devlyn-cli 0.5.2 → 0.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/devlyn.js +1 -0
- package/optional-skills/dokkit/ANALYSIS.md +198 -0
- package/optional-skills/dokkit/COMMANDS.md +365 -0
- package/optional-skills/dokkit/DOCX-XML.md +76 -0
- package/optional-skills/dokkit/EXPORT.md +102 -0
- package/optional-skills/dokkit/FILLING.md +377 -0
- package/optional-skills/dokkit/HWPX-XML.md +73 -0
- package/optional-skills/dokkit/IMAGE-SOURCING.md +127 -0
- package/optional-skills/dokkit/INGESTION.md +65 -0
- package/optional-skills/dokkit/SKILL.md +153 -0
- package/optional-skills/dokkit/STATE.md +60 -0
- package/optional-skills/dokkit/references/docx-field-patterns.md +151 -0
- package/optional-skills/dokkit/references/docx-structure.md +58 -0
- package/optional-skills/dokkit/references/field-detection-patterns.md +130 -0
- package/optional-skills/dokkit/references/hwpx-field-patterns.md +461 -0
- package/optional-skills/dokkit/references/hwpx-structure.md +159 -0
- package/optional-skills/dokkit/references/image-opportunity-heuristics.md +121 -0
- package/optional-skills/dokkit/references/image-xml-patterns.md +338 -0
- package/optional-skills/dokkit/references/section-image-interleaving.md +346 -0
- package/optional-skills/dokkit/references/section-range-detection.md +118 -0
- package/optional-skills/dokkit/references/state-schema.md +143 -0
- package/optional-skills/dokkit/references/supported-formats.md +67 -0
- package/optional-skills/dokkit/scripts/compile_hwpx.py +134 -0
- package/optional-skills/dokkit/scripts/detect_fields.py +301 -0
- package/optional-skills/dokkit/scripts/detect_fields_hwpx.py +286 -0
- package/optional-skills/dokkit/scripts/export_pdf.py +99 -0
- package/optional-skills/dokkit/scripts/parse_hwpx.py +185 -0
- package/optional-skills/dokkit/scripts/parse_image_with_gemini.py +159 -0
- package/optional-skills/dokkit/scripts/parse_xlsx.py +98 -0
- package/optional-skills/dokkit/scripts/source_images.py +365 -0
- package/optional-skills/dokkit/scripts/validate_docx.py +142 -0
- package/optional-skills/dokkit/scripts/validate_hwpx.py +281 -0
- package/optional-skills/dokkit/scripts/validate_state.py +132 -0
- package/package.json +1 -1
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# Ingestion Knowledge
|
|
2
|
+
|
|
3
|
+
Parsing strategies and format routing for converting source documents into the dual-file format (Markdown content + JSON sidecar).
|
|
4
|
+
|
|
5
|
+
## Format Routing
|
|
6
|
+
|
|
7
|
+
| Format | Parser | Command |
|
|
8
|
+
|--------|--------|---------|
|
|
9
|
+
| PDF | Docling | `python -m docling <file> --to md` |
|
|
10
|
+
| DOCX | Docling | `python -m docling <file> --to md` |
|
|
11
|
+
| PPTX | Docling | `python -m docling <file> --to md` |
|
|
12
|
+
| HTML | Docling | `python -m docling <file> --to md` |
|
|
13
|
+
| CSV | Docling | `python -m docling <file> --to md` |
|
|
14
|
+
| MD | Direct copy | Read and process as-is |
|
|
15
|
+
| XLSX | Custom | `python .claude/skills/dokkit/scripts/parse_xlsx.py` |
|
|
16
|
+
| HWPX | Custom | `python .claude/skills/dokkit/scripts/parse_hwpx.py` |
|
|
17
|
+
| JSON | Custom | Read, format as structured markdown |
|
|
18
|
+
| TXT | Custom | Read, wrap as markdown |
|
|
19
|
+
| PNG/JPG | Gemini Vision | `python .claude/skills/dokkit/scripts/parse_image_with_gemini.py` |
|
|
20
|
+
|
|
21
|
+
## Docling Usage
|
|
22
|
+
|
|
23
|
+
Primary parser for most formats:
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
python -m docling <input-file> --to md --output <output-dir>
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
After Docling runs:
|
|
30
|
+
1. Read the markdown output
|
|
31
|
+
2. Extract key-value pairs from the content
|
|
32
|
+
3. Build the JSON sidecar with metadata
|
|
33
|
+
4. Move files to `.dokkit/sources/`
|
|
34
|
+
|
|
35
|
+
If Docling is not installed, show an explicit error with install instructions: `pip install docling`. Do NOT silently fall back to a different parser.
|
|
36
|
+
|
|
37
|
+
## Custom Parser Output Format
|
|
38
|
+
|
|
39
|
+
All custom parsers output JSON to stdout:
|
|
40
|
+
```json
|
|
41
|
+
{
|
|
42
|
+
"content_md": "# Document Title\n\nExtracted content...",
|
|
43
|
+
"metadata": {
|
|
44
|
+
"file_name": "original.xlsx",
|
|
45
|
+
"file_type": "xlsx",
|
|
46
|
+
"parse_date": "2026-02-07T12:00:00Z",
|
|
47
|
+
"key_value_pairs": { "Name": "John", "Date": "2026-01-15" },
|
|
48
|
+
"sections": ["Sheet1", "Sheet2"]
|
|
49
|
+
}
|
|
50
|
+
}
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Key-Value Extraction
|
|
54
|
+
|
|
55
|
+
After parsing, scan content for structured data:
|
|
56
|
+
- Table cells with label-value patterns (e.g., "Name: John Doe")
|
|
57
|
+
- Form fields with values
|
|
58
|
+
- Metadata headers
|
|
59
|
+
- Labeled sections
|
|
60
|
+
|
|
61
|
+
Store in the JSON sidecar's `key_value_pairs` field for fast lookup during template filling.
|
|
62
|
+
|
|
63
|
+
## References
|
|
64
|
+
|
|
65
|
+
See `references/supported-formats.md` for detailed format specifications.
|
|
@@ -0,0 +1,153 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: dokkit
|
|
3
|
+
description: >
|
|
4
|
+
Document template filling system for DOCX and HWPX formats.
|
|
5
|
+
Ingests source documents, analyzes templates, detects fillable fields,
|
|
6
|
+
fills them surgically using source data, reviews with confidence scoring,
|
|
7
|
+
and exports completed documents. Supports Korean and English templates.
|
|
8
|
+
Subcommands: init, sources, preview, ingest, fill, fill-doc, modify, review, export.
|
|
9
|
+
Use when user says "fill template", "fill document", "ingest", "dokkit".
|
|
10
|
+
user-invocable: true
|
|
11
|
+
allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent
|
|
12
|
+
argument-hint: "<subcommand> [arguments]"
|
|
13
|
+
context:
|
|
14
|
+
- type: file
|
|
15
|
+
path: ${CLAUDE_SKILL_DIR}/COMMANDS.md
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
# Dokkit — Document Template Filling System
|
|
19
|
+
|
|
20
|
+
Surgical document filling for DOCX and HWPX templates using ingested source data. One command with 9 subcommands covering the full document filling lifecycle.
|
|
21
|
+
|
|
22
|
+
## Subcommands
|
|
23
|
+
|
|
24
|
+
| Subcommand | Arguments | Type | Description |
|
|
25
|
+
|------------|-----------|------|-------------|
|
|
26
|
+
| `init` | `[--force] [--keep-sources]` | Inline | Initialize or reset workspace |
|
|
27
|
+
| `sources` | — | Inline | Display ingested sources dashboard |
|
|
28
|
+
| `preview` | — | Inline | Generate PDF preview via LibreOffice |
|
|
29
|
+
| `ingest` | `<file1> [file2] ...` | Agent | Parse source documents into workspace |
|
|
30
|
+
| `fill` | `<template.docx\|hwpx>` | Agent | End-to-end: analyze, fill, review, auto-fix, export |
|
|
31
|
+
| `fill-doc` | `<template.docx\|hwpx>` | Agent | Analyze template and fill fields only |
|
|
32
|
+
| `modify` | `"<instruction>"` | Agent | Apply targeted changes to filled document |
|
|
33
|
+
| `review` | `[section\|approve]` | Agent | Review with per-field confidence annotations |
|
|
34
|
+
| `export` | `<docx\|hwpx\|pdf>` | Agent | Export filled document to format |
|
|
35
|
+
|
|
36
|
+
## Routing
|
|
37
|
+
|
|
38
|
+
Parse `$ARGUMENTS` to determine the subcommand:
|
|
39
|
+
|
|
40
|
+
1. Extract `$1` as the subcommand name
|
|
41
|
+
2. Pass remaining arguments (`$2`, `$3`, ...) to the subcommand
|
|
42
|
+
3. If `$1` is empty or unrecognized, display the subcommand table above with usage examples
|
|
43
|
+
|
|
44
|
+
Full workflows for each subcommand are in COMMANDS.md (auto-loaded via context).
|
|
45
|
+
|
|
46
|
+
<example>
|
|
47
|
+
- `/dokkit ingest docs/resume.pdf docs/transcript.xlsx` — ingest two sources
|
|
48
|
+
- `/dokkit fill docs/template.hwpx` — end-to-end fill pipeline
|
|
49
|
+
- `/dokkit modify "Change the phone number to 010-1234-5678"` — targeted change
|
|
50
|
+
- `/dokkit export pdf` — export as PDF
|
|
51
|
+
</example>
|
|
52
|
+
|
|
53
|
+
## Architecture
|
|
54
|
+
|
|
55
|
+
### Agents
|
|
56
|
+
|
|
57
|
+
| Agent | Model | Role |
|
|
58
|
+
|-------|-------|------|
|
|
59
|
+
| **dokkit-ingestor** | opus | Parse source docs into `.dokkit/sources/` (.md + .json pairs) |
|
|
60
|
+
| **dokkit-analyzer** | opus | Analyze templates, detect fields, map to sources. Writes `analysis.json`. READ-ONLY on templates. |
|
|
61
|
+
| **dokkit-filler** | opus | Surgical XML modification using analysis.json. Three modes: fill, modify, review. |
|
|
62
|
+
| **dokkit-exporter** | sonnet | Repackage ZIP archives, PDF conversion via LibreOffice. |
|
|
63
|
+
|
|
64
|
+
### Workspace
|
|
65
|
+
|
|
66
|
+
All agents communicate via the `.dokkit/` filesystem:
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
.dokkit/
|
|
70
|
+
├── state.json # Single source of truth for session state
|
|
71
|
+
├── sources/ # Ingested content (.md + .json pairs)
|
|
72
|
+
├── analysis.json # Template analysis output (from analyzer)
|
|
73
|
+
├── images/ # Sourced images for template filling
|
|
74
|
+
├── template_work/ # Unpacked template XML (working copy)
|
|
75
|
+
└── output/ # Exported filled documents
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### State Protocol
|
|
79
|
+
|
|
80
|
+
Read `.dokkit/state.json` before any operation. Write state changes atomically: read current → update fields → write back → validate.
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
init → state created (empty)
|
|
84
|
+
ingest → source added to sources[]
|
|
85
|
+
fill/fill-doc → template set, analysis created, filled_document created
|
|
86
|
+
modify → filled_document updated
|
|
87
|
+
review approve → filled_document.status = "finalized"
|
|
88
|
+
export → export entry added to exports[]
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Validate after every write: `python ${CLAUDE_SKILL_DIR}/scripts/validate_state.py .dokkit/state.json`
|
|
92
|
+
|
|
93
|
+
### Knowledge Files
|
|
94
|
+
|
|
95
|
+
Agent-facing knowledge bases in this skill directory:
|
|
96
|
+
|
|
97
|
+
| File | Purpose | Agents |
|
|
98
|
+
|------|---------|--------|
|
|
99
|
+
| `STATE.md` | State schema and management protocol | All |
|
|
100
|
+
| `INGESTION.md` | Format routing and parsing strategies | dokkit-ingestor |
|
|
101
|
+
| `ANALYSIS.md` | Field detection, confidence scoring, output schema | dokkit-analyzer |
|
|
102
|
+
| `FILLING.md` | XML surgery rules, matching strategy, image insertion | dokkit-analyzer, dokkit-filler |
|
|
103
|
+
| `DOCX-XML.md` | Open XML structure for DOCX documents | dokkit-analyzer, dokkit-filler |
|
|
104
|
+
| `HWPX-XML.md` | OWPML structure for HWPX documents | dokkit-analyzer, dokkit-filler |
|
|
105
|
+
| `IMAGE-SOURCING.md` | Image generation, search, and insertion patterns | dokkit-filler |
|
|
106
|
+
| `EXPORT.md` | Document compilation and format conversion | dokkit-exporter |
|
|
107
|
+
|
|
108
|
+
Deep reference material in `references/`:
|
|
109
|
+
- `state-schema.md` — Complete state.json schema
|
|
110
|
+
- `supported-formats.md` — Detailed format specifications
|
|
111
|
+
- `docx-structure.md`, `docx-field-patterns.md` — DOCX patterns
|
|
112
|
+
- `hwpx-structure.md`, `hwpx-field-patterns.md` — HWPX patterns (10 detection patterns)
|
|
113
|
+
- `field-detection-patterns.md` — Advanced heuristics (9 DOCX + 6 HWPX)
|
|
114
|
+
- `section-range-detection.md` — Dynamic range detection for section_content
|
|
115
|
+
- `section-image-interleaving.md` — Image interleaving algorithm
|
|
116
|
+
- `image-opportunity-heuristics.md` — AI image opportunity detection
|
|
117
|
+
- `image-xml-patterns.md` — Image element structures (DOCX + HWPX)
|
|
118
|
+
|
|
119
|
+
Scripts in `scripts/`:
|
|
120
|
+
- `validate_state.py` — State validation
|
|
121
|
+
- `parse_xlsx.py`, `parse_hwpx.py`, `parse_image_with_gemini.py` — Custom parsers
|
|
122
|
+
- `detect_fields.py`, `detect_fields_hwpx.py` — Field detection
|
|
123
|
+
- `validate_docx.py`, `validate_hwpx.py` — Document validation
|
|
124
|
+
- `compile_hwpx.py` — HWPX repackaging
|
|
125
|
+
- `export_pdf.py` — PDF conversion
|
|
126
|
+
|
|
127
|
+
## Rules
|
|
128
|
+
|
|
129
|
+
<rules>
|
|
130
|
+
- Display errors clearly with actionable guidance. Never silently fall back to defaults.
|
|
131
|
+
- Original template is never modified — copies go to `.dokkit/template_work/`.
|
|
132
|
+
- Analyzer is read-only on templates. Only the filler modifies XML.
|
|
133
|
+
- Confidence levels: high, medium, low (not numeric scores).
|
|
134
|
+
- Signatures must be user-provided — never auto-generate them.
|
|
135
|
+
- Validate state after every write with `scripts/validate_state.py`.
|
|
136
|
+
- Inline commands (init, sources, preview) execute directly — do NOT spawn agents.
|
|
137
|
+
- Agent-delegated commands spawn the appropriate agent(s) sequentially.
|
|
138
|
+
</rules>
|
|
139
|
+
|
|
140
|
+
## Known Pitfalls
|
|
141
|
+
|
|
142
|
+
Critical issues discovered through production use:
|
|
143
|
+
|
|
144
|
+
1. **HWPX namespace stripping**: Python ET strips unused namespace declarations. Restore ALL 14 original xmlns on EVERY root element after any `tree.write()`. Applies to section0.xml, content.hpf, header.xml.
|
|
145
|
+
2. **HWPX subList cell wrapping**: ~65% of cells wrap content in `<hp:subList>/<hp:p>`. Check for subList before writing content.
|
|
146
|
+
3. **table_content "Pre-filled" bug**: Never set `mapped_value` to placeholder strings for `table_content` fields. Use `mapped_value: null` with `action: "preserve"`.
|
|
147
|
+
4. **HWPX cellAddr rowAddr corruption**: After row insert/delete, re-index ALL `rowAddr` values. Duplicate rowAddr causes silent data loss.
|
|
148
|
+
5. **HWPX `<hp:pic>` inside `<hp:run>`**: Pic as sibling of run renders invisible. Must be `<hp:run><hp:pic>...<hp:t/></hp:run>`.
|
|
149
|
+
6. **HWPML units**: 1/7200 inch, NOT hundredths of mm. 1mm ~ 283.46 units. A4 text width ~ 46,648 units.
|
|
150
|
+
7. **rowSpan stripping**: When cloning rows with rowSpan>1, divide cellSz height by rowSpan.
|
|
151
|
+
8. **HWPX pic element order**: offset, orgSz, curSz, flip, rotationInfo, renderingInfo, imgRect, imgClip, inMargin, imgDim, hc:img, sz, pos, outMargin.
|
|
152
|
+
9. **HWPX post-write safety**: After ET write: (a) restore namespaces, (b) fix XML declaration to double quotes with `standalone="yes"`, (c) remove newline between `?>` and `<root>`.
|
|
153
|
+
10. **compile_hwpx.py skip .bak**: Backup files must be excluded from ZIP repackaging.
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
# State Management
|
|
2
|
+
|
|
3
|
+
Protocol for reading and writing `.dokkit/state.json`. All agents follow this protocol.
|
|
4
|
+
|
|
5
|
+
## Workspace Structure
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
.dokkit/
|
|
9
|
+
├── state.json # Single source of truth for session state
|
|
10
|
+
├── sources/ # Ingested source content
|
|
11
|
+
│ ├── <name>.md # Extracted content (LLM-optimized markdown)
|
|
12
|
+
│ └── <name>.json # Structured metadata sidecar
|
|
13
|
+
├── analysis.json # Template analysis output (from analyzer)
|
|
14
|
+
├── images/ # Sourced images
|
|
15
|
+
├── template_work/ # Unpacked template XML (working copy)
|
|
16
|
+
│ ├── word/ # (DOCX) or Contents/ (HWPX)
|
|
17
|
+
│ └── ...
|
|
18
|
+
└── output/ # Exported filled documents
|
|
19
|
+
└── filled_<name>.<ext>
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Reading State
|
|
23
|
+
|
|
24
|
+
Read `.dokkit/state.json` before any operation. Check:
|
|
25
|
+
- `sources` array for available context
|
|
26
|
+
- `template` for current template info
|
|
27
|
+
- `analysis` for field mapping data
|
|
28
|
+
- `filled_document` for current document status
|
|
29
|
+
|
|
30
|
+
## Writing State
|
|
31
|
+
|
|
32
|
+
After any mutation:
|
|
33
|
+
1. Read current state.json (avoid overwriting concurrent changes)
|
|
34
|
+
2. Update only the relevant fields
|
|
35
|
+
3. Write the full state back
|
|
36
|
+
4. Validate: `python .claude/skills/dokkit/scripts/validate_state.py .dokkit/state.json`
|
|
37
|
+
|
|
38
|
+
## State Transitions
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
/dokkit init → state created (empty)
|
|
42
|
+
/dokkit ingest → source added to sources[]
|
|
43
|
+
/dokkit fill or fill-doc → template set, analysis created, filled_document created
|
|
44
|
+
/dokkit modify → filled_document updated
|
|
45
|
+
/dokkit review approve → filled_document.status = "finalized"
|
|
46
|
+
/dokkit export → export entry added to exports[]
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Validation
|
|
50
|
+
|
|
51
|
+
The validator checks:
|
|
52
|
+
- Schema conformance
|
|
53
|
+
- Required fields present
|
|
54
|
+
- Valid status values
|
|
55
|
+
- Source file references exist
|
|
56
|
+
- No orphaned entries
|
|
57
|
+
|
|
58
|
+
## References
|
|
59
|
+
|
|
60
|
+
See `references/state-schema.md` for the complete schema definition.
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
# DOCX Field Detection Patterns
|
|
2
|
+
|
|
3
|
+
## Pattern 1: Placeholder Text
|
|
4
|
+
|
|
5
|
+
```xml
|
|
6
|
+
<!-- Text like {{name}} or <<name>> in a run -->
|
|
7
|
+
<w:r>
|
|
8
|
+
<w:rPr>
|
|
9
|
+
<w:rFonts w:ascii="Arial" w:hAnsi="Arial"/>
|
|
10
|
+
<w:sz w:val="20"/>
|
|
11
|
+
</w:rPr>
|
|
12
|
+
<w:t>{{full_name}}</w:t> <!-- REPLACE this text content -->
|
|
13
|
+
</w:r>
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
**Action**: Replace the text content of `<w:t>` while preserving `<w:rPr>`.
|
|
17
|
+
|
|
18
|
+
## Pattern 2: Empty Table Cell
|
|
19
|
+
|
|
20
|
+
```xml
|
|
21
|
+
<w:tr>
|
|
22
|
+
<w:tc>
|
|
23
|
+
<w:p><w:r><w:t>Name</w:t></w:r></w:p> <!-- Label cell -->
|
|
24
|
+
</w:tc>
|
|
25
|
+
<w:tc>
|
|
26
|
+
<w:p/> <!-- Empty cell → FILL THIS -->
|
|
27
|
+
</w:tc>
|
|
28
|
+
</w:tr>
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
**Action**: Insert `<w:r><w:t>value</w:t></w:r>` into the empty `<w:p>`. Copy `<w:rPr>` from the label cell's run to match formatting.
|
|
32
|
+
|
|
33
|
+
## Pattern 3: Underline Placeholder
|
|
34
|
+
|
|
35
|
+
```xml
|
|
36
|
+
<w:r>
|
|
37
|
+
<w:rPr>
|
|
38
|
+
<w:u w:val="single"/>
|
|
39
|
+
</w:rPr>
|
|
40
|
+
<w:t xml:space="preserve"> </w:t> <!-- Spaces with underline -->
|
|
41
|
+
</w:r>
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
**Action**: Replace the spaces in `<w:t>` with the actual value. Keep `<w:u>` in `<w:rPr>`.
|
|
45
|
+
|
|
46
|
+
## Pattern 4: Content Control
|
|
47
|
+
|
|
48
|
+
```xml
|
|
49
|
+
<w:sdt>
|
|
50
|
+
<w:sdtPr>
|
|
51
|
+
<w:alias w:val="Company Name"/>
|
|
52
|
+
<w:tag w:val="company"/>
|
|
53
|
+
<w:showingPlcHdr/> <!-- Indicates placeholder is showing -->
|
|
54
|
+
</w:sdtPr>
|
|
55
|
+
<w:sdtContent>
|
|
56
|
+
<w:p>
|
|
57
|
+
<w:r>
|
|
58
|
+
<w:rPr><w:rStyle w:val="PlaceholderText"/></w:rPr>
|
|
59
|
+
<w:t>Click here to enter text.</w:t>
|
|
60
|
+
</w:r>
|
|
61
|
+
</w:p>
|
|
62
|
+
</w:sdtContent>
|
|
63
|
+
</w:sdt>
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Action**: Replace the run inside `<w:sdtContent>` with a new run containing the value. Remove `<w:showingPlcHdr/>` from `<w:sdtPr>`. Remove the placeholder style from `<w:rPr>`.
|
|
67
|
+
|
|
68
|
+
## Pattern 5: Instruction Text
|
|
69
|
+
|
|
70
|
+
```xml
|
|
71
|
+
<w:r>
|
|
72
|
+
<w:rPr>
|
|
73
|
+
<w:color w:val="808080"/> <!-- Gray text -->
|
|
74
|
+
<w:i/> <!-- Italic -->
|
|
75
|
+
</w:rPr>
|
|
76
|
+
<w:t>(enter your name)</w:t>
|
|
77
|
+
</w:r>
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**Action**: Replace text content. Change `<w:rPr>` to remove gray color and italic (or copy from a nearby filled field).
|
|
81
|
+
|
|
82
|
+
## Pattern 6: Writing Tip Box (작성 팁)
|
|
83
|
+
|
|
84
|
+
Single-cell tables with dashed borders containing `※` guidance text. These are NOT fillable — they must be **deleted**.
|
|
85
|
+
|
|
86
|
+
```xml
|
|
87
|
+
<w:tbl>
|
|
88
|
+
<w:tblPr>
|
|
89
|
+
<w:tblBorders>
|
|
90
|
+
<w:top w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
|
|
91
|
+
<w:left w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
|
|
92
|
+
<w:bottom w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
|
|
93
|
+
<w:right w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
|
|
94
|
+
</w:tblBorders>
|
|
95
|
+
</w:tblPr>
|
|
96
|
+
<w:tr>
|
|
97
|
+
<w:tc>
|
|
98
|
+
<w:p>
|
|
99
|
+
<w:r>
|
|
100
|
+
<w:rPr><w:color w:val="FF0000"/></w:rPr>
|
|
101
|
+
<w:t>※ 작성 팁: 구체적인 사업 목표를 기재하세요.</w:t>
|
|
102
|
+
</w:r>
|
|
103
|
+
</w:p>
|
|
104
|
+
</w:tc>
|
|
105
|
+
</w:tr>
|
|
106
|
+
</w:tbl>
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
**Identifying traits**:
|
|
110
|
+
- Single row, single cell (`<w:tr>` has one `<w:tc>`)
|
|
111
|
+
- `<w:tblBorders>` with `w:val="dashed"` on all sides
|
|
112
|
+
- Text starts with `※` or contains `작성 팁`, `작성요령`
|
|
113
|
+
- Often has red `<w:color w:val="FF0000"/>` styling
|
|
114
|
+
|
|
115
|
+
**Action**: Flag as `field_type: "tip_box"`, `action: "delete"`. Delete the entire `<w:tbl>` element.
|
|
116
|
+
|
|
117
|
+
## Color Warning for Copied Formatting
|
|
118
|
+
|
|
119
|
+
When copying `<w:rPr>` from template guide text or instruction text (Patterns 2 and 5), **always check for red color**:
|
|
120
|
+
|
|
121
|
+
```xml
|
|
122
|
+
<!-- DANGER: This rPr has red color from guide text -->
|
|
123
|
+
<w:rPr>
|
|
124
|
+
<w:color w:val="FF0000"/> <!-- REMOVE THIS -->
|
|
125
|
+
<w:i/> <!-- REMOVE THIS (from guide text) -->
|
|
126
|
+
<w:sz w:val="20"/> <!-- KEEP -->
|
|
127
|
+
</w:rPr>
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
**Rule**: After copying rPr from any template text, check for `<w:color>` elements. If the value is `FF0000`, `FF0000FF`, or any red shade, **remove the `<w:color>` element** (defaults to black). Also remove `<w:i/>` if it came from guide text.
|
|
131
|
+
|
|
132
|
+
## Safe Modification Template
|
|
133
|
+
|
|
134
|
+
```python
|
|
135
|
+
import xml.etree.ElementTree as ET
|
|
136
|
+
|
|
137
|
+
ns = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
|
|
138
|
+
ET.register_namespace("w", ns["w"])
|
|
139
|
+
|
|
140
|
+
tree = ET.parse("word/document.xml")
|
|
141
|
+
|
|
142
|
+
# Find and replace placeholder text
|
|
143
|
+
for t_elem in tree.iter("{%s}t" % ns["w"]):
|
|
144
|
+
if t_elem.text and "{{" in t_elem.text:
|
|
145
|
+
placeholder = t_elem.text # e.g., "{{name}}"
|
|
146
|
+
field_name = placeholder.strip("{}").strip("<>")
|
|
147
|
+
if field_name in field_values:
|
|
148
|
+
t_elem.text = field_values[field_name]
|
|
149
|
+
|
|
150
|
+
tree.write("word/document.xml", xml_declaration=True, encoding="UTF-8")
|
|
151
|
+
```
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# DOCX XML Structure Reference
|
|
2
|
+
|
|
3
|
+
## Unpacking a DOCX
|
|
4
|
+
|
|
5
|
+
```bash
|
|
6
|
+
# Unzip to inspect
|
|
7
|
+
mkdir -p .dokkit/template_work
|
|
8
|
+
cd .dokkit/template_work
|
|
9
|
+
unzip -o /path/to/template.docx
|
|
10
|
+
```
|
|
11
|
+
|
|
12
|
+
## Reading document.xml
|
|
13
|
+
|
|
14
|
+
The main content is in `word/document.xml`. Parse with any XML parser.
|
|
15
|
+
|
|
16
|
+
### Python Example
|
|
17
|
+
```python
|
|
18
|
+
import xml.etree.ElementTree as ET
|
|
19
|
+
|
|
20
|
+
ns = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
|
|
21
|
+
tree = ET.parse("word/document.xml")
|
|
22
|
+
root = tree.getroot()
|
|
23
|
+
|
|
24
|
+
# Find all paragraphs
|
|
25
|
+
for p in root.iter("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p"):
|
|
26
|
+
texts = []
|
|
27
|
+
for t in p.iter("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t"):
|
|
28
|
+
if t.text:
|
|
29
|
+
texts.append(t.text)
|
|
30
|
+
print("".join(texts))
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## Repackaging a DOCX
|
|
34
|
+
|
|
35
|
+
After modifying XML, repackage as a valid DOCX:
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
import zipfile
|
|
39
|
+
import os
|
|
40
|
+
|
|
41
|
+
def repackage_docx(work_dir, output_path):
|
|
42
|
+
"""Repackage modified XML files into a valid DOCX."""
|
|
43
|
+
with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zf:
|
|
44
|
+
for root, dirs, files in os.walk(work_dir):
|
|
45
|
+
for file in files:
|
|
46
|
+
file_path = os.path.join(root, file)
|
|
47
|
+
arcname = os.path.relpath(file_path, work_dir)
|
|
48
|
+
zf.write(file_path, arcname)
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Critical Rules for DOCX Surgery
|
|
52
|
+
|
|
53
|
+
1. **Never remove `<w:rPr>` elements** — they contain all formatting
|
|
54
|
+
2. **Preserve `xml:space="preserve"`** on `<w:t>` elements with leading/trailing spaces
|
|
55
|
+
3. **Keep `<w:pPr>` intact** — paragraph formatting must not change
|
|
56
|
+
4. **Maintain bookmark pairs** — `<w:bookmarkStart>` must have matching `<w:bookmarkEnd>`
|
|
57
|
+
5. **Don't modify `<w:sectPr>`** — section properties control page layout
|
|
58
|
+
6. **Preserve table cell merge attributes** — `<w:vMerge>` and `<w:gridSpan>`
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
# Field Detection Patterns
|
|
2
|
+
|
|
3
|
+
## DOCX Detection Heuristics
|
|
4
|
+
|
|
5
|
+
### Heuristic 1: Curly Brace Placeholders
|
|
6
|
+
```regex
|
|
7
|
+
\{\{[^}]+\}\}
|
|
8
|
+
```
|
|
9
|
+
Match text like `{{field_name}}`. High reliability.
|
|
10
|
+
|
|
11
|
+
### Heuristic 2: Angle Bracket Placeholders
|
|
12
|
+
```regex
|
|
13
|
+
<<[^>]+>>
|
|
14
|
+
```
|
|
15
|
+
Match text like `<<field_name>>`. High reliability.
|
|
16
|
+
|
|
17
|
+
### Heuristic 3: Square Bracket Placeholders
|
|
18
|
+
```regex
|
|
19
|
+
\[[^\]]+\]
|
|
20
|
+
```
|
|
21
|
+
Match text like `[field_name]`. Medium reliability (may match references).
|
|
22
|
+
|
|
23
|
+
### Heuristic 4: Underline-Only Runs
|
|
24
|
+
A run where:
|
|
25
|
+
- `<w:rPr>` contains `<w:u w:val="single"/>`
|
|
26
|
+
- `<w:t>` contains only spaces, underscores, or is empty
|
|
27
|
+
- Run length > 3 characters
|
|
28
|
+
|
|
29
|
+
### Heuristic 5: Empty Table Cells
|
|
30
|
+
A `<w:tc>` that:
|
|
31
|
+
- Contains only `<w:p/>` or `<w:p><w:pPr/></w:p>` (empty paragraph)
|
|
32
|
+
- Is adjacent to a cell containing text (the label)
|
|
33
|
+
- The label cell's text is short (< 50 chars) and not numeric
|
|
34
|
+
|
|
35
|
+
### Heuristic 6: Instruction Text
|
|
36
|
+
A run where text matches patterns like:
|
|
37
|
+
```regex
|
|
38
|
+
\(.*?(enter|type|input|write|fill|입력).*?\)
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Heuristic 7: Content Controls
|
|
42
|
+
Any `<w:sdt>` element with `<w:showingPlcHdr/>` in its properties.
|
|
43
|
+
|
|
44
|
+
### Heuristic 8: Image Fields
|
|
45
|
+
A field is classified as `image` when any of these conditions hold:
|
|
46
|
+
- A `{{placeholder}}` or `<<placeholder>>` contains an image keyword
|
|
47
|
+
- A table cell contains an existing `<w:drawing>` element (pre-positioned image slot)
|
|
48
|
+
- An empty table cell is adjacent to a cell whose label matches an image keyword
|
|
49
|
+
|
|
50
|
+
**Image keywords** (case-insensitive):
|
|
51
|
+
- Korean: 사진, 증명사진, 여권사진, 로고, 서명, 날인, 도장, 직인
|
|
52
|
+
- English: Photo, Picture, Logo, Signature, Stamp, Seal, Image, Portrait
|
|
53
|
+
|
|
54
|
+
**Image type classification**:
|
|
55
|
+
| Keyword match | `image_type` |
|
|
56
|
+
|---------------|-------------|
|
|
57
|
+
| 사진, 증명사진, 여권사진, photo, picture, portrait, image | `photo` |
|
|
58
|
+
| 로고, logo | `logo` |
|
|
59
|
+
| 서명, 날인, 도장, 직인, signature, stamp, seal | `signature` |
|
|
60
|
+
| (no keyword match) | `figure` |
|
|
61
|
+
|
|
62
|
+
Image fields are **excluded** from the `placeholder_text` and `empty_cell` detectors to prevent double-detection.
|
|
63
|
+
|
|
64
|
+
### Heuristic 9: Tip Box
|
|
65
|
+
A `<w:tbl>` that:
|
|
66
|
+
- Has exactly one row and one cell (1×1 table)
|
|
67
|
+
- `<w:tblBorders>` uses `w:val="dashed"` borders
|
|
68
|
+
- Cell text starts with `※` or contains `작성 팁` / `작성요령`
|
|
69
|
+
- Often has red text color (`<w:color w:val="FF0000"/>`)
|
|
70
|
+
|
|
71
|
+
→ `field_type: "tip_box"`, `action: "delete"`
|
|
72
|
+
|
|
73
|
+
## HWPX Detection Heuristics
|
|
74
|
+
|
|
75
|
+
### Heuristic 1: Empty Adjacent Cells
|
|
76
|
+
Same as DOCX but using `<hp:tc>` and `<hp:t>` elements.
|
|
77
|
+
|
|
78
|
+
### Heuristic 2: Korean Instruction Text
|
|
79
|
+
```regex
|
|
80
|
+
\(.*?(입력|기재|작성).*?\)
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Heuristic 3: Date Component Cells
|
|
84
|
+
Cells immediately before 년/월/일 (year/month/day) markers.
|
|
85
|
+
|
|
86
|
+
### Heuristic 4: Image Fields
|
|
87
|
+
Same logic as DOCX Heuristic 8, adapted for HWPX elements:
|
|
88
|
+
- `<hp:pic>` instead of `<w:drawing>`
|
|
89
|
+
- `<hp:tc>` / `<hp:t>` instead of `<w:tc>` / `<w:t>`
|
|
90
|
+
- Same image keyword list and type classification
|
|
91
|
+
|
|
92
|
+
### Heuristic 5: Tip Box
|
|
93
|
+
An `<hp:tbl>` that:
|
|
94
|
+
- Has `rowCnt="1"` and `colCnt="1"` (single-cell table)
|
|
95
|
+
- `borderFillIDRef` resolves to DASH border style in `header.xml`
|
|
96
|
+
- Cell text starts with `※` or contains `작성 팁` / `작성요령` / `작성 요령`
|
|
97
|
+
- May appear standalone or nested inside a `<hp:subList>` within another cell
|
|
98
|
+
|
|
99
|
+
→ `field_type: "tip_box"`, `action: "delete"`, `container: "standalone"|"nested"`
|
|
100
|
+
|
|
101
|
+
### Heuristic 6: Section Header Rows
|
|
102
|
+
Table rows where:
|
|
103
|
+
- First cell spans multiple columns (`hp:cellSpan colSpan > 1`)
|
|
104
|
+
- Text is short and descriptive (section name)
|
|
105
|
+
- Background may be shaded
|
|
106
|
+
|
|
107
|
+
## HWPX Pre-Fill Sanitization
|
|
108
|
+
|
|
109
|
+
### Negative Character Spacing
|
|
110
|
+
HWPX templates may define `<hh:charPr>` elements in `header.xml` with negative `<hh:spacing>` values (e.g., `hangul="-3"`). These compress characters closer together, which works for short placeholder text but causes **severe text overlap** when the filler replaces placeholders with longer content.
|
|
111
|
+
|
|
112
|
+
**Rule**: Before filling, scan ALL `<hh:charPr>` definitions in `header.xml` and set any negative spacing attribute values to `"0"`. This applies to all attributes: `hangul`, `latin`, `hanja`, `japanese`, `other`, `symbol`, `user`.
|
|
113
|
+
|
|
114
|
+
**Example fix**:
|
|
115
|
+
```xml
|
|
116
|
+
<!-- Before (causes overlap) -->
|
|
117
|
+
<hh:spacing hangul="-3" latin="-3" hanja="-3" japanese="-3" other="-3" symbol="-3" user="-3"/>
|
|
118
|
+
|
|
119
|
+
<!-- After (normal spacing) -->
|
|
120
|
+
<hh:spacing hangul="0" latin="0" hanja="0" japanese="0" other="0" symbol="0" user="0"/>
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
## False Positive Filtering
|
|
124
|
+
|
|
125
|
+
Exclude detected "fields" that are:
|
|
126
|
+
- Part of a header/title row (not fillable)
|
|
127
|
+
- Copyright notices or footer text
|
|
128
|
+
- Page numbers or running headers
|
|
129
|
+
- Table of contents entries
|
|
130
|
+
- Cross-reference markers
|