devlyn-cli 0.5.1 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. package/bin/devlyn.js +1 -0
  2. package/optional-skills/better-auth-setup/SKILL.md +222 -11
  3. package/optional-skills/better-auth-setup/references/proxy-gotchas.md +148 -0
  4. package/optional-skills/better-auth-setup/references/proxy-setup.md +284 -0
  5. package/optional-skills/dokkit/ANALYSIS.md +198 -0
  6. package/optional-skills/dokkit/COMMANDS.md +365 -0
  7. package/optional-skills/dokkit/DOCX-XML.md +76 -0
  8. package/optional-skills/dokkit/EXPORT.md +102 -0
  9. package/optional-skills/dokkit/FILLING.md +377 -0
  10. package/optional-skills/dokkit/HWPX-XML.md +73 -0
  11. package/optional-skills/dokkit/IMAGE-SOURCING.md +127 -0
  12. package/optional-skills/dokkit/INGESTION.md +65 -0
  13. package/optional-skills/dokkit/SKILL.md +153 -0
  14. package/optional-skills/dokkit/STATE.md +60 -0
  15. package/optional-skills/dokkit/references/docx-field-patterns.md +151 -0
  16. package/optional-skills/dokkit/references/docx-structure.md +58 -0
  17. package/optional-skills/dokkit/references/field-detection-patterns.md +130 -0
  18. package/optional-skills/dokkit/references/hwpx-field-patterns.md +461 -0
  19. package/optional-skills/dokkit/references/hwpx-structure.md +159 -0
  20. package/optional-skills/dokkit/references/image-opportunity-heuristics.md +121 -0
  21. package/optional-skills/dokkit/references/image-xml-patterns.md +338 -0
  22. package/optional-skills/dokkit/references/section-image-interleaving.md +346 -0
  23. package/optional-skills/dokkit/references/section-range-detection.md +118 -0
  24. package/optional-skills/dokkit/references/state-schema.md +143 -0
  25. package/optional-skills/dokkit/references/supported-formats.md +67 -0
  26. package/optional-skills/dokkit/scripts/compile_hwpx.py +134 -0
  27. package/optional-skills/dokkit/scripts/detect_fields.py +301 -0
  28. package/optional-skills/dokkit/scripts/detect_fields_hwpx.py +286 -0
  29. package/optional-skills/dokkit/scripts/export_pdf.py +99 -0
  30. package/optional-skills/dokkit/scripts/parse_hwpx.py +185 -0
  31. package/optional-skills/dokkit/scripts/parse_image_with_gemini.py +159 -0
  32. package/optional-skills/dokkit/scripts/parse_xlsx.py +98 -0
  33. package/optional-skills/dokkit/scripts/source_images.py +365 -0
  34. package/optional-skills/dokkit/scripts/validate_docx.py +142 -0
  35. package/optional-skills/dokkit/scripts/validate_hwpx.py +281 -0
  36. package/optional-skills/dokkit/scripts/validate_state.py +132 -0
  37. package/package.json +1 -1
@@ -0,0 +1,153 @@
1
+ ---
2
+ name: dokkit
3
+ description: >
4
+ Document template filling system for DOCX and HWPX formats.
5
+ Ingests source documents, analyzes templates, detects fillable fields,
6
+ fills them surgically using source data, reviews with confidence scoring,
7
+ and exports completed documents. Supports Korean and English templates.
8
+ Subcommands: init, sources, preview, ingest, fill, fill-doc, modify, review, export.
9
+ Use when user says "fill template", "fill document", "ingest", "dokkit".
10
+ user-invocable: true
11
+ allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent
12
+ argument-hint: "<subcommand> [arguments]"
13
+ context:
14
+ - type: file
15
+ path: ${CLAUDE_SKILL_DIR}/COMMANDS.md
16
+ ---
17
+
18
+ # Dokkit — Document Template Filling System
19
+
20
+ Surgical document filling for DOCX and HWPX templates using ingested source data. One command with 9 subcommands covering the full document filling lifecycle.
21
+
22
+ ## Subcommands
23
+
24
+ | Subcommand | Arguments | Type | Description |
25
+ |------------|-----------|------|-------------|
26
+ | `init` | `[--force] [--keep-sources]` | Inline | Initialize or reset workspace |
27
+ | `sources` | — | Inline | Display ingested sources dashboard |
28
+ | `preview` | — | Inline | Generate PDF preview via LibreOffice |
29
+ | `ingest` | `<file1> [file2] ...` | Agent | Parse source documents into workspace |
30
+ | `fill` | `<template.docx\|hwpx>` | Agent | End-to-end: analyze, fill, review, auto-fix, export |
31
+ | `fill-doc` | `<template.docx\|hwpx>` | Agent | Analyze template and fill fields only |
32
+ | `modify` | `"<instruction>"` | Agent | Apply targeted changes to filled document |
33
+ | `review` | `[section\|approve]` | Agent | Review with per-field confidence annotations |
34
+ | `export` | `<docx\|hwpx\|pdf>` | Agent | Export filled document to format |
35
+
36
+ ## Routing
37
+
38
+ Parse `$ARGUMENTS` to determine the subcommand:
39
+
40
+ 1. Extract `$1` as the subcommand name
41
+ 2. Pass remaining arguments (`$2`, `$3`, ...) to the subcommand
42
+ 3. If `$1` is empty or unrecognized, display the subcommand table above with usage examples
43
+
44
+ Full workflows for each subcommand are in COMMANDS.md (auto-loaded via context).
45
+
46
+ <example>
47
+ - `/dokkit ingest docs/resume.pdf docs/transcript.xlsx` — ingest two sources
48
+ - `/dokkit fill docs/template.hwpx` — end-to-end fill pipeline
49
+ - `/dokkit modify "Change the phone number to 010-1234-5678"` — targeted change
50
+ - `/dokkit export pdf` — export as PDF
51
+ </example>
52
+
53
+ ## Architecture
54
+
55
+ ### Agents
56
+
57
+ | Agent | Model | Role |
58
+ |-------|-------|------|
59
+ | **dokkit-ingestor** | opus | Parse source docs into `.dokkit/sources/` (.md + .json pairs) |
60
+ | **dokkit-analyzer** | opus | Analyze templates, detect fields, map to sources. Writes `analysis.json`. READ-ONLY on templates. |
61
+ | **dokkit-filler** | opus | Surgical XML modification using analysis.json. Three modes: fill, modify, review. |
62
+ | **dokkit-exporter** | sonnet | Repackage ZIP archives, PDF conversion via LibreOffice. |
63
+
64
+ ### Workspace
65
+
66
+ All agents communicate via the `.dokkit/` filesystem:
67
+
68
+ ```
69
+ .dokkit/
70
+ ├── state.json # Single source of truth for session state
71
+ ├── sources/ # Ingested content (.md + .json pairs)
72
+ ├── analysis.json # Template analysis output (from analyzer)
73
+ ├── images/ # Sourced images for template filling
74
+ ├── template_work/ # Unpacked template XML (working copy)
75
+ └── output/ # Exported filled documents
76
+ ```
77
+
78
+ ### State Protocol
79
+
80
+ Read `.dokkit/state.json` before any operation. Write state changes atomically: read current → update fields → write back → validate.
81
+
82
+ ```
83
+ init → state created (empty)
84
+ ingest → source added to sources[]
85
+ fill/fill-doc → template set, analysis created, filled_document created
86
+ modify → filled_document updated
87
+ review approve → filled_document.status = "finalized"
88
+ export → export entry added to exports[]
89
+ ```
90
+
91
+ Validate after every write: `python ${CLAUDE_SKILL_DIR}/scripts/validate_state.py .dokkit/state.json`
92
+
93
+ ### Knowledge Files
94
+
95
+ Agent-facing knowledge bases in this skill directory:
96
+
97
+ | File | Purpose | Agents |
98
+ |------|---------|--------|
99
+ | `STATE.md` | State schema and management protocol | All |
100
+ | `INGESTION.md` | Format routing and parsing strategies | dokkit-ingestor |
101
+ | `ANALYSIS.md` | Field detection, confidence scoring, output schema | dokkit-analyzer |
102
+ | `FILLING.md` | XML surgery rules, matching strategy, image insertion | dokkit-analyzer, dokkit-filler |
103
+ | `DOCX-XML.md` | Open XML structure for DOCX documents | dokkit-analyzer, dokkit-filler |
104
+ | `HWPX-XML.md` | OWPML structure for HWPX documents | dokkit-analyzer, dokkit-filler |
105
+ | `IMAGE-SOURCING.md` | Image generation, search, and insertion patterns | dokkit-filler |
106
+ | `EXPORT.md` | Document compilation and format conversion | dokkit-exporter |
107
+
108
+ Deep reference material in `references/`:
109
+ - `state-schema.md` — Complete state.json schema
110
+ - `supported-formats.md` — Detailed format specifications
111
+ - `docx-structure.md`, `docx-field-patterns.md` — DOCX patterns
112
+ - `hwpx-structure.md`, `hwpx-field-patterns.md` — HWPX patterns (10 detection patterns)
113
+ - `field-detection-patterns.md` — Advanced heuristics (9 DOCX + 6 HWPX)
114
+ - `section-range-detection.md` — Dynamic range detection for section_content
115
+ - `section-image-interleaving.md` — Image interleaving algorithm
116
+ - `image-opportunity-heuristics.md` — AI image opportunity detection
117
+ - `image-xml-patterns.md` — Image element structures (DOCX + HWPX)
118
+
119
+ Scripts in `scripts/`:
120
+ - `validate_state.py` — State validation
121
+ - `parse_xlsx.py`, `parse_hwpx.py`, `parse_image_with_gemini.py` — Custom parsers
122
+ - `detect_fields.py`, `detect_fields_hwpx.py` — Field detection
123
+ - `validate_docx.py`, `validate_hwpx.py` — Document validation
124
+ - `compile_hwpx.py` — HWPX repackaging
125
+ - `export_pdf.py` — PDF conversion
126
+
127
+ ## Rules
128
+
129
+ <rules>
130
+ - Display errors clearly with actionable guidance. Never silently fall back to defaults.
131
+ - Original template is never modified — copies go to `.dokkit/template_work/`.
132
+ - Analyzer is read-only on templates. Only the filler modifies XML.
133
+ - Confidence levels: high, medium, low (not numeric scores).
134
+ - Signatures must be user-provided — never auto-generate them.
135
+ - Validate state after every write with `scripts/validate_state.py`.
136
+ - Inline commands (init, sources, preview) execute directly — do NOT spawn agents.
137
+ - Agent-delegated commands spawn the appropriate agent(s) sequentially.
138
+ </rules>
139
+
140
+ ## Known Pitfalls
141
+
142
+ Critical issues discovered through production use:
143
+
144
+ 1. **HWPX namespace stripping**: Python ET strips unused namespace declarations. Restore ALL 14 original xmlns on EVERY root element after any `tree.write()`. Applies to section0.xml, content.hpf, header.xml.
145
+ 2. **HWPX subList cell wrapping**: ~65% of cells wrap content in `<hp:subList>/<hp:p>`. Check for subList before writing content.
146
+ 3. **table_content "Pre-filled" bug**: Never set `mapped_value` to placeholder strings for `table_content` fields. Use `mapped_value: null` with `action: "preserve"`.
147
+ 4. **HWPX cellAddr rowAddr corruption**: After row insert/delete, re-index ALL `rowAddr` values. Duplicate rowAddr causes silent data loss.
148
+ 5. **HWPX `<hp:pic>` inside `<hp:run>`**: Pic as sibling of run renders invisible. Must be `<hp:run><hp:pic>...<hp:t/></hp:run>`.
149
+ 6. **HWPML units**: 1/7200 inch, NOT hundredths of mm. 1mm ~ 283.46 units. A4 text width ~ 46,648 units.
150
+ 7. **rowSpan stripping**: When cloning rows with rowSpan>1, divide cellSz height by rowSpan.
151
+ 8. **HWPX pic element order**: offset, orgSz, curSz, flip, rotationInfo, renderingInfo, imgRect, imgClip, inMargin, imgDim, hc:img, sz, pos, outMargin.
152
+ 9. **HWPX post-write safety**: After ET write: (a) restore namespaces, (b) fix XML declaration to double quotes with `standalone="yes"`, (c) remove newline between `?>` and `<root>`.
153
+ 10. **compile_hwpx.py skip .bak**: Backup files must be excluded from ZIP repackaging.
@@ -0,0 +1,60 @@
1
+ # State Management
2
+
3
+ Protocol for reading and writing `.dokkit/state.json`. All agents follow this protocol.
4
+
5
+ ## Workspace Structure
6
+
7
+ ```
8
+ .dokkit/
9
+ ├── state.json # Single source of truth for session state
10
+ ├── sources/ # Ingested source content
11
+ │ ├── <name>.md # Extracted content (LLM-optimized markdown)
12
+ │ └── <name>.json # Structured metadata sidecar
13
+ ├── analysis.json # Template analysis output (from analyzer)
14
+ ├── images/ # Sourced images
15
+ ├── template_work/ # Unpacked template XML (working copy)
16
+ │ ├── word/ # (DOCX) or Contents/ (HWPX)
17
+ │ └── ...
18
+ └── output/ # Exported filled documents
19
+ └── filled_<name>.<ext>
20
+ ```
21
+
22
+ ## Reading State
23
+
24
+ Read `.dokkit/state.json` before any operation. Check:
25
+ - `sources` array for available context
26
+ - `template` for current template info
27
+ - `analysis` for field mapping data
28
+ - `filled_document` for current document status
29
+
30
+ ## Writing State
31
+
32
+ After any mutation:
33
+ 1. Read current state.json (avoid overwriting concurrent changes)
34
+ 2. Update only the relevant fields
35
+ 3. Write the full state back
36
+ 4. Validate: `python .claude/skills/dokkit/scripts/validate_state.py .dokkit/state.json`
37
+
38
+ ## State Transitions
39
+
40
+ ```
41
+ /dokkit init → state created (empty)
42
+ /dokkit ingest → source added to sources[]
43
+ /dokkit fill or fill-doc → template set, analysis created, filled_document created
44
+ /dokkit modify → filled_document updated
45
+ /dokkit review approve → filled_document.status = "finalized"
46
+ /dokkit export → export entry added to exports[]
47
+ ```
48
+
49
+ ## Validation
50
+
51
+ The validator checks:
52
+ - Schema conformance
53
+ - Required fields present
54
+ - Valid status values
55
+ - Source file references exist
56
+ - No orphaned entries
57
+
58
+ ## References
59
+
60
+ See `references/state-schema.md` for the complete schema definition.
@@ -0,0 +1,151 @@
1
+ # DOCX Field Detection Patterns
2
+
3
+ ## Pattern 1: Placeholder Text
4
+
5
+ ```xml
6
+ <!-- Text like {{name}} or <<name>> in a run -->
7
+ <w:r>
8
+ <w:rPr>
9
+ <w:rFonts w:ascii="Arial" w:hAnsi="Arial"/>
10
+ <w:sz w:val="20"/>
11
+ </w:rPr>
12
+ <w:t>{{full_name}}</w:t> <!-- REPLACE this text content -->
13
+ </w:r>
14
+ ```
15
+
16
+ **Action**: Replace the text content of `<w:t>` while preserving `<w:rPr>`.
17
+
18
+ ## Pattern 2: Empty Table Cell
19
+
20
+ ```xml
21
+ <w:tr>
22
+ <w:tc>
23
+ <w:p><w:r><w:t>Name</w:t></w:r></w:p> <!-- Label cell -->
24
+ </w:tc>
25
+ <w:tc>
26
+ <w:p/> <!-- Empty cell → FILL THIS -->
27
+ </w:tc>
28
+ </w:tr>
29
+ ```
30
+
31
+ **Action**: Insert `<w:r><w:t>value</w:t></w:r>` into the empty `<w:p>`. Copy `<w:rPr>` from the label cell's run to match formatting.
32
+
33
+ ## Pattern 3: Underline Placeholder
34
+
35
+ ```xml
36
+ <w:r>
37
+ <w:rPr>
38
+ <w:u w:val="single"/>
39
+ </w:rPr>
40
+ <w:t xml:space="preserve"> </w:t> <!-- Spaces with underline -->
41
+ </w:r>
42
+ ```
43
+
44
+ **Action**: Replace the spaces in `<w:t>` with the actual value. Keep `<w:u>` in `<w:rPr>`.
45
+
46
+ ## Pattern 4: Content Control
47
+
48
+ ```xml
49
+ <w:sdt>
50
+ <w:sdtPr>
51
+ <w:alias w:val="Company Name"/>
52
+ <w:tag w:val="company"/>
53
+ <w:showingPlcHdr/> <!-- Indicates placeholder is showing -->
54
+ </w:sdtPr>
55
+ <w:sdtContent>
56
+ <w:p>
57
+ <w:r>
58
+ <w:rPr><w:rStyle w:val="PlaceholderText"/></w:rPr>
59
+ <w:t>Click here to enter text.</w:t>
60
+ </w:r>
61
+ </w:p>
62
+ </w:sdtContent>
63
+ </w:sdt>
64
+ ```
65
+
66
+ **Action**: Replace the run inside `<w:sdtContent>` with a new run containing the value. Remove `<w:showingPlcHdr/>` from `<w:sdtPr>`. Remove the placeholder style from `<w:rPr>`.
67
+
68
+ ## Pattern 5: Instruction Text
69
+
70
+ ```xml
71
+ <w:r>
72
+ <w:rPr>
73
+ <w:color w:val="808080"/> <!-- Gray text -->
74
+ <w:i/> <!-- Italic -->
75
+ </w:rPr>
76
+ <w:t>(enter your name)</w:t>
77
+ </w:r>
78
+ ```
79
+
80
+ **Action**: Replace text content. Change `<w:rPr>` to remove gray color and italic (or copy from a nearby filled field).
81
+
82
+ ## Pattern 6: Writing Tip Box (작성 팁)
83
+
84
+ Single-cell tables with dashed borders containing `※` guidance text. These are NOT fillable — they must be **deleted**.
85
+
86
+ ```xml
87
+ <w:tbl>
88
+ <w:tblPr>
89
+ <w:tblBorders>
90
+ <w:top w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
91
+ <w:left w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
92
+ <w:bottom w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
93
+ <w:right w:val="dashed" w:sz="4" w:space="0" w:color="auto"/>
94
+ </w:tblBorders>
95
+ </w:tblPr>
96
+ <w:tr>
97
+ <w:tc>
98
+ <w:p>
99
+ <w:r>
100
+ <w:rPr><w:color w:val="FF0000"/></w:rPr>
101
+ <w:t>※ 작성 팁: 구체적인 사업 목표를 기재하세요.</w:t>
102
+ </w:r>
103
+ </w:p>
104
+ </w:tc>
105
+ </w:tr>
106
+ </w:tbl>
107
+ ```
108
+
109
+ **Identifying traits**:
110
+ - Single row, single cell (`<w:tr>` has one `<w:tc>`)
111
+ - `<w:tblBorders>` with `w:val="dashed"` on all sides
112
+ - Text starts with `※` or contains `작성 팁`, `작성요령`
113
+ - Often has red `<w:color w:val="FF0000"/>` styling
114
+
115
+ **Action**: Flag as `field_type: "tip_box"`, `action: "delete"`. Delete the entire `<w:tbl>` element.
116
+
117
+ ## Color Warning for Copied Formatting
118
+
119
+ When copying `<w:rPr>` from template guide text or instruction text (Patterns 2 and 5), **always check for red color**:
120
+
121
+ ```xml
122
+ <!-- DANGER: This rPr has red color from guide text -->
123
+ <w:rPr>
124
+ <w:color w:val="FF0000"/> <!-- REMOVE THIS -->
125
+ <w:i/> <!-- REMOVE THIS (from guide text) -->
126
+ <w:sz w:val="20"/> <!-- KEEP -->
127
+ </w:rPr>
128
+ ```
129
+
130
+ **Rule**: After copying rPr from any template text, check for `<w:color>` elements. If the value is `FF0000`, `FF0000FF`, or any red shade, **remove the `<w:color>` element** (defaults to black). Also remove `<w:i/>` if it came from guide text.
131
+
132
+ ## Safe Modification Template
133
+
134
+ ```python
135
+ import xml.etree.ElementTree as ET
136
+
137
+ ns = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
138
+ ET.register_namespace("w", ns["w"])
139
+
140
+ tree = ET.parse("word/document.xml")
141
+
142
+ # Find and replace placeholder text
143
+ for t_elem in tree.iter("{%s}t" % ns["w"]):
144
+ if t_elem.text and "{{" in t_elem.text:
145
+ placeholder = t_elem.text # e.g., "{{name}}"
146
+ field_name = placeholder.strip("{}").strip("<>")
147
+ if field_name in field_values:
148
+ t_elem.text = field_values[field_name]
149
+
150
+ tree.write("word/document.xml", xml_declaration=True, encoding="UTF-8")
151
+ ```
@@ -0,0 +1,58 @@
1
+ # DOCX XML Structure Reference
2
+
3
+ ## Unpacking a DOCX
4
+
5
+ ```bash
6
+ # Unzip to inspect
7
+ mkdir -p .dokkit/template_work
8
+ cd .dokkit/template_work
9
+ unzip -o /path/to/template.docx
10
+ ```
11
+
12
+ ## Reading document.xml
13
+
14
+ The main content is in `word/document.xml`. Parse with any XML parser.
15
+
16
+ ### Python Example
17
+ ```python
18
+ import xml.etree.ElementTree as ET
19
+
20
+ ns = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
21
+ tree = ET.parse("word/document.xml")
22
+ root = tree.getroot()
23
+
24
+ # Find all paragraphs
25
+ for p in root.iter("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p"):
26
+ texts = []
27
+ for t in p.iter("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t"):
28
+ if t.text:
29
+ texts.append(t.text)
30
+ print("".join(texts))
31
+ ```
32
+
33
+ ## Repackaging a DOCX
34
+
35
+ After modifying XML, repackage as a valid DOCX:
36
+
37
+ ```python
38
+ import zipfile
39
+ import os
40
+
41
+ def repackage_docx(work_dir, output_path):
42
+ """Repackage modified XML files into a valid DOCX."""
43
+ with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zf:
44
+ for root, dirs, files in os.walk(work_dir):
45
+ for file in files:
46
+ file_path = os.path.join(root, file)
47
+ arcname = os.path.relpath(file_path, work_dir)
48
+ zf.write(file_path, arcname)
49
+ ```
50
+
51
+ ## Critical Rules for DOCX Surgery
52
+
53
+ 1. **Never remove `<w:rPr>` elements** — they contain all formatting
54
+ 2. **Preserve `xml:space="preserve"`** on `<w:t>` elements with leading/trailing spaces
55
+ 3. **Keep `<w:pPr>` intact** — paragraph formatting must not change
56
+ 4. **Maintain bookmark pairs** — `<w:bookmarkStart>` must have matching `<w:bookmarkEnd>`
57
+ 5. **Don't modify `<w:sectPr>`** — section properties control page layout
58
+ 6. **Preserve table cell merge attributes** — `<w:vMerge>` and `<w:gridSpan>`
@@ -0,0 +1,130 @@
1
+ # Field Detection Patterns
2
+
3
+ ## DOCX Detection Heuristics
4
+
5
+ ### Heuristic 1: Curly Brace Placeholders
6
+ ```regex
7
+ \{\{[^}]+\}\}
8
+ ```
9
+ Match text like `{{field_name}}`. High reliability.
10
+
11
+ ### Heuristic 2: Angle Bracket Placeholders
12
+ ```regex
13
+ <<[^>]+>>
14
+ ```
15
+ Match text like `<<field_name>>`. High reliability.
16
+
17
+ ### Heuristic 3: Square Bracket Placeholders
18
+ ```regex
19
+ \[[^\]]+\]
20
+ ```
21
+ Match text like `[field_name]`. Medium reliability (may match references).
22
+
23
+ ### Heuristic 4: Underline-Only Runs
24
+ A run where:
25
+ - `<w:rPr>` contains `<w:u w:val="single"/>`
26
+ - `<w:t>` contains only spaces, underscores, or is empty
27
+ - Run length > 3 characters
28
+
29
+ ### Heuristic 5: Empty Table Cells
30
+ A `<w:tc>` that:
31
+ - Contains only `<w:p/>` or `<w:p><w:pPr/></w:p>` (empty paragraph)
32
+ - Is adjacent to a cell containing text (the label)
33
+ - The label cell's text is short (< 50 chars) and not numeric
34
+
35
+ ### Heuristic 6: Instruction Text
36
+ A run where text matches patterns like:
37
+ ```regex
38
+ \(.*?(enter|type|input|write|fill|입력).*?\)
39
+ ```
40
+
41
+ ### Heuristic 7: Content Controls
42
+ Any `<w:sdt>` element with `<w:showingPlcHdr/>` in its properties.
43
+
44
+ ### Heuristic 8: Image Fields
45
+ A field is classified as `image` when any of these conditions hold:
46
+ - A `{{placeholder}}` or `<<placeholder>>` contains an image keyword
47
+ - A table cell contains an existing `<w:drawing>` element (pre-positioned image slot)
48
+ - An empty table cell is adjacent to a cell whose label matches an image keyword
49
+
50
+ **Image keywords** (case-insensitive):
51
+ - Korean: 사진, 증명사진, 여권사진, 로고, 서명, 날인, 도장, 직인
52
+ - English: Photo, Picture, Logo, Signature, Stamp, Seal, Image, Portrait
53
+
54
+ **Image type classification**:
55
+ | Keyword match | `image_type` |
56
+ |---------------|-------------|
57
+ | 사진, 증명사진, 여권사진, photo, picture, portrait, image | `photo` |
58
+ | 로고, logo | `logo` |
59
+ | 서명, 날인, 도장, 직인, signature, stamp, seal | `signature` |
60
+ | (no keyword match) | `figure` |
61
+
62
+ Image fields are **excluded** from the `placeholder_text` and `empty_cell` detectors to prevent double-detection.
63
+
64
+ ### Heuristic 9: Tip Box
65
+ A `<w:tbl>` that:
66
+ - Has exactly one row and one cell (1×1 table)
67
+ - `<w:tblBorders>` uses `w:val="dashed"` borders
68
+ - Cell text starts with `※` or contains `작성 팁` / `작성요령`
69
+ - Often has red text color (`<w:color w:val="FF0000"/>`)
70
+
71
+ → `field_type: "tip_box"`, `action: "delete"`
72
+
73
+ ## HWPX Detection Heuristics
74
+
75
+ ### Heuristic 1: Empty Adjacent Cells
76
+ Same as DOCX but using `<hp:tc>` and `<hp:t>` elements.
77
+
78
+ ### Heuristic 2: Korean Instruction Text
79
+ ```regex
80
+ \(.*?(입력|기재|작성).*?\)
81
+ ```
82
+
83
+ ### Heuristic 3: Date Component Cells
84
+ Cells immediately before 년/월/일 (year/month/day) markers.
85
+
86
+ ### Heuristic 4: Image Fields
87
+ Same logic as DOCX Heuristic 8, adapted for HWPX elements:
88
+ - `<hp:pic>` instead of `<w:drawing>`
89
+ - `<hp:tc>` / `<hp:t>` instead of `<w:tc>` / `<w:t>`
90
+ - Same image keyword list and type classification
91
+
92
+ ### Heuristic 5: Tip Box
93
+ An `<hp:tbl>` that:
94
+ - Has `rowCnt="1"` and `colCnt="1"` (single-cell table)
95
+ - `borderFillIDRef` resolves to DASH border style in `header.xml`
96
+ - Cell text starts with `※` or contains `작성 팁` / `작성요령` / `작성 요령`
97
+ - May appear standalone or nested inside a `<hp:subList>` within another cell
98
+
99
+ → `field_type: "tip_box"`, `action: "delete"`, `container: "standalone"|"nested"`
100
+
101
+ ### Heuristic 6: Section Header Rows
102
+ Table rows where:
103
+ - First cell spans multiple columns (`hp:cellSpan colSpan > 1`)
104
+ - Text is short and descriptive (section name)
105
+ - Background may be shaded
106
+
107
+ ## HWPX Pre-Fill Sanitization
108
+
109
+ ### Negative Character Spacing
110
+ HWPX templates may define `<hh:charPr>` elements in `header.xml` with negative `<hh:spacing>` values (e.g., `hangul="-3"`). These compress characters closer together, which works for short placeholder text but causes **severe text overlap** when the filler replaces placeholders with longer content.
111
+
112
+ **Rule**: Before filling, scan ALL `<hh:charPr>` definitions in `header.xml` and set any negative spacing attribute values to `"0"`. This applies to all attributes: `hangul`, `latin`, `hanja`, `japanese`, `other`, `symbol`, `user`.
113
+
114
+ **Example fix**:
115
+ ```xml
116
+ <!-- Before (causes overlap) -->
117
+ <hh:spacing hangul="-3" latin="-3" hanja="-3" japanese="-3" other="-3" symbol="-3" user="-3"/>
118
+
119
+ <!-- After (normal spacing) -->
120
+ <hh:spacing hangul="0" latin="0" hanja="0" japanese="0" other="0" symbol="0" user="0"/>
121
+ ```
122
+
123
+ ## False Positive Filtering
124
+
125
+ Exclude detected "fields" that are:
126
+ - Part of a header/title row (not fillable)
127
+ - Copyright notices or footer text
128
+ - Page numbers or running headers
129
+ - Table of contents entries
130
+ - Cross-reference markers