devlyn-cli 0.5.2 → 0.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/bin/devlyn.js +1 -0
  2. package/optional-skills/dokkit/ANALYSIS.md +198 -0
  3. package/optional-skills/dokkit/COMMANDS.md +365 -0
  4. package/optional-skills/dokkit/DOCX-XML.md +76 -0
  5. package/optional-skills/dokkit/EXPORT.md +102 -0
  6. package/optional-skills/dokkit/FILLING.md +377 -0
  7. package/optional-skills/dokkit/HWPX-XML.md +73 -0
  8. package/optional-skills/dokkit/IMAGE-SOURCING.md +127 -0
  9. package/optional-skills/dokkit/INGESTION.md +65 -0
  10. package/optional-skills/dokkit/SKILL.md +153 -0
  11. package/optional-skills/dokkit/STATE.md +60 -0
  12. package/optional-skills/dokkit/references/docx-field-patterns.md +151 -0
  13. package/optional-skills/dokkit/references/docx-structure.md +58 -0
  14. package/optional-skills/dokkit/references/field-detection-patterns.md +130 -0
  15. package/optional-skills/dokkit/references/hwpx-field-patterns.md +461 -0
  16. package/optional-skills/dokkit/references/hwpx-structure.md +159 -0
  17. package/optional-skills/dokkit/references/image-opportunity-heuristics.md +121 -0
  18. package/optional-skills/dokkit/references/image-xml-patterns.md +338 -0
  19. package/optional-skills/dokkit/references/section-image-interleaving.md +346 -0
  20. package/optional-skills/dokkit/references/section-range-detection.md +118 -0
  21. package/optional-skills/dokkit/references/state-schema.md +143 -0
  22. package/optional-skills/dokkit/references/supported-formats.md +67 -0
  23. package/optional-skills/dokkit/scripts/compile_hwpx.py +134 -0
  24. package/optional-skills/dokkit/scripts/detect_fields.py +301 -0
  25. package/optional-skills/dokkit/scripts/detect_fields_hwpx.py +286 -0
  26. package/optional-skills/dokkit/scripts/export_pdf.py +99 -0
  27. package/optional-skills/dokkit/scripts/parse_hwpx.py +185 -0
  28. package/optional-skills/dokkit/scripts/parse_image_with_gemini.py +159 -0
  29. package/optional-skills/dokkit/scripts/parse_xlsx.py +98 -0
  30. package/optional-skills/dokkit/scripts/source_images.py +365 -0
  31. package/optional-skills/dokkit/scripts/validate_docx.py +142 -0
  32. package/optional-skills/dokkit/scripts/validate_hwpx.py +281 -0
  33. package/optional-skills/dokkit/scripts/validate_state.py +132 -0
  34. package/package.json +1 -1
@@ -0,0 +1,461 @@
1
+ # HWPX Field Detection Patterns
2
+
3
+ ## Pattern 1: Empty Table Cell
4
+
5
+ Korean forms are heavily table-based. The most common pattern:
6
+
7
+ ```xml
8
+ <hp:tr>
9
+ <hp:tc>
10
+ <!-- Label cell -->
11
+ <hp:p>
12
+ <hp:run>
13
+ <hp:rPr charPrIDRef="1"/>
14
+ <hp:t>성명</hp:t>
15
+ </hp:run>
16
+ </hp:p>
17
+ </hp:tc>
18
+ <hp:tc>
19
+ <!-- Empty value cell → FILL THIS -->
20
+ <hp:p>
21
+ <hp:lineseg/>
22
+ </hp:p>
23
+ </hp:tc>
24
+ </hp:tr>
25
+ ```
26
+
27
+ **Action**: Insert a new `<hp:run>` with `<hp:t>value</hp:t>` into the empty paragraph. Copy `charPrIDRef` from label cell's run.
28
+
29
+ ## Pattern 2: Placeholder Text in Cell
30
+
31
+ ```xml
32
+ <hp:tc>
33
+ <hp:p>
34
+ <hp:run>
35
+ <hp:t>(이름을 입력하세요)</hp:t> <!-- Instruction text -->
36
+ </hp:run>
37
+ </hp:p>
38
+ </hp:tc>
39
+ ```
40
+
41
+ **Action**: Replace the text in `<hp:t>` with the actual value.
42
+
43
+ ## Pattern 3: Multi-Row Spanning Label
44
+
45
+ Korean forms often have a label cell spanning multiple rows:
46
+
47
+ ```xml
48
+ <hp:tr>
49
+ <hp:tc>
50
+ <hp:cellSpan rowSpan="3"/>
51
+ <hp:p><hp:run><hp:t>학력</hp:t></hp:run></hp:p>
52
+ </hp:tc>
53
+ <hp:tc><hp:p><hp:run><hp:t>학교명</hp:t></hp:run></hp:p></hp:tc>
54
+ <hp:tc><hp:p/></hp:tc> <!-- Empty → fill with school name -->
55
+ </hp:tr>
56
+ ```
57
+
58
+ **Action**: The spanning label ("학력" = Education) is the section. Sub-labels ("학교명" = School Name) identify individual fields.
59
+
60
+ ## Pattern 4: Date Fields
61
+
62
+ ```xml
63
+ <hp:tc>
64
+ <hp:p>
65
+ <hp:run><hp:t>년</hp:t></hp:run> <!-- Year -->
66
+ </hp:p>
67
+ </hp:tc>
68
+ <hp:tc>
69
+ <hp:p>
70
+ <hp:run><hp:t>월</hp:t></hp:run> <!-- Month -->
71
+ </hp:p>
72
+ </hp:tc>
73
+ <hp:tc>
74
+ <hp:p>
75
+ <hp:run><hp:t>일</hp:t></hp:run> <!-- Day -->
76
+ </hp:p>
77
+ </hp:tc>
78
+ ```
79
+
80
+ **Action**: Fill the cells preceding 년/월/일 with the appropriate date components.
81
+
82
+ ## Pattern 5: Writing Tip Box (작성 팁)
83
+
84
+ Standalone 1×1 tables with DASH-bordered cells that contain `※` guidance text. These are NOT fillable fields — they must be **deleted** before or during filling.
85
+
86
+ ```xml
87
+ <hp:tbl rowCnt="1" colCnt="1">
88
+ <hp:tr>
89
+ <hp:tc borderFillIDRef="16">
90
+ <hp:p>
91
+ <hp:run>
92
+ <hp:rPr charPrIDRef="45"/> <!-- Often RED style -->
93
+ <hp:t>※ 작성 팁: 사업의 목적과 필요성을 구체적으로 작성하세요.</hp:t>
94
+ </hp:run>
95
+ </hp:p>
96
+ <hp:p>
97
+ <hp:run>
98
+ <hp:rPr charPrIDRef="45"/>
99
+ <hp:t>※ 관련 법령이나 정책 근거를 제시하면 좋습니다.</hp:t>
100
+ </hp:run>
101
+ </hp:p>
102
+ </hp:tc>
103
+ </hp:tr>
104
+ </hp:tbl>
105
+ ```
106
+
107
+ **Identifying traits**:
108
+ - `rowCnt="1"` and `colCnt="1"` (single-cell table)
109
+ - `borderFillIDRef` resolves to DASH border style in `header.xml`
110
+ - Text starts with `※` or contains `작성 팁`, `작성요령`, `작성 요령`
111
+ - Often appears inside a `<hp:subList>` within another table cell
112
+
113
+ **Two container types**:
114
+ - **Standalone**: Top-level 1×1 table between other content → delete the entire `<hp:tbl>`
115
+ - **Nested**: Inside a `<hp:subList>` within a fill-target cell → delete the `<hp:subList>` element
116
+
117
+ **Action**: Flag as `field_type: "tip_box"`, `action: "delete"`. The filler agent removes these before filling.
118
+
119
+ ## Pattern 6: Character Property Resolution (charPrIDRef)
120
+
121
+ HWPX text formatting is controlled by `charPrIDRef` attributes that reference `<hh:charPr>` entries in `header.xml`.
122
+
123
+ ### How charPrIDRef works
124
+ ```xml
125
+ <!-- In section*.xml — a run references charPr ID 45 -->
126
+ <hp:run>
127
+ <hp:rPr charPrIDRef="45"/>
128
+ <hp:t>Some text</hp:t>
129
+ </hp:run>
130
+
131
+ <!-- In header.xml — charPr ID 45 defines the style -->
132
+ <hh:charPr id="45" height="1000" textColor="#FF0000"
133
+ bold="false" italic="true" spacing="-5"/>
134
+ ```
135
+
136
+ ### Template guide text uses RED styles
137
+ Many templates use red (#FF0000) charPrIDRef values for guide text, tip boxes, and instructions. Common red IDs seen in Korean government templates: 39, 45, 51, 52, 57, 62, 81.
138
+
139
+ **Critical rule**: When filling a field, NEVER copy `charPrIDRef` from guide/tip text. Instead, find or create a black (#000000) charPr.
140
+
141
+ ### Finding a suitable black charPr
142
+ ```python
143
+ import xml.etree.ElementTree as ET
144
+
145
+ def find_black_charpr(header_path):
146
+ """Find a charPrIDRef suitable for filled text (black, normal style)."""
147
+ hns = {"hh": "http://www.hancom.co.kr/hwpml/2011/head"}
148
+ tree = ET.parse(header_path)
149
+ root = tree.getroot()
150
+
151
+ candidates = []
152
+ for cp in root.iter("{%s}charPr" % hns["hh"]):
153
+ color = cp.get("textColor", "#000000")
154
+ bold = cp.get("bold", "false")
155
+ italic = cp.get("italic", "false")
156
+ spacing = int(cp.get("spacing", "0"))
157
+
158
+ # Want: black text, not italic, non-negative spacing
159
+ if color.upper() in ("#000000", "#000000FF", "black") and \
160
+ italic == "false" and spacing >= 0:
161
+ candidates.append({
162
+ "id": cp.get("id"),
163
+ "bold": bold == "true",
164
+ "height": int(cp.get("height", "1000")),
165
+ "spacing": spacing,
166
+ })
167
+
168
+ # Prefer non-bold, standard size, zero spacing
169
+ normal = [c for c in candidates if not c["bold"] and c["spacing"] == 0]
170
+ bold_list = [c for c in candidates if c["bold"] and c["spacing"] == 0]
171
+
172
+ return {
173
+ "normal": normal[0]["id"] if normal else None,
174
+ "bold": bold_list[0]["id"] if bold_list else None,
175
+ }
176
+ ```
177
+
178
+ ### Creating a new charPr if needed
179
+ If no suitable black charPr exists in `header.xml`, create one by appending a new `<hh:charPr>` element with the next available ID, `textColor="#000000"`, `bold="false"`, `italic="false"`, `spacing="0"`.
180
+
181
+ ## Pattern 7: Image Field in Table Cell
182
+
183
+ A label cell containing image-related keywords (사진, 증명사진, 로고, 서명, 직인, 사업자등록증) next to an empty cell indicates an image insertion point.
184
+
185
+ ```xml
186
+ <hp:tr>
187
+ <hp:tc>
188
+ <!-- Label cell with image keyword -->
189
+ <hp:p>
190
+ <hp:run>
191
+ <hp:rPr charPrIDRef="1"/>
192
+ <hp:t>사진</hp:t>
193
+ </hp:run>
194
+ </hp:p>
195
+ </hp:tc>
196
+ <hp:tc>
197
+ <!-- Empty cell → INSERT IMAGE HERE -->
198
+ <hp:p>
199
+ <hp:lineseg/>
200
+ </hp:p>
201
+ </hp:tc>
202
+ </hp:tr>
203
+ ```
204
+
205
+ **Action**: Insert a `<hp:pic>` element INSIDE a `<hp:run>` within the cell's `<hp:p>`. The `<hp:t/>` goes AFTER the pic inside the run.
206
+
207
+ ### Image Paragraph Structure (CRITICAL)
208
+
209
+ ```xml
210
+ <!-- pic must be INSIDE run, t/ AFTER pic (matches real Hancom Office output) -->
211
+ <hp:p id="..." paraPrIDRef="..." styleIDRef="0" pageBreak="0" columnBreak="0" merged="0">
212
+ <hp:linesegarray>
213
+ <hp:lineseg textpos="0" vertpos="0" vertsize="{H}" textheight="{H}"
214
+ baseline="{H*0.85}" spacing="500" .../>
215
+ </hp:linesegarray>
216
+ <hp:run charPrIDRef="0">
217
+ <hp:pic id="{seq_id}" zOrder="{z}" ...>...</hp:pic>
218
+ <hp:t/>
219
+ </hp:run>
220
+ </hp:p>
221
+ ```
222
+
223
+ ### Complete `<hp:pic>` Structure (Hancom Canonical Order)
224
+
225
+ ```xml
226
+ <hp:pic id="{seq_id}" zOrder="{z}" numberingType="PICTURE" textWrap="TOP_AND_BOTTOM"
227
+ textFlow="BOTH_SIDES" lock="0" dropcapstyle="None"
228
+ href="" groupLevel="0" instid="{seq_id}" reverse="0">
229
+ <!-- Group 1: Geometry -->
230
+ <hp:offset x="0" y="0"/>
231
+ <hp:orgSz width="{W}" height="{H}"/>
232
+ <hp:curSz width="{W}" height="{H}"/>
233
+ <hp:flip horizontal="0" vertical="0"/>
234
+ <hp:rotationInfo angle="0" centerX="{W/2}" centerY="{H/2}" rotateimage="1"/>
235
+ <hp:renderingInfo>
236
+ <hc:transMatrix e1="1" e2="0" e3="0" e4="0" e5="1" e6="0"/>
237
+ <hc:scaMatrix e1="1" e2="0" e3="0" e4="0" e5="1" e6="0"/>
238
+ <hc:rotMatrix e1="1" e2="-0" e3="0" e4="0" e5="1" e6="0"/>
239
+ </hp:renderingInfo>
240
+ <!-- Group 2: Image data -->
241
+ <hp:imgRect>
242
+ <hc:pt0 x="0" y="0"/>
243
+ <hc:pt1 x="{W}" y="0"/>
244
+ <hc:pt2 x="{W}" y="{H}"/>
245
+ <hc:pt3 x="0" y="{H}"/>
246
+ </hp:imgRect>
247
+ <hp:imgClip left="0" right="{pixW}" top="0" bottom="{pixH}"/>
248
+ <hp:inMargin left="0" right="0" top="0" bottom="0"/>
249
+ <hp:imgDim dimwidth="{pixW}" dimheight="{pixH}"/>
250
+ <hc:img binaryItemIDRef="{manifest_id}" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
251
+ <!-- Group 3: Layout (AFTER hc:img) -->
252
+ <hp:sz width="{W}" widthRelTo="ABSOLUTE" height="{H}" heightRelTo="ABSOLUTE" protect="0"/>
253
+ <hp:pos treatAsChar="1" affectLSpacing="0" flowWithText="0" allowOverlap="0"
254
+ holdAnchorAndSO="0" vertRelTo="PARA" horzRelTo="COLUMN"
255
+ vertAlign="TOP" horzAlign="LEFT" vertOffset="0" horzOffset="0"/>
256
+ <hp:outMargin left="0" right="0" top="0" bottom="0"/>
257
+ </hp:pic>
258
+ ```
259
+
260
+ Where: `{W}/{H}` = HWPML units (1/7200 inch), `{pixW}/{pixH}` = pixel dimensions from PIL, `{manifest_id}` = `id` from `content.hpf`.
261
+
262
+ ### 9 Critical Rules for `<hp:pic>`
263
+
264
+ 1. **`<img>` uses `hc:` namespace** — `<hc:img>`, NOT `<hp:img>`
265
+ 2. **`<imgRect>` has 4 `<hc:pt>` children** — `<hc:pt0>` through `<hc:pt3>`, NOT inline attributes
266
+ 3. **All required children present** — `offset`, `orgSz`, `curSz`, `flip`, `rotationInfo`, `renderingInfo`, `inMargin`
267
+ 4. **No spurious elements** — Do NOT add `hp:lineShape`, `hp:caption`, `hp:shapeComment`
268
+ 5. **`imgClip` right/bottom = pixel dims** — from `imgDim`, NOT zeros
269
+ 6. **Hancom canonical element order** — offset, orgSz, ..., hc:img, **then** sz, pos, outMargin
270
+ 7. **Register in `content.hpf` manifest only** — Do NOT add `<hh:binDataItems>` to `header.xml`
271
+ 8. **`hp:pos` attributes** — `flowWithText="0"` `horzRelTo="COLUMN"`
272
+ 9. **pic INSIDE run, t AFTER pic** — `<hp:run><hp:pic>...</hp:pic><hp:t/></hp:run>`
273
+
274
+ ## Pattern 8: SubList Cell Wrapping (CRITICAL)
275
+
276
+ In Korean government HWPX templates, ~65% of table cells wrap their content in `<hp:subList>/<hp:p>` rather than having `<hp:p>` as a direct child of `<hp:tc>`. Hancom Office reads content from inside `<hp:subList>` and ignores orphaned direct `<hp:p>` elements.
277
+
278
+ ### Two cell structures
279
+
280
+ **Direct pattern** (~35% of cells):
281
+ ```xml
282
+ <hp:tc>
283
+ <hp:cellAddr .../>
284
+ <hp:cellSpan .../>
285
+ <hp:cellSz .../>
286
+ <hp:p>
287
+ <hp:run><hp:t>Content here</hp:t></hp:run>
288
+ </hp:p>
289
+ </hp:tc>
290
+ ```
291
+
292
+ **SubList pattern** (~65% of cells):
293
+ ```xml
294
+ <hp:tc>
295
+ <hp:cellAddr .../>
296
+ <hp:cellSpan .../>
297
+ <hp:cellSz .../>
298
+ <hp:subList>
299
+ <hp:p>
300
+ <hp:run><hp:t>Content here</hp:t></hp:run>
301
+ </hp:p>
302
+ </hp:subList>
303
+ </hp:tc>
304
+ ```
305
+
306
+ ### Critical rule for filling
307
+
308
+ When writing content into a cell, ALWAYS check for `<hp:subList>` first:
309
+ 1. If `<hp:subList>` exists: write into `<hp:subList>/<hp:p>`, NOT as a direct `<hp:p>` child of `<hp:tc>`
310
+ 2. If no `<hp:subList>`: write as direct `<hp:p>` child of `<hp:tc>` (standard pattern)
311
+
312
+ **Wrong** — creates orphaned paragraphs that Hancom ignores:
313
+ ```python
314
+ # BAD: always writes to cell directly
315
+ p = ET.SubElement(cell, hp_tag("p"))
316
+ ```
317
+
318
+ **Correct** — respects subList wrapper:
319
+ ```python
320
+ # GOOD: check for subList first
321
+ container = cell
322
+ for c in cell:
323
+ if c.tag == hp_tag("subList"):
324
+ container = c
325
+ break
326
+ p = ET.SubElement(container, hp_tag("p"))
327
+ ```
328
+
329
+ This applies to ALL cell operations: `clear_cell_content()`, `fill_cell_text()`, and `insert_cell_image_resolved()`.
330
+
331
+ ## Pattern 9: cellAddr Row Addressing (CRITICAL)
332
+
333
+ Every `<hp:tc>` inside a `<hp:tr>` contains a `<hp:cellAddr>` element with `colAddr` and `rowAddr` attributes. The `rowAddr` MUST equal the **0-based index** of the `<hp:tr>` within its parent `<hp:tbl>`.
334
+
335
+ ### Structure
336
+ ```xml
337
+ <hp:tbl rowCnt="3" colCnt="2">
338
+ <hp:tr> <!-- row index 0 -->
339
+ <hp:tc>
340
+ <hp:cellAddr colAddr="0" rowAddr="0"/> <!-- rowAddr = 0 ✓ -->
341
+ ...
342
+ </hp:tc>
343
+ <hp:tc>
344
+ <hp:cellAddr colAddr="1" rowAddr="0"/> <!-- rowAddr = 0 ✓ -->
345
+ ...
346
+ </hp:tc>
347
+ </hp:tr>
348
+ <hp:tr> <!-- row index 1 -->
349
+ <hp:tc>
350
+ <hp:cellAddr colAddr="0" rowAddr="1"/> <!-- rowAddr = 1 ✓ -->
351
+ ...
352
+ </hp:tc>
353
+ <hp:tc>
354
+ <hp:cellAddr colAddr="1" rowAddr="1"/> <!-- rowAddr = 1 ✓ -->
355
+ ...
356
+ </hp:tc>
357
+ </hp:tr>
358
+ </hp:tbl>
359
+ ```
360
+
361
+ ### Consequence of violation
362
+ If two `<hp:tr>` elements share the same `rowAddr`, Polaris Office **silently hides** the duplicate rows. The table renders with missing data but no error is reported. This is the most common corruption when cloning rows.
363
+
364
+ ### Fix code
365
+ ```python
366
+ HP = "http://www.hancom.co.kr/hwpml/2011/paragraph"
367
+
368
+ def fix_celladdr_rowaddr(tbl):
369
+ """Fix rowAddr values and rowCnt for an HWPX table after row insertion."""
370
+ rows = tbl.findall(f"{{{HP}}}tr")
371
+ for row_idx, tr in enumerate(rows):
372
+ for tc in tr.findall(f"{{{HP}}}tc"):
373
+ cell_addr = tc.find(f"{{{HP}}}cellAddr")
374
+ if cell_addr is not None:
375
+ cell_addr.set("rowAddr", str(row_idx))
376
+ tbl.set("rowCnt", str(len(rows)))
377
+ ```
378
+
379
+ ### When to apply
380
+ - After cloning a `<hp:tr>` and inserting it into a table
381
+ - After inserting new rows built from `table_content` pipe-delimited data
382
+ - After deleting rows from a table
383
+ - Any time the number or order of `<hp:tr>` children changes
384
+
385
+ ## Pattern 10: Image Paragraph Center Alignment
386
+
387
+ Image paragraphs in HWPX should be center-aligned using a `paraPrIDRef` that references a center-aligned `<hh:paraPr>` from `header.xml`.
388
+
389
+ ### Finding center-aligned paraPrIDRef
390
+
391
+ ```python
392
+ def find_center_parapr(header_path):
393
+ """Find first center-aligned paraPr from header.xml for image paragraphs."""
394
+ import xml.etree.ElementTree as ET
395
+ HH = "http://www.hancom.co.kr/hwpml/2011/head"
396
+ tree = ET.parse(header_path)
397
+ for pp in tree.getroot().iter(f"{{{HH}}}paraPr"):
398
+ align = pp.find(f"{{{HH}}}align")
399
+ if align is not None and align.get("horizontal") == "CENTER":
400
+ return pp.get("id")
401
+ return "0" # fallback to default
402
+ ```
403
+
404
+ ### Usage in image paragraphs
405
+
406
+ ```xml
407
+ <!-- Image paragraph uses center-aligned paraPrIDRef -->
408
+ <hp:p id="..." paraPrIDRef="{CENTER_PARAPR_ID}" styleIDRef="0" pageBreak="0" columnBreak="0" merged="0">
409
+ <hp:linesegarray>
410
+ <hp:lineseg textpos="0" vertpos="0" vertsize="{H}" textheight="{H}" .../>
411
+ </hp:linesegarray>
412
+ <hp:run charPrIDRef="0">
413
+ <hp:pic id="{seq_id}" ...>...</hp:pic>
414
+ <hp:t/>
415
+ </hp:run>
416
+ </hp:p>
417
+ ```
418
+
419
+ ### Why this matters
420
+
421
+ Without center alignment, images default to left-aligned positioning. Korean government document templates expect centered images, particularly for section content images (~77% page width). The `paraPrIDRef` must reference a `<hh:paraPr>` that has `<hh:align horizontal="CENTER"/>`.
422
+
423
+ ### When to apply
424
+ - ALL image paragraphs in section content (from `image_opportunities`)
425
+ - Cell-level images that should be centered within the cell
426
+ - Both standalone and inline image paragraphs
427
+
428
+ ## Safe HWPX Modification
429
+
430
+ ```python
431
+ import xml.etree.ElementTree as ET
432
+
433
+ ns = {
434
+ "hp": "http://www.hancom.co.kr/hwpml/2011/paragraph",
435
+ "hs": "http://www.hancom.co.kr/hwpml/2011/section",
436
+ }
437
+
438
+ # Register namespaces to avoid prefix changes
439
+ for prefix, uri in ns.items():
440
+ ET.register_namespace(prefix, uri)
441
+
442
+ tree = ET.parse("Contents/section0.xml")
443
+ root = tree.getroot()
444
+
445
+ # Find empty cells adjacent to label cells in tables
446
+ for tbl in root.iter("{%s}tbl" % ns["hp"]):
447
+ for tr in tbl.iter("{%s}tr" % ns["hp"]):
448
+ cells = list(tr.iter("{%s}tc" % ns["hp"]))
449
+ for i, cell in enumerate(cells):
450
+ # Check if this cell has text (label)
451
+ texts = [t.text for t in cell.iter("{%s}t" % ns["hp"]) if t.text]
452
+ if texts and i + 1 < len(cells):
453
+ next_cell = cells[i + 1]
454
+ next_texts = [t.text for t in next_cell.iter("{%s}t" % ns["hp"]) if t.text]
455
+ if not next_texts:
456
+ label = "".join(texts)
457
+ # This is a fillable field with label
458
+ print(f"Found field: {label}")
459
+
460
+ tree.write("Contents/section0.xml", xml_declaration=True, encoding="UTF-8")
461
+ ```
@@ -0,0 +1,159 @@
1
+ # HWPX XML Structure Reference
2
+
3
+ ## Unpacking an HWPX
4
+
5
+ ```bash
6
+ mkdir -p .dokkit/template_work
7
+ cd .dokkit/template_work
8
+ unzip -o /path/to/template.hwpx
9
+ ```
10
+
11
+ ## Reading Section XML
12
+
13
+ ```python
14
+ import xml.etree.ElementTree as ET
15
+
16
+ # Parse section file
17
+ tree = ET.parse("Contents/section0.xml")
18
+ root = tree.getroot()
19
+
20
+ # HWPX namespaces
21
+ ns = {
22
+ "hp": "http://www.hancom.co.kr/hwpml/2011/paragraph",
23
+ "hs": "http://www.hancom.co.kr/hwpml/2011/section",
24
+ "hc": "http://www.hancom.co.kr/hwpml/2011/core",
25
+ "hh": "http://www.hancom.co.kr/hwpml/2011/head",
26
+ "opf": "http://www.idpf.org/2007/opf",
27
+ }
28
+
29
+ # Find all paragraphs
30
+ for p in root.iter("{http://www.hancom.co.kr/hwpml/2011/paragraph}p"):
31
+ texts = []
32
+ for t in p.iter("{http://www.hancom.co.kr/hwpml/2011/paragraph}t"):
33
+ if t.text:
34
+ texts.append(t.text)
35
+ if texts:
36
+ print("".join(texts))
37
+ ```
38
+
39
+ ## CRITICAL: Preserving Namespace Declarations
40
+
41
+ Python's `xml.etree.ElementTree` **strips unused namespace declarations** when re-serializing XML. This breaks Hancom/Polaris Office, which requires ALL original namespace declarations on EVERY XML root element, even if no elements use those prefixes.
42
+
43
+ **This applies to ALL HWPX XML files**, not just `section0.xml`:
44
+ - `Contents/section0.xml` — root `<hs:sec>` needs 14+ xmlns
45
+ - `Contents/content.hpf` — root `<opf:package>` needs 14+ xmlns
46
+ - `Contents/header.xml` — root `<hh:head>` needs 14+ xmlns
47
+
48
+ **After any ET-based XML modification**, you MUST restore the original namespace declarations:
49
+
50
+ ```python
51
+ # After tree.write(), fix the root element:
52
+ import re
53
+
54
+ with open(section_xml_path, 'r', encoding='utf-8') as f:
55
+ content = f.read()
56
+
57
+ # Capture original namespace declarations BEFORE any ET parsing
58
+ ORIGINAL_ROOT_NS = (
59
+ 'xmlns:ha="http://www.hancom.co.kr/hwpml/2011/app" '
60
+ 'xmlns:hp="http://www.hancom.co.kr/hwpml/2011/paragraph" '
61
+ 'xmlns:hp10="http://www.hancom.co.kr/hwpml/2016/paragraph" '
62
+ 'xmlns:hs="http://www.hancom.co.kr/hwpml/2011/section" '
63
+ 'xmlns:hc="http://www.hancom.co.kr/hwpml/2011/core" '
64
+ 'xmlns:hh="http://www.hancom.co.kr/hwpml/2011/head" '
65
+ 'xmlns:hhs="http://www.hancom.co.kr/hwpml/2011/history" '
66
+ 'xmlns:hm="http://www.hancom.co.kr/hwpml/2011/master-page" '
67
+ 'xmlns:hpf="http://www.hancom.co.kr/schema/2011/hpf" '
68
+ 'xmlns:dc="http://purl.org/dc/elements/1.1/" '
69
+ 'xmlns:ooxmlchart="http://www.hancom.co.kr/hwpml/2016/ooxmlchart" '
70
+ 'xmlns:epub="http://www.idpf.org/2007/ops" '
71
+ 'xmlns:config="urn:oasis:names:tc:opendocument:xmlns:config:1.0" '
72
+ 'xmlns:opf="http://www.idpf.org/2007/opf/"'
73
+ )
74
+
75
+ # Replace stripped root with full original declarations
76
+ content = re.sub(
77
+ r'<hs:sec\s+xmlns:[^>]+>',
78
+ f'<hs:sec {ORIGINAL_ROOT_NS}>',
79
+ content, count=1
80
+ )
81
+
82
+ # Also restore XML declaration to original format
83
+ content = re.sub(
84
+ r"<\?xml version='1\.0' encoding='UTF-8'\?>",
85
+ '<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>',
86
+ content, count=1
87
+ )
88
+
89
+ with open(section_xml_path, 'w', encoding='utf-8') as f:
90
+ f.write(content)
91
+ ```
92
+
93
+ **Also remove newlines** that ET inserts between the XML declaration and root element:
94
+ ```python
95
+ content = content.replace('?>\n<', '?><')
96
+ ```
97
+
98
+ **Best practice**: Before calling `ET.parse()`, save the original root opening tag. After `tree.write()`, replace the new root tag with the saved original. Apply this to EVERY HWPX XML file you modify (section0.xml, content.hpf, header.xml).
99
+
100
+ ## Repackaging an HWPX
101
+
102
+ CRITICAL: The `mimetype` file must be first and uncompressed.
103
+
104
+ ```python
105
+ import zipfile
106
+ import os
107
+
108
+ def repackage_hwpx(work_dir, output_path):
109
+ """Repackage modified XML files into a valid HWPX."""
110
+ with zipfile.ZipFile(output_path, 'w') as zf:
111
+ # mimetype MUST be first and uncompressed
112
+ mimetype_path = os.path.join(work_dir, "mimetype")
113
+ if os.path.exists(mimetype_path):
114
+ zf.write(mimetype_path, "mimetype", compress_type=zipfile.ZIP_STORED)
115
+
116
+ # Add all other files with compression
117
+ for root, dirs, files in os.walk(work_dir):
118
+ for file in files:
119
+ if file == "mimetype":
120
+ continue
121
+ file_path = os.path.join(root, file)
122
+ arcname = os.path.relpath(file_path, work_dir)
123
+ zf.write(file_path, arcname, compress_type=zipfile.ZIP_DEFLATED)
124
+ ```
125
+
126
+ ## BinData and Image Handling
127
+
128
+ ### BinData Directory
129
+ The `BinData/` directory (at the archive root) stores embedded binary resources — primarily images. Files are named sequentially: `image1.png`, `image2.jpg`, etc.
130
+
131
+ ### Image Registration — Manifest Only
132
+ Images are registered ONLY in `Contents/content.hpf` via `<opf:item>` elements:
133
+ ```xml
134
+ <opf:item id="image1" href="BinData/image1.png" media-type="image/png" isEmbeded="1"/>
135
+ ```
136
+
137
+ **Critical**: Do NOT add `<hh:binDataItems>` entries to `header.xml` for images. The `content.hpf` manifest is the sole registration point. No entries are needed in `META-INF/manifest.xml` either.
138
+
139
+ ### Image Elements Use `hc:` Namespace
140
+ The `<img>` element inside `<hp:pic>` uses the **core** namespace (`hc:`), not the paragraph namespace (`hp:`):
141
+ ```xml
142
+ <!-- CORRECT -->
143
+ <hc:img binaryItemIDRef="image1" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
144
+
145
+ <!-- WRONG — will not render -->
146
+ <hp:img binaryItemIDRef="image1" .../>
147
+ ```
148
+
149
+ See the `dokkit-image-sourcing` skill for the complete `<hp:pic>` element structure with all required children.
150
+
151
+ ## Critical Rules for HWPX Surgery
152
+
153
+ 1. **`mimetype` must be first in ZIP** — stored uncompressed
154
+ 2. **Preserve `hp:rPr` elements** — character formatting
155
+ 3. **Don't modify `hp:cellSpan`** — cell merging must remain intact
156
+ 4. **Keep `hp:cellAddr` — and ensure `rowAddr` = row index** — Each `<hp:tc>` has `<hp:cellAddr colAddr="C" rowAddr="R"/>` where `R` MUST equal the 0-based index of the parent `<hp:tr>` within the `<hp:tbl>`. If two rows share the same `rowAddr`, Polaris Office **silently hides** the duplicate — the table renders with missing data and no error. After any row insertion, deletion, or reordering, re-index ALL `rowAddr` values and update `<hp:tbl rowCnt="N">`.
157
+ 5. **Preserve paragraph properties** — `hp:pPr` controls alignment, spacing
158
+ 6. **Korean font references** — don't change `hangulFont`, `latinFont` attributes
159
+ 7. **Section boundaries** — each section file is independent