devlyn-cli 0.5.2 → 0.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/devlyn.js +1 -0
- package/optional-skills/dokkit/ANALYSIS.md +198 -0
- package/optional-skills/dokkit/COMMANDS.md +365 -0
- package/optional-skills/dokkit/DOCX-XML.md +76 -0
- package/optional-skills/dokkit/EXPORT.md +102 -0
- package/optional-skills/dokkit/FILLING.md +377 -0
- package/optional-skills/dokkit/HWPX-XML.md +73 -0
- package/optional-skills/dokkit/IMAGE-SOURCING.md +127 -0
- package/optional-skills/dokkit/INGESTION.md +65 -0
- package/optional-skills/dokkit/SKILL.md +153 -0
- package/optional-skills/dokkit/STATE.md +60 -0
- package/optional-skills/dokkit/references/docx-field-patterns.md +151 -0
- package/optional-skills/dokkit/references/docx-structure.md +58 -0
- package/optional-skills/dokkit/references/field-detection-patterns.md +130 -0
- package/optional-skills/dokkit/references/hwpx-field-patterns.md +461 -0
- package/optional-skills/dokkit/references/hwpx-structure.md +159 -0
- package/optional-skills/dokkit/references/image-opportunity-heuristics.md +121 -0
- package/optional-skills/dokkit/references/image-xml-patterns.md +338 -0
- package/optional-skills/dokkit/references/section-image-interleaving.md +346 -0
- package/optional-skills/dokkit/references/section-range-detection.md +118 -0
- package/optional-skills/dokkit/references/state-schema.md +143 -0
- package/optional-skills/dokkit/references/supported-formats.md +67 -0
- package/optional-skills/dokkit/scripts/compile_hwpx.py +134 -0
- package/optional-skills/dokkit/scripts/detect_fields.py +301 -0
- package/optional-skills/dokkit/scripts/detect_fields_hwpx.py +286 -0
- package/optional-skills/dokkit/scripts/export_pdf.py +99 -0
- package/optional-skills/dokkit/scripts/parse_hwpx.py +185 -0
- package/optional-skills/dokkit/scripts/parse_image_with_gemini.py +159 -0
- package/optional-skills/dokkit/scripts/parse_xlsx.py +98 -0
- package/optional-skills/dokkit/scripts/source_images.py +365 -0
- package/optional-skills/dokkit/scripts/validate_docx.py +142 -0
- package/optional-skills/dokkit/scripts/validate_hwpx.py +281 -0
- package/optional-skills/dokkit/scripts/validate_state.py +132 -0
- package/package.json +1 -1
|
@@ -0,0 +1,461 @@
|
|
|
1
|
+
# HWPX Field Detection Patterns
|
|
2
|
+
|
|
3
|
+
## Pattern 1: Empty Table Cell
|
|
4
|
+
|
|
5
|
+
Korean forms are heavily table-based. The most common pattern:
|
|
6
|
+
|
|
7
|
+
```xml
|
|
8
|
+
<hp:tr>
|
|
9
|
+
<hp:tc>
|
|
10
|
+
<!-- Label cell -->
|
|
11
|
+
<hp:p>
|
|
12
|
+
<hp:run>
|
|
13
|
+
<hp:rPr charPrIDRef="1"/>
|
|
14
|
+
<hp:t>성명</hp:t>
|
|
15
|
+
</hp:run>
|
|
16
|
+
</hp:p>
|
|
17
|
+
</hp:tc>
|
|
18
|
+
<hp:tc>
|
|
19
|
+
<!-- Empty value cell → FILL THIS -->
|
|
20
|
+
<hp:p>
|
|
21
|
+
<hp:lineseg/>
|
|
22
|
+
</hp:p>
|
|
23
|
+
</hp:tc>
|
|
24
|
+
</hp:tr>
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
**Action**: Insert a new `<hp:run>` with `<hp:t>value</hp:t>` into the empty paragraph. Copy `charPrIDRef` from label cell's run.
|
|
28
|
+
|
|
29
|
+
## Pattern 2: Placeholder Text in Cell
|
|
30
|
+
|
|
31
|
+
```xml
|
|
32
|
+
<hp:tc>
|
|
33
|
+
<hp:p>
|
|
34
|
+
<hp:run>
|
|
35
|
+
<hp:t>(이름을 입력하세요)</hp:t> <!-- Instruction text -->
|
|
36
|
+
</hp:run>
|
|
37
|
+
</hp:p>
|
|
38
|
+
</hp:tc>
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
**Action**: Replace the text in `<hp:t>` with the actual value.
|
|
42
|
+
|
|
43
|
+
## Pattern 3: Multi-Row Spanning Label
|
|
44
|
+
|
|
45
|
+
Korean forms often have a label cell spanning multiple rows:
|
|
46
|
+
|
|
47
|
+
```xml
|
|
48
|
+
<hp:tr>
|
|
49
|
+
<hp:tc>
|
|
50
|
+
<hp:cellSpan rowSpan="3"/>
|
|
51
|
+
<hp:p><hp:run><hp:t>학력</hp:t></hp:run></hp:p>
|
|
52
|
+
</hp:tc>
|
|
53
|
+
<hp:tc><hp:p><hp:run><hp:t>학교명</hp:t></hp:run></hp:p></hp:tc>
|
|
54
|
+
<hp:tc><hp:p/></hp:tc> <!-- Empty → fill with school name -->
|
|
55
|
+
</hp:tr>
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
**Action**: The spanning label ("학력" = Education) is the section. Sub-labels ("학교명" = School Name) identify individual fields.
|
|
59
|
+
|
|
60
|
+
## Pattern 4: Date Fields
|
|
61
|
+
|
|
62
|
+
```xml
|
|
63
|
+
<hp:tc>
|
|
64
|
+
<hp:p>
|
|
65
|
+
<hp:run><hp:t>년</hp:t></hp:run> <!-- Year -->
|
|
66
|
+
</hp:p>
|
|
67
|
+
</hp:tc>
|
|
68
|
+
<hp:tc>
|
|
69
|
+
<hp:p>
|
|
70
|
+
<hp:run><hp:t>월</hp:t></hp:run> <!-- Month -->
|
|
71
|
+
</hp:p>
|
|
72
|
+
</hp:tc>
|
|
73
|
+
<hp:tc>
|
|
74
|
+
<hp:p>
|
|
75
|
+
<hp:run><hp:t>일</hp:t></hp:run> <!-- Day -->
|
|
76
|
+
</hp:p>
|
|
77
|
+
</hp:tc>
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**Action**: Fill the cells preceding 년/월/일 with the appropriate date components.
|
|
81
|
+
|
|
82
|
+
## Pattern 5: Writing Tip Box (작성 팁)
|
|
83
|
+
|
|
84
|
+
Standalone 1×1 tables with DASH-bordered cells that contain `※` guidance text. These are NOT fillable fields — they must be **deleted** before or during filling.
|
|
85
|
+
|
|
86
|
+
```xml
|
|
87
|
+
<hp:tbl rowCnt="1" colCnt="1">
|
|
88
|
+
<hp:tr>
|
|
89
|
+
<hp:tc borderFillIDRef="16">
|
|
90
|
+
<hp:p>
|
|
91
|
+
<hp:run>
|
|
92
|
+
<hp:rPr charPrIDRef="45"/> <!-- Often RED style -->
|
|
93
|
+
<hp:t>※ 작성 팁: 사업의 목적과 필요성을 구체적으로 작성하세요.</hp:t>
|
|
94
|
+
</hp:run>
|
|
95
|
+
</hp:p>
|
|
96
|
+
<hp:p>
|
|
97
|
+
<hp:run>
|
|
98
|
+
<hp:rPr charPrIDRef="45"/>
|
|
99
|
+
<hp:t>※ 관련 법령이나 정책 근거를 제시하면 좋습니다.</hp:t>
|
|
100
|
+
</hp:run>
|
|
101
|
+
</hp:p>
|
|
102
|
+
</hp:tc>
|
|
103
|
+
</hp:tr>
|
|
104
|
+
</hp:tbl>
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
**Identifying traits**:
|
|
108
|
+
- `rowCnt="1"` and `colCnt="1"` (single-cell table)
|
|
109
|
+
- `borderFillIDRef` resolves to DASH border style in `header.xml`
|
|
110
|
+
- Text starts with `※` or contains `작성 팁`, `작성요령`, `작성 요령`
|
|
111
|
+
- Often appears inside a `<hp:subList>` within another table cell
|
|
112
|
+
|
|
113
|
+
**Two container types**:
|
|
114
|
+
- **Standalone**: Top-level 1×1 table between other content → delete the entire `<hp:tbl>`
|
|
115
|
+
- **Nested**: Inside a `<hp:subList>` within a fill-target cell → delete the `<hp:subList>` element
|
|
116
|
+
|
|
117
|
+
**Action**: Flag as `field_type: "tip_box"`, `action: "delete"`. The filler agent removes these before filling.
|
|
118
|
+
|
|
119
|
+
## Pattern 6: Character Property Resolution (charPrIDRef)
|
|
120
|
+
|
|
121
|
+
HWPX text formatting is controlled by `charPrIDRef` attributes that reference `<hh:charPr>` entries in `header.xml`.
|
|
122
|
+
|
|
123
|
+
### How charPrIDRef works
|
|
124
|
+
```xml
|
|
125
|
+
<!-- In section*.xml — a run references charPr ID 45 -->
|
|
126
|
+
<hp:run>
|
|
127
|
+
<hp:rPr charPrIDRef="45"/>
|
|
128
|
+
<hp:t>Some text</hp:t>
|
|
129
|
+
</hp:run>
|
|
130
|
+
|
|
131
|
+
<!-- In header.xml — charPr ID 45 defines the style -->
|
|
132
|
+
<hh:charPr id="45" height="1000" textColor="#FF0000"
|
|
133
|
+
bold="false" italic="true" spacing="-5"/>
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
### Template guide text uses RED styles
|
|
137
|
+
Many templates use red (#FF0000) charPrIDRef values for guide text, tip boxes, and instructions. Common red IDs seen in Korean government templates: 39, 45, 51, 52, 57, 62, 81.
|
|
138
|
+
|
|
139
|
+
**Critical rule**: When filling a field, NEVER copy `charPrIDRef` from guide/tip text. Instead, find or create a black (#000000) charPr.
|
|
140
|
+
|
|
141
|
+
### Finding a suitable black charPr
|
|
142
|
+
```python
|
|
143
|
+
import xml.etree.ElementTree as ET
|
|
144
|
+
|
|
145
|
+
def find_black_charpr(header_path):
|
|
146
|
+
"""Find a charPrIDRef suitable for filled text (black, normal style)."""
|
|
147
|
+
hns = {"hh": "http://www.hancom.co.kr/hwpml/2011/head"}
|
|
148
|
+
tree = ET.parse(header_path)
|
|
149
|
+
root = tree.getroot()
|
|
150
|
+
|
|
151
|
+
candidates = []
|
|
152
|
+
for cp in root.iter("{%s}charPr" % hns["hh"]):
|
|
153
|
+
color = cp.get("textColor", "#000000")
|
|
154
|
+
bold = cp.get("bold", "false")
|
|
155
|
+
italic = cp.get("italic", "false")
|
|
156
|
+
spacing = int(cp.get("spacing", "0"))
|
|
157
|
+
|
|
158
|
+
# Want: black text, not italic, non-negative spacing
|
|
159
|
+
if color.upper() in ("#000000", "#000000FF", "black") and \
|
|
160
|
+
italic == "false" and spacing >= 0:
|
|
161
|
+
candidates.append({
|
|
162
|
+
"id": cp.get("id"),
|
|
163
|
+
"bold": bold == "true",
|
|
164
|
+
"height": int(cp.get("height", "1000")),
|
|
165
|
+
"spacing": spacing,
|
|
166
|
+
})
|
|
167
|
+
|
|
168
|
+
# Prefer non-bold, standard size, zero spacing
|
|
169
|
+
normal = [c for c in candidates if not c["bold"] and c["spacing"] == 0]
|
|
170
|
+
bold_list = [c for c in candidates if c["bold"] and c["spacing"] == 0]
|
|
171
|
+
|
|
172
|
+
return {
|
|
173
|
+
"normal": normal[0]["id"] if normal else None,
|
|
174
|
+
"bold": bold_list[0]["id"] if bold_list else None,
|
|
175
|
+
}
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
### Creating a new charPr if needed
|
|
179
|
+
If no suitable black charPr exists in `header.xml`, create one by appending a new `<hh:charPr>` element with the next available ID, `textColor="#000000"`, `bold="false"`, `italic="false"`, `spacing="0"`.
|
|
180
|
+
|
|
181
|
+
## Pattern 7: Image Field in Table Cell
|
|
182
|
+
|
|
183
|
+
A label cell containing image-related keywords (사진, 증명사진, 로고, 서명, 직인, 사업자등록증) next to an empty cell indicates an image insertion point.
|
|
184
|
+
|
|
185
|
+
```xml
|
|
186
|
+
<hp:tr>
|
|
187
|
+
<hp:tc>
|
|
188
|
+
<!-- Label cell with image keyword -->
|
|
189
|
+
<hp:p>
|
|
190
|
+
<hp:run>
|
|
191
|
+
<hp:rPr charPrIDRef="1"/>
|
|
192
|
+
<hp:t>사진</hp:t>
|
|
193
|
+
</hp:run>
|
|
194
|
+
</hp:p>
|
|
195
|
+
</hp:tc>
|
|
196
|
+
<hp:tc>
|
|
197
|
+
<!-- Empty cell → INSERT IMAGE HERE -->
|
|
198
|
+
<hp:p>
|
|
199
|
+
<hp:lineseg/>
|
|
200
|
+
</hp:p>
|
|
201
|
+
</hp:tc>
|
|
202
|
+
</hp:tr>
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**Action**: Insert a `<hp:pic>` element INSIDE a `<hp:run>` within the cell's `<hp:p>`. The `<hp:t/>` goes AFTER the pic inside the run.
|
|
206
|
+
|
|
207
|
+
### Image Paragraph Structure (CRITICAL)
|
|
208
|
+
|
|
209
|
+
```xml
|
|
210
|
+
<!-- pic must be INSIDE run, t/ AFTER pic (matches real Hancom Office output) -->
|
|
211
|
+
<hp:p id="..." paraPrIDRef="..." styleIDRef="0" pageBreak="0" columnBreak="0" merged="0">
|
|
212
|
+
<hp:linesegarray>
|
|
213
|
+
<hp:lineseg textpos="0" vertpos="0" vertsize="{H}" textheight="{H}"
|
|
214
|
+
baseline="{H*0.85}" spacing="500" .../>
|
|
215
|
+
</hp:linesegarray>
|
|
216
|
+
<hp:run charPrIDRef="0">
|
|
217
|
+
<hp:pic id="{seq_id}" zOrder="{z}" ...>...</hp:pic>
|
|
218
|
+
<hp:t/>
|
|
219
|
+
</hp:run>
|
|
220
|
+
</hp:p>
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### Complete `<hp:pic>` Structure (Hancom Canonical Order)
|
|
224
|
+
|
|
225
|
+
```xml
|
|
226
|
+
<hp:pic id="{seq_id}" zOrder="{z}" numberingType="PICTURE" textWrap="TOP_AND_BOTTOM"
|
|
227
|
+
textFlow="BOTH_SIDES" lock="0" dropcapstyle="None"
|
|
228
|
+
href="" groupLevel="0" instid="{seq_id}" reverse="0">
|
|
229
|
+
<!-- Group 1: Geometry -->
|
|
230
|
+
<hp:offset x="0" y="0"/>
|
|
231
|
+
<hp:orgSz width="{W}" height="{H}"/>
|
|
232
|
+
<hp:curSz width="{W}" height="{H}"/>
|
|
233
|
+
<hp:flip horizontal="0" vertical="0"/>
|
|
234
|
+
<hp:rotationInfo angle="0" centerX="{W/2}" centerY="{H/2}" rotateimage="1"/>
|
|
235
|
+
<hp:renderingInfo>
|
|
236
|
+
<hc:transMatrix e1="1" e2="0" e3="0" e4="0" e5="1" e6="0"/>
|
|
237
|
+
<hc:scaMatrix e1="1" e2="0" e3="0" e4="0" e5="1" e6="0"/>
|
|
238
|
+
<hc:rotMatrix e1="1" e2="-0" e3="0" e4="0" e5="1" e6="0"/>
|
|
239
|
+
</hp:renderingInfo>
|
|
240
|
+
<!-- Group 2: Image data -->
|
|
241
|
+
<hp:imgRect>
|
|
242
|
+
<hc:pt0 x="0" y="0"/>
|
|
243
|
+
<hc:pt1 x="{W}" y="0"/>
|
|
244
|
+
<hc:pt2 x="{W}" y="{H}"/>
|
|
245
|
+
<hc:pt3 x="0" y="{H}"/>
|
|
246
|
+
</hp:imgRect>
|
|
247
|
+
<hp:imgClip left="0" right="{pixW}" top="0" bottom="{pixH}"/>
|
|
248
|
+
<hp:inMargin left="0" right="0" top="0" bottom="0"/>
|
|
249
|
+
<hp:imgDim dimwidth="{pixW}" dimheight="{pixH}"/>
|
|
250
|
+
<hc:img binaryItemIDRef="{manifest_id}" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
|
|
251
|
+
<!-- Group 3: Layout (AFTER hc:img) -->
|
|
252
|
+
<hp:sz width="{W}" widthRelTo="ABSOLUTE" height="{H}" heightRelTo="ABSOLUTE" protect="0"/>
|
|
253
|
+
<hp:pos treatAsChar="1" affectLSpacing="0" flowWithText="0" allowOverlap="0"
|
|
254
|
+
holdAnchorAndSO="0" vertRelTo="PARA" horzRelTo="COLUMN"
|
|
255
|
+
vertAlign="TOP" horzAlign="LEFT" vertOffset="0" horzOffset="0"/>
|
|
256
|
+
<hp:outMargin left="0" right="0" top="0" bottom="0"/>
|
|
257
|
+
</hp:pic>
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
Where: `{W}/{H}` = HWPML units (1/7200 inch), `{pixW}/{pixH}` = pixel dimensions from PIL, `{manifest_id}` = `id` from `content.hpf`.
|
|
261
|
+
|
|
262
|
+
### 9 Critical Rules for `<hp:pic>`
|
|
263
|
+
|
|
264
|
+
1. **`<img>` uses `hc:` namespace** — `<hc:img>`, NOT `<hp:img>`
|
|
265
|
+
2. **`<imgRect>` has 4 `<hc:pt>` children** — `<hc:pt0>` through `<hc:pt3>`, NOT inline attributes
|
|
266
|
+
3. **All required children present** — `offset`, `orgSz`, `curSz`, `flip`, `rotationInfo`, `renderingInfo`, `inMargin`
|
|
267
|
+
4. **No spurious elements** — Do NOT add `hp:lineShape`, `hp:caption`, `hp:shapeComment`
|
|
268
|
+
5. **`imgClip` right/bottom = pixel dims** — from `imgDim`, NOT zeros
|
|
269
|
+
6. **Hancom canonical element order** — offset, orgSz, ..., hc:img, **then** sz, pos, outMargin
|
|
270
|
+
7. **Register in `content.hpf` manifest only** — Do NOT add `<hh:binDataItems>` to `header.xml`
|
|
271
|
+
8. **`hp:pos` attributes** — `flowWithText="0"` `horzRelTo="COLUMN"`
|
|
272
|
+
9. **pic INSIDE run, t AFTER pic** — `<hp:run><hp:pic>...</hp:pic><hp:t/></hp:run>`
|
|
273
|
+
|
|
274
|
+
## Pattern 8: SubList Cell Wrapping (CRITICAL)
|
|
275
|
+
|
|
276
|
+
In Korean government HWPX templates, ~65% of table cells wrap their content in `<hp:subList>/<hp:p>` rather than having `<hp:p>` as a direct child of `<hp:tc>`. Hancom Office reads content from inside `<hp:subList>` and ignores orphaned direct `<hp:p>` elements.
|
|
277
|
+
|
|
278
|
+
### Two cell structures
|
|
279
|
+
|
|
280
|
+
**Direct pattern** (~35% of cells):
|
|
281
|
+
```xml
|
|
282
|
+
<hp:tc>
|
|
283
|
+
<hp:cellAddr .../>
|
|
284
|
+
<hp:cellSpan .../>
|
|
285
|
+
<hp:cellSz .../>
|
|
286
|
+
<hp:p>
|
|
287
|
+
<hp:run><hp:t>Content here</hp:t></hp:run>
|
|
288
|
+
</hp:p>
|
|
289
|
+
</hp:tc>
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
**SubList pattern** (~65% of cells):
|
|
293
|
+
```xml
|
|
294
|
+
<hp:tc>
|
|
295
|
+
<hp:cellAddr .../>
|
|
296
|
+
<hp:cellSpan .../>
|
|
297
|
+
<hp:cellSz .../>
|
|
298
|
+
<hp:subList>
|
|
299
|
+
<hp:p>
|
|
300
|
+
<hp:run><hp:t>Content here</hp:t></hp:run>
|
|
301
|
+
</hp:p>
|
|
302
|
+
</hp:subList>
|
|
303
|
+
</hp:tc>
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
### Critical rule for filling
|
|
307
|
+
|
|
308
|
+
When writing content into a cell, ALWAYS check for `<hp:subList>` first:
|
|
309
|
+
1. If `<hp:subList>` exists: write into `<hp:subList>/<hp:p>`, NOT as a direct `<hp:p>` child of `<hp:tc>`
|
|
310
|
+
2. If no `<hp:subList>`: write as direct `<hp:p>` child of `<hp:tc>` (standard pattern)
|
|
311
|
+
|
|
312
|
+
**Wrong** — creates orphaned paragraphs that Hancom ignores:
|
|
313
|
+
```python
|
|
314
|
+
# BAD: always writes to cell directly
|
|
315
|
+
p = ET.SubElement(cell, hp_tag("p"))
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
**Correct** — respects subList wrapper:
|
|
319
|
+
```python
|
|
320
|
+
# GOOD: check for subList first
|
|
321
|
+
container = cell
|
|
322
|
+
for c in cell:
|
|
323
|
+
if c.tag == hp_tag("subList"):
|
|
324
|
+
container = c
|
|
325
|
+
break
|
|
326
|
+
p = ET.SubElement(container, hp_tag("p"))
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
This applies to ALL cell operations: `clear_cell_content()`, `fill_cell_text()`, and `insert_cell_image_resolved()`.
|
|
330
|
+
|
|
331
|
+
## Pattern 9: cellAddr Row Addressing (CRITICAL)
|
|
332
|
+
|
|
333
|
+
Every `<hp:tc>` inside a `<hp:tr>` contains a `<hp:cellAddr>` element with `colAddr` and `rowAddr` attributes. The `rowAddr` MUST equal the **0-based index** of the `<hp:tr>` within its parent `<hp:tbl>`.
|
|
334
|
+
|
|
335
|
+
### Structure
|
|
336
|
+
```xml
|
|
337
|
+
<hp:tbl rowCnt="3" colCnt="2">
|
|
338
|
+
<hp:tr> <!-- row index 0 -->
|
|
339
|
+
<hp:tc>
|
|
340
|
+
<hp:cellAddr colAddr="0" rowAddr="0"/> <!-- rowAddr = 0 ✓ -->
|
|
341
|
+
...
|
|
342
|
+
</hp:tc>
|
|
343
|
+
<hp:tc>
|
|
344
|
+
<hp:cellAddr colAddr="1" rowAddr="0"/> <!-- rowAddr = 0 ✓ -->
|
|
345
|
+
...
|
|
346
|
+
</hp:tc>
|
|
347
|
+
</hp:tr>
|
|
348
|
+
<hp:tr> <!-- row index 1 -->
|
|
349
|
+
<hp:tc>
|
|
350
|
+
<hp:cellAddr colAddr="0" rowAddr="1"/> <!-- rowAddr = 1 ✓ -->
|
|
351
|
+
...
|
|
352
|
+
</hp:tc>
|
|
353
|
+
<hp:tc>
|
|
354
|
+
<hp:cellAddr colAddr="1" rowAddr="1"/> <!-- rowAddr = 1 ✓ -->
|
|
355
|
+
...
|
|
356
|
+
</hp:tc>
|
|
357
|
+
</hp:tr>
|
|
358
|
+
</hp:tbl>
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
### Consequence of violation
|
|
362
|
+
If two `<hp:tr>` elements share the same `rowAddr`, Polaris Office **silently hides** the duplicate rows. The table renders with missing data but no error is reported. This is the most common corruption when cloning rows.
|
|
363
|
+
|
|
364
|
+
### Fix code
|
|
365
|
+
```python
|
|
366
|
+
HP = "http://www.hancom.co.kr/hwpml/2011/paragraph"
|
|
367
|
+
|
|
368
|
+
def fix_celladdr_rowaddr(tbl):
|
|
369
|
+
"""Fix rowAddr values and rowCnt for an HWPX table after row insertion."""
|
|
370
|
+
rows = tbl.findall(f"{{{HP}}}tr")
|
|
371
|
+
for row_idx, tr in enumerate(rows):
|
|
372
|
+
for tc in tr.findall(f"{{{HP}}}tc"):
|
|
373
|
+
cell_addr = tc.find(f"{{{HP}}}cellAddr")
|
|
374
|
+
if cell_addr is not None:
|
|
375
|
+
cell_addr.set("rowAddr", str(row_idx))
|
|
376
|
+
tbl.set("rowCnt", str(len(rows)))
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
### When to apply
|
|
380
|
+
- After cloning a `<hp:tr>` and inserting it into a table
|
|
381
|
+
- After inserting new rows built from `table_content` pipe-delimited data
|
|
382
|
+
- After deleting rows from a table
|
|
383
|
+
- Any time the number or order of `<hp:tr>` children changes
|
|
384
|
+
|
|
385
|
+
## Pattern 10: Image Paragraph Center Alignment
|
|
386
|
+
|
|
387
|
+
Image paragraphs in HWPX should be center-aligned using a `paraPrIDRef` that references a center-aligned `<hh:paraPr>` from `header.xml`.
|
|
388
|
+
|
|
389
|
+
### Finding center-aligned paraPrIDRef
|
|
390
|
+
|
|
391
|
+
```python
|
|
392
|
+
def find_center_parapr(header_path):
|
|
393
|
+
"""Find first center-aligned paraPr from header.xml for image paragraphs."""
|
|
394
|
+
import xml.etree.ElementTree as ET
|
|
395
|
+
HH = "http://www.hancom.co.kr/hwpml/2011/head"
|
|
396
|
+
tree = ET.parse(header_path)
|
|
397
|
+
for pp in tree.getroot().iter(f"{{{HH}}}paraPr"):
|
|
398
|
+
align = pp.find(f"{{{HH}}}align")
|
|
399
|
+
if align is not None and align.get("horizontal") == "CENTER":
|
|
400
|
+
return pp.get("id")
|
|
401
|
+
return "0" # fallback to default
|
|
402
|
+
```
|
|
403
|
+
|
|
404
|
+
### Usage in image paragraphs
|
|
405
|
+
|
|
406
|
+
```xml
|
|
407
|
+
<!-- Image paragraph uses center-aligned paraPrIDRef -->
|
|
408
|
+
<hp:p id="..." paraPrIDRef="{CENTER_PARAPR_ID}" styleIDRef="0" pageBreak="0" columnBreak="0" merged="0">
|
|
409
|
+
<hp:linesegarray>
|
|
410
|
+
<hp:lineseg textpos="0" vertpos="0" vertsize="{H}" textheight="{H}" .../>
|
|
411
|
+
</hp:linesegarray>
|
|
412
|
+
<hp:run charPrIDRef="0">
|
|
413
|
+
<hp:pic id="{seq_id}" ...>...</hp:pic>
|
|
414
|
+
<hp:t/>
|
|
415
|
+
</hp:run>
|
|
416
|
+
</hp:p>
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
### Why this matters
|
|
420
|
+
|
|
421
|
+
Without center alignment, images default to left-aligned positioning. Korean government document templates expect centered images, particularly for section content images (~77% page width). The `paraPrIDRef` must reference a `<hh:paraPr>` that has `<hh:align horizontal="CENTER"/>`.
|
|
422
|
+
|
|
423
|
+
### When to apply
|
|
424
|
+
- ALL image paragraphs in section content (from `image_opportunities`)
|
|
425
|
+
- Cell-level images that should be centered within the cell
|
|
426
|
+
- Both standalone and inline image paragraphs
|
|
427
|
+
|
|
428
|
+
## Safe HWPX Modification
|
|
429
|
+
|
|
430
|
+
```python
|
|
431
|
+
import xml.etree.ElementTree as ET
|
|
432
|
+
|
|
433
|
+
ns = {
|
|
434
|
+
"hp": "http://www.hancom.co.kr/hwpml/2011/paragraph",
|
|
435
|
+
"hs": "http://www.hancom.co.kr/hwpml/2011/section",
|
|
436
|
+
}
|
|
437
|
+
|
|
438
|
+
# Register namespaces to avoid prefix changes
|
|
439
|
+
for prefix, uri in ns.items():
|
|
440
|
+
ET.register_namespace(prefix, uri)
|
|
441
|
+
|
|
442
|
+
tree = ET.parse("Contents/section0.xml")
|
|
443
|
+
root = tree.getroot()
|
|
444
|
+
|
|
445
|
+
# Find empty cells adjacent to label cells in tables
|
|
446
|
+
for tbl in root.iter("{%s}tbl" % ns["hp"]):
|
|
447
|
+
for tr in tbl.iter("{%s}tr" % ns["hp"]):
|
|
448
|
+
cells = list(tr.iter("{%s}tc" % ns["hp"]))
|
|
449
|
+
for i, cell in enumerate(cells):
|
|
450
|
+
# Check if this cell has text (label)
|
|
451
|
+
texts = [t.text for t in cell.iter("{%s}t" % ns["hp"]) if t.text]
|
|
452
|
+
if texts and i + 1 < len(cells):
|
|
453
|
+
next_cell = cells[i + 1]
|
|
454
|
+
next_texts = [t.text for t in next_cell.iter("{%s}t" % ns["hp"]) if t.text]
|
|
455
|
+
if not next_texts:
|
|
456
|
+
label = "".join(texts)
|
|
457
|
+
# This is a fillable field with label
|
|
458
|
+
print(f"Found field: {label}")
|
|
459
|
+
|
|
460
|
+
tree.write("Contents/section0.xml", xml_declaration=True, encoding="UTF-8")
|
|
461
|
+
```
|
|
@@ -0,0 +1,159 @@
|
|
|
1
|
+
# HWPX XML Structure Reference
|
|
2
|
+
|
|
3
|
+
## Unpacking an HWPX
|
|
4
|
+
|
|
5
|
+
```bash
|
|
6
|
+
mkdir -p .dokkit/template_work
|
|
7
|
+
cd .dokkit/template_work
|
|
8
|
+
unzip -o /path/to/template.hwpx
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
## Reading Section XML
|
|
12
|
+
|
|
13
|
+
```python
|
|
14
|
+
import xml.etree.ElementTree as ET
|
|
15
|
+
|
|
16
|
+
# Parse section file
|
|
17
|
+
tree = ET.parse("Contents/section0.xml")
|
|
18
|
+
root = tree.getroot()
|
|
19
|
+
|
|
20
|
+
# HWPX namespaces
|
|
21
|
+
ns = {
|
|
22
|
+
"hp": "http://www.hancom.co.kr/hwpml/2011/paragraph",
|
|
23
|
+
"hs": "http://www.hancom.co.kr/hwpml/2011/section",
|
|
24
|
+
"hc": "http://www.hancom.co.kr/hwpml/2011/core",
|
|
25
|
+
"hh": "http://www.hancom.co.kr/hwpml/2011/head",
|
|
26
|
+
"opf": "http://www.idpf.org/2007/opf",
|
|
27
|
+
}
|
|
28
|
+
|
|
29
|
+
# Find all paragraphs
|
|
30
|
+
for p in root.iter("{http://www.hancom.co.kr/hwpml/2011/paragraph}p"):
|
|
31
|
+
texts = []
|
|
32
|
+
for t in p.iter("{http://www.hancom.co.kr/hwpml/2011/paragraph}t"):
|
|
33
|
+
if t.text:
|
|
34
|
+
texts.append(t.text)
|
|
35
|
+
if texts:
|
|
36
|
+
print("".join(texts))
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## CRITICAL: Preserving Namespace Declarations
|
|
40
|
+
|
|
41
|
+
Python's `xml.etree.ElementTree` **strips unused namespace declarations** when re-serializing XML. This breaks Hancom/Polaris Office, which requires ALL original namespace declarations on EVERY XML root element, even if no elements use those prefixes.
|
|
42
|
+
|
|
43
|
+
**This applies to ALL HWPX XML files**, not just `section0.xml`:
|
|
44
|
+
- `Contents/section0.xml` — root `<hs:sec>` needs 14+ xmlns
|
|
45
|
+
- `Contents/content.hpf` — root `<opf:package>` needs 14+ xmlns
|
|
46
|
+
- `Contents/header.xml` — root `<hh:head>` needs 14+ xmlns
|
|
47
|
+
|
|
48
|
+
**After any ET-based XML modification**, you MUST restore the original namespace declarations:
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
# After tree.write(), fix the root element:
|
|
52
|
+
import re
|
|
53
|
+
|
|
54
|
+
with open(section_xml_path, 'r', encoding='utf-8') as f:
|
|
55
|
+
content = f.read()
|
|
56
|
+
|
|
57
|
+
# Capture original namespace declarations BEFORE any ET parsing
|
|
58
|
+
ORIGINAL_ROOT_NS = (
|
|
59
|
+
'xmlns:ha="http://www.hancom.co.kr/hwpml/2011/app" '
|
|
60
|
+
'xmlns:hp="http://www.hancom.co.kr/hwpml/2011/paragraph" '
|
|
61
|
+
'xmlns:hp10="http://www.hancom.co.kr/hwpml/2016/paragraph" '
|
|
62
|
+
'xmlns:hs="http://www.hancom.co.kr/hwpml/2011/section" '
|
|
63
|
+
'xmlns:hc="http://www.hancom.co.kr/hwpml/2011/core" '
|
|
64
|
+
'xmlns:hh="http://www.hancom.co.kr/hwpml/2011/head" '
|
|
65
|
+
'xmlns:hhs="http://www.hancom.co.kr/hwpml/2011/history" '
|
|
66
|
+
'xmlns:hm="http://www.hancom.co.kr/hwpml/2011/master-page" '
|
|
67
|
+
'xmlns:hpf="http://www.hancom.co.kr/schema/2011/hpf" '
|
|
68
|
+
'xmlns:dc="http://purl.org/dc/elements/1.1/" '
|
|
69
|
+
'xmlns:ooxmlchart="http://www.hancom.co.kr/hwpml/2016/ooxmlchart" '
|
|
70
|
+
'xmlns:epub="http://www.idpf.org/2007/ops" '
|
|
71
|
+
'xmlns:config="urn:oasis:names:tc:opendocument:xmlns:config:1.0" '
|
|
72
|
+
'xmlns:opf="http://www.idpf.org/2007/opf/"'
|
|
73
|
+
)
|
|
74
|
+
|
|
75
|
+
# Replace stripped root with full original declarations
|
|
76
|
+
content = re.sub(
|
|
77
|
+
r'<hs:sec\s+xmlns:[^>]+>',
|
|
78
|
+
f'<hs:sec {ORIGINAL_ROOT_NS}>',
|
|
79
|
+
content, count=1
|
|
80
|
+
)
|
|
81
|
+
|
|
82
|
+
# Also restore XML declaration to original format
|
|
83
|
+
content = re.sub(
|
|
84
|
+
r"<\?xml version='1\.0' encoding='UTF-8'\?>",
|
|
85
|
+
'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>',
|
|
86
|
+
content, count=1
|
|
87
|
+
)
|
|
88
|
+
|
|
89
|
+
with open(section_xml_path, 'w', encoding='utf-8') as f:
|
|
90
|
+
f.write(content)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Also remove newlines** that ET inserts between the XML declaration and root element:
|
|
94
|
+
```python
|
|
95
|
+
content = content.replace('?>\n<', '?><')
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
**Best practice**: Before calling `ET.parse()`, save the original root opening tag. After `tree.write()`, replace the new root tag with the saved original. Apply this to EVERY HWPX XML file you modify (section0.xml, content.hpf, header.xml).
|
|
99
|
+
|
|
100
|
+
## Repackaging an HWPX
|
|
101
|
+
|
|
102
|
+
CRITICAL: The `mimetype` file must be first and uncompressed.
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
import zipfile
|
|
106
|
+
import os
|
|
107
|
+
|
|
108
|
+
def repackage_hwpx(work_dir, output_path):
|
|
109
|
+
"""Repackage modified XML files into a valid HWPX."""
|
|
110
|
+
with zipfile.ZipFile(output_path, 'w') as zf:
|
|
111
|
+
# mimetype MUST be first and uncompressed
|
|
112
|
+
mimetype_path = os.path.join(work_dir, "mimetype")
|
|
113
|
+
if os.path.exists(mimetype_path):
|
|
114
|
+
zf.write(mimetype_path, "mimetype", compress_type=zipfile.ZIP_STORED)
|
|
115
|
+
|
|
116
|
+
# Add all other files with compression
|
|
117
|
+
for root, dirs, files in os.walk(work_dir):
|
|
118
|
+
for file in files:
|
|
119
|
+
if file == "mimetype":
|
|
120
|
+
continue
|
|
121
|
+
file_path = os.path.join(root, file)
|
|
122
|
+
arcname = os.path.relpath(file_path, work_dir)
|
|
123
|
+
zf.write(file_path, arcname, compress_type=zipfile.ZIP_DEFLATED)
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
## BinData and Image Handling
|
|
127
|
+
|
|
128
|
+
### BinData Directory
|
|
129
|
+
The `BinData/` directory (at the archive root) stores embedded binary resources — primarily images. Files are named sequentially: `image1.png`, `image2.jpg`, etc.
|
|
130
|
+
|
|
131
|
+
### Image Registration — Manifest Only
|
|
132
|
+
Images are registered ONLY in `Contents/content.hpf` via `<opf:item>` elements:
|
|
133
|
+
```xml
|
|
134
|
+
<opf:item id="image1" href="BinData/image1.png" media-type="image/png" isEmbeded="1"/>
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
**Critical**: Do NOT add `<hh:binDataItems>` entries to `header.xml` for images. The `content.hpf` manifest is the sole registration point. No entries are needed in `META-INF/manifest.xml` either.
|
|
138
|
+
|
|
139
|
+
### Image Elements Use `hc:` Namespace
|
|
140
|
+
The `<img>` element inside `<hp:pic>` uses the **core** namespace (`hc:`), not the paragraph namespace (`hp:`):
|
|
141
|
+
```xml
|
|
142
|
+
<!-- CORRECT -->
|
|
143
|
+
<hc:img binaryItemIDRef="image1" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
|
|
144
|
+
|
|
145
|
+
<!-- WRONG — will not render -->
|
|
146
|
+
<hp:img binaryItemIDRef="image1" .../>
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
See the `dokkit-image-sourcing` skill for the complete `<hp:pic>` element structure with all required children.
|
|
150
|
+
|
|
151
|
+
## Critical Rules for HWPX Surgery
|
|
152
|
+
|
|
153
|
+
1. **`mimetype` must be first in ZIP** — stored uncompressed
|
|
154
|
+
2. **Preserve `hp:rPr` elements** — character formatting
|
|
155
|
+
3. **Don't modify `hp:cellSpan`** — cell merging must remain intact
|
|
156
|
+
4. **Keep `hp:cellAddr` — and ensure `rowAddr` = row index** — Each `<hp:tc>` has `<hp:cellAddr colAddr="C" rowAddr="R"/>` where `R` MUST equal the 0-based index of the parent `<hp:tr>` within the `<hp:tbl>`. If two rows share the same `rowAddr`, Polaris Office **silently hides** the duplicate — the table renders with missing data and no error. After any row insertion, deletion, or reordering, re-index ALL `rowAddr` values and update `<hp:tbl rowCnt="N">`.
|
|
157
|
+
5. **Preserve paragraph properties** — `hp:pPr` controls alignment, spacing
|
|
158
|
+
6. **Korean font references** — don't change `hangulFont`, `latinFont` attributes
|
|
159
|
+
7. **Section boundaries** — each section file is independent
|