kordoc 1.6.0 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -14,13 +14,35 @@
14
14
 
15
15
  ---
16
16
 
17
- ## What's New in v1.6.0
17
+ ## What's New in v1.7.0
18
+
19
+ - **Image Extraction (HWP/HWPX)** — Binary image extraction from ZIP entries and HWP5 BinData streams. Rendered as `![image](...)` in markdown output.
20
+ - **Partial Parsing (Graceful Degradation)** — Single page failures no longer abort the whole document. Failed pages emit `PARTIAL_PARSE` warnings and parsing continues.
21
+ - **Progress Callbacks** — `onProgress` callback in `ParseOptions`. CLI shows `[3/15 pages]` progress. Batch mode shows `[2/10 files]`.
22
+ - **File Path Input** — `parse("path/to/file.hwp")` string overload. Auto-reads file, detects format, returns result.
23
+ - **PDF Header/Footer Filtering** — `removeHeaderFooter: true` option removes repeated text at page edges. Removed elements recorded in `ParseWarning`.
24
+ - **Security Hardening** — ZIP bomb cumulative-size tracking across all file types, SSRF prevention on webhook URLs, XSS-safe hyperlink rendering (javascript: URLs stripped), null-byte path traversal detection, Levenshtein length guard (O(m×n) DoS prevention), 30s PDF load timeout.
25
+ - **Bug Fixes** — HWPX generator separator logic, XML recursion depth limit (MAX_XML_DEPTH=200), PDF table row merge protection, CLI `--format` validation, variable shadowing in PDF parser.
26
+ - **UX Improvements** — KV table false-positive reduction (time/URL/number patterns excluded), MCP `parse_metadata` uses 50MB limit with header-only format detection, Watch debounce increased to 1000ms with stable-size check.
27
+
28
+ <details>
29
+ <summary>v1.6.1 fixes</summary>
30
+
31
+ - **HWP5 Table Cell Offset Fix** — Fixed critical 2-byte offset misalignment in LIST_HEADER parsing. Row address was incorrectly read as colSpan, causing 3-column tables to explode into 6+ columns with misaligned content. Tables now use colAddr/rowAddr-based direct placement for accurate cell positioning.
32
+ - **HWP5 TAB Control Character Fix** — TAB (0x0009) inline control's 14-byte extension data was not skipped, producing garbage characters (`࣐Ā`) after every tab in the output. Fixed by adding the required 14-byte skip.
33
+
34
+ </details>
35
+
36
+ <details>
37
+ <summary>v1.6.0 features</summary>
18
38
 
19
39
  - **Cluster-Based Table Detection (PDF)** — Detects borderless tables by analyzing text alignment patterns. Baseline grouping + X-coordinate clustering identifies 2+ column tables that line-based detection misses. Sort-and-split clustering for order-independent results.
20
40
  - **Korean Special Table Detection** — Automatically detects `구분/항목/종류`-style key-value patterns common in Korean government documents and converts them to structured 2-column tables.
21
41
  - **Korean Word-Break Recovery** — Improved merging of broken Korean words in PDF table cells. Handles character-level PDF rendering (micro-gaps between Hangul characters) and cell line-break artifacts up to 8 characters.
22
42
  - **Empty Table Filtering** — Tables with all-empty cells (from line detection of decorative borders) are now automatically removed.
23
43
 
44
+ </details>
45
+
24
46
  <details>
25
47
  <summary>v1.5.0 features</summary>
26
48
 
@@ -215,7 +237,7 @@ import type {
215
237
  | Format | Engine | Features |
216
238
  |--------|--------|----------|
217
239
  | **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
218
- | **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
240
+ | **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection, colAddr-based table cell placement |
219
241
  | **PDF** | pdfjs-dist | Line-based table detection, XY-Cut reading order, heading detection, hidden text filter, OCR |
220
242
 
221
243
  ## Security