kordoc 1.4.1 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -14,7 +14,29 @@
14
14
 
15
15
  ---
16
16
 
17
- ## What's New in v1.4.0
17
+ ## What's New in v1.6.0
18
+
19
+ - **Cluster-Based Table Detection (PDF)** — Detects borderless tables by analyzing text alignment patterns. Baseline grouping + X-coordinate clustering identifies 2+ column tables that line-based detection misses. Sort-and-split clustering for order-independent results.
20
+ - **Korean Special Table Detection** — Automatically detects `구분/항목/종류`-style key-value patterns common in Korean government documents and converts them to structured 2-column tables.
21
+ - **Korean Word-Break Recovery** — Improved merging of broken Korean words in PDF table cells. Handles character-level PDF rendering (micro-gaps between Hangul characters) and cell line-break artifacts up to 8 characters.
22
+ - **Empty Table Filtering** — Tables with all-empty cells (from line detection of decorative borders) are now automatically removed.
23
+
24
+ <details>
25
+ <summary>v1.5.0 features</summary>
26
+
27
+ - **Line-Based Table Detection (PDF)** — Ported from OpenDataLoader. Extracts horizontal/vertical lines from PDF graphics commands, builds grid via intersection vertices, maps text to cells by bbox overlap. Proper colspan/rowspan detection. Falls back to heuristic for line-free PDFs.
28
+ - **IRBlock v2** — 6 block types: `heading`, `paragraph`, `table`, `list`, `image`, `separator`. New fields: `bbox`, `style`, `pageNumber`, `level`, `href`, `footnoteText`.
29
+ - **ParseResult v2** — `outline` (document structure) and `warnings` (skipped elements, hidden text) fields.
30
+ - **PDF Enhancements** — XY-Cut reading order, heading detection (font-size ratio), hidden text filtering (prompt injection defense), bounding box on every block.
31
+ - **HWP5 Enhancements** — CHAR_SHAPE parsing, style-based heading detection, warnings for skipped OLE/images.
32
+ - **HWPX Enhancements** — Style parsing from header.xml, hyperlink/footnote extraction.
33
+ - **List Detection** — Numbered paragraphs after tables auto-converted to ordered list blocks.
34
+ - **MCP Server** — Now returns `outline` and `warnings` in parse_document responses.
35
+
36
+ </details>
37
+
38
+ <details>
39
+ <summary>v1.4.x features</summary>
18
40
 
19
41
  - **Document Compare** — Diff two documents at IR level. Cross-format (HWP vs HWPX) supported.
20
42
  - **Form Field Recognition** — Extract label-value pairs from government forms automatically.
@@ -26,6 +48,8 @@
26
48
  - **7 MCP Tools** — parse_document, detect_format, parse_metadata, parse_pages, parse_table, compare_documents, parse_form.
27
49
  - **Error Codes** — Structured `code` field: `"ENCRYPTED"`, `"ZIP_BOMB"`, `"IMAGE_BASED_PDF"`, etc.
28
50
 
51
+ </details>
52
+
29
53
  ---
30
54
 
31
55
  ## Why kordoc?
@@ -177,7 +201,8 @@ npx kordoc watch ./docs --webhook https://api/hook # webhook notification
177
201
  ```typescript
178
202
  import type {
179
203
  ParseResult, ParseSuccess, ParseFailure, FileType,
180
- IRBlock, IRTable, IRCell, CellContext,
204
+ IRBlock, IRBlockType, IRTable, IRCell, CellContext,
205
+ BoundingBox, InlineStyle, OutlineItem, ParseWarning, WarningCode,
181
206
  DocumentMetadata, ParseOptions, ErrorCode,
182
207
  DiffResult, BlockDiff, CellDiff, DiffChangeType,
183
208
  FormField, FormResult,
@@ -191,7 +216,7 @@ import type {
191
216
  |--------|--------|----------|
192
217
  | **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
193
218
  | **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
194
- | **PDF** | pdfjs-dist | Line grouping, table detection, image PDF + OCR |
219
+ | **PDF** | pdfjs-dist | Line-based table detection, XY-Cut reading order, heading detection, hidden text filter, OCR |
195
220
 
196
221
  ## Security
197
222