kordoc 1.4.1 → 1.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +28 -3
- package/dist/{chunk-FC5R5FMV.js → chunk-TFGOV2ML.js} +1394 -58
- package/dist/chunk-TFGOV2ML.js.map +1 -0
- package/dist/cli.js +2 -2
- package/dist/index.cjs +1396 -60
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +55 -2
- package/dist/index.d.ts +55 -2
- package/dist/index.js +1393 -57
- package/dist/index.js.map +1 -1
- package/dist/mcp.js +18 -4
- package/dist/mcp.js.map +1 -1
- package/dist/{provider-JB7SY74K.js → provider-A4FHJSID.js} +2 -2
- package/dist/provider-A4FHJSID.js.map +1 -0
- package/dist/{watch-K2JXCS32.js → watch-WMRLOFYY.js} +2 -2
- package/package.json +2 -1
- package/dist/chunk-FC5R5FMV.js.map +0 -1
- package/dist/provider-JB7SY74K.js.map +0 -1
- /package/dist/{watch-K2JXCS32.js.map → watch-WMRLOFYY.js.map} +0 -0
package/README.md
CHANGED
|
@@ -14,7 +14,29 @@
|
|
|
14
14
|
|
|
15
15
|
---
|
|
16
16
|
|
|
17
|
-
## What's New in v1.
|
|
17
|
+
## What's New in v1.6.0
|
|
18
|
+
|
|
19
|
+
- **Cluster-Based Table Detection (PDF)** — Detects borderless tables by analyzing text alignment patterns. Baseline grouping + X-coordinate clustering identifies 2+ column tables that line-based detection misses. Sort-and-split clustering for order-independent results.
|
|
20
|
+
- **Korean Special Table Detection** — Automatically detects `구분/항목/종류`-style key-value patterns common in Korean government documents and converts them to structured 2-column tables.
|
|
21
|
+
- **Korean Word-Break Recovery** — Improved merging of broken Korean words in PDF table cells. Handles character-level PDF rendering (micro-gaps between Hangul characters) and cell line-break artifacts up to 8 characters.
|
|
22
|
+
- **Empty Table Filtering** — Tables with all-empty cells (from line detection of decorative borders) are now automatically removed.
|
|
23
|
+
|
|
24
|
+
<details>
|
|
25
|
+
<summary>v1.5.0 features</summary>
|
|
26
|
+
|
|
27
|
+
- **Line-Based Table Detection (PDF)** — Ported from OpenDataLoader. Extracts horizontal/vertical lines from PDF graphics commands, builds grid via intersection vertices, maps text to cells by bbox overlap. Proper colspan/rowspan detection. Falls back to heuristic for line-free PDFs.
|
|
28
|
+
- **IRBlock v2** — 6 block types: `heading`, `paragraph`, `table`, `list`, `image`, `separator`. New fields: `bbox`, `style`, `pageNumber`, `level`, `href`, `footnoteText`.
|
|
29
|
+
- **ParseResult v2** — `outline` (document structure) and `warnings` (skipped elements, hidden text) fields.
|
|
30
|
+
- **PDF Enhancements** — XY-Cut reading order, heading detection (font-size ratio), hidden text filtering (prompt injection defense), bounding box on every block.
|
|
31
|
+
- **HWP5 Enhancements** — CHAR_SHAPE parsing, style-based heading detection, warnings for skipped OLE/images.
|
|
32
|
+
- **HWPX Enhancements** — Style parsing from header.xml, hyperlink/footnote extraction.
|
|
33
|
+
- **List Detection** — Numbered paragraphs after tables auto-converted to ordered list blocks.
|
|
34
|
+
- **MCP Server** — Now returns `outline` and `warnings` in parse_document responses.
|
|
35
|
+
|
|
36
|
+
</details>
|
|
37
|
+
|
|
38
|
+
<details>
|
|
39
|
+
<summary>v1.4.x features</summary>
|
|
18
40
|
|
|
19
41
|
- **Document Compare** — Diff two documents at IR level. Cross-format (HWP vs HWPX) supported.
|
|
20
42
|
- **Form Field Recognition** — Extract label-value pairs from government forms automatically.
|
|
@@ -26,6 +48,8 @@
|
|
|
26
48
|
- **7 MCP Tools** — parse_document, detect_format, parse_metadata, parse_pages, parse_table, compare_documents, parse_form.
|
|
27
49
|
- **Error Codes** — Structured `code` field: `"ENCRYPTED"`, `"ZIP_BOMB"`, `"IMAGE_BASED_PDF"`, etc.
|
|
28
50
|
|
|
51
|
+
</details>
|
|
52
|
+
|
|
29
53
|
---
|
|
30
54
|
|
|
31
55
|
## Why kordoc?
|
|
@@ -177,7 +201,8 @@ npx kordoc watch ./docs --webhook https://api/hook # webhook notification
|
|
|
177
201
|
```typescript
|
|
178
202
|
import type {
|
|
179
203
|
ParseResult, ParseSuccess, ParseFailure, FileType,
|
|
180
|
-
IRBlock, IRTable, IRCell, CellContext,
|
|
204
|
+
IRBlock, IRBlockType, IRTable, IRCell, CellContext,
|
|
205
|
+
BoundingBox, InlineStyle, OutlineItem, ParseWarning, WarningCode,
|
|
181
206
|
DocumentMetadata, ParseOptions, ErrorCode,
|
|
182
207
|
DiffResult, BlockDiff, CellDiff, DiffChangeType,
|
|
183
208
|
FormField, FormResult,
|
|
@@ -191,7 +216,7 @@ import type {
|
|
|
191
216
|
|--------|--------|----------|
|
|
192
217
|
| **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
|
|
193
218
|
| **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
|
|
194
|
-
| **PDF** | pdfjs-dist | Line
|
|
219
|
+
| **PDF** | pdfjs-dist | Line-based table detection, XY-Cut reading order, heading detection, hidden text filter, OCR |
|
|
195
220
|
|
|
196
221
|
## Security
|
|
197
222
|
|