kordoc 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +18 -3
- package/dist/{chunk-BWZW234S.js → chunk-5SZWGBNL.js} +1083 -57
- package/dist/chunk-5SZWGBNL.js.map +1 -0
- package/dist/cli.js +2 -2
- package/dist/index.cjs +1085 -59
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +55 -2
- package/dist/index.d.ts +55 -2
- package/dist/index.js +1082 -56
- package/dist/index.js.map +1 -1
- package/dist/mcp.js +18 -4
- package/dist/mcp.js.map +1 -1
- package/dist/{provider-JB7SY74K.js → provider-A4FHJSID.js} +2 -2
- package/dist/provider-A4FHJSID.js.map +1 -0
- package/dist/{watch-LIGKH3QS.js → watch-YCWNFYAW.js} +2 -2
- package/package.json +2 -1
- package/dist/chunk-BWZW234S.js.map +0 -1
- package/dist/provider-JB7SY74K.js.map +0 -1
- /package/dist/{watch-LIGKH3QS.js.map → watch-YCWNFYAW.js.map} +0 -0
package/README.md
CHANGED
|
@@ -14,7 +14,19 @@
|
|
|
14
14
|
|
|
15
15
|
---
|
|
16
16
|
|
|
17
|
-
## What's New in v1.
|
|
17
|
+
## What's New in v1.5.0
|
|
18
|
+
|
|
19
|
+
- **Line-Based Table Detection (PDF)** — Ported from OpenDataLoader. Extracts horizontal/vertical lines from PDF graphics commands, builds grid via intersection vertices, maps text to cells by bbox overlap. Proper colspan/rowspan detection. Falls back to heuristic for line-free PDFs.
|
|
20
|
+
- **IRBlock v2** — 6 block types: `heading`, `paragraph`, `table`, `list`, `image`, `separator`. New fields: `bbox`, `style`, `pageNumber`, `level`, `href`, `footnoteText`.
|
|
21
|
+
- **ParseResult v2** — `outline` (document structure) and `warnings` (skipped elements, hidden text) fields.
|
|
22
|
+
- **PDF Enhancements** — XY-Cut reading order, heading detection (font-size ratio), hidden text filtering (prompt injection defense), bounding box on every block.
|
|
23
|
+
- **HWP5 Enhancements** — CHAR_SHAPE parsing, style-based heading detection, warnings for skipped OLE/images.
|
|
24
|
+
- **HWPX Enhancements** — Style parsing from header.xml, hyperlink/footnote extraction.
|
|
25
|
+
- **List Detection** — Numbered paragraphs after tables auto-converted to ordered list blocks.
|
|
26
|
+
- **MCP Server** — Now returns `outline` and `warnings` in parse_document responses.
|
|
27
|
+
|
|
28
|
+
<details>
|
|
29
|
+
<summary>v1.4.x features</summary>
|
|
18
30
|
|
|
19
31
|
- **Document Compare** — Diff two documents at IR level. Cross-format (HWP vs HWPX) supported.
|
|
20
32
|
- **Form Field Recognition** — Extract label-value pairs from government forms automatically.
|
|
@@ -26,6 +38,8 @@
|
|
|
26
38
|
- **7 MCP Tools** — parse_document, detect_format, parse_metadata, parse_pages, parse_table, compare_documents, parse_form.
|
|
27
39
|
- **Error Codes** — Structured `code` field: `"ENCRYPTED"`, `"ZIP_BOMB"`, `"IMAGE_BASED_PDF"`, etc.
|
|
28
40
|
|
|
41
|
+
</details>
|
|
42
|
+
|
|
29
43
|
---
|
|
30
44
|
|
|
31
45
|
## Why kordoc?
|
|
@@ -177,7 +191,8 @@ npx kordoc watch ./docs --webhook https://api/hook # webhook notification
|
|
|
177
191
|
```typescript
|
|
178
192
|
import type {
|
|
179
193
|
ParseResult, ParseSuccess, ParseFailure, FileType,
|
|
180
|
-
IRBlock, IRTable, IRCell, CellContext,
|
|
194
|
+
IRBlock, IRBlockType, IRTable, IRCell, CellContext,
|
|
195
|
+
BoundingBox, InlineStyle, OutlineItem, ParseWarning, WarningCode,
|
|
181
196
|
DocumentMetadata, ParseOptions, ErrorCode,
|
|
182
197
|
DiffResult, BlockDiff, CellDiff, DiffChangeType,
|
|
183
198
|
FormField, FormResult,
|
|
@@ -191,7 +206,7 @@ import type {
|
|
|
191
206
|
|--------|--------|----------|
|
|
192
207
|
| **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
|
|
193
208
|
| **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
|
|
194
|
-
| **PDF** | pdfjs-dist | Line
|
|
209
|
+
| **PDF** | pdfjs-dist | Line-based table detection, XY-Cut reading order, heading detection, hidden text filter, OCR |
|
|
195
210
|
|
|
196
211
|
## Security
|
|
197
212
|
|