kordoc 1.3.0 → 1.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # kordoc
2
2
 
3
- **모두 파싱해버리겠다** — Parse any Korean document to Markdown.
3
+ **모두 파싱해버리겠다** — The Korean Document Platform.
4
4
 
5
5
  [![npm version](https://img.shields.io/npm/v/kordoc.svg)](https://www.npmjs.com/package/kordoc)
6
6
  [![license](https://img.shields.io/npm/l/kordoc.svg)](https://github.com/chrisryugj/kordoc/blob/main/LICENSE)
7
7
  [![node](https://img.shields.io/node/v/kordoc.svg)](https://nodejs.org)
8
8
 
9
- > *HWP, HWPX, PDF — 대한민국 문서라면 남김없이 파싱해버립니다.*
9
+ > *Parse, compare, extract, and generate Korean documents. HWP, HWPX, PDF — all of them.*
10
10
 
11
11
  [한국어](./README-KR.md)
12
12
 
@@ -14,46 +14,40 @@
14
14
 
15
15
  ---
16
16
 
17
- ## Why kordoc?
18
-
19
- South Korea's government runs on **HWP** — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.
17
+ ## What's New in v1.4.0
20
18
 
21
- **kordoc** was born from that document hell. Built by a Korean civil servant who spent **7 years** buried under HWP files at a district office. One day he snapped — and decided to parse them all. Its parsers have been battle-tested across 5 real government projects, processing school curriculum plans, facility inspection reports, legal annexes, and municipal newsletters. If a Korean public servant wrote it, kordoc can parse it.
19
+ - **Document Compare** Diff two documents at IR level. Cross-format (HWP vs HWPX) supported.
20
+ - **Form Field Recognition** — Extract label-value pairs from government forms automatically.
21
+ - **Structured Parsing** — Access `IRBlock[]` and `DocumentMetadata` directly, not just markdown.
22
+ - **Page Range Parsing** — Parse only pages 1-3: `parse(buffer, { pages: "1-3" })`.
23
+ - **Markdown to HWPX** — Reverse conversion. Generate valid HWPX files from markdown.
24
+ - **OCR Integration** — Pluggable OCR for image-based PDFs (bring your own provider).
25
+ - **Watch Mode** — `kordoc watch ./incoming --webhook https://...` for auto-conversion.
26
+ - **7 MCP Tools** — parse_document, detect_format, parse_metadata, parse_pages, parse_table, compare_documents, parse_form.
27
+ - **Error Codes** — Structured `code` field: `"ENCRYPTED"`, `"ZIP_BOMB"`, `"IMAGE_BASED_PDF"`, etc.
22
28
 
23
29
  ---
24
30
 
25
- ## Features
31
+ ## Why kordoc?
26
32
 
27
- - **HWP 5.x Binary Parsing** OLE2 container + record stream + UTF-16LE. No Hancom Office needed.
28
- - **HWPX ZIP Parsing** — OPF manifest resolution, multi-section, nested tables.
29
- - **PDF Text Extraction** — Y-coordinate line grouping, table reconstruction, image PDF detection.
30
- - **2-Pass Table Builder** — Correct `colSpan`/`rowSpan` via grid algorithm. No broken tables.
31
- - **Broken ZIP Recovery** — Corrupted HWPX? Scans raw Local File Headers.
32
- - **3 Interfaces** — npm library, CLI tool, and MCP server (Claude/Cursor).
33
- - **Cross-Platform** — Pure JavaScript. Runs on Linux, macOS, Windows.
33
+ South Korea's government runs on **HWP** — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare.
34
34
 
35
- ## Supported Formats
35
+ **kordoc** was born from that document hell. Built by a Korean civil servant who spent **7 years** buried under HWP files. Battle-tested across 5 real government projects. If a Korean public servant wrote it, kordoc can parse it.
36
36
 
37
- | Format | Engine | Features |
38
- |--------|--------|----------|
39
- | **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
40
- | **HWP 5.x** (한컴 레거시) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
41
- | **PDF** | pdfjs-dist | Line grouping, table detection, image PDF warning |
37
+ ---
42
38
 
43
39
  ## Installation
44
40
 
45
41
  ```bash
46
42
  npm install kordoc
47
43
 
48
- # PDF support requires pdfjs-dist (optional peer dependency)
44
+ # PDF support (optional)
49
45
  npm install pdfjs-dist
50
46
  ```
51
47
 
52
- > `pdfjs-dist` is an optional peer dependency. Not needed for HWP/HWPX parsing.
48
+ ## Quick Start
53
49
 
54
- ## Usage
55
-
56
- ### As a Library
50
+ ### Parse Any Document
57
51
 
58
52
  ```typescript
59
53
  import { parse } from "kordoc"
@@ -63,40 +57,76 @@ const buffer = readFileSync("document.hwpx")
63
57
  const result = await parse(buffer.buffer)
64
58
 
65
59
  if (result.success) {
66
- console.log(result.markdown)
60
+ console.log(result.markdown) // Markdown text
61
+ console.log(result.blocks) // IRBlock[] structured data
62
+ console.log(result.metadata) // { title, author, createdAt, ... }
67
63
  }
68
64
  ```
69
65
 
70
- #### Format-Specific
66
+ ### Compare Two Documents
67
+
68
+ ```typescript
69
+ import { compare } from "kordoc"
70
+
71
+ const diff = await compare(bufferA, bufferB)
72
+ // diff.stats → { added: 3, removed: 1, modified: 5, unchanged: 42 }
73
+ // diff.diffs → BlockDiff[] with cell-level table diffs
74
+ ```
75
+
76
+ Cross-format supported: compare HWP against HWPX of the same document.
77
+
78
+ ### Extract Form Fields
71
79
 
72
80
  ```typescript
73
- import { parseHwpx, parseHwp, parsePdf } from "kordoc"
81
+ import { parse, extractFormFields } from "kordoc"
74
82
 
75
- const hwpxResult = await parseHwpx(buffer) // HWPX
76
- const hwpResult = await parseHwp(buffer) // HWP 5.x
77
- const pdfResult = await parsePdf(buffer) // PDF
83
+ const result = await parse(buffer)
84
+ if (result.success) {
85
+ const form = extractFormFields(result.blocks)
86
+ // form.fields → [{ label: "성명", value: "홍길동", row: 0, col: 0 }, ...]
87
+ // form.confidence → 0.85
88
+ }
78
89
  ```
79
90
 
80
- #### Format Detection
91
+ ### Generate HWPX from Markdown
81
92
 
82
93
  ```typescript
83
- import { detectFormat } from "kordoc"
94
+ import { markdownToHwpx } from "kordoc"
84
95
 
85
- detectFormat(buffer) // "hwpx" | "hwp" | "pdf" | "unknown"
96
+ const hwpxBuffer = await markdownToHwpx("# Title\n\nParagraph text\n\n| A | B |\n| --- | --- |\n| 1 | 2 |")
97
+ writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))
86
98
  ```
87
99
 
88
- ### As a CLI
100
+ ### Parse Specific Pages
89
101
 
90
- ```bash
91
- npx kordoc document.hwpx # stdout
92
- npx kordoc document.hwp -o output.md # save to file
93
- npx kordoc *.pdf -d ./converted/ # batch convert
94
- npx kordoc report.hwpx --format json # JSON with metadata
102
+ ```typescript
103
+ const result = await parse(buffer, { pages: "1-3" }) // pages 1-3 only
104
+ const result = await parse(buffer, { pages: [1, 5, 10] }) // specific pages
105
+ ```
106
+
107
+ ### OCR for Image-Based PDFs
108
+
109
+ ```typescript
110
+ const result = await parse(buffer, {
111
+ ocr: async (pageImage, pageNumber, mimeType) => {
112
+ return await myOcrService.recognize(pageImage) // Tesseract, Claude Vision, etc.
113
+ }
114
+ })
95
115
  ```
96
116
 
97
- ### As an MCP Server
117
+ ## CLI
118
+
119
+ ```bash
120
+ npx kordoc document.hwpx # stdout
121
+ npx kordoc document.hwp -o output.md # save to file
122
+ npx kordoc *.pdf -d ./converted/ # batch convert
123
+ npx kordoc report.hwpx --format json # JSON with blocks + metadata
124
+ npx kordoc report.hwpx --pages 1-3 # page range
125
+ npx kordoc watch ./incoming -d ./output # watch mode
126
+ npx kordoc watch ./docs --webhook https://api/hook # webhook notification
127
+ ```
98
128
 
99
- Works with **Claude Desktop**, **Cursor**, **Windsurf**, and any MCP-compatible client.
129
+ ## MCP Server (Claude / Cursor / Windsurf)
100
130
 
101
131
  ```json
102
132
  {
@@ -109,104 +139,67 @@ Works with **Claude Desktop**, **Cursor**, **Windsurf**, and any MCP-compatible
109
139
  }
110
140
  ```
111
141
 
112
- **Tools exposed:**
142
+ **7 Tools:**
113
143
 
114
144
  | Tool | Description |
115
145
  |------|-------------|
116
- | `parse_document` | Parse HWP/HWPX/PDF file → Markdown |
146
+ | `parse_document` | Parse HWP/HWPX/PDF → Markdown with metadata |
117
147
  | `detect_format` | Detect file format via magic bytes |
148
+ | `parse_metadata` | Extract metadata only (fast, no full parse) |
149
+ | `parse_pages` | Parse specific page range |
150
+ | `parse_table` | Extract Nth table from document |
151
+ | `compare_documents` | Diff two documents (cross-format) |
152
+ | `parse_form` | Extract form fields as structured JSON |
118
153
 
119
154
  ## API Reference
120
155
 
121
- ### `parse(buffer: ArrayBuffer): Promise<ParseResult>`
156
+ ### Core
122
157
 
123
- Auto-detects format and converts to Markdown.
158
+ | Function | Description |
159
+ |----------|-------------|
160
+ | `parse(buffer, options?)` | Auto-detect format, parse to Markdown + IRBlock[] |
161
+ | `parseHwpx(buffer, options?)` | HWPX only |
162
+ | `parseHwp(buffer, options?)` | HWP 5.x only |
163
+ | `parsePdf(buffer, options?)` | PDF only |
164
+ | `detectFormat(buffer)` | Returns `"hwpx" \| "hwp" \| "pdf" \| "unknown"` |
124
165
 
125
- ```typescript
126
- interface ParseResult {
127
- success: boolean
128
- markdown?: string
129
- fileType: "hwpx" | "hwp" | "pdf" | "unknown"
130
- isImageBased?: boolean // scanned PDF detection
131
- pageCount?: number // PDF only
132
- error?: string
133
- }
134
- ```
166
+ ### Advanced
167
+
168
+ | Function | Description |
169
+ |----------|-------------|
170
+ | `compare(bufferA, bufferB, options?)` | Document diff at IR level |
171
+ | `extractFormFields(blocks)` | Form field recognition from IRBlock[] |
172
+ | `markdownToHwpx(markdown)` | Markdown → HWPX reverse conversion |
173
+ | `blocksToMarkdown(blocks)` | IRBlock[] → Markdown string |
135
174
 
136
175
  ### Types
137
176
 
138
177
  ```typescript
139
- import type { ParseResult, ParseSuccess, ParseFailure, FileType } from "kordoc"
178
+ import type {
179
+ ParseResult, ParseSuccess, ParseFailure, FileType,
180
+ IRBlock, IRTable, IRCell, CellContext,
181
+ DocumentMetadata, ParseOptions, ErrorCode,
182
+ DiffResult, BlockDiff, CellDiff, DiffChangeType,
183
+ FormField, FormResult,
184
+ OcrProvider, WatchOptions,
185
+ } from "kordoc"
140
186
  ```
141
187
 
142
- > Internal types (`IRBlock`, `IRTable`, `IRCell`, `CellContext`) and utilities (`KordocError`, `sanitizeError`, `isPathTraversal`, `buildTable`, `blocksToMarkdown`) are not part of the public API.
143
-
144
- ## Requirements
188
+ ## Supported Formats
145
189
 
146
- - **Node.js** >= 18
147
- - **pdfjs-dist** >= 4.0.0 — Optional. Only needed for PDF. HWP/HWPX work without it.
190
+ | Format | Engine | Features |
191
+ |--------|--------|----------|
192
+ | **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
193
+ | **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
194
+ | **PDF** | pdfjs-dist | Line grouping, table detection, image PDF + OCR |
148
195
 
149
196
  ## Security
150
197
 
151
- Production-grade security hardening:
152
-
153
- - **ZIP bomb protection** — Entry count validation, 100MB decompression limit, 500 entry cap
154
- > **Known limitation:** Pre-check reads declared sizes from ZIP Central Directory, which an attacker can falsify. The primary defense is per-file cumulative size tracking during actual decompression. For fully untrusted input where streaming decompression is required, consider wrapping kordoc behind a size-limited sandbox.
155
- - **XXE/Billion Laughs prevention** — Internal DTD subsets fully stripped from HWPX XML
156
- - **Decompression bomb guard** — `maxOutputLength` on HWP5 zlib streams, cumulative 100MB limit across sections
157
- - **PDF resource limits** — MAX_PAGES=5,000, cumulative text size 100MB cap, `doc.destroy()` cleanup
158
- - **HWP5 record cap** — Max 500,000 records per section, prevents memory exhaustion from crafted files
159
- - **Table dimension clamping** — rows/cols read from HWP5 binary clamped to MAX_ROWS/MAX_COLS before allocation
160
- - **colSpan/rowSpan clamping** — Crafted merge values clamped to grid bounds (MAX_COLS=200, MAX_ROWS=10,000)
161
- - **Path traversal guard** — Backslash normalization, `..`, absolute paths, Windows drive letters all rejected
162
- - **MCP error sanitization** — Allowlist-based error filtering, unknown errors return generic message
163
- - **MCP path restriction** — Only `.hwp`, `.hwpx`, `.pdf` extensions allowed, symlink resolution
164
- - **File size limit** — 500MB max in MCP server and CLI
165
- - **HWP5 section limit** — Max 100 sections in both primary and fallback paths
166
- - **HWP5 control char fix** — Character code 10 (footnote/endnote) now correctly handled
167
-
168
- ## How It Works
169
-
170
- ```
171
- ┌─────────────┐ Magic Bytes ┌──────────────────┐
172
- │ File Input │ ──── Detection ────→ │ Format Router │
173
- └─────────────┘ └────────┬─────────┘
174
-
175
- ┌──────────────────────────┼──────────────────────────┐
176
- │ │ │
177
- ┌─────▼─────┐ ┌───────▼───────┐ ┌──────▼──────┐
178
- │ HWPX │ │ HWP 5.x │ │ PDF │
179
- │ ZIP+XML │ │ OLE2+Record │ │ pdfjs-dist │
180
- └─────┬─────┘ └───────┬───────┘ └──────┬──────┘
181
- │ │ │
182
- │ ┌──────────────────┤ │
183
- │ │ �� │
184
- ┌─────▼───────▼─────┐ │ │
185
- │ 2-Pass Table │ │ │
186
- │ Builder (Grid) │ │ │
187
- └─────────┬─────────┘ │ │
188
- │ │ │
189
- ┌─────▼──────────────────────▼──────────────────────────▼─────┐
190
- │ IRBlock[] │
191
- │ (Intermediate Representation) │
192
- └────────────────────────┬───────────────────────────────────┘
193
-
194
- ┌──────▼──────┐
195
- │ Markdown │
196
- │ Output │
197
- └─────────────┘
198
- ```
198
+ Production-grade hardening: ZIP bomb protection, XXE/Billion Laughs prevention, decompression bomb guard, path traversal guard, MCP error sanitization, file size limits (500MB). See [SECURITY.md](./SECURITY.md) for details.
199
199
 
200
200
  ## Credits
201
201
 
202
- Production-tested across 5 Korean government technology projects:
203
- - School curriculum plans (학교교육과정)
204
- - Facility inspection reports (사전기획 보고서)
205
- - Legal document annexes (법률 별표)
206
- - Municipal newsletters (소식지)
207
- - Public data extraction tools (공공데이터)
208
-
209
- Thousands of real government documents parsed without breaking a sweat.
202
+ Production-tested across 5 Korean government projects: school curriculum plans, facility inspection reports, legal document annexes, municipal newsletters, and public data extraction tools. Thousands of real government documents parsed.
210
203
 
211
204
  ## License
212
205