kordoc 1.2.0 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +111 -118
- package/dist/{chunk-4BKNDXGU.js → chunk-BWZW234S.js} +595 -86
- package/dist/chunk-BWZW234S.js.map +1 -0
- package/dist/cli.js +15 -3
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +665 -59
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +163 -6
- package/dist/index.d.ts +163 -6
- package/dist/index.js +667 -58
- package/dist/index.js.map +1 -1
- package/dist/mcp.js +216 -13
- package/dist/mcp.js.map +1 -1
- package/dist/provider-JB7SY74K.js +38 -0
- package/dist/provider-JB7SY74K.js.map +1 -0
- package/dist/watch-LIGKH3QS.js +90 -0
- package/dist/watch-LIGKH3QS.js.map +1 -0
- package/package.json +1 -1
- package/dist/chunk-4BKNDXGU.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
# kordoc
|
|
2
2
|
|
|
3
|
-
**모두 파싱해버리겠다** —
|
|
3
|
+
**모두 파싱해버리겠다** — The Korean Document Platform.
|
|
4
4
|
|
|
5
5
|
[](https://www.npmjs.com/package/kordoc)
|
|
6
6
|
[](https://github.com/chrisryugj/kordoc/blob/main/LICENSE)
|
|
7
7
|
[](https://nodejs.org)
|
|
8
8
|
|
|
9
|
-
> *HWP, HWPX, PDF —
|
|
9
|
+
> *Parse, compare, extract, and generate Korean documents. HWP, HWPX, PDF — all of them.*
|
|
10
10
|
|
|
11
11
|
[한국어](./README-KR.md)
|
|
12
12
|
|
|
@@ -14,46 +14,40 @@
|
|
|
14
14
|
|
|
15
15
|
---
|
|
16
16
|
|
|
17
|
-
##
|
|
18
|
-
|
|
19
|
-
South Korea's government runs on **HWP** — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.
|
|
17
|
+
## What's New in v1.4.0
|
|
20
18
|
|
|
21
|
-
|
|
19
|
+
- **Document Compare** — Diff two documents at IR level. Cross-format (HWP vs HWPX) supported.
|
|
20
|
+
- **Form Field Recognition** — Extract label-value pairs from government forms automatically.
|
|
21
|
+
- **Structured Parsing** — Access `IRBlock[]` and `DocumentMetadata` directly, not just markdown.
|
|
22
|
+
- **Page Range Parsing** — Parse only pages 1-3: `parse(buffer, { pages: "1-3" })`.
|
|
23
|
+
- **Markdown to HWPX** — Reverse conversion. Generate valid HWPX files from markdown.
|
|
24
|
+
- **OCR Integration** — Pluggable OCR for image-based PDFs (bring your own provider).
|
|
25
|
+
- **Watch Mode** — `kordoc watch ./incoming --webhook https://...` for auto-conversion.
|
|
26
|
+
- **7 MCP Tools** — parse_document, detect_format, parse_metadata, parse_pages, parse_table, compare_documents, parse_form.
|
|
27
|
+
- **Error Codes** — Structured `code` field: `"ENCRYPTED"`, `"ZIP_BOMB"`, `"IMAGE_BASED_PDF"`, etc.
|
|
22
28
|
|
|
23
29
|
---
|
|
24
30
|
|
|
25
|
-
##
|
|
31
|
+
## Why kordoc?
|
|
26
32
|
|
|
27
|
-
|
|
28
|
-
- **HWPX ZIP Parsing** — OPF manifest resolution, multi-section, nested tables.
|
|
29
|
-
- **PDF Text Extraction** — Y-coordinate line grouping, table reconstruction, image PDF detection.
|
|
30
|
-
- **2-Pass Table Builder** — Correct `colSpan`/`rowSpan` via grid algorithm. No broken tables.
|
|
31
|
-
- **Broken ZIP Recovery** — Corrupted HWPX? Scans raw Local File Headers.
|
|
32
|
-
- **3 Interfaces** — npm library, CLI tool, and MCP server (Claude/Cursor).
|
|
33
|
-
- **Cross-Platform** — Pure JavaScript. Runs on Linux, macOS, Windows.
|
|
33
|
+
South Korea's government runs on **HWP** — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare.
|
|
34
34
|
|
|
35
|
-
|
|
35
|
+
**kordoc** was born from that document hell. Built by a Korean civil servant who spent **7 years** buried under HWP files. Battle-tested across 5 real government projects. If a Korean public servant wrote it, kordoc can parse it.
|
|
36
36
|
|
|
37
|
-
|
|
38
|
-
|--------|--------|----------|
|
|
39
|
-
| **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
|
|
40
|
-
| **HWP 5.x** (한컴 레거시) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
|
|
41
|
-
| **PDF** | pdfjs-dist | Line grouping, table detection, image PDF warning |
|
|
37
|
+
---
|
|
42
38
|
|
|
43
39
|
## Installation
|
|
44
40
|
|
|
45
41
|
```bash
|
|
46
42
|
npm install kordoc
|
|
47
43
|
|
|
48
|
-
# PDF support
|
|
44
|
+
# PDF support (optional)
|
|
49
45
|
npm install pdfjs-dist
|
|
50
46
|
```
|
|
51
47
|
|
|
52
|
-
|
|
48
|
+
## Quick Start
|
|
53
49
|
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
### As a Library
|
|
50
|
+
### Parse Any Document
|
|
57
51
|
|
|
58
52
|
```typescript
|
|
59
53
|
import { parse } from "kordoc"
|
|
@@ -63,40 +57,76 @@ const buffer = readFileSync("document.hwpx")
|
|
|
63
57
|
const result = await parse(buffer.buffer)
|
|
64
58
|
|
|
65
59
|
if (result.success) {
|
|
66
|
-
console.log(result.markdown)
|
|
60
|
+
console.log(result.markdown) // Markdown text
|
|
61
|
+
console.log(result.blocks) // IRBlock[] structured data
|
|
62
|
+
console.log(result.metadata) // { title, author, createdAt, ... }
|
|
67
63
|
}
|
|
68
64
|
```
|
|
69
65
|
|
|
70
|
-
|
|
66
|
+
### Compare Two Documents
|
|
67
|
+
|
|
68
|
+
```typescript
|
|
69
|
+
import { compare } from "kordoc"
|
|
70
|
+
|
|
71
|
+
const diff = await compare(bufferA, bufferB)
|
|
72
|
+
// diff.stats → { added: 3, removed: 1, modified: 5, unchanged: 42 }
|
|
73
|
+
// diff.diffs → BlockDiff[] with cell-level table diffs
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
Cross-format supported: compare HWP against HWPX of the same document.
|
|
77
|
+
|
|
78
|
+
### Extract Form Fields
|
|
71
79
|
|
|
72
80
|
```typescript
|
|
73
|
-
import {
|
|
81
|
+
import { parse, extractFormFields } from "kordoc"
|
|
74
82
|
|
|
75
|
-
const
|
|
76
|
-
|
|
77
|
-
const
|
|
83
|
+
const result = await parse(buffer)
|
|
84
|
+
if (result.success) {
|
|
85
|
+
const form = extractFormFields(result.blocks)
|
|
86
|
+
// form.fields → [{ label: "성명", value: "홍길동", row: 0, col: 0 }, ...]
|
|
87
|
+
// form.confidence → 0.85
|
|
88
|
+
}
|
|
78
89
|
```
|
|
79
90
|
|
|
80
|
-
|
|
91
|
+
### Generate HWPX from Markdown
|
|
81
92
|
|
|
82
93
|
```typescript
|
|
83
|
-
import {
|
|
94
|
+
import { markdownToHwpx } from "kordoc"
|
|
84
95
|
|
|
85
|
-
|
|
96
|
+
const hwpxBuffer = await markdownToHwpx("# Title\n\nParagraph text\n\n| A | B |\n| --- | --- |\n| 1 | 2 |")
|
|
97
|
+
writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))
|
|
86
98
|
```
|
|
87
99
|
|
|
88
|
-
###
|
|
100
|
+
### Parse Specific Pages
|
|
89
101
|
|
|
90
|
-
```
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
102
|
+
```typescript
|
|
103
|
+
const result = await parse(buffer, { pages: "1-3" }) // pages 1-3 only
|
|
104
|
+
const result = await parse(buffer, { pages: [1, 5, 10] }) // specific pages
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### OCR for Image-Based PDFs
|
|
108
|
+
|
|
109
|
+
```typescript
|
|
110
|
+
const result = await parse(buffer, {
|
|
111
|
+
ocr: async (pageImage, pageNumber, mimeType) => {
|
|
112
|
+
return await myOcrService.recognize(pageImage) // Tesseract, Claude Vision, etc.
|
|
113
|
+
}
|
|
114
|
+
})
|
|
95
115
|
```
|
|
96
116
|
|
|
97
|
-
|
|
117
|
+
## CLI
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
npx kordoc document.hwpx # stdout
|
|
121
|
+
npx kordoc document.hwp -o output.md # save to file
|
|
122
|
+
npx kordoc *.pdf -d ./converted/ # batch convert
|
|
123
|
+
npx kordoc report.hwpx --format json # JSON with blocks + metadata
|
|
124
|
+
npx kordoc report.hwpx --pages 1-3 # page range
|
|
125
|
+
npx kordoc watch ./incoming -d ./output # watch mode
|
|
126
|
+
npx kordoc watch ./docs --webhook https://api/hook # webhook notification
|
|
127
|
+
```
|
|
98
128
|
|
|
99
|
-
|
|
129
|
+
## MCP Server (Claude / Cursor / Windsurf)
|
|
100
130
|
|
|
101
131
|
```json
|
|
102
132
|
{
|
|
@@ -109,104 +139,67 @@ Works with **Claude Desktop**, **Cursor**, **Windsurf**, and any MCP-compatible
|
|
|
109
139
|
}
|
|
110
140
|
```
|
|
111
141
|
|
|
112
|
-
**Tools
|
|
142
|
+
**7 Tools:**
|
|
113
143
|
|
|
114
144
|
| Tool | Description |
|
|
115
145
|
|------|-------------|
|
|
116
|
-
| `parse_document` | Parse HWP/HWPX/PDF
|
|
146
|
+
| `parse_document` | Parse HWP/HWPX/PDF → Markdown with metadata |
|
|
117
147
|
| `detect_format` | Detect file format via magic bytes |
|
|
148
|
+
| `parse_metadata` | Extract metadata only (fast, no full parse) |
|
|
149
|
+
| `parse_pages` | Parse specific page range |
|
|
150
|
+
| `parse_table` | Extract Nth table from document |
|
|
151
|
+
| `compare_documents` | Diff two documents (cross-format) |
|
|
152
|
+
| `parse_form` | Extract form fields as structured JSON |
|
|
118
153
|
|
|
119
154
|
## API Reference
|
|
120
155
|
|
|
121
|
-
###
|
|
156
|
+
### Core
|
|
122
157
|
|
|
123
|
-
|
|
158
|
+
| Function | Description |
|
|
159
|
+
|----------|-------------|
|
|
160
|
+
| `parse(buffer, options?)` | Auto-detect format, parse to Markdown + IRBlock[] |
|
|
161
|
+
| `parseHwpx(buffer, options?)` | HWPX only |
|
|
162
|
+
| `parseHwp(buffer, options?)` | HWP 5.x only |
|
|
163
|
+
| `parsePdf(buffer, options?)` | PDF only |
|
|
164
|
+
| `detectFormat(buffer)` | Returns `"hwpx" \| "hwp" \| "pdf" \| "unknown"` |
|
|
124
165
|
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
}
|
|
134
|
-
```
|
|
166
|
+
### Advanced
|
|
167
|
+
|
|
168
|
+
| Function | Description |
|
|
169
|
+
|----------|-------------|
|
|
170
|
+
| `compare(bufferA, bufferB, options?)` | Document diff at IR level |
|
|
171
|
+
| `extractFormFields(blocks)` | Form field recognition from IRBlock[] |
|
|
172
|
+
| `markdownToHwpx(markdown)` | Markdown → HWPX reverse conversion |
|
|
173
|
+
| `blocksToMarkdown(blocks)` | IRBlock[] → Markdown string |
|
|
135
174
|
|
|
136
175
|
### Types
|
|
137
176
|
|
|
138
177
|
```typescript
|
|
139
|
-
import type {
|
|
178
|
+
import type {
|
|
179
|
+
ParseResult, ParseSuccess, ParseFailure, FileType,
|
|
180
|
+
IRBlock, IRTable, IRCell, CellContext,
|
|
181
|
+
DocumentMetadata, ParseOptions, ErrorCode,
|
|
182
|
+
DiffResult, BlockDiff, CellDiff, DiffChangeType,
|
|
183
|
+
FormField, FormResult,
|
|
184
|
+
OcrProvider, WatchOptions,
|
|
185
|
+
} from "kordoc"
|
|
140
186
|
```
|
|
141
187
|
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
## Requirements
|
|
188
|
+
## Supported Formats
|
|
145
189
|
|
|
146
|
-
|
|
147
|
-
|
|
190
|
+
| Format | Engine | Features |
|
|
191
|
+
|--------|--------|----------|
|
|
192
|
+
| **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
|
|
193
|
+
| **HWP 5.x** (한컴 Legacy) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
|
|
194
|
+
| **PDF** | pdfjs-dist | Line grouping, table detection, image PDF + OCR |
|
|
148
195
|
|
|
149
196
|
## Security
|
|
150
197
|
|
|
151
|
-
Production-grade
|
|
152
|
-
|
|
153
|
-
- **ZIP bomb protection** — Entry count validation, 100MB decompression limit, 500 entry cap
|
|
154
|
-
> **Known limitation:** Pre-check reads declared sizes from ZIP Central Directory, which an attacker can falsify. The primary defense is per-file cumulative size tracking during actual decompression. For fully untrusted input where streaming decompression is required, consider wrapping kordoc behind a size-limited sandbox.
|
|
155
|
-
- **XXE/Billion Laughs prevention** — Internal DTD subsets fully stripped from HWPX XML
|
|
156
|
-
- **Decompression bomb guard** — `maxOutputLength` on HWP5 zlib streams, cumulative 100MB limit across sections
|
|
157
|
-
- **PDF resource limits** — MAX_PAGES=5,000, cumulative text size 100MB cap, `doc.destroy()` cleanup
|
|
158
|
-
- **HWP5 record cap** — Max 500,000 records per section, prevents memory exhaustion from crafted files
|
|
159
|
-
- **Table dimension clamping** — rows/cols read from HWP5 binary clamped to MAX_ROWS/MAX_COLS before allocation
|
|
160
|
-
- **colSpan/rowSpan clamping** — Crafted merge values clamped to grid bounds (MAX_COLS=200, MAX_ROWS=10,000)
|
|
161
|
-
- **Path traversal guard** — Backslash normalization, `..`, absolute paths, Windows drive letters all rejected
|
|
162
|
-
- **MCP error sanitization** — Allowlist-based error filtering, unknown errors return generic message
|
|
163
|
-
- **MCP path restriction** — Only `.hwp`, `.hwpx`, `.pdf` extensions allowed, symlink resolution
|
|
164
|
-
- **File size limit** — 500MB max in MCP server and CLI
|
|
165
|
-
- **HWP5 section limit** — Max 100 sections in both primary and fallback paths
|
|
166
|
-
- **HWP5 control char fix** — Character code 10 (footnote/endnote) now correctly handled
|
|
167
|
-
|
|
168
|
-
## How It Works
|
|
169
|
-
|
|
170
|
-
```
|
|
171
|
-
┌─────────────┐ Magic Bytes ┌──────────────────┐
|
|
172
|
-
│ File Input │ ──── Detection ────→ │ Format Router │
|
|
173
|
-
└─────────────┘ └────────┬─────────┘
|
|
174
|
-
│
|
|
175
|
-
┌──────────────────────────┼──────────────────────────┐
|
|
176
|
-
│ │ │
|
|
177
|
-
┌─────▼─────┐ ┌───────▼───────┐ ┌──────▼──────┐
|
|
178
|
-
│ HWPX │ │ HWP 5.x │ │ PDF │
|
|
179
|
-
│ ZIP+XML │ │ OLE2+Record │ │ pdfjs-dist │
|
|
180
|
-
└─────┬─────┘ └───────┬───────┘ └──────┬──────┘
|
|
181
|
-
│ │ │
|
|
182
|
-
│ ┌──────────────────┤ │
|
|
183
|
-
│ │ �� │
|
|
184
|
-
┌─────▼───────▼─────┐ │ │
|
|
185
|
-
│ 2-Pass Table │ │ │
|
|
186
|
-
│ Builder (Grid) │ │ │
|
|
187
|
-
└─────────┬─────────┘ │ │
|
|
188
|
-
│ │ │
|
|
189
|
-
┌─────▼──────────────────────▼──────────────────────────▼─────┐
|
|
190
|
-
│ IRBlock[] │
|
|
191
|
-
│ (Intermediate Representation) │
|
|
192
|
-
└────────────────────────┬───────────────────────────────────┘
|
|
193
|
-
│
|
|
194
|
-
┌──────▼──────┐
|
|
195
|
-
│ Markdown │
|
|
196
|
-
│ Output │
|
|
197
|
-
└─────────────┘
|
|
198
|
-
```
|
|
198
|
+
Production-grade hardening: ZIP bomb protection, XXE/Billion Laughs prevention, decompression bomb guard, path traversal guard, MCP error sanitization, file size limits (500MB). See [SECURITY.md](./SECURITY.md) for details.
|
|
199
199
|
|
|
200
200
|
## Credits
|
|
201
201
|
|
|
202
|
-
Production-tested across 5 Korean government
|
|
203
|
-
- School curriculum plans (학교교육과정)
|
|
204
|
-
- Facility inspection reports (사전기획 보고서)
|
|
205
|
-
- Legal document annexes (법률 별표)
|
|
206
|
-
- Municipal newsletters (소식지)
|
|
207
|
-
- Public data extraction tools (공공데이터)
|
|
208
|
-
|
|
209
|
-
Thousands of real government documents parsed without breaking a sweat.
|
|
202
|
+
Production-tested across 5 Korean government projects: school curriculum plans, facility inspection reports, legal document annexes, municipal newsletters, and public data extraction tools. Thousands of real government documents parsed.
|
|
210
203
|
|
|
211
204
|
## License
|
|
212
205
|
|