kordoc 0.1.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +102 -145
- package/dist/{chunk-P2BZDRLZ.js → chunk-C3XHIIJZ.js} +130 -94
- package/dist/cli.js +9 -6
- package/dist/index.cjs +127 -82
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +17 -9
- package/dist/index.d.ts +17 -9
- package/dist/index.js +126 -89
- package/dist/index.js.map +1 -1
- package/dist/mcp.js +30 -13782
- package/package.json +17 -7
package/README.md
CHANGED
|
@@ -1,45 +1,60 @@
|
|
|
1
1
|
# kordoc
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
**모두 파싱해버리겠다** — Parse any Korean document to Markdown.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
[](https://www.npmjs.com/package/kordoc)
|
|
6
|
+
[](https://github.com/chrisryugj/kordoc/blob/main/LICENSE)
|
|
7
|
+
[](https://nodejs.org)
|
|
6
8
|
|
|
7
|
-
|
|
9
|
+
> *HWP, HWPX, PDF — 대한민국 문서라면 남김없이 파싱해버립니다.*
|
|
8
10
|
|
|
9
|
-
|
|
11
|
+
[한국어](./README-KR.md)
|
|
12
|
+
|
|
13
|
+

|
|
10
14
|
|
|
11
15
|
---
|
|
12
16
|
|
|
13
17
|
## Why kordoc?
|
|
14
18
|
|
|
15
|
-
South Korea runs on HWP
|
|
19
|
+
South Korea's government runs on **HWP** — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.
|
|
16
20
|
|
|
17
|
-
**kordoc** was
|
|
21
|
+
**kordoc** was born from that document hell. Built by a Korean civil servant who spent **7 years** buried under HWP files at a district office. One day he snapped — and decided to parse them all. Its parsers have been battle-tested across 5 real government projects, processing school curriculum plans, facility inspection reports, legal annexes, and municipal newsletters. If a Korean public servant wrote it, kordoc can parse it.
|
|
18
22
|
|
|
19
|
-
|
|
20
|
-
|--------|--------|--------|
|
|
21
|
-
| **HWPX** (한컴 2020+) | ZIP + XML DOM walk | Stable |
|
|
22
|
-
| **HWP 5.x** (한컴 레거시) | OLE2 binary + record parsing | Stable |
|
|
23
|
-
| **PDF** | pdfjs-dist text extraction | Stable |
|
|
23
|
+
---
|
|
24
24
|
|
|
25
|
-
|
|
25
|
+
## Features
|
|
26
26
|
|
|
27
|
-
- **
|
|
28
|
-
- **
|
|
29
|
-
- **
|
|
30
|
-
- **
|
|
31
|
-
- **
|
|
27
|
+
- **HWP 5.x Binary Parsing** — OLE2 container + record stream + UTF-16LE. No Hancom Office needed.
|
|
28
|
+
- **HWPX ZIP Parsing** — OPF manifest resolution, multi-section, nested tables.
|
|
29
|
+
- **PDF Text Extraction** — Y-coordinate line grouping, table reconstruction, image PDF detection.
|
|
30
|
+
- **2-Pass Table Builder** — Correct `colSpan`/`rowSpan` via grid algorithm. No broken tables.
|
|
31
|
+
- **Broken ZIP Recovery** — Corrupted HWPX? Scans raw Local File Headers.
|
|
32
|
+
- **3 Interfaces** — npm library, CLI tool, and MCP server (Claude/Cursor).
|
|
33
|
+
- **Cross-Platform** — Pure JavaScript. Runs on Linux, macOS, Windows.
|
|
32
34
|
|
|
33
|
-
|
|
35
|
+
## Supported Formats
|
|
34
36
|
|
|
35
|
-
|
|
37
|
+
| Format | Engine | Features |
|
|
38
|
+
|--------|--------|----------|
|
|
39
|
+
| **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
|
|
40
|
+
| **HWP 5.x** (한컴 레거시) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
|
|
41
|
+
| **PDF** | pdfjs-dist | Line grouping, table detection, image PDF warning |
|
|
36
42
|
|
|
37
|
-
|
|
43
|
+
## Installation
|
|
38
44
|
|
|
39
45
|
```bash
|
|
40
46
|
npm install kordoc
|
|
47
|
+
|
|
48
|
+
# PDF support requires pdfjs-dist (optional peer dependency)
|
|
49
|
+
npm install pdfjs-dist
|
|
41
50
|
```
|
|
42
51
|
|
|
52
|
+
> **Since v0.2.1**, `pdfjs-dist` is an optional peer dependency. Not needed for HWP/HWPX parsing.
|
|
53
|
+
|
|
54
|
+
## Usage
|
|
55
|
+
|
|
56
|
+
### As a Library
|
|
57
|
+
|
|
43
58
|
```typescript
|
|
44
59
|
import { parse } from "kordoc"
|
|
45
60
|
import { readFileSync } from "fs"
|
|
@@ -49,31 +64,25 @@ const result = await parse(buffer.buffer)
|
|
|
49
64
|
|
|
50
65
|
if (result.success) {
|
|
51
66
|
console.log(result.markdown)
|
|
52
|
-
// → Clean markdown with tables, headings, and structure preserved
|
|
53
67
|
}
|
|
54
68
|
```
|
|
55
69
|
|
|
56
|
-
|
|
70
|
+
#### Format-Specific
|
|
57
71
|
|
|
58
72
|
```typescript
|
|
59
73
|
import { parseHwpx, parseHwp, parsePdf } from "kordoc"
|
|
60
74
|
|
|
61
|
-
|
|
62
|
-
const
|
|
63
|
-
|
|
64
|
-
// HWP 5.x (legacy binary format)
|
|
65
|
-
const hwpResult = await parseHwp(buffer)
|
|
66
|
-
|
|
67
|
-
// PDF (text-based)
|
|
68
|
-
const pdfResult = await parsePdf(buffer)
|
|
75
|
+
const hwpxResult = await parseHwpx(buffer) // HWPX
|
|
76
|
+
const hwpResult = await parseHwp(buffer) // HWP 5.x
|
|
77
|
+
const pdfResult = await parsePdf(buffer) // PDF
|
|
69
78
|
```
|
|
70
79
|
|
|
71
|
-
|
|
80
|
+
#### Format Detection
|
|
72
81
|
|
|
73
82
|
```typescript
|
|
74
|
-
import { detectFormat
|
|
83
|
+
import { detectFormat } from "kordoc"
|
|
75
84
|
|
|
76
|
-
|
|
85
|
+
detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
|
|
77
86
|
```
|
|
78
87
|
|
|
79
88
|
### As a CLI
|
|
@@ -85,9 +94,9 @@ npx kordoc *.pdf -d ./converted/ # batch convert
|
|
|
85
94
|
npx kordoc report.hwpx --format json # JSON with metadata
|
|
86
95
|
```
|
|
87
96
|
|
|
88
|
-
### As an MCP
|
|
97
|
+
### As an MCP Server
|
|
89
98
|
|
|
90
|
-
|
|
99
|
+
Works with **Claude Desktop**, **Cursor**, **Windsurf**, and any MCP-compatible client.
|
|
91
100
|
|
|
92
101
|
```json
|
|
93
102
|
{
|
|
@@ -100,147 +109,95 @@ kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
|
|
|
100
109
|
}
|
|
101
110
|
```
|
|
102
111
|
|
|
103
|
-
|
|
104
|
-
- **`parse_document`** — Parse a HWP/HWPX/PDF file to Markdown
|
|
105
|
-
- **`detect_format`** — Detect file format via magic bytes
|
|
112
|
+
**Tools exposed:**
|
|
106
113
|
|
|
107
|
-
|
|
114
|
+
| Tool | Description |
|
|
115
|
+
|------|-------------|
|
|
116
|
+
| `parse_document` | Parse HWP/HWPX/PDF file → Markdown |
|
|
117
|
+
| `detect_format` | Detect file format via magic bytes |
|
|
108
118
|
|
|
109
119
|
## API Reference
|
|
110
120
|
|
|
111
121
|
### `parse(buffer: ArrayBuffer): Promise<ParseResult>`
|
|
112
122
|
|
|
113
|
-
Auto-detects format
|
|
114
|
-
|
|
115
|
-
### `ParseResult`
|
|
123
|
+
Auto-detects format and converts to Markdown.
|
|
116
124
|
|
|
117
125
|
```typescript
|
|
118
126
|
interface ParseResult {
|
|
119
127
|
success: boolean
|
|
120
|
-
markdown?: string
|
|
128
|
+
markdown?: string
|
|
121
129
|
fileType: "hwpx" | "hwp" | "pdf" | "unknown"
|
|
122
|
-
isImageBased?: boolean //
|
|
123
|
-
pageCount?: number // PDF
|
|
124
|
-
error?: string
|
|
130
|
+
isImageBased?: boolean // scanned PDF detection
|
|
131
|
+
pageCount?: number // PDF only
|
|
132
|
+
error?: string
|
|
125
133
|
}
|
|
126
134
|
```
|
|
127
135
|
|
|
128
|
-
### Low-
|
|
136
|
+
### Low-Level Exports
|
|
129
137
|
|
|
130
138
|
```typescript
|
|
131
|
-
|
|
132
|
-
import { buildTable, blocksToMarkdown } from "kordoc"
|
|
133
|
-
|
|
134
|
-
// Type definitions
|
|
139
|
+
import { buildTable, blocksToMarkdown, convertTableToText } from "kordoc"
|
|
135
140
|
import type { IRBlock, IRTable, IRCell, CellContext } from "kordoc"
|
|
136
141
|
```
|
|
137
142
|
|
|
138
|
-
---
|
|
139
|
-
|
|
140
|
-
## Supported Formats
|
|
141
|
-
|
|
142
|
-
### HWPX (한컴오피스 2020+)
|
|
143
|
-
|
|
144
|
-
ZIP-based XML format. kordoc reads the OPF manifest (`content.hpf`) for correct section ordering, walks the XML DOM for paragraphs and tables, and handles:
|
|
145
|
-
- Multi-section documents
|
|
146
|
-
- Nested tables (table inside a table cell)
|
|
147
|
-
- `colSpan` / `rowSpan` merged cells
|
|
148
|
-
- Corrupted ZIP archives (Local File Header fallback)
|
|
149
|
-
|
|
150
|
-
### HWP 5.x (한컴오피스 레거시)
|
|
151
|
-
|
|
152
|
-
OLE2 Compound Binary format. kordoc parses the CFB container, decompresses section streams (zlib), reads HWP record structures, and extracts UTF-16LE text with full control character handling:
|
|
153
|
-
- 21 control character types (line breaks, tabs, hyphens, NBSP, extended objects)
|
|
154
|
-
- Encrypted/DRM file detection (fails fast with clear error)
|
|
155
|
-
- Table extraction with grid-based cell arrangement
|
|
156
|
-
|
|
157
|
-
### PDF
|
|
158
|
-
|
|
159
|
-
Server-side text extraction via pdfjs-dist:
|
|
160
|
-
- Y-coordinate based line grouping
|
|
161
|
-
- Gap-based cell/table detection
|
|
162
|
-
- Image-based PDF detection (< 10 chars/page average)
|
|
163
|
-
- Korean text line joining (조사/접속사 awareness)
|
|
164
|
-
|
|
165
|
-
---
|
|
166
|
-
|
|
167
143
|
## Requirements
|
|
168
144
|
|
|
169
145
|
- **Node.js** >= 18
|
|
170
|
-
- **pdfjs-dist** —
|
|
171
|
-
|
|
172
|
-
---
|
|
173
|
-
|
|
174
|
-
## Credits
|
|
175
|
-
|
|
176
|
-
Built by a Korean civil servant who spent years drowning in HWP files. Production-tested across 5 government technology projects — school curriculum plans, facility inspection reports, legal documents, and municipal newsletters. The parsers in this package have processed thousands of real Korean government documents without breaking a sweat.
|
|
177
|
-
|
|
178
|
-
---
|
|
179
|
-
|
|
180
|
-
## License
|
|
146
|
+
- **pdfjs-dist** >= 4.0.0 — Optional. Only needed for PDF. HWP/HWPX work without it.
|
|
181
147
|
|
|
182
|
-
|
|
148
|
+
## Security
|
|
183
149
|
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
<br>
|
|
187
|
-
|
|
188
|
-
# kordoc (한국어)
|
|
189
|
-
|
|
190
|
-
### 모두 파싱해버리겠다.
|
|
191
|
-
|
|
192
|
-
> *대한민국에서 둘째가라면 서러울 문서지옥. 거기서 7년 버틴 공무원이 만들었습니다.*
|
|
150
|
+
v0.2.1 includes the following security hardening:
|
|
193
151
|
|
|
194
|
-
|
|
152
|
+
- **ZIP bomb protection** — 100MB decompression limit, 500 entry cap
|
|
153
|
+
- **XXE prevention** — DOCTYPE declarations stripped from HWPX XML
|
|
154
|
+
- **Decompression bomb guard** — `maxOutputLength` on HWP5 zlib streams
|
|
155
|
+
- **MCP path restriction** — Only `.hwp`, `.hwpx`, `.pdf` extensions allowed
|
|
156
|
+
- **Table memory guard** — 10,000 row cap on table builder
|
|
195
157
|
|
|
196
|
-
|
|
158
|
+
## How It Works
|
|
197
159
|
|
|
198
|
-
- **한컴오피스 불필요** — COM 자동화 없이 바이너리 직접 파싱. Linux, Mac에서도 동작
|
|
199
|
-
- **손상 파일 복구** — ZIP Central Directory가 깨진 HWPX도 Local File Header 스캔으로 복구
|
|
200
|
-
- **병합 셀 완벽 처리** — 2-pass 그리드 알고리즘으로 colSpan/rowSpan 정확히 렌더링
|
|
201
|
-
- **HWP5 바이너리 직접 파싱** — OLE2 컨테이너 → 레코드 스트림 → UTF-16LE 텍스트 추출
|
|
202
|
-
- **이미지 PDF 감지** — 스캔된 PDF는 텍스트 추출 불가를 사전에 알려줌
|
|
203
|
-
- **실전 검증 완료** — 5개 공공 프로젝트, 수천 건의 실제 관공서 문서에서 테스트됨
|
|
204
|
-
|
|
205
|
-
### 설치
|
|
206
|
-
|
|
207
|
-
```bash
|
|
208
|
-
npm install kordoc
|
|
209
160
|
```
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
161
|
+
┌─────────────┐ Magic Bytes ┌──────────────────┐
|
|
162
|
+
│ File Input │ ──── Detection ────→ │ Format Router │
|
|
163
|
+
└─────────────┘ └────────┬─────────┘
|
|
164
|
+
│
|
|
165
|
+
┌──────────────────────────┼──────────────────────────┐
|
|
166
|
+
│ │ │
|
|
167
|
+
┌─────▼─────┐ ┌───────▼───────┐ ┌──────▼──────┐
|
|
168
|
+
│ HWPX │ │ HWP 5.x │ │ PDF │
|
|
169
|
+
│ ZIP+XML │ │ OLE2+Record │ │ pdfjs-dist │
|
|
170
|
+
└─────┬─────┘ └───────┬───────┘ └──────┬──────┘
|
|
171
|
+
│ │ │
|
|
172
|
+
│ ┌──────────────────┤ │
|
|
173
|
+
│ │ �� │
|
|
174
|
+
┌─────▼───────▼─────┐ │ │
|
|
175
|
+
│ 2-Pass Table │ │ │
|
|
176
|
+
│ Builder (Grid) │ │ │
|
|
177
|
+
└─────────┬─────────┘ │ │
|
|
178
|
+
│ │ │
|
|
179
|
+
┌─────▼──────────────────────▼──────────────────────────▼─────┐
|
|
180
|
+
│ IRBlock[] │
|
|
181
|
+
│ (Intermediate Representation) │
|
|
182
|
+
└────────────────────────┬───────────────────────────────────┘
|
|
183
|
+
│
|
|
184
|
+
┌──────▼──────┐
|
|
185
|
+
│ Markdown │
|
|
186
|
+
│ Output │
|
|
187
|
+
└─────────────┘
|
|
223
188
|
```
|
|
224
189
|
|
|
225
|
-
|
|
190
|
+
## Credits
|
|
226
191
|
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
192
|
+
Production-tested across 5 Korean government technology projects:
|
|
193
|
+
- School curriculum plans (학교교육과정)
|
|
194
|
+
- Facility inspection reports (사전기획 보고서)
|
|
195
|
+
- Legal document annexes (법률 별표)
|
|
196
|
+
- Municipal newsletters (소식지)
|
|
197
|
+
- Public data extraction tools (공공데이터)
|
|
232
198
|
|
|
233
|
-
|
|
199
|
+
Thousands of real government documents parsed without breaking a sweat.
|
|
234
200
|
|
|
235
|
-
|
|
201
|
+
## License
|
|
236
202
|
|
|
237
|
-
|
|
238
|
-
{
|
|
239
|
-
"mcpServers": {
|
|
240
|
-
"kordoc": {
|
|
241
|
-
"command": "npx",
|
|
242
|
-
"args": ["-y", "kordoc-mcp"]
|
|
243
|
-
}
|
|
244
|
-
}
|
|
245
|
-
}
|
|
246
|
-
```
|
|
203
|
+
[MIT](./LICENSE)
|