kordoc 0.1.1 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,45 +1,60 @@
1
1
  # kordoc
2
2
 
3
- ### 모두 파싱해버리겠다.
3
+ **모두 파싱해버리겠다** — Parse any Korean document to Markdown.
4
4
 
5
- > *"HWP든 HWPX든 PDF든 — 대한민국 문서라면 남김없이 파싱해버립니다."*
5
+ [![npm version](https://img.shields.io/npm/v/kordoc.svg)](https://www.npmjs.com/package/kordoc)
6
+ [![license](https://img.shields.io/npm/l/kordoc.svg)](https://github.com/chrisryugj/kordoc/blob/main/LICENSE)
7
+ [![node](https://img.shields.io/node/v/kordoc.svg)](https://nodejs.org)
6
8
 
7
- Built by a Korean civil servant who spent 7 years in the deepest circle of document hell. One day he snapped, and kordoc was born.
9
+ > *HWP, HWPX, PDF 대한민국 문서라면 남김없이 파싱해버립니다.*
8
10
 
9
- Korean document formats — parsed, converted, delivered as clean Markdown. No COM automation, no Windows dependency, no tears.
11
+ [한국어](./README-KR.md)
12
+
13
+ ![kordoc demo](./demo.gif)
10
14
 
11
15
  ---
12
16
 
13
17
  ## Why kordoc?
14
18
 
15
- South Korea runs on HWP. The rest of the world has never heard of it. Government offices produce thousands of `.hwp` files daily, and extracting text from them has always been a nightmare COM automation that only works on Windows, proprietary formats with zero documentation, and tables that break every parser.
19
+ South Korea's government runs on **HWP** a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.
16
20
 
17
- **kordoc** was forged in this document hell. Its parsers have been battle-tested across 5 real Korean government projects, processing everything from school curriculum plans to facility inspection reports. If a Korean public servant wrote it, kordoc can parse it.
21
+ **kordoc** was born from that document hell. Built by a Korean civil servant who spent **7 years** buried under HWP files at a district office. One day he snapped — and decided to parse them all. Its parsers have been battle-tested across 5 real government projects, processing school curriculum plans, facility inspection reports, legal annexes, and municipal newsletters. If a Korean public servant wrote it, kordoc can parse it.
18
22
 
19
- | Format | Engine | Status |
20
- |--------|--------|--------|
21
- | **HWPX** (한컴 2020+) | ZIP + XML DOM walk | Stable |
22
- | **HWP 5.x** (한컴 레거시) | OLE2 binary + record parsing | Stable |
23
- | **PDF** | pdfjs-dist text extraction | Stable |
23
+ ---
24
24
 
25
- ### What makes it different
25
+ ## Features
26
26
 
27
- - **2-pass table builder** — Correct `colSpan`/`rowSpan` handling via grid algorithm. No more broken table layouts.
28
- - **Broken ZIP recovery** — Corrupted HWPX? We scan raw Local File Headers and still extract text.
29
- - **OPF manifest resolution** — Multi-section HWPX documents parsed in correct spine order.
30
- - **21 HWP5 control characters** — Full UTF-16LE decoding with extended/inline object skip.
31
- - **Image-based PDF detection** — Warns you when a scanned PDF can't be text-extracted.
27
+ - **HWP 5.x Binary Parsing** — OLE2 container + record stream + UTF-16LE. No Hancom Office needed.
28
+ - **HWPX ZIP Parsing** — OPF manifest resolution, multi-section, nested tables.
29
+ - **PDF Text Extraction** — Y-coordinate line grouping, table reconstruction, image PDF detection.
30
+ - **2-Pass Table Builder** — Correct `colSpan`/`rowSpan` via grid algorithm. No broken tables.
31
+ - **Broken ZIP Recovery** — Corrupted HWPX? Scans raw Local File Headers.
32
+ - **3 Interfaces** — npm library, CLI tool, and MCP server (Claude/Cursor).
33
+ - **Cross-Platform** — Pure JavaScript. Runs on Linux, macOS, Windows.
32
34
 
33
- ---
35
+ ## Supported Formats
34
36
 
35
- ## Quick Start
37
+ | Format | Engine | Features |
38
+ |--------|--------|----------|
39
+ | **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
40
+ | **HWP 5.x** (한컴 레거시) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
41
+ | **PDF** | pdfjs-dist | Line grouping, table detection, image PDF warning |
36
42
 
37
- ### As a library
43
+ ## Installation
38
44
 
39
45
  ```bash
40
46
  npm install kordoc
47
+
48
+ # PDF support requires pdfjs-dist (optional peer dependency)
49
+ npm install pdfjs-dist
41
50
  ```
42
51
 
52
+ > **Since v0.2.1**, `pdfjs-dist` is an optional peer dependency. Not needed for HWP/HWPX parsing.
53
+
54
+ ## Usage
55
+
56
+ ### As a Library
57
+
43
58
  ```typescript
44
59
  import { parse } from "kordoc"
45
60
  import { readFileSync } from "fs"
@@ -49,31 +64,25 @@ const result = await parse(buffer.buffer)
49
64
 
50
65
  if (result.success) {
51
66
  console.log(result.markdown)
52
- // → Clean markdown with tables, headings, and structure preserved
53
67
  }
54
68
  ```
55
69
 
56
- ### Format-specific parsing
70
+ #### Format-Specific
57
71
 
58
72
  ```typescript
59
73
  import { parseHwpx, parseHwp, parsePdf } from "kordoc"
60
74
 
61
- // HWPX (modern Hancom format)
62
- const hwpxResult = await parseHwpx(buffer)
63
-
64
- // HWP 5.x (legacy binary format)
65
- const hwpResult = await parseHwp(buffer)
66
-
67
- // PDF (text-based)
68
- const pdfResult = await parsePdf(buffer)
75
+ const hwpxResult = await parseHwpx(buffer) // HWPX
76
+ const hwpResult = await parseHwp(buffer) // HWP 5.x
77
+ const pdfResult = await parsePdf(buffer) // PDF
69
78
  ```
70
79
 
71
- ### Format detection
80
+ #### Format Detection
72
81
 
73
82
  ```typescript
74
- import { detectFormat, isHwpxFile, isOldHwpFile, isPdfFile } from "kordoc"
83
+ import { detectFormat } from "kordoc"
75
84
 
76
- const format = detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
85
+ detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
77
86
  ```
78
87
 
79
88
  ### As a CLI
@@ -85,9 +94,9 @@ npx kordoc *.pdf -d ./converted/ # batch convert
85
94
  npx kordoc report.hwpx --format json # JSON with metadata
86
95
  ```
87
96
 
88
- ### As an MCP server (Claude / Cursor / Windsurf)
97
+ ### As an MCP Server
89
98
 
90
- kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
99
+ Works with **Claude Desktop**, **Cursor**, **Windsurf**, and any MCP-compatible client.
91
100
 
92
101
  ```json
93
102
  {
@@ -100,147 +109,100 @@ kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
100
109
  }
101
110
  ```
102
111
 
103
- This exposes two tools:
104
- - **`parse_document`** — Parse a HWP/HWPX/PDF file to Markdown
105
- - **`detect_format`** — Detect file format via magic bytes
112
+ **Tools exposed:**
106
113
 
107
- ---
114
+ | Tool | Description |
115
+ |------|-------------|
116
+ | `parse_document` | Parse HWP/HWPX/PDF file → Markdown |
117
+ | `detect_format` | Detect file format via magic bytes |
108
118
 
109
119
  ## API Reference
110
120
 
111
121
  ### `parse(buffer: ArrayBuffer): Promise<ParseResult>`
112
122
 
113
- Auto-detects format via magic bytes and parses to Markdown.
114
-
115
- ### `ParseResult`
123
+ Auto-detects format and converts to Markdown.
116
124
 
117
125
  ```typescript
118
126
  interface ParseResult {
119
127
  success: boolean
120
- markdown?: string // Extracted markdown text
128
+ markdown?: string
121
129
  fileType: "hwpx" | "hwp" | "pdf" | "unknown"
122
- isImageBased?: boolean // true if scanned PDF (no text extractable)
123
- pageCount?: number // PDF page count
124
- error?: string // Error message on failure
130
+ isImageBased?: boolean // scanned PDF detection
131
+ pageCount?: number // PDF only
132
+ error?: string
125
133
  }
126
134
  ```
127
135
 
128
- ### Low-level exports
136
+ ### Low-Level Exports
129
137
 
130
138
  ```typescript
131
- // Table builder (2-pass colSpan/rowSpan algorithm)
132
- import { buildTable, blocksToMarkdown } from "kordoc"
133
-
134
- // Type definitions
139
+ import { buildTable, blocksToMarkdown, convertTableToText } from "kordoc"
135
140
  import type { IRBlock, IRTable, IRCell, CellContext } from "kordoc"
136
141
  ```
137
142
 
138
- ---
139
-
140
- ## Supported Formats
141
-
142
- ### HWPX (한컴오피스 2020+)
143
-
144
- ZIP-based XML format. kordoc reads the OPF manifest (`content.hpf`) for correct section ordering, walks the XML DOM for paragraphs and tables, and handles:
145
- - Multi-section documents
146
- - Nested tables (table inside a table cell)
147
- - `colSpan` / `rowSpan` merged cells
148
- - Corrupted ZIP archives (Local File Header fallback)
149
-
150
- ### HWP 5.x (한컴오피스 레거시)
151
-
152
- OLE2 Compound Binary format. kordoc parses the CFB container, decompresses section streams (zlib), reads HWP record structures, and extracts UTF-16LE text with full control character handling:
153
- - 21 control character types (line breaks, tabs, hyphens, NBSP, extended objects)
154
- - Encrypted/DRM file detection (fails fast with clear error)
155
- - Table extraction with grid-based cell arrangement
156
-
157
- ### PDF
158
-
159
- Server-side text extraction via pdfjs-dist:
160
- - Y-coordinate based line grouping
161
- - Gap-based cell/table detection
162
- - Image-based PDF detection (< 10 chars/page average)
163
- - Korean text line joining (조사/접속사 awareness)
164
-
165
- ---
166
-
167
143
  ## Requirements
168
144
 
169
145
  - **Node.js** >= 18
170
- - **pdfjs-dist** — Required only for PDF parsing. HWP/HWPX work without it.
171
-
172
- ---
173
-
174
- ## Credits
175
-
176
- Built by a Korean civil servant who spent years drowning in HWP files. Production-tested across 5 government technology projects — school curriculum plans, facility inspection reports, legal documents, and municipal newsletters. The parsers in this package have processed thousands of real Korean government documents without breaking a sweat.
177
-
178
- ---
179
-
180
- ## License
146
+ - **pdfjs-dist** >= 4.0.0 Optional. Only needed for PDF. HWP/HWPX work without it.
181
147
 
182
- MIT
148
+ ## Security
183
149
 
184
- ---
185
-
186
- <br>
187
-
188
- # kordoc (한국어)
189
-
190
- ### 모두 파싱해버리겠다.
191
-
192
- > *대한민국에서 둘째가라면 서러울 문서지옥. 거기서 7년 버틴 공무원이 만들었습니다.*
150
+ v0.2.2 security hardening (cumulative since v0.2.1):
193
151
 
194
- HWP, HWPX, PDF관공서에서 쏟아지는 모든 문서 포맷을 마크다운으로 변환하는 Node.js 라이브러리입니다. 학교 교육과정, 사전기획 보고서, 검토의견서, 소식지 원고... 뭐든 넣으면 파싱합니다.
152
+ - **ZIP bomb protection** 100MB decompression limit, 500 entry cap
153
+ - **XXE/Billion Laughs prevention** — Internal DTD subsets fully stripped from HWPX XML
154
+ - **Decompression bomb guard** — `maxOutputLength` on HWP5 zlib streams, cumulative 100MB limit across sections
155
+ - **colSpan/rowSpan clamping** — Crafted merge values clamped to grid bounds (MAX_COLS=200, MAX_ROWS=10,000)
156
+ - **Broken ZIP path traversal guard** — `..` and absolute path entries rejected, filename length capped
157
+ - **MCP path restriction** — Only `.hwp`, `.hwpx`, `.pdf` extensions allowed
158
+ - **File size limit** — 500MB max in MCP server and CLI
159
+ - **PDF resource cleanup** — `doc.destroy()` prevents WASM memory leaks
160
+ - **Table memory guard** — Sparse Set-based allocation in Pass 1, 10,000 row cap
161
+ - **HWP5 section limit** — Max 100 sections to prevent infinite loop on corrupted files
195
162
 
196
- ### 특징
163
+ ## How It Works
197
164
 
198
- - **한컴오피스 불필요** — COM 자동화 없이 바이너리 직접 파싱. Linux, Mac에서도 동작
199
- - **손상 파일 복구** — ZIP Central Directory가 깨진 HWPX도 Local File Header 스캔으로 복구
200
- - **병합 셀 완벽 처리** — 2-pass 그리드 알고리즘으로 colSpan/rowSpan 정확히 렌더링
201
- - **HWP5 바이너리 직접 파싱** — OLE2 컨테이너 → 레코드 스트림 → UTF-16LE 텍스트 추출
202
- - **이미지 PDF 감지** — 스캔된 PDF는 텍스트 추출 불가를 사전에 알려줌
203
- - **실전 검증 완료** — 5개 공공 프로젝트, 수천 건의 실제 관공서 문서에서 테스트됨
204
-
205
- ### 설치
206
-
207
- ```bash
208
- npm install kordoc
209
165
  ```
210
-
211
- ### 사용법
212
-
213
- ```typescript
214
- import { parse } from "kordoc"
215
- import { readFileSync } from "fs"
216
-
217
- const buffer = readFileSync("사업계획서.hwpx")
218
- const result = await parse(buffer.buffer)
219
-
220
- if (result.success) {
221
- console.log(result.markdown)
222
- }
166
+ ┌─────────────┐ Magic Bytes ┌──────────────────┐
167
+ │ File Input │ ──── Detection ────→ │ Format Router │
168
+ └─────────────┘ └────────┬─────────┘
169
+
170
+ ┌──────────────────────────┼──────────────────────────┐
171
+ │ │ │
172
+ ┌─────▼─────┐ ┌───────▼───────┐ ┌──────▼──────┐
173
+ │ HWPX │ │ HWP 5.x │ │ PDF │
174
+ │ ZIP+XML │ │ OLE2+Record │ │ pdfjs-dist
175
+ └─────┬─────┘ └───────┬───────┘ └──────┬──────┘
176
+ │ │ │
177
+ │ ┌──────────────────┤ │
178
+ │ │ �� │
179
+ ┌─────▼───────▼─────┐ │ │
180
+ │ 2-Pass Table │ │ │
181
+ │ Builder (Grid) │ │ │
182
+ └─────────┬─────────┘ │ │
183
+ │ │ │
184
+ ┌─────▼──────────────────────▼──────────────────────────▼─────┐
185
+ │ IRBlock[] │
186
+ │ (Intermediate Representation) │
187
+ └────────────────────────┬───────────────────────────────────┘
188
+
189
+ ┌──────▼──────┐
190
+ │ Markdown │
191
+ │ Output │
192
+ └─────────────┘
223
193
  ```
224
194
 
225
- ### CLI
195
+ ## Credits
226
196
 
227
- ```bash
228
- npx kordoc 사업계획서.hwpx # 터미널 출력
229
- npx kordoc 보고서.hwp -o 보고서.md # 파일 저장
230
- npx kordoc *.pdf -d ./변환결과/ # 일괄 변환
231
- ```
197
+ Production-tested across 5 Korean government technology projects:
198
+ - School curriculum plans (학교교육과정)
199
+ - Facility inspection reports (사전기획 보고서)
200
+ - Legal document annexes (법률 별표)
201
+ - Municipal newsletters (소식지)
202
+ - Public data extraction tools (공공데이터)
232
203
 
233
- ### MCP 서버 (Claude / Cursor / Windsurf)
204
+ Thousands of real government documents parsed without breaking a sweat.
234
205
 
235
- Claude Desktop이나 Cursor에서 문서 파싱 도구로 바로 사용 가능합니다:
206
+ ## License
236
207
 
237
- ```json
238
- {
239
- "mcpServers": {
240
- "kordoc": {
241
- "command": "npx",
242
- "args": ["-y", "kordoc-mcp"]
243
- }
244
- }
245
- }
246
- ```
208
+ [MIT](./LICENSE)