kordoc 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 chrisryugj
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,246 @@
1
+ # kordoc
2
+
3
+ ### 모두 파싱해버리겠다.
4
+
5
+ > *"HWP든 HWPX든 PDF든 — 대한민국 문서라면 남김없이 파싱해버립니다."*
6
+
7
+ Built by a Korean civil servant who spent 7 years in the deepest circle of document hell. One day he snapped, and kordoc was born.
8
+
9
+ Korean document formats — parsed, converted, delivered as clean Markdown. No COM automation, no Windows dependency, no tears.
10
+
11
+ ---
12
+
13
+ ## Why kordoc?
14
+
15
+ South Korea runs on HWP. The rest of the world has never heard of it. Government offices produce thousands of `.hwp` files daily, and extracting text from them has always been a nightmare — COM automation that only works on Windows, proprietary formats with zero documentation, and tables that break every parser.
16
+
17
+ **kordoc** was forged in this document hell. Its parsers have been battle-tested across 5 real Korean government projects, processing everything from school curriculum plans to facility inspection reports. If a Korean public servant wrote it, kordoc can parse it.
18
+
19
+ | Format | Engine | Status |
20
+ |--------|--------|--------|
21
+ | **HWPX** (한컴 2020+) | ZIP + XML DOM walk | Stable |
22
+ | **HWP 5.x** (한컴 레거시) | OLE2 binary + record parsing | Stable |
23
+ | **PDF** | pdfjs-dist text extraction | Stable |
24
+
25
+ ### What makes it different
26
+
27
+ - **2-pass table builder** — Correct `colSpan`/`rowSpan` handling via grid algorithm. No more broken table layouts.
28
+ - **Broken ZIP recovery** — Corrupted HWPX? We scan raw Local File Headers and still extract text.
29
+ - **OPF manifest resolution** — Multi-section HWPX documents parsed in correct spine order.
30
+ - **21 HWP5 control characters** — Full UTF-16LE decoding with extended/inline object skip.
31
+ - **Image-based PDF detection** — Warns you when a scanned PDF can't be text-extracted.
32
+
33
+ ---
34
+
35
+ ## Quick Start
36
+
37
+ ### As a library
38
+
39
+ ```bash
40
+ npm install kordoc
41
+ ```
42
+
43
+ ```typescript
44
+ import { parse } from "kordoc"
45
+ import { readFileSync } from "fs"
46
+
47
+ const buffer = readFileSync("document.hwpx")
48
+ const result = await parse(buffer.buffer)
49
+
50
+ if (result.success) {
51
+ console.log(result.markdown)
52
+ // → Clean markdown with tables, headings, and structure preserved
53
+ }
54
+ ```
55
+
56
+ ### Format-specific parsing
57
+
58
+ ```typescript
59
+ import { parseHwpx, parseHwp, parsePdf } from "kordoc"
60
+
61
+ // HWPX (modern Hancom format)
62
+ const hwpxResult = await parseHwpx(buffer)
63
+
64
+ // HWP 5.x (legacy binary format)
65
+ const hwpResult = await parseHwp(buffer)
66
+
67
+ // PDF (text-based)
68
+ const pdfResult = await parsePdf(buffer)
69
+ ```
70
+
71
+ ### Format detection
72
+
73
+ ```typescript
74
+ import { detectFormat, isHwpxFile, isOldHwpFile, isPdfFile } from "kordoc"
75
+
76
+ const format = detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
77
+ ```
78
+
79
+ ### As a CLI
80
+
81
+ ```bash
82
+ npx kordoc document.hwpx # stdout
83
+ npx kordoc document.hwp -o output.md # save to file
84
+ npx kordoc *.pdf -d ./converted/ # batch convert
85
+ npx kordoc report.hwpx --format json # JSON with metadata
86
+ ```
87
+
88
+ ### As an MCP server (Claude / Cursor / Windsurf)
89
+
90
+ kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
91
+
92
+ ```json
93
+ {
94
+ "mcpServers": {
95
+ "kordoc": {
96
+ "command": "npx",
97
+ "args": ["-y", "kordoc-mcp"]
98
+ }
99
+ }
100
+ }
101
+ ```
102
+
103
+ This exposes two tools:
104
+ - **`parse_document`** — Parse a HWP/HWPX/PDF file to Markdown
105
+ - **`detect_format`** — Detect file format via magic bytes
106
+
107
+ ---
108
+
109
+ ## API Reference
110
+
111
+ ### `parse(buffer: ArrayBuffer): Promise<ParseResult>`
112
+
113
+ Auto-detects format via magic bytes and parses to Markdown.
114
+
115
+ ### `ParseResult`
116
+
117
+ ```typescript
118
+ interface ParseResult {
119
+ success: boolean
120
+ markdown?: string // Extracted markdown text
121
+ fileType: "hwpx" | "hwp" | "pdf" | "unknown"
122
+ isImageBased?: boolean // true if scanned PDF (no text extractable)
123
+ pageCount?: number // PDF page count
124
+ error?: string // Error message on failure
125
+ }
126
+ ```
127
+
128
+ ### Low-level exports
129
+
130
+ ```typescript
131
+ // Table builder (2-pass colSpan/rowSpan algorithm)
132
+ import { buildTable, blocksToMarkdown } from "kordoc"
133
+
134
+ // Type definitions
135
+ import type { IRBlock, IRTable, IRCell, CellContext } from "kordoc"
136
+ ```
137
+
138
+ ---
139
+
140
+ ## Supported Formats
141
+
142
+ ### HWPX (한컴오피스 2020+)
143
+
144
+ ZIP-based XML format. kordoc reads the OPF manifest (`content.hpf`) for correct section ordering, walks the XML DOM for paragraphs and tables, and handles:
145
+ - Multi-section documents
146
+ - Nested tables (table inside a table cell)
147
+ - `colSpan` / `rowSpan` merged cells
148
+ - Corrupted ZIP archives (Local File Header fallback)
149
+
150
+ ### HWP 5.x (한컴오피스 레거시)
151
+
152
+ OLE2 Compound Binary format. kordoc parses the CFB container, decompresses section streams (zlib), reads HWP record structures, and extracts UTF-16LE text with full control character handling:
153
+ - 21 control character types (line breaks, tabs, hyphens, NBSP, extended objects)
154
+ - Encrypted/DRM file detection (fails fast with clear error)
155
+ - Table extraction with grid-based cell arrangement
156
+
157
+ ### PDF
158
+
159
+ Server-side text extraction via pdfjs-dist:
160
+ - Y-coordinate based line grouping
161
+ - Gap-based cell/table detection
162
+ - Image-based PDF detection (< 10 chars/page average)
163
+ - Korean text line joining (조사/접속사 awareness)
164
+
165
+ ---
166
+
167
+ ## Requirements
168
+
169
+ - **Node.js** >= 18
170
+ - **pdfjs-dist** — Required only for PDF parsing. HWP/HWPX work without it.
171
+
172
+ ---
173
+
174
+ ## Credits
175
+
176
+ Built by a Korean civil servant who spent years drowning in HWP files. Production-tested across 5 government technology projects — school curriculum plans, facility inspection reports, legal documents, and municipal newsletters. The parsers in this package have processed thousands of real Korean government documents without breaking a sweat.
177
+
178
+ ---
179
+
180
+ ## License
181
+
182
+ MIT
183
+
184
+ ---
185
+
186
+ <br>
187
+
188
+ # kordoc (한국어)
189
+
190
+ ### 모두 파싱해버리겠다.
191
+
192
+ > *대한민국에서 둘째가라면 서러울 문서지옥. 거기서 7년 버틴 공무원이 만들었습니다.*
193
+
194
+ HWP, HWPX, PDF — 관공서에서 쏟아지는 모든 문서 포맷을 마크다운으로 변환하는 Node.js 라이브러리입니다. 학교 교육과정, 사전기획 보고서, 검토의견서, 소식지 원고... 뭐든 넣으면 파싱합니다.
195
+
196
+ ### 특징
197
+
198
+ - **한컴오피스 불필요** — COM 자동화 없이 바이너리 직접 파싱. Linux, Mac에서도 동작
199
+ - **손상 파일 복구** — ZIP Central Directory가 깨진 HWPX도 Local File Header 스캔으로 복구
200
+ - **병합 셀 완벽 처리** — 2-pass 그리드 알고리즘으로 colSpan/rowSpan 정확히 렌더링
201
+ - **HWP5 바이너리 직접 파싱** — OLE2 컨테이너 → 레코드 스트림 → UTF-16LE 텍스트 추출
202
+ - **이미지 PDF 감지** — 스캔된 PDF는 텍스트 추출 불가를 사전에 알려줌
203
+ - **실전 검증 완료** — 5개 공공 프로젝트, 수천 건의 실제 관공서 문서에서 테스트됨
204
+
205
+ ### 설치
206
+
207
+ ```bash
208
+ npm install kordoc
209
+ ```
210
+
211
+ ### 사용법
212
+
213
+ ```typescript
214
+ import { parse } from "kordoc"
215
+ import { readFileSync } from "fs"
216
+
217
+ const buffer = readFileSync("사업계획서.hwpx")
218
+ const result = await parse(buffer.buffer)
219
+
220
+ if (result.success) {
221
+ console.log(result.markdown)
222
+ }
223
+ ```
224
+
225
+ ### CLI
226
+
227
+ ```bash
228
+ npx kordoc 사업계획서.hwpx # 터미널 출력
229
+ npx kordoc 보고서.hwp -o 보고서.md # 파일 저장
230
+ npx kordoc *.pdf -d ./변환결과/ # 일괄 변환
231
+ ```
232
+
233
+ ### MCP 서버 (Claude / Cursor / Windsurf)
234
+
235
+ Claude Desktop이나 Cursor에서 문서 파싱 도구로 바로 사용 가능합니다:
236
+
237
+ ```json
238
+ {
239
+ "mcpServers": {
240
+ "kordoc": {
241
+ "command": "npx",
242
+ "args": ["-y", "kordoc-mcp"]
243
+ }
244
+ }
245
+ }
246
+ ```