kordoc 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +246 -0
- package/dist/chunk-P2BZDRLZ.js +813 -0
- package/dist/cli.js +55 -0
- package/dist/index.cjs +857 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +79 -0
- package/dist/index.d.ts +79 -0
- package/dist/index.js +815 -0
- package/dist/index.js.map +1 -0
- package/dist/mcp.js +13854 -0
- package/package.json +64 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 chrisryugj
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
1
|
+
# kordoc
|
|
2
|
+
|
|
3
|
+
### 모두 파싱해버리겠다.
|
|
4
|
+
|
|
5
|
+
> *"HWP든 HWPX든 PDF든 — 대한민국 문서라면 남김없이 파싱해버립니다."*
|
|
6
|
+
|
|
7
|
+
Built by a Korean civil servant who spent 7 years in the deepest circle of document hell. One day he snapped, and kordoc was born.
|
|
8
|
+
|
|
9
|
+
Korean document formats — parsed, converted, delivered as clean Markdown. No COM automation, no Windows dependency, no tears.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Why kordoc?
|
|
14
|
+
|
|
15
|
+
South Korea runs on HWP. The rest of the world has never heard of it. Government offices produce thousands of `.hwp` files daily, and extracting text from them has always been a nightmare — COM automation that only works on Windows, proprietary formats with zero documentation, and tables that break every parser.
|
|
16
|
+
|
|
17
|
+
**kordoc** was forged in this document hell. Its parsers have been battle-tested across 5 real Korean government projects, processing everything from school curriculum plans to facility inspection reports. If a Korean public servant wrote it, kordoc can parse it.
|
|
18
|
+
|
|
19
|
+
| Format | Engine | Status |
|
|
20
|
+
|--------|--------|--------|
|
|
21
|
+
| **HWPX** (한컴 2020+) | ZIP + XML DOM walk | Stable |
|
|
22
|
+
| **HWP 5.x** (한컴 레거시) | OLE2 binary + record parsing | Stable |
|
|
23
|
+
| **PDF** | pdfjs-dist text extraction | Stable |
|
|
24
|
+
|
|
25
|
+
### What makes it different
|
|
26
|
+
|
|
27
|
+
- **2-pass table builder** — Correct `colSpan`/`rowSpan` handling via grid algorithm. No more broken table layouts.
|
|
28
|
+
- **Broken ZIP recovery** — Corrupted HWPX? We scan raw Local File Headers and still extract text.
|
|
29
|
+
- **OPF manifest resolution** — Multi-section HWPX documents parsed in correct spine order.
|
|
30
|
+
- **21 HWP5 control characters** — Full UTF-16LE decoding with extended/inline object skip.
|
|
31
|
+
- **Image-based PDF detection** — Warns you when a scanned PDF can't be text-extracted.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Quick Start
|
|
36
|
+
|
|
37
|
+
### As a library
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
npm install kordoc
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
```typescript
|
|
44
|
+
import { parse } from "kordoc"
|
|
45
|
+
import { readFileSync } from "fs"
|
|
46
|
+
|
|
47
|
+
const buffer = readFileSync("document.hwpx")
|
|
48
|
+
const result = await parse(buffer.buffer)
|
|
49
|
+
|
|
50
|
+
if (result.success) {
|
|
51
|
+
console.log(result.markdown)
|
|
52
|
+
// → Clean markdown with tables, headings, and structure preserved
|
|
53
|
+
}
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Format-specific parsing
|
|
57
|
+
|
|
58
|
+
```typescript
|
|
59
|
+
import { parseHwpx, parseHwp, parsePdf } from "kordoc"
|
|
60
|
+
|
|
61
|
+
// HWPX (modern Hancom format)
|
|
62
|
+
const hwpxResult = await parseHwpx(buffer)
|
|
63
|
+
|
|
64
|
+
// HWP 5.x (legacy binary format)
|
|
65
|
+
const hwpResult = await parseHwp(buffer)
|
|
66
|
+
|
|
67
|
+
// PDF (text-based)
|
|
68
|
+
const pdfResult = await parsePdf(buffer)
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Format detection
|
|
72
|
+
|
|
73
|
+
```typescript
|
|
74
|
+
import { detectFormat, isHwpxFile, isOldHwpFile, isPdfFile } from "kordoc"
|
|
75
|
+
|
|
76
|
+
const format = detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### As a CLI
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
npx kordoc document.hwpx # stdout
|
|
83
|
+
npx kordoc document.hwp -o output.md # save to file
|
|
84
|
+
npx kordoc *.pdf -d ./converted/ # batch convert
|
|
85
|
+
npx kordoc report.hwpx --format json # JSON with metadata
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### As an MCP server (Claude / Cursor / Windsurf)
|
|
89
|
+
|
|
90
|
+
kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
|
|
91
|
+
|
|
92
|
+
```json
|
|
93
|
+
{
|
|
94
|
+
"mcpServers": {
|
|
95
|
+
"kordoc": {
|
|
96
|
+
"command": "npx",
|
|
97
|
+
"args": ["-y", "kordoc-mcp"]
|
|
98
|
+
}
|
|
99
|
+
}
|
|
100
|
+
}
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
This exposes two tools:
|
|
104
|
+
- **`parse_document`** — Parse a HWP/HWPX/PDF file to Markdown
|
|
105
|
+
- **`detect_format`** — Detect file format via magic bytes
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## API Reference
|
|
110
|
+
|
|
111
|
+
### `parse(buffer: ArrayBuffer): Promise<ParseResult>`
|
|
112
|
+
|
|
113
|
+
Auto-detects format via magic bytes and parses to Markdown.
|
|
114
|
+
|
|
115
|
+
### `ParseResult`
|
|
116
|
+
|
|
117
|
+
```typescript
|
|
118
|
+
interface ParseResult {
|
|
119
|
+
success: boolean
|
|
120
|
+
markdown?: string // Extracted markdown text
|
|
121
|
+
fileType: "hwpx" | "hwp" | "pdf" | "unknown"
|
|
122
|
+
isImageBased?: boolean // true if scanned PDF (no text extractable)
|
|
123
|
+
pageCount?: number // PDF page count
|
|
124
|
+
error?: string // Error message on failure
|
|
125
|
+
}
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### Low-level exports
|
|
129
|
+
|
|
130
|
+
```typescript
|
|
131
|
+
// Table builder (2-pass colSpan/rowSpan algorithm)
|
|
132
|
+
import { buildTable, blocksToMarkdown } from "kordoc"
|
|
133
|
+
|
|
134
|
+
// Type definitions
|
|
135
|
+
import type { IRBlock, IRTable, IRCell, CellContext } from "kordoc"
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
---
|
|
139
|
+
|
|
140
|
+
## Supported Formats
|
|
141
|
+
|
|
142
|
+
### HWPX (한컴오피스 2020+)
|
|
143
|
+
|
|
144
|
+
ZIP-based XML format. kordoc reads the OPF manifest (`content.hpf`) for correct section ordering, walks the XML DOM for paragraphs and tables, and handles:
|
|
145
|
+
- Multi-section documents
|
|
146
|
+
- Nested tables (table inside a table cell)
|
|
147
|
+
- `colSpan` / `rowSpan` merged cells
|
|
148
|
+
- Corrupted ZIP archives (Local File Header fallback)
|
|
149
|
+
|
|
150
|
+
### HWP 5.x (한컴오피스 레거시)
|
|
151
|
+
|
|
152
|
+
OLE2 Compound Binary format. kordoc parses the CFB container, decompresses section streams (zlib), reads HWP record structures, and extracts UTF-16LE text with full control character handling:
|
|
153
|
+
- 21 control character types (line breaks, tabs, hyphens, NBSP, extended objects)
|
|
154
|
+
- Encrypted/DRM file detection (fails fast with clear error)
|
|
155
|
+
- Table extraction with grid-based cell arrangement
|
|
156
|
+
|
|
157
|
+
### PDF
|
|
158
|
+
|
|
159
|
+
Server-side text extraction via pdfjs-dist:
|
|
160
|
+
- Y-coordinate based line grouping
|
|
161
|
+
- Gap-based cell/table detection
|
|
162
|
+
- Image-based PDF detection (< 10 chars/page average)
|
|
163
|
+
- Korean text line joining (조사/접속사 awareness)
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## Requirements
|
|
168
|
+
|
|
169
|
+
- **Node.js** >= 18
|
|
170
|
+
- **pdfjs-dist** — Required only for PDF parsing. HWP/HWPX work without it.
|
|
171
|
+
|
|
172
|
+
---
|
|
173
|
+
|
|
174
|
+
## Credits
|
|
175
|
+
|
|
176
|
+
Built by a Korean civil servant who spent years drowning in HWP files. Production-tested across 5 government technology projects — school curriculum plans, facility inspection reports, legal documents, and municipal newsletters. The parsers in this package have processed thousands of real Korean government documents without breaking a sweat.
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
180
|
+
## License
|
|
181
|
+
|
|
182
|
+
MIT
|
|
183
|
+
|
|
184
|
+
---
|
|
185
|
+
|
|
186
|
+
<br>
|
|
187
|
+
|
|
188
|
+
# kordoc (한국어)
|
|
189
|
+
|
|
190
|
+
### 모두 파싱해버리겠다.
|
|
191
|
+
|
|
192
|
+
> *대한민국에서 둘째가라면 서러울 문서지옥. 거기서 7년 버틴 공무원이 만들었습니다.*
|
|
193
|
+
|
|
194
|
+
HWP, HWPX, PDF — 관공서에서 쏟아지는 모든 문서 포맷을 마크다운으로 변환하는 Node.js 라이브러리입니다. 학교 교육과정, 사전기획 보고서, 검토의견서, 소식지 원고... 뭐든 넣으면 파싱합니다.
|
|
195
|
+
|
|
196
|
+
### 특징
|
|
197
|
+
|
|
198
|
+
- **한컴오피스 불필요** — COM 자동화 없이 바이너리 직접 파싱. Linux, Mac에서도 동작
|
|
199
|
+
- **손상 파일 복구** — ZIP Central Directory가 깨진 HWPX도 Local File Header 스캔으로 복구
|
|
200
|
+
- **병합 셀 완벽 처리** — 2-pass 그리드 알고리즘으로 colSpan/rowSpan 정확히 렌더링
|
|
201
|
+
- **HWP5 바이너리 직접 파싱** — OLE2 컨테이너 → 레코드 스트림 → UTF-16LE 텍스트 추출
|
|
202
|
+
- **이미지 PDF 감지** — 스캔된 PDF는 텍스트 추출 불가를 사전에 알려줌
|
|
203
|
+
- **실전 검증 완료** — 5개 공공 프로젝트, 수천 건의 실제 관공서 문서에서 테스트됨
|
|
204
|
+
|
|
205
|
+
### 설치
|
|
206
|
+
|
|
207
|
+
```bash
|
|
208
|
+
npm install kordoc
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### 사용법
|
|
212
|
+
|
|
213
|
+
```typescript
|
|
214
|
+
import { parse } from "kordoc"
|
|
215
|
+
import { readFileSync } from "fs"
|
|
216
|
+
|
|
217
|
+
const buffer = readFileSync("사업계획서.hwpx")
|
|
218
|
+
const result = await parse(buffer.buffer)
|
|
219
|
+
|
|
220
|
+
if (result.success) {
|
|
221
|
+
console.log(result.markdown)
|
|
222
|
+
}
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
### CLI
|
|
226
|
+
|
|
227
|
+
```bash
|
|
228
|
+
npx kordoc 사업계획서.hwpx # 터미널 출력
|
|
229
|
+
npx kordoc 보고서.hwp -o 보고서.md # 파일 저장
|
|
230
|
+
npx kordoc *.pdf -d ./변환결과/ # 일괄 변환
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
### MCP 서버 (Claude / Cursor / Windsurf)
|
|
234
|
+
|
|
235
|
+
Claude Desktop이나 Cursor에서 문서 파싱 도구로 바로 사용 가능합니다:
|
|
236
|
+
|
|
237
|
+
```json
|
|
238
|
+
{
|
|
239
|
+
"mcpServers": {
|
|
240
|
+
"kordoc": {
|
|
241
|
+
"command": "npx",
|
|
242
|
+
"args": ["-y", "kordoc-mcp"]
|
|
243
|
+
}
|
|
244
|
+
}
|
|
245
|
+
}
|
|
246
|
+
```
|