npm - kordoc - Versions diffs - 0.1.1 → 0.2.2 - Mend

kordoc 0.1.1 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +107 -145
package/dist/{chunk-P2BZDRLZ.js → chunk-KZMWHK72.js} +134 -73
package/dist/cli.js +19 -7
package/dist/index.cjs +131 -61
package/dist/index.cjs.map +1 -1
package/dist/index.d.cts +17 -9
package/dist/index.d.ts +17 -9
package/dist/index.js +130 -68
package/dist/index.js.map +1 -1
package/dist/mcp.js +46 -13783
package/package.json +13 -3

package/README.md CHANGED Viewed

@@ -1,45 +1,60 @@
 # kordoc
-### 모두 파싱해버리겠다.
+**모두 파싱해버리겠다** — Parse any Korean document to Markdown.
-> *"HWP든 HWPX든 PDF든 — 대한민국 문서라면 남김없이 파싱해버립니다."*
+[![npm version](https://img.shields.io/npm/v/kordoc.svg)](https://www.npmjs.com/package/kordoc)
+[![license](https://img.shields.io/npm/l/kordoc.svg)](https://github.com/chrisryugj/kordoc/blob/main/LICENSE)
+[![node](https://img.shields.io/node/v/kordoc.svg)](https://nodejs.org)
-Built by a Korean civil servant who spent 7 years in the deepest circle of document hell. One day he snapped, and kordoc was born.
+> *HWP, HWPX, PDF — 대한민국 문서라면 남김없이 파싱해버립니다.*
-Korean document formats — parsed, converted, delivered as clean Markdown. No COM automation, no Windows dependency, no tears.
+[한국어](./README-KR.md)
+![kordoc demo](./demo.gif)
 ---
 ## Why kordoc?
-South Korea runs on HWP. The rest of the world has never heard of it. Government offices produce thousands of `.hwp` files daily, and extracting text from them has always been a nightmare — COM automation that only works on Windows, proprietary formats with zero documentation, and tables that break every parser.
+South Korea's government runs on **HWP** — a proprietary word processor the rest of the world has never heard of. Every day, 243 local governments and thousands of public institutions produce mountains of `.hwp` files. Extracting text from them has always been a nightmare: COM automation that only works on Windows, proprietary binary formats with zero documentation, and tables that break every existing parser.
-**kordoc** was forged in this document hell. Its parsers have been battle-tested across 5 real Korean government projects, processing everything from school curriculum plans to facility inspection reports. If a Korean public servant wrote it, kordoc can parse it.
+**kordoc** was born from that document hell. Built by a Korean civil servant who spent **7 years** buried under HWP files at a district office. One day he snapped — and decided to parse them all. Its parsers have been battle-tested across 5 real government projects, processing school curriculum plans, facility inspection reports, legal annexes, and municipal newsletters. If a Korean public servant wrote it, kordoc can parse it.
-| Format | Engine | Status |
-|--------|--------|--------|
-| **HWPX** (한컴 2020+) | ZIP + XML DOM walk | Stable |
-| **HWP 5.x** (한컴 레거시) | OLE2 binary + record parsing | Stable |
-| **PDF** | pdfjs-dist text extraction | Stable |
+---
-### What makes it different
+## Features
-- **2-pass table builder** — Correct `colSpan`/`rowSpan` handling via grid algorithm. No more broken table layouts.
-- **Broken ZIP recovery** — Corrupted HWPX? We scan raw Local File Headers and still extract text.
-- **OPF manifest resolution** — Multi-section HWPX documents parsed in correct spine order.
-- **21 HWP5 control characters** — Full UTF-16LE decoding with extended/inline object skip.
-- **Image-based PDF detection** — Warns you when a scanned PDF can't be text-extracted.
+- **HWP 5.x Binary Parsing** — OLE2 container + record stream + UTF-16LE. No Hancom Office needed.
+- **HWPX ZIP Parsing** — OPF manifest resolution, multi-section, nested tables.
+- **PDF Text Extraction** — Y-coordinate line grouping, table reconstruction, image PDF detection.
+- **2-Pass Table Builder** — Correct `colSpan`/`rowSpan` via grid algorithm. No broken tables.
+- **Broken ZIP Recovery** — Corrupted HWPX? Scans raw Local File Headers.
+- **3 Interfaces** — npm library, CLI tool, and MCP server (Claude/Cursor).
+- **Cross-Platform** — Pure JavaScript. Runs on Linux, macOS, Windows.
----
+## Supported Formats
-## Quick Start
+| Format | Engine | Features |
+|--------|--------|----------|
+| **HWPX** (한컴 2020+) | ZIP + XML DOM | Manifest, nested tables, merged cells, broken ZIP recovery |
+| **HWP 5.x** (한컴 레거시) | OLE2 + CFB | 21 control chars, zlib decompression, DRM detection |
+| **PDF** | pdfjs-dist | Line grouping, table detection, image PDF warning |
-### As a library
+## Installation
 ```bash
 npm install kordoc
+# PDF support requires pdfjs-dist (optional peer dependency)
+npm install pdfjs-dist
 ```
+> **Since v0.2.1**, `pdfjs-dist` is an optional peer dependency. Not needed for HWP/HWPX parsing.
+## Usage
+### As a Library
 ```typescript
 import { parse } from "kordoc"
 import { readFileSync } from "fs"
@@ -49,31 +64,25 @@ const result = await parse(buffer.buffer)
 if (result.success) {
   console.log(result.markdown)
-  // → Clean markdown with tables, headings, and structure preserved
 }
 ```
-### Format-specific parsing
+#### Format-Specific
 ```typescript
 import { parseHwpx, parseHwp, parsePdf } from "kordoc"
-// HWPX (modern Hancom format)
-const hwpxResult = await parseHwpx(buffer)
-// HWP 5.x (legacy binary format)
-const hwpResult = await parseHwp(buffer)
-// PDF (text-based)
-const pdfResult = await parsePdf(buffer)
+const hwpxResult = await parseHwpx(buffer)   // HWPX
+const hwpResult  = await parseHwp(buffer)    // HWP 5.x
+const pdfResult  = await parsePdf(buffer)    // PDF
 ```
-### Format detection
+#### Format Detection
 ```typescript
-import { detectFormat, isHwpxFile, isOldHwpFile, isPdfFile } from "kordoc"
+import { detectFormat } from "kordoc"
-const format = detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
+detectFormat(buffer) // → "hwpx" | "hwp" | "pdf" | "unknown"
 ```
 ### As a CLI
@@ -85,9 +94,9 @@ npx kordoc *.pdf -d ./converted/            # batch convert
 npx kordoc report.hwpx --format json        # JSON with metadata
 ```
-### As an MCP server (Claude / Cursor / Windsurf)
+### As an MCP Server
-kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
+Works with **Claude Desktop**, **Cursor**, **Windsurf**, and any MCP-compatible client.
 ```json
 {
@@ -100,147 +109,100 @@ kordoc includes a built-in MCP server. Add it to your Claude Desktop config:
 }
 ```
-This exposes two tools:
-- **`parse_document`** — Parse a HWP/HWPX/PDF file to Markdown
-- **`detect_format`** — Detect file format via magic bytes
+**Tools exposed:**
----
+| Tool | Description |
+|------|-------------|
+| `parse_document` | Parse HWP/HWPX/PDF file → Markdown |
+| `detect_format` | Detect file format via magic bytes |
 ## API Reference
 ### `parse(buffer: ArrayBuffer): Promise<ParseResult>`
-Auto-detects format via magic bytes and parses to Markdown.
-### `ParseResult`
+Auto-detects format and converts to Markdown.
 ```typescript
 interface ParseResult {
   success: boolean
-  markdown?: string          // Extracted markdown text
+  markdown?: string
   fileType: "hwpx" | "hwp" | "pdf" | "unknown"
-  isImageBased?: boolean     // true if scanned PDF (no text extractable)
-  pageCount?: number         // PDF page count
-  error?: string             // Error message on failure
+  isImageBased?: boolean     // scanned PDF detection
+  pageCount?: number         // PDF only
+  error?: string
 }
 ```
-### Low-level exports
+### Low-Level Exports
 ```typescript
-// Table builder (2-pass colSpan/rowSpan algorithm)
-import { buildTable, blocksToMarkdown } from "kordoc"
-// Type definitions
+import { buildTable, blocksToMarkdown, convertTableToText } from "kordoc"
 import type { IRBlock, IRTable, IRCell, CellContext } from "kordoc"
 ```
----
-## Supported Formats
-### HWPX (한컴오피스 2020+)
-ZIP-based XML format. kordoc reads the OPF manifest (`content.hpf`) for correct section ordering, walks the XML DOM for paragraphs and tables, and handles:
-- Multi-section documents
-- Nested tables (table inside a table cell)
-- `colSpan` / `rowSpan` merged cells
-- Corrupted ZIP archives (Local File Header fallback)
-### HWP 5.x (한컴오피스 레거시)
-OLE2 Compound Binary format. kordoc parses the CFB container, decompresses section streams (zlib), reads HWP record structures, and extracts UTF-16LE text with full control character handling:
-- 21 control character types (line breaks, tabs, hyphens, NBSP, extended objects)
-- Encrypted/DRM file detection (fails fast with clear error)
-- Table extraction with grid-based cell arrangement
-### PDF
-Server-side text extraction via pdfjs-dist:
-- Y-coordinate based line grouping
-- Gap-based cell/table detection
-- Image-based PDF detection (< 10 chars/page average)
-- Korean text line joining (조사/접속사 awareness)
----
 ## Requirements
 - **Node.js** >= 18
-- **pdfjs-dist** — Required only for PDF parsing. HWP/HWPX work without it.
----
-## Credits
-Built by a Korean civil servant who spent years drowning in HWP files. Production-tested across 5 government technology projects — school curriculum plans, facility inspection reports, legal documents, and municipal newsletters. The parsers in this package have processed thousands of real Korean government documents without breaking a sweat.
----
-## License
+- **pdfjs-dist** >= 4.0.0 — Optional. Only needed for PDF. HWP/HWPX work without it.
-MIT
+## Security
----
-<br>
-# kordoc (한국어)
-### 모두 파싱해버리겠다.
-> *대한민국에서 둘째가라면 서러울 문서지옥. 거기서 7년 버틴 공무원이 만들었습니다.*
+v0.2.2 security hardening (cumulative since v0.2.1):
-HWP, HWPX, PDF — 관공서에서 쏟아지는 모든 문서 포맷을 마크다운으로 변환하는 Node.js 라이브러리입니다. 학교 교육과정, 사전기획 보고서, 검토의견서, 소식지 원고... 뭐든 넣으면 파싱합니다.
+- **ZIP bomb protection** — 100MB decompression limit, 500 entry cap
+- **XXE/Billion Laughs prevention** — Internal DTD subsets fully stripped from HWPX XML
+- **Decompression bomb guard** — `maxOutputLength` on HWP5 zlib streams, cumulative 100MB limit across sections
+- **colSpan/rowSpan clamping** — Crafted merge values clamped to grid bounds (MAX_COLS=200, MAX_ROWS=10,000)
+- **Broken ZIP path traversal guard** — `..` and absolute path entries rejected, filename length capped
+- **MCP path restriction** — Only `.hwp`, `.hwpx`, `.pdf` extensions allowed
+- **File size limit** — 500MB max in MCP server and CLI
+- **PDF resource cleanup** — `doc.destroy()` prevents WASM memory leaks
+- **Table memory guard** — Sparse Set-based allocation in Pass 1, 10,000 row cap
+- **HWP5 section limit** — Max 100 sections to prevent infinite loop on corrupted files
-### 특징
+## How It Works
-- **한컴오피스 불필요** — COM 자동화 없이 바이너리 직접 파싱. Linux, Mac에서도 동작
-- **손상 파일 복구** — ZIP Central Directory가 깨진 HWPX도 Local File Header 스캔으로 복구
-- **병합 셀 완벽 처리** — 2-pass 그리드 알고리즘으로 colSpan/rowSpan 정확히 렌더링
-- **HWP5 바이너리 직접 파싱** — OLE2 컨테이너 → 레코드 스트림 → UTF-16LE 텍스트 추출
-- **이미지 PDF 감지** — 스캔된 PDF는 텍스트 추출 불가를 사전에 알려줌
-- **실전 검증 완료** — 5개 공공 프로젝트, 수천 건의 실제 관공서 문서에서 테스트됨
-### 설치
-```bash
-npm install kordoc
 ```
-### 사용법
-```typescript
-import { parse } from "kordoc"
-import { readFileSync } from "fs"
-const buffer = readFileSync("사업계획서.hwpx")
-const result = await parse(buffer.buffer)
-if (result.success) {
-  console.log(result.markdown)
-}
+┌─────────────┐     Magic Bytes      ┌──────────────────┐
+│  File Input  │ ──── Detection ────→ │  Format Router   │
+└─────────────┘                       └────────┬─────────┘
+                                               │
+                    ┌──────────────────────────┼──────────────────────────┐
+                    │                          │                          │
+              ┌─────▼─────┐            ┌───────▼───────┐          ┌──────▼──────┐
+              │   HWPX    │            │    HWP 5.x    │          │     PDF     │
+              │  ZIP+XML  │            │  OLE2+Record  │          │  pdfjs-dist │
+              └─────┬─────┘            └───────┬───────┘          └──────┬──────┘
+                    │                          │                          │
+                    │       ┌──────────────────┤                          │
+                    │       │                  ��                          │
+              ┌─────▼───────▼─────┐            │                          │
+              │  2-Pass Table     │            │                          │
+              │  Builder (Grid)   │            │                          │
+              └─────────┬─────────┘            │                          │
+                        │                      │                          │
+                  ┌─────▼──────────────────────▼──────────────────────────▼─────┐
+                  │                      IRBlock[]                              │
+                  │              (Intermediate Representation)                  │
+                  └────────────────────────┬───────────────────────────────────┘
+                                           │
+                                    ┌──────▼──────┐
+                                    │  Markdown   │
+                                    │   Output    │
+                                    └─────────────┘
 ```
-### CLI
+## Credits
-```bash
-npx kordoc 사업계획서.hwpx                     # 터미널 출력
-npx kordoc 보고서.hwp -o 보고서.md              # 파일 저장
-npx kordoc *.pdf -d ./변환결과/                 # 일괄 변환
-```
+Production-tested across 5 Korean government technology projects:
+- School curriculum plans (학교교육과정)
+- Facility inspection reports (사전기획 보고서)
+- Legal document annexes (법률 별표)
+- Municipal newsletters (소식지)
+- Public data extraction tools (공공데이터)
-### MCP 서버 (Claude / Cursor / Windsurf)
+Thousands of real government documents parsed without breaking a sweat.
-Claude Desktop이나 Cursor에서 문서 파싱 도구로 바로 사용 가능합니다:
+## License
-```json
-{
-  "mcpServers": {
-    "kordoc": {
-      "command": "npx",
-      "args": ["-y", "kordoc-mcp"]
-    }
-  }
-}
-```
+[MIT](./LICENSE)