hwp2md 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE.md ADDED
@@ -0,0 +1,7 @@
1
+ Copyright 2025 Jaechan Kim<kjc0210@gmail.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,174 @@
1
+ # hwp2md
2
+
3
+ Node.js 및 브라우저용 HWP/HWPX to Markdown 변환기.
4
+
5
+ 한글(HWP 5.0 및 HWPX) 파일을 LLM에 최적화된 깔끔한 Markdown으로 변환합니다.
6
+
7
+ ## 주요 기능
8
+
9
+ - ✅ HWP 5.0 및 HWPX 파일에서 텍스트 추출
10
+ - ✅ HWPX (ZIP+XML) 형식 지원 (확장자 자동 감지)
11
+ - ✅ 표를 Markdown 형식으로 변환
12
+ - ✅ 문서 구조 보존
13
+ - ✅ 셀 병합 처리 (LLM 최적화를 위한 내용 반복)
14
+ - ✅ Node.js 및 브라우저 환경 지원
15
+ - ✅ TypeScript 타입 정의 포함
16
+
17
+ ## 설치
18
+
19
+ ```bash
20
+ npm install hwp2md
21
+ # 또는
22
+ pnpm add hwp2md
23
+ # 또는
24
+ yarn add hwp2md
25
+ ```
26
+
27
+ ## 사용법
28
+
29
+ ### Node.js
30
+
31
+ ```typescript
32
+ import { convert } from "hwp2md";
33
+
34
+ // HWP 또는 HWPX 변환 (확장자로 자동 감지)
35
+ const markdown = await convert("document.hwp");
36
+ const markdown2 = await convert("document.hwpx");
37
+ console.log(markdown);
38
+
39
+ // 옵션 지정
40
+ const markdown3 = await convert("document.hwp", {
41
+ tableLineBreakStyle: "space", // 또는 'br'
42
+ });
43
+ ```
44
+
45
+ ### 브라우저
46
+
47
+ ```typescript
48
+ import { convertFromFile } from "hwp2md/browser";
49
+
50
+ const fileInput = document.getElementById("file") as HTMLInputElement;
51
+ fileInput.addEventListener("change", async (e) => {
52
+ const file = (e.target as HTMLInputElement).files?.[0];
53
+ if (file) {
54
+ const markdown = await convertFromFile(file);
55
+ console.log(markdown);
56
+ }
57
+ });
58
+ ```
59
+
60
+ ### CLI
61
+
62
+ ```bash
63
+ # 전역 설치
64
+ npm install -g hwp2md
65
+
66
+ # 파일 정보 표시
67
+ hwp2md info document.hwp
68
+ hwp2md info document.hwpx
69
+
70
+ # Markdown으로 변환
71
+ hwp2md convert document.hwp output.md
72
+
73
+ # HWPX 파일 변환 (자동 감지)
74
+ hwp2md convert document.hwpx output.md
75
+
76
+ # 옵션 지정
77
+ hwp2md convert document.hwp output.md --table-line-breaks br
78
+ ```
79
+
80
+ ## API
81
+
82
+ ### `convert(input, options?)`
83
+
84
+ HWP/HWPX 파일을 Markdown으로 변환합니다 (확장자로 자동 감지).
85
+
86
+ **파라미터:**
87
+
88
+ - `input`: 파일 경로 (Node.js, `.hwp` 또는 `.hwpx`), ArrayBuffer, 또는 Uint8Array
89
+ - `options`: 변환 옵션
90
+ - `tableLineBreakStyle`: `'space'` (기본값) 또는 `'br'`
91
+
92
+ **반환값:** `Promise<string>` - Markdown 내용
93
+
94
+ ### `convertFromFile(file, options?)` (브라우저 전용)
95
+
96
+ 브라우저 File 객체에서 HWP를 변환합니다.
97
+
98
+ **파라미터:**
99
+
100
+ - `file`: input 엘리먼트의 File 객체
101
+ - `options`: 변환 옵션
102
+
103
+ **반환값:** `Promise<string>` - Markdown 내용
104
+
105
+ ### `HWPFile`
106
+
107
+ HWP 파일 파싱을 위한 저수준 API입니다.
108
+
109
+ ```typescript
110
+ import { HWPFile } from "hwp2md";
111
+
112
+ const hwp = await HWPFile.fromFile("document.hwp");
113
+ hwp.open();
114
+
115
+ console.log(hwp.fileHeader);
116
+ console.log(hwp.getSectionCount());
117
+
118
+ const sectionData = hwp.readSection(0);
119
+ hwp.close();
120
+ ```
121
+
122
+ ### `HWPXFile`
123
+
124
+ HWPX 파일 파싱을 위한 저수준 API입니다.
125
+
126
+ ```typescript
127
+ import { HWPXFile, convertHwpxToMarkdown } from "hwp2md";
128
+
129
+ const hwpx = await HWPXFile.fromFile("document.hwpx");
130
+ hwpx.open();
131
+
132
+ console.log(hwpx.getSectionCount());
133
+ const markdown = convertHwpxToMarkdown(hwpx);
134
+
135
+ hwpx.close();
136
+ ```
137
+
138
+ ## 옵션
139
+
140
+ ### `tableLineBreakStyle`
141
+
142
+ 표 셀 내 줄바꿈 처리 방식:
143
+
144
+ - `'space'` (기본값): 공백으로 연결 - LLM 처리에 최적화
145
+ - `'br'`: `<br>` 태그 사용 - 가독성 우선
146
+
147
+ ## 제한사항
148
+
149
+ - **HWP 5.0 및 HWPX 지원** - 이전 HWP 형식(HWP 3.0, HWP 97, HWP 2002 등)은 지원하지 않음
150
+ - 레거시 HWP 파일은 한컴오피스에서 HWP 5.0 또는 HWPX로 변환 가능
151
+ - 레거시 형식 감지 시 오류 발생
152
+ - **텍스트 & 표** - 현재 텍스트와 표만 추출하며, 이미지 및 복잡한 개체는 건너뜀
153
+ - **한국어 중심** - 한국어 문서에 최적화 (UTF-16LE 인코딩)
154
+ - **기본 서식만** - 글꼴, 색상, 고급 스타일은 보존하지 않음
155
+
156
+ ## 개발
157
+
158
+ ```bash
159
+ # 의존성 설치
160
+ pnpm install
161
+
162
+ # 빌드
163
+ pnpm build
164
+
165
+ # 테스트
166
+ pnpm test
167
+
168
+ # 타입 체크
169
+ pnpm typecheck
170
+ ```
171
+
172
+ ## 라이선스
173
+
174
+ MIT
@@ -0,0 +1,179 @@
1
+ //#region src/types.d.ts
2
+ /**
3
+ * HWP File Header Information
4
+ */
5
+ interface FileHeader {
6
+ signature: string;
7
+ version: string;
8
+ isCompressed: boolean;
9
+ isEncrypted: boolean;
10
+ rawProperties: number;
11
+ }
12
+ /**
13
+ * Binary Record Structure
14
+ */
15
+ interface Record$1 {
16
+ tagId: number;
17
+ level: number;
18
+ data: Uint8Array;
19
+ size: number;
20
+ }
21
+ /**
22
+ * Paragraph Header Information
23
+ */
24
+ interface ParagraphHeader {
25
+ textCount: number;
26
+ controlMask: number;
27
+ paraShapeId: number;
28
+ styleId: number;
29
+ columnType: number;
30
+ charShapeCount: number;
31
+ }
32
+ /**
33
+ * Parsed Paragraph
34
+ */
35
+ interface Paragraph {
36
+ text: string;
37
+ header: ParagraphHeader;
38
+ }
39
+ /**
40
+ * Table Cell
41
+ */
42
+ interface Cell {
43
+ row: number;
44
+ col: number;
45
+ rowspan: number;
46
+ colspan: number;
47
+ text: string;
48
+ }
49
+ /**
50
+ * Parsed Table
51
+ */
52
+ interface Table {
53
+ rows: number;
54
+ cols: number;
55
+ cells: Cell[];
56
+ }
57
+ /**
58
+ * Conversion Options
59
+ */
60
+ interface ConvertOptions {
61
+ tableLineBreakStyle?: "space" | "br";
62
+ }
63
+ /**
64
+ * Merge Strategy for Table Cells
65
+ */
66
+ type MergeStrategy = "repeat" | "blank";
67
+ //#endregion
68
+ //#region src/parser.d.ts
69
+ /**
70
+ * HWP File Parser
71
+ * Reads and parses HWP 5.0 files using OLE Compound File format
72
+ */
73
+ declare class HWPFile {
74
+ private data;
75
+ private cfb;
76
+ private _fileHeader;
77
+ private _isCompressed;
78
+ /**
79
+ * Create HWPFile from raw data
80
+ * @param data - Raw HWP file data
81
+ */
82
+ constructor(data: Uint8Array | ArrayBuffer);
83
+ /**
84
+ * Create HWPFile from file path (Node.js only)
85
+ * @param path - Path to HWP file
86
+ */
87
+ static fromFile(path: string): Promise<HWPFile>;
88
+ /**
89
+ * Create HWPFile from ArrayBuffer
90
+ * @param data - ArrayBuffer data
91
+ */
92
+ static fromArrayBuffer(data: ArrayBuffer): HWPFile;
93
+ /**
94
+ * Create HWPFile from Uint8Array
95
+ * @param data - Uint8Array data
96
+ */
97
+ static fromUint8Array(data: Uint8Array): HWPFile;
98
+ /**
99
+ * Open and parse HWP file
100
+ */
101
+ open(): void;
102
+ /**
103
+ * Close HWP file and release resources
104
+ */
105
+ close(): void;
106
+ /**
107
+ * Get file header information
108
+ */
109
+ get fileHeader(): FileHeader | null;
110
+ /**
111
+ * Check if file is compressed
112
+ */
113
+ get isCompressed(): boolean;
114
+ /**
115
+ * Parse FileHeader stream (256 bytes fixed)
116
+ */
117
+ private parseFileHeader;
118
+ /**
119
+ * Read and decompress stream
120
+ * @param streamPath - Stream path (e.g., 'DocInfo', 'BodyText/Section0')
121
+ * @returns Decompressed data or null if stream doesn't exist
122
+ */
123
+ readStream(streamPath: string): Uint8Array | null;
124
+ /**
125
+ * List all streams in HWP file
126
+ * @returns Array of stream paths
127
+ */
128
+ listStreams(): string[][];
129
+ /**
130
+ * Get file information
131
+ */
132
+ getFileInfo(): Record<string, unknown>;
133
+ /**
134
+ * Get number of sections in BodyText
135
+ */
136
+ getSectionCount(): number;
137
+ /**
138
+ * Read section data
139
+ * @param sectionIndex - Section index (0-based)
140
+ * @returns Decompressed section data
141
+ */
142
+ readSection(sectionIndex: number): Uint8Array | null;
143
+ }
144
+ //#endregion
145
+ //#region src/converter.d.ts
146
+ /**
147
+ * Convert HWP file to Markdown
148
+ * @param hwp - Opened HWP file
149
+ * @param options - Conversion options
150
+ * @returns Markdown content
151
+ */
152
+ declare function convertHwpToMarkdown(hwp: HWPFile, options?: ConvertOptions): string;
153
+ /**
154
+ * High-level API: Convert HWP/HWPX file to Markdown
155
+ * Auto-detects format based on file extension or magic bytes.
156
+ * @param input - File path (Node.js), ArrayBuffer, or Uint8Array
157
+ * @param options - Conversion options
158
+ * @returns Markdown content
159
+ */
160
+ declare function convert(input: string | ArrayBuffer | Uint8Array, options?: ConvertOptions): Promise<string>;
161
+ //#endregion
162
+ //#region src/browser.d.ts
163
+ /**
164
+ * Convert HWP from Blob (browser File API)
165
+ * @param blob - Blob or File object
166
+ * @param options - Conversion options
167
+ * @returns Markdown content
168
+ */
169
+ declare function convertFromBlob(blob: Blob, options?: ConvertOptions): Promise<string>;
170
+ /**
171
+ * Convert HWP from File input (browser)
172
+ * @param file - File object from input element
173
+ * @param options - Conversion options
174
+ * @returns Markdown content
175
+ */
176
+ declare function convertFromFile(file: File, options?: ConvertOptions): Promise<string>;
177
+ //#endregion
178
+ export { Cell, ConvertOptions, FileHeader, HWPFile, MergeStrategy, Paragraph, ParagraphHeader, Record$1 as Record, Table, convert, convertFromBlob, convertFromFile, convertHwpToMarkdown };
179
+ //# sourceMappingURL=browser.d.mts.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"browser.d.mts","names":[],"sources":["../src/types.ts","../src/parser.ts","../src/converter.ts","../src/browser.ts"],"sourcesContent":[],"mappings":";;AAGA;AAWA;AAUiB,UArBA,UAAA,CAqBe;EAYf,SAAA,EAAA,MAAS;EAQT,OAAI,EAAA,MAAA;EAWJ,YAAK,EAAA,OAGb;EAMQ,WAAA,EAAA,OAAc;EAOnB,aAAA,EAAA,MAAa;;;;AC1DzB;AAS4B,UDRX,QAAA,CCQW;EAAa,KAAA,EAAA,MAAA;EAMM,KAAA,EAAA,MAAA;EAAR,IAAA,EDX/B,UCW+B;EAUR,IAAA,EAAA,MAAA;;;;;AAmIG,UDjJjB,eAAA,CCiJiB;EAmDjB,SAAA,EAAA,MAAA;EAuCoB,WAAA,EAAA,MAAA;EAAU,WAAA,EAAA,MAAA;;;;AC/M/C;AAiEA;;;AAEY,UFnFK,SAAA,CEmFL;EACT,IAAA,EAAA,MAAA;EAAO,MAAA,EFlFA,eEkFA;;;;ACxGV;AACQ,UH2BS,IAAA,CG3BT;EAAI,GAAA,EAAA,MAAA;EAET,GAAA,EAAA,MAAA;EAAO,OAAA,EAAA,MAAA;EAYY,OAAA,EAAA,MAAA;EACd,IAAA,EAAA,MAAA;;;;;UHuBS,KAAA;;;SAGR;;;;;UAMQ,cAAA;;;;;;KAOL,aAAA;;;AA3BZ;AAWA;AASA;AAOA;cC1Da,OAAA;;;EAAA,QAAA,WAAO;EASQ,QAAA,aAAA;EAAa;;;;EAgBI,WAAA,CAAA,IAAA,EAhBjB,UAgBiB,GAhBJ,WAgBI;EAQf;;;;EA8Kb,OAAA,QAAA,CAAA,IAAA,EAAA,MAAA,CAAA,EAhMsB,OAgMtB,CAhM8B,OAgM9B,CAAA;EAuCoB;;;;+BA7NN,cAAc;ECc7B;AAiEhB;;;EAEY,OAAA,cAAA,CAAA,IAAA,EDzEkB,UCyElB,CAAA,EDzE+B,OCyE/B;EACT;;;;;ACxGH;;EACY,KAAA,CAAA,CAAA,EAAA,IAAA;EAET;;AAYH;EACQ,IAAA,UAAA,CAAA,CAAA,EFuEY,UEvEZ,GAAA,IAAA;EAAI;;;;;;;;;;;;;kCFyIsB;;;;;;;;;iBAmDjB;;;;;;;;;;qCAuCoB;;;;;;;;;;AArNP,iBCMd,oBAAA,CDNc,GAAA,ECOvB,ODPuB,EAAA,OAAA,CAAA,ECQnB,cDRmB,CAAA,EAAA,MAAA;;;ACM9B;AAiEA;;;;AAGG,iBAHmB,OAAA,CAGnB,KAAA,EAAA,MAAA,GAFe,WAEf,GAF6B,UAE7B,EAAA,OAAA,CAAA,EADS,cACT,CAAA,EAAA,OAAA,CAAA,MAAA,CAAA;;;AFxDH;AAOA;;;;AC1DA;AAS4B,iBENN,eAAA,CFMM,IAAA,EELpB,IFKoB,EAAA,OAAA,CAAA,EELhB,cFKgB,CAAA,EEHzB,OFGyB,CAAA,MAAA,CAAA;;;;;;;AAwBe,iBEfrB,eAAA,CFeqB,IAAA,EEdnC,IFcmC,EAAA,OAAA,CAAA,EEd/B,cFc+B,CAAA,EEZxC,OFYwC,CAAA,MAAA,CAAA"}
@@ -0,0 +1,27 @@
1
+ import { a as HWPFile, n as convertHwpToMarkdown, t as convert } from "./converter-D6LrZNSL.mjs";
2
+
3
+ //#region src/browser.ts
4
+ /**
5
+ * Convert HWP from Blob (browser File API)
6
+ * @param blob - Blob or File object
7
+ * @param options - Conversion options
8
+ * @returns Markdown content
9
+ */
10
+ async function convertFromBlob(blob, options) {
11
+ const arrayBuffer = await blob.arrayBuffer();
12
+ const { convert: convert$1 } = await import("./converter-C0C25ssg.mjs");
13
+ return convert$1(arrayBuffer, options);
14
+ }
15
+ /**
16
+ * Convert HWP from File input (browser)
17
+ * @param file - File object from input element
18
+ * @param options - Conversion options
19
+ * @returns Markdown content
20
+ */
21
+ async function convertFromFile(file, options) {
22
+ return convertFromBlob(file, options);
23
+ }
24
+
25
+ //#endregion
26
+ export { HWPFile, convert, convertFromBlob, convertFromFile, convertHwpToMarkdown };
27
+ //# sourceMappingURL=browser.mjs.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"browser.mjs","names":["convert"],"sources":["../src/browser.ts"],"sourcesContent":["/**\n * HWP2MD - Browser Entry Point\n * Browser-specific entry point (no Node.js dependencies)\n */\n\n// Core exports for browser\nexport { HWPFile } from \"./parser\";\nexport { convert, convertHwpToMarkdown } from \"./converter\";\nexport type * from \"./types\";\n\n/**\n * Convert HWP from Blob (browser File API)\n * @param blob - Blob or File object\n * @param options - Conversion options\n * @returns Markdown content\n */\nexport async function convertFromBlob(\n blob: Blob,\n options?: import(\"./types\").ConvertOptions,\n): Promise<string> {\n const arrayBuffer = await blob.arrayBuffer();\n const { convert } = await import(\"./converter\");\n return convert(arrayBuffer, options);\n}\n\n/**\n * Convert HWP from File input (browser)\n * @param file - File object from input element\n * @param options - Conversion options\n * @returns Markdown content\n */\nexport async function convertFromFile(\n file: File,\n options?: import(\"./types\").ConvertOptions,\n): Promise<string> {\n return convertFromBlob(file, options);\n}\n"],"mappings":";;;;;;;;;AAgBA,eAAsB,gBACpB,MACA,SACiB;CACjB,MAAM,cAAc,MAAM,KAAK,aAAa;CAC5C,MAAM,EAAE,uBAAY,MAAM,OAAO;AACjC,QAAOA,UAAQ,aAAa,QAAQ;;;;;;;;AAStC,eAAsB,gBACpB,MACA,SACiB;AACjB,QAAO,gBAAgB,MAAM,QAAQ"}
@@ -0,0 +1 @@
1
+ export { };