file2md 1.4.53 → 1.4.55
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +402 -400
- package/dist/index.d.ts +5 -0
- package/dist/index.js +204 -132
- package/dist/parsers/docx-parser.d.ts +2 -1
- package/dist/parsers/docx-parser.js +54 -21
- package/dist/parsers/hwp-parser.d.ts +2 -1
- package/dist/parsers/hwp-parser.js +27 -11
- package/dist/parsers/pdf-parser.d.ts +4 -1
- package/dist/parsers/pdf-parser.js +5 -0
- package/dist/parsers/pptx-parser.d.ts +3 -1
- package/dist/parsers/pptx-parser.js +51 -15
- package/dist/parsers/xlsx-parser.d.ts +2 -1
- package/dist/parsers/xlsx-parser.js +62 -14
- package/dist/types/errors.d.ts +32 -0
- package/dist/types/errors.js +48 -0
- package/dist/types/interfaces.d.ts +16 -0
- package/dist/utils/chart-extractor.d.ts +6 -2
- package/dist/utils/chart-extractor.js +83 -3
- package/dist/utils/image-extractor.d.ts +2 -2
- package/dist/utils/image-extractor.js +27 -4
- package/dist/utils/layout-parser.d.ts +1 -1
- package/dist/utils/layout-parser.js +2 -2
- package/dist/utils/resource-monitor.d.ts +94 -0
- package/dist/utils/resource-monitor.js +232 -0
- package/dist/utils/secure-xml-parser.d.ts +50 -0
- package/dist/utils/secure-xml-parser.js +219 -0
- package/dist/utils/zip-security.d.ts +70 -0
- package/dist/utils/zip-security.js +217 -0
- package/package.json +15 -3
package/README.md
CHANGED
|
@@ -1,401 +1,403 @@
|
|
|
1
|
-
# file2md
|
|
2
|
-
|
|
3
|
-
[](https://badge.fury.io/js/file2md)
|
|
4
|
-
[](https://www.typescriptlang.org/)
|
|
5
|
-
[](https://opensource.org/licenses/MIT)
|
|
6
|
-
|
|
7
|
-
A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with **advanced layout preservation**, **image extraction**, **chart conversion**, and **Korean language support**.
|
|
8
|
-
|
|
9
|
-
**English** | [한국어](README.ko.md)
|
|
10
|
-
|
|
11
|
-
## ✨ Features
|
|
12
|
-
|
|
13
|
-
- 🔄 **Multiple Format Support**: PDF, DOCX, XLSX, PPTX, HWP, HWPX
|
|
14
|
-
- 🎨 **Layout Preservation**: Maintains document structure, tables, and formatting
|
|
15
|
-
- 🖼️ **Image Extraction**: Extract embedded images from DOCX, PPTX,
|
|
16
|
-
- 📊 **Chart Conversion**: Converts charts to Markdown tables
|
|
17
|
-
- 📝 **List & Table Support**: Proper nested lists and complex tables
|
|
18
|
-
- 🌏 **Korean Language Support**: Full support for HWP/HWPX Korean document formats
|
|
19
|
-
- 🔒 **Type Safety**: Full TypeScript support with comprehensive types
|
|
20
|
-
- ⚡ **Modern ESM**: ES2022 modules with CommonJS compatibility
|
|
21
|
-
- 🚀 **Zero Config**: Works out of the box
|
|
22
|
-
- 📄 **PDF Text Extraction**: Enhanced text extraction with layout detection
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
console.log(
|
|
52
|
-
console.log(
|
|
53
|
-
console.log(
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
const
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
- ✅ **
|
|
140
|
-
- ✅ **
|
|
141
|
-
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
- ✅ **
|
|
147
|
-
- ✅ **
|
|
148
|
-
- ✅ **
|
|
149
|
-
- ✅ **
|
|
150
|
-
- ✅ **
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
- ✅ **
|
|
156
|
-
- ✅ **
|
|
157
|
-
- ✅ **
|
|
158
|
-
- ✅ **
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
- ✅ **
|
|
164
|
-
- ✅ **
|
|
165
|
-
- ✅ **
|
|
166
|
-
- ✅ **
|
|
167
|
-
- ✅ **
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
- ✅ **
|
|
173
|
-
- ✅ **
|
|
174
|
-
- ✅ **
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
- ✅ **
|
|
180
|
-
- ✅ **
|
|
181
|
-
- ✅ **
|
|
182
|
-
- ✅ **
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
//
|
|
196
|
-
//
|
|
197
|
-
//
|
|
198
|
-
|
|
199
|
-
//
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
|
215
|
-
|
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
console.error('
|
|
235
|
-
} else if (error instanceof
|
|
236
|
-
console.error('
|
|
237
|
-
}
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
|
|
262
|
-
results.push({ file, success:
|
|
263
|
-
}
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
console.log(
|
|
293
|
-
console.log(`-
|
|
294
|
-
console.log(`-
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
|
|
|
302
|
-
|
|
303
|
-
|
|
|
304
|
-
|
|
|
305
|
-
|
|
|
306
|
-
|
|
|
307
|
-
|
|
308
|
-
|
|
309
|
-
|
|
310
|
-
|
|
311
|
-
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
|
|
317
|
-
- **
|
|
318
|
-
- **
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
- **
|
|
324
|
-
- **
|
|
325
|
-
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
'
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
console.log(
|
|
345
|
-
}
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
- **
|
|
355
|
-
- **
|
|
356
|
-
|
|
357
|
-
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
|
|
371
|
-
|
|
372
|
-
|
|
373
|
-
|
|
374
|
-
|
|
375
|
-
|
|
376
|
-
|
|
377
|
-
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
|
|
382
|
-
|
|
383
|
-
|
|
384
|
-
|
|
385
|
-
|
|
386
|
-
|
|
387
|
-
|
|
388
|
-
|
|
389
|
-
|
|
390
|
-
|
|
391
|
-
|
|
392
|
-
|
|
393
|
-
|
|
394
|
-
|
|
395
|
-
|
|
396
|
-
|
|
397
|
-
- [
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
|
|
1
|
+
# file2md
|
|
2
|
+
|
|
3
|
+
[](https://badge.fury.io/js/file2md)
|
|
4
|
+
[](https://www.typescriptlang.org/)
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+
|
|
7
|
+
A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with **advanced layout preservation**, **image extraction**, **chart conversion**, and **Korean language support**.
|
|
8
|
+
|
|
9
|
+
**English** | [한국어](README.ko.md)
|
|
10
|
+
|
|
11
|
+
## ✨ Features
|
|
12
|
+
|
|
13
|
+
- 🔄 **Multiple Format Support**: PDF, DOCX, XLSX, PPTX, HWP, HWPX
|
|
14
|
+
- 🎨 **Layout Preservation**: Maintains document structure, tables, and formatting
|
|
15
|
+
- 🖼️ **Image Extraction**: Extract embedded images from DOCX, PPTX, HWP documents
|
|
16
|
+
- 📊 **Chart Conversion**: Converts charts to Markdown tables
|
|
17
|
+
- 📝 **List & Table Support**: Proper nested lists and complex tables
|
|
18
|
+
- 🌏 **Korean Language Support**: Full support for HWP/HWPX Korean document formats
|
|
19
|
+
- 🔒 **Type Safety**: Full TypeScript support with comprehensive types
|
|
20
|
+
- ⚡ **Modern ESM**: ES2022 modules with CommonJS compatibility
|
|
21
|
+
- 🚀 **Zero Config**: Works out of the box
|
|
22
|
+
- 📄 **PDF Text Extraction**: Enhanced text extraction with layout detection
|
|
23
|
+
|
|
24
|
+
> **Note**: XLSX image extraction is planned but not yet supported.
|
|
25
|
+
|
|
26
|
+
## 📦 Installation
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
npm install file2md
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## 🚀 Quick Start
|
|
33
|
+
|
|
34
|
+
### TypeScript / ES Modules
|
|
35
|
+
|
|
36
|
+
```typescript
|
|
37
|
+
import { convert } from 'file2md';
|
|
38
|
+
|
|
39
|
+
// Convert from file path
|
|
40
|
+
const result = await convert('./document.pdf');
|
|
41
|
+
console.log(result.markdown);
|
|
42
|
+
|
|
43
|
+
// Convert with options
|
|
44
|
+
const result = await convert('./presentation.pptx', {
|
|
45
|
+
imageDir: 'extracted-images',
|
|
46
|
+
preserveLayout: true,
|
|
47
|
+
extractCharts: true,
|
|
48
|
+
extractImages: true
|
|
49
|
+
});
|
|
50
|
+
|
|
51
|
+
console.log(`✅ Converted successfully!`);
|
|
52
|
+
console.log(`📄 Markdown length: ${result.markdown.length}`);
|
|
53
|
+
console.log(`🖼️ Images extracted: ${result.images.length}`);
|
|
54
|
+
console.log(`📊 Charts found: ${result.charts.length}`);
|
|
55
|
+
console.log(`⏱️ Processing time: ${result.metadata.processingTime}ms`);
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Korean Document Support (HWP/HWPX)
|
|
59
|
+
|
|
60
|
+
```typescript
|
|
61
|
+
import { convert } from 'file2md';
|
|
62
|
+
|
|
63
|
+
// Convert Korean HWP document
|
|
64
|
+
const hwpResult = await convert('./document.hwp', {
|
|
65
|
+
imageDir: 'hwp-images',
|
|
66
|
+
preserveLayout: true,
|
|
67
|
+
extractImages: true
|
|
68
|
+
});
|
|
69
|
+
|
|
70
|
+
// Convert Korean HWPX document (XML-based format)
|
|
71
|
+
const hwpxResult = await convert('./document.hwpx', {
|
|
72
|
+
imageDir: 'hwpx-images',
|
|
73
|
+
preserveLayout: true,
|
|
74
|
+
extractImages: true
|
|
75
|
+
});
|
|
76
|
+
|
|
77
|
+
console.log(`🇰🇷 HWP content: ${hwpResult.markdown.substring(0, 100)}...`);
|
|
78
|
+
console.log(`📄 HWPX pages: ${hwpxResult.metadata.pageCount}`);
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### CommonJS
|
|
82
|
+
|
|
83
|
+
```javascript
|
|
84
|
+
const { convert } = require('file2md');
|
|
85
|
+
|
|
86
|
+
const result = await convert('./document.docx');
|
|
87
|
+
console.log(result.markdown);
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### From Buffer
|
|
91
|
+
|
|
92
|
+
```typescript
|
|
93
|
+
import { convert } from 'file2md';
|
|
94
|
+
import { readFile } from 'fs/promises';
|
|
95
|
+
|
|
96
|
+
const buffer = await readFile('./document.xlsx');
|
|
97
|
+
const result = await convert(buffer, {
|
|
98
|
+
imageDir: 'spreadsheet-images'
|
|
99
|
+
});
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## 📋 API Reference
|
|
103
|
+
|
|
104
|
+
### `convert(input, options?)`
|
|
105
|
+
|
|
106
|
+
**Parameters:**
|
|
107
|
+
- `input: string | Buffer` - File path or buffer containing document data
|
|
108
|
+
- `options?: ConvertOptions` - Conversion options
|
|
109
|
+
|
|
110
|
+
**Returns:** `Promise<ConversionResult>`
|
|
111
|
+
|
|
112
|
+
### Options
|
|
113
|
+
|
|
114
|
+
```typescript
|
|
115
|
+
interface ConvertOptions {
|
|
116
|
+
imageDir?: string; // Directory for extracted images (default: 'images')
|
|
117
|
+
outputDir?: string; // Output directory for slide screenshots (PPTX, falls back to imageDir)
|
|
118
|
+
preserveLayout?: boolean; // Maintain document layout (default: true)
|
|
119
|
+
extractCharts?: boolean; // Convert charts to tables (default: true)
|
|
120
|
+
extractImages?: boolean; // Extract embedded images (default: true)
|
|
121
|
+
maxPages?: number; // Max pages for PDFs (default: unlimited)
|
|
122
|
+
}
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Result
|
|
126
|
+
|
|
127
|
+
```typescript
|
|
128
|
+
interface ConversionResult {
|
|
129
|
+
markdown: string; // Generated Markdown content
|
|
130
|
+
images: ImageData[]; // Extracted image information
|
|
131
|
+
charts: ChartData[]; // Extracted chart data
|
|
132
|
+
metadata: DocumentMetadata; // Document metadata with processing info
|
|
133
|
+
}
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
## 🎯 Format-Specific Features
|
|
137
|
+
|
|
138
|
+
### 📄 PDF
|
|
139
|
+
- ✅ **Text extraction** with layout enhancement
|
|
140
|
+
- ✅ **Table detection** and formatting
|
|
141
|
+
- ✅ **List recognition** (bullets, numbers)
|
|
142
|
+
- ✅ **Heading detection** (ALL CAPS, colons)
|
|
143
|
+
- ❌ **Image extraction** (text-only processing)
|
|
144
|
+
|
|
145
|
+
### 📝 DOCX
|
|
146
|
+
- ✅ **Heading hierarchy** (H1-H6)
|
|
147
|
+
- ✅ **Text formatting** (bold, italic)
|
|
148
|
+
- ✅ **Complex tables** with merged cells
|
|
149
|
+
- ✅ **Nested lists** with proper indentation
|
|
150
|
+
- ✅ **Embedded images** and charts
|
|
151
|
+
- ✅ **Cell styling** (alignment, colors)
|
|
152
|
+
- ✅ **Font size preservation** and formatting
|
|
153
|
+
|
|
154
|
+
### 📊 XLSX
|
|
155
|
+
- ✅ **Multiple worksheets** as separate sections
|
|
156
|
+
- ✅ **Cell formatting** (bold, colors, alignment)
|
|
157
|
+
- ✅ **Data type preservation**
|
|
158
|
+
- ✅ **Chart extraction** to data tables
|
|
159
|
+
- ✅ **Conditional formatting** notes
|
|
160
|
+
- ✅ **Shared strings** handling for large files
|
|
161
|
+
|
|
162
|
+
### 🎬 PPTX
|
|
163
|
+
- ✅ **Slide-by-slide** organization
|
|
164
|
+
- ✅ **Text positioning** and layout
|
|
165
|
+
- ✅ **Image placement** per slide
|
|
166
|
+
- ✅ **Table extraction** from slides
|
|
167
|
+
- ✅ **Multi-column layouts**
|
|
168
|
+
- ✅ **Title extraction** from document properties
|
|
169
|
+
- ✅ **Chart and image** inline embedding
|
|
170
|
+
|
|
171
|
+
### 🇰🇷 HWP (Korean)
|
|
172
|
+
- ✅ **Binary format** parsing using hwp.js
|
|
173
|
+
- ✅ **Korean text extraction** with proper encoding
|
|
174
|
+
- ✅ **Image extraction** from embedded content
|
|
175
|
+
- ✅ **Layout preservation** for Korean documents
|
|
176
|
+
- ✅ **Copyright message filtering** for clean output
|
|
177
|
+
|
|
178
|
+
### 🇰🇷 HWPX (Korean XML)
|
|
179
|
+
- ✅ **XML-based format** parsing with JSZip
|
|
180
|
+
- ✅ **Multiple section support** for large documents
|
|
181
|
+
- ✅ **Relationship mapping** for image references
|
|
182
|
+
- ✅ **OWPML structure** parsing
|
|
183
|
+
- ✅ **Enhanced Korean text** processing
|
|
184
|
+
- ✅ **BinData image extraction** from ZIP archive
|
|
185
|
+
|
|
186
|
+
## 🖼️ Image Handling
|
|
187
|
+
|
|
188
|
+
Images are automatically extracted and saved to the specified directory:
|
|
189
|
+
|
|
190
|
+
```typescript
|
|
191
|
+
const result = await convert('./presentation.pptx', {
|
|
192
|
+
imageDir: 'my-images'
|
|
193
|
+
});
|
|
194
|
+
|
|
195
|
+
// Result structure:
|
|
196
|
+
// my-images/
|
|
197
|
+
// ├── image_1.png
|
|
198
|
+
// ├── image_2.jpg
|
|
199
|
+
// └── chart_1.png
|
|
200
|
+
|
|
201
|
+
// Markdown will contain:
|
|
202
|
+
// 
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**Note:** PDF files are processed as text-only. Use dedicated PDF tools for image extraction if needed.
|
|
206
|
+
|
|
207
|
+
## 📊 Chart Conversion
|
|
208
|
+
|
|
209
|
+
Charts are converted to Markdown tables:
|
|
210
|
+
|
|
211
|
+
```markdown
|
|
212
|
+
#### Chart 1: Sales Data
|
|
213
|
+
|
|
214
|
+
| Category | Q1 | Q2 | Q3 | Q4 |
|
|
215
|
+
| --- | --- | --- | --- | --- |
|
|
216
|
+
| Revenue | 100 | 150 | 200 | 250 |
|
|
217
|
+
| Profit | 20 | 30 | 45 | 60 |
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
## 🛡️ Error Handling
|
|
221
|
+
|
|
222
|
+
```typescript
|
|
223
|
+
import {
|
|
224
|
+
convert,
|
|
225
|
+
UnsupportedFormatError,
|
|
226
|
+
FileNotFoundError,
|
|
227
|
+
ParseError
|
|
228
|
+
} from 'file2md';
|
|
229
|
+
|
|
230
|
+
try {
|
|
231
|
+
const result = await convert('./document.pdf');
|
|
232
|
+
} catch (error) {
|
|
233
|
+
if (error instanceof UnsupportedFormatError) {
|
|
234
|
+
console.error('Unsupported file format');
|
|
235
|
+
} else if (error instanceof FileNotFoundError) {
|
|
236
|
+
console.error('File not found');
|
|
237
|
+
} else if (error instanceof ParseError) {
|
|
238
|
+
console.error('Failed to parse document:', error.message);
|
|
239
|
+
}
|
|
240
|
+
}
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
## 🧪 Advanced Usage
|
|
244
|
+
|
|
245
|
+
### Batch Processing
|
|
246
|
+
|
|
247
|
+
```typescript
|
|
248
|
+
import { convert } from 'file2md';
|
|
249
|
+
import { readdir } from 'fs/promises';
|
|
250
|
+
|
|
251
|
+
async function convertFolder(folderPath: string) {
|
|
252
|
+
const files = await readdir(folderPath);
|
|
253
|
+
const results = [];
|
|
254
|
+
|
|
255
|
+
for (const file of files) {
|
|
256
|
+
if (file.match(/\.(pdf|docx|xlsx|pptx|hwp|hwpx)$/i)) {
|
|
257
|
+
try {
|
|
258
|
+
const result = await convert(`${folderPath}/${file}`, {
|
|
259
|
+
imageDir: 'batch-images',
|
|
260
|
+
extractImages: true
|
|
261
|
+
});
|
|
262
|
+
results.push({ file, success: true, result });
|
|
263
|
+
} catch (error) {
|
|
264
|
+
results.push({ file, success: false, error });
|
|
265
|
+
}
|
|
266
|
+
}
|
|
267
|
+
}
|
|
268
|
+
|
|
269
|
+
return results;
|
|
270
|
+
}
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
### Large Document Processing
|
|
274
|
+
|
|
275
|
+
```typescript
|
|
276
|
+
import { convert } from 'file2md';
|
|
277
|
+
|
|
278
|
+
// Optimize for large documents
|
|
279
|
+
const result = await convert('./large-document.pdf', {
|
|
280
|
+
maxPages: 50, // Limit PDF processing
|
|
281
|
+
preserveLayout: true // Keep layout analysis
|
|
282
|
+
});
|
|
283
|
+
|
|
284
|
+
// Enhanced PPTX processing
|
|
285
|
+
const pptxResult = await convert('./presentation.pptx', {
|
|
286
|
+
outputDir: 'slides', // Separate directory for slides
|
|
287
|
+
extractCharts: true, // Extract chart data
|
|
288
|
+
extractImages: true // Extract embedded images
|
|
289
|
+
});
|
|
290
|
+
|
|
291
|
+
// Performance metrics are available in metadata
|
|
292
|
+
console.log('Performance Metrics:');
|
|
293
|
+
console.log(`- Processing time: ${result.metadata.processingTime}ms`);
|
|
294
|
+
console.log(`- Pages processed: ${result.metadata.pageCount}`);
|
|
295
|
+
console.log(`- Images extracted: ${result.metadata.imageCount}`);
|
|
296
|
+
console.log(`- File type: ${result.metadata.fileType}`);
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
## 📊 Supported Formats
|
|
300
|
+
|
|
301
|
+
| Format | Extension | Layout | Images | Charts | Tables | Lists |
|
|
302
|
+
|--------|-----------|---------|---------|---------|---------|--------|
|
|
303
|
+
| PDF | `.pdf` | ✅ | ❌ | ❌ | ✅ | ✅ |
|
|
304
|
+
| Word | `.docx` | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
305
|
+
| Excel | `.xlsx` | ✅ | ❌ | ✅ | ✅ | ❌ |
|
|
306
|
+
| PowerPoint | `.pptx` | ✅ | ✅ | ✅ | ✅ | ❌ |
|
|
307
|
+
| HWP | `.hwp` | ✅ | ✅ | ❌ | ❌ | ✅ |
|
|
308
|
+
| HWPX | `.hwpx` | ✅ | ✅ | ❌ | ❌ | ✅ |
|
|
309
|
+
|
|
310
|
+
> **Note**: PDF processing focuses on text extraction with enhanced layout detection. For PDF image extraction, consider using dedicated PDF processing tools.
|
|
311
|
+
|
|
312
|
+
## 🌏 Korean Document Support
|
|
313
|
+
|
|
314
|
+
file2md includes comprehensive support for Korean document formats:
|
|
315
|
+
|
|
316
|
+
### HWP (한글)
|
|
317
|
+
- **Binary format** used by Hangul (한글) word processor
|
|
318
|
+
- **Legacy format** still widely used in Korean organizations
|
|
319
|
+
- **Full text extraction** with Korean character encoding
|
|
320
|
+
- **Image and chart** extraction support
|
|
321
|
+
|
|
322
|
+
### HWPX (한글 XML)
|
|
323
|
+
- **Modern XML-based** format, successor to HWP
|
|
324
|
+
- **ZIP archive structure** with XML content files
|
|
325
|
+
- **Enhanced parsing** with relationship mapping
|
|
326
|
+
- **Multiple sections** and complex document support
|
|
327
|
+
|
|
328
|
+
### Usage Examples
|
|
329
|
+
|
|
330
|
+
```typescript
|
|
331
|
+
// Convert Korean documents
|
|
332
|
+
const koreanDocs = [
|
|
333
|
+
'report.hwp', // Legacy binary format
|
|
334
|
+
'document.hwpx', // Modern XML format
|
|
335
|
+
'presentation.pptx'
|
|
336
|
+
];
|
|
337
|
+
|
|
338
|
+
for (const doc of koreanDocs) {
|
|
339
|
+
const result = await convert(doc, {
|
|
340
|
+
imageDir: 'korean-docs-images',
|
|
341
|
+
preserveLayout: true
|
|
342
|
+
});
|
|
343
|
+
|
|
344
|
+
console.log(`📄 ${doc}: ${result.markdown.length} characters`);
|
|
345
|
+
console.log(`🖼️ Images: ${result.images.length}`);
|
|
346
|
+
console.log(`⏱️ Processed in ${result.metadata.processingTime}ms`);
|
|
347
|
+
}
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
## 🔧 Performance & Configuration
|
|
351
|
+
|
|
352
|
+
The library is optimized for performance with sensible defaults:
|
|
353
|
+
|
|
354
|
+
- **Zero configuration** - Works out of the box
|
|
355
|
+
- **Efficient processing** - Optimized for various document sizes
|
|
356
|
+
- **Memory management** - Proper cleanup of temporary resources
|
|
357
|
+
- **Type safety** - Full TypeScript support
|
|
358
|
+
|
|
359
|
+
Performance metrics are included in the conversion result for monitoring and optimization.
|
|
360
|
+
|
|
361
|
+
## 🤝 Contributing
|
|
362
|
+
|
|
363
|
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
364
|
+
|
|
365
|
+
1. Fork the repository
|
|
366
|
+
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
|
367
|
+
3. Commit your changes (`git commit -m 'Add amazing feature'`)
|
|
368
|
+
4. Push to the branch (`git push origin feature/amazing-feature`)
|
|
369
|
+
5. Open a Pull Request
|
|
370
|
+
|
|
371
|
+
### Development Setup
|
|
372
|
+
|
|
373
|
+
```bash
|
|
374
|
+
# Clone the repository
|
|
375
|
+
git clone https://github.com/ricky-clevi/file2md.git
|
|
376
|
+
cd file2md
|
|
377
|
+
|
|
378
|
+
# Install dependencies
|
|
379
|
+
npm install
|
|
380
|
+
|
|
381
|
+
# Run tests
|
|
382
|
+
npm test
|
|
383
|
+
|
|
384
|
+
# Build the project
|
|
385
|
+
npm run build
|
|
386
|
+
|
|
387
|
+
# Run linting
|
|
388
|
+
npm run lint
|
|
389
|
+
```
|
|
390
|
+
|
|
391
|
+
## 📄 License
|
|
392
|
+
|
|
393
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
394
|
+
|
|
395
|
+
## 🔗 Links
|
|
396
|
+
|
|
397
|
+
- [npm package](https://www.npmjs.com/package/file2md)
|
|
398
|
+
- [GitHub repository](https://github.com/ricky-clevi/file2md)
|
|
399
|
+
- [Issues & Bug Reports](https://github.com/ricky-clevi/file2md/issues)
|
|
400
|
+
|
|
401
|
+
---
|
|
402
|
+
|
|
401
403
|
**Made with ❤️ and TypeScript** • **🖼️ Enhanced with intelligent document parsing** • **🇰🇷 Korean document support**
|