file2md 1.4.53 → 1.4.55

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,401 +1,403 @@
1
- # file2md
2
-
3
- [![npm version](https://badge.fury.io/js/file2md.svg)](https://badge.fury.io/js/file2md)
4
- [![TypeScript](https://img.shields.io/badge/TypeScript-Ready-blue.svg)](https://www.typescriptlang.org/)
5
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
-
7
- A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with **advanced layout preservation**, **image extraction**, **chart conversion**, and **Korean language support**.
8
-
9
- **English** | [한국어](README.ko.md)
10
-
11
- ## ✨ Features
12
-
13
- - 🔄 **Multiple Format Support**: PDF, DOCX, XLSX, PPTX, HWP, HWPX
14
- - 🎨 **Layout Preservation**: Maintains document structure, tables, and formatting
15
- - 🖼️ **Image Extraction**: Extract embedded images from DOCX, PPTX, XLSX, HWP documents
16
- - 📊 **Chart Conversion**: Converts charts to Markdown tables
17
- - 📝 **List & Table Support**: Proper nested lists and complex tables
18
- - 🌏 **Korean Language Support**: Full support for HWP/HWPX Korean document formats
19
- - 🔒 **Type Safety**: Full TypeScript support with comprehensive types
20
- - ⚡ **Modern ESM**: ES2022 modules with CommonJS compatibility
21
- - 🚀 **Zero Config**: Works out of the box
22
- - 📄 **PDF Text Extraction**: Enhanced text extraction with layout detection
23
-
24
- ## 📦 Installation
25
-
26
- ```bash
27
- npm install file2md
28
- ```
29
-
30
- ## 🚀 Quick Start
31
-
32
- ### TypeScript / ES Modules
33
-
34
- ```typescript
35
- import { convert } from 'file2md';
36
-
37
- // Convert from file path
38
- const result = await convert('./document.pdf');
39
- console.log(result.markdown);
40
-
41
- // Convert with options
42
- const result = await convert('./presentation.pptx', {
43
- imageDir: 'extracted-images',
44
- preserveLayout: true,
45
- extractCharts: true,
46
- extractImages: true
47
- });
48
-
49
- console.log(`✅ Converted successfully!`);
50
- console.log(`📄 Markdown length: ${result.markdown.length}`);
51
- console.log(`🖼️ Images extracted: ${result.images.length}`);
52
- console.log(`📊 Charts found: ${result.charts.length}`);
53
- console.log(`⏱️ Processing time: ${result.metadata.processingTime}ms`);
54
- ```
55
-
56
- ### Korean Document Support (HWP/HWPX)
57
-
58
- ```typescript
59
- import { convert } from 'file2md';
60
-
61
- // Convert Korean HWP document
62
- const hwpResult = await convert('./document.hwp', {
63
- imageDir: 'hwp-images',
64
- preserveLayout: true,
65
- extractImages: true
66
- });
67
-
68
- // Convert Korean HWPX document (XML-based format)
69
- const hwpxResult = await convert('./document.hwpx', {
70
- imageDir: 'hwpx-images',
71
- preserveLayout: true,
72
- extractImages: true
73
- });
74
-
75
- console.log(`🇰🇷 HWP content: ${hwpResult.markdown.substring(0, 100)}...`);
76
- console.log(`📄 HWPX pages: ${hwpResult.metadata.pageCount}`);
77
- ```
78
-
79
- ### CommonJS
80
-
81
- ```javascript
82
- const { convert } = require('file2md');
83
-
84
- const result = await convert('./document.docx');
85
- console.log(result.markdown);
86
- ```
87
-
88
- ### From Buffer
89
-
90
- ```typescript
91
- import { convert } from 'file2md';
92
- import { readFile } from 'fs/promises';
93
-
94
- const buffer = await readFile('./document.xlsx');
95
- const result = await convert(buffer, {
96
- imageDir: 'spreadsheet-images'
97
- });
98
- ```
99
-
100
- ## 📋 API Reference
101
-
102
- ### `convert(input, options?)`
103
-
104
- **Parameters:**
105
- - `input: string | Buffer` - File path or buffer containing document data
106
- - `options?: ConvertOptions` - Conversion options
107
-
108
- **Returns:** `Promise<ConversionResult>`
109
-
110
- ### Options
111
-
112
- ```typescript
113
- interface ConvertOptions {
114
- imageDir?: string; // Directory for extracted images (default: 'images')
115
- outputDir?: string; // Output directory for slide screenshots (PPTX, falls back to imageDir)
116
- preserveLayout?: boolean; // Maintain document layout (default: true)
117
- extractCharts?: boolean; // Convert charts to tables (default: true)
118
- extractImages?: boolean; // Extract embedded images (default: true)
119
- maxPages?: number; // Max pages for PDFs (default: unlimited)
120
- }
121
- ```
122
-
123
- ### Result
124
-
125
- ```typescript
126
- interface ConversionResult {
127
- markdown: string; // Generated Markdown content
128
- images: ImageData[]; // Extracted image information
129
- charts: ChartData[]; // Extracted chart data
130
- metadata: DocumentMetadata; // Document metadata with processing info
131
- }
132
- ```
133
-
134
- ## 🎯 Format-Specific Features
135
-
136
- ### 📄 PDF
137
- - ✅ **Text extraction** with layout enhancement
138
- - **Table detection** and formatting
139
- - ✅ **List recognition** (bullets, numbers)
140
- - ✅ **Heading detection** (ALL CAPS, colons)
141
- - **Image extraction** (text-only processing)
142
-
143
- ### 📝 DOCX
144
- - ✅ **Heading hierarchy** (H1-H6)
145
- - **Text formatting** (bold, italic)
146
- - ✅ **Complex tables** with merged cells
147
- - ✅ **Nested lists** with proper indentation
148
- - ✅ **Embedded images** and charts
149
- - ✅ **Cell styling** (alignment, colors)
150
- - ✅ **Font size preservation** and formatting
151
-
152
- ### 📊 XLSX
153
- - ✅ **Multiple worksheets** as separate sections
154
- - **Cell formatting** (bold, colors, alignment)
155
- - ✅ **Data type preservation**
156
- - ✅ **Chart extraction** to data tables
157
- - ✅ **Conditional formatting** notes
158
- - ✅ **Shared strings** handling for large files
159
-
160
- ### 🎬 PPTX
161
- - ✅ **Slide-by-slide** organization
162
- - **Text positioning** and layout
163
- - ✅ **Image placement** per slide
164
- - ✅ **Table extraction** from slides
165
- - ✅ **Multi-column layouts**
166
- - ✅ **Title extraction** from document properties
167
- - ✅ **Chart and image** inline embedding
168
-
169
- ### 🇰🇷 HWP (Korean)
170
- - ✅ **Binary format** parsing using hwp.js
171
- - **Korean text extraction** with proper encoding
172
- - ✅ **Image extraction** from embedded content
173
- - ✅ **Layout preservation** for Korean documents
174
- - ✅ **Copyright message filtering** for clean output
175
-
176
- ### 🇰🇷 HWPX (Korean XML)
177
- - ✅ **XML-based format** parsing with JSZip
178
- - **Multiple section support** for large documents
179
- - ✅ **Relationship mapping** for image references
180
- - ✅ **OWPML structure** parsing
181
- - ✅ **Enhanced Korean text** processing
182
- - ✅ **BinData image extraction** from ZIP archive
183
-
184
- ## 🖼️ Image Handling
185
-
186
- Images are automatically extracted and saved to the specified directory:
187
-
188
- ```typescript
189
- const result = await convert('./presentation.pptx', {
190
- imageDir: 'my-images'
191
- });
192
-
193
- // Result structure:
194
- // my-images/
195
- // ├── image_1.png
196
- // ├── image_2.jpg
197
- // └── chart_1.png
198
-
199
- // Markdown will contain:
200
- // ![Slide 1 Image](my-images/image_1.png)
201
- ```
202
-
203
- **Note:** PDF files are processed as text-only. Use dedicated PDF tools for image extraction if needed.
204
-
205
- ## 📊 Chart Conversion
206
-
207
- Charts are converted to Markdown tables:
208
-
209
- ```markdown
210
- #### Chart 1: Sales Data
211
-
212
- | Category | Q1 | Q2 | Q3 | Q4 |
213
- | --- | --- | --- | --- | --- |
214
- | Revenue | 100 | 150 | 200 | 250 |
215
- | Profit | 20 | 30 | 45 | 60 |
216
- ```
217
-
218
- ## 🛡️ Error Handling
219
-
220
- ```typescript
221
- import {
222
- convert,
223
- UnsupportedFormatError,
224
- FileNotFoundError,
225
- ParseError
226
- } from 'file2md';
227
-
228
- try {
229
- const result = await convert('./document.pdf');
230
- } catch (error) {
231
- if (error instanceof UnsupportedFormatError) {
232
- console.error('Unsupported file format');
233
- } else if (error instanceof FileNotFoundError) {
234
- console.error('File not found');
235
- } else if (error instanceof ParseError) {
236
- console.error('Failed to parse document:', error.message);
237
- }
238
- }
239
- ```
240
-
241
- ## 🧪 Advanced Usage
242
-
243
- ### Batch Processing
244
-
245
- ```typescript
246
- import { convert } from 'file2md';
247
- import { readdir } from 'fs/promises';
248
-
249
- async function convertFolder(folderPath: string) {
250
- const files = await readdir(folderPath);
251
- const results = [];
252
-
253
- for (const file of files) {
254
- if (file.match(/\.(pdf|docx|xlsx|pptx|hwp|hwpx)$/i)) {
255
- try {
256
- const result = await convert(`${folderPath}/${file}`, {
257
- imageDir: 'batch-images',
258
- extractImages: true
259
- });
260
- results.push({ file, success: true, result });
261
- } catch (error) {
262
- results.push({ file, success: false, error });
263
- }
264
- }
265
- }
266
-
267
- return results;
268
- }
269
- ```
270
-
271
- ### Large Document Processing
272
-
273
- ```typescript
274
- import { convert } from 'file2md';
275
-
276
- // Optimize for large documents
277
- const result = await convert('./large-document.pdf', {
278
- maxPages: 50, // Limit PDF processing
279
- preserveLayout: true // Keep layout analysis
280
- });
281
-
282
- // Enhanced PPTX processing
283
- const pptxResult = await convert('./presentation.pptx', {
284
- outputDir: 'slides', // Separate directory for slides
285
- extractCharts: true, // Extract chart data
286
- extractImages: true // Extract embedded images
287
- });
288
-
289
- // Performance metrics are available in metadata
290
- console.log('Performance Metrics:');
291
- console.log(`- Processing time: ${result.metadata.processingTime}ms`);
292
- console.log(`- Pages processed: ${result.metadata.pageCount}`);
293
- console.log(`- Images extracted: ${result.metadata.imageCount}`);
294
- console.log(`- File type: ${result.metadata.fileType}`);
295
- ```
296
-
297
- ## 📊 Supported Formats
298
-
299
- | Format | Extension | Layout | Images | Charts | Tables | Lists |
300
- |--------|-----------|---------|---------|---------|---------|--------|
301
- | PDF | `.pdf` | | | | | |
302
- | Word | `.docx` | ✅ | ✅ | ✅ | ✅ | ✅ |
303
- | Excel | `.xlsx` | ✅ | ❌ | | ✅ | |
304
- | PowerPoint | `.pptx` | ✅ | ✅ | ✅ | ✅ | |
305
- | HWP | `.hwp` | ✅ | | | | |
306
- | HWPX | `.hwpx` | ✅ | ✅ | | ❌ | ✅ |
307
-
308
- > **Note**: PDF processing focuses on text extraction with enhanced layout detection. For PDF image extraction, consider using dedicated PDF processing tools.
309
-
310
- ## 🌏 Korean Document Support
311
-
312
- file2md includes comprehensive support for Korean document formats:
313
-
314
- ### HWP (한글)
315
- - **Binary format** used by Hangul (한글) word processor
316
- - **Legacy format** still widely used in Korean organizations
317
- - **Full text extraction** with Korean character encoding
318
- - **Image and chart** extraction support
319
-
320
- ### HWPX (한글 XML)
321
- - **Modern XML-based** format, successor to HWP
322
- - **ZIP archive structure** with XML content files
323
- - **Enhanced parsing** with relationship mapping
324
- - **Multiple sections** and complex document support
325
-
326
- ### Usage Examples
327
-
328
- ```typescript
329
- // Convert Korean documents
330
- const koreanDocs = [
331
- 'report.hwp', // Legacy binary format
332
- 'document.hwpx', // Modern XML format
333
- 'presentation.pptx'
334
- ];
335
-
336
- for (const doc of koreanDocs) {
337
- const result = await convert(doc, {
338
- imageDir: 'korean-docs-images',
339
- preserveLayout: true
340
- });
341
-
342
- console.log(`📄 ${doc}: ${result.markdown.length} characters`);
343
- console.log(`🖼️ Images: ${result.images.length}`);
344
- console.log(`⏱️ Processed in ${result.metadata.processingTime}ms`);
345
- }
346
- ```
347
-
348
- ## 🔧 Performance & Configuration
349
-
350
- The library is optimized for performance with sensible defaults:
351
-
352
- - **Zero configuration** - Works out of the box
353
- - **Efficient processing** - Optimized for various document sizes
354
- - **Memory management** - Proper cleanup of temporary resources
355
- - **Type safety** - Full TypeScript support
356
-
357
- Performance metrics are included in the conversion result for monitoring and optimization.
358
-
359
- ## 🤝 Contributing
360
-
361
- Contributions are welcome! Please feel free to submit a Pull Request.
362
-
363
- 1. Fork the repository
364
- 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
365
- 3. Commit your changes (`git commit -m 'Add amazing feature'`)
366
- 4. Push to the branch (`git push origin feature/amazing-feature`)
367
- 5. Open a Pull Request
368
-
369
- ### Development Setup
370
-
371
- ```bash
372
- # Clone the repository
373
- git clone https://github.com/ricky-clevi/file2md.git
374
- cd file2md
375
-
376
- # Install dependencies
377
- npm install
378
-
379
- # Run tests
380
- npm test
381
-
382
- # Build the project
383
- npm run build
384
-
385
- # Run linting
386
- npm run lint
387
- ```
388
-
389
- ## 📄 License
390
-
391
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
392
-
393
- ## 🔗 Links
394
-
395
- - [npm package](https://www.npmjs.com/package/file2md)
396
- - [GitHub repository](https://github.com/ricky-clevi/file2md)
397
- - [Issues & Bug Reports](https://github.com/ricky-clevi/file2md/issues)
398
-
399
- ---
400
-
1
+ # file2md
2
+
3
+ [![npm version](https://badge.fury.io/js/file2md.svg)](https://badge.fury.io/js/file2md)
4
+ [![TypeScript](https://img.shields.io/badge/TypeScript-Ready-blue.svg)](https://www.typescriptlang.org/)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
+
7
+ A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with **advanced layout preservation**, **image extraction**, **chart conversion**, and **Korean language support**.
8
+
9
+ **English** | [한국어](README.ko.md)
10
+
11
+ ## ✨ Features
12
+
13
+ - 🔄 **Multiple Format Support**: PDF, DOCX, XLSX, PPTX, HWP, HWPX
14
+ - 🎨 **Layout Preservation**: Maintains document structure, tables, and formatting
15
+ - 🖼️ **Image Extraction**: Extract embedded images from DOCX, PPTX, HWP documents
16
+ - 📊 **Chart Conversion**: Converts charts to Markdown tables
17
+ - 📝 **List & Table Support**: Proper nested lists and complex tables
18
+ - 🌏 **Korean Language Support**: Full support for HWP/HWPX Korean document formats
19
+ - 🔒 **Type Safety**: Full TypeScript support with comprehensive types
20
+ - ⚡ **Modern ESM**: ES2022 modules with CommonJS compatibility
21
+ - 🚀 **Zero Config**: Works out of the box
22
+ - 📄 **PDF Text Extraction**: Enhanced text extraction with layout detection
23
+
24
+ > **Note**: XLSX image extraction is planned but not yet supported.
25
+
26
+ ## 📦 Installation
27
+
28
+ ```bash
29
+ npm install file2md
30
+ ```
31
+
32
+ ## 🚀 Quick Start
33
+
34
+ ### TypeScript / ES Modules
35
+
36
+ ```typescript
37
+ import { convert } from 'file2md';
38
+
39
+ // Convert from file path
40
+ const result = await convert('./document.pdf');
41
+ console.log(result.markdown);
42
+
43
+ // Convert with options
44
+ const result = await convert('./presentation.pptx', {
45
+ imageDir: 'extracted-images',
46
+ preserveLayout: true,
47
+ extractCharts: true,
48
+ extractImages: true
49
+ });
50
+
51
+ console.log(`✅ Converted successfully!`);
52
+ console.log(`📄 Markdown length: ${result.markdown.length}`);
53
+ console.log(`🖼️ Images extracted: ${result.images.length}`);
54
+ console.log(`📊 Charts found: ${result.charts.length}`);
55
+ console.log(`⏱️ Processing time: ${result.metadata.processingTime}ms`);
56
+ ```
57
+
58
+ ### Korean Document Support (HWP/HWPX)
59
+
60
+ ```typescript
61
+ import { convert } from 'file2md';
62
+
63
+ // Convert Korean HWP document
64
+ const hwpResult = await convert('./document.hwp', {
65
+ imageDir: 'hwp-images',
66
+ preserveLayout: true,
67
+ extractImages: true
68
+ });
69
+
70
+ // Convert Korean HWPX document (XML-based format)
71
+ const hwpxResult = await convert('./document.hwpx', {
72
+ imageDir: 'hwpx-images',
73
+ preserveLayout: true,
74
+ extractImages: true
75
+ });
76
+
77
+ console.log(`🇰🇷 HWP content: ${hwpResult.markdown.substring(0, 100)}...`);
78
+ console.log(`📄 HWPX pages: ${hwpxResult.metadata.pageCount}`);
79
+ ```
80
+
81
+ ### CommonJS
82
+
83
+ ```javascript
84
+ const { convert } = require('file2md');
85
+
86
+ const result = await convert('./document.docx');
87
+ console.log(result.markdown);
88
+ ```
89
+
90
+ ### From Buffer
91
+
92
+ ```typescript
93
+ import { convert } from 'file2md';
94
+ import { readFile } from 'fs/promises';
95
+
96
+ const buffer = await readFile('./document.xlsx');
97
+ const result = await convert(buffer, {
98
+ imageDir: 'spreadsheet-images'
99
+ });
100
+ ```
101
+
102
+ ## 📋 API Reference
103
+
104
+ ### `convert(input, options?)`
105
+
106
+ **Parameters:**
107
+ - `input: string | Buffer` - File path or buffer containing document data
108
+ - `options?: ConvertOptions` - Conversion options
109
+
110
+ **Returns:** `Promise<ConversionResult>`
111
+
112
+ ### Options
113
+
114
+ ```typescript
115
+ interface ConvertOptions {
116
+ imageDir?: string; // Directory for extracted images (default: 'images')
117
+ outputDir?: string; // Output directory for slide screenshots (PPTX, falls back to imageDir)
118
+ preserveLayout?: boolean; // Maintain document layout (default: true)
119
+ extractCharts?: boolean; // Convert charts to tables (default: true)
120
+ extractImages?: boolean; // Extract embedded images (default: true)
121
+ maxPages?: number; // Max pages for PDFs (default: unlimited)
122
+ }
123
+ ```
124
+
125
+ ### Result
126
+
127
+ ```typescript
128
+ interface ConversionResult {
129
+ markdown: string; // Generated Markdown content
130
+ images: ImageData[]; // Extracted image information
131
+ charts: ChartData[]; // Extracted chart data
132
+ metadata: DocumentMetadata; // Document metadata with processing info
133
+ }
134
+ ```
135
+
136
+ ## 🎯 Format-Specific Features
137
+
138
+ ### 📄 PDF
139
+ - ✅ **Text extraction** with layout enhancement
140
+ - ✅ **Table detection** and formatting
141
+ - **List recognition** (bullets, numbers)
142
+ - ✅ **Heading detection** (ALL CAPS, colons)
143
+ - **Image extraction** (text-only processing)
144
+
145
+ ### 📝 DOCX
146
+ - ✅ **Heading hierarchy** (H1-H6)
147
+ - ✅ **Text formatting** (bold, italic)
148
+ - ✅ **Complex tables** with merged cells
149
+ - ✅ **Nested lists** with proper indentation
150
+ - ✅ **Embedded images** and charts
151
+ - ✅ **Cell styling** (alignment, colors)
152
+ - **Font size preservation** and formatting
153
+
154
+ ### 📊 XLSX
155
+ - ✅ **Multiple worksheets** as separate sections
156
+ - ✅ **Cell formatting** (bold, colors, alignment)
157
+ - ✅ **Data type preservation**
158
+ - ✅ **Chart extraction** to data tables
159
+ - ✅ **Conditional formatting** notes
160
+ - **Shared strings** handling for large files
161
+
162
+ ### 🎬 PPTX
163
+ - ✅ **Slide-by-slide** organization
164
+ - ✅ **Text positioning** and layout
165
+ - ✅ **Image placement** per slide
166
+ - ✅ **Table extraction** from slides
167
+ - ✅ **Multi-column layouts**
168
+ - ✅ **Title extraction** from document properties
169
+ - **Chart and image** inline embedding
170
+
171
+ ### 🇰🇷 HWP (Korean)
172
+ - ✅ **Binary format** parsing using hwp.js
173
+ - ✅ **Korean text extraction** with proper encoding
174
+ - ✅ **Image extraction** from embedded content
175
+ - ✅ **Layout preservation** for Korean documents
176
+ - **Copyright message filtering** for clean output
177
+
178
+ ### 🇰🇷 HWPX (Korean XML)
179
+ - ✅ **XML-based format** parsing with JSZip
180
+ - ✅ **Multiple section support** for large documents
181
+ - ✅ **Relationship mapping** for image references
182
+ - ✅ **OWPML structure** parsing
183
+ - ✅ **Enhanced Korean text** processing
184
+ - **BinData image extraction** from ZIP archive
185
+
186
+ ## 🖼️ Image Handling
187
+
188
+ Images are automatically extracted and saved to the specified directory:
189
+
190
+ ```typescript
191
+ const result = await convert('./presentation.pptx', {
192
+ imageDir: 'my-images'
193
+ });
194
+
195
+ // Result structure:
196
+ // my-images/
197
+ // ├── image_1.png
198
+ // ├── image_2.jpg
199
+ // └── chart_1.png
200
+
201
+ // Markdown will contain:
202
+ // ![Slide 1 Image](my-images/image_1.png)
203
+ ```
204
+
205
+ **Note:** PDF files are processed as text-only. Use dedicated PDF tools for image extraction if needed.
206
+
207
+ ## 📊 Chart Conversion
208
+
209
+ Charts are converted to Markdown tables:
210
+
211
+ ```markdown
212
+ #### Chart 1: Sales Data
213
+
214
+ | Category | Q1 | Q2 | Q3 | Q4 |
215
+ | --- | --- | --- | --- | --- |
216
+ | Revenue | 100 | 150 | 200 | 250 |
217
+ | Profit | 20 | 30 | 45 | 60 |
218
+ ```
219
+
220
+ ## 🛡️ Error Handling
221
+
222
+ ```typescript
223
+ import {
224
+ convert,
225
+ UnsupportedFormatError,
226
+ FileNotFoundError,
227
+ ParseError
228
+ } from 'file2md';
229
+
230
+ try {
231
+ const result = await convert('./document.pdf');
232
+ } catch (error) {
233
+ if (error instanceof UnsupportedFormatError) {
234
+ console.error('Unsupported file format');
235
+ } else if (error instanceof FileNotFoundError) {
236
+ console.error('File not found');
237
+ } else if (error instanceof ParseError) {
238
+ console.error('Failed to parse document:', error.message);
239
+ }
240
+ }
241
+ ```
242
+
243
+ ## 🧪 Advanced Usage
244
+
245
+ ### Batch Processing
246
+
247
+ ```typescript
248
+ import { convert } from 'file2md';
249
+ import { readdir } from 'fs/promises';
250
+
251
+ async function convertFolder(folderPath: string) {
252
+ const files = await readdir(folderPath);
253
+ const results = [];
254
+
255
+ for (const file of files) {
256
+ if (file.match(/\.(pdf|docx|xlsx|pptx|hwp|hwpx)$/i)) {
257
+ try {
258
+ const result = await convert(`${folderPath}/${file}`, {
259
+ imageDir: 'batch-images',
260
+ extractImages: true
261
+ });
262
+ results.push({ file, success: true, result });
263
+ } catch (error) {
264
+ results.push({ file, success: false, error });
265
+ }
266
+ }
267
+ }
268
+
269
+ return results;
270
+ }
271
+ ```
272
+
273
+ ### Large Document Processing
274
+
275
+ ```typescript
276
+ import { convert } from 'file2md';
277
+
278
+ // Optimize for large documents
279
+ const result = await convert('./large-document.pdf', {
280
+ maxPages: 50, // Limit PDF processing
281
+ preserveLayout: true // Keep layout analysis
282
+ });
283
+
284
+ // Enhanced PPTX processing
285
+ const pptxResult = await convert('./presentation.pptx', {
286
+ outputDir: 'slides', // Separate directory for slides
287
+ extractCharts: true, // Extract chart data
288
+ extractImages: true // Extract embedded images
289
+ });
290
+
291
+ // Performance metrics are available in metadata
292
+ console.log('Performance Metrics:');
293
+ console.log(`- Processing time: ${result.metadata.processingTime}ms`);
294
+ console.log(`- Pages processed: ${result.metadata.pageCount}`);
295
+ console.log(`- Images extracted: ${result.metadata.imageCount}`);
296
+ console.log(`- File type: ${result.metadata.fileType}`);
297
+ ```
298
+
299
+ ## 📊 Supported Formats
300
+
301
+ | Format | Extension | Layout | Images | Charts | Tables | Lists |
302
+ |--------|-----------|---------|---------|---------|---------|--------|
303
+ | PDF | `.pdf` | ✅ | ❌ | | ✅ | |
304
+ | Word | `.docx` | ✅ | ✅ | ✅ | ✅ | |
305
+ | Excel | `.xlsx` | ✅ | | | | |
306
+ | PowerPoint | `.pptx` | ✅ | ✅ | ✅ | | ❌ |
307
+ | HWP | `.hwp` | ✅ | ✅ | ❌ | ❌ | ✅ |
308
+ | HWPX | `.hwpx` | ✅ | ✅ | ❌ | ❌ | ✅ |
309
+
310
+ > **Note**: PDF processing focuses on text extraction with enhanced layout detection. For PDF image extraction, consider using dedicated PDF processing tools.
311
+
312
+ ## 🌏 Korean Document Support
313
+
314
+ file2md includes comprehensive support for Korean document formats:
315
+
316
+ ### HWP (한글)
317
+ - **Binary format** used by Hangul (한글) word processor
318
+ - **Legacy format** still widely used in Korean organizations
319
+ - **Full text extraction** with Korean character encoding
320
+ - **Image and chart** extraction support
321
+
322
+ ### HWPX (한글 XML)
323
+ - **Modern XML-based** format, successor to HWP
324
+ - **ZIP archive structure** with XML content files
325
+ - **Enhanced parsing** with relationship mapping
326
+ - **Multiple sections** and complex document support
327
+
328
+ ### Usage Examples
329
+
330
+ ```typescript
331
+ // Convert Korean documents
332
+ const koreanDocs = [
333
+ 'report.hwp', // Legacy binary format
334
+ 'document.hwpx', // Modern XML format
335
+ 'presentation.pptx'
336
+ ];
337
+
338
+ for (const doc of koreanDocs) {
339
+ const result = await convert(doc, {
340
+ imageDir: 'korean-docs-images',
341
+ preserveLayout: true
342
+ });
343
+
344
+ console.log(`📄 ${doc}: ${result.markdown.length} characters`);
345
+ console.log(`🖼️ Images: ${result.images.length}`);
346
+ console.log(`⏱️ Processed in ${result.metadata.processingTime}ms`);
347
+ }
348
+ ```
349
+
350
+ ## 🔧 Performance & Configuration
351
+
352
+ The library is optimized for performance with sensible defaults:
353
+
354
+ - **Zero configuration** - Works out of the box
355
+ - **Efficient processing** - Optimized for various document sizes
356
+ - **Memory management** - Proper cleanup of temporary resources
357
+ - **Type safety** - Full TypeScript support
358
+
359
+ Performance metrics are included in the conversion result for monitoring and optimization.
360
+
361
+ ## 🤝 Contributing
362
+
363
+ Contributions are welcome! Please feel free to submit a Pull Request.
364
+
365
+ 1. Fork the repository
366
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
367
+ 3. Commit your changes (`git commit -m 'Add amazing feature'`)
368
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
369
+ 5. Open a Pull Request
370
+
371
+ ### Development Setup
372
+
373
+ ```bash
374
+ # Clone the repository
375
+ git clone https://github.com/ricky-clevi/file2md.git
376
+ cd file2md
377
+
378
+ # Install dependencies
379
+ npm install
380
+
381
+ # Run tests
382
+ npm test
383
+
384
+ # Build the project
385
+ npm run build
386
+
387
+ # Run linting
388
+ npm run lint
389
+ ```
390
+
391
+ ## 📄 License
392
+
393
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
394
+
395
+ ## 🔗 Links
396
+
397
+ - [npm package](https://www.npmjs.com/package/file2md)
398
+ - [GitHub repository](https://github.com/ricky-clevi/file2md)
399
+ - [Issues & Bug Reports](https://github.com/ricky-clevi/file2md/issues)
400
+
401
+ ---
402
+
401
403
  **Made with ❤️ and TypeScript** • **🖼️ Enhanced with intelligent document parsing** • **🇰🇷 Korean document support**