@heripo/pdf-parser 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +201 -0
- package/NOTICE +37 -0
- package/README.ko.md +426 -0
- package/README.md +426 -0
- package/dist/chunk-JVYF2SQS.js +495 -0
- package/dist/chunk-JVYF2SQS.js.map +1 -0
- package/dist/chunk-VUNV25KB.js +16 -0
- package/dist/chunk-VUNV25KB.js.map +1 -0
- package/dist/index.cjs +1323 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +129 -0
- package/dist/index.d.ts +129 -0
- package/dist/index.js +1295 -0
- package/dist/index.js.map +1 -0
- package/dist/token-HEEJ7XHP.js +63 -0
- package/dist/token-HEEJ7XHP.js.map +1 -0
- package/dist/token-util-MKHXD2JQ.js +6 -0
- package/dist/token-util-MKHXD2JQ.js.map +1 -0
- package/package.json +96 -0
package/README.md
ADDED
|
@@ -0,0 +1,426 @@
|
|
|
1
|
+
# @heripo/pdf-parser
|
|
2
|
+
|
|
3
|
+
> PDF parsing library - OCR support with Docling SDK
|
|
4
|
+
|
|
5
|
+
[](https://www.npmjs.com/package/@heripo/pdf-parser)
|
|
6
|
+
[](https://nodejs.org/)
|
|
7
|
+
[](https://www.python.org/)
|
|
8
|
+

|
|
9
|
+
[](../../LICENSE)
|
|
10
|
+
|
|
11
|
+
**English** | [한국어](./README.ko.md)
|
|
12
|
+
|
|
13
|
+
> **Note**: Please check the [root README](../../README.md) first for project overview, installation instructions, and roadmap.
|
|
14
|
+
|
|
15
|
+
`@heripo/pdf-parser` is a library for parsing PDF documents and OCR processing based on Docling SDK. It is designed to effectively process documents with complex layouts such as archaeological excavation reports.
|
|
16
|
+
|
|
17
|
+
## Table of Contents
|
|
18
|
+
|
|
19
|
+
- [Key Features](#key-features)
|
|
20
|
+
- [Prerequisites](#prerequisites)
|
|
21
|
+
- [Installation](#installation)
|
|
22
|
+
- [Usage](#usage)
|
|
23
|
+
- [Why macOS Only?](#why-macos-only)
|
|
24
|
+
- [System Dependencies Details](#system-dependencies-details)
|
|
25
|
+
- [API Documentation](#api-documentation)
|
|
26
|
+
- [Troubleshooting](#troubleshooting)
|
|
27
|
+
- [License](#license)
|
|
28
|
+
|
|
29
|
+
## Key Features
|
|
30
|
+
|
|
31
|
+
- **High-Quality OCR**: document recognition using Docling SDK
|
|
32
|
+
- **Apple Silicon Optimized**: GPU acceleration on M1/M2/M3/M4/M5 chips
|
|
33
|
+
- **Automatic Environment Setup**: Automatic Python virtual environment and docling-serve installation
|
|
34
|
+
- **Image Extraction**: Automatic extraction and saving of images from PDFs
|
|
35
|
+
- **Flexible Configuration**: OCR, format, threading options, and other detailed settings
|
|
36
|
+
|
|
37
|
+
## Prerequisites
|
|
38
|
+
|
|
39
|
+
### System Requirements
|
|
40
|
+
|
|
41
|
+
- **macOS** with Apple Silicon (M1/M2/M3) - Recommended for optimal performance
|
|
42
|
+
- **macOS** with Intel - Supported but slower
|
|
43
|
+
- **Linux/Windows** - Currently not supported
|
|
44
|
+
|
|
45
|
+
### Required Dependencies
|
|
46
|
+
|
|
47
|
+
#### 1. Node.js >= 22.0.0
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
brew install node
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
#### 2. pnpm >= 9.0.0
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
npm install -g pnpm
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
#### 3. Python 3.9 - 3.12
|
|
60
|
+
|
|
61
|
+
⚠️ **Important**: Python 3.13+ is not supported. Some Docling SDK dependencies are not compatible with Python 3.13.
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# Install Python 3.11 (recommended)
|
|
65
|
+
brew install python@3.11
|
|
66
|
+
|
|
67
|
+
# Verify version
|
|
68
|
+
python3.11 --version
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
#### 4. jq (JSON processing tool)
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
brew install jq
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
#### 5. lsof (port management)
|
|
78
|
+
|
|
79
|
+
Installed by default on macOS. Verify:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
which lsof
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### First Run Setup
|
|
86
|
+
|
|
87
|
+
When using `@heripo/pdf-parser` for the first time, it automatically:
|
|
88
|
+
|
|
89
|
+
1. Creates Python virtual environment at `~/.heripo/pdf-parser/venv`
|
|
90
|
+
2. Installs `docling-serve` and dependencies
|
|
91
|
+
3. Starts docling-serve process on local port
|
|
92
|
+
|
|
93
|
+
This setup runs only once and may take 5-10 minutes depending on internet connection.
|
|
94
|
+
|
|
95
|
+
## Installation
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
# Install with npm
|
|
99
|
+
npm install @heripo/pdf-parser
|
|
100
|
+
|
|
101
|
+
# Install with pnpm
|
|
102
|
+
pnpm add @heripo/pdf-parser
|
|
103
|
+
|
|
104
|
+
# Install with yarn
|
|
105
|
+
yarn add @heripo/pdf-parser
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## Usage
|
|
109
|
+
|
|
110
|
+
### Basic Usage
|
|
111
|
+
|
|
112
|
+
```typescript
|
|
113
|
+
import { Logger } from '@heripo/logger';
|
|
114
|
+
import { PDFParser } from '@heripo/pdf-parser';
|
|
115
|
+
|
|
116
|
+
const logger = Logger(...);
|
|
117
|
+
|
|
118
|
+
// Create PDFParser instance
|
|
119
|
+
const pdfParser = new PDFParser({
|
|
120
|
+
pythonPath: 'python3.11', // Python executable path
|
|
121
|
+
logger,
|
|
122
|
+
});
|
|
123
|
+
|
|
124
|
+
// Initialize (environment setup and start docling-serve)
|
|
125
|
+
await pdfParser.init();
|
|
126
|
+
|
|
127
|
+
// Parse PDF
|
|
128
|
+
const outputPath = await pdfParser.parse(
|
|
129
|
+
'path/to/report.pdf', // Input PDF file
|
|
130
|
+
'output-directory', // Output directory
|
|
131
|
+
(resultPath) => {
|
|
132
|
+
// Conversion complete callback
|
|
133
|
+
console.log('PDF conversion complete:', resultPath);
|
|
134
|
+
},
|
|
135
|
+
);
|
|
136
|
+
|
|
137
|
+
// Use results
|
|
138
|
+
console.log('Output path:', outputPath);
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### Advanced Options
|
|
142
|
+
|
|
143
|
+
```typescript
|
|
144
|
+
const pdfParser = new PDFParser({
|
|
145
|
+
pythonPath: 'python3.11',
|
|
146
|
+
logger,
|
|
147
|
+
|
|
148
|
+
// docling-serve configuration
|
|
149
|
+
port: 5001, // Port to use (default: 5001)
|
|
150
|
+
timeout: 10000000, // Timeout (milliseconds)
|
|
151
|
+
|
|
152
|
+
// Use external docling-serve
|
|
153
|
+
externalDoclingUrl: 'http://localhost:5000', // When using external server
|
|
154
|
+
});
|
|
155
|
+
|
|
156
|
+
// Specify conversion options
|
|
157
|
+
await pdfParser.parse('input.pdf', 'output', (result) => console.log(result), {
|
|
158
|
+
// OCR settings
|
|
159
|
+
doOcr: true, // Enable OCR (default: true)
|
|
160
|
+
|
|
161
|
+
// Output formats
|
|
162
|
+
formats: ['docling_json', 'md'], // Select output formats
|
|
163
|
+
|
|
164
|
+
// Thread count
|
|
165
|
+
pdfBackend: 'dlparse_v2', // PDF backend
|
|
166
|
+
});
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### Image Extraction
|
|
170
|
+
|
|
171
|
+
Automatically extracts images from PDFs:
|
|
172
|
+
|
|
173
|
+
```typescript
|
|
174
|
+
const outputPath = await pdfParser.parse(
|
|
175
|
+
'report.pdf',
|
|
176
|
+
'output',
|
|
177
|
+
(resultPath) => {
|
|
178
|
+
// Images are saved in output/images/ directory
|
|
179
|
+
console.log('Image extraction complete:', resultPath);
|
|
180
|
+
},
|
|
181
|
+
);
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### Resource Cleanup
|
|
185
|
+
|
|
186
|
+
Clean up resources after work is complete:
|
|
187
|
+
|
|
188
|
+
```typescript
|
|
189
|
+
// Terminate docling-serve process
|
|
190
|
+
await pdfParser.shutdown();
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
## Why macOS Only?
|
|
194
|
+
|
|
195
|
+
`@heripo/pdf-parser` **intentionally relies heavily on macOS**. The key reason for this decision is **Docling SDK's local OCR performance**.
|
|
196
|
+
|
|
197
|
+
### OCR Selection Background
|
|
198
|
+
|
|
199
|
+
Archaeological excavation report PDFs have the following characteristics:
|
|
200
|
+
|
|
201
|
+
- Scanned documents spanning hundreds of pages
|
|
202
|
+
- Layouts containing complex tables, diagrams, photographs
|
|
203
|
+
- Precise text extraction is essential
|
|
204
|
+
|
|
205
|
+
### OCR Option Comparison
|
|
206
|
+
|
|
207
|
+
| Method | Performance | Cost | Description |
|
|
208
|
+
| ----------------------- | ----------- | ---- | ---------------------------------------------------------- |
|
|
209
|
+
| **Docling (Local)** | ★★★★★ | Free | Overwhelming performance on Apple Silicon, GPU utilization |
|
|
210
|
+
| Cloud OCR (Google, AWS) | ★★★★ | $$$ | Tens of dollars per hundreds of pages |
|
|
211
|
+
| Tesseract (Local) | ★★ | Free | Low Korean recognition rate, lacking layout analysis |
|
|
212
|
+
|
|
213
|
+
### Key Advantages
|
|
214
|
+
|
|
215
|
+
- **Cost**: 100x+ cheaper than cloud OCR (free)
|
|
216
|
+
- **Performance**: Fast processing with GPU acceleration on Apple Silicon M1/M2/M3
|
|
217
|
+
- **Quality**: Accurate recognition even for complex documents
|
|
218
|
+
- **Privacy**: Documents are not sent to external servers
|
|
219
|
+
|
|
220
|
+
### Trade-off
|
|
221
|
+
|
|
222
|
+
- Optimal performance only in macOS + Apple Silicon environment
|
|
223
|
+
- No current plans for Linux/Windows support (see "Linux Support Status" below)
|
|
224
|
+
|
|
225
|
+
## System Dependencies Details
|
|
226
|
+
|
|
227
|
+
`@heripo/pdf-parser` requires the following system-level dependencies:
|
|
228
|
+
|
|
229
|
+
| Dependency | Required Version | Installation | Purpose |
|
|
230
|
+
| ---------- | ---------------- | -------------------------- | ------------------------------------------- |
|
|
231
|
+
| Python | 3.9 - 3.12 | `brew install python@3.11` | Docling SDK runtime |
|
|
232
|
+
| jq | Any | `brew install jq` | JSON processing (conversion result parsing) |
|
|
233
|
+
| lsof | Any | Included with macOS | docling-serve port management |
|
|
234
|
+
|
|
235
|
+
> ⚠️ **Python 3.13+ is not supported.** Some Docling SDK dependencies are not compatible with Python 3.13.
|
|
236
|
+
|
|
237
|
+
### Checking Python Version
|
|
238
|
+
|
|
239
|
+
```bash
|
|
240
|
+
# Check installed Python version
|
|
241
|
+
python3 --version
|
|
242
|
+
python3.11 --version
|
|
243
|
+
|
|
244
|
+
# When multiple versions are installed
|
|
245
|
+
ls -la /usr/local/bin/python*
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
### Checking jq Installation
|
|
249
|
+
|
|
250
|
+
```bash
|
|
251
|
+
# Check jq version
|
|
252
|
+
jq --version
|
|
253
|
+
|
|
254
|
+
# Check jq path
|
|
255
|
+
which jq
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
## API Documentation
|
|
259
|
+
|
|
260
|
+
### PDFParser Class
|
|
261
|
+
|
|
262
|
+
#### Constructor Options
|
|
263
|
+
|
|
264
|
+
```typescript
|
|
265
|
+
interface PDFParserOptions {
|
|
266
|
+
pythonPath?: string; // Python executable path (default: 'python3')
|
|
267
|
+
logger?: Logger; // Logger instance
|
|
268
|
+
port?: number; // docling-serve port (default: 5001)
|
|
269
|
+
timeout?: number; // Timeout (milliseconds, default: 10000000)
|
|
270
|
+
externalDoclingUrl?: string; // External docling-serve URL (optional)
|
|
271
|
+
}
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
#### Methods
|
|
275
|
+
|
|
276
|
+
##### `init(): Promise<void>`
|
|
277
|
+
|
|
278
|
+
Sets up Python environment and starts docling-serve.
|
|
279
|
+
|
|
280
|
+
```typescript
|
|
281
|
+
await pdfParser.init();
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
##### `parse(inputPath, outputDir, callback, options?): Promise<string>`
|
|
285
|
+
|
|
286
|
+
Parses a PDF file.
|
|
287
|
+
|
|
288
|
+
**Parameters:**
|
|
289
|
+
|
|
290
|
+
- `inputPath` (string): Input PDF file path
|
|
291
|
+
- `outputDir` (string): Output directory path
|
|
292
|
+
- `callback` (function): Callback function called on conversion complete
|
|
293
|
+
- `options` (ConversionOptions, optional): Conversion options
|
|
294
|
+
|
|
295
|
+
**Returns:**
|
|
296
|
+
|
|
297
|
+
- `Promise<string>`: Output file path
|
|
298
|
+
|
|
299
|
+
##### `shutdown(): Promise<void>`
|
|
300
|
+
|
|
301
|
+
Terminates the docling-serve process.
|
|
302
|
+
|
|
303
|
+
```typescript
|
|
304
|
+
await pdfParser.shutdown();
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
### ConversionOptions
|
|
308
|
+
|
|
309
|
+
```typescript
|
|
310
|
+
interface ConversionOptions {
|
|
311
|
+
doOcr?: boolean; // Enable OCR (default: true)
|
|
312
|
+
formats?: string[]; // Output formats (default: ['docling_json'])
|
|
313
|
+
pdfBackend?: string; // PDF backend (default: 'dlparse_v2')
|
|
314
|
+
}
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
## Troubleshooting
|
|
318
|
+
|
|
319
|
+
### Python Version Error
|
|
320
|
+
|
|
321
|
+
**Symptom**: `Python version X.Y is not supported`
|
|
322
|
+
|
|
323
|
+
**Solution**:
|
|
324
|
+
|
|
325
|
+
```bash
|
|
326
|
+
# Install Python 3.11
|
|
327
|
+
brew install python@3.11
|
|
328
|
+
|
|
329
|
+
# Explicitly specify in PDFParser
|
|
330
|
+
const pdfParser = new PDFParser({
|
|
331
|
+
pythonPath: 'python3.11',
|
|
332
|
+
logger,
|
|
333
|
+
});
|
|
334
|
+
```
|
|
335
|
+
|
|
336
|
+
### jq Not Found
|
|
337
|
+
|
|
338
|
+
**Symptom**: `Command not found: jq`
|
|
339
|
+
|
|
340
|
+
**Solution**:
|
|
341
|
+
|
|
342
|
+
```bash
|
|
343
|
+
brew install jq
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
### Port Conflict
|
|
347
|
+
|
|
348
|
+
**Symptom**: `Port 5001 is already in use`
|
|
349
|
+
|
|
350
|
+
**Solution**:
|
|
351
|
+
|
|
352
|
+
```bash
|
|
353
|
+
# Use different port
|
|
354
|
+
const pdfParser = new PDFParser({
|
|
355
|
+
pythonPath: 'python3.11',
|
|
356
|
+
port: 5002, // Specify different port
|
|
357
|
+
logger,
|
|
358
|
+
});
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
### docling-serve Start Failure
|
|
362
|
+
|
|
363
|
+
**Symptom**: `Failed to start docling-serve`
|
|
364
|
+
|
|
365
|
+
**Solution**:
|
|
366
|
+
|
|
367
|
+
```bash
|
|
368
|
+
# Recreate virtual environment
|
|
369
|
+
rm -rf ~/.heripo/pdf-parser/venv
|
|
370
|
+
# Run init() again
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
## Linux Support Status
|
|
374
|
+
|
|
375
|
+
Currently **macOS only**. Linux support is **not entirely ruled out**, but due to OCR performance and cost efficiency issues, **there are no specific plans at this time**.
|
|
376
|
+
|
|
377
|
+
| Platform | Status | Notes |
|
|
378
|
+
| --------------------- | ------------ | ----------------------------------------------- |
|
|
379
|
+
| macOS + Apple Silicon | ✅ Supported | Optimal performance, GPU acceleration |
|
|
380
|
+
| macOS + Intel | ✅ Supported | No GPU acceleration |
|
|
381
|
+
| Linux | ❓ TBD | No current plans due to performance/cost issues |
|
|
382
|
+
| Windows | ❓ TBD | WSL2 Linux approach possible |
|
|
383
|
+
|
|
384
|
+
### Reason for No Linux Support
|
|
385
|
+
|
|
386
|
+
Docling SDK's local OCR achieves both performance and cost efficiency by utilizing Apple Metal GPU acceleration on macOS. We have not yet found an OCR solution on Linux that provides equivalent performance and cost efficiency.
|
|
387
|
+
|
|
388
|
+
### Ideas Welcome
|
|
389
|
+
|
|
390
|
+
If you have ideas for supporting Linux while maintaining both performance and cost efficiency, please suggest them via [GitHub Discussions](https://github.com/heripo-lab/heripo-engine/discussions) or Issues. The following information is particularly helpful:
|
|
391
|
+
|
|
392
|
+
- Experience with Korean document OCR on Linux
|
|
393
|
+
- OCR solutions capable of handling complex layouts (tables, diagrams)
|
|
394
|
+
- Cost estimates for processing hundreds of pages
|
|
395
|
+
|
|
396
|
+
## Related Packages
|
|
397
|
+
|
|
398
|
+
- [@heripo/document-processor](../document-processor) - Document structure analysis and LLM processing
|
|
399
|
+
- [@heripo/model](../model) - Data models and type definitions
|
|
400
|
+
|
|
401
|
+
## License
|
|
402
|
+
|
|
403
|
+
This package is distributed under the [Apache License 2.0](../../LICENSE).
|
|
404
|
+
|
|
405
|
+
## Contributing
|
|
406
|
+
|
|
407
|
+
Contributions are always welcome! Please see the [Contributing Guide](../../CONTRIBUTING.md).
|
|
408
|
+
|
|
409
|
+
## Issues and Support
|
|
410
|
+
|
|
411
|
+
- **Bug Reports**: [GitHub Issues](https://github.com/heripo-lab/heripo-engine/issues)
|
|
412
|
+
- **Discussions**: [GitHub Discussions](https://github.com/heripo-lab/heripo-engine/discussions)
|
|
413
|
+
- **Security Vulnerabilities**: See [Security Policy](../../SECURITY.md)
|
|
414
|
+
|
|
415
|
+
## Project-Wide Information
|
|
416
|
+
|
|
417
|
+
For project-wide information not covered in this package, see the [root README](../../README.md):
|
|
418
|
+
|
|
419
|
+
- **Citation and Attribution**: Academic citation (BibTeX) and attribution methods
|
|
420
|
+
- **Contributing Guidelines**: Development guidelines, commit rules, PR procedures
|
|
421
|
+
- **Community**: Issue tracker, discussions, security policy
|
|
422
|
+
- **Roadmap**: Project development plans
|
|
423
|
+
|
|
424
|
+
---
|
|
425
|
+
|
|
426
|
+
**heripo lab** | [GitHub](https://github.com/heripo-lab) | [heripo engine](https://github.com/heripo-lab/heripo-engine)
|