@heripo/pdf-parser 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,426 @@
1
+ # @heripo/pdf-parser
2
+
3
+ > PDF parsing library - OCR support with Docling SDK
4
+
5
+ [![npm version](https://img.shields.io/npm/v/@heripo/pdf-parser.svg)](https://www.npmjs.com/package/@heripo/pdf-parser)
6
+ [![Node.js](https://img.shields.io/badge/Node.js-%3E%3D22-339933?logo=node.js&logoColor=white)](https://nodejs.org/)
7
+ [![Python](https://img.shields.io/badge/Python-3.9--3.12-3776AB?logo=python&logoColor=white)](https://www.python.org/)
8
+ ![coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)
9
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](../../LICENSE)
10
+
11
+ **English** | [한국어](./README.ko.md)
12
+
13
+ > **Note**: Please check the [root README](../../README.md) first for project overview, installation instructions, and roadmap.
14
+
15
+ `@heripo/pdf-parser` is a library for parsing PDF documents and OCR processing based on Docling SDK. It is designed to effectively process documents with complex layouts such as archaeological excavation reports.
16
+
17
+ ## Table of Contents
18
+
19
+ - [Key Features](#key-features)
20
+ - [Prerequisites](#prerequisites)
21
+ - [Installation](#installation)
22
+ - [Usage](#usage)
23
+ - [Why macOS Only?](#why-macos-only)
24
+ - [System Dependencies Details](#system-dependencies-details)
25
+ - [API Documentation](#api-documentation)
26
+ - [Troubleshooting](#troubleshooting)
27
+ - [License](#license)
28
+
29
+ ## Key Features
30
+
31
+ - **High-Quality OCR**: document recognition using Docling SDK
32
+ - **Apple Silicon Optimized**: GPU acceleration on M1/M2/M3/M4/M5 chips
33
+ - **Automatic Environment Setup**: Automatic Python virtual environment and docling-serve installation
34
+ - **Image Extraction**: Automatic extraction and saving of images from PDFs
35
+ - **Flexible Configuration**: OCR, format, threading options, and other detailed settings
36
+
37
+ ## Prerequisites
38
+
39
+ ### System Requirements
40
+
41
+ - **macOS** with Apple Silicon (M1/M2/M3) - Recommended for optimal performance
42
+ - **macOS** with Intel - Supported but slower
43
+ - **Linux/Windows** - Currently not supported
44
+
45
+ ### Required Dependencies
46
+
47
+ #### 1. Node.js >= 22.0.0
48
+
49
+ ```bash
50
+ brew install node
51
+ ```
52
+
53
+ #### 2. pnpm >= 9.0.0
54
+
55
+ ```bash
56
+ npm install -g pnpm
57
+ ```
58
+
59
+ #### 3. Python 3.9 - 3.12
60
+
61
+ ⚠️ **Important**: Python 3.13+ is not supported. Some Docling SDK dependencies are not compatible with Python 3.13.
62
+
63
+ ```bash
64
+ # Install Python 3.11 (recommended)
65
+ brew install python@3.11
66
+
67
+ # Verify version
68
+ python3.11 --version
69
+ ```
70
+
71
+ #### 4. jq (JSON processing tool)
72
+
73
+ ```bash
74
+ brew install jq
75
+ ```
76
+
77
+ #### 5. lsof (port management)
78
+
79
+ Installed by default on macOS. Verify:
80
+
81
+ ```bash
82
+ which lsof
83
+ ```
84
+
85
+ ### First Run Setup
86
+
87
+ When using `@heripo/pdf-parser` for the first time, it automatically:
88
+
89
+ 1. Creates Python virtual environment at `~/.heripo/pdf-parser/venv`
90
+ 2. Installs `docling-serve` and dependencies
91
+ 3. Starts docling-serve process on local port
92
+
93
+ This setup runs only once and may take 5-10 minutes depending on internet connection.
94
+
95
+ ## Installation
96
+
97
+ ```bash
98
+ # Install with npm
99
+ npm install @heripo/pdf-parser
100
+
101
+ # Install with pnpm
102
+ pnpm add @heripo/pdf-parser
103
+
104
+ # Install with yarn
105
+ yarn add @heripo/pdf-parser
106
+ ```
107
+
108
+ ## Usage
109
+
110
+ ### Basic Usage
111
+
112
+ ```typescript
113
+ import { Logger } from '@heripo/logger';
114
+ import { PDFParser } from '@heripo/pdf-parser';
115
+
116
+ const logger = Logger(...);
117
+
118
+ // Create PDFParser instance
119
+ const pdfParser = new PDFParser({
120
+ pythonPath: 'python3.11', // Python executable path
121
+ logger,
122
+ });
123
+
124
+ // Initialize (environment setup and start docling-serve)
125
+ await pdfParser.init();
126
+
127
+ // Parse PDF
128
+ const outputPath = await pdfParser.parse(
129
+ 'path/to/report.pdf', // Input PDF file
130
+ 'output-directory', // Output directory
131
+ (resultPath) => {
132
+ // Conversion complete callback
133
+ console.log('PDF conversion complete:', resultPath);
134
+ },
135
+ );
136
+
137
+ // Use results
138
+ console.log('Output path:', outputPath);
139
+ ```
140
+
141
+ ### Advanced Options
142
+
143
+ ```typescript
144
+ const pdfParser = new PDFParser({
145
+ pythonPath: 'python3.11',
146
+ logger,
147
+
148
+ // docling-serve configuration
149
+ port: 5001, // Port to use (default: 5001)
150
+ timeout: 10000000, // Timeout (milliseconds)
151
+
152
+ // Use external docling-serve
153
+ externalDoclingUrl: 'http://localhost:5000', // When using external server
154
+ });
155
+
156
+ // Specify conversion options
157
+ await pdfParser.parse('input.pdf', 'output', (result) => console.log(result), {
158
+ // OCR settings
159
+ doOcr: true, // Enable OCR (default: true)
160
+
161
+ // Output formats
162
+ formats: ['docling_json', 'md'], // Select output formats
163
+
164
+ // Thread count
165
+ pdfBackend: 'dlparse_v2', // PDF backend
166
+ });
167
+ ```
168
+
169
+ ### Image Extraction
170
+
171
+ Automatically extracts images from PDFs:
172
+
173
+ ```typescript
174
+ const outputPath = await pdfParser.parse(
175
+ 'report.pdf',
176
+ 'output',
177
+ (resultPath) => {
178
+ // Images are saved in output/images/ directory
179
+ console.log('Image extraction complete:', resultPath);
180
+ },
181
+ );
182
+ ```
183
+
184
+ ### Resource Cleanup
185
+
186
+ Clean up resources after work is complete:
187
+
188
+ ```typescript
189
+ // Terminate docling-serve process
190
+ await pdfParser.shutdown();
191
+ ```
192
+
193
+ ## Why macOS Only?
194
+
195
+ `@heripo/pdf-parser` **intentionally relies heavily on macOS**. The key reason for this decision is **Docling SDK's local OCR performance**.
196
+
197
+ ### OCR Selection Background
198
+
199
+ Archaeological excavation report PDFs have the following characteristics:
200
+
201
+ - Scanned documents spanning hundreds of pages
202
+ - Layouts containing complex tables, diagrams, photographs
203
+ - Precise text extraction is essential
204
+
205
+ ### OCR Option Comparison
206
+
207
+ | Method | Performance | Cost | Description |
208
+ | ----------------------- | ----------- | ---- | ---------------------------------------------------------- |
209
+ | **Docling (Local)** | ★★★★★ | Free | Overwhelming performance on Apple Silicon, GPU utilization |
210
+ | Cloud OCR (Google, AWS) | ★★★★ | $$$ | Tens of dollars per hundreds of pages |
211
+ | Tesseract (Local) | ★★ | Free | Low Korean recognition rate, lacking layout analysis |
212
+
213
+ ### Key Advantages
214
+
215
+ - **Cost**: 100x+ cheaper than cloud OCR (free)
216
+ - **Performance**: Fast processing with GPU acceleration on Apple Silicon M1/M2/M3
217
+ - **Quality**: Accurate recognition even for complex documents
218
+ - **Privacy**: Documents are not sent to external servers
219
+
220
+ ### Trade-off
221
+
222
+ - Optimal performance only in macOS + Apple Silicon environment
223
+ - No current plans for Linux/Windows support (see "Linux Support Status" below)
224
+
225
+ ## System Dependencies Details
226
+
227
+ `@heripo/pdf-parser` requires the following system-level dependencies:
228
+
229
+ | Dependency | Required Version | Installation | Purpose |
230
+ | ---------- | ---------------- | -------------------------- | ------------------------------------------- |
231
+ | Python | 3.9 - 3.12 | `brew install python@3.11` | Docling SDK runtime |
232
+ | jq | Any | `brew install jq` | JSON processing (conversion result parsing) |
233
+ | lsof | Any | Included with macOS | docling-serve port management |
234
+
235
+ > ⚠️ **Python 3.13+ is not supported.** Some Docling SDK dependencies are not compatible with Python 3.13.
236
+
237
+ ### Checking Python Version
238
+
239
+ ```bash
240
+ # Check installed Python version
241
+ python3 --version
242
+ python3.11 --version
243
+
244
+ # When multiple versions are installed
245
+ ls -la /usr/local/bin/python*
246
+ ```
247
+
248
+ ### Checking jq Installation
249
+
250
+ ```bash
251
+ # Check jq version
252
+ jq --version
253
+
254
+ # Check jq path
255
+ which jq
256
+ ```
257
+
258
+ ## API Documentation
259
+
260
+ ### PDFParser Class
261
+
262
+ #### Constructor Options
263
+
264
+ ```typescript
265
+ interface PDFParserOptions {
266
+ pythonPath?: string; // Python executable path (default: 'python3')
267
+ logger?: Logger; // Logger instance
268
+ port?: number; // docling-serve port (default: 5001)
269
+ timeout?: number; // Timeout (milliseconds, default: 10000000)
270
+ externalDoclingUrl?: string; // External docling-serve URL (optional)
271
+ }
272
+ ```
273
+
274
+ #### Methods
275
+
276
+ ##### `init(): Promise<void>`
277
+
278
+ Sets up Python environment and starts docling-serve.
279
+
280
+ ```typescript
281
+ await pdfParser.init();
282
+ ```
283
+
284
+ ##### `parse(inputPath, outputDir, callback, options?): Promise<string>`
285
+
286
+ Parses a PDF file.
287
+
288
+ **Parameters:**
289
+
290
+ - `inputPath` (string): Input PDF file path
291
+ - `outputDir` (string): Output directory path
292
+ - `callback` (function): Callback function called on conversion complete
293
+ - `options` (ConversionOptions, optional): Conversion options
294
+
295
+ **Returns:**
296
+
297
+ - `Promise<string>`: Output file path
298
+
299
+ ##### `shutdown(): Promise<void>`
300
+
301
+ Terminates the docling-serve process.
302
+
303
+ ```typescript
304
+ await pdfParser.shutdown();
305
+ ```
306
+
307
+ ### ConversionOptions
308
+
309
+ ```typescript
310
+ interface ConversionOptions {
311
+ doOcr?: boolean; // Enable OCR (default: true)
312
+ formats?: string[]; // Output formats (default: ['docling_json'])
313
+ pdfBackend?: string; // PDF backend (default: 'dlparse_v2')
314
+ }
315
+ ```
316
+
317
+ ## Troubleshooting
318
+
319
+ ### Python Version Error
320
+
321
+ **Symptom**: `Python version X.Y is not supported`
322
+
323
+ **Solution**:
324
+
325
+ ```bash
326
+ # Install Python 3.11
327
+ brew install python@3.11
328
+
329
+ # Explicitly specify in PDFParser
330
+ const pdfParser = new PDFParser({
331
+ pythonPath: 'python3.11',
332
+ logger,
333
+ });
334
+ ```
335
+
336
+ ### jq Not Found
337
+
338
+ **Symptom**: `Command not found: jq`
339
+
340
+ **Solution**:
341
+
342
+ ```bash
343
+ brew install jq
344
+ ```
345
+
346
+ ### Port Conflict
347
+
348
+ **Symptom**: `Port 5001 is already in use`
349
+
350
+ **Solution**:
351
+
352
+ ```bash
353
+ # Use different port
354
+ const pdfParser = new PDFParser({
355
+ pythonPath: 'python3.11',
356
+ port: 5002, // Specify different port
357
+ logger,
358
+ });
359
+ ```
360
+
361
+ ### docling-serve Start Failure
362
+
363
+ **Symptom**: `Failed to start docling-serve`
364
+
365
+ **Solution**:
366
+
367
+ ```bash
368
+ # Recreate virtual environment
369
+ rm -rf ~/.heripo/pdf-parser/venv
370
+ # Run init() again
371
+ ```
372
+
373
+ ## Linux Support Status
374
+
375
+ Currently **macOS only**. Linux support is **not entirely ruled out**, but due to OCR performance and cost efficiency issues, **there are no specific plans at this time**.
376
+
377
+ | Platform | Status | Notes |
378
+ | --------------------- | ------------ | ----------------------------------------------- |
379
+ | macOS + Apple Silicon | ✅ Supported | Optimal performance, GPU acceleration |
380
+ | macOS + Intel | ✅ Supported | No GPU acceleration |
381
+ | Linux | ❓ TBD | No current plans due to performance/cost issues |
382
+ | Windows | ❓ TBD | WSL2 Linux approach possible |
383
+
384
+ ### Reason for No Linux Support
385
+
386
+ Docling SDK's local OCR achieves both performance and cost efficiency by utilizing Apple Metal GPU acceleration on macOS. We have not yet found an OCR solution on Linux that provides equivalent performance and cost efficiency.
387
+
388
+ ### Ideas Welcome
389
+
390
+ If you have ideas for supporting Linux while maintaining both performance and cost efficiency, please suggest them via [GitHub Discussions](https://github.com/heripo-lab/heripo-engine/discussions) or Issues. The following information is particularly helpful:
391
+
392
+ - Experience with Korean document OCR on Linux
393
+ - OCR solutions capable of handling complex layouts (tables, diagrams)
394
+ - Cost estimates for processing hundreds of pages
395
+
396
+ ## Related Packages
397
+
398
+ - [@heripo/document-processor](../document-processor) - Document structure analysis and LLM processing
399
+ - [@heripo/model](../model) - Data models and type definitions
400
+
401
+ ## License
402
+
403
+ This package is distributed under the [Apache License 2.0](../../LICENSE).
404
+
405
+ ## Contributing
406
+
407
+ Contributions are always welcome! Please see the [Contributing Guide](../../CONTRIBUTING.md).
408
+
409
+ ## Issues and Support
410
+
411
+ - **Bug Reports**: [GitHub Issues](https://github.com/heripo-lab/heripo-engine/issues)
412
+ - **Discussions**: [GitHub Discussions](https://github.com/heripo-lab/heripo-engine/discussions)
413
+ - **Security Vulnerabilities**: See [Security Policy](../../SECURITY.md)
414
+
415
+ ## Project-Wide Information
416
+
417
+ For project-wide information not covered in this package, see the [root README](../../README.md):
418
+
419
+ - **Citation and Attribution**: Academic citation (BibTeX) and attribution methods
420
+ - **Contributing Guidelines**: Development guidelines, commit rules, PR procedures
421
+ - **Community**: Issue tracker, discussions, security policy
422
+ - **Roadmap**: Project development plans
423
+
424
+ ---
425
+
426
+ **heripo lab** | [GitHub](https://github.com/heripo-lab) | [heripo engine](https://github.com/heripo-lab/heripo-engine)