@opendataloader/pdf 1.11.0 → 1.11.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +92 -0
- package/lib/opendataloader-pdf-cli.jar +0 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -77,6 +77,18 @@ Building RAG pipelines? You've probably hit these problems:
|
|
|
77
77
|
|
|
78
78
|
<br/>
|
|
79
79
|
|
|
80
|
+
## Which Mode Should I Use?
|
|
81
|
+
|
|
82
|
+
| Your Document | Mode | Setup |
|
|
83
|
+
|---------------|------|-------|
|
|
84
|
+
| Standard digital PDF | Fast (default) | `pip install opendataloader-pdf` |
|
|
85
|
+
| Complex or nested tables | Hybrid | + start hybrid server |
|
|
86
|
+
| Scanned / image-based PDF | Hybrid + OCR | + `--force-ocr` on server |
|
|
87
|
+
| Charts / figures needing text description | Hybrid + picture description | + `--enrich-picture-description` on server |
|
|
88
|
+
| Mathematical formulas (LaTeX) | Hybrid + formula | + `--enrich-formula` on server |
|
|
89
|
+
|
|
90
|
+
<br/>
|
|
91
|
+
|
|
80
92
|
## Output Formats
|
|
81
93
|
|
|
82
94
|
| Format | Use Case |
|
|
@@ -258,6 +270,28 @@ Output in HTML (MathJax/KaTeX compatible):
|
|
|
258
270
|
|
|
259
271
|
> **Note**: Formula extraction requires `--hybrid-mode full` to route all pages to the backend where the formula enrichment model runs.
|
|
260
272
|
|
|
273
|
+
### Scanned PDFs (OCR)
|
|
274
|
+
|
|
275
|
+
For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend:
|
|
276
|
+
|
|
277
|
+
```bash
|
|
278
|
+
# Start backend with OCR enabled
|
|
279
|
+
opendataloader-pdf-hybrid --port 5002 --force-ocr
|
|
280
|
+
|
|
281
|
+
# Process scanned PDF
|
|
282
|
+
opendataloader-pdf --hybrid docling-fast input-scanned.pdf
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
For non-English documents, specify the OCR language:
|
|
286
|
+
|
|
287
|
+
```bash
|
|
288
|
+
opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
> **Note**: Standard digital PDFs do not need `--force-ocr`. Use it only for scanned or image-based PDFs.
|
|
292
|
+
|
|
293
|
+
> **Timeout**: OCR is CPU-intensive. For large scanned documents, increase the timeout: `opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 input-scanned.pdf`
|
|
294
|
+
|
|
261
295
|
### Picture / Chart Description (Alt Text)
|
|
262
296
|
|
|
263
297
|
Generate AI-powered descriptions for images and charts in your PDFs. Useful for accessibility (alt text) and making visual content searchable in RAG pipelines.
|
|
@@ -409,6 +443,64 @@ This means: consistent output (same input = same output), no GPU required, faste
|
|
|
409
443
|
|
|
410
444
|
Enable hybrid mode with `pip install -U "opendataloader-pdf[hybrid]"`. This routes pages with complex tables to an AI backend (like docling-serve) while keeping simple pages fast and local. Table accuracy improves from 0.49 to 0.93 — matching or exceeding dedicated AI parsers while remaining faster and more cost-effective.
|
|
411
445
|
|
|
446
|
+
### Does it work with scanned PDFs?
|
|
447
|
+
|
|
448
|
+
Yes, via hybrid mode with OCR. Start the backend server with `--force-ocr`:
|
|
449
|
+
|
|
450
|
+
Terminal 1: Start backend with OCR enabled
|
|
451
|
+
|
|
452
|
+
```bash
|
|
453
|
+
opendataloader-pdf-hybrid --port 5002 --force-ocr
|
|
454
|
+
```
|
|
455
|
+
|
|
456
|
+
Terminal 2: Process scanned PDF
|
|
457
|
+
|
|
458
|
+
```bash
|
|
459
|
+
opendataloader-pdf --hybrid docling-fast input-scanned.pdf
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
Or use in Python:
|
|
463
|
+
|
|
464
|
+
```python
|
|
465
|
+
opendataloader_pdf.convert(
|
|
466
|
+
input_path="scanned.pdf",
|
|
467
|
+
output_dir="output/",
|
|
468
|
+
hybrid="docling-fast"
|
|
469
|
+
)
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
(Start the backend with `--force-ocr` before running.)
|
|
473
|
+
|
|
474
|
+
For non-English documents, add `--ocr-lang`:
|
|
475
|
+
|
|
476
|
+
```bash
|
|
477
|
+
opendataloader-pdf-hybrid --port 5002 --ocr-lang "ko,en"
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
### Does it work with images and charts?
|
|
481
|
+
|
|
482
|
+
Two levels of support:
|
|
483
|
+
|
|
484
|
+
1. **Image extraction** (all modes): Embedded images are extracted to the output folder with bounding boxes. Use `--image-output external` (the default):
|
|
485
|
+
|
|
486
|
+
```python
|
|
487
|
+
opendataloader_pdf.convert(
|
|
488
|
+
input_path="document.pdf",
|
|
489
|
+
output_dir="output/",
|
|
490
|
+
image_output="external" # Saves images as files with bounding boxes in JSON
|
|
491
|
+
)
|
|
492
|
+
```
|
|
493
|
+
|
|
494
|
+
2. **AI chart descriptions** (hybrid only): Generate natural language descriptions of charts and figures for RAG search:
|
|
495
|
+
|
|
496
|
+
```bash
|
|
497
|
+
# Start backend with picture description enabled
|
|
498
|
+
opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
|
|
499
|
+
|
|
500
|
+
# Process with full backend mode (required for picture description)
|
|
501
|
+
opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf
|
|
502
|
+
```
|
|
503
|
+
|
|
412
504
|
<br/>
|
|
413
505
|
|
|
414
506
|
## Contributing
|
|
Binary file
|