@opendataloader/pdf 1.11.0 → 1.11.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -77,6 +77,18 @@ Building RAG pipelines? You've probably hit these problems:
77
77
 
78
78
  <br/>
79
79
 
80
+ ## Which Mode Should I Use?
81
+
82
+ | Your Document | Mode | Setup |
83
+ |---------------|------|-------|
84
+ | Standard digital PDF | Fast (default) | `pip install opendataloader-pdf` |
85
+ | Complex or nested tables | Hybrid | + start hybrid server |
86
+ | Scanned / image-based PDF | Hybrid + OCR | + `--force-ocr` on server |
87
+ | Charts / figures needing text description | Hybrid + picture description | + `--enrich-picture-description` on server |
88
+ | Mathematical formulas (LaTeX) | Hybrid + formula | + `--enrich-formula` on server |
89
+
90
+ <br/>
91
+
80
92
  ## Output Formats
81
93
 
82
94
  | Format | Use Case |
@@ -258,6 +270,28 @@ Output in HTML (MathJax/KaTeX compatible):
258
270
 
259
271
  > **Note**: Formula extraction requires `--hybrid-mode full` to route all pages to the backend where the formula enrichment model runs.
260
272
 
273
+ ### Scanned PDFs (OCR)
274
+
275
+ For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend:
276
+
277
+ ```bash
278
+ # Start backend with OCR enabled
279
+ opendataloader-pdf-hybrid --port 5002 --force-ocr
280
+
281
+ # Process scanned PDF
282
+ opendataloader-pdf --hybrid docling-fast input-scanned.pdf
283
+ ```
284
+
285
+ For non-English documents, specify the OCR language:
286
+
287
+ ```bash
288
+ opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
289
+ ```
290
+
291
+ > **Note**: Standard digital PDFs do not need `--force-ocr`. Use it only for scanned or image-based PDFs.
292
+
293
+ > **Timeout**: OCR is CPU-intensive. For large scanned documents, increase the timeout: `opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 input-scanned.pdf`
294
+
261
295
  ### Picture / Chart Description (Alt Text)
262
296
 
263
297
  Generate AI-powered descriptions for images and charts in your PDFs. Useful for accessibility (alt text) and making visual content searchable in RAG pipelines.
@@ -409,6 +443,64 @@ This means: consistent output (same input = same output), no GPU required, faste
409
443
 
410
444
  Enable hybrid mode with `pip install -U "opendataloader-pdf[hybrid]"`. This routes pages with complex tables to an AI backend (like docling-serve) while keeping simple pages fast and local. Table accuracy improves from 0.49 to 0.93 — matching or exceeding dedicated AI parsers while remaining faster and more cost-effective.
411
445
 
446
+ ### Does it work with scanned PDFs?
447
+
448
+ Yes, via hybrid mode with OCR. Start the backend server with `--force-ocr`:
449
+
450
+ Terminal 1: Start backend with OCR enabled
451
+
452
+ ```bash
453
+ opendataloader-pdf-hybrid --port 5002 --force-ocr
454
+ ```
455
+
456
+ Terminal 2: Process scanned PDF
457
+
458
+ ```bash
459
+ opendataloader-pdf --hybrid docling-fast input-scanned.pdf
460
+ ```
461
+
462
+ Or use in Python:
463
+
464
+ ```python
465
+ opendataloader_pdf.convert(
466
+ input_path="scanned.pdf",
467
+ output_dir="output/",
468
+ hybrid="docling-fast"
469
+ )
470
+ ```
471
+
472
+ (Start the backend with `--force-ocr` before running.)
473
+
474
+ For non-English documents, add `--ocr-lang`:
475
+
476
+ ```bash
477
+ opendataloader-pdf-hybrid --port 5002 --ocr-lang "ko,en"
478
+ ```
479
+
480
+ ### Does it work with images and charts?
481
+
482
+ Two levels of support:
483
+
484
+ 1. **Image extraction** (all modes): Embedded images are extracted to the output folder with bounding boxes. Use `--image-output external` (the default):
485
+
486
+ ```python
487
+ opendataloader_pdf.convert(
488
+ input_path="document.pdf",
489
+ output_dir="output/",
490
+ image_output="external" # Saves images as files with bounding boxes in JSON
491
+ )
492
+ ```
493
+
494
+ 2. **AI chart descriptions** (hybrid only): Generate natural language descriptions of charts and figures for RAG search:
495
+
496
+ ```bash
497
+ # Start backend with picture description enabled
498
+ opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
499
+
500
+ # Process with full backend mode (required for picture description)
501
+ opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf
502
+ ```
503
+
412
504
  <br/>
413
505
 
414
506
  ## Contributing
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@opendataloader/pdf",
3
- "version": "1.11.0",
3
+ "version": "1.11.2",
4
4
  "description": "A Node.js wrapper for the opendataloader-pdf Java CLI.",
5
5
  "main": "./dist/index.cjs",
6
6
  "module": "./dist/index.js",