@opendataloader/pdf 1.12.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,20 +1,47 @@
1
+ <!-- AI-AGENT-SUMMARY
2
+ name: opendataloader-pdf
3
+ category: PDF data extraction, PDF accessibility automation
4
+ license: Apache-2.0
5
+ solves: [PDF to structured data for RAG/LLM pipelines, automate PDF accessibility compliance — layout analysis + auto-tagging to Tagged PDF (first open-source end-to-end)]
6
+ input: PDF files (digital, scanned, tagged)
7
+ output: Markdown, JSON (with bounding boxes), HTML, Tagged PDF, PDF/UA (enterprise)
8
+ sdk: Python, Node.js, Java
9
+ requirements: Java 11+
10
+ pricing: open-source core (data extraction, layout analysis, auto-tagging to Tagged PDF), enterprise add-on (PDF/UA export, accessibility studio)
11
+ extraction-benchmark: #1 overall extraction accuracy (0.90) in hybrid mode, 0.93 table extraction accuracy, 0.05s/page local mode
12
+ accessibility-validation: PDF Association collaboration, Well-Tagged PDF specification, veraPDF automated validation
13
+ key-differentiators: [benchmark #1 PDF parser, deterministic output, bounding boxes for every element, XY-Cut++ reading order, AI safety filters, hybrid AI mode, first open-source PDF auto-tagging to Tagged PDF, PDF Association + Dual Lab (veraPDF) collaboration, Well-Tagged PDF spec compliance]
14
+ -->
15
+
1
16
  # OpenDataLoader PDF
2
17
 
3
- **PDF Parsing for RAG** Convert to Markdown & JSON, Fast, Local, No GPU
18
+ **PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.**
4
19
 
5
- [![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
20
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
6
21
  [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
7
22
  [![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
8
23
  [![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
9
24
  [![Java](https://img.shields.io/badge/Java-11%2B-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf#java)
10
25
 
11
- Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes all running locally on your machine.
26
+ 🔍 **PDF parser for AI data extraction** Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.90 overall). Deterministic local mode + AI hybrid mode for complex pages.
27
+
28
+ - **How accurate is it?** — #1 in benchmarks: 0.90 overall, 0.93 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages ([benchmarks](#extraction-benchmarks))
29
+ - **Scanned PDFs and OCR?** — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))
30
+ - **Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))
31
+ - **How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))
32
+
33
+ ♿ **PDF accessibility automation** — The same layout analysis engine also powers auto-tagging. First open-source tool to generate Tagged PDFs end-to-end (coming Q2 2026).
34
+
35
+ - **What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--pdfua-conversion))
36
+ - **What's free?** — Layout analysis + auto-tagging (Q2 2026, Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency ([auto-tagging preview](#auto-tagging-preview-coming-q2-2026))
37
+ - **What about PDF/UA compliance?** — Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step ([pipeline](#accessibility-pipeline))
38
+ - **Why trust this?** — Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) ([veraPDF](https://verapdf.org) developers). Auto-tagging follows the Well-Tagged PDF specification, validated with veraPDF ([collaboration](https://opendataloader.org/docs/tagged-pdf-collaboration))
12
39
 
13
- **Why developers choose OpenDataLoader:**
14
- - **Deterministic** — Same input always produces same output (no LLM hallucinations)
15
- - **Fast** Process 100+ pages per second on CPU
16
- - **Private** — 100% local, zero data transmission
17
- - **Accurate** Bounding boxes for every element, correct multi-column reading order
40
+ ## Get Started in 30 Seconds
41
+
42
+ **Requires**: Java 11+ and Python 3.10+ ([Node.js](https://opendataloader.org/docs/quick-start-nodejs) | [Java](https://opendataloader.org/docs/quick-start-java) also available)
43
+
44
+ > Before you start: run `java -version`. If not found, install JDK 11+ from [Adoptium](https://adoptium.net/).
18
45
 
19
46
  ```bash
20
47
  pip install -U opendataloader-pdf
@@ -23,225 +50,185 @@ pip install -U opendataloader-pdf
23
50
  ```python
24
51
  import opendataloader_pdf
25
52
 
26
- # PDF to Markdown for RAG
53
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
27
54
  opendataloader_pdf.convert(
28
- input_path="document.pdf",
55
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
29
56
  output_dir="output/",
30
57
  format="markdown,json"
31
58
  )
32
59
  ```
33
60
 
34
- <br/>
35
-
36
- ## Why OpenDataLoader?
37
-
38
- Building RAG pipelines? You've probably hit these problems:
39
-
40
- | Problem | How We Solve It |
41
- |---------|-----------------|
42
- | **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
43
- | **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
44
- | **Headers/footers pollute context** | Auto-filtered before output |
45
- | **No coordinates for citations** | Bounding box for every element |
46
- | **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
47
- | **GPU required** | Pure CPU, rule-based — runs anywhere |
48
-
49
- <br/>
50
-
51
- ## Key Features
52
-
53
- ### For RAG & LLM Pipelines
54
-
55
- - **Structured Output** JSON with semantic types (heading, paragraph, table, list, caption)
56
- - **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations
57
- - **Reading Order** XY-Cut++ algorithm handles multi-column layouts correctly
58
- - **Noise Filtering** Headers, footers, hidden text, watermarks auto-removed
59
- - **LangChain Integration** [Official document loader](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf)
60
-
61
- ### Performance & Privacy
61
+ ![OpenDataLoader PDF layout analysis — headings, tables, images detected with bounding boxes](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/samples/image/example_annotated_pdf.png)
62
+
63
+ *Annotated PDF output — each element (heading, paragraph, table, image) detected with bounding boxes and semantic type.*
64
+
65
+ ## What Problems Does This Solve?
66
+
67
+ | Problem | Solution | Status |
68
+ |---------|----------|--------|
69
+ | **PDF structure lost during parsing** — wrong reading order, broken tables, no element coordinates | Deterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading order | Shipped |
70
+ | **Complex tables, scanned PDFs, formulas, charts** need AI-level understanding | Hybrid mode routes complex pages to AI backend (#1 in benchmarks) | Shipped |
71
+ | **PDF accessibility compliance** — EAA, ADA, Section 508 enforced. Manual remediation $50–200/doc | Auto-tagging: layout analysis → Tagged PDF (free, Q2 2026). Built with PDF Association & veraPDF validation. PDF/UA export (enterprise add-on) | Auto-tag: Q2 2026 |
72
+
73
+ ## Capability Matrix
74
+
75
+ | Capability | Supported | Tier |
76
+ |------------|-----------|------|
77
+ | **Data extraction** | | |
78
+ | Extract text with correct reading order | Yes | Free |
79
+ | Bounding boxes for every element | Yes | Free |
80
+ | Table extraction (simple borders) | Yes | Free |
81
+ | Table extraction (complex/borderless) | Yes | Free (Hybrid) |
82
+ | Heading hierarchy detection | Yes | Free |
83
+ | List detection (numbered, bulleted, nested) | Yes | Free |
84
+ | Image extraction with coordinates | Yes | Free |
85
+ | AI chart/image description | Yes | Free (Hybrid) |
86
+ | OCR for scanned PDFs | Yes | Free (Hybrid) |
87
+ | Formula extraction (LaTeX) | Yes | Free (Hybrid) |
88
+ | Tagged PDF structure extraction | Yes | Free |
89
+ | AI safety (prompt injection filtering) | Yes | Free |
90
+ | Header/footer/watermark filtering | Yes | Free |
91
+ | **Accessibility** | | |
92
+ | Auto-tagging → Tagged PDF for untagged PDFs | Coming Q2 2026 | Free (Apache 2.0) |
93
+ | PDF/UA-1, PDF/UA-2 export | 💼 Available | Enterprise |
94
+ | Accessibility studio (visual editor) | 💼 Available | Enterprise |
95
+ | **Limitations** | | |
96
+ | Process Word/Excel/PPT | No | — |
97
+ | GPU required | No | — |
98
+
99
+ ## Extraction Benchmarks
100
+
101
+ **opendataloader-pdf [hybrid] ranks #1 overall (0.90)** across reading order, table, and heading extraction accuracy.
102
+
103
+ | Engine | Overall | Reading Order | Table | Heading | Speed (s/page) |
104
+ |--------|---------|---------------|-------|---------|----------------|
105
+ | **opendataloader [hybrid]** | **0.90** | **0.94** | **0.93** | **0.83** | 0.43 |
106
+ | opendataloader | 0.72 | 0.91 | 0.49 | 0.76 | **0.05** |
107
+ | docling | 0.86 | 0.90 | 0.89 | 0.80 | 0.73 |
108
+ | marker | 0.83 | 0.89 | 0.81 | 0.80 | 53.93 |
109
+ | mineru | 0.82 | 0.86 | 0.87 | 0.74 | 5.96 |
110
+ | pymupdf4llm | 0.57 | 0.89 | 0.40 | 0.41 | 0.09 |
111
+ | markitdown | 0.29 | 0.88 | 0.00 | 0.00 | **0.04** |
112
+
113
+ > Scores normalized to [0, 1]. Higher is better for accuracy; lower is better for speed. **Bold** = best. [Full benchmark details](https://github.com/opendataloader-project/opendataloader-bench)
62
114
 
63
- - **No GPU** — Fast, rule-based heuristics
64
- - **Local-First** — Your documents never leave your machine
65
- - **High Throughput** — Process thousands of PDFs efficiently
66
- - **Multi-Language SDK** — Python, Node.js, Java
67
-
68
- ### Document Understanding
69
-
70
- - **Tables** — Detects borders, handles merged cells
71
- - **Lists** — Numbered, bulleted, nested
72
- - **Headings** — Auto-detects hierarchy levels
73
- - **Images** — Extracts with captions linked
74
- - **Tagged PDF Support** — Uses native PDF structure when available
75
- - **AI Safety** — Auto-filters prompt injection content
76
-
77
- <br/>
115
+ [![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
78
116
 
79
117
  ## Which Mode Should I Use?
80
118
 
81
- | Your Document | Mode | Setup |
82
- |---------------|------|-------|
83
- | Standard digital PDF | Fast (default) | `pip install opendataloader-pdf` |
84
- | Complex or nested tables | Hybrid | + start hybrid server |
85
- | Scanned / image-based PDF | Hybrid + OCR | + `--force-ocr` on server |
86
- | Charts / figures needing text description | Hybrid + picture description | + `--enrich-picture-description` on server |
87
- | Mathematical formulas (LaTeX) | Hybrid + formula | + `--enrich-formula` on server |
88
-
89
- <br/>
90
-
91
- ## Output Formats
92
-
93
- | Format | Use Case |
94
- |--------|----------|
95
- | **JSON** | Structured data with bounding boxes, semantic types |
96
- | **Markdown** | Clean text for LLM context, RAG chunks |
97
- | **HTML** | Web display with styling |
98
- | **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
99
-
100
- <br/>
101
-
102
- ## JSON Output Example
103
-
104
- ```json
105
- {
106
- "type": "heading",
107
- "id": 42,
108
- "level": "Title",
109
- "page number": 1,
110
- "bounding box": [72.0, 700.0, 540.0, 730.0],
111
- "heading level": 1,
112
- "font": "Helvetica-Bold",
113
- "font size": 24.0,
114
- "text color": "[0.0]",
115
- "content": "Introduction"
116
- }
117
- ```
118
-
119
- | Field | Description |
120
- |-------|-------------|
121
- | `type` | Element type: heading, paragraph, table, list, image, caption |
122
- | `id` | Unique identifier for cross-referencing |
123
- | `page number` | 1-indexed page reference |
124
- | `bounding box` | `[left, bottom, right, top]` in PDF points |
125
- | `heading level` | Heading depth (1+) |
126
- | `font`, `font size` | Typography info |
127
- | `content` | Extracted text |
128
-
129
- [Full JSON Schema →](https://opendataloader.org/docs/json-schema)
130
-
131
- <br/>
119
+ | Your Document | Mode | Install | Server Command | Client Command |
120
+ |---------------|------|---------|----------------|----------------|
121
+ | Standard digital PDF | Fast (default) | `pip install opendataloader-pdf` | None needed | `opendataloader-pdf file1.pdf file2.pdf folder/` |
122
+ | Complex or nested tables | **Hybrid** | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
123
+ | Scanned / image-based PDF | Hybrid + OCR | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002 --force-ocr` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
124
+ | Non-English scanned PDF | Hybrid + OCR | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
125
+ | Mathematical formulas | Hybrid + formula | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-formula` | `opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/` |
126
+ | Charts needing description | Hybrid + picture | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-picture-description` | `opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/` |
127
+ | Untagged PDFs needing accessibility | Auto-tagging → Tagged PDF | Coming Q2 2026 | — | — |
132
128
 
133
129
  ## Quick Start
134
130
 
135
- - [Python](https://opendataloader.org/docs/quick-start-python)
136
- - [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
137
- - [Java](https://opendataloader.org/docs/quick-start-java)
131
+ ### Python
138
132
 
139
- <br/>
140
-
141
- ## Advanced Options
133
+ ```bash
134
+ pip install -U opendataloader-pdf
135
+ ```
142
136
 
143
137
  ```python
138
+ import opendataloader_pdf
139
+
140
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
144
141
  opendataloader_pdf.convert(
145
- input_path="document.pdf",
142
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
146
143
  output_dir="output/",
147
- format="json,markdown,pdf",
148
-
149
- # Image output mode: "off", "embedded" (Base64), or "external" (default)
150
- image_output="embedded",
151
-
152
- # Image format: "png" or "jpeg"
153
- image_format="jpeg",
154
-
155
- # Tagged PDF
156
- use_struct_tree=True, # Use native PDF structure
144
+ format="markdown,json"
157
145
  )
158
146
  ```
159
147
 
160
- [Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
161
-
162
- <br/>
163
-
164
- ## AI Safety
165
-
166
- PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
167
-
168
- - Hidden text (transparent, zero-size)
169
- - Off-page content
170
- - Suspicious invisible layers
171
-
172
- This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
173
-
174
- <br/>
175
-
176
- ## Tagged PDF Support
148
+ ### Node.js
177
149
 
178
- **Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
179
-
180
- **OpenDataLoader leverages this:**
150
+ ```bash
151
+ npm install @opendataloader/pdf
152
+ ```
181
153
 
182
- - When a PDF has structure tags, we extract the **exact layout** the author intended
183
- - Headings, lists, tables, reading order — all preserved from the source
184
- - No guessing, no heuristics needed — **pixel-perfect semantic extraction**
154
+ ```typescript
155
+ import { convert } from '@opendataloader/pdf';
185
156
 
186
- ```python
187
- opendataloader_pdf.convert(
188
- input_path="accessible_document.pdf",
189
- use_struct_tree=True # Use native PDF structure tags
190
- )
157
+ await convert(['file1.pdf', 'file2.pdf', 'folder/'], {
158
+ outputDir: 'output/',
159
+ format: 'markdown,json'
160
+ });
191
161
  ```
192
162
 
193
- Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
194
-
195
- [Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
163
+ ### Java
196
164
 
197
- <br/>
165
+ ```xml
166
+ <dependency>
167
+ <groupId>org.opendataloader</groupId>
168
+ <artifactId>opendataloader-pdf-core</artifactId>
169
+ </dependency>
170
+ ```
198
171
 
199
- ## Hybrid Mode
172
+ [Python Quick Start](https://opendataloader.org/docs/quick-start-python) | [Node.js Quick Start](https://opendataloader.org/docs/quick-start-nodejs) | [Java Quick Start](https://opendataloader.org/docs/quick-start-java)
200
173
 
201
- For documents with complex tables or OCR needs, enable hybrid mode to route challenging pages to an AI backend while keeping simple pages fast and local.
174
+ ## Hybrid Mode: #1 Accuracy for Complex PDFs
202
175
 
203
- **Results**: Table accuracy jumps from 0.49 0.93 (+90%) with acceptable speed trade-off.
176
+ Hybrid mode combines fast local Java processing with AI backends. Simple pages stay local (0.05s); complex pages route to AI for +90% table accuracy.
204
177
 
205
178
  ```bash
206
179
  pip install -U "opendataloader-pdf[hybrid]"
207
180
  ```
208
181
 
209
- Terminal 1: Start the backend server
182
+ **Terminal 1** Start the backend server:
210
183
 
211
184
  ```bash
212
185
  opendataloader-pdf-hybrid --port 5002
213
186
  ```
214
187
 
215
- Terminal 2: Process PDFs with hybrid mode
188
+ **Terminal 2** Process PDFs:
216
189
 
217
190
  ```bash
218
- opendataloader-pdf --hybrid docling-fast input.pdf
191
+ # Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
192
+ opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
219
193
  ```
220
194
 
221
- Or use in Python:
195
+ **Python:**
222
196
 
223
197
  ```python
198
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
224
199
  opendataloader_pdf.convert(
225
- input_path="complex_tables.pdf",
200
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
226
201
  output_dir="output/",
227
- hybrid="docling-fast" # Routes complex pages to AI backend
202
+ hybrid="docling-fast"
228
203
  )
229
204
  ```
230
205
 
231
- - **Local-first**: Simple pages processed locally, complex pages routed to backend
232
- - **Fallback**: If backend unavailable, gracefully falls back to local processing
233
- - **Privacy**: Run the backend locally for 100% on-premise
206
+ ### OCR for Scanned PDFs
207
+
208
+ Start the backend with `--force-ocr` for image-based PDFs with no selectable text:
209
+
210
+ ```bash
211
+ opendataloader-pdf-hybrid --port 5002 --force-ocr
212
+ ```
213
+
214
+ For non-English documents, specify the language:
215
+
216
+ ```bash
217
+ opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
218
+ ```
219
+
220
+ Supported languages: `en`, `ko`, `ja`, `ch_sim`, `ch_tra`, `de`, `fr`, `ar`, and more.
234
221
 
235
222
  ### Formula Extraction (LaTeX)
236
223
 
237
- For PDFs containing mathematical formulas, enable formula enrichment to extract LaTeX representations:
224
+ Extract mathematical formulas as LaTeX from scientific PDFs:
238
225
 
239
226
  ```bash
240
- # Start backend with formula enrichment
227
+ # Server: enable formula enrichment
241
228
  opendataloader-pdf-hybrid --enrich-formula
242
229
 
243
- # Process with full backend mode (required for formula extraction)
244
- opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf
230
+ # Batch all files in one call each invocation spawns a JVM process, so repeated calls are slow
231
+ opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
245
232
  ```
246
233
 
247
234
  Output in JSON:
@@ -254,95 +241,113 @@ Output in JSON:
254
241
  }
255
242
  ```
256
243
 
257
- Output in Markdown:
258
- ```markdown
259
- $$
260
- \frac{f(x+h) - f(x)}{h}
261
- $$
262
- ```
263
-
264
- Output in HTML (MathJax/KaTeX compatible):
265
- ```html
266
- <div class="math-display">\[\frac{f(x+h) - f(x)}{h}\]</div>
267
- ```
268
-
269
- > **Note**: Formula extraction requires `--hybrid-mode full` to route all pages to the backend where the formula enrichment model runs.
244
+ > **Note**: Formula and picture description enrichments require `--hybrid-mode full` on the client side.
270
245
 
271
- ### Scanned PDFs (OCR)
246
+ ### Chart & Image Description
272
247
 
273
- For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend:
248
+ Generate AI descriptions for charts and images useful for RAG search and accessibility alt text:
274
249
 
275
250
  ```bash
276
- # Start backend with OCR enabled
277
- opendataloader-pdf-hybrid --port 5002 --force-ocr
251
+ # Server
252
+ opendataloader-pdf-hybrid --enrich-picture-description
278
253
 
279
- # Process scanned PDF
280
- opendataloader-pdf --hybrid docling-fast input-scanned.pdf
254
+ # Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
255
+ opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
281
256
  ```
282
257
 
283
- For non-English documents, specify the OCR language:
284
-
285
- ```bash
286
- opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
258
+ Output in JSON:
259
+ ```json
260
+ {
261
+ "type": "picture",
262
+ "page number": 1,
263
+ "bounding box": [72.0, 400.0, 540.0, 650.0],
264
+ "description": "A bar chart showing waste generation by region from 2016 to 2030..."
265
+ }
287
266
  ```
288
267
 
289
- > **Note**: Standard digital PDFs do not need `--force-ocr`. Use it only for scanned or image-based PDFs.
268
+ > Uses SmolVLM (256M), a lightweight vision model. Custom prompts supported via `--picture-description-prompt`.
290
269
 
291
- > **Timeout**: OCR is CPU-intensive. For large scanned documents, increase the timeout: `opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 input-scanned.pdf`
270
+ ### Hancom Data Loader Integration Coming Soon
292
271
 
293
- ### Picture / Chart Description (Alt Text)
272
+ Enterprise-grade AI document analysis via [Hancom Data Loader](https://sdk.hancom.com/en/services/1?utm_source=github&utm_medium=readme&utm_campaign=opendataloader-pdf) customer-customized models trained on your domain-specific documents. 30+ element types (tables, charts, formulas, captions, footnotes, etc.), VLM-based image/chart understanding, complex table extraction (merged cells, nested tables), SLA-backed OCR for scanned documents, and native HWP/HWPX support. Supports PDF, DOCX, XLSX, PPTX, HWP, PNG, JPG. [Live demo](https://livedemo.sdk.hancom.com/en/dataloader?utm_source=github&utm_medium=readme&utm_campaign=opendataloader-pdf)
294
273
 
295
- Generate AI-powered descriptions for images and charts in your PDFs. Useful for accessibility (alt text) and making visual content searchable in RAG pipelines.
274
+ [Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode)
296
275
 
297
- ```bash
298
- # Start backend with picture description
299
- opendataloader-pdf-hybrid --enrich-picture-description
276
+ ## Output Formats
300
277
 
301
- # Process with full backend mode (required for picture description)
302
- opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf
303
- ```
278
+ | Format | Use Case |
279
+ |--------|----------|
280
+ | **JSON** | Structured data with bounding boxes, semantic types |
281
+ | **Markdown** | Clean text for LLM context, RAG chunks |
282
+ | **HTML** | Web display with styling |
283
+ | **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000)) |
284
+ | **Text** | Plain text extraction |
285
+
286
+ Combine formats: `format="json,markdown"`
287
+
288
+ ### JSON Output Example
304
289
 
305
- Output in JSON:
306
290
  ```json
307
291
  {
308
- "type": "picture",
292
+ "type": "heading",
293
+ "id": 42,
294
+ "level": "Title",
309
295
  "page number": 1,
310
- "bounding box": [72.0, 400.0, 540.0, 650.0],
311
- "description": "A bar chart showing waste generation by region from 2016 to 2030..."
296
+ "bounding box": [72.0, 700.0, 540.0, 730.0],
297
+ "heading level": 1,
298
+ "font": "Helvetica-Bold",
299
+ "font size": 24.0,
300
+ "text color": "[0.0]",
301
+ "content": "Introduction"
312
302
  }
313
303
  ```
314
304
 
315
- Output in Markdown:
316
- ```markdown
317
- ![image 1](document_images/imageFile1.png)
305
+ | Field | Description |
306
+ |-------|-------------|
307
+ | `type` | Element type: heading, paragraph, table, list, image, caption, formula |
308
+ | `id` | Unique identifier for cross-referencing |
309
+ | `page number` | 1-indexed page reference |
310
+ | `bounding box` | `[left, bottom, right, top]` in PDF points (72pt = 1 inch) |
311
+ | `heading level` | Heading depth (1+) |
312
+ | `content` | Extracted text |
313
+
314
+ [Full JSON Schema](https://opendataloader.org/docs/json-schema)
318
315
 
319
- *A bar chart showing waste generation by region from 2016 to 2030...*
320
- ```
316
+ ## Advanced Features
321
317
 
322
- Output in HTML:
323
- ```html
324
- <figure>
325
- <img src="document_images/imageFile1.png" alt="figure1">
326
- <figcaption>A bar chart showing waste generation by region from 2016 to 2030...</figcaption>
327
- </figure>
328
- ```
318
+ ### Tagged PDF Support
329
319
 
330
- You can also customize the prompt for better results with specific document types:
320
+ When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
331
321
 
332
- ```bash
333
- opendataloader-pdf-hybrid --enrich-picture-description \
334
- --picture-description-prompt "Describe this scientific figure in detail."
322
+ ```python
323
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
324
+ opendataloader_pdf.convert(
325
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
326
+ output_dir="output/",
327
+ use_struct_tree=True # Use native PDF structure tags
328
+ )
335
329
  ```
336
330
 
337
- > **Note**: Picture description uses SmolVLM (256M), a lightweight vision model. Results are suitable for general context but may not capture precise data values from complex charts.
331
+ Most PDF parsers ignore structure tags entirely. [Learn more](https://opendataloader.org/docs/tagged-pdf)
332
+
333
+ ### AI Safety: Prompt Injection Protection
334
+
335
+ PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
336
+
337
+ - Hidden text (transparent, zero-size fonts)
338
+ - Off-page content
339
+ - Suspicious invisible layers
338
340
 
339
- [Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode)
341
+ To sanitize sensitive data (emails, URLs, phone numbers placeholders), enable it explicitly:
340
342
 
341
- <br/>
343
+ ```bash
344
+ # Batch all files in one call — each invocation spawns a JVM process, so repeated calls are slow
345
+ opendataloader-pdf file1.pdf file2.pdf folder/ --sanitize
346
+ ```
342
347
 
343
- ## LangChain Integration
348
+ [AI Safety Guide](https://opendataloader.org/docs/ai-safety)
344
349
 
345
- OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
350
+ ### LangChain Integration
346
351
 
347
352
  ```bash
348
353
  pip install -U langchain-opendataloader-pdf
@@ -352,164 +357,225 @@ pip install -U langchain-opendataloader-pdf
352
357
  from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
353
358
 
354
359
  loader = OpenDataLoaderPDFLoader(
355
- file_path=["document.pdf"],
360
+ file_path=["file1.pdf", "file2.pdf", "folder/"],
356
361
  format="text"
357
362
  )
358
363
  documents = loader.load()
364
+ ```
365
+
366
+ [LangChain Docs](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf) | [GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf) | [PyPI](https://pypi.org/project/langchain-opendataloader-pdf/)
367
+
368
+ ### Advanced Options
359
369
 
360
- # Use with any LangChain pipeline
361
- for doc in documents:
362
- print(doc.page_content[:100])
370
+ ```python
371
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
372
+ opendataloader_pdf.convert(
373
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
374
+ output_dir="output/",
375
+ format="json,markdown,pdf",
376
+ image_output="embedded", # "off", "embedded" (Base64), or "external" (default)
377
+ image_format="jpeg", # "png" or "jpeg"
378
+ use_struct_tree=True, # Use native PDF structure
379
+ )
363
380
  ```
364
381
 
365
- - [LangChain Documentation](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf)
366
- - [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
367
- - [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
382
+ [Full CLI Options Reference](https://opendataloader.org/docs/cli-options-reference)
368
383
 
369
- <br/>
384
+ ## PDF Accessibility & PDF/UA Conversion
370
385
 
371
- ## Benchmarks
386
+ **Problem**: Millions of existing PDFs lack structure tags, failing accessibility regulations (EAA, ADA/Section 508, Korea Digital Inclusion Act). Manual remediation costs $50–200 per document and doesn't scale.
372
387
 
373
- We continuously benchmark against real-world documents.
388
+ **OpenDataLoader's approach**: Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (developers of [veraPDF](https://verapdf.org), the industry-reference open-source PDF/A and PDF/UA validator). Auto-tagging follows the [Well-Tagged PDF specification](https://pdfa.org/resource/well-tagged-pdf/) and is validated programmatically using veraPDF — automated conformance checks against PDF accessibility standards, not manual review. No existing open-source tool generates Tagged PDFs end-to-end — most rely on proprietary SDKs for the tag-writing step. OpenDataLoader does it all under Apache 2.0. ([collaboration details](https://opendataloader.org/docs/tagged-pdf-collaboration))
374
389
 
375
- [View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
390
+ | Regulation | Deadline | Requirement |
391
+ |------------|----------|-------------|
392
+ | **European Accessibility Act (EAA)** | June 28, 2025 | Accessible digital products across the EU |
393
+ | **ADA & Section 508** | In effect | U.S. federal agencies and public accommodations |
394
+ | **Digital Inclusion Act** | In effect | South Korea digital service accessibility |
376
395
 
377
- ### Quick Comparison
396
+ ### Standards & Validation
378
397
 
379
- | Engine | Overall | Reading Order | Table | Heading | Speed (s/page) |
380
- |-----------------------------|----------|---------------|----------|----------|----------------|
381
- | **opendataloader** | 0.72 | 0.91 | 0.49 | 0.76 | **0.05** |
382
- | **opendataloader [hybrid]** | **0.90** | **0.94** | **0.93** | **0.83** | 0.43 |
383
- | docling | 0.86 | 0.90 | 0.89 | 0.80 | 0.73 |
384
- | marker | 0.83 | 0.89 | 0.81 | 0.80 | 53.93 |
385
- | mineru | 0.82 | 0.86 | 0.87 | 0.74 | 5.96 |
386
- | pymupdf4llm | 0.57 | 0.89 | 0.40 | 0.41 | 0.09 |
387
- | markitdown | 0.29 | 0.88 | 0.00 | 0.00 | **0.04** |
398
+ | Aspect | Detail |
399
+ |--------|--------|
400
+ | **Specification** | [Well-Tagged PDF](https://pdfa.org/resource/well-tagged-pdf/) by PDF Association |
401
+ | **Validation** | [veraPDF](https://verapdf.org) industry-reference open-source PDF/A & PDF/UA validator |
402
+ | **Collaboration** | PDF Association + [Dual Lab](https://duallab.com) (veraPDF developers) co-develop tagging and validation |
403
+ | **License** | Auto-tagging → Tagged PDF: Apache 2.0 (free). PDF/UA export: Enterprise |
388
404
 
389
- > Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
405
+ ### Accessibility Pipeline
390
406
 
391
- ### Visual Comparison
407
+ | Step | Feature | Status | Tier |
408
+ |------|---------|--------|------|
409
+ | 1. **Audit** | Read existing PDF tags, detect untagged PDFs | Shipped | Free |
410
+ | 2. **Auto-tag → Tagged PDF** | Generate structure tags for untagged PDFs | Coming Q2 2026 | Free (Apache 2.0) |
411
+ | 3. **Export PDF/UA** | Convert to PDF/UA-1 or PDF/UA-2 compliant files | 💼 Available | Enterprise |
412
+ | 4. **Visual editing** | Accessibility studio — review and fix tags | 💼 Available | Enterprise |
392
413
 
393
- [![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
414
+ > **💼 Enterprise features** are available on request. [Contact us](https://opendataloader.org/contact) to get started.
394
415
 
416
+ ### Auto-Tagging Preview (Coming Q2 2026)
395
417
 
396
- <br/>
418
+ ```python
419
+ # API shape preview — available Q2 2026
420
+ opendataloader_pdf.convert(
421
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
422
+ output_dir="output/",
423
+ auto_tag=True # Generate structure tags for untagged PDFs
424
+ )
425
+ ```
397
426
 
398
- ## Roadmap
427
+ ### End-to-End Compliance Workflow
399
428
 
400
- See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
429
+ ```
430
+ Existing PDFs (untagged)
431
+
432
+
433
+ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
434
+ │ 1. Audit │───>│ 2. Auto-Tag │───>│ 3. Export │───>│ 4. Studio │
435
+ │ (check tags) │ │ (→ Tagged PDF) │ │ (PDF/UA) │ │ (visual editor) │
436
+ └─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
437
+ │ │ │ │
438
+ ▼ ▼ ▼ ▼
439
+ use_struct_tree auto_tag PDF/UA export Accessibility Studio
440
+ (Available now) (Q2 2026, Apache 2.0) (Enterprise) (Enterprise)
441
+ ```
401
442
 
402
- <br/>
443
+ [PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)
403
444
 
404
- ## Documentation
445
+ ## Roadmap
405
446
 
406
- - [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
407
- - [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
408
- - [CLI Options](https://opendataloader.org/docs/cli-options-reference)
409
- - [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
410
- - [AI Safety Features](https://opendataloader.org/docs/ai-safety)
447
+ | Feature | Timeline | Tier |
448
+ |---------|----------|------|
449
+ | **Auto-tagging Tagged PDF** — Generate Tagged PDFs from untagged PDFs | Q2 2026 | Free |
450
+ | **[Hancom Data Loader](https://sdk.hancom.com/en/services/1?utm_source=github&utm_medium=readme&utm_campaign=opendataloader-pdf)** — Enterprise AI document analysis, customer-customized models, VLM-based chart/image understanding, production-grade OCR | Q2-Q3 2026 | Free |
451
+ | **Structure validation** — Verify PDF tag trees | Q2 2026 | Planned |
411
452
 
412
- <br/>
453
+ [Full Roadmap](https://opendataloader.org/docs/upcoming-roadmap)
413
454
 
414
455
  ## Frequently Asked Questions
415
456
 
416
457
  ### What is the best PDF parser for RAG?
417
458
 
418
- For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
459
+ For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this — it outputs structured JSON with bounding boxes, handles multi-column layouts with XY-Cut++, and runs locally without GPU. In hybrid mode, it ranks #1 overall (0.90) in benchmarks.
460
+
461
+ ### What is the best open-source PDF parser?
462
+
463
+ OpenDataLoader PDF is the only open-source parser that combines: rule-based deterministic extraction (no GPU), bounding boxes for every element, XY-Cut++ reading order, built-in AI safety filters, native Tagged PDF support, and hybrid AI mode for complex documents. It ranks #1 in overall accuracy (0.90) while running locally on CPU.
419
464
 
420
465
  ### How do I extract tables from PDF for LLM?
421
466
 
422
- OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
467
+ OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.49 to 0.93 TEDS score):
468
+
469
+ ```python
470
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
471
+ opendataloader_pdf.convert(
472
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
473
+ output_dir="output/",
474
+ format="json",
475
+ hybrid="docling-fast" # For complex tables
476
+ )
477
+ ```
478
+
479
+ ### How does it compare to docling, marker, or pymupdf4llm?
480
+
481
+ OpenDataLoader [hybrid] ranks #1 overall (0.90) across reading order, table, and heading accuracy. Key differences: docling (0.86) is strong but lacks bounding boxes and AI safety filters. marker (0.83) requires GPU and is 100x slower (53.93s/page). pymupdf4llm (0.57) is fast but has poor table (0.40) and heading (0.41) accuracy. OpenDataLoader is the only parser that combines deterministic local extraction, bounding boxes for every element, and built-in prompt injection protection. See [full benchmark](https://github.com/opendataloader-project/opendataloader-bench).
423
482
 
424
483
  ### Can I use this without sending data to the cloud?
425
484
 
426
- Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
485
+ Yes. OpenDataLoader runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The hybrid mode backend also runs locally on your machine. Ideal for legal, healthcare, and financial documents.
427
486
 
428
- ### What makes OpenDataLoader unique?
487
+ ### Does it support OCR for scanned PDFs?
429
488
 
430
- OpenDataLoader takes a different approach from many PDF parsers:
489
+ Yes, via hybrid mode. Install with `pip install "opendataloader-pdf[hybrid]"`, start the backend with `--force-ocr`, then process as usual. Supports multiple languages including Korean, Japanese, Chinese, Arabic, and more via `--ocr-lang`.
431
490
 
432
- - **Rule-based extraction** Deterministic output without GPU requirements
433
- - **Bounding boxes for all elements** — Essential for citation systems
434
- - **XY-Cut++ reading order** — Handles multi-column layouts correctly
435
- - **Built-in AI safety filters** — Protects against prompt injection
436
- - **Native Tagged PDF support** — Leverages accessibility metadata
491
+ ### Does it work with Korean, Japanese, or Chinese documents?
437
492
 
438
- This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
493
+ Yes. For digital PDFs, text extraction works out of the box. For scanned PDFs, use hybrid mode with `--force-ocr --ocr-lang "ko,en"` (or `ja`, `ch_sim`, `ch_tra`). Coming soon: [Hancom Data Loader](https://sdk.hancom.com/en/services/1?utm_source=github&utm_medium=readme&utm_campaign=opendataloader-pdf) integration — enterprise-grade AI document analysis with built-in production-grade OCR and customer-customized models optimized for your specific document types and workflows.
439
494
 
440
- ### How do I get better accuracy for complex tables?
495
+ ### How fast is it?
441
496
 
442
- Enable hybrid mode with `pip install -U "opendataloader-pdf[hybrid]"`. This routes pages with complex tables to an AI backend (like docling-serve) while keeping simple pages fast and local. Table accuracy improves from 0.49 to 0.93 matching or exceeding dedicated AI parsers while remaining faster and more cost-effective.
497
+ Local mode processes 20+ pages per second on CPU (0.05s/page). Hybrid mode processes 2+ pages per second (0.43s/page) with significantly higher accuracy for complex documents. No GPU required. Benchmarked on Apple M4. [Full benchmark details](https://github.com/opendataloader-project/opendataloader-bench)
443
498
 
444
- ### Does it work with scanned PDFs?
499
+ ### Does it handle multi-column layouts?
445
500
 
446
- Yes, via hybrid mode with OCR. Start the backend server with `--force-ocr`:
501
+ Yes. OpenDataLoader uses XY-Cut++ reading order analysis to correctly sequence text across multi-column pages, sidebars, and mixed layouts. This works in both local and hybrid modes without any configuration.
447
502
 
448
- Terminal 1: Start backend with OCR enabled
503
+ ### What is hybrid mode?
449
504
 
450
- ```bash
451
- opendataloader-pdf-hybrid --port 5002 --force-ocr
452
- ```
505
+ Hybrid mode combines fast local Java processing with an AI backend. Simple pages are processed locally (0.05s/page); complex pages (tables, scanned content, formulas, charts) are automatically routed to the AI backend for higher accuracy. The backend runs locally on your machine — no cloud required. See [Which Mode Should I Use?](#which-mode-should-i-use) and [Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode).
453
506
 
454
- Terminal 2: Process scanned PDF
507
+ ### Does it work with LangChain?
455
508
 
456
- ```bash
457
- opendataloader-pdf --hybrid docling-fast input-scanned.pdf
458
- ```
509
+ Yes. Install `langchain-opendataloader-pdf` for an official LangChain document loader integration. See [LangChain docs](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf).
459
510
 
460
- Or use in Python:
511
+ ### How do I chunk PDFs for RAG?
512
+
513
+ OpenDataLoader outputs structured Markdown with headings, tables, and lists preserved — ideal input for semantic chunking. Each element in JSON output includes `type`, `heading level`, and `page number`, so you can split by section or page boundary. For most RAG pipelines: parse with `format="markdown"` for text chunks, or `format="json"` when you need element-level control. Pair with LangChain's `RecursiveCharacterTextSplitter` or your own heading-based splitter for best results.
514
+
515
+ ### How do I cite PDF sources in RAG answers?
516
+
517
+ Every element in JSON output includes a `bounding box` (`[left, bottom, right, top]` in PDF points) and `page number`. When your RAG pipeline returns an answer, map the source chunk back to its bounding box to highlight the exact location in the original PDF. This enables "click to source" UX — users see which paragraph, table, or figure the answer came from. No other open-source parser provides bounding boxes for every element by default.
518
+
519
+ ### How do I convert PDF to Markdown for LLM?
461
520
 
462
521
  ```python
522
+ import opendataloader_pdf
523
+
524
+ # Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
463
525
  opendataloader_pdf.convert(
464
- input_path="scanned.pdf",
526
+ input_path=["file1.pdf", "file2.pdf", "folder/"],
465
527
  output_dir="output/",
466
- hybrid="docling-fast"
528
+ format="markdown"
467
529
  )
468
530
  ```
469
531
 
470
- (Start the backend with `--force-ocr` before running.)
532
+ OpenDataLoader preserves heading hierarchy, table structure, and reading order in the Markdown output. For complex documents with borderless tables or scanned pages, use hybrid mode (`hybrid="docling-fast"`) for higher accuracy. The output is clean enough to feed directly into LLM context windows or RAG chunking pipelines.
471
533
 
472
- For non-English documents, add `--ocr-lang`:
534
+ ### Is there an automated PDF accessibility remediation tool?
473
535
 
474
- ```bash
475
- opendataloader-pdf-hybrid --port 5002 --ocr-lang "ko,en"
476
- ```
536
+ Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging (Q2 2026) converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.
477
537
 
478
- ### Does it work with images and charts?
538
+ ### Is this really the first open-source PDF auto-tagging tool?
479
539
 
480
- Two levels of support:
540
+ Yes. Existing tools either depend on proprietary SDKs for writing structure tags, only output non-PDF formats (e.g., Docling outputs Markdown/JSON but cannot produce Tagged PDFs), or require manual intervention. OpenDataLoader is the first to do layout analysis → tag generation → Tagged PDF output entirely under an open-source license (Apache 2.0), with no proprietary dependency. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, the industry-reference open-source PDF/A and PDF/UA validator.
481
541
 
482
- 1. **Image extraction** (all modes): Embedded images are extracted to the output folder with bounding boxes. Use `--image-output external` (the default):
542
+ ### How do I convert existing PDFs to PDF/UA?
483
543
 
484
- ```python
485
- opendataloader_pdf.convert(
486
- input_path="document.pdf",
487
- output_dir="output/",
488
- image_output="external" # Saves images as files with bounding boxes in JSON
489
- )
490
- ```
544
+ OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (`use_struct_tree=True`), auto-tag untagged PDFs into Tagged PDFs (Q2 2026, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. [Contact us](https://opendataloader.org/contact) for enterprise integration.
491
545
 
492
- 2. **AI chart descriptions** (hybrid only): Generate natural language descriptions of charts and figures for RAG search:
546
+ ### How do I make my PDFs accessible for EAA compliance?
493
547
 
494
- ```bash
495
- # Start backend with picture description enabled
496
- opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
548
+ The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF will be open-sourced under Apache 2.0 (Q2 2026). PDF/UA export and accessibility studio are enterprise add-ons. See our [Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance).
497
549
 
498
- # Process with full backend mode (required for picture description)
499
- opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf
500
- ```
550
+ ### Is OpenDataLoader PDF free?
551
+
552
+ The core library is **open-source under Apache 2.0** — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF (Q2 2026). We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.
553
+
554
+ ### Why did the license change from MPL 2.0 to Apache 2.0?
501
555
 
502
- <br/>
556
+ MPL 2.0 requires file-level copyleft, which often triggers legal review before enterprise adoption. Apache 2.0 is fully permissive — no copyleft obligations, easier to integrate into commercial projects. If you are using a pre-2.0 version, it remains under MPL 2.0 and you can continue using it. Upgrading to 2.0+ means your project follows Apache 2.0 terms, which are strictly more permissive — no additional obligations, no action needed on your side.
557
+
558
+ ## Documentation
559
+
560
+ - [Quick Start (Python)](https://opendataloader.org/docs/quick-start-python)
561
+ - [Quick Start (Node.js)](https://opendataloader.org/docs/quick-start-nodejs)
562
+ - [Quick Start (Java)](https://opendataloader.org/docs/quick-start-java)
563
+ - [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
564
+ - [CLI Options](https://opendataloader.org/docs/cli-options-reference)
565
+ - [Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode)
566
+ - [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
567
+ - [AI Safety Features](https://opendataloader.org/docs/ai-safety)
568
+ - [PDF Accessibility](https://opendataloader.org/docs/accessibility-compliance)
503
569
 
504
570
  ## Contributing
505
571
 
506
572
  We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
507
573
 
508
- <br/>
509
-
510
574
  ## License
511
575
 
512
- [Mozilla Public License 2.0](LICENSE)
576
+ [Apache License 2.0](LICENSE)
577
+
578
+ > **Note:** Versions prior to 2.0 are licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).
513
579
 
514
580
  ---
515
581