@opendataloader/pdf 1.11.3 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +201 -361
- package/NOTICE +38 -0
- package/README.md +370 -316
- package/THIRD_PARTY/THIRD_PARTY_LICENSES.md +318 -52
- package/THIRD_PARTY/THIRD_PARTY_NOTICES.md +177 -237
- package/THIRD_PARTY/licenses/BSD-2-Clause.txt +22 -0
- package/THIRD_PARTY/licenses/Blue-Oak-1.0.0.txt +56 -0
- package/THIRD_PARTY/licenses/EPL-1.0.txt +211 -0
- package/THIRD_PARTY/licenses/ISC.txt +16 -0
- package/THIRD_PARTY/licenses/PSF-2.0.txt +46 -0
- package/dist/cli.cjs +8 -1
- package/dist/cli.cjs.map +1 -1
- package/dist/cli.js +8 -1
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +3 -0
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +3 -1
- package/dist/index.d.ts +3 -1
- package/dist/index.js +3 -0
- package/dist/index.js.map +1 -1
- package/lib/opendataloader-pdf-cli.jar +0 -0
- package/package.json +3 -3
- package/NOTICE.md +0 -15
- package/THIRD_PARTY/licenses/Apache-2.0.txt +0 -202
package/README.md
CHANGED
|
@@ -1,21 +1,45 @@
|
|
|
1
|
+
<!-- AI-AGENT-SUMMARY
|
|
2
|
+
name: opendataloader-pdf
|
|
3
|
+
category: PDF data extraction, PDF accessibility automation
|
|
4
|
+
license: Apache-2.0
|
|
5
|
+
solves: [PDF to structured data for RAG/LLM pipelines, automate PDF accessibility compliance — layout analysis + auto-tagging to Tagged PDF (first open-source end-to-end)]
|
|
6
|
+
input: PDF files (digital, scanned, tagged)
|
|
7
|
+
output: Markdown, JSON (with bounding boxes), HTML, Tagged PDF, PDF/UA (enterprise)
|
|
8
|
+
sdk: Python, Node.js, Java
|
|
9
|
+
requirements: Java 11+
|
|
10
|
+
pricing: open-source core (data extraction, layout analysis, auto-tagging to Tagged PDF), enterprise add-on (PDF/UA export, accessibility studio)
|
|
11
|
+
extraction-benchmark: #1 overall extraction accuracy (0.90) in hybrid mode, 0.93 table extraction accuracy, 0.05s/page local mode
|
|
12
|
+
accessibility-validation: PDF Association collaboration, Well-Tagged PDF specification, veraPDF automated validation
|
|
13
|
+
key-differentiators: [benchmark #1 PDF parser, deterministic output, bounding boxes for every element, XY-Cut++ reading order, AI safety filters, hybrid AI mode, first open-source PDF auto-tagging to Tagged PDF, PDF Association + Dual Lab (veraPDF) collaboration, Well-Tagged PDF spec compliance]
|
|
14
|
+
-->
|
|
15
|
+
|
|
1
16
|
# OpenDataLoader PDF
|
|
2
17
|
|
|
3
|
-
**PDF
|
|
18
|
+
**PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.**
|
|
4
19
|
|
|
5
|
-
[](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
|
|
6
21
|
[](https://pypi.org/project/opendataloader-pdf/)
|
|
7
22
|
[](https://www.npmjs.com/package/@opendataloader/pdf)
|
|
8
23
|
[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
|
|
9
|
-
[](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
|
|
10
24
|
[](https://github.com/opendataloader-project/opendataloader-pdf#java)
|
|
11
25
|
|
|
12
|
-
|
|
26
|
+
🔍 **PDF parser for AI data extraction** — Extract Markdown, JSON (with bounding boxes), and HTML from any PDF. #1 in benchmarks (0.90 overall). Deterministic local mode + AI hybrid mode for complex pages.
|
|
27
|
+
|
|
28
|
+
- **How accurate is it?** — #1 in benchmarks: 0.90 overall, 0.93 table accuracy across 200 real-world PDFs including multi-column and scientific papers. Deterministic local mode + AI hybrid mode for complex pages ([benchmarks](#extraction-benchmarks))
|
|
29
|
+
- **Scanned PDFs and OCR?** — Yes. Built-in OCR (80+ languages) in hybrid mode. Works with poor-quality scans at 300 DPI+ ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))
|
|
30
|
+
- **Tables, formulas, images, charts?** — Yes. Complex/borderless tables, LaTeX formulas, and AI-generated picture/chart descriptions all via hybrid mode ([hybrid mode](#hybrid-mode-1-accuracy-for-complex-pdfs))
|
|
31
|
+
- **How do I use this for RAG?** — `pip install opendataloader-pdf`, convert in 3 lines. Outputs structured Markdown for chunking, JSON with bounding boxes for source citations, and HTML. LangChain integration available. Python, Node.js, Java SDKs ([quick start](#get-started-in-30-seconds) | [LangChain](#langchain-integration))
|
|
32
|
+
|
|
33
|
+
♿ **PDF accessibility automation** — The same layout analysis engine also powers auto-tagging. First open-source tool to generate Tagged PDFs end-to-end (coming Q2 2026).
|
|
34
|
+
|
|
35
|
+
- **What's the problem?** — Accessibility regulations are now enforced worldwide. Manual PDF remediation costs $50–200 per document and doesn't scale ([regulations](#pdf-accessibility--pdfua-conversion))
|
|
36
|
+
- **What's free?** — Layout analysis + auto-tagging (Q2 2026, Apache 2.0). Untagged PDF in → Tagged PDF out. No proprietary SDK dependency ([auto-tagging preview](#auto-tagging-preview-coming-q2-2026))
|
|
37
|
+
- **What about PDF/UA compliance?** — Converting Tagged PDF to PDF/UA-1 or PDF/UA-2 is an enterprise add-on. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step ([pipeline](#accessibility-pipeline))
|
|
38
|
+
- **Why trust this?** — Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) ([veraPDF](https://verapdf.org) developers). Auto-tagging follows the Well-Tagged PDF specification, validated with veraPDF ([collaboration](https://opendataloader.org/docs/tagged-pdf-collaboration))
|
|
13
39
|
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
- **Private** — 100% local, zero data transmission
|
|
18
|
-
- **Accurate** — Bounding boxes for every element, correct multi-column reading order
|
|
40
|
+
## Get Started in 30 Seconds
|
|
41
|
+
|
|
42
|
+
**Requires**: Java 11+ and Python 3.9+ ([Node.js](https://opendataloader.org/docs/quick-start-nodejs) | [Java](https://opendataloader.org/docs/quick-start-java) also available)
|
|
19
43
|
|
|
20
44
|
```bash
|
|
21
45
|
pip install -U opendataloader-pdf
|
|
@@ -24,226 +48,178 @@ pip install -U opendataloader-pdf
|
|
|
24
48
|
```python
|
|
25
49
|
import opendataloader_pdf
|
|
26
50
|
|
|
27
|
-
#
|
|
51
|
+
# Batch all files in one call — each convert() spawns a JVM process, so repeated calls are slow
|
|
28
52
|
opendataloader_pdf.convert(
|
|
29
|
-
input_path="
|
|
53
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
30
54
|
output_dir="output/",
|
|
31
55
|
format="markdown,json"
|
|
32
56
|
)
|
|
33
57
|
```
|
|
34
58
|
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
|
46
|
-
|
|
47
|
-
| **
|
|
48
|
-
|
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
59
|
+
## What Problems Does This Solve?
|
|
60
|
+
|
|
61
|
+
| Problem | Solution | Status |
|
|
62
|
+
|---------|----------|--------|
|
|
63
|
+
| **PDF structure lost during parsing** — wrong reading order, broken tables, no element coordinates | Deterministic local PDF to Markdown/JSON with bounding boxes, XY-Cut++ reading order | Shipped |
|
|
64
|
+
| **Complex tables, scanned PDFs, formulas, charts** need AI-level understanding | Hybrid mode routes complex pages to AI backend (#1 in benchmarks) | Shipped |
|
|
65
|
+
| **PDF accessibility compliance** — EAA, ADA, Section 508 enforced. Manual remediation $50–200/doc | Auto-tagging: layout analysis → Tagged PDF (free, Q2 2026). Built with PDF Association & veraPDF validation. PDF/UA export (enterprise add-on) | Auto-tag: Q2 2026 |
|
|
66
|
+
|
|
67
|
+
## Capability Matrix
|
|
68
|
+
|
|
69
|
+
| Capability | Supported | Tier |
|
|
70
|
+
|------------|-----------|------|
|
|
71
|
+
| **Data extraction** | | |
|
|
72
|
+
| Extract text with correct reading order | Yes | Free |
|
|
73
|
+
| Bounding boxes for every element | Yes | Free |
|
|
74
|
+
| Table extraction (simple borders) | Yes | Free |
|
|
75
|
+
| Table extraction (complex/borderless) | Yes | Free (Hybrid) |
|
|
76
|
+
| Heading hierarchy detection | Yes | Free |
|
|
77
|
+
| List detection (numbered, bulleted, nested) | Yes | Free |
|
|
78
|
+
| Image extraction with coordinates | Yes | Free |
|
|
79
|
+
| AI chart/image description | Yes | Free (Hybrid) |
|
|
80
|
+
| OCR for scanned PDFs | Yes | Free (Hybrid) |
|
|
81
|
+
| Formula extraction (LaTeX) | Yes | Free (Hybrid) |
|
|
82
|
+
| Tagged PDF structure extraction | Yes | Free |
|
|
83
|
+
| AI safety (prompt injection filtering) | Yes | Free |
|
|
84
|
+
| Header/footer/watermark filtering | Yes | Free |
|
|
85
|
+
| **Accessibility** | | |
|
|
86
|
+
| Auto-tagging → Tagged PDF for untagged PDFs | Coming Q2 2026 | Free (Apache 2.0) |
|
|
87
|
+
| PDF/UA-1, PDF/UA-2 export | 💼 Available | Enterprise |
|
|
88
|
+
| Accessibility studio (visual editor) | 💼 Available | Enterprise |
|
|
89
|
+
| **Limitations** | | |
|
|
90
|
+
| Process Word/Excel/PPT | No | — |
|
|
91
|
+
| GPU required | No | — |
|
|
92
|
+
|
|
93
|
+
## Extraction Benchmarks
|
|
94
|
+
|
|
95
|
+
**opendataloader-pdf [hybrid] ranks #1 overall (0.90)** across reading order, table, and heading extraction accuracy.
|
|
96
|
+
|
|
97
|
+
| Engine | Overall | Reading Order | Table | Heading | Speed (s/page) |
|
|
98
|
+
|--------|---------|---------------|-------|---------|----------------|
|
|
99
|
+
| **opendataloader [hybrid]** | **0.90** | **0.94** | **0.93** | **0.83** | 0.43 |
|
|
100
|
+
| opendataloader | 0.72 | 0.91 | 0.49 | 0.76 | **0.05** |
|
|
101
|
+
| docling | 0.86 | 0.90 | 0.89 | 0.80 | 0.73 |
|
|
102
|
+
| marker | 0.83 | 0.89 | 0.81 | 0.80 | 53.93 |
|
|
103
|
+
| mineru | 0.82 | 0.86 | 0.87 | 0.74 | 5.96 |
|
|
104
|
+
| pymupdf4llm | 0.57 | 0.89 | 0.40 | 0.41 | 0.09 |
|
|
105
|
+
| markitdown | 0.29 | 0.88 | 0.00 | 0.00 | **0.04** |
|
|
106
|
+
|
|
107
|
+
> Scores normalized to [0, 1]. Higher is better for accuracy; lower is better for speed. **Bold** = best. [Full benchmark details](https://github.com/opendataloader-project/opendataloader-bench)
|
|
68
108
|
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
- **Tables** — Detects borders, handles merged cells
|
|
72
|
-
- **Lists** — Numbered, bulleted, nested
|
|
73
|
-
- **Headings** — Auto-detects hierarchy levels
|
|
74
|
-
- **Images** — Extracts with captions linked
|
|
75
|
-
- **Tagged PDF Support** — Uses native PDF structure when available
|
|
76
|
-
- **AI Safety** — Auto-filters prompt injection content
|
|
77
|
-
|
|
78
|
-
<br/>
|
|
109
|
+
[](https://github.com/opendataloader-project/opendataloader-bench)
|
|
79
110
|
|
|
80
111
|
## Which Mode Should I Use?
|
|
81
112
|
|
|
82
|
-
| Your Document | Mode |
|
|
83
|
-
|
|
84
|
-
| Standard digital PDF | Fast (default) | `pip install opendataloader-pdf` |
|
|
85
|
-
| Complex or nested tables | Hybrid |
|
|
86
|
-
| Scanned / image-based PDF | Hybrid + OCR |
|
|
87
|
-
|
|
|
88
|
-
| Mathematical formulas
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
## Output Formats
|
|
93
|
-
|
|
94
|
-
| Format | Use Case |
|
|
95
|
-
|--------|----------|
|
|
96
|
-
| **JSON** | Structured data with bounding boxes, semantic types |
|
|
97
|
-
| **Markdown** | Clean text for LLM context, RAG chunks |
|
|
98
|
-
| **HTML** | Web display with styling |
|
|
99
|
-
| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
|
|
100
|
-
|
|
101
|
-
<br/>
|
|
102
|
-
|
|
103
|
-
## JSON Output Example
|
|
104
|
-
|
|
105
|
-
```json
|
|
106
|
-
{
|
|
107
|
-
"type": "heading",
|
|
108
|
-
"id": 42,
|
|
109
|
-
"level": "Title",
|
|
110
|
-
"page number": 1,
|
|
111
|
-
"bounding box": [72.0, 700.0, 540.0, 730.0],
|
|
112
|
-
"heading level": 1,
|
|
113
|
-
"font": "Helvetica-Bold",
|
|
114
|
-
"font size": 24.0,
|
|
115
|
-
"text color": "[0.0]",
|
|
116
|
-
"content": "Introduction"
|
|
117
|
-
}
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
| Field | Description |
|
|
121
|
-
|-------|-------------|
|
|
122
|
-
| `type` | Element type: heading, paragraph, table, list, image, caption |
|
|
123
|
-
| `id` | Unique identifier for cross-referencing |
|
|
124
|
-
| `page number` | 1-indexed page reference |
|
|
125
|
-
| `bounding box` | `[left, bottom, right, top]` in PDF points |
|
|
126
|
-
| `heading level` | Heading depth (1+) |
|
|
127
|
-
| `font`, `font size` | Typography info |
|
|
128
|
-
| `content` | Extracted text |
|
|
129
|
-
|
|
130
|
-
[Full JSON Schema →](https://opendataloader.org/docs/json-schema)
|
|
131
|
-
|
|
132
|
-
<br/>
|
|
113
|
+
| Your Document | Mode | Install | Server Command | Client Command |
|
|
114
|
+
|---------------|------|---------|----------------|----------------|
|
|
115
|
+
| Standard digital PDF | Fast (default) | `pip install opendataloader-pdf` | None needed | `opendataloader-pdf file1.pdf file2.pdf folder/` |
|
|
116
|
+
| Complex or nested tables | **Hybrid** | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
|
|
117
|
+
| Scanned / image-based PDF | Hybrid + OCR | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002 --force-ocr` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
|
|
118
|
+
| Non-English scanned PDF | Hybrid + OCR | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"` | `opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/` |
|
|
119
|
+
| Mathematical formulas | Hybrid + formula | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-formula` | `opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/` |
|
|
120
|
+
| Charts needing description | Hybrid + picture | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-picture-description` | `opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/` |
|
|
121
|
+
| Untagged PDFs needing accessibility | Auto-tagging → Tagged PDF | Coming Q2 2026 | — | — |
|
|
133
122
|
|
|
134
123
|
## Quick Start
|
|
135
124
|
|
|
136
|
-
|
|
137
|
-
- [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
|
|
138
|
-
- [Docker](https://opendataloader.org/docs/quick-start-docker)
|
|
139
|
-
- [Java](https://opendataloader.org/docs/quick-start-java)
|
|
140
|
-
|
|
141
|
-
<br/>
|
|
125
|
+
### Python
|
|
142
126
|
|
|
143
|
-
|
|
127
|
+
```bash
|
|
128
|
+
pip install -U opendataloader-pdf
|
|
129
|
+
```
|
|
144
130
|
|
|
145
131
|
```python
|
|
132
|
+
import opendataloader_pdf
|
|
133
|
+
|
|
146
134
|
opendataloader_pdf.convert(
|
|
147
|
-
input_path="
|
|
135
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
148
136
|
output_dir="output/",
|
|
149
|
-
format="
|
|
150
|
-
|
|
151
|
-
# Image output mode: "off", "embedded" (Base64), or "external" (default)
|
|
152
|
-
image_output="embedded",
|
|
153
|
-
|
|
154
|
-
# Image format: "png" or "jpeg"
|
|
155
|
-
image_format="jpeg",
|
|
156
|
-
|
|
157
|
-
# Tagged PDF
|
|
158
|
-
use_struct_tree=True, # Use native PDF structure
|
|
137
|
+
format="markdown,json"
|
|
159
138
|
)
|
|
160
139
|
```
|
|
161
140
|
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
<br/>
|
|
165
|
-
|
|
166
|
-
## AI Safety
|
|
167
|
-
|
|
168
|
-
PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
|
|
169
|
-
|
|
170
|
-
- Hidden text (transparent, zero-size)
|
|
171
|
-
- Off-page content
|
|
172
|
-
- Suspicious invisible layers
|
|
173
|
-
|
|
174
|
-
This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
|
|
175
|
-
|
|
176
|
-
<br/>
|
|
177
|
-
|
|
178
|
-
## Tagged PDF Support
|
|
179
|
-
|
|
180
|
-
**Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
|
|
141
|
+
### Node.js
|
|
181
142
|
|
|
182
|
-
|
|
143
|
+
```bash
|
|
144
|
+
npm install @opendataloader/pdf
|
|
145
|
+
```
|
|
183
146
|
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
- No guessing, no heuristics needed — **pixel-perfect semantic extraction**
|
|
147
|
+
```typescript
|
|
148
|
+
import { convert } from '@opendataloader/pdf';
|
|
187
149
|
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
)
|
|
150
|
+
await convert(['file1.pdf', 'file2.pdf', 'folder/'], {
|
|
151
|
+
outputDir: 'output/',
|
|
152
|
+
format: 'markdown,json'
|
|
153
|
+
});
|
|
193
154
|
```
|
|
194
155
|
|
|
195
|
-
|
|
156
|
+
### Java
|
|
196
157
|
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
<
|
|
158
|
+
```xml
|
|
159
|
+
<dependency>
|
|
160
|
+
<groupId>org.opendataloader</groupId>
|
|
161
|
+
<artifactId>opendataloader-pdf-core</artifactId>
|
|
162
|
+
</dependency>
|
|
163
|
+
```
|
|
200
164
|
|
|
201
|
-
|
|
165
|
+
[Python Quick Start](https://opendataloader.org/docs/quick-start-python) | [Node.js Quick Start](https://opendataloader.org/docs/quick-start-nodejs) | [Java Quick Start](https://opendataloader.org/docs/quick-start-java)
|
|
202
166
|
|
|
203
|
-
|
|
167
|
+
## Hybrid Mode: #1 Accuracy for Complex PDFs
|
|
204
168
|
|
|
205
|
-
|
|
169
|
+
Hybrid mode combines fast local Java processing with AI backends. Simple pages stay local (0.05s); complex pages route to AI for +90% table accuracy.
|
|
206
170
|
|
|
207
171
|
```bash
|
|
208
172
|
pip install -U "opendataloader-pdf[hybrid]"
|
|
209
173
|
```
|
|
210
174
|
|
|
211
|
-
Terminal 1
|
|
175
|
+
**Terminal 1** — Start the backend server:
|
|
212
176
|
|
|
213
177
|
```bash
|
|
214
178
|
opendataloader-pdf-hybrid --port 5002
|
|
215
179
|
```
|
|
216
180
|
|
|
217
|
-
Terminal 2
|
|
181
|
+
**Terminal 2** — Process PDFs:
|
|
218
182
|
|
|
219
183
|
```bash
|
|
220
|
-
opendataloader-pdf --hybrid docling-fast
|
|
184
|
+
opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf folder/
|
|
221
185
|
```
|
|
222
186
|
|
|
223
|
-
|
|
187
|
+
**Python:**
|
|
224
188
|
|
|
225
189
|
```python
|
|
226
190
|
opendataloader_pdf.convert(
|
|
227
|
-
input_path="
|
|
191
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
228
192
|
output_dir="output/",
|
|
229
|
-
hybrid="docling-fast"
|
|
193
|
+
hybrid="docling-fast"
|
|
230
194
|
)
|
|
231
195
|
```
|
|
232
196
|
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
|
|
197
|
+
### OCR for Scanned PDFs
|
|
198
|
+
|
|
199
|
+
Start the backend with `--force-ocr` for image-based PDFs with no selectable text:
|
|
200
|
+
|
|
201
|
+
```bash
|
|
202
|
+
opendataloader-pdf-hybrid --port 5002 --force-ocr
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
For non-English documents, specify the language:
|
|
206
|
+
|
|
207
|
+
```bash
|
|
208
|
+
opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
Supported languages: `en`, `ko`, `ja`, `ch_sim`, `ch_tra`, `de`, `fr`, `ar`, and more.
|
|
236
212
|
|
|
237
213
|
### Formula Extraction (LaTeX)
|
|
238
214
|
|
|
239
|
-
|
|
215
|
+
Extract mathematical formulas as LaTeX from scientific PDFs:
|
|
240
216
|
|
|
241
217
|
```bash
|
|
242
|
-
#
|
|
218
|
+
# Server: enable formula enrichment
|
|
243
219
|
opendataloader-pdf-hybrid --enrich-formula
|
|
244
220
|
|
|
245
|
-
#
|
|
246
|
-
opendataloader-pdf --hybrid docling-fast --hybrid-mode full
|
|
221
|
+
# Client: must use full mode for enrichments
|
|
222
|
+
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
|
|
247
223
|
```
|
|
248
224
|
|
|
249
225
|
Output in JSON:
|
|
@@ -256,95 +232,111 @@ Output in JSON:
|
|
|
256
232
|
}
|
|
257
233
|
```
|
|
258
234
|
|
|
259
|
-
|
|
260
|
-
```markdown
|
|
261
|
-
$$
|
|
262
|
-
\frac{f(x+h) - f(x)}{h}
|
|
263
|
-
$$
|
|
264
|
-
```
|
|
235
|
+
> **Note**: Formula and picture description enrichments require `--hybrid-mode full` on the client side.
|
|
265
236
|
|
|
266
|
-
|
|
267
|
-
```html
|
|
268
|
-
<div class="math-display">\[\frac{f(x+h) - f(x)}{h}\]</div>
|
|
269
|
-
```
|
|
237
|
+
### Chart & Image Description
|
|
270
238
|
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
### Scanned PDFs (OCR)
|
|
274
|
-
|
|
275
|
-
For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend:
|
|
239
|
+
Generate AI descriptions for charts and images — useful for RAG search and accessibility alt text:
|
|
276
240
|
|
|
277
241
|
```bash
|
|
278
|
-
#
|
|
279
|
-
opendataloader-pdf-hybrid --
|
|
242
|
+
# Server
|
|
243
|
+
opendataloader-pdf-hybrid --enrich-picture-description
|
|
280
244
|
|
|
281
|
-
#
|
|
282
|
-
opendataloader-pdf --hybrid docling-fast
|
|
245
|
+
# Client (must use full mode)
|
|
246
|
+
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf file2.pdf folder/
|
|
283
247
|
```
|
|
284
248
|
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
249
|
+
Output in JSON:
|
|
250
|
+
```json
|
|
251
|
+
{
|
|
252
|
+
"type": "picture",
|
|
253
|
+
"page number": 1,
|
|
254
|
+
"bounding box": [72.0, 400.0, 540.0, 650.0],
|
|
255
|
+
"description": "A bar chart showing waste generation by region from 2016 to 2030..."
|
|
256
|
+
}
|
|
289
257
|
```
|
|
290
258
|
|
|
291
|
-
>
|
|
259
|
+
> Uses SmolVLM (256M), a lightweight vision model. Custom prompts supported via `--picture-description-prompt`.
|
|
292
260
|
|
|
293
|
-
|
|
261
|
+
### Hancom Data Loader Integration — Coming Soon
|
|
294
262
|
|
|
295
|
-
|
|
263
|
+
Enterprise-grade AI document analysis via [Hancom Data Loader](https://sdk.hancom.com/services/1) — customer-customized models trained on your domain-specific documents. 30+ element types (tables, charts, formulas, captions, footnotes, etc.), VLM-based image/chart understanding, complex table extraction (merged cells, nested tables), and native HWP/HWPX support. Supports PDF, DOCX, XLSX, PPTX, HWP, PNG, JPG. [Live demo](https://livedemo.sdk.hancom.com/dataloader)
|
|
296
264
|
|
|
297
|
-
|
|
265
|
+
[Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode)
|
|
298
266
|
|
|
299
|
-
|
|
300
|
-
# Start backend with picture description
|
|
301
|
-
opendataloader-pdf-hybrid --enrich-picture-description
|
|
267
|
+
## Output Formats
|
|
302
268
|
|
|
303
|
-
|
|
304
|
-
|
|
305
|
-
|
|
269
|
+
| Format | Use Case |
|
|
270
|
+
|--------|----------|
|
|
271
|
+
| **JSON** | Structured data with bounding boxes, semantic types |
|
|
272
|
+
| **Markdown** | Clean text for LLM context, RAG chunks |
|
|
273
|
+
| **HTML** | Web display with styling |
|
|
274
|
+
| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000)) |
|
|
275
|
+
| **Text** | Plain text extraction |
|
|
276
|
+
|
|
277
|
+
Combine formats: `format="json,markdown"`
|
|
278
|
+
|
|
279
|
+
### JSON Output Example
|
|
306
280
|
|
|
307
|
-
Output in JSON:
|
|
308
281
|
```json
|
|
309
282
|
{
|
|
310
|
-
"type": "
|
|
283
|
+
"type": "heading",
|
|
284
|
+
"id": 42,
|
|
285
|
+
"level": "Title",
|
|
311
286
|
"page number": 1,
|
|
312
|
-
"bounding box": [72.0,
|
|
313
|
-
"
|
|
287
|
+
"bounding box": [72.0, 700.0, 540.0, 730.0],
|
|
288
|
+
"heading level": 1,
|
|
289
|
+
"font": "Helvetica-Bold",
|
|
290
|
+
"font size": 24.0,
|
|
291
|
+
"text color": "[0.0]",
|
|
292
|
+
"content": "Introduction"
|
|
314
293
|
}
|
|
315
294
|
```
|
|
316
295
|
|
|
317
|
-
|
|
318
|
-
|
|
319
|
-
|
|
296
|
+
| Field | Description |
|
|
297
|
+
|-------|-------------|
|
|
298
|
+
| `type` | Element type: heading, paragraph, table, list, image, caption, formula |
|
|
299
|
+
| `id` | Unique identifier for cross-referencing |
|
|
300
|
+
| `page number` | 1-indexed page reference |
|
|
301
|
+
| `bounding box` | `[left, bottom, right, top]` in PDF points (72pt = 1 inch) |
|
|
302
|
+
| `heading level` | Heading depth (1+) |
|
|
303
|
+
| `content` | Extracted text |
|
|
320
304
|
|
|
321
|
-
|
|
322
|
-
```
|
|
305
|
+
[Full JSON Schema](https://opendataloader.org/docs/json-schema)
|
|
323
306
|
|
|
324
|
-
|
|
325
|
-
```html
|
|
326
|
-
<figure>
|
|
327
|
-
<img src="document_images/imageFile1.png" alt="figure1">
|
|
328
|
-
<figcaption>A bar chart showing waste generation by region from 2016 to 2030...</figcaption>
|
|
329
|
-
</figure>
|
|
330
|
-
```
|
|
307
|
+
## Advanced Features
|
|
331
308
|
|
|
332
|
-
|
|
309
|
+
### Tagged PDF Support
|
|
333
310
|
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
311
|
+
When a PDF has structure tags, OpenDataLoader extracts the **exact layout** the author intended — no guessing, no heuristics. Headings, lists, tables, and reading order are preserved from the source.
|
|
312
|
+
|
|
313
|
+
```python
|
|
314
|
+
opendataloader_pdf.convert(
|
|
315
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
316
|
+
output_dir="output/",
|
|
317
|
+
use_struct_tree=True # Use native PDF structure tags
|
|
318
|
+
)
|
|
337
319
|
```
|
|
338
320
|
|
|
339
|
-
|
|
321
|
+
Most PDF parsers ignore structure tags entirely. [Learn more](https://opendataloader.org/docs/tagged-pdf)
|
|
340
322
|
|
|
341
|
-
|
|
323
|
+
### AI Safety: Prompt Injection Protection
|
|
342
324
|
|
|
343
|
-
|
|
325
|
+
PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
|
|
344
326
|
|
|
345
|
-
|
|
327
|
+
- Hidden text (transparent, zero-size fonts)
|
|
328
|
+
- Off-page content
|
|
329
|
+
- Suspicious invisible layers
|
|
346
330
|
|
|
347
|
-
|
|
331
|
+
To sanitize sensitive data (emails, URLs, phone numbers → placeholders), enable it explicitly:
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
opendataloader-pdf input.pdf --sanitize
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
[AI Safety Guide](https://opendataloader.org/docs/ai-safety)
|
|
338
|
+
|
|
339
|
+
### LangChain Integration
|
|
348
340
|
|
|
349
341
|
```bash
|
|
350
342
|
pip install -U langchain-opendataloader-pdf
|
|
@@ -354,164 +346,226 @@ pip install -U langchain-opendataloader-pdf
|
|
|
354
346
|
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
|
|
355
347
|
|
|
356
348
|
loader = OpenDataLoaderPDFLoader(
|
|
357
|
-
file_path=["
|
|
349
|
+
file_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
358
350
|
format="text"
|
|
359
351
|
)
|
|
360
352
|
documents = loader.load()
|
|
353
|
+
```
|
|
361
354
|
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
|
|
355
|
+
[LangChain Docs](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf) | [GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf) | [PyPI](https://pypi.org/project/langchain-opendataloader-pdf/)
|
|
356
|
+
|
|
357
|
+
### Advanced Options
|
|
358
|
+
|
|
359
|
+
```python
|
|
360
|
+
opendataloader_pdf.convert(
|
|
361
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
362
|
+
output_dir="output/",
|
|
363
|
+
format="json,markdown,pdf",
|
|
364
|
+
image_output="embedded", # "off", "embedded" (Base64), or "external" (default)
|
|
365
|
+
image_format="jpeg", # "png" or "jpeg"
|
|
366
|
+
use_struct_tree=True, # Use native PDF structure
|
|
367
|
+
)
|
|
365
368
|
```
|
|
366
369
|
|
|
367
|
-
|
|
368
|
-
- [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
|
|
369
|
-
- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
|
|
370
|
+
[Full CLI Options Reference](https://opendataloader.org/docs/cli-options-reference)
|
|
370
371
|
|
|
371
|
-
|
|
372
|
+
## PDF Accessibility & PDF/UA Conversion
|
|
372
373
|
|
|
373
|
-
|
|
374
|
+
**Problem**: Millions of existing PDFs lack structure tags, failing accessibility regulations (EAA, ADA/Section 508, Korea Digital Inclusion Act). Manual remediation costs $50–200 per document and doesn't scale.
|
|
374
375
|
|
|
375
|
-
|
|
376
|
+
**OpenDataLoader's approach**: Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (developers of [veraPDF](https://verapdf.org), the industry-reference open-source PDF/A and PDF/UA validator). Auto-tagging follows the [Well-Tagged PDF specification](https://pdfa.org/resource/well-tagged-pdf/) and is validated programmatically using veraPDF — automated conformance checks against PDF accessibility standards, not manual review. No existing open-source tool generates Tagged PDFs end-to-end — most rely on proprietary SDKs for the tag-writing step. OpenDataLoader does it all under Apache 2.0. ([collaboration details](https://opendataloader.org/docs/tagged-pdf-collaboration))
|
|
376
377
|
|
|
377
|
-
|
|
378
|
+
| Regulation | Deadline | Requirement |
|
|
379
|
+
|------------|----------|-------------|
|
|
380
|
+
| **European Accessibility Act (EAA)** | June 28, 2025 | Accessible digital products across the EU |
|
|
381
|
+
| **ADA & Section 508** | In effect | U.S. federal agencies and public accommodations |
|
|
382
|
+
| **Digital Inclusion Act** | In effect | South Korea digital service accessibility |
|
|
378
383
|
|
|
379
|
-
###
|
|
384
|
+
### Standards & Validation
|
|
380
385
|
|
|
381
|
-
|
|
|
382
|
-
|
|
383
|
-
| **
|
|
384
|
-
| **
|
|
385
|
-
|
|
|
386
|
-
|
|
|
387
|
-
| mineru | 0.82 | 0.86 | 0.87 | 0.74 | 5.96 |
|
|
388
|
-
| pymupdf4llm | 0.57 | 0.89 | 0.40 | 0.41 | 0.09 |
|
|
389
|
-
| markitdown | 0.29 | 0.88 | 0.00 | 0.00 | **0.04** |
|
|
386
|
+
| Aspect | Detail |
|
|
387
|
+
|--------|--------|
|
|
388
|
+
| **Specification** | [Well-Tagged PDF](https://pdfa.org/resource/well-tagged-pdf/) by PDF Association |
|
|
389
|
+
| **Validation** | [veraPDF](https://verapdf.org) — industry-reference open-source PDF/A & PDF/UA validator |
|
|
390
|
+
| **Collaboration** | PDF Association + [Dual Lab](https://duallab.com) (veraPDF developers) co-develop tagging and validation |
|
|
391
|
+
| **License** | Auto-tagging → Tagged PDF: Apache 2.0 (free). PDF/UA export: Enterprise |
|
|
390
392
|
|
|
391
|
-
|
|
393
|
+
### Accessibility Pipeline
|
|
392
394
|
|
|
393
|
-
|
|
395
|
+
| Step | Feature | Status | Tier |
|
|
396
|
+
|------|---------|--------|------|
|
|
397
|
+
| 1. **Audit** | Read existing PDF tags, detect untagged PDFs | Shipped | Free |
|
|
398
|
+
| 2. **Auto-tag → Tagged PDF** | Generate structure tags for untagged PDFs | Coming Q2 2026 | Free (Apache 2.0) |
|
|
399
|
+
| 3. **Export PDF/UA** | Convert to PDF/UA-1 or PDF/UA-2 compliant files | 💼 Available | Enterprise |
|
|
400
|
+
| 4. **Visual editing** | Accessibility studio — review and fix tags | 💼 Available | Enterprise |
|
|
394
401
|
|
|
395
|
-
[
|
|
402
|
+
> **💼 Enterprise features** are available on request. [Contact us](https://opendataloader.org/contact) to get started.
|
|
396
403
|
|
|
404
|
+
### Auto-Tagging Preview (Coming Q2 2026)
|
|
397
405
|
|
|
398
|
-
|
|
406
|
+
```python
|
|
407
|
+
# API shape preview — available Q2 2026
|
|
408
|
+
opendataloader_pdf.convert(
|
|
409
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
410
|
+
output_dir="output/",
|
|
411
|
+
auto_tag=True # Generate structure tags for untagged PDFs
|
|
412
|
+
)
|
|
413
|
+
```
|
|
399
414
|
|
|
400
|
-
|
|
415
|
+
### End-to-End Compliance Workflow
|
|
401
416
|
|
|
402
|
-
|
|
417
|
+
```
|
|
418
|
+
Existing PDFs (untagged)
|
|
419
|
+
│
|
|
420
|
+
▼
|
|
421
|
+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
422
|
+
│ 1. Audit │───>│ 2. Remediate │───>│ 3. Export │
|
|
423
|
+
│ (check tags) │ │ (auto-tag) │ │ (PDF/UA) │
|
|
424
|
+
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
425
|
+
│ │ │
|
|
426
|
+
▼ ▼ ▼
|
|
427
|
+
use_struct_tree auto_tag PDF/UA export
|
|
428
|
+
(Available now) (Q2 2026, Apache 2.0) (Enterprise)
|
|
429
|
+
│
|
|
430
|
+
▼
|
|
431
|
+
PDF/UA-1 or PDF/UA-2
|
|
432
|
+
compliant output
|
|
433
|
+
```
|
|
403
434
|
|
|
404
|
-
|
|
435
|
+
[PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)
|
|
405
436
|
|
|
406
|
-
##
|
|
437
|
+
## Roadmap
|
|
407
438
|
|
|
408
|
-
|
|
409
|
-
|
|
410
|
-
-
|
|
411
|
-
|
|
412
|
-
|
|
439
|
+
| Feature | Timeline | Tier |
|
|
440
|
+
|---------|----------|------|
|
|
441
|
+
| **Auto-tagging → Tagged PDF** — Generate Tagged PDFs from untagged PDFs | Q2 2026 | Free |
|
|
442
|
+
| **[Hancom Data Loader](https://sdk.hancom.com/services/1)** — Enterprise AI document analysis, customer-customized models, VLM-based chart/image understanding | Q2-Q3 2026 | Free |
|
|
443
|
+
| **Structure validation** — Verify PDF tag trees | Q2 2026 | Planned |
|
|
413
444
|
|
|
414
|
-
|
|
445
|
+
[Full Roadmap](https://opendataloader.org/docs/upcoming-roadmap)
|
|
415
446
|
|
|
416
447
|
## Frequently Asked Questions
|
|
417
448
|
|
|
418
449
|
### What is the best PDF parser for RAG?
|
|
419
450
|
|
|
420
|
-
For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this
|
|
451
|
+
For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this — it outputs structured JSON with bounding boxes, handles multi-column layouts with XY-Cut++, and runs locally without GPU. In hybrid mode, it ranks #1 overall (0.90) in benchmarks.
|
|
452
|
+
|
|
453
|
+
### What is the best open-source PDF parser?
|
|
454
|
+
|
|
455
|
+
OpenDataLoader PDF is the only open-source parser that combines: rule-based deterministic extraction (no GPU), bounding boxes for every element, XY-Cut++ reading order, built-in AI safety filters, native Tagged PDF support, and hybrid AI mode for complex documents. It ranks #1 in overall accuracy (0.90) while running locally on CPU.
|
|
421
456
|
|
|
422
457
|
### How do I extract tables from PDF for LLM?
|
|
423
458
|
|
|
424
|
-
OpenDataLoader detects tables using
|
|
459
|
+
OpenDataLoader detects tables using border analysis and text clustering, preserving row/column structure. For complex tables, enable hybrid mode for +90% accuracy improvement (0.49 to 0.93 TEDS score):
|
|
460
|
+
|
|
461
|
+
```python
|
|
462
|
+
opendataloader_pdf.convert(
|
|
463
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
464
|
+
output_dir="output/",
|
|
465
|
+
format="json",
|
|
466
|
+
hybrid="docling-fast" # For complex tables
|
|
467
|
+
)
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
### How does it compare to docling, marker, or pymupdf4llm?
|
|
471
|
+
|
|
472
|
+
OpenDataLoader [hybrid] ranks #1 overall (0.90) across reading order, table, and heading accuracy. Key differences: docling (0.86) is strong but lacks bounding boxes and AI safety filters. marker (0.83) requires GPU and is 100x slower (53.93s/page). pymupdf4llm (0.57) is fast but has poor table (0.40) and heading (0.41) accuracy. OpenDataLoader is the only parser that combines deterministic local extraction, bounding boxes for every element, and built-in prompt injection protection. See [full benchmark](https://github.com/opendataloader-project/opendataloader-bench).
|
|
425
473
|
|
|
426
474
|
### Can I use this without sending data to the cloud?
|
|
427
475
|
|
|
428
|
-
Yes. OpenDataLoader runs 100% locally
|
|
476
|
+
Yes. OpenDataLoader runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The hybrid mode backend also runs locally on your machine. Ideal for legal, healthcare, and financial documents.
|
|
429
477
|
|
|
430
|
-
###
|
|
478
|
+
### Does it support OCR for scanned PDFs?
|
|
431
479
|
|
|
432
|
-
|
|
480
|
+
Yes, via hybrid mode. Install with `pip install "opendataloader-pdf[hybrid]"`, start the backend with `--force-ocr`, then process as usual. Supports multiple languages including Korean, Japanese, Chinese, Arabic, and more via `--ocr-lang`.
|
|
433
481
|
|
|
434
|
-
|
|
435
|
-
- **Bounding boxes for all elements** — Essential for citation systems
|
|
436
|
-
- **XY-Cut++ reading order** — Handles multi-column layouts correctly
|
|
437
|
-
- **Built-in AI safety filters** — Protects against prompt injection
|
|
438
|
-
- **Native Tagged PDF support** — Leverages accessibility metadata
|
|
482
|
+
### Does it work with Korean, Japanese, or Chinese documents?
|
|
439
483
|
|
|
440
|
-
|
|
484
|
+
Yes. For digital PDFs, text extraction works out of the box. For scanned PDFs, use hybrid mode with `--force-ocr --ocr-lang "ko,en"` (or `ja`, `ch_sim`, `ch_tra`). Coming soon: [Hancom Data Loader](https://sdk.hancom.com/services/1) integration — enterprise-grade AI document analysis with customer-customized models optimized for your specific document types and workflows.
|
|
441
485
|
|
|
442
|
-
### How
|
|
486
|
+
### How fast is it?
|
|
443
487
|
|
|
444
|
-
|
|
488
|
+
Local mode processes 100+ pages per second on CPU (0.05s/page). Hybrid mode is 0.43s/page with significantly higher accuracy for complex documents. No GPU required. Benchmarked on Apple M4. [Full benchmark details](https://github.com/opendataloader-project/opendataloader-bench)
|
|
445
489
|
|
|
446
|
-
### Does it
|
|
490
|
+
### Does it handle multi-column layouts?
|
|
447
491
|
|
|
448
|
-
Yes,
|
|
492
|
+
Yes. OpenDataLoader uses XY-Cut++ reading order analysis to correctly sequence text across multi-column pages, sidebars, and mixed layouts. This works in both local and hybrid modes without any configuration.
|
|
449
493
|
|
|
450
|
-
|
|
494
|
+
### What is hybrid mode?
|
|
451
495
|
|
|
452
|
-
|
|
453
|
-
opendataloader-pdf-hybrid --port 5002 --force-ocr
|
|
454
|
-
```
|
|
496
|
+
Hybrid mode combines fast local Java processing with an AI backend. Simple pages are processed locally (0.05s/page); complex pages (tables, scanned content, formulas, charts) are automatically routed to the AI backend for higher accuracy. The backend runs locally on your machine — no cloud required. See [Which Mode Should I Use?](#which-mode-should-i-use) and [Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode).
|
|
455
497
|
|
|
456
|
-
|
|
498
|
+
### Does it work with LangChain?
|
|
457
499
|
|
|
458
|
-
|
|
459
|
-
opendataloader-pdf --hybrid docling-fast input-scanned.pdf
|
|
460
|
-
```
|
|
500
|
+
Yes. Install `langchain-opendataloader-pdf` for an official LangChain document loader integration. See [LangChain docs](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf).
|
|
461
501
|
|
|
462
|
-
|
|
502
|
+
### How do I chunk PDFs for RAG?
|
|
503
|
+
|
|
504
|
+
OpenDataLoader outputs structured Markdown with headings, tables, and lists preserved — ideal input for semantic chunking. Each element in JSON output includes `type`, `heading level`, and `page number`, so you can split by section or page boundary. For most RAG pipelines: parse with `format="markdown"` for text chunks, or `format="json"` when you need element-level control. Pair with LangChain's `RecursiveCharacterTextSplitter` or your own heading-based splitter for best results.
|
|
505
|
+
|
|
506
|
+
### How do I cite PDF sources in RAG answers?
|
|
507
|
+
|
|
508
|
+
Every element in JSON output includes a `bounding box` (`[left, bottom, right, top]` in PDF points) and `page number`. When your RAG pipeline returns an answer, map the source chunk back to its bounding box to highlight the exact location in the original PDF. This enables "click to source" UX — users see which paragraph, table, or figure the answer came from. No other open-source parser provides bounding boxes for every element by default.
|
|
509
|
+
|
|
510
|
+
### How do I convert PDF to Markdown for LLM?
|
|
463
511
|
|
|
464
512
|
```python
|
|
513
|
+
import opendataloader_pdf
|
|
514
|
+
|
|
465
515
|
opendataloader_pdf.convert(
|
|
466
|
-
input_path="
|
|
516
|
+
input_path=["file1.pdf", "file2.pdf", "folder/"],
|
|
467
517
|
output_dir="output/",
|
|
468
|
-
|
|
518
|
+
format="markdown"
|
|
469
519
|
)
|
|
470
520
|
```
|
|
471
521
|
|
|
472
|
-
|
|
522
|
+
OpenDataLoader preserves heading hierarchy, table structure, and reading order in the Markdown output. For complex documents with borderless tables or scanned pages, use hybrid mode (`hybrid="docling-fast"`) for higher accuracy. The output is clean enough to feed directly into LLM context windows or RAG chunking pipelines.
|
|
473
523
|
|
|
474
|
-
|
|
524
|
+
### Is there an automated PDF accessibility remediation tool?
|
|
475
525
|
|
|
476
|
-
|
|
477
|
-
opendataloader-pdf-hybrid --port 5002 --ocr-lang "ko,en"
|
|
478
|
-
```
|
|
526
|
+
Yes. OpenDataLoader is the first open-source tool that automates PDF accessibility end-to-end. Built in collaboration with [PDF Association](https://pdfa.org) and [Dual Lab](https://duallab.com) (veraPDF developers), auto-tagging follows the Well-Tagged PDF specification and is validated programmatically using veraPDF. The layout analysis engine detects document structure (headings, tables, lists, reading order) and generates accessibility tags automatically. Auto-tagging (Q2 2026) converts untagged PDFs into Tagged PDFs under Apache 2.0 — no proprietary SDK dependency. For organizations needing full PDF/UA compliance, enterprise add-ons provide PDF/UA export and a visual tag editor. This replaces manual remediation workflows that typically cost $50–200+ per document.
|
|
479
527
|
|
|
480
|
-
###
|
|
528
|
+
### Is this really the first open-source PDF auto-tagging tool?
|
|
481
529
|
|
|
482
|
-
|
|
530
|
+
Yes. Existing tools either depend on proprietary SDKs for writing structure tags, only output non-PDF formats (e.g., Docling outputs Markdown/JSON but cannot produce Tagged PDFs), or require manual intervention. OpenDataLoader is the first to do layout analysis → tag generation → Tagged PDF output entirely under an open-source license (Apache 2.0), with no proprietary dependency. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, the industry-reference open-source PDF/A and PDF/UA validator.
|
|
483
531
|
|
|
484
|
-
|
|
532
|
+
### How do I convert existing PDFs to PDF/UA?
|
|
485
533
|
|
|
486
|
-
|
|
487
|
-
opendataloader_pdf.convert(
|
|
488
|
-
input_path="document.pdf",
|
|
489
|
-
output_dir="output/",
|
|
490
|
-
image_output="external" # Saves images as files with bounding boxes in JSON
|
|
491
|
-
)
|
|
492
|
-
```
|
|
534
|
+
OpenDataLoader provides an end-to-end pipeline: audit existing PDFs for tags (`use_struct_tree=True`), auto-tag untagged PDFs into Tagged PDFs (Q2 2026, free under Apache 2.0), and export as PDF/UA-1 or PDF/UA-2 (enterprise add-on). Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF. Auto-tagging generates the Tagged PDF; PDF/UA export is the final step. [Contact us](https://opendataloader.org/contact) for enterprise integration.
|
|
493
535
|
|
|
494
|
-
|
|
536
|
+
### How do I make my PDFs accessible for EAA compliance?
|
|
495
537
|
|
|
496
|
-
|
|
497
|
-
# Start backend with picture description enabled
|
|
498
|
-
opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
|
|
538
|
+
The European Accessibility Act requires accessible digital products by June 28, 2025. OpenDataLoader supports the full remediation workflow: audit → auto-tag → Tagged PDF → PDF/UA export. Auto-tagging follows the PDF Association's Well-Tagged PDF specification and is validated using veraPDF, ensuring standards-compliant output. Auto-tagging to Tagged PDF will be open-sourced under Apache 2.0 (Q2 2026). PDF/UA export and accessibility studio are enterprise add-ons. See our [Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance).
|
|
499
539
|
|
|
500
|
-
|
|
501
|
-
|
|
502
|
-
|
|
540
|
+
### Is OpenDataLoader PDF free?
|
|
541
|
+
|
|
542
|
+
The core library is **open-source under Apache 2.0** — free for commercial use. This includes all extraction features (text, tables, images, OCR, formulas, charts via hybrid mode), AI safety filters, Tagged PDF support, and auto-tagging to Tagged PDF (Q2 2026). We are committed to keeping the core accessibility pipeline (layout analysis → auto-tagging → Tagged PDF) free and open-source. Enterprise add-ons (PDF/UA export, accessibility studio) are available for organizations needing end-to-end regulatory compliance.
|
|
543
|
+
|
|
544
|
+
### Why did the license change from MPL 2.0 to Apache 2.0?
|
|
503
545
|
|
|
504
|
-
|
|
546
|
+
MPL 2.0 requires file-level copyleft, which often triggers legal review before enterprise adoption. Apache 2.0 is fully permissive — no copyleft obligations, easier to integrate into commercial projects. If you are using a pre-2.0 version, it remains under MPL 2.0 and you can continue using it. Upgrading to 2.0+ means your project follows Apache 2.0 terms, which are strictly more permissive — no additional obligations, no action needed on your side.
|
|
547
|
+
|
|
548
|
+
## Documentation
|
|
549
|
+
|
|
550
|
+
- [Quick Start (Python)](https://opendataloader.org/docs/quick-start-python)
|
|
551
|
+
- [Quick Start (Node.js)](https://opendataloader.org/docs/quick-start-nodejs)
|
|
552
|
+
- [Quick Start (Java)](https://opendataloader.org/docs/quick-start-java)
|
|
553
|
+
- [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
|
|
554
|
+
- [CLI Options](https://opendataloader.org/docs/cli-options-reference)
|
|
555
|
+
- [Hybrid Mode Guide](https://opendataloader.org/docs/hybrid-mode)
|
|
556
|
+
- [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
|
|
557
|
+
- [AI Safety Features](https://opendataloader.org/docs/ai-safety)
|
|
558
|
+
- [PDF Accessibility](https://opendataloader.org/docs/accessibility-compliance)
|
|
505
559
|
|
|
506
560
|
## Contributing
|
|
507
561
|
|
|
508
562
|
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
|
509
563
|
|
|
510
|
-
<br/>
|
|
511
|
-
|
|
512
564
|
## License
|
|
513
565
|
|
|
514
|
-
[
|
|
566
|
+
[Apache License 2.0](LICENSE)
|
|
567
|
+
|
|
568
|
+
> **Note:** Versions prior to 2.0 are licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).
|
|
515
569
|
|
|
516
570
|
---
|
|
517
571
|
|