@opendataloader/pdf 1.3.0 → 1.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NOTICE.md +1 -1
- package/README.md +193 -369
- package/dist/cli.cjs +140 -65
- package/dist/cli.cjs.map +1 -1
- package/dist/cli.js +140 -65
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +102 -81
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +48 -12
- package/dist/index.d.ts +48 -12
- package/dist/index.js +101 -81
- package/dist/index.js.map +1 -1
- package/lib/opendataloader-pdf-cli.jar +0 -0
- package/package.json +2 -2
package/README.md
CHANGED
|
@@ -1,483 +1,307 @@
|
|
|
1
1
|
# OpenDataLoader PDF
|
|
2
2
|
|
|
3
|
+
**PDF to Markdown & JSON for RAG** — Fast, Local, No GPU Required
|
|
3
4
|
|
|
4
5
|
[](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
|
|
5
|
-

|
|
6
|
-

|
|
7
|
-
[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
|
|
8
6
|
[](https://pypi.org/project/opendataloader-pdf/)
|
|
9
7
|
[](https://www.npmjs.com/package/@opendataloader/pdf)
|
|
10
|
-
[
|
|
33
|
-
- 🏷️ **Tagged PDF** — Advanced data extraction technology based on Tagged PDF - [Learn more](https://opendataloader.org/docs/tagged-pdf)
|
|
34
|
-
- 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original
|
|
35
|
-
|
|
36
|
-
[Download Annotated PDF Sample](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/1901.03003_annotated.pdf)
|
|
37
|
-
|
|
38
|
-

|
|
39
|
-
|
|
40
|
-
<br/>
|
|
41
|
-
|
|
42
|
-
## 🚀 Upcoming Features
|
|
43
|
-
|
|
44
|
-
**Scheduled for November**
|
|
45
|
-
- ⚡ **Performance Improvement** — Enhance the inference skill for greater accuracy and speed.
|
|
46
|
-
- 📊 **Benchmarks & Datasets** — Publish transparent evaluations using open datasets and standardized metrics.
|
|
47
|
-
- 🎯 **Metrics** — Publish the calculation methods to transparently share benchmark results.
|
|
48
|
-
<br/>
|
|
49
|
-
|
|
50
|
-
**Scheduled for December**
|
|
51
|
-
- 🖨️ **OCR for scanned PDFs** — Extract data from image-only pages.
|
|
52
|
-
- 🧠 **Table AI option** — Higher accuracy for tables with borderless or merged cells.
|
|
53
|
-
<br/>
|
|
54
|
-
|
|
55
|
-
**Scheduled for 2026**
|
|
56
|
-
- 🛡️ **AI Red Teaming** — Transparent adversarial benchmarks with datasets and metrics, then reported regularly.
|
|
57
|
-
<br/>
|
|
58
|
-
|
|
59
|
-
## Prerequisites
|
|
60
|
-
|
|
61
|
-
- Java 11 or higher must be installed and available in your system's PATH.
|
|
62
|
-
- Python 3.9+
|
|
63
|
-
|
|
64
|
-
<br/>
|
|
8
|
+
[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
|
|
9
|
+
[](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
|
|
10
|
+
[](https://github.com/opendataloader-project/opendataloader-pdf#java)
|
|
65
11
|
|
|
66
|
-
|
|
12
|
+
Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
|
|
67
13
|
|
|
68
|
-
|
|
14
|
+
**Why developers choose OpenDataLoader:**
|
|
15
|
+
- **Deterministic** — Same input always produces same output (no LLM hallucinations)
|
|
16
|
+
- **Fast** — Process 100+ pages per second on CPU
|
|
17
|
+
- **Private** — 100% local, zero data transmission
|
|
18
|
+
- **Accurate** — Bounding boxes for every element, correct multi-column reading order
|
|
69
19
|
|
|
70
|
-
```
|
|
20
|
+
```bash
|
|
71
21
|
pip install -U opendataloader-pdf
|
|
72
22
|
```
|
|
73
23
|
|
|
74
|
-
### Usage
|
|
75
|
-
|
|
76
|
-
input_path can be either the path to a single document or the path to a folder.
|
|
77
|
-
|
|
78
24
|
```python
|
|
79
25
|
import opendataloader_pdf
|
|
80
26
|
|
|
27
|
+
# PDF to Markdown for RAG
|
|
81
28
|
opendataloader_pdf.convert(
|
|
82
|
-
input_path=
|
|
83
|
-
output_dir="
|
|
84
|
-
format="json
|
|
29
|
+
input_path="document.pdf",
|
|
30
|
+
output_dir="output/",
|
|
31
|
+
format="markdown,json"
|
|
85
32
|
)
|
|
86
33
|
```
|
|
87
34
|
|
|
88
|
-
If you want to run it via CLI, you can use the following command on the terminal:
|
|
89
|
-
|
|
90
|
-
```bash
|
|
91
|
-
opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json,html,pdf,markdown
|
|
92
|
-
```
|
|
93
|
-
|
|
94
|
-
### Function: convert()
|
|
95
|
-
|
|
96
|
-
The main function to process PDFs.
|
|
97
|
-
|
|
98
|
-
| Parameter | Type | Required | Default | Description |
|
|
99
|
-
|-------------------------|-----------------------| -------- |--------------|------------------------------------------------------------------------------------------------------------------------------------------|
|
|
100
|
-
| `input_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. |
|
|
101
|
-
| `output_dir` | `Optional[str]` | No | input folder | Directory where outputs are written. |
|
|
102
|
-
| `password` | `Optional[str]` | No | `None` | Password used for encrypted PDFs. |
|
|
103
|
-
| `format` | `Optional[Union[str, List[str]]]` | No | `None` | Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images) |
|
|
104
|
-
| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
|
|
105
|
-
| `content_safety_off` | `Optional[Union[str, List[str]]]` | No | `None` | Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg) |
|
|
106
|
-
| `keep_line_breaks` | `bool` | No | `False` | Preserves line breaks in text output when `True`. |
|
|
107
|
-
| `replace_invalid_chars` | `Optional[str]` | No | `None` | Replacement character for invalid or unrecognized characters (e.g., �, `\u0000`). |
|
|
108
|
-
| `use_struct_tree` | `bool ` | No | `False` | Enable processing structure tree (disabled by default). |
|
|
109
|
-
|
|
110
|
-
### Function: run()
|
|
111
|
-
|
|
112
|
-
Deprecated.
|
|
113
|
-
|
|
114
35
|
<br/>
|
|
115
36
|
|
|
116
|
-
##
|
|
117
|
-
|
|
118
|
-
**Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.
|
|
37
|
+
## Why OpenDataLoader?
|
|
119
38
|
|
|
120
|
-
|
|
39
|
+
Building RAG pipelines? You've probably hit these problems:
|
|
121
40
|
|
|
122
|
-
|
|
41
|
+
| Problem | How We Solve It |
|
|
42
|
+
|---------|-----------------|
|
|
43
|
+
| **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
|
|
44
|
+
| **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
|
|
45
|
+
| **Headers/footers pollute context** | Auto-filtered before output |
|
|
46
|
+
| **No coordinates for citations** | Bounding box for every element |
|
|
47
|
+
| **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
|
|
48
|
+
| **GPU required** | Pure CPU, rule-based — runs anywhere |
|
|
123
49
|
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
```sh
|
|
127
|
-
npm install @opendataloader/pdf
|
|
128
|
-
```
|
|
50
|
+
<br/>
|
|
129
51
|
|
|
130
|
-
|
|
52
|
+
## Key Features
|
|
131
53
|
|
|
132
|
-
|
|
54
|
+
### For RAG & LLM Pipelines
|
|
133
55
|
|
|
134
|
-
|
|
135
|
-
|
|
56
|
+
- **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)
|
|
57
|
+
- **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations
|
|
58
|
+
- **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly
|
|
59
|
+
- **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed
|
|
60
|
+
- **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
|
|
136
61
|
|
|
137
|
-
|
|
138
|
-
try {
|
|
139
|
-
await convert(['path/to/document.pdf', 'path/to/folder'], {
|
|
140
|
-
outputDir: 'path/to/output',
|
|
141
|
-
format: 'json,html,pdf,markdown',
|
|
142
|
-
});
|
|
143
|
-
console.log('convert() complete');
|
|
144
|
-
} catch (error) {
|
|
145
|
-
console.error('Error processing PDF:', error);
|
|
146
|
-
}
|
|
147
|
-
}
|
|
62
|
+
### Performance & Privacy
|
|
148
63
|
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
64
|
+
- **No GPU** — Fast, rule-based heuristics
|
|
65
|
+
- **Local-First** — Your documents never leave your machine
|
|
66
|
+
- **High Throughput** — Process thousands of PDFs efficiently
|
|
67
|
+
- **Multi-Language SDK** — Python, Node.js, Java, Docker
|
|
152
68
|
|
|
153
|
-
|
|
69
|
+
### Document Understanding
|
|
154
70
|
|
|
155
|
-
|
|
71
|
+
- **Tables** — Detects borders, handles merged cells
|
|
72
|
+
- **Lists** — Numbered, bulleted, nested
|
|
73
|
+
- **Headings** — Auto-detects hierarchy levels
|
|
74
|
+
- **Images** — Extracts with captions linked
|
|
75
|
+
- **Tagged PDF Support** — Uses native PDF structure when available
|
|
76
|
+
- **AI Safety** — Auto-filters prompt injection content
|
|
156
77
|
|
|
157
|
-
|
|
158
|
-
|--------------------------------| ---------- | ----------- |------------------------------------------------------------------------------------------------------------------------------|
|
|
159
|
-
| `inputPaths` | `string[]` | — | One or more file paths or directories to process. |
|
|
160
|
-
| `options.outputDir` | `string` | `undefined` | Directory where outputs are written. |
|
|
161
|
-
| `options.password` | `string` | `undefined` | Password for encrypted PDFs. |
|
|
162
|
-
| `options.format` | `string \| string[]` | `undefined` | Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images) |
|
|
163
|
-
| `options.quiet` | `boolean` | `false` | Suppress CLI logging output and prevent streaming. |
|
|
164
|
-
| `options.contentSafetyOff` | `string \| string[]` | `undefined` | Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg) |
|
|
165
|
-
| `options.keepLineBreaks` | `boolean` | `false` | Preserve line breaks in text output. |
|
|
166
|
-
| `options.replaceInvalidChars` | `string` | `undefined` | Replacement character for invalid or unrecognized characters. |
|
|
167
|
-
| `options.useStructTree` | `boolean` | `false` | Enable processing structure tree (disabled by default). |
|
|
78
|
+
<br/>
|
|
168
79
|
|
|
169
|
-
|
|
80
|
+
## Output Formats
|
|
170
81
|
|
|
171
|
-
|
|
82
|
+
| Format | Use Case |
|
|
83
|
+
|--------|----------|
|
|
84
|
+
| **JSON** | Structured data with bounding boxes, semantic types |
|
|
85
|
+
| **Markdown** | Clean text for LLM context, RAG chunks |
|
|
86
|
+
| **HTML** | Web display with styling |
|
|
87
|
+
| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
|
|
172
88
|
|
|
173
|
-
|
|
89
|
+
<br/>
|
|
174
90
|
|
|
175
|
-
|
|
176
|
-
|
|
91
|
+
## JSON Output Example
|
|
92
|
+
|
|
93
|
+
```json
|
|
94
|
+
{
|
|
95
|
+
"type": "heading",
|
|
96
|
+
"id": 42,
|
|
97
|
+
"level": "Title",
|
|
98
|
+
"page number": 1,
|
|
99
|
+
"bounding box": [72.0, 700.0, 540.0, 730.0],
|
|
100
|
+
"heading level": 1,
|
|
101
|
+
"font": "Helvetica-Bold",
|
|
102
|
+
"font size": 24.0,
|
|
103
|
+
"text color": "[0.0]",
|
|
104
|
+
"content": "Introduction"
|
|
105
|
+
}
|
|
177
106
|
```
|
|
178
107
|
|
|
179
|
-
|
|
108
|
+
| Field | Description |
|
|
109
|
+
|-------|-------------|
|
|
110
|
+
| `type` | Element type: heading, paragraph, table, list, image, caption |
|
|
111
|
+
| `id` | Unique identifier for cross-referencing |
|
|
112
|
+
| `page number` | 1-indexed page reference |
|
|
113
|
+
| `bounding box` | `[left, bottom, right, top]` in PDF points |
|
|
114
|
+
| `heading level` | Heading depth (1+) |
|
|
115
|
+
| `font`, `font size` | Typography info |
|
|
116
|
+
| `content` | Extracted text |
|
|
180
117
|
|
|
181
|
-
|
|
182
|
-
-o, --output-dir <path> Directory where outputs are written
|
|
183
|
-
-p, --password <password> Password for encrypted PDFs
|
|
184
|
-
-f, --format <values> Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
|
|
185
|
-
-q, --quiet Suppress CLI logging output
|
|
186
|
-
--content-safety-off <modes> Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)
|
|
187
|
-
--keep-line-breaks Preserve line breaks in text output
|
|
188
|
-
--replace-invalid-chars <c> Replacement character for invalid or unrecognized characters
|
|
189
|
-
-h, --help Show usage information
|
|
190
|
-
--use-struct-tree Enable processing structure tree (disabled by default)
|
|
191
|
-
```
|
|
118
|
+
[Full JSON Schema →](https://opendataloader.org/docs/json-schema)
|
|
192
119
|
|
|
193
120
|
<br/>
|
|
194
121
|
|
|
195
|
-
##
|
|
196
|
-
|
|
197
|
-
For various example templates, including Gradle and Maven, please refer to [Examples](https://github.com/opendataloader-project/opendataloader-pdf-examples).
|
|
198
|
-
|
|
199
|
-
### Dependency
|
|
200
|
-
|
|
201
|
-
To include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.
|
|
202
|
-
|
|
203
|
-
Check for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).
|
|
204
|
-
|
|
205
|
-
```xml
|
|
206
|
-
<project>
|
|
207
|
-
<!-- other configurations... -->
|
|
208
|
-
|
|
209
|
-
<dependencies>
|
|
210
|
-
<dependency>
|
|
211
|
-
<groupId>org.opendataloader</groupId>
|
|
212
|
-
<artifactId>opendataloader-pdf-core</artifactId>
|
|
213
|
-
<version>1.3.0</version>
|
|
214
|
-
</dependency>
|
|
215
|
-
</dependencies>
|
|
216
|
-
|
|
217
|
-
<repositories>
|
|
218
|
-
<repository>
|
|
219
|
-
<snapshots>
|
|
220
|
-
<enabled>true</enabled>
|
|
221
|
-
</snapshots>
|
|
222
|
-
<id>vera-dev</id>
|
|
223
|
-
<name>Vera development</name>
|
|
224
|
-
<url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
|
|
225
|
-
</repository>
|
|
226
|
-
</repositories>
|
|
227
|
-
<pluginRepositories>
|
|
228
|
-
<pluginRepository>
|
|
229
|
-
<snapshots>
|
|
230
|
-
<enabled>false</enabled>
|
|
231
|
-
</snapshots>
|
|
232
|
-
<id>vera-dev</id>
|
|
233
|
-
<name>Vera development</name>
|
|
234
|
-
<url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
|
|
235
|
-
</pluginRepository>
|
|
236
|
-
</pluginRepositories>
|
|
237
|
-
|
|
238
|
-
<!-- other configurations... -->
|
|
239
|
-
</project>
|
|
240
|
-
```
|
|
241
|
-
|
|
122
|
+
## Quick Start
|
|
242
123
|
|
|
243
|
-
|
|
124
|
+
- [Python](https://opendataloader.org/docs/quick-start-python)
|
|
125
|
+
- [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
|
|
126
|
+
- [Docker](https://opendataloader.org/docs/quick-start-docker)
|
|
127
|
+
- [Java](https://opendataloader.org/docs/quick-start-java)
|
|
244
128
|
|
|
245
|
-
|
|
129
|
+
<br/>
|
|
246
130
|
|
|
247
|
-
|
|
248
|
-
import org.opendataloader.pdf.api.Config;
|
|
249
|
-
import org.opendataloader.pdf.api.OpenDataLoaderPDF;
|
|
131
|
+
## Advanced Options
|
|
250
132
|
|
|
251
|
-
|
|
133
|
+
```python
|
|
134
|
+
opendataloader_pdf.convert(
|
|
135
|
+
input_path="document.pdf",
|
|
136
|
+
output_dir="output/",
|
|
137
|
+
format="json,markdown,pdf",
|
|
252
138
|
|
|
253
|
-
|
|
139
|
+
# Reading order
|
|
140
|
+
reading_order="xycut", # XY-Cut++ for multi-column
|
|
254
141
|
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
config.setGeneratePDF(true);
|
|
259
|
-
config.setGenerateMarkdown(true);
|
|
260
|
-
config.setGenerateHtml(true);
|
|
142
|
+
# Images
|
|
143
|
+
embed_images=True, # Base64 in output
|
|
144
|
+
image_format="png",
|
|
261
145
|
|
|
262
|
-
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
//exception during processing
|
|
266
|
-
}
|
|
267
|
-
}
|
|
268
|
-
}
|
|
146
|
+
# Tagged PDF
|
|
147
|
+
use_struct_tree=True, # Use native PDF structure
|
|
148
|
+
)
|
|
269
149
|
```
|
|
270
150
|
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
The full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)
|
|
151
|
+
[Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
|
|
274
152
|
|
|
275
153
|
<br/>
|
|
276
154
|
|
|
277
|
-
##
|
|
278
|
-
|
|
279
|
-
Download sample PDF
|
|
155
|
+
## AI Safety
|
|
280
156
|
|
|
281
|
-
|
|
282
|
-
curl -L -o 1901.03003.pdf https://arxiv.org/pdf/1901.03003
|
|
283
|
-
```
|
|
157
|
+
PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
|
|
284
158
|
|
|
285
|
-
|
|
159
|
+
- Hidden text (transparent, zero-size)
|
|
160
|
+
- Off-page content
|
|
161
|
+
- Suspicious invisible layers
|
|
286
162
|
|
|
287
|
-
|
|
288
|
-
docker run --rm -v "$PWD":/work ghcr.io/opendataloader-project/opendataloader-pdf-cli:latest /work/1901.03003.pdf -f json,html,pdf,markdown
|
|
289
|
-
```
|
|
163
|
+
This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
|
|
290
164
|
|
|
291
165
|
<br/>
|
|
292
166
|
|
|
293
|
-
##
|
|
167
|
+
## Tagged PDF Support
|
|
294
168
|
|
|
295
|
-
|
|
169
|
+
**Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
|
|
296
170
|
|
|
297
|
-
|
|
171
|
+
**OpenDataLoader leverages this:**
|
|
298
172
|
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
|
|
302
|
-
|
|
303
|
-
If the build is successful, the resulting `jar` file will be created in the path below.
|
|
173
|
+
- When a PDF has structure tags, we extract the **exact layout** the author intended
|
|
174
|
+
- Headings, lists, tables, reading order — all preserved from the source
|
|
175
|
+
- No guessing, no heuristics needed — **pixel-perfect semantic extraction**
|
|
304
176
|
|
|
305
|
-
```
|
|
306
|
-
|
|
177
|
+
```python
|
|
178
|
+
opendataloader_pdf.convert(
|
|
179
|
+
input_path="accessible_document.pdf",
|
|
180
|
+
use_struct_tree=True # Use native PDF structure tags
|
|
181
|
+
)
|
|
307
182
|
```
|
|
308
183
|
|
|
309
|
-
|
|
184
|
+
Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
|
|
310
185
|
|
|
311
|
-
|
|
312
|
-
java -jar opendataloader-pdf-cli-<VERSION>.jar [options] <INPUT FILE OR FOLDER>
|
|
313
|
-
```
|
|
314
|
-
|
|
315
|
-
This generates a JSON file with layout recognition results in the specified output folder.
|
|
316
|
-
Additionally, annotated PDF with recognized structures, Markdown and Html are generated if options `--pdf`, `--markdown` and `--html` are specified.
|
|
186
|
+
[Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
|
|
317
187
|
|
|
318
|
-
|
|
188
|
+
<br/>
|
|
319
189
|
|
|
320
|
-
|
|
321
|
-
The option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.
|
|
322
|
-
The option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags.
|
|
323
|
-
The option `--markdown-with-images` enables inclusion of image references into the output Markdown.
|
|
324
|
-
The option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character.
|
|
325
|
-
The option `--use-struct-tree` enables processing structure tree (disabled by default).
|
|
326
|
-
The images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.
|
|
190
|
+
## LangChain Integration
|
|
327
191
|
|
|
328
|
-
|
|
192
|
+
OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
|
|
329
193
|
|
|
330
|
-
```
|
|
331
|
-
|
|
332
|
-
-o,--output-dir <arg> Specifies the output directory for generated files
|
|
333
|
-
-p,--password <arg> Specifies the password for an encrypted PDF
|
|
334
|
-
-f,--format <arg> Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
|
|
335
|
-
-q,--quiet Suppresses console logging output
|
|
336
|
-
--content-safety-off <arg> Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)
|
|
337
|
-
--keep-line-breaks Preserves original line breaks in the extracted text
|
|
338
|
-
--replace-invalid-chars <arg> Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
|
|
339
|
-
--use-struct-tree Enables processing structure tree (disabled by default)
|
|
194
|
+
```bash
|
|
195
|
+
pip install -U langchain-opendataloader-pdf
|
|
340
196
|
```
|
|
341
197
|
|
|
342
|
-
|
|
198
|
+
```python
|
|
199
|
+
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
|
|
343
200
|
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
349
|
-
|
|
350
|
-
|
|
201
|
+
loader = OpenDataLoaderPDFLoader(
|
|
202
|
+
file_path=["document.pdf"],
|
|
203
|
+
format="text"
|
|
204
|
+
)
|
|
205
|
+
documents = loader.load()
|
|
206
|
+
|
|
207
|
+
# Use with any LangChain pipeline
|
|
208
|
+
for doc in documents:
|
|
209
|
+
print(doc.page_content[:100])
|
|
351
210
|
```
|
|
352
211
|
|
|
353
|
-
|
|
212
|
+
- [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
|
|
213
|
+
- [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
|
|
214
|
+
- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
|
|
354
215
|
|
|
355
|
-
|
|
216
|
+
<br/>
|
|
356
217
|
|
|
357
|
-
|
|
358
|
-
|-------------------|---------|----------|------------------------------------|
|
|
359
|
-
| file name | string | no | Name of processed pdf file |
|
|
360
|
-
| number of pages | integer | no | Number of pages in pdf file |
|
|
361
|
-
| author | string | no | Author of pdf file |
|
|
362
|
-
| title | string | no | Title of pdf file |
|
|
363
|
-
| creation date | string | no | Creation date of pdf file |
|
|
364
|
-
| modification date | string | no | Modification date of pdf file |
|
|
365
|
-
| kids | array | no | Array of detected content elements |
|
|
218
|
+
## Benchmarks
|
|
366
219
|
|
|
367
|
-
|
|
220
|
+
We continuously benchmark against real-world documents.
|
|
368
221
|
|
|
369
|
-
|
|
370
|
-
|--------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
371
|
-
| id | integer | yes | Unique id of content element |
|
|
372
|
-
| level | string | yes | Level of content element |
|
|
373
|
-
| type | string | no | Type of content element<br/>Possible types: `footer`, `header`, `heading`, `line`, `table`, `table row`, `table cell`, `paragraph`, `list`, `list item`, `image`, `line art`, `caption`, `text block` |
|
|
374
|
-
| page number | integer | no | Page number of content element |
|
|
375
|
-
| bounding box | array | no | Bounding box of content element |
|
|
222
|
+
[View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
|
|
376
223
|
|
|
377
|
-
|
|
224
|
+
### Quick Comparison
|
|
378
225
|
|
|
379
|
-
|
|
|
380
|
-
|
|
381
|
-
|
|
|
382
|
-
|
|
|
383
|
-
|
|
|
384
|
-
|
|
|
226
|
+
| Engine | Accuracy | | Speed (s/page) | | Reading Order | | Table | | Heading | |
|
|
227
|
+
|--------------------|----------|------|----------------|------|---------------|------|----------|------|----------|------|
|
|
228
|
+
| **opendataloader** | 0.82 | #2 | **0.05** | #1 | **0.91** | #1 | 0.49 | #2 | 0.65 | #2 |
|
|
229
|
+
| docling | **0.88** | #1 | 0.73 | #4 | 0.90 | #2 | **0.89** | #1 | **0.80** | #1 |
|
|
230
|
+
| pymupdf4llm | 0.73 | #3 | 0.09 | #2 | 0.89 | #3 | 0.40 | #3 | 0.41 | #3 |
|
|
231
|
+
| markitdown | 0.58 | #4 | **0.04** | #1 | 0.88 | #4 | 0.00 | #4 | 0.00 | #4 |
|
|
385
232
|
|
|
386
|
-
|
|
233
|
+
> Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
|
|
387
234
|
|
|
388
|
-
|
|
389
|
-
|-------------------|---------|----------|--------------------------------|
|
|
390
|
-
| number of rows | integer | no | Number of table rows |
|
|
391
|
-
| number of columns | integer | no | Number of table columns |
|
|
392
|
-
| rows | array | no | Array of table rows |
|
|
393
|
-
| previous table id | integer | yes | Id of previous connected table |
|
|
394
|
-
| next table id | integer | yes | Id of next connected table |
|
|
235
|
+
### When to Use Each Engine
|
|
395
236
|
|
|
396
|
-
|
|
237
|
+
| Use Case | Recommended Engine | Why |
|
|
238
|
+
|--------------------------|--------------------|--------------------------------------------------------|
|
|
239
|
+
| Best overall balance | **opendataloader** | Fast (0.05s/page) with high reading order accuracy |
|
|
240
|
+
| Maximum accuracy | docling | Highest scores for tables and headings, but 16x slower |
|
|
241
|
+
| Speed-critical pipelines | markitdown | Fastest, but no table/heading extraction |
|
|
242
|
+
| PyMuPDF ecosystem | pymupdf4llm | Good balance if already using PyMuPDF |
|
|
397
243
|
|
|
398
|
-
|
|
399
|
-
|------------|---------|----------|----------------------|
|
|
400
|
-
| row number | integer | no | Number of table row |
|
|
401
|
-
| cells | array | no | Array of table cells |
|
|
244
|
+
### Visual Comparison
|
|
402
245
|
|
|
403
|
-
|
|
246
|
+
[](https://github.com/opendataloader-project/opendataloader-bench)
|
|
404
247
|
|
|
405
|
-
| Field | Type | Optional | Description |
|
|
406
|
-
|---------------|---------|----------|--------------------------------------|
|
|
407
|
-
| row number | integer | no | Row number of table cell |
|
|
408
|
-
| column number | integer | no | Column number of table cell |
|
|
409
|
-
| row span | integer | no | Row span of table cell |
|
|
410
|
-
| column span | integer | no | Column span of table cell |
|
|
411
|
-
| kids | array | no | Array of table cell content elements |
|
|
412
248
|
|
|
413
|
-
|
|
249
|
+
<br/>
|
|
414
250
|
|
|
415
|
-
|
|
416
|
-
|---------------|---------|----------|--------------------------|
|
|
417
|
-
| heading level | integer | no | Heading level of heading |
|
|
251
|
+
## Roadmap
|
|
418
252
|
|
|
419
|
-
|
|
253
|
+
See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
|
|
420
254
|
|
|
421
|
-
|
|
422
|
-
|----------------------|---------|----------|-------------------------------------|
|
|
423
|
-
| number of list items | integer | no | Number of list items |
|
|
424
|
-
| numbering style | string | no | Numbering style of this list |
|
|
425
|
-
| previous list id | integer | yes | Id of previous connected list |
|
|
426
|
-
| next list id | integer | yes | Id of next connected list |
|
|
427
|
-
| list items | array | no | Array of list item content elements |
|
|
255
|
+
<br/>
|
|
428
256
|
|
|
429
|
-
|
|
257
|
+
## Documentation
|
|
430
258
|
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
259
|
+
- [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
|
|
260
|
+
- [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
|
|
261
|
+
- [CLI Options](https://opendataloader.org/docs/cli-options-reference)
|
|
262
|
+
- [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
|
|
263
|
+
- [AI Safety Features](https://opendataloader.org/docs/ai-safety)
|
|
434
264
|
|
|
435
|
-
|
|
265
|
+
<br/>
|
|
436
266
|
|
|
437
|
-
|
|
438
|
-
|-------|-------|----------|-----------------------------------------|
|
|
439
|
-
| kids | array | no | Array of header/footer content elements |
|
|
267
|
+
## Frequently Asked Questions
|
|
440
268
|
|
|
441
|
-
|
|
269
|
+
### What is the best PDF parser for RAG?
|
|
442
270
|
|
|
443
|
-
|
|
444
|
-
|-------|-------|----------|--------------------------------------|
|
|
445
|
-
| kids | array | no | Array of text block content elements |
|
|
271
|
+
For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
|
|
446
272
|
|
|
273
|
+
### How do I extract tables from PDF for LLM?
|
|
447
274
|
|
|
448
|
-
|
|
275
|
+
OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
|
|
449
276
|
|
|
450
|
-
|
|
277
|
+
### Can I use this without sending data to the cloud?
|
|
451
278
|
|
|
452
|
-
|
|
279
|
+
Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
|
|
453
280
|
|
|
454
|
-
|
|
281
|
+
### What makes OpenDataLoader unique?
|
|
455
282
|
|
|
456
|
-
|
|
457
|
-
Have questions or need a little help? We're here for you!🤗
|
|
283
|
+
OpenDataLoader takes a different approach from many PDF parsers:
|
|
458
284
|
|
|
459
|
-
-
|
|
460
|
-
-
|
|
285
|
+
- **Rule-based extraction** — Deterministic output without GPU requirements
|
|
286
|
+
- **Bounding boxes for all elements** — Essential for citation systems
|
|
287
|
+
- **XY-Cut++ reading order** — Handles multi-column layouts correctly
|
|
288
|
+
- **Built-in AI safety filters** — Protects against prompt injection
|
|
289
|
+
- **Native Tagged PDF support** — Leverages accessibility metadata
|
|
461
290
|
|
|
462
|
-
|
|
291
|
+
This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
|
|
463
292
|
|
|
464
|
-
|
|
293
|
+
<br/>
|
|
465
294
|
|
|
466
|
-
|
|
295
|
+
## Contributing
|
|
467
296
|
|
|
468
|
-
|
|
297
|
+
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
|
469
298
|
|
|
470
|
-
|
|
471
|
-
- **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
|
|
472
|
-
- **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company’s specific policies.
|
|
299
|
+
<br/>
|
|
473
300
|
|
|
474
|
-
##
|
|
301
|
+
## License
|
|
475
302
|
|
|
476
|
-
|
|
303
|
+
[Mozilla Public License 2.0](LICENSE)
|
|
477
304
|
|
|
478
|
-
|
|
305
|
+
---
|
|
479
306
|
|
|
480
|
-
|
|
481
|
-
- [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)
|
|
482
|
-
- [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)
|
|
483
|
-
- [licenses/](./THIRD_PARTY/licenses/)
|
|
307
|
+
**Found this useful?** Give us a star to help others discover OpenDataLoader.
|