@opendataloader/pdf 1.3.0 → 1.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,483 +1,307 @@
1
1
  # OpenDataLoader PDF
2
2
 
3
+ **PDF to Markdown & JSON for RAG** — Fast, Local, No GPU Required
3
4
 
4
5
  [![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
5
- ![Java](https://img.shields.io/badge/Java-11+-blue.svg)
6
- ![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
7
- [![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
8
6
  [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
9
7
  [![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
10
- [![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
11
- [![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
12
- [![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)
13
-
14
- <br/>
15
-
16
- **Safe, Open, High-Performance — PDF for AI**
17
-
18
- OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
19
-
20
- It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
21
- Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
22
- AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
23
-
24
- <br/>
25
-
26
- ## 🌟 Key Features
27
-
28
- - 🧾 **Rich, Structured Output** — JSON, Markdown or Html
29
- - 🧩 **Layout Reconstruction** — Headings, Lists, Tables, Images, Reading Order
30
- - ⚡ **Fast & Lightweight** — Rule-Based Heuristic, High-Throughput, No GPU
31
- - 🔒 **Local-First Privacy** — Runs fully on your machine
32
- - 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content - [Learn more](https://opendataloader.org/docs/ai-safety)
33
- - 🏷️ **Tagged PDF** — Advanced data extraction technology based on Tagged PDF - [Learn more](https://opendataloader.org/docs/tagged-pdf)
34
- - 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original
35
-
36
- [Download Annotated PDF Sample](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/1901.03003_annotated.pdf)
37
-
38
- ![Annotated PDF Preview](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/example_annotated_pdf.png)
39
-
40
- <br/>
41
-
42
- ## 🚀 Upcoming Features
43
-
44
- **Scheduled for November**
45
- - ⚡ **Performance Improvement** — Enhance the inference skill for greater accuracy and speed.
46
- - 📊 **Benchmarks & Datasets** — Publish transparent evaluations using open datasets and standardized metrics.
47
- - 🎯 **Metrics** — Publish the calculation methods to transparently share benchmark results.
48
- <br/>
49
-
50
- **Scheduled for December**
51
- - 🖨️ **OCR for scanned PDFs** — Extract data from image-only pages.
52
- - 🧠 **Table AI option** — Higher accuracy for tables with borderless or merged cells.
53
- <br/>
54
-
55
- **Scheduled for 2026**
56
- - 🛡️ **AI Red Teaming** — Transparent adversarial benchmarks with datasets and metrics, then reported regularly.
57
- <br/>
58
-
59
- ## Prerequisites
60
-
61
- - Java 11 or higher must be installed and available in your system's PATH.
62
- - Python 3.9+
63
-
64
- <br/>
8
+ [![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
9
+ [![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
10
+ [![Java](https://img.shields.io/badge/Java-11%2B-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf#java)
65
11
 
66
- ## Python
12
+ Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
67
13
 
68
- ### Installation
14
+ **Why developers choose OpenDataLoader:**
15
+ - **Deterministic** — Same input always produces same output (no LLM hallucinations)
16
+ - **Fast** — Process 100+ pages per second on CPU
17
+ - **Private** — 100% local, zero data transmission
18
+ - **Accurate** — Bounding boxes for every element, correct multi-column reading order
69
19
 
70
- ```sh
20
+ ```bash
71
21
  pip install -U opendataloader-pdf
72
22
  ```
73
23
 
74
- ### Usage
75
-
76
- input_path can be either the path to a single document or the path to a folder.
77
-
78
24
  ```python
79
25
  import opendataloader_pdf
80
26
 
27
+ # PDF to Markdown for RAG
81
28
  opendataloader_pdf.convert(
82
- input_path=["path/to/document.pdf", "path/to/folder"],
83
- output_dir="path/to/output",
84
- format="json,html,pdf,markdown"
29
+ input_path="document.pdf",
30
+ output_dir="output/",
31
+ format="markdown,json"
85
32
  )
86
33
  ```
87
34
 
88
- If you want to run it via CLI, you can use the following command on the terminal:
89
-
90
- ```bash
91
- opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json,html,pdf,markdown
92
- ```
93
-
94
- ### Function: convert()
95
-
96
- The main function to process PDFs.
97
-
98
- | Parameter | Type | Required | Default | Description |
99
- |-------------------------|-----------------------| -------- |--------------|------------------------------------------------------------------------------------------------------------------------------------------|
100
- | `input_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. |
101
- | `output_dir` | `Optional[str]` | No | input folder | Directory where outputs are written. |
102
- | `password` | `Optional[str]` | No | `None` | Password used for encrypted PDFs. |
103
- | `format` | `Optional[Union[str, List[str]]]` | No | `None` | Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images) |
104
- | `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
105
- | `content_safety_off` | `Optional[Union[str, List[str]]]` | No | `None` | Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg) |
106
- | `keep_line_breaks` | `bool` | No | `False` | Preserves line breaks in text output when `True`. |
107
- | `replace_invalid_chars` | `Optional[str]` | No | `None` | Replacement character for invalid or unrecognized characters (e.g., �, `\u0000`). |
108
- | `use_struct_tree` | `bool ` | No | `False` | Enable processing structure tree (disabled by default). |
109
-
110
- ### Function: run()
111
-
112
- Deprecated.
113
-
114
35
  <br/>
115
36
 
116
- ## Node.js / NPM
117
-
118
- **Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.
37
+ ## Why OpenDataLoader?
119
38
 
120
- ### Prerequisites
39
+ Building RAG pipelines? You've probably hit these problems:
121
40
 
122
- - Java 11 or higher must be installed and available in your system's PATH.
41
+ | Problem | How We Solve It |
42
+ |---------|-----------------|
43
+ | **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
44
+ | **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
45
+ | **Headers/footers pollute context** | Auto-filtered before output |
46
+ | **No coordinates for citations** | Bounding box for every element |
47
+ | **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
48
+ | **GPU required** | Pure CPU, rule-based — runs anywhere |
123
49
 
124
- ### Installation
125
-
126
- ```sh
127
- npm install @opendataloader/pdf
128
- ```
50
+ <br/>
129
51
 
130
- ### Usage
52
+ ## Key Features
131
53
 
132
- `inputPath` can be either the path to a single document or the path to a folder.
54
+ ### For RAG & LLM Pipelines
133
55
 
134
- ```typescript
135
- import { convert } from '@opendataloader/pdf';
56
+ - **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)
57
+ - **Bounding Boxes** Every element includes `[x1, y1, x2, y2]` coordinates for citations
58
+ - **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly
59
+ - **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed
60
+ - **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
136
61
 
137
- async function main() {
138
- try {
139
- await convert(['path/to/document.pdf', 'path/to/folder'], {
140
- outputDir: 'path/to/output',
141
- format: 'json,html,pdf,markdown',
142
- });
143
- console.log('convert() complete');
144
- } catch (error) {
145
- console.error('Error processing PDF:', error);
146
- }
147
- }
62
+ ### Performance & Privacy
148
63
 
149
- main();
150
- ```
151
- ### Function: convert()
64
+ - **No GPU** — Fast, rule-based heuristics
65
+ - **Local-First** — Your documents never leave your machine
66
+ - **High Throughput** — Process thousands of PDFs efficiently
67
+ - **Multi-Language SDK** — Python, Node.js, Java, Docker
152
68
 
153
- `convert(inputPaths: string[], options?: ConvertOptions): Promise<string>`
69
+ ### Document Understanding
154
70
 
155
- Multi-input helper matching the Python wrapper.
71
+ - **Tables** Detects borders, handles merged cells
72
+ - **Lists** — Numbered, bulleted, nested
73
+ - **Headings** — Auto-detects hierarchy levels
74
+ - **Images** — Extracts with captions linked
75
+ - **Tagged PDF Support** — Uses native PDF structure when available
76
+ - **AI Safety** — Auto-filters prompt injection content
156
77
 
157
- | Property | Type | Default | Description |
158
- |--------------------------------| ---------- | ----------- |------------------------------------------------------------------------------------------------------------------------------|
159
- | `inputPaths` | `string[]` | — | One or more file paths or directories to process. |
160
- | `options.outputDir` | `string` | `undefined` | Directory where outputs are written. |
161
- | `options.password` | `string` | `undefined` | Password for encrypted PDFs. |
162
- | `options.format` | `string \| string[]` | `undefined` | Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images) |
163
- | `options.quiet` | `boolean` | `false` | Suppress CLI logging output and prevent streaming. |
164
- | `options.contentSafetyOff` | `string \| string[]` | `undefined` | Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg) |
165
- | `options.keepLineBreaks` | `boolean` | `false` | Preserve line breaks in text output. |
166
- | `options.replaceInvalidChars` | `string` | `undefined` | Replacement character for invalid or unrecognized characters. |
167
- | `options.useStructTree` | `boolean` | `false` | Enable processing structure tree (disabled by default). |
78
+ <br/>
168
79
 
169
- ### Function: run()
80
+ ## Output Formats
170
81
 
171
- Deprecated.
82
+ | Format | Use Case |
83
+ |--------|----------|
84
+ | **JSON** | Structured data with bounding boxes, semantic types |
85
+ | **Markdown** | Clean text for LLM context, RAG chunks |
86
+ | **HTML** | Web display with styling |
87
+ | **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
172
88
 
173
- ### CLI
89
+ <br/>
174
90
 
175
- ```bash
176
- npx @opendataloader/pdf path/to/document.pdf path/to/folder -o path/to/output -f json,html,pdf,markdown
91
+ ## JSON Output Example
92
+
93
+ ```json
94
+ {
95
+ "type": "heading",
96
+ "id": 42,
97
+ "level": "Title",
98
+ "page number": 1,
99
+ "bounding box": [72.0, 700.0, 540.0, 730.0],
100
+ "heading level": 1,
101
+ "font": "Helvetica-Bold",
102
+ "font size": 24.0,
103
+ "text color": "[0.0]",
104
+ "content": "Introduction"
105
+ }
177
106
  ```
178
107
 
179
- #### Available options
108
+ | Field | Description |
109
+ |-------|-------------|
110
+ | `type` | Element type: heading, paragraph, table, list, image, caption |
111
+ | `id` | Unique identifier for cross-referencing |
112
+ | `page number` | 1-indexed page reference |
113
+ | `bounding box` | `[left, bottom, right, top]` in PDF points |
114
+ | `heading level` | Heading depth (1+) |
115
+ | `font`, `font size` | Typography info |
116
+ | `content` | Extracted text |
180
117
 
181
- ```
182
- -o, --output-dir <path> Directory where outputs are written
183
- -p, --password <password> Password for encrypted PDFs
184
- -f, --format <values> Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
185
- -q, --quiet Suppress CLI logging output
186
- --content-safety-off <modes> Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)
187
- --keep-line-breaks Preserve line breaks in text output
188
- --replace-invalid-chars <c> Replacement character for invalid or unrecognized characters
189
- -h, --help Show usage information
190
- --use-struct-tree Enable processing structure tree (disabled by default)
191
- ```
118
+ [Full JSON Schema →](https://opendataloader.org/docs/json-schema)
192
119
 
193
120
  <br/>
194
121
 
195
- ## Java
196
-
197
- For various example templates, including Gradle and Maven, please refer to [Examples](https://github.com/opendataloader-project/opendataloader-pdf-examples).
198
-
199
- ### Dependency
200
-
201
- To include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.
202
-
203
- Check for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).
204
-
205
- ```xml
206
- <project>
207
- <!-- other configurations... -->
208
-
209
- <dependencies>
210
- <dependency>
211
- <groupId>org.opendataloader</groupId>
212
- <artifactId>opendataloader-pdf-core</artifactId>
213
- <version>1.3.0</version>
214
- </dependency>
215
- </dependencies>
216
-
217
- <repositories>
218
- <repository>
219
- <snapshots>
220
- <enabled>true</enabled>
221
- </snapshots>
222
- <id>vera-dev</id>
223
- <name>Vera development</name>
224
- <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
225
- </repository>
226
- </repositories>
227
- <pluginRepositories>
228
- <pluginRepository>
229
- <snapshots>
230
- <enabled>false</enabled>
231
- </snapshots>
232
- <id>vera-dev</id>
233
- <name>Vera development</name>
234
- <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
235
- </pluginRepository>
236
- </pluginRepositories>
237
-
238
- <!-- other configurations... -->
239
- </project>
240
- ```
241
-
122
+ ## Quick Start
242
123
 
243
- ### Java code integration
124
+ - [Python](https://opendataloader.org/docs/quick-start-python)
125
+ - [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
126
+ - [Docker](https://opendataloader.org/docs/quick-start-docker)
127
+ - [Java](https://opendataloader.org/docs/quick-start-java)
244
128
 
245
- To integrate Layout recognition API into Java code, one can follow the sample code below.
129
+ <br/>
246
130
 
247
- ```java
248
- import org.opendataloader.pdf.api.Config;
249
- import org.opendataloader.pdf.api.OpenDataLoaderPDF;
131
+ ## Advanced Options
250
132
 
251
- import java.io.IOException;
133
+ ```python
134
+ opendataloader_pdf.convert(
135
+ input_path="document.pdf",
136
+ output_dir="output/",
137
+ format="json,markdown,pdf",
252
138
 
253
- public class Sample {
139
+ # Reading order
140
+ reading_order="xycut", # XY-Cut++ for multi-column
254
141
 
255
- public static void main(String[] args) {
256
- Config config = new Config();
257
- config.setOutputFolder("path/to/output");
258
- config.setGeneratePDF(true);
259
- config.setGenerateMarkdown(true);
260
- config.setGenerateHtml(true);
142
+ # Images
143
+ embed_images=True, # Base64 in output
144
+ image_format="png",
261
145
 
262
- try {
263
- OpenDataLoaderPDF.processFile("path/to/document.pdf", config);
264
- } catch (Exception exception) {
265
- //exception during processing
266
- }
267
- }
268
- }
146
+ # Tagged PDF
147
+ use_struct_tree=True, # Use native PDF structure
148
+ )
269
149
  ```
270
150
 
271
- ### API Documentation
272
-
273
- The full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)
151
+ [Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
274
152
 
275
153
  <br/>
276
154
 
277
- ## Docker
278
-
279
- Download sample PDF
155
+ ## AI Safety
280
156
 
281
- ```sh
282
- curl -L -o 1901.03003.pdf https://arxiv.org/pdf/1901.03003
283
- ```
157
+ PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
284
158
 
285
- Run opendataloader-pdf in Docker container
159
+ - Hidden text (transparent, zero-size)
160
+ - Off-page content
161
+ - Suspicious invisible layers
286
162
 
287
- ```
288
- docker run --rm -v "$PWD":/work ghcr.io/opendataloader-project/opendataloader-pdf-cli:latest /work/1901.03003.pdf -f json,html,pdf,markdown
289
- ```
163
+ This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
290
164
 
291
165
  <br/>
292
166
 
293
- ## Developing with OpenDataLoader PDF
167
+ ## Tagged PDF Support
294
168
 
295
- ### Build
169
+ **Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
296
170
 
297
- Build and install using Maven command:
171
+ **OpenDataLoader leverages this:**
298
172
 
299
- ```sh
300
- mvn clean install -f java/pom.xml
301
- ```
302
-
303
- If the build is successful, the resulting `jar` file will be created in the path below.
173
+ - When a PDF has structure tags, we extract the **exact layout** the author intended
174
+ - Headings, lists, tables, reading order — all preserved from the source
175
+ - No guessing, no heuristics needed — **pixel-perfect semantic extraction**
304
176
 
305
- ```sh
306
- java/opendataloader-pdf-cli/target
177
+ ```python
178
+ opendataloader_pdf.convert(
179
+ input_path="accessible_document.pdf",
180
+ use_struct_tree=True # Use native PDF structure tags
181
+ )
307
182
  ```
308
183
 
309
- ### CLI usage
184
+ Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
310
185
 
311
- ```sh
312
- java -jar opendataloader-pdf-cli-<VERSION>.jar [options] <INPUT FILE OR FOLDER>
313
- ```
314
-
315
- This generates a JSON file with layout recognition results in the specified output folder.
316
- Additionally, annotated PDF with recognized structures, Markdown and Html are generated if options `--pdf`, `--markdown` and `--html` are specified.
186
+ [Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
317
187
 
318
- By default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.
188
+ <br/>
319
189
 
320
- The option `--keep-line-breaks` to preserve the original line breaks text content in JSON and Markdown output.
321
- The option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.
322
- The option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags.
323
- The option `--markdown-with-images` enables inclusion of image references into the output Markdown.
324
- The option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character.
325
- The option `--use-struct-tree` enables processing structure tree (disabled by default).
326
- The images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.
190
+ ## LangChain Integration
327
191
 
328
- #### Available options:
192
+ OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
329
193
 
330
- ```
331
- Options:
332
- -o,--output-dir <arg> Specifies the output directory for generated files
333
- -p,--password <arg> Specifies the password for an encrypted PDF
334
- -f,--format <arg> Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
335
- -q,--quiet Suppresses console logging output
336
- --content-safety-off <arg> Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)
337
- --keep-line-breaks Preserves original line breaks in the extracted text
338
- --replace-invalid-chars <arg> Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
339
- --use-struct-tree Enables processing structure tree (disabled by default)
194
+ ```bash
195
+ pip install -U langchain-opendataloader-pdf
340
196
  ```
341
197
 
342
- The legacy options (for backward compatibility):
198
+ ```python
199
+ from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
343
200
 
344
- ```
345
- --no-json Disables the JSON output format
346
- --html Sets the data extraction output format to HTML
347
- --pdf Generates a new PDF file where the extracted layout data is visualized as annotations
348
- --markdown Sets the data extraction output format to Markdown
349
- --markdown-with-html Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
350
- --markdown-with-images Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
201
+ loader = OpenDataLoaderPDFLoader(
202
+ file_path=["document.pdf"],
203
+ format="text"
204
+ )
205
+ documents = loader.load()
206
+
207
+ # Use with any LangChain pipeline
208
+ for doc in documents:
209
+ print(doc.page_content[:100])
351
210
  ```
352
211
 
353
- ### Schema of the JSON output
212
+ - [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
213
+ - [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
214
+ - [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
354
215
 
355
- Root json node
216
+ <br/>
356
217
 
357
- | Field | Type | Optional | Description |
358
- |-------------------|---------|----------|------------------------------------|
359
- | file name | string | no | Name of processed pdf file |
360
- | number of pages | integer | no | Number of pages in pdf file |
361
- | author | string | no | Author of pdf file |
362
- | title | string | no | Title of pdf file |
363
- | creation date | string | no | Creation date of pdf file |
364
- | modification date | string | no | Modification date of pdf file |
365
- | kids | array | no | Array of detected content elements |
218
+ ## Benchmarks
366
219
 
367
- Common fields of content json nodes
220
+ We continuously benchmark against real-world documents.
368
221
 
369
- | Field | Type | Optional | Description |
370
- |--------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
371
- | id | integer | yes | Unique id of content element |
372
- | level | string | yes | Level of content element |
373
- | type | string | no | Type of content element<br/>Possible types: `footer`, `header`, `heading`, `line`, `table`, `table row`, `table cell`, `paragraph`, `list`, `list item`, `image`, `line art`, `caption`, `text block` |
374
- | page number | integer | no | Page number of content element |
375
- | bounding box | array | no | Bounding box of content element |
222
+ [View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
376
223
 
377
- Specific fields of text content json nodes (`caption`, `heading`, `paragraph`)
224
+ ### Quick Comparison
378
225
 
379
- | Field | Type | Optional | Description |
380
- |------------|--------|----------|-------------------|
381
- | font | string | no | Font name of text |
382
- | font size | double | no | Font size of text |
383
- | text color | array | no | Color of text |
384
- | content | string | no | Text value |
226
+ | Engine | Accuracy | | Speed (s/page) | | Reading Order | | Table | | Heading | |
227
+ |--------------------|----------|------|----------------|------|---------------|------|----------|------|----------|------|
228
+ | **opendataloader** | 0.82 | #2 | **0.05** | #1 | **0.91** | #1 | 0.49 | #2 | 0.65 | #2 |
229
+ | docling | **0.88** | #1 | 0.73 | #4 | 0.90 | #2 | **0.89** | #1 | **0.80** | #1 |
230
+ | pymupdf4llm | 0.73 | #3 | 0.09 | #2 | 0.89 | #3 | 0.40 | #3 | 0.41 | #3 |
231
+ | markitdown | 0.58 | #4 | **0.04** | #1 | 0.88 | #4 | 0.00 | #4 | 0.00 | #4 |
385
232
 
386
- Specific fields of `table` json nodes
233
+ > Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
387
234
 
388
- | Field | Type | Optional | Description |
389
- |-------------------|---------|----------|--------------------------------|
390
- | number of rows | integer | no | Number of table rows |
391
- | number of columns | integer | no | Number of table columns |
392
- | rows | array | no | Array of table rows |
393
- | previous table id | integer | yes | Id of previous connected table |
394
- | next table id | integer | yes | Id of next connected table |
235
+ ### When to Use Each Engine
395
236
 
396
- Specific fields of `table row` json nodes
237
+ | Use Case | Recommended Engine | Why |
238
+ |--------------------------|--------------------|--------------------------------------------------------|
239
+ | Best overall balance | **opendataloader** | Fast (0.05s/page) with high reading order accuracy |
240
+ | Maximum accuracy | docling | Highest scores for tables and headings, but 16x slower |
241
+ | Speed-critical pipelines | markitdown | Fastest, but no table/heading extraction |
242
+ | PyMuPDF ecosystem | pymupdf4llm | Good balance if already using PyMuPDF |
397
243
 
398
- | Field | Type | Optional | Description |
399
- |------------|---------|----------|----------------------|
400
- | row number | integer | no | Number of table row |
401
- | cells | array | no | Array of table cells |
244
+ ### Visual Comparison
402
245
 
403
- Specific fields of `table cell` json nodes
246
+ [![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
404
247
 
405
- | Field | Type | Optional | Description |
406
- |---------------|---------|----------|--------------------------------------|
407
- | row number | integer | no | Row number of table cell |
408
- | column number | integer | no | Column number of table cell |
409
- | row span | integer | no | Row span of table cell |
410
- | column span | integer | no | Column span of table cell |
411
- | kids | array | no | Array of table cell content elements |
412
248
 
413
- Specific fields of `heading` json nodes
249
+ <br/>
414
250
 
415
- | Field | Type | Optional | Description |
416
- |---------------|---------|----------|--------------------------|
417
- | heading level | integer | no | Heading level of heading |
251
+ ## Roadmap
418
252
 
419
- Specific fields of `list` json nodes
253
+ See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
420
254
 
421
- | Field | Type | Optional | Description |
422
- |----------------------|---------|----------|-------------------------------------|
423
- | number of list items | integer | no | Number of list items |
424
- | numbering style | string | no | Numbering style of this list |
425
- | previous list id | integer | yes | Id of previous connected list |
426
- | next list id | integer | yes | Id of next connected list |
427
- | list items | array | no | Array of list item content elements |
255
+ <br/>
428
256
 
429
- Specific fields of `list item` json nodes
257
+ ## Documentation
430
258
 
431
- | Field | Type | Optional | Description |
432
- |-------|-------|----------|-------------------------------------|
433
- | kids | array | no | Array of list item content elements |
259
+ - [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
260
+ - [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
261
+ - [CLI Options](https://opendataloader.org/docs/cli-options-reference)
262
+ - [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
263
+ - [AI Safety Features](https://opendataloader.org/docs/ai-safety)
434
264
 
435
- Specific fields of `header` and `footer` json nodes
265
+ <br/>
436
266
 
437
- | Field | Type | Optional | Description |
438
- |-------|-------|----------|-----------------------------------------|
439
- | kids | array | no | Array of header/footer content elements |
267
+ ## Frequently Asked Questions
440
268
 
441
- Specific fields of `text block` json nodes
269
+ ### What is the best PDF parser for RAG?
442
270
 
443
- | Field | Type | Optional | Description |
444
- |-------|-------|----------|--------------------------------------|
445
- | kids | array | no | Array of text block content elements |
271
+ For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
446
272
 
273
+ ### How do I extract tables from PDF for LLM?
447
274
 
448
- ## 🤝 Contributing
275
+ OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
449
276
 
450
- We believe that great software is built together.
277
+ ### Can I use this without sending data to the cloud?
451
278
 
452
- Your contributions are vital to the success of this project.
279
+ Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
453
280
 
454
- Please read [CONTRIBUTING.md](https://github.com/hancom-inc/opendataloader-pdf/blob/main/CONTRIBUTING.md) for details on how to contribute.
281
+ ### What makes OpenDataLoader unique?
455
282
 
456
- ## 💖 Community & Support
457
- Have questions or need a little help? We're here for you!🤗
283
+ OpenDataLoader takes a different approach from many PDF parsers:
458
284
 
459
- - [GitHub Discussions](https://github.com/hancom-inc/opendataloader-pdf/discussions): For Q&A and general chats. Let's talk! 🗣️
460
- - [GitHub Issues](https://github.com/hancom-inc/opendataloader-pdf/issues): Found a bug? 🐛 Please report it here so we can fix it.
285
+ - **Rule-based extraction** Deterministic output without GPU requirements
286
+ - **Bounding boxes for all elements** Essential for citation systems
287
+ - **XY-Cut++ reading order** — Handles multi-column layouts correctly
288
+ - **Built-in AI safety filters** — Protects against prompt injection
289
+ - **Native Tagged PDF support** — Leverages accessibility metadata
461
290
 
462
- ## Our Branding and Trademarks
291
+ This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
463
292
 
464
- We love our brand and want to protect it!
293
+ <br/>
465
294
 
466
- This project may contain trademarks, logos, or brand names for our products and services.
295
+ ## Contributing
467
296
 
468
- To ensure everyone is on the same page, please remember these simple rules:
297
+ We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
469
298
 
470
- - **Authorized Use**: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
471
- - **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
472
- - **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company’s specific policies.
299
+ <br/>
473
300
 
474
- ## ⚖️ License
301
+ ## License
475
302
 
476
- This project is licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).
303
+ [Mozilla Public License 2.0](LICENSE)
477
304
 
478
- For the full license text, see [LICENSE](LICENSE).
305
+ ---
479
306
 
480
- For information on third-party libraries and components, see:
481
- - [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)
482
- - [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)
483
- - [licenses/](./THIRD_PARTY/licenses/)
307
+ **Found this useful?** Give us a star to help others discover OpenDataLoader.