@opendataloader/pdf 1.4.1 → 1.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,176 +1,307 @@
1
1
  # OpenDataLoader PDF
2
2
 
3
+ **PDF to Markdown & JSON for RAG** — Fast, Local, No GPU Required
3
4
 
4
5
  [![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
5
- ![Java](https://img.shields.io/badge/Java-11+-blue.svg)
6
- ![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
7
- [![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
8
6
  [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
9
7
  [![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
10
- [![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
11
- [![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
12
- [![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)
8
+ [![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
9
+ [![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
10
+ [![Java](https://img.shields.io/badge/Java-11%2B-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf#java)
13
11
 
14
- <br/>
12
+ Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
13
+
14
+ **Why developers choose OpenDataLoader:**
15
+ - **Deterministic** — Same input always produces same output (no LLM hallucinations)
16
+ - **Fast** — Process 100+ pages per second on CPU
17
+ - **Private** — 100% local, zero data transmission
18
+ - **Accurate** — Bounding boxes for every element, correct multi-column reading order
15
19
 
16
- **Safe, Open, High-Performance — PDF for AI**
20
+ ```bash
21
+ pip install -U opendataloader-pdf
22
+ ```
17
23
 
18
- OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
24
+ ```python
25
+ import opendataloader_pdf
19
26
 
20
- It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
21
- Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
22
- AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
27
+ # PDF to Markdown for RAG
28
+ opendataloader_pdf.convert(
29
+ input_path="document.pdf",
30
+ output_dir="output/",
31
+ format="markdown,json"
32
+ )
33
+ ```
23
34
 
24
35
  <br/>
25
36
 
26
- ## 🌟 Key Features
37
+ ## Why OpenDataLoader?
27
38
 
28
- - 🧾 **Rich, Structured Output** JSON, Markdown or Html
29
- - 🧩 **Layout Reconstruction** — Headings, Lists, Tables, Images, Reading Order
30
- - ⚡ **Fast & Lightweight** — Rule-Based Heuristic, High-Throughput, No GPU
31
- - 🔒 **Local-First Privacy** — Runs fully on your machine
32
- - 🏷️ **Tagged PDF** — Advanced data extraction technology based on Tagged PDF - [Learn more](https://opendataloader.org/docs/tagged-pdf)
33
- - 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content - [Learn more](https://opendataloader.org/docs/ai-safety)
34
- - 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original - [See examples](https://opendataloader.org/demo/samples)
39
+ Building RAG pipelines? You've probably hit these problems:
35
40
 
36
- [![Annotated PDF Preview](https://github.com/opendataloader-project/opendataloader-pdf/raw/refs/heads/main/samples/image/example_annotated_pdf.png)](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)
41
+ | Problem | How We Solve It |
42
+ |---------|-----------------|
43
+ | **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
44
+ | **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
45
+ | **Headers/footers pollute context** | Auto-filtered before output |
46
+ | **No coordinates for citations** | Bounding box for every element |
47
+ | **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
48
+ | **GPU required** | Pure CPU, rule-based — runs anywhere |
37
49
 
38
50
  <br/>
39
51
 
40
- - 📊 **Benchmark** — Continuously researched to deliver High-Performance & Quality - [GitHub](https://github.com/opendataloader-project/opendataloader-bench)
52
+ ## Key Features
53
+
54
+ ### For RAG & LLM Pipelines
55
+
56
+ - **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)
57
+ - **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations
58
+ - **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly
59
+ - **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed
60
+ - **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
61
+
62
+ ### Performance & Privacy
63
+
64
+ - **No GPU** — Fast, rule-based heuristics
65
+ - **Local-First** — Your documents never leave your machine
66
+ - **High Throughput** — Process thousands of PDFs efficiently
67
+ - **Multi-Language SDK** — Python, Node.js, Java, Docker
68
+
69
+ ### Document Understanding
41
70
 
42
- [![Benchmark Preview](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
71
+ - **Tables** — Detects borders, handles merged cells
72
+ - **Lists** — Numbered, bulleted, nested
73
+ - **Headings** — Auto-detects hierarchy levels
74
+ - **Images** — Extracts with captions linked
75
+ - **Tagged PDF Support** — Uses native PDF structure when available
76
+ - **AI Safety** — Auto-filters prompt injection content
43
77
 
44
78
  <br/>
45
79
 
46
- ### 🚀 Upcoming Features
80
+ ## Output Formats
47
81
 
48
- **Scheduled for December**
49
- - 🖨️ **OCR for scanned PDFs** — Extract data from image-only pages.
50
- - 🧠 **Table AI option** Higher accuracy for tables with borderless or merged cells.
82
+ | Format | Use Case |
83
+ |--------|----------|
84
+ | **JSON** | Structured data with bounding boxes, semantic types |
85
+ | **Markdown** | Clean text for LLM context, RAG chunks |
86
+ | **HTML** | Web display with styling |
87
+ | **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
51
88
 
52
89
  <br/>
53
90
 
54
- ## Quick Start with Python
91
+ ## JSON Output Example
92
+
93
+ ```json
94
+ {
95
+ "type": "heading",
96
+ "id": 42,
97
+ "level": "Title",
98
+ "page number": 1,
99
+ "bounding box": [72.0, 700.0, 540.0, 730.0],
100
+ "heading level": 1,
101
+ "font": "Helvetica-Bold",
102
+ "font size": 24.0,
103
+ "text color": "[0.0]",
104
+ "content": "Introduction"
105
+ }
106
+ ```
55
107
 
56
- ### Prerequisites
108
+ | Field | Description |
109
+ |-------|-------------|
110
+ | `type` | Element type: heading, paragraph, table, list, image, caption |
111
+ | `id` | Unique identifier for cross-referencing |
112
+ | `page number` | 1-indexed page reference |
113
+ | `bounding box` | `[left, bottom, right, top]` in PDF points |
114
+ | `heading level` | Heading depth (1+) |
115
+ | `font`, `font size` | Typography info |
116
+ | `content` | Extracted text |
57
117
 
58
- - Java 11 or higher must be installed and available in your system's PATH.
59
- - Python 3.9+
118
+ [Full JSON Schema →](https://opendataloader.org/docs/json-schema)
60
119
 
61
- ### Installation
120
+ <br/>
62
121
 
63
- ```sh
64
- pip install -U opendataloader-pdf
65
- ```
122
+ ## Quick Start
66
123
 
67
- ### Usage
124
+ - [Python](https://opendataloader.org/docs/quick-start-python)
125
+ - [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
126
+ - [Docker](https://opendataloader.org/docs/quick-start-docker)
127
+ - [Java](https://opendataloader.org/docs/quick-start-java)
68
128
 
69
- input_path can be either the path to a single document or the path to a folder.
129
+ <br/>
70
130
 
71
- ```python
72
- import opendataloader_pdf
131
+ ## Advanced Options
73
132
 
133
+ ```python
74
134
  opendataloader_pdf.convert(
75
- input_path=["path/to/document.pdf", "path/to/folder"],
76
- output_dir="path/to/output",
77
- format="json,html,pdf,markdown"
135
+ input_path="document.pdf",
136
+ output_dir="output/",
137
+ format="json,markdown,pdf",
138
+
139
+ # Reading order
140
+ reading_order="xycut", # XY-Cut++ for multi-column
141
+
142
+ # Images
143
+ embed_images=True, # Base64 in output
144
+ image_format="png",
145
+
146
+ # Tagged PDF
147
+ use_struct_tree=True, # Use native PDF structure
78
148
  )
79
149
  ```
80
150
 
151
+ [Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
152
+
81
153
  <br/>
82
154
 
83
- ## Quick Start with more languages & tools
155
+ ## AI Safety
84
156
 
85
- - [Quick Start with Python](https://opendataloader.org/docs/quick-start-python)
86
- - [Quick Start with Java](https://opendataloader.org/docs/quick-start-java)
87
- - [Quick Start with Node.js](https://opendataloader.org/docs/quick-start-nodejs)
88
- - [Quick Start with Docker](https://opendataloader.org/docs/quick-start-docker)
157
+ PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
158
+
159
+ - Hidden text (transparent, zero-size)
160
+ - Off-page content
161
+ - Suspicious invisible layers
162
+
163
+ This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
89
164
 
90
165
  <br/>
91
166
 
92
- ## Developing with OpenDataLoader
167
+ ## Tagged PDF Support
93
168
 
94
- ### Build & Test
169
+ **Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
95
170
 
96
- **Prerequisites**: Java 11+, Python 3.9+, Node.js 20+, pnpm
171
+ **OpenDataLoader leverages this:**
97
172
 
98
- ```sh
99
- # Run tests (for local development)
100
- ./scripts/test-java.sh
101
- ./scripts/test-python.sh
102
- ./scripts/test-node.sh
173
+ - When a PDF has structure tags, we extract the **exact layout** the author intended
174
+ - Headings, lists, tables, reading order — all preserved from the source
175
+ - No guessing, no heuristics needed — **pixel-perfect semantic extraction**
103
176
 
104
- # Full CI build (all packages)
105
- ./scripts/build-all.sh
177
+ ```python
178
+ opendataloader_pdf.convert(
179
+ input_path="accessible_document.pdf",
180
+ use_struct_tree=True # Use native PDF structure tags
181
+ )
106
182
  ```
107
183
 
108
- ### Syncing CLI Options
184
+ Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
185
+
186
+ [Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
187
+
188
+ <br/>
189
+
190
+ ## LangChain Integration
109
191
 
110
- CLI options are defined in Java and auto-generated for Node.js, Python, and documentation.
192
+ OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
111
193
 
112
- ```sh
113
- # After modifying Java CLI options, regenerate all bindings:
114
- pnpm run sync-options
194
+ ```bash
195
+ pip install -U langchain-opendataloader-pdf
115
196
  ```
116
197
 
117
- This generates:
118
- - `node/opendataloader-pdf/src/cli-options.generated.ts`
119
- - `node/opendataloader-pdf/src/convert-options.generated.ts`
120
- - `python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py`
121
- - `python/opendataloader-pdf/src/opendataloader_pdf/convert_generated.py`
122
- - `content/docs/cli-options-reference.mdx`
198
+ ```python
199
+ from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
123
200
 
124
- ### Resources
201
+ loader = OpenDataLoaderPDFLoader(
202
+ file_path=["document.pdf"],
203
+ format="text"
204
+ )
205
+ documents = loader.load()
125
206
 
126
- - [CLI Options Reference](https://opendataloader.org/docs/cli-options-reference)
127
- - [Development](https://opendataloader.org/docs/development-workflow)
128
- - [Json Schema](https://opendataloader.org/docs/json-schema)
129
- - [Javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/index.html)
207
+ # Use with any LangChain pipeline
208
+ for doc in documents:
209
+ print(doc.page_content[:100])
210
+ ```
211
+
212
+ - [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
213
+ - [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
214
+ - [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
130
215
 
131
216
  <br/>
132
217
 
133
- ## 🤝 Contributing
218
+ ## Benchmarks
219
+
220
+ We continuously benchmark against real-world documents.
221
+
222
+ [View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
223
+
224
+ ### Quick Comparison
225
+
226
+ | Engine | Accuracy | | Speed (s/page) | | Reading Order | | Table | | Heading | |
227
+ |--------------------|----------|------|----------------|------|---------------|------|----------|------|----------|------|
228
+ | **opendataloader** | 0.82 | #2 | **0.05** | #1 | **0.91** | #1 | 0.49 | #2 | 0.65 | #2 |
229
+ | docling | **0.88** | #1 | 0.73 | #4 | 0.90 | #2 | **0.89** | #1 | **0.80** | #1 |
230
+ | pymupdf4llm | 0.73 | #3 | 0.09 | #2 | 0.89 | #3 | 0.40 | #3 | 0.41 | #3 |
231
+ | markitdown | 0.58 | #4 | **0.04** | #1 | 0.88 | #4 | 0.00 | #4 | 0.00 | #4 |
232
+
233
+ > Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
134
234
 
135
- We believe that great software is built together.
235
+ ### When to Use Each Engine
136
236
 
137
- Your contributions are vital to the success of this project.
237
+ | Use Case | Recommended Engine | Why |
238
+ |--------------------------|--------------------|--------------------------------------------------------|
239
+ | Best overall balance | **opendataloader** | Fast (0.05s/page) with high reading order accuracy |
240
+ | Maximum accuracy | docling | Highest scores for tables and headings, but 16x slower |
241
+ | Speed-critical pipelines | markitdown | Fastest, but no table/heading extraction |
242
+ | PyMuPDF ecosystem | pymupdf4llm | Good balance if already using PyMuPDF |
243
+
244
+ ### Visual Comparison
245
+
246
+ [![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
138
247
 
139
- Please read [CONTRIBUTING.md](https://github.com/hancom-inc/opendataloader-pdf/blob/main/CONTRIBUTING.md) for details on how to contribute.
140
248
 
141
249
  <br/>
142
250
 
143
- ## 💖 Community & Support
251
+ ## Roadmap
252
+
253
+ See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
254
+
255
+ <br/>
144
256
 
145
- Have questions or need a little help? We're here for you!🤗
257
+ ## Documentation
146
258
 
147
- - [GitHub Discussions](https://github.com/hancom-inc/opendataloader-pdf/discussions): For Q&A and general chats. Let's talk! 🗣️
148
- - [GitHub Issues](https://github.com/hancom-inc/opendataloader-pdf/issues): Found a bug? 🐛 Please report it here so we can fix it.
149
- - [SUPPORT.md](SUPPORT.md): Learn about our issue guidelines and AI-powered issue processing system.
259
+ - [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
260
+ - [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
261
+ - [CLI Options](https://opendataloader.org/docs/cli-options-reference)
262
+ - [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
263
+ - [AI Safety Features](https://opendataloader.org/docs/ai-safety)
150
264
 
151
265
  <br/>
152
266
 
153
- ## Our Branding and Trademarks
267
+ ## Frequently Asked Questions
268
+
269
+ ### What is the best PDF parser for RAG?
270
+
271
+ For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
272
+
273
+ ### How do I extract tables from PDF for LLM?
154
274
 
155
- We love our brand and want to protect it!
275
+ OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
156
276
 
157
- This project may contain trademarks, logos, or brand names for our products and services.
277
+ ### Can I use this without sending data to the cloud?
278
+
279
+ Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
280
+
281
+ ### What makes OpenDataLoader unique?
282
+
283
+ OpenDataLoader takes a different approach from many PDF parsers:
284
+
285
+ - **Rule-based extraction** — Deterministic output without GPU requirements
286
+ - **Bounding boxes for all elements** — Essential for citation systems
287
+ - **XY-Cut++ reading order** — Handles multi-column layouts correctly
288
+ - **Built-in AI safety filters** — Protects against prompt injection
289
+ - **Native Tagged PDF support** — Leverages accessibility metadata
290
+
291
+ This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
292
+
293
+ <br/>
158
294
 
159
- To ensure everyone is on the same page, please remember these simple rules:
295
+ ## Contributing
160
296
 
161
- - **Authorized Use**: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
162
- - **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
163
- - **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company’s specific policies.
297
+ We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
164
298
 
165
299
  <br/>
166
300
 
167
- ## ⚖️ License
301
+ ## License
168
302
 
169
- This project is licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).
303
+ [Mozilla Public License 2.0](LICENSE)
170
304
 
171
- For the full license text, see [LICENSE](LICENSE).
305
+ ---
172
306
 
173
- For information on third-party libraries and components, see:
174
- - [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)
175
- - [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)
176
- - [licenses/](./THIRD_PARTY/licenses/)
307
+ **Found this useful?** Give us a star to help others discover OpenDataLoader.
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@opendataloader/pdf",
3
- "version": "1.4.1",
3
+ "version": "1.4.2",
4
4
  "description": "A Node.js wrapper for the opendataloader-pdf Java CLI.",
5
5
  "main": "./dist/index.cjs",
6
6
  "module": "./dist/index.js",