npm - @opendataloader/pdf - Versions diffs - 1.4.1 → 1.4.3 - Mend

@opendataloader/pdf 1.4.1 → 1.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md +227 -96
package/lib/opendataloader-pdf-cli.jar +0 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -1,176 +1,307 @@
 # OpenDataLoader PDF
+**PDF Parsing for RAG** — Convert to Markdown & JSON, Fast, Local, No GPU
 [![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
-![Java](https://img.shields.io/badge/Java-11+-blue.svg)
-![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
-[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
 [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
 [![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
-[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
-[![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
-[![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)
+[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
+[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
+[![Java](https://img.shields.io/badge/Java-11%2B-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf#java)
-<br/>
+Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
+**Why developers choose OpenDataLoader:**
+- **Deterministic** — Same input always produces same output (no LLM hallucinations)
+- **Fast** — Process 100+ pages per second on CPU
+- **Private** — 100% local, zero data transmission
+- **Accurate** — Bounding boxes for every element, correct multi-column reading order
-**Safe, Open, High-Performance — PDF for AI**
+```bash
+pip install -U opendataloader-pdf
+```
-OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
+```python
+import opendataloader_pdf
-It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
-Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
-AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
+# PDF to Markdown for RAG
+opendataloader_pdf.convert(
+    input_path="document.pdf",
+    output_dir="output/",
+    format="markdown,json"
+)
+```
 <br/>
-## 🌟 Key Features
+## Why OpenDataLoader?
-- 🧾 **Rich, Structured Output** — JSON, Markdown or Html
-- 🧩 **Layout Reconstruction** — Headings, Lists, Tables, Images, Reading Order
-- ⚡ **Fast & Lightweight** — Rule-Based Heuristic, High-Throughput, No GPU
-- 🔒 **Local-First Privacy** — Runs fully on your machine
-- 🏷️ **Tagged PDF** — Advanced data extraction technology based on Tagged PDF - [Learn more](https://opendataloader.org/docs/tagged-pdf)
-- 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content - [Learn more](https://opendataloader.org/docs/ai-safety)
-- 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original - [See examples](https://opendataloader.org/demo/samples)
+Building RAG pipelines? You've probably hit these problems:
-[![Annotated PDF Preview](https://github.com/opendataloader-project/opendataloader-pdf/raw/refs/heads/main/samples/image/example_annotated_pdf.png)](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)
+| Problem | How We Solve It |
+|---------|-----------------|
+| **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
+| **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
+| **Headers/footers pollute context** | Auto-filtered before output |
+| **No coordinates for citations** | Bounding box for every element |
+| **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
+| **GPU required** | Pure CPU, rule-based — runs anywhere |
 <br/>
-- 📊 **Benchmark** — Continuously researched to deliver High-Performance & Quality - [GitHub](https://github.com/opendataloader-project/opendataloader-bench)
+## Key Features
+### For RAG & LLM Pipelines
+- **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)
+- **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations
+- **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly
+- **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed
+- **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
+### Performance & Privacy
+- **No GPU** — Fast, rule-based heuristics
+- **Local-First** — Your documents never leave your machine
+- **High Throughput** — Process thousands of PDFs efficiently
+- **Multi-Language SDK** — Python, Node.js, Java, Docker
+### Document Understanding
-[![Benchmark Preview](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
+- **Tables** — Detects borders, handles merged cells
+- **Lists** — Numbered, bulleted, nested
+- **Headings** — Auto-detects hierarchy levels
+- **Images** — Extracts with captions linked
+- **Tagged PDF Support** — Uses native PDF structure when available
+- **AI Safety** — Auto-filters prompt injection content
 <br/>
-### 🚀 Upcoming Features
+## Output Formats
-**Scheduled for December**
-- 🖨️ **OCR for scanned PDFs** — Extract data from image-only pages.
-- 🧠 **Table AI option** — Higher accuracy for tables with borderless or merged cells.
+| Format | Use Case |
+|--------|----------|
+| **JSON** | Structured data with bounding boxes, semantic types |
+| **Markdown** | Clean text for LLM context, RAG chunks |
+| **HTML** | Web display with styling |
+| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
 <br/>
-## Quick Start with Python
+## JSON Output Example
+```json
+{
+  "type": "heading",
+  "id": 42,
+  "level": "Title",
+  "page number": 1,
+  "bounding box": [72.0, 700.0, 540.0, 730.0],
+  "heading level": 1,
+  "font": "Helvetica-Bold",
+  "font size": 24.0,
+  "text color": "[0.0]",
+  "content": "Introduction"
+}
+```
-### Prerequisites
+| Field | Description |
+|-------|-------------|
+| `type` | Element type: heading, paragraph, table, list, image, caption |
+| `id` | Unique identifier for cross-referencing |
+| `page number` | 1-indexed page reference |
+| `bounding box` | `[left, bottom, right, top]` in PDF points |
+| `heading level` | Heading depth (1+) |
+| `font`, `font size` | Typography info |
+| `content` | Extracted text |
-- Java 11 or higher must be installed and available in your system's PATH.
-- Python 3.9+
+[Full JSON Schema →](https://opendataloader.org/docs/json-schema)
-### Installation
+<br/>
-```sh
-pip install -U opendataloader-pdf
-```
+## Quick Start
-### Usage
+- [Python](https://opendataloader.org/docs/quick-start-python)
+- [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
+- [Docker](https://opendataloader.org/docs/quick-start-docker)
+- [Java](https://opendataloader.org/docs/quick-start-java)
-input_path can be either the path to a single document or the path to a folder.
+<br/>
-```python
-import opendataloader_pdf
+## Advanced Options
+```python
 opendataloader_pdf.convert(
-    input_path=["path/to/document.pdf", "path/to/folder"],
-    output_dir="path/to/output",
-    format="json,html,pdf,markdown"
+    input_path="document.pdf",
+    output_dir="output/",
+    format="json,markdown,pdf",
+    # Reading order
+    reading_order="xycut",           # XY-Cut++ for multi-column
+    # Images
+    embed_images=True,               # Base64 in output
+    image_format="png",
+    # Tagged PDF
+    use_struct_tree=True,            # Use native PDF structure
 )
 ```
+[Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
 <br/>
-## Quick Start with more languages & tools
+## AI Safety
-- [Quick Start with Python](https://opendataloader.org/docs/quick-start-python)
-- [Quick Start with Java](https://opendataloader.org/docs/quick-start-java)
-- [Quick Start with Node.js](https://opendataloader.org/docs/quick-start-nodejs)
-- [Quick Start with Docker](https://opendataloader.org/docs/quick-start-docker)
+PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
+- Hidden text (transparent, zero-size)
+- Off-page content
+- Suspicious invisible layers
+This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
 <br/>
-## Developing with OpenDataLoader
+## Tagged PDF Support
-### Build & Test
+**Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
-**Prerequisites**: Java 11+, Python 3.9+, Node.js 20+, pnpm
+**OpenDataLoader leverages this:**
-```sh
-# Run tests (for local development)
-./scripts/test-java.sh
-./scripts/test-python.sh
-./scripts/test-node.sh
+- When a PDF has structure tags, we extract the **exact layout** the author intended
+- Headings, lists, tables, reading order — all preserved from the source
+- No guessing, no heuristics needed — **pixel-perfect semantic extraction**
-# Full CI build (all packages)
-./scripts/build-all.sh
+```python
+opendataloader_pdf.convert(
+    input_path="accessible_document.pdf",
+    use_struct_tree=True  # Use native PDF structure tags
+)
 ```
-### Syncing CLI Options
+Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
+[Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
+<br/>
+## LangChain Integration
-CLI options are defined in Java and auto-generated for Node.js, Python, and documentation.
+OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
-```sh
-# After modifying Java CLI options, regenerate all bindings:
-pnpm run sync-options
+```bash
+pip install -U langchain-opendataloader-pdf
 ```
-This generates:
-- `node/opendataloader-pdf/src/cli-options.generated.ts`
-- `node/opendataloader-pdf/src/convert-options.generated.ts`
-- `python/opendataloader-pdf/src/opendataloader_pdf/cli_options_generated.py`
-- `python/opendataloader-pdf/src/opendataloader_pdf/convert_generated.py`
-- `content/docs/cli-options-reference.mdx`
+```python
+from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
-### Resources
+loader = OpenDataLoaderPDFLoader(
+    file_path=["document.pdf"],
+    format="text"
+)
+documents = loader.load()
-- [CLI Options Reference](https://opendataloader.org/docs/cli-options-reference)
-- [Development](https://opendataloader.org/docs/development-workflow)
-- [Json Schema](https://opendataloader.org/docs/json-schema)
-- [Javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/index.html)
+# Use with any LangChain pipeline
+for doc in documents:
+    print(doc.page_content[:100])
+```
+- [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
+- [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
+- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
 <br/>
-## 🤝 Contributing
+## Benchmarks
+We continuously benchmark against real-world documents.
+[View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
+### Quick Comparison
+| Engine             | Accuracy |      | Speed (s/page) |      | Reading Order |      | Table    |      | Heading  |      |
+|--------------------|----------|------|----------------|------|---------------|------|----------|------|----------|------|
+| **opendataloader** | 0.82     | #2   | **0.05**       | #1   | **0.91**      | #1   | 0.49     | #2   | 0.65     | #2   |
+| docling            | **0.88** | #1   | 0.73           | #4   | 0.90          | #2   | **0.89** | #1   | **0.80** | #1   |
+| pymupdf4llm        | 0.73     | #3   | 0.09           | #2   | 0.89          | #3   | 0.40     | #3   | 0.41     | #3   |
+| markitdown         | 0.58     | #4   | **0.04**       | #1   | 0.88          | #4   | 0.00     | #4   | 0.00     | #4   |
+> Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
-We believe that great software is built together.
+### When to Use Each Engine
-Your contributions are vital to the success of this project.
+| Use Case                 | Recommended Engine | Why                                                    |
+|--------------------------|--------------------|--------------------------------------------------------|
+| Best overall balance     | **opendataloader** | Fast (0.05s/page) with high reading order accuracy     |
+| Maximum accuracy         | docling            | Highest scores for tables and headings, but 16x slower |
+| Speed-critical pipelines | markitdown         | Fastest, but no table/heading extraction               |
+| PyMuPDF ecosystem        | pymupdf4llm        | Good balance if already using PyMuPDF                  |
+### Visual Comparison
+[![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
-Please read [CONTRIBUTING.md](https://github.com/hancom-inc/opendataloader-pdf/blob/main/CONTRIBUTING.md) for details on how to contribute.
 <br/>
-## 💖 Community & Support
+## Roadmap
+See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
+<br/>
-Have questions or need a little help? We're here for you!🤗
+## Documentation
-- [GitHub Discussions](https://github.com/hancom-inc/opendataloader-pdf/discussions): For Q&A and general chats. Let's talk! 🗣️
-- [GitHub Issues](https://github.com/hancom-inc/opendataloader-pdf/issues): Found a bug? 🐛 Please report it here so we can fix it.
-- [SUPPORT.md](SUPPORT.md): Learn about our issue guidelines and AI-powered issue processing system.
+- [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
+- [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
+- [CLI Options](https://opendataloader.org/docs/cli-options-reference)
+- [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
+- [AI Safety Features](https://opendataloader.org/docs/ai-safety)
 <br/>
-## ✨ Our Branding and Trademarks
+## Frequently Asked Questions
+### What is the best PDF parser for RAG?
+For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
+### How do I extract tables from PDF for LLM?
-We love our brand and want to protect it!
+OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
-This project may contain trademarks, logos, or brand names for our products and services.
+### Can I use this without sending data to the cloud?
+Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
+### What makes OpenDataLoader unique?
+OpenDataLoader takes a different approach from many PDF parsers:
+- **Rule-based extraction** — Deterministic output without GPU requirements
+- **Bounding boxes for all elements** — Essential for citation systems
+- **XY-Cut++ reading order** — Handles multi-column layouts correctly
+- **Built-in AI safety filters** — Protects against prompt injection
+- **Native Tagged PDF support** — Leverages accessibility metadata
+This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
+<br/>
-To ensure everyone is on the same page, please remember these simple rules:
+## Contributing
-- **Authorized Use**: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
-- **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
-- **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company’s specific policies.
+We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
 <br/>
-## ⚖️ License
+## License
-This project is licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).
+[Mozilla Public License 2.0](LICENSE)
-For the full license text, see [LICENSE](LICENSE).
+---
-For information on third-party libraries and components, see:
-- [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)
-- [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)
-- [licenses/](./THIRD_PARTY/licenses/)
+**Found this useful?** Give us a star to help others discover OpenDataLoader.

package/lib/opendataloader-pdf-cli.jar CHANGED Viewed

Binary file

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@opendataloader/pdf",
-  "version": "1.4.1",
+  "version": "1.4.3",
   "description": "A Node.js wrapper for the opendataloader-pdf Java CLI.",
   "main": "./dist/index.cjs",
   "module": "./dist/index.js",