npm - @opendataloader/pdf - Versions diffs - 1.3.0 → 1.4.2 - Mend

@opendataloader/pdf 1.3.0 → 1.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/NOTICE.md +1 -1
package/README.md +193 -369
package/dist/cli.cjs +140 -65
package/dist/cli.cjs.map +1 -1
package/dist/cli.js +140 -65
package/dist/cli.js.map +1 -1
package/dist/index.cjs +102 -81
package/dist/index.cjs.map +1 -1
package/dist/index.d.cts +48 -12
package/dist/index.d.ts +48 -12
package/dist/index.js +101 -81
package/dist/index.js.map +1 -1
package/lib/opendataloader-pdf-cli.jar +0 -0
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -1,483 +1,307 @@
 # OpenDataLoader PDF
+**PDF to Markdown & JSON for RAG** — Fast, Local, No GPU Required
 [![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
-![Java](https://img.shields.io/badge/Java-11+-blue.svg)
-![Python](https://img.shields.io/badge/Python-3.9+-blue.svg)
-[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
 [![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
 [![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
-[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
-[![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
-[![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)
-<br/>
-**Safe, Open, High-Performance — PDF for AI**
-OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
-It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
-Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
-AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.
-<br/>
-## 🌟 Key Features
-- 🧾 **Rich, Structured Output** — JSON, Markdown or Html
-- 🧩 **Layout Reconstruction** — Headings, Lists, Tables, Images, Reading Order
-- ⚡ **Fast & Lightweight** — Rule-Based Heuristic, High-Throughput, No GPU
-- 🔒 **Local-First Privacy** — Runs fully on your machine
-- 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content - [Learn more](https://opendataloader.org/docs/ai-safety)
-- 🏷️ **Tagged PDF** — Advanced data extraction technology based on Tagged PDF - [Learn more](https://opendataloader.org/docs/tagged-pdf)
-- 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original
-[Download Annotated PDF Sample](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/1901.03003_annotated.pdf)
-![Annotated PDF Preview](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/example_annotated_pdf.png)
-<br/>
-## 🚀 Upcoming Features
-**Scheduled for November**
-- ⚡ **Performance Improvement** — Enhance the inference skill for greater accuracy and speed.
-- 📊 **Benchmarks & Datasets** — Publish transparent evaluations using open datasets and standardized metrics.
-- 🎯 **Metrics** — Publish the calculation methods to transparently share benchmark results.
-<br/>
-**Scheduled for December**
-- 🖨️ **OCR for scanned PDFs** — Extract data from image-only pages.
-- 🧠 **Table AI option** — Higher accuracy for tables with borderless or merged cells.
-<br/>
-**Scheduled for 2026**
-- 🛡️ **AI Red Teaming** — Transparent adversarial benchmarks with datasets and metrics, then reported regularly.
-<br/>
-## Prerequisites
-- Java 11 or higher must be installed and available in your system's PATH.
-- Python 3.9+
-<br/>
+[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
+[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
+[![Java](https://img.shields.io/badge/Java-11%2B-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf#java)
-## Python
+Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
-### Installation
+**Why developers choose OpenDataLoader:**
+- **Deterministic** — Same input always produces same output (no LLM hallucinations)
+- **Fast** — Process 100+ pages per second on CPU
+- **Private** — 100% local, zero data transmission
+- **Accurate** — Bounding boxes for every element, correct multi-column reading order
-```sh
+```bash
 pip install -U opendataloader-pdf
 ```
-### Usage
-input_path can be either the path to a single document or the path to a folder.
 ```python
 import opendataloader_pdf
+# PDF to Markdown for RAG
 opendataloader_pdf.convert(
-    input_path=["path/to/document.pdf", "path/to/folder"],
-    output_dir="path/to/output",
-    format="json,html,pdf,markdown"
+    input_path="document.pdf",
+    output_dir="output/",
+    format="markdown,json"
 )
 ```
-If you want to run it via CLI, you can use the following command on the terminal:
-```bash
-opendataloader-pdf path/to/document.pdf path/to/folder -o path/to/output -f json,html,pdf,markdown
-```
-### Function: convert()
-The main function to process PDFs.
-| Parameter               | Type                  | Required | Default      | Description                                                                                                                              |
-|-------------------------|-----------------------| -------- |--------------|------------------------------------------------------------------------------------------------------------------------------------------|
-| `input_path`            | `List[str]`           | ✅ Yes    | —            | One or more PDF file paths or directories to process.                                                                                    |
-| `output_dir`            | `Optional[str]`       | No       | input folder | Directory where outputs are written.                                                                                                     |
-| `password`              | `Optional[str]`       | No       | `None`       | Password used for encrypted PDFs.                                                                                                        |
-| `format`                | `Optional[Union[str, List[str]]]` | No | `None`       | Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images) |
-| `quiet`                 | `bool`                | No       | `False`      | Suppresses CLI logging output when `True`.                                                                                               |
-| `content_safety_off`    | `Optional[Union[str, List[str]]]` | No | `None`       | Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)                       |
-| `keep_line_breaks`      | `bool`                | No       | `False`      | Preserves line breaks in text output when `True`.                                                                                        |
-| `replace_invalid_chars` | `Optional[str]`       | No       | `None`       | Replacement character for invalid or unrecognized characters (e.g., �, `\u0000`).                                                        |
-| `use_struct_tree`       | `bool `               | No       | `False`      | Enable processing structure tree (disabled by default).                                                                                  |
-### Function: run()
-Deprecated.
 <br/>
-## Node.js / NPM
-**Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.
+## Why OpenDataLoader?
-### Prerequisites
+Building RAG pipelines? You've probably hit these problems:
-- Java 11 or higher must be installed and available in your system's PATH.
+| Problem | How We Solve It |
+|---------|-----------------|
+| **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
+| **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
+| **Headers/footers pollute context** | Auto-filtered before output |
+| **No coordinates for citations** | Bounding box for every element |
+| **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
+| **GPU required** | Pure CPU, rule-based — runs anywhere |
-### Installation
-```sh
-npm install @opendataloader/pdf
-```
+<br/>
-### Usage
+## Key Features
-`inputPath` can be either the path to a single document or the path to a folder.
+### For RAG & LLM Pipelines
-```typescript
-import { convert } from '@opendataloader/pdf';
+- **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)
+- **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations
+- **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly
+- **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed
+- **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
-async function main() {
-  try {
-    await convert(['path/to/document.pdf', 'path/to/folder'], {
-      outputDir: 'path/to/output',
-      format: 'json,html,pdf,markdown',
-    });
-    console.log('convert() complete');
-  } catch (error) {
-    console.error('Error processing PDF:', error);
-  }
-}
+### Performance & Privacy
-main();
-```
-### Function: convert()
+- **No GPU** — Fast, rule-based heuristics
+- **Local-First** — Your documents never leave your machine
+- **High Throughput** — Process thousands of PDFs efficiently
+- **Multi-Language SDK** — Python, Node.js, Java, Docker
-`convert(inputPaths: string[], options?: ConvertOptions): Promise<string>`
+### Document Understanding
-Multi-input helper matching the Python wrapper.
+- **Tables** — Detects borders, handles merged cells
+- **Lists** — Numbered, bulleted, nested
+- **Headings** — Auto-detects hierarchy levels
+- **Images** — Extracts with captions linked
+- **Tagged PDF Support** — Uses native PDF structure when available
+- **AI Safety** — Auto-filters prompt injection content
-| Property                       | Type       | Default     | Description                                                                                                                  |
-|--------------------------------| ---------- | ----------- |------------------------------------------------------------------------------------------------------------------------------|
-| `inputPaths`                   | `string[]` | —           | One or more file paths or directories to process.                                                                            |
-| `options.outputDir`            | `string`   | `undefined` | Directory where outputs are written.                                                                                         |
-| `options.password`             | `string`   | `undefined` | Password for encrypted PDFs.                                                                                                 |
-| `options.format`               | `string \| string[]` | `undefined` | Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images) |
-| `options.quiet`                | `boolean`  | `false`     | Suppress CLI logging output and prevent streaming.                                                                           |
-| `options.contentSafetyOff`     | `string \| string[]` | `undefined` | Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)                         |
-| `options.keepLineBreaks`       | `boolean`  | `false`     | Preserve line breaks in text output.                                                                                         |
-| `options.replaceInvalidChars`  | `string`   | `undefined` | Replacement character for invalid or unrecognized characters.                                                                |
-| `options.useStructTree`        | `boolean`  | `false`     | Enable processing structure tree (disabled by default).                                                                      |
+<br/>
-### Function: run()
+## Output Formats
-Deprecated.
+| Format | Use Case |
+|--------|----------|
+| **JSON** | Structured data with bounding boxes, semantic types |
+| **Markdown** | Clean text for LLM context, RAG chunks |
+| **HTML** | Web display with styling |
+| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
-### CLI
+<br/>
-```bash
-npx @opendataloader/pdf path/to/document.pdf path/to/folder -o path/to/output -f json,html,pdf,markdown
+## JSON Output Example
+```json
+{
+  "type": "heading",
+  "id": 42,
+  "level": "Title",
+  "page number": 1,
+  "bounding box": [72.0, 700.0, 540.0, 730.0],
+  "heading level": 1,
+  "font": "Helvetica-Bold",
+  "font size": 24.0,
+  "text color": "[0.0]",
+  "content": "Introduction"
+}
 ```
-#### Available options
+| Field | Description |
+|-------|-------------|
+| `type` | Element type: heading, paragraph, table, list, image, caption |
+| `id` | Unique identifier for cross-referencing |
+| `page number` | 1-indexed page reference |
+| `bounding box` | `[left, bottom, right, top]` in PDF points |
+| `heading level` | Heading depth (1+) |
+| `font`, `font size` | Typography info |
+| `content` | Extracted text |
-```
-  -o, --output-dir <path>             Directory where outputs are written
-  -p, --password <password>           Password for encrypted PDFs
-  -f, --format <values>               Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
-  -q, --quiet                         Suppress CLI logging output
-      --content-safety-off <modes>    Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)
-      --keep-line-breaks              Preserve line breaks in text output
-      --replace-invalid-chars <c>     Replacement character for invalid or unrecognized characters
-  -h, --help                          Show usage information
-      --use-struct-tree               Enable processing structure tree (disabled by default)
-```
+[Full JSON Schema →](https://opendataloader.org/docs/json-schema)
 <br/>
-## Java
-For various example templates, including Gradle and Maven, please refer to [Examples](https://github.com/opendataloader-project/opendataloader-pdf-examples).
-### Dependency
-To include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.
-Check for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).
-```xml
-<project>
-    <!-- other configurations... -->
-    <dependencies>
-        <dependency>
-            <groupId>org.opendataloader</groupId>
-            <artifactId>opendataloader-pdf-core</artifactId>
-            <version>1.3.0</version>
-        </dependency>
-    </dependencies>
-    <repositories>
-        <repository>
-            <snapshots>
-                <enabled>true</enabled>
-            </snapshots>
-            <id>vera-dev</id>
-            <name>Vera development</name>
-            <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
-        </repository>
-    </repositories>
-    <pluginRepositories>
-        <pluginRepository>
-            <snapshots>
-                <enabled>false</enabled>
-            </snapshots>
-            <id>vera-dev</id>
-            <name>Vera development</name>
-            <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
-        </pluginRepository>
-    </pluginRepositories>
-    <!-- other configurations... -->
-</project>
-```
+## Quick Start
-### Java code integration
+- [Python](https://opendataloader.org/docs/quick-start-python)
+- [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
+- [Docker](https://opendataloader.org/docs/quick-start-docker)
+- [Java](https://opendataloader.org/docs/quick-start-java)
-To integrate Layout recognition API into Java code, one can follow the sample code below.
+<br/>
-```java
-import org.opendataloader.pdf.api.Config;
-import org.opendataloader.pdf.api.OpenDataLoaderPDF;
+## Advanced Options
-import java.io.IOException;
+```python
+opendataloader_pdf.convert(
+    input_path="document.pdf",
+    output_dir="output/",
+    format="json,markdown,pdf",
-public class Sample {
+    # Reading order
+    reading_order="xycut",           # XY-Cut++ for multi-column
-    public static void main(String[] args) {
-        Config config = new Config();
-        config.setOutputFolder("path/to/output");
-        config.setGeneratePDF(true);
-        config.setGenerateMarkdown(true);
-        config.setGenerateHtml(true);
+    # Images
+    embed_images=True,               # Base64 in output
+    image_format="png",
-        try {
-            OpenDataLoaderPDF.processFile("path/to/document.pdf", config);
-        } catch (Exception exception) {
-            //exception during processing
-        }
-    }
-}
+    # Tagged PDF
+    use_struct_tree=True,            # Use native PDF structure
+)
 ```
-### API Documentation
-The full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)
+[Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
 <br/>
-## Docker
-Download sample PDF
+## AI Safety
-```sh
-curl -L -o 1901.03003.pdf https://arxiv.org/pdf/1901.03003
-```
+PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
-Run opendataloader-pdf in Docker container
+- Hidden text (transparent, zero-size)
+- Off-page content
+- Suspicious invisible layers
-```
-docker run --rm -v "$PWD":/work ghcr.io/opendataloader-project/opendataloader-pdf-cli:latest /work/1901.03003.pdf -f json,html,pdf,markdown
-```
+This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
 <br/>
-## Developing with OpenDataLoader PDF
+## Tagged PDF Support
-### Build
+**Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
-Build and install using Maven command:
+**OpenDataLoader leverages this:**
-```sh
-mvn clean install -f java/pom.xml
-```
-If the build is successful, the resulting `jar` file will be created in the path below.
+- When a PDF has structure tags, we extract the **exact layout** the author intended
+- Headings, lists, tables, reading order — all preserved from the source
+- No guessing, no heuristics needed — **pixel-perfect semantic extraction**
-```sh
-java/opendataloader-pdf-cli/target
+```python
+opendataloader_pdf.convert(
+    input_path="accessible_document.pdf",
+    use_struct_tree=True  # Use native PDF structure tags
+)
 ```
-### CLI usage
+Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
-```sh
-java -jar opendataloader-pdf-cli-<VERSION>.jar [options] <INPUT FILE OR FOLDER>
-```
-This generates a JSON file with layout recognition results in the specified output folder.
-Additionally, annotated PDF with recognized structures, Markdown and Html are generated if options `--pdf`, `--markdown` and `--html` are specified.
+[Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
-By default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.
+<br/>
-The option `--keep-line-breaks` to preserve the original line breaks text content in JSON and Markdown output.
-The option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.
-The option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags.
-The option `--markdown-with-images` enables inclusion of image references into the output Markdown.
-The option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character.
-The option `--use-struct-tree` enables processing structure tree (disabled by default).
-The images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.
+## LangChain Integration
-#### Available options:
+OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
-```
-Options:
--o,--output-dir <arg>           Specifies the output directory for generated files
--p,--password <arg>             Specifies the password for an encrypted PDF
--f,--format <arg>               Comma-separated output formats to generate. (json, text, html, pdf, markdown, markdown-with-html, markdown-with-images)
--q,--quiet                      Suppresses console logging output
---content-safety-off <arg>      Comma-separated content safety filters to disable. (all, hidden-text, off-page, tiny, hidden-ocg)
---keep-line-breaks              Preserves original line breaks in the extracted text
---replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
---use-struct-tree               Enables processing structure tree (disabled by default)
+```bash
+pip install -U langchain-opendataloader-pdf
 ```
-The legacy options (for backward compatibility):
+```python
+from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
-```
---no-json                       Disables the JSON output format
---html                          Sets the data extraction output format to HTML
---pdf                           Generates a new PDF file where the extracted layout data is visualized as annotations
---markdown                      Sets the data extraction output format to Markdown
---markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
---markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
+loader = OpenDataLoaderPDFLoader(
+    file_path=["document.pdf"],
+    format="text"
+)
+documents = loader.load()
+# Use with any LangChain pipeline
+for doc in documents:
+    print(doc.page_content[:100])
 ```
-### Schema of the JSON output
+- [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
+- [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
+- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
-Root json node
+<br/>
-| Field             | Type    | Optional | Description                        |
-|-------------------|---------|----------|------------------------------------|
-| file name         | string  | no       | Name of processed pdf file         |
-| number of pages   | integer | no       | Number of pages in pdf file        |
-| author            | string  | no       | Author of pdf file                 |
-| title             | string  | no       | Title of pdf file                  |
-| creation date     | string  | no       | Creation date of pdf file          |
-| modification date | string  | no       | Modification date of pdf file      |
-| kids              | array   | no       | Array of detected content elements |
+## Benchmarks
-Common fields of content json nodes
+We continuously benchmark against real-world documents.
-| Field        | Type    | Optional | Description                                                                                                                                                                                           |
-|--------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| id           | integer | yes      | Unique id of content element                                                                                                                                                                          |
-| level        | string  | yes      | Level of content element                                                                                                                                                                              |
-| type         | string  | no       | Type of content element<br/>Possible types: `footer`, `header`, `heading`, `line`, `table`, `table row`, `table cell`, `paragraph`, `list`, `list item`, `image`, `line art`, `caption`, `text block` |
-| page number  | integer | no       | Page number of content element                                                                                                                                                                        |
-| bounding box | array   | no       | Bounding box of content element                                                                                                                                                                       |
+[View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
-Specific fields of text content json nodes (`caption`, `heading`, `paragraph`)
+### Quick Comparison
-| Field      | Type   | Optional | Description       |
-|------------|--------|----------|-------------------|
-| font       | string | no       | Font name of text |
-| font size  | double | no       | Font size of text |
-| text color | array  | no       | Color of text     |
-| content    | string | no       | Text value        |
+| Engine             | Accuracy |      | Speed (s/page) |      | Reading Order |      | Table    |      | Heading  |      |
+|--------------------|----------|------|----------------|------|---------------|------|----------|------|----------|------|
+| **opendataloader** | 0.82     | #2   | **0.05**       | #1   | **0.91**      | #1   | 0.49     | #2   | 0.65     | #2   |
+| docling            | **0.88** | #1   | 0.73           | #4   | 0.90          | #2   | **0.89** | #1   | **0.80** | #1   |
+| pymupdf4llm        | 0.73     | #3   | 0.09           | #2   | 0.89          | #3   | 0.40     | #3   | 0.41     | #3   |
+| markitdown         | 0.58     | #4   | **0.04**       | #1   | 0.88          | #4   | 0.00     | #4   | 0.00     | #4   |
-Specific fields of `table` json nodes
+> Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
-| Field             | Type    | Optional | Description                    |
-|-------------------|---------|----------|--------------------------------|
-| number of rows    | integer | no       | Number of table rows           |
-| number of columns | integer | no       | Number of table columns        |
-| rows              | array   | no       | Array of table rows            |
-| previous table id | integer | yes      | Id of previous connected table |
-| next table id     | integer | yes      | Id of next connected table     |
+### When to Use Each Engine
-Specific fields of `table row` json nodes
+| Use Case                 | Recommended Engine | Why                                                    |
+|--------------------------|--------------------|--------------------------------------------------------|
+| Best overall balance     | **opendataloader** | Fast (0.05s/page) with high reading order accuracy     |
+| Maximum accuracy         | docling            | Highest scores for tables and headings, but 16x slower |
+| Speed-critical pipelines | markitdown         | Fastest, but no table/heading extraction               |
+| PyMuPDF ecosystem        | pymupdf4llm        | Good balance if already using PyMuPDF                  |
-| Field      | Type    | Optional | Description          |
-|------------|---------|----------|----------------------|
-| row number | integer | no       | Number of table row  |
-| cells      | array   | no       | Array of table cells |
+### Visual Comparison
-Specific fields of `table cell` json nodes
+[![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)
-| Field         | Type    | Optional | Description                          |
-|---------------|---------|----------|--------------------------------------|
-| row number    | integer | no       | Row number of table cell             |
-| column number | integer | no       | Column number of table cell          |
-| row span      | integer | no       | Row span of table cell               |
-| column span   | integer | no       | Column span of table cell            |
-| kids          | array   | no       | Array of table cell content elements |
-Specific fields of `heading` json nodes
+<br/>
-| Field         | Type    | Optional | Description              |
-|---------------|---------|----------|--------------------------|
-| heading level | integer | no       | Heading level of heading |
+## Roadmap
-Specific fields of `list` json nodes
+See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
-| Field                | Type    | Optional | Description                         |
-|----------------------|---------|----------|-------------------------------------|
-| number of list items | integer | no       | Number of list items                |
-| numbering style      | string  | no       | Numbering style of this list        |
-| previous list id     | integer | yes      | Id of previous connected list       |
-| next list id         | integer | yes      | Id of next connected list           |
-| list items           | array   | no       | Array of list item content elements |
+<br/>
-Specific fields of `list item` json nodes
+## Documentation
-| Field | Type  | Optional | Description                         |
-|-------|-------|----------|-------------------------------------|
-| kids  | array | no       | Array of list item content elements |
+- [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
+- [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
+- [CLI Options](https://opendataloader.org/docs/cli-options-reference)
+- [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
+- [AI Safety Features](https://opendataloader.org/docs/ai-safety)
-Specific fields of `header` and `footer` json nodes
+<br/>
-| Field | Type  | Optional | Description                             |
-|-------|-------|----------|-----------------------------------------|
-| kids  | array | no       | Array of header/footer content elements |
+## Frequently Asked Questions
-Specific fields of `text block` json nodes
+### What is the best PDF parser for RAG?
-| Field | Type  | Optional | Description                          |
-|-------|-------|----------|--------------------------------------|
-| kids  | array | no       | Array of text block content elements |
+For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
+### How do I extract tables from PDF for LLM?
-## 🤝 Contributing
+OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
-We believe that great software is built together.
+### Can I use this without sending data to the cloud?
-Your contributions are vital to the success of this project.
+Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
-Please read [CONTRIBUTING.md](https://github.com/hancom-inc/opendataloader-pdf/blob/main/CONTRIBUTING.md) for details on how to contribute.
+### What makes OpenDataLoader unique?
-## 💖 Community & Support
-Have questions or need a little help? We're here for you!🤗
+OpenDataLoader takes a different approach from many PDF parsers:
-- [GitHub Discussions](https://github.com/hancom-inc/opendataloader-pdf/discussions): For Q&A and general chats. Let's talk! 🗣️
-- [GitHub Issues](https://github.com/hancom-inc/opendataloader-pdf/issues): Found a bug? 🐛 Please report it here so we can fix it.
+- **Rule-based extraction** — Deterministic output without GPU requirements
+- **Bounding boxes for all elements** — Essential for citation systems
+- **XY-Cut++ reading order** — Handles multi-column layouts correctly
+- **Built-in AI safety filters** — Protects against prompt injection
+- **Native Tagged PDF support** — Leverages accessibility metadata
-## ✨ Our Branding and Trademarks
+This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
-We love our brand and want to protect it!
+<br/>
-This project may contain trademarks, logos, or brand names for our products and services.
+## Contributing
-To ensure everyone is on the same page, please remember these simple rules:
+We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
-- **Authorized Use**: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
-- **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
-- **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company’s specific policies.
+<br/>
-## ⚖️ License
+## License
-This project is licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).
+[Mozilla Public License 2.0](LICENSE)
-For the full license text, see [LICENSE](LICENSE).
+---
-For information on third-party libraries and components, see:
-- [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)
-- [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)
-- [licenses/](./THIRD_PARTY/licenses/)
+**Found this useful?** Give us a star to help others discover OpenDataLoader.