PyPI - pdfmux - Versions diffs - 0.2.0__tar.gz - Mend

pdfmux 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

pdfmux-0.2.0/.claude/launch.json +11 -0
pdfmux-0.2.0/.github/workflows/ci.yml +48 -0
pdfmux-0.2.0/.github/workflows/publish.yml +21 -0
pdfmux-0.2.0/.gitignore +29 -0
pdfmux-0.2.0/Dockerfile +18 -0
pdfmux-0.2.0/LICENSE +21 -0
pdfmux-0.2.0/PKG-INFO +385 -0
pdfmux-0.2.0/README.md +343 -0
pdfmux-0.2.0/docs/ARCHITECTURE.md +18 -0
pdfmux-0.2.0/docs/CHANGELOG.md +38 -0
pdfmux-0.2.0/pyproject.toml +66 -0
pdfmux-0.2.0/site/index.html +323 -0
pdfmux-0.2.0/src/pdfmux/__init__.py +3 -0
pdfmux-0.2.0/src/pdfmux/cli.py +176 -0
pdfmux-0.2.0/src/pdfmux/detect.py +126 -0
pdfmux-0.2.0/src/pdfmux/extractors/__init__.py +19 -0
pdfmux-0.2.0/src/pdfmux/extractors/fast.py +39 -0
pdfmux-0.2.0/src/pdfmux/extractors/llm.py +124 -0
pdfmux-0.2.0/src/pdfmux/extractors/ocr.py +91 -0
pdfmux-0.2.0/src/pdfmux/extractors/tables.py +59 -0
pdfmux-0.2.0/src/pdfmux/formatters/__init__.py +1 -0
pdfmux-0.2.0/src/pdfmux/formatters/csv_fmt.py +91 -0
pdfmux-0.2.0/src/pdfmux/formatters/json_fmt.py +51 -0
pdfmux-0.2.0/src/pdfmux/formatters/markdown.py +41 -0
pdfmux-0.2.0/src/pdfmux/mcp_server.py +197 -0
pdfmux-0.2.0/src/pdfmux/pipeline.py +236 -0
pdfmux-0.2.0/src/pdfmux/postprocess.py +109 -0
pdfmux-0.2.0/tests/conftest.py +79 -0
pdfmux-0.2.0/tests/test_detect.py +47 -0
pdfmux-0.2.0/tests/test_extractors.py +76 -0
pdfmux-0.2.0/tests/test_pipeline.py +80 -0

pdfmux-0.2.0/.claude/launch.json ADDED Viewed

@@ -0,0 +1,11 @@
+{
+  "version": "0.0.1",
+  "configurations": [
+    {
+      "name": "site",
+      "runtimeExecutable": "npx",
+      "runtimeArgs": ["serve", "site/", "-l", "3456"],
+      "port": 3456
+    }
+  ]
+}

pdfmux-0.2.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,48 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - run: pip install ruff
+      - run: ruff check src/ tests/
+      - run: ruff format --check src/ tests/
+  test:
+    runs-on: ubuntu-latest
+    needs: lint
+    strategy:
+      matrix:
+        python-version: ["3.11", "3.12", "3.13"]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - run: pip install -e ".[dev]"
+      - run: pytest -v
+  build:
+    runs-on: ubuntu-latest
+    needs: test
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - run: pip install build
+      - run: python -m build
+      - uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/

pdfmux-0.2.0/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,21 @@
+name: Publish to PyPI
+on:
+  release:
+    types: [published]
+permissions:
+  id-token: write
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    environment: pypi
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - run: pip install build
+      - run: python -m build
+      - uses: pypa/gh-action-pypi-publish@release/v1

pdfmux-0.2.0/.gitignore ADDED Viewed

@@ -0,0 +1,29 @@
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+.env
+.env.*
+.env.local
+.venv/
+venv/
+*.pem
+*.key
+credentials
+credentials.*
+secrets
+secrets.*
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+.wrangler/
+*.pdf
+!tests/fixtures/*.pdf

pdfmux-0.2.0/Dockerfile ADDED Viewed

@@ -0,0 +1,18 @@
+FROM python:3.11-slim AS builder
+WORKDIR /app
+COPY pyproject.toml README.md ./
+COPY src/ ./src/
+RUN pip install --no-cache-dir build && \
+    python -m build --wheel
+FROM python:3.11-slim
+WORKDIR /app
+COPY --from=builder /app/dist/*.whl /tmp/
+RUN pip install --no-cache-dir /tmp/*.whl && \
+    rm /tmp/*.whl
+ENTRYPOINT ["pdfmux"]
+CMD ["--help"]

pdfmux-0.2.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Nameet Potnis
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

pdfmux-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,385 @@
+Metadata-Version: 2.4
+Name: pdfmux
+Version: 0.2.0
+Summary: The smart PDF-to-Markdown router. One command, zero config, best extractor per document.
+Project-URL: Homepage, https://pdfmux.com
+Project-URL: Repository, https://github.com/NameetP/pdfmux
+Project-URL: Issues, https://github.com/NameetP/pdfmux/issues
+Author: Nameet Potnis
+License-Expression: MIT
+License-File: LICENSE
+Keywords: ai,converter,extraction,llm,markdown,mcp,pdf
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Requires-Python: >=3.11
+Requires-Dist: mcp>=1.0.0
+Requires-Dist: pymupdf4llm>=0.0.10
+Requires-Dist: pymupdf>=1.24.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: typer>=0.9.0
+Provides-Extra: all
+Requires-Dist: docling>=2.0.0; extra == 'all'
+Requires-Dist: google-genai>=1.0.0; extra == 'all'
+Requires-Dist: surya-ocr>=0.6.0; extra == 'all'
+Provides-Extra: dev
+Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
+Requires-Dist: pytest>=8.0.0; extra == 'dev'
+Requires-Dist: ruff>=0.4.0; extra == 'dev'
+Provides-Extra: llm
+Requires-Dist: google-genai>=1.0.0; extra == 'llm'
+Provides-Extra: ocr
+Requires-Dist: surya-ocr>=0.6.0; extra == 'ocr'
+Provides-Extra: tables
+Requires-Dist: docling>=2.0.0; extra == 'tables'
+Description-Content-Type: text/markdown
+# pdfmux
+The smart PDF-to-Markdown router. One command, zero config.
+```
+PDF ──→ pdfmux ──→ Markdown
+         │
+         ├─ digital?  → PyMuPDF     (0.01s/pg, free)
+         ├─ tables?   → Docling     (0.3s/pg, free)
+         ├─ scanned?  → Surya OCR   (1-5s/pg, free)
+         └─ complex?  → Gemini Flash (2-5s/pg, ~$0.01)
+```
+We don't convert PDFs. We route them to whichever tool converts them best.
+90% of PDFs are digital — converted in milliseconds, for free.
+## Quick Start
+```bash
+pip install pdfmux
+pdfmux invoice.pdf
+# ✓ invoice.pdf → invoice.md (2 pages, 95% confidence, via pymupdf4llm)
+```
+That's it. No config, no flags, no API keys needed.
+## Install
+```bash
+# core (handles digital PDFs — the vast majority)
+pip install pdfmux
+# add table extraction (Docling — 97.9% table accuracy)
+pip install pdfmux[tables]
+# add scanned PDF support (Surya OCR)
+pip install pdfmux[ocr]
+# add LLM fallback for hardest cases (Gemini Flash)
+pip install pdfmux[llm]
+# everything
+pip install pdfmux[all]
+```
+Requires Python 3.11+.
+## Usage
+### Convert a single file
+```bash
+pdfmux invoice.pdf
+# ✓ invoice.pdf → invoice.md (2 pages, 95% confidence, via pymupdf4llm)
+```
+Output is written to the same directory with a `.md` extension by default.
+### Specify output location
+```bash
+pdfmux report.pdf -o ./converted/report.md
+```
+### Batch convert a directory
+```bash
+pdfmux ./docs/ -o ./output/
+# Converting 12 PDFs from ./docs/...
+#   ✓ invoice.pdf → invoice.md (95%)
+#   ✓ contract.pdf → contract.md (92%)
+#   ✓ scan.pdf → scan.md (87%)
+# Done: 12 converted, 0 failed
+```
+### Output formats
+```bash
+# markdown (default)
+pdfmux report.pdf
+# json — structured output with metadata
+pdfmux report.pdf -f json
+# csv — extracts tables only
+pdfmux data.pdf -f csv
+```
+### Quality presets
+```bash
+# fast — PyMuPDF only, no ML (instant, free)
+pdfmux report.pdf -q fast
+# standard — auto-detect and route (default)
+pdfmux report.pdf -q standard
+# high — use LLM for everything (slow, costs ~$0.01/doc)
+pdfmux report.pdf -q high
+```
+### Other options
+```bash
+# show confidence score in output
+pdfmux report.pdf --confidence
+# print to stdout instead of file
+pdfmux report.pdf --stdout
+```
+### All CLI options
+| Option | Short | Default | Description |
+|--------|-------|---------|-------------|
+| `--output` | `-o` | Same dir, `.md` ext | Output file or directory |
+| `--format` | `-f` | `markdown` | Output format: `markdown`, `json`, `csv` |
+| `--quality` | `-q` | `standard` | Quality: `fast`, `standard`, `high` |
+| `--confidence` | | `false` | Include confidence score in output |
+| `--stdout` | | `false` | Print to stdout instead of writing file |
+## How It Works
+### Detection
+pdfmux opens each PDF with PyMuPDF and classifies it by inspecting every page:
+```
+For each page:
+  ├─ Has >50 chars of extractable text?  → digital
+  ├─ Has embedded images but no text?    → scanned
+  └─ Empty or minimal content?           → digital (empty page)
+Classification:
+  ├─ ≥80% digital pages  → digital PDF
+  ├─ ≥80% scanned pages  → scanned PDF
+  └─ Otherwise            → mixed PDF
+Table detection:
+  ├─ Check for ruled line patterns (≥3 horizontal + ≥2 vertical lines)
+  └─ Check for tab-separated or multi-space aligned text patterns
+```
+### Routing
+Based on classification, pdfmux picks the best extractor:
+```
+classify(pdf)
+  │
+  ├─ quality=fast? ────────────────→ PyMuPDF (always)
+  ├─ quality=high? ────────────────→ Gemini Flash → Surya → PyMuPDF
+  │
+  └─ quality=standard (default):
+       ├─ digital, no tables ──────→ PyMuPDF
+       ├─ has tables ──────────────→ Docling → PyMuPDF fallback
+       ├─ scanned ─────────────────→ Surya OCR → PyMuPDF fallback
+       ├─ mixed ───────────────────→ PyMuPDF (digital pgs) + Surya (scanned pgs)
+       └─ default ─────────────────→ PyMuPDF
+```
+If an optional extractor isn't installed, pdfmux silently falls back to the next best option. No errors, no config.
+### Post-processing
+After extraction, every result goes through:
+1. **Cleanup** — remove control characters, fix broken hyphenation, normalize blank lines
+2. **Confidence scoring** — text completeness, encoding quality, structure preservation, whitespace sanity
+3. **Formatting** — heading normalization, list marker standardization, optional YAML frontmatter
+### Extractors
+| Tier | Extractor | What it handles | Speed | Cost | Install |
+|------|-----------|----------------|-------|------|---------|
+| Fast | PyMuPDF / pymupdf4llm | Digital PDFs with clean text | 0.01s/page | Free | Base |
+| Tables | Docling | Table-heavy documents | 0.3-3s/page | Free | `pdfmux[tables]` |
+| OCR | Surya | Scanned / image-based PDFs | 1-5s/page | Free | `pdfmux[ocr]` |
+| LLM | Gemini 2.5 Flash | Complex layouts, handwriting, edge cases | 2-5s/page | ~$0.01/doc | `pdfmux[llm]` |
+## Output Formats
+### Markdown (default)
+Clean markdown optimized for LLM consumption:
+```markdown
+# Quarterly Report
+Revenue for Q3 increased by 15% year-over-year...
+## Financial Summary
+| Metric | Q3 2025 | Q3 2024 |
+|--------|---------|---------|
+| Revenue | $12.3M | $10.7M |
+| Profit | $3.1M | $2.4M |
+```
+### JSON
+Structured output with metadata, useful for pipelines:
+```json
+{
+  "source": "report.pdf",
+  "converter": "pdfmux",
+  "extractor": "pymupdf4llm (fast)",
+  "page_count": 5,
+  "confidence": 0.95,
+  "warnings": [],
+  "content": "# Quarterly Report\n\nRevenue for Q3...",
+  "pages": [
+    { "page": 1, "content": "# Quarterly Report..." },
+    { "page": 2, "content": "## Financial Summary..." }
+  ]
+}
+```
+### CSV
+Extracts tables from the document into CSV format:
+```csv
+Metric,Q3 2025,Q3 2024
+Revenue,$12.3M,$10.7M
+Profit,$3.1M,$2.4M
+```
+Raises an error if no tables are found in the document.
+## MCP Server
+pdfmux includes a built-in MCP (Model Context Protocol) server so AI agents can read PDFs natively.
+```bash
+pdfmux serve
+```
+### Claude Desktop / Cursor
+Add to your config:
+```json
+{
+  "mcpServers": {
+    "pdfmux": {
+      "command": "pdfmux",
+      "args": ["serve"]
+    }
+  }
+}
+```
+### Claude Code
+```bash
+claude mcp add pdfmux -- pdfmux serve
+```
+### Tool
+The server exposes a single `convert_pdf` tool over stdio:
+```json
+{
+  "name": "convert_pdf",
+  "description": "Convert a PDF to Markdown/JSON/CSV",
+  "parameters": {
+    "file_path": "string — path to the PDF file",
+    "format": "string — markdown | json | csv (default: markdown)",
+    "quality": "string — fast | standard | high (default: standard)"
+  }
+}
+```
+Your agent calls it, gets the extracted text back. No setup required.
+## Environment Variables
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `GEMINI_API_KEY` | Only for `pdfmux[llm]` | Google Gemini API key for LLM extraction |
+| `GOOGLE_API_KEY` | Alternative | Alternative env var for Gemini API key |
+No environment variables are needed for the base install or the `tables`/`ocr` extras.
+## Why Not Just Use X?
+| Tool | Good at | Limitation |
+|------|---------|-----------|
+| Marker | GPU ML extraction | Overkill for simple digital PDFs, needs GPU |
+| Docling | Tables (97.9% accuracy) | Slow on non-table documents |
+| pymupdf4llm | Fast digital text | Can't handle scanned or complex layouts |
+| MinerU | Full ML pipeline | Heavy, complex setup |
+| MarkItDown | Microsoft tool, wide format support | Not optimized for any specific PDF type |
+| **pdfmux** | Picking the right tool automatically | — |
+pdfmux uses these tools. It doesn't compete with them — it orchestrates them.
+The key insight: no single extractor wins on everything. PyMuPDF is 100x faster on digital PDFs. Docling is better at tables. Surya handles scans. Gemini catches what everything else misses. pdfmux routes each document to the right one.
+## Project Structure
+```
+src/pdfmux/
+├── cli.py              # Typer CLI (convert, serve, version)
+├── pipeline.py         # Tiered routing logic
+├── detect.py           # PDF type detection
+├── postprocess.py      # Cleanup + confidence scoring
+├── mcp_server.py       # MCP server (stdio JSON-RPC)
+├── extractors/
+│   ├── fast.py         # PyMuPDF — handles 90% of PDFs
+│   ├── tables.py       # Docling — table-heavy docs
+│   ├── ocr.py          # Surya — scanned PDFs
+│   └── llm.py          # Gemini Flash — hardest cases
+└── formatters/
+    ├── markdown.py     # Markdown output
+    ├── json_fmt.py     # JSON output
+    └── csv_fmt.py      # CSV output (tables only)
+```
+## Development
+```bash
+git clone https://github.com/NameetP/pdfmux.git
+cd pdfmux
+python3.12 -m venv .venv && source .venv/bin/activate
+pip install -e ".[dev]"
+# run tests
+pytest
+# lint
+ruff check src/ tests/
+ruff format src/ tests/
+```
+## License
+[MIT](LICENSE)