PyPI - glmmedia-ocr - Versions diffs - 0.1.0__tar.gz - Mend

glmmedia-ocr 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

glmmedia_ocr-0.1.0/PKG-INFO +851 -0
glmmedia_ocr-0.1.0/README.md +836 -0
glmmedia_ocr-0.1.0/pyproject.toml +27 -0
glmmedia_ocr-0.1.0/setup.cfg +4 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/__init__.py +3 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/__main__.py +4 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/cli.py +307 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/config.py +195 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/inputs.py +70 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/ollama.py +153 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/pipeline.py +178 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr/spinner.py +35 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr.egg-info/PKG-INFO +851 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr.egg-info/SOURCES.txt +16 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr.egg-info/dependency_links.txt +1 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr.egg-info/entry_points.txt +2 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr.egg-info/requires.txt +4 -0
glmmedia_ocr-0.1.0/src/glmmedia_ocr.egg-info/top_level.txt +1 -0

glmmedia_ocr-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,851 @@
+Metadata-Version: 2.4
+Name: glmmedia-ocr
+Version: 0.1.0
+Summary: Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama
+Author-email: dusy4 <dusy4@users.noreply.github.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/dusy4/glmmedia-ocr
+Project-URL: Bug Tracker, https://github.com/dusy4/glmmedia-ocr/issues
+Requires-Python: >=3.12
+Description-Content-Type: text/markdown
+Requires-Dist: glmocr[selfhosted]
+Requires-Dist: pypdfium2
+Requires-Dist: Pillow
+Requires-Dist: pyyaml
+# glmmedia-ocr
+Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama. Fully self-contained — zero ongoing maintenance after install.
+```bash
+npm install -g glmmedia-ocr
+glmmedia-ocr scan invoice.pdf
+# → invoice.md written
+```
+---
+## Table of Contents
+- [Requirements](#requirements)
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [CLI Reference](#cli-reference)
+- [How It Works](#how-it-works)
+- [Architecture](#architecture)
+- [Output Format](#output-format)
+- [Configuration](#configuration)
+- [GPU Support](#gpu-support)
+- [Troubleshooting](#troubleshooting)
+- [Project Structure](#project-structure)
+- [License](#license)
+---
+## Requirements
+Only two things need to be on your machine before installing:
+| Requirement | Why | Where |
+|---|---|---|
+| **Python 3.12 or 3.13** | Runs the GLM-OCR SDK | [python.org](https://www.python.org/downloads/) |
+| **Ollama** (installed, not necessarily running) | Serves the `glm-ocr` model locally | [ollama.com/download](https://ollama.com/download) |
+That's it. Everything else — the Python virtual environment, all dependencies, and the Ollama process lifecycle — is managed automatically by the package.
+> **Note:** Python 3.14+ is not yet supported. The GLM-OCR SDK and its dependencies (PyTorch, Transformers) only publish wheels for Python 3.10–3.13.
+---
+## Installation
+### npm (recommended)
+```bash
+npm install -g glmmedia-ocr
+```
+This triggers a `postinstall` script that:
+1. Creates a dedicated Python virtual environment inside the package (`.venv/`)
+2. Installs `glmocr[selfhosted]` with **CPU-only PyTorch** into the venv
+3. Verifies the installation by importing the SDK
+The first install takes a few minutes while pip downloads ~1-2GB of dependencies. This is a one-time cost.
+### pip
+```bash
+pip install .
+```
+Or from source:
+```bash
+git clone https://github.com/glmmedia-ocr/glmmedia-ocr.git
+cd glmmedia-ocr
+pip install .
+```
+This installs the same dependencies directly into your Python environment and registers the `glmmedia-ocr` CLI command. Both npm and pip packages provide the exact same functionality and CLI interface.
+### GPU install (optional)
+By default, the npm package installs CPU-only PyTorch to avoid GPU resource competition with Ollama. If you have a GPU and want to use it for layout detection:
+```bash
+# npm
+GLMOCR_GPU=1 npm install -g glmmedia-ocr
+# pip — pip resolves CUDA PyTorch by default
+pip install .
+```
+### Reinstall / repair
+```bash
+# npm
+npm rebuild glmmedia-ocr
+# pip
+pip install --force-reinstall .
+```
+---
+## Quick Start
+```bash
+# Single PDF
+glmmedia-ocr scan invoice.pdf
+# Single image
+glmmedia-ocr scan receipt.png
+# Multiple images
+glmmedia-ocr scan page1.png page2.png page3.png
+# Mixed PDFs and images
+glmmedia-ocr scan report.pdf page1.png page2.png
+# All images in a directory
+glmmedia-ocr scan ./images/
+# All images in directory + subdirectories
+glmmedia-ocr scan ./images/ --recursive
+# Shell glob
+glmmedia-ocr scan *.png
+# Custom output path
+glmmedia-ocr scan contract.pdf --output ./results/contract.md
+# Higher DPI for better OCR quality
+glmmedia-ocr scan receipt.pdf --dpi 300
+# Connect to a remote Ollama instance
+glmmedia-ocr scan report.pdf --ollama-host 192.168.1.100:11434
+# Faster processing with parallel workers
+glmmedia-ocr scan book.pdf --concurrency 2
+# Debug logging to see layout detection progress
+glmmedia-ocr scan document.pdf --log-level DEBUG
+```
+### First run
+On the very first run, the CLI will:
+1. Detect that Ollama is not running and start it automatically
+2. Detect that the `glm-ocr:latest` model is not pulled and download it (~2.2GB)
+3. Process your input
+4. Shut down Ollama on exit (since it started it)
+Subsequent runs skip steps 1 and 2 if Ollama is already running and the model is cached.
+---
+## CLI Reference
+```
+glmmedia-ocr scan <input...> [options]
+Inputs:
+  <file.pdf>                   Single PDF file
+  <image.png>                  Single image file (PNG, JPEG, WebP, BMP, TIFF, GIF)
+  <img1.png> <img2.png> ...    Multiple image files
+  <directory>/                 Directory of images (use --recursive for subfolders)
+Input/Output:
+  --output <path>              Output .md path (default: auto-generated from input names)
+  --recursive                  Scan directories recursively for images
+Rendering:
+  --dpi <number>               Render DPI for PDFs (default: 200)
+  --image-format <format>      Image format: PNG, JPEG, WEBP (default: PNG)
+  --min-pixels <number>        Minimum image pixels (default: 12544)
+  --max-pixels <number>        Maximum image pixels (default: 71372800)
+  --patch-expand-factor <n>    Patch expansion factor (default: 1)
+  --t-patch-size <n>           T-patch size (default: 2)
+  --image-expect-length <n>    Image expect length (default: 6144)
+Generation:
+  --max-tokens <number>        Max generation tokens (default: 8192)
+  --temperature <float>        Sampling temperature (default: 0.0)
+  --top-p <float>              Top-p sampling (default: 0.00001)
+  --top-k <number>             Top-k sampling (default: 1)
+  --repetition-penalty <float> Repetition penalty (default: 1.1)
+Layout (PP-DocLayoutV3):
+  --layout-device <device>     Device: cpu, cuda, cuda:N (default: cpu)
+  --layout-model-dir <path>    Custom layout model directory
+  --layout-threshold <float>   Detection threshold (default: 0.3)
+  --layout-batch-size <n>      Layout batch size (default: 1)
+  --layout-use-polygon         Use polygon masks for cropping
+  --no-layout-nms              Disable layout NMS
+  --layout-merge-mode <mode>   Merge overlapping bboxes: large|small (default: large)
+  --layout-workers <n>         Layout workers (default: 1)
+Result formatting:
+  --output-format <format>     Output: markdown, json, both (default: markdown)
+  --no-merge-formula-numbers   Disable formula number merging
+  --no-merge-text-blocks       Disable text block merging
+  --no-format-bullet-points    Disable bullet point formatting
+Pipeline:
+  --concurrency <number>       Parallel OCR workers (default: 1)
+  --page-maxsize <number>      Page queue max size (default: 100)
+  --region-maxsize <number>    Region queue max size (default: 2000)
+Ollama / API:
+  --ollama-host <host>         Ollama host (default: localhost:11434)
+  --ollama-num-ctx <n>         Ollama num_ctx for glm-ocr (default: 8192; 0 = omit)
+  --api-scheme <scheme>        API scheme: http, https (default: auto)
+  --api-key <key>              API key for MaaS providers
+  --verify-ssl                 Enable SSL verification
+  --connect-timeout <seconds>  Connect timeout (default: 30)
+  --request-timeout <seconds>  Request timeout (default: 120)
+MaaS (Zhipu Cloud):
+  --maas                       Enable MaaS mode (disables local OCR)
+  --maas-api-url <url>         MaaS API URL
+  --maas-model <model>         MaaS model name
+  --maas-api-key <key>         MaaS API key
+  --no-maas-verify-ssl         Disable MaaS SSL verification
+  --maas-connect-timeout <s>   MaaS connect timeout (default: 30)
+  --maas-request-timeout <s>   MaaS request timeout (default: 300)
+  --maas-retry-attempts <n>    MaaS retry attempts (default: 2)
+Logging:
+  --log-level <level>          Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)
+```
+### Flag Details
+#### Inputs
+| Input type | Description |
+|---|---|
+| `<file.pdf>` | One or more PDF files. Each page becomes `<!-- PAGE N -->` in output. |
+| `<image.png>` | One or more image files. Supported: PNG, JPEG, WebP, BMP, TIFF, GIF. |
+| `<file.pdf> <img.png>` | Mixed PDFs and images. Pages are merged in input order. |
+| `<directory>/` | Directory of images. Scans flat by default; use `--recursive` for subfolders. |
+#### Input/Output
+| Flag | Default | Description |
+|---|---|---|
+| `--output` | auto-generated | Where to write the Markdown output. Single input → `<name>.md`. Multiple inputs → `<name1>_<name2>_output.md`. `--output` overrides all. |
+| `--recursive` | off | When a directory is passed, recurse into subdirectories for images. |
+#### Rendering
+| Flag | Default | Description |
+|---|---|---|
+| `--dpi` | `200` | Resolution for rendering PDF pages to images. Higher DPI improves OCR accuracy but increases processing time and memory usage. Recommended: 200-300. |
+| `--image-format` | `PNG` | Format for images sent to the OCR API. `PNG` is lossless (best for code, diagrams). `JPEG` is smaller (best for text documents). `WEBP` is smallest but may not be supported by all backends. |
+| `--min-pixels` | `12544` | Minimum image pixel count (112×112). Images smaller than this are upscaled. |
+| `--max-pixels` | `71372800` | Maximum image pixel count (14×14×4×1280). Images larger than this are downscaled. |
+| `--patch-expand-factor` | `1` | Patch expansion factor for image processing. |
+| `--t-patch-size` | `2` | T-patch size for image processing. |
+| `--image-expect-length` | `6144` | Expected image token length. |
+#### Generation
+| Flag | Default | Description |
+|---|---|---|
+| `--max-tokens` | `8192` | Maximum tokens generated per region. Increase for very dense pages. |
+| `--temperature` | `0.0` | Sampling temperature. `0.0` = deterministic (recommended for OCR). |
+| `--top-p` | `0.00001` | Top-p (nucleus) sampling. Keep very low for OCR. |
+| `--top-k` | `1` | Top-k sampling. `1` = always pick the most likely token. |
+| `--repetition-penalty` | `1.1` | Penalty for repeating tokens. Prevents the model from getting stuck in loops. |
+#### Layout (PP-DocLayoutV3)
+| Flag | Default | Description |
+|---|---|---|
+| `--layout-device` | `cpu` | Device for the PP-DocLayoutV3 layout detection model. `cpu` avoids GPU memory competition with Ollama. Use `cuda` or `cuda:N` for GPU. |
+| `--layout-model-dir` | (SDK default) | Path to a custom PP-DocLayoutV3 model directory. Leave unset to use the SDK's built-in default. |
+| `--layout-threshold` | `0.3` | Confidence threshold for layout detection. Lower values detect more regions (may include false positives). |
+| `--layout-batch-size` | `1` | Max images per layout model forward pass. Reduce to `1` if OOM. |
+| `--layout-use-polygon` | off | Use polygon masks for region cropping instead of bounding boxes. More precise for rotated or staggered layouts. |
+| `--no-layout-nms` | off | Disable non-maximum suppression for layout detection. |
+| `--layout-merge-mode` | `large` | How to merge overlapping bounding boxes. `large` keeps the larger region, `small` keeps the smaller one. |
+| `--layout-workers` | `1` | Number of layout detection workers. |
+#### Result Formatting
+| Flag | Default | Description |
+|---|---|---|
+| `--output-format` | `markdown` | Output format: `markdown`, `json`, or `both`. |
+| `--no-merge-formula-numbers` | off | Disable automatic merging of formula numbers with their equations. |
+| `--no-merge-text-blocks` | off | Disable automatic merging of adjacent text blocks. |
+| `--no-format-bullet-points` | off | Disable automatic bullet point formatting normalization. |
+#### Pipeline
+| Flag | Default | Description |
+|---|---|---|
+| `--concurrency` | `1` | Number of parallel OCR workers. Increase for faster processing on multi-page documents. Set to `1` for maximum stability with Ollama. |
+| `--page-maxsize` | `100` | Maximum number of pages queued for processing. |
+| `--region-maxsize` | `2000` | Maximum number of regions queued for OCR. |
+#### Ollama / API
+| Flag | Default | Description |
+|---|---|---|
+| `--ollama-host` | `localhost:11434` | Ollama server address. Use this to connect to a remote or non-standard Ollama instance. |
+| `--ollama-num-ctx` | `8192` | Ollama `num_ctx` parameter for glm-ocr. Prevents GGML tensor size crashes. Set to `0` to omit. |
+| `--api-scheme` | auto | API URL scheme: `http` or `https`. Auto-detects based on port (HTTPS if 443). |
+| `--api-key` | null | API key for MaaS providers (Zhipu, OpenAI, etc.). |
+| `--verify-ssl` | off | Enable SSL certificate verification for API requests. |
+| `--connect-timeout` | `30` | Connection timeout in seconds. |
+| `--request-timeout` | `120` | Request timeout in seconds. |
+#### MaaS (Zhipu Cloud)
+| Flag | Default | Description |
+|---|---|---|
+| `--maas` | off | Enable MaaS mode. Sends requests directly to Zhipu's cloud API. Disables local OCR and Ollama checks. |
+| `--maas-api-url` | Zhipu default | MaaS API endpoint URL. |
+| `--maas-model` | `glm-ocr` | MaaS model name. |
+| `--maas-api-key` | null | MaaS API key (or set `ZHIPU_API_KEY` env var). |
+| `--no-maas-verify-ssl` | off | Disable SSL verification for MaaS requests. |
+| `--maas-connect-timeout` | `30` | MaaS connection timeout in seconds. |
+| `--maas-request-timeout` | `300` | MaaS request timeout in seconds. |
+| `--maas-retry-attempts` | `2` | Number of retry attempts for transient MaaS errors. |
+#### Logging
+| Flag | Default | Description |
+|---|---|---|
+| `--log-level` | `INFO` | Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. Use `DEBUG` to see detailed timing and layout detection progress. |
+---
+## How It Works
+### Startup Sequence
+```
+glmmedia-ocr scan invoice.pdf
+│
+├─ 1. Preflight Checks
+│   ├─ Python 3.12 or 3.13 found?
+│   ├─ Ollama binary on PATH? (skipped if --maas)
+│   └─ GLM-OCR SDK importable in managed venv?
+│
+├─ 2. Ollama Lifecycle (skipped if --maas)
+│   ├─ Is Ollama already running? (GET localhost:11434)
+│   ├─ If yes → use it, leave it running after exit
+│   └─ If no → spawn ollama serve, wait until healthy
+│
+├─ 3. Model Check (skipped if --maas)
+│   ├─ Is glm-ocr:latest pulled? (ollama list)
+│   └─ If no → ollama pull glm-ocr:latest (~2.2GB, one-time)
+│
+├─ 4. Pipeline Execution
+│   ├─ PDF: Render pages to images (pypdfium2, in-memory, capped to 2000px)
+│   │   Images: Load and cap to 2000px (no rendering step)
+│   ├─ Run layout detection (PP-DocLayoutV3) — progress logged to stderr
+│   ├─ OCR each region via Ollama (/api/generate) or MaaS
+│   └─ Merge results with page markers
+│
+└─ 5. Cleanup
+    ├─ Write output .md
+    └─ Shut down Ollama (only if CLI started it)
+```
+### Ollama Ownership Tracking
+The CLI tracks whether it started Ollama or found it already running:
+| Scenario | CLI behavior |
+|---|---|
+| Ollama was already running | Uses it, leaves it running on exit |
+| CLI started Ollama | Shuts it down on normal exit, SIGINT, or SIGTERM |
+| CLI crashes | Still shuts down Ollama via signal trap |
+This means you can run Ollama manually before using the CLI, and it won't be touched.
+---
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     User (CLI)                              │
+│   glmmedia-ocr scan invoice.pdf  (or *.png, ./images/)     │
+└──────────────────────────┬──────────────────────────────────┘
+                           │
+┌──────────────────────────▼──────────────────────────────────┐
+│              bin/glmmedia-ocr.js (Node.js)                  │
+│                                                             │
+│  ┌─────────────┐  ┌──────────────┐  ┌───────────────────┐  │
+│  │  Preflight  │  │   Ollama     │  │   Model Check     │  │
+│  │  Checks     │  │  Lifecycle   │  │   (pull if needed)│  │
+│  └──────┬──────┘  └──────┬───────┘  └────────┬──────────┘  │
+│         │                │                    │              │
+│         └────────────────┼────────────────────┘              │
+│                          │                                   │
+│              ┌───────────▼────────────┐                      │
+│              │  Resolve inputs        │                      │
+│              │  (files, dirs, globs)  │                      │
+│              └───────────┬────────────┘                      │
+│                          │                                   │
+│              ┌───────────▼────────────┐                      │
+│              │  Generate config.yaml  │                      │
+│              │  (full SDK template)   │                      │
+│              └───────────┬────────────┘                      │
+│                          │                                   │
+│              ┌───────────▼────────────┐                      │
+│              │  Spawn Python Pipeline │                      │
+│              │  lib/pipeline.py       │                      │
+│              └───────────┬────────────┘                      │
+└──────────────────────────┼──────────────────────────────────┘
+                           │
+┌──────────────────────────▼──────────────────────────────────┐
+│              lib/pipeline.py (Python)                       │
+│                                                             │
+│  ┌──────────────────┐    ┌──────────────────────────────┐  │
+│  │  PDF: pypdfium2  │    │  GlmOcr SDK (selfhosted)     │  │
+│  │  Image: PIL open │───▶│  ┌────────────────────────┐  │  │
+│  │  (2000px cap)    │    │  │ PP-DocLayoutV3         │  │  │
+│  └──────────────────┘    │  │ (Transformers + CPU    │  │  │
+│                          │  │  PyTorch layout detect) │  │  │
+│                          │  └───────────┬────────────┘  │  │
+│                          │              │                │  │
+│                          │  ┌───────────▼────────────┐  │  │
+│                          │  │ OCRClient              │  │  │
+│                          │  │ → Ollama /api/generate │  │  │
+│                          │  └────────────────────────┘  │  │
+│                          └──────────────────────────────┘  │
+│                                     │                       │
+│                          ┌──────────▼────────────┐          │
+│                          │  Merge + Page Markers │          │
+│                          │  → output.md          │          │
+│                          └───────────────────────┘          │
+└─────────────────────────────────────────────────────────────┘
+```
+### Key Design Decisions
+| Decision | Rationale |
+|---|---|
+| **Managed `.venv`** | The package owns its Python environment. Never touches the user's global Python. Reproducible, isolated, self-contained. |
+| **CPU-only PyTorch by default** | Avoids GPU memory competition with Ollama. Smaller venv (~1-2GB vs 4GB+). Layout detection on CPU is fast enough for most documents. |
+| **Ollama `/api/generate` mode** | Official GLM-OCR recommendation for Ollama. More stable than the OpenAI-compatible endpoint for vision requests. |
+| **pypdfium2 for PDF rendering** | Ships its own PDFium binary in the wheel. Zero system dependencies. Renders directly to PIL images in-memory — no temp files, no subprocess calls. |
+| **2000px image cap** | Balances OCR quality with model stability. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS. Prevents GGML tensor size crashes on Ollama. |
+| **Full SDK config** | Generates a complete `config.yaml` matching the SDK's template on every run. All 50+ options are exposed as CLI flags. |
+| **Per-page error tolerance** | A failed page gets a placeholder in the output. The rest of the document continues processing. |
+---
+## Output Format
+The output Markdown file contains clear page boundaries:
+```markdown
+<!-- PAGE 1 -->
+# Invoice
+**Invoice Number:** INV-2024-0042
+**Date:** January 15, 2024
+| Item | Quantity | Price |
+|------|----------|-------|
+| Widget A | 10 | $50.00 |
+| Widget B | 5 | $75.00 |
+**Total: $875.00**
+---
+<!-- PAGE 2 -->
+## Terms and Conditions
+1. Payment is due within 30 days.
+2. Late payments incur a 2% monthly fee.
+---
+```
+### Page Markers
+Each page is delimited by:
+- `<!-- PAGE N -->` — HTML comment identifying the page number
+- `---` — Markdown horizontal rule as a visual separator
+### Failed Pages
+If a page fails OCR (e.g., Ollama timeout, model error), it gets a placeholder:
+```markdown
+<!-- PAGE 4 -->
+<!-- PAGE 4: OCR failed — API request failed after 3 attempts -->
+---
+```
+The rest of the document continues processing normally.
+---
+## Configuration
+### Environment Variables
+| Variable | Default | Description |
+|---|---|---|
+| `GLMOCR_GPU` | `0` | Set to `1` during install to use GPU PyTorch instead of CPU-only. |
+### Internal Config (auto-generated)
+The CLI generates a temporary YAML config for each run. All SDK options are exposed as CLI flags:
+```yaml
+# Example of generated config (abbreviated)
+pipeline:
+  maas:
+    enabled: false
+  ocr_api:
+    api_host: localhost
+    api_port: 11434
+    api_path: /api/generate
+    api_mode: ollama_generate
+    model: glm-ocr:latest
+    connect_timeout: 30
+    request_timeout: 120
+  max_workers: 1
+  page_maxsize: 100
+  region_maxsize: 2000
+  page_loader:
+    max_tokens: 8192
+    temperature: 0.0
+    top_p: 0.00001
+    top_k: 1
+    repetition_penalty: 1.1
+    image_format: PNG
+    min_pixels: 12544
+    max_pixels: 71372800
+  result_formatter:
+    output_format: markdown
+    enable_merge_formula_numbers: true
+    enable_merge_text_blocks: true
+    enable_format_bullet_points: true
+  layout:
+    device: "cpu"
+    threshold: 0.3
+    batch_size: 1
+    use_polygon: false
+    layout_nms: true
+    layout_merge_bboxes_mode: large
+```
+This config is written to a temp directory before each run and cleaned up afterward. Users don't need to manage it manually.
+---
+## GPU Support
+The default installation uses CPU-only PyTorch for layout detection. This is intentional:
+1. **No GPU competition** — Ollama loads the glm-ocr model into GPU VRAM. Running layout detection on the same GPU can cause OOM errors.
+2. **Smaller venv** — CPU PyTorch is ~500MB vs ~4GB for CUDA.
+3. **Fast enough** — PP-DocLayoutV3 is lightweight and runs quickly on CPU for typical document sizes.
+### Enabling GPU
+If you have ample GPU memory and want faster layout detection:
+```bash
+# Uninstall the CPU-only version
+npm uninstall -g glmmedia-ocr
+# Reinstall with GPU PyTorch
+GLMOCR_GPU=1 npm install -g glmmedia-ocr
+```
+Then use `--layout-device cuda` when scanning:
+```bash
+glmmedia-ocr scan document.pdf --layout-device cuda
+```
+### Recommended GPU Setup
+If running both Ollama (glm-ocr model) and layout detection on the same GPU:
+- **GPU with 12GB+ VRAM** — glm-ocr takes ~2.2GB, layout detection takes ~1-2GB
+- **Use `--concurrency 1`** — Avoids queuing multiple OCR requests that could spike memory
+- **Monitor with `nvidia-smi`** — Watch for OOM during processing
+---
+## Troubleshooting
+### Python not found or unsupported version
+```
+✗ Python 3.12+ not found on PATH. Install from python.org
+```
+**Fix:** Install Python 3.12 or 3.13 from [python.org](https://www.python.org/downloads/). Make sure it's on your PATH. Python 3.14+ is not yet supported because key dependencies (PyTorch, Transformers) don't publish 3.14 wheels yet.
+```bash
+# Verify
+python --version  # Should show 3.12.x or 3.13.x
+```
+### Ollama not found
+```
+✗ Ollama not found on PATH. Install from https://ollama.com/download
+```
+**Fix:** Install Ollama from [ollama.com/download](https://ollama.com/download).
+```bash
+# Verify
+ollama --version
+```
+### SDK installation failed
+```
+✗ GLM-OCR SDK installation failed. Run 'npm rebuild glmmedia-ocr' to retry.
+```
+**Fix:** Rebuild the package:
+```bash
+npm rebuild glmmedia-ocr
+```
+If that fails, try a clean reinstall:
+```bash
+npm uninstall -g glmmedia-ocr
+npm install -g glmmedia-ocr
+```
+### Model pull failed
+```
+✗ ollama pull failed with code 1
+```
+**Fix:** Check your internet connection and try again. The model is ~2.2GB and requires a stable connection.
+```bash
+# Manual pull to debug
+ollama pull glm-ocr:latest
+```
+### Ollama won't start
+```
+✗ Ollama did not become healthy within 15s
+```
+**Fix:** Start Ollama manually and check for errors:
+```bash
+ollama serve
+# In another terminal:
+ollama list
+```
+If Ollama is already running on a different port, use `--ollama-host`:
+```bash
+glmmedia-ocr scan document.pdf --ollama-host localhost:11435
+```
+### OCR timeout on large documents
+```
+Error: OCR failed — API request failed after 3 attempts
+```
+**Fix:** Increase the request timeout or reduce concurrency:
+```bash
+# Reduce to single worker (most stable)
+glmmedia-ocr scan large-document.pdf --concurrency 1
+# If using a remote Ollama, ensure the network is stable
+glmmedia-ocr scan document.pdf --ollama-host 192.168.1.100:11434
+```
+### Out of memory
+```
+Error: CUDA out of memory
+```
+**Fix:** Use CPU for layout detection:
+```bash
+glmmedia-ocr scan document.pdf --layout-device cpu
+```
+Or reduce concurrency:
+```bash
+glmmedia-ocr scan document.pdf --concurrency 1
+```
+### Corrupt or encrypted PDF
+```
+Error: Failed to render PDF: ...
+```
+**Fix:** Ensure the PDF is valid and not password-protected. The current version does not support encrypted PDFs. Use a tool like `qpdf` to decrypt first:
+```bash
+qpdf --decrypt --password=your-password input.pdf decrypted.pdf
+glmmedia-ocr scan decrypted.pdf
+```
+### No image files found in directory
+```
+✗ No image files found in directory: ./images/
+```
+**Fix:** Ensure the directory contains supported image files (PNG, JPEG, WebP, BMP, TIFF, GIF). Use `--recursive` if images are in subdirectories:
+```bash
+glmmedia-ocr scan ./images/ --recursive
+```
+### Input not found
+```
+✗ Input not found: ./missing.pdf
+```
+**Fix:** Check the file path and ensure the input exists.
+---
+## Project Structure
+```
+glmmedia-ocr/
+├── bin/
+│   └── glmmedia-ocr.js          # npm CLI entry point
+│                                # - Thin wrapper: finds .venv Python
+│                                # - Delegates to lib/pipeline.py
+│
+├── scripts/
+│   └── postinstall.js           # npm package setup
+│                                # - Creates .venv
+│                                # - pip install glmocr[selfhosted] + CPU torch
+│                                # - Verifies installation
+│
+├── lib/
+│   └── pipeline.py              # PDF/Image-to-Markdown pipeline (npm path)
+│                                # - pypdfium2: PDF → PIL images (2000px cap)
+│                                # - PIL: load images directly (2000px cap)
+│                                # - GlmOcr SDK: layout detection + OCR
+│                                # - Logging: surfaces SDK progress to stderr
+│                                # - Merge with page markers → .md
+│
+├── src/glmmedia_ocr/            # Pure Python CLI package (pip path)
+│   ├── __init__.py              # Package version
+│   ├── __main__.py              # python -m glmmedia_ocr entry
+│   ├── cli.py                   # Full CLI: args, Ollama, config, spinner
+│   ├── config.py                # Config YAML generation
+│   ├── inputs.py                # Input resolution (files, dirs, types)
+│   ├── ollama.py                # Ollama lifecycle management
+│   ├── pipeline.py              # Rendering + OCR + output
+│   └── spinner.py               # Animated terminal spinner
+│
+├── pyproject.toml               # Python package metadata + deps
+├── .venv/                       # Created at npm install time (gitignored)
+├── .gitignore
+├── package.json                 # npm package metadata
+└── README.md
+```
+### Distribution Channels
+| Channel | Entry point | Code path |
+|---|---|---|
+| **npm** | `bin/glmmedia-ocr.js` | JS wrapper → `lib/pipeline.py` |
+| **pip** | `src/glmmedia_ocr/cli.py` | Pure Python (full implementation) |
+Both provide the same CLI interface and functionality. They are independent implementations — changes to one should be mirrored in the other.
+### What's NOT Here
+| Not included | Why |
+|---|---|
+| `node_modules/` | Zero npm dependencies — uses Node.js built-ins only |
+| `vendor/poppler/` | pypdfium2 ships its own PDFium binary in its pip wheel |
+| `config.yaml` | Generated dynamically per run, cleaned up after |
+| `*.md` output files | Generated by the CLI, not part of the package |
+| `dist/`, `build/`, `*.egg-info/` | Build artifacts (gitignored) |
+---
+## Under the Hood
+### Input Resolution
+The CLI accepts PDFs, images, and directories. When a directory is passed, it collects all supported image files (flat or recursive with `--recursive`). Mixed input types (PDF + image) are supported — pages are merged in input order into a single output file with sequential `<!-- PAGE N -->` markers.
+### PDF Rendering
+Uses **pypdfium2**, which bundles the PDFium engine (same as Chromium). Renders PDF pages directly to PIL images in-memory at the specified DPI. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS resampling. No temp files, no subprocess calls, no system dependencies.
+### Image Loading
+Images are opened with PIL and capped to 2000px on their longest dimension via LANCZOS resampling. This ensures consistent quality while preventing GGML tensor size crashes on Ollama.
+### Layout Detection
+Uses **PP-DocLayoutV3** via HuggingFace Transformers. Detects text blocks, tables, formulas, images, and other regions on each page. Runs on CPU by default to avoid GPU memory competition with Ollama. Progress is logged to stderr when `--log-level DEBUG` is used.
+### OCR
+Each detected region is sent to the **glm-ocr** model via Ollama's native `/api/generate` endpoint. The model returns structured Markdown for each region.
+### Result Merging
+Per-page results are merged with `<!-- PAGE N -->` markers and `---` separators. Failed pages get error placeholders instead of aborting the entire document.
+---
+## License
+MIT