glmmedia-ocr 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,851 @@
1
+ Metadata-Version: 2.4
2
+ Name: glmmedia-ocr
3
+ Version: 0.1.0
4
+ Summary: Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama
5
+ Author-email: dusy4 <dusy4@users.noreply.github.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/dusy4/glmmedia-ocr
8
+ Project-URL: Bug Tracker, https://github.com/dusy4/glmmedia-ocr/issues
9
+ Requires-Python: >=3.12
10
+ Description-Content-Type: text/markdown
11
+ Requires-Dist: glmocr[selfhosted]
12
+ Requires-Dist: pypdfium2
13
+ Requires-Dist: Pillow
14
+ Requires-Dist: pyyaml
15
+
16
+ # glmmedia-ocr
17
+
18
+ Convert PDFs and images to structured Markdown using local GLM-OCR + Ollama. Fully self-contained — zero ongoing maintenance after install.
19
+
20
+ ```bash
21
+ npm install -g glmmedia-ocr
22
+ glmmedia-ocr scan invoice.pdf
23
+ # → invoice.md written
24
+ ```
25
+
26
+ ---
27
+
28
+ ## Table of Contents
29
+
30
+ - [Requirements](#requirements)
31
+ - [Installation](#installation)
32
+ - [Quick Start](#quick-start)
33
+ - [CLI Reference](#cli-reference)
34
+ - [How It Works](#how-it-works)
35
+ - [Architecture](#architecture)
36
+ - [Output Format](#output-format)
37
+ - [Configuration](#configuration)
38
+ - [GPU Support](#gpu-support)
39
+ - [Troubleshooting](#troubleshooting)
40
+ - [Project Structure](#project-structure)
41
+ - [License](#license)
42
+
43
+ ---
44
+
45
+ ## Requirements
46
+
47
+ Only two things need to be on your machine before installing:
48
+
49
+ | Requirement | Why | Where |
50
+ |---|---|---|
51
+ | **Python 3.12 or 3.13** | Runs the GLM-OCR SDK | [python.org](https://www.python.org/downloads/) |
52
+ | **Ollama** (installed, not necessarily running) | Serves the `glm-ocr` model locally | [ollama.com/download](https://ollama.com/download) |
53
+
54
+ That's it. Everything else — the Python virtual environment, all dependencies, and the Ollama process lifecycle — is managed automatically by the package.
55
+
56
+ > **Note:** Python 3.14+ is not yet supported. The GLM-OCR SDK and its dependencies (PyTorch, Transformers) only publish wheels for Python 3.10–3.13.
57
+
58
+ ---
59
+
60
+ ## Installation
61
+
62
+ ### npm (recommended)
63
+
64
+ ```bash
65
+ npm install -g glmmedia-ocr
66
+ ```
67
+
68
+ This triggers a `postinstall` script that:
69
+
70
+ 1. Creates a dedicated Python virtual environment inside the package (`.venv/`)
71
+ 2. Installs `glmocr[selfhosted]` with **CPU-only PyTorch** into the venv
72
+ 3. Verifies the installation by importing the SDK
73
+
74
+ The first install takes a few minutes while pip downloads ~1-2GB of dependencies. This is a one-time cost.
75
+
76
+ ### pip
77
+
78
+ ```bash
79
+ pip install .
80
+ ```
81
+
82
+ Or from source:
83
+
84
+ ```bash
85
+ git clone https://github.com/glmmedia-ocr/glmmedia-ocr.git
86
+ cd glmmedia-ocr
87
+ pip install .
88
+ ```
89
+
90
+ This installs the same dependencies directly into your Python environment and registers the `glmmedia-ocr` CLI command. Both npm and pip packages provide the exact same functionality and CLI interface.
91
+
92
+ ### GPU install (optional)
93
+
94
+ By default, the npm package installs CPU-only PyTorch to avoid GPU resource competition with Ollama. If you have a GPU and want to use it for layout detection:
95
+
96
+ ```bash
97
+ # npm
98
+ GLMOCR_GPU=1 npm install -g glmmedia-ocr
99
+
100
+ # pip — pip resolves CUDA PyTorch by default
101
+ pip install .
102
+ ```
103
+
104
+ ### Reinstall / repair
105
+
106
+ ```bash
107
+ # npm
108
+ npm rebuild glmmedia-ocr
109
+
110
+ # pip
111
+ pip install --force-reinstall .
112
+ ```
113
+
114
+ ---
115
+
116
+ ## Quick Start
117
+
118
+ ```bash
119
+ # Single PDF
120
+ glmmedia-ocr scan invoice.pdf
121
+
122
+ # Single image
123
+ glmmedia-ocr scan receipt.png
124
+
125
+ # Multiple images
126
+ glmmedia-ocr scan page1.png page2.png page3.png
127
+
128
+ # Mixed PDFs and images
129
+ glmmedia-ocr scan report.pdf page1.png page2.png
130
+
131
+ # All images in a directory
132
+ glmmedia-ocr scan ./images/
133
+
134
+ # All images in directory + subdirectories
135
+ glmmedia-ocr scan ./images/ --recursive
136
+
137
+ # Shell glob
138
+ glmmedia-ocr scan *.png
139
+
140
+ # Custom output path
141
+ glmmedia-ocr scan contract.pdf --output ./results/contract.md
142
+
143
+ # Higher DPI for better OCR quality
144
+ glmmedia-ocr scan receipt.pdf --dpi 300
145
+
146
+ # Connect to a remote Ollama instance
147
+ glmmedia-ocr scan report.pdf --ollama-host 192.168.1.100:11434
148
+
149
+ # Faster processing with parallel workers
150
+ glmmedia-ocr scan book.pdf --concurrency 2
151
+
152
+ # Debug logging to see layout detection progress
153
+ glmmedia-ocr scan document.pdf --log-level DEBUG
154
+ ```
155
+
156
+ ### First run
157
+
158
+ On the very first run, the CLI will:
159
+
160
+ 1. Detect that Ollama is not running and start it automatically
161
+ 2. Detect that the `glm-ocr:latest` model is not pulled and download it (~2.2GB)
162
+ 3. Process your input
163
+ 4. Shut down Ollama on exit (since it started it)
164
+
165
+ Subsequent runs skip steps 1 and 2 if Ollama is already running and the model is cached.
166
+
167
+ ---
168
+
169
+ ## CLI Reference
170
+
171
+ ```
172
+ glmmedia-ocr scan <input...> [options]
173
+
174
+ Inputs:
175
+ <file.pdf> Single PDF file
176
+ <image.png> Single image file (PNG, JPEG, WebP, BMP, TIFF, GIF)
177
+ <img1.png> <img2.png> ... Multiple image files
178
+ <directory>/ Directory of images (use --recursive for subfolders)
179
+
180
+ Input/Output:
181
+ --output <path> Output .md path (default: auto-generated from input names)
182
+ --recursive Scan directories recursively for images
183
+
184
+ Rendering:
185
+ --dpi <number> Render DPI for PDFs (default: 200)
186
+ --image-format <format> Image format: PNG, JPEG, WEBP (default: PNG)
187
+ --min-pixels <number> Minimum image pixels (default: 12544)
188
+ --max-pixels <number> Maximum image pixels (default: 71372800)
189
+ --patch-expand-factor <n> Patch expansion factor (default: 1)
190
+ --t-patch-size <n> T-patch size (default: 2)
191
+ --image-expect-length <n> Image expect length (default: 6144)
192
+
193
+ Generation:
194
+ --max-tokens <number> Max generation tokens (default: 8192)
195
+ --temperature <float> Sampling temperature (default: 0.0)
196
+ --top-p <float> Top-p sampling (default: 0.00001)
197
+ --top-k <number> Top-k sampling (default: 1)
198
+ --repetition-penalty <float> Repetition penalty (default: 1.1)
199
+
200
+ Layout (PP-DocLayoutV3):
201
+ --layout-device <device> Device: cpu, cuda, cuda:N (default: cpu)
202
+ --layout-model-dir <path> Custom layout model directory
203
+ --layout-threshold <float> Detection threshold (default: 0.3)
204
+ --layout-batch-size <n> Layout batch size (default: 1)
205
+ --layout-use-polygon Use polygon masks for cropping
206
+ --no-layout-nms Disable layout NMS
207
+ --layout-merge-mode <mode> Merge overlapping bboxes: large|small (default: large)
208
+ --layout-workers <n> Layout workers (default: 1)
209
+
210
+ Result formatting:
211
+ --output-format <format> Output: markdown, json, both (default: markdown)
212
+ --no-merge-formula-numbers Disable formula number merging
213
+ --no-merge-text-blocks Disable text block merging
214
+ --no-format-bullet-points Disable bullet point formatting
215
+
216
+ Pipeline:
217
+ --concurrency <number> Parallel OCR workers (default: 1)
218
+ --page-maxsize <number> Page queue max size (default: 100)
219
+ --region-maxsize <number> Region queue max size (default: 2000)
220
+
221
+ Ollama / API:
222
+ --ollama-host <host> Ollama host (default: localhost:11434)
223
+ --ollama-num-ctx <n> Ollama num_ctx for glm-ocr (default: 8192; 0 = omit)
224
+ --api-scheme <scheme> API scheme: http, https (default: auto)
225
+ --api-key <key> API key for MaaS providers
226
+ --verify-ssl Enable SSL verification
227
+ --connect-timeout <seconds> Connect timeout (default: 30)
228
+ --request-timeout <seconds> Request timeout (default: 120)
229
+
230
+ MaaS (Zhipu Cloud):
231
+ --maas Enable MaaS mode (disables local OCR)
232
+ --maas-api-url <url> MaaS API URL
233
+ --maas-model <model> MaaS model name
234
+ --maas-api-key <key> MaaS API key
235
+ --no-maas-verify-ssl Disable MaaS SSL verification
236
+ --maas-connect-timeout <s> MaaS connect timeout (default: 30)
237
+ --maas-request-timeout <s> MaaS request timeout (default: 300)
238
+ --maas-retry-attempts <n> MaaS retry attempts (default: 2)
239
+
240
+ Logging:
241
+ --log-level <level> Log level: DEBUG, INFO, WARNING, ERROR (default: INFO)
242
+ ```
243
+
244
+ ### Flag Details
245
+
246
+ #### Inputs
247
+
248
+ | Input type | Description |
249
+ |---|---|
250
+ | `<file.pdf>` | One or more PDF files. Each page becomes `<!-- PAGE N -->` in output. |
251
+ | `<image.png>` | One or more image files. Supported: PNG, JPEG, WebP, BMP, TIFF, GIF. |
252
+ | `<file.pdf> <img.png>` | Mixed PDFs and images. Pages are merged in input order. |
253
+ | `<directory>/` | Directory of images. Scans flat by default; use `--recursive` for subfolders. |
254
+
255
+ #### Input/Output
256
+
257
+ | Flag | Default | Description |
258
+ |---|---|---|
259
+ | `--output` | auto-generated | Where to write the Markdown output. Single input → `<name>.md`. Multiple inputs → `<name1>_<name2>_output.md`. `--output` overrides all. |
260
+ | `--recursive` | off | When a directory is passed, recurse into subdirectories for images. |
261
+
262
+ #### Rendering
263
+
264
+ | Flag | Default | Description |
265
+ |---|---|---|
266
+ | `--dpi` | `200` | Resolution for rendering PDF pages to images. Higher DPI improves OCR accuracy but increases processing time and memory usage. Recommended: 200-300. |
267
+ | `--image-format` | `PNG` | Format for images sent to the OCR API. `PNG` is lossless (best for code, diagrams). `JPEG` is smaller (best for text documents). `WEBP` is smallest but may not be supported by all backends. |
268
+ | `--min-pixels` | `12544` | Minimum image pixel count (112×112). Images smaller than this are upscaled. |
269
+ | `--max-pixels` | `71372800` | Maximum image pixel count (14×14×4×1280). Images larger than this are downscaled. |
270
+ | `--patch-expand-factor` | `1` | Patch expansion factor for image processing. |
271
+ | `--t-patch-size` | `2` | T-patch size for image processing. |
272
+ | `--image-expect-length` | `6144` | Expected image token length. |
273
+
274
+ #### Generation
275
+
276
+ | Flag | Default | Description |
277
+ |---|---|---|
278
+ | `--max-tokens` | `8192` | Maximum tokens generated per region. Increase for very dense pages. |
279
+ | `--temperature` | `0.0` | Sampling temperature. `0.0` = deterministic (recommended for OCR). |
280
+ | `--top-p` | `0.00001` | Top-p (nucleus) sampling. Keep very low for OCR. |
281
+ | `--top-k` | `1` | Top-k sampling. `1` = always pick the most likely token. |
282
+ | `--repetition-penalty` | `1.1` | Penalty for repeating tokens. Prevents the model from getting stuck in loops. |
283
+
284
+ #### Layout (PP-DocLayoutV3)
285
+
286
+ | Flag | Default | Description |
287
+ |---|---|---|
288
+ | `--layout-device` | `cpu` | Device for the PP-DocLayoutV3 layout detection model. `cpu` avoids GPU memory competition with Ollama. Use `cuda` or `cuda:N` for GPU. |
289
+ | `--layout-model-dir` | (SDK default) | Path to a custom PP-DocLayoutV3 model directory. Leave unset to use the SDK's built-in default. |
290
+ | `--layout-threshold` | `0.3` | Confidence threshold for layout detection. Lower values detect more regions (may include false positives). |
291
+ | `--layout-batch-size` | `1` | Max images per layout model forward pass. Reduce to `1` if OOM. |
292
+ | `--layout-use-polygon` | off | Use polygon masks for region cropping instead of bounding boxes. More precise for rotated or staggered layouts. |
293
+ | `--no-layout-nms` | off | Disable non-maximum suppression for layout detection. |
294
+ | `--layout-merge-mode` | `large` | How to merge overlapping bounding boxes. `large` keeps the larger region, `small` keeps the smaller one. |
295
+ | `--layout-workers` | `1` | Number of layout detection workers. |
296
+
297
+ #### Result Formatting
298
+
299
+ | Flag | Default | Description |
300
+ |---|---|---|
301
+ | `--output-format` | `markdown` | Output format: `markdown`, `json`, or `both`. |
302
+ | `--no-merge-formula-numbers` | off | Disable automatic merging of formula numbers with their equations. |
303
+ | `--no-merge-text-blocks` | off | Disable automatic merging of adjacent text blocks. |
304
+ | `--no-format-bullet-points` | off | Disable automatic bullet point formatting normalization. |
305
+
306
+ #### Pipeline
307
+
308
+ | Flag | Default | Description |
309
+ |---|---|---|
310
+ | `--concurrency` | `1` | Number of parallel OCR workers. Increase for faster processing on multi-page documents. Set to `1` for maximum stability with Ollama. |
311
+ | `--page-maxsize` | `100` | Maximum number of pages queued for processing. |
312
+ | `--region-maxsize` | `2000` | Maximum number of regions queued for OCR. |
313
+
314
+ #### Ollama / API
315
+
316
+ | Flag | Default | Description |
317
+ |---|---|---|
318
+ | `--ollama-host` | `localhost:11434` | Ollama server address. Use this to connect to a remote or non-standard Ollama instance. |
319
+ | `--ollama-num-ctx` | `8192` | Ollama `num_ctx` parameter for glm-ocr. Prevents GGML tensor size crashes. Set to `0` to omit. |
320
+ | `--api-scheme` | auto | API URL scheme: `http` or `https`. Auto-detects based on port (HTTPS if 443). |
321
+ | `--api-key` | null | API key for MaaS providers (Zhipu, OpenAI, etc.). |
322
+ | `--verify-ssl` | off | Enable SSL certificate verification for API requests. |
323
+ | `--connect-timeout` | `30` | Connection timeout in seconds. |
324
+ | `--request-timeout` | `120` | Request timeout in seconds. |
325
+
326
+ #### MaaS (Zhipu Cloud)
327
+
328
+ | Flag | Default | Description |
329
+ |---|---|---|
330
+ | `--maas` | off | Enable MaaS mode. Sends requests directly to Zhipu's cloud API. Disables local OCR and Ollama checks. |
331
+ | `--maas-api-url` | Zhipu default | MaaS API endpoint URL. |
332
+ | `--maas-model` | `glm-ocr` | MaaS model name. |
333
+ | `--maas-api-key` | null | MaaS API key (or set `ZHIPU_API_KEY` env var). |
334
+ | `--no-maas-verify-ssl` | off | Disable SSL verification for MaaS requests. |
335
+ | `--maas-connect-timeout` | `30` | MaaS connection timeout in seconds. |
336
+ | `--maas-request-timeout` | `300` | MaaS request timeout in seconds. |
337
+ | `--maas-retry-attempts` | `2` | Number of retry attempts for transient MaaS errors. |
338
+
339
+ #### Logging
340
+
341
+ | Flag | Default | Description |
342
+ |---|---|---|
343
+ | `--log-level` | `INFO` | Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`. Use `DEBUG` to see detailed timing and layout detection progress. |
344
+
345
+ ---
346
+
347
+ ## How It Works
348
+
349
+ ### Startup Sequence
350
+
351
+ ```
352
+ glmmedia-ocr scan invoice.pdf
353
+
354
+ ├─ 1. Preflight Checks
355
+ │ ├─ Python 3.12 or 3.13 found?
356
+ │ ├─ Ollama binary on PATH? (skipped if --maas)
357
+ │ └─ GLM-OCR SDK importable in managed venv?
358
+
359
+ ├─ 2. Ollama Lifecycle (skipped if --maas)
360
+ │ ├─ Is Ollama already running? (GET localhost:11434)
361
+ │ ├─ If yes → use it, leave it running after exit
362
+ │ └─ If no → spawn ollama serve, wait until healthy
363
+
364
+ ├─ 3. Model Check (skipped if --maas)
365
+ │ ├─ Is glm-ocr:latest pulled? (ollama list)
366
+ │ └─ If no → ollama pull glm-ocr:latest (~2.2GB, one-time)
367
+
368
+ ├─ 4. Pipeline Execution
369
+ │ ├─ PDF: Render pages to images (pypdfium2, in-memory, capped to 2000px)
370
+ │ │ Images: Load and cap to 2000px (no rendering step)
371
+ │ ├─ Run layout detection (PP-DocLayoutV3) — progress logged to stderr
372
+ │ ├─ OCR each region via Ollama (/api/generate) or MaaS
373
+ │ └─ Merge results with page markers
374
+
375
+ └─ 5. Cleanup
376
+ ├─ Write output .md
377
+ └─ Shut down Ollama (only if CLI started it)
378
+ ```
379
+
380
+ ### Ollama Ownership Tracking
381
+
382
+ The CLI tracks whether it started Ollama or found it already running:
383
+
384
+ | Scenario | CLI behavior |
385
+ |---|---|
386
+ | Ollama was already running | Uses it, leaves it running on exit |
387
+ | CLI started Ollama | Shuts it down on normal exit, SIGINT, or SIGTERM |
388
+ | CLI crashes | Still shuts down Ollama via signal trap |
389
+
390
+ This means you can run Ollama manually before using the CLI, and it won't be touched.
391
+
392
+ ---
393
+
394
+ ## Architecture
395
+
396
+ ```
397
+ ┌─────────────────────────────────────────────────────────────┐
398
+ │ User (CLI) │
399
+ │ glmmedia-ocr scan invoice.pdf (or *.png, ./images/) │
400
+ └──────────────────────────┬──────────────────────────────────┘
401
+
402
+ ┌──────────────────────────▼──────────────────────────────────┐
403
+ │ bin/glmmedia-ocr.js (Node.js) │
404
+ │ │
405
+ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────┐ │
406
+ │ │ Preflight │ │ Ollama │ │ Model Check │ │
407
+ │ │ Checks │ │ Lifecycle │ │ (pull if needed)│ │
408
+ │ └──────┬──────┘ └──────┬───────┘ └────────┬──────────┘ │
409
+ │ │ │ │ │
410
+ │ └────────────────┼────────────────────┘ │
411
+ │ │ │
412
+ │ ┌───────────▼────────────┐ │
413
+ │ │ Resolve inputs │ │
414
+ │ │ (files, dirs, globs) │ │
415
+ │ └───────────┬────────────┘ │
416
+ │ │ │
417
+ │ ┌───────────▼────────────┐ │
418
+ │ │ Generate config.yaml │ │
419
+ │ │ (full SDK template) │ │
420
+ │ └───────────┬────────────┘ │
421
+ │ │ │
422
+ │ ┌───────────▼────────────┐ │
423
+ │ │ Spawn Python Pipeline │ │
424
+ │ │ lib/pipeline.py │ │
425
+ │ └───────────┬────────────┘ │
426
+ └──────────────────────────┼──────────────────────────────────┘
427
+
428
+ ┌──────────────────────────▼──────────────────────────────────┐
429
+ │ lib/pipeline.py (Python) │
430
+ │ │
431
+ │ ┌──────────────────┐ ┌──────────────────────────────┐ │
432
+ │ │ PDF: pypdfium2 │ │ GlmOcr SDK (selfhosted) │ │
433
+ │ │ Image: PIL open │───▶│ ┌────────────────────────┐ │ │
434
+ │ │ (2000px cap) │ │ │ PP-DocLayoutV3 │ │ │
435
+ │ └──────────────────┘ │ │ (Transformers + CPU │ │ │
436
+ │ │ │ PyTorch layout detect) │ │ │
437
+ │ │ └───────────┬────────────┘ │ │
438
+ │ │ │ │ │
439
+ │ │ ┌───────────▼────────────┐ │ │
440
+ │ │ │ OCRClient │ │ │
441
+ │ │ │ → Ollama /api/generate │ │ │
442
+ │ │ └────────────────────────┘ │ │
443
+ │ └──────────────────────────────┘ │
444
+ │ │ │
445
+ │ ┌──────────▼────────────┐ │
446
+ │ │ Merge + Page Markers │ │
447
+ │ │ → output.md │ │
448
+ │ └───────────────────────┘ │
449
+ └─────────────────────────────────────────────────────────────┘
450
+ ```
451
+
452
+ ### Key Design Decisions
453
+
454
+ | Decision | Rationale |
455
+ |---|---|
456
+ | **Managed `.venv`** | The package owns its Python environment. Never touches the user's global Python. Reproducible, isolated, self-contained. |
457
+ | **CPU-only PyTorch by default** | Avoids GPU memory competition with Ollama. Smaller venv (~1-2GB vs 4GB+). Layout detection on CPU is fast enough for most documents. |
458
+ | **Ollama `/api/generate` mode** | Official GLM-OCR recommendation for Ollama. More stable than the OpenAI-compatible endpoint for vision requests. |
459
+ | **pypdfium2 for PDF rendering** | Ships its own PDFium binary in the wheel. Zero system dependencies. Renders directly to PIL images in-memory — no temp files, no subprocess calls. |
460
+ | **2000px image cap** | Balances OCR quality with model stability. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS. Prevents GGML tensor size crashes on Ollama. |
461
+ | **Full SDK config** | Generates a complete `config.yaml` matching the SDK's template on every run. All 50+ options are exposed as CLI flags. |
462
+ | **Per-page error tolerance** | A failed page gets a placeholder in the output. The rest of the document continues processing. |
463
+
464
+ ---
465
+
466
+ ## Output Format
467
+
468
+ The output Markdown file contains clear page boundaries:
469
+
470
+ ```markdown
471
+ <!-- PAGE 1 -->
472
+
473
+ # Invoice
474
+
475
+ **Invoice Number:** INV-2024-0042
476
+ **Date:** January 15, 2024
477
+
478
+ | Item | Quantity | Price |
479
+ |------|----------|-------|
480
+ | Widget A | 10 | $50.00 |
481
+ | Widget B | 5 | $75.00 |
482
+
483
+ **Total: $875.00**
484
+
485
+ ---
486
+
487
+ <!-- PAGE 2 -->
488
+
489
+ ## Terms and Conditions
490
+
491
+ 1. Payment is due within 30 days.
492
+ 2. Late payments incur a 2% monthly fee.
493
+
494
+ ---
495
+ ```
496
+
497
+ ### Page Markers
498
+
499
+ Each page is delimited by:
500
+
501
+ - `<!-- PAGE N -->` — HTML comment identifying the page number
502
+ - `---` — Markdown horizontal rule as a visual separator
503
+
504
+ ### Failed Pages
505
+
506
+ If a page fails OCR (e.g., Ollama timeout, model error), it gets a placeholder:
507
+
508
+ ```markdown
509
+ <!-- PAGE 4 -->
510
+
511
+ <!-- PAGE 4: OCR failed — API request failed after 3 attempts -->
512
+
513
+ ---
514
+ ```
515
+
516
+ The rest of the document continues processing normally.
517
+
518
+ ---
519
+
520
+ ## Configuration
521
+
522
+ ### Environment Variables
523
+
524
+ | Variable | Default | Description |
525
+ |---|---|---|
526
+ | `GLMOCR_GPU` | `0` | Set to `1` during install to use GPU PyTorch instead of CPU-only. |
527
+
528
+ ### Internal Config (auto-generated)
529
+
530
+ The CLI generates a temporary YAML config for each run. All SDK options are exposed as CLI flags:
531
+
532
+ ```yaml
533
+ # Example of generated config (abbreviated)
534
+ pipeline:
535
+ maas:
536
+ enabled: false
537
+ ocr_api:
538
+ api_host: localhost
539
+ api_port: 11434
540
+ api_path: /api/generate
541
+ api_mode: ollama_generate
542
+ model: glm-ocr:latest
543
+ connect_timeout: 30
544
+ request_timeout: 120
545
+ max_workers: 1
546
+ page_maxsize: 100
547
+ region_maxsize: 2000
548
+ page_loader:
549
+ max_tokens: 8192
550
+ temperature: 0.0
551
+ top_p: 0.00001
552
+ top_k: 1
553
+ repetition_penalty: 1.1
554
+ image_format: PNG
555
+ min_pixels: 12544
556
+ max_pixels: 71372800
557
+ result_formatter:
558
+ output_format: markdown
559
+ enable_merge_formula_numbers: true
560
+ enable_merge_text_blocks: true
561
+ enable_format_bullet_points: true
562
+ layout:
563
+ device: "cpu"
564
+ threshold: 0.3
565
+ batch_size: 1
566
+ use_polygon: false
567
+ layout_nms: true
568
+ layout_merge_bboxes_mode: large
569
+ ```
570
+
571
+ This config is written to a temp directory before each run and cleaned up afterward. Users don't need to manage it manually.
572
+
573
+ ---
574
+
575
+ ## GPU Support
576
+
577
+ The default installation uses CPU-only PyTorch for layout detection. This is intentional:
578
+
579
+ 1. **No GPU competition** — Ollama loads the glm-ocr model into GPU VRAM. Running layout detection on the same GPU can cause OOM errors.
580
+ 2. **Smaller venv** — CPU PyTorch is ~500MB vs ~4GB for CUDA.
581
+ 3. **Fast enough** — PP-DocLayoutV3 is lightweight and runs quickly on CPU for typical document sizes.
582
+
583
+ ### Enabling GPU
584
+
585
+ If you have ample GPU memory and want faster layout detection:
586
+
587
+ ```bash
588
+ # Uninstall the CPU-only version
589
+ npm uninstall -g glmmedia-ocr
590
+
591
+ # Reinstall with GPU PyTorch
592
+ GLMOCR_GPU=1 npm install -g glmmedia-ocr
593
+ ```
594
+
595
+ Then use `--layout-device cuda` when scanning:
596
+
597
+ ```bash
598
+ glmmedia-ocr scan document.pdf --layout-device cuda
599
+ ```
600
+
601
+ ### Recommended GPU Setup
602
+
603
+ If running both Ollama (glm-ocr model) and layout detection on the same GPU:
604
+
605
+ - **GPU with 12GB+ VRAM** — glm-ocr takes ~2.2GB, layout detection takes ~1-2GB
606
+ - **Use `--concurrency 1`** — Avoids queuing multiple OCR requests that could spike memory
607
+ - **Monitor with `nvidia-smi`** — Watch for OOM during processing
608
+
609
+ ---
610
+
611
+ ## Troubleshooting
612
+
613
+ ### Python not found or unsupported version
614
+
615
+ ```
616
+ ✗ Python 3.12+ not found on PATH. Install from python.org
617
+ ```
618
+
619
+ **Fix:** Install Python 3.12 or 3.13 from [python.org](https://www.python.org/downloads/). Make sure it's on your PATH. Python 3.14+ is not yet supported because key dependencies (PyTorch, Transformers) don't publish 3.14 wheels yet.
620
+
621
+ ```bash
622
+ # Verify
623
+ python --version # Should show 3.12.x or 3.13.x
624
+ ```
625
+
626
+ ### Ollama not found
627
+
628
+ ```
629
+ ✗ Ollama not found on PATH. Install from https://ollama.com/download
630
+ ```
631
+
632
+ **Fix:** Install Ollama from [ollama.com/download](https://ollama.com/download).
633
+
634
+ ```bash
635
+ # Verify
636
+ ollama --version
637
+ ```
638
+
639
+ ### SDK installation failed
640
+
641
+ ```
642
+ ✗ GLM-OCR SDK installation failed. Run 'npm rebuild glmmedia-ocr' to retry.
643
+ ```
644
+
645
+ **Fix:** Rebuild the package:
646
+
647
+ ```bash
648
+ npm rebuild glmmedia-ocr
649
+ ```
650
+
651
+ If that fails, try a clean reinstall:
652
+
653
+ ```bash
654
+ npm uninstall -g glmmedia-ocr
655
+ npm install -g glmmedia-ocr
656
+ ```
657
+
658
+ ### Model pull failed
659
+
660
+ ```
661
+ ✗ ollama pull failed with code 1
662
+ ```
663
+
664
+ **Fix:** Check your internet connection and try again. The model is ~2.2GB and requires a stable connection.
665
+
666
+ ```bash
667
+ # Manual pull to debug
668
+ ollama pull glm-ocr:latest
669
+ ```
670
+
671
+ ### Ollama won't start
672
+
673
+ ```
674
+ ✗ Ollama did not become healthy within 15s
675
+ ```
676
+
677
+ **Fix:** Start Ollama manually and check for errors:
678
+
679
+ ```bash
680
+ ollama serve
681
+ # In another terminal:
682
+ ollama list
683
+ ```
684
+
685
+ If Ollama is already running on a different port, use `--ollama-host`:
686
+
687
+ ```bash
688
+ glmmedia-ocr scan document.pdf --ollama-host localhost:11435
689
+ ```
690
+
691
+ ### OCR timeout on large documents
692
+
693
+ ```
694
+ Error: OCR failed — API request failed after 3 attempts
695
+ ```
696
+
697
+ **Fix:** Increase the request timeout or reduce concurrency:
698
+
699
+ ```bash
700
+ # Reduce to single worker (most stable)
701
+ glmmedia-ocr scan large-document.pdf --concurrency 1
702
+
703
+ # If using a remote Ollama, ensure the network is stable
704
+ glmmedia-ocr scan document.pdf --ollama-host 192.168.1.100:11434
705
+ ```
706
+
707
+ ### Out of memory
708
+
709
+ ```
710
+ Error: CUDA out of memory
711
+ ```
712
+
713
+ **Fix:** Use CPU for layout detection:
714
+
715
+ ```bash
716
+ glmmedia-ocr scan document.pdf --layout-device cpu
717
+ ```
718
+
719
+ Or reduce concurrency:
720
+
721
+ ```bash
722
+ glmmedia-ocr scan document.pdf --concurrency 1
723
+ ```
724
+
725
+ ### Corrupt or encrypted PDF
726
+
727
+ ```
728
+ Error: Failed to render PDF: ...
729
+ ```
730
+
731
+ **Fix:** Ensure the PDF is valid and not password-protected. The current version does not support encrypted PDFs. Use a tool like `qpdf` to decrypt first:
732
+
733
+ ```bash
734
+ qpdf --decrypt --password=your-password input.pdf decrypted.pdf
735
+ glmmedia-ocr scan decrypted.pdf
736
+ ```
737
+
738
+ ### No image files found in directory
739
+
740
+ ```
741
+ ✗ No image files found in directory: ./images/
742
+ ```
743
+
744
+ **Fix:** Ensure the directory contains supported image files (PNG, JPEG, WebP, BMP, TIFF, GIF). Use `--recursive` if images are in subdirectories:
745
+
746
+ ```bash
747
+ glmmedia-ocr scan ./images/ --recursive
748
+ ```
749
+
750
+ ### Input not found
751
+
752
+ ```
753
+ ✗ Input not found: ./missing.pdf
754
+ ```
755
+
756
+ **Fix:** Check the file path and ensure the input exists.
757
+
758
+ ---
759
+
760
+ ## Project Structure
761
+
762
+ ```
763
+ glmmedia-ocr/
764
+ ├── bin/
765
+ │ └── glmmedia-ocr.js # npm CLI entry point
766
+ │ # - Thin wrapper: finds .venv Python
767
+ │ # - Delegates to lib/pipeline.py
768
+
769
+ ├── scripts/
770
+ │ └── postinstall.js # npm package setup
771
+ │ # - Creates .venv
772
+ │ # - pip install glmocr[selfhosted] + CPU torch
773
+ │ # - Verifies installation
774
+
775
+ ├── lib/
776
+ │ └── pipeline.py # PDF/Image-to-Markdown pipeline (npm path)
777
+ │ # - pypdfium2: PDF → PIL images (2000px cap)
778
+ │ # - PIL: load images directly (2000px cap)
779
+ │ # - GlmOcr SDK: layout detection + OCR
780
+ │ # - Logging: surfaces SDK progress to stderr
781
+ │ # - Merge with page markers → .md
782
+
783
+ ├── src/glmmedia_ocr/ # Pure Python CLI package (pip path)
784
+ │ ├── __init__.py # Package version
785
+ │ ├── __main__.py # python -m glmmedia_ocr entry
786
+ │ ├── cli.py # Full CLI: args, Ollama, config, spinner
787
+ │ ├── config.py # Config YAML generation
788
+ │ ├── inputs.py # Input resolution (files, dirs, types)
789
+ │ ├── ollama.py # Ollama lifecycle management
790
+ │ ├── pipeline.py # Rendering + OCR + output
791
+ │ └── spinner.py # Animated terminal spinner
792
+
793
+ ├── pyproject.toml # Python package metadata + deps
794
+ ├── .venv/ # Created at npm install time (gitignored)
795
+ ├── .gitignore
796
+ ├── package.json # npm package metadata
797
+ └── README.md
798
+ ```
799
+
800
+ ### Distribution Channels
801
+
802
+ | Channel | Entry point | Code path |
803
+ |---|---|---|
804
+ | **npm** | `bin/glmmedia-ocr.js` | JS wrapper → `lib/pipeline.py` |
805
+ | **pip** | `src/glmmedia_ocr/cli.py` | Pure Python (full implementation) |
806
+
807
+ Both provide the same CLI interface and functionality. They are independent implementations — changes to one should be mirrored in the other.
808
+
809
+ ### What's NOT Here
810
+
811
+ | Not included | Why |
812
+ |---|---|
813
+ | `node_modules/` | Zero npm dependencies — uses Node.js built-ins only |
814
+ | `vendor/poppler/` | pypdfium2 ships its own PDFium binary in its pip wheel |
815
+ | `config.yaml` | Generated dynamically per run, cleaned up after |
816
+ | `*.md` output files | Generated by the CLI, not part of the package |
817
+ | `dist/`, `build/`, `*.egg-info/` | Build artifacts (gitignored) |
818
+
819
+ ---
820
+
821
+ ## Under the Hood
822
+
823
+ ### Input Resolution
824
+
825
+ The CLI accepts PDFs, images, and directories. When a directory is passed, it collects all supported image files (flat or recursive with `--recursive`). Mixed input types (PDF + image) are supported — pages are merged in input order into a single output file with sequential `<!-- PAGE N -->` markers.
826
+
827
+ ### PDF Rendering
828
+
829
+ Uses **pypdfium2**, which bundles the PDFium engine (same as Chromium). Renders PDF pages directly to PIL images in-memory at the specified DPI. Images exceeding 2000px on their longest dimension are downscaled via LANCZOS resampling. No temp files, no subprocess calls, no system dependencies.
830
+
831
+ ### Image Loading
832
+
833
+ Images are opened with PIL and capped to 2000px on their longest dimension via LANCZOS resampling. This ensures consistent quality while preventing GGML tensor size crashes on Ollama.
834
+
835
+ ### Layout Detection
836
+
837
+ Uses **PP-DocLayoutV3** via HuggingFace Transformers. Detects text blocks, tables, formulas, images, and other regions on each page. Runs on CPU by default to avoid GPU memory competition with Ollama. Progress is logged to stderr when `--log-level DEBUG` is used.
838
+
839
+ ### OCR
840
+
841
+ Each detected region is sent to the **glm-ocr** model via Ollama's native `/api/generate` endpoint. The model returns structured Markdown for each region.
842
+
843
+ ### Result Merging
844
+
845
+ Per-page results are merged with `<!-- PAGE N -->` markers and `---` separators. Failed pages get error placeholders instead of aborting the entire document.
846
+
847
+ ---
848
+
849
+ ## License
850
+
851
+ MIT