pi-ocr 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,393 +2,111 @@
2
2
 
3
3
  > ### ⚡ Zero setup. Works out of the box.
4
4
  >
5
- > Default backend: **MinerU** — a free cloud API. No install, no GPU, no API key.
6
- > Just `pi install npm:pi-ocr` and OCR anything.
5
+ > Default backend is **MinerU** — a free cloud API.
6
+ > No GPU, no API key, no pip install. Just `pi install` and `/ocr`.
7
7
 
8
- Multi-backend OCR for [Pi Coding Agent](https://pi.dev) extract text, LaTeX math formulas, and tables from images and PDFs. Choose the backend that fits your needs: free cloud API, local GPU, or pure Python.
9
-
10
- > Bridges the multimodal gap for non-vision LLMs like **DeepSeek**. When your model can't see images, `minimodel_ocr` acts as its eyes.
11
-
12
- ## Three Backends — One Tool
13
-
14
- | Backend | Type | Best For |
15
- |---|---|---|
16
- | 🦙 **Ollama** | Local GPU | Math formulas (LaTeX), privacy, offline |
17
- | ☁️ **MinerU** | Free cloud API | Complex PDFs, no GPU, zero setup |
18
- | 📐 **Pix2Text** | Local Python | Math formulas + text, free Mathpix alternative |
19
-
20
- Switch anytime with `/ocr` (no args) — a visual `SettingsList` menu lets you pick and configure everything without editing JSON:
21
-
22
- ```
23
- /ocr → opens settings: backend, model, PDF split toggle
24
- /ocr <file> [task] → OCR a file
25
- ```
26
-
27
- ## Features
28
-
29
- | | |
30
- |---|---|
31
- | 🔤 **Text** | General text recognition → Markdown |
32
- | 🧮 **Formulas** | Math formulas → LaTeX (Ollama glm-ocr: 94.6 OmniDocBench) |
33
- | 📊 **Tables** | Table structure → Markdown tables |
34
- | 🖼️ **Figures** | Diagrams and illustrations → descriptions |
35
- | 📄 **PDF** | Full PDF support across all backends |
36
- | 🎛️ **Any model** | Ollama works with glm-ocr, llama3.2-vision, minicpm-v, etc. |
37
- | ☁️ **Free cloud** | MinerU Agent API: no token, ≤10MB, ≤20 pages free |
38
- | 📦 **Auto-split** | MinerU splits PDFs >20 pages into free-tier chunks |
8
+ OCR for [Pi Coding Agent](https://pi.dev). Bridges the multimodal gap for non-vision LLMs like DeepSeek: when your model can't see images, `pi_ocr` reads them for you.
39
9
 
40
10
  ---
41
11
 
42
12
  ## Quickstart
43
13
 
44
- ### One command
45
-
46
14
  ```bash
47
15
  pi install npm:pi-ocr
16
+ /ocr ./screenshot.png
17
+ /ocr ./paper.pdf
48
18
  ```
49
19
 
50
- **That's it.** The default backend is ☁️ **MinerU** — a free cloud API with zero setup.
51
- Start OCR'ing immediately:
52
-
53
- ```
54
- /ocr ./scan.png
55
- /ocr ./document.pdf
56
- ```
57
-
58
- > 💡 Want offline OCR or math formulas? Switch backends anytime with `/ocr` (no args).
20
+ **That's all.** MinerU (free cloud API) is the default and requires nothing.
59
21
 
60
22
  ---
61
23
 
62
- ### Optional: set up other backends
24
+ ## Backends
63
25
 
64
- Only needed if you want to switch from the default MinerU.
26
+ Switch anytime with `/ocr` (no args).
27
+
28
+ | | Backend | Best for | Setup |
29
+ |---|---|---|---|
30
+ | ☁️ | **MinerU** (default) | PDFs, tables, general docs | None — works instantly |
31
+ | 🦙 | Ollama | Math formulas → LaTeX, offline | `brew install ollama && ollama pull glm-ocr` |
32
+ | 📐 | Pix2Text | Math + text, CPU-only | `pip install pix2text` |
65
33
 
66
34
  ---
67
35
 
68
- ### 🦙 Ollama setup
36
+ ## MinerU (default)
69
37
 
70
- #### macOS
38
+ Free cloud API. Handles PDF, PNG, JPG, Docx, PPTx, Xlsx.
71
39
 
72
- ```bash
73
- # 1. Install Ollama
74
- brew install ollama
40
+ **Limits:** ≤10MB per file, ≤20 pages per request.
75
41
 
76
- # 2. Pull the default OCR model (~2.2 GB)
77
- ollama pull glm-ocr
42
+ PDFs >20 pages can be auto-split (enabled by default in `/ocr` settings). Splitting needs `pip install pypdfium2`.
78
43
 
79
- # 3. Multi-page PDF support (optional but recommended)
80
- brew install poppler
81
- ```
44
+ ---
82
45
 
83
- > macOS uses built-in `sips` for single-page PDFs — zero extra deps for those.
84
- > Multi-page PDFs need `poppler` for the `pdftoppm` tool.
46
+ ## Ollama (optional, for math formulas)
85
47
 
86
- #### Linux
48
+ Local GPU OCR via [glm-ocr](https://ollama.com) — state-of-the-art formula recognition (94.6 OmniDocBench). Outputs LaTeX.
87
49
 
88
50
  ```bash
89
- # 1. Install Ollama
90
- curl -fsSL https://ollama.com/install.sh | sh
91
-
92
- # 2. Pull the default OCR model (~2.2 GB)
51
+ # macOS
52
+ brew install ollama
93
53
  ollama pull glm-ocr
54
+ brew install poppler # multi-page PDFs only
94
55
 
95
- # 3. PDF support (required on Linux)
96
- sudo apt install poppler-utils # Debian/Ubuntu
97
- sudo dnf install poppler-utils # Fedora
98
- sudo pacman -S poppler # Arch
56
+ # Linux
57
+ curl -fsSL https://ollama.com/install.sh | sh
58
+ ollama pull glm-ocr
59
+ sudo apt install poppler-utils
99
60
  ```
100
61
 
101
- #### Verify
102
-
103
- ```bash
104
- # Check Ollama is running and model is pulled
105
- ollama list | grep glm-ocr
106
- ```
62
+ Switch with `/ocr` → "OCR Backend" → ollama.
107
63
 
108
64
  ---
109
65
 
110
- ### 📐 Pix2Text setup
111
-
112
- Pix2Text runs entirely in Python. Handles images and PDFs via a single API call — no manual conversion needed.
113
-
114
- #### Step 1: Make sure you have Python 3.9+ (macOS/Linux)
115
-
116
- | System | Check |
117
- |---|---|
118
- | macOS/Linux | `python3 --version` |
119
-
120
- > ⚠️ **Important:** Know which Python you're using. Run `which python3` — if it shows `conda`, `brew`, or `/usr/bin/python3`, your `pip install` must target the same Python:
121
- > ```bash
122
- > # Conda Python
123
- > pip install pix2text
124
- >
125
- > # System Python (may need --user or sudo)
126
- > pip install --user pix2text
127
- >
128
- > # Brew Python (macOS)
129
- > /opt/homebrew/bin/pip3 install pix2text
130
- > ```
131
- > If unsure, use `python3 -m pip install ...` — this always installs for the active `python3`.
132
-
133
- #### Step 2: Install packages
66
+ ## Pix2Text (optional, for offline CPU)
134
67
 
135
- ```bash
136
- python3 -m pip install pix2text
137
- ```
138
-
139
- #### Step 3: Verify
68
+ Local Python OCR. Mathpix alternative — handles text + formulas on CPU.
140
69
 
141
70
  ```bash
142
- python3 -c "from pix2text import Pix2Text; print('OK')"
71
+ pip install pix2text
143
72
  ```
144
73
 
145
- > First run downloads ONNX models (~50MB) to `~/.pix2text/`.
146
-
147
- > First run downloads ONNX models (~50MB) to `~/.pix2text/`.
148
-
74
+ First run downloads ONNX models (~50MB). Switch with `/ocr` → "OCR Backend" → pix2text.
149
75
 
150
76
  ---
151
77
 
152
- ### ☁️ MinerU (default — already working!)
153
-
154
- **No setup required.** The free Agent API works immediately. No token, no account, no install.
155
-
156
- Free tier limits:
157
- - ≤ 10 MB per file
158
- - ≤ 20 pages per request
159
- - IP-based rate limiting
160
-
161
- For files >10MB, compress first at [ilovepdf.com/compress_pdf](https://ilovepdf.com/compress_pdf).
162
-
163
- ---
164
-
165
- ## Usage
166
-
167
- ### Settings UI
168
-
169
- ```
170
- /ocr
171
- ```
172
-
173
- Opens an interactive `SettingsList` with keyboard navigation:
78
+ ## Commands
174
79
 
175
- ```
176
- ┌─ OCR Settings ─────────────────────────────────┐
177
- │ OCR Backend [ollama / mineru / pix2text] │
178
- │ MinerU: Split PDF [ON / OFF] │
179
- │ Ollama Model [glm-ocr] │
180
- │ ↑↓ navigate • ← → toggle • enter select • esc close │
181
- └──────────────────────────────────────────────────┘
182
- ```
183
-
184
- - **Backend**: ← → to cycle ollama/mineru/pix2text — saves immediately
185
- - **MinerU Split**: ← → to toggle ON/OFF — when ON, PDFs >20 pages are auto-split
186
- - **Model**: Enter opens a sub-menu with recommended models + custom input
187
-
188
- ### OCR a file
189
-
190
- ```
191
- /ocr <file> [task] [model]
192
- ```
193
-
194
- | Example | Result |
80
+ | Command | |
195
81
  |---|---|
196
- | `/ocr ./scan.png` | Auto-detect all content |
197
- | `/ocr ./equation.jpg formula` | LaTeX formula output |
198
- | `/ocr ./contract.pdf text` | Text-only extraction |
199
- | `/ocr ./paper.pdf auto glm-ocr:q8_0` | Use specific model |
82
+ | `/ocr` | Open settings (backend, model, split toggle) |
83
+ | `/ocr <file> [task]` | OCR a file |
84
+ | `/ocr <file> formula` | Math LaTeX output |
200
85
 
201
86
  ### Tasks
202
87
 
203
- | Task | Description | Output format |
204
- |---|---|---|
205
- | `auto` | Full document OCR (default) | Markdown + LaTeX mixed |
206
- | `text` | Plain text recognition | Markdown |
207
- | `formula` | Math formula recognition | LaTeX |
208
- | `table` | Table structure recognition | Markdown tables |
209
- | `figure` | Figure / diagram description | Natural language |
210
-
211
- ### LLM-invoked (automatic)
212
-
213
- The extension registers a `minimodel_ocr` tool. The agent invokes it automatically:
214
-
215
- ```
216
- > What formula is written in this screenshot?
217
- > OCR this 50-page PDF into markdown.
218
- ```
219
-
220
- ---
221
-
222
- ## MinerU PDF Splitting
223
-
224
- When `MinerU: Split PDF >20 pages` is ON (default), large PDFs are automatically split into ≤20-page chunks. Each chunk is a separate request with 3s spacing:
225
-
226
- ```
227
- Splitting 85-page PDF into ≤20-page chunks…
228
- [1/5] uploading…
229
- [1/5] running (12s)
230
- [1/5] done
231
- [2/5] waiting rate limit…
232
- [2/5] uploading…
233
- [2/5] running (18s)
234
- [2/5] done
235
- ...
236
- ```
237
-
238
- ---
239
-
240
- ## Backend Comparison
241
-
242
- | | 🦙 Ollama | ☁️ MinerU | 📐 **Pix2Text** |
243
- |---|---|---|---|
244
- | **Setup** | Install Ollama + pull model | None | `pip install` 3 packages |
245
- | **GPU needed** | Recommended | No | No |
246
- | **Internet** | No | Yes | No (first run: yes) |
247
- | **Math formulas** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐ |
248
- | **Complex PDFs** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ |
249
- | **Chinese text** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
250
- | **File size limit** | None | 10MB (free) | None |
251
- | **Page limit** | None | 20/request (free) | None |
252
- | **Cost** | Free (local) | Free (rate-limited) | Free (local) |
253
-
254
- ---
255
-
256
- ## Supported File Types
257
-
258
- | Format | Ollama | MinerU | Pix2Text |
259
- |---|---|---|---|
260
- | PNG, JPG, GIF, WEBP, BMP, TIFF | ✅ | ✅ | ✅ |
261
- | PDF | ✅ | ✅ | ✅ |
262
- | Docx, PPTx, Xlsx | ❌ | ✅ | ❌ |
263
-
264
- ---
265
-
266
- ## PDF Support Details
267
-
268
- | Backend | Conversion method | System deps |
269
- |---|---|---|
270
- | **Ollama** | `sips` (macOS page 1) / `pdftoppm` (multi-page, Linux) | `poppler` (multi-page only) |
271
- | **MinerU** | Direct PDF upload — no conversion | None |
272
- | **Pix2Text** | Built-in `recognize_pdf()` — no system deps | None |
273
-
274
- ---
275
-
276
- ## Configuration
277
-
278
- All settings are persisted to `~/.pi/agent/settings.json`:
279
-
280
- ```json
281
- {
282
- "minimodelOcr": {
283
- "backend": "ollama",
284
- "model": "glm-ocr",
285
- "ollamaHost": "http://localhost:11434",
286
- "mineruSplitPdf": true
287
- }
288
- }
289
- ```
290
-
291
- Change settings via `/ocr` (interactive) or edit directly. Environment variables override file settings:
292
-
293
- ```bash
294
- export OLLAMA_HOST="http://localhost:11434"
295
- export OCR_MODEL="glm-ocr"
296
- ```
88
+ | Task | Output |
89
+ |---|---|
90
+ | `auto` (default) | Markdown + LaTeX |
91
+ | `text` | Plain Markdown |
92
+ | `formula` | LaTeX only |
93
+ | `table` | Markdown tables |
94
+ | `figure` | Description |
297
95
 
298
96
  ---
299
97
 
300
98
  ## Troubleshooting
301
99
 
302
- ### Ollama: "fetch failed" / "ECONNREFUSED"
303
-
304
- ```bash
305
- # Start Ollama in the background
306
- ollama serve
307
- ```
100
+ **"Is Ollama running?"** `ollama serve`
308
101
 
309
- ### Ollama: "model not found"
102
+ **MinerU 429** Rate limited. Wait a minute or switch backend.
310
103
 
311
- ```bash
312
- ollama pull glm-ocr
313
- ```
104
+ **"python3 not found" (Pix2Text)** → `python3 -m pip install pix2text`
314
105
 
315
- ### Pix2Text: "python3 not found"
316
-
317
- ```bash
318
- # Check your python:
319
- which python3 && python3 --version
320
-
321
- # If using conda:
322
- conda activate base && pip install pix2text
323
-
324
- # If using system python:
325
- python3 -m pip install --user pix2text
326
- ```
327
-
328
- ### Pix2Text: "No module named 'pix2text'"
329
-
330
- You likely installed with a different `pip` than your active `python3`:
331
-
332
- ```bash
333
- python3 -m pip install pix2text
334
- ```
335
- ```
336
-
337
- ### MinerU: "429 Too Many Requests"
338
-
339
- IP rate limit hit. Wait 1-2 minutes, or switch to Ollama/Pix2Text with `/ocr`.
340
-
341
- ### MinerU: "file page count exceeds lightweight API limit"
342
-
343
- Enable PDF splitting: `/ocr` → toggle "MinerU: Split PDF" to ON.
344
-
345
- ### MinerU: "File too large for free MinerU API"
346
-
347
- Compress the PDF at [ilovepdf.com/compress_pdf](https://ilovepdf.com/compress_pdf) or switch to a local backend with `/ocr`.
348
-
349
- ### macOS multi-page PDF: "pdftoppm not found"
350
-
351
- ```bash
352
- brew install poppler
353
- ```
354
-
355
- ### Linux multi-page PDF: "pdftoppm not found"
356
-
357
- ```bash
358
- # Debian/Ubuntu
359
- sudo apt install poppler-utils
360
-
361
- # Fedora
362
- sudo dnf install poppler-utils
363
-
364
- # Arch
365
- sudo pacman -S poppler
366
- ```
106
+ **"pdftoppm not found" (Ollama multi-page)** → `brew install poppler` (macOS) / `sudo apt install poppler-utils` (Linux)
367
107
 
368
108
  ---
369
109
 
370
- ## How It Works
371
-
372
- ```
373
- ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────┐
374
- │ pi (DeepSeek) │────▶│ minimodel_ocr │────▶│ Ollama / MinerU │
375
- │ (no vision) │ │ pi extension │ │ / Pix2Text │
376
- └──────────────────┘ └──────────────────┘ └──────────────────────┘
377
- │ │ │
378
- │ "read this image" │ POST /api/generate │
379
- │────────────────────────▶│ (Ollama) │
380
- │ │ or POST /api/v1/agent │
381
- │ │ (MinerU) │
382
- │ │ or python3 subprocess │
383
- │ │ (Pix2Text) │
384
- │ │──────────────────────────▶│
385
- │ │ OCR text response │
386
- │ LaTeX / Markdown │◀──────────────────────────│
387
- │◀────────────────────────│ │
388
- ```
389
-
390
- The tool dispatches to your selected backend. Switch anytime with `/ocr` — no restart needed.
391
-
392
110
  ## License
393
111
 
394
112
  MIT
@@ -1,7 +1,7 @@
1
1
  /**
2
- * pi-minimodel-ocr — Multi-backend OCR for Pi Coding Agent
2
+ * pi-ocr — Multi-backend OCR for Pi Coding Agent
3
3
  *
4
- * Registers a `minimodel_ocr` tool that the LLM can call to read images and PDFs
4
+ * Registers a `pi_ocr` tool that the LLM can call to read images and PDFs
5
5
  * using one of three backends:
6
6
  * - Ollama (local vision models like glm-ocr)
7
7
  * - MinerU API (free Agent API, ≤10MB, ≤20 pages)
@@ -19,7 +19,7 @@
19
19
  * Pix2Text: pip install pix2text
20
20
  * PDF tools: brew install poppler (macOS multi-page PDF for Ollama)
21
21
  *
22
- * Install: pi install npm:pi-minimodel-ocr
22
+ * Install: pi install npm:pi-ocr
23
23
  */
24
24
 
25
25
  import { Type } from "@earendil-works/pi-ai";
@@ -111,7 +111,7 @@ const ocrSchema = Type.Object({
111
111
  });
112
112
 
113
113
  const ocrTool = defineTool({
114
- name: "minimodel_ocr",
114
+ name: "pi_ocr",
115
115
  label: "Minimodel OCR",
116
116
  description:
117
117
  "Extract text, math formulas (LaTeX), and tables from images or PDFs using local Ollama vision models. " +
@@ -120,9 +120,9 @@ const ocrTool = defineTool({
120
120
  promptSnippet:
121
121
  "Extract text/formulas/tables from images and PDFs using local Ollama OCR",
122
122
  promptGuidelines: [
123
- "When the user asks about the content of an image or PDF, use minimodel_ocr to extract the text first.",
124
- "For mathematical documents, use minimodel_ocr with task='formula' or task='auto' to get LaTeX output.",
125
- "Use minimodel_ocr with task='auto' for general document OCR to extract all text, formulas, tables, and figures.",
123
+ "When the user asks about the content of an image or PDF, use pi_ocr to extract the text first.",
124
+ "For mathematical documents, use pi_ocr with task='formula' or task='auto' to get LaTeX output.",
125
+ "Use pi_ocr with task='auto' for general document OCR to extract all text, formulas, tables, and figures.",
126
126
  ],
127
127
  parameters: ocrSchema,
128
128
  async execute(_toolCallId, params, signal, onUpdate, _ctx) {
@@ -408,7 +408,7 @@ export default function ocrExtension(pi: ExtensionAPI) {
408
408
  const text = config.backend === "ollama"
409
409
  ? `OCR: ollama ${config.model}`
410
410
  : `OCR: ${config.backend}`;
411
- ctx.ui.setStatus("minimodel-ocr", text);
411
+ ctx.ui.setStatus("pi-ocr", text);
412
412
  }
413
413
 
414
414
  // ── Startup ────────────────────────────────────────────────────────────────
@@ -430,5 +430,5 @@ export default function ocrExtension(pi: ExtensionAPI) {
430
430
  }
431
431
  });
432
432
 
433
- console.log("[pi-ocr] Loaded — /ocr (file or settings), tool: minimodel_ocr, default: mineru");
433
+ console.log("[pi-ocr] Loaded — /ocr (file or settings), tool: pi_ocr, default: mineru");
434
434
  }
@@ -1,5 +1,5 @@
1
1
  /**
2
- * pi-minimodel-ocr — MinerU API backend
2
+ * pi-ocr — MinerU API backend
3
3
  *
4
4
  * Uses the free Agent Lightweight API (no token required):
5
5
  * - File ≤10MB, ≤20 pages → one free request
@@ -1,5 +1,5 @@
1
1
  /**
2
- * pi-minimodel-ocr — Ollama backend
2
+ * pi-ocr — Ollama backend
3
3
  *
4
4
  * Uses any locally-running Ollama vision model (default: glm-ocr) to OCR
5
5
  * images and PDFs. Converts PDF pages to PNG before sending to Ollama.
@@ -1,5 +1,5 @@
1
1
  /**
2
- * pi-minimodel-ocr — Pix2Text backend
2
+ * pi-ocr — Pix2Text backend
3
3
  *
4
4
  * Uses Pix2Text (https://github.com/breezedeus/Pix2Text) — an open-source
5
5
  * Python alternative to Mathpix. Recognizes layouts, text, math formulas (LaTeX),
@@ -1,5 +1,5 @@
1
1
  /**
2
- * pi-minimodel-ocr — shared types for OCR backends
2
+ * pi-ocr — shared types for OCR backends
3
3
  */
4
4
 
5
5
  export const TASKS = ["text", "formula", "table", "figure", "auto"] as const;
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-ocr",
3
- "version": "1.0.0",
3
+ "version": "1.0.2",
4
4
  "description": "Pi extension: Zero-setup multi-backend OCR — MinerU (free cloud), Ollama (local GPU, LaTeX formulas), Pix2Text (local Python). Extract text, formulas, and tables from images and PDFs. Default: zero config, works out of the box.",
5
5
  "keywords": [
6
6
  "pi-package",