npm - mimo2codex - Versions diffs - 0.1.16 → 0.1.17 - Mend

mimo2codex 0.1.16 → 0.1.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +24 -1
package/README.zh.md +23 -1
package/dist/translate/reqToChat.js +41 -1
package/dist/translate/reqToChat.js.map +1 -1
package/doc/mimoskill.md +295 -0
package/doc/mimoskill.zh.md +295 -0
package/mimoskill/SKILL.md +25 -19
package/mimoskill/references/ocr_workflow.md +49 -25
package/mimoskill/scripts/mimo_chat.py +111 -42
package/mimoskill/scripts/ocr.py +83 -34
package/package.json +1 -1

package/doc/mimoskill.zh.md ADDED Viewed

@@ -0,0 +1,295 @@
+# mimoskill · 详细介绍
+> [English](./mimoskill.md) · 中文
+>
+> 回到：[README English](../README.md) · [README 中文](../README.zh.md)
+`mimoskill/` 是仓库根目录下一捆**辅助脚本 + 参考文档**。它存在的原因是有些事 MiMo / DeepSeek / 大多数纯文本 LLM 原生做不了（图像生成、纯文本模型看图、…），而 Codex 在客户端硬编码了一些能力假设，代理层压根改不动。
+代理（mimo2codex）和 mimoskill **完全独立**：不跑 mimo2codex 也能用 mimoskill，反之亦然。两者通过**约定**协作：代理检测到能力缺口时，会在消息里塞占位文本，指向对应的 `mimoskill/scripts/*.py`。
+## 什么时候会触发？
+> 一句话：**"模型能做的事 proxy 透传，模型做不了的事 mimoskill 兜底。"**
+| 能力 | 当前 chat 模型能做 | 当前 chat 模型做不了 |
+|---|---|---|
+| 看图 / OCR / 识图 | proxy 透传图片给模型；**mimoskill 不触发** | proxy 剥掉图片、塞 `[N image attachment(s) omitted: … python3 mimoskill/scripts/ocr.py <path> …]` 占位文本；LLM 读到占位 + AGENTS.md 后 **去跑 `ocr.py`** |
+| 图像生成 | 没有任何主流 chat 模型自带 image-gen | **mimoskill 永远触发** —— `scripts/generate_image.py` 或 `scripts/generate_pet.py` |
+| 联网搜索 | proxy 在 MiMo `sk-*`（按量）key 下把 Codex 的 `web_search` 翻译成 MiMo 内置的；`tp-*`（套餐）key 与 DeepSeek 自动跳过 | `scripts/mimo_chat.py` 遵循同样规则 —— MiMo `sk-*` 自动启用，`tp-*` / pollinations 跳过。**无需参数** |
+| TTS / ASR | Codex 没接 | `scripts/mimo_chat.py` 直接调 MiMo 的独立端点 |
+触发**发生在 LLM 这一层**，不在 proxy 层。proxy 只做协议翻译 + 最小兼容性修整（剥图、塞占位文本）。Codex 读 [AGENTS.md](../AGENTS.md) 和 [mimoskill/SKILL.md](../mimoskill/SKILL.md)，看到占位文本或者用户意图后，自己决定调哪个脚本。脚本是独立子进程，**完全绕开 proxy** —— OCR 直接打 MiMo 或 pollinations，出图直接打 pollinations 或 OpenAI，等等。
+## 目录结构
+```
+mimoskill/
+├── SKILL.md                   # 给 LLM 看的 skill 清单 —— 触发规则 + 决策树
+├── scripts/
+│   ├── mimo_chat.py           # 直接调 MiMo 聊天 / 视觉 / 联网搜索（纯标准库）
+│   ├── ocr.py                 # OCR / 识图。MiMo 或免费 pollinations
+│   ├── generate_image.py      # 通用图像生成（任意风格 / 主题）
+│   ├── generate_pet.py        # Codex 宠物生成（chibi 贴纸风）
+│   └── install_pet.sh         # 把生成的 PNG 装到 Codex 的宠物目录
+├── references/
+│   ├── models.md              # MiMo 能力矩阵 + 字段坑
+│   ├── ocr_workflow.md        # 完整 OCR 模式参考、退出码、JSON 结构
+│   └── pet_workflow.md        # 单图 vs 多状态动画 bundle
+└── assets/
+    └── pet_prompt_template.md # 调好的 chibi 贴纸提示词模板
+```
+## 脚本详解
+### `scripts/mimo_chat.py` —— 聊天 / 视觉（无 key 也能用）
+纯标准库 Python 脚本，单轮或流式聊天。两个引擎，跟 `ocr.py` 是同一套 `--engine auto|mimo|pollinations`：
+| 引擎 | 需要 key | 备注 |
+|---|---|---|
+| `mimo` | 需要 `MIMO_API_KEY` | 最佳质量。`sk-*` key 自动启用 web_search（无需参数），TTS / ASR 也只能用这个 |
+| `pollinations` | **不需要** | 免费公共端点 `text.pollinations.ai`。文本 + 视觉可用，联网搜索 / TTS / ASR 不可用 |
+auto 选择：有 `MIMO_API_KEY` 用 mimo，否则 pollinations。**这个脚本现在不依赖任何 key**（纯文本 + 视觉场景）。
+```bash
+# 零配置 —— 自动走 pollinations 兜底
+python3 mimoskill/scripts/mimo_chat.py "讲个笑话"
+python3 mimoskill/scripts/mimo_chat.py --image https://x/y.png "描述这张图"
+# 最佳质量 + MiMo 原生能力（sk-* key 自动开 web_search，TTS、ASR）
+export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
+python3 mimoskill/scripts/mimo_chat.py "今天上海天气"   # 自动带 web_search
+python3 mimoskill/scripts/mimo_chat.py --model mimo-v2.5-pro --max-tokens 8000 --stream "写长一点"
+```
+mimo 引擎自动踩好 MiMo 的坑：`max_completion_tokens`（不是 `max_tokens`）、图片必须配 `text` part、多轮 `reasoning_content` 回填、联网搜索插件调用。
+| 参数 | 说明 |
+|---|---|
+| `--engine` | `auto` / `mimo` / `pollinations`（默认 auto） |
+| `--model` | 默认 `mimo-v2.5-pro`（mimo 引擎）。视觉用 `mimo-v2.5` / `mimo-v2-omni` |
+| `--pollinations-model` | 默认 `openai`（视觉能力）。可选 `openai-large` / `openai-fast` |
+| `--image URL` | 附图。自动 bump 到视觉能力模型 |
+| `--stream` | SSE 流式 |
+| `--max-tokens N` | mimo 引擎映射到 `max_completion_tokens`，pollinations 映射到 `max_tokens` |
+| `--temperature F` | 默认 0.7 |
+### `scripts/ocr.py` —— OCR / 识图
+非视觉 chat 模型场景下的兜底。**两个引擎**（`--engine auto` 自动选）：
+| 引擎 | 需要 key | 质量 | 备注 |
+|---|---|---|---|
+| `mimo` | 需要 `MIMO_API_KEY` | 最好 | 内部调 `mimo-v2.5`（视觉模型），与外层 chat 模型无关 |
+| `pollinations` | **不需要** | 还行 | 免费公共端点 `text.pollinations.ai`。有 IP 限流，但无需注册 |
+auto 选择：有 `MIMO_API_KEY` 用 mimo，否则 pollinations。所以**只配了 DeepSeek key**（或者啥都没配）的用户也能零配置用 OCR。
+```bash
+# 零配置 —— 没设 MIMO_API_KEY 时自动走免费 pollinations
+python3 mimoskill/scripts/ocr.py path/to/image.png
+# 最佳质量 —— 设 MiMo key
+export MIMO_API_KEY=sk-xxxx
+python3 mimoskill/scripts/ocr.py path/to/image.png   # auto -> mimo
+# 强制走免费引擎（即便你有 MiMo key，比如想省额度）
+python3 mimoskill/scripts/ocr.py --engine pollinations form.png
+# 强制 MiMo —— 没设 key 直接报错（不静默降级）
+python3 mimoskill/scripts/ocr.py --engine mimo form.png
+```
+四个输出模式：
+| `--mode` | 输出 |
+|---|---|
+| `text`（默认） | 逐字 OCR —— 保留换行 + 阅读顺序 |
+| `describe` | 2-4 句描述 |
+| `structured` | 单个 JSON：`text` / `language` / `regions[]` / `summary` |
+| `markdown` | 整张图重新渲染成 GitHub-flavored Markdown |
+输入形态（位置参数，0+ 个）：
+- 本地路径：`./scan.png`、`C:\foo.jpg`
+- HTTP(S) URL：原样转发
+- `data:image/...;base64,…`：原样转发
+- `-` 或管道 stdin：从 stdin 读一张图的字节
+magic-byte 嗅探 MIME（不信任扩展名）：PNG / JPEG / GIF / WebP / BMP。多个位置参数会**一次 upstream 调用**批处理。
+> 完整参考：[mimoskill/references/ocr_workflow.md](../mimoskill/references/ocr_workflow.md)（模式、退出码、JSON 结构、lang/prompt 参数、pollinations 细节）。
+### `scripts/generate_image.py` —— 通用图像生成
+`generate_pet.py` 的薄包装，去掉 chibi 宠物提示词模板、加了可选的 `--style` 常见风格。同样的 providers、同样的环境变量、同样的 auto 兜底策略。
+```bash
+# 免费 —— 没设 OpenAI key 时 auto 走 pollinations
+python3 mimoskill/scripts/generate_image.py --prompt "日式庭园，水彩，黎明" --out garden.png
+# 高质量 —— 设 OpenAI key
+export PET_OPENAI_API_KEY=sk-real-openai-key
+python3 mimoskill/scripts/generate_image.py --prompt "..." --out art.png  # auto -> gpt-image-1
+# 风格预设
+python3 mimoskill/scripts/generate_image.py --style anime --prompt "黄昏的神社" --out shrine.png
+```
+| `--provider` | 后端 |
+|---|---|
+| `auto`（默认） | 有 `PET_OPENAI_API_KEY` 走 `gpt-image-1`，否则 `pollinations` |
+| `pollinations` | 免费、无 key |
+| `gpt-image-1` | OpenAI 官方图像生成 —— 最佳质量 |
+| `replicate` | Replicate API（任意模型） |
+| `local-sd` | 本地 Stable Diffusion |
+> `PET_OPENAI_API_KEY` 故意**和 `MIMO_API_KEY`、`OPENAI_API_KEY` 分开** —— 只用于图像生成，泄露或不存在都不影响别的事。
+### `scripts/generate_pet.py` —— Codex 宠物生成
+同样的后端，但内置了一套调好的 chibi 贴纸提示词，围绕 `--description` 组装。输出尺寸 + 留白都按 Codex 宠物选择器适配。
+```bash
+# 单张静态宠物（免费）
+python3 mimoskill/scripts/generate_pet.py --description "chibi shiba 程序员" --out pet.png
+# 多状态动画 bundle（idle / thinking / typing / sleeping）
+python3 mimoskill/scripts/generate_pet.py --description "chibi 猫" --bundle ./shiba/
+```
+提示词模板在 [mimoskill/assets/pet_prompt_template.md](../mimoskill/assets/pet_prompt_template.md)。完整流程见 [mimoskill/references/pet_workflow.md](../mimoskill/references/pet_workflow.md)。
+### `scripts/install_pet.sh` —— 装宠物到 Codex
+自动探测 macOS / Linux / Windows 的宠物目录，把 PNG（或 bundle）拷过去。绕开 Codex 硬编码的宠物路径问题。
+```bash
+bash mimoskill/scripts/install_pet.sh pet.png shiba
+# 然后完全退出 + 重启 Codex（桌面端走系统托盘退出，不只是关窗口）
+```
+## 三种用法
+### 1. 直接调用（普通用户，零配置）
+```bash
+python3 mimoskill/scripts/mimo_chat.py "..."
+python3 mimoskill/scripts/ocr.py invoice.png        # 无 key 也能跑，走免费 pollinations
+python3 mimoskill/scripts/generate_image.py --prompt "..."
+```
+不需要注册 skill —— 就是普通 Python 脚本（纯标准库，不用 `pip install`）。
+### 2. 当 Claude Code 的 Skill 用
+软链到 `~/.claude/skills/`：
+```bash
+ln -s "$(pwd)/mimoskill" ~/.claude/skills/mimoskill
+```
+之后 Claude 会读 [SKILL.md](../mimoskill/SKILL.md)，遇到相关请求（"帮我从这张图生成宠物"、"读一下这张截图的文字"、"让 MiMo 把这段话朗读了"）自动路由到对应脚本。
+### 3. 当 Codex agent 指南
+仓库根的 [AGENTS.md](../AGENTS.md) 已经接好。Codex 每次启会话都会读，遇到生图 / 宠物 / OCR 任务会路由到 mimoskill 脚本 —— **不会**再去 `pip install openai`，也不会在用 MiMo / DeepSeek / Qwen / 任何非 OpenAI 上游时尝试调 OpenAI 的 `image_gen` 工具。
+## 环境变量
+| 变量 | 谁用 | 说明 |
+|---|---|---|
+| `MIMO_API_KEY` | `mimo_chat.py`、`ocr.py`（engine=mimo / auto 时） | MiMo Chat / 视觉 key。两个脚本都**可选** —— 没设会自动走 pollinations |
+| `MIMO_CHAT_ENGINE` | `mimo_chat.py` | `auto` / `mimo` / `pollinations` —— 等价于 `--engine` |
+| `MIMO_BASE_URL` | `mimo_chat.py`、`ocr.py` | 默认 `https://api.xiaomimimo.com/v1` |
+| `MIMO_MODEL` / `MIMO_OCR_MODEL` | `ocr.py` 模型 auto-pick | 没传 `--model` 时使用（必须视觉能力） |
+| `MIMO_OCR_ENGINE` | `ocr.py` | `auto` / `mimo` / `pollinations` —— 等价于 `--engine` 参数 |
+| `POLLINATIONS_MODEL` | `ocr.py` | 默认 `openai`（视觉能力）。可选 `openai-large`、`openai-fast` |
+| `PET_OPENAI_API_KEY` | `generate_pet.py`、`generate_image.py` | 跟 `MIMO_API_KEY` / `OPENAI_API_KEY` 独立；只用于图像生成 |
+| `REPLICATE_API_TOKEN` | `generate_*.py --provider replicate` | 仅 Replicate 后端时需要 |
+## 常用组合
+### 先 OCR 一张图，再用当前 chat 模型总结
+```bash
+TEXT=$(python3 mimoskill/scripts/ocr.py invoice.png)
+python3 mimoskill/scripts/mimo_chat.py "总结这张发票:\n$TEXT"
+```
+或者直接在 Codex 里：把图贴进去就行。proxy 剥图后留指向 `ocr.py` 的占位文本，Codex 自己跑脚本把文字喂回对话 —— **完全自动**。
+### 生成 `/hatch` 替代宠物（无 OpenAI key 也能用）
+```bash
+python3 mimoskill/scripts/generate_pet.py --description "chibi shiba 程序员" --out pet.png
+bash mimoskill/scripts/install_pet.sh pet.png shiba
+# 完全退出 + 重启 Codex，宠物菜单里挑新的
+```
+想要更好质量，设 `PET_OPENAI_API_KEY=sk-真OpenAI-key`，auto 会切到 `gpt-image-1`。
+### 结构化 OCR + JSON 解析
+```bash
+JSON=$(python3 mimoskill/scripts/ocr.py --mode structured invoice.png)
+echo "$JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['summary'])"
+```
+### 多图批量 OCR（一次计费）
+```bash
+python3 mimoskill/scripts/ocr.py page1.png page2.png page3.png
+```
+所有图**单次** upstream 调用，模型可跨图引用（如身份证正反面）。输出是按阅读顺序串联的一段文本。
+## 故障排查
+<details>
+<summary><b><code>MIMO_API_KEY</code> 未设置</b> —— ocr.py 退出码 3</summary>
+你显式传了 `--engine mimo`。要么去掉这个参数（`auto` 会自动降级到 pollinations），要么设 key：
+```bash
+export MIMO_API_KEY=sk-xxxx
+python3 mimoskill/scripts/ocr.py form.png
+```
+</details>
+<details>
+<summary><b>Pollinations 返回 429 / 限流</b></summary>
+撞 IP 限流。等会儿再试，或者切到 `--engine mimo`（如果你有 MiMo key）。
+</details>
+<details>
+<summary><b>Codex 跑 /hatch 时报 <code>image_gen tool not available</code></b></summary>
+Codex 的 `/hatch` 在客户端硬编码调 OpenAI 的 `image_gen` 工具，代理拦不住。改用 `generate_pet.py`，见上文「生成 /hatch 替代宠物」。
+</details>
+<details>
+<summary><b>报 <code>pip install openai</code> 错 / Codex 想装 openai</b></summary>
+是 Codex 想用 openai Python SDK 兜底图像生成。[AGENTS.md](../AGENTS.md) 已经预防这条路 —— 确认它在仓库根，且当前 Codex 会话已经读过（编辑完 AGENTS.md 后要开新会话）。
+</details>
+<details>
+<summary><b>工具返回了图，但模型在工具结果里看不到图</b></summary>
+设计如此。Chat Completions 的 `tool` role 历史上只接受字符串 content —— `function_call_output` 里的图片 content part 会被 flatten 成 `[N image attachment(s) omitted from tool output: ...]` 占位文本（详见 [src/translate/reqToChat.ts](../src/translate/reqToChat.ts) 的 `toolOutputToString`）。要把图喂给 LLM，让工具把图存到本地、返回路径，下一轮用户消息再 `@path/to/screenshot.png` 让 ocr.py 类工具读出来 —— 这时如果 chat 模型不支持视觉，OCR 兜底机制就会接管。
+</details>
+## 设计取舍
+- **不需要 `pip install`。** 所有脚本纯标准库。避免依赖漂移，任何裸 Python ≥ 3.8 都能跑。
+- **网络操作明确。** 不偷偷重试备用端点。要 MiMo 又没 key 就直接报错 —— 而不是静默降级掩盖配错。
+- **proxy 和 mimoskill 互不调用。** 两个独立进程，靠 `AGENTS.md` / `SKILL.md` 约定连接。这样两边都能独立测试 / 替换。
+- **Pollinations 是无 key 逃生通道。** 在 `ocr.py`（视觉）、`generate_pet.py`（出图）、`generate_image.py`（出图）里都用作免费兜底。有 IP 限流但永远在线。项目把它当成一等公民，不是"降级模式"。

package/mimoskill/SKILL.md CHANGED Viewed

@@ -18,7 +18,7 @@ Trigger this skill when:
 - User asks "how do I generate a Codex pet" / "/hatch isn't working" / "image_gen tool not available"
 - User wants image generation as part of a MiMo-backed workflow
 - User pastes the Codex error: `the image generation tool (image_gen) is not available in this environment` or `the CLI fallback requires the openai Python package`
-- User wants to **OCR / read text from / describe / 识别 / 提取文字 from an image** while the active chat model is non-vision (e.g. mimo-v2.5-pro, mimo-v2-flash, or any third-party model without vision) — use `scripts/ocr.py` to fall back through `mimo-v2.5` without changing the chat model
+- User wants to **OCR / read text from / describe / 识别 / 提取文字 from an image** while the active chat model is non-vision (e.g. mimo-v2.5-pro, mimo-v2-flash, deepseek-*, or any third-party text-only model) — use `scripts/ocr.py`. Works with or without a MiMo key (free pollinations fallback when `MIMO_API_KEY` is unset).
 - User sees the proxy's `[N image attachment(s) omitted: this model does not support image input …]` placeholder in their transcript
 - Anything in the `mimo2codex` repo that touches a feature MiMo doesn't support
@@ -38,7 +38,7 @@ Quick answer:
 | Audio chat | ✅ | `mimo-v2-omni` | input only |
 | Video understanding | ✅ | `mimo-v2-omni` | input only |
 | **Image generation** | ❌ | — | `scripts/generate_image.py` (general) or `scripts/generate_pet.py` (Codex pets) — see below |
-| OCR / 识图 (when chat model is non-vision) | ⚠️ via `mimo-v2.5` | `scripts/ocr.py` | always uses `mimo-v2.5` internally regardless of chat model |
+| OCR / 识图 (when chat model is non-vision) | ⚠️ via `mimo-v2.5` or free pollinations | `scripts/ocr.py` | `--engine auto`: mimo if `MIMO_API_KEY` set, else pollinations (no key) |
 | Code interpreter / sandbox | ❌ | — | not provided |
 For the full capability matrix and examples, read [references/models.md](references/models.md).
@@ -48,7 +48,7 @@ For the full capability matrix and examples, read [references/models.md](referen
 ```
 Is it OCR / read text from image / describe / 识别 an image
 when the active chat model is non-vision?
-├── Yes → use scripts/ocr.py (always routes through mimo-v2.5 internally)
+├── Yes → use scripts/ocr.py (mimo-v2.5 if MIMO_API_KEY set, else free pollinations)
 └── No
     │
     Is it chat / vision / search / TTS / ASR with a vision-capable model?
@@ -60,45 +60,51 @@ when the active chat model is non-vision?
         └── No  → see "General (non-pet) image generation" below (scripts/generate_image.py)
 ```
-## Calling MiMo directly
+## Calling chat directly (works without any key)
-Use `scripts/mimo_chat.py` to send a single chat completion (or stream):
+Use `scripts/mimo_chat.py` for one-shot or streaming chat. Two engines, `--engine auto` (default) picks `mimo` if `MIMO_API_KEY` is set, else `pollinations` (free, no key) — so **the script works without any key** for text and vision.
 ```bash
+# Zero-setup — uses pollinations fallback when MIMO_API_KEY is unset
+python3 mimoskill/scripts/mimo_chat.py "your prompt here"
+python3 mimoskill/scripts/mimo_chat.py --image https://example.com/x.png "describe this"
+# Best quality + MiMo-specific features (web search, TTS, ASR)
 export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
 python3 mimoskill/scripts/mimo_chat.py "your prompt here"
-python3 mimoskill/scripts/mimo_chat.py --model mimo-v2.5 --image https://example.com/x.png "describe this"
-python3 mimoskill/scripts/mimo_chat.py --search "今天上海天气?"
+python3 mimoskill/scripts/mimo_chat.py "今天上海天气?"   # web search auto-enabled on sk-* keys
 python3 mimoskill/scripts/mimo_chat.py --stream "tell me a story"
 ```
-The script handles all the MiMo-specific quirks — `max_completion_tokens` instead of `max_tokens`, the required `text` part next to `image_url`, web_search plugin invocation, `reasoning_content` round-tripping, etc.
+When the mimo engine is active the script handles all MiMo-specific quirks — `max_completion_tokens` instead of `max_tokens`, the required `text` part next to `image_url`, `reasoning_content` round-tripping, etc. **Web search is auto-enabled on pay-as-you-go (`sk-*`) keys** — the `web_search` builtin is always included in the tools array and the model decides when to invoke it (`tool_choice: "auto"`). Token-plan (`tp-*`) keys skip web search (the endpoint doesn't support it). The pollinations engine doesn't support web search, TTS, or ASR (those are MiMo native features); it auto-switches to OpenAI-compat field names (`max_tokens`).
 For non-trivial integrations, [references/models.md](references/models.md) and [the official MiMo OpenAI-compat doc](https://platform.xiaomimimo.com/docs/api/chat/openai-api) are the authoritative references.
 ## OCR / image recognition (when the chat model can't see images)
-If the user wants to **read text from an image** or **describe / 识别 an image** but the current chat model is non-vision (`mimo-v2.5-pro`, `mimo-v2.5-pro[1m]`, `mimo-v2-flash`, or any third-party model without vision), invoke `scripts/ocr.py`. It always uses `mimo-v2.5` internally — the chat model stays untouched.
+If the user wants to **read text from an image** or **describe / 识别 an image** but the current chat model is non-vision (`mimo-v2.5-pro`, `mimo-v2.5-pro[1m]`, `mimo-v2-flash`, `deepseek-*`, or any third-party text-only model), invoke `scripts/ocr.py`. Two engines, `--engine auto` (default) picks the right one:
+- **`mimo`** — needs `MIMO_API_KEY`, uses `mimo-v2.5` regardless of the chat model. Best quality.
+- **`pollinations`** — free public vision endpoint at `text.pollinations.ai`, **no key required**. Mirrors the same no-key fallback `generate_pet.py` uses. Rate-limited but always available — covers users who only have a DeepSeek key (or no key at all).
 The proxy silently drops image attachments on non-vision models (`src/translate/reqToChat.ts:48-72`) and leaves a `[N image attachment(s) omitted: …]` placeholder. **When you see that placeholder in the transcript, the right move is to run ocr.py and feed the text back into the conversation.** Don't ask the user to switch models.
 ```bash
-export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
-# verbatim OCR (default)
+# Zero-setup — uses pollinations fallback when MIMO_API_KEY is unset
 python3 mimoskill/scripts/ocr.py path/to/image.png
-# 2-4 sentence description
 python3 mimoskill/scripts/ocr.py --mode describe https://example.com/x.png
-# structured JSON (text + regions + language + summary)
 python3 mimoskill/scripts/ocr.py --mode structured a.png b.jpg
-# re-render as GitHub-flavored Markdown (good for forms / receipts)
 cat scan.png | python3 mimoskill/scripts/ocr.py --mode markdown
+# Best quality — set MiMo key, auto picks mimo
+export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
+python3 mimoskill/scripts/ocr.py path/to/image.png
+# Force the free engine even when you have a MiMo key (e.g. to save quota)
+python3 mimoskill/scripts/ocr.py --engine pollinations form.png
 ```
-`ocr.py` accepts local paths, http(s) URLs, `data:` URLs, or stdin bytes. Magic-byte sniffs the MIME (PNG / JPEG / GIF / WebP / BMP). Multiple positional args are batched into one MiMo call. Non-vision `--model` values are auto-coerced to `mimo-v2.5` with one stderr note.
+`ocr.py` accepts local paths, http(s) URLs, `data:` URLs, or stdin bytes. Magic-byte sniffs the MIME (PNG / JPEG / GIF / WebP / BMP). Multiple positional args are batched into one upstream call. Non-vision `--model` values are auto-coerced to `mimo-v2.5` with one stderr note (mimo engine only; on pollinations use `--pollinations-model`).
 See [references/ocr_workflow.md](references/ocr_workflow.md) for full mode reference, exit codes, JSON shape for `--mode structured`, and the `--lang` / `--prompt` knobs.

package/mimoskill/references/ocr_workflow.md CHANGED Viewed

@@ -1,26 +1,32 @@
 # OCR / image recognition workflow
 `mimoskill/scripts/ocr.py` is the fallback path for reading or describing
-images when the surrounding chat model can't see them. It always calls
-`mimo-v2.5` (MiMo's vision-capable model) internally, regardless of which
-model the rest of the conversation is using.
+images when the surrounding chat model can't see them. Two engines:
+| Engine | Needs API key? | Quality | Notes |
+|---|---|---|---|
+| `mimo` | yes (`MIMO_API_KEY`) | best | Calls `mimo-v2.5` regardless of the chat model used elsewhere. |
+| `pollinations` | **no** | decent | Free public endpoint at `text.pollinations.ai`. Rate-limited but no signup. |
+`--engine auto` (default) picks `mimo` if `MIMO_API_KEY` is set, else falls
+back to `pollinations` so users with only a DeepSeek key (or no key at all)
+still get OCR.
 ## TL;DR
 ```bash
-export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
-# default mode (text) — verbatim OCR
+# Zero-setup — uses free pollinations fallback when MIMO_API_KEY is unset
 python3 mimoskill/scripts/ocr.py path/to/image.png
-# describe the image in 2-4 sentences
 python3 mimoskill/scripts/ocr.py --mode describe path/to/image.png
-# structured JSON (text + regions + language + summary)
 python3 mimoskill/scripts/ocr.py --mode structured a.png b.jpg
-# re-render as GitHub-flavored Markdown
 python3 mimoskill/scripts/ocr.py --mode markdown form.png
+# Force the free engine even when you have a MiMo key (e.g. to save quota)
+python3 mimoskill/scripts/ocr.py --engine pollinations form.png
+# Best quality — set MiMo key
+export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
+python3 mimoskill/scripts/ocr.py path/to/image.png   # auto -> mimo
 ```
 ## Why this skill exists
@@ -161,21 +167,39 @@ silently (one stderr line) rather than failing.
 ## When `MIMO_API_KEY` isn't set
-`ocr.py` exits with code `3` and this stderr message:
+`--engine auto` (the default) silently falls back to `pollinations`:
 ```
-error: MIMO_API_KEY is not set; ocr.py needs MiMo V2.5 vision to read images.
-  set one at https://platform.xiaomimimo.com/#/console/api-keys
-  OR if you want fully-local OCR with no API key, install tesseract:
-      macOS:    brew install tesseract tesseract-lang
-      Ubuntu:   sudo apt install tesseract-ocr tesseract-ocr-chi-sim
-      Windows:  https://github.com/UB-Mannheim/tesseract/wiki
-    then run: tesseract <image> - -l eng+chi_sim
-  (tesseract is NOT installed or invoked by this skill; this is just a pointer.)
+[engine] auto -> pollinations (free, no key). Set MIMO_API_KEY for higher quality (mimo-v2.5).
+[ocr] engine=pollinations mode=text model=openai images=1
+<extracted text>
+```
+Exit code `3` is only raised when the user explicitly passes `--engine mimo`
+without a key (passing the flag is treated as an assertion that MiMo should
+be used; auto-falling-back would mask the misconfiguration).
+If you'd rather use **fully-local OCR** with no network at all, install
+tesseract and shell to it directly — this skill won't auto-invoke it:
+```bash
+macOS:    brew install tesseract tesseract-lang
+Ubuntu:   sudo apt install tesseract-ocr tesseract-ocr-chi-sim
+Windows:  https://github.com/UB-Mannheim/tesseract/wiki
+tesseract <image> - -l eng+chi_sim
 ```
-The tesseract pointer is **just a pointer** — this skill never auto-shells
-to it. Keeps the dependency surface predictable.
+## Pollinations specifics
+- Endpoint: `https://text.pollinations.ai/openai` (OpenAI Chat Completions
+  compatible).
+- Default model: `openai` (vision-capable). Override with
+  `--pollinations-model <name>` or `POLLINATIONS_MODEL=<name>`. Other
+  vision-capable picks include `openai-large`, `openai-fast`.
+- No `Authorization` header is sent; the service is open. Rate limits apply
+  per-IP; if you hit them you'll see HTTP 429 in stderr — wait or retry.
+- `reasoning_content` is normally empty for pollinations responses (the
+  underlying models don't expose chain-of-thought).
 ## Common pitfalls
@@ -194,9 +218,9 @@ to it. Keeps the dependency surface predictable.
 | Code | Meaning |
 |---|---|
 | 0 | Success |
-| 1 | MiMo HTTP error (error body printed to stderr) |
+| 1 | Upstream HTTP error (MiMo or Pollinations; error body printed to stderr) |
 | 2 | argv / usage error (no image, mutually exclusive flags, etc.) |
-| 3 | `MIMO_API_KEY` not set |
+| 3 | `--engine mimo` explicitly requested but `MIMO_API_KEY` not set |
 | 4 | Local image file not found / unreadable |
 ## Composing with `mimo_chat.py`