mimo2codex 0.1.16 → 0.1.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,295 @@
1
+ # mimoskill · 详细介绍
2
+
3
+ > [English](./mimoskill.md) · 中文
4
+ >
5
+ > 回到:[README English](../README.md) · [README 中文](../README.zh.md)
6
+
7
+ `mimoskill/` 是仓库根目录下一捆**辅助脚本 + 参考文档**。它存在的原因是有些事 MiMo / DeepSeek / 大多数纯文本 LLM 原生做不了(图像生成、纯文本模型看图、…),而 Codex 在客户端硬编码了一些能力假设,代理层压根改不动。
8
+
9
+ 代理(mimo2codex)和 mimoskill **完全独立**:不跑 mimo2codex 也能用 mimoskill,反之亦然。两者通过**约定**协作:代理检测到能力缺口时,会在消息里塞占位文本,指向对应的 `mimoskill/scripts/*.py`。
10
+
11
+ ## 什么时候会触发?
12
+
13
+ > 一句话:**"模型能做的事 proxy 透传,模型做不了的事 mimoskill 兜底。"**
14
+
15
+ | 能力 | 当前 chat 模型能做 | 当前 chat 模型做不了 |
16
+ |---|---|---|
17
+ | 看图 / OCR / 识图 | proxy 透传图片给模型;**mimoskill 不触发** | proxy 剥掉图片、塞 `[N image attachment(s) omitted: … python3 mimoskill/scripts/ocr.py <path> …]` 占位文本;LLM 读到占位 + AGENTS.md 后 **去跑 `ocr.py`** |
18
+ | 图像生成 | 没有任何主流 chat 模型自带 image-gen | **mimoskill 永远触发** —— `scripts/generate_image.py` 或 `scripts/generate_pet.py` |
19
+ | 联网搜索 | proxy 在 MiMo `sk-*`(按量)key 下把 Codex 的 `web_search` 翻译成 MiMo 内置的;`tp-*`(套餐)key 与 DeepSeek 自动跳过 | `scripts/mimo_chat.py` 遵循同样规则 —— MiMo `sk-*` 自动启用,`tp-*` / pollinations 跳过。**无需参数** |
20
+ | TTS / ASR | Codex 没接 | `scripts/mimo_chat.py` 直接调 MiMo 的独立端点 |
21
+
22
+ 触发**发生在 LLM 这一层**,不在 proxy 层。proxy 只做协议翻译 + 最小兼容性修整(剥图、塞占位文本)。Codex 读 [AGENTS.md](../AGENTS.md) 和 [mimoskill/SKILL.md](../mimoskill/SKILL.md),看到占位文本或者用户意图后,自己决定调哪个脚本。脚本是独立子进程,**完全绕开 proxy** —— OCR 直接打 MiMo 或 pollinations,出图直接打 pollinations 或 OpenAI,等等。
23
+
24
+ ## 目录结构
25
+
26
+ ```
27
+ mimoskill/
28
+ ├── SKILL.md # 给 LLM 看的 skill 清单 —— 触发规则 + 决策树
29
+ ├── scripts/
30
+ │ ├── mimo_chat.py # 直接调 MiMo 聊天 / 视觉 / 联网搜索(纯标准库)
31
+ │ ├── ocr.py # OCR / 识图。MiMo 或免费 pollinations
32
+ │ ├── generate_image.py # 通用图像生成(任意风格 / 主题)
33
+ │ ├── generate_pet.py # Codex 宠物生成(chibi 贴纸风)
34
+ │ └── install_pet.sh # 把生成的 PNG 装到 Codex 的宠物目录
35
+ ├── references/
36
+ │ ├── models.md # MiMo 能力矩阵 + 字段坑
37
+ │ ├── ocr_workflow.md # 完整 OCR 模式参考、退出码、JSON 结构
38
+ │ └── pet_workflow.md # 单图 vs 多状态动画 bundle
39
+ └── assets/
40
+ └── pet_prompt_template.md # 调好的 chibi 贴纸提示词模板
41
+ ```
42
+
43
+ ## 脚本详解
44
+
45
+ ### `scripts/mimo_chat.py` —— 聊天 / 视觉(无 key 也能用)
46
+
47
+ 纯标准库 Python 脚本,单轮或流式聊天。两个引擎,跟 `ocr.py` 是同一套 `--engine auto|mimo|pollinations`:
48
+
49
+ | 引擎 | 需要 key | 备注 |
50
+ |---|---|---|
51
+ | `mimo` | 需要 `MIMO_API_KEY` | 最佳质量。`sk-*` key 自动启用 web_search(无需参数),TTS / ASR 也只能用这个 |
52
+ | `pollinations` | **不需要** | 免费公共端点 `text.pollinations.ai`。文本 + 视觉可用,联网搜索 / TTS / ASR 不可用 |
53
+
54
+ auto 选择:有 `MIMO_API_KEY` 用 mimo,否则 pollinations。**这个脚本现在不依赖任何 key**(纯文本 + 视觉场景)。
55
+
56
+ ```bash
57
+ # 零配置 —— 自动走 pollinations 兜底
58
+ python3 mimoskill/scripts/mimo_chat.py "讲个笑话"
59
+ python3 mimoskill/scripts/mimo_chat.py --image https://x/y.png "描述这张图"
60
+
61
+ # 最佳质量 + MiMo 原生能力(sk-* key 自动开 web_search,TTS、ASR)
62
+ export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
63
+ python3 mimoskill/scripts/mimo_chat.py "今天上海天气" # 自动带 web_search
64
+ python3 mimoskill/scripts/mimo_chat.py --model mimo-v2.5-pro --max-tokens 8000 --stream "写长一点"
65
+ ```
66
+
67
+ mimo 引擎自动踩好 MiMo 的坑:`max_completion_tokens`(不是 `max_tokens`)、图片必须配 `text` part、多轮 `reasoning_content` 回填、联网搜索插件调用。
68
+
69
+ | 参数 | 说明 |
70
+ |---|---|
71
+ | `--engine` | `auto` / `mimo` / `pollinations`(默认 auto) |
72
+ | `--model` | 默认 `mimo-v2.5-pro`(mimo 引擎)。视觉用 `mimo-v2.5` / `mimo-v2-omni` |
73
+ | `--pollinations-model` | 默认 `openai`(视觉能力)。可选 `openai-large` / `openai-fast` |
74
+ | `--image URL` | 附图。自动 bump 到视觉能力模型 |
75
+ | `--stream` | SSE 流式 |
76
+ | `--max-tokens N` | mimo 引擎映射到 `max_completion_tokens`,pollinations 映射到 `max_tokens` |
77
+ | `--temperature F` | 默认 0.7 |
78
+
79
+ ### `scripts/ocr.py` —— OCR / 识图
80
+
81
+ 非视觉 chat 模型场景下的兜底。**两个引擎**(`--engine auto` 自动选):
82
+
83
+ | 引擎 | 需要 key | 质量 | 备注 |
84
+ |---|---|---|---|
85
+ | `mimo` | 需要 `MIMO_API_KEY` | 最好 | 内部调 `mimo-v2.5`(视觉模型),与外层 chat 模型无关 |
86
+ | `pollinations` | **不需要** | 还行 | 免费公共端点 `text.pollinations.ai`。有 IP 限流,但无需注册 |
87
+
88
+ auto 选择:有 `MIMO_API_KEY` 用 mimo,否则 pollinations。所以**只配了 DeepSeek key**(或者啥都没配)的用户也能零配置用 OCR。
89
+
90
+ ```bash
91
+ # 零配置 —— 没设 MIMO_API_KEY 时自动走免费 pollinations
92
+ python3 mimoskill/scripts/ocr.py path/to/image.png
93
+
94
+ # 最佳质量 —— 设 MiMo key
95
+ export MIMO_API_KEY=sk-xxxx
96
+ python3 mimoskill/scripts/ocr.py path/to/image.png # auto -> mimo
97
+
98
+ # 强制走免费引擎(即便你有 MiMo key,比如想省额度)
99
+ python3 mimoskill/scripts/ocr.py --engine pollinations form.png
100
+
101
+ # 强制 MiMo —— 没设 key 直接报错(不静默降级)
102
+ python3 mimoskill/scripts/ocr.py --engine mimo form.png
103
+ ```
104
+
105
+ 四个输出模式:
106
+
107
+ | `--mode` | 输出 |
108
+ |---|---|
109
+ | `text`(默认) | 逐字 OCR —— 保留换行 + 阅读顺序 |
110
+ | `describe` | 2-4 句描述 |
111
+ | `structured` | 单个 JSON:`text` / `language` / `regions[]` / `summary` |
112
+ | `markdown` | 整张图重新渲染成 GitHub-flavored Markdown |
113
+
114
+ 输入形态(位置参数,0+ 个):
115
+ - 本地路径:`./scan.png`、`C:\foo.jpg`
116
+ - HTTP(S) URL:原样转发
117
+ - `data:image/...;base64,…`:原样转发
118
+ - `-` 或管道 stdin:从 stdin 读一张图的字节
119
+
120
+ magic-byte 嗅探 MIME(不信任扩展名):PNG / JPEG / GIF / WebP / BMP。多个位置参数会**一次 upstream 调用**批处理。
121
+
122
+ > 完整参考:[mimoskill/references/ocr_workflow.md](../mimoskill/references/ocr_workflow.md)(模式、退出码、JSON 结构、lang/prompt 参数、pollinations 细节)。
123
+
124
+ ### `scripts/generate_image.py` —— 通用图像生成
125
+
126
+ `generate_pet.py` 的薄包装,去掉 chibi 宠物提示词模板、加了可选的 `--style` 常见风格。同样的 providers、同样的环境变量、同样的 auto 兜底策略。
127
+
128
+ ```bash
129
+ # 免费 —— 没设 OpenAI key 时 auto 走 pollinations
130
+ python3 mimoskill/scripts/generate_image.py --prompt "日式庭园,水彩,黎明" --out garden.png
131
+
132
+ # 高质量 —— 设 OpenAI key
133
+ export PET_OPENAI_API_KEY=sk-real-openai-key
134
+ python3 mimoskill/scripts/generate_image.py --prompt "..." --out art.png # auto -> gpt-image-1
135
+
136
+ # 风格预设
137
+ python3 mimoskill/scripts/generate_image.py --style anime --prompt "黄昏的神社" --out shrine.png
138
+ ```
139
+
140
+ | `--provider` | 后端 |
141
+ |---|---|
142
+ | `auto`(默认) | 有 `PET_OPENAI_API_KEY` 走 `gpt-image-1`,否则 `pollinations` |
143
+ | `pollinations` | 免费、无 key |
144
+ | `gpt-image-1` | OpenAI 官方图像生成 —— 最佳质量 |
145
+ | `replicate` | Replicate API(任意模型) |
146
+ | `local-sd` | 本地 Stable Diffusion |
147
+
148
+ > `PET_OPENAI_API_KEY` 故意**和 `MIMO_API_KEY`、`OPENAI_API_KEY` 分开** —— 只用于图像生成,泄露或不存在都不影响别的事。
149
+
150
+ ### `scripts/generate_pet.py` —— Codex 宠物生成
151
+
152
+ 同样的后端,但内置了一套调好的 chibi 贴纸提示词,围绕 `--description` 组装。输出尺寸 + 留白都按 Codex 宠物选择器适配。
153
+
154
+ ```bash
155
+ # 单张静态宠物(免费)
156
+ python3 mimoskill/scripts/generate_pet.py --description "chibi shiba 程序员" --out pet.png
157
+
158
+ # 多状态动画 bundle(idle / thinking / typing / sleeping)
159
+ python3 mimoskill/scripts/generate_pet.py --description "chibi 猫" --bundle ./shiba/
160
+ ```
161
+
162
+ 提示词模板在 [mimoskill/assets/pet_prompt_template.md](../mimoskill/assets/pet_prompt_template.md)。完整流程见 [mimoskill/references/pet_workflow.md](../mimoskill/references/pet_workflow.md)。
163
+
164
+ ### `scripts/install_pet.sh` —— 装宠物到 Codex
165
+
166
+ 自动探测 macOS / Linux / Windows 的宠物目录,把 PNG(或 bundle)拷过去。绕开 Codex 硬编码的宠物路径问题。
167
+
168
+ ```bash
169
+ bash mimoskill/scripts/install_pet.sh pet.png shiba
170
+ # 然后完全退出 + 重启 Codex(桌面端走系统托盘退出,不只是关窗口)
171
+ ```
172
+
173
+ ## 三种用法
174
+
175
+ ### 1. 直接调用(普通用户,零配置)
176
+
177
+ ```bash
178
+ python3 mimoskill/scripts/mimo_chat.py "..."
179
+ python3 mimoskill/scripts/ocr.py invoice.png # 无 key 也能跑,走免费 pollinations
180
+ python3 mimoskill/scripts/generate_image.py --prompt "..."
181
+ ```
182
+
183
+ 不需要注册 skill —— 就是普通 Python 脚本(纯标准库,不用 `pip install`)。
184
+
185
+ ### 2. 当 Claude Code 的 Skill 用
186
+
187
+ 软链到 `~/.claude/skills/`:
188
+
189
+ ```bash
190
+ ln -s "$(pwd)/mimoskill" ~/.claude/skills/mimoskill
191
+ ```
192
+
193
+ 之后 Claude 会读 [SKILL.md](../mimoskill/SKILL.md),遇到相关请求("帮我从这张图生成宠物"、"读一下这张截图的文字"、"让 MiMo 把这段话朗读了")自动路由到对应脚本。
194
+
195
+ ### 3. 当 Codex agent 指南
196
+
197
+ 仓库根的 [AGENTS.md](../AGENTS.md) 已经接好。Codex 每次启会话都会读,遇到生图 / 宠物 / OCR 任务会路由到 mimoskill 脚本 —— **不会**再去 `pip install openai`,也不会在用 MiMo / DeepSeek / Qwen / 任何非 OpenAI 上游时尝试调 OpenAI 的 `image_gen` 工具。
198
+
199
+ ## 环境变量
200
+
201
+ | 变量 | 谁用 | 说明 |
202
+ |---|---|---|
203
+ | `MIMO_API_KEY` | `mimo_chat.py`、`ocr.py`(engine=mimo / auto 时) | MiMo Chat / 视觉 key。两个脚本都**可选** —— 没设会自动走 pollinations |
204
+ | `MIMO_CHAT_ENGINE` | `mimo_chat.py` | `auto` / `mimo` / `pollinations` —— 等价于 `--engine` |
205
+ | `MIMO_BASE_URL` | `mimo_chat.py`、`ocr.py` | 默认 `https://api.xiaomimimo.com/v1` |
206
+ | `MIMO_MODEL` / `MIMO_OCR_MODEL` | `ocr.py` 模型 auto-pick | 没传 `--model` 时使用(必须视觉能力) |
207
+ | `MIMO_OCR_ENGINE` | `ocr.py` | `auto` / `mimo` / `pollinations` —— 等价于 `--engine` 参数 |
208
+ | `POLLINATIONS_MODEL` | `ocr.py` | 默认 `openai`(视觉能力)。可选 `openai-large`、`openai-fast` |
209
+ | `PET_OPENAI_API_KEY` | `generate_pet.py`、`generate_image.py` | 跟 `MIMO_API_KEY` / `OPENAI_API_KEY` 独立;只用于图像生成 |
210
+ | `REPLICATE_API_TOKEN` | `generate_*.py --provider replicate` | 仅 Replicate 后端时需要 |
211
+
212
+ ## 常用组合
213
+
214
+ ### 先 OCR 一张图,再用当前 chat 模型总结
215
+
216
+ ```bash
217
+ TEXT=$(python3 mimoskill/scripts/ocr.py invoice.png)
218
+ python3 mimoskill/scripts/mimo_chat.py "总结这张发票:\n$TEXT"
219
+ ```
220
+
221
+ 或者直接在 Codex 里:把图贴进去就行。proxy 剥图后留指向 `ocr.py` 的占位文本,Codex 自己跑脚本把文字喂回对话 —— **完全自动**。
222
+
223
+ ### 生成 `/hatch` 替代宠物(无 OpenAI key 也能用)
224
+
225
+ ```bash
226
+ python3 mimoskill/scripts/generate_pet.py --description "chibi shiba 程序员" --out pet.png
227
+ bash mimoskill/scripts/install_pet.sh pet.png shiba
228
+ # 完全退出 + 重启 Codex,宠物菜单里挑新的
229
+ ```
230
+
231
+ 想要更好质量,设 `PET_OPENAI_API_KEY=sk-真OpenAI-key`,auto 会切到 `gpt-image-1`。
232
+
233
+ ### 结构化 OCR + JSON 解析
234
+
235
+ ```bash
236
+ JSON=$(python3 mimoskill/scripts/ocr.py --mode structured invoice.png)
237
+ echo "$JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['summary'])"
238
+ ```
239
+
240
+ ### 多图批量 OCR(一次计费)
241
+
242
+ ```bash
243
+ python3 mimoskill/scripts/ocr.py page1.png page2.png page3.png
244
+ ```
245
+
246
+ 所有图**单次** upstream 调用,模型可跨图引用(如身份证正反面)。输出是按阅读顺序串联的一段文本。
247
+
248
+ ## 故障排查
249
+
250
+ <details>
251
+ <summary><b><code>MIMO_API_KEY</code> 未设置</b> —— ocr.py 退出码 3</summary>
252
+
253
+ 你显式传了 `--engine mimo`。要么去掉这个参数(`auto` 会自动降级到 pollinations),要么设 key:
254
+
255
+ ```bash
256
+ export MIMO_API_KEY=sk-xxxx
257
+ python3 mimoskill/scripts/ocr.py form.png
258
+ ```
259
+
260
+ </details>
261
+
262
+ <details>
263
+ <summary><b>Pollinations 返回 429 / 限流</b></summary>
264
+
265
+ 撞 IP 限流。等会儿再试,或者切到 `--engine mimo`(如果你有 MiMo key)。
266
+
267
+ </details>
268
+
269
+ <details>
270
+ <summary><b>Codex 跑 /hatch 时报 <code>image_gen tool not available</code></b></summary>
271
+
272
+ Codex 的 `/hatch` 在客户端硬编码调 OpenAI 的 `image_gen` 工具,代理拦不住。改用 `generate_pet.py`,见上文「生成 /hatch 替代宠物」。
273
+
274
+ </details>
275
+
276
+ <details>
277
+ <summary><b>报 <code>pip install openai</code> 错 / Codex 想装 openai</b></summary>
278
+
279
+ 是 Codex 想用 openai Python SDK 兜底图像生成。[AGENTS.md](../AGENTS.md) 已经预防这条路 —— 确认它在仓库根,且当前 Codex 会话已经读过(编辑完 AGENTS.md 后要开新会话)。
280
+
281
+ </details>
282
+
283
+ <details>
284
+ <summary><b>工具返回了图,但模型在工具结果里看不到图</b></summary>
285
+
286
+ 设计如此。Chat Completions 的 `tool` role 历史上只接受字符串 content —— `function_call_output` 里的图片 content part 会被 flatten 成 `[N image attachment(s) omitted from tool output: ...]` 占位文本(详见 [src/translate/reqToChat.ts](../src/translate/reqToChat.ts) 的 `toolOutputToString`)。要把图喂给 LLM,让工具把图存到本地、返回路径,下一轮用户消息再 `@path/to/screenshot.png` 让 ocr.py 类工具读出来 —— 这时如果 chat 模型不支持视觉,OCR 兜底机制就会接管。
287
+
288
+ </details>
289
+
290
+ ## 设计取舍
291
+
292
+ - **不需要 `pip install`。** 所有脚本纯标准库。避免依赖漂移,任何裸 Python ≥ 3.8 都能跑。
293
+ - **网络操作明确。** 不偷偷重试备用端点。要 MiMo 又没 key 就直接报错 —— 而不是静默降级掩盖配错。
294
+ - **proxy 和 mimoskill 互不调用。** 两个独立进程,靠 `AGENTS.md` / `SKILL.md` 约定连接。这样两边都能独立测试 / 替换。
295
+ - **Pollinations 是无 key 逃生通道。** 在 `ocr.py`(视觉)、`generate_pet.py`(出图)、`generate_image.py`(出图)里都用作免费兜底。有 IP 限流但永远在线。项目把它当成一等公民,不是"降级模式"。
@@ -18,7 +18,7 @@ Trigger this skill when:
18
18
  - User asks "how do I generate a Codex pet" / "/hatch isn't working" / "image_gen tool not available"
19
19
  - User wants image generation as part of a MiMo-backed workflow
20
20
  - User pastes the Codex error: `the image generation tool (image_gen) is not available in this environment` or `the CLI fallback requires the openai Python package`
21
- - User wants to **OCR / read text from / describe / 识别 / 提取文字 from an image** while the active chat model is non-vision (e.g. mimo-v2.5-pro, mimo-v2-flash, or any third-party model without vision) — use `scripts/ocr.py` to fall back through `mimo-v2.5` without changing the chat model
21
+ - User wants to **OCR / read text from / describe / 识别 / 提取文字 from an image** while the active chat model is non-vision (e.g. mimo-v2.5-pro, mimo-v2-flash, deepseek-*, or any third-party text-only model) — use `scripts/ocr.py`. Works with or without a MiMo key (free pollinations fallback when `MIMO_API_KEY` is unset).
22
22
  - User sees the proxy's `[N image attachment(s) omitted: this model does not support image input …]` placeholder in their transcript
23
23
  - Anything in the `mimo2codex` repo that touches a feature MiMo doesn't support
24
24
 
@@ -38,7 +38,7 @@ Quick answer:
38
38
  | Audio chat | ✅ | `mimo-v2-omni` | input only |
39
39
  | Video understanding | ✅ | `mimo-v2-omni` | input only |
40
40
  | **Image generation** | ❌ | — | `scripts/generate_image.py` (general) or `scripts/generate_pet.py` (Codex pets) — see below |
41
- | OCR / 识图 (when chat model is non-vision) | ⚠️ via `mimo-v2.5` | `scripts/ocr.py` | always uses `mimo-v2.5` internally regardless of chat model |
41
+ | OCR / 识图 (when chat model is non-vision) | ⚠️ via `mimo-v2.5` or free pollinations | `scripts/ocr.py` | `--engine auto`: mimo if `MIMO_API_KEY` set, else pollinations (no key) |
42
42
  | Code interpreter / sandbox | ❌ | — | not provided |
43
43
 
44
44
  For the full capability matrix and examples, read [references/models.md](references/models.md).
@@ -48,7 +48,7 @@ For the full capability matrix and examples, read [references/models.md](referen
48
48
  ```
49
49
  Is it OCR / read text from image / describe / 识别 an image
50
50
  when the active chat model is non-vision?
51
- ├── Yes → use scripts/ocr.py (always routes through mimo-v2.5 internally)
51
+ ├── Yes → use scripts/ocr.py (mimo-v2.5 if MIMO_API_KEY set, else free pollinations)
52
52
  └── No
53
53
 
54
54
  Is it chat / vision / search / TTS / ASR with a vision-capable model?
@@ -60,45 +60,51 @@ when the active chat model is non-vision?
60
60
  └── No → see "General (non-pet) image generation" below (scripts/generate_image.py)
61
61
  ```
62
62
 
63
- ## Calling MiMo directly
63
+ ## Calling chat directly (works without any key)
64
64
 
65
- Use `scripts/mimo_chat.py` to send a single chat completion (or stream):
65
+ Use `scripts/mimo_chat.py` for one-shot or streaming chat. Two engines, `--engine auto` (default) picks `mimo` if `MIMO_API_KEY` is set, else `pollinations` (free, no key) — so **the script works without any key** for text and vision.
66
66
 
67
67
  ```bash
68
+ # Zero-setup — uses pollinations fallback when MIMO_API_KEY is unset
69
+ python3 mimoskill/scripts/mimo_chat.py "your prompt here"
70
+ python3 mimoskill/scripts/mimo_chat.py --image https://example.com/x.png "describe this"
71
+
72
+ # Best quality + MiMo-specific features (web search, TTS, ASR)
68
73
  export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
69
74
  python3 mimoskill/scripts/mimo_chat.py "your prompt here"
70
- python3 mimoskill/scripts/mimo_chat.py --model mimo-v2.5 --image https://example.com/x.png "describe this"
71
- python3 mimoskill/scripts/mimo_chat.py --search "今天上海天气?"
75
+ python3 mimoskill/scripts/mimo_chat.py "今天上海天气?" # web search auto-enabled on sk-* keys
72
76
  python3 mimoskill/scripts/mimo_chat.py --stream "tell me a story"
73
77
  ```
74
78
 
75
- The script handles all the MiMo-specific quirks — `max_completion_tokens` instead of `max_tokens`, the required `text` part next to `image_url`, web_search plugin invocation, `reasoning_content` round-tripping, etc.
79
+ When the mimo engine is active the script handles all MiMo-specific quirks — `max_completion_tokens` instead of `max_tokens`, the required `text` part next to `image_url`, `reasoning_content` round-tripping, etc. **Web search is auto-enabled on pay-as-you-go (`sk-*`) keys** — the `web_search` builtin is always included in the tools array and the model decides when to invoke it (`tool_choice: "auto"`). Token-plan (`tp-*`) keys skip web search (the endpoint doesn't support it). The pollinations engine doesn't support web search, TTS, or ASR (those are MiMo native features); it auto-switches to OpenAI-compat field names (`max_tokens`).
76
80
 
77
81
  For non-trivial integrations, [references/models.md](references/models.md) and [the official MiMo OpenAI-compat doc](https://platform.xiaomimimo.com/docs/api/chat/openai-api) are the authoritative references.
78
82
 
79
83
  ## OCR / image recognition (when the chat model can't see images)
80
84
 
81
- If the user wants to **read text from an image** or **describe / 识别 an image** but the current chat model is non-vision (`mimo-v2.5-pro`, `mimo-v2.5-pro[1m]`, `mimo-v2-flash`, or any third-party model without vision), invoke `scripts/ocr.py`. It always uses `mimo-v2.5` internally the chat model stays untouched.
85
+ If the user wants to **read text from an image** or **describe / 识别 an image** but the current chat model is non-vision (`mimo-v2.5-pro`, `mimo-v2.5-pro[1m]`, `mimo-v2-flash`, `deepseek-*`, or any third-party text-only model), invoke `scripts/ocr.py`. Two engines, `--engine auto` (default) picks the right one:
86
+
87
+ - **`mimo`** — needs `MIMO_API_KEY`, uses `mimo-v2.5` regardless of the chat model. Best quality.
88
+ - **`pollinations`** — free public vision endpoint at `text.pollinations.ai`, **no key required**. Mirrors the same no-key fallback `generate_pet.py` uses. Rate-limited but always available — covers users who only have a DeepSeek key (or no key at all).
82
89
 
83
90
  The proxy silently drops image attachments on non-vision models (`src/translate/reqToChat.ts:48-72`) and leaves a `[N image attachment(s) omitted: …]` placeholder. **When you see that placeholder in the transcript, the right move is to run ocr.py and feed the text back into the conversation.** Don't ask the user to switch models.
84
91
 
85
92
  ```bash
86
- export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
87
-
88
- # verbatim OCR (default)
93
+ # Zero-setup — uses pollinations fallback when MIMO_API_KEY is unset
89
94
  python3 mimoskill/scripts/ocr.py path/to/image.png
90
-
91
- # 2-4 sentence description
92
95
  python3 mimoskill/scripts/ocr.py --mode describe https://example.com/x.png
93
-
94
- # structured JSON (text + regions + language + summary)
95
96
  python3 mimoskill/scripts/ocr.py --mode structured a.png b.jpg
96
-
97
- # re-render as GitHub-flavored Markdown (good for forms / receipts)
98
97
  cat scan.png | python3 mimoskill/scripts/ocr.py --mode markdown
98
+
99
+ # Best quality — set MiMo key, auto picks mimo
100
+ export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
101
+ python3 mimoskill/scripts/ocr.py path/to/image.png
102
+
103
+ # Force the free engine even when you have a MiMo key (e.g. to save quota)
104
+ python3 mimoskill/scripts/ocr.py --engine pollinations form.png
99
105
  ```
100
106
 
101
- `ocr.py` accepts local paths, http(s) URLs, `data:` URLs, or stdin bytes. Magic-byte sniffs the MIME (PNG / JPEG / GIF / WebP / BMP). Multiple positional args are batched into one MiMo call. Non-vision `--model` values are auto-coerced to `mimo-v2.5` with one stderr note.
107
+ `ocr.py` accepts local paths, http(s) URLs, `data:` URLs, or stdin bytes. Magic-byte sniffs the MIME (PNG / JPEG / GIF / WebP / BMP). Multiple positional args are batched into one upstream call. Non-vision `--model` values are auto-coerced to `mimo-v2.5` with one stderr note (mimo engine only; on pollinations use `--pollinations-model`).
102
108
 
103
109
  See [references/ocr_workflow.md](references/ocr_workflow.md) for full mode reference, exit codes, JSON shape for `--mode structured`, and the `--lang` / `--prompt` knobs.
104
110
 
@@ -1,26 +1,32 @@
1
1
  # OCR / image recognition workflow
2
2
 
3
3
  `mimoskill/scripts/ocr.py` is the fallback path for reading or describing
4
- images when the surrounding chat model can't see them. It always calls
5
- `mimo-v2.5` (MiMo's vision-capable model) internally, regardless of which
6
- model the rest of the conversation is using.
4
+ images when the surrounding chat model can't see them. Two engines:
5
+
6
+ | Engine | Needs API key? | Quality | Notes |
7
+ |---|---|---|---|
8
+ | `mimo` | yes (`MIMO_API_KEY`) | best | Calls `mimo-v2.5` regardless of the chat model used elsewhere. |
9
+ | `pollinations` | **no** | decent | Free public endpoint at `text.pollinations.ai`. Rate-limited but no signup. |
10
+
11
+ `--engine auto` (default) picks `mimo` if `MIMO_API_KEY` is set, else falls
12
+ back to `pollinations` so users with only a DeepSeek key (or no key at all)
13
+ still get OCR.
7
14
 
8
15
  ## TL;DR
9
16
 
10
17
  ```bash
11
- export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
12
-
13
- # default mode (text) — verbatim OCR
18
+ # Zero-setup — uses free pollinations fallback when MIMO_API_KEY is unset
14
19
  python3 mimoskill/scripts/ocr.py path/to/image.png
15
-
16
- # describe the image in 2-4 sentences
17
20
  python3 mimoskill/scripts/ocr.py --mode describe path/to/image.png
18
-
19
- # structured JSON (text + regions + language + summary)
20
21
  python3 mimoskill/scripts/ocr.py --mode structured a.png b.jpg
21
-
22
- # re-render as GitHub-flavored Markdown
23
22
  python3 mimoskill/scripts/ocr.py --mode markdown form.png
23
+
24
+ # Force the free engine even when you have a MiMo key (e.g. to save quota)
25
+ python3 mimoskill/scripts/ocr.py --engine pollinations form.png
26
+
27
+ # Best quality — set MiMo key
28
+ export MIMO_API_KEY=sk-xxxxxxxxxxxxxxxx
29
+ python3 mimoskill/scripts/ocr.py path/to/image.png # auto -> mimo
24
30
  ```
25
31
 
26
32
  ## Why this skill exists
@@ -161,21 +167,39 @@ silently (one stderr line) rather than failing.
161
167
 
162
168
  ## When `MIMO_API_KEY` isn't set
163
169
 
164
- `ocr.py` exits with code `3` and this stderr message:
170
+ `--engine auto` (the default) silently falls back to `pollinations`:
165
171
 
166
172
  ```
167
- error: MIMO_API_KEY is not set; ocr.py needs MiMo V2.5 vision to read images.
168
- set one at https://platform.xiaomimimo.com/#/console/api-keys
169
- OR if you want fully-local OCR with no API key, install tesseract:
170
- macOS: brew install tesseract tesseract-lang
171
- Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim
172
- Windows: https://github.com/UB-Mannheim/tesseract/wiki
173
- then run: tesseract <image> - -l eng+chi_sim
174
- (tesseract is NOT installed or invoked by this skill; this is just a pointer.)
173
+ [engine] auto -> pollinations (free, no key). Set MIMO_API_KEY for higher quality (mimo-v2.5).
174
+ [ocr] engine=pollinations mode=text model=openai images=1
175
+ <extracted text>
176
+ ```
177
+
178
+ Exit code `3` is only raised when the user explicitly passes `--engine mimo`
179
+ without a key (passing the flag is treated as an assertion that MiMo should
180
+ be used; auto-falling-back would mask the misconfiguration).
181
+
182
+ If you'd rather use **fully-local OCR** with no network at all, install
183
+ tesseract and shell to it directly — this skill won't auto-invoke it:
184
+
185
+ ```bash
186
+ macOS: brew install tesseract tesseract-lang
187
+ Ubuntu: sudo apt install tesseract-ocr tesseract-ocr-chi-sim
188
+ Windows: https://github.com/UB-Mannheim/tesseract/wiki
189
+ tesseract <image> - -l eng+chi_sim
175
190
  ```
176
191
 
177
- The tesseract pointer is **just a pointer** — this skill never auto-shells
178
- to it. Keeps the dependency surface predictable.
192
+ ## Pollinations specifics
193
+
194
+ - Endpoint: `https://text.pollinations.ai/openai` (OpenAI Chat Completions
195
+ compatible).
196
+ - Default model: `openai` (vision-capable). Override with
197
+ `--pollinations-model <name>` or `POLLINATIONS_MODEL=<name>`. Other
198
+ vision-capable picks include `openai-large`, `openai-fast`.
199
+ - No `Authorization` header is sent; the service is open. Rate limits apply
200
+ per-IP; if you hit them you'll see HTTP 429 in stderr — wait or retry.
201
+ - `reasoning_content` is normally empty for pollinations responses (the
202
+ underlying models don't expose chain-of-thought).
179
203
 
180
204
  ## Common pitfalls
181
205
 
@@ -194,9 +218,9 @@ to it. Keeps the dependency surface predictable.
194
218
  | Code | Meaning |
195
219
  |---|---|
196
220
  | 0 | Success |
197
- | 1 | MiMo HTTP error (error body printed to stderr) |
221
+ | 1 | Upstream HTTP error (MiMo or Pollinations; error body printed to stderr) |
198
222
  | 2 | argv / usage error (no image, mutually exclusive flags, etc.) |
199
- | 3 | `MIMO_API_KEY` not set |
223
+ | 3 | `--engine mimo` explicitly requested but `MIMO_API_KEY` not set |
200
224
  | 4 | Local image file not found / unreadable |
201
225
 
202
226
  ## Composing with `mimo_chat.py`