npm - opencode-vision - Versions diffs - 0.1.0 → 0.2.1 - Mend

opencode-vision 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -23,26 +23,36 @@ delegates to a vision subagent, and parses a typed report.
 ## Install
-Add the plugin to your `~/.config/opencode/opencode.json`:
+Two parts: the plugin (registers subagents) and the skill (the SKILL.md the
+agent sees). Both are one-line commands.
+### 1. Install the plugin (subagent registration)
+Add to your `~/.config/opencode/opencode.json`:
 ```json
 {
   "$schema": "https://opencode.ai/config.json",
   "plugin": [
     "opencode-vision"
-  ],
-  "skills": {
-    "paths": [
-      "~/.cache/opencode/node_modules/opencode-vision"
-    ]
-  }
+  ]
 }
 ```
-opencode auto-installs the npm package via Bun on next launch — no separate
-`npm install` step needed. The skill ships inside the package (in `SKILL.md`),
-so point `skills.paths` at the installed package location so opencode's skill
-loader can find it.
+opencode auto-installs the npm package via Bun on next launch. The plugin's
+`config(cfg)` hook registers 10 `vision-*` subagents programmatically.
+### 2. Install the skill (SKILL.md discovery)
+```bash
+npx skills add WeZZard/skills -a opencode -g --skill vision
+```
+This uses the [open agent skills CLI](https://github.com/vercel-labs/skills)
+to fetch `SKILL.md` from this repo and drop it into
+`~/.agents/skills/vision/SKILL.md` — a directory opencode scans by default
+(along with `~/.config/opencode/skills/`). No `skills.paths` config entry
+needed.
 The old `~/.config/opencode/agents/visual-judge.md` subagent is removed —
 this plugin replaces it with 10 typed `vision-*` subagents. Delete the old
@@ -52,16 +62,16 @@ file if present:
 rm -f ~/.config/opencode/agents/visual-judge.md
 ```
-Restart opencode for the config to take effect.
+Restart opencode for both changes to take effect.
-> **Why `skills.paths` points at the installed package:** opencode's plugin
-> loader resolves the npm package to its `dist/index.js` entrypoint and
-> runs the `config(cfg)` hook that registers the 10 subagents. But opencode's
-> *skill* loader scans directories for `SKILL.md` — it does not look inside
-> npm packages automatically. So we point `skills.paths` at the installed
-> package directory, where `SKILL.md` ships as a published file. The path
-> above (`~/.cache/opencode/node_modules/opencode-vision`) is where Bun
-> caches opencode plugins; adjust if your cache lives elsewhere.
+> **Why two steps?** opencode's plugin loader resolves the npm package to its
+> `dist/index.js` entrypoint and runs the `config(cfg)` hook that registers
+> the 10 subagents. But opencode's *skill* loader scans filesystem directories
+> for `SKILL.md` — it does not look inside npm packages automatically. The
+> `npx skills` command bridges this gap by placing `SKILL.md` where opencode's
+> default skill scan finds it. This is a workaround for opencode bug #33896
+> (plugin-registered skills not discoverable); it will be withdrawn once the
+> upstream fix (PR #33918) ships.
 ## Verify
@@ -116,6 +126,9 @@ opencode/vision/                  # this sub-package, published as opencode-visi
     visual-judgment-request.v1.json
     visual-judgment-report.v1.json
   README.md                        # this file
+skills/vision/SKILL.md             # symlink → ../../opencode/vision/SKILL.md
+                                   # lets npx skills discover and install the skill
 ```
 ## Build & publish (maintainers)
@@ -162,4 +175,19 @@ Published via GitHub raw URLs (branch `main`):
 The files also live in this repo under `opencode/vision/schemas/` for
 editing. The URL is the canonical `$id`/`$schema` reference used by the
-SKILL.md and subagent body.
+SKILL.md and subagent body.
+## Withdraw the skill workaround
+When opencode bug [#33896](https://github.com/anomalyco/opencode/issues/33896)
+is fixed (PR [#33918](https://github.com/anomalyco/opencode/pull/33918)
+merged and shipped), the plugin can self-register the skill via the v2
+`ctx.skill.transform()` API. At that point the `npx skills`-installed file
+becomes redundant:
+```bash
+npx skills remove vision -a opencode -g
+```
+The plugin will then handle both subagent registration and skill discovery,
+making the install a single `"plugin": ["opencode-vision"]` line.

package/SKILL.md CHANGED Viewed

@@ -1,29 +1,49 @@
 ---
 name: vision
 description: >-
-  Use when you must verify, check, or evaluate what is visually rendered in
-  one or more images — e.g. "visually verify the screenshot shows a centered
-  button", "check the icon is visible", "does the layout match the design",
-  acceptance criteria mentioning on-screen state. Captures visual-judgment
-  intent from user prompts or MCP task outputs, classifies it into a typed
-  judgment, asks the user once per session which vision model to use,
-  assembles a versioned request, delegates to a vision subagent, and parses
-  the typed report. Requires locally-stored image files (cua-driver
-  screenshots, Playwright/chrome-devtools captures, user-provided paths).
+  Use when a tool result contains an image attachment the current model
+  cannot see (attachments[].mime = "image/png",
+  url = "data:image/png;base64,...") OR the user asks to visually
+  verify/check rendered content ("visually verify", "screenshot shows",
+  "centered/visible/hidden", "looks right", "matches the design").
+  Triggers on screenshots from chrome-devtools_take_screenshot,
+  playwright_browser_take_screenshot,
+  cua-driver_get_window_state/zoom/take_screenshot. Routes image bytes
+  to a vision-* subagent when the orchestrator's model is text-only
+  (e.g. glm-5.2, deepseek-v4-pro) and cannot see images itself.
+  Classifies intent into a typed judgment (presence/absence/alignment/
+  ordering/equality/layout/readability/state/diff/describe), asks the
+  user once per session which vision model to use, assembles a versioned
+  request, delegates, parses the typed report. Image paths from
+  screenshot_out_file/filePath; inline-only images saved to /tmp via node
+  (not shell echo, to avoid embedding image bytes in commands).
 ---
 # Vision — Visual Judgment Skill
-You are a text-only orchestrator (GLM 5.2). You cannot see images. When a
-task requires visual verification, you delegate to a vision subagent that
-returns a typed report. This skill defines the extraction pipeline:
+When a task requires visual verification and the orchestrator's model
+cannot see images, this skill routes image bytes to a vision subagent
+that returns a typed report. The extraction pipeline is:
 **Detect → Classify → Assemble → Pick model → Delegate → Parse**.
+## When NOT to invoke this skill
+If the orchestrator's model is itself vision-capable (e.g. you are
+running on `kimi-for-coding/k2p7`, `openai/gpt-5.5`,
+`ollama-cloud/gemini-3-flash-preview`, `opencode-go/qwen3.7-plus`, etc. —
+the same models listed in Step 4's mapping table), do NOT delegate to a
+vision subagent. Analyze the image attachment directly — you can see it.
+This skill is only for orchestrators whose model cannot see images
+(e.g. `ollama-cloud/glm-5.2`, `deepseek/deepseek-v4-pro`).
 ## Why this skill exists
-You are text-only (`attachment: false`). You cannot verify visual properties
-yourself — alignment, color, readability, layout. A vision subagent can.
-This skill gives you a stable contract for talking to one.
+Tool results in opencode can carry image attachments (`attachments[]`
+with `mime: "image/*"` and `url: "data:image/...;base64,..."`). A model
+trained without multimodal support sees the text part of these results
+but the image bytes are invisible to it. This skill recognizes when such
+an attachment is present and routes it to a vision subagent that can see
+it, giving you a stable typed contract for the exchange.
 ## The two schemas
@@ -34,7 +54,7 @@ This skill gives you a stable contract for talking to one.
 ## Step 1. Detect
-Visual-judgment intent arrives from two sources. Recognize both.
+Visual-judgment intent arrives from three sources. Recognize all three.
 ### Source A — explicit visual-judgment language in a user prompt
@@ -67,6 +87,27 @@ welcome header." The text describes structure, but "looks right" is a
 visual layout quality the text can't fully prove → you detect a
 visual-judgment need.
+### Source C — image attachment in a tool result
+When any tool result in the transcript contains an `attachments[]` entry
+with `mime` starting `image/`, that is an image the orchestrator cannot
+see. This is a trigger regardless of whether the user explicitly asked for
+visual verification — the image's mere presence means a visual judgment
+*could* be needed. Recognize these patterns:
+| Tool | Signal in result | File path available? |
+|---|---|---|
+| `chrome-devtools_take_screenshot` | `attachments[].mime = image/png` | Yes, if `filePath` was passed to the tool |
+| `playwright_browser_take_screenshot` | `attachments[].mime = image/png` | Yes, if `filename` was passed (saved to output dir) |
+| `cua-driver_get_window_state` | `screenshot` field (base64) + `screenshot_file_path` if `screenshot_out_file` was passed | Yes if `screenshot_out_file` set |
+| `cua-driver_zoom` | Cropped JPEG returned inline | **No** — inline only, must be saved to disk first (see 3f) |
+| `cua-driver_take_screenshot` | `attachments[].mime = image/png` | Yes if `filePath` set |
+**Gating rule**: auto-invoke only when the user's current task has a
+visual component (layout, alignment, presence, state, readability — see
+Step 2). If the task has no visual component, do nothing; note the image
+is available if needed later.
 ## Step 2. Classify
 Map the NL task to one of the 10 closed `judgment.type` values. Each has
@@ -147,9 +188,48 @@ screenshot-save instructions.
 ### 3e. Edge case — built-in computer-use MCP
 The built-in Claude Code `computer-use` MCP returns screenshots as inline
-base64 images, not file paths. You cannot see inline images (you are
-text-only), and the vision subagent needs a file path to `read`. Prefer
-`cua-driver` for desktop visual judgments — it has `screenshot_out_file`.
+base64 images, not file paths. The vision subagent needs a file path to
+`read`. Prefer `cua-driver` for desktop visual judgments — it has
+`screenshot_out_file`.
+### 3f. Inline-only image attachments (no file path)
+Some tool results return image attachments with
+`attachments[].url = "data:image/...;base64,..."` but **no file path** —
+e.g. `cua-driver_zoom` (inline-only, no path param), or
+`playwright_browser_take_screenshot` called without a `filename`. The
+vision subagent needs a file path to `read`. Save the inline image to
+disk first.
+**Prefer avoiding inline images altogether**: when calling
+`cua-driver_get_window_state`, always pass `screenshot_out_file` so a
+file path is available directly. When calling
+`chrome-devtools_take_screenshot` or `playwright_browser_take_screenshot`,
+always pass `filePath` / `filename`. This avoids the inline-only case
+entirely and is the safest path.
+If you must handle an inline-only image, write the base64 payload to a
+file using `node -e` (not `echo | base64 -d`, which embeds the raw
+image data in a shell command — screenshots may contain sensitive
+content like tokens or credentials):
+```
+If a tool result has attachments[].url starting "data:image/...;base64,"
+but no file path:
+  1. Extract the base64 payload from the data URL (the part after
+     ";base64,").
+  2. Write it to /tmp/vision-<random>.png using node, which avoids
+     passing the base64 through the shell:
+       node -e "require('fs').writeFileSync('/tmp/vision-<random>.png',
+       Buffer.from('<base64>','base64'))"
+     Or write a small script to /tmp and run it, passing the base64 via
+     stdin to avoid it appearing in the command line.
+  3. Use that path in the request's images[].path.
+```
+Do not use `echo "<base64>" | base64 -d` — it embeds the raw image
+bytes in the shell command, creating an exfiltration risk if the
+screenshot contains sensitive data.
 ## Step 4. Pick model (once per session)

package/dist/index.js CHANGED Viewed

@@ -7,6 +7,7 @@ var candidateDirs = [bundleDir, join(bundleDir, "..")];
 var dataDir = candidateDirs.find((d) => existsSync(join(d, "vision-models.json")) && existsSync(join(d, "subagent-body.md"))) ?? bundleDir;
 var manifest = JSON.parse(readFileSync(join(dataDir, "vision-models.json"), "utf8"));
 var bodyTpl = readFileSync(join(dataDir, "subagent-body.md"), "utf8");
+var VISION_CAPABLE_ORCHESTRATORS = new Set(manifest.models.map((m) => `${m.provider}/${m.model_id}`));
 var PERMISSION = {
   edit: "deny",
   read: "allow",
@@ -23,6 +24,10 @@ function subagentName(entry) {
 }
 var plugin = async () => ({
   config: async (cfg) => {
+    const orchestrator = cfg.model;
+    if (orchestrator && VISION_CAPABLE_ORCHESTRATORS.has(orchestrator)) {
+      return;
+    }
     cfg.agent ??= {};
     for (const e of manifest.models) {
       const name = subagentName(e);

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "opencode-vision",
-  "version": "0.1.0",
+  "version": "0.2.1",
   "description": "Typed visual-judgment skill for opencode. Registers 10 vision subagents (one per top-tier vision model across OpenAI, Kimi for Coding, Ollama Cloud, and opencode-go) and a skill that teaches a text-only orchestrator to extract visual-judgment intent, classify it into a typed judgment, and delegate to a vision subagent with a versioned request/report contract.",
   "type": "module",
   "main": "./dist/index.js",
@@ -21,6 +21,7 @@
     "prebuild": "rm -rf dist",
     "build": "bun build ./plugin.ts --outfile ./dist/index.js --target node --format esm --packages external",
     "prepublishOnly": "bun run build",
+    "sync:skill": "cp SKILL.md ../../skills/vision/SKILL.md",
     "typecheck": "tsc --noEmit"
   },
   "keywords": [