opencode-vision 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +165 -0
- package/SKILL.md +396 -0
- package/dist/index.js +44 -0
- package/package.json +55 -0
- package/schemas/visual-judgment-report.v1.json +88 -0
- package/schemas/visual-judgment-request.v1.json +236 -0
- package/subagent-body.md +45 -0
- package/vision-models.json +15 -0
package/README.md
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
# vision — Visual Judgment Skill for opencode
|
|
2
|
+
|
|
3
|
+
A typed visual-judgment contract for text-only orchestrators (GLM 5.2).
|
|
4
|
+
The orchestrator captures a screenshot via a browser/computer-use MCP,
|
|
5
|
+
extracts the visual-judgment intent, assembles a versioned request JSON,
|
|
6
|
+
delegates to a vision subagent, and parses a typed report.
|
|
7
|
+
|
|
8
|
+
## What it gives you
|
|
9
|
+
|
|
10
|
+
- **10 vision subagents** registered programmatically at init — one per
|
|
11
|
+
top-tier vision model across OpenAI, Kimi for Coding, Ollama Cloud, and
|
|
12
|
+
opencode-go.
|
|
13
|
+
- **A stable typed contract** — two versioned JSON Schemas
|
|
14
|
+
(`visual-judgment-request.v1` / `visual-judgment-report.v1`) replace the
|
|
15
|
+
old "design your own schema" free-for-all.
|
|
16
|
+
- **Per-session model selection** — the skill asks the user once which
|
|
17
|
+
vision model to use, then reuses it for the rest of the session.
|
|
18
|
+
- **10 judgment types** — `presence`, `absence`, `alignment`, `ordering`,
|
|
19
|
+
`equality`, `layout`, `readability`, `state`, `diff`, `describe`.
|
|
20
|
+
- **MCP integration** — works with chrome-devtools, Playwright, and
|
|
21
|
+
cua-driver screenshots. Uses the a11y/AX tree when it answers the
|
|
22
|
+
question; delegates to a vision subagent only when pixels matter.
|
|
23
|
+
|
|
24
|
+
## Install
|
|
25
|
+
|
|
26
|
+
Add the plugin to your `~/.config/opencode/opencode.json`:
|
|
27
|
+
|
|
28
|
+
```json
|
|
29
|
+
{
|
|
30
|
+
"$schema": "https://opencode.ai/config.json",
|
|
31
|
+
"plugin": [
|
|
32
|
+
"opencode-vision"
|
|
33
|
+
],
|
|
34
|
+
"skills": {
|
|
35
|
+
"paths": [
|
|
36
|
+
"~/.cache/opencode/node_modules/opencode-vision"
|
|
37
|
+
]
|
|
38
|
+
}
|
|
39
|
+
}
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
opencode auto-installs the npm package via Bun on next launch — no separate
|
|
43
|
+
`npm install` step needed. The skill ships inside the package (in `SKILL.md`),
|
|
44
|
+
so point `skills.paths` at the installed package location so opencode's skill
|
|
45
|
+
loader can find it.
|
|
46
|
+
|
|
47
|
+
The old `~/.config/opencode/agents/visual-judge.md` subagent is removed —
|
|
48
|
+
this plugin replaces it with 10 typed `vision-*` subagents. Delete the old
|
|
49
|
+
file if present:
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
rm -f ~/.config/opencode/agents/visual-judge.md
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Restart opencode for the config to take effect.
|
|
56
|
+
|
|
57
|
+
> **Why `skills.paths` points at the installed package:** opencode's plugin
|
|
58
|
+
> loader resolves the npm package to its `dist/index.js` entrypoint and
|
|
59
|
+
> runs the `config(cfg)` hook that registers the 10 subagents. But opencode's
|
|
60
|
+
> *skill* loader scans directories for `SKILL.md` — it does not look inside
|
|
61
|
+
> npm packages automatically. So we point `skills.paths` at the installed
|
|
62
|
+
> package directory, where `SKILL.md` ships as a published file. The path
|
|
63
|
+
> above (`~/.cache/opencode/node_modules/opencode-vision`) is where Bun
|
|
64
|
+
> caches opencode plugins; adjust if your cache lives elsewhere.
|
|
65
|
+
|
|
66
|
+
## Verify
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
opencode debug agent vision-openai-gpt-5.5
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Should show the registered subagent with `model: openai/gpt-5.5`,
|
|
73
|
+
`mode: subagent`.
|
|
74
|
+
|
|
75
|
+
To list all 10:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
opencode debug agent vision-openai-gpt-5.5
|
|
79
|
+
opencode debug agent vision-kimi-for-coding-k2p7
|
|
80
|
+
opencode debug agent vision-ollama-cloud-gemini-3-flash-preview
|
|
81
|
+
opencode debug agent vision-ollama-cloud-gemma4-31b
|
|
82
|
+
opencode debug agent vision-ollama-cloud-minimax-m3
|
|
83
|
+
opencode debug agent vision-ollama-cloud-qwen3.5-397b
|
|
84
|
+
opencode debug agent vision-opencode-go-kimi-k2.7-code
|
|
85
|
+
opencode debug agent vision-opencode-go-minimax-m3
|
|
86
|
+
opencode debug agent vision-opencode-go-qwen3.7-plus
|
|
87
|
+
opencode debug agent vision-opencode-go-mimo-v2.5
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## Smoke test
|
|
91
|
+
|
|
92
|
+
Ask the orchestrator something visual:
|
|
93
|
+
|
|
94
|
+
> Visually verify the screenshot at /tmp/foo.png shows a centered button.
|
|
95
|
+
|
|
96
|
+
The orchestrator should:
|
|
97
|
+
1. Detect the visual-judgment intent.
|
|
98
|
+
2. Ask you (once) which vision model to use.
|
|
99
|
+
3. Assemble a `visual-judgment-request.v1` JSON with `judgment.type:
|
|
100
|
+
alignment`.
|
|
101
|
+
4. Delegate to the chosen `vision-*` subagent.
|
|
102
|
+
5. Parse the report and tell you pass/fail with the button's position.
|
|
103
|
+
|
|
104
|
+
## File layout (source)
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
opencode/vision/ # this sub-package, published as opencode-vision
|
|
108
|
+
package.json # npm package metadata; main -> dist/index.js
|
|
109
|
+
plugin.ts # source: registers 10 vision-* subagents via config(cfg)
|
|
110
|
+
dist/ # built on prepublishOnly (gitignored)
|
|
111
|
+
index.js # built bundle — the package entrypoint
|
|
112
|
+
vision-models.json # 10-entry manifest (one top-tier per provider × family)
|
|
113
|
+
subagent-body.md # shared subagent prompt template
|
|
114
|
+
SKILL.md # intent-capture protocol + per-session question + MCP integration
|
|
115
|
+
schemas/
|
|
116
|
+
visual-judgment-request.v1.json
|
|
117
|
+
visual-judgment-report.v1.json
|
|
118
|
+
README.md # this file
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## Build & publish (maintainers)
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
cd opencode/vision
|
|
125
|
+
bun run build # builds dist/index.js
|
|
126
|
+
npm publish # runs prepublishOnly -> build -> publish
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
The `files` field in `package.json` controls what ships: `dist/`,
|
|
130
|
+
`SKILL.md`, `schemas/`, `subagent-body.md`, `vision-models.json`,
|
|
131
|
+
`README.md`. No source `.ts` or `node_modules` leak.
|
|
132
|
+
|
|
133
|
+
## Catalog (10 models, 4 providers)
|
|
134
|
+
|
|
135
|
+
Curation rule: one top-tier model per provider × vendor family; drop
|
|
136
|
+
non-reasoning, drop superseded within a provider, drop coding-specialized,
|
|
137
|
+
drop Pro/billing variants of the same family; keep cross-provider
|
|
138
|
+
duplicates.
|
|
139
|
+
|
|
140
|
+
| Provider | Model | Family |
|
|
141
|
+
|---|---|---|
|
|
142
|
+
| openai | gpt-5.5 | GPT-5.5 |
|
|
143
|
+
| kimi-for-coding | k2p7 | Kimi K2.7 |
|
|
144
|
+
| ollama-cloud | gemini-3-flash-preview | Gemini |
|
|
145
|
+
| ollama-cloud | gemma4:31b | Gemma |
|
|
146
|
+
| ollama-cloud | minimax-m3 | MiniMax |
|
|
147
|
+
| ollama-cloud | qwen3.5:397b | Qwen 3.5 |
|
|
148
|
+
| opencode-go | kimi-k2.7-code | Kimi K2.7 (cross-provider route) |
|
|
149
|
+
| opencode-go | minimax-m3 | MiniMax (cross-provider route) |
|
|
150
|
+
| opencode-go | qwen3.7-plus | Qwen 3.7 |
|
|
151
|
+
| opencode-go | mimo-v2.5 | MiMo |
|
|
152
|
+
|
|
153
|
+
To add a model: add one line to `vision-models.json` and restart opencode.
|
|
154
|
+
The plugin re-reads the manifest at init.
|
|
155
|
+
|
|
156
|
+
## Schemas
|
|
157
|
+
|
|
158
|
+
Published via GitHub raw URLs (branch `main`):
|
|
159
|
+
|
|
160
|
+
- Request: `https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-request.v1.json`
|
|
161
|
+
- Report: `https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-report.v1.json`
|
|
162
|
+
|
|
163
|
+
The files also live in this repo under `opencode/vision/schemas/` for
|
|
164
|
+
editing. The URL is the canonical `$id`/`$schema` reference used by the
|
|
165
|
+
SKILL.md and subagent body.
|
package/SKILL.md
ADDED
|
@@ -0,0 +1,396 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: vision
|
|
3
|
+
description: >-
|
|
4
|
+
Use when you must verify, check, or evaluate what is visually rendered in
|
|
5
|
+
one or more images — e.g. "visually verify the screenshot shows a centered
|
|
6
|
+
button", "check the icon is visible", "does the layout match the design",
|
|
7
|
+
acceptance criteria mentioning on-screen state. Captures visual-judgment
|
|
8
|
+
intent from user prompts or MCP task outputs, classifies it into a typed
|
|
9
|
+
judgment, asks the user once per session which vision model to use,
|
|
10
|
+
assembles a versioned request, delegates to a vision subagent, and parses
|
|
11
|
+
the typed report. Requires locally-stored image files (cua-driver
|
|
12
|
+
screenshots, Playwright/chrome-devtools captures, user-provided paths).
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
# Vision — Visual Judgment Skill
|
|
16
|
+
|
|
17
|
+
You are a text-only orchestrator (GLM 5.2). You cannot see images. When a
|
|
18
|
+
task requires visual verification, you delegate to a vision subagent that
|
|
19
|
+
returns a typed report. This skill defines the extraction pipeline:
|
|
20
|
+
**Detect → Classify → Assemble → Pick model → Delegate → Parse**.
|
|
21
|
+
|
|
22
|
+
## Why this skill exists
|
|
23
|
+
|
|
24
|
+
You are text-only (`attachment: false`). You cannot verify visual properties
|
|
25
|
+
yourself — alignment, color, readability, layout. A vision subagent can.
|
|
26
|
+
This skill gives you a stable contract for talking to one.
|
|
27
|
+
|
|
28
|
+
## The two schemas
|
|
29
|
+
|
|
30
|
+
- **Request** (what you emit, passed as the `task` prompt):
|
|
31
|
+
https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-request.v1.json
|
|
32
|
+
- **Report** (what the subagent returns):
|
|
33
|
+
https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-report.v1.json
|
|
34
|
+
|
|
35
|
+
## Step 1. Detect
|
|
36
|
+
|
|
37
|
+
Visual-judgment intent arrives from two sources. Recognize both.
|
|
38
|
+
|
|
39
|
+
### Source A — explicit visual-judgment language in a user prompt
|
|
40
|
+
|
|
41
|
+
Trigger lexicon (any of these suggests visual judgment):
|
|
42
|
+
- "visually verify", "visually check", "screenshot shows"
|
|
43
|
+
- "looks right", "looks wrong", "looks broken"
|
|
44
|
+
- "centered", "aligned", "overlapping", "misaligned"
|
|
45
|
+
- "visible", "hidden", "not showing"
|
|
46
|
+
- "readable", "legible", "too small", "low contrast"
|
|
47
|
+
- "on" / "off" / "checked" / "disabled" (for a control's visual state)
|
|
48
|
+
- "matches the design", "matches the mockup"
|
|
49
|
+
- acceptance criteria mentioning on-screen state
|
|
50
|
+
- a user-provided image path (e.g. `/tmp/foo.png`)
|
|
51
|
+
|
|
52
|
+
If the user's request contains image-attachment references or a path to a
|
|
53
|
+
screenshot/screenshot file, that is also a trigger.
|
|
54
|
+
|
|
55
|
+
### Source B — a gap between an MCP task output and a visual criterion
|
|
56
|
+
|
|
57
|
+
A `browser-use-*` or `computer-use-cua` subagent returned a screenshot
|
|
58
|
+
path plus a text description of what is on screen. But the user's
|
|
59
|
+
criterion is visual (positional, color, readability, layout) and the text
|
|
60
|
+
description cannot fully prove it. You recognize the gap and extract a
|
|
61
|
+
visual-judgment intent from the combination of user criterion + MCP output.
|
|
62
|
+
|
|
63
|
+
**Example**: user says "log into the app and check the dashboard looks
|
|
64
|
+
right." You spawn `browser-use-chrome-devtools` to navigate + screenshot.
|
|
65
|
+
It returns `/tmp/dashboard.png` + "sidebar with nav items, bar chart,
|
|
66
|
+
welcome header." The text describes structure, but "looks right" is a
|
|
67
|
+
visual layout quality the text can't fully prove → you detect a
|
|
68
|
+
visual-judgment need.
|
|
69
|
+
|
|
70
|
+
## Step 2. Classify
|
|
71
|
+
|
|
72
|
+
Map the NL task to one of the 10 closed `judgment.type` values. Each has
|
|
73
|
+
typed `parameters`.
|
|
74
|
+
|
|
75
|
+
| Type | When to use | Typed parameters |
|
|
76
|
+
|---|---|---|
|
|
77
|
+
| `presence` | Is X visible on screen? | `subject`, `expectation: present\|absent` |
|
|
78
|
+
| `absence` | Is X NOT visible? (dual of presence) | `subject`, `expectation: absent` |
|
|
79
|
+
| `alignment` | Is X centered / left-aligned / top along an axis? | `subject`, `axis`, `expectation`, `tolerance` |
|
|
80
|
+
| `ordering` | Are items in expected left-to-right or top-to-bottom order? | `direction: ltr\|ttb`, `expected[]` |
|
|
81
|
+
| `equality` | Do two images render the same thing? | `subjects[2]`, `threshold: exact\|perceptual` |
|
|
82
|
+
| `layout` | Open-ended structural check (arrangement, spacing) | `expectations` (NL) |
|
|
83
|
+
| `readability` | Is text legible? (contrast, size) | `subject` |
|
|
84
|
+
| `state` | Is a control in a given state? (toggle, checkbox) | `subject`, `expectedState` |
|
|
85
|
+
| `diff` | What changed between two screenshots? | `baseline`, `current` (image labels) |
|
|
86
|
+
| `describe` | Open-ended description of what's on screen | `focus` |
|
|
87
|
+
|
|
88
|
+
Worked examples (one per type) are in the appendix at the bottom of this
|
|
89
|
+
file. When in doubt, pick the most specific type that fits; fall back to
|
|
90
|
+
`describe` if nothing fits.
|
|
91
|
+
|
|
92
|
+
## Step 3. Assemble
|
|
93
|
+
|
|
94
|
+
Construct the `visual-judgment-request.v1` JSON object.
|
|
95
|
+
|
|
96
|
+
### 3a. Gather image paths
|
|
97
|
+
|
|
98
|
+
Image paths come from:
|
|
99
|
+
|
|
100
|
+
| Source | How to get the path |
|
|
101
|
+
|---|---|
|
|
102
|
+
| User-provided | Use the path the user gave (e.g. `/tmp/foo.png`). |
|
|
103
|
+
| chrome-devtools MCP | `chrome-devtools_take_screenshot({ filePath: "/tmp/shot.png" })` — saves PNG to disk. |
|
|
104
|
+
| Playwright MCP | `playwright_browser_take_screenshot({ filename: "shot.png" })` — saves to the configured output directory. |
|
|
105
|
+
| cua-driver MCP | `cua-driver_get_window_state({ pid, window_id, screenshot_out_file: "/tmp/win.png" })` — saves window screenshot to disk. Also returns the AX tree as text. |
|
|
106
|
+
| Browser-use subagent output | The subagent returns the path in its text response; extract it. |
|
|
107
|
+
|
|
108
|
+
For each image, assign a `label` (short, used in `observations[].imageLabel`)
|
|
109
|
+
and a `role` (`baseline` = before/reference, `current` = the thing under
|
|
110
|
+
test, `reference` = design target).
|
|
111
|
+
|
|
112
|
+
### 3b. Dual-track: a11y tree vs. visual judgment
|
|
113
|
+
|
|
114
|
+
Before delegating, check whether the text tree already answers the
|
|
115
|
+
question. All three MCPs (chrome-devtools, Playwright, cua-driver) return
|
|
116
|
+
an accessibility/AX tree alongside the screenshot. You can read that text
|
|
117
|
+
directly — no vision call needed.
|
|
118
|
+
|
|
119
|
+
| Criterion | Source | Delegate to vision? |
|
|
120
|
+
|---|---|---|
|
|
121
|
+
| "Button exists" | a11y tree (element present) | No |
|
|
122
|
+
| "Button is enabled/disabled" | a11y tree (`AXEnabled`) | No |
|
|
123
|
+
| "Button text says 'Submit'" | a11y tree (`AXTitle`/`AXValue`) | No |
|
|
124
|
+
| "Button is centered" | Screenshot (positional) | **Yes** |
|
|
125
|
+
| "Text is readable" | Screenshot (contrast/size) | **Yes** |
|
|
126
|
+
| "Toggle is blue" | Screenshot (color) | **Yes** |
|
|
127
|
+
| "Layout matches design" | Screenshot (structural) | **Yes** |
|
|
128
|
+
| "Two screenshots are identical" | Screenshot pair | **Yes** |
|
|
129
|
+
|
|
130
|
+
Use the cheap text source first. Only pay for a vision call when the text
|
|
131
|
+
tree cannot answer.
|
|
132
|
+
|
|
133
|
+
### 3c. Fill typed parameters + NL criteria
|
|
134
|
+
|
|
135
|
+
Fill `judgment.parameters` per the type (Step 2 table). If the typed
|
|
136
|
+
parameters cannot fully express the nuance, add a free-form `criteria`
|
|
137
|
+
string as a fallback for the subagent. Also set `responseContract` if you
|
|
138
|
+
want something specific back beyond the fixed report envelope.
|
|
139
|
+
|
|
140
|
+
### 3d. Edge case — MCP output has no screenshot path
|
|
141
|
+
|
|
142
|
+
If a browser-use subagent returned only text (no path) but a visual
|
|
143
|
+
judgment is still needed, capture a screenshot yourself by driving the
|
|
144
|
+
MCP directly (see 3a table), or re-task the subagent with explicit
|
|
145
|
+
screenshot-save instructions.
|
|
146
|
+
|
|
147
|
+
### 3e. Edge case — built-in computer-use MCP
|
|
148
|
+
|
|
149
|
+
The built-in Claude Code `computer-use` MCP returns screenshots as inline
|
|
150
|
+
base64 images, not file paths. You cannot see inline images (you are
|
|
151
|
+
text-only), and the vision subagent needs a file path to `read`. Prefer
|
|
152
|
+
`cua-driver` for desktop visual judgments — it has `screenshot_out_file`.
|
|
153
|
+
|
|
154
|
+
## Step 4. Pick model (once per session)
|
|
155
|
+
|
|
156
|
+
opencode has no per-call model override and no LLM-set session variable.
|
|
157
|
+
The model choice is carried in your own context for the rest of the
|
|
158
|
+
session.
|
|
159
|
+
|
|
160
|
+
**On the first visual-judgment need in a session**, before delegating,
|
|
161
|
+
call the `question` tool once:
|
|
162
|
+
|
|
163
|
+
```
|
|
164
|
+
question({
|
|
165
|
+
questions: [{
|
|
166
|
+
header: "Vision model",
|
|
167
|
+
question: "I found several models that support vision tasks. Which model would you prefer for visual judgments this session?",
|
|
168
|
+
options: [
|
|
169
|
+
{ label: "openai/gpt-5.5", description: "Highest accuracy (Recommended)" },
|
|
170
|
+
{ label: "kimi-for-coding/k2p7", description: "Kimi K2.7 Code" },
|
|
171
|
+
{ label: "ollama-cloud/gemini-3-flash-preview", description: "Gemini 3 Flash, 1M context" },
|
|
172
|
+
{ label: "ollama-cloud/gemma4:31b", description: "Gemma 4 31B" },
|
|
173
|
+
{ label: "ollama-cloud/minimax-m3", description: "MiniMax M3" },
|
|
174
|
+
{ label: "ollama-cloud/qwen3.5:397b", description: "Qwen 3.5 397B" },
|
|
175
|
+
{ label: "opencode-go/kimi-k2.7-code", description: "Kimi K2.7 Code via opencode-go" },
|
|
176
|
+
{ label: "opencode-go/minimax-m3", description: "MiniMax M3 via opencode-go" },
|
|
177
|
+
{ label: "opencode-go/qwen3.7-plus", description: "Qwen 3.7 Plus, 1M context" },
|
|
178
|
+
{ label: "opencode-go/mimo-v2.5", description: "MiMo V2.5, 1M context" }
|
|
179
|
+
]
|
|
180
|
+
}]
|
|
181
|
+
})
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
The tool auto-adds an "Other" option (type your own). After the user
|
|
185
|
+
answers:
|
|
186
|
+
|
|
187
|
+
- Map the answer to a `subagent_type` via the table below.
|
|
188
|
+
- Remember the choice for the rest of the session. Do not ask again.
|
|
189
|
+
Reuse the chosen model for all subsequent visual judgments in this
|
|
190
|
+
session.
|
|
191
|
+
- If the user picks "Other" and types a model id, map it to the closest
|
|
192
|
+
matching `vision-*` subagent from the table, or fall back to
|
|
193
|
+
`vision-openai-gpt-5.5` if no match.
|
|
194
|
+
|
|
195
|
+
### `preferredModel → subagent_type` mapping table
|
|
196
|
+
|
|
197
|
+
```
|
|
198
|
+
openai/gpt-5.5 -> vision-openai-gpt-5.5
|
|
199
|
+
kimi-for-coding/k2p7 -> vision-kimi-for-coding-k2p7
|
|
200
|
+
ollama-cloud/gemini-3-flash-preview -> vision-ollama-cloud-gemini-3-flash-preview
|
|
201
|
+
ollama-cloud/gemma4:31b -> vision-ollama-cloud-gemma4-31b
|
|
202
|
+
ollama-cloud/minimax-m3 -> vision-ollama-cloud-minimax-m3
|
|
203
|
+
ollama-cloud/qwen3.5:397b -> vision-ollama-cloud-qwen3.5-397b
|
|
204
|
+
opencode-go/kimi-k2.7-code -> vision-opencode-go-kimi-k2.7-code
|
|
205
|
+
opencode-go/minimax-m3 -> vision-opencode-go-minimax-m3
|
|
206
|
+
opencode-go/qwen3.7-plus -> vision-opencode-go-qwen3.7-plus
|
|
207
|
+
opencode-go/mimo-v2.5 -> vision-opencode-go-mimo-v2.5
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## Step 5. Delegate
|
|
211
|
+
|
|
212
|
+
Spawn the subagent with the assembled request JSON as the `prompt`:
|
|
213
|
+
|
|
214
|
+
```
|
|
215
|
+
task({
|
|
216
|
+
subagent_type: "<mapped subagent_type>",
|
|
217
|
+
description: "<short, e.g. 'Verify Submit button is centered'>",
|
|
218
|
+
prompt: <the full visual-judgment-request.v1 JSON object>
|
|
219
|
+
})
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
## Step 6. Parse
|
|
223
|
+
|
|
224
|
+
The subagent returns a `visual-judgment-report.v1` JSON object. Branch on
|
|
225
|
+
`status` and `verdict`:
|
|
226
|
+
|
|
227
|
+
- `status: "ok"` + `verdict: "pass"` → criterion met. Report success to
|
|
228
|
+
the user, citing `observations[]` as evidence.
|
|
229
|
+
- `status: "ok"` + `verdict: "fail"` → criterion not met. Report failure,
|
|
230
|
+
citing the specific `observations[]` (e.g. "button is 42px right of
|
|
231
|
+
center"). Include `reasoning`.
|
|
232
|
+
- `status: "ok"` + `verdict: "inconclusive"` → informational (for `diff`
|
|
233
|
+
and `describe`) or genuinely undeterminable. Surface `observations[]`
|
|
234
|
+
and `diff[]` directly to the user.
|
|
235
|
+
- `status: "error"` → the subagent could not analyze the image(s). Check
|
|
236
|
+
`errors[]` (codes: `file_not_found`, `unsupported_format`,
|
|
237
|
+
`model_unavailable`). If `model_unavailable`, retry with a different
|
|
238
|
+
model from the mapping table, or re-ask the user.
|
|
239
|
+
- `status: "insufficient-evidence"` → the subagent analyzed the image but
|
|
240
|
+
cannot reach a verdict. Report this honestly; do not pretend a verdict.
|
|
241
|
+
|
|
242
|
+
Surface `observations[]` as citations so the user sees what the subagent
|
|
243
|
+
actually saw. Include `confidence` in your report to the user.
|
|
244
|
+
|
|
245
|
+
## Two integration patterns
|
|
246
|
+
|
|
247
|
+
### Pattern 1 — Direct (simple "screenshot + judge")
|
|
248
|
+
|
|
249
|
+
Use when the browser/desktop interaction is trivial (just navigate and
|
|
250
|
+
look). You drive the MCP directly, capture one screenshot, delegate one
|
|
251
|
+
judgment.
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
You: chrome-devtools_navigate_page({ url: "http://localhost:3000" })
|
|
255
|
+
You: chrome-devtools_take_screenshot({ filePath: "/tmp/login.png" })
|
|
256
|
+
You: [assemble request with /tmp/login.png]
|
|
257
|
+
You: task({ subagent_type: "vision-openai-gpt-5.5", prompt: <request> })
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
### Pattern 2 — Two-phase (complex interaction, then judge)
|
|
261
|
+
|
|
262
|
+
Use when interaction is non-trivial (navigate, click, fill, navigate
|
|
263
|
+
again). You spawn a `browser-use-*` or `computer-use-cua` subagent to
|
|
264
|
+
perform the interaction and capture a screenshot. It returns the path.
|
|
265
|
+
You then delegate to a `vision-*` subagent.
|
|
266
|
+
|
|
267
|
+
```
|
|
268
|
+
You: task({
|
|
269
|
+
subagent_type: "browser-use-chrome-devtools",
|
|
270
|
+
prompt: "Navigate to /login, fill credentials, click Submit, wait for
|
|
271
|
+
dashboard, take a screenshot to /tmp/dashboard.png. Return
|
|
272
|
+
the file path and a brief text description."
|
|
273
|
+
})
|
|
274
|
+
-> subagent returns: "/tmp/dashboard.png, sidebar + chart + header"
|
|
275
|
+
You: [assemble request with /tmp/dashboard.png, judgment.type=layout]
|
|
276
|
+
You: task({ subagent_type: "vision-openai-gpt-5.5", prompt: <request> })
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
Separation of concerns: the browser subagent knows how to drive; the
|
|
280
|
+
vision subagent knows how to see.
|
|
281
|
+
|
|
282
|
+
---
|
|
283
|
+
|
|
284
|
+
## Appendix — worked examples per judgment type
|
|
285
|
+
|
|
286
|
+
### presence — "is X visible?"
|
|
287
|
+
```json
|
|
288
|
+
{
|
|
289
|
+
"$schema": "https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-request.v1.json",
|
|
290
|
+
"id": "vj-001",
|
|
291
|
+
"preferredModel": "openai/gpt-5.5",
|
|
292
|
+
"images": [{ "path": "/tmp/login.png", "label": "login-screen", "role": "current" }],
|
|
293
|
+
"judgment": { "type": "presence", "parameters": { "subject": "Submit button", "expectation": "present" } },
|
|
294
|
+
"criteria": "A clickable button labeled 'Submit' or equivalent, within the login form area.",
|
|
295
|
+
"responseContract": "Return pass/fail and note the button's position if found."
|
|
296
|
+
}
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
### absence — "is X NOT visible?"
|
|
300
|
+
```json
|
|
301
|
+
{
|
|
302
|
+
"id": "vj-002", "preferredModel": "openai/gpt-5.5",
|
|
303
|
+
"images": [{ "path": "/tmp/post-logout.png", "label": "home", "role": "current" }],
|
|
304
|
+
"judgment": { "type": "absence", "parameters": { "subject": "error banner", "expectation": "absent" } },
|
|
305
|
+
"criteria": "No red/error banner at the top of the page or anywhere on screen."
|
|
306
|
+
}
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
### alignment — "is X centered on an axis?"
|
|
310
|
+
```json
|
|
311
|
+
{
|
|
312
|
+
"id": "vj-003", "preferredModel": "openai/gpt-5.5",
|
|
313
|
+
"images": [{ "path": "/tmp/header.png", "label": "header", "role": "current" }],
|
|
314
|
+
"judgment": { "type": "alignment", "parameters": { "subject": "logo", "axis": "horizontal", "expectation": "centered", "tolerance": "loose" } },
|
|
315
|
+
"criteria": "Logo should be roughly centered in the header band, allowing minor off-center within ~5%."
|
|
316
|
+
}
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
### ordering — "are items in expected LTR/TTB order?"
|
|
320
|
+
```json
|
|
321
|
+
{
|
|
322
|
+
"id": "vj-004", "preferredModel": "openai/gpt-5.5",
|
|
323
|
+
"images": [{ "path": "/tmp/nav.png", "label": "navbar", "role": "current" }],
|
|
324
|
+
"judgment": { "type": "ordering", "parameters": { "direction": "ltr", "expected": ["Home", "Products", "About", "Contact"] } },
|
|
325
|
+
"criteria": "Items read left-to-right in the specified order."
|
|
326
|
+
}
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
### equality — "do two images match?"
|
|
330
|
+
```json
|
|
331
|
+
{
|
|
332
|
+
"id": "vj-005", "preferredModel": "openai/gpt-5.5",
|
|
333
|
+
"images": [
|
|
334
|
+
{ "path": "/tmp/chart-v1.png", "label": "v1", "role": "baseline" },
|
|
335
|
+
{ "path": "/tmp/chart-v2.png", "label": "v2", "role": "current" }
|
|
336
|
+
],
|
|
337
|
+
"judgment": { "type": "equality", "parameters": { "subjects": ["v1", "v2"], "threshold": "perceptual" } },
|
|
338
|
+
"criteria": "Minor pixel-level anti-aliasing differences are acceptable; structural differences are not."
|
|
339
|
+
}
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
### layout — "does the structure match expectations?"
|
|
343
|
+
```json
|
|
344
|
+
{
|
|
345
|
+
"id": "vj-006", "preferredModel": "openai/gpt-5.5",
|
|
346
|
+
"images": [{ "path": "/tmp/form.png", "label": "signup-form", "role": "current" }],
|
|
347
|
+
"judgment": { "type": "layout", "parameters": { "expectations": "Fields stacked vertically; equal vertical gaps; labels above inputs." } },
|
|
348
|
+
"criteria": "Email, Password, Confirm Password fields in that top-to-bottom order."
|
|
349
|
+
}
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
### readability — "is the text legible?"
|
|
353
|
+
```json
|
|
354
|
+
{
|
|
355
|
+
"id": "vj-007", "preferredModel": "openai/gpt-5.5",
|
|
356
|
+
"images": [{ "path": "/tmp/page.png", "label": "page", "role": "current" }],
|
|
357
|
+
"judgment": { "type": "readability", "parameters": { "subject": "footer text" } },
|
|
358
|
+
"criteria": "Footer text should be readable at normal viewing distance; not blurry, not too small, sufficient contrast."
|
|
359
|
+
}
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
### state — "is the control in the expected state?"
|
|
363
|
+
```json
|
|
364
|
+
{
|
|
365
|
+
"id": "vj-008", "preferredModel": "openai/gpt-5.5",
|
|
366
|
+
"images": [{ "path": "/tmp/settings.png", "label": "settings-panel", "role": "current" }],
|
|
367
|
+
"judgment": { "type": "state", "parameters": { "subject": "notifications toggle", "expectedState": "on" } },
|
|
368
|
+
"criteria": "Toggle knob should be on the right side with the accent color (blue)."
|
|
369
|
+
}
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
### diff — "what changed between two screenshots?"
|
|
373
|
+
```json
|
|
374
|
+
{
|
|
375
|
+
"id": "vj-009", "preferredModel": "openai/gpt-5.5",
|
|
376
|
+
"images": [
|
|
377
|
+
{ "path": "/tmp/before.png", "label": "before", "role": "baseline" },
|
|
378
|
+
{ "path": "/tmp/after.png", "label": "after", "role": "current" }
|
|
379
|
+
],
|
|
380
|
+
"judgment": { "type": "diff", "parameters": { "baseline": "before", "current": "after" } },
|
|
381
|
+
"criteria": "Report all visual differences: added/removed/changed elements, color shifts, position changes."
|
|
382
|
+
}
|
|
383
|
+
```
|
|
384
|
+
|
|
385
|
+
### describe — "what's on screen?"
|
|
386
|
+
```json
|
|
387
|
+
{
|
|
388
|
+
"id": "vj-010", "preferredModel": "openai/gpt-5.5",
|
|
389
|
+
"images": [{ "path": "/tmp/screenshot.png", "label": "screen", "role": "current" }],
|
|
390
|
+
"judgment": { "type": "describe", "parameters": { "focus": "overall layout and primary UI elements" } },
|
|
391
|
+
"criteria": "Capture: app type, main regions, primary actions, color scheme."
|
|
392
|
+
}
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
For `diff` and `describe`, expect `verdict: "inconclusive"` — these are
|
|
396
|
+
informational, not pass/fail. Use `diff[]` and `observations[]` directly.
|
package/dist/index.js
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
// plugin.ts
|
|
2
|
+
import { readFileSync, existsSync } from "node:fs";
|
|
3
|
+
import { fileURLToPath } from "node:url";
|
|
4
|
+
import { dirname, join } from "node:path";
|
|
5
|
+
var bundleDir = dirname(fileURLToPath(import.meta.url));
|
|
6
|
+
var candidateDirs = [bundleDir, join(bundleDir, "..")];
|
|
7
|
+
var dataDir = candidateDirs.find((d) => existsSync(join(d, "vision-models.json")) && existsSync(join(d, "subagent-body.md"))) ?? bundleDir;
|
|
8
|
+
var manifest = JSON.parse(readFileSync(join(dataDir, "vision-models.json"), "utf8"));
|
|
9
|
+
var bodyTpl = readFileSync(join(dataDir, "subagent-body.md"), "utf8");
|
|
10
|
+
var PERMISSION = {
|
|
11
|
+
edit: "deny",
|
|
12
|
+
read: "allow",
|
|
13
|
+
glob: "allow",
|
|
14
|
+
grep: "allow",
|
|
15
|
+
list: "allow",
|
|
16
|
+
external_directory: {
|
|
17
|
+
"/private/tmp/**": "allow",
|
|
18
|
+
"/private/var/folders/**": "allow"
|
|
19
|
+
}
|
|
20
|
+
};
|
|
21
|
+
function subagentName(entry) {
|
|
22
|
+
return "vision-" + entry.provider + "-" + entry.model_id.replace(/[/:]/g, "-");
|
|
23
|
+
}
|
|
24
|
+
var plugin = async () => ({
|
|
25
|
+
config: async (cfg) => {
|
|
26
|
+
cfg.agent ??= {};
|
|
27
|
+
for (const e of manifest.models) {
|
|
28
|
+
const name = subagentName(e);
|
|
29
|
+
cfg.agent[name] ??= {};
|
|
30
|
+
Object.assign(cfg.agent[name], {
|
|
31
|
+
description: `Visual judgment subagent (${e.name}). Consumes a visual-judgment-request.v1 JSON, analyzes images, emits a visual-judgment-report.v1 JSON. Not coupled to any screenshot tool or UI framework - works with any locally stored image.`,
|
|
32
|
+
mode: "subagent",
|
|
33
|
+
model: `${e.provider}/${e.model_id}`,
|
|
34
|
+
temperature: 0.1,
|
|
35
|
+
prompt: bodyTpl.replaceAll("{{model_name}}", e.name).replaceAll("{{provider}}", e.provider).replaceAll("{{model_id}}", e.model_id),
|
|
36
|
+
permission: PERMISSION
|
|
37
|
+
});
|
|
38
|
+
}
|
|
39
|
+
}
|
|
40
|
+
});
|
|
41
|
+
var plugin_default = { id: "vision", server: plugin };
|
|
42
|
+
export {
|
|
43
|
+
plugin_default as default
|
|
44
|
+
};
|
package/package.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "opencode-vision",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Typed visual-judgment skill for opencode. Registers 10 vision subagents (one per top-tier vision model across OpenAI, Kimi for Coding, Ollama Cloud, and opencode-go) and a skill that teaches a text-only orchestrator to extract visual-judgment intent, classify it into a typed judgment, and delegate to a vision subagent with a versioned request/report contract.",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"main": "./dist/index.js",
|
|
7
|
+
"exports": {
|
|
8
|
+
".": {
|
|
9
|
+
"import": "./dist/index.js"
|
|
10
|
+
}
|
|
11
|
+
},
|
|
12
|
+
"files": [
|
|
13
|
+
"dist",
|
|
14
|
+
"SKILL.md",
|
|
15
|
+
"schemas",
|
|
16
|
+
"subagent-body.md",
|
|
17
|
+
"vision-models.json",
|
|
18
|
+
"README.md"
|
|
19
|
+
],
|
|
20
|
+
"scripts": {
|
|
21
|
+
"prebuild": "rm -rf dist",
|
|
22
|
+
"build": "bun build ./plugin.ts --outfile ./dist/index.js --target node --format esm --packages external",
|
|
23
|
+
"prepublishOnly": "bun run build",
|
|
24
|
+
"typecheck": "tsc --noEmit"
|
|
25
|
+
},
|
|
26
|
+
"keywords": [
|
|
27
|
+
"opencode",
|
|
28
|
+
"opencode-plugin",
|
|
29
|
+
"opencode-ai",
|
|
30
|
+
"vision",
|
|
31
|
+
"visual-judgment",
|
|
32
|
+
"image-analysis",
|
|
33
|
+
"screenshot",
|
|
34
|
+
"subagent",
|
|
35
|
+
"skill"
|
|
36
|
+
],
|
|
37
|
+
"license": "MIT",
|
|
38
|
+
"repository": {
|
|
39
|
+
"type": "git",
|
|
40
|
+
"url": "git+https://github.com/WeZZard/skills.git",
|
|
41
|
+
"directory": "opencode/vision"
|
|
42
|
+
},
|
|
43
|
+
"homepage": "https://github.com/WeZZard/skills/tree/main/opencode/vision#readme",
|
|
44
|
+
"bugs": {
|
|
45
|
+
"url": "https://github.com/WeZZard/skills/issues"
|
|
46
|
+
},
|
|
47
|
+
"peerDependencies": {
|
|
48
|
+
"@opencode-ai/plugin": "^1.4.7"
|
|
49
|
+
},
|
|
50
|
+
"devDependencies": {
|
|
51
|
+
"@opencode-ai/plugin": "^1.4.7",
|
|
52
|
+
"@types/node": "^22.13.9",
|
|
53
|
+
"typescript": "^5.8.2"
|
|
54
|
+
}
|
|
55
|
+
}
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
3
|
+
"$id": "https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-report.v1.json",
|
|
4
|
+
"title": "Visual Judgment Report v1",
|
|
5
|
+
"description": "Typed report envelope returned by a vision subagent after analyzing images against a visual-judgment-request.",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"required": ["$schema", "id", "status"],
|
|
8
|
+
"additionalProperties": false,
|
|
9
|
+
"properties": {
|
|
10
|
+
"$schema": {
|
|
11
|
+
"const": "https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-report.v1.json",
|
|
12
|
+
"description": "Schema version identifier."
|
|
13
|
+
},
|
|
14
|
+
"id": {
|
|
15
|
+
"type": "string",
|
|
16
|
+
"description": "Correlation id — MUST echo the request id."
|
|
17
|
+
},
|
|
18
|
+
"status": {
|
|
19
|
+
"enum": ["ok", "error", "insufficient-evidence"],
|
|
20
|
+
"description": "ok = analysis succeeded; error = could not analyze (image corrupt, model unavailable); insufficient-evidence = analyzed but cannot reach a verdict."
|
|
21
|
+
},
|
|
22
|
+
"verdict": {
|
|
23
|
+
"enum": ["pass", "fail", "inconclusive"],
|
|
24
|
+
"description": "pass = criterion met; fail = criterion not met; inconclusive = informational (diff/describe) or genuinely undeterminable."
|
|
25
|
+
},
|
|
26
|
+
"confidence": {
|
|
27
|
+
"type": "number",
|
|
28
|
+
"minimum": 0,
|
|
29
|
+
"maximum": 1,
|
|
30
|
+
"description": "0.0–1.0. How confident the subagent is in the verdict."
|
|
31
|
+
},
|
|
32
|
+
"observations": {
|
|
33
|
+
"type": "array",
|
|
34
|
+
"description": "Typed observations about what was seen in each image. Structure varies per judgment.type by convention.",
|
|
35
|
+
"items": {
|
|
36
|
+
"type": "object",
|
|
37
|
+
"required": ["imageLabel", "subject"],
|
|
38
|
+
"additionalProperties": false,
|
|
39
|
+
"properties": {
|
|
40
|
+
"imageLabel": { "type": "string", "description": "Label of the image this observation is about (matches request images[].label)." },
|
|
41
|
+
"subject": { "type": "string", "description": "What this observation describes, e.g. 'Submit button', 'footer text'." },
|
|
42
|
+
"properties": {
|
|
43
|
+
"type": "object",
|
|
44
|
+
"additionalProperties": true,
|
|
45
|
+
"description": "Type-specific findings, e.g. {found:true,position:'bottom-center'} for presence, {centerOffsetPx:42} for alignment, {knobPosition:'right',trackColor:'blue'} for state."
|
|
46
|
+
},
|
|
47
|
+
"note": { "type": "string", "description": "Free-form clarifying note." }
|
|
48
|
+
}
|
|
49
|
+
}
|
|
50
|
+
},
|
|
51
|
+
"diff": {
|
|
52
|
+
"type": "array",
|
|
53
|
+
"description": "Structured change list (populated for judgment.type=diff).",
|
|
54
|
+
"items": {
|
|
55
|
+
"type": "object",
|
|
56
|
+
"additionalProperties": false,
|
|
57
|
+
"properties": {
|
|
58
|
+
"from": { "type": "string", "description": "What was there in the baseline." },
|
|
59
|
+
"to": { "type": "string", "description": "What is there in the current." },
|
|
60
|
+
"description": { "type": "string", "description": "Human-readable description of the change." }
|
|
61
|
+
}
|
|
62
|
+
}
|
|
63
|
+
},
|
|
64
|
+
"reasoning": {
|
|
65
|
+
"type": "string",
|
|
66
|
+
"description": "One-paragraph justification linking observations to verdict."
|
|
67
|
+
},
|
|
68
|
+
"errors": {
|
|
69
|
+
"type": "array",
|
|
70
|
+
"description": "Populated when status=error. One entry per image that could not be analyzed.",
|
|
71
|
+
"items": {
|
|
72
|
+
"type": "object",
|
|
73
|
+
"required": ["imageLabel", "code"],
|
|
74
|
+
"additionalProperties": false,
|
|
75
|
+
"properties": {
|
|
76
|
+
"imageLabel": { "type": "string" },
|
|
77
|
+
"code": { "type": "string", "description": "e.g. 'file_not_found', 'unsupported_format', 'model_unavailable'." }
|
|
78
|
+
}
|
|
79
|
+
}
|
|
80
|
+
}
|
|
81
|
+
},
|
|
82
|
+
"allOf": [
|
|
83
|
+
{
|
|
84
|
+
"if": { "properties": { "status": { "const": "ok" } } },
|
|
85
|
+
"then": { "required": ["verdict", "confidence", "observations"] }
|
|
86
|
+
}
|
|
87
|
+
]
|
|
88
|
+
}
|
|
@@ -0,0 +1,236 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
|
3
|
+
"$id": "https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-request.v1.json",
|
|
4
|
+
"title": "Visual Judgment Request v1",
|
|
5
|
+
"description": "Typed intent envelope emitted by the orchestrator and passed as the task prompt to a vision subagent.",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"required": ["$schema", "id", "images", "judgment"],
|
|
8
|
+
"additionalProperties": false,
|
|
9
|
+
"properties": {
|
|
10
|
+
"$schema": {
|
|
11
|
+
"const": "https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-request.v1.json",
|
|
12
|
+
"description": "Schema version identifier."
|
|
13
|
+
},
|
|
14
|
+
"id": {
|
|
15
|
+
"type": "string",
|
|
16
|
+
"description": "Correlation id (uuid recommended). The report must echo this id."
|
|
17
|
+
},
|
|
18
|
+
"preferredModel": {
|
|
19
|
+
"type": "string",
|
|
20
|
+
"pattern": "^.+/.+$",
|
|
21
|
+
"description": "Provider/model-id of the vision model the orchestrator selected (informational; the subagent is already bound to a model via its config)."
|
|
22
|
+
},
|
|
23
|
+
"images": {
|
|
24
|
+
"type": "array",
|
|
25
|
+
"minItems": 1,
|
|
26
|
+
"description": "One or more locally-stored image files to analyze.",
|
|
27
|
+
"items": {
|
|
28
|
+
"type": "object",
|
|
29
|
+
"required": ["path"],
|
|
30
|
+
"additionalProperties": false,
|
|
31
|
+
"properties": {
|
|
32
|
+
"path": {
|
|
33
|
+
"type": "string",
|
|
34
|
+
"description": "Absolute or project-relative path to the image file (PNG/JPEG/WebP)."
|
|
35
|
+
},
|
|
36
|
+
"label": {
|
|
37
|
+
"type": "string",
|
|
38
|
+
"description": "Short human-readable label for the image, used in observations[].imageLabel."
|
|
39
|
+
},
|
|
40
|
+
"role": {
|
|
41
|
+
"enum": ["baseline", "current", "reference"],
|
|
42
|
+
"description": "Role of this image in the judgment: baseline (before/reference), current (the thing under test), reference (design/target)."
|
|
43
|
+
}
|
|
44
|
+
}
|
|
45
|
+
}
|
|
46
|
+
},
|
|
47
|
+
"judgment": {
|
|
48
|
+
"description": "The typed judgment to perform. Exactly one type applies; its parameters are type-specific.",
|
|
49
|
+
"oneOf": [
|
|
50
|
+
{
|
|
51
|
+
"type": "object",
|
|
52
|
+
"required": ["type", "parameters"],
|
|
53
|
+
"additionalProperties": false,
|
|
54
|
+
"properties": {
|
|
55
|
+
"type": { "const": "presence" },
|
|
56
|
+
"parameters": {
|
|
57
|
+
"type": "object",
|
|
58
|
+
"required": ["subject", "expectation"],
|
|
59
|
+
"additionalProperties": false,
|
|
60
|
+
"properties": {
|
|
61
|
+
"subject": { "type": "string", "description": "The element/text/icon to find, e.g. 'Submit button'." },
|
|
62
|
+
"expectation": { "enum": ["present", "absent"], "description": "Whether the subject should be visible." }
|
|
63
|
+
}
|
|
64
|
+
}
|
|
65
|
+
}
|
|
66
|
+
},
|
|
67
|
+
{
|
|
68
|
+
"type": "object",
|
|
69
|
+
"required": ["type", "parameters"],
|
|
70
|
+
"additionalProperties": false,
|
|
71
|
+
"properties": {
|
|
72
|
+
"type": { "const": "absence" },
|
|
73
|
+
"parameters": {
|
|
74
|
+
"type": "object",
|
|
75
|
+
"required": ["subject", "expectation"],
|
|
76
|
+
"additionalProperties": false,
|
|
77
|
+
"properties": {
|
|
78
|
+
"subject": { "type": "string", "description": "The element that should NOT be visible." },
|
|
79
|
+
"expectation": { "enum": ["present", "absent"], "description": "Should be 'absent'." }
|
|
80
|
+
}
|
|
81
|
+
}
|
|
82
|
+
}
|
|
83
|
+
},
|
|
84
|
+
{
|
|
85
|
+
"type": "object",
|
|
86
|
+
"required": ["type", "parameters"],
|
|
87
|
+
"additionalProperties": false,
|
|
88
|
+
"properties": {
|
|
89
|
+
"type": { "const": "alignment" },
|
|
90
|
+
"parameters": {
|
|
91
|
+
"type": "object",
|
|
92
|
+
"required": ["subject", "axis"],
|
|
93
|
+
"additionalProperties": false,
|
|
94
|
+
"properties": {
|
|
95
|
+
"subject": { "type": "string" },
|
|
96
|
+
"axis": { "enum": ["horizontal", "vertical", "both"] },
|
|
97
|
+
"expectation": { "type": "string", "description": "e.g. 'centered', 'left-aligned', 'top'." },
|
|
98
|
+
"tolerance": { "enum": ["strict", "loose"], "description": "strict ≈ exact; loose ≈ within ~5%." }
|
|
99
|
+
}
|
|
100
|
+
}
|
|
101
|
+
}
|
|
102
|
+
},
|
|
103
|
+
{
|
|
104
|
+
"type": "object",
|
|
105
|
+
"required": ["type", "parameters"],
|
|
106
|
+
"additionalProperties": false,
|
|
107
|
+
"properties": {
|
|
108
|
+
"type": { "const": "ordering" },
|
|
109
|
+
"parameters": {
|
|
110
|
+
"type": "object",
|
|
111
|
+
"required": ["direction", "expected"],
|
|
112
|
+
"additionalProperties": false,
|
|
113
|
+
"properties": {
|
|
114
|
+
"direction": { "enum": ["ltr", "ttb"], "description": "Left-to-right or top-to-bottom." },
|
|
115
|
+
"expected": { "type": "array", "items": { "type": "string" }, "description": "Expected order of subjects." }
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
}
|
|
119
|
+
},
|
|
120
|
+
{
|
|
121
|
+
"type": "object",
|
|
122
|
+
"required": ["type", "parameters"],
|
|
123
|
+
"additionalProperties": false,
|
|
124
|
+
"properties": {
|
|
125
|
+
"type": { "const": "equality" },
|
|
126
|
+
"parameters": {
|
|
127
|
+
"type": "object",
|
|
128
|
+
"required": ["subjects"],
|
|
129
|
+
"additionalProperties": false,
|
|
130
|
+
"properties": {
|
|
131
|
+
"subjects": {
|
|
132
|
+
"type": "array",
|
|
133
|
+
"minItems": 2,
|
|
134
|
+
"maxItems": 2,
|
|
135
|
+
"items": { "type": "string" },
|
|
136
|
+
"description": "Labels of the two images to compare."
|
|
137
|
+
},
|
|
138
|
+
"threshold": { "enum": ["exact", "perceptual"], "description": "exact = pixel-identical; perceptual = structurally same." }
|
|
139
|
+
}
|
|
140
|
+
}
|
|
141
|
+
}
|
|
142
|
+
},
|
|
143
|
+
{
|
|
144
|
+
"type": "object",
|
|
145
|
+
"required": ["type", "parameters"],
|
|
146
|
+
"additionalProperties": false,
|
|
147
|
+
"properties": {
|
|
148
|
+
"type": { "const": "layout" },
|
|
149
|
+
"parameters": {
|
|
150
|
+
"type": "object",
|
|
151
|
+
"required": ["expectations"],
|
|
152
|
+
"additionalProperties": false,
|
|
153
|
+
"properties": {
|
|
154
|
+
"expectations": { "type": "string", "description": "NL description of expected layout, e.g. 'sidebar on left, main content right, equal gaps'." }
|
|
155
|
+
}
|
|
156
|
+
}
|
|
157
|
+
}
|
|
158
|
+
},
|
|
159
|
+
{
|
|
160
|
+
"type": "object",
|
|
161
|
+
"required": ["type", "parameters"],
|
|
162
|
+
"additionalProperties": false,
|
|
163
|
+
"properties": {
|
|
164
|
+
"type": { "const": "readability" },
|
|
165
|
+
"parameters": {
|
|
166
|
+
"type": "object",
|
|
167
|
+
"required": ["subject"],
|
|
168
|
+
"additionalProperties": false,
|
|
169
|
+
"properties": {
|
|
170
|
+
"subject": { "type": "string", "description": "The text/region to check for legibility." }
|
|
171
|
+
}
|
|
172
|
+
}
|
|
173
|
+
}
|
|
174
|
+
},
|
|
175
|
+
{
|
|
176
|
+
"type": "object",
|
|
177
|
+
"required": ["type", "parameters"],
|
|
178
|
+
"additionalProperties": false,
|
|
179
|
+
"properties": {
|
|
180
|
+
"type": { "const": "state" },
|
|
181
|
+
"parameters": {
|
|
182
|
+
"type": "object",
|
|
183
|
+
"required": ["subject", "expectedState"],
|
|
184
|
+
"additionalProperties": false,
|
|
185
|
+
"properties": {
|
|
186
|
+
"subject": { "type": "string", "description": "The control, e.g. 'notifications toggle'." },
|
|
187
|
+
"expectedState": { "type": "string", "description": "e.g. 'on', 'off', 'checked', 'disabled'." }
|
|
188
|
+
}
|
|
189
|
+
}
|
|
190
|
+
}
|
|
191
|
+
},
|
|
192
|
+
{
|
|
193
|
+
"type": "object",
|
|
194
|
+
"required": ["type", "parameters"],
|
|
195
|
+
"additionalProperties": false,
|
|
196
|
+
"properties": {
|
|
197
|
+
"type": { "const": "diff" },
|
|
198
|
+
"parameters": {
|
|
199
|
+
"type": "object",
|
|
200
|
+
"required": ["baseline", "current"],
|
|
201
|
+
"additionalProperties": false,
|
|
202
|
+
"properties": {
|
|
203
|
+
"baseline": { "type": "string", "description": "Label of the before image." },
|
|
204
|
+
"current": { "type": "string", "description": "Label of the after image." }
|
|
205
|
+
}
|
|
206
|
+
}
|
|
207
|
+
}
|
|
208
|
+
},
|
|
209
|
+
{
|
|
210
|
+
"type": "object",
|
|
211
|
+
"required": ["type", "parameters"],
|
|
212
|
+
"additionalProperties": false,
|
|
213
|
+
"properties": {
|
|
214
|
+
"type": { "const": "describe" },
|
|
215
|
+
"parameters": {
|
|
216
|
+
"type": "object",
|
|
217
|
+
"required": ["focus"],
|
|
218
|
+
"additionalProperties": false,
|
|
219
|
+
"properties": {
|
|
220
|
+
"focus": { "type": "string", "description": "What to describe, e.g. 'overall layout and primary UI elements'." }
|
|
221
|
+
}
|
|
222
|
+
}
|
|
223
|
+
}
|
|
224
|
+
}
|
|
225
|
+
]
|
|
226
|
+
},
|
|
227
|
+
"criteria": {
|
|
228
|
+
"type": "string",
|
|
229
|
+
"description": "Natural-language fallback for nuance the typed parameters cannot capture. Free-form guidance for the subagent."
|
|
230
|
+
},
|
|
231
|
+
"responseContract": {
|
|
232
|
+
"type": "string",
|
|
233
|
+
"description": "What the caller wants back, beyond the fixed report envelope. e.g. 'Return pass/fail and note the button position if found.'"
|
|
234
|
+
}
|
|
235
|
+
}
|
|
236
|
+
}
|
package/subagent-body.md
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
You are a visual judgment subagent powered by {{model_name}}
|
|
2
|
+
({{provider}}/{{model_id}}).
|
|
3
|
+
|
|
4
|
+
## Input
|
|
5
|
+
|
|
6
|
+
You receive a `visual-judgment-request.v1` JSON object as your prompt.
|
|
7
|
+
Read it, then read each image file listed in `images[].path` using the
|
|
8
|
+
`read` tool. Analyze them against `judgment.type` and `judgment.parameters`.
|
|
9
|
+
|
|
10
|
+
The request schema lives at:
|
|
11
|
+
https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-request.v1.json
|
|
12
|
+
|
|
13
|
+
## Output
|
|
14
|
+
|
|
15
|
+
Emit a `visual-judgment-report.v1` JSON object — nothing else. No prose,
|
|
16
|
+
no markdown fences, no commentary. The envelope is fixed:
|
|
17
|
+
|
|
18
|
+
- `$schema`: "https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-report.v1.json"
|
|
19
|
+
- `id`: echo the request id
|
|
20
|
+
- `status`: "ok" | "error" | "insufficient-evidence"
|
|
21
|
+
- `verdict`: "pass" | "fail" | "inconclusive" (only when status="ok")
|
|
22
|
+
- `confidence`: 0.0-1.0
|
|
23
|
+
- `observations[]`: typed per `judgment.type`
|
|
24
|
+
- `diff[]`: structured change list (for judgment.type="diff")
|
|
25
|
+
- `reasoning`: one-paragraph justification linking observations to verdict
|
|
26
|
+
- `errors[]`: if any image could not be analyzed
|
|
27
|
+
|
|
28
|
+
The report schema lives at:
|
|
29
|
+
https://raw.githubusercontent.com/WeZZard/skills/main/opencode/vision/schemas/visual-judgment-report.v1.json
|
|
30
|
+
|
|
31
|
+
## Rules
|
|
32
|
+
|
|
33
|
+
- Report what you actually observe. Do not guess.
|
|
34
|
+
- Be specific: positions, colors, sizes, alignment, visibility, ordering.
|
|
35
|
+
- If a subject described in the request is not visible, say so in
|
|
36
|
+
`observations[].note`.
|
|
37
|
+
- If you cannot analyze an image (corrupted, wrong format, file not found),
|
|
38
|
+
set `status: "error"` with an `errors[]` entry (code e.g. "file_not_found",
|
|
39
|
+
"unsupported_format").
|
|
40
|
+
- For `diff` and `describe` judgments, set `verdict: "inconclusive"` —
|
|
41
|
+
these are informational, not pass/fail.
|
|
42
|
+
- Validate your output against the report schema URL (best-effort if the
|
|
43
|
+
fetch fails — emit the envelope correctly regardless).
|
|
44
|
+
- You MUST NOT spawn subagents. You are a leaf in the execution tree.
|
|
45
|
+
- You MUST NOT run the graph engine or any orchestrator-only command.
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
{
|
|
2
|
+
"_comment": "Vision subagent manifest. Curation rule: one top-tier model per provider x vendor family; drop non-reasoning, drop superseded within a provider, drop coding-specialized, drop Pro/billing variants of the same family; keep cross-provider duplicates (same model via different route = different billing/rate/latency). Add a model = one line here + restart opencode.",
|
|
3
|
+
"models": [
|
|
4
|
+
{ "provider": "openai", "model_id": "gpt-5.5", "name": "GPT-5.5" },
|
|
5
|
+
{ "provider": "kimi-for-coding", "model_id": "k2p7", "name": "Kimi K2.7 Code" },
|
|
6
|
+
{ "provider": "ollama-cloud", "model_id": "gemini-3-flash-preview", "name": "Gemini 3 Flash" },
|
|
7
|
+
{ "provider": "ollama-cloud", "model_id": "gemma4:31b", "name": "Gemma 4 31B" },
|
|
8
|
+
{ "provider": "ollama-cloud", "model_id": "minimax-m3", "name": "MiniMax M3" },
|
|
9
|
+
{ "provider": "ollama-cloud", "model_id": "qwen3.5:397b", "name": "Qwen 3.5 397B" },
|
|
10
|
+
{ "provider": "opencode-go", "model_id": "kimi-k2.7-code", "name": "Kimi K2.7 Code" },
|
|
11
|
+
{ "provider": "opencode-go", "model_id": "minimax-m3", "name": "MiniMax M3" },
|
|
12
|
+
{ "provider": "opencode-go", "model_id": "qwen3.7-plus", "name": "Qwen 3.7 Plus" },
|
|
13
|
+
{ "provider": "opencode-go", "model_id": "mimo-v2.5", "name": "MiMo V2.5" }
|
|
14
|
+
]
|
|
15
|
+
}
|