@glrs-dev/harness-plugin-opencode 2.0.1 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/CHANGELOG.md +72 -0
  2. package/README.md +39 -104
  3. package/dist/agents/prompts/build.md +18 -4
  4. package/dist/agents/prompts/build.open.md +18 -4
  5. package/dist/agents/prompts/{qa-thorough.md → code-reviewer-thorough.md} +34 -19
  6. package/dist/agents/prompts/code-reviewer.md +80 -0
  7. package/dist/agents/prompts/code-reviewer.open.md +68 -0
  8. package/dist/agents/prompts/gap-analyzer.md +2 -0
  9. package/dist/agents/prompts/plan-reviewer.md +3 -0
  10. package/dist/agents/prompts/plan.md +23 -4
  11. package/dist/agents/prompts/prime.md +146 -87
  12. package/dist/agents/prompts/research-auto.md +1 -1
  13. package/dist/agents/prompts/research-local.md +1 -1
  14. package/dist/agents/prompts/research-web.md +1 -1
  15. package/dist/agents/prompts/research.md +2 -0
  16. package/dist/agents/prompts/spec-reviewer.md +54 -0
  17. package/dist/agents/prompts/spec-reviewer.open.md +57 -0
  18. package/dist/agents/shared/index.ts +1 -0
  19. package/dist/agents/shared/ui-evaluation-ladder.md +50 -0
  20. package/dist/agents/shared/workflow-mechanics.md +5 -5
  21. package/dist/autopilot/prompt-template.md +80 -0
  22. package/dist/{chunk-VJUETC6A.js → chunk-PDMXYZM4.js} +53 -1
  23. package/dist/cli.js +1333 -1646
  24. package/dist/commands/prompts/fresh.md +27 -24
  25. package/dist/commands/prompts/review.md +3 -3
  26. package/dist/commands/prompts/ship.md +2 -0
  27. package/dist/index.js +106 -627
  28. package/dist/skills/adversarial-review-rubric/SKILL.md +47 -0
  29. package/dist/skills/code-quality/SKILL.md +1 -1
  30. package/dist/skills/root-cause-diagnosis/SKILL.md +24 -0
  31. package/dist/skills/spear-protocol/SKILL.md +166 -0
  32. package/package.json +1 -1
  33. package/dist/agents/prompts/pilot-assessor.md +0 -77
  34. package/dist/agents/prompts/pilot-builder.md +0 -40
  35. package/dist/agents/prompts/pilot-planner.md +0 -56
  36. package/dist/agents/prompts/pilot-scoper.md +0 -58
  37. package/dist/agents/prompts/qa-reviewer.md +0 -68
  38. package/dist/agents/prompts/qa-reviewer.open.md +0 -58
  39. package/dist/chunk-6CZPRUMJ.js +0 -869
  40. package/dist/chunk-DZG4D3OH.js +0 -54
  41. package/dist/chunk-OYRKOEXK.js +0 -88
  42. package/dist/commands/prompts/autopilot.md +0 -96
  43. package/dist/install-6775ZBDG.js +0 -13
  44. package/dist/paths-WZ23ZQOV.js +0 -18
package/CHANGELOG.md CHANGED
@@ -1,5 +1,77 @@
1
1
  # Changelog
2
2
 
3
+ ## 2.2.0
4
+
5
+ ### Minor Changes
6
+
7
+ - [#58](https://github.com/iceglober/glrs/pull/58) [`2720440`](https://github.com/iceglober/glrs/commit/2720440e76ed76f95a59b77525cb140bd673d669) Thanks [@iceglober](https://github.com/iceglober)! - Autopilot rewrite, pilot rip-out, Tier 1 visual capabilities, opencode-snip toggle, research-variant hiding.
8
+
9
+ **Breaking changes:**
10
+
11
+ - **Pilot subsystem removed.** The `glrs oc pilot` CLI subcommand, the four pilot agents (`pilot-scoper` / `planner` / `builder` / `assessor`), the pilot-planning skill references, the `pilot-plugin.ts` runtime enforcer, and all pilot state/docs are gone. Users on pilot should migrate to the CLI autopilot or plain PRIME workflow.
12
+ - **TUI `/autopilot` slash command removed.** Autopilot is now CLI-only: `glrs oc autopilot "<prompt>"`. Users who want autonomous looping run the CLI in any terminal; the TUI stays for interactive work.
13
+ - **Research-variant agents (`research-web`, `research-local`, `research-auto`) hidden from the primary-agent picker.** They now run only as subagents dispatched by `@research`. Users who previously selected them directly should select `@research` instead.
14
+
15
+ **New features:**
16
+
17
+ - **CLI autopilot (`glrs oc autopilot "<prompt>"`)** — Ralph-loop engine: sends your prompt each iteration, watches the agent's response for `<autopilot-done>` sentinel, retries the same prompt when absent. Budgets: 50 iterations / 4h / 3 zero-progress iterations / kill-switch file. Supports single-issue (`"ship ENG-1234"`) and multi-issue (`"ship every open ENG-* issue in project ROADMAP"`) prompts.
18
+ - **opencode-snip installer toggle** — new "Plugin add-ons" section in `glrs oc install` (parallel to existing MCP toggles). Opt-in adds `opencode-snip` to the user's `plugin` array via config-merge, no vendored code. Useful for token reduction on bash-heavy sessions. Requires the Go `snip` binary separately.
19
+ - **Tier 1 visual capabilities** — `@plan`, `@research`, `@gap-analyzer` now have Playwright MCP access (joining `@prime`, `@build`, `@assessor`, `@assessor-thorough`, `@plan-reviewer`). Enable via the installer's Playwright toggle.
20
+ - **UI evaluation ladder (graceful degradation)** — all visual-capable agents now carry a four-tier capability ladder (Playwright → curl → webfetch → source inspection). When Playwright is unavailable, agents fall through to the next tier and report which method they used. No hard failure on Playwright absence.
21
+
22
+ **Internal:**
23
+
24
+ - Server lifecycle helpers (`startServer` / `createSession` / `sendAndWait` / `getLastAssistantMessage`) moved from `src/pilot/server.ts` to `src/lib/opencode-server.ts` (consumed by the CLI autopilot).
25
+ - Agent roster reduced from 20 → 16. Net −5,308 lines across 91 files. Test count 536 → 462 (pilot tests removed, visual-capability tests added).
26
+
27
+ - [#55](https://github.com/iceglober/glrs/pull/55) [`8099c49`](https://github.com/iceglober/glrs/commit/8099c498fa6a9c05c8880bfd09cb2c4fd7d1721c) Thanks [@iceglober](https://github.com/iceglober)! - Rename PRIME arc phases to SPEAR model (Scope → Plan → Execute → Assess → Resolve). Rename @qa-reviewer → @assessor, @qa-thorough → @assessor-thorough. Resolve stage auto-ships (pushes branch, opens PR) — /ship becomes a resume path for interrupted sessions.
28
+
29
+ - [#57](https://github.com/iceglober/glrs/pull/57) [`6212c48`](https://github.com/iceglober/glrs/commit/6212c483efa2cc8f0407bc6a0d8c23110498eb21) Thanks [@iceglober](https://github.com/iceglober)! - Restructure the SPEAR protocol (PRIME's five-stage arc) across four areas: Assess quality, failure discipline, skill modularity, and agent-contract hygiene.
30
+
31
+ **Breaking changes** (match the prior `@assessor` rename's hard-break pattern):
32
+
33
+ - `@assessor` is replaced by `@spec-reviewer` (first pass, returns `[PASS_SPEC]` or `[FAIL_SPEC]`) and `@code-reviewer` (second pass, runs only on PASS_SPEC, returns `[PASS]` / `[LOOP-TO-PLAN]` / `[FIX-INLINE]`). User configs referencing `@assessor` by name will fail to resolve — update to the appropriate replacement.
34
+ - `@assessor-thorough` is renamed to `@code-reviewer-thorough` (same role: opus-tier backstop for high-risk diffs that re-runs the full suite unconditionally).
35
+ - Registered agent count: 20 → 21.
36
+
37
+ **Assess rigor (two-stage review + MECE rubric):**
38
+
39
+ - Every Assess cycle now dispatches two subagents sequentially instead of one, roughly doubling the subagent calls per review cycle. The spec pass is cheaper; the code-quality pass runs only if spec passed.
40
+ - Assess delegations carry a five-dimension MECE rubric (Correctness, Completeness, Consistency, Safety, Scope) and a progressive-strictness signal (Level 1/2/3) that tightens across Assess iterations.
41
+ - PRs with red CI (typecheck, lint, or tests failing) now fail Assess regardless of whether the failure appears pre-existing. "Pre-existing" claims require three-part evidence: a specific commit SHA, `git log` output showing the failure pre-dates the branch, and merge-base reproduction. Claims without all three are auto-rejected.
42
+
43
+ **Failure discipline (no-defer policy):**
44
+
45
+ - The hard rule that allowed logging pre-existing failures to a plan's `## Open questions` section and deferring them is removed.
46
+ - `@build` now runs a mandatory root-cause diagnosis protocol on any unexpected test/lint/typecheck failure: merge-base reproduction, `git blame`, rationalization table countering common excuse patterns ("likely pre-existing", "unrelated to my change", etc.).
47
+ - If fixing a failure would require touching more than ~5 files outside the plan's `## File-level changes`, `@build` STOPs with a reorganization proposal for PRIME to present to the user — there is no autonomous deferral path.
48
+
49
+ **TDD enforcement:**
50
+
51
+ - For any plan with a `## Test plan` entry or a `tests:` field in the acceptance-criteria fence, `@build` now enforces TDD order: write the test first, verify it fails, then implement. Tests in a just-written RED state are explicitly carved out of the failure-diagnosis protocol — they're expected failures, not unexpected ones.
52
+
53
+ **New bundled skills:**
54
+
55
+ - `spear-protocol` — the full SPEAR stage logic (Bootstrap, Scope, Plan, Execute, Assess, Resolve). Loaded by PRIME at session start. Inline fallback retained in `prime.md` in case skill-loading is unavailable.
56
+ - `root-cause-diagnosis` — the failure-diagnosis protocol + rationalization table. Loaded by `@build` and its strict-executor variant on unexpected failures.
57
+ - `adversarial-review-rubric` — the MECE rubric, progressive strictness levels, Red-CI-blocks-merge rule, and three-part evidence test. Loaded by all Assess-layer agents before reviewing.
58
+
59
+ **Agent-contract changes:**
60
+
61
+ - `@build` gains a four-status return protocol: DONE / DONE_WITH_CONCERNS / NEEDS_CONTEXT / BLOCKED.
62
+ - `@build` now reports guidance deviations (item (e) of its return payload) when PRIME's Execute-prompt guidance permits multiple readings and `@build` picked one. Same "silence is not acceptable" bar as plan-file mutations.
63
+ - PRIME runs a pre-dispatch consistency check before every `@build` dispatch: re-read the Execute prompt against the plan and against any already-drafted follow-up prompts. Contradictions caught pre-dispatch avoid the downstream blame-misattribution pattern where faithful agent execution gets narrated as deviation.
64
+ - `@plan` bans placeholder phrases (TBD, TODO, "implement later", etc.) and runs a self-review checklist (spec coverage, placeholder scan, type/name consistency) before handing to `@plan-reviewer`.
65
+ - `@build`'s prompt is trimmed of orchestration context per the Minimal Contract principle (subagents perform worse when carrying parent-level workflow philosophy).
66
+
67
+ **Other refinements:**
68
+
69
+ - PRIME's Scope grounding dispatches parallel `@code-searcher` calls in a single message when grounding touches 3+ independent subsystems.
70
+ - PRIME's Plan stage detects multi-subsystem requests (3+ independent subsystems with no shared interface) and asks whether to split into separate plans.
71
+ - Delegation prompts apply the Minimal Contract minimality test: remove any sentence that doesn't help the subagent produce a better result. Non-goals prefer positive-instruction form ("Only modify files listed above") over negative lists when the positive form is shorter.
72
+
73
+ ## 2.1.0
74
+
3
75
  ## 2.0.1
4
76
 
5
77
  ### Patch Changes
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # @glrs-dev/harness-plugin-opencode
2
2
 
3
- Opinionated agent harness for [OpenCode](https://opencode.ai). Agents, tools, slash commands, and an unattended pilot mode — one package.
3
+ Opinionated agent harness for [OpenCode](https://opencode.ai). Agents, tools, slash commands, and an unattended autopilot loop — one package.
4
4
 
5
5
  ## Quick start
6
6
 
@@ -21,7 +21,7 @@ bunx @glrs-dev/harness-plugin-opencode install
21
21
  opencode
22
22
  ```
23
23
 
24
- No global install. All [plugin features](#what-the-plugin-provides) load automatically. You won't have the `glrs-oc` CLI, but pilot commands will offer to install the plugin if you add the CLI later.
24
+ No global install. All [plugin features](#what-the-plugin-provides) load automatically. You won't have the `glrs-oc` CLI, but you can add it later.
25
25
 
26
26
  ### Verifying the published tarball
27
27
 
@@ -43,18 +43,18 @@ Open OpenCode in any repo. The `prime` agent handles everything end-to-end.
43
43
  ```
44
44
  /fresh ENG-1234
45
45
  ```
46
- Wipes the worktree, creates a branch from the ticket ref, and begins the five-phase workflow: understand → plan → execute → verifyhandoff.
46
+ Wipes the worktree, creates a branch from the ticket ref, and begins the SPEAR workflow: scope → plan → execute → assessresolve.
47
47
 
48
48
  **Start a task from a description:**
49
49
  ```
50
50
  /fresh add rate limiting to the upload endpoint
51
51
  ```
52
52
 
53
- **Go hands-off after the plan looks good:**
53
+ **Go hands-off with the Ralph loop (CLI, lights-out):**
54
54
  ```
55
- /autopilot ENG-1234
55
+ glrs oc autopilot "ship ENG-1234"
56
56
  ```
57
- Runs the full workflow unattended. Stops when all acceptance criteria are checked off. You review, then `/ship`.
57
+ Runs PRIME in a loop: sends your prompt each iteration, watches for `<autopilot-done>` in the response, exits when the sentinel appears or a budget is hit (50 iterations / 4h / 3 zero-progress iterations / kill-switch at `.agent/autopilot-disable`). Works with multi-issue prompts too: `glrs oc autopilot "ship every open issue in Linear project ENG-ROADMAP until the project is done"`. There is no TUI slash command — if you're in the TUI and don't want the loop, just type the task normally.
58
58
 
59
59
  **Ship when done:**
60
60
  ```
@@ -66,7 +66,7 @@ Squashes commits, pushes, opens a PR with the plan as the body.
66
66
  ```
67
67
  /review 87
68
68
  ```
69
- Read-only adversarial review. Fetches the diff, runs typecheck/lint, delegates to `@qa-reviewer`, outputs a structured verdict.
69
+ Read-only adversarial review. Fetches the diff, runs typecheck/lint, delegates to `@assessor`, outputs a structured verdict.
70
70
 
71
71
  **Deep codebase research:**
72
72
  ```
@@ -74,41 +74,21 @@ Read-only adversarial review. Fetches the diff, runs typecheck/lint, delegates t
74
74
  ```
75
75
  Spawns parallel subagents, synthesizes findings with exact file:line references.
76
76
 
77
- ### Autonomous (pilot CLI)
78
-
79
- For larger work that benefits from structured scoping and autonomous execution with self-assessment.
80
-
81
- ```bash
82
- # Scope interactively — spawns OpenCode TUI with the pilot-scoper agent
83
- glrs-oc pilot scope "Refactor the billing module into separate services"
84
-
85
- # Execute autonomously — Plan → Execute → Assess → Resolve (SPEAR loop)
86
- glrs-oc pilot go
87
-
88
- # Configure models and verify commands for this repo
89
- glrs-oc pilot configure
90
-
91
- # Check workflow status
92
- glrs-oc pilot status
93
- ```
94
-
95
- See [Pilot mode](#pilot-mode) for the full command reference.
96
-
97
77
  ---
98
78
 
99
79
  ## What the plugin provides
100
80
 
101
- 14 agents, 7 slash commands, 5 tools, 5 MCPs, 5 skill bundles, 4 sub-plugins. Details below.
81
+ 16 agents, 7 slash commands, 5 tools, 5 MCPs, 11 skill bundles, 3 sub-plugins. Details below.
102
82
 
103
83
  ### Agents
104
84
 
105
85
  | Agent | Tier | Role |
106
86
  |-------|------|------|
107
- | `prime` | deep | Five-phase end-to-end workflow (default agent) |
87
+ | `prime` | deep | SPEAR end-to-end workflow (default agent) |
108
88
  | `plan` | deep | Interactive planner with gap analysis and adversarial review |
109
89
  | `build` | mid | Plan executor |
110
- | `qa-reviewer` | mid | Fast adversarial code review |
111
- | `qa-thorough` | deep | Full-suite adversarial review |
90
+ | `assessor` | mid | Fast adversarial code review |
91
+ | `assessor-thorough` | deep | Full-suite adversarial review |
112
92
  | `plan-reviewer` | deep | Adversarial plan review |
113
93
  | `gap-analyzer` | deep | Identifies gaps in plans |
114
94
  | `architecture-advisor` | deep | Architecture guidance |
@@ -116,8 +96,8 @@ See [Pilot mode](#pilot-mode) for the full command reference.
116
96
  | `docs-maintainer` | mid | Documentation updates |
117
97
  | `lib-reader` | mid | Library/dependency reader |
118
98
  | `agents-md-writer` | mid | AGENTS.md generation |
119
- | `pilot-builder` | mid | Unattended task executor (pilot subsystem) |
120
- | `pilot-planner` | deep | Decomposes work into pilot.yaml DAGs |
99
+ | `research` | deep | Multi-workstream research orchestrator |
100
+ | `research-web` / `research-local` / `research-auto` | deep | Research subagents (dispatched by `@research`) |
121
101
 
122
102
  Tiers: **deep** = opus-class, **mid** = sonnet-class, **fast** = haiku-class. Override with [`harness.models`](#model-overrides).
123
103
 
@@ -126,13 +106,14 @@ Tiers: **deep** = opus-class, **mid** = sonnet-class, **fast** = haiku-class. Ov
126
106
  | Command | What it does |
127
107
  |---------|-------------|
128
108
  | `/fresh <ref>` | Wipe worktree, branch from ticket or description, start PRIME |
129
- | `/autopilot <ref>` | Hands-off PRIME run; stops when acceptance criteria pass |
130
109
  | `/ship <plan>` | Squash, push, open PR |
131
110
  | `/review <target>` | Read-only adversarial review (PR#, SHA, branch, or file) |
132
111
  | `/research <topic>` | Parallel codebase exploration with file:line citations |
133
112
  | `/init-deep` | Generate hierarchical AGENTS.md files |
134
113
  | `/costs` | Show running LLM spend totals |
135
114
 
115
+ Autopilot is CLI-only: `glrs oc autopilot "<prompt>"` (see above).
116
+
136
117
  ### Tools
137
118
 
138
119
  `ast_grep` · `tsc_check` · `eslint_check` · `todo_scan` · `comment_check`
@@ -149,94 +130,48 @@ Tiers: **deep** = opus-class, **mid** = sonnet-class, **fast** = haiku-class. Ov
149
130
 
150
131
  ### Sub-plugins
151
132
 
152
- - **autopilot** — idle-nudge loop driver (only activates via `/autopilot`)
153
133
  - **notify** — OS notifications when the agent asks a question
154
134
  - **cost-tracker** — LLM spend by provider/model at `~/.glorious/opencode/costs.json`
155
- - **pilot-plugin** — runtime invariant enforcement for pilot agents
135
+ - **tool-hooks** — post-edit verification loop (tsc, eslint) + output backpressure
156
136
 
157
137
  ### Skills
158
138
 
159
- `review-plan` · `web-design-guidelines` · `vercel-react-best-practices` · `vercel-composition-patterns` · `pilot-planning`
139
+ `adr` · `agent-estimation` · `code-quality` · `research` · `research-auto` · `research-local` · `research-web` · `review-plan` · `vercel-composition-patterns` · `vercel-react-best-practices` · `web-design-guidelines`
160
140
 
161
141
  ---
162
142
 
163
- ## Pilot mode
164
-
165
- Autonomous code execution using the SPEAR loop (Scope → Plan → Execute → Assess → Resolve). The user scopes interactively, then `pilot go` runs the rest autonomously with self-assessment and deployment-risk reflection.
166
-
167
- **Prerequisites:** `git` >= 2.5, `opencode` on PATH. Plugin must be installed (auto-prompted if missing).
168
-
169
- ### Commands
143
+ ## Enabling visual UI capabilities
170
144
 
171
- | Command | Description |
172
- |---------|-------------|
173
- | `glrs-oc pilot scope "<goal>"` | Interactive scoping session. Produces `scope.json` with framing + acceptance criteria. |
174
- | `glrs-oc pilot go` | Autonomous execution. Reads scope, runs Plan → Execute → Assess → Resolve. |
175
- | `glrs-oc pilot configure` | Interactive per-phase model selection, verify commands, assess cycles, Playwright toggle. |
176
- | `glrs-oc pilot status` | Workflow status from SQLite. `--workflow <id>`, `--json`. |
177
-
178
- ### SPEAR loop
179
-
180
- 1. **Scope** (interactive) — scoper agent interviews you, explores the codebase, produces acceptance criteria.
181
- 2. **Plan** (autonomous) — planner agent decomposes ACs into an ordered task list.
182
- 3. **Execute** (autonomous) — builder agent runs one task at a time, commits on verify pass.
183
- 4. **Assess** (autonomous) — assessor evaluates ACs + asks deployment-risk questions (what could break? unexpected consequences? what could go wrong?). If fail → re-plan the gap → re-execute → re-assess (bounded by `max_assess_cycles`).
184
- 5. **Resolve** (autonomous) — final summary with acknowledged risks.
145
+ The `@plan`, `@research`, `@gap-analyzer`, `@prime`, `@build`, `@assessor`, `@assessor-thorough`, and `@plan-reviewer` agents can verify web UIs, rendered output, and visual components when Playwright is available.
185
146
 
186
- ### State storage
187
-
188
- ```
189
- ~/.glorious/opencode/<repo>/pilot/
190
- state.sqlite # workflows + events
191
- current-scope.json # pointer to active scope
192
- scopes/<workflowId>/
193
- scope.json # framing + acceptance criteria
194
- plan.json # task list
195
- assessment-cycle-N.json # assessment reports
196
- ```
147
+ ### Enable Playwright MCP
197
148
 
198
- Repo identity derived from `git rev-parse --git-common-dir`worktrees of the same repo share state. Override with `$GLORIOUS_PILOT_DIR`.
199
-
200
- ### Configuration
201
-
202
- Config lives at `.glrs/pilot.json` in your repo (not per-plan YAML):
149
+ During `glrs-oc install-plugin`, select **Playwright browser automation + visual UI verification (requires Chromium)** in the MCP toggle list. Or enable it manually in `opencode.json`:
203
150
 
204
151
  ```json
205
152
  {
206
- "models": {
207
- "scope": "anthropic/claude-sonnet-4-6",
208
- "plan": "anthropic/claude-sonnet-4-6",
209
- "execute": "anthropic/claude-sonnet-4-6",
210
- "assess": "anthropic/claude-sonnet-4-6"
211
- },
212
- "verify": {
213
- "baseline": ["bun test", "bun run typecheck"],
214
- "after_each": ["bun run typecheck"]
215
- },
216
- "max_assess_cycles": 3,
217
- "playwright": { "enabled": false, "base_url": "http://localhost:3000" }
153
+ "mcp": {
154
+ "playwright": { "enabled": true }
155
+ }
218
156
  }
219
157
  ```
220
158
 
221
- Run `glrs-oc pilot configure` for interactive setup with searchable model selection.
159
+ Then install Chromium:
222
160
 
223
- ### Migrating from pilot v1
161
+ ```bash
162
+ npx playwright install chromium
163
+ ```
224
164
 
225
- If you used `pilot build` / `pilot.yaml` previously:
165
+ ### Graceful degradation
226
166
 
227
- | v1 command | v2 equivalent |
228
- |---|---|
229
- | `pilot plan` | `pilot scope "<goal>"` |
230
- | `pilot build` | `pilot go` |
231
- | `pilot validate` | `pilot configure` (config validation) |
232
- | `pilot status` | `pilot status` (same name, different output) |
233
- | `pilot logs` | `pilot status --json` |
234
- | `pilot cost` | `pilot status --json` |
235
- | `pilot build-resume` | `pilot go` (re-reads scope, restarts from Plan) |
167
+ Agents automatically fall back when Playwright is unavailable:
236
168
 
237
- Old `.glrs/pilot.json` (v1 format with `baseline`/`after_each` at top level) is detected and a migration banner is shown. Run `pilot configure` to set up the new format.
169
+ 1. **Tier A (Playwright)** navigate, screenshot, evaluate DOM. Best signal.
170
+ 2. **Tier B (curl)** — parse returned HTML for structure and reachability.
171
+ 3. **Tier C (webfetch)** — built-in tool for public URLs.
172
+ 4. **Tier D (source inspection)** — read component files and reason about rendering. Agent flags "visual verification skipped" in its final message.
238
173
 
239
- Old state DBs under `~/.glorious/opencode/<repo>/pilot/` are orphaned they won't be read or migrated. You can safely delete them.
174
+ No configuration requiredagents detect capability absence from MCP errors and fall through automatically.
240
175
 
241
176
  ---
242
177
 
@@ -293,7 +228,7 @@ Your opencode.json values win. Example:
293
228
  | `glrs-oc install-plugin [--pin] [--dry-run]` | Register plugin in opencode.json |
294
229
  | `glrs-oc uninstall [--dry-run]` | Remove plugin from opencode.json |
295
230
  | `glrs-oc doctor` | Check installation health |
296
- | `glrs-oc pilot <verb>` | [Pilot mode](#pilot-mode) |
231
+ | `glrs-oc autopilot "<prompt>"` | Run PRIME in a loop (lights-out) |
297
232
  | `glrs-oc plan-dir` | Print repo-shared plan directory |
298
233
  | `glrs-oc plan-check <path>` | Validate legacy markdown plan files |
299
234
 
@@ -324,7 +259,7 @@ bun remove -g @glrs-dev/harness-plugin-opencode # remove CLI
324
259
  - `bun`
325
260
  - `uvx` for serena + git MCPs (`brew install uv`)
326
261
  - `node`/`npx` for memory MCP
327
- - `git` >= 2.5 for pilot worktrees
262
+ - `git` for version control operations
328
263
 
329
264
  ## Security & threat boundaries
330
265
 
@@ -334,8 +269,8 @@ Report vulnerabilities privately per [`SECURITY.md`](./SECURITY.md) — do NOT o
334
269
 
335
270
  This is a plugin with broad local-machine access. Install it deliberately:
336
271
 
337
- - **Reads and writes files** under your home directory (`~/.config/opencode/opencode.json`, `~/.cache/harness-opencode/*`, `~/.config/harness-opencode/install-id`, `~/.glorious/opencode/<repo>/pilot/*`).
338
- - **Runs local subprocesses** during normal operation: `git`, `gh`, `npm`/`bun`, `ast-grep`, `tsc`, `opencode`, and project-specific verify commands from any `pilot.yaml` you author.
272
+ - **Reads and writes files** under your home directory (`~/.config/opencode/opencode.json`, `~/.cache/harness-opencode/*`, `~/.config/harness-opencode/install-id`, `~/.glorious/opencode/<repo>/*`).
273
+ - **Runs local subprocesses** during normal operation: `git`, `gh`, `npm`/`bun`, `ast-grep`, `tsc`, `opencode`, and project-specific verify commands.
339
274
  - **Makes outbound HTTPS calls** (all opt-out-able):
340
275
  - `registry.npmjs.org` — daily version check. Opt out: `HARNESS_OPENCODE_UPDATE_CHECK=0`.
341
276
  - `catwalk.charm.land` — model catalog during interactive install only. Response is schema-validated before it reaches your `opencode.json`.
@@ -47,9 +47,12 @@ Before editing any file longer than ~200 lines, run `comment_check` scoped to th
47
47
  For each item in `## File-level changes`:
48
48
  1. Make the change.
49
49
  2. After each non-trivial change, run lint and tests for the affected files.
50
- 3. If a test fails, fix it before moving on.
50
+ 3. If a test fails, fix it before moving on. Run the root-cause diagnosis protocol below before drawing any conclusion about the failure's origin.
51
51
  4. Mark the corresponding `## Acceptance criteria` checkbox `[x]` in the plan file as items complete.
52
52
 
53
+ **When any test/lint/typecheck fails unexpectedly, load the `root-cause-diagnosis` skill via the Skill tool and follow its protocol.**
54
+ The skill contains: merge-base reproduction, git blame evidence, scope check, rationalization table, and TDD-RED exception.
55
+
53
56
  **Fenced plans — TDD order.** If the plan's `## Acceptance criteria` contains a ```plan-state fence, work item-by-item in TDD order: for each acceptance item, write the test(s) named in its `tests:` field FIRST (they must fail initially), then implement the change that makes them pass, then confirm by running the item's `verify:` command. Only mark the fence item `- [x]` after the verify command exits 0. This is how fenced plans encode strict TDD — the `tests:` field is the spec; the code is secondary.
54
57
 
55
58
  When you discover the plan is wrong:
@@ -64,7 +67,7 @@ Before returning to PRIME (or declaring complete on a top-level invocation):
64
67
  - `tsc_check` on each edited file is clean (it's capped and fast — run it).
65
68
  - `git diff --stat` matches the plan's `## File-level changes`.
66
69
 
67
- Do NOT run the full test suite or a full lint pass. PRIME's Phase 4 delegates that to `@qa-reviewer` / `@qa-thorough`, which will fail you if a full-suite regression slips through. Running the full suite here duplicates that work. Per-file tests during execution (section 3) are expected; a final full-suite run is not.
70
+ Do NOT run the full test suite or a full lint pass. PRIME's Assess stage delegates that to `@spec-reviewer` / `@code-reviewer` / `@code-reviewer-thorough`, which will fail you if a full-suite regression slips through. Running the full suite here duplicates that work. Per-file tests during execution (section 3) are expected; a final full-suite run is not.
68
71
 
69
72
  ## 5. Return payload
70
73
 
@@ -76,13 +79,22 @@ Return control to your caller with a structured summary:
76
79
 
77
80
  **(c) Plan mutations** — any cosmetic/numeric threshold bumps you absorbed silently, any scope expansions under the 2-file limit you absorbed. Be explicit: *"Updated plan §4 line-count threshold from 200 → 260 (file ended up 258 lines; self-imposed metric)"* is a good entry; silence is not.
78
81
 
79
- **(d) Unusual conditions** — pre-existing failures encountered and logged to the plan's `## Open questions` (cite the bullet verbatim), files touched outside `## File-level changes` with justification, any STOP condition you hit.
82
+ **(d) Unusual conditions** — files touched outside `## File-level changes` with justification, any STOP condition you hit.
83
+
84
+ **(e) Guidance deviations** — when PRIME's Execute-prompt guidance contains instructions that you interpreted in a way that could plausibly be read differently (the plan permitted multiple readings; the Execute prompt and the plan pointed in subtly different directions; two items in the Execute prompt were in tension and you picked one), surface the decision explicitly. Example entry: *"Execute prompt item #12 said 'extract common content to skill'; I read this as 'remove from agent prompts and put only in skill' and extracted fully; alternate reading was 'duplicate in skill while keeping inline as enforced default.' Chose full extraction because DRY and the rules also live in prime.md hard rules."* Silence is not acceptable — same bar as item (c). A PRIME that can't see the decision-point after the fact has no way to tell a defensible judgment from a silent disobedience.
85
+
86
+ **Return status.** Use one of these four statuses in your return:
87
+
88
+ - **DONE** — all acceptance criteria met, no concerns.
89
+ - **DONE_WITH_CONCERNS** — all acceptance criteria met, but you noticed issues worth PRIME's attention (e.g., a pattern inconsistency you worked around, a non-blocking lint warning, a TODO you left in place per the plan's `## Out of scope`). List concerns explicitly.
90
+ - **NEEDS_CONTEXT** — you hit ambiguity that requires user input before you can proceed. Describe what's needed.
91
+ - **BLOCKED** — a hard blocker prevents completion (missing dependency, conflicting plan, broken environment). Describe the blocker.
80
92
 
81
93
  **STOP payloads.** If you hit a blocker instead of completing, make the STOP clearly labeled in your return so PRIME recognizes it as a blocker rather than a completion. Format:
82
94
 
83
95
  > STOP: <one-sentence blocker>. <Which of the three classes this falls under: cosmetic-numeric / approach-design / scope-expansion-over-2-files>. <What PRIME needs to resolve to re-dispatch>.
84
96
 
85
- PRIME owns QA dispatch. Do NOT delegate to `@qa-reviewer` or `@qa-thorough` yourself when invoked as a subagent — PRIME's Phase 4 applies a fast-vs-thorough heuristic based on diff size + risk that you don't have full context for. When invoked top-level (`@build <plan-path>`), you may delegate to `@qa-reviewer` directly as the session's final step.
97
+ PRIME owns QA dispatch. Do NOT delegate to `@spec-reviewer`, `@code-reviewer`, or `@code-reviewer-thorough` yourself when invoked as a subagent — PRIME's Assess stage applies a fast-vs-thorough heuristic based on diff size + risk that you don't have full context for. When invoked top-level (`@build <plan-path>`), you may delegate to `@spec-reviewer` directly as the session's final step.
86
98
 
87
99
  # Hard rules
88
100
 
@@ -91,3 +103,5 @@ PRIME owns QA dispatch. Do NOT delegate to `@qa-reviewer` or `@qa-thorough` your
91
103
  - **Never use `--no-verify` or `--no-gpg-sign`** to bypass pre-commit hooks. If a hook blocks you, fix the root cause (resolve TODOs, repair lint/type errors). If the hook seems genuinely wrong, STOP and ask the user.
92
104
  - Plan file mutations: mark `[x]` freely as items complete. For **cosmetic / self-imposed numeric thresholds** (line-count budgets, row caps, arbitrary `< N` limits the planner set on itself), update the threshold silently and note it in your commit message — do NOT stop. For **approach / design changes** (the interface doesn't exist, the test strategy won't work, a whole section needs restructuring), stop and use the `question` tool. For **scope expansion** (an extra file or two needed to finish the item), add to `## File-level changes` and keep going; only ask if the expansion is > ~2 files or shifts the `## Goal`.
93
105
  - The user's goals are fixed; your own metrics are revisable. If you find yourself working around the plan's *approach*, that's a design-change signal — stop and ask. If you're just bumping a threshold you set on yourself, keep moving.
106
+
107
+ {UI_EVALUATION_LADDER}
@@ -37,12 +37,17 @@ Before starting, note: file count, which acceptance criteria you will verify, an
37
37
 
38
38
  ## 3. Execute task by task
39
39
 
40
+ **Fenced plans — TDD order.** If the plan's `## Acceptance criteria` contains a ```plan-state fence, work item-by-item in TDD order: for each acceptance item, write the test(s) named in its `tests:` field FIRST (they must fail initially), then implement the change that makes them pass, then confirm by running the item's `verify:` command. Only mark the fence item `- [x]` after the verify command exits 0.
41
+
40
42
  For each item in `## File-level changes`:
41
43
  1. Make the change.
42
- 2. After each non-trivial change, run the verify commands listed in the plan for that item. If they fail, fix and re-run.
44
+ 2. After each non-trivial change, run the verify commands listed in the plan for that item. If they fail, run the root-cause diagnosis protocol below, fix, and re-run.
43
45
  3. If a test fails, fix it before moving on.
44
46
  4. Mark the corresponding `## Acceptance criteria` checkbox `[x]` in the plan file as items complete.
45
47
 
48
+ **When any test/lint/typecheck fails unexpectedly, load the `root-cause-diagnosis` skill via the Skill tool and follow its protocol.**
49
+ The skill contains: merge-base reproduction, git blame evidence, scope check, rationalization table, and TDD-RED exception.
50
+
46
51
  **Verify commands.** Run the verify commands listed in the plan. If they pass, the item is done. If they fail, read the output, fix the code, and re-run. Do not mark an item `[x]` until the verify command exits 0.
47
52
 
48
53
  When you discover the plan is wrong:
@@ -59,7 +64,7 @@ Before returning:
59
64
  - `tsc_check` on each edited file is clean.
60
65
  - `git diff --stat` matches the plan's `## File-level changes`.
61
66
 
62
- Do NOT run the full test suite. PRIME's Phase 4 delegates that to `@qa-reviewer` / `@qa-thorough`.
67
+ Do NOT run the full test suite. PRIME's Assess stage delegates that to `@spec-reviewer` / `@code-reviewer` / `@code-reviewer-thorough`.
63
68
 
64
69
  ## 5. Return payload
65
70
 
@@ -71,13 +76,22 @@ Return control to your caller with a structured summary:
71
76
 
72
77
  **(c) Plan mutations** — any changes you made to the plan file itself (threshold bumps, etc.).
73
78
 
74
- **(d) Unusual conditions** — pre-existing failures, files touched outside `## File-level changes`, any STOP condition.
79
+ **(d) Unusual conditions** — files touched outside `## File-level changes` with justification, any STOP condition.
80
+
81
+ **(e) Guidance deviations** — when PRIME's Execute-prompt guidance contains instructions that you interpreted in a way that could plausibly be read differently (the plan permitted multiple readings; the Execute prompt and the plan pointed in subtly different directions; two items in the Execute prompt were in tension and you picked one), surface the decision explicitly. Example entry: *"Execute prompt item #12 said 'extract common content to skill'; I read this as 'remove from agent prompts' and extracted fully; alternate reading was 'duplicate in skill while keeping inline.' Chose full extraction because DRY."* Silence is not acceptable — same bar as item (c).
82
+
83
+ **Return status.** Use one of these four statuses:
84
+
85
+ - **DONE** — all acceptance criteria met, no concerns.
86
+ - **DONE_WITH_CONCERNS** — all acceptance criteria met, but you noticed issues worth PRIME's attention. List concerns explicitly.
87
+ - **NEEDS_CONTEXT** — ambiguity requires user input before you can proceed.
88
+ - **BLOCKED** — a hard blocker prevents completion.
75
89
 
76
90
  **STOP payloads.** If you hit a blocker, label it clearly:
77
91
 
78
92
  > STOP: <one-sentence blocker>. <What needs to be resolved to re-dispatch>.
79
93
 
80
- PRIME owns QA dispatch. Do NOT delegate to `@qa-reviewer` or `@qa-thorough` yourself when invoked as a subagent.
94
+ PRIME owns Assess dispatch. Do NOT delegate to `@spec-reviewer`, `@code-reviewer`, or `@code-reviewer-thorough` yourself when invoked as a subagent.
81
95
 
82
96
  # Hard rules
83
97
 
@@ -1,39 +1,41 @@
1
1
  ---
2
- name: qa-thorough
3
- description: Thorough adversarial reviewer. Re-runs full lint/test/typecheck suite. Use for high-risk or large diffs. Returns [PASS] or [FAIL].
2
+ name: code-reviewer-thorough
3
+ description: Thorough code reviewer for high-risk diffs. Re-runs full lint/test/typecheck unconditionally. Use for large or high-risk diffs. Returns [PASS], [LOOP-TO-PLAN], or [FIX-INLINE].
4
4
  mode: subagent
5
5
  model: anthropic/claude-opus-4-7
6
6
  temperature: 0.1
7
7
  ---
8
8
 
9
- You are the QA Reviewer (thorough variant). The PRIME picks this variant for large or high-risk diffs — your job is to re-run the full lint / test / typecheck suite from scratch and independently verify every acceptance criterion, regardless of what the PRIME claims.
9
+ You are the Code Reviewer (thorough variant). The PRIME picks this variant for large or high-risk diffs — your job is to re-run the full lint / test / typecheck suite from scratch and independently verify every acceptance criterion, regardless of what the PRIME claims.
10
10
 
11
- Do not ask the user questions. Return `[PASS]` or `[FAIL]` only. If you're tempted to ask, FAIL instead.
11
+ Do not ask the user questions. Return `[PASS]`, `[LOOP-TO-PLAN: <summary>]`, or `[FIX-INLINE: <summary>]` only.
12
12
 
13
- You are distinct from `@qa-reviewer`. That variant trusts the PRIME's recent green output and skips redundant re-runs. You do NOT — re-execution is the whole point of delegating to thorough.
13
+ You are distinct from `@code-reviewer`. That variant trusts the PRIME's recent green output and skips redundant re-runs. You do NOT — re-execution is the whole point of delegating to thorough.
14
+
15
+ You run ONLY after `@spec-reviewer` has returned `[PASS_SPEC]` — spec/scope compliance is already confirmed.
14
16
 
15
17
  # Process
16
18
 
17
19
  1. **Read the plan** at the path provided.
18
20
  2. **Inspect the diff.** Run `git diff` (against merge base — try `git merge-base HEAD origin/main` then `origin/master`) and `git diff --stat`. Also run `git status` to see untracked files.
19
21
  3. **Plan-drift check (AUTO-FAIL).** For each modified file in the diff, verify it appears in the plan's `## File-level changes`. A modified file NOT listed in `## File-level changes` is AUTO-FAIL regardless of how "implicit" the coverage seems — the plan should have listed it. Report as `Plan drift: <path> modified but not in ## File-level changes`.
20
- 4. **Scope-creep check.** For each UNTRACKED file (from `git status`) that is NOT in `## File-level changes`, run `git log --oneline -- <file>` to determine whether the file is pre-existing work or scope creep. Do NOT accept the PRIME's verbal "pre-existing" claim without this check. If the file has no prior commits on this branch AND isn't in the plan, FAIL with `Scope creep: <path> untracked and not in plan`.
22
+ 4. **Scope-creep check.** For each UNTRACKED file (from `git status`) that is NOT in `## File-level changes`, run `git log --oneline -- <file>` to determine whether the file is pre-existing work or scope creep. Do NOT accept the PRIME's verbal "pre-existing" claim without this check. If the file has no prior commits on this branch AND isn't in the plan, LOOP-TO-PLAN with `Scope creep: <path> untracked and not in plan`.
21
23
  5. **Semantic verification.** For each item in `## File-level changes`, verify the corresponding code change exists and matches the description. For each `## Acceptance criteria` item, verify it is actually met by reading the code — do NOT trust `[x]` checkboxes.
22
- 6. **Plan-state verify commands (fenced plans only).** Run `bunx @glrs-dev/harness-plugin-opencode plan-check --run <plan-path>` and execute each returned verify command via `bash`. Any non-zero exit → FAIL with `Verify failed: <command> (exit N)`. If the plan has no fence (legacy), skip.
23
- 7. **Re-run the project's test command.** Unconditionally. Discover the invocation from `package.json` scripts / `Makefile` / `CONTRIBUTING.md` / `AGENTS.md` — typical forms: `pnpm test`, `npm test`, `bun test`, `cargo test`, `pytest`, `go test ./...`. Any failure → FAIL.
24
- 8. **Re-run the project's lint command.** Unconditionally. E.g., `pnpm lint`, `npm run lint`, `ruff check`, `golangci-lint run`. Any failure → FAIL.
25
- 9. **Re-run the project's typecheck / build command.** Unconditionally. E.g., `pnpm typecheck`, `tsc --noEmit`, `mypy`, `cargo check`. Any failure → FAIL.
24
+ 6. **Plan-state verify commands (fenced plans only).** Run `bunx @glrs-dev/harness-plugin-opencode plan-check --run <plan-path>` and execute each returned verify command via `bash`. Any non-zero exit → LOOP-TO-PLAN with `Verify failed: <command> (exit N)`. If the plan has no fence (legacy), skip.
25
+ 7. **Re-run the project's test command.** Unconditionally. Discover the invocation from `package.json` scripts / `Makefile` / `CONTRIBUTING.md` / `AGENTS.md` — typical forms: `pnpm test`, `npm test`, `bun test`, `cargo test`, `pytest`, `go test ./...`. Any failure → FIX-INLINE (if trivial) or LOOP-TO-PLAN (if structural).
26
+ 8. **Re-run the project's lint command.** Unconditionally. E.g., `pnpm lint`, `npm run lint`, `ruff check`, `golangci-lint run`. Any failure → FIX-INLINE.
27
+ 9. **Re-run the project's typecheck / build command.** Unconditionally. E.g., `pnpm typecheck`, `tsc --noEmit`, `mypy`, `cargo check`. Any failure → FIX-INLINE.
26
28
  10. **Check for missed concerns:**
27
29
  - Regressions in adjacent code not mentioned in the plan
28
30
  - Missing test coverage for new behavior
29
31
  - Hardcoded values that should be config
30
32
  - Error paths not handled
31
- 11. **AGENTS.md freshness (hierarchical docs).** For each directory touched by the change, check whether a local `AGENTS.md` exists. If yes, read it and verify its conventions/claims still match the code. If the change shifts a convention and the local `AGENTS.md` wasn't updated, FAIL with: `Update <path>/AGENTS.md to reflect <specific change>`. Do not fail on unrelated staleness — only on drift caused by THIS change.
32
- 12. **Scan for new tech debt.** Run `todo_scan` with `onlyChanged: true`. For every TODO / FIXME / HACK / XXX, check whether the plan's `## Out of scope` or `## Open questions` acknowledges it. Unacknowledged new debt → FAIL with `file:line`.
33
+ 11. **AGENTS.md freshness (hierarchical docs).** For each directory touched by the change, check whether a local `AGENTS.md` exists. If yes, read it and verify its conventions/claims still match the code. If the change shifts a convention and the local `AGENTS.md` wasn't updated, return FIX-INLINE with: `Update <path>/AGENTS.md to reflect <specific change>`. Do not fail on unrelated staleness — only on drift caused by THIS change.
34
+ 12. **Scan for new tech debt.** Run `todo_scan` with `onlyChanged: true`. For every TODO / FIXME / HACK / XXX, check whether the plan's `## Out of scope` or `## Open questions` acknowledges it. Unacknowledged new debt → FIX-INLINE with `file:line`.
33
35
 
34
36
  # Output
35
37
 
36
- Exactly one of these two formats. Nothing else.
38
+ Exactly one of these three formats. Nothing else.
37
39
 
38
40
  **If everything passes:**
39
41
 
@@ -43,10 +45,20 @@ Exactly one of these two formats. Nothing else.
43
45
  <2–3 sentence summary of verified changes.>
44
46
  ```
45
47
 
46
- **If anything fails:**
48
+ **If structural issues require re-planning:**
49
+
50
+ ```
51
+ [LOOP-TO-PLAN: <one-line summary>]
47
52
 
53
+ 1. <File:line> — <Specific issue requiring plan-level change>
54
+ 2. <File:line> — <Next issue>
55
+ ...
48
56
  ```
49
- [FAIL]
57
+
58
+ **If trivial issues can be fixed inline:**
59
+
60
+ ```
61
+ [FIX-INLINE: <one-line summary>]
50
62
 
51
63
  1. <File:line> — <Specific issue>
52
64
  2. <File:line> — <Next issue>
@@ -56,8 +68,11 @@ Exactly one of these two formats. Nothing else.
56
68
  # Rules
57
69
 
58
70
  - Never suggest fixes. Report precisely; the build agent will fix.
59
- - Never trust the build agent's narrative. "Pre-existing work" requires `git log --oneline -- <file>` evidence.
60
- - A single failing test is enough to FAIL. Do not minimize.
61
- - **AUTO-FAIL on plan drift.** Modified file not in `## File-level changes` FAIL, no exceptions.
62
- - **AUTO-FAIL on scope creep.** Untracked file not in plan with no prior commits → FAIL.
71
+ - A single failing item is enough to return a non-PASS verdict. Do not minimize.
72
+ - **LOOP-TO-PLAN** for: new files needed, different approach required, missed acceptance criteria, structural regressions.
73
+ - **FIX-INLINE** for: lint failures, missing test assertions, typos, AGENTS.md staleness, unacknowledged tech debt.
63
74
  - Re-run test / lint / typecheck unconditionally. That is the whole reason the PRIME picked you over the fast variant.
75
+ - **Load the `adversarial-review-rubric` skill via the Skill tool before reviewing.**
76
+ The skill contains: MECE rubric, progressive strictness levels, Red-CI-blocks-merge rule, and the evidence test for pre-existing claims.
77
+
78
+ {UI_EVALUATION_LADDER}