npm - @glrs-dev/harness-plugin-opencode - Versions diffs - 2.0.1 → 2.2.0 - Mend

@glrs-dev/harness-plugin-opencode 2.0.1 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

package/CHANGELOG.md +72 -0
package/README.md +39 -104
package/dist/agents/prompts/build.md +18 -4
package/dist/agents/prompts/build.open.md +18 -4
package/dist/agents/prompts/{qa-thorough.md → code-reviewer-thorough.md} +34 -19
package/dist/agents/prompts/code-reviewer.md +80 -0
package/dist/agents/prompts/code-reviewer.open.md +68 -0
package/dist/agents/prompts/gap-analyzer.md +2 -0
package/dist/agents/prompts/plan-reviewer.md +3 -0
package/dist/agents/prompts/plan.md +23 -4
package/dist/agents/prompts/prime.md +146 -87
package/dist/agents/prompts/research-auto.md +1 -1
package/dist/agents/prompts/research-local.md +1 -1
package/dist/agents/prompts/research-web.md +1 -1
package/dist/agents/prompts/research.md +2 -0
package/dist/agents/prompts/spec-reviewer.md +54 -0
package/dist/agents/prompts/spec-reviewer.open.md +57 -0
package/dist/agents/shared/index.ts +1 -0
package/dist/agents/shared/ui-evaluation-ladder.md +50 -0
package/dist/agents/shared/workflow-mechanics.md +5 -5
package/dist/autopilot/prompt-template.md +80 -0
package/dist/{chunk-VJUETC6A.js → chunk-PDMXYZM4.js} +53 -1
package/dist/cli.js +1333 -1646
package/dist/commands/prompts/fresh.md +27 -24
package/dist/commands/prompts/review.md +3 -3
package/dist/commands/prompts/ship.md +2 -0
package/dist/index.js +106 -627
package/dist/skills/adversarial-review-rubric/SKILL.md +47 -0
package/dist/skills/code-quality/SKILL.md +1 -1
package/dist/skills/root-cause-diagnosis/SKILL.md +24 -0
package/dist/skills/spear-protocol/SKILL.md +166 -0
package/package.json +1 -1
package/dist/agents/prompts/pilot-assessor.md +0 -77
package/dist/agents/prompts/pilot-builder.md +0 -40
package/dist/agents/prompts/pilot-planner.md +0 -56
package/dist/agents/prompts/pilot-scoper.md +0 -58
package/dist/agents/prompts/qa-reviewer.md +0 -68
package/dist/agents/prompts/qa-reviewer.open.md +0 -58
package/dist/chunk-6CZPRUMJ.js +0 -869
package/dist/chunk-DZG4D3OH.js +0 -54
package/dist/chunk-OYRKOEXK.js +0 -88
package/dist/commands/prompts/autopilot.md +0 -96
package/dist/install-6775ZBDG.js +0 -13
package/dist/paths-WZ23ZQOV.js +0 -18

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,77 @@
 # Changelog
+## 2.2.0
+### Minor Changes
+- [#58](https://github.com/iceglober/glrs/pull/58) [`2720440`](https://github.com/iceglober/glrs/commit/2720440e76ed76f95a59b77525cb140bd673d669) Thanks [@iceglober](https://github.com/iceglober)! - Autopilot rewrite, pilot rip-out, Tier 1 visual capabilities, opencode-snip toggle, research-variant hiding.
+  **Breaking changes:**
+  - **Pilot subsystem removed.** The `glrs oc pilot` CLI subcommand, the four pilot agents (`pilot-scoper` / `planner` / `builder` / `assessor`), the pilot-planning skill references, the `pilot-plugin.ts` runtime enforcer, and all pilot state/docs are gone. Users on pilot should migrate to the CLI autopilot or plain PRIME workflow.
+  - **TUI `/autopilot` slash command removed.** Autopilot is now CLI-only: `glrs oc autopilot "<prompt>"`. Users who want autonomous looping run the CLI in any terminal; the TUI stays for interactive work.
+  - **Research-variant agents (`research-web`, `research-local`, `research-auto`) hidden from the primary-agent picker.** They now run only as subagents dispatched by `@research`. Users who previously selected them directly should select `@research` instead.
+  **New features:**
+  - **CLI autopilot (`glrs oc autopilot "<prompt>"`)** — Ralph-loop engine: sends your prompt each iteration, watches the agent's response for `<autopilot-done>` sentinel, retries the same prompt when absent. Budgets: 50 iterations / 4h / 3 zero-progress iterations / kill-switch file. Supports single-issue (`"ship ENG-1234"`) and multi-issue (`"ship every open ENG-* issue in project ROADMAP"`) prompts.
+  - **opencode-snip installer toggle** — new "Plugin add-ons" section in `glrs oc install` (parallel to existing MCP toggles). Opt-in adds `opencode-snip` to the user's `plugin` array via config-merge, no vendored code. Useful for token reduction on bash-heavy sessions. Requires the Go `snip` binary separately.
+  - **Tier 1 visual capabilities** — `@plan`, `@research`, `@gap-analyzer` now have Playwright MCP access (joining `@prime`, `@build`, `@assessor`, `@assessor-thorough`, `@plan-reviewer`). Enable via the installer's Playwright toggle.
+  - **UI evaluation ladder (graceful degradation)** — all visual-capable agents now carry a four-tier capability ladder (Playwright → curl → webfetch → source inspection). When Playwright is unavailable, agents fall through to the next tier and report which method they used. No hard failure on Playwright absence.
+  **Internal:**
+  - Server lifecycle helpers (`startServer` / `createSession` / `sendAndWait` / `getLastAssistantMessage`) moved from `src/pilot/server.ts` to `src/lib/opencode-server.ts` (consumed by the CLI autopilot).
+  - Agent roster reduced from 20 → 16. Net −5,308 lines across 91 files. Test count 536 → 462 (pilot tests removed, visual-capability tests added).
+- [#55](https://github.com/iceglober/glrs/pull/55) [`8099c49`](https://github.com/iceglober/glrs/commit/8099c498fa6a9c05c8880bfd09cb2c4fd7d1721c) Thanks [@iceglober](https://github.com/iceglober)! - Rename PRIME arc phases to SPEAR model (Scope → Plan → Execute → Assess → Resolve). Rename @qa-reviewer → @assessor, @qa-thorough → @assessor-thorough. Resolve stage auto-ships (pushes branch, opens PR) — /ship becomes a resume path for interrupted sessions.
+- [#57](https://github.com/iceglober/glrs/pull/57) [`6212c48`](https://github.com/iceglober/glrs/commit/6212c483efa2cc8f0407bc6a0d8c23110498eb21) Thanks [@iceglober](https://github.com/iceglober)! - Restructure the SPEAR protocol (PRIME's five-stage arc) across four areas: Assess quality, failure discipline, skill modularity, and agent-contract hygiene.
+  **Breaking changes** (match the prior `@assessor` rename's hard-break pattern):
+  - `@assessor` is replaced by `@spec-reviewer` (first pass, returns `[PASS_SPEC]` or `[FAIL_SPEC]`) and `@code-reviewer` (second pass, runs only on PASS_SPEC, returns `[PASS]` / `[LOOP-TO-PLAN]` / `[FIX-INLINE]`). User configs referencing `@assessor` by name will fail to resolve — update to the appropriate replacement.
+  - `@assessor-thorough` is renamed to `@code-reviewer-thorough` (same role: opus-tier backstop for high-risk diffs that re-runs the full suite unconditionally).
+  - Registered agent count: 20 → 21.
+  **Assess rigor (two-stage review + MECE rubric):**
+  - Every Assess cycle now dispatches two subagents sequentially instead of one, roughly doubling the subagent calls per review cycle. The spec pass is cheaper; the code-quality pass runs only if spec passed.
+  - Assess delegations carry a five-dimension MECE rubric (Correctness, Completeness, Consistency, Safety, Scope) and a progressive-strictness signal (Level 1/2/3) that tightens across Assess iterations.
+  - PRs with red CI (typecheck, lint, or tests failing) now fail Assess regardless of whether the failure appears pre-existing. "Pre-existing" claims require three-part evidence: a specific commit SHA, `git log` output showing the failure pre-dates the branch, and merge-base reproduction. Claims without all three are auto-rejected.
+  **Failure discipline (no-defer policy):**
+  - The hard rule that allowed logging pre-existing failures to a plan's `## Open questions` section and deferring them is removed.
+  - `@build` now runs a mandatory root-cause diagnosis protocol on any unexpected test/lint/typecheck failure: merge-base reproduction, `git blame`, rationalization table countering common excuse patterns ("likely pre-existing", "unrelated to my change", etc.).
+  - If fixing a failure would require touching more than ~5 files outside the plan's `## File-level changes`, `@build` STOPs with a reorganization proposal for PRIME to present to the user — there is no autonomous deferral path.
+  **TDD enforcement:**
+  - For any plan with a `## Test plan` entry or a `tests:` field in the acceptance-criteria fence, `@build` now enforces TDD order: write the test first, verify it fails, then implement. Tests in a just-written RED state are explicitly carved out of the failure-diagnosis protocol — they're expected failures, not unexpected ones.
+  **New bundled skills:**
+  - `spear-protocol` — the full SPEAR stage logic (Bootstrap, Scope, Plan, Execute, Assess, Resolve). Loaded by PRIME at session start. Inline fallback retained in `prime.md` in case skill-loading is unavailable.
+  - `root-cause-diagnosis` — the failure-diagnosis protocol + rationalization table. Loaded by `@build` and its strict-executor variant on unexpected failures.
+  - `adversarial-review-rubric` — the MECE rubric, progressive strictness levels, Red-CI-blocks-merge rule, and three-part evidence test. Loaded by all Assess-layer agents before reviewing.
+  **Agent-contract changes:**
+  - `@build` gains a four-status return protocol: DONE / DONE_WITH_CONCERNS / NEEDS_CONTEXT / BLOCKED.
+  - `@build` now reports guidance deviations (item (e) of its return payload) when PRIME's Execute-prompt guidance permits multiple readings and `@build` picked one. Same "silence is not acceptable" bar as plan-file mutations.
+  - PRIME runs a pre-dispatch consistency check before every `@build` dispatch: re-read the Execute prompt against the plan and against any already-drafted follow-up prompts. Contradictions caught pre-dispatch avoid the downstream blame-misattribution pattern where faithful agent execution gets narrated as deviation.
+  - `@plan` bans placeholder phrases (TBD, TODO, "implement later", etc.) and runs a self-review checklist (spec coverage, placeholder scan, type/name consistency) before handing to `@plan-reviewer`.
+  - `@build`'s prompt is trimmed of orchestration context per the Minimal Contract principle (subagents perform worse when carrying parent-level workflow philosophy).
+  **Other refinements:**
+  - PRIME's Scope grounding dispatches parallel `@code-searcher` calls in a single message when grounding touches 3+ independent subsystems.
+  - PRIME's Plan stage detects multi-subsystem requests (3+ independent subsystems with no shared interface) and asks whether to split into separate plans.
+  - Delegation prompts apply the Minimal Contract minimality test: remove any sentence that doesn't help the subagent produce a better result. Non-goals prefer positive-instruction form ("Only modify files listed above") over negative lists when the positive form is shorter.
+## 2.1.0
 ## 2.0.1
 ### Patch Changes

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # @glrs-dev/harness-plugin-opencode
-Opinionated agent harness for [OpenCode](https://opencode.ai). Agents, tools, slash commands, and an unattended pilot mode — one package.
+Opinionated agent harness for [OpenCode](https://opencode.ai). Agents, tools, slash commands, and an unattended autopilot loop — one package.
 ## Quick start
@@ -21,7 +21,7 @@ bunx @glrs-dev/harness-plugin-opencode install
 opencode
 ```
-No global install. All [plugin features](#what-the-plugin-provides) load automatically. You won't have the `glrs-oc` CLI, but pilot commands will offer to install the plugin if you add the CLI later.
+No global install. All [plugin features](#what-the-plugin-provides) load automatically. You won't have the `glrs-oc` CLI, but you can add it later.
 ### Verifying the published tarball
@@ -43,18 +43,18 @@ Open OpenCode in any repo. The `prime` agent handles everything end-to-end.
 ```
 /fresh ENG-1234
 ```
-Wipes the worktree, creates a branch from the ticket ref, and begins the five-phase workflow: understand → plan → execute → verify → handoff.
+Wipes the worktree, creates a branch from the ticket ref, and begins the SPEAR workflow: scope → plan → execute → assess → resolve.
 **Start a task from a description:**
 ```
 /fresh add rate limiting to the upload endpoint
 ```
-**Go hands-off after the plan looks good:**
+**Go hands-off with the Ralph loop (CLI, lights-out):**
 ```
-/autopilot ENG-1234
+glrs oc autopilot "ship ENG-1234"
 ```
-Runs the full workflow unattended. Stops when all acceptance criteria are checked off. You review, then `/ship`.
+Runs PRIME in a loop: sends your prompt each iteration, watches for `<autopilot-done>` in the response, exits when the sentinel appears or a budget is hit (50 iterations / 4h / 3 zero-progress iterations / kill-switch at `.agent/autopilot-disable`). Works with multi-issue prompts too: `glrs oc autopilot "ship every open issue in Linear project ENG-ROADMAP until the project is done"`. There is no TUI slash command — if you're in the TUI and don't want the loop, just type the task normally.
 **Ship when done:**
 ```
@@ -66,7 +66,7 @@ Squashes commits, pushes, opens a PR with the plan as the body.
 ```
 /review 87
 ```
-Read-only adversarial review. Fetches the diff, runs typecheck/lint, delegates to `@qa-reviewer`, outputs a structured verdict.
+Read-only adversarial review. Fetches the diff, runs typecheck/lint, delegates to `@assessor`, outputs a structured verdict.
 **Deep codebase research:**
 ```
@@ -74,41 +74,21 @@ Read-only adversarial review. Fetches the diff, runs typecheck/lint, delegates t
 ```
 Spawns parallel subagents, synthesizes findings with exact file:line references.
-### Autonomous (pilot CLI)
-For larger work that benefits from structured scoping and autonomous execution with self-assessment.
-```bash
-# Scope interactively — spawns OpenCode TUI with the pilot-scoper agent
-glrs-oc pilot scope "Refactor the billing module into separate services"
-# Execute autonomously — Plan → Execute → Assess → Resolve (SPEAR loop)
-glrs-oc pilot go
-# Configure models and verify commands for this repo
-glrs-oc pilot configure
-# Check workflow status
-glrs-oc pilot status
-```
-See [Pilot mode](#pilot-mode) for the full command reference.
 ---
 ## What the plugin provides
-14 agents, 7 slash commands, 5 tools, 5 MCPs, 5 skill bundles, 4 sub-plugins. Details below.
+16 agents, 7 slash commands, 5 tools, 5 MCPs, 11 skill bundles, 3 sub-plugins. Details below.
 ### Agents
 | Agent | Tier | Role |
 |-------|------|------|
-| `prime` | deep | Five-phase end-to-end workflow (default agent) |
+| `prime` | deep | SPEAR end-to-end workflow (default agent) |
 | `plan` | deep | Interactive planner with gap analysis and adversarial review |
 | `build` | mid | Plan executor |
-| `qa-reviewer` | mid | Fast adversarial code review |
-| `qa-thorough` | deep | Full-suite adversarial review |
+| `assessor` | mid | Fast adversarial code review |
+| `assessor-thorough` | deep | Full-suite adversarial review |
 | `plan-reviewer` | deep | Adversarial plan review |
 | `gap-analyzer` | deep | Identifies gaps in plans |
 | `architecture-advisor` | deep | Architecture guidance |
@@ -116,8 +96,8 @@ See [Pilot mode](#pilot-mode) for the full command reference.
 | `docs-maintainer` | mid | Documentation updates |
 | `lib-reader` | mid | Library/dependency reader |
 | `agents-md-writer` | mid | AGENTS.md generation |
-| `pilot-builder` | mid | Unattended task executor (pilot subsystem) |
-| `pilot-planner` | deep | Decomposes work into pilot.yaml DAGs |
+| `research` | deep | Multi-workstream research orchestrator |
+| `research-web` / `research-local` / `research-auto` | deep | Research subagents (dispatched by `@research`) |
 Tiers: **deep** = opus-class, **mid** = sonnet-class, **fast** = haiku-class. Override with [`harness.models`](#model-overrides).
@@ -126,13 +106,14 @@ Tiers: **deep** = opus-class, **mid** = sonnet-class, **fast** = haiku-class. Ov
 | Command | What it does |
 |---------|-------------|
 | `/fresh <ref>` | Wipe worktree, branch from ticket or description, start PRIME |
-| `/autopilot <ref>` | Hands-off PRIME run; stops when acceptance criteria pass |
 | `/ship <plan>` | Squash, push, open PR |
 | `/review <target>` | Read-only adversarial review (PR#, SHA, branch, or file) |
 | `/research <topic>` | Parallel codebase exploration with file:line citations |
 | `/init-deep` | Generate hierarchical AGENTS.md files |
 | `/costs` | Show running LLM spend totals |
+Autopilot is CLI-only: `glrs oc autopilot "<prompt>"` (see above).
 ### Tools
 `ast_grep` · `tsc_check` · `eslint_check` · `todo_scan` · `comment_check`
@@ -149,94 +130,48 @@ Tiers: **deep** = opus-class, **mid** = sonnet-class, **fast** = haiku-class. Ov
 ### Sub-plugins
-- **autopilot** — idle-nudge loop driver (only activates via `/autopilot`)
 - **notify** — OS notifications when the agent asks a question
 - **cost-tracker** — LLM spend by provider/model at `~/.glorious/opencode/costs.json`
-- **pilot-plugin** — runtime invariant enforcement for pilot agents
+- **tool-hooks** — post-edit verification loop (tsc, eslint) + output backpressure
 ### Skills
-`review-plan` · `web-design-guidelines` · `vercel-react-best-practices` · `vercel-composition-patterns` · `pilot-planning`
+`adr` · `agent-estimation` · `code-quality` · `research` · `research-auto` · `research-local` · `research-web` · `review-plan` · `vercel-composition-patterns` · `vercel-react-best-practices` · `web-design-guidelines`
 ---
-## Pilot mode
-Autonomous code execution using the SPEAR loop (Scope → Plan → Execute → Assess → Resolve). The user scopes interactively, then `pilot go` runs the rest autonomously with self-assessment and deployment-risk reflection.
-**Prerequisites:** `git` >= 2.5, `opencode` on PATH. Plugin must be installed (auto-prompted if missing).
-### Commands
+## Enabling visual UI capabilities
-| Command | Description |
-|---------|-------------|
-| `glrs-oc pilot scope "<goal>"` | Interactive scoping session. Produces `scope.json` with framing + acceptance criteria. |
-| `glrs-oc pilot go` | Autonomous execution. Reads scope, runs Plan → Execute → Assess → Resolve. |
-| `glrs-oc pilot configure` | Interactive per-phase model selection, verify commands, assess cycles, Playwright toggle. |
-| `glrs-oc pilot status` | Workflow status from SQLite. `--workflow <id>`, `--json`. |
-### SPEAR loop
-1. **Scope** (interactive) — scoper agent interviews you, explores the codebase, produces acceptance criteria.
-2. **Plan** (autonomous) — planner agent decomposes ACs into an ordered task list.
-3. **Execute** (autonomous) — builder agent runs one task at a time, commits on verify pass.
-4. **Assess** (autonomous) — assessor evaluates ACs + asks deployment-risk questions (what could break? unexpected consequences? what could go wrong?). If fail → re-plan the gap → re-execute → re-assess (bounded by `max_assess_cycles`).
-5. **Resolve** (autonomous) — final summary with acknowledged risks.
+The `@plan`, `@research`, `@gap-analyzer`, `@prime`, `@build`, `@assessor`, `@assessor-thorough`, and `@plan-reviewer` agents can verify web UIs, rendered output, and visual components when Playwright is available.
-### State storage
-```
-~/.glorious/opencode/<repo>/pilot/
-  state.sqlite              # workflows + events
-  current-scope.json        # pointer to active scope
-  scopes/<workflowId>/
-    scope.json              # framing + acceptance criteria
-    plan.json               # task list
-    assessment-cycle-N.json # assessment reports
-```
+### Enable Playwright MCP
-Repo identity derived from `git rev-parse --git-common-dir` — worktrees of the same repo share state. Override with `$GLORIOUS_PILOT_DIR`.
-### Configuration
-Config lives at `.glrs/pilot.json` in your repo (not per-plan YAML):
+During `glrs-oc install-plugin`, select **Playwright — browser automation + visual UI verification (requires Chromium)** in the MCP toggle list. Or enable it manually in `opencode.json`:
 ```json
 {
-  "models": {
-    "scope": "anthropic/claude-sonnet-4-6",
-    "plan": "anthropic/claude-sonnet-4-6",
-    "execute": "anthropic/claude-sonnet-4-6",
-    "assess": "anthropic/claude-sonnet-4-6"
-  },
-  "verify": {
-    "baseline": ["bun test", "bun run typecheck"],
-    "after_each": ["bun run typecheck"]
-  },
-  "max_assess_cycles": 3,
-  "playwright": { "enabled": false, "base_url": "http://localhost:3000" }
+  "mcp": {
+    "playwright": { "enabled": true }
+  }
 }
 ```
-Run `glrs-oc pilot configure` for interactive setup with searchable model selection.
+Then install Chromium:
-### Migrating from pilot v1
+```bash
+npx playwright install chromium
+```
-If you used `pilot build` / `pilot.yaml` previously:
+### Graceful degradation
-| v1 command | v2 equivalent |
-|---|---|
-| `pilot plan` | `pilot scope "<goal>"` |
-| `pilot build` | `pilot go` |
-| `pilot validate` | `pilot configure` (config validation) |
-| `pilot status` | `pilot status` (same name, different output) |
-| `pilot logs` | `pilot status --json` |
-| `pilot cost` | `pilot status --json` |
-| `pilot build-resume` | `pilot go` (re-reads scope, restarts from Plan) |
+Agents automatically fall back when Playwright is unavailable:
-Old `.glrs/pilot.json` (v1 format with `baseline`/`after_each` at top level) is detected and a migration banner is shown. Run `pilot configure` to set up the new format.
+1. **Tier A (Playwright)** — navigate, screenshot, evaluate DOM. Best signal.
+2. **Tier B (curl)** — parse returned HTML for structure and reachability.
+3. **Tier C (webfetch)** — built-in tool for public URLs.
+4. **Tier D (source inspection)** — read component files and reason about rendering. Agent flags "visual verification skipped" in its final message.
-Old state DBs under `~/.glorious/opencode/<repo>/pilot/` are orphaned — they won't be read or migrated. You can safely delete them.
+No configuration required — agents detect capability absence from MCP errors and fall through automatically.
 ---
@@ -293,7 +228,7 @@ Your opencode.json values win. Example:
 | `glrs-oc install-plugin [--pin] [--dry-run]` | Register plugin in opencode.json |
 | `glrs-oc uninstall [--dry-run]` | Remove plugin from opencode.json |
 | `glrs-oc doctor` | Check installation health |
-| `glrs-oc pilot <verb>` | [Pilot mode](#pilot-mode) |
+| `glrs-oc autopilot "<prompt>"` | Run PRIME in a loop (lights-out) |
 | `glrs-oc plan-dir` | Print repo-shared plan directory |
 | `glrs-oc plan-check <path>` | Validate legacy markdown plan files |
@@ -324,7 +259,7 @@ bun remove -g @glrs-dev/harness-plugin-opencode    # remove CLI
 - `bun`
 - `uvx` for serena + git MCPs (`brew install uv`)
 - `node`/`npx` for memory MCP
-- `git` >= 2.5 for pilot worktrees
+- `git` for version control operations
 ## Security & threat boundaries
@@ -334,8 +269,8 @@ Report vulnerabilities privately per [`SECURITY.md`](./SECURITY.md) — do NOT o
 This is a plugin with broad local-machine access. Install it deliberately:
-- **Reads and writes files** under your home directory (`~/.config/opencode/opencode.json`, `~/.cache/harness-opencode/*`, `~/.config/harness-opencode/install-id`, `~/.glorious/opencode/<repo>/pilot/*`).
-- **Runs local subprocesses** during normal operation: `git`, `gh`, `npm`/`bun`, `ast-grep`, `tsc`, `opencode`, and project-specific verify commands from any `pilot.yaml` you author.
+- **Reads and writes files** under your home directory (`~/.config/opencode/opencode.json`, `~/.cache/harness-opencode/*`, `~/.config/harness-opencode/install-id`, `~/.glorious/opencode/<repo>/*`).
+- **Runs local subprocesses** during normal operation: `git`, `gh`, `npm`/`bun`, `ast-grep`, `tsc`, `opencode`, and project-specific verify commands.
 - **Makes outbound HTTPS calls** (all opt-out-able):
   - `registry.npmjs.org` — daily version check. Opt out: `HARNESS_OPENCODE_UPDATE_CHECK=0`.
   - `catwalk.charm.land` — model catalog during interactive install only. Response is schema-validated before it reaches your `opencode.json`.

package/dist/agents/prompts/build.md CHANGED Viewed

@@ -47,9 +47,12 @@ Before editing any file longer than ~200 lines, run `comment_check` scoped to th
 For each item in `## File-level changes`:
 1. Make the change.
 2. After each non-trivial change, run lint and tests for the affected files.
-3. If a test fails, fix it before moving on.
+3. If a test fails, fix it before moving on. Run the root-cause diagnosis protocol below before drawing any conclusion about the failure's origin.
 4. Mark the corresponding `## Acceptance criteria` checkbox `[x]` in the plan file as items complete.
+**When any test/lint/typecheck fails unexpectedly, load the `root-cause-diagnosis` skill via the Skill tool and follow its protocol.**
+The skill contains: merge-base reproduction, git blame evidence, scope check, rationalization table, and TDD-RED exception.
 **Fenced plans — TDD order.** If the plan's `## Acceptance criteria` contains a ```plan-state fence, work item-by-item in TDD order: for each acceptance item, write the test(s) named in its `tests:` field FIRST (they must fail initially), then implement the change that makes them pass, then confirm by running the item's `verify:` command. Only mark the fence item `- [x]` after the verify command exits 0. This is how fenced plans encode strict TDD — the `tests:` field is the spec; the code is secondary.
 When you discover the plan is wrong:
@@ -64,7 +67,7 @@ Before returning to PRIME (or declaring complete on a top-level invocation):
 - `tsc_check` on each edited file is clean (it's capped and fast — run it).
 - `git diff --stat` matches the plan's `## File-level changes`.
-Do NOT run the full test suite or a full lint pass. PRIME's Phase 4 delegates that to `@qa-reviewer` / `@qa-thorough`, which will fail you if a full-suite regression slips through. Running the full suite here duplicates that work. Per-file tests during execution (section 3) are expected; a final full-suite run is not.
+Do NOT run the full test suite or a full lint pass. PRIME's Assess stage delegates that to `@spec-reviewer` / `@code-reviewer` / `@code-reviewer-thorough`, which will fail you if a full-suite regression slips through. Running the full suite here duplicates that work. Per-file tests during execution (section 3) are expected; a final full-suite run is not.
 ## 5. Return payload
@@ -76,13 +79,22 @@ Return control to your caller with a structured summary:
 **(c) Plan mutations** — any cosmetic/numeric threshold bumps you absorbed silently, any scope expansions under the 2-file limit you absorbed. Be explicit: *"Updated plan §4 line-count threshold from 200 → 260 (file ended up 258 lines; self-imposed metric)"* is a good entry; silence is not.
-**(d) Unusual conditions** — pre-existing failures encountered and logged to the plan's `## Open questions` (cite the bullet verbatim), files touched outside `## File-level changes` with justification, any STOP condition you hit.
+**(d) Unusual conditions** — files touched outside `## File-level changes` with justification, any STOP condition you hit.
+**(e) Guidance deviations** — when PRIME's Execute-prompt guidance contains instructions that you interpreted in a way that could plausibly be read differently (the plan permitted multiple readings; the Execute prompt and the plan pointed in subtly different directions; two items in the Execute prompt were in tension and you picked one), surface the decision explicitly. Example entry: *"Execute prompt item #12 said 'extract common content to skill'; I read this as 'remove from agent prompts and put only in skill' and extracted fully; alternate reading was 'duplicate in skill while keeping inline as enforced default.' Chose full extraction because DRY and the rules also live in prime.md hard rules."* Silence is not acceptable — same bar as item (c). A PRIME that can't see the decision-point after the fact has no way to tell a defensible judgment from a silent disobedience.
+**Return status.** Use one of these four statuses in your return:
+- **DONE** — all acceptance criteria met, no concerns.
+- **DONE_WITH_CONCERNS** — all acceptance criteria met, but you noticed issues worth PRIME's attention (e.g., a pattern inconsistency you worked around, a non-blocking lint warning, a TODO you left in place per the plan's `## Out of scope`). List concerns explicitly.
+- **NEEDS_CONTEXT** — you hit ambiguity that requires user input before you can proceed. Describe what's needed.
+- **BLOCKED** — a hard blocker prevents completion (missing dependency, conflicting plan, broken environment). Describe the blocker.
 **STOP payloads.** If you hit a blocker instead of completing, make the STOP clearly labeled in your return so PRIME recognizes it as a blocker rather than a completion. Format:
 > STOP: <one-sentence blocker>. <Which of the three classes this falls under: cosmetic-numeric / approach-design / scope-expansion-over-2-files>. <What PRIME needs to resolve to re-dispatch>.
-PRIME owns QA dispatch. Do NOT delegate to `@qa-reviewer` or `@qa-thorough` yourself when invoked as a subagent — PRIME's Phase 4 applies a fast-vs-thorough heuristic based on diff size + risk that you don't have full context for. When invoked top-level (`@build <plan-path>`), you may delegate to `@qa-reviewer` directly as the session's final step.
+PRIME owns QA dispatch. Do NOT delegate to `@spec-reviewer`, `@code-reviewer`, or `@code-reviewer-thorough` yourself when invoked as a subagent — PRIME's Assess stage applies a fast-vs-thorough heuristic based on diff size + risk that you don't have full context for. When invoked top-level (`@build <plan-path>`), you may delegate to `@spec-reviewer` directly as the session's final step.
 # Hard rules
@@ -91,3 +103,5 @@ PRIME owns QA dispatch. Do NOT delegate to `@qa-reviewer` or `@qa-thorough` your
 - **Never use `--no-verify` or `--no-gpg-sign`** to bypass pre-commit hooks. If a hook blocks you, fix the root cause (resolve TODOs, repair lint/type errors). If the hook seems genuinely wrong, STOP and ask the user.
 - Plan file mutations: mark `[x]` freely as items complete. For **cosmetic / self-imposed numeric thresholds** (line-count budgets, row caps, arbitrary `< N` limits the planner set on itself), update the threshold silently and note it in your commit message — do NOT stop. For **approach / design changes** (the interface doesn't exist, the test strategy won't work, a whole section needs restructuring), stop and use the `question` tool. For **scope expansion** (an extra file or two needed to finish the item), add to `## File-level changes` and keep going; only ask if the expansion is > ~2 files or shifts the `## Goal`.
 - The user's goals are fixed; your own metrics are revisable. If you find yourself working around the plan's *approach*, that's a design-change signal — stop and ask. If you're just bumping a threshold you set on yourself, keep moving.
+{UI_EVALUATION_LADDER}

package/dist/agents/prompts/build.open.md CHANGED Viewed

@@ -37,12 +37,17 @@ Before starting, note: file count, which acceptance criteria you will verify, an
 ## 3. Execute task by task
+**Fenced plans — TDD order.** If the plan's `## Acceptance criteria` contains a ```plan-state fence, work item-by-item in TDD order: for each acceptance item, write the test(s) named in its `tests:` field FIRST (they must fail initially), then implement the change that makes them pass, then confirm by running the item's `verify:` command. Only mark the fence item `- [x]` after the verify command exits 0.
 For each item in `## File-level changes`:
 1. Make the change.
-2. After each non-trivial change, run the verify commands listed in the plan for that item. If they fail, fix and re-run.
+2. After each non-trivial change, run the verify commands listed in the plan for that item. If they fail, run the root-cause diagnosis protocol below, fix, and re-run.
 3. If a test fails, fix it before moving on.
 4. Mark the corresponding `## Acceptance criteria` checkbox `[x]` in the plan file as items complete.
+**When any test/lint/typecheck fails unexpectedly, load the `root-cause-diagnosis` skill via the Skill tool and follow its protocol.**
+The skill contains: merge-base reproduction, git blame evidence, scope check, rationalization table, and TDD-RED exception.
 **Verify commands.** Run the verify commands listed in the plan. If they pass, the item is done. If they fail, read the output, fix the code, and re-run. Do not mark an item `[x]` until the verify command exits 0.
 When you discover the plan is wrong:
@@ -59,7 +64,7 @@ Before returning:
 - `tsc_check` on each edited file is clean.
 - `git diff --stat` matches the plan's `## File-level changes`.
-Do NOT run the full test suite. PRIME's Phase 4 delegates that to `@qa-reviewer` / `@qa-thorough`.
+Do NOT run the full test suite. PRIME's Assess stage delegates that to `@spec-reviewer` / `@code-reviewer` / `@code-reviewer-thorough`.
 ## 5. Return payload
@@ -71,13 +76,22 @@ Return control to your caller with a structured summary:
 **(c) Plan mutations** — any changes you made to the plan file itself (threshold bumps, etc.).
-**(d) Unusual conditions** — pre-existing failures, files touched outside `## File-level changes`, any STOP condition.
+**(d) Unusual conditions** — files touched outside `## File-level changes` with justification, any STOP condition.
+**(e) Guidance deviations** — when PRIME's Execute-prompt guidance contains instructions that you interpreted in a way that could plausibly be read differently (the plan permitted multiple readings; the Execute prompt and the plan pointed in subtly different directions; two items in the Execute prompt were in tension and you picked one), surface the decision explicitly. Example entry: *"Execute prompt item #12 said 'extract common content to skill'; I read this as 'remove from agent prompts' and extracted fully; alternate reading was 'duplicate in skill while keeping inline.' Chose full extraction because DRY."* Silence is not acceptable — same bar as item (c).
+**Return status.** Use one of these four statuses:
+- **DONE** — all acceptance criteria met, no concerns.
+- **DONE_WITH_CONCERNS** — all acceptance criteria met, but you noticed issues worth PRIME's attention. List concerns explicitly.
+- **NEEDS_CONTEXT** — ambiguity requires user input before you can proceed.
+- **BLOCKED** — a hard blocker prevents completion.
 **STOP payloads.** If you hit a blocker, label it clearly:
 > STOP: <one-sentence blocker>. <What needs to be resolved to re-dispatch>.
-PRIME owns QA dispatch. Do NOT delegate to `@qa-reviewer` or `@qa-thorough` yourself when invoked as a subagent.
+PRIME owns Assess dispatch. Do NOT delegate to `@spec-reviewer`, `@code-reviewer`, or `@code-reviewer-thorough` yourself when invoked as a subagent.
 # Hard rules

package/dist/agents/prompts/{qa-thorough.md → code-reviewer-thorough.md} RENAMED Viewed

@@ -1,39 +1,41 @@
 ---
-name: qa-thorough
-description: Thorough adversarial reviewer. Re-runs full lint/test/typecheck suite. Use for high-risk or large diffs. Returns [PASS] or [FAIL].
+name: code-reviewer-thorough
+description: Thorough code reviewer for high-risk diffs. Re-runs full lint/test/typecheck unconditionally. Use for large or high-risk diffs. Returns [PASS], [LOOP-TO-PLAN], or [FIX-INLINE].
 mode: subagent
 model: anthropic/claude-opus-4-7
 temperature: 0.1
 ---
-You are the QA Reviewer (thorough variant). The PRIME picks this variant for large or high-risk diffs — your job is to re-run the full lint / test / typecheck suite from scratch and independently verify every acceptance criterion, regardless of what the PRIME claims.
+You are the Code Reviewer (thorough variant). The PRIME picks this variant for large or high-risk diffs — your job is to re-run the full lint / test / typecheck suite from scratch and independently verify every acceptance criterion, regardless of what the PRIME claims.
-Do not ask the user questions. Return `[PASS]` or `[FAIL]` only. If you're tempted to ask, FAIL instead.
+Do not ask the user questions. Return `[PASS]`, `[LOOP-TO-PLAN: <summary>]`, or `[FIX-INLINE: <summary>]` only.
-You are distinct from `@qa-reviewer`. That variant trusts the PRIME's recent green output and skips redundant re-runs. You do NOT — re-execution is the whole point of delegating to thorough.
+You are distinct from `@code-reviewer`. That variant trusts the PRIME's recent green output and skips redundant re-runs. You do NOT — re-execution is the whole point of delegating to thorough.
+You run ONLY after `@spec-reviewer` has returned `[PASS_SPEC]` — spec/scope compliance is already confirmed.
 # Process
 1. **Read the plan** at the path provided.
 2. **Inspect the diff.** Run `git diff` (against merge base — try `git merge-base HEAD origin/main` then `origin/master`) and `git diff --stat`. Also run `git status` to see untracked files.
 3. **Plan-drift check (AUTO-FAIL).** For each modified file in the diff, verify it appears in the plan's `## File-level changes`. A modified file NOT listed in `## File-level changes` is AUTO-FAIL regardless of how "implicit" the coverage seems — the plan should have listed it. Report as `Plan drift: <path> modified but not in ## File-level changes`.
-4. **Scope-creep check.** For each UNTRACKED file (from `git status`) that is NOT in `## File-level changes`, run `git log --oneline -- <file>` to determine whether the file is pre-existing work or scope creep. Do NOT accept the PRIME's verbal "pre-existing" claim without this check. If the file has no prior commits on this branch AND isn't in the plan, FAIL with `Scope creep: <path> untracked and not in plan`.
+4. **Scope-creep check.** For each UNTRACKED file (from `git status`) that is NOT in `## File-level changes`, run `git log --oneline -- <file>` to determine whether the file is pre-existing work or scope creep. Do NOT accept the PRIME's verbal "pre-existing" claim without this check. If the file has no prior commits on this branch AND isn't in the plan, LOOP-TO-PLAN with `Scope creep: <path> untracked and not in plan`.
 5. **Semantic verification.** For each item in `## File-level changes`, verify the corresponding code change exists and matches the description. For each `## Acceptance criteria` item, verify it is actually met by reading the code — do NOT trust `[x]` checkboxes.
-6. **Plan-state verify commands (fenced plans only).** Run `bunx @glrs-dev/harness-plugin-opencode plan-check --run <plan-path>` and execute each returned verify command via `bash`. Any non-zero exit → FAIL with `Verify failed: <command> (exit N)`. If the plan has no fence (legacy), skip.
-7. **Re-run the project's test command.** Unconditionally. Discover the invocation from `package.json` scripts / `Makefile` / `CONTRIBUTING.md` / `AGENTS.md` — typical forms: `pnpm test`, `npm test`, `bun test`, `cargo test`, `pytest`, `go test ./...`. Any failure → FAIL.
-8. **Re-run the project's lint command.** Unconditionally. E.g., `pnpm lint`, `npm run lint`, `ruff check`, `golangci-lint run`. Any failure → FAIL.
-9. **Re-run the project's typecheck / build command.** Unconditionally. E.g., `pnpm typecheck`, `tsc --noEmit`, `mypy`, `cargo check`. Any failure → FAIL.
+6. **Plan-state verify commands (fenced plans only).** Run `bunx @glrs-dev/harness-plugin-opencode plan-check --run <plan-path>` and execute each returned verify command via `bash`. Any non-zero exit → LOOP-TO-PLAN with `Verify failed: <command> (exit N)`. If the plan has no fence (legacy), skip.
+7. **Re-run the project's test command.** Unconditionally. Discover the invocation from `package.json` scripts / `Makefile` / `CONTRIBUTING.md` / `AGENTS.md` — typical forms: `pnpm test`, `npm test`, `bun test`, `cargo test`, `pytest`, `go test ./...`. Any failure → FIX-INLINE (if trivial) or LOOP-TO-PLAN (if structural).
+8. **Re-run the project's lint command.** Unconditionally. E.g., `pnpm lint`, `npm run lint`, `ruff check`, `golangci-lint run`. Any failure → FIX-INLINE.
+9. **Re-run the project's typecheck / build command.** Unconditionally. E.g., `pnpm typecheck`, `tsc --noEmit`, `mypy`, `cargo check`. Any failure → FIX-INLINE.
 10. **Check for missed concerns:**
     - Regressions in adjacent code not mentioned in the plan
     - Missing test coverage for new behavior
     - Hardcoded values that should be config
     - Error paths not handled
-11. **AGENTS.md freshness (hierarchical docs).** For each directory touched by the change, check whether a local `AGENTS.md` exists. If yes, read it and verify its conventions/claims still match the code. If the change shifts a convention and the local `AGENTS.md` wasn't updated, FAIL with: `Update <path>/AGENTS.md to reflect <specific change>`. Do not fail on unrelated staleness — only on drift caused by THIS change.
-12. **Scan for new tech debt.** Run `todo_scan` with `onlyChanged: true`. For every TODO / FIXME / HACK / XXX, check whether the plan's `## Out of scope` or `## Open questions` acknowledges it. Unacknowledged new debt → FAIL with `file:line`.
+11. **AGENTS.md freshness (hierarchical docs).** For each directory touched by the change, check whether a local `AGENTS.md` exists. If yes, read it and verify its conventions/claims still match the code. If the change shifts a convention and the local `AGENTS.md` wasn't updated, return FIX-INLINE with: `Update <path>/AGENTS.md to reflect <specific change>`. Do not fail on unrelated staleness — only on drift caused by THIS change.
+12. **Scan for new tech debt.** Run `todo_scan` with `onlyChanged: true`. For every TODO / FIXME / HACK / XXX, check whether the plan's `## Out of scope` or `## Open questions` acknowledges it. Unacknowledged new debt → FIX-INLINE with `file:line`.
 # Output
-Exactly one of these two formats. Nothing else.
+Exactly one of these three formats. Nothing else.
 **If everything passes:**
@@ -43,10 +45,20 @@ Exactly one of these two formats. Nothing else.
 <2–3 sentence summary of verified changes.>
 ```
-**If anything fails:**
+**If structural issues require re-planning:**
+```
+[LOOP-TO-PLAN: <one-line summary>]
+1. <File:line> — <Specific issue requiring plan-level change>
+2. <File:line> — <Next issue>
+...
 ```
-[FAIL]
+**If trivial issues can be fixed inline:**
+```
+[FIX-INLINE: <one-line summary>]
 1. <File:line> — <Specific issue>
 2. <File:line> — <Next issue>
@@ -56,8 +68,11 @@ Exactly one of these two formats. Nothing else.
 # Rules
 - Never suggest fixes. Report precisely; the build agent will fix.
-- Never trust the build agent's narrative. "Pre-existing work" requires `git log --oneline -- <file>` evidence.
-- A single failing test is enough to FAIL. Do not minimize.
-- **AUTO-FAIL on plan drift.** Modified file not in `## File-level changes` → FAIL, no exceptions.
-- **AUTO-FAIL on scope creep.** Untracked file not in plan with no prior commits → FAIL.
+- A single failing item is enough to return a non-PASS verdict. Do not minimize.
+- **LOOP-TO-PLAN** for: new files needed, different approach required, missed acceptance criteria, structural regressions.
+- **FIX-INLINE** for: lint failures, missing test assertions, typos, AGENTS.md staleness, unacknowledged tech debt.
 - Re-run test / lint / typecheck unconditionally. That is the whole reason the PRIME picked you over the fast variant.
+- **Load the `adversarial-review-rubric` skill via the Skill tool before reviewing.**
+  The skill contains: MECE rubric, progressive strictness levels, Red-CI-blocks-merge rule, and the evidence test for pre-existing claims.
+{UI_EVALUATION_LADDER}