npm - @mmerterden/multi-agent-pipeline - Versions diffs - 10.7.3 → 10.8.0 - Mend

@mmerterden/multi-agent-pipeline 10.7.3 → 10.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

package/pipeline/commands/multi-agent/refs/features/verify-by-test.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Feature: Verify-by-Test Triage (Phase 4 Step 3.7)
+**Pattern**: reviewer findings are hypotheses; a failing repro test is proof. Adversarial-review research (Refute-or-Promote, 2026) found that plausible-but-wrong findings survive debate rounds but die on a single empirical test. This step converts the highest-stakes verdicts (accepted blocking) from judgment into evidence before the Phase 3 rework loop fires.
+**Gated by `prefs.global.verifyByTest.enabled`** (default: `false`). When enabled, after triage 3.6 and before Step 4, IF the validated triage output contains at least one `accepted` blocking finding:
+1. Dispatch ONE verifier sub-agent for the iteration (model: `verifyByTest.model`, default `sonnet`)  -  never one dispatch per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings, the diff hunks for their files, and the Phase 1 test conventions.
+2. Per finding, the verifier writes ONE minimal repro test asserting the correct behavior the finding claims is broken, then runs ONLY that test via the Phase 3 single-test invocation (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`) under `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
+3. Stamp each processed finding with a `verification` object (triage-output schema v3.2.0) and re-run `validate-triage.mjs` on the mutated triage file under the standard 3.2.1 gate protocol.
+4. Findings beyond `maxFindings` keep their judgment-only verdict (log `verify_by_test=cap-exceeded`).
+5. The whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600); on breach or verifier crash, remaining findings keep judgment-only verdicts and the pipeline proceeds. Never blocks.
+## Verdict table
+| Repro test outcome | `verification.result` | Action |
+|---|---|---|
+| Fails as the finding predicts | `confirmed` | Finding stays accepted blocking. Repro test KEPT in the worktree, recorded in `redTests[]`. |
+| Passes (not reproducible) | `not-reproduced` | Downgrade gated by `evidence-gate.mjs --claim test --status passed` on the test log (exit 0 required). Finding moves `accepted[]` -> `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`; repro test file deleted. Evidence-gate failure -> treat as `inconclusive`. |
+| Compile error / timeout / not unit-testable | `inconclusive` | Finding stays accepted blocking (judgment stands). Partial test deleted, cause in `verification.note`. |
+Downgrades go to `deferred`, never `rejected`: triage judged the issue real, and deferred items surface in the Phase 7 report for a human eye.
+## Red-test handoff to Phase 3
+`state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`. On the Phase 3 re-entry, each `redTests[]` entry IS the RED step of the rework TDD cycle for its finding: the reflection prompt instructs "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it." The dev agent must not write a duplicate failing test for those findings.
+## Cleanup invariant
+After Step 3.7, the only uncommitted verifier artifacts are the confirmed repro tests listed in `redTests[]` (committed later with the fix) and logs under `$WORKTREE/.pipeline/` (outside Phase 6 commit scope). `not-reproduced` and `inconclusive` test files are always deleted.
+## Telemetry
+One `review.verify_by_test` metric per iteration: `attempted`, `confirmed`, `downgraded`, `inconclusive`, `duration_ms` (plus `tokens_in/out` when available), forwarded to the tracker like all Phase 4 metrics. Timeout emits `triage=verify-by-test-timeout`.
+## Off by default reason
+Adds one Sonnet call plus up to `maxFindings` single-test runs (and build-lock contention on Xcode projects) per review iteration that has accepted blockers. On clean runs it never fires, but on noisy-reviewer repos it can add minutes per iteration. Flip on for security-critical work, release branches, or repos where reviewer false-positive rate is high.
+## Reference
+Wiring: `refs/phases/phase-4-review.md` Step 3.7. Schema: `pipeline/schemas/triage-output.schema.json` v3.2.0 (`$defs.verification`). Evidence gate: `pipeline/scripts/evidence-gate.mjs`. Prefs: `prefs.global.verifyByTest` in `pipeline/schemas/prefs.schema.json`.

package/pipeline/commands/multi-agent/refs/knowledge.md CHANGED Viewed

@@ -88,7 +88,7 @@ Phase 4: Review (CLI-aware parallel + triage)
     |-- Reviewer (gpt-5.4) -> edge cases, cross-provider diversity
     +-- Reviewer (sonnet)  -> quality + correctness + naming
        Context for ALL: Phase 1 files + Phase 2 decisions + Phase 3 diff
-       Then: single Opus triage pass filters false-positives
+       Then: single Fable triage pass filters false-positives
 ```
 ### Context-Aware Agent Prompting

package/pipeline/commands/multi-agent/refs/phases/log-format.md CHANGED Viewed

@@ -72,6 +72,16 @@ Verdict: unverified (2 reviewers)
 ## Test Scenarios (Jira)
 (User perspective: Precondition -> Steps -> Expected result)
+## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
+(v10.8.0, appended chronologically by the phase-boundary checkpoint  -  one block per phase transition, the LATEST block is authoritative. Written by the orchestrator from state it already holds; no LLM call, ~15 lines max. `resume.md` Step 3 and post-`/compact` re-grounding read this block FIRST. Format spec: `refs/phases/operations.md`.)
+- Done: {up to 3 bullets of completed outcomes}
+- Remaining: {ordered list of remaining phases / sub-steps}
+- Decisions: {key decisions later phases depend on}
+- Open findings: {accepted-but-unresolved review findings, or "none"}
+- Next: Phase {N+1} {name}, subStep {token or "start"}
 ```
 ### Cost Breakdown  -  emission contract
@@ -91,7 +101,7 @@ Every phase that dispatches a billable LLM agent MUST forward its token totals t
 ```bash
 pipeline/scripts/log-metric.sh "$TASK_ID" <phase-id> <event> \
-  model=<opus|sonnet|haiku|gpt-5.4> tokens_in=$IN tokens_out=$OUT tokens_cached=$CACHED duration_ms=$DUR
+  model=<fable|opus|sonnet|haiku|gpt-5.4> tokens_in=$IN tokens_out=$OUT tokens_cached=$CACHED duration_ms=$DUR
 LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" <phase-id> tokens \
   model=<...> tokens_in=$IN tokens_out=$OUT tokens_cached=$CACHED
 ```

package/pipeline/commands/multi-agent/refs/phases/modes.md CHANGED Viewed

@@ -85,7 +85,7 @@ Phase 0: Init -> Phase 3: Dev (self-contained) -> Phase 5: Test -> Phase 6: Comm
 | Phase 1 (Analysis)  | Parallel Explore agents                         | **SKIP**                                                                           | **SKIP**                                     | **SKIP**                                      | **SKIP**                                      |
 | Phase 2 (Planning)  | TaskCreate + architecture review + **Plan Approval Gate** (clarification max 2 rounds + approval loop) | **SKIP** (plan gate not applicable  -  fast path) | **SKIP** | **SKIP** | **SKIP** |
 | Phase 3 (Dev)       | Follows Phase 2 plan, TDD cycle (Sonnet)        | **Self-contained** (Opus): agent scans relevant files, implements with TDD, builds | Same as `--dev`                              | Same as `--dev`                               | Same as `--dev`                               |
-| Phase 4 (Review)    | Parallel review + Opus triage (Claude: 2-model / Copilot: 3-model) | **SKIP**                                                       | **SKIP**                                     | **SKIP**                                      | **SKIP**                                      |
+| Phase 4 (Review)    | Parallel review + Fable triage (Claude: 2-model / Copilot: 3-model) | **SKIP**                                                       | **SKIP**                                     | **SKIP**                                      | **SKIP**                                      |
 | Phase 5 (User Test) | Interactive prompt (ask user to test)           | **Interactive prompt** (ask user to test)                                          | **Skip** (autopilot suppresses interactive prompts) | **Interactive prompt** (same as `--dev`)      | **Skip** (same as `--dev autopilot`)          |
 | Phase 6 (Commit)    | Commit + PR                                     | Same  -  still asks (unless autopilot)                                               | Auto commit + push + PR                      | Same as `--dev`                               | Auto commit + push + PR                       |
 | Phase 7 (Report)    | Full report + channels multi-select             | Simplified  -  no review/analysis sections, BUT **channels menu still pauses** | Channels menu **STILL PAUSES** | Same as `--dev` | Channels menu **STILL PAUSES** |

package/pipeline/commands/multi-agent/refs/phases/operations.md CHANGED Viewed

@@ -93,8 +93,21 @@ This keeps orchestrator context lean and enables programmatic routing.
 **Proactive compaction + phase-boundary checkpoint**: the orchestrator follows ~2,500 lines of phase prose in one session; once context fills, it starts dropping steps  -  the single biggest cause of "it got stuck / skipped a step." Two defenses, both required on full-pipeline runs:
-- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a one-line progress summary to `agent-log.md`. The next phase reads state + log, not the back-conversation  -  so a transition is a clean re-entry point even if context is later compacted.
-- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit  -  it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` to re-ground.
+- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a structured handoff block to `agent-log.md` (format below). The next phase reads state + log, not the back-conversation  -  so a transition is a clean re-entry point even if context is later compacted.
+- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit  -  it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` AND the latest `## Handoff` block in `agent-log.md` to re-ground.
+**Handoff block (v10.8.0)**: the structured artifact the phase-boundary checkpoint appends to `agent-log.md`. Written by the orchestrator from state it already holds  -  no agent dispatch, no extra LLM call. Cap at ~15 lines; the latest block is authoritative (earlier ones are history). This is the fresh-context re-entry contract: a resume or post-compaction session rebuilds working context from the latest handoff + `agent-state.json` + git log, never from conversation memory.
+```markdown
+## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
+- Done: {up to 3 bullets of completed outcomes, e.g. "plan approved, 4 tasks", "build green, 12 tests added"}
+- Remaining: {ordered list of remaining phases / sub-steps}
+- Decisions: {key decisions later phases depend on, e.g. "used existing KeychainStore, no new wrapper"}
+- Open findings: {accepted-but-unresolved review findings, or "none"}
+- Next: Phase {N+1} {name}, subStep {token or "start"}
+```
+Full `agent-log.md` shape: `refs/phases/log-format.md`. Resume-side consumption: `resume.md` Step 3 reads the latest handoff FIRST, then falls back to per-phase findings for logs written before v10.8.
 **Sub-step checkpoints (long phases)**: Phase 3 (dev/TDD cycles) and Phase 7 (report/channels) can run many minutes; a crash mid-phase loses everything since the last phase boundary and forces a full phase re-run on resume. For these phases, also record `state.phases[<n>].subStep` (a short token: `red`, `green`, `build`, `pr-opened`, `confluence-synced`, ...) and the `files[]` written so far after each meaningful unit of work. On resume, re-enter the phase but skip units whose `subStep` is already recorded and whose `files[]` exist in the worktree  -  re-do only the unfinished tail, never the whole phase.

package/pipeline/commands/multi-agent/refs/phases/phase-1-analysis.md CHANGED Viewed

@@ -1,6 +1,6 @@
-### Phase 1: Analysis (Opus)
+### Phase 1: Analysis (Fable)
-> **TLDR**  -  Opus-driven codebase exploration. Detects if the issue is already fixed (git blame, closed PRs), then launches parallel Explore sub-agents to map the affected code paths. Outputs: impact analysis, stack detection (auto-selects platform guide), relevant files, risk areas. Feeds Phase 2 planning.
+> **TLDR**  -  Fable-driven codebase exploration (Opus when the fallback ladder engages). Detects if the issue is already fixed (git blame, closed PRs), then launches parallel Explore sub-agents to map the affected code paths. Outputs: impact analysis, stack detection (auto-selects platform guide), relevant files, risk areas. Feeds Phase 2 planning.
 <!-- progress-contract: applied -->
 Progress emission per `refs/progress-contract.md`  -  lines for each Explore dispatch, each finish, analyst synthesis start, `analysis.json` write.

package/pipeline/commands/multi-agent/refs/phases/phase-2-planning.md CHANGED Viewed

@@ -1,6 +1,6 @@
-### Phase 2: Planning (Opus)
+### Phase 2: Planning (Fable)
-> **TLDR**  -  Opus decomposes the analysis into concrete tasks with file-level targets, risk grading, and architecture review. Before Phase 3 a **Plan Approval Gate** runs in normal mode: if the Jira/issue description is ambiguous the orchestrator asks the user structured clarification questions (max 2 rounds)  -  once scope is clear it renders the plan and loops on free-text edit requests until the user approves or aborts. The gate is **skipped entirely** for `--dev`, `autopilot`, and `--dev autopilot` (their speed/zero-interaction contracts are preserved).
+> **TLDR**  -  Fable decomposes the analysis (Opus when the fallback ladder engages) into concrete tasks with file-level targets, risk grading, and architecture review. Before Phase 3 a **Plan Approval Gate** runs in normal mode: if the Jira/issue description is ambiguous the orchestrator asks the user structured clarification questions (max 2 rounds)  -  once scope is clear it renders the plan and loops on free-text edit requests until the user approves or aborts. The gate is **skipped entirely** for `--dev`, `autopilot`, and `--dev autopilot` (their speed/zero-interaction contracts are preserved).
 <!-- progress-contract: applied -->
 Progress emission per `refs/progress-contract.md`  -  lines for plan-draft start, clarification-ask, clarification-answer, plan render, plan-edit-request, plan-approved, plan-aborted.

package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md CHANGED Viewed

@@ -51,7 +51,7 @@ For non-component taskTypes (`bugfix`, `feature`, `refactor`, `chore`), continue
 #### Re-entry from Phase 4 triage
-Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings  -  Opus triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
+Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings  -  Fable triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
 When re-entering from Phase 4:
@@ -94,6 +94,7 @@ For each task (respecting dependency order):
 3. **TDD cycle** (Launch Agent with `model: "sonnet"`):
    **RED  -  Write ONE failing test first:**
+   - **Rework re-entry**: if `state.reviewIterations[-1].verifyByTest.redTests[]` exists (Phase 4 Step 3.7 ran), those failing repro tests ARE the RED step for their findings  -  make them green, do not write a duplicate failing test and do not delete or weaken them. See `refs/features/verify-by-test.md`.
    - Test framework: use whatever the project already uses. Detect from existing test files:
      - `@Test` / `#expect` → Swift Testing
      - `XCTestCase` / `XCTAssert` → XCTest
@@ -115,6 +116,7 @@ For each task (respecting dependency order):
    **GREEN  -  Minimal code to pass:**
    - Smallest change that makes the test green  -  no extras
+   - **Existing tests are immutable**: deleting, renaming, skipping, or weakening an existing assertion to reach green is a violation. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. (Deterministic backstop: the `test_lines_removed` signal in Phase 4 Step 1.75 flags test files that shrink.)
    - Run the same test again → must PASS
    - Run full test suite → no regressions:
      ```bash

package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ### Phase 4: Review (deterministic gates + parallel + triage)
-> **TLDR**  -  Three-stage review. Stage 1: deterministic gates (build + lint + test + secret scan) that MUST pass. Stage 2: AI models in parallel  -  reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers (Opus + Sonnet); Copilot CLI dispatches 3 reviewers (GPT-5.4 + Opus + Sonnet). Stage 3: Opus triage  -  evaluates raw findings, filters false-positives/out-of-scope, keeps only actionable items. Only triage-accepted blocking items loop back to Phase 3.
+> **TLDR**  -  Three-stage review. Stage 1: deterministic gates (build + lint + test + secret scan) that MUST pass. Stage 2: AI models in parallel  -  reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers (Fable + Sonnet); Copilot CLI dispatches 3 reviewers (GPT-5.4 + Opus + Sonnet — Fable 5 is not offered on Copilot CLI). Stage 3: Fable triage (Opus on Copilot CLI)  -  evaluates raw findings, filters false-positives/out-of-scope, keeps only actionable items. Only triage-accepted blocking items loop back to Phase 3.
 <!-- progress-contract: applied -->
 Progress emission per `refs/progress-contract.md`  -  lines for each gate, each reviewer dispatch + finish, triage start, triage verdict, fix dispatch.
@@ -127,6 +127,7 @@ echo "$RISK_JSON" | node pipeline/scripts/validate-diff-risk.mjs - >/dev/null 2>
 | `migration` | 4.0 | path matches DB schema / migration glob (.sql, /Migrations/, alembic/, prisma/migrations) |
 | `public_api` | 2.0 | added line declares `public func/class/struct/enum`, `@objc`, `open fun`, `@Composable`, or `export function/class/...` |
 | `no_test_change` | 2.5 | source file changed, no paired test file (`{Base}Tests.{ext}`, `{Base}.test.{ext}`, etc.) appears in the diff |
+| `test_lines_removed` | 3.0 | test-classified file whose diff removes more lines than it adds (immutable-test backstop, see `refs/rules.md`) |
 | `complexity_delta` | 1.5 | added control-flow tokens (`if`/`guard`/`switch`/`while`/`for`/ternary/`&&`/`\|\|`) |
 | `ui_critical` | 1.5 | path matches `*View.swift`, `*Screen.kt`, `*Configuration.swift`, etc. |
 | `loc_changed` | 1.0 | base sensitivity to total `+/-` lines |
@@ -181,17 +182,17 @@ Launch Agent instances **in parallel** using the shared `code-reviewer` subagent
 | Reviewer   | subagent_type     | Model               | Focus                             | Skills Referenced                             | Where it runs        |
 | ---------- | ----------------- | ------------------- | --------------------------------- | --------------------------------------------- | -------------------- |
-| Reviewer 1 | `code-reviewer`   | `claude-opus-4.6`   | Deep security + architecture      | `api-security-best-practices`, `architecture` | Both CLIs            |
+| Reviewer 1 | `code-reviewer`   | `claude-fable-5` (Claude Code) / `claude-opus-4-8` (Copilot CLI) | Deep security + architecture      | `api-security-best-practices`, `architecture` | Both CLIs            |
 | Reviewer 2 | `code-reviewer`   | `gpt-5.4`           | Edge cases, different perspective | cross-model diversity                         | **Copilot CLI only** |
-| Reviewer 3 | `code-reviewer`   | `claude-sonnet-4.6` | Quality + correctness + naming    | `clean-code`, stack-specific skill            | Both CLIs            |
+| Reviewer 3 | `code-reviewer`   | `claude-sonnet-4-6` | Quality + correctness + naming    | `clean-code`, stack-specific skill            | Both CLIs            |
 Each reviewer inherits the `code-reviewer` agent's focus areas (Security, Architecture, Quality, Performance) and output contract. The orchestrator overrides only the model and the stack-specific skill per-reviewer  -  no prompt duplication.
-**Model override wiring:** `code-reviewer.md` declares `preferredModel: fable`, so Reviewer 1 uses the persona default (Fable 5). Reviewer 2 (Copilot-only, `gpt-5.4`) and Reviewer 3 (`claude-sonnet-4.6`) set `PHASE_MODEL_OVERRIDE=<model>` before dispatch  -  the orchestrator exports `CLAUDE_CODE_SUBAGENT_MODEL` on Claude Code, or passes `--model` on Copilot CLI. Full precedence rule: `skills/shared/core/multi-agent/SKILL.md#agent-dispatch--per-persona-model-routing-v610`. Fable dispatches are subject to the fallback contract (`refs/features/model-fallback.md`): dispatch-error retry walks `fable -> opus -> sonnet` and budget-ceiling downgrade.
+**Model override wiring:** `code-reviewer.md` declares `preferredModel: fable`, so Reviewer 1 uses the persona default (Fable 5). Reviewer 2 (Copilot-only, `gpt-5.4`) and Reviewer 3 (`claude-sonnet-4-6`) set `PHASE_MODEL_OVERRIDE=<model>` before dispatch  -  the orchestrator exports `CLAUDE_CODE_SUBAGENT_MODEL` on Claude Code, or passes `--model` on Copilot CLI. Full precedence rule: `skills/shared/core/multi-agent/SKILL.md#agent-dispatch--per-persona-model-routing-v610`. Fable dispatches are subject to the fallback contract (`refs/features/model-fallback.md`): dispatch-error retry walks `fable -> opus -> sonnet` and budget-ceiling downgrade.
 **Stack-specific skills loaded per reviewer** (from Phase 1 `detectedStack`). On Claude Code, Reviewer 2 (GPT-5.4) is not dispatched  -  its skill column is ignored. On Copilot CLI all three columns are used.
-| Stack | Reviewer 1 (Opus) | Reviewer 2 (GPT-5.4  -  Copilot CLI only) | Reviewer 3 (Sonnet) |
+| Stack | Reviewer 1 (Fable / Opus on Copilot) | Reviewer 2 (GPT-5.4  -  Copilot CLI only) | Reviewer 3 (Sonnet) |
 |-------|-------------------|-----------------------------------------|---------------------|
 | iOS/Swift | `ios-security`, `swiftui-performance`, `hig-patterns` | `swift-concurrency`, `ios-accessibility` | `swiftui-pro`, `swift-testing` |
 | Android/Kotlin | `android-security`, `android-performance` | `compose-testing`, `android-architecture` | `compose-components`, `kotlin-coroutines-expert` |
@@ -204,11 +205,13 @@ Skills are injected into reviewer prompt context  -  the reviewer uses them as r
 **iOS/Swift  -  interaction & convention skills (conditional).** When the diff touches SwiftUI UI files (`*View.swift`, `*Screen.swift`, `*Configuration.swift`, `*+Modifiers.swift`), additionally inject the relevant `figma-common` convention skills as reference for the iOS reviewers: `figma-navigation`, `figma-overlays`, `figma-bottom-sheets` (interaction: emit-intent vs self-route/self-present; native-SwiftUI-first vs the project's `ui.*` custom system), and the enriched `figma-to-swiftui` accessibility rules (minimalism). These back the Step 1.5 iOS convention checks. Generic across SwiftUI projects  -  not tied to any one app. Omit when the diff has no SwiftUI UI changes (keeps the reviewer prompt lean).
-**Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry the Opus reviewer once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
+**Module review guides (conditional, all stacks).** A module in the repo may carry its own CLAUDE guide  -  a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths: for each changed file, walk its directory chain up to the repo root and collect `CLAUDE.md`, `*-CLAUDE.md`, and `AGENTS.md` files (root-level ones excluded  -  the host CLI already loads those). Dedupe, cap at 5 (log any dropped). Inject into every reviewer prompt with the directive: read each guide before reviewing and apply its rules/checklist to the changed files under its directory  -  a guide governs only its own subtree, and guide violations are findings triaged by severity like any other. No guides found → no-op, prompt stays lean. Same contract as `/multi-agent:review` Step 2b.
+**Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry Reviewer 1 once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
 #### Output contract  -  reviewer step
-Step 2 produces N reviewer-output objects (one per dispatched reviewer), each conforming to `pipeline/schemas/reviewer-output.schema.json`. They are persisted to `state.reviewIterations[<iteration>].reviewers[]` and consumed by Step 3 (Opus triage)  -  never by Phase 6 directly. The triage step (below) is the producer of the only review artifact Phase 6 reads, conforming to `pipeline/schemas/triage-output.schema.json`.
+Step 2 produces N reviewer-output objects (one per dispatched reviewer), each conforming to `pipeline/schemas/reviewer-output.schema.json`. They are persisted to `state.reviewIterations[<iteration>].reviewers[]` and consumed by Step 3 (Fable triage)  -  never by Phase 6 directly. The triage step (below) is the producer of the only review artifact Phase 6 reads, conforming to `pipeline/schemas/triage-output.schema.json`.
 **Subagent return format**  -  each reviewer returns JSON conforming to `pipeline/schemas/reviewer-output.schema.json`:
@@ -248,9 +251,11 @@ Exit 0 = valid. Exit 2 = contradiction (approved=true with blocking findings)  -
 **Off by default reason:** mixed-verdict cases are ~8% of runs in practice; the extra ~$0.20-$0.50 per run isn't worth automating for users who'd rather let triage resolve it cleanly. Users with high-stakes tasks (security-critical, release branches) can flip the flag.
-#### Step 3  -  Opus Triage (filter before acting)
+#### Step 3  -  Fable Triage (filter before acting)
+**CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag  -  reviewers hallucinate, misread scope, or repeat each other. Run Fable triage (Opus on Copilot CLI) to evaluate merged findings against task scope.
-**CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag  -  reviewers hallucinate, misread scope, or repeat each other. Run Opus triage to evaluate merged findings against task scope.
+Opt-in empirical layer: when `prefs.global.verifyByTest.enabled` is `true`, accepted blocking findings additionally go through Step 3.7 (verify-by-test), which tries to reproduce each one with a minimal failing test before the Phase 3 rework loop fires. Full wiring: `refs/features/verify-by-test.md`.
 ##### 3.1 Short-circuit: no findings
@@ -258,7 +263,7 @@ If merged findings `length === 0`, **skip triage**: write empty result `{"accept
 ##### 3.2 Launch triage agent
-Launch **1 Agent** (subagent_type: `general-purpose`, model: `opus`) with:
+Launch **1 Agent** (subagent_type: `general-purpose`, model: `fable` on Claude Code / `opus` on Copilot CLI) with:
 - Raw findings from Reviewer 1 + Reviewer 2 (merged JSON)
 - Task scope (Phase 1 analysis summary + Phase 2 plan)
@@ -307,11 +312,11 @@ Step 3 produces a single triage-output object conforming to `pipeline/schemas/tr
 Return ONLY valid JSON conforming to pipeline/schemas/triage-output.schema.json:
 {
-  "accepted":  [{ "severity": "blocking|important|suggestion", "file": "...", "line": N, "issue": "...", "fix": "...", "reviewer": "opus|sonnet" }],
+  "accepted":  [{ "severity": "blocking|important|suggestion", "file": "...", "line": N, "issue": "...", "fix": "...", "reviewer": "fable|opus|sonnet|gpt" }],
   "deferred":  [{ "finding": {...}, "reason": "..." }],
   "rejected":  [{ "finding": {...}, "reason": "..." }],
   "approved":  true|false,  // true if no accepted blocking items remain
-  "consensus": { "reviewerCount": N, "verdict": "unanimous-pass|unanimous-block|split|unverified", "disagreements": [{ "file": "...", "line": N, "issue": "...", "note": "Opus blocking, Sonnet approved" }] }  // optional, see 3.6
+  "consensus": { "reviewerCount": N, "verdict": "unanimous-pass|unanimous-block|split|unverified", "disagreements": [{ "file": "...", "line": N, "issue": "...", "note": "Fable blocking, Sonnet approved" }] }  // optional, see 3.6
 }
 ```
@@ -352,12 +357,12 @@ Failure fallback (timeout >120s, or agent crash before any JSON is produced): re
 Emit metrics per review pass for Phase 7 cost rollup:
 ```bash
-LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=opus duration_ms=$OPUS_DURATION tokens_in=$OPUS_IN tokens_out=$OPUS_OUT
+LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=fable duration_ms=$R1_DURATION tokens_in=$R1_IN tokens_out=$R1_OUT   # model=opus on Copilot CLI
 # GPT-5.4 metric emitted only on Copilot CLI (skip on Claude Code):
 [ "${CLI_HOST:-claude}" = "copilot" ] && \
   LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=gpt-5.4 duration_ms=$GPT_DURATION tokens_in=$GPT_IN tokens_out=$GPT_OUT
 LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=sonnet duration_ms=$SONNET_DURATION tokens_in=$SONNET_IN tokens_out=$SONNET_OUT
-LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.triage_call model=opus duration_ms=$TRIAGE_DURATION tokens_in=$TRIAGE_IN tokens_out=$TRIAGE_OUT
+LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.triage_call model=fable duration_ms=$TRIAGE_DURATION tokens_in=$TRIAGE_IN tokens_out=$TRIAGE_OUT
 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.completed raw_count=$RAW accepted=$ACC deferred=$DEF rejected=$REJ approved=$APPROVED duration_ms=$DURATION
 ```
@@ -365,31 +370,58 @@ pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.completed raw_count=$RAW acce
 ##### 3.5 Optional cross-check (single-point-of-failure mitigation)
-Opt-in via `prefs.global.triageCrossCheck.enabled` (default `false`). Sampled runs dispatch a **Sonnet** triage agent as second opinion, validated via `validate-triage.mjs` (same fallback rules). Disagreements logged as `triage.cross_check_diff`; `blockOnDisagreement` pauses for user (autopilot: proceed with Opus verdict). Doubles triage cost on sampled runs.
+Opt-in via `prefs.global.triageCrossCheck.enabled` (default `false`). Sampled runs dispatch a **Sonnet** triage agent as second opinion, validated via `validate-triage.mjs` (same fallback rules). Disagreements logged as `triage.cross_check_diff`; `blockOnDisagreement` pauses for user (autopilot: proceed with the Fable verdict). Doubles triage cost on sampled runs.
 ##### 3.6 Consensus surfacing (anti-correlation)
-**Rationale:** Reviewer 1 (Opus) and Reviewer 3 (Sonnet) share a base model family, so unanimous agreement on a *judgment call* is not independent confirmation  -  same-family models drift the same way on ambiguous prompts. Treating "both approved" as proof produces false-consensus passes. Triage therefore records a `consensus` block (schema v3.1.0) and surfaces disagreement and unverified agreement to the user rather than burying it.
+**Rationale:** Reviewer 1 (Fable) and Reviewer 3 (Sonnet) are both Anthropic Claude models, so unanimous agreement on a *judgment call* is not independent confirmation  -  same-family models drift the same way on ambiguous prompts. Treating "both approved" as proof produces false-consensus passes. Triage therefore records a `consensus` block (schema v3.1.0) and surfaces disagreement and unverified agreement to the user rather than burying it.
 After the triage verdict is computed, populate `triage.consensus`:
 1. `reviewerCount` = number of reviewers dispatched this iteration (`2` on Claude Code, `3` on Copilot CLI).
 2. Classify the iteration `verdict`:
    - `unanimous-block` -> all reviewers returned at least one overlapping `blocking` finding.
-   - `split` -> reviewers disagreed on existence or severity of one or more findings (the Step 2.5 disagreement definition). List each split in `disagreements[]` with a `note` naming who held which position (e.g. "Opus blocking, Sonnet approved").
+   - `split` -> reviewers disagreed on existence or severity of one or more findings (the Step 2.5 disagreement definition). List each split in `disagreements[]` with a `note` naming who held which position (e.g. "Fable blocking, Sonnet approved").
    - `unanimous-pass` -> all reviewers approved AND the diff is low-risk (no security/auth/concurrency surface per Phase 1 `touchedAreas`). Clear-cut; trust it.
    - `unverified` -> all reviewers approved BUT the diff touches a judgment-heavy surface (security, auth, concurrency, money, data migration). Agreement here may be correlated; do NOT treat it as a confirmed pass. Surface it.
 3. `disagreements[]` is populated for `split` and is also used to carry `unverified` notes (e.g. "both approved a keychain change  -  agreement unverified, confirm manually").
 **Surfacing (Step 4 + Phase 7):** When `verdict` is `split` or `unverified`, the disagreements are shown to the user at the Step 4 checkpoint (interactive modes) and always written to the Phase 7 report. Autopilot does not block on `unverified` (it logs `review.consensus=unverified` and proceeds), matching the maturity-check model  -  but the report records it so a human review can catch it. This is additive: omitting `consensus` is valid and means "not computed."
+##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
+**Rationale:** a triage verdict is still a judgment call; a failing repro test is proof. Debating a finding costs tokens, running it costs one test invocation and kills false positives that survive adversarial framing.
+**Gate:** runs only when `prefs.global.verifyByTest.enabled` is `true` AND the validated triage output contains at least one `accepted` finding with `severity: "blocking"`. Otherwise skip silently (no log noise). Full behavior spec: `refs/features/verify-by-test.md`.
+1. **Dispatch ONE verifier agent** (subagent_type: `general-purpose`, model: `prefs.global.verifyByTest.model`, default `sonnet`) for the iteration  -  not one per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings (ordered as triage returned them), the diff hunks for their files, and the project's test conventions from Phase 1. Findings beyond the cap keep their judgment-only verdict; log `verify_by_test=cap-exceeded count=<n>`.
+2. **Per finding, the verifier writes ONE minimal repro test** asserting the CORRECT behavior the finding claims is broken, following the platform test conventions (framework, naming, location per phase-3-dev.md), then runs ONLY that test using the platform single-test invocation from Phase 3 (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`), wrapped in `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
+3. **Verdict mapping (fails toward keeping blockers):**
+| Repro test outcome | `verification.result` | Action on finding |
+| --- | --- | --- |
+| Test FAILS as the finding predicts | `confirmed` | Stays `accepted` blocking. Repro test is KEPT in the worktree and recorded in `redTests[]` as the RED test for the Phase 3 rework loop. |
+| Test PASSES (defect not reproducible) | `not-reproduced` | Downgrade is evidence-gated: `node pipeline/scripts/evidence-gate.mjs --claim test --status passed --evidence "$WORKTREE/.pipeline/verify-<i>.test.log"` must exit 0. On exit 0: move finding from `accepted[]` to `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`, delete the repro test file. On exit non-zero: treat as `inconclusive`. |
+| Compile error, timeout, or defect not expressible as a unit test | `inconclusive` | Stays `accepted` blocking (judgment-only verdict stands). Partial test file deleted, cause noted in `verification.note`. |
+4. **Cleanup invariant:** after Step 3.7 the only verifier artifacts left in the worktree are the confirmed repro tests listed in `redTests[]` (they get committed with the fix, satisfying the TDD contract) and logs under `$WORKTREE/.pipeline/` (already outside Phase 6 commit scope).
+5. **Persist + re-validate:** stamp each verified finding with a `verification` object (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`, recompute `approved`, then re-run `validate-triage.mjs` on the mutated `$TRIAGE_FILE` under the same 3.2.1 gate protocol.
+6. **Timeout/fallback (mirrors 3.3):** the whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600). On verifier crash or budget breach: no retry; remaining findings keep judgment-only verdicts, log `triage=verify-by-test-timeout`, proceed to Step 4. Never blocks the pipeline.
+Telemetry (per 3.4 conventions, best-effort):
+```bash
+LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.verify_by_test \
+  attempted=$A confirmed=$C downgraded=$D inconclusive=$I duration_ms=$MS
+```
 #### Step 4  -  Consensus + Action (triage-driven)
 If `triage.consensus.verdict` is `split` or `unverified`, surface `consensus.disagreements[]` to the user before acting: interactive modes show the split and ask whether to treat the unverified agreement as a pass (picker-contract: Trust / Re-review / Treat-as-blocking); autopilot logs `review.consensus=<verdict>` and proceeds on the triage verdict. Never silently average a split into a pass.
 Act **only on triage.accepted**:
-- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items)
+- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items). When Step 3.7 ran and `state.reviewIterations[-1].verifyByTest.redTests[]` is non-empty, the reflection prompt cites each red test: "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it."
 - **accepted.important** → fix and re-review
 - **accepted.suggestion** → apply if reasonable
 - **deferred** → append to Phase 7 report as "follow-up items" (do not block)
@@ -430,7 +462,7 @@ for proj in $(jq -r '.projects[] | "\(.name)\t\(.worktreePath)\t\(.baseBranch)"'
 done
 ```
-Same 3 reviewers (Opus / GPT-5.4 / Sonnet) receive `COMBINED_DIFF` with a multi-repo prefix in the system prompt:
+Same reviewer set (Fable-or-Opus / GPT-5.4 / Sonnet) receive `COMBINED_DIFF` with a multi-repo prefix in the system prompt:
 ```
 This is a multi-repo task spanning {N} repos: {repo names}.

package/pipeline/commands/multi-agent/refs/progress-contract.md CHANGED Viewed

@@ -128,7 +128,7 @@ Every phase that dispatches a billable LLM agent MUST forward the call's token t
 ```bash
 LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" <phase-id> <event> \
-  model=<opus|sonnet|haiku|gpt-5.4> \
+  model=<fable|opus|sonnet|haiku|gpt-5.4> \
   tokens_in=$IN tokens_out=$OUT duration_ms=$DUR
 ```

package/pipeline/commands/multi-agent/refs/rules.md CHANGED Viewed

@@ -52,6 +52,7 @@ This is the single source of truth. When a contributor or model is unsure where
 - **NEVER** commit without passing build (all gates in Phase 4 Step 1 must be green).
 - **NEVER** commit without passing review (at least one AI reviewer must return `approved: true` with no blocking findings).
 - **NEVER** skip tests. Every public method, every error path, every edge case.
+- **NEVER** delete, rename, or weaken an existing test to get a green run. Existing tests are immutable during a task. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. Deterministic backstop: the `test_lines_removed` diff-risk signal (Phase 4 Step 1.75) flags test files that shrink.
 - **Follow existing code style and conventions.** Read neighbor files before writing new ones  -  match naming, structure, import order.
 - **Use design tokens, no magic numbers.** `16` → `.Spacing.spacing16`. `#E31837` → `Color.Primary.primary`. `.font(.system(size: 14))` → `.typographyStyle(.body1)`.
 - **Design system primitives before custom views.** Before writing a new SwiftUI / Compose / React / View / Configuration triplet inside a domain or feature module, grep the shared component library (project-specific path, e.g. `Common/UIComponents/`, `core-ui/`, `packages/ui/`) for an existing primitive that solves the same problem. New domain-level wrappers, custom modals, custom buttons, or hand-rolled toasts are forbidden when the design system already has an equivalent. If the primitive exists but lacks a modifier (placeholder, size, error binding), **add the modifier to the primitive** in its `+Modifiers` extension  -  do not fork the primitive into the consumer domain. The Figma `CodeConnectSnippet` is the authoritative pointer to which primitive to use.

package/pipeline/commands/multi-agent/refs/tracker-contract.md CHANGED Viewed

@@ -24,8 +24,7 @@ The agent detects which CLI it's running in and uses the appropriate visual mech
 ```
 1. system prompt mentions "Claude Code"             → claude-code
 2. system prompt mentions "Copilot" / "GitHub Copilot" → copilot
-3. system prompt mentions "Cursor"                   → cursor
-5. None of the above                                 → generic (bash stdout)
+3. None of the above                                 → generic (bash stdout)
 ```
 Visual mechanism per CLI:

package/pipeline/commands/multi-agent/resume.md CHANGED Viewed

@@ -23,10 +23,13 @@ Resume a paused or failed task from the last successful phase.
    - `haltReason`  -  if set, show it so the user knows why the run stopped; clear it on successful re-entry
    - `autopilot`  -  preserve the mode
-3. **Load context**  -  read prior-phase findings from `agent-log.md`:
-   - Phase 1 analysis → use it from Phase 2+
-   - Phase 2 plan → use it from Phase 3+
-   - Phase 3 code → already in the worktree
+3. **Load context**  -  rebuild working context from durable artifacts, never from conversation memory:
+   - **Handoff first (v10.8.0)**: read the LATEST `## Handoff` block in `agent-log.md`  -  it carries done/remaining/decisions/open-findings and the exact re-entry point (phase + subStep). When present, it is the primary context source; cross-check its `Next:` line against `state.currentPhase` and trust state on mismatch (state is the machine truth, handoff is the narrative).
+   - Fall back to per-phase findings for logs written before v10.8 (no handoff blocks):
+     - Phase 1 analysis → use it from Phase 2+
+     - Phase 2 plan → use it from Phase 3+
+     - Phase 3 code → already in the worktree
+   - Recent `git log --oneline -10` in the worktree grounds what was actually committed vs. claimed.
 4. **Continue the pipeline**  -  start from the next phase (same pipeline as the main multi-agent command).

package/pipeline/commands/multi-agent/review.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-description: "Run parallel review on a branch's diff or a Pull Request: 2 models on Claude Code (Opus + Sonnet), 3 models on Copilot CLI (GPT + Opus + Sonnet). On PR input, posts per-finding inline comments and sets approve/needs-work review state."
+description: "Run parallel review on a branch's diff or a Pull Request: 2 models on Claude Code (Fable + Sonnet), 3 models on Copilot CLI (GPT + Opus + Sonnet). On PR input, posts per-finding inline comments and sets approve/needs-work review state."
 argument-hint: "[#N | repo#N | PR-URL | branch]  -  optional: PR by number/URL, repo+number, or local branch. Supports GitHub and Bitbucket Server URLs. If omitted, the current branch is used."
 ---
@@ -109,18 +109,50 @@ The `credential-store.sh` wrapper handles macOS Keychain (`security`), Linux lib
 Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers can re-read it.
+### 2b. Module review guides  -  path-scoped convention files
+A module in the repo may carry its own CLAUDE guide  -  a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths:
+```bash
+# Changed file paths from the patch:
+grep -E '^\+\+\+ b/' "$DIFF_FILE" | sed 's|^+++ b/||' | sort -u > /tmp/multi-agent-review-${TASK_ID}-paths.txt
+# For each changed path, walk its directory chain up to the repo root and
+# collect guide files matching: CLAUDE.md, *-CLAUDE.md, AGENTS.md.
+# Root-level CLAUDE.md/AGENTS.md are excluded  -  the host CLI already loads those.
+guides=()
+while IFS= read -r p; do
+  d=$(dirname "$p")
+  while [ "$d" != "." ] && [ "$d" != "/" ]; do
+    for g in "$d"/CLAUDE.md "$d"/*-CLAUDE.md "$d"/AGENTS.md; do
+      [ -e "$g" ] && guides+=("$g")
+    done
+    d=$(dirname "$d")
+  done
+done < /tmp/multi-agent-review-${TASK_ID}-paths.txt
+# dedupe, cap at 5 (log any dropped so truncation is never silent)
+```
+Existence checks are resolved against the local checkout when the cwd is the target repo. In PR mode without a local checkout, probe the candidate paths via the provider API instead (`gh api /repos/{o}/{r}/contents/{path}?ref={headSha}` / Bitbucket `GET /projects/{KEY}/repos/{slug}/browse/{path}?at={headSha}`) and fetch the matching files' raw content the same way. No hit → step is a silent no-op.
+Persist `agent-state.review.moduleGuides = [<repo-relative paths>]` and inject into every reviewer prompt (Step 3):
+> MODULE REVIEW GUIDES: before reviewing, read each of these guide files. Apply a guide's rules/checklist to every changed file under its directory. Guide violations are findings like any other  -  triage them by severity.
+Scope note: a guide governs only files under its own directory  -  a guide found under one module must not be applied to a sibling module's changes in the same PR.
 ### 3. Launch parallel reviewers  -  host-CLI dependent
 **Claude Code (2 in parallel):**
-- Agent 1: `claude-opus-4.6` → security + architecture
-- Agent 2: `claude-sonnet-4.6` → general quality
+- Agent 1: `claude-fable-5` → security + architecture
+- Agent 2: `claude-sonnet-4-6` → general quality
 **Copilot CLI (3 in parallel):**
-- Agent 1: `claude-opus-4.6` → security + architecture
+- Agent 1: `claude-opus-4-8` → security + architecture (Fable 5 is not offered on Copilot CLI)
 - Agent 2: `gpt-5.4` → edge cases, alternate perspective
-- Agent 3: `claude-sonnet-4.6` → general quality
+- Agent 3: `claude-sonnet-4-6` → general quality
-Each reviewer receives the diff plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
+Each reviewer receives the diff, the module review guides from Step 2b (when any were found), plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
 ### 4. Store-compliance cross-reference
@@ -137,7 +169,7 @@ Each finding gets the `ruleID` from the catalog plus the platform policy ref:
 Catalog-only  -  does NOT invoke binaries. For a full scan, use `/multi-agent:test "store-ready"`.
-### 5. Triage (Opus)
+### 5. Triage (Fable)
 Classify findings into:
 - 🔴 **Blocking** → must fix
@@ -152,10 +184,10 @@ Triage also marks each finding as `accepted` (real issue), `deferred` (real but
 🔍 Review Complete  ·  PR #1250  ·  3 files +120 -45
 | Model    | Verdict   | Blocking | Important | Suggestion |
 |----------|-----------|----------|-----------|------------|
-| Opus     | approved  | 0        | 1         | 3          |
+| Fable    | approved  | 0        | 1         | 3          |
 | Sonnet   | rejected  | 1        | 2         | 5          |
-Consensus: ⚠ DISAGREEMENT  -  see Opus triage
+Consensus: ⚠ DISAGREEMENT  -  see Fable triage
 ```
 This summary ALWAYS prints, regardless of input mode. The chat is the live conversation; on the PR side, the durable artifacts are inline comments + the review state (Step 7).

package/pipeline/commands/multi-agent/sync.md CHANGED Viewed

@@ -58,7 +58,7 @@ Run every step automatically:
 Step 0: FIGMA_SYNC  SKIP (deprecated  -  feedback_figma_source_deprecated)
 Step 1: PLATFORM    Detect macOS / Linux / Windows (Git Bash / WSL); export PLATFORM env
 Step 1.5: DETECT    Compare timestamps, find stale targets
-Step 2: COPILOT     Claude Code -> Copilot CLI (instructions + 34 sub-command skills)
+Step 2: COPILOT     Claude Code -> Copilot CLI (instructions + 35 sub-command skills)
 Step 3: REPO        Claude Code -> pipeline repo (genericized, personal data scrub, bash -n on all sh)
 Step 3c: PLUGINS    pipeline shared/external -> multi-agent-plugins marketplace (rebuild knowledge/,
                      bump changed plugins' patch version, commit + push the plugins repo)
@@ -277,11 +277,11 @@ This runs on the Claude <-> Copilot axis — the two CLIs the pipeline supports
 |-------------|-------------|
 | `~/.claude/commands/multi-agent/{cmd}.md` | `~/.copilot/skills/multi-agent-{cmd}/SKILL.md` |
-**34 commands are synced** (canonical inventory  -  must match `cross-cli-contract.md` section 1; drift = contract violation):
+**35 commands are synced** (canonical inventory  -  must match `cross-cli-contract.md` section 1; drift = contract violation):
 ```
 analysis, analysis-resolve, autopilot, build-optimize, channels, delete, dev,
-dev-autopilot, dev-local, dev-local-autopilot, diff-explain, garbage-collect,
+dev-autopilot, dev-local, dev-local-autopilot, diff-explain, finish, garbage-collect,
 help, issue, jira, kill, language, local, local-autopilot, log, manual-test,
 prune-logs, purge, refactor, resume, review, scan, search, setup, stack, status,
 sync, test, update

package/pipeline/commands/multi-agent.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-description: "Task orchestrator  -  full pipeline via Jira ID + branch or GitHub Issue URL: analysis, plan, TDD development, parallel review + Opus triage (CLI-aware: 2-model on Claude Code, 3-model on Copilot CLI), commit, log"
+description: "Task orchestrator  -  full pipeline via Jira ID + branch or GitHub Issue URL: analysis, plan, TDD development, parallel review + Fable triage (CLI-aware: 2-model on Claude Code, 3-model on Copilot CLI), commit, log"
 allowed-tools: Agent, Bash, Read, Write, Edit, Glob, Grep, TaskCreate, TaskUpdate, TaskList, TaskGet, AskUserQuestion, WebFetch, WebSearch, NotebookEdit, Skill
 ---
@@ -140,14 +140,14 @@ This command uses lazy loading for token efficiency. Read the relevant sub-file
 - Multiple stacks -> load all relevant guides
 **Agent definitions** (used in Phase 1 and Phase 4):
-- `$HOME/.claude/agents/code-reviewer.md`  -  Phase 4 reviewer persona (`preferredModel: opus`; Phase 4 overrides Reviewer 3 to `sonnet`)
+- `$HOME/.claude/agents/code-reviewer.md`  -  Phase 4 reviewer persona (`preferredModel: fable`; Phase 4 overrides Reviewer 3 to `sonnet`)
 - `$HOME/.claude/agents/explorer.md`  -  Phase 1 codebase scan persona (`preferredModel: sonnet`  -  scan work, cost-efficient)
-- `$HOME/.claude/agents/ios-architect.md`  -  iOS architecture review (`preferredModel: opus`)
-- `$HOME/.claude/agents/android-architect.md`  -  Android architecture review (`preferredModel: opus`)
-- `$HOME/.claude/agents/backend-architect.md`  -  Backend/API architecture review (`preferredModel: opus`)
+- `$HOME/.claude/agents/ios-architect.md`  -  iOS architecture review (`preferredModel: fable`)
+- `$HOME/.claude/agents/android-architect.md`  -  Android architecture review (`preferredModel: fable`)
+- `$HOME/.claude/agents/backend-architect.md`  -  Backend/API architecture review (`preferredModel: fable`)
 - `$HOME/.claude/agents/security-auditor.md`  -  Security audit (`preferredModel: opus`)
-**Per-persona model routing:** Before each Agent dispatch, the orchestrator reads `preferredModel` from the persona file and exports `CLAUDE_CODE_SUBAGENT_MODEL` (Claude Code) / passes `--model` (Copilot CLI). Precedence: per-dispatch `PHASE_MODEL_OVERRIDE` > persona `preferredModel` > `opus`. Full contract: `skills/shared/core/multi-agent/SKILL.md#agent-dispatch--per-persona-model-routing-v610`.
+**Per-persona model routing:** Before each Agent dispatch, the orchestrator reads `preferredModel` from the persona file and exports `CLAUDE_CODE_SUBAGENT_MODEL` (Claude Code) / passes `--model` (Copilot CLI). Precedence: per-dispatch `PHASE_MODEL_OVERRIDE` > persona `preferredModel` > `fable` (falls back per `refs/features/model-fallback.md`). Full contract: `skills/shared/core/multi-agent/SKILL.md#agent-dispatch--per-persona-model-routing-v610`.
 ---
@@ -247,7 +247,7 @@ When called with `review`:
 1. Detect current branch and project from cwd (or ask)
 2. Get diff: `git diff HEAD` (unstaged + staged)
 3. If no diff, get diff against base branch: `git diff origin/{baseBranch}...HEAD`
-4. Launch Phase 4 review (parallel + Opus triage  -  2-model on Claude Code, 3-model on Copilot CLI) on the diff
+4. Launch Phase 4 review (parallel + Fable triage  -  2-model on Claude Code, 3-model on Copilot CLI) on the diff
 5. No worktree, no state file  -  lightweight one-shot review
 6. Print findings to terminal

package/pipeline/schemas/agent-state.schema.json CHANGED Viewed

@@ -183,7 +183,7 @@
             "planEditRequests": {
               "type": "array",
               "items": { "type": "string" },
-              "description": "v5.3.0 Phase 2  -  free-text edit instructions the user typed between plan renders. Preserved verbatim for audit; Opus parses them conversationally to revise the plan."
+              "description": "v5.3.0 Phase 2  -  free-text edit instructions the user typed between plan renders. Preserved verbatim for audit; the planning model (Fable top tier) parses them conversationally to revise the plan."
             }
           }
         }