npm - @mmerterden/multi-agent-pipeline - Versions diffs - 10.7.4 → 10.9.0 - Mend

@mmerterden/multi-agent-pipeline 10.7.4 → 10.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/CHANGELOG.md +93 -0
package/README.md +2 -0
package/docs/engineering.md +76 -0
package/docs/features.md +49 -33
package/package.json +1 -1
package/pipeline/commands/multi-agent/refs/features/verify-by-test.md +41 -0
package/pipeline/commands/multi-agent/refs/phases/log-format.md +10 -0
package/pipeline/commands/multi-agent/refs/phases/operations.md +15 -2
package/pipeline/commands/multi-agent/refs/phases/phase-0-init.md +9 -0
package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md +12 -0
package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md +12 -1
package/pipeline/commands/multi-agent/refs/rules.md +1 -0
package/pipeline/commands/multi-agent/resume.md +7 -4
package/pipeline/commands/multi-agent/review.md +33 -1
package/pipeline/schemas/diff-risk.schema.json +5 -4
package/pipeline/schemas/prefs.schema.json +59 -0
package/pipeline/schemas/token-budget.json +7 -7
package/pipeline/schemas/triage-output.schema.json +35 -4
package/pipeline/scripts/README.md +3 -0
package/pipeline/scripts/diff-risk-score.mjs +11 -1
package/pipeline/scripts/fixtures/diff-risk-test-removal.diff +40 -0
package/pipeline/scripts/fixtures/install-layout.tsv +3 -3
package/pipeline/scripts/smoke-diff-risk.sh +30 -1
package/pipeline/scripts/smoke-handoff-contract.sh +92 -0
package/pipeline/scripts/smoke-update-check.sh +122 -0
package/pipeline/scripts/smoke-verify-by-test.sh +148 -0
package/pipeline/scripts/update-check.sh +82 -0
package/pipeline/scripts/validate-diff-risk.mjs +2 -1
package/pipeline/scripts/validate-triage.mjs +31 -2

package/CHANGELOG.md CHANGED Viewed

@@ -14,6 +14,99 @@ Internal file-layout changes that don't affect the slash-command surface are sti
 ---
+## [10.9.0] - 2026-07-03
+### Added
+- **Update check at run start (Phase 0 Step 0.6, opt-out via
+  `prefs.global.updateCheck`).** Once per `ttlHours` window (default 24h,
+  cached, 3s-bounded curl straight to the npm registry - never `npm view`),
+  the pipeline compares the installed version against `dist-tags.latest`.
+  Newer version found: interactive modes ask ONE question ("Update now?" ->
+  yes runs the `/multi-agent:update` flow and the run continues; the update
+  takes full effect next session); autopilot logs one line and never asks
+  (zero-interaction contract). `autoUpdate: true` skips the question and
+  updates before the worktree is created. Offline, ahead-of-registry, and
+  failed checks are silent - the step never blocks. New
+  `pipeline/scripts/update-check.sh` + `smoke-update-check.sh` (13 assertions).
+- **Design fidelity contract (Phase 3, BLOCKING on Figma-referenced tasks).**
+  The analysis doc's captured frames are the 1:1 reference for every UI line:
+  a Code Connect-mapped component must be used verbatim (no sound-alikes,
+  missing modifiers go into `+Modifiers`, never forks), unmapped UI becomes a
+  NEW component under the standard architecture, and inter-component spacing
+  comes from the design's measured values mapped to tokens - never invented.
+  Missing design data halts with a re-run-analysis instruction; Figma MCP
+  stays forbidden outside analysis.
+- **`docs/engineering.md`** - version-free catalog of the discipline layer
+  (bounded loops, evidence gates, token-budgeted phase docs, immutable tests,
+  fresh-context handoffs, memory hedges, install hygiene), linked from README.
+### Changed
+- **Docs drop inline version markers.** `docs/features.md` headers and bullets
+  no longer carry `(vX.Y.Z)` tags - the docs describe the current state;
+  history lives in this CHANGELOG. Same principle applied to the website copy.
+- **Phase-doc token budgets recalibrated** (`token-budget.json`, total
+  46500 -> 50000) after the verify-by-test, update-check, immutable-test, and
+  design-fidelity contracts landed - Step 3.7 prose was compressed to a
+  pointer into `refs/features/verify-by-test.md` before recalibration.
+---
+## [10.8.0] - 2026-07-02
+Three additive loop-quality features, all sourced from the 2026 agentic-loop research sweep (Anthropic long-running-agent harness guidance + adversarial-review findings): empirical verification of blocking findings, an immutable-test contract, and structured phase-boundary handoffs.
+### Added
+- **Verify-by-test triage (Phase 4 Step 3.7, opt-in via `prefs.global.verifyByTest`).**
+  Accepted blocking findings are no longer judgment-final: one verifier agent
+  (default Sonnet, capped at `maxFindings`=3) writes a minimal repro test per
+  finding and runs ONLY that test. Test fails as predicted -> finding confirmed
+  and the repro test is handed to Phase 3 as the rework RED test
+  (`state.reviewIterations[-1].verifyByTest.redTests[]`). Test passes under
+  `evidence-gate.mjs` -> finding downgraded to `deferred` (never `rejected`).
+  Compile error / timeout -> `inconclusive`, judgment verdict stands. The whole
+  step is timeout-bounded and never blocks the pipeline. Schema
+  `triage-output` v3.2.0 adds the optional per-finding `verification` block;
+  `validate-triage.mjs` validates it. New feature doc
+  `refs/features/verify-by-test.md` + `smoke-verify-by-test.sh` (20 assertions).
+- **Immutable-test rule + `test_lines_removed` diff-risk signal.** New rule in
+  `refs/rules.md` and the Phase 3 GREEN step: deleting, renaming, or weakening
+  an existing test to reach green is a violation; a test changes only when the
+  task changes the spec it encodes, named in the commit body. Deterministic
+  backstop: `diff-risk-score.mjs` v1.1.0 emits `test_lines_removed` (w=3.0)
+  when a test-classified file removes more lines than it adds; wired into the
+  Phase 4 Step 1.75 signal table, `diff-risk.schema.json` v1.1.0, and
+  `validate-diff-risk.mjs`. New fixture `diff-risk-test-removal.diff` +
+  3 new `smoke-diff-risk.sh` assertions.
+- **Structured handoff blocks (fresh-context re-entry discipline).** The
+  phase-boundary checkpoint in `refs/phases/operations.md` now appends a
+  structured `## Handoff` block (Done / Remaining / Decisions / Open findings /
+  Next) to `agent-log.md` at every phase transition - written by the
+  orchestrator from state it already holds, no LLM call, ~15 lines. The
+  post-`/compact` re-grounding and `resume.md` Step 3 read the latest handoff
+  FIRST (state wins on mismatch), falling back to per-phase findings for
+  pre-v10.8 logs. Documented in `log-format.md`; guarded by
+  `smoke-handoff-contract.sh` (13 assertions).
+- **Module review guides in `/multi-agent:review` (Step 2b).** When a changed
+  file's module carries its own convention file (`CLAUDE.md`, `*-CLAUDE.md`,
+  `AGENTS.md` below repo root), the review discovers it deterministically from
+  the diff paths (capped at 5, truncation logged), injects it into every
+  reviewer prompt, and scopes each guide to files under its own directory.
+  Works with a local checkout or via provider API in PR mode.
+### Fixed
+- **`validate-triage.mjs` reviewer enum accepts `fable` and `gpt`.** The runtime
+  validator still rejected the schema-v3.1.0 reviewer values restored in
+  v10.7.4 (`fable` is Reviewer 1 on Claude Code, `gpt` on Copilot CLI), so any
+  accepted finding attributed to them failed the 3.2.1 gate. Enum now matches
+  `triage-output.schema.json`.
+---
 ## [10.7.4] - 2026-07-02
 Deep-consistency sweep: the v10.6.0 Fable restore and the v10.7.0 adapter removal are now reflected on every surface, and the test suite is green again end to end.

package/README.md CHANGED Viewed

@@ -51,6 +51,8 @@ One command runs 8 phases, with a gate between the risky ones:
 Under the hood: each task runs in its own **git worktree** (or the current branch with `:local`), commits use the **git identity routed from the repo's origin URL**, and **multi-repo** tasks get per-repo worktrees plus an integration build. Tokens stay in the OS keychain; nothing is committed or logged. `/multi-agent:review` can also review an existing GitHub/Bitbucket PR — per-finding inline comments anchored to `file:line` + an explicit Approve / Needs-Work state.
+The discipline behind all of this — bounded loops, evidence gates, token-budgeted phase docs, immutable tests, fresh-context handoffs — is catalogued in [docs/engineering.md](./docs/engineering.md). The full feature list lives in [docs/features.md](./docs/features.md).
 ## Modes
 | Mode | Command | Flow |

package/docs/engineering.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Engineering Discipline
+The pipeline's output quality comes less from any single model call and more from the constraints wrapped around every call. This page lists those constraints - the internals that keep long agentic runs correct, cheap, and auditable. No version markers here; this document always describes the current state (history lives in the [CHANGELOG](../CHANGELOG.md)).
+## Loop discipline
+Every loop in the pipeline has a deterministic exit condition and a hard iteration cap. Nothing "keeps trying".
+- **TDD micro-loop** (Phase 3): red -> green -> refactor per task; build-fix retries capped at 3 attempts.
+- **Dev-critic loop** (Phase 3.5, opt-in): generator vs critic on deterministic criteria; max 2 iterations, round 3 escalates to the user.
+- **Review-fix macro-loop** (Phase 3 <-> 4): only triage-accepted blocking findings re-enter development; max 3 iterations, then hard-kill and escalate.
+- **Disagreement round** (Phase 4, opt-in): when reviewers split, exactly one rebuttal round - never more.
+- **Global rule**: any retry loop stops after 3 attempts and asks the user. No exceptions, no infinite loops.
+## Evidence over claims
+- **Default-FAIL evidence gate**: a "build passed" or "tests passed" claim is rejected unless the captured log actually shows success. Missing, empty, or failure-marked logs fail CLOSED.
+- **Verify-by-test** (opt-in): an accepted blocking review finding can be empirically confirmed - a verifier agent writes a minimal repro test and runs it. Test fails as predicted: the finding stands and the failing test is handed to the rework loop as its RED test. Test passes: the finding is downgraded, and even the downgrade must pass the evidence gate.
+- **Deterministic validator gates**: triage output, reviewer output, diff-risk output, and agent state are schema-validated by zero-dependency scripts whose exit codes decide the flow - one self-correction rework, then halt with a recovery hint. The LLM never grades its own homework.
+## Test integrity
+- **Immutable tests**: deleting, renaming, or weakening an existing test to get a green run is a rule violation. A test changes only when the task changes the spec it encodes, named in the commit body.
+- **Deterministic backstop**: the diff-risk scorer flags any test file whose diff removes more lines than it adds (`test_lines_removed`), so a shrinking test suite surfaces to reviewers even if the rule is ignored.
+- **Test-gap detection**: newly added public symbols with no paired test are reported before the test phase.
+## Context management
+- **Token-budgeted phase docs**: every phase document has a per-file and total token budget enforced by a smoke gate (`token-budget.json`). Growth is compressed first; budgets are recalibrated only when real contract text lands. This keeps the orchestrator's working context lean across an 8-phase run.
+- **Lazy loading**: only the active phase's document is in context; references load on demand.
+- **Structured handoff blocks**: at every phase boundary the orchestrator appends a Done / Remaining / Decisions / Open findings / Next block to the run log - written from state it already holds, no extra model call. Resume and post-compaction re-entry read the latest handoff plus the state file, never the conversation history.
+- **Proactive compaction**: past ~50% context usage the run compacts itself and re-grounds from durable artifacts instead of waiting for lossy auto-compaction.
+## Review architecture
+- **Deterministic gates before AI review**: build, lint, tests, and secret scan must be green before a single reviewer token is spent.
+- **Cross-model parallel review**: independent reviewers with different vantage points, then a triage pass that filters false positives and out-of-scope noise. Reviewer findings are raw signals, not commands - nothing auto-loops on a "blocking" tag.
+- **Consensus surfacing**: unanimous agreement between same-base-model reviewers on judgment-heavy surfaces is flagged as unverified instead of being trusted as independent confirmation.
+- **Diff risk scoring**: a sub-second, zero-LLM heuristic ranks changed files (security paths, migrations, missing tests, shrinking tests) and feeds reviewers a priority hint - advisory, never a gate.
+- **Prompt-cache-aware dispatch**: reviewer and triage prompts share a byte-identical leading block so parallel dispatches hit the prompt cache instead of re-billing the diff.
+## Design fidelity
+- **Single source of design truth**: Figma is fetched once, during analysis, through a 3-tier fallback chain (MCP -> REST -> user screenshot). Every later phase reads the frozen analysis artifact; calling Figma from dev or review phases is a contract violation enforced by a smoke gate.
+- **1:1 implementation contract**: a Code Connect-mapped component must be used verbatim (no sound-alike substitutes; missing modifiers are added to the component, never forked). Unmapped UI becomes a new component under the standard architecture. Spacing between components comes from the design's measured values mapped to tokens - never invented numbers.
+## Cost transparency
+- **Per-phase token ledger**: every billable call forwards token counts to the tracker; each run ends with a cost table naming the top cost driver and pricing prompt-cache reads at their discounted rate.
+- **Budget ceilings**: a cost budget can cap a run; crossing it triggers model downgrade or halt rather than silent overspend.
+- **Model fallback contract**: premium-tier dispatches degrade deterministically (top tier -> mid -> base) on dispatch errors, date gates, or budget ceilings - never by improvisation.
+## Memory without bias
+- **Learnings ledger**: durable per-repo facts and rejected review preferences replay into analysis and triage, so agents stop re-discovering and reviewers stop re-flagging the same things.
+- **Hedged injection**: every memory injection carries an explicit "context, not command" hedge, and blocking-severity rejections are never memorized - a wrong rejection must not permanently suppress a finding class.
+- **Prior-art caps**: triage sees at most a fixed number of highest-similarity past findings; memory is advisory context, not a verdict multiplier.
+## Update and install hygiene
+- **Wipe-then-copy install**: pipeline-owned directories are cleared before copying, so files deleted upstream never linger as ghost commands or dead scripts on user machines.
+- **Layout fingerprint**: a fixture pins the installed file layout; structural drift fails the suite.
+- **Advisory update check**: at run start (cached, bounded, offline-silent) the pipeline compares its installed version with the registry; interactive runs offer a one-question update, autopilot only logs. It never blocks and never self-modifies mid-run.
+- **Token-preserving uninstall**: removing the pipeline never touches the user's stored credentials.
+## Security and hygiene
+- **Secret scanning**: staged diffs are scanned with provider-prefix patterns plus Shannon-entropy detection before every commit; findings block.
+- **Personal-data genericization**: the public repo is generated from the private source with a scrub pass (names, hosts, project keys) and a zero-hit grep gate before push.
+- **Skill scanner**: tiered pattern scanning over skill content guards against supply-chain surprises.
+- **Intent guard**: a deterministic classifier separates "answer my question" from "change my code", so a conceptual question never spawns a worktree.
+## Cross-CLI parity
+- **One source of truth, two runtimes**: Claude Code and Copilot CLI command surfaces are byte-identical where shared; a parity smoke fails on drift.
+- **Portability gates**: every shell script passes `bash -n` and a GNU/BSD portability grep; the CI matrix runs macOS, Linux, and Windows.

package/docs/features.md CHANGED Viewed

@@ -43,7 +43,7 @@ Compose freely: `--dev --local autopilot` = shortest, least-friction path.
 Build commands, test runners, lint tools, and review focus areas all adapt to the detected stack.
-### Stack Selection (marketplace plugins, v10.5.0+)
+### Stack Selection (marketplace plugins)
 Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketplace. Selecting a stack enables the matching plugin(s) in the current repo's `.claude/settings.json` `enabledPlugins` — no skill copying, no session restart tricks, no directory shuffling. The `ai-common-engineering-toolkit` (accessibility audit, humanizer, Firebase) is always enabled alongside the stack plugin.
@@ -57,7 +57,7 @@ Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketpl
 /multi-agent:stack all         # every stack plugin
 ```
-### Task Type Detection (v2.0.0)
+### Task Type Detection
 Phase 0 Step 9 classifies every task before Phase 1 starts. Deterministic priority order: Figma URL → instruction file path → git diff heuristic → Jira issue type → branch name → description keywords → user prompt (autopilot defaults to `feature`).
@@ -71,26 +71,26 @@ Result persisted to `agent-state.taskType`:
 | `refactor`  | Phase 4 emphasizes behavior preservation; Phase 6 uses `refactor(...)` prefix |
 | `chore`     | Lightweight flow; Phase 6 uses `chore(...)` prefix                            |
-### SubPhase Convention (v2.0.0)
+### SubPhase Convention
-When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7 — Phase 6.5 was merged into Phase 7 REPORT in v5.1.0) — specialized work slots into its parent phase without inflating the count.
+When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7) — specialized work slots into its parent phase without inflating the count.
 ## PR & Review Flow
-### Default Reviewers (v1.6.1, hardened in v2.0.0)
+### Default Reviewers
 - **Bitbucket**: Fetches via `/rest/default-reviewers/1.0/`. Every PUT must re-send `reviewers`, `fromRef`, `toRef`, `draft` (regression guarded by smoke test).
 - **GitHub**: Honors `CODEOWNERS` + falls back to `prefs.projects[].githubDefaultReviewers`.
 - PR author always filtered out (Bitbucket: 409, GitHub: GraphQL error).
-### Draft vs Ready Prompt (v1.7.0)
+### Draft vs Ready Prompt
 Phase 6 asks `DRAFT or READY?` before creating the PR and persists the choice in `prefs.projects[].defaultPrMode`.
 - Bitbucket: `draft: true` flag (DC 8.x+) with `[DRAFT]` title fallback for older servers.
 - GitHub: `gh pr create --draft` + `gh pr ready` for promotion.
-### `channels` Command (v5.7.0 — replaces `enrich`)
+### `channels` Command
 Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post-hoc for fixes closed outside the pipeline:
@@ -102,9 +102,9 @@ Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post
 /multi-agent:channels ABC-1234 --channels jira,confluence --content test
 ```
-Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the pre-v5.7 `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
+Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the earlier `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
-### Body Preservation Contract (v1.6.1, verified by smoke test in v2.0.0)
+### Body Preservation Contract (smoke-verified)
 Every external-system body (PR description, Jira comment, GitHub issue) uses `jq -n --rawfile body body.md '{description: $body}'` → `curl --data-binary @payload.json`. No literal `\n` strings, no HTML entities (`&amp;`, `&lt;`, `&quot;`). UTF-8 preserved end-to-end. `scripts/smoke-add-detail.sh` runs 14 contract assertions.
@@ -137,7 +137,7 @@ The reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers in paralle
 **Fable Triage** (Phase 4 Step 3, Opus on Copilot CLI): Evaluates merged raw findings against task scope. Classifies each as `accepted` (fix now), `deferred` (out of scope, log for later), or `rejected` (false positive / noise). Only triage-accepted blocking items loop back to Phase 3.
-### Runtime Triage Validator (v2.3.0)
+### Runtime Triage Validator
 After triage returns, output is validated by `validate-triage.mjs`:
@@ -148,9 +148,25 @@ After triage returns, output is validated by `validate-triage.mjs`:
 | **2** | Over-rejection guard tripped — pause for human               |
 | **3** | Contradiction auto-corrected — proceed with corrected output |
-### Bidirectional Approved↔Blocking Auto-Correction (v2.6.1)
+### Bidirectional Approved↔Blocking Auto-Correction
-If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with `if`/`then` constraint in schema v3.0.0.
+If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with an `if`/`then` constraint in the schema itself.
+### Verify-by-Test Triage (Phase 4 Step 3.7, opt-in)
+A triage verdict is a judgment call; a failing repro test is proof. When `prefs.global.verifyByTest.enabled` is on, one verifier agent (default Sonnet) writes a minimal repro test per accepted blocking finding (cap: `maxFindings`=3) and runs only that test. Fails as predicted -> finding confirmed, the repro test becomes the Phase 3 rework RED test. Passes under `evidence-gate.mjs` -> finding downgraded to `deferred`. Compile error / timeout -> `inconclusive`, judgment stands. Timeout-bounded, never blocks. Full spec: `refs/features/verify-by-test.md`.
+### Immutable-Test Rule + `test_lines_removed` Signal
+Existing tests are immutable during a task: deleting, renaming, or weakening an assertion to reach green is a violation (`refs/rules.md`, Phase 3 GREEN step). A test changes only when the task changes the spec it encodes, named in the commit body. Deterministic backstop: `diff-risk-score.mjs` emits `test_lines_removed` (w=3.0) for any test-classified file whose diff removes more lines than it adds.
+### Update Check at Run Start
+Phase 0 Step 0.6, opt-out via `prefs.global.updateCheck`. Once per `ttlHours` window (cached, 3s-bounded curl to the npm registry), the installed version is compared against `dist-tags.latest`. Newer version: interactive modes ask "Update now?" (yes runs the `/multi-agent:update` flow, then the run continues); autopilot logs one line and never asks. `autoUpdate: true` updates silently before the worktree exists. Offline and failed checks are silent - never blocks.
+### Structured Handoff Blocks
+Every phase transition appends a `## Handoff` block (Done / Remaining / Decisions / Open findings / Next) to `agent-log.md` - orchestrator-written from existing state, no LLM call. `/multi-agent:resume` and post-`/compact` re-grounding read the latest handoff first, so long runs re-enter from durable artifacts instead of conversation memory (fresh-context discipline from Anthropic's long-running-agent harness guidance).
 ### Accessibility Code Review (Phase 4 Step 1.5)
@@ -162,7 +178,7 @@ If changes include UI files, reviewers check for:
 Pure code analysis — no simulator needed. Device-level audits run in Phase 5 when requested.
-### Status Enforcement (v1.5.0)
+### Status Enforcement
 Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation verify step that re-reads the field and retries once on silent `VALIDATION` failures (e.g. stale Projects V2 option IDs after a board rebuild).
@@ -175,49 +191,49 @@ Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation veri
 ## Testing & Quality
-### Schema-Validated State (v2.0.0)
+### Schema-Validated State
 All critical state files are schema-validated at read and write time:
 - `agent-state.schema.json` — validates `$HOME/.claude/logs/multi-agent/.../agent-state.json`
-- `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json` (extended v2.5.0)
-- `triage-output.schema.json` — validates triage output (v2.3.0, hardened v2.6.1, `if`/`then` constraint v3.0.0)
+- `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json`
+- `triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
-### Smoke Test Suites (v2.0.0–v2.5.0)
+### Smoke Test Suites
 10 suites: add-detail (14 body-preservation assertions), review-triage (validator exit codes), prefs (schema round-trip), state (agent-state lifecycle), metrics (telemetry emission), sync (instruction parity), secret-scan (hook patterns), phase-banner (terminal UI), token-budget (per-phase limits), phase-tracker (progress tracking).
-### Adversarial Eval Fixtures (v2.5.0–v2.6.0)
+### Adversarial Eval Fixtures
 10 fixtures that test triage resilience against adversarial reviewer output: over-rejection, hallucinated findings, contradictions, invalid JSON, schema violations, duplicate findings, scope creep, empty results, timeout simulation, and combined edge cases.
-### Sync Parity Check (v2.3.0)
+### Sync Parity Check
 Detects drift between Claude Code instructions (`~/.claude/commands/multi-agent.md`), Copilot CLI instructions (`~/.copilot/copilot-instructions.md`), and the repo's pipeline spec files. Reports discrepancies during Phase 0 Init.
-### Token Budget Enforcement (v3.0.0)
+### Token Budget Enforcement
 Per-phase token budgets prevent runaway sessions. If a phase exceeds its budget, the pipeline pauses and offers: continue (extend budget), skip phase, or abort. Budgets are configurable in `prefs.global.tokenBudgets`.
 ## Telemetry & Observability
-- **Pipeline Metrics** (v2.3.0): Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
-- **Cost Telemetry** (v2.5.0): Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
-- **Phase Tracker** (v2.4.0): Cross-CLI visual progress (current phase, elapsed time, iteration count).
-- **Phase Banner** (v2.4.0): Terminal UI for phase transitions with Unicode box-drawing characters.
-- **Per-task Cost Breakdown in agent-log.md** (v8.3.0): Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
+- **Pipeline Metrics**: Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
+- **Cost Telemetry**: Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
+- **Phase Tracker**: Cross-CLI visual progress (current phase, elapsed time, iteration count).
+- **Phase Banner**: Terminal UI for phase transitions with Unicode box-drawing characters.
+- **Per-task Cost Breakdown in agent-log.md**: Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
-### Diff Risk Scoring (v8.3.0)
+### Diff Risk Scoring
 `pipeline/scripts/diff-risk-score.mjs` runs at Phase 4 Step 1.75 — before reviewer dispatch. Heuristic, deterministic, sub-second, no LLM. Top-N risk-ranked files inject into each reviewer's prompt as a `${PRIORITY_FILES}` block; reviewers read those files first but still review the entire diff.
-Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
+Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
-### Test Gap Detection (v8.3.0)
+### Test Gap Detection
 `pipeline/scripts/test-gap-scan.mjs` runs at Phase 5 Step 0. Walks the diff for newly added public symbols and reports those with no paired test. Stack-specific rules ship for iOS, Android, Python, Node.js. iOS Views and Android `@Composable` symbols default to `important`; other public API additions to `suggestion`. Optional gating via `prefs.testGap.blockingThreshold` — when set, the report becomes a Phase 4 rework finding once `important + blocking` count exceeds the threshold.
-### Triage Memory (v8.3.0)
+### Triage Memory
 Per-repo append-only JSONL corpus at `~/.claude/memory/multi-agent/<repo-slug>/triage-corpus.jsonl`. Phase 7 ingests every triage output (idempotent), Phase 1 enriches the analysis with similar past tasks, Phase 4 triage attaches prior-art hits to each raw finding with an explicit bias hedge. Token-overlap recall, zero deps, Node-18-compatible. `/multi-agent:search "<text>" --semantic` routes the query to the corpus instead of agent-log grep. Toggle via `prefs.global.priorArtEnrichment.enabled` (default ON).
@@ -268,10 +284,10 @@ Token registry maps logical names (`jira`, `bitbucket`, `github`, `confluence`)
 ## Schemas & Validation
-- `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle (v2.0.0)
-- `pipeline/schemas/prefs.schema.json` — validates preferences (v2.0.0, extended v2.5.0)
-- `pipeline/schemas/triage-output.schema.json` — validates triage output (v2.3.0, hardened v2.6.1, `if`/`then` constraint v3.0.0)
+- `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle
+- `pipeline/schemas/prefs.schema.json` — validates preferences
+- `pipeline/schemas/triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
-## File Layout (v1.9.0)
+## File Layout
 `commands/multi-agent/{add-detail,setup,help}.md` → slash commands. `commands/multi-agent/refs/**` → guides, phase specs, rules (`refs:` prefix signals "not an action"). Modifier flags and ops stay inline in `multi-agent.md`.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@mmerterden/multi-agent-pipeline",
-  "version": "10.7.4",
+  "version": "10.9.0",
   "description": "8-phase AI development pipeline with full orchestration on Claude Code and Copilot CLI. Analysis, planning, TDD, CLI-aware parallel review with consensus surfacing + Fable triage, default-FAIL evidence gates, secret + intent guards, per-phase cost ledger, persistent learnings memory, wiki generation, commit automation. Token-preserving uninstall.",
   "type": "module",
   "main": "index.js",

package/pipeline/commands/multi-agent/refs/features/verify-by-test.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Feature: Verify-by-Test Triage (Phase 4 Step 3.7)
+**Pattern**: reviewer findings are hypotheses; a failing repro test is proof. Adversarial-review research (Refute-or-Promote, 2026) found that plausible-but-wrong findings survive debate rounds but die on a single empirical test. This step converts the highest-stakes verdicts (accepted blocking) from judgment into evidence before the Phase 3 rework loop fires.
+**Gated by `prefs.global.verifyByTest.enabled`** (default: `false`). When enabled, after triage 3.6 and before Step 4, IF the validated triage output contains at least one `accepted` blocking finding:
+1. Dispatch ONE verifier sub-agent for the iteration (model: `verifyByTest.model`, default `sonnet`)  -  never one dispatch per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings, the diff hunks for their files, and the Phase 1 test conventions.
+2. Per finding, the verifier writes ONE minimal repro test asserting the correct behavior the finding claims is broken, then runs ONLY that test via the Phase 3 single-test invocation (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`) under `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
+3. Stamp each processed finding with a `verification` object (triage-output schema v3.2.0) and re-run `validate-triage.mjs` on the mutated triage file under the standard 3.2.1 gate protocol.
+4. Findings beyond `maxFindings` keep their judgment-only verdict (log `verify_by_test=cap-exceeded`).
+5. The whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600); on breach or verifier crash, remaining findings keep judgment-only verdicts and the pipeline proceeds. Never blocks.
+## Verdict table
+| Repro test outcome | `verification.result` | Action |
+|---|---|---|
+| Fails as the finding predicts | `confirmed` | Finding stays accepted blocking. Repro test KEPT in the worktree, recorded in `redTests[]`. |
+| Passes (not reproducible) | `not-reproduced` | Downgrade gated by `evidence-gate.mjs --claim test --status passed` on the test log (exit 0 required). Finding moves `accepted[]` -> `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`; repro test file deleted. Evidence-gate failure -> treat as `inconclusive`. |
+| Compile error / timeout / not unit-testable | `inconclusive` | Finding stays accepted blocking (judgment stands). Partial test deleted, cause in `verification.note`. |
+Downgrades go to `deferred`, never `rejected`: triage judged the issue real, and deferred items surface in the Phase 7 report for a human eye.
+## Red-test handoff to Phase 3
+`state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`. On the Phase 3 re-entry, each `redTests[]` entry IS the RED step of the rework TDD cycle for its finding: the reflection prompt instructs "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it." The dev agent must not write a duplicate failing test for those findings.
+## Cleanup invariant
+After Step 3.7, the only uncommitted verifier artifacts are the confirmed repro tests listed in `redTests[]` (committed later with the fix) and logs under `$WORKTREE/.pipeline/` (outside Phase 6 commit scope). `not-reproduced` and `inconclusive` test files are always deleted.
+## Telemetry
+One `review.verify_by_test` metric per iteration: `attempted`, `confirmed`, `downgraded`, `inconclusive`, `duration_ms` (plus `tokens_in/out` when available), forwarded to the tracker like all Phase 4 metrics. Timeout emits `triage=verify-by-test-timeout`.
+## Off by default reason
+Adds one Sonnet call plus up to `maxFindings` single-test runs (and build-lock contention on Xcode projects) per review iteration that has accepted blockers. On clean runs it never fires, but on noisy-reviewer repos it can add minutes per iteration. Flip on for security-critical work, release branches, or repos where reviewer false-positive rate is high.
+## Reference
+Wiring: `refs/phases/phase-4-review.md` Step 3.7. Schema: `pipeline/schemas/triage-output.schema.json` v3.2.0 (`$defs.verification`). Evidence gate: `pipeline/scripts/evidence-gate.mjs`. Prefs: `prefs.global.verifyByTest` in `pipeline/schemas/prefs.schema.json`.

package/pipeline/commands/multi-agent/refs/phases/log-format.md CHANGED Viewed

@@ -72,6 +72,16 @@ Verdict: unverified (2 reviewers)
 ## Test Scenarios (Jira)
 (User perspective: Precondition -> Steps -> Expected result)
+## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
+(v10.8.0, appended chronologically by the phase-boundary checkpoint  -  one block per phase transition, the LATEST block is authoritative. Written by the orchestrator from state it already holds; no LLM call, ~15 lines max. `resume.md` Step 3 and post-`/compact` re-grounding read this block FIRST. Format spec: `refs/phases/operations.md`.)
+- Done: {up to 3 bullets of completed outcomes}
+- Remaining: {ordered list of remaining phases / sub-steps}
+- Decisions: {key decisions later phases depend on}
+- Open findings: {accepted-but-unresolved review findings, or "none"}
+- Next: Phase {N+1} {name}, subStep {token or "start"}
 ```
 ### Cost Breakdown  -  emission contract

package/pipeline/commands/multi-agent/refs/phases/operations.md CHANGED Viewed

@@ -93,8 +93,21 @@ This keeps orchestrator context lean and enables programmatic routing.
 **Proactive compaction + phase-boundary checkpoint**: the orchestrator follows ~2,500 lines of phase prose in one session; once context fills, it starts dropping steps  -  the single biggest cause of "it got stuck / skipped a step." Two defenses, both required on full-pipeline runs:
-- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a one-line progress summary to `agent-log.md`. The next phase reads state + log, not the back-conversation  -  so a transition is a clean re-entry point even if context is later compacted.
-- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit  -  it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` to re-ground.
+- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a structured handoff block to `agent-log.md` (format below). The next phase reads state + log, not the back-conversation  -  so a transition is a clean re-entry point even if context is later compacted.
+- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit  -  it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` AND the latest `## Handoff` block in `agent-log.md` to re-ground.
+**Handoff block (v10.8.0)**: the structured artifact the phase-boundary checkpoint appends to `agent-log.md`. Written by the orchestrator from state it already holds  -  no agent dispatch, no extra LLM call. Cap at ~15 lines; the latest block is authoritative (earlier ones are history). This is the fresh-context re-entry contract: a resume or post-compaction session rebuilds working context from the latest handoff + `agent-state.json` + git log, never from conversation memory.
+```markdown
+## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
+- Done: {up to 3 bullets of completed outcomes, e.g. "plan approved, 4 tasks", "build green, 12 tests added"}
+- Remaining: {ordered list of remaining phases / sub-steps}
+- Decisions: {key decisions later phases depend on, e.g. "used existing KeychainStore, no new wrapper"}
+- Open findings: {accepted-but-unresolved review findings, or "none"}
+- Next: Phase {N+1} {name}, subStep {token or "start"}
+```
+Full `agent-log.md` shape: `refs/phases/log-format.md`. Resume-side consumption: `resume.md` Step 3 reads the latest handoff FIRST, then falls back to per-phase findings for logs written before v10.8.
 **Sub-step checkpoints (long phases)**: Phase 3 (dev/TDD cycles) and Phase 7 (report/channels) can run many minutes; a crash mid-phase loses everything since the last phase boundary and forces a full phase re-run on resume. For these phases, also record `state.phases[<n>].subStep` (a short token: `red`, `green`, `build`, `pr-opened`, `confluence-synced`, ...) and the `files[]` written so far after each meaningful unit of work. On resume, re-enter the phase but skip units whose `subStep` is already recorded and whose `files[]` exist in the worktree  -  re-do only the unfinished tail, never the whole phase.

package/pipeline/commands/multi-agent/refs/phases/phase-0-init.md CHANGED Viewed

@@ -98,6 +98,15 @@ Log the resolved tier in the agent log:
 Full chain definition, REST endpoints, URL parsing, canonical-component contract: `pipeline/rules/figma-pipeline.md` "MUST: Figma access - 3-tier fallback chain (BLOCKING, pipeline-wide)". Do not duplicate it here.
+#### Step 0.6 - Update check (advisory, opt-out via `prefs.global.updateCheck.enabled`)
+Run `bash pipeline/scripts/update-check.sh` (cached per `updateCheck.ttlHours`, default 24h; 3s-bounded; every failure path silent). Empty output → continue. Output `<local>|<latest>` means a newer version exists:
+- **Interactive**: log `→ update available: v<local> -> v<latest>`, ask ONE AskUserQuestion  -  **Update now** (recommended) / **Continue without updating**. On *Update now* (or `updateCheck.autoUpdate: true`, which skips the question): run the `/multi-agent:update` flow, log `→ updated to v<latest>`, continue the run (note in the log: already-loaded phase docs finish this run on the old version; full effect next session). On *Continue*: no re-ask until the TTL expires.
+- **Autopilot**: never ask (zero-interaction contract). Log `→ update available: ... (log-only; run /multi-agent:update)` and continue  -  unless `autoUpdate: true`, then update silently first.
+Advisory  -  NEVER blocks. Must run BEFORE Step 6 (worktree creation) so an accepted update cannot mutate `~/.claude` under a mid-phase run.
 #### Step 1  -  Parse Input
 **Branch from input**: If the user provided a branch name after the issue reference (space-separated), store as `baseBranch` and skip Step 3. Otherwise Step 3 asks interactively.

package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md CHANGED Viewed

@@ -49,6 +49,16 @@ When Phase 0 Step 7 classified the task as `component`, Phase 3 delegates the en
 For non-component taskTypes (`bugfix`, `feature`, `refactor`, `chore`), continue with the standard TDD section below.
+#### Design fidelity contract (BLOCKING when the task carries a Figma reference)
+Applies to EVERY task whose analysis doc carries design ground truth (Section 6 rows, node IDs, screenshots)  -  component-dispatch AND TDD-path UI work alike. MCP stays forbidden here (analysis-only rule); the analysis doc IS the design:
+1. **1:1, not interpretation.** The captured frames are the reference for every UI line. Layout, variant states, copy, and colours come from the analysis doc's measured values  -  never eyeballed, never "close enough".
+2. **Code Connect decides the component.** A mapped component (`CodeConnectSnippet` / repo `*.figma.swift` / `*.figma.kt` row) MUST be used verbatim  -  sound-alike substitutes are forbidden; a missing modifier is added to the component's `+Modifiers` extension, never forked. No mapping exists -> write a NEW component per the Configuration/View/Modifiers architecture; never inline ad-hoc UI into the consumer screen.
+3. **Inter-component spacing is part of the design.** Gaps, paddings, and alignment BETWEEN components must match the frame's measured values, mapped to spacing tokens (`Spacing*`)  -  never invented numbers. A spacing/layout deviation is a `review_blocking` finding in Phase 4 Step 1.8, not a nitpick.
+Missing design data (variant, padding, copy) → HALT and instruct the user to re-run `/multi-agent:analysis`; never guess and never call Figma MCP from this phase.
 #### Re-entry from Phase 4 triage
 Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings  -  Fable triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
@@ -94,6 +104,7 @@ For each task (respecting dependency order):
 3. **TDD cycle** (Launch Agent with `model: "sonnet"`):
    **RED  -  Write ONE failing test first:**
+   - **Rework re-entry**: if `state.reviewIterations[-1].verifyByTest.redTests[]` exists (Phase 4 Step 3.7 ran), those failing repro tests ARE the RED step for their findings  -  make them green, do not write a duplicate failing test and do not delete or weaken them. See `refs/features/verify-by-test.md`.
    - Test framework: use whatever the project already uses. Detect from existing test files:
      - `@Test` / `#expect` → Swift Testing
      - `XCTestCase` / `XCTAssert` → XCTest
@@ -115,6 +126,7 @@ For each task (respecting dependency order):
    **GREEN  -  Minimal code to pass:**
    - Smallest change that makes the test green  -  no extras
+   - **Existing tests are immutable**: deleting, renaming, skipping, or weakening an existing assertion to reach green is a violation. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. (Deterministic backstop: the `test_lines_removed` signal in Phase 4 Step 1.75 flags test files that shrink.)
    - Run the same test again → must PASS
    - Run full test suite → no regressions:
      ```bash

package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md CHANGED Viewed

@@ -127,6 +127,7 @@ echo "$RISK_JSON" | node pipeline/scripts/validate-diff-risk.mjs - >/dev/null 2>
 | `migration` | 4.0 | path matches DB schema / migration glob (.sql, /Migrations/, alembic/, prisma/migrations) |
 | `public_api` | 2.0 | added line declares `public func/class/struct/enum`, `@objc`, `open fun`, `@Composable`, or `export function/class/...` |
 | `no_test_change` | 2.5 | source file changed, no paired test file (`{Base}Tests.{ext}`, `{Base}.test.{ext}`, etc.) appears in the diff |
+| `test_lines_removed` | 3.0 | test-classified file whose diff removes more lines than it adds (immutable-test backstop, see `refs/rules.md`) |
 | `complexity_delta` | 1.5 | added control-flow tokens (`if`/`guard`/`switch`/`while`/`for`/ternary/`&&`/`\|\|`) |
 | `ui_critical` | 1.5 | path matches `*View.swift`, `*Screen.kt`, `*Configuration.swift`, etc. |
 | `loc_changed` | 1.0 | base sensitivity to total `+/-` lines |
@@ -204,6 +205,8 @@ Skills are injected into reviewer prompt context  -  the reviewer uses them as r
 **iOS/Swift  -  interaction & convention skills (conditional).** When the diff touches SwiftUI UI files (`*View.swift`, `*Screen.swift`, `*Configuration.swift`, `*+Modifiers.swift`), additionally inject the relevant `figma-common` convention skills as reference for the iOS reviewers: `figma-navigation`, `figma-overlays`, `figma-bottom-sheets` (interaction: emit-intent vs self-route/self-present; native-SwiftUI-first vs the project's `ui.*` custom system), and the enriched `figma-to-swiftui` accessibility rules (minimalism). These back the Step 1.5 iOS convention checks. Generic across SwiftUI projects  -  not tied to any one app. Omit when the diff has no SwiftUI UI changes (keeps the reviewer prompt lean).
+**Module review guides (conditional, all stacks).** A module in the repo may carry its own CLAUDE guide  -  a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths: for each changed file, walk its directory chain up to the repo root and collect `CLAUDE.md`, `*-CLAUDE.md`, and `AGENTS.md` files (root-level ones excluded  -  the host CLI already loads those). Dedupe, cap at 5 (log any dropped). Inject into every reviewer prompt with the directive: read each guide before reviewing and apply its rules/checklist to the changed files under its directory  -  a guide governs only its own subtree, and guide violations are findings triaged by severity like any other. No guides found → no-op, prompt stays lean. Same contract as `/multi-agent:review` Step 2b.
 **Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry Reviewer 1 once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
 #### Output contract  -  reviewer step
@@ -252,6 +255,8 @@ Exit 0 = valid. Exit 2 = contradiction (approved=true with blocking findings)  -
 **CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag  -  reviewers hallucinate, misread scope, or repeat each other. Run Fable triage (Opus on Copilot CLI) to evaluate merged findings against task scope.
+Opt-in empirical layer: when `prefs.global.verifyByTest.enabled` is `true`, accepted blocking findings additionally go through Step 3.7 (verify-by-test), which tries to reproduce each one with a minimal failing test before the Phase 3 rework loop fires. Full wiring: `refs/features/verify-by-test.md`.
 ##### 3.1 Short-circuit: no findings
 If merged findings `length === 0`, **skip triage**: write empty result `{"accepted": [], "deferred": [], "rejected": [], "approved": true}`, log, proceed to Phase 5.
@@ -383,13 +388,19 @@ After the triage verdict is computed, populate `triage.consensus`:
 **Surfacing (Step 4 + Phase 7):** When `verdict` is `split` or `unverified`, the disagreements are shown to the user at the Step 4 checkpoint (interactive modes) and always written to the Phase 7 report. Autopilot does not block on `unverified` (it logs `review.consensus=unverified` and proceeds), matching the maturity-check model  -  but the report records it so a human review can catch it. This is additive: omitting `consensus` is valid and means "not computed."
+##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
+A triage verdict is judgment; a failing repro test is proof. Runs only when `prefs.global.verifyByTest.enabled` is `true` AND `accepted` contains a `blocking` finding; otherwise skip silently. **Full contract (verdict table, cleanup invariant, prompts): `refs/features/verify-by-test.md`  -  read it before executing this step.**
+Compressed flow: dispatch ONE verifier agent (model `verifyByTest.model`, default `sonnet`) for up to `maxFindings` (default 3) accepted blocking findings. Per finding it writes ONE minimal repro test and runs ONLY that test (Phase 3 single-test invocation, build lock, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`). Outcomes: test FAILS as predicted -> `confirmed`, finding stays blocking and the test is KEPT in `redTests[]` as the Phase 3 rework RED test; test PASSES -> `not-reproduced` ONLY if `evidence-gate.mjs --claim test --status passed` exits 0 on the log, finding moves to `deferred[]`, test deleted; compile error / timeout / not unit-testable -> `inconclusive`, judgment verdict stands. Stamp findings with `verification` (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = {attempted, confirmed, downgraded, inconclusive, redTests[]}`, recompute `approved`, re-run `validate-triage.mjs` under the 3.2.1 gate. Whole step bounded by `stepTimeoutSec` (default 600); on breach or crash remaining findings keep judgment verdicts  -  never blocks. Telemetry per 3.4: `review.verify_by_test attempted= confirmed= downgraded= inconclusive= duration_ms=`.
 #### Step 4  -  Consensus + Action (triage-driven)
 If `triage.consensus.verdict` is `split` or `unverified`, surface `consensus.disagreements[]` to the user before acting: interactive modes show the split and ask whether to treat the unverified agreement as a pass (picker-contract: Trust / Re-review / Treat-as-blocking); autopilot logs `review.consensus=<verdict>` and proceeds on the triage verdict. Never silently average a split into a pass.
 Act **only on triage.accepted**:
-- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items)
+- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items). When Step 3.7 ran and `state.reviewIterations[-1].verifyByTest.redTests[]` is non-empty, the reflection prompt cites each red test: "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it."
 - **accepted.important** → fix and re-review
 - **accepted.suggestion** → apply if reasonable
 - **deferred** → append to Phase 7 report as "follow-up items" (do not block)

package/pipeline/commands/multi-agent/refs/rules.md CHANGED Viewed

@@ -52,6 +52,7 @@ This is the single source of truth. When a contributor or model is unsure where
 - **NEVER** commit without passing build (all gates in Phase 4 Step 1 must be green).
 - **NEVER** commit without passing review (at least one AI reviewer must return `approved: true` with no blocking findings).
 - **NEVER** skip tests. Every public method, every error path, every edge case.
+- **NEVER** delete, rename, or weaken an existing test to get a green run. Existing tests are immutable during a task. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. Deterministic backstop: the `test_lines_removed` diff-risk signal (Phase 4 Step 1.75) flags test files that shrink.
 - **Follow existing code style and conventions.** Read neighbor files before writing new ones  -  match naming, structure, import order.
 - **Use design tokens, no magic numbers.** `16` → `.Spacing.spacing16`. `#E31837` → `Color.Primary.primary`. `.font(.system(size: 14))` → `.typographyStyle(.body1)`.
 - **Design system primitives before custom views.** Before writing a new SwiftUI / Compose / React / View / Configuration triplet inside a domain or feature module, grep the shared component library (project-specific path, e.g. `Common/UIComponents/`, `core-ui/`, `packages/ui/`) for an existing primitive that solves the same problem. New domain-level wrappers, custom modals, custom buttons, or hand-rolled toasts are forbidden when the design system already has an equivalent. If the primitive exists but lacks a modifier (placeholder, size, error binding), **add the modifier to the primitive** in its `+Modifiers` extension  -  do not fork the primitive into the consumer domain. The Figma `CodeConnectSnippet` is the authoritative pointer to which primitive to use.