npm - @mmerterden/multi-agent-pipeline - Versions diffs - 10.7.4 → 10.8.0 - Mend

@mmerterden/multi-agent-pipeline 10.7.4 → 10.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/CHANGELOG.md +54 -0
package/docs/features.md +13 -1
package/package.json +1 -1
package/pipeline/commands/multi-agent/refs/features/verify-by-test.md +41 -0
package/pipeline/commands/multi-agent/refs/phases/log-format.md +10 -0
package/pipeline/commands/multi-agent/refs/phases/operations.md +15 -2
package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md +2 -0
package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md +33 -1
package/pipeline/commands/multi-agent/refs/rules.md +1 -0
package/pipeline/commands/multi-agent/resume.md +7 -4
package/pipeline/commands/multi-agent/review.md +33 -1
package/pipeline/schemas/diff-risk.schema.json +5 -4
package/pipeline/schemas/prefs.schema.json +35 -0
package/pipeline/schemas/triage-output.schema.json +35 -4
package/pipeline/scripts/README.md +2 -0
package/pipeline/scripts/diff-risk-score.mjs +11 -1
package/pipeline/scripts/fixtures/diff-risk-test-removal.diff +40 -0
package/pipeline/scripts/fixtures/install-layout.tsv +3 -3
package/pipeline/scripts/smoke-diff-risk.sh +30 -1
package/pipeline/scripts/smoke-handoff-contract.sh +92 -0
package/pipeline/scripts/smoke-verify-by-test.sh +148 -0
package/pipeline/scripts/validate-diff-risk.mjs +2 -1
package/pipeline/scripts/validate-triage.mjs +31 -2

package/CHANGELOG.md CHANGED Viewed

@@ -14,6 +14,60 @@ Internal file-layout changes that don't affect the slash-command surface are sti
 ---
+## [10.8.0] - 2026-07-02
+Three additive loop-quality features, all sourced from the 2026 agentic-loop research sweep (Anthropic long-running-agent harness guidance + adversarial-review findings): empirical verification of blocking findings, an immutable-test contract, and structured phase-boundary handoffs.
+### Added
+- **Verify-by-test triage (Phase 4 Step 3.7, opt-in via `prefs.global.verifyByTest`).**
+  Accepted blocking findings are no longer judgment-final: one verifier agent
+  (default Sonnet, capped at `maxFindings`=3) writes a minimal repro test per
+  finding and runs ONLY that test. Test fails as predicted -> finding confirmed
+  and the repro test is handed to Phase 3 as the rework RED test
+  (`state.reviewIterations[-1].verifyByTest.redTests[]`). Test passes under
+  `evidence-gate.mjs` -> finding downgraded to `deferred` (never `rejected`).
+  Compile error / timeout -> `inconclusive`, judgment verdict stands. The whole
+  step is timeout-bounded and never blocks the pipeline. Schema
+  `triage-output` v3.2.0 adds the optional per-finding `verification` block;
+  `validate-triage.mjs` validates it. New feature doc
+  `refs/features/verify-by-test.md` + `smoke-verify-by-test.sh` (20 assertions).
+- **Immutable-test rule + `test_lines_removed` diff-risk signal.** New rule in
+  `refs/rules.md` and the Phase 3 GREEN step: deleting, renaming, or weakening
+  an existing test to reach green is a violation; a test changes only when the
+  task changes the spec it encodes, named in the commit body. Deterministic
+  backstop: `diff-risk-score.mjs` v1.1.0 emits `test_lines_removed` (w=3.0)
+  when a test-classified file removes more lines than it adds; wired into the
+  Phase 4 Step 1.75 signal table, `diff-risk.schema.json` v1.1.0, and
+  `validate-diff-risk.mjs`. New fixture `diff-risk-test-removal.diff` +
+  3 new `smoke-diff-risk.sh` assertions.
+- **Structured handoff blocks (fresh-context re-entry discipline).** The
+  phase-boundary checkpoint in `refs/phases/operations.md` now appends a
+  structured `## Handoff` block (Done / Remaining / Decisions / Open findings /
+  Next) to `agent-log.md` at every phase transition - written by the
+  orchestrator from state it already holds, no LLM call, ~15 lines. The
+  post-`/compact` re-grounding and `resume.md` Step 3 read the latest handoff
+  FIRST (state wins on mismatch), falling back to per-phase findings for
+  pre-v10.8 logs. Documented in `log-format.md`; guarded by
+  `smoke-handoff-contract.sh` (13 assertions).
+- **Module review guides in `/multi-agent:review` (Step 2b).** When a changed
+  file's module carries its own convention file (`CLAUDE.md`, `*-CLAUDE.md`,
+  `AGENTS.md` below repo root), the review discovers it deterministically from
+  the diff paths (capped at 5, truncation logged), injects it into every
+  reviewer prompt, and scopes each guide to files under its own directory.
+  Works with a local checkout or via provider API in PR mode.
+### Fixed
+- **`validate-triage.mjs` reviewer enum accepts `fable` and `gpt`.** The runtime
+  validator still rejected the schema-v3.1.0 reviewer values restored in
+  v10.7.4 (`fable` is Reviewer 1 on Claude Code, `gpt` on Copilot CLI), so any
+  accepted finding attributed to them failed the 3.2.1 gate. Enum now matches
+  `triage-output.schema.json`.
+---
 ## [10.7.4] - 2026-07-02
 Deep-consistency sweep: the v10.6.0 Fable restore and the v10.7.0 adapter removal are now reflected on every surface, and the test suite is green again end to end.

package/docs/features.md CHANGED Viewed

@@ -152,6 +152,18 @@ After triage returns, output is validated by `validate-triage.mjs`:
 If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with `if`/`then` constraint in schema v3.0.0.
+### Verify-by-Test Triage (Phase 4 Step 3.7, v10.8.0, opt-in)
+A triage verdict is a judgment call; a failing repro test is proof. When `prefs.global.verifyByTest.enabled` is on, one verifier agent (default Sonnet) writes a minimal repro test per accepted blocking finding (cap: `maxFindings`=3) and runs only that test. Fails as predicted -> finding confirmed, the repro test becomes the Phase 3 rework RED test. Passes under `evidence-gate.mjs` -> finding downgraded to `deferred`. Compile error / timeout -> `inconclusive`, judgment stands. Timeout-bounded, never blocks. Full spec: `refs/features/verify-by-test.md`.
+### Immutable-Test Rule + `test_lines_removed` Signal (v10.8.0)
+Existing tests are immutable during a task: deleting, renaming, or weakening an assertion to reach green is a violation (`refs/rules.md`, Phase 3 GREEN step). A test changes only when the task changes the spec it encodes, named in the commit body. Deterministic backstop: `diff-risk-score.mjs` emits `test_lines_removed` (w=3.0) for any test-classified file whose diff removes more lines than it adds.
+### Structured Handoff Blocks (v10.8.0)
+Every phase transition appends a `## Handoff` block (Done / Remaining / Decisions / Open findings / Next) to `agent-log.md` - orchestrator-written from existing state, no LLM call. `/multi-agent:resume` and post-`/compact` re-grounding read the latest handoff first, so long runs re-enter from durable artifacts instead of conversation memory (fresh-context discipline from Anthropic's long-running-agent harness guidance).
 ### Accessibility Code Review (Phase 4 Step 1.5)
 If changes include UI files, reviewers check for:
@@ -211,7 +223,7 @@ Per-phase token budgets prevent runaway sessions. If a phase exceeds its budget,
 `pipeline/scripts/diff-risk-score.mjs` runs at Phase 4 Step 1.75 — before reviewer dispatch. Heuristic, deterministic, sub-second, no LLM. Top-N risk-ranked files inject into each reviewer's prompt as a `${PRIORITY_FILES}` block; reviewers read those files first but still review the entire diff.
-Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
+Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (v10.8.0: test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
 ### Test Gap Detection (v8.3.0)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@mmerterden/multi-agent-pipeline",
-  "version": "10.7.4",
+  "version": "10.8.0",
   "description": "8-phase AI development pipeline with full orchestration on Claude Code and Copilot CLI. Analysis, planning, TDD, CLI-aware parallel review with consensus surfacing + Fable triage, default-FAIL evidence gates, secret + intent guards, per-phase cost ledger, persistent learnings memory, wiki generation, commit automation. Token-preserving uninstall.",
   "type": "module",
   "main": "index.js",

package/pipeline/commands/multi-agent/refs/features/verify-by-test.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Feature: Verify-by-Test Triage (Phase 4 Step 3.7)
+**Pattern**: reviewer findings are hypotheses; a failing repro test is proof. Adversarial-review research (Refute-or-Promote, 2026) found that plausible-but-wrong findings survive debate rounds but die on a single empirical test. This step converts the highest-stakes verdicts (accepted blocking) from judgment into evidence before the Phase 3 rework loop fires.
+**Gated by `prefs.global.verifyByTest.enabled`** (default: `false`). When enabled, after triage 3.6 and before Step 4, IF the validated triage output contains at least one `accepted` blocking finding:
+1. Dispatch ONE verifier sub-agent for the iteration (model: `verifyByTest.model`, default `sonnet`)  -  never one dispatch per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings, the diff hunks for their files, and the Phase 1 test conventions.
+2. Per finding, the verifier writes ONE minimal repro test asserting the correct behavior the finding claims is broken, then runs ONLY that test via the Phase 3 single-test invocation (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`) under `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
+3. Stamp each processed finding with a `verification` object (triage-output schema v3.2.0) and re-run `validate-triage.mjs` on the mutated triage file under the standard 3.2.1 gate protocol.
+4. Findings beyond `maxFindings` keep their judgment-only verdict (log `verify_by_test=cap-exceeded`).
+5. The whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600); on breach or verifier crash, remaining findings keep judgment-only verdicts and the pipeline proceeds. Never blocks.
+## Verdict table
+| Repro test outcome | `verification.result` | Action |
+|---|---|---|
+| Fails as the finding predicts | `confirmed` | Finding stays accepted blocking. Repro test KEPT in the worktree, recorded in `redTests[]`. |
+| Passes (not reproducible) | `not-reproduced` | Downgrade gated by `evidence-gate.mjs --claim test --status passed` on the test log (exit 0 required). Finding moves `accepted[]` -> `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`; repro test file deleted. Evidence-gate failure -> treat as `inconclusive`. |
+| Compile error / timeout / not unit-testable | `inconclusive` | Finding stays accepted blocking (judgment stands). Partial test deleted, cause in `verification.note`. |
+Downgrades go to `deferred`, never `rejected`: triage judged the issue real, and deferred items surface in the Phase 7 report for a human eye.
+## Red-test handoff to Phase 3
+`state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`. On the Phase 3 re-entry, each `redTests[]` entry IS the RED step of the rework TDD cycle for its finding: the reflection prompt instructs "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it." The dev agent must not write a duplicate failing test for those findings.
+## Cleanup invariant
+After Step 3.7, the only uncommitted verifier artifacts are the confirmed repro tests listed in `redTests[]` (committed later with the fix) and logs under `$WORKTREE/.pipeline/` (outside Phase 6 commit scope). `not-reproduced` and `inconclusive` test files are always deleted.
+## Telemetry
+One `review.verify_by_test` metric per iteration: `attempted`, `confirmed`, `downgraded`, `inconclusive`, `duration_ms` (plus `tokens_in/out` when available), forwarded to the tracker like all Phase 4 metrics. Timeout emits `triage=verify-by-test-timeout`.
+## Off by default reason
+Adds one Sonnet call plus up to `maxFindings` single-test runs (and build-lock contention on Xcode projects) per review iteration that has accepted blockers. On clean runs it never fires, but on noisy-reviewer repos it can add minutes per iteration. Flip on for security-critical work, release branches, or repos where reviewer false-positive rate is high.
+## Reference
+Wiring: `refs/phases/phase-4-review.md` Step 3.7. Schema: `pipeline/schemas/triage-output.schema.json` v3.2.0 (`$defs.verification`). Evidence gate: `pipeline/scripts/evidence-gate.mjs`. Prefs: `prefs.global.verifyByTest` in `pipeline/schemas/prefs.schema.json`.

package/pipeline/commands/multi-agent/refs/phases/log-format.md CHANGED Viewed

@@ -72,6 +72,16 @@ Verdict: unverified (2 reviewers)
 ## Test Scenarios (Jira)
 (User perspective: Precondition -> Steps -> Expected result)
+## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
+(v10.8.0, appended chronologically by the phase-boundary checkpoint  -  one block per phase transition, the LATEST block is authoritative. Written by the orchestrator from state it already holds; no LLM call, ~15 lines max. `resume.md` Step 3 and post-`/compact` re-grounding read this block FIRST. Format spec: `refs/phases/operations.md`.)
+- Done: {up to 3 bullets of completed outcomes}
+- Remaining: {ordered list of remaining phases / sub-steps}
+- Decisions: {key decisions later phases depend on}
+- Open findings: {accepted-but-unresolved review findings, or "none"}
+- Next: Phase {N+1} {name}, subStep {token or "start"}
 ```
 ### Cost Breakdown  -  emission contract

package/pipeline/commands/multi-agent/refs/phases/operations.md CHANGED Viewed

@@ -93,8 +93,21 @@ This keeps orchestrator context lean and enables programmatic routing.
 **Proactive compaction + phase-boundary checkpoint**: the orchestrator follows ~2,500 lines of phase prose in one session; once context fills, it starts dropping steps  -  the single biggest cause of "it got stuck / skipped a step." Two defenses, both required on full-pipeline runs:
-- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a one-line progress summary to `agent-log.md`. The next phase reads state + log, not the back-conversation  -  so a transition is a clean re-entry point even if context is later compacted.
-- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit  -  it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` to re-ground.
+- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a structured handoff block to `agent-log.md` (format below). The next phase reads state + log, not the back-conversation  -  so a transition is a clean re-entry point even if context is later compacted.
+- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit  -  it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` AND the latest `## Handoff` block in `agent-log.md` to re-ground.
+**Handoff block (v10.8.0)**: the structured artifact the phase-boundary checkpoint appends to `agent-log.md`. Written by the orchestrator from state it already holds  -  no agent dispatch, no extra LLM call. Cap at ~15 lines; the latest block is authoritative (earlier ones are history). This is the fresh-context re-entry contract: a resume or post-compaction session rebuilds working context from the latest handoff + `agent-state.json` + git log, never from conversation memory.
+```markdown
+## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
+- Done: {up to 3 bullets of completed outcomes, e.g. "plan approved, 4 tasks", "build green, 12 tests added"}
+- Remaining: {ordered list of remaining phases / sub-steps}
+- Decisions: {key decisions later phases depend on, e.g. "used existing KeychainStore, no new wrapper"}
+- Open findings: {accepted-but-unresolved review findings, or "none"}
+- Next: Phase {N+1} {name}, subStep {token or "start"}
+```
+Full `agent-log.md` shape: `refs/phases/log-format.md`. Resume-side consumption: `resume.md` Step 3 reads the latest handoff FIRST, then falls back to per-phase findings for logs written before v10.8.
 **Sub-step checkpoints (long phases)**: Phase 3 (dev/TDD cycles) and Phase 7 (report/channels) can run many minutes; a crash mid-phase loses everything since the last phase boundary and forces a full phase re-run on resume. For these phases, also record `state.phases[<n>].subStep` (a short token: `red`, `green`, `build`, `pr-opened`, `confluence-synced`, ...) and the `files[]` written so far after each meaningful unit of work. On resume, re-enter the phase but skip units whose `subStep` is already recorded and whose `files[]` exist in the worktree  -  re-do only the unfinished tail, never the whole phase.

package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md CHANGED Viewed

@@ -94,6 +94,7 @@ For each task (respecting dependency order):
 3. **TDD cycle** (Launch Agent with `model: "sonnet"`):
    **RED  -  Write ONE failing test first:**
+   - **Rework re-entry**: if `state.reviewIterations[-1].verifyByTest.redTests[]` exists (Phase 4 Step 3.7 ran), those failing repro tests ARE the RED step for their findings  -  make them green, do not write a duplicate failing test and do not delete or weaken them. See `refs/features/verify-by-test.md`.
    - Test framework: use whatever the project already uses. Detect from existing test files:
      - `@Test` / `#expect` → Swift Testing
      - `XCTestCase` / `XCTAssert` → XCTest
@@ -115,6 +116,7 @@ For each task (respecting dependency order):
    **GREEN  -  Minimal code to pass:**
    - Smallest change that makes the test green  -  no extras
+   - **Existing tests are immutable**: deleting, renaming, skipping, or weakening an existing assertion to reach green is a violation. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. (Deterministic backstop: the `test_lines_removed` signal in Phase 4 Step 1.75 flags test files that shrink.)
    - Run the same test again → must PASS
    - Run full test suite → no regressions:
      ```bash

package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md CHANGED Viewed

@@ -127,6 +127,7 @@ echo "$RISK_JSON" | node pipeline/scripts/validate-diff-risk.mjs - >/dev/null 2>
 | `migration` | 4.0 | path matches DB schema / migration glob (.sql, /Migrations/, alembic/, prisma/migrations) |
 | `public_api` | 2.0 | added line declares `public func/class/struct/enum`, `@objc`, `open fun`, `@Composable`, or `export function/class/...` |
 | `no_test_change` | 2.5 | source file changed, no paired test file (`{Base}Tests.{ext}`, `{Base}.test.{ext}`, etc.) appears in the diff |
+| `test_lines_removed` | 3.0 | test-classified file whose diff removes more lines than it adds (immutable-test backstop, see `refs/rules.md`) |
 | `complexity_delta` | 1.5 | added control-flow tokens (`if`/`guard`/`switch`/`while`/`for`/ternary/`&&`/`\|\|`) |
 | `ui_critical` | 1.5 | path matches `*View.swift`, `*Screen.kt`, `*Configuration.swift`, etc. |
 | `loc_changed` | 1.0 | base sensitivity to total `+/-` lines |
@@ -204,6 +205,8 @@ Skills are injected into reviewer prompt context  -  the reviewer uses them as r
 **iOS/Swift  -  interaction & convention skills (conditional).** When the diff touches SwiftUI UI files (`*View.swift`, `*Screen.swift`, `*Configuration.swift`, `*+Modifiers.swift`), additionally inject the relevant `figma-common` convention skills as reference for the iOS reviewers: `figma-navigation`, `figma-overlays`, `figma-bottom-sheets` (interaction: emit-intent vs self-route/self-present; native-SwiftUI-first vs the project's `ui.*` custom system), and the enriched `figma-to-swiftui` accessibility rules (minimalism). These back the Step 1.5 iOS convention checks. Generic across SwiftUI projects  -  not tied to any one app. Omit when the diff has no SwiftUI UI changes (keeps the reviewer prompt lean).
+**Module review guides (conditional, all stacks).** A module in the repo may carry its own CLAUDE guide  -  a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths: for each changed file, walk its directory chain up to the repo root and collect `CLAUDE.md`, `*-CLAUDE.md`, and `AGENTS.md` files (root-level ones excluded  -  the host CLI already loads those). Dedupe, cap at 5 (log any dropped). Inject into every reviewer prompt with the directive: read each guide before reviewing and apply its rules/checklist to the changed files under its directory  -  a guide governs only its own subtree, and guide violations are findings triaged by severity like any other. No guides found → no-op, prompt stays lean. Same contract as `/multi-agent:review` Step 2b.
 **Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry Reviewer 1 once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
 #### Output contract  -  reviewer step
@@ -252,6 +255,8 @@ Exit 0 = valid. Exit 2 = contradiction (approved=true with blocking findings)  -
 **CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag  -  reviewers hallucinate, misread scope, or repeat each other. Run Fable triage (Opus on Copilot CLI) to evaluate merged findings against task scope.
+Opt-in empirical layer: when `prefs.global.verifyByTest.enabled` is `true`, accepted blocking findings additionally go through Step 3.7 (verify-by-test), which tries to reproduce each one with a minimal failing test before the Phase 3 rework loop fires. Full wiring: `refs/features/verify-by-test.md`.
 ##### 3.1 Short-circuit: no findings
 If merged findings `length === 0`, **skip triage**: write empty result `{"accepted": [], "deferred": [], "rejected": [], "approved": true}`, log, proceed to Phase 5.
@@ -383,13 +388,40 @@ After the triage verdict is computed, populate `triage.consensus`:
 **Surfacing (Step 4 + Phase 7):** When `verdict` is `split` or `unverified`, the disagreements are shown to the user at the Step 4 checkpoint (interactive modes) and always written to the Phase 7 report. Autopilot does not block on `unverified` (it logs `review.consensus=unverified` and proceeds), matching the maturity-check model  -  but the report records it so a human review can catch it. This is additive: omitting `consensus` is valid and means "not computed."
+##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
+**Rationale:** a triage verdict is still a judgment call; a failing repro test is proof. Debating a finding costs tokens, running it costs one test invocation and kills false positives that survive adversarial framing.
+**Gate:** runs only when `prefs.global.verifyByTest.enabled` is `true` AND the validated triage output contains at least one `accepted` finding with `severity: "blocking"`. Otherwise skip silently (no log noise). Full behavior spec: `refs/features/verify-by-test.md`.
+1. **Dispatch ONE verifier agent** (subagent_type: `general-purpose`, model: `prefs.global.verifyByTest.model`, default `sonnet`) for the iteration  -  not one per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings (ordered as triage returned them), the diff hunks for their files, and the project's test conventions from Phase 1. Findings beyond the cap keep their judgment-only verdict; log `verify_by_test=cap-exceeded count=<n>`.
+2. **Per finding, the verifier writes ONE minimal repro test** asserting the CORRECT behavior the finding claims is broken, following the platform test conventions (framework, naming, location per phase-3-dev.md), then runs ONLY that test using the platform single-test invocation from Phase 3 (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`), wrapped in `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
+3. **Verdict mapping (fails toward keeping blockers):**
+| Repro test outcome | `verification.result` | Action on finding |
+| --- | --- | --- |
+| Test FAILS as the finding predicts | `confirmed` | Stays `accepted` blocking. Repro test is KEPT in the worktree and recorded in `redTests[]` as the RED test for the Phase 3 rework loop. |
+| Test PASSES (defect not reproducible) | `not-reproduced` | Downgrade is evidence-gated: `node pipeline/scripts/evidence-gate.mjs --claim test --status passed --evidence "$WORKTREE/.pipeline/verify-<i>.test.log"` must exit 0. On exit 0: move finding from `accepted[]` to `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`, delete the repro test file. On exit non-zero: treat as `inconclusive`. |
+| Compile error, timeout, or defect not expressible as a unit test | `inconclusive` | Stays `accepted` blocking (judgment-only verdict stands). Partial test file deleted, cause noted in `verification.note`. |
+4. **Cleanup invariant:** after Step 3.7 the only verifier artifacts left in the worktree are the confirmed repro tests listed in `redTests[]` (they get committed with the fix, satisfying the TDD contract) and logs under `$WORKTREE/.pipeline/` (already outside Phase 6 commit scope).
+5. **Persist + re-validate:** stamp each verified finding with a `verification` object (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`, recompute `approved`, then re-run `validate-triage.mjs` on the mutated `$TRIAGE_FILE` under the same 3.2.1 gate protocol.
+6. **Timeout/fallback (mirrors 3.3):** the whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600). On verifier crash or budget breach: no retry; remaining findings keep judgment-only verdicts, log `triage=verify-by-test-timeout`, proceed to Step 4. Never blocks the pipeline.
+Telemetry (per 3.4 conventions, best-effort):
+```bash
+LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.verify_by_test \
+  attempted=$A confirmed=$C downgraded=$D inconclusive=$I duration_ms=$MS
+```
 #### Step 4  -  Consensus + Action (triage-driven)
 If `triage.consensus.verdict` is `split` or `unverified`, surface `consensus.disagreements[]` to the user before acting: interactive modes show the split and ask whether to treat the unverified agreement as a pass (picker-contract: Trust / Re-review / Treat-as-blocking); autopilot logs `review.consensus=<verdict>` and proceeds on the triage verdict. Never silently average a split into a pass.
 Act **only on triage.accepted**:
-- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items)
+- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items). When Step 3.7 ran and `state.reviewIterations[-1].verifyByTest.redTests[]` is non-empty, the reflection prompt cites each red test: "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it."
 - **accepted.important** → fix and re-review
 - **accepted.suggestion** → apply if reasonable
 - **deferred** → append to Phase 7 report as "follow-up items" (do not block)

package/pipeline/commands/multi-agent/refs/rules.md CHANGED Viewed

@@ -52,6 +52,7 @@ This is the single source of truth. When a contributor or model is unsure where
 - **NEVER** commit without passing build (all gates in Phase 4 Step 1 must be green).
 - **NEVER** commit without passing review (at least one AI reviewer must return `approved: true` with no blocking findings).
 - **NEVER** skip tests. Every public method, every error path, every edge case.
+- **NEVER** delete, rename, or weaken an existing test to get a green run. Existing tests are immutable during a task. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. Deterministic backstop: the `test_lines_removed` diff-risk signal (Phase 4 Step 1.75) flags test files that shrink.
 - **Follow existing code style and conventions.** Read neighbor files before writing new ones  -  match naming, structure, import order.
 - **Use design tokens, no magic numbers.** `16` → `.Spacing.spacing16`. `#E31837` → `Color.Primary.primary`. `.font(.system(size: 14))` → `.typographyStyle(.body1)`.
 - **Design system primitives before custom views.** Before writing a new SwiftUI / Compose / React / View / Configuration triplet inside a domain or feature module, grep the shared component library (project-specific path, e.g. `Common/UIComponents/`, `core-ui/`, `packages/ui/`) for an existing primitive that solves the same problem. New domain-level wrappers, custom modals, custom buttons, or hand-rolled toasts are forbidden when the design system already has an equivalent. If the primitive exists but lacks a modifier (placeholder, size, error binding), **add the modifier to the primitive** in its `+Modifiers` extension  -  do not fork the primitive into the consumer domain. The Figma `CodeConnectSnippet` is the authoritative pointer to which primitive to use.

package/pipeline/commands/multi-agent/resume.md CHANGED Viewed

@@ -23,10 +23,13 @@ Resume a paused or failed task from the last successful phase.
    - `haltReason`  -  if set, show it so the user knows why the run stopped; clear it on successful re-entry
    - `autopilot`  -  preserve the mode
-3. **Load context**  -  read prior-phase findings from `agent-log.md`:
-   - Phase 1 analysis → use it from Phase 2+
-   - Phase 2 plan → use it from Phase 3+
-   - Phase 3 code → already in the worktree
+3. **Load context**  -  rebuild working context from durable artifacts, never from conversation memory:
+   - **Handoff first (v10.8.0)**: read the LATEST `## Handoff` block in `agent-log.md`  -  it carries done/remaining/decisions/open-findings and the exact re-entry point (phase + subStep). When present, it is the primary context source; cross-check its `Next:` line against `state.currentPhase` and trust state on mismatch (state is the machine truth, handoff is the narrative).
+   - Fall back to per-phase findings for logs written before v10.8 (no handoff blocks):
+     - Phase 1 analysis → use it from Phase 2+
+     - Phase 2 plan → use it from Phase 3+
+     - Phase 3 code → already in the worktree
+   - Recent `git log --oneline -10` in the worktree grounds what was actually committed vs. claimed.
 4. **Continue the pipeline**  -  start from the next phase (same pipeline as the main multi-agent command).

package/pipeline/commands/multi-agent/review.md CHANGED Viewed

@@ -109,6 +109,38 @@ The `credential-store.sh` wrapper handles macOS Keychain (`security`), Linux lib
 Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers can re-read it.
+### 2b. Module review guides  -  path-scoped convention files
+A module in the repo may carry its own CLAUDE guide  -  a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths:
+```bash
+# Changed file paths from the patch:
+grep -E '^\+\+\+ b/' "$DIFF_FILE" | sed 's|^+++ b/||' | sort -u > /tmp/multi-agent-review-${TASK_ID}-paths.txt
+# For each changed path, walk its directory chain up to the repo root and
+# collect guide files matching: CLAUDE.md, *-CLAUDE.md, AGENTS.md.
+# Root-level CLAUDE.md/AGENTS.md are excluded  -  the host CLI already loads those.
+guides=()
+while IFS= read -r p; do
+  d=$(dirname "$p")
+  while [ "$d" != "." ] && [ "$d" != "/" ]; do
+    for g in "$d"/CLAUDE.md "$d"/*-CLAUDE.md "$d"/AGENTS.md; do
+      [ -e "$g" ] && guides+=("$g")
+    done
+    d=$(dirname "$d")
+  done
+done < /tmp/multi-agent-review-${TASK_ID}-paths.txt
+# dedupe, cap at 5 (log any dropped so truncation is never silent)
+```
+Existence checks are resolved against the local checkout when the cwd is the target repo. In PR mode without a local checkout, probe the candidate paths via the provider API instead (`gh api /repos/{o}/{r}/contents/{path}?ref={headSha}` / Bitbucket `GET /projects/{KEY}/repos/{slug}/browse/{path}?at={headSha}`) and fetch the matching files' raw content the same way. No hit → step is a silent no-op.
+Persist `agent-state.review.moduleGuides = [<repo-relative paths>]` and inject into every reviewer prompt (Step 3):
+> MODULE REVIEW GUIDES: before reviewing, read each of these guide files. Apply a guide's rules/checklist to every changed file under its directory. Guide violations are findings like any other  -  triage them by severity.
+Scope note: a guide governs only files under its own directory  -  a guide found under one module must not be applied to a sibling module's changes in the same PR.
 ### 3. Launch parallel reviewers  -  host-CLI dependent
 **Claude Code (2 in parallel):**
@@ -120,7 +152,7 @@ Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers ca
 - Agent 2: `gpt-5.4` → edge cases, alternate perspective
 - Agent 3: `claude-sonnet-4-6` → general quality
-Each reviewer receives the diff plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
+Each reviewer receives the diff, the module review guides from Step 2b (when any were found), plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
 ### 4. Store-compliance cross-reference

package/pipeline/schemas/diff-risk.schema.json CHANGED Viewed

@@ -1,16 +1,16 @@
 {
   "$schema": "https://json-schema.org/draft/2020-12/schema",
   "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/diff-risk.schema.json",
-  "version": "1.0.0",
+  "version": "1.1.0",
   "title": "Multi-Agent Pipeline  -  Phase 4 diff risk score",
-  "description": "Output contract for diff-risk-score.mjs. Heuristic, deterministic, no LLM. Produced before Phase 4 Step 2 to give reviewer prompts a priority ordering  -  never used as a gate.",
+  "description": "Output contract for diff-risk-score.mjs. Heuristic, deterministic, no LLM. Produced before Phase 4 Step 2 to give reviewer prompts a priority ordering  -  never used as a gate. v1.1.0 adds the test_lines_removed signal (immutable-test backstop: a test file whose diff removes more lines than it adds).",
   "type": "object",
   "additionalProperties": false,
   "required": ["schemaVersion", "task", "totals", "files"],
   "properties": {
     "schemaVersion": {
       "type": "string",
-      "const": "1.0.0"
+      "const": "1.1.0"
     },
     "task": {
       "type": "object",
@@ -63,7 +63,8 @@
                     "no_test_change",
                     "complexity_delta",
                     "ui_critical",
-                    "migration"
+                    "migration",
+                    "test_lines_removed"
                   ]
                 },
                 "weight": { "type": "number" },

package/pipeline/schemas/prefs.schema.json CHANGED Viewed

@@ -701,6 +701,41 @@
           "default": false,
           "description": "v6.1.0+ \u2014 Phase 4 Step 2.5 rebuttal round. When reviewers disagree (mixed blocker/approved verdict), each reviewer is re-prompted with the others' opposing arguments for one additional round before triage. Lifts signal quality on ambiguous findings at ~1\u00d7 Step 2 token cost. Off by default \u2014 flip for security-critical or release-branch reviews."
         },
+        "verifyByTest": {
+          "type": "object",
+          "additionalProperties": false,
+          "description": "v10.8+ - Phase 4 Step 3.7 verify-by-test. When enabled, accepted BLOCKING findings are empirically validated before the Phase 3 rework loop: one verifier agent writes a minimal repro test per finding and runs only that test. Confirmed findings hand their failing test to Phase 3 as the RED step; non-reproducible findings are downgraded to deferred under evidence-gate. Only blocking findings are ever verified (fixed behavior, not a knob). Adds one model call plus up to maxFindings single-test runs per iteration with accepted blockers; default off. Flip on for security-critical work, release branches, or repos with noisy reviewers. Full spec: refs/features/verify-by-test.md.",
+          "properties": {
+            "enabled": {
+              "type": "boolean",
+              "default": false,
+              "description": "Master switch."
+            },
+            "maxFindings": {
+              "type": "integer",
+              "minimum": 1,
+              "maximum": 10,
+              "default": 3,
+              "description": "Max accepted blocking findings verified per review iteration. Findings beyond the cap keep their judgment-only verdict."
+            },
+            "model": {
+              "type": "string",
+              "enum": [
+                "sonnet",
+                "opus"
+              ],
+              "default": "sonnet",
+              "description": "Verifier agent model. Writing a minimal repro test is mechanical work; Sonnet is the cost-sane default."
+            },
+            "stepTimeoutSec": {
+              "type": "integer",
+              "minimum": 60,
+              "maximum": 1800,
+              "default": 600,
+              "description": "Wall-clock budget for the whole Step 3.7 pass. On breach, remaining findings keep judgment-only verdicts and the pipeline proceeds (never blocks)."
+            }
+          }
+        },
         "review": {
           "type": "object",
           "additionalProperties": false,

package/pipeline/schemas/triage-output.schema.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
   "$schema": "https://json-schema.org/draft/2020-12/schema",
   "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/triage-output.schema.json",
-  "version": "3.1.0",
+  "version": "3.2.0",
   "title": "Multi-Agent Pipeline  -  Phase 4 triage output",
-  "description": "Contract for the Opus triage agent's JSON output in Phase 4 Step 3. Triage consumes merged reviewer findings and splits them into accepted/deferred/rejected. Only `accepted` blocking/important items trigger Phase 3 rework. v3.1.0 adds the optional `consensus` block so triage can surface reviewer-agreement risk (false consensus among same-base-model reviewers) instead of silently merging.",
+  "description": "Contract for the Opus triage agent's JSON output in Phase 4 Step 3. Triage consumes merged reviewer findings and splits them into accepted/deferred/rejected. Only `accepted` blocking/important items trigger Phase 3 rework. v3.1.0 adds the optional `consensus` block so triage can surface reviewer-agreement risk (false consensus among same-base-model reviewers) instead of silently merging. v3.2.0 adds the optional per-finding `verification` block written by Phase 4 Step 3.7 (verify-by-test): the empirical repro-test outcome for accepted blocking findings.",
   "type": "object",
   "additionalProperties": false,
   "required": ["accepted", "deferred", "rejected", "approved"],
@@ -114,6 +114,35 @@
         }
       }
     },
+    "verification": {
+      "type": "object",
+      "additionalProperties": false,
+      "description": "v3.2.0 verify-by-test outcome (Phase 4 Step 3.7, opt-in via prefs.global.verifyByTest). confirmed = repro test failed as the finding predicts (finding stands, test kept as the Phase 3 RED test); not-reproduced = repro test passed under evidence-gate (finding downgraded to deferred); inconclusive = compile error / timeout / not unit-testable (judgment verdict stands).",
+      "required": ["result"],
+      "properties": {
+        "result": {
+          "type": "string",
+          "enum": ["confirmed", "not-reproduced", "inconclusive"]
+        },
+        "testRef": {
+          "type": "string",
+          "minLength": 1,
+          "description": "Single-test reference, e.g. 'AuthTests/LoginTests/testExpiredTokenRejected' or 'tests/test_auth.py::test_expired_token'."
+        },
+        "evidencePath": {
+          "type": "string",
+          "minLength": 1,
+          "description": "Path to the test-run log verified by evidence-gate.mjs, e.g. '.pipeline/verify-1.test.log'."
+        },
+        "note": { "type": "string" }
+      },
+      "if": {
+        "properties": { "result": { "enum": ["confirmed", "not-reproduced"] } }
+      },
+      "then": {
+        "required": ["result", "testRef", "evidencePath"]
+      }
+    },
     "rawFinding": {
       "type": "object",
       "additionalProperties": false,
@@ -124,7 +153,8 @@
         "line": { "type": "integer", "minimum": 0 },
         "issue": { "type": "string", "minLength": 4 },
         "fix": { "type": "string" },
-        "reviewer": { "$ref": "#/$defs/reviewer" }
+        "reviewer": { "$ref": "#/$defs/reviewer" },
+        "verification": { "$ref": "#/$defs/verification" }
       }
     },
     "acceptedFinding": {
@@ -144,7 +174,8 @@
               "type": "string",
               "minLength": 4,
               "description": "Concrete change the dev agent must make. Required for accepted items so Phase 3 re-entry has actionable direction."
-            }
+            },
+            "verification": { "$ref": "#/$defs/verification" }
           }
         }
       ]

package/pipeline/scripts/README.md CHANGED Viewed

@@ -22,6 +22,8 @@ Validate contracts. Each emits `══ <name> smoke: N passed, M failed ══`
 - `smoke-phase-6-multi.sh`  -  Phase 6 multi-repo commit/PR cross-linking
 - `smoke-phase-banner.sh` + `smoke-phase-tracker.sh`  -  Phase UI output contracts
 - `smoke-phase4-triage.sh`  -  Phase 4 reviewer → triage flow
+- `smoke-verify-by-test.sh`  -  Phase 4 Step 3.7 verify-by-test contract (v10.8.0)
+- `smoke-handoff-contract.sh`  -  phase-boundary structured handoff + handoff-first resume (v10.8.0)
 ### Schema + state
 - `smoke-schema-validation.sh`  -  all JSON schemas validate

package/pipeline/scripts/diff-risk-score.mjs CHANGED Viewed

@@ -15,6 +15,7 @@
  *   complexity_delta  -  added if/guard/case/switch/while count     w=1.5
  *   ui_critical       -  *View.swift / *Screen.kt / Configuration   w=1.5
  *   migration         -  DB schema / migration path                 w=4.0
+ *   test_lines_removed -  test file shrinks (removed > added)       w=3.0
  *
  * Inputs:
  *   --base <ref>     Base ref. Default: origin/main, fallback: main
@@ -275,6 +276,15 @@ function buildRow(stat, addedLines, allChangedPaths) {
     }
   }
+  // Test-lines-removed: a test-classified file whose diff removes more lines
+  // than it adds. Shrinking tests is the classic get-to-green shortcut the
+  // immutable-test rule forbids (refs/rules.md); surface it to reviewers.
+  if (isTestPath(path) && stat.removed > stat.added) {
+    const w = 3.0;
+    signals.push({ name: "test_lines_removed", weight: w, value: stat.removed - stat.added });
+    score += 12 * w;
+  }
   return {
     path,
     score: Math.round(score * 100) / 100,
@@ -306,7 +316,7 @@ function main() {
   };
   const out = {
-    schemaVersion: "1.0.0",
+    schemaVersion: "1.1.0",
     task: {
       id: TASK_ID,
       base: BASE || "(diff-file)",

package/pipeline/scripts/fixtures/diff-risk-test-removal.diff ADDED Viewed

@@ -0,0 +1,40 @@
+diff --git a/MyAppTests/LoginViewModelTests.swift b/MyAppTests/LoginViewModelTests.swift
+index 1111111..2222222 100644
+--- a/MyAppTests/LoginViewModelTests.swift
++++ b/MyAppTests/LoginViewModelTests.swift
+@@ -10,30 +10,20 @@ final class LoginViewModelTests: XCTestCase {
+     func testLoginWithValidCredentials_Succeeds() {
+         let sut = LoginViewModel(service: MockAuthService())
++        sut.retryPolicy = .none
+         sut.login(email: "user@example.com", password: "correct")
++        XCTAssertTrue(sut.isAuthenticated)
+     }
+-
+-    func testLoginWithInvalidEmail_ShowsError() {
+-        let sut = LoginViewModel(service: MockAuthService())
+-        sut.login(email: "not-an-email", password: "irrelevant")
+-        XCTAssertEqual(sut.errorMessage, "Invalid email")
+-    }
+-
+-    func testLoginWithExpiredToken_Rejects() {
+-        let sut = LoginViewModel(service: MockAuthService(tokenState: .expired))
+-        sut.login(email: "user@example.com", password: "correct")
+-        XCTAssertFalse(sut.isAuthenticated)
+-    }
+-
+-    func testLogout_ClearsSession() {
+-        let sut = LoginViewModel(service: MockAuthService())
+-        sut.logout()
+-        XCTAssertNil(sut.session)
+-    }
+ }
+diff --git a/MyApp/Sources/Auth/LoginViewModel.swift b/MyApp/Sources/Auth/LoginViewModel.swift
+index 3333333..4444444 100644
+--- a/MyApp/Sources/Auth/LoginViewModel.swift
++++ b/MyApp/Sources/Auth/LoginViewModel.swift
+@@ -20,6 +20,8 @@ final class LoginViewModel {
+     func login(email: String, password: String) {
++        guard email.contains("@") else { return }
++        service.authenticate(email: email, password: password)
+     }
+ }

package/pipeline/scripts/fixtures/install-layout.tsv CHANGED Viewed

@@ -1,16 +1,16 @@
 .claude/CLAUDE.md	1
 .claude/agents	8
-.claude/commands	88
+.claude/commands	89
 .claude/lib	23
 .claude/multi-agent-preferences.json	1
 .claude/rules	12
 .claude/schemas	23
-.claude/scripts	167
+.claude/scripts	169
 .claude/settings.json	1
 .claude/skills	560
 .copilot/agents	8
 .copilot/copilot-instructions.md	1
 .copilot/lib	23
 .copilot/schemas	23
-.copilot/scripts	167
+.copilot/scripts	169
 .copilot/skills	596

package/pipeline/scripts/smoke-diff-risk.sh CHANGED Viewed

@@ -12,6 +12,7 @@
 #   8. phase-4-review.md ref doc declares Step 1.75 + diff-risk-score.mjs
 #   9. code-reviewer.md agent template carries the priority-files placeholder
 #   10. prefs.schema.json exposes diffRisk advisory toggle
+#   11. test-removal fixture fires the test_lines_removed signal (v1.1.0)
 #
 # Exit 0 = all pass, 1 = any failure.
@@ -26,6 +27,7 @@ REVIEWER="$ROOT/pipeline/agents/code-reviewer.md"
 PREFS="$ROOT/pipeline/schemas/prefs.schema.json"
 FIX_IOS="$ROOT/pipeline/scripts/fixtures/diff-risk-ios.diff"
 FIX_AND="$ROOT/pipeline/scripts/fixtures/diff-risk-android.diff"
+FIX_TESTRM="$ROOT/pipeline/scripts/fixtures/diff-risk-test-removal.diff"
 pass=0
 fail=0
@@ -38,10 +40,11 @@ printf '→ smoke-diff-risk (v8.3.0): pre-review risk scoring contract\n'
 [ -f "$SCHEMA" ]   || { record_fail "schema missing: $SCHEMA"; exit 1; }
 [ -f "$FIX_IOS" ]  || { record_fail "fixture missing: $FIX_IOS"; exit 1; }
 [ -f "$FIX_AND" ]  || { record_fail "fixture missing: $FIX_AND"; exit 1; }
+[ -f "$FIX_TESTRM" ] || { record_fail "fixture missing: $FIX_TESTRM"; exit 1; }
 # --- 1: iOS fixture produces JSON ---
 out_ios=$(node "$SCORE" --diff "$FIX_IOS" 2>/dev/null)
-if jq -e '.schemaVersion == "1.0.0"' <<< "$out_ios" >/dev/null 2>&1; then
+if jq -e '.schemaVersion == "1.1.0"' <<< "$out_ios" >/dev/null 2>&1; then
   record_pass "iOS fixture renders schema-versioned JSON"
 else
   record_fail "iOS fixture JSON malformed or missing schemaVersion"
@@ -150,6 +153,32 @@ else
   record_fail "prefs.schema.json missing global.diffRiskAdvisory"
 fi
+# --- 11: test_lines_removed signal fires on the test-removal fixture ---
+out_testrm=$(node "$SCORE" --diff "$FIX_TESTRM" 2>/dev/null)
+sig_value=$(jq -r '.files[] | select(.path == "MyAppTests/LoginViewModelTests.swift")
+                   | .signals[] | select(.name == "test_lines_removed") | .value' <<< "$out_testrm")
+if [ "$sig_value" = "16" ]; then
+  record_pass "test_lines_removed fires with value=16 (18 removed - 2 added)"
+else
+  record_fail "test_lines_removed should fire with value=16, got: ${sig_value:-missing}"
+fi
+sig_on_source=$(jq -r '[.files[] | select(.path == "MyApp/Sources/Auth/LoginViewModel.swift")
+                        | .signals[] | select(.name == "test_lines_removed")] | length' <<< "$out_testrm")
+if [ "$sig_on_source" = "0" ]; then
+  record_pass "test_lines_removed does not fire on source files"
+else
+  record_fail "test_lines_removed must only fire on test-classified paths"
+fi
+set +e
+echo "$out_testrm" | node "$VALIDATE" - >/dev/null 2>&1
+rc_testrm=$?
+set -e
+if [ "$rc_testrm" -eq 0 ]; then
+  record_pass "validator accepts output carrying test_lines_removed"
+else
+  record_fail "validator rejected test_lines_removed output (rc=$rc_testrm)"
+fi
 # --- Summary ---
 total=$((pass + fail))
 printf '\n→ smoke-diff-risk: %d/%d passed\n' "$pass" "$total"

package/pipeline/scripts/smoke-handoff-contract.sh ADDED Viewed

@@ -0,0 +1,92 @@
+#!/usr/bin/env bash
+# smoke-handoff-contract.sh
+#
+# Verifies the v10.8.0 structured-handoff contract (fresh-context re-entry):
+#   1. operations.md documents the Handoff block with all 5 required lines
+#   2. operations.md compaction trigger re-reads state AND the latest handoff
+#   3. log-format.md documents the Handoff section in the canonical log shape
+#   4. resume.md Step 3 reads the latest handoff FIRST with pre-v10.8 fallback
+#
+# Exit 0 = all pass, 1 = any failure.
+set -euo pipefail
+ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+OPS="$ROOT/pipeline/commands/multi-agent/refs/phases/operations.md"
+LOGFMT="$ROOT/pipeline/commands/multi-agent/refs/phases/log-format.md"
+RESUME="$ROOT/pipeline/commands/multi-agent/resume.md"
+pass=0
+fail=0
+failures=()
+record_pass() { pass=$((pass + 1)); printf '  \033[0;32mPASS\033[0m %s\n' "$1"; }
+record_fail() { fail=$((fail + 1)); failures+=("$1"); printf '  \033[0;31mFAIL\033[0m %s\n' "$1"; }
+printf '→ smoke-handoff-contract: structured handoff (fresh-context re-entry)\n'
+# 1. operations.md documents the Handoff block with the 5 required lines
+if [ ! -f "$OPS" ]; then
+  record_fail "operations.md missing"
+else
+  if grep -qF "Handoff block (v10.8.0)" "$OPS"; then
+    record_pass "operations.md documents the Handoff block"
+  else
+    record_fail "operations.md missing 'Handoff block (v10.8.0)' spec"
+  fi
+  for line in "- Done:" "- Remaining:" "- Decisions:" "- Open findings:" "- Next:"; do
+    if grep -qF -- "$line" "$OPS"; then
+      record_pass "operations.md handoff spec has '$line'"
+    else
+      record_fail "operations.md handoff spec missing '$line'"
+    fi
+  done
+  if grep -qF "no agent dispatch, no extra LLM call" "$OPS"; then
+    record_pass "operations.md states handoff is orchestrator-written (no LLM call)"
+  else
+    record_fail "operations.md must state the handoff costs no LLM call"
+  fi
+fi
+# 2. Compaction trigger re-reads state AND latest handoff
+if grep -qE 'agent-state\.json.*AND the latest.*Handoff' "$OPS"; then
+  record_pass "compaction trigger re-reads state + latest handoff"
+else
+  record_fail "operations.md compaction trigger must re-read agent-state.json AND the latest Handoff block"
+fi
+# 3. log-format.md documents the Handoff section
+if grep -qF "## Handoff - end of Phase" "$LOGFMT"; then
+  record_pass "log-format.md documents the Handoff section"
+else
+  record_fail "log-format.md missing the Handoff section"
+fi
+if grep -qF "LATEST block is authoritative" "$LOGFMT"; then
+  record_pass "log-format.md states latest-block-wins semantics"
+else
+  record_fail "log-format.md must state the latest handoff block is authoritative"
+fi
+# 4. resume.md reads handoff first, with fallback for older logs
+if grep -qE 'LATEST .?## Handoff.? block' "$RESUME"; then
+  record_pass "resume.md Step 3 reads the latest Handoff block first"
+else
+  record_fail "resume.md Step 3 must read the latest Handoff block first"
+fi
+if grep -qiF "fall back to per-phase findings" "$RESUME"; then
+  record_pass "resume.md keeps the pre-v10.8 per-phase fallback"
+else
+  record_fail "resume.md must keep the pre-v10.8 per-phase findings fallback"
+fi
+if grep -qF "trust state on mismatch" "$RESUME"; then
+  record_pass "resume.md defines state-wins conflict rule"
+else
+  record_fail "resume.md must define the handoff-vs-state conflict rule (state wins)"
+fi
+printf '\n══ handoff-contract smoke: %d passed, %d failed ══\n' "$pass" "$fail"
+if [ "$fail" -gt 0 ]; then
+  printf '\nFailures:\n'
+  for msg in "${failures[@]}"; do printf '  - %s\n' "$msg"; done
+  exit 1
+fi
+exit 0

package/pipeline/scripts/smoke-verify-by-test.sh ADDED Viewed

@@ -0,0 +1,148 @@
+#!/usr/bin/env bash
+# smoke-verify-by-test.sh
+#
+# Verifies the Phase 4 Step 3.7 verify-by-test contract:
+#   1. phase-4-review.md documents Step 3.7 with evidence-gate invocation + feature-doc pointer
+#   2. refs/features/verify-by-test.md exists and covers the verdict table + red-test handoff
+#   3. prefs.schema.json exposes global.verifyByTest.{enabled,maxFindings,model,stepTimeoutSec}
+#   4. verifyByTest.enabled defaults to false (opt-in, no surprise cost)
+#   5. triage-output.schema.json is v3.2.0 with the $defs.verification result enum
+#   6. validate-triage.mjs accepts a valid `confirmed` verification and rejects bad ones
+#   7. phase-3-dev.md documents the redTests rework re-entry
+#
+# Exit 0 = all pass, 1 = any failure.
+set -euo pipefail
+ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+PHASE4_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-4-review.md"
+PHASE3_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md"
+FEATURE_DOC="$ROOT/pipeline/commands/multi-agent/refs/features/verify-by-test.md"
+PREFS_SCHEMA="$ROOT/pipeline/schemas/prefs.schema.json"
+TRIAGE_SCHEMA="$ROOT/pipeline/schemas/triage-output.schema.json"
+VALIDATOR="$ROOT/pipeline/scripts/validate-triage.mjs"
+pass=0
+fail=0
+failures=()
+record_pass() { pass=$((pass + 1)); printf '  \033[0;32mPASS\033[0m %s\n' "$1"; }
+record_fail() { fail=$((fail + 1)); failures+=("$1"); printf '  \033[0;31mFAIL\033[0m %s\n' "$1"; }
+printf '→ smoke-verify-by-test: Phase 4 Step 3.7 contract\n'
+# 1. Phase 4 doc documents Step 3.7
+if [ ! -f "$PHASE4_DOC" ]; then
+  record_fail "phase-4-review.md missing"
+else
+  if grep -qF "3.7 Verify-by-test" "$PHASE4_DOC"; then
+    record_pass "phase-4-review.md documents Step 3.7"
+  else
+    record_fail "phase-4-review.md missing Step 3.7 section"
+  fi
+  if grep -qF "evidence-gate.mjs --claim test --status passed" "$PHASE4_DOC"; then
+    record_pass "Step 3.7 downgrade is evidence-gated"
+  else
+    record_fail "Step 3.7 must gate downgrades via evidence-gate.mjs --claim test --status passed"
+  fi
+  if grep -qF "refs/features/verify-by-test.md" "$PHASE4_DOC"; then
+    record_pass "phase-4-review.md points to the feature doc"
+  else
+    record_fail "phase-4-review.md must reference refs/features/verify-by-test.md"
+  fi
+  if grep -qF "review.verify_by_test" "$PHASE4_DOC"; then
+    record_pass "Step 3.7 emits review.verify_by_test telemetry"
+  else
+    record_fail "Step 3.7 must document the review.verify_by_test metric"
+  fi
+fi
+# 2. Feature doc exists with verdict + handoff coverage
+if [ ! -f "$FEATURE_DOC" ]; then
+  record_fail "refs/features/verify-by-test.md missing"
+else
+  for token in "not-reproduced" "inconclusive" "redTests" "Off by default"; do
+    if grep -qF "$token" "$FEATURE_DOC"; then
+      record_pass "feature doc covers '$token'"
+    else
+      record_fail "feature doc missing '$token'"
+    fi
+  done
+fi
+# 3. Prefs schema exposes verifyByTest knobs
+for prop in enabled maxFindings model stepTimeoutSec; do
+  if jq -e ".properties.global.properties.verifyByTest.properties.${prop}" "$PREFS_SCHEMA" >/dev/null 2>&1; then
+    record_pass "prefs schema exposes verifyByTest.${prop}"
+  else
+    record_fail "prefs schema missing verifyByTest.${prop}"
+  fi
+done
+# 4. Off by default  -  preserves existing-user baseline
+if jq -e '.properties.global.properties.verifyByTest.properties.enabled
+          | has("default") and .default == false' "$PREFS_SCHEMA" >/dev/null 2>&1; then
+  record_pass "verifyByTest.enabled defaults to false (opt-in)"
+else
+  record_fail "verifyByTest.enabled must default to false"
+fi
+# 5. Triage schema version + verification enum
+schema_version=$(jq -r '.version // empty' "$TRIAGE_SCHEMA")
+if [ "$schema_version" = "3.2.0" ]; then
+  record_pass "triage-output schema version is 3.2.0"
+else
+  record_fail "triage-output schema version should be 3.2.0 (was: ${schema_version:-missing})"
+fi
+if jq -e '.["$defs"].verification.properties.result.enum
+          | (index("confirmed") != null
+             and index("not-reproduced") != null
+             and index("inconclusive") != null)' "$TRIAGE_SCHEMA" >/dev/null 2>&1; then
+  record_pass "schema verification.result enum complete"
+else
+  record_fail "schema \$defs.verification.result enum must be confirmed/not-reproduced/inconclusive"
+fi
+# 6. Behavioral validator round-trips
+valid_fixture='{"accepted":[{"severity":"blocking","file":"Sources/Auth/Login.swift","line":42,"issue":"expired token accepted as valid","fix":"reject tokens past expiry in validateToken()","reviewer":"fable","verification":{"result":"confirmed","testRef":"AuthTests/LoginTests/testExpiredTokenRejected","evidencePath":".pipeline/verify-1.test.log"}}],"deferred":[],"rejected":[],"approved":false}'
+if printf '%s' "$valid_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
+  record_pass "validator accepts confirmed verification with testRef+evidencePath"
+else
+  record_fail "validator rejected a valid confirmed verification"
+fi
+bad_result_fixture='{"accepted":[{"severity":"blocking","file":"a.swift","line":1,"issue":"some issue here","fix":"do the fix","reviewer":"fable","verification":{"result":"maybe"}}],"deferred":[],"rejected":[],"approved":false}'
+if printf '%s' "$bad_result_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
+  record_fail "validator must reject verification.result 'maybe'"
+else
+  record_pass "validator rejects bad verification.result"
+fi
+missing_ref_fixture='{"accepted":[{"severity":"blocking","file":"a.swift","line":1,"issue":"some issue here","fix":"do the fix","reviewer":"fable","verification":{"result":"confirmed"}}],"deferred":[],"rejected":[],"approved":false}'
+if printf '%s' "$missing_ref_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
+  record_fail "validator must reject confirmed verification without testRef/evidencePath"
+else
+  record_pass "validator rejects confirmed verification lacking testRef/evidencePath"
+fi
+# Reviewer enum parity: fable (Claude Code default) and gpt (Copilot CLI) accepted
+fable_fixture='{"accepted":[{"severity":"important","file":"a.swift","line":1,"issue":"some issue here","fix":"do the fix","reviewer":"gpt"}],"deferred":[],"rejected":[],"approved":true}'
+if printf '%s' "$fable_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
+  record_pass "validator accepts schema-allowed reviewers (fable/gpt)"
+else
+  record_fail "validator must accept reviewer values fable and gpt (schema v3.1.0 parity)"
+fi
+# 7. Phase 3 doc documents the red-test rework re-entry
+if grep -qF "verifyByTest.redTests" "$PHASE3_DOC"; then
+  record_pass "phase-3-dev.md documents redTests rework re-entry"
+else
+  record_fail "phase-3-dev.md must document verifyByTest.redTests re-entry"
+fi
+printf '\n══ verify-by-test smoke: %d passed, %d failed ══\n' "$pass" "$fail"
+if [ "$fail" -gt 0 ]; then
+  printf '\nFailures:\n'
+  for msg in "${failures[@]}"; do printf '  - %s\n' "$msg"; done
+  exit 1
+fi
+exit 0

package/pipeline/scripts/validate-diff-risk.mjs CHANGED Viewed

@@ -23,6 +23,7 @@ const ALLOWED_SIGNALS = new Set([
   "complexity_delta",
   "ui_critical",
   "migration",
+  "test_lines_removed",
 ]);
 function readInput() {
@@ -48,7 +49,7 @@ function validate(obj) {
   if (typeof obj !== "object" || obj === null || Array.isArray(obj)) {
     return ["root must be an object"];
   }
-  if (obj.schemaVersion !== "1.0.0") errors.push(`schemaVersion must be "1.0.0", got ${JSON.stringify(obj.schemaVersion)}`);
+  if (obj.schemaVersion !== "1.1.0") errors.push(`schemaVersion must be "1.1.0", got ${JSON.stringify(obj.schemaVersion)}`);
   if (typeof obj.task !== "object" || obj.task === null) {
     errors.push("task must be an object");

package/pipeline/scripts/validate-triage.mjs CHANGED Viewed

@@ -23,9 +23,10 @@
 import { readFileSync } from "node:fs";
-const ALLOWED_REVIEWERS = new Set(["opus", "sonnet"]);
+const ALLOWED_REVIEWERS = new Set(["fable", "opus", "sonnet", "gpt"]);
 const ALLOWED_SEVERITIES = new Set(["blocking", "important", "suggestion"]);
 const ALLOWED_CONSENSUS_VERDICTS = new Set(["unanimous-pass", "unanimous-block", "split", "unverified"]);
+const ALLOWED_VERIFICATION_RESULTS = new Set(["confirmed", "not-reproduced", "inconclusive"]);
 const OVER_REJECT_THRESHOLD = 0.8;
 const OVER_REJECT_MIN_FINDINGS = 5;
@@ -64,13 +65,41 @@ function validateRawFinding(f, label, errors) {
   if (typeof f.issue !== "string" || f.issue.length < 4) {
     errors.push(`${label}: issue must be a string ≥4 chars`);
   }
+  if (f.verification !== undefined) {
+    validateVerification(f.verification, `${label}.verification`, errors);
+  }
+}
+// v3.2.0 verify-by-test outcome (Phase 4 Step 3.7). Optional; when present:
+// result is required, and confirmed/not-reproduced additionally require
+// testRef + evidencePath (the empirical claims must be traceable).
+function validateVerification(v, label, errors) {
+  if (typeof v !== "object" || v === null || Array.isArray(v)) {
+    errors.push(`${label}: must be an object when present`);
+    return;
+  }
+  if (!ALLOWED_VERIFICATION_RESULTS.has(v.result)) {
+    errors.push(`${label}: bad result "${v.result}" (allowed: confirmed|not-reproduced|inconclusive)`);
+    return;
+  }
+  if (v.result === "confirmed" || v.result === "not-reproduced") {
+    if (typeof v.testRef !== "string" || v.testRef.length === 0) {
+      errors.push(`${label}: testRef required and non-empty when result is "${v.result}"`);
+    }
+    if (typeof v.evidencePath !== "string" || v.evidencePath.length === 0) {
+      errors.push(`${label}: evidencePath required and non-empty when result is "${v.result}"`);
+    }
+  }
+  if (v.note !== undefined && typeof v.note !== "string") {
+    errors.push(`${label}: note must be a string when present`);
+  }
 }
 function validateAccepted(f, i, errors) {
   validateRawFinding(f, `accepted[${i}]`, errors);
   if (!ALLOWED_REVIEWERS.has(f.reviewer)) {
     errors.push(
-      `accepted[${i}]: reviewer must be "opus" or "sonnet" (got ${JSON.stringify(f.reviewer)}; haiku was removed in v2.1.0)`,
+      `accepted[${i}]: reviewer must be one of fable|opus|sonnet|gpt (got ${JSON.stringify(f.reviewer)}; haiku was removed in v2.1.0)`,
     );
   }
   if (typeof f.fix !== "string" || f.fix.length < 4) {