@mmerterden/multi-agent-pipeline 10.7.4 → 10.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -14,6 +14,60 @@ Internal file-layout changes that don't affect the slash-command surface are sti
14
14
 
15
15
  ---
16
16
 
17
+ ## [10.8.0] - 2026-07-02
18
+
19
+ Three additive loop-quality features, all sourced from the 2026 agentic-loop research sweep (Anthropic long-running-agent harness guidance + adversarial-review findings): empirical verification of blocking findings, an immutable-test contract, and structured phase-boundary handoffs.
20
+
21
+ ### Added
22
+
23
+ - **Verify-by-test triage (Phase 4 Step 3.7, opt-in via `prefs.global.verifyByTest`).**
24
+ Accepted blocking findings are no longer judgment-final: one verifier agent
25
+ (default Sonnet, capped at `maxFindings`=3) writes a minimal repro test per
26
+ finding and runs ONLY that test. Test fails as predicted -> finding confirmed
27
+ and the repro test is handed to Phase 3 as the rework RED test
28
+ (`state.reviewIterations[-1].verifyByTest.redTests[]`). Test passes under
29
+ `evidence-gate.mjs` -> finding downgraded to `deferred` (never `rejected`).
30
+ Compile error / timeout -> `inconclusive`, judgment verdict stands. The whole
31
+ step is timeout-bounded and never blocks the pipeline. Schema
32
+ `triage-output` v3.2.0 adds the optional per-finding `verification` block;
33
+ `validate-triage.mjs` validates it. New feature doc
34
+ `refs/features/verify-by-test.md` + `smoke-verify-by-test.sh` (20 assertions).
35
+ - **Immutable-test rule + `test_lines_removed` diff-risk signal.** New rule in
36
+ `refs/rules.md` and the Phase 3 GREEN step: deleting, renaming, or weakening
37
+ an existing test to reach green is a violation; a test changes only when the
38
+ task changes the spec it encodes, named in the commit body. Deterministic
39
+ backstop: `diff-risk-score.mjs` v1.1.0 emits `test_lines_removed` (w=3.0)
40
+ when a test-classified file removes more lines than it adds; wired into the
41
+ Phase 4 Step 1.75 signal table, `diff-risk.schema.json` v1.1.0, and
42
+ `validate-diff-risk.mjs`. New fixture `diff-risk-test-removal.diff` +
43
+ 3 new `smoke-diff-risk.sh` assertions.
44
+ - **Structured handoff blocks (fresh-context re-entry discipline).** The
45
+ phase-boundary checkpoint in `refs/phases/operations.md` now appends a
46
+ structured `## Handoff` block (Done / Remaining / Decisions / Open findings /
47
+ Next) to `agent-log.md` at every phase transition - written by the
48
+ orchestrator from state it already holds, no LLM call, ~15 lines. The
49
+ post-`/compact` re-grounding and `resume.md` Step 3 read the latest handoff
50
+ FIRST (state wins on mismatch), falling back to per-phase findings for
51
+ pre-v10.8 logs. Documented in `log-format.md`; guarded by
52
+ `smoke-handoff-contract.sh` (13 assertions).
53
+
54
+ - **Module review guides in `/multi-agent:review` (Step 2b).** When a changed
55
+ file's module carries its own convention file (`CLAUDE.md`, `*-CLAUDE.md`,
56
+ `AGENTS.md` below repo root), the review discovers it deterministically from
57
+ the diff paths (capped at 5, truncation logged), injects it into every
58
+ reviewer prompt, and scopes each guide to files under its own directory.
59
+ Works with a local checkout or via provider API in PR mode.
60
+
61
+ ### Fixed
62
+
63
+ - **`validate-triage.mjs` reviewer enum accepts `fable` and `gpt`.** The runtime
64
+ validator still rejected the schema-v3.1.0 reviewer values restored in
65
+ v10.7.4 (`fable` is Reviewer 1 on Claude Code, `gpt` on Copilot CLI), so any
66
+ accepted finding attributed to them failed the 3.2.1 gate. Enum now matches
67
+ `triage-output.schema.json`.
68
+
69
+ ---
70
+
17
71
  ## [10.7.4] - 2026-07-02
18
72
 
19
73
  Deep-consistency sweep: the v10.6.0 Fable restore and the v10.7.0 adapter removal are now reflected on every surface, and the test suite is green again end to end.
package/docs/features.md CHANGED
@@ -152,6 +152,18 @@ After triage returns, output is validated by `validate-triage.mjs`:
152
152
 
153
153
  If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with `if`/`then` constraint in schema v3.0.0.
154
154
 
155
+ ### Verify-by-Test Triage (Phase 4 Step 3.7, v10.8.0, opt-in)
156
+
157
+ A triage verdict is a judgment call; a failing repro test is proof. When `prefs.global.verifyByTest.enabled` is on, one verifier agent (default Sonnet) writes a minimal repro test per accepted blocking finding (cap: `maxFindings`=3) and runs only that test. Fails as predicted -> finding confirmed, the repro test becomes the Phase 3 rework RED test. Passes under `evidence-gate.mjs` -> finding downgraded to `deferred`. Compile error / timeout -> `inconclusive`, judgment stands. Timeout-bounded, never blocks. Full spec: `refs/features/verify-by-test.md`.
158
+
159
+ ### Immutable-Test Rule + `test_lines_removed` Signal (v10.8.0)
160
+
161
+ Existing tests are immutable during a task: deleting, renaming, or weakening an assertion to reach green is a violation (`refs/rules.md`, Phase 3 GREEN step). A test changes only when the task changes the spec it encodes, named in the commit body. Deterministic backstop: `diff-risk-score.mjs` emits `test_lines_removed` (w=3.0) for any test-classified file whose diff removes more lines than it adds.
162
+
163
+ ### Structured Handoff Blocks (v10.8.0)
164
+
165
+ Every phase transition appends a `## Handoff` block (Done / Remaining / Decisions / Open findings / Next) to `agent-log.md` - orchestrator-written from existing state, no LLM call. `/multi-agent:resume` and post-`/compact` re-grounding read the latest handoff first, so long runs re-enter from durable artifacts instead of conversation memory (fresh-context discipline from Anthropic's long-running-agent harness guidance).
166
+
155
167
  ### Accessibility Code Review (Phase 4 Step 1.5)
156
168
 
157
169
  If changes include UI files, reviewers check for:
@@ -211,7 +223,7 @@ Per-phase token budgets prevent runaway sessions. If a phase exceeds its budget,
211
223
 
212
224
  `pipeline/scripts/diff-risk-score.mjs` runs at Phase 4 Step 1.75 — before reviewer dispatch. Heuristic, deterministic, sub-second, no LLM. Top-N risk-ranked files inject into each reviewer's prompt as a `${PRIORITY_FILES}` block; reviewers read those files first but still review the entire diff.
213
225
 
214
- Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
226
+ Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (v10.8.0: test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
215
227
 
216
228
  ### Test Gap Detection (v8.3.0)
217
229
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@mmerterden/multi-agent-pipeline",
3
- "version": "10.7.4",
3
+ "version": "10.8.0",
4
4
  "description": "8-phase AI development pipeline with full orchestration on Claude Code and Copilot CLI. Analysis, planning, TDD, CLI-aware parallel review with consensus surfacing + Fable triage, default-FAIL evidence gates, secret + intent guards, per-phase cost ledger, persistent learnings memory, wiki generation, commit automation. Token-preserving uninstall.",
5
5
  "type": "module",
6
6
  "main": "index.js",
@@ -0,0 +1,41 @@
1
+ # Feature: Verify-by-Test Triage (Phase 4 Step 3.7)
2
+
3
+ **Pattern**: reviewer findings are hypotheses; a failing repro test is proof. Adversarial-review research (Refute-or-Promote, 2026) found that plausible-but-wrong findings survive debate rounds but die on a single empirical test. This step converts the highest-stakes verdicts (accepted blocking) from judgment into evidence before the Phase 3 rework loop fires.
4
+
5
+ **Gated by `prefs.global.verifyByTest.enabled`** (default: `false`). When enabled, after triage 3.6 and before Step 4, IF the validated triage output contains at least one `accepted` blocking finding:
6
+
7
+ 1. Dispatch ONE verifier sub-agent for the iteration (model: `verifyByTest.model`, default `sonnet`) - never one dispatch per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings, the diff hunks for their files, and the Phase 1 test conventions.
8
+ 2. Per finding, the verifier writes ONE minimal repro test asserting the correct behavior the finding claims is broken, then runs ONLY that test via the Phase 3 single-test invocation (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`) under `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
9
+ 3. Stamp each processed finding with a `verification` object (triage-output schema v3.2.0) and re-run `validate-triage.mjs` on the mutated triage file under the standard 3.2.1 gate protocol.
10
+ 4. Findings beyond `maxFindings` keep their judgment-only verdict (log `verify_by_test=cap-exceeded`).
11
+ 5. The whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600); on breach or verifier crash, remaining findings keep judgment-only verdicts and the pipeline proceeds. Never blocks.
12
+
13
+ ## Verdict table
14
+
15
+ | Repro test outcome | `verification.result` | Action |
16
+ |---|---|---|
17
+ | Fails as the finding predicts | `confirmed` | Finding stays accepted blocking. Repro test KEPT in the worktree, recorded in `redTests[]`. |
18
+ | Passes (not reproducible) | `not-reproduced` | Downgrade gated by `evidence-gate.mjs --claim test --status passed` on the test log (exit 0 required). Finding moves `accepted[]` -> `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`; repro test file deleted. Evidence-gate failure -> treat as `inconclusive`. |
19
+ | Compile error / timeout / not unit-testable | `inconclusive` | Finding stays accepted blocking (judgment stands). Partial test deleted, cause in `verification.note`. |
20
+
21
+ Downgrades go to `deferred`, never `rejected`: triage judged the issue real, and deferred items surface in the Phase 7 report for a human eye.
22
+
23
+ ## Red-test handoff to Phase 3
24
+
25
+ `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`. On the Phase 3 re-entry, each `redTests[]` entry IS the RED step of the rework TDD cycle for its finding: the reflection prompt instructs "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it." The dev agent must not write a duplicate failing test for those findings.
26
+
27
+ ## Cleanup invariant
28
+
29
+ After Step 3.7, the only uncommitted verifier artifacts are the confirmed repro tests listed in `redTests[]` (committed later with the fix) and logs under `$WORKTREE/.pipeline/` (outside Phase 6 commit scope). `not-reproduced` and `inconclusive` test files are always deleted.
30
+
31
+ ## Telemetry
32
+
33
+ One `review.verify_by_test` metric per iteration: `attempted`, `confirmed`, `downgraded`, `inconclusive`, `duration_ms` (plus `tokens_in/out` when available), forwarded to the tracker like all Phase 4 metrics. Timeout emits `triage=verify-by-test-timeout`.
34
+
35
+ ## Off by default reason
36
+
37
+ Adds one Sonnet call plus up to `maxFindings` single-test runs (and build-lock contention on Xcode projects) per review iteration that has accepted blockers. On clean runs it never fires, but on noisy-reviewer repos it can add minutes per iteration. Flip on for security-critical work, release branches, or repos where reviewer false-positive rate is high.
38
+
39
+ ## Reference
40
+
41
+ Wiring: `refs/phases/phase-4-review.md` Step 3.7. Schema: `pipeline/schemas/triage-output.schema.json` v3.2.0 (`$defs.verification`). Evidence gate: `pipeline/scripts/evidence-gate.mjs`. Prefs: `prefs.global.verifyByTest` in `pipeline/schemas/prefs.schema.json`.
@@ -72,6 +72,16 @@ Verdict: unverified (2 reviewers)
72
72
  ## Test Scenarios (Jira)
73
73
 
74
74
  (User perspective: Precondition -> Steps -> Expected result)
75
+
76
+ ## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
77
+
78
+ (v10.8.0, appended chronologically by the phase-boundary checkpoint - one block per phase transition, the LATEST block is authoritative. Written by the orchestrator from state it already holds; no LLM call, ~15 lines max. `resume.md` Step 3 and post-`/compact` re-grounding read this block FIRST. Format spec: `refs/phases/operations.md`.)
79
+
80
+ - Done: {up to 3 bullets of completed outcomes}
81
+ - Remaining: {ordered list of remaining phases / sub-steps}
82
+ - Decisions: {key decisions later phases depend on}
83
+ - Open findings: {accepted-but-unresolved review findings, or "none"}
84
+ - Next: Phase {N+1} {name}, subStep {token or "start"}
75
85
  ```
76
86
 
77
87
  ### Cost Breakdown - emission contract
@@ -93,8 +93,21 @@ This keeps orchestrator context lean and enables programmatic routing.
93
93
 
94
94
  **Proactive compaction + phase-boundary checkpoint**: the orchestrator follows ~2,500 lines of phase prose in one session; once context fills, it starts dropping steps - the single biggest cause of "it got stuck / skipped a step." Two defenses, both required on full-pipeline runs:
95
95
 
96
- - *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a one-line progress summary to `agent-log.md`. The next phase reads state + log, not the back-conversation - so a transition is a clean re-entry point even if context is later compacted.
97
- - *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit - it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` to re-ground.
96
+ - *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a structured handoff block to `agent-log.md` (format below). The next phase reads state + log, not the back-conversation - so a transition is a clean re-entry point even if context is later compacted.
97
+ - *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit - it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` AND the latest `## Handoff` block in `agent-log.md` to re-ground.
98
+
99
+ **Handoff block (v10.8.0)**: the structured artifact the phase-boundary checkpoint appends to `agent-log.md`. Written by the orchestrator from state it already holds - no agent dispatch, no extra LLM call. Cap at ~15 lines; the latest block is authoritative (earlier ones are history). This is the fresh-context re-entry contract: a resume or post-compaction session rebuilds working context from the latest handoff + `agent-state.json` + git log, never from conversation memory.
100
+
101
+ ```markdown
102
+ ## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
103
+ - Done: {up to 3 bullets of completed outcomes, e.g. "plan approved, 4 tasks", "build green, 12 tests added"}
104
+ - Remaining: {ordered list of remaining phases / sub-steps}
105
+ - Decisions: {key decisions later phases depend on, e.g. "used existing KeychainStore, no new wrapper"}
106
+ - Open findings: {accepted-but-unresolved review findings, or "none"}
107
+ - Next: Phase {N+1} {name}, subStep {token or "start"}
108
+ ```
109
+
110
+ Full `agent-log.md` shape: `refs/phases/log-format.md`. Resume-side consumption: `resume.md` Step 3 reads the latest handoff FIRST, then falls back to per-phase findings for logs written before v10.8.
98
111
 
99
112
  **Sub-step checkpoints (long phases)**: Phase 3 (dev/TDD cycles) and Phase 7 (report/channels) can run many minutes; a crash mid-phase loses everything since the last phase boundary and forces a full phase re-run on resume. For these phases, also record `state.phases[<n>].subStep` (a short token: `red`, `green`, `build`, `pr-opened`, `confluence-synced`, ...) and the `files[]` written so far after each meaningful unit of work. On resume, re-enter the phase but skip units whose `subStep` is already recorded and whose `files[]` exist in the worktree - re-do only the unfinished tail, never the whole phase.
100
113
 
@@ -94,6 +94,7 @@ For each task (respecting dependency order):
94
94
  3. **TDD cycle** (Launch Agent with `model: "sonnet"`):
95
95
 
96
96
  **RED - Write ONE failing test first:**
97
+ - **Rework re-entry**: if `state.reviewIterations[-1].verifyByTest.redTests[]` exists (Phase 4 Step 3.7 ran), those failing repro tests ARE the RED step for their findings - make them green, do not write a duplicate failing test and do not delete or weaken them. See `refs/features/verify-by-test.md`.
97
98
  - Test framework: use whatever the project already uses. Detect from existing test files:
98
99
  - `@Test` / `#expect` → Swift Testing
99
100
  - `XCTestCase` / `XCTAssert` → XCTest
@@ -115,6 +116,7 @@ For each task (respecting dependency order):
115
116
 
116
117
  **GREEN - Minimal code to pass:**
117
118
  - Smallest change that makes the test green - no extras
119
+ - **Existing tests are immutable**: deleting, renaming, skipping, or weakening an existing assertion to reach green is a violation. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. (Deterministic backstop: the `test_lines_removed` signal in Phase 4 Step 1.75 flags test files that shrink.)
118
120
  - Run the same test again → must PASS
119
121
  - Run full test suite → no regressions:
120
122
  ```bash
@@ -127,6 +127,7 @@ echo "$RISK_JSON" | node pipeline/scripts/validate-diff-risk.mjs - >/dev/null 2>
127
127
  | `migration` | 4.0 | path matches DB schema / migration glob (.sql, /Migrations/, alembic/, prisma/migrations) |
128
128
  | `public_api` | 2.0 | added line declares `public func/class/struct/enum`, `@objc`, `open fun`, `@Composable`, or `export function/class/...` |
129
129
  | `no_test_change` | 2.5 | source file changed, no paired test file (`{Base}Tests.{ext}`, `{Base}.test.{ext}`, etc.) appears in the diff |
130
+ | `test_lines_removed` | 3.0 | test-classified file whose diff removes more lines than it adds (immutable-test backstop, see `refs/rules.md`) |
130
131
  | `complexity_delta` | 1.5 | added control-flow tokens (`if`/`guard`/`switch`/`while`/`for`/ternary/`&&`/`\|\|`) |
131
132
  | `ui_critical` | 1.5 | path matches `*View.swift`, `*Screen.kt`, `*Configuration.swift`, etc. |
132
133
  | `loc_changed` | 1.0 | base sensitivity to total `+/-` lines |
@@ -204,6 +205,8 @@ Skills are injected into reviewer prompt context - the reviewer uses them as r
204
205
 
205
206
  **iOS/Swift - interaction & convention skills (conditional).** When the diff touches SwiftUI UI files (`*View.swift`, `*Screen.swift`, `*Configuration.swift`, `*+Modifiers.swift`), additionally inject the relevant `figma-common` convention skills as reference for the iOS reviewers: `figma-navigation`, `figma-overlays`, `figma-bottom-sheets` (interaction: emit-intent vs self-route/self-present; native-SwiftUI-first vs the project's `ui.*` custom system), and the enriched `figma-to-swiftui` accessibility rules (minimalism). These back the Step 1.5 iOS convention checks. Generic across SwiftUI projects - not tied to any one app. Omit when the diff has no SwiftUI UI changes (keeps the reviewer prompt lean).
206
207
 
208
+ **Module review guides (conditional, all stacks).** A module in the repo may carry its own CLAUDE guide - a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths: for each changed file, walk its directory chain up to the repo root and collect `CLAUDE.md`, `*-CLAUDE.md`, and `AGENTS.md` files (root-level ones excluded - the host CLI already loads those). Dedupe, cap at 5 (log any dropped). Inject into every reviewer prompt with the directive: read each guide before reviewing and apply its rules/checklist to the changed files under its directory - a guide governs only its own subtree, and guide violations are findings triaged by severity like any other. No guides found → no-op, prompt stays lean. Same contract as `/multi-agent:review` Step 2b.
209
+
207
210
  **Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry Reviewer 1 once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
208
211
 
209
212
  #### Output contract - reviewer step
@@ -252,6 +255,8 @@ Exit 0 = valid. Exit 2 = contradiction (approved=true with blocking findings) -
252
255
 
253
256
  **CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag - reviewers hallucinate, misread scope, or repeat each other. Run Fable triage (Opus on Copilot CLI) to evaluate merged findings against task scope.
254
257
 
258
+ Opt-in empirical layer: when `prefs.global.verifyByTest.enabled` is `true`, accepted blocking findings additionally go through Step 3.7 (verify-by-test), which tries to reproduce each one with a minimal failing test before the Phase 3 rework loop fires. Full wiring: `refs/features/verify-by-test.md`.
259
+
255
260
  ##### 3.1 Short-circuit: no findings
256
261
 
257
262
  If merged findings `length === 0`, **skip triage**: write empty result `{"accepted": [], "deferred": [], "rejected": [], "approved": true}`, log, proceed to Phase 5.
@@ -383,13 +388,40 @@ After the triage verdict is computed, populate `triage.consensus`:
383
388
 
384
389
  **Surfacing (Step 4 + Phase 7):** When `verdict` is `split` or `unverified`, the disagreements are shown to the user at the Step 4 checkpoint (interactive modes) and always written to the Phase 7 report. Autopilot does not block on `unverified` (it logs `review.consensus=unverified` and proceeds), matching the maturity-check model - but the report records it so a human review can catch it. This is additive: omitting `consensus` is valid and means "not computed."
385
390
 
391
+ ##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
392
+
393
+ **Rationale:** a triage verdict is still a judgment call; a failing repro test is proof. Debating a finding costs tokens, running it costs one test invocation and kills false positives that survive adversarial framing.
394
+
395
+ **Gate:** runs only when `prefs.global.verifyByTest.enabled` is `true` AND the validated triage output contains at least one `accepted` finding with `severity: "blocking"`. Otherwise skip silently (no log noise). Full behavior spec: `refs/features/verify-by-test.md`.
396
+
397
+ 1. **Dispatch ONE verifier agent** (subagent_type: `general-purpose`, model: `prefs.global.verifyByTest.model`, default `sonnet`) for the iteration - not one per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings (ordered as triage returned them), the diff hunks for their files, and the project's test conventions from Phase 1. Findings beyond the cap keep their judgment-only verdict; log `verify_by_test=cap-exceeded count=<n>`.
398
+ 2. **Per finding, the verifier writes ONE minimal repro test** asserting the CORRECT behavior the finding claims is broken, following the platform test conventions (framework, naming, location per phase-3-dev.md), then runs ONLY that test using the platform single-test invocation from Phase 3 (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`), wrapped in `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
399
+ 3. **Verdict mapping (fails toward keeping blockers):**
400
+
401
+ | Repro test outcome | `verification.result` | Action on finding |
402
+ | --- | --- | --- |
403
+ | Test FAILS as the finding predicts | `confirmed` | Stays `accepted` blocking. Repro test is KEPT in the worktree and recorded in `redTests[]` as the RED test for the Phase 3 rework loop. |
404
+ | Test PASSES (defect not reproducible) | `not-reproduced` | Downgrade is evidence-gated: `node pipeline/scripts/evidence-gate.mjs --claim test --status passed --evidence "$WORKTREE/.pipeline/verify-<i>.test.log"` must exit 0. On exit 0: move finding from `accepted[]` to `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`, delete the repro test file. On exit non-zero: treat as `inconclusive`. |
405
+ | Compile error, timeout, or defect not expressible as a unit test | `inconclusive` | Stays `accepted` blocking (judgment-only verdict stands). Partial test file deleted, cause noted in `verification.note`. |
406
+
407
+ 4. **Cleanup invariant:** after Step 3.7 the only verifier artifacts left in the worktree are the confirmed repro tests listed in `redTests[]` (they get committed with the fix, satisfying the TDD contract) and logs under `$WORKTREE/.pipeline/` (already outside Phase 6 commit scope).
408
+ 5. **Persist + re-validate:** stamp each verified finding with a `verification` object (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`, recompute `approved`, then re-run `validate-triage.mjs` on the mutated `$TRIAGE_FILE` under the same 3.2.1 gate protocol.
409
+ 6. **Timeout/fallback (mirrors 3.3):** the whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600). On verifier crash or budget breach: no retry; remaining findings keep judgment-only verdicts, log `triage=verify-by-test-timeout`, proceed to Step 4. Never blocks the pipeline.
410
+
411
+ Telemetry (per 3.4 conventions, best-effort):
412
+
413
+ ```bash
414
+ LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.verify_by_test \
415
+ attempted=$A confirmed=$C downgraded=$D inconclusive=$I duration_ms=$MS
416
+ ```
417
+
386
418
  #### Step 4 - Consensus + Action (triage-driven)
387
419
 
388
420
  If `triage.consensus.verdict` is `split` or `unverified`, surface `consensus.disagreements[]` to the user before acting: interactive modes show the split and ask whether to treat the unverified agreement as a pass (picker-contract: Trust / Re-review / Treat-as-blocking); autopilot logs `review.consensus=<verdict>` and proceeds on the triage verdict. Never silently average a split into a pass.
389
421
 
390
422
  Act **only on triage.accepted**:
391
423
 
392
- - **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items)
424
+ - **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items). When Step 3.7 ran and `state.reviewIterations[-1].verifyByTest.redTests[]` is non-empty, the reflection prompt cites each red test: "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it."
393
425
  - **accepted.important** → fix and re-review
394
426
  - **accepted.suggestion** → apply if reasonable
395
427
  - **deferred** → append to Phase 7 report as "follow-up items" (do not block)
@@ -52,6 +52,7 @@ This is the single source of truth. When a contributor or model is unsure where
52
52
  - **NEVER** commit without passing build (all gates in Phase 4 Step 1 must be green).
53
53
  - **NEVER** commit without passing review (at least one AI reviewer must return `approved: true` with no blocking findings).
54
54
  - **NEVER** skip tests. Every public method, every error path, every edge case.
55
+ - **NEVER** delete, rename, or weaken an existing test to get a green run. Existing tests are immutable during a task. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. Deterministic backstop: the `test_lines_removed` diff-risk signal (Phase 4 Step 1.75) flags test files that shrink.
55
56
  - **Follow existing code style and conventions.** Read neighbor files before writing new ones - match naming, structure, import order.
56
57
  - **Use design tokens, no magic numbers.** `16` → `.Spacing.spacing16`. `#E31837` → `Color.Primary.primary`. `.font(.system(size: 14))` → `.typographyStyle(.body1)`.
57
58
  - **Design system primitives before custom views.** Before writing a new SwiftUI / Compose / React / View / Configuration triplet inside a domain or feature module, grep the shared component library (project-specific path, e.g. `Common/UIComponents/`, `core-ui/`, `packages/ui/`) for an existing primitive that solves the same problem. New domain-level wrappers, custom modals, custom buttons, or hand-rolled toasts are forbidden when the design system already has an equivalent. If the primitive exists but lacks a modifier (placeholder, size, error binding), **add the modifier to the primitive** in its `+Modifiers` extension - do not fork the primitive into the consumer domain. The Figma `CodeConnectSnippet` is the authoritative pointer to which primitive to use.
@@ -23,10 +23,13 @@ Resume a paused or failed task from the last successful phase.
23
23
  - `haltReason` - if set, show it so the user knows why the run stopped; clear it on successful re-entry
24
24
  - `autopilot` - preserve the mode
25
25
 
26
- 3. **Load context** - read prior-phase findings from `agent-log.md`:
27
- - Phase 1 analysis use it from Phase 2+
28
- - Phase 2 plan use it from Phase 3+
29
- - Phase 3 codealready in the worktree
26
+ 3. **Load context** - rebuild working context from durable artifacts, never from conversation memory:
27
+ - **Handoff first (v10.8.0)**: read the LATEST `## Handoff` block in `agent-log.md` - it carries done/remaining/decisions/open-findings and the exact re-entry point (phase + subStep). When present, it is the primary context source; cross-check its `Next:` line against `state.currentPhase` and trust state on mismatch (state is the machine truth, handoff is the narrative).
28
+ - Fall back to per-phase findings for logs written before v10.8 (no handoff blocks):
29
+ - Phase 1 analysisuse it from Phase 2+
30
+ - Phase 2 plan → use it from Phase 3+
31
+ - Phase 3 code → already in the worktree
32
+ - Recent `git log --oneline -10` in the worktree grounds what was actually committed vs. claimed.
30
33
 
31
34
  4. **Continue the pipeline** - start from the next phase (same pipeline as the main multi-agent command).
32
35
 
@@ -109,6 +109,38 @@ The `credential-store.sh` wrapper handles macOS Keychain (`security`), Linux lib
109
109
 
110
110
  Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers can re-read it.
111
111
 
112
+ ### 2b. Module review guides - path-scoped convention files
113
+
114
+ A module in the repo may carry its own CLAUDE guide - a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths:
115
+
116
+ ```bash
117
+ # Changed file paths from the patch:
118
+ grep -E '^\+\+\+ b/' "$DIFF_FILE" | sed 's|^+++ b/||' | sort -u > /tmp/multi-agent-review-${TASK_ID}-paths.txt
119
+
120
+ # For each changed path, walk its directory chain up to the repo root and
121
+ # collect guide files matching: CLAUDE.md, *-CLAUDE.md, AGENTS.md.
122
+ # Root-level CLAUDE.md/AGENTS.md are excluded - the host CLI already loads those.
123
+ guides=()
124
+ while IFS= read -r p; do
125
+ d=$(dirname "$p")
126
+ while [ "$d" != "." ] && [ "$d" != "/" ]; do
127
+ for g in "$d"/CLAUDE.md "$d"/*-CLAUDE.md "$d"/AGENTS.md; do
128
+ [ -e "$g" ] && guides+=("$g")
129
+ done
130
+ d=$(dirname "$d")
131
+ done
132
+ done < /tmp/multi-agent-review-${TASK_ID}-paths.txt
133
+ # dedupe, cap at 5 (log any dropped so truncation is never silent)
134
+ ```
135
+
136
+ Existence checks are resolved against the local checkout when the cwd is the target repo. In PR mode without a local checkout, probe the candidate paths via the provider API instead (`gh api /repos/{o}/{r}/contents/{path}?ref={headSha}` / Bitbucket `GET /projects/{KEY}/repos/{slug}/browse/{path}?at={headSha}`) and fetch the matching files' raw content the same way. No hit → step is a silent no-op.
137
+
138
+ Persist `agent-state.review.moduleGuides = [<repo-relative paths>]` and inject into every reviewer prompt (Step 3):
139
+
140
+ > MODULE REVIEW GUIDES: before reviewing, read each of these guide files. Apply a guide's rules/checklist to every changed file under its directory. Guide violations are findings like any other - triage them by severity.
141
+
142
+ Scope note: a guide governs only files under its own directory - a guide found under one module must not be applied to a sibling module's changes in the same PR.
143
+
112
144
  ### 3. Launch parallel reviewers - host-CLI dependent
113
145
 
114
146
  **Claude Code (2 in parallel):**
@@ -120,7 +152,7 @@ Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers ca
120
152
  - Agent 2: `gpt-5.4` → edge cases, alternate perspective
121
153
  - Agent 3: `claude-sonnet-4-6` → general quality
122
154
 
123
- Each reviewer receives the diff plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
155
+ Each reviewer receives the diff, the module review guides from Step 2b (when any were found), plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
124
156
 
125
157
  ### 4. Store-compliance cross-reference
126
158
 
@@ -1,16 +1,16 @@
1
1
  {
2
2
  "$schema": "https://json-schema.org/draft/2020-12/schema",
3
3
  "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/diff-risk.schema.json",
4
- "version": "1.0.0",
4
+ "version": "1.1.0",
5
5
  "title": "Multi-Agent Pipeline - Phase 4 diff risk score",
6
- "description": "Output contract for diff-risk-score.mjs. Heuristic, deterministic, no LLM. Produced before Phase 4 Step 2 to give reviewer prompts a priority ordering - never used as a gate.",
6
+ "description": "Output contract for diff-risk-score.mjs. Heuristic, deterministic, no LLM. Produced before Phase 4 Step 2 to give reviewer prompts a priority ordering - never used as a gate. v1.1.0 adds the test_lines_removed signal (immutable-test backstop: a test file whose diff removes more lines than it adds).",
7
7
  "type": "object",
8
8
  "additionalProperties": false,
9
9
  "required": ["schemaVersion", "task", "totals", "files"],
10
10
  "properties": {
11
11
  "schemaVersion": {
12
12
  "type": "string",
13
- "const": "1.0.0"
13
+ "const": "1.1.0"
14
14
  },
15
15
  "task": {
16
16
  "type": "object",
@@ -63,7 +63,8 @@
63
63
  "no_test_change",
64
64
  "complexity_delta",
65
65
  "ui_critical",
66
- "migration"
66
+ "migration",
67
+ "test_lines_removed"
67
68
  ]
68
69
  },
69
70
  "weight": { "type": "number" },
@@ -701,6 +701,41 @@
701
701
  "default": false,
702
702
  "description": "v6.1.0+ \u2014 Phase 4 Step 2.5 rebuttal round. When reviewers disagree (mixed blocker/approved verdict), each reviewer is re-prompted with the others' opposing arguments for one additional round before triage. Lifts signal quality on ambiguous findings at ~1\u00d7 Step 2 token cost. Off by default \u2014 flip for security-critical or release-branch reviews."
703
703
  },
704
+ "verifyByTest": {
705
+ "type": "object",
706
+ "additionalProperties": false,
707
+ "description": "v10.8+ - Phase 4 Step 3.7 verify-by-test. When enabled, accepted BLOCKING findings are empirically validated before the Phase 3 rework loop: one verifier agent writes a minimal repro test per finding and runs only that test. Confirmed findings hand their failing test to Phase 3 as the RED step; non-reproducible findings are downgraded to deferred under evidence-gate. Only blocking findings are ever verified (fixed behavior, not a knob). Adds one model call plus up to maxFindings single-test runs per iteration with accepted blockers; default off. Flip on for security-critical work, release branches, or repos with noisy reviewers. Full spec: refs/features/verify-by-test.md.",
708
+ "properties": {
709
+ "enabled": {
710
+ "type": "boolean",
711
+ "default": false,
712
+ "description": "Master switch."
713
+ },
714
+ "maxFindings": {
715
+ "type": "integer",
716
+ "minimum": 1,
717
+ "maximum": 10,
718
+ "default": 3,
719
+ "description": "Max accepted blocking findings verified per review iteration. Findings beyond the cap keep their judgment-only verdict."
720
+ },
721
+ "model": {
722
+ "type": "string",
723
+ "enum": [
724
+ "sonnet",
725
+ "opus"
726
+ ],
727
+ "default": "sonnet",
728
+ "description": "Verifier agent model. Writing a minimal repro test is mechanical work; Sonnet is the cost-sane default."
729
+ },
730
+ "stepTimeoutSec": {
731
+ "type": "integer",
732
+ "minimum": 60,
733
+ "maximum": 1800,
734
+ "default": 600,
735
+ "description": "Wall-clock budget for the whole Step 3.7 pass. On breach, remaining findings keep judgment-only verdicts and the pipeline proceeds (never blocks)."
736
+ }
737
+ }
738
+ },
704
739
  "review": {
705
740
  "type": "object",
706
741
  "additionalProperties": false,
@@ -1,9 +1,9 @@
1
1
  {
2
2
  "$schema": "https://json-schema.org/draft/2020-12/schema",
3
3
  "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/triage-output.schema.json",
4
- "version": "3.1.0",
4
+ "version": "3.2.0",
5
5
  "title": "Multi-Agent Pipeline - Phase 4 triage output",
6
- "description": "Contract for the Opus triage agent's JSON output in Phase 4 Step 3. Triage consumes merged reviewer findings and splits them into accepted/deferred/rejected. Only `accepted` blocking/important items trigger Phase 3 rework. v3.1.0 adds the optional `consensus` block so triage can surface reviewer-agreement risk (false consensus among same-base-model reviewers) instead of silently merging.",
6
+ "description": "Contract for the Opus triage agent's JSON output in Phase 4 Step 3. Triage consumes merged reviewer findings and splits them into accepted/deferred/rejected. Only `accepted` blocking/important items trigger Phase 3 rework. v3.1.0 adds the optional `consensus` block so triage can surface reviewer-agreement risk (false consensus among same-base-model reviewers) instead of silently merging. v3.2.0 adds the optional per-finding `verification` block written by Phase 4 Step 3.7 (verify-by-test): the empirical repro-test outcome for accepted blocking findings.",
7
7
  "type": "object",
8
8
  "additionalProperties": false,
9
9
  "required": ["accepted", "deferred", "rejected", "approved"],
@@ -114,6 +114,35 @@
114
114
  }
115
115
  }
116
116
  },
117
+ "verification": {
118
+ "type": "object",
119
+ "additionalProperties": false,
120
+ "description": "v3.2.0 verify-by-test outcome (Phase 4 Step 3.7, opt-in via prefs.global.verifyByTest). confirmed = repro test failed as the finding predicts (finding stands, test kept as the Phase 3 RED test); not-reproduced = repro test passed under evidence-gate (finding downgraded to deferred); inconclusive = compile error / timeout / not unit-testable (judgment verdict stands).",
121
+ "required": ["result"],
122
+ "properties": {
123
+ "result": {
124
+ "type": "string",
125
+ "enum": ["confirmed", "not-reproduced", "inconclusive"]
126
+ },
127
+ "testRef": {
128
+ "type": "string",
129
+ "minLength": 1,
130
+ "description": "Single-test reference, e.g. 'AuthTests/LoginTests/testExpiredTokenRejected' or 'tests/test_auth.py::test_expired_token'."
131
+ },
132
+ "evidencePath": {
133
+ "type": "string",
134
+ "minLength": 1,
135
+ "description": "Path to the test-run log verified by evidence-gate.mjs, e.g. '.pipeline/verify-1.test.log'."
136
+ },
137
+ "note": { "type": "string" }
138
+ },
139
+ "if": {
140
+ "properties": { "result": { "enum": ["confirmed", "not-reproduced"] } }
141
+ },
142
+ "then": {
143
+ "required": ["result", "testRef", "evidencePath"]
144
+ }
145
+ },
117
146
  "rawFinding": {
118
147
  "type": "object",
119
148
  "additionalProperties": false,
@@ -124,7 +153,8 @@
124
153
  "line": { "type": "integer", "minimum": 0 },
125
154
  "issue": { "type": "string", "minLength": 4 },
126
155
  "fix": { "type": "string" },
127
- "reviewer": { "$ref": "#/$defs/reviewer" }
156
+ "reviewer": { "$ref": "#/$defs/reviewer" },
157
+ "verification": { "$ref": "#/$defs/verification" }
128
158
  }
129
159
  },
130
160
  "acceptedFinding": {
@@ -144,7 +174,8 @@
144
174
  "type": "string",
145
175
  "minLength": 4,
146
176
  "description": "Concrete change the dev agent must make. Required for accepted items so Phase 3 re-entry has actionable direction."
147
- }
177
+ },
178
+ "verification": { "$ref": "#/$defs/verification" }
148
179
  }
149
180
  }
150
181
  ]
@@ -22,6 +22,8 @@ Validate contracts. Each emits `══ <name> smoke: N passed, M failed ══`
22
22
  - `smoke-phase-6-multi.sh` - Phase 6 multi-repo commit/PR cross-linking
23
23
  - `smoke-phase-banner.sh` + `smoke-phase-tracker.sh` - Phase UI output contracts
24
24
  - `smoke-phase4-triage.sh` - Phase 4 reviewer → triage flow
25
+ - `smoke-verify-by-test.sh` - Phase 4 Step 3.7 verify-by-test contract (v10.8.0)
26
+ - `smoke-handoff-contract.sh` - phase-boundary structured handoff + handoff-first resume (v10.8.0)
25
27
 
26
28
  ### Schema + state
27
29
  - `smoke-schema-validation.sh` - all JSON schemas validate
@@ -15,6 +15,7 @@
15
15
  * complexity_delta - added if/guard/case/switch/while count w=1.5
16
16
  * ui_critical - *View.swift / *Screen.kt / Configuration w=1.5
17
17
  * migration - DB schema / migration path w=4.0
18
+ * test_lines_removed - test file shrinks (removed > added) w=3.0
18
19
  *
19
20
  * Inputs:
20
21
  * --base <ref> Base ref. Default: origin/main, fallback: main
@@ -275,6 +276,15 @@ function buildRow(stat, addedLines, allChangedPaths) {
275
276
  }
276
277
  }
277
278
 
279
+ // Test-lines-removed: a test-classified file whose diff removes more lines
280
+ // than it adds. Shrinking tests is the classic get-to-green shortcut the
281
+ // immutable-test rule forbids (refs/rules.md); surface it to reviewers.
282
+ if (isTestPath(path) && stat.removed > stat.added) {
283
+ const w = 3.0;
284
+ signals.push({ name: "test_lines_removed", weight: w, value: stat.removed - stat.added });
285
+ score += 12 * w;
286
+ }
287
+
278
288
  return {
279
289
  path,
280
290
  score: Math.round(score * 100) / 100,
@@ -306,7 +316,7 @@ function main() {
306
316
  };
307
317
 
308
318
  const out = {
309
- schemaVersion: "1.0.0",
319
+ schemaVersion: "1.1.0",
310
320
  task: {
311
321
  id: TASK_ID,
312
322
  base: BASE || "(diff-file)",
@@ -0,0 +1,40 @@
1
+ diff --git a/MyAppTests/LoginViewModelTests.swift b/MyAppTests/LoginViewModelTests.swift
2
+ index 1111111..2222222 100644
3
+ --- a/MyAppTests/LoginViewModelTests.swift
4
+ +++ b/MyAppTests/LoginViewModelTests.swift
5
+ @@ -10,30 +10,20 @@ final class LoginViewModelTests: XCTestCase {
6
+ func testLoginWithValidCredentials_Succeeds() {
7
+ let sut = LoginViewModel(service: MockAuthService())
8
+ + sut.retryPolicy = .none
9
+ sut.login(email: "user@example.com", password: "correct")
10
+ + XCTAssertTrue(sut.isAuthenticated)
11
+ }
12
+ -
13
+ - func testLoginWithInvalidEmail_ShowsError() {
14
+ - let sut = LoginViewModel(service: MockAuthService())
15
+ - sut.login(email: "not-an-email", password: "irrelevant")
16
+ - XCTAssertEqual(sut.errorMessage, "Invalid email")
17
+ - }
18
+ -
19
+ - func testLoginWithExpiredToken_Rejects() {
20
+ - let sut = LoginViewModel(service: MockAuthService(tokenState: .expired))
21
+ - sut.login(email: "user@example.com", password: "correct")
22
+ - XCTAssertFalse(sut.isAuthenticated)
23
+ - }
24
+ -
25
+ - func testLogout_ClearsSession() {
26
+ - let sut = LoginViewModel(service: MockAuthService())
27
+ - sut.logout()
28
+ - XCTAssertNil(sut.session)
29
+ - }
30
+ }
31
+ diff --git a/MyApp/Sources/Auth/LoginViewModel.swift b/MyApp/Sources/Auth/LoginViewModel.swift
32
+ index 3333333..4444444 100644
33
+ --- a/MyApp/Sources/Auth/LoginViewModel.swift
34
+ +++ b/MyApp/Sources/Auth/LoginViewModel.swift
35
+ @@ -20,6 +20,8 @@ final class LoginViewModel {
36
+ func login(email: String, password: String) {
37
+ + guard email.contains("@") else { return }
38
+ + service.authenticate(email: email, password: password)
39
+ }
40
+ }
@@ -1,16 +1,16 @@
1
1
  .claude/CLAUDE.md 1
2
2
  .claude/agents 8
3
- .claude/commands 88
3
+ .claude/commands 89
4
4
  .claude/lib 23
5
5
  .claude/multi-agent-preferences.json 1
6
6
  .claude/rules 12
7
7
  .claude/schemas 23
8
- .claude/scripts 167
8
+ .claude/scripts 169
9
9
  .claude/settings.json 1
10
10
  .claude/skills 560
11
11
  .copilot/agents 8
12
12
  .copilot/copilot-instructions.md 1
13
13
  .copilot/lib 23
14
14
  .copilot/schemas 23
15
- .copilot/scripts 167
15
+ .copilot/scripts 169
16
16
  .copilot/skills 596
@@ -12,6 +12,7 @@
12
12
  # 8. phase-4-review.md ref doc declares Step 1.75 + diff-risk-score.mjs
13
13
  # 9. code-reviewer.md agent template carries the priority-files placeholder
14
14
  # 10. prefs.schema.json exposes diffRisk advisory toggle
15
+ # 11. test-removal fixture fires the test_lines_removed signal (v1.1.0)
15
16
  #
16
17
  # Exit 0 = all pass, 1 = any failure.
17
18
 
@@ -26,6 +27,7 @@ REVIEWER="$ROOT/pipeline/agents/code-reviewer.md"
26
27
  PREFS="$ROOT/pipeline/schemas/prefs.schema.json"
27
28
  FIX_IOS="$ROOT/pipeline/scripts/fixtures/diff-risk-ios.diff"
28
29
  FIX_AND="$ROOT/pipeline/scripts/fixtures/diff-risk-android.diff"
30
+ FIX_TESTRM="$ROOT/pipeline/scripts/fixtures/diff-risk-test-removal.diff"
29
31
 
30
32
  pass=0
31
33
  fail=0
@@ -38,10 +40,11 @@ printf '→ smoke-diff-risk (v8.3.0): pre-review risk scoring contract\n'
38
40
  [ -f "$SCHEMA" ] || { record_fail "schema missing: $SCHEMA"; exit 1; }
39
41
  [ -f "$FIX_IOS" ] || { record_fail "fixture missing: $FIX_IOS"; exit 1; }
40
42
  [ -f "$FIX_AND" ] || { record_fail "fixture missing: $FIX_AND"; exit 1; }
43
+ [ -f "$FIX_TESTRM" ] || { record_fail "fixture missing: $FIX_TESTRM"; exit 1; }
41
44
 
42
45
  # --- 1: iOS fixture produces JSON ---
43
46
  out_ios=$(node "$SCORE" --diff "$FIX_IOS" 2>/dev/null)
44
- if jq -e '.schemaVersion == "1.0.0"' <<< "$out_ios" >/dev/null 2>&1; then
47
+ if jq -e '.schemaVersion == "1.1.0"' <<< "$out_ios" >/dev/null 2>&1; then
45
48
  record_pass "iOS fixture renders schema-versioned JSON"
46
49
  else
47
50
  record_fail "iOS fixture JSON malformed or missing schemaVersion"
@@ -150,6 +153,32 @@ else
150
153
  record_fail "prefs.schema.json missing global.diffRiskAdvisory"
151
154
  fi
152
155
 
156
+ # --- 11: test_lines_removed signal fires on the test-removal fixture ---
157
+ out_testrm=$(node "$SCORE" --diff "$FIX_TESTRM" 2>/dev/null)
158
+ sig_value=$(jq -r '.files[] | select(.path == "MyAppTests/LoginViewModelTests.swift")
159
+ | .signals[] | select(.name == "test_lines_removed") | .value' <<< "$out_testrm")
160
+ if [ "$sig_value" = "16" ]; then
161
+ record_pass "test_lines_removed fires with value=16 (18 removed - 2 added)"
162
+ else
163
+ record_fail "test_lines_removed should fire with value=16, got: ${sig_value:-missing}"
164
+ fi
165
+ sig_on_source=$(jq -r '[.files[] | select(.path == "MyApp/Sources/Auth/LoginViewModel.swift")
166
+ | .signals[] | select(.name == "test_lines_removed")] | length' <<< "$out_testrm")
167
+ if [ "$sig_on_source" = "0" ]; then
168
+ record_pass "test_lines_removed does not fire on source files"
169
+ else
170
+ record_fail "test_lines_removed must only fire on test-classified paths"
171
+ fi
172
+ set +e
173
+ echo "$out_testrm" | node "$VALIDATE" - >/dev/null 2>&1
174
+ rc_testrm=$?
175
+ set -e
176
+ if [ "$rc_testrm" -eq 0 ]; then
177
+ record_pass "validator accepts output carrying test_lines_removed"
178
+ else
179
+ record_fail "validator rejected test_lines_removed output (rc=$rc_testrm)"
180
+ fi
181
+
153
182
  # --- Summary ---
154
183
  total=$((pass + fail))
155
184
  printf '\n→ smoke-diff-risk: %d/%d passed\n' "$pass" "$total"
@@ -0,0 +1,92 @@
1
+ #!/usr/bin/env bash
2
+ # smoke-handoff-contract.sh
3
+ #
4
+ # Verifies the v10.8.0 structured-handoff contract (fresh-context re-entry):
5
+ # 1. operations.md documents the Handoff block with all 5 required lines
6
+ # 2. operations.md compaction trigger re-reads state AND the latest handoff
7
+ # 3. log-format.md documents the Handoff section in the canonical log shape
8
+ # 4. resume.md Step 3 reads the latest handoff FIRST with pre-v10.8 fallback
9
+ #
10
+ # Exit 0 = all pass, 1 = any failure.
11
+
12
+ set -euo pipefail
13
+
14
+ ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
15
+ OPS="$ROOT/pipeline/commands/multi-agent/refs/phases/operations.md"
16
+ LOGFMT="$ROOT/pipeline/commands/multi-agent/refs/phases/log-format.md"
17
+ RESUME="$ROOT/pipeline/commands/multi-agent/resume.md"
18
+
19
+ pass=0
20
+ fail=0
21
+ failures=()
22
+ record_pass() { pass=$((pass + 1)); printf ' \033[0;32mPASS\033[0m %s\n' "$1"; }
23
+ record_fail() { fail=$((fail + 1)); failures+=("$1"); printf ' \033[0;31mFAIL\033[0m %s\n' "$1"; }
24
+
25
+ printf '→ smoke-handoff-contract: structured handoff (fresh-context re-entry)\n'
26
+
27
+ # 1. operations.md documents the Handoff block with the 5 required lines
28
+ if [ ! -f "$OPS" ]; then
29
+ record_fail "operations.md missing"
30
+ else
31
+ if grep -qF "Handoff block (v10.8.0)" "$OPS"; then
32
+ record_pass "operations.md documents the Handoff block"
33
+ else
34
+ record_fail "operations.md missing 'Handoff block (v10.8.0)' spec"
35
+ fi
36
+ for line in "- Done:" "- Remaining:" "- Decisions:" "- Open findings:" "- Next:"; do
37
+ if grep -qF -- "$line" "$OPS"; then
38
+ record_pass "operations.md handoff spec has '$line'"
39
+ else
40
+ record_fail "operations.md handoff spec missing '$line'"
41
+ fi
42
+ done
43
+ if grep -qF "no agent dispatch, no extra LLM call" "$OPS"; then
44
+ record_pass "operations.md states handoff is orchestrator-written (no LLM call)"
45
+ else
46
+ record_fail "operations.md must state the handoff costs no LLM call"
47
+ fi
48
+ fi
49
+
50
+ # 2. Compaction trigger re-reads state AND latest handoff
51
+ if grep -qE 'agent-state\.json.*AND the latest.*Handoff' "$OPS"; then
52
+ record_pass "compaction trigger re-reads state + latest handoff"
53
+ else
54
+ record_fail "operations.md compaction trigger must re-read agent-state.json AND the latest Handoff block"
55
+ fi
56
+
57
+ # 3. log-format.md documents the Handoff section
58
+ if grep -qF "## Handoff - end of Phase" "$LOGFMT"; then
59
+ record_pass "log-format.md documents the Handoff section"
60
+ else
61
+ record_fail "log-format.md missing the Handoff section"
62
+ fi
63
+ if grep -qF "LATEST block is authoritative" "$LOGFMT"; then
64
+ record_pass "log-format.md states latest-block-wins semantics"
65
+ else
66
+ record_fail "log-format.md must state the latest handoff block is authoritative"
67
+ fi
68
+
69
+ # 4. resume.md reads handoff first, with fallback for older logs
70
+ if grep -qE 'LATEST .?## Handoff.? block' "$RESUME"; then
71
+ record_pass "resume.md Step 3 reads the latest Handoff block first"
72
+ else
73
+ record_fail "resume.md Step 3 must read the latest Handoff block first"
74
+ fi
75
+ if grep -qiF "fall back to per-phase findings" "$RESUME"; then
76
+ record_pass "resume.md keeps the pre-v10.8 per-phase fallback"
77
+ else
78
+ record_fail "resume.md must keep the pre-v10.8 per-phase findings fallback"
79
+ fi
80
+ if grep -qF "trust state on mismatch" "$RESUME"; then
81
+ record_pass "resume.md defines state-wins conflict rule"
82
+ else
83
+ record_fail "resume.md must define the handoff-vs-state conflict rule (state wins)"
84
+ fi
85
+
86
+ printf '\n══ handoff-contract smoke: %d passed, %d failed ══\n' "$pass" "$fail"
87
+ if [ "$fail" -gt 0 ]; then
88
+ printf '\nFailures:\n'
89
+ for msg in "${failures[@]}"; do printf ' - %s\n' "$msg"; done
90
+ exit 1
91
+ fi
92
+ exit 0
@@ -0,0 +1,148 @@
1
+ #!/usr/bin/env bash
2
+ # smoke-verify-by-test.sh
3
+ #
4
+ # Verifies the Phase 4 Step 3.7 verify-by-test contract:
5
+ # 1. phase-4-review.md documents Step 3.7 with evidence-gate invocation + feature-doc pointer
6
+ # 2. refs/features/verify-by-test.md exists and covers the verdict table + red-test handoff
7
+ # 3. prefs.schema.json exposes global.verifyByTest.{enabled,maxFindings,model,stepTimeoutSec}
8
+ # 4. verifyByTest.enabled defaults to false (opt-in, no surprise cost)
9
+ # 5. triage-output.schema.json is v3.2.0 with the $defs.verification result enum
10
+ # 6. validate-triage.mjs accepts a valid `confirmed` verification and rejects bad ones
11
+ # 7. phase-3-dev.md documents the redTests rework re-entry
12
+ #
13
+ # Exit 0 = all pass, 1 = any failure.
14
+
15
+ set -euo pipefail
16
+
17
+ ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
18
+ PHASE4_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-4-review.md"
19
+ PHASE3_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md"
20
+ FEATURE_DOC="$ROOT/pipeline/commands/multi-agent/refs/features/verify-by-test.md"
21
+ PREFS_SCHEMA="$ROOT/pipeline/schemas/prefs.schema.json"
22
+ TRIAGE_SCHEMA="$ROOT/pipeline/schemas/triage-output.schema.json"
23
+ VALIDATOR="$ROOT/pipeline/scripts/validate-triage.mjs"
24
+
25
+ pass=0
26
+ fail=0
27
+ failures=()
28
+ record_pass() { pass=$((pass + 1)); printf ' \033[0;32mPASS\033[0m %s\n' "$1"; }
29
+ record_fail() { fail=$((fail + 1)); failures+=("$1"); printf ' \033[0;31mFAIL\033[0m %s\n' "$1"; }
30
+
31
+ printf '→ smoke-verify-by-test: Phase 4 Step 3.7 contract\n'
32
+
33
+ # 1. Phase 4 doc documents Step 3.7
34
+ if [ ! -f "$PHASE4_DOC" ]; then
35
+ record_fail "phase-4-review.md missing"
36
+ else
37
+ if grep -qF "3.7 Verify-by-test" "$PHASE4_DOC"; then
38
+ record_pass "phase-4-review.md documents Step 3.7"
39
+ else
40
+ record_fail "phase-4-review.md missing Step 3.7 section"
41
+ fi
42
+ if grep -qF "evidence-gate.mjs --claim test --status passed" "$PHASE4_DOC"; then
43
+ record_pass "Step 3.7 downgrade is evidence-gated"
44
+ else
45
+ record_fail "Step 3.7 must gate downgrades via evidence-gate.mjs --claim test --status passed"
46
+ fi
47
+ if grep -qF "refs/features/verify-by-test.md" "$PHASE4_DOC"; then
48
+ record_pass "phase-4-review.md points to the feature doc"
49
+ else
50
+ record_fail "phase-4-review.md must reference refs/features/verify-by-test.md"
51
+ fi
52
+ if grep -qF "review.verify_by_test" "$PHASE4_DOC"; then
53
+ record_pass "Step 3.7 emits review.verify_by_test telemetry"
54
+ else
55
+ record_fail "Step 3.7 must document the review.verify_by_test metric"
56
+ fi
57
+ fi
58
+
59
+ # 2. Feature doc exists with verdict + handoff coverage
60
+ if [ ! -f "$FEATURE_DOC" ]; then
61
+ record_fail "refs/features/verify-by-test.md missing"
62
+ else
63
+ for token in "not-reproduced" "inconclusive" "redTests" "Off by default"; do
64
+ if grep -qF "$token" "$FEATURE_DOC"; then
65
+ record_pass "feature doc covers '$token'"
66
+ else
67
+ record_fail "feature doc missing '$token'"
68
+ fi
69
+ done
70
+ fi
71
+
72
+ # 3. Prefs schema exposes verifyByTest knobs
73
+ for prop in enabled maxFindings model stepTimeoutSec; do
74
+ if jq -e ".properties.global.properties.verifyByTest.properties.${prop}" "$PREFS_SCHEMA" >/dev/null 2>&1; then
75
+ record_pass "prefs schema exposes verifyByTest.${prop}"
76
+ else
77
+ record_fail "prefs schema missing verifyByTest.${prop}"
78
+ fi
79
+ done
80
+
81
+ # 4. Off by default - preserves existing-user baseline
82
+ if jq -e '.properties.global.properties.verifyByTest.properties.enabled
83
+ | has("default") and .default == false' "$PREFS_SCHEMA" >/dev/null 2>&1; then
84
+ record_pass "verifyByTest.enabled defaults to false (opt-in)"
85
+ else
86
+ record_fail "verifyByTest.enabled must default to false"
87
+ fi
88
+
89
+ # 5. Triage schema version + verification enum
90
+ schema_version=$(jq -r '.version // empty' "$TRIAGE_SCHEMA")
91
+ if [ "$schema_version" = "3.2.0" ]; then
92
+ record_pass "triage-output schema version is 3.2.0"
93
+ else
94
+ record_fail "triage-output schema version should be 3.2.0 (was: ${schema_version:-missing})"
95
+ fi
96
+ if jq -e '.["$defs"].verification.properties.result.enum
97
+ | (index("confirmed") != null
98
+ and index("not-reproduced") != null
99
+ and index("inconclusive") != null)' "$TRIAGE_SCHEMA" >/dev/null 2>&1; then
100
+ record_pass "schema verification.result enum complete"
101
+ else
102
+ record_fail "schema \$defs.verification.result enum must be confirmed/not-reproduced/inconclusive"
103
+ fi
104
+
105
+ # 6. Behavioral validator round-trips
106
+ valid_fixture='{"accepted":[{"severity":"blocking","file":"Sources/Auth/Login.swift","line":42,"issue":"expired token accepted as valid","fix":"reject tokens past expiry in validateToken()","reviewer":"fable","verification":{"result":"confirmed","testRef":"AuthTests/LoginTests/testExpiredTokenRejected","evidencePath":".pipeline/verify-1.test.log"}}],"deferred":[],"rejected":[],"approved":false}'
107
+ if printf '%s' "$valid_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
108
+ record_pass "validator accepts confirmed verification with testRef+evidencePath"
109
+ else
110
+ record_fail "validator rejected a valid confirmed verification"
111
+ fi
112
+
113
+ bad_result_fixture='{"accepted":[{"severity":"blocking","file":"a.swift","line":1,"issue":"some issue here","fix":"do the fix","reviewer":"fable","verification":{"result":"maybe"}}],"deferred":[],"rejected":[],"approved":false}'
114
+ if printf '%s' "$bad_result_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
115
+ record_fail "validator must reject verification.result 'maybe'"
116
+ else
117
+ record_pass "validator rejects bad verification.result"
118
+ fi
119
+
120
+ missing_ref_fixture='{"accepted":[{"severity":"blocking","file":"a.swift","line":1,"issue":"some issue here","fix":"do the fix","reviewer":"fable","verification":{"result":"confirmed"}}],"deferred":[],"rejected":[],"approved":false}'
121
+ if printf '%s' "$missing_ref_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
122
+ record_fail "validator must reject confirmed verification without testRef/evidencePath"
123
+ else
124
+ record_pass "validator rejects confirmed verification lacking testRef/evidencePath"
125
+ fi
126
+
127
+ # Reviewer enum parity: fable (Claude Code default) and gpt (Copilot CLI) accepted
128
+ fable_fixture='{"accepted":[{"severity":"important","file":"a.swift","line":1,"issue":"some issue here","fix":"do the fix","reviewer":"gpt"}],"deferred":[],"rejected":[],"approved":true}'
129
+ if printf '%s' "$fable_fixture" | node "$VALIDATOR" - >/dev/null 2>&1; then
130
+ record_pass "validator accepts schema-allowed reviewers (fable/gpt)"
131
+ else
132
+ record_fail "validator must accept reviewer values fable and gpt (schema v3.1.0 parity)"
133
+ fi
134
+
135
+ # 7. Phase 3 doc documents the red-test rework re-entry
136
+ if grep -qF "verifyByTest.redTests" "$PHASE3_DOC"; then
137
+ record_pass "phase-3-dev.md documents redTests rework re-entry"
138
+ else
139
+ record_fail "phase-3-dev.md must document verifyByTest.redTests re-entry"
140
+ fi
141
+
142
+ printf '\n══ verify-by-test smoke: %d passed, %d failed ══\n' "$pass" "$fail"
143
+ if [ "$fail" -gt 0 ]; then
144
+ printf '\nFailures:\n'
145
+ for msg in "${failures[@]}"; do printf ' - %s\n' "$msg"; done
146
+ exit 1
147
+ fi
148
+ exit 0
@@ -23,6 +23,7 @@ const ALLOWED_SIGNALS = new Set([
23
23
  "complexity_delta",
24
24
  "ui_critical",
25
25
  "migration",
26
+ "test_lines_removed",
26
27
  ]);
27
28
 
28
29
  function readInput() {
@@ -48,7 +49,7 @@ function validate(obj) {
48
49
  if (typeof obj !== "object" || obj === null || Array.isArray(obj)) {
49
50
  return ["root must be an object"];
50
51
  }
51
- if (obj.schemaVersion !== "1.0.0") errors.push(`schemaVersion must be "1.0.0", got ${JSON.stringify(obj.schemaVersion)}`);
52
+ if (obj.schemaVersion !== "1.1.0") errors.push(`schemaVersion must be "1.1.0", got ${JSON.stringify(obj.schemaVersion)}`);
52
53
 
53
54
  if (typeof obj.task !== "object" || obj.task === null) {
54
55
  errors.push("task must be an object");
@@ -23,9 +23,10 @@
23
23
 
24
24
  import { readFileSync } from "node:fs";
25
25
 
26
- const ALLOWED_REVIEWERS = new Set(["opus", "sonnet"]);
26
+ const ALLOWED_REVIEWERS = new Set(["fable", "opus", "sonnet", "gpt"]);
27
27
  const ALLOWED_SEVERITIES = new Set(["blocking", "important", "suggestion"]);
28
28
  const ALLOWED_CONSENSUS_VERDICTS = new Set(["unanimous-pass", "unanimous-block", "split", "unverified"]);
29
+ const ALLOWED_VERIFICATION_RESULTS = new Set(["confirmed", "not-reproduced", "inconclusive"]);
29
30
  const OVER_REJECT_THRESHOLD = 0.8;
30
31
  const OVER_REJECT_MIN_FINDINGS = 5;
31
32
 
@@ -64,13 +65,41 @@ function validateRawFinding(f, label, errors) {
64
65
  if (typeof f.issue !== "string" || f.issue.length < 4) {
65
66
  errors.push(`${label}: issue must be a string ≥4 chars`);
66
67
  }
68
+ if (f.verification !== undefined) {
69
+ validateVerification(f.verification, `${label}.verification`, errors);
70
+ }
71
+ }
72
+
73
+ // v3.2.0 verify-by-test outcome (Phase 4 Step 3.7). Optional; when present:
74
+ // result is required, and confirmed/not-reproduced additionally require
75
+ // testRef + evidencePath (the empirical claims must be traceable).
76
+ function validateVerification(v, label, errors) {
77
+ if (typeof v !== "object" || v === null || Array.isArray(v)) {
78
+ errors.push(`${label}: must be an object when present`);
79
+ return;
80
+ }
81
+ if (!ALLOWED_VERIFICATION_RESULTS.has(v.result)) {
82
+ errors.push(`${label}: bad result "${v.result}" (allowed: confirmed|not-reproduced|inconclusive)`);
83
+ return;
84
+ }
85
+ if (v.result === "confirmed" || v.result === "not-reproduced") {
86
+ if (typeof v.testRef !== "string" || v.testRef.length === 0) {
87
+ errors.push(`${label}: testRef required and non-empty when result is "${v.result}"`);
88
+ }
89
+ if (typeof v.evidencePath !== "string" || v.evidencePath.length === 0) {
90
+ errors.push(`${label}: evidencePath required and non-empty when result is "${v.result}"`);
91
+ }
92
+ }
93
+ if (v.note !== undefined && typeof v.note !== "string") {
94
+ errors.push(`${label}: note must be a string when present`);
95
+ }
67
96
  }
68
97
 
69
98
  function validateAccepted(f, i, errors) {
70
99
  validateRawFinding(f, `accepted[${i}]`, errors);
71
100
  if (!ALLOWED_REVIEWERS.has(f.reviewer)) {
72
101
  errors.push(
73
- `accepted[${i}]: reviewer must be "opus" or "sonnet" (got ${JSON.stringify(f.reviewer)}; haiku was removed in v2.1.0)`,
102
+ `accepted[${i}]: reviewer must be one of fable|opus|sonnet|gpt (got ${JSON.stringify(f.reviewer)}; haiku was removed in v2.1.0)`,
74
103
  );
75
104
  }
76
105
  if (typeof f.fix !== "string" || f.fix.length < 4) {