@mmerterden/multi-agent-pipeline 10.7.4 → 10.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. package/CHANGELOG.md +93 -0
  2. package/README.md +2 -0
  3. package/docs/engineering.md +76 -0
  4. package/docs/features.md +49 -33
  5. package/package.json +1 -1
  6. package/pipeline/commands/multi-agent/refs/features/verify-by-test.md +41 -0
  7. package/pipeline/commands/multi-agent/refs/phases/log-format.md +10 -0
  8. package/pipeline/commands/multi-agent/refs/phases/operations.md +15 -2
  9. package/pipeline/commands/multi-agent/refs/phases/phase-0-init.md +9 -0
  10. package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md +12 -0
  11. package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md +12 -1
  12. package/pipeline/commands/multi-agent/refs/rules.md +1 -0
  13. package/pipeline/commands/multi-agent/resume.md +7 -4
  14. package/pipeline/commands/multi-agent/review.md +33 -1
  15. package/pipeline/schemas/diff-risk.schema.json +5 -4
  16. package/pipeline/schemas/prefs.schema.json +59 -0
  17. package/pipeline/schemas/token-budget.json +7 -7
  18. package/pipeline/schemas/triage-output.schema.json +35 -4
  19. package/pipeline/scripts/README.md +3 -0
  20. package/pipeline/scripts/diff-risk-score.mjs +11 -1
  21. package/pipeline/scripts/fixtures/diff-risk-test-removal.diff +40 -0
  22. package/pipeline/scripts/fixtures/install-layout.tsv +3 -3
  23. package/pipeline/scripts/smoke-diff-risk.sh +30 -1
  24. package/pipeline/scripts/smoke-handoff-contract.sh +92 -0
  25. package/pipeline/scripts/smoke-update-check.sh +122 -0
  26. package/pipeline/scripts/smoke-verify-by-test.sh +148 -0
  27. package/pipeline/scripts/update-check.sh +82 -0
  28. package/pipeline/scripts/validate-diff-risk.mjs +2 -1
  29. package/pipeline/scripts/validate-triage.mjs +31 -2
package/CHANGELOG.md CHANGED
@@ -14,6 +14,99 @@ Internal file-layout changes that don't affect the slash-command surface are sti
14
14
 
15
15
  ---
16
16
 
17
+ ## [10.9.0] - 2026-07-03
18
+
19
+ ### Added
20
+
21
+ - **Update check at run start (Phase 0 Step 0.6, opt-out via
22
+ `prefs.global.updateCheck`).** Once per `ttlHours` window (default 24h,
23
+ cached, 3s-bounded curl straight to the npm registry - never `npm view`),
24
+ the pipeline compares the installed version against `dist-tags.latest`.
25
+ Newer version found: interactive modes ask ONE question ("Update now?" ->
26
+ yes runs the `/multi-agent:update` flow and the run continues; the update
27
+ takes full effect next session); autopilot logs one line and never asks
28
+ (zero-interaction contract). `autoUpdate: true` skips the question and
29
+ updates before the worktree is created. Offline, ahead-of-registry, and
30
+ failed checks are silent - the step never blocks. New
31
+ `pipeline/scripts/update-check.sh` + `smoke-update-check.sh` (13 assertions).
32
+ - **Design fidelity contract (Phase 3, BLOCKING on Figma-referenced tasks).**
33
+ The analysis doc's captured frames are the 1:1 reference for every UI line:
34
+ a Code Connect-mapped component must be used verbatim (no sound-alikes,
35
+ missing modifiers go into `+Modifiers`, never forks), unmapped UI becomes a
36
+ NEW component under the standard architecture, and inter-component spacing
37
+ comes from the design's measured values mapped to tokens - never invented.
38
+ Missing design data halts with a re-run-analysis instruction; Figma MCP
39
+ stays forbidden outside analysis.
40
+ - **`docs/engineering.md`** - version-free catalog of the discipline layer
41
+ (bounded loops, evidence gates, token-budgeted phase docs, immutable tests,
42
+ fresh-context handoffs, memory hedges, install hygiene), linked from README.
43
+
44
+ ### Changed
45
+
46
+ - **Docs drop inline version markers.** `docs/features.md` headers and bullets
47
+ no longer carry `(vX.Y.Z)` tags - the docs describe the current state;
48
+ history lives in this CHANGELOG. Same principle applied to the website copy.
49
+ - **Phase-doc token budgets recalibrated** (`token-budget.json`, total
50
+ 46500 -> 50000) after the verify-by-test, update-check, immutable-test, and
51
+ design-fidelity contracts landed - Step 3.7 prose was compressed to a
52
+ pointer into `refs/features/verify-by-test.md` before recalibration.
53
+
54
+ ---
55
+
56
+ ## [10.8.0] - 2026-07-02
57
+
58
+ Three additive loop-quality features, all sourced from the 2026 agentic-loop research sweep (Anthropic long-running-agent harness guidance + adversarial-review findings): empirical verification of blocking findings, an immutable-test contract, and structured phase-boundary handoffs.
59
+
60
+ ### Added
61
+
62
+ - **Verify-by-test triage (Phase 4 Step 3.7, opt-in via `prefs.global.verifyByTest`).**
63
+ Accepted blocking findings are no longer judgment-final: one verifier agent
64
+ (default Sonnet, capped at `maxFindings`=3) writes a minimal repro test per
65
+ finding and runs ONLY that test. Test fails as predicted -> finding confirmed
66
+ and the repro test is handed to Phase 3 as the rework RED test
67
+ (`state.reviewIterations[-1].verifyByTest.redTests[]`). Test passes under
68
+ `evidence-gate.mjs` -> finding downgraded to `deferred` (never `rejected`).
69
+ Compile error / timeout -> `inconclusive`, judgment verdict stands. The whole
70
+ step is timeout-bounded and never blocks the pipeline. Schema
71
+ `triage-output` v3.2.0 adds the optional per-finding `verification` block;
72
+ `validate-triage.mjs` validates it. New feature doc
73
+ `refs/features/verify-by-test.md` + `smoke-verify-by-test.sh` (20 assertions).
74
+ - **Immutable-test rule + `test_lines_removed` diff-risk signal.** New rule in
75
+ `refs/rules.md` and the Phase 3 GREEN step: deleting, renaming, or weakening
76
+ an existing test to reach green is a violation; a test changes only when the
77
+ task changes the spec it encodes, named in the commit body. Deterministic
78
+ backstop: `diff-risk-score.mjs` v1.1.0 emits `test_lines_removed` (w=3.0)
79
+ when a test-classified file removes more lines than it adds; wired into the
80
+ Phase 4 Step 1.75 signal table, `diff-risk.schema.json` v1.1.0, and
81
+ `validate-diff-risk.mjs`. New fixture `diff-risk-test-removal.diff` +
82
+ 3 new `smoke-diff-risk.sh` assertions.
83
+ - **Structured handoff blocks (fresh-context re-entry discipline).** The
84
+ phase-boundary checkpoint in `refs/phases/operations.md` now appends a
85
+ structured `## Handoff` block (Done / Remaining / Decisions / Open findings /
86
+ Next) to `agent-log.md` at every phase transition - written by the
87
+ orchestrator from state it already holds, no LLM call, ~15 lines. The
88
+ post-`/compact` re-grounding and `resume.md` Step 3 read the latest handoff
89
+ FIRST (state wins on mismatch), falling back to per-phase findings for
90
+ pre-v10.8 logs. Documented in `log-format.md`; guarded by
91
+ `smoke-handoff-contract.sh` (13 assertions).
92
+
93
+ - **Module review guides in `/multi-agent:review` (Step 2b).** When a changed
94
+ file's module carries its own convention file (`CLAUDE.md`, `*-CLAUDE.md`,
95
+ `AGENTS.md` below repo root), the review discovers it deterministically from
96
+ the diff paths (capped at 5, truncation logged), injects it into every
97
+ reviewer prompt, and scopes each guide to files under its own directory.
98
+ Works with a local checkout or via provider API in PR mode.
99
+
100
+ ### Fixed
101
+
102
+ - **`validate-triage.mjs` reviewer enum accepts `fable` and `gpt`.** The runtime
103
+ validator still rejected the schema-v3.1.0 reviewer values restored in
104
+ v10.7.4 (`fable` is Reviewer 1 on Claude Code, `gpt` on Copilot CLI), so any
105
+ accepted finding attributed to them failed the 3.2.1 gate. Enum now matches
106
+ `triage-output.schema.json`.
107
+
108
+ ---
109
+
17
110
  ## [10.7.4] - 2026-07-02
18
111
 
19
112
  Deep-consistency sweep: the v10.6.0 Fable restore and the v10.7.0 adapter removal are now reflected on every surface, and the test suite is green again end to end.
package/README.md CHANGED
@@ -51,6 +51,8 @@ One command runs 8 phases, with a gate between the risky ones:
51
51
 
52
52
  Under the hood: each task runs in its own **git worktree** (or the current branch with `:local`), commits use the **git identity routed from the repo's origin URL**, and **multi-repo** tasks get per-repo worktrees plus an integration build. Tokens stay in the OS keychain; nothing is committed or logged. `/multi-agent:review` can also review an existing GitHub/Bitbucket PR — per-finding inline comments anchored to `file:line` + an explicit Approve / Needs-Work state.
53
53
 
54
+ The discipline behind all of this — bounded loops, evidence gates, token-budgeted phase docs, immutable tests, fresh-context handoffs — is catalogued in [docs/engineering.md](./docs/engineering.md). The full feature list lives in [docs/features.md](./docs/features.md).
55
+
54
56
  ## Modes
55
57
 
56
58
  | Mode | Command | Flow |
@@ -0,0 +1,76 @@
1
+ # Engineering Discipline
2
+
3
+ The pipeline's output quality comes less from any single model call and more from the constraints wrapped around every call. This page lists those constraints - the internals that keep long agentic runs correct, cheap, and auditable. No version markers here; this document always describes the current state (history lives in the [CHANGELOG](../CHANGELOG.md)).
4
+
5
+ ## Loop discipline
6
+
7
+ Every loop in the pipeline has a deterministic exit condition and a hard iteration cap. Nothing "keeps trying".
8
+
9
+ - **TDD micro-loop** (Phase 3): red -> green -> refactor per task; build-fix retries capped at 3 attempts.
10
+ - **Dev-critic loop** (Phase 3.5, opt-in): generator vs critic on deterministic criteria; max 2 iterations, round 3 escalates to the user.
11
+ - **Review-fix macro-loop** (Phase 3 <-> 4): only triage-accepted blocking findings re-enter development; max 3 iterations, then hard-kill and escalate.
12
+ - **Disagreement round** (Phase 4, opt-in): when reviewers split, exactly one rebuttal round - never more.
13
+ - **Global rule**: any retry loop stops after 3 attempts and asks the user. No exceptions, no infinite loops.
14
+
15
+ ## Evidence over claims
16
+
17
+ - **Default-FAIL evidence gate**: a "build passed" or "tests passed" claim is rejected unless the captured log actually shows success. Missing, empty, or failure-marked logs fail CLOSED.
18
+ - **Verify-by-test** (opt-in): an accepted blocking review finding can be empirically confirmed - a verifier agent writes a minimal repro test and runs it. Test fails as predicted: the finding stands and the failing test is handed to the rework loop as its RED test. Test passes: the finding is downgraded, and even the downgrade must pass the evidence gate.
19
+ - **Deterministic validator gates**: triage output, reviewer output, diff-risk output, and agent state are schema-validated by zero-dependency scripts whose exit codes decide the flow - one self-correction rework, then halt with a recovery hint. The LLM never grades its own homework.
20
+
21
+ ## Test integrity
22
+
23
+ - **Immutable tests**: deleting, renaming, or weakening an existing test to get a green run is a rule violation. A test changes only when the task changes the spec it encodes, named in the commit body.
24
+ - **Deterministic backstop**: the diff-risk scorer flags any test file whose diff removes more lines than it adds (`test_lines_removed`), so a shrinking test suite surfaces to reviewers even if the rule is ignored.
25
+ - **Test-gap detection**: newly added public symbols with no paired test are reported before the test phase.
26
+
27
+ ## Context management
28
+
29
+ - **Token-budgeted phase docs**: every phase document has a per-file and total token budget enforced by a smoke gate (`token-budget.json`). Growth is compressed first; budgets are recalibrated only when real contract text lands. This keeps the orchestrator's working context lean across an 8-phase run.
30
+ - **Lazy loading**: only the active phase's document is in context; references load on demand.
31
+ - **Structured handoff blocks**: at every phase boundary the orchestrator appends a Done / Remaining / Decisions / Open findings / Next block to the run log - written from state it already holds, no extra model call. Resume and post-compaction re-entry read the latest handoff plus the state file, never the conversation history.
32
+ - **Proactive compaction**: past ~50% context usage the run compacts itself and re-grounds from durable artifacts instead of waiting for lossy auto-compaction.
33
+
34
+ ## Review architecture
35
+
36
+ - **Deterministic gates before AI review**: build, lint, tests, and secret scan must be green before a single reviewer token is spent.
37
+ - **Cross-model parallel review**: independent reviewers with different vantage points, then a triage pass that filters false positives and out-of-scope noise. Reviewer findings are raw signals, not commands - nothing auto-loops on a "blocking" tag.
38
+ - **Consensus surfacing**: unanimous agreement between same-base-model reviewers on judgment-heavy surfaces is flagged as unverified instead of being trusted as independent confirmation.
39
+ - **Diff risk scoring**: a sub-second, zero-LLM heuristic ranks changed files (security paths, migrations, missing tests, shrinking tests) and feeds reviewers a priority hint - advisory, never a gate.
40
+ - **Prompt-cache-aware dispatch**: reviewer and triage prompts share a byte-identical leading block so parallel dispatches hit the prompt cache instead of re-billing the diff.
41
+
42
+ ## Design fidelity
43
+
44
+ - **Single source of design truth**: Figma is fetched once, during analysis, through a 3-tier fallback chain (MCP -> REST -> user screenshot). Every later phase reads the frozen analysis artifact; calling Figma from dev or review phases is a contract violation enforced by a smoke gate.
45
+ - **1:1 implementation contract**: a Code Connect-mapped component must be used verbatim (no sound-alike substitutes; missing modifiers are added to the component, never forked). Unmapped UI becomes a new component under the standard architecture. Spacing between components comes from the design's measured values mapped to tokens - never invented numbers.
46
+
47
+ ## Cost transparency
48
+
49
+ - **Per-phase token ledger**: every billable call forwards token counts to the tracker; each run ends with a cost table naming the top cost driver and pricing prompt-cache reads at their discounted rate.
50
+ - **Budget ceilings**: a cost budget can cap a run; crossing it triggers model downgrade or halt rather than silent overspend.
51
+ - **Model fallback contract**: premium-tier dispatches degrade deterministically (top tier -> mid -> base) on dispatch errors, date gates, or budget ceilings - never by improvisation.
52
+
53
+ ## Memory without bias
54
+
55
+ - **Learnings ledger**: durable per-repo facts and rejected review preferences replay into analysis and triage, so agents stop re-discovering and reviewers stop re-flagging the same things.
56
+ - **Hedged injection**: every memory injection carries an explicit "context, not command" hedge, and blocking-severity rejections are never memorized - a wrong rejection must not permanently suppress a finding class.
57
+ - **Prior-art caps**: triage sees at most a fixed number of highest-similarity past findings; memory is advisory context, not a verdict multiplier.
58
+
59
+ ## Update and install hygiene
60
+
61
+ - **Wipe-then-copy install**: pipeline-owned directories are cleared before copying, so files deleted upstream never linger as ghost commands or dead scripts on user machines.
62
+ - **Layout fingerprint**: a fixture pins the installed file layout; structural drift fails the suite.
63
+ - **Advisory update check**: at run start (cached, bounded, offline-silent) the pipeline compares its installed version with the registry; interactive runs offer a one-question update, autopilot only logs. It never blocks and never self-modifies mid-run.
64
+ - **Token-preserving uninstall**: removing the pipeline never touches the user's stored credentials.
65
+
66
+ ## Security and hygiene
67
+
68
+ - **Secret scanning**: staged diffs are scanned with provider-prefix patterns plus Shannon-entropy detection before every commit; findings block.
69
+ - **Personal-data genericization**: the public repo is generated from the private source with a scrub pass (names, hosts, project keys) and a zero-hit grep gate before push.
70
+ - **Skill scanner**: tiered pattern scanning over skill content guards against supply-chain surprises.
71
+ - **Intent guard**: a deterministic classifier separates "answer my question" from "change my code", so a conceptual question never spawns a worktree.
72
+
73
+ ## Cross-CLI parity
74
+
75
+ - **One source of truth, two runtimes**: Claude Code and Copilot CLI command surfaces are byte-identical where shared; a parity smoke fails on drift.
76
+ - **Portability gates**: every shell script passes `bash -n` and a GNU/BSD portability grep; the CI matrix runs macOS, Linux, and Windows.
package/docs/features.md CHANGED
@@ -43,7 +43,7 @@ Compose freely: `--dev --local autopilot` = shortest, least-friction path.
43
43
 
44
44
  Build commands, test runners, lint tools, and review focus areas all adapt to the detected stack.
45
45
 
46
- ### Stack Selection (marketplace plugins, v10.5.0+)
46
+ ### Stack Selection (marketplace plugins)
47
47
 
48
48
  Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketplace. Selecting a stack enables the matching plugin(s) in the current repo's `.claude/settings.json` `enabledPlugins` — no skill copying, no session restart tricks, no directory shuffling. The `ai-common-engineering-toolkit` (accessibility audit, humanizer, Firebase) is always enabled alongside the stack plugin.
49
49
 
@@ -57,7 +57,7 @@ Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketpl
57
57
  /multi-agent:stack all # every stack plugin
58
58
  ```
59
59
 
60
- ### Task Type Detection (v2.0.0)
60
+ ### Task Type Detection
61
61
 
62
62
  Phase 0 Step 9 classifies every task before Phase 1 starts. Deterministic priority order: Figma URL → instruction file path → git diff heuristic → Jira issue type → branch name → description keywords → user prompt (autopilot defaults to `feature`).
63
63
 
@@ -71,26 +71,26 @@ Result persisted to `agent-state.taskType`:
71
71
  | `refactor` | Phase 4 emphasizes behavior preservation; Phase 6 uses `refactor(...)` prefix |
72
72
  | `chore` | Lightweight flow; Phase 6 uses `chore(...)` prefix |
73
73
 
74
- ### SubPhase Convention (v2.0.0)
74
+ ### SubPhase Convention
75
75
 
76
- When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7 — Phase 6.5 was merged into Phase 7 REPORT in v5.1.0) — specialized work slots into its parent phase without inflating the count.
76
+ When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7) — specialized work slots into its parent phase without inflating the count.
77
77
 
78
78
  ## PR & Review Flow
79
79
 
80
- ### Default Reviewers (v1.6.1, hardened in v2.0.0)
80
+ ### Default Reviewers
81
81
 
82
82
  - **Bitbucket**: Fetches via `/rest/default-reviewers/1.0/`. Every PUT must re-send `reviewers`, `fromRef`, `toRef`, `draft` (regression guarded by smoke test).
83
83
  - **GitHub**: Honors `CODEOWNERS` + falls back to `prefs.projects[].githubDefaultReviewers`.
84
84
  - PR author always filtered out (Bitbucket: 409, GitHub: GraphQL error).
85
85
 
86
- ### Draft vs Ready Prompt (v1.7.0)
86
+ ### Draft vs Ready Prompt
87
87
 
88
88
  Phase 6 asks `DRAFT or READY?` before creating the PR and persists the choice in `prefs.projects[].defaultPrMode`.
89
89
 
90
90
  - Bitbucket: `draft: true` flag (DC 8.x+) with `[DRAFT]` title fallback for older servers.
91
91
  - GitHub: `gh pr create --draft` + `gh pr ready` for promotion.
92
92
 
93
- ### `channels` Command (v5.7.0 — replaces `enrich`)
93
+ ### `channels` Command
94
94
 
95
95
  Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post-hoc for fixes closed outside the pipeline:
96
96
 
@@ -102,9 +102,9 @@ Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post
102
102
  /multi-agent:channels ABC-1234 --channels jira,confluence --content test
103
103
  ```
104
104
 
105
- Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the pre-v5.7 `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
105
+ Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the earlier `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
106
106
 
107
- ### Body Preservation Contract (v1.6.1, verified by smoke test in v2.0.0)
107
+ ### Body Preservation Contract (smoke-verified)
108
108
 
109
109
  Every external-system body (PR description, Jira comment, GitHub issue) uses `jq -n --rawfile body body.md '{description: $body}'` → `curl --data-binary @payload.json`. No literal `\n` strings, no HTML entities (`&amp;`, `&lt;`, `&quot;`). UTF-8 preserved end-to-end. `scripts/smoke-add-detail.sh` runs 14 contract assertions.
110
110
 
@@ -137,7 +137,7 @@ The reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers in paralle
137
137
 
138
138
  **Fable Triage** (Phase 4 Step 3, Opus on Copilot CLI): Evaluates merged raw findings against task scope. Classifies each as `accepted` (fix now), `deferred` (out of scope, log for later), or `rejected` (false positive / noise). Only triage-accepted blocking items loop back to Phase 3.
139
139
 
140
- ### Runtime Triage Validator (v2.3.0)
140
+ ### Runtime Triage Validator
141
141
 
142
142
  After triage returns, output is validated by `validate-triage.mjs`:
143
143
 
@@ -148,9 +148,25 @@ After triage returns, output is validated by `validate-triage.mjs`:
148
148
  | **2** | Over-rejection guard tripped — pause for human |
149
149
  | **3** | Contradiction auto-corrected — proceed with corrected output |
150
150
 
151
- ### Bidirectional Approved↔Blocking Auto-Correction (v2.6.1)
151
+ ### Bidirectional Approved↔Blocking Auto-Correction
152
152
 
153
- If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with `if`/`then` constraint in schema v3.0.0.
153
+ If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with an `if`/`then` constraint in the schema itself.
154
+
155
+ ### Verify-by-Test Triage (Phase 4 Step 3.7, opt-in)
156
+
157
+ A triage verdict is a judgment call; a failing repro test is proof. When `prefs.global.verifyByTest.enabled` is on, one verifier agent (default Sonnet) writes a minimal repro test per accepted blocking finding (cap: `maxFindings`=3) and runs only that test. Fails as predicted -> finding confirmed, the repro test becomes the Phase 3 rework RED test. Passes under `evidence-gate.mjs` -> finding downgraded to `deferred`. Compile error / timeout -> `inconclusive`, judgment stands. Timeout-bounded, never blocks. Full spec: `refs/features/verify-by-test.md`.
158
+
159
+ ### Immutable-Test Rule + `test_lines_removed` Signal
160
+
161
+ Existing tests are immutable during a task: deleting, renaming, or weakening an assertion to reach green is a violation (`refs/rules.md`, Phase 3 GREEN step). A test changes only when the task changes the spec it encodes, named in the commit body. Deterministic backstop: `diff-risk-score.mjs` emits `test_lines_removed` (w=3.0) for any test-classified file whose diff removes more lines than it adds.
162
+
163
+ ### Update Check at Run Start
164
+
165
+ Phase 0 Step 0.6, opt-out via `prefs.global.updateCheck`. Once per `ttlHours` window (cached, 3s-bounded curl to the npm registry), the installed version is compared against `dist-tags.latest`. Newer version: interactive modes ask "Update now?" (yes runs the `/multi-agent:update` flow, then the run continues); autopilot logs one line and never asks. `autoUpdate: true` updates silently before the worktree exists. Offline and failed checks are silent - never blocks.
166
+
167
+ ### Structured Handoff Blocks
168
+
169
+ Every phase transition appends a `## Handoff` block (Done / Remaining / Decisions / Open findings / Next) to `agent-log.md` - orchestrator-written from existing state, no LLM call. `/multi-agent:resume` and post-`/compact` re-grounding read the latest handoff first, so long runs re-enter from durable artifacts instead of conversation memory (fresh-context discipline from Anthropic's long-running-agent harness guidance).
154
170
 
155
171
  ### Accessibility Code Review (Phase 4 Step 1.5)
156
172
 
@@ -162,7 +178,7 @@ If changes include UI files, reviewers check for:
162
178
 
163
179
  Pure code analysis — no simulator needed. Device-level audits run in Phase 5 when requested.
164
180
 
165
- ### Status Enforcement (v1.5.0)
181
+ ### Status Enforcement
166
182
 
167
183
  Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation verify step that re-reads the field and retries once on silent `VALIDATION` failures (e.g. stale Projects V2 option IDs after a board rebuild).
168
184
 
@@ -175,49 +191,49 @@ Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation veri
175
191
 
176
192
  ## Testing & Quality
177
193
 
178
- ### Schema-Validated State (v2.0.0)
194
+ ### Schema-Validated State
179
195
 
180
196
  All critical state files are schema-validated at read and write time:
181
197
 
182
198
  - `agent-state.schema.json` — validates `$HOME/.claude/logs/multi-agent/.../agent-state.json`
183
- - `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json` (extended v2.5.0)
184
- - `triage-output.schema.json` — validates triage output (v2.3.0, hardened v2.6.1, `if`/`then` constraint v3.0.0)
199
+ - `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json`
200
+ - `triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
185
201
 
186
- ### Smoke Test Suites (v2.0.0–v2.5.0)
202
+ ### Smoke Test Suites
187
203
 
188
204
  10 suites: add-detail (14 body-preservation assertions), review-triage (validator exit codes), prefs (schema round-trip), state (agent-state lifecycle), metrics (telemetry emission), sync (instruction parity), secret-scan (hook patterns), phase-banner (terminal UI), token-budget (per-phase limits), phase-tracker (progress tracking).
189
205
 
190
- ### Adversarial Eval Fixtures (v2.5.0–v2.6.0)
206
+ ### Adversarial Eval Fixtures
191
207
 
192
208
  10 fixtures that test triage resilience against adversarial reviewer output: over-rejection, hallucinated findings, contradictions, invalid JSON, schema violations, duplicate findings, scope creep, empty results, timeout simulation, and combined edge cases.
193
209
 
194
- ### Sync Parity Check (v2.3.0)
210
+ ### Sync Parity Check
195
211
 
196
212
  Detects drift between Claude Code instructions (`~/.claude/commands/multi-agent.md`), Copilot CLI instructions (`~/.copilot/copilot-instructions.md`), and the repo's pipeline spec files. Reports discrepancies during Phase 0 Init.
197
213
 
198
- ### Token Budget Enforcement (v3.0.0)
214
+ ### Token Budget Enforcement
199
215
 
200
216
  Per-phase token budgets prevent runaway sessions. If a phase exceeds its budget, the pipeline pauses and offers: continue (extend budget), skip phase, or abort. Budgets are configurable in `prefs.global.tokenBudgets`.
201
217
 
202
218
  ## Telemetry & Observability
203
219
 
204
- - **Pipeline Metrics** (v2.3.0): Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
205
- - **Cost Telemetry** (v2.5.0): Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
206
- - **Phase Tracker** (v2.4.0): Cross-CLI visual progress (current phase, elapsed time, iteration count).
207
- - **Phase Banner** (v2.4.0): Terminal UI for phase transitions with Unicode box-drawing characters.
208
- - **Per-task Cost Breakdown in agent-log.md** (v8.3.0): Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
220
+ - **Pipeline Metrics**: Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
221
+ - **Cost Telemetry**: Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
222
+ - **Phase Tracker**: Cross-CLI visual progress (current phase, elapsed time, iteration count).
223
+ - **Phase Banner**: Terminal UI for phase transitions with Unicode box-drawing characters.
224
+ - **Per-task Cost Breakdown in agent-log.md**: Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
209
225
 
210
- ### Diff Risk Scoring (v8.3.0)
226
+ ### Diff Risk Scoring
211
227
 
212
228
  `pipeline/scripts/diff-risk-score.mjs` runs at Phase 4 Step 1.75 — before reviewer dispatch. Heuristic, deterministic, sub-second, no LLM. Top-N risk-ranked files inject into each reviewer's prompt as a `${PRIORITY_FILES}` block; reviewers read those files first but still review the entire diff.
213
229
 
214
- Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
230
+ Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
215
231
 
216
- ### Test Gap Detection (v8.3.0)
232
+ ### Test Gap Detection
217
233
 
218
234
  `pipeline/scripts/test-gap-scan.mjs` runs at Phase 5 Step 0. Walks the diff for newly added public symbols and reports those with no paired test. Stack-specific rules ship for iOS, Android, Python, Node.js. iOS Views and Android `@Composable` symbols default to `important`; other public API additions to `suggestion`. Optional gating via `prefs.testGap.blockingThreshold` — when set, the report becomes a Phase 4 rework finding once `important + blocking` count exceeds the threshold.
219
235
 
220
- ### Triage Memory (v8.3.0)
236
+ ### Triage Memory
221
237
 
222
238
  Per-repo append-only JSONL corpus at `~/.claude/memory/multi-agent/<repo-slug>/triage-corpus.jsonl`. Phase 7 ingests every triage output (idempotent), Phase 1 enriches the analysis with similar past tasks, Phase 4 triage attaches prior-art hits to each raw finding with an explicit bias hedge. Token-overlap recall, zero deps, Node-18-compatible. `/multi-agent:search "<text>" --semantic` routes the query to the corpus instead of agent-log grep. Toggle via `prefs.global.priorArtEnrichment.enabled` (default ON).
223
239
 
@@ -268,10 +284,10 @@ Token registry maps logical names (`jira`, `bitbucket`, `github`, `confluence`)
268
284
 
269
285
  ## Schemas & Validation
270
286
 
271
- - `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle (v2.0.0)
272
- - `pipeline/schemas/prefs.schema.json` — validates preferences (v2.0.0, extended v2.5.0)
273
- - `pipeline/schemas/triage-output.schema.json` — validates triage output (v2.3.0, hardened v2.6.1, `if`/`then` constraint v3.0.0)
287
+ - `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle
288
+ - `pipeline/schemas/prefs.schema.json` — validates preferences
289
+ - `pipeline/schemas/triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
274
290
 
275
- ## File Layout (v1.9.0)
291
+ ## File Layout
276
292
 
277
293
  `commands/multi-agent/{add-detail,setup,help}.md` → slash commands. `commands/multi-agent/refs/**` → guides, phase specs, rules (`refs:` prefix signals "not an action"). Modifier flags and ops stay inline in `multi-agent.md`.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@mmerterden/multi-agent-pipeline",
3
- "version": "10.7.4",
3
+ "version": "10.9.0",
4
4
  "description": "8-phase AI development pipeline with full orchestration on Claude Code and Copilot CLI. Analysis, planning, TDD, CLI-aware parallel review with consensus surfacing + Fable triage, default-FAIL evidence gates, secret + intent guards, per-phase cost ledger, persistent learnings memory, wiki generation, commit automation. Token-preserving uninstall.",
5
5
  "type": "module",
6
6
  "main": "index.js",
@@ -0,0 +1,41 @@
1
+ # Feature: Verify-by-Test Triage (Phase 4 Step 3.7)
2
+
3
+ **Pattern**: reviewer findings are hypotheses; a failing repro test is proof. Adversarial-review research (Refute-or-Promote, 2026) found that plausible-but-wrong findings survive debate rounds but die on a single empirical test. This step converts the highest-stakes verdicts (accepted blocking) from judgment into evidence before the Phase 3 rework loop fires.
4
+
5
+ **Gated by `prefs.global.verifyByTest.enabled`** (default: `false`). When enabled, after triage 3.6 and before Step 4, IF the validated triage output contains at least one `accepted` blocking finding:
6
+
7
+ 1. Dispatch ONE verifier sub-agent for the iteration (model: `verifyByTest.model`, default `sonnet`) - never one dispatch per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings, the diff hunks for their files, and the Phase 1 test conventions.
8
+ 2. Per finding, the verifier writes ONE minimal repro test asserting the correct behavior the finding claims is broken, then runs ONLY that test via the Phase 3 single-test invocation (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`) under `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
9
+ 3. Stamp each processed finding with a `verification` object (triage-output schema v3.2.0) and re-run `validate-triage.mjs` on the mutated triage file under the standard 3.2.1 gate protocol.
10
+ 4. Findings beyond `maxFindings` keep their judgment-only verdict (log `verify_by_test=cap-exceeded`).
11
+ 5. The whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600); on breach or verifier crash, remaining findings keep judgment-only verdicts and the pipeline proceeds. Never blocks.
12
+
13
+ ## Verdict table
14
+
15
+ | Repro test outcome | `verification.result` | Action |
16
+ |---|---|---|
17
+ | Fails as the finding predicts | `confirmed` | Finding stays accepted blocking. Repro test KEPT in the worktree, recorded in `redTests[]`. |
18
+ | Passes (not reproducible) | `not-reproduced` | Downgrade gated by `evidence-gate.mjs --claim test --status passed` on the test log (exit 0 required). Finding moves `accepted[]` -> `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`; repro test file deleted. Evidence-gate failure -> treat as `inconclusive`. |
19
+ | Compile error / timeout / not unit-testable | `inconclusive` | Finding stays accepted blocking (judgment stands). Partial test deleted, cause in `verification.note`. |
20
+
21
+ Downgrades go to `deferred`, never `rejected`: triage judged the issue real, and deferred items surface in the Phase 7 report for a human eye.
22
+
23
+ ## Red-test handoff to Phase 3
24
+
25
+ `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`. On the Phase 3 re-entry, each `redTests[]` entry IS the RED step of the rework TDD cycle for its finding: the reflection prompt instructs "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it." The dev agent must not write a duplicate failing test for those findings.
26
+
27
+ ## Cleanup invariant
28
+
29
+ After Step 3.7, the only uncommitted verifier artifacts are the confirmed repro tests listed in `redTests[]` (committed later with the fix) and logs under `$WORKTREE/.pipeline/` (outside Phase 6 commit scope). `not-reproduced` and `inconclusive` test files are always deleted.
30
+
31
+ ## Telemetry
32
+
33
+ One `review.verify_by_test` metric per iteration: `attempted`, `confirmed`, `downgraded`, `inconclusive`, `duration_ms` (plus `tokens_in/out` when available), forwarded to the tracker like all Phase 4 metrics. Timeout emits `triage=verify-by-test-timeout`.
34
+
35
+ ## Off by default reason
36
+
37
+ Adds one Sonnet call plus up to `maxFindings` single-test runs (and build-lock contention on Xcode projects) per review iteration that has accepted blockers. On clean runs it never fires, but on noisy-reviewer repos it can add minutes per iteration. Flip on for security-critical work, release branches, or repos where reviewer false-positive rate is high.
38
+
39
+ ## Reference
40
+
41
+ Wiring: `refs/phases/phase-4-review.md` Step 3.7. Schema: `pipeline/schemas/triage-output.schema.json` v3.2.0 (`$defs.verification`). Evidence gate: `pipeline/scripts/evidence-gate.mjs`. Prefs: `prefs.global.verifyByTest` in `pipeline/schemas/prefs.schema.json`.
@@ -72,6 +72,16 @@ Verdict: unverified (2 reviewers)
72
72
  ## Test Scenarios (Jira)
73
73
 
74
74
  (User perspective: Precondition -> Steps -> Expected result)
75
+
76
+ ## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
77
+
78
+ (v10.8.0, appended chronologically by the phase-boundary checkpoint - one block per phase transition, the LATEST block is authoritative. Written by the orchestrator from state it already holds; no LLM call, ~15 lines max. `resume.md` Step 3 and post-`/compact` re-grounding read this block FIRST. Format spec: `refs/phases/operations.md`.)
79
+
80
+ - Done: {up to 3 bullets of completed outcomes}
81
+ - Remaining: {ordered list of remaining phases / sub-steps}
82
+ - Decisions: {key decisions later phases depend on}
83
+ - Open findings: {accepted-but-unresolved review findings, or "none"}
84
+ - Next: Phase {N+1} {name}, subStep {token or "start"}
75
85
  ```
76
86
 
77
87
  ### Cost Breakdown - emission contract
@@ -93,8 +93,21 @@ This keeps orchestrator context lean and enables programmatic routing.
93
93
 
94
94
  **Proactive compaction + phase-boundary checkpoint**: the orchestrator follows ~2,500 lines of phase prose in one session; once context fills, it starts dropping steps - the single biggest cause of "it got stuck / skipped a step." Two defenses, both required on full-pipeline runs:
95
95
 
96
- - *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a one-line progress summary to `agent-log.md`. The next phase reads state + log, not the back-conversation - so a transition is a clean re-entry point even if context is later compacted.
97
- - *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit - it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` to re-ground.
96
+ - *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a structured handoff block to `agent-log.md` (format below). The next phase reads state + log, not the back-conversation - so a transition is a clean re-entry point even if context is later compacted.
97
+ - *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit - it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` AND the latest `## Handoff` block in `agent-log.md` to re-ground.
98
+
99
+ **Handoff block (v10.8.0)**: the structured artifact the phase-boundary checkpoint appends to `agent-log.md`. Written by the orchestrator from state it already holds - no agent dispatch, no extra LLM call. Cap at ~15 lines; the latest block is authoritative (earlier ones are history). This is the fresh-context re-entry contract: a resume or post-compaction session rebuilds working context from the latest handoff + `agent-state.json` + git log, never from conversation memory.
100
+
101
+ ```markdown
102
+ ## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
103
+ - Done: {up to 3 bullets of completed outcomes, e.g. "plan approved, 4 tasks", "build green, 12 tests added"}
104
+ - Remaining: {ordered list of remaining phases / sub-steps}
105
+ - Decisions: {key decisions later phases depend on, e.g. "used existing KeychainStore, no new wrapper"}
106
+ - Open findings: {accepted-but-unresolved review findings, or "none"}
107
+ - Next: Phase {N+1} {name}, subStep {token or "start"}
108
+ ```
109
+
110
+ Full `agent-log.md` shape: `refs/phases/log-format.md`. Resume-side consumption: `resume.md` Step 3 reads the latest handoff FIRST, then falls back to per-phase findings for logs written before v10.8.
98
111
 
99
112
  **Sub-step checkpoints (long phases)**: Phase 3 (dev/TDD cycles) and Phase 7 (report/channels) can run many minutes; a crash mid-phase loses everything since the last phase boundary and forces a full phase re-run on resume. For these phases, also record `state.phases[<n>].subStep` (a short token: `red`, `green`, `build`, `pr-opened`, `confluence-synced`, ...) and the `files[]` written so far after each meaningful unit of work. On resume, re-enter the phase but skip units whose `subStep` is already recorded and whose `files[]` exist in the worktree - re-do only the unfinished tail, never the whole phase.
100
113
 
@@ -98,6 +98,15 @@ Log the resolved tier in the agent log:
98
98
 
99
99
  Full chain definition, REST endpoints, URL parsing, canonical-component contract: `pipeline/rules/figma-pipeline.md` "MUST: Figma access - 3-tier fallback chain (BLOCKING, pipeline-wide)". Do not duplicate it here.
100
100
 
101
+ #### Step 0.6 - Update check (advisory, opt-out via `prefs.global.updateCheck.enabled`)
102
+
103
+ Run `bash pipeline/scripts/update-check.sh` (cached per `updateCheck.ttlHours`, default 24h; 3s-bounded; every failure path silent). Empty output → continue. Output `<local>|<latest>` means a newer version exists:
104
+
105
+ - **Interactive**: log `→ update available: v<local> -> v<latest>`, ask ONE AskUserQuestion - **Update now** (recommended) / **Continue without updating**. On *Update now* (or `updateCheck.autoUpdate: true`, which skips the question): run the `/multi-agent:update` flow, log `→ updated to v<latest>`, continue the run (note in the log: already-loaded phase docs finish this run on the old version; full effect next session). On *Continue*: no re-ask until the TTL expires.
106
+ - **Autopilot**: never ask (zero-interaction contract). Log `→ update available: ... (log-only; run /multi-agent:update)` and continue - unless `autoUpdate: true`, then update silently first.
107
+
108
+ Advisory - NEVER blocks. Must run BEFORE Step 6 (worktree creation) so an accepted update cannot mutate `~/.claude` under a mid-phase run.
109
+
101
110
  #### Step 1 - Parse Input
102
111
 
103
112
  **Branch from input**: If the user provided a branch name after the issue reference (space-separated), store as `baseBranch` and skip Step 3. Otherwise Step 3 asks interactively.
@@ -49,6 +49,16 @@ When Phase 0 Step 7 classified the task as `component`, Phase 3 delegates the en
49
49
 
50
50
  For non-component taskTypes (`bugfix`, `feature`, `refactor`, `chore`), continue with the standard TDD section below.
51
51
 
52
+ #### Design fidelity contract (BLOCKING when the task carries a Figma reference)
53
+
54
+ Applies to EVERY task whose analysis doc carries design ground truth (Section 6 rows, node IDs, screenshots) - component-dispatch AND TDD-path UI work alike. MCP stays forbidden here (analysis-only rule); the analysis doc IS the design:
55
+
56
+ 1. **1:1, not interpretation.** The captured frames are the reference for every UI line. Layout, variant states, copy, and colours come from the analysis doc's measured values - never eyeballed, never "close enough".
57
+ 2. **Code Connect decides the component.** A mapped component (`CodeConnectSnippet` / repo `*.figma.swift` / `*.figma.kt` row) MUST be used verbatim - sound-alike substitutes are forbidden; a missing modifier is added to the component's `+Modifiers` extension, never forked. No mapping exists -> write a NEW component per the Configuration/View/Modifiers architecture; never inline ad-hoc UI into the consumer screen.
58
+ 3. **Inter-component spacing is part of the design.** Gaps, paddings, and alignment BETWEEN components must match the frame's measured values, mapped to spacing tokens (`Spacing*`) - never invented numbers. A spacing/layout deviation is a `review_blocking` finding in Phase 4 Step 1.8, not a nitpick.
59
+
60
+ Missing design data (variant, padding, copy) → HALT and instruct the user to re-run `/multi-agent:analysis`; never guess and never call Figma MCP from this phase.
61
+
52
62
  #### Re-entry from Phase 4 triage
53
63
 
54
64
  Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings - Fable triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
@@ -94,6 +104,7 @@ For each task (respecting dependency order):
94
104
  3. **TDD cycle** (Launch Agent with `model: "sonnet"`):
95
105
 
96
106
  **RED - Write ONE failing test first:**
107
+ - **Rework re-entry**: if `state.reviewIterations[-1].verifyByTest.redTests[]` exists (Phase 4 Step 3.7 ran), those failing repro tests ARE the RED step for their findings - make them green, do not write a duplicate failing test and do not delete or weaken them. See `refs/features/verify-by-test.md`.
97
108
  - Test framework: use whatever the project already uses. Detect from existing test files:
98
109
  - `@Test` / `#expect` → Swift Testing
99
110
  - `XCTestCase` / `XCTAssert` → XCTest
@@ -115,6 +126,7 @@ For each task (respecting dependency order):
115
126
 
116
127
  **GREEN - Minimal code to pass:**
117
128
  - Smallest change that makes the test green - no extras
129
+ - **Existing tests are immutable**: deleting, renaming, skipping, or weakening an existing assertion to reach green is a violation. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. (Deterministic backstop: the `test_lines_removed` signal in Phase 4 Step 1.75 flags test files that shrink.)
118
130
  - Run the same test again → must PASS
119
131
  - Run full test suite → no regressions:
120
132
  ```bash
@@ -127,6 +127,7 @@ echo "$RISK_JSON" | node pipeline/scripts/validate-diff-risk.mjs - >/dev/null 2>
127
127
  | `migration` | 4.0 | path matches DB schema / migration glob (.sql, /Migrations/, alembic/, prisma/migrations) |
128
128
  | `public_api` | 2.0 | added line declares `public func/class/struct/enum`, `@objc`, `open fun`, `@Composable`, or `export function/class/...` |
129
129
  | `no_test_change` | 2.5 | source file changed, no paired test file (`{Base}Tests.{ext}`, `{Base}.test.{ext}`, etc.) appears in the diff |
130
+ | `test_lines_removed` | 3.0 | test-classified file whose diff removes more lines than it adds (immutable-test backstop, see `refs/rules.md`) |
130
131
  | `complexity_delta` | 1.5 | added control-flow tokens (`if`/`guard`/`switch`/`while`/`for`/ternary/`&&`/`\|\|`) |
131
132
  | `ui_critical` | 1.5 | path matches `*View.swift`, `*Screen.kt`, `*Configuration.swift`, etc. |
132
133
  | `loc_changed` | 1.0 | base sensitivity to total `+/-` lines |
@@ -204,6 +205,8 @@ Skills are injected into reviewer prompt context - the reviewer uses them as r
204
205
 
205
206
  **iOS/Swift - interaction & convention skills (conditional).** When the diff touches SwiftUI UI files (`*View.swift`, `*Screen.swift`, `*Configuration.swift`, `*+Modifiers.swift`), additionally inject the relevant `figma-common` convention skills as reference for the iOS reviewers: `figma-navigation`, `figma-overlays`, `figma-bottom-sheets` (interaction: emit-intent vs self-route/self-present; native-SwiftUI-first vs the project's `ui.*` custom system), and the enriched `figma-to-swiftui` accessibility rules (minimalism). These back the Step 1.5 iOS convention checks. Generic across SwiftUI projects - not tied to any one app. Omit when the diff has no SwiftUI UI changes (keeps the reviewer prompt lean).
206
207
 
208
+ **Module review guides (conditional, all stacks).** A module in the repo may carry its own CLAUDE guide - a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths: for each changed file, walk its directory chain up to the repo root and collect `CLAUDE.md`, `*-CLAUDE.md`, and `AGENTS.md` files (root-level ones excluded - the host CLI already loads those). Dedupe, cap at 5 (log any dropped). Inject into every reviewer prompt with the directive: read each guide before reviewing and apply its rules/checklist to the changed files under its directory - a guide governs only its own subtree, and guide violations are findings triaged by severity like any other. No guides found → no-op, prompt stays lean. Same contract as `/multi-agent:review` Step 2b.
209
+
207
210
  **Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry Reviewer 1 once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
208
211
 
209
212
  #### Output contract - reviewer step
@@ -252,6 +255,8 @@ Exit 0 = valid. Exit 2 = contradiction (approved=true with blocking findings) -
252
255
 
253
256
  **CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag - reviewers hallucinate, misread scope, or repeat each other. Run Fable triage (Opus on Copilot CLI) to evaluate merged findings against task scope.
254
257
 
258
+ Opt-in empirical layer: when `prefs.global.verifyByTest.enabled` is `true`, accepted blocking findings additionally go through Step 3.7 (verify-by-test), which tries to reproduce each one with a minimal failing test before the Phase 3 rework loop fires. Full wiring: `refs/features/verify-by-test.md`.
259
+
255
260
  ##### 3.1 Short-circuit: no findings
256
261
 
257
262
  If merged findings `length === 0`, **skip triage**: write empty result `{"accepted": [], "deferred": [], "rejected": [], "approved": true}`, log, proceed to Phase 5.
@@ -383,13 +388,19 @@ After the triage verdict is computed, populate `triage.consensus`:
383
388
 
384
389
  **Surfacing (Step 4 + Phase 7):** When `verdict` is `split` or `unverified`, the disagreements are shown to the user at the Step 4 checkpoint (interactive modes) and always written to the Phase 7 report. Autopilot does not block on `unverified` (it logs `review.consensus=unverified` and proceeds), matching the maturity-check model - but the report records it so a human review can catch it. This is additive: omitting `consensus` is valid and means "not computed."
385
390
 
391
+ ##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
392
+
393
+ A triage verdict is judgment; a failing repro test is proof. Runs only when `prefs.global.verifyByTest.enabled` is `true` AND `accepted` contains a `blocking` finding; otherwise skip silently. **Full contract (verdict table, cleanup invariant, prompts): `refs/features/verify-by-test.md` - read it before executing this step.**
394
+
395
+ Compressed flow: dispatch ONE verifier agent (model `verifyByTest.model`, default `sonnet`) for up to `maxFindings` (default 3) accepted blocking findings. Per finding it writes ONE minimal repro test and runs ONLY that test (Phase 3 single-test invocation, build lock, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`). Outcomes: test FAILS as predicted -> `confirmed`, finding stays blocking and the test is KEPT in `redTests[]` as the Phase 3 rework RED test; test PASSES -> `not-reproduced` ONLY if `evidence-gate.mjs --claim test --status passed` exits 0 on the log, finding moves to `deferred[]`, test deleted; compile error / timeout / not unit-testable -> `inconclusive`, judgment verdict stands. Stamp findings with `verification` (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = {attempted, confirmed, downgraded, inconclusive, redTests[]}`, recompute `approved`, re-run `validate-triage.mjs` under the 3.2.1 gate. Whole step bounded by `stepTimeoutSec` (default 600); on breach or crash remaining findings keep judgment verdicts - never blocks. Telemetry per 3.4: `review.verify_by_test attempted= confirmed= downgraded= inconclusive= duration_ms=`.
396
+
386
397
  #### Step 4 - Consensus + Action (triage-driven)
387
398
 
388
399
  If `triage.consensus.verdict` is `split` or `unverified`, surface `consensus.disagreements[]` to the user before acting: interactive modes show the split and ask whether to treat the unverified agreement as a pass (picker-contract: Trust / Re-review / Treat-as-blocking); autopilot logs `review.consensus=<verdict>` and proceeds on the triage verdict. Never silently average a split into a pass.
389
400
 
390
401
  Act **only on triage.accepted**:
391
402
 
392
- - **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items)
403
+ - **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items). When Step 3.7 ran and `state.reviewIterations[-1].verifyByTest.redTests[]` is non-empty, the reflection prompt cites each red test: "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it."
393
404
  - **accepted.important** → fix and re-review
394
405
  - **accepted.suggestion** → apply if reasonable
395
406
  - **deferred** → append to Phase 7 report as "follow-up items" (do not block)
@@ -52,6 +52,7 @@ This is the single source of truth. When a contributor or model is unsure where
52
52
  - **NEVER** commit without passing build (all gates in Phase 4 Step 1 must be green).
53
53
  - **NEVER** commit without passing review (at least one AI reviewer must return `approved: true` with no blocking findings).
54
54
  - **NEVER** skip tests. Every public method, every error path, every edge case.
55
+ - **NEVER** delete, rename, or weaken an existing test to get a green run. Existing tests are immutable during a task. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. Deterministic backstop: the `test_lines_removed` diff-risk signal (Phase 4 Step 1.75) flags test files that shrink.
55
56
  - **Follow existing code style and conventions.** Read neighbor files before writing new ones - match naming, structure, import order.
56
57
  - **Use design tokens, no magic numbers.** `16` → `.Spacing.spacing16`. `#E31837` → `Color.Primary.primary`. `.font(.system(size: 14))` → `.typographyStyle(.body1)`.
57
58
  - **Design system primitives before custom views.** Before writing a new SwiftUI / Compose / React / View / Configuration triplet inside a domain or feature module, grep the shared component library (project-specific path, e.g. `Common/UIComponents/`, `core-ui/`, `packages/ui/`) for an existing primitive that solves the same problem. New domain-level wrappers, custom modals, custom buttons, or hand-rolled toasts are forbidden when the design system already has an equivalent. If the primitive exists but lacks a modifier (placeholder, size, error binding), **add the modifier to the primitive** in its `+Modifiers` extension - do not fork the primitive into the consumer domain. The Figma `CodeConnectSnippet` is the authoritative pointer to which primitive to use.