@mmerterden/multi-agent-pipeline 10.7.3 → 10.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +73 -2
- package/docs/adr/0001-three-model-triage.md +2 -2
- package/docs/adr/0007-multi-tool-adapter-framework.md +1 -1
- package/docs/adr/README.md +2 -2
- package/docs/architecture.md +14 -14
- package/docs/features.md +35 -22
- package/docs/performance.md +3 -3
- package/index.js +3 -7
- package/install/templates/copilot-instructions.md +2 -2
- package/package.json +2 -5
- package/pipeline/agents/dev-critic.md +1 -1
- package/pipeline/claude-md-template.md +1 -1
- package/pipeline/commands/multi-agent/dev-autopilot.md +1 -1
- package/pipeline/commands/multi-agent/finish.md +2 -2
- package/pipeline/commands/multi-agent/help.md +12 -12
- package/pipeline/commands/multi-agent/local.md +1 -1
- package/pipeline/commands/multi-agent/refs/features/dev-critic.md +1 -1
- package/pipeline/commands/multi-agent/refs/features/model-fallback.md +7 -3
- package/pipeline/commands/multi-agent/refs/features/verify-by-test.md +41 -0
- package/pipeline/commands/multi-agent/refs/knowledge.md +1 -1
- package/pipeline/commands/multi-agent/refs/phases/log-format.md +11 -1
- package/pipeline/commands/multi-agent/refs/phases/modes.md +1 -1
- package/pipeline/commands/multi-agent/refs/phases/operations.md +15 -2
- package/pipeline/commands/multi-agent/refs/phases/phase-1-analysis.md +2 -2
- package/pipeline/commands/multi-agent/refs/phases/phase-2-planning.md +2 -2
- package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md +3 -1
- package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md +51 -19
- package/pipeline/commands/multi-agent/refs/progress-contract.md +1 -1
- package/pipeline/commands/multi-agent/refs/rules.md +1 -0
- package/pipeline/commands/multi-agent/refs/tracker-contract.md +1 -2
- package/pipeline/commands/multi-agent/resume.md +7 -4
- package/pipeline/commands/multi-agent/review.md +41 -9
- package/pipeline/commands/multi-agent/sync.md +3 -3
- package/pipeline/commands/multi-agent.md +7 -7
- package/pipeline/schemas/agent-state.schema.json +1 -1
- package/pipeline/schemas/diff-risk.schema.json +5 -4
- package/pipeline/schemas/prefs.schema.json +38 -3
- package/pipeline/schemas/reviewer-output.schema.json +1 -1
- package/pipeline/schemas/triage-output.schema.json +37 -6
- package/pipeline/scripts/README.md +3 -2
- package/pipeline/scripts/cost-budget-check.mjs +1 -1
- package/pipeline/scripts/cost-table.json +7 -0
- package/pipeline/scripts/diff-risk-score.mjs +11 -1
- package/pipeline/scripts/fixtures/diff-risk-test-removal.diff +40 -0
- package/pipeline/scripts/fixtures/install-layout.tsv +5 -5
- package/pipeline/scripts/smoke-diff-risk.sh +30 -1
- package/pipeline/scripts/smoke-handoff-contract.sh +92 -0
- package/pipeline/scripts/smoke-verify-by-test.sh +148 -0
- package/pipeline/scripts/uninstall.mjs +53 -57
- package/pipeline/scripts/validate-diff-risk.mjs +2 -1
- package/pipeline/scripts/validate-triage.mjs +31 -2
- package/pipeline/skills/shared/core/multi-agent/SKILL.md +11 -11
- package/pipeline/skills/shared/core/multi-agent-dev-autopilot/SKILL.md +1 -1
- package/pipeline/skills/shared/core/multi-agent-finish/SKILL.md +1 -1
- package/pipeline/skills/shared/core/multi-agent-help/SKILL.md +8 -8
- package/pipeline/skills/shared/core/multi-agent-review/SKILL.md +5 -5
- package/pipeline/skills/shared/core/multi-agent-sync/SKILL.md +7 -5
- package/pipeline/scripts/smoke-readme-counts.sh +0 -120
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
# Feature: Verify-by-Test Triage (Phase 4 Step 3.7)
|
|
2
|
+
|
|
3
|
+
**Pattern**: reviewer findings are hypotheses; a failing repro test is proof. Adversarial-review research (Refute-or-Promote, 2026) found that plausible-but-wrong findings survive debate rounds but die on a single empirical test. This step converts the highest-stakes verdicts (accepted blocking) from judgment into evidence before the Phase 3 rework loop fires.
|
|
4
|
+
|
|
5
|
+
**Gated by `prefs.global.verifyByTest.enabled`** (default: `false`). When enabled, after triage 3.6 and before Step 4, IF the validated triage output contains at least one `accepted` blocking finding:
|
|
6
|
+
|
|
7
|
+
1. Dispatch ONE verifier sub-agent for the iteration (model: `verifyByTest.model`, default `sonnet`) - never one dispatch per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings, the diff hunks for their files, and the Phase 1 test conventions.
|
|
8
|
+
2. Per finding, the verifier writes ONE minimal repro test asserting the correct behavior the finding claims is broken, then runs ONLY that test via the Phase 3 single-test invocation (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`) under `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
|
|
9
|
+
3. Stamp each processed finding with a `verification` object (triage-output schema v3.2.0) and re-run `validate-triage.mjs` on the mutated triage file under the standard 3.2.1 gate protocol.
|
|
10
|
+
4. Findings beyond `maxFindings` keep their judgment-only verdict (log `verify_by_test=cap-exceeded`).
|
|
11
|
+
5. The whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600); on breach or verifier crash, remaining findings keep judgment-only verdicts and the pipeline proceeds. Never blocks.
|
|
12
|
+
|
|
13
|
+
## Verdict table
|
|
14
|
+
|
|
15
|
+
| Repro test outcome | `verification.result` | Action |
|
|
16
|
+
|---|---|---|
|
|
17
|
+
| Fails as the finding predicts | `confirmed` | Finding stays accepted blocking. Repro test KEPT in the worktree, recorded in `redTests[]`. |
|
|
18
|
+
| Passes (not reproducible) | `not-reproduced` | Downgrade gated by `evidence-gate.mjs --claim test --status passed` on the test log (exit 0 required). Finding moves `accepted[]` -> `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`; repro test file deleted. Evidence-gate failure -> treat as `inconclusive`. |
|
|
19
|
+
| Compile error / timeout / not unit-testable | `inconclusive` | Finding stays accepted blocking (judgment stands). Partial test deleted, cause in `verification.note`. |
|
|
20
|
+
|
|
21
|
+
Downgrades go to `deferred`, never `rejected`: triage judged the issue real, and deferred items surface in the Phase 7 report for a human eye.
|
|
22
|
+
|
|
23
|
+
## Red-test handoff to Phase 3
|
|
24
|
+
|
|
25
|
+
`state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`. On the Phase 3 re-entry, each `redTests[]` entry IS the RED step of the rework TDD cycle for its finding: the reflection prompt instructs "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it." The dev agent must not write a duplicate failing test for those findings.
|
|
26
|
+
|
|
27
|
+
## Cleanup invariant
|
|
28
|
+
|
|
29
|
+
After Step 3.7, the only uncommitted verifier artifacts are the confirmed repro tests listed in `redTests[]` (committed later with the fix) and logs under `$WORKTREE/.pipeline/` (outside Phase 6 commit scope). `not-reproduced` and `inconclusive` test files are always deleted.
|
|
30
|
+
|
|
31
|
+
## Telemetry
|
|
32
|
+
|
|
33
|
+
One `review.verify_by_test` metric per iteration: `attempted`, `confirmed`, `downgraded`, `inconclusive`, `duration_ms` (plus `tokens_in/out` when available), forwarded to the tracker like all Phase 4 metrics. Timeout emits `triage=verify-by-test-timeout`.
|
|
34
|
+
|
|
35
|
+
## Off by default reason
|
|
36
|
+
|
|
37
|
+
Adds one Sonnet call plus up to `maxFindings` single-test runs (and build-lock contention on Xcode projects) per review iteration that has accepted blockers. On clean runs it never fires, but on noisy-reviewer repos it can add minutes per iteration. Flip on for security-critical work, release branches, or repos where reviewer false-positive rate is high.
|
|
38
|
+
|
|
39
|
+
## Reference
|
|
40
|
+
|
|
41
|
+
Wiring: `refs/phases/phase-4-review.md` Step 3.7. Schema: `pipeline/schemas/triage-output.schema.json` v3.2.0 (`$defs.verification`). Evidence gate: `pipeline/scripts/evidence-gate.mjs`. Prefs: `prefs.global.verifyByTest` in `pipeline/schemas/prefs.schema.json`.
|
|
@@ -88,7 +88,7 @@ Phase 4: Review (CLI-aware parallel + triage)
|
|
|
88
88
|
|-- Reviewer (gpt-5.4) -> edge cases, cross-provider diversity
|
|
89
89
|
+-- Reviewer (sonnet) -> quality + correctness + naming
|
|
90
90
|
Context for ALL: Phase 1 files + Phase 2 decisions + Phase 3 diff
|
|
91
|
-
Then: single
|
|
91
|
+
Then: single Fable triage pass filters false-positives
|
|
92
92
|
```
|
|
93
93
|
|
|
94
94
|
### Context-Aware Agent Prompting
|
|
@@ -72,6 +72,16 @@ Verdict: unverified (2 reviewers)
|
|
|
72
72
|
## Test Scenarios (Jira)
|
|
73
73
|
|
|
74
74
|
(User perspective: Precondition -> Steps -> Expected result)
|
|
75
|
+
|
|
76
|
+
## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
|
|
77
|
+
|
|
78
|
+
(v10.8.0, appended chronologically by the phase-boundary checkpoint - one block per phase transition, the LATEST block is authoritative. Written by the orchestrator from state it already holds; no LLM call, ~15 lines max. `resume.md` Step 3 and post-`/compact` re-grounding read this block FIRST. Format spec: `refs/phases/operations.md`.)
|
|
79
|
+
|
|
80
|
+
- Done: {up to 3 bullets of completed outcomes}
|
|
81
|
+
- Remaining: {ordered list of remaining phases / sub-steps}
|
|
82
|
+
- Decisions: {key decisions later phases depend on}
|
|
83
|
+
- Open findings: {accepted-but-unresolved review findings, or "none"}
|
|
84
|
+
- Next: Phase {N+1} {name}, subStep {token or "start"}
|
|
75
85
|
```
|
|
76
86
|
|
|
77
87
|
### Cost Breakdown - emission contract
|
|
@@ -91,7 +101,7 @@ Every phase that dispatches a billable LLM agent MUST forward its token totals t
|
|
|
91
101
|
|
|
92
102
|
```bash
|
|
93
103
|
pipeline/scripts/log-metric.sh "$TASK_ID" <phase-id> <event> \
|
|
94
|
-
model=<opus|sonnet|haiku|gpt-5.4> tokens_in=$IN tokens_out=$OUT tokens_cached=$CACHED duration_ms=$DUR
|
|
104
|
+
model=<fable|opus|sonnet|haiku|gpt-5.4> tokens_in=$IN tokens_out=$OUT tokens_cached=$CACHED duration_ms=$DUR
|
|
95
105
|
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" <phase-id> tokens \
|
|
96
106
|
model=<...> tokens_in=$IN tokens_out=$OUT tokens_cached=$CACHED
|
|
97
107
|
```
|
|
@@ -85,7 +85,7 @@ Phase 0: Init -> Phase 3: Dev (self-contained) -> Phase 5: Test -> Phase 6: Comm
|
|
|
85
85
|
| Phase 1 (Analysis) | Parallel Explore agents | **SKIP** | **SKIP** | **SKIP** | **SKIP** |
|
|
86
86
|
| Phase 2 (Planning) | TaskCreate + architecture review + **Plan Approval Gate** (clarification max 2 rounds + approval loop) | **SKIP** (plan gate not applicable - fast path) | **SKIP** | **SKIP** | **SKIP** |
|
|
87
87
|
| Phase 3 (Dev) | Follows Phase 2 plan, TDD cycle (Sonnet) | **Self-contained** (Opus): agent scans relevant files, implements with TDD, builds | Same as `--dev` | Same as `--dev` | Same as `--dev` |
|
|
88
|
-
| Phase 4 (Review) | Parallel review +
|
|
88
|
+
| Phase 4 (Review) | Parallel review + Fable triage (Claude: 2-model / Copilot: 3-model) | **SKIP** | **SKIP** | **SKIP** | **SKIP** |
|
|
89
89
|
| Phase 5 (User Test) | Interactive prompt (ask user to test) | **Interactive prompt** (ask user to test) | **Skip** (autopilot suppresses interactive prompts) | **Interactive prompt** (same as `--dev`) | **Skip** (same as `--dev autopilot`) |
|
|
90
90
|
| Phase 6 (Commit) | Commit + PR | Same - still asks (unless autopilot) | Auto commit + push + PR | Same as `--dev` | Auto commit + push + PR |
|
|
91
91
|
| Phase 7 (Report) | Full report + channels multi-select | Simplified - no review/analysis sections, BUT **channels menu still pauses** | Channels menu **STILL PAUSES** | Same as `--dev` | Channels menu **STILL PAUSES** |
|
|
@@ -93,8 +93,21 @@ This keeps orchestrator context lean and enables programmatic routing.
|
|
|
93
93
|
|
|
94
94
|
**Proactive compaction + phase-boundary checkpoint**: the orchestrator follows ~2,500 lines of phase prose in one session; once context fills, it starts dropping steps - the single biggest cause of "it got stuck / skipped a step." Two defenses, both required on full-pipeline runs:
|
|
95
95
|
|
|
96
|
-
- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a
|
|
97
|
-
- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit - it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` to re-ground.
|
|
96
|
+
- *Phase-boundary checkpoint.* At every phase transition, before loading the next phase doc, write the durable state (`agent-state.json` phase status + `files[]` + `retryCount`) and append a structured handoff block to `agent-log.md` (format below). The next phase reads state + log, not the back-conversation - so a transition is a clean re-entry point even if context is later compacted.
|
|
97
|
+
- *Compaction trigger.* If conversation context exceeds ~50%, run `/compact` preserving "modified files, plan, open review findings, current phase + sub-step" before continuing. Don't wait for auto-compaction near the limit - it triggers exactly when context is worst and is lossy. After compaction, re-read `agent-state.json` AND the latest `## Handoff` block in `agent-log.md` to re-ground.
|
|
98
|
+
|
|
99
|
+
**Handoff block (v10.8.0)**: the structured artifact the phase-boundary checkpoint appends to `agent-log.md`. Written by the orchestrator from state it already holds - no agent dispatch, no extra LLM call. Cap at ~15 lines; the latest block is authoritative (earlier ones are history). This is the fresh-context re-entry contract: a resume or post-compaction session rebuilds working context from the latest handoff + `agent-state.json` + git log, never from conversation memory.
|
|
100
|
+
|
|
101
|
+
```markdown
|
|
102
|
+
## Handoff - end of Phase {N} ({name}) - {ISO timestamp}
|
|
103
|
+
- Done: {up to 3 bullets of completed outcomes, e.g. "plan approved, 4 tasks", "build green, 12 tests added"}
|
|
104
|
+
- Remaining: {ordered list of remaining phases / sub-steps}
|
|
105
|
+
- Decisions: {key decisions later phases depend on, e.g. "used existing KeychainStore, no new wrapper"}
|
|
106
|
+
- Open findings: {accepted-but-unresolved review findings, or "none"}
|
|
107
|
+
- Next: Phase {N+1} {name}, subStep {token or "start"}
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
Full `agent-log.md` shape: `refs/phases/log-format.md`. Resume-side consumption: `resume.md` Step 3 reads the latest handoff FIRST, then falls back to per-phase findings for logs written before v10.8.
|
|
98
111
|
|
|
99
112
|
**Sub-step checkpoints (long phases)**: Phase 3 (dev/TDD cycles) and Phase 7 (report/channels) can run many minutes; a crash mid-phase loses everything since the last phase boundary and forces a full phase re-run on resume. For these phases, also record `state.phases[<n>].subStep` (a short token: `red`, `green`, `build`, `pr-opened`, `confluence-synced`, ...) and the `files[]` written so far after each meaningful unit of work. On resume, re-enter the phase but skip units whose `subStep` is already recorded and whose `files[]` exist in the worktree - re-do only the unfinished tail, never the whole phase.
|
|
100
113
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
### Phase 1: Analysis (
|
|
1
|
+
### Phase 1: Analysis (Fable)
|
|
2
2
|
|
|
3
|
-
> **TLDR** -
|
|
3
|
+
> **TLDR** - Fable-driven codebase exploration (Opus when the fallback ladder engages). Detects if the issue is already fixed (git blame, closed PRs), then launches parallel Explore sub-agents to map the affected code paths. Outputs: impact analysis, stack detection (auto-selects platform guide), relevant files, risk areas. Feeds Phase 2 planning.
|
|
4
4
|
|
|
5
5
|
<!-- progress-contract: applied -->
|
|
6
6
|
Progress emission per `refs/progress-contract.md` - lines for each Explore dispatch, each finish, analyst synthesis start, `analysis.json` write.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
### Phase 2: Planning (
|
|
1
|
+
### Phase 2: Planning (Fable)
|
|
2
2
|
|
|
3
|
-
> **TLDR** -
|
|
3
|
+
> **TLDR** - Fable decomposes the analysis (Opus when the fallback ladder engages) into concrete tasks with file-level targets, risk grading, and architecture review. Before Phase 3 a **Plan Approval Gate** runs in normal mode: if the Jira/issue description is ambiguous the orchestrator asks the user structured clarification questions (max 2 rounds) - once scope is clear it renders the plan and loops on free-text edit requests until the user approves or aborts. The gate is **skipped entirely** for `--dev`, `autopilot`, and `--dev autopilot` (their speed/zero-interaction contracts are preserved).
|
|
4
4
|
|
|
5
5
|
<!-- progress-contract: applied -->
|
|
6
6
|
Progress emission per `refs/progress-contract.md` - lines for plan-draft start, clarification-ask, clarification-answer, plan render, plan-edit-request, plan-approved, plan-aborted.
|
|
@@ -51,7 +51,7 @@ For non-component taskTypes (`bugfix`, `feature`, `refactor`, `chore`), continue
|
|
|
51
51
|
|
|
52
52
|
#### Re-entry from Phase 4 triage
|
|
53
53
|
|
|
54
|
-
Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings -
|
|
54
|
+
Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings - Fable triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
|
|
55
55
|
|
|
56
56
|
When re-entering from Phase 4:
|
|
57
57
|
|
|
@@ -94,6 +94,7 @@ For each task (respecting dependency order):
|
|
|
94
94
|
3. **TDD cycle** (Launch Agent with `model: "sonnet"`):
|
|
95
95
|
|
|
96
96
|
**RED - Write ONE failing test first:**
|
|
97
|
+
- **Rework re-entry**: if `state.reviewIterations[-1].verifyByTest.redTests[]` exists (Phase 4 Step 3.7 ran), those failing repro tests ARE the RED step for their findings - make them green, do not write a duplicate failing test and do not delete or weaken them. See `refs/features/verify-by-test.md`.
|
|
97
98
|
- Test framework: use whatever the project already uses. Detect from existing test files:
|
|
98
99
|
- `@Test` / `#expect` → Swift Testing
|
|
99
100
|
- `XCTestCase` / `XCTAssert` → XCTest
|
|
@@ -115,6 +116,7 @@ For each task (respecting dependency order):
|
|
|
115
116
|
|
|
116
117
|
**GREEN - Minimal code to pass:**
|
|
117
118
|
- Smallest change that makes the test green - no extras
|
|
119
|
+
- **Existing tests are immutable**: deleting, renaming, skipping, or weakening an existing assertion to reach green is a violation. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. (Deterministic backstop: the `test_lines_removed` signal in Phase 4 Step 1.75 flags test files that shrink.)
|
|
118
120
|
- Run the same test again → must PASS
|
|
119
121
|
- Run full test suite → no regressions:
|
|
120
122
|
```bash
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
### Phase 4: Review (deterministic gates + parallel + triage)
|
|
2
2
|
|
|
3
|
-
> **TLDR** - Three-stage review. Stage 1: deterministic gates (build + lint + test + secret scan) that MUST pass. Stage 2: AI models in parallel - reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers (
|
|
3
|
+
> **TLDR** - Three-stage review. Stage 1: deterministic gates (build + lint + test + secret scan) that MUST pass. Stage 2: AI models in parallel - reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers (Fable + Sonnet); Copilot CLI dispatches 3 reviewers (GPT-5.4 + Opus + Sonnet — Fable 5 is not offered on Copilot CLI). Stage 3: Fable triage (Opus on Copilot CLI) - evaluates raw findings, filters false-positives/out-of-scope, keeps only actionable items. Only triage-accepted blocking items loop back to Phase 3.
|
|
4
4
|
|
|
5
5
|
<!-- progress-contract: applied -->
|
|
6
6
|
Progress emission per `refs/progress-contract.md` - lines for each gate, each reviewer dispatch + finish, triage start, triage verdict, fix dispatch.
|
|
@@ -127,6 +127,7 @@ echo "$RISK_JSON" | node pipeline/scripts/validate-diff-risk.mjs - >/dev/null 2>
|
|
|
127
127
|
| `migration` | 4.0 | path matches DB schema / migration glob (.sql, /Migrations/, alembic/, prisma/migrations) |
|
|
128
128
|
| `public_api` | 2.0 | added line declares `public func/class/struct/enum`, `@objc`, `open fun`, `@Composable`, or `export function/class/...` |
|
|
129
129
|
| `no_test_change` | 2.5 | source file changed, no paired test file (`{Base}Tests.{ext}`, `{Base}.test.{ext}`, etc.) appears in the diff |
|
|
130
|
+
| `test_lines_removed` | 3.0 | test-classified file whose diff removes more lines than it adds (immutable-test backstop, see `refs/rules.md`) |
|
|
130
131
|
| `complexity_delta` | 1.5 | added control-flow tokens (`if`/`guard`/`switch`/`while`/`for`/ternary/`&&`/`\|\|`) |
|
|
131
132
|
| `ui_critical` | 1.5 | path matches `*View.swift`, `*Screen.kt`, `*Configuration.swift`, etc. |
|
|
132
133
|
| `loc_changed` | 1.0 | base sensitivity to total `+/-` lines |
|
|
@@ -181,17 +182,17 @@ Launch Agent instances **in parallel** using the shared `code-reviewer` subagent
|
|
|
181
182
|
|
|
182
183
|
| Reviewer | subagent_type | Model | Focus | Skills Referenced | Where it runs |
|
|
183
184
|
| ---------- | ----------------- | ------------------- | --------------------------------- | --------------------------------------------- | -------------------- |
|
|
184
|
-
| Reviewer 1 | `code-reviewer` | `claude-opus-4
|
|
185
|
+
| Reviewer 1 | `code-reviewer` | `claude-fable-5` (Claude Code) / `claude-opus-4-8` (Copilot CLI) | Deep security + architecture | `api-security-best-practices`, `architecture` | Both CLIs |
|
|
185
186
|
| Reviewer 2 | `code-reviewer` | `gpt-5.4` | Edge cases, different perspective | cross-model diversity | **Copilot CLI only** |
|
|
186
|
-
| Reviewer 3 | `code-reviewer` | `claude-sonnet-4
|
|
187
|
+
| Reviewer 3 | `code-reviewer` | `claude-sonnet-4-6` | Quality + correctness + naming | `clean-code`, stack-specific skill | Both CLIs |
|
|
187
188
|
|
|
188
189
|
Each reviewer inherits the `code-reviewer` agent's focus areas (Security, Architecture, Quality, Performance) and output contract. The orchestrator overrides only the model and the stack-specific skill per-reviewer - no prompt duplication.
|
|
189
190
|
|
|
190
|
-
**Model override wiring:** `code-reviewer.md` declares `preferredModel: fable`, so Reviewer 1 uses the persona default (Fable 5). Reviewer 2 (Copilot-only, `gpt-5.4`) and Reviewer 3 (`claude-sonnet-4
|
|
191
|
+
**Model override wiring:** `code-reviewer.md` declares `preferredModel: fable`, so Reviewer 1 uses the persona default (Fable 5). Reviewer 2 (Copilot-only, `gpt-5.4`) and Reviewer 3 (`claude-sonnet-4-6`) set `PHASE_MODEL_OVERRIDE=<model>` before dispatch - the orchestrator exports `CLAUDE_CODE_SUBAGENT_MODEL` on Claude Code, or passes `--model` on Copilot CLI. Full precedence rule: `skills/shared/core/multi-agent/SKILL.md#agent-dispatch--per-persona-model-routing-v610`. Fable dispatches are subject to the fallback contract (`refs/features/model-fallback.md`): dispatch-error retry walks `fable -> opus -> sonnet` and budget-ceiling downgrade.
|
|
191
192
|
|
|
192
193
|
**Stack-specific skills loaded per reviewer** (from Phase 1 `detectedStack`). On Claude Code, Reviewer 2 (GPT-5.4) is not dispatched - its skill column is ignored. On Copilot CLI all three columns are used.
|
|
193
194
|
|
|
194
|
-
| Stack | Reviewer 1 (Opus) | Reviewer 2 (GPT-5.4 - Copilot CLI only) | Reviewer 3 (Sonnet) |
|
|
195
|
+
| Stack | Reviewer 1 (Fable / Opus on Copilot) | Reviewer 2 (GPT-5.4 - Copilot CLI only) | Reviewer 3 (Sonnet) |
|
|
195
196
|
|-------|-------------------|-----------------------------------------|---------------------|
|
|
196
197
|
| iOS/Swift | `ios-security`, `swiftui-performance`, `hig-patterns` | `swift-concurrency`, `ios-accessibility` | `swiftui-pro`, `swift-testing` |
|
|
197
198
|
| Android/Kotlin | `android-security`, `android-performance` | `compose-testing`, `android-architecture` | `compose-components`, `kotlin-coroutines-expert` |
|
|
@@ -204,11 +205,13 @@ Skills are injected into reviewer prompt context - the reviewer uses them as r
|
|
|
204
205
|
|
|
205
206
|
**iOS/Swift - interaction & convention skills (conditional).** When the diff touches SwiftUI UI files (`*View.swift`, `*Screen.swift`, `*Configuration.swift`, `*+Modifiers.swift`), additionally inject the relevant `figma-common` convention skills as reference for the iOS reviewers: `figma-navigation`, `figma-overlays`, `figma-bottom-sheets` (interaction: emit-intent vs self-route/self-present; native-SwiftUI-first vs the project's `ui.*` custom system), and the enriched `figma-to-swiftui` accessibility rules (minimalism). These back the Step 1.5 iOS convention checks. Generic across SwiftUI projects - not tied to any one app. Omit when the diff has no SwiftUI UI changes (keeps the reviewer prompt lean).
|
|
206
207
|
|
|
207
|
-
**
|
|
208
|
+
**Module review guides (conditional, all stacks).** A module in the repo may carry its own CLAUDE guide - a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths: for each changed file, walk its directory chain up to the repo root and collect `CLAUDE.md`, `*-CLAUDE.md`, and `AGENTS.md` files (root-level ones excluded - the host CLI already loads those). Dedupe, cap at 5 (log any dropped). Inject into every reviewer prompt with the directive: read each guide before reviewing and apply its rules/checklist to the changed files under its directory - a guide governs only its own subtree, and guide violations are findings triaged by severity like any other. No guides found → no-op, prompt stays lean. Same contract as `/multi-agent:review` Step 2b.
|
|
209
|
+
|
|
210
|
+
**Dispatch timeout (required, mirrors triage 3.3).** Reviewers run in parallel and triage waits on all of them, so one stalled reviewer hangs the phase. Bound each reviewer dispatch by `REVIEWER_TIMEOUT_SECONDS` (default 180). If a reviewer has not returned by the budget: log `review.reviewer_timeout reviewer=<name>`, treat that reviewer as absent, and proceed to triage with the reviewers that did return. The merged-findings count and `consensus.reviewerCount` reflect only the reviewers that returned. If **zero** reviewers return, retry Reviewer 1 once; on a second total failure HALT with `ERR: no reviewer returned within ${REVIEWER_TIMEOUT_SECONDS}s; resume with /multi-agent:resume #N.`. The Step 2.5 rebuttal round uses the same per-dispatch timeout. Never block indefinitely on a slow or dead reviewer dispatch.
|
|
208
211
|
|
|
209
212
|
#### Output contract - reviewer step
|
|
210
213
|
|
|
211
|
-
Step 2 produces N reviewer-output objects (one per dispatched reviewer), each conforming to `pipeline/schemas/reviewer-output.schema.json`. They are persisted to `state.reviewIterations[<iteration>].reviewers[]` and consumed by Step 3 (
|
|
214
|
+
Step 2 produces N reviewer-output objects (one per dispatched reviewer), each conforming to `pipeline/schemas/reviewer-output.schema.json`. They are persisted to `state.reviewIterations[<iteration>].reviewers[]` and consumed by Step 3 (Fable triage) - never by Phase 6 directly. The triage step (below) is the producer of the only review artifact Phase 6 reads, conforming to `pipeline/schemas/triage-output.schema.json`.
|
|
212
215
|
|
|
213
216
|
**Subagent return format** - each reviewer returns JSON conforming to `pipeline/schemas/reviewer-output.schema.json`:
|
|
214
217
|
|
|
@@ -248,9 +251,11 @@ Exit 0 = valid. Exit 2 = contradiction (approved=true with blocking findings) -
|
|
|
248
251
|
|
|
249
252
|
**Off by default reason:** mixed-verdict cases are ~8% of runs in practice; the extra ~$0.20-$0.50 per run isn't worth automating for users who'd rather let triage resolve it cleanly. Users with high-stakes tasks (security-critical, release branches) can flip the flag.
|
|
250
253
|
|
|
251
|
-
#### Step 3 -
|
|
254
|
+
#### Step 3 - Fable Triage (filter before acting)
|
|
255
|
+
|
|
256
|
+
**CRITICAL**: Reviewer findings are **raw signals**, not commands. Never auto-loop on every "blocking" tag - reviewers hallucinate, misread scope, or repeat each other. Run Fable triage (Opus on Copilot CLI) to evaluate merged findings against task scope.
|
|
252
257
|
|
|
253
|
-
|
|
258
|
+
Opt-in empirical layer: when `prefs.global.verifyByTest.enabled` is `true`, accepted blocking findings additionally go through Step 3.7 (verify-by-test), which tries to reproduce each one with a minimal failing test before the Phase 3 rework loop fires. Full wiring: `refs/features/verify-by-test.md`.
|
|
254
259
|
|
|
255
260
|
##### 3.1 Short-circuit: no findings
|
|
256
261
|
|
|
@@ -258,7 +263,7 @@ If merged findings `length === 0`, **skip triage**: write empty result `{"accept
|
|
|
258
263
|
|
|
259
264
|
##### 3.2 Launch triage agent
|
|
260
265
|
|
|
261
|
-
Launch **1 Agent** (subagent_type: `general-purpose`, model: `opus`) with:
|
|
266
|
+
Launch **1 Agent** (subagent_type: `general-purpose`, model: `fable` on Claude Code / `opus` on Copilot CLI) with:
|
|
262
267
|
|
|
263
268
|
- Raw findings from Reviewer 1 + Reviewer 2 (merged JSON)
|
|
264
269
|
- Task scope (Phase 1 analysis summary + Phase 2 plan)
|
|
@@ -307,11 +312,11 @@ Step 3 produces a single triage-output object conforming to `pipeline/schemas/tr
|
|
|
307
312
|
|
|
308
313
|
Return ONLY valid JSON conforming to pipeline/schemas/triage-output.schema.json:
|
|
309
314
|
{
|
|
310
|
-
"accepted": [{ "severity": "blocking|important|suggestion", "file": "...", "line": N, "issue": "...", "fix": "...", "reviewer": "opus|sonnet" }],
|
|
315
|
+
"accepted": [{ "severity": "blocking|important|suggestion", "file": "...", "line": N, "issue": "...", "fix": "...", "reviewer": "fable|opus|sonnet|gpt" }],
|
|
311
316
|
"deferred": [{ "finding": {...}, "reason": "..." }],
|
|
312
317
|
"rejected": [{ "finding": {...}, "reason": "..." }],
|
|
313
318
|
"approved": true|false, // true if no accepted blocking items remain
|
|
314
|
-
"consensus": { "reviewerCount": N, "verdict": "unanimous-pass|unanimous-block|split|unverified", "disagreements": [{ "file": "...", "line": N, "issue": "...", "note": "
|
|
319
|
+
"consensus": { "reviewerCount": N, "verdict": "unanimous-pass|unanimous-block|split|unverified", "disagreements": [{ "file": "...", "line": N, "issue": "...", "note": "Fable blocking, Sonnet approved" }] } // optional, see 3.6
|
|
315
320
|
}
|
|
316
321
|
```
|
|
317
322
|
|
|
@@ -352,12 +357,12 @@ Failure fallback (timeout >120s, or agent crash before any JSON is produced): re
|
|
|
352
357
|
Emit metrics per review pass for Phase 7 cost rollup:
|
|
353
358
|
|
|
354
359
|
```bash
|
|
355
|
-
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=
|
|
360
|
+
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=fable duration_ms=$R1_DURATION tokens_in=$R1_IN tokens_out=$R1_OUT # model=opus on Copilot CLI
|
|
356
361
|
# GPT-5.4 metric emitted only on Copilot CLI (skip on Claude Code):
|
|
357
362
|
[ "${CLI_HOST:-claude}" = "copilot" ] && \
|
|
358
363
|
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=gpt-5.4 duration_ms=$GPT_DURATION tokens_in=$GPT_IN tokens_out=$GPT_OUT
|
|
359
364
|
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.reviewer_call model=sonnet duration_ms=$SONNET_DURATION tokens_in=$SONNET_IN tokens_out=$SONNET_OUT
|
|
360
|
-
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.triage_call model=
|
|
365
|
+
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.triage_call model=fable duration_ms=$TRIAGE_DURATION tokens_in=$TRIAGE_IN tokens_out=$TRIAGE_OUT
|
|
361
366
|
pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.completed raw_count=$RAW accepted=$ACC deferred=$DEF rejected=$REJ approved=$APPROVED duration_ms=$DURATION
|
|
362
367
|
```
|
|
363
368
|
|
|
@@ -365,31 +370,58 @@ pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.completed raw_count=$RAW acce
|
|
|
365
370
|
|
|
366
371
|
##### 3.5 Optional cross-check (single-point-of-failure mitigation)
|
|
367
372
|
|
|
368
|
-
Opt-in via `prefs.global.triageCrossCheck.enabled` (default `false`). Sampled runs dispatch a **Sonnet** triage agent as second opinion, validated via `validate-triage.mjs` (same fallback rules). Disagreements logged as `triage.cross_check_diff`; `blockOnDisagreement` pauses for user (autopilot: proceed with
|
|
373
|
+
Opt-in via `prefs.global.triageCrossCheck.enabled` (default `false`). Sampled runs dispatch a **Sonnet** triage agent as second opinion, validated via `validate-triage.mjs` (same fallback rules). Disagreements logged as `triage.cross_check_diff`; `blockOnDisagreement` pauses for user (autopilot: proceed with the Fable verdict). Doubles triage cost on sampled runs.
|
|
369
374
|
|
|
370
375
|
##### 3.6 Consensus surfacing (anti-correlation)
|
|
371
376
|
|
|
372
|
-
**Rationale:** Reviewer 1 (
|
|
377
|
+
**Rationale:** Reviewer 1 (Fable) and Reviewer 3 (Sonnet) are both Anthropic Claude models, so unanimous agreement on a *judgment call* is not independent confirmation - same-family models drift the same way on ambiguous prompts. Treating "both approved" as proof produces false-consensus passes. Triage therefore records a `consensus` block (schema v3.1.0) and surfaces disagreement and unverified agreement to the user rather than burying it.
|
|
373
378
|
|
|
374
379
|
After the triage verdict is computed, populate `triage.consensus`:
|
|
375
380
|
|
|
376
381
|
1. `reviewerCount` = number of reviewers dispatched this iteration (`2` on Claude Code, `3` on Copilot CLI).
|
|
377
382
|
2. Classify the iteration `verdict`:
|
|
378
383
|
- `unanimous-block` -> all reviewers returned at least one overlapping `blocking` finding.
|
|
379
|
-
- `split` -> reviewers disagreed on existence or severity of one or more findings (the Step 2.5 disagreement definition). List each split in `disagreements[]` with a `note` naming who held which position (e.g. "
|
|
384
|
+
- `split` -> reviewers disagreed on existence or severity of one or more findings (the Step 2.5 disagreement definition). List each split in `disagreements[]` with a `note` naming who held which position (e.g. "Fable blocking, Sonnet approved").
|
|
380
385
|
- `unanimous-pass` -> all reviewers approved AND the diff is low-risk (no security/auth/concurrency surface per Phase 1 `touchedAreas`). Clear-cut; trust it.
|
|
381
386
|
- `unverified` -> all reviewers approved BUT the diff touches a judgment-heavy surface (security, auth, concurrency, money, data migration). Agreement here may be correlated; do NOT treat it as a confirmed pass. Surface it.
|
|
382
387
|
3. `disagreements[]` is populated for `split` and is also used to carry `unverified` notes (e.g. "both approved a keychain change - agreement unverified, confirm manually").
|
|
383
388
|
|
|
384
389
|
**Surfacing (Step 4 + Phase 7):** When `verdict` is `split` or `unverified`, the disagreements are shown to the user at the Step 4 checkpoint (interactive modes) and always written to the Phase 7 report. Autopilot does not block on `unverified` (it logs `review.consensus=unverified` and proceeds), matching the maturity-check model - but the report records it so a human review can catch it. This is additive: omitting `consensus` is valid and means "not computed."
|
|
385
390
|
|
|
391
|
+
##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
|
|
392
|
+
|
|
393
|
+
**Rationale:** a triage verdict is still a judgment call; a failing repro test is proof. Debating a finding costs tokens, running it costs one test invocation and kills false positives that survive adversarial framing.
|
|
394
|
+
|
|
395
|
+
**Gate:** runs only when `prefs.global.verifyByTest.enabled` is `true` AND the validated triage output contains at least one `accepted` finding with `severity: "blocking"`. Otherwise skip silently (no log noise). Full behavior spec: `refs/features/verify-by-test.md`.
|
|
396
|
+
|
|
397
|
+
1. **Dispatch ONE verifier agent** (subagent_type: `general-purpose`, model: `prefs.global.verifyByTest.model`, default `sonnet`) for the iteration - not one per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings (ordered as triage returned them), the diff hunks for their files, and the project's test conventions from Phase 1. Findings beyond the cap keep their judgment-only verdict; log `verify_by_test=cap-exceeded count=<n>`.
|
|
398
|
+
2. **Per finding, the verifier writes ONE minimal repro test** asserting the CORRECT behavior the finding claims is broken, following the platform test conventions (framework, naming, location per phase-3-dev.md), then runs ONLY that test using the platform single-test invocation from Phase 3 (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`), wrapped in `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
|
|
399
|
+
3. **Verdict mapping (fails toward keeping blockers):**
|
|
400
|
+
|
|
401
|
+
| Repro test outcome | `verification.result` | Action on finding |
|
|
402
|
+
| --- | --- | --- |
|
|
403
|
+
| Test FAILS as the finding predicts | `confirmed` | Stays `accepted` blocking. Repro test is KEPT in the worktree and recorded in `redTests[]` as the RED test for the Phase 3 rework loop. |
|
|
404
|
+
| Test PASSES (defect not reproducible) | `not-reproduced` | Downgrade is evidence-gated: `node pipeline/scripts/evidence-gate.mjs --claim test --status passed --evidence "$WORKTREE/.pipeline/verify-<i>.test.log"` must exit 0. On exit 0: move finding from `accepted[]` to `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`, delete the repro test file. On exit non-zero: treat as `inconclusive`. |
|
|
405
|
+
| Compile error, timeout, or defect not expressible as a unit test | `inconclusive` | Stays `accepted` blocking (judgment-only verdict stands). Partial test file deleted, cause noted in `verification.note`. |
|
|
406
|
+
|
|
407
|
+
4. **Cleanup invariant:** after Step 3.7 the only verifier artifacts left in the worktree are the confirmed repro tests listed in `redTests[]` (they get committed with the fix, satisfying the TDD contract) and logs under `$WORKTREE/.pipeline/` (already outside Phase 6 commit scope).
|
|
408
|
+
5. **Persist + re-validate:** stamp each verified finding with a `verification` object (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`, recompute `approved`, then re-run `validate-triage.mjs` on the mutated `$TRIAGE_FILE` under the same 3.2.1 gate protocol.
|
|
409
|
+
6. **Timeout/fallback (mirrors 3.3):** the whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600). On verifier crash or budget breach: no retry; remaining findings keep judgment-only verdicts, log `triage=verify-by-test-timeout`, proceed to Step 4. Never blocks the pipeline.
|
|
410
|
+
|
|
411
|
+
Telemetry (per 3.4 conventions, best-effort):
|
|
412
|
+
|
|
413
|
+
```bash
|
|
414
|
+
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.verify_by_test \
|
|
415
|
+
attempted=$A confirmed=$C downgraded=$D inconclusive=$I duration_ms=$MS
|
|
416
|
+
```
|
|
417
|
+
|
|
386
418
|
#### Step 4 - Consensus + Action (triage-driven)
|
|
387
419
|
|
|
388
420
|
If `triage.consensus.verdict` is `split` or `unverified`, surface `consensus.disagreements[]` to the user before acting: interactive modes show the split and ask whether to treat the unverified agreement as a pass (picker-contract: Trust / Re-review / Treat-as-blocking); autopilot logs `review.consensus=<verdict>` and proceeds on the triage verdict. Never silently average a split into a pass.
|
|
389
421
|
|
|
390
422
|
Act **only on triage.accepted**:
|
|
391
423
|
|
|
392
|
-
- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items)
|
|
424
|
+
- **accepted.blocking** → back to Phase 3 (max 3 iterations, with reflection prompt citing only accepted items). When Step 3.7 ran and `state.reviewIterations[-1].verifyByTest.redTests[]` is non-empty, the reflection prompt cites each red test: "a failing repro test already exists at <testRef>; make it green; do not delete or weaken it."
|
|
393
425
|
- **accepted.important** → fix and re-review
|
|
394
426
|
- **accepted.suggestion** → apply if reasonable
|
|
395
427
|
- **deferred** → append to Phase 7 report as "follow-up items" (do not block)
|
|
@@ -430,7 +462,7 @@ for proj in $(jq -r '.projects[] | "\(.name)\t\(.worktreePath)\t\(.baseBranch)"'
|
|
|
430
462
|
done
|
|
431
463
|
```
|
|
432
464
|
|
|
433
|
-
Same
|
|
465
|
+
Same reviewer set (Fable-or-Opus / GPT-5.4 / Sonnet) receive `COMBINED_DIFF` with a multi-repo prefix in the system prompt:
|
|
434
466
|
|
|
435
467
|
```
|
|
436
468
|
This is a multi-repo task spanning {N} repos: {repo names}.
|
|
@@ -128,7 +128,7 @@ Every phase that dispatches a billable LLM agent MUST forward the call's token t
|
|
|
128
128
|
|
|
129
129
|
```bash
|
|
130
130
|
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" <phase-id> <event> \
|
|
131
|
-
model=<opus|sonnet|haiku|gpt-5.4> \
|
|
131
|
+
model=<fable|opus|sonnet|haiku|gpt-5.4> \
|
|
132
132
|
tokens_in=$IN tokens_out=$OUT duration_ms=$DUR
|
|
133
133
|
```
|
|
134
134
|
|
|
@@ -52,6 +52,7 @@ This is the single source of truth. When a contributor or model is unsure where
|
|
|
52
52
|
- **NEVER** commit without passing build (all gates in Phase 4 Step 1 must be green).
|
|
53
53
|
- **NEVER** commit without passing review (at least one AI reviewer must return `approved: true` with no blocking findings).
|
|
54
54
|
- **NEVER** skip tests. Every public method, every error path, every edge case.
|
|
55
|
+
- **NEVER** delete, rename, or weaken an existing test to get a green run. Existing tests are immutable during a task. A test may change only when the task itself changes the spec that test encodes, and the commit body must name the changed test and the spec change. Deterministic backstop: the `test_lines_removed` diff-risk signal (Phase 4 Step 1.75) flags test files that shrink.
|
|
55
56
|
- **Follow existing code style and conventions.** Read neighbor files before writing new ones - match naming, structure, import order.
|
|
56
57
|
- **Use design tokens, no magic numbers.** `16` → `.Spacing.spacing16`. `#E31837` → `Color.Primary.primary`. `.font(.system(size: 14))` → `.typographyStyle(.body1)`.
|
|
57
58
|
- **Design system primitives before custom views.** Before writing a new SwiftUI / Compose / React / View / Configuration triplet inside a domain or feature module, grep the shared component library (project-specific path, e.g. `Common/UIComponents/`, `core-ui/`, `packages/ui/`) for an existing primitive that solves the same problem. New domain-level wrappers, custom modals, custom buttons, or hand-rolled toasts are forbidden when the design system already has an equivalent. If the primitive exists but lacks a modifier (placeholder, size, error binding), **add the modifier to the primitive** in its `+Modifiers` extension - do not fork the primitive into the consumer domain. The Figma `CodeConnectSnippet` is the authoritative pointer to which primitive to use.
|
|
@@ -24,8 +24,7 @@ The agent detects which CLI it's running in and uses the appropriate visual mech
|
|
|
24
24
|
```
|
|
25
25
|
1. system prompt mentions "Claude Code" → claude-code
|
|
26
26
|
2. system prompt mentions "Copilot" / "GitHub Copilot" → copilot
|
|
27
|
-
3.
|
|
28
|
-
5. None of the above → generic (bash stdout)
|
|
27
|
+
3. None of the above → generic (bash stdout)
|
|
29
28
|
```
|
|
30
29
|
|
|
31
30
|
Visual mechanism per CLI:
|
|
@@ -23,10 +23,13 @@ Resume a paused or failed task from the last successful phase.
|
|
|
23
23
|
- `haltReason` - if set, show it so the user knows why the run stopped; clear it on successful re-entry
|
|
24
24
|
- `autopilot` - preserve the mode
|
|
25
25
|
|
|
26
|
-
3. **Load context** -
|
|
27
|
-
-
|
|
28
|
-
-
|
|
29
|
-
|
|
26
|
+
3. **Load context** - rebuild working context from durable artifacts, never from conversation memory:
|
|
27
|
+
- **Handoff first (v10.8.0)**: read the LATEST `## Handoff` block in `agent-log.md` - it carries done/remaining/decisions/open-findings and the exact re-entry point (phase + subStep). When present, it is the primary context source; cross-check its `Next:` line against `state.currentPhase` and trust state on mismatch (state is the machine truth, handoff is the narrative).
|
|
28
|
+
- Fall back to per-phase findings for logs written before v10.8 (no handoff blocks):
|
|
29
|
+
- Phase 1 analysis → use it from Phase 2+
|
|
30
|
+
- Phase 2 plan → use it from Phase 3+
|
|
31
|
+
- Phase 3 code → already in the worktree
|
|
32
|
+
- Recent `git log --oneline -10` in the worktree grounds what was actually committed vs. claimed.
|
|
30
33
|
|
|
31
34
|
4. **Continue the pipeline** - start from the next phase (same pipeline as the main multi-agent command).
|
|
32
35
|
|
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
---
|
|
2
|
-
description: "Run parallel review on a branch's diff or a Pull Request: 2 models on Claude Code (
|
|
2
|
+
description: "Run parallel review on a branch's diff or a Pull Request: 2 models on Claude Code (Fable + Sonnet), 3 models on Copilot CLI (GPT + Opus + Sonnet). On PR input, posts per-finding inline comments and sets approve/needs-work review state."
|
|
3
3
|
argument-hint: "[#N | repo#N | PR-URL | branch] - optional: PR by number/URL, repo+number, or local branch. Supports GitHub and Bitbucket Server URLs. If omitted, the current branch is used."
|
|
4
4
|
---
|
|
5
5
|
|
|
@@ -109,18 +109,50 @@ The `credential-store.sh` wrapper handles macOS Keychain (`security`), Linux lib
|
|
|
109
109
|
|
|
110
110
|
Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers can re-read it.
|
|
111
111
|
|
|
112
|
+
### 2b. Module review guides - path-scoped convention files
|
|
113
|
+
|
|
114
|
+
A module in the repo may carry its own CLAUDE guide - a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
# Changed file paths from the patch:
|
|
118
|
+
grep -E '^\+\+\+ b/' "$DIFF_FILE" | sed 's|^+++ b/||' | sort -u > /tmp/multi-agent-review-${TASK_ID}-paths.txt
|
|
119
|
+
|
|
120
|
+
# For each changed path, walk its directory chain up to the repo root and
|
|
121
|
+
# collect guide files matching: CLAUDE.md, *-CLAUDE.md, AGENTS.md.
|
|
122
|
+
# Root-level CLAUDE.md/AGENTS.md are excluded - the host CLI already loads those.
|
|
123
|
+
guides=()
|
|
124
|
+
while IFS= read -r p; do
|
|
125
|
+
d=$(dirname "$p")
|
|
126
|
+
while [ "$d" != "." ] && [ "$d" != "/" ]; do
|
|
127
|
+
for g in "$d"/CLAUDE.md "$d"/*-CLAUDE.md "$d"/AGENTS.md; do
|
|
128
|
+
[ -e "$g" ] && guides+=("$g")
|
|
129
|
+
done
|
|
130
|
+
d=$(dirname "$d")
|
|
131
|
+
done
|
|
132
|
+
done < /tmp/multi-agent-review-${TASK_ID}-paths.txt
|
|
133
|
+
# dedupe, cap at 5 (log any dropped so truncation is never silent)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Existence checks are resolved against the local checkout when the cwd is the target repo. In PR mode without a local checkout, probe the candidate paths via the provider API instead (`gh api /repos/{o}/{r}/contents/{path}?ref={headSha}` / Bitbucket `GET /projects/{KEY}/repos/{slug}/browse/{path}?at={headSha}`) and fetch the matching files' raw content the same way. No hit → step is a silent no-op.
|
|
137
|
+
|
|
138
|
+
Persist `agent-state.review.moduleGuides = [<repo-relative paths>]` and inject into every reviewer prompt (Step 3):
|
|
139
|
+
|
|
140
|
+
> MODULE REVIEW GUIDES: before reviewing, read each of these guide files. Apply a guide's rules/checklist to every changed file under its directory. Guide violations are findings like any other - triage them by severity.
|
|
141
|
+
|
|
142
|
+
Scope note: a guide governs only files under its own directory - a guide found under one module must not be applied to a sibling module's changes in the same PR.
|
|
143
|
+
|
|
112
144
|
### 3. Launch parallel reviewers - host-CLI dependent
|
|
113
145
|
|
|
114
146
|
**Claude Code (2 in parallel):**
|
|
115
|
-
- Agent 1: `claude-
|
|
116
|
-
- Agent 2: `claude-sonnet-4
|
|
147
|
+
- Agent 1: `claude-fable-5` → security + architecture
|
|
148
|
+
- Agent 2: `claude-sonnet-4-6` → general quality
|
|
117
149
|
|
|
118
150
|
**Copilot CLI (3 in parallel):**
|
|
119
|
-
- Agent 1: `claude-opus-4
|
|
151
|
+
- Agent 1: `claude-opus-4-8` → security + architecture (Fable 5 is not offered on Copilot CLI)
|
|
120
152
|
- Agent 2: `gpt-5.4` → edge cases, alternate perspective
|
|
121
|
-
- Agent 3: `claude-sonnet-4
|
|
153
|
+
- Agent 3: `claude-sonnet-4-6` → general quality
|
|
122
154
|
|
|
123
|
-
Each reviewer receives the diff plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
|
|
155
|
+
Each reviewer receives the diff, the module review guides from Step 2b (when any were found), plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
|
|
124
156
|
|
|
125
157
|
### 4. Store-compliance cross-reference
|
|
126
158
|
|
|
@@ -137,7 +169,7 @@ Each finding gets the `ruleID` from the catalog plus the platform policy ref:
|
|
|
137
169
|
|
|
138
170
|
Catalog-only - does NOT invoke binaries. For a full scan, use `/multi-agent:test "store-ready"`.
|
|
139
171
|
|
|
140
|
-
### 5. Triage (
|
|
172
|
+
### 5. Triage (Fable)
|
|
141
173
|
|
|
142
174
|
Classify findings into:
|
|
143
175
|
- 🔴 **Blocking** → must fix
|
|
@@ -152,10 +184,10 @@ Triage also marks each finding as `accepted` (real issue), `deferred` (real but
|
|
|
152
184
|
🔍 Review Complete · PR #1250 · 3 files +120 -45
|
|
153
185
|
| Model | Verdict | Blocking | Important | Suggestion |
|
|
154
186
|
|----------|-----------|----------|-----------|------------|
|
|
155
|
-
|
|
|
187
|
+
| Fable | approved | 0 | 1 | 3 |
|
|
156
188
|
| Sonnet | rejected | 1 | 2 | 5 |
|
|
157
189
|
|
|
158
|
-
Consensus: ⚠ DISAGREEMENT - see
|
|
190
|
+
Consensus: ⚠ DISAGREEMENT - see Fable triage
|
|
159
191
|
```
|
|
160
192
|
|
|
161
193
|
This summary ALWAYS prints, regardless of input mode. The chat is the live conversation; on the PR side, the durable artifacts are inline comments + the review state (Step 7).
|
|
@@ -58,7 +58,7 @@ Run every step automatically:
|
|
|
58
58
|
Step 0: FIGMA_SYNC SKIP (deprecated - feedback_figma_source_deprecated)
|
|
59
59
|
Step 1: PLATFORM Detect macOS / Linux / Windows (Git Bash / WSL); export PLATFORM env
|
|
60
60
|
Step 1.5: DETECT Compare timestamps, find stale targets
|
|
61
|
-
Step 2: COPILOT Claude Code -> Copilot CLI (instructions +
|
|
61
|
+
Step 2: COPILOT Claude Code -> Copilot CLI (instructions + 35 sub-command skills)
|
|
62
62
|
Step 3: REPO Claude Code -> pipeline repo (genericized, personal data scrub, bash -n on all sh)
|
|
63
63
|
Step 3c: PLUGINS pipeline shared/external -> multi-agent-plugins marketplace (rebuild knowledge/,
|
|
64
64
|
bump changed plugins' patch version, commit + push the plugins repo)
|
|
@@ -277,11 +277,11 @@ This runs on the Claude <-> Copilot axis — the two CLIs the pipeline supports
|
|
|
277
277
|
|-------------|-------------|
|
|
278
278
|
| `~/.claude/commands/multi-agent/{cmd}.md` | `~/.copilot/skills/multi-agent-{cmd}/SKILL.md` |
|
|
279
279
|
|
|
280
|
-
**
|
|
280
|
+
**35 commands are synced** (canonical inventory - must match `cross-cli-contract.md` section 1; drift = contract violation):
|
|
281
281
|
|
|
282
282
|
```
|
|
283
283
|
analysis, analysis-resolve, autopilot, build-optimize, channels, delete, dev,
|
|
284
|
-
dev-autopilot, dev-local, dev-local-autopilot, diff-explain, garbage-collect,
|
|
284
|
+
dev-autopilot, dev-local, dev-local-autopilot, diff-explain, finish, garbage-collect,
|
|
285
285
|
help, issue, jira, kill, language, local, local-autopilot, log, manual-test,
|
|
286
286
|
prune-logs, purge, refactor, resume, review, scan, search, setup, stack, status,
|
|
287
287
|
sync, test, update
|
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
---
|
|
2
|
-
description: "Task orchestrator - full pipeline via Jira ID + branch or GitHub Issue URL: analysis, plan, TDD development, parallel review +
|
|
2
|
+
description: "Task orchestrator - full pipeline via Jira ID + branch or GitHub Issue URL: analysis, plan, TDD development, parallel review + Fable triage (CLI-aware: 2-model on Claude Code, 3-model on Copilot CLI), commit, log"
|
|
3
3
|
allowed-tools: Agent, Bash, Read, Write, Edit, Glob, Grep, TaskCreate, TaskUpdate, TaskList, TaskGet, AskUserQuestion, WebFetch, WebSearch, NotebookEdit, Skill
|
|
4
4
|
---
|
|
5
5
|
|
|
@@ -140,14 +140,14 @@ This command uses lazy loading for token efficiency. Read the relevant sub-file
|
|
|
140
140
|
- Multiple stacks -> load all relevant guides
|
|
141
141
|
|
|
142
142
|
**Agent definitions** (used in Phase 1 and Phase 4):
|
|
143
|
-
- `$HOME/.claude/agents/code-reviewer.md` - Phase 4 reviewer persona (`preferredModel:
|
|
143
|
+
- `$HOME/.claude/agents/code-reviewer.md` - Phase 4 reviewer persona (`preferredModel: fable`; Phase 4 overrides Reviewer 3 to `sonnet`)
|
|
144
144
|
- `$HOME/.claude/agents/explorer.md` - Phase 1 codebase scan persona (`preferredModel: sonnet` - scan work, cost-efficient)
|
|
145
|
-
- `$HOME/.claude/agents/ios-architect.md` - iOS architecture review (`preferredModel:
|
|
146
|
-
- `$HOME/.claude/agents/android-architect.md` - Android architecture review (`preferredModel:
|
|
147
|
-
- `$HOME/.claude/agents/backend-architect.md` - Backend/API architecture review (`preferredModel:
|
|
145
|
+
- `$HOME/.claude/agents/ios-architect.md` - iOS architecture review (`preferredModel: fable`)
|
|
146
|
+
- `$HOME/.claude/agents/android-architect.md` - Android architecture review (`preferredModel: fable`)
|
|
147
|
+
- `$HOME/.claude/agents/backend-architect.md` - Backend/API architecture review (`preferredModel: fable`)
|
|
148
148
|
- `$HOME/.claude/agents/security-auditor.md` - Security audit (`preferredModel: opus`)
|
|
149
149
|
|
|
150
|
-
**Per-persona model routing:** Before each Agent dispatch, the orchestrator reads `preferredModel` from the persona file and exports `CLAUDE_CODE_SUBAGENT_MODEL` (Claude Code) / passes `--model` (Copilot CLI). Precedence: per-dispatch `PHASE_MODEL_OVERRIDE` > persona `preferredModel` > `
|
|
150
|
+
**Per-persona model routing:** Before each Agent dispatch, the orchestrator reads `preferredModel` from the persona file and exports `CLAUDE_CODE_SUBAGENT_MODEL` (Claude Code) / passes `--model` (Copilot CLI). Precedence: per-dispatch `PHASE_MODEL_OVERRIDE` > persona `preferredModel` > `fable` (falls back per `refs/features/model-fallback.md`). Full contract: `skills/shared/core/multi-agent/SKILL.md#agent-dispatch--per-persona-model-routing-v610`.
|
|
151
151
|
|
|
152
152
|
---
|
|
153
153
|
|
|
@@ -247,7 +247,7 @@ When called with `review`:
|
|
|
247
247
|
1. Detect current branch and project from cwd (or ask)
|
|
248
248
|
2. Get diff: `git diff HEAD` (unstaged + staged)
|
|
249
249
|
3. If no diff, get diff against base branch: `git diff origin/{baseBranch}...HEAD`
|
|
250
|
-
4. Launch Phase 4 review (parallel +
|
|
250
|
+
4. Launch Phase 4 review (parallel + Fable triage - 2-model on Claude Code, 3-model on Copilot CLI) on the diff
|
|
251
251
|
5. No worktree, no state file - lightweight one-shot review
|
|
252
252
|
6. Print findings to terminal
|
|
253
253
|
|
|
@@ -183,7 +183,7 @@
|
|
|
183
183
|
"planEditRequests": {
|
|
184
184
|
"type": "array",
|
|
185
185
|
"items": { "type": "string" },
|
|
186
|
-
"description": "v5.3.0 Phase 2 - free-text edit instructions the user typed between plan renders. Preserved verbatim for audit;
|
|
186
|
+
"description": "v5.3.0 Phase 2 - free-text edit instructions the user typed between plan renders. Preserved verbatim for audit; the planning model (Fable top tier) parses them conversationally to revise the plan."
|
|
187
187
|
}
|
|
188
188
|
}
|
|
189
189
|
}
|