@mmerterden/multi-agent-pipeline 10.7.4 → 10.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. package/CHANGELOG.md +93 -0
  2. package/README.md +2 -0
  3. package/docs/engineering.md +76 -0
  4. package/docs/features.md +49 -33
  5. package/package.json +1 -1
  6. package/pipeline/commands/multi-agent/refs/features/verify-by-test.md +41 -0
  7. package/pipeline/commands/multi-agent/refs/phases/log-format.md +10 -0
  8. package/pipeline/commands/multi-agent/refs/phases/operations.md +15 -2
  9. package/pipeline/commands/multi-agent/refs/phases/phase-0-init.md +9 -0
  10. package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md +12 -0
  11. package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md +12 -1
  12. package/pipeline/commands/multi-agent/refs/rules.md +1 -0
  13. package/pipeline/commands/multi-agent/resume.md +7 -4
  14. package/pipeline/commands/multi-agent/review.md +33 -1
  15. package/pipeline/schemas/diff-risk.schema.json +5 -4
  16. package/pipeline/schemas/prefs.schema.json +59 -0
  17. package/pipeline/schemas/token-budget.json +7 -7
  18. package/pipeline/schemas/triage-output.schema.json +35 -4
  19. package/pipeline/scripts/README.md +3 -0
  20. package/pipeline/scripts/diff-risk-score.mjs +11 -1
  21. package/pipeline/scripts/fixtures/diff-risk-test-removal.diff +40 -0
  22. package/pipeline/scripts/fixtures/install-layout.tsv +3 -3
  23. package/pipeline/scripts/smoke-diff-risk.sh +30 -1
  24. package/pipeline/scripts/smoke-handoff-contract.sh +92 -0
  25. package/pipeline/scripts/smoke-update-check.sh +122 -0
  26. package/pipeline/scripts/smoke-verify-by-test.sh +148 -0
  27. package/pipeline/scripts/update-check.sh +82 -0
  28. package/pipeline/scripts/validate-diff-risk.mjs +2 -1
  29. package/pipeline/scripts/validate-triage.mjs +31 -2
@@ -23,10 +23,13 @@ Resume a paused or failed task from the last successful phase.
23
23
  - `haltReason` - if set, show it so the user knows why the run stopped; clear it on successful re-entry
24
24
  - `autopilot` - preserve the mode
25
25
 
26
- 3. **Load context** - read prior-phase findings from `agent-log.md`:
27
- - Phase 1 analysis use it from Phase 2+
28
- - Phase 2 plan use it from Phase 3+
29
- - Phase 3 codealready in the worktree
26
+ 3. **Load context** - rebuild working context from durable artifacts, never from conversation memory:
27
+ - **Handoff first (v10.8.0)**: read the LATEST `## Handoff` block in `agent-log.md` - it carries done/remaining/decisions/open-findings and the exact re-entry point (phase + subStep). When present, it is the primary context source; cross-check its `Next:` line against `state.currentPhase` and trust state on mismatch (state is the machine truth, handoff is the narrative).
28
+ - Fall back to per-phase findings for logs written before v10.8 (no handoff blocks):
29
+ - Phase 1 analysisuse it from Phase 2+
30
+ - Phase 2 plan → use it from Phase 3+
31
+ - Phase 3 code → already in the worktree
32
+ - Recent `git log --oneline -10` in the worktree grounds what was actually committed vs. claimed.
30
33
 
31
34
  4. **Continue the pipeline** - start from the next phase (same pipeline as the main multi-agent command).
32
35
 
@@ -109,6 +109,38 @@ The `credential-store.sh` wrapper handles macOS Keychain (`security`), Linux lib
109
109
 
110
110
  Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers can re-read it.
111
111
 
112
+ ### 2b. Module review guides - path-scoped convention files
113
+
114
+ A module in the repo may carry its own CLAUDE guide - a convention/checklist file living somewhere in the module's directory tree that the host CLI never auto-loads. When a changed file's module has such a guide, the review must consult it. Discovery is deterministic, from the diff's changed paths:
115
+
116
+ ```bash
117
+ # Changed file paths from the patch:
118
+ grep -E '^\+\+\+ b/' "$DIFF_FILE" | sed 's|^+++ b/||' | sort -u > /tmp/multi-agent-review-${TASK_ID}-paths.txt
119
+
120
+ # For each changed path, walk its directory chain up to the repo root and
121
+ # collect guide files matching: CLAUDE.md, *-CLAUDE.md, AGENTS.md.
122
+ # Root-level CLAUDE.md/AGENTS.md are excluded - the host CLI already loads those.
123
+ guides=()
124
+ while IFS= read -r p; do
125
+ d=$(dirname "$p")
126
+ while [ "$d" != "." ] && [ "$d" != "/" ]; do
127
+ for g in "$d"/CLAUDE.md "$d"/*-CLAUDE.md "$d"/AGENTS.md; do
128
+ [ -e "$g" ] && guides+=("$g")
129
+ done
130
+ d=$(dirname "$d")
131
+ done
132
+ done < /tmp/multi-agent-review-${TASK_ID}-paths.txt
133
+ # dedupe, cap at 5 (log any dropped so truncation is never silent)
134
+ ```
135
+
136
+ Existence checks are resolved against the local checkout when the cwd is the target repo. In PR mode without a local checkout, probe the candidate paths via the provider API instead (`gh api /repos/{o}/{r}/contents/{path}?ref={headSha}` / Bitbucket `GET /projects/{KEY}/repos/{slug}/browse/{path}?at={headSha}`) and fetch the matching files' raw content the same way. No hit → step is a silent no-op.
137
+
138
+ Persist `agent-state.review.moduleGuides = [<repo-relative paths>]` and inject into every reviewer prompt (Step 3):
139
+
140
+ > MODULE REVIEW GUIDES: before reviewing, read each of these guide files. Apply a guide's rules/checklist to every changed file under its directory. Guide violations are findings like any other - triage them by severity.
141
+
142
+ Scope note: a guide governs only files under its own directory - a guide found under one module must not be applied to a sibling module's changes in the same PR.
143
+
112
144
  ### 3. Launch parallel reviewers - host-CLI dependent
113
145
 
114
146
  **Claude Code (2 in parallel):**
@@ -120,7 +152,7 @@ Save the diff to `/tmp/multi-agent-review-${TASK_ID}-diff.patch` so reviewers ca
120
152
  - Agent 2: `gpt-5.4` → edge cases, alternate perspective
121
153
  - Agent 3: `claude-sonnet-4-6` → general quality
122
154
 
123
- Each reviewer receives the diff plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
155
+ Each reviewer receives the diff, the module review guides from Step 2b (when any were found), plus the standard reviewer system prompt (see `refs/phases/phase-4-review.md` for the prompt contract). Output: structured `findings[]` per reviewer.
124
156
 
125
157
  ### 4. Store-compliance cross-reference
126
158
 
@@ -1,16 +1,16 @@
1
1
  {
2
2
  "$schema": "https://json-schema.org/draft/2020-12/schema",
3
3
  "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/diff-risk.schema.json",
4
- "version": "1.0.0",
4
+ "version": "1.1.0",
5
5
  "title": "Multi-Agent Pipeline - Phase 4 diff risk score",
6
- "description": "Output contract for diff-risk-score.mjs. Heuristic, deterministic, no LLM. Produced before Phase 4 Step 2 to give reviewer prompts a priority ordering - never used as a gate.",
6
+ "description": "Output contract for diff-risk-score.mjs. Heuristic, deterministic, no LLM. Produced before Phase 4 Step 2 to give reviewer prompts a priority ordering - never used as a gate. v1.1.0 adds the test_lines_removed signal (immutable-test backstop: a test file whose diff removes more lines than it adds).",
7
7
  "type": "object",
8
8
  "additionalProperties": false,
9
9
  "required": ["schemaVersion", "task", "totals", "files"],
10
10
  "properties": {
11
11
  "schemaVersion": {
12
12
  "type": "string",
13
- "const": "1.0.0"
13
+ "const": "1.1.0"
14
14
  },
15
15
  "task": {
16
16
  "type": "object",
@@ -63,7 +63,8 @@
63
63
  "no_test_change",
64
64
  "complexity_delta",
65
65
  "ui_critical",
66
- "migration"
66
+ "migration",
67
+ "test_lines_removed"
67
68
  ]
68
69
  },
69
70
  "weight": { "type": "number" },
@@ -701,6 +701,65 @@
701
701
  "default": false,
702
702
  "description": "v6.1.0+ \u2014 Phase 4 Step 2.5 rebuttal round. When reviewers disagree (mixed blocker/approved verdict), each reviewer is re-prompted with the others' opposing arguments for one additional round before triage. Lifts signal quality on ambiguous findings at ~1\u00d7 Step 2 token cost. Off by default \u2014 flip for security-critical or release-branch reviews."
703
703
  },
704
+ "updateCheck": {
705
+ "type": "object",
706
+ "additionalProperties": false,
707
+ "description": "v10.9+ - Phase 0 Step 0.6 advisory version check. Once per ttlHours window, a bounded (3s) registry read compares the installed version against dist-tags.latest. Newer version found: interactive modes ask 'Update now / Continue' (yes runs the /multi-agent:update flow, then the run continues); autopilot logs one line and never asks. Offline/failed checks are silent; the step never blocks the pipeline.",
708
+ "properties": {
709
+ "enabled": {
710
+ "type": "boolean",
711
+ "default": true,
712
+ "description": "Master switch. On by default - the cost is at most one 3s-bounded curl per ttlHours."
713
+ },
714
+ "ttlHours": {
715
+ "type": "integer",
716
+ "minimum": 1,
717
+ "maximum": 168,
718
+ "default": 24,
719
+ "description": "Cache window for the registry read."
720
+ },
721
+ "autoUpdate": {
722
+ "type": "boolean",
723
+ "default": false,
724
+ "description": "When true, skip the question and run the update flow automatically before the run starts (interactive AND autopilot). Off by default - self-modifying ~/.claude without asking is a surprise."
725
+ }
726
+ }
727
+ },
728
+ "verifyByTest": {
729
+ "type": "object",
730
+ "additionalProperties": false,
731
+ "description": "v10.8+ - Phase 4 Step 3.7 verify-by-test. When enabled, accepted BLOCKING findings are empirically validated before the Phase 3 rework loop: one verifier agent writes a minimal repro test per finding and runs only that test. Confirmed findings hand their failing test to Phase 3 as the RED step; non-reproducible findings are downgraded to deferred under evidence-gate. Only blocking findings are ever verified (fixed behavior, not a knob). Adds one model call plus up to maxFindings single-test runs per iteration with accepted blockers; default off. Flip on for security-critical work, release branches, or repos with noisy reviewers. Full spec: refs/features/verify-by-test.md.",
732
+ "properties": {
733
+ "enabled": {
734
+ "type": "boolean",
735
+ "default": false,
736
+ "description": "Master switch."
737
+ },
738
+ "maxFindings": {
739
+ "type": "integer",
740
+ "minimum": 1,
741
+ "maximum": 10,
742
+ "default": 3,
743
+ "description": "Max accepted blocking findings verified per review iteration. Findings beyond the cap keep their judgment-only verdict."
744
+ },
745
+ "model": {
746
+ "type": "string",
747
+ "enum": [
748
+ "sonnet",
749
+ "opus"
750
+ ],
751
+ "default": "sonnet",
752
+ "description": "Verifier agent model. Writing a minimal repro test is mechanical work; Sonnet is the cost-sane default."
753
+ },
754
+ "stepTimeoutSec": {
755
+ "type": "integer",
756
+ "minimum": 60,
757
+ "maximum": 1800,
758
+ "default": 600,
759
+ "description": "Wall-clock budget for the whole Step 3.7 pass. On breach, remaining findings keep judgment-only verdicts and the pipeline proceeds (never blocks)."
760
+ }
761
+ }
762
+ },
704
763
  "review": {
705
764
  "type": "object",
706
765
  "additionalProperties": false,
@@ -3,15 +3,15 @@
3
3
  "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/token-budget.json",
4
4
  "description": "Per-phase token budget for lazy-loaded pipeline docs. Enforced by smoke-token-budget.sh.",
5
5
  "phases": {
6
- "phase-0-init": { "max_tokens": 11250, "warn_tokens": 9900 },
6
+ "phase-0-init": { "max_tokens": 12400, "warn_tokens": 10900 },
7
7
  "phase-1-analysis": { "max_tokens": 3750, "warn_tokens": 3300 },
8
8
  "phase-2-planning": { "max_tokens": 6500, "warn_tokens": 5750 },
9
- "phase-3-dev": { "max_tokens": 7650, "warn_tokens": 6750 },
10
- "phase-4-review": { "max_tokens": 11100, "warn_tokens": 9750 },
11
- "phase-5-test": { "max_tokens": 2300, "warn_tokens": 2000 },
12
- "phase-6-commit": { "max_tokens": 5550, "warn_tokens": 4900 },
9
+ "phase-3-dev": { "max_tokens": 7900, "warn_tokens": 6950 },
10
+ "phase-4-review": { "max_tokens": 13250, "warn_tokens": 11650 },
11
+ "phase-5-test": { "max_tokens": 2550, "warn_tokens": 2250 },
12
+ "phase-6-commit": { "max_tokens": 6150, "warn_tokens": 5400 },
13
13
  "phase-7-report": { "max_tokens": 5600, "warn_tokens": 4950 }
14
14
  },
15
- "total_max_tokens": 46500,
16
- "note": "Token estimate = ceil(chars / 4). Per-phase budget rule: warn = current+10% (rounded to nearest 50), max = current+25%. Gives ~6 edit cycles of headroom before warn trips - intentionally quiet under normal maintenance, loud when a phase grows unusually. Only the active phase is loaded (lazy). Recalibrated at v10.0.0 after the validator/consistency/simplifier/lesson gate contracts landed in phases 1-4 (prose already compressed; the residual growth is the gate contracts themselves)."
15
+ "total_max_tokens": 50000,
16
+ "note": "Token estimate = ceil(chars / 4). Per-phase budget rule: warn = current+10% (rounded to nearest 50), max = current+25%. Gives ~6 edit cycles of headroom before warn trips - intentionally quiet under normal maintenance, loud when a phase grows unusually. Only the active phase is loaded (lazy). Recalibrated at v10.0.0 after the validator/consistency/simplifier/lesson gate contracts landed in phases 1-4. Recalibrated again at v10.9.0 after the verify-by-test (Phase 4 Step 3.7), update-check (Phase 0 Step 0.6), immutable-test (Phase 3 GREEN) and redTests re-entry contracts landed - Step 3.7 prose was compressed to a pointer into refs/features/verify-by-test.md before the recalibration."
17
17
  }
@@ -1,9 +1,9 @@
1
1
  {
2
2
  "$schema": "https://json-schema.org/draft/2020-12/schema",
3
3
  "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/triage-output.schema.json",
4
- "version": "3.1.0",
4
+ "version": "3.2.0",
5
5
  "title": "Multi-Agent Pipeline - Phase 4 triage output",
6
- "description": "Contract for the Opus triage agent's JSON output in Phase 4 Step 3. Triage consumes merged reviewer findings and splits them into accepted/deferred/rejected. Only `accepted` blocking/important items trigger Phase 3 rework. v3.1.0 adds the optional `consensus` block so triage can surface reviewer-agreement risk (false consensus among same-base-model reviewers) instead of silently merging.",
6
+ "description": "Contract for the Opus triage agent's JSON output in Phase 4 Step 3. Triage consumes merged reviewer findings and splits them into accepted/deferred/rejected. Only `accepted` blocking/important items trigger Phase 3 rework. v3.1.0 adds the optional `consensus` block so triage can surface reviewer-agreement risk (false consensus among same-base-model reviewers) instead of silently merging. v3.2.0 adds the optional per-finding `verification` block written by Phase 4 Step 3.7 (verify-by-test): the empirical repro-test outcome for accepted blocking findings.",
7
7
  "type": "object",
8
8
  "additionalProperties": false,
9
9
  "required": ["accepted", "deferred", "rejected", "approved"],
@@ -114,6 +114,35 @@
114
114
  }
115
115
  }
116
116
  },
117
+ "verification": {
118
+ "type": "object",
119
+ "additionalProperties": false,
120
+ "description": "v3.2.0 verify-by-test outcome (Phase 4 Step 3.7, opt-in via prefs.global.verifyByTest). confirmed = repro test failed as the finding predicts (finding stands, test kept as the Phase 3 RED test); not-reproduced = repro test passed under evidence-gate (finding downgraded to deferred); inconclusive = compile error / timeout / not unit-testable (judgment verdict stands).",
121
+ "required": ["result"],
122
+ "properties": {
123
+ "result": {
124
+ "type": "string",
125
+ "enum": ["confirmed", "not-reproduced", "inconclusive"]
126
+ },
127
+ "testRef": {
128
+ "type": "string",
129
+ "minLength": 1,
130
+ "description": "Single-test reference, e.g. 'AuthTests/LoginTests/testExpiredTokenRejected' or 'tests/test_auth.py::test_expired_token'."
131
+ },
132
+ "evidencePath": {
133
+ "type": "string",
134
+ "minLength": 1,
135
+ "description": "Path to the test-run log verified by evidence-gate.mjs, e.g. '.pipeline/verify-1.test.log'."
136
+ },
137
+ "note": { "type": "string" }
138
+ },
139
+ "if": {
140
+ "properties": { "result": { "enum": ["confirmed", "not-reproduced"] } }
141
+ },
142
+ "then": {
143
+ "required": ["result", "testRef", "evidencePath"]
144
+ }
145
+ },
117
146
  "rawFinding": {
118
147
  "type": "object",
119
148
  "additionalProperties": false,
@@ -124,7 +153,8 @@
124
153
  "line": { "type": "integer", "minimum": 0 },
125
154
  "issue": { "type": "string", "minLength": 4 },
126
155
  "fix": { "type": "string" },
127
- "reviewer": { "$ref": "#/$defs/reviewer" }
156
+ "reviewer": { "$ref": "#/$defs/reviewer" },
157
+ "verification": { "$ref": "#/$defs/verification" }
128
158
  }
129
159
  },
130
160
  "acceptedFinding": {
@@ -144,7 +174,8 @@
144
174
  "type": "string",
145
175
  "minLength": 4,
146
176
  "description": "Concrete change the dev agent must make. Required for accepted items so Phase 3 re-entry has actionable direction."
147
- }
177
+ },
178
+ "verification": { "$ref": "#/$defs/verification" }
148
179
  }
149
180
  }
150
181
  ]
@@ -22,6 +22,9 @@ Validate contracts. Each emits `══ <name> smoke: N passed, M failed ══`
22
22
  - `smoke-phase-6-multi.sh` - Phase 6 multi-repo commit/PR cross-linking
23
23
  - `smoke-phase-banner.sh` + `smoke-phase-tracker.sh` - Phase UI output contracts
24
24
  - `smoke-phase4-triage.sh` - Phase 4 reviewer → triage flow
25
+ - `smoke-verify-by-test.sh` - Phase 4 Step 3.7 verify-by-test contract (v10.8.0)
26
+ - `smoke-handoff-contract.sh` - phase-boundary structured handoff + handoff-first resume (v10.8.0)
27
+ - `smoke-update-check.sh` - Phase 0 Step 0.6 advisory update-check contract (v10.9.0)
25
28
 
26
29
  ### Schema + state
27
30
  - `smoke-schema-validation.sh` - all JSON schemas validate
@@ -15,6 +15,7 @@
15
15
  * complexity_delta - added if/guard/case/switch/while count w=1.5
16
16
  * ui_critical - *View.swift / *Screen.kt / Configuration w=1.5
17
17
  * migration - DB schema / migration path w=4.0
18
+ * test_lines_removed - test file shrinks (removed > added) w=3.0
18
19
  *
19
20
  * Inputs:
20
21
  * --base <ref> Base ref. Default: origin/main, fallback: main
@@ -275,6 +276,15 @@ function buildRow(stat, addedLines, allChangedPaths) {
275
276
  }
276
277
  }
277
278
 
279
+ // Test-lines-removed: a test-classified file whose diff removes more lines
280
+ // than it adds. Shrinking tests is the classic get-to-green shortcut the
281
+ // immutable-test rule forbids (refs/rules.md); surface it to reviewers.
282
+ if (isTestPath(path) && stat.removed > stat.added) {
283
+ const w = 3.0;
284
+ signals.push({ name: "test_lines_removed", weight: w, value: stat.removed - stat.added });
285
+ score += 12 * w;
286
+ }
287
+
278
288
  return {
279
289
  path,
280
290
  score: Math.round(score * 100) / 100,
@@ -306,7 +316,7 @@ function main() {
306
316
  };
307
317
 
308
318
  const out = {
309
- schemaVersion: "1.0.0",
319
+ schemaVersion: "1.1.0",
310
320
  task: {
311
321
  id: TASK_ID,
312
322
  base: BASE || "(diff-file)",
@@ -0,0 +1,40 @@
1
+ diff --git a/MyAppTests/LoginViewModelTests.swift b/MyAppTests/LoginViewModelTests.swift
2
+ index 1111111..2222222 100644
3
+ --- a/MyAppTests/LoginViewModelTests.swift
4
+ +++ b/MyAppTests/LoginViewModelTests.swift
5
+ @@ -10,30 +10,20 @@ final class LoginViewModelTests: XCTestCase {
6
+ func testLoginWithValidCredentials_Succeeds() {
7
+ let sut = LoginViewModel(service: MockAuthService())
8
+ + sut.retryPolicy = .none
9
+ sut.login(email: "user@example.com", password: "correct")
10
+ + XCTAssertTrue(sut.isAuthenticated)
11
+ }
12
+ -
13
+ - func testLoginWithInvalidEmail_ShowsError() {
14
+ - let sut = LoginViewModel(service: MockAuthService())
15
+ - sut.login(email: "not-an-email", password: "irrelevant")
16
+ - XCTAssertEqual(sut.errorMessage, "Invalid email")
17
+ - }
18
+ -
19
+ - func testLoginWithExpiredToken_Rejects() {
20
+ - let sut = LoginViewModel(service: MockAuthService(tokenState: .expired))
21
+ - sut.login(email: "user@example.com", password: "correct")
22
+ - XCTAssertFalse(sut.isAuthenticated)
23
+ - }
24
+ -
25
+ - func testLogout_ClearsSession() {
26
+ - let sut = LoginViewModel(service: MockAuthService())
27
+ - sut.logout()
28
+ - XCTAssertNil(sut.session)
29
+ - }
30
+ }
31
+ diff --git a/MyApp/Sources/Auth/LoginViewModel.swift b/MyApp/Sources/Auth/LoginViewModel.swift
32
+ index 3333333..4444444 100644
33
+ --- a/MyApp/Sources/Auth/LoginViewModel.swift
34
+ +++ b/MyApp/Sources/Auth/LoginViewModel.swift
35
+ @@ -20,6 +20,8 @@ final class LoginViewModel {
36
+ func login(email: String, password: String) {
37
+ + guard email.contains("@") else { return }
38
+ + service.authenticate(email: email, password: password)
39
+ }
40
+ }
@@ -1,16 +1,16 @@
1
1
  .claude/CLAUDE.md 1
2
2
  .claude/agents 8
3
- .claude/commands 88
3
+ .claude/commands 89
4
4
  .claude/lib 23
5
5
  .claude/multi-agent-preferences.json 1
6
6
  .claude/rules 12
7
7
  .claude/schemas 23
8
- .claude/scripts 167
8
+ .claude/scripts 171
9
9
  .claude/settings.json 1
10
10
  .claude/skills 560
11
11
  .copilot/agents 8
12
12
  .copilot/copilot-instructions.md 1
13
13
  .copilot/lib 23
14
14
  .copilot/schemas 23
15
- .copilot/scripts 167
15
+ .copilot/scripts 171
16
16
  .copilot/skills 596
@@ -12,6 +12,7 @@
12
12
  # 8. phase-4-review.md ref doc declares Step 1.75 + diff-risk-score.mjs
13
13
  # 9. code-reviewer.md agent template carries the priority-files placeholder
14
14
  # 10. prefs.schema.json exposes diffRisk advisory toggle
15
+ # 11. test-removal fixture fires the test_lines_removed signal (v1.1.0)
15
16
  #
16
17
  # Exit 0 = all pass, 1 = any failure.
17
18
 
@@ -26,6 +27,7 @@ REVIEWER="$ROOT/pipeline/agents/code-reviewer.md"
26
27
  PREFS="$ROOT/pipeline/schemas/prefs.schema.json"
27
28
  FIX_IOS="$ROOT/pipeline/scripts/fixtures/diff-risk-ios.diff"
28
29
  FIX_AND="$ROOT/pipeline/scripts/fixtures/diff-risk-android.diff"
30
+ FIX_TESTRM="$ROOT/pipeline/scripts/fixtures/diff-risk-test-removal.diff"
29
31
 
30
32
  pass=0
31
33
  fail=0
@@ -38,10 +40,11 @@ printf '→ smoke-diff-risk (v8.3.0): pre-review risk scoring contract\n'
38
40
  [ -f "$SCHEMA" ] || { record_fail "schema missing: $SCHEMA"; exit 1; }
39
41
  [ -f "$FIX_IOS" ] || { record_fail "fixture missing: $FIX_IOS"; exit 1; }
40
42
  [ -f "$FIX_AND" ] || { record_fail "fixture missing: $FIX_AND"; exit 1; }
43
+ [ -f "$FIX_TESTRM" ] || { record_fail "fixture missing: $FIX_TESTRM"; exit 1; }
41
44
 
42
45
  # --- 1: iOS fixture produces JSON ---
43
46
  out_ios=$(node "$SCORE" --diff "$FIX_IOS" 2>/dev/null)
44
- if jq -e '.schemaVersion == "1.0.0"' <<< "$out_ios" >/dev/null 2>&1; then
47
+ if jq -e '.schemaVersion == "1.1.0"' <<< "$out_ios" >/dev/null 2>&1; then
45
48
  record_pass "iOS fixture renders schema-versioned JSON"
46
49
  else
47
50
  record_fail "iOS fixture JSON malformed or missing schemaVersion"
@@ -150,6 +153,32 @@ else
150
153
  record_fail "prefs.schema.json missing global.diffRiskAdvisory"
151
154
  fi
152
155
 
156
+ # --- 11: test_lines_removed signal fires on the test-removal fixture ---
157
+ out_testrm=$(node "$SCORE" --diff "$FIX_TESTRM" 2>/dev/null)
158
+ sig_value=$(jq -r '.files[] | select(.path == "MyAppTests/LoginViewModelTests.swift")
159
+ | .signals[] | select(.name == "test_lines_removed") | .value' <<< "$out_testrm")
160
+ if [ "$sig_value" = "16" ]; then
161
+ record_pass "test_lines_removed fires with value=16 (18 removed - 2 added)"
162
+ else
163
+ record_fail "test_lines_removed should fire with value=16, got: ${sig_value:-missing}"
164
+ fi
165
+ sig_on_source=$(jq -r '[.files[] | select(.path == "MyApp/Sources/Auth/LoginViewModel.swift")
166
+ | .signals[] | select(.name == "test_lines_removed")] | length' <<< "$out_testrm")
167
+ if [ "$sig_on_source" = "0" ]; then
168
+ record_pass "test_lines_removed does not fire on source files"
169
+ else
170
+ record_fail "test_lines_removed must only fire on test-classified paths"
171
+ fi
172
+ set +e
173
+ echo "$out_testrm" | node "$VALIDATE" - >/dev/null 2>&1
174
+ rc_testrm=$?
175
+ set -e
176
+ if [ "$rc_testrm" -eq 0 ]; then
177
+ record_pass "validator accepts output carrying test_lines_removed"
178
+ else
179
+ record_fail "validator rejected test_lines_removed output (rc=$rc_testrm)"
180
+ fi
181
+
153
182
  # --- Summary ---
154
183
  total=$((pass + fail))
155
184
  printf '\n→ smoke-diff-risk: %d/%d passed\n' "$pass" "$total"
@@ -0,0 +1,92 @@
1
+ #!/usr/bin/env bash
2
+ # smoke-handoff-contract.sh
3
+ #
4
+ # Verifies the v10.8.0 structured-handoff contract (fresh-context re-entry):
5
+ # 1. operations.md documents the Handoff block with all 5 required lines
6
+ # 2. operations.md compaction trigger re-reads state AND the latest handoff
7
+ # 3. log-format.md documents the Handoff section in the canonical log shape
8
+ # 4. resume.md Step 3 reads the latest handoff FIRST with pre-v10.8 fallback
9
+ #
10
+ # Exit 0 = all pass, 1 = any failure.
11
+
12
+ set -euo pipefail
13
+
14
+ ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
15
+ OPS="$ROOT/pipeline/commands/multi-agent/refs/phases/operations.md"
16
+ LOGFMT="$ROOT/pipeline/commands/multi-agent/refs/phases/log-format.md"
17
+ RESUME="$ROOT/pipeline/commands/multi-agent/resume.md"
18
+
19
+ pass=0
20
+ fail=0
21
+ failures=()
22
+ record_pass() { pass=$((pass + 1)); printf ' \033[0;32mPASS\033[0m %s\n' "$1"; }
23
+ record_fail() { fail=$((fail + 1)); failures+=("$1"); printf ' \033[0;31mFAIL\033[0m %s\n' "$1"; }
24
+
25
+ printf '→ smoke-handoff-contract: structured handoff (fresh-context re-entry)\n'
26
+
27
+ # 1. operations.md documents the Handoff block with the 5 required lines
28
+ if [ ! -f "$OPS" ]; then
29
+ record_fail "operations.md missing"
30
+ else
31
+ if grep -qF "Handoff block (v10.8.0)" "$OPS"; then
32
+ record_pass "operations.md documents the Handoff block"
33
+ else
34
+ record_fail "operations.md missing 'Handoff block (v10.8.0)' spec"
35
+ fi
36
+ for line in "- Done:" "- Remaining:" "- Decisions:" "- Open findings:" "- Next:"; do
37
+ if grep -qF -- "$line" "$OPS"; then
38
+ record_pass "operations.md handoff spec has '$line'"
39
+ else
40
+ record_fail "operations.md handoff spec missing '$line'"
41
+ fi
42
+ done
43
+ if grep -qF "no agent dispatch, no extra LLM call" "$OPS"; then
44
+ record_pass "operations.md states handoff is orchestrator-written (no LLM call)"
45
+ else
46
+ record_fail "operations.md must state the handoff costs no LLM call"
47
+ fi
48
+ fi
49
+
50
+ # 2. Compaction trigger re-reads state AND latest handoff
51
+ if grep -qE 'agent-state\.json.*AND the latest.*Handoff' "$OPS"; then
52
+ record_pass "compaction trigger re-reads state + latest handoff"
53
+ else
54
+ record_fail "operations.md compaction trigger must re-read agent-state.json AND the latest Handoff block"
55
+ fi
56
+
57
+ # 3. log-format.md documents the Handoff section
58
+ if grep -qF "## Handoff - end of Phase" "$LOGFMT"; then
59
+ record_pass "log-format.md documents the Handoff section"
60
+ else
61
+ record_fail "log-format.md missing the Handoff section"
62
+ fi
63
+ if grep -qF "LATEST block is authoritative" "$LOGFMT"; then
64
+ record_pass "log-format.md states latest-block-wins semantics"
65
+ else
66
+ record_fail "log-format.md must state the latest handoff block is authoritative"
67
+ fi
68
+
69
+ # 4. resume.md reads handoff first, with fallback for older logs
70
+ if grep -qE 'LATEST .?## Handoff.? block' "$RESUME"; then
71
+ record_pass "resume.md Step 3 reads the latest Handoff block first"
72
+ else
73
+ record_fail "resume.md Step 3 must read the latest Handoff block first"
74
+ fi
75
+ if grep -qiF "fall back to per-phase findings" "$RESUME"; then
76
+ record_pass "resume.md keeps the pre-v10.8 per-phase fallback"
77
+ else
78
+ record_fail "resume.md must keep the pre-v10.8 per-phase findings fallback"
79
+ fi
80
+ if grep -qF "trust state on mismatch" "$RESUME"; then
81
+ record_pass "resume.md defines state-wins conflict rule"
82
+ else
83
+ record_fail "resume.md must define the handoff-vs-state conflict rule (state wins)"
84
+ fi
85
+
86
+ printf '\n══ handoff-contract smoke: %d passed, %d failed ══\n' "$pass" "$fail"
87
+ if [ "$fail" -gt 0 ]; then
88
+ printf '\nFailures:\n'
89
+ for msg in "${failures[@]}"; do printf ' - %s\n' "$msg"; done
90
+ exit 1
91
+ fi
92
+ exit 0
@@ -0,0 +1,122 @@
1
+ #!/usr/bin/env bash
2
+ # smoke-update-check.sh
3
+ #
4
+ # Verifies the Phase 0 Step 0.6 advisory update-check contract:
5
+ # 1. update-check.sh exists, parses (bash -n), and honors the advisory contract
6
+ # offline: exit 0 + empty stdout when the registry is unreachable
7
+ # 2. Cached path: fresh cache short-circuits without a network call
8
+ # 3. Newer latest -> "<local>|<latest>"; same or older latest -> silent
9
+ # 4. prefs.schema.json exposes updateCheck.{enabled,ttlHours,autoUpdate}
10
+ # with the documented defaults (enabled=true, autoUpdate=false)
11
+ # 5. phase-0-init.md documents Step 0.6 with the autopilot log-only rule
12
+ #
13
+ # Exit 0 = all pass, 1 = any failure.
14
+
15
+ set -euo pipefail
16
+
17
+ ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
18
+ SCRIPT="$ROOT/pipeline/scripts/update-check.sh"
19
+ PREFS_SCHEMA="$ROOT/pipeline/schemas/prefs.schema.json"
20
+ PHASE0_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-0-init.md"
21
+
22
+ pass=0
23
+ fail=0
24
+ failures=()
25
+ record_pass() { pass=$((pass + 1)); printf ' \033[0;32mPASS\033[0m %s\n' "$1"; }
26
+ record_fail() { fail=$((fail + 1)); failures+=("$1"); printf ' \033[0;31mFAIL\033[0m %s\n' "$1"; }
27
+
28
+ printf '→ smoke-update-check: Phase 0 Step 0.6 advisory contract\n'
29
+
30
+ tmpdir=$(mktemp -d)
31
+ trap 'rm -rf "$tmpdir"' EXIT
32
+ CACHE="$tmpdir/update-check-cache"
33
+
34
+ # 1. Script parses
35
+ if bash -n "$SCRIPT" 2>/dev/null; then
36
+ record_pass "update-check.sh parses (bash -n)"
37
+ else
38
+ record_fail "update-check.sh has syntax errors"
39
+ fi
40
+
41
+ now=$(date +%s)
42
+
43
+ # 2. Fresh cache with newer latest -> reports, no network needed
44
+ printf '%s|99.0.0\n' "$now" > "$CACHE"
45
+ out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
46
+ if [ "$rc" -eq 0 ] && [ "$out" = "10.0.0|99.0.0" ]; then
47
+ record_pass "newer cached latest -> '<local>|<latest>' (exit 0)"
48
+ else
49
+ record_fail "newer cached latest should print '10.0.0|99.0.0' (got '$out', rc=$rc)"
50
+ fi
51
+
52
+ # 3. Same version -> silent
53
+ printf '%s|10.0.0\n' "$now" > "$CACHE"
54
+ out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
55
+ if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
56
+ record_pass "same version -> silent"
57
+ else
58
+ record_fail "same version should be silent (got '$out', rc=$rc)"
59
+ fi
60
+
61
+ # 3b. Local ahead of registry (dev machine) -> silent
62
+ printf '%s|10.0.0\n' "$now" > "$CACHE"
63
+ out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.1.0 2>/dev/null); rc=$?
64
+ if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
65
+ record_pass "local ahead of registry -> silent (no downgrade prompt)"
66
+ else
67
+ record_fail "local-ahead should be silent (got '$out', rc=$rc)"
68
+ fi
69
+
70
+ # 3c. Offline + stale cache -> silent exit 0 (advisory: never blocks)
71
+ printf '0|10.0.0\n' > "$CACHE"
72
+ out=$(UPDATE_CHECK_CACHE="$CACHE" http_proxy="http://127.0.0.1:1" https_proxy="http://127.0.0.1:1" \
73
+ bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
74
+ if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
75
+ record_pass "offline + stale cache -> silent exit 0"
76
+ else
77
+ record_fail "offline should be a silent no-op (got '$out', rc=$rc)"
78
+ fi
79
+
80
+ # 4. Prefs schema knobs + defaults
81
+ for prop in enabled ttlHours autoUpdate; do
82
+ if jq -e ".properties.global.properties.updateCheck.properties.${prop}" "$PREFS_SCHEMA" >/dev/null 2>&1; then
83
+ record_pass "prefs schema exposes updateCheck.${prop}"
84
+ else
85
+ record_fail "prefs schema missing updateCheck.${prop}"
86
+ fi
87
+ done
88
+ if jq -e '.properties.global.properties.updateCheck.properties.enabled.default == true' "$PREFS_SCHEMA" >/dev/null 2>&1; then
89
+ record_pass "updateCheck.enabled defaults to true (notify-only, bounded cost)"
90
+ else
91
+ record_fail "updateCheck.enabled must default to true"
92
+ fi
93
+ if jq -e '.properties.global.properties.updateCheck.properties.autoUpdate | has("default") and .default == false' "$PREFS_SCHEMA" >/dev/null 2>&1; then
94
+ record_pass "updateCheck.autoUpdate defaults to false (no silent self-modify)"
95
+ else
96
+ record_fail "updateCheck.autoUpdate must default to false"
97
+ fi
98
+
99
+ # 5. Phase 0 doc wiring
100
+ if grep -qF "Step 0.6 - Update check" "$PHASE0_DOC"; then
101
+ record_pass "phase-0-init.md documents Step 0.6"
102
+ else
103
+ record_fail "phase-0-init.md missing Step 0.6"
104
+ fi
105
+ if grep -qF "update-check.sh" "$PHASE0_DOC"; then
106
+ record_pass "phase-0-init.md invokes update-check.sh"
107
+ else
108
+ record_fail "phase-0-init.md must invoke update-check.sh"
109
+ fi
110
+ if grep -qF "log-only" "$PHASE0_DOC" && grep -qF "never ask (zero-interaction contract)" "$PHASE0_DOC"; then
111
+ record_pass "phase-0-init.md states the autopilot log-only rule"
112
+ else
113
+ record_fail "phase-0-init.md must state autopilot never asks (log-only)"
114
+ fi
115
+
116
+ printf '\n══ update-check smoke: %d passed, %d failed ══\n' "$pass" "$fail"
117
+ if [ "$fail" -gt 0 ]; then
118
+ printf '\nFailures:\n'
119
+ for msg in "${failures[@]}"; do printf ' - %s\n' "$msg"; done
120
+ exit 1
121
+ fi
122
+ exit 0