@mmerterden/multi-agent-pipeline 10.8.0 → 10.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -14,6 +14,45 @@ Internal file-layout changes that don't affect the slash-command surface are sti
14
14
 
15
15
  ---
16
16
 
17
+ ## [10.9.0] - 2026-07-03
18
+
19
+ ### Added
20
+
21
+ - **Update check at run start (Phase 0 Step 0.6, opt-out via
22
+ `prefs.global.updateCheck`).** Once per `ttlHours` window (default 24h,
23
+ cached, 3s-bounded curl straight to the npm registry - never `npm view`),
24
+ the pipeline compares the installed version against `dist-tags.latest`.
25
+ Newer version found: interactive modes ask ONE question ("Update now?" ->
26
+ yes runs the `/multi-agent:update` flow and the run continues; the update
27
+ takes full effect next session); autopilot logs one line and never asks
28
+ (zero-interaction contract). `autoUpdate: true` skips the question and
29
+ updates before the worktree is created. Offline, ahead-of-registry, and
30
+ failed checks are silent - the step never blocks. New
31
+ `pipeline/scripts/update-check.sh` + `smoke-update-check.sh` (13 assertions).
32
+ - **Design fidelity contract (Phase 3, BLOCKING on Figma-referenced tasks).**
33
+ The analysis doc's captured frames are the 1:1 reference for every UI line:
34
+ a Code Connect-mapped component must be used verbatim (no sound-alikes,
35
+ missing modifiers go into `+Modifiers`, never forks), unmapped UI becomes a
36
+ NEW component under the standard architecture, and inter-component spacing
37
+ comes from the design's measured values mapped to tokens - never invented.
38
+ Missing design data halts with a re-run-analysis instruction; Figma MCP
39
+ stays forbidden outside analysis.
40
+ - **`docs/engineering.md`** - version-free catalog of the discipline layer
41
+ (bounded loops, evidence gates, token-budgeted phase docs, immutable tests,
42
+ fresh-context handoffs, memory hedges, install hygiene), linked from README.
43
+
44
+ ### Changed
45
+
46
+ - **Docs drop inline version markers.** `docs/features.md` headers and bullets
47
+ no longer carry `(vX.Y.Z)` tags - the docs describe the current state;
48
+ history lives in this CHANGELOG. Same principle applied to the website copy.
49
+ - **Phase-doc token budgets recalibrated** (`token-budget.json`, total
50
+ 46500 -> 50000) after the verify-by-test, update-check, immutable-test, and
51
+ design-fidelity contracts landed - Step 3.7 prose was compressed to a
52
+ pointer into `refs/features/verify-by-test.md` before recalibration.
53
+
54
+ ---
55
+
17
56
  ## [10.8.0] - 2026-07-02
18
57
 
19
58
  Three additive loop-quality features, all sourced from the 2026 agentic-loop research sweep (Anthropic long-running-agent harness guidance + adversarial-review findings): empirical verification of blocking findings, an immutable-test contract, and structured phase-boundary handoffs.
package/README.md CHANGED
@@ -51,6 +51,8 @@ One command runs 8 phases, with a gate between the risky ones:
51
51
 
52
52
  Under the hood: each task runs in its own **git worktree** (or the current branch with `:local`), commits use the **git identity routed from the repo's origin URL**, and **multi-repo** tasks get per-repo worktrees plus an integration build. Tokens stay in the OS keychain; nothing is committed or logged. `/multi-agent:review` can also review an existing GitHub/Bitbucket PR — per-finding inline comments anchored to `file:line` + an explicit Approve / Needs-Work state.
53
53
 
54
+ The discipline behind all of this — bounded loops, evidence gates, token-budgeted phase docs, immutable tests, fresh-context handoffs — is catalogued in [docs/engineering.md](./docs/engineering.md). The full feature list lives in [docs/features.md](./docs/features.md).
55
+
54
56
  ## Modes
55
57
 
56
58
  | Mode | Command | Flow |
@@ -0,0 +1,76 @@
1
+ # Engineering Discipline
2
+
3
+ The pipeline's output quality comes less from any single model call and more from the constraints wrapped around every call. This page lists those constraints - the internals that keep long agentic runs correct, cheap, and auditable. No version markers here; this document always describes the current state (history lives in the [CHANGELOG](../CHANGELOG.md)).
4
+
5
+ ## Loop discipline
6
+
7
+ Every loop in the pipeline has a deterministic exit condition and a hard iteration cap. Nothing "keeps trying".
8
+
9
+ - **TDD micro-loop** (Phase 3): red -> green -> refactor per task; build-fix retries capped at 3 attempts.
10
+ - **Dev-critic loop** (Phase 3.5, opt-in): generator vs critic on deterministic criteria; max 2 iterations, round 3 escalates to the user.
11
+ - **Review-fix macro-loop** (Phase 3 <-> 4): only triage-accepted blocking findings re-enter development; max 3 iterations, then hard-kill and escalate.
12
+ - **Disagreement round** (Phase 4, opt-in): when reviewers split, exactly one rebuttal round - never more.
13
+ - **Global rule**: any retry loop stops after 3 attempts and asks the user. No exceptions, no infinite loops.
14
+
15
+ ## Evidence over claims
16
+
17
+ - **Default-FAIL evidence gate**: a "build passed" or "tests passed" claim is rejected unless the captured log actually shows success. Missing, empty, or failure-marked logs fail CLOSED.
18
+ - **Verify-by-test** (opt-in): an accepted blocking review finding can be empirically confirmed - a verifier agent writes a minimal repro test and runs it. Test fails as predicted: the finding stands and the failing test is handed to the rework loop as its RED test. Test passes: the finding is downgraded, and even the downgrade must pass the evidence gate.
19
+ - **Deterministic validator gates**: triage output, reviewer output, diff-risk output, and agent state are schema-validated by zero-dependency scripts whose exit codes decide the flow - one self-correction rework, then halt with a recovery hint. The LLM never grades its own homework.
20
+
21
+ ## Test integrity
22
+
23
+ - **Immutable tests**: deleting, renaming, or weakening an existing test to get a green run is a rule violation. A test changes only when the task changes the spec it encodes, named in the commit body.
24
+ - **Deterministic backstop**: the diff-risk scorer flags any test file whose diff removes more lines than it adds (`test_lines_removed`), so a shrinking test suite surfaces to reviewers even if the rule is ignored.
25
+ - **Test-gap detection**: newly added public symbols with no paired test are reported before the test phase.
26
+
27
+ ## Context management
28
+
29
+ - **Token-budgeted phase docs**: every phase document has a per-file and total token budget enforced by a smoke gate (`token-budget.json`). Growth is compressed first; budgets are recalibrated only when real contract text lands. This keeps the orchestrator's working context lean across an 8-phase run.
30
+ - **Lazy loading**: only the active phase's document is in context; references load on demand.
31
+ - **Structured handoff blocks**: at every phase boundary the orchestrator appends a Done / Remaining / Decisions / Open findings / Next block to the run log - written from state it already holds, no extra model call. Resume and post-compaction re-entry read the latest handoff plus the state file, never the conversation history.
32
+ - **Proactive compaction**: past ~50% context usage the run compacts itself and re-grounds from durable artifacts instead of waiting for lossy auto-compaction.
33
+
34
+ ## Review architecture
35
+
36
+ - **Deterministic gates before AI review**: build, lint, tests, and secret scan must be green before a single reviewer token is spent.
37
+ - **Cross-model parallel review**: independent reviewers with different vantage points, then a triage pass that filters false positives and out-of-scope noise. Reviewer findings are raw signals, not commands - nothing auto-loops on a "blocking" tag.
38
+ - **Consensus surfacing**: unanimous agreement between same-base-model reviewers on judgment-heavy surfaces is flagged as unverified instead of being trusted as independent confirmation.
39
+ - **Diff risk scoring**: a sub-second, zero-LLM heuristic ranks changed files (security paths, migrations, missing tests, shrinking tests) and feeds reviewers a priority hint - advisory, never a gate.
40
+ - **Prompt-cache-aware dispatch**: reviewer and triage prompts share a byte-identical leading block so parallel dispatches hit the prompt cache instead of re-billing the diff.
41
+
42
+ ## Design fidelity
43
+
44
+ - **Single source of design truth**: Figma is fetched once, during analysis, through a 3-tier fallback chain (MCP -> REST -> user screenshot). Every later phase reads the frozen analysis artifact; calling Figma from dev or review phases is a contract violation enforced by a smoke gate.
45
+ - **1:1 implementation contract**: a Code Connect-mapped component must be used verbatim (no sound-alike substitutes; missing modifiers are added to the component, never forked). Unmapped UI becomes a new component under the standard architecture. Spacing between components comes from the design's measured values mapped to tokens - never invented numbers.
46
+
47
+ ## Cost transparency
48
+
49
+ - **Per-phase token ledger**: every billable call forwards token counts to the tracker; each run ends with a cost table naming the top cost driver and pricing prompt-cache reads at their discounted rate.
50
+ - **Budget ceilings**: a cost budget can cap a run; crossing it triggers model downgrade or halt rather than silent overspend.
51
+ - **Model fallback contract**: premium-tier dispatches degrade deterministically (top tier -> mid -> base) on dispatch errors, date gates, or budget ceilings - never by improvisation.
52
+
53
+ ## Memory without bias
54
+
55
+ - **Learnings ledger**: durable per-repo facts and rejected review preferences replay into analysis and triage, so agents stop re-discovering and reviewers stop re-flagging the same things.
56
+ - **Hedged injection**: every memory injection carries an explicit "context, not command" hedge, and blocking-severity rejections are never memorized - a wrong rejection must not permanently suppress a finding class.
57
+ - **Prior-art caps**: triage sees at most a fixed number of highest-similarity past findings; memory is advisory context, not a verdict multiplier.
58
+
59
+ ## Update and install hygiene
60
+
61
+ - **Wipe-then-copy install**: pipeline-owned directories are cleared before copying, so files deleted upstream never linger as ghost commands or dead scripts on user machines.
62
+ - **Layout fingerprint**: a fixture pins the installed file layout; structural drift fails the suite.
63
+ - **Advisory update check**: at run start (cached, bounded, offline-silent) the pipeline compares its installed version with the registry; interactive runs offer a one-question update, autopilot only logs. It never blocks and never self-modifies mid-run.
64
+ - **Token-preserving uninstall**: removing the pipeline never touches the user's stored credentials.
65
+
66
+ ## Security and hygiene
67
+
68
+ - **Secret scanning**: staged diffs are scanned with provider-prefix patterns plus Shannon-entropy detection before every commit; findings block.
69
+ - **Personal-data genericization**: the public repo is generated from the private source with a scrub pass (names, hosts, project keys) and a zero-hit grep gate before push.
70
+ - **Skill scanner**: tiered pattern scanning over skill content guards against supply-chain surprises.
71
+ - **Intent guard**: a deterministic classifier separates "answer my question" from "change my code", so a conceptual question never spawns a worktree.
72
+
73
+ ## Cross-CLI parity
74
+
75
+ - **One source of truth, two runtimes**: Claude Code and Copilot CLI command surfaces are byte-identical where shared; a parity smoke fails on drift.
76
+ - **Portability gates**: every shell script passes `bash -n` and a GNU/BSD portability grep; the CI matrix runs macOS, Linux, and Windows.
package/docs/features.md CHANGED
@@ -43,7 +43,7 @@ Compose freely: `--dev --local autopilot` = shortest, least-friction path.
43
43
 
44
44
  Build commands, test runners, lint tools, and review focus areas all adapt to the detected stack.
45
45
 
46
- ### Stack Selection (marketplace plugins, v10.5.0+)
46
+ ### Stack Selection (marketplace plugins)
47
47
 
48
48
  Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketplace. Selecting a stack enables the matching plugin(s) in the current repo's `.claude/settings.json` `enabledPlugins` — no skill copying, no session restart tricks, no directory shuffling. The `ai-common-engineering-toolkit` (accessibility audit, humanizer, Firebase) is always enabled alongside the stack plugin.
49
49
 
@@ -57,7 +57,7 @@ Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketpl
57
57
  /multi-agent:stack all # every stack plugin
58
58
  ```
59
59
 
60
- ### Task Type Detection (v2.0.0)
60
+ ### Task Type Detection
61
61
 
62
62
  Phase 0 Step 9 classifies every task before Phase 1 starts. Deterministic priority order: Figma URL → instruction file path → git diff heuristic → Jira issue type → branch name → description keywords → user prompt (autopilot defaults to `feature`).
63
63
 
@@ -71,26 +71,26 @@ Result persisted to `agent-state.taskType`:
71
71
  | `refactor` | Phase 4 emphasizes behavior preservation; Phase 6 uses `refactor(...)` prefix |
72
72
  | `chore` | Lightweight flow; Phase 6 uses `chore(...)` prefix |
73
73
 
74
- ### SubPhase Convention (v2.0.0)
74
+ ### SubPhase Convention
75
75
 
76
- When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7 — Phase 6.5 was merged into Phase 7 REPORT in v5.1.0) — specialized work slots into its parent phase without inflating the count.
76
+ When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7) — specialized work slots into its parent phase without inflating the count.
77
77
 
78
78
  ## PR & Review Flow
79
79
 
80
- ### Default Reviewers (v1.6.1, hardened in v2.0.0)
80
+ ### Default Reviewers
81
81
 
82
82
  - **Bitbucket**: Fetches via `/rest/default-reviewers/1.0/`. Every PUT must re-send `reviewers`, `fromRef`, `toRef`, `draft` (regression guarded by smoke test).
83
83
  - **GitHub**: Honors `CODEOWNERS` + falls back to `prefs.projects[].githubDefaultReviewers`.
84
84
  - PR author always filtered out (Bitbucket: 409, GitHub: GraphQL error).
85
85
 
86
- ### Draft vs Ready Prompt (v1.7.0)
86
+ ### Draft vs Ready Prompt
87
87
 
88
88
  Phase 6 asks `DRAFT or READY?` before creating the PR and persists the choice in `prefs.projects[].defaultPrMode`.
89
89
 
90
90
  - Bitbucket: `draft: true` flag (DC 8.x+) with `[DRAFT]` title fallback for older servers.
91
91
  - GitHub: `gh pr create --draft` + `gh pr ready` for promotion.
92
92
 
93
- ### `channels` Command (v5.7.0 — replaces `enrich`)
93
+ ### `channels` Command
94
94
 
95
95
  Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post-hoc for fixes closed outside the pipeline:
96
96
 
@@ -102,9 +102,9 @@ Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post
102
102
  /multi-agent:channels ABC-1234 --channels jira,confluence --content test
103
103
  ```
104
104
 
105
- Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the pre-v5.7 `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
105
+ Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the earlier `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
106
106
 
107
- ### Body Preservation Contract (v1.6.1, verified by smoke test in v2.0.0)
107
+ ### Body Preservation Contract (smoke-verified)
108
108
 
109
109
  Every external-system body (PR description, Jira comment, GitHub issue) uses `jq -n --rawfile body body.md '{description: $body}'` → `curl --data-binary @payload.json`. No literal `\n` strings, no HTML entities (`&amp;`, `&lt;`, `&quot;`). UTF-8 preserved end-to-end. `scripts/smoke-add-detail.sh` runs 14 contract assertions.
110
110
 
@@ -137,7 +137,7 @@ The reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers in paralle
137
137
 
138
138
  **Fable Triage** (Phase 4 Step 3, Opus on Copilot CLI): Evaluates merged raw findings against task scope. Classifies each as `accepted` (fix now), `deferred` (out of scope, log for later), or `rejected` (false positive / noise). Only triage-accepted blocking items loop back to Phase 3.
139
139
 
140
- ### Runtime Triage Validator (v2.3.0)
140
+ ### Runtime Triage Validator
141
141
 
142
142
  After triage returns, output is validated by `validate-triage.mjs`:
143
143
 
@@ -148,19 +148,23 @@ After triage returns, output is validated by `validate-triage.mjs`:
148
148
  | **2** | Over-rejection guard tripped — pause for human |
149
149
  | **3** | Contradiction auto-corrected — proceed with corrected output |
150
150
 
151
- ### Bidirectional Approved↔Blocking Auto-Correction (v2.6.1)
151
+ ### Bidirectional Approved↔Blocking Auto-Correction
152
152
 
153
- If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with `if`/`then` constraint in schema v3.0.0.
153
+ If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with an `if`/`then` constraint in the schema itself.
154
154
 
155
- ### Verify-by-Test Triage (Phase 4 Step 3.7, v10.8.0, opt-in)
155
+ ### Verify-by-Test Triage (Phase 4 Step 3.7, opt-in)
156
156
 
157
157
  A triage verdict is a judgment call; a failing repro test is proof. When `prefs.global.verifyByTest.enabled` is on, one verifier agent (default Sonnet) writes a minimal repro test per accepted blocking finding (cap: `maxFindings`=3) and runs only that test. Fails as predicted -> finding confirmed, the repro test becomes the Phase 3 rework RED test. Passes under `evidence-gate.mjs` -> finding downgraded to `deferred`. Compile error / timeout -> `inconclusive`, judgment stands. Timeout-bounded, never blocks. Full spec: `refs/features/verify-by-test.md`.
158
158
 
159
- ### Immutable-Test Rule + `test_lines_removed` Signal (v10.8.0)
159
+ ### Immutable-Test Rule + `test_lines_removed` Signal
160
160
 
161
161
  Existing tests are immutable during a task: deleting, renaming, or weakening an assertion to reach green is a violation (`refs/rules.md`, Phase 3 GREEN step). A test changes only when the task changes the spec it encodes, named in the commit body. Deterministic backstop: `diff-risk-score.mjs` emits `test_lines_removed` (w=3.0) for any test-classified file whose diff removes more lines than it adds.
162
162
 
163
- ### Structured Handoff Blocks (v10.8.0)
163
+ ### Update Check at Run Start
164
+
165
+ Phase 0 Step 0.6, opt-out via `prefs.global.updateCheck`. Once per `ttlHours` window (cached, 3s-bounded curl to the npm registry), the installed version is compared against `dist-tags.latest`. Newer version: interactive modes ask "Update now?" (yes runs the `/multi-agent:update` flow, then the run continues); autopilot logs one line and never asks. `autoUpdate: true` updates silently before the worktree exists. Offline and failed checks are silent - never blocks.
166
+
167
+ ### Structured Handoff Blocks
164
168
 
165
169
  Every phase transition appends a `## Handoff` block (Done / Remaining / Decisions / Open findings / Next) to `agent-log.md` - orchestrator-written from existing state, no LLM call. `/multi-agent:resume` and post-`/compact` re-grounding read the latest handoff first, so long runs re-enter from durable artifacts instead of conversation memory (fresh-context discipline from Anthropic's long-running-agent harness guidance).
166
170
 
@@ -174,7 +178,7 @@ If changes include UI files, reviewers check for:
174
178
 
175
179
  Pure code analysis — no simulator needed. Device-level audits run in Phase 5 when requested.
176
180
 
177
- ### Status Enforcement (v1.5.0)
181
+ ### Status Enforcement
178
182
 
179
183
  Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation verify step that re-reads the field and retries once on silent `VALIDATION` failures (e.g. stale Projects V2 option IDs after a board rebuild).
180
184
 
@@ -187,49 +191,49 @@ Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation veri
187
191
 
188
192
  ## Testing & Quality
189
193
 
190
- ### Schema-Validated State (v2.0.0)
194
+ ### Schema-Validated State
191
195
 
192
196
  All critical state files are schema-validated at read and write time:
193
197
 
194
198
  - `agent-state.schema.json` — validates `$HOME/.claude/logs/multi-agent/.../agent-state.json`
195
- - `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json` (extended v2.5.0)
196
- - `triage-output.schema.json` — validates triage output (v2.3.0, hardened v2.6.1, `if`/`then` constraint v3.0.0)
199
+ - `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json`
200
+ - `triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
197
201
 
198
- ### Smoke Test Suites (v2.0.0–v2.5.0)
202
+ ### Smoke Test Suites
199
203
 
200
204
  10 suites: add-detail (14 body-preservation assertions), review-triage (validator exit codes), prefs (schema round-trip), state (agent-state lifecycle), metrics (telemetry emission), sync (instruction parity), secret-scan (hook patterns), phase-banner (terminal UI), token-budget (per-phase limits), phase-tracker (progress tracking).
201
205
 
202
- ### Adversarial Eval Fixtures (v2.5.0–v2.6.0)
206
+ ### Adversarial Eval Fixtures
203
207
 
204
208
  10 fixtures that test triage resilience against adversarial reviewer output: over-rejection, hallucinated findings, contradictions, invalid JSON, schema violations, duplicate findings, scope creep, empty results, timeout simulation, and combined edge cases.
205
209
 
206
- ### Sync Parity Check (v2.3.0)
210
+ ### Sync Parity Check
207
211
 
208
212
  Detects drift between Claude Code instructions (`~/.claude/commands/multi-agent.md`), Copilot CLI instructions (`~/.copilot/copilot-instructions.md`), and the repo's pipeline spec files. Reports discrepancies during Phase 0 Init.
209
213
 
210
- ### Token Budget Enforcement (v3.0.0)
214
+ ### Token Budget Enforcement
211
215
 
212
216
  Per-phase token budgets prevent runaway sessions. If a phase exceeds its budget, the pipeline pauses and offers: continue (extend budget), skip phase, or abort. Budgets are configurable in `prefs.global.tokenBudgets`.
213
217
 
214
218
  ## Telemetry & Observability
215
219
 
216
- - **Pipeline Metrics** (v2.3.0): Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
217
- - **Cost Telemetry** (v2.5.0): Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
218
- - **Phase Tracker** (v2.4.0): Cross-CLI visual progress (current phase, elapsed time, iteration count).
219
- - **Phase Banner** (v2.4.0): Terminal UI for phase transitions with Unicode box-drawing characters.
220
- - **Per-task Cost Breakdown in agent-log.md** (v8.3.0): Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
220
+ - **Pipeline Metrics**: Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
221
+ - **Cost Telemetry**: Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
222
+ - **Phase Tracker**: Cross-CLI visual progress (current phase, elapsed time, iteration count).
223
+ - **Phase Banner**: Terminal UI for phase transitions with Unicode box-drawing characters.
224
+ - **Per-task Cost Breakdown in agent-log.md**: Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
221
225
 
222
- ### Diff Risk Scoring (v8.3.0)
226
+ ### Diff Risk Scoring
223
227
 
224
228
  `pipeline/scripts/diff-risk-score.mjs` runs at Phase 4 Step 1.75 — before reviewer dispatch. Heuristic, deterministic, sub-second, no LLM. Top-N risk-ranked files inject into each reviewer's prompt as a `${PRIORITY_FILES}` block; reviewers read those files first but still review the entire diff.
225
229
 
226
- Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (v10.8.0: test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
230
+ Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
227
231
 
228
- ### Test Gap Detection (v8.3.0)
232
+ ### Test Gap Detection
229
233
 
230
234
  `pipeline/scripts/test-gap-scan.mjs` runs at Phase 5 Step 0. Walks the diff for newly added public symbols and reports those with no paired test. Stack-specific rules ship for iOS, Android, Python, Node.js. iOS Views and Android `@Composable` symbols default to `important`; other public API additions to `suggestion`. Optional gating via `prefs.testGap.blockingThreshold` — when set, the report becomes a Phase 4 rework finding once `important + blocking` count exceeds the threshold.
231
235
 
232
- ### Triage Memory (v8.3.0)
236
+ ### Triage Memory
233
237
 
234
238
  Per-repo append-only JSONL corpus at `~/.claude/memory/multi-agent/<repo-slug>/triage-corpus.jsonl`. Phase 7 ingests every triage output (idempotent), Phase 1 enriches the analysis with similar past tasks, Phase 4 triage attaches prior-art hits to each raw finding with an explicit bias hedge. Token-overlap recall, zero deps, Node-18-compatible. `/multi-agent:search "<text>" --semantic` routes the query to the corpus instead of agent-log grep. Toggle via `prefs.global.priorArtEnrichment.enabled` (default ON).
235
239
 
@@ -280,10 +284,10 @@ Token registry maps logical names (`jira`, `bitbucket`, `github`, `confluence`)
280
284
 
281
285
  ## Schemas & Validation
282
286
 
283
- - `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle (v2.0.0)
284
- - `pipeline/schemas/prefs.schema.json` — validates preferences (v2.0.0, extended v2.5.0)
285
- - `pipeline/schemas/triage-output.schema.json` — validates triage output (v2.3.0, hardened v2.6.1, `if`/`then` constraint v3.0.0)
287
+ - `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle
288
+ - `pipeline/schemas/prefs.schema.json` — validates preferences
289
+ - `pipeline/schemas/triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
286
290
 
287
- ## File Layout (v1.9.0)
291
+ ## File Layout
288
292
 
289
293
  `commands/multi-agent/{add-detail,setup,help}.md` → slash commands. `commands/multi-agent/refs/**` → guides, phase specs, rules (`refs:` prefix signals "not an action"). Modifier flags and ops stay inline in `multi-agent.md`.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@mmerterden/multi-agent-pipeline",
3
- "version": "10.8.0",
3
+ "version": "10.9.0",
4
4
  "description": "8-phase AI development pipeline with full orchestration on Claude Code and Copilot CLI. Analysis, planning, TDD, CLI-aware parallel review with consensus surfacing + Fable triage, default-FAIL evidence gates, secret + intent guards, per-phase cost ledger, persistent learnings memory, wiki generation, commit automation. Token-preserving uninstall.",
5
5
  "type": "module",
6
6
  "main": "index.js",
@@ -98,6 +98,15 @@ Log the resolved tier in the agent log:
98
98
 
99
99
  Full chain definition, REST endpoints, URL parsing, canonical-component contract: `pipeline/rules/figma-pipeline.md` "MUST: Figma access - 3-tier fallback chain (BLOCKING, pipeline-wide)". Do not duplicate it here.
100
100
 
101
+ #### Step 0.6 - Update check (advisory, opt-out via `prefs.global.updateCheck.enabled`)
102
+
103
+ Run `bash pipeline/scripts/update-check.sh` (cached per `updateCheck.ttlHours`, default 24h; 3s-bounded; every failure path silent). Empty output → continue. Output `<local>|<latest>` means a newer version exists:
104
+
105
+ - **Interactive**: log `→ update available: v<local> -> v<latest>`, ask ONE AskUserQuestion - **Update now** (recommended) / **Continue without updating**. On *Update now* (or `updateCheck.autoUpdate: true`, which skips the question): run the `/multi-agent:update` flow, log `→ updated to v<latest>`, continue the run (note in the log: already-loaded phase docs finish this run on the old version; full effect next session). On *Continue*: no re-ask until the TTL expires.
106
+ - **Autopilot**: never ask (zero-interaction contract). Log `→ update available: ... (log-only; run /multi-agent:update)` and continue - unless `autoUpdate: true`, then update silently first.
107
+
108
+ Advisory - NEVER blocks. Must run BEFORE Step 6 (worktree creation) so an accepted update cannot mutate `~/.claude` under a mid-phase run.
109
+
101
110
  #### Step 1 - Parse Input
102
111
 
103
112
  **Branch from input**: If the user provided a branch name after the issue reference (space-separated), store as `baseBranch` and skip Step 3. Otherwise Step 3 asks interactively.
@@ -49,6 +49,16 @@ When Phase 0 Step 7 classified the task as `component`, Phase 3 delegates the en
49
49
 
50
50
  For non-component taskTypes (`bugfix`, `feature`, `refactor`, `chore`), continue with the standard TDD section below.
51
51
 
52
+ #### Design fidelity contract (BLOCKING when the task carries a Figma reference)
53
+
54
+ Applies to EVERY task whose analysis doc carries design ground truth (Section 6 rows, node IDs, screenshots) - component-dispatch AND TDD-path UI work alike. MCP stays forbidden here (analysis-only rule); the analysis doc IS the design:
55
+
56
+ 1. **1:1, not interpretation.** The captured frames are the reference for every UI line. Layout, variant states, copy, and colours come from the analysis doc's measured values - never eyeballed, never "close enough".
57
+ 2. **Code Connect decides the component.** A mapped component (`CodeConnectSnippet` / repo `*.figma.swift` / `*.figma.kt` row) MUST be used verbatim - sound-alike substitutes are forbidden; a missing modifier is added to the component's `+Modifiers` extension, never forked. No mapping exists -> write a NEW component per the Configuration/View/Modifiers architecture; never inline ad-hoc UI into the consumer screen.
58
+ 3. **Inter-component spacing is part of the design.** Gaps, paddings, and alignment BETWEEN components must match the frame's measured values, mapped to spacing tokens (`Spacing*`) - never invented numbers. A spacing/layout deviation is a `review_blocking` finding in Phase 4 Step 1.8, not a nitpick.
59
+
60
+ Missing design data (variant, padding, copy) → HALT and instruct the user to re-run `/multi-agent:analysis`; never guess and never call Figma MCP from this phase.
61
+
52
62
  #### Re-entry from Phase 4 triage
53
63
 
54
64
  Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings - Fable triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
@@ -390,30 +390,9 @@ After the triage verdict is computed, populate `triage.consensus`:
390
390
 
391
391
  ##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
392
392
 
393
- **Rationale:** a triage verdict is still a judgment call; a failing repro test is proof. Debating a finding costs tokens, running it costs one test invocation and kills false positives that survive adversarial framing.
393
+ A triage verdict is judgment; a failing repro test is proof. Runs only when `prefs.global.verifyByTest.enabled` is `true` AND `accepted` contains a `blocking` finding; otherwise skip silently. **Full contract (verdict table, cleanup invariant, prompts): `refs/features/verify-by-test.md` - read it before executing this step.**
394
394
 
395
- **Gate:** runs only when `prefs.global.verifyByTest.enabled` is `true` AND the validated triage output contains at least one `accepted` finding with `severity: "blocking"`. Otherwise skip silently (no log noise). Full behavior spec: `refs/features/verify-by-test.md`.
396
-
397
- 1. **Dispatch ONE verifier agent** (subagent_type: `general-purpose`, model: `prefs.global.verifyByTest.model`, default `sonnet`) for the iteration - not one per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings (ordered as triage returned them), the diff hunks for their files, and the project's test conventions from Phase 1. Findings beyond the cap keep their judgment-only verdict; log `verify_by_test=cap-exceeded count=<n>`.
398
- 2. **Per finding, the verifier writes ONE minimal repro test** asserting the CORRECT behavior the finding claims is broken, following the platform test conventions (framework, naming, location per phase-3-dev.md), then runs ONLY that test using the platform single-test invocation from Phase 3 (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`), wrapped in `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
399
- 3. **Verdict mapping (fails toward keeping blockers):**
400
-
401
- | Repro test outcome | `verification.result` | Action on finding |
402
- | --- | --- | --- |
403
- | Test FAILS as the finding predicts | `confirmed` | Stays `accepted` blocking. Repro test is KEPT in the worktree and recorded in `redTests[]` as the RED test for the Phase 3 rework loop. |
404
- | Test PASSES (defect not reproducible) | `not-reproduced` | Downgrade is evidence-gated: `node pipeline/scripts/evidence-gate.mjs --claim test --status passed --evidence "$WORKTREE/.pipeline/verify-<i>.test.log"` must exit 0. On exit 0: move finding from `accepted[]` to `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`, delete the repro test file. On exit non-zero: treat as `inconclusive`. |
405
- | Compile error, timeout, or defect not expressible as a unit test | `inconclusive` | Stays `accepted` blocking (judgment-only verdict stands). Partial test file deleted, cause noted in `verification.note`. |
406
-
407
- 4. **Cleanup invariant:** after Step 3.7 the only verifier artifacts left in the worktree are the confirmed repro tests listed in `redTests[]` (they get committed with the fix, satisfying the TDD contract) and logs under `$WORKTREE/.pipeline/` (already outside Phase 6 commit scope).
408
- 5. **Persist + re-validate:** stamp each verified finding with a `verification` object (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`, recompute `approved`, then re-run `validate-triage.mjs` on the mutated `$TRIAGE_FILE` under the same 3.2.1 gate protocol.
409
- 6. **Timeout/fallback (mirrors 3.3):** the whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600). On verifier crash or budget breach: no retry; remaining findings keep judgment-only verdicts, log `triage=verify-by-test-timeout`, proceed to Step 4. Never blocks the pipeline.
410
-
411
- Telemetry (per 3.4 conventions, best-effort):
412
-
413
- ```bash
414
- LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.verify_by_test \
415
- attempted=$A confirmed=$C downgraded=$D inconclusive=$I duration_ms=$MS
416
- ```
395
+ Compressed flow: dispatch ONE verifier agent (model `verifyByTest.model`, default `sonnet`) for up to `maxFindings` (default 3) accepted blocking findings. Per finding it writes ONE minimal repro test and runs ONLY that test (Phase 3 single-test invocation, build lock, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`). Outcomes: test FAILS as predicted -> `confirmed`, finding stays blocking and the test is KEPT in `redTests[]` as the Phase 3 rework RED test; test PASSES -> `not-reproduced` ONLY if `evidence-gate.mjs --claim test --status passed` exits 0 on the log, finding moves to `deferred[]`, test deleted; compile error / timeout / not unit-testable -> `inconclusive`, judgment verdict stands. Stamp findings with `verification` (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = {attempted, confirmed, downgraded, inconclusive, redTests[]}`, recompute `approved`, re-run `validate-triage.mjs` under the 3.2.1 gate. Whole step bounded by `stepTimeoutSec` (default 600); on breach or crash remaining findings keep judgment verdicts - never blocks. Telemetry per 3.4: `review.verify_by_test attempted= confirmed= downgraded= inconclusive= duration_ms=`.
417
396
 
418
397
  #### Step 4 - Consensus + Action (triage-driven)
419
398
 
@@ -701,6 +701,30 @@
701
701
  "default": false,
702
702
  "description": "v6.1.0+ \u2014 Phase 4 Step 2.5 rebuttal round. When reviewers disagree (mixed blocker/approved verdict), each reviewer is re-prompted with the others' opposing arguments for one additional round before triage. Lifts signal quality on ambiguous findings at ~1\u00d7 Step 2 token cost. Off by default \u2014 flip for security-critical or release-branch reviews."
703
703
  },
704
+ "updateCheck": {
705
+ "type": "object",
706
+ "additionalProperties": false,
707
+ "description": "v10.9+ - Phase 0 Step 0.6 advisory version check. Once per ttlHours window, a bounded (3s) registry read compares the installed version against dist-tags.latest. Newer version found: interactive modes ask 'Update now / Continue' (yes runs the /multi-agent:update flow, then the run continues); autopilot logs one line and never asks. Offline/failed checks are silent; the step never blocks the pipeline.",
708
+ "properties": {
709
+ "enabled": {
710
+ "type": "boolean",
711
+ "default": true,
712
+ "description": "Master switch. On by default - the cost is at most one 3s-bounded curl per ttlHours."
713
+ },
714
+ "ttlHours": {
715
+ "type": "integer",
716
+ "minimum": 1,
717
+ "maximum": 168,
718
+ "default": 24,
719
+ "description": "Cache window for the registry read."
720
+ },
721
+ "autoUpdate": {
722
+ "type": "boolean",
723
+ "default": false,
724
+ "description": "When true, skip the question and run the update flow automatically before the run starts (interactive AND autopilot). Off by default - self-modifying ~/.claude without asking is a surprise."
725
+ }
726
+ }
727
+ },
704
728
  "verifyByTest": {
705
729
  "type": "object",
706
730
  "additionalProperties": false,
@@ -3,15 +3,15 @@
3
3
  "$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/token-budget.json",
4
4
  "description": "Per-phase token budget for lazy-loaded pipeline docs. Enforced by smoke-token-budget.sh.",
5
5
  "phases": {
6
- "phase-0-init": { "max_tokens": 11250, "warn_tokens": 9900 },
6
+ "phase-0-init": { "max_tokens": 12400, "warn_tokens": 10900 },
7
7
  "phase-1-analysis": { "max_tokens": 3750, "warn_tokens": 3300 },
8
8
  "phase-2-planning": { "max_tokens": 6500, "warn_tokens": 5750 },
9
- "phase-3-dev": { "max_tokens": 7650, "warn_tokens": 6750 },
10
- "phase-4-review": { "max_tokens": 11100, "warn_tokens": 9750 },
11
- "phase-5-test": { "max_tokens": 2300, "warn_tokens": 2000 },
12
- "phase-6-commit": { "max_tokens": 5550, "warn_tokens": 4900 },
9
+ "phase-3-dev": { "max_tokens": 7900, "warn_tokens": 6950 },
10
+ "phase-4-review": { "max_tokens": 13250, "warn_tokens": 11650 },
11
+ "phase-5-test": { "max_tokens": 2550, "warn_tokens": 2250 },
12
+ "phase-6-commit": { "max_tokens": 6150, "warn_tokens": 5400 },
13
13
  "phase-7-report": { "max_tokens": 5600, "warn_tokens": 4950 }
14
14
  },
15
- "total_max_tokens": 46500,
16
- "note": "Token estimate = ceil(chars / 4). Per-phase budget rule: warn = current+10% (rounded to nearest 50), max = current+25%. Gives ~6 edit cycles of headroom before warn trips - intentionally quiet under normal maintenance, loud when a phase grows unusually. Only the active phase is loaded (lazy). Recalibrated at v10.0.0 after the validator/consistency/simplifier/lesson gate contracts landed in phases 1-4 (prose already compressed; the residual growth is the gate contracts themselves)."
15
+ "total_max_tokens": 50000,
16
+ "note": "Token estimate = ceil(chars / 4). Per-phase budget rule: warn = current+10% (rounded to nearest 50), max = current+25%. Gives ~6 edit cycles of headroom before warn trips - intentionally quiet under normal maintenance, loud when a phase grows unusually. Only the active phase is loaded (lazy). Recalibrated at v10.0.0 after the validator/consistency/simplifier/lesson gate contracts landed in phases 1-4. Recalibrated again at v10.9.0 after the verify-by-test (Phase 4 Step 3.7), update-check (Phase 0 Step 0.6), immutable-test (Phase 3 GREEN) and redTests re-entry contracts landed - Step 3.7 prose was compressed to a pointer into refs/features/verify-by-test.md before the recalibration."
17
17
  }
@@ -24,6 +24,7 @@ Validate contracts. Each emits `══ <name> smoke: N passed, M failed ══`
24
24
  - `smoke-phase4-triage.sh` - Phase 4 reviewer → triage flow
25
25
  - `smoke-verify-by-test.sh` - Phase 4 Step 3.7 verify-by-test contract (v10.8.0)
26
26
  - `smoke-handoff-contract.sh` - phase-boundary structured handoff + handoff-first resume (v10.8.0)
27
+ - `smoke-update-check.sh` - Phase 0 Step 0.6 advisory update-check contract (v10.9.0)
27
28
 
28
29
  ### Schema + state
29
30
  - `smoke-schema-validation.sh` - all JSON schemas validate
@@ -5,12 +5,12 @@
5
5
  .claude/multi-agent-preferences.json 1
6
6
  .claude/rules 12
7
7
  .claude/schemas 23
8
- .claude/scripts 169
8
+ .claude/scripts 171
9
9
  .claude/settings.json 1
10
10
  .claude/skills 560
11
11
  .copilot/agents 8
12
12
  .copilot/copilot-instructions.md 1
13
13
  .copilot/lib 23
14
14
  .copilot/schemas 23
15
- .copilot/scripts 169
15
+ .copilot/scripts 171
16
16
  .copilot/skills 596
@@ -0,0 +1,122 @@
1
+ #!/usr/bin/env bash
2
+ # smoke-update-check.sh
3
+ #
4
+ # Verifies the Phase 0 Step 0.6 advisory update-check contract:
5
+ # 1. update-check.sh exists, parses (bash -n), and honors the advisory contract
6
+ # offline: exit 0 + empty stdout when the registry is unreachable
7
+ # 2. Cached path: fresh cache short-circuits without a network call
8
+ # 3. Newer latest -> "<local>|<latest>"; same or older latest -> silent
9
+ # 4. prefs.schema.json exposes updateCheck.{enabled,ttlHours,autoUpdate}
10
+ # with the documented defaults (enabled=true, autoUpdate=false)
11
+ # 5. phase-0-init.md documents Step 0.6 with the autopilot log-only rule
12
+ #
13
+ # Exit 0 = all pass, 1 = any failure.
14
+
15
+ set -euo pipefail
16
+
17
+ ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
18
+ SCRIPT="$ROOT/pipeline/scripts/update-check.sh"
19
+ PREFS_SCHEMA="$ROOT/pipeline/schemas/prefs.schema.json"
20
+ PHASE0_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-0-init.md"
21
+
22
+ pass=0
23
+ fail=0
24
+ failures=()
25
+ record_pass() { pass=$((pass + 1)); printf ' \033[0;32mPASS\033[0m %s\n' "$1"; }
26
+ record_fail() { fail=$((fail + 1)); failures+=("$1"); printf ' \033[0;31mFAIL\033[0m %s\n' "$1"; }
27
+
28
+ printf '→ smoke-update-check: Phase 0 Step 0.6 advisory contract\n'
29
+
30
+ tmpdir=$(mktemp -d)
31
+ trap 'rm -rf "$tmpdir"' EXIT
32
+ CACHE="$tmpdir/update-check-cache"
33
+
34
+ # 1. Script parses
35
+ if bash -n "$SCRIPT" 2>/dev/null; then
36
+ record_pass "update-check.sh parses (bash -n)"
37
+ else
38
+ record_fail "update-check.sh has syntax errors"
39
+ fi
40
+
41
+ now=$(date +%s)
42
+
43
+ # 2. Fresh cache with newer latest -> reports, no network needed
44
+ printf '%s|99.0.0\n' "$now" > "$CACHE"
45
+ out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
46
+ if [ "$rc" -eq 0 ] && [ "$out" = "10.0.0|99.0.0" ]; then
47
+ record_pass "newer cached latest -> '<local>|<latest>' (exit 0)"
48
+ else
49
+ record_fail "newer cached latest should print '10.0.0|99.0.0' (got '$out', rc=$rc)"
50
+ fi
51
+
52
+ # 3. Same version -> silent
53
+ printf '%s|10.0.0\n' "$now" > "$CACHE"
54
+ out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
55
+ if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
56
+ record_pass "same version -> silent"
57
+ else
58
+ record_fail "same version should be silent (got '$out', rc=$rc)"
59
+ fi
60
+
61
+ # 3b. Local ahead of registry (dev machine) -> silent
62
+ printf '%s|10.0.0\n' "$now" > "$CACHE"
63
+ out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.1.0 2>/dev/null); rc=$?
64
+ if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
65
+ record_pass "local ahead of registry -> silent (no downgrade prompt)"
66
+ else
67
+ record_fail "local-ahead should be silent (got '$out', rc=$rc)"
68
+ fi
69
+
70
+ # 3c. Offline + stale cache -> silent exit 0 (advisory: never blocks)
71
+ printf '0|10.0.0\n' > "$CACHE"
72
+ out=$(UPDATE_CHECK_CACHE="$CACHE" http_proxy="http://127.0.0.1:1" https_proxy="http://127.0.0.1:1" \
73
+ bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
74
+ if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
75
+ record_pass "offline + stale cache -> silent exit 0"
76
+ else
77
+ record_fail "offline should be a silent no-op (got '$out', rc=$rc)"
78
+ fi
79
+
80
+ # 4. Prefs schema knobs + defaults
81
+ for prop in enabled ttlHours autoUpdate; do
82
+ if jq -e ".properties.global.properties.updateCheck.properties.${prop}" "$PREFS_SCHEMA" >/dev/null 2>&1; then
83
+ record_pass "prefs schema exposes updateCheck.${prop}"
84
+ else
85
+ record_fail "prefs schema missing updateCheck.${prop}"
86
+ fi
87
+ done
88
+ if jq -e '.properties.global.properties.updateCheck.properties.enabled.default == true' "$PREFS_SCHEMA" >/dev/null 2>&1; then
89
+ record_pass "updateCheck.enabled defaults to true (notify-only, bounded cost)"
90
+ else
91
+ record_fail "updateCheck.enabled must default to true"
92
+ fi
93
+ if jq -e '.properties.global.properties.updateCheck.properties.autoUpdate | has("default") and .default == false' "$PREFS_SCHEMA" >/dev/null 2>&1; then
94
+ record_pass "updateCheck.autoUpdate defaults to false (no silent self-modify)"
95
+ else
96
+ record_fail "updateCheck.autoUpdate must default to false"
97
+ fi
98
+
99
+ # 5. Phase 0 doc wiring
100
+ if grep -qF "Step 0.6 - Update check" "$PHASE0_DOC"; then
101
+ record_pass "phase-0-init.md documents Step 0.6"
102
+ else
103
+ record_fail "phase-0-init.md missing Step 0.6"
104
+ fi
105
+ if grep -qF "update-check.sh" "$PHASE0_DOC"; then
106
+ record_pass "phase-0-init.md invokes update-check.sh"
107
+ else
108
+ record_fail "phase-0-init.md must invoke update-check.sh"
109
+ fi
110
+ if grep -qF "log-only" "$PHASE0_DOC" && grep -qF "never ask (zero-interaction contract)" "$PHASE0_DOC"; then
111
+ record_pass "phase-0-init.md states the autopilot log-only rule"
112
+ else
113
+ record_fail "phase-0-init.md must state autopilot never asks (log-only)"
114
+ fi
115
+
116
+ printf '\n══ update-check smoke: %d passed, %d failed ══\n' "$pass" "$fail"
117
+ if [ "$fail" -gt 0 ]; then
118
+ printf '\nFailures:\n'
119
+ for msg in "${failures[@]}"; do printf ' - %s\n' "$msg"; done
120
+ exit 1
121
+ fi
122
+ exit 0
@@ -0,0 +1,82 @@
1
+ #!/usr/bin/env bash
2
+ # update-check.sh - cached advisory version check (Phase 0 Step 0.6)
3
+ #
4
+ # Compares the locally installed pipeline version against the npm registry's
5
+ # dist-tags.latest. Cached with a TTL so at most one network call per TTL
6
+ # window; the call is bounded by a short timeout and every failure path is
7
+ # silent - this script NEVER blocks or fails the pipeline.
8
+ #
9
+ # stdout: "<local>|<latest>" when a newer version exists, nothing otherwise.
10
+ # Exit code: always 0 (advisory contract).
11
+ #
12
+ # Usage:
13
+ # bash pipeline/scripts/update-check.sh # auto-detect local version
14
+ # bash pipeline/scripts/update-check.sh --local 10.8.0 # explicit local version
15
+ # bash pipeline/scripts/update-check.sh --ttl-hours 24 # cache window (default 24)
16
+ # bash pipeline/scripts/update-check.sh --force # ignore cache
17
+ #
18
+ # Cache file: ~/.claude/logs/multi-agent/.update-check ("epoch|latest").
19
+ # Registry read is a plain curl - never `npm view` (a user-level .npmrc scope
20
+ # mapping can silently reroute npm to a different registry; curl cannot lie).
21
+
22
+ set -euo pipefail
23
+
24
+ PKG="@mmerterden/multi-agent-pipeline"
25
+ REGISTRY_URL="https://registry.npmjs.org/${PKG/\//%2F}"
26
+ CACHE_FILE="${UPDATE_CHECK_CACHE:-$HOME/.claude/logs/multi-agent/.update-check}"
27
+ TTL_HOURS=24
28
+ LOCAL_VERSION=""
29
+ FORCE=0
30
+
31
+ while [ $# -gt 0 ]; do
32
+ case "$1" in
33
+ --local) LOCAL_VERSION="${2:-}"; shift 2 ;;
34
+ --ttl-hours) TTL_HOURS="${2:-24}"; shift 2 ;;
35
+ --force) FORCE=1; shift ;;
36
+ *) shift ;;
37
+ esac
38
+ done
39
+
40
+ # Local version: explicit arg, else read from the pipeline repo clone.
41
+ if [ -z "$LOCAL_VERSION" ]; then
42
+ for candidate in "$HOME/multi-agent-pipeline" "$HOME/dev/multi-agent-pipeline" "$HOME/projects/multi-agent-pipeline"; do
43
+ if [ -f "$candidate/package.json" ]; then
44
+ LOCAL_VERSION=$(node -p "require('$candidate/package.json').version" 2>/dev/null || true)
45
+ [ -n "$LOCAL_VERSION" ] && break
46
+ fi
47
+ done
48
+ fi
49
+ [ -z "$LOCAL_VERSION" ] && exit 0 # cannot determine local version -> silent no-op
50
+
51
+ now=$(date +%s)
52
+ latest=""
53
+
54
+ # Fresh cache?
55
+ if [ "$FORCE" -eq 0 ] && [ -f "$CACHE_FILE" ]; then
56
+ cached_epoch=$(cut -d'|' -f1 "$CACHE_FILE" 2>/dev/null || echo 0)
57
+ cached_latest=$(cut -d'|' -f2 "$CACHE_FILE" 2>/dev/null || echo "")
58
+ case "$cached_epoch" in (*[!0-9]*|"") cached_epoch=0 ;; esac
59
+ if [ $((now - cached_epoch)) -lt $((TTL_HOURS * 3600)) ] && [ -n "$cached_latest" ]; then
60
+ latest="$cached_latest"
61
+ fi
62
+ fi
63
+
64
+ # Stale or missing cache -> one bounded registry call (silent on any failure).
65
+ if [ -z "$latest" ]; then
66
+ latest=$(curl -sm 3 "$REGISTRY_URL" 2>/dev/null \
67
+ | { command -v jq >/dev/null 2>&1 && jq -r '."dist-tags".latest // empty' \
68
+ || sed -n 's/.*"latest":"\([^"]*\)".*/\1/p'; } | head -1) || true
69
+ [ -z "$latest" ] && exit 0
70
+ mkdir -p "$(dirname "$CACHE_FILE")" 2>/dev/null || exit 0
71
+ printf '%s|%s\n' "$now" "$latest" > "$CACHE_FILE" 2>/dev/null || true
72
+ fi
73
+
74
+ [ "$latest" = "$LOCAL_VERSION" ] && exit 0
75
+
76
+ # Update available only when latest sorts strictly ABOVE local (a dev machine
77
+ # running ahead of the registry must not see an "update" prompt).
78
+ highest=$(printf '%s\n%s\n' "$LOCAL_VERSION" "$latest" | sort -t. -k1,1n -k2,2n -k3,3n | tail -1)
79
+ if [ "$highest" = "$latest" ]; then
80
+ printf '%s|%s\n' "$LOCAL_VERSION" "$latest"
81
+ fi
82
+ exit 0