@mmerterden/multi-agent-pipeline 10.8.0 → 10.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +39 -0
- package/README.md +2 -0
- package/docs/engineering.md +76 -0
- package/docs/features.md +40 -36
- package/package.json +1 -1
- package/pipeline/commands/multi-agent/refs/phases/phase-0-init.md +9 -0
- package/pipeline/commands/multi-agent/refs/phases/phase-3-dev.md +10 -0
- package/pipeline/commands/multi-agent/refs/phases/phase-4-review.md +2 -23
- package/pipeline/schemas/prefs.schema.json +24 -0
- package/pipeline/schemas/token-budget.json +7 -7
- package/pipeline/scripts/README.md +1 -0
- package/pipeline/scripts/fixtures/install-layout.tsv +2 -2
- package/pipeline/scripts/smoke-update-check.sh +122 -0
- package/pipeline/scripts/update-check.sh +82 -0
package/CHANGELOG.md
CHANGED
|
@@ -14,6 +14,45 @@ Internal file-layout changes that don't affect the slash-command surface are sti
|
|
|
14
14
|
|
|
15
15
|
---
|
|
16
16
|
|
|
17
|
+
## [10.9.0] - 2026-07-03
|
|
18
|
+
|
|
19
|
+
### Added
|
|
20
|
+
|
|
21
|
+
- **Update check at run start (Phase 0 Step 0.6, opt-out via
|
|
22
|
+
`prefs.global.updateCheck`).** Once per `ttlHours` window (default 24h,
|
|
23
|
+
cached, 3s-bounded curl straight to the npm registry - never `npm view`),
|
|
24
|
+
the pipeline compares the installed version against `dist-tags.latest`.
|
|
25
|
+
Newer version found: interactive modes ask ONE question ("Update now?" ->
|
|
26
|
+
yes runs the `/multi-agent:update` flow and the run continues; the update
|
|
27
|
+
takes full effect next session); autopilot logs one line and never asks
|
|
28
|
+
(zero-interaction contract). `autoUpdate: true` skips the question and
|
|
29
|
+
updates before the worktree is created. Offline, ahead-of-registry, and
|
|
30
|
+
failed checks are silent - the step never blocks. New
|
|
31
|
+
`pipeline/scripts/update-check.sh` + `smoke-update-check.sh` (13 assertions).
|
|
32
|
+
- **Design fidelity contract (Phase 3, BLOCKING on Figma-referenced tasks).**
|
|
33
|
+
The analysis doc's captured frames are the 1:1 reference for every UI line:
|
|
34
|
+
a Code Connect-mapped component must be used verbatim (no sound-alikes,
|
|
35
|
+
missing modifiers go into `+Modifiers`, never forks), unmapped UI becomes a
|
|
36
|
+
NEW component under the standard architecture, and inter-component spacing
|
|
37
|
+
comes from the design's measured values mapped to tokens - never invented.
|
|
38
|
+
Missing design data halts with a re-run-analysis instruction; Figma MCP
|
|
39
|
+
stays forbidden outside analysis.
|
|
40
|
+
- **`docs/engineering.md`** - version-free catalog of the discipline layer
|
|
41
|
+
(bounded loops, evidence gates, token-budgeted phase docs, immutable tests,
|
|
42
|
+
fresh-context handoffs, memory hedges, install hygiene), linked from README.
|
|
43
|
+
|
|
44
|
+
### Changed
|
|
45
|
+
|
|
46
|
+
- **Docs drop inline version markers.** `docs/features.md` headers and bullets
|
|
47
|
+
no longer carry `(vX.Y.Z)` tags - the docs describe the current state;
|
|
48
|
+
history lives in this CHANGELOG. Same principle applied to the website copy.
|
|
49
|
+
- **Phase-doc token budgets recalibrated** (`token-budget.json`, total
|
|
50
|
+
46500 -> 50000) after the verify-by-test, update-check, immutable-test, and
|
|
51
|
+
design-fidelity contracts landed - Step 3.7 prose was compressed to a
|
|
52
|
+
pointer into `refs/features/verify-by-test.md` before recalibration.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
17
56
|
## [10.8.0] - 2026-07-02
|
|
18
57
|
|
|
19
58
|
Three additive loop-quality features, all sourced from the 2026 agentic-loop research sweep (Anthropic long-running-agent harness guidance + adversarial-review findings): empirical verification of blocking findings, an immutable-test contract, and structured phase-boundary handoffs.
|
package/README.md
CHANGED
|
@@ -51,6 +51,8 @@ One command runs 8 phases, with a gate between the risky ones:
|
|
|
51
51
|
|
|
52
52
|
Under the hood: each task runs in its own **git worktree** (or the current branch with `:local`), commits use the **git identity routed from the repo's origin URL**, and **multi-repo** tasks get per-repo worktrees plus an integration build. Tokens stay in the OS keychain; nothing is committed or logged. `/multi-agent:review` can also review an existing GitHub/Bitbucket PR — per-finding inline comments anchored to `file:line` + an explicit Approve / Needs-Work state.
|
|
53
53
|
|
|
54
|
+
The discipline behind all of this — bounded loops, evidence gates, token-budgeted phase docs, immutable tests, fresh-context handoffs — is catalogued in [docs/engineering.md](./docs/engineering.md). The full feature list lives in [docs/features.md](./docs/features.md).
|
|
55
|
+
|
|
54
56
|
## Modes
|
|
55
57
|
|
|
56
58
|
| Mode | Command | Flow |
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# Engineering Discipline
|
|
2
|
+
|
|
3
|
+
The pipeline's output quality comes less from any single model call and more from the constraints wrapped around every call. This page lists those constraints - the internals that keep long agentic runs correct, cheap, and auditable. No version markers here; this document always describes the current state (history lives in the [CHANGELOG](../CHANGELOG.md)).
|
|
4
|
+
|
|
5
|
+
## Loop discipline
|
|
6
|
+
|
|
7
|
+
Every loop in the pipeline has a deterministic exit condition and a hard iteration cap. Nothing "keeps trying".
|
|
8
|
+
|
|
9
|
+
- **TDD micro-loop** (Phase 3): red -> green -> refactor per task; build-fix retries capped at 3 attempts.
|
|
10
|
+
- **Dev-critic loop** (Phase 3.5, opt-in): generator vs critic on deterministic criteria; max 2 iterations, round 3 escalates to the user.
|
|
11
|
+
- **Review-fix macro-loop** (Phase 3 <-> 4): only triage-accepted blocking findings re-enter development; max 3 iterations, then hard-kill and escalate.
|
|
12
|
+
- **Disagreement round** (Phase 4, opt-in): when reviewers split, exactly one rebuttal round - never more.
|
|
13
|
+
- **Global rule**: any retry loop stops after 3 attempts and asks the user. No exceptions, no infinite loops.
|
|
14
|
+
|
|
15
|
+
## Evidence over claims
|
|
16
|
+
|
|
17
|
+
- **Default-FAIL evidence gate**: a "build passed" or "tests passed" claim is rejected unless the captured log actually shows success. Missing, empty, or failure-marked logs fail CLOSED.
|
|
18
|
+
- **Verify-by-test** (opt-in): an accepted blocking review finding can be empirically confirmed - a verifier agent writes a minimal repro test and runs it. Test fails as predicted: the finding stands and the failing test is handed to the rework loop as its RED test. Test passes: the finding is downgraded, and even the downgrade must pass the evidence gate.
|
|
19
|
+
- **Deterministic validator gates**: triage output, reviewer output, diff-risk output, and agent state are schema-validated by zero-dependency scripts whose exit codes decide the flow - one self-correction rework, then halt with a recovery hint. The LLM never grades its own homework.
|
|
20
|
+
|
|
21
|
+
## Test integrity
|
|
22
|
+
|
|
23
|
+
- **Immutable tests**: deleting, renaming, or weakening an existing test to get a green run is a rule violation. A test changes only when the task changes the spec it encodes, named in the commit body.
|
|
24
|
+
- **Deterministic backstop**: the diff-risk scorer flags any test file whose diff removes more lines than it adds (`test_lines_removed`), so a shrinking test suite surfaces to reviewers even if the rule is ignored.
|
|
25
|
+
- **Test-gap detection**: newly added public symbols with no paired test are reported before the test phase.
|
|
26
|
+
|
|
27
|
+
## Context management
|
|
28
|
+
|
|
29
|
+
- **Token-budgeted phase docs**: every phase document has a per-file and total token budget enforced by a smoke gate (`token-budget.json`). Growth is compressed first; budgets are recalibrated only when real contract text lands. This keeps the orchestrator's working context lean across an 8-phase run.
|
|
30
|
+
- **Lazy loading**: only the active phase's document is in context; references load on demand.
|
|
31
|
+
- **Structured handoff blocks**: at every phase boundary the orchestrator appends a Done / Remaining / Decisions / Open findings / Next block to the run log - written from state it already holds, no extra model call. Resume and post-compaction re-entry read the latest handoff plus the state file, never the conversation history.
|
|
32
|
+
- **Proactive compaction**: past ~50% context usage the run compacts itself and re-grounds from durable artifacts instead of waiting for lossy auto-compaction.
|
|
33
|
+
|
|
34
|
+
## Review architecture
|
|
35
|
+
|
|
36
|
+
- **Deterministic gates before AI review**: build, lint, tests, and secret scan must be green before a single reviewer token is spent.
|
|
37
|
+
- **Cross-model parallel review**: independent reviewers with different vantage points, then a triage pass that filters false positives and out-of-scope noise. Reviewer findings are raw signals, not commands - nothing auto-loops on a "blocking" tag.
|
|
38
|
+
- **Consensus surfacing**: unanimous agreement between same-base-model reviewers on judgment-heavy surfaces is flagged as unverified instead of being trusted as independent confirmation.
|
|
39
|
+
- **Diff risk scoring**: a sub-second, zero-LLM heuristic ranks changed files (security paths, migrations, missing tests, shrinking tests) and feeds reviewers a priority hint - advisory, never a gate.
|
|
40
|
+
- **Prompt-cache-aware dispatch**: reviewer and triage prompts share a byte-identical leading block so parallel dispatches hit the prompt cache instead of re-billing the diff.
|
|
41
|
+
|
|
42
|
+
## Design fidelity
|
|
43
|
+
|
|
44
|
+
- **Single source of design truth**: Figma is fetched once, during analysis, through a 3-tier fallback chain (MCP -> REST -> user screenshot). Every later phase reads the frozen analysis artifact; calling Figma from dev or review phases is a contract violation enforced by a smoke gate.
|
|
45
|
+
- **1:1 implementation contract**: a Code Connect-mapped component must be used verbatim (no sound-alike substitutes; missing modifiers are added to the component, never forked). Unmapped UI becomes a new component under the standard architecture. Spacing between components comes from the design's measured values mapped to tokens - never invented numbers.
|
|
46
|
+
|
|
47
|
+
## Cost transparency
|
|
48
|
+
|
|
49
|
+
- **Per-phase token ledger**: every billable call forwards token counts to the tracker; each run ends with a cost table naming the top cost driver and pricing prompt-cache reads at their discounted rate.
|
|
50
|
+
- **Budget ceilings**: a cost budget can cap a run; crossing it triggers model downgrade or halt rather than silent overspend.
|
|
51
|
+
- **Model fallback contract**: premium-tier dispatches degrade deterministically (top tier -> mid -> base) on dispatch errors, date gates, or budget ceilings - never by improvisation.
|
|
52
|
+
|
|
53
|
+
## Memory without bias
|
|
54
|
+
|
|
55
|
+
- **Learnings ledger**: durable per-repo facts and rejected review preferences replay into analysis and triage, so agents stop re-discovering and reviewers stop re-flagging the same things.
|
|
56
|
+
- **Hedged injection**: every memory injection carries an explicit "context, not command" hedge, and blocking-severity rejections are never memorized - a wrong rejection must not permanently suppress a finding class.
|
|
57
|
+
- **Prior-art caps**: triage sees at most a fixed number of highest-similarity past findings; memory is advisory context, not a verdict multiplier.
|
|
58
|
+
|
|
59
|
+
## Update and install hygiene
|
|
60
|
+
|
|
61
|
+
- **Wipe-then-copy install**: pipeline-owned directories are cleared before copying, so files deleted upstream never linger as ghost commands or dead scripts on user machines.
|
|
62
|
+
- **Layout fingerprint**: a fixture pins the installed file layout; structural drift fails the suite.
|
|
63
|
+
- **Advisory update check**: at run start (cached, bounded, offline-silent) the pipeline compares its installed version with the registry; interactive runs offer a one-question update, autopilot only logs. It never blocks and never self-modifies mid-run.
|
|
64
|
+
- **Token-preserving uninstall**: removing the pipeline never touches the user's stored credentials.
|
|
65
|
+
|
|
66
|
+
## Security and hygiene
|
|
67
|
+
|
|
68
|
+
- **Secret scanning**: staged diffs are scanned with provider-prefix patterns plus Shannon-entropy detection before every commit; findings block.
|
|
69
|
+
- **Personal-data genericization**: the public repo is generated from the private source with a scrub pass (names, hosts, project keys) and a zero-hit grep gate before push.
|
|
70
|
+
- **Skill scanner**: tiered pattern scanning over skill content guards against supply-chain surprises.
|
|
71
|
+
- **Intent guard**: a deterministic classifier separates "answer my question" from "change my code", so a conceptual question never spawns a worktree.
|
|
72
|
+
|
|
73
|
+
## Cross-CLI parity
|
|
74
|
+
|
|
75
|
+
- **One source of truth, two runtimes**: Claude Code and Copilot CLI command surfaces are byte-identical where shared; a parity smoke fails on drift.
|
|
76
|
+
- **Portability gates**: every shell script passes `bash -n` and a GNU/BSD portability grep; the CI matrix runs macOS, Linux, and Windows.
|
package/docs/features.md
CHANGED
|
@@ -43,7 +43,7 @@ Compose freely: `--dev --local autopilot` = shortest, least-friction path.
|
|
|
43
43
|
|
|
44
44
|
Build commands, test runners, lint tools, and review focus areas all adapt to the detected stack.
|
|
45
45
|
|
|
46
|
-
### Stack Selection (marketplace plugins
|
|
46
|
+
### Stack Selection (marketplace plugins)
|
|
47
47
|
|
|
48
48
|
Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketplace. Selecting a stack enables the matching plugin(s) in the current repo's `.claude/settings.json` `enabledPlugins` — no skill copying, no session restart tricks, no directory shuffling. The `ai-common-engineering-toolkit` (accessibility audit, humanizer, Firebase) is always enabled alongside the stack plugin.
|
|
49
49
|
|
|
@@ -57,7 +57,7 @@ Stack skill sets ship as versioned plugins in the `multi-agent-plugins` marketpl
|
|
|
57
57
|
/multi-agent:stack all # every stack plugin
|
|
58
58
|
```
|
|
59
59
|
|
|
60
|
-
### Task Type Detection
|
|
60
|
+
### Task Type Detection
|
|
61
61
|
|
|
62
62
|
Phase 0 Step 9 classifies every task before Phase 1 starts. Deterministic priority order: Figma URL → instruction file path → git diff heuristic → Jira issue type → branch name → description keywords → user prompt (autopilot defaults to `feature`).
|
|
63
63
|
|
|
@@ -71,26 +71,26 @@ Result persisted to `agent-state.taskType`:
|
|
|
71
71
|
| `refactor` | Phase 4 emphasizes behavior preservation; Phase 6 uses `refactor(...)` prefix |
|
|
72
72
|
| `chore` | Lightweight flow; Phase 6 uses `chore(...)` prefix |
|
|
73
73
|
|
|
74
|
-
### SubPhase Convention
|
|
74
|
+
### SubPhase Convention
|
|
75
75
|
|
|
76
|
-
When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7
|
|
76
|
+
When a specialized skill takes over a main pipeline phase, progress is reported as SubPhases (e.g. `SubPhase 3.0: Init`, `SubPhase 3.1: Gather`). The top-level pipeline stays fixed at 8 phases (0-7) — specialized work slots into its parent phase without inflating the count.
|
|
77
77
|
|
|
78
78
|
## PR & Review Flow
|
|
79
79
|
|
|
80
|
-
### Default Reviewers
|
|
80
|
+
### Default Reviewers
|
|
81
81
|
|
|
82
82
|
- **Bitbucket**: Fetches via `/rest/default-reviewers/1.0/`. Every PUT must re-send `reviewers`, `fromRef`, `toRef`, `draft` (regression guarded by smoke test).
|
|
83
83
|
- **GitHub**: Honors `CODEOWNERS` + falls back to `prefs.projects[].githubDefaultReviewers`.
|
|
84
84
|
- PR author always filtered out (Bitbucket: 409, GitHub: GraphQL error).
|
|
85
85
|
|
|
86
|
-
### Draft vs Ready Prompt
|
|
86
|
+
### Draft vs Ready Prompt
|
|
87
87
|
|
|
88
88
|
Phase 6 asks `DRAFT or READY?` before creating the PR and persists the choice in `prefs.projects[].defaultPrMode`.
|
|
89
89
|
|
|
90
90
|
- Bitbucket: `draft: true` flag (DC 8.x+) with `[DRAFT]` title fallback for older servers.
|
|
91
91
|
- GitHub: `gh pr create --draft` + `gh pr ready` for promotion.
|
|
92
92
|
|
|
93
|
-
### `channels` Command
|
|
93
|
+
### `channels` Command
|
|
94
94
|
|
|
95
95
|
Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post-hoc for fixes closed outside the pipeline:
|
|
96
96
|
|
|
@@ -102,9 +102,9 @@ Multi-channel reporter — Phase 7 delegates to it, and it's also invocable post
|
|
|
102
102
|
/multi-agent:channels ABC-1234 --channels jira,confluence --content test
|
|
103
103
|
```
|
|
104
104
|
|
|
105
|
-
Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the
|
|
105
|
+
Multi-select **channels** (Jira / Confluence / Wiki / PR description) × multi-select **content** (normal analysis / test scenarios / auto-diff summary / manual note). Each body runs through the humanizer skill per-channel. Bitbucket PR updates use the reviewer-preserving PUT pattern (title + description + reviewers + fromRef + toRef + version mandatory). Replaces the earlier `enrich` command — all its capabilities (diff auto-summarize, manual mode, reviewer-preserving) are preserved; Confluence + Wiki are new.
|
|
106
106
|
|
|
107
|
-
### Body Preservation Contract (
|
|
107
|
+
### Body Preservation Contract (smoke-verified)
|
|
108
108
|
|
|
109
109
|
Every external-system body (PR description, Jira comment, GitHub issue) uses `jq -n --rawfile body body.md '{description: $body}'` → `curl --data-binary @payload.json`. No literal `\n` strings, no HTML entities (`&`, `<`, `"`). UTF-8 preserved end-to-end. `scripts/smoke-add-detail.sh` runs 14 contract assertions.
|
|
110
110
|
|
|
@@ -137,7 +137,7 @@ The reviewer set is **CLI-aware**: Claude Code dispatches 2 reviewers in paralle
|
|
|
137
137
|
|
|
138
138
|
**Fable Triage** (Phase 4 Step 3, Opus on Copilot CLI): Evaluates merged raw findings against task scope. Classifies each as `accepted` (fix now), `deferred` (out of scope, log for later), or `rejected` (false positive / noise). Only triage-accepted blocking items loop back to Phase 3.
|
|
139
139
|
|
|
140
|
-
### Runtime Triage Validator
|
|
140
|
+
### Runtime Triage Validator
|
|
141
141
|
|
|
142
142
|
After triage returns, output is validated by `validate-triage.mjs`:
|
|
143
143
|
|
|
@@ -148,19 +148,23 @@ After triage returns, output is validated by `validate-triage.mjs`:
|
|
|
148
148
|
| **2** | Over-rejection guard tripped — pause for human |
|
|
149
149
|
| **3** | Contradiction auto-corrected — proceed with corrected output |
|
|
150
150
|
|
|
151
|
-
### Bidirectional Approved↔Blocking Auto-Correction
|
|
151
|
+
### Bidirectional Approved↔Blocking Auto-Correction
|
|
152
152
|
|
|
153
|
-
If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with `if`/`then` constraint in schema
|
|
153
|
+
If triage returns `approved: false` but has no blocking items, the validator forces `approved: true`. Conversely, if `approved: true` but blocking items exist, it forces `approved: false`. Hardened with an `if`/`then` constraint in the schema itself.
|
|
154
154
|
|
|
155
|
-
### Verify-by-Test Triage (Phase 4 Step 3.7,
|
|
155
|
+
### Verify-by-Test Triage (Phase 4 Step 3.7, opt-in)
|
|
156
156
|
|
|
157
157
|
A triage verdict is a judgment call; a failing repro test is proof. When `prefs.global.verifyByTest.enabled` is on, one verifier agent (default Sonnet) writes a minimal repro test per accepted blocking finding (cap: `maxFindings`=3) and runs only that test. Fails as predicted -> finding confirmed, the repro test becomes the Phase 3 rework RED test. Passes under `evidence-gate.mjs` -> finding downgraded to `deferred`. Compile error / timeout -> `inconclusive`, judgment stands. Timeout-bounded, never blocks. Full spec: `refs/features/verify-by-test.md`.
|
|
158
158
|
|
|
159
|
-
### Immutable-Test Rule + `test_lines_removed` Signal
|
|
159
|
+
### Immutable-Test Rule + `test_lines_removed` Signal
|
|
160
160
|
|
|
161
161
|
Existing tests are immutable during a task: deleting, renaming, or weakening an assertion to reach green is a violation (`refs/rules.md`, Phase 3 GREEN step). A test changes only when the task changes the spec it encodes, named in the commit body. Deterministic backstop: `diff-risk-score.mjs` emits `test_lines_removed` (w=3.0) for any test-classified file whose diff removes more lines than it adds.
|
|
162
162
|
|
|
163
|
-
###
|
|
163
|
+
### Update Check at Run Start
|
|
164
|
+
|
|
165
|
+
Phase 0 Step 0.6, opt-out via `prefs.global.updateCheck`. Once per `ttlHours` window (cached, 3s-bounded curl to the npm registry), the installed version is compared against `dist-tags.latest`. Newer version: interactive modes ask "Update now?" (yes runs the `/multi-agent:update` flow, then the run continues); autopilot logs one line and never asks. `autoUpdate: true` updates silently before the worktree exists. Offline and failed checks are silent - never blocks.
|
|
166
|
+
|
|
167
|
+
### Structured Handoff Blocks
|
|
164
168
|
|
|
165
169
|
Every phase transition appends a `## Handoff` block (Done / Remaining / Decisions / Open findings / Next) to `agent-log.md` - orchestrator-written from existing state, no LLM call. `/multi-agent:resume` and post-`/compact` re-grounding read the latest handoff first, so long runs re-enter from durable artifacts instead of conversation memory (fresh-context discipline from Anthropic's long-running-agent harness guidance).
|
|
166
170
|
|
|
@@ -174,7 +178,7 @@ If changes include UI files, reviewers check for:
|
|
|
174
178
|
|
|
175
179
|
Pure code analysis — no simulator needed. Device-level audits run in Phase 5 when requested.
|
|
176
180
|
|
|
177
|
-
### Status Enforcement
|
|
181
|
+
### Status Enforcement
|
|
178
182
|
|
|
179
183
|
Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation verify step that re-reads the field and retries once on silent `VALIDATION` failures (e.g. stale Projects V2 option IDs after a board rebuild).
|
|
180
184
|
|
|
@@ -187,49 +191,49 @@ Phase 3 marks issue-tracker status update as MANDATORY with a post-mutation veri
|
|
|
187
191
|
|
|
188
192
|
## Testing & Quality
|
|
189
193
|
|
|
190
|
-
### Schema-Validated State
|
|
194
|
+
### Schema-Validated State
|
|
191
195
|
|
|
192
196
|
All critical state files are schema-validated at read and write time:
|
|
193
197
|
|
|
194
198
|
- `agent-state.schema.json` — validates `$HOME/.claude/logs/multi-agent/.../agent-state.json`
|
|
195
|
-
- `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json`
|
|
196
|
-
- `triage-output.schema.json` — validates triage output (
|
|
199
|
+
- `prefs.schema.json` — validates `$HOME/.claude/multi-agent-preferences.json`
|
|
200
|
+
- `triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
|
|
197
201
|
|
|
198
|
-
### Smoke Test Suites
|
|
202
|
+
### Smoke Test Suites
|
|
199
203
|
|
|
200
204
|
10 suites: add-detail (14 body-preservation assertions), review-triage (validator exit codes), prefs (schema round-trip), state (agent-state lifecycle), metrics (telemetry emission), sync (instruction parity), secret-scan (hook patterns), phase-banner (terminal UI), token-budget (per-phase limits), phase-tracker (progress tracking).
|
|
201
205
|
|
|
202
|
-
### Adversarial Eval Fixtures
|
|
206
|
+
### Adversarial Eval Fixtures
|
|
203
207
|
|
|
204
208
|
10 fixtures that test triage resilience against adversarial reviewer output: over-rejection, hallucinated findings, contradictions, invalid JSON, schema violations, duplicate findings, scope creep, empty results, timeout simulation, and combined edge cases.
|
|
205
209
|
|
|
206
|
-
### Sync Parity Check
|
|
210
|
+
### Sync Parity Check
|
|
207
211
|
|
|
208
212
|
Detects drift between Claude Code instructions (`~/.claude/commands/multi-agent.md`), Copilot CLI instructions (`~/.copilot/copilot-instructions.md`), and the repo's pipeline spec files. Reports discrepancies during Phase 0 Init.
|
|
209
213
|
|
|
210
|
-
### Token Budget Enforcement
|
|
214
|
+
### Token Budget Enforcement
|
|
211
215
|
|
|
212
216
|
Per-phase token budgets prevent runaway sessions. If a phase exceeds its budget, the pipeline pauses and offers: continue (extend budget), skip phase, or abort. Budgets are configurable in `prefs.global.tokenBudgets`.
|
|
213
217
|
|
|
214
218
|
## Telemetry & Observability
|
|
215
219
|
|
|
216
|
-
- **Pipeline Metrics
|
|
217
|
-
- **Cost Telemetry
|
|
218
|
-
- **Phase Tracker
|
|
219
|
-
- **Phase Banner
|
|
220
|
-
- **Per-task Cost Breakdown in agent-log.md
|
|
220
|
+
- **Pipeline Metrics**: Structured metrics to `metrics.jsonl` via `log-metric.sh`. Aggregated by `aggregate-metrics.mjs`.
|
|
221
|
+
- **Cost Telemetry**: Per-phase token cost tracking (`tokens_in`, `tokens_out`, `model`, `duration_ms`). Omitted fields handled gracefully.
|
|
222
|
+
- **Phase Tracker**: Cross-CLI visual progress (current phase, elapsed time, iteration count).
|
|
223
|
+
- **Phase Banner**: Terminal UI for phase transitions with Unicode box-drawing characters.
|
|
224
|
+
- **Per-task Cost Breakdown in agent-log.md**: Phase 7 appends a 4-column block (Phase · Model · Tokens in/out · Est. USD) to every run's `agent-log.md`. Sourced from `phase-tracker.sh tokens` accumulators × `cost-table.json` prices. Independent of the channels-side `reportContent.costSummary` toggle. The `LOG_METRIC_FORWARD_TO_TRACKER=1` env flag mirrors `tokens_in`/`tokens_out`/`model` from `log-metric.sh` into the tracker so JSONL metrics and the cost block stay in sync from one call site.
|
|
221
225
|
|
|
222
|
-
### Diff Risk Scoring
|
|
226
|
+
### Diff Risk Scoring
|
|
223
227
|
|
|
224
228
|
`pipeline/scripts/diff-risk-score.mjs` runs at Phase 4 Step 1.75 — before reviewer dispatch. Heuristic, deterministic, sub-second, no LLM. Top-N risk-ranked files inject into each reviewer's prompt as a `${PRIORITY_FILES}` block; reviewers read those files first but still review the entire diff.
|
|
225
229
|
|
|
226
|
-
Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (
|
|
230
|
+
Signals + weights: `security_path` ×3, `migration` ×4, `public_api` ×2, `no_test_change` ×2.5, `test_lines_removed` ×3 (test file shrinks - immutable-test backstop), `complexity_delta` ×1.5, `ui_critical` ×1.5, `loc_changed` ×1. Toggle via `prefs.global.diffRiskAdvisory` (default ON).
|
|
227
231
|
|
|
228
|
-
### Test Gap Detection
|
|
232
|
+
### Test Gap Detection
|
|
229
233
|
|
|
230
234
|
`pipeline/scripts/test-gap-scan.mjs` runs at Phase 5 Step 0. Walks the diff for newly added public symbols and reports those with no paired test. Stack-specific rules ship for iOS, Android, Python, Node.js. iOS Views and Android `@Composable` symbols default to `important`; other public API additions to `suggestion`. Optional gating via `prefs.testGap.blockingThreshold` — when set, the report becomes a Phase 4 rework finding once `important + blocking` count exceeds the threshold.
|
|
231
235
|
|
|
232
|
-
### Triage Memory
|
|
236
|
+
### Triage Memory
|
|
233
237
|
|
|
234
238
|
Per-repo append-only JSONL corpus at `~/.claude/memory/multi-agent/<repo-slug>/triage-corpus.jsonl`. Phase 7 ingests every triage output (idempotent), Phase 1 enriches the analysis with similar past tasks, Phase 4 triage attaches prior-art hits to each raw finding with an explicit bias hedge. Token-overlap recall, zero deps, Node-18-compatible. `/multi-agent:search "<text>" --semantic` routes the query to the corpus instead of agent-log grep. Toggle via `prefs.global.priorArtEnrichment.enabled` (default ON).
|
|
235
239
|
|
|
@@ -280,10 +284,10 @@ Token registry maps logical names (`jira`, `bitbucket`, `github`, `confluence`)
|
|
|
280
284
|
|
|
281
285
|
## Schemas & Validation
|
|
282
286
|
|
|
283
|
-
- `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle
|
|
284
|
-
- `pipeline/schemas/prefs.schema.json` — validates preferences
|
|
285
|
-
- `pipeline/schemas/triage-output.schema.json` — validates triage output (
|
|
287
|
+
- `pipeline/schemas/agent-state.schema.json` — validates agent state lifecycle
|
|
288
|
+
- `pipeline/schemas/prefs.schema.json` — validates preferences
|
|
289
|
+
- `pipeline/schemas/triage-output.schema.json` — validates triage output (contradiction `if`/`then` constraint built in)
|
|
286
290
|
|
|
287
|
-
## File Layout
|
|
291
|
+
## File Layout
|
|
288
292
|
|
|
289
293
|
`commands/multi-agent/{add-detail,setup,help}.md` → slash commands. `commands/multi-agent/refs/**` → guides, phase specs, rules (`refs:` prefix signals "not an action"). Modifier flags and ops stay inline in `multi-agent.md`.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@mmerterden/multi-agent-pipeline",
|
|
3
|
-
"version": "10.
|
|
3
|
+
"version": "10.9.0",
|
|
4
4
|
"description": "8-phase AI development pipeline with full orchestration on Claude Code and Copilot CLI. Analysis, planning, TDD, CLI-aware parallel review with consensus surfacing + Fable triage, default-FAIL evidence gates, secret + intent guards, per-phase cost ledger, persistent learnings memory, wiki generation, commit automation. Token-preserving uninstall.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "index.js",
|
|
@@ -98,6 +98,15 @@ Log the resolved tier in the agent log:
|
|
|
98
98
|
|
|
99
99
|
Full chain definition, REST endpoints, URL parsing, canonical-component contract: `pipeline/rules/figma-pipeline.md` "MUST: Figma access - 3-tier fallback chain (BLOCKING, pipeline-wide)". Do not duplicate it here.
|
|
100
100
|
|
|
101
|
+
#### Step 0.6 - Update check (advisory, opt-out via `prefs.global.updateCheck.enabled`)
|
|
102
|
+
|
|
103
|
+
Run `bash pipeline/scripts/update-check.sh` (cached per `updateCheck.ttlHours`, default 24h; 3s-bounded; every failure path silent). Empty output → continue. Output `<local>|<latest>` means a newer version exists:
|
|
104
|
+
|
|
105
|
+
- **Interactive**: log `→ update available: v<local> -> v<latest>`, ask ONE AskUserQuestion - **Update now** (recommended) / **Continue without updating**. On *Update now* (or `updateCheck.autoUpdate: true`, which skips the question): run the `/multi-agent:update` flow, log `→ updated to v<latest>`, continue the run (note in the log: already-loaded phase docs finish this run on the old version; full effect next session). On *Continue*: no re-ask until the TTL expires.
|
|
106
|
+
- **Autopilot**: never ask (zero-interaction contract). Log `→ update available: ... (log-only; run /multi-agent:update)` and continue - unless `autoUpdate: true`, then update silently first.
|
|
107
|
+
|
|
108
|
+
Advisory - NEVER blocks. Must run BEFORE Step 6 (worktree creation) so an accepted update cannot mutate `~/.claude` under a mid-phase run.
|
|
109
|
+
|
|
101
110
|
#### Step 1 - Parse Input
|
|
102
111
|
|
|
103
112
|
**Branch from input**: If the user provided a branch name after the issue reference (space-separated), store as `baseBranch` and skip Step 3. Otherwise Step 3 asks interactively.
|
|
@@ -49,6 +49,16 @@ When Phase 0 Step 7 classified the task as `component`, Phase 3 delegates the en
|
|
|
49
49
|
|
|
50
50
|
For non-component taskTypes (`bugfix`, `feature`, `refactor`, `chore`), continue with the standard TDD section below.
|
|
51
51
|
|
|
52
|
+
#### Design fidelity contract (BLOCKING when the task carries a Figma reference)
|
|
53
|
+
|
|
54
|
+
Applies to EVERY task whose analysis doc carries design ground truth (Section 6 rows, node IDs, screenshots) - component-dispatch AND TDD-path UI work alike. MCP stays forbidden here (analysis-only rule); the analysis doc IS the design:
|
|
55
|
+
|
|
56
|
+
1. **1:1, not interpretation.** The captured frames are the reference for every UI line. Layout, variant states, copy, and colours come from the analysis doc's measured values - never eyeballed, never "close enough".
|
|
57
|
+
2. **Code Connect decides the component.** A mapped component (`CodeConnectSnippet` / repo `*.figma.swift` / `*.figma.kt` row) MUST be used verbatim - sound-alike substitutes are forbidden; a missing modifier is added to the component's `+Modifiers` extension, never forked. No mapping exists -> write a NEW component per the Configuration/View/Modifiers architecture; never inline ad-hoc UI into the consumer screen.
|
|
58
|
+
3. **Inter-component spacing is part of the design.** Gaps, paddings, and alignment BETWEEN components must match the frame's measured values, mapped to spacing tokens (`Spacing*`) - never invented numbers. A spacing/layout deviation is a `review_blocking` finding in Phase 4 Step 1.8, not a nitpick.
|
|
59
|
+
|
|
60
|
+
Missing design data (variant, padding, copy) → HALT and instruct the user to re-run `/multi-agent:analysis`; never guess and never call Figma MCP from this phase.
|
|
61
|
+
|
|
52
62
|
#### Re-entry from Phase 4 triage
|
|
53
63
|
|
|
54
64
|
Phase 3 runs twice in the pipeline lifetime: first for initial development, then optionally for rework after Phase 4 review. **Phase 3 never acts on raw reviewer output.** It only consumes `triage.accepted` findings - Fable triage in Phase 4 already filtered false-positives, deferred out-of-scope items, and rejected noise.
|
|
@@ -390,30 +390,9 @@ After the triage verdict is computed, populate `triage.consensus`:
|
|
|
390
390
|
|
|
391
391
|
##### 3.7 Verify-by-test (opt-in, empirical validation of blocking findings)
|
|
392
392
|
|
|
393
|
-
|
|
393
|
+
A triage verdict is judgment; a failing repro test is proof. Runs only when `prefs.global.verifyByTest.enabled` is `true` AND `accepted` contains a `blocking` finding; otherwise skip silently. **Full contract (verdict table, cleanup invariant, prompts): `refs/features/verify-by-test.md` - read it before executing this step.**
|
|
394
394
|
|
|
395
|
-
|
|
396
|
-
|
|
397
|
-
1. **Dispatch ONE verifier agent** (subagent_type: `general-purpose`, model: `prefs.global.verifyByTest.model`, default `sonnet`) for the iteration - not one per finding. Input: up to `verifyByTest.maxFindings` (default 3) accepted blocking findings (ordered as triage returned them), the diff hunks for their files, and the project's test conventions from Phase 1. Findings beyond the cap keep their judgment-only verdict; log `verify_by_test=cap-exceeded count=<n>`.
|
|
398
|
-
2. **Per finding, the verifier writes ONE minimal repro test** asserting the CORRECT behavior the finding claims is broken, following the platform test conventions (framework, naming, location per phase-3-dev.md), then runs ONLY that test using the platform single-test invocation from Phase 3 (`xcodebuild test -only-testing:`, `pytest {file}::{name}`, `npm test -- --testPathPattern=`, `./gradlew test --tests`), wrapped in `acquire_build_lock`/`release_build_lock`, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`.
|
|
399
|
-
3. **Verdict mapping (fails toward keeping blockers):**
|
|
400
|
-
|
|
401
|
-
| Repro test outcome | `verification.result` | Action on finding |
|
|
402
|
-
| --- | --- | --- |
|
|
403
|
-
| Test FAILS as the finding predicts | `confirmed` | Stays `accepted` blocking. Repro test is KEPT in the worktree and recorded in `redTests[]` as the RED test for the Phase 3 rework loop. |
|
|
404
|
-
| Test PASSES (defect not reproducible) | `not-reproduced` | Downgrade is evidence-gated: `node pipeline/scripts/evidence-gate.mjs --claim test --status passed --evidence "$WORKTREE/.pipeline/verify-<i>.test.log"` must exit 0. On exit 0: move finding from `accepted[]` to `deferred[]` with reason `verify-by-test: not reproduced - repro test <testRef> passed`, delete the repro test file. On exit non-zero: treat as `inconclusive`. |
|
|
405
|
-
| Compile error, timeout, or defect not expressible as a unit test | `inconclusive` | Stays `accepted` blocking (judgment-only verdict stands). Partial test file deleted, cause noted in `verification.note`. |
|
|
406
|
-
|
|
407
|
-
4. **Cleanup invariant:** after Step 3.7 the only verifier artifacts left in the worktree are the confirmed repro tests listed in `redTests[]` (they get committed with the fix, satisfying the TDD contract) and logs under `$WORKTREE/.pipeline/` (already outside Phase 6 commit scope).
|
|
408
|
-
5. **Persist + re-validate:** stamp each verified finding with a `verification` object (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = { attempted, confirmed, downgraded, inconclusive, redTests: [{file, testRef, issue}] }`, recompute `approved`, then re-run `validate-triage.mjs` on the mutated `$TRIAGE_FILE` under the same 3.2.1 gate protocol.
|
|
409
|
-
6. **Timeout/fallback (mirrors 3.3):** the whole step is bounded by `verifyByTest.stepTimeoutSec` (default 600). On verifier crash or budget breach: no retry; remaining findings keep judgment-only verdicts, log `triage=verify-by-test-timeout`, proceed to Step 4. Never blocks the pipeline.
|
|
410
|
-
|
|
411
|
-
Telemetry (per 3.4 conventions, best-effort):
|
|
412
|
-
|
|
413
|
-
```bash
|
|
414
|
-
LOG_METRIC_FORWARD_TO_TRACKER=1 pipeline/scripts/log-metric.sh "$TASK_ID" 4 review.verify_by_test \
|
|
415
|
-
attempted=$A confirmed=$C downgraded=$D inconclusive=$I duration_ms=$MS
|
|
416
|
-
```
|
|
395
|
+
Compressed flow: dispatch ONE verifier agent (model `verifyByTest.model`, default `sonnet`) for up to `maxFindings` (default 3) accepted blocking findings. Per finding it writes ONE minimal repro test and runs ONLY that test (Phase 3 single-test invocation, build lock, log tee'd to `$WORKTREE/.pipeline/verify-<i>.test.log`). Outcomes: test FAILS as predicted -> `confirmed`, finding stays blocking and the test is KEPT in `redTests[]` as the Phase 3 rework RED test; test PASSES -> `not-reproduced` ONLY if `evidence-gate.mjs --claim test --status passed` exits 0 on the log, finding moves to `deferred[]`, test deleted; compile error / timeout / not unit-testable -> `inconclusive`, judgment verdict stands. Stamp findings with `verification` (schema v3.2.0), persist `state.reviewIterations[-1].verifyByTest = {attempted, confirmed, downgraded, inconclusive, redTests[]}`, recompute `approved`, re-run `validate-triage.mjs` under the 3.2.1 gate. Whole step bounded by `stepTimeoutSec` (default 600); on breach or crash remaining findings keep judgment verdicts - never blocks. Telemetry per 3.4: `review.verify_by_test attempted= confirmed= downgraded= inconclusive= duration_ms=`.
|
|
417
396
|
|
|
418
397
|
#### Step 4 - Consensus + Action (triage-driven)
|
|
419
398
|
|
|
@@ -701,6 +701,30 @@
|
|
|
701
701
|
"default": false,
|
|
702
702
|
"description": "v6.1.0+ \u2014 Phase 4 Step 2.5 rebuttal round. When reviewers disagree (mixed blocker/approved verdict), each reviewer is re-prompted with the others' opposing arguments for one additional round before triage. Lifts signal quality on ambiguous findings at ~1\u00d7 Step 2 token cost. Off by default \u2014 flip for security-critical or release-branch reviews."
|
|
703
703
|
},
|
|
704
|
+
"updateCheck": {
|
|
705
|
+
"type": "object",
|
|
706
|
+
"additionalProperties": false,
|
|
707
|
+
"description": "v10.9+ - Phase 0 Step 0.6 advisory version check. Once per ttlHours window, a bounded (3s) registry read compares the installed version against dist-tags.latest. Newer version found: interactive modes ask 'Update now / Continue' (yes runs the /multi-agent:update flow, then the run continues); autopilot logs one line and never asks. Offline/failed checks are silent; the step never blocks the pipeline.",
|
|
708
|
+
"properties": {
|
|
709
|
+
"enabled": {
|
|
710
|
+
"type": "boolean",
|
|
711
|
+
"default": true,
|
|
712
|
+
"description": "Master switch. On by default - the cost is at most one 3s-bounded curl per ttlHours."
|
|
713
|
+
},
|
|
714
|
+
"ttlHours": {
|
|
715
|
+
"type": "integer",
|
|
716
|
+
"minimum": 1,
|
|
717
|
+
"maximum": 168,
|
|
718
|
+
"default": 24,
|
|
719
|
+
"description": "Cache window for the registry read."
|
|
720
|
+
},
|
|
721
|
+
"autoUpdate": {
|
|
722
|
+
"type": "boolean",
|
|
723
|
+
"default": false,
|
|
724
|
+
"description": "When true, skip the question and run the update flow automatically before the run starts (interactive AND autopilot). Off by default - self-modifying ~/.claude without asking is a surprise."
|
|
725
|
+
}
|
|
726
|
+
}
|
|
727
|
+
},
|
|
704
728
|
"verifyByTest": {
|
|
705
729
|
"type": "object",
|
|
706
730
|
"additionalProperties": false,
|
|
@@ -3,15 +3,15 @@
|
|
|
3
3
|
"$id": "https://github.com/mmerterden/multi-agent-pipeline/pipeline/schemas/token-budget.json",
|
|
4
4
|
"description": "Per-phase token budget for lazy-loaded pipeline docs. Enforced by smoke-token-budget.sh.",
|
|
5
5
|
"phases": {
|
|
6
|
-
"phase-0-init": { "max_tokens":
|
|
6
|
+
"phase-0-init": { "max_tokens": 12400, "warn_tokens": 10900 },
|
|
7
7
|
"phase-1-analysis": { "max_tokens": 3750, "warn_tokens": 3300 },
|
|
8
8
|
"phase-2-planning": { "max_tokens": 6500, "warn_tokens": 5750 },
|
|
9
|
-
"phase-3-dev": { "max_tokens":
|
|
10
|
-
"phase-4-review": { "max_tokens":
|
|
11
|
-
"phase-5-test": { "max_tokens":
|
|
12
|
-
"phase-6-commit": { "max_tokens":
|
|
9
|
+
"phase-3-dev": { "max_tokens": 7900, "warn_tokens": 6950 },
|
|
10
|
+
"phase-4-review": { "max_tokens": 13250, "warn_tokens": 11650 },
|
|
11
|
+
"phase-5-test": { "max_tokens": 2550, "warn_tokens": 2250 },
|
|
12
|
+
"phase-6-commit": { "max_tokens": 6150, "warn_tokens": 5400 },
|
|
13
13
|
"phase-7-report": { "max_tokens": 5600, "warn_tokens": 4950 }
|
|
14
14
|
},
|
|
15
|
-
"total_max_tokens":
|
|
16
|
-
"note": "Token estimate = ceil(chars / 4). Per-phase budget rule: warn = current+10% (rounded to nearest 50), max = current+25%. Gives ~6 edit cycles of headroom before warn trips - intentionally quiet under normal maintenance, loud when a phase grows unusually. Only the active phase is loaded (lazy). Recalibrated at v10.0.0 after the validator/consistency/simplifier/lesson gate contracts landed in phases 1-4 (prose
|
|
15
|
+
"total_max_tokens": 50000,
|
|
16
|
+
"note": "Token estimate = ceil(chars / 4). Per-phase budget rule: warn = current+10% (rounded to nearest 50), max = current+25%. Gives ~6 edit cycles of headroom before warn trips - intentionally quiet under normal maintenance, loud when a phase grows unusually. Only the active phase is loaded (lazy). Recalibrated at v10.0.0 after the validator/consistency/simplifier/lesson gate contracts landed in phases 1-4. Recalibrated again at v10.9.0 after the verify-by-test (Phase 4 Step 3.7), update-check (Phase 0 Step 0.6), immutable-test (Phase 3 GREEN) and redTests re-entry contracts landed - Step 3.7 prose was compressed to a pointer into refs/features/verify-by-test.md before the recalibration."
|
|
17
17
|
}
|
|
@@ -24,6 +24,7 @@ Validate contracts. Each emits `══ <name> smoke: N passed, M failed ══`
|
|
|
24
24
|
- `smoke-phase4-triage.sh` - Phase 4 reviewer → triage flow
|
|
25
25
|
- `smoke-verify-by-test.sh` - Phase 4 Step 3.7 verify-by-test contract (v10.8.0)
|
|
26
26
|
- `smoke-handoff-contract.sh` - phase-boundary structured handoff + handoff-first resume (v10.8.0)
|
|
27
|
+
- `smoke-update-check.sh` - Phase 0 Step 0.6 advisory update-check contract (v10.9.0)
|
|
27
28
|
|
|
28
29
|
### Schema + state
|
|
29
30
|
- `smoke-schema-validation.sh` - all JSON schemas validate
|
|
@@ -5,12 +5,12 @@
|
|
|
5
5
|
.claude/multi-agent-preferences.json 1
|
|
6
6
|
.claude/rules 12
|
|
7
7
|
.claude/schemas 23
|
|
8
|
-
.claude/scripts
|
|
8
|
+
.claude/scripts 171
|
|
9
9
|
.claude/settings.json 1
|
|
10
10
|
.claude/skills 560
|
|
11
11
|
.copilot/agents 8
|
|
12
12
|
.copilot/copilot-instructions.md 1
|
|
13
13
|
.copilot/lib 23
|
|
14
14
|
.copilot/schemas 23
|
|
15
|
-
.copilot/scripts
|
|
15
|
+
.copilot/scripts 171
|
|
16
16
|
.copilot/skills 596
|
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# smoke-update-check.sh
|
|
3
|
+
#
|
|
4
|
+
# Verifies the Phase 0 Step 0.6 advisory update-check contract:
|
|
5
|
+
# 1. update-check.sh exists, parses (bash -n), and honors the advisory contract
|
|
6
|
+
# offline: exit 0 + empty stdout when the registry is unreachable
|
|
7
|
+
# 2. Cached path: fresh cache short-circuits without a network call
|
|
8
|
+
# 3. Newer latest -> "<local>|<latest>"; same or older latest -> silent
|
|
9
|
+
# 4. prefs.schema.json exposes updateCheck.{enabled,ttlHours,autoUpdate}
|
|
10
|
+
# with the documented defaults (enabled=true, autoUpdate=false)
|
|
11
|
+
# 5. phase-0-init.md documents Step 0.6 with the autopilot log-only rule
|
|
12
|
+
#
|
|
13
|
+
# Exit 0 = all pass, 1 = any failure.
|
|
14
|
+
|
|
15
|
+
set -euo pipefail
|
|
16
|
+
|
|
17
|
+
ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
|
|
18
|
+
SCRIPT="$ROOT/pipeline/scripts/update-check.sh"
|
|
19
|
+
PREFS_SCHEMA="$ROOT/pipeline/schemas/prefs.schema.json"
|
|
20
|
+
PHASE0_DOC="$ROOT/pipeline/commands/multi-agent/refs/phases/phase-0-init.md"
|
|
21
|
+
|
|
22
|
+
pass=0
|
|
23
|
+
fail=0
|
|
24
|
+
failures=()
|
|
25
|
+
record_pass() { pass=$((pass + 1)); printf ' \033[0;32mPASS\033[0m %s\n' "$1"; }
|
|
26
|
+
record_fail() { fail=$((fail + 1)); failures+=("$1"); printf ' \033[0;31mFAIL\033[0m %s\n' "$1"; }
|
|
27
|
+
|
|
28
|
+
printf '→ smoke-update-check: Phase 0 Step 0.6 advisory contract\n'
|
|
29
|
+
|
|
30
|
+
tmpdir=$(mktemp -d)
|
|
31
|
+
trap 'rm -rf "$tmpdir"' EXIT
|
|
32
|
+
CACHE="$tmpdir/update-check-cache"
|
|
33
|
+
|
|
34
|
+
# 1. Script parses
|
|
35
|
+
if bash -n "$SCRIPT" 2>/dev/null; then
|
|
36
|
+
record_pass "update-check.sh parses (bash -n)"
|
|
37
|
+
else
|
|
38
|
+
record_fail "update-check.sh has syntax errors"
|
|
39
|
+
fi
|
|
40
|
+
|
|
41
|
+
now=$(date +%s)
|
|
42
|
+
|
|
43
|
+
# 2. Fresh cache with newer latest -> reports, no network needed
|
|
44
|
+
printf '%s|99.0.0\n' "$now" > "$CACHE"
|
|
45
|
+
out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
|
|
46
|
+
if [ "$rc" -eq 0 ] && [ "$out" = "10.0.0|99.0.0" ]; then
|
|
47
|
+
record_pass "newer cached latest -> '<local>|<latest>' (exit 0)"
|
|
48
|
+
else
|
|
49
|
+
record_fail "newer cached latest should print '10.0.0|99.0.0' (got '$out', rc=$rc)"
|
|
50
|
+
fi
|
|
51
|
+
|
|
52
|
+
# 3. Same version -> silent
|
|
53
|
+
printf '%s|10.0.0\n' "$now" > "$CACHE"
|
|
54
|
+
out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
|
|
55
|
+
if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
|
|
56
|
+
record_pass "same version -> silent"
|
|
57
|
+
else
|
|
58
|
+
record_fail "same version should be silent (got '$out', rc=$rc)"
|
|
59
|
+
fi
|
|
60
|
+
|
|
61
|
+
# 3b. Local ahead of registry (dev machine) -> silent
|
|
62
|
+
printf '%s|10.0.0\n' "$now" > "$CACHE"
|
|
63
|
+
out=$(UPDATE_CHECK_CACHE="$CACHE" bash "$SCRIPT" --local 10.1.0 2>/dev/null); rc=$?
|
|
64
|
+
if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
|
|
65
|
+
record_pass "local ahead of registry -> silent (no downgrade prompt)"
|
|
66
|
+
else
|
|
67
|
+
record_fail "local-ahead should be silent (got '$out', rc=$rc)"
|
|
68
|
+
fi
|
|
69
|
+
|
|
70
|
+
# 3c. Offline + stale cache -> silent exit 0 (advisory: never blocks)
|
|
71
|
+
printf '0|10.0.0\n' > "$CACHE"
|
|
72
|
+
out=$(UPDATE_CHECK_CACHE="$CACHE" http_proxy="http://127.0.0.1:1" https_proxy="http://127.0.0.1:1" \
|
|
73
|
+
bash "$SCRIPT" --local 10.0.0 2>/dev/null); rc=$?
|
|
74
|
+
if [ "$rc" -eq 0 ] && [ -z "$out" ]; then
|
|
75
|
+
record_pass "offline + stale cache -> silent exit 0"
|
|
76
|
+
else
|
|
77
|
+
record_fail "offline should be a silent no-op (got '$out', rc=$rc)"
|
|
78
|
+
fi
|
|
79
|
+
|
|
80
|
+
# 4. Prefs schema knobs + defaults
|
|
81
|
+
for prop in enabled ttlHours autoUpdate; do
|
|
82
|
+
if jq -e ".properties.global.properties.updateCheck.properties.${prop}" "$PREFS_SCHEMA" >/dev/null 2>&1; then
|
|
83
|
+
record_pass "prefs schema exposes updateCheck.${prop}"
|
|
84
|
+
else
|
|
85
|
+
record_fail "prefs schema missing updateCheck.${prop}"
|
|
86
|
+
fi
|
|
87
|
+
done
|
|
88
|
+
if jq -e '.properties.global.properties.updateCheck.properties.enabled.default == true' "$PREFS_SCHEMA" >/dev/null 2>&1; then
|
|
89
|
+
record_pass "updateCheck.enabled defaults to true (notify-only, bounded cost)"
|
|
90
|
+
else
|
|
91
|
+
record_fail "updateCheck.enabled must default to true"
|
|
92
|
+
fi
|
|
93
|
+
if jq -e '.properties.global.properties.updateCheck.properties.autoUpdate | has("default") and .default == false' "$PREFS_SCHEMA" >/dev/null 2>&1; then
|
|
94
|
+
record_pass "updateCheck.autoUpdate defaults to false (no silent self-modify)"
|
|
95
|
+
else
|
|
96
|
+
record_fail "updateCheck.autoUpdate must default to false"
|
|
97
|
+
fi
|
|
98
|
+
|
|
99
|
+
# 5. Phase 0 doc wiring
|
|
100
|
+
if grep -qF "Step 0.6 - Update check" "$PHASE0_DOC"; then
|
|
101
|
+
record_pass "phase-0-init.md documents Step 0.6"
|
|
102
|
+
else
|
|
103
|
+
record_fail "phase-0-init.md missing Step 0.6"
|
|
104
|
+
fi
|
|
105
|
+
if grep -qF "update-check.sh" "$PHASE0_DOC"; then
|
|
106
|
+
record_pass "phase-0-init.md invokes update-check.sh"
|
|
107
|
+
else
|
|
108
|
+
record_fail "phase-0-init.md must invoke update-check.sh"
|
|
109
|
+
fi
|
|
110
|
+
if grep -qF "log-only" "$PHASE0_DOC" && grep -qF "never ask (zero-interaction contract)" "$PHASE0_DOC"; then
|
|
111
|
+
record_pass "phase-0-init.md states the autopilot log-only rule"
|
|
112
|
+
else
|
|
113
|
+
record_fail "phase-0-init.md must state autopilot never asks (log-only)"
|
|
114
|
+
fi
|
|
115
|
+
|
|
116
|
+
printf '\n══ update-check smoke: %d passed, %d failed ══\n' "$pass" "$fail"
|
|
117
|
+
if [ "$fail" -gt 0 ]; then
|
|
118
|
+
printf '\nFailures:\n'
|
|
119
|
+
for msg in "${failures[@]}"; do printf ' - %s\n' "$msg"; done
|
|
120
|
+
exit 1
|
|
121
|
+
fi
|
|
122
|
+
exit 0
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# update-check.sh - cached advisory version check (Phase 0 Step 0.6)
|
|
3
|
+
#
|
|
4
|
+
# Compares the locally installed pipeline version against the npm registry's
|
|
5
|
+
# dist-tags.latest. Cached with a TTL so at most one network call per TTL
|
|
6
|
+
# window; the call is bounded by a short timeout and every failure path is
|
|
7
|
+
# silent - this script NEVER blocks or fails the pipeline.
|
|
8
|
+
#
|
|
9
|
+
# stdout: "<local>|<latest>" when a newer version exists, nothing otherwise.
|
|
10
|
+
# Exit code: always 0 (advisory contract).
|
|
11
|
+
#
|
|
12
|
+
# Usage:
|
|
13
|
+
# bash pipeline/scripts/update-check.sh # auto-detect local version
|
|
14
|
+
# bash pipeline/scripts/update-check.sh --local 10.8.0 # explicit local version
|
|
15
|
+
# bash pipeline/scripts/update-check.sh --ttl-hours 24 # cache window (default 24)
|
|
16
|
+
# bash pipeline/scripts/update-check.sh --force # ignore cache
|
|
17
|
+
#
|
|
18
|
+
# Cache file: ~/.claude/logs/multi-agent/.update-check ("epoch|latest").
|
|
19
|
+
# Registry read is a plain curl - never `npm view` (a user-level .npmrc scope
|
|
20
|
+
# mapping can silently reroute npm to a different registry; curl cannot lie).
|
|
21
|
+
|
|
22
|
+
set -euo pipefail
|
|
23
|
+
|
|
24
|
+
PKG="@mmerterden/multi-agent-pipeline"
|
|
25
|
+
REGISTRY_URL="https://registry.npmjs.org/${PKG/\//%2F}"
|
|
26
|
+
CACHE_FILE="${UPDATE_CHECK_CACHE:-$HOME/.claude/logs/multi-agent/.update-check}"
|
|
27
|
+
TTL_HOURS=24
|
|
28
|
+
LOCAL_VERSION=""
|
|
29
|
+
FORCE=0
|
|
30
|
+
|
|
31
|
+
while [ $# -gt 0 ]; do
|
|
32
|
+
case "$1" in
|
|
33
|
+
--local) LOCAL_VERSION="${2:-}"; shift 2 ;;
|
|
34
|
+
--ttl-hours) TTL_HOURS="${2:-24}"; shift 2 ;;
|
|
35
|
+
--force) FORCE=1; shift ;;
|
|
36
|
+
*) shift ;;
|
|
37
|
+
esac
|
|
38
|
+
done
|
|
39
|
+
|
|
40
|
+
# Local version: explicit arg, else read from the pipeline repo clone.
|
|
41
|
+
if [ -z "$LOCAL_VERSION" ]; then
|
|
42
|
+
for candidate in "$HOME/multi-agent-pipeline" "$HOME/dev/multi-agent-pipeline" "$HOME/projects/multi-agent-pipeline"; do
|
|
43
|
+
if [ -f "$candidate/package.json" ]; then
|
|
44
|
+
LOCAL_VERSION=$(node -p "require('$candidate/package.json').version" 2>/dev/null || true)
|
|
45
|
+
[ -n "$LOCAL_VERSION" ] && break
|
|
46
|
+
fi
|
|
47
|
+
done
|
|
48
|
+
fi
|
|
49
|
+
[ -z "$LOCAL_VERSION" ] && exit 0 # cannot determine local version -> silent no-op
|
|
50
|
+
|
|
51
|
+
now=$(date +%s)
|
|
52
|
+
latest=""
|
|
53
|
+
|
|
54
|
+
# Fresh cache?
|
|
55
|
+
if [ "$FORCE" -eq 0 ] && [ -f "$CACHE_FILE" ]; then
|
|
56
|
+
cached_epoch=$(cut -d'|' -f1 "$CACHE_FILE" 2>/dev/null || echo 0)
|
|
57
|
+
cached_latest=$(cut -d'|' -f2 "$CACHE_FILE" 2>/dev/null || echo "")
|
|
58
|
+
case "$cached_epoch" in (*[!0-9]*|"") cached_epoch=0 ;; esac
|
|
59
|
+
if [ $((now - cached_epoch)) -lt $((TTL_HOURS * 3600)) ] && [ -n "$cached_latest" ]; then
|
|
60
|
+
latest="$cached_latest"
|
|
61
|
+
fi
|
|
62
|
+
fi
|
|
63
|
+
|
|
64
|
+
# Stale or missing cache -> one bounded registry call (silent on any failure).
|
|
65
|
+
if [ -z "$latest" ]; then
|
|
66
|
+
latest=$(curl -sm 3 "$REGISTRY_URL" 2>/dev/null \
|
|
67
|
+
| { command -v jq >/dev/null 2>&1 && jq -r '."dist-tags".latest // empty' \
|
|
68
|
+
|| sed -n 's/.*"latest":"\([^"]*\)".*/\1/p'; } | head -1) || true
|
|
69
|
+
[ -z "$latest" ] && exit 0
|
|
70
|
+
mkdir -p "$(dirname "$CACHE_FILE")" 2>/dev/null || exit 0
|
|
71
|
+
printf '%s|%s\n' "$now" "$latest" > "$CACHE_FILE" 2>/dev/null || true
|
|
72
|
+
fi
|
|
73
|
+
|
|
74
|
+
[ "$latest" = "$LOCAL_VERSION" ] && exit 0
|
|
75
|
+
|
|
76
|
+
# Update available only when latest sorts strictly ABOVE local (a dev machine
|
|
77
|
+
# running ahead of the registry must not see an "update" prompt).
|
|
78
|
+
highest=$(printf '%s\n%s\n' "$LOCAL_VERSION" "$latest" | sort -t. -k1,1n -k2,2n -k3,3n | tail -1)
|
|
79
|
+
if [ "$highest" = "$latest" ]; then
|
|
80
|
+
printf '%s|%s\n' "$LOCAL_VERSION" "$latest"
|
|
81
|
+
fi
|
|
82
|
+
exit 0
|