@hegemonart/get-design-done 1.27.1 → 1.27.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +2 -2
- package/.claude-plugin/plugin.json +1 -1
- package/CHANGELOG.md +95 -0
- package/SKILL.md +1 -0
- package/agents/design-reflector.md +52 -0
- package/agents/perf-analyzer.md +166 -0
- package/hooks/budget-enforcer.ts +249 -5
- package/hooks/gdd-precompact-snapshot.js +334 -0
- package/hooks/gdd-sessionstart-recap.js +281 -0
- package/hooks/hooks.json +18 -0
- package/package.json +2 -2
- package/reference/bandit-integration.md +163 -0
- package/reference/perf-budget.md +142 -0
- package/reference/registry.json +14 -0
- package/reference/retrieval-contract.md +16 -0
- package/scripts/lib/bandit-arbitrage.cjs +423 -0
- package/scripts/lib/bandit-router/integration.cjs +309 -0
- package/scripts/lib/cache/gdd-cache-manager.cjs +292 -0
- package/scripts/lib/discuss-parallel-runner/index.ts +5 -1
- package/scripts/lib/explore-parallel-runner/index.ts +5 -1
- package/scripts/lib/parallelism-engine/concurrency-tuner.cjs +259 -0
- package/scripts/lib/parallelism-engine/concurrency-tuner.d.cts +53 -0
- package/scripts/lib/perf-analyzer/cost-regression.cjs +299 -0
- package/scripts/lib/perf-analyzer/index.cjs +139 -0
- package/scripts/lib/prompt-dedup/index.cjs +161 -0
- package/scripts/lib/session-runner/index.ts +206 -0
- package/skills/bandit-status/SKILL.md +129 -0
- package/skills/peers/SKILL.md +27 -8
|
@@ -5,14 +5,14 @@
|
|
|
5
5
|
},
|
|
6
6
|
"metadata": {
|
|
7
7
|
"description": "Get Design Done — 5-stage agent-orchestrated design pipeline with 9 connections, handoff-first workflow, bidirectional Figma write-back, 22+ specialized agents, queryable knowledge layer (intel store, dependency analysis, learnings extraction), and a self-improvement loop (reflector, frontmatter + budget feedback, global-skills layer). v1.20.0 ships the SDK foundation: gdd-state MCP server (11 typed tools), lockfile-safe STATE.md mutations, event stream, and resilience primitives (jittered-backoff, rate-guard, error-classifier, iteration-budget) for rate-limit + 429 + context-overflow recovery. Full CI/CD pipeline (Node 22/24 × Linux/macOS/Windows) and release automation (auto-tag + GitHub Release + release-time smoke test).",
|
|
8
|
-
"version": "1.27.
|
|
8
|
+
"version": "1.27.6"
|
|
9
9
|
},
|
|
10
10
|
"plugins": [
|
|
11
11
|
{
|
|
12
12
|
"name": "get-design-done",
|
|
13
13
|
"source": "./",
|
|
14
14
|
"description": "Agent-orchestrated 5-stage design pipeline: Brief → Explore → Plan → Design → Verify. 22+ specialized agents, 9 connections (Figma, Refero, Preview, Storybook, Chromatic, Figma Writer, Graphify, Pinterest, Claude Design), Claude Design handoff, bidirectional Figma write-back, and a queryable intel store (.design/intel/) for dependency and learnings queries. Standalone commands: style, darkmode, compare, figma-write, graphify, handoff, analyze-dependencies, skill-manifest, extract-learnings. Embeds NNG heuristics, WCAG thresholds, typographic systems, motion framework, and anti-pattern catalog. Ships with a full CI/CD pipeline (Node 22/24 × Linux/macOS/Windows) and release automation. Optimization layer (v1.0.4.1, retroactive): gdd-router + gdd-cache-manager skills, PreToolUse budget-enforcer hook, tier-aware agent frontmatter, lazy checker gates, streaming synthesizer, /gdd:warm-cache + /gdd:optimize commands, and cost telemetry at .design/telemetry/costs.jsonl — targeting 50-70% per-task token-cost reduction with no quality-floor regression. v1.20.0 SDK foundation: gdd-state MCP server (11 typed tools), lockfile-safe STATE.md mutations, event stream at .design/telemetry/events.jsonl, resilience primitives (jittered-backoff, rate-guard, error-classifier, iteration-budget) with rate-limit + 429 + context-overflow recovery, and TypeScript toolchain.",
|
|
15
|
-
"version": "1.27.
|
|
15
|
+
"version": "1.27.6",
|
|
16
16
|
"author": {
|
|
17
17
|
"name": "hegemonart"
|
|
18
18
|
},
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "get-design-done",
|
|
3
3
|
"short_name": "gdd",
|
|
4
|
-
"version": "1.27.
|
|
4
|
+
"version": "1.27.6",
|
|
5
5
|
"description": "Agent-orchestrated 5-stage design pipeline: Brief → Explore → Plan → Design → Verify. 22+ specialized agents, 9 connections (Figma, Refero, Preview, Storybook, Chromatic, Figma Writer, Graphify, Pinterest, Claude Design), handoff-first workflow via Claude Design bundles, bidirectional Figma write-back (annotations, Code Connect), queryable intel store (`.design/intel/`) for O(1) design surface lookups, and self-improvement loop (reflector agent, frontmatter + budget feedback, global-skills layer at `~/.claude/gdd/global-skills/`). Standalone commands: style, darkmode, compare, figma-write, graphify, handoff, analyze-dependencies, skill-manifest, extract-learnings, reflect, apply-reflections. Embeds NNG heuristics, WCAG thresholds, typographic systems, motion framework, and anti-pattern catalog. Ships with a full CI/CD pipeline (Node 22/24 × Linux/macOS/Windows, lint + schema + frontmatter + stale-ref + shellcheck + gitleaks + injection-scan + blocking size-budget) and release automation (auto-tag + GitHub Release + release-time smoke test). Optimization layer (v1.0.4.1, retroactive): gdd-router + gdd-cache-manager skills, PreToolUse budget-enforcer hook, tier-aware agent frontmatter, lazy checker gates, streaming synthesizer, /gdd:warm-cache + /gdd:optimize commands, and cost telemetry at .design/telemetry/costs.jsonl — targeting 50-70% per-task token-cost reduction with no quality-floor regression. v1.20.0 SDK foundation: gdd-state MCP server (11 typed tools), lockfile-safe STATE.md mutations, event stream at .design/telemetry/events.jsonl, resilience primitives (jittered-backoff, rate-guard, error-classifier, iteration-budget) with rate-limit + 429 + context-overflow recovery, and TypeScript toolchain.",
|
|
6
6
|
"author": {
|
|
7
7
|
"name": "hegemonart",
|
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,101 @@ All notable changes to get-design-done are documented here. Versions follow [sem
|
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
+
## [1.27.6] — 2026-05-18
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
|
|
11
|
+
- **Phase 27.6 — Pipeline Performance + Token-Cost Optimization** (6 plans). After 27.5 wired the bandit into production routing, telemetry from `.design/telemetry/{costs,trajectories,events}.jsonl` finally measures real spawns; this phase converts that telemetry into concrete optimizations.
|
|
12
|
+
- `agents/perf-analyzer.md` + `scripts/lib/perf-analyzer/` (Plan 27.6-01) — reflector-tier agent that reads telemetry cross-cycle and surfaces top-3 token-cost regressions per agent + cache-hit-rate deltas + p95 latency spikes. Spawned by `/gdd:reflect` or `/gdd:audit`, NOT per-cycle (D-04).
|
|
13
|
+
- `reference/perf-budget.md` + `tests/perf-budget.test.cjs` (Plan 27.6-02) — per-agent budget table + CI regression gate that fails on >25% regression vs baseline across 3 cycles (D-01). Thresholds configurable via `.design/budget.json#perf_regression_threshold`.
|
|
14
|
+
- `scripts/lib/cache/gdd-cache-manager.cjs` (Plan 27.6-03) — cache-warming heuristic refinement: multiplicative `recency × frequency × cost` score (D-06) + top-N ranking + LRU eviction within warmed set + false-positive event emission when >20% of warmed entries evict before use (D-02).
|
|
15
|
+
- `scripts/lib/parallelism-engine/concurrency-tuner.cjs` (Plan 27.6-04) — data-driven concurrency resolver reading `parallelism.verdict` events; default becomes `min(cpu-1, last_observed_optimum)` capped at 8 (D-07). Both explore-parallel-runner and discuss-parallel-runner now use the resolver when `opts.concurrency` is omitted.
|
|
16
|
+
- `hooks/gdd-precompact-snapshot.js` + `hooks/gdd-sessionstart-recap.js` (Plan 27.6-05) — Storybloq §4.6 transplant. PreCompact hook writes atomic snapshots to `.design/snapshots/<ts>.json` (D-08; retention last-10 LRU); SessionStart recap emits markdown to stderr + JSON sidecar at `.design/snapshots/last-recap.json` (D-09). Harness-aware: Codex no-op with stderr notice (D-10, Phase 45 dep for full path).
|
|
17
|
+
- `scripts/lib/prompt-dedup/index.cjs` + `reference/retrieval-contract.md` extension (Plan 27.6-06) — D-11 dedup: when ≥ 3 agents in same cycle read same `reference/*.md`, the retrieval-contract preamble adds a "shared context loaded once" marker.
|
|
18
|
+
- `docs/PERF-OPTIMIZATION.md` (Plan 27.6-06) — operator guide covering all 6 plans, 12 D-XX decisions, the CI regression gate, perf-analyzer proposals, cache-warming tuning, concurrency resolver, snapshot/recap hooks, Codex no-op fallback, prompt-dedup, recalibration process, and troubleshooting.
|
|
19
|
+
|
|
20
|
+
### Decisions locked
|
|
21
|
+
|
|
22
|
+
- D-01: Regression-gate threshold = 25% (configurable via `.design/budget.json#perf_regression_threshold`).
|
|
23
|
+
- D-02: Cache-warming false-positive tolerance = 20% (configurable via `.design/budget.json#cache_warming_falsepositive_threshold`).
|
|
24
|
+
- D-03: Baseline data = synthetic cycle replay; real-cycle calibration in a follow-up patch after 1-2 production cycles.
|
|
25
|
+
- D-04: `perf-analyzer` is reflector-tier (not per-cycle).
|
|
26
|
+
- D-05: Per-agent budgets = current p50 + 25% buffer initially.
|
|
27
|
+
- D-06: Cache-warming heuristic = multiplicative `recency × frequency × cost`.
|
|
28
|
+
- D-07: Parallel-mapper concurrency reads `parallelism.verdict` events; default = `min(cpu-1, last_optimum)` capped at 8.
|
|
29
|
+
- D-08: PreCompact snapshot uses `scripts/lib/lockfile.cjs` for atomicity (atomic `.tmp` + rename); retention last-10 LRU.
|
|
30
|
+
- D-09: SessionStart recap emits markdown to stderr + JSON sidecar to `.design/snapshots/last-recap.json`.
|
|
31
|
+
- D-10: Codex no-op fallback (stderr notice; Phase 45 dep for full path).
|
|
32
|
+
- D-11: Prompt-dedup injects at Phase 14.5 retrieval-contract preamble (≥ 3 agents reading same ref → shared-context marker).
|
|
33
|
+
- D-12: 4 manifests lockstep + CHANGELOG + OFF_CADENCE + baseline at `test-fixture/baselines/phase-27-6/`.
|
|
34
|
+
|
|
35
|
+
### Out of scope (deferred)
|
|
36
|
+
|
|
37
|
+
- Cross-runtime cost arbitrage (Phase 26 territory).
|
|
38
|
+
- Per-call model substitution (Phase 23.5 bandit territory).
|
|
39
|
+
- Rewriting reference files (Phase 46 territory).
|
|
40
|
+
- Codex `pre-large-context-action` interception (Phase 45 dep).
|
|
41
|
+
- Cache-warming auto-tuning of heuristic weights — measurement-gated follow-up.
|
|
42
|
+
- Real-cycle baseline calibration — deferred to follow-up patch.
|
|
43
|
+
|
|
44
|
+
### Test coverage
|
|
45
|
+
|
|
46
|
+
- `tests/perf-analyzer-cost-regression.test.cjs` — ≥10 tests for detection rules (Plan 27.6-01).
|
|
47
|
+
- `tests/perf-budget.test.cjs` — ≥6 tests for CI gate including cold-start tolerance (Plan 27.6-02).
|
|
48
|
+
- `tests/gdd-cache-manager-warming.test.cjs` — ≥6 tests for warming heuristic (Plan 27.6-03).
|
|
49
|
+
- `tests/concurrency-tuner.test.cjs` — ≥5 tests for D-07 algorithm (Plan 27.6-04).
|
|
50
|
+
- `tests/gdd-precompact-snapshot.test.cjs` — ≥6 tests including atomicity + harness fallback (Plan 27.6-05).
|
|
51
|
+
- `tests/gdd-sessionstart-recap.test.cjs` — ≥4 tests for diff + Codex no-op (Plan 27.6-05).
|
|
52
|
+
- `tests/prompt-dedup.test.cjs` — 12 tests for D-11 threshold + cycle scoping + alphabetic sort + malformed-event filter (Plan 27.6-06).
|
|
53
|
+
- `tests/phase-27-6-baseline.test.cjs` — version-agnostic regression baseline (Plan 27.6-06).
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## [1.27.5] — 2026-05-17
|
|
58
|
+
|
|
59
|
+
### Added
|
|
60
|
+
|
|
61
|
+
- **Phase 27.5 — Bandit Production Integration** (6 plans). Wires Phase 23.5's bandit posterior + Phase 27-07's `delegate?` dimension into a real production routing path. After v1.27.5, `default-tier:` becomes a default (cold-start prior), not a final answer — the bandit picks the final tier from measurement when `adaptive_mode: full`.
|
|
62
|
+
- `scripts/lib/bandit-router/integration.cjs` (Plan 27.5-01) — thin shim exposing `consultBandit({agent, bin, delegate, agentFrontmatter, adaptiveMode}) → {tier, decision_log}` and `recordOutcome({agent, bin, delegate, tier, status, costUsd, adaptiveMode}) → void`. Hides `pull` vs `pullWithDelegate` choice. Best-effort posterior write per D-04.
|
|
63
|
+
- `hooks/budget-enforcer.ts` (Plan 27.5-02) — bandit consultation per Agent spawn after `resolved_models` is computed, before SDK call. Overrides `resolved_models[agent]` via `tier-resolver.cjs` when the bandit picks a different tier than the router emitted. Emits `bandit.tier_selected` event per spawn. Respects `tier_override:` frontmatter bypass (D-05), `adaptive_mode` gate (D-07), and the 80% auto-downgrade guard.
|
|
64
|
+
- `scripts/lib/session-runner/index.ts` (Plan 27.5-03) — calls `recordOutcome()` after every `emit('session.completed', ...)` site (4 call sites: rate-limited, peer-success, turn-cap-zero, terminal retry-exit). Adds 3 optional fields to `SessionRunnerOptions`: `agent`, `bin`, `tier`. Posterior write is best-effort; missing fields silent.
|
|
65
|
+
- `agents/design-reflector.md` Section 8 (Plan 27.5-04) — bandit-arbitrage analysis surfaces "agent X frontmatter says sonnet but bandit picks opus" as `[FRONTMATTER]` proposals after 3+ pulls with credible interval < 0.05 and ≥ 50% mean delta vs second-best tier (D-10). New module `scripts/lib/bandit-arbitrage.cjs` mirrors Phase 26-06's cost-arbitrage shape.
|
|
66
|
+
- `skills/peers/SKILL.md` Step 5 + new `skills/bandit-status/SKILL.md` (Plan 27.5-05) — `/gdd:peers` now reads canonical posterior path `.design/telemetry/posterior.json` and renders real per-peer reward-delta when posterior is populated. New read-only `/gdd:bandit-status` skill surfaces per-`(agent, bin, delegate, tier)` posterior snapshots (alpha/beta/mean/stddev/count/last-used). Strictly read-only per D-11.
|
|
67
|
+
- `docs/BANDIT-INTEGRATION.md` + `reference/bandit-integration.md` (Plan 27.5-06) — operator guide + developer cheat sheet.
|
|
68
|
+
|
|
69
|
+
### Decisions locked
|
|
70
|
+
|
|
71
|
+
- D-01: `hooks/budget-enforcer.ts` is the bandit consultation site (single canonical routing decision point).
|
|
72
|
+
- D-02: Per-spawn timing, after `resolved_models` computed, before SDK call.
|
|
73
|
+
- D-03: Override `resolved_models[agent]` with bandit tier through `tier-resolver.cjs`. Preserve `model_tier_overrides[agent]` unchanged (back-compat).
|
|
74
|
+
- D-04: `update()` called in session-runner's terminal-emit path after `session.completed`. Best-effort posterior write — errors swallowed.
|
|
75
|
+
- D-05: `tier_override:` frontmatter is the explicit per-agent bandit-bypass surface.
|
|
76
|
+
- D-06: Posterior path stays at `.design/telemetry/posterior.json` (Phase 23.5 D-08 unchanged).
|
|
77
|
+
- D-07: Bandit consultation gated by `adaptive_mode` (static + hedge silent; full active).
|
|
78
|
+
- D-08: Reward function unchanged from Phase 23.5 (two-stage lexicographic correctness + cost).
|
|
79
|
+
- D-09: Cold-start prior for the 5 peer delegate arms uses neutral `TIER_PRIOR` (no bias toward any peer).
|
|
80
|
+
- D-10: Reflector bandit-arbitrage reuses `cost-arbitrage.cjs` shape (50% threshold, mirror Phase 26-06).
|
|
81
|
+
- D-11: `/gdd:bandit-status` is read-only (use `/gdd:bandit-reset` from Phase 23.5 to mutate).
|
|
82
|
+
- D-12: All 6 plans land together with one CHANGELOG block; 4 manifests bump lockstep.
|
|
83
|
+
|
|
84
|
+
### Out of scope (deferred)
|
|
85
|
+
|
|
86
|
+
- Auto-failover when bandit recommends a delegate not in `enabled_peers` — bandit stays advisory.
|
|
87
|
+
- Cross-cycle posterior decay — Phase 23.5 D-12 already specifies discounted Thompson sampling.
|
|
88
|
+
- Per-task bandit dimensions beyond `(agent, bin, delegate)` — needs convergence proof first.
|
|
89
|
+
- Removing frontmatter `default-tier:` — additive only; deprecation is Phase 30+.
|
|
90
|
+
- Bandit-driven complexity_class selection — different decision domain.
|
|
91
|
+
|
|
92
|
+
### Test coverage
|
|
93
|
+
|
|
94
|
+
- `tests/bandit-router-integration.test.cjs` — 25+ tests covering all 5 paths × adaptive_mode × tier_override × delegate (Plan 27.5-01).
|
|
95
|
+
- `tests/budget-enforcer-bandit.test.cjs` — 8+ tests for hook consultation branches (Plan 27.5-02).
|
|
96
|
+
- `tests/session-runner-bandit-outcome.test.cjs` — 6+ tests for recordOutcome paths (Plan 27.5-03).
|
|
97
|
+
- `tests/bandit-arbitrage.test.cjs` — 6+ tests for reflector analyzer (Plan 27.5-04).
|
|
98
|
+
- `tests/phase-27-5-baseline.test.cjs` — manifests + baseline + integration-exports regression (Plan 27.5-06).
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
7
102
|
## [1.27.1] — 2026-04-30
|
|
8
103
|
|
|
9
104
|
Phase 27 wiring patch — closes the production-integration gaps left by v1.27.0's "structural ship". v1.27.0 landed all peer-CLI library code + tests + docs but the helpers were exported without callers, so `delegate_to:` on agent frontmatter was validated and then ignored at runtime. v1.27.1 wires the four integration points so delegation actually fires for users who set `delegate_to:` AND allowlist the peer.
|
package/SKILL.md
CHANGED
|
@@ -89,6 +89,7 @@ Each stage produces artifacts in `.design/` inside the current project.
|
|
|
89
89
|
| `skill-manifest [--refresh]` | `get-design-done:skill-manifest` | List or refresh the local skill manifest used by the router for discovery |
|
|
90
90
|
| `quality-gate` | `get-design-done:quality-gate` | Phase 25 — parallel lint/type/test/visual command runner; classifies failures via quality-gate-runner agent |
|
|
91
91
|
| `turn-closeout` | `get-design-done:turn-closeout` | Phase 25 — Stop-hook mirror skill; finalizes per-turn STATE blocks and emits closeout events |
|
|
92
|
+
| `bandit-status` | `get-design-done:bandit-status` | Phase 27.5 — read-only diagnostic surface for the bandit posterior; per-(agent, bin, delegate, tier) snapshots (alpha, beta, mean, stddev, count, last-used). Use `/gdd:bandit-reset` to mutate. |
|
|
92
93
|
| `peers` | `get-design-done:peers` | Phase 27 — `/gdd:peers` capability matrix command; shows installed peer-CLIs (codex/gemini/cursor/copilot/qwen), allowlist status, claimed roles, posterior delta vs local |
|
|
93
94
|
| `peer-cli-customize` | `get-design-done:peer-cli-customize` | Phase 27 — rewire role→peer mappings on a per-agent basis (edits frontmatter `delegate_to:` directly) |
|
|
94
95
|
| `peer-cli-add` | `get-design-done:peer-cli-add` | Phase 27 — guided ladder for adding a brand-new peer (verification ladder + adapter scaffolding + capability-matrix update) |
|
|
@@ -151,6 +151,58 @@ Render each `cost_arbitrage` entry into the Proposals section as a `[BUDGET]`-ta
|
|
|
151
151
|
|
|
152
152
|
---
|
|
153
153
|
|
|
154
|
+
### 8. Bandit-arbitrage analysis (Phase 27.5 — D-10)
|
|
155
|
+
|
|
156
|
+
**Why this exists:** Phase 27.5 (v1.27.5) wired the bandit posterior + delegate dimension into production. The posterior now accumulates per-`(agent, bin, delegate, tier)` win-rates from real spawns. Once the posterior has enough data, the bandit's best-arm tier for an agent may differ from that agent's frontmatter `default-tier:` — a measurement signal that the frontmatter is stale. This section surfaces that signal as a `[FRONTMATTER]` proposal.
|
|
157
|
+
|
|
158
|
+
**Data sources:**
|
|
159
|
+
|
|
160
|
+
- `.design/telemetry/posterior.json` — the bandit posterior file written by Phase 23.5's `bandit-router.cjs` + Phase 27.5-02/03's production callers. Path matches `bandit-router.cjs`'s `DEFAULT_POSTERIOR_PATH`. If the file does not exist, skip this section with note "posterior.json not found — Phase 27.5 wiring required."
|
|
161
|
+
- `agents/*.md` — read each agent's frontmatter `default-tier:` value. The reflector already parses frontmatter in Section 3 ("Agent Performance"); reuse that parse pass and build a `{agent: defaultTier}` map keyed by the agent's `name:` field.
|
|
162
|
+
|
|
163
|
+
**The rule:**
|
|
164
|
+
|
|
165
|
+
For each `(agent, bin)` slice in the posterior (defaulting to `delegate='none'` arms — focuses on local-call routing):
|
|
166
|
+
|
|
167
|
+
1. Compute per-tier posterior mean = `α / (α + β)` and stddev = `sqrt(αβ / ((α+β)² · (α+β+1)))`.
|
|
168
|
+
2. Identify `posterior_best_tier = argmax(mean)` across the tiers present in the slice.
|
|
169
|
+
3. Gates (all must hold to emit):
|
|
170
|
+
- `sum(arm.count)` across the slice's tier rows >= 3 (D-10's "3+ cycles" proxy).
|
|
171
|
+
- `(best_mean - second_best_mean) / second_best_mean >= 0.5` (50% delta heuristic).
|
|
172
|
+
- `stddev(best_tier) < 0.05` (credible interval narrow enough).
|
|
173
|
+
- `frontmatter[agent].default-tier !== posterior_best_tier` (the actual stale signal).
|
|
174
|
+
4. If all gates hold, emit a structured `bandit_arbitrage` proposal.
|
|
175
|
+
|
|
176
|
+
**Important guardrails (failure modes the rule must avoid):**
|
|
177
|
+
|
|
178
|
+
- **Single-tier-only history is silent.** If only one tier has been pulled for `(agent, bin)`, no comparison is possible — emit nothing rather than a misleading "winner" proposal.
|
|
179
|
+
- **Wide credible intervals are silent.** Bandit posteriors are noisy early on; the 0.05 stddev gate ensures we only surface signals where the bandit is confident.
|
|
180
|
+
- **The 50% threshold is a starting heuristic.** Same discipline as cost-arbitrage Section 7 — bandit-learning over which arbitrage proposals were APPLIED (and whether the posterior subsequently shifted) is a separate (future) phase.
|
|
181
|
+
- **delegateFilter='none' is the v1.27.5 default.** Arbitrage analysis on the 5 peer-delegate slices is left for a future plan; current peer data is too sparse to credibly disagree with frontmatter.
|
|
182
|
+
|
|
183
|
+
**Helper:** `scripts/lib/bandit-arbitrage.cjs` exports `analyze(posterior, options) → proposals[]` implementing the above rule deterministically. The executor agent following this skill loads the posterior via `bandit-router.loadPosterior()`, builds the `{agent: defaultTier}` map from `agents/*.md` frontmatter, and passes both to `analyze()`. No re-derivation of the rule in prose — call the helper.
|
|
184
|
+
|
|
185
|
+
**Proposal output shape** (one entry per stale-frontmatter signal, JSON-serializable for `/gdd:apply-reflections`):
|
|
186
|
+
|
|
187
|
+
```json
|
|
188
|
+
{
|
|
189
|
+
"type": "bandit_arbitrage",
|
|
190
|
+
"agent": "design-verifier",
|
|
191
|
+
"bin": "medium",
|
|
192
|
+
"current_frontmatter_tier": "sonnet",
|
|
193
|
+
"posterior_best_tier": "opus",
|
|
194
|
+
"posterior_mean": { "haiku": 0.50, "sonnet": 0.62, "opus": 0.95 },
|
|
195
|
+
"posterior_stddev": { "haiku": 0.04, "sonnet": 0.03, "opus": 0.02 },
|
|
196
|
+
"pull_count": 18,
|
|
197
|
+
"proposal": "design-verifier (medium bin) frontmatter says sonnet but bandit picks opus (posterior mean 0.950 vs 0.620, 18 pulls, stddev 0.020) — update frontmatter or add tier_override: sonnet if intentional",
|
|
198
|
+
"evidence": "posterior_cred_int_narrow"
|
|
199
|
+
}
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
Render each `bandit_arbitrage` entry into the Proposals section as a `[FRONTMATTER]`-tagged proposal carrying the structured payload verbatim. `/gdd:apply-reflections` routes the proposal to either (a) an `agents/<name>.md` frontmatter `default-tier:` update OR (b) a new `tier_override: <existing-tier>` add when the operator explicitly wants to keep the existing default-tier despite the measured drift.
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
154
206
|
## Proposals
|
|
155
207
|
|
|
156
208
|
After all sections, write a **Proposals** section. Number proposals sequentially. Every proposal must include evidence — no vague observations.
|
|
@@ -0,0 +1,166 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: perf-analyzer
|
|
3
|
+
description: Cross-cycle performance reflector. Reads .design/telemetry/{costs,trajectories,events}.jsonl and surfaces top-3 token-cost regressions per agent + cache-hit-rate deltas + p95 latency spikes. Spawned by /gdd:reflect or /gdd:audit (NOT per-cycle). Phase 27.6 D-04.
|
|
4
|
+
tools: Read, Write, Bash, Grep, Glob
|
|
5
|
+
color: yellow
|
|
6
|
+
model: inherit
|
|
7
|
+
default-tier: opus
|
|
8
|
+
tier-rationale: "Phase 27.6 reflector — analyzes cross-cycle telemetry, proposes pipeline-level perf improvements; opus matches design-reflector tier per D-04"
|
|
9
|
+
size_budget: XL
|
|
10
|
+
parallel-safe: never
|
|
11
|
+
typical-duration-seconds: 45
|
|
12
|
+
reads-only: false
|
|
13
|
+
writes:
|
|
14
|
+
- ".design/perf/*.md"
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
@reference/shared-preamble.md
|
|
18
|
+
|
|
19
|
+
# perf-analyzer
|
|
20
|
+
|
|
21
|
+
## Role
|
|
22
|
+
|
|
23
|
+
You are a cross-cycle performance reflector. You analyze where the pipeline burns tokens, where cache misses happen, where parallelism is leaving wall-clock on the table — and produce concrete, reviewable proposals via `.design/perf/<cycle-slug>.md`. You never auto-apply anything; the operator reviews via `/gdd:apply-reflections` (Phase 11 wiring).
|
|
24
|
+
|
|
25
|
+
You run **cross-cycle, not per-cycle** (Phase 27.6 D-04). Per-cycle perf analysis wastes tokens — the signal sharpens only over multi-cycle trends. Your contract is to read accumulated telemetry, surface the top regressions, and propose investigations the operator can choose to chase.
|
|
26
|
+
|
|
27
|
+
## When to Run
|
|
28
|
+
|
|
29
|
+
Spawn this agent from:
|
|
30
|
+
|
|
31
|
+
- `/gdd:reflect` — on-demand reflection (Phase 11)
|
|
32
|
+
- `/gdd:audit` — end-of-cycle audit roll-up
|
|
33
|
+
- `/gdd:perf` — direct invocation (if/when added; currently the two above suffice)
|
|
34
|
+
|
|
35
|
+
**Do NOT spawn from any per-cycle stage** (brief / explore / plan / design / verify). Per-cycle invocation violates D-04 and wastes tokens — the analysis needs `>= 3` cycles of accumulated data to be meaningful (D-01). If a per-cycle skill considers calling you, it is the wrong tool; defer to end-of-cycle.
|
|
36
|
+
|
|
37
|
+
## Required Reading
|
|
38
|
+
|
|
39
|
+
The orchestrating skill supplies a `<required_reading>` block in the prompt. Read every listed file before acting.
|
|
40
|
+
|
|
41
|
+
Minimum expected inputs (skip gracefully if absent, note what's missing in the output):
|
|
42
|
+
|
|
43
|
+
- `.design/telemetry/costs.jsonl` — per-agent-spawn cost data (Phase 10.1)
|
|
44
|
+
- `.design/telemetry/trajectories/*.jsonl` — agent wall-time data (Phase 22)
|
|
45
|
+
- `.design/telemetry/events.jsonl` — full event stream (Phase 22)
|
|
46
|
+
- `reference/perf-budget.md` — per-agent budgets + baseline pointers (Phase 27.6-02, may not exist yet on first run; skip gracefully)
|
|
47
|
+
- `test-fixture/baselines/phase-27-6/perf-baseline.json` — synthetic baseline (Phase 27.6 D-03, exists after 27.6-06 closeout)
|
|
48
|
+
|
|
49
|
+
Helper library (use Bash to require):
|
|
50
|
+
|
|
51
|
+
- `scripts/lib/perf-analyzer/index.cjs` — `loadCosts({path, sinceCycle?})`, `loadTrajectories({dir})`
|
|
52
|
+
- `scripts/lib/perf-analyzer/cost-regression.cjs` — `detectCostRegressions({rows, baseline, thresholdPct, cyclesRequired})`, `computeCacheHitDelta(...)`, `computeP95Spikes(...)`
|
|
53
|
+
|
|
54
|
+
The helper library is a CommonJS module with no external deps — safe to require from Bash without dragging the gdd-state MCP graph.
|
|
55
|
+
|
|
56
|
+
## Output
|
|
57
|
+
|
|
58
|
+
Write `.design/perf/<cycle-slug>.md`. If `--dry-run` is set in the spawning prompt, print proposals to stdout only — do not write the file.
|
|
59
|
+
|
|
60
|
+
Terminate with `## PERF ANALYSIS COMPLETE`.
|
|
61
|
+
|
|
62
|
+
## 1. Top-3 Token-Cost Regressions
|
|
63
|
+
|
|
64
|
+
Use `scripts/lib/perf-analyzer/cost-regression.cjs::detectCostRegressions` over `loadCosts({})`. Threshold = 25% (Phase 27.6 D-01 default; read `.design/budget.json#perf_regression_threshold` if present for an override). Minimum 3 distinct cycles required (D-01). Top-3 cap is enforced by the library.
|
|
65
|
+
|
|
66
|
+
For each regression, render a `[REGRESSION]` proposal:
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
[REGRESSION] perf-analyzer-{agent}-{slug}
|
|
70
|
+
- agent: <agent>
|
|
71
|
+
- baseline_p50_usd: <number>
|
|
72
|
+
- current_p50_usd: <number>
|
|
73
|
+
- delta_pct: <number>%
|
|
74
|
+
- cycles_observed: <count>
|
|
75
|
+
- hypothesis: <one-line plausible cause; e.g., "added reference reads per spawn", "tier upgrade from sonnet→opus">
|
|
76
|
+
- next_action: <one-line operator action; e.g., "/gdd:perf-investigate <agent>", "consider tier_override: sonnet">
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
For each regression, emit a `perf.regression_detected` event via `appendEvent` from the Phase 22 event stream:
|
|
80
|
+
|
|
81
|
+
```javascript
|
|
82
|
+
// Pseudo-instruction for the executor — the agent runs Bash with this shape
|
|
83
|
+
const { appendEvent } = require('./scripts/lib/event-stream');
|
|
84
|
+
appendEvent({
|
|
85
|
+
type: 'perf.regression_detected',
|
|
86
|
+
timestamp: new Date().toISOString(),
|
|
87
|
+
sessionId: process.env.GDD_SESSION_ID ?? 'perf-analyzer',
|
|
88
|
+
payload: { agent, baseline_p50_usd, current_p50_usd, delta_pct, cycles_observed },
|
|
89
|
+
});
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
The `perf.regression_detected` event type is additive to the Phase 22 registry — the writer accepts unknown types (per `scripts/lib/event-stream/types.ts` envelope invariant: "unknown types are allowed; validation is structural, not a closed enum").
|
|
93
|
+
|
|
94
|
+
If `detectCostRegressions` returns `summary.regressions_count === 0`, write a single line: `No token-cost regressions detected (threshold 25%, >=3 cycles).` and skip event emission for this section.
|
|
95
|
+
|
|
96
|
+
## 2. Cache-Hit-Rate Deltas
|
|
97
|
+
|
|
98
|
+
Use `computeCacheHitDelta` over the same row set. Report agents whose `delta_pct < -20` (hit rate dropped by 20% or more) as `[CACHE-MISS]` proposals:
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
[CACHE-MISS] perf-analyzer-{agent}-cache-{slug}
|
|
102
|
+
- agent: <agent>
|
|
103
|
+
- baseline_hit_rate: <0..1>
|
|
104
|
+
- current_hit_rate: <0..1>
|
|
105
|
+
- delta_pct: <negative number>%
|
|
106
|
+
- cycles_observed: <count>
|
|
107
|
+
- hypothesis: <one-line cause; e.g., "preamble churn invalidated prefix cache", "new reference reads broke cache key">
|
|
108
|
+
- next_action: <one-line; e.g., "/gdd:cache-investigate <agent>", "audit shared-preamble.md drift">
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
If no agent crosses the -20% threshold, write a single line acknowledging that the cache hit rates are within tolerance.
|
|
112
|
+
|
|
113
|
+
## 3. p95 Latency Spikes
|
|
114
|
+
|
|
115
|
+
Use `computeP95Spikes` over `loadTrajectories({})`. Report any agent with `multiplier >= 1.5` as a `[LATENCY-SPIKE]` proposal:
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
[LATENCY-SPIKE] perf-analyzer-{agent}-p95-{slug}
|
|
119
|
+
- agent: <agent>
|
|
120
|
+
- baseline_p95_ms: <number>
|
|
121
|
+
- current_p95_ms: <number>
|
|
122
|
+
- multiplier: <number>x
|
|
123
|
+
- cycles_observed: <count>
|
|
124
|
+
- hypothesis: <one-line; e.g., "model upgrade increased latency", "Bash tool blocked on lock">
|
|
125
|
+
- next_action: <one-line; e.g., "/gdd:trace-agent <agent>", "review recent tool-args distribution">
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
If no agent crosses the 1.5x threshold, write a single line confirming p95 wall-time is stable.
|
|
129
|
+
|
|
130
|
+
## 4. Roll-up Summary
|
|
131
|
+
|
|
132
|
+
At the bottom, print a single table for at-a-glance cycle review:
|
|
133
|
+
|
|
134
|
+
| Metric | Value |
|
|
135
|
+
| ----------------------------------- | ----- |
|
|
136
|
+
| regressions_count | N |
|
|
137
|
+
| cache_miss_count | N |
|
|
138
|
+
| latency_spike_count | N |
|
|
139
|
+
| agents_evaluated | N |
|
|
140
|
+
| agents_skipped_insufficient_data | N |
|
|
141
|
+
| threshold_pct | 25 |
|
|
142
|
+
| cycles_required | 3 |
|
|
143
|
+
|
|
144
|
+
The numbers come straight from `detectCostRegressions().summary` and the lengths of the cache-miss / latency-spike arrays. Do not synthesize counts — read them from the library output.
|
|
145
|
+
|
|
146
|
+
## What This Agent Does NOT Do
|
|
147
|
+
|
|
148
|
+
- Does NOT auto-tune heuristics (out of scope per CONTEXT.md "auto-tuning of heuristic weights").
|
|
149
|
+
- Does NOT modify model selection (Phase 23.5 bandit territory; 27.5 wired the bandit, 27.6 only measures outcomes).
|
|
150
|
+
- Does NOT rewrite reference files (Phase 46 territory — canonical reference index).
|
|
151
|
+
- Does NOT analyze cross-runtime cost arbitrage (Phase 26 territory).
|
|
152
|
+
- Does NOT run on every cycle. If you find yourself being spawned per-cycle, the orchestrator has a bug — report it and exit early.
|
|
153
|
+
|
|
154
|
+
Stay within the cross-cycle measurement loop. Surface proposals; the operator reviews and applies.
|
|
155
|
+
|
|
156
|
+
## Record
|
|
157
|
+
|
|
158
|
+
At run-end, append one JSONL line to `.design/intel/insights.jsonl`:
|
|
159
|
+
|
|
160
|
+
```json
|
|
161
|
+
{"ts":"<ISO-8601>","agent":"perf-analyzer","cycle":"<cycle from STATE.md>","stage":"reflection","one_line_insight":"<top regression hypothesis or 'no regressions detected'>","artifacts_written":[".design/perf/<cycle-slug>.md"]}
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
Schema: `reference/schemas/insight-line.schema.json`. The `artifacts_written` array MUST list the per-cycle perf proposal file. If no proposals were generated (cold-start tolerance), still write the `.md` (with a "no regressions detected" body) and emit the line with the artifact path.
|
|
165
|
+
|
|
166
|
+
## PERF ANALYSIS COMPLETE
|