bigpowers 2.26.0 → 2.28.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/.pi/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "bigpowers",
3
- "version": "2.26.0",
4
- "description": "68 skills — 61 agent skills for spec-driven, test-first software development by solo developers",
3
+ "version": "2.28.0",
4
+ "description": "70 skills — 61 agent skills for spec-driven, test-first software development by solo developers",
5
5
  "keywords": [
6
6
  "pi-package"
7
7
  ],
@@ -0,0 +1,88 @@
1
+ ---
2
+ description: Evaluate an incoming project plan against bigpowers principles and conventions, surface gaps, and produce a READY/NOT READY verdict before engagement begins. Use when a new project arrives, when adapting a foreign plan, or before running seed-conventions on an unfamiliar codebase.
3
+ ---
4
+
5
+
6
+ # Audit Plan
7
+
8
+ > **HARD GATE** — Do NOT start build skills (kickoff-branch, develop-tdd) until audit-plan returns a READY verdict. A plan missing test commands, scope boundaries, or success criteria will produce drift and rework downstream.
9
+
10
+ Assess an incoming project plan for alignment with bigpowers principles, identify what's missing, and produce a structured readiness report before any skill execution begins.
11
+
12
+ ## Three lenses
13
+
14
+ ### 1. Principles alignment
15
+ - Are stories vertical slices (not horizontal layers)?
16
+ - Is scope bounded — explicit in_scope + out_of_scope?
17
+ - Are success criteria defined (how do we know we're done)?
18
+ - Are HARD GATE candidates identifiable (critical decision points)?
19
+ - Is there a domain language / ubiquitous terminology?
20
+
21
+ ### 2. Conventions completeness
22
+ - Does `CLAUDE.md` or `AGENTS.md` exist?
23
+ - Does `CONVENTIONS.md` exist?
24
+ - Is the `specs/` directory layout in place?
25
+ - Are commit conventions documented (Conventional Commits)?
26
+ - Is the git workflow mode identified (`solo-git` | `team-pr`)?
27
+
28
+ ### 3. Bigpowers pre-flight (must all be answered before build)
29
+ | Question | Why |
30
+ |----------|-----|
31
+ | What is the **test command**? | `develop-tdd` verify steps require it |
32
+ | What is the **build command**? | `verify-work` mechanical gate |
33
+ | What is the **lint command**? | `audit-code` lint gate |
34
+ | What is the **typecheck command**? | `verify-work` typecheck gate |
35
+ | What **CI platform** is in use? | `wire-ci` configuration |
36
+ | **Solo or team**? | `release-branch` integration mode |
37
+ | Primary **language + framework**? | model routing + conventions |
38
+ | **Greenfield or existing** codebase? | determines whether to run `seed-conventions` or `migrate-spec` first |
39
+
40
+ ## Process
41
+
42
+ 1. **Ingest the plan** — accept a file path, pasted PRD text, or existing `specs/` artifacts. Read `CLAUDE.md` and `CONVENTIONS.md` if present.
43
+
44
+ 2. **Score each lens** — for every item above, mark:
45
+ - ✅ Present and adequate
46
+ - ⚠️ Present but incomplete — note what's missing
47
+ - ❌ Absent
48
+
49
+ 3. **Close gaps conversationally** — for each ❌ or ⚠️, ask one question at a time. Record each answer before moving to the next.
50
+
51
+ 4. **Write `specs/PLAN-AUDIT.md`**:
52
+
53
+ ```markdown
54
+ # Plan Audit — <project>
55
+ **Date:** YYYY-MM-DD · **Verdict:** READY | NOT READY
56
+
57
+ ## Principles Alignment
58
+ | Check | Status | Note |
59
+ | Vertical slices | ✅ | 4 stories, each shippable |
60
+ | Scope bounded | ⚠️ | in_scope present; out_of_scope missing |
61
+
62
+ ## Conventions Completeness
63
+ | Check | Status | Note |
64
+
65
+ ## Pre-flight Answers
66
+ | Command | Value |
67
+ | test | `npm test` |
68
+ | build | `npm run build` |
69
+
70
+ ## Open Gaps
71
+ - [ ] Add out_of_scope to scope definition (run scope-work)
72
+ - [ ] Create CLAUDE.md (run seed-conventions)
73
+
74
+ ## Verdict
75
+ READY — proceed with survey-context
76
+ NOT READY — N gaps remain; close before proceeding
77
+ ```
78
+
79
+ 5. **Recommend next skill**:
80
+ - READY → `survey-context`
81
+ - Needs bootstrapping → `seed-conventions`
82
+ - Needs spec elaboration → `elaborate-spec`
83
+ - Has foreign spec format → `migrate-spec`
84
+ - Plan assumptions need challenging → `grill-me`
85
+
86
+ ## Verify
87
+
88
+ → verify: `test -f specs/PLAN-AUDIT.md && grep -q 'Verdict' specs/PLAN-AUDIT.md && echo OK || echo FAIL`
@@ -71,7 +71,26 @@ Summarize your understanding in 3–5 bullet points aligned with [countable-stor
71
71
 
72
72
  Ask: "Is this an accurate summary? Anything missing or wrong?"
73
73
 
74
- ### 5. Suggest next skill
74
+ ### 5. Write specs/planning-context.yaml
75
+
76
+ After the user confirms the summary in step 4, persist the key decisions:
77
+
78
+ ```yaml
79
+ # specs/planning-context.yaml — written by elaborate-spec; consumed by scope-work and slice-tasks
80
+ feature_name: "<from step 1>"
81
+ problem_statement: "<one paragraph>"
82
+ constraints:
83
+ - "<constraint 1>"
84
+ out_of_scope:
85
+ - "<excluded item 1>"
86
+ key_decisions:
87
+ - decision: "<what was decided>"
88
+ rationale: "<why>"
89
+ ```
90
+
91
+ If `specs/planning-context.yaml` already exists, ask: `"Planning context from a prior session exists. Update it? [Y/n]"`. Overwrite on Y; leave unchanged on N.
92
+
93
+ ### 6. Suggest next skill
75
94
 
76
95
  Once the spec is clear, recommend the next step:
77
96
  - If domain model needs work → `model-domain`
@@ -9,16 +9,23 @@ description: Benchmark-gated skill evolution — consume bigpowers-benchmark rep
9
9
 
10
10
  ## Loop
11
11
 
12
- 1. Run `bigpowers-benchmark` (external repo); save report path in state.yaml.
13
- 2. Identify target skill + measurable gap from report.
14
- 3. `plan-work`minimal change proposal with verify commands.
15
- 4. Edit via `craft-skill` / direct SKILL.md edit; run `sync-skills.sh`.
16
- 5. Re-run benchmark; compare scores.
17
- 6. Record decision in `specs/adr/` + `session-state`; revert if regression.
12
+ 1. **Establish baseline** — Run `run-benchmark <skill> --baseline`. If no definition exists at `specs/benchmarks/<skill>.yaml`, create one following `specs/benchmarks/SCHEMA.md` first. Save report path in `state.yaml`. If `specs/benchmarks/reports/BASELINE-<skill>.yaml` already exists, skip this step.
13
+
14
+ 2. **Identify gap** Read the baseline report (`specs/benchmarks/reports/BASELINE-<skill>.yaml`). Find scenarios with `result: FAIL` or low `pass_at_k`. This is the measurable gap.
15
+
16
+ 3. **`plan-work`** Write a minimal change proposal targeting the failing scenarios. Include verify commands.
17
+
18
+ 4. **Edit** via `craft-skill` / direct SKILL.md edit; run `bash scripts/sync-skills.sh`.
19
+
20
+ 5. **Re-run benchmark** — `run-benchmark <skill>`. Compare new `pass_at_k` against baseline.
21
+ - **IMPROVED or STABLE** → advance to step 6.
22
+ - **REGRESSION** (`new pass_at_k < baseline`) → revert the change and loop back to step 3.
23
+
24
+ 6. **Record decision** — Write `specs/adr/NNNN-evolve-<skill>.md` with before/after `pass_at_k` scores. Update `session-state`.
18
25
 
19
26
  ## Verify
20
27
 
21
- → verify: benchmark report shows post-change score baseline (document paths in state.yaml)
28
+ → verify: `grep -c 'run-benchmark\|pass_at_k\|BASELINE-' evolve-skill/SKILL.md | awk '{if($1>=2) print "OK"; else print "FAIL"}'`
22
29
 
23
30
  See [REFERENCE.md](REFERENCE.md) for ADR template.
24
31
 
@@ -6,6 +6,8 @@ description: "Streamlined fast-path for trivial data-only fixes — no TDD, no b
6
6
 
7
7
  # Quick Fix
8
8
 
9
+ > **HARD GATE** — ALL entry criteria must pass before invoking quick-fix. If any guardrail triggers during execution, abort immediately and fall back to `investigate-bug`. Do NOT use quick-fix for logic changes, multi-file edits, or diffs > 5 lines.
10
+
9
11
  Fast-track for trivial data-only fixes that do not require the full bug-fix chain.
10
12
 
11
13
  When a bug fix is purely data — an add-missing-key, a typo correction, a config value update — the standard 6-skill chain (investigate-bug → diagnose-root → develop-tdd → kickoff-branch → verify-work → release-branch) is wasteful overhead. Quick-fix collapses it to 2 skills: **quick-fix** then **release-branch**.
@@ -0,0 +1,69 @@
1
+ ---
2
+ description: Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions.
3
+ ---
4
+
5
+
6
+ # Run Benchmark
7
+
8
+ > **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
9
+
10
+ Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
11
+
12
+ ## Usage
13
+
14
+ ```bash
15
+ # Benchmark a single skill
16
+ run-benchmark <skill-name>
17
+
18
+ # Benchmark all skills with definitions
19
+ run-benchmark --all
20
+
21
+ # Pin current results as baseline
22
+ run-benchmark <skill-name> --baseline
23
+ ```
24
+
25
+ ## Process
26
+
27
+ 1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
28
+
29
+ 2. **Run each scenario** — For each scenario in `scenarios[]`:
30
+ - **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
31
+ - **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
32
+
33
+ 3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
34
+
35
+ 4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
36
+
37
+ ```yaml
38
+ skill: survey-context
39
+ run_date: "2026-06-22"
40
+ pass_at_k: 0.83
41
+ total_scenarios: 3
42
+ passed: 2
43
+ failed: 1
44
+ scenarios:
45
+ - id: s01
46
+ name: "detects active epic from state.yaml"
47
+ result: PASS
48
+ weight: 1.0
49
+ - id: s02
50
+ name: "reads release-plan.yaml and reports next epic"
51
+ result: PASS
52
+ weight: 1.0
53
+ - id: s03
54
+ name: "handles missing state.yaml gracefully"
55
+ result: FAIL
56
+ weight: 0.5
57
+ failure_note: "crashed instead of suggesting state.yaml creation"
58
+ ```
59
+
60
+ 5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
61
+
62
+ 6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
63
+ - `IMPROVED: 0.67 → 0.83`
64
+ - `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
65
+ - `STABLE: 0.83 = 0.83`
66
+
67
+ ## Verify
68
+
69
+ → verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
@@ -36,6 +36,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
36
36
 
37
37
  2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
38
38
 
39
+ 2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
40
+ ```bash
41
+ test -f specs/planning-context.yaml && python3 -c "
42
+ import yaml, datetime
43
+ d = yaml.safe_load(open('specs/planning-context.yaml'))
44
+ written = d.get('written_at','')
45
+ if written:
46
+ age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
47
+ print(f'Context age: {age:.1f}h')
48
+ " 2>/dev/null || echo "No context or no written_at"
49
+ ```
50
+ - If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
51
+ - If context is **≥ 24h old** or absent, run elaborate-spec normally.
52
+ - On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
53
+
39
54
  3. **Invoke the matching skill** — Run the skill that matches the workflow key:
40
55
  - `survey-context` — where are we?
41
56
  - `scope-work` — what's in and out?
@@ -52,6 +67,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
52
67
 
53
68
  In `specs/planning-status.yaml`:
54
69
  ```yaml
70
+ context_capsule: # written by elaborate-spec; cleared on cycle completion
71
+ written_at: "2026-06-22T03:00:00Z"
72
+ written_by: elaborate-spec
73
+ feature_name: "add dark mode"
55
74
  workflows:
56
75
  survey-context:
57
76
  required: true
@@ -17,6 +17,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
17
17
 
18
18
  ## Process
19
19
 
20
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
21
+ ```bash
22
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
23
+ ```
24
+ Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
25
+
20
26
  1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
21
27
 
22
28
  2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
@@ -17,6 +17,12 @@ Produce **epic capsule story tasks** in `specs/epics/eNN-slug/` — vertical sli
17
17
 
18
18
  ## Process
19
19
 
20
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it first:
21
+ ```bash
22
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
23
+ ```
24
+ Use `feature_name`, `constraints`, and `out_of_scope` to inform slice boundaries. `key_decisions` in the file may constrain how stories are cut (e.g., "no external deps" constrains slice 2). If absent, proceed normally.
25
+
20
26
  1. **Read context** — Read `specs/product/SCOPE_LATEST.yaml` and/or `specs/release-plan.yaml`. Understand what the epic delivers end-to-end.
21
27
 
22
28
  2. **Cut tracer-bullet slices** — Identify the thinnest possible vertical path through the stack that delivers user value. Start with this slice; it will catch integration issues first. For example:
@@ -14,7 +14,8 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
14
14
  | Mode | Scope |
15
15
  |------|-------|
16
16
  | **Quick Scan** | Skills changed since last tag or in current diff |
17
- | **Full** | All 62 skills per SKILL-INDEX.md + catalog audit |
17
+ | **Full** | All skills per SKILL-INDEX.md + catalog audit |
18
+ | **--verify** | Run `bash scripts/run-skill-verify.sh` and append health results to the stocktake report |
18
19
 
19
20
  ## Process
20
21
 
@@ -26,6 +27,7 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
26
27
  - Skills with zero calls (potential dead weight)
27
28
  - Skills with high average time (candidates for `evolve-skill`)
28
29
  5. Critical findings → `plan-work` story; cosmetic → `evolve-skill` candidate.
30
+ 6. **--verify mode:** Run `bash scripts/run-skill-verify.sh` and append a `## Verify Health` section to the stocktake report: `"N/68 PASS, M FAIL, K SKIP"`. FAIL skills are critical findings and go straight to `plan-work`.
29
31
 
30
32
  ### Skill timing data (`metrics.skill_timings`)
31
33
 
@@ -0,0 +1,90 @@
1
+ ---
2
+ name: audit-plan
3
+ description: "Evaluate an incoming project plan against bigpowers principles and conventions, surface gaps, and produce a READY/NOT READY verdict before engagement begins. Use when a new project arrives, when adapting a foreign plan, or before running seed-conventions on an unfamiliar codebase."
4
+ model: sonnet
5
+ ---
6
+
7
+
8
+ # Audit Plan
9
+
10
+ > **HARD GATE** — Do NOT start build skills (kickoff-branch, develop-tdd) until audit-plan returns a READY verdict. A plan missing test commands, scope boundaries, or success criteria will produce drift and rework downstream.
11
+
12
+ Assess an incoming project plan for alignment with bigpowers principles, identify what's missing, and produce a structured readiness report before any skill execution begins.
13
+
14
+ ## Three lenses
15
+
16
+ ### 1. Principles alignment
17
+ - Are stories vertical slices (not horizontal layers)?
18
+ - Is scope bounded — explicit in_scope + out_of_scope?
19
+ - Are success criteria defined (how do we know we're done)?
20
+ - Are HARD GATE candidates identifiable (critical decision points)?
21
+ - Is there a domain language / ubiquitous terminology?
22
+
23
+ ### 2. Conventions completeness
24
+ - Does `CLAUDE.md` or `AGENTS.md` exist?
25
+ - Does `CONVENTIONS.md` exist?
26
+ - Is the `specs/` directory layout in place?
27
+ - Are commit conventions documented (Conventional Commits)?
28
+ - Is the git workflow mode identified (`solo-git` | `team-pr`)?
29
+
30
+ ### 3. Bigpowers pre-flight (must all be answered before build)
31
+ | Question | Why |
32
+ |----------|-----|
33
+ | What is the **test command**? | `develop-tdd` verify steps require it |
34
+ | What is the **build command**? | `verify-work` mechanical gate |
35
+ | What is the **lint command**? | `audit-code` lint gate |
36
+ | What is the **typecheck command**? | `verify-work` typecheck gate |
37
+ | What **CI platform** is in use? | `wire-ci` configuration |
38
+ | **Solo or team**? | `release-branch` integration mode |
39
+ | Primary **language + framework**? | model routing + conventions |
40
+ | **Greenfield or existing** codebase? | determines whether to run `seed-conventions` or `migrate-spec` first |
41
+
42
+ ## Process
43
+
44
+ 1. **Ingest the plan** — accept a file path, pasted PRD text, or existing `specs/` artifacts. Read `CLAUDE.md` and `CONVENTIONS.md` if present.
45
+
46
+ 2. **Score each lens** — for every item above, mark:
47
+ - ✅ Present and adequate
48
+ - ⚠️ Present but incomplete — note what's missing
49
+ - ❌ Absent
50
+
51
+ 3. **Close gaps conversationally** — for each ❌ or ⚠️, ask one question at a time. Record each answer before moving to the next.
52
+
53
+ 4. **Write `specs/PLAN-AUDIT.md`**:
54
+
55
+ ```markdown
56
+ # Plan Audit — <project>
57
+ **Date:** YYYY-MM-DD · **Verdict:** READY | NOT READY
58
+
59
+ ## Principles Alignment
60
+ | Check | Status | Note |
61
+ | Vertical slices | ✅ | 4 stories, each shippable |
62
+ | Scope bounded | ⚠️ | in_scope present; out_of_scope missing |
63
+
64
+ ## Conventions Completeness
65
+ | Check | Status | Note |
66
+
67
+ ## Pre-flight Answers
68
+ | Command | Value |
69
+ | test | `npm test` |
70
+ | build | `npm run build` |
71
+
72
+ ## Open Gaps
73
+ - [ ] Add out_of_scope to scope definition (run scope-work)
74
+ - [ ] Create CLAUDE.md (run seed-conventions)
75
+
76
+ ## Verdict
77
+ READY — proceed with survey-context
78
+ NOT READY — N gaps remain; close before proceeding
79
+ ```
80
+
81
+ 5. **Recommend next skill**:
82
+ - READY → `survey-context`
83
+ - Needs bootstrapping → `seed-conventions`
84
+ - Needs spec elaboration → `elaborate-spec`
85
+ - Has foreign spec format → `migrate-spec`
86
+ - Plan assumptions need challenging → `grill-me`
87
+
88
+ ## Verify
89
+
90
+ → verify: `test -f specs/PLAN-AUDIT.md && grep -q 'Verdict' specs/PLAN-AUDIT.md && echo OK || echo FAIL`
@@ -73,7 +73,26 @@ Summarize your understanding in 3–5 bullet points aligned with [countable-stor
73
73
 
74
74
  Ask: "Is this an accurate summary? Anything missing or wrong?"
75
75
 
76
- ### 5. Suggest next skill
76
+ ### 5. Write specs/planning-context.yaml
77
+
78
+ After the user confirms the summary in step 4, persist the key decisions:
79
+
80
+ ```yaml
81
+ # specs/planning-context.yaml — written by elaborate-spec; consumed by scope-work and slice-tasks
82
+ feature_name: "<from step 1>"
83
+ problem_statement: "<one paragraph>"
84
+ constraints:
85
+ - "<constraint 1>"
86
+ out_of_scope:
87
+ - "<excluded item 1>"
88
+ key_decisions:
89
+ - decision: "<what was decided>"
90
+ rationale: "<why>"
91
+ ```
92
+
93
+ If `specs/planning-context.yaml` already exists, ask: `"Planning context from a prior session exists. Update it? [Y/n]"`. Overwrite on Y; leave unchanged on N.
94
+
95
+ ### 6. Suggest next skill
77
96
 
78
97
  Once the spec is clear, recommend the next step:
79
98
  - If domain model needs work → `model-domain`
@@ -11,16 +11,23 @@ model: opus
11
11
 
12
12
  ## Loop
13
13
 
14
- 1. Run `bigpowers-benchmark` (external repo); save report path in state.yaml.
15
- 2. Identify target skill + measurable gap from report.
16
- 3. `plan-work`minimal change proposal with verify commands.
17
- 4. Edit via `craft-skill` / direct SKILL.md edit; run `sync-skills.sh`.
18
- 5. Re-run benchmark; compare scores.
19
- 6. Record decision in `specs/adr/` + `session-state`; revert if regression.
14
+ 1. **Establish baseline** — Run `run-benchmark <skill> --baseline`. If no definition exists at `specs/benchmarks/<skill>.yaml`, create one following `specs/benchmarks/SCHEMA.md` first. Save report path in `state.yaml`. If `specs/benchmarks/reports/BASELINE-<skill>.yaml` already exists, skip this step.
15
+
16
+ 2. **Identify gap** Read the baseline report (`specs/benchmarks/reports/BASELINE-<skill>.yaml`). Find scenarios with `result: FAIL` or low `pass_at_k`. This is the measurable gap.
17
+
18
+ 3. **`plan-work`** Write a minimal change proposal targeting the failing scenarios. Include verify commands.
19
+
20
+ 4. **Edit** via `craft-skill` / direct SKILL.md edit; run `bash scripts/sync-skills.sh`.
21
+
22
+ 5. **Re-run benchmark** — `run-benchmark <skill>`. Compare new `pass_at_k` against baseline.
23
+ - **IMPROVED or STABLE** → advance to step 6.
24
+ - **REGRESSION** (`new pass_at_k < baseline`) → revert the change and loop back to step 3.
25
+
26
+ 6. **Record decision** — Write `specs/adr/NNNN-evolve-<skill>.md` with before/after `pass_at_k` scores. Update `session-state`.
20
27
 
21
28
  ## Verify
22
29
 
23
- → verify: benchmark report shows post-change score baseline (document paths in state.yaml)
30
+ → verify: `grep -c 'run-benchmark\|pass_at_k\|BASELINE-' evolve-skill/SKILL.md | awk '{if($1>=2) print "OK"; else print "FAIL"}'`
24
31
 
25
32
  See [REFERENCE.md](REFERENCE.md) for ADR template.
26
33
 
@@ -8,6 +8,8 @@ model: sonnet
8
8
 
9
9
  # Quick Fix
10
10
 
11
+ > **HARD GATE** — ALL entry criteria must pass before invoking quick-fix. If any guardrail triggers during execution, abort immediately and fall back to `investigate-bug`. Do NOT use quick-fix for logic changes, multi-file edits, or diffs > 5 lines.
12
+
11
13
  Fast-track for trivial data-only fixes that do not require the full bug-fix chain.
12
14
 
13
15
  When a bug fix is purely data — an add-missing-key, a typo correction, a config value update — the standard 6-skill chain (investigate-bug → diagnose-root → develop-tdd → kickoff-branch → verify-work → release-branch) is wasteful overhead. Quick-fix collapses it to 2 skills: **quick-fix** then **release-branch**.
@@ -0,0 +1,71 @@
1
+ ---
2
+ name: run-benchmark
3
+ description: "Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions."
4
+ model: haiku
5
+ ---
6
+
7
+
8
+ # Run Benchmark
9
+
10
+ > **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
11
+
12
+ Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
13
+
14
+ ## Usage
15
+
16
+ ```bash
17
+ # Benchmark a single skill
18
+ run-benchmark <skill-name>
19
+
20
+ # Benchmark all skills with definitions
21
+ run-benchmark --all
22
+
23
+ # Pin current results as baseline
24
+ run-benchmark <skill-name> --baseline
25
+ ```
26
+
27
+ ## Process
28
+
29
+ 1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
30
+
31
+ 2. **Run each scenario** — For each scenario in `scenarios[]`:
32
+ - **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
33
+ - **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
34
+
35
+ 3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
36
+
37
+ 4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
38
+
39
+ ```yaml
40
+ skill: survey-context
41
+ run_date: "2026-06-22"
42
+ pass_at_k: 0.83
43
+ total_scenarios: 3
44
+ passed: 2
45
+ failed: 1
46
+ scenarios:
47
+ - id: s01
48
+ name: "detects active epic from state.yaml"
49
+ result: PASS
50
+ weight: 1.0
51
+ - id: s02
52
+ name: "reads release-plan.yaml and reports next epic"
53
+ result: PASS
54
+ weight: 1.0
55
+ - id: s03
56
+ name: "handles missing state.yaml gracefully"
57
+ result: FAIL
58
+ weight: 0.5
59
+ failure_note: "crashed instead of suggesting state.yaml creation"
60
+ ```
61
+
62
+ 5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
63
+
64
+ 6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
65
+ - `IMPROVED: 0.67 → 0.83`
66
+ - `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
67
+ - `STABLE: 0.83 = 0.83`
68
+
69
+ ## Verify
70
+
71
+ → verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
@@ -38,6 +38,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
38
38
 
39
39
  2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
40
40
 
41
+ 2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
42
+ ```bash
43
+ test -f specs/planning-context.yaml && python3 -c "
44
+ import yaml, datetime
45
+ d = yaml.safe_load(open('specs/planning-context.yaml'))
46
+ written = d.get('written_at','')
47
+ if written:
48
+ age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
49
+ print(f'Context age: {age:.1f}h')
50
+ " 2>/dev/null || echo "No context or no written_at"
51
+ ```
52
+ - If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
53
+ - If context is **≥ 24h old** or absent, run elaborate-spec normally.
54
+ - On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
55
+
41
56
  3. **Invoke the matching skill** — Run the skill that matches the workflow key:
42
57
  - `survey-context` — where are we?
43
58
  - `scope-work` — what's in and out?
@@ -54,6 +69,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
54
69
 
55
70
  In `specs/planning-status.yaml`:
56
71
  ```yaml
72
+ context_capsule: # written by elaborate-spec; cleared on cycle completion
73
+ written_at: "2026-06-22T03:00:00Z"
74
+ written_by: elaborate-spec
75
+ feature_name: "add dark mode"
57
76
  workflows:
58
77
  survey-context:
59
78
  required: true
@@ -19,6 +19,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
19
19
 
20
20
  ## Process
21
21
 
22
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
23
+ ```bash
24
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
25
+ ```
26
+ Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
27
+
22
28
  1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
23
29
 
24
30
  2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
@@ -19,6 +19,12 @@ Produce **epic capsule story tasks** in `specs/epics/eNN-slug/` — vertical sli
19
19
 
20
20
  ## Process
21
21
 
22
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it first:
23
+ ```bash
24
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
25
+ ```
26
+ Use `feature_name`, `constraints`, and `out_of_scope` to inform slice boundaries. `key_decisions` in the file may constrain how stories are cut (e.g., "no external deps" constrains slice 2). If absent, proceed normally.
27
+
22
28
  1. **Read context** — Read `specs/product/SCOPE_LATEST.yaml` and/or `specs/release-plan.yaml`. Understand what the epic delivers end-to-end.
23
29
 
24
30
  2. **Cut tracer-bullet slices** — Identify the thinnest possible vertical path through the stack that delivers user value. Start with this slice; it will catch integration issues first. For example:
@@ -16,7 +16,8 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
16
16
  | Mode | Scope |
17
17
  |------|-------|
18
18
  | **Quick Scan** | Skills changed since last tag or in current diff |
19
- | **Full** | All 62 skills per SKILL-INDEX.md + catalog audit |
19
+ | **Full** | All skills per SKILL-INDEX.md + catalog audit |
20
+ | **--verify** | Run `bash scripts/run-skill-verify.sh` and append health results to the stocktake report |
20
21
 
21
22
  ## Process
22
23
 
@@ -28,6 +29,7 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
28
29
  - Skills with zero calls (potential dead weight)
29
30
  - Skills with high average time (candidates for `evolve-skill`)
30
31
  5. Critical findings → `plan-work` story; cosmetic → `evolve-skill` candidate.
32
+ 6. **--verify mode:** Run `bash scripts/run-skill-verify.sh` and append a `## Verify Health` section to the stocktake report: `"N/68 PASS, M FAIL, K SKIP"`. FAIL skills are critical findings and go straight to `plan-work`.
31
33
 
32
34
  ### Skill timing data (`metrics.skill_timings`)
33
35