bigpowers 2.25.0 → 2.27.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. package/.pi/package.json +2 -2
  2. package/.pi/prompts/build-epic.md +22 -0
  3. package/.pi/prompts/compose-workflow.md +9 -7
  4. package/.pi/prompts/develop-tdd.md +20 -0
  5. package/.pi/prompts/elaborate-spec.md +20 -1
  6. package/.pi/prompts/evolve-skill.md +14 -7
  7. package/.pi/prompts/kickoff-branch.md +26 -1
  8. package/.pi/prompts/quick-fix.md +2 -0
  9. package/.pi/prompts/run-benchmark.md +69 -0
  10. package/.pi/prompts/run-planning.md +19 -0
  11. package/.pi/prompts/scope-work.md +6 -0
  12. package/.pi/prompts/slice-tasks.md +6 -0
  13. package/.pi/prompts/stocktake-skills.md +3 -1
  14. package/.pi/prompts/verify-work.md +30 -0
  15. package/.pi/skills/build-epic/SKILL.md +22 -0
  16. package/.pi/skills/compose-workflow/SKILL.md +9 -7
  17. package/.pi/skills/develop-tdd/SKILL.md +20 -0
  18. package/.pi/skills/elaborate-spec/SKILL.md +20 -1
  19. package/.pi/skills/evolve-skill/SKILL.md +14 -7
  20. package/.pi/skills/kickoff-branch/SKILL.md +26 -1
  21. package/.pi/skills/quick-fix/SKILL.md +2 -0
  22. package/.pi/skills/run-benchmark/SKILL.md +71 -0
  23. package/.pi/skills/run-planning/SKILL.md +19 -0
  24. package/.pi/skills/scope-work/SKILL.md +6 -0
  25. package/.pi/skills/slice-tasks/SKILL.md +6 -0
  26. package/.pi/skills/stocktake-skills/SKILL.md +3 -1
  27. package/.pi/skills/verify-work/SKILL.md +30 -0
  28. package/CHANGELOG.md +25 -0
  29. package/SKILL-INDEX.md +2 -2
  30. package/build-epic/SKILL.md +22 -0
  31. package/compose-workflow/SKILL.md +9 -7
  32. package/develop-tdd/SKILL.md +20 -0
  33. package/elaborate-spec/SKILL.md +20 -1
  34. package/evolve-skill/SKILL.md +14 -7
  35. package/kickoff-branch/SKILL.md +26 -1
  36. package/package.json +1 -1
  37. package/quick-fix/SKILL.md +2 -0
  38. package/run-benchmark/SKILL.md +70 -0
  39. package/run-planning/SKILL.md +19 -0
  40. package/scope-work/SKILL.md +6 -0
  41. package/scripts/run-skill-verify.sh +57 -0
  42. package/skills-lock.json +17 -12
  43. package/slice-tasks/SKILL.md +6 -0
  44. package/stocktake-skills/SKILL.md +3 -1
  45. package/verify-work/SKILL.md +30 -0
@@ -0,0 +1,71 @@
1
+ ---
2
+ name: run-benchmark
3
+ description: "Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions."
4
+ model: haiku
5
+ ---
6
+
7
+
8
+ # Run Benchmark
9
+
10
+ > **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
11
+
12
+ Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
13
+
14
+ ## Usage
15
+
16
+ ```bash
17
+ # Benchmark a single skill
18
+ run-benchmark <skill-name>
19
+
20
+ # Benchmark all skills with definitions
21
+ run-benchmark --all
22
+
23
+ # Pin current results as baseline
24
+ run-benchmark <skill-name> --baseline
25
+ ```
26
+
27
+ ## Process
28
+
29
+ 1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
30
+
31
+ 2. **Run each scenario** — For each scenario in `scenarios[]`:
32
+ - **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
33
+ - **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
34
+
35
+ 3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
36
+
37
+ 4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
38
+
39
+ ```yaml
40
+ skill: survey-context
41
+ run_date: "2026-06-22"
42
+ pass_at_k: 0.83
43
+ total_scenarios: 3
44
+ passed: 2
45
+ failed: 1
46
+ scenarios:
47
+ - id: s01
48
+ name: "detects active epic from state.yaml"
49
+ result: PASS
50
+ weight: 1.0
51
+ - id: s02
52
+ name: "reads release-plan.yaml and reports next epic"
53
+ result: PASS
54
+ weight: 1.0
55
+ - id: s03
56
+ name: "handles missing state.yaml gracefully"
57
+ result: FAIL
58
+ weight: 0.5
59
+ failure_note: "crashed instead of suggesting state.yaml creation"
60
+ ```
61
+
62
+ 5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
63
+
64
+ 6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
65
+ - `IMPROVED: 0.67 → 0.83`
66
+ - `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
67
+ - `STABLE: 0.83 = 0.83`
68
+
69
+ ## Verify
70
+
71
+ → verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
@@ -38,6 +38,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
38
38
 
39
39
  2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
40
40
 
41
+ 2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
42
+ ```bash
43
+ test -f specs/planning-context.yaml && python3 -c "
44
+ import yaml, datetime
45
+ d = yaml.safe_load(open('specs/planning-context.yaml'))
46
+ written = d.get('written_at','')
47
+ if written:
48
+ age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
49
+ print(f'Context age: {age:.1f}h')
50
+ " 2>/dev/null || echo "No context or no written_at"
51
+ ```
52
+ - If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
53
+ - If context is **≥ 24h old** or absent, run elaborate-spec normally.
54
+ - On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
55
+
41
56
  3. **Invoke the matching skill** — Run the skill that matches the workflow key:
42
57
  - `survey-context` — where are we?
43
58
  - `scope-work` — what's in and out?
@@ -54,6 +69,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
54
69
 
55
70
  In `specs/planning-status.yaml`:
56
71
  ```yaml
72
+ context_capsule: # written by elaborate-spec; cleared on cycle completion
73
+ written_at: "2026-06-22T03:00:00Z"
74
+ written_by: elaborate-spec
75
+ feature_name: "add dark mode"
57
76
  workflows:
58
77
  survey-context:
59
78
  required: true
@@ -19,6 +19,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
19
19
 
20
20
  ## Process
21
21
 
22
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
23
+ ```bash
24
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
25
+ ```
26
+ Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
27
+
22
28
  1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
23
29
 
24
30
  2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
@@ -19,6 +19,12 @@ Produce **epic capsule story tasks** in `specs/epics/eNN-slug/` — vertical sli
19
19
 
20
20
  ## Process
21
21
 
22
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it first:
23
+ ```bash
24
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
25
+ ```
26
+ Use `feature_name`, `constraints`, and `out_of_scope` to inform slice boundaries. `key_decisions` in the file may constrain how stories are cut (e.g., "no external deps" constrains slice 2). If absent, proceed normally.
27
+
22
28
  1. **Read context** — Read `specs/product/SCOPE_LATEST.yaml` and/or `specs/release-plan.yaml`. Understand what the epic delivers end-to-end.
23
29
 
24
30
  2. **Cut tracer-bullet slices** — Identify the thinnest possible vertical path through the stack that delivers user value. Start with this slice; it will catch integration issues first. For example:
@@ -16,7 +16,8 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
16
16
  | Mode | Scope |
17
17
  |------|-------|
18
18
  | **Quick Scan** | Skills changed since last tag or in current diff |
19
- | **Full** | All 62 skills per SKILL-INDEX.md + catalog audit |
19
+ | **Full** | All skills per SKILL-INDEX.md + catalog audit |
20
+ | **--verify** | Run `bash scripts/run-skill-verify.sh` and append health results to the stocktake report |
20
21
 
21
22
  ## Process
22
23
 
@@ -28,6 +29,7 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
28
29
  - Skills with zero calls (potential dead weight)
29
30
  - Skills with high average time (candidates for `evolve-skill`)
30
31
  5. Critical findings → `plan-work` story; cosmetic → `evolve-skill` candidate.
32
+ 6. **--verify mode:** Run `bash scripts/run-skill-verify.sh` and append a `## Verify Health` section to the stocktake report: `"N/68 PASS, M FAIL, K SKIP"`. FAIL skills are critical findings and go straight to `plan-work`.
31
33
 
32
34
  ### Skill timing data (`metrics.skill_timings`)
33
35
 
@@ -17,6 +17,7 @@ Review answers "is the code good?"; Verify answers "does the built thing do what
17
17
 
18
18
  - Default: full UAT plus gaps loop
19
19
  - --smoke: Cold-start only plus one happy-path flow. Use for hotfixes.
20
+ - --cli: CLI tool verification — replaces cold-start with binary smoke checklist. Use for CLI tools with no server process.
20
21
 
21
22
  ## Process
22
23
 
@@ -25,6 +26,12 @@ Review answers "is the code good?"; Verify answers "does the built thing do what
25
26
  0. **Branch check** — must not be `main`/`master`.
26
27
 
27
28
  1. Read active story tasks from `specs/epics/<capsule>/eNNsYY-tasks.yaml` and story spec from `specs/epics/<capsule>/eNNsYY-<slug>.md` (countable-story-format, Gherkin in §17).
29
+ 1a. **Pre-UAT verify validation** — for each task's `verify:` command, run it and detect pattern mismatches before UAT begins. If a grep/awk/jq command fails, check whether the pattern is wrong vs. a genuine failure:
30
+ ```bash
31
+ # For a failing grep -q 'PATTERN' FILE, check what is actually in FILE
32
+ grep 'PATTERN' FILE || grep -n '' FILE | head -20 # show nearest lines
33
+ ```
34
+ Report: `"Pattern 'X' not found. Nearest match: 'Y' at line N"` and ask `"Update verify command? [Y/n]"`. Fix before proceeding — a mismatched verify command produces false failures during UAT.
28
35
  2. **Cold-start smoke** (if app): stop server, clear caches, boot from scratch.
29
36
  3. **AGENTS.md preflight** — before running default checks, call `bash scripts/bp-read-agents.sh` to detect project-specific commands. If `BP_PREFLIGHT` is set, run it instead of the default mechanical gates (or in addition to them if the project requires both). Output: `"Using preflight from AGENTS.md: <cmd>"`. Fall back to `CLAUDE.md` commands if AGENTS.md is absent.
30
37
  4. Mechanical gates: build → typecheck → lint → tests (from `CLAUDE.md` or AGENTS.md).
@@ -92,6 +99,29 @@ phases:
92
99
 
93
100
  > **HARD GATE** — Verification evidence MUST be persisted before marking the story done. No evidence = not verified.
94
101
 
102
+ ## --cli mode
103
+
104
+ For CLI tools where cold-start smoke (stop server / clear caches) does not apply. Auto-detected when the project has no server process (no `listen()`, no `server.js`, no blocking `main()`); or explicitly activated with `--cli`.
105
+
106
+ **Auto-detect binary name:**
107
+ ```bash
108
+ # Cargo.toml
109
+ BINARY=$(grep '^name' Cargo.toml | head -1 | awk -F'"' '{print $2}')
110
+ # package.json
111
+ BINARY=$(node -e "console.log(require('./package.json').bin && Object.keys(require('./package.json').bin)[0] || '')" 2>/dev/null)
112
+ # Makefile
113
+ BINARY=$(grep '^BIN\s*=' Makefile 2>/dev/null | awk '{print $3}')
114
+ ```
115
+
116
+ **CLI verification checklist (replaces cold-start smoke):**
117
+
118
+ 1. `--help` smoke: `$BINARY --help` → assert output contains "Usage"
119
+ 2. `--version` check: `$BINARY --version` → assert version matches manifest (Cargo.toml / package.json)
120
+ 3. Happy-path: run documented example command from README.md → assert non-empty output
121
+ 4. Edge case: `$BINARY --invalid-flag` → assert exit code ≠ 0 and error message printed
122
+
123
+ No "stop server" or "clear caches" steps are executed in `--cli` mode. Steps 3–6 of the default process (mechanical gates, UAT, gaps loop) still run unchanged.
124
+
95
125
  ## Verify
96
126
 
97
127
  → verify: `test -f specs/verifications/<story_id>-verify.yaml && echo "Evidence persisted"`
package/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ # [2.27.0](https://github.com/danielvm-git/bigpowers/compare/v2.26.0...v2.27.0) (2026-06-22)
2
+
3
+
4
+ ### Bug Fixes
5
+
6
+ * **quick-fix:** add HARD GATE callout — completes 35/35 skills ([9782b85](https://github.com/danielvm-git/bigpowers/commit/9782b85de96fabc77c14fe8f7be3fd30b98e9cfb))
7
+
8
+
9
+ ### Features
10
+
11
+ * **2.9.0:** implement e22/e23/e24 — Depth release ([c33d66c](https://github.com/danielvm-git/bigpowers/commit/c33d66c6a386a6c273ab1f4c2a71981309e1cda6))
12
+ * **plan-release:** scope 2.9.0 Depth — e22/e23/e24 ([5a64698](https://github.com/danielvm-git/bigpowers/commit/5a646981c520f4874c4f2807d53af40e17b093fb))
13
+
14
+ # [2.26.0](https://github.com/danielvm-git/bigpowers/compare/v2.25.0...v2.26.0) (2026-06-22)
15
+
16
+
17
+ ### Bug Fixes
18
+
19
+ * **compose-workflow:** respond to code review — fix recipes and docs ([17ac105](https://github.com/danielvm-git/bigpowers/commit/17ac105371c25fd111821b3711648eb4009db314))
20
+
21
+
22
+ ### Features
23
+
24
+ * **e20:** Build-Epic Ergonomics & Flow Polish — all stories done ([80b4b9a](https://github.com/danielvm-git/bigpowers/commit/80b4b9acd049ae2521c29e1d66fc0be5b509cac9))
25
+
1
26
  # [2.25.0](https://github.com/danielvm-git/bigpowers/compare/v2.24.0...v2.25.0) (2026-06-22)
2
27
 
3
28
 
package/SKILL-INDEX.md CHANGED
@@ -3,8 +3,8 @@
3
3
  > **DO NOT EDIT** — This file is auto-generated by `scripts/generate-skill-index.sh`.
4
4
  > Edit `SKILL.md` source files or `skills-lock.json` instead. Run `bash scripts/sync-skills.sh` to regenerate.
5
5
 
6
- **Generated:** 2026-06-22T02:29:38Z
7
- **Skills:** 68
6
+ **Generated:** 2026-06-22T11:54:00Z
7
+ **Skills:** 69
8
8
 
9
9
  ---
10
10
 
@@ -47,6 +47,28 @@ After step 5 (verify-work) completes successfully, step 6 runs `audit-code` auto
47
47
  4. **Audit artifact:** Full audit report saved to `specs/verifications/AUDIT-<epic>-<story>.md` regardless of pass/fail, for reviewer traceability.
48
48
  5. **Enforce F.I.R.S.T:** After audit-code passes, run `enforce-first --quick` on new/modified tests. Append F.I.R.S.T violations (if any) to the audit report. Failing F.I.R.S.T criteria trigger the same loop-back to step 4.
49
49
 
50
+ ## --fast mode
51
+
52
+ Coalesces read-and-report steps to reduce token overhead. Activate with `build-epic --fast`.
53
+
54
+ | Normal | --fast | Change |
55
+ |--------|--------|--------|
56
+ | Step 1 (survey-context) | 1+2 together | survey + plan in one invocation |
57
+ | Step 2 (plan-work) | (absorbed into 1) | — |
58
+ | Step 3 (kickoff-branch) | Step 2 | unchanged, sequential |
59
+ | Step 4 (develop-tdd) | Step 3 | unchanged, sequential |
60
+ | Step 5 (verify-work) | Step 4 | unchanged, sequential |
61
+ | Step 6 (audit-code) | 5+6 together | audit + commit-message in one invocation |
62
+ | Step 7 (commit-message) | (absorbed into 6) | — |
63
+ | Step 8 (release-branch) | Step 7 | unchanged, sequential |
64
+
65
+ **Total invocations:** 8 → 6 per story.
66
+
67
+ **Rules:**
68
+ - Steps 3/4/5/8 (kickoff, develop, verify, release) still run sequentially — they require user interaction or branch state.
69
+ - `--fast` does NOT skip any checklist items; it only coalesces steps that are pure read-and-report.
70
+ - Record `epic_cycle.fast_mode: true` in `state.yaml` when this flag is active.
71
+
50
72
  ## Handoff
51
73
 
52
74
  Write `handoff.next_skill` and `handoff.context` in `state.yaml` when pausing mid-epic.
@@ -11,13 +11,15 @@ model: sonnet
11
11
  ## Process
12
12
 
13
13
  1. Interview: goal, phases, which skills, gates between steps.
14
- 2. Write `specs/WORKFLOW-<name>.md`:
15
- - Trigger ("Use when...")
16
- - Ordered steps: `skill artefact → verify`
17
- - HARD GATEs between phases
14
+ 2. Write `specs/workflows/<name>.yaml`:
15
+ - `name`, `command`, `description`, `skills[]`, `verify`
16
+ - Optional: `args` for skill-specific arguments
18
17
  3. Register in state.yaml Active Decisions.
19
18
  4. Optional: reference from `orchestrate-project` Ad-Hoc mode.
20
19
 
20
+ > **Prefer the YAML recipe format** over the legacy `specs/WORKFLOW-<name>.md` markdown format.
21
+ > YAML recipes are command-mappable, machine-readable, and listed in the Standard Recipe Library.
22
+
21
23
  ## Standard Recipe Library
22
24
 
23
25
  Pre-built recipes in `specs/workflows/` map agentic stack commands to skill chains.
@@ -25,13 +27,13 @@ Reference them in AGENTS.md so `/command` directly invokes the matching recipe.
25
27
 
26
28
  | Command | Workflow | Skill chain |
27
29
  |---------|----------|-------------|
28
- | `/check-stack` | check-stack | survey-context → assess-impact → verify-work |
30
+ | `/check-stack` | check-stack | survey-context → assess-impact → setup-environment |
29
31
  | `/ship` | ship | audit-code → commit-message → release-branch |
30
32
  | `/tdd` | tdd | develop-tdd → enforce-first |
31
33
  | `/code-review` | code-review | audit-code → request-review → respond-review |
32
- | `/security` | security | audit-code (security focus) → request-review |
34
+ | `/security` | security | audit-code → request-review |
33
35
  | `/plan` | plan | survey-context → research-first → plan-work |
34
- | `/build-fix` | build-fix | investigate-bug → diagnose-root → quick-fix → validate-fix |
36
+ | `/build-fix` | build-fix | investigate-bug → diagnose-root → develop-tdd → validate-fix |
35
37
  | `/e2e` | e2e | smoke-test → verify-work |
36
38
 
37
39
  Add to `AGENTS.md`:
@@ -149,6 +149,26 @@ Once all tests pass: locate the Verification Script in the active epic capsule,
149
149
  [ ] verify: command passes
150
150
  ```
151
151
 
152
+ ## --config mode
153
+
154
+ For pure-config tasks (update package.json, edit YAML, tweak manifest) where there is no test infrastructure to write against. The RED state is "verify command fails"; GREEN is "verify command passes."
155
+
156
+ **When to use:** task has a runnable `verify:` command and the deliverable is a config file change with no new behavior to unit-test. Invoke as `develop-tdd --config`.
157
+
158
+ **Cycle:**
159
+
160
+ ```
161
+ RED: Run verify command → it fails (expected)
162
+ GREEN: Apply config change → verify passes
163
+ COMMIT: commit: chore(<scope>): <change>
164
+ ```
165
+
166
+ **Rules:**
167
+ - Skips test-writing phase entirely — do NOT write a test file for config tasks.
168
+ - `verify:` command is **required** and must be runnable (no placeholder).
169
+ - Commit message follows Conventional Commits (`chore:` or `feat:` as appropriate).
170
+ - Still runs full `verify-work` after all tasks complete.
171
+
152
172
  ## Handoff
153
173
 
154
174
  Gate: READY -> next: verify-work
@@ -72,7 +72,26 @@ Summarize your understanding in 3–5 bullet points aligned with [countable-stor
72
72
 
73
73
  Ask: "Is this an accurate summary? Anything missing or wrong?"
74
74
 
75
- ### 5. Suggest next skill
75
+ ### 5. Write specs/planning-context.yaml
76
+
77
+ After the user confirms the summary in step 4, persist the key decisions:
78
+
79
+ ```yaml
80
+ # specs/planning-context.yaml — written by elaborate-spec; consumed by scope-work and slice-tasks
81
+ feature_name: "<from step 1>"
82
+ problem_statement: "<one paragraph>"
83
+ constraints:
84
+ - "<constraint 1>"
85
+ out_of_scope:
86
+ - "<excluded item 1>"
87
+ key_decisions:
88
+ - decision: "<what was decided>"
89
+ rationale: "<why>"
90
+ ```
91
+
92
+ If `specs/planning-context.yaml` already exists, ask: `"Planning context from a prior session exists. Update it? [Y/n]"`. Overwrite on Y; leave unchanged on N.
93
+
94
+ ### 6. Suggest next skill
76
95
 
77
96
  Once the spec is clear, recommend the next step:
78
97
  - If domain model needs work → `model-domain`
@@ -10,15 +10,22 @@ model: opus
10
10
 
11
11
  ## Loop
12
12
 
13
- 1. Run `bigpowers-benchmark` (external repo); save report path in state.yaml.
14
- 2. Identify target skill + measurable gap from report.
15
- 3. `plan-work`minimal change proposal with verify commands.
16
- 4. Edit via `craft-skill` / direct SKILL.md edit; run `sync-skills.sh`.
17
- 5. Re-run benchmark; compare scores.
18
- 6. Record decision in `specs/adr/` + `session-state`; revert if regression.
13
+ 1. **Establish baseline** — Run `run-benchmark <skill> --baseline`. If no definition exists at `specs/benchmarks/<skill>.yaml`, create one following `specs/benchmarks/SCHEMA.md` first. Save report path in `state.yaml`. If `specs/benchmarks/reports/BASELINE-<skill>.yaml` already exists, skip this step.
14
+
15
+ 2. **Identify gap** Read the baseline report (`specs/benchmarks/reports/BASELINE-<skill>.yaml`). Find scenarios with `result: FAIL` or low `pass_at_k`. This is the measurable gap.
16
+
17
+ 3. **`plan-work`** Write a minimal change proposal targeting the failing scenarios. Include verify commands.
18
+
19
+ 4. **Edit** via `craft-skill` / direct SKILL.md edit; run `bash scripts/sync-skills.sh`.
20
+
21
+ 5. **Re-run benchmark** — `run-benchmark <skill>`. Compare new `pass_at_k` against baseline.
22
+ - **IMPROVED or STABLE** → advance to step 6.
23
+ - **REGRESSION** (`new pass_at_k < baseline`) → revert the change and loop back to step 3.
24
+
25
+ 6. **Record decision** — Write `specs/adr/NNNN-evolve-<skill>.md` with before/after `pass_at_k` scores. Update `session-state`.
19
26
 
20
27
  ## Verify
21
28
 
22
- → verify: benchmark report shows post-change score baseline (document paths in state.yaml)
29
+ → verify: `grep -c 'run-benchmark\|pass_at_k\|BASELINE-' evolve-skill/SKILL.md | awk '{if($1>=2) print "OK"; else print "FAIL"}'`
23
30
 
24
31
  See [REFERENCE.md](REFERENCE.md) for ADR template.
@@ -32,7 +32,32 @@ git status # working tree MUST be clean
32
32
  git log --oneline -5
33
33
  ```
34
34
 
35
- If working tree is dirty, ask the user to stash or commit first. If not on `$DEFAULT` after checkout, stop and fix before continuing.
35
+ **Spec-only pre-kickoff** before enforcing the clean-tree gate, check whether dirty files are spec artifacts:
36
+
37
+ ```bash
38
+ DIRTY=$(git status --porcelain | awk '{print $2}')
39
+ NON_SPEC=$(echo "$DIRTY" | grep -v '^specs/' || true)
40
+
41
+ if [ -z "$DIRTY" ]; then
42
+ : # clean — proceed
43
+ elif [ -z "$NON_SPEC" ]; then
44
+ # spec-only dirty tree — offer auto-commit
45
+ echo "Dirty spec artifacts: $(echo $DIRTY | tr '\n' ' ')"
46
+ read -p "Commit spec artifacts before kickoff? [Y/n]: " CONFIRM
47
+ CONFIRM=${CONFIRM:-Y}
48
+ if [[ "$CONFIRM" =~ ^[Yy] ]]; then
49
+ git add specs/
50
+ git commit -m "chore(state): checkpoint before kickoff"
51
+ fi
52
+ else
53
+ echo "Dirty tree: $NON_SPEC (not a spec artifact). Stash or commit before proceeding."
54
+ exit 1
55
+ fi
56
+ ```
57
+
58
+ - **Spec artifacts** match `specs/` — state.yaml, epics YAMLs, execution-status.yaml, etc.
59
+ - **Non-spec dirty files** (src/, scripts/, SKILL.md, …) still enforce the full clean-tree gate.
60
+ - If not on `$DEFAULT` after checkout, stop and fix before continuing.
36
61
 
37
62
  ### 3. Pre-flight & Conflict Resolution
38
63
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "bigpowers",
3
- "version": "2.25.0",
3
+ "version": "2.27.0",
4
4
  "description": "61 agent skills for spec-driven, test-first software development by solo developers",
5
5
  "main": "index.js",
6
6
  "scripts": {
@@ -7,6 +7,8 @@ model: sonnet
7
7
 
8
8
  # Quick Fix
9
9
 
10
+ > **HARD GATE** — ALL entry criteria must pass before invoking quick-fix. If any guardrail triggers during execution, abort immediately and fall back to `investigate-bug`. Do NOT use quick-fix for logic changes, multi-file edits, or diffs > 5 lines.
11
+
10
12
  Fast-track for trivial data-only fixes that do not require the full bug-fix chain.
11
13
 
12
14
  When a bug fix is purely data — an add-missing-key, a typo correction, a config value update — the standard 6-skill chain (investigate-bug → diagnose-root → develop-tdd → kickoff-branch → verify-work → release-branch) is wasteful overhead. Quick-fix collapses it to 2 skills: **quick-fix** then **release-branch**.
@@ -0,0 +1,70 @@
1
+ ---
2
+ name: run-benchmark
3
+ model: haiku
4
+ description: Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions.
5
+ ---
6
+
7
+ # Run Benchmark
8
+
9
+ > **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
10
+
11
+ Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
12
+
13
+ ## Usage
14
+
15
+ ```bash
16
+ # Benchmark a single skill
17
+ run-benchmark <skill-name>
18
+
19
+ # Benchmark all skills with definitions
20
+ run-benchmark --all
21
+
22
+ # Pin current results as baseline
23
+ run-benchmark <skill-name> --baseline
24
+ ```
25
+
26
+ ## Process
27
+
28
+ 1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
29
+
30
+ 2. **Run each scenario** — For each scenario in `scenarios[]`:
31
+ - **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
32
+ - **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
33
+
34
+ 3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
35
+
36
+ 4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
37
+
38
+ ```yaml
39
+ skill: survey-context
40
+ run_date: "2026-06-22"
41
+ pass_at_k: 0.83
42
+ total_scenarios: 3
43
+ passed: 2
44
+ failed: 1
45
+ scenarios:
46
+ - id: s01
47
+ name: "detects active epic from state.yaml"
48
+ result: PASS
49
+ weight: 1.0
50
+ - id: s02
51
+ name: "reads release-plan.yaml and reports next epic"
52
+ result: PASS
53
+ weight: 1.0
54
+ - id: s03
55
+ name: "handles missing state.yaml gracefully"
56
+ result: FAIL
57
+ weight: 0.5
58
+ failure_note: "crashed instead of suggesting state.yaml creation"
59
+ ```
60
+
61
+ 5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
62
+
63
+ 6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
64
+ - `IMPROVED: 0.67 → 0.83`
65
+ - `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
66
+ - `STABLE: 0.83 = 0.83`
67
+
68
+ ## Verify
69
+
70
+ → verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
@@ -37,6 +37,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
37
37
 
38
38
  2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
39
39
 
40
+ 2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
41
+ ```bash
42
+ test -f specs/planning-context.yaml && python3 -c "
43
+ import yaml, datetime
44
+ d = yaml.safe_load(open('specs/planning-context.yaml'))
45
+ written = d.get('written_at','')
46
+ if written:
47
+ age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
48
+ print(f'Context age: {age:.1f}h')
49
+ " 2>/dev/null || echo "No context or no written_at"
50
+ ```
51
+ - If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
52
+ - If context is **≥ 24h old** or absent, run elaborate-spec normally.
53
+ - On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
54
+
40
55
  3. **Invoke the matching skill** — Run the skill that matches the workflow key:
41
56
  - `survey-context` — where are we?
42
57
  - `scope-work` — what's in and out?
@@ -53,6 +68,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
53
68
 
54
69
  In `specs/planning-status.yaml`:
55
70
  ```yaml
71
+ context_capsule: # written by elaborate-spec; cleared on cycle completion
72
+ written_at: "2026-06-22T03:00:00Z"
73
+ written_by: elaborate-spec
74
+ feature_name: "add dark mode"
56
75
  workflows:
57
76
  survey-context:
58
77
  required: true
@@ -18,6 +18,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
18
18
 
19
19
  ## Process
20
20
 
21
+ 0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
22
+ ```bash
23
+ test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
24
+ ```
25
+ Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
26
+
21
27
  1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
22
28
 
23
29
  2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
@@ -0,0 +1,57 @@
1
+ #!/usr/bin/env bash
2
+ # Run all SKILL.md → verify: commands and report PASS/FAIL/SKIP.
3
+ # Exit 0 only when zero FAILs.
4
+ # Usage: bash scripts/run-skill-verify.sh [skill-name]
5
+ # No args: runs all skills
6
+ # With arg: runs only the named skill
7
+
8
+ set -uo pipefail
9
+
10
+ REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
11
+ cd "$REPO_ROOT"
12
+
13
+ PASS=0; FAIL=0; SKIP=0
14
+ TARGET="${1:-}"
15
+
16
+ run_skill() {
17
+ local skill_md="$1"
18
+ local skill
19
+ skill=$(dirname "$skill_md")
20
+
21
+ local cmd
22
+ cmd=$(grep '→ verify:' "$skill_md" 2>/dev/null | head -1 | sed 's/.*→ verify: *//')
23
+
24
+ if [ -z "$cmd" ]; then
25
+ echo "SKIP: $skill"
26
+ SKIP=$((SKIP + 1))
27
+ return
28
+ fi
29
+
30
+ local output
31
+ if output=$(timeout 10 bash -c "$cmd" 2>&1); then
32
+ echo "PASS: $skill"
33
+ PASS=$((PASS + 1))
34
+ else
35
+ echo "FAIL: $skill — $cmd"
36
+ echo " output: $(echo "$output" | head -1)"
37
+ FAIL=$((FAIL + 1))
38
+ fi
39
+ }
40
+
41
+ if [ -n "$TARGET" ]; then
42
+ if [ -f "$TARGET/SKILL.md" ]; then
43
+ run_skill "$TARGET/SKILL.md"
44
+ else
45
+ echo "ERROR: $TARGET/SKILL.md not found"
46
+ exit 1
47
+ fi
48
+ else
49
+ for skill_md in */SKILL.md; do
50
+ run_skill "$skill_md"
51
+ done
52
+ fi
53
+
54
+ echo ""
55
+ echo "Results: $PASS PASS, $FAIL FAIL, $SKIP SKIP"
56
+
57
+ [ "$FAIL" -eq 0 ]