bigpowers 2.26.0 → 2.27.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.pi/package.json +2 -2
- package/.pi/prompts/elaborate-spec.md +20 -1
- package/.pi/prompts/evolve-skill.md +14 -7
- package/.pi/prompts/quick-fix.md +2 -0
- package/.pi/prompts/run-benchmark.md +69 -0
- package/.pi/prompts/run-planning.md +19 -0
- package/.pi/prompts/scope-work.md +6 -0
- package/.pi/prompts/slice-tasks.md +6 -0
- package/.pi/prompts/stocktake-skills.md +3 -1
- package/.pi/skills/elaborate-spec/SKILL.md +20 -1
- package/.pi/skills/evolve-skill/SKILL.md +14 -7
- package/.pi/skills/quick-fix/SKILL.md +2 -0
- package/.pi/skills/run-benchmark/SKILL.md +71 -0
- package/.pi/skills/run-planning/SKILL.md +19 -0
- package/.pi/skills/scope-work/SKILL.md +6 -0
- package/.pi/skills/slice-tasks/SKILL.md +6 -0
- package/.pi/skills/stocktake-skills/SKILL.md +3 -1
- package/CHANGELOG.md +13 -0
- package/SKILL-INDEX.md +2 -2
- package/elaborate-spec/SKILL.md +20 -1
- package/evolve-skill/SKILL.md +14 -7
- package/package.json +1 -1
- package/quick-fix/SKILL.md +2 -0
- package/run-benchmark/SKILL.md +70 -0
- package/run-planning/SKILL.md +19 -0
- package/scope-work/SKILL.md +6 -0
- package/scripts/run-skill-verify.sh +57 -0
- package/skills-lock.json +12 -7
- package/slice-tasks/SKILL.md +6 -0
- package/stocktake-skills/SKILL.md +3 -1
package/.pi/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "bigpowers",
|
|
3
|
-
"version": "2.
|
|
4
|
-
"description": "
|
|
3
|
+
"version": "2.27.0",
|
|
4
|
+
"description": "69 skills — 61 agent skills for spec-driven, test-first software development by solo developers",
|
|
5
5
|
"keywords": [
|
|
6
6
|
"pi-package"
|
|
7
7
|
],
|
|
@@ -71,7 +71,26 @@ Summarize your understanding in 3–5 bullet points aligned with [countable-stor
|
|
|
71
71
|
|
|
72
72
|
Ask: "Is this an accurate summary? Anything missing or wrong?"
|
|
73
73
|
|
|
74
|
-
### 5.
|
|
74
|
+
### 5. Write specs/planning-context.yaml
|
|
75
|
+
|
|
76
|
+
After the user confirms the summary in step 4, persist the key decisions:
|
|
77
|
+
|
|
78
|
+
```yaml
|
|
79
|
+
# specs/planning-context.yaml — written by elaborate-spec; consumed by scope-work and slice-tasks
|
|
80
|
+
feature_name: "<from step 1>"
|
|
81
|
+
problem_statement: "<one paragraph>"
|
|
82
|
+
constraints:
|
|
83
|
+
- "<constraint 1>"
|
|
84
|
+
out_of_scope:
|
|
85
|
+
- "<excluded item 1>"
|
|
86
|
+
key_decisions:
|
|
87
|
+
- decision: "<what was decided>"
|
|
88
|
+
rationale: "<why>"
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
If `specs/planning-context.yaml` already exists, ask: `"Planning context from a prior session exists. Update it? [Y/n]"`. Overwrite on Y; leave unchanged on N.
|
|
92
|
+
|
|
93
|
+
### 6. Suggest next skill
|
|
75
94
|
|
|
76
95
|
Once the spec is clear, recommend the next step:
|
|
77
96
|
- If domain model needs work → `model-domain`
|
|
@@ -9,16 +9,23 @@ description: Benchmark-gated skill evolution — consume bigpowers-benchmark rep
|
|
|
9
9
|
|
|
10
10
|
## Loop
|
|
11
11
|
|
|
12
|
-
1. Run `
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
12
|
+
1. **Establish baseline** — Run `run-benchmark <skill> --baseline`. If no definition exists at `specs/benchmarks/<skill>.yaml`, create one following `specs/benchmarks/SCHEMA.md` first. Save report path in `state.yaml`. If `specs/benchmarks/reports/BASELINE-<skill>.yaml` already exists, skip this step.
|
|
13
|
+
|
|
14
|
+
2. **Identify gap** — Read the baseline report (`specs/benchmarks/reports/BASELINE-<skill>.yaml`). Find scenarios with `result: FAIL` or low `pass_at_k`. This is the measurable gap.
|
|
15
|
+
|
|
16
|
+
3. **`plan-work`** — Write a minimal change proposal targeting the failing scenarios. Include verify commands.
|
|
17
|
+
|
|
18
|
+
4. **Edit** via `craft-skill` / direct SKILL.md edit; run `bash scripts/sync-skills.sh`.
|
|
19
|
+
|
|
20
|
+
5. **Re-run benchmark** — `run-benchmark <skill>`. Compare new `pass_at_k` against baseline.
|
|
21
|
+
- **IMPROVED or STABLE** → advance to step 6.
|
|
22
|
+
- **REGRESSION** (`new pass_at_k < baseline`) → revert the change and loop back to step 3.
|
|
23
|
+
|
|
24
|
+
6. **Record decision** — Write `specs/adr/NNNN-evolve-<skill>.md` with before/after `pass_at_k` scores. Update `session-state`.
|
|
18
25
|
|
|
19
26
|
## Verify
|
|
20
27
|
|
|
21
|
-
→ verify:
|
|
28
|
+
→ verify: `grep -c 'run-benchmark\|pass_at_k\|BASELINE-' evolve-skill/SKILL.md | awk '{if($1>=2) print "OK"; else print "FAIL"}'`
|
|
22
29
|
|
|
23
30
|
See [REFERENCE.md](REFERENCE.md) for ADR template.
|
|
24
31
|
|
package/.pi/prompts/quick-fix.md
CHANGED
|
@@ -6,6 +6,8 @@ description: "Streamlined fast-path for trivial data-only fixes — no TDD, no b
|
|
|
6
6
|
|
|
7
7
|
# Quick Fix
|
|
8
8
|
|
|
9
|
+
> **HARD GATE** — ALL entry criteria must pass before invoking quick-fix. If any guardrail triggers during execution, abort immediately and fall back to `investigate-bug`. Do NOT use quick-fix for logic changes, multi-file edits, or diffs > 5 lines.
|
|
10
|
+
|
|
9
11
|
Fast-track for trivial data-only fixes that do not require the full bug-fix chain.
|
|
10
12
|
|
|
11
13
|
When a bug fix is purely data — an add-missing-key, a typo correction, a config value update — the standard 6-skill chain (investigate-bug → diagnose-root → develop-tdd → kickoff-branch → verify-work → release-branch) is wasteful overhead. Quick-fix collapses it to 2 skills: **quick-fix** then **release-branch**.
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions.
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
# Run Benchmark
|
|
7
|
+
|
|
8
|
+
> **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
|
|
9
|
+
|
|
10
|
+
Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
|
|
11
|
+
|
|
12
|
+
## Usage
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
# Benchmark a single skill
|
|
16
|
+
run-benchmark <skill-name>
|
|
17
|
+
|
|
18
|
+
# Benchmark all skills with definitions
|
|
19
|
+
run-benchmark --all
|
|
20
|
+
|
|
21
|
+
# Pin current results as baseline
|
|
22
|
+
run-benchmark <skill-name> --baseline
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
## Process
|
|
26
|
+
|
|
27
|
+
1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
|
|
28
|
+
|
|
29
|
+
2. **Run each scenario** — For each scenario in `scenarios[]`:
|
|
30
|
+
- **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
|
|
31
|
+
- **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
|
|
32
|
+
|
|
33
|
+
3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
|
|
34
|
+
|
|
35
|
+
4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
|
|
36
|
+
|
|
37
|
+
```yaml
|
|
38
|
+
skill: survey-context
|
|
39
|
+
run_date: "2026-06-22"
|
|
40
|
+
pass_at_k: 0.83
|
|
41
|
+
total_scenarios: 3
|
|
42
|
+
passed: 2
|
|
43
|
+
failed: 1
|
|
44
|
+
scenarios:
|
|
45
|
+
- id: s01
|
|
46
|
+
name: "detects active epic from state.yaml"
|
|
47
|
+
result: PASS
|
|
48
|
+
weight: 1.0
|
|
49
|
+
- id: s02
|
|
50
|
+
name: "reads release-plan.yaml and reports next epic"
|
|
51
|
+
result: PASS
|
|
52
|
+
weight: 1.0
|
|
53
|
+
- id: s03
|
|
54
|
+
name: "handles missing state.yaml gracefully"
|
|
55
|
+
result: FAIL
|
|
56
|
+
weight: 0.5
|
|
57
|
+
failure_note: "crashed instead of suggesting state.yaml creation"
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
|
|
61
|
+
|
|
62
|
+
6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
|
|
63
|
+
- `IMPROVED: 0.67 → 0.83`
|
|
64
|
+
- `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
|
|
65
|
+
- `STABLE: 0.83 = 0.83`
|
|
66
|
+
|
|
67
|
+
## Verify
|
|
68
|
+
|
|
69
|
+
→ verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
|
|
@@ -36,6 +36,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
|
|
|
36
36
|
|
|
37
37
|
2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
|
|
38
38
|
|
|
39
|
+
2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
|
|
40
|
+
```bash
|
|
41
|
+
test -f specs/planning-context.yaml && python3 -c "
|
|
42
|
+
import yaml, datetime
|
|
43
|
+
d = yaml.safe_load(open('specs/planning-context.yaml'))
|
|
44
|
+
written = d.get('written_at','')
|
|
45
|
+
if written:
|
|
46
|
+
age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
|
|
47
|
+
print(f'Context age: {age:.1f}h')
|
|
48
|
+
" 2>/dev/null || echo "No context or no written_at"
|
|
49
|
+
```
|
|
50
|
+
- If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
|
|
51
|
+
- If context is **≥ 24h old** or absent, run elaborate-spec normally.
|
|
52
|
+
- On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
|
|
53
|
+
|
|
39
54
|
3. **Invoke the matching skill** — Run the skill that matches the workflow key:
|
|
40
55
|
- `survey-context` — where are we?
|
|
41
56
|
- `scope-work` — what's in and out?
|
|
@@ -52,6 +67,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
|
|
|
52
67
|
|
|
53
68
|
In `specs/planning-status.yaml`:
|
|
54
69
|
```yaml
|
|
70
|
+
context_capsule: # written by elaborate-spec; cleared on cycle completion
|
|
71
|
+
written_at: "2026-06-22T03:00:00Z"
|
|
72
|
+
written_by: elaborate-spec
|
|
73
|
+
feature_name: "add dark mode"
|
|
55
74
|
workflows:
|
|
56
75
|
survey-context:
|
|
57
76
|
required: true
|
|
@@ -17,6 +17,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
|
|
|
17
17
|
|
|
18
18
|
## Process
|
|
19
19
|
|
|
20
|
+
0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
|
|
21
|
+
```bash
|
|
22
|
+
test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
|
|
23
|
+
```
|
|
24
|
+
Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
|
|
25
|
+
|
|
20
26
|
1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
|
|
21
27
|
|
|
22
28
|
2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
|
|
@@ -17,6 +17,12 @@ Produce **epic capsule story tasks** in `specs/epics/eNN-slug/` — vertical sli
|
|
|
17
17
|
|
|
18
18
|
## Process
|
|
19
19
|
|
|
20
|
+
0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it first:
|
|
21
|
+
```bash
|
|
22
|
+
test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
|
|
23
|
+
```
|
|
24
|
+
Use `feature_name`, `constraints`, and `out_of_scope` to inform slice boundaries. `key_decisions` in the file may constrain how stories are cut (e.g., "no external deps" constrains slice 2). If absent, proceed normally.
|
|
25
|
+
|
|
20
26
|
1. **Read context** — Read `specs/product/SCOPE_LATEST.yaml` and/or `specs/release-plan.yaml`. Understand what the epic delivers end-to-end.
|
|
21
27
|
|
|
22
28
|
2. **Cut tracer-bullet slices** — Identify the thinnest possible vertical path through the stack that delivers user value. Start with this slice; it will catch integration issues first. For example:
|
|
@@ -14,7 +14,8 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
|
|
|
14
14
|
| Mode | Scope |
|
|
15
15
|
|------|-------|
|
|
16
16
|
| **Quick Scan** | Skills changed since last tag or in current diff |
|
|
17
|
-
| **Full** | All
|
|
17
|
+
| **Full** | All skills per SKILL-INDEX.md + catalog audit |
|
|
18
|
+
| **--verify** | Run `bash scripts/run-skill-verify.sh` and append health results to the stocktake report |
|
|
18
19
|
|
|
19
20
|
## Process
|
|
20
21
|
|
|
@@ -26,6 +27,7 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
|
|
|
26
27
|
- Skills with zero calls (potential dead weight)
|
|
27
28
|
- Skills with high average time (candidates for `evolve-skill`)
|
|
28
29
|
5. Critical findings → `plan-work` story; cosmetic → `evolve-skill` candidate.
|
|
30
|
+
6. **--verify mode:** Run `bash scripts/run-skill-verify.sh` and append a `## Verify Health` section to the stocktake report: `"N/68 PASS, M FAIL, K SKIP"`. FAIL skills are critical findings and go straight to `plan-work`.
|
|
29
31
|
|
|
30
32
|
### Skill timing data (`metrics.skill_timings`)
|
|
31
33
|
|
|
@@ -73,7 +73,26 @@ Summarize your understanding in 3–5 bullet points aligned with [countable-stor
|
|
|
73
73
|
|
|
74
74
|
Ask: "Is this an accurate summary? Anything missing or wrong?"
|
|
75
75
|
|
|
76
|
-
### 5.
|
|
76
|
+
### 5. Write specs/planning-context.yaml
|
|
77
|
+
|
|
78
|
+
After the user confirms the summary in step 4, persist the key decisions:
|
|
79
|
+
|
|
80
|
+
```yaml
|
|
81
|
+
# specs/planning-context.yaml — written by elaborate-spec; consumed by scope-work and slice-tasks
|
|
82
|
+
feature_name: "<from step 1>"
|
|
83
|
+
problem_statement: "<one paragraph>"
|
|
84
|
+
constraints:
|
|
85
|
+
- "<constraint 1>"
|
|
86
|
+
out_of_scope:
|
|
87
|
+
- "<excluded item 1>"
|
|
88
|
+
key_decisions:
|
|
89
|
+
- decision: "<what was decided>"
|
|
90
|
+
rationale: "<why>"
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
If `specs/planning-context.yaml` already exists, ask: `"Planning context from a prior session exists. Update it? [Y/n]"`. Overwrite on Y; leave unchanged on N.
|
|
94
|
+
|
|
95
|
+
### 6. Suggest next skill
|
|
77
96
|
|
|
78
97
|
Once the spec is clear, recommend the next step:
|
|
79
98
|
- If domain model needs work → `model-domain`
|
|
@@ -11,16 +11,23 @@ model: opus
|
|
|
11
11
|
|
|
12
12
|
## Loop
|
|
13
13
|
|
|
14
|
-
1. Run `
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
14
|
+
1. **Establish baseline** — Run `run-benchmark <skill> --baseline`. If no definition exists at `specs/benchmarks/<skill>.yaml`, create one following `specs/benchmarks/SCHEMA.md` first. Save report path in `state.yaml`. If `specs/benchmarks/reports/BASELINE-<skill>.yaml` already exists, skip this step.
|
|
15
|
+
|
|
16
|
+
2. **Identify gap** — Read the baseline report (`specs/benchmarks/reports/BASELINE-<skill>.yaml`). Find scenarios with `result: FAIL` or low `pass_at_k`. This is the measurable gap.
|
|
17
|
+
|
|
18
|
+
3. **`plan-work`** — Write a minimal change proposal targeting the failing scenarios. Include verify commands.
|
|
19
|
+
|
|
20
|
+
4. **Edit** via `craft-skill` / direct SKILL.md edit; run `bash scripts/sync-skills.sh`.
|
|
21
|
+
|
|
22
|
+
5. **Re-run benchmark** — `run-benchmark <skill>`. Compare new `pass_at_k` against baseline.
|
|
23
|
+
- **IMPROVED or STABLE** → advance to step 6.
|
|
24
|
+
- **REGRESSION** (`new pass_at_k < baseline`) → revert the change and loop back to step 3.
|
|
25
|
+
|
|
26
|
+
6. **Record decision** — Write `specs/adr/NNNN-evolve-<skill>.md` with before/after `pass_at_k` scores. Update `session-state`.
|
|
20
27
|
|
|
21
28
|
## Verify
|
|
22
29
|
|
|
23
|
-
→ verify:
|
|
30
|
+
→ verify: `grep -c 'run-benchmark\|pass_at_k\|BASELINE-' evolve-skill/SKILL.md | awk '{if($1>=2) print "OK"; else print "FAIL"}'`
|
|
24
31
|
|
|
25
32
|
See [REFERENCE.md](REFERENCE.md) for ADR template.
|
|
26
33
|
|
|
@@ -8,6 +8,8 @@ model: sonnet
|
|
|
8
8
|
|
|
9
9
|
# Quick Fix
|
|
10
10
|
|
|
11
|
+
> **HARD GATE** — ALL entry criteria must pass before invoking quick-fix. If any guardrail triggers during execution, abort immediately and fall back to `investigate-bug`. Do NOT use quick-fix for logic changes, multi-file edits, or diffs > 5 lines.
|
|
12
|
+
|
|
11
13
|
Fast-track for trivial data-only fixes that do not require the full bug-fix chain.
|
|
12
14
|
|
|
13
15
|
When a bug fix is purely data — an add-missing-key, a typo correction, a config value update — the standard 6-skill chain (investigate-bug → diagnose-root → develop-tdd → kickoff-branch → verify-work → release-branch) is wasteful overhead. Quick-fix collapses it to 2 skills: **quick-fix** then **release-branch**.
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: run-benchmark
|
|
3
|
+
description: "Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions."
|
|
4
|
+
model: haiku
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
# Run Benchmark
|
|
9
|
+
|
|
10
|
+
> **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
|
|
11
|
+
|
|
12
|
+
Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
|
|
13
|
+
|
|
14
|
+
## Usage
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
# Benchmark a single skill
|
|
18
|
+
run-benchmark <skill-name>
|
|
19
|
+
|
|
20
|
+
# Benchmark all skills with definitions
|
|
21
|
+
run-benchmark --all
|
|
22
|
+
|
|
23
|
+
# Pin current results as baseline
|
|
24
|
+
run-benchmark <skill-name> --baseline
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Process
|
|
28
|
+
|
|
29
|
+
1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
|
|
30
|
+
|
|
31
|
+
2. **Run each scenario** — For each scenario in `scenarios[]`:
|
|
32
|
+
- **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
|
|
33
|
+
- **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
|
|
34
|
+
|
|
35
|
+
3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
|
|
36
|
+
|
|
37
|
+
4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
|
|
38
|
+
|
|
39
|
+
```yaml
|
|
40
|
+
skill: survey-context
|
|
41
|
+
run_date: "2026-06-22"
|
|
42
|
+
pass_at_k: 0.83
|
|
43
|
+
total_scenarios: 3
|
|
44
|
+
passed: 2
|
|
45
|
+
failed: 1
|
|
46
|
+
scenarios:
|
|
47
|
+
- id: s01
|
|
48
|
+
name: "detects active epic from state.yaml"
|
|
49
|
+
result: PASS
|
|
50
|
+
weight: 1.0
|
|
51
|
+
- id: s02
|
|
52
|
+
name: "reads release-plan.yaml and reports next epic"
|
|
53
|
+
result: PASS
|
|
54
|
+
weight: 1.0
|
|
55
|
+
- id: s03
|
|
56
|
+
name: "handles missing state.yaml gracefully"
|
|
57
|
+
result: FAIL
|
|
58
|
+
weight: 0.5
|
|
59
|
+
failure_note: "crashed instead of suggesting state.yaml creation"
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
|
|
63
|
+
|
|
64
|
+
6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
|
|
65
|
+
- `IMPROVED: 0.67 → 0.83`
|
|
66
|
+
- `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
|
|
67
|
+
- `STABLE: 0.83 = 0.83`
|
|
68
|
+
|
|
69
|
+
## Verify
|
|
70
|
+
|
|
71
|
+
→ verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
|
|
@@ -38,6 +38,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
|
|
|
38
38
|
|
|
39
39
|
2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
|
|
40
40
|
|
|
41
|
+
2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
|
|
42
|
+
```bash
|
|
43
|
+
test -f specs/planning-context.yaml && python3 -c "
|
|
44
|
+
import yaml, datetime
|
|
45
|
+
d = yaml.safe_load(open('specs/planning-context.yaml'))
|
|
46
|
+
written = d.get('written_at','')
|
|
47
|
+
if written:
|
|
48
|
+
age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
|
|
49
|
+
print(f'Context age: {age:.1f}h')
|
|
50
|
+
" 2>/dev/null || echo "No context or no written_at"
|
|
51
|
+
```
|
|
52
|
+
- If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
|
|
53
|
+
- If context is **≥ 24h old** or absent, run elaborate-spec normally.
|
|
54
|
+
- On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
|
|
55
|
+
|
|
41
56
|
3. **Invoke the matching skill** — Run the skill that matches the workflow key:
|
|
42
57
|
- `survey-context` — where are we?
|
|
43
58
|
- `scope-work` — what's in and out?
|
|
@@ -54,6 +69,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
|
|
|
54
69
|
|
|
55
70
|
In `specs/planning-status.yaml`:
|
|
56
71
|
```yaml
|
|
72
|
+
context_capsule: # written by elaborate-spec; cleared on cycle completion
|
|
73
|
+
written_at: "2026-06-22T03:00:00Z"
|
|
74
|
+
written_by: elaborate-spec
|
|
75
|
+
feature_name: "add dark mode"
|
|
57
76
|
workflows:
|
|
58
77
|
survey-context:
|
|
59
78
|
required: true
|
|
@@ -19,6 +19,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
|
|
|
19
19
|
|
|
20
20
|
## Process
|
|
21
21
|
|
|
22
|
+
0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
|
|
23
|
+
```bash
|
|
24
|
+
test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
|
|
25
|
+
```
|
|
26
|
+
Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
|
|
27
|
+
|
|
22
28
|
1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
|
|
23
29
|
|
|
24
30
|
2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
|
|
@@ -19,6 +19,12 @@ Produce **epic capsule story tasks** in `specs/epics/eNN-slug/` — vertical sli
|
|
|
19
19
|
|
|
20
20
|
## Process
|
|
21
21
|
|
|
22
|
+
0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it first:
|
|
23
|
+
```bash
|
|
24
|
+
test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
|
|
25
|
+
```
|
|
26
|
+
Use `feature_name`, `constraints`, and `out_of_scope` to inform slice boundaries. `key_decisions` in the file may constrain how stories are cut (e.g., "no external deps" constrains slice 2). If absent, proceed normally.
|
|
27
|
+
|
|
22
28
|
1. **Read context** — Read `specs/product/SCOPE_LATEST.yaml` and/or `specs/release-plan.yaml`. Understand what the epic delivers end-to-end.
|
|
23
29
|
|
|
24
30
|
2. **Cut tracer-bullet slices** — Identify the thinnest possible vertical path through the stack that delivers user value. Start with this slice; it will catch integration issues first. For example:
|
|
@@ -16,7 +16,8 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
|
|
|
16
16
|
| Mode | Scope |
|
|
17
17
|
|------|-------|
|
|
18
18
|
| **Quick Scan** | Skills changed since last tag or in current diff |
|
|
19
|
-
| **Full** | All
|
|
19
|
+
| **Full** | All skills per SKILL-INDEX.md + catalog audit |
|
|
20
|
+
| **--verify** | Run `bash scripts/run-skill-verify.sh` and append health results to the stocktake report |
|
|
20
21
|
|
|
21
22
|
## Process
|
|
22
23
|
|
|
@@ -28,6 +29,7 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
|
|
|
28
29
|
- Skills with zero calls (potential dead weight)
|
|
29
30
|
- Skills with high average time (candidates for `evolve-skill`)
|
|
30
31
|
5. Critical findings → `plan-work` story; cosmetic → `evolve-skill` candidate.
|
|
32
|
+
6. **--verify mode:** Run `bash scripts/run-skill-verify.sh` and append a `## Verify Health` section to the stocktake report: `"N/68 PASS, M FAIL, K SKIP"`. FAIL skills are critical findings and go straight to `plan-work`.
|
|
31
33
|
|
|
32
34
|
### Skill timing data (`metrics.skill_timings`)
|
|
33
35
|
|
package/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,16 @@
|
|
|
1
|
+
# [2.27.0](https://github.com/danielvm-git/bigpowers/compare/v2.26.0...v2.27.0) (2026-06-22)
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
### Bug Fixes
|
|
5
|
+
|
|
6
|
+
* **quick-fix:** add HARD GATE callout — completes 35/35 skills ([9782b85](https://github.com/danielvm-git/bigpowers/commit/9782b85de96fabc77c14fe8f7be3fd30b98e9cfb))
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
### Features
|
|
10
|
+
|
|
11
|
+
* **2.9.0:** implement e22/e23/e24 — Depth release ([c33d66c](https://github.com/danielvm-git/bigpowers/commit/c33d66c6a386a6c273ab1f4c2a71981309e1cda6))
|
|
12
|
+
* **plan-release:** scope 2.9.0 Depth — e22/e23/e24 ([5a64698](https://github.com/danielvm-git/bigpowers/commit/5a646981c520f4874c4f2807d53af40e17b093fb))
|
|
13
|
+
|
|
1
14
|
# [2.26.0](https://github.com/danielvm-git/bigpowers/compare/v2.25.0...v2.26.0) (2026-06-22)
|
|
2
15
|
|
|
3
16
|
|
package/SKILL-INDEX.md
CHANGED
|
@@ -3,8 +3,8 @@
|
|
|
3
3
|
> **DO NOT EDIT** — This file is auto-generated by `scripts/generate-skill-index.sh`.
|
|
4
4
|
> Edit `SKILL.md` source files or `skills-lock.json` instead. Run `bash scripts/sync-skills.sh` to regenerate.
|
|
5
5
|
|
|
6
|
-
**Generated:** 2026-06-
|
|
7
|
-
**Skills:**
|
|
6
|
+
**Generated:** 2026-06-22T11:54:00Z
|
|
7
|
+
**Skills:** 69
|
|
8
8
|
|
|
9
9
|
---
|
|
10
10
|
|
package/elaborate-spec/SKILL.md
CHANGED
|
@@ -72,7 +72,26 @@ Summarize your understanding in 3–5 bullet points aligned with [countable-stor
|
|
|
72
72
|
|
|
73
73
|
Ask: "Is this an accurate summary? Anything missing or wrong?"
|
|
74
74
|
|
|
75
|
-
### 5.
|
|
75
|
+
### 5. Write specs/planning-context.yaml
|
|
76
|
+
|
|
77
|
+
After the user confirms the summary in step 4, persist the key decisions:
|
|
78
|
+
|
|
79
|
+
```yaml
|
|
80
|
+
# specs/planning-context.yaml — written by elaborate-spec; consumed by scope-work and slice-tasks
|
|
81
|
+
feature_name: "<from step 1>"
|
|
82
|
+
problem_statement: "<one paragraph>"
|
|
83
|
+
constraints:
|
|
84
|
+
- "<constraint 1>"
|
|
85
|
+
out_of_scope:
|
|
86
|
+
- "<excluded item 1>"
|
|
87
|
+
key_decisions:
|
|
88
|
+
- decision: "<what was decided>"
|
|
89
|
+
rationale: "<why>"
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
If `specs/planning-context.yaml` already exists, ask: `"Planning context from a prior session exists. Update it? [Y/n]"`. Overwrite on Y; leave unchanged on N.
|
|
93
|
+
|
|
94
|
+
### 6. Suggest next skill
|
|
76
95
|
|
|
77
96
|
Once the spec is clear, recommend the next step:
|
|
78
97
|
- If domain model needs work → `model-domain`
|
package/evolve-skill/SKILL.md
CHANGED
|
@@ -10,15 +10,22 @@ model: opus
|
|
|
10
10
|
|
|
11
11
|
## Loop
|
|
12
12
|
|
|
13
|
-
1. Run `
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
13
|
+
1. **Establish baseline** — Run `run-benchmark <skill> --baseline`. If no definition exists at `specs/benchmarks/<skill>.yaml`, create one following `specs/benchmarks/SCHEMA.md` first. Save report path in `state.yaml`. If `specs/benchmarks/reports/BASELINE-<skill>.yaml` already exists, skip this step.
|
|
14
|
+
|
|
15
|
+
2. **Identify gap** — Read the baseline report (`specs/benchmarks/reports/BASELINE-<skill>.yaml`). Find scenarios with `result: FAIL` or low `pass_at_k`. This is the measurable gap.
|
|
16
|
+
|
|
17
|
+
3. **`plan-work`** — Write a minimal change proposal targeting the failing scenarios. Include verify commands.
|
|
18
|
+
|
|
19
|
+
4. **Edit** via `craft-skill` / direct SKILL.md edit; run `bash scripts/sync-skills.sh`.
|
|
20
|
+
|
|
21
|
+
5. **Re-run benchmark** — `run-benchmark <skill>`. Compare new `pass_at_k` against baseline.
|
|
22
|
+
- **IMPROVED or STABLE** → advance to step 6.
|
|
23
|
+
- **REGRESSION** (`new pass_at_k < baseline`) → revert the change and loop back to step 3.
|
|
24
|
+
|
|
25
|
+
6. **Record decision** — Write `specs/adr/NNNN-evolve-<skill>.md` with before/after `pass_at_k` scores. Update `session-state`.
|
|
19
26
|
|
|
20
27
|
## Verify
|
|
21
28
|
|
|
22
|
-
→ verify:
|
|
29
|
+
→ verify: `grep -c 'run-benchmark\|pass_at_k\|BASELINE-' evolve-skill/SKILL.md | awk '{if($1>=2) print "OK"; else print "FAIL"}'`
|
|
23
30
|
|
|
24
31
|
See [REFERENCE.md](REFERENCE.md) for ADR template.
|
package/package.json
CHANGED
package/quick-fix/SKILL.md
CHANGED
|
@@ -7,6 +7,8 @@ model: sonnet
|
|
|
7
7
|
|
|
8
8
|
# Quick Fix
|
|
9
9
|
|
|
10
|
+
> **HARD GATE** — ALL entry criteria must pass before invoking quick-fix. If any guardrail triggers during execution, abort immediately and fall back to `investigate-bug`. Do NOT use quick-fix for logic changes, multi-file edits, or diffs > 5 lines.
|
|
11
|
+
|
|
10
12
|
Fast-track for trivial data-only fixes that do not require the full bug-fix chain.
|
|
11
13
|
|
|
12
14
|
When a bug fix is purely data — an add-missing-key, a typo correction, a config value update — the standard 6-skill chain (investigate-bug → diagnose-root → develop-tdd → kickoff-branch → verify-work → release-branch) is wasteful overhead. Quick-fix collapses it to 2 skills: **quick-fix** then **release-branch**.
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: run-benchmark
|
|
3
|
+
model: haiku
|
|
4
|
+
description: Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions.
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Run Benchmark
|
|
8
|
+
|
|
9
|
+
> **HARD GATE** — Do NOT use benchmark scores to declare a skill "good" or "bad" in isolation. Benchmarks measure relative quality vs. a baseline — they catch regressions, they do not certify correctness.
|
|
10
|
+
|
|
11
|
+
Reads benchmark definitions from `specs/benchmarks/`, executes each scenario's grader, and writes a structured `pass@k` report that `evolve-skill` consumes.
|
|
12
|
+
|
|
13
|
+
## Usage
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
# Benchmark a single skill
|
|
17
|
+
run-benchmark <skill-name>
|
|
18
|
+
|
|
19
|
+
# Benchmark all skills with definitions
|
|
20
|
+
run-benchmark --all
|
|
21
|
+
|
|
22
|
+
# Pin current results as baseline
|
|
23
|
+
run-benchmark <skill-name> --baseline
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## Process
|
|
27
|
+
|
|
28
|
+
1. **Locate definition** — Read `specs/benchmarks/<skill>.yaml`. If absent, report: `"No benchmark definition found for <skill>. Create specs/benchmarks/<skill>.yaml first."` and stop.
|
|
29
|
+
|
|
30
|
+
2. **Run each scenario** — For each scenario in `scenarios[]`:
|
|
31
|
+
- **Code grader:** Run `grader.command` in repo root via `bash -c`. Exit 0 → PASS. Non-zero → FAIL. Timeout: 15 seconds.
|
|
32
|
+
- **Rubric grader:** Present each criterion to the agent as a yes/no question about the scenario output. ≥ 80% yes → PASS, else FAIL.
|
|
33
|
+
|
|
34
|
+
3. **Calculate pass@k** — `pass@k = sum(weight of PASS scenarios) / sum(all weights)`. Round to 2 decimal places.
|
|
35
|
+
|
|
36
|
+
4. **Write report** to `specs/benchmarks/reports/BENCHMARK-<skill>-<YYYY-MM-DD>.yaml`:
|
|
37
|
+
|
|
38
|
+
```yaml
|
|
39
|
+
skill: survey-context
|
|
40
|
+
run_date: "2026-06-22"
|
|
41
|
+
pass_at_k: 0.83
|
|
42
|
+
total_scenarios: 3
|
|
43
|
+
passed: 2
|
|
44
|
+
failed: 1
|
|
45
|
+
scenarios:
|
|
46
|
+
- id: s01
|
|
47
|
+
name: "detects active epic from state.yaml"
|
|
48
|
+
result: PASS
|
|
49
|
+
weight: 1.0
|
|
50
|
+
- id: s02
|
|
51
|
+
name: "reads release-plan.yaml and reports next epic"
|
|
52
|
+
result: PASS
|
|
53
|
+
weight: 1.0
|
|
54
|
+
- id: s03
|
|
55
|
+
name: "handles missing state.yaml gracefully"
|
|
56
|
+
result: FAIL
|
|
57
|
+
weight: 0.5
|
|
58
|
+
failure_note: "crashed instead of suggesting state.yaml creation"
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
5. **Baseline mode** (`--baseline`) — Copy the report to `specs/benchmarks/reports/BASELINE-<skill>.yaml`. This is the reference point for regression checks in `evolve-skill`.
|
|
62
|
+
|
|
63
|
+
6. **Compare to baseline** — If a `BASELINE-<skill>.yaml` exists, compare `pass_at_k`. Report:
|
|
64
|
+
- `IMPROVED: 0.67 → 0.83`
|
|
65
|
+
- `REGRESSION: 0.83 → 0.67 — do NOT ship this change`
|
|
66
|
+
- `STABLE: 0.83 = 0.83`
|
|
67
|
+
|
|
68
|
+
## Verify
|
|
69
|
+
|
|
70
|
+
→ verify: `test -f run-benchmark/SKILL.md && grep -q 'pass_at_k\|pass.at.k' run-benchmark/SKILL.md && echo OK || echo FAIL`
|
package/run-planning/SKILL.md
CHANGED
|
@@ -37,6 +37,21 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
|
|
|
37
37
|
|
|
38
38
|
2. **Find next step** — Find the first workflow key with `status: pending`. If the key is `optional`, check if the user wants to run it. If not, mark it `skipped`.
|
|
39
39
|
|
|
40
|
+
2a. **Context capsule check** — Before invoking `elaborate-spec`, check whether a fresh `specs/planning-context.yaml` exists:
|
|
41
|
+
```bash
|
|
42
|
+
test -f specs/planning-context.yaml && python3 -c "
|
|
43
|
+
import yaml, datetime
|
|
44
|
+
d = yaml.safe_load(open('specs/planning-context.yaml'))
|
|
45
|
+
written = d.get('written_at','')
|
|
46
|
+
if written:
|
|
47
|
+
age = (datetime.datetime.now(datetime.timezone.utc) - datetime.datetime.fromisoformat(written)).total_seconds() / 3600
|
|
48
|
+
print(f'Context age: {age:.1f}h')
|
|
49
|
+
" 2>/dev/null || echo "No context or no written_at"
|
|
50
|
+
```
|
|
51
|
+
- If context is **< 24h old**, ask: `"Planning context from Xh ago exists for '<feature_name>'. Re-run elaborate-spec? [y/N]"`. Skip elaborate-spec on N.
|
|
52
|
+
- If context is **≥ 24h old** or absent, run elaborate-spec normally.
|
|
53
|
+
- On planning cycle completion (all required keys done), clear the capsule: delete `specs/planning-context.yaml` and set `planning-status.yaml` `context_capsule: null`.
|
|
54
|
+
|
|
40
55
|
3. **Invoke the matching skill** — Run the skill that matches the workflow key:
|
|
41
56
|
- `survey-context` — where are we?
|
|
42
57
|
- `scope-work` — what's in and out?
|
|
@@ -53,6 +68,10 @@ Each key maps to a skill invocation. Optional keys can be skipped; required keys
|
|
|
53
68
|
|
|
54
69
|
In `specs/planning-status.yaml`:
|
|
55
70
|
```yaml
|
|
71
|
+
context_capsule: # written by elaborate-spec; cleared on cycle completion
|
|
72
|
+
written_at: "2026-06-22T03:00:00Z"
|
|
73
|
+
written_by: elaborate-spec
|
|
74
|
+
feature_name: "add dark mode"
|
|
56
75
|
workflows:
|
|
57
76
|
survey-context:
|
|
58
77
|
required: true
|
package/scope-work/SKILL.md
CHANGED
|
@@ -18,6 +18,12 @@ Turn the current conversation into a bounded PRD at `specs/product/SCOPE_LATEST.
|
|
|
18
18
|
|
|
19
19
|
## Process
|
|
20
20
|
|
|
21
|
+
0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it before doing anything else:
|
|
22
|
+
```bash
|
|
23
|
+
test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
|
|
24
|
+
```
|
|
25
|
+
Pre-populate `feature_name`, `constraints`, and `out_of_scope` from the file. Skip re-asking questions already answered by elaborate-spec. If the file is absent, proceed normally.
|
|
26
|
+
|
|
21
27
|
1. **Gather context** — Read existing `specs/` artifacts (`release-plan.yaml`, `plans/TECH_STACK_LATEST.md`, `requirements/VISION_LATEST.yaml` if any). Understand what the project is building and why.
|
|
22
28
|
|
|
23
29
|
2. **Interview (if needed)** — Clarify: What is the goal? Who are the users? What is definitely in scope? What is explicitly out of scope? What constraints exist (time, budget, tech)? How will success be measured?
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# Run all SKILL.md → verify: commands and report PASS/FAIL/SKIP.
|
|
3
|
+
# Exit 0 only when zero FAILs.
|
|
4
|
+
# Usage: bash scripts/run-skill-verify.sh [skill-name]
|
|
5
|
+
# No args: runs all skills
|
|
6
|
+
# With arg: runs only the named skill
|
|
7
|
+
|
|
8
|
+
set -uo pipefail
|
|
9
|
+
|
|
10
|
+
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
|
11
|
+
cd "$REPO_ROOT"
|
|
12
|
+
|
|
13
|
+
PASS=0; FAIL=0; SKIP=0
|
|
14
|
+
TARGET="${1:-}"
|
|
15
|
+
|
|
16
|
+
run_skill() {
|
|
17
|
+
local skill_md="$1"
|
|
18
|
+
local skill
|
|
19
|
+
skill=$(dirname "$skill_md")
|
|
20
|
+
|
|
21
|
+
local cmd
|
|
22
|
+
cmd=$(grep '→ verify:' "$skill_md" 2>/dev/null | head -1 | sed 's/.*→ verify: *//')
|
|
23
|
+
|
|
24
|
+
if [ -z "$cmd" ]; then
|
|
25
|
+
echo "SKIP: $skill"
|
|
26
|
+
SKIP=$((SKIP + 1))
|
|
27
|
+
return
|
|
28
|
+
fi
|
|
29
|
+
|
|
30
|
+
local output
|
|
31
|
+
if output=$(timeout 10 bash -c "$cmd" 2>&1); then
|
|
32
|
+
echo "PASS: $skill"
|
|
33
|
+
PASS=$((PASS + 1))
|
|
34
|
+
else
|
|
35
|
+
echo "FAIL: $skill — $cmd"
|
|
36
|
+
echo " output: $(echo "$output" | head -1)"
|
|
37
|
+
FAIL=$((FAIL + 1))
|
|
38
|
+
fi
|
|
39
|
+
}
|
|
40
|
+
|
|
41
|
+
if [ -n "$TARGET" ]; then
|
|
42
|
+
if [ -f "$TARGET/SKILL.md" ]; then
|
|
43
|
+
run_skill "$TARGET/SKILL.md"
|
|
44
|
+
else
|
|
45
|
+
echo "ERROR: $TARGET/SKILL.md not found"
|
|
46
|
+
exit 1
|
|
47
|
+
fi
|
|
48
|
+
else
|
|
49
|
+
for skill_md in */SKILL.md; do
|
|
50
|
+
run_skill "$skill_md"
|
|
51
|
+
done
|
|
52
|
+
fi
|
|
53
|
+
|
|
54
|
+
echo ""
|
|
55
|
+
echo "Results: $PASS PASS, $FAIL FAIL, $SKIP SKIP"
|
|
56
|
+
|
|
57
|
+
[ "$FAIL" -eq 0 ]
|
package/skills-lock.json
CHANGED
|
@@ -93,7 +93,7 @@
|
|
|
93
93
|
},
|
|
94
94
|
"elaborate-spec": {
|
|
95
95
|
"description": "Refine a rough idea into a clear, detailed specification through dialogue. Does not produce code. Use when user has a vague idea, wants to think through a feature before planning, or needs to turn \"I want X\" into a concrete spec.",
|
|
96
|
-
"sha256": "
|
|
96
|
+
"sha256": "9edc17057449444d",
|
|
97
97
|
"path": "elaborate-spec/SKILL.md"
|
|
98
98
|
},
|
|
99
99
|
"enforce-first": {
|
|
@@ -103,7 +103,7 @@
|
|
|
103
103
|
},
|
|
104
104
|
"evolve-skill": {
|
|
105
105
|
"description": "Benchmark-gated skill evolution — consume bigpowers-benchmark report, propose plan-work change, edit skill via craft-skill, re-run benchmark, record ADR. Use when a skill underperforms on benchmark or stocktake finds systemic gap.",
|
|
106
|
-
"sha256": "
|
|
106
|
+
"sha256": "e2d127c4ae0b5af7",
|
|
107
107
|
"path": "evolve-skill/SKILL.md"
|
|
108
108
|
},
|
|
109
109
|
"execute-plan": {
|
|
@@ -198,7 +198,7 @@
|
|
|
198
198
|
},
|
|
199
199
|
"quick-fix": {
|
|
200
200
|
"description": "\"Streamlined fast-path for trivial data-only fixes — no TDD, no branching ceremony. Collapses 6 skills into 2 for changes that are purely data with no logic risk. Aborts with fallback to investigate-bug if guardrails trigger.\"",
|
|
201
|
-
"sha256": "
|
|
201
|
+
"sha256": "2b9f1cd6557b6256",
|
|
202
202
|
"path": "quick-fix/SKILL.md"
|
|
203
203
|
},
|
|
204
204
|
"release-branch": {
|
|
@@ -226,6 +226,11 @@
|
|
|
226
226
|
"sha256": "c16bbe4854a0d665",
|
|
227
227
|
"path": "respond-review/SKILL.md"
|
|
228
228
|
},
|
|
229
|
+
"run-benchmark": {
|
|
230
|
+
"description": "Run skill quality benchmarks from specs/benchmarks/ definitions and write pass@k reports. Use before and after evolve-skill to prove quality changes are improvements, not regressions.",
|
|
231
|
+
"sha256": "e27f4e682e505b19",
|
|
232
|
+
"path": "run-benchmark/SKILL.md"
|
|
233
|
+
},
|
|
229
234
|
"run-evals": {
|
|
230
235
|
"description": "Eval-Driven Development — define capability and regression evals before building; code graders use verify commands, model graders use explicit rubrics; log pass@k. Use before develop-tdd on new features, or when measuring agent capability over runs.",
|
|
231
236
|
"sha256": "b3cd89a7e440c94f",
|
|
@@ -233,12 +238,12 @@
|
|
|
233
238
|
},
|
|
234
239
|
"run-planning": {
|
|
235
240
|
"description": "\"DISCOVER-PHASE ADVANCER — Drive the discover-phase checklist (specs/planning-status.yaml) through survey-context → scope-work → research-first → elaborate-spec → plan-release → slice-tasks. NOT a duplicate of plan-work or the planning spine; it orchestrates the pre-coding discover phase only.\"",
|
|
236
|
-
"sha256": "
|
|
241
|
+
"sha256": "a2e7c028e7f817de",
|
|
237
242
|
"path": "run-planning/SKILL.md"
|
|
238
243
|
},
|
|
239
244
|
"scope-work": {
|
|
240
245
|
"description": "\"PLANNING SPINE STEP 1 of 3 — Scope the work: define what is in and out of scope and save as specs/product/SCOPE_LATEST.yaml. Use before slice-tasks or plan-release on any new initiative. Not a substitute for slice-tasks (step 2) or plan-work (step 3).\"",
|
|
241
|
-
"sha256": "
|
|
246
|
+
"sha256": "d3cb167d8a5296be",
|
|
242
247
|
"path": "scope-work/SKILL.md"
|
|
243
248
|
},
|
|
244
249
|
"search-skills": {
|
|
@@ -268,7 +273,7 @@
|
|
|
268
273
|
},
|
|
269
274
|
"slice-tasks": {
|
|
270
275
|
"description": "\"PLANNING SPINE STEP 2 of 3 — Slice the work: break a scoped PRD into vertical-slice stories in specs/epics/. Use after scope-work (step 1), before plan-work (step 3). Not a substitute for scope-work or plan-work.\"",
|
|
271
|
-
"sha256": "
|
|
276
|
+
"sha256": "7948164e218541ea",
|
|
272
277
|
"path": "slice-tasks/SKILL.md"
|
|
273
278
|
},
|
|
274
279
|
"smoke-test": {
|
|
@@ -283,7 +288,7 @@
|
|
|
283
288
|
},
|
|
284
289
|
"stocktake-skills": {
|
|
285
290
|
"description": "Sequential subagent batch audit of the bigpowers skill catalog — Quick Scan (changed only) or Full (all skills). Use during sustain phase, before a major release, or when catalog drift is suspected.",
|
|
286
|
-
"sha256": "
|
|
291
|
+
"sha256": "6e73b2d2cf0cfbe1",
|
|
287
292
|
"path": "stocktake-skills/SKILL.md"
|
|
288
293
|
},
|
|
289
294
|
"survey-context": {
|
package/slice-tasks/SKILL.md
CHANGED
|
@@ -18,6 +18,12 @@ Produce **epic capsule story tasks** in `specs/epics/eNN-slug/` — vertical sli
|
|
|
18
18
|
|
|
19
19
|
## Process
|
|
20
20
|
|
|
21
|
+
0. **Read planning-context.yaml** — If `specs/planning-context.yaml` exists, read it first:
|
|
22
|
+
```bash
|
|
23
|
+
test -f specs/planning-context.yaml && echo "Context found" || echo "No context — starting fresh"
|
|
24
|
+
```
|
|
25
|
+
Use `feature_name`, `constraints`, and `out_of_scope` to inform slice boundaries. `key_decisions` in the file may constrain how stories are cut (e.g., "no external deps" constrains slice 2). If absent, proceed normally.
|
|
26
|
+
|
|
21
27
|
1. **Read context** — Read `specs/product/SCOPE_LATEST.yaml` and/or `specs/release-plan.yaml`. Understand what the epic delivers end-to-end.
|
|
22
28
|
|
|
23
29
|
2. **Cut tracer-bullet slices** — Identify the thinnest possible vertical path through the stack that delivers user value. Start with this slice; it will catch integration issues first. For example:
|
|
@@ -15,7 +15,8 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
|
|
|
15
15
|
| Mode | Scope |
|
|
16
16
|
|------|-------|
|
|
17
17
|
| **Quick Scan** | Skills changed since last tag or in current diff |
|
|
18
|
-
| **Full** | All
|
|
18
|
+
| **Full** | All skills per SKILL-INDEX.md + catalog audit |
|
|
19
|
+
| **--verify** | Run `bash scripts/run-skill-verify.sh` and append health results to the stocktake report |
|
|
19
20
|
|
|
20
21
|
## Process
|
|
21
22
|
|
|
@@ -27,6 +28,7 @@ Audit SKILL.md catalog for drift, stale triggers, missing HARD GATEs, and INDEX
|
|
|
27
28
|
- Skills with zero calls (potential dead weight)
|
|
28
29
|
- Skills with high average time (candidates for `evolve-skill`)
|
|
29
30
|
5. Critical findings → `plan-work` story; cosmetic → `evolve-skill` candidate.
|
|
31
|
+
6. **--verify mode:** Run `bash scripts/run-skill-verify.sh` and append a `## Verify Health` section to the stocktake report: `"N/68 PASS, M FAIL, K SKIP"`. FAIL skills are critical findings and go straight to `plan-work`.
|
|
30
32
|
|
|
31
33
|
### Skill timing data (`metrics.skill_timings`)
|
|
32
34
|
|