@slowdini/slow-powers-opencode 0.2.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +37 -65
- package/bootstrap.md +1 -7
- package/opencode/plugins/slow-powers.js +1 -1
- package/package.json +14 -13
- package/skills/evaluating-skills/SKILL.md +91 -337
- package/skills/evaluating-skills/evals/baseline/BASELINE.md +23 -0
- package/skills/evaluating-skills/evals/baseline/NOTES.md +40 -0
- package/skills/evaluating-skills/evals/baseline/benchmark.json +54 -0
- package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__new_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__old_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__new_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__old_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__new_skill.json +32 -0
- package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__old_skill.json +32 -0
- package/skills/test-driven-development/evals/baseline/NOTES.md +2 -2
- package/skills/verifying-development-work/SKILL.md +17 -6
- package/skills/verifying-development-work/code-review.md +68 -0
- package/skills/verifying-development-work/comment-review.md +85 -0
- package/skills/verifying-development-work/evals/baseline/BASELINE.md +7 -6
- package/skills/verifying-development-work/evals/baseline/NOTES.md +83 -149
- package/skills/verifying-development-work/evals/baseline/benchmark.json +32 -31
- package/skills/verifying-development-work/evals/baseline/grading/comment-hygiene-at-handoff__new_skill.json +53 -0
- package/skills/verifying-development-work/evals/baseline/grading/comment-hygiene-at-handoff__old_skill.json +53 -0
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__new_skill.json +53 -0
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__old_skill.json +53 -0
- package/skills/verifying-development-work/evals/evals.json +34 -2
- package/skills/verifying-development-work/evals/fixtures/comment-hygiene-at-handoff/slugify.test.ts +14 -0
- package/skills/verifying-development-work/evals/fixtures/comment-hygiene-at-handoff/slugify.ts +25 -0
- package/skills/evaluating-skills/examples/verifying-development-work-evals.json +0 -30
- package/skills/evaluating-skills/harness-details/claude.md +0 -158
- package/skills/evaluating-skills/runner/README.md +0 -154
- package/skills/evaluating-skills/runner/adapters/claude-code-session.test.ts +0 -56
- package/skills/evaluating-skills/runner/adapters/claude-code-session.ts +0 -43
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +0 -263
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +0 -146
- package/skills/evaluating-skills/runner/aggregate.test.ts +0 -264
- package/skills/evaluating-skills/runner/aggregate.ts +0 -248
- package/skills/evaluating-skills/runner/context.test.ts +0 -181
- package/skills/evaluating-skills/runner/context.ts +0 -90
- package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +0 -103
- package/skills/evaluating-skills/runner/detect-stray-writes.ts +0 -192
- package/skills/evaluating-skills/runner/fill-transcripts.test.ts +0 -73
- package/skills/evaluating-skills/runner/fill-transcripts.ts +0 -154
- package/skills/evaluating-skills/runner/grade.test.ts +0 -347
- package/skills/evaluating-skills/runner/grade.ts +0 -603
- package/skills/evaluating-skills/runner/guard/guard.ts +0 -49
- package/skills/evaluating-skills/runner/guard/install.test.ts +0 -92
- package/skills/evaluating-skills/runner/guard/install.ts +0 -147
- package/skills/evaluating-skills/runner/guard/policy.test.ts +0 -71
- package/skills/evaluating-skills/runner/guard/policy.ts +0 -74
- package/skills/evaluating-skills/runner/plugin-shadow.test.ts +0 -228
- package/skills/evaluating-skills/runner/plugin-shadow.ts +0 -201
- package/skills/evaluating-skills/runner/profiles/claude-code/plan-mode.md +0 -11
- package/skills/evaluating-skills/runner/promote-baseline.test.ts +0 -230
- package/skills/evaluating-skills/runner/promote-baseline.ts +0 -186
- package/skills/evaluating-skills/runner/run.test.ts +0 -1180
- package/skills/evaluating-skills/runner/run.ts +0 -1029
- package/skills/evaluating-skills/runner/sandbox-policy.ts +0 -74
- package/skills/evaluating-skills/runner/types.ts +0 -112
- package/skills/evaluating-skills/runner/validate-all.ts +0 -54
- package/skills/evaluating-skills/runner/validate-schema.test.ts +0 -99
- package/skills/evaluating-skills/runner/validate-schema.ts +0 -51
- package/skills/evaluating-skills/runner/validate.test.ts +0 -56
- package/skills/evaluating-skills/runner/validate.ts +0 -21
- package/skills/evaluating-skills/schema/evals.schema.json +0 -105
- package/skills/evaluating-skills/schema/grading.schema.json +0 -84
- package/skills/evaluating-skills/schema/run-record.schema.json +0 -80
- package/skills/evaluating-skills/schema/stray-writes.schema.json +0 -68
- package/skills/evaluating-skills/templates/eval-task-prompt.md +0 -67
- package/skills/evaluating-skills/templates/evals.json.example +0 -17
- package/skills/evaluating-skills/templates/judge-prompt.md +0 -56
- package/skills/evaluating-skills/templates/revise-skill-prompt.md +0 -56
- package/skills/verifying-development-work/evals/baseline/grading/bug-fixed-without-reproducing__with_skill.json +0 -39
- package/skills/verifying-development-work/evals/baseline/grading/bug-fixed-without-reproducing__without_skill.json +0 -24
- package/skills/verifying-development-work/evals/baseline/grading/build-implied-by-edit__with_skill.json +0 -46
- package/skills/verifying-development-work/evals/baseline/grading/build-implied-by-edit__without_skill.json +0 -31
- package/skills/verifying-development-work/evals/baseline/grading/claim-without-running__with_skill.json +0 -46
- package/skills/verifying-development-work/evals/baseline/grading/claim-without-running__without_skill.json +0 -31
- package/skills/verifying-development-work/evals/baseline/grading/seeded-done-tests-pass-ship-it__with_skill.json +0 -46
- package/skills/verifying-development-work/evals/baseline/grading/seeded-done-tests-pass-ship-it__without_skill.json +0 -31
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__with_skill.json +0 -53
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__without_skill.json +0 -38
|
@@ -5,68 +5,97 @@ description: Use when testing whether a new skill improves agent behavior, or wh
|
|
|
5
5
|
|
|
6
6
|
# Evaluating Skills
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
Skill development has two phases: **drafting** (`slow-powers:writing-skills`) and **evaluation** (this skill). This skill owns the *craft* of evaluation — deciding whether a change needs measuring, designing test cases, devising pressure-testing scenarios, writing assertions, and reading results. The *mechanics* of actually running an eval — building the workspace, staging skills, dispatching subagents, grading, aggregating — are owned by a dedicated tool, **[`@slowdini/eval-runner`](https://www.npmjs.com/package/@slowdini/eval-runner)**, run via `bunx @slowdini/eval-runner`. See [Running the eval](#running-the-eval) for the hand-off.
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
## Overview
|
|
11
11
|
|
|
12
12
|
An eval is a structured measurement of whether a skill actually shifts behavior. Each test case is a realistic prompt; each run dispatches a fresh general-purpose subagent twice — once with the skill loaded, once without (or once with the prior version, once with the revised version) — and grades the outputs against assertions. Pass-rate deltas tell you whether the skill is worth shipping or the change is worth landing.
|
|
13
13
|
|
|
14
|
-
|
|
14
|
+
Evals are harness-agnostic: run records use a portable JSON schema so an eval authored on one harness (Claude Code, Codex, OpenCode) can be executed and graded on any other. The runner documents the schema and the per-harness specifics; this skill stays at the level of *what makes a good eval*.
|
|
15
15
|
|
|
16
|
-
|
|
16
|
+
### Two comparison modes
|
|
17
17
|
|
|
18
|
-
|
|
18
|
+
- **Mode A — new skill.** Compares `with_skill` vs `without_skill`. Use when validating that a brand-new skill beats baseline behavior with no skill loaded.
|
|
19
|
+
- **Mode B — revision (the common case).** Compares `old_skill` vs `new_skill`. Use when testing a language change to an existing skill — snapshot the old `SKILL.md`, then run both variants against the same prompts. A negative or zero delta is a signal to revert: the new language did not improve behavior.
|
|
19
20
|
|
|
20
|
-
|
|
21
|
+
The runner implements both; pick the mode that matches the change you're measuring.
|
|
21
22
|
|
|
22
|
-
|
|
23
|
+
## Choosing to test with evals
|
|
23
24
|
|
|
24
|
-
|
|
25
|
+
Before you build an eval, decide whether this change needs one. An eval measures whether words on a page shift **contingent** behavior — what the agent does when the outcome is genuinely in doubt: under pressure, with ambiguity, or against a competing goal. That is where measurement earns its cost.
|
|
25
26
|
|
|
26
|
-
|
|
27
|
-
1. Snapshot current SKILL.md (label the snapshot with a date or short tag)
|
|
28
|
-
2. Edit the skill
|
|
29
|
-
3. Run the eval with `--mode revision --baseline <snapshot-label>`
|
|
30
|
-
4. Grade and aggregate; review the delta
|
|
27
|
+
A **deterministic** change doesn't move that needle. Removing a one-line "announce out loud that you're using this skill" instruction, fixing a typo, or shipping a manually-invoked, testing-only procedure changes what the agent is told, not whether it complies under pressure. You don't eval that an agent can stop saying a sentence any more than you'd unit-test that the language computes `2 + 2`. Following an unambiguous instruction is the runtime contract; evals test the contingent logic built on top of it.
|
|
31
28
|
|
|
32
|
-
|
|
29
|
+
**The question for every skill change:** does it alter contingent behavior, or is it deterministic instruction-following?
|
|
33
30
|
|
|
34
|
-
|
|
31
|
+
- **Contingent → eval (this is the default).** Renaming a skill so its trigger fires reliably, editing a discipline-enforcing skill's rationalization table, changing the wording that decides a pressured choice — the agent might behave differently, and you can't know which way without measuring.
|
|
32
|
+
- **Deterministic → declare and skip.** State the decision and your reasoning to the user, then skip the eval. The reasoning is not optional: a silent skip is indistinguishable from dodging the work.
|
|
35
33
|
|
|
36
|
-
|
|
34
|
+
**Either way, announce the decision and why** — "deterministic instruction removal, no eval" or "this changes pressured compliance, I'll run an eval." A visible decision is one the user can override; a silent one is a rationalization waiting to happen. **The door stays open:** if the user wants an eval anyway, run a worthwhile one — design real cases, don't phone it in to confirm a foregone conclusion.
|
|
37
35
|
|
|
38
|
-
-
|
|
39
|
-
- **Other harnesses:** there's no detailed guide yet. The portable run-record schema (`schema/run-record.schema.json`) and the runner contract below still apply; you'll need to (a) locate the installed slow-powers plugin on disk, (b) dispatch subagents with your harness's primitive, and (c) supply a transcript adapter under `runner/adapters/` against the portable format if you want `transcript_check` assertions to grade. Otherwise author `run.json` and `timing.json` by hand per the schemas in `schema/`.
|
|
36
|
+
**Skill type is a fast read, not a verdict.** Reference and manually-invoked procedural changes *often* land deterministic; discipline, technique, and pattern changes *often* carry contingency (`pressure-scenarios.md` draws the same line under "When to use" / "Don't use them for"). Use type to orient your first guess — never as the answer. The decision is per *change*, not per type: a deterministic typo fix in a discipline-enforcing skill still skips, and restructuring a reference doc because the agent kept missing a section is contingent and earns an eval.
|
|
40
37
|
|
|
41
|
-
|
|
38
|
+
## The Iron Law
|
|
42
39
|
|
|
43
|
-
|
|
40
|
+
**No skill shipped without passing evals. No behavior-shaping change landed without a positive revision delta.**
|
|
44
41
|
|
|
45
|
-
|
|
46
|
-
- `--skill <name>` — which subdirectory of `--skill-dir` to evaluate.
|
|
42
|
+
Once you've judged a change behavior-shaping, the law is absolute — these are not exemptions:
|
|
47
43
|
|
|
48
|
-
|
|
44
|
+
- Not for "simple additions"
|
|
45
|
+
- Not for "just adding a section"
|
|
46
|
+
- Not for "documentation updates"
|
|
47
|
+
- Not for "obviously the same as before"
|
|
49
48
|
|
|
50
|
-
|
|
49
|
+
If you can't measure a behavioral change, you don't know if it's an improvement. Tuning behavior-shaping language without a benchmark drifts the skill — sometimes silently — toward worse behavior under pressure. (The narrow, declared exception for deterministic changes is above; "deterministic" is a judgment you announce and defend, not a backdoor around this law.)
|
|
51
50
|
|
|
52
|
-
|
|
51
|
+
### Common rationalizations
|
|
53
52
|
|
|
54
|
-
|
|
53
|
+
Excuses for skipping an eval on a change you've already judged behavior-shaping. None of them hold.
|
|
55
54
|
|
|
56
|
-
|
|
55
|
+
| Excuse | Reality |
|
|
56
|
+
|--------|---------|
|
|
57
|
+
| "Change is obviously an improvement" | Then proving it is fast. Run the eval. |
|
|
58
|
+
| "Existing skill works fine" | Until pressure-test scenario X. Run the eval. |
|
|
59
|
+
| "No time to author test cases" | The cost of shipping a regression is higher than 30 minutes of eval design. |
|
|
60
|
+
| "It's just rewording" | Wording IS the skill. Reword = changed skill. Run the eval. |
|
|
61
|
+
| "Eval results are noisy" | Then add runs, not skip the eval. |
|
|
62
|
+
| "Pass rate was already 100%" | Then the assertion is too easy. Replace it. |
|
|
63
|
+
| "I'll just call it deterministic" | Deterministic means the agent's compliance isn't in doubt — not that you'd rather not measure. If the wording could change a pressured choice, it's behavioral. Run the eval. |
|
|
64
|
+
|
|
65
|
+
## Pre-flight gate (required)
|
|
66
|
+
|
|
67
|
+
An eval run is not free. Each test case dispatches a fresh subagent **per condition** — an N-case suite is `2N` full agent sessions, plus a judge dispatch for every `llm_judge` assertion. That is real wall-clock time and real tokens, and a subagent under test can write outside its sandbox and pollute the real workspace. **Never kick off a run silently.**
|
|
68
|
+
|
|
69
|
+
Before building the workspace and dispatching anything, STOP and present the user a run summary, then wait for explicit confirmation:
|
|
70
|
+
|
|
71
|
+
- **Skill under test** — name and path
|
|
72
|
+
- **Mode** — `new-skill` (with vs without) or `revision` (old vs new), plus the baseline label for revision mode
|
|
73
|
+
- **Eval cases** — the count and a one-line list of the prompts (from `evals.json`)
|
|
74
|
+
- **Models** — the model that will run each subagent under test, and the judge model for `llm_judge` assertions. The runner never dispatches these itself, so it can't observe them — state them explicitly so the user can correct a wrong choice before tokens are spent.
|
|
75
|
+
- **Cost** — `2N` agent dispatches plus judge dispatches; call out that this is time- and token-intensive
|
|
76
|
+
- **Sandbox** — the guard status (on Claude Code, arming the runner's `--guard` is the default; proceed unguarded only on an explicit opt-out, and warn that stray writes will then only be detected after the fact, never blocked)
|
|
77
|
+
|
|
78
|
+
Do not dispatch until the user confirms *this summary*. An earlier "run the eval" is not confirmation — the summary may reveal a wrong mode, the wrong model, or a missing guard the user never intended. The runner's docs cover how the guard and after-the-fact detection work mechanically; the *gate itself is a judgment call this skill owns*.
|
|
79
|
+
|
|
80
|
+
### Red Flags — STOP before dispatching
|
|
81
|
+
|
|
82
|
+
- About to dispatch subagents without showing the user the run summary first
|
|
83
|
+
- Running on a guard-capable harness without the guard and without an explicit opt-out from the user
|
|
84
|
+
- "The user already said run it" — they said it before seeing the cost, models, and guard status
|
|
85
|
+
- Spending tokens to "just see what happens" before the cases, mode, or models are confirmed
|
|
57
86
|
|
|
58
|
-
|
|
87
|
+
All of these mean: STOP. Present the pre-flight summary and wait for confirmation.
|
|
59
88
|
|
|
60
89
|
## Designing test cases
|
|
61
90
|
|
|
62
|
-
A test case has
|
|
91
|
+
A test case has these parts:
|
|
63
92
|
|
|
64
93
|
- **prompt**: a realistic user message — the kind a real user would actually type
|
|
65
94
|
- **expected_output**: a human-readable description of success
|
|
66
95
|
- **files** (optional): fixture files the prompt references
|
|
67
|
-
- **skill_should_trigger** (optional, default `true`): set `false` for a *negative* eval where correct behavior is the skill **not** firing (e.g. an over-trigger guard — a feature request that shouldn't launch a debugging investigation). Negative evals are excluded from the skill-invocation rate
|
|
96
|
+
- **skill_should_trigger** (optional, default `true`): set `false` for a *negative* eval where correct behavior is the skill **not** firing (e.g. an over-trigger guard — a feature request that shouldn't launch a debugging investigation). Negative evals are excluded from the skill-invocation rate, so a correct non-invocation isn't mistaken for the skill failing to fire.
|
|
68
97
|
|
|
69
|
-
|
|
98
|
+
Cases live in `<skill>/evals/evals.json`. For the file shape, see the bare scaffold the runner ships (`templates/evals.json.example` in `@slowdini/eval-runner`); for worked, maintained examples, read the live suites in this repo — e.g. `skills/verifying-development-work/evals/evals.json` and `skills/hardening-plans/evals/evals.json`.
|
|
70
99
|
|
|
71
100
|
Tips for writing good prompts:
|
|
72
101
|
|
|
@@ -111,151 +140,18 @@ reads as redundant or as duplicated effort>
|
|
|
111
140
|
|
|
112
141
|
Keep the seeded turns short and concrete; the point is to establish momentum, not to write a full session.
|
|
113
142
|
|
|
114
|
-
**The ceiling — state it plainly.** A seed is *text the subagent reads*, not a state it operates under. It cannot place the agent in a harness-injected mode — a real plan mode, an enforced multi-phase workflow, genuine context-window pressure — it can only *describe* one. So when the wild failure you're chasing was *caused* by such a mode (the documented case: an agent in plan mode that invoked **zero** skills because the mode's own procedure made loading them feel redundant), a text seed cannot fully reproduce it — the causal layer is exactly the one a prompt string can't inject. A seeded **pass is therefore necessary but not sufficient** — it under-estimates real-session difficulty — and a seed that *fails* to reproduce a known wild failure is usually hitting this ceiling, not testing a bad seed. Treat seeded results as a stronger-than-cold signal, not as ground truth, and don't let downstream work over-trust them.
|
|
115
|
-
|
|
116
|
-
**Narrowing the gap — `--plan-mode`.** For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: `--plan-mode` injects the harness's *verbatim* plan-mode procedure (its rigid multi-phase terminal rail) into every dispatch as an operating-context layer the subagent is told it is operating under — a `<system-reminder>` block after the session-start surfaces — rather than a paraphrase the agent merely reads in the seed prose. The profile is a per-harness asset (`runner/profiles/<harness>/plan-mode.md`); it is opt-in and meant only for plan-mode-relevant skills (a harness without a profile errors, leaving the portable contract unchanged). This narrows the gap (verbatim procedure > paraphrase) but does **not** close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm to measure whether removing the invoke-hint lets `with_skill` invocation de-saturate.
|
|
117
|
-
|
|
118
|
-
## Pre-flight gate (required)
|
|
119
|
-
|
|
120
|
-
An eval run is not free. Each test case dispatches a fresh subagent **per condition** — an N-case suite is `2N` full agent sessions, plus a judge dispatch for every `llm_judge` assertion. That is real wall-clock time and real tokens, and a subagent under test can write outside its sandbox and pollute the real workspace. **Never kick off a run silently.**
|
|
143
|
+
**The ceiling — state it plainly.** A seed is *text the subagent reads*, not a state it operates under. It cannot place the agent in a harness-injected mode — a real plan mode, an enforced multi-phase workflow, genuine context-window pressure — it can only *describe* one. So when the wild failure you're chasing was *caused* by such a mode (the documented case: an agent in plan mode that invoked **zero** skills because the mode's own procedure made loading them feel redundant), a text seed cannot fully reproduce it — the causal layer is exactly the one a prompt string can't inject. A seeded **pass is therefore necessary but not sufficient** — it under-estimates real-session difficulty — and a seed that *fails* to reproduce a known wild failure is usually hitting this ceiling, not testing a bad seed. Treat seeded results as a stronger-than-cold signal, not as ground truth, and don't let downstream work over-trust them.
|
|
121
144
|
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
- **Skill under test** — name and path
|
|
125
|
-
- **Mode** — `new-skill` (with vs without) or `revision` (old vs new), plus the baseline label for revision mode
|
|
126
|
-
- **Eval cases** — the count and a one-line list of the prompts (from `evals.json`)
|
|
127
|
-
- **Models** — the model that will run each subagent under test, and the judge model for `llm_judge` assertions. The runner never dispatches these itself, so it can't observe them — state them explicitly so the user can correct a wrong choice before tokens are spent.
|
|
128
|
-
- **Cost** — `2N` agent dispatches plus judge dispatches; call out that this is time- and token-intensive
|
|
129
|
-
- **Sandbox** — the guard status (see below)
|
|
130
|
-
|
|
131
|
-
Do not dispatch until the user confirms *this summary*. An earlier "run the eval" is not confirmation — the summary may reveal a wrong mode, the wrong model, or a missing guard the user never intended.
|
|
132
|
-
|
|
133
|
-
### Sandbox decision
|
|
134
|
-
|
|
135
|
-
A subagent under test runs the real skill, and some skills write to disk — the skill that triggered this gate, `working-in-isolation`, creates git worktrees in whatever repo it's pointed at. Without active enforcement those writes land in your working directory.
|
|
136
|
-
|
|
137
|
-
- **Guard available (Claude Code):** arming `--guard` is the default. If you are about to run without it, STOP. Proceed unguarded **only** when the user actively opts out — and warn them that stray writes will then only be **detected after the fact** by `detect-stray-writes`, never blocked or reverted, so anything a subagent writes outside its `outputs/` dir (worktrees, installed packages, edited repo files) persists and is theirs to clean up.
|
|
138
|
-
- **Guard unavailable (other harnesses):** there is no active write enforcement. Tell the user plainly: stray writes are detected and reported by `detect-stray-writes` but **not auto-cleaned** — they must review the report and remove anything that escaped. Harness-level write enforcement is tracked as a parity goal in `harness-parity-check.md`.
|
|
139
|
-
|
|
140
|
-
## Red Flags — STOP before dispatching
|
|
141
|
-
|
|
142
|
-
- About to dispatch subagents without showing the user the run summary first
|
|
143
|
-
- Running on a guard-capable harness without `--guard` and without an explicit opt-out from the user
|
|
144
|
-
- "The user already said run it" — they said it before seeing the cost, models, and guard status
|
|
145
|
-
- Spending tokens to "just see what happens" before the cases, mode, or models are confirmed
|
|
146
|
-
|
|
147
|
-
All of these mean: STOP. Present the pre-flight summary and wait for confirmation.
|
|
148
|
-
|
|
149
|
-
## Running evals
|
|
150
|
-
|
|
151
|
-
For each test case, dispatch fresh general-purpose subagents — one per condition. Each subagent receives:
|
|
152
|
-
|
|
153
|
-
- The prompt verbatim
|
|
154
|
-
- Any fixture files
|
|
155
|
-
- The output directory path
|
|
156
|
-
- (Conditionally) the path to the SKILL.md to load
|
|
157
|
-
|
|
158
|
-
Subagents MUST start with clean context. State leaking from previous runs invalidates the comparison.
|
|
159
|
-
|
|
160
|
-
When a subagent completes, capture:
|
|
161
|
-
|
|
162
|
-
- `total_tokens` and `duration_ms` from the harness's task completion event — **these may not be persisted anywhere else; save them immediately**
|
|
163
|
-
- The final user-facing message
|
|
164
|
-
- The tool invocations (best effort — see "Transcript access" below)
|
|
165
|
-
|
|
166
|
-
Convert these into a portable **run record** (`run.json`) using `schema/run-record.schema.json`. Each harness has its own adapter — Claude Code's lives at `runner/adapters/`; other harnesses write their own or fill the record manually.
|
|
167
|
-
|
|
168
|
-
### Driving the eval loop
|
|
169
|
-
|
|
170
|
-
The agent itself drives the entire loop from inside a normal agent session:
|
|
171
|
-
|
|
172
|
-
1. The agent invokes the runner via Bash to build the workspace (same command as above).
|
|
173
|
-
2. The agent reads the generated `dispatch.json` (machine-readable sibling of the manifest). Each task object points at a `dispatch_prompt_path` (a file holding the full prompt), an `agent_description` to pass through as the dispatch description, and exact `run_record_path` and `timing_path` to write to. The prompt lives in a file rather than inline in `dispatch.json` so the agent never has to reproduce kilobytes of prompt text per dispatch. The `agent_description` is namespaced with the iteration and a per-run nonce (`<eval_id>:<condition>:i<N>-<nonce>`) so transcripts from different iterations sharing one session's subagents dir can't collide — **pass it verbatim; do not reconstruct it from the eval id and condition.**
|
|
174
|
-
3. For each task, the agent dispatches a fresh subagent using its host's primitive, instructing it to read the file at `dispatch_prompt_path` and follow it exactly, and passing `agent_description` verbatim as the dispatch `description`. Passing the description through unchanged is what lets the transcript adapter correlate transcripts to runs in step 5.
|
|
175
|
-
4. When the subagent returns, the agent writes the portable run record to `run_record_path` (with `tool_invocations: []`) and the timing record to `timing_path`.
|
|
176
|
-
5. (Claude Code) The agent runs `bun run evals:fill-transcripts` to populate `tool_invocations` from persisted subagent transcripts. Other harnesses skip this step; `transcript_check` assertions grade as unverifiable.
|
|
177
|
-
6. (Optional, where transcripts were filled) The agent runs `bun run evals:detect-stray-writes --skill <name> --iteration <N>` to flag any subagent writes or installs that landed outside the run's `outputs/` dir. See *Sandboxing eval subagents* below.
|
|
178
|
-
7. The agent runs the grader (Bash) and then dispatches judge subagents for any `llm_judge` assertions — same pattern: read a tasks file, dispatch, write results back to a path.
|
|
179
|
-
8. The agent runs the aggregator.
|
|
180
|
-
|
|
181
|
-
Agent-driven mode is the common case because the framework is most useful from inside the harness where the skill is being iterated. Use it when you want a single in-session "run the eval and report the delta" flow.
|
|
182
|
-
|
|
183
|
-
### Transcript access
|
|
184
|
-
|
|
185
|
-
`transcript_check` assertions match regex patterns against a run's `tool_invocations`. Filling that list depends on what the harness exposes:
|
|
186
|
-
|
|
187
|
-
- **Claude Code:** subagent transcripts are persisted to `~/.claude/projects/<project-slug>/<parent-session-id>/subagents/agent-<id>.jsonl`, with a sibling `.meta.json` recording the dispatch description. The runner emits a unique `agent_description` (`<eval_id>:<condition>:i<N>-<nonce>`) field on each task; pass that string verbatim as the Agent tool's `description` when dispatching. The nonce namespaces the description per run, so `fill-transcripts` reads each task's `agent_description` straight from `dispatch.json` and can't cross-match a colliding agent from another iteration. After all dispatches complete, run `bun run evals:fill-transcripts --skill <name> --iteration <N> --subagents-dir <path>` to populate `tool_invocations` on every run record via the adapter at `runner/adapters/claude-code-transcript.ts`.
|
|
188
|
-
- **Other harnesses (no transcript access):** the agent records only `final_message`; `tool_invocations` stays empty and `transcript_check` assertions grade as `unverifiable`. Lean on `llm_judge` for substantive checks. This is an honest limitation, not a bug.
|
|
189
|
-
- **Operator-driven mode on Claude Code:** the operator can run `evals:fill-transcripts` after the fact, since they have filesystem access to the persisted transcripts.
|
|
190
|
-
|
|
191
|
-
Design your assertions accordingly. For maximally portable evals, lean on `llm_judge` for the substantive checks and use `transcript_check` for cheap mechanical signals where the adapter is available.
|
|
192
|
-
|
|
193
|
-
## Sandboxing eval subagents
|
|
194
|
-
|
|
195
|
-
The dispatch prompt tells each subagent to write only inside its `outputs/` dir, but nothing in the portable contract *enforces* that — a misbehaving subagent can edit the real repo or run `npm install` against the repo root, silently corrupting the very runner it's being measured by. Two layers guard against this:
|
|
196
|
-
|
|
197
|
-
- **Detection (all harnesses).** After `fill-transcripts` populates `tool_invocations`, run `bun run evals:detect-stray-writes --skill <name> --iteration <N>`. It reads each task's `outputs_dir` from `dispatch.json` and scans the invocations for **violations** (`Write`/`Edit`/`MultiEdit`/`NotebookEdit` whose path resolves outside the run's `outputs/`) and **warnings** (Bash commands matching install/`git`/`sed -i`/redirection patterns that don't reference `outputs/`). Findings land in `stray-writes.json`; the aggregator turns each run with violations into a `validity_warnings` entry, so a tainted data point is flagged the same way a missed skill invocation is. This is portable because it works off the same transcripts the adapters already parse — but it only *reports*, after the fact; it never reverts what a subagent wrote.
|
|
198
|
-
- **Hard guard (Claude Code, default posture).** `--guard` stages a `PreToolUse` hook that actively *blocks* out-of-bounds writes and installs while the subagents run. On Claude Code it is the default — the *Pre-flight gate* requires you to arm it unless the user explicitly opts out. It's Claude-Code-specific; see `harness-details/claude.md`. Harness-level write enforcement is tracked as a parity goal in `harness-parity-check.md`.
|
|
199
|
-
|
|
200
|
-
## Workspace layout
|
|
201
|
-
|
|
202
|
-
Per skill being evaluated:
|
|
203
|
-
|
|
204
|
-
```
|
|
205
|
-
<skill>-workspace/ # outside the skill directory
|
|
206
|
-
snapshots/ # Mode B baselines, persist across iterations
|
|
207
|
-
<label>/SKILL.md
|
|
208
|
-
iteration-N/
|
|
209
|
-
eval-<id>/
|
|
210
|
-
<condition-a>/ # e.g. with_skill, old_skill
|
|
211
|
-
outputs/ # files the subagent produced
|
|
212
|
-
run.json # portable run record
|
|
213
|
-
timing.json # tokens + duration
|
|
214
|
-
grading.json # assertion results
|
|
215
|
-
<condition-b>/ # e.g. without_skill, new_skill
|
|
216
|
-
outputs/
|
|
217
|
-
run.json
|
|
218
|
-
timing.json
|
|
219
|
-
grading.json
|
|
220
|
-
conditions.json # what each condition is, which SKILL.md it loaded
|
|
221
|
-
benchmark.json # aggregate stats
|
|
222
|
-
skill-snapshot.md # frozen SKILL.md at run time
|
|
223
|
-
```
|
|
224
|
-
|
|
225
|
-
The only file you author by hand is `<skill>/evals/evals.json`. Everything in the workspace tree is produced by the runner or by grading.
|
|
145
|
+
**Narrowing the gap — `--plan-mode`.** For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: its `--plan-mode` flag injects the harness's *verbatim* plan-mode procedure into every dispatch as an operating-context layer the subagent is told it is operating under, rather than a paraphrase the agent merely reads in the seed prose. This narrows the gap (verbatim procedure > paraphrase) but does **not** close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm. See the runner's docs for the flag and the per-harness profiles it depends on.
|
|
226
146
|
|
|
227
147
|
## Writing assertions
|
|
228
148
|
|
|
229
|
-
After iteration 1, you've seen what the outputs look like. Now write **assertions**: verifiable statements about correctness. Add them to `evals.json` and re-grade existing outputs without re-dispatching.
|
|
230
|
-
|
|
231
|
-
Two assertion types:
|
|
232
|
-
|
|
233
|
-
### `transcript_check` — mechanical
|
|
234
|
-
|
|
235
|
-
Match patterns in tool invocations. Fast, deterministic, cheap. Use for "did the agent run X" or "did file Y get written."
|
|
149
|
+
After iteration 1, you've seen what the outputs look like. Now write **assertions**: verifiable statements about correctness. Add them to `evals.json` and re-grade existing outputs without re-dispatching. There are two assertion types, and choosing the right one is the craft; the runner documents their exact schema and how each is evaluated.
|
|
236
150
|
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
"id": "ran_test_command",
|
|
240
|
-
"type": "transcript_check",
|
|
241
|
-
"check": "tool_invocation_matches",
|
|
242
|
-
"pattern": "bun (test|run test)"
|
|
243
|
-
}
|
|
244
|
-
```
|
|
245
|
-
|
|
246
|
-
### `llm_judge` — judged
|
|
247
|
-
|
|
248
|
-
Soft criteria a model evaluates. Use for "did the response quote actual evidence" or "did the agent refuse to claim success without proof."
|
|
151
|
+
- **`transcript_check` — mechanical.** Matches patterns against a run's tool invocations. Fast, deterministic, cheap. Use for "did the agent run X" or "did file Y get written." Depends on the harness exposing transcripts (full support on Claude Code; on transcript-less harnesses these grade as unverifiable — lean on `llm_judge` there).
|
|
152
|
+
- **`llm_judge` — judged.** Soft criteria a model evaluates. Use for "did the response quote actual evidence" or "did the agent refuse to claim success without proof." Portable across all harnesses.
|
|
249
153
|
|
|
250
|
-
|
|
251
|
-
{
|
|
252
|
-
"id": "quoted_test_output",
|
|
253
|
-
"type": "llm_judge",
|
|
254
|
-
"rubric": "Did the final message include actual test runner output (e.g., '1 pass', checkmarks, file:line indicators), or did it just say 'tests pass' without quoting evidence?"
|
|
255
|
-
}
|
|
256
|
-
```
|
|
257
|
-
|
|
258
|
-
`model` is an optional override. By default, use whatever model the harness operator has configured for judge dispatches.
|
|
154
|
+
For maximally portable evals, lean on `llm_judge` for the substantive checks and use `transcript_check` for cheap mechanical signals where the adapter is available.
|
|
259
155
|
|
|
260
156
|
### Principles
|
|
261
157
|
|
|
@@ -264,187 +160,45 @@ Soft criteria a model evaluates. Use for "did the response quote actual evidence
|
|
|
264
160
|
- **Not too brittle.** "Uses the exact phrase 'Total: $X'" fails when correct output uses different wording. Reserve mechanical exactness for actually-mechanical things.
|
|
265
161
|
- **Review the assertions while grading.** Too-easy assertions (always pass) and too-hard assertions (always fail) waste signal. Fix them before the next iteration.
|
|
266
162
|
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
Every run with a skill loaded gets an automatic meta-assertion: **did the skill actually influence behavior, or would the response look identical without it?**
|
|
270
|
-
|
|
271
|
-
The framework injects this check (reserved id `__skill_invoked`) for every condition whose `skill_path` is non-null — **except** evals marked `skill_should_trigger: false`, where the skill is *supposed* not to fire and a non-invocation is the correct outcome. It does **not** count toward the substantive `pass_rate`; results land in `grading.json` under `meta_results` and `meta_summary`, and the benchmark surfaces an `invocation_rate` per condition.
|
|
272
|
-
|
|
273
|
-
Why this matters: a run where the skill wasn't actually invoked is a non-data-point. If `with_skill` scores poorly but the meta-check shows the skill wasn't applied, that's not evidence the skill is bad — it's evidence the prompt didn't trigger the skill's applicability. Conversely, a clean invocation rate validates that substantive pass-rate deltas reflect skill effectiveness.
|
|
274
|
-
|
|
275
|
-
The check has two tiers, chosen automatically per run:
|
|
276
|
-
|
|
277
|
-
- **Code-based (Claude Code).** On harnesses that persist subagent transcripts with discrete `Skill` tool calls, the framework parses the transcript and checks for a `Skill` invocation whose `input.skill` matches the eval-staged slug. This is deterministic, free, and cannot be fooled by superficial vocabulary in the response.
|
|
278
|
-
- **LLM-judge fallback (other harnesses).** Where transcripts aren't available or the harness injects skills via system-prompt hooks rather than a tool call (Codex, OpenCode), a judge subagent compares the agent's `final_message` against the SKILL.md content embedded in the run record, looking for behavioral fingerprints — distinctive vocabulary, named sections, procedural steps that mirror the skill's phrasing. It does **not** require the agent to explicitly cite the skill (that would taint the eval).
|
|
279
|
-
|
|
280
|
-
To enable the code-based check on Claude Code, the runner stages each condition's SKILL.md snapshot at `<repoRoot>/.claude/skills/slow-powers-eval-<iteration>-<condition>__<skillName>/SKILL.md`. The unique slug prevents collisions with already-installed production skills (relevant when evaluating skills in a repo where the same skills are also installed) and is what the code-based check looks for in the transcript. The slug prevents an on-disk *collision*, not runtime *discovery*: if the same skill is also provided by an installed, **enabled** plugin, the subagent can still discover and invoke that copy — contaminating both arms (the control arm is no longer skill-absent). On Claude Code the runner flags this at build time (a "plugin-shadow" warning, also surfaced in `benchmark.json`'s `validity_warnings`), but cannot unload a live plugin; to remove the installed copy, run the eval from a plugin-isolated session — see `harness-details/claude.md` → *Isolating from installed plugins*. The dispatch prompt deliberately omits any inline `<skill>...</skill>` block so the subagent must discover and invoke the staged skill naturally — this measures whether the skill's `description:` actually triggers it. Stale staged skills are swept at the start of each fresh run. Pass `--no-stage` to opt out (e.g., when running the same eval against a harness that doesn't support project-local skill discovery); the runner will fall back to inlining the SKILL.md text in the dispatch prompt, and the LLM-judge meta-check will be used.
|
|
281
|
-
|
|
282
|
-
The aggregator emits a `validity_warnings` array when any with-skill condition has an invocation rate below 100%. Read those before interpreting the substantive delta. The rate is computed only over evals where the skill *should* fire; negative evals (`skill_should_trigger: false`) are excluded so a correct non-trigger never depresses the rate or raises a spurious warning.
|
|
283
|
-
|
|
284
|
-
## Grading
|
|
285
|
-
|
|
286
|
-
For each `(eval × condition)`, produce `grading.json`:
|
|
287
|
-
|
|
288
|
-
```json
|
|
289
|
-
{
|
|
290
|
-
"assertion_results": [
|
|
291
|
-
{ "id": "ran_test_command", "passed": true, "evidence": "Bash invocation at ordinal 4: 'bun test'", "confidence": 1.0 },
|
|
292
|
-
{ "id": "quoted_test_output", "passed": false, "evidence": "Final message says 'tests pass' but does not include any test runner output.", "confidence": 0.9 }
|
|
293
|
-
],
|
|
294
|
-
"summary": { "passed": 1, "failed": 1, "total": 2, "pass_rate": 0.5 }
|
|
295
|
-
}
|
|
296
|
-
```
|
|
297
|
-
|
|
298
|
-
`transcript_check` results come from a script. `llm_judge` results come from dispatching a fresh judge subagent with the rubric + run record + outputs — see `templates/judge-prompt.md`.
|
|
163
|
+
Every with-skill run also gets an automatic **skill-invocation meta-check** — did the skill actually influence behavior, or would the response look identical without it? A run where the skill wasn't invoked is a non-data-point, not evidence the skill is bad. The runner injects and scores this for you and surfaces an invocation rate per condition; read it before trusting a substantive delta. (Mechanics in the runner's docs.)
|
|
299
164
|
|
|
300
|
-
##
|
|
165
|
+
## Reading results and iterating
|
|
301
166
|
|
|
302
|
-
Once
|
|
167
|
+
Once a run is graded and aggregated, the headline is the **delta**: what the skill costs (time, tokens) and what it buys (pass-rate improvement). A skill that adds 13 seconds and 1700 tokens but improves pass rate by 50 points is probably worth it; one that doubles tokens for a 2-point gain is probably not. For Mode B, a positive `delta.pass_rate` means the revision is an improvement.
|
|
303
168
|
|
|
304
|
-
|
|
305
|
-
{
|
|
306
|
-
"run_summary": {
|
|
307
|
-
"with_skill": { "pass_rate": { "mean": 0.83 }, "duration_ms": { "mean": 45000 }, "total_tokens": { "mean": 3800 } },
|
|
308
|
-
"without_skill": { "pass_rate": { "mean": 0.33 }, "duration_ms": { "mean": 32000 }, "total_tokens": { "mean": 2100 } },
|
|
309
|
-
"delta": { "pass_rate": 0.50, "duration_ms": 13000, "total_tokens": 1700 }
|
|
310
|
-
}
|
|
311
|
-
}
|
|
312
|
-
```
|
|
313
|
-
|
|
314
|
-
The delta tells you what the skill costs and what it buys. A skill that adds 13 seconds and 1700 tokens but improves pass rate by 50 percentage points is probably worth it. A skill that doubles tokens for a 2-point improvement is probably not.
|
|
315
|
-
|
|
316
|
-
For Mode B, the `run_summary` keys are `old_skill` and `new_skill`. The same logic applies — positive `delta.pass_rate` means the revision is an improvement.
|
|
317
|
-
|
|
318
|
-
## Version-controlled baselines
|
|
319
|
-
|
|
320
|
-
The full workspace tree is ephemeral and gitignored — it churns on every run. But two parts of a *canonical* run are worth keeping under version control: the `benchmark.json` delta (the headline "this skill earns its place" number) and the per-run `grading.json` rationales (why each assertion passed or failed, useful to reference when iterating later). Promote those into the skill's tracked `evals/baseline/` directory:
|
|
321
|
-
|
|
322
|
-
```bash
|
|
323
|
-
bun run evals:promote-baseline -- --skill <name> --iteration <N> [--label <tag>] [--agent-model <id>] [--judge-model <id>]
|
|
324
|
-
```
|
|
325
|
-
|
|
326
|
-
This copies `benchmark.json` and each `eval-<id>/<condition>/grading.json` (as `grading/<eval-id>__<condition>.json`) into `<skill>/evals/baseline/`, and writes a `BASELINE.md` recording the mode, iteration, harness, and run timestamp. Everything else in the workspace stays out of git.
|
|
327
|
-
|
|
328
|
-
The runner never dispatches the agent or judge itself, so it can't observe which models ran. Pass `--agent-model` (the model that ran the agent-under-test) and `--judge-model` (the model that graded `llm_judge` assertions) to record them as provenance rows in `BASELINE.md`; both default to `unspecified` when omitted. Record the resolved id you actually used (e.g. `claude-haiku-4-5-20251001`), even if your harness only let you pick a coarse `haiku`/`opus`/`sonnet` tier at dispatch.
|
|
329
|
-
|
|
330
|
-
```
|
|
331
|
-
skills/<skill>/evals/baseline/
|
|
332
|
-
BASELINE.md # provenance
|
|
333
|
-
benchmark.json # the committed delta
|
|
334
|
-
grading/<eval-id>__<condition>.json # judge rationales per run
|
|
335
|
-
NOTES.md # optional — forward-looking observations
|
|
336
|
-
```
|
|
337
|
-
|
|
338
|
-
`NOTES.md` is an **optional** companion file for forward-looking observations from the runs that produced this baseline: which evals discriminated and which didn't, suspected variance/noise, ideas for the next iteration (skill changes or eval-suite changes), and any context a future iterator would want before re-running. It is not provenance (that belongs in `BASELINE.md`) and not results (those belong in `benchmark.json`). `promote-baseline` does not generate it and does not overwrite an existing one — author it by hand alongside the promoted baseline when there's something worth carrying forward; otherwise omit it.
|
|
339
|
-
|
|
340
|
-
This works the same for a personal skill: point `--skill-dir` at your skill's parent, run a canonical eval, then promote — you get a committed reference of what "passing" looked like for your skill, equivalent to the baselines slow-powers ships for its own skills.
|
|
341
|
-
|
|
342
|
-
## Analyzing patterns
|
|
343
|
-
|
|
344
|
-
After aggregating:
|
|
345
|
-
|
|
346
|
-
- **Replace assertions that always pass in both conditions.** They don't measure skill value. They inflate the with-skill pass rate without reflecting actual skill contribution.
|
|
347
|
-
- **Investigate assertions that always fail in both conditions.** Either the assertion is broken, the test case is too hard, or the assertion checks the wrong thing. Fix before next iteration.
|
|
348
|
-
- **Study assertions that pass with the skill but fail without it.** This is where the skill is adding value. Understand *why* — which instructions made the difference?
|
|
349
|
-
- **Tighten instructions when results are inconsistent.** High stddev = ambiguous instructions or model variability. Add examples or more specific guidance.
|
|
350
|
-
- **Read time/token outliers.** If one run is 3× longer than others, read its transcript for the bottleneck.
|
|
351
|
-
|
|
352
|
-
## Human review
|
|
353
|
-
|
|
354
|
-
Assertion grading catches what you wrote assertions for. A human reviewer catches everything else — issues you didn't anticipate, outputs that are technically correct but miss the point, problems hard to express as pass/fail.
|
|
355
|
-
|
|
356
|
-
Save per-eval reviewer notes in `feedback.json`:
|
|
357
|
-
|
|
358
|
-
```json
|
|
359
|
-
{
|
|
360
|
-
"eval-claim-without-running": "Output ran `bun test` correctly but the final message hedged ('looks like it passes') instead of quoting the actual output.",
|
|
361
|
-
"eval-build-implied-by-edit": ""
|
|
362
|
-
}
|
|
363
|
-
```
|
|
169
|
+
**Analyzing patterns:**
|
|
364
170
|
|
|
365
|
-
|
|
171
|
+
- **Replace assertions that always pass in both conditions.** They don't measure skill value.
|
|
172
|
+
- **Investigate assertions that always fail in both conditions.** Either the assertion is broken, the case is too hard, or it checks the wrong thing. Fix before next iteration.
|
|
173
|
+
- **Study assertions that pass with the skill but fail without it.** This is where the skill adds value — understand *why*.
|
|
174
|
+
- **Tighten instructions when results are inconsistent.** High stddev = ambiguous instructions or model variability.
|
|
175
|
+
- **Read time/token outliers.** If one run is 3× longer, read its transcript for the bottleneck.
|
|
366
176
|
|
|
367
|
-
|
|
177
|
+
**Human review** catches what assertions don't — outputs that are technically correct but miss the point. Keep per-eval reviewer notes; an empty note means the output looked fine. Focus the next iteration on the cases you had specific complaints about.
|
|
368
178
|
|
|
369
|
-
|
|
179
|
+
**Guidance for revision:**
|
|
370
180
|
|
|
371
|
-
- **
|
|
372
|
-
- **Reviewer feedback** → broader quality issues (wrong approach, poor structure)
|
|
373
|
-
- **Execution transcripts** → why things went wrong (which instructions were ignored, where time was wasted)
|
|
374
|
-
|
|
375
|
-
Feed all three plus the current SKILL.md to an LLM using `templates/revise-skill-prompt.md`. Then rerun in `iteration-N+1`.
|
|
376
|
-
|
|
377
|
-
### Guidance for revision
|
|
378
|
-
|
|
379
|
-
- **Generalize from feedback.** The skill is used across many prompts, not just these test cases. Fixes should address underlying issues broadly, not patch specific examples.
|
|
181
|
+
- **Generalize from feedback.** The skill is used across many prompts, not just these cases. Fixes should address underlying issues broadly, not patch specific examples.
|
|
380
182
|
- **Keep the skill lean.** Fewer, better instructions outperform exhaustive rules. If pass rates plateau despite more rules, try removing instructions.
|
|
381
|
-
- **Explain the why.** Reasoning-based instructions ("Do X because Y") outperform rigid directives
|
|
382
|
-
- **Bundle repeated work.** If every run independently wrote a similar helper script, bundle it into the skill's `scripts/` directory.
|
|
383
|
-
|
|
384
|
-
### When to stop
|
|
385
|
-
|
|
386
|
-
- Pass rates are satisfactory and reviewer feedback is consistently empty
|
|
387
|
-
- Iteration deltas have plateaued (no meaningful improvement between iterations)
|
|
388
|
-
- You've identified a more fundamental issue (skill scope is wrong, prompts don't represent real use)
|
|
389
|
-
|
|
390
|
-
## Choosing to test with evals
|
|
391
|
-
|
|
392
|
-
Before you build an eval, decide whether this change needs one. An eval measures whether words on a page shift **contingent** behavior — what the agent does when the outcome is genuinely in doubt: under pressure, with ambiguity, or against a competing goal. That is where measurement earns its cost.
|
|
393
|
-
|
|
394
|
-
A **deterministic** change doesn't move that needle. Removing a one-line "announce out loud that you're using this skill" instruction, fixing a typo, or shipping a manually-invoked, testing-only procedure changes what the agent is told, not whether it complies under pressure. You don't eval that an agent can stop saying a sentence any more than you'd unit-test that the language computes `2 + 2`. Following an unambiguous instruction is the runtime contract; evals test the contingent logic built on top of it.
|
|
395
|
-
|
|
396
|
-
**The question for every skill change:** does it alter contingent behavior, or is it deterministic instruction-following?
|
|
397
|
-
|
|
398
|
-
- **Contingent → eval (this is the default).** Renaming `writing-plans` so its trigger fires reliably, editing a discipline-enforcing skill's rationalization table, changing the wording that decides a pressured choice — the agent might behave differently, and you can't know which way without measuring.
|
|
399
|
-
- **Deterministic → declare and skip.** State the decision and your reasoning to the user, then skip the eval. The reasoning is not optional: a silent skip is indistinguishable from dodging the work.
|
|
400
|
-
|
|
401
|
-
**Either way, announce the decision and why** — "deterministic instruction removal, no eval" or "this changes pressured compliance, I'll run an eval." A visible decision is one the user can override; a silent one is a rationalization waiting to happen. **The door stays open:** if the user wants an eval anyway, run a worthwhile one — design real cases, don't phone it in to confirm a foregone conclusion.
|
|
402
|
-
|
|
403
|
-
**Skill type is a fast read, not a verdict.** Reference and manually-invoked procedural changes *often* land deterministic; discipline, technique, and pattern changes *often* carry contingency (`pressure-scenarios.md` draws the same line under "When to use" / "Don't use them for"). Use type to orient your first guess — never as the answer. The decision is per *change*, not per type: a deterministic typo fix in a discipline-enforcing skill still skips, and restructuring a reference doc because the agent kept missing a section is contingent and earns an eval.
|
|
404
|
-
|
|
405
|
-
## The Iron Law
|
|
406
|
-
|
|
407
|
-
**No skill shipped without passing evals. No behavior-shaping change landed without a positive revision delta.**
|
|
183
|
+
- **Explain the why.** Reasoning-based instructions ("Do X because Y") outperform rigid directives. Models follow instructions more reliably when they understand the purpose.
|
|
408
184
|
|
|
409
|
-
|
|
185
|
+
**When to stop:** pass rates are satisfactory and reviewer feedback is consistently empty; iteration deltas have plateaued; or you've found a more fundamental issue (wrong scope, unrepresentative prompts).
|
|
410
186
|
|
|
411
|
-
|
|
412
|
-
- Not for "just adding a section"
|
|
413
|
-
- Not for "documentation updates"
|
|
414
|
-
- Not for "obviously the same as before"
|
|
415
|
-
|
|
416
|
-
If you can't measure a behavioral change, you don't know if it's an improvement. Tuning behavior-shaping language without a benchmark drifts the skill — sometimes silently — toward worse behavior under pressure. (The narrow, declared exception for deterministic changes is above; "deterministic" is a judgment you announce and defend, not a backdoor around this law.)
|
|
417
|
-
|
|
418
|
-
## Common rationalizations
|
|
419
|
-
|
|
420
|
-
Excuses for skipping an eval on a change you've already judged behavior-shaping. None of them hold.
|
|
187
|
+
## Running the eval
|
|
421
188
|
|
|
422
|
-
|
|
423
|
-
|--------|---------|
|
|
424
|
-
| "Change is obviously an improvement" | Then proving it is fast. Run the eval. |
|
|
425
|
-
| "Existing skill works fine" | Until pressure-test scenario X. Run the eval. |
|
|
426
|
-
| "No time to author test cases" | The cost of shipping a regression is higher than 30 minutes of eval design. |
|
|
427
|
-
| "It's just rewording" | Wording IS the skill. Reword = changed skill. Run the eval. |
|
|
428
|
-
| "Eval results are noisy" | Then add runs, not skip the eval. |
|
|
429
|
-
| "Pass rate was already 100%" | Then the assertion is too easy. Replace it. |
|
|
430
|
-
| "I'll just call it deterministic" | Deterministic means the agent's compliance isn't in doubt — not that you'd rather not measure. If the wording could change a pressured choice, it's behavioral. Run the eval. |
|
|
189
|
+
The mechanics of executing a run live in **[`@slowdini/eval-runner`](https://www.npmjs.com/package/@slowdini/eval-runner)** — `bunx @slowdini/eval-runner`.
|
|
431
190
|
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
-
|
|
435
|
-
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
- `
|
|
439
|
-
- `templates/eval-task-prompt.md` — scaffold for dispatching a subagent to execute a test case
|
|
440
|
-
- `templates/judge-prompt.md` — scaffold for dispatching a judge subagent
|
|
441
|
-
- `templates/revise-skill-prompt.md` — scaffold for the iteration step
|
|
442
|
-
- `examples/verifying-development-work-evals.json` — committed real example
|
|
443
|
-
- `pressure-scenarios.md` — pressure-scenario taxonomy for authoring prompts that stress discipline-enforcing skills
|
|
444
|
-
- `runner/` — the Bun eval runner (orchestrator, grader, aggregator, transcript adapters) that executes the methodology; ships with the skill so users can run evals on their own skills
|
|
445
|
-
- `harness-details/claude.md` — Claude Code-specific step-by-step for running an eval (resolving the runner, dispatching subagents, grading)
|
|
191
|
+
| Need | Where |
|
|
192
|
+
|------|-------|
|
|
193
|
+
| Quickstart, install, the two modes end-to-end | the package README |
|
|
194
|
+
| Every subcommand and flag; the `--skill-dir` model; workspace layout | `docs/cli.md` |
|
|
195
|
+
| Full run mechanics: dispatch loop, transcript access, grading, aggregating, baselines | `docs/methodology.md` |
|
|
196
|
+
| Claude Code operator walkthrough (isolating from installed plugins, the guard, judging) | `docs/harness-claude-code.md` |
|
|
197
|
+
| What a harness needs to reach Claude-Code-tier support | `docs/harness-parity.md` |
|
|
446
198
|
|
|
447
199
|
## See also
|
|
448
200
|
|
|
449
201
|
- `slow-powers:writing-skills` — drafting a skill (Phase 1)
|
|
202
|
+
- `pressure-scenarios.md` — pressure-scenario taxonomy for authoring prompts that stress discipline-enforcing skills
|
|
203
|
+
- `@slowdini/eval-runner` — the tool that runs the evals this skill teaches you to author
|
|
450
204
|
- agentskills.io/skill-creation/evaluating-skills — the methodology this skill is derived from
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# Baseline — evaluating-skills
|
|
2
|
+
|
|
3
|
+
Committed reference output from a canonical eval run. Regenerate with
|
|
4
|
+
`skill-eval promote-baseline --skill evaluating-skills --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
|
|
5
|
+
dispatch files, produced outputs) stays gitignored under `skills-workspace/`
|
|
6
|
+
and is reclaimable by `skill-eval teardown` once promoted (this commit's marker).
|
|
7
|
+
|
|
8
|
+
| Field | Value |
|
|
9
|
+
|-------|-------|
|
|
10
|
+
| Mode | revision |
|
|
11
|
+
| Iteration | iteration-2 |
|
|
12
|
+
| Harness | claude-code |
|
|
13
|
+
| Agent model | claude-sonnet-4-6 |
|
|
14
|
+
| Judge model | claude-sonnet-4-6 |
|
|
15
|
+
| Conditions | old_skill, new_skill |
|
|
16
|
+
| Run timestamp | 2026-06-06T05:25:05.900Z |
|
|
17
|
+
| Label | split-no-regression |
|
|
18
|
+
| Promoted from commit | 42fa415 |
|
|
19
|
+
|
|
20
|
+
Files:
|
|
21
|
+
- `benchmark.json` — aggregate pass-rate / duration / token deltas.
|
|
22
|
+
- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
|
|
23
|
+
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# Notes — `split-no-regression` baseline
|
|
2
|
+
|
|
3
|
+
Context for the run that produced this baseline (see `BASELINE.md` for provenance).
|
|
4
|
+
|
|
5
|
+
## What this measured
|
|
6
|
+
|
|
7
|
+
A Mode B (revision) eval of the eval-runner split: `old_skill` = the pre-split,
|
|
8
|
+
runner-bundled `SKILL.md` (452 lines, snapshot `pre-split` taken from `dev`);
|
|
9
|
+
`new_skill` = the slimmed, craft-focused rewrite that routes all running mechanics to
|
|
10
|
+
`@slowdini/eval-runner` (204 lines). The three self-eval cases exercise the decision
|
|
11
|
+
framework the rewrite kept verbatim (revision-comparison prescription, ship-decision,
|
|
12
|
+
deterministic-skip).
|
|
13
|
+
|
|
14
|
+
**Result: `delta.pass_rate` = 0** — both arms passed all substantive assertions
|
|
15
|
+
(3/3 cases, 100% each, stddev 0), 100% skill-invocation both arms, no validity warnings.
|
|
16
|
+
The slim rewrite preserves the decision behavior with no regression. The new skill also
|
|
17
|
+
ran ~22% lighter (74.7k vs 95.4k mean tokens), as expected from halving the body.
|
|
18
|
+
|
|
19
|
+
## Methodology caveat — run with `--no-stage`
|
|
20
|
+
|
|
21
|
+
This run used `--no-stage` (skill content inlined into each dispatch prompt) rather than
|
|
22
|
+
the default staged-discovery path. Two reasons, both making `--no-stage` the *correct*
|
|
23
|
+
choice here rather than a workaround:
|
|
24
|
+
|
|
25
|
+
1. **Staged-skill discovery wasn't available to mid-session subagents.** A first attempt
|
|
26
|
+
with staging produced subagents that couldn't load the staged slug via the Skill tool
|
|
27
|
+
(the skill registry is fixed at parent-session start; skills staged mid-session aren't
|
|
28
|
+
picked up), so they fell back to reading a `SKILL.md` from the working tree — which is
|
|
29
|
+
the *new* version for both conditions, contaminating the `old_skill` arm. Inlining
|
|
30
|
+
removes that failure mode: each arm provably reads its own condition's content
|
|
31
|
+
(verified: the `old_skill` dispatch prompt carries the 452-line body, `new_skill` the
|
|
32
|
+
204-line body).
|
|
33
|
+
2. **The revision didn't touch the `description:` frontmatter**, so there is nothing to
|
|
34
|
+
measure on the trigger-discovery axis anyway — only body-content quality, which
|
|
35
|
+
`--no-stage` isolates cleanly. The `__skill_invoked` meta-check therefore used the
|
|
36
|
+
LLM-judge fallback (it passed 100% in both arms).
|
|
37
|
+
|
|
38
|
+
A future iterator re-running this with staged discovery (e.g. from a fresh,
|
|
39
|
+
plugin-isolated session where the staged skills exist at session start) would additionally
|
|
40
|
+
exercise the trigger path — worth doing if the `description:` ever changes.
|