@slowdini/slow-powers-opencode 0.3.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/README.md +34 -72
  2. package/bootstrap.md +1 -7
  3. package/opencode/plugins/slow-powers.js +69 -5
  4. package/package.json +14 -17
  5. package/skills/evaluating-skills/SKILL.md +90 -338
  6. package/skills/evaluating-skills/evals/baseline/BASELINE.md +23 -0
  7. package/skills/evaluating-skills/evals/baseline/NOTES.md +40 -0
  8. package/skills/evaluating-skills/evals/baseline/benchmark.json +54 -0
  9. package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__new_skill.json +39 -0
  10. package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__old_skill.json +39 -0
  11. package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__new_skill.json +39 -0
  12. package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__old_skill.json +39 -0
  13. package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__new_skill.json +32 -0
  14. package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__old_skill.json +32 -0
  15. package/skills/hardening-plans/SKILL.md +29 -7
  16. package/skills/hardening-plans/evals/baseline/BASELINE.md +11 -6
  17. package/skills/hardening-plans/evals/baseline/NOTES.md +72 -58
  18. package/skills/hardening-plans/evals/baseline/benchmark.json +25 -25
  19. package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__new_skill.json +2 -2
  20. package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__old_skill.json +2 -2
  21. package/skills/hardening-plans/evals/baseline/grading/docs-refactor-plan-mode__new_skill.json +39 -0
  22. package/skills/hardening-plans/evals/baseline/grading/docs-refactor-plan-mode__old_skill.json +39 -0
  23. package/skills/hardening-plans/evals/baseline/grading/oauth-task-breakdown-cold__new_skill.json +39 -0
  24. package/skills/hardening-plans/evals/baseline/grading/oauth-task-breakdown-cold__old_skill.json +39 -0
  25. package/skills/hardening-plans/evals/baseline/grading/research-plan-no-required-skill__new_skill.json +32 -0
  26. package/skills/hardening-plans/evals/baseline/grading/research-plan-no-required-skill__old_skill.json +32 -0
  27. package/skills/hardening-plans/evals/baseline/grading/seeded-plan-mode-todo-app-adversarial__new_skill.json +39 -0
  28. package/skills/hardening-plans/evals/baseline/grading/seeded-plan-mode-todo-app-adversarial__old_skill.json +39 -0
  29. package/skills/hardening-plans/evals/baseline/grading/seeded-plan-mode-todo-app__new_skill.json +39 -0
  30. package/skills/hardening-plans/evals/baseline/grading/seeded-plan-mode-todo-app__old_skill.json +39 -0
  31. package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__new_skill.json +3 -3
  32. package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__old_skill.json +8 -8
  33. package/skills/hardening-plans/evals/baseline/grading/structural-refactor-cold__new_skill.json +39 -0
  34. package/skills/hardening-plans/evals/baseline/grading/structural-refactor-cold__old_skill.json +39 -0
  35. package/skills/hardening-plans/evals/evals.json +46 -0
  36. package/skills/test-driven-development/evals/baseline/NOTES.md +2 -2
  37. package/skills/evaluating-skills/examples/verifying-development-work-evals.json +0 -30
  38. package/skills/evaluating-skills/harness-details/claude.md +0 -194
  39. package/skills/evaluating-skills/harness-parity.md +0 -155
  40. package/skills/evaluating-skills/runner/README.md +0 -163
  41. package/skills/evaluating-skills/runner/adapters/claude-code-session.test.ts +0 -56
  42. package/skills/evaluating-skills/runner/adapters/claude-code-session.ts +0 -43
  43. package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +0 -485
  44. package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +0 -242
  45. package/skills/evaluating-skills/runner/aggregate.test.ts +0 -484
  46. package/skills/evaluating-skills/runner/aggregate.ts +0 -269
  47. package/skills/evaluating-skills/runner/context.test.ts +0 -181
  48. package/skills/evaluating-skills/runner/context.ts +0 -90
  49. package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +0 -396
  50. package/skills/evaluating-skills/runner/detect-stray-writes.ts +0 -288
  51. package/skills/evaluating-skills/runner/fill-transcripts.test.ts +0 -73
  52. package/skills/evaluating-skills/runner/fill-transcripts.ts +0 -154
  53. package/skills/evaluating-skills/runner/grade.test.ts +0 -347
  54. package/skills/evaluating-skills/runner/grade.ts +0 -603
  55. package/skills/evaluating-skills/runner/guard/guard.ts +0 -49
  56. package/skills/evaluating-skills/runner/guard/install.test.ts +0 -92
  57. package/skills/evaluating-skills/runner/guard/install.ts +0 -147
  58. package/skills/evaluating-skills/runner/guard/policy.test.ts +0 -128
  59. package/skills/evaluating-skills/runner/guard/policy.ts +0 -74
  60. package/skills/evaluating-skills/runner/plugin-shadow.test.ts +0 -228
  61. package/skills/evaluating-skills/runner/plugin-shadow.ts +0 -201
  62. package/skills/evaluating-skills/runner/profiles/claude-code/plan-mode.md +0 -11
  63. package/skills/evaluating-skills/runner/promote-baseline.test.ts +0 -281
  64. package/skills/evaluating-skills/runner/promote-baseline.ts +0 -204
  65. package/skills/evaluating-skills/runner/record-runs.test.ts +0 -314
  66. package/skills/evaluating-skills/runner/record-runs.ts +0 -209
  67. package/skills/evaluating-skills/runner/run.test.ts +0 -1703
  68. package/skills/evaluating-skills/runner/run.ts +0 -1388
  69. package/skills/evaluating-skills/runner/sandbox-policy.ts +0 -94
  70. package/skills/evaluating-skills/runner/types.ts +0 -121
  71. package/skills/evaluating-skills/runner/validate-all.ts +0 -54
  72. package/skills/evaluating-skills/runner/validate-schema.test.ts +0 -99
  73. package/skills/evaluating-skills/runner/validate-schema.ts +0 -51
  74. package/skills/evaluating-skills/runner/validate.test.ts +0 -56
  75. package/skills/evaluating-skills/runner/validate.ts +0 -21
  76. package/skills/evaluating-skills/runner/workspace-teardown.test.ts +0 -227
  77. package/skills/evaluating-skills/runner/workspace-teardown.ts +0 -136
  78. package/skills/evaluating-skills/schema/evals.schema.json +0 -105
  79. package/skills/evaluating-skills/schema/grading.schema.json +0 -84
  80. package/skills/evaluating-skills/schema/run-record.schema.json +0 -80
  81. package/skills/evaluating-skills/schema/stray-writes.schema.json +0 -80
  82. package/skills/evaluating-skills/templates/eval-task-prompt.md +0 -69
  83. package/skills/evaluating-skills/templates/evals.json.example +0 -17
  84. package/skills/evaluating-skills/templates/judge-prompt.md +0 -56
  85. package/skills/evaluating-skills/templates/revise-skill-prompt.md +0 -56
@@ -5,70 +5,97 @@ description: Use when testing whether a new skill improves agent behavior, or wh
5
5
 
6
6
  # Evaluating Skills
7
7
 
8
- ## Overview
8
+ Skill development has two phases: **drafting** (`slow-powers:writing-skills`) and **evaluation** (this skill). This skill owns the *craft* of evaluation — deciding whether a change needs measuring, designing test cases, devising pressure-testing scenarios, writing assertions, and reading results. The *mechanics* of actually running an eval — building the workspace, staging skills, dispatching subagents, grading, aggregating — are owned by a dedicated tool, **[`@slowdini/eval-runner`](https://www.npmjs.com/package/@slowdini/eval-runner)**, run via `bunx @slowdini/eval-runner`. See [Running the eval](#running-the-eval) for the hand-off.
9
9
 
10
- Skill development has two phases: **drafting** (`slow-powers:writing-skills`) and **evaluation** (this skill).
10
+ ## Overview
11
11
 
12
12
  An eval is a structured measurement of whether a skill actually shifts behavior. Each test case is a realistic prompt; each run dispatches a fresh general-purpose subagent twice — once with the skill loaded, once without (or once with the prior version, once with the revised version) — and grades the outputs against assertions. Pass-rate deltas tell you whether the skill is worth shipping or the change is worth landing.
13
13
 
14
- This skill is harness-agnostic. Run records use a portable JSON schema so evals authored on one harness (Claude Code, Codex, OpenCode) can be executed and graded on any other. See `schema/run-record.schema.json`.
14
+ Evals are harness-agnostic: run records use a portable JSON schema so an eval authored on one harness (Claude Code, Codex, OpenCode) can be executed and graded on any other. The runner documents the schema and the per-harness specifics; this skill stays at the level of *what makes a good eval*.
15
15
 
16
- ## Two comparison modes
16
+ ### Two comparison modes
17
17
 
18
- ### Mode A — New skill comparison
18
+ - **Mode A — new skill.** Compares `with_skill` vs `without_skill`. Use when validating that a brand-new skill beats baseline behavior with no skill loaded.
19
+ - **Mode B — revision (the common case).** Compares `old_skill` vs `new_skill`. Use when testing a language change to an existing skill — snapshot the old `SKILL.md`, then run both variants against the same prompts. A negative or zero delta is a signal to revert: the new language did not improve behavior.
19
20
 
20
- Compares `with_skill/` vs `without_skill/`. Use when validating a brand-new skill actually beats baseline behavior with no skill loaded. This is the agentskills.io default.
21
+ The runner implements both; pick the mode that matches the change you're measuring.
21
22
 
22
- ### Mode B Revision comparison
23
+ ## Choosing to test with evals
23
24
 
24
- Compares `old_skill/` vs `new_skill/`. **This is the common case.** Use when testing a language change to an existing skillsnapshot the old SKILL.md as a baseline, then run both variants against the same prompts.
25
+ Before you build an eval, decide whether this change needs one. An eval measures whether words on a page shift **contingent** behavior what the agent does when the outcome is genuinely in doubt: under pressure, with ambiguity, or against a competing goal. That is where measurement earns its cost.
25
26
 
26
- Mode B workflow (edit-first the usual order):
27
- 1. Edit the skill (the new version is now in the working tree)
28
- 2. Snapshot the old version straight from git: `snapshot --label <tag> --ref HEAD` (any commit/tag/branch works; `--ref` reads git without touching the working tree)
29
- 3. Run the eval with `--mode revision --baseline <snapshot-label>`
30
- 4. Grade and aggregate; review the delta
27
+ A **deterministic** change doesn't move that needle. Removing a one-line "announce out loud that you're using this skill" instruction, fixing a typo, or shipping a manually-invoked, testing-only procedure changes what the agent is told, not whether it complies under pressure. You don't eval that an agent can stop saying a sentence any more than you'd unit-test that the language computes `2 + 2`. Following an unambiguous instruction is the runtime contract; evals test the contingent logic built on top of it.
31
28
 
32
- If you snapshot *before* editing, omit `--ref` in step 2 (it reads the working tree) and do it ahead of step 1.
29
+ **The question for every skill change:** does it alter contingent behavior, or is it deterministic instruction-following?
33
30
 
34
- A negative or zero delta is a signal to revert the change — the new language did not improve behavior.
31
+ - **Contingent eval (this is the default).** Renaming a skill so its trigger fires reliably, editing a discipline-enforcing skill's rationalization table, changing the wording that decides a pressured choice — the agent might behave differently, and you can't know which way without measuring.
32
+ - **Deterministic → declare and skip.** State the decision and your reasoning to the user, then skip the eval. The reasoning is not optional: a silent skip is indistinguishable from dodging the work.
35
33
 
36
- ## Running an eval on a skill
34
+ **Either way, announce the decision and why** — "deterministic instruction removal, no eval" or "this changes pressured compliance, I'll run an eval." A visible decision is one the user can override; a silent one is a rationalization waiting to happen. **The door stays open:** if the user wants an eval anyway, run a worthwhile one — design real cases, don't phone it in to confirm a foregone conclusion.
37
35
 
38
- The eval runner ships with this skill (under `runner/`), so you can evaluate any skillincluding your own personal skills not just the ones in this repo. The detailed steps depend on your harness:
36
+ **Skill type is a fast read, not a verdict.** Reference and manually-invoked procedural changes *often* land deterministic; discipline, technique, and pattern changes *often* carry contingency (`pressure-scenarios.md` draws the same line under "When to use" / "Don't use them for"). Use type to orient your first guess never as the answer. The decision is per *change*, not per type: a deterministic typo fix in a discipline-enforcing skill still skips, and restructuring a reference doc because the agent kept missing a section is contingent and earns an eval.
39
37
 
40
- - **Claude Code:** follow `harness-details/claude.md` end-to-end (resolving the bundled runner, dispatching subagents via the Task tool, locating transcripts, grading).
41
- - **Other harnesses:** there's no detailed guide yet. The portable run-record schema (`schema/run-record.schema.json`) and the runner contract below still apply; you'll need to (a) locate the installed slow-powers plugin on disk, (b) dispatch subagents with your harness's primitive, and (c) supply a transcript adapter under `runner/adapters/` against the portable format if you want `transcript_check` assertions to grade. Otherwise author `run.json` and `timing.json` by hand per the schemas in `schema/`.
38
+ ## The Iron Law
42
39
 
43
- ### The runner contract (all harnesses)
40
+ **No skill shipped without passing evals. No behavior-shaping change landed without a positive revision delta.**
44
41
 
45
- The runner takes two required flags:
42
+ Once you've judged a change behavior-shaping, the law is absolute — these are not exemptions:
46
43
 
47
- - `--skill-dir <path>` — a directory containing one or more skill folders. **This directory is the eval's test environment.** Every skill in it is staged for the subagent: the skill-under-test under a unique slug, every *other* skill under its natural name.
48
- - `--skill <name>` which subdirectory of `--skill-dir` to evaluate.
44
+ - Not for "simple additions"
45
+ - Not for "just adding a section"
46
+ - Not for "documentation updates"
47
+ - Not for "obviously the same as before"
49
48
 
50
- Optional flags: `--bootstrap <path>` (see *Bootstrap content* below), `--workspace-dir <path>` (defaults to `<cwd>/skills-workspace`), `--mode new-skill|revision`, `--baseline <label>`, `--only <id,...>` / `--skip <id,...>` (run only / all-but the named eval ids for cost-conscious reduced-set runs without editing `evals.json`; mutually exclusive, errors on an unknown id), `--harness`, `--no-stage`, `--dry-run`, `--guard` (Claude Code only — arm the write guard; see *Sandboxing eval subagents*), `--plan-mode` (Claude Code only — inject the harness's verbatim plan-mode procedure as an operating-context layer; opt-in, for plan-mode-relevant skills only; see *Seeding conversation context (and its ceiling)*).
49
+ If you can't measure a behavioral change, you don't know if it's an improvement. Tuning behavior-shaping language without a benchmark drifts the skill sometimes silentlytoward worse behavior under pressure. (The narrow, declared exception for deterministic changes is above; "deterministic" is a judgment you announce and defend, not a backdoor around this law.)
51
50
 
52
- Each iteration lands under `<workspace-dir>/<skill>/iteration-N/` with the same tree described in *Workspace layout* below, plus a machine-readable `dispatch.json` and a human-readable `dispatch-manifest.md`. The end product is `benchmark.json`: read its `run_summary`, `delta`, and `validity_warnings`.
51
+ ### Common rationalizations
53
52
 
54
- #### What gets staged
53
+ Excuses for skipping an eval on a change you've already judged behavior-shaping. None of them hold.
55
54
 
56
- The runner stages every skill it finds under `--skill-dir`. The skill-under-test goes under a unique slug for the `__skill_invoked` meta-check — with its sibling asset files (any non-`SKILL.md`, non-`evals/` content) copied alongside, so a multi-file skill whose `SKILL.md` links a companion doc (e.g. `[code-review.md](code-review.md)`) still resolves once staged; sibling skills stage under their natural names so cross-references resolve. **If your `--skill-dir` contains only your one skill, the eval runs in isolation** — references like "REQUIRED SUB-SKILL: `slow-powers:test-driven-development`" won't resolve, and your assertions must not depend on a sibling skill firing. To include other skills as siblings, copy or symlink them into `--skill-dir` before running.
55
+ | Excuse | Reality |
56
+ |--------|---------|
57
+ | "Change is obviously an improvement" | Then proving it is fast. Run the eval. |
58
+ | "Existing skill works fine" | Until pressure-test scenario X. Run the eval. |
59
+ | "No time to author test cases" | The cost of shipping a regression is higher than 30 minutes of eval design. |
60
+ | "It's just rewording" | Wording IS the skill. Reword = changed skill. Run the eval. |
61
+ | "Eval results are noisy" | Then add runs, not skip the eval. |
62
+ | "Pass rate was already 100%" | Then the assertion is too easy. Replace it. |
63
+ | "I'll just call it deterministic" | Deterministic means the agent's compliance isn't in doubt — not that you'd rather not measure. If the wording could change a pressured choice, it's behavioral. Run the eval. |
64
+
65
+ ## Pre-flight gate (required)
66
+
67
+ An eval run is not free. Each test case dispatches a fresh subagent **per condition** — an N-case suite is `2N` full agent sessions, plus a judge dispatch for every `llm_judge` assertion. That is real wall-clock time and real tokens, and a subagent under test can write outside its sandbox and pollute the real workspace. **Never kick off a run silently.**
68
+
69
+ Before building the workspace and dispatching anything, STOP and present the user a run summary, then wait for explicit confirmation:
70
+
71
+ - **Skill under test** — name and path
72
+ - **Mode** — `new-skill` (with vs without) or `revision` (old vs new), plus the baseline label for revision mode
73
+ - **Eval cases** — the count and a one-line list of the prompts (from `evals.json`)
74
+ - **Models** — the model that will run each subagent under test, and the judge model for `llm_judge` assertions. The runner never dispatches these itself, so it can't observe them — state them explicitly so the user can correct a wrong choice before tokens are spent.
75
+ - **Cost** — `2N` agent dispatches plus judge dispatches; call out that this is time- and token-intensive
76
+ - **Sandbox** — the guard status (on Claude Code, arming the runner's `--guard` is the default; proceed unguarded only on an explicit opt-out, and warn that stray writes will then only be detected after the fact, never blocked)
77
+
78
+ Do not dispatch until the user confirms *this summary*. An earlier "run the eval" is not confirmation — the summary may reveal a wrong mode, the wrong model, or a missing guard the user never intended. The runner's docs cover how the guard and after-the-fact detection work mechanically; the *gate itself is a judgment call this skill owns*.
57
79
 
58
- #### Bootstrap content
80
+ ### Red Flags — STOP before dispatching
59
81
 
60
- Every dispatch prompt includes an available-skills block listing the skills staged for this eval (auto-built by the runner), rendered in the harness's native presentation so the dispatch reads like a real session rather than an eval. If you also want product-specific framing prepended — instruction priority rules, planning guidelines, anything you'd put in a SessionStart hook — author a Markdown file and pass it via `--bootstrap <path>`. The runner emits the file verbatim inside a `<session-start-context>` block, before the available-skills block. Omit `--bootstrap` and the dispatch carries only the available-skills block, nothing else.
82
+ - About to dispatch subagents without showing the user the run summary first
83
+ - Running on a guard-capable harness without the guard and without an explicit opt-out from the user
84
+ - "The user already said run it" — they said it before seeing the cost, models, and guard status
85
+ - Spending tokens to "just see what happens" before the cases, mode, or models are confirmed
86
+
87
+ All of these mean: STOP. Present the pre-flight summary and wait for confirmation.
61
88
 
62
89
  ## Designing test cases
63
90
 
64
- A test case has three parts:
91
+ A test case has these parts:
65
92
 
66
93
  - **prompt**: a realistic user message — the kind a real user would actually type
67
94
  - **expected_output**: a human-readable description of success
68
95
  - **files** (optional): fixture files the prompt references
69
- - **skill_should_trigger** (optional, default `true`): set `false` for a *negative* eval where correct behavior is the skill **not** firing (e.g. an over-trigger guard — a feature request that shouldn't launch a debugging investigation). Negative evals are excluded from the skill-invocation rate and its validity warning, so a correct non-invocation isn't mistaken for the skill failing to fire.
96
+ - **skill_should_trigger** (optional, default `true`): set `false` for a *negative* eval where correct behavior is the skill **not** firing (e.g. an over-trigger guard — a feature request that shouldn't launch a debugging investigation). Negative evals are excluded from the skill-invocation rate, so a correct non-invocation isn't mistaken for the skill failing to fire.
70
97
 
71
- Stored in `<skill>/evals/evals.json`. See `templates/evals.json.example` and `examples/verifying-development-work-evals.json`.
98
+ Cases live in `<skill>/evals/evals.json`. For the file shape, see the bare scaffold the runner ships (`templates/evals.json.example` in `@slowdini/eval-runner`); for worked, maintained examples, read the live suites in this repo — e.g. `skills/verifying-development-work/evals/evals.json` and `skills/hardening-plans/evals/evals.json`.
72
99
 
73
100
  Tips for writing good prompts:
74
101
 
@@ -113,151 +140,18 @@ reads as redundant or as duplicated effort>
113
140
 
114
141
  Keep the seeded turns short and concrete; the point is to establish momentum, not to write a full session.
115
142
 
116
- **The ceiling — state it plainly.** A seed is *text the subagent reads*, not a state it operates under. It cannot place the agent in a harness-injected mode — a real plan mode, an enforced multi-phase workflow, genuine context-window pressure — it can only *describe* one. So when the wild failure you're chasing was *caused* by such a mode (the documented case: an agent in plan mode that invoked **zero** skills because the mode's own procedure made loading them feel redundant), a text seed cannot fully reproduce it — the causal layer is exactly the one a prompt string can't inject. A seeded **pass is therefore necessary but not sufficient** — it under-estimates real-session difficulty — and a seed that *fails* to reproduce a known wild failure is usually hitting this ceiling, not testing a bad seed. Treat seeded results as a stronger-than-cold signal, not as ground truth, and don't let downstream work over-trust them. Faithfully reproducing a mode-caused failure needs a real harness mode the runner can't inject today — track that as a parity goal.
143
+ **The ceiling — state it plainly.** A seed is *text the subagent reads*, not a state it operates under. It cannot place the agent in a harness-injected mode — a real plan mode, an enforced multi-phase workflow, genuine context-window pressure — it can only *describe* one. So when the wild failure you're chasing was *caused* by such a mode (the documented case: an agent in plan mode that invoked **zero** skills because the mode's own procedure made loading them feel redundant), a text seed cannot fully reproduce it — the causal layer is exactly the one a prompt string can't inject. A seeded **pass is therefore necessary but not sufficient** — it under-estimates real-session difficulty — and a seed that *fails* to reproduce a known wild failure is usually hitting this ceiling, not testing a bad seed. Treat seeded results as a stronger-than-cold signal, not as ground truth, and don't let downstream work over-trust them.
117
144
 
118
- **Narrowing the gap — `--plan-mode`.** For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: `--plan-mode` injects the harness's *verbatim* plan-mode procedure (its rigid multi-phase terminal rail) into every dispatch as an operating-context layer the subagent is told it is operating under — a `<system-reminder>` block after the session-start surfaces — rather than a paraphrase the agent merely reads in the seed prose. The profile is a per-harness asset (`runner/profiles/<harness>/plan-mode.md`); it is opt-in and meant only for plan-mode-relevant skills (a harness without a profile errors, leaving the portable contract unchanged). This narrows the gap (verbatim procedure > paraphrase) but does **not** close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm to measure whether removing the invoke-hint lets `with_skill` invocation de-saturate.
119
-
120
- ## Pre-flight gate (required)
121
-
122
- An eval run is not free. Each test case dispatches a fresh subagent **per condition** — an N-case suite is `2N` full agent sessions, plus a judge dispatch for every `llm_judge` assertion. That is real wall-clock time and real tokens, and a subagent under test can write outside its sandbox and pollute the real workspace. **Never kick off a run silently.**
123
-
124
- Before building the workspace and dispatching anything, STOP and present the user a run summary, then wait for explicit confirmation:
125
-
126
- - **Skill under test** — name and path
127
- - **Mode** — `new-skill` (with vs without) or `revision` (old vs new), plus the baseline label for revision mode
128
- - **Eval cases** — the count and a one-line list of the prompts (from `evals.json`)
129
- - **Models** — the model that will run each subagent under test, and the judge model for `llm_judge` assertions. The runner never dispatches these itself, so it can't observe them — state them explicitly so the user can correct a wrong choice before tokens are spent.
130
- - **Cost** — `2N` agent dispatches plus judge dispatches; call out that this is time- and token-intensive
131
- - **Sandbox** — the guard status (see below)
132
-
133
- Do not dispatch until the user confirms *this summary*. An earlier "run the eval" is not confirmation — the summary may reveal a wrong mode, the wrong model, or a missing guard the user never intended.
134
-
135
- ### Sandbox decision
136
-
137
- A subagent under test runs the real skill, and some skills write to disk — the skill that triggered this gate, `working-in-isolation`, creates git worktrees in whatever repo it's pointed at. Without active enforcement those writes land in your working directory.
138
-
139
- - **Guard available (Claude Code):** arming `--guard` is the default. If you are about to run without it, STOP. Proceed unguarded **only** when the user actively opts out — and warn them that stray writes will then only be **detected after the fact** by `detect-stray-writes`, never blocked or reverted, so anything a subagent writes outside its `outputs/` dir (worktrees, installed packages, edited repo files) persists and is theirs to clean up.
140
- - **Guard unavailable (other harnesses):** there is no active write enforcement. Tell the user plainly: stray writes are detected and reported by `detect-stray-writes` but **not auto-cleaned** — they must review the report and remove anything that escaped. Harness-level write enforcement is tracked as a parity goal in `harness-parity.md`.
141
-
142
- ## Red Flags — STOP before dispatching
143
-
144
- - About to dispatch subagents without showing the user the run summary first
145
- - Running on a guard-capable harness without `--guard` and without an explicit opt-out from the user
146
- - "The user already said run it" — they said it before seeing the cost, models, and guard status
147
- - Spending tokens to "just see what happens" before the cases, mode, or models are confirmed
148
-
149
- All of these mean: STOP. Present the pre-flight summary and wait for confirmation.
150
-
151
- ## Running evals
152
-
153
- For each test case, dispatch fresh general-purpose subagents — one per condition. Each subagent receives:
154
-
155
- - The prompt verbatim
156
- - Any fixture files
157
- - The output directory path
158
- - (Conditionally) the path to the SKILL.md to load
159
-
160
- Subagents MUST start with clean context. State leaking from previous runs invalidates the comparison.
161
-
162
- Each run needs a portable **run record** (`run.json`, matching `schema/run-record.schema.json`) and a timing record (`timing.json`) holding:
163
-
164
- - `total_tokens` and `duration_ms`
165
- - The final user-facing message
166
- - The tool invocations (best effort — see "Transcript access" below)
167
-
168
- On a harness with persisted transcripts (Claude Code), `record-runs` assembles both records from disk after the dispatches — nothing is captured by hand. On a transcript-less harness, capture them manually when each subagent completes: tokens/duration come from the harness's task completion event (**these may not be persisted anywhere else; save them immediately**), and the record is written via that harness's adapter or by hand.
169
-
170
- ### Driving the eval loop
171
-
172
- The agent itself drives the entire loop from inside a normal agent session:
173
-
174
- 1. The agent invokes the runner via Bash to build the workspace (same command as above).
175
- 2. The agent reads the generated `dispatch.json` (machine-readable sibling of the manifest). Each task object points at a `dispatch_prompt_path` (a file holding the full prompt), an `agent_description` to pass through as the dispatch description, and exact `run_record_path` and `timing_path` to write to. The prompt lives in a file rather than inline in `dispatch.json` so the agent never has to reproduce kilobytes of prompt text per dispatch. The `agent_description` is namespaced with the iteration and a per-run nonce (`<eval_id>:<condition>:i<N>-<nonce>`) so transcripts from different iterations sharing one session's subagents dir can't collide — **pass it verbatim; do not reconstruct it from the eval id and condition.**
176
- 3. For each task, the agent dispatches a fresh subagent using its host's primitive, instructing it to read the file at `dispatch_prompt_path` and follow it exactly, and passing `agent_description` verbatim as the dispatch `description`. Passing the description through unchanged is what lets the transcript adapter correlate transcripts to runs in step 4.
177
- 4. (Claude Code) After all dispatches return, the agent runs `bun run evals:ingest` once — a fixed-order chain of record-runs (assembles every task's `run.json` from `dispatch.json` + the subagent's own `outputs/final-message.md` + the persisted transcript, and backfills `timing.json` with transcript-derived tokens/duration, `"source": "transcript"`; never clobbers a record that already exists), fill-transcripts, detect-stray-writes (see *Sandboxing eval subagents* below), and the grader. It stops where only the agent can act: dispatching a judge subagent for each `llm_judge` assertion — same pattern as step 3: read a tasks file, dispatch, write results back to a path.
178
- 5. (Claude Code) After the judges return, the agent runs `bun run evals:finalize` — grade `--finalize` then the aggregator — and reads the benchmark.
179
- 6. (Other harnesses) The portable path is the same loop run by hand: when each subagent returns, the agent writes the run record to `run_record_path` and the timing record to `timing_path` itself (without a transcript adapter, `tool_invocations` stays `[]` and `transcript_check` assertions grade as unverifiable), then runs the grader, dispatches judges, finalizes, and aggregates as individual commands.
180
-
181
- Agent-driven mode is the common case because the framework is most useful from inside the harness where the skill is being iterated. Use it when you want a single in-session "run the eval and report the delta" flow.
182
-
183
- ### Transcript access
184
-
185
- `transcript_check` assertions match regex patterns against a run's `tool_invocations`. Filling that list depends on what the harness exposes:
186
-
187
- - **Claude Code:** subagent transcripts are persisted to `~/.claude/projects/<project-slug>/<parent-session-id>/subagents/agent-<id>.jsonl`, with a sibling `.meta.json` recording the dispatch description. The runner emits a unique `agent_description` (`<eval_id>:<condition>:i<N>-<nonce>`) field on each task; pass that string verbatim as the Agent tool's `description` when dispatching. The nonce namespaces the description per run, so `fill-transcripts` reads each task's `agent_description` straight from `dispatch.json` and can't cross-match a colliding agent from another iteration. After all dispatches complete, run `bun run evals:fill-transcripts --skill <name> --iteration <N> --subagents-dir <path>` to populate `tool_invocations` on every run record via the adapter at `runner/adapters/claude-code-transcript.ts`.
188
- - **Other harnesses (no transcript access):** the agent records only `final_message`; `tool_invocations` stays empty and `transcript_check` assertions grade as `unverifiable`. Lean on `llm_judge` for substantive checks. This is an honest limitation, not a bug.
189
- - **Operator-driven mode on Claude Code:** the operator can run `evals:fill-transcripts` after the fact, since they have filesystem access to the persisted transcripts.
190
-
191
- Design your assertions accordingly. For maximally portable evals, lean on `llm_judge` for the substantive checks and use `transcript_check` for cheap mechanical signals where the adapter is available.
192
-
193
- ## Sandboxing eval subagents
194
-
195
- The dispatch prompt tells each subagent to write only inside its `outputs/` dir, but nothing in the portable contract *enforces* that — a misbehaving subagent can edit the real repo or run `npm install` against the repo root, silently corrupting the very runner it's being measured by. Two layers guard against this:
196
-
197
- - **Detection (all harnesses).** After `fill-transcripts` populates `tool_invocations`, run `bun run evals:detect-stray-writes --skill <name> --iteration <N>`. It reads each task's `outputs_dir` from `dispatch.json` and scans the invocations for **violations** (`Write`/`Edit`/`MultiEdit`/`NotebookEdit` whose path resolves outside the run's `outputs/`) and **warnings** (Bash commands matching install/`git`/`sed -i`/redirection patterns that don't reference `outputs/`). Findings land in `stray-writes.json`; the aggregator turns each run with violations into a `validity_warnings` entry, so a tainted data point is flagged the same way a missed skill invocation is. This is portable because it works off the same transcripts the adapters already parse — but it only *reports*, after the fact; it never reverts what a subagent wrote.
198
- - **Hard guard (Claude Code, default posture).** `--guard` stages a `PreToolUse` hook that actively *blocks* out-of-bounds writes and installs while the subagents run — including `git worktree add` and Bash that creates files under `.claude` or a bare `skills/`. On Claude Code it is the default — the *Pre-flight gate* requires you to arm it unless the user explicitly opts out. It's Claude-Code-specific; see `harness-details/claude.md`. Harness-level write enforcement is tracked as a parity goal in `harness-parity.md`.
199
-
200
- A run ends with teardown (`bun run evals:teardown --skill <name>`, or the `teardown` runner command): it disarms the guard, removes the staged skill set the runner created under `<cwd>/.claude/skills/`, **and** reclaims the skill's `skills-workspace/` artifacts, so a completed run leaves nothing behind that wasn't meant to be committed. Pre-existing project skills and `.claude/settings.json` are left intact. Teardown only deletes what's safe: iterations whose results are committed (it keys off the `.promoted.json` marker `promote-baseline` drops) and snapshots reproducible from a git ref. Iterations with results you haven't promoted, and working-tree snapshots, are **preserved** with a warning telling you to promote or discard them. Pass the same `--workspace-dir` you ran with if you used a custom one.
201
-
202
- ## Workspace layout
203
-
204
- Per skill being evaluated:
205
-
206
- ```
207
- <skill>-workspace/ # outside the skill directory
208
- snapshots/ # Mode B baselines, persist across iterations
209
- <label>/SKILL.md
210
- iteration-N/
211
- eval-<id>/
212
- <condition-a>/ # e.g. with_skill, old_skill
213
- outputs/ # files the subagent produced
214
- run.json # portable run record
215
- timing.json # tokens + duration
216
- grading.json # assertion results
217
- <condition-b>/ # e.g. without_skill, new_skill
218
- outputs/
219
- run.json
220
- timing.json
221
- grading.json
222
- conditions.json # what each condition is, which SKILL.md it loaded
223
- benchmark.json # aggregate stats
224
- skill-snapshot.md # frozen SKILL.md at run time
225
- ```
226
-
227
- The only file you author by hand is `<skill>/evals/evals.json`. Everything in the workspace tree is produced by the runner or by grading.
145
+ **Narrowing the gap — `--plan-mode`.** For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: its `--plan-mode` flag injects the harness's *verbatim* plan-mode procedure into every dispatch as an operating-context layer the subagent is told it is operating under, rather than a paraphrase the agent merely reads in the seed prose. This narrows the gap (verbatim procedure > paraphrase) but does **not** close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm. See the runner's docs for the flag and the per-harness profiles it depends on.
228
146
 
229
147
  ## Writing assertions
230
148
 
231
- After iteration 1, you've seen what the outputs look like. Now write **assertions**: verifiable statements about correctness. Add them to `evals.json` and re-grade existing outputs without re-dispatching.
232
-
233
- Two assertion types:
234
-
235
- ### `transcript_check` — mechanical
236
-
237
- Match patterns in tool invocations. Fast, deterministic, cheap. Use for "did the agent run X" or "did file Y get written."
149
+ After iteration 1, you've seen what the outputs look like. Now write **assertions**: verifiable statements about correctness. Add them to `evals.json` and re-grade existing outputs without re-dispatching. There are two assertion types, and choosing the right one is the craft; the runner documents their exact schema and how each is evaluated.
238
150
 
239
- ```json
240
- {
241
- "id": "ran_test_command",
242
- "type": "transcript_check",
243
- "check": "tool_invocation_matches",
244
- "pattern": "bun (test|run test)"
245
- }
246
- ```
247
-
248
- ### `llm_judge` — judged
249
-
250
- Soft criteria a model evaluates. Use for "did the response quote actual evidence" or "did the agent refuse to claim success without proof."
151
+ - **`transcript_check` — mechanical.** Matches patterns against a run's tool invocations. Fast, deterministic, cheap. Use for "did the agent run X" or "did file Y get written." Depends on the harness exposing transcripts (full support on Claude Code; on transcript-less harnesses these grade as unverifiable — lean on `llm_judge` there).
152
+ - **`llm_judge` — judged.** Soft criteria a model evaluates. Use for "did the response quote actual evidence" or "did the agent refuse to claim success without proof." Portable across all harnesses.
251
153
 
252
- ```json
253
- {
254
- "id": "quoted_test_output",
255
- "type": "llm_judge",
256
- "rubric": "Did the final message include actual test runner output (e.g., '1 pass', checkmarks, file:line indicators), or did it just say 'tests pass' without quoting evidence?"
257
- }
258
- ```
259
-
260
- `model` is an optional override. By default, use whatever model the harness operator has configured for judge dispatches.
154
+ For maximally portable evals, lean on `llm_judge` for the substantive checks and use `transcript_check` for cheap mechanical signals where the adapter is available.
261
155
 
262
156
  ### Principles
263
157
 
@@ -266,187 +160,45 @@ Soft criteria a model evaluates. Use for "did the response quote actual evidence
266
160
  - **Not too brittle.** "Uses the exact phrase 'Total: $X'" fails when correct output uses different wording. Reserve mechanical exactness for actually-mechanical things.
267
161
  - **Review the assertions while grading.** Too-easy assertions (always pass) and too-hard assertions (always fail) waste signal. Fix them before the next iteration.
268
162
 
269
- ## Skill-invocation meta-check
270
-
271
- Every run with a skill loaded gets an automatic meta-assertion: **did the skill actually influence behavior, or would the response look identical without it?**
272
-
273
- The framework injects this check (reserved id `__skill_invoked`) for every condition whose `skill_path` is non-null — **except** evals marked `skill_should_trigger: false`, where the skill is *supposed* not to fire and a non-invocation is the correct outcome. It does **not** count toward the substantive `pass_rate`; results land in `grading.json` under `meta_results` and `meta_summary`, and the benchmark surfaces an `invocation_rate` per condition.
274
-
275
- Why this matters: a run where the skill wasn't actually invoked is a non-data-point. If `with_skill` scores poorly but the meta-check shows the skill wasn't applied, that's not evidence the skill is bad — it's evidence the prompt didn't trigger the skill's applicability. Conversely, a clean invocation rate validates that substantive pass-rate deltas reflect skill effectiveness.
276
-
277
- The check has two tiers, chosen automatically per run:
278
-
279
- - **Code-based (Claude Code).** On harnesses that persist subagent transcripts with discrete `Skill` tool calls, the framework parses the transcript and checks for a `Skill` invocation whose `input.skill` matches the eval-staged slug. This is deterministic, free, and cannot be fooled by superficial vocabulary in the response.
280
- - **LLM-judge fallback (other harnesses).** Where transcripts aren't available or the harness injects skills via system-prompt hooks rather than a tool call (Codex, OpenCode), a judge subagent compares the agent's `final_message` against the SKILL.md content embedded in the run record, looking for behavioral fingerprints — distinctive vocabulary, named sections, procedural steps that mirror the skill's phrasing. It does **not** require the agent to explicitly cite the skill (that would taint the eval).
281
-
282
- To enable the code-based check on Claude Code, the runner stages each condition's SKILL.md snapshot (plus the skill's sibling asset files) at `<repoRoot>/.claude/skills/slow-powers-eval-<iteration>-<condition>__<skillName>/SKILL.md`. The unique slug prevents collisions with already-installed production skills (relevant when evaluating skills in a repo where the same skills are also installed) and is what the code-based check looks for in the transcript. The slug prevents an on-disk *collision*, not runtime *discovery*: if the same skill is also provided by an installed, **enabled** plugin, the subagent can still discover and invoke that copy — contaminating both arms (the control arm is no longer skill-absent). On Claude Code the runner flags this at build time (a "plugin-shadow" warning, also surfaced in `benchmark.json`'s `validity_warnings`), but cannot unload a live plugin; to remove the installed copy, run the eval from a plugin-isolated session — see `harness-details/claude.md` → *Isolating from installed plugins*. The dispatch prompt deliberately omits any inline `<skill>...</skill>` block so the subagent must discover and invoke the staged skill naturally — this measures whether the skill's `description:` actually triggers it. Stale staged skills are swept at the start of each fresh run. Pass `--no-stage` to opt out (e.g., when running the same eval against a harness that doesn't support project-local skill discovery); the runner will fall back to inlining the SKILL.md text in the dispatch prompt, and the LLM-judge meta-check will be used. The inline fallback carries only the SKILL.md text — sibling asset files aren't inlined — so a multi-file skill whose behavior depends on a linked companion doc needs the staged path, not `--no-stage`.
283
-
284
- The aggregator emits a `validity_warnings` array when any with-skill condition has an invocation rate below 100%. Read those before interpreting the substantive delta. The rate is computed only over evals where the skill *should* fire; negative evals (`skill_should_trigger: false`) are excluded so a correct non-trigger never depresses the rate or raises a spurious warning.
285
-
286
- ## Grading
287
-
288
- For each `(eval × condition)`, produce `grading.json`:
289
-
290
- ```json
291
- {
292
- "assertion_results": [
293
- { "id": "ran_test_command", "passed": true, "evidence": "Bash invocation at ordinal 4: 'bun test'", "confidence": 1.0 },
294
- { "id": "quoted_test_output", "passed": false, "evidence": "Final message says 'tests pass' but does not include any test runner output.", "confidence": 0.9 }
295
- ],
296
- "summary": { "passed": 1, "failed": 1, "total": 2, "pass_rate": 0.5 }
297
- }
298
- ```
299
-
300
- `transcript_check` results come from a script. `llm_judge` results come from dispatching a fresh judge subagent with the rubric + run record + outputs — see `templates/judge-prompt.md`.
163
+ Every with-skill run also gets an automatic **skill-invocation meta-check** — did the skill actually influence behavior, or would the response look identical without it? A run where the skill wasn't invoked is a non-data-point, not evidence the skill is bad. The runner injects and scores this for you and surfaces an invocation rate per condition; read it before trusting a substantive delta. (Mechanics in the runner's docs.)
301
164
 
302
- ## Aggregating
165
+ ## Reading results and iterating
303
166
 
304
- Once every run is graded, compute `benchmark.json`:
167
+ Once a run is graded and aggregated, the headline is the **delta**: what the skill costs (time, tokens) and what it buys (pass-rate improvement). A skill that adds 13 seconds and 1700 tokens but improves pass rate by 50 points is probably worth it; one that doubles tokens for a 2-point gain is probably not. For Mode B, a positive `delta.pass_rate` means the revision is an improvement.
305
168
 
306
- ```json
307
- {
308
- "run_summary": {
309
- "with_skill": { "pass_rate": { "mean": 0.83 }, "duration_ms": { "mean": 45000 }, "total_tokens": { "mean": 3800 } },
310
- "without_skill": { "pass_rate": { "mean": 0.33 }, "duration_ms": { "mean": 32000 }, "total_tokens": { "mean": 2100 } },
311
- "delta": { "pass_rate": 0.50, "duration_ms": 13000, "total_tokens": 1700 }
312
- }
313
- }
314
- ```
315
-
316
- The delta tells you what the skill costs and what it buys. A skill that adds 13 seconds and 1700 tokens but improves pass rate by 50 percentage points is probably worth it. A skill that doubles tokens for a 2-point improvement is probably not.
317
-
318
- For Mode B, the `run_summary` keys are `old_skill` and `new_skill`. The same logic applies — positive `delta.pass_rate` means the revision is an improvement.
319
-
320
- ## Version-controlled baselines
321
-
322
- The full workspace tree is ephemeral and gitignored — it churns on every run. But two parts of a *canonical* run are worth keeping under version control: the `benchmark.json` delta (the headline "this skill earns its place" number) and the per-run `grading.json` rationales (why each assertion passed or failed, useful to reference when iterating later). Promote those into the skill's tracked `evals/baseline/` directory:
323
-
324
- ```bash
325
- bun run evals:promote-baseline -- --skill <name> --iteration <N> [--label <tag>] [--agent-model <id>] [--judge-model <id>]
326
- ```
327
-
328
- This copies `benchmark.json` and each `eval-<id>/<condition>/grading.json` (as `grading/<eval-id>__<condition>.json`) into `<skill>/evals/baseline/`, and writes a `BASELINE.md` recording the mode, iteration, harness, and run timestamp. Everything else in the workspace stays out of git.
329
-
330
- The runner never dispatches the agent or judge itself, so it can't observe which models ran. Pass `--agent-model` (the model that ran the agent-under-test) and `--judge-model` (the model that graded `llm_judge` assertions) to record them as provenance rows in `BASELINE.md`; both default to `unspecified` when omitted. Record the resolved id you actually used (e.g. `claude-haiku-4-5-20251001`), even if your harness only let you pick a coarse `haiku`/`opus`/`sonnet` tier at dispatch.
331
-
332
- ```
333
- skills/<skill>/evals/baseline/
334
- BASELINE.md # provenance
335
- benchmark.json # the committed delta
336
- grading/<eval-id>__<condition>.json # judge rationales per run
337
- NOTES.md # optional — forward-looking observations
338
- ```
339
-
340
- `NOTES.md` is an **optional** companion file for forward-looking observations from the runs that produced this baseline: which evals discriminated and which didn't, suspected variance/noise, ideas for the next iteration (skill changes or eval-suite changes), and any context a future iterator would want before re-running. It is not provenance (that belongs in `BASELINE.md`) and not results (those belong in `benchmark.json`). `promote-baseline` does not generate it and does not overwrite an existing one — author it by hand alongside the promoted baseline when there's something worth carrying forward; otherwise omit it.
341
-
342
- This works the same for a personal skill: point `--skill-dir` at your skill's parent, run a canonical eval, then promote — you get a committed reference of what "passing" looked like for your skill, equivalent to the baselines slow-powers ships for its own skills.
343
-
344
- ## Analyzing patterns
345
-
346
- After aggregating:
347
-
348
- - **Replace assertions that always pass in both conditions.** They don't measure skill value. They inflate the with-skill pass rate without reflecting actual skill contribution.
349
- - **Investigate assertions that always fail in both conditions.** Either the assertion is broken, the test case is too hard, or the assertion checks the wrong thing. Fix before next iteration.
350
- - **Study assertions that pass with the skill but fail without it.** This is where the skill is adding value. Understand *why* — which instructions made the difference?
351
- - **Tighten instructions when results are inconsistent.** High stddev = ambiguous instructions or model variability. Add examples or more specific guidance.
352
- - **Read time/token outliers.** If one run is 3× longer than others, read its transcript for the bottleneck.
353
-
354
- ## Human review
355
-
356
- Assertion grading catches what you wrote assertions for. A human reviewer catches everything else — issues you didn't anticipate, outputs that are technically correct but miss the point, problems hard to express as pass/fail.
357
-
358
- Save per-eval reviewer notes in `feedback.json`:
359
-
360
- ```json
361
- {
362
- "eval-claim-without-running": "Output ran `bun test` correctly but the final message hedged ('looks like it passes') instead of quoting the actual output.",
363
- "eval-build-implied-by-edit": ""
364
- }
365
- ```
169
+ **Analyzing patterns:**
366
170
 
367
- Empty string = output looked fine. Focus the next iteration's improvements on the test cases where you had specific complaints.
171
+ - **Replace assertions that always pass in both conditions.** They don't measure skill value.
172
+ - **Investigate assertions that always fail in both conditions.** Either the assertion is broken, the case is too hard, or it checks the wrong thing. Fix before next iteration.
173
+ - **Study assertions that pass with the skill but fail without it.** This is where the skill adds value — understand *why*.
174
+ - **Tighten instructions when results are inconsistent.** High stddev = ambiguous instructions or model variability.
175
+ - **Read time/token outliers.** If one run is 3× longer, read its transcript for the bottleneck.
368
176
 
369
- ## Iterating
177
+ **Human review** catches what assertions don't — outputs that are technically correct but miss the point. Keep per-eval reviewer notes; an empty note means the output looked fine. Focus the next iteration on the cases you had specific complaints about.
370
178
 
371
- After iteration N, you have three signals:
179
+ **Guidance for revision:**
372
180
 
373
- - **Failed assertions** specific gaps (missing instruction, unclear rule, uncovered case)
374
- - **Reviewer feedback** → broader quality issues (wrong approach, poor structure)
375
- - **Execution transcripts** → why things went wrong (which instructions were ignored, where time was wasted)
376
-
377
- Feed all three plus the current SKILL.md to an LLM using `templates/revise-skill-prompt.md`. Then rerun in `iteration-N+1`.
378
-
379
- ### Guidance for revision
380
-
381
- - **Generalize from feedback.** The skill is used across many prompts, not just these test cases. Fixes should address underlying issues broadly, not patch specific examples.
181
+ - **Generalize from feedback.** The skill is used across many prompts, not just these cases. Fixes should address underlying issues broadly, not patch specific examples.
382
182
  - **Keep the skill lean.** Fewer, better instructions outperform exhaustive rules. If pass rates plateau despite more rules, try removing instructions.
383
- - **Explain the why.** Reasoning-based instructions ("Do X because Y") outperform rigid directives ("ALWAYS do X"). Models follow instructions more reliably when they understand the purpose.
384
- - **Bundle repeated work.** If every run independently wrote a similar helper script, bundle it into the skill's `scripts/` directory.
385
-
386
- ### When to stop
387
-
388
- - Pass rates are satisfactory and reviewer feedback is consistently empty
389
- - Iteration deltas have plateaued (no meaningful improvement between iterations)
390
- - You've identified a more fundamental issue (skill scope is wrong, prompts don't represent real use)
391
-
392
- ## Choosing to test with evals
393
-
394
- Before you build an eval, decide whether this change needs one. An eval measures whether words on a page shift **contingent** behavior — what the agent does when the outcome is genuinely in doubt: under pressure, with ambiguity, or against a competing goal. That is where measurement earns its cost.
395
-
396
- A **deterministic** change doesn't move that needle. Removing a one-line "announce out loud that you're using this skill" instruction, fixing a typo, or shipping a manually-invoked, testing-only procedure changes what the agent is told, not whether it complies under pressure. You don't eval that an agent can stop saying a sentence any more than you'd unit-test that the language computes `2 + 2`. Following an unambiguous instruction is the runtime contract; evals test the contingent logic built on top of it.
397
-
398
- **The question for every skill change:** does it alter contingent behavior, or is it deterministic instruction-following?
399
-
400
- - **Contingent → eval (this is the default).** Renaming `writing-plans` so its trigger fires reliably, editing a discipline-enforcing skill's rationalization table, changing the wording that decides a pressured choice — the agent might behave differently, and you can't know which way without measuring.
401
- - **Deterministic → declare and skip.** State the decision and your reasoning to the user, then skip the eval. The reasoning is not optional: a silent skip is indistinguishable from dodging the work.
402
-
403
- **Either way, announce the decision and why** — "deterministic instruction removal, no eval" or "this changes pressured compliance, I'll run an eval." A visible decision is one the user can override; a silent one is a rationalization waiting to happen. **The door stays open:** if the user wants an eval anyway, run a worthwhile one — design real cases, don't phone it in to confirm a foregone conclusion.
404
-
405
- **Skill type is a fast read, not a verdict.** Reference and manually-invoked procedural changes *often* land deterministic; discipline, technique, and pattern changes *often* carry contingency (`pressure-scenarios.md` draws the same line under "When to use" / "Don't use them for"). Use type to orient your first guess — never as the answer. The decision is per *change*, not per type: a deterministic typo fix in a discipline-enforcing skill still skips, and restructuring a reference doc because the agent kept missing a section is contingent and earns an eval.
406
-
407
- ## The Iron Law
408
-
409
- **No skill shipped without passing evals. No behavior-shaping change landed without a positive revision delta.**
183
+ - **Explain the why.** Reasoning-based instructions ("Do X because Y") outperform rigid directives. Models follow instructions more reliably when they understand the purpose.
410
184
 
411
- Once you've judged a change behavior-shaping, the law is absolute these are not exemptions:
185
+ **When to stop:** pass rates are satisfactory and reviewer feedback is consistently empty; iteration deltas have plateaued; or you've found a more fundamental issue (wrong scope, unrepresentative prompts).
412
186
 
413
- - Not for "simple additions"
414
- - Not for "just adding a section"
415
- - Not for "documentation updates"
416
- - Not for "obviously the same as before"
417
-
418
- If you can't measure a behavioral change, you don't know if it's an improvement. Tuning behavior-shaping language without a benchmark drifts the skill — sometimes silently — toward worse behavior under pressure. (The narrow, declared exception for deterministic changes is above; "deterministic" is a judgment you announce and defend, not a backdoor around this law.)
419
-
420
- ## Common rationalizations
421
-
422
- Excuses for skipping an eval on a change you've already judged behavior-shaping. None of them hold.
187
+ ## Running the eval
423
188
 
424
- | Excuse | Reality |
425
- |--------|---------|
426
- | "Change is obviously an improvement" | Then proving it is fast. Run the eval. |
427
- | "Existing skill works fine" | Until pressure-test scenario X. Run the eval. |
428
- | "No time to author test cases" | The cost of shipping a regression is higher than 30 minutes of eval design. |
429
- | "It's just rewording" | Wording IS the skill. Reword = changed skill. Run the eval. |
430
- | "Eval results are noisy" | Then add runs, not skip the eval. |
431
- | "Pass rate was already 100%" | Then the assertion is too easy. Replace it. |
432
- | "I'll just call it deterministic" | Deterministic means the agent's compliance isn't in doubt — not that you'd rather not measure. If the wording could change a pressured choice, it's behavioral. Run the eval. |
189
+ The mechanics of executing a run live in **[`@slowdini/eval-runner`](https://www.npmjs.com/package/@slowdini/eval-runner)** — `bunx @slowdini/eval-runner`.
433
190
 
434
- ## Bundled assets
435
-
436
- - `schema/evals.schema.json` validate `evals.json` shape
437
- - `schema/grading.schema.json` validate `grading.json` shape
438
- - `schema/run-record.schema.json` portable run record format (cross-harness key)
439
- - `schema/stray-writes.schema.json` validate the `evals:detect-stray-writes` report shape
440
- - `templates/evals.json.example` — reference eval definition
441
- - `templates/eval-task-prompt.md` — scaffold for dispatching a subagent to execute a test case
442
- - `templates/judge-prompt.md` — scaffold for dispatching a judge subagent
443
- - `templates/revise-skill-prompt.md` — scaffold for the iteration step
444
- - `examples/verifying-development-work-evals.json` — committed real example
445
- - `pressure-scenarios.md` — pressure-scenario taxonomy for authoring prompts that stress discipline-enforcing skills
446
- - `runner/` — the Bun eval runner (orchestrator, grader, aggregator, transcript adapters) that executes the methodology; ships with the skill so users can run evals on their own skills
447
- - `harness-details/claude.md` — Claude Code-specific step-by-step for running an eval (resolving the runner, dispatching subagents, grading)
191
+ | Need | Where |
192
+ |------|-------|
193
+ | Quickstart, install, the two modes end-to-end | the package README |
194
+ | Every subcommand and flag; the `--skill-dir` model; workspace layout | `docs/cli.md` |
195
+ | Full run mechanics: dispatch loop, transcript access, grading, aggregating, baselines | `docs/methodology.md` |
196
+ | Claude Code operator walkthrough (isolating from installed plugins, the guard, judging) | `docs/harness-claude-code.md` |
197
+ | What a harness needs to reach Claude-Code-tier support | `docs/harness-parity.md` |
448
198
 
449
199
  ## See also
450
200
 
451
201
  - `slow-powers:writing-skills` — drafting a skill (Phase 1)
202
+ - `pressure-scenarios.md` — pressure-scenario taxonomy for authoring prompts that stress discipline-enforcing skills
203
+ - `@slowdini/eval-runner` — the tool that runs the evals this skill teaches you to author
452
204
  - agentskills.io/skill-creation/evaluating-skills — the methodology this skill is derived from
@@ -0,0 +1,23 @@
1
+ # Baseline — evaluating-skills
2
+
3
+ Committed reference output from a canonical eval run. Regenerate with
4
+ `skill-eval promote-baseline --skill evaluating-skills --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
5
+ dispatch files, produced outputs) stays gitignored under `skills-workspace/`
6
+ and is reclaimable by `skill-eval teardown` once promoted (this commit's marker).
7
+
8
+ | Field | Value |
9
+ |-------|-------|
10
+ | Mode | revision |
11
+ | Iteration | iteration-2 |
12
+ | Harness | claude-code |
13
+ | Agent model | claude-sonnet-4-6 |
14
+ | Judge model | claude-sonnet-4-6 |
15
+ | Conditions | old_skill, new_skill |
16
+ | Run timestamp | 2026-06-06T05:25:05.900Z |
17
+ | Label | split-no-regression |
18
+ | Promoted from commit | 42fa415 |
19
+
20
+ Files:
21
+ - `benchmark.json` — aggregate pass-rate / duration / token deltas.
22
+ - `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
23
+
@@ -0,0 +1,40 @@
1
+ # Notes — `split-no-regression` baseline
2
+
3
+ Context for the run that produced this baseline (see `BASELINE.md` for provenance).
4
+
5
+ ## What this measured
6
+
7
+ A Mode B (revision) eval of the eval-runner split: `old_skill` = the pre-split,
8
+ runner-bundled `SKILL.md` (452 lines, snapshot `pre-split` taken from `dev`);
9
+ `new_skill` = the slimmed, craft-focused rewrite that routes all running mechanics to
10
+ `@slowdini/eval-runner` (204 lines). The three self-eval cases exercise the decision
11
+ framework the rewrite kept verbatim (revision-comparison prescription, ship-decision,
12
+ deterministic-skip).
13
+
14
+ **Result: `delta.pass_rate` = 0** — both arms passed all substantive assertions
15
+ (3/3 cases, 100% each, stddev 0), 100% skill-invocation both arms, no validity warnings.
16
+ The slim rewrite preserves the decision behavior with no regression. The new skill also
17
+ ran ~22% lighter (74.7k vs 95.4k mean tokens), as expected from halving the body.
18
+
19
+ ## Methodology caveat — run with `--no-stage`
20
+
21
+ This run used `--no-stage` (skill content inlined into each dispatch prompt) rather than
22
+ the default staged-discovery path. Two reasons, both making `--no-stage` the *correct*
23
+ choice here rather than a workaround:
24
+
25
+ 1. **Staged-skill discovery wasn't available to mid-session subagents.** A first attempt
26
+ with staging produced subagents that couldn't load the staged slug via the Skill tool
27
+ (the skill registry is fixed at parent-session start; skills staged mid-session aren't
28
+ picked up), so they fell back to reading a `SKILL.md` from the working tree — which is
29
+ the *new* version for both conditions, contaminating the `old_skill` arm. Inlining
30
+ removes that failure mode: each arm provably reads its own condition's content
31
+ (verified: the `old_skill` dispatch prompt carries the 452-line body, `new_skill` the
32
+ 204-line body).
33
+ 2. **The revision didn't touch the `description:` frontmatter**, so there is nothing to
34
+ measure on the trigger-discovery axis anyway — only body-content quality, which
35
+ `--no-stage` isolates cleanly. The `__skill_invoked` meta-check therefore used the
36
+ LLM-judge fallback (it passed 100% in both arms).
37
+
38
+ A future iterator re-running this with staged discovery (e.g. from a fresh,
39
+ plugin-isolated session where the staged skills exist at session start) would additionally
40
+ exercise the trigger path — worth doing if the `description:` ever changes.