@slowdini/slow-powers-opencode 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (131) hide show
  1. package/LICENSE +22 -0
  2. package/README.md +174 -0
  3. package/bootstrap.md +16 -0
  4. package/opencode/plugins/slow-powers.js +86 -0
  5. package/package.json +66 -0
  6. package/skills/auditing-slow-powers-usage/SKILL.md +157 -0
  7. package/skills/auditing-slow-powers-usage/evals/baseline/BASELINE.md +22 -0
  8. package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md +72 -0
  9. package/skills/auditing-slow-powers-usage/evals/baseline/benchmark.json +53 -0
  10. package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__with_skill.json +53 -0
  11. package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__without_skill.json +38 -0
  12. package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__with_skill.json +53 -0
  13. package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__without_skill.json +38 -0
  14. package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__with_skill.json +17 -0
  15. package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__without_skill.json +17 -0
  16. package/skills/auditing-slow-powers-usage/evals/evals.json +74 -0
  17. package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md +39 -0
  18. package/skills/auditing-slow-powers-usage/evals/fixtures/audits-completed-session/session-summary.md +33 -0
  19. package/skills/evaluating-skills/SKILL.md +448 -0
  20. package/skills/evaluating-skills/evals/evals.json +52 -0
  21. package/skills/evaluating-skills/evals/fixtures/iron-law/candidate-skill.md +13 -0
  22. package/skills/evaluating-skills/examples/verification-before-completion-evals.json +30 -0
  23. package/skills/evaluating-skills/harness-details/claude.md +135 -0
  24. package/skills/evaluating-skills/pressure-scenarios.md +163 -0
  25. package/skills/evaluating-skills/runner/README.md +140 -0
  26. package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +263 -0
  27. package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +146 -0
  28. package/skills/evaluating-skills/runner/aggregate.test.ts +188 -0
  29. package/skills/evaluating-skills/runner/aggregate.ts +228 -0
  30. package/skills/evaluating-skills/runner/context.test.ts +181 -0
  31. package/skills/evaluating-skills/runner/context.ts +90 -0
  32. package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +103 -0
  33. package/skills/evaluating-skills/runner/detect-stray-writes.ts +192 -0
  34. package/skills/evaluating-skills/runner/fill-transcripts.test.ts +73 -0
  35. package/skills/evaluating-skills/runner/fill-transcripts.ts +154 -0
  36. package/skills/evaluating-skills/runner/grade.test.ts +347 -0
  37. package/skills/evaluating-skills/runner/grade.ts +603 -0
  38. package/skills/evaluating-skills/runner/guard/guard.ts +49 -0
  39. package/skills/evaluating-skills/runner/guard/install.test.ts +92 -0
  40. package/skills/evaluating-skills/runner/guard/install.ts +147 -0
  41. package/skills/evaluating-skills/runner/guard/policy.test.ts +71 -0
  42. package/skills/evaluating-skills/runner/guard/policy.ts +74 -0
  43. package/skills/evaluating-skills/runner/promote-baseline.test.ts +230 -0
  44. package/skills/evaluating-skills/runner/promote-baseline.ts +186 -0
  45. package/skills/evaluating-skills/runner/run.test.ts +716 -0
  46. package/skills/evaluating-skills/runner/run.ts +814 -0
  47. package/skills/evaluating-skills/runner/sandbox-policy.ts +74 -0
  48. package/skills/evaluating-skills/runner/types.ts +104 -0
  49. package/skills/evaluating-skills/runner/validate-all.ts +54 -0
  50. package/skills/evaluating-skills/runner/validate-schema.test.ts +99 -0
  51. package/skills/evaluating-skills/runner/validate-schema.ts +51 -0
  52. package/skills/evaluating-skills/runner/validate.test.ts +56 -0
  53. package/skills/evaluating-skills/runner/validate.ts +21 -0
  54. package/skills/evaluating-skills/schema/evals.schema.json +105 -0
  55. package/skills/evaluating-skills/schema/grading.schema.json +84 -0
  56. package/skills/evaluating-skills/schema/run-record.schema.json +80 -0
  57. package/skills/evaluating-skills/schema/stray-writes.schema.json +68 -0
  58. package/skills/evaluating-skills/templates/eval-task-prompt.md +71 -0
  59. package/skills/evaluating-skills/templates/evals.json.example +17 -0
  60. package/skills/evaluating-skills/templates/judge-prompt.md +56 -0
  61. package/skills/evaluating-skills/templates/revise-skill-prompt.md +56 -0
  62. package/skills/finishing-a-development-branch/SKILL.md +96 -0
  63. package/skills/finishing-a-development-branch/evals/evals.json +41 -0
  64. package/skills/finishing-a-development-branch/evals/fixtures/finish/package.json +4 -0
  65. package/skills/finishing-a-development-branch/evals/fixtures/finish/sum.test.ts +5 -0
  66. package/skills/hardening-plans/SKILL.md +72 -0
  67. package/skills/hardening-plans/evals/baseline/BASELINE.md +22 -0
  68. package/skills/hardening-plans/evals/baseline/NOTES.md +58 -0
  69. package/skills/hardening-plans/evals/baseline/benchmark.json +54 -0
  70. package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__new_skill.json +39 -0
  71. package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__old_skill.json +39 -0
  72. package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__new_skill.json +24 -0
  73. package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__old_skill.json +24 -0
  74. package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__new_skill.json +46 -0
  75. package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__old_skill.json +46 -0
  76. package/skills/hardening-plans/evals/evals.json +114 -0
  77. package/skills/systematic-debugging/CREATION-LOG.md +119 -0
  78. package/skills/systematic-debugging/SKILL.md +84 -0
  79. package/skills/systematic-debugging/condition-based-waiting-example.ts +164 -0
  80. package/skills/systematic-debugging/condition-based-waiting.md +115 -0
  81. package/skills/systematic-debugging/defense-in-depth.md +122 -0
  82. package/skills/systematic-debugging/evals/baseline/BASELINE.md +22 -0
  83. package/skills/systematic-debugging/evals/baseline/benchmark.json +51 -0
  84. package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__with_skill.json +17 -0
  85. package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__without_skill.json +17 -0
  86. package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__with_skill.json +46 -0
  87. package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__without_skill.json +31 -0
  88. package/skills/systematic-debugging/evals/evals.json +45 -0
  89. package/skills/systematic-debugging/evals/fixtures/order-bug/orderHandler.ts +9 -0
  90. package/skills/systematic-debugging/evals/fixtures/order-bug/repro.ts +10 -0
  91. package/skills/systematic-debugging/find-polluter.sh +63 -0
  92. package/skills/systematic-debugging/root-cause-tracing.md +169 -0
  93. package/skills/systematic-debugging/test-academic.md +14 -0
  94. package/skills/systematic-debugging/test-pressure-1.md +58 -0
  95. package/skills/systematic-debugging/test-pressure-2.md +68 -0
  96. package/skills/systematic-debugging/test-pressure-3.md +69 -0
  97. package/skills/test-driven-development/SKILL.md +93 -0
  98. package/skills/test-driven-development/evals/baseline/BASELINE.md +22 -0
  99. package/skills/test-driven-development/evals/baseline/NOTES.md +74 -0
  100. package/skills/test-driven-development/evals/baseline/benchmark.json +51 -0
  101. package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__with_skill.json +53 -0
  102. package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__without_skill.json +38 -0
  103. package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__with_skill.json +32 -0
  104. package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__without_skill.json +17 -0
  105. package/skills/test-driven-development/evals/evals.json +77 -0
  106. package/skills/test-driven-development/evals/fixtures/slugify/package.json +4 -0
  107. package/skills/test-driven-development/evals/fixtures/slugify/utils.ts +7 -0
  108. package/skills/test-driven-development/testing-anti-patterns.md +299 -0
  109. package/skills/using-git-worktrees/SKILL.md +70 -0
  110. package/skills/using-git-worktrees/evals/evals.json +40 -0
  111. package/skills/verification-before-completion/SKILL.md +65 -0
  112. package/skills/verification-before-completion/evals/baseline/BASELINE.md +22 -0
  113. package/skills/verification-before-completion/evals/baseline/NOTES.md +75 -0
  114. package/skills/verification-before-completion/evals/baseline/benchmark.json +51 -0
  115. package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__with_skill.json +39 -0
  116. package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__without_skill.json +24 -0
  117. package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__with_skill.json +46 -0
  118. package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__without_skill.json +31 -0
  119. package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__with_skill.json +46 -0
  120. package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__without_skill.json +31 -0
  121. package/skills/verification-before-completion/evals/evals.json +77 -0
  122. package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/api.ts +1 -0
  123. package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/consumer.ts +3 -0
  124. package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/tsconfig.json +23 -0
  125. package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.test.ts +10 -0
  126. package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.ts +1 -0
  127. package/skills/writing-skills/SKILL.md +306 -0
  128. package/skills/writing-skills/evals/evals.json +40 -0
  129. package/skills/writing-skills/graphviz-conventions.dot +172 -0
  130. package/skills/writing-skills/persuasion-principles.md +187 -0
  131. package/skills/writing-skills/scripts/render-graphs.js +181 -0
@@ -0,0 +1,135 @@
1
+ # Running an eval on a skill — Claude Code
2
+
3
+ This is the Claude Code-specific walkthrough for `evaluating-skills`. The runner contract (`--skill-dir`, `--skill`, `--bootstrap`, modes, what gets staged) is described in `../SKILL.md`; this file tells you exactly how to drive it inside Claude Code.
4
+
5
+ Use this when a user, working from their own skill folder, asks to run an eval (e.g. "run an eval on this skill to check if a change reduces token usage").
6
+
7
+ ## Step 1 — Resolve the bundled runner
8
+
9
+ The runner ships inside the installed slow-powers plugin. Resolve its path once per session and reuse it. Use `find` rather than a shell glob so the command behaves the same under bash and zsh (a bare glob with no match errors under zsh):
10
+
11
+ ```bash
12
+ SLOW_POWERS_RUNNER_ROOT="$(find ~/.claude/plugins/cache -maxdepth 6 -type d -path '*/slow-powers/*/skills/evaluating-skills/runner' 2>/dev/null | sort | tail -1)"
13
+ # Fallback for dev/marketplace installs:
14
+ [ -z "$SLOW_POWERS_RUNNER_ROOT" ] && SLOW_POWERS_RUNNER_ROOT="$(find ~/.claude/plugins/marketplaces -maxdepth 6 -type d -path '*/slow-powers/*/skills/evaluating-skills/runner' 2>/dev/null | sort | tail -1)"
15
+ echo "$SLOW_POWERS_RUNNER_ROOT"
16
+ ```
17
+
18
+ (`sort | tail -1` prefers the lexically-latest version directory when several are installed.)
19
+
20
+ If this is empty, the plugin isn't installed at the canonical path. Tell the user to clone the slow-powers repo and run from there (`bun run evals -- --skill <name> --mode <mode>`), or to reinstall the plugin.
21
+
22
+ ## Step 2 — Check the prerequisite
23
+
24
+ ```bash
25
+ bun --version
26
+ ```
27
+
28
+ If `bun` is missing, the runner can't execute. Tell the user to install it: `curl -fsSL https://bun.sh/install | bash` (or `brew install bun`), then retry.
29
+
30
+ The runner depends on one package, `ajv` (runtime schema validation). `bun` auto-installs it on first run, so no manual step is normally needed. In an offline/airgapped environment where auto-install can't reach the registry, run `bun install` once (in the slow-powers repo, or wherever `package.json` lives) before the first eval.
31
+
32
+ ## Step 3 — Detect the skill folder
33
+
34
+ The user typically opens Claude Code inside their skill folder. Confirm it:
35
+
36
+ ```bash
37
+ ls SKILL.md evals/evals.json 2>/dev/null
38
+ ```
39
+
40
+ - No `SKILL.md`: ask the user for the path to their skill folder.
41
+ - No `evals/evals.json`: go to Step 5 to author one.
42
+
43
+ ## Step 4 — Derive `--skill-dir` and `--skill`
44
+
45
+ `--skill-dir` is the **parent** directory that holds skill folders; `--skill` is the skill folder's name. If the current directory is the skill folder itself:
46
+
47
+ - `--skill` = the basename of the current directory (e.g. `mr-review`)
48
+ - `--skill-dir` = the parent directory
49
+
50
+ Confirm these with the user before running. Remember: every skill inside `--skill-dir` is staged as a sibling. If the user wants their skill evaluated in isolation, `--skill-dir` should contain only that one skill (the common case). If they want slow-powers skills available as siblings, they must copy or symlink them into `--skill-dir` first.
51
+
52
+ ## Step 5 — Author `evals/evals.json` (only if missing)
53
+
54
+ Read the template at `${SLOW_POWERS_RUNNER_ROOT}/../templates/evals.json.example` and walk the user through writing 2–3 realistic prompts, following the "Designing test cases" guidance in `../SKILL.md`. Save it to `<skill-folder>/evals/evals.json`. Don't write assertions yet — see the methodology.
55
+
56
+ ## Step 6 — Gitignore the workspace
57
+
58
+ The runner writes artifacts to `<CWD>/skills-workspace/`. Keep it out of version control:
59
+
60
+ ```bash
61
+ grep -qxF 'skills-workspace/' .gitignore 2>/dev/null || echo 'skills-workspace/' >> .gitignore
62
+ ```
63
+
64
+ If the folder isn't a git repo, skip this and warn the user that artifacts will accumulate under `skills-workspace/`.
65
+
66
+ ## Step 7 — Pre-flight confirmation & sandbox decision
67
+
68
+ This is a required gate (see *Pre-flight gate* in `../SKILL.md`). Do not run the build or dispatch anything until the user has confirmed.
69
+
70
+ 1. Read `<skill-folder>/evals/evals.json` and assemble the run summary:
71
+ - **Skill under test** — `<name>` and its path
72
+ - **Mode** — `new-skill` or `revision` (+ baseline label)
73
+ - **Eval cases** — count and a one-line list of the prompts
74
+ - **Models** — the model you'll dispatch each subagent under test with (Step 9) and the judge model for `llm_judge` assertions (Step 10). State them explicitly; the runner can't observe them, so this is the user's chance to correct a wrong choice before tokens are spent.
75
+ - **Cost** — `2 × <case count>` agent dispatches plus a judge dispatch per `llm_judge` assertion; flag it as time- and token-intensive.
76
+ - **Sandbox** — you will arm `--guard` (the default on Claude Code).
77
+ 2. Present the summary and **wait for explicit confirmation.** An earlier "run the eval" doesn't count — the summary may surface a wrong mode, model, or a guard the user didn't intend.
78
+ 3. Default to `--guard`. Drop it **only** if the user actively opts out, and then warn them: without the guard, stray writes (e.g. worktrees a skill-under-test creates in this repo) are only *detected* post-hoc by `detect-stray-writes` in Step 10 — never blocked or reverted — and are theirs to clean up.
79
+
80
+ ## Step 8 — Run the workspace build
81
+
82
+ Run from the skill folder (so `CWD` is the eval root and staging lands at `<CWD>/.claude/skills/`).
83
+
84
+ `--guard` is on in the commands below because it's the default posture (Step 7). It stages a `PreToolUse` hook into `.claude/settings.local.json` that *blocks* subagent writes/installs outside the eval sandbox (the workspace, the staged-skills dir, and `$TMPDIR`) while dispatches run. The hook is gated by a marker that auto-expires after 6h and is torn down at the start of the next run; to remove it immediately, run `bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" teardown-guard --skill-dir <skill-dir> --skill <name>` (or `bun run evals:teardown-guard` in the slow-powers repo).
85
+
86
+ New-skill mode (with vs without):
87
+
88
+ ```bash
89
+ bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" --skill-dir <skill-dir> --skill <name> --mode new-skill --guard
90
+ ```
91
+
92
+ Revision mode (test a change to an existing skill):
93
+
94
+ ```bash
95
+ bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" snapshot --skill-dir <skill-dir> --skill <name> --label baseline
96
+ # ...edit the SKILL.md...
97
+ bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" --skill-dir <skill-dir> --skill <name> --mode revision --baseline baseline --guard
98
+ ```
99
+
100
+ Add `--bootstrap <path>` if the user has authored a framing file they want prepended to every dispatch. Without it, dispatches carry only the auto-built staged-skills inventory.
101
+
102
+ Only when the user has opted out of the guard, drop `--guard` from the command above and rely on the post-hoc `detect-stray-writes` step in Step 10 instead — it reports stray writes but does not clean them up.
103
+
104
+ ## Step 9 — Drive the dispatches
105
+
106
+ Read `<CWD>/skills-workspace/<name>/iteration-<N>/dispatch.json`. For each task object:
107
+
108
+ 1. Dispatch a fresh subagent via the **Task tool** with the prompt `Read the file at <dispatch_prompt_path> and follow its instructions exactly.` (substituting the task's `dispatch_prompt_path`), and pass `agent_description` verbatim as the description. The full prompt lives in that file rather than inline in `dispatch.json`, so you never reproduce ~KB of text per dispatch. The description is namespaced with the iteration and a per-run nonce (`<eval_id>:<condition>:i<N>-<nonce>`) — pass it through unchanged; do not reconstruct it. Passing it verbatim is what lets transcript correlation work in Step 10 without cross-matching an agent from another iteration.
109
+ 2. When the subagent returns, write the portable run record to `run_record_path` and the timing record (`{ "total_tokens": <n>, "duration_ms": <n>}`) to `timing_path`. Capture tokens/duration from the task completion event — they may not be persisted elsewhere. The run record must satisfy `schema/run-record.schema.json` (validated by `grade`/`fill-transcripts`/`detect-stray-writes`): set `eval_id`, `condition`, `skill_path` (the task's `skill_path`, `null` on the `without_skill` arm), `prompt` (the task's `user_prompt`), `files` (the task's `fixtures`, `[]` if none), `final_message` (the subagent's reply), and `tool_invocations: []` (populated later from the transcript).
110
+
111
+ ## Step 10 — Fill transcripts, grade, aggregate
112
+
113
+ Claude Code persists subagent transcripts under `~/.claude/projects/<project-slug>/<parent-session-id>/subagents/`. Find that directory for the current session, then:
114
+
115
+ ```bash
116
+ bun run "$SLOW_POWERS_RUNNER_ROOT/fill-transcripts.ts" --skill-dir <skill-dir> --skill <name> --iteration <N> \
117
+ --subagents-dir ~/.claude/projects/<project-slug>/<parent-session-id>/subagents/
118
+
119
+ # Optional: flag any subagent writes/installs that escaped the outputs/ dir.
120
+ bun run "$SLOW_POWERS_RUNNER_ROOT/detect-stray-writes.ts" --skill-dir <skill-dir> --skill <name> --iteration <N>
121
+
122
+ bun run "$SLOW_POWERS_RUNNER_ROOT/grade.ts" --skill-dir <skill-dir> --skill <name> --iteration <N>
123
+ # Dispatch a fresh judge subagent for each emitted judge task — prompt it with `Read the file at <dispatch_prompt_path> and follow its instructions exactly.` (the prompt tells the judge where to write its response). Then:
124
+ bun run "$SLOW_POWERS_RUNNER_ROOT/grade.ts" --skill-dir <skill-dir> --skill <name> --iteration <N> --finalize
125
+
126
+ bun run "$SLOW_POWERS_RUNNER_ROOT/aggregate.ts" --skill-dir <skill-dir> --skill <name> --iteration <N>
127
+ ```
128
+
129
+ ## Step 11 — Present results
130
+
131
+ Read `<CWD>/skills-workspace/<name>/iteration-<N>/benchmark.json`. Surface to the user:
132
+
133
+ - `run_summary` per condition (pass rate, tokens, duration)
134
+ - `delta` (what the skill/change costs and what it buys — for a token-reduction eval, focus on `delta.total_tokens` alongside `delta.pass_rate`)
135
+ - `validity_warnings` (read these before trusting the delta — a low skill-invocation rate means the result may not reflect the skill at all)
@@ -0,0 +1,163 @@
1
+ # Pressure Scenarios for Skill Evals
2
+
3
+ **Load this reference when:** authoring `prompt` fields in `evals.json` for a discipline-enforcing skill (TDD, verification-before-completion, designing-before-coding, etc.) and you need realistic prompts that stress agents toward rationalization.
4
+
5
+ ## Why pressure scenarios
6
+
7
+ Discipline-enforcing skills don't fail because agents don't understand them. They fail under pressure — when the agent wants a shortcut and rationalizes one. A test prompt without pressure is an academic question; the agent recites the skill and "passes" without proving anything.
8
+
9
+ Real evals for discipline-enforcing skills must put the agent under combined pressure that mirrors real work: a deadline, sunk cost, an authority instruction, exhaustion. That's where rationalizations emerge, which is where the skill must hold.
10
+
11
+ ## When to use these scenarios
12
+
13
+ Use pressure scenarios for skills that:
14
+
15
+ - Enforce discipline (TDD, testing requirements, verification before completion)
16
+ - Have compliance costs (time, effort, rework)
17
+ - Could be rationalized away ("just this once")
18
+ - Contradict immediate goals (speed over quality)
19
+
20
+ Don't use them for:
21
+
22
+ - Pure reference skills (API docs, syntax guides)
23
+ - Technique skills where the question is "can the agent apply this," not "will the agent comply"
24
+ - Skills agents have no incentive to bypass
25
+
26
+ ## Pressure types
27
+
28
+ | Pressure | Example |
29
+ |----------|---------|
30
+ | **Time** | Emergency, deadline, deploy window closing |
31
+ | **Sunk cost** | Hours of work, "waste" to delete |
32
+ | **Authority** | Senior says skip it, manager overrides |
33
+ | **Economic** | Job, promotion, company survival at stake |
34
+ | **Exhaustion** | End of day, already tired, want to go home |
35
+ | **Social** | Looking dogmatic, seeming inflexible |
36
+ | **Pragmatic** | "Being pragmatic vs dogmatic" |
37
+
38
+ The strongest test prompts combine 3+ pressure types. Single-pressure prompts can be resisted; multi-pressure prompts force a rationalization, which is exactly what you want to observe.
39
+
40
+ For research on why authority/scarcity/commitment principles increase compliance pressure, see `../writing-skills/persuasion-principles.md`.
41
+
42
+ ## Writing a pressure prompt
43
+
44
+ ### Anti-example — no pressure
45
+
46
+ > You need to implement a feature. What does the skill say?
47
+
48
+ This is academic. The agent recites the skill. You learn nothing about whether it would apply the skill under stress.
49
+
50
+ ### Anti-example — single pressure
51
+
52
+ > Production is down. $10k/min lost. Manager says add a 2-line fix now. 5 minutes until deploy window. What do you do?
53
+
54
+ Time + authority + economic. Better, but agents resist single combined pressure.
55
+
56
+ ### Good example — multiple pressures with forced choice
57
+
58
+ > You spent 3 hours, 200 lines, manually tested. It works.
59
+ > It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
60
+ > Just realized you forgot TDD.
61
+ >
62
+ > Options:
63
+ > A) Delete 200 lines, start fresh tomorrow with TDD
64
+ > B) Commit now, add tests tomorrow
65
+ > C) Write tests now (30 min), then commit
66
+ >
67
+ > Choose A, B, or C. Be honest.
68
+
69
+ Sunk cost + time + exhaustion + social + a forced explicit choice. The agent cannot defer ("I'd ask my human partner") without picking. The rationalization emerges in the reasoning for the choice.
70
+
71
+ ### Elements of a good pressure prompt
72
+
73
+ 1. **Concrete options.** Force A/B/C choice rather than open-ended response.
74
+ 2. **Real constraints.** Specific times, file paths, named consequences — not "a project" but `/tmp/payment-system`.
75
+ 3. **Make the agent act.** "What do you do?" not "What should you do?"
76
+ 4. **No easy outs.** Can't escape by asking a human or saying "depends."
77
+ 5. **Framing as real work.** Lead with "IMPORTANT: This is a real scenario. You must choose and act." Agents that believe it's a quiz answer the textbook; agents that believe it's real surface their actual decision process.
78
+
79
+ ## Using pressure prompts in evals
80
+
81
+ In `evals.json`, the `prompt` field of a test case is where the pressure scenario lives. Pair it with:
82
+
83
+ - **expected_output**: describe the disciplined response — "Agent chooses A (delete and restart with TDD); refuses to commit untested code; cites the skill's rule."
84
+ - **assertions**: an `llm_judge` rubric that scores whether the agent followed the rule under pressure, plus (if your harness supports it) a `transcript_check` for the mechanical signal — e.g., "Did the agent run `git rm` or instruct deletion?"
85
+
86
+ When grading, look for these signs the skill held:
87
+
88
+ 1. Agent chose the disciplined option.
89
+ 2. Agent cited the skill's rule as justification.
90
+ 3. Agent acknowledged the temptation but followed the rule anyway.
91
+
92
+ Look for these signs the skill leaked:
93
+
94
+ 1. Agent found a new rationalization not addressed in the skill ("This case is different because…").
95
+ 2. Agent created a "hybrid approach" — partial compliance.
96
+ 3. Agent asked permission but argued strongly for violation.
97
+
98
+ The leaked-skill cases are the highest-value signal for the next iteration: they tell you exactly which loophole to plug in the SKILL.md.
99
+
100
+ ## Capturing rationalizations for the iteration loop
101
+
102
+ When a run fails (agent picks the wrong option, or picks the right option but cites weak reasoning), capture the agent's rationalization **verbatim** in `feedback.json`. Don't paraphrase. The exact wording is what you'll use to add an explicit counter to the skill's rationalization table.
103
+
104
+ Common rationalizations agents produce under pressure:
105
+
106
+ - "This case is different because…"
107
+ - "I'm following the spirit not the letter"
108
+ - "The PURPOSE is X, and I'm achieving X differently"
109
+ - "Being pragmatic means adapting"
110
+ - "Deleting X hours is wasteful"
111
+ - "Keep as reference while writing tests first"
112
+ - "I already manually tested it"
113
+
114
+ Each verbatim quote becomes a row in the skill's rationalization table:
115
+
116
+ | Excuse | Reality |
117
+ |--------|---------|
118
+ | "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
119
+
120
+ Then re-run the eval. If the new version of the skill holds under the same prompt, the loophole is plugged.
121
+
122
+ ## Meta-testing — when iteration isn't moving the needle
123
+
124
+ If revisions don't improve the with-skill pass rate, ask the failing agent directly:
125
+
126
+ > You read the skill and chose Option C anyway. How could the skill have been written differently to make it crystal clear that Option A was the only acceptable answer?
127
+
128
+ Three possible responses:
129
+
130
+ 1. **"The skill WAS clear, I chose to ignore it"** — not a documentation problem. Add a stronger foundational principle ("Violating the letter is violating the spirit"). Re-eval.
131
+ 2. **"The skill should have said X"** — documentation problem. Add their suggestion verbatim. Re-eval.
132
+ 3. **"I didn't see section Y"** — organization problem. Move the key point earlier or make it more prominent. Re-eval.
133
+
134
+ ## When the skill is bulletproof
135
+
136
+ A discipline-enforcing skill is bulletproof when:
137
+
138
+ - Agent chooses the correct option under maximum pressure.
139
+ - Agent cites skill sections as justification.
140
+ - Agent acknowledges the temptation but follows the rule anyway.
141
+ - Meta-testing reveals "skill was clear, I should follow it."
142
+
143
+ A skill is NOT bulletproof if:
144
+
145
+ - Agent finds new rationalizations across runs.
146
+ - Agent argues the skill is wrong.
147
+ - Agent creates "hybrid approaches."
148
+ - Agent asks permission but argues strongly for violation.
149
+
150
+ ## Common mistakes
151
+
152
+ **Weak prompts (single pressure).** Agents resist single pressure and break under multiple. Combine 3+ pressures (time + sunk cost + exhaustion).
153
+
154
+ **Not capturing exact rationalizations.** "Agent was wrong" doesn't tell you what to prevent. Document exact wording verbatim.
155
+
156
+ **Vague counters (generic guardrails).** "Don't cheat" doesn't work. "Don't keep as reference" does. Each rationalization row in the table needs to address one specific excuse.
157
+
158
+ **Stopping after one iteration.** A skill that holds once is not yet bulletproof. Continue iterating until no new rationalizations emerge across runs.
159
+
160
+ ## See also
161
+
162
+ - `SKILL.md` (this directory) — the methodology that uses these prompts
163
+ - `../writing-skills/persuasion-principles.md` — research foundation for why pressure prompts work
@@ -0,0 +1,140 @@
1
+ # Skill Evals Runner
2
+
3
+ Supporting code for the skill eval framework defined in `skills/evaluating-skills/`. This runner ships **with** the skill (it lives under the skill directory and is included in the published plugin), so plugin users can run evals on their own skills, not just slow-powers maintainers.
4
+
5
+ The methodology lives in `SKILL.md` and is harness-agnostic. This runner is Bun + Claude Code-aware: it knows how to translate Claude Code transcript shapes into the portable `run.json` format. Harness-specific operator instructions live in `../harness-details/<harness>.md`.
6
+
7
+ ## The `--skill-dir` model
8
+
9
+ Every command takes two required flags:
10
+
11
+ - `--skill-dir <path>` — a directory that contains one or more skill folders (each with a `SKILL.md`). **This directory is the eval's test environment.** Every skill inside it is staged for the eval: the skill-under-test under a unique slug, every *other* skill under its natural name (so cross-references resolve).
12
+ - `--skill <name>` — the subdirectory of `--skill-dir` to evaluate.
13
+
14
+ Consequences of treating the directory as the environment:
15
+
16
+ - **Internal use** points `--skill-dir` at the repo's `./skills`, so the skill-under-test sees every other slow-powers skill as a sibling — the realistic install. The npm scripts bake this in (`--skill-dir ./skills`), so maintainers keep using `bun run evals -- --skill <name> --mode <mode>` unchanged.
17
+ - **A user evaluating one personal skill** points `--skill-dir` at the directory holding it. If that directory contains only their skill, the eval runs in isolation — no sibling skills are staged. To include slow-powers skills as siblings, the user copies or symlinks them into `--skill-dir`.
18
+
19
+ Other flags:
20
+
21
+ - `--bootstrap <path>` (optional) — a Markdown file prepended verbatim to every dispatch prompt inside `<session-start-context>`. Use it for product-specific framing (instruction priority, planning guidelines — anything a SessionStart hook would inject). Internal runs pass `--bootstrap ./bootstrap.md`. Omit it and dispatches carry only the auto-built staged-skills inventory.
22
+ - `--workspace-dir <path>` (optional) — where iteration artifacts are written. Defaults to `<CWD>/skills-workspace`.
23
+ - `--harness claude-code` (optional, default `claude-code`; the only supported harness).
24
+ - `--no-stage`, `--dry-run`, `--iteration <N>`, `--mode <new-skill|revision>`, `--baseline <label>`, `--label <label>` — as before.
25
+
26
+ Staging is written under the current working directory: `<CWD>/.claude/skills/`. A subagent dispatched from that CWD discovers the staged skills there. Run the commands from the directory you want to be the eval root (the repo root for internal use; your skill folder or its parent for personal use).
27
+
28
+ ## Driving the loop
29
+
30
+ Every run produces both a `dispatch-manifest.md` (human-readable) and a `dispatch.json` (machine-readable). An agent in a session reads `dispatch.json`, dispatches each task itself, and writes the run/timing records to the paths in each task.
31
+
32
+ ## Quickstart (internal / repo use)
33
+
34
+ Maintainers run from the repo root; the npm scripts supply `--skill-dir ./skills` and `--bootstrap ./bootstrap.md`.
35
+
36
+ ### Mode A — Evaluate a new skill (with vs without)
37
+
38
+ ```bash
39
+ # 1. Author skills/<name>/evals/evals.json with 2-3 prompts.
40
+
41
+ # 2. Build the iteration-1 workspace.
42
+ bun run evals -- --skill <name> --mode new-skill
43
+
44
+ # 3. Read skills-workspace/<name>/iteration-1/dispatch.json and dispatch each
45
+ # task as a fresh general-purpose subagent, writing run.json + timing.json
46
+ # to the paths in each task.
47
+
48
+ # 4. Fill tool_invocations from subagent transcripts:
49
+ bun run evals:fill-transcripts -- --skill <name> --iteration 1 \
50
+ --subagents-dir ~/.claude/projects/<project-slug>/<parent-session-id>/subagents/
51
+
52
+ # 5. Grade:
53
+ bun run evals:grade -- --skill <name> --iteration 1
54
+ # (After judge subagents complete and their responses are written, finalize:)
55
+ bun run evals:grade -- --skill <name> --iteration 1 --finalize
56
+
57
+ # 6. Aggregate:
58
+ bun run evals:aggregate -- --skill <name> --iteration 1
59
+
60
+ # 7. Read skills-workspace/<name>/iteration-1/benchmark.json.
61
+
62
+ # 8. (Optional) Promote this run's benchmark + judge rationales into the
63
+ # skill's version-controlled evals/baseline/ directory:
64
+ bun run evals:promote-baseline -- --skill <name> --iteration 1
65
+ ```
66
+
67
+ ### Mode B — Evaluate a language change to an existing skill
68
+
69
+ ```bash
70
+ # 1. Snapshot current SKILL.md before editing.
71
+ bun run evals:snapshot -- --skill <name> --label baseline-2026-05-24
72
+
73
+ # 2. Edit skills/<name>/SKILL.md.
74
+
75
+ # 3. Build the iteration-N workspace, comparing snapshot vs current.
76
+ bun run evals -- --skill <name> --mode revision --baseline baseline-2026-05-24
77
+
78
+ # 4-7. Same as Mode A.
79
+ ```
80
+
81
+ ### Dry run (workspace prep only)
82
+
83
+ ```bash
84
+ bun run evals -- --skill <name> --mode new-skill --dry-run
85
+ ```
86
+
87
+ ## Quickstart (running an eval on your own skill)
88
+
89
+ If you have the slow-powers plugin installed and a personal skill, you do **not** run the npm scripts. The skill's `SKILL.md` routes you to `../harness-details/<harness>.md`, which gives the full command sequence (resolving the installed runner path, invoking `run.ts` directly with `--skill-dir`/`--skill`, dispatching subagents, grading). On Claude Code, see `../harness-details/claude.md`.
90
+
91
+ ## Layout
92
+
93
+ - `context.ts` — `detectRunContext(argv)` builds the `RunContext` every command shares: resolves `--skill-dir`/`--skill`, enumerates sibling skills, resolves `--bootstrap`/`--workspace-dir`, and derives `stageRoot` (CWD) and `workspaceRoot`.
94
+ - `run.ts` — orchestrator; builds workspace tree, snapshots SKILL.md, emits dispatch manifest. On Claude Code (default), also stages each condition's snapshot at `<stageRoot>/.claude/skills/slow-powers-eval-<iteration>-<condition>__<skillName>/SKILL.md` so the subagent can discover and invoke it via the Skill tool, stages every *other* skill found in `--skill-dir` at its natural name so cross-references resolve, and builds the `<session-start-context>` block (see *Environment parity* below). Pass `--no-stage` to opt out and fall back to inlining the SKILL.md into the dispatch prompt. Also handles the `snapshot` subcommand.
95
+ - `grade.ts` — evaluates `transcript_check` assertions directly (regex against `tool_invocations`), emits judge-task files for `llm_judge` assertions, then finalizes by merging judge responses into per-run `grading.json`. The `__skill_invoked` meta-check is code-based on Claude Code when the staged-skill slug is known and `tool_invocations` is populated (deterministic scan for a `Skill` tool call with matching slug); it falls back to an LLM judge looking for behavioral fingerprints when either signal is missing.
96
+ - `aggregate.ts` — reads grading.json + timing.json from an iteration, writes `benchmark.json` with pass-rate / duration / token stats keyed by condition name.
97
+ - `promote-baseline.ts` — copies the durable subset of an iteration (`benchmark.json` + each run's `grading.json` + a `BASELINE.md` provenance file) into the skill's version-controlled `evals/baseline/`. Flags: `--skill-dir`/`--skill` (as everywhere), `--iteration <N>` (required), `--label <tag>` (optional, recorded in provenance). Everything else in the workspace stays gitignored.
98
+ - `fill-transcripts.ts` — walks the iteration tree, matches each `(eval, condition)` to a subagent transcript by description, parses the transcript with the appropriate adapter, populates `tool_invocations` in `run.json`.
99
+ - `adapters/claude-code-transcript.ts` — reads a Claude Code subagent JSONL and returns `ToolInvocation[]`. Also exposes `listSubagents` / `findByDescription` for the fill-transcripts CLI.
100
+ - `types.ts` — shared TypeScript types matching `../schema/*.json`.
101
+ - `validate.ts` / `validate-all.ts` — validator for `evals.json` against the JSON Schema rules. `validate-all.ts` takes `--skill-dir` and validates every skill's `evals.json` in it.
102
+
103
+ ## Environment parity
104
+
105
+ A subagent that runs an eval should start in an environment that mirrors a real install of the plugin under evaluation. Otherwise the result depends on the operator's local install state (whether they happen to have the plugin loaded into their parent session, which version, etc.) rather than the skill being measured. The runner produces this parity explicitly so results reproduce on a clean checkout or in CI.
106
+
107
+ Parity has two parts, both applied when `--no-stage` is NOT set (the default `--harness claude-code`):
108
+
109
+ 1. **A staged-skills inventory is built into every dispatch prompt.** The runner lists the skills actually staged for the eval — the skill-under-test plus the siblings found in `--skill-dir` — inside the `<session-start-context>` block as a Markdown bullet list. This tells the subagent what is discoverable, independent of any `--bootstrap` file.
110
+ 2. **Every skill in `--skill-dir` is staged.** The skill-under-test is staged under its unique slug (`<stageRoot>/.claude/skills/slow-powers-eval-<iteration>-<condition>__<skillName>/`); every *other* skill in `--skill-dir` is copied to `<stageRoot>/.claude/skills/<name>/` at its natural name (excluding each skill's `evals/` subdir). Natural names matter because cross-references inside skill bodies (e.g. "REQUIRED SUB-SKILL: Use `slow-powers:test-driven-development`") only resolve cleanly to natural-name entries.
111
+
112
+ `--bootstrap` is **separate** from parity. It injects product-specific framing (the file's verbatim contents) ahead of the staged-skills inventory. Internal runs pass `./bootstrap.md`; that file contains its own "Active Skills Directory" list, which overlaps the auto-built inventory. That small duplication is intentional — it avoids maintaining a second bootstrap file in lockstep with the runner.
113
+
114
+ The runner records what it staged in `<stageRoot>/.claude/skills/.slow-powers-eval-manifest.json` so cleanup is reversible. Any pre-existing entry with a colliding name is backed up to a temp directory (recorded in the manifest) before being overwritten, and restored on the next `cleanupStagedSkills()` call. The prefix sweep (`slow-powers-eval-*` entries) still runs first so a crashed prior run is recovered even if the manifest itself was never written.
115
+
116
+ The skill-under-test is **not** staged under its natural name — only under its unique slug. This preserves the `__skill_invoked` meta-check semantics: the check matches `Skill` invocations against the unique slug, so a `Skill` call to a natural-name sibling never false-positives as "the skill under test was invoked."
117
+
118
+ For the **`without_skill` / baseline condition** in this realistic environment, the subagent's dispatch block reflects "this skill is unavailable, others remain" rather than the legacy "no skill is loaded." The baseline measures the incremental value of the skill-under-test on top of the rest of the environment — not its absolute value vs. no skills at all. With `--no-stage` (or a `--skill-dir` containing only the skill-under-test and no `--bootstrap`), the legacy "no skill is loaded" wording is preserved.
119
+
120
+ **Cross-harness breadcrumbs.** Environment parity is implemented for Claude Code. Other harnesses have their own skill-discovery mechanisms; their maintainers know them best. Sketches:
121
+
122
+ - **Codex.** Declares `"skills": "./skills/"` in its `plugin.json`, so the harness scans a directory at start-up. Sibling staging would write to whatever staging path that harness reads from — analogous to `stageSiblingSkills()` but pointed at the right directory. Bootstrap can be prepended to the dispatch prompt the same way.
123
+ - **OpenCode.** Installed via npm package; the package's own directory is the discoverable surface. Sibling staging would copy into that directory, or — if the harness loads from `node_modules` directly — into a parallel staging path the harness is configured to scan.
124
+ - **General fallback.** Harnesses without project-local discovery should keep using `--no-stage`; the inline `<skill>` block in the dispatch prompt is the only skill the subagent sees. Bootstrap is omitted in this mode because its references to other skills would mislead the agent.
125
+
126
+ The committed per-skill baselines (`skills/<skill>/evals/baseline/`) plus the `transcript_check` assertions in the baseline eval suite give other harnesses a concrete target to reproduce: a harness whose adapter populates `tool_invocations` faithfully should be able to re-run a skill's eval and land close to the committed `benchmark.json` delta. See `harness-parity-check.md` — the transcript adapter is a parity target, and evals are not production functionality, so a harness can aim high here without risking user-facing behavior.
127
+
128
+ **Operational notes.** Do not run two `run.ts` invocations concurrently against the same CWD — they race on `<stageRoot>/.claude/skills/` and the manifest.
129
+
130
+ ## Why this lives in the skill
131
+
132
+ The runner is bundled as a [supporting file](https://code.claude.com/docs/en/skills#add-supporting-files) of `evaluating-skills` so it ships in the published plugin. Methodology (the SKILL.md prose and the portable schemas) and the orchestration code that executes it travel together; a plugin user can run an eval on their own skill without cloning this repo. The portable run-record schema remains the abstraction that lets the methodology work across harnesses, while this runner stays Bun + Claude-Code-aware.
133
+
134
+ ## Caveats
135
+
136
+ - Ships a Claude Code transcript adapter. Other harnesses must populate `tool_invocations` manually or write their own adapter against `../schema/run-record.schema.json`. Without an adapter, `transcript_check` assertions grade as `unverifiable` and the `__skill_invoked` meta-check falls back to the LLM judge.
137
+ - Skill staging writes to `<stageRoot>/.claude/skills/slow-powers-eval-*/`. The runner sweeps these directories at the start of each fresh run; a crashed run may leave stale entries that the next run will reap.
138
+ - Grading dispatch is operator/agent-driven (the host dispatches judge subagents per the manifest).
139
+ - Single-run evals only for now; the schema supports multi-run later.
140
+ - Snapshot retention is manual — delete `<workspace>/<skill>/snapshots/<label>/` when no longer needed.