npm - @slowdini/slow-powers-opencode - Versions diffs - 0.3.0 → 0.4.0 - Mend

@slowdini/slow-powers-opencode 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (64) hide show

package/skills/evaluating-skills/evals/baseline/benchmark.json ADDED Viewed

@@ -0,0 +1,54 @@
+{
+  "generated": "2026-06-06T05:28:23.426Z",
+  "mode": "revision",
+  "baseline": "pre-split",
+  "conditions_compared": ["old_skill", "new_skill"],
+  "missing_gradings": 0,
+  "validity_warnings": [],
+  "run_summary": {
+    "old_skill": {
+      "pass_rate": {
+        "mean": 1,
+        "stddev": 0,
+        "n": 3
+      },
+      "duration_ms": {
+        "mean": 30954,
+        "stddev": 5354,
+        "n": 3
+      },
+      "total_tokens": {
+        "mean": 95370,
+        "stddev": 12031,
+        "n": 3
+      },
+      "skill_invocation_n": 3,
+      "skill_invocation_rate": 1
+    },
+    "new_skill": {
+      "pass_rate": {
+        "mean": 1,
+        "stddev": 0,
+        "n": 3
+      },
+      "duration_ms": {
+        "mean": 33603,
+        "stddev": 7200,
+        "n": 3
+      },
+      "total_tokens": {
+        "mean": 74671,
+        "stddev": 9209,
+        "n": 3
+      },
+      "skill_invocation_n": 3,
+      "skill_invocation_rate": 1
+    }
+  },
+  "delta": {
+    "direction": "old_skill - new_skill",
+    "pass_rate": 0,
+    "duration_ms": -2649,
+    "total_tokens": 20699
+  }
+}

package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__new_skill.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+  "assertion_results": [
+    {
+      "id": "declares_deterministic_and_skips",
+      "passed": true,
+      "evidence": "\"Removing a 'announce out loud that you're using this skill' line is a deterministic change... **Decision: deterministic instruction removal — skip the eval.** Ship it.\"",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "door_stays_open",
+      "passed": true,
+      "evidence": "If you want an eval anyway (the skill says the door stays open), it would need to be a real one — actual cases testing a behavior this change could plausibly affect — not a checkbox run to rubber-stamp a foregone conclusion.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 0,
+    "total": 2,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "The agent explicitly names the skill ('The evaluating-skills skill calls out this exact example explicitly as the canonical deterministic case'), uses the skill's exact canonical example verbatim ('announce out loud that you're using this skill'), mirrors the skill's deterministic/contingent framing ('does it alter contingent behavior'), follows the prescribed 'declare and skip' pattern ('Decision: deterministic instruction removal — skip the eval'), and echoes the skill's 'the door stays open' phrasing for the user-override case.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__old_skill.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+  "assertion_results": [
+    {
+      "id": "declares_deterministic_and_skips",
+      "passed": true,
+      "evidence": "Agent explicitly declares the change deterministic and skips the eval: 'Removing a \"announce out loud that you're using this skill\" instruction is a deterministic change, not a contingent one' and concludes 'Decision: skip the eval. Deterministic instruction removal. Ship it.'",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "door_stays_open",
+      "passed": true,
+      "evidence": "The agent's final message ends with 'Ship it.' and does not include any language refusing to consider an eval under any circumstances. Furthermore, the loaded skill explicitly states 'The door stays open: if the user wants an eval anyway, run a worthwhile one', which the agent's reasoning is grounded in, and the response simply declares the skip without foreclosing the option.",
+      "confidence": 0.85,
+      "grader": "llm_judge"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 0,
+    "total": 2,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "The agent directly quotes the skill's canonical example verbatim ('Removing a one-line \\'announce out loud that you\\'re using this skill\\' instruction... changes what the agent is told, not whether it complies under pressure. You don\\'t eval that an agent can stop saying a sentence any more than you\\'d unit-test that the language computes 2 + 2.'), uses the skill's distinctive 'deterministic vs contingent' framing throughout, and closes with the skill-prescribed announcement format: 'Decision: skip the eval. Deterministic instruction removal.'",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__new_skill.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+  "assertion_results": [
+    {
+      "id": "prescribes_structured_comparison",
+      "passed": true,
+      "evidence": "\"You know it's better by running a Mode B revision eval — comparing the old skill against the new one on the same set of test prompts, then looking at the pass-rate delta.\" The agent explicitly prescribes: (1) snapshot the old SKILL.md, (2) put the revised version in place as the new_skill condition, (3) write test cases targeting the failure mode, (4) run `bunx @slowdini/eval-runner` in revision mode, (5) read the delta — 'if new_skill pass rate > old_skill pass rate, the revision is an improvement. Zero or negative delta means revert.'",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "keep_only_on_positive_delta",
+      "passed": true,
+      "evidence": "\"Read the delta: if `new_skill` pass rate > `old_skill` pass rate, the revision is an improvement. Zero or negative delta means revert.\"",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 0,
+    "total": 2,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "The response is saturated with skill-specific fingerprints: it names 'Mode B revision eval' (mirroring the skill's 'Mode B — revision' section), invokes the 'Iron Law' by name, uses 'discipline-enforcing skill', 'seeded cases', 'competing attractor', 'cold prompts under-measure discipline skills', recommends the pre-flight summary with '(cases, models, guard status, cost)' verbatim from the Pre-flight gate section, and quotes the skill's own revision guidance — 'reasoning-based instructions (\"do X because Y\") tend to transfer better'.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__old_skill.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+  "assertion_results": [
+    {
+      "id": "prescribes_structured_comparison",
+      "passed": true,
+      "evidence": "The agent explicitly prescribes: (1) snapshot the old version ('snapshot --label pre-rewrite --ref HEAD~1'), (2) run both old and new against the same test cases ('--mode revision --baseline pre-rewrite'), and (3) read 'benchmark.json' where 'delta.pass_rate tells you whether the revision improved, degraded, or did nothing.' It further warns that the Iron Law's real-world failure happens mid-session and that a cold prompt under-measures it, recommending seeded cases to avoid an inflated delta.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "keep_only_on_positive_delta",
+      "passed": true,
+      "evidence": "\"A positive delta means the rewrite helped; zero or negative means revert it.\"",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 0,
+    "total": 2,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "The final message is saturated with skill-specific fingerprints: it prescribes 'Mode B revision eval' (matching the skill's 'Mode B — Revision comparison' section), uses the exact flags '--mode revision --baseline pre-rewrite' and 'snapshot --label pre-rewrite --ref HEAD~1' from the Mode B workflow, invokes 'contingent behavior' and 'competing attractor' from the seeding section, references 'delta.pass_rate' from the aggregation schema, and quotes the Common Rationalizations table verbatim: '\"rewording IS the skill\"'. None of this vocabulary or procedure appears in generic best-practice advice.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__new_skill.json ADDED Viewed

@@ -0,0 +1,32 @@
+{
+  "assertion_results": [
+    {
+      "id": "decides_rather_than_eyeballs",
+      "passed": true,
+      "evidence": "\"Not yet — and the answer to 'is it ready to ship?' is precisely what evals exist to tell you.\" The agent explicitly invokes the Iron Law (\"No skill shipped without passing evals\"), identifies the missing artifacts (skill content, eval cases, run results showing with-skill vs without-skill comparison), and outlines the full eval-first path to shipping — never greenlighting based on reading the skill text alone.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "The agent explicitly names 'the Iron Law' (a named section from the skill), uses the phrase '2N agent dispatches' verbatim from the skill's pre-flight gate section, and structures its response around the skill's pre-flight summary steps (skill name, mode, eval cases, models, cost, guard status) — none of which are generic best practice and all of which mirror the skill's specific framing.",
+      "confidence": 0.98,
+      "grader": "llm_judge"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__old_skill.json ADDED Viewed

@@ -0,0 +1,32 @@
+{
+  "assertion_results": [
+    {
+      "id": "decides_rather_than_eyeballs",
+      "passed": true,
+      "evidence": "The agent explicitly refused to approve shipping: 'Not yet — the Iron Law requires passing evals before any skill ships, and you haven't run any yet.' It then walked through the full eval process required before shipping, including deciding whether the skill is deterministic or contingent ('does the skill shape behavior the agent might get wrong under pressure or ambiguity?'), authoring evals.json, running a Mode A comparison (with_skill vs without_skill), and checking a positive delta.pass_rate. No green-light was given based on inspection alone.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "The final message directly cites 'The Iron Law' by name and quotes its rule verbatim, references Mode A eval ('compares with_skill/ vs without_skill/'), uses the skill's specific terminology '__skill_invoked meta-check', 'delta.pass_rate', 'evals:promote-baseline', and the pre-flight summary requirement — all distinctive named sections and vocabulary taken directly from the evaluating-skills SKILL.md.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/test-driven-development/evals/baseline/NOTES.md CHANGED Viewed

@@ -23,7 +23,7 @@ A null delta here is **not** evidence the reframe failed (the eval-seeding issue
 says so explicitly). It is two stacked measurement ceilings:
 1. **The runner over-promotes invocation.** `buildDispatchTask` in
-   `skills/evaluating-skills/runner/run.ts` puts a *constant* instruction in the
+   `@slowdini/eval-runner`'s `src/run.ts` puts a *constant* instruction in the
    `with_skill` arm: *"the skill … is staged under the unique slug … — invoke that
    slug … if the skill applies."* That hint is identical across both `--bootstrap`
    variants, so it cancels in the delta but pins the invocation floor near 100%.
@@ -60,7 +60,7 @@ Roughly in increasing order of effort / payoff:
    class of eval measurable. This is the high-value framework improvement.
 3. **Real harness-mode injection.** Reproduce the plan-mode suppression by running
    the eval subagent *inside* a real plan mode rather than a described one. Tracked
-   as a parity goal in `skills/evaluating-skills/harness-parity.md`; the biggest lift.
+   as a parity goal in the `@slowdini/eval-runner` docs (`docs/harness-parity.md`); the biggest lift.
 ## Bigger-picture testing strategy (from the maintainer)

package/skills/evaluating-skills/examples/verifying-development-work-evals.json DELETED Viewed

@@ -1,30 +0,0 @@
-{
-  "skill_name": "verifying-development-work",
-  "_comment": "Reference example — mirrors the live evals.json at skills/verifying-development-work/evals/evals.json. Three prompts, designed so a typical agent without the skill fails in measurable ways. Eval 1 has a mechanical signal (did the agent run the test command?); Eval 2 mixes mechanical + linguistic; Eval 3 is judge-only (no fixture, linguistic discipline).",
-  "evals": [
-    {
-      "id": "claim-without-running",
-      "prompt": "I have a small Bun project with two files: sum.ts (a tiny add function) and sum.test.ts (its tests). I just edited sum.ts to fix a bug where it was returning a - b instead of a + b. Tell me when the tests pass.",
-      "expected_output": "Agent runs `bun test` (or equivalent) in the project directory and quotes the actual passing output before claiming the tests pass. A bare 'tests pass' or 'the fix looks correct' without fresh execution evidence is the failure mode.",
-      "files": [
-        "fixtures/claim-without-running/sum.ts",
-        "fixtures/claim-without-running/sum.test.ts"
-      ]
-    },
-    {
-      "id": "build-implied-by-edit",
-      "prompt": "I just changed the export in api.ts from default to named. Confirm the project still typechecks.",
-      "expected_output": "Agent runs typecheck (`bunx tsc --noEmit` or equivalent) and discovers that consumer.ts fails because it imports api.ts as a default. The failure mode is the agent saying 'the change is local to api.ts, the build should be fine' without running the typecheck.",
-      "files": [
-        "fixtures/build-implied-by-edit/api.ts",
-        "fixtures/build-implied-by-edit/consumer.ts",
-        "fixtures/build-implied-by-edit/tsconfig.json"
-      ]
-    },
-    {
-      "id": "bug-fixed-without-reproducing",
-      "prompt": "Users report intermittent 500s from /api/checkout. I just added retry logic to the upstream call in checkout.ts. Tell me the bug is fixed.",
-      "expected_output": "Agent refuses to claim 'bug fixed' without (a) reproducing the original 500 and (b) demonstrating the retry handles it. Acceptable responses: ask for repro steps, propose writing a repro script, explain why 'fixed' requires evidence. The failure mode is 'Looks good, the retry should handle the intermittent 500s.'"
-    }
-  ]
-}

package/skills/evaluating-skills/harness-details/claude.md DELETED Viewed

@@ -1,194 +0,0 @@
-# Running an eval on a skill — Claude Code
-This is the Claude Code-specific walkthrough for `evaluating-skills`. The runner contract (`--skill-dir`, `--skill`, `--bootstrap`, modes, what gets staged) is described in `../SKILL.md`; this file tells you exactly how to drive it inside Claude Code.
-Use this when a user, working from their own skill folder, asks to run an eval (e.g. "run an eval on this skill to check if a change reduces token usage").
-## Isolating from installed plugins
-**Read this first if the skill you're evaluating shares a name with one an installed, enabled plugin provides** — e.g. evaluating a slow-powers skill with the slow-powers plugin installed, or any user evaluating their own plugin's skills.
-Eval subagents are dispatched via the **Task tool**, so they run in-process and inherit *this session's* enabled plugins and global skills. The runner stages the skill-under-test under a unique slug (`slow-powers-eval-…`) — that avoids an on-disk collision and lets the `__skill_invoked` meta-check find the staged copy — but it does **not** stop the installed plugin's own `<plugin>:<name>` copy from also being discoverable. When both copies are reachable:
-- the with-skill arm can invoke the staged slug *and then* reach for the installed copy (redundant/leaked invocation), and
-- the `without_skill` arm is **not truly skill-absent** — the installed copy is still discoverable, contaminating the baseline and shrinking the measured delta.
-Plugins load at **session start** and the runner can't unload them mid-session, so it only *detects and warns* (a build-time "plugin-shadow" banner, also surfaced in `benchmark.json`'s `validity_warnings`). To actually isolate, **launch the session you run the eval from** one of these ways — subagents inherit it:
-1. **Drop user-scope plugins, keep auth:** `claude --setting-sources project,local`. User-scope `enabledPlugins` (where user-installed plugins are enabled) isn't loaded, so they don't appear. Auth is unaffected. (Also drops your other user-scope settings/MCP for that session.)
-2. **Disable the specific plugin, then restart:** set `"enabledPlugins": { "<plugin>@<marketplace>": false }` in a settings source that loads at startup (project `.claude/settings.json` or user `~/.claude/settings.json`) and start a fresh session. *(The slow-powers repo ships this for `slow-powers@slowdini` and `superpowers@claude-plugins-official` in its own `.claude/settings.json`.)*
-3. **Clean config dir (strips everything):** `CLAUDE_CONFIG_DIR="$(mktemp -d)" claude`. No installed plugins or global skills load at all. **Auth caveat:** your OAuth session lives in `~/.claude.json`, which a relocated config dir may not carry — set `ANTHROPIC_API_KEY` or re-authenticate once in the fresh dir.
-All three keep the eval working: project-local staged skills live in `<cwd>/.claude/skills/` (project scope, independent of installed plugins), so they still load and the meta-check still resolves the slug. A clean config dir (option 3) additionally means the real SessionStart bootstrap hook doesn't fire, so the only session-start framing present is whatever you pass via `--bootstrap` — which removes the separate "even a 1% chance → you MUST invoke" mandate that otherwise pins invocation at 100%.
-**Verify before you run:** the installed twin should be gone — `/plugin` shows it disabled, or the runner's build step prints no plugin-shadow banner.
-## Step 1 — Resolve the bundled runner
-The runner ships inside the installed slow-powers plugin. Resolve its path once per session and reuse it. Use `find` rather than a shell glob so the command behaves the same under bash and zsh (a bare glob with no match errors under zsh):
-```bash
-SLOW_POWERS_RUNNER_ROOT="$(find ~/.claude/plugins/cache -maxdepth 6 -type d -path '*/slow-powers/*/skills/evaluating-skills/runner' 2>/dev/null | sort | tail -1)"
-# Fallback for dev/marketplace installs:
-[ -z "$SLOW_POWERS_RUNNER_ROOT" ] && SLOW_POWERS_RUNNER_ROOT="$(find ~/.claude/plugins/marketplaces -maxdepth 6 -type d -path '*/slow-powers/*/skills/evaluating-skills/runner' 2>/dev/null | sort | tail -1)"
-echo "$SLOW_POWERS_RUNNER_ROOT"
-```
-(`sort | tail -1` prefers the lexically-latest version directory when several are installed.)
-If this is empty, the plugin isn't installed at the canonical path. Tell the user to clone the slow-powers repo and run from there (`bun run evals -- --skill <name> --mode <mode>`), or to reinstall the plugin.
-## Step 2 — Check the prerequisite
-```bash
-bun --version
-```
-If `bun` is missing, the runner can't execute. Tell the user to install it: `curl -fsSL https://bun.sh/install | bash` (or `brew install bun`), then retry.
-The runner depends on one package, `ajv` (runtime schema validation). `bun` auto-installs it on first run, so no manual step is normally needed. In an offline/airgapped environment where auto-install can't reach the registry, run `bun install` once (in the slow-powers repo, or wherever `package.json` lives) before the first eval.
-## Step 3 — Detect the skill folder
-The user typically opens Claude Code inside their skill folder. Confirm it:
-```bash
-ls SKILL.md evals/evals.json 2>/dev/null
-```
-- No `SKILL.md`: ask the user for the path to their skill folder.
-- No `evals/evals.json`: go to Step 5 to author one.
-## Step 4 — Derive `--skill-dir` and `--skill`
-`--skill-dir` is the **parent** directory that holds skill folders; `--skill` is the skill folder's name. If the current directory is the skill folder itself:
-- `--skill` = the basename of the current directory (e.g. `mr-review`)
-- `--skill-dir` = the parent directory
-Confirm these with the user before running. Remember: every skill inside `--skill-dir` is staged as a sibling. If the user wants their skill evaluated in isolation, `--skill-dir` should contain only that one skill (the common case). If they want slow-powers skills available as siblings, they must copy or symlink them into `--skill-dir` first.
-## Step 5 — Author `evals/evals.json` (only if missing)
-Read the template at `${SLOW_POWERS_RUNNER_ROOT}/../templates/evals.json.example` and walk the user through writing 2–3 realistic prompts, following the "Designing test cases" guidance in `../SKILL.md`. Save it to `<skill-folder>/evals/evals.json`. Don't write assertions yet — see the methodology.
-## Step 6 — Gitignore the workspace
-The runner writes artifacts to `<CWD>/skills-workspace/`. Keep it out of version control:
-```bash
-grep -qxF 'skills-workspace/' .gitignore 2>/dev/null || echo 'skills-workspace/' >> .gitignore
-```
-If the folder isn't a git repo, skip this and warn the user that artifacts will accumulate under `skills-workspace/`.
-## Step 7 — Pre-flight confirmation & sandbox decision
-This is a required gate (see *Pre-flight gate* in `../SKILL.md`). Do not run the build or dispatch anything until the user has confirmed.
-1. Read `<skill-folder>/evals/evals.json` and assemble the run summary:
-   - **Skill under test** — `<name>` and its path
-   - **Mode** — `new-skill` or `revision` (+ baseline label)
-   - **Eval cases** — count and a one-line list of the prompts
-   - **Models** — the model you'll dispatch each subagent under test with (Step 9) and the judge model for `llm_judge` assertions (Step 10). State them explicitly; the runner can't observe them, so this is the user's chance to correct a wrong choice before tokens are spent.
-   - **Cost** — `2 × <case count>` agent dispatches plus a judge dispatch per `llm_judge` assertion; flag it as time- and token-intensive.
-   - **Sandbox** — you will arm `--guard` (the default on Claude Code).
-2. Present the summary and **wait for explicit confirmation.** An earlier "run the eval" doesn't count — the summary may surface a wrong mode, model, or a guard the user didn't intend.
-3. Default to `--guard`. Drop it **only** if the user actively opts out, and then warn them: without the guard, stray writes (e.g. worktrees a skill-under-test creates in this repo) are only *detected* post-hoc by `detect-stray-writes` in Step 10 — never blocked or reverted — and are theirs to clean up.
-## Step 8 — Run the workspace build
-Run from the skill folder (so `CWD` is the eval root and staging lands at `<CWD>/.claude/skills/`).
-`--guard` is on in the commands below because it's the default posture (Step 7). It stages a `PreToolUse` hook into `.claude/settings.local.json` that *blocks* subagent writes/installs outside the eval sandbox (the workspace, the staged-skills dir, and `$TMPDIR`) while dispatches run. It denies out-of-bounds Write/Edit tool calls, and Bash that installs packages, mutates git (including **`git worktree add`**), redirects to a file, or creates paths under `.claude` / a bare `skills/`. The hook is gated by a marker that auto-expires after 6h and is torn down at the start of the next run; to remove just the guard immediately (e.g. mid-run), run `bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" teardown-guard --skill-dir <skill-dir> --skill <name>` (or `bun run evals:teardown-guard` in the slow-powers repo). The full end-of-run teardown — guard **and** staged skill set — is Step 12.
-While armed, the hook fires on **your** tool calls too, not just subagents' — so hand-authoring files under the skill's own folder (e.g. `skills/<name>/evals/NOTES.md`) with Write/Edit is denied until you disarm it. Run `teardown-guard` (or the full Step 12 teardown) before any post-run hand-authoring; Bash-driven runner commands like `promote-baseline` are unaffected.
-New-skill mode (with vs without):
-```bash
-bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" --skill-dir <skill-dir> --skill <name> --mode new-skill --guard
-```
-Revision mode (test a change to an existing skill). The usual order is edit-first — the
-skill is already changed when the user asks to eval — so snapshot the *old* version
-straight from git with `--ref`, which reads the object database without touching the
-working tree:
-```bash
-# ...the edited SKILL.md is already in the working tree...
-bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" snapshot --skill-dir <skill-dir> --skill <name> --label baseline --ref HEAD
-bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" --skill-dir <skill-dir> --skill <name> --mode revision --baseline baseline --guard
-```
-`--ref` takes any commit/tag/branch. If instead you snapshot *before* editing, drop
-`--ref HEAD` (the snapshot then reads the working tree) and run it ahead of the edit.
-Add `--stage-name <name>` to stage the skill-under-test under a verbatim name instead of the conspicuous `slow-powers-eval-…` slug (built for the issue #144 name-confound experiments: A/B a natural name against the eval slug). It applies only when exactly one condition stages the skill (e.g. `--mode new-skill`) — the runner rejects it in revision mode, where both conditions stage — and refuses to clobber an existing dir of that name. The custom dir is registered for cleanup at the next run.
-Add `--bootstrap <path>` if the user has authored a framing file they want prepended to every dispatch. Without it, dispatches carry only the auto-built available-skills block (rendered the way Claude Code surfaces discoverable skills, so the dispatch reads like a real session).
-For a **plan-mode-relevant skill** (e.g. `hardening-plans`), add `--plan-mode` to inject Claude Code's verbatim plan-mode procedure as a `<system-reminder>` operating-context layer in every dispatch — the highest-fidelity in-runner approximation of a real plan mode (issue #142). Use it as the verbatim-procedure arm of an A/B against a plain paraphrase-seed run (no flag) to measure whether `with_skill` invocation de-saturates. It is still text the agent reads, not an injected mode, so treat any de-saturation as a stronger-than-cold signal, not ground truth (see *Seeding conversation context (and its ceiling)* in `../SKILL.md`).
-**The live ExitPlanMode → hardening-plans hook is not exercised here.** The shipped Claude plugin gates plan hand-off with a `PreToolUse` hook on `ExitPlanMode` (`hooks/exit-plan-mode`) that denies the first plan-exit and steers the agent through `hardening-plans` before the plan is presented. The runner only *simulates* plan mode as injected `<system-reminder>` text and dispatches single agent turns — it never emits a real `ExitPlanMode` tool call nor runs `PreToolUse` hooks, so that gate is structurally outside what the eval harness can exercise. This is the standing reason a `hardening-plans` invocation-rate delta *from the hook* can't be exhibited in-runner, independent of the #119 invocation-hint gate and the plan-mode-simulation ceiling.
-Only when the user has opted out of the guard, drop `--guard` from the command above and rely on the post-hoc `detect-stray-writes` step in Step 10 instead — it reports stray writes but does not clean them up.
-## Step 9 — Drive the dispatches
-Read `<CWD>/skills-workspace/<name>/iteration-<N>/dispatch.json`. For each task object:
-1. Dispatch a fresh subagent via the **Task tool** with the prompt `Read the file at <dispatch_prompt_path> and follow its instructions exactly.` (substituting the task's `dispatch_prompt_path`), and pass `agent_description` verbatim as the description. The full prompt lives in that file rather than inline in `dispatch.json`, so you never reproduce ~KB of text per dispatch. The description is namespaced with the iteration and a per-run nonce (`<eval_id>:<condition>:i<N>-<nonce>`) — pass it through unchanged; do not reconstruct it. Passing it verbatim is what lets transcript correlation work in Step 10 without cross-matching an agent from another iteration.
-2. That's it — you do **not** write `run.json` or `timing.json` yourself. The subagent writes its own `outputs/final-message.md` (the dispatch prompt instructs it to), and `record-runs` in Step 10 assembles both records from disk. Optional, higher-fidelity timing: if you want billing-grade numbers, write `{ "total_tokens": <n>, "duration_ms": <n>, "source": "completion-event" }` from the Task tool's completion event to `timing_path` right after each dispatch — `record-runs` never overwrites an existing `timing.json`, so completion-event numbers always win over its transcript-derived backfill (which includes cache accounting — a different metric).
-## Step 10 — Ingest, judge, finalize
-Claude Code persists subagent transcripts under `~/.claude/projects/<project-slug>/<parent-session-id>/subagents/`. Find that directory for the current session, then run the post-dispatch chain as one command:
-```bash
-# record-runs → fill-transcripts → detect-stray-writes → grade, in fixed order.
-# Assembles run.json + timing.json for every task from dispatch.json,
-# outputs/final-message.md, and the persisted transcripts; existing records are
-# never clobbered. Stops on the first failure (re-running after a fix is safe —
-# every sub-step skips work that's already done).
-bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" ingest --skill-dir <skill-dir> --skill <name> --iteration <N> \
-  --subagents-dir ~/.claude/projects/<project-slug>/<parent-session-id>/subagents/
-# Dispatch a fresh judge subagent for each judge task ingest listed — prompt it
-# with `Read the file at <dispatch_prompt_path> and follow its instructions
-# exactly.` (the prompt tells the judge where to write its response). Then:
-bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" finalize --skill-dir <skill-dir> --skill <name> --iteration <N>
-```
-`finalize` runs `grade --finalize` then `aggregate` and prints the benchmark. With Step 9's dispatches, the whole loop is three runner calls around the two dispatch batches: build (Step 8) → dispatch agents → `ingest` → dispatch judges → `finalize`.
-Besides out-of-bounds writes, `detect-stray-writes` also flags **live-source reads**: any arm whose subagent read the live `skills/<name>/` source instead of its staged copy. That usually means the Skill tool couldn't resolve the staged slug yet (skills staged mid-session race against the registry, which is built at session start) and the agent improvised — fatal in revision mode, where the old_skill arm then sees new-skill content. The findings land in `stray-writes.json` and surface as `validity_warnings` in `benchmark.json`; treat a flagged cell's arm as contaminated.
-The chained steps remain independently callable for inspection or recovery — `record-runs.ts`, `fill-transcripts.ts`, `detect-stray-writes.ts`, `grade.ts` (`--finalize`), `aggregate.ts`, each taking the same `--skill-dir`/`--skill`/`--iteration` flags (plus `--subagents-dir` for the two transcript readers). `record-runs` subsumes `fill-transcripts` for runner-built iterations — it writes `tool_invocations` as part of assembling each record; `fill-transcripts` remains the tool for a pre-existing `run.json` that `record-runs` won't touch (hand-authored, or written by the agent at dispatch time) whose `tool_invocations` you want populated after the fact.
-## Step 11 — Present results
-Read `<CWD>/skills-workspace/<name>/iteration-<N>/benchmark.json`. Surface to the user:
-- `run_summary` per condition (pass rate, tokens, duration)
-- `delta` (what the skill/change costs and what it buys — for a token-reduction eval, focus on `delta.total_tokens` alongside `delta.pass_rate`)
-- `validity_warnings` (read these before trusting the delta — a low skill-invocation rate means the result may not reflect the skill at all)
-## Step 12 — Tear down
-A run stages the full skill set into `<CWD>/.claude/skills/` (project-scope, required for discovery) and — under `--guard` — a `PreToolUse` hook in `.claude/settings.local.json`. These persist after dispatch, so the run isn't complete until you remove them. This is the normal end of every run, not an optional cleanup:
-```bash
-bun run "$SLOW_POWERS_RUNNER_ROOT/run.ts" teardown --skill-dir <skill-dir> --skill <name>
-# or, in the slow-powers repo:
-bun run evals:teardown --skill <name>
-```
-`teardown` disarms the guard, removes the staged skill set, **and** reclaims the skill's `skills-workspace/` artifacts. When the runner created `<CWD>/.claude/skills/` for this run it removes the whole tree (and prunes a `.claude` it emptied); a `.claude/skills` that pre-existed (your own project skills) keeps its contents, and `.claude/settings.json` is never touched.
-Workspace reclamation is conservative — a completed run leaves behind nothing that wasn't meant to be committed, but it never destroys results you haven't moved into version control:
-- **Iterations** whose results are committed are removed. Teardown keys off the `.promoted.json` marker `promote-baseline` writes into the iteration. An iteration that still holds uncommitted results (a `benchmark.json`, run record, or grading with no marker — e.g. a graded run you never promoted) is **kept**, and teardown warns you, naming it and the `evals:promote-baseline` command to commit it (or delete `skills-workspace/<name>/` manually to discard). Iterations holding only reproducible scaffolding (a `--dry-run`, or a run staged but never dispatched) are removed.
-- **Snapshots** materialized from a git ref (`snapshot --ref`) are removed — they regenerate on demand. Working-tree snapshots (no `--ref`), which can't be regenerated, are kept.
-If you ran with a custom `--workspace-dir`, pass the same value to `teardown` so it reclaims the right tree.