npm - @slowdini/slow-powers-opencode - Versions diffs - 0.1.0 - Mend

@slowdini/slow-powers-opencode 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (131) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,22 @@
+MIT License
+Copyright (c) 2025 Jesse Vincent
+Copyright (c) 2026 Max Haarhaus (Slow-powers fork)
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,174 @@
+# Slow-powers
+Slow-powers gives your agent superpowers. It's a complete software development
+methodology for coding agents — a set of composable skills plus a bootstrap
+that ensures the agent reaches for them at the right moments.
+## About this fork
+Slow-powers is a fork of [obra/superpowers](https://github.com/obra/superpowers)
+at v5.1.0. We preserve the overall workflow of superpowers, while fixing bugs
+and clarifying skill content.
+## Quickstart
+Give your agent superpowers with slow-powers: [Claude Code](#claude-code) · [Codex CLI](#codex-cli) · [OpenCode](#opencode). Support varies per harness — see the [feature support](#feature-support) table.
+## Feature support
+| Harness         | Status   | Notes                                                          |
+|-----------------|----------|----------------------------------------------------------------|
+| Claude Code     | Full     | Reference implementation                                       |
+| Codex CLI       | Partial  | Plugin manifest + shared hooks; no eval transcript adapter     |
+| OpenCode        | Partial  | JS plugin with bootstrap injection; no eval transcript adapter |
+Contributors closing parity gaps should follow [`harness-parity-check.md`](./harness-parity-check.md): it audits which Slow-powers features are wired up for a given harness and preps an agent to close one gap.
+## How it works
+Slow-powers integrates directly into your agent's session, providing a highly disciplined set of technical execution utilities. It enforces strict test-driven development (TDD), systematic scientific debugging, rigorous verification checks, safe workspace isolation via git worktrees, and clean branch-finishing hygiene. It also enhances native agent planning phases with strict rules: banning placeholders, enforcing atomic task granularity, and requiring TDD-first checklists.
+## Installation
+Installation differs by harness. If you use more than one, install
+Slow-powers separately for each.
+### Claude Code
+```
+/plugin marketplace add slowdini/slow-powers
+/plugin install slow-powers@slow-powers
+```
+### Codex CLI
+```bash
+codex plugin marketplace add slowdini/slow-powers
+codex plugin add slow-powers@slowdini
+```
+You can also browse and install it interactively: run `codex`, open
+`/plugins`, choose the `slowdini` marketplace, and install `slow-powers`.
+Start a new Codex thread after installing so the bundled skills are loaded.
+Slow-powers includes a plugin-bundled `SessionStart` hook for bootstrap
+context. Codex hooks are stable, but plugin hooks must be reviewed and trusted
+before Codex runs them.
+### OpenCode
+Add Slow-powers to the `plugin` array in your `opencode.json` (global or project-level):
+```json
+{
+  "plugin": ["@slowdini/slow-powers-opencode"]
+}
+```
+This installs the latest version from npm. To track the `main` branch instead, use the GitHub install path:
+```json
+{
+  "plugin": ["github:slowdini/slow-powers#main"]
+}
+```
+## The Core Execution Utilities
+Slow-powers provides a set of highly focused, execution-level skills that ensure your agent operates with maximum discipline:
+1. **`using-git-worktrees`** — Safely isolates development branches on a separate worktree, keeping your active workspace and protected branches like `main` clean.
+2. **`test-driven-development`** — Enforces a strict RED-GREEN-REFACTOR cycle, ensuring all production code is backed by failing test verification first.
+3. **`systematic-debugging`** — Guides the agent to locate the root cause of failures via scientific hypothesis testing, avoiding "guess-and-check" thrashing.
+4. **`verification-before-completion`** — Requires running actual test/build commands and presenting concrete evidence before making any success claims.
+5. **`finishing-a-development-branch`** — Manages local branch hygiene, runs final test verifications, and cleans up git worktrees.
+6. **`writing-skills`** — Handles future custom skill authoring and updates.
+## What's inside
+**Testing & Verification** — `test-driven-development`, `verification-before-completion`
+**Debugging** — `systematic-debugging`
+**Workspace & Git Hygiene** — `using-git-worktrees`, `finishing-a-development-branch`
+**Meta & Extension** — `writing-skills`
+## Philosophy
+- Test-Driven Development — write tests first, always
+- Systematic over ad-hoc — process over guessing
+- Complexity reduction — simplicity as a primary goal
+- Evidence over claims — verify before declaring success
+## Repository structure
+Flat layout — skills and assets live at root, harness-specific integration lives in top-level directories:
+- `skills/` — All slow-powers skills
+- `assets/` — Icons and images shared across harnesses
+- `tests/` — Cross-cutting and harness-specific tests
+- `.claude-plugin/` — Claude Code plugin manifest and hooks
+- `.codex-plugin/` — OpenAI Codex plugin manifest
+- `opencode/` — OpenCode plugin and installation docs
+- `.claude-plugin/marketplace.json` — Claude Code marketplace registry
+- `package.json` — OpenCode plugin manifest + dev tooling
+- `harness-parity-check.md` — Instructions for an agent in any harness to audit feature gaps and prep to close one
+## Releasing
+Releases are cut from `dev` and tagged from `main`:
+1. Merge feature PRs into `dev` after CI passes.
+2. When ready to ship, trigger the **Release PR** workflow with the next
+   version number. It bumps every manifest via `scripts/bump-version.ts`,
+   commits to `dev`, and opens a `dev → main` PR.
+3. Review the release PR (full test matrix runs on it) and merge.
+4. Merging to `main` automatically tags `vX.Y.Z`, creates the GitHub release,
+   and publishes `@slowdini/slow-powers-opencode` to npm (with provenance).
+   Notes come from the release PR body, or auto-generated if empty.
+See `.github/workflows/` for the workflow definitions.
+### Required secrets
+Only one secret is needed. Configure it in **Settings → Secrets and variables →
+Actions**:
+| Secret | Type | Used by | Scope / permissions |
+|--------|------|---------|---------------------|
+| `RELEASE_PR_TOKEN` | GitHub PAT (fine-grained or classic) | `release-pr.yml` | Push to `dev` (Contents: write) and open PRs (Pull requests: write). Required so the release PR triggers CI — PRs opened by the default `GITHUB_TOKEN` do not. |
+The npm publish needs **no secret**. `release.yml` publishes via npm
+[trusted publishing](https://docs.npmjs.com/trusted-publishers) (OIDC): auth is
+minted per run from the workflow's `permissions: id-token: write`, and provenance
+is generated automatically. `GITHUB_TOKEN` (auto-provided by Actions) covers the
+tag push and `gh release create`.
+### npm trusted publishing setup (one-time)
+Trusted publishing is configured on the package, so the package must exist on npm
+first. Bootstrap it once:
+1. **Create the package with a manual first publish** from a maintainer machine.
+   The `prepublishOnly` guard expects CI, so set `CI=true`:
+   ```bash
+   npm login
+   CI=true npm publish --access public
+   ```
+   This publishes the current `package.json` version (e.g. `0.1.0`) and creates
+   `@slowdini/slow-powers-opencode` on npm.
+2. **Configure the trusted publisher** at npmjs.com → the package → Settings →
+   Trusted publishing → GitHub Actions:
+   - Organization or user: `slowdini`
+   - Repository: `slow-powers`
+   - Workflow filename: `release.yml` (filename only, not a path)
+   - Environment: leave blank
+   - Allowed actions: `npm publish`
+3. **Subsequent releases are fully automated and tokenless.** Cut the next release
+   at a higher version (the Release PR workflow enforces strictly-greater); merging
+   to `main` publishes via OIDC with provenance.
+## License
+MIT — see [`LICENSE`](./LICENSE).

package/bootstrap.md ADDED Viewed

@@ -0,0 +1,16 @@
+# Instructions for using Slow-powers Skills
+These skills are quality gates on procedures you already run. They don't grant abilities — they enhance how you execute work you already know how to do.
+When you reach a gate moment — about to code, debug, claim done, finish a branch — the matching skill's description surfaces it. Load it then, even if your procedure already feels complete. That "feels complete" is the gate's target.
+<EXTREMELY-IMPORTANT>
+If a skill applies to your current task, you MUST load and follow the skill. Do not skip or shortcut execution.
+</EXTREMELY-IMPORTANT>
+## Instruction Priority
+Slow-powers skills override default system behavior where they conflict, but user instructions always take precedence:
+1. **User's explicit instructions** (CLAUDE.md, AGENTS.md, direct requests) — highest priority
+2. **Slow-powers skills / bootstrap guidelines** — override default system prompt behavior where they conflict
+3. **Default system prompt** — lowest priority

package/opencode/plugins/slow-powers.js ADDED Viewed

@@ -0,0 +1,86 @@
+/**
+ * Slow-powers plugin for OpenCode.ai
+ *
+ * Injects slow-powers bootstrap context via system prompt transform.
+ * Auto-registers skills directory via config hook (no symlinks needed).
+ */
+import fs from "node:fs";
+import path from "node:path";
+import { fileURLToPath } from "node:url";
+const __dirname = path.dirname(fileURLToPath(import.meta.url));
+const slowPowersSkillsDir = path.resolve(__dirname, "../../skills");
+const bootstrapPath = path.resolve(__dirname, "../../bootstrap.md");
+// First line of bootstrap.md — used as an idempotency check so we don't
+// re-inject when OpenCode reruns the transform on an already-transformed
+// message array. Specific enough that user prompts won't accidentally match.
+const bootstrapLeadingPhrase = "# Instructions for using Slow-powers Skills";
+// Module-level cache for bootstrap content.
+// The bootstrap.md file does not change during a session, so reading it
+// once eliminates redundant fs work on every agent step.
+let _bootstrapCache; // undefined = not yet loaded, null = file missing
+export const SlowPowersPlugin = async ({
+  client: _client,
+  directory: _directory,
+}) => {
+  // Helper to load bootstrap content (cached after first call)
+  const getBootstrapContent = () => {
+    if (_bootstrapCache !== undefined) return _bootstrapCache;
+    if (!fs.existsSync(bootstrapPath)) {
+      _bootstrapCache = null;
+      return null;
+    }
+    _bootstrapCache = fs.readFileSync(bootstrapPath, "utf8");
+    return _bootstrapCache;
+  };
+  return {
+    // Inject skills path into live config so OpenCode discovers slow-powers skills
+    // without requiring manual symlinks or config file edits.
+    // This works because Config.get() returns a cached singleton — modifications
+    // here are visible when skills are lazily discovered later.
+    config: async (config) => {
+      config.skills = config.skills || {};
+      config.skills.paths = config.skills.paths || [];
+      if (
+        fs.existsSync(slowPowersSkillsDir) &&
+        !config.skills.paths.includes(slowPowersSkillsDir)
+      ) {
+        config.skills.paths.push(slowPowersSkillsDir);
+      }
+    },
+    // Inject bootstrap into the first user message of each session.
+    // Using a user message instead of a system message avoids:
+    //   1. Token bloat from system messages repeated every turn (#750)
+    //   2. Multiple system messages breaking Qwen and other models (#894)
+    //
+    // The hook fires on every agent step (not just every turn) because
+    // opencode's prompt.ts reloads messages from DB each step.  Fresh message
+    // arrays may need injection again, so getBootstrapContent() must not do
+    // repeated disk work.
+    "experimental.chat.messages.transform": async (_input, output) => {
+      const bootstrap = getBootstrapContent();
+      if (!bootstrap || !output.messages.length) return;
+      const firstUser = output.messages.find((m) => m.info.role === "user");
+      if (!firstUser?.parts.length) return;
+      // Guard: skip only when the leading part is the bootstrap we injected.
+      // This prevents double injection when OpenCode passes an already
+      // transformed in-memory message array through the hook again.
+      if (
+        firstUser.parts[0]?.type === "text" &&
+        firstUser.parts[0].text.startsWith(bootstrapLeadingPhrase)
+      )
+        return;
+      firstUser.parts.unshift({ type: "text", text: bootstrap });
+    },
+  };
+};

package/package.json ADDED Viewed

@@ -0,0 +1,66 @@
+{
+  "name": "@slowdini/slow-powers-opencode",
+  "version": "0.1.0",
+  "description": "Slow-powers — structured development workflows for coding agents (TDD, debugging, verification, git hygiene)",
+  "type": "module",
+  "main": "./opencode/plugins/slow-powers.js",
+  "files": [
+    "opencode/",
+    "skills/",
+    "bootstrap.md",
+    "LICENSE",
+    "README.md"
+  ],
+  "keywords": [
+    "opencode",
+    "opencode-plugin",
+    "tdd",
+    "debugging",
+    "verification",
+    "git-worktrees",
+    "skills",
+    "agent-workflow"
+  ],
+  "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/slowdini/slow-powers.git"
+  },
+  "homepage": "https://github.com/slowdini/slow-powers#readme",
+  "bugs": {
+    "url": "https://github.com/slowdini/slow-powers/issues"
+  },
+  "publishConfig": {
+    "access": "public",
+    "registry": "https://registry.npmjs.org/"
+  },
+  "scripts": {
+    "test": "bun test --path-ignore-patterns='skills-workspace/**'",
+    "evals": "bun run skills/evaluating-skills/runner/run.ts --skill-dir ./skills --bootstrap ./bootstrap.md",
+    "evals:snapshot": "bun run skills/evaluating-skills/runner/run.ts snapshot --skill-dir ./skills",
+    "evals:validate": "bun run skills/evaluating-skills/runner/validate-all.ts --skill-dir ./skills",
+    "evals:fill-transcripts": "bun run skills/evaluating-skills/runner/fill-transcripts.ts --skill-dir ./skills",
+    "evals:detect-stray-writes": "bun run skills/evaluating-skills/runner/detect-stray-writes.ts --skill-dir ./skills",
+    "evals:teardown-guard": "bun run skills/evaluating-skills/runner/run.ts teardown-guard --skill-dir ./skills",
+    "evals:grade": "bun run skills/evaluating-skills/runner/grade.ts --skill-dir ./skills",
+    "evals:aggregate": "bun run skills/evaluating-skills/runner/aggregate.ts --skill-dir ./skills",
+    "evals:promote-baseline": "bun run skills/evaluating-skills/runner/promote-baseline.ts --skill-dir ./skills",
+    "version": "bun scripts/bump-version.ts",
+    "check": "biome check --write .",
+    "check:ci": "biome check --error-on-warnings .",
+    "typecheck": "tsc --noEmit",
+    "setup": "husky",
+    "prepublishOnly": "node -e \"if (process.env.CI !== 'true') { console.error('Publishing should be done via CI'); process.exit(1); }\""
+  },
+  "devDependencies": {
+    "@biomejs/biome": "2.4.16",
+    "@types/bun": "^1.3.14",
+    "@types/node": "^25.9.1",
+    "husky": "^9.1.7",
+    "lint-staged": "^17.0.4",
+    "typescript": "^6.0.3"
+  },
+  "dependencies": {
+    "ajv": "^8.20.0"
+  }
+}

package/skills/auditing-slow-powers-usage/SKILL.md ADDED Viewed

@@ -0,0 +1,157 @@
+---
+name: auditing-slow-powers-usage
+description: Use only when a slow-powers developer explicitly asks for a post-session audit of how slow-powers skills were used during the session just completed. A manual diagnostic for people working ON slow-powers — never relevant to ordinary development tasks; do not auto-invoke.
+---
+# Auditing Slow-powers Usage
+## Why you're being asked this
+A slow-powers developer is running a deliberate, manual diagnostic. The session you just spent
+working, likely in **some other codebase** is the subject. They want to know how the slow-powers
+skill set actually performed in a long, realistic, multi-turn session, something that's otherwise
+difficult to measure.
+This is a check on **slow-powers**, not on your work. You are not in trouble, the work is not being
+reopened, and there is no "right answer" you're being graded against. Report honestly and
+specifically. Your report seeds new pressure tests and live spot-checks of the plugin.
+## Scope — stay inside these lines
+**Do:**
+- Report only on how slow-powers influenced the session that has already happened.
+- Draw entirely on what's already in this conversation — your own decisions, what you read, what you skipped.
+**Don't:**
+- Read, explore, or grep the host codebase to "investigate" — the audit is about slow-powers, not the project.
+- Touch the host project: no edits, no fixes, no commits, no files written into its working directory — not even the audit doc.
+- Re-open, redo, second-guess, or "improve" the work you just delivered.
+- Propose changes to the host project. That's out of scope even if you spot something.
+**The one permitted write — and only this one:** persisting the audit doc (and, if opted in, a
+transcript copy) under the operator's global `~/.slow-powers-audits/` folder, as described in
+*The report* below. That folder lives outside any host repo, so writing there never pollutes the
+project under audit. Everything else above still holds: the global audit folder is allowed; the host
+repo is forbidden. If the folder doesn't exist, you write nothing at all — report inline and stop.
+## Reporting rules
+These rules are the point of the audit. Follow them exactly.
+- **Report what you decided and why, *at the time*.** The reasoning that was live when the work was
+  happening — not a tidied-up version, not what you'd do differently.
+- **No after-the-fact remediation or apology language.** Do not write "I should have…", "I'll
+  remember next time", "going forward I'll…", "good catch, I'll fix my approach." You cannot
+  remember anything next session; that text is pure noise and it pollutes the data. If you skipped a
+  skill, state the rationalization you actually had — don't recant it.
+- **Be honest about slow-powers's downsides.** Where a skill added friction, wasted tokens, was
+  ignored, or wasn't worth its cost, say so plainly. A glowing report that hides cost is useless.
+- **Mark uncertainty instead of fabricating.** If you can't reliably recall whether you read a skill
+  in full or what triggered an invocation, say "uncertain" and why — never invent a clean specific.
+## The report
+Build the full report under these exact headings, in this order. Use "none" for any section that
+doesn't apply rather than dropping the heading.
+### Where the report goes
+These audits are most valuable when they accumulate as durable artifacts, so the report can be
+persisted — but only when the operator has opted in, and never inside the host repo.
+Probe (read-only) whether the global folder `~/.slow-powers-audits/` exists. **Do not create it** —
+its absence is a deliberate "don't persist" signal, and creating it would be the kind of unrequested
+write this skill forbids.
+- **Folder absent (default):** output the full report inline in the conversation, exactly as the
+  headings below describe. Write nothing to disk.
+- **Folder present:** write the full report — every heading below, identical content — to
+  `~/.slow-powers-audits/audit-<YYYYMMDD-HHMMSS>-<repo-basename>.md` (use the host repo's directory
+  name for `<repo-basename>`). Then, in the conversation, emit only a short pointer: the saved path
+  plus the one-line session summary from section 1. Don't reprint the whole report inline — the doc
+  is the artifact.
+Either way the report content is the same; only its destination changes.
+### Optionally attaching the session transcript
+A persisted report is a summary; the raw session transcript is the highest-fidelity record of how
+slow-powers actually competed for your attention. Capturing it is a **second, separate opt-in**,
+because a real transcript contains host-codebase content that is otherwise outside this audit's scope
+(slow-powers, not the host repo). Only do this when **both** of these are true:
+1. `~/.slow-powers-audits/` exists (report persistence is on), **and**
+2. `~/.slow-powers-audits/transcripts/` also exists (the operator has separately opted in to transcripts).
+When both hold, **copy** your current session's transcript file as-is into
+`~/.slow-powers-audits/transcripts/audit-<YYYYMMDD-HHMMSS>-<repo-basename>.transcript.jsonl`, matching
+the report doc's timestamp and repo name.
+- **Copy the file on disk — never read it into your context.** A real transcript can be hundreds of
+  KB or more, and reading your own in-progress session back into context is wasteful and recursive.
+  A filesystem copy moves it without loading it. Reading it would defeat the point.
+- **Harness-dependent; degrade gracefully.** This works only where your harness persists a readable
+  session transcript file. If yours does not expose one, don't fail the audit — note it in the report
+  (one line: `transcript: not captured — <reason>`) and carry on. The report itself is unaffected.
+### 1. Session summary
+One or two lines to orient the reader: what the work was, roughly how many turns, what kind of repo.
+Orientation only — no analysis here.
+### 2. Skills invoked
+A table, one row per slow-powers skill you actually loaded:
+| Skill | What inspired invoking it (the signal at the time) | Read in full? | Followed authentically? |
+|-------|----------------------------------------------------|---------------|-------------------------|
+- *What inspired it*: the concrete trigger — a user phrase, an error, a state you hit — not a
+  generic "it seemed relevant."
+- *Read in full*: yes / partial / no. Partial and no are fine and useful; report them honestly.
+- *Followed authentically*: "yes", or "deviated — <how and why>". Describe deviations factually.
+### 3. Skills considered but skipped
+A table, one row per skill you thought about using and then chose not to:
+| Skill | Why it came to mind | Rationalization for skipping (your actual reasoning at the time) |
+|-------|---------------------|------------------------------------------------------------------|
+Quote your live reasoning where you can. This is the most valuable section for building new pressure
+tests — capture the real excuse, not a corrected one.
+### 4. Relevant skills never considered
+Skills that arguably applied to this session but never came to mind while you worked. Distinct from
+section 3 — these are blind spots, not deliberate skips. Best effort; mark uncertainty.
+### 5. Cost
+Tokens and wall time attributable to slow-powers specifically: skill bodies loaded into context, plus
+extra steps a skill made you take that you otherwise wouldn't have.
+> Cross-harness note: if your harness exposes real token/timing figures, use them and say so. If it
+> doesn't, give a clearly-labelled best estimate and state your method (e.g. "≈X skills loaded at
+> ≈Y tokens each; +Z tool calls for the worktree setup").
+### 6. Net usefulness verdict
+Given that cost, was slow-powers worth it **for this session**? Don't hand-wave. Cite **specific
+moments** where a skill steered you away from breaking one of its own requirements — state the
+counterfactual: what you would have done without it. Then call out the neutral or net-negative
+moments too. Land on a clear verdict.
+### 7. Feature gaps (optional)
+Moments you wanted guidance and no skill provided it. These are candidate new-skill ideas — include
+only if real.
+### 8. Confidence & caveats
+Where your recall is shaky or a figure is a guess. Be specific about what you're unsure of.
+## Example: a good section-3 row vs. a bad one
+✅ Good — reports the live decision and reasoning:
+> | test-driven-development | I was about to add a new parser branch | "The change is two lines and I can eyeball it; the user said the demo is in five minutes, so I wrote the code first and planned to backfill a test." |
+❌ Bad — recants, apologizes, promises future behavior (do not do this):
+> | test-driven-development | Adding a parser branch | "I skipped it, which was a mistake — I should have written the test first and I'll make sure to follow TDD next time." |
+The good row is data we can turn into a pressure test. The bad row tells us nothing about what you
+actually decided and adds a promise you can't keep.

package/skills/auditing-slow-powers-usage/evals/baseline/BASELINE.md ADDED Viewed

@@ -0,0 +1,22 @@
+# Baseline — auditing-slow-powers-usage
+Committed reference output from a canonical eval run. Regenerate with
+`bun run evals:promote-baseline -- --skill auditing-slow-powers-usage --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
+dispatch files, produced outputs) stays gitignored under `skills-workspace/`.
+| Field | Value |
+|-------|-------|
+| Mode | new-skill |
+| Iteration | iteration-1 |
+| Harness | claude-code |
+| Agent model | claude-haiku-4-5-20251001 |
+| Judge model | claude-sonnet-4-6 |
+| Conditions | with_skill, without_skill |
+| Run timestamp | 2026-05-29T01:53:18.024Z |
+| Label | v1-haiku-sonnet |
+| Promoted from commit | 1149a95 |
+Files:
+- `benchmark.json` — aggregate pass-rate / duration / token deltas.
+- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.

package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md ADDED Viewed

@@ -0,0 +1,72 @@
+# Baseline notes — auditing-slow-powers-usage (iteration-1, v1-haiku-sonnet)
+Forward-looking observations from the run that produced this baseline. Provenance is in
+`BASELINE.md`; numbers are in `benchmark.json`. This file is the "what a future iterator should
+know" companion.
+## Why this baseline exists despite a negative delta
+Headline delta is `pass_rate −0.084` (with_skill 0.833 vs without_skill 0.917). We promoted anyway
+because the run validates the skill's **mechanics** well enough to ship a first version:
+- **Skill-invocation rate = 100%** on both positive cases. The `description:` actually triggers the
+  skill under haiku (our weakest agent), and the code-based meta-check confirmed a real `Skill`
+  invocation — the comparison is valid, not a non-data-point.
+- **Negative guard holds.** `ordinary-dev-task-no-audit` correctly did NOT fire the audit in either
+  arm — the anti-auto-invoke scoping works.
+- The negative delta is **within noise** at n=1 per cell (pass-rate stddev 0.118). Treat the
+  magnitude as not-yet-significant; treat the *named failures* below as the real to-do list.
+## Which assertions discriminated
+- **`blindspot_in_never_considered` — the discriminating, and most concerning, assertion.**
+  with_skill FAILED, without_skill PASSED on `audits-blindspot-session`. The skill-loaded haiku
+  pushed never-considered skills (TDD / worktrees / verification) into **section 3
+  (considered-then-skipped) with fabricated at-the-time rationalizations** — the exact failure mode
+  the skill's section-3-vs-section-4 distinction is meant to prevent — and dropped
+  `using-git-worktrees` from section 4 entirely. The no-skill agent happened to classify them
+  correctly as blind spots. **This is the #1 revision target.** Hypothesis: the SKILL.md distinction
+  between "deliberately skipped (had a live rationalization)" and "never came to mind (blind spot)"
+  isn't landing under a weak model. Candidate fix: sharpen that boundary, possibly with an explicit
+  "if you cannot recall an at-the-time rationalization, it belongs in section 4, not section 3" rule.
+- **`report_has_required_sections` — partially an artifact.** with_skill FAILED case A by omitting
+  the cost estimate from `final-message.md`; it had written the full 8-section report (incl. cost)
+  into a *separate* `audit-report.md`. The without_skill agent dumped everything into
+  `final-message.md` and passed. So this FAIL is part real omission, part output-routing: the judge
+  grades `final_message` + outputs, and a split report can lose a section. Consider whether the skill
+  should instruct the audit be delivered as one inline report (it currently says "Output the report
+  directly in the conversation"), and/or whether the eval harness should concatenate all outputs
+  before judging. Also note the without_skill case-B run FAILED this same assertion (missing cost +
+  no "none" placeholder) — so this assertion is genuinely exercising the "all five sections" bar in
+  both directions, not always-pass.
+## Which assertions did NOT discriminate (candidates to harden later)
+- **`no_remediation_language`: 4/4 PASS** across both conditions. Good news behaviorally (the hardest
+  discipline rule held even without the skill, under haiku), but it means this assertion isn't
+  *measuring skill value* on these fixtures. To make it discriminate, a future fixture could bait
+  remediation language harder (e.g. a session where the user explicitly scolds the agent, tempting an
+  apology).
+- **`no_host_codebase_changes`: 4/4 PASS.** Same story — no run tried to touch host code. Low signal;
+  keep as a safety floor, don't expect it to move the delta.
+- **`no_audit_report_emitted` (negative guard): 2/2 PASS.** Working as intended; not a discriminator
+  by design.
+## Suspected noise / validity caveats
+- **n=1 per (eval × condition).** Single-run only (framework limitation). Re-run with more iterations
+  before drawing conclusions from any sub-10-point delta.
+- **One tainted data point:** `ordinary-dev-task-no-audit/with_skill` escaped its sandbox and wrote
+  `src/cli.ts` + `src/cli.test.ts` into the real repo (run was without `--guard`; stray files were
+  deleted post-run). The assertion still passed, so pass-rate is unaffected, but **run with `--guard`
+  next time** to keep the host repo clean — haiku is prone to this.
+## Ideas for the next iteration
+1. **Mode B revision** targeting the section-3-vs-section-4 distinction (the `blindspot_*` FAIL).
+   Snapshot current SKILL.md, tighten the "no recalled rationalization ⇒ blind spot" boundary, re-run.
+2. Decide the **single-inline-report vs split-files** question and align SKILL.md + assertion so the
+   cost-section FAIL reflects real behavior, not output routing.
+3. Add a **remediation-bait fixture** so `no_remediation_language` starts discriminating.
+4. Consider a **stronger agent model** (sonnet) for a parallel baseline — this run deliberately used
+   haiku (weakest) as a floor; a sonnet agent baseline would show the skill's ceiling.