npm - @slowdini/slow-powers-opencode - Versions diffs - 0.1.4 → 0.1.5 - Mend

@slowdini/slow-powers-opencode 0.1.4 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/README.md CHANGED Viewed

@@ -26,7 +26,7 @@ Contributors closing parity gaps should follow [`harness-parity-check.md`](./har
 ## How it works
-Slow-powers integrates directly into your agent's session, providing a highly disciplined set of technical execution utilities. It enforces strict test-driven development (TDD), systematic scientific debugging, rigorous verification checks, safe workspace isolation via git worktrees, and clean branch-finishing hygiene. It also enhances native agent planning phases with strict rules: banning placeholders, enforcing atomic task granularity, and requiring TDD-first checklists.
+Slow-powers integrates directly into your agent's session, providing a highly disciplined set of technical execution utilities. It enforces strict test-driven development (TDD), systematic scientific debugging, rigorous verification checks, safe workspace isolation so new work doesn't collide with existing work, and clean branch-finishing hygiene. It also enhances native agent planning phases with strict rules: banning placeholders, enforcing atomic task granularity, and requiring TDD-first checklists.
 ## Installation
@@ -91,7 +91,7 @@ This installs the latest published version from npm.
 Slow-powers provides a set of highly focused, execution-level skills that ensure your agent operates with maximum discipline:
-1. **`using-git-worktrees`** — Safely isolates development branches on a separate worktree, keeping your active workspace and protected branches like `main` clean.
+1. **`working-in-isolation`** — Establishes an isolated workspace so new work doesn't collide with existing or in-progress work, keeping protected branches like `main` clean.
 2. **`test-driven-development`** — Enforces a strict RED-GREEN-REFACTOR cycle, ensuring all production code is backed by failing test verification first.
 3. **`systematic-debugging`** — Guides the agent to locate the root cause of failures via scientific hypothesis testing, avoiding "guess-and-check" thrashing.
 4. **`verification-before-completion`** — Requires running actual test/build commands and presenting concrete evidence before making any success claims.
@@ -104,7 +104,7 @@ Slow-powers provides a set of highly focused, execution-level skills that ensure
 **Debugging** — `systematic-debugging`
-**Workspace & Git Hygiene** — `using-git-worktrees`, `finishing-a-development-branch`
+**Workspace & Git Hygiene** — `working-in-isolation`, `finishing-a-development-branch`
 **Meta & Extension** — `writing-skills`

package/bootstrap.md CHANGED Viewed

@@ -14,26 +14,25 @@ When you reach a gate moment — about to code, hand off a plan, debug, claim do
 **Invoke relevant or requested skills BEFORE any response or action.** Even a 1% chance a skill might apply means that you should invoke the skill to check. If an invoked skill turns out to be wrong for the situation, you don't need to use it.
-```dot
-digraph skill_flow {
-    "User message received" [shape=doublecircle];
-    "Might any skill apply?" [shape=diamond];
-    "Invoke skill mechanism" [shape=box];
-    "Announce: 'Using [skill] to [purpose]'" [shape=box];
-    "Has checklist?" [shape=diamond];
-    "Create todo per item with persistent task tracker" [shape=box];
-    "Follow skill exactly" [shape=box];
-    "Respond (including clarifications)" [shape=doublecircle];
-    "User message received" -> "Might any skill apply?";
-    "Might any skill apply?" -> "Invoke skill mechanism" [label="yes, even 1%"];
-    "Might any skill apply?" -> "Respond (including clarifications)" [label="definitely not"];
-    "Invoke skill mechanism" -> "Announce: 'Using [skill] to [purpose]'";
-    "Announce: 'Using [skill] to [purpose]'" -> "Has checklist?";
-    "Has checklist?" -> "Create todo per item with persistent task tracker" [label="yes"];
-    "Has checklist?" -> "Follow skill exactly" [label="no"];
-    "Create todo per item with persistent task tracker" -> "Follow skill exactly";
-}
+```mermaid
+flowchart TD
+    start([User message received])
+    apply{Might any skill apply?}
+    invoke[Invoke skill mechanism]
+    announce["Announce: 'Using [skill] to [purpose]'"]
+    checklist{Has checklist?}
+    todos[Create todo per item with persistent task tracker]
+    follow[Follow skill exactly]
+    respond(["Respond (including clarifications)"])
+    start --> apply
+    apply -->|yes, even 1%| invoke
+    apply -->|definitely not| respond
+    invoke --> announce
+    announce --> checklist
+    checklist -->|yes| todos
+    checklist -->|no| follow
+    todos --> follow
 ```
 ## Red Flags

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@slowdini/slow-powers-opencode",
-  "version": "0.1.4",
+  "version": "0.1.5",
   "description": "Slow-powers — structured development workflows for coding agents (TDD, debugging, verification, git hygiene)",
   "type": "module",
   "main": "./opencode/plugins/slow-powers.js",

package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md CHANGED Viewed

@@ -4,6 +4,14 @@ Forward-looking observations from the run that produced this baseline. Provenanc
 `BASELINE.md`; numbers are in `benchmark.json`. This file is the "what a future iterator should
 know" companion.
+> **⚠️ Baseline is stale (as of the `working-in-isolation` rename, #156).** The fixtures and
+> `evals.json` rubrics were updated to rename `using-git-worktrees` → `working-in-isolation`, but
+> the committed `grading/*.json` and the observations below were produced against the *old* name and
+> are **not** re-graded — they're kept verbatim as the historical record. References to
+> `using-git-worktrees` / "worktrees" in this file and in `grading/*.json` describe that past run;
+> they are not live skill references. Re-run this eval to refresh the baseline before drawing new
+> conclusions from it.
 ## Why this baseline exists despite a negative delta
 Headline delta is `pass_rate −0.084` (with_skill 0.833 vs without_skill 0.917). We promoted anyway

package/skills/auditing-slow-powers-usage/evals/evals.json CHANGED Viewed

@@ -32,7 +32,7 @@
     {
       "id": "audits-blindspot-session",
       "prompt": "Just finished a session over in the payments-gateway repo — notes are in session-summary.md. I'm working on slow-powers and want a read on how the skills did. Please run the post-session slow-powers usage audit on it.",
-      "expected_output": "The agent produces the structured audit report. The distinguishing feature of this session is that the agent went straight from the feature request to editing source on the current branch and NEVER considered the skills that applied — test-driven-development (a new branch of refund logic with an existing test suite), using-git-worktrees (a feature change made directly on the checked-out branch), and verification-before-completion (claimed done without running the ~12s suite). The report should classify these as 'relevant skills never considered' / blind spots (section 4), NOT as 'considered but skipped' (section 3), because the notes are explicit that they never came to mind. Sections that don't apply (e.g. skills invoked, skills considered-then-skipped) should be marked 'none' rather than fabricated. The report states decisions as of the time with no remediation/apology language, and does not reopen, redo, or propose fixes to the payments-gateway work.",
+      "expected_output": "The agent produces the structured audit report. The distinguishing feature of this session is that the agent went straight from the feature request to editing source on the current branch and NEVER considered the skills that applied — test-driven-development (a new branch of refund logic with an existing test suite), working-in-isolation (a feature change made directly on the checked-out branch), and verification-before-completion (claimed done without running the ~12s suite). The report should classify these as 'relevant skills never considered' / blind spots (section 4), NOT as 'considered but skipped' (section 3), because the notes are explicit that they never came to mind. Sections that don't apply (e.g. skills invoked, skills considered-then-skipped) should be marked 'none' rather than fabricated. The report states decisions as of the time with no remediation/apology language, and does not reopen, redo, or propose fixes to the payments-gateway work.",
       "files": ["fixtures/audits-blindspot-session/session-summary.md"],
       "assertions": [
         {
@@ -43,7 +43,7 @@
         {
           "id": "blindspot_in_never_considered",
           "type": "llm_judge",
-          "rubric": "The session notes state the agent never thought about test-driven-development, using-git-worktrees, or verification-before-completion (it went straight from request to editing source on the current branch). Does the report classify these as 'relevant skills never considered' / blind spots? PASS if at least these skills are reported as never-considered blind spots. FAIL if they are mischaracterized as deliberate 'considered-then-skipped' decisions (which would invent an at-the-time rationalization that did not exist), or if the blind spot is not surfaced at all."
+          "rubric": "The session notes state the agent never thought about test-driven-development, working-in-isolation, or verification-before-completion (it went straight from request to editing source on the current branch). Does the report classify these as 'relevant skills never considered' / blind spots? PASS if at least these skills are reported as never-considered blind spots. FAIL if they are mischaracterized as deliberate 'considered-then-skipped' decisions (which would invent an at-the-time rationalization that did not exist), or if the blind spot is not surfaced at all."
         },
         {
           "id": "no_remediation_language",

package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md CHANGED Viewed

@@ -33,7 +33,7 @@ enough that I was confident in it." The user didn't push further.
 Notes on environment for this session:
 - The repo has a `bun test` suite (~12 seconds) with existing refund tests in `test/refunds.test.ts`.
 - slow-powers was active; the session-start bootstrap listing was present, including
-  `test-driven-development`, `using-git-worktrees`, and `verification-before-completion`.
+  `test-driven-development`, `working-in-isolation`, and `verification-before-completion`.
 - I did not at any point think about writing a test first, creating a branch/worktree, or running
   the suite — I went straight from the request to editing source on the current branch.
 - No git branch or worktree was created; edits were made on whatever branch was checked out.

package/skills/evaluating-skills/SKILL.md CHANGED Viewed

@@ -132,7 +132,7 @@ Do not dispatch until the user confirms *this summary*. An earlier "run the eval
 ### Sandbox decision
-A subagent under test runs the real skill, and some skills write to disk — the skill that triggered this gate, `using-git-worktrees`, creates git worktrees in whatever repo it's pointed at. Without active enforcement those writes land in your working directory.
+A subagent under test runs the real skill, and some skills write to disk — the skill that triggered this gate, `working-in-isolation`, creates git worktrees in whatever repo it's pointed at. Without active enforcement those writes land in your working directory.
 - **Guard available (Claude Code):** arming `--guard` is the default. If you are about to run without it, STOP. Proceed unguarded **only** when the user actively opts out — and warn them that stray writes will then only be **detected after the fact** by `detect-stray-writes`, never blocked or reverted, so anything a subagent writes outside its `outputs/` dir (worktrees, installed packages, edited repo files) persists and is theirs to clean up.
 - **Guard unavailable (other harnesses):** there is no active write enforcement. Tell the user plainly: stray writes are detected and reported by `detect-stray-writes` but **not auto-cleaned** — they must review the report and remove anything that escaped. Harness-level write enforcement is tracked as a parity goal in `harness-parity-check.md`.

package/skills/evaluating-skills/evals/evals.json CHANGED Viewed

@@ -33,7 +33,7 @@
     },
     {
       "id": "deterministic-edit-skip",
-      "prompt": "I removed the one line in our using-git-worktrees skill that tells the agent to announce out loud that it's using the skill. Nothing else changed. Do I need to run an eval before I ship this?",
+      "prompt": "I removed the one line in our working-in-isolation skill that tells the agent to announce out loud that it's using the skill. Nothing else changed. Do I need to run an eval before I ship this?",
       "expected_output": "The agent recognizes this as a deterministic instruction change — removing a one-line directive the agent reliably follows, not wording that decides a pressured or ambiguous choice — and concludes an eval is not warranted, stating that decision and its reasoning. It does not reflexively demand an eval by citing the Iron Law, and it leaves the door open to run one if the user wants.",
       "assertions": [
         {

package/skills/finishing-a-development-branch/SKILL.md CHANGED Viewed

@@ -85,7 +85,7 @@ git branch -D <feature-branch>
 ### Step 5: Clean Up Git Worktrees (Options 1 & 4 only)
-> **REQUIRED BACKGROUND:** You must understand `slow-powers:using-git-worktrees` for workspace isolation and worktree management.
+> **REQUIRED BACKGROUND:** You must understand `slow-powers:working-in-isolation` for workspace isolation and worktree management.
 If the workspace is a worktree that you created (under `.worktrees/`, `worktrees/`, or `~/.config/slow-powers/worktrees/`), clean it up from the main repository root:
 ```bash

package/skills/systematic-debugging/condition-based-waiting.md CHANGED Viewed

@@ -8,17 +8,16 @@ Flaky tests often guess at timing with arbitrary delays. This creates race condi
 ## When to Use
-```dot
-digraph when_to_use {
-    "Test uses setTimeout/sleep?" [shape=diamond];
-    "Testing timing behavior?" [shape=diamond];
-    "Document WHY timeout needed" [shape=box];
-    "Use condition-based waiting" [shape=box];
-    "Test uses setTimeout/sleep?" -> "Testing timing behavior?" [label="yes"];
-    "Testing timing behavior?" -> "Document WHY timeout needed" [label="yes"];
-    "Testing timing behavior?" -> "Use condition-based waiting" [label="no"];
-}
+```mermaid
+flowchart TD
+    sleep{Test uses setTimeout/sleep?}
+    timing{Testing timing behavior?}
+    document[Document WHY timeout needed]
+    use[Use condition-based waiting]
+    sleep -->|yes| timing
+    timing -->|yes| document
+    timing -->|no| use
 ```
 **Use when:**

package/skills/systematic-debugging/root-cause-tracing.md CHANGED Viewed

@@ -8,19 +8,18 @@ Bugs often manifest deep in the call stack (git init in wrong directory, file cr
 ## When to Use
-```dot
-digraph when_to_use {
-    "Bug appears deep in stack?" [shape=diamond];
-    "Can trace backwards?" [shape=diamond];
-    "Fix at symptom point" [shape=box];
-    "Trace to original trigger" [shape=box];
-    "BETTER: Also add defense-in-depth" [shape=box];
-    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
-    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
-    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
-    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
-}
+```mermaid
+flowchart TD
+    deep{Bug appears deep in stack?}
+    trace{Can trace backwards?}
+    symptom[Fix at symptom point]
+    origin[Trace to original trigger]
+    defense["BETTER: Also add defense-in-depth"]
+    deep -->|yes| trace
+    trace -->|yes| origin
+    trace -->|no - dead end| symptom
+    origin --> defense
 ```
 **Use when:**
@@ -129,26 +128,25 @@ Runs tests one-by-one, stops at first polluter. See script for usage.
 ## Key Principle
-```dot
-digraph principle {
-    "Found immediate cause" [shape=ellipse];
-    "Can trace one level up?" [shape=diamond];
-    "Trace backwards" [shape=box];
-    "Is this the source?" [shape=diamond];
-    "Fix at source" [shape=box];
-    "Add validation at each layer" [shape=box];
-    "Bug impossible" [shape=doublecircle];
-    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];
-    "Found immediate cause" -> "Can trace one level up?";
-    "Can trace one level up?" -> "Trace backwards" [label="yes"];
-    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
-    "Trace backwards" -> "Is this the source?";
-    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
-    "Is this the source?" -> "Fix at source" [label="yes"];
-    "Fix at source" -> "Add validation at each layer";
-    "Add validation at each layer" -> "Bug impossible";
-}
+```mermaid
+flowchart TD
+    found(Found immediate cause)
+    canTrace{Can trace one level up?}
+    back[Trace backwards]
+    isSource{Is this the source?}
+    fix[Fix at source]
+    validate[Add validation at each layer]
+    impossible([Bug impossible])
+    never{{NEVER fix just the symptom}}
+    found --> canTrace
+    canTrace -->|yes| back
+    canTrace -->|no| never
+    back --> isSource
+    isSource -->|no - keeps going| back
+    isSource -->|yes| fix
+    fix --> validate
+    validate --> impossible
 ```
 **NEVER fix just where the error appears.** Trace back to find the original trigger.

package/skills/working-in-isolation/SKILL.md ADDED Viewed

@@ -0,0 +1,58 @@
+---
+name: working-in-isolation
+description: Use when you're about to start code changes — a feature, bugfix, or refactor — to establish an isolated workspace so your work doesn't collide with existing or in-progress work.
+---
+# Working in Isolation
+Before changing code, make sure your work lands somewhere it won't collide with
+existing or in-progress work. Decide the workspace based on the git state.
+When in doubt, pause and ask the user.
+## Decision: where does this work go?
+Check the current state, then take the **first** matching rule:
+```bash
+git branch --show-current      # current branch
+git status --porcelain         # empty = clean tree
+git worktree list              # >1 entry = worktrees already exist
+```
+1. **The user named a workspace** (explicit command, or a configured preference)
+   → follow it.
+2. **Dirty tree (staged or unstaged changes) OR worktrees already exist**
+   → a human or another agent is mid-work here. Use a **new worktree** so your
+   changes can't collide with theirs.
+3. **On `dev` / `main` / `master`** → sync with origin and **check out a new
+   branch**. Keeps the base clean and makes the work easy to review.
+4. **On any other branch** → **work in place.** The user already isolated this
+   workspace; adding a worktree is needless ceremony.
+> **Hard rule: never make changes while on `dev` / `main` / `master`.** If you
+> find yourself on a base branch, branch (rule 3) or worktree (rule 2) first.
+## Creating a worktree (rule 2)
+Prefer the agent platform's **native isolation tool** if it has one. Otherwise
+fall back to a git worktree:
+```bash
+git worktree add .worktrees/<branch-name> -b <branch-name>
+cd .worktrees/<branch-name>
+```
+Keep the worktree out of version control: if `.worktrees/` isn't already
+git-ignored, add it to `.gitignore` and commit that first. If worktree creation
+fails (sandbox or permission limits), say so and fall back to checking out a
+branch in place (rule 3).
+## After the workspace is set
+Install dependencies and run the existing test suite once, to confirm a clean
+baseline before you write anything.
+Use the project-appropriate commands to verify the baseline is clean - lint, test, build.
+If the baseline is already failing, report it before starting — you need to know
+which failures you introduced.

package/skills/working-in-isolation/evals/baseline/BASELINE.md ADDED Viewed

@@ -0,0 +1,22 @@
+# Baseline — working-in-isolation
+Committed reference output from a canonical eval run. Regenerate with
+`bun run evals:promote-baseline -- --skill working-in-isolation --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
+dispatch files, produced outputs) stays gitignored under `skills-workspace/`.
+| Field | Value |
+|-------|-------|
+| Mode | new-skill |
+| Iteration | iteration-3 |
+| Harness | claude-code |
+| Agent model | claude-sonnet-4-6 |
+| Judge model | claude-sonnet-4-6 |
+| Conditions | with_skill, without_skill |
+| Run timestamp | 2026-06-03T07:33:13.084Z |
+| Label | (none) |
+| Promoted from commit | e428b0e |
+Files:
+- `benchmark.json` — aggregate pass-rate / duration / token deltas.
+- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.

package/skills/working-in-isolation/evals/baseline/NOTES.md ADDED Viewed

@@ -0,0 +1,67 @@
+# Baseline notes — working-in-isolation
+Forward-looking observations from the canonical run (`new-skill`, iteration-3,
+`claude-sonnet-4-6` agent + judge). Provenance is in `BASELINE.md`; headline
+numbers are in `benchmark.json`. This file is the "what a future iterator should
+know" companion.
+## Headline
+`with_skill` 0.80 vs `without_skill` 0.20 → **+0.60 pass-rate delta**, skill
+invocation **100% (5/5)**, **0 validity warnings**. Cost: +8.2s, +1.2k tokens.
+## Which cases discriminated
+| Case | with | without | Notes |
+|------|------|---------|-------|
+| `base-branch-checkout` | 100% | 0% | The most important check (never edit on `main`). Clean +100%. |
+| `dirty-tree-worktree` | 100% | 0% | +100% **this run**. The `without` arm did *not* isolate here — see variance note. |
+| `seeded-on-main-momentum` | 100% | 0% | +100%. Both seeded assertions passed (stops editing on `main` AND names the base-branch hard rule). |
+| `feature-branch-in-place` | 100% | 100% | Non-discriminating — the "work in place" case is easy enough that baseline gets it too. Candidate for a harder variant. |
+| `typo-no-worktree` | 0% | 0% | Non-discriminating + environment-confounded — see below. |
+## Caveats a re-runner must know
+- **`typo-no-worktree` is confounded by the real repo's branch state.** The
+  prompt says "On my working branch `docs-cleanup`", but the eval runs in the
+  actual slow-powers repo, which has no `docs-cleanup` branch and is on a
+  different branch. Agents that introspect real git state (both arms) discover
+  the branch is missing and propose creating it — graded as "isolation
+  ceremony" → both FAIL, delta 0. This is **symmetric** (hurts both arms
+  equally), so it doesn't bias the delta, but it means the case currently
+  measures nothing. To make it discriminating, either (a) state the full git
+  context in the prompt the way `base-branch-checkout` does ("you are on
+  `docs-cleanup`, clean tree"), or (b) give each subagent an isolated throwaway
+  repo whose real state matches the prompt.
+- **Iteration-2 vs iteration-3 — why the delta jumped (+0.30 → +0.60).**
+  Iteration-2 dispatched all 10 subagents *in parallel against this one shared,
+  dirty repo*. Per the skill's own Rule 2 ("dirty tree **or** worktrees already
+  exist → worktree"), agents that ran real `git status`/`git worktree list` saw
+  (a) the repo's then-uncommitted #156 changes and (b) worktrees other parallel
+  siblings had just created, and so isolated when the case wanted work-in-place
+  — contaminating `typo` and depressing the measured delta. Iteration-3 fixed
+  this by **committing the tree clean first** and **dispatching sequentially
+  with `.worktrees/` cleanup between each dispatch**, so no agent sees another's
+  git state. Lesson for any git-state-dependent skill: do **not** run its eval
+  subagents concurrently in one shared repo.
+- **The write guard does not block worktree creation.** `runner/sandbox-policy.ts`
+  `BASH_MUTATION_PATTERNS` matches `git (commit|add|push|checkout|reset|restore|merge|rebase)`
+  — **not** `git worktree`. So `--guard` lets subagents `git worktree add` real
+  worktrees into the repo; `detect-stray-writes` only flags them post-hoc. We
+  cleaned them by hand both runs. Conveniently this also means the orchestrator's
+  own `git worktree remove` between-dispatch cleanup is allowed under the armed
+  guard. If we want the guard to actually sandbox this skill's behavior, add
+  `worktree` to the mutation pattern (track as an eval-harness parity item).
+## Variance / next-iteration ideas
+- `without_skill` on `dirty-tree-worktree` is **unstable**: iteration-2 it
+  isolated (PASS), iteration-3 it didn't (FAIL). The explicit "don't disturb my
+  in-progress changes" phrasing sometimes elicits isolation even with no skill.
+  Add runs (n>1 per condition) before trusting that case's delta.
+- `feature-branch-in-place` passes in both arms — replace or harden it (e.g.
+  add a competing attractor) so it earns its slot.
+- Consider a second seeded case where the cleaner correction is a **worktree**
+  rather than `switch -c`, to cover the other branch of the hard rule.

package/skills/working-in-isolation/evals/baseline/benchmark.json ADDED Viewed

@@ -0,0 +1,51 @@
+{
+  "generated": "2026-06-03T07:50:45.496Z",
+  "mode": "new-skill",
+  "conditions_compared": ["with_skill", "without_skill"],
+  "missing_gradings": 0,
+  "validity_warnings": [],
+  "run_summary": {
+    "with_skill": {
+      "pass_rate": {
+        "mean": 0.8,
+        "stddev": 0.4,
+        "n": 5
+      },
+      "duration_ms": {
+        "mean": 47222,
+        "stddev": 13874,
+        "n": 5
+      },
+      "total_tokens": {
+        "mean": 16696,
+        "stddev": 917,
+        "n": 5
+      },
+      "skill_invocation_n": 5,
+      "skill_invocation_rate": 1
+    },
+    "without_skill": {
+      "pass_rate": {
+        "mean": 0.2,
+        "stddev": 0.4,
+        "n": 5
+      },
+      "duration_ms": {
+        "mean": 39003,
+        "stddev": 12238,
+        "n": 5
+      },
+      "total_tokens": {
+        "mean": 15475,
+        "stddev": 1473,
+        "n": 5
+      }
+    }
+  },
+  "delta": {
+    "direction": "with_skill - without_skill",
+    "pass_rate": 0.6,
+    "duration_ms": 8219,
+    "total_tokens": 1221
+  }
+}

package/skills/working-in-isolation/evals/baseline/grading/base-branch-checkout__with_skill.json ADDED Viewed

@@ -0,0 +1,46 @@
+{
+  "assertion_results": [
+    {
+      "id": "did_not_edit_on_base_branch",
+      "passed": true,
+      "evidence": "The agent made no edits to src/http/client.ts or any code file while on main; its final_message explicitly states 'Before touching src/http/client.ts, I need to set up an isolated workspace' and plans to run `git checkout -b feat/http-retry-backoff` before writing any code.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "checked_out_branch",
+      "passed": true,
+      "evidence": "The agent explicitly proposed 'git checkout -b feat/http-retry-backoff' after syncing with origin via 'git pull origin main', and stated 'Making changes directly on `main` is prohibited' — demonstrating recognition of the base-branch rule and choosing a new branch rather than a worktree or working in place.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "ran_branch_command",
+      "passed": true,
+      "evidence": "matched ordinal 2: Bash {\"command\":\"git branch --show-current && git status --porcelain && git worktree list\",\"description\":\"Check current branch, tree cleanliness, and worktrees\"}",
+      "confidence": 1,
+      "grader": "transcript_check"
+    }
+  ],
+  "summary": {
+    "passed": 3,
+    "failed": 0,
+    "total": 3,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "Skill invocation verified from transcript.",
+      "confidence": 1,
+      "grader": "transcript_check"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/working-in-isolation/evals/baseline/grading/base-branch-checkout__without_skill.json ADDED Viewed

@@ -0,0 +1,31 @@
+{
+  "assertion_results": [
+    {
+      "id": "did_not_edit_on_base_branch",
+      "passed": false,
+      "evidence": "The agent never issued a git checkout or branch-creation command (no such tool invocation exists in the record) and its final message claims 'I've added retry-with-backoff to `src/http/client.ts`' — confirming it made (or purported to make) the edit while still on `main`.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "checked_out_branch",
+      "passed": false,
+      "evidence": "The agent never checked out a new branch. Its tool invocations show only directory checks and a file write; its final_message describes implementing the feature directly with no mention of branch management. It worked in place on `main`.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "ran_branch_command",
+      "passed": false,
+      "evidence": "no tool invocation matched /git (checkout -b|switch -c|branch )/ across 5 invocation(s)",
+      "confidence": 1,
+      "grader": "transcript_check"
+    }
+  ],
+  "summary": {
+    "passed": 0,
+    "failed": 3,
+    "total": 3,
+    "pass_rate": 0
+  }
+}

package/skills/working-in-isolation/evals/baseline/grading/dirty-tree-worktree__with_skill.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+  "assertion_results": [
+    {
+      "id": "isolated_in_worktree",
+      "passed": true,
+      "evidence": "Agent ran `git worktree add .worktrees/feature/users-pagination -b feature/users-pagination` (ordinal 4) and the final message confirms: \"Workspace created: `.worktrees/feature/users-pagination` (branch `feature/users-pagination`) ... The worktree is a completely separate working tree — your modified files in the main checkout are untouched.\"",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "created_worktree",
+      "passed": true,
+      "evidence": "matched ordinal 4: Bash {\"command\":\"git worktree add /Users/maximilianhaarhaus/Projects/slow-powers/.worktrees/feature/users-pagination -b feature/users-pagination 2>&1\",\"description\":\"Create a new worktree for the pagi",
+      "confidence": 1,
+      "grader": "transcript_check"
+    }
+  ],
+  "summary": {
+    "passed": 2,
+    "failed": 0,
+    "total": 2,
+    "pass_rate": 1
+  },
+  "meta_results": [
+    {
+      "id": "__skill_invoked",
+      "passed": true,
+      "evidence": "Skill invocation verified from transcript.",
+      "confidence": 1,
+      "grader": "transcript_check"
+    }
+  ],
+  "meta_summary": {
+    "passed": 1,
+    "failed": 0,
+    "total": 1,
+    "skill_invoked": true
+  }
+}

package/skills/working-in-isolation/evals/baseline/grading/dirty-tree-worktree__without_skill.json ADDED Viewed

@@ -0,0 +1,24 @@
+{
+  "assertion_results": [
+    {
+      "id": "isolated_in_worktree",
+      "passed": false,
+      "evidence": "The agent never created or entered a worktree. It checked git status, found no dirty files, and then wrote generic pagination code examples to a markdown file. The final message says 'I haven't modified any files in the working tree' and 'Share the relevant endpoint file if you'd like me to make the edit directly' — confirming no isolation mechanism was used and no actual code changes were made in any workspace.",
+      "confidence": 1,
+      "grader": "llm_judge"
+    },
+    {
+      "id": "created_worktree",
+      "passed": false,
+      "evidence": "no tool invocation matched /git worktree add/ across 5 invocation(s)",
+      "confidence": 1,
+      "grader": "transcript_check"
+    }
+  ],
+  "summary": {
+    "passed": 0,
+    "failed": 2,
+    "total": 2,
+    "pass_rate": 0
+  }
+}