@slowdini/slow-powers-opencode 0.1.4 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. package/README.md +3 -3
  2. package/bootstrap.md +19 -20
  3. package/package.json +1 -1
  4. package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md +8 -0
  5. package/skills/auditing-slow-powers-usage/evals/evals.json +2 -2
  6. package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md +1 -1
  7. package/skills/evaluating-skills/SKILL.md +1 -1
  8. package/skills/evaluating-skills/evals/evals.json +1 -1
  9. package/skills/finishing-a-development-branch/SKILL.md +1 -1
  10. package/skills/systematic-debugging/condition-based-waiting.md +10 -11
  11. package/skills/systematic-debugging/root-cause-tracing.md +31 -33
  12. package/skills/working-in-isolation/SKILL.md +58 -0
  13. package/skills/working-in-isolation/evals/baseline/BASELINE.md +22 -0
  14. package/skills/working-in-isolation/evals/baseline/NOTES.md +67 -0
  15. package/skills/working-in-isolation/evals/baseline/benchmark.json +51 -0
  16. package/skills/working-in-isolation/evals/baseline/grading/base-branch-checkout__with_skill.json +46 -0
  17. package/skills/working-in-isolation/evals/baseline/grading/base-branch-checkout__without_skill.json +31 -0
  18. package/skills/working-in-isolation/evals/baseline/grading/dirty-tree-worktree__with_skill.json +39 -0
  19. package/skills/working-in-isolation/evals/baseline/grading/dirty-tree-worktree__without_skill.json +24 -0
  20. package/skills/working-in-isolation/evals/baseline/grading/feature-branch-in-place__with_skill.json +32 -0
  21. package/skills/working-in-isolation/evals/baseline/grading/feature-branch-in-place__without_skill.json +17 -0
  22. package/skills/working-in-isolation/evals/baseline/grading/seeded-on-main-momentum__with_skill.json +39 -0
  23. package/skills/working-in-isolation/evals/baseline/grading/seeded-on-main-momentum__without_skill.json +24 -0
  24. package/skills/working-in-isolation/evals/baseline/grading/typo-no-worktree__with_skill.json +32 -0
  25. package/skills/working-in-isolation/evals/baseline/grading/typo-no-worktree__without_skill.json +17 -0
  26. package/skills/working-in-isolation/evals/evals.json +87 -0
  27. package/skills/writing-skills/SKILL.md +179 -195
  28. package/skills/using-git-worktrees/SKILL.md +0 -70
  29. package/skills/using-git-worktrees/evals/evals.json +0 -40
  30. package/skills/writing-skills/graphviz-conventions.dot +0 -172
  31. package/skills/writing-skills/scripts/render-graphs.js +0 -181
package/README.md CHANGED
@@ -26,7 +26,7 @@ Contributors closing parity gaps should follow [`harness-parity-check.md`](./har
26
26
 
27
27
  ## How it works
28
28
 
29
- Slow-powers integrates directly into your agent's session, providing a highly disciplined set of technical execution utilities. It enforces strict test-driven development (TDD), systematic scientific debugging, rigorous verification checks, safe workspace isolation via git worktrees, and clean branch-finishing hygiene. It also enhances native agent planning phases with strict rules: banning placeholders, enforcing atomic task granularity, and requiring TDD-first checklists.
29
+ Slow-powers integrates directly into your agent's session, providing a highly disciplined set of technical execution utilities. It enforces strict test-driven development (TDD), systematic scientific debugging, rigorous verification checks, safe workspace isolation so new work doesn't collide with existing work, and clean branch-finishing hygiene. It also enhances native agent planning phases with strict rules: banning placeholders, enforcing atomic task granularity, and requiring TDD-first checklists.
30
30
 
31
31
  ## Installation
32
32
 
@@ -91,7 +91,7 @@ This installs the latest published version from npm.
91
91
 
92
92
  Slow-powers provides a set of highly focused, execution-level skills that ensure your agent operates with maximum discipline:
93
93
 
94
- 1. **`using-git-worktrees`** — Safely isolates development branches on a separate worktree, keeping your active workspace and protected branches like `main` clean.
94
+ 1. **`working-in-isolation`** — Establishes an isolated workspace so new work doesn't collide with existing or in-progress work, keeping protected branches like `main` clean.
95
95
  2. **`test-driven-development`** — Enforces a strict RED-GREEN-REFACTOR cycle, ensuring all production code is backed by failing test verification first.
96
96
  3. **`systematic-debugging`** — Guides the agent to locate the root cause of failures via scientific hypothesis testing, avoiding "guess-and-check" thrashing.
97
97
  4. **`verification-before-completion`** — Requires running actual test/build commands and presenting concrete evidence before making any success claims.
@@ -104,7 +104,7 @@ Slow-powers provides a set of highly focused, execution-level skills that ensure
104
104
 
105
105
  **Debugging** — `systematic-debugging`
106
106
 
107
- **Workspace & Git Hygiene** — `using-git-worktrees`, `finishing-a-development-branch`
107
+ **Workspace & Git Hygiene** — `working-in-isolation`, `finishing-a-development-branch`
108
108
 
109
109
  **Meta & Extension** — `writing-skills`
110
110
 
package/bootstrap.md CHANGED
@@ -14,26 +14,25 @@ When you reach a gate moment — about to code, hand off a plan, debug, claim do
14
14
 
15
15
  **Invoke relevant or requested skills BEFORE any response or action.** Even a 1% chance a skill might apply means that you should invoke the skill to check. If an invoked skill turns out to be wrong for the situation, you don't need to use it.
16
16
 
17
- ```dot
18
- digraph skill_flow {
19
- "User message received" [shape=doublecircle];
20
- "Might any skill apply?" [shape=diamond];
21
- "Invoke skill mechanism" [shape=box];
22
- "Announce: 'Using [skill] to [purpose]'" [shape=box];
23
- "Has checklist?" [shape=diamond];
24
- "Create todo per item with persistent task tracker" [shape=box];
25
- "Follow skill exactly" [shape=box];
26
- "Respond (including clarifications)" [shape=doublecircle];
27
-
28
- "User message received" -> "Might any skill apply?";
29
- "Might any skill apply?" -> "Invoke skill mechanism" [label="yes, even 1%"];
30
- "Might any skill apply?" -> "Respond (including clarifications)" [label="definitely not"];
31
- "Invoke skill mechanism" -> "Announce: 'Using [skill] to [purpose]'";
32
- "Announce: 'Using [skill] to [purpose]'" -> "Has checklist?";
33
- "Has checklist?" -> "Create todo per item with persistent task tracker" [label="yes"];
34
- "Has checklist?" -> "Follow skill exactly" [label="no"];
35
- "Create todo per item with persistent task tracker" -> "Follow skill exactly";
36
- }
17
+ ```mermaid
18
+ flowchart TD
19
+ start([User message received])
20
+ apply{Might any skill apply?}
21
+ invoke[Invoke skill mechanism]
22
+ announce["Announce: 'Using [skill] to [purpose]'"]
23
+ checklist{Has checklist?}
24
+ todos[Create todo per item with persistent task tracker]
25
+ follow[Follow skill exactly]
26
+ respond(["Respond (including clarifications)"])
27
+
28
+ start --> apply
29
+ apply -->|yes, even 1%| invoke
30
+ apply -->|definitely not| respond
31
+ invoke --> announce
32
+ announce --> checklist
33
+ checklist -->|yes| todos
34
+ checklist -->|no| follow
35
+ todos --> follow
37
36
  ```
38
37
 
39
38
  ## Red Flags
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@slowdini/slow-powers-opencode",
3
- "version": "0.1.4",
3
+ "version": "0.1.5",
4
4
  "description": "Slow-powers — structured development workflows for coding agents (TDD, debugging, verification, git hygiene)",
5
5
  "type": "module",
6
6
  "main": "./opencode/plugins/slow-powers.js",
@@ -4,6 +4,14 @@ Forward-looking observations from the run that produced this baseline. Provenanc
4
4
  `BASELINE.md`; numbers are in `benchmark.json`. This file is the "what a future iterator should
5
5
  know" companion.
6
6
 
7
+ > **⚠️ Baseline is stale (as of the `working-in-isolation` rename, #156).** The fixtures and
8
+ > `evals.json` rubrics were updated to rename `using-git-worktrees` → `working-in-isolation`, but
9
+ > the committed `grading/*.json` and the observations below were produced against the *old* name and
10
+ > are **not** re-graded — they're kept verbatim as the historical record. References to
11
+ > `using-git-worktrees` / "worktrees" in this file and in `grading/*.json` describe that past run;
12
+ > they are not live skill references. Re-run this eval to refresh the baseline before drawing new
13
+ > conclusions from it.
14
+
7
15
  ## Why this baseline exists despite a negative delta
8
16
 
9
17
  Headline delta is `pass_rate −0.084` (with_skill 0.833 vs without_skill 0.917). We promoted anyway
@@ -32,7 +32,7 @@
32
32
  {
33
33
  "id": "audits-blindspot-session",
34
34
  "prompt": "Just finished a session over in the payments-gateway repo — notes are in session-summary.md. I'm working on slow-powers and want a read on how the skills did. Please run the post-session slow-powers usage audit on it.",
35
- "expected_output": "The agent produces the structured audit report. The distinguishing feature of this session is that the agent went straight from the feature request to editing source on the current branch and NEVER considered the skills that applied — test-driven-development (a new branch of refund logic with an existing test suite), using-git-worktrees (a feature change made directly on the checked-out branch), and verification-before-completion (claimed done without running the ~12s suite). The report should classify these as 'relevant skills never considered' / blind spots (section 4), NOT as 'considered but skipped' (section 3), because the notes are explicit that they never came to mind. Sections that don't apply (e.g. skills invoked, skills considered-then-skipped) should be marked 'none' rather than fabricated. The report states decisions as of the time with no remediation/apology language, and does not reopen, redo, or propose fixes to the payments-gateway work.",
35
+ "expected_output": "The agent produces the structured audit report. The distinguishing feature of this session is that the agent went straight from the feature request to editing source on the current branch and NEVER considered the skills that applied — test-driven-development (a new branch of refund logic with an existing test suite), working-in-isolation (a feature change made directly on the checked-out branch), and verification-before-completion (claimed done without running the ~12s suite). The report should classify these as 'relevant skills never considered' / blind spots (section 4), NOT as 'considered but skipped' (section 3), because the notes are explicit that they never came to mind. Sections that don't apply (e.g. skills invoked, skills considered-then-skipped) should be marked 'none' rather than fabricated. The report states decisions as of the time with no remediation/apology language, and does not reopen, redo, or propose fixes to the payments-gateway work.",
36
36
  "files": ["fixtures/audits-blindspot-session/session-summary.md"],
37
37
  "assertions": [
38
38
  {
@@ -43,7 +43,7 @@
43
43
  {
44
44
  "id": "blindspot_in_never_considered",
45
45
  "type": "llm_judge",
46
- "rubric": "The session notes state the agent never thought about test-driven-development, using-git-worktrees, or verification-before-completion (it went straight from request to editing source on the current branch). Does the report classify these as 'relevant skills never considered' / blind spots? PASS if at least these skills are reported as never-considered blind spots. FAIL if they are mischaracterized as deliberate 'considered-then-skipped' decisions (which would invent an at-the-time rationalization that did not exist), or if the blind spot is not surfaced at all."
46
+ "rubric": "The session notes state the agent never thought about test-driven-development, working-in-isolation, or verification-before-completion (it went straight from request to editing source on the current branch). Does the report classify these as 'relevant skills never considered' / blind spots? PASS if at least these skills are reported as never-considered blind spots. FAIL if they are mischaracterized as deliberate 'considered-then-skipped' decisions (which would invent an at-the-time rationalization that did not exist), or if the blind spot is not surfaced at all."
47
47
  },
48
48
  {
49
49
  "id": "no_remediation_language",
@@ -33,7 +33,7 @@ enough that I was confident in it." The user didn't push further.
33
33
  Notes on environment for this session:
34
34
  - The repo has a `bun test` suite (~12 seconds) with existing refund tests in `test/refunds.test.ts`.
35
35
  - slow-powers was active; the session-start bootstrap listing was present, including
36
- `test-driven-development`, `using-git-worktrees`, and `verification-before-completion`.
36
+ `test-driven-development`, `working-in-isolation`, and `verification-before-completion`.
37
37
  - I did not at any point think about writing a test first, creating a branch/worktree, or running
38
38
  the suite — I went straight from the request to editing source on the current branch.
39
39
  - No git branch or worktree was created; edits were made on whatever branch was checked out.
@@ -132,7 +132,7 @@ Do not dispatch until the user confirms *this summary*. An earlier "run the eval
132
132
 
133
133
  ### Sandbox decision
134
134
 
135
- A subagent under test runs the real skill, and some skills write to disk — the skill that triggered this gate, `using-git-worktrees`, creates git worktrees in whatever repo it's pointed at. Without active enforcement those writes land in your working directory.
135
+ A subagent under test runs the real skill, and some skills write to disk — the skill that triggered this gate, `working-in-isolation`, creates git worktrees in whatever repo it's pointed at. Without active enforcement those writes land in your working directory.
136
136
 
137
137
  - **Guard available (Claude Code):** arming `--guard` is the default. If you are about to run without it, STOP. Proceed unguarded **only** when the user actively opts out — and warn them that stray writes will then only be **detected after the fact** by `detect-stray-writes`, never blocked or reverted, so anything a subagent writes outside its `outputs/` dir (worktrees, installed packages, edited repo files) persists and is theirs to clean up.
138
138
  - **Guard unavailable (other harnesses):** there is no active write enforcement. Tell the user plainly: stray writes are detected and reported by `detect-stray-writes` but **not auto-cleaned** — they must review the report and remove anything that escaped. Harness-level write enforcement is tracked as a parity goal in `harness-parity-check.md`.
@@ -33,7 +33,7 @@
33
33
  },
34
34
  {
35
35
  "id": "deterministic-edit-skip",
36
- "prompt": "I removed the one line in our using-git-worktrees skill that tells the agent to announce out loud that it's using the skill. Nothing else changed. Do I need to run an eval before I ship this?",
36
+ "prompt": "I removed the one line in our working-in-isolation skill that tells the agent to announce out loud that it's using the skill. Nothing else changed. Do I need to run an eval before I ship this?",
37
37
  "expected_output": "The agent recognizes this as a deterministic instruction change — removing a one-line directive the agent reliably follows, not wording that decides a pressured or ambiguous choice — and concludes an eval is not warranted, stating that decision and its reasoning. It does not reflexively demand an eval by citing the Iron Law, and it leaves the door open to run one if the user wants.",
38
38
  "assertions": [
39
39
  {
@@ -85,7 +85,7 @@ git branch -D <feature-branch>
85
85
 
86
86
  ### Step 5: Clean Up Git Worktrees (Options 1 & 4 only)
87
87
 
88
- > **REQUIRED BACKGROUND:** You must understand `slow-powers:using-git-worktrees` for workspace isolation and worktree management.
88
+ > **REQUIRED BACKGROUND:** You must understand `slow-powers:working-in-isolation` for workspace isolation and worktree management.
89
89
 
90
90
  If the workspace is a worktree that you created (under `.worktrees/`, `worktrees/`, or `~/.config/slow-powers/worktrees/`), clean it up from the main repository root:
91
91
  ```bash
@@ -8,17 +8,16 @@ Flaky tests often guess at timing with arbitrary delays. This creates race condi
8
8
 
9
9
  ## When to Use
10
10
 
11
- ```dot
12
- digraph when_to_use {
13
- "Test uses setTimeout/sleep?" [shape=diamond];
14
- "Testing timing behavior?" [shape=diamond];
15
- "Document WHY timeout needed" [shape=box];
16
- "Use condition-based waiting" [shape=box];
17
-
18
- "Test uses setTimeout/sleep?" -> "Testing timing behavior?" [label="yes"];
19
- "Testing timing behavior?" -> "Document WHY timeout needed" [label="yes"];
20
- "Testing timing behavior?" -> "Use condition-based waiting" [label="no"];
21
- }
11
+ ```mermaid
12
+ flowchart TD
13
+ sleep{Test uses setTimeout/sleep?}
14
+ timing{Testing timing behavior?}
15
+ document[Document WHY timeout needed]
16
+ use[Use condition-based waiting]
17
+
18
+ sleep -->|yes| timing
19
+ timing -->|yes| document
20
+ timing -->|no| use
22
21
  ```
23
22
 
24
23
  **Use when:**
@@ -8,19 +8,18 @@ Bugs often manifest deep in the call stack (git init in wrong directory, file cr
8
8
 
9
9
  ## When to Use
10
10
 
11
- ```dot
12
- digraph when_to_use {
13
- "Bug appears deep in stack?" [shape=diamond];
14
- "Can trace backwards?" [shape=diamond];
15
- "Fix at symptom point" [shape=box];
16
- "Trace to original trigger" [shape=box];
17
- "BETTER: Also add defense-in-depth" [shape=box];
18
-
19
- "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
20
- "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
21
- "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
22
- "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
23
- }
11
+ ```mermaid
12
+ flowchart TD
13
+ deep{Bug appears deep in stack?}
14
+ trace{Can trace backwards?}
15
+ symptom[Fix at symptom point]
16
+ origin[Trace to original trigger]
17
+ defense["BETTER: Also add defense-in-depth"]
18
+
19
+ deep -->|yes| trace
20
+ trace -->|yes| origin
21
+ trace -->|no - dead end| symptom
22
+ origin --> defense
24
23
  ```
25
24
 
26
25
  **Use when:**
@@ -129,26 +128,25 @@ Runs tests one-by-one, stops at first polluter. See script for usage.
129
128
 
130
129
  ## Key Principle
131
130
 
132
- ```dot
133
- digraph principle {
134
- "Found immediate cause" [shape=ellipse];
135
- "Can trace one level up?" [shape=diamond];
136
- "Trace backwards" [shape=box];
137
- "Is this the source?" [shape=diamond];
138
- "Fix at source" [shape=box];
139
- "Add validation at each layer" [shape=box];
140
- "Bug impossible" [shape=doublecircle];
141
- "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];
142
-
143
- "Found immediate cause" -> "Can trace one level up?";
144
- "Can trace one level up?" -> "Trace backwards" [label="yes"];
145
- "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
146
- "Trace backwards" -> "Is this the source?";
147
- "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
148
- "Is this the source?" -> "Fix at source" [label="yes"];
149
- "Fix at source" -> "Add validation at each layer";
150
- "Add validation at each layer" -> "Bug impossible";
151
- }
131
+ ```mermaid
132
+ flowchart TD
133
+ found(Found immediate cause)
134
+ canTrace{Can trace one level up?}
135
+ back[Trace backwards]
136
+ isSource{Is this the source?}
137
+ fix[Fix at source]
138
+ validate[Add validation at each layer]
139
+ impossible([Bug impossible])
140
+ never{{NEVER fix just the symptom}}
141
+
142
+ found --> canTrace
143
+ canTrace -->|yes| back
144
+ canTrace -->|no| never
145
+ back --> isSource
146
+ isSource -->|no - keeps going| back
147
+ isSource -->|yes| fix
148
+ fix --> validate
149
+ validate --> impossible
152
150
  ```
153
151
 
154
152
  **NEVER fix just where the error appears.** Trace back to find the original trigger.
@@ -0,0 +1,58 @@
1
+ ---
2
+ name: working-in-isolation
3
+ description: Use when you're about to start code changes — a feature, bugfix, or refactor — to establish an isolated workspace so your work doesn't collide with existing or in-progress work.
4
+ ---
5
+
6
+ # Working in Isolation
7
+
8
+ Before changing code, make sure your work lands somewhere it won't collide with
9
+ existing or in-progress work. Decide the workspace based on the git state.
10
+ When in doubt, pause and ask the user.
11
+
12
+ ## Decision: where does this work go?
13
+
14
+ Check the current state, then take the **first** matching rule:
15
+
16
+ ```bash
17
+ git branch --show-current # current branch
18
+ git status --porcelain # empty = clean tree
19
+ git worktree list # >1 entry = worktrees already exist
20
+ ```
21
+
22
+ 1. **The user named a workspace** (explicit command, or a configured preference)
23
+ → follow it.
24
+ 2. **Dirty tree (staged or unstaged changes) OR worktrees already exist**
25
+ → a human or another agent is mid-work here. Use a **new worktree** so your
26
+ changes can't collide with theirs.
27
+ 3. **On `dev` / `main` / `master`** → sync with origin and **check out a new
28
+ branch**. Keeps the base clean and makes the work easy to review.
29
+ 4. **On any other branch** → **work in place.** The user already isolated this
30
+ workspace; adding a worktree is needless ceremony.
31
+
32
+ > **Hard rule: never make changes while on `dev` / `main` / `master`.** If you
33
+ > find yourself on a base branch, branch (rule 3) or worktree (rule 2) first.
34
+
35
+ ## Creating a worktree (rule 2)
36
+
37
+ Prefer the agent platform's **native isolation tool** if it has one. Otherwise
38
+ fall back to a git worktree:
39
+
40
+ ```bash
41
+ git worktree add .worktrees/<branch-name> -b <branch-name>
42
+ cd .worktrees/<branch-name>
43
+ ```
44
+
45
+ Keep the worktree out of version control: if `.worktrees/` isn't already
46
+ git-ignored, add it to `.gitignore` and commit that first. If worktree creation
47
+ fails (sandbox or permission limits), say so and fall back to checking out a
48
+ branch in place (rule 3).
49
+
50
+ ## After the workspace is set
51
+
52
+ Install dependencies and run the existing test suite once, to confirm a clean
53
+ baseline before you write anything.
54
+
55
+ Use the project-appropriate commands to verify the baseline is clean - lint, test, build.
56
+
57
+ If the baseline is already failing, report it before starting — you need to know
58
+ which failures you introduced.
@@ -0,0 +1,22 @@
1
+ # Baseline — working-in-isolation
2
+
3
+ Committed reference output from a canonical eval run. Regenerate with
4
+ `bun run evals:promote-baseline -- --skill working-in-isolation --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
5
+ dispatch files, produced outputs) stays gitignored under `skills-workspace/`.
6
+
7
+ | Field | Value |
8
+ |-------|-------|
9
+ | Mode | new-skill |
10
+ | Iteration | iteration-3 |
11
+ | Harness | claude-code |
12
+ | Agent model | claude-sonnet-4-6 |
13
+ | Judge model | claude-sonnet-4-6 |
14
+ | Conditions | with_skill, without_skill |
15
+ | Run timestamp | 2026-06-03T07:33:13.084Z |
16
+ | Label | (none) |
17
+ | Promoted from commit | e428b0e |
18
+
19
+ Files:
20
+ - `benchmark.json` — aggregate pass-rate / duration / token deltas.
21
+ - `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
22
+
@@ -0,0 +1,67 @@
1
+ # Baseline notes — working-in-isolation
2
+
3
+ Forward-looking observations from the canonical run (`new-skill`, iteration-3,
4
+ `claude-sonnet-4-6` agent + judge). Provenance is in `BASELINE.md`; headline
5
+ numbers are in `benchmark.json`. This file is the "what a future iterator should
6
+ know" companion.
7
+
8
+ ## Headline
9
+
10
+ `with_skill` 0.80 vs `without_skill` 0.20 → **+0.60 pass-rate delta**, skill
11
+ invocation **100% (5/5)**, **0 validity warnings**. Cost: +8.2s, +1.2k tokens.
12
+
13
+ ## Which cases discriminated
14
+
15
+ | Case | with | without | Notes |
16
+ |------|------|---------|-------|
17
+ | `base-branch-checkout` | 100% | 0% | The most important check (never edit on `main`). Clean +100%. |
18
+ | `dirty-tree-worktree` | 100% | 0% | +100% **this run**. The `without` arm did *not* isolate here — see variance note. |
19
+ | `seeded-on-main-momentum` | 100% | 0% | +100%. Both seeded assertions passed (stops editing on `main` AND names the base-branch hard rule). |
20
+ | `feature-branch-in-place` | 100% | 100% | Non-discriminating — the "work in place" case is easy enough that baseline gets it too. Candidate for a harder variant. |
21
+ | `typo-no-worktree` | 0% | 0% | Non-discriminating + environment-confounded — see below. |
22
+
23
+ ## Caveats a re-runner must know
24
+
25
+ - **`typo-no-worktree` is confounded by the real repo's branch state.** The
26
+ prompt says "On my working branch `docs-cleanup`", but the eval runs in the
27
+ actual slow-powers repo, which has no `docs-cleanup` branch and is on a
28
+ different branch. Agents that introspect real git state (both arms) discover
29
+ the branch is missing and propose creating it — graded as "isolation
30
+ ceremony" → both FAIL, delta 0. This is **symmetric** (hurts both arms
31
+ equally), so it doesn't bias the delta, but it means the case currently
32
+ measures nothing. To make it discriminating, either (a) state the full git
33
+ context in the prompt the way `base-branch-checkout` does ("you are on
34
+ `docs-cleanup`, clean tree"), or (b) give each subagent an isolated throwaway
35
+ repo whose real state matches the prompt.
36
+
37
+ - **Iteration-2 vs iteration-3 — why the delta jumped (+0.30 → +0.60).**
38
+ Iteration-2 dispatched all 10 subagents *in parallel against this one shared,
39
+ dirty repo*. Per the skill's own Rule 2 ("dirty tree **or** worktrees already
40
+ exist → worktree"), agents that ran real `git status`/`git worktree list` saw
41
+ (a) the repo's then-uncommitted #156 changes and (b) worktrees other parallel
42
+ siblings had just created, and so isolated when the case wanted work-in-place
43
+ — contaminating `typo` and depressing the measured delta. Iteration-3 fixed
44
+ this by **committing the tree clean first** and **dispatching sequentially
45
+ with `.worktrees/` cleanup between each dispatch**, so no agent sees another's
46
+ git state. Lesson for any git-state-dependent skill: do **not** run its eval
47
+ subagents concurrently in one shared repo.
48
+
49
+ - **The write guard does not block worktree creation.** `runner/sandbox-policy.ts`
50
+ `BASH_MUTATION_PATTERNS` matches `git (commit|add|push|checkout|reset|restore|merge|rebase)`
51
+ — **not** `git worktree`. So `--guard` lets subagents `git worktree add` real
52
+ worktrees into the repo; `detect-stray-writes` only flags them post-hoc. We
53
+ cleaned them by hand both runs. Conveniently this also means the orchestrator's
54
+ own `git worktree remove` between-dispatch cleanup is allowed under the armed
55
+ guard. If we want the guard to actually sandbox this skill's behavior, add
56
+ `worktree` to the mutation pattern (track as an eval-harness parity item).
57
+
58
+ ## Variance / next-iteration ideas
59
+
60
+ - `without_skill` on `dirty-tree-worktree` is **unstable**: iteration-2 it
61
+ isolated (PASS), iteration-3 it didn't (FAIL). The explicit "don't disturb my
62
+ in-progress changes" phrasing sometimes elicits isolation even with no skill.
63
+ Add runs (n>1 per condition) before trusting that case's delta.
64
+ - `feature-branch-in-place` passes in both arms — replace or harden it (e.g.
65
+ add a competing attractor) so it earns its slot.
66
+ - Consider a second seeded case where the cleaner correction is a **worktree**
67
+ rather than `switch -c`, to cover the other branch of the hard rule.
@@ -0,0 +1,51 @@
1
+ {
2
+ "generated": "2026-06-03T07:50:45.496Z",
3
+ "mode": "new-skill",
4
+ "conditions_compared": ["with_skill", "without_skill"],
5
+ "missing_gradings": 0,
6
+ "validity_warnings": [],
7
+ "run_summary": {
8
+ "with_skill": {
9
+ "pass_rate": {
10
+ "mean": 0.8,
11
+ "stddev": 0.4,
12
+ "n": 5
13
+ },
14
+ "duration_ms": {
15
+ "mean": 47222,
16
+ "stddev": 13874,
17
+ "n": 5
18
+ },
19
+ "total_tokens": {
20
+ "mean": 16696,
21
+ "stddev": 917,
22
+ "n": 5
23
+ },
24
+ "skill_invocation_n": 5,
25
+ "skill_invocation_rate": 1
26
+ },
27
+ "without_skill": {
28
+ "pass_rate": {
29
+ "mean": 0.2,
30
+ "stddev": 0.4,
31
+ "n": 5
32
+ },
33
+ "duration_ms": {
34
+ "mean": 39003,
35
+ "stddev": 12238,
36
+ "n": 5
37
+ },
38
+ "total_tokens": {
39
+ "mean": 15475,
40
+ "stddev": 1473,
41
+ "n": 5
42
+ }
43
+ }
44
+ },
45
+ "delta": {
46
+ "direction": "with_skill - without_skill",
47
+ "pass_rate": 0.6,
48
+ "duration_ms": 8219,
49
+ "total_tokens": 1221
50
+ }
51
+ }
@@ -0,0 +1,46 @@
1
+ {
2
+ "assertion_results": [
3
+ {
4
+ "id": "did_not_edit_on_base_branch",
5
+ "passed": true,
6
+ "evidence": "The agent made no edits to src/http/client.ts or any code file while on main; its final_message explicitly states 'Before touching src/http/client.ts, I need to set up an isolated workspace' and plans to run `git checkout -b feat/http-retry-backoff` before writing any code.",
7
+ "confidence": 1,
8
+ "grader": "llm_judge"
9
+ },
10
+ {
11
+ "id": "checked_out_branch",
12
+ "passed": true,
13
+ "evidence": "The agent explicitly proposed 'git checkout -b feat/http-retry-backoff' after syncing with origin via 'git pull origin main', and stated 'Making changes directly on `main` is prohibited' — demonstrating recognition of the base-branch rule and choosing a new branch rather than a worktree or working in place.",
14
+ "confidence": 1,
15
+ "grader": "llm_judge"
16
+ },
17
+ {
18
+ "id": "ran_branch_command",
19
+ "passed": true,
20
+ "evidence": "matched ordinal 2: Bash {\"command\":\"git branch --show-current && git status --porcelain && git worktree list\",\"description\":\"Check current branch, tree cleanliness, and worktrees\"}",
21
+ "confidence": 1,
22
+ "grader": "transcript_check"
23
+ }
24
+ ],
25
+ "summary": {
26
+ "passed": 3,
27
+ "failed": 0,
28
+ "total": 3,
29
+ "pass_rate": 1
30
+ },
31
+ "meta_results": [
32
+ {
33
+ "id": "__skill_invoked",
34
+ "passed": true,
35
+ "evidence": "Skill invocation verified from transcript.",
36
+ "confidence": 1,
37
+ "grader": "transcript_check"
38
+ }
39
+ ],
40
+ "meta_summary": {
41
+ "passed": 1,
42
+ "failed": 0,
43
+ "total": 1,
44
+ "skill_invoked": true
45
+ }
46
+ }
@@ -0,0 +1,31 @@
1
+ {
2
+ "assertion_results": [
3
+ {
4
+ "id": "did_not_edit_on_base_branch",
5
+ "passed": false,
6
+ "evidence": "The agent never issued a git checkout or branch-creation command (no such tool invocation exists in the record) and its final message claims 'I've added retry-with-backoff to `src/http/client.ts`' — confirming it made (or purported to make) the edit while still on `main`.",
7
+ "confidence": 1,
8
+ "grader": "llm_judge"
9
+ },
10
+ {
11
+ "id": "checked_out_branch",
12
+ "passed": false,
13
+ "evidence": "The agent never checked out a new branch. Its tool invocations show only directory checks and a file write; its final_message describes implementing the feature directly with no mention of branch management. It worked in place on `main`.",
14
+ "confidence": 1,
15
+ "grader": "llm_judge"
16
+ },
17
+ {
18
+ "id": "ran_branch_command",
19
+ "passed": false,
20
+ "evidence": "no tool invocation matched /git (checkout -b|switch -c|branch )/ across 5 invocation(s)",
21
+ "confidence": 1,
22
+ "grader": "transcript_check"
23
+ }
24
+ ],
25
+ "summary": {
26
+ "passed": 0,
27
+ "failed": 3,
28
+ "total": 3,
29
+ "pass_rate": 0
30
+ }
31
+ }
@@ -0,0 +1,39 @@
1
+ {
2
+ "assertion_results": [
3
+ {
4
+ "id": "isolated_in_worktree",
5
+ "passed": true,
6
+ "evidence": "Agent ran `git worktree add .worktrees/feature/users-pagination -b feature/users-pagination` (ordinal 4) and the final message confirms: \"Workspace created: `.worktrees/feature/users-pagination` (branch `feature/users-pagination`) ... The worktree is a completely separate working tree — your modified files in the main checkout are untouched.\"",
7
+ "confidence": 1,
8
+ "grader": "llm_judge"
9
+ },
10
+ {
11
+ "id": "created_worktree",
12
+ "passed": true,
13
+ "evidence": "matched ordinal 4: Bash {\"command\":\"git worktree add /Users/maximilianhaarhaus/Projects/slow-powers/.worktrees/feature/users-pagination -b feature/users-pagination 2>&1\",\"description\":\"Create a new worktree for the pagi",
14
+ "confidence": 1,
15
+ "grader": "transcript_check"
16
+ }
17
+ ],
18
+ "summary": {
19
+ "passed": 2,
20
+ "failed": 0,
21
+ "total": 2,
22
+ "pass_rate": 1
23
+ },
24
+ "meta_results": [
25
+ {
26
+ "id": "__skill_invoked",
27
+ "passed": true,
28
+ "evidence": "Skill invocation verified from transcript.",
29
+ "confidence": 1,
30
+ "grader": "transcript_check"
31
+ }
32
+ ],
33
+ "meta_summary": {
34
+ "passed": 1,
35
+ "failed": 0,
36
+ "total": 1,
37
+ "skill_invoked": true
38
+ }
39
+ }
@@ -0,0 +1,24 @@
1
+ {
2
+ "assertion_results": [
3
+ {
4
+ "id": "isolated_in_worktree",
5
+ "passed": false,
6
+ "evidence": "The agent never created or entered a worktree. It checked git status, found no dirty files, and then wrote generic pagination code examples to a markdown file. The final message says 'I haven't modified any files in the working tree' and 'Share the relevant endpoint file if you'd like me to make the edit directly' — confirming no isolation mechanism was used and no actual code changes were made in any workspace.",
7
+ "confidence": 1,
8
+ "grader": "llm_judge"
9
+ },
10
+ {
11
+ "id": "created_worktree",
12
+ "passed": false,
13
+ "evidence": "no tool invocation matched /git worktree add/ across 5 invocation(s)",
14
+ "confidence": 1,
15
+ "grader": "transcript_check"
16
+ }
17
+ ],
18
+ "summary": {
19
+ "passed": 0,
20
+ "failed": 2,
21
+ "total": 2,
22
+ "pass_rate": 0
23
+ }
24
+ }