@slowdini/slow-powers-opencode 0.4.4 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +2 -2
- package/package.json +14 -14
- package/skills/evaluating-skills/SKILL.md +6 -6
- package/skills/evaluating-skills/evals/baseline/BASELINE.md +2 -3
- package/skills/hardening-plans/evals/baseline/BASELINE.md +2 -3
- package/skills/{systematic-debugging → investigating-bugs}/SKILL.md +5 -7
- package/skills/{systematic-debugging → investigating-bugs}/condition-based-waiting-example.ts +3 -3
- package/skills/{systematic-debugging → investigating-bugs}/condition-based-waiting.md +1 -9
- package/skills/investigating-bugs/evals/baseline/BASELINE.md +23 -0
- package/skills/investigating-bugs/evals/baseline/benchmark.json +51 -0
- package/skills/investigating-bugs/evals/baseline/grading/feature-request-no-debugging__with_skill.json +17 -0
- package/skills/investigating-bugs/evals/baseline/grading/feature-request-no-debugging__without_skill.json +17 -0
- package/skills/investigating-bugs/evals/baseline/grading/null-id-crash-investigate-first__with_skill.json +46 -0
- package/skills/investigating-bugs/evals/baseline/grading/null-id-crash-investigate-first__without_skill.json +31 -0
- package/skills/investigating-bugs/evals/baseline/grading/seeded-stacked-guess-investigate-first__with_skill.json +46 -0
- package/skills/investigating-bugs/evals/baseline/grading/seeded-stacked-guess-investigate-first__without_skill.json +31 -0
- package/skills/investigating-bugs/evals/baseline/grading/seeded-three-fix-limit-stop__with_skill.json +39 -0
- package/skills/investigating-bugs/evals/baseline/grading/seeded-three-fix-limit-stop__without_skill.json +24 -0
- package/skills/investigating-bugs/evals/evals.json +89 -0
- package/skills/test-driven-development/SKILL.md +2 -0
- package/skills/verifying-development-work/SKILL.md +37 -20
- package/skills/verifying-development-work/code-review.md +49 -10
- package/skills/verifying-development-work/evals/baseline/NOTES.md +4 -4
- package/skills/verifying-development-work/evals/evals.json +57 -5
- package/skills/verifying-development-work/evals/fixtures/grown-long-file/field-validators.test.ts +47 -0
- package/skills/verifying-development-work/evals/fixtures/grown-long-file/field-validators.ts +532 -0
- package/skills/verifying-development-work/long-files.md +141 -0
- package/skills/working-in-isolation/SKILL.md +16 -2
- package/skills/working-in-isolation/evals/evals.json +4 -4
- package/skills/writing-skills/SKILL.md +2 -2
- package/skills/systematic-debugging/CREATION-LOG.md +0 -119
- package/skills/systematic-debugging/defense-in-depth.md +0 -122
- package/skills/systematic-debugging/evals/baseline/BASELINE.md +0 -22
- package/skills/systematic-debugging/evals/baseline/benchmark.json +0 -51
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__with_skill.json +0 -17
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__without_skill.json +0 -17
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__with_skill.json +0 -46
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__without_skill.json +0 -31
- package/skills/systematic-debugging/evals/evals.json +0 -45
- package/skills/systematic-debugging/find-polluter.sh +0 -63
- package/skills/systematic-debugging/root-cause-tracing.md +0 -167
- package/skills/systematic-debugging/test-academic.md +0 -14
- package/skills/systematic-debugging/test-pressure-1.md +0 -58
- package/skills/systematic-debugging/test-pressure-2.md +0 -68
- package/skills/systematic-debugging/test-pressure-3.md +0 -69
- package/skills/verifying-development-work/comment-review.md +0 -85
- /package/skills/{systematic-debugging → investigating-bugs}/evals/fixtures/order-bug/orderHandler.ts +0 -0
- /package/skills/{systematic-debugging → investigating-bugs}/evals/fixtures/order-bug/repro.ts +0 -0
package/README.md
CHANGED
|
@@ -69,7 +69,7 @@ opencode plugin @slowdini/slow-powers-opencode -g
|
|
|
69
69
|
Slow-powers provides a set of highly focused skills that ensure your agent operates with maximum discipline:
|
|
70
70
|
|
|
71
71
|
1. **`hardening-plans`** — Instructs the agent to re-review any plans before it hands them back to you, looking for hallucinations, logical inconsistencies, and other common plan mistakes.
|
|
72
|
-
2. **`
|
|
72
|
+
2. **`investigating-bugs`** — Guides the agent to locate the root cause of failures via scientific hypothesis testing, avoiding "guess-and-check" thrashing.
|
|
73
73
|
3. **`working-in-isolation`** — Establishes an isolated workspace (worktree or branch) so new work doesn't collide with existing or in-progress work, keeping protected branches like `main` clean.
|
|
74
74
|
4. **`test-driven-development`** — Enforces a strict RED-GREEN-REFACTOR cycle, ensuring all code is backed by failing test verification first.
|
|
75
75
|
5. **`verifying-development-work`** — Requires running actual test/build commands and presenting concrete evidence before any success claim, with a final review pass over the change, code AND comments, before work is handed back.
|
|
@@ -82,7 +82,7 @@ The skills declare prerequisite / next-step gates so the agent follows an intend
|
|
|
82
82
|
|
|
83
83
|
**Plan mode:** plan mode → `hardening-plans` → `working-in-isolation` → `test-driven-development` → `verifying-development-work`
|
|
84
84
|
|
|
85
|
-
**Debugging:** (`working-in-isolation`) → `
|
|
85
|
+
**Debugging:** (`working-in-isolation`) → `investigating-bugs` → `verifying-development-work`
|
|
86
86
|
|
|
87
87
|
## Philosophy
|
|
88
88
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@slowdini/slow-powers-opencode",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.5.0",
|
|
4
4
|
"description": "Slow-powers — structured development workflows for coding agents (TDD, debugging, verification, git hygiene)",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"main": "./opencode/plugins/slow-powers.js",
|
|
@@ -36,19 +36,19 @@
|
|
|
36
36
|
},
|
|
37
37
|
"scripts": {
|
|
38
38
|
"test": "bun test --path-ignore-patterns='skills-workspace/**'",
|
|
39
|
-
"evals": "
|
|
40
|
-
"evals:snapshot": "
|
|
41
|
-
"evals:validate": "
|
|
42
|
-
"evals:ingest": "
|
|
43
|
-
"evals:finalize": "
|
|
44
|
-
"evals:record-runs": "
|
|
45
|
-
"evals:fill-transcripts": "
|
|
46
|
-
"evals:detect-stray-writes": "
|
|
47
|
-
"evals:teardown-guard": "
|
|
48
|
-
"evals:teardown": "
|
|
49
|
-
"evals:grade": "
|
|
50
|
-
"evals:aggregate": "
|
|
51
|
-
"evals:promote-baseline": "
|
|
39
|
+
"evals": "eval-magic run --skill-dir ./skills --bootstrap ./bootstrap.md",
|
|
40
|
+
"evals:snapshot": "eval-magic snapshot --skill-dir ./skills",
|
|
41
|
+
"evals:validate": "eval-magic validate --skill-dir ./skills",
|
|
42
|
+
"evals:ingest": "eval-magic ingest --skill-dir ./skills",
|
|
43
|
+
"evals:finalize": "eval-magic finalize --skill-dir ./skills",
|
|
44
|
+
"evals:record-runs": "eval-magic record-runs --skill-dir ./skills",
|
|
45
|
+
"evals:fill-transcripts": "eval-magic fill-transcripts --skill-dir ./skills",
|
|
46
|
+
"evals:detect-stray-writes": "eval-magic detect-stray-writes --skill-dir ./skills",
|
|
47
|
+
"evals:teardown-guard": "eval-magic teardown-guard --skill-dir ./skills",
|
|
48
|
+
"evals:teardown": "eval-magic teardown --skill-dir ./skills",
|
|
49
|
+
"evals:grade": "eval-magic grade --skill-dir ./skills",
|
|
50
|
+
"evals:aggregate": "eval-magic aggregate --skill-dir ./skills",
|
|
51
|
+
"evals:promote-baseline": "eval-magic promote-baseline --skill-dir ./skills",
|
|
52
52
|
"check": "biome check --write .",
|
|
53
53
|
"check:ci": "biome check --error-on-warnings .",
|
|
54
54
|
"typecheck": "tsc --noEmit",
|
|
@@ -5,7 +5,7 @@ description: Use when testing whether a new skill improves agent behavior, or wh
|
|
|
5
5
|
|
|
6
6
|
# Evaluating Skills
|
|
7
7
|
|
|
8
|
-
Skill development has two phases: **drafting** (`slow-powers:writing-skills`) and **evaluation** (this skill). This skill owns the *craft* of evaluation — deciding whether a change needs measuring, designing test cases, devising pressure-testing scenarios, writing assertions, and reading results. The *mechanics* of actually running an eval — building the workspace, staging skills, dispatching subagents, grading, aggregating — are owned by a dedicated tool, **[eval-magic](https://github.com/slowdini/eval-magic)**, which ships as a dependency-less prebuilt binary you invoke as `
|
|
8
|
+
Skill development has two phases: **drafting** (`slow-powers:writing-skills`) and **evaluation** (this skill). This skill owns the *craft* of evaluation — deciding whether a change needs measuring, designing test cases, devising pressure-testing scenarios, writing assertions, and reading results. The *mechanics* of actually running an eval — building the workspace, staging skills, dispatching subagents, grading, aggregating — are owned by a dedicated tool, **[eval-magic](https://github.com/slowdini/eval-magic)**, which ships as a dependency-less prebuilt binary you invoke as `eval-magic`. See [Running the eval](#running-the-eval) for the hand-off.
|
|
9
9
|
|
|
10
10
|
## Overview
|
|
11
11
|
|
|
@@ -95,7 +95,7 @@ A test case has these parts:
|
|
|
95
95
|
- **files** (optional): fixture files the prompt references
|
|
96
96
|
- **skill_should_trigger** (optional, default `true`): set `false` for a *negative* eval where correct behavior is the skill **not** firing (e.g. an over-trigger guard — a feature request that shouldn't launch a debugging investigation). Negative evals are excluded from the skill-invocation rate, so a correct non-invocation isn't mistaken for the skill failing to fire.
|
|
97
97
|
|
|
98
|
-
Cases live in `<skill>/evals/evals.json`. For the file shape, see the author-template example in the eval-magic README and validate against the bundled schema with `
|
|
98
|
+
Cases live in `<skill>/evals/evals.json`. For the file shape, see the author-template example in the eval-magic README and validate against the bundled schema with `eval-magic validate`; for worked, maintained examples, read the live suites in this repo — e.g. `skills/verifying-development-work/evals/evals.json` and `skills/hardening-plans/evals/evals.json`.
|
|
99
99
|
|
|
100
100
|
Tips for writing good prompts:
|
|
101
101
|
|
|
@@ -142,7 +142,7 @@ Keep the seeded turns short and concrete; the point is to establish momentum, no
|
|
|
142
142
|
|
|
143
143
|
**The ceiling — state it plainly.** A seed is *text the subagent reads*, not a state it operates under. It cannot place the agent in a harness-injected mode — a real plan mode, an enforced multi-phase workflow, genuine context-window pressure — it can only *describe* one. So when the wild failure you're chasing was *caused* by such a mode (the documented case: an agent in plan mode that invoked **zero** skills because the mode's own procedure made loading them feel redundant), a text seed cannot fully reproduce it — the causal layer is exactly the one a prompt string can't inject. A seeded **pass is therefore necessary but not sufficient** — it under-estimates real-session difficulty — and a seed that *fails* to reproduce a known wild failure is usually hitting this ceiling, not testing a bad seed. Treat seeded results as a stronger-than-cold signal, not as ground truth, and don't let downstream work over-trust them.
|
|
144
144
|
|
|
145
|
-
**Narrowing the gap — `--plan-mode`.** For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: its `--plan-mode` flag injects the harness's *verbatim* plan-mode procedure into every dispatch as an operating-context layer the subagent is told it is operating under, rather than a paraphrase the agent merely reads in the seed prose. This narrows the gap (verbatim procedure > paraphrase) but does **not** close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm. See `
|
|
145
|
+
**Narrowing the gap — `--plan-mode`.** For the documented plan-mode case, the runner offers the highest-fidelity in-runner approximation: its `--plan-mode` flag injects the harness's *verbatim* plan-mode procedure into every dispatch as an operating-context layer the subagent is told it is operating under, rather than a paraphrase the agent merely reads in the seed prose. This narrows the gap (verbatim procedure > paraphrase) but does **not** close it: it is still text the agent reads, not an injected mode, so the necessary-not-sufficient ceiling above stands unchanged. Use it as the strongest in-runner signal and pair it with a paraphrase-seed arm. See `eval-magic run --help` for the flag and the per-harness profiles it depends on.
|
|
146
146
|
|
|
147
147
|
## Writing assertions
|
|
148
148
|
|
|
@@ -186,12 +186,12 @@ Once a run is graded and aggregated, the headline is the **delta**: what the ski
|
|
|
186
186
|
|
|
187
187
|
## Running the eval
|
|
188
188
|
|
|
189
|
-
The mechanics of executing a run live in **[eval-magic](https://github.com/slowdini/eval-magic)** — the `
|
|
189
|
+
The mechanics of executing a run live in **[eval-magic](https://github.com/slowdini/eval-magic)** — the `eval-magic` binary. eval-magic's README is the complete operating guide, and every flag is documented in the tool's own help.
|
|
190
190
|
|
|
191
191
|
| Need | Where |
|
|
192
192
|
|------|-------|
|
|
193
193
|
| Quickstart, install, the two modes end-to-end | the eval-magic README |
|
|
194
|
-
| Every subcommand and flag; the `--skill-dir` model; workspace layout | `
|
|
194
|
+
| Every subcommand and flag; the `--skill-dir` model; workspace layout | `eval-magic --help` and `eval-magic <subcommand> --help` |
|
|
195
195
|
| Full run mechanics: dispatch loop, transcript access, grading, aggregating, baselines | the eval-magic README |
|
|
196
196
|
| Claude Code & Codex harness specifics — isolating from installed plugins, the guard, judging | the README's Harnesses section |
|
|
197
197
|
| What a harness needs to reach Claude-Code-tier support | `docs/harness-parity.md` |
|
|
@@ -200,5 +200,5 @@ The mechanics of executing a run live in **[eval-magic](https://github.com/slowd
|
|
|
200
200
|
|
|
201
201
|
- `slow-powers:writing-skills` — drafting a skill (Phase 1)
|
|
202
202
|
- `pressure-scenarios.md` — pressure-scenario taxonomy for authoring prompts that stress discipline-enforcing skills
|
|
203
|
-
- eval-magic (the `
|
|
203
|
+
- eval-magic (the `eval-magic` tool) — runs the evals this skill teaches you to author
|
|
204
204
|
- agentskills.io/skill-creation/evaluating-skills — the methodology this skill is derived from
|
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# Baseline — evaluating-skills
|
|
2
2
|
|
|
3
3
|
Committed reference output from a canonical eval run. Regenerate with
|
|
4
|
-
`
|
|
4
|
+
`eval-magic promote-baseline --skill evaluating-skills --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
|
|
5
5
|
dispatch files, produced outputs) stays gitignored under `skills-workspace/`
|
|
6
|
-
and is reclaimable by `
|
|
6
|
+
and is reclaimable by `eval-magic teardown` once promoted (this commit's marker).
|
|
7
7
|
|
|
8
8
|
| Field | Value |
|
|
9
9
|
|-------|-------|
|
|
@@ -20,4 +20,3 @@ and is reclaimable by `skill-eval teardown` once promoted (this commit's marker)
|
|
|
20
20
|
Files:
|
|
21
21
|
- `benchmark.json` — aggregate pass-rate / duration / token deltas.
|
|
22
22
|
- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
|
|
23
|
-
|
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# Baseline — hardening-plans
|
|
2
2
|
|
|
3
3
|
Committed reference output from a canonical eval run. Regenerate with
|
|
4
|
-
`
|
|
4
|
+
`eval-magic promote-baseline --skill hardening-plans --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
|
|
5
5
|
dispatch files, produced outputs) stays gitignored under `skills-workspace/`
|
|
6
|
-
and is reclaimable by `
|
|
6
|
+
and is reclaimable by `eval-magic teardown` once promoted (this commit's marker).
|
|
7
7
|
|
|
8
8
|
| Field | Value |
|
|
9
9
|
|-------|-------|
|
|
@@ -24,4 +24,3 @@ the named-hand-off requirement). `new_skill` = the working tree at promotion
|
|
|
24
24
|
Files:
|
|
25
25
|
- `benchmark.json` — aggregate pass-rate / duration / token deltas.
|
|
26
26
|
- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
|
|
27
|
-
|
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
---
|
|
2
|
-
name:
|
|
2
|
+
name: investigating-bugs
|
|
3
3
|
description: Use when encountering any bug, test failure, build error, or unexpected behavior.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
#
|
|
6
|
+
# Investigating Bugs
|
|
7
7
|
|
|
8
8
|
Avoid "guess-and-check" coding. Always identify the root cause before making changes.
|
|
9
9
|
|
|
@@ -22,11 +22,13 @@ Avoid "guess-and-check" coding. Always identify the root cause before making cha
|
|
|
22
22
|
Before changing any code:
|
|
23
23
|
1. **Read Error Messages and Stack Traces:** Read every line of the error. Note the exact file, line number, and error codes.
|
|
24
24
|
2. **Reproduce Consistently:** Identify the exact steps, inputs, or environment needed to trigger the bug. If it cannot be reproduced, gather more logs instead of guessing.
|
|
25
|
+
* For flaky tests that pass sometimes and fail under load, the cause is usually arbitrary `sleep`/timeout delays. Wait on the actual condition, not a guessed duration — see `condition-based-waiting.md` in this directory.
|
|
25
26
|
3. **Check Recent Changes:** Run a git diff. Analyze recent commits, dependency additions, or config changes.
|
|
26
27
|
4. **Gather Evidence (Multi-Component Systems):**
|
|
27
28
|
* Log inputs and outputs at every component boundary.
|
|
28
29
|
* Instrument the layers step-by-step (e.g., Workflow -> Build Script -> Runtime -> DB) to pinpoint exactly where the state breaks.
|
|
29
30
|
5. **Trace Data Flow:** Trace variables backward from the failure point to their source. Fix the bug at the source, not the symptom.
|
|
31
|
+
* When manual tracing dead-ends, instrument the suspect operation: log the key inputs, relevant environment, and a captured stack trace (`new Error().stack`) *just before* it runs. In tests, write to stderr — a logger may be suppressed. Read the captured stack to find the original caller, then remove the instrumentation.
|
|
30
32
|
|
|
31
33
|
---
|
|
32
34
|
|
|
@@ -56,14 +58,12 @@ Before changing any code:
|
|
|
56
58
|
4. **The Three-Fix Limit (Architectural Check):**
|
|
57
59
|
* If you attempt **three separate fixes** and the bug remains: **STOP.**
|
|
58
60
|
* This is a strong signal that the issue is architectural (e.g., wrong model assumptions, coupled state, race conditions).
|
|
59
|
-
* Re-evaluate the system architecture and discuss the approach with
|
|
61
|
+
* Re-evaluate the system architecture and discuss the approach with the user before attempting a fourth patch.
|
|
60
62
|
|
|
61
63
|
---
|
|
62
64
|
|
|
63
65
|
## Common Rationalizations
|
|
64
66
|
|
|
65
|
-
> **Note:** The rationalizations below are prospective — they represent likely excuses an agent might produce under pressure, but they have not yet been validated through actual eval runs. After running pressure-test evals, replace or augment these with verbatim quotes from failed runs.
|
|
66
|
-
|
|
67
67
|
| Excuse | Reality |
|
|
68
68
|
|--------|---------|
|
|
69
69
|
| "This is an emergency, we don't have time" | 5 minutes of investigation beats 5 hours of chasing symptoms. |
|
|
@@ -77,8 +77,6 @@ Before changing any code:
|
|
|
77
77
|
|
|
78
78
|
## Red Flags — STOP and Reset
|
|
79
79
|
|
|
80
|
-
> **Note:** The red flags below are prospective — they represent likely warning signs, but they have not yet been validated through actual eval runs.
|
|
81
|
-
|
|
82
80
|
- Writing a fix before reproducing the bug or reading the full stack trace
|
|
83
81
|
- "Let's just try changing X to see if it works"
|
|
84
82
|
- Stacking multiple speculative fixes on top of each other
|
package/skills/{systematic-debugging → investigating-bugs}/condition-based-waiting-example.ts
RENAMED
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
// Complete implementation of condition-based waiting utilities
|
|
2
|
-
//
|
|
3
|
-
//
|
|
1
|
+
// Complete implementation of condition-based waiting utilities.
|
|
2
|
+
// Domain-specific example (event-driven thread manager) showing how the generic
|
|
3
|
+
// waitFor pattern specializes into reusable helpers for a real codebase.
|
|
4
4
|
|
|
5
5
|
import type { ThreadManager } from "~/threads/thread-manager";
|
|
6
6
|
import type { LaceEvent, LaceEventType } from "~/threads/types";
|
|
@@ -78,7 +78,7 @@ async function waitFor<T>(
|
|
|
78
78
|
}
|
|
79
79
|
```
|
|
80
80
|
|
|
81
|
-
See `condition-based-waiting-example.ts` in this directory for complete implementation with domain-specific helpers (`waitForEvent`, `waitForEventCount`, `waitForEventMatch`)
|
|
81
|
+
See `condition-based-waiting-example.ts` in this directory for a complete implementation with domain-specific helpers (`waitForEvent`, `waitForEventCount`, `waitForEventMatch`).
|
|
82
82
|
|
|
83
83
|
## Common Mistakes
|
|
84
84
|
|
|
@@ -104,11 +104,3 @@ await new Promise(r => setTimeout(r, 200)); // Then: wait for timed behavior
|
|
|
104
104
|
1. First wait for triggering condition
|
|
105
105
|
2. Based on known timing (not guessing)
|
|
106
106
|
3. Comment explaining WHY
|
|
107
|
-
|
|
108
|
-
## Real-World Impact
|
|
109
|
-
|
|
110
|
-
From debugging session (2025-10-03):
|
|
111
|
-
- Fixed 15 flaky tests across 3 files
|
|
112
|
-
- Pass rate: 60% → 100%
|
|
113
|
-
- Execution time: 40% faster
|
|
114
|
-
- No more race conditions
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# Baseline — investigating-bugs
|
|
2
|
+
|
|
3
|
+
Committed reference output from a canonical eval run. Regenerate with
|
|
4
|
+
`eval-magic promote-baseline --skill investigating-bugs --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
|
|
5
|
+
dispatch files, produced outputs) stays gitignored under `skills-workspace/`
|
|
6
|
+
and is reclaimable by `eval-magic teardown` once promoted (this commit's marker).
|
|
7
|
+
|
|
8
|
+
| Field | Value |
|
|
9
|
+
|-------|-------|
|
|
10
|
+
| Mode | new-skill |
|
|
11
|
+
| Iteration | iteration-1 |
|
|
12
|
+
| Harness | claude-code |
|
|
13
|
+
| Agent model | unspecified |
|
|
14
|
+
| Judge model | unspecified |
|
|
15
|
+
| Conditions | with_skill, without_skill |
|
|
16
|
+
| Run timestamp | 2026-06-11T04:22:46.590Z |
|
|
17
|
+
| Label | issue-207 rename + pressure validation (Mode A, sonnet-4-6) |
|
|
18
|
+
| Promoted from commit | 37289e4 |
|
|
19
|
+
|
|
20
|
+
Files:
|
|
21
|
+
- `benchmark.json` — aggregate pass-rate / duration / token deltas.
|
|
22
|
+
- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
|
|
23
|
+
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
{
|
|
2
|
+
"generated": "2026-06-11T04:28:59.764Z",
|
|
3
|
+
"mode": "new-skill",
|
|
4
|
+
"conditions_compared": ["with_skill", "without_skill"],
|
|
5
|
+
"missing_gradings": 0,
|
|
6
|
+
"validity_warnings": [],
|
|
7
|
+
"run_summary": {
|
|
8
|
+
"with_skill": {
|
|
9
|
+
"pass_rate": {
|
|
10
|
+
"mean": 1.0,
|
|
11
|
+
"stddev": 0.0,
|
|
12
|
+
"n": 4
|
|
13
|
+
},
|
|
14
|
+
"duration_ms": {
|
|
15
|
+
"mean": 33639.0,
|
|
16
|
+
"stddev": 10185.0,
|
|
17
|
+
"n": 4
|
|
18
|
+
},
|
|
19
|
+
"total_tokens": {
|
|
20
|
+
"mean": 87928.0,
|
|
21
|
+
"stddev": 26986.0,
|
|
22
|
+
"n": 4
|
|
23
|
+
},
|
|
24
|
+
"skill_invocation_n": 3,
|
|
25
|
+
"skill_invocation_rate": 1.0
|
|
26
|
+
},
|
|
27
|
+
"without_skill": {
|
|
28
|
+
"pass_rate": {
|
|
29
|
+
"mean": 0.917,
|
|
30
|
+
"stddev": 0.144,
|
|
31
|
+
"n": 4
|
|
32
|
+
},
|
|
33
|
+
"duration_ms": {
|
|
34
|
+
"mean": 24224.0,
|
|
35
|
+
"stddev": 4179.0,
|
|
36
|
+
"n": 4
|
|
37
|
+
},
|
|
38
|
+
"total_tokens": {
|
|
39
|
+
"mean": 72100.0,
|
|
40
|
+
"stddev": 10863.0,
|
|
41
|
+
"n": 4
|
|
42
|
+
}
|
|
43
|
+
}
|
|
44
|
+
},
|
|
45
|
+
"delta": {
|
|
46
|
+
"direction": "with_skill - without_skill",
|
|
47
|
+
"pass_rate": 0.083,
|
|
48
|
+
"duration_ms": 9415.0,
|
|
49
|
+
"total_tokens": 15828.0
|
|
50
|
+
}
|
|
51
|
+
}
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "did_not_overtrigger_debugging",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "The agent responded with a direct feature implementation plan ('Here's how to add a `--verbose` flag to your CLI that prints each file as it's processed') covering argument parsing, threading the flag through processing logic, and wiring it up at the entry point — with no mention of bug reproduction, stack traces, root cause investigation, or any debugging phases.",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
}
|
|
10
|
+
],
|
|
11
|
+
"summary": {
|
|
12
|
+
"passed": 1,
|
|
13
|
+
"failed": 0,
|
|
14
|
+
"total": 1,
|
|
15
|
+
"pass_rate": 1.0
|
|
16
|
+
}
|
|
17
|
+
}
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "did_not_overtrigger_debugging",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "The agent responded with straightforward implementation guidance: 'To add a `--verbose` flag that prints each file as it's processed, here's what needs to change' — covering argument parsing, threading the flag into the processing loop, and updating help text. No debugging phases, bug reproduction, stack trace analysis, or failure hypotheses appear anywhere in the response.",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
}
|
|
10
|
+
],
|
|
11
|
+
"summary": {
|
|
12
|
+
"passed": 1,
|
|
13
|
+
"failed": 0,
|
|
14
|
+
"total": 1,
|
|
15
|
+
"pass_rate": 1.0
|
|
16
|
+
}
|
|
17
|
+
}
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "inspected_before_patching",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "matched ordinal 1: Read {\"file_path\":\"/Users/maximilianhaarhaus/Projects/slow-powers/skills-workspace/investigating-bugs/iteration-1/eval-null-id-crash-investigate-first/with_skill/inputs/orderHandler.ts\"}",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "transcript_check"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "root_cause_before_fix",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "Tool invocations show the agent read orderHandler.ts (ordinal 1) and repro.ts (ordinal 2) before writing any fix (ordinal 5). The final message explicitly states 'Before touching anything, I investigated the root cause' and explains WHY customer is undefined: 'The Order type declares customer as optional (customer?:), so guest-checkout orders arrive without it. When they do, order.customer is undefined and accessing .id throws.'",
|
|
14
|
+
"confidence": 0.95,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "fix_targets_source_not_symptom",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The agent explicitly named the root cause: 'The Order type declares customer as optional (customer?:), so guest-checkout orders arrive without it. When they do, order.customer is undefined and accessing .id throws.' It also distinguished source from symptom: 'This is deterministic — every guest checkout triggers it. The intermittency you see in production is just the rate of guest-checkout orders, not non-determinism in the code.' The fix (a guard in describeOrder itself) follows directly from this diagnosis, and the agent noted 'The fix lives at the root cause (the function's own missing guard), not at every call site.'",
|
|
21
|
+
"confidence": 1.0,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
}
|
|
24
|
+
],
|
|
25
|
+
"summary": {
|
|
26
|
+
"passed": 3,
|
|
27
|
+
"failed": 0,
|
|
28
|
+
"total": 3,
|
|
29
|
+
"pass_rate": 1.0
|
|
30
|
+
},
|
|
31
|
+
"meta_results": [
|
|
32
|
+
{
|
|
33
|
+
"id": "__skill_invoked",
|
|
34
|
+
"passed": true,
|
|
35
|
+
"evidence": "The agent's final_message opens with 'Before touching anything, I investigated the root cause' — a direct echo of the skill's Iron Law ('NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST'). It then follows the skill's Phase 1 (tracing data flow back to the optional `customer?:` field), Phase 3 (formulating a single hypothesis: 'every guest checkout triggers it'), and Phase 4 (writing a failing test before implementing the fix, then verifying no regressions). The claim 'The intermittency you see in production is just the rate of guest-checkout orders, not non-determinism in the code' mirrors the skill's Phase 1 Step 2 instruction to 'Reproduce Consistently' and explain apparent flakiness.",
|
|
36
|
+
"confidence": 0.97,
|
|
37
|
+
"grader": "llm_judge"
|
|
38
|
+
}
|
|
39
|
+
],
|
|
40
|
+
"meta_summary": {
|
|
41
|
+
"passed": 1,
|
|
42
|
+
"failed": 0,
|
|
43
|
+
"total": 1,
|
|
44
|
+
"skill_invoked": true
|
|
45
|
+
}
|
|
46
|
+
}
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "inspected_before_patching",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "matched ordinal 1: Read {\"file_path\":\"/Users/maximilianhaarhaus/Projects/slow-powers/skills-workspace/investigating-bugs/iteration-1/eval-null-id-crash-investigate-first/without_skill/inputs/orderHandler.ts\"}",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "transcript_check"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "root_cause_before_fix",
|
|
12
|
+
"passed": false,
|
|
13
|
+
"evidence": "Tool invocations show the agent read orderHandler.ts (ordinal 1) and repro.ts (ordinal 2) and then immediately wrote the patched file (ordinal 4) without ever running repro.ts or executing any code to reproduce the crash. No Bash invocation ran the repro script or constructed a customer-less order; the agent inferred the root cause from reading alone and applied the fix in the same pass.",
|
|
14
|
+
"confidence": 0.95,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "fix_targets_source_not_symptom",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "\"Guest-checkout orders arrive without a `customer`, which triggers the `TypeError` intermittently.\" — the agent explicitly named the root cause (guest-checkout orders legitimately have no customer) and identified the code's wrong assumption before applying the guard.",
|
|
21
|
+
"confidence": 0.95,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
}
|
|
24
|
+
],
|
|
25
|
+
"summary": {
|
|
26
|
+
"passed": 2,
|
|
27
|
+
"failed": 1,
|
|
28
|
+
"total": 3,
|
|
29
|
+
"pass_rate": 0.6666666666666666
|
|
30
|
+
}
|
|
31
|
+
}
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "investigated_before_second_patch",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "matched ordinal 1: Read {\"file_path\":\"/Users/maximilianhaarhaus/Projects/slow-powers/skills-workspace/investigating-bugs/iteration-1/eval-seeded-stacked-guess-investigate-first/with_skill/inputs/orderHandler.ts\"}",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "transcript_check"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "rejects_stacked_guess",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "Agent explicitly refused: 'I will not add another guard on top of the existing patch. That would be a third speculative change on an uninvestigated root cause.' It then investigated the root cause, identifying that customer is optional in the type and guest-checkout orders legitimately arrive with no customer field, before proposing a targeted fix.",
|
|
14
|
+
"confidence": 1.0,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "root_cause_named",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The root cause is a **missing branch for guest-checkout orders in `describeOrder`**. The type correctly marks `customer` as optional (these orders are valid and intentional — guest checkouts), but the function was written as if a customer is always present.",
|
|
21
|
+
"confidence": 1.0,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
}
|
|
24
|
+
],
|
|
25
|
+
"summary": {
|
|
26
|
+
"passed": 3,
|
|
27
|
+
"failed": 0,
|
|
28
|
+
"total": 3,
|
|
29
|
+
"pass_rate": 1.0
|
|
30
|
+
},
|
|
31
|
+
"meta_results": [
|
|
32
|
+
{
|
|
33
|
+
"id": "__skill_invoked",
|
|
34
|
+
"passed": true,
|
|
35
|
+
"evidence": "The agent explicitly invokes multiple skill fingerprints: it names the 'Red Flag pattern' verbatim ('each \"fix\" only shifts the bug to a new location'), quotes the skill's reset instruction ('STOP. Revert changes. Return to Phase 1.'), labels its investigation section 'Root cause investigation (Phase 1)' and its fix section 'Phase 4', and cites the skill's iron-law framing by refusing to add a third speculative change — all language and structure drawn directly from the skill content.",
|
|
36
|
+
"confidence": 1.0,
|
|
37
|
+
"grader": "llm_judge"
|
|
38
|
+
}
|
|
39
|
+
],
|
|
40
|
+
"meta_summary": {
|
|
41
|
+
"passed": 1,
|
|
42
|
+
"failed": 0,
|
|
43
|
+
"total": 1,
|
|
44
|
+
"skill_invoked": true
|
|
45
|
+
}
|
|
46
|
+
}
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "investigated_before_second_patch",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "matched ordinal 1: Read {\"file_path\":\"/Users/maximilianhaarhaus/Projects/slow-powers/skills-workspace/investigating-bugs/iteration-1/eval-seeded-stacked-guess-investigate-first/without_skill/inputs/orderHandler.ts\"}",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "transcript_check"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "rejects_stacked_guess",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "The agent explicitly declined to add the guard without investigation: 'we've now applied two patches in a row based on guesses, and each one revealed a deeper problem. Before we add a third guard, it's worth spending two minutes understanding what's actually happening.' It then read the source files and correctly identified that describeOrder is a logging function, not a save path, and asked to see the order-save and invoice-generation paths before changing any code.",
|
|
14
|
+
"confidence": 1.0,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "root_cause_named",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The repro script shows that guest-checkout orders ({ id: \"A-1002\" }) legitimately have no customer — that's expected behaviour for guest checkouts.",
|
|
21
|
+
"confidence": 0.95,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
}
|
|
24
|
+
],
|
|
25
|
+
"summary": {
|
|
26
|
+
"passed": 3,
|
|
27
|
+
"failed": 0,
|
|
28
|
+
"total": 3,
|
|
29
|
+
"pass_rate": 1.0
|
|
30
|
+
}
|
|
31
|
+
}
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "stops_at_three_fixes",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "\"I can't ship a fourth patch. Three separate fixes failed, which is the signal to stop — not to keep stacking guesses. [...] The Three-Fix Limit has triggered.\" The agent explicitly declines the fourth timeout/retry patch and redirects to root-cause investigation (reproduce deterministically, read failure output, check git history).",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "names_architectural_cause",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "\"That pattern means we're chasing a symptom, not fixing a root cause\" and \"whether this is a timing issue, a data issue, or a race condition\" — the agent explicitly names race condition as a candidate root cause and frames the problem as architectural rather than a matter of waiting longer.",
|
|
14
|
+
"confidence": 0.95,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1.0
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "The agent explicitly names the skill's rule verbatim: '**The Three-Fix Limit has triggered.**' — a direct quote of the Phase 4 heading 'The Three-Fix Limit (Architectural Check)' — and structures its refusal around that rule, matching the skill's prescribed behavior of stopping after three failed fixes and redirecting to root-cause investigation.",
|
|
29
|
+
"confidence": 1.0,
|
|
30
|
+
"grader": "llm_judge"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "stops_at_three_fixes",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "Agent explicitly declines the fourth patch: 'three fixes in a row haven't worked, and adding more timeouts and retries on top of them is very likely to make things worse or just mask the problem until it resurfaces in production.' It then pivots to root-cause diagnosis, proposing to check whether the batch insert completes for all 3 records and whether the test is using the correct DB connection.",
|
|
7
|
+
"confidence": 1.0,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "names_architectural_cause",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "Agent names two architectural root causes: 'Are we looking at the right DB connection? If the test and the batch handler are using different connections or the read is hitting a read replica, the transaction commit won’t help.' and 'Is the batch insert itself completing for all 3 records? If the insert is silently failing or swallowing an error for one record, no amount of waiting will fix it.' Both go well beyond 'wait longer' framing.",
|
|
14
|
+
"confidence": 0.95,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1.0
|
|
23
|
+
}
|
|
24
|
+
}
|