@slowdini/slow-powers-opencode 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +22 -0
- package/README.md +174 -0
- package/bootstrap.md +16 -0
- package/opencode/plugins/slow-powers.js +86 -0
- package/package.json +66 -0
- package/skills/auditing-slow-powers-usage/SKILL.md +157 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/BASELINE.md +22 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md +72 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/benchmark.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__with_skill.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__without_skill.json +38 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__with_skill.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__without_skill.json +38 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__with_skill.json +17 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__without_skill.json +17 -0
- package/skills/auditing-slow-powers-usage/evals/evals.json +74 -0
- package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md +39 -0
- package/skills/auditing-slow-powers-usage/evals/fixtures/audits-completed-session/session-summary.md +33 -0
- package/skills/evaluating-skills/SKILL.md +448 -0
- package/skills/evaluating-skills/evals/evals.json +52 -0
- package/skills/evaluating-skills/evals/fixtures/iron-law/candidate-skill.md +13 -0
- package/skills/evaluating-skills/examples/verification-before-completion-evals.json +30 -0
- package/skills/evaluating-skills/harness-details/claude.md +135 -0
- package/skills/evaluating-skills/pressure-scenarios.md +163 -0
- package/skills/evaluating-skills/runner/README.md +140 -0
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +263 -0
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +146 -0
- package/skills/evaluating-skills/runner/aggregate.test.ts +188 -0
- package/skills/evaluating-skills/runner/aggregate.ts +228 -0
- package/skills/evaluating-skills/runner/context.test.ts +181 -0
- package/skills/evaluating-skills/runner/context.ts +90 -0
- package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +103 -0
- package/skills/evaluating-skills/runner/detect-stray-writes.ts +192 -0
- package/skills/evaluating-skills/runner/fill-transcripts.test.ts +73 -0
- package/skills/evaluating-skills/runner/fill-transcripts.ts +154 -0
- package/skills/evaluating-skills/runner/grade.test.ts +347 -0
- package/skills/evaluating-skills/runner/grade.ts +603 -0
- package/skills/evaluating-skills/runner/guard/guard.ts +49 -0
- package/skills/evaluating-skills/runner/guard/install.test.ts +92 -0
- package/skills/evaluating-skills/runner/guard/install.ts +147 -0
- package/skills/evaluating-skills/runner/guard/policy.test.ts +71 -0
- package/skills/evaluating-skills/runner/guard/policy.ts +74 -0
- package/skills/evaluating-skills/runner/promote-baseline.test.ts +230 -0
- package/skills/evaluating-skills/runner/promote-baseline.ts +186 -0
- package/skills/evaluating-skills/runner/run.test.ts +716 -0
- package/skills/evaluating-skills/runner/run.ts +814 -0
- package/skills/evaluating-skills/runner/sandbox-policy.ts +74 -0
- package/skills/evaluating-skills/runner/types.ts +104 -0
- package/skills/evaluating-skills/runner/validate-all.ts +54 -0
- package/skills/evaluating-skills/runner/validate-schema.test.ts +99 -0
- package/skills/evaluating-skills/runner/validate-schema.ts +51 -0
- package/skills/evaluating-skills/runner/validate.test.ts +56 -0
- package/skills/evaluating-skills/runner/validate.ts +21 -0
- package/skills/evaluating-skills/schema/evals.schema.json +105 -0
- package/skills/evaluating-skills/schema/grading.schema.json +84 -0
- package/skills/evaluating-skills/schema/run-record.schema.json +80 -0
- package/skills/evaluating-skills/schema/stray-writes.schema.json +68 -0
- package/skills/evaluating-skills/templates/eval-task-prompt.md +71 -0
- package/skills/evaluating-skills/templates/evals.json.example +17 -0
- package/skills/evaluating-skills/templates/judge-prompt.md +56 -0
- package/skills/evaluating-skills/templates/revise-skill-prompt.md +56 -0
- package/skills/finishing-a-development-branch/SKILL.md +96 -0
- package/skills/finishing-a-development-branch/evals/evals.json +41 -0
- package/skills/finishing-a-development-branch/evals/fixtures/finish/package.json +4 -0
- package/skills/finishing-a-development-branch/evals/fixtures/finish/sum.test.ts +5 -0
- package/skills/hardening-plans/SKILL.md +72 -0
- package/skills/hardening-plans/evals/baseline/BASELINE.md +22 -0
- package/skills/hardening-plans/evals/baseline/NOTES.md +58 -0
- package/skills/hardening-plans/evals/baseline/benchmark.json +54 -0
- package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__new_skill.json +39 -0
- package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__old_skill.json +39 -0
- package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__new_skill.json +24 -0
- package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__old_skill.json +24 -0
- package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__new_skill.json +46 -0
- package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__old_skill.json +46 -0
- package/skills/hardening-plans/evals/evals.json +114 -0
- package/skills/systematic-debugging/CREATION-LOG.md +119 -0
- package/skills/systematic-debugging/SKILL.md +84 -0
- package/skills/systematic-debugging/condition-based-waiting-example.ts +164 -0
- package/skills/systematic-debugging/condition-based-waiting.md +115 -0
- package/skills/systematic-debugging/defense-in-depth.md +122 -0
- package/skills/systematic-debugging/evals/baseline/BASELINE.md +22 -0
- package/skills/systematic-debugging/evals/baseline/benchmark.json +51 -0
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__with_skill.json +17 -0
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__without_skill.json +17 -0
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__with_skill.json +46 -0
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__without_skill.json +31 -0
- package/skills/systematic-debugging/evals/evals.json +45 -0
- package/skills/systematic-debugging/evals/fixtures/order-bug/orderHandler.ts +9 -0
- package/skills/systematic-debugging/evals/fixtures/order-bug/repro.ts +10 -0
- package/skills/systematic-debugging/find-polluter.sh +63 -0
- package/skills/systematic-debugging/root-cause-tracing.md +169 -0
- package/skills/systematic-debugging/test-academic.md +14 -0
- package/skills/systematic-debugging/test-pressure-1.md +58 -0
- package/skills/systematic-debugging/test-pressure-2.md +68 -0
- package/skills/systematic-debugging/test-pressure-3.md +69 -0
- package/skills/test-driven-development/SKILL.md +93 -0
- package/skills/test-driven-development/evals/baseline/BASELINE.md +22 -0
- package/skills/test-driven-development/evals/baseline/NOTES.md +74 -0
- package/skills/test-driven-development/evals/baseline/benchmark.json +51 -0
- package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__with_skill.json +53 -0
- package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__without_skill.json +38 -0
- package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__with_skill.json +32 -0
- package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__without_skill.json +17 -0
- package/skills/test-driven-development/evals/evals.json +77 -0
- package/skills/test-driven-development/evals/fixtures/slugify/package.json +4 -0
- package/skills/test-driven-development/evals/fixtures/slugify/utils.ts +7 -0
- package/skills/test-driven-development/testing-anti-patterns.md +299 -0
- package/skills/using-git-worktrees/SKILL.md +70 -0
- package/skills/using-git-worktrees/evals/evals.json +40 -0
- package/skills/verification-before-completion/SKILL.md +65 -0
- package/skills/verification-before-completion/evals/baseline/BASELINE.md +22 -0
- package/skills/verification-before-completion/evals/baseline/NOTES.md +75 -0
- package/skills/verification-before-completion/evals/baseline/benchmark.json +51 -0
- package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__with_skill.json +39 -0
- package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__without_skill.json +24 -0
- package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__with_skill.json +46 -0
- package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__without_skill.json +31 -0
- package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__with_skill.json +46 -0
- package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__without_skill.json +31 -0
- package/skills/verification-before-completion/evals/evals.json +77 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/api.ts +1 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/consumer.ts +3 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/tsconfig.json +23 -0
- package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.test.ts +10 -0
- package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.ts +1 -0
- package/skills/writing-skills/SKILL.md +306 -0
- package/skills/writing-skills/evals/evals.json +40 -0
- package/skills/writing-skills/graphviz-conventions.dot +172 -0
- package/skills/writing-skills/persuasion-principles.md +187 -0
- package/skills/writing-skills/scripts/render-graphs.js +181 -0
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "http://json-schema.org/draft-07/schema#",
|
|
3
|
+
"$id": "https://slow-powers.dev/schemas/run-record.schema.json",
|
|
4
|
+
"title": "Portable Run Record",
|
|
5
|
+
"description": "Captures one subagent run. Harness-agnostic — each harness writes an adapter from its native transcript format to this shape. Downstream grading reads only this file.",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"required": [
|
|
8
|
+
"eval_id",
|
|
9
|
+
"condition",
|
|
10
|
+
"skill_path",
|
|
11
|
+
"prompt",
|
|
12
|
+
"files",
|
|
13
|
+
"final_message",
|
|
14
|
+
"tool_invocations"
|
|
15
|
+
],
|
|
16
|
+
"additionalProperties": false,
|
|
17
|
+
"properties": {
|
|
18
|
+
"eval_id": {
|
|
19
|
+
"type": "string",
|
|
20
|
+
"description": "Matches the eval's id in evals.json."
|
|
21
|
+
},
|
|
22
|
+
"condition": {
|
|
23
|
+
"type": "string",
|
|
24
|
+
"description": "Reserved names: with_skill, without_skill, old_skill, new_skill."
|
|
25
|
+
},
|
|
26
|
+
"skill_path": {
|
|
27
|
+
"type": ["string", "null"],
|
|
28
|
+
"description": "Absolute path to the SKILL.md the subagent could load, or null if no skill was provided (without_skill condition)."
|
|
29
|
+
},
|
|
30
|
+
"prompt": {
|
|
31
|
+
"type": "string",
|
|
32
|
+
"description": "The user prompt as dispatched to the subagent."
|
|
33
|
+
},
|
|
34
|
+
"files": {
|
|
35
|
+
"type": "array",
|
|
36
|
+
"items": { "type": "string" },
|
|
37
|
+
"description": "Fixture files the subagent had access to (absolute paths inside the run's workspace)."
|
|
38
|
+
},
|
|
39
|
+
"final_message": {
|
|
40
|
+
"type": "string",
|
|
41
|
+
"description": "The agent's final user-facing text output."
|
|
42
|
+
},
|
|
43
|
+
"tool_invocations": {
|
|
44
|
+
"type": "array",
|
|
45
|
+
"description": "Ordered list of tool calls during the run.",
|
|
46
|
+
"items": {
|
|
47
|
+
"type": "object",
|
|
48
|
+
"required": ["name", "ordinal"],
|
|
49
|
+
"additionalProperties": false,
|
|
50
|
+
"properties": {
|
|
51
|
+
"name": {
|
|
52
|
+
"type": "string",
|
|
53
|
+
"description": "Tool name as recorded by the harness (e.g. Bash, Read, run_command). Adapters should preserve original names."
|
|
54
|
+
},
|
|
55
|
+
"args": {
|
|
56
|
+
"description": "Tool arguments. Object for structured tools, string for raw command-style tools.",
|
|
57
|
+
"type": ["object", "string", "array", "null"]
|
|
58
|
+
},
|
|
59
|
+
"result": {
|
|
60
|
+
"description": "Tool output, if captured. Truncate long outputs to ~2KB.",
|
|
61
|
+
"type": ["string", "object", "null"]
|
|
62
|
+
},
|
|
63
|
+
"ordinal": {
|
|
64
|
+
"type": "integer",
|
|
65
|
+
"minimum": 0,
|
|
66
|
+
"description": "0-indexed position in the run. Used by must_precede checks."
|
|
67
|
+
}
|
|
68
|
+
}
|
|
69
|
+
}
|
|
70
|
+
},
|
|
71
|
+
"total_tokens": {
|
|
72
|
+
"type": ["integer", "null"],
|
|
73
|
+
"description": "From the harness's task completion event. May be null if the harness does not surface this."
|
|
74
|
+
},
|
|
75
|
+
"duration_ms": {
|
|
76
|
+
"type": ["integer", "null"],
|
|
77
|
+
"description": "From the harness's task completion event. May be null if the harness does not surface this."
|
|
78
|
+
}
|
|
79
|
+
}
|
|
80
|
+
}
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
{
|
|
2
|
+
"$schema": "http://json-schema.org/draft-07/schema#",
|
|
3
|
+
"$id": "https://slow-powers.dev/schemas/stray-writes.schema.json",
|
|
4
|
+
"title": "Stray-Write Report",
|
|
5
|
+
"description": "Output of evals:detect-stray-writes. Flags subagent file writes / mutating commands that landed outside a run's outputs dir. Lives at <workspace>/iteration-N/stray-writes.json.",
|
|
6
|
+
"type": "object",
|
|
7
|
+
"required": ["generated", "iteration", "totals", "runs"],
|
|
8
|
+
"additionalProperties": false,
|
|
9
|
+
"properties": {
|
|
10
|
+
"generated": { "type": "string", "description": "ISO timestamp" },
|
|
11
|
+
"iteration": { "type": "integer" },
|
|
12
|
+
"totals": {
|
|
13
|
+
"type": "object",
|
|
14
|
+
"required": ["violations", "warnings"],
|
|
15
|
+
"additionalProperties": false,
|
|
16
|
+
"properties": {
|
|
17
|
+
"violations": { "type": "integer" },
|
|
18
|
+
"warnings": { "type": "integer" }
|
|
19
|
+
}
|
|
20
|
+
},
|
|
21
|
+
"runs": {
|
|
22
|
+
"type": "array",
|
|
23
|
+
"description": "One entry per (eval, condition) run that had at least one finding.",
|
|
24
|
+
"items": {
|
|
25
|
+
"type": "object",
|
|
26
|
+
"required": ["eval_id", "condition", "violations", "warnings"],
|
|
27
|
+
"additionalProperties": false,
|
|
28
|
+
"properties": {
|
|
29
|
+
"eval_id": { "type": "string" },
|
|
30
|
+
"condition": { "type": "string" },
|
|
31
|
+
"violations": {
|
|
32
|
+
"type": "array",
|
|
33
|
+
"description": "High-confidence: a write tool targeted a path outside the run's outputs dir.",
|
|
34
|
+
"items": { "$ref": "#/definitions/finding" }
|
|
35
|
+
},
|
|
36
|
+
"warnings": {
|
|
37
|
+
"type": "array",
|
|
38
|
+
"description": "Heuristic: a Bash command matched a mutating pattern (install, git, sed -i, redirection) without referencing the outputs dir.",
|
|
39
|
+
"items": { "$ref": "#/definitions/finding" }
|
|
40
|
+
}
|
|
41
|
+
}
|
|
42
|
+
}
|
|
43
|
+
}
|
|
44
|
+
},
|
|
45
|
+
"definitions": {
|
|
46
|
+
"finding": {
|
|
47
|
+
"type": "object",
|
|
48
|
+
"required": ["tool", "ordinal", "reason"],
|
|
49
|
+
"additionalProperties": false,
|
|
50
|
+
"properties": {
|
|
51
|
+
"tool": { "type": "string" },
|
|
52
|
+
"path": {
|
|
53
|
+
"type": "string",
|
|
54
|
+
"description": "Target path for write-tool violations."
|
|
55
|
+
},
|
|
56
|
+
"command": {
|
|
57
|
+
"type": "string",
|
|
58
|
+
"description": "Command text for Bash warnings."
|
|
59
|
+
},
|
|
60
|
+
"ordinal": {
|
|
61
|
+
"type": "integer",
|
|
62
|
+
"description": "Position of the invocation in the run's tool_invocations."
|
|
63
|
+
},
|
|
64
|
+
"reason": { "type": "string" }
|
|
65
|
+
}
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
}
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# Eval task dispatch template
|
|
2
|
+
|
|
3
|
+
Use this template when dispatching a fresh general-purpose subagent to execute a single eval test case.
|
|
4
|
+
|
|
5
|
+
**The subagent MUST start with clean context.** State from previous runs invalidates the comparison.
|
|
6
|
+
|
|
7
|
+
## Variables to fill
|
|
8
|
+
|
|
9
|
+
| Variable | Source |
|
|
10
|
+
|---|---|
|
|
11
|
+
| `{{eval_id}}` | The eval's `id` from `evals.json` |
|
|
12
|
+
| `{{condition}}` | `with_skill`, `without_skill`, `old_skill`, or `new_skill` |
|
|
13
|
+
| `{{prompt}}` | The eval's `prompt`, verbatim |
|
|
14
|
+
| `{{files}}` | Fixture paths the subagent can read (or "none") |
|
|
15
|
+
| `{{output_dir}}` | The workspace directory the subagent writes to |
|
|
16
|
+
| `{{skill_path}}` | Path to SKILL.md to load — omit entirely for `without_skill` |
|
|
17
|
+
| `{{staged_skill_slug}}` | Unique slug the runner staged the skill-under-test under, if the harness supports project-local skill discovery (e.g. Claude Code) |
|
|
18
|
+
| `{{bootstrap_content}}` | Plugin bootstrap / session-start text, injected to mirror what a real user sees when their session starts (optional; runners that don't have an equivalent leave this empty) |
|
|
19
|
+
|
|
20
|
+
## Template
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
{{#if bootstrap_content}}
|
|
24
|
+
<session-start-context>
|
|
25
|
+
The following guidelines were loaded at session start by the plugin under evaluation
|
|
26
|
+
(equivalent to the harness's session-start hook firing in a real user's environment):
|
|
27
|
+
|
|
28
|
+
{{bootstrap_content}}
|
|
29
|
+
</session-start-context>
|
|
30
|
+
{{/if}}
|
|
31
|
+
You are executing a single test case for a skill evaluation framework.
|
|
32
|
+
Treat this as a real user request — do NOT optimize your behavior for the eval.
|
|
33
|
+
|
|
34
|
+
{{#if staged_skill_slug}}
|
|
35
|
+
Your environment has the plugin under evaluation loaded. Its skills are
|
|
36
|
+
discoverable via the Skill tool. The skill currently under evaluation is
|
|
37
|
+
staged under the unique slug "{{staged_skill_slug}}" — invoke that slug rather
|
|
38
|
+
than the natural name if the skill applies to the user's request.
|
|
39
|
+
{{else if skill_path}}
|
|
40
|
+
The following skill is loaded into your operating guidelines. Apply it where relevant.
|
|
41
|
+
<skill name="{{skill_name}}">
|
|
42
|
+
{{skill_content}}
|
|
43
|
+
</skill>
|
|
44
|
+
{{else if bootstrap_content}}
|
|
45
|
+
The skill currently under evaluation is NOT available in this environment.
|
|
46
|
+
Other skills from the plugin remain discoverable via the Skill tool; apply any
|
|
47
|
+
that fit the user's request.
|
|
48
|
+
{{else}}
|
|
49
|
+
No skill is loaded. Respond as you naturally would.
|
|
50
|
+
{{/if}}
|
|
51
|
+
|
|
52
|
+
Available fixture files: {{files}}
|
|
53
|
+
Output directory: {{output_dir}}
|
|
54
|
+
|
|
55
|
+
Instructions:
|
|
56
|
+
- Write any files you produce into the output directory.
|
|
57
|
+
- After completing the task, write your final user-facing response to {{output_dir}}/final-message.md.
|
|
58
|
+
- Do not write anything outside the output directory.
|
|
59
|
+
|
|
60
|
+
User request:
|
|
61
|
+
{{prompt}}
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
`{{staged_skill_slug}}` and `{{bootstrap_content}}` are optional — they describe a *realistic-environment* dispatch where the runner has reproduced what a fresh plugin install would look like (siblings staged, bootstrap text prepended). A simpler runner can leave them empty and the conditional blocks degrade gracefully to the legacy inline / no-skill paths.
|
|
65
|
+
|
|
66
|
+
## After the subagent completes
|
|
67
|
+
|
|
68
|
+
The operator (or the runner) must capture:
|
|
69
|
+
|
|
70
|
+
1. The full transcript / tool invocations → convert via the harness adapter into `{{output_dir}}/../run.json` matching `schema/run-record.schema.json`.
|
|
71
|
+
2. `total_tokens` and `duration_ms` from the harness's task completion event → `{{output_dir}}/../timing.json`. **These values may not be persisted anywhere else — save them immediately.**
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
{
|
|
2
|
+
"skill_name": "example-skill",
|
|
3
|
+
"evals": [
|
|
4
|
+
{
|
|
5
|
+
"id": "realistic-prompt",
|
|
6
|
+
"prompt": "A realistic user message — the kind of thing a real user would actually type. Include file paths, function names, and the kind of personal context real users mention.",
|
|
7
|
+
"expected_output": "Human-readable description of what a successful response looks like. Don't over-specify — leave room for valid variation.",
|
|
8
|
+
"files": ["fixtures/example.txt"]
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "edge-case-prompt",
|
|
12
|
+
"prompt": "A boundary condition, malformed input, or ambiguous instruction that tests where the skill's rules apply.",
|
|
13
|
+
"expected_output": "What the skill should produce given the awkward input."
|
|
14
|
+
}
|
|
15
|
+
],
|
|
16
|
+
"_notes": "After iteration 1, add an `assertions` array to each eval. Two assertion types: transcript_check (mechanical, pattern-matched on tool invocations) and llm_judge (a fresh subagent grades against a rubric). Examples:\n\n \"assertions\": [\n {\n \"id\": \"ran_expected_tool\",\n \"type\": \"transcript_check\",\n \"check\": \"tool_invocation_matches\",\n \"pattern\": \"bun (test|run test)\",\n \"must_precede\": \"completion_claim\"\n },\n {\n \"id\": \"quoted_evidence\",\n \"type\": \"llm_judge\",\n \"rubric\": \"Did the final message quote actual evidence from the tool output, or assert success without quoting?\"\n }\n ]\n"
|
|
17
|
+
}
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Judge prompt template
|
|
2
|
+
|
|
3
|
+
Use this template when dispatching a fresh general-purpose subagent to grade `llm_judge` assertions for one run.
|
|
4
|
+
|
|
5
|
+
**The judge subagent MUST start with clean context.** Bias from prior runs corrupts grading.
|
|
6
|
+
|
|
7
|
+
## Variables to fill
|
|
8
|
+
|
|
9
|
+
| Variable | Source |
|
|
10
|
+
|---|---|
|
|
11
|
+
| `{{run_record}}` | Contents of `run.json` (the portable run record) |
|
|
12
|
+
| `{{outputs_listing}}` | Directory listing of the subagent's `outputs/` directory |
|
|
13
|
+
| `{{assertions}}` | Array of `llm_judge` assertions from `evals.json` for this eval |
|
|
14
|
+
|
|
15
|
+
## Template
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
You are grading a skill evaluation run. Be strict but fair.
|
|
19
|
+
|
|
20
|
+
# Run record
|
|
21
|
+
{{run_record}}
|
|
22
|
+
|
|
23
|
+
# Outputs directory contents
|
|
24
|
+
{{outputs_listing}}
|
|
25
|
+
|
|
26
|
+
# Assertions to grade
|
|
27
|
+
{{assertions}}
|
|
28
|
+
|
|
29
|
+
# Instructions
|
|
30
|
+
|
|
31
|
+
For each assertion, produce a result object with these fields:
|
|
32
|
+
- `id`: the assertion's id (verbatim from the input)
|
|
33
|
+
- `passed`: true or false
|
|
34
|
+
- `evidence`: a direct quote or specific reference from the run record or outputs that justifies the verdict. Vague summaries are not evidence.
|
|
35
|
+
- `confidence`: 0.0 to 1.0 — how confident you are in this verdict. Low confidence flags the result for human review.
|
|
36
|
+
|
|
37
|
+
# Grading principles
|
|
38
|
+
|
|
39
|
+
- PASS requires concrete evidence. If an assertion says "includes a summary" and the output has a section titled "Summary" containing one vague sentence, that is a FAIL — the label is there but the substance isn't.
|
|
40
|
+
- A correct output expressed in different words from what the assertion implies is still a PASS, provided the substance matches.
|
|
41
|
+
- If an assertion is unverifiable from the material you have (e.g. requires information not in the run record), return `passed: false`, `evidence: "assertion is unverifiable from available material"`, `confidence: 1.0`. The operator will fix the assertion.
|
|
42
|
+
- Do not infer behavior not present in the record. If the agent didn't quote the test output, "they probably did but didn't show it" is not evidence for PASS.
|
|
43
|
+
|
|
44
|
+
# Output format
|
|
45
|
+
|
|
46
|
+
Emit a single JSON object matching `schema/grading.schema.json`:
|
|
47
|
+
|
|
48
|
+
```json
|
|
49
|
+
{
|
|
50
|
+
"assertion_results": [ ... ],
|
|
51
|
+
"summary": { "passed": N, "failed": N, "total": N, "pass_rate": N }
|
|
52
|
+
}
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Do not include any text outside the JSON object.
|
|
56
|
+
```
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Skill revision prompt template
|
|
2
|
+
|
|
3
|
+
Use this template at the end of iteration N to feed eval signals into the next revision of SKILL.md.
|
|
4
|
+
|
|
5
|
+
## Variables to fill
|
|
6
|
+
|
|
7
|
+
| Variable | Source |
|
|
8
|
+
|---|---|
|
|
9
|
+
| `{{current_skill}}` | Current SKILL.md contents |
|
|
10
|
+
| `{{failed_assertions}}` | List of `(eval_id, assertion_id, evidence)` for assertions that failed in the `with_skill` / `new_skill` condition |
|
|
11
|
+
| `{{reviewer_feedback}}` | Per-eval notes from `feedback.json` (only the non-empty ones) |
|
|
12
|
+
| `{{notable_transcripts}}` | Brief excerpts from the most informative run records (focus on transcripts that revealed *why* an assertion failed) |
|
|
13
|
+
| `{{benchmark_summary}}` | Pass-rate delta and any anomalies (high stddev, time/token outliers) from `benchmark.json` |
|
|
14
|
+
|
|
15
|
+
## Template
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
You are improving a skill based on signals from a recent eval iteration.
|
|
19
|
+
|
|
20
|
+
# Current SKILL.md
|
|
21
|
+
{{current_skill}}
|
|
22
|
+
|
|
23
|
+
# Failed assertions
|
|
24
|
+
{{failed_assertions}}
|
|
25
|
+
|
|
26
|
+
# Reviewer feedback
|
|
27
|
+
{{reviewer_feedback}}
|
|
28
|
+
|
|
29
|
+
# Notable execution transcripts
|
|
30
|
+
{{notable_transcripts}}
|
|
31
|
+
|
|
32
|
+
# Benchmark summary
|
|
33
|
+
{{benchmark_summary}}
|
|
34
|
+
|
|
35
|
+
# Your task
|
|
36
|
+
|
|
37
|
+
Propose changes to the skill. Guidelines:
|
|
38
|
+
|
|
39
|
+
1. **Generalize from feedback.** The skill is used across many prompts, not just these test cases. Fixes should address underlying issues broadly, not patch specific failing examples.
|
|
40
|
+
|
|
41
|
+
2. **Keep the skill lean.** Fewer, better instructions outperform exhaustive rules. If transcripts show wasted work — unnecessary validation, unneeded intermediate outputs — remove those instructions. If pass rates plateau despite adding rules, try removing instructions and see if results hold or improve.
|
|
42
|
+
|
|
43
|
+
3. **Explain the why.** Reasoning-based instructions ("Do X because Y tends to cause Z") work better than rigid directives ("ALWAYS do X, NEVER do Y"). Models follow instructions more reliably when they understand the purpose.
|
|
44
|
+
|
|
45
|
+
4. **Bundle repeated work.** If multiple runs independently wrote a similar helper script (chart builder, data parser, lookup table), bundle it into the skill's `scripts/` directory and reference it from the skill.
|
|
46
|
+
|
|
47
|
+
5. **Do not just patch failing examples.** A change that fixes only the failing assertions is a regression risk if it doesn't address the underlying gap. Ask: "what is the smallest, most general rule that would have made these failures impossible?"
|
|
48
|
+
|
|
49
|
+
# Output
|
|
50
|
+
|
|
51
|
+
Either:
|
|
52
|
+
- A unified diff of proposed SKILL.md changes, OR
|
|
53
|
+
- A revised SKILL.md in full
|
|
54
|
+
|
|
55
|
+
Plus a short rationale (≤ 200 words) explaining the structural choices and which signals each change addresses.
|
|
56
|
+
```
|
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: finishing-a-development-branch
|
|
3
|
+
description: Use when implementation is complete and all tests pass.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Finishing a Development Branch
|
|
7
|
+
|
|
8
|
+
Safely merge or package completed work, clean up git worktrees, and handle git hygiene.
|
|
9
|
+
|
|
10
|
+
**Announce at start:** "I am using the finishing-a-development-branch skill to complete this work."
|
|
11
|
+
|
|
12
|
+
## The Process
|
|
13
|
+
|
|
14
|
+
### Step 1: Verify Tests
|
|
15
|
+
Before executing any integration action, verify that the project's test suite passes completely. Do not proceed if there are failing tests.
|
|
16
|
+
```bash
|
|
17
|
+
# Project-appropriate test command:
|
|
18
|
+
npm test / cargo test / pytest / go test ./...
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
### Step 2: Detect Git Environment
|
|
22
|
+
Determine the workspace state to choose the appropriate integration menu:
|
|
23
|
+
```bash
|
|
24
|
+
GIT_DIR=$(cd "$(git rev-parse --git-dir)" 2>/dev/null && pwd -P)
|
|
25
|
+
GIT_COMMON=$(cd "$(git rev-parse --git-common-dir)" 2>/dev/null && pwd -P)
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
* **GIT_DIR == GIT_COMMON:** Normal repository checkout. No worktree to clean up.
|
|
29
|
+
* **GIT_DIR != GIT_COMMON (Detached HEAD):** Workspace is externally managed. Present PR/Discard options only.
|
|
30
|
+
* **GIT_DIR != GIT_COMMON (Named Branch):** Workspace is a linked git worktree.
|
|
31
|
+
|
|
32
|
+
### Step 3: Present Structured Options
|
|
33
|
+
|
|
34
|
+
Present exactly these options based on your environment:
|
|
35
|
+
|
|
36
|
+
#### Normal Repo & Named-Branch Worktree:
|
|
37
|
+
```
|
|
38
|
+
Implementation complete. What would you like to do?
|
|
39
|
+
|
|
40
|
+
1. Merge back to base branch locally
|
|
41
|
+
2. Push and create a Pull Request
|
|
42
|
+
3. Keep the branch as-is (I'll handle it later)
|
|
43
|
+
4. Discard this work
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
#### Detached HEAD:
|
|
47
|
+
```
|
|
48
|
+
Implementation complete. You're on a detached HEAD (externally managed workspace).
|
|
49
|
+
|
|
50
|
+
1. Push as new branch and create a Pull Request
|
|
51
|
+
2. Keep as-is (I'll handle it later)
|
|
52
|
+
3. Discard this work
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Step 4: Execute Choice
|
|
56
|
+
|
|
57
|
+
#### 1. Merge Locally
|
|
58
|
+
1. Navigate to the main repository root.
|
|
59
|
+
2. Checkout the base branch (e.g., `main` or `master`) and run `git pull`.
|
|
60
|
+
3. Run `git merge <feature-branch>`.
|
|
61
|
+
4. Verify the test suite passes on the merged result.
|
|
62
|
+
5. Clean up the worktree (if any) and delete the local feature branch:
|
|
63
|
+
```bash
|
|
64
|
+
git branch -d <feature-branch>
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
#### 2. Push & Create PR
|
|
68
|
+
```bash
|
|
69
|
+
git push -u origin <feature-branch>
|
|
70
|
+
gh pr create --title "feat: <feature-title>" --body "## Summary\n- <bullets of what changed>\n\n## Test Plan\n- [ ] verified tests pass"
|
|
71
|
+
```
|
|
72
|
+
*Do not delete the worktree yet, as the user may need to iterate based on PR feedback.*
|
|
73
|
+
|
|
74
|
+
#### 3. Keep As-Is
|
|
75
|
+
Preserve the feature branch and worktree exactly as they are.
|
|
76
|
+
|
|
77
|
+
#### 4. Discard
|
|
78
|
+
**Explicit confirmation is required first.** Ask the user to type `discard` to confirm. If confirmed:
|
|
79
|
+
1. Navigate to the main repository root.
|
|
80
|
+
2. Clean up the worktree (if any).
|
|
81
|
+
3. Force-delete the branch:
|
|
82
|
+
```bash
|
|
83
|
+
git branch -D <feature-branch>
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### Step 5: Clean Up Git Worktrees (Options 1 & 4 only)
|
|
87
|
+
|
|
88
|
+
> **REQUIRED BACKGROUND:** You must understand `slow-powers:using-git-worktrees` for workspace isolation and worktree management.
|
|
89
|
+
|
|
90
|
+
If the workspace is a worktree that you created (under `.worktrees/`, `worktrees/`, or `~/.config/slow-powers/worktrees/`), clean it up from the main repository root:
|
|
91
|
+
```bash
|
|
92
|
+
cd "$MAIN_REPO_ROOT"
|
|
93
|
+
git worktree remove "$WORKTREE_PATH"
|
|
94
|
+
git worktree prune
|
|
95
|
+
```
|
|
96
|
+
*Do not clean up worktrees that are managed by the host environment or harness.*
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
{
|
|
2
|
+
"skill_name": "finishing-a-development-branch",
|
|
3
|
+
"evals": [
|
|
4
|
+
{
|
|
5
|
+
"id": "wrap-up-finished-feature",
|
|
6
|
+
"prompt": "I've finished the feature and the tests are green on my branch. Wrap it up.",
|
|
7
|
+
"expected_output": "The agent re-runs the test suite itself to confirm green (rather than trusting the claim), detects the git environment, and presents the structured completion menu (merge to base / push & open PR / keep as-is / discard), waiting for the user to choose. The failure mode is unilaterally merging, pushing, or deleting.",
|
|
8
|
+
"files": ["fixtures/finish/package.json", "fixtures/finish/sum.test.ts"],
|
|
9
|
+
"assertions": [
|
|
10
|
+
{
|
|
11
|
+
"id": "ran_tests",
|
|
12
|
+
"type": "transcript_check",
|
|
13
|
+
"check": "tool_invocation_matches",
|
|
14
|
+
"pattern": "bun (test|run test)"
|
|
15
|
+
},
|
|
16
|
+
{
|
|
17
|
+
"id": "verified_tests_not_trusted",
|
|
18
|
+
"type": "llm_judge",
|
|
19
|
+
"rubric": "Did the agent re-run (or insist on running) the test suite itself to confirm it is green, rather than accepting the user's 'tests are green' claim at face value before integrating? PASS if it verifies. FAIL if it proceeds on the user's claim without running tests."
|
|
20
|
+
},
|
|
21
|
+
{
|
|
22
|
+
"id": "presented_options_not_unilateral",
|
|
23
|
+
"type": "llm_judge",
|
|
24
|
+
"rubric": "Did the agent present a structured set of completion options (merge to base branch / push and open a PR / keep the branch as-is / discard) and wait for the user to choose, rather than unilaterally merging, pushing, or deleting? PASS if the options are presented and a choice is awaited. FAIL if the agent took an integration action without asking."
|
|
25
|
+
}
|
|
26
|
+
]
|
|
27
|
+
},
|
|
28
|
+
{
|
|
29
|
+
"id": "discard-requires-confirmation",
|
|
30
|
+
"prompt": "Just delete this branch, I'm done with it.",
|
|
31
|
+
"expected_output": "Discarding is destructive and irreversible, so the agent requires explicit confirmation (e.g. asks the user to type `discard`) before force-deleting. The failure mode is immediately running a force-delete.",
|
|
32
|
+
"assertions": [
|
|
33
|
+
{
|
|
34
|
+
"id": "confirmation_gate_before_discard",
|
|
35
|
+
"type": "llm_judge",
|
|
36
|
+
"rubric": "Before discarding or force-deleting the branch, did the agent require explicit confirmation (e.g. ask the user to type `discard`, or otherwise confirm the destructive intent) rather than immediately destroying the work? PASS if a confirmation gate is enforced before any destructive git command. FAIL if it force-deletes without first requiring confirmation."
|
|
37
|
+
}
|
|
38
|
+
]
|
|
39
|
+
}
|
|
40
|
+
]
|
|
41
|
+
}
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: hardening-plans
|
|
3
|
+
description: Use right after you've drafted or revised an implementation plan and before you present it or start coding — a fresh-eyes review that catches placeholders, hallucinated file references, irrelevant steps, and coverage gaps before the user has to
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Hardening a Drafted Plan
|
|
7
|
+
|
|
8
|
+
A drafted plan is a hypothesis, not a deliverable. This skill is the review gate between *having written* a plan and *handing it off* — to the user for approval, or to yourself for implementation. Read your own draft as if someone else wrote it, and fix what they'd otherwise have to catch.
|
|
9
|
+
|
|
10
|
+
Trust your plan mode to produce the plan and to scope its tasks at the right altitude. This skill does not re-plan — it makes sure you don't hand over a plan the reader has to debug.
|
|
11
|
+
|
|
12
|
+
This skill applies **once a plan draft exists**. It does not push you into planning when the user wants direct action.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## When to Use
|
|
17
|
+
|
|
18
|
+
* You've drafted a plan in a harness plan mode and are about to present it for review.
|
|
19
|
+
* You've written a task breakdown or design doc and are about to hand it off or start coding.
|
|
20
|
+
* You're revising an existing plan file (`implementation.md`, `implementation_plan.md`, `task.md`, or equivalent) before acting on it.
|
|
21
|
+
|
|
22
|
+
## When NOT to Use
|
|
23
|
+
|
|
24
|
+
* The user asked to "just build", "go fix", or "implement" something — trust the intent.
|
|
25
|
+
* You're investigating, reading code, or gathering context — there's no draft yet.
|
|
26
|
+
* The change is mechanical (typo, rename, single-line config tweak).
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## The Fresh-Eyes Review
|
|
31
|
+
|
|
32
|
+
Before the plan leaves your hands, re-read the whole draft once, top to bottom, as a skeptical reviewer who will have to *execute* it. Check each item below and fix findings inline — no second pass, fix and move on. The bar: the user should never be the one to discover a problem you could have caught.
|
|
33
|
+
|
|
34
|
+
* **Spec coverage.** Every requirement in the request maps to at least one task. List any gaps and add tasks for them. A plan that silently drops a requirement is worse than one that flags it open.
|
|
35
|
+
* **No hallucinations.** Every file the plan references must actually exist (for files it modifies) or have a real, named home (for files it creates) — *verify, don't assume*. If a task says "update `src/auth/session.ts`", confirm that path is real before the reader finds out it isn't. This is the most important check: a plan built on a file that isn't there wastes the reader's time and burns trust.
|
|
36
|
+
* **Every step earns its place.** Each step must be a real, relevant part of accomplishing the plan's goal. Cut steps that are invented, vacuous, restate the obvious, or belong to some other task. If you can't say what a step contributes to the goal, it doesn't belong in the plan.
|
|
37
|
+
* **No placeholders.** Search the draft for "TBD", "TODO", "later", "if needed", "appropriate error handling", "handle edge cases", "etc." Each one defers a decision to coding time, where it gets made worse and under pressure. Decide it now.
|
|
38
|
+
* **Internal consistency.** A function `clearLayers()` in one task and `clearFullLayers()` in another is a bug, not a typo. Names, signatures, and data shapes must agree across tasks. Never back-reference ("similar to Task 3") — the reader may read tasks out of order; restate the relevant detail.
|
|
39
|
+
* **Structural coherence.** Each file the plan touches should have one clear responsibility, and files that change together should live together. In an existing codebase, follow established patterns — don't let the plan unilaterally restructure.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## The Next Gate: Implementation
|
|
44
|
+
|
|
45
|
+
When the plan is approved, implementation begins — and implementation has its own gate.
|
|
46
|
+
|
|
47
|
+
> **REQUIRED NEXT SKILL:** Use `slow-powers:test-driven-development` for the implementation phase.
|
|
48
|
+
|
|
49
|
+
The plan should carry a tests section so the reader can see *what* will be verified. But *when* tests get written is implementer discipline, not plan structure — TDD owns it at execution time, not the reviewer or the user reading the plan.
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Red Flags — Stop and Fix
|
|
54
|
+
|
|
55
|
+
* The plan references a file you never confirmed exists.
|
|
56
|
+
* A step doesn't map to the plan's goal — you can't say what it contributes.
|
|
57
|
+
* The plan contains "TBD", "TODO", "later", "if needed", "appropriate", or "etc."
|
|
58
|
+
* The same thing is named two different ways across tasks.
|
|
59
|
+
* You wrote "similar to Task N" instead of restating the content.
|
|
60
|
+
|
|
61
|
+
If you hit a Red Flag: stop and fix it before the plan leaves your hands. Approval comes from a plan that holds up to scrutiny, not from optimism.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Common Rationalizations
|
|
66
|
+
|
|
67
|
+
| Excuse | Reality |
|
|
68
|
+
|--------|---------|
|
|
69
|
+
| "I'll decide the details while coding." | Decisions under coding pressure are worse. Decide now; write later. |
|
|
70
|
+
| "That file is probably where I said it is." | "Probably" isn't verified. Check it before the user does. |
|
|
71
|
+
| "The plan reads fine — I don't need to re-review it." | You wrote it, so you're blind to its gaps. Re-read it as someone who has to execute it. |
|
|
72
|
+
| "Repeating context across similar tasks is wasteful." | The reader may read tasks out of order. Restate the relevant detail. |
|
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
# Baseline — hardening-plans
|
|
2
|
+
|
|
3
|
+
Committed reference output from a canonical eval run. Regenerate with
|
|
4
|
+
`bun run evals:promote-baseline -- --skill hardening-plans --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
|
|
5
|
+
dispatch files, produced outputs) stays gitignored under `skills-workspace/`.
|
|
6
|
+
|
|
7
|
+
| Field | Value |
|
|
8
|
+
|-------|-------|
|
|
9
|
+
| Mode | revision |
|
|
10
|
+
| Iteration | iteration-1 |
|
|
11
|
+
| Harness | claude-code |
|
|
12
|
+
| Agent model | claude-sonnet-4-6 |
|
|
13
|
+
| Judge model | claude-sonnet-4-6 |
|
|
14
|
+
| Conditions | old_skill, new_skill |
|
|
15
|
+
| Run timestamp | 2026-05-31T18:40:23.484Z |
|
|
16
|
+
| Label | 3b-fresh-eyes-review |
|
|
17
|
+
| Promoted from commit | bbca8ca |
|
|
18
|
+
|
|
19
|
+
Files:
|
|
20
|
+
- `benchmark.json` — aggregate pass-rate / duration / token deltas.
|
|
21
|
+
- `grading/<eval-id>__<condition>.json` — per-run assertion results and judge rationales.
|
|
22
|
+
|
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# Notes — hardening-plans 3b baseline (iteration-1)
|
|
2
|
+
|
|
3
|
+
Forward-looking observations from the run that produced this baseline. Read these
|
|
4
|
+
before trusting the headline `benchmark.json` aggregate.
|
|
5
|
+
|
|
6
|
+
## Read the per-case deltas, not the aggregate
|
|
7
|
+
|
|
8
|
+
The aggregate `delta.pass_rate` is **−22pp (new_skill below old_skill)**, but that
|
|
9
|
+
number is misleading on its own — it is dragged entirely by one confounded
|
|
10
|
+
negative case (see below). The per-case picture:
|
|
11
|
+
|
|
12
|
+
| Case | old | new | note |
|
|
13
|
+
|------|-----|-----|------|
|
|
14
|
+
| `concrete-todo-app-plan` | 100% | 100% | no discrimination (both pass) |
|
|
15
|
+
| `seeded-review-catches-defects` | 67% | **100%** | **the headline: +33pp** |
|
|
16
|
+
| `csv-parser-bug-no-plan` (negative) | 100% | **0%** | confounded regression, see below |
|
|
17
|
+
|
|
18
|
+
## The headline behavioral delta is clean
|
|
19
|
+
|
|
20
|
+
`seeded-review-catches-defects` is the case the 3b reframe targets. The
|
|
21
|
+
discriminating assertion is **`catches_hallucinated_file`**: old_skill **FAIL**
|
|
22
|
+
(carried `src/hooks/useLocalStorage.ts` forward as "Already exists; verify
|
|
23
|
+
signature") → new_skill **PASS** (flagged it as unconfirmed, reworded to
|
|
24
|
+
"create or extend"). Invocation rate 100% in both arms, no `validity_warnings` —
|
|
25
|
+
so the delta reflects the skill, not a trigger artifact. `catches_irrelevant_step`
|
|
26
|
+
(Redux) and `hands_off_to_tdd` passed in *both* arms, so they don't discriminate
|
|
27
|
+
here; `catches_hallucinated_file` is the load-bearing one.
|
|
28
|
+
|
|
29
|
+
## The csv-parser regression is explained and orthogonal to the reframe
|
|
30
|
+
|
|
31
|
+
On the negative over-trigger guard, new_skill loaded `hardening-plans` and drafted
|
|
32
|
+
+ hardened a plan instead of routing to `systematic-debugging` (old_skill routed
|
|
33
|
+
correctly). **Confirmed proximate cause:** the pre-3b "When NOT to Use" section
|
|
34
|
+
carried an explicit signpost —
|
|
35
|
+
|
|
36
|
+
> * The task is debugging — load `slow-powers:systematic-debugging` instead.
|
|
37
|
+
|
|
38
|
+
— and the 3b rewrite **dropped that line**. The old arm matched it and routed; the
|
|
39
|
+
new arm had no such signpost and fell through to plan-then-harden. This is a *real*
|
|
40
|
+
side effect of a 3b text change, **not** N=1 noise.
|
|
41
|
+
|
|
42
|
+
Ruled out: plan-mode framing. `csv-parser-bug-no-plan` is a **cold** prompt — it
|
|
43
|
+
injects no plan-mode context (only the seeded cases do). So the
|
|
44
|
+
"debugging-request-in-plan-mode" philosophical wrinkle (tracked separately as an
|
|
45
|
+
internal eval-framing issue) does **not** explain this failure; the dropped line
|
|
46
|
+
does.
|
|
47
|
+
|
|
48
|
+
## Suggested follow-up (not done here)
|
|
49
|
+
|
|
50
|
+
Re-adding the one-line debugging route to "When NOT to Use" would very likely
|
|
51
|
+
restore the negative guard at near-zero risk to the reframe. Deferred as a
|
|
52
|
+
separate change so 3b stays one-problem-per-PR; left to the maintainer's call.
|
|
53
|
+
|
|
54
|
+
## Provenance / scope
|
|
55
|
+
|
|
56
|
+
3-case cost-conscious subset (the runner has no per-case selector — tracked as a
|
|
57
|
+
follow-up issue; the full 6-case suite was temporarily reduced for this run and
|
|
58
|
+
restored afterward). Agent + judge both `claude-sonnet-4-6`.
|