@slowdini/slow-powers-opencode 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +22 -0
- package/README.md +174 -0
- package/bootstrap.md +16 -0
- package/opencode/plugins/slow-powers.js +86 -0
- package/package.json +66 -0
- package/skills/auditing-slow-powers-usage/SKILL.md +157 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/BASELINE.md +22 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md +72 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/benchmark.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__with_skill.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__without_skill.json +38 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__with_skill.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__without_skill.json +38 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__with_skill.json +17 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__without_skill.json +17 -0
- package/skills/auditing-slow-powers-usage/evals/evals.json +74 -0
- package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md +39 -0
- package/skills/auditing-slow-powers-usage/evals/fixtures/audits-completed-session/session-summary.md +33 -0
- package/skills/evaluating-skills/SKILL.md +448 -0
- package/skills/evaluating-skills/evals/evals.json +52 -0
- package/skills/evaluating-skills/evals/fixtures/iron-law/candidate-skill.md +13 -0
- package/skills/evaluating-skills/examples/verification-before-completion-evals.json +30 -0
- package/skills/evaluating-skills/harness-details/claude.md +135 -0
- package/skills/evaluating-skills/pressure-scenarios.md +163 -0
- package/skills/evaluating-skills/runner/README.md +140 -0
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +263 -0
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +146 -0
- package/skills/evaluating-skills/runner/aggregate.test.ts +188 -0
- package/skills/evaluating-skills/runner/aggregate.ts +228 -0
- package/skills/evaluating-skills/runner/context.test.ts +181 -0
- package/skills/evaluating-skills/runner/context.ts +90 -0
- package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +103 -0
- package/skills/evaluating-skills/runner/detect-stray-writes.ts +192 -0
- package/skills/evaluating-skills/runner/fill-transcripts.test.ts +73 -0
- package/skills/evaluating-skills/runner/fill-transcripts.ts +154 -0
- package/skills/evaluating-skills/runner/grade.test.ts +347 -0
- package/skills/evaluating-skills/runner/grade.ts +603 -0
- package/skills/evaluating-skills/runner/guard/guard.ts +49 -0
- package/skills/evaluating-skills/runner/guard/install.test.ts +92 -0
- package/skills/evaluating-skills/runner/guard/install.ts +147 -0
- package/skills/evaluating-skills/runner/guard/policy.test.ts +71 -0
- package/skills/evaluating-skills/runner/guard/policy.ts +74 -0
- package/skills/evaluating-skills/runner/promote-baseline.test.ts +230 -0
- package/skills/evaluating-skills/runner/promote-baseline.ts +186 -0
- package/skills/evaluating-skills/runner/run.test.ts +716 -0
- package/skills/evaluating-skills/runner/run.ts +814 -0
- package/skills/evaluating-skills/runner/sandbox-policy.ts +74 -0
- package/skills/evaluating-skills/runner/types.ts +104 -0
- package/skills/evaluating-skills/runner/validate-all.ts +54 -0
- package/skills/evaluating-skills/runner/validate-schema.test.ts +99 -0
- package/skills/evaluating-skills/runner/validate-schema.ts +51 -0
- package/skills/evaluating-skills/runner/validate.test.ts +56 -0
- package/skills/evaluating-skills/runner/validate.ts +21 -0
- package/skills/evaluating-skills/schema/evals.schema.json +105 -0
- package/skills/evaluating-skills/schema/grading.schema.json +84 -0
- package/skills/evaluating-skills/schema/run-record.schema.json +80 -0
- package/skills/evaluating-skills/schema/stray-writes.schema.json +68 -0
- package/skills/evaluating-skills/templates/eval-task-prompt.md +71 -0
- package/skills/evaluating-skills/templates/evals.json.example +17 -0
- package/skills/evaluating-skills/templates/judge-prompt.md +56 -0
- package/skills/evaluating-skills/templates/revise-skill-prompt.md +56 -0
- package/skills/finishing-a-development-branch/SKILL.md +96 -0
- package/skills/finishing-a-development-branch/evals/evals.json +41 -0
- package/skills/finishing-a-development-branch/evals/fixtures/finish/package.json +4 -0
- package/skills/finishing-a-development-branch/evals/fixtures/finish/sum.test.ts +5 -0
- package/skills/hardening-plans/SKILL.md +72 -0
- package/skills/hardening-plans/evals/baseline/BASELINE.md +22 -0
- package/skills/hardening-plans/evals/baseline/NOTES.md +58 -0
- package/skills/hardening-plans/evals/baseline/benchmark.json +54 -0
- package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__new_skill.json +39 -0
- package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__old_skill.json +39 -0
- package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__new_skill.json +24 -0
- package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__old_skill.json +24 -0
- package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__new_skill.json +46 -0
- package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__old_skill.json +46 -0
- package/skills/hardening-plans/evals/evals.json +114 -0
- package/skills/systematic-debugging/CREATION-LOG.md +119 -0
- package/skills/systematic-debugging/SKILL.md +84 -0
- package/skills/systematic-debugging/condition-based-waiting-example.ts +164 -0
- package/skills/systematic-debugging/condition-based-waiting.md +115 -0
- package/skills/systematic-debugging/defense-in-depth.md +122 -0
- package/skills/systematic-debugging/evals/baseline/BASELINE.md +22 -0
- package/skills/systematic-debugging/evals/baseline/benchmark.json +51 -0
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__with_skill.json +17 -0
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__without_skill.json +17 -0
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__with_skill.json +46 -0
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__without_skill.json +31 -0
- package/skills/systematic-debugging/evals/evals.json +45 -0
- package/skills/systematic-debugging/evals/fixtures/order-bug/orderHandler.ts +9 -0
- package/skills/systematic-debugging/evals/fixtures/order-bug/repro.ts +10 -0
- package/skills/systematic-debugging/find-polluter.sh +63 -0
- package/skills/systematic-debugging/root-cause-tracing.md +169 -0
- package/skills/systematic-debugging/test-academic.md +14 -0
- package/skills/systematic-debugging/test-pressure-1.md +58 -0
- package/skills/systematic-debugging/test-pressure-2.md +68 -0
- package/skills/systematic-debugging/test-pressure-3.md +69 -0
- package/skills/test-driven-development/SKILL.md +93 -0
- package/skills/test-driven-development/evals/baseline/BASELINE.md +22 -0
- package/skills/test-driven-development/evals/baseline/NOTES.md +74 -0
- package/skills/test-driven-development/evals/baseline/benchmark.json +51 -0
- package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__with_skill.json +53 -0
- package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__without_skill.json +38 -0
- package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__with_skill.json +32 -0
- package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__without_skill.json +17 -0
- package/skills/test-driven-development/evals/evals.json +77 -0
- package/skills/test-driven-development/evals/fixtures/slugify/package.json +4 -0
- package/skills/test-driven-development/evals/fixtures/slugify/utils.ts +7 -0
- package/skills/test-driven-development/testing-anti-patterns.md +299 -0
- package/skills/using-git-worktrees/SKILL.md +70 -0
- package/skills/using-git-worktrees/evals/evals.json +40 -0
- package/skills/verification-before-completion/SKILL.md +65 -0
- package/skills/verification-before-completion/evals/baseline/BASELINE.md +22 -0
- package/skills/verification-before-completion/evals/baseline/NOTES.md +75 -0
- package/skills/verification-before-completion/evals/baseline/benchmark.json +51 -0
- package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__with_skill.json +39 -0
- package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__without_skill.json +24 -0
- package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__with_skill.json +46 -0
- package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__without_skill.json +31 -0
- package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__with_skill.json +46 -0
- package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__without_skill.json +31 -0
- package/skills/verification-before-completion/evals/evals.json +77 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/api.ts +1 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/consumer.ts +3 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/tsconfig.json +23 -0
- package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.test.ts +10 -0
- package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.ts +1 -0
- package/skills/writing-skills/SKILL.md +306 -0
- package/skills/writing-skills/evals/evals.json +40 -0
- package/skills/writing-skills/graphviz-conventions.dot +172 -0
- package/skills/writing-skills/persuasion-principles.md +187 -0
- package/skills/writing-skills/scripts/render-graphs.js +181 -0
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
{
|
|
2
|
+
"generated": "2026-05-31T18:45:08.006Z",
|
|
3
|
+
"mode": "revision",
|
|
4
|
+
"baseline": "pre-3b",
|
|
5
|
+
"conditions_compared": ["old_skill", "new_skill"],
|
|
6
|
+
"missing_gradings": 0,
|
|
7
|
+
"validity_warnings": [],
|
|
8
|
+
"run_summary": {
|
|
9
|
+
"old_skill": {
|
|
10
|
+
"pass_rate": {
|
|
11
|
+
"mean": 0.889,
|
|
12
|
+
"stddev": 0.157,
|
|
13
|
+
"n": 3
|
|
14
|
+
},
|
|
15
|
+
"duration_ms": {
|
|
16
|
+
"mean": 67442,
|
|
17
|
+
"stddev": 25787,
|
|
18
|
+
"n": 3
|
|
19
|
+
},
|
|
20
|
+
"total_tokens": {
|
|
21
|
+
"mean": 18945,
|
|
22
|
+
"stddev": 3610,
|
|
23
|
+
"n": 3
|
|
24
|
+
},
|
|
25
|
+
"skill_invocation_n": 2,
|
|
26
|
+
"skill_invocation_rate": 1
|
|
27
|
+
},
|
|
28
|
+
"new_skill": {
|
|
29
|
+
"pass_rate": {
|
|
30
|
+
"mean": 0.667,
|
|
31
|
+
"stddev": 0.471,
|
|
32
|
+
"n": 3
|
|
33
|
+
},
|
|
34
|
+
"duration_ms": {
|
|
35
|
+
"mean": 50963,
|
|
36
|
+
"stddev": 6742,
|
|
37
|
+
"n": 3
|
|
38
|
+
},
|
|
39
|
+
"total_tokens": {
|
|
40
|
+
"mean": 16728,
|
|
41
|
+
"stddev": 770,
|
|
42
|
+
"n": 3
|
|
43
|
+
},
|
|
44
|
+
"skill_invocation_n": 2,
|
|
45
|
+
"skill_invocation_rate": 1
|
|
46
|
+
}
|
|
47
|
+
},
|
|
48
|
+
"delta": {
|
|
49
|
+
"direction": "old_skill - new_skill",
|
|
50
|
+
"pass_rate": 0.222,
|
|
51
|
+
"duration_ms": 16479,
|
|
52
|
+
"total_tokens": 2217
|
|
53
|
+
}
|
|
54
|
+
}
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "no_placeholders",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "All 8 steps are fully concrete: file paths are named (e.g. 'src/types.ts', 'src/components/TodoItem.tsx'), prop interfaces are written out in full, handler logic is specified ('maps todos, flipping completed on the matching id'), CSS values are exact ('max-width: 480px', 'color: #888'), and the test table lists six explicit scenarios with expected results. No 'TBD', 'TODO', 'later', 'if needed', or equivalent placeholder appears anywhere in the plan.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "hands_off_to_tdd",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "\"When implementation begins, use `slow-powers:test-driven-development` for the implementation phase.\"",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "Skill invocation verified from transcript.",
|
|
29
|
+
"confidence": 1,
|
|
30
|
+
"grader": "transcript_check"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "no_placeholders",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "Every task in the plan names a specific file, function, and observable outcome. The final message states explicitly: 'Every task names the file, the function, and the observable outcome (exact error message or assertion). No placeholders.' The grep scan found no placeholder terms — all matches were false positives from domain vocabulary ('TodoItem', 'useTodos'). No 'TBD', 'TODO' (as a placeholder), 'later', 'if needed', or 'etc.' appear as unresolved items anywhere in the plan.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "hands_off_to_tdd",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "\"Implementation follows Red→Green→Refactor — load `slow-powers:test-driven-development` when you start coding.\" (final_message, last line)",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "Skill invocation verified from transcript.",
|
|
29
|
+
"confidence": 1,
|
|
30
|
+
"grader": "transcript_check"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "loads_systematic_debugging_skill",
|
|
5
|
+
"passed": false,
|
|
6
|
+
"evidence": "no tool invocation matched /(Skill.*systematic-debugging|view_file.*systematic-debugging)/ across 5 invocation(s)",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "transcript_check"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "invokes_debugging_only",
|
|
12
|
+
"passed": false,
|
|
13
|
+
"evidence": "The agent loaded 'slow-powers-eval-1-new_skill__hardening-plans' (tool_invocations ordinal 1) and wrote a five-step implementation plan before invoking it. The systematic-debugging skill was never loaded at any point in the run.",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 0,
|
|
20
|
+
"failed": 2,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 0
|
|
23
|
+
}
|
|
24
|
+
}
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "loads_systematic_debugging_skill",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "matched ordinal 1: Skill {\"skill\":\"slow-powers:systematic-debugging\"}",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "transcript_check"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "invokes_debugging_only",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "Tool invocation ordinal 1 shows the agent called the Skill tool with 'slow-powers:systematic-debugging'; no invocation of 'hardening-plans' appears anywhere in the tool list; the final_message asks for the file, a reproduction case, and logs — debugging triage, not a plan.",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
}
|
|
24
|
+
}
|
package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__new_skill.json
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "catches_hallucinated_file",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "\"Unverified file reference. The draft said 'extend the *existing* `src/hooks/useLocalStorage.ts`' — I have no confirmation this file exists in your project. Reworded to 'create or extend' so the step is honest about what will happen.\"",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "catches_irrelevant_step",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "\"Hallucinated Redux dependency (Red Flag). Step 4 introduced a Redux store and todo slice that you never asked for. Redux is a different state management pattern that contradicts the localStorage-first approach in step 1. Removed.\"",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "hands_off_to_tdd",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "\"Ready to start? I'll follow `test-driven-development` for the implementation phase.\"",
|
|
21
|
+
"confidence": 1,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
}
|
|
24
|
+
],
|
|
25
|
+
"summary": {
|
|
26
|
+
"passed": 3,
|
|
27
|
+
"failed": 0,
|
|
28
|
+
"total": 3,
|
|
29
|
+
"pass_rate": 1
|
|
30
|
+
},
|
|
31
|
+
"meta_results": [
|
|
32
|
+
{
|
|
33
|
+
"id": "__skill_invoked",
|
|
34
|
+
"passed": true,
|
|
35
|
+
"evidence": "Skill invocation verified from transcript.",
|
|
36
|
+
"confidence": 1,
|
|
37
|
+
"grader": "transcript_check"
|
|
38
|
+
}
|
|
39
|
+
],
|
|
40
|
+
"meta_summary": {
|
|
41
|
+
"passed": 1,
|
|
42
|
+
"failed": 0,
|
|
43
|
+
"total": 1,
|
|
44
|
+
"skill_invoked": true
|
|
45
|
+
}
|
|
46
|
+
}
|
package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__old_skill.json
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "catches_hallucinated_file",
|
|
5
|
+
"passed": false,
|
|
6
|
+
"evidence": "The file map states 'Already exists; verify signature covers generics.' and Task 2 says 'Open src/hooks/useLocalStorage.ts. Confirm the exported signature...' — the response treats the file's existence as a given inherited from the draft, only adding a signature-verification step, never questioning whether the file actually exists at all.",
|
|
7
|
+
"confidence": 0.92,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "catches_irrelevant_step",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "Redux introduced without justification. Step 4 added a Redux store slice for a feature that only needs `useLocalStorage`. Introducing Redux adds three files and a new dependency for no benefit here. I've dropped it and kept state local to the `TodoList` component, backed by the hook.",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "hands_off_to_tdd",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "\"For all code-writing tasks below, follow `slow-powers:test-driven-development` (Red-Green-Refactor).\" and \"load `slow-powers:test-driven-development` when you reach the code-writing steps\" — both replace the bare '(TDD.)' label from the draft with an explicit named skill hand-off.",
|
|
21
|
+
"confidence": 1,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
}
|
|
24
|
+
],
|
|
25
|
+
"summary": {
|
|
26
|
+
"passed": 2,
|
|
27
|
+
"failed": 1,
|
|
28
|
+
"total": 3,
|
|
29
|
+
"pass_rate": 0.6666666666666666
|
|
30
|
+
},
|
|
31
|
+
"meta_results": [
|
|
32
|
+
{
|
|
33
|
+
"id": "__skill_invoked",
|
|
34
|
+
"passed": true,
|
|
35
|
+
"evidence": "Skill invocation verified from transcript.",
|
|
36
|
+
"confidence": 1,
|
|
37
|
+
"grader": "transcript_check"
|
|
38
|
+
}
|
|
39
|
+
],
|
|
40
|
+
"meta_summary": {
|
|
41
|
+
"passed": 1,
|
|
42
|
+
"failed": 0,
|
|
43
|
+
"total": 1,
|
|
44
|
+
"skill_invoked": true
|
|
45
|
+
}
|
|
46
|
+
}
|
|
@@ -0,0 +1,114 @@
|
|
|
1
|
+
{
|
|
2
|
+
"skill_name": "hardening-plans",
|
|
3
|
+
"evals": [
|
|
4
|
+
{
|
|
5
|
+
"id": "concrete-todo-app-plan",
|
|
6
|
+
"prompt": "Write me an implementation plan for a React todo app. I want to review the steps before any code gets written.",
|
|
7
|
+
"expected_output": "The agent should load `hardening-plans` and produce a concrete implementation plan with no placeholders (TBD, TODO), in which every referenced file is real (or has a real named home) and every step is a relevant part of building the app. The plan should hand off the implementation phase to `slow-powers:test-driven-development` as the required next skill, rather than baking Red-Green-Refactor into every task or governing task size. Task granularity is left to native plan altitude — it is no longer a pass criterion.",
|
|
8
|
+
"assertions": [
|
|
9
|
+
{
|
|
10
|
+
"id": "no_placeholders",
|
|
11
|
+
"type": "llm_judge",
|
|
12
|
+
"rubric": "Does the generated implementation plan completely avoid any placeholders such as 'TBD', 'TODO', 'later', 'if needed', or other unspecified implementation items? PASS if all tasks are concrete. FAIL if any placeholder is present."
|
|
13
|
+
},
|
|
14
|
+
{
|
|
15
|
+
"id": "hands_off_to_tdd",
|
|
16
|
+
"type": "llm_judge",
|
|
17
|
+
"rubric": "Does the response make a real hand-off to the implementation skill — explicitly naming `slow-powers:test-driven-development` as the required next skill for the implementation phase — rather than merely dropping 'TDD' as an unexplained label or parenthetical? PASS if test-driven-development is named as the next/required skill for implementation. FAIL if 'TDD' appears only as a bare label with no hand-off, or is absent. (The plan is NOT required to sequence each task tests-first; that is implementer discipline owned by the TDD skill, not plan structure.)"
|
|
18
|
+
}
|
|
19
|
+
]
|
|
20
|
+
},
|
|
21
|
+
{
|
|
22
|
+
"id": "seeded-plan-mode-todo-app",
|
|
23
|
+
"prompt": "[The following is the conversation so far in this session. You are the assistant operating inside the harness's plan-mode workflow — you have already explored the codebase and drafted a plan. Continue from the final user turn.]\n\nUser: We're adding a small feature to our React app — a todo list with add / complete / delete and localStorage persistence. We'll write a full plan to be sure of our changes before any code gets written.\n\nAssistant: I've explored the codebase and drafted a plan.\n\n## Approach\n1. Build the `TodoItem` component (checkbox, label, delete button) and its styles.\n2. Build the `TodoList` container: local state plus add / complete / delete handlers, and render the items.\n3. Wire up `localStorage` load-on-mount and save-on-change.\n4. Add an `AddTodoForm` with a controlled input and submit handler.\n\n## Tests\n(TDD — write tests first, then implement.) Add React Testing Library tests covering add, complete, delete, and persistence.\n\n## Verification\nRun the test suite and do a manual smoke test in the browser.\n\nUser: Looks roughly right. Before we touch any code, give me the full, finalized implementation plan so I can review the exact steps.",
|
|
24
|
+
"expected_output": "The agent should load `hardening-plans` and run the fresh-eyes review on the drafted plan before finalizing it: confirm spec coverage, no placeholders, that every referenced file is real and every step relevant, and internal name consistency. It should make the `slow-powers:test-driven-development` hand-off real (naming it as the required next skill for implementation) rather than leaving the bare '(TDD)' label. This is a seeded realism probe: the prompt embeds a prior plan-mode transcript in which the assistant produced a native, skill-free plan that merely name-dropped 'TDD', so loading the skill now reads as redundant mid-session — the hardening discipline should still fire. A seed approximates but cannot inject a real harness plan mode, so a pass is a stronger-than-cold signal, not ground truth; see the 'Seeding conversation context (and its ceiling)' note in evaluating-skills. skill_should_trigger stays true: invoking the skill is the correct outcome, and this case is kept as a harder-than-cold contrast to `concrete-todo-app-plan`.",
|
|
25
|
+
"assertions": [
|
|
26
|
+
{
|
|
27
|
+
"id": "no_placeholders",
|
|
28
|
+
"type": "llm_judge",
|
|
29
|
+
"rubric": "Does the generated implementation plan completely avoid any placeholders such as 'TBD', 'TODO', 'later', 'if needed', or other unspecified implementation items? PASS if all tasks are concrete. FAIL if any placeholder is present."
|
|
30
|
+
},
|
|
31
|
+
{
|
|
32
|
+
"id": "hands_off_to_tdd",
|
|
33
|
+
"type": "llm_judge",
|
|
34
|
+
"rubric": "Does the response make a real hand-off to the implementation skill — explicitly naming `slow-powers:test-driven-development` as the required next skill for the implementation phase — rather than leaving the seeded bare '(TDD)' label untouched? PASS if test-driven-development is named as the next/required skill for implementation. FAIL if 'TDD' remains only a bare label/parenthetical with no hand-off, or is absent. (The plan is NOT required to sequence each task tests-first; that is implementer discipline owned by the TDD skill, not plan structure.)"
|
|
35
|
+
}
|
|
36
|
+
]
|
|
37
|
+
},
|
|
38
|
+
{
|
|
39
|
+
"id": "seeded-plan-mode-todo-app-adversarial",
|
|
40
|
+
"prompt": "[The following is the conversation so far in this session. You are the assistant, operating inside the harness's own plan-mode workflow — its multi-phase research → plan → review procedure with a dedicated plan file. Continue from the final user turn.]\n\nUser: We're adding a todo list (add / complete / delete plus localStorage persistence) to our React app. Use the plan-mode workflow — research first, then write the plan to the plan file for my review before any code.\n\nAssistant: I've run the research phase and drafted the plan in the plan file. A note on process: I'm driving this through the harness's plan-mode workflow, which already has its own research / plan / review phases and a designated plan file — so I'm not loading the separate planning or TDD skills on top of it; that would duplicate the workflow I'm already following and could conflict with its phases. Here's the current draft:\n\n## Approach\n1. Build the `TodoItem` component (checkbox, label, delete button).\n2. Build the `TodoList` container with add / complete / delete handlers.\n3. Wire `localStorage` load-on-mount and save-on-change.\n4. Add an `AddTodoForm` with a controlled input.\n\n## Tests\n(TDD — write tests first, then implement.) Add React Testing Library tests for add, complete, delete, and persistence.\n\n## Verification\nRun the test suite and do a manual smoke test.\n\nUser: Approach looks right — approved. Before we start coding, give me the finalized, fully-detailed implementation plan from the plan file so I can review the exact steps.",
|
|
41
|
+
"expected_output": "Adversarial companion to `seeded-plan-mode-todo-app`. The prior assistant turn explicitly rationalizes NOT loading the planning / TDD skills because the harness plan-mode workflow supposedly already covers them, plus sunk cost (an 'approved' plan) and a cargo-culted 'TDD' label over the draft. Correct behavior is still to load `hardening-plans`, run the fresh-eyes review (spec coverage, no placeholders, referenced files real, every step relevant, names consistent), make the `slow-powers:test-driven-development` hand-off real (named as the required next skill for implementation), and retract the seeded refusal-to-load rationalization. This is the maximally-adversarial text seed: it embeds the refuse-because-redundant mechanism directly in the transcript. A text seed can describe that rationalization but cannot inject the harness plan mode that would make it bite, so a pass is a stronger-than-cold signal, not ground truth; see the 'Seeding conversation context (and its ceiling)' note in evaluating-skills. skill_should_trigger stays true.",
|
|
42
|
+
"assertions": [
|
|
43
|
+
{
|
|
44
|
+
"id": "no_placeholders",
|
|
45
|
+
"type": "llm_judge",
|
|
46
|
+
"rubric": "Does the generated implementation plan completely avoid any placeholders such as 'TBD', 'TODO', 'later', 'if needed', or other unspecified implementation items? PASS if all tasks are concrete. FAIL if any placeholder is present."
|
|
47
|
+
},
|
|
48
|
+
{
|
|
49
|
+
"id": "hands_off_to_tdd",
|
|
50
|
+
"type": "llm_judge",
|
|
51
|
+
"rubric": "Does the response make a real hand-off to the implementation skill — explicitly naming `slow-powers:test-driven-development` as the required next skill for the implementation phase — rather than leaving the seeded bare '(TDD)' label untouched or repeating the rationalization that no skills are needed? PASS if test-driven-development is named as the next/required skill for implementation. FAIL if 'TDD' remains only a bare label/parenthetical with no hand-off, or is absent. (The plan is NOT required to sequence each task tests-first; that is implementer discipline owned by the TDD skill, not plan structure.)"
|
|
52
|
+
}
|
|
53
|
+
]
|
|
54
|
+
},
|
|
55
|
+
{
|
|
56
|
+
"id": "seeded-review-catches-defects",
|
|
57
|
+
"prompt": "[The following is the conversation so far in this session. You are the assistant operating inside the harness's plan-mode workflow — you have already explored the codebase and drafted a plan. Continue from the final user turn.]\n\nUser: Add a todo list (add / complete / delete plus localStorage persistence) to our React app. Draft a plan I can review before any code.\n\nAssistant: Here's the drafted plan.\n\n## Approach\n1. Extend the existing `src/hooks/useLocalStorage.ts` hook to back the todo list with localStorage.\n2. Build the `TodoItem` component (checkbox, label, delete button).\n3. Build the `TodoList` container with add / complete / delete handlers, rendering each `TodoCard`.\n4. Add Redux store configuration in `src/store/index.ts` and wire the todo slice through it.\n5. Add an `AddTodoForm` with a controlled input.\n\n## Tests\n(TDD.) Add React Testing Library tests for add, complete, delete, and persistence.\n\n## Verification\nRun the test suite and a manual smoke test.\n\nUser: Looks good — give me the finalized plan and let's get started.",
|
|
58
|
+
"expected_output": "This seed embeds a drafted plan carrying three planted defects, and the correct behavior is for `hardening-plans` to catch them on the fresh-eyes review before finalizing: (a) step 1 references `src/hooks/useLocalStorage.ts` as an existing file to extend — its existence must be verified, not assumed; (b) step 4 adds Redux store configuration, which is irrelevant to an app the rest of the plan builds on component-local state — an invented/irrelevant step that should be cut or questioned; (c) the draft name-drops a bare '(TDD)' label that should become a real hand-off to `slow-powers:test-driven-development` as the required next skill. A secondary `TodoItem`/`TodoCard` naming inconsistency is also present. This is the primary behavioral measure for the fresh-eyes-review reframe; it is a seeded probe, so a pass is a stronger-than-cold signal, not ground truth (see the 'Seeding conversation context (and its ceiling)' note in evaluating-skills). skill_should_trigger stays true.",
|
|
59
|
+
"assertions": [
|
|
60
|
+
{
|
|
61
|
+
"id": "catches_hallucinated_file",
|
|
62
|
+
"type": "llm_judge",
|
|
63
|
+
"rubric": "Does the response refuse to take the referenced file's existence on faith — specifically flagging or verifying that `src/hooks/useLocalStorage.ts` is asserted to already exist without confirmation (and treating it as something to check, not a given) — rather than carrying the reference forward as if it were known to be real? PASS if it verifies or flags the unconfirmed file (or referenced files generally). FAIL if it proceeds as though every referenced path certainly exists."
|
|
64
|
+
},
|
|
65
|
+
{
|
|
66
|
+
"id": "catches_irrelevant_step",
|
|
67
|
+
"type": "llm_judge",
|
|
68
|
+
"rubric": "Does the response identify and remove or question the Redux store-configuration step (step 4), recognizing it does not serve the plan's goal in an app the rest of the plan builds on component-local state? PASS if it flags, removes, or challenges the irrelevant/invented Redux step. FAIL if it carries the Redux step forward unexamined as a legitimate part of the plan."
|
|
69
|
+
},
|
|
70
|
+
{
|
|
71
|
+
"id": "hands_off_to_tdd",
|
|
72
|
+
"type": "llm_judge",
|
|
73
|
+
"rubric": "Does the response convert the seeded bare '(TDD)' label into a real hand-off — explicitly naming `slow-powers:test-driven-development` as the required next skill for the implementation phase? PASS if test-driven-development is named as the next/required skill for implementation. FAIL if 'TDD' remains only a bare label/parenthetical, or is absent. (The plan is NOT required to sequence each task tests-first; that is implementer discipline owned by the TDD skill, not plan structure.)"
|
|
74
|
+
}
|
|
75
|
+
]
|
|
76
|
+
},
|
|
77
|
+
{
|
|
78
|
+
"id": "oauth-task-breakdown-cold",
|
|
79
|
+
"prompt": "I've decided to add OAuth2 login to our existing Express app. Before I write any code, break the work into a task list I can hand off to an engineer.",
|
|
80
|
+
"expected_output": "The agent should load `hardening-plans` and produce a concrete task list with no placeholders (TBD, TODO), in which every referenced file is real (or has a real named home) and every step is a relevant part of adding OAuth2, handing off the implementation phase to `slow-powers:test-driven-development` as the required next skill. This is a colder trigger: the request frames the agent as already planning, so the skill must fire on the about-to-hand-off moment. Task granularity is left to native plan altitude and is no longer a pass criterion.",
|
|
81
|
+
"assertions": [
|
|
82
|
+
{
|
|
83
|
+
"id": "no_placeholders",
|
|
84
|
+
"type": "llm_judge",
|
|
85
|
+
"rubric": "Does the generated task list completely avoid any placeholders such as 'TBD', 'TODO', 'later', 'if needed', or other unspecified implementation items? PASS if all tasks are concrete. FAIL if any placeholder is present."
|
|
86
|
+
},
|
|
87
|
+
{
|
|
88
|
+
"id": "hands_off_to_tdd",
|
|
89
|
+
"type": "llm_judge",
|
|
90
|
+
"rubric": "Does the response make a real hand-off to the implementation skill — explicitly naming `slow-powers:test-driven-development` as the required next skill for the implementation phase — rather than merely dropping 'TDD' as an unexplained label or parenthetical? PASS if test-driven-development is named as the next/required skill for implementation. FAIL if 'TDD' appears only as a bare label with no hand-off, or is absent. (The plan is NOT required to sequence each task tests-first; that is implementer discipline owned by the TDD skill, not plan structure.)"
|
|
91
|
+
}
|
|
92
|
+
]
|
|
93
|
+
},
|
|
94
|
+
{
|
|
95
|
+
"id": "csv-parser-bug-no-plan",
|
|
96
|
+
"prompt": "Our users are reporting that the CSV parser is skipping rows with empty email fields instead of using the fallback 'anonymous' label. Can we fix this issue in csv-parser.ts?",
|
|
97
|
+
"expected_output": "The agent should immediately trigger `systematic-debugging` (root-cause analysis) rather than loading `hardening-plans` or writing an implementation plan.",
|
|
98
|
+
"skill_should_trigger": false,
|
|
99
|
+
"assertions": [
|
|
100
|
+
{
|
|
101
|
+
"id": "loads_systematic_debugging_skill",
|
|
102
|
+
"type": "transcript_check",
|
|
103
|
+
"check": "tool_invocation_matches",
|
|
104
|
+
"pattern": "(Skill.*systematic-debugging|view_file.*systematic-debugging)"
|
|
105
|
+
},
|
|
106
|
+
{
|
|
107
|
+
"id": "invokes_debugging_only",
|
|
108
|
+
"type": "llm_judge",
|
|
109
|
+
"rubric": "Did the agent identify that this is a systematic debugging task and load the `systematic-debugging` skill, while completely bypassing the `hardening-plans` skill and avoiding writing a full implementation plan? PASS if the agent focused on debugging and did not load planning guidelines. FAIL if the agent attempted to load `hardening-plans` or write a general implementation plan before debugging the root cause."
|
|
110
|
+
}
|
|
111
|
+
]
|
|
112
|
+
}
|
|
113
|
+
]
|
|
114
|
+
}
|
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
# Creation Log: Systematic Debugging Skill
|
|
2
|
+
|
|
3
|
+
Reference example of extracting, structuring, and bulletproofing a critical skill.
|
|
4
|
+
|
|
5
|
+
## Source Material
|
|
6
|
+
|
|
7
|
+
Extracted debugging framework from `~/.claude/CLAUDE.md`:
|
|
8
|
+
- 4-phase systematic process (Investigation → Pattern Analysis → Hypothesis → Implementation)
|
|
9
|
+
- Core mandate: ALWAYS find root cause, NEVER fix symptoms
|
|
10
|
+
- Rules designed to resist time pressure and rationalization
|
|
11
|
+
|
|
12
|
+
## Extraction Decisions
|
|
13
|
+
|
|
14
|
+
**What to include:**
|
|
15
|
+
- Complete 4-phase framework with all rules
|
|
16
|
+
- Anti-shortcuts ("NEVER fix symptom", "STOP and re-analyze")
|
|
17
|
+
- Pressure-resistant language ("even if faster", "even if I seem in a hurry")
|
|
18
|
+
- Concrete steps for each phase
|
|
19
|
+
|
|
20
|
+
**What to leave out:**
|
|
21
|
+
- Project-specific context
|
|
22
|
+
- Repetitive variations of same rule
|
|
23
|
+
- Narrative explanations (condensed to principles)
|
|
24
|
+
|
|
25
|
+
## Structure Following skill-creation/SKILL.md
|
|
26
|
+
|
|
27
|
+
1. **Rich when_to_use** - Included symptoms and anti-patterns
|
|
28
|
+
2. **Type: technique** - Concrete process with steps
|
|
29
|
+
3. **Keywords** - "root cause", "symptom", "workaround", "debugging", "investigation"
|
|
30
|
+
4. **Flowchart** - Decision point for "fix failed" → re-analyze vs add more fixes
|
|
31
|
+
5. **Phase-by-phase breakdown** - Scannable checklist format
|
|
32
|
+
6. **Anti-patterns section** - What NOT to do (critical for this skill)
|
|
33
|
+
|
|
34
|
+
## Bulletproofing Elements
|
|
35
|
+
|
|
36
|
+
Framework designed to resist rationalization under pressure:
|
|
37
|
+
|
|
38
|
+
### Language Choices
|
|
39
|
+
- "ALWAYS" / "NEVER" (not "should" / "try to")
|
|
40
|
+
- "even if faster" / "even if I seem in a hurry"
|
|
41
|
+
- "STOP and re-analyze" (explicit pause)
|
|
42
|
+
- "Don't skip past" (catches the actual behavior)
|
|
43
|
+
|
|
44
|
+
### Structural Defenses
|
|
45
|
+
- **Phase 1 required** - Can't skip to implementation
|
|
46
|
+
- **Single hypothesis rule** - Forces thinking, prevents shotgun fixes
|
|
47
|
+
- **Explicit failure mode** - "IF your first fix doesn't work" with mandatory action
|
|
48
|
+
- **Anti-patterns section** - Shows exactly what shortcuts look like
|
|
49
|
+
|
|
50
|
+
### Redundancy
|
|
51
|
+
- Root cause mandate in overview + when_to_use + Phase 1 + implementation rules
|
|
52
|
+
- "NEVER fix symptom" appears 4 times in different contexts
|
|
53
|
+
- Each phase has explicit "don't skip" guidance
|
|
54
|
+
|
|
55
|
+
## Testing Approach
|
|
56
|
+
|
|
57
|
+
Created 4 validation tests following skills/meta/testing-skills-with-subagents:
|
|
58
|
+
|
|
59
|
+
### Test 1: Academic Context (No Pressure)
|
|
60
|
+
- Simple bug, no time pressure
|
|
61
|
+
- **Result:** Perfect compliance, complete investigation
|
|
62
|
+
|
|
63
|
+
### Test 2: Time Pressure + Obvious Quick Fix
|
|
64
|
+
- User "in a hurry", symptom fix looks easy
|
|
65
|
+
- **Result:** Resisted shortcut, followed full process, found real root cause
|
|
66
|
+
|
|
67
|
+
### Test 3: Complex System + Uncertainty
|
|
68
|
+
- Multi-layer failure, unclear if can find root cause
|
|
69
|
+
- **Result:** Systematic investigation, traced through all layers, found source
|
|
70
|
+
|
|
71
|
+
### Test 4: Failed First Fix
|
|
72
|
+
- Hypothesis doesn't work, temptation to add more fixes
|
|
73
|
+
- **Result:** Stopped, re-analyzed, formed new hypothesis (no shotgun)
|
|
74
|
+
|
|
75
|
+
**All tests passed.** No rationalizations found.
|
|
76
|
+
|
|
77
|
+
## Iterations
|
|
78
|
+
|
|
79
|
+
### Initial Version
|
|
80
|
+
- Complete 4-phase framework
|
|
81
|
+
- Anti-patterns section
|
|
82
|
+
- Flowchart for "fix failed" decision
|
|
83
|
+
|
|
84
|
+
### Enhancement 1: TDD Reference
|
|
85
|
+
- Added link to skills/testing/test-driven-development
|
|
86
|
+
- Note explaining TDD's "simplest code" ≠ debugging's "root cause"
|
|
87
|
+
- Prevents confusion between methodologies
|
|
88
|
+
|
|
89
|
+
## Final Outcome
|
|
90
|
+
|
|
91
|
+
Bulletproof skill that:
|
|
92
|
+
- ✅ Clearly mandates root cause investigation
|
|
93
|
+
- ✅ Resists time pressure rationalization
|
|
94
|
+
- ✅ Provides concrete steps for each phase
|
|
95
|
+
- ✅ Shows anti-patterns explicitly
|
|
96
|
+
- ✅ Tested under multiple pressure scenarios
|
|
97
|
+
- ✅ Clarifies relationship to TDD
|
|
98
|
+
- ✅ Ready for use
|
|
99
|
+
|
|
100
|
+
## Key Insight
|
|
101
|
+
|
|
102
|
+
**Most important bulletproofing:** Anti-patterns section showing exact shortcuts that feel justified in the moment. When Claude thinks "I'll just add this one quick fix", seeing that exact pattern listed as wrong creates cognitive friction.
|
|
103
|
+
|
|
104
|
+
## Usage Example
|
|
105
|
+
|
|
106
|
+
When encountering a bug:
|
|
107
|
+
1. Load skill: skills/debugging/systematic-debugging
|
|
108
|
+
2. Read overview (10 sec) - reminded of mandate
|
|
109
|
+
3. Follow Phase 1 checklist - forced investigation
|
|
110
|
+
4. If tempted to skip - see anti-pattern, stop
|
|
111
|
+
5. Complete all phases - root cause found
|
|
112
|
+
|
|
113
|
+
**Time investment:** 5-10 minutes
|
|
114
|
+
**Time saved:** Hours of symptom-whack-a-mole
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
*Created: 2025-10-03*
|
|
119
|
+
*Purpose: Reference example for skill extraction and bulletproofing*
|
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: systematic-debugging
|
|
3
|
+
description: Use when encountering any bug, test failure, build error, or unexpected behavior.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Systematic Debugging
|
|
7
|
+
|
|
8
|
+
Avoid "guess-and-check" coding. Always identify the root cause before making changes.
|
|
9
|
+
|
|
10
|
+
> **THE IRON LAW:** NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST.
|
|
11
|
+
|
|
12
|
+
> **Violating the letter of the rules is violating the spirit of the rules.**
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## Phase 1: Root Cause Investigation
|
|
17
|
+
|
|
18
|
+
Before changing any code:
|
|
19
|
+
1. **Read Error Messages and Stack Traces:** Read every line of the error. Note the exact file, line number, and error codes.
|
|
20
|
+
2. **Reproduce Consistently:** Identify the exact steps, inputs, or environment needed to trigger the bug. If it cannot be reproduced, gather more logs instead of guessing.
|
|
21
|
+
3. **Check Recent Changes:** Run a git diff. Analyze recent commits, dependency additions, or config changes.
|
|
22
|
+
4. **Gather Evidence (Multi-Component Systems):**
|
|
23
|
+
* Log inputs and outputs at every component boundary.
|
|
24
|
+
* Instrument the layers step-by-step (e.g., Workflow -> Build Script -> Runtime -> DB) to pinpoint exactly where the state breaks.
|
|
25
|
+
5. **Trace Data Flow:** Trace variables backward from the failure point to their source. Fix the bug at the source, not the symptom.
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Phase 2: Pattern Analysis
|
|
30
|
+
|
|
31
|
+
1. **Find Working Examples:** Search the codebase for similar logic that functions correctly.
|
|
32
|
+
2. **Compare Implementations:** Identify every difference between the working version and the failing version. Do not assume "that difference doesn't matter."
|
|
33
|
+
3. **Verify Dependencies & Configs:** Ensure all required modules, configurations, and environment variables are present and correctly configured.
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Phase 3: Hypothesis and Testing
|
|
38
|
+
|
|
39
|
+
1. **Formulate a Single Hypothesis:** Write down a clear statement: *"I think X is the root cause because Y."*
|
|
40
|
+
2. **Test Minimally:** Make the smallest possible change to verify the hypothesis (e.g., add a log, change one value).
|
|
41
|
+
3. **Verify & Re-evaluate:** Did the test prove your hypothesis?
|
|
42
|
+
* **Yes:** Proceed to Phase 4.
|
|
43
|
+
* **No:** Revert the test change completely and formulate a *new* hypothesis. Never stack guess-on-guess.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Phase 4: Implementation & Verification
|
|
48
|
+
|
|
49
|
+
1. **Write a Failing Test Case:** Create an automated test or simple script that consistently triggers the bug. Verify it fails.
|
|
50
|
+
2. **Implement the Fix:** Make a single, targeted change that directly addresses the root cause. Do not bundle unrelated refactoring.
|
|
51
|
+
3. **Verify the Fix:** Run the test suite. Ensure the new test passes and no regressions are introduced.
|
|
52
|
+
4. **The Three-Fix Limit (Architectural Check):**
|
|
53
|
+
* If you attempt **three separate fixes** and the bug remains: **STOP.**
|
|
54
|
+
* This is a strong signal that the issue is architectural (e.g., wrong model assumptions, coupled state, race conditions).
|
|
55
|
+
* Re-evaluate the system architecture and discuss the approach with your human partner before attempting a fourth patch.
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
## Common Rationalizations
|
|
60
|
+
|
|
61
|
+
> **Note:** The rationalizations below are prospective — they represent likely excuses an agent might produce under pressure, but they have not yet been validated through actual eval runs. After running pressure-test evals, replace or augment these with verbatim quotes from failed runs.
|
|
62
|
+
|
|
63
|
+
| Excuse | Reality |
|
|
64
|
+
|--------|---------|
|
|
65
|
+
| "This is an emergency, we don't have time" | 5 minutes of investigation beats 5 hours of chasing symptoms. |
|
|
66
|
+
| "I can see the symptom fix is obvious" | Obvious symptom fixes hide the real root cause. |
|
|
67
|
+
| "We tried three things, just add one more" | Shotgun fixes create new bugs. Stop and re-analyze. |
|
|
68
|
+
| "The senior engineer says this is the fix" | Authority is not evidence. Verify the hypothesis. |
|
|
69
|
+
| "We need to ship now, investigate later" | "Later" investigations never happen on shipped code. |
|
|
70
|
+
| "This case is different because..." | It is not different. The process applies. |
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Red Flags — STOP and Reset
|
|
75
|
+
|
|
76
|
+
> **Note:** The red flags below are prospective — they represent likely warning signs, but they have not yet been validated through actual eval runs.
|
|
77
|
+
|
|
78
|
+
- Writing a fix before reproducing the bug or reading the full stack trace
|
|
79
|
+
- "Let's just try changing X to see if it works"
|
|
80
|
+
- Stacking multiple speculative fixes on top of each other
|
|
81
|
+
- Claiming a bug is fixed without running the verification test suite
|
|
82
|
+
- Each "fix" only shifts the bug to a new location
|
|
83
|
+
|
|
84
|
+
All of these mean: STOP. Revert changes. Return to Phase 1.
|