@slowdini/slow-powers-opencode 0.2.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +37 -65
- package/bootstrap.md +1 -7
- package/opencode/plugins/slow-powers.js +1 -1
- package/package.json +14 -13
- package/skills/evaluating-skills/SKILL.md +91 -337
- package/skills/evaluating-skills/evals/baseline/BASELINE.md +23 -0
- package/skills/evaluating-skills/evals/baseline/NOTES.md +40 -0
- package/skills/evaluating-skills/evals/baseline/benchmark.json +54 -0
- package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__new_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__old_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__new_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/did-my-revision-help__old_skill.json +39 -0
- package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__new_skill.json +32 -0
- package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__old_skill.json +32 -0
- package/skills/test-driven-development/evals/baseline/NOTES.md +2 -2
- package/skills/verifying-development-work/SKILL.md +17 -6
- package/skills/verifying-development-work/code-review.md +68 -0
- package/skills/verifying-development-work/comment-review.md +85 -0
- package/skills/verifying-development-work/evals/baseline/BASELINE.md +7 -6
- package/skills/verifying-development-work/evals/baseline/NOTES.md +83 -149
- package/skills/verifying-development-work/evals/baseline/benchmark.json +32 -31
- package/skills/verifying-development-work/evals/baseline/grading/comment-hygiene-at-handoff__new_skill.json +53 -0
- package/skills/verifying-development-work/evals/baseline/grading/comment-hygiene-at-handoff__old_skill.json +53 -0
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__new_skill.json +53 -0
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__old_skill.json +53 -0
- package/skills/verifying-development-work/evals/evals.json +34 -2
- package/skills/verifying-development-work/evals/fixtures/comment-hygiene-at-handoff/slugify.test.ts +14 -0
- package/skills/verifying-development-work/evals/fixtures/comment-hygiene-at-handoff/slugify.ts +25 -0
- package/skills/evaluating-skills/examples/verifying-development-work-evals.json +0 -30
- package/skills/evaluating-skills/harness-details/claude.md +0 -158
- package/skills/evaluating-skills/runner/README.md +0 -154
- package/skills/evaluating-skills/runner/adapters/claude-code-session.test.ts +0 -56
- package/skills/evaluating-skills/runner/adapters/claude-code-session.ts +0 -43
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +0 -263
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +0 -146
- package/skills/evaluating-skills/runner/aggregate.test.ts +0 -264
- package/skills/evaluating-skills/runner/aggregate.ts +0 -248
- package/skills/evaluating-skills/runner/context.test.ts +0 -181
- package/skills/evaluating-skills/runner/context.ts +0 -90
- package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +0 -103
- package/skills/evaluating-skills/runner/detect-stray-writes.ts +0 -192
- package/skills/evaluating-skills/runner/fill-transcripts.test.ts +0 -73
- package/skills/evaluating-skills/runner/fill-transcripts.ts +0 -154
- package/skills/evaluating-skills/runner/grade.test.ts +0 -347
- package/skills/evaluating-skills/runner/grade.ts +0 -603
- package/skills/evaluating-skills/runner/guard/guard.ts +0 -49
- package/skills/evaluating-skills/runner/guard/install.test.ts +0 -92
- package/skills/evaluating-skills/runner/guard/install.ts +0 -147
- package/skills/evaluating-skills/runner/guard/policy.test.ts +0 -71
- package/skills/evaluating-skills/runner/guard/policy.ts +0 -74
- package/skills/evaluating-skills/runner/plugin-shadow.test.ts +0 -228
- package/skills/evaluating-skills/runner/plugin-shadow.ts +0 -201
- package/skills/evaluating-skills/runner/profiles/claude-code/plan-mode.md +0 -11
- package/skills/evaluating-skills/runner/promote-baseline.test.ts +0 -230
- package/skills/evaluating-skills/runner/promote-baseline.ts +0 -186
- package/skills/evaluating-skills/runner/run.test.ts +0 -1180
- package/skills/evaluating-skills/runner/run.ts +0 -1029
- package/skills/evaluating-skills/runner/sandbox-policy.ts +0 -74
- package/skills/evaluating-skills/runner/types.ts +0 -112
- package/skills/evaluating-skills/runner/validate-all.ts +0 -54
- package/skills/evaluating-skills/runner/validate-schema.test.ts +0 -99
- package/skills/evaluating-skills/runner/validate-schema.ts +0 -51
- package/skills/evaluating-skills/runner/validate.test.ts +0 -56
- package/skills/evaluating-skills/runner/validate.ts +0 -21
- package/skills/evaluating-skills/schema/evals.schema.json +0 -105
- package/skills/evaluating-skills/schema/grading.schema.json +0 -84
- package/skills/evaluating-skills/schema/run-record.schema.json +0 -80
- package/skills/evaluating-skills/schema/stray-writes.schema.json +0 -68
- package/skills/evaluating-skills/templates/eval-task-prompt.md +0 -67
- package/skills/evaluating-skills/templates/evals.json.example +0 -17
- package/skills/evaluating-skills/templates/judge-prompt.md +0 -56
- package/skills/evaluating-skills/templates/revise-skill-prompt.md +0 -56
- package/skills/verifying-development-work/evals/baseline/grading/bug-fixed-without-reproducing__with_skill.json +0 -39
- package/skills/verifying-development-work/evals/baseline/grading/bug-fixed-without-reproducing__without_skill.json +0 -24
- package/skills/verifying-development-work/evals/baseline/grading/build-implied-by-edit__with_skill.json +0 -46
- package/skills/verifying-development-work/evals/baseline/grading/build-implied-by-edit__without_skill.json +0 -31
- package/skills/verifying-development-work/evals/baseline/grading/claim-without-running__with_skill.json +0 -46
- package/skills/verifying-development-work/evals/baseline/grading/claim-without-running__without_skill.json +0 -31
- package/skills/verifying-development-work/evals/baseline/grading/seeded-done-tests-pass-ship-it__with_skill.json +0 -46
- package/skills/verifying-development-work/evals/baseline/grading/seeded-done-tests-pass-ship-it__without_skill.json +0 -31
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__with_skill.json +0 -53
- package/skills/verifying-development-work/evals/baseline/grading/wrap-it-up-handoff__without_skill.json +0 -38
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
{
|
|
2
|
+
"generated": "2026-06-06T05:28:23.426Z",
|
|
3
|
+
"mode": "revision",
|
|
4
|
+
"baseline": "pre-split",
|
|
5
|
+
"conditions_compared": ["old_skill", "new_skill"],
|
|
6
|
+
"missing_gradings": 0,
|
|
7
|
+
"validity_warnings": [],
|
|
8
|
+
"run_summary": {
|
|
9
|
+
"old_skill": {
|
|
10
|
+
"pass_rate": {
|
|
11
|
+
"mean": 1,
|
|
12
|
+
"stddev": 0,
|
|
13
|
+
"n": 3
|
|
14
|
+
},
|
|
15
|
+
"duration_ms": {
|
|
16
|
+
"mean": 30954,
|
|
17
|
+
"stddev": 5354,
|
|
18
|
+
"n": 3
|
|
19
|
+
},
|
|
20
|
+
"total_tokens": {
|
|
21
|
+
"mean": 95370,
|
|
22
|
+
"stddev": 12031,
|
|
23
|
+
"n": 3
|
|
24
|
+
},
|
|
25
|
+
"skill_invocation_n": 3,
|
|
26
|
+
"skill_invocation_rate": 1
|
|
27
|
+
},
|
|
28
|
+
"new_skill": {
|
|
29
|
+
"pass_rate": {
|
|
30
|
+
"mean": 1,
|
|
31
|
+
"stddev": 0,
|
|
32
|
+
"n": 3
|
|
33
|
+
},
|
|
34
|
+
"duration_ms": {
|
|
35
|
+
"mean": 33603,
|
|
36
|
+
"stddev": 7200,
|
|
37
|
+
"n": 3
|
|
38
|
+
},
|
|
39
|
+
"total_tokens": {
|
|
40
|
+
"mean": 74671,
|
|
41
|
+
"stddev": 9209,
|
|
42
|
+
"n": 3
|
|
43
|
+
},
|
|
44
|
+
"skill_invocation_n": 3,
|
|
45
|
+
"skill_invocation_rate": 1
|
|
46
|
+
}
|
|
47
|
+
},
|
|
48
|
+
"delta": {
|
|
49
|
+
"direction": "old_skill - new_skill",
|
|
50
|
+
"pass_rate": 0,
|
|
51
|
+
"duration_ms": -2649,
|
|
52
|
+
"total_tokens": 20699
|
|
53
|
+
}
|
|
54
|
+
}
|
package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__new_skill.json
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "declares_deterministic_and_skips",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "\"Removing a 'announce out loud that you're using this skill' line is a deterministic change... **Decision: deterministic instruction removal — skip the eval.** Ship it.\"",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "door_stays_open",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "If you want an eval anyway (the skill says the door stays open), it would need to be a real one — actual cases testing a behavior this change could plausibly affect — not a checkbox run to rubber-stamp a foregone conclusion.",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "The agent explicitly names the skill ('The evaluating-skills skill calls out this exact example explicitly as the canonical deterministic case'), uses the skill's exact canonical example verbatim ('announce out loud that you're using this skill'), mirrors the skill's deterministic/contingent framing ('does it alter contingent behavior'), follows the prescribed 'declare and skip' pattern ('Decision: deterministic instruction removal — skip the eval'), and echoes the skill's 'the door stays open' phrasing for the user-override case.",
|
|
29
|
+
"confidence": 1,
|
|
30
|
+
"grader": "llm_judge"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
package/skills/evaluating-skills/evals/baseline/grading/deterministic-edit-skip__old_skill.json
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "declares_deterministic_and_skips",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "Agent explicitly declares the change deterministic and skips the eval: 'Removing a \"announce out loud that you're using this skill\" instruction is a deterministic change, not a contingent one' and concludes 'Decision: skip the eval. Deterministic instruction removal. Ship it.'",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "door_stays_open",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "The agent's final message ends with 'Ship it.' and does not include any language refusing to consider an eval under any circumstances. Furthermore, the loaded skill explicitly states 'The door stays open: if the user wants an eval anyway, run a worthwhile one', which the agent's reasoning is grounded in, and the response simply declares the skip without foreclosing the option.",
|
|
14
|
+
"confidence": 0.85,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "The agent directly quotes the skill's canonical example verbatim ('Removing a one-line \\'announce out loud that you\\'re using this skill\\' instruction... changes what the agent is told, not whether it complies under pressure. You don\\'t eval that an agent can stop saying a sentence any more than you\\'d unit-test that the language computes 2 + 2.'), uses the skill's distinctive 'deterministic vs contingent' framing throughout, and closes with the skill-prescribed announcement format: 'Decision: skip the eval. Deterministic instruction removal.'",
|
|
29
|
+
"confidence": 1,
|
|
30
|
+
"grader": "llm_judge"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "prescribes_structured_comparison",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "\"You know it's better by running a Mode B revision eval — comparing the old skill against the new one on the same set of test prompts, then looking at the pass-rate delta.\" The agent explicitly prescribes: (1) snapshot the old SKILL.md, (2) put the revised version in place as the new_skill condition, (3) write test cases targeting the failure mode, (4) run `bunx @slowdini/eval-runner` in revision mode, (5) read the delta — 'if new_skill pass rate > old_skill pass rate, the revision is an improvement. Zero or negative delta means revert.'",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "keep_only_on_positive_delta",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "\"Read the delta: if `new_skill` pass rate > `old_skill` pass rate, the revision is an improvement. Zero or negative delta means revert.\"",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "The response is saturated with skill-specific fingerprints: it names 'Mode B revision eval' (mirroring the skill's 'Mode B — revision' section), invokes the 'Iron Law' by name, uses 'discipline-enforcing skill', 'seeded cases', 'competing attractor', 'cold prompts under-measure discipline skills', recommends the pre-flight summary with '(cases, models, guard status, cost)' verbatim from the Pre-flight gate section, and quotes the skill's own revision guidance — 'reasoning-based instructions (\"do X because Y\") tend to transfer better'.",
|
|
29
|
+
"confidence": 1,
|
|
30
|
+
"grader": "llm_judge"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "prescribes_structured_comparison",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "The agent explicitly prescribes: (1) snapshot the old version ('snapshot --label pre-rewrite --ref HEAD~1'), (2) run both old and new against the same test cases ('--mode revision --baseline pre-rewrite'), and (3) read 'benchmark.json' where 'delta.pass_rate tells you whether the revision improved, degraded, or did nothing.' It further warns that the Iron Law's real-world failure happens mid-session and that a cold prompt under-measures it, recommending seeded cases to avoid an inflated delta.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "keep_only_on_positive_delta",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "\"A positive delta means the rewrite helped; zero or negative means revert it.\"",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
}
|
|
17
|
+
],
|
|
18
|
+
"summary": {
|
|
19
|
+
"passed": 2,
|
|
20
|
+
"failed": 0,
|
|
21
|
+
"total": 2,
|
|
22
|
+
"pass_rate": 1
|
|
23
|
+
},
|
|
24
|
+
"meta_results": [
|
|
25
|
+
{
|
|
26
|
+
"id": "__skill_invoked",
|
|
27
|
+
"passed": true,
|
|
28
|
+
"evidence": "The final message is saturated with skill-specific fingerprints: it prescribes 'Mode B revision eval' (matching the skill's 'Mode B — Revision comparison' section), uses the exact flags '--mode revision --baseline pre-rewrite' and 'snapshot --label pre-rewrite --ref HEAD~1' from the Mode B workflow, invokes 'contingent behavior' and 'competing attractor' from the seeding section, references 'delta.pass_rate' from the aggregation schema, and quotes the Common Rationalizations table verbatim: '\"rewording IS the skill\"'. None of this vocabulary or procedure appears in generic best-practice advice.",
|
|
29
|
+
"confidence": 1,
|
|
30
|
+
"grader": "llm_judge"
|
|
31
|
+
}
|
|
32
|
+
],
|
|
33
|
+
"meta_summary": {
|
|
34
|
+
"passed": 1,
|
|
35
|
+
"failed": 0,
|
|
36
|
+
"total": 1,
|
|
37
|
+
"skill_invoked": true
|
|
38
|
+
}
|
|
39
|
+
}
|
package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__new_skill.json
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "decides_rather_than_eyeballs",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "\"Not yet — and the answer to 'is it ready to ship?' is precisely what evals exist to tell you.\" The agent explicitly invokes the Iron Law (\"No skill shipped without passing evals\"), identifies the missing artifacts (skill content, eval cases, run results showing with-skill vs without-skill comparison), and outlines the full eval-first path to shipping — never greenlighting based on reading the skill text alone.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
}
|
|
10
|
+
],
|
|
11
|
+
"summary": {
|
|
12
|
+
"passed": 1,
|
|
13
|
+
"failed": 0,
|
|
14
|
+
"total": 1,
|
|
15
|
+
"pass_rate": 1
|
|
16
|
+
},
|
|
17
|
+
"meta_results": [
|
|
18
|
+
{
|
|
19
|
+
"id": "__skill_invoked",
|
|
20
|
+
"passed": true,
|
|
21
|
+
"evidence": "The agent explicitly names 'the Iron Law' (a named section from the skill), uses the phrase '2N agent dispatches' verbatim from the skill's pre-flight gate section, and structures its response around the skill's pre-flight summary steps (skill name, mode, eval cases, models, cost, guard status) — none of which are generic best practice and all of which mirror the skill's specific framing.",
|
|
22
|
+
"confidence": 0.98,
|
|
23
|
+
"grader": "llm_judge"
|
|
24
|
+
}
|
|
25
|
+
],
|
|
26
|
+
"meta_summary": {
|
|
27
|
+
"passed": 1,
|
|
28
|
+
"failed": 0,
|
|
29
|
+
"total": 1,
|
|
30
|
+
"skill_invoked": true
|
|
31
|
+
}
|
|
32
|
+
}
|
package/skills/evaluating-skills/evals/baseline/grading/is-new-skill-ready-to-ship__old_skill.json
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "decides_rather_than_eyeballs",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "The agent explicitly refused to approve shipping: 'Not yet — the Iron Law requires passing evals before any skill ships, and you haven't run any yet.' It then walked through the full eval process required before shipping, including deciding whether the skill is deterministic or contingent ('does the skill shape behavior the agent might get wrong under pressure or ambiguity?'), authoring evals.json, running a Mode A comparison (with_skill vs without_skill), and checking a positive delta.pass_rate. No green-light was given based on inspection alone.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
}
|
|
10
|
+
],
|
|
11
|
+
"summary": {
|
|
12
|
+
"passed": 1,
|
|
13
|
+
"failed": 0,
|
|
14
|
+
"total": 1,
|
|
15
|
+
"pass_rate": 1
|
|
16
|
+
},
|
|
17
|
+
"meta_results": [
|
|
18
|
+
{
|
|
19
|
+
"id": "__skill_invoked",
|
|
20
|
+
"passed": true,
|
|
21
|
+
"evidence": "The final message directly cites 'The Iron Law' by name and quotes its rule verbatim, references Mode A eval ('compares with_skill/ vs without_skill/'), uses the skill's specific terminology '__skill_invoked meta-check', 'delta.pass_rate', 'evals:promote-baseline', and the pre-flight summary requirement — all distinctive named sections and vocabulary taken directly from the evaluating-skills SKILL.md.",
|
|
22
|
+
"confidence": 1,
|
|
23
|
+
"grader": "llm_judge"
|
|
24
|
+
}
|
|
25
|
+
],
|
|
26
|
+
"meta_summary": {
|
|
27
|
+
"passed": 1,
|
|
28
|
+
"failed": 0,
|
|
29
|
+
"total": 1,
|
|
30
|
+
"skill_invoked": true
|
|
31
|
+
}
|
|
32
|
+
}
|
|
@@ -23,7 +23,7 @@ A null delta here is **not** evidence the reframe failed (the eval-seeding issue
|
|
|
23
23
|
says so explicitly). It is two stacked measurement ceilings:
|
|
24
24
|
|
|
25
25
|
1. **The runner over-promotes invocation.** `buildDispatchTask` in
|
|
26
|
-
|
|
26
|
+
`@slowdini/eval-runner`'s `src/run.ts` puts a *constant* instruction in the
|
|
27
27
|
`with_skill` arm: *"the skill … is staged under the unique slug … — invoke that
|
|
28
28
|
slug … if the skill applies."* That hint is identical across both `--bootstrap`
|
|
29
29
|
variants, so it cancels in the delta but pins the invocation floor near 100%.
|
|
@@ -60,7 +60,7 @@ Roughly in increasing order of effort / payoff:
|
|
|
60
60
|
class of eval measurable. This is the high-value framework improvement.
|
|
61
61
|
3. **Real harness-mode injection.** Reproduce the plan-mode suppression by running
|
|
62
62
|
the eval subagent *inside* a real plan mode rather than a described one. Tracked
|
|
63
|
-
as a parity goal in `harness-parity
|
|
63
|
+
as a parity goal in the `@slowdini/eval-runner` docs (`docs/harness-parity.md`); the biggest lift.
|
|
64
64
|
|
|
65
65
|
## Bigger-picture testing strategy (from the maintainer)
|
|
66
66
|
|
|
@@ -38,13 +38,24 @@ Before claiming any task is finished, making a success claim, or declaring a bug
|
|
|
38
38
|
|
|
39
39
|
---
|
|
40
40
|
|
|
41
|
-
## Finishing: Review
|
|
41
|
+
## Finishing: Review Code, Verify, Then Review Comments
|
|
42
42
|
|
|
43
|
-
The Gate Function above is your discipline at *every* completion claim. When you believe the work itself is done, run
|
|
43
|
+
The Gate Function above is your discipline at *every* completion claim. When you believe the work itself is done, run these three finishing phases **in order**. The order is deliberate: every code change happens in phase 1, *before* the verification, so the evidence you hand back is guaranteed to cover the exact code being returned — and comment cleanup comes *after*, where it can't disturb that check.
|
|
44
44
|
|
|
45
|
-
1. **Review
|
|
46
|
-
2. **
|
|
47
|
-
3. **
|
|
45
|
+
1. **Review and fix the code** — follow [`code-review.md`](code-review.md). This is the only phase that changes behavior. Review catches what running can't — silent regressions, missed edge cases, leftover debug code, reuse or simplification — then you fix or flag each finding, and *the code is now frozen*. Size the review to the change: a quick check, not a second project. (Comments are **not** reviewed here — they get phase 3.)
|
|
46
|
+
2. **Run the final verification** — apply the Gate Function fresh to the now-frozen code and present *that* output as your evidence. Because all code changes happened in phase 1, this check covers exactly what the user gets.
|
|
47
|
+
3. **Review and clean the comments** — follow [`comment-review.md`](comment-review.md). This pass touches *only* comments, so it changes no behavior and needs **no re-verification**: delete narrative / step-by-step / ticket comments, keeping only true Explanation or exported-member Documentation, before the diff reaches a human.
|
|
48
|
+
|
|
49
|
+
**Copy this checklist into your task tracker the moment you start finishing, and tick each box in order.** The ordering *is* the discipline — and an untracked checklist is one whose middle steps get skipped under momentum:
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
- [ ] Phase 1 — reviewed the CODE against intent, ranked findings, fixed/flagged each (per code-review.md); code is now frozen
|
|
53
|
+
- [ ] Phase 2 — ran the final verification fresh on the frozen code, and presented that output as evidence
|
|
54
|
+
- [ ] Phase 3 — reviewed the COMMENTS (per comment-review.md): deleted narrative / step-by-step / ticket comments, kept only true Explanation or exported Documentation
|
|
55
|
+
- [ ] Surfaced integration options (merge / push+PR / leave as-is / discard) — did not merge or push on my own
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
The last box is its own gate; the section below is why it's never yours to skip.
|
|
48
59
|
|
|
49
60
|
---
|
|
50
61
|
|
|
@@ -69,7 +80,7 @@ Verified, reviewed work is still *your* checkpoint, not a decision to merge. Int
|
|
|
69
80
|
| "It's obvious this is correct" | Obvious bugs are the most embarrassing. Reading code predicts behavior; only running it proves behavior. |
|
|
70
81
|
| "I'll verify after committing" | Verification after the claim is too late. |
|
|
71
82
|
| "The build should be fine" | "Should" is not evidence. |
|
|
72
|
-
| "Tests pass, so we're done here" | Verification is one
|
|
83
|
+
| "Tests pass, so we're done here" | Verification is one phase of finishing, not the whole sequence — review and fix the code, verify the frozen result, then clean the comments. |
|
|
73
84
|
| "The user said ship it, so I'll just merge" | "Ship it" authorizes the user's choice, not a unilateral merge or push. |
|
|
74
85
|
|
|
75
86
|
---
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# Reviewing the Code
|
|
2
|
+
|
|
3
|
+
This is **phase 1** of the finishing sequence in [`SKILL.md`](SKILL.md) — the
|
|
4
|
+
code review. Review and fix the *code* here. This is the only phase that changes
|
|
5
|
+
behavior, so once you finish it the code is frozen.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Size the review to the change
|
|
10
|
+
|
|
11
|
+
Review depth matches the size and risk of the diff. A one-line fix gets a careful
|
|
12
|
+
read and a moment's thought about what it could break; a new subsystem gets more.
|
|
13
|
+
Don't run a heavyweight audit over a trivial change to look thorough — a review
|
|
14
|
+
that's louder than the change it covers is the failure this guidance exists to
|
|
15
|
+
prevent.
|
|
16
|
+
|
|
17
|
+
Do the review however your harness makes natural — read the diff inline, or
|
|
18
|
+
dispatch it to a general purpose subagent.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Read the diff against intent
|
|
23
|
+
|
|
24
|
+
Read the actual diff — not your memory of what you changed — against the plan or
|
|
25
|
+
the request. Cite findings by `file:line` so each one is checkable. Look for:
|
|
26
|
+
|
|
27
|
+
- **Intent alignment** — does the change do what was asked? Are deviations
|
|
28
|
+
deliberate improvements, or drift?
|
|
29
|
+
- **Correctness** — bugs, off-by-ones, wrong conditions, mishandled `null`/empty.
|
|
30
|
+
- **Error & edge cases** — failure paths, boundaries, and inputs the happy path skips.
|
|
31
|
+
- **Reuse & simplification** — existing helpers ignored, needless abstraction,
|
|
32
|
+
code that could be plainer.
|
|
33
|
+
- **Leftover scaffolding** — debug prints, commented-out code, dead branches,
|
|
34
|
+
silent regressions to nearby behavior.
|
|
35
|
+
- **Tests** — do they exercise real behavior, and do they cover what changed?
|
|
36
|
+
|
|
37
|
+
This is not an exhaustive checklist to march through — it's where real problems
|
|
38
|
+
tend to hide. Spend attention where this particular diff warrants it.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Rank, then return only the top findings
|
|
43
|
+
|
|
44
|
+
Sort what you found by severity and report only the few that matter. The point of
|
|
45
|
+
ranking is to *drop* noise, not to pad a list.
|
|
46
|
+
|
|
47
|
+
| Severity | What belongs here |
|
|
48
|
+
|----------|-------------------|
|
|
49
|
+
| **Critical — must fix** | Bugs, security holes, data loss, broken functionality. |
|
|
50
|
+
| **Important — should fix** | Missing behavior, weak error handling, test gaps, architecture problems. |
|
|
51
|
+
| **Minor — nice to have** | Style, micro-optimizations, polish. |
|
|
52
|
+
|
|
53
|
+
Report the most important handful. **Drop Minor nitpicks unless nothing more
|
|
54
|
+
serious exists** — a pile of trivia buries the one finding that mattered and
|
|
55
|
+
trains the reader to skim past your review. Don't manufacture findings to fill the
|
|
56
|
+
tiers; "nothing critical, one important thing" is a complete and good result.
|
|
57
|
+
Close with a one-line verdict.
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## Then: address the findings — and freeze the code
|
|
62
|
+
|
|
63
|
+
Fix or explicitly flag each code finding you kept. Any fix changes the code — so
|
|
64
|
+
make all of those changes *now*, in this phase. When you're done, the code is
|
|
65
|
+
**frozen**: nothing in the remaining phases touches behavior. Return to the
|
|
66
|
+
finishing sequence in [`SKILL.md`](SKILL.md) and run the **final verification**
|
|
67
|
+
(phase 2) on this frozen result — the check you hand back is then guaranteed to
|
|
68
|
+
cover the exact code being returned.
|
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
# Reviewing the Comments
|
|
2
|
+
|
|
3
|
+
This is **phase 3** — the last step of the finishing sequence in [`SKILL.md`](SKILL.md).
|
|
4
|
+
By now the code has been reviewed (phase 1), and verified (phase 2). The code is frozen;
|
|
5
|
+
**this pass touches only comments.** That is the whole reason it comes last: a
|
|
6
|
+
comment edit can't change behavior, so it can't invalidate the verification you
|
|
7
|
+
just ran — there is nothing here to re-test. Do it as the final polish before the
|
|
8
|
+
handoff.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## The comment-hygiene pass
|
|
13
|
+
|
|
14
|
+
Review **every comment in the changed code** with one goal: **delete as many as
|
|
15
|
+
possible.**
|
|
16
|
+
|
|
17
|
+
This runs against your own instinct. Writing a comment feels like preserving the
|
|
18
|
+
narrative — why this approach, what was tried, which ticket it traces to. But a
|
|
19
|
+
human reading code finds it *very hard* to skip a comment; every one they hit,
|
|
20
|
+
they stop and read. Narrative comments tax every future reader to record a story
|
|
21
|
+
that belongs in the commit message or the PR, not the source. Left in, they
|
|
22
|
+
become the thing the user has to delete by hand before merging — so delete them
|
|
23
|
+
now, on their behalf.
|
|
24
|
+
|
|
25
|
+
A comment survives only if it fits one of two categories **and** meets its bar:
|
|
26
|
+
|
|
27
|
+
1. **Explanation.** Code that is genuinely hard to follow from reading it — a
|
|
28
|
+
subtle algorithm, a deliberate break from the usual pattern, a non-obvious
|
|
29
|
+
constraint. The comment fills the gap with an *evergreen* reason (true a year
|
|
30
|
+
from now, not "fixes the bug from Tuesday"). These are **extremely rare**:
|
|
31
|
+
well-written code is self-commenting, and a reader fluent in code can follow
|
|
32
|
+
even sophisticated paths when the code itself is clear. If the right fix is to
|
|
33
|
+
make the code clearer, do that instead of explaining unclear code.
|
|
34
|
+
2. **Documentation.** A concise doc-style comment (jsdoc and equivalents) on an
|
|
35
|
+
**exported** member, where the text is surfaced by doc generators and editor
|
|
36
|
+
hints to readers who *don't* have the source in front of them. These almost
|
|
37
|
+
always earn their place. Keep them concise and evergreen, matching the
|
|
38
|
+
surrounding style; they may describe usage more freely since that's their job.
|
|
39
|
+
|
|
40
|
+
**Everything else gets deleted — about 99.9% of the time.** The most common
|
|
41
|
+
offender, and the one that feels most defensible, is **step-by-step narration**
|
|
42
|
+
that walks through what the code already says — `// Step 1: lowercase`,
|
|
43
|
+
`// now strip the accents`, `// finally, trim the dashes`. It reads as helpful
|
|
44
|
+
structure, and *that feeling is the trap*: the numbered steps restate control
|
|
45
|
+
flow the reader can already see in the code, so most such comments carry no
|
|
46
|
+
information the line below them doesn't — they only add something else to read.
|
|
47
|
+
"The steps make it easier to follow" is the rationalization to delete *through*,
|
|
48
|
+
not act on; the code is the structure. Strip the narration and nothing is lost.
|
|
49
|
+
The same goes for prose narrative ("first we… then we…"), time-sensitive comments
|
|
50
|
+
(ticket numbers, "the previous solution…", "changed this because…"), and any
|
|
51
|
+
comment that merely restates its line. A comment that fits neither surviving
|
|
52
|
+
category, or fits one but misses its bar, is noise. **When in doubt, delete it.**
|
|
53
|
+
A truly unique case might warrant a truly unusual comment — but treat that as the
|
|
54
|
+
rare exception it is, not the default.
|
|
55
|
+
|
|
56
|
+
```ts
|
|
57
|
+
// BEFORE — every comment restates the line under it
|
|
58
|
+
// Step 1: lowercase the title
|
|
59
|
+
const lower = title.toLowerCase();
|
|
60
|
+
// Step 2: replace whitespace runs with a single hyphen
|
|
61
|
+
const hyphenated = lower.replace(/\s+/g, "-");
|
|
62
|
+
|
|
63
|
+
// AFTER — the code already says all of that
|
|
64
|
+
const lower = title.toLowerCase();
|
|
65
|
+
const hyphenated = lower.replace(/\s+/g, "-");
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
**A kernel of value doesn't save the comment around it.** The hardest case is
|
|
69
|
+
the *mixed* comment — mostly narration, with one genuinely useful clause buried
|
|
70
|
+
in it (a real constraint, a non-obvious *why*). Keeping the whole block "because
|
|
71
|
+
part of it is useful" is exactly how noise survives review: a reader will keep a
|
|
72
|
+
comment that's 90% restatement for the sake of the 10% that matters. Don't.
|
|
73
|
+
**Extract the useful part, delete the rest, and if what remains earns a comment,
|
|
74
|
+
write it as a tight standalone one** — the kernel alone, not the narration that
|
|
75
|
+
carried it. A four-line "Step 1… / Step 2 *(the one real reason)* / Step 3… /
|
|
76
|
+
Step 4…" block collapses to a single comment stating that one reason, and the
|
|
77
|
+
numbered narration is gone.
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## Then: hand it back
|
|
82
|
+
|
|
83
|
+
These were comment-only edits — they change no behavior, so there is **nothing to
|
|
84
|
+
re-verify**: the verification from phase 2 still covers the code being returned.
|
|
85
|
+
Return to the finishing sequence in [`SKILL.md`](SKILL.md) for the handoff.
|
|
@@ -2,19 +2,20 @@
|
|
|
2
2
|
|
|
3
3
|
Committed reference output from a canonical eval run. Regenerate with
|
|
4
4
|
`bun run evals:promote-baseline -- --skill verifying-development-work --iteration <N>` after aggregating. The ephemeral workspace (run records, timing,
|
|
5
|
-
dispatch files, produced outputs) stays gitignored under `skills-workspace
|
|
5
|
+
dispatch files, produced outputs) stays gitignored under `skills-workspace/`
|
|
6
|
+
and is reclaimable by `evals:teardown` once promoted (this commit's marker).
|
|
6
7
|
|
|
7
8
|
| Field | Value |
|
|
8
9
|
|-------|-------|
|
|
9
|
-
| Mode |
|
|
10
|
-
| Iteration | iteration-
|
|
10
|
+
| Mode | revision |
|
|
11
|
+
| Iteration | iteration-6 |
|
|
11
12
|
| Harness | claude-code |
|
|
12
13
|
| Agent model | claude-sonnet-4-6 |
|
|
13
14
|
| Judge model | claude-sonnet-4-6 |
|
|
14
|
-
| Conditions |
|
|
15
|
-
| Run timestamp | 2026-06-
|
|
15
|
+
| Conditions | old_skill, new_skill |
|
|
16
|
+
| Run timestamp | 2026-06-05T01:32:51.388Z |
|
|
16
17
|
| Label | (none) |
|
|
17
|
-
| Promoted from commit |
|
|
18
|
+
| Promoted from commit | 4d6276b |
|
|
18
19
|
|
|
19
20
|
Files:
|
|
20
21
|
- `benchmark.json` — aggregate pass-rate / duration / token deltas.
|