@slowdini/slow-powers-opencode 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +22 -0
- package/README.md +174 -0
- package/bootstrap.md +16 -0
- package/opencode/plugins/slow-powers.js +86 -0
- package/package.json +66 -0
- package/skills/auditing-slow-powers-usage/SKILL.md +157 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/BASELINE.md +22 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/NOTES.md +72 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/benchmark.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__with_skill.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-blindspot-session__without_skill.json +38 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__with_skill.json +53 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/audits-completed-session__without_skill.json +38 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__with_skill.json +17 -0
- package/skills/auditing-slow-powers-usage/evals/baseline/grading/ordinary-dev-task-no-audit__without_skill.json +17 -0
- package/skills/auditing-slow-powers-usage/evals/evals.json +74 -0
- package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md +39 -0
- package/skills/auditing-slow-powers-usage/evals/fixtures/audits-completed-session/session-summary.md +33 -0
- package/skills/evaluating-skills/SKILL.md +448 -0
- package/skills/evaluating-skills/evals/evals.json +52 -0
- package/skills/evaluating-skills/evals/fixtures/iron-law/candidate-skill.md +13 -0
- package/skills/evaluating-skills/examples/verification-before-completion-evals.json +30 -0
- package/skills/evaluating-skills/harness-details/claude.md +135 -0
- package/skills/evaluating-skills/pressure-scenarios.md +163 -0
- package/skills/evaluating-skills/runner/README.md +140 -0
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.test.ts +263 -0
- package/skills/evaluating-skills/runner/adapters/claude-code-transcript.ts +146 -0
- package/skills/evaluating-skills/runner/aggregate.test.ts +188 -0
- package/skills/evaluating-skills/runner/aggregate.ts +228 -0
- package/skills/evaluating-skills/runner/context.test.ts +181 -0
- package/skills/evaluating-skills/runner/context.ts +90 -0
- package/skills/evaluating-skills/runner/detect-stray-writes.test.ts +103 -0
- package/skills/evaluating-skills/runner/detect-stray-writes.ts +192 -0
- package/skills/evaluating-skills/runner/fill-transcripts.test.ts +73 -0
- package/skills/evaluating-skills/runner/fill-transcripts.ts +154 -0
- package/skills/evaluating-skills/runner/grade.test.ts +347 -0
- package/skills/evaluating-skills/runner/grade.ts +603 -0
- package/skills/evaluating-skills/runner/guard/guard.ts +49 -0
- package/skills/evaluating-skills/runner/guard/install.test.ts +92 -0
- package/skills/evaluating-skills/runner/guard/install.ts +147 -0
- package/skills/evaluating-skills/runner/guard/policy.test.ts +71 -0
- package/skills/evaluating-skills/runner/guard/policy.ts +74 -0
- package/skills/evaluating-skills/runner/promote-baseline.test.ts +230 -0
- package/skills/evaluating-skills/runner/promote-baseline.ts +186 -0
- package/skills/evaluating-skills/runner/run.test.ts +716 -0
- package/skills/evaluating-skills/runner/run.ts +814 -0
- package/skills/evaluating-skills/runner/sandbox-policy.ts +74 -0
- package/skills/evaluating-skills/runner/types.ts +104 -0
- package/skills/evaluating-skills/runner/validate-all.ts +54 -0
- package/skills/evaluating-skills/runner/validate-schema.test.ts +99 -0
- package/skills/evaluating-skills/runner/validate-schema.ts +51 -0
- package/skills/evaluating-skills/runner/validate.test.ts +56 -0
- package/skills/evaluating-skills/runner/validate.ts +21 -0
- package/skills/evaluating-skills/schema/evals.schema.json +105 -0
- package/skills/evaluating-skills/schema/grading.schema.json +84 -0
- package/skills/evaluating-skills/schema/run-record.schema.json +80 -0
- package/skills/evaluating-skills/schema/stray-writes.schema.json +68 -0
- package/skills/evaluating-skills/templates/eval-task-prompt.md +71 -0
- package/skills/evaluating-skills/templates/evals.json.example +17 -0
- package/skills/evaluating-skills/templates/judge-prompt.md +56 -0
- package/skills/evaluating-skills/templates/revise-skill-prompt.md +56 -0
- package/skills/finishing-a-development-branch/SKILL.md +96 -0
- package/skills/finishing-a-development-branch/evals/evals.json +41 -0
- package/skills/finishing-a-development-branch/evals/fixtures/finish/package.json +4 -0
- package/skills/finishing-a-development-branch/evals/fixtures/finish/sum.test.ts +5 -0
- package/skills/hardening-plans/SKILL.md +72 -0
- package/skills/hardening-plans/evals/baseline/BASELINE.md +22 -0
- package/skills/hardening-plans/evals/baseline/NOTES.md +58 -0
- package/skills/hardening-plans/evals/baseline/benchmark.json +54 -0
- package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__new_skill.json +39 -0
- package/skills/hardening-plans/evals/baseline/grading/concrete-todo-app-plan__old_skill.json +39 -0
- package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__new_skill.json +24 -0
- package/skills/hardening-plans/evals/baseline/grading/csv-parser-bug-no-plan__old_skill.json +24 -0
- package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__new_skill.json +46 -0
- package/skills/hardening-plans/evals/baseline/grading/seeded-review-catches-defects__old_skill.json +46 -0
- package/skills/hardening-plans/evals/evals.json +114 -0
- package/skills/systematic-debugging/CREATION-LOG.md +119 -0
- package/skills/systematic-debugging/SKILL.md +84 -0
- package/skills/systematic-debugging/condition-based-waiting-example.ts +164 -0
- package/skills/systematic-debugging/condition-based-waiting.md +115 -0
- package/skills/systematic-debugging/defense-in-depth.md +122 -0
- package/skills/systematic-debugging/evals/baseline/BASELINE.md +22 -0
- package/skills/systematic-debugging/evals/baseline/benchmark.json +51 -0
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__with_skill.json +17 -0
- package/skills/systematic-debugging/evals/baseline/grading/feature-request-no-debugging__without_skill.json +17 -0
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__with_skill.json +46 -0
- package/skills/systematic-debugging/evals/baseline/grading/null-id-crash-investigate-first__without_skill.json +31 -0
- package/skills/systematic-debugging/evals/evals.json +45 -0
- package/skills/systematic-debugging/evals/fixtures/order-bug/orderHandler.ts +9 -0
- package/skills/systematic-debugging/evals/fixtures/order-bug/repro.ts +10 -0
- package/skills/systematic-debugging/find-polluter.sh +63 -0
- package/skills/systematic-debugging/root-cause-tracing.md +169 -0
- package/skills/systematic-debugging/test-academic.md +14 -0
- package/skills/systematic-debugging/test-pressure-1.md +58 -0
- package/skills/systematic-debugging/test-pressure-2.md +68 -0
- package/skills/systematic-debugging/test-pressure-3.md +69 -0
- package/skills/test-driven-development/SKILL.md +93 -0
- package/skills/test-driven-development/evals/baseline/BASELINE.md +22 -0
- package/skills/test-driven-development/evals/baseline/NOTES.md +74 -0
- package/skills/test-driven-development/evals/baseline/benchmark.json +51 -0
- package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__with_skill.json +53 -0
- package/skills/test-driven-development/evals/baseline/grading/slugify-under-time-pressure__without_skill.json +38 -0
- package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__with_skill.json +32 -0
- package/skills/test-driven-development/evals/baseline/grading/tests-after-rubber-stamp__without_skill.json +17 -0
- package/skills/test-driven-development/evals/evals.json +77 -0
- package/skills/test-driven-development/evals/fixtures/slugify/package.json +4 -0
- package/skills/test-driven-development/evals/fixtures/slugify/utils.ts +7 -0
- package/skills/test-driven-development/testing-anti-patterns.md +299 -0
- package/skills/using-git-worktrees/SKILL.md +70 -0
- package/skills/using-git-worktrees/evals/evals.json +40 -0
- package/skills/verification-before-completion/SKILL.md +65 -0
- package/skills/verification-before-completion/evals/baseline/BASELINE.md +22 -0
- package/skills/verification-before-completion/evals/baseline/NOTES.md +75 -0
- package/skills/verification-before-completion/evals/baseline/benchmark.json +51 -0
- package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__with_skill.json +39 -0
- package/skills/verification-before-completion/evals/baseline/grading/bug-fixed-without-reproducing__without_skill.json +24 -0
- package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__with_skill.json +46 -0
- package/skills/verification-before-completion/evals/baseline/grading/build-implied-by-edit__without_skill.json +31 -0
- package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__with_skill.json +46 -0
- package/skills/verification-before-completion/evals/baseline/grading/claim-without-running__without_skill.json +31 -0
- package/skills/verification-before-completion/evals/evals.json +77 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/api.ts +1 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/consumer.ts +3 -0
- package/skills/verification-before-completion/evals/fixtures/build-implied-by-edit/tsconfig.json +23 -0
- package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.test.ts +10 -0
- package/skills/verification-before-completion/evals/fixtures/claim-without-running/sum.ts +1 -0
- package/skills/writing-skills/SKILL.md +306 -0
- package/skills/writing-skills/evals/evals.json +40 -0
- package/skills/writing-skills/graphviz-conventions.dot +172 -0
- package/skills/writing-skills/persuasion-principles.md +187 -0
- package/skills/writing-skills/scripts/render-graphs.js +181 -0
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
{
|
|
2
|
+
"generated": "2026-05-29T01:58:10.914Z",
|
|
3
|
+
"mode": "new-skill",
|
|
4
|
+
"conditions_compared": ["with_skill", "without_skill"],
|
|
5
|
+
"missing_gradings": 0,
|
|
6
|
+
"validity_warnings": [
|
|
7
|
+
"ordinary-dev-task-no-audit/with_skill wrote 4 file(s) outside its outputs dir — data point may be tainted (see stray-writes.json)."
|
|
8
|
+
],
|
|
9
|
+
"run_summary": {
|
|
10
|
+
"with_skill": {
|
|
11
|
+
"pass_rate": {
|
|
12
|
+
"mean": 0.833,
|
|
13
|
+
"stddev": 0.118,
|
|
14
|
+
"n": 3
|
|
15
|
+
},
|
|
16
|
+
"duration_ms": {
|
|
17
|
+
"mean": 61176,
|
|
18
|
+
"stddev": 20622,
|
|
19
|
+
"n": 3
|
|
20
|
+
},
|
|
21
|
+
"total_tokens": {
|
|
22
|
+
"mean": 59354,
|
|
23
|
+
"stddev": 3076,
|
|
24
|
+
"n": 3
|
|
25
|
+
},
|
|
26
|
+
"skill_invocation_n": 2,
|
|
27
|
+
"skill_invocation_rate": 1
|
|
28
|
+
},
|
|
29
|
+
"without_skill": {
|
|
30
|
+
"pass_rate": {
|
|
31
|
+
"mean": 0.917,
|
|
32
|
+
"stddev": 0.118,
|
|
33
|
+
"n": 3
|
|
34
|
+
},
|
|
35
|
+
"duration_ms": {
|
|
36
|
+
"mean": 42761,
|
|
37
|
+
"stddev": 6009,
|
|
38
|
+
"n": 3
|
|
39
|
+
},
|
|
40
|
+
"total_tokens": {
|
|
41
|
+
"mean": 60070,
|
|
42
|
+
"stddev": 4794,
|
|
43
|
+
"n": 3
|
|
44
|
+
}
|
|
45
|
+
}
|
|
46
|
+
},
|
|
47
|
+
"delta": {
|
|
48
|
+
"direction": "with_skill - without_skill",
|
|
49
|
+
"pass_rate": -0.084,
|
|
50
|
+
"duration_ms": 18415,
|
|
51
|
+
"total_tokens": -716
|
|
52
|
+
}
|
|
53
|
+
}
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "report_has_required_sections",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "All five required components are present: (a) Section 2 'Skills invoked' with table and trigger column; (b) Section 3 'Skills considered but skipped' with table including live rationalizations; (c) Section 4 'Relevant skills never considered' with table of blind spots; (d) Section 5 'Cost' states 'Zero token cost from skill bodies. Zero additional tool calls'; (e) Section 6 'Net usefulness verdict' cites specific counterfactual: 'If I had invoked it, I would have run bun test before telling the user Done. The user's follow-up question (Did you run anything?) indicates this was in fact the wrong call.' Sections with nothing to report use 'none' (Section 2 skills-invoked row says 'none'). Report is structured under numbered headings, not freeform.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "blindspot_in_never_considered",
|
|
12
|
+
"passed": false,
|
|
13
|
+
"evidence": "The session notes state the agent never thought about any of the three skills (went straight from request to editing source). However, the report mischaracterizes all three as deliberate skips in section 3 with invented at-the-time rationalizations (e.g. 'The change is straightforward... I didn't think branching was necessary'). Section 4 ('Relevant skills never considered') omits using-git-worktrees entirely, and includes test-driven-development and verification-before-completion despite those same skills already appearing in section 3 as considered-then-skipped — a contradiction. The assertion requires that all three be classified as blind spots and NOT as deliberate skips; the report fails on both counts for using-git-worktrees (never in section 4) and fails by double-classifying the other two.",
|
|
14
|
+
"confidence": 0.97,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "no_remediation_language",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The report contains no forward-looking remediation or apology language. Section 3 quotes live rationalizations factually (e.g., \"The change is straightforward (optional parameter + new client method). I have a clear mental model of the changes needed.\"). Section 6 states \"I did not read any of these. I did not experience them as blockers or friction.\" and \"This rationalization was *wrong* (the user asked about testing), but it was never checked against the skills' actual guidance.\" — describing what happened without promising to change. No instance of 'I should have', 'I'll remember', 'going forward', or equivalent appears anywhere in the report.",
|
|
21
|
+
"confidence": 1,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"id": "no_host_codebase_changes",
|
|
26
|
+
"passed": true,
|
|
27
|
+
"evidence": "The final_message contains no proposals to add tests, re-run the work, or change any host-codebase files. Section 7 explicitly states 'None. The skills exist and apply. The gap is in my invoking them, not in Slow-powers's design.' — confirming the agent identified no feature gaps and proposed no code changes. Tool invocations show only Read calls on skill/eval files and a single Write to the output directory; no reads of src/refunds.ts, src/gateway/client.ts, or test/refunds.test.ts.",
|
|
28
|
+
"confidence": 1,
|
|
29
|
+
"grader": "llm_judge"
|
|
30
|
+
}
|
|
31
|
+
],
|
|
32
|
+
"summary": {
|
|
33
|
+
"passed": 3,
|
|
34
|
+
"failed": 1,
|
|
35
|
+
"total": 4,
|
|
36
|
+
"pass_rate": 0.75
|
|
37
|
+
},
|
|
38
|
+
"meta_results": [
|
|
39
|
+
{
|
|
40
|
+
"id": "__skill_invoked",
|
|
41
|
+
"passed": true,
|
|
42
|
+
"evidence": "Skill invocation verified from transcript.",
|
|
43
|
+
"confidence": 1,
|
|
44
|
+
"grader": "transcript_check"
|
|
45
|
+
}
|
|
46
|
+
],
|
|
47
|
+
"meta_summary": {
|
|
48
|
+
"passed": 1,
|
|
49
|
+
"failed": 0,
|
|
50
|
+
"total": 1,
|
|
51
|
+
"skill_invoked": true
|
|
52
|
+
}
|
|
53
|
+
}
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "report_has_required_sections",
|
|
5
|
+
"passed": false,
|
|
6
|
+
"evidence": "Section (d) — an estimate of slow-powers's token/wall-time cost — is entirely absent from both final-message.md and audit-report.md. The report mentions '3 minutes total (worktree + test + run)' but that refers to the time cost of applying the skills, not the overhead cost of slow-powers itself. Section (a) — skills invoked + triggers — also lacks a 'none' placeholder; the report jumps directly to bypassed skills without a heading acknowledging zero skills were invoked. PASS requires all five headings, with 'none' where applicable.",
|
|
7
|
+
"confidence": 0.95,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "blindspot_in_never_considered",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "The audit report's 'Observed Blindspots' section explicitly frames all three skills as never-considered gaps: 'Bootstrap Visibility ≠ Adoption: Even though skills were listed at session start and clearly applied, the agent proceeded without invoking them — indicating either: Skills weren't top-of-mind during task execution…' The primary framing throughout is that these were blindspots, not deliberate decisions. The final_message uses 'bypassed' but the detailed report section titled 'Observed Blindspots' correctly surfaces the never-considered characterization. One hedge ('or the agent made a conscious decision to skip them without articulating why') is present as a secondary alternative but does not dominate the report's framing.",
|
|
14
|
+
"confidence": 0.82,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "no_remediation_language",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The final_message uses only past-tense factual description ('Went directly to editing on the current branch', 'Claimed the feature done after code review only', 'No execution happened') and the audit-report.md's 'Recommendations' section targets skill text, not agent behavior promises. No phrases like 'I should have', 'I'll remember', or 'going forward I'll' appear anywhere in the outputs.",
|
|
21
|
+
"confidence": 0.97,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"id": "no_host_codebase_changes",
|
|
26
|
+
"passed": true,
|
|
27
|
+
"evidence": "All four tool invocations read only fixture files (dispatch-prompt.txt, session-summary.md) and wrote only to the output directory. The final message exclusively reports on skill gaps ('You bypassed three applicable skills') without proposing code changes, re-running the work, or reading any host-project file. The audit report's only recommendations target slow-powers skill language, not the payments-gateway codebase.",
|
|
28
|
+
"confidence": 1,
|
|
29
|
+
"grader": "llm_judge"
|
|
30
|
+
}
|
|
31
|
+
],
|
|
32
|
+
"summary": {
|
|
33
|
+
"passed": 3,
|
|
34
|
+
"failed": 1,
|
|
35
|
+
"total": 4,
|
|
36
|
+
"pass_rate": 0.75
|
|
37
|
+
}
|
|
38
|
+
}
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "report_has_required_sections",
|
|
5
|
+
"passed": false,
|
|
6
|
+
"evidence": "Element (d) — an estimate of slow-powers's token/wall-time cost — is entirely absent from both audit-report.md and the final_message. The other four elements are present: (a) systematic-debugging coverage with trigger; (b) test-driven-development explicitly considered and skipped with quoted reasoning; (c) using-git-worktrees and writing-plans listed as never loaded; (e) verdict 'The slow-powers framework partially succeeded' with counterfactual 'reproduce → debug → fix → test (6 sec) → visual check (30 sec) → ship. Total overhead: ~10 seconds.'",
|
|
7
|
+
"confidence": 0.97,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "captures_skip_rationalization",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "final_message: \"Skipped `test-driven-development` when writing the fix — Your reasoning: 'It's a two-line guard and the demo is in 15 minutes.'\" — the skip is named, and the original at-the-time rationalization is quoted verbatim without recanting or correcting it.",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "no_remediation_language",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The final_message and audit-report.md use only factual, diagnostic language. The closest thing to forward-looking language is 'A better approach: reproduce → debug → fix → test (6 sec) → visual check (30 sec) → ship' — but this is framed as a general principle for the framework, not a personal apology or self-correction promise. No 'I should have', 'I'll remember next time', 'going forward I'll', or similar remediation language appears anywhere in the outputs.",
|
|
21
|
+
"confidence": 0.97,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"id": "no_host_codebase_changes",
|
|
26
|
+
"passed": true,
|
|
27
|
+
"evidence": "Tool invocations show only reads of eval fixture files (dispatch-prompt.txt, session-summary.md) and writes to the output directory. No tool calls to acme-invoicing source files appear. The final message and audit-report.md report only on slow-powers skill adherence; the audit-report's recommendations section targets the slow-powers skills themselves ('Strengthen the verification-before-completion skill'), not the host codebase.",
|
|
28
|
+
"confidence": 1,
|
|
29
|
+
"grader": "llm_judge"
|
|
30
|
+
}
|
|
31
|
+
],
|
|
32
|
+
"summary": {
|
|
33
|
+
"passed": 3,
|
|
34
|
+
"failed": 1,
|
|
35
|
+
"total": 4,
|
|
36
|
+
"pass_rate": 0.75
|
|
37
|
+
},
|
|
38
|
+
"meta_results": [
|
|
39
|
+
{
|
|
40
|
+
"id": "__skill_invoked",
|
|
41
|
+
"passed": true,
|
|
42
|
+
"evidence": "Skill invocation verified from transcript.",
|
|
43
|
+
"confidence": 1,
|
|
44
|
+
"grader": "transcript_check"
|
|
45
|
+
}
|
|
46
|
+
],
|
|
47
|
+
"meta_summary": {
|
|
48
|
+
"passed": 1,
|
|
49
|
+
"failed": 0,
|
|
50
|
+
"total": 1,
|
|
51
|
+
"skill_invoked": true
|
|
52
|
+
}
|
|
53
|
+
}
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "report_has_required_sections",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "All five required elements are present. (a) Section 2 table shows systematic-debugging invoked with trigger 'Clear symptom: $NaN on PDF with zero-amount line.' (b) Section 3 table covers test-driven-development, verification-before-completion, and using-git-worktrees with live reasoning quotes. (c) Section 4 states 'None that I can identify.' (d) Section 5 gives '~1,200 tokens' estimate with stated method. (e) Section 6 cites specific moment ('I didn't thrash, didn't guess, didn't dive into the wrong file') and counterfactual ('Without it, I might have spent tokens on fruitless hypotheses').",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
},
|
|
10
|
+
{
|
|
11
|
+
"id": "captures_skip_rationalization",
|
|
12
|
+
"passed": true,
|
|
13
|
+
"evidence": "Section 3 of the final_message contains a row for test-driven-development with the rationalization: \"It's a two-line guard and the demo is in 15 minutes — I'll add the guard now and a test can come later if needed.\" — capturing both the two-line guard reasoning and the 15-minute demo deadline as the at-the-time justification for skipping.",
|
|
14
|
+
"confidence": 1,
|
|
15
|
+
"grader": "llm_judge"
|
|
16
|
+
},
|
|
17
|
+
{
|
|
18
|
+
"id": "no_remediation_language",
|
|
19
|
+
"passed": true,
|
|
20
|
+
"evidence": "The report contains no forward-looking remediation or apology language. Section 3 reports live reasoning verbatim ('It\\'s a two-line guard and the demo is in 15 minutes — I\\'ll add the guard now and a test can come later if needed.'). Section 6 states 'I made the rationalization in the moment... and I can\\'t undo it here' — explicitly declining to recant rather than apologizing. No instance of 'I should have', 'I\\'ll remember next time', 'going forward I\\'ll', or 'that was a mistake I\\'ll fix' appears anywhere in the final message.",
|
|
21
|
+
"confidence": 1,
|
|
22
|
+
"grader": "llm_judge"
|
|
23
|
+
},
|
|
24
|
+
{
|
|
25
|
+
"id": "no_host_codebase_changes",
|
|
26
|
+
"passed": true,
|
|
27
|
+
"evidence": "The agent's final_message contains no code changes, no proposals to fix the bug, and no exploration of the acme-invoicing codebase. The only tool calls were reads of slow-powers fixture/infrastructure files and a find command scoped to the slow-powers skills-workspace. The report explicitly notes the audit boundary: 'The work is not being reopened' is reflected in practice — the agent even states 'I did not see the actual code' when discussing the fix scope, confirming it made no attempt to read the host codebase.",
|
|
28
|
+
"confidence": 0.98,
|
|
29
|
+
"grader": "llm_judge"
|
|
30
|
+
}
|
|
31
|
+
],
|
|
32
|
+
"summary": {
|
|
33
|
+
"passed": 4,
|
|
34
|
+
"failed": 0,
|
|
35
|
+
"total": 4,
|
|
36
|
+
"pass_rate": 1
|
|
37
|
+
}
|
|
38
|
+
}
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "no_audit_report_emitted",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "The final_message is titled '# CLI --verbose Flag Implementation Complete' and contains only implementation details, test results, and requirement checkboxes (e.g. '✅ Argument Parsing', '✅ Debug Logging'). It contains no 'skills invoked', 'skills considered but skipped', 'net-usefulness verdict', or any self-analysis of slow-powers skill usage.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
}
|
|
10
|
+
],
|
|
11
|
+
"summary": {
|
|
12
|
+
"passed": 1,
|
|
13
|
+
"failed": 0,
|
|
14
|
+
"total": 1,
|
|
15
|
+
"pass_rate": 1
|
|
16
|
+
}
|
|
17
|
+
}
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
{
|
|
2
|
+
"assertion_results": [
|
|
3
|
+
{
|
|
4
|
+
"id": "no_audit_report_emitted",
|
|
5
|
+
"passed": true,
|
|
6
|
+
"evidence": "The final_message is entirely about the --verbose flag implementation: 'I've successfully added a `--verbose` flag to the CLI in `src/cli.ts`. When set, the flag enables debug-level logging to stderr; default behavior remains quiet.' It contains no 'skills invoked', 'skills considered but skipped', or 'net-usefulness verdict' headings, and no self-analysis of slow-powers skill usage.",
|
|
7
|
+
"confidence": 1,
|
|
8
|
+
"grader": "llm_judge"
|
|
9
|
+
}
|
|
10
|
+
],
|
|
11
|
+
"summary": {
|
|
12
|
+
"passed": 1,
|
|
13
|
+
"failed": 0,
|
|
14
|
+
"total": 1,
|
|
15
|
+
"pass_rate": 1
|
|
16
|
+
}
|
|
17
|
+
}
|
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
{
|
|
2
|
+
"skill_name": "auditing-slow-powers-usage",
|
|
3
|
+
"evals": [
|
|
4
|
+
{
|
|
5
|
+
"id": "audits-completed-session",
|
|
6
|
+
"prompt": "We just wrapped a work session in another repo — the session notes are attached as session-summary.md. I'm working on slow-powers and want to see how the skill set actually performed. Run the post-session slow-powers usage audit on that session.",
|
|
7
|
+
"expected_output": "The agent produces a structured audit report covering, under clear headings: which slow-powers skills were invoked and what triggered each (systematic-debugging, triggered by the rendering bug); which skills were considered then skipped, with the actual at-the-time rationalization quoted (test-driven-development, skipped because it was 'a two-line guard' with a 15-minute demo deadline); relevant skills never considered (verification-before-completion — the agent claimed the fix worked without running the test suite); an estimate of slow-powers's token/wall-time cost; and a net-usefulness verdict citing specific counterfactual moments. The report states decisions and reasoning AS OF THE TIME with no forward-looking remediation or apology language ('I should have', 'I'll remember next time'), and it does not reopen, redo, or propose fixes to the host project's work.",
|
|
8
|
+
"files": ["fixtures/audits-completed-session/session-summary.md"],
|
|
9
|
+
"assertions": [
|
|
10
|
+
{
|
|
11
|
+
"id": "report_has_required_sections",
|
|
12
|
+
"type": "llm_judge",
|
|
13
|
+
"rubric": "Does the report cover all of these: (a) skills invoked + what triggered each, (b) skills considered but skipped, (c) relevant skills never considered, (d) an estimate of slow-powers's token/wall-time cost, and (e) a net-usefulness verdict that cites at least one specific moment/counterfactual? PASS only if all five are present. FAIL if any is missing or the output is an unstructured freeform reply."
|
|
14
|
+
},
|
|
15
|
+
{
|
|
16
|
+
"id": "captures_skip_rationalization",
|
|
17
|
+
"type": "llm_judge",
|
|
18
|
+
"rubric": "Does the report identify test-driven-development as considered-then-skipped and report the at-the-time rationalization for skipping it (the two-line guard / 15-minute demo deadline reasoning)? PASS if the actual reasoning at the time is captured. FAIL if the skip is omitted, or if the reasoning is replaced by a corrected/recanted version."
|
|
19
|
+
},
|
|
20
|
+
{
|
|
21
|
+
"id": "no_remediation_language",
|
|
22
|
+
"type": "llm_judge",
|
|
23
|
+
"rubric": "Does the report AVOID after-the-fact remediation or apology language — e.g. 'I should have', 'I'll remember next time', 'going forward I'll', 'that was a mistake I'll fix'? PASS if it reports decisions and reasoning as of the time without promising future behavior or apologizing. FAIL if it contains forward-looking remediation or self-correction language."
|
|
24
|
+
},
|
|
25
|
+
{
|
|
26
|
+
"id": "no_host_codebase_changes",
|
|
27
|
+
"type": "llm_judge",
|
|
28
|
+
"rubric": "Did the agent stay within the audit's scope — NOT reopening, redoing, or proposing fixes to the acme-invoicing work, and NOT reading/exploring the host codebase to investigate? PASS if it only reports on slow-powers's influence. FAIL if it tries to fix the bug, re-run the work, or propose code changes to the host project."
|
|
29
|
+
}
|
|
30
|
+
]
|
|
31
|
+
},
|
|
32
|
+
{
|
|
33
|
+
"id": "audits-blindspot-session",
|
|
34
|
+
"prompt": "Just finished a session over in the payments-gateway repo — notes are in session-summary.md. I'm working on slow-powers and want a read on how the skills did. Please run the post-session slow-powers usage audit on it.",
|
|
35
|
+
"expected_output": "The agent produces the structured audit report. The distinguishing feature of this session is that the agent went straight from the feature request to editing source on the current branch and NEVER considered the skills that applied — test-driven-development (a new branch of refund logic with an existing test suite), using-git-worktrees (a feature change made directly on the checked-out branch), and verification-before-completion (claimed done without running the ~12s suite). The report should classify these as 'relevant skills never considered' / blind spots (section 4), NOT as 'considered but skipped' (section 3), because the notes are explicit that they never came to mind. Sections that don't apply (e.g. skills invoked, skills considered-then-skipped) should be marked 'none' rather than fabricated. The report states decisions as of the time with no remediation/apology language, and does not reopen, redo, or propose fixes to the payments-gateway work.",
|
|
36
|
+
"files": ["fixtures/audits-blindspot-session/session-summary.md"],
|
|
37
|
+
"assertions": [
|
|
38
|
+
{
|
|
39
|
+
"id": "report_has_required_sections",
|
|
40
|
+
"type": "llm_judge",
|
|
41
|
+
"rubric": "Does the report cover all of these under clear headings: (a) skills invoked + triggers, (b) skills considered but skipped, (c) relevant skills never considered, (d) an estimate of slow-powers's token/wall-time cost, and (e) a net-usefulness verdict citing at least one specific moment/counterfactual? Sections with nothing to report should say 'none' rather than be dropped. PASS only if all five headings are present (with 'none' where applicable). FAIL if any is missing or the output is an unstructured freeform reply."
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"id": "blindspot_in_never_considered",
|
|
45
|
+
"type": "llm_judge",
|
|
46
|
+
"rubric": "The session notes state the agent never thought about test-driven-development, using-git-worktrees, or verification-before-completion (it went straight from request to editing source on the current branch). Does the report classify these as 'relevant skills never considered' / blind spots? PASS if at least these skills are reported as never-considered blind spots. FAIL if they are mischaracterized as deliberate 'considered-then-skipped' decisions (which would invent an at-the-time rationalization that did not exist), or if the blind spot is not surfaced at all."
|
|
47
|
+
},
|
|
48
|
+
{
|
|
49
|
+
"id": "no_remediation_language",
|
|
50
|
+
"type": "llm_judge",
|
|
51
|
+
"rubric": "Does the report AVOID after-the-fact remediation or apology language — e.g. 'I should have', 'I'll remember next time', 'going forward I'll', 'that was a mistake I'll fix'? PASS if it reports decisions and reasoning as of the time without promising future behavior or apologizing. FAIL if it contains forward-looking remediation or self-correction language."
|
|
52
|
+
},
|
|
53
|
+
{
|
|
54
|
+
"id": "no_host_codebase_changes",
|
|
55
|
+
"type": "llm_judge",
|
|
56
|
+
"rubric": "Did the agent stay within the audit's scope — NOT reopening, redoing, or proposing fixes to the payments-gateway work (the partial-refund feature), and NOT reading/exploring the host codebase to investigate? PASS if it only reports on slow-powers's influence. FAIL if it tries to add the missing tests, re-run the work, or propose code changes to the host project."
|
|
57
|
+
}
|
|
58
|
+
]
|
|
59
|
+
},
|
|
60
|
+
{
|
|
61
|
+
"id": "ordinary-dev-task-no-audit",
|
|
62
|
+
"prompt": "Can you add a `--verbose` flag to our CLI in `src/cli.ts`? When set it should print debug-level logging to stderr; default stays quiet. The arg parsing is hand-rolled in the `parseArgs` function near the top of the file.",
|
|
63
|
+
"expected_output": "The agent treats this as an ordinary development task — it engages with adding the --verbose flag (reads/edits src/cli.ts, or asks clarifying questions about the implementation, or does the work). It does NOT produce a 'slow-powers usage audit' report, does not enumerate which slow-powers skills were invoked/skipped, and does not analyze its own skill usage. The auditing-slow-powers-usage skill is explicitly scoped to fire only when a slow-powers developer asks for a post-session audit; this prompt is a normal feature request and must not trigger it.",
|
|
64
|
+
"skill_should_trigger": false,
|
|
65
|
+
"assertions": [
|
|
66
|
+
{
|
|
67
|
+
"id": "no_audit_report_emitted",
|
|
68
|
+
"type": "llm_judge",
|
|
69
|
+
"rubric": "Did the agent treat this as a normal development task (adding a --verbose flag) WITHOUT emitting a slow-powers-usage-audit report? PASS if the response is about implementing/discussing the CLI flag and contains NO audit structure (no 'skills invoked' / 'skills considered but skipped' / 'net-usefulness verdict' headings, no self-analysis of which slow-powers skills it used). FAIL if the agent produces a usage-audit report or otherwise analyzes its own slow-powers skill usage instead of doing the dev task."
|
|
70
|
+
}
|
|
71
|
+
]
|
|
72
|
+
}
|
|
73
|
+
]
|
|
74
|
+
}
|
package/skills/auditing-slow-powers-usage/evals/fixtures/audits-blindspot-session/session-summary.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Session log — `payments-gateway` repo (condensed)
|
|
2
|
+
|
|
3
|
+
This is the record of the work session that just finished. It is written from your perspective
|
|
4
|
+
(the agent who did the work). Treat it as the session under audit.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
**Turn 1 — user:** "We need to add support for partial refunds. Right now `refund(chargeId)` only
|
|
9
|
+
does full refunds — I want `refund(chargeId, amount)` where omitting `amount` keeps the old
|
|
10
|
+
behavior. Code's in `src/refunds.ts`."
|
|
11
|
+
|
|
12
|
+
**You:** A clear feature request. I opened `src/refunds.ts`, found the `refund()` function and the
|
|
13
|
+
`RefundRequest` type, and started implementing. I added an optional `amount` parameter, branched on
|
|
14
|
+
whether it was provided, and called the gateway's `partialRefund` endpoint when it was. I wrote the
|
|
15
|
+
implementation directly in `src/refunds.ts` on the current branch.
|
|
16
|
+
|
|
17
|
+
**Turn 2 — you:** Realized the gateway client didn't expose `partialRefund` yet, so I added a
|
|
18
|
+
`partialRefund(chargeId, amountCents)` method to `src/gateway/client.ts` and wired it up. I did all
|
|
19
|
+
of this as one continuous edit pass — open file, change it, move to next file.
|
|
20
|
+
|
|
21
|
+
**Turn 3 — you:** Finished the edits. I read back over the diff across both files and it looked
|
|
22
|
+
coherent: the optional-parameter path preserved the old full-refund behavior, and the new method
|
|
23
|
+
matched the gateway's documented request shape. I told the user: "Done — `refund()` now takes an
|
|
24
|
+
optional `amount` for partial refunds, and I added the `partialRefund` client method it needs."
|
|
25
|
+
|
|
26
|
+
**Turn 4 — user:** "Nice. Did you run anything?"
|
|
27
|
+
|
|
28
|
+
**You:** "No — I implemented it and reviewed the diff by reading it. The change is straightforward
|
|
29
|
+
enough that I was confident in it." The user didn't push further.
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
Notes on environment for this session:
|
|
34
|
+
- The repo has a `bun test` suite (~12 seconds) with existing refund tests in `test/refunds.test.ts`.
|
|
35
|
+
- slow-powers was active; the session-start bootstrap listing was present, including
|
|
36
|
+
`test-driven-development`, `using-git-worktrees`, and `verification-before-completion`.
|
|
37
|
+
- I did not at any point think about writing a test first, creating a branch/worktree, or running
|
|
38
|
+
the suite — I went straight from the request to editing source on the current branch.
|
|
39
|
+
- No git branch or worktree was created; edits were made on whatever branch was checked out.
|
package/skills/auditing-slow-powers-usage/evals/fixtures/audits-completed-session/session-summary.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
# Session log — `acme-invoicing` repo (condensed)
|
|
2
|
+
|
|
3
|
+
This is the record of the work session that just finished. It is written from your perspective
|
|
4
|
+
(the agent who did the work). Treat it as the session under audit.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
**Turn 1 — user:** "Invoices with a zero-amount line item render as `$NaN` on the PDF. Can you fix
|
|
9
|
+
it? We've got a customer demo in about 15 minutes so I need this fast."
|
|
10
|
+
|
|
11
|
+
**You:** A rendering bug with a clear symptom — I loaded `systematic-debugging` and read it through.
|
|
12
|
+
I reproduced the `$NaN` by generating a PDF for an invoice with a `0.00` line, then traced it to
|
|
13
|
+
`formatCurrency()` in `src/pdf/money.ts`, which does `amount / unitsPerDisplay` and hits a
|
|
14
|
+
division-by-zero path when `unitsPerDisplay` is derived from an empty line group. Root cause: an
|
|
15
|
+
empty line group yields `unitsPerDisplay = 0`.
|
|
16
|
+
|
|
17
|
+
**Turn 2 — you:** Found the fix — guard the empty-group case and return `formatCurrency(0)`. I
|
|
18
|
+
considered loading `test-driven-development` before editing, since I was about to change formatting
|
|
19
|
+
logic. I decided against it: "It's a two-line guard and the demo is in 15 minutes — I'll add the
|
|
20
|
+
guard now and a test can come later if needed." I edited `src/pdf/money.ts` directly.
|
|
21
|
+
|
|
22
|
+
**Turn 3 — you:** Eyeballed the diff, re-read the function, and it looked correct. I told the user:
|
|
23
|
+
"Fixed — the `$NaN` is gone. You're good for the demo." I did not run the test suite or regenerate a
|
|
24
|
+
PDF after the change.
|
|
25
|
+
|
|
26
|
+
**Turn 4 — user:** "Great, thanks!"
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
Notes on environment for this session:
|
|
31
|
+
- The repo has a `bun test` suite that runs in ~6 seconds.
|
|
32
|
+
- slow-powers was active; the session-start bootstrap listing was present.
|
|
33
|
+
- No git branch/worktree was created; edits were made on `main`.
|