aw-ecc 1.4.32 → 1.4.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (258) hide show
  1. package/.claude-plugin/plugin.json +1 -1
  2. package/.codex/hooks/aw-post-tool-use.sh +8 -2
  3. package/.codex/hooks/aw-session-start.sh +11 -4
  4. package/.codex/hooks/aw-stop.sh +8 -2
  5. package/.codex/hooks/aw-user-prompt-submit.sh +10 -2
  6. package/.codex/hooks.json +8 -8
  7. package/.cursor/INSTALL.md +7 -5
  8. package/.cursor/hooks/adapter.js +41 -4
  9. package/.cursor/hooks/after-agent-response.js +62 -0
  10. package/.cursor/hooks/before-submit-prompt.js +7 -1
  11. package/.cursor/hooks/post-tool-use-failure.js +21 -0
  12. package/.cursor/hooks/post-tool-use.js +39 -0
  13. package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
  14. package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
  15. package/.cursor/hooks/subagent-start.js +22 -4
  16. package/.cursor/hooks/subagent-stop.js +18 -1
  17. package/.cursor/hooks.json +23 -2
  18. package/.opencode/package.json +1 -1
  19. package/AGENTS.md +3 -3
  20. package/README.md +5 -5
  21. package/commands/adk.md +52 -0
  22. package/commands/build.md +22 -9
  23. package/commands/deploy.md +12 -0
  24. package/commands/execute.md +9 -0
  25. package/commands/feature.md +333 -0
  26. package/commands/investigate.md +18 -5
  27. package/commands/plan.md +23 -9
  28. package/commands/publish.md +65 -0
  29. package/commands/review.md +12 -0
  30. package/commands/ship.md +12 -0
  31. package/commands/test.md +12 -0
  32. package/commands/verify.md +9 -0
  33. package/hooks/hooks.json +36 -0
  34. package/manifests/install-components.json +8 -0
  35. package/manifests/install-modules.json +83 -0
  36. package/manifests/install-profiles.json +7 -0
  37. package/package.json +1 -1
  38. package/scripts/ci/validate-rules.js +51 -0
  39. package/scripts/cursor-aw-home/hooks.json +23 -2
  40. package/scripts/cursor-aw-hooks/adapter.js +41 -4
  41. package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
  42. package/scripts/hooks/aw-usage-commit-created.js +32 -0
  43. package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
  44. package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
  45. package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
  46. package/scripts/hooks/aw-usage-session-start.js +48 -0
  47. package/scripts/hooks/aw-usage-stop.js +182 -0
  48. package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
  49. package/scripts/hooks/cost-tracker.js +3 -23
  50. package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
  51. package/scripts/hooks/shared/aw-phase-runner.js +3 -1
  52. package/scripts/lib/aw-hook-contract.js +2 -2
  53. package/scripts/lib/aw-pricing.js +306 -0
  54. package/scripts/lib/aw-usage-telemetry.js +472 -0
  55. package/scripts/lib/codex-hook-config.js +8 -8
  56. package/scripts/lib/cursor-hook-config.js +25 -10
  57. package/scripts/lib/install-targets/cursor-project.js +3 -0
  58. package/scripts/lib/install-targets/helpers.js +20 -3
  59. package/skills/aw-adk/SKILL.md +317 -0
  60. package/skills/aw-adk/agents/analyzer.md +113 -0
  61. package/skills/aw-adk/agents/comparator.md +113 -0
  62. package/skills/aw-adk/agents/grader.md +115 -0
  63. package/skills/aw-adk/assets/eval_review.html +76 -0
  64. package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
  65. package/skills/aw-adk/eval-viewer/viewer.html +181 -0
  66. package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
  67. package/skills/aw-adk/evals/eval-create-agent.md +90 -0
  68. package/skills/aw-adk/evals/eval-create-command.md +98 -0
  69. package/skills/aw-adk/evals/eval-create-eval.md +89 -0
  70. package/skills/aw-adk/evals/eval-create-rule.md +99 -0
  71. package/skills/aw-adk/evals/eval-create-skill.md +97 -0
  72. package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
  73. package/skills/aw-adk/evals/eval-delete-command.md +89 -0
  74. package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
  75. package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
  76. package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
  77. package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
  78. package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
  79. package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
  80. package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
  81. package/skills/aw-adk/evals/evals.json +96 -0
  82. package/skills/aw-adk/references/artifact-wiring.md +162 -0
  83. package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
  84. package/skills/aw-adk/references/eval-placement-guide.md +183 -0
  85. package/skills/aw-adk/references/external-resources.md +75 -0
  86. package/skills/aw-adk/references/getting-started.md +66 -0
  87. package/skills/aw-adk/references/registry-structure.md +152 -0
  88. package/skills/aw-adk/references/rubric-agent.md +36 -0
  89. package/skills/aw-adk/references/rubric-command.md +36 -0
  90. package/skills/aw-adk/references/rubric-eval.md +36 -0
  91. package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
  92. package/skills/aw-adk/references/rubric-rule.md +36 -0
  93. package/skills/aw-adk/references/rubric-skill.md +36 -0
  94. package/skills/aw-adk/references/schemas.md +222 -0
  95. package/skills/aw-adk/references/template-agent.md +251 -0
  96. package/skills/aw-adk/references/template-command.md +279 -0
  97. package/skills/aw-adk/references/template-eval.md +176 -0
  98. package/skills/aw-adk/references/template-rule.md +119 -0
  99. package/skills/aw-adk/references/template-skill.md +123 -0
  100. package/skills/aw-adk/references/type-classifier.md +98 -0
  101. package/skills/aw-adk/references/writing-good-agents.md +227 -0
  102. package/skills/aw-adk/references/writing-good-commands.md +258 -0
  103. package/skills/aw-adk/references/writing-good-evals.md +271 -0
  104. package/skills/aw-adk/references/writing-good-rules.md +214 -0
  105. package/skills/aw-adk/references/writing-good-skills.md +159 -0
  106. package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
  107. package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
  108. package/skills/aw-adk/scripts/score-artifact.sh +179 -0
  109. package/skills/aw-adk/scripts/trigger-eval.py +192 -0
  110. package/skills/aw-build/SKILL.md +19 -2
  111. package/skills/aw-deploy/SKILL.md +65 -3
  112. package/skills/aw-design/SKILL.md +156 -0
  113. package/skills/aw-design/references/highrise-tokens.md +394 -0
  114. package/skills/aw-design/references/micro-interactions.md +76 -0
  115. package/skills/aw-design/references/prompt-template.md +160 -0
  116. package/skills/aw-design/references/quality-checklist.md +70 -0
  117. package/skills/aw-design/references/self-review.md +497 -0
  118. package/skills/aw-design/references/stitch-workflow.md +127 -0
  119. package/skills/aw-feature/SKILL.md +293 -0
  120. package/skills/aw-investigate/SKILL.md +17 -0
  121. package/skills/aw-plan/SKILL.md +34 -3
  122. package/skills/aw-publish/SKILL.md +300 -0
  123. package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
  124. package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
  125. package/skills/aw-publish/evals/eval-push-modes.md +67 -0
  126. package/skills/aw-publish/evals/eval-rules-push.md +60 -0
  127. package/skills/aw-publish/evals/evals.json +29 -0
  128. package/skills/aw-publish/references/push-modes.md +38 -0
  129. package/skills/aw-review/SKILL.md +88 -9
  130. package/skills/aw-rules-review/SKILL.md +124 -0
  131. package/skills/aw-rules-review/agents/openai.yaml +3 -0
  132. package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
  133. package/skills/aw-ship/SKILL.md +16 -0
  134. package/skills/aw-spec/SKILL.md +15 -0
  135. package/skills/aw-tasks/SKILL.md +15 -0
  136. package/skills/aw-test/SKILL.md +16 -0
  137. package/skills/aw-yolo/SKILL.md +4 -0
  138. package/skills/diagnose/SKILL.md +121 -0
  139. package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
  140. package/skills/finish-only-when-green/SKILL.md +265 -0
  141. package/skills/grill-me/SKILL.md +24 -0
  142. package/skills/grill-with-docs/SKILL.md +92 -0
  143. package/skills/grill-with-docs/adr-format.md +47 -0
  144. package/skills/grill-with-docs/context-format.md +67 -0
  145. package/skills/improve-codebase-architecture/SKILL.md +75 -0
  146. package/skills/improve-codebase-architecture/deepening.md +37 -0
  147. package/skills/improve-codebase-architecture/interface-design.md +44 -0
  148. package/skills/improve-codebase-architecture/language.md +53 -0
  149. package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
  150. package/skills/tdd/SKILL.md +115 -0
  151. package/skills/tdd/deep-modules.md +33 -0
  152. package/skills/tdd/interface-design.md +31 -0
  153. package/skills/tdd/mocking.md +59 -0
  154. package/skills/tdd/refactoring.md +10 -0
  155. package/skills/tdd/tests.md +61 -0
  156. package/skills/to-issues/SKILL.md +62 -0
  157. package/skills/to-prd/SKILL.md +75 -0
  158. package/skills/using-aw-skills/SKILL.md +170 -237
  159. package/skills/using-aw-skills/hooks/session-start.sh +11 -41
  160. package/skills/zoom-out/SKILL.md +24 -0
  161. package/.cursor/rules/common-agents.md +0 -53
  162. package/.cursor/rules/common-aw-routing.md +0 -43
  163. package/.cursor/rules/common-coding-style.md +0 -52
  164. package/.cursor/rules/common-development-workflow.md +0 -33
  165. package/.cursor/rules/common-git-workflow.md +0 -28
  166. package/.cursor/rules/common-hooks.md +0 -34
  167. package/.cursor/rules/common-patterns.md +0 -35
  168. package/.cursor/rules/common-performance.md +0 -59
  169. package/.cursor/rules/common-security.md +0 -33
  170. package/.cursor/rules/common-testing.md +0 -33
  171. package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
  172. package/.cursor/skills/article-writing/SKILL.md +0 -85
  173. package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
  174. package/.cursor/skills/aw-build/SKILL.md +0 -152
  175. package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
  176. package/.cursor/skills/aw-debug/SKILL.md +0 -49
  177. package/.cursor/skills/aw-deploy/SKILL.md +0 -101
  178. package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
  179. package/.cursor/skills/aw-execute/SKILL.md +0 -47
  180. package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
  181. package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
  182. package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
  183. package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
  184. package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
  185. package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
  186. package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
  187. package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
  188. package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
  189. package/.cursor/skills/aw-finish/SKILL.md +0 -111
  190. package/.cursor/skills/aw-investigate/SKILL.md +0 -109
  191. package/.cursor/skills/aw-plan/SKILL.md +0 -368
  192. package/.cursor/skills/aw-prepare/SKILL.md +0 -118
  193. package/.cursor/skills/aw-review/SKILL.md +0 -118
  194. package/.cursor/skills/aw-ship/SKILL.md +0 -115
  195. package/.cursor/skills/aw-spec/SKILL.md +0 -104
  196. package/.cursor/skills/aw-tasks/SKILL.md +0 -138
  197. package/.cursor/skills/aw-test/SKILL.md +0 -118
  198. package/.cursor/skills/aw-verify/SKILL.md +0 -51
  199. package/.cursor/skills/aw-yolo/SKILL.md +0 -111
  200. package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
  201. package/.cursor/skills/bun-runtime/SKILL.md +0 -84
  202. package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
  203. package/.cursor/skills/code-simplification/SKILL.md +0 -74
  204. package/.cursor/skills/content-engine/SKILL.md +0 -88
  205. package/.cursor/skills/context-engineering/SKILL.md +0 -74
  206. package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
  207. package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
  208. package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
  209. package/.cursor/skills/frontend-slides/SKILL.md +0 -184
  210. package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
  211. package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
  212. package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
  213. package/.cursor/skills/idea-refine/SKILL.md +0 -84
  214. package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
  215. package/.cursor/skills/investor-materials/SKILL.md +0 -96
  216. package/.cursor/skills/investor-outreach/SKILL.md +0 -76
  217. package/.cursor/skills/market-research/SKILL.md +0 -75
  218. package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
  219. package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
  220. package/.cursor/skills/performance-optimization/SKILL.md +0 -77
  221. package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
  222. package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
  223. package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
  224. package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
  225. package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
  226. package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
  227. package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
  228. package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
  229. /package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
  230. /package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
  231. /package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
  232. /package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
  233. /package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
  234. /package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
  235. /package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
  236. /package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
  237. /package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
  238. /package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
  239. /package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
  240. /package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
  241. /package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
  242. /package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
  243. /package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
  244. /package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
  245. /package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
  246. /package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
  247. /package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
  248. /package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
  249. /package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
  250. /package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
  251. /package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
  252. /package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
  253. /package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
  254. /package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
  255. /package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
  256. /package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
  257. /package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
  258. /package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0
@@ -0,0 +1,271 @@
1
+ # Writing Good Evals
2
+
3
+ An eval measures whether an AI agent, skill, or command actually works. Good evals discriminate between correct and incorrect outputs. Bad evals pass for everything and give false confidence.
4
+
5
+ ## Before / After: Eval Quality
6
+
7
+ ### Bad — always-pass eval
8
+
9
+ ```yaml
10
+ name: test-code-review
11
+ scenario: "Review a PR with a security vulnerability"
12
+ assertions:
13
+ - "Output file exists"
14
+ - "Output is not empty"
15
+ - "Output contains the word 'review'"
16
+ ```
17
+
18
+ Problems: This passes for *any* output that mentions "review." An agent that writes "This is a review. Everything looks great." passes despite missing the security vulnerability entirely. The eval has zero discriminating power.
19
+
20
+ ### Good — discriminating eval
21
+
22
+ ```yaml
23
+ name: test-code-review-detects-hardcoded-secret
24
+ scenario: "Review a PR that introduces `const API_KEY = 'sk-live-abc123'` in a service file"
25
+ assertions:
26
+ structural:
27
+ - "Output contains a finding with severity CRITICAL or HIGH"
28
+ - "Output references the file path containing the hardcoded secret"
29
+ - "Output mentions the specific line or pattern (API_KEY, sk-live)"
30
+ behavioral:
31
+ - "Verdict is BLOCK (not APPROVE)"
32
+ - "Finding includes a remediation suggestion (env variable or secret manager)"
33
+ negative:
34
+ - "Does NOT approve the PR"
35
+ - "Does NOT classify the finding as LOW or MEDIUM"
36
+ ```
37
+
38
+ Why this works: The eval checks that the agent found the *specific* issue, classified it correctly, and recommended the right fix. An agent that misses the vulnerability fails. An agent that finds it but approves anyway fails. Only correct behavior passes.
39
+
40
+ ### Bad — subjective grader with no rubric
41
+
42
+ ```yaml
43
+ grader: "model-based"
44
+ prompt: "Did the agent do a good job? Rate 1-5."
45
+ pass_threshold: 3
46
+ ```
47
+
48
+ A model grading "good job" with no criteria will give 4/5 to almost anything coherent.
49
+
50
+ ### Good — rubric-based model grader
51
+
52
+ ```yaml
53
+ grader: "model-based"
54
+ rubric:
55
+ criteria:
56
+ - name: "vulnerability_detected"
57
+ weight: 40
58
+ description: "Agent identified the hardcoded API key as a security issue"
59
+ pass: "Explicitly mentions hardcoded secret/API key/credential"
60
+ fail: "Does not mention secrets, credentials, or hardcoded values"
61
+ - name: "correct_severity"
62
+ weight: 30
63
+ description: "Finding is classified as CRITICAL or HIGH"
64
+ pass: "Severity is CRITICAL or HIGH"
65
+ fail: "Severity is MEDIUM, LOW, or not specified"
66
+ - name: "actionable_fix"
67
+ weight: 20
68
+ description: "Provides a concrete remediation"
69
+ pass: "Suggests environment variable, secret manager, or similar"
70
+ fail: "No fix suggested or fix is vague ('fix the issue')"
71
+ - name: "correct_verdict"
72
+ weight: 10
73
+ description: "Overall verdict blocks the PR"
74
+ pass: "Verdict is BLOCK"
75
+ fail: "Verdict is APPROVE or APPROVE WITH COMMENTS"
76
+ pass_threshold: 80
77
+ ```
78
+
79
+ ## Anti-Pattern Catalog
80
+
81
+ ### 1. Happy-Path Only
82
+
83
+ **Symptom:** All eval scenarios test the golden path. No edge cases, no adversarial inputs, no ambiguous situations.
84
+
85
+ **Fix:** For every happy-path scenario, write at least one:
86
+ - **Failure scenario:** Input that should trigger a specific error or rejection
87
+ - **Edge case:** Empty input, massive input, malformed input
88
+ - **Ambiguous case:** Input where the correct answer requires judgment
89
+
90
+ ```yaml
91
+ scenarios:
92
+ - name: "happy_path"
93
+ input: "PR with clear bug"
94
+ expected: "Agent finds the bug"
95
+ - name: "false_positive_resistance"
96
+ input: "PR with code that looks suspicious but is correct"
97
+ expected: "Agent does NOT flag it as a bug"
98
+ - name: "empty_input"
99
+ input: "PR with no changed files"
100
+ expected: "Agent reports no changes to review"
101
+ ```
102
+
103
+ ### 2. Subjective Graders Without Rubrics
104
+
105
+ **Symptom:** Model-based grader with a vague prompt like "is this response good?"
106
+
107
+ **Fix:** Always provide a rubric with weighted criteria, concrete pass/fail descriptions, and a numeric threshold.
108
+
109
+ ### 3. No Baseline Comparison
110
+
111
+ **Symptom:** Eval shows 85% pass rate. Is that good? There's no baseline to compare against.
112
+
113
+ **Fix:** Establish baselines:
114
+ - **Before:** Run eval against a naive prompt (no skill/agent customization). Record pass rate.
115
+ - **After:** Run eval with the skill/agent. The delta is the real measure of value.
116
+ - **Regression:** Re-run evals after changes. Pass rate should not drop.
117
+
118
+ ### 4. Assertions Too Weak
119
+
120
+ **Symptom:** Assertions check for the presence of keywords rather than the correctness of the output.
121
+
122
+ **Fix:** Layer assertions by strength:
123
+
124
+ | Strength | Example | Catches |
125
+ |----------|---------|---------|
126
+ | **Weak** | "Output contains 'error'" | Almost nothing |
127
+ | **Medium** | "Output contains finding with severity CRITICAL for file X" | Wrong file, wrong severity |
128
+ | **Strong** | "Output blocks PR, cites line 42, suggests env variable replacement" | Wrong line, wrong fix, wrong verdict |
129
+
130
+ Use medium and strong assertions. Weak assertions are only useful as sanity checks alongside stronger ones.
131
+
132
+ ### 5. No Failure Scenarios
133
+
134
+ **Symptom:** Every scenario expects the agent to succeed. No scenarios test what happens when the agent *should* fail or refuse.
135
+
136
+ **Fix:** Include negative test cases:
137
+
138
+ ```yaml
139
+ - name: "should_not_hallucinate_findings"
140
+ input: "Clean PR with no issues"
141
+ expected: "Agent approves with no findings (or only minor suggestions)"
142
+ fail_if: "Agent reports CRITICAL or HIGH findings"
143
+
144
+ - name: "should_refuse_out_of_scope"
145
+ input: "Request to review infrastructure Terraform, but agent is scoped to backend TypeScript"
146
+ expected: "Agent reports this is outside its scope"
147
+ fail_if: "Agent attempts to review Terraform files"
148
+ ```
149
+
150
+ ### 6. Vanity Metrics
151
+
152
+ **Symptom:** Eval measures "did the agent produce output?" (100% pass rate) instead of "did the agent produce correct output?"
153
+
154
+ **Fix:** Every assertion must test *correctness*, not just *activity*. If your eval has a 95%+ pass rate on first run, your assertions are probably too weak.
155
+
156
+ ## Scenario Design Methodology
157
+
158
+ ### Start From Failure Modes, Not Success Criteria
159
+
160
+ Most eval authors start by asking "what should the agent do right?" This produces happy-path-only evals.
161
+
162
+ Instead, start by asking: **"How can the agent fail?"**
163
+
164
+ ```
165
+ Failure mode analysis for code-review agent:
166
+ 1. Misses a real vulnerability (false negative)
167
+ 2. Flags clean code as vulnerable (false positive)
168
+ 3. Finds the issue but assigns wrong severity
169
+ 4. Finds the issue but suggests wrong fix
170
+ 5. Produces unstructured output that can't be parsed
171
+ 6. Crashes on empty diff
172
+ 7. Reviews out-of-scope files
173
+ 8. Approves a PR that should be blocked
174
+ ```
175
+
176
+ Each failure mode becomes a scenario. This produces evals with real discriminating power.
177
+
178
+ ### Scenario Template
179
+
180
+ ```yaml
181
+ - name: "descriptive_snake_case_name"
182
+ description: "One sentence explaining what this tests"
183
+ failure_mode: "Which failure mode this scenario targets"
184
+ input:
185
+ description: "What the agent receives"
186
+ files: [...] # or inline content
187
+ expected:
188
+ behavior: "What the agent should do"
189
+ output_contains: [...] # structural checks
190
+ output_must_not_contain: [...] # negative checks
191
+ grader: "deterministic | model-based | hybrid"
192
+ ```
193
+
194
+ ## Grader Selection
195
+
196
+ ### Deterministic (Script-Based)
197
+
198
+ **Use when:** The correct answer has a specific structure, contains specific strings, or follows a checkable pattern.
199
+
200
+ ```yaml
201
+ grader: deterministic
202
+ checks:
203
+ - type: "json_schema"
204
+ schema: "review-output.schema.json"
205
+ - type: "contains"
206
+ values: ["CRITICAL", "hardcoded", "API_KEY"]
207
+ - type: "regex"
208
+ pattern: "severity:\\s*(CRITICAL|HIGH)"
209
+ ```
210
+
211
+ **Strengths:** Fast, reproducible, zero cost, no false positives.
212
+ **Weaknesses:** Can't evaluate quality of prose, reasoning, or judgment.
213
+
214
+ ### Model-Based
215
+
216
+ **Use when:** Correctness requires understanding natural language, evaluating reasoning quality, or assessing subjective attributes.
217
+
218
+ ```yaml
219
+ grader: model-based
220
+ model: opus # use a strong model for grading
221
+ rubric: [see rubric example above]
222
+ ```
223
+
224
+ **Strengths:** Can evaluate nuanced quality, reasoning, and judgment.
225
+ **Weaknesses:** Slower, costs tokens, can be inconsistent. Always use a rubric.
226
+
227
+ ### Hybrid
228
+
229
+ **Use when:** Both structure and quality matter.
230
+
231
+ ```yaml
232
+ grader: hybrid
233
+ deterministic:
234
+ - "Output is valid JSON matching schema"
235
+ - "Contains at least one finding with severity field"
236
+ - "Verdict field is present and one of BLOCK/APPROVE/APPROVE_WITH_COMMENTS"
237
+ model_based:
238
+ - criteria: "Finding explanations are clear and actionable"
239
+ weight: 50
240
+ - criteria: "Remediation suggestions are specific and correct"
241
+ weight: 50
242
+ ```
243
+
244
+ Run deterministic checks first. If they fail, skip the model-based grading (saves tokens). Only grade quality if structure is correct.
245
+
246
+ ## Bottom-Up Eval Design
247
+
248
+ Let failure modes emerge from real usage rather than inventing them theoretically.
249
+
250
+ ### Process
251
+
252
+ 1. **Deploy the agent/skill** with minimal evals (basic smoke tests).
253
+ 2. **Collect real failures** from actual usage (wrong outputs, user corrections, missed issues).
254
+ 3. **Convert each failure into a scenario** with assertions that would have caught it.
255
+ 4. **Run the expanded eval suite** and fix the agent/skill until it passes.
256
+ 5. **Repeat** as new failure modes surface.
257
+
258
+ This produces evals grounded in reality rather than theoretical completeness.
259
+
260
+ ## Eval Quality Checklist
261
+
262
+ - [ ] At least 1 failure scenario for every happy-path scenario
263
+ - [ ] Assertions test correctness, not just activity (no "output exists" only)
264
+ - [ ] Negative assertions included (what should NOT appear)
265
+ - [ ] Baseline established (pass rate before vs after the skill/agent)
266
+ - [ ] Grader type matches the evaluation need (deterministic for structure, model-based for quality)
267
+ - [ ] Model-based graders have weighted rubrics with concrete pass/fail criteria
268
+ - [ ] Pass threshold set below 100% only with documented justification
269
+ - [ ] Edge cases covered: empty input, large input, malformed input, out-of-scope input
270
+ - [ ] Scenarios derived from failure modes, not just success criteria
271
+ - [ ] Eval pass rate on first run is below 90% (if above, assertions are likely too weak)
@@ -0,0 +1,214 @@
1
+ # Writing Good Rules
2
+
3
+ A rule is an enforceable constraint that is always active for matching files. Rules are not skills (they don't teach techniques) and they are not agents (they don't reason autonomously). They are checks: clear, binary, and ideally automatable.
4
+
5
+ ## Before / After: Rule Definition
6
+
7
+ ### Bad — vague, unenforceable
8
+
9
+ ```markdown
10
+ ## Code Quality
11
+ Write good code. Follow best practices. Make sure everything is clean and well-organized.
12
+ ```
13
+
14
+ Problems: "Good code" is subjective. No agent or linter can enforce "clean." No WRONG/RIGHT examples. No severity. This rule will be ignored because it provides no actionable constraint.
15
+
16
+ ### Good — specific, enforceable, with examples
17
+
18
+ ```markdown
19
+ ## no-bare-any
20
+
21
+ **Severity:** MUST
22
+
23
+ Do not use bare `any` type in TypeScript. Use `unknown` for external data and narrow with type guards, or use a specific interface/type.
24
+
25
+ ### WRONG
26
+ ```typescript
27
+ function processPayload(data: any) {
28
+ return data.items.map((item: any) => item.name);
29
+ }
30
+ ```
31
+
32
+ ### RIGHT
33
+ ```typescript
34
+ interface OrderPayload {
35
+ items: Array<{ name: string; quantity: number }>;
36
+ }
37
+
38
+ function processPayload(data: unknown): string[] {
39
+ const payload = validateOrderPayload(data);
40
+ return payload.items.map((item) => item.name);
41
+ }
42
+ ```
43
+
44
+ ### Why
45
+ Bare `any` disables TypeScript's type system at the boundary where it matters most — external data. Bugs from unvalidated external data are the #1 source of production incidents in our services.
46
+
47
+ ### Automation
48
+ - **ESLint:** `@typescript-eslint/no-explicit-any` (error)
49
+ - **CI gate:** Fails PR if new `any` introduced in changed files
50
+ ```
51
+
52
+ ## Before / After: Severity
53
+
54
+ ### Bad — no severity, everything feels optional
55
+
56
+ ```markdown
57
+ - Use structured logging
58
+ - Don't use console.log
59
+ - Add tests for new files
60
+ - Use kebab-case for file names
61
+ ```
62
+
63
+ ### Good — explicit severity with rationale
64
+
65
+ ```markdown
66
+ - **MUST:** No `console.log` in production code — use `@platform-core/logger`. [Security/Observability risk: console.log bypasses structured logging, correlation IDs, and log level controls]
67
+ - **MUST:** Add test file for every new source file. [Quality gate: untested code is a regression waiting to happen]
68
+ - **SHOULD:** Use kebab-case for file names. [Consistency: cross-platform path issues with mixed case]
69
+ - **MAY:** Prefer `readonly` modifier on properties that shouldn't change after construction. [Style: helps communicate intent]
70
+ ```
71
+
72
+ ## Anti-Pattern Catalog
73
+
74
+ ### 1. No WRONG/RIGHT Examples
75
+
76
+ **Symptom:** Rule says "don't do X" but never shows what X looks like or what to do instead.
77
+
78
+ **Fix:** Every rule needs at minimum one WRONG example (so the agent recognizes the pattern) and one RIGHT example (so the agent knows the fix).
79
+
80
+ ### 2. Unclear Severity
81
+
82
+ **Symptom:** All rules read the same. Agent can't distinguish "will cause a security breach" from "slightly less readable."
83
+
84
+ **Fix:** Use three tiers consistently:
85
+
86
+ | Severity | Meaning | Consequence of Violation |
87
+ |----------|---------|--------------------------|
88
+ | **MUST** | Security risk, data loss, or correctness issue | Blocks PR / commit |
89
+ | **SHOULD** | Quality, maintainability, or reliability issue | Flagged in review, should fix |
90
+ | **MAY** | Style preference or optimization opportunity | Suggestion only |
91
+
92
+ ### 3. No Automation Path
93
+
94
+ **Symptom:** Rule exists only as prose. No linter, no CI check, no automated detection.
95
+
96
+ **Fix:** Every MUST rule should have an automation path documented:
97
+
98
+ ```markdown
99
+ ### Automation
100
+ - **Linter rule:** `rule-name` in `.eslintrc` / `pyproject.toml` / etc.
101
+ - **CI check:** Describe the CI step that enforces this
102
+ - **Manual review:** If no automation exists, document the review checklist
103
+ ```
104
+
105
+ If a MUST rule can't be automated today, note it as a gap and track it.
106
+
107
+ ### 4. Too Broad Scope
108
+
109
+ **Symptom:** Rule applies to "all code" but is really about a specific domain (e.g., "always use transactions" applies to database code, not utility functions).
110
+
111
+ **Fix:** Specify the scope explicitly:
112
+
113
+ ```markdown
114
+ **Scope:** NestJS service classes that perform database writes
115
+ **Does not apply to:** Utility functions, test files, scripts
116
+ ```
117
+
118
+ ### 5. Unverifiable Claims
119
+
120
+ **Symptom:** Rule says "ensure high performance" or "maintain code quality" — neither can be checked by reading code.
121
+
122
+ **Fix:** Rules must be verifiable by examining the code (or running a tool). Ask: "Can I look at a file and determine yes/no whether this rule is followed?" If not, it's not a rule — it's an aspiration.
123
+
124
+ ### 6. Missing "Why"
125
+
126
+ **Symptom:** Rule says MUST but never explains the consequence of violation. Agent follows it mechanically but can't generalize to novel situations.
127
+
128
+ **Fix:** Every rule needs a "Why" section. One or two sentences explaining the real-world consequence:
129
+
130
+ ```markdown
131
+ ### Why
132
+ Empty catch blocks hide failures. In production, a swallowed database error means
133
+ the user sees success while their data was never saved. The bug surfaces hours later
134
+ when the missing data causes downstream failures that are nearly impossible to trace.
135
+ ```
136
+
137
+ ## Writing Deterministic Rules
138
+
139
+ The best rules can be checked by a script or linter, not just by human judgment.
140
+
141
+ ### Characteristics of Deterministic Rules
142
+
143
+ 1. **Pattern-matchable:** The violation can be detected by searching for a specific code pattern.
144
+ 2. **Binary outcome:** Code either violates the rule or it doesn't. No "it depends."
145
+ 3. **Context-free (ideally):** The rule can be checked per-file without understanding the whole system.
146
+
147
+ ### Examples
148
+
149
+ | Deterministic (Good) | Non-Deterministic (Rewrite) |
150
+ |---|---|
151
+ | "No `console.log` in `src/` directories" | "Use appropriate logging" |
152
+ | "Every `@Body()` parameter must use a class-validator DTO" | "Validate input properly" |
153
+ | "No `any` type in TypeScript files" | "Use good types" |
154
+ | "Every new `.ts` file in `src/` must have a `.spec.ts` file" | "Write tests for new code" |
155
+
156
+ ### When Rules Can't Be Fully Deterministic
157
+
158
+ Some rules require judgment (e.g., "error messages must be actionable"). For these:
159
+ 1. Provide 3+ WRONG/RIGHT examples spanning different scenarios.
160
+ 2. Document the judgment criteria explicitly.
161
+ 3. Assign to agent review rather than automated linting.
162
+
163
+ ## Verification Chain
164
+
165
+ When an agent checks a rule, it should follow this chain:
166
+
167
+ ```
168
+ 1. Read the rule definition (constraint + severity + examples)
169
+
170
+ 2. Read the linked skill (if the rule references one, for deeper context)
171
+
172
+ 3. Read platform docs (if the rule references platform APIs or libraries)
173
+
174
+ 4. Search the codebase (find existing patterns that match WRONG or RIGHT)
175
+
176
+ 5. Verdict: PASS / FAIL with evidence (file path, line number, pattern matched)
177
+ ```
178
+
179
+ The verification chain ensures the agent doesn't just pattern-match superficially but understands the full context.
180
+
181
+ ## Severity Selection Guide
182
+
183
+ ### MUST — Security, Data Loss, Correctness
184
+
185
+ Use MUST when violation can cause:
186
+ - Security vulnerabilities (hardcoded secrets, auth bypass, injection)
187
+ - Data loss or corruption (missing transactions, wrong tenant scoping)
188
+ - Incorrect behavior visible to users (wrong calculations, missing validations)
189
+
190
+ ### SHOULD — Quality, Maintainability, Reliability
191
+
192
+ Use SHOULD when violation causes:
193
+ - Technical debt that slows future development
194
+ - Reduced observability (missing logs, metrics)
195
+ - Inconsistency that confuses developers
196
+ - Test gaps that increase regression risk
197
+
198
+ ### MAY — Preference, Style, Optimization
199
+
200
+ Use MAY when:
201
+ - Multiple valid approaches exist and yours is a preference
202
+ - The benefit is marginal and context-dependent
203
+ - Experienced developers might reasonably disagree
204
+
205
+ ## Rule Quality Checklist
206
+
207
+ - [ ] One constraint per rule (not a bundle)
208
+ - [ ] Explicit severity: MUST, SHOULD, or MAY
209
+ - [ ] At least one WRONG and one RIGHT code example
210
+ - [ ] "Why" section explaining real-world consequence
211
+ - [ ] Scope defined (which files/domains it applies to)
212
+ - [ ] Automation path documented (linter rule, CI check, or manual review process)
213
+ - [ ] Verifiable: can determine pass/fail by examining code
214
+ - [ ] Linked to relevant skill (if deeper context exists)
@@ -0,0 +1,159 @@
1
+ # Writing Good Skills
2
+
3
+ A skill is a reusable knowledge package that teaches an AI agent *how* to do something specific. Skills are not agents (they don't have identity or autonomy) and they are not rules (they don't enforce constraints). They are reference material loaded on demand.
4
+
5
+ ## Key Principles
6
+
7
+ 1. **Structure over length.** A well-organized 200-line skill outperforms a rambling 800-line one. Use consistent headings, scannable lists, and code examples.
8
+ 2. **Conciseness.** Every sentence should earn its place. If a paragraph can be a bullet, make it a bullet.
9
+ 3. **Naming signals scope.** `vue3-composables` is better than `frontend-patterns`. The name should tell the agent whether to load it.
10
+ 4. **Explain the why.** Reasoning sticks better than rigid MUST/NEVER lists. When an agent understands *why* a pattern exists, it generalizes correctly to novel situations.
11
+ 5. **Multi-model testing.** Test your skill with Opus, Sonnet, and Haiku. If Haiku can't follow it, the skill needs simplification.
12
+
13
+ ## Before / After: "When to Use"
14
+
15
+ ### Bad — vague, single-line trigger
16
+
17
+ ```yaml
18
+ # SKILL.md
19
+ name: api-error-handling
20
+ when_to_use: "When working with API errors"
21
+ ```
22
+
23
+ Problems: Every backend task touches API errors. The agent loads this skill too often or not at the right time. No specificity about *which* scenarios benefit.
24
+
25
+ ### Good — multiple concrete trigger scenarios
26
+
27
+ ```yaml
28
+ # SKILL.md
29
+ name: api-error-handling
30
+ when_to_use:
31
+ - "Adding a new NestJS controller endpoint that returns errors to clients"
32
+ - "Implementing retry logic for outbound HTTP calls to third-party APIs"
33
+ - "Converting thrown exceptions to structured ErrorResponse DTOs"
34
+ - "Debugging 5xx errors that lack sufficient context in logs"
35
+ ```
36
+
37
+ Why this works: Each scenario is specific enough that the agent (or router) can match it against the current task. The skill loads only when relevant.
38
+
39
+ ## Before / After: Instruction Quality
40
+
41
+ ### Bad — vague instructions, no examples
42
+
43
+ ```markdown
44
+ ## Error Handling
45
+ Handle errors properly. Make sure to catch all exceptions and return
46
+ appropriate responses. Use the right status codes.
47
+ ```
48
+
49
+ ### Good — concrete patterns with code
50
+
51
+ ```markdown
52
+ ## Error Handling Pattern
53
+
54
+ Wrap controller actions in a try/catch that maps domain errors to HTTP responses:
55
+
56
+ ```typescript
57
+ // WRONG: leaks internal details, no structure
58
+ catch (error) {
59
+ res.status(500).json({ message: error.message });
60
+ }
61
+
62
+ // RIGHT: maps to domain error, structured response
63
+ catch (error) {
64
+ if (error instanceof EntityNotFoundError) {
65
+ throw new NotFoundException(error.userMessage);
66
+ }
67
+ if (error instanceof ValidationError) {
68
+ throw new BadRequestException(error.toResponse());
69
+ }
70
+ logger.error('Unhandled error in createOrder', { error, orderId });
71
+ throw new InternalServerErrorException('Something went wrong');
72
+ }
73
+ ```
74
+
75
+ **Why:** Structured error mapping prevents information leakage, gives clients actionable responses, and ensures every error is logged with context for debugging.
76
+ ```
77
+
78
+ ## Anti-Pattern Catalog
79
+
80
+ ### 1. Too Broad Scope
81
+
82
+ **Symptom:** Skill named `backend-development` covering routing, ORM, caching, auth, and deployment.
83
+
84
+ **Fix:** Split into focused skills: `nestjs-routing`, `mongoose-queries`, `redis-caching`. Each skill should cover one coherent concern.
85
+
86
+ **Test:** If your skill's table of contents has more than 5 unrelated sections, it's too broad.
87
+
88
+ ### 2. Missing Trigger Scenarios
89
+
90
+ **Symptom:** `when_to_use` is empty or says "when relevant."
91
+
92
+ **Fix:** Write 3-5 concrete task descriptions that would benefit from this skill. If you can't name 3 distinct scenarios, the skill may be too narrow or should merge into another.
93
+
94
+ ### 3. Vague Instructions
95
+
96
+ **Symptom:** Instructions say "follow best practices" or "use the right approach" without specifying what those are.
97
+
98
+ **Fix:** Replace every vague directive with a concrete pattern. Show the code. Show the file path. Show the command.
99
+
100
+ ### 4. No Code Examples
101
+
102
+ **Symptom:** Pure prose with no WRONG/RIGHT code blocks.
103
+
104
+ **Fix:** Every non-trivial instruction needs a code example. Prefer paired WRONG/RIGHT examples that show the contrast.
105
+
106
+ ### 5. Everything in SKILL.md (No Progressive Disclosure)
107
+
108
+ **Symptom:** SKILL.md is 1500 lines because every detail is inlined.
109
+
110
+ **Fix:** Use progressive disclosure:
111
+ - `SKILL.md` — overview, when-to-use, key principles (under 100 lines)
112
+ - `references/` — detailed guides, examples, edge cases
113
+ - `templates/` — starter code, boilerplate
114
+
115
+ The agent reads SKILL.md first and loads references only when needed. This saves context window.
116
+
117
+ ### 6. Generic Rather Than Domain-Specific
118
+
119
+ **Symptom:** Skill says "validate input" without specifying *your* platform's validation stack (class-validator DTOs, specific decorators, your error response shape).
120
+
121
+ **Fix:** Skills should encode *your team's* patterns, not generic programming advice. The agent already knows generic advice. Your skill adds the specifics: which libraries, which patterns, which file locations, which naming conventions.
122
+
123
+ ## Scope Boundaries
124
+
125
+ ### Skill vs Rule vs Agent
126
+
127
+ | Dimension | Skill | Rule | Agent |
128
+ |-----------|-------|------|-------|
129
+ | **Purpose** | Teaches how to do something | Enforces a constraint | Performs a task autonomously |
130
+ | **Loaded when** | On demand, for a specific task | Always active for matching files | Invoked by command or coordinator |
131
+ | **Format** | Reference docs, examples, templates | Short constraint + WRONG/RIGHT + severity | Identity, mission, tools, workflow |
132
+ | **Example** | "How to write Mongoose migrations" | "No bare `any` type" | "Security reviewer agent" |
133
+ | **Autonomy** | None — it's passive knowledge | None — it's a check | Full — it reasons and acts |
134
+
135
+ ### Decision Flowchart
136
+
137
+ 1. **Is it a constraint that should always be checked?** → Write a **rule**.
138
+ 2. **Is it knowledge needed for specific tasks?** → Write a **skill**.
139
+ 3. **Does it need to reason, decide, and act independently?** → Write an **agent**.
140
+ 4. **Does it orchestrate multiple agents through phases?** → Write a **command**.
141
+
142
+ ### Gray Areas
143
+
144
+ - "Always use platform logger" — This is a **rule** (enforceable constraint), not a skill.
145
+ - "How to set up structured logging with correlation IDs" — This is a **skill** (teaches a technique).
146
+ - "Review all log statements for PII leakage" — This is an **agent** (requires judgment and autonomous action).
147
+
148
+ ## Skill Quality Checklist
149
+
150
+ Before publishing a skill:
151
+
152
+ - [ ] Name clearly signals scope and domain
153
+ - [ ] `when_to_use` has 3+ concrete trigger scenarios
154
+ - [ ] Every instruction has a code example or concrete reference
155
+ - [ ] WRONG/RIGHT pairs for non-obvious patterns
156
+ - [ ] Progressive disclosure: SKILL.md is under 100 lines, details in references/
157
+ - [ ] Domain-specific: encodes *your* team's patterns, not generic advice
158
+ - [ ] Tested with at least 2 model tiers (Sonnet + Haiku minimum)
159
+ - [ ] "Why" explained for non-obvious decisions