aw-ecc 1.4.31 → 1.4.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (259) hide show
  1. package/.claude-plugin/plugin.json +1 -1
  2. package/.codex/hooks/aw-post-tool-use.sh +8 -2
  3. package/.codex/hooks/aw-session-start.sh +11 -4
  4. package/.codex/hooks/aw-stop.sh +8 -2
  5. package/.codex/hooks/aw-user-prompt-submit.sh +10 -2
  6. package/.codex/hooks.json +8 -8
  7. package/.cursor/INSTALL.md +7 -5
  8. package/.cursor/hooks/adapter.js +41 -4
  9. package/.cursor/hooks/after-agent-response.js +62 -0
  10. package/.cursor/hooks/before-submit-prompt.js +7 -1
  11. package/.cursor/hooks/post-tool-use-failure.js +21 -0
  12. package/.cursor/hooks/post-tool-use.js +39 -0
  13. package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
  14. package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
  15. package/.cursor/hooks/subagent-start.js +22 -4
  16. package/.cursor/hooks/subagent-stop.js +18 -1
  17. package/.cursor/hooks.json +23 -2
  18. package/.opencode/package.json +1 -1
  19. package/AGENTS.md +3 -3
  20. package/README.md +5 -5
  21. package/commands/adk.md +52 -0
  22. package/commands/build.md +22 -9
  23. package/commands/deploy.md +12 -0
  24. package/commands/execute.md +9 -0
  25. package/commands/feature.md +333 -0
  26. package/commands/investigate.md +18 -5
  27. package/commands/plan.md +23 -9
  28. package/commands/publish.md +65 -0
  29. package/commands/review.md +12 -0
  30. package/commands/ship.md +12 -0
  31. package/commands/test.md +12 -0
  32. package/commands/verify.md +9 -0
  33. package/hooks/hooks.json +36 -0
  34. package/manifests/install-components.json +8 -0
  35. package/manifests/install-modules.json +83 -0
  36. package/manifests/install-profiles.json +7 -0
  37. package/package.json +1 -1
  38. package/scripts/ci/validate-rules.js +51 -0
  39. package/scripts/cursor-aw-home/hooks.json +23 -2
  40. package/scripts/cursor-aw-hooks/adapter.js +41 -4
  41. package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
  42. package/scripts/hooks/aw-usage-commit-created.js +32 -0
  43. package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
  44. package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
  45. package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
  46. package/scripts/hooks/aw-usage-session-start.js +48 -0
  47. package/scripts/hooks/aw-usage-stop.js +182 -0
  48. package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
  49. package/scripts/hooks/cost-tracker.js +3 -23
  50. package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
  51. package/scripts/hooks/shared/aw-phase-runner.js +3 -1
  52. package/scripts/lib/aw-hook-contract.js +2 -2
  53. package/scripts/lib/aw-pricing.js +306 -0
  54. package/scripts/lib/aw-usage-telemetry.js +472 -0
  55. package/scripts/lib/codex-hook-config.js +8 -8
  56. package/scripts/lib/cursor-hook-config.js +25 -10
  57. package/scripts/lib/install-targets/codex-home.js +7 -0
  58. package/scripts/lib/install-targets/cursor-project.js +3 -0
  59. package/scripts/lib/install-targets/helpers.js +20 -3
  60. package/skills/aw-adk/SKILL.md +317 -0
  61. package/skills/aw-adk/agents/analyzer.md +113 -0
  62. package/skills/aw-adk/agents/comparator.md +113 -0
  63. package/skills/aw-adk/agents/grader.md +115 -0
  64. package/skills/aw-adk/assets/eval_review.html +76 -0
  65. package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
  66. package/skills/aw-adk/eval-viewer/viewer.html +181 -0
  67. package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
  68. package/skills/aw-adk/evals/eval-create-agent.md +90 -0
  69. package/skills/aw-adk/evals/eval-create-command.md +98 -0
  70. package/skills/aw-adk/evals/eval-create-eval.md +89 -0
  71. package/skills/aw-adk/evals/eval-create-rule.md +99 -0
  72. package/skills/aw-adk/evals/eval-create-skill.md +97 -0
  73. package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
  74. package/skills/aw-adk/evals/eval-delete-command.md +89 -0
  75. package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
  76. package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
  77. package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
  78. package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
  79. package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
  80. package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
  81. package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
  82. package/skills/aw-adk/evals/evals.json +96 -0
  83. package/skills/aw-adk/references/artifact-wiring.md +162 -0
  84. package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
  85. package/skills/aw-adk/references/eval-placement-guide.md +183 -0
  86. package/skills/aw-adk/references/external-resources.md +75 -0
  87. package/skills/aw-adk/references/getting-started.md +66 -0
  88. package/skills/aw-adk/references/registry-structure.md +152 -0
  89. package/skills/aw-adk/references/rubric-agent.md +36 -0
  90. package/skills/aw-adk/references/rubric-command.md +36 -0
  91. package/skills/aw-adk/references/rubric-eval.md +36 -0
  92. package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
  93. package/skills/aw-adk/references/rubric-rule.md +36 -0
  94. package/skills/aw-adk/references/rubric-skill.md +36 -0
  95. package/skills/aw-adk/references/schemas.md +222 -0
  96. package/skills/aw-adk/references/template-agent.md +251 -0
  97. package/skills/aw-adk/references/template-command.md +279 -0
  98. package/skills/aw-adk/references/template-eval.md +176 -0
  99. package/skills/aw-adk/references/template-rule.md +119 -0
  100. package/skills/aw-adk/references/template-skill.md +123 -0
  101. package/skills/aw-adk/references/type-classifier.md +98 -0
  102. package/skills/aw-adk/references/writing-good-agents.md +227 -0
  103. package/skills/aw-adk/references/writing-good-commands.md +258 -0
  104. package/skills/aw-adk/references/writing-good-evals.md +271 -0
  105. package/skills/aw-adk/references/writing-good-rules.md +214 -0
  106. package/skills/aw-adk/references/writing-good-skills.md +159 -0
  107. package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
  108. package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
  109. package/skills/aw-adk/scripts/score-artifact.sh +179 -0
  110. package/skills/aw-adk/scripts/trigger-eval.py +192 -0
  111. package/skills/aw-build/SKILL.md +19 -2
  112. package/skills/aw-deploy/SKILL.md +65 -3
  113. package/skills/aw-design/SKILL.md +156 -0
  114. package/skills/aw-design/references/highrise-tokens.md +394 -0
  115. package/skills/aw-design/references/micro-interactions.md +76 -0
  116. package/skills/aw-design/references/prompt-template.md +160 -0
  117. package/skills/aw-design/references/quality-checklist.md +70 -0
  118. package/skills/aw-design/references/self-review.md +497 -0
  119. package/skills/aw-design/references/stitch-workflow.md +127 -0
  120. package/skills/aw-feature/SKILL.md +293 -0
  121. package/skills/aw-investigate/SKILL.md +17 -0
  122. package/skills/aw-plan/SKILL.md +34 -3
  123. package/skills/aw-publish/SKILL.md +300 -0
  124. package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
  125. package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
  126. package/skills/aw-publish/evals/eval-push-modes.md +67 -0
  127. package/skills/aw-publish/evals/eval-rules-push.md +60 -0
  128. package/skills/aw-publish/evals/evals.json +29 -0
  129. package/skills/aw-publish/references/push-modes.md +38 -0
  130. package/skills/aw-review/SKILL.md +88 -9
  131. package/skills/aw-rules-review/SKILL.md +124 -0
  132. package/skills/aw-rules-review/agents/openai.yaml +3 -0
  133. package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
  134. package/skills/aw-ship/SKILL.md +16 -0
  135. package/skills/aw-spec/SKILL.md +15 -0
  136. package/skills/aw-tasks/SKILL.md +15 -0
  137. package/skills/aw-test/SKILL.md +16 -0
  138. package/skills/aw-yolo/SKILL.md +4 -0
  139. package/skills/diagnose/SKILL.md +121 -0
  140. package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
  141. package/skills/finish-only-when-green/SKILL.md +265 -0
  142. package/skills/grill-me/SKILL.md +24 -0
  143. package/skills/grill-with-docs/SKILL.md +92 -0
  144. package/skills/grill-with-docs/adr-format.md +47 -0
  145. package/skills/grill-with-docs/context-format.md +67 -0
  146. package/skills/improve-codebase-architecture/SKILL.md +75 -0
  147. package/skills/improve-codebase-architecture/deepening.md +37 -0
  148. package/skills/improve-codebase-architecture/interface-design.md +44 -0
  149. package/skills/improve-codebase-architecture/language.md +53 -0
  150. package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
  151. package/skills/tdd/SKILL.md +115 -0
  152. package/skills/tdd/deep-modules.md +33 -0
  153. package/skills/tdd/interface-design.md +31 -0
  154. package/skills/tdd/mocking.md +59 -0
  155. package/skills/tdd/refactoring.md +10 -0
  156. package/skills/tdd/tests.md +61 -0
  157. package/skills/to-issues/SKILL.md +62 -0
  158. package/skills/to-prd/SKILL.md +75 -0
  159. package/skills/using-aw-skills/SKILL.md +170 -237
  160. package/skills/using-aw-skills/hooks/session-start.sh +11 -41
  161. package/skills/zoom-out/SKILL.md +24 -0
  162. package/.cursor/rules/common-agents.md +0 -53
  163. package/.cursor/rules/common-aw-routing.md +0 -43
  164. package/.cursor/rules/common-coding-style.md +0 -52
  165. package/.cursor/rules/common-development-workflow.md +0 -33
  166. package/.cursor/rules/common-git-workflow.md +0 -28
  167. package/.cursor/rules/common-hooks.md +0 -34
  168. package/.cursor/rules/common-patterns.md +0 -35
  169. package/.cursor/rules/common-performance.md +0 -59
  170. package/.cursor/rules/common-security.md +0 -33
  171. package/.cursor/rules/common-testing.md +0 -33
  172. package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
  173. package/.cursor/skills/article-writing/SKILL.md +0 -85
  174. package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
  175. package/.cursor/skills/aw-build/SKILL.md +0 -152
  176. package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
  177. package/.cursor/skills/aw-debug/SKILL.md +0 -49
  178. package/.cursor/skills/aw-deploy/SKILL.md +0 -101
  179. package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
  180. package/.cursor/skills/aw-execute/SKILL.md +0 -47
  181. package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
  182. package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
  183. package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
  184. package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
  185. package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
  186. package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
  187. package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
  188. package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
  189. package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
  190. package/.cursor/skills/aw-finish/SKILL.md +0 -111
  191. package/.cursor/skills/aw-investigate/SKILL.md +0 -109
  192. package/.cursor/skills/aw-plan/SKILL.md +0 -368
  193. package/.cursor/skills/aw-prepare/SKILL.md +0 -118
  194. package/.cursor/skills/aw-review/SKILL.md +0 -118
  195. package/.cursor/skills/aw-ship/SKILL.md +0 -115
  196. package/.cursor/skills/aw-spec/SKILL.md +0 -104
  197. package/.cursor/skills/aw-tasks/SKILL.md +0 -138
  198. package/.cursor/skills/aw-test/SKILL.md +0 -118
  199. package/.cursor/skills/aw-verify/SKILL.md +0 -51
  200. package/.cursor/skills/aw-yolo/SKILL.md +0 -111
  201. package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
  202. package/.cursor/skills/bun-runtime/SKILL.md +0 -84
  203. package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
  204. package/.cursor/skills/code-simplification/SKILL.md +0 -74
  205. package/.cursor/skills/content-engine/SKILL.md +0 -88
  206. package/.cursor/skills/context-engineering/SKILL.md +0 -74
  207. package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
  208. package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
  209. package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
  210. package/.cursor/skills/frontend-slides/SKILL.md +0 -184
  211. package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
  212. package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
  213. package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
  214. package/.cursor/skills/idea-refine/SKILL.md +0 -84
  215. package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
  216. package/.cursor/skills/investor-materials/SKILL.md +0 -96
  217. package/.cursor/skills/investor-outreach/SKILL.md +0 -76
  218. package/.cursor/skills/market-research/SKILL.md +0 -75
  219. package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
  220. package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
  221. package/.cursor/skills/performance-optimization/SKILL.md +0 -77
  222. package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
  223. package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
  224. package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
  225. package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
  226. package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
  227. package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
  228. package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
  229. package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
  230. /package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
  231. /package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
  232. /package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
  233. /package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
  234. /package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
  235. /package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
  236. /package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
  237. /package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
  238. /package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
  239. /package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
  240. /package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
  241. /package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
  242. /package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
  243. /package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
  244. /package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
  245. /package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
  246. /package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
  247. /package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
  248. /package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
  249. /package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
  250. /package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
  251. /package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
  252. /package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
  253. /package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
  254. /package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
  255. /package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
  256. /package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
  257. /package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
  258. /package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
  259. /package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0
@@ -0,0 +1,222 @@
1
+ # ADK JSON Schemas
2
+
3
+ Defines the JSON structures used by the ADK eval-driven iteration system. Adapted from skill-creator for CASRE context.
4
+
5
+ ---
6
+
7
+ ## evals.json
8
+
9
+ Defines test prompts and assertions for an artifact. Located at `skills/aw-adk/evals/evals.json` or `<artifact>/evals/evals.json`.
10
+
11
+ ```json
12
+ {
13
+ "artifact_name": "payments-agent",
14
+ "artifact_type": "agent",
15
+ "evals": [
16
+ {
17
+ "id": 1,
18
+ "prompt": "Create an agent for payments processing in the revex/memberships namespace",
19
+ "expected_output": "Agent file with Identity, Core Mission, Critical Rules, Process, Deliverables sections",
20
+ "files": [],
21
+ "expectations": [
22
+ "The agent has a Core Mission section with 2+ sentences",
23
+ "Frontmatter includes name, description, tools, model, category, squad",
24
+ "Agent file is placed at correct registry path",
25
+ "At least 2 colocated eval files created"
26
+ ]
27
+ }
28
+ ]
29
+ }
30
+ ```
31
+
32
+ **Fields:**
33
+ - `artifact_name`: Name of the artifact being tested
34
+ - `artifact_type`: One of: command, agent, skill, rule, eval
35
+ - `evals[].id`: Unique integer identifier
36
+ - `evals[].prompt`: The task to execute (realistic user request)
37
+ - `evals[].expected_output`: Human-readable success description
38
+ - `evals[].files`: Optional input file paths
39
+ - `evals[].expectations`: Verifiable assertions (graded by agents/grader.md)
40
+
41
+ ---
42
+
43
+ ## eval_metadata.json
44
+
45
+ Per-eval directory metadata. Located at `<workspace>/iteration-N/<eval-name>/eval_metadata.json`.
46
+
47
+ ```json
48
+ {
49
+ "eval_id": 1,
50
+ "eval_name": "create-payments-agent",
51
+ "prompt": "Create an agent for payments processing in the revex/memberships namespace",
52
+ "assertions": [
53
+ "The agent has a Core Mission section with 2+ sentences",
54
+ "Frontmatter includes name, description, tools, model"
55
+ ]
56
+ }
57
+ ```
58
+
59
+ ---
60
+
61
+ ## grading.json
62
+
63
+ Output from the grader agent. Located at `<run-dir>/grading.json`.
64
+
65
+ ```json
66
+ {
67
+ "expectations": [
68
+ {
69
+ "text": "The agent has a Core Mission section",
70
+ "passed": true,
71
+ "evidence": "Found '## Core Mission' at line 42 with 3 sentences"
72
+ }
73
+ ],
74
+ "summary": {
75
+ "passed": 8,
76
+ "failed": 2,
77
+ "total": 10,
78
+ "pass_rate": 0.80
79
+ },
80
+ "execution_metrics": {
81
+ "tool_calls": { "Read": 5, "Write": 2, "Bash": 3 },
82
+ "total_tool_calls": 10,
83
+ "total_steps": 6,
84
+ "errors_encountered": 0
85
+ },
86
+ "timing": {
87
+ "executor_duration_seconds": 45.0,
88
+ "total_duration_seconds": 52.0
89
+ },
90
+ "claims": [
91
+ {
92
+ "claim": "Agent scores B-Tier (65/100)",
93
+ "type": "quality",
94
+ "verified": true,
95
+ "evidence": "Rubric scoring confirms total = 65"
96
+ }
97
+ ],
98
+ "eval_feedback": {
99
+ "suggestions": [],
100
+ "overall": "Assertions cover structure well. Consider adding behavioral checks."
101
+ }
102
+ }
103
+ ```
104
+
105
+ **Important:** The viewer depends on exact field names: `text`, `passed`, `evidence` in expectations (not `name`/`met`/`details`).
106
+
107
+ ---
108
+
109
+ ## timing.json
110
+
111
+ Wall clock timing. Located at `<run-dir>/timing.json`.
112
+
113
+ ```json
114
+ {
115
+ "total_tokens": 84852,
116
+ "duration_ms": 23332,
117
+ "total_duration_seconds": 23.3
118
+ }
119
+ ```
120
+
121
+ **How to capture:** When a subagent task completes, the notification includes `total_tokens` and `duration_ms`. Save immediately — this data is not persisted elsewhere.
122
+
123
+ ---
124
+
125
+ ## benchmark.json
126
+
127
+ Aggregated results. Located at `<workspace>/iteration-N/benchmark.json`. Generated by `scripts/aggregate-benchmark.py`.
128
+
129
+ ```json
130
+ {
131
+ "metadata": {
132
+ "artifact_name": "payments-agent",
133
+ "iteration_dir": "payments-agent-workspace/iteration-1",
134
+ "evals_run": ["create-payments-agent", "score-minimal-agent"],
135
+ "total_runs": 4
136
+ },
137
+ "runs": [
138
+ {
139
+ "eval_id": 1,
140
+ "eval_name": "create-payments-agent",
141
+ "configuration": "with_artifact",
142
+ "run_number": 1,
143
+ "result": {
144
+ "pass_rate": 0.85,
145
+ "passed": 6,
146
+ "failed": 1,
147
+ "total": 7,
148
+ "time_seconds": 42.5,
149
+ "tokens": 3800,
150
+ "errors": 0
151
+ },
152
+ "expectations": [
153
+ { "text": "...", "passed": true, "evidence": "..." }
154
+ ]
155
+ }
156
+ ],
157
+ "run_summary": {
158
+ "with_artifact": {
159
+ "pass_rate": { "mean": 0.85, "stddev": 0.05 },
160
+ "time_seconds": { "mean": 45.0, "stddev": 12.0 },
161
+ "tokens": { "mean": 3800, "stddev": 400 }
162
+ },
163
+ "without_artifact": {
164
+ "pass_rate": { "mean": 0.35, "stddev": 0.08 },
165
+ "time_seconds": { "mean": 32.0, "stddev": 8.0 },
166
+ "tokens": { "mean": 2100, "stddev": 300 }
167
+ },
168
+ "delta": {
169
+ "pass_rate": "+0.500",
170
+ "time_seconds": "+13.0",
171
+ "tokens": "+1700"
172
+ }
173
+ },
174
+ "notes": [
175
+ "Without-artifact runs consistently fail on eval placement checks (0% pass rate)"
176
+ ]
177
+ }
178
+ ```
179
+
180
+ **Important:** The viewer reads `configuration` (not `config`) and `result.pass_rate` (not top-level `pass_rate`).
181
+
182
+ ---
183
+
184
+ ## comparison.json
185
+
186
+ Output from blind comparator. Located at `<grading-dir>/comparison.json`.
187
+
188
+ ```json
189
+ {
190
+ "winner": "A",
191
+ "reasoning": "Artifact A has stronger Identity section with concrete traits...",
192
+ "rubric": {
193
+ "A": { "dimensions": { "1_frontmatter": 8, "2_identity": 9 }, "total": 74, "tier": "B" },
194
+ "B": { "dimensions": { "1_frontmatter": 7, "2_identity": 5 }, "total": 61, "tier": "B" }
195
+ },
196
+ "output_quality": {
197
+ "A": { "score": 74, "strengths": ["..."], "weaknesses": ["..."] },
198
+ "B": { "score": 61, "strengths": ["..."], "weaknesses": ["..."] }
199
+ }
200
+ }
201
+ ```
202
+
203
+ ---
204
+
205
+ ## feedback.json
206
+
207
+ Human review feedback. Downloaded from eval-viewer. Located at `<workspace>/iteration-N/feedback.json`.
208
+
209
+ ```json
210
+ {
211
+ "reviews": [
212
+ {
213
+ "run_id": "create-payments-agent-with_artifact",
214
+ "feedback": "Identity section is great but Code Examples are too generic",
215
+ "timestamp": "2026-04-22T10:30:00Z"
216
+ }
217
+ ],
218
+ "status": "complete"
219
+ }
220
+ ```
221
+
222
+ Empty feedback means the reviewer thought it was fine.
@@ -0,0 +1,251 @@
1
+ # Agent Template
2
+
3
+ Copy the scaffold below as your starting point. Replace all `<placeholder>` tokens.
4
+
5
+ ---
6
+
7
+ ## Scaffold
8
+
9
+ ````markdown
10
+ ---
11
+ name: <namespace>-<agent-slug>
12
+ description: "<1-2 sentences. Primary capability + trigger scenario.>"
13
+ tools: [Read, Edit, Write, Bash, Grep, Glob]
14
+ model: <sonnet|opus|haiku>
15
+ category: <domain>
16
+ squad: <team/sub_team>
17
+ skills: [<skill-1>, <skill-2>]
18
+ ---
19
+
20
+ # <Agent Display Name>
21
+
22
+ ## Identity
23
+
24
+ You are **<Agent Name>**, a <role description>.
25
+
26
+ - **Expertise**: <2-3 specific domains of deep knowledge>
27
+ - **Personality**: <How you communicate — direct, methodical, thorough, etc.>
28
+ - **Strengths**: <What you do better than a generalist>
29
+ - **Limitations**: <What you explicitly do NOT do — keeps scope tight>
30
+
31
+ ## Core Mission
32
+
33
+ <2-3 sentences describing the agent's primary purpose, the outcomes it produces,
34
+ and the value it delivers. This is the "elevator pitch" for the agent.>
35
+
36
+ ### Primary Objectives
37
+
38
+ 1. <Objective 1 — specific, measurable outcome>
39
+ 2. <Objective 2 — specific, measurable outcome>
40
+ 3. <Objective 3 — specific, measurable outcome>
41
+
42
+ ### Success Criteria
43
+
44
+ - <Criterion 1 — observable and verifiable>
45
+ - <Criterion 2 — observable and verifiable>
46
+ - <Criterion 3 — observable and verifiable>
47
+
48
+ ## Critical Rules
49
+
50
+ ### BLOCK — Stop and escalate
51
+
52
+ These conditions halt execution. Do not proceed until resolved.
53
+
54
+ - **<Block condition 1>**: <What triggers it and why it's dangerous>
55
+ - **<Block condition 2>**: <What triggers it and why it's dangerous>
56
+
57
+ ### NEVER — Hard constraints
58
+
59
+ Violating these produces incorrect or harmful output.
60
+
61
+ - Never <action 1> because <consequence>
62
+ - Never <action 2> because <consequence>
63
+ - Never <action 3> because <consequence>
64
+
65
+ ### ALWAYS — Required behaviors
66
+
67
+ Skipping these degrades quality below acceptable thresholds.
68
+
69
+ - Always <action 1> because <reason>
70
+ - Always <action 2> because <reason>
71
+ - Always <action 3> because <reason>
72
+
73
+ ## Process
74
+
75
+ ### Step 1: <Phase Name>
76
+
77
+ <What to do and why. Include concrete commands when applicable.>
78
+
79
+ ```bash
80
+ # Example command
81
+ <command>
82
+ ```
83
+
84
+ **Output:** <What this step produces>
85
+ **Checkpoint:** <How to verify this step succeeded before moving on>
86
+
87
+ ### Step 2: <Phase Name>
88
+
89
+ <Instructions for the next phase.>
90
+
91
+ ```bash
92
+ # Example command
93
+ <command>
94
+ ```
95
+
96
+ **Output:** <What this step produces>
97
+ **Checkpoint:** <Verification criteria>
98
+
99
+ ### Step 3: <Phase Name>
100
+
101
+ <Continue the pattern. Add as many steps as needed.>
102
+
103
+ ### Step N: Deliver Results
104
+
105
+ <Final step — produce the deliverables and verify them.>
106
+
107
+ ## Deliverables
108
+
109
+ | # | Artifact | Format | Location | Required |
110
+ |---|----------|--------|----------|----------|
111
+ | 1 | <artifact-name> | <format> | <path> | Yes |
112
+ | 2 | <artifact-name> | <format> | <path> | Yes |
113
+ | 3 | <artifact-name> | <format> | <path> | No |
114
+
115
+ ## Communication Style
116
+
117
+ ### Tone
118
+
119
+ <Describe how the agent communicates: formal/informal, terse/verbose, etc.>
120
+
121
+ ### Example Phrases
122
+
123
+ - When starting: "<example opening phrase>"
124
+ - When blocked: "<example escalation phrase>"
125
+ - When delivering: "<example completion phrase>"
126
+ - When uncertain: "<example clarification phrase>"
127
+
128
+ ### Reporting Format
129
+
130
+ <How the agent structures its responses. Example:>
131
+
132
+ ```
133
+ ## <Title>
134
+
135
+ **Status:** <PASS | FAIL | NEEDS_REVIEW>
136
+ **Summary:** <1-2 sentences>
137
+
138
+ ### Findings
139
+ 1. <finding with evidence>
140
+
141
+ ### Recommendations
142
+ 1. <actionable recommendation>
143
+ ```
144
+
145
+ ## Learning & Memory
146
+
147
+ ### Pattern Recognition
148
+
149
+ The agent should recognize and adapt to these patterns:
150
+
151
+ - **<Pattern 1>**: <What to look for> -> <How to respond>
152
+ - **<Pattern 2>**: <What to look for> -> <How to respond>
153
+
154
+ ### Context Accumulation
155
+
156
+ Between invocations, the agent retains understanding of:
157
+
158
+ - <Context type 1 — e.g., "codebase architecture from previous reviews">
159
+ - <Context type 2 — e.g., "team conventions observed in prior sessions">
160
+
161
+ ### Anti-Patterns to Flag
162
+
163
+ - <Anti-pattern 1>: <What it looks like and why it's wrong>
164
+ - <Anti-pattern 2>: <What it looks like and why it's wrong>
165
+
166
+ ## Success Metrics
167
+
168
+ | Metric | Target | Measurement |
169
+ |--------|--------|-------------|
170
+ | <metric-1> | <quantified target, e.g., ">90%"> | <how to measure> |
171
+ | <metric-2> | <quantified target> | <how to measure> |
172
+ | <metric-3> | <quantified target> | <how to measure> |
173
+
174
+ ## Advanced Capabilities
175
+
176
+ ### <Capability 1>
177
+
178
+ <Description of an advanced behavior the agent supports when invoked
179
+ with specific inputs or in specific contexts.>
180
+
181
+ ### <Capability 2>
182
+
183
+ <Another advanced capability. These are optional behaviors that
184
+ extend the core mission for power users.>
185
+
186
+ ## Skills & References
187
+
188
+ - [<skill-name>](../skills/<slug>/SKILL.md) — <when to load>
189
+ - [<reference-name>](references/<file>.md) — <what it covers>
190
+ ````
191
+
192
+ ---
193
+
194
+ ## Section-by-Section Guide
195
+
196
+ ### Identity (4 Required Fields)
197
+
198
+ The Identity section defines the agent's persona. All four fields are mandatory:
199
+
200
+ 1. **Expertise** — Narrow scope. "Database optimization" is better than "backend development." Narrow agents outperform generalists because the model focuses its knowledge.
201
+
202
+ 2. **Personality** — This shapes output tone. "Methodical and evidence-driven" produces different output than "fast and opinionated." Choose what fits the use case.
203
+
204
+ 3. **Strengths** — What makes this agent worth spawning instead of asking the base model directly. If you can't articulate this, the agent may not need to exist.
205
+
206
+ 4. **Limitations** — Explicit scope boundaries prevent the agent from drifting into adjacent domains. "Does NOT write production code" keeps a reviewer focused on reviewing.
207
+
208
+ ### Core Mission
209
+
210
+ The bridge between identity and action. A reader should understand what this agent produces after reading just the Identity and Core Mission sections. Everything below is implementation detail.
211
+
212
+ ### Critical Rules
213
+
214
+ Three severity tiers, each with a distinct consequence:
215
+
216
+ - **BLOCK** — execution stops. Use sparingly. Reserved for data loss, security breaches, or irreversible actions.
217
+ - **NEVER** — output quality drops below acceptable. These are hard constraints the model must internalize.
218
+ - **ALWAYS** — quality degrades when skipped. These are positive behaviors, not prohibitions.
219
+
220
+ Explain WHY each rule exists. "Never modify production data because rollback is impossible in this system" is better than "Never modify production data."
221
+
222
+ ### Process
223
+
224
+ Step-by-step instructions the agent follows. Each step needs:
225
+ - What to do (action)
226
+ - Why (reasoning — helps the model handle edge cases)
227
+ - How to verify (checkpoint — prevents cascading failures)
228
+
229
+ Include bash commands where the step involves tooling. The model follows concrete commands more reliably than abstract instructions.
230
+
231
+ ### Deliverables Table
232
+
233
+ Explicit contract between the agent and its caller. The caller knows exactly what to expect. The agent knows exactly what to produce. No ambiguity.
234
+
235
+ ### Communication Style
236
+
237
+ Example phrases are surprisingly effective at shaping agent output. The model pattern-matches against them. Providing 4-5 examples in the right tone produces more consistent output than paragraphs of description.
238
+
239
+ ### Success Metrics
240
+
241
+ Quantified targets enable eval creation. "Accuracy > 90%" can be tested. "Good accuracy" cannot. Every metric should be measurable by a grader agent or deterministic script.
242
+
243
+ ## Model Tier Selection
244
+
245
+ | Tier | When to Use | Cost |
246
+ |---|---|---|
247
+ | `haiku` | High-frequency, narrow-scope tasks (linting, formatting, simple checks) | Lowest |
248
+ | `sonnet` | Most agent work (review, analysis, implementation, orchestration) | Medium |
249
+ | `opus` | Deep reasoning, architectural decisions, complex multi-step analysis | Highest |
250
+
251
+ Default to `sonnet` unless you have a specific reason for `haiku` (high frequency) or `opus` (deep reasoning).