aw-ecc 1.4.31 → 1.4.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (259) hide show
  1. package/.claude-plugin/plugin.json +1 -1
  2. package/.codex/hooks/aw-post-tool-use.sh +8 -2
  3. package/.codex/hooks/aw-session-start.sh +11 -4
  4. package/.codex/hooks/aw-stop.sh +8 -2
  5. package/.codex/hooks/aw-user-prompt-submit.sh +10 -2
  6. package/.codex/hooks.json +8 -8
  7. package/.cursor/INSTALL.md +7 -5
  8. package/.cursor/hooks/adapter.js +41 -4
  9. package/.cursor/hooks/after-agent-response.js +62 -0
  10. package/.cursor/hooks/before-submit-prompt.js +7 -1
  11. package/.cursor/hooks/post-tool-use-failure.js +21 -0
  12. package/.cursor/hooks/post-tool-use.js +39 -0
  13. package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
  14. package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
  15. package/.cursor/hooks/subagent-start.js +22 -4
  16. package/.cursor/hooks/subagent-stop.js +18 -1
  17. package/.cursor/hooks.json +23 -2
  18. package/.opencode/package.json +1 -1
  19. package/AGENTS.md +3 -3
  20. package/README.md +5 -5
  21. package/commands/adk.md +52 -0
  22. package/commands/build.md +22 -9
  23. package/commands/deploy.md +12 -0
  24. package/commands/execute.md +9 -0
  25. package/commands/feature.md +333 -0
  26. package/commands/investigate.md +18 -5
  27. package/commands/plan.md +23 -9
  28. package/commands/publish.md +65 -0
  29. package/commands/review.md +12 -0
  30. package/commands/ship.md +12 -0
  31. package/commands/test.md +12 -0
  32. package/commands/verify.md +9 -0
  33. package/hooks/hooks.json +36 -0
  34. package/manifests/install-components.json +8 -0
  35. package/manifests/install-modules.json +83 -0
  36. package/manifests/install-profiles.json +7 -0
  37. package/package.json +1 -1
  38. package/scripts/ci/validate-rules.js +51 -0
  39. package/scripts/cursor-aw-home/hooks.json +23 -2
  40. package/scripts/cursor-aw-hooks/adapter.js +41 -4
  41. package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
  42. package/scripts/hooks/aw-usage-commit-created.js +32 -0
  43. package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
  44. package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
  45. package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
  46. package/scripts/hooks/aw-usage-session-start.js +48 -0
  47. package/scripts/hooks/aw-usage-stop.js +182 -0
  48. package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
  49. package/scripts/hooks/cost-tracker.js +3 -23
  50. package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
  51. package/scripts/hooks/shared/aw-phase-runner.js +3 -1
  52. package/scripts/lib/aw-hook-contract.js +2 -2
  53. package/scripts/lib/aw-pricing.js +306 -0
  54. package/scripts/lib/aw-usage-telemetry.js +472 -0
  55. package/scripts/lib/codex-hook-config.js +8 -8
  56. package/scripts/lib/cursor-hook-config.js +25 -10
  57. package/scripts/lib/install-targets/codex-home.js +7 -0
  58. package/scripts/lib/install-targets/cursor-project.js +3 -0
  59. package/scripts/lib/install-targets/helpers.js +20 -3
  60. package/skills/aw-adk/SKILL.md +317 -0
  61. package/skills/aw-adk/agents/analyzer.md +113 -0
  62. package/skills/aw-adk/agents/comparator.md +113 -0
  63. package/skills/aw-adk/agents/grader.md +115 -0
  64. package/skills/aw-adk/assets/eval_review.html +76 -0
  65. package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
  66. package/skills/aw-adk/eval-viewer/viewer.html +181 -0
  67. package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
  68. package/skills/aw-adk/evals/eval-create-agent.md +90 -0
  69. package/skills/aw-adk/evals/eval-create-command.md +98 -0
  70. package/skills/aw-adk/evals/eval-create-eval.md +89 -0
  71. package/skills/aw-adk/evals/eval-create-rule.md +99 -0
  72. package/skills/aw-adk/evals/eval-create-skill.md +97 -0
  73. package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
  74. package/skills/aw-adk/evals/eval-delete-command.md +89 -0
  75. package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
  76. package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
  77. package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
  78. package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
  79. package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
  80. package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
  81. package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
  82. package/skills/aw-adk/evals/evals.json +96 -0
  83. package/skills/aw-adk/references/artifact-wiring.md +162 -0
  84. package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
  85. package/skills/aw-adk/references/eval-placement-guide.md +183 -0
  86. package/skills/aw-adk/references/external-resources.md +75 -0
  87. package/skills/aw-adk/references/getting-started.md +66 -0
  88. package/skills/aw-adk/references/registry-structure.md +152 -0
  89. package/skills/aw-adk/references/rubric-agent.md +36 -0
  90. package/skills/aw-adk/references/rubric-command.md +36 -0
  91. package/skills/aw-adk/references/rubric-eval.md +36 -0
  92. package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
  93. package/skills/aw-adk/references/rubric-rule.md +36 -0
  94. package/skills/aw-adk/references/rubric-skill.md +36 -0
  95. package/skills/aw-adk/references/schemas.md +222 -0
  96. package/skills/aw-adk/references/template-agent.md +251 -0
  97. package/skills/aw-adk/references/template-command.md +279 -0
  98. package/skills/aw-adk/references/template-eval.md +176 -0
  99. package/skills/aw-adk/references/template-rule.md +119 -0
  100. package/skills/aw-adk/references/template-skill.md +123 -0
  101. package/skills/aw-adk/references/type-classifier.md +98 -0
  102. package/skills/aw-adk/references/writing-good-agents.md +227 -0
  103. package/skills/aw-adk/references/writing-good-commands.md +258 -0
  104. package/skills/aw-adk/references/writing-good-evals.md +271 -0
  105. package/skills/aw-adk/references/writing-good-rules.md +214 -0
  106. package/skills/aw-adk/references/writing-good-skills.md +159 -0
  107. package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
  108. package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
  109. package/skills/aw-adk/scripts/score-artifact.sh +179 -0
  110. package/skills/aw-adk/scripts/trigger-eval.py +192 -0
  111. package/skills/aw-build/SKILL.md +19 -2
  112. package/skills/aw-deploy/SKILL.md +65 -3
  113. package/skills/aw-design/SKILL.md +156 -0
  114. package/skills/aw-design/references/highrise-tokens.md +394 -0
  115. package/skills/aw-design/references/micro-interactions.md +76 -0
  116. package/skills/aw-design/references/prompt-template.md +160 -0
  117. package/skills/aw-design/references/quality-checklist.md +70 -0
  118. package/skills/aw-design/references/self-review.md +497 -0
  119. package/skills/aw-design/references/stitch-workflow.md +127 -0
  120. package/skills/aw-feature/SKILL.md +293 -0
  121. package/skills/aw-investigate/SKILL.md +17 -0
  122. package/skills/aw-plan/SKILL.md +34 -3
  123. package/skills/aw-publish/SKILL.md +300 -0
  124. package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
  125. package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
  126. package/skills/aw-publish/evals/eval-push-modes.md +67 -0
  127. package/skills/aw-publish/evals/eval-rules-push.md +60 -0
  128. package/skills/aw-publish/evals/evals.json +29 -0
  129. package/skills/aw-publish/references/push-modes.md +38 -0
  130. package/skills/aw-review/SKILL.md +88 -9
  131. package/skills/aw-rules-review/SKILL.md +124 -0
  132. package/skills/aw-rules-review/agents/openai.yaml +3 -0
  133. package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
  134. package/skills/aw-ship/SKILL.md +16 -0
  135. package/skills/aw-spec/SKILL.md +15 -0
  136. package/skills/aw-tasks/SKILL.md +15 -0
  137. package/skills/aw-test/SKILL.md +16 -0
  138. package/skills/aw-yolo/SKILL.md +4 -0
  139. package/skills/diagnose/SKILL.md +121 -0
  140. package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
  141. package/skills/finish-only-when-green/SKILL.md +265 -0
  142. package/skills/grill-me/SKILL.md +24 -0
  143. package/skills/grill-with-docs/SKILL.md +92 -0
  144. package/skills/grill-with-docs/adr-format.md +47 -0
  145. package/skills/grill-with-docs/context-format.md +67 -0
  146. package/skills/improve-codebase-architecture/SKILL.md +75 -0
  147. package/skills/improve-codebase-architecture/deepening.md +37 -0
  148. package/skills/improve-codebase-architecture/interface-design.md +44 -0
  149. package/skills/improve-codebase-architecture/language.md +53 -0
  150. package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
  151. package/skills/tdd/SKILL.md +115 -0
  152. package/skills/tdd/deep-modules.md +33 -0
  153. package/skills/tdd/interface-design.md +31 -0
  154. package/skills/tdd/mocking.md +59 -0
  155. package/skills/tdd/refactoring.md +10 -0
  156. package/skills/tdd/tests.md +61 -0
  157. package/skills/to-issues/SKILL.md +62 -0
  158. package/skills/to-prd/SKILL.md +75 -0
  159. package/skills/using-aw-skills/SKILL.md +170 -237
  160. package/skills/using-aw-skills/hooks/session-start.sh +11 -41
  161. package/skills/zoom-out/SKILL.md +24 -0
  162. package/.cursor/rules/common-agents.md +0 -53
  163. package/.cursor/rules/common-aw-routing.md +0 -43
  164. package/.cursor/rules/common-coding-style.md +0 -52
  165. package/.cursor/rules/common-development-workflow.md +0 -33
  166. package/.cursor/rules/common-git-workflow.md +0 -28
  167. package/.cursor/rules/common-hooks.md +0 -34
  168. package/.cursor/rules/common-patterns.md +0 -35
  169. package/.cursor/rules/common-performance.md +0 -59
  170. package/.cursor/rules/common-security.md +0 -33
  171. package/.cursor/rules/common-testing.md +0 -33
  172. package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
  173. package/.cursor/skills/article-writing/SKILL.md +0 -85
  174. package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
  175. package/.cursor/skills/aw-build/SKILL.md +0 -152
  176. package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
  177. package/.cursor/skills/aw-debug/SKILL.md +0 -49
  178. package/.cursor/skills/aw-deploy/SKILL.md +0 -101
  179. package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
  180. package/.cursor/skills/aw-execute/SKILL.md +0 -47
  181. package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
  182. package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
  183. package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
  184. package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
  185. package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
  186. package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
  187. package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
  188. package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
  189. package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
  190. package/.cursor/skills/aw-finish/SKILL.md +0 -111
  191. package/.cursor/skills/aw-investigate/SKILL.md +0 -109
  192. package/.cursor/skills/aw-plan/SKILL.md +0 -368
  193. package/.cursor/skills/aw-prepare/SKILL.md +0 -118
  194. package/.cursor/skills/aw-review/SKILL.md +0 -118
  195. package/.cursor/skills/aw-ship/SKILL.md +0 -115
  196. package/.cursor/skills/aw-spec/SKILL.md +0 -104
  197. package/.cursor/skills/aw-tasks/SKILL.md +0 -138
  198. package/.cursor/skills/aw-test/SKILL.md +0 -118
  199. package/.cursor/skills/aw-verify/SKILL.md +0 -51
  200. package/.cursor/skills/aw-yolo/SKILL.md +0 -111
  201. package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
  202. package/.cursor/skills/bun-runtime/SKILL.md +0 -84
  203. package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
  204. package/.cursor/skills/code-simplification/SKILL.md +0 -74
  205. package/.cursor/skills/content-engine/SKILL.md +0 -88
  206. package/.cursor/skills/context-engineering/SKILL.md +0 -74
  207. package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
  208. package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
  209. package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
  210. package/.cursor/skills/frontend-slides/SKILL.md +0 -184
  211. package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
  212. package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
  213. package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
  214. package/.cursor/skills/idea-refine/SKILL.md +0 -84
  215. package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
  216. package/.cursor/skills/investor-materials/SKILL.md +0 -96
  217. package/.cursor/skills/investor-outreach/SKILL.md +0 -76
  218. package/.cursor/skills/market-research/SKILL.md +0 -75
  219. package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
  220. package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
  221. package/.cursor/skills/performance-optimization/SKILL.md +0 -77
  222. package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
  223. package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
  224. package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
  225. package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
  226. package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
  227. package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
  228. package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
  229. package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
  230. /package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
  231. /package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
  232. /package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
  233. /package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
  234. /package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
  235. /package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
  236. /package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
  237. /package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
  238. /package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
  239. /package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
  240. /package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
  241. /package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
  242. /package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
  243. /package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
  244. /package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
  245. /package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
  246. /package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
  247. /package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
  248. /package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
  249. /package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
  250. /package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
  251. /package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
  252. /package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
  253. /package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
  254. /package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
  255. /package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
  256. /package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
  257. /package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
  258. /package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
  259. /package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0
@@ -0,0 +1,98 @@
1
+ ---
2
+ name: eval-create-command
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: advanced
6
+ ---
7
+
8
+ # Eval: Create Command — Multi-Phase with Human Checkpoint
9
+
10
+ ## Task
11
+
12
+ Test that the ADK creates a command with proper phase structure, agent roster, and — critically — generates evals that cover the command's own structure (human checkpoints, parallel agents, mid-pipeline failures). This eval targets the gap where the ADK created commands but derived evals from generic categories instead of the artifact's structure.
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ Create a command for database migration workflow in the platform/data namespace. It should have 4 phases: (1) pre-migration validation — check schema compatibility and generate migration plan, (2) backup — snapshot current state, (3) migrate — apply migration scripts with progress tracking, (4) post-migration verification — validate data integrity and rollback if checks fail. Phase 3 must have a human approval checkpoint before executing destructive changes. Create new agents for each phase within platform/data.
18
+ ```
19
+
20
+ ## Context
21
+
22
+ | Field | Value |
23
+ |-------|-------|
24
+ | **Namespace** | `platform/data` |
25
+ | **Domain** | `data` |
26
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
27
+ | **Target type** | `command` |
28
+
29
+ ## Expected Outcomes
30
+
31
+ - [ ] **Type classified correctly** — identified as `command`
32
+ - [ ] **Interview conducted** — asked about workflow phases, agents, human checkpoints, namespace
33
+ - [ ] **Path resolved** — target at `.aw/.aw_registry/platform/data/commands/database-migration.md`
34
+ - [ ] **Command has AW-PROTOCOL reference** and skill loading gate
35
+ - [ ] **Agent roster table present** — with phase, agent name, model columns
36
+ - [ ] **Phase structure** — numbered phases with input/output/checkpoint/on-failure
37
+ - [ ] **Human checkpoint** — at least one phase blocks for human approval (migration is destructive)
38
+ - [ ] **CHECKPOINT output shown**
39
+ - [ ] **Lint ran and passed** — no phantom_agent errors
40
+ - [ ] **Scoring performed** — rubric-command.md read, 10-dimension score table
41
+ - [ ] **2+ evals created** — colocated at `commands/evals/<slug>/eval-*.md`
42
+ - [ ] **Evals derived from structure** — at least one eval covers the human checkpoint (approve AND reject paths)
43
+ - [ ] **Dependency chain eval present** — at least one eval validates all agents in roster exist
44
+ - [ ] **`aw link` ran**
45
+
46
+ ## Grading Criteria
47
+
48
+ ### PASS (all conditions met)
49
+
50
+ - All 14 outcomes checked
51
+ - Evals exercise the command's own phases, not just generic happy-path/failure
52
+
53
+ ### PARTIAL (9+ of 14)
54
+
55
+ - Command created with correct structure
56
+ - But evals are generic (no checkpoint-specific or dependency-chain evals)
57
+
58
+ ### FAIL (below 9)
59
+
60
+ - No phase structure
61
+ - No human checkpoint for a destructive workflow
62
+ - Steps 5-14 skipped
63
+ - Evals missing entirely
64
+
65
+ ## Evaluation Method
66
+
67
+ **Type:** hybrid
68
+
69
+ ### Deterministic Checks
70
+
71
+ ```bash
72
+ # Verify command file exists
73
+ test -f ".aw/.aw_registry/platform/data/commands/database-migration.md" || echo "FAIL: file not found"
74
+
75
+ # Check for phase structure
76
+ grep -q "## Phase" ".aw/.aw_registry/platform/data/commands/database-migration.md" || echo "FAIL: no phases"
77
+
78
+ # Check for agent roster
79
+ grep -q "Agent Roster" ".aw/.aw_registry/platform/data/commands/database-migration.md" || echo "FAIL: no agent roster"
80
+
81
+ # Run lint
82
+ bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh ".aw/.aw_registry/platform/data/commands/database-migration.md" command
83
+
84
+ # Verify evals exist
85
+ ls .aw/.aw_registry/platform/data/commands/evals/database-migration/eval-*.md 2>/dev/null | wc -l | grep -q "[2-9]" || echo "FAIL: fewer than 2 evals"
86
+ ```
87
+
88
+ ### Model-Based Checks
89
+
90
+ - Does at least one eval test the human checkpoint with both approve and reject paths?
91
+ - Is the phase structure appropriate for a migration (pre-check, backup, migrate, validate, rollback)?
92
+ - Did the executor output a CHECKPOINT step?
93
+
94
+ ## Baseline Expectations
95
+
96
+ - Without ADK: Command created but evals are generic (happy-path only), no checkpoint-specific evals.
97
+ - With ADK: Structure-derived evals covering human gates, dependency chains, and mid-pipeline failures.
98
+ - **Expected delta:** +2 structure-specific evals vs. generic-only
@@ -0,0 +1,89 @@
1
+ ---
2
+ name: eval-create-eval
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: intermediate
6
+ ---
7
+
8
+ # Eval: Create Eval — Standalone Eval for Existing Artifact
9
+
10
+ ## Task
11
+
12
+ Test that the ADK can create evals for an existing artifact (not as part of a create flow, but standalone). This tests the eval-specific interview, correct colocated placement, and eval quality (not always-pass, has failure scenarios, discriminating assertions).
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ Create evals for the existing integrity-verifier agent in the revex/reselling namespace (it's at .aw/.aw_registry/revex/reselling/backend/agents/integrity-verifier.md). Create at least 2 evals — one happy path testing successful data integrity verification, and one failure scenario where the agent encounters corrupted or mismatched records. Use hybrid grading (deterministic for structure, model-based for content quality).
18
+ ```
19
+
20
+ ## Context
21
+
22
+ | Field | Value |
23
+ |-------|-------|
24
+ | **Namespace** | `revex/reselling` |
25
+ | **Domain** | `backend` |
26
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
27
+ | **Target type** | `eval` |
28
+
29
+ ## Expected Outcomes
30
+
31
+ - [ ] **Type classified correctly** — identified as `eval`
32
+ - [ ] **Interview conducted** — asked about: which parent artifact, what scenarios, what grader type
33
+ - [ ] **Parent artifact located** — the ADK reads the existing agent to understand what to test
34
+ - [ ] **2+ eval files created** — at `agents/evals/payments-processor/eval-*.md`
35
+ - [ ] **Colocated placement** — evals are in the agent's `evals/` directory, not a centralized location
36
+ - [ ] **Happy path covered** — at least one eval tests the agent working correctly
37
+ - [ ] **Failure scenario covered** — at least one eval tests error handling or edge cases
38
+ - [ ] **Eval frontmatter correct** — each eval has `target:`, `type: eval`, `purpose:`
39
+ - [ ] **Assertions are discriminating** — at least one negative assertion ("does NOT contain/skip X")
40
+ - [ ] **Grading criteria clear** — PASS/PARTIAL/FAIL with specific thresholds
41
+ - [ ] **CHECKPOINT output shown**
42
+ - [ ] **Lint ran** on the eval files
43
+
44
+ ## Grading Criteria
45
+
46
+ ### PASS (all conditions met)
47
+
48
+ - All 12 outcomes checked
49
+ - Evals are specific to the payments-processor agent (not generic template output)
50
+ - At least one negative assertion present
51
+
52
+ ### PARTIAL (8+ of 12)
53
+
54
+ - Evals created but generic (not tailored to the agent's domain)
55
+ - OR placed in wrong directory
56
+
57
+ ### FAIL (below 8)
58
+
59
+ - No evals created
60
+ - Evals placed in centralized location instead of colocated
61
+ - All assertions are always-pass (no discriminating checks)
62
+
63
+ ## Evaluation Method
64
+
65
+ **Type:** hybrid
66
+
67
+ ### Deterministic Checks
68
+
69
+ ```bash
70
+ # Verify evals exist at correct colocated path
71
+ ls .aw/.aw_registry/revex/reselling/*/agents/evals/payments-processor/eval-*.md 2>/dev/null | wc -l | grep -q "[2-9]" || echo "FAIL: fewer than 2 evals"
72
+
73
+ # Verify frontmatter
74
+ for f in .aw/.aw_registry/revex/reselling/*/agents/evals/payments-processor/eval-*.md; do
75
+ grep -q "^target:" "$f" || echo "FAIL: $f missing target"
76
+ done
77
+ ```
78
+
79
+ ### Model-Based Checks
80
+
81
+ - Are eval scenarios specific to payments processing (not generic)?
82
+ - Do assertions discriminate — would a clearly wrong output fail them?
83
+ - Did the executor read the parent agent before writing evals?
84
+
85
+ ## Baseline Expectations
86
+
87
+ - Without ADK: Generic eval stubs with always-pass assertions, possibly in wrong directory.
88
+ - With ADK: Domain-specific evals with discriminating assertions, correctly colocated.
89
+ - **Expected delta:** +30% assertion discrimination rate
@@ -0,0 +1,99 @@
1
+ ---
2
+ name: eval-create-rule
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: intermediate
6
+ ---
7
+
8
+ # Eval: Create Rule — Full Flow Including AGENTS.md Update
9
+
10
+ ## Task
11
+
12
+ Test that the ADK follows the complete rule creation flow — including the three registry updates that rules uniquely require: reference file, rule-manifest.json entry, AND AGENTS.md bullet point. Also tests that rules are not treated as "simpler" than other types — they must go through lint, scoring, and eval creation like any other CASRE type.
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ Create a rule called no-unbounded-cache-ttl for the data domain. It prevents Redis/Memorystore cache keys without expiry — every SET must include EX or PX. Severity: MUST. WRONG: redis.set("user:123", data) with no TTL. RIGHT: redis.set("user:123", data, "EX", 3600). File patterns: *.service.ts, *.repository.ts, *.cache.ts. Exception: distributed locks using Redlock which manage their own TTL internally.
18
+ ```
19
+
20
+ ## Context
21
+
22
+ | Field | Value |
23
+ |-------|-------|
24
+ | **Namespace** | `platform` (rules are always platform-scoped) |
25
+ | **Domain** | `data` |
26
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
27
+ | **Target type** | `rule` |
28
+
29
+ ## Expected Outcomes
30
+
31
+ - [ ] **Type classified correctly** — identified as `rule`
32
+ - [ ] **Interview conducted** — asked about: what it prevents, domain, severity, WRONG/RIGHT examples, file patterns, exceptions (6 questions per ADK)
33
+ - [ ] **Reference file created** — at `.aw/.aw_rules/platform/data/references/no-unbounded-redis-cache.md` (or similar slug)
34
+ - [ ] **Reference has WRONG/RIGHT examples** — concrete, copy-pasteable code (not pseudocode)
35
+ - [ ] **Reference has severity and paths frontmatter** — `severity: MUST`, `paths:` with relevant globs
36
+ - [ ] **CHECKPOINT output shown** — remaining steps printed before continuing
37
+ - [ ] **Lint ran** — `lint-artifact.sh` executed on the rule file
38
+ - [ ] **Scoring performed** — rubric-rule.md read, 10-dimension score table output
39
+ - [ ] **2+ evals created** — for the rule itself
40
+ - [ ] **rule-manifest.json updated** — new entry with id, severity, domains, rule path, description, principle
41
+ - [ ] **AGENTS.md bullet added** — `.aw/.aw_rules/platform/data/AGENTS.md` has a new bullet in the Always/Never section AND a reference link
42
+ - [ ] **`aw link` ran** (or acknowledged that rules don't need `aw link` — they're live immediately via hook)
43
+
44
+ ## Grading Criteria
45
+
46
+ ### PASS (all conditions met)
47
+
48
+ - All 12 outcomes checked
49
+ - Rule went through full lint/score/eval flow (not treated as "just a doc")
50
+ - All three registry updates performed (reference + manifest + AGENTS.md)
51
+
52
+ ### PARTIAL (8+ of 12)
53
+
54
+ - Rule created with correct structure
55
+ - But some flow steps skipped (no lint, no score, or no evals)
56
+ - OR manifest updated but AGENTS.md bullet missing
57
+
58
+ ### FAIL (below 8)
59
+
60
+ - Skipped directly from scaffold to "done" (steps 5-14 dropped)
61
+ - No AGENTS.md update (rule would never be enforced at runtime)
62
+ - No WRONG/RIGHT examples in the rule
63
+
64
+ ## Evaluation Method
65
+
66
+ **Type:** hybrid
67
+
68
+ ### Deterministic Checks
69
+
70
+ ```bash
71
+ # Verify reference file exists
72
+ find .aw/.aw_rules/platform/data/references/ -name "*redis*" -o -name "*cache*" | head -1 | xargs test -f || echo "FAIL: rule reference not found"
73
+
74
+ # Verify WRONG/RIGHT examples
75
+ grep -qi "WRONG\|Never" "<rule-path>" || echo "FAIL: no WRONG examples"
76
+ grep -qi "RIGHT\|Always" "<rule-path>" || echo "FAIL: no RIGHT examples"
77
+
78
+ # Verify manifest entry
79
+ grep -q "unbounded" .aw/.aw_rules/rule-manifest.json || echo "FAIL: not in manifest"
80
+
81
+ # Verify AGENTS.md bullet
82
+ grep -qi "redis\|cache\|unbounded" .aw/.aw_rules/platform/data/AGENTS.md || echo "FAIL: not in AGENTS.md"
83
+
84
+ # Run lint
85
+ bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh "<rule-path>" rule
86
+ ```
87
+
88
+ ### Model-Based Checks
89
+
90
+ - Did the executor show a CHECKPOINT step (not skip straight to writing)?
91
+ - Are WRONG/RIGHT examples concrete Redis code (not generic placeholders)?
92
+ - Does the score table show 10 dimensions?
93
+ - Did the executor create evals for the rule?
94
+
95
+ ## Baseline Expectations
96
+
97
+ - Without ADK: Rule reference created, maybe manifest updated, but AGENTS.md bullet missing (rule never enforced). No lint, no score, no evals.
98
+ - With ADK: Full three-update flow, lint-validated, scored, with colocated evals.
99
+ - **Expected delta:** 3/3 registry updates vs. 1-2/3 without ADK
@@ -0,0 +1,97 @@
1
+ ---
2
+ name: eval-create-skill
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: intermediate
6
+ ---
7
+
8
+ # Eval: Create Skill — Full Flow Compliance
9
+
10
+ ## Task
11
+
12
+ Test that the ADK follows all 14 create flow steps when creating a skill. The prompt asks for a realistic skill in a known namespace. The eval checks that no steps are skipped — especially CHECKPOINT, LINT, SCORE, and EVAL GATE, which have historically been dropped for "simpler" artifact types.
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ Create a skill for MongoDB query patterns in the platform/data namespace. It should help developers write performant Mongoose queries — covering index-aware query construction, aggregation pipeline patterns, pagination with cursor-based approaches, and common anti-patterns like unbounded find(). Target audience is backend engineers using NestJS with Mongoose. No scripts or references needed beyond inline examples.
18
+ ```
19
+
20
+ ## Context
21
+
22
+ | Field | Value |
23
+ |-------|-------|
24
+ | **Namespace** | `platform/data` |
25
+ | **Domain** | `data` |
26
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
27
+ | **Target type** | `skill` |
28
+
29
+ ## Expected Outcomes
30
+
31
+ The executor's output must satisfy ALL of the following:
32
+
33
+ - [ ] **Type classified correctly** — identified as `skill` (not agent, command, or rule)
34
+ - [ ] **Interview conducted** — asked at least 3 questions before scaffolding (when to use, what domain knowledge, namespace confirmation)
35
+ - [ ] **Path resolved correctly** — target path is `.aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md`
36
+ - [ ] **SKILL.md created** with frontmatter fields: `name`, `description`, `trigger`
37
+ - [ ] **Required sections present** — at minimum: "When to Use", a guide/instructions section, and "References"
38
+ - [ ] **CHECKPOINT output shown** — the executor printed remaining steps (LINT → SCORE → EVALS → REGISTRY → SYNC) before continuing
39
+ - [ ] **Lint ran** — `lint-artifact.sh` was executed on the created file
40
+ - [ ] **Scoring performed** — rubric-skill.md was read and a score table with 10 dimensions was output
41
+ - [ ] **Score is B-Tier (60+) minimum** — or the executor iterated to fix gaps
42
+ - [ ] **2+ evals created** — colocated at `skills/mongodb-query-patterns/evals/eval-*.md`
43
+ - [ ] **Evals cover happy + failure** — at least one eval tests a failure or edge case scenario
44
+ - [ ] **`aw link` ran** — sync step was not skipped
45
+ - [ ] **No phantom dependencies** — any referenced artifacts actually exist
46
+
47
+ ## Grading Criteria
48
+
49
+ ### PASS (all conditions met)
50
+
51
+ - All 13 expected outcomes checked
52
+ - Content is domain-specific (MongoDB, not generic placeholder)
53
+ - Full flow executed in order
54
+
55
+ ### PARTIAL (8+ of 13)
56
+
57
+ - Artifact created with correct structure
58
+ - But some steps skipped (e.g., no checkpoint, no lint, or no evals)
59
+
60
+ ### FAIL (below 8)
61
+
62
+ - Steps 5-14 skipped entirely (wrote artifact → jumped to "done")
63
+ - Wrong type classification
64
+ - Wrong filesystem path
65
+
66
+ ## Evaluation Method
67
+
68
+ **Type:** hybrid
69
+
70
+ ### Deterministic Checks
71
+
72
+ ```bash
73
+ # Verify SKILL.md exists at correct path
74
+ test -f ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" || echo "FAIL: file not found"
75
+
76
+ # Verify required frontmatter
77
+ grep -q "^name:" ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" || echo "FAIL: missing name"
78
+ grep -q "^trigger:" ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" || echo "FAIL: missing trigger"
79
+
80
+ # Verify evals exist
81
+ ls .aw/.aw_registry/platform/data/skills/mongodb-query-patterns/evals/eval-*.md 2>/dev/null | wc -l | grep -q "[2-9]" || echo "FAIL: fewer than 2 evals"
82
+
83
+ # Run lint
84
+ bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" skill
85
+ ```
86
+
87
+ ### Model-Based Checks
88
+
89
+ - Did the executor output a CHECKPOINT before lint/score/eval steps?
90
+ - Is the content MongoDB-specific (not generic foo/bar)?
91
+ - Does the score table show 10 dimensions with justified scores?
92
+
93
+ ## Baseline Expectations
94
+
95
+ - Without ADK: Model creates a markdown file but skips lint, scoring, evals, and registry updates. No structured flow.
96
+ - With ADK: All 14 steps followed. Structured artifact with colocated evals and correct placement.
97
+ - **Expected delta:** +50% step completion rate
@@ -0,0 +1,79 @@
1
+ ---
2
+ name: eval-delete-agent
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: intermediate
6
+ ---
7
+
8
+ # Eval: Delete Agent — Full Cleanup Including Colocated Evals
9
+
10
+ ## Task
11
+
12
+ Test that the ADK's delete mode removes the agent file, its colocated evals, and warns about any commands that reference it in their roster.
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ First, create a temporary agent called temp-cleanup-test in the platform/services namespace. It's a simple agent for testing deletion — tools: Read, Grep. Model: haiku. Skills: []. Description: "Temporary agent for delete flow testing."
18
+
19
+ After the agent and its evals are created, delete it using the ADK delete flow. Confirm when prompted.
20
+ ```
21
+
22
+ ## Context
23
+
24
+ | Field | Value |
25
+ |-------|-------|
26
+ | **Namespace** | `platform/services` |
27
+ | **Domain** | `services` |
28
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
29
+ | **Target type** | `agent` (create then delete) |
30
+
31
+ ## Expected Outcomes
32
+
33
+ - [ ] **Agent created first** at `.aw/.aw_registry/platform/services/agents/temp-cleanup-test.md`
34
+ - [ ] **Evals created** in colocated directory
35
+ - [ ] **Delete flow initiated** — ADK switches to delete mode
36
+ - [ ] **Inventory shown** — lists the agent file + eval files that will be deleted
37
+ - [ ] **Dependency check** — scans for commands referencing this agent in their roster
38
+ - [ ] **User confirmation requested** — asks before deleting
39
+ - [ ] **Agent file deleted**
40
+ - [ ] **Colocated evals deleted** — entire `evals/temp-cleanup-test/` directory removed
41
+ - [ ] **`aw link` ran** after deletion
42
+
43
+ ## Grading Criteria
44
+
45
+ ### PASS
46
+
47
+ - All 9 outcomes met
48
+ - No orphaned files remain after deletion
49
+
50
+ ### PARTIAL
51
+
52
+ - Agent deleted but evals left behind
53
+ - OR no confirmation requested before deletion
54
+
55
+ ### FAIL
56
+
57
+ - Agent not deleted
58
+ - Delete without showing inventory
59
+ - No `aw link` after deletion
60
+
61
+ ## Evaluation Method
62
+
63
+ **Type:** deterministic
64
+
65
+ ### Deterministic Checks
66
+
67
+ ```bash
68
+ # After delete, verify agent is gone
69
+ test ! -f ".aw/.aw_registry/platform/services/agents/temp-cleanup-test.md" || echo "FAIL: agent still exists"
70
+
71
+ # Verify evals directory is gone
72
+ test ! -d ".aw/.aw_registry/platform/services/agents/evals/temp-cleanup-test" || echo "FAIL: eval directory still exists"
73
+ ```
74
+
75
+ ## Baseline Expectations
76
+
77
+ - Without ADK: Manual file deletion, evals likely left orphaned.
78
+ - With ADK: Full inventory, dependency check, clean removal, sync.
79
+ - **Expected delta:** 0 orphaned files with ADK vs. likely orphans without
@@ -0,0 +1,89 @@
1
+ ---
2
+ name: eval-delete-command
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: intermediate
6
+ ---
7
+
8
+ # Eval: Delete Command — Agent Roster Inventory + Shared Agent Handling
9
+
10
+ ## Task
11
+
12
+ Test that deleting a command inventories its agent roster and asks the user whether each agent should also be deleted (they may be shared with other commands) or just left in place.
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ First, create a temporary command called temp-pipeline-test in the platform/data namespace. It has 2 phases: (1) validate — check input data format, (2) process — transform and store data. Create new agents for each phase: temp-pipeline-validator and temp-pipeline-processor. Both in platform/data, model: haiku, tools: Read, Bash.
18
+
19
+ After the command and agents are created, delete the command temp-pipeline-test using the ADK delete flow. When asked about the agents, say "delete both — they're not shared." Confirm deletion when prompted.
20
+ ```
21
+
22
+ ## Context
23
+
24
+ | Field | Value |
25
+ |-------|-------|
26
+ | **Namespace** | `platform/data` |
27
+ | **Domain** | `data` |
28
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
29
+ | **Target type** | `command` (create then delete) |
30
+
31
+ ## Expected Outcomes
32
+
33
+ - [ ] **Command created** at `.aw/.aw_registry/platform/data/commands/temp-pipeline-test.md`
34
+ - [ ] **2 agents created** for the command's phases
35
+ - [ ] **Delete flow initiated** for the command
36
+ - [ ] **Inventory shown** — lists command file + colocated evals
37
+ - [ ] **Agent roster identified** — lists the 2 agents in the roster
38
+ - [ ] **User asked per agent** — "These agents are in the roster. Delete them too or leave them?"
39
+ - [ ] **Command file + evals deleted**
40
+ - [ ] **Both agents deleted** (per user instruction)
41
+ - [ ] **No phantom references remain** — no command referencing deleted agents, no agents referencing deleted command
42
+ - [ ] **`aw link` ran**
43
+
44
+ ## Grading Criteria
45
+
46
+ ### PASS
47
+
48
+ - All 10 outcomes met
49
+ - No orphaned files remain
50
+
51
+ ### PARTIAL
52
+
53
+ - Command deleted but agents left without asking
54
+ - OR agents deleted without confirming with user
55
+
56
+ ### FAIL
57
+
58
+ - Command not deleted
59
+ - Agents silently deleted without asking
60
+ - Agents left behind AND no mention of them in inventory
61
+
62
+ ## Evaluation Method
63
+
64
+ **Type:** hybrid
65
+
66
+ ### Deterministic Checks
67
+
68
+ ```bash
69
+ # Command should be gone
70
+ test ! -f ".aw/.aw_registry/platform/data/commands/temp-pipeline-test.md" || echo "FAIL: command still exists"
71
+
72
+ # Command evals should be gone
73
+ test ! -d ".aw/.aw_registry/platform/data/commands/evals/temp-pipeline-test" || echo "FAIL: command evals remain"
74
+
75
+ # Agents should be gone (user said delete both)
76
+ test ! -f ".aw/.aw_registry/platform/data/agents/temp-pipeline-validator.md" || echo "FAIL: validator agent still exists"
77
+ test ! -f ".aw/.aw_registry/platform/data/agents/temp-pipeline-processor.md" || echo "FAIL: processor agent still exists"
78
+ ```
79
+
80
+ ### Model-Based Checks
81
+
82
+ - Did the ADK ask about each agent before deleting?
83
+ - Did it present the choice (delete vs. leave) rather than assuming?
84
+
85
+ ## Baseline Expectations
86
+
87
+ - Without ADK: Command deleted, agents orphaned (still exist but nothing invokes them).
88
+ - With ADK: Full roster inventory, per-agent confirmation, clean removal.
89
+ - **Expected delta:** 0 orphaned agents with ADK vs. 2 without
@@ -0,0 +1,86 @@
1
+ ---
2
+ name: eval-delete-rule
3
+ target: skill/aw-adk
4
+ category: functional
5
+ difficulty: intermediate
6
+ ---
7
+
8
+ # Eval: Delete Rule — Registry Cleanup (Manifest + AGENTS.md)
9
+
10
+ ## Task
11
+
12
+ Test that the ADK's delete mode for rules removes the reference file AND cleans up both the rule-manifest.json entry and the AGENTS.md bullet. Rules have the most complex cleanup because they touch 3 registry locations.
13
+
14
+ ### Prompt
15
+
16
+ ```
17
+ First, create a temporary rule called no-temp-test-pattern for the universal domain. It prevents using temporary test patterns in production code. Severity: SHOULD. WRONG: if (process.env.TEMP_TEST) { skipValidation(); }. RIGHT: remove temp test flags before merging. File patterns: *.ts, *.js. No exceptions.
18
+
19
+ After the rule is created (including manifest + AGENTS.md updates), delete it using the ADK delete flow. Confirm when prompted.
20
+ ```
21
+
22
+ ## Context
23
+
24
+ | Field | Value |
25
+ |-------|-------|
26
+ | **Namespace** | `platform` |
27
+ | **Domain** | `universal` |
28
+ | **Target artifact** | `skills/aw-adk/SKILL.md` |
29
+ | **Target type** | `rule` (create then delete) |
30
+
31
+ ## Expected Outcomes
32
+
33
+ - [ ] **Rule created first** with reference file, manifest entry, AGENTS.md bullet
34
+ - [ ] **Delete flow initiated**
35
+ - [ ] **Inventory shown** — lists: reference file, manifest entry, AGENTS.md bullet, colocated evals
36
+ - [ ] **User confirmation requested**
37
+ - [ ] **Reference file deleted**
38
+ - [ ] **rule-manifest.json entry removed**
39
+ - [ ] **AGENTS.md bullet removed**
40
+ - [ ] **Colocated evals deleted**
41
+ - [ ] **`aw link` ran** after deletion
42
+
43
+ ## Grading Criteria
44
+
45
+ ### PASS
46
+
47
+ - All 9 outcomes met
48
+ - rule-manifest.json has no trace of the deleted rule
49
+ - AGENTS.md has no trace of the deleted rule
50
+
51
+ ### PARTIAL
52
+
53
+ - Reference file deleted but manifest or AGENTS.md not cleaned up
54
+ - OR evals left behind
55
+
56
+ ### FAIL
57
+
58
+ - Rule not deleted
59
+ - Manifest entry left (rule would still appear in enforcement system)
60
+ - AGENTS.md bullet left (rule would still be enforced at runtime)
61
+
62
+ ## Evaluation Method
63
+
64
+ **Type:** deterministic
65
+
66
+ ### Deterministic Checks
67
+
68
+ ```bash
69
+ # After delete, verify reference file is gone
70
+ find .aw/.aw_rules/platform/universal/references/ -name "*temp-test*" 2>/dev/null | grep -q . && echo "FAIL: reference still exists"
71
+
72
+ # Verify manifest cleaned
73
+ grep -q "temp-test" .aw/.aw_rules/rule-manifest.json 2>/dev/null && echo "FAIL: still in manifest"
74
+
75
+ # Verify AGENTS.md cleaned
76
+ grep -qi "temp.test" .aw/.aw_rules/platform/universal/AGENTS.md 2>/dev/null && echo "FAIL: still in AGENTS.md"
77
+
78
+ # Verify evals cleaned
79
+ find .aw/.aw_rules/platform/universal/ -path "*/evals/*temp-test*" 2>/dev/null | grep -q . && echo "FAIL: eval files remain"
80
+ ```
81
+
82
+ ## Baseline Expectations
83
+
84
+ - Without ADK: Reference file deleted manually, manifest and AGENTS.md likely not cleaned — rule remains enforced as a ghost.
85
+ - With ADK: Full 3-location cleanup, no ghost rules.
86
+ - **Expected delta:** 3/3 cleanup vs. 1/3 without ADK