aw-ecc 1.4.32 → 1.4.48
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/.cursor/INSTALL.md +7 -5
- package/.cursor/hooks/adapter.js +41 -4
- package/.cursor/hooks/after-agent-response.js +62 -0
- package/.cursor/hooks/before-submit-prompt.js +7 -1
- package/.cursor/hooks/post-tool-use-failure.js +21 -0
- package/.cursor/hooks/post-tool-use.js +39 -0
- package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
- package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
- package/.cursor/hooks/subagent-start.js +22 -4
- package/.cursor/hooks/subagent-stop.js +18 -1
- package/.cursor/hooks.json +23 -2
- package/.opencode/package.json +1 -1
- package/AGENTS.md +3 -3
- package/README.md +5 -5
- package/commands/adk.md +52 -0
- package/commands/build.md +22 -9
- package/commands/deploy.md +12 -0
- package/commands/execute.md +9 -0
- package/commands/feature.md +333 -0
- package/commands/investigate.md +18 -5
- package/commands/plan.md +23 -9
- package/commands/publish.md +65 -0
- package/commands/review.md +12 -0
- package/commands/ship.md +12 -0
- package/commands/test.md +12 -0
- package/commands/verify.md +9 -0
- package/hooks/hooks.json +36 -0
- package/manifests/install-components.json +8 -0
- package/manifests/install-modules.json +83 -0
- package/manifests/install-profiles.json +7 -0
- package/package.json +2 -2
- package/scripts/ci/validate-rules.js +51 -0
- package/scripts/cursor-aw-home/hooks.json +23 -2
- package/scripts/cursor-aw-hooks/adapter.js +41 -4
- package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
- package/scripts/hooks/aw-usage-commit-created.js +32 -0
- package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
- package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
- package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
- package/scripts/hooks/aw-usage-session-start.js +48 -0
- package/scripts/hooks/aw-usage-stop.js +182 -0
- package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
- package/scripts/hooks/cost-tracker.js +3 -23
- package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
- package/scripts/hooks/shared/aw-phase-runner.js +3 -1
- package/scripts/lib/aw-hook-contract.js +2 -2
- package/scripts/lib/aw-pricing.js +306 -0
- package/scripts/lib/aw-usage-telemetry.js +472 -0
- package/scripts/lib/codex-hook-config.js +8 -8
- package/scripts/lib/cursor-hook-config.js +25 -10
- package/scripts/lib/install-targets/cursor-project.js +3 -0
- package/scripts/lib/install-targets/helpers.js +20 -3
- package/skills/aw-adk/SKILL.md +317 -0
- package/skills/aw-adk/agents/analyzer.md +113 -0
- package/skills/aw-adk/agents/comparator.md +113 -0
- package/skills/aw-adk/agents/grader.md +115 -0
- package/skills/aw-adk/assets/eval_review.html +76 -0
- package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
- package/skills/aw-adk/eval-viewer/viewer.html +181 -0
- package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
- package/skills/aw-adk/evals/eval-create-agent.md +90 -0
- package/skills/aw-adk/evals/eval-create-command.md +98 -0
- package/skills/aw-adk/evals/eval-create-eval.md +89 -0
- package/skills/aw-adk/evals/eval-create-rule.md +99 -0
- package/skills/aw-adk/evals/eval-create-skill.md +97 -0
- package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
- package/skills/aw-adk/evals/eval-delete-command.md +89 -0
- package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
- package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
- package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
- package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
- package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
- package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
- package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
- package/skills/aw-adk/evals/evals.json +96 -0
- package/skills/aw-adk/references/artifact-wiring.md +162 -0
- package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
- package/skills/aw-adk/references/eval-placement-guide.md +183 -0
- package/skills/aw-adk/references/external-resources.md +75 -0
- package/skills/aw-adk/references/getting-started.md +66 -0
- package/skills/aw-adk/references/registry-structure.md +152 -0
- package/skills/aw-adk/references/rubric-agent.md +36 -0
- package/skills/aw-adk/references/rubric-command.md +36 -0
- package/skills/aw-adk/references/rubric-eval.md +36 -0
- package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
- package/skills/aw-adk/references/rubric-rule.md +36 -0
- package/skills/aw-adk/references/rubric-skill.md +36 -0
- package/skills/aw-adk/references/schemas.md +222 -0
- package/skills/aw-adk/references/template-agent.md +251 -0
- package/skills/aw-adk/references/template-command.md +279 -0
- package/skills/aw-adk/references/template-eval.md +176 -0
- package/skills/aw-adk/references/template-rule.md +119 -0
- package/skills/aw-adk/references/template-skill.md +123 -0
- package/skills/aw-adk/references/type-classifier.md +98 -0
- package/skills/aw-adk/references/writing-good-agents.md +227 -0
- package/skills/aw-adk/references/writing-good-commands.md +258 -0
- package/skills/aw-adk/references/writing-good-evals.md +271 -0
- package/skills/aw-adk/references/writing-good-rules.md +214 -0
- package/skills/aw-adk/references/writing-good-skills.md +159 -0
- package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
- package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
- package/skills/aw-adk/scripts/score-artifact.sh +179 -0
- package/skills/aw-adk/scripts/trigger-eval.py +192 -0
- package/skills/aw-build/SKILL.md +19 -2
- package/skills/aw-deploy/SKILL.md +65 -3
- package/skills/aw-design/SKILL.md +156 -0
- package/skills/aw-design/references/highrise-tokens.md +394 -0
- package/skills/aw-design/references/micro-interactions.md +76 -0
- package/skills/aw-design/references/prompt-template.md +160 -0
- package/skills/aw-design/references/quality-checklist.md +70 -0
- package/skills/aw-design/references/self-review.md +497 -0
- package/skills/aw-design/references/stitch-workflow.md +127 -0
- package/skills/aw-feature/SKILL.md +293 -0
- package/skills/aw-investigate/SKILL.md +17 -0
- package/skills/aw-plan/SKILL.md +34 -3
- package/skills/aw-publish/SKILL.md +300 -0
- package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
- package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
- package/skills/aw-publish/evals/eval-push-modes.md +67 -0
- package/skills/aw-publish/evals/eval-rules-push.md +60 -0
- package/skills/aw-publish/evals/evals.json +29 -0
- package/skills/aw-publish/references/push-modes.md +38 -0
- package/skills/aw-review/SKILL.md +88 -9
- package/skills/aw-rules-review/SKILL.md +124 -0
- package/skills/aw-rules-review/agents/openai.yaml +3 -0
- package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
- package/skills/aw-ship/SKILL.md +16 -0
- package/skills/aw-spec/SKILL.md +15 -0
- package/skills/aw-tasks/SKILL.md +15 -0
- package/skills/aw-test/SKILL.md +16 -0
- package/skills/aw-yolo/SKILL.md +4 -0
- package/skills/diagnose/SKILL.md +121 -0
- package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
- package/skills/finish-only-when-green/SKILL.md +265 -0
- package/skills/grill-me/SKILL.md +24 -0
- package/skills/grill-with-docs/SKILL.md +92 -0
- package/skills/grill-with-docs/adr-format.md +47 -0
- package/skills/grill-with-docs/context-format.md +67 -0
- package/skills/improve-codebase-architecture/SKILL.md +75 -0
- package/skills/improve-codebase-architecture/deepening.md +37 -0
- package/skills/improve-codebase-architecture/interface-design.md +44 -0
- package/skills/improve-codebase-architecture/language.md +53 -0
- package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
- package/skills/tdd/SKILL.md +115 -0
- package/skills/tdd/deep-modules.md +33 -0
- package/skills/tdd/interface-design.md +31 -0
- package/skills/tdd/mocking.md +59 -0
- package/skills/tdd/refactoring.md +10 -0
- package/skills/tdd/tests.md +61 -0
- package/skills/to-issues/SKILL.md +62 -0
- package/skills/to-prd/SKILL.md +75 -0
- package/skills/using-aw-skills/SKILL.md +170 -237
- package/skills/using-aw-skills/hooks/session-start.sh +11 -41
- package/skills/zoom-out/SKILL.md +24 -0
- package/.codex/hooks/aw-post-tool-use.sh +0 -6
- package/.codex/hooks/aw-pre-tool-use.sh +0 -6
- package/.codex/hooks/aw-session-start.sh +0 -25
- package/.codex/hooks/aw-stop.sh +0 -6
- package/.codex/hooks/aw-user-prompt-submit.sh +0 -10
- package/.codex/hooks.json +0 -62
- package/.cursor/rules/common-agents.md +0 -53
- package/.cursor/rules/common-aw-routing.md +0 -43
- package/.cursor/rules/common-coding-style.md +0 -52
- package/.cursor/rules/common-development-workflow.md +0 -33
- package/.cursor/rules/common-git-workflow.md +0 -28
- package/.cursor/rules/common-hooks.md +0 -34
- package/.cursor/rules/common-patterns.md +0 -35
- package/.cursor/rules/common-performance.md +0 -59
- package/.cursor/rules/common-security.md +0 -33
- package/.cursor/rules/common-testing.md +0 -33
- package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
- package/.cursor/skills/article-writing/SKILL.md +0 -85
- package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
- package/.cursor/skills/aw-build/SKILL.md +0 -152
- package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
- package/.cursor/skills/aw-debug/SKILL.md +0 -49
- package/.cursor/skills/aw-deploy/SKILL.md +0 -101
- package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
- package/.cursor/skills/aw-execute/SKILL.md +0 -47
- package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
- package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
- package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
- package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
- package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
- package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
- package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
- package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
- package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
- package/.cursor/skills/aw-finish/SKILL.md +0 -111
- package/.cursor/skills/aw-investigate/SKILL.md +0 -109
- package/.cursor/skills/aw-plan/SKILL.md +0 -368
- package/.cursor/skills/aw-prepare/SKILL.md +0 -118
- package/.cursor/skills/aw-review/SKILL.md +0 -118
- package/.cursor/skills/aw-ship/SKILL.md +0 -115
- package/.cursor/skills/aw-spec/SKILL.md +0 -104
- package/.cursor/skills/aw-tasks/SKILL.md +0 -138
- package/.cursor/skills/aw-test/SKILL.md +0 -118
- package/.cursor/skills/aw-verify/SKILL.md +0 -51
- package/.cursor/skills/aw-yolo/SKILL.md +0 -111
- package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
- package/.cursor/skills/bun-runtime/SKILL.md +0 -84
- package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
- package/.cursor/skills/code-simplification/SKILL.md +0 -74
- package/.cursor/skills/content-engine/SKILL.md +0 -88
- package/.cursor/skills/context-engineering/SKILL.md +0 -74
- package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
- package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
- package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
- package/.cursor/skills/frontend-slides/SKILL.md +0 -184
- package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
- package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
- package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
- package/.cursor/skills/idea-refine/SKILL.md +0 -84
- package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
- package/.cursor/skills/investor-materials/SKILL.md +0 -96
- package/.cursor/skills/investor-outreach/SKILL.md +0 -76
- package/.cursor/skills/market-research/SKILL.md +0 -75
- package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
- package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
- package/.cursor/skills/performance-optimization/SKILL.md +0 -77
- package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
- package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
- package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
- package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
- package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
- package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
- package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
- package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
- /package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
- /package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
- /package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
- /package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
- /package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
- /package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
- /package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
- /package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
- /package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
- /package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
- /package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
- /package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
- /package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
- /package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
- /package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
- /package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
- /package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
- /package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
- /package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
- /package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
- /package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
- /package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
- /package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
- /package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
- /package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
- /package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
- /package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
- /package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
- /package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
- /package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-create-command
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: advanced
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Create Command — Multi-Phase with Human Checkpoint
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK creates a command with proper phase structure, agent roster, and — critically — generates evals that cover the command's own structure (human checkpoints, parallel agents, mid-pipeline failures). This eval targets the gap where the ADK created commands but derived evals from generic categories instead of the artifact's structure.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
Create a command for database migration workflow in the platform/data namespace. It should have 4 phases: (1) pre-migration validation — check schema compatibility and generate migration plan, (2) backup — snapshot current state, (3) migrate — apply migration scripts with progress tracking, (4) post-migration verification — validate data integrity and rollback if checks fail. Phase 3 must have a human approval checkpoint before executing destructive changes. Create new agents for each phase within platform/data.
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
| Field | Value |
|
|
23
|
+
|-------|-------|
|
|
24
|
+
| **Namespace** | `platform/data` |
|
|
25
|
+
| **Domain** | `data` |
|
|
26
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
27
|
+
| **Target type** | `command` |
|
|
28
|
+
|
|
29
|
+
## Expected Outcomes
|
|
30
|
+
|
|
31
|
+
- [ ] **Type classified correctly** — identified as `command`
|
|
32
|
+
- [ ] **Interview conducted** — asked about workflow phases, agents, human checkpoints, namespace
|
|
33
|
+
- [ ] **Path resolved** — target at `.aw/.aw_registry/platform/data/commands/database-migration.md`
|
|
34
|
+
- [ ] **Command has AW-PROTOCOL reference** and skill loading gate
|
|
35
|
+
- [ ] **Agent roster table present** — with phase, agent name, model columns
|
|
36
|
+
- [ ] **Phase structure** — numbered phases with input/output/checkpoint/on-failure
|
|
37
|
+
- [ ] **Human checkpoint** — at least one phase blocks for human approval (migration is destructive)
|
|
38
|
+
- [ ] **CHECKPOINT output shown**
|
|
39
|
+
- [ ] **Lint ran and passed** — no phantom_agent errors
|
|
40
|
+
- [ ] **Scoring performed** — rubric-command.md read, 10-dimension score table
|
|
41
|
+
- [ ] **2+ evals created** — colocated at `commands/evals/<slug>/eval-*.md`
|
|
42
|
+
- [ ] **Evals derived from structure** — at least one eval covers the human checkpoint (approve AND reject paths)
|
|
43
|
+
- [ ] **Dependency chain eval present** — at least one eval validates all agents in roster exist
|
|
44
|
+
- [ ] **`aw link` ran**
|
|
45
|
+
|
|
46
|
+
## Grading Criteria
|
|
47
|
+
|
|
48
|
+
### PASS (all conditions met)
|
|
49
|
+
|
|
50
|
+
- All 14 outcomes checked
|
|
51
|
+
- Evals exercise the command's own phases, not just generic happy-path/failure
|
|
52
|
+
|
|
53
|
+
### PARTIAL (9+ of 14)
|
|
54
|
+
|
|
55
|
+
- Command created with correct structure
|
|
56
|
+
- But evals are generic (no checkpoint-specific or dependency-chain evals)
|
|
57
|
+
|
|
58
|
+
### FAIL (below 9)
|
|
59
|
+
|
|
60
|
+
- No phase structure
|
|
61
|
+
- No human checkpoint for a destructive workflow
|
|
62
|
+
- Steps 5-14 skipped
|
|
63
|
+
- Evals missing entirely
|
|
64
|
+
|
|
65
|
+
## Evaluation Method
|
|
66
|
+
|
|
67
|
+
**Type:** hybrid
|
|
68
|
+
|
|
69
|
+
### Deterministic Checks
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
# Verify command file exists
|
|
73
|
+
test -f ".aw/.aw_registry/platform/data/commands/database-migration.md" || echo "FAIL: file not found"
|
|
74
|
+
|
|
75
|
+
# Check for phase structure
|
|
76
|
+
grep -q "## Phase" ".aw/.aw_registry/platform/data/commands/database-migration.md" || echo "FAIL: no phases"
|
|
77
|
+
|
|
78
|
+
# Check for agent roster
|
|
79
|
+
grep -q "Agent Roster" ".aw/.aw_registry/platform/data/commands/database-migration.md" || echo "FAIL: no agent roster"
|
|
80
|
+
|
|
81
|
+
# Run lint
|
|
82
|
+
bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh ".aw/.aw_registry/platform/data/commands/database-migration.md" command
|
|
83
|
+
|
|
84
|
+
# Verify evals exist
|
|
85
|
+
ls .aw/.aw_registry/platform/data/commands/evals/database-migration/eval-*.md 2>/dev/null | wc -l | grep -q "[2-9]" || echo "FAIL: fewer than 2 evals"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Model-Based Checks
|
|
89
|
+
|
|
90
|
+
- Does at least one eval test the human checkpoint with both approve and reject paths?
|
|
91
|
+
- Is the phase structure appropriate for a migration (pre-check, backup, migrate, validate, rollback)?
|
|
92
|
+
- Did the executor output a CHECKPOINT step?
|
|
93
|
+
|
|
94
|
+
## Baseline Expectations
|
|
95
|
+
|
|
96
|
+
- Without ADK: Command created but evals are generic (happy-path only), no checkpoint-specific evals.
|
|
97
|
+
- With ADK: Structure-derived evals covering human gates, dependency chains, and mid-pipeline failures.
|
|
98
|
+
- **Expected delta:** +2 structure-specific evals vs. generic-only
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-create-eval
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Create Eval — Standalone Eval for Existing Artifact
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK can create evals for an existing artifact (not as part of a create flow, but standalone). This tests the eval-specific interview, correct colocated placement, and eval quality (not always-pass, has failure scenarios, discriminating assertions).
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
Create evals for the existing integrity-verifier agent in the revex/reselling namespace (it's at .aw/.aw_registry/revex/reselling/backend/agents/integrity-verifier.md). Create at least 2 evals — one happy path testing successful data integrity verification, and one failure scenario where the agent encounters corrupted or mismatched records. Use hybrid grading (deterministic for structure, model-based for content quality).
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
| Field | Value |
|
|
23
|
+
|-------|-------|
|
|
24
|
+
| **Namespace** | `revex/reselling` |
|
|
25
|
+
| **Domain** | `backend` |
|
|
26
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
27
|
+
| **Target type** | `eval` |
|
|
28
|
+
|
|
29
|
+
## Expected Outcomes
|
|
30
|
+
|
|
31
|
+
- [ ] **Type classified correctly** — identified as `eval`
|
|
32
|
+
- [ ] **Interview conducted** — asked about: which parent artifact, what scenarios, what grader type
|
|
33
|
+
- [ ] **Parent artifact located** — the ADK reads the existing agent to understand what to test
|
|
34
|
+
- [ ] **2+ eval files created** — at `agents/evals/payments-processor/eval-*.md`
|
|
35
|
+
- [ ] **Colocated placement** — evals are in the agent's `evals/` directory, not a centralized location
|
|
36
|
+
- [ ] **Happy path covered** — at least one eval tests the agent working correctly
|
|
37
|
+
- [ ] **Failure scenario covered** — at least one eval tests error handling or edge cases
|
|
38
|
+
- [ ] **Eval frontmatter correct** — each eval has `target:`, `type: eval`, `purpose:`
|
|
39
|
+
- [ ] **Assertions are discriminating** — at least one negative assertion ("does NOT contain/skip X")
|
|
40
|
+
- [ ] **Grading criteria clear** — PASS/PARTIAL/FAIL with specific thresholds
|
|
41
|
+
- [ ] **CHECKPOINT output shown**
|
|
42
|
+
- [ ] **Lint ran** on the eval files
|
|
43
|
+
|
|
44
|
+
## Grading Criteria
|
|
45
|
+
|
|
46
|
+
### PASS (all conditions met)
|
|
47
|
+
|
|
48
|
+
- All 12 outcomes checked
|
|
49
|
+
- Evals are specific to the payments-processor agent (not generic template output)
|
|
50
|
+
- At least one negative assertion present
|
|
51
|
+
|
|
52
|
+
### PARTIAL (8+ of 12)
|
|
53
|
+
|
|
54
|
+
- Evals created but generic (not tailored to the agent's domain)
|
|
55
|
+
- OR placed in wrong directory
|
|
56
|
+
|
|
57
|
+
### FAIL (below 8)
|
|
58
|
+
|
|
59
|
+
- No evals created
|
|
60
|
+
- Evals placed in centralized location instead of colocated
|
|
61
|
+
- All assertions are always-pass (no discriminating checks)
|
|
62
|
+
|
|
63
|
+
## Evaluation Method
|
|
64
|
+
|
|
65
|
+
**Type:** hybrid
|
|
66
|
+
|
|
67
|
+
### Deterministic Checks
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Verify evals exist at correct colocated path
|
|
71
|
+
ls .aw/.aw_registry/revex/reselling/*/agents/evals/payments-processor/eval-*.md 2>/dev/null | wc -l | grep -q "[2-9]" || echo "FAIL: fewer than 2 evals"
|
|
72
|
+
|
|
73
|
+
# Verify frontmatter
|
|
74
|
+
for f in .aw/.aw_registry/revex/reselling/*/agents/evals/payments-processor/eval-*.md; do
|
|
75
|
+
grep -q "^target:" "$f" || echo "FAIL: $f missing target"
|
|
76
|
+
done
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
### Model-Based Checks
|
|
80
|
+
|
|
81
|
+
- Are eval scenarios specific to payments processing (not generic)?
|
|
82
|
+
- Do assertions discriminate — would a clearly wrong output fail them?
|
|
83
|
+
- Did the executor read the parent agent before writing evals?
|
|
84
|
+
|
|
85
|
+
## Baseline Expectations
|
|
86
|
+
|
|
87
|
+
- Without ADK: Generic eval stubs with always-pass assertions, possibly in wrong directory.
|
|
88
|
+
- With ADK: Domain-specific evals with discriminating assertions, correctly colocated.
|
|
89
|
+
- **Expected delta:** +30% assertion discrimination rate
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-create-rule
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Create Rule — Full Flow Including AGENTS.md Update
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK follows the complete rule creation flow — including the three registry updates that rules uniquely require: reference file, rule-manifest.json entry, AND AGENTS.md bullet point. Also tests that rules are not treated as "simpler" than other types — they must go through lint, scoring, and eval creation like any other CASRE type.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
Create a rule called no-unbounded-cache-ttl for the data domain. It prevents Redis/Memorystore cache keys without expiry — every SET must include EX or PX. Severity: MUST. WRONG: redis.set("user:123", data) with no TTL. RIGHT: redis.set("user:123", data, "EX", 3600). File patterns: *.service.ts, *.repository.ts, *.cache.ts. Exception: distributed locks using Redlock which manage their own TTL internally.
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
| Field | Value |
|
|
23
|
+
|-------|-------|
|
|
24
|
+
| **Namespace** | `platform` (rules are always platform-scoped) |
|
|
25
|
+
| **Domain** | `data` |
|
|
26
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
27
|
+
| **Target type** | `rule` |
|
|
28
|
+
|
|
29
|
+
## Expected Outcomes
|
|
30
|
+
|
|
31
|
+
- [ ] **Type classified correctly** — identified as `rule`
|
|
32
|
+
- [ ] **Interview conducted** — asked about: what it prevents, domain, severity, WRONG/RIGHT examples, file patterns, exceptions (6 questions per ADK)
|
|
33
|
+
- [ ] **Reference file created** — at `.aw/.aw_rules/platform/data/references/no-unbounded-redis-cache.md` (or similar slug)
|
|
34
|
+
- [ ] **Reference has WRONG/RIGHT examples** — concrete, copy-pasteable code (not pseudocode)
|
|
35
|
+
- [ ] **Reference has severity and paths frontmatter** — `severity: MUST`, `paths:` with relevant globs
|
|
36
|
+
- [ ] **CHECKPOINT output shown** — remaining steps printed before continuing
|
|
37
|
+
- [ ] **Lint ran** — `lint-artifact.sh` executed on the rule file
|
|
38
|
+
- [ ] **Scoring performed** — rubric-rule.md read, 10-dimension score table output
|
|
39
|
+
- [ ] **2+ evals created** — for the rule itself
|
|
40
|
+
- [ ] **rule-manifest.json updated** — new entry with id, severity, domains, rule path, description, principle
|
|
41
|
+
- [ ] **AGENTS.md bullet added** — `.aw/.aw_rules/platform/data/AGENTS.md` has a new bullet in the Always/Never section AND a reference link
|
|
42
|
+
- [ ] **`aw link` ran** (or acknowledged that rules don't need `aw link` — they're live immediately via hook)
|
|
43
|
+
|
|
44
|
+
## Grading Criteria
|
|
45
|
+
|
|
46
|
+
### PASS (all conditions met)
|
|
47
|
+
|
|
48
|
+
- All 12 outcomes checked
|
|
49
|
+
- Rule went through full lint/score/eval flow (not treated as "just a doc")
|
|
50
|
+
- All three registry updates performed (reference + manifest + AGENTS.md)
|
|
51
|
+
|
|
52
|
+
### PARTIAL (8+ of 12)
|
|
53
|
+
|
|
54
|
+
- Rule created with correct structure
|
|
55
|
+
- But some flow steps skipped (no lint, no score, or no evals)
|
|
56
|
+
- OR manifest updated but AGENTS.md bullet missing
|
|
57
|
+
|
|
58
|
+
### FAIL (below 8)
|
|
59
|
+
|
|
60
|
+
- Skipped directly from scaffold to "done" (steps 5-14 dropped)
|
|
61
|
+
- No AGENTS.md update (rule would never be enforced at runtime)
|
|
62
|
+
- No WRONG/RIGHT examples in the rule
|
|
63
|
+
|
|
64
|
+
## Evaluation Method
|
|
65
|
+
|
|
66
|
+
**Type:** hybrid
|
|
67
|
+
|
|
68
|
+
### Deterministic Checks
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
# Verify reference file exists
|
|
72
|
+
find .aw/.aw_rules/platform/data/references/ -name "*redis*" -o -name "*cache*" | head -1 | xargs test -f || echo "FAIL: rule reference not found"
|
|
73
|
+
|
|
74
|
+
# Verify WRONG/RIGHT examples
|
|
75
|
+
grep -qi "WRONG\|Never" "<rule-path>" || echo "FAIL: no WRONG examples"
|
|
76
|
+
grep -qi "RIGHT\|Always" "<rule-path>" || echo "FAIL: no RIGHT examples"
|
|
77
|
+
|
|
78
|
+
# Verify manifest entry
|
|
79
|
+
grep -q "unbounded" .aw/.aw_rules/rule-manifest.json || echo "FAIL: not in manifest"
|
|
80
|
+
|
|
81
|
+
# Verify AGENTS.md bullet
|
|
82
|
+
grep -qi "redis\|cache\|unbounded" .aw/.aw_rules/platform/data/AGENTS.md || echo "FAIL: not in AGENTS.md"
|
|
83
|
+
|
|
84
|
+
# Run lint
|
|
85
|
+
bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh "<rule-path>" rule
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Model-Based Checks
|
|
89
|
+
|
|
90
|
+
- Did the executor show a CHECKPOINT step (not skip straight to writing)?
|
|
91
|
+
- Are WRONG/RIGHT examples concrete Redis code (not generic placeholders)?
|
|
92
|
+
- Does the score table show 10 dimensions?
|
|
93
|
+
- Did the executor create evals for the rule?
|
|
94
|
+
|
|
95
|
+
## Baseline Expectations
|
|
96
|
+
|
|
97
|
+
- Without ADK: Rule reference created, maybe manifest updated, but AGENTS.md bullet missing (rule never enforced). No lint, no score, no evals.
|
|
98
|
+
- With ADK: Full three-update flow, lint-validated, scored, with colocated evals.
|
|
99
|
+
- **Expected delta:** 3/3 registry updates vs. 1-2/3 without ADK
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-create-skill
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Create Skill — Full Flow Compliance
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK follows all 14 create flow steps when creating a skill. The prompt asks for a realistic skill in a known namespace. The eval checks that no steps are skipped — especially CHECKPOINT, LINT, SCORE, and EVAL GATE, which have historically been dropped for "simpler" artifact types.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
Create a skill for MongoDB query patterns in the platform/data namespace. It should help developers write performant Mongoose queries — covering index-aware query construction, aggregation pipeline patterns, pagination with cursor-based approaches, and common anti-patterns like unbounded find(). Target audience is backend engineers using NestJS with Mongoose. No scripts or references needed beyond inline examples.
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
| Field | Value |
|
|
23
|
+
|-------|-------|
|
|
24
|
+
| **Namespace** | `platform/data` |
|
|
25
|
+
| **Domain** | `data` |
|
|
26
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
27
|
+
| **Target type** | `skill` |
|
|
28
|
+
|
|
29
|
+
## Expected Outcomes
|
|
30
|
+
|
|
31
|
+
The executor's output must satisfy ALL of the following:
|
|
32
|
+
|
|
33
|
+
- [ ] **Type classified correctly** — identified as `skill` (not agent, command, or rule)
|
|
34
|
+
- [ ] **Interview conducted** — asked at least 3 questions before scaffolding (when to use, what domain knowledge, namespace confirmation)
|
|
35
|
+
- [ ] **Path resolved correctly** — target path is `.aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md`
|
|
36
|
+
- [ ] **SKILL.md created** with frontmatter fields: `name`, `description`, `trigger`
|
|
37
|
+
- [ ] **Required sections present** — at minimum: "When to Use", a guide/instructions section, and "References"
|
|
38
|
+
- [ ] **CHECKPOINT output shown** — the executor printed remaining steps (LINT → SCORE → EVALS → REGISTRY → SYNC) before continuing
|
|
39
|
+
- [ ] **Lint ran** — `lint-artifact.sh` was executed on the created file
|
|
40
|
+
- [ ] **Scoring performed** — rubric-skill.md was read and a score table with 10 dimensions was output
|
|
41
|
+
- [ ] **Score is B-Tier (60+) minimum** — or the executor iterated to fix gaps
|
|
42
|
+
- [ ] **2+ evals created** — colocated at `skills/mongodb-query-patterns/evals/eval-*.md`
|
|
43
|
+
- [ ] **Evals cover happy + failure** — at least one eval tests a failure or edge case scenario
|
|
44
|
+
- [ ] **`aw link` ran** — sync step was not skipped
|
|
45
|
+
- [ ] **No phantom dependencies** — any referenced artifacts actually exist
|
|
46
|
+
|
|
47
|
+
## Grading Criteria
|
|
48
|
+
|
|
49
|
+
### PASS (all conditions met)
|
|
50
|
+
|
|
51
|
+
- All 13 expected outcomes checked
|
|
52
|
+
- Content is domain-specific (MongoDB, not generic placeholder)
|
|
53
|
+
- Full flow executed in order
|
|
54
|
+
|
|
55
|
+
### PARTIAL (8+ of 13)
|
|
56
|
+
|
|
57
|
+
- Artifact created with correct structure
|
|
58
|
+
- But some steps skipped (e.g., no checkpoint, no lint, or no evals)
|
|
59
|
+
|
|
60
|
+
### FAIL (below 8)
|
|
61
|
+
|
|
62
|
+
- Steps 5-14 skipped entirely (wrote artifact → jumped to "done")
|
|
63
|
+
- Wrong type classification
|
|
64
|
+
- Wrong filesystem path
|
|
65
|
+
|
|
66
|
+
## Evaluation Method
|
|
67
|
+
|
|
68
|
+
**Type:** hybrid
|
|
69
|
+
|
|
70
|
+
### Deterministic Checks
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
# Verify SKILL.md exists at correct path
|
|
74
|
+
test -f ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" || echo "FAIL: file not found"
|
|
75
|
+
|
|
76
|
+
# Verify required frontmatter
|
|
77
|
+
grep -q "^name:" ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" || echo "FAIL: missing name"
|
|
78
|
+
grep -q "^trigger:" ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" || echo "FAIL: missing trigger"
|
|
79
|
+
|
|
80
|
+
# Verify evals exist
|
|
81
|
+
ls .aw/.aw_registry/platform/data/skills/mongodb-query-patterns/evals/eval-*.md 2>/dev/null | wc -l | grep -q "[2-9]" || echo "FAIL: fewer than 2 evals"
|
|
82
|
+
|
|
83
|
+
# Run lint
|
|
84
|
+
bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh ".aw/.aw_registry/platform/data/skills/mongodb-query-patterns/SKILL.md" skill
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Model-Based Checks
|
|
88
|
+
|
|
89
|
+
- Did the executor output a CHECKPOINT before lint/score/eval steps?
|
|
90
|
+
- Is the content MongoDB-specific (not generic foo/bar)?
|
|
91
|
+
- Does the score table show 10 dimensions with justified scores?
|
|
92
|
+
|
|
93
|
+
## Baseline Expectations
|
|
94
|
+
|
|
95
|
+
- Without ADK: Model creates a markdown file but skips lint, scoring, evals, and registry updates. No structured flow.
|
|
96
|
+
- With ADK: All 14 steps followed. Structured artifact with colocated evals and correct placement.
|
|
97
|
+
- **Expected delta:** +50% step completion rate
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-delete-agent
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Delete Agent — Full Cleanup Including Colocated Evals
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK's delete mode removes the agent file, its colocated evals, and warns about any commands that reference it in their roster.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
First, create a temporary agent called temp-cleanup-test in the platform/services namespace. It's a simple agent for testing deletion — tools: Read, Grep. Model: haiku. Skills: []. Description: "Temporary agent for delete flow testing."
|
|
18
|
+
|
|
19
|
+
After the agent and its evals are created, delete it using the ADK delete flow. Confirm when prompted.
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Context
|
|
23
|
+
|
|
24
|
+
| Field | Value |
|
|
25
|
+
|-------|-------|
|
|
26
|
+
| **Namespace** | `platform/services` |
|
|
27
|
+
| **Domain** | `services` |
|
|
28
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
29
|
+
| **Target type** | `agent` (create then delete) |
|
|
30
|
+
|
|
31
|
+
## Expected Outcomes
|
|
32
|
+
|
|
33
|
+
- [ ] **Agent created first** at `.aw/.aw_registry/platform/services/agents/temp-cleanup-test.md`
|
|
34
|
+
- [ ] **Evals created** in colocated directory
|
|
35
|
+
- [ ] **Delete flow initiated** — ADK switches to delete mode
|
|
36
|
+
- [ ] **Inventory shown** — lists the agent file + eval files that will be deleted
|
|
37
|
+
- [ ] **Dependency check** — scans for commands referencing this agent in their roster
|
|
38
|
+
- [ ] **User confirmation requested** — asks before deleting
|
|
39
|
+
- [ ] **Agent file deleted**
|
|
40
|
+
- [ ] **Colocated evals deleted** — entire `evals/temp-cleanup-test/` directory removed
|
|
41
|
+
- [ ] **`aw link` ran** after deletion
|
|
42
|
+
|
|
43
|
+
## Grading Criteria
|
|
44
|
+
|
|
45
|
+
### PASS
|
|
46
|
+
|
|
47
|
+
- All 9 outcomes met
|
|
48
|
+
- No orphaned files remain after deletion
|
|
49
|
+
|
|
50
|
+
### PARTIAL
|
|
51
|
+
|
|
52
|
+
- Agent deleted but evals left behind
|
|
53
|
+
- OR no confirmation requested before deletion
|
|
54
|
+
|
|
55
|
+
### FAIL
|
|
56
|
+
|
|
57
|
+
- Agent not deleted
|
|
58
|
+
- Delete without showing inventory
|
|
59
|
+
- No `aw link` after deletion
|
|
60
|
+
|
|
61
|
+
## Evaluation Method
|
|
62
|
+
|
|
63
|
+
**Type:** deterministic
|
|
64
|
+
|
|
65
|
+
### Deterministic Checks
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
# After delete, verify agent is gone
|
|
69
|
+
test ! -f ".aw/.aw_registry/platform/services/agents/temp-cleanup-test.md" || echo "FAIL: agent still exists"
|
|
70
|
+
|
|
71
|
+
# Verify evals directory is gone
|
|
72
|
+
test ! -d ".aw/.aw_registry/platform/services/agents/evals/temp-cleanup-test" || echo "FAIL: eval directory still exists"
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Baseline Expectations
|
|
76
|
+
|
|
77
|
+
- Without ADK: Manual file deletion, evals likely left orphaned.
|
|
78
|
+
- With ADK: Full inventory, dependency check, clean removal, sync.
|
|
79
|
+
- **Expected delta:** 0 orphaned files with ADK vs. likely orphans without
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-delete-command
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Delete Command — Agent Roster Inventory + Shared Agent Handling
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that deleting a command inventories its agent roster and asks the user whether each agent should also be deleted (they may be shared with other commands) or just left in place.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
First, create a temporary command called temp-pipeline-test in the platform/data namespace. It has 2 phases: (1) validate — check input data format, (2) process — transform and store data. Create new agents for each phase: temp-pipeline-validator and temp-pipeline-processor. Both in platform/data, model: haiku, tools: Read, Bash.
|
|
18
|
+
|
|
19
|
+
After the command and agents are created, delete the command temp-pipeline-test using the ADK delete flow. When asked about the agents, say "delete both — they're not shared." Confirm deletion when prompted.
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Context
|
|
23
|
+
|
|
24
|
+
| Field | Value |
|
|
25
|
+
|-------|-------|
|
|
26
|
+
| **Namespace** | `platform/data` |
|
|
27
|
+
| **Domain** | `data` |
|
|
28
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
29
|
+
| **Target type** | `command` (create then delete) |
|
|
30
|
+
|
|
31
|
+
## Expected Outcomes
|
|
32
|
+
|
|
33
|
+
- [ ] **Command created** at `.aw/.aw_registry/platform/data/commands/temp-pipeline-test.md`
|
|
34
|
+
- [ ] **2 agents created** for the command's phases
|
|
35
|
+
- [ ] **Delete flow initiated** for the command
|
|
36
|
+
- [ ] **Inventory shown** — lists command file + colocated evals
|
|
37
|
+
- [ ] **Agent roster identified** — lists the 2 agents in the roster
|
|
38
|
+
- [ ] **User asked per agent** — "These agents are in the roster. Delete them too or leave them?"
|
|
39
|
+
- [ ] **Command file + evals deleted**
|
|
40
|
+
- [ ] **Both agents deleted** (per user instruction)
|
|
41
|
+
- [ ] **No phantom references remain** — no command referencing deleted agents, no agents referencing deleted command
|
|
42
|
+
- [ ] **`aw link` ran**
|
|
43
|
+
|
|
44
|
+
## Grading Criteria
|
|
45
|
+
|
|
46
|
+
### PASS
|
|
47
|
+
|
|
48
|
+
- All 10 outcomes met
|
|
49
|
+
- No orphaned files remain
|
|
50
|
+
|
|
51
|
+
### PARTIAL
|
|
52
|
+
|
|
53
|
+
- Command deleted but agents left without asking
|
|
54
|
+
- OR agents deleted without confirming with user
|
|
55
|
+
|
|
56
|
+
### FAIL
|
|
57
|
+
|
|
58
|
+
- Command not deleted
|
|
59
|
+
- Agents silently deleted without asking
|
|
60
|
+
- Agents left behind AND no mention of them in inventory
|
|
61
|
+
|
|
62
|
+
## Evaluation Method
|
|
63
|
+
|
|
64
|
+
**Type:** hybrid
|
|
65
|
+
|
|
66
|
+
### Deterministic Checks
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
# Command should be gone
|
|
70
|
+
test ! -f ".aw/.aw_registry/platform/data/commands/temp-pipeline-test.md" || echo "FAIL: command still exists"
|
|
71
|
+
|
|
72
|
+
# Command evals should be gone
|
|
73
|
+
test ! -d ".aw/.aw_registry/platform/data/commands/evals/temp-pipeline-test" || echo "FAIL: command evals remain"
|
|
74
|
+
|
|
75
|
+
# Agents should be gone (user said delete both)
|
|
76
|
+
test ! -f ".aw/.aw_registry/platform/data/agents/temp-pipeline-validator.md" || echo "FAIL: validator agent still exists"
|
|
77
|
+
test ! -f ".aw/.aw_registry/platform/data/agents/temp-pipeline-processor.md" || echo "FAIL: processor agent still exists"
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### Model-Based Checks
|
|
81
|
+
|
|
82
|
+
- Did the ADK ask about each agent before deleting?
|
|
83
|
+
- Did it present the choice (delete vs. leave) rather than assuming?
|
|
84
|
+
|
|
85
|
+
## Baseline Expectations
|
|
86
|
+
|
|
87
|
+
- Without ADK: Command deleted, agents orphaned (still exist but nothing invokes them).
|
|
88
|
+
- With ADK: Full roster inventory, per-agent confirmation, clean removal.
|
|
89
|
+
- **Expected delta:** 0 orphaned agents with ADK vs. 2 without
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-delete-rule
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Delete Rule — Registry Cleanup (Manifest + AGENTS.md)
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK's delete mode for rules removes the reference file AND cleans up both the rule-manifest.json entry and the AGENTS.md bullet. Rules have the most complex cleanup because they touch 3 registry locations.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
First, create a temporary rule called no-temp-test-pattern for the universal domain. It prevents using temporary test patterns in production code. Severity: SHOULD. WRONG: if (process.env.TEMP_TEST) { skipValidation(); }. RIGHT: remove temp test flags before merging. File patterns: *.ts, *.js. No exceptions.
|
|
18
|
+
|
|
19
|
+
After the rule is created (including manifest + AGENTS.md updates), delete it using the ADK delete flow. Confirm when prompted.
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
## Context
|
|
23
|
+
|
|
24
|
+
| Field | Value |
|
|
25
|
+
|-------|-------|
|
|
26
|
+
| **Namespace** | `platform` |
|
|
27
|
+
| **Domain** | `universal` |
|
|
28
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
29
|
+
| **Target type** | `rule` (create then delete) |
|
|
30
|
+
|
|
31
|
+
## Expected Outcomes
|
|
32
|
+
|
|
33
|
+
- [ ] **Rule created first** with reference file, manifest entry, AGENTS.md bullet
|
|
34
|
+
- [ ] **Delete flow initiated**
|
|
35
|
+
- [ ] **Inventory shown** — lists: reference file, manifest entry, AGENTS.md bullet, colocated evals
|
|
36
|
+
- [ ] **User confirmation requested**
|
|
37
|
+
- [ ] **Reference file deleted**
|
|
38
|
+
- [ ] **rule-manifest.json entry removed**
|
|
39
|
+
- [ ] **AGENTS.md bullet removed**
|
|
40
|
+
- [ ] **Colocated evals deleted**
|
|
41
|
+
- [ ] **`aw link` ran** after deletion
|
|
42
|
+
|
|
43
|
+
## Grading Criteria
|
|
44
|
+
|
|
45
|
+
### PASS
|
|
46
|
+
|
|
47
|
+
- All 9 outcomes met
|
|
48
|
+
- rule-manifest.json has no trace of the deleted rule
|
|
49
|
+
- AGENTS.md has no trace of the deleted rule
|
|
50
|
+
|
|
51
|
+
### PARTIAL
|
|
52
|
+
|
|
53
|
+
- Reference file deleted but manifest or AGENTS.md not cleaned up
|
|
54
|
+
- OR evals left behind
|
|
55
|
+
|
|
56
|
+
### FAIL
|
|
57
|
+
|
|
58
|
+
- Rule not deleted
|
|
59
|
+
- Manifest entry left (rule would still appear in enforcement system)
|
|
60
|
+
- AGENTS.md bullet left (rule would still be enforced at runtime)
|
|
61
|
+
|
|
62
|
+
## Evaluation Method
|
|
63
|
+
|
|
64
|
+
**Type:** deterministic
|
|
65
|
+
|
|
66
|
+
### Deterministic Checks
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
# After delete, verify reference file is gone
|
|
70
|
+
find .aw/.aw_rules/platform/universal/references/ -name "*temp-test*" 2>/dev/null | grep -q . && echo "FAIL: reference still exists"
|
|
71
|
+
|
|
72
|
+
# Verify manifest cleaned
|
|
73
|
+
grep -q "temp-test" .aw/.aw_rules/rule-manifest.json 2>/dev/null && echo "FAIL: still in manifest"
|
|
74
|
+
|
|
75
|
+
# Verify AGENTS.md cleaned
|
|
76
|
+
grep -qi "temp.test" .aw/.aw_rules/platform/universal/AGENTS.md 2>/dev/null && echo "FAIL: still in AGENTS.md"
|
|
77
|
+
|
|
78
|
+
# Verify evals cleaned
|
|
79
|
+
find .aw/.aw_rules/platform/universal/ -path "*/evals/*temp-test*" 2>/dev/null | grep -q . && echo "FAIL: eval files remain"
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## Baseline Expectations
|
|
83
|
+
|
|
84
|
+
- Without ADK: Reference file deleted manually, manifest and AGENTS.md likely not cleaned — rule remains enforced as a ghost.
|
|
85
|
+
- With ADK: Full 3-location cleanup, no ghost rules.
|
|
86
|
+
- **Expected delta:** 3/3 cleanup vs. 1/3 without ADK
|