aw-ecc 1.4.32 → 1.4.48
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/.cursor/INSTALL.md +7 -5
- package/.cursor/hooks/adapter.js +41 -4
- package/.cursor/hooks/after-agent-response.js +62 -0
- package/.cursor/hooks/before-submit-prompt.js +7 -1
- package/.cursor/hooks/post-tool-use-failure.js +21 -0
- package/.cursor/hooks/post-tool-use.js +39 -0
- package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
- package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
- package/.cursor/hooks/subagent-start.js +22 -4
- package/.cursor/hooks/subagent-stop.js +18 -1
- package/.cursor/hooks.json +23 -2
- package/.opencode/package.json +1 -1
- package/AGENTS.md +3 -3
- package/README.md +5 -5
- package/commands/adk.md +52 -0
- package/commands/build.md +22 -9
- package/commands/deploy.md +12 -0
- package/commands/execute.md +9 -0
- package/commands/feature.md +333 -0
- package/commands/investigate.md +18 -5
- package/commands/plan.md +23 -9
- package/commands/publish.md +65 -0
- package/commands/review.md +12 -0
- package/commands/ship.md +12 -0
- package/commands/test.md +12 -0
- package/commands/verify.md +9 -0
- package/hooks/hooks.json +36 -0
- package/manifests/install-components.json +8 -0
- package/manifests/install-modules.json +83 -0
- package/manifests/install-profiles.json +7 -0
- package/package.json +2 -2
- package/scripts/ci/validate-rules.js +51 -0
- package/scripts/cursor-aw-home/hooks.json +23 -2
- package/scripts/cursor-aw-hooks/adapter.js +41 -4
- package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
- package/scripts/hooks/aw-usage-commit-created.js +32 -0
- package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
- package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
- package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
- package/scripts/hooks/aw-usage-session-start.js +48 -0
- package/scripts/hooks/aw-usage-stop.js +182 -0
- package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
- package/scripts/hooks/cost-tracker.js +3 -23
- package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
- package/scripts/hooks/shared/aw-phase-runner.js +3 -1
- package/scripts/lib/aw-hook-contract.js +2 -2
- package/scripts/lib/aw-pricing.js +306 -0
- package/scripts/lib/aw-usage-telemetry.js +472 -0
- package/scripts/lib/codex-hook-config.js +8 -8
- package/scripts/lib/cursor-hook-config.js +25 -10
- package/scripts/lib/install-targets/cursor-project.js +3 -0
- package/scripts/lib/install-targets/helpers.js +20 -3
- package/skills/aw-adk/SKILL.md +317 -0
- package/skills/aw-adk/agents/analyzer.md +113 -0
- package/skills/aw-adk/agents/comparator.md +113 -0
- package/skills/aw-adk/agents/grader.md +115 -0
- package/skills/aw-adk/assets/eval_review.html +76 -0
- package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
- package/skills/aw-adk/eval-viewer/viewer.html +181 -0
- package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
- package/skills/aw-adk/evals/eval-create-agent.md +90 -0
- package/skills/aw-adk/evals/eval-create-command.md +98 -0
- package/skills/aw-adk/evals/eval-create-eval.md +89 -0
- package/skills/aw-adk/evals/eval-create-rule.md +99 -0
- package/skills/aw-adk/evals/eval-create-skill.md +97 -0
- package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
- package/skills/aw-adk/evals/eval-delete-command.md +89 -0
- package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
- package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
- package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
- package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
- package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
- package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
- package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
- package/skills/aw-adk/evals/evals.json +96 -0
- package/skills/aw-adk/references/artifact-wiring.md +162 -0
- package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
- package/skills/aw-adk/references/eval-placement-guide.md +183 -0
- package/skills/aw-adk/references/external-resources.md +75 -0
- package/skills/aw-adk/references/getting-started.md +66 -0
- package/skills/aw-adk/references/registry-structure.md +152 -0
- package/skills/aw-adk/references/rubric-agent.md +36 -0
- package/skills/aw-adk/references/rubric-command.md +36 -0
- package/skills/aw-adk/references/rubric-eval.md +36 -0
- package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
- package/skills/aw-adk/references/rubric-rule.md +36 -0
- package/skills/aw-adk/references/rubric-skill.md +36 -0
- package/skills/aw-adk/references/schemas.md +222 -0
- package/skills/aw-adk/references/template-agent.md +251 -0
- package/skills/aw-adk/references/template-command.md +279 -0
- package/skills/aw-adk/references/template-eval.md +176 -0
- package/skills/aw-adk/references/template-rule.md +119 -0
- package/skills/aw-adk/references/template-skill.md +123 -0
- package/skills/aw-adk/references/type-classifier.md +98 -0
- package/skills/aw-adk/references/writing-good-agents.md +227 -0
- package/skills/aw-adk/references/writing-good-commands.md +258 -0
- package/skills/aw-adk/references/writing-good-evals.md +271 -0
- package/skills/aw-adk/references/writing-good-rules.md +214 -0
- package/skills/aw-adk/references/writing-good-skills.md +159 -0
- package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
- package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
- package/skills/aw-adk/scripts/score-artifact.sh +179 -0
- package/skills/aw-adk/scripts/trigger-eval.py +192 -0
- package/skills/aw-build/SKILL.md +19 -2
- package/skills/aw-deploy/SKILL.md +65 -3
- package/skills/aw-design/SKILL.md +156 -0
- package/skills/aw-design/references/highrise-tokens.md +394 -0
- package/skills/aw-design/references/micro-interactions.md +76 -0
- package/skills/aw-design/references/prompt-template.md +160 -0
- package/skills/aw-design/references/quality-checklist.md +70 -0
- package/skills/aw-design/references/self-review.md +497 -0
- package/skills/aw-design/references/stitch-workflow.md +127 -0
- package/skills/aw-feature/SKILL.md +293 -0
- package/skills/aw-investigate/SKILL.md +17 -0
- package/skills/aw-plan/SKILL.md +34 -3
- package/skills/aw-publish/SKILL.md +300 -0
- package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
- package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
- package/skills/aw-publish/evals/eval-push-modes.md +67 -0
- package/skills/aw-publish/evals/eval-rules-push.md +60 -0
- package/skills/aw-publish/evals/evals.json +29 -0
- package/skills/aw-publish/references/push-modes.md +38 -0
- package/skills/aw-review/SKILL.md +88 -9
- package/skills/aw-rules-review/SKILL.md +124 -0
- package/skills/aw-rules-review/agents/openai.yaml +3 -0
- package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
- package/skills/aw-ship/SKILL.md +16 -0
- package/skills/aw-spec/SKILL.md +15 -0
- package/skills/aw-tasks/SKILL.md +15 -0
- package/skills/aw-test/SKILL.md +16 -0
- package/skills/aw-yolo/SKILL.md +4 -0
- package/skills/diagnose/SKILL.md +121 -0
- package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
- package/skills/finish-only-when-green/SKILL.md +265 -0
- package/skills/grill-me/SKILL.md +24 -0
- package/skills/grill-with-docs/SKILL.md +92 -0
- package/skills/grill-with-docs/adr-format.md +47 -0
- package/skills/grill-with-docs/context-format.md +67 -0
- package/skills/improve-codebase-architecture/SKILL.md +75 -0
- package/skills/improve-codebase-architecture/deepening.md +37 -0
- package/skills/improve-codebase-architecture/interface-design.md +44 -0
- package/skills/improve-codebase-architecture/language.md +53 -0
- package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
- package/skills/tdd/SKILL.md +115 -0
- package/skills/tdd/deep-modules.md +33 -0
- package/skills/tdd/interface-design.md +31 -0
- package/skills/tdd/mocking.md +59 -0
- package/skills/tdd/refactoring.md +10 -0
- package/skills/tdd/tests.md +61 -0
- package/skills/to-issues/SKILL.md +62 -0
- package/skills/to-prd/SKILL.md +75 -0
- package/skills/using-aw-skills/SKILL.md +170 -237
- package/skills/using-aw-skills/hooks/session-start.sh +11 -41
- package/skills/zoom-out/SKILL.md +24 -0
- package/.codex/hooks/aw-post-tool-use.sh +0 -6
- package/.codex/hooks/aw-pre-tool-use.sh +0 -6
- package/.codex/hooks/aw-session-start.sh +0 -25
- package/.codex/hooks/aw-stop.sh +0 -6
- package/.codex/hooks/aw-user-prompt-submit.sh +0 -10
- package/.codex/hooks.json +0 -62
- package/.cursor/rules/common-agents.md +0 -53
- package/.cursor/rules/common-aw-routing.md +0 -43
- package/.cursor/rules/common-coding-style.md +0 -52
- package/.cursor/rules/common-development-workflow.md +0 -33
- package/.cursor/rules/common-git-workflow.md +0 -28
- package/.cursor/rules/common-hooks.md +0 -34
- package/.cursor/rules/common-patterns.md +0 -35
- package/.cursor/rules/common-performance.md +0 -59
- package/.cursor/rules/common-security.md +0 -33
- package/.cursor/rules/common-testing.md +0 -33
- package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
- package/.cursor/skills/article-writing/SKILL.md +0 -85
- package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
- package/.cursor/skills/aw-build/SKILL.md +0 -152
- package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
- package/.cursor/skills/aw-debug/SKILL.md +0 -49
- package/.cursor/skills/aw-deploy/SKILL.md +0 -101
- package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
- package/.cursor/skills/aw-execute/SKILL.md +0 -47
- package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
- package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
- package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
- package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
- package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
- package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
- package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
- package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
- package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
- package/.cursor/skills/aw-finish/SKILL.md +0 -111
- package/.cursor/skills/aw-investigate/SKILL.md +0 -109
- package/.cursor/skills/aw-plan/SKILL.md +0 -368
- package/.cursor/skills/aw-prepare/SKILL.md +0 -118
- package/.cursor/skills/aw-review/SKILL.md +0 -118
- package/.cursor/skills/aw-ship/SKILL.md +0 -115
- package/.cursor/skills/aw-spec/SKILL.md +0 -104
- package/.cursor/skills/aw-tasks/SKILL.md +0 -138
- package/.cursor/skills/aw-test/SKILL.md +0 -118
- package/.cursor/skills/aw-verify/SKILL.md +0 -51
- package/.cursor/skills/aw-yolo/SKILL.md +0 -111
- package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
- package/.cursor/skills/bun-runtime/SKILL.md +0 -84
- package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
- package/.cursor/skills/code-simplification/SKILL.md +0 -74
- package/.cursor/skills/content-engine/SKILL.md +0 -88
- package/.cursor/skills/context-engineering/SKILL.md +0 -74
- package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
- package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
- package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
- package/.cursor/skills/frontend-slides/SKILL.md +0 -184
- package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
- package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
- package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
- package/.cursor/skills/idea-refine/SKILL.md +0 -84
- package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
- package/.cursor/skills/investor-materials/SKILL.md +0 -96
- package/.cursor/skills/investor-outreach/SKILL.md +0 -76
- package/.cursor/skills/market-research/SKILL.md +0 -75
- package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
- package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
- package/.cursor/skills/performance-optimization/SKILL.md +0 -77
- package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
- package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
- package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
- package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
- package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
- package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
- package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
- package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
- /package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
- /package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
- /package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
- /package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
- /package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
- /package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
- /package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
- /package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
- /package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
- /package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
- /package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
- /package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
- /package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
- /package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
- /package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
- /package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
- /package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
- /package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
- /package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
- /package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
- /package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
- /package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
- /package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
- /package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
- /package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
- /package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
- /package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
- /package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
- /package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
- /package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
<!DOCTYPE html>
|
|
2
|
+
<html lang="en">
|
|
3
|
+
<head>
|
|
4
|
+
<meta charset="UTF-8">
|
|
5
|
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
6
|
+
<title>ADK Trigger Eval Review</title>
|
|
7
|
+
<style>
|
|
8
|
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
|
9
|
+
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; background: #0d1117; color: #c9d1d9; padding: 20px; max-width: 900px; margin: 0 auto; }
|
|
10
|
+
h1 { margin-bottom: 8px; }
|
|
11
|
+
.subtitle { color: #8b949e; margin-bottom: 20px; }
|
|
12
|
+
.description { background: #161b22; border: 1px solid #30363d; border-radius: 8px; padding: 12px; margin-bottom: 20px; font-family: monospace; font-size: 13px; }
|
|
13
|
+
.eval-item { background: #161b22; border: 1px solid #30363d; border-radius: 8px; padding: 12px; margin-bottom: 8px; display: flex; align-items: center; gap: 12px; }
|
|
14
|
+
.eval-item textarea { flex: 1; background: #0d1117; border: 1px solid #30363d; border-radius: 4px; color: #c9d1d9; padding: 6px 8px; font-family: inherit; resize: vertical; min-height: 36px; }
|
|
15
|
+
.toggle { cursor: pointer; padding: 4px 10px; border-radius: 12px; font-size: 12px; font-weight: 600; border: none; }
|
|
16
|
+
.toggle.trigger { background: #1b4332; color: #2dd4bf; }
|
|
17
|
+
.toggle.no-trigger { background: #4c1d1d; color: #f87171; }
|
|
18
|
+
.remove-btn { cursor: pointer; color: #f87171; background: none; border: none; font-size: 18px; }
|
|
19
|
+
.add-btn, .export-btn { padding: 8px 16px; border-radius: 6px; border: none; cursor: pointer; font-size: 14px; margin-right: 8px; margin-top: 12px; }
|
|
20
|
+
.add-btn { background: #21262d; color: #c9d1d9; }
|
|
21
|
+
.export-btn { background: #238636; color: white; }
|
|
22
|
+
.add-btn:hover { background: #30363d; }
|
|
23
|
+
.export-btn:hover { background: #2ea043; }
|
|
24
|
+
</style>
|
|
25
|
+
</head>
|
|
26
|
+
<body>
|
|
27
|
+
|
|
28
|
+
<h1>Trigger Eval Review: <span id="skill-name">__SKILL_NAME_PLACEHOLDER__</span></h1>
|
|
29
|
+
<p class="subtitle">Review and edit trigger eval queries. Toggle should-trigger, add/remove entries, then export.</p>
|
|
30
|
+
|
|
31
|
+
<div class="description" id="current-description">__SKILL_DESCRIPTION_PLACEHOLDER__</div>
|
|
32
|
+
|
|
33
|
+
<div id="eval-list"></div>
|
|
34
|
+
|
|
35
|
+
<button class="add-btn" onclick="addItem(true)">+ Should Trigger</button>
|
|
36
|
+
<button class="add-btn" onclick="addItem(false)">+ Should NOT Trigger</button>
|
|
37
|
+
<button class="export-btn" onclick="exportEvalSet()">Export Eval Set</button>
|
|
38
|
+
|
|
39
|
+
<script>
|
|
40
|
+
let evalData = __EVAL_DATA_PLACEHOLDER__;
|
|
41
|
+
|
|
42
|
+
function render() {
|
|
43
|
+
const list = document.getElementById('eval-list');
|
|
44
|
+
list.innerHTML = '';
|
|
45
|
+
evalData.forEach((item, i) => {
|
|
46
|
+
const div = document.createElement('div');
|
|
47
|
+
div.className = 'eval-item';
|
|
48
|
+
div.innerHTML = `
|
|
49
|
+
<button class="toggle ${item.should_trigger ? 'trigger' : 'no-trigger'}" onclick="toggleTrigger(${i})">
|
|
50
|
+
${item.should_trigger ? 'TRIGGER' : 'NO TRIGGER'}
|
|
51
|
+
</button>
|
|
52
|
+
<textarea oninput="updateQuery(${i}, this.value)">${item.query}</textarea>
|
|
53
|
+
<button class="remove-btn" onclick="removeItem(${i})">×</button>
|
|
54
|
+
`;
|
|
55
|
+
list.appendChild(div);
|
|
56
|
+
});
|
|
57
|
+
}
|
|
58
|
+
|
|
59
|
+
function toggleTrigger(i) { evalData[i].should_trigger = !evalData[i].should_trigger; render(); }
|
|
60
|
+
function updateQuery(i, val) { evalData[i].query = val; }
|
|
61
|
+
function removeItem(i) { evalData.splice(i, 1); render(); }
|
|
62
|
+
function addItem(shouldTrigger) { evalData.push({ query: '', should_trigger: shouldTrigger }); render(); document.querySelector('.eval-item:last-child textarea').focus(); }
|
|
63
|
+
|
|
64
|
+
function exportEvalSet() {
|
|
65
|
+
const filtered = evalData.filter(e => e.query.trim());
|
|
66
|
+
const blob = new Blob([JSON.stringify(filtered, null, 2)], { type: 'application/json' });
|
|
67
|
+
const a = document.createElement('a');
|
|
68
|
+
a.href = URL.createObjectURL(blob);
|
|
69
|
+
a.download = 'eval_set.json';
|
|
70
|
+
a.click();
|
|
71
|
+
}
|
|
72
|
+
|
|
73
|
+
render();
|
|
74
|
+
</script>
|
|
75
|
+
</body>
|
|
76
|
+
</html>
|
|
@@ -0,0 +1,164 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""
|
|
3
|
+
generate_review.py — Generates side-by-side review UI for ADK eval outputs
|
|
4
|
+
|
|
5
|
+
Usage:
|
|
6
|
+
python skills/aw-adk/eval-viewer/generate_review.py <workspace>/iteration-N \\
|
|
7
|
+
--artifact-name "my-agent" \\
|
|
8
|
+
[--benchmark <workspace>/iteration-N/benchmark.json] \\
|
|
9
|
+
[--previous-workspace <workspace>/iteration-<N-1>] \\
|
|
10
|
+
[--static <output_path>]
|
|
11
|
+
|
|
12
|
+
Opens an HTML review interface showing:
|
|
13
|
+
- Outputs tab: per-case review with feedback textbox
|
|
14
|
+
- Benchmark tab: quantitative stats comparison
|
|
15
|
+
|
|
16
|
+
If --static is provided, writes standalone HTML instead of starting a server.
|
|
17
|
+
|
|
18
|
+
Adapted from skill-creator's eval-viewer/generate_review.py for CASRE context.
|
|
19
|
+
"""
|
|
20
|
+
|
|
21
|
+
import argparse
|
|
22
|
+
import http.server
|
|
23
|
+
import json
|
|
24
|
+
import os
|
|
25
|
+
import sys
|
|
26
|
+
import threading
|
|
27
|
+
import webbrowser
|
|
28
|
+
from pathlib import Path
|
|
29
|
+
|
|
30
|
+
|
|
31
|
+
def load_json(path: str) -> dict:
|
|
32
|
+
try:
|
|
33
|
+
with open(path, "r") as f:
|
|
34
|
+
return json.load(f)
|
|
35
|
+
except (FileNotFoundError, json.JSONDecodeError):
|
|
36
|
+
return {}
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
def collect_review_data(iteration_dir: str, previous_dir: str = None) -> list[dict]:
|
|
40
|
+
"""Collect all eval outputs for review."""
|
|
41
|
+
reviews = []
|
|
42
|
+
iteration_path = Path(iteration_dir)
|
|
43
|
+
|
|
44
|
+
for eval_dir in sorted(iteration_path.iterdir()):
|
|
45
|
+
if not eval_dir.is_dir():
|
|
46
|
+
continue
|
|
47
|
+
|
|
48
|
+
metadata = load_json(str(eval_dir / "eval_metadata.json"))
|
|
49
|
+
eval_name = metadata.get("eval_name", eval_dir.name)
|
|
50
|
+
prompt = metadata.get("prompt", "")
|
|
51
|
+
|
|
52
|
+
for config_dir in sorted(eval_dir.iterdir()):
|
|
53
|
+
if not config_dir.is_dir():
|
|
54
|
+
continue
|
|
55
|
+
|
|
56
|
+
# Read outputs
|
|
57
|
+
outputs_dir = config_dir / "outputs"
|
|
58
|
+
output_files = {}
|
|
59
|
+
if outputs_dir.exists():
|
|
60
|
+
for f in sorted(outputs_dir.iterdir()):
|
|
61
|
+
if f.is_file():
|
|
62
|
+
try:
|
|
63
|
+
output_files[f.name] = f.read_text()
|
|
64
|
+
except UnicodeDecodeError:
|
|
65
|
+
output_files[f.name] = f"[Binary file: {f.name}]"
|
|
66
|
+
|
|
67
|
+
# Read grading
|
|
68
|
+
grading = load_json(str(config_dir / "grading.json"))
|
|
69
|
+
|
|
70
|
+
# Read previous output if available
|
|
71
|
+
previous_output = None
|
|
72
|
+
if previous_dir:
|
|
73
|
+
prev_config = Path(previous_dir) / eval_dir.name / config_dir.name / "outputs"
|
|
74
|
+
if prev_config.exists():
|
|
75
|
+
previous_output = {}
|
|
76
|
+
for f in sorted(prev_config.iterdir()):
|
|
77
|
+
if f.is_file():
|
|
78
|
+
try:
|
|
79
|
+
previous_output[f.name] = f.read_text()
|
|
80
|
+
except UnicodeDecodeError:
|
|
81
|
+
previous_output[f.name] = f"[Binary file: {f.name}]"
|
|
82
|
+
|
|
83
|
+
# Read previous feedback
|
|
84
|
+
previous_feedback = None
|
|
85
|
+
if previous_dir:
|
|
86
|
+
prev_feedback_path = Path(previous_dir) / "feedback.json"
|
|
87
|
+
prev_feedback = load_json(str(prev_feedback_path))
|
|
88
|
+
run_id = f"{eval_dir.name}-{config_dir.name}"
|
|
89
|
+
for review in prev_feedback.get("reviews", []):
|
|
90
|
+
if review.get("run_id") == run_id:
|
|
91
|
+
previous_feedback = review.get("feedback", "")
|
|
92
|
+
|
|
93
|
+
reviews.append({
|
|
94
|
+
"run_id": f"{eval_dir.name}-{config_dir.name}",
|
|
95
|
+
"eval_name": eval_name,
|
|
96
|
+
"configuration": config_dir.name,
|
|
97
|
+
"prompt": prompt,
|
|
98
|
+
"outputs": output_files,
|
|
99
|
+
"grading": grading,
|
|
100
|
+
"previous_output": previous_output,
|
|
101
|
+
"previous_feedback": previous_feedback,
|
|
102
|
+
})
|
|
103
|
+
|
|
104
|
+
return reviews
|
|
105
|
+
|
|
106
|
+
|
|
107
|
+
def generate_html(reviews: list[dict], benchmark: dict = None, artifact_name: str = "artifact") -> str:
|
|
108
|
+
"""Generate the review HTML page."""
|
|
109
|
+
template_path = Path(__file__).parent / "viewer.html"
|
|
110
|
+
|
|
111
|
+
if template_path.exists():
|
|
112
|
+
html = template_path.read_text()
|
|
113
|
+
html = html.replace("__REVIEW_DATA_PLACEHOLDER__", json.dumps(reviews))
|
|
114
|
+
html = html.replace("__BENCHMARK_DATA_PLACEHOLDER__", json.dumps(benchmark or {}))
|
|
115
|
+
html = html.replace("__ARTIFACT_NAME_PLACEHOLDER__", artifact_name)
|
|
116
|
+
return html
|
|
117
|
+
|
|
118
|
+
# Fallback: minimal HTML
|
|
119
|
+
return f"""<!DOCTYPE html>
|
|
120
|
+
<html>
|
|
121
|
+
<head><title>ADK Review: {artifact_name}</title></head>
|
|
122
|
+
<body>
|
|
123
|
+
<h1>ADK Eval Review: {artifact_name}</h1>
|
|
124
|
+
<p>{len(reviews)} test cases loaded. See console for data.</p>
|
|
125
|
+
<script>
|
|
126
|
+
const reviewData = {json.dumps(reviews, indent=2)};
|
|
127
|
+
const benchmarkData = {json.dumps(benchmark or {}, indent=2)};
|
|
128
|
+
console.log('Review data:', reviewData);
|
|
129
|
+
console.log('Benchmark data:', benchmarkData);
|
|
130
|
+
</script>
|
|
131
|
+
</body>
|
|
132
|
+
</html>"""
|
|
133
|
+
|
|
134
|
+
|
|
135
|
+
def main():
|
|
136
|
+
parser = argparse.ArgumentParser(description="Generate ADK eval review UI")
|
|
137
|
+
parser.add_argument("iteration_dir", help="Path to iteration directory")
|
|
138
|
+
parser.add_argument("--artifact-name", default="artifact", help="Name of the artifact being reviewed")
|
|
139
|
+
parser.add_argument("--benchmark", help="Path to benchmark.json")
|
|
140
|
+
parser.add_argument("--previous-workspace", help="Path to previous iteration directory")
|
|
141
|
+
parser.add_argument("--static", help="Write standalone HTML to this path instead of starting server")
|
|
142
|
+
args = parser.parse_args()
|
|
143
|
+
|
|
144
|
+
reviews = collect_review_data(args.iteration_dir, args.previous_workspace)
|
|
145
|
+
benchmark = load_json(args.benchmark) if args.benchmark else None
|
|
146
|
+
|
|
147
|
+
html = generate_html(reviews, benchmark, args.artifact_name)
|
|
148
|
+
|
|
149
|
+
if args.static:
|
|
150
|
+
with open(args.static, "w") as f:
|
|
151
|
+
f.write(html)
|
|
152
|
+
print(f"Wrote static review to {args.static}")
|
|
153
|
+
else:
|
|
154
|
+
# Write temp file and open in browser
|
|
155
|
+
import tempfile
|
|
156
|
+
tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".html", delete=False, prefix="adk-review-")
|
|
157
|
+
tmp.write(html)
|
|
158
|
+
tmp.close()
|
|
159
|
+
print(f"Opening review in browser: {tmp.name}")
|
|
160
|
+
webbrowser.open(f"file://{tmp.name}")
|
|
161
|
+
|
|
162
|
+
|
|
163
|
+
if __name__ == "__main__":
|
|
164
|
+
main()
|
|
@@ -0,0 +1,181 @@
|
|
|
1
|
+
<!DOCTYPE html>
|
|
2
|
+
<html lang="en">
|
|
3
|
+
<head>
|
|
4
|
+
<meta charset="UTF-8">
|
|
5
|
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
6
|
+
<title>ADK Eval Review: __ARTIFACT_NAME_PLACEHOLDER__</title>
|
|
7
|
+
<style>
|
|
8
|
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
|
9
|
+
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; background: #0d1117; color: #c9d1d9; padding: 20px; }
|
|
10
|
+
.tabs { display: flex; gap: 4px; margin-bottom: 20px; }
|
|
11
|
+
.tab { padding: 8px 16px; background: #161b22; border: 1px solid #30363d; border-radius: 6px 6px 0 0; cursor: pointer; color: #8b949e; }
|
|
12
|
+
.tab.active { background: #0d1117; border-bottom-color: transparent; color: #c9d1d9; }
|
|
13
|
+
.panel { display: none; }
|
|
14
|
+
.panel.active { display: block; }
|
|
15
|
+
.nav { display: flex; gap: 8px; margin-bottom: 16px; align-items: center; }
|
|
16
|
+
.nav button { padding: 6px 12px; background: #21262d; border: 1px solid #30363d; border-radius: 6px; color: #c9d1d9; cursor: pointer; }
|
|
17
|
+
.nav button:hover { background: #30363d; }
|
|
18
|
+
.nav span { color: #8b949e; }
|
|
19
|
+
.card { background: #161b22; border: 1px solid #30363d; border-radius: 8px; padding: 16px; margin-bottom: 16px; }
|
|
20
|
+
.card h3 { color: #58a6ff; margin-bottom: 8px; }
|
|
21
|
+
.card pre { background: #0d1117; padding: 12px; border-radius: 4px; overflow-x: auto; font-size: 13px; white-space: pre-wrap; }
|
|
22
|
+
.feedback textarea { width: 100%; height: 80px; background: #0d1117; border: 1px solid #30363d; border-radius: 6px; color: #c9d1d9; padding: 8px; font-family: inherit; resize: vertical; }
|
|
23
|
+
.badge { display: inline-block; padding: 2px 8px; border-radius: 12px; font-size: 12px; font-weight: 600; }
|
|
24
|
+
.badge.pass { background: #1b4332; color: #2dd4bf; }
|
|
25
|
+
.badge.fail { background: #4c1d1d; color: #f87171; }
|
|
26
|
+
.config-label { font-size: 12px; color: #8b949e; text-transform: uppercase; letter-spacing: 1px; }
|
|
27
|
+
details { margin-top: 8px; }
|
|
28
|
+
details summary { cursor: pointer; color: #8b949e; }
|
|
29
|
+
.submit-btn { padding: 10px 24px; background: #238636; border: none; border-radius: 6px; color: white; font-size: 14px; cursor: pointer; margin-top: 16px; }
|
|
30
|
+
.submit-btn:hover { background: #2ea043; }
|
|
31
|
+
table { width: 100%; border-collapse: collapse; }
|
|
32
|
+
th, td { padding: 8px 12px; text-align: left; border-bottom: 1px solid #21262d; }
|
|
33
|
+
th { color: #8b949e; font-weight: 600; }
|
|
34
|
+
</style>
|
|
35
|
+
</head>
|
|
36
|
+
<body>
|
|
37
|
+
|
|
38
|
+
<h1 style="margin-bottom: 20px;">ADK Eval Review: <span id="artifact-name"></span></h1>
|
|
39
|
+
|
|
40
|
+
<div class="tabs">
|
|
41
|
+
<div class="tab active" onclick="switchTab('outputs')">Outputs</div>
|
|
42
|
+
<div class="tab" onclick="switchTab('benchmark')">Benchmark</div>
|
|
43
|
+
</div>
|
|
44
|
+
|
|
45
|
+
<div id="outputs-panel" class="panel active">
|
|
46
|
+
<div class="nav">
|
|
47
|
+
<button onclick="prev()">← Prev</button>
|
|
48
|
+
<span id="counter">1 / 1</span>
|
|
49
|
+
<button onclick="next()">Next →</button>
|
|
50
|
+
</div>
|
|
51
|
+
<div id="review-content"></div>
|
|
52
|
+
<button class="submit-btn" onclick="submitAll()">Submit All Reviews</button>
|
|
53
|
+
</div>
|
|
54
|
+
|
|
55
|
+
<div id="benchmark-panel" class="panel">
|
|
56
|
+
<div id="benchmark-content"></div>
|
|
57
|
+
</div>
|
|
58
|
+
|
|
59
|
+
<script>
|
|
60
|
+
const reviews = __REVIEW_DATA_PLACEHOLDER__;
|
|
61
|
+
const benchmark = __BENCHMARK_DATA_PLACEHOLDER__;
|
|
62
|
+
const artifactName = "__ARTIFACT_NAME_PLACEHOLDER__";
|
|
63
|
+
let currentIndex = 0;
|
|
64
|
+
const feedbackStore = {};
|
|
65
|
+
|
|
66
|
+
document.getElementById('artifact-name').textContent = artifactName;
|
|
67
|
+
|
|
68
|
+
function switchTab(tab) {
|
|
69
|
+
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
|
|
70
|
+
document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
|
|
71
|
+
event.target.classList.add('active');
|
|
72
|
+
document.getElementById(tab + '-panel').classList.add('active');
|
|
73
|
+
}
|
|
74
|
+
|
|
75
|
+
function renderReview(index) {
|
|
76
|
+
if (!reviews.length) { document.getElementById('review-content').innerHTML = '<p>No review data.</p>'; return; }
|
|
77
|
+
const r = reviews[index];
|
|
78
|
+
document.getElementById('counter').textContent = `${index + 1} / ${reviews.length}`;
|
|
79
|
+
|
|
80
|
+
let html = '';
|
|
81
|
+
|
|
82
|
+
// Prompt
|
|
83
|
+
html += `<div class="card"><h3>Prompt</h3><pre>${escapeHtml(r.prompt || 'N/A')}</pre></div>`;
|
|
84
|
+
|
|
85
|
+
// Config label
|
|
86
|
+
html += `<div class="config-label">${r.configuration}</div>`;
|
|
87
|
+
|
|
88
|
+
// Outputs
|
|
89
|
+
if (r.outputs && Object.keys(r.outputs).length) {
|
|
90
|
+
for (const [name, content] of Object.entries(r.outputs)) {
|
|
91
|
+
html += `<div class="card"><h3>${escapeHtml(name)}</h3><pre>${escapeHtml(content)}</pre></div>`;
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
|
|
95
|
+
// Previous output
|
|
96
|
+
if (r.previous_output) {
|
|
97
|
+
html += `<details><summary>Previous Output</summary>`;
|
|
98
|
+
for (const [name, content] of Object.entries(r.previous_output)) {
|
|
99
|
+
html += `<div class="card"><h3>${escapeHtml(name)}</h3><pre>${escapeHtml(content)}</pre></div>`;
|
|
100
|
+
}
|
|
101
|
+
html += `</details>`;
|
|
102
|
+
}
|
|
103
|
+
|
|
104
|
+
// Grading
|
|
105
|
+
if (r.grading && r.grading.expectations) {
|
|
106
|
+
html += `<details><summary>Formal Grades (${r.grading.summary?.passed || 0}/${r.grading.summary?.total || 0} passed)</summary>`;
|
|
107
|
+
for (const exp of r.grading.expectations) {
|
|
108
|
+
const badge = exp.passed ? '<span class="badge pass">PASS</span>' : '<span class="badge fail">FAIL</span>';
|
|
109
|
+
html += `<div class="card">${badge} ${escapeHtml(exp.text)}<pre>${escapeHtml(exp.evidence || '')}</pre></div>`;
|
|
110
|
+
}
|
|
111
|
+
html += `</details>`;
|
|
112
|
+
}
|
|
113
|
+
|
|
114
|
+
// Previous feedback
|
|
115
|
+
if (r.previous_feedback) {
|
|
116
|
+
html += `<div class="card"><h3>Previous Feedback</h3><pre>${escapeHtml(r.previous_feedback)}</pre></div>`;
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
// Feedback textarea
|
|
120
|
+
const savedFeedback = feedbackStore[r.run_id] || '';
|
|
121
|
+
html += `<div class="feedback"><h3 style="color:#58a6ff;margin-bottom:8px;">Feedback</h3>`;
|
|
122
|
+
html += `<textarea id="feedback-input" placeholder="Leave feedback (empty = looks good)" oninput="saveFeedback()">${escapeHtml(savedFeedback)}</textarea></div>`;
|
|
123
|
+
|
|
124
|
+
document.getElementById('review-content').innerHTML = html;
|
|
125
|
+
}
|
|
126
|
+
|
|
127
|
+
function saveFeedback() {
|
|
128
|
+
const r = reviews[currentIndex];
|
|
129
|
+
feedbackStore[r.run_id] = document.getElementById('feedback-input').value;
|
|
130
|
+
}
|
|
131
|
+
|
|
132
|
+
function prev() { if (currentIndex > 0) { saveFeedback(); currentIndex--; renderReview(currentIndex); } }
|
|
133
|
+
function next() { if (currentIndex < reviews.length - 1) { saveFeedback(); currentIndex++; renderReview(currentIndex); } }
|
|
134
|
+
|
|
135
|
+
function submitAll() {
|
|
136
|
+
saveFeedback();
|
|
137
|
+
const feedback = { reviews: reviews.map(r => ({ run_id: r.run_id, feedback: feedbackStore[r.run_id] || '', timestamp: new Date().toISOString() })), status: 'complete' };
|
|
138
|
+
const blob = new Blob([JSON.stringify(feedback, null, 2)], { type: 'application/json' });
|
|
139
|
+
const a = document.createElement('a');
|
|
140
|
+
a.href = URL.createObjectURL(blob);
|
|
141
|
+
a.download = 'feedback.json';
|
|
142
|
+
a.click();
|
|
143
|
+
alert('Feedback saved! Copy feedback.json to your workspace directory.');
|
|
144
|
+
}
|
|
145
|
+
|
|
146
|
+
function renderBenchmark() {
|
|
147
|
+
const el = document.getElementById('benchmark-content');
|
|
148
|
+
if (!benchmark || !benchmark.run_summary) { el.innerHTML = '<p>No benchmark data.</p>'; return; }
|
|
149
|
+
|
|
150
|
+
let html = '<div class="card"><h3>Summary</h3><table><tr><th>Config</th><th>Pass Rate</th><th>Time</th><th>Tokens</th></tr>';
|
|
151
|
+
for (const [config, stats] of Object.entries(benchmark.run_summary)) {
|
|
152
|
+
if (config === 'delta') continue;
|
|
153
|
+
const pr = stats.pass_rate || {};
|
|
154
|
+
const ts = stats.time_seconds || {};
|
|
155
|
+
const tk = stats.tokens || {};
|
|
156
|
+
html += `<tr><td>${config}</td><td>${((pr.mean||0)*100).toFixed(1)}% ± ${((pr.stddev||0)*100).toFixed(1)}%</td><td>${(ts.mean||0).toFixed(1)}s</td><td>${(tk.mean||0).toFixed(0)}</td></tr>`;
|
|
157
|
+
}
|
|
158
|
+
if (benchmark.run_summary.delta) {
|
|
159
|
+
const d = benchmark.run_summary.delta;
|
|
160
|
+
html += `<tr style="font-weight:bold;color:#58a6ff"><td>Delta</td><td>${d.pass_rate}</td><td>${d.time_seconds}s</td><td>${d.tokens}</td></tr>`;
|
|
161
|
+
}
|
|
162
|
+
html += '</table></div>';
|
|
163
|
+
|
|
164
|
+
if (benchmark.notes && benchmark.notes.length) {
|
|
165
|
+
html += '<div class="card"><h3>Analyst Notes</h3><ul>';
|
|
166
|
+
for (const note of benchmark.notes) { html += `<li>${escapeHtml(note)}</li>`; }
|
|
167
|
+
html += '</ul></div>';
|
|
168
|
+
}
|
|
169
|
+
|
|
170
|
+
el.innerHTML = html;
|
|
171
|
+
}
|
|
172
|
+
|
|
173
|
+
function escapeHtml(s) { return String(s).replace(/&/g,'&').replace(/</g,'<').replace(/>/g,'>').replace(/"/g,'"'); }
|
|
174
|
+
|
|
175
|
+
document.addEventListener('keydown', e => { if (e.key === 'ArrowLeft') prev(); if (e.key === 'ArrowRight') next(); });
|
|
176
|
+
|
|
177
|
+
renderReview(0);
|
|
178
|
+
renderBenchmark();
|
|
179
|
+
</script>
|
|
180
|
+
</body>
|
|
181
|
+
</html>
|
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-colocated-placement
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: structural
|
|
5
|
+
difficulty: basic
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Colocated Placement — Evals Land in Correct Directory
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK places evals in the correct colocated directory for each artifact type. Each CASRE type has a different eval placement pattern. This eval creates an artifact and checks that its evals end up in the right location — not in a centralized `evals/` directory.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
Create an agent for API rate limiting in the platform/services namespace. It should enforce per-tenant rate limits on HTTP endpoints using sliding window counters in Redis. Tools: Read, Bash, Grep, Glob. Model: sonnet. No existing skills to reference — use skills: [].
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
| Field | Value |
|
|
23
|
+
|-------|-------|
|
|
24
|
+
| **Namespace** | `platform/services` |
|
|
25
|
+
| **Domain** | `services` |
|
|
26
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
27
|
+
| **Target type** | `agent` |
|
|
28
|
+
|
|
29
|
+
## Expected Outcomes
|
|
30
|
+
|
|
31
|
+
- [ ] **Agent created** at `.aw/.aw_registry/platform/services/agents/api-rate-limiter.md` (or similar slug)
|
|
32
|
+
- [ ] **Evals created** at `.aw/.aw_registry/platform/services/agents/evals/api-rate-limiter/eval-*.md`
|
|
33
|
+
- [ ] **NOT placed** at a top-level `evals/` directory
|
|
34
|
+
- [ ] **NOT placed** at `.aw/.aw_registry/platform/services/evals/` (wrong nesting)
|
|
35
|
+
- [ ] **Each eval has `target:` frontmatter** referencing the parent agent
|
|
36
|
+
- [ ] **At least 2 eval files** created
|
|
37
|
+
|
|
38
|
+
## Grading Criteria
|
|
39
|
+
|
|
40
|
+
### PASS
|
|
41
|
+
|
|
42
|
+
- Evals are in the correct colocated path: `agents/evals/<slug>/eval-*.md`
|
|
43
|
+
- 2+ eval files exist
|
|
44
|
+
- All have correct `target:` frontmatter
|
|
45
|
+
|
|
46
|
+
### PARTIAL
|
|
47
|
+
|
|
48
|
+
- Evals created but in a slightly wrong path (e.g., `agents/evals/eval-*.md` without the slug subdirectory)
|
|
49
|
+
|
|
50
|
+
### FAIL
|
|
51
|
+
|
|
52
|
+
- Evals in a centralized location
|
|
53
|
+
- No evals created
|
|
54
|
+
- Evals reference wrong parent artifact
|
|
55
|
+
|
|
56
|
+
## Evaluation Method
|
|
57
|
+
|
|
58
|
+
**Type:** deterministic
|
|
59
|
+
|
|
60
|
+
### Deterministic Checks
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
# Find the agent file
|
|
64
|
+
AGENT_PATH=$(find .aw/.aw_registry/platform/services/agents/ -name "*rate-limit*" -not -path "*/evals/*" | head -1)
|
|
65
|
+
SLUG=$(basename "$AGENT_PATH" .md)
|
|
66
|
+
|
|
67
|
+
# Verify evals are colocated
|
|
68
|
+
EVAL_COUNT=$(ls .aw/.aw_registry/platform/services/agents/evals/$SLUG/eval-*.md 2>/dev/null | wc -l)
|
|
69
|
+
[[ "$EVAL_COUNT" -ge 2 ]] || echo "FAIL: expected 2+ evals at agents/evals/$SLUG/, found $EVAL_COUNT"
|
|
70
|
+
|
|
71
|
+
# Verify no centralized placement
|
|
72
|
+
ls .aw/.aw_registry/platform/services/evals/ 2>/dev/null && echo "WARN: centralized evals/ exists"
|
|
73
|
+
|
|
74
|
+
# Verify target frontmatter
|
|
75
|
+
for f in .aw/.aw_registry/platform/services/agents/evals/$SLUG/eval-*.md; do
|
|
76
|
+
grep -q "^target:" "$f" || echo "FAIL: $f missing target frontmatter"
|
|
77
|
+
done
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Baseline Expectations
|
|
81
|
+
|
|
82
|
+
- Without ADK: Evals placed arbitrarily or not created at all.
|
|
83
|
+
- With ADK: Correct colocated placement per eval-placement-guide.md.
|
|
84
|
+
- **Expected delta:** 100% correct placement with ADK
|
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: eval-create-agent
|
|
3
|
+
target: skill/aw-adk
|
|
4
|
+
category: functional
|
|
5
|
+
difficulty: intermediate
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Eval: Create Agent — Phantom Dependency Detection
|
|
9
|
+
|
|
10
|
+
## Task
|
|
11
|
+
|
|
12
|
+
Test that the ADK creates an agent with valid skill references (no phantom dependencies) and follows the full create flow. This eval specifically targets the phantom skill problem — agents listing skills in `skills:` frontmatter that don't exist in the registry.
|
|
13
|
+
|
|
14
|
+
### Prompt
|
|
15
|
+
|
|
16
|
+
```
|
|
17
|
+
Create an agent for payments processing in the revex/reselling namespace, under the backend domain. It should validate payment webhook signatures, reconcile transaction records against Stripe events, and flag discrepancies. Tools needed: Read, Bash, Grep, Glob. Model: sonnet. No existing skills to reference — use skills: [].
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
| Field | Value |
|
|
23
|
+
|-------|-------|
|
|
24
|
+
| **Namespace** | `revex/reselling` |
|
|
25
|
+
| **Domain** | `backend` |
|
|
26
|
+
| **Target artifact** | `skills/aw-adk/SKILL.md` |
|
|
27
|
+
| **Target type** | `agent` |
|
|
28
|
+
|
|
29
|
+
## Expected Outcomes
|
|
30
|
+
|
|
31
|
+
- [ ] **Type classified correctly** — identified as `agent`
|
|
32
|
+
- [ ] **Interview conducted** — asked about agent's purpose, tools needed, model, skills
|
|
33
|
+
- [ ] **Path resolved** — target at `.aw/.aw_registry/revex/reselling/<domain>/agents/payments-processor.md` (domain may vary)
|
|
34
|
+
- [ ] **Agent created** with frontmatter: `name`, `description`, `tools`, `model`, `category`, `squad`, `skills`
|
|
35
|
+
- [ ] **No phantom skills** — every entry in `skills:` frontmatter either exists in the registry OR `skills: []` is used
|
|
36
|
+
- [ ] **Identity section present** — agent has a clear identity/mission section
|
|
37
|
+
- [ ] **CHECKPOINT output shown** — remaining steps printed before continuing
|
|
38
|
+
- [ ] **Lint ran and passed** — `lint-artifact.sh` executed, no phantom_skill errors
|
|
39
|
+
- [ ] **Scoring performed** — rubric-agent.md read, 10-dimension score table output
|
|
40
|
+
- [ ] **2+ evals created** — colocated at `agents/evals/<slug>/eval-*.md`
|
|
41
|
+
- [ ] **Evals derive from agent structure** — at least one eval exercises the agent's specific domain (payments), not generic checks
|
|
42
|
+
- [ ] **`aw link` ran**
|
|
43
|
+
|
|
44
|
+
## Grading Criteria
|
|
45
|
+
|
|
46
|
+
### PASS (all conditions met)
|
|
47
|
+
|
|
48
|
+
- All 12 outcomes checked
|
|
49
|
+
- Zero phantom dependencies
|
|
50
|
+
- Agent content is payments-domain-specific
|
|
51
|
+
|
|
52
|
+
### PARTIAL (8+ of 12)
|
|
53
|
+
|
|
54
|
+
- Agent created correctly but some flow steps skipped
|
|
55
|
+
- OR agent has phantom skills but lint caught them
|
|
56
|
+
|
|
57
|
+
### FAIL (below 8)
|
|
58
|
+
|
|
59
|
+
- Phantom skills in `skills:` frontmatter that lint didn't catch
|
|
60
|
+
- Steps 5-14 skipped
|
|
61
|
+
- Wrong type classification
|
|
62
|
+
|
|
63
|
+
## Evaluation Method
|
|
64
|
+
|
|
65
|
+
**Type:** hybrid
|
|
66
|
+
|
|
67
|
+
### Deterministic Checks
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
# Run lint — will catch phantom skills
|
|
71
|
+
bash ~/.aw-ecc/skills/aw-adk/scripts/lint-artifact.sh "<agent-path>" agent
|
|
72
|
+
|
|
73
|
+
# Verify frontmatter fields
|
|
74
|
+
grep -q "^name:" "<agent-path>" || echo "FAIL: missing name"
|
|
75
|
+
grep -q "^tools:" "<agent-path>" || echo "FAIL: missing tools"
|
|
76
|
+
grep -q "^model:" "<agent-path>" || echo "FAIL: missing model"
|
|
77
|
+
grep -q "^skills:" "<agent-path>" || echo "FAIL: missing skills field"
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### Model-Based Checks
|
|
81
|
+
|
|
82
|
+
- Are skill references valid (either empty or pointing to real skills)?
|
|
83
|
+
- Is the agent's identity specific to payments processing?
|
|
84
|
+
- Did the executor show the CHECKPOINT step?
|
|
85
|
+
|
|
86
|
+
## Baseline Expectations
|
|
87
|
+
|
|
88
|
+
- Without ADK: Agent created with phantom skill references, no lint validation, no evals.
|
|
89
|
+
- With ADK: Phantom-free agent, lint-validated, scored, with colocated evals.
|
|
90
|
+
- **Expected delta:** zero phantom dependencies vs. 2-3 phantoms without ADK
|