aw-ecc 1.4.32 → 1.4.48
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/.cursor/INSTALL.md +7 -5
- package/.cursor/hooks/adapter.js +41 -4
- package/.cursor/hooks/after-agent-response.js +62 -0
- package/.cursor/hooks/before-submit-prompt.js +7 -1
- package/.cursor/hooks/post-tool-use-failure.js +21 -0
- package/.cursor/hooks/post-tool-use.js +39 -0
- package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
- package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
- package/.cursor/hooks/subagent-start.js +22 -4
- package/.cursor/hooks/subagent-stop.js +18 -1
- package/.cursor/hooks.json +23 -2
- package/.opencode/package.json +1 -1
- package/AGENTS.md +3 -3
- package/README.md +5 -5
- package/commands/adk.md +52 -0
- package/commands/build.md +22 -9
- package/commands/deploy.md +12 -0
- package/commands/execute.md +9 -0
- package/commands/feature.md +333 -0
- package/commands/investigate.md +18 -5
- package/commands/plan.md +23 -9
- package/commands/publish.md +65 -0
- package/commands/review.md +12 -0
- package/commands/ship.md +12 -0
- package/commands/test.md +12 -0
- package/commands/verify.md +9 -0
- package/hooks/hooks.json +36 -0
- package/manifests/install-components.json +8 -0
- package/manifests/install-modules.json +83 -0
- package/manifests/install-profiles.json +7 -0
- package/package.json +2 -2
- package/scripts/ci/validate-rules.js +51 -0
- package/scripts/cursor-aw-home/hooks.json +23 -2
- package/scripts/cursor-aw-hooks/adapter.js +41 -4
- package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
- package/scripts/hooks/aw-usage-commit-created.js +32 -0
- package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
- package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
- package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
- package/scripts/hooks/aw-usage-session-start.js +48 -0
- package/scripts/hooks/aw-usage-stop.js +182 -0
- package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
- package/scripts/hooks/cost-tracker.js +3 -23
- package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
- package/scripts/hooks/shared/aw-phase-runner.js +3 -1
- package/scripts/lib/aw-hook-contract.js +2 -2
- package/scripts/lib/aw-pricing.js +306 -0
- package/scripts/lib/aw-usage-telemetry.js +472 -0
- package/scripts/lib/codex-hook-config.js +8 -8
- package/scripts/lib/cursor-hook-config.js +25 -10
- package/scripts/lib/install-targets/cursor-project.js +3 -0
- package/scripts/lib/install-targets/helpers.js +20 -3
- package/skills/aw-adk/SKILL.md +317 -0
- package/skills/aw-adk/agents/analyzer.md +113 -0
- package/skills/aw-adk/agents/comparator.md +113 -0
- package/skills/aw-adk/agents/grader.md +115 -0
- package/skills/aw-adk/assets/eval_review.html +76 -0
- package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
- package/skills/aw-adk/eval-viewer/viewer.html +181 -0
- package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
- package/skills/aw-adk/evals/eval-create-agent.md +90 -0
- package/skills/aw-adk/evals/eval-create-command.md +98 -0
- package/skills/aw-adk/evals/eval-create-eval.md +89 -0
- package/skills/aw-adk/evals/eval-create-rule.md +99 -0
- package/skills/aw-adk/evals/eval-create-skill.md +97 -0
- package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
- package/skills/aw-adk/evals/eval-delete-command.md +89 -0
- package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
- package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
- package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
- package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
- package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
- package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
- package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
- package/skills/aw-adk/evals/evals.json +96 -0
- package/skills/aw-adk/references/artifact-wiring.md +162 -0
- package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
- package/skills/aw-adk/references/eval-placement-guide.md +183 -0
- package/skills/aw-adk/references/external-resources.md +75 -0
- package/skills/aw-adk/references/getting-started.md +66 -0
- package/skills/aw-adk/references/registry-structure.md +152 -0
- package/skills/aw-adk/references/rubric-agent.md +36 -0
- package/skills/aw-adk/references/rubric-command.md +36 -0
- package/skills/aw-adk/references/rubric-eval.md +36 -0
- package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
- package/skills/aw-adk/references/rubric-rule.md +36 -0
- package/skills/aw-adk/references/rubric-skill.md +36 -0
- package/skills/aw-adk/references/schemas.md +222 -0
- package/skills/aw-adk/references/template-agent.md +251 -0
- package/skills/aw-adk/references/template-command.md +279 -0
- package/skills/aw-adk/references/template-eval.md +176 -0
- package/skills/aw-adk/references/template-rule.md +119 -0
- package/skills/aw-adk/references/template-skill.md +123 -0
- package/skills/aw-adk/references/type-classifier.md +98 -0
- package/skills/aw-adk/references/writing-good-agents.md +227 -0
- package/skills/aw-adk/references/writing-good-commands.md +258 -0
- package/skills/aw-adk/references/writing-good-evals.md +271 -0
- package/skills/aw-adk/references/writing-good-rules.md +214 -0
- package/skills/aw-adk/references/writing-good-skills.md +159 -0
- package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
- package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
- package/skills/aw-adk/scripts/score-artifact.sh +179 -0
- package/skills/aw-adk/scripts/trigger-eval.py +192 -0
- package/skills/aw-build/SKILL.md +19 -2
- package/skills/aw-deploy/SKILL.md +65 -3
- package/skills/aw-design/SKILL.md +156 -0
- package/skills/aw-design/references/highrise-tokens.md +394 -0
- package/skills/aw-design/references/micro-interactions.md +76 -0
- package/skills/aw-design/references/prompt-template.md +160 -0
- package/skills/aw-design/references/quality-checklist.md +70 -0
- package/skills/aw-design/references/self-review.md +497 -0
- package/skills/aw-design/references/stitch-workflow.md +127 -0
- package/skills/aw-feature/SKILL.md +293 -0
- package/skills/aw-investigate/SKILL.md +17 -0
- package/skills/aw-plan/SKILL.md +34 -3
- package/skills/aw-publish/SKILL.md +300 -0
- package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
- package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
- package/skills/aw-publish/evals/eval-push-modes.md +67 -0
- package/skills/aw-publish/evals/eval-rules-push.md +60 -0
- package/skills/aw-publish/evals/evals.json +29 -0
- package/skills/aw-publish/references/push-modes.md +38 -0
- package/skills/aw-review/SKILL.md +88 -9
- package/skills/aw-rules-review/SKILL.md +124 -0
- package/skills/aw-rules-review/agents/openai.yaml +3 -0
- package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
- package/skills/aw-ship/SKILL.md +16 -0
- package/skills/aw-spec/SKILL.md +15 -0
- package/skills/aw-tasks/SKILL.md +15 -0
- package/skills/aw-test/SKILL.md +16 -0
- package/skills/aw-yolo/SKILL.md +4 -0
- package/skills/diagnose/SKILL.md +121 -0
- package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
- package/skills/finish-only-when-green/SKILL.md +265 -0
- package/skills/grill-me/SKILL.md +24 -0
- package/skills/grill-with-docs/SKILL.md +92 -0
- package/skills/grill-with-docs/adr-format.md +47 -0
- package/skills/grill-with-docs/context-format.md +67 -0
- package/skills/improve-codebase-architecture/SKILL.md +75 -0
- package/skills/improve-codebase-architecture/deepening.md +37 -0
- package/skills/improve-codebase-architecture/interface-design.md +44 -0
- package/skills/improve-codebase-architecture/language.md +53 -0
- package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
- package/skills/tdd/SKILL.md +115 -0
- package/skills/tdd/deep-modules.md +33 -0
- package/skills/tdd/interface-design.md +31 -0
- package/skills/tdd/mocking.md +59 -0
- package/skills/tdd/refactoring.md +10 -0
- package/skills/tdd/tests.md +61 -0
- package/skills/to-issues/SKILL.md +62 -0
- package/skills/to-prd/SKILL.md +75 -0
- package/skills/using-aw-skills/SKILL.md +170 -237
- package/skills/using-aw-skills/hooks/session-start.sh +11 -41
- package/skills/zoom-out/SKILL.md +24 -0
- package/.codex/hooks/aw-post-tool-use.sh +0 -6
- package/.codex/hooks/aw-pre-tool-use.sh +0 -6
- package/.codex/hooks/aw-session-start.sh +0 -25
- package/.codex/hooks/aw-stop.sh +0 -6
- package/.codex/hooks/aw-user-prompt-submit.sh +0 -10
- package/.codex/hooks.json +0 -62
- package/.cursor/rules/common-agents.md +0 -53
- package/.cursor/rules/common-aw-routing.md +0 -43
- package/.cursor/rules/common-coding-style.md +0 -52
- package/.cursor/rules/common-development-workflow.md +0 -33
- package/.cursor/rules/common-git-workflow.md +0 -28
- package/.cursor/rules/common-hooks.md +0 -34
- package/.cursor/rules/common-patterns.md +0 -35
- package/.cursor/rules/common-performance.md +0 -59
- package/.cursor/rules/common-security.md +0 -33
- package/.cursor/rules/common-testing.md +0 -33
- package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
- package/.cursor/skills/article-writing/SKILL.md +0 -85
- package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
- package/.cursor/skills/aw-build/SKILL.md +0 -152
- package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
- package/.cursor/skills/aw-debug/SKILL.md +0 -49
- package/.cursor/skills/aw-deploy/SKILL.md +0 -101
- package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
- package/.cursor/skills/aw-execute/SKILL.md +0 -47
- package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
- package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
- package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
- package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
- package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
- package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
- package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
- package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
- package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
- package/.cursor/skills/aw-finish/SKILL.md +0 -111
- package/.cursor/skills/aw-investigate/SKILL.md +0 -109
- package/.cursor/skills/aw-plan/SKILL.md +0 -368
- package/.cursor/skills/aw-prepare/SKILL.md +0 -118
- package/.cursor/skills/aw-review/SKILL.md +0 -118
- package/.cursor/skills/aw-ship/SKILL.md +0 -115
- package/.cursor/skills/aw-spec/SKILL.md +0 -104
- package/.cursor/skills/aw-tasks/SKILL.md +0 -138
- package/.cursor/skills/aw-test/SKILL.md +0 -118
- package/.cursor/skills/aw-verify/SKILL.md +0 -51
- package/.cursor/skills/aw-yolo/SKILL.md +0 -111
- package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
- package/.cursor/skills/bun-runtime/SKILL.md +0 -84
- package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
- package/.cursor/skills/code-simplification/SKILL.md +0 -74
- package/.cursor/skills/content-engine/SKILL.md +0 -88
- package/.cursor/skills/context-engineering/SKILL.md +0 -74
- package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
- package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
- package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
- package/.cursor/skills/frontend-slides/SKILL.md +0 -184
- package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
- package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
- package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
- package/.cursor/skills/idea-refine/SKILL.md +0 -84
- package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
- package/.cursor/skills/investor-materials/SKILL.md +0 -96
- package/.cursor/skills/investor-outreach/SKILL.md +0 -76
- package/.cursor/skills/market-research/SKILL.md +0 -75
- package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
- package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
- package/.cursor/skills/performance-optimization/SKILL.md +0 -77
- package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
- package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
- package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
- package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
- package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
- package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
- package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
- package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
- /package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
- /package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
- /package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
- /package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
- /package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
- /package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
- /package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
- /package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
- /package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
- /package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
- /package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
- /package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
- /package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
- /package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
- /package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
- /package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
- /package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
- /package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
- /package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
- /package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
- /package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
- /package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
- /package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
- /package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
- /package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
- /package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
- /package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
- /package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
- /package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
- /package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0
|
@@ -0,0 +1,222 @@
|
|
|
1
|
+
# ADK JSON Schemas
|
|
2
|
+
|
|
3
|
+
Defines the JSON structures used by the ADK eval-driven iteration system. Adapted from skill-creator for CASRE context.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## evals.json
|
|
8
|
+
|
|
9
|
+
Defines test prompts and assertions for an artifact. Located at `skills/aw-adk/evals/evals.json` or `<artifact>/evals/evals.json`.
|
|
10
|
+
|
|
11
|
+
```json
|
|
12
|
+
{
|
|
13
|
+
"artifact_name": "payments-agent",
|
|
14
|
+
"artifact_type": "agent",
|
|
15
|
+
"evals": [
|
|
16
|
+
{
|
|
17
|
+
"id": 1,
|
|
18
|
+
"prompt": "Create an agent for payments processing in the revex/memberships namespace",
|
|
19
|
+
"expected_output": "Agent file with Identity, Core Mission, Critical Rules, Process, Deliverables sections",
|
|
20
|
+
"files": [],
|
|
21
|
+
"expectations": [
|
|
22
|
+
"The agent has a Core Mission section with 2+ sentences",
|
|
23
|
+
"Frontmatter includes name, description, tools, model, category, squad",
|
|
24
|
+
"Agent file is placed at correct registry path",
|
|
25
|
+
"At least 2 colocated eval files created"
|
|
26
|
+
]
|
|
27
|
+
}
|
|
28
|
+
]
|
|
29
|
+
}
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
**Fields:**
|
|
33
|
+
- `artifact_name`: Name of the artifact being tested
|
|
34
|
+
- `artifact_type`: One of: command, agent, skill, rule, eval
|
|
35
|
+
- `evals[].id`: Unique integer identifier
|
|
36
|
+
- `evals[].prompt`: The task to execute (realistic user request)
|
|
37
|
+
- `evals[].expected_output`: Human-readable success description
|
|
38
|
+
- `evals[].files`: Optional input file paths
|
|
39
|
+
- `evals[].expectations`: Verifiable assertions (graded by agents/grader.md)
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## eval_metadata.json
|
|
44
|
+
|
|
45
|
+
Per-eval directory metadata. Located at `<workspace>/iteration-N/<eval-name>/eval_metadata.json`.
|
|
46
|
+
|
|
47
|
+
```json
|
|
48
|
+
{
|
|
49
|
+
"eval_id": 1,
|
|
50
|
+
"eval_name": "create-payments-agent",
|
|
51
|
+
"prompt": "Create an agent for payments processing in the revex/memberships namespace",
|
|
52
|
+
"assertions": [
|
|
53
|
+
"The agent has a Core Mission section with 2+ sentences",
|
|
54
|
+
"Frontmatter includes name, description, tools, model"
|
|
55
|
+
]
|
|
56
|
+
}
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## grading.json
|
|
62
|
+
|
|
63
|
+
Output from the grader agent. Located at `<run-dir>/grading.json`.
|
|
64
|
+
|
|
65
|
+
```json
|
|
66
|
+
{
|
|
67
|
+
"expectations": [
|
|
68
|
+
{
|
|
69
|
+
"text": "The agent has a Core Mission section",
|
|
70
|
+
"passed": true,
|
|
71
|
+
"evidence": "Found '## Core Mission' at line 42 with 3 sentences"
|
|
72
|
+
}
|
|
73
|
+
],
|
|
74
|
+
"summary": {
|
|
75
|
+
"passed": 8,
|
|
76
|
+
"failed": 2,
|
|
77
|
+
"total": 10,
|
|
78
|
+
"pass_rate": 0.80
|
|
79
|
+
},
|
|
80
|
+
"execution_metrics": {
|
|
81
|
+
"tool_calls": { "Read": 5, "Write": 2, "Bash": 3 },
|
|
82
|
+
"total_tool_calls": 10,
|
|
83
|
+
"total_steps": 6,
|
|
84
|
+
"errors_encountered": 0
|
|
85
|
+
},
|
|
86
|
+
"timing": {
|
|
87
|
+
"executor_duration_seconds": 45.0,
|
|
88
|
+
"total_duration_seconds": 52.0
|
|
89
|
+
},
|
|
90
|
+
"claims": [
|
|
91
|
+
{
|
|
92
|
+
"claim": "Agent scores B-Tier (65/100)",
|
|
93
|
+
"type": "quality",
|
|
94
|
+
"verified": true,
|
|
95
|
+
"evidence": "Rubric scoring confirms total = 65"
|
|
96
|
+
}
|
|
97
|
+
],
|
|
98
|
+
"eval_feedback": {
|
|
99
|
+
"suggestions": [],
|
|
100
|
+
"overall": "Assertions cover structure well. Consider adding behavioral checks."
|
|
101
|
+
}
|
|
102
|
+
}
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
**Important:** The viewer depends on exact field names: `text`, `passed`, `evidence` in expectations (not `name`/`met`/`details`).
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## timing.json
|
|
110
|
+
|
|
111
|
+
Wall clock timing. Located at `<run-dir>/timing.json`.
|
|
112
|
+
|
|
113
|
+
```json
|
|
114
|
+
{
|
|
115
|
+
"total_tokens": 84852,
|
|
116
|
+
"duration_ms": 23332,
|
|
117
|
+
"total_duration_seconds": 23.3
|
|
118
|
+
}
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
**How to capture:** When a subagent task completes, the notification includes `total_tokens` and `duration_ms`. Save immediately — this data is not persisted elsewhere.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
## benchmark.json
|
|
126
|
+
|
|
127
|
+
Aggregated results. Located at `<workspace>/iteration-N/benchmark.json`. Generated by `scripts/aggregate-benchmark.py`.
|
|
128
|
+
|
|
129
|
+
```json
|
|
130
|
+
{
|
|
131
|
+
"metadata": {
|
|
132
|
+
"artifact_name": "payments-agent",
|
|
133
|
+
"iteration_dir": "payments-agent-workspace/iteration-1",
|
|
134
|
+
"evals_run": ["create-payments-agent", "score-minimal-agent"],
|
|
135
|
+
"total_runs": 4
|
|
136
|
+
},
|
|
137
|
+
"runs": [
|
|
138
|
+
{
|
|
139
|
+
"eval_id": 1,
|
|
140
|
+
"eval_name": "create-payments-agent",
|
|
141
|
+
"configuration": "with_artifact",
|
|
142
|
+
"run_number": 1,
|
|
143
|
+
"result": {
|
|
144
|
+
"pass_rate": 0.85,
|
|
145
|
+
"passed": 6,
|
|
146
|
+
"failed": 1,
|
|
147
|
+
"total": 7,
|
|
148
|
+
"time_seconds": 42.5,
|
|
149
|
+
"tokens": 3800,
|
|
150
|
+
"errors": 0
|
|
151
|
+
},
|
|
152
|
+
"expectations": [
|
|
153
|
+
{ "text": "...", "passed": true, "evidence": "..." }
|
|
154
|
+
]
|
|
155
|
+
}
|
|
156
|
+
],
|
|
157
|
+
"run_summary": {
|
|
158
|
+
"with_artifact": {
|
|
159
|
+
"pass_rate": { "mean": 0.85, "stddev": 0.05 },
|
|
160
|
+
"time_seconds": { "mean": 45.0, "stddev": 12.0 },
|
|
161
|
+
"tokens": { "mean": 3800, "stddev": 400 }
|
|
162
|
+
},
|
|
163
|
+
"without_artifact": {
|
|
164
|
+
"pass_rate": { "mean": 0.35, "stddev": 0.08 },
|
|
165
|
+
"time_seconds": { "mean": 32.0, "stddev": 8.0 },
|
|
166
|
+
"tokens": { "mean": 2100, "stddev": 300 }
|
|
167
|
+
},
|
|
168
|
+
"delta": {
|
|
169
|
+
"pass_rate": "+0.500",
|
|
170
|
+
"time_seconds": "+13.0",
|
|
171
|
+
"tokens": "+1700"
|
|
172
|
+
}
|
|
173
|
+
},
|
|
174
|
+
"notes": [
|
|
175
|
+
"Without-artifact runs consistently fail on eval placement checks (0% pass rate)"
|
|
176
|
+
]
|
|
177
|
+
}
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
**Important:** The viewer reads `configuration` (not `config`) and `result.pass_rate` (not top-level `pass_rate`).
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## comparison.json
|
|
185
|
+
|
|
186
|
+
Output from blind comparator. Located at `<grading-dir>/comparison.json`.
|
|
187
|
+
|
|
188
|
+
```json
|
|
189
|
+
{
|
|
190
|
+
"winner": "A",
|
|
191
|
+
"reasoning": "Artifact A has stronger Identity section with concrete traits...",
|
|
192
|
+
"rubric": {
|
|
193
|
+
"A": { "dimensions": { "1_frontmatter": 8, "2_identity": 9 }, "total": 74, "tier": "B" },
|
|
194
|
+
"B": { "dimensions": { "1_frontmatter": 7, "2_identity": 5 }, "total": 61, "tier": "B" }
|
|
195
|
+
},
|
|
196
|
+
"output_quality": {
|
|
197
|
+
"A": { "score": 74, "strengths": ["..."], "weaknesses": ["..."] },
|
|
198
|
+
"B": { "score": 61, "strengths": ["..."], "weaknesses": ["..."] }
|
|
199
|
+
}
|
|
200
|
+
}
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
---
|
|
204
|
+
|
|
205
|
+
## feedback.json
|
|
206
|
+
|
|
207
|
+
Human review feedback. Downloaded from eval-viewer. Located at `<workspace>/iteration-N/feedback.json`.
|
|
208
|
+
|
|
209
|
+
```json
|
|
210
|
+
{
|
|
211
|
+
"reviews": [
|
|
212
|
+
{
|
|
213
|
+
"run_id": "create-payments-agent-with_artifact",
|
|
214
|
+
"feedback": "Identity section is great but Code Examples are too generic",
|
|
215
|
+
"timestamp": "2026-04-22T10:30:00Z"
|
|
216
|
+
}
|
|
217
|
+
],
|
|
218
|
+
"status": "complete"
|
|
219
|
+
}
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
Empty feedback means the reviewer thought it was fine.
|
|
@@ -0,0 +1,251 @@
|
|
|
1
|
+
# Agent Template
|
|
2
|
+
|
|
3
|
+
Copy the scaffold below as your starting point. Replace all `<placeholder>` tokens.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Scaffold
|
|
8
|
+
|
|
9
|
+
````markdown
|
|
10
|
+
---
|
|
11
|
+
name: <namespace>-<agent-slug>
|
|
12
|
+
description: "<1-2 sentences. Primary capability + trigger scenario.>"
|
|
13
|
+
tools: [Read, Edit, Write, Bash, Grep, Glob]
|
|
14
|
+
model: <sonnet|opus|haiku>
|
|
15
|
+
category: <domain>
|
|
16
|
+
squad: <team/sub_team>
|
|
17
|
+
skills: [<skill-1>, <skill-2>]
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
# <Agent Display Name>
|
|
21
|
+
|
|
22
|
+
## Identity
|
|
23
|
+
|
|
24
|
+
You are **<Agent Name>**, a <role description>.
|
|
25
|
+
|
|
26
|
+
- **Expertise**: <2-3 specific domains of deep knowledge>
|
|
27
|
+
- **Personality**: <How you communicate — direct, methodical, thorough, etc.>
|
|
28
|
+
- **Strengths**: <What you do better than a generalist>
|
|
29
|
+
- **Limitations**: <What you explicitly do NOT do — keeps scope tight>
|
|
30
|
+
|
|
31
|
+
## Core Mission
|
|
32
|
+
|
|
33
|
+
<2-3 sentences describing the agent's primary purpose, the outcomes it produces,
|
|
34
|
+
and the value it delivers. This is the "elevator pitch" for the agent.>
|
|
35
|
+
|
|
36
|
+
### Primary Objectives
|
|
37
|
+
|
|
38
|
+
1. <Objective 1 — specific, measurable outcome>
|
|
39
|
+
2. <Objective 2 — specific, measurable outcome>
|
|
40
|
+
3. <Objective 3 — specific, measurable outcome>
|
|
41
|
+
|
|
42
|
+
### Success Criteria
|
|
43
|
+
|
|
44
|
+
- <Criterion 1 — observable and verifiable>
|
|
45
|
+
- <Criterion 2 — observable and verifiable>
|
|
46
|
+
- <Criterion 3 — observable and verifiable>
|
|
47
|
+
|
|
48
|
+
## Critical Rules
|
|
49
|
+
|
|
50
|
+
### BLOCK — Stop and escalate
|
|
51
|
+
|
|
52
|
+
These conditions halt execution. Do not proceed until resolved.
|
|
53
|
+
|
|
54
|
+
- **<Block condition 1>**: <What triggers it and why it's dangerous>
|
|
55
|
+
- **<Block condition 2>**: <What triggers it and why it's dangerous>
|
|
56
|
+
|
|
57
|
+
### NEVER — Hard constraints
|
|
58
|
+
|
|
59
|
+
Violating these produces incorrect or harmful output.
|
|
60
|
+
|
|
61
|
+
- Never <action 1> because <consequence>
|
|
62
|
+
- Never <action 2> because <consequence>
|
|
63
|
+
- Never <action 3> because <consequence>
|
|
64
|
+
|
|
65
|
+
### ALWAYS — Required behaviors
|
|
66
|
+
|
|
67
|
+
Skipping these degrades quality below acceptable thresholds.
|
|
68
|
+
|
|
69
|
+
- Always <action 1> because <reason>
|
|
70
|
+
- Always <action 2> because <reason>
|
|
71
|
+
- Always <action 3> because <reason>
|
|
72
|
+
|
|
73
|
+
## Process
|
|
74
|
+
|
|
75
|
+
### Step 1: <Phase Name>
|
|
76
|
+
|
|
77
|
+
<What to do and why. Include concrete commands when applicable.>
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
# Example command
|
|
81
|
+
<command>
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
**Output:** <What this step produces>
|
|
85
|
+
**Checkpoint:** <How to verify this step succeeded before moving on>
|
|
86
|
+
|
|
87
|
+
### Step 2: <Phase Name>
|
|
88
|
+
|
|
89
|
+
<Instructions for the next phase.>
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
# Example command
|
|
93
|
+
<command>
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Output:** <What this step produces>
|
|
97
|
+
**Checkpoint:** <Verification criteria>
|
|
98
|
+
|
|
99
|
+
### Step 3: <Phase Name>
|
|
100
|
+
|
|
101
|
+
<Continue the pattern. Add as many steps as needed.>
|
|
102
|
+
|
|
103
|
+
### Step N: Deliver Results
|
|
104
|
+
|
|
105
|
+
<Final step — produce the deliverables and verify them.>
|
|
106
|
+
|
|
107
|
+
## Deliverables
|
|
108
|
+
|
|
109
|
+
| # | Artifact | Format | Location | Required |
|
|
110
|
+
|---|----------|--------|----------|----------|
|
|
111
|
+
| 1 | <artifact-name> | <format> | <path> | Yes |
|
|
112
|
+
| 2 | <artifact-name> | <format> | <path> | Yes |
|
|
113
|
+
| 3 | <artifact-name> | <format> | <path> | No |
|
|
114
|
+
|
|
115
|
+
## Communication Style
|
|
116
|
+
|
|
117
|
+
### Tone
|
|
118
|
+
|
|
119
|
+
<Describe how the agent communicates: formal/informal, terse/verbose, etc.>
|
|
120
|
+
|
|
121
|
+
### Example Phrases
|
|
122
|
+
|
|
123
|
+
- When starting: "<example opening phrase>"
|
|
124
|
+
- When blocked: "<example escalation phrase>"
|
|
125
|
+
- When delivering: "<example completion phrase>"
|
|
126
|
+
- When uncertain: "<example clarification phrase>"
|
|
127
|
+
|
|
128
|
+
### Reporting Format
|
|
129
|
+
|
|
130
|
+
<How the agent structures its responses. Example:>
|
|
131
|
+
|
|
132
|
+
```
|
|
133
|
+
## <Title>
|
|
134
|
+
|
|
135
|
+
**Status:** <PASS | FAIL | NEEDS_REVIEW>
|
|
136
|
+
**Summary:** <1-2 sentences>
|
|
137
|
+
|
|
138
|
+
### Findings
|
|
139
|
+
1. <finding with evidence>
|
|
140
|
+
|
|
141
|
+
### Recommendations
|
|
142
|
+
1. <actionable recommendation>
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Learning & Memory
|
|
146
|
+
|
|
147
|
+
### Pattern Recognition
|
|
148
|
+
|
|
149
|
+
The agent should recognize and adapt to these patterns:
|
|
150
|
+
|
|
151
|
+
- **<Pattern 1>**: <What to look for> -> <How to respond>
|
|
152
|
+
- **<Pattern 2>**: <What to look for> -> <How to respond>
|
|
153
|
+
|
|
154
|
+
### Context Accumulation
|
|
155
|
+
|
|
156
|
+
Between invocations, the agent retains understanding of:
|
|
157
|
+
|
|
158
|
+
- <Context type 1 — e.g., "codebase architecture from previous reviews">
|
|
159
|
+
- <Context type 2 — e.g., "team conventions observed in prior sessions">
|
|
160
|
+
|
|
161
|
+
### Anti-Patterns to Flag
|
|
162
|
+
|
|
163
|
+
- <Anti-pattern 1>: <What it looks like and why it's wrong>
|
|
164
|
+
- <Anti-pattern 2>: <What it looks like and why it's wrong>
|
|
165
|
+
|
|
166
|
+
## Success Metrics
|
|
167
|
+
|
|
168
|
+
| Metric | Target | Measurement |
|
|
169
|
+
|--------|--------|-------------|
|
|
170
|
+
| <metric-1> | <quantified target, e.g., ">90%"> | <how to measure> |
|
|
171
|
+
| <metric-2> | <quantified target> | <how to measure> |
|
|
172
|
+
| <metric-3> | <quantified target> | <how to measure> |
|
|
173
|
+
|
|
174
|
+
## Advanced Capabilities
|
|
175
|
+
|
|
176
|
+
### <Capability 1>
|
|
177
|
+
|
|
178
|
+
<Description of an advanced behavior the agent supports when invoked
|
|
179
|
+
with specific inputs or in specific contexts.>
|
|
180
|
+
|
|
181
|
+
### <Capability 2>
|
|
182
|
+
|
|
183
|
+
<Another advanced capability. These are optional behaviors that
|
|
184
|
+
extend the core mission for power users.>
|
|
185
|
+
|
|
186
|
+
## Skills & References
|
|
187
|
+
|
|
188
|
+
- [<skill-name>](../skills/<slug>/SKILL.md) — <when to load>
|
|
189
|
+
- [<reference-name>](references/<file>.md) — <what it covers>
|
|
190
|
+
````
|
|
191
|
+
|
|
192
|
+
---
|
|
193
|
+
|
|
194
|
+
## Section-by-Section Guide
|
|
195
|
+
|
|
196
|
+
### Identity (4 Required Fields)
|
|
197
|
+
|
|
198
|
+
The Identity section defines the agent's persona. All four fields are mandatory:
|
|
199
|
+
|
|
200
|
+
1. **Expertise** — Narrow scope. "Database optimization" is better than "backend development." Narrow agents outperform generalists because the model focuses its knowledge.
|
|
201
|
+
|
|
202
|
+
2. **Personality** — This shapes output tone. "Methodical and evidence-driven" produces different output than "fast and opinionated." Choose what fits the use case.
|
|
203
|
+
|
|
204
|
+
3. **Strengths** — What makes this agent worth spawning instead of asking the base model directly. If you can't articulate this, the agent may not need to exist.
|
|
205
|
+
|
|
206
|
+
4. **Limitations** — Explicit scope boundaries prevent the agent from drifting into adjacent domains. "Does NOT write production code" keeps a reviewer focused on reviewing.
|
|
207
|
+
|
|
208
|
+
### Core Mission
|
|
209
|
+
|
|
210
|
+
The bridge between identity and action. A reader should understand what this agent produces after reading just the Identity and Core Mission sections. Everything below is implementation detail.
|
|
211
|
+
|
|
212
|
+
### Critical Rules
|
|
213
|
+
|
|
214
|
+
Three severity tiers, each with a distinct consequence:
|
|
215
|
+
|
|
216
|
+
- **BLOCK** — execution stops. Use sparingly. Reserved for data loss, security breaches, or irreversible actions.
|
|
217
|
+
- **NEVER** — output quality drops below acceptable. These are hard constraints the model must internalize.
|
|
218
|
+
- **ALWAYS** — quality degrades when skipped. These are positive behaviors, not prohibitions.
|
|
219
|
+
|
|
220
|
+
Explain WHY each rule exists. "Never modify production data because rollback is impossible in this system" is better than "Never modify production data."
|
|
221
|
+
|
|
222
|
+
### Process
|
|
223
|
+
|
|
224
|
+
Step-by-step instructions the agent follows. Each step needs:
|
|
225
|
+
- What to do (action)
|
|
226
|
+
- Why (reasoning — helps the model handle edge cases)
|
|
227
|
+
- How to verify (checkpoint — prevents cascading failures)
|
|
228
|
+
|
|
229
|
+
Include bash commands where the step involves tooling. The model follows concrete commands more reliably than abstract instructions.
|
|
230
|
+
|
|
231
|
+
### Deliverables Table
|
|
232
|
+
|
|
233
|
+
Explicit contract between the agent and its caller. The caller knows exactly what to expect. The agent knows exactly what to produce. No ambiguity.
|
|
234
|
+
|
|
235
|
+
### Communication Style
|
|
236
|
+
|
|
237
|
+
Example phrases are surprisingly effective at shaping agent output. The model pattern-matches against them. Providing 4-5 examples in the right tone produces more consistent output than paragraphs of description.
|
|
238
|
+
|
|
239
|
+
### Success Metrics
|
|
240
|
+
|
|
241
|
+
Quantified targets enable eval creation. "Accuracy > 90%" can be tested. "Good accuracy" cannot. Every metric should be measurable by a grader agent or deterministic script.
|
|
242
|
+
|
|
243
|
+
## Model Tier Selection
|
|
244
|
+
|
|
245
|
+
| Tier | When to Use | Cost |
|
|
246
|
+
|---|---|---|
|
|
247
|
+
| `haiku` | High-frequency, narrow-scope tasks (linting, formatting, simple checks) | Lowest |
|
|
248
|
+
| `sonnet` | Most agent work (review, analysis, implementation, orchestration) | Medium |
|
|
249
|
+
| `opus` | Deep reasoning, architectural decisions, complex multi-step analysis | Highest |
|
|
250
|
+
|
|
251
|
+
Default to `sonnet` unless you have a specific reason for `haiku` (high frequency) or `opus` (deep reasoning).
|