npm - aw-ecc - Versions diffs - 1.4.31 → 1.4.47 - Mend

aw-ecc 1.4.31 → 1.4.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (259) hide show

package/.claude-plugin/plugin.json +1 -1
package/.codex/hooks/aw-post-tool-use.sh +8 -2
package/.codex/hooks/aw-session-start.sh +11 -4
package/.codex/hooks/aw-stop.sh +8 -2
package/.codex/hooks/aw-user-prompt-submit.sh +10 -2
package/.codex/hooks.json +8 -8
package/.cursor/INSTALL.md +7 -5
package/.cursor/hooks/adapter.js +41 -4
package/.cursor/hooks/after-agent-response.js +62 -0
package/.cursor/hooks/before-submit-prompt.js +7 -1
package/.cursor/hooks/post-tool-use-failure.js +21 -0
package/.cursor/hooks/post-tool-use.js +39 -0
package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
package/.cursor/hooks/subagent-start.js +22 -4
package/.cursor/hooks/subagent-stop.js +18 -1
package/.cursor/hooks.json +23 -2
package/.opencode/package.json +1 -1
package/AGENTS.md +3 -3
package/README.md +5 -5
package/commands/adk.md +52 -0
package/commands/build.md +22 -9
package/commands/deploy.md +12 -0
package/commands/execute.md +9 -0
package/commands/feature.md +333 -0
package/commands/investigate.md +18 -5
package/commands/plan.md +23 -9
package/commands/publish.md +65 -0
package/commands/review.md +12 -0
package/commands/ship.md +12 -0
package/commands/test.md +12 -0
package/commands/verify.md +9 -0
package/hooks/hooks.json +36 -0
package/manifests/install-components.json +8 -0
package/manifests/install-modules.json +83 -0
package/manifests/install-profiles.json +7 -0
package/package.json +1 -1
package/scripts/ci/validate-rules.js +51 -0
package/scripts/cursor-aw-home/hooks.json +23 -2
package/scripts/cursor-aw-hooks/adapter.js +41 -4
package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
package/scripts/hooks/aw-usage-commit-created.js +32 -0
package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
package/scripts/hooks/aw-usage-session-start.js +48 -0
package/scripts/hooks/aw-usage-stop.js +182 -0
package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
package/scripts/hooks/cost-tracker.js +3 -23
package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
package/scripts/hooks/shared/aw-phase-runner.js +3 -1
package/scripts/lib/aw-hook-contract.js +2 -2
package/scripts/lib/aw-pricing.js +306 -0
package/scripts/lib/aw-usage-telemetry.js +472 -0
package/scripts/lib/codex-hook-config.js +8 -8
package/scripts/lib/cursor-hook-config.js +25 -10
package/scripts/lib/install-targets/codex-home.js +7 -0
package/scripts/lib/install-targets/cursor-project.js +3 -0
package/scripts/lib/install-targets/helpers.js +20 -3
package/skills/aw-adk/SKILL.md +317 -0
package/skills/aw-adk/agents/analyzer.md +113 -0
package/skills/aw-adk/agents/comparator.md +113 -0
package/skills/aw-adk/agents/grader.md +115 -0
package/skills/aw-adk/assets/eval_review.html +76 -0
package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
package/skills/aw-adk/eval-viewer/viewer.html +181 -0
package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
package/skills/aw-adk/evals/eval-create-agent.md +90 -0
package/skills/aw-adk/evals/eval-create-command.md +98 -0
package/skills/aw-adk/evals/eval-create-eval.md +89 -0
package/skills/aw-adk/evals/eval-create-rule.md +99 -0
package/skills/aw-adk/evals/eval-create-skill.md +97 -0
package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
package/skills/aw-adk/evals/eval-delete-command.md +89 -0
package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
package/skills/aw-adk/evals/evals.json +96 -0
package/skills/aw-adk/references/artifact-wiring.md +162 -0
package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
package/skills/aw-adk/references/eval-placement-guide.md +183 -0
package/skills/aw-adk/references/external-resources.md +75 -0
package/skills/aw-adk/references/getting-started.md +66 -0
package/skills/aw-adk/references/registry-structure.md +152 -0
package/skills/aw-adk/references/rubric-agent.md +36 -0
package/skills/aw-adk/references/rubric-command.md +36 -0
package/skills/aw-adk/references/rubric-eval.md +36 -0
package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
package/skills/aw-adk/references/rubric-rule.md +36 -0
package/skills/aw-adk/references/rubric-skill.md +36 -0
package/skills/aw-adk/references/schemas.md +222 -0
package/skills/aw-adk/references/template-agent.md +251 -0
package/skills/aw-adk/references/template-command.md +279 -0
package/skills/aw-adk/references/template-eval.md +176 -0
package/skills/aw-adk/references/template-rule.md +119 -0
package/skills/aw-adk/references/template-skill.md +123 -0
package/skills/aw-adk/references/type-classifier.md +98 -0
package/skills/aw-adk/references/writing-good-agents.md +227 -0
package/skills/aw-adk/references/writing-good-commands.md +258 -0
package/skills/aw-adk/references/writing-good-evals.md +271 -0
package/skills/aw-adk/references/writing-good-rules.md +214 -0
package/skills/aw-adk/references/writing-good-skills.md +159 -0
package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
package/skills/aw-adk/scripts/score-artifact.sh +179 -0
package/skills/aw-adk/scripts/trigger-eval.py +192 -0
package/skills/aw-build/SKILL.md +19 -2
package/skills/aw-deploy/SKILL.md +65 -3
package/skills/aw-design/SKILL.md +156 -0
package/skills/aw-design/references/highrise-tokens.md +394 -0
package/skills/aw-design/references/micro-interactions.md +76 -0
package/skills/aw-design/references/prompt-template.md +160 -0
package/skills/aw-design/references/quality-checklist.md +70 -0
package/skills/aw-design/references/self-review.md +497 -0
package/skills/aw-design/references/stitch-workflow.md +127 -0
package/skills/aw-feature/SKILL.md +293 -0
package/skills/aw-investigate/SKILL.md +17 -0
package/skills/aw-plan/SKILL.md +34 -3
package/skills/aw-publish/SKILL.md +300 -0
package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
package/skills/aw-publish/evals/eval-push-modes.md +67 -0
package/skills/aw-publish/evals/eval-rules-push.md +60 -0
package/skills/aw-publish/evals/evals.json +29 -0
package/skills/aw-publish/references/push-modes.md +38 -0
package/skills/aw-review/SKILL.md +88 -9
package/skills/aw-rules-review/SKILL.md +124 -0
package/skills/aw-rules-review/agents/openai.yaml +3 -0
package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
package/skills/aw-ship/SKILL.md +16 -0
package/skills/aw-spec/SKILL.md +15 -0
package/skills/aw-tasks/SKILL.md +15 -0
package/skills/aw-test/SKILL.md +16 -0
package/skills/aw-yolo/SKILL.md +4 -0
package/skills/diagnose/SKILL.md +121 -0
package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
package/skills/finish-only-when-green/SKILL.md +265 -0
package/skills/grill-me/SKILL.md +24 -0
package/skills/grill-with-docs/SKILL.md +92 -0
package/skills/grill-with-docs/adr-format.md +47 -0
package/skills/grill-with-docs/context-format.md +67 -0
package/skills/improve-codebase-architecture/SKILL.md +75 -0
package/skills/improve-codebase-architecture/deepening.md +37 -0
package/skills/improve-codebase-architecture/interface-design.md +44 -0
package/skills/improve-codebase-architecture/language.md +53 -0
package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
package/skills/tdd/SKILL.md +115 -0
package/skills/tdd/deep-modules.md +33 -0
package/skills/tdd/interface-design.md +31 -0
package/skills/tdd/mocking.md +59 -0
package/skills/tdd/refactoring.md +10 -0
package/skills/tdd/tests.md +61 -0
package/skills/to-issues/SKILL.md +62 -0
package/skills/to-prd/SKILL.md +75 -0
package/skills/using-aw-skills/SKILL.md +170 -237
package/skills/using-aw-skills/hooks/session-start.sh +11 -41
package/skills/zoom-out/SKILL.md +24 -0
package/.cursor/rules/common-agents.md +0 -53
package/.cursor/rules/common-aw-routing.md +0 -43
package/.cursor/rules/common-coding-style.md +0 -52
package/.cursor/rules/common-development-workflow.md +0 -33
package/.cursor/rules/common-git-workflow.md +0 -28
package/.cursor/rules/common-hooks.md +0 -34
package/.cursor/rules/common-patterns.md +0 -35
package/.cursor/rules/common-performance.md +0 -59
package/.cursor/rules/common-security.md +0 -33
package/.cursor/rules/common-testing.md +0 -33
package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
package/.cursor/skills/article-writing/SKILL.md +0 -85
package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
package/.cursor/skills/aw-build/SKILL.md +0 -152
package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
package/.cursor/skills/aw-debug/SKILL.md +0 -49
package/.cursor/skills/aw-deploy/SKILL.md +0 -101
package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
package/.cursor/skills/aw-execute/SKILL.md +0 -47
package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
package/.cursor/skills/aw-finish/SKILL.md +0 -111
package/.cursor/skills/aw-investigate/SKILL.md +0 -109
package/.cursor/skills/aw-plan/SKILL.md +0 -368
package/.cursor/skills/aw-prepare/SKILL.md +0 -118
package/.cursor/skills/aw-review/SKILL.md +0 -118
package/.cursor/skills/aw-ship/SKILL.md +0 -115
package/.cursor/skills/aw-spec/SKILL.md +0 -104
package/.cursor/skills/aw-tasks/SKILL.md +0 -138
package/.cursor/skills/aw-test/SKILL.md +0 -118
package/.cursor/skills/aw-verify/SKILL.md +0 -51
package/.cursor/skills/aw-yolo/SKILL.md +0 -111
package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
package/.cursor/skills/bun-runtime/SKILL.md +0 -84
package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
package/.cursor/skills/code-simplification/SKILL.md +0 -74
package/.cursor/skills/content-engine/SKILL.md +0 -88
package/.cursor/skills/context-engineering/SKILL.md +0 -74
package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
package/.cursor/skills/frontend-slides/SKILL.md +0 -184
package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
package/.cursor/skills/idea-refine/SKILL.md +0 -84
package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
package/.cursor/skills/investor-materials/SKILL.md +0 -96
package/.cursor/skills/investor-outreach/SKILL.md +0 -76
package/.cursor/skills/market-research/SKILL.md +0 -75
package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
package/.cursor/skills/performance-optimization/SKILL.md +0 -77
package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
/package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
/package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
/package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
/package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
/package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
/package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
/package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
/package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
/package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
/package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
/package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
/package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
/package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
/package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
/package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
/package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
/package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
/package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
/package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
/package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
/package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
/package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
/package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
/package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
/package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
/package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
/package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
/package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
/package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
/package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0

package/skills/aw-adk/references/registry-structure.md ADDED Viewed

@@ -0,0 +1,152 @@
+# Registry Structure & Path Resolution
+How CASRE artifacts are organized in the AW registry and rules system.
+## Two Namespace Models
+### Platform namespace (shared, read-only)
+Platform is flat — no team/sub_team level. Domains are the primary organization:
+```
+.aw/                                    # ← root anchor — all registry paths start here
+  .aw_registry/
+    platform/
+      <domain>/                     # core, frontend, infra, data, sdet, review, design, services
+        skills/<slug>/SKILL.md
+        agents/<slug>.md
+        commands/<slug>.md
+        evals/<type>/<slug>/eval-*.md   # colocated evals
+```
+**Domains:** core, frontend, infra, data, sdet, review, design, services
+### Team namespaces
+Teams always have `<team>/<sub_team>` and optionally nest further by domain:
+```
+.aw/
+  .aw_registry/
+    <team>/                         # crm, leadgen, revex, mobile
+      <sub_team>/                   # users, events, forms, courses, memberships, core
+        [<domain>/]                 # OPTIONAL: backend, core, frontend, infra, design, product, quality
+          skills/<slug>/SKILL.md
+          agents/<slug>.md
+          commands/<slug>.md
+```
+**Key differences:**
+- **`platform`** uses `platform/<domain>/CAS` — no team layer
+- **Teams** use `<team>/<sub_team>/CAS` or `<team>/<sub_team>/<domain>/CAS`
+- Some teams put CAS directly: `crm/users/agents/` (no domain nesting)
+- Others nest fully: `leadgen/events/backend/agents/` (with domain)
+### Rules structure (separate hierarchy)
+Rules live outside the registry in `.aw/.aw_rules/`:
+```
+.aw/
+  .aw_rules/                      # Platform-level (shared)
+    platform/
+      <domain>/                   # universal, security, frontend, backend, data, infra, sdet, mobile
+        AGENTS.md                 # Main rules file
+        references/<slug>.md      # Individual rule references
+        evals/<slug>/eval-*.md    # Colocated rule evals
+        <stack>/                  # Optional stack overlays
+          AGENTS.md
+          references/<slug>.md
+    rule-manifest.json            # Registry of all platform rules
+  .aw_rules.local/                # Team/repo-level (planned)
+    rule-manifest-local.json
+    render.json                   # Target mapping between platform + local layers
+    <path>/AGENTS.md              # e.g., apps/billing/AGENTS.md
+```
+**Stack overlays:** `backend/nestjs/`, `backend/go-connect/`, `frontend/vue/`, `frontend/nuxt/`
+## Path Resolution Flow
+Use this decision tree to **construct** the exact target path. The rules are deterministic — walk the tree, substitute your variables, and you'll have the path. Searching is unnecessary because every combination of namespace + domain + type produces exactly one path.
+```
+1. Is this a RULE?
+   ├── YES → Platform rule?
+   │         ├── YES → .aw/.aw_rules/platform/<domain>/references/<slug>.md
+   │         └── NO  → .aw/.aw_rules.local/<path>/AGENTS.md
+   └── NO  → continue
+2. Is this PLATFORM work?
+   ├── YES → Ask: which domain?
+   │         (core, frontend, infra, data, sdet, review, design, services)
+   │         → .aw/.aw_registry/platform/<domain>/<type>/<slug>
+   └── NO  → continue
+3. Is this TEAM work?
+   → Ask: which team? (crm, leadgen, revex, mobile)
+   → Ask: which sub_team? (users, events, courses, memberships, core, etc.)
+   → Ask: need domain nesting? (optional)
+   ├── NO domain  → .aw/.aw_registry/<team>/<sub_team>/<type>/<slug>
+   └── YES domain → .aw/.aw_registry/<team>/<sub_team>/<domain>/<type>/<slug>
+```
+## Naming Conventions
+`aw link` creates symlinks across all IDE directories (`.claude/`, `.cursor/`, `.codex/`) using **all-hyphens** names. The formula comes from `link.mjs`: it takes every directory name between the namespace root and the artifact type directory, appends the artifact slug, and joins them all with hyphens. When a team uses optional domain nesting (e.g., `revex/reselling/backend/agents/`), the domain becomes an extra segment in the name.
+| Type | Platform | Team (flat) | Team (with domain) | Stage/core |
+|---|---|---|---|---|
+| Command | `aw:platform-core-plan` | `aw:revex-reselling-<slug>` | `aw:revex-reselling-backend-<slug>` | `aw:build` |
+| Agent | `platform-data-db-engineer` | `revex-reselling-redis-reviewer` | `revex-reselling-backend-<slug>` | `code-reviewer` |
+| Skill | `platform-core-architecture-design` | `revex-reselling-redis-patterns` | `revex-reselling-backend-<slug>` | `skill-creator` |
+| Rule | `<domain>/<slug>` | repo-local path | repo-local path | — |
+Platform and team (flat) examples are real names from the live system. When in doubt, verify against what `aw link` actually created — the symlinks are the source of truth:
+- Skills: `ls ~/.claude/skills/` (or `.cursor/`, `.codex/`)
+- Agents: `ls ~/.claude/agents/`
+- Commands: `ls ~/.claude/commands/aw/`
+**Examples of cross-artifact references (all use the hyphen-joined symlink name):**
+```yaml
+# Agent loading skills
+skills:
+  - revex-reselling-redis-patterns
+  - platform-data-mongodb-patterns
+# Command loading agents
+agents:
+  - platform-review-security-reviewer
+  - platform-review-performance-reviewer
+# subagent_type in commands/skills
+subagent_type: "platform-review-security-reviewer"
+```
+These all match the symlink names in `~/.claude/skills/` and `~/.claude/agents/` that `aw link` creates.
+## Colocated Eval Placement
+Evals live next to their parent artifact, not in a top-level `evals/` directory:
+| Parent Type | Eval Location |
+|---|---|
+| Skill | `skills/<slug>/evals/eval-*.md` (inside skill dir) |
+| Agent | `agents/evals/<slug>/eval-*.md` (sibling evals/ dir) |
+| Command | `commands/evals/<slug>/eval-*.md` (sibling evals/ dir) |
+| Rule | `rules/evals/<slug>/eval-*.md` or within `.aw/.aw_rules/` references |
+| Eval (meta) | `evals/evals/eval-*.md` (self-referential) |
+## How `aw pull` Maps to IDE Paths
+After `aw pull`, registry artifacts are synced to IDE-local locations:
+| Registry | Claude Code | Cursor | Codex |
+|---|---|---|---|
+| `skills/<slug>/SKILL.md` | `.claude/skills/<slug>/SKILL.md` | `.cursor/rules/<slug>.mdc` | `.codex/<slug>/` |
+| `agents/<slug>.md` | `.claude/agents/<slug>.md` | `.cursor/rules/<slug>.mdc` | `.codex/<slug>/` |
+| `commands/<slug>.md` | `.claude/commands/<slug>.md` | N/A (manual) | N/A |
+`skills-lock.json` tracks installed skills with SHA256 integrity hashes (like package-lock.json for skills).

package/skills/aw-adk/references/rubric-agent.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rubric: Agent Quality (10 dimensions, /100)
+Evaluates the quality of an AW agent definition file.
+## Scoring Table
+| # | Dimension | 0 (Missing) | 5 (Partial) | 10 (Complete) |
+|---|-----------|-------------|-------------|---------------|
+| 1 | **Identity & Personality** | No identity defined | Has role but missing personality, memory, or experience | Full 4-field identity: role, personality, memory, experience |
+| 2 | **Core Mission** | Missing or vague | 1 generic sentence | 2-3 specific sentences naming domain, outcomes, scope boundaries |
+| 3 | **Critical Rules** | Missing or soft suggestions | 1-2 rules without BLOCK/NEVER keywords | 3-5 hard rules with BLOCK/NEVER/ALWAYS and measurable thresholds |
+| 4 | **Process / Workflow** | Missing | Numbered list without code or commands | Step-by-step with input/output per step, code examples, bash commands |
+| 5 | **Deliverables** | Missing | Table with names only | Table with format + quality bar + inline template for each deliverable |
+| 6 | **Communication Style** | Missing or 1 sentence | 2 personality traits described | 3-4 example phrases showing distinct voice and tone |
+| 7 | **Code Examples** | No code blocks | 1 generic block | 2+ domain-specific blocks showing good vs. bad patterns |
+| 8 | **Learning & Memory** | Missing | Lists what to remember | Pattern recognition + anti-patterns + cross-session learning protocol |
+| 9 | **Success Metrics** | Missing or vague | 2-3 metrics without numbers | 4-5 quantified targets with explicit thresholds |
+| 10 | **Advanced Capabilities** | Missing | 1-2 bullets | 3+ mastery areas showing growth trajectory from basic to expert |
+## Tier Thresholds
+| Tier | Score | Interpretation |
+|------|-------|----------------|
+| **S** | 90-100 | Production-grade. Agent is distinctive, reliable, and self-improving. |
+| **A** | 75-89 | Strong. Minor gaps in voice, metrics, or advanced capabilities. |
+| **B** | 60-74 | Functional. Usable but lacks personality depth or measurable targets. |
+| **C** | 40-59 | Draft. Core mission exists but workflow and deliverables need work. |
+| **D** | 0-39 | Stub or broken. Not usable without significant rewrite. |
+## How to use this rubric
+1. Open the agent definition file.
+2. Score each dimension independently (0, 5, or 10).
+3. Sum all 10 scores for a total out of 100.
+4. Map the total to a tier using the thresholds above.
+5. Prioritize fixing dimensions that scored 0 -- these represent missing sections that block agent effectiveness.

package/skills/aw-adk/references/rubric-command.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rubric: Command Quality (10 dimensions, /100)
+Evaluates the quality of an AW command definition file.
+## Scoring Table
+| # | Dimension | 0 (Missing) | 5 (Partial) | 10 (Complete) |
+|---|-----------|-------------|-------------|---------------|
+| 1 | **Frontmatter** | Missing | `name` + `description` present | `name` + `description` + `argument-hint` + `mcp` list |
+| 2 | **Protocol Reference** | Missing | Generic mention of a protocol | Explicit AW-PROTOCOL reference + skill loading gate requirement |
+| 3 | **Agent Roster** | Missing | Informal mentions of agents in prose | Table with phase, agent name, tier, and skills loaded per agent |
+| 4 | **Skill Loading Gate** | Missing | Partial mention of skill loading | Full blocking gate: agent -> skills -> learnings chain before execution |
+| 5 | **Phase Structure** | No phases defined | Phases listed without I/O contracts | Numbered phases with input, output, and human checkpoints per phase |
+| 6 | **Output Format** | Missing | Vague description | Concrete markdown template with placeholders and field names |
+| 7 | **Human Checkpoints** | None defined | 1 vague mention of review | Explicit gates with approval criteria and proceed/halt behavior |
+| 8 | **Learning / Transparency** | Missing | Mentions learning broadly | Per-step .md artifacts + transparency JSON + learnings append protocol |
+| 9 | **User Communication** | Silent execution | Some status messages | Status message per phase with progress indicator format |
+| 10 | **Error Handling** | Missing | "Retry if failed" | Per-phase failure paths, fallback agents, and halt conditions |
+## Tier Thresholds
+| Tier | Score | Interpretation |
+|------|-------|----------------|
+| **S** | 90-100 | Production-grade. Fully orchestrated with checkpoints and error recovery. |
+| **A** | 75-89 | Strong. Minor gaps in error handling or transparency artifacts. |
+| **B** | 60-74 | Functional. Phases exist but missing checkpoints or fallback paths. |
+| **C** | 40-59 | Draft. Basic structure present but no orchestration rigor. |
+| **D** | 0-39 | Stub or broken. Not executable as a command. |
+## How to use this rubric
+1. Open the command definition file.
+2. Score each dimension independently (0, 5, or 10).
+3. Sum all 10 scores for a total out of 100.
+4. Map the total to a tier using the thresholds above.
+5. Commands scoring below B-tier should not be shipped to users -- focus on phase structure and error handling first.

package/skills/aw-adk/references/rubric-eval.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rubric: Eval Quality (10 dimensions, /100)
+Evaluates the quality of an AW eval definition file used for benchmarking agent and skill performance.
+## Scoring Table
+| # | Dimension | 0 (Missing) | 5 (Partial) | 10 (Complete) |
+|---|-----------|-------------|-------------|---------------|
+| 1 | **Frontmatter** | Missing | `name` + `target` present | `name` + `target` + `category` + `difficulty` |
+| 2 | **Task Clarity** | Missing or vague | Describes intent in general terms | Concrete task with specific artifact type, namespace, and domain |
+| 3 | **Context** | Missing | Minimal (just a path or name) | Namespace, domain, target path, existing related work, key packages |
+| 4 | **Expected Outcomes** | Missing | 1-2 vague expectations | 4+ specific, verifiable checkboxes with concrete acceptance criteria |
+| 5 | **Grading Criteria** | Missing | Just PASS/FAIL | PASS/PARTIAL/FAIL with clear thresholds for each grade |
+| 6 | **Evaluation Method** | Missing | Unspecified or implied | Explicit: deterministic, model-based, or hybrid with rationale |
+| 7 | **Scenario Diversity** | Single happy path only | Happy path + 1 edge case | Happy + failure + edge case + adversarial scenario |
+| 8 | **False-Pass Resistance** | Would pass a clearly wrong artifact | Some specificity in assertions | Assertions target quality dimensions, not just existence checks |
+| 9 | **Reproducibility** | Non-deterministic or vague setup | Mostly reproducible with minor variance | Fully deterministic or clearly bounded stochastic with seed/tolerance |
+| 10 | **Baseline Tracking** | No baseline mentioned | Mentions baseline informally | Explicit with/without comparison methodology and stored baseline scores |
+## Tier Thresholds
+| Tier | Score | Interpretation |
+|------|-------|----------------|
+| **S** | 90-100 | Production-grade. Reliable, reproducible, and resistant to false passes. |
+| **A** | 75-89 | Strong. Good coverage; minor gaps in diversity or baseline tracking. |
+| **B** | 60-74 | Functional. Tests the right thing but may miss edge cases or allow false passes. |
+| **C** | 40-59 | Draft. Task is defined but grading criteria or scenarios need work. |
+| **D** | 0-39 | Stub or broken. Not usable for benchmarking. |
+## How to use this rubric
+1. Open the eval definition file.
+2. Score each dimension independently (0, 5, or 10).
+3. Sum all 10 scores for a total out of 100.
+4. Map the total to a tier using the thresholds above.
+5. Evals below B-tier should not be included in benchmark suites -- prioritize false-pass resistance and scenario diversity to avoid misleading results.

package/skills/aw-adk/references/rubric-meta-eval.md ADDED Viewed

@@ -0,0 +1,132 @@
+# Rubric: Meta-Evaluation
+Score an eval's quality across 5 dimensions. Total: /50.
+## Dimensions
+### 1. Scenario Diversity (0-10)
+How broadly does the eval cover the artifact's behavior space?
+| Score | Description |
+|-------|-------------|
+| 0 | Single happy-path scenario only |
+| 2 | Happy path with minor input variations |
+| 5 | Happy path + at least one edge case or failure scenario |
+| 7 | Happy path + failure + edge case coverage |
+| 10 | Happy path + failure + edge case + adversarial/ambiguous inputs |
+**What to look for:**
+- Does the eval test what happens when required fields are missing?
+- Are boundary conditions exercised (empty input, maximum length, special characters)?
+- Is there at least one scenario where the artifact should refuse or fail gracefully?
+- Are adversarial prompts included (conflicting instructions, prompt injection attempts)?
+### 2. Grader Determinism (0-10)
+How reproducible are the eval results across runs?
+| Score | Description |
+|-------|-------------|
+| 0 | Fully subjective -- human judgment with no rubric |
+| 3 | Model-based grader with vague instructions ("is this good?") |
+| 5 | Model-based grader with specific criteria and examples |
+| 7 | Mix of deterministic checks and bounded model grading |
+| 10 | Fully deterministic script or model grader with explicit pass/fail boundaries |
+**What to look for:**
+- Are pass/fail criteria unambiguous enough that two reviewers would agree?
+- Does the grader use exact-match, regex, or structured checks where possible?
+- If model-based, does the grader prompt include concrete examples of pass and fail?
+- Could you automate this grader in CI without human intervention?
+### 3. False-Pass Resistance (0-10)
+Would a clearly wrong artifact still pass the eval?
+| Score | Description |
+|-------|-------------|
+| 0 | A boilerplate or empty artifact would pass |
+| 3 | Checks for presence of output but not correctness |
+| 5 | Some specificity -- checks named sections or keywords |
+| 7 | Targeted assertions on content, structure, and relationships |
+| 10 | Assertions that demonstrably fail on known-wrong artifacts |
+**What to look for:**
+- Does the eval check content meaning, not just content existence?
+- Are there negative assertions (artifact must NOT contain X)?
+- Would a copy-paste of the prompt back as output pass? If yes, score low.
+- Does the eval verify relationships between parts (e.g., summary matches detail)?
+### 4. Criteria Specificity (0-10)
+How precisely are success criteria defined?
+| Score | Description |
+|-------|-------------|
+| 0 | Vague ("output should be good", "looks correct") |
+| 3 | Named qualities without measurement ("should be comprehensive") |
+| 5 | Named sections or fields to check with qualitative descriptions |
+| 7 | Quantified thresholds for most criteria (counts, percentages, sizes) |
+| 10 | All criteria have quantified thresholds with evidence requirements |
+**What to look for:**
+- Can you turn each criterion into a binary pass/fail without interpretation?
+- Are numeric thresholds stated (e.g., "at least 3 scenarios", "under 200 words")?
+- Does the eval require evidence citations or examples in the output?
+- Are edge cases in the criteria themselves addressed (what counts as "partial")?
+### 5. Baseline Tracking (0-10)
+Does the eval support measuring improvement over time?
+| Score | Description |
+|-------|-------------|
+| 0 | No baseline, no comparison methodology |
+| 3 | Mentions that results should be compared but no method |
+| 5 | References a baseline or before/after comparison |
+| 7 | Explicit with/without methodology with defined metrics |
+| 10 | Reproducible baseline with versioned artifacts, stored results, and diff method |
+**What to look for:**
+- Is there a "before" snapshot or known-good baseline to compare against?
+- Can you re-run the eval on an older version and get comparable results?
+- Does the eval track scores over time (not just pass/fail)?
+- Is the comparison methodology documented enough for someone else to reproduce?
+## Tier Thresholds
+| Tier | Score | Interpretation |
+|------|-------|----------------|
+| **S** | 45-50 | Production-grade eval. Ship it. |
+| **A** | 38-44 | Strong eval. Minor gaps acceptable for non-critical artifacts. |
+| **B** | 30-37 | Adequate eval. Acceptable for first iteration; improve before relying on it for regressions. |
+| **C** | 20-29 | Weak eval. Provides some signal but has significant blind spots. |
+| **D** | 0-19 | Eval provides negligible quality signal. Rewrite before using. |
+## How to Use
+1. **Score each eval after writing it.** Fill in the 5 dimensions honestly. If you authored the eval, have someone else score it -- author bias inflates scores by 5-10 points on average.
+2. **Gate on tier before merging.** New evals for skills and agents should be B-tier or above. Critical-path artifacts (commands, platform rules) should target A-tier.
+3. **Use dimension scores to guide improvements.** A low Scenario Diversity score is fixed by adding scenarios. A low Grader Determinism score requires rewriting the grader. Address the lowest-scoring dimension first for maximum impact.
+4. **Re-score after changes.** When you modify an eval, re-run the rubric. Track the score in the eval's frontmatter or a changelog comment.
+5. **Compare across evals.** Use tier distribution to gauge overall eval maturity for a project. If most evals are C-tier, invest in eval quality before adding more artifacts.
+### Scoring Template
+```markdown
+## Meta-Eval Score
+| Dimension | Score | Notes |
+|-----------|-------|-------|
+| Scenario Diversity | /10 | |
+| Grader Determinism | /10 | |
+| False-Pass Resistance | /10 | |
+| Criteria Specificity | /10 | |
+| Baseline Tracking | /10 | |
+| **Total** | **/50** | **Tier: ?** |
+```

package/skills/aw-adk/references/rubric-rule.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rubric: Rule Quality (10 dimensions, /100)
+Evaluates the quality of an AW rule definition file. Expands the original 5-dimension /50 rubric from aw-rules to 10 dimensions /100.
+## Scoring Table
+| # | Dimension | 0 (Missing) | 5 (Partial) | 10 (Complete) |
+|---|-----------|-------------|-------------|---------------|
+| 1 | **Frontmatter** | Missing | `id` + `severity` present | `id` + `severity` + `domains` + `paths` glob patterns |
+| 2 | **Rule Statement** | Missing | Vague or ambiguous phrasing | One clear sentence stating the requirement and why it matters |
+| 3 | **WRONG Example** | Missing | Generic or hypothetical violation | Real violation pattern drawn from actual codebase conventions |
+| 4 | **RIGHT Example** | Missing | Generic fix | Verified fix grounded in platform docs or referenced skills |
+| 5 | **Skill Link** | Missing | Wrong or broken link | Correct link to an existing, relevant skill |
+| 6 | **Severity Justification** | Missing | Just states MUST/SHOULD without rationale | Explains why this severity: risk, blast radius, precedent |
+| 7 | **Automation Path** | Missing | "Can be linted" without specifics | Specific lint rule, pre-commit hook, or CI script that enforces this |
+| 8 | **Scope Precision** | Too broad ("all code") | Domain-scoped (e.g., "backend" or "frontend") | Path-scoped with glob patterns (e.g., `services/*/src/**/*.ts`) |
+| 9 | **Exceptions** | No mention of edge cases | "Edge cases exist" without detail | Explicit exceptions listed with justification for each |
+| 10 | **Manifest Entry** | Missing from rule-manifest.json | Incomplete entry (missing fields) | Full entry: id, severity, domains, description, principle |
+## Tier Thresholds
+| Tier | Score | Interpretation |
+|------|-------|----------------|
+| **S** | 90-100 | Production-grade. Enforceable, scoped, and documented with automation. |
+| **A** | 75-89 | Strong. Clear rule with examples; minor gaps in automation or exceptions. |
+| **B** | 60-74 | Functional. Rule is understandable but lacks enforcement path or scope precision. |
+| **C** | 40-59 | Draft. Statement exists but missing examples or justification. |
+| **D** | 0-39 | Stub or broken. Not enforceable or understandable. |
+## How to use this rubric
+1. Open the rule file and its corresponding entry in `rule-manifest.json`.
+2. Score each dimension independently (0, 5, or 10).
+3. Sum all 10 scores for a total out of 100.
+4. Map the total to a tier using the thresholds above.
+5. Rules below B-tier should not be added to the active manifest -- prioritize adding WRONG/RIGHT examples and an automation path.

package/skills/aw-adk/references/rubric-skill.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rubric: Skill Quality (10 dimensions, /100)
+Evaluates the quality of an AW skill file (SKILL.md + references/).
+## Scoring Table
+| # | Dimension | 0 (Missing) | 5 (Partial) | 10 (Complete) |
+|---|-----------|-------------|-------------|---------------|
+| 1 | **Frontmatter** | Missing | `name` + `description` present | `name` + `description` + `trigger` with "use when" clause |
+| 2 | **Purpose Statement** | Missing | 1 vague sentence | 2-3 specific sentences naming domain, scope, and outcomes |
+| 3 | **When to Use** | Missing | 1 trigger scenario | 3+ trigger scenarios covering distinct use cases |
+| 4 | **Instructions** | Missing or vague | Numbered list without detail | Step-by-step with concrete actions, commands, decision points |
+| 5 | **Code Examples** | No code blocks | 1 generic code block | 3+ domain-specific code blocks with context |
+| 6 | **Checklists** | Missing | Basic bullet list | Items with pass/fail criteria, severity, and fix guidance |
+| 7 | **References** | No references | Internal links only | Links to platform-docs + reference files in references/ |
+| 8 | **Progressive Disclosure** | Everything in SKILL.md or too sparse | SKILL.md + some refs | SKILL.md < 5k words, detailed content in references/ |
+| 9 | **Domain Specificity** | Generic / framework-agnostic | Mentions package names | Actual package names, design tokens, API patterns with versions |
+| 10 | **Output Format** | No output template | Vague description of output | Concrete markdown template with field names and placeholders |
+## Tier Thresholds
+| Tier | Score | Interpretation |
+|------|-------|----------------|
+| **S** | 90-100 | Production-grade. Ready for cross-team adoption. |
+| **A** | 75-89 | Strong. Minor gaps in examples or references. |
+| **B** | 60-74 | Functional. Usable but missing depth in 2-3 dimensions. |
+| **C** | 40-59 | Draft. Needs significant work before team use. |
+| **D** | 0-39 | Stub or broken. Not usable without rewrite. |
+## How to use this rubric
+1. Open the skill's `SKILL.md` and its `references/` directory.
+2. Score each dimension independently (0, 5, or 10).
+3. Sum all 10 scores for a total out of 100.
+4. Map the total to a tier using the thresholds above.
+5. Record dimension-level scores to guide targeted improvements -- focus on any dimension scoring 0 or 5 first.