npm - aw-ecc - Versions diffs - 1.4.31 → 1.4.47 - Mend

aw-ecc 1.4.31 → 1.4.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (259) hide show

package/.claude-plugin/plugin.json +1 -1
package/.codex/hooks/aw-post-tool-use.sh +8 -2
package/.codex/hooks/aw-session-start.sh +11 -4
package/.codex/hooks/aw-stop.sh +8 -2
package/.codex/hooks/aw-user-prompt-submit.sh +10 -2
package/.codex/hooks.json +8 -8
package/.cursor/INSTALL.md +7 -5
package/.cursor/hooks/adapter.js +41 -4
package/.cursor/hooks/after-agent-response.js +62 -0
package/.cursor/hooks/before-submit-prompt.js +7 -1
package/.cursor/hooks/post-tool-use-failure.js +21 -0
package/.cursor/hooks/post-tool-use.js +39 -0
package/.cursor/hooks/shared/aw-phase-definitions.js +53 -0
package/.cursor/hooks/shared/aw-phase-runner.js +3 -1
package/.cursor/hooks/subagent-start.js +22 -4
package/.cursor/hooks/subagent-stop.js +18 -1
package/.cursor/hooks.json +23 -2
package/.opencode/package.json +1 -1
package/AGENTS.md +3 -3
package/README.md +5 -5
package/commands/adk.md +52 -0
package/commands/build.md +22 -9
package/commands/deploy.md +12 -0
package/commands/execute.md +9 -0
package/commands/feature.md +333 -0
package/commands/investigate.md +18 -5
package/commands/plan.md +23 -9
package/commands/publish.md +65 -0
package/commands/review.md +12 -0
package/commands/ship.md +12 -0
package/commands/test.md +12 -0
package/commands/verify.md +9 -0
package/hooks/hooks.json +36 -0
package/manifests/install-components.json +8 -0
package/manifests/install-modules.json +83 -0
package/manifests/install-profiles.json +7 -0
package/package.json +1 -1
package/scripts/ci/validate-rules.js +51 -0
package/scripts/cursor-aw-home/hooks.json +23 -2
package/scripts/cursor-aw-hooks/adapter.js +41 -4
package/scripts/cursor-aw-hooks/before-submit-prompt.js +7 -1
package/scripts/hooks/aw-usage-commit-created.js +32 -0
package/scripts/hooks/aw-usage-post-tool-use-failure.js +56 -0
package/scripts/hooks/aw-usage-post-tool-use.js +242 -0
package/scripts/hooks/aw-usage-prompt-submit.js +112 -0
package/scripts/hooks/aw-usage-session-start.js +48 -0
package/scripts/hooks/aw-usage-stop.js +182 -0
package/scripts/hooks/aw-usage-telemetry-send.js +84 -0
package/scripts/hooks/cost-tracker.js +3 -23
package/scripts/hooks/shared/aw-phase-definitions.js +53 -0
package/scripts/hooks/shared/aw-phase-runner.js +3 -1
package/scripts/lib/aw-hook-contract.js +2 -2
package/scripts/lib/aw-pricing.js +306 -0
package/scripts/lib/aw-usage-telemetry.js +472 -0
package/scripts/lib/codex-hook-config.js +8 -8
package/scripts/lib/cursor-hook-config.js +25 -10
package/scripts/lib/install-targets/codex-home.js +7 -0
package/scripts/lib/install-targets/cursor-project.js +3 -0
package/scripts/lib/install-targets/helpers.js +20 -3
package/skills/aw-adk/SKILL.md +317 -0
package/skills/aw-adk/agents/analyzer.md +113 -0
package/skills/aw-adk/agents/comparator.md +113 -0
package/skills/aw-adk/agents/grader.md +115 -0
package/skills/aw-adk/assets/eval_review.html +76 -0
package/skills/aw-adk/eval-viewer/generate_review.py +164 -0
package/skills/aw-adk/eval-viewer/viewer.html +181 -0
package/skills/aw-adk/evals/eval-colocated-placement.md +84 -0
package/skills/aw-adk/evals/eval-create-agent.md +90 -0
package/skills/aw-adk/evals/eval-create-command.md +98 -0
package/skills/aw-adk/evals/eval-create-eval.md +89 -0
package/skills/aw-adk/evals/eval-create-rule.md +99 -0
package/skills/aw-adk/evals/eval-create-skill.md +97 -0
package/skills/aw-adk/evals/eval-delete-agent.md +79 -0
package/skills/aw-adk/evals/eval-delete-command.md +89 -0
package/skills/aw-adk/evals/eval-delete-rule.md +86 -0
package/skills/aw-adk/evals/eval-delete-skill.md +90 -0
package/skills/aw-adk/evals/eval-meta-eval-coverage.md +78 -0
package/skills/aw-adk/evals/eval-meta-eval-determinism.md +81 -0
package/skills/aw-adk/evals/eval-meta-eval-false-pass.md +81 -0
package/skills/aw-adk/evals/eval-score-accuracy.md +95 -0
package/skills/aw-adk/evals/eval-type-redirect.md +68 -0
package/skills/aw-adk/evals/evals.json +96 -0
package/skills/aw-adk/references/artifact-wiring.md +162 -0
package/skills/aw-adk/references/cross-ide-mapping.md +71 -0
package/skills/aw-adk/references/eval-placement-guide.md +183 -0
package/skills/aw-adk/references/external-resources.md +75 -0
package/skills/aw-adk/references/getting-started.md +66 -0
package/skills/aw-adk/references/registry-structure.md +152 -0
package/skills/aw-adk/references/rubric-agent.md +36 -0
package/skills/aw-adk/references/rubric-command.md +36 -0
package/skills/aw-adk/references/rubric-eval.md +36 -0
package/skills/aw-adk/references/rubric-meta-eval.md +132 -0
package/skills/aw-adk/references/rubric-rule.md +36 -0
package/skills/aw-adk/references/rubric-skill.md +36 -0
package/skills/aw-adk/references/schemas.md +222 -0
package/skills/aw-adk/references/template-agent.md +251 -0
package/skills/aw-adk/references/template-command.md +279 -0
package/skills/aw-adk/references/template-eval.md +176 -0
package/skills/aw-adk/references/template-rule.md +119 -0
package/skills/aw-adk/references/template-skill.md +123 -0
package/skills/aw-adk/references/type-classifier.md +98 -0
package/skills/aw-adk/references/writing-good-agents.md +227 -0
package/skills/aw-adk/references/writing-good-commands.md +258 -0
package/skills/aw-adk/references/writing-good-evals.md +271 -0
package/skills/aw-adk/references/writing-good-rules.md +214 -0
package/skills/aw-adk/references/writing-good-skills.md +159 -0
package/skills/aw-adk/scripts/aggregate-benchmark.py +190 -0
package/skills/aw-adk/scripts/lint-artifact.sh +211 -0
package/skills/aw-adk/scripts/score-artifact.sh +179 -0
package/skills/aw-adk/scripts/trigger-eval.py +192 -0
package/skills/aw-build/SKILL.md +19 -2
package/skills/aw-deploy/SKILL.md +65 -3
package/skills/aw-design/SKILL.md +156 -0
package/skills/aw-design/references/highrise-tokens.md +394 -0
package/skills/aw-design/references/micro-interactions.md +76 -0
package/skills/aw-design/references/prompt-template.md +160 -0
package/skills/aw-design/references/quality-checklist.md +70 -0
package/skills/aw-design/references/self-review.md +497 -0
package/skills/aw-design/references/stitch-workflow.md +127 -0
package/skills/aw-feature/SKILL.md +293 -0
package/skills/aw-investigate/SKILL.md +17 -0
package/skills/aw-plan/SKILL.md +34 -3
package/skills/aw-publish/SKILL.md +300 -0
package/skills/aw-publish/evals/eval-confirmation-gate.md +60 -0
package/skills/aw-publish/evals/eval-intent-detection.md +111 -0
package/skills/aw-publish/evals/eval-push-modes.md +67 -0
package/skills/aw-publish/evals/eval-rules-push.md +60 -0
package/skills/aw-publish/evals/evals.json +29 -0
package/skills/aw-publish/references/push-modes.md +38 -0
package/skills/aw-review/SKILL.md +88 -9
package/skills/aw-rules-review/SKILL.md +124 -0
package/skills/aw-rules-review/agents/openai.yaml +3 -0
package/skills/aw-rules-review/scripts/generate-review-template.mjs +323 -0
package/skills/aw-ship/SKILL.md +16 -0
package/skills/aw-spec/SKILL.md +15 -0
package/skills/aw-tasks/SKILL.md +15 -0
package/skills/aw-test/SKILL.md +16 -0
package/skills/aw-yolo/SKILL.md +4 -0
package/skills/diagnose/SKILL.md +121 -0
package/skills/diagnose/scripts/hitl-loop.template.sh +41 -0
package/skills/finish-only-when-green/SKILL.md +265 -0
package/skills/grill-me/SKILL.md +24 -0
package/skills/grill-with-docs/SKILL.md +92 -0
package/skills/grill-with-docs/adr-format.md +47 -0
package/skills/grill-with-docs/context-format.md +67 -0
package/skills/improve-codebase-architecture/SKILL.md +75 -0
package/skills/improve-codebase-architecture/deepening.md +37 -0
package/skills/improve-codebase-architecture/interface-design.md +44 -0
package/skills/improve-codebase-architecture/language.md +53 -0
package/skills/local-ghl-setup-from-screenshot/SKILL.md +538 -0
package/skills/tdd/SKILL.md +115 -0
package/skills/tdd/deep-modules.md +33 -0
package/skills/tdd/interface-design.md +31 -0
package/skills/tdd/mocking.md +59 -0
package/skills/tdd/refactoring.md +10 -0
package/skills/tdd/tests.md +61 -0
package/skills/to-issues/SKILL.md +62 -0
package/skills/to-prd/SKILL.md +75 -0
package/skills/using-aw-skills/SKILL.md +170 -237
package/skills/using-aw-skills/hooks/session-start.sh +11 -41
package/skills/zoom-out/SKILL.md +24 -0
package/.cursor/rules/common-agents.md +0 -53
package/.cursor/rules/common-aw-routing.md +0 -43
package/.cursor/rules/common-coding-style.md +0 -52
package/.cursor/rules/common-development-workflow.md +0 -33
package/.cursor/rules/common-git-workflow.md +0 -28
package/.cursor/rules/common-hooks.md +0 -34
package/.cursor/rules/common-patterns.md +0 -35
package/.cursor/rules/common-performance.md +0 -59
package/.cursor/rules/common-security.md +0 -33
package/.cursor/rules/common-testing.md +0 -33
package/.cursor/skills/api-and-interface-design/SKILL.md +0 -75
package/.cursor/skills/article-writing/SKILL.md +0 -85
package/.cursor/skills/aw-brainstorm/SKILL.md +0 -115
package/.cursor/skills/aw-build/SKILL.md +0 -152
package/.cursor/skills/aw-build/evals/build-stage-cases.json +0 -28
package/.cursor/skills/aw-debug/SKILL.md +0 -49
package/.cursor/skills/aw-deploy/SKILL.md +0 -101
package/.cursor/skills/aw-deploy/evals/deploy-stage-cases.json +0 -32
package/.cursor/skills/aw-execute/SKILL.md +0 -47
package/.cursor/skills/aw-execute/references/mode-code.md +0 -47
package/.cursor/skills/aw-execute/references/mode-docs.md +0 -28
package/.cursor/skills/aw-execute/references/mode-infra.md +0 -44
package/.cursor/skills/aw-execute/references/mode-migration.md +0 -58
package/.cursor/skills/aw-execute/references/worker-implementer.md +0 -26
package/.cursor/skills/aw-execute/references/worker-parallel-worker.md +0 -23
package/.cursor/skills/aw-execute/references/worker-quality-reviewer.md +0 -23
package/.cursor/skills/aw-execute/references/worker-spec-reviewer.md +0 -23
package/.cursor/skills/aw-execute/scripts/build-worker-bundle.js +0 -229
package/.cursor/skills/aw-finish/SKILL.md +0 -111
package/.cursor/skills/aw-investigate/SKILL.md +0 -109
package/.cursor/skills/aw-plan/SKILL.md +0 -368
package/.cursor/skills/aw-prepare/SKILL.md +0 -118
package/.cursor/skills/aw-review/SKILL.md +0 -118
package/.cursor/skills/aw-ship/SKILL.md +0 -115
package/.cursor/skills/aw-spec/SKILL.md +0 -104
package/.cursor/skills/aw-tasks/SKILL.md +0 -138
package/.cursor/skills/aw-test/SKILL.md +0 -118
package/.cursor/skills/aw-verify/SKILL.md +0 -51
package/.cursor/skills/aw-yolo/SKILL.md +0 -111
package/.cursor/skills/browser-testing-with-devtools/SKILL.md +0 -81
package/.cursor/skills/bun-runtime/SKILL.md +0 -84
package/.cursor/skills/ci-cd-and-automation/SKILL.md +0 -71
package/.cursor/skills/code-simplification/SKILL.md +0 -74
package/.cursor/skills/content-engine/SKILL.md +0 -88
package/.cursor/skills/context-engineering/SKILL.md +0 -74
package/.cursor/skills/deprecation-and-migration/SKILL.md +0 -75
package/.cursor/skills/documentation-and-adrs/SKILL.md +0 -75
package/.cursor/skills/documentation-lookup/SKILL.md +0 -90
package/.cursor/skills/frontend-slides/SKILL.md +0 -184
package/.cursor/skills/frontend-slides/STYLE_PRESETS.md +0 -330
package/.cursor/skills/frontend-ui-engineering/SKILL.md +0 -68
package/.cursor/skills/git-workflow-and-versioning/SKILL.md +0 -75
package/.cursor/skills/idea-refine/SKILL.md +0 -84
package/.cursor/skills/incremental-implementation/SKILL.md +0 -75
package/.cursor/skills/investor-materials/SKILL.md +0 -96
package/.cursor/skills/investor-outreach/SKILL.md +0 -76
package/.cursor/skills/market-research/SKILL.md +0 -75
package/.cursor/skills/mcp-server-patterns/SKILL.md +0 -67
package/.cursor/skills/nextjs-turbopack/SKILL.md +0 -44
package/.cursor/skills/performance-optimization/SKILL.md +0 -77
package/.cursor/skills/security-and-hardening/SKILL.md +0 -70
package/.cursor/skills/using-aw-skills/SKILL.md +0 -290
package/.cursor/skills/using-aw-skills/evals/skill-trigger-cases.tsv +0 -25
package/.cursor/skills/using-aw-skills/evals/test-skill-triggers.sh +0 -171
package/.cursor/skills/using-aw-skills/hooks/hooks.json +0 -9
package/.cursor/skills/using-aw-skills/hooks/session-start.sh +0 -67
package/.cursor/skills/using-platform-skills/SKILL.md +0 -163
package/.cursor/skills/using-platform-skills/evals/platform-selection-cases.json +0 -52
/package/.cursor/rules/{golang-coding-style.md → golang-coding-style.mdc} +0 -0
/package/.cursor/rules/{golang-hooks.md → golang-hooks.mdc} +0 -0
/package/.cursor/rules/{golang-patterns.md → golang-patterns.mdc} +0 -0
/package/.cursor/rules/{golang-security.md → golang-security.mdc} +0 -0
/package/.cursor/rules/{golang-testing.md → golang-testing.mdc} +0 -0
/package/.cursor/rules/{kotlin-coding-style.md → kotlin-coding-style.mdc} +0 -0
/package/.cursor/rules/{kotlin-hooks.md → kotlin-hooks.mdc} +0 -0
/package/.cursor/rules/{kotlin-patterns.md → kotlin-patterns.mdc} +0 -0
/package/.cursor/rules/{kotlin-security.md → kotlin-security.mdc} +0 -0
/package/.cursor/rules/{kotlin-testing.md → kotlin-testing.mdc} +0 -0
/package/.cursor/rules/{php-coding-style.md → php-coding-style.mdc} +0 -0
/package/.cursor/rules/{php-hooks.md → php-hooks.mdc} +0 -0
/package/.cursor/rules/{php-patterns.md → php-patterns.mdc} +0 -0
/package/.cursor/rules/{php-security.md → php-security.mdc} +0 -0
/package/.cursor/rules/{php-testing.md → php-testing.mdc} +0 -0
/package/.cursor/rules/{python-coding-style.md → python-coding-style.mdc} +0 -0
/package/.cursor/rules/{python-hooks.md → python-hooks.mdc} +0 -0
/package/.cursor/rules/{python-patterns.md → python-patterns.mdc} +0 -0
/package/.cursor/rules/{python-security.md → python-security.mdc} +0 -0
/package/.cursor/rules/{python-testing.md → python-testing.mdc} +0 -0
/package/.cursor/rules/{swift-coding-style.md → swift-coding-style.mdc} +0 -0
/package/.cursor/rules/{swift-hooks.md → swift-hooks.mdc} +0 -0
/package/.cursor/rules/{swift-patterns.md → swift-patterns.mdc} +0 -0
/package/.cursor/rules/{swift-security.md → swift-security.mdc} +0 -0
/package/.cursor/rules/{swift-testing.md → swift-testing.mdc} +0 -0
/package/.cursor/rules/{typescript-coding-style.md → typescript-coding-style.mdc} +0 -0
/package/.cursor/rules/{typescript-hooks.md → typescript-hooks.mdc} +0 -0
/package/.cursor/rules/{typescript-patterns.md → typescript-patterns.mdc} +0 -0
/package/.cursor/rules/{typescript-security.md → typescript-security.mdc} +0 -0
/package/.cursor/rules/{typescript-testing.md → typescript-testing.mdc} +0 -0

package/skills/aw-adk/references/writing-good-evals.md ADDED Viewed

@@ -0,0 +1,271 @@
+# Writing Good Evals
+An eval measures whether an AI agent, skill, or command actually works. Good evals discriminate between correct and incorrect outputs. Bad evals pass for everything and give false confidence.
+## Before / After: Eval Quality
+### Bad — always-pass eval
+```yaml
+name: test-code-review
+scenario: "Review a PR with a security vulnerability"
+assertions:
+  - "Output file exists"
+  - "Output is not empty"
+  - "Output contains the word 'review'"
+```
+Problems: This passes for *any* output that mentions "review." An agent that writes "This is a review. Everything looks great." passes despite missing the security vulnerability entirely. The eval has zero discriminating power.
+### Good — discriminating eval
+```yaml
+name: test-code-review-detects-hardcoded-secret
+scenario: "Review a PR that introduces `const API_KEY = 'sk-live-abc123'` in a service file"
+assertions:
+  structural:
+    - "Output contains a finding with severity CRITICAL or HIGH"
+    - "Output references the file path containing the hardcoded secret"
+    - "Output mentions the specific line or pattern (API_KEY, sk-live)"
+  behavioral:
+    - "Verdict is BLOCK (not APPROVE)"
+    - "Finding includes a remediation suggestion (env variable or secret manager)"
+  negative:
+    - "Does NOT approve the PR"
+    - "Does NOT classify the finding as LOW or MEDIUM"
+```
+Why this works: The eval checks that the agent found the *specific* issue, classified it correctly, and recommended the right fix. An agent that misses the vulnerability fails. An agent that finds it but approves anyway fails. Only correct behavior passes.
+### Bad — subjective grader with no rubric
+```yaml
+grader: "model-based"
+prompt: "Did the agent do a good job? Rate 1-5."
+pass_threshold: 3
+```
+A model grading "good job" with no criteria will give 4/5 to almost anything coherent.
+### Good — rubric-based model grader
+```yaml
+grader: "model-based"
+rubric:
+  criteria:
+    - name: "vulnerability_detected"
+      weight: 40
+      description: "Agent identified the hardcoded API key as a security issue"
+      pass: "Explicitly mentions hardcoded secret/API key/credential"
+      fail: "Does not mention secrets, credentials, or hardcoded values"
+    - name: "correct_severity"
+      weight: 30
+      description: "Finding is classified as CRITICAL or HIGH"
+      pass: "Severity is CRITICAL or HIGH"
+      fail: "Severity is MEDIUM, LOW, or not specified"
+    - name: "actionable_fix"
+      weight: 20
+      description: "Provides a concrete remediation"
+      pass: "Suggests environment variable, secret manager, or similar"
+      fail: "No fix suggested or fix is vague ('fix the issue')"
+    - name: "correct_verdict"
+      weight: 10
+      description: "Overall verdict blocks the PR"
+      pass: "Verdict is BLOCK"
+      fail: "Verdict is APPROVE or APPROVE WITH COMMENTS"
+  pass_threshold: 80
+```
+## Anti-Pattern Catalog
+### 1. Happy-Path Only
+**Symptom:** All eval scenarios test the golden path. No edge cases, no adversarial inputs, no ambiguous situations.
+**Fix:** For every happy-path scenario, write at least one:
+- **Failure scenario:** Input that should trigger a specific error or rejection
+- **Edge case:** Empty input, massive input, malformed input
+- **Ambiguous case:** Input where the correct answer requires judgment
+```yaml
+scenarios:
+  - name: "happy_path"
+    input: "PR with clear bug"
+    expected: "Agent finds the bug"
+  - name: "false_positive_resistance"
+    input: "PR with code that looks suspicious but is correct"
+    expected: "Agent does NOT flag it as a bug"
+  - name: "empty_input"
+    input: "PR with no changed files"
+    expected: "Agent reports no changes to review"
+```
+### 2. Subjective Graders Without Rubrics
+**Symptom:** Model-based grader with a vague prompt like "is this response good?"
+**Fix:** Always provide a rubric with weighted criteria, concrete pass/fail descriptions, and a numeric threshold.
+### 3. No Baseline Comparison
+**Symptom:** Eval shows 85% pass rate. Is that good? There's no baseline to compare against.
+**Fix:** Establish baselines:
+- **Before:** Run eval against a naive prompt (no skill/agent customization). Record pass rate.
+- **After:** Run eval with the skill/agent. The delta is the real measure of value.
+- **Regression:** Re-run evals after changes. Pass rate should not drop.
+### 4. Assertions Too Weak
+**Symptom:** Assertions check for the presence of keywords rather than the correctness of the output.
+**Fix:** Layer assertions by strength:
+| Strength | Example | Catches |
+|----------|---------|---------|
+| **Weak** | "Output contains 'error'" | Almost nothing |
+| **Medium** | "Output contains finding with severity CRITICAL for file X" | Wrong file, wrong severity |
+| **Strong** | "Output blocks PR, cites line 42, suggests env variable replacement" | Wrong line, wrong fix, wrong verdict |
+Use medium and strong assertions. Weak assertions are only useful as sanity checks alongside stronger ones.
+### 5. No Failure Scenarios
+**Symptom:** Every scenario expects the agent to succeed. No scenarios test what happens when the agent *should* fail or refuse.
+**Fix:** Include negative test cases:
+```yaml
+- name: "should_not_hallucinate_findings"
+  input: "Clean PR with no issues"
+  expected: "Agent approves with no findings (or only minor suggestions)"
+  fail_if: "Agent reports CRITICAL or HIGH findings"
+- name: "should_refuse_out_of_scope"
+  input: "Request to review infrastructure Terraform, but agent is scoped to backend TypeScript"
+  expected: "Agent reports this is outside its scope"
+  fail_if: "Agent attempts to review Terraform files"
+```
+### 6. Vanity Metrics
+**Symptom:** Eval measures "did the agent produce output?" (100% pass rate) instead of "did the agent produce correct output?"
+**Fix:** Every assertion must test *correctness*, not just *activity*. If your eval has a 95%+ pass rate on first run, your assertions are probably too weak.
+## Scenario Design Methodology
+### Start From Failure Modes, Not Success Criteria
+Most eval authors start by asking "what should the agent do right?" This produces happy-path-only evals.
+Instead, start by asking: **"How can the agent fail?"**
+```
+Failure mode analysis for code-review agent:
+1. Misses a real vulnerability (false negative)
+2. Flags clean code as vulnerable (false positive)
+3. Finds the issue but assigns wrong severity
+4. Finds the issue but suggests wrong fix
+5. Produces unstructured output that can't be parsed
+6. Crashes on empty diff
+7. Reviews out-of-scope files
+8. Approves a PR that should be blocked
+```
+Each failure mode becomes a scenario. This produces evals with real discriminating power.
+### Scenario Template
+```yaml
+- name: "descriptive_snake_case_name"
+  description: "One sentence explaining what this tests"
+  failure_mode: "Which failure mode this scenario targets"
+  input:
+    description: "What the agent receives"
+    files: [...]  # or inline content
+  expected:
+    behavior: "What the agent should do"
+    output_contains: [...]  # structural checks
+    output_must_not_contain: [...]  # negative checks
+  grader: "deterministic | model-based | hybrid"
+```
+## Grader Selection
+### Deterministic (Script-Based)
+**Use when:** The correct answer has a specific structure, contains specific strings, or follows a checkable pattern.
+```yaml
+grader: deterministic
+checks:
+  - type: "json_schema"
+    schema: "review-output.schema.json"
+  - type: "contains"
+    values: ["CRITICAL", "hardcoded", "API_KEY"]
+  - type: "regex"
+    pattern: "severity:\\s*(CRITICAL|HIGH)"
+```
+**Strengths:** Fast, reproducible, zero cost, no false positives.
+**Weaknesses:** Can't evaluate quality of prose, reasoning, or judgment.
+### Model-Based
+**Use when:** Correctness requires understanding natural language, evaluating reasoning quality, or assessing subjective attributes.
+```yaml
+grader: model-based
+model: opus  # use a strong model for grading
+rubric: [see rubric example above]
+```
+**Strengths:** Can evaluate nuanced quality, reasoning, and judgment.
+**Weaknesses:** Slower, costs tokens, can be inconsistent. Always use a rubric.
+### Hybrid
+**Use when:** Both structure and quality matter.
+```yaml
+grader: hybrid
+deterministic:
+  - "Output is valid JSON matching schema"
+  - "Contains at least one finding with severity field"
+  - "Verdict field is present and one of BLOCK/APPROVE/APPROVE_WITH_COMMENTS"
+model_based:
+  - criteria: "Finding explanations are clear and actionable"
+    weight: 50
+  - criteria: "Remediation suggestions are specific and correct"
+    weight: 50
+```
+Run deterministic checks first. If they fail, skip the model-based grading (saves tokens). Only grade quality if structure is correct.
+## Bottom-Up Eval Design
+Let failure modes emerge from real usage rather than inventing them theoretically.
+### Process
+1. **Deploy the agent/skill** with minimal evals (basic smoke tests).
+2. **Collect real failures** from actual usage (wrong outputs, user corrections, missed issues).
+3. **Convert each failure into a scenario** with assertions that would have caught it.
+4. **Run the expanded eval suite** and fix the agent/skill until it passes.
+5. **Repeat** as new failure modes surface.
+This produces evals grounded in reality rather than theoretical completeness.
+## Eval Quality Checklist
+- [ ] At least 1 failure scenario for every happy-path scenario
+- [ ] Assertions test correctness, not just activity (no "output exists" only)
+- [ ] Negative assertions included (what should NOT appear)
+- [ ] Baseline established (pass rate before vs after the skill/agent)
+- [ ] Grader type matches the evaluation need (deterministic for structure, model-based for quality)
+- [ ] Model-based graders have weighted rubrics with concrete pass/fail criteria
+- [ ] Pass threshold set below 100% only with documented justification
+- [ ] Edge cases covered: empty input, large input, malformed input, out-of-scope input
+- [ ] Scenarios derived from failure modes, not just success criteria
+- [ ] Eval pass rate on first run is below 90% (if above, assertions are likely too weak)

package/skills/aw-adk/references/writing-good-rules.md ADDED Viewed

@@ -0,0 +1,214 @@
+# Writing Good Rules
+A rule is an enforceable constraint that is always active for matching files. Rules are not skills (they don't teach techniques) and they are not agents (they don't reason autonomously). They are checks: clear, binary, and ideally automatable.
+## Before / After: Rule Definition
+### Bad — vague, unenforceable
+```markdown
+## Code Quality
+Write good code. Follow best practices. Make sure everything is clean and well-organized.
+```
+Problems: "Good code" is subjective. No agent or linter can enforce "clean." No WRONG/RIGHT examples. No severity. This rule will be ignored because it provides no actionable constraint.
+### Good — specific, enforceable, with examples
+```markdown
+## no-bare-any
+**Severity:** MUST
+Do not use bare `any` type in TypeScript. Use `unknown` for external data and narrow with type guards, or use a specific interface/type.
+### WRONG
+```typescript
+function processPayload(data: any) {
+  return data.items.map((item: any) => item.name);
+}
+```
+### RIGHT
+```typescript
+interface OrderPayload {
+  items: Array<{ name: string; quantity: number }>;
+}
+function processPayload(data: unknown): string[] {
+  const payload = validateOrderPayload(data);
+  return payload.items.map((item) => item.name);
+}
+```
+### Why
+Bare `any` disables TypeScript's type system at the boundary where it matters most — external data. Bugs from unvalidated external data are the #1 source of production incidents in our services.
+### Automation
+- **ESLint:** `@typescript-eslint/no-explicit-any` (error)
+- **CI gate:** Fails PR if new `any` introduced in changed files
+```
+## Before / After: Severity
+### Bad — no severity, everything feels optional
+```markdown
+- Use structured logging
+- Don't use console.log
+- Add tests for new files
+- Use kebab-case for file names
+```
+### Good — explicit severity with rationale
+```markdown
+- **MUST:** No `console.log` in production code — use `@platform-core/logger`. [Security/Observability risk: console.log bypasses structured logging, correlation IDs, and log level controls]
+- **MUST:** Add test file for every new source file. [Quality gate: untested code is a regression waiting to happen]
+- **SHOULD:** Use kebab-case for file names. [Consistency: cross-platform path issues with mixed case]
+- **MAY:** Prefer `readonly` modifier on properties that shouldn't change after construction. [Style: helps communicate intent]
+```
+## Anti-Pattern Catalog
+### 1. No WRONG/RIGHT Examples
+**Symptom:** Rule says "don't do X" but never shows what X looks like or what to do instead.
+**Fix:** Every rule needs at minimum one WRONG example (so the agent recognizes the pattern) and one RIGHT example (so the agent knows the fix).
+### 2. Unclear Severity
+**Symptom:** All rules read the same. Agent can't distinguish "will cause a security breach" from "slightly less readable."
+**Fix:** Use three tiers consistently:
+| Severity | Meaning | Consequence of Violation |
+|----------|---------|--------------------------|
+| **MUST** | Security risk, data loss, or correctness issue | Blocks PR / commit |
+| **SHOULD** | Quality, maintainability, or reliability issue | Flagged in review, should fix |
+| **MAY** | Style preference or optimization opportunity | Suggestion only |
+### 3. No Automation Path
+**Symptom:** Rule exists only as prose. No linter, no CI check, no automated detection.
+**Fix:** Every MUST rule should have an automation path documented:
+```markdown
+### Automation
+- **Linter rule:** `rule-name` in `.eslintrc` / `pyproject.toml` / etc.
+- **CI check:** Describe the CI step that enforces this
+- **Manual review:** If no automation exists, document the review checklist
+```
+If a MUST rule can't be automated today, note it as a gap and track it.
+### 4. Too Broad Scope
+**Symptom:** Rule applies to "all code" but is really about a specific domain (e.g., "always use transactions" applies to database code, not utility functions).
+**Fix:** Specify the scope explicitly:
+```markdown
+**Scope:** NestJS service classes that perform database writes
+**Does not apply to:** Utility functions, test files, scripts
+```
+### 5. Unverifiable Claims
+**Symptom:** Rule says "ensure high performance" or "maintain code quality" — neither can be checked by reading code.
+**Fix:** Rules must be verifiable by examining the code (or running a tool). Ask: "Can I look at a file and determine yes/no whether this rule is followed?" If not, it's not a rule — it's an aspiration.
+### 6. Missing "Why"
+**Symptom:** Rule says MUST but never explains the consequence of violation. Agent follows it mechanically but can't generalize to novel situations.
+**Fix:** Every rule needs a "Why" section. One or two sentences explaining the real-world consequence:
+```markdown
+### Why
+Empty catch blocks hide failures. In production, a swallowed database error means
+the user sees success while their data was never saved. The bug surfaces hours later
+when the missing data causes downstream failures that are nearly impossible to trace.
+```
+## Writing Deterministic Rules
+The best rules can be checked by a script or linter, not just by human judgment.
+### Characteristics of Deterministic Rules
+1. **Pattern-matchable:** The violation can be detected by searching for a specific code pattern.
+2. **Binary outcome:** Code either violates the rule or it doesn't. No "it depends."
+3. **Context-free (ideally):** The rule can be checked per-file without understanding the whole system.
+### Examples
+| Deterministic (Good) | Non-Deterministic (Rewrite) |
+|---|---|
+| "No `console.log` in `src/` directories" | "Use appropriate logging" |
+| "Every `@Body()` parameter must use a class-validator DTO" | "Validate input properly" |
+| "No `any` type in TypeScript files" | "Use good types" |
+| "Every new `.ts` file in `src/` must have a `.spec.ts` file" | "Write tests for new code" |
+### When Rules Can't Be Fully Deterministic
+Some rules require judgment (e.g., "error messages must be actionable"). For these:
+1. Provide 3+ WRONG/RIGHT examples spanning different scenarios.
+2. Document the judgment criteria explicitly.
+3. Assign to agent review rather than automated linting.
+## Verification Chain
+When an agent checks a rule, it should follow this chain:
+```
+1. Read the rule definition (constraint + severity + examples)
+      ↓
+2. Read the linked skill (if the rule references one, for deeper context)
+      ↓
+3. Read platform docs (if the rule references platform APIs or libraries)
+      ↓
+4. Search the codebase (find existing patterns that match WRONG or RIGHT)
+      ↓
+5. Verdict: PASS / FAIL with evidence (file path, line number, pattern matched)
+```
+The verification chain ensures the agent doesn't just pattern-match superficially but understands the full context.
+## Severity Selection Guide
+### MUST — Security, Data Loss, Correctness
+Use MUST when violation can cause:
+- Security vulnerabilities (hardcoded secrets, auth bypass, injection)
+- Data loss or corruption (missing transactions, wrong tenant scoping)
+- Incorrect behavior visible to users (wrong calculations, missing validations)
+### SHOULD — Quality, Maintainability, Reliability
+Use SHOULD when violation causes:
+- Technical debt that slows future development
+- Reduced observability (missing logs, metrics)
+- Inconsistency that confuses developers
+- Test gaps that increase regression risk
+### MAY — Preference, Style, Optimization
+Use MAY when:
+- Multiple valid approaches exist and yours is a preference
+- The benefit is marginal and context-dependent
+- Experienced developers might reasonably disagree
+## Rule Quality Checklist
+- [ ] One constraint per rule (not a bundle)
+- [ ] Explicit severity: MUST, SHOULD, or MAY
+- [ ] At least one WRONG and one RIGHT code example
+- [ ] "Why" section explaining real-world consequence
+- [ ] Scope defined (which files/domains it applies to)
+- [ ] Automation path documented (linter rule, CI check, or manual review process)
+- [ ] Verifiable: can determine pass/fail by examining code
+- [ ] Linked to relevant skill (if deeper context exists)

package/skills/aw-adk/references/writing-good-skills.md ADDED Viewed

@@ -0,0 +1,159 @@
+# Writing Good Skills
+A skill is a reusable knowledge package that teaches an AI agent *how* to do something specific. Skills are not agents (they don't have identity or autonomy) and they are not rules (they don't enforce constraints). They are reference material loaded on demand.
+## Key Principles
+1. **Structure over length.** A well-organized 200-line skill outperforms a rambling 800-line one. Use consistent headings, scannable lists, and code examples.
+2. **Conciseness.** Every sentence should earn its place. If a paragraph can be a bullet, make it a bullet.
+3. **Naming signals scope.** `vue3-composables` is better than `frontend-patterns`. The name should tell the agent whether to load it.
+4. **Explain the why.** Reasoning sticks better than rigid MUST/NEVER lists. When an agent understands *why* a pattern exists, it generalizes correctly to novel situations.
+5. **Multi-model testing.** Test your skill with Opus, Sonnet, and Haiku. If Haiku can't follow it, the skill needs simplification.
+## Before / After: "When to Use"
+### Bad — vague, single-line trigger
+```yaml
+# SKILL.md
+name: api-error-handling
+when_to_use: "When working with API errors"
+```
+Problems: Every backend task touches API errors. The agent loads this skill too often or not at the right time. No specificity about *which* scenarios benefit.
+### Good — multiple concrete trigger scenarios
+```yaml
+# SKILL.md
+name: api-error-handling
+when_to_use:
+  - "Adding a new NestJS controller endpoint that returns errors to clients"
+  - "Implementing retry logic for outbound HTTP calls to third-party APIs"
+  - "Converting thrown exceptions to structured ErrorResponse DTOs"
+  - "Debugging 5xx errors that lack sufficient context in logs"
+```
+Why this works: Each scenario is specific enough that the agent (or router) can match it against the current task. The skill loads only when relevant.
+## Before / After: Instruction Quality
+### Bad — vague instructions, no examples
+```markdown
+## Error Handling
+Handle errors properly. Make sure to catch all exceptions and return
+appropriate responses. Use the right status codes.
+```
+### Good — concrete patterns with code
+```markdown
+## Error Handling Pattern
+Wrap controller actions in a try/catch that maps domain errors to HTTP responses:
+```typescript
+// WRONG: leaks internal details, no structure
+catch (error) {
+  res.status(500).json({ message: error.message });
+}
+// RIGHT: maps to domain error, structured response
+catch (error) {
+  if (error instanceof EntityNotFoundError) {
+    throw new NotFoundException(error.userMessage);
+  }
+  if (error instanceof ValidationError) {
+    throw new BadRequestException(error.toResponse());
+  }
+  logger.error('Unhandled error in createOrder', { error, orderId });
+  throw new InternalServerErrorException('Something went wrong');
+}
+```
+**Why:** Structured error mapping prevents information leakage, gives clients actionable responses, and ensures every error is logged with context for debugging.
+```
+## Anti-Pattern Catalog
+### 1. Too Broad Scope
+**Symptom:** Skill named `backend-development` covering routing, ORM, caching, auth, and deployment.
+**Fix:** Split into focused skills: `nestjs-routing`, `mongoose-queries`, `redis-caching`. Each skill should cover one coherent concern.
+**Test:** If your skill's table of contents has more than 5 unrelated sections, it's too broad.
+### 2. Missing Trigger Scenarios
+**Symptom:** `when_to_use` is empty or says "when relevant."
+**Fix:** Write 3-5 concrete task descriptions that would benefit from this skill. If you can't name 3 distinct scenarios, the skill may be too narrow or should merge into another.
+### 3. Vague Instructions
+**Symptom:** Instructions say "follow best practices" or "use the right approach" without specifying what those are.
+**Fix:** Replace every vague directive with a concrete pattern. Show the code. Show the file path. Show the command.
+### 4. No Code Examples
+**Symptom:** Pure prose with no WRONG/RIGHT code blocks.
+**Fix:** Every non-trivial instruction needs a code example. Prefer paired WRONG/RIGHT examples that show the contrast.
+### 5. Everything in SKILL.md (No Progressive Disclosure)
+**Symptom:** SKILL.md is 1500 lines because every detail is inlined.
+**Fix:** Use progressive disclosure:
+- `SKILL.md` — overview, when-to-use, key principles (under 100 lines)
+- `references/` — detailed guides, examples, edge cases
+- `templates/` — starter code, boilerplate
+The agent reads SKILL.md first and loads references only when needed. This saves context window.
+### 6. Generic Rather Than Domain-Specific
+**Symptom:** Skill says "validate input" without specifying *your* platform's validation stack (class-validator DTOs, specific decorators, your error response shape).
+**Fix:** Skills should encode *your team's* patterns, not generic programming advice. The agent already knows generic advice. Your skill adds the specifics: which libraries, which patterns, which file locations, which naming conventions.
+## Scope Boundaries
+### Skill vs Rule vs Agent
+| Dimension | Skill | Rule | Agent |
+|-----------|-------|------|-------|
+| **Purpose** | Teaches how to do something | Enforces a constraint | Performs a task autonomously |
+| **Loaded when** | On demand, for a specific task | Always active for matching files | Invoked by command or coordinator |
+| **Format** | Reference docs, examples, templates | Short constraint + WRONG/RIGHT + severity | Identity, mission, tools, workflow |
+| **Example** | "How to write Mongoose migrations" | "No bare `any` type" | "Security reviewer agent" |
+| **Autonomy** | None — it's passive knowledge | None — it's a check | Full — it reasons and acts |
+### Decision Flowchart
+1. **Is it a constraint that should always be checked?** → Write a **rule**.
+2. **Is it knowledge needed for specific tasks?** → Write a **skill**.
+3. **Does it need to reason, decide, and act independently?** → Write an **agent**.
+4. **Does it orchestrate multiple agents through phases?** → Write a **command**.
+### Gray Areas
+- "Always use platform logger" — This is a **rule** (enforceable constraint), not a skill.
+- "How to set up structured logging with correlation IDs" — This is a **skill** (teaches a technique).
+- "Review all log statements for PII leakage" — This is an **agent** (requires judgment and autonomous action).
+## Skill Quality Checklist
+Before publishing a skill:
+- [ ] Name clearly signals scope and domain
+- [ ] `when_to_use` has 3+ concrete trigger scenarios
+- [ ] Every instruction has a code example or concrete reference
+- [ ] WRONG/RIGHT pairs for non-obvious patterns
+- [ ] Progressive disclosure: SKILL.md is under 100 lines, details in references/
+- [ ] Domain-specific: encodes *your* team's patterns, not generic advice
+- [ ] Tested with at least 2 model tiers (Sonnet + Haiku minimum)
+- [ ] "Why" explained for non-obvious decisions