npm - hatch3r - Versions diffs - 1.7.1 → 1.8.0 - Mend

hatch3r 1.7.1 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (189) hide show

package/README.md +38 -12
package/agents/hatch3r-a11y-auditor.md +4 -0
package/agents/hatch3r-architect.md +4 -0
package/agents/hatch3r-ci-watcher.md +4 -0
package/agents/hatch3r-context-rules.md +26 -6
package/agents/hatch3r-creator.md +6 -1
package/agents/hatch3r-dependency-auditor.md +4 -0
package/agents/hatch3r-devops.md +4 -0
package/agents/hatch3r-docs-writer.md +4 -0
package/agents/hatch3r-fixer.md +4 -0
package/agents/hatch3r-handoff-loader.md +243 -0
package/agents/hatch3r-handoff-preparer.md +134 -0
package/agents/hatch3r-implementer.md +12 -0
package/agents/hatch3r-learnings-loader.md +5 -1
package/agents/hatch3r-lint-fixer.md +4 -0
package/agents/hatch3r-perf-profiler.md +8 -0
package/agents/hatch3r-researcher.md +4 -0
package/agents/hatch3r-reviewer.md +94 -0
package/agents/hatch3r-security-auditor.md +24 -0
package/agents/hatch3r-test-writer.md +4 -0
package/agents/modes/requirements-elicitation.md +4 -1
package/agents/modes/similar-implementation.md +6 -0
package/agents/modes/user-flows.md +76 -0
package/agents/shared/quality-charter.md +128 -0
package/agents/shared/user-content-templates.md +31 -1
package/commands/hatch3r-agent-customize.md +4 -0
package/commands/hatch3r-api-spec.md +7 -0
package/commands/hatch3r-benchmark.md +7 -0
package/commands/hatch3r-board-fill.md +8 -0
package/commands/hatch3r-board-groom.md +4 -0
package/commands/hatch3r-board-init.md +51 -0
package/commands/hatch3r-board-pickup.md +8 -0
package/commands/hatch3r-board-refresh.md +4 -0
package/commands/hatch3r-board-shared.md +6 -6
package/commands/hatch3r-bug-plan.md +7 -0
package/commands/hatch3r-codebase-map.md +8 -0
package/commands/hatch3r-command-customize.md +4 -0
package/commands/hatch3r-context-health.md +5 -0
package/commands/hatch3r-create.md +59 -4
package/commands/hatch3r-debug.md +7 -0
package/commands/hatch3r-dep-audit.md +4 -0
package/commands/hatch3r-feature-plan.md +7 -0
package/commands/hatch3r-handoff.md +133 -0
package/commands/hatch3r-healthcheck.md +4 -0
package/commands/hatch3r-hooks.md +4 -0
package/commands/hatch3r-learn.md +16 -0
package/commands/hatch3r-migration-plan.md +7 -0
package/commands/hatch3r-onboard.md +7 -0
package/commands/hatch3r-pr-resolve.md +12 -1
package/commands/hatch3r-project-spec.md +8 -0
package/commands/hatch3r-quick-change.md +11 -2
package/commands/hatch3r-recipe.md +4 -0
package/commands/hatch3r-refactor-plan.md +7 -0
package/commands/hatch3r-release.md +5 -0
package/commands/hatch3r-revision.md +7 -0
package/commands/hatch3r-roadmap.md +8 -0
package/commands/hatch3r-rule-customize.md +4 -0
package/commands/hatch3r-security-audit.md +4 -0
package/commands/hatch3r-skill-customize.md +4 -0
package/commands/hatch3r-test-plan.md +7 -0
package/commands/hatch3r-workflow.md +11 -1
package/dist/cli/index.js +4814 -1130
package/dist/cli/index.js.map +1 -1
package/package.json +10 -5
package/rules/hatch3r-accessibility-standards.md +21 -0
package/rules/hatch3r-accessibility-standards.mdc +21 -0
package/rules/hatch3r-agent-orchestration-detail.md +3 -0
package/rules/hatch3r-agent-orchestration-detail.mdc +3 -0
package/rules/hatch3r-agent-orchestration.md +34 -3
package/rules/hatch3r-agent-orchestration.mdc +34 -3
package/rules/hatch3r-ai-evals.md +158 -0
package/rules/hatch3r-ai-evals.mdc +154 -0
package/rules/hatch3r-ai-ux-patterns.md +131 -0
package/rules/hatch3r-ai-ux-patterns.mdc +127 -0
package/rules/hatch3r-api-design.md +67 -9
package/rules/hatch3r-api-design.mdc +67 -9
package/rules/hatch3r-api-versioning.md +119 -0
package/rules/hatch3r-api-versioning.mdc +115 -0
package/rules/hatch3r-auth-patterns.md +170 -0
package/rules/hatch3r-auth-patterns.mdc +166 -0
package/rules/hatch3r-component-conventions.md +30 -0
package/rules/hatch3r-component-conventions.mdc +30 -0
package/rules/hatch3r-container-hardening.md +131 -0
package/rules/hatch3r-container-hardening.mdc +127 -0
package/rules/hatch3r-contract-testing.md +117 -0
package/rules/hatch3r-contract-testing.mdc +113 -0
package/rules/hatch3r-deep-context.md +2 -0
package/rules/hatch3r-deep-context.mdc +2 -0
package/rules/hatch3r-dependency-management.md +73 -1
package/rules/hatch3r-dependency-management.mdc +72 -0
package/rules/hatch3r-design-system-detection.md +142 -0
package/rules/hatch3r-design-system-detection.mdc +138 -0
package/rules/hatch3r-event-schema-evolution.md +90 -0
package/rules/hatch3r-event-schema-evolution.mdc +86 -0
package/rules/hatch3r-handoff-readiness.md +45 -0
package/rules/hatch3r-handoff-readiness.mdc +40 -0
package/rules/hatch3r-i18n.md +13 -0
package/rules/hatch3r-i18n.mdc +13 -0
package/rules/hatch3r-iteration-summary.md +2 -0
package/rules/hatch3r-iteration-summary.mdc +2 -0
package/rules/hatch3r-migrations.md +61 -16
package/rules/hatch3r-migrations.mdc +61 -16
package/rules/hatch3r-observability-logging.md +1 -1
package/rules/hatch3r-observability-logging.mdc +1 -1
package/rules/hatch3r-observability-metrics.md +1 -1
package/rules/hatch3r-observability-metrics.mdc +1 -1
package/rules/hatch3r-observability-tracing-detail.md +8 -149
package/rules/hatch3r-observability-tracing-detail.mdc +7 -149
package/rules/hatch3r-observability-tracing.md +154 -6
package/rules/hatch3r-observability-tracing.mdc +154 -6
package/rules/hatch3r-observability.md +1 -0
package/rules/hatch3r-observability.mdc +1 -0
package/rules/hatch3r-operability.md +149 -0
package/rules/hatch3r-operability.mdc +145 -0
package/rules/hatch3r-passkey-server.md +181 -0
package/rules/hatch3r-passkey-server.mdc +177 -0
package/rules/hatch3r-progressive-delivery.md +120 -0
package/rules/hatch3r-progressive-delivery.mdc +116 -0
package/rules/hatch3r-resilience-patterns.md +154 -0
package/rules/hatch3r-resilience-patterns.mdc +150 -0
package/rules/hatch3r-secrets-management.md +29 -0
package/rules/hatch3r-secrets-management.mdc +29 -0
package/rules/hatch3r-testing.md +139 -43
package/rules/hatch3r-testing.mdc +139 -43
package/rules/hatch3r-ux-states-and-flows.md +149 -0
package/rules/hatch3r-ux-states-and-flows.mdc +145 -0
package/skills/hatch3r-a11y-audit/SKILL.md +14 -0
package/skills/hatch3r-agent-customize/SKILL.md +10 -0
package/skills/hatch3r-ai-feature/SKILL.md +136 -0
package/skills/hatch3r-api-spec/SKILL.md +73 -0
package/skills/hatch3r-architecture-review/SKILL.md +14 -0
package/skills/hatch3r-bug-fix/SKILL.md +5 -0
package/skills/hatch3r-ci-pipeline/SKILL.md +14 -0
package/skills/hatch3r-cli-aichat/SKILL.md +84 -0
package/skills/hatch3r-cli-ast-grep/SKILL.md +85 -0
package/skills/hatch3r-cli-az-devops/SKILL.md +89 -0
package/skills/hatch3r-cli-bat/SKILL.md +85 -0
package/skills/hatch3r-cli-comby/SKILL.md +85 -0
package/skills/hatch3r-cli-csvkit/SKILL.md +84 -0
package/skills/hatch3r-cli-delta/SKILL.md +86 -0
package/skills/hatch3r-cli-difftastic/SKILL.md +84 -0
package/skills/hatch3r-cli-docker/SKILL.md +89 -0
package/skills/hatch3r-cli-duckdb/SKILL.md +84 -0
package/skills/hatch3r-cli-fd/SKILL.md +85 -0
package/skills/hatch3r-cli-fzf/SKILL.md +84 -0
package/skills/hatch3r-cli-gh/SKILL.md +90 -0
package/skills/hatch3r-cli-glab/SKILL.md +89 -0
package/skills/hatch3r-cli-jq/SKILL.md +89 -0
package/skills/hatch3r-cli-lazygit/SKILL.md +78 -0
package/skills/hatch3r-cli-llm/SKILL.md +84 -0
package/skills/hatch3r-cli-miller/SKILL.md +84 -0
package/skills/hatch3r-cli-mods/SKILL.md +84 -0
package/skills/hatch3r-cli-overview/SKILL.md +60 -0
package/skills/hatch3r-cli-playwright/SKILL.md +89 -0
package/skills/hatch3r-cli-podman/SKILL.md +84 -0
package/skills/hatch3r-cli-qsv/SKILL.md +91 -0
package/skills/hatch3r-cli-ripgrep/SKILL.md +85 -0
package/skills/hatch3r-cli-rtk/SKILL.md +91 -0
package/skills/hatch3r-cli-sd/SKILL.md +85 -0
package/skills/hatch3r-cli-stagehand/SKILL.md +111 -0
package/skills/hatch3r-cli-taplo/SKILL.md +84 -0
package/skills/hatch3r-cli-yq/SKILL.md +85 -0
package/skills/hatch3r-cli-zstd/SKILL.md +85 -0
package/skills/hatch3r-command-customize/SKILL.md +10 -0
package/skills/hatch3r-context-health/SKILL.md +14 -0
package/skills/hatch3r-cost-tracking/SKILL.md +14 -0
package/skills/hatch3r-customize/SKILL.md +17 -0
package/skills/hatch3r-dep-audit/SKILL.md +14 -0
package/skills/hatch3r-design-system-detect/SKILL.md +164 -0
package/skills/hatch3r-feature/SKILL.md +2 -0
package/skills/hatch3r-gh-agentic-workflows/SKILL.md +13 -0
package/skills/hatch3r-handoff-prepare/SKILL.md +160 -0
package/skills/hatch3r-handoff-resume/SKILL.md +171 -0
package/skills/hatch3r-incident-response/SKILL.md +14 -0
package/skills/hatch3r-issue-workflow/SKILL.md +5 -0
package/skills/hatch3r-logical-refactor/SKILL.md +14 -0
package/skills/hatch3r-migration/SKILL.md +14 -0
package/skills/hatch3r-observability-verify/SKILL.md +134 -0
package/skills/hatch3r-perf-audit/SKILL.md +14 -0
package/skills/hatch3r-pr-creation/SKILL.md +14 -0
package/skills/hatch3r-qa-validation/SKILL.md +18 -0
package/skills/hatch3r-recipe/SKILL.md +14 -0
package/skills/hatch3r-refactor/SKILL.md +14 -0
package/skills/hatch3r-release/SKILL.md +14 -0
package/skills/hatch3r-reliability-verify/SKILL.md +146 -0
package/skills/hatch3r-rule-customize/SKILL.md +10 -0
package/skills/hatch3r-skill-customize/SKILL.md +10 -0
package/skills/hatch3r-ui-ux-verify/SKILL.md +138 -0
package/skills/hatch3r-visual-refactor/SKILL.md +15 -1

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "hatch3r",
-  "version": "1.7.1",
+  "version": "1.8.0",
   "description": "Battle-tested agentic coding setup framework. One command to hatch your agent stack -- agents, skills, rules, commands, and MCP for every major AI coding tool.",
   "type": "module",
   "exports": {
@@ -22,15 +22,19 @@
     "test:watch": "vitest",
     "inventory": "tsx scripts/inventory.ts",
     "inventory:check-docs": "tsx scripts/inventory.ts --check-docs",
-    "validate:rule-parity": "tsx scripts/validate-rule-parity.ts",
-    "validate:efficiency": "tsx scripts/validate-efficiency-invariants.ts",
-    "validate": "npm run validate:rule-parity && npm run validate:efficiency",
+    "validate:rule-parity": "tsx scripts/validate-rule-parity.ts && tsx scripts/validate-rule-pillar-currency.ts",
+    "validate:efficiency": "tsx scripts/validate-efficiency-invariants.ts && tsx scripts/validate-bridge-budget.ts && tsx scripts/validate-fanout-emission.ts",
+    "validate:cli-skills": "tsx scripts/validate-cli-skills.ts",
+    "validate:wiring": "tsx scripts/validate-wiring.ts",
+    "generate:cli-skills": "tsx scripts/generate-cli-skills.ts",
+    "validate": "npm run validate:rule-parity && npm run validate:efficiency && npm run validate:cli-skills && npm run validate:wiring",
     "audit:validate-registry": "tsx scripts/validate-finding-registry.ts",
     "audit:migrate": "tsx scripts/migrate-finding-registry.ts",
     "audit:archive": "tsx scripts/audit-archive.ts",
     "audit:find": "tsx scripts/audit-find.ts",
     "audit:reset": "tsx scripts/clean-audit-workspace.ts",
-    "lockfile:check": "lockfile-lint --path package-lock.json --type npm --allowed-hosts npm --validate-https"
+    "lockfile:check": "lockfile-lint --path package-lock.json --type npm --allowed-hosts npm --validate-https",
+    "mcp:cve-check": "tsx scripts/check-mcp-cves.ts"
   },
   "keywords": [
     "agents",
@@ -88,6 +92,7 @@
     "commander": "^14.0.3",
     "inquirer": "^13.3.2",
     "ora": "^9.3.0",
+    "p-limit": "^3.1.0",
     "proper-lockfile": "^4.1.2",
     "update-notifier": "^7.3.1",
     "yaml": "^2.8.3"

package/rules/hatch3r-accessibility-standards.md CHANGED Viewed

@@ -78,3 +78,24 @@ All user-facing features must meet WCAG 2.2 Level AA conformance. This is the ba
 - Run automated accessibility checks (axe-core, Lighthouse) in CI.
 - Automated tools catch ~30% of accessibility issues. Manual testing is required for keyboard flows, screen reader experience, and cognitive accessibility.
 - Maintain an accessibility test checklist per component type (form, modal, navigation, data table).
+## WCAG 2.2 New Success Criteria (Mandatory Audit Items)
+WCAG 2.2 (W3C Recommendation, October 2023) added nine Success Criteria; three apply to most UI surfaces and are listed here as required audit items.
+- **SC 2.5.8 Target Size (Minimum) — AA:** every pointer-target's hit area is at least 24 by 24 CSS pixels. Smaller targets are permitted only when surrounded by ≥24px of spacing (the spacing exception) or when the larger inline target is also reachable (the inline exception). Test densest UI elements (sidebar collapse, table-row checkboxes, icon-only toolbar buttons).
+- **SC 2.4.11 Focus Not Obscured (Minimum) — AA:** when a control receives keyboard focus, the focused element must not be entirely hidden by author-created content. Common violations: sticky headers, persistent cookie banners, chatbot launcher widgets, fixed action bars. Mitigate with `scroll-margin-top` on focus targets equal to the sticky header height, or by collapsing sticky chrome on focus.
+- **SC 2.5.7 Dragging Movements — AA:** any drag operation (reorder, slider thumb drag, map pan, kanban card move) must offer a single-pointer non-drag alternative. Examples: list reorder with up/down arrow buttons; slider with numeric input or +/- buttons; map pan with arrow-key navigation.
+Reference: https://www.w3.org/TR/WCAG22/
+## Mobile and Touch Accessibility
+Touch surfaces have stricter target and spacing requirements than pointer-only surfaces; native platform guidance supersedes WCAG 2.5.8 on touch.
+- Touch targets: 44 by 44 points on iOS (Apple Human Interface Guidelines), 48 by 48 dp on Android (Material Design 3). These supersede WCAG 2.5.8's 24-pixel minimum on touch-only surfaces.
+- Spacing between interactive elements: ≥8 pixels of separation to prevent mis-taps; ≥4 mm on physical-density screens.
+- Avoid tap-to-reveal patterns (hover tooltips, pure-hover dropdowns) — touch has no hover state. Replace with permanent visible labels or with long-press that has a visible affordance and announces itself to screen readers.
+- Apply `env(safe-area-inset-*)` padding on full-bleed surfaces so content clears notches, home indicators, and rounded corners on iOS and Android edge devices.
+- Support Dynamic Type (iOS) and rem-based font scaling — declare body text in `rem` or `em` units, never `px`, so OS-level font size settings cascade.
+- Zoom to 200% and 400% (per WCAG 1.4.4 and 1.4.10 Reflow) must remain functional with no horizontal scroll trap. Audit for `width: 100vw` and fixed pixel widths that break reflow.

package/rules/hatch3r-accessibility-standards.mdc CHANGED Viewed

@@ -74,3 +74,24 @@ All user-facing features must meet WCAG 2.2 Level AA conformance. This is the ba
 - Run automated accessibility checks (axe-core, Lighthouse) in CI.
 - Automated tools catch ~30% of accessibility issues. Manual testing is required for keyboard flows, screen reader experience, and cognitive accessibility.
 - Maintain an accessibility test checklist per component type (form, modal, navigation, data table).
+## WCAG 2.2 New Success Criteria (Mandatory Audit Items)
+WCAG 2.2 (W3C Recommendation, October 2023) added nine Success Criteria; three apply to most UI surfaces and are listed here as required audit items.
+- **SC 2.5.8 Target Size (Minimum) — AA:** every pointer-target's hit area is at least 24 by 24 CSS pixels. Smaller targets are permitted only when surrounded by ≥24px of spacing (the spacing exception) or when the larger inline target is also reachable (the inline exception). Test densest UI elements (sidebar collapse, table-row checkboxes, icon-only toolbar buttons).
+- **SC 2.4.11 Focus Not Obscured (Minimum) — AA:** when a control receives keyboard focus, the focused element must not be entirely hidden by author-created content. Common violations: sticky headers, persistent cookie banners, chatbot launcher widgets, fixed action bars. Mitigate with `scroll-margin-top` on focus targets equal to the sticky header height, or by collapsing sticky chrome on focus.
+- **SC 2.5.7 Dragging Movements — AA:** any drag operation (reorder, slider thumb drag, map pan, kanban card move) must offer a single-pointer non-drag alternative. Examples: list reorder with up/down arrow buttons; slider with numeric input or +/- buttons; map pan with arrow-key navigation.
+Reference: https://www.w3.org/TR/WCAG22/
+## Mobile and Touch Accessibility
+Touch surfaces have stricter target and spacing requirements than pointer-only surfaces; native platform guidance supersedes WCAG 2.5.8 on touch.
+- Touch targets: 44 by 44 points on iOS (Apple Human Interface Guidelines), 48 by 48 dp on Android (Material Design 3). These supersede WCAG 2.5.8's 24-pixel minimum on touch-only surfaces.
+- Spacing between interactive elements: ≥8 pixels of separation to prevent mis-taps; ≥4 mm on physical-density screens.
+- Avoid tap-to-reveal patterns (hover tooltips, pure-hover dropdowns) — touch has no hover state. Replace with permanent visible labels or with long-press that has a visible affordance and announces itself to screen readers.
+- Apply `env(safe-area-inset-*)` padding on full-bleed surfaces so content clears notches, home indicators, and rounded corners on iOS and Android edge devices.
+- Support Dynamic Type (iOS) and rem-based font scaling — declare body text in `rem` or `em` units, never `px`, so OS-level font size settings cascade.
+- Zoom to 200% and 400% (per WCAG 1.4.4 and 1.4.10 Reflow) must remain functional with no horizontal scroll trap. Audit for `width: 100vw` and fixed pixel widths that break reflow.

package/rules/hatch3r-agent-orchestration-detail.md CHANGED Viewed

@@ -5,8 +5,11 @@ description: Extended orchestration reference — PipelineContext schemas, resil
 scope: conditional
 globs: "**/.agents/**,**/pipeline/**,**/*orchestrat*,**/*agent*"
 tags: [core]
+precedence: normal
 quality_charter: agents/shared/quality-charter.md
 cache_friendly: true
+detail_rule: true
+consumed_by: hatch3r-agent-orchestration
 ---
 # Agent Orchestration — Extended Reference

package/rules/hatch3r-agent-orchestration-detail.mdc CHANGED Viewed

@@ -2,6 +2,9 @@
 description: Extended orchestration reference — PipelineContext schemas, resilience protocols, observability integration, and auto-mode guardrails
 globs: ["**/.agents/**", "**/pipeline/**", "**/*orchestrat*", "**/*agent*"]
 alwaysApply: false
+precedence: normal
+detail_rule: true
+consumed_by: hatch3r-agent-orchestration
 ---
 # Agent Orchestration — Extended Reference

package/rules/hatch3r-agent-orchestration.md CHANGED Viewed

@@ -4,6 +4,7 @@ type: rule
 description: Mandatory agent delegation, skill loading, and subagent usage directives for ALL tasks in ALL contexts
 scope: always
 tags: [core]
+precedence: high
 quality_charter: agents/shared/quality-charter.md
 cache_friendly: true
 ---
@@ -48,7 +49,7 @@ Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatc
 Score task complexity per the `hatch3r-deep-context` rule before Phase 1. Apply the resulting tier:
-- **Tier 2 (Standard):** Present elicitation questions inline. Await answers before Phase 2.
+- **Tier 2 hard gate (B1).** Before Phase 2, run `hatch3r-researcher` with `requirements-elicitation:quick` mode to detect ambiguity and ask the user via the platform-native question tool (per `agents/shared/user-question-protocol.md`). Orchestrator awaits answers and integrates them into the Phase 1 brief; do not begin Phase 2 with unresolved questions. Tier 1 is exempt only when scope is single-file, single-concern, and acceptance is testable from the user message alone.
 - **Tier 3 (Deep):** Present Pre-Implementation Summary and ASK for confirmation. Do NOT proceed until all unresolved questions are answered.
 ## Mandatory Delegation Directives
@@ -103,6 +104,28 @@ Examples:
 A missing header on a tracked Tier >= 2 task is a self-detectable drift signal — the user may halt the turn and request re-grounding. The header also functions as a per-reply cache prime: rendering it forces the orchestrator to re-resolve which phase it is in before choosing tools. Tier 1 tasks, read-only answers, and chat-only iterations do NOT require the header.
+### End-of-Turn Delegation Attestation
+When the turn is on a tracked task at Tier >= 2 AND caused at least one file mutation, the orchestrator MUST emit a closing block immediately before the Iteration Summary. The block enumerates every file mutated this turn, the spawning sub-agent invocation, and the `delegation_proof_id` returned by that sub-agent.
+Format:
+```
+[hatch3r-delegation-attestation]
+files_mutated_this_turn:
+  - <relative path>: via <agent-name> (proof: <delegation_proof_id>)
+mutating_subagent_invocations: <integer>
+inline_edits_by_orchestrator: none | <carve-out: hatch3r-quick-change Tier-1 + queued re-delegation>
+```
+Rules:
+- Each `files_mutated_this_turn` row MUST cite the spawning sub-agent invocation and quote the `delegation_proof_id` returned by that sub-agent verbatim. Unattributable rows are self-declared P8 B2 violations and the orchestrator MUST queue re-delegation in the next turn.
+- `inline_edits_by_orchestrator: none` is the only acceptable value outside the `hatch3r-quick-change` Tier-1 carve-out declared in the "Inline implementation" definition above.
+- Tier 1 read-only and chat-only turns are exempt — same scope as the Per-Turn Pipeline-State Header.
+- Missing block on a Tier >= 2 mutating turn is a self-detectable drift signal — the user may halt the turn and re-ground per the same protocol as the missing-header signal.
+- The block is consumed by reviewers and the next orchestrator turn; it sits beside the Iteration Summary, not inside it, preserving the existing 5-field iteration-summary contract verbatim.
 ### Mandatory Delegation Directive (No Inline Implementation)
 Restating with maximum clarity for sub-agent prompt inclusion: the orchestrator MUST NOT call `Edit`, `Write`, `MultiEdit`, `NotebookEdit`, `replace_string_in_file`, `multi_replace_string_in_file`, `create_file`, `str_replace_based_edit_tool`, `apply_patch`, or any platform-equivalent code-writing tool from its own turn. The only path for code mutation is the Task tool spawning `hatch3r-implementer` (Phase 2) or `hatch3r-fixer` (Phase 3). Carve-out: `hatch3r-quick-change` Tier 1 trivial items per its declared scope. No other carve-out exists. Violations are bypass mode (see issue #73) — surface them by halting the turn and re-delegating.
@@ -132,14 +155,14 @@ For multi-sub-task implementations, the implementer performs a lightweight mini-
 1. Spawn `hatch3r-reviewer` with diff and acceptance criteria. Reviewer includes blast radius summary.
 2. Critical/Warning findings: spawn `hatch3r-fixer` with full reviewer output.
-3. Re-review after fixes. Repeat until 0 Critical + 0 Warning, or max 3 iterations.
+3. Re-review after fixes. Repeat until 0 Critical + 0 Warning, or max 4 iterations (matches `DEFAULT_MAX_REVIEW_ITERATIONS` in `src/pipeline/reviewLoop.ts`; raised from 3 to 4 in Cycle 7.5 W2B2 finding H26 so the oscillation detector becomes reachable in default config). The rule default and the code constant are kept in sync by `src/__tests__/pipeline/reviewLoop.test.ts` (CI-enforced).
 4. **Confirmation pass** after clean review: lightweight re-review for fix-driven regressions and acceptance criteria completeness. The confirmation pass checks only: (a) no new test failures compared to Phase 2 baseline, (b) no type errors introduced, (c) acceptance criteria from the issue are still met. It does not re-run the full review checklist.
 5. Max iterations reached: surface to user with a structured summary: iteration count, remaining Critical findings (with file:line), remaining Warning findings, and a recommendation (fix manually vs. accept risk). Never present raw reviewer output without summarization.
 6. **Review gate confidence signal:** When the review loop exits with a clean verdict, record the iteration count in `PipelineContext.reviewResult.iterations`. Clean-on-first-pass (iteration 1) signals higher confidence than clean-after-multiple-iterations (iteration 2-3). Phase 4 specialists and the orchestrator should factor this into their risk assessment.
 **Phase 4 — Final Quality** (after review loop is clean):
-Launch parallel subagents -- no artificial concurrency limit.
+Launch Phase 4 specialists in parallel, bounded by `max_phase4_parallel` (default `3`, override via `HATCH3R_MAX_PHASE4_PARALLEL` env var; valid range 1-16, values outside the range fall back to default with a logged warning). The bound exists to cap per-orchestrator concurrent context cost — it does not soften the P8 B2 directive that fan-out scales with task decomposition. When the number of applicable specialists exceeds `max_phase4_parallel`, batch them by severity-descending priority: `CRITICAL → HIGH → MEDIUM → LOW` (severity is the worst-case finding class the specialist is expected to surface, per the `hatch3r-test-writer` / `hatch3r-security-auditor` always-on baseline → CRITICAL, conditional UI/security/perf → HIGH, docs/lint → MEDIUM, low-impact specialists → LOW). Within the same severity bucket, dispatch order is the trigger-table order in the table above. Each batch runs to completion (all specialists return SUCCESS/PARTIAL/FAILED) before the next batch starts; the validation pass below runs once after the final batch.
 - **Always** (except when Phase Skip Criteria applies — see below)**:** `hatch3r-test-writer`, `hatch3r-security-auditor`
 - **Evaluate:** `hatch3r-docs-writer` (when APIs/architecture/UX affected)
@@ -253,6 +276,14 @@ ALL three must hold:
 **Default:** When in doubt, serialize. For typical hatch3r tasks (1–5 sub-tasks) the DAG-scheduling overhead often outweighs concurrency gain.
+### Cost-Dominance Principle
+Token cost of sub-agent invocation never justifies serialization of independent work. The three safety conditions (read-only or disjoint writes, deterministic aggregation, no shared mutable state) govern WHEN parallelism is safe; cost does not govern WHETHER to parallelize. When in doubt, fan out. Serialization is only valid on true dependency edges.
+### Scaling Heuristic
+Sub-agent count tracks task decomposition: N independent modules → N parallel Phase-2 implementers; M specialist gates → M parallel Phase-4 specialists; K independent research questions → K parallel researcher modes. Orchestrators emit `sub_agents_spawned: {count, rationale}` in their structured output.
 ## Cross-Phase Error Propagation
 When a phase produces a non-SUCCESS status, the orchestrator must propagate error context to downstream phases rather than silently dropping it:

package/rules/hatch3r-agent-orchestration.mdc CHANGED Viewed

@@ -1,6 +1,7 @@
 ---
 description: Mandatory agent delegation, skill loading, and subagent usage directives for ALL tasks in ALL contexts
 alwaysApply: true
+precedence: high
 ---
 # Agent Orchestration
@@ -43,7 +44,7 @@ Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatc
 Score task complexity per the `hatch3r-deep-context` rule before Phase 1. Apply the resulting tier:
-- **Tier 2 (Standard):** Present elicitation questions inline. Await answers before Phase 2.
+- **Tier 2 hard gate (B1).** Before Phase 2, run `hatch3r-researcher` with `requirements-elicitation:quick` mode to detect ambiguity and ask the user via the platform-native question tool (per `agents/shared/user-question-protocol.md`). Orchestrator awaits answers and integrates them into the Phase 1 brief; do not begin Phase 2 with unresolved questions. Tier 1 is exempt only when scope is single-file, single-concern, and acceptance is testable from the user message alone.
 - **Tier 3 (Deep):** Present Pre-Implementation Summary and ASK for confirmation. Do NOT proceed until all unresolved questions are answered.
 ## Mandatory Delegation Directives
@@ -98,6 +99,28 @@ Examples:
 A missing header on a tracked Tier >= 2 task is a self-detectable drift signal — the user may halt the turn and request re-grounding. The header also functions as a per-reply cache prime: rendering it forces the orchestrator to re-resolve which phase it is in before choosing tools. Tier 1 tasks, read-only answers, and chat-only iterations do NOT require the header.
+### End-of-Turn Delegation Attestation
+When the turn is on a tracked task at Tier >= 2 AND caused at least one file mutation, the orchestrator MUST emit a closing block immediately before the Iteration Summary. The block enumerates every file mutated this turn, the spawning sub-agent invocation, and the `delegation_proof_id` returned by that sub-agent.
+Format:
+```
+[hatch3r-delegation-attestation]
+files_mutated_this_turn:
+  - <relative path>: via <agent-name> (proof: <delegation_proof_id>)
+mutating_subagent_invocations: <integer>
+inline_edits_by_orchestrator: none | <carve-out: hatch3r-quick-change Tier-1 + queued re-delegation>
+```
+Rules:
+- Each `files_mutated_this_turn` row MUST cite the spawning sub-agent invocation and quote the `delegation_proof_id` returned by that sub-agent verbatim. Unattributable rows are self-declared P8 B2 violations and the orchestrator MUST queue re-delegation in the next turn.
+- `inline_edits_by_orchestrator: none` is the only acceptable value outside the `hatch3r-quick-change` Tier-1 carve-out declared in the "Inline implementation" definition above.
+- Tier 1 read-only and chat-only turns are exempt — same scope as the Per-Turn Pipeline-State Header.
+- Missing block on a Tier >= 2 mutating turn is a self-detectable drift signal — the user may halt the turn and re-ground per the same protocol as the missing-header signal.
+- The block is consumed by reviewers and the next orchestrator turn; it sits beside the Iteration Summary, not inside it, preserving the existing 5-field iteration-summary contract verbatim.
 ### Mandatory Delegation Directive (No Inline Implementation)
 Restating with maximum clarity for sub-agent prompt inclusion: the orchestrator MUST NOT call `Edit`, `Write`, `MultiEdit`, `NotebookEdit`, `replace_string_in_file`, `multi_replace_string_in_file`, `create_file`, `str_replace_based_edit_tool`, `apply_patch`, or any platform-equivalent code-writing tool from its own turn. The only path for code mutation is the Task tool spawning `hatch3r-implementer` (Phase 2) or `hatch3r-fixer` (Phase 3). Carve-out: `hatch3r-quick-change` Tier 1 trivial items per its declared scope. No other carve-out exists. Violations are bypass mode (see issue #73) — surface them by halting the turn and re-delegating.
@@ -127,14 +150,14 @@ For multi-sub-task implementations, the implementer performs a lightweight mini-
 1. Spawn `hatch3r-reviewer` with diff and acceptance criteria. Reviewer includes blast radius summary.
 2. Critical/Warning findings: spawn `hatch3r-fixer` with full reviewer output.
-3. Re-review after fixes. Repeat until 0 Critical + 0 Warning, or max 3 iterations.
+3. Re-review after fixes. Repeat until 0 Critical + 0 Warning, or max 4 iterations (matches `DEFAULT_MAX_REVIEW_ITERATIONS` in `src/pipeline/reviewLoop.ts`; raised from 3 to 4 in Cycle 7.5 W2B2 finding H26 so the oscillation detector becomes reachable in default config). The rule default and the code constant are kept in sync by `src/__tests__/pipeline/reviewLoop.test.ts` (CI-enforced).
 4. **Confirmation pass** after clean review: lightweight re-review for fix-driven regressions and acceptance criteria completeness. The confirmation pass checks only: (a) no new test failures compared to Phase 2 baseline, (b) no type errors introduced, (c) acceptance criteria from the issue are still met. It does not re-run the full review checklist.
 5. Max iterations reached: surface to user with a structured summary: iteration count, remaining Critical findings (with file:line), remaining Warning findings, and a recommendation (fix manually vs. accept risk). Never present raw reviewer output without summarization.
 6. **Review gate confidence signal:** When the review loop exits with a clean verdict, record the iteration count in `PipelineContext.reviewResult.iterations`. Clean-on-first-pass (iteration 1) signals higher confidence than clean-after-multiple-iterations (iteration 2-3). Phase 4 specialists and the orchestrator should factor this into their risk assessment.
 **Phase 4 — Final Quality** (after review loop is clean):
-Launch parallel subagents -- no artificial concurrency limit.
+Launch Phase 4 specialists in parallel, bounded by `max_phase4_parallel` (default `3`, override via `HATCH3R_MAX_PHASE4_PARALLEL` env var; valid range 1-16, values outside the range fall back to default with a logged warning). The bound exists to cap per-orchestrator concurrent context cost — it does not soften the P8 B2 directive that fan-out scales with task decomposition. When the number of applicable specialists exceeds `max_phase4_parallel`, batch them by severity-descending priority: `CRITICAL → HIGH → MEDIUM → LOW` (severity is the worst-case finding class the specialist is expected to surface, per the `hatch3r-test-writer` / `hatch3r-security-auditor` always-on baseline → CRITICAL, conditional UI/security/perf → HIGH, docs/lint → MEDIUM, low-impact specialists → LOW). Within the same severity bucket, dispatch order is the trigger-table order in the table above. Each batch runs to completion (all specialists return SUCCESS/PARTIAL/FAILED) before the next batch starts; the validation pass below runs once after the final batch.
 - **Always** (except when Phase Skip Criteria applies — see below)**:** `hatch3r-test-writer`, `hatch3r-security-auditor`
 - **Evaluate:** `hatch3r-docs-writer` (when APIs/architecture/UX affected)
@@ -248,6 +271,14 @@ ALL three must hold:
 **Default:** When in doubt, serialize. For typical hatch3r tasks (1–5 sub-tasks) the DAG-scheduling overhead often outweighs concurrency gain.
+### Cost-Dominance Principle
+Token cost of sub-agent invocation never justifies serialization of independent work. The three safety conditions (read-only or disjoint writes, deterministic aggregation, no shared mutable state) govern WHEN parallelism is safe; cost does not govern WHETHER to parallelize. When in doubt, fan out. Serialization is only valid on true dependency edges.
+### Scaling Heuristic
+Sub-agent count tracks task decomposition: N independent modules → N parallel Phase-2 implementers; M specialist gates → M parallel Phase-4 specialists; K independent research questions → K parallel researcher modes. Orchestrators emit `sub_agents_spawned: {count, rationale}` in their structured output.
 ## Cross-Phase Error Propagation
 When a phase produces a non-SUCCESS status, the orchestrator must propagate error context to downstream phases rather than silently dropping it:

package/rules/hatch3r-ai-evals.md ADDED Viewed

@@ -0,0 +1,158 @@
+---
+id: hatch3r-ai-evals
+type: rule
+description: AI feature evaluation, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI for end-user projects shipping LLM features
+scope: "**/ai/**,**/llm/**,**/chat/**,**/assistant/**,**/agents/**,**/copilot/**,**/evals/**,**/prompts/**,**/rag/**"
+tags: [review, implementation, ai]
+quality_charter: agents/shared/quality-charter.md
+cache_friendly: true
+---
+# AI Feature Evaluation and Cost Governance (2026)
+## Scope
+This rule governs the BACKEND half of an LLM-driven feature: eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, hallucination-as-SLI, tool-use evals, safety/red-team, and audit logging. The FRONTEND half (streaming UI, tool-call cards, human-approval gates, cancel/abort/undo, citations) is the paired companion rule `rules/hatch3r-ai-ux-patterns.md`. Apply both rules for any LLM-driven feature — UI-only or backend-only coverage is a regression.
+Detection mirrors the companion rule: this rule activates when the project imports an LLM SDK (`openai`, `@anthropic-ai/sdk`, `@google/generative-ai`, `cohere-ai`, `ai`, `@ai-sdk/*`, `langchain`, `llama-index`) or contains files under `ai/`, `llm/`, `chat/`, `assistant/`, `agents/`, `copilot/`, `evals/`, `prompts/`, `rag/`.
+## Eval Harness Mandate
+Every AI feature has an automated eval harness committed to the repo before the feature ships. Hand-rolled "ask the model and eyeball the answer" is a regression in 2026.
+Pick one tool by task class:
+- **promptfoo** — broad coverage, declarative YAML, model-comparison defaults
+- **DeepEval** — pytest-style assertions for CI gate integration
+- **RAGAS** — retrieval-augmented generation metrics (context_precision, context_recall, faithfulness, answer_relevance)
+- **Inspect** — UK AISI framework for safety and agentic evals
+- **braintrust** — SaaS + OSS hybrid, run history retained per prompt version
+- **TruLens** — observability-coupled, runs evals against live traces
+- **Arize Phoenix** — open-source observability with eval modules
+Document the chosen tool in `evals/README.md` so the agent picks the same tool on every future change.
+## Golden Dataset Versioning
+- Eval cases live in repo at `evals/<feature>/golden.{jsonl,csv}`.
+- Minimum 20 cases per feature; one input plus one expected output (or graded rubric) per row.
+- Dataset version in the filename (`golden.v3.jsonl`) — never overwrite v2 in place.
+- Ground truth refreshed quarterly; refresh PR includes a diff of changed expected outputs and the rationale.
+- Hand-curated edge cases (failure modes the model has historically tripped on) live in `evals/<feature>/edge.jsonl` alongside the golden set.
+## Eval Metrics
+Match the metric to the task class:
+- Classification → exact-match accuracy + per-class precision/recall.
+- Open-ended generation → rubric-scored LLM-as-judge with a 50-example calibration set sanity-checking judge agreement against human labels; recompute calibration every model change.
+- Retrieval/RAG → RAGAS metrics: context_precision, context_recall, faithfulness, answer_relevance.
+- Pairwise comparison → win-rate against the prior prompt version (>=55% required to adopt).
+- Refusal calibration → refusal-rate as an explicit SLI (refusals on prohibited inputs vs false-positive refusals on benign inputs).
+## Regression Gating
+- Every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**` runs the eval set in CI.
+- PR blocked when any metric drops below the per-feature threshold defined in `evals/<feature>/thresholds.json`.
+- Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) runs the full eval with a 5% accuracy budget; cross over 5% requires a sign-off comment from a named reviewer and a 24-hour canary at 5% traffic.
+- Eval failure is treated as a test failure — never an advisory warning.
+## Prompt Versioning
+- Prompts are first-class artifacts at `prompts/<feature>/v<N>.{md,txt}` with an SHA-256 hash recorded in the eval thresholds file.
+- Runtime emits `prompt_version` + `prompt_hash` per request as log + span attribute.
+- A/B framework supports concurrent versions behind a flag — traffic split recorded per request so eval delta is computable.
+- Hash drift between repo prompt and deployed prompt is a P0 incident.
+## Cost Telemetry per Request
+Every LLM call logs: `tokens_in`, `tokens_out`, `cache_hit` (boolean + cached_tokens count), `model`, `cost_usd`, `latency_ms`, `cost_center` (feature ID), `prompt_version`, `prompt_hash`, `user_id_hash`.
+Aggregate dashboards in the observability stack — cross-reference `rules/hatch3r-observability-metrics.md` and `rules/hatch3r-observability-tracing-detail.md` for the SLI/SLO vocabulary, and `skills/hatch3r-observability-verify` for the wiring checklist. Per-feature budget alerts fire at 50%, 75%, and 90% of monthly budget; abuse-detection alert at 10x user p99 cost over a 1-hour window.
+## Prompt Caching (Anthropic)
+- Apply `cache_control` breakpoints to long system prompts, tool definitions, and large RAG context blocks above 1024 tokens (the Claude Opus/Sonnet 4.x minimum cache size; 2048 tokens for Haiku 4.x).
+- Up to 4 breakpoints per request, longest-TTL-first.
+- 5-minute TTL costs 1.25x the standard write rate; 1-hour TTL costs 2x the standard write rate. Reads are 0.1x base.
+- Track cache_hit ratio per feature; <30% hit ratio on a stable prompt is a sign the prefix is changing — investigate before next deploy.
+## OpenAI Prompt Caching
+- Automatic for requests over 1024 tokens with a stable deterministic prefix; ~50% discount on cached input tokens.
+- Cache duration documented by OpenAI as 5-10 minutes idle, up to 1 hour during off-peak.
+- Order the request prefix deterministically — same system prompt, same tool definitions, same retrieved-doc order — or the cache misses silently.
+## Model Router and Fallback
+Every LLM call wraps in a retry-with-decorrelated-jitter plus model-fallback chain:
+1. **Primary** — production-quality model (e.g. Sonnet 4.7).
+2. **Secondary** — faster/cheaper model (e.g. Haiku 4.5, GPT-5-mini, or a local Llama variant).
+3. **Static fallback** — cached response from a similar prior request or a canned templated reply that names the failure ("Search temporarily unavailable — retry in a minute").
+Cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the circuit-breaker + retry-with-decorrelated-jitter primitives the chain reuses. Each fallback path has its own eval suite — a silent quality cliff between primary and secondary is a regression.
+## Hallucination and Groundedness as SLI
+Track per feature and treat as service-level indicators with explicit SLO targets:
+- `ai.hallucination_rate` — % of responses producing claims not present in retrieved sources. SLO: <5% on the golden set.
+- `ai.citation_precision` — % of citations whose source span verifiably contains the cited claim. SLO: >95% on retrieved claims.
+- `ai.refusal_rate` — overall refusal-rate; track false-positive refusals separately.
+- `ai.groundedness_score` — average RAGAS faithfulness across the golden set. SLO: >0.85.
+Cross-reference `rules/hatch3r-observability-metrics.md` for the SLI/SLO authoring template.
+## Safety and Red-Team
+Every feature exercising user-controlled prompts runs adversarial eval on a schedule (weekly minimum, on every prompt-version bump):
+- **Garak** — open-source jailbreak / prompt-injection / PII-leakage probe suite.
+- **PyRIT** — Microsoft red-team framework, scenario-driven.
+- **Inspect-redteam** — UK AISI safety eval modules.
+Block release on regression. PII-leakage tests use a synthetic-PII corpus (never production PII). Prompt-injection tests cover OWASP Top 10 for Agentic Applications 2026.
+## Tool-Use Evals
+When the AI feature uses tools (per the companion UX rule), the eval suite covers:
+- Tool-selection accuracy — correct tool chosen for each input class.
+- Argument validity — emitted args satisfy the tool schema (Zod, JSON Schema, Pydantic).
+- Chain correctness — multi-tool plans reach the goal with the minimum required step count plus a 20% tolerance.
+- Latency budget — p99 tool-execution time within the budget from `rules/hatch3r-performance-budgets.md`.
+Methodology aligned with **BFCL v4** (Berkeley Function Calling Leaderboard) and **tau-bench** (multi-turn tool-use benchmark).
+## OpenTelemetry GenAI Semantic Conventions
+Every LLM call emits an OpenTelemetry span named `gen_ai.<operation>` with the attributes prescribed by the OpenTelemetry GenAI semantic conventions: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.temperature`, `gen_ai.tool.name` (when tools used). Cross-reference Slice 2 observability rules for the broader span taxonomy.
+## User-Feedback Loop
+- Every AI response surface emits a thumbs-up/down control wired to a feedback queue.
+- A monthly triage job promotes thumbs-down examples into regression-test fixtures in `evals/<feature>/edge.jsonl`.
+- Promotion is a manual review step — never auto-promote raw user feedback (it contains noise and adversarial labels).
+## Audit Logging
+- LLM inputs + outputs logged to a compliance store separate from APM with a 30-90 day retention window (configurable per data-classification policy).
+- PII redaction before persistence — same redaction primitive as the framework's data-classification pipeline.
+- Reproducibility key: `model` + `prompt_hash` + `seed` (when the model exposes a seed parameter) + `temperature`. Without all four, the response is non-reproducible.
+## Eval-Driven Development Workflow
+Write eval before prompt, measure baseline, write prompt, measure delta, iterate until eval threshold passes. Cross-reference `skills/hatch3r-ai-feature/SKILL.md` for the step-by-step workflow.
+## References
+- promptfoo — `promptfoo.dev`
+- DeepEval — `github.com/confident-ai/deepeval`
+- RAGAS — `docs.ragas.io`
+- Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
+- Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
+- OpenAI prompt caching guide — `platform.openai.com/docs/guides/prompt-caching`
+- OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
+- Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`
+- tau-bench — `github.com/sierra-research/tau-bench`
+- OWASP Top 10 for LLM Applications (Agentic 2026) — `genai.owasp.org`

package/rules/hatch3r-ai-evals.mdc ADDED Viewed

@@ -0,0 +1,154 @@
+---
+description: AI feature evaluation, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI for end-user projects shipping LLM features
+globs: ["**/ai/**", "**/llm/**", "**/chat/**", "**/assistant/**", "**/agents/**", "**/copilot/**", "**/evals/**", "**/prompts/**", "**/rag/**"]
+alwaysApply: false
+---
+# AI Feature Evaluation and Cost Governance (2026)
+## Scope
+This rule governs the BACKEND half of an LLM-driven feature: eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, hallucination-as-SLI, tool-use evals, safety/red-team, and audit logging. The FRONTEND half (streaming UI, tool-call cards, human-approval gates, cancel/abort/undo, citations) is the paired companion rule `rules/hatch3r-ai-ux-patterns.md`. Apply both rules for any LLM-driven feature — UI-only or backend-only coverage is a regression.
+Detection mirrors the companion rule: this rule activates when the project imports an LLM SDK (`openai`, `@anthropic-ai/sdk`, `@google/generative-ai`, `cohere-ai`, `ai`, `@ai-sdk/*`, `langchain`, `llama-index`) or contains files under `ai/`, `llm/`, `chat/`, `assistant/`, `agents/`, `copilot/`, `evals/`, `prompts/`, `rag/`.
+## Eval Harness Mandate
+Every AI feature has an automated eval harness committed to the repo before the feature ships. Hand-rolled "ask the model and eyeball the answer" is a regression in 2026.
+Pick one tool by task class:
+- **promptfoo** — broad coverage, declarative YAML, model-comparison defaults
+- **DeepEval** — pytest-style assertions for CI gate integration
+- **RAGAS** — retrieval-augmented generation metrics (context_precision, context_recall, faithfulness, answer_relevance)
+- **Inspect** — UK AISI framework for safety and agentic evals
+- **braintrust** — SaaS + OSS hybrid, run history retained per prompt version
+- **TruLens** — observability-coupled, runs evals against live traces
+- **Arize Phoenix** — open-source observability with eval modules
+Document the chosen tool in `evals/README.md` so the agent picks the same tool on every future change.
+## Golden Dataset Versioning
+- Eval cases live in repo at `evals/<feature>/golden.{jsonl,csv}`.
+- Minimum 20 cases per feature; one input plus one expected output (or graded rubric) per row.
+- Dataset version in the filename (`golden.v3.jsonl`) — never overwrite v2 in place.
+- Ground truth refreshed quarterly; refresh PR includes a diff of changed expected outputs and the rationale.
+- Hand-curated edge cases (failure modes the model has historically tripped on) live in `evals/<feature>/edge.jsonl` alongside the golden set.
+## Eval Metrics
+Match the metric to the task class:
+- Classification → exact-match accuracy + per-class precision/recall.
+- Open-ended generation → rubric-scored LLM-as-judge with a 50-example calibration set sanity-checking judge agreement against human labels; recompute calibration every model change.
+- Retrieval/RAG → RAGAS metrics: context_precision, context_recall, faithfulness, answer_relevance.
+- Pairwise comparison → win-rate against the prior prompt version (>=55% required to adopt).
+- Refusal calibration → refusal-rate as an explicit SLI (refusals on prohibited inputs vs false-positive refusals on benign inputs).
+## Regression Gating
+- Every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**` runs the eval set in CI.
+- PR blocked when any metric drops below the per-feature threshold defined in `evals/<feature>/thresholds.json`.
+- Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) runs the full eval with a 5% accuracy budget; cross over 5% requires a sign-off comment from a named reviewer and a 24-hour canary at 5% traffic.
+- Eval failure is treated as a test failure — never an advisory warning.
+## Prompt Versioning
+- Prompts are first-class artifacts at `prompts/<feature>/v<N>.{md,txt}` with an SHA-256 hash recorded in the eval thresholds file.
+- Runtime emits `prompt_version` + `prompt_hash` per request as log + span attribute.
+- A/B framework supports concurrent versions behind a flag — traffic split recorded per request so eval delta is computable.
+- Hash drift between repo prompt and deployed prompt is a P0 incident.
+## Cost Telemetry per Request
+Every LLM call logs: `tokens_in`, `tokens_out`, `cache_hit` (boolean + cached_tokens count), `model`, `cost_usd`, `latency_ms`, `cost_center` (feature ID), `prompt_version`, `prompt_hash`, `user_id_hash`.
+Aggregate dashboards in the observability stack — cross-reference `rules/hatch3r-observability-metrics.md` and `rules/hatch3r-observability-tracing-detail.md` for the SLI/SLO vocabulary, and `skills/hatch3r-observability-verify` for the wiring checklist. Per-feature budget alerts fire at 50%, 75%, and 90% of monthly budget; abuse-detection alert at 10x user p99 cost over a 1-hour window.
+## Prompt Caching (Anthropic)
+- Apply `cache_control` breakpoints to long system prompts, tool definitions, and large RAG context blocks above 1024 tokens (the Claude Opus/Sonnet 4.x minimum cache size; 2048 tokens for Haiku 4.x).
+- Up to 4 breakpoints per request, longest-TTL-first.
+- 5-minute TTL costs 1.25x the standard write rate; 1-hour TTL costs 2x the standard write rate. Reads are 0.1x base.
+- Track cache_hit ratio per feature; <30% hit ratio on a stable prompt is a sign the prefix is changing — investigate before next deploy.
+## OpenAI Prompt Caching
+- Automatic for requests over 1024 tokens with a stable deterministic prefix; ~50% discount on cached input tokens.
+- Cache duration documented by OpenAI as 5-10 minutes idle, up to 1 hour during off-peak.
+- Order the request prefix deterministically — same system prompt, same tool definitions, same retrieved-doc order — or the cache misses silently.
+## Model Router and Fallback
+Every LLM call wraps in a retry-with-decorrelated-jitter plus model-fallback chain:
+1. **Primary** — production-quality model (e.g. Sonnet 4.7).
+2. **Secondary** — faster/cheaper model (e.g. Haiku 4.5, GPT-5-mini, or a local Llama variant).
+3. **Static fallback** — cached response from a similar prior request or a canned templated reply that names the failure ("Search temporarily unavailable — retry in a minute").
+Cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the circuit-breaker + retry-with-decorrelated-jitter primitives the chain reuses. Each fallback path has its own eval suite — a silent quality cliff between primary and secondary is a regression.
+## Hallucination and Groundedness as SLI
+Track per feature and treat as service-level indicators with explicit SLO targets:
+- `ai.hallucination_rate` — % of responses producing claims not present in retrieved sources. SLO: <5% on the golden set.
+- `ai.citation_precision` — % of citations whose source span verifiably contains the cited claim. SLO: >95% on retrieved claims.
+- `ai.refusal_rate` — overall refusal-rate; track false-positive refusals separately.
+- `ai.groundedness_score` — average RAGAS faithfulness across the golden set. SLO: >0.85.
+Cross-reference `rules/hatch3r-observability-metrics.md` for the SLI/SLO authoring template.
+## Safety and Red-Team
+Every feature exercising user-controlled prompts runs adversarial eval on a schedule (weekly minimum, on every prompt-version bump):
+- **Garak** — open-source jailbreak / prompt-injection / PII-leakage probe suite.
+- **PyRIT** — Microsoft red-team framework, scenario-driven.
+- **Inspect-redteam** — UK AISI safety eval modules.
+Block release on regression. PII-leakage tests use a synthetic-PII corpus (never production PII). Prompt-injection tests cover OWASP Top 10 for Agentic Applications 2026.
+## Tool-Use Evals
+When the AI feature uses tools (per the companion UX rule), the eval suite covers:
+- Tool-selection accuracy — correct tool chosen for each input class.
+- Argument validity — emitted args satisfy the tool schema (Zod, JSON Schema, Pydantic).
+- Chain correctness — multi-tool plans reach the goal with the minimum required step count plus a 20% tolerance.
+- Latency budget — p99 tool-execution time within the budget from `rules/hatch3r-performance-budgets.md`.
+Methodology aligned with **BFCL v4** (Berkeley Function Calling Leaderboard) and **tau-bench** (multi-turn tool-use benchmark).
+## OpenTelemetry GenAI Semantic Conventions
+Every LLM call emits an OpenTelemetry span named `gen_ai.<operation>` with the attributes prescribed by the OpenTelemetry GenAI semantic conventions: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.temperature`, `gen_ai.tool.name` (when tools used). Cross-reference Slice 2 observability rules for the broader span taxonomy.
+## User-Feedback Loop
+- Every AI response surface emits a thumbs-up/down control wired to a feedback queue.
+- A monthly triage job promotes thumbs-down examples into regression-test fixtures in `evals/<feature>/edge.jsonl`.
+- Promotion is a manual review step — never auto-promote raw user feedback (it contains noise and adversarial labels).
+## Audit Logging
+- LLM inputs + outputs logged to a compliance store separate from APM with a 30-90 day retention window (configurable per data-classification policy).
+- PII redaction before persistence — same redaction primitive as the framework's data-classification pipeline.
+- Reproducibility key: `model` + `prompt_hash` + `seed` (when the model exposes a seed parameter) + `temperature`. Without all four, the response is non-reproducible.
+## Eval-Driven Development Workflow
+Write eval before prompt, measure baseline, write prompt, measure delta, iterate until eval threshold passes. Cross-reference `skills/hatch3r-ai-feature/SKILL.md` for the step-by-step workflow.
+## References
+- promptfoo — `promptfoo.dev`
+- DeepEval — `github.com/confident-ai/deepeval`
+- RAGAS — `docs.ragas.io`
+- Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
+- Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
+- OpenAI prompt caching guide — `platform.openai.com/docs/guides/prompt-caching`
+- OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
+- Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`
+- tau-bench — `github.com/sierra-research/tau-bench`
+- OWASP Top 10 for LLM Applications (Agentic 2026) — `genai.owasp.org`