hatch3r 1.7.0 → 1.7.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +38 -12
- package/agents/hatch3r-a11y-auditor.md +4 -0
- package/agents/hatch3r-architect.md +5 -1
- package/agents/hatch3r-ci-watcher.md +4 -0
- package/agents/hatch3r-context-rules.md +4 -0
- package/agents/hatch3r-creator.md +4 -0
- package/agents/hatch3r-dependency-auditor.md +4 -0
- package/agents/hatch3r-devops.md +4 -0
- package/agents/hatch3r-docs-writer.md +4 -0
- package/agents/hatch3r-fixer.md +5 -1
- package/agents/hatch3r-handoff-loader.md +243 -0
- package/agents/hatch3r-handoff-preparer.md +134 -0
- package/agents/hatch3r-implementer.md +5 -1
- package/agents/hatch3r-learnings-loader.md +4 -0
- package/agents/hatch3r-lint-fixer.md +4 -0
- package/agents/hatch3r-perf-profiler.md +8 -0
- package/agents/hatch3r-researcher.md +5 -1
- package/agents/hatch3r-reviewer.md +92 -0
- package/agents/hatch3r-security-auditor.md +24 -0
- package/agents/hatch3r-test-writer.md +4 -0
- package/agents/modes/requirements-elicitation.md +5 -1
- package/agents/modes/similar-implementation.md +6 -0
- package/agents/modes/user-flows.md +76 -0
- package/agents/shared/quality-charter.md +129 -0
- package/agents/shared/user-question-protocol.md +95 -0
- package/commands/board/shared-azure-devops.md +2 -0
- package/commands/board/shared-github.md +17 -0
- package/commands/board/shared-gitlab.md +4 -0
- package/commands/hatch3r-board-fill.md +2 -1
- package/commands/hatch3r-board-pickup.md +1 -1
- package/commands/hatch3r-board-shared.md +21 -0
- package/commands/hatch3r-create.md +2 -0
- package/commands/hatch3r-handoff.md +126 -0
- package/commands/hatch3r-pr-resolve.md +672 -0
- package/commands/hatch3r-quick-change.md +5 -3
- package/commands/hatch3r-report.md +167 -0
- package/commands/hatch3r-revision.md +1 -1
- package/commands/hatch3r-workflow.md +3 -1
- package/dist/cli/index.js +3144 -979
- package/dist/cli/index.js.map +1 -1
- package/package.json +4 -2
- package/rules/hatch3r-accessibility-standards.md +21 -0
- package/rules/hatch3r-accessibility-standards.mdc +21 -0
- package/rules/hatch3r-agent-orchestration.md +32 -1
- package/rules/hatch3r-agent-orchestration.mdc +32 -1
- package/rules/hatch3r-ai-evals.md +158 -0
- package/rules/hatch3r-ai-evals.mdc +154 -0
- package/rules/hatch3r-ai-ux-patterns.md +131 -0
- package/rules/hatch3r-ai-ux-patterns.mdc +127 -0
- package/rules/hatch3r-api-design.md +67 -9
- package/rules/hatch3r-api-design.mdc +67 -9
- package/rules/hatch3r-api-versioning.md +119 -0
- package/rules/hatch3r-api-versioning.mdc +115 -0
- package/rules/hatch3r-auth-patterns.md +170 -0
- package/rules/hatch3r-auth-patterns.mdc +166 -0
- package/rules/hatch3r-component-conventions.md +30 -0
- package/rules/hatch3r-component-conventions.mdc +30 -0
- package/rules/hatch3r-container-hardening.md +131 -0
- package/rules/hatch3r-container-hardening.mdc +127 -0
- package/rules/hatch3r-contract-testing.md +117 -0
- package/rules/hatch3r-contract-testing.mdc +113 -0
- package/rules/hatch3r-deep-context.md +3 -1
- package/rules/hatch3r-deep-context.mdc +3 -1
- package/rules/hatch3r-dependency-management.md +73 -1
- package/rules/hatch3r-dependency-management.mdc +72 -0
- package/rules/hatch3r-design-system-detection.md +142 -0
- package/rules/hatch3r-design-system-detection.mdc +138 -0
- package/rules/hatch3r-event-schema-evolution.md +90 -0
- package/rules/hatch3r-event-schema-evolution.mdc +86 -0
- package/rules/hatch3r-handoff-readiness.md +45 -0
- package/rules/hatch3r-handoff-readiness.mdc +40 -0
- package/rules/hatch3r-i18n.md +13 -0
- package/rules/hatch3r-i18n.mdc +13 -0
- package/rules/hatch3r-iteration-summary.md +2 -0
- package/rules/hatch3r-iteration-summary.mdc +2 -0
- package/rules/hatch3r-migrations.md +61 -16
- package/rules/hatch3r-migrations.mdc +61 -16
- package/rules/hatch3r-observability-logging.md +1 -1
- package/rules/hatch3r-observability-logging.mdc +1 -1
- package/rules/hatch3r-observability-metrics.md +1 -1
- package/rules/hatch3r-observability-metrics.mdc +1 -1
- package/rules/hatch3r-observability-tracing-detail.md +1 -1
- package/rules/hatch3r-observability-tracing-detail.mdc +1 -1
- package/rules/hatch3r-observability-tracing.md +1 -1
- package/rules/hatch3r-observability-tracing.mdc +1 -1
- package/rules/hatch3r-observability.md +1 -0
- package/rules/hatch3r-observability.mdc +1 -0
- package/rules/hatch3r-operability.md +149 -0
- package/rules/hatch3r-operability.mdc +145 -0
- package/rules/hatch3r-passkey-server.md +181 -0
- package/rules/hatch3r-passkey-server.mdc +177 -0
- package/rules/hatch3r-progressive-delivery.md +120 -0
- package/rules/hatch3r-progressive-delivery.mdc +116 -0
- package/rules/hatch3r-resilience-patterns.md +154 -0
- package/rules/hatch3r-resilience-patterns.mdc +150 -0
- package/rules/hatch3r-secrets-management.md +29 -0
- package/rules/hatch3r-secrets-management.mdc +29 -0
- package/rules/hatch3r-testing.md +139 -43
- package/rules/hatch3r-testing.mdc +139 -43
- package/rules/hatch3r-ux-states-and-flows.md +149 -0
- package/rules/hatch3r-ux-states-and-flows.mdc +145 -0
- package/skills/hatch3r-a11y-audit/SKILL.md +14 -0
- package/skills/hatch3r-ai-feature/SKILL.md +134 -0
- package/skills/hatch3r-api-spec/SKILL.md +5 -0
- package/skills/hatch3r-architecture-review/SKILL.md +14 -0
- package/skills/hatch3r-bug-fix/SKILL.md +5 -0
- package/skills/hatch3r-ci-pipeline/SKILL.md +14 -0
- package/skills/hatch3r-cli-aichat/SKILL.md +84 -0
- package/skills/hatch3r-cli-ast-grep/SKILL.md +85 -0
- package/skills/hatch3r-cli-az-devops/SKILL.md +89 -0
- package/skills/hatch3r-cli-bat/SKILL.md +85 -0
- package/skills/hatch3r-cli-comby/SKILL.md +85 -0
- package/skills/hatch3r-cli-csvkit/SKILL.md +84 -0
- package/skills/hatch3r-cli-delta/SKILL.md +86 -0
- package/skills/hatch3r-cli-difftastic/SKILL.md +84 -0
- package/skills/hatch3r-cli-docker/SKILL.md +89 -0
- package/skills/hatch3r-cli-duckdb/SKILL.md +84 -0
- package/skills/hatch3r-cli-fd/SKILL.md +85 -0
- package/skills/hatch3r-cli-fzf/SKILL.md +84 -0
- package/skills/hatch3r-cli-gh/SKILL.md +90 -0
- package/skills/hatch3r-cli-glab/SKILL.md +89 -0
- package/skills/hatch3r-cli-jq/SKILL.md +85 -0
- package/skills/hatch3r-cli-lazygit/SKILL.md +78 -0
- package/skills/hatch3r-cli-llm/SKILL.md +84 -0
- package/skills/hatch3r-cli-miller/SKILL.md +84 -0
- package/skills/hatch3r-cli-mods/SKILL.md +84 -0
- package/skills/hatch3r-cli-overview/SKILL.md +60 -0
- package/skills/hatch3r-cli-playwright/SKILL.md +89 -0
- package/skills/hatch3r-cli-podman/SKILL.md +84 -0
- package/skills/hatch3r-cli-ripgrep/SKILL.md +85 -0
- package/skills/hatch3r-cli-rtk/SKILL.md +91 -0
- package/skills/hatch3r-cli-sd/SKILL.md +85 -0
- package/skills/hatch3r-cli-stagehand/SKILL.md +79 -0
- package/skills/hatch3r-cli-taplo/SKILL.md +84 -0
- package/skills/hatch3r-cli-xsv/SKILL.md +89 -0
- package/skills/hatch3r-cli-yq/SKILL.md +85 -0
- package/skills/hatch3r-cli-zstd/SKILL.md +85 -0
- package/skills/hatch3r-context-health/SKILL.md +14 -0
- package/skills/hatch3r-cost-tracking/SKILL.md +14 -0
- package/skills/hatch3r-customize/SKILL.md +14 -0
- package/skills/hatch3r-dep-audit/SKILL.md +14 -0
- package/skills/hatch3r-design-system-detect/SKILL.md +162 -0
- package/skills/hatch3r-feature/SKILL.md +2 -0
- package/skills/hatch3r-gh-agentic-workflows/SKILL.md +13 -0
- package/skills/hatch3r-handoff-prepare/SKILL.md +160 -0
- package/skills/hatch3r-handoff-resume/SKILL.md +171 -0
- package/skills/hatch3r-incident-response/SKILL.md +14 -0
- package/skills/hatch3r-issue-workflow/SKILL.md +5 -0
- package/skills/hatch3r-logical-refactor/SKILL.md +14 -0
- package/skills/hatch3r-migration/SKILL.md +14 -0
- package/skills/hatch3r-observability-verify/SKILL.md +133 -0
- package/skills/hatch3r-perf-audit/SKILL.md +14 -0
- package/skills/hatch3r-pr-creation/SKILL.md +14 -0
- package/skills/hatch3r-qa-validation/SKILL.md +18 -0
- package/skills/hatch3r-recipe/SKILL.md +14 -0
- package/skills/hatch3r-refactor/SKILL.md +14 -0
- package/skills/hatch3r-release/SKILL.md +14 -0
- package/skills/hatch3r-reliability-verify/SKILL.md +144 -0
- package/skills/hatch3r-ui-ux-verify/SKILL.md +136 -0
- package/skills/hatch3r-visual-refactor/SKILL.md +15 -1
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "hatch3r",
|
|
3
|
-
"version": "1.7.
|
|
3
|
+
"version": "1.7.5",
|
|
4
4
|
"description": "Battle-tested agentic coding setup framework. One command to hatch your agent stack -- agents, skills, rules, commands, and MCP for every major AI coding tool.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"exports": {
|
|
@@ -24,7 +24,9 @@
|
|
|
24
24
|
"inventory:check-docs": "tsx scripts/inventory.ts --check-docs",
|
|
25
25
|
"validate:rule-parity": "tsx scripts/validate-rule-parity.ts",
|
|
26
26
|
"validate:efficiency": "tsx scripts/validate-efficiency-invariants.ts",
|
|
27
|
-
"validate": "
|
|
27
|
+
"validate:cli-skills": "tsx scripts/validate-cli-skills.ts",
|
|
28
|
+
"generate:cli-skills": "tsx scripts/generate-cli-skills.ts",
|
|
29
|
+
"validate": "npm run validate:rule-parity && npm run validate:efficiency && npm run validate:cli-skills",
|
|
28
30
|
"audit:validate-registry": "tsx scripts/validate-finding-registry.ts",
|
|
29
31
|
"audit:migrate": "tsx scripts/migrate-finding-registry.ts",
|
|
30
32
|
"audit:archive": "tsx scripts/audit-archive.ts",
|
|
@@ -78,3 +78,24 @@ All user-facing features must meet WCAG 2.2 Level AA conformance. This is the ba
|
|
|
78
78
|
- Run automated accessibility checks (axe-core, Lighthouse) in CI.
|
|
79
79
|
- Automated tools catch ~30% of accessibility issues. Manual testing is required for keyboard flows, screen reader experience, and cognitive accessibility.
|
|
80
80
|
- Maintain an accessibility test checklist per component type (form, modal, navigation, data table).
|
|
81
|
+
|
|
82
|
+
## WCAG 2.2 New Success Criteria (Mandatory Audit Items)
|
|
83
|
+
|
|
84
|
+
WCAG 2.2 (W3C Recommendation, October 2023) added nine Success Criteria; three apply to most UI surfaces and are listed here as required audit items.
|
|
85
|
+
|
|
86
|
+
- **SC 2.5.8 Target Size (Minimum) — AA:** every pointer-target's hit area is at least 24 by 24 CSS pixels. Smaller targets are permitted only when surrounded by ≥24px of spacing (the spacing exception) or when the larger inline target is also reachable (the inline exception). Test densest UI elements (sidebar collapse, table-row checkboxes, icon-only toolbar buttons).
|
|
87
|
+
- **SC 2.4.11 Focus Not Obscured (Minimum) — AA:** when a control receives keyboard focus, the focused element must not be entirely hidden by author-created content. Common violations: sticky headers, persistent cookie banners, chatbot launcher widgets, fixed action bars. Mitigate with `scroll-margin-top` on focus targets equal to the sticky header height, or by collapsing sticky chrome on focus.
|
|
88
|
+
- **SC 2.5.7 Dragging Movements — AA:** any drag operation (reorder, slider thumb drag, map pan, kanban card move) must offer a single-pointer non-drag alternative. Examples: list reorder with up/down arrow buttons; slider with numeric input or +/- buttons; map pan with arrow-key navigation.
|
|
89
|
+
|
|
90
|
+
Reference: https://www.w3.org/TR/WCAG22/
|
|
91
|
+
|
|
92
|
+
## Mobile and Touch Accessibility
|
|
93
|
+
|
|
94
|
+
Touch surfaces have stricter target and spacing requirements than pointer-only surfaces; native platform guidance supersedes WCAG 2.5.8 on touch.
|
|
95
|
+
|
|
96
|
+
- Touch targets: 44 by 44 points on iOS (Apple Human Interface Guidelines), 48 by 48 dp on Android (Material Design 3). These supersede WCAG 2.5.8's 24-pixel minimum on touch-only surfaces.
|
|
97
|
+
- Spacing between interactive elements: ≥8 pixels of separation to prevent mis-taps; ≥4 mm on physical-density screens.
|
|
98
|
+
- Avoid tap-to-reveal patterns (hover tooltips, pure-hover dropdowns) — touch has no hover state. Replace with permanent visible labels or with long-press that has a visible affordance and announces itself to screen readers.
|
|
99
|
+
- Apply `env(safe-area-inset-*)` padding on full-bleed surfaces so content clears notches, home indicators, and rounded corners on iOS and Android edge devices.
|
|
100
|
+
- Support Dynamic Type (iOS) and rem-based font scaling — declare body text in `rem` or `em` units, never `px`, so OS-level font size settings cascade.
|
|
101
|
+
- Zoom to 200% and 400% (per WCAG 1.4.4 and 1.4.10 Reflow) must remain functional with no horizontal scroll trap. Audit for `width: 100vw` and fixed pixel widths that break reflow.
|
|
@@ -74,3 +74,24 @@ All user-facing features must meet WCAG 2.2 Level AA conformance. This is the ba
|
|
|
74
74
|
- Run automated accessibility checks (axe-core, Lighthouse) in CI.
|
|
75
75
|
- Automated tools catch ~30% of accessibility issues. Manual testing is required for keyboard flows, screen reader experience, and cognitive accessibility.
|
|
76
76
|
- Maintain an accessibility test checklist per component type (form, modal, navigation, data table).
|
|
77
|
+
|
|
78
|
+
## WCAG 2.2 New Success Criteria (Mandatory Audit Items)
|
|
79
|
+
|
|
80
|
+
WCAG 2.2 (W3C Recommendation, October 2023) added nine Success Criteria; three apply to most UI surfaces and are listed here as required audit items.
|
|
81
|
+
|
|
82
|
+
- **SC 2.5.8 Target Size (Minimum) — AA:** every pointer-target's hit area is at least 24 by 24 CSS pixels. Smaller targets are permitted only when surrounded by ≥24px of spacing (the spacing exception) or when the larger inline target is also reachable (the inline exception). Test densest UI elements (sidebar collapse, table-row checkboxes, icon-only toolbar buttons).
|
|
83
|
+
- **SC 2.4.11 Focus Not Obscured (Minimum) — AA:** when a control receives keyboard focus, the focused element must not be entirely hidden by author-created content. Common violations: sticky headers, persistent cookie banners, chatbot launcher widgets, fixed action bars. Mitigate with `scroll-margin-top` on focus targets equal to the sticky header height, or by collapsing sticky chrome on focus.
|
|
84
|
+
- **SC 2.5.7 Dragging Movements — AA:** any drag operation (reorder, slider thumb drag, map pan, kanban card move) must offer a single-pointer non-drag alternative. Examples: list reorder with up/down arrow buttons; slider with numeric input or +/- buttons; map pan with arrow-key navigation.
|
|
85
|
+
|
|
86
|
+
Reference: https://www.w3.org/TR/WCAG22/
|
|
87
|
+
|
|
88
|
+
## Mobile and Touch Accessibility
|
|
89
|
+
|
|
90
|
+
Touch surfaces have stricter target and spacing requirements than pointer-only surfaces; native platform guidance supersedes WCAG 2.5.8 on touch.
|
|
91
|
+
|
|
92
|
+
- Touch targets: 44 by 44 points on iOS (Apple Human Interface Guidelines), 48 by 48 dp on Android (Material Design 3). These supersede WCAG 2.5.8's 24-pixel minimum on touch-only surfaces.
|
|
93
|
+
- Spacing between interactive elements: ≥8 pixels of separation to prevent mis-taps; ≥4 mm on physical-density screens.
|
|
94
|
+
- Avoid tap-to-reveal patterns (hover tooltips, pure-hover dropdowns) — touch has no hover state. Replace with permanent visible labels or with long-press that has a visible affordance and announces itself to screen readers.
|
|
95
|
+
- Apply `env(safe-area-inset-*)` padding on full-bleed surfaces so content clears notches, home indicators, and rounded corners on iOS and Android edge devices.
|
|
96
|
+
- Support Dynamic Type (iOS) and rem-based font scaling — declare body text in `rem` or `em` units, never `px`, so OS-level font size settings cascade.
|
|
97
|
+
- Zoom to 200% and 400% (per WCAG 1.4.4 and 1.4.10 Reflow) must remain functional with no horizontal scroll trap. Audit for `width: 100vw` and fixed pixel widths that break reflow.
|
|
@@ -19,6 +19,8 @@ Hatch3r's orchestration uses a **phase-gated pipeline** (Research, Implement, Re
|
|
|
19
19
|
|
|
20
20
|
This rule applies to EVERY context without exception: board-pickup (epic, sub-issue, standalone, batch), workflow command (full/quick), plain chat, issue references, and natural language requests. The full sub-agent pipeline is mandatory — never implement code inline without sub-agents.
|
|
21
21
|
|
|
22
|
+
**"Inline implementation" defined.** Inline implementation means calling any code-writing tool — `Edit`, `Write`, `MultiEdit`, `NotebookEdit`, `replace_string_in_file`, `multi_replace_string_in_file`, `create_file`, `str_replace_based_edit_tool`, `apply_patch`, or any platform equivalent — from the orchestrator turn itself, rather than from inside a spawned `hatch3r-implementer` (Phase 2) or `hatch3r-fixer` (Phase 3) sub-agent. The only carve-out is `hatch3r-quick-change` for Tier 1 single-line trivial edits per its declared scope.
|
|
23
|
+
|
|
22
24
|
## Universal Sub-Agent Pipeline
|
|
23
25
|
|
|
24
26
|
Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatch3r-researcher`), **Phase 2 — Implement** (`hatch3r-implementer`), **Phase 3 — Review Loop** (`hatch3r-reviewer` + `hatch3r-fixer`), **Phase 4 — Final Quality** (parallel specialists). See Mandatory Delegation Directives below.
|
|
@@ -46,7 +48,7 @@ Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatc
|
|
|
46
48
|
|
|
47
49
|
Score task complexity per the `hatch3r-deep-context` rule before Phase 1. Apply the resulting tier:
|
|
48
50
|
|
|
49
|
-
- **Tier 2 (
|
|
51
|
+
- **Tier 2 hard gate (B1).** Before Phase 2, run `hatch3r-researcher` with `requirements-elicitation:quick` mode to detect ambiguity and ask the user via the platform-native question tool (per `agents/shared/user-question-protocol.md`). Orchestrator awaits answers and integrates them into the Phase 1 brief; do not begin Phase 2 with unresolved questions. Tier 1 is exempt only when scope is single-file, single-concern, and acceptance is testable from the user message alone.
|
|
50
52
|
- **Tier 3 (Deep):** Present Pre-Implementation Summary and ASK for confirmation. Do NOT proceed until all unresolved questions are answered.
|
|
51
53
|
|
|
52
54
|
## Mandatory Delegation Directives
|
|
@@ -84,6 +86,27 @@ Spawn `hatch3r-implementer` via Task tool for ALL code changes. Never implement
|
|
|
84
86
|
|
|
85
87
|
**Implementer prompt enrichment (Tier 2+):** Include `similar-implementation` findings as "Reference Conventions", resolved `requirements-elicitation` answers as "Resolved Requirements", and blast radius data (Tier 3 only).
|
|
86
88
|
|
|
89
|
+
### Per-Turn Pipeline-State Header
|
|
90
|
+
|
|
91
|
+
Whenever a tracked task is active at Tier 2 or Tier 3 (deep-context score >= 3), the orchestrator MUST emit a single-line pipeline-state header at the very start of every assistant turn that touches the task. Format:
|
|
92
|
+
|
|
93
|
+
```
|
|
94
|
+
[hatch3r-pipeline: phase {1|2|3|4} | last: {agent} → {SUCCESS|PARTIAL|FAILED|BLOCKED|n/a} | next: {agent or "user-confirmation" or "complete"}]
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Examples:
|
|
98
|
+
|
|
99
|
+
- `[hatch3r-pipeline: phase 1 | last: n/a | next: hatch3r-researcher]`
|
|
100
|
+
- `[hatch3r-pipeline: phase 2 | last: hatch3r-researcher → SUCCESS | next: hatch3r-implementer]`
|
|
101
|
+
- `[hatch3r-pipeline: phase 3 | last: hatch3r-reviewer → PARTIAL | next: hatch3r-fixer]`
|
|
102
|
+
- `[hatch3r-pipeline: phase 3 | last: hatch3r-implementer → SUCCESS | next: user-confirmation]`
|
|
103
|
+
|
|
104
|
+
A missing header on a tracked Tier >= 2 task is a self-detectable drift signal — the user may halt the turn and request re-grounding. The header also functions as a per-reply cache prime: rendering it forces the orchestrator to re-resolve which phase it is in before choosing tools. Tier 1 tasks, read-only answers, and chat-only iterations do NOT require the header.
|
|
105
|
+
|
|
106
|
+
### Mandatory Delegation Directive (No Inline Implementation)
|
|
107
|
+
|
|
108
|
+
Restating with maximum clarity for sub-agent prompt inclusion: the orchestrator MUST NOT call `Edit`, `Write`, `MultiEdit`, `NotebookEdit`, `replace_string_in_file`, `multi_replace_string_in_file`, `create_file`, `str_replace_based_edit_tool`, `apply_patch`, or any platform-equivalent code-writing tool from its own turn. The only path for code mutation is the Task tool spawning `hatch3r-implementer` (Phase 2) or `hatch3r-fixer` (Phase 3). Carve-out: `hatch3r-quick-change` Tier 1 trivial items per its declared scope. No other carve-out exists. Violations are bypass mode (see issue #73) — surface them by halting the turn and re-delegating.
|
|
109
|
+
|
|
87
110
|
### Mid-Implementation Research Gap Checkpoint
|
|
88
111
|
|
|
89
112
|
At the midpoint of Phase 2 (after initial files are modified but before completion), the implementer MUST evaluate whether research gaps exist. This prevents discovering missing context too late in the pipeline.
|
|
@@ -230,6 +253,14 @@ ALL three must hold:
|
|
|
230
253
|
|
|
231
254
|
**Default:** When in doubt, serialize. For typical hatch3r tasks (1–5 sub-tasks) the DAG-scheduling overhead often outweighs concurrency gain.
|
|
232
255
|
|
|
256
|
+
### Cost-Dominance Principle
|
|
257
|
+
|
|
258
|
+
Token cost of sub-agent invocation never justifies serialization of independent work. The three safety conditions (read-only or disjoint writes, deterministic aggregation, no shared mutable state) govern WHEN parallelism is safe; cost does not govern WHETHER to parallelize. When in doubt, fan out. Serialization is only valid on true dependency edges.
|
|
259
|
+
|
|
260
|
+
### Scaling Heuristic
|
|
261
|
+
|
|
262
|
+
Sub-agent count tracks task decomposition: N independent modules → N parallel Phase-2 implementers; M specialist gates → M parallel Phase-4 specialists; K independent research questions → K parallel researcher modes. Orchestrators emit `sub_agents_spawned: {count, rationale}` in their structured output.
|
|
263
|
+
|
|
233
264
|
## Cross-Phase Error Propagation
|
|
234
265
|
|
|
235
266
|
When a phase produces a non-SUCCESS status, the orchestrator must propagate error context to downstream phases rather than silently dropping it:
|
|
@@ -14,6 +14,8 @@ Hatch3r's orchestration uses a **phase-gated pipeline** (Research, Implement, Re
|
|
|
14
14
|
|
|
15
15
|
This rule applies to EVERY context without exception: board-pickup (epic, sub-issue, standalone, batch), workflow command (full/quick), plain chat, issue references, and natural language requests. The full sub-agent pipeline is mandatory — never implement code inline without sub-agents.
|
|
16
16
|
|
|
17
|
+
**"Inline implementation" defined.** Inline implementation means calling any code-writing tool — `Edit`, `Write`, `MultiEdit`, `NotebookEdit`, `replace_string_in_file`, `multi_replace_string_in_file`, `create_file`, `str_replace_based_edit_tool`, `apply_patch`, or any platform equivalent — from the orchestrator turn itself, rather than from inside a spawned `hatch3r-implementer` (Phase 2) or `hatch3r-fixer` (Phase 3) sub-agent. The only carve-out is `hatch3r-quick-change` for Tier 1 single-line trivial edits per its declared scope.
|
|
18
|
+
|
|
17
19
|
## Universal Sub-Agent Pipeline
|
|
18
20
|
|
|
19
21
|
Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatch3r-researcher`), **Phase 2 — Implement** (`hatch3r-implementer`), **Phase 3 — Review Loop** (`hatch3r-reviewer` + `hatch3r-fixer`), **Phase 4 — Final Quality** (parallel specialists). See Mandatory Delegation Directives below.
|
|
@@ -41,7 +43,7 @@ Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatc
|
|
|
41
43
|
|
|
42
44
|
Score task complexity per the `hatch3r-deep-context` rule before Phase 1. Apply the resulting tier:
|
|
43
45
|
|
|
44
|
-
- **Tier 2 (
|
|
46
|
+
- **Tier 2 hard gate (B1).** Before Phase 2, run `hatch3r-researcher` with `requirements-elicitation:quick` mode to detect ambiguity and ask the user via the platform-native question tool (per `agents/shared/user-question-protocol.md`). Orchestrator awaits answers and integrates them into the Phase 1 brief; do not begin Phase 2 with unresolved questions. Tier 1 is exempt only when scope is single-file, single-concern, and acceptance is testable from the user message alone.
|
|
45
47
|
- **Tier 3 (Deep):** Present Pre-Implementation Summary and ASK for confirmation. Do NOT proceed until all unresolved questions are answered.
|
|
46
48
|
|
|
47
49
|
## Mandatory Delegation Directives
|
|
@@ -79,6 +81,27 @@ Spawn `hatch3r-implementer` via Task tool for ALL code changes. Never implement
|
|
|
79
81
|
|
|
80
82
|
**Implementer prompt enrichment (Tier 2+):** Include `similar-implementation` findings as "Reference Conventions", resolved `requirements-elicitation` answers as "Resolved Requirements", and blast radius data (Tier 3 only).
|
|
81
83
|
|
|
84
|
+
### Per-Turn Pipeline-State Header
|
|
85
|
+
|
|
86
|
+
Whenever a tracked task is active at Tier 2 or Tier 3 (deep-context score >= 3), the orchestrator MUST emit a single-line pipeline-state header at the very start of every assistant turn that touches the task. Format:
|
|
87
|
+
|
|
88
|
+
```
|
|
89
|
+
[hatch3r-pipeline: phase {1|2|3|4} | last: {agent} → {SUCCESS|PARTIAL|FAILED|BLOCKED|n/a} | next: {agent or "user-confirmation" or "complete"}]
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Examples:
|
|
93
|
+
|
|
94
|
+
- `[hatch3r-pipeline: phase 1 | last: n/a | next: hatch3r-researcher]`
|
|
95
|
+
- `[hatch3r-pipeline: phase 2 | last: hatch3r-researcher → SUCCESS | next: hatch3r-implementer]`
|
|
96
|
+
- `[hatch3r-pipeline: phase 3 | last: hatch3r-reviewer → PARTIAL | next: hatch3r-fixer]`
|
|
97
|
+
- `[hatch3r-pipeline: phase 3 | last: hatch3r-implementer → SUCCESS | next: user-confirmation]`
|
|
98
|
+
|
|
99
|
+
A missing header on a tracked Tier >= 2 task is a self-detectable drift signal — the user may halt the turn and request re-grounding. The header also functions as a per-reply cache prime: rendering it forces the orchestrator to re-resolve which phase it is in before choosing tools. Tier 1 tasks, read-only answers, and chat-only iterations do NOT require the header.
|
|
100
|
+
|
|
101
|
+
### Mandatory Delegation Directive (No Inline Implementation)
|
|
102
|
+
|
|
103
|
+
Restating with maximum clarity for sub-agent prompt inclusion: the orchestrator MUST NOT call `Edit`, `Write`, `MultiEdit`, `NotebookEdit`, `replace_string_in_file`, `multi_replace_string_in_file`, `create_file`, `str_replace_based_edit_tool`, `apply_patch`, or any platform-equivalent code-writing tool from its own turn. The only path for code mutation is the Task tool spawning `hatch3r-implementer` (Phase 2) or `hatch3r-fixer` (Phase 3). Carve-out: `hatch3r-quick-change` Tier 1 trivial items per its declared scope. No other carve-out exists. Violations are bypass mode (see issue #73) — surface them by halting the turn and re-delegating.
|
|
104
|
+
|
|
82
105
|
### Mid-Implementation Research Gap Checkpoint
|
|
83
106
|
|
|
84
107
|
At the midpoint of Phase 2 (after initial files are modified but before completion), the implementer MUST evaluate whether research gaps exist. This prevents discovering missing context too late in the pipeline.
|
|
@@ -225,6 +248,14 @@ ALL three must hold:
|
|
|
225
248
|
|
|
226
249
|
**Default:** When in doubt, serialize. For typical hatch3r tasks (1–5 sub-tasks) the DAG-scheduling overhead often outweighs concurrency gain.
|
|
227
250
|
|
|
251
|
+
### Cost-Dominance Principle
|
|
252
|
+
|
|
253
|
+
Token cost of sub-agent invocation never justifies serialization of independent work. The three safety conditions (read-only or disjoint writes, deterministic aggregation, no shared mutable state) govern WHEN parallelism is safe; cost does not govern WHETHER to parallelize. When in doubt, fan out. Serialization is only valid on true dependency edges.
|
|
254
|
+
|
|
255
|
+
### Scaling Heuristic
|
|
256
|
+
|
|
257
|
+
Sub-agent count tracks task decomposition: N independent modules → N parallel Phase-2 implementers; M specialist gates → M parallel Phase-4 specialists; K independent research questions → K parallel researcher modes. Orchestrators emit `sub_agents_spawned: {count, rationale}` in their structured output.
|
|
258
|
+
|
|
228
259
|
## Cross-Phase Error Propagation
|
|
229
260
|
|
|
230
261
|
When a phase produces a non-SUCCESS status, the orchestrator must propagate error context to downstream phases rather than silently dropping it:
|
|
@@ -0,0 +1,158 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-ai-evals
|
|
3
|
+
type: rule
|
|
4
|
+
description: AI feature evaluation, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI for end-user projects shipping LLM features
|
|
5
|
+
scope: "**/ai/**,**/llm/**,**/chat/**,**/assistant/**,**/agents/**,**/copilot/**,**/evals/**,**/prompts/**,**/rag/**"
|
|
6
|
+
tags: [review, implementation, ai]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
8
|
+
cache_friendly: true
|
|
9
|
+
---
|
|
10
|
+
# AI Feature Evaluation and Cost Governance (2026)
|
|
11
|
+
|
|
12
|
+
## Scope
|
|
13
|
+
|
|
14
|
+
This rule governs the BACKEND half of an LLM-driven feature: eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, hallucination-as-SLI, tool-use evals, safety/red-team, and audit logging. The FRONTEND half (streaming UI, tool-call cards, human-approval gates, cancel/abort/undo, citations) is the paired companion rule `rules/hatch3r-ai-ux-patterns.md`. Apply both rules for any LLM-driven feature — UI-only or backend-only coverage is a regression.
|
|
15
|
+
|
|
16
|
+
Detection mirrors the companion rule: this rule activates when the project imports an LLM SDK (`openai`, `@anthropic-ai/sdk`, `@google/generative-ai`, `cohere-ai`, `ai`, `@ai-sdk/*`, `langchain`, `llama-index`) or contains files under `ai/`, `llm/`, `chat/`, `assistant/`, `agents/`, `copilot/`, `evals/`, `prompts/`, `rag/`.
|
|
17
|
+
|
|
18
|
+
## Eval Harness Mandate
|
|
19
|
+
|
|
20
|
+
Every AI feature has an automated eval harness committed to the repo before the feature ships. Hand-rolled "ask the model and eyeball the answer" is a regression in 2026.
|
|
21
|
+
|
|
22
|
+
Pick one tool by task class:
|
|
23
|
+
|
|
24
|
+
- **promptfoo** — broad coverage, declarative YAML, model-comparison defaults
|
|
25
|
+
- **DeepEval** — pytest-style assertions for CI gate integration
|
|
26
|
+
- **RAGAS** — retrieval-augmented generation metrics (context_precision, context_recall, faithfulness, answer_relevance)
|
|
27
|
+
- **Inspect** — UK AISI framework for safety and agentic evals
|
|
28
|
+
- **braintrust** — SaaS + OSS hybrid, run history retained per prompt version
|
|
29
|
+
- **TruLens** — observability-coupled, runs evals against live traces
|
|
30
|
+
- **Arize Phoenix** — open-source observability with eval modules
|
|
31
|
+
|
|
32
|
+
Document the chosen tool in `evals/README.md` so the agent picks the same tool on every future change.
|
|
33
|
+
|
|
34
|
+
## Golden Dataset Versioning
|
|
35
|
+
|
|
36
|
+
- Eval cases live in repo at `evals/<feature>/golden.{jsonl,csv}`.
|
|
37
|
+
- Minimum 20 cases per feature; one input plus one expected output (or graded rubric) per row.
|
|
38
|
+
- Dataset version in the filename (`golden.v3.jsonl`) — never overwrite v2 in place.
|
|
39
|
+
- Ground truth refreshed quarterly; refresh PR includes a diff of changed expected outputs and the rationale.
|
|
40
|
+
- Hand-curated edge cases (failure modes the model has historically tripped on) live in `evals/<feature>/edge.jsonl` alongside the golden set.
|
|
41
|
+
|
|
42
|
+
## Eval Metrics
|
|
43
|
+
|
|
44
|
+
Match the metric to the task class:
|
|
45
|
+
|
|
46
|
+
- Classification → exact-match accuracy + per-class precision/recall.
|
|
47
|
+
- Open-ended generation → rubric-scored LLM-as-judge with a 50-example calibration set sanity-checking judge agreement against human labels; recompute calibration every model change.
|
|
48
|
+
- Retrieval/RAG → RAGAS metrics: context_precision, context_recall, faithfulness, answer_relevance.
|
|
49
|
+
- Pairwise comparison → win-rate against the prior prompt version (>=55% required to adopt).
|
|
50
|
+
- Refusal calibration → refusal-rate as an explicit SLI (refusals on prohibited inputs vs false-positive refusals on benign inputs).
|
|
51
|
+
|
|
52
|
+
## Regression Gating
|
|
53
|
+
|
|
54
|
+
- Every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**` runs the eval set in CI.
|
|
55
|
+
- PR blocked when any metric drops below the per-feature threshold defined in `evals/<feature>/thresholds.json`.
|
|
56
|
+
- Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) runs the full eval with a 5% accuracy budget; cross over 5% requires a sign-off comment from a named reviewer and a 24-hour canary at 5% traffic.
|
|
57
|
+
- Eval failure is treated as a test failure — never an advisory warning.
|
|
58
|
+
|
|
59
|
+
## Prompt Versioning
|
|
60
|
+
|
|
61
|
+
- Prompts are first-class artifacts at `prompts/<feature>/v<N>.{md,txt}` with an SHA-256 hash recorded in the eval thresholds file.
|
|
62
|
+
- Runtime emits `prompt_version` + `prompt_hash` per request as log + span attribute.
|
|
63
|
+
- A/B framework supports concurrent versions behind a flag — traffic split recorded per request so eval delta is computable.
|
|
64
|
+
- Hash drift between repo prompt and deployed prompt is a P0 incident.
|
|
65
|
+
|
|
66
|
+
## Cost Telemetry per Request
|
|
67
|
+
|
|
68
|
+
Every LLM call logs: `tokens_in`, `tokens_out`, `cache_hit` (boolean + cached_tokens count), `model`, `cost_usd`, `latency_ms`, `cost_center` (feature ID), `prompt_version`, `prompt_hash`, `user_id_hash`.
|
|
69
|
+
|
|
70
|
+
Aggregate dashboards in the observability stack — cross-reference `rules/hatch3r-observability-metrics.md` and `rules/hatch3r-observability-tracing-detail.md` for the SLI/SLO vocabulary, and `skills/hatch3r-observability-verify` for the wiring checklist. Per-feature budget alerts fire at 50%, 75%, and 90% of monthly budget; abuse-detection alert at 10x user p99 cost over a 1-hour window.
|
|
71
|
+
|
|
72
|
+
## Prompt Caching (Anthropic)
|
|
73
|
+
|
|
74
|
+
- Apply `cache_control` breakpoints to long system prompts, tool definitions, and large RAG context blocks above 1024 tokens (the Claude Opus/Sonnet 4.x minimum cache size; 2048 tokens for Haiku 4.x).
|
|
75
|
+
- Up to 4 breakpoints per request, longest-TTL-first.
|
|
76
|
+
- 5-minute TTL costs 1.25x the standard write rate; 1-hour TTL costs 2x the standard write rate. Reads are 0.1x base.
|
|
77
|
+
- Track cache_hit ratio per feature; <30% hit ratio on a stable prompt is a sign the prefix is changing — investigate before next deploy.
|
|
78
|
+
|
|
79
|
+
## OpenAI Prompt Caching
|
|
80
|
+
|
|
81
|
+
- Automatic for requests over 1024 tokens with a stable deterministic prefix; ~50% discount on cached input tokens.
|
|
82
|
+
- Cache duration documented by OpenAI as 5-10 minutes idle, up to 1 hour during off-peak.
|
|
83
|
+
- Order the request prefix deterministically — same system prompt, same tool definitions, same retrieved-doc order — or the cache misses silently.
|
|
84
|
+
|
|
85
|
+
## Model Router and Fallback
|
|
86
|
+
|
|
87
|
+
Every LLM call wraps in a retry-with-decorrelated-jitter plus model-fallback chain:
|
|
88
|
+
|
|
89
|
+
1. **Primary** — production-quality model (e.g. Sonnet 4.7).
|
|
90
|
+
2. **Secondary** — faster/cheaper model (e.g. Haiku 4.5, GPT-5-mini, or a local Llama variant).
|
|
91
|
+
3. **Static fallback** — cached response from a similar prior request or a canned templated reply that names the failure ("Search temporarily unavailable — retry in a minute").
|
|
92
|
+
|
|
93
|
+
Cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the circuit-breaker + retry-with-decorrelated-jitter primitives the chain reuses. Each fallback path has its own eval suite — a silent quality cliff between primary and secondary is a regression.
|
|
94
|
+
|
|
95
|
+
## Hallucination and Groundedness as SLI
|
|
96
|
+
|
|
97
|
+
Track per feature and treat as service-level indicators with explicit SLO targets:
|
|
98
|
+
|
|
99
|
+
- `ai.hallucination_rate` — % of responses producing claims not present in retrieved sources. SLO: <5% on the golden set.
|
|
100
|
+
- `ai.citation_precision` — % of citations whose source span verifiably contains the cited claim. SLO: >95% on retrieved claims.
|
|
101
|
+
- `ai.refusal_rate` — overall refusal-rate; track false-positive refusals separately.
|
|
102
|
+
- `ai.groundedness_score` — average RAGAS faithfulness across the golden set. SLO: >0.85.
|
|
103
|
+
|
|
104
|
+
Cross-reference `rules/hatch3r-observability-metrics.md` for the SLI/SLO authoring template.
|
|
105
|
+
|
|
106
|
+
## Safety and Red-Team
|
|
107
|
+
|
|
108
|
+
Every feature exercising user-controlled prompts runs adversarial eval on a schedule (weekly minimum, on every prompt-version bump):
|
|
109
|
+
|
|
110
|
+
- **Garak** — open-source jailbreak / prompt-injection / PII-leakage probe suite.
|
|
111
|
+
- **PyRIT** — Microsoft red-team framework, scenario-driven.
|
|
112
|
+
- **Inspect-redteam** — UK AISI safety eval modules.
|
|
113
|
+
|
|
114
|
+
Block release on regression. PII-leakage tests use a synthetic-PII corpus (never production PII). Prompt-injection tests cover OWASP Top 10 for Agentic Applications 2026.
|
|
115
|
+
|
|
116
|
+
## Tool-Use Evals
|
|
117
|
+
|
|
118
|
+
When the AI feature uses tools (per the companion UX rule), the eval suite covers:
|
|
119
|
+
|
|
120
|
+
- Tool-selection accuracy — correct tool chosen for each input class.
|
|
121
|
+
- Argument validity — emitted args satisfy the tool schema (Zod, JSON Schema, Pydantic).
|
|
122
|
+
- Chain correctness — multi-tool plans reach the goal with the minimum required step count plus a 20% tolerance.
|
|
123
|
+
- Latency budget — p99 tool-execution time within the budget from `rules/hatch3r-performance-budgets.md`.
|
|
124
|
+
|
|
125
|
+
Methodology aligned with **BFCL v4** (Berkeley Function Calling Leaderboard) and **tau-bench** (multi-turn tool-use benchmark).
|
|
126
|
+
|
|
127
|
+
## OpenTelemetry GenAI Semantic Conventions
|
|
128
|
+
|
|
129
|
+
Every LLM call emits an OpenTelemetry span named `gen_ai.<operation>` with the attributes prescribed by the OpenTelemetry GenAI semantic conventions: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.temperature`, `gen_ai.tool.name` (when tools used). Cross-reference Slice 2 observability rules for the broader span taxonomy.
|
|
130
|
+
|
|
131
|
+
## User-Feedback Loop
|
|
132
|
+
|
|
133
|
+
- Every AI response surface emits a thumbs-up/down control wired to a feedback queue.
|
|
134
|
+
- A monthly triage job promotes thumbs-down examples into regression-test fixtures in `evals/<feature>/edge.jsonl`.
|
|
135
|
+
- Promotion is a manual review step — never auto-promote raw user feedback (it contains noise and adversarial labels).
|
|
136
|
+
|
|
137
|
+
## Audit Logging
|
|
138
|
+
|
|
139
|
+
- LLM inputs + outputs logged to a compliance store separate from APM with a 30-90 day retention window (configurable per data-classification policy).
|
|
140
|
+
- PII redaction before persistence — same redaction primitive as the framework's data-classification pipeline.
|
|
141
|
+
- Reproducibility key: `model` + `prompt_hash` + `seed` (when the model exposes a seed parameter) + `temperature`. Without all four, the response is non-reproducible.
|
|
142
|
+
|
|
143
|
+
## Eval-Driven Development Workflow
|
|
144
|
+
|
|
145
|
+
Write eval before prompt, measure baseline, write prompt, measure delta, iterate until eval threshold passes. Cross-reference `skills/hatch3r-ai-feature/SKILL.md` for the step-by-step workflow.
|
|
146
|
+
|
|
147
|
+
## References
|
|
148
|
+
|
|
149
|
+
- promptfoo — `promptfoo.dev`
|
|
150
|
+
- DeepEval — `github.com/confident-ai/deepeval`
|
|
151
|
+
- RAGAS — `docs.ragas.io`
|
|
152
|
+
- Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
|
|
153
|
+
- Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
|
|
154
|
+
- OpenAI prompt caching guide — `platform.openai.com/docs/guides/prompt-caching`
|
|
155
|
+
- OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
|
|
156
|
+
- Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`
|
|
157
|
+
- tau-bench — `github.com/sierra-research/tau-bench`
|
|
158
|
+
- OWASP Top 10 for LLM Applications (Agentic 2026) — `genai.owasp.org`
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: AI feature evaluation, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI for end-user projects shipping LLM features
|
|
3
|
+
globs: ["**/ai/**", "**/llm/**", "**/chat/**", "**/assistant/**", "**/agents/**", "**/copilot/**", "**/evals/**", "**/prompts/**", "**/rag/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# AI Feature Evaluation and Cost Governance (2026)
|
|
7
|
+
|
|
8
|
+
## Scope
|
|
9
|
+
|
|
10
|
+
This rule governs the BACKEND half of an LLM-driven feature: eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, hallucination-as-SLI, tool-use evals, safety/red-team, and audit logging. The FRONTEND half (streaming UI, tool-call cards, human-approval gates, cancel/abort/undo, citations) is the paired companion rule `rules/hatch3r-ai-ux-patterns.md`. Apply both rules for any LLM-driven feature — UI-only or backend-only coverage is a regression.
|
|
11
|
+
|
|
12
|
+
Detection mirrors the companion rule: this rule activates when the project imports an LLM SDK (`openai`, `@anthropic-ai/sdk`, `@google/generative-ai`, `cohere-ai`, `ai`, `@ai-sdk/*`, `langchain`, `llama-index`) or contains files under `ai/`, `llm/`, `chat/`, `assistant/`, `agents/`, `copilot/`, `evals/`, `prompts/`, `rag/`.
|
|
13
|
+
|
|
14
|
+
## Eval Harness Mandate
|
|
15
|
+
|
|
16
|
+
Every AI feature has an automated eval harness committed to the repo before the feature ships. Hand-rolled "ask the model and eyeball the answer" is a regression in 2026.
|
|
17
|
+
|
|
18
|
+
Pick one tool by task class:
|
|
19
|
+
|
|
20
|
+
- **promptfoo** — broad coverage, declarative YAML, model-comparison defaults
|
|
21
|
+
- **DeepEval** — pytest-style assertions for CI gate integration
|
|
22
|
+
- **RAGAS** — retrieval-augmented generation metrics (context_precision, context_recall, faithfulness, answer_relevance)
|
|
23
|
+
- **Inspect** — UK AISI framework for safety and agentic evals
|
|
24
|
+
- **braintrust** — SaaS + OSS hybrid, run history retained per prompt version
|
|
25
|
+
- **TruLens** — observability-coupled, runs evals against live traces
|
|
26
|
+
- **Arize Phoenix** — open-source observability with eval modules
|
|
27
|
+
|
|
28
|
+
Document the chosen tool in `evals/README.md` so the agent picks the same tool on every future change.
|
|
29
|
+
|
|
30
|
+
## Golden Dataset Versioning
|
|
31
|
+
|
|
32
|
+
- Eval cases live in repo at `evals/<feature>/golden.{jsonl,csv}`.
|
|
33
|
+
- Minimum 20 cases per feature; one input plus one expected output (or graded rubric) per row.
|
|
34
|
+
- Dataset version in the filename (`golden.v3.jsonl`) — never overwrite v2 in place.
|
|
35
|
+
- Ground truth refreshed quarterly; refresh PR includes a diff of changed expected outputs and the rationale.
|
|
36
|
+
- Hand-curated edge cases (failure modes the model has historically tripped on) live in `evals/<feature>/edge.jsonl` alongside the golden set.
|
|
37
|
+
|
|
38
|
+
## Eval Metrics
|
|
39
|
+
|
|
40
|
+
Match the metric to the task class:
|
|
41
|
+
|
|
42
|
+
- Classification → exact-match accuracy + per-class precision/recall.
|
|
43
|
+
- Open-ended generation → rubric-scored LLM-as-judge with a 50-example calibration set sanity-checking judge agreement against human labels; recompute calibration every model change.
|
|
44
|
+
- Retrieval/RAG → RAGAS metrics: context_precision, context_recall, faithfulness, answer_relevance.
|
|
45
|
+
- Pairwise comparison → win-rate against the prior prompt version (>=55% required to adopt).
|
|
46
|
+
- Refusal calibration → refusal-rate as an explicit SLI (refusals on prohibited inputs vs false-positive refusals on benign inputs).
|
|
47
|
+
|
|
48
|
+
## Regression Gating
|
|
49
|
+
|
|
50
|
+
- Every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**` runs the eval set in CI.
|
|
51
|
+
- PR blocked when any metric drops below the per-feature threshold defined in `evals/<feature>/thresholds.json`.
|
|
52
|
+
- Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) runs the full eval with a 5% accuracy budget; cross over 5% requires a sign-off comment from a named reviewer and a 24-hour canary at 5% traffic.
|
|
53
|
+
- Eval failure is treated as a test failure — never an advisory warning.
|
|
54
|
+
|
|
55
|
+
## Prompt Versioning
|
|
56
|
+
|
|
57
|
+
- Prompts are first-class artifacts at `prompts/<feature>/v<N>.{md,txt}` with an SHA-256 hash recorded in the eval thresholds file.
|
|
58
|
+
- Runtime emits `prompt_version` + `prompt_hash` per request as log + span attribute.
|
|
59
|
+
- A/B framework supports concurrent versions behind a flag — traffic split recorded per request so eval delta is computable.
|
|
60
|
+
- Hash drift between repo prompt and deployed prompt is a P0 incident.
|
|
61
|
+
|
|
62
|
+
## Cost Telemetry per Request
|
|
63
|
+
|
|
64
|
+
Every LLM call logs: `tokens_in`, `tokens_out`, `cache_hit` (boolean + cached_tokens count), `model`, `cost_usd`, `latency_ms`, `cost_center` (feature ID), `prompt_version`, `prompt_hash`, `user_id_hash`.
|
|
65
|
+
|
|
66
|
+
Aggregate dashboards in the observability stack — cross-reference `rules/hatch3r-observability-metrics.md` and `rules/hatch3r-observability-tracing-detail.md` for the SLI/SLO vocabulary, and `skills/hatch3r-observability-verify` for the wiring checklist. Per-feature budget alerts fire at 50%, 75%, and 90% of monthly budget; abuse-detection alert at 10x user p99 cost over a 1-hour window.
|
|
67
|
+
|
|
68
|
+
## Prompt Caching (Anthropic)
|
|
69
|
+
|
|
70
|
+
- Apply `cache_control` breakpoints to long system prompts, tool definitions, and large RAG context blocks above 1024 tokens (the Claude Opus/Sonnet 4.x minimum cache size; 2048 tokens for Haiku 4.x).
|
|
71
|
+
- Up to 4 breakpoints per request, longest-TTL-first.
|
|
72
|
+
- 5-minute TTL costs 1.25x the standard write rate; 1-hour TTL costs 2x the standard write rate. Reads are 0.1x base.
|
|
73
|
+
- Track cache_hit ratio per feature; <30% hit ratio on a stable prompt is a sign the prefix is changing — investigate before next deploy.
|
|
74
|
+
|
|
75
|
+
## OpenAI Prompt Caching
|
|
76
|
+
|
|
77
|
+
- Automatic for requests over 1024 tokens with a stable deterministic prefix; ~50% discount on cached input tokens.
|
|
78
|
+
- Cache duration documented by OpenAI as 5-10 minutes idle, up to 1 hour during off-peak.
|
|
79
|
+
- Order the request prefix deterministically — same system prompt, same tool definitions, same retrieved-doc order — or the cache misses silently.
|
|
80
|
+
|
|
81
|
+
## Model Router and Fallback
|
|
82
|
+
|
|
83
|
+
Every LLM call wraps in a retry-with-decorrelated-jitter plus model-fallback chain:
|
|
84
|
+
|
|
85
|
+
1. **Primary** — production-quality model (e.g. Sonnet 4.7).
|
|
86
|
+
2. **Secondary** — faster/cheaper model (e.g. Haiku 4.5, GPT-5-mini, or a local Llama variant).
|
|
87
|
+
3. **Static fallback** — cached response from a similar prior request or a canned templated reply that names the failure ("Search temporarily unavailable — retry in a minute").
|
|
88
|
+
|
|
89
|
+
Cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the circuit-breaker + retry-with-decorrelated-jitter primitives the chain reuses. Each fallback path has its own eval suite — a silent quality cliff between primary and secondary is a regression.
|
|
90
|
+
|
|
91
|
+
## Hallucination and Groundedness as SLI
|
|
92
|
+
|
|
93
|
+
Track per feature and treat as service-level indicators with explicit SLO targets:
|
|
94
|
+
|
|
95
|
+
- `ai.hallucination_rate` — % of responses producing claims not present in retrieved sources. SLO: <5% on the golden set.
|
|
96
|
+
- `ai.citation_precision` — % of citations whose source span verifiably contains the cited claim. SLO: >95% on retrieved claims.
|
|
97
|
+
- `ai.refusal_rate` — overall refusal-rate; track false-positive refusals separately.
|
|
98
|
+
- `ai.groundedness_score` — average RAGAS faithfulness across the golden set. SLO: >0.85.
|
|
99
|
+
|
|
100
|
+
Cross-reference `rules/hatch3r-observability-metrics.md` for the SLI/SLO authoring template.
|
|
101
|
+
|
|
102
|
+
## Safety and Red-Team
|
|
103
|
+
|
|
104
|
+
Every feature exercising user-controlled prompts runs adversarial eval on a schedule (weekly minimum, on every prompt-version bump):
|
|
105
|
+
|
|
106
|
+
- **Garak** — open-source jailbreak / prompt-injection / PII-leakage probe suite.
|
|
107
|
+
- **PyRIT** — Microsoft red-team framework, scenario-driven.
|
|
108
|
+
- **Inspect-redteam** — UK AISI safety eval modules.
|
|
109
|
+
|
|
110
|
+
Block release on regression. PII-leakage tests use a synthetic-PII corpus (never production PII). Prompt-injection tests cover OWASP Top 10 for Agentic Applications 2026.
|
|
111
|
+
|
|
112
|
+
## Tool-Use Evals
|
|
113
|
+
|
|
114
|
+
When the AI feature uses tools (per the companion UX rule), the eval suite covers:
|
|
115
|
+
|
|
116
|
+
- Tool-selection accuracy — correct tool chosen for each input class.
|
|
117
|
+
- Argument validity — emitted args satisfy the tool schema (Zod, JSON Schema, Pydantic).
|
|
118
|
+
- Chain correctness — multi-tool plans reach the goal with the minimum required step count plus a 20% tolerance.
|
|
119
|
+
- Latency budget — p99 tool-execution time within the budget from `rules/hatch3r-performance-budgets.md`.
|
|
120
|
+
|
|
121
|
+
Methodology aligned with **BFCL v4** (Berkeley Function Calling Leaderboard) and **tau-bench** (multi-turn tool-use benchmark).
|
|
122
|
+
|
|
123
|
+
## OpenTelemetry GenAI Semantic Conventions
|
|
124
|
+
|
|
125
|
+
Every LLM call emits an OpenTelemetry span named `gen_ai.<operation>` with the attributes prescribed by the OpenTelemetry GenAI semantic conventions: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.temperature`, `gen_ai.tool.name` (when tools used). Cross-reference Slice 2 observability rules for the broader span taxonomy.
|
|
126
|
+
|
|
127
|
+
## User-Feedback Loop
|
|
128
|
+
|
|
129
|
+
- Every AI response surface emits a thumbs-up/down control wired to a feedback queue.
|
|
130
|
+
- A monthly triage job promotes thumbs-down examples into regression-test fixtures in `evals/<feature>/edge.jsonl`.
|
|
131
|
+
- Promotion is a manual review step — never auto-promote raw user feedback (it contains noise and adversarial labels).
|
|
132
|
+
|
|
133
|
+
## Audit Logging
|
|
134
|
+
|
|
135
|
+
- LLM inputs + outputs logged to a compliance store separate from APM with a 30-90 day retention window (configurable per data-classification policy).
|
|
136
|
+
- PII redaction before persistence — same redaction primitive as the framework's data-classification pipeline.
|
|
137
|
+
- Reproducibility key: `model` + `prompt_hash` + `seed` (when the model exposes a seed parameter) + `temperature`. Without all four, the response is non-reproducible.
|
|
138
|
+
|
|
139
|
+
## Eval-Driven Development Workflow
|
|
140
|
+
|
|
141
|
+
Write eval before prompt, measure baseline, write prompt, measure delta, iterate until eval threshold passes. Cross-reference `skills/hatch3r-ai-feature/SKILL.md` for the step-by-step workflow.
|
|
142
|
+
|
|
143
|
+
## References
|
|
144
|
+
|
|
145
|
+
- promptfoo — `promptfoo.dev`
|
|
146
|
+
- DeepEval — `github.com/confident-ai/deepeval`
|
|
147
|
+
- RAGAS — `docs.ragas.io`
|
|
148
|
+
- Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
|
|
149
|
+
- Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
|
|
150
|
+
- OpenAI prompt caching guide — `platform.openai.com/docs/guides/prompt-caching`
|
|
151
|
+
- OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
|
|
152
|
+
- Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`
|
|
153
|
+
- tau-bench — `github.com/sierra-research/tau-bench`
|
|
154
|
+
- OWASP Top 10 for LLM Applications (Agentic 2026) — `genai.owasp.org`
|