hatch3r 1.7.1 → 1.7.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (150) hide show
  1. package/README.md +37 -11
  2. package/agents/hatch3r-a11y-auditor.md +4 -0
  3. package/agents/hatch3r-architect.md +4 -0
  4. package/agents/hatch3r-ci-watcher.md +4 -0
  5. package/agents/hatch3r-context-rules.md +4 -0
  6. package/agents/hatch3r-creator.md +4 -0
  7. package/agents/hatch3r-dependency-auditor.md +4 -0
  8. package/agents/hatch3r-devops.md +4 -0
  9. package/agents/hatch3r-docs-writer.md +4 -0
  10. package/agents/hatch3r-fixer.md +4 -0
  11. package/agents/hatch3r-handoff-loader.md +243 -0
  12. package/agents/hatch3r-handoff-preparer.md +134 -0
  13. package/agents/hatch3r-implementer.md +4 -0
  14. package/agents/hatch3r-learnings-loader.md +4 -0
  15. package/agents/hatch3r-lint-fixer.md +4 -0
  16. package/agents/hatch3r-perf-profiler.md +8 -0
  17. package/agents/hatch3r-researcher.md +4 -0
  18. package/agents/hatch3r-reviewer.md +92 -0
  19. package/agents/hatch3r-security-auditor.md +24 -0
  20. package/agents/hatch3r-test-writer.md +4 -0
  21. package/agents/modes/requirements-elicitation.md +4 -1
  22. package/agents/modes/similar-implementation.md +6 -0
  23. package/agents/modes/user-flows.md +76 -0
  24. package/agents/shared/quality-charter.md +128 -0
  25. package/commands/hatch3r-board-fill.md +1 -0
  26. package/commands/hatch3r-create.md +2 -0
  27. package/commands/hatch3r-handoff.md +126 -0
  28. package/commands/hatch3r-pr-resolve.md +4 -0
  29. package/commands/hatch3r-quick-change.md +4 -2
  30. package/commands/hatch3r-workflow.md +2 -0
  31. package/dist/cli/index.js +2321 -460
  32. package/dist/cli/index.js.map +1 -1
  33. package/package.json +4 -2
  34. package/rules/hatch3r-accessibility-standards.md +21 -0
  35. package/rules/hatch3r-accessibility-standards.mdc +21 -0
  36. package/rules/hatch3r-agent-orchestration.md +9 -1
  37. package/rules/hatch3r-agent-orchestration.mdc +9 -1
  38. package/rules/hatch3r-ai-evals.md +158 -0
  39. package/rules/hatch3r-ai-evals.mdc +154 -0
  40. package/rules/hatch3r-ai-ux-patterns.md +131 -0
  41. package/rules/hatch3r-ai-ux-patterns.mdc +127 -0
  42. package/rules/hatch3r-api-design.md +67 -9
  43. package/rules/hatch3r-api-design.mdc +67 -9
  44. package/rules/hatch3r-api-versioning.md +119 -0
  45. package/rules/hatch3r-api-versioning.mdc +115 -0
  46. package/rules/hatch3r-auth-patterns.md +170 -0
  47. package/rules/hatch3r-auth-patterns.mdc +166 -0
  48. package/rules/hatch3r-component-conventions.md +30 -0
  49. package/rules/hatch3r-component-conventions.mdc +30 -0
  50. package/rules/hatch3r-container-hardening.md +131 -0
  51. package/rules/hatch3r-container-hardening.mdc +127 -0
  52. package/rules/hatch3r-contract-testing.md +117 -0
  53. package/rules/hatch3r-contract-testing.mdc +113 -0
  54. package/rules/hatch3r-deep-context.md +2 -0
  55. package/rules/hatch3r-deep-context.mdc +2 -0
  56. package/rules/hatch3r-dependency-management.md +73 -1
  57. package/rules/hatch3r-dependency-management.mdc +72 -0
  58. package/rules/hatch3r-design-system-detection.md +142 -0
  59. package/rules/hatch3r-design-system-detection.mdc +138 -0
  60. package/rules/hatch3r-event-schema-evolution.md +90 -0
  61. package/rules/hatch3r-event-schema-evolution.mdc +86 -0
  62. package/rules/hatch3r-handoff-readiness.md +45 -0
  63. package/rules/hatch3r-handoff-readiness.mdc +40 -0
  64. package/rules/hatch3r-i18n.md +13 -0
  65. package/rules/hatch3r-i18n.mdc +13 -0
  66. package/rules/hatch3r-migrations.md +61 -16
  67. package/rules/hatch3r-migrations.mdc +61 -16
  68. package/rules/hatch3r-observability-logging.md +1 -1
  69. package/rules/hatch3r-observability-logging.mdc +1 -1
  70. package/rules/hatch3r-observability-metrics.md +1 -1
  71. package/rules/hatch3r-observability-metrics.mdc +1 -1
  72. package/rules/hatch3r-observability-tracing-detail.md +1 -1
  73. package/rules/hatch3r-observability-tracing-detail.mdc +1 -1
  74. package/rules/hatch3r-observability-tracing.md +1 -1
  75. package/rules/hatch3r-observability-tracing.mdc +1 -1
  76. package/rules/hatch3r-observability.md +1 -0
  77. package/rules/hatch3r-observability.mdc +1 -0
  78. package/rules/hatch3r-operability.md +149 -0
  79. package/rules/hatch3r-operability.mdc +145 -0
  80. package/rules/hatch3r-passkey-server.md +181 -0
  81. package/rules/hatch3r-passkey-server.mdc +177 -0
  82. package/rules/hatch3r-progressive-delivery.md +120 -0
  83. package/rules/hatch3r-progressive-delivery.mdc +116 -0
  84. package/rules/hatch3r-resilience-patterns.md +154 -0
  85. package/rules/hatch3r-resilience-patterns.mdc +150 -0
  86. package/rules/hatch3r-secrets-management.md +29 -0
  87. package/rules/hatch3r-secrets-management.mdc +29 -0
  88. package/rules/hatch3r-testing.md +139 -43
  89. package/rules/hatch3r-testing.mdc +139 -43
  90. package/rules/hatch3r-ux-states-and-flows.md +149 -0
  91. package/rules/hatch3r-ux-states-and-flows.mdc +145 -0
  92. package/skills/hatch3r-a11y-audit/SKILL.md +14 -0
  93. package/skills/hatch3r-ai-feature/SKILL.md +134 -0
  94. package/skills/hatch3r-api-spec/SKILL.md +5 -0
  95. package/skills/hatch3r-architecture-review/SKILL.md +14 -0
  96. package/skills/hatch3r-bug-fix/SKILL.md +5 -0
  97. package/skills/hatch3r-ci-pipeline/SKILL.md +14 -0
  98. package/skills/hatch3r-cli-aichat/SKILL.md +84 -0
  99. package/skills/hatch3r-cli-ast-grep/SKILL.md +85 -0
  100. package/skills/hatch3r-cli-az-devops/SKILL.md +89 -0
  101. package/skills/hatch3r-cli-bat/SKILL.md +85 -0
  102. package/skills/hatch3r-cli-comby/SKILL.md +85 -0
  103. package/skills/hatch3r-cli-csvkit/SKILL.md +84 -0
  104. package/skills/hatch3r-cli-delta/SKILL.md +86 -0
  105. package/skills/hatch3r-cli-difftastic/SKILL.md +84 -0
  106. package/skills/hatch3r-cli-docker/SKILL.md +89 -0
  107. package/skills/hatch3r-cli-duckdb/SKILL.md +84 -0
  108. package/skills/hatch3r-cli-fd/SKILL.md +85 -0
  109. package/skills/hatch3r-cli-fzf/SKILL.md +84 -0
  110. package/skills/hatch3r-cli-gh/SKILL.md +90 -0
  111. package/skills/hatch3r-cli-glab/SKILL.md +89 -0
  112. package/skills/hatch3r-cli-jq/SKILL.md +85 -0
  113. package/skills/hatch3r-cli-lazygit/SKILL.md +78 -0
  114. package/skills/hatch3r-cli-llm/SKILL.md +84 -0
  115. package/skills/hatch3r-cli-miller/SKILL.md +84 -0
  116. package/skills/hatch3r-cli-mods/SKILL.md +84 -0
  117. package/skills/hatch3r-cli-overview/SKILL.md +60 -0
  118. package/skills/hatch3r-cli-playwright/SKILL.md +89 -0
  119. package/skills/hatch3r-cli-podman/SKILL.md +84 -0
  120. package/skills/hatch3r-cli-ripgrep/SKILL.md +85 -0
  121. package/skills/hatch3r-cli-rtk/SKILL.md +91 -0
  122. package/skills/hatch3r-cli-sd/SKILL.md +85 -0
  123. package/skills/hatch3r-cli-stagehand/SKILL.md +79 -0
  124. package/skills/hatch3r-cli-taplo/SKILL.md +84 -0
  125. package/skills/hatch3r-cli-xsv/SKILL.md +89 -0
  126. package/skills/hatch3r-cli-yq/SKILL.md +85 -0
  127. package/skills/hatch3r-cli-zstd/SKILL.md +85 -0
  128. package/skills/hatch3r-context-health/SKILL.md +14 -0
  129. package/skills/hatch3r-cost-tracking/SKILL.md +14 -0
  130. package/skills/hatch3r-customize/SKILL.md +14 -0
  131. package/skills/hatch3r-dep-audit/SKILL.md +14 -0
  132. package/skills/hatch3r-design-system-detect/SKILL.md +162 -0
  133. package/skills/hatch3r-feature/SKILL.md +2 -0
  134. package/skills/hatch3r-gh-agentic-workflows/SKILL.md +13 -0
  135. package/skills/hatch3r-handoff-prepare/SKILL.md +160 -0
  136. package/skills/hatch3r-handoff-resume/SKILL.md +171 -0
  137. package/skills/hatch3r-incident-response/SKILL.md +14 -0
  138. package/skills/hatch3r-issue-workflow/SKILL.md +5 -0
  139. package/skills/hatch3r-logical-refactor/SKILL.md +14 -0
  140. package/skills/hatch3r-migration/SKILL.md +14 -0
  141. package/skills/hatch3r-observability-verify/SKILL.md +133 -0
  142. package/skills/hatch3r-perf-audit/SKILL.md +14 -0
  143. package/skills/hatch3r-pr-creation/SKILL.md +14 -0
  144. package/skills/hatch3r-qa-validation/SKILL.md +18 -0
  145. package/skills/hatch3r-recipe/SKILL.md +14 -0
  146. package/skills/hatch3r-refactor/SKILL.md +14 -0
  147. package/skills/hatch3r-release/SKILL.md +14 -0
  148. package/skills/hatch3r-reliability-verify/SKILL.md +144 -0
  149. package/skills/hatch3r-ui-ux-verify/SKILL.md +136 -0
  150. package/skills/hatch3r-visual-refactor/SKILL.md +15 -1
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "hatch3r",
3
- "version": "1.7.1",
3
+ "version": "1.7.5",
4
4
  "description": "Battle-tested agentic coding setup framework. One command to hatch your agent stack -- agents, skills, rules, commands, and MCP for every major AI coding tool.",
5
5
  "type": "module",
6
6
  "exports": {
@@ -24,7 +24,9 @@
24
24
  "inventory:check-docs": "tsx scripts/inventory.ts --check-docs",
25
25
  "validate:rule-parity": "tsx scripts/validate-rule-parity.ts",
26
26
  "validate:efficiency": "tsx scripts/validate-efficiency-invariants.ts",
27
- "validate": "npm run validate:rule-parity && npm run validate:efficiency",
27
+ "validate:cli-skills": "tsx scripts/validate-cli-skills.ts",
28
+ "generate:cli-skills": "tsx scripts/generate-cli-skills.ts",
29
+ "validate": "npm run validate:rule-parity && npm run validate:efficiency && npm run validate:cli-skills",
28
30
  "audit:validate-registry": "tsx scripts/validate-finding-registry.ts",
29
31
  "audit:migrate": "tsx scripts/migrate-finding-registry.ts",
30
32
  "audit:archive": "tsx scripts/audit-archive.ts",
@@ -78,3 +78,24 @@ All user-facing features must meet WCAG 2.2 Level AA conformance. This is the ba
78
78
  - Run automated accessibility checks (axe-core, Lighthouse) in CI.
79
79
  - Automated tools catch ~30% of accessibility issues. Manual testing is required for keyboard flows, screen reader experience, and cognitive accessibility.
80
80
  - Maintain an accessibility test checklist per component type (form, modal, navigation, data table).
81
+
82
+ ## WCAG 2.2 New Success Criteria (Mandatory Audit Items)
83
+
84
+ WCAG 2.2 (W3C Recommendation, October 2023) added nine Success Criteria; three apply to most UI surfaces and are listed here as required audit items.
85
+
86
+ - **SC 2.5.8 Target Size (Minimum) — AA:** every pointer-target's hit area is at least 24 by 24 CSS pixels. Smaller targets are permitted only when surrounded by ≥24px of spacing (the spacing exception) or when the larger inline target is also reachable (the inline exception). Test densest UI elements (sidebar collapse, table-row checkboxes, icon-only toolbar buttons).
87
+ - **SC 2.4.11 Focus Not Obscured (Minimum) — AA:** when a control receives keyboard focus, the focused element must not be entirely hidden by author-created content. Common violations: sticky headers, persistent cookie banners, chatbot launcher widgets, fixed action bars. Mitigate with `scroll-margin-top` on focus targets equal to the sticky header height, or by collapsing sticky chrome on focus.
88
+ - **SC 2.5.7 Dragging Movements — AA:** any drag operation (reorder, slider thumb drag, map pan, kanban card move) must offer a single-pointer non-drag alternative. Examples: list reorder with up/down arrow buttons; slider with numeric input or +/- buttons; map pan with arrow-key navigation.
89
+
90
+ Reference: https://www.w3.org/TR/WCAG22/
91
+
92
+ ## Mobile and Touch Accessibility
93
+
94
+ Touch surfaces have stricter target and spacing requirements than pointer-only surfaces; native platform guidance supersedes WCAG 2.5.8 on touch.
95
+
96
+ - Touch targets: 44 by 44 points on iOS (Apple Human Interface Guidelines), 48 by 48 dp on Android (Material Design 3). These supersede WCAG 2.5.8's 24-pixel minimum on touch-only surfaces.
97
+ - Spacing between interactive elements: ≥8 pixels of separation to prevent mis-taps; ≥4 mm on physical-density screens.
98
+ - Avoid tap-to-reveal patterns (hover tooltips, pure-hover dropdowns) — touch has no hover state. Replace with permanent visible labels or with long-press that has a visible affordance and announces itself to screen readers.
99
+ - Apply `env(safe-area-inset-*)` padding on full-bleed surfaces so content clears notches, home indicators, and rounded corners on iOS and Android edge devices.
100
+ - Support Dynamic Type (iOS) and rem-based font scaling — declare body text in `rem` or `em` units, never `px`, so OS-level font size settings cascade.
101
+ - Zoom to 200% and 400% (per WCAG 1.4.4 and 1.4.10 Reflow) must remain functional with no horizontal scroll trap. Audit for `width: 100vw` and fixed pixel widths that break reflow.
@@ -74,3 +74,24 @@ All user-facing features must meet WCAG 2.2 Level AA conformance. This is the ba
74
74
  - Run automated accessibility checks (axe-core, Lighthouse) in CI.
75
75
  - Automated tools catch ~30% of accessibility issues. Manual testing is required for keyboard flows, screen reader experience, and cognitive accessibility.
76
76
  - Maintain an accessibility test checklist per component type (form, modal, navigation, data table).
77
+
78
+ ## WCAG 2.2 New Success Criteria (Mandatory Audit Items)
79
+
80
+ WCAG 2.2 (W3C Recommendation, October 2023) added nine Success Criteria; three apply to most UI surfaces and are listed here as required audit items.
81
+
82
+ - **SC 2.5.8 Target Size (Minimum) — AA:** every pointer-target's hit area is at least 24 by 24 CSS pixels. Smaller targets are permitted only when surrounded by ≥24px of spacing (the spacing exception) or when the larger inline target is also reachable (the inline exception). Test densest UI elements (sidebar collapse, table-row checkboxes, icon-only toolbar buttons).
83
+ - **SC 2.4.11 Focus Not Obscured (Minimum) — AA:** when a control receives keyboard focus, the focused element must not be entirely hidden by author-created content. Common violations: sticky headers, persistent cookie banners, chatbot launcher widgets, fixed action bars. Mitigate with `scroll-margin-top` on focus targets equal to the sticky header height, or by collapsing sticky chrome on focus.
84
+ - **SC 2.5.7 Dragging Movements — AA:** any drag operation (reorder, slider thumb drag, map pan, kanban card move) must offer a single-pointer non-drag alternative. Examples: list reorder with up/down arrow buttons; slider with numeric input or +/- buttons; map pan with arrow-key navigation.
85
+
86
+ Reference: https://www.w3.org/TR/WCAG22/
87
+
88
+ ## Mobile and Touch Accessibility
89
+
90
+ Touch surfaces have stricter target and spacing requirements than pointer-only surfaces; native platform guidance supersedes WCAG 2.5.8 on touch.
91
+
92
+ - Touch targets: 44 by 44 points on iOS (Apple Human Interface Guidelines), 48 by 48 dp on Android (Material Design 3). These supersede WCAG 2.5.8's 24-pixel minimum on touch-only surfaces.
93
+ - Spacing between interactive elements: ≥8 pixels of separation to prevent mis-taps; ≥4 mm on physical-density screens.
94
+ - Avoid tap-to-reveal patterns (hover tooltips, pure-hover dropdowns) — touch has no hover state. Replace with permanent visible labels or with long-press that has a visible affordance and announces itself to screen readers.
95
+ - Apply `env(safe-area-inset-*)` padding on full-bleed surfaces so content clears notches, home indicators, and rounded corners on iOS and Android edge devices.
96
+ - Support Dynamic Type (iOS) and rem-based font scaling — declare body text in `rem` or `em` units, never `px`, so OS-level font size settings cascade.
97
+ - Zoom to 200% and 400% (per WCAG 1.4.4 and 1.4.10 Reflow) must remain functional with no horizontal scroll trap. Audit for `width: 100vw` and fixed pixel widths that break reflow.
@@ -48,7 +48,7 @@ Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatc
48
48
 
49
49
  Score task complexity per the `hatch3r-deep-context` rule before Phase 1. Apply the resulting tier:
50
50
 
51
- - **Tier 2 (Standard):** Present elicitation questions inline. Await answers before Phase 2.
51
+ - **Tier 2 hard gate (B1).** Before Phase 2, run `hatch3r-researcher` with `requirements-elicitation:quick` mode to detect ambiguity and ask the user via the platform-native question tool (per `agents/shared/user-question-protocol.md`). Orchestrator awaits answers and integrates them into the Phase 1 brief; do not begin Phase 2 with unresolved questions. Tier 1 is exempt only when scope is single-file, single-concern, and acceptance is testable from the user message alone.
52
52
  - **Tier 3 (Deep):** Present Pre-Implementation Summary and ASK for confirmation. Do NOT proceed until all unresolved questions are answered.
53
53
 
54
54
  ## Mandatory Delegation Directives
@@ -253,6 +253,14 @@ ALL three must hold:
253
253
 
254
254
  **Default:** When in doubt, serialize. For typical hatch3r tasks (1–5 sub-tasks) the DAG-scheduling overhead often outweighs concurrency gain.
255
255
 
256
+ ### Cost-Dominance Principle
257
+
258
+ Token cost of sub-agent invocation never justifies serialization of independent work. The three safety conditions (read-only or disjoint writes, deterministic aggregation, no shared mutable state) govern WHEN parallelism is safe; cost does not govern WHETHER to parallelize. When in doubt, fan out. Serialization is only valid on true dependency edges.
259
+
260
+ ### Scaling Heuristic
261
+
262
+ Sub-agent count tracks task decomposition: N independent modules → N parallel Phase-2 implementers; M specialist gates → M parallel Phase-4 specialists; K independent research questions → K parallel researcher modes. Orchestrators emit `sub_agents_spawned: {count, rationale}` in their structured output.
263
+
256
264
  ## Cross-Phase Error Propagation
257
265
 
258
266
  When a phase produces a non-SUCCESS status, the orchestrator must propagate error context to downstream phases rather than silently dropping it:
@@ -43,7 +43,7 @@ Every task MUST follow this four-phase pipeline: **Phase 1 — Research** (`hatc
43
43
 
44
44
  Score task complexity per the `hatch3r-deep-context` rule before Phase 1. Apply the resulting tier:
45
45
 
46
- - **Tier 2 (Standard):** Present elicitation questions inline. Await answers before Phase 2.
46
+ - **Tier 2 hard gate (B1).** Before Phase 2, run `hatch3r-researcher` with `requirements-elicitation:quick` mode to detect ambiguity and ask the user via the platform-native question tool (per `agents/shared/user-question-protocol.md`). Orchestrator awaits answers and integrates them into the Phase 1 brief; do not begin Phase 2 with unresolved questions. Tier 1 is exempt only when scope is single-file, single-concern, and acceptance is testable from the user message alone.
47
47
  - **Tier 3 (Deep):** Present Pre-Implementation Summary and ASK for confirmation. Do NOT proceed until all unresolved questions are answered.
48
48
 
49
49
  ## Mandatory Delegation Directives
@@ -248,6 +248,14 @@ ALL three must hold:
248
248
 
249
249
  **Default:** When in doubt, serialize. For typical hatch3r tasks (1–5 sub-tasks) the DAG-scheduling overhead often outweighs concurrency gain.
250
250
 
251
+ ### Cost-Dominance Principle
252
+
253
+ Token cost of sub-agent invocation never justifies serialization of independent work. The three safety conditions (read-only or disjoint writes, deterministic aggregation, no shared mutable state) govern WHEN parallelism is safe; cost does not govern WHETHER to parallelize. When in doubt, fan out. Serialization is only valid on true dependency edges.
254
+
255
+ ### Scaling Heuristic
256
+
257
+ Sub-agent count tracks task decomposition: N independent modules → N parallel Phase-2 implementers; M specialist gates → M parallel Phase-4 specialists; K independent research questions → K parallel researcher modes. Orchestrators emit `sub_agents_spawned: {count, rationale}` in their structured output.
258
+
251
259
  ## Cross-Phase Error Propagation
252
260
 
253
261
  When a phase produces a non-SUCCESS status, the orchestrator must propagate error context to downstream phases rather than silently dropping it:
@@ -0,0 +1,158 @@
1
+ ---
2
+ id: hatch3r-ai-evals
3
+ type: rule
4
+ description: AI feature evaluation, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI for end-user projects shipping LLM features
5
+ scope: "**/ai/**,**/llm/**,**/chat/**,**/assistant/**,**/agents/**,**/copilot/**,**/evals/**,**/prompts/**,**/rag/**"
6
+ tags: [review, implementation, ai]
7
+ quality_charter: agents/shared/quality-charter.md
8
+ cache_friendly: true
9
+ ---
10
+ # AI Feature Evaluation and Cost Governance (2026)
11
+
12
+ ## Scope
13
+
14
+ This rule governs the BACKEND half of an LLM-driven feature: eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, hallucination-as-SLI, tool-use evals, safety/red-team, and audit logging. The FRONTEND half (streaming UI, tool-call cards, human-approval gates, cancel/abort/undo, citations) is the paired companion rule `rules/hatch3r-ai-ux-patterns.md`. Apply both rules for any LLM-driven feature — UI-only or backend-only coverage is a regression.
15
+
16
+ Detection mirrors the companion rule: this rule activates when the project imports an LLM SDK (`openai`, `@anthropic-ai/sdk`, `@google/generative-ai`, `cohere-ai`, `ai`, `@ai-sdk/*`, `langchain`, `llama-index`) or contains files under `ai/`, `llm/`, `chat/`, `assistant/`, `agents/`, `copilot/`, `evals/`, `prompts/`, `rag/`.
17
+
18
+ ## Eval Harness Mandate
19
+
20
+ Every AI feature has an automated eval harness committed to the repo before the feature ships. Hand-rolled "ask the model and eyeball the answer" is a regression in 2026.
21
+
22
+ Pick one tool by task class:
23
+
24
+ - **promptfoo** — broad coverage, declarative YAML, model-comparison defaults
25
+ - **DeepEval** — pytest-style assertions for CI gate integration
26
+ - **RAGAS** — retrieval-augmented generation metrics (context_precision, context_recall, faithfulness, answer_relevance)
27
+ - **Inspect** — UK AISI framework for safety and agentic evals
28
+ - **braintrust** — SaaS + OSS hybrid, run history retained per prompt version
29
+ - **TruLens** — observability-coupled, runs evals against live traces
30
+ - **Arize Phoenix** — open-source observability with eval modules
31
+
32
+ Document the chosen tool in `evals/README.md` so the agent picks the same tool on every future change.
33
+
34
+ ## Golden Dataset Versioning
35
+
36
+ - Eval cases live in repo at `evals/<feature>/golden.{jsonl,csv}`.
37
+ - Minimum 20 cases per feature; one input plus one expected output (or graded rubric) per row.
38
+ - Dataset version in the filename (`golden.v3.jsonl`) — never overwrite v2 in place.
39
+ - Ground truth refreshed quarterly; refresh PR includes a diff of changed expected outputs and the rationale.
40
+ - Hand-curated edge cases (failure modes the model has historically tripped on) live in `evals/<feature>/edge.jsonl` alongside the golden set.
41
+
42
+ ## Eval Metrics
43
+
44
+ Match the metric to the task class:
45
+
46
+ - Classification → exact-match accuracy + per-class precision/recall.
47
+ - Open-ended generation → rubric-scored LLM-as-judge with a 50-example calibration set sanity-checking judge agreement against human labels; recompute calibration every model change.
48
+ - Retrieval/RAG → RAGAS metrics: context_precision, context_recall, faithfulness, answer_relevance.
49
+ - Pairwise comparison → win-rate against the prior prompt version (>=55% required to adopt).
50
+ - Refusal calibration → refusal-rate as an explicit SLI (refusals on prohibited inputs vs false-positive refusals on benign inputs).
51
+
52
+ ## Regression Gating
53
+
54
+ - Every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**` runs the eval set in CI.
55
+ - PR blocked when any metric drops below the per-feature threshold defined in `evals/<feature>/thresholds.json`.
56
+ - Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) runs the full eval with a 5% accuracy budget; cross over 5% requires a sign-off comment from a named reviewer and a 24-hour canary at 5% traffic.
57
+ - Eval failure is treated as a test failure — never an advisory warning.
58
+
59
+ ## Prompt Versioning
60
+
61
+ - Prompts are first-class artifacts at `prompts/<feature>/v<N>.{md,txt}` with an SHA-256 hash recorded in the eval thresholds file.
62
+ - Runtime emits `prompt_version` + `prompt_hash` per request as log + span attribute.
63
+ - A/B framework supports concurrent versions behind a flag — traffic split recorded per request so eval delta is computable.
64
+ - Hash drift between repo prompt and deployed prompt is a P0 incident.
65
+
66
+ ## Cost Telemetry per Request
67
+
68
+ Every LLM call logs: `tokens_in`, `tokens_out`, `cache_hit` (boolean + cached_tokens count), `model`, `cost_usd`, `latency_ms`, `cost_center` (feature ID), `prompt_version`, `prompt_hash`, `user_id_hash`.
69
+
70
+ Aggregate dashboards in the observability stack — cross-reference `rules/hatch3r-observability-metrics.md` and `rules/hatch3r-observability-tracing-detail.md` for the SLI/SLO vocabulary, and `skills/hatch3r-observability-verify` for the wiring checklist. Per-feature budget alerts fire at 50%, 75%, and 90% of monthly budget; abuse-detection alert at 10x user p99 cost over a 1-hour window.
71
+
72
+ ## Prompt Caching (Anthropic)
73
+
74
+ - Apply `cache_control` breakpoints to long system prompts, tool definitions, and large RAG context blocks above 1024 tokens (the Claude Opus/Sonnet 4.x minimum cache size; 2048 tokens for Haiku 4.x).
75
+ - Up to 4 breakpoints per request, longest-TTL-first.
76
+ - 5-minute TTL costs 1.25x the standard write rate; 1-hour TTL costs 2x the standard write rate. Reads are 0.1x base.
77
+ - Track cache_hit ratio per feature; <30% hit ratio on a stable prompt is a sign the prefix is changing — investigate before next deploy.
78
+
79
+ ## OpenAI Prompt Caching
80
+
81
+ - Automatic for requests over 1024 tokens with a stable deterministic prefix; ~50% discount on cached input tokens.
82
+ - Cache duration documented by OpenAI as 5-10 minutes idle, up to 1 hour during off-peak.
83
+ - Order the request prefix deterministically — same system prompt, same tool definitions, same retrieved-doc order — or the cache misses silently.
84
+
85
+ ## Model Router and Fallback
86
+
87
+ Every LLM call wraps in a retry-with-decorrelated-jitter plus model-fallback chain:
88
+
89
+ 1. **Primary** — production-quality model (e.g. Sonnet 4.7).
90
+ 2. **Secondary** — faster/cheaper model (e.g. Haiku 4.5, GPT-5-mini, or a local Llama variant).
91
+ 3. **Static fallback** — cached response from a similar prior request or a canned templated reply that names the failure ("Search temporarily unavailable — retry in a minute").
92
+
93
+ Cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the circuit-breaker + retry-with-decorrelated-jitter primitives the chain reuses. Each fallback path has its own eval suite — a silent quality cliff between primary and secondary is a regression.
94
+
95
+ ## Hallucination and Groundedness as SLI
96
+
97
+ Track per feature and treat as service-level indicators with explicit SLO targets:
98
+
99
+ - `ai.hallucination_rate` — % of responses producing claims not present in retrieved sources. SLO: <5% on the golden set.
100
+ - `ai.citation_precision` — % of citations whose source span verifiably contains the cited claim. SLO: >95% on retrieved claims.
101
+ - `ai.refusal_rate` — overall refusal-rate; track false-positive refusals separately.
102
+ - `ai.groundedness_score` — average RAGAS faithfulness across the golden set. SLO: >0.85.
103
+
104
+ Cross-reference `rules/hatch3r-observability-metrics.md` for the SLI/SLO authoring template.
105
+
106
+ ## Safety and Red-Team
107
+
108
+ Every feature exercising user-controlled prompts runs adversarial eval on a schedule (weekly minimum, on every prompt-version bump):
109
+
110
+ - **Garak** — open-source jailbreak / prompt-injection / PII-leakage probe suite.
111
+ - **PyRIT** — Microsoft red-team framework, scenario-driven.
112
+ - **Inspect-redteam** — UK AISI safety eval modules.
113
+
114
+ Block release on regression. PII-leakage tests use a synthetic-PII corpus (never production PII). Prompt-injection tests cover OWASP Top 10 for Agentic Applications 2026.
115
+
116
+ ## Tool-Use Evals
117
+
118
+ When the AI feature uses tools (per the companion UX rule), the eval suite covers:
119
+
120
+ - Tool-selection accuracy — correct tool chosen for each input class.
121
+ - Argument validity — emitted args satisfy the tool schema (Zod, JSON Schema, Pydantic).
122
+ - Chain correctness — multi-tool plans reach the goal with the minimum required step count plus a 20% tolerance.
123
+ - Latency budget — p99 tool-execution time within the budget from `rules/hatch3r-performance-budgets.md`.
124
+
125
+ Methodology aligned with **BFCL v4** (Berkeley Function Calling Leaderboard) and **tau-bench** (multi-turn tool-use benchmark).
126
+
127
+ ## OpenTelemetry GenAI Semantic Conventions
128
+
129
+ Every LLM call emits an OpenTelemetry span named `gen_ai.<operation>` with the attributes prescribed by the OpenTelemetry GenAI semantic conventions: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.temperature`, `gen_ai.tool.name` (when tools used). Cross-reference Slice 2 observability rules for the broader span taxonomy.
130
+
131
+ ## User-Feedback Loop
132
+
133
+ - Every AI response surface emits a thumbs-up/down control wired to a feedback queue.
134
+ - A monthly triage job promotes thumbs-down examples into regression-test fixtures in `evals/<feature>/edge.jsonl`.
135
+ - Promotion is a manual review step — never auto-promote raw user feedback (it contains noise and adversarial labels).
136
+
137
+ ## Audit Logging
138
+
139
+ - LLM inputs + outputs logged to a compliance store separate from APM with a 30-90 day retention window (configurable per data-classification policy).
140
+ - PII redaction before persistence — same redaction primitive as the framework's data-classification pipeline.
141
+ - Reproducibility key: `model` + `prompt_hash` + `seed` (when the model exposes a seed parameter) + `temperature`. Without all four, the response is non-reproducible.
142
+
143
+ ## Eval-Driven Development Workflow
144
+
145
+ Write eval before prompt, measure baseline, write prompt, measure delta, iterate until eval threshold passes. Cross-reference `skills/hatch3r-ai-feature/SKILL.md` for the step-by-step workflow.
146
+
147
+ ## References
148
+
149
+ - promptfoo — `promptfoo.dev`
150
+ - DeepEval — `github.com/confident-ai/deepeval`
151
+ - RAGAS — `docs.ragas.io`
152
+ - Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
153
+ - Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
154
+ - OpenAI prompt caching guide — `platform.openai.com/docs/guides/prompt-caching`
155
+ - OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
156
+ - Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`
157
+ - tau-bench — `github.com/sierra-research/tau-bench`
158
+ - OWASP Top 10 for LLM Applications (Agentic 2026) — `genai.owasp.org`
@@ -0,0 +1,154 @@
1
+ ---
2
+ description: AI feature evaluation, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI for end-user projects shipping LLM features
3
+ globs: ["**/ai/**", "**/llm/**", "**/chat/**", "**/assistant/**", "**/agents/**", "**/copilot/**", "**/evals/**", "**/prompts/**", "**/rag/**"]
4
+ alwaysApply: false
5
+ ---
6
+ # AI Feature Evaluation and Cost Governance (2026)
7
+
8
+ ## Scope
9
+
10
+ This rule governs the BACKEND half of an LLM-driven feature: eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, hallucination-as-SLI, tool-use evals, safety/red-team, and audit logging. The FRONTEND half (streaming UI, tool-call cards, human-approval gates, cancel/abort/undo, citations) is the paired companion rule `rules/hatch3r-ai-ux-patterns.md`. Apply both rules for any LLM-driven feature — UI-only or backend-only coverage is a regression.
11
+
12
+ Detection mirrors the companion rule: this rule activates when the project imports an LLM SDK (`openai`, `@anthropic-ai/sdk`, `@google/generative-ai`, `cohere-ai`, `ai`, `@ai-sdk/*`, `langchain`, `llama-index`) or contains files under `ai/`, `llm/`, `chat/`, `assistant/`, `agents/`, `copilot/`, `evals/`, `prompts/`, `rag/`.
13
+
14
+ ## Eval Harness Mandate
15
+
16
+ Every AI feature has an automated eval harness committed to the repo before the feature ships. Hand-rolled "ask the model and eyeball the answer" is a regression in 2026.
17
+
18
+ Pick one tool by task class:
19
+
20
+ - **promptfoo** — broad coverage, declarative YAML, model-comparison defaults
21
+ - **DeepEval** — pytest-style assertions for CI gate integration
22
+ - **RAGAS** — retrieval-augmented generation metrics (context_precision, context_recall, faithfulness, answer_relevance)
23
+ - **Inspect** — UK AISI framework for safety and agentic evals
24
+ - **braintrust** — SaaS + OSS hybrid, run history retained per prompt version
25
+ - **TruLens** — observability-coupled, runs evals against live traces
26
+ - **Arize Phoenix** — open-source observability with eval modules
27
+
28
+ Document the chosen tool in `evals/README.md` so the agent picks the same tool on every future change.
29
+
30
+ ## Golden Dataset Versioning
31
+
32
+ - Eval cases live in repo at `evals/<feature>/golden.{jsonl,csv}`.
33
+ - Minimum 20 cases per feature; one input plus one expected output (or graded rubric) per row.
34
+ - Dataset version in the filename (`golden.v3.jsonl`) — never overwrite v2 in place.
35
+ - Ground truth refreshed quarterly; refresh PR includes a diff of changed expected outputs and the rationale.
36
+ - Hand-curated edge cases (failure modes the model has historically tripped on) live in `evals/<feature>/edge.jsonl` alongside the golden set.
37
+
38
+ ## Eval Metrics
39
+
40
+ Match the metric to the task class:
41
+
42
+ - Classification → exact-match accuracy + per-class precision/recall.
43
+ - Open-ended generation → rubric-scored LLM-as-judge with a 50-example calibration set sanity-checking judge agreement against human labels; recompute calibration every model change.
44
+ - Retrieval/RAG → RAGAS metrics: context_precision, context_recall, faithfulness, answer_relevance.
45
+ - Pairwise comparison → win-rate against the prior prompt version (>=55% required to adopt).
46
+ - Refusal calibration → refusal-rate as an explicit SLI (refusals on prohibited inputs vs false-positive refusals on benign inputs).
47
+
48
+ ## Regression Gating
49
+
50
+ - Every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**` runs the eval set in CI.
51
+ - PR blocked when any metric drops below the per-feature threshold defined in `evals/<feature>/thresholds.json`.
52
+ - Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) runs the full eval with a 5% accuracy budget; cross over 5% requires a sign-off comment from a named reviewer and a 24-hour canary at 5% traffic.
53
+ - Eval failure is treated as a test failure — never an advisory warning.
54
+
55
+ ## Prompt Versioning
56
+
57
+ - Prompts are first-class artifacts at `prompts/<feature>/v<N>.{md,txt}` with an SHA-256 hash recorded in the eval thresholds file.
58
+ - Runtime emits `prompt_version` + `prompt_hash` per request as log + span attribute.
59
+ - A/B framework supports concurrent versions behind a flag — traffic split recorded per request so eval delta is computable.
60
+ - Hash drift between repo prompt and deployed prompt is a P0 incident.
61
+
62
+ ## Cost Telemetry per Request
63
+
64
+ Every LLM call logs: `tokens_in`, `tokens_out`, `cache_hit` (boolean + cached_tokens count), `model`, `cost_usd`, `latency_ms`, `cost_center` (feature ID), `prompt_version`, `prompt_hash`, `user_id_hash`.
65
+
66
+ Aggregate dashboards in the observability stack — cross-reference `rules/hatch3r-observability-metrics.md` and `rules/hatch3r-observability-tracing-detail.md` for the SLI/SLO vocabulary, and `skills/hatch3r-observability-verify` for the wiring checklist. Per-feature budget alerts fire at 50%, 75%, and 90% of monthly budget; abuse-detection alert at 10x user p99 cost over a 1-hour window.
67
+
68
+ ## Prompt Caching (Anthropic)
69
+
70
+ - Apply `cache_control` breakpoints to long system prompts, tool definitions, and large RAG context blocks above 1024 tokens (the Claude Opus/Sonnet 4.x minimum cache size; 2048 tokens for Haiku 4.x).
71
+ - Up to 4 breakpoints per request, longest-TTL-first.
72
+ - 5-minute TTL costs 1.25x the standard write rate; 1-hour TTL costs 2x the standard write rate. Reads are 0.1x base.
73
+ - Track cache_hit ratio per feature; <30% hit ratio on a stable prompt is a sign the prefix is changing — investigate before next deploy.
74
+
75
+ ## OpenAI Prompt Caching
76
+
77
+ - Automatic for requests over 1024 tokens with a stable deterministic prefix; ~50% discount on cached input tokens.
78
+ - Cache duration documented by OpenAI as 5-10 minutes idle, up to 1 hour during off-peak.
79
+ - Order the request prefix deterministically — same system prompt, same tool definitions, same retrieved-doc order — or the cache misses silently.
80
+
81
+ ## Model Router and Fallback
82
+
83
+ Every LLM call wraps in a retry-with-decorrelated-jitter plus model-fallback chain:
84
+
85
+ 1. **Primary** — production-quality model (e.g. Sonnet 4.7).
86
+ 2. **Secondary** — faster/cheaper model (e.g. Haiku 4.5, GPT-5-mini, or a local Llama variant).
87
+ 3. **Static fallback** — cached response from a similar prior request or a canned templated reply that names the failure ("Search temporarily unavailable — retry in a minute").
88
+
89
+ Cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the circuit-breaker + retry-with-decorrelated-jitter primitives the chain reuses. Each fallback path has its own eval suite — a silent quality cliff between primary and secondary is a regression.
90
+
91
+ ## Hallucination and Groundedness as SLI
92
+
93
+ Track per feature and treat as service-level indicators with explicit SLO targets:
94
+
95
+ - `ai.hallucination_rate` — % of responses producing claims not present in retrieved sources. SLO: <5% on the golden set.
96
+ - `ai.citation_precision` — % of citations whose source span verifiably contains the cited claim. SLO: >95% on retrieved claims.
97
+ - `ai.refusal_rate` — overall refusal-rate; track false-positive refusals separately.
98
+ - `ai.groundedness_score` — average RAGAS faithfulness across the golden set. SLO: >0.85.
99
+
100
+ Cross-reference `rules/hatch3r-observability-metrics.md` for the SLI/SLO authoring template.
101
+
102
+ ## Safety and Red-Team
103
+
104
+ Every feature exercising user-controlled prompts runs adversarial eval on a schedule (weekly minimum, on every prompt-version bump):
105
+
106
+ - **Garak** — open-source jailbreak / prompt-injection / PII-leakage probe suite.
107
+ - **PyRIT** — Microsoft red-team framework, scenario-driven.
108
+ - **Inspect-redteam** — UK AISI safety eval modules.
109
+
110
+ Block release on regression. PII-leakage tests use a synthetic-PII corpus (never production PII). Prompt-injection tests cover OWASP Top 10 for Agentic Applications 2026.
111
+
112
+ ## Tool-Use Evals
113
+
114
+ When the AI feature uses tools (per the companion UX rule), the eval suite covers:
115
+
116
+ - Tool-selection accuracy — correct tool chosen for each input class.
117
+ - Argument validity — emitted args satisfy the tool schema (Zod, JSON Schema, Pydantic).
118
+ - Chain correctness — multi-tool plans reach the goal with the minimum required step count plus a 20% tolerance.
119
+ - Latency budget — p99 tool-execution time within the budget from `rules/hatch3r-performance-budgets.md`.
120
+
121
+ Methodology aligned with **BFCL v4** (Berkeley Function Calling Leaderboard) and **tau-bench** (multi-turn tool-use benchmark).
122
+
123
+ ## OpenTelemetry GenAI Semantic Conventions
124
+
125
+ Every LLM call emits an OpenTelemetry span named `gen_ai.<operation>` with the attributes prescribed by the OpenTelemetry GenAI semantic conventions: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.temperature`, `gen_ai.tool.name` (when tools used). Cross-reference Slice 2 observability rules for the broader span taxonomy.
126
+
127
+ ## User-Feedback Loop
128
+
129
+ - Every AI response surface emits a thumbs-up/down control wired to a feedback queue.
130
+ - A monthly triage job promotes thumbs-down examples into regression-test fixtures in `evals/<feature>/edge.jsonl`.
131
+ - Promotion is a manual review step — never auto-promote raw user feedback (it contains noise and adversarial labels).
132
+
133
+ ## Audit Logging
134
+
135
+ - LLM inputs + outputs logged to a compliance store separate from APM with a 30-90 day retention window (configurable per data-classification policy).
136
+ - PII redaction before persistence — same redaction primitive as the framework's data-classification pipeline.
137
+ - Reproducibility key: `model` + `prompt_hash` + `seed` (when the model exposes a seed parameter) + `temperature`. Without all four, the response is non-reproducible.
138
+
139
+ ## Eval-Driven Development Workflow
140
+
141
+ Write eval before prompt, measure baseline, write prompt, measure delta, iterate until eval threshold passes. Cross-reference `skills/hatch3r-ai-feature/SKILL.md` for the step-by-step workflow.
142
+
143
+ ## References
144
+
145
+ - promptfoo — `promptfoo.dev`
146
+ - DeepEval — `github.com/confident-ai/deepeval`
147
+ - RAGAS — `docs.ragas.io`
148
+ - Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
149
+ - Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
150
+ - OpenAI prompt caching guide — `platform.openai.com/docs/guides/prompt-caching`
151
+ - OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
152
+ - Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`
153
+ - tau-bench — `github.com/sierra-research/tau-bench`
154
+ - OWASP Top 10 for LLM Applications (Agentic 2026) — `genai.owasp.org`
@@ -0,0 +1,131 @@
1
+ ---
2
+ id: hatch3r-ai-ux-patterns
3
+ type: rule
4
+ description: 2026 AI/agentic UX patterns for end-user projects shipping AI features — streaming, tool-call UI, human-approval gates, cancel/abort/undo, citations
5
+ scope: "**/*.vue,**/*.jsx,**/*.tsx,**/*.svelte,**/ai/**,**/chat/**,**/assistant/**,**/agents/**,**/llm/**,**/copilot/**"
6
+ tags: [ux, ai, frontend]
7
+ quality_charter: agents/shared/quality-charter.md
8
+ cache_friendly: true
9
+ ---
10
+ # AI/Agentic UX Patterns (2026)
11
+
12
+ ## Scope
13
+
14
+ This rule applies when the end-user project ships LLM-driven UI — chat, assistant, copilot, agent dashboards, generative UI surfaces. It does NOT govern the LLM backend itself (model selection, prompt engineering, retrieval pipeline). For non-AI UX rules (loading, empty, error, partial states; form patterns; microcopy), cross-reference `rules/hatch3r-ux-states-and-flows.md`. When both rules apply to the same surface, the non-AI rule sets the baseline and this rule layers AI-specific behavior on top.
15
+
16
+ Backend companion: `rules/hatch3r-ai-evals.md` defines eval harness, prompt versioning, cost telemetry, prompt caching, model fallback, and hallucination-as-SLI. Apply both rules for any LLM-driven feature.
17
+
18
+ Detection: this rule activates when the project imports an AI SDK (`ai`, `@ai-sdk/*`, `openai`, `@anthropic-ai/sdk`, `@google/generative-ai`), or contains files under `ai/`, `chat/`, `assistant/`, `agents/`, `llm/`, `copilot/`. Adding any of these to a project that previously had no AI surface triggers the full ruleset on the next agent run.
19
+
20
+ ## Streaming-First Defaults
21
+
22
+ - Every LLM-driven surface uses framework-agnostic streaming hooks: `useChat`, `useCompletion`, or `useObject` from Vercel AI SDK UI (or equivalent for non-React stacks). Hand-rolled SSE or `fetch` against the model endpoint is a regression in 2026.
23
+ - Render progressive tokens from the moment the request starts. Pair with a skeleton state during the pre-token window (request-sent, first-token-pending). A blank surface during model latency is a regression.
24
+ - Render markdown incrementally without re-parsing the whole buffer on every chunk (Vercel AI Elements `MessageResponse` pattern). Use a streaming-safe markdown renderer that tracks last-rendered offset and appends from there.
25
+ - Emit chunk-level error boundaries. If a stream fails mid-flight, retain prior tokens, render an inline retry control adjacent to the truncated response, and do not blank the surface or reset the message body.
26
+ - Indicate completion explicitly via a final-token marker or `onFinish` callback transition. Stale "thinking" indicators after stream-end erode trust and waste user attention.
27
+ - Non-streaming responses on LLM-driven surfaces are a regression. The narrow exception is structured-output endpoints that must return a single validated JSON payload (`useObject` already streams partial objects — prefer that).
28
+ - Typing indicator policy: show only while no token has arrived; switch to streamed content the moment the first delta lands. Two-second indicator timeouts that linger past first-token are a regression.
29
+ - Backpressure: when the model emits faster than the renderer can paint, batch deltas to one paint per animation frame (`requestAnimationFrame`) rather than dropping tokens or queuing into an unbounded buffer.
30
+
31
+ ## Generative UI (RSC Streaming)
32
+
33
+ - For dynamic dashboards, cards, and surfaces driven by tool calls, use the `streamUI()` RSC pattern (or framework equivalent). The model returns a tool invocation, the server renders an RSC fragment per tool, the client merges fragments into the thread without re-hydrating.
34
+ - Build on shadcn primitives via Vercel AI Elements — 20+ shadcn-based React components for AI surfaces: messages, input, reasoning panel, tool-call display, response actions. Reuse before authoring; bespoke equivalents are a duplication regression.
35
+ - Keep AI-generated UI off the client JS bundle. RSC streaming keeps the payload server-side; do not ship a client-side renderer that duplicates server-rendered output.
36
+ - Each RSC fragment is independently hydratable — never wrap two tool outputs in a single fragment that requires both to complete before render.
37
+ - For non-React stacks, the equivalent is server-rendered HTML fragments streamed via Server-Sent Events and merged into the DOM with explicit anchor elements per tool call. Do not bypass the server-render boundary with raw `innerHTML` of model output.
38
+
39
+ ## Tool-Call UI Cards
40
+
41
+ Every tool invocation is user-visible by default. Hidden tool calls are a regression — transparency is the trust contract with the user.
42
+
43
+ - Render each tool call as a structured card with: tool name, args (collapsed by default, expandable on click), status (`pending` → `in-progress` → `complete` / `failed`), elapsed duration, result preview (truncated, expand to full).
44
+ - Indicate state through both color and icon — a single channel (color only) fails for color-vision-deficient users.
45
+ - Args display redacts secrets (API keys, tokens, bearer credentials) and PII per project data-classification rules. Show a `[redacted]` placeholder so the user knows a value was scrubbed, not omitted.
46
+ - Async sub-flows — a tool that triggers more tools — render as nested cards. Do not collapse a multi-tool sub-flow to a single line; the user must see every external action.
47
+ - Failed tool calls show the error message in plain language plus a single-click retry control. Stack traces and raw API responses belong behind an "Details" toggle, not in the user-visible card body.
48
+ - Result preview length: cap at 200 characters or 5 lines (whichever first); always paired with an "Expand" affordance. Long binary or base64 payloads display size + content-type, not the raw bytes.
49
+ - Tool-call cards are keyboard-focusable and announce status transitions via `aria-live="polite"` regions so screen-reader users perceive state changes without polling the visual surface.
50
+
51
+ ## Human-Approval Gates for Side-Effects
52
+
53
+ Required for any tool that:
54
+
55
+ - Mutates external state (database write, email send, post to Slack/GitHub/external service)
56
+ - Spends money (paid-API call, transaction, billable inference run on a third-party provider)
57
+ - Modifies the user's environment (file write, shell command, git push, package install)
58
+
59
+ Gate pattern:
60
+
61
+ 1. Pause execution before invoking the tool.
62
+ 2. Render an approval card showing tool name, full args (no collapse), diff preview if the tool mutates a file, named target ("Send email to alice@example.com", not "Send email").
63
+ 3. Require explicit user confirm (`Approve` button, not Enter-to-dismiss).
64
+ 4. On approve, invoke the tool and transition the card to `in-progress`. On deny, surface a cancel-reason field and persist the rejection in the run transcript.
65
+
66
+ Read-only tools (search, fetch, list, read) do not require approval — gating them adds friction without trust value.
67
+
68
+ Destructive irreversible actions (drop table, delete repo, permanent send) disclose irreversibility on the approval card with a typed-confirmation pattern (user types the named target to enable the confirm button).
69
+
70
+ Auto-approve scopes (the "trusted tool" pattern) live behind a per-tool setting with a named scope ("Auto-approve git status, never git push"). Default to opt-in, never opt-out. Scopes are revocable from a single settings surface and revocation takes effect for in-flight runs, not just future runs.
71
+
72
+ ## Cancel / Abort / Undo
73
+
74
+ Three distinct affordances — do not collapse them into a single control:
75
+
76
+ - **Cancel** — every long-running tool call (>2s expected duration) and every streaming response renders a visible cancel control adjacent to the surface. Wire to an `AbortController` covering both the model stream and the tool worker. Cancellation produces a clean stop with a "Cancelled by user" marker in the transcript — not a hang, not a silent terminate.
77
+ - **Abort** — distinct from cancel. Aborts the entire agent task (a multi-tool plan). Persists prior tool results so the user can resume from any completed step. Renders as a separate control on the agent-task header, not the in-flight tool card.
78
+ - **Undo** — every reversible action (file write, message send, database row insert, configuration change) renders an undo affordance for 5-30s after completion (named target shown: "Undo: deleted file foo.txt"). Maps to a compensating tool call (delete-after-write, recall-after-send, soft-delete-restore). Destructive irreversible actions disclose this on the approval card per the previous section.
79
+ - Keyboard shortcuts: cancel = `Esc` while the surface is focused; abort = `Esc` then confirm (double-tap pattern) to prevent accidental task termination; undo = `Cmd/Ctrl+Z` while the undo affordance is visible.
80
+ - All three controls remain operable after the user navigates away from the surface — running streams and undo windows persist in a thread-level state, not component-local state. A user closing the panel does not cancel the in-flight task.
81
+
82
+ ## Citations and Grounding
83
+
84
+ - Factual claims from retrieval show inline citations with span-level linkage. Hovering or clicking the citation highlights the source passage in the retrieved document.
85
+ - Cite the retrieval source URL or document anchor, not just the document name — "Knowledge Base / Onboarding" without an anchor is unverifiable.
86
+ - Ungroundable claims are flagged in the response surface ("I'm not certain — this isn't in your retrieved sources") rather than silently emitted. Span-level verification is preferred when the model exposes it.
87
+ - Do not show uncalibrated confidence badges. A "90% confident" indicator without empirical calibration data actively misleads — drop it or replace with grounding framing ("Based on [doc]" or "Not found in your data").
88
+ - When retrieval returns zero results, say so explicitly. Silent fall-through to model parametric memory is a regression — the user cannot distinguish retrieved from confabulated content.
89
+ - Citation rendering: superscript number link in the response body that opens a side panel (or popover on hover) with source title, URL, and highlighted span. Avoid footnote-only patterns that force the user to scroll.
90
+ - Source freshness indicator: when retrieval metadata includes document timestamps, surface the source date on the citation popover. Stale sources flagged at >12 months by default; threshold configurable per project domain.
91
+
92
+ ## Multi-Step Agent UX
93
+
94
+ - Render the agent's plan as an ordered checklist before execution. Allow the user to edit step text, reorder, or skip individual steps before approving the plan. A pre-execution plan view is required for any agent run with >3 steps.
95
+ - As each step completes, mark with status (success/failure/skipped) plus elapsed duration. Failed steps show the failure cause and a retry control; retry restarts from the failed step, not from the beginning.
96
+ - Streaming reasoning (when the model exposes it via a reasoning channel) renders in a collapsed "Reasoning" panel that the user opens on click. Never inline reasoning with the user-facing message body — it dilutes the response and conflates the user-facing answer with internal model state.
97
+ - Maintain a thread or run ID per conversation for resumption, audit trail, and observability backlink. The run ID is visible in the UI (footer or thread metadata) so users and support can reference a specific run.
98
+ - Plan revision: when the model revises its plan mid-run (adds, removes, or reorders remaining steps), diff the previous plan against the new plan and surface the diff to the user before continuing. Silent plan mutation defeats the pre-execution review.
99
+ - Parallel steps: when the plan contains independent steps that the agent runs in parallel, render them as a horizontal group with shared progress, not a flat vertical list — the user must see the parallelism to interpret elapsed time.
100
+
101
+ ## Failure Modes and Degradation
102
+
103
+ - **Model timeout** — render a fallback card with a single-click recovery ("Model is slow — retry?" or "Switching to faster model"). Do not silently retry; the user must know the model failed.
104
+ - **Rate limit** — show a countdown timer and queue position when available. Surface vendor rate-limit headers, not raw API error JSON. "Try again in 47s" beats "429 Too Many Requests".
105
+ - **Token budget exceeded** — indicate context truncation in-line ("Conversation truncated — earlier turns dropped") and offer a summary/restart control. Do not silently truncate; the user must know which turns are still in context.
106
+ - **Tool unavailability** — degrade gracefully. Disable the affected feature with a tooltip explaining why ("Slack integration offline — reconnect in Settings"), not a crash or an unhandled exception bubbled to the user.
107
+ - **Stream interruption mid-response** — retain the partial response, render a "Stream interrupted" marker, offer continue-from-here (resume the stream from the last received token) and restart-from-scratch (drop partial, restart) as two separate controls.
108
+ - **Content-policy refusal** — when the model refuses to answer (safety filter, policy violation), render the refusal verbatim in a distinct style (not a normal assistant message), and surface a single-click "Report this" affordance so users can flag false-positive refusals.
109
+ - **Stale session** — if the user returns to a tab where the stream has died (network sleep, tab discard), detect on focus and offer a "Reconnect" control rather than silently re-issuing the request and double-billing the user.
110
+
111
+ ## Verification Gate
112
+
113
+ Before declaring an AI surface "done":
114
+
115
+ - Streaming verified end-to-end: scripted Playwright test that asserts progressive token render (first token <1s after request, last token marked complete).
116
+ - Tool-call card snapshot per state (`pending`, `in-progress`, `complete`, `failed`) — missing any state is a blocker.
117
+ - Approval-gate test: simulate a side-effecting tool, assert execution pauses, approval card renders required fields, deny path persists rejection in transcript.
118
+ - Cancel/abort/undo each have at least one end-to-end test exercising the AbortController wire-up; cancellation produces a transcript marker, not a silent terminate.
119
+ - Citation rendering: snapshot of a grounded response shows inline citation with source URL or anchor; an ungroundable claim renders the flag string, not silent emission.
120
+ - Failure-mode coverage: at least one test per failure mode in this rule (timeout, rate limit, token budget, tool unavailability, stream interruption, content-policy refusal, stale session). Missing coverage is a blocker.
121
+ - Accessibility: axe-core scan against every AI surface state (idle, streaming, tool-call pending, approval card open, error) returns 0 serious or critical violations.
122
+
123
+ ## References
124
+
125
+ - Vercel — Introducing AI Elements (`vercel.com/changelog/introducing-ai-elements`)
126
+ - Vercel AI SDK UI docs (`ai-sdk.dev/docs/ai-sdk-ui`) — `useChat`, `useCompletion`, `useObject`
127
+ - Vercel — AI SDK 3 Generative UI / `streamUI()` (`vercel.com/blog/ai-sdk-3-generative-ui`)
128
+ - Anthropic — Building agents with the Claude Agent SDK (`anthropic.com/engineering/building-agents-with-the-claude-agent-sdk`)
129
+ - OpenAI Apps SDK — UI guidelines (`developers.openai.com/apps-sdk/concepts/ui-guidelines`)
130
+ - ClarityArc — Hallucination, grounding, and citation in enterprise systems
131
+ - Lakera — LLM hallucinations in 2026