npm - hatch3r - Versions diffs - 1.7.0 → 1.7.5 - Mend

hatch3r 1.7.0 → 1.7.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (160) hide show

package/README.md +38 -12
package/agents/hatch3r-a11y-auditor.md +4 -0
package/agents/hatch3r-architect.md +5 -1
package/agents/hatch3r-ci-watcher.md +4 -0
package/agents/hatch3r-context-rules.md +4 -0
package/agents/hatch3r-creator.md +4 -0
package/agents/hatch3r-dependency-auditor.md +4 -0
package/agents/hatch3r-devops.md +4 -0
package/agents/hatch3r-docs-writer.md +4 -0
package/agents/hatch3r-fixer.md +5 -1
package/agents/hatch3r-handoff-loader.md +243 -0
package/agents/hatch3r-handoff-preparer.md +134 -0
package/agents/hatch3r-implementer.md +5 -1
package/agents/hatch3r-learnings-loader.md +4 -0
package/agents/hatch3r-lint-fixer.md +4 -0
package/agents/hatch3r-perf-profiler.md +8 -0
package/agents/hatch3r-researcher.md +5 -1
package/agents/hatch3r-reviewer.md +92 -0
package/agents/hatch3r-security-auditor.md +24 -0
package/agents/hatch3r-test-writer.md +4 -0
package/agents/modes/requirements-elicitation.md +5 -1
package/agents/modes/similar-implementation.md +6 -0
package/agents/modes/user-flows.md +76 -0
package/agents/shared/quality-charter.md +129 -0
package/agents/shared/user-question-protocol.md +95 -0
package/commands/board/shared-azure-devops.md +2 -0
package/commands/board/shared-github.md +17 -0
package/commands/board/shared-gitlab.md +4 -0
package/commands/hatch3r-board-fill.md +2 -1
package/commands/hatch3r-board-pickup.md +1 -1
package/commands/hatch3r-board-shared.md +21 -0
package/commands/hatch3r-create.md +2 -0
package/commands/hatch3r-handoff.md +126 -0
package/commands/hatch3r-pr-resolve.md +672 -0
package/commands/hatch3r-quick-change.md +5 -3
package/commands/hatch3r-report.md +167 -0
package/commands/hatch3r-revision.md +1 -1
package/commands/hatch3r-workflow.md +3 -1
package/dist/cli/index.js +3144 -979
package/dist/cli/index.js.map +1 -1
package/package.json +4 -2
package/rules/hatch3r-accessibility-standards.md +21 -0
package/rules/hatch3r-accessibility-standards.mdc +21 -0
package/rules/hatch3r-agent-orchestration.md +32 -1
package/rules/hatch3r-agent-orchestration.mdc +32 -1
package/rules/hatch3r-ai-evals.md +158 -0
package/rules/hatch3r-ai-evals.mdc +154 -0
package/rules/hatch3r-ai-ux-patterns.md +131 -0
package/rules/hatch3r-ai-ux-patterns.mdc +127 -0
package/rules/hatch3r-api-design.md +67 -9
package/rules/hatch3r-api-design.mdc +67 -9
package/rules/hatch3r-api-versioning.md +119 -0
package/rules/hatch3r-api-versioning.mdc +115 -0
package/rules/hatch3r-auth-patterns.md +170 -0
package/rules/hatch3r-auth-patterns.mdc +166 -0
package/rules/hatch3r-component-conventions.md +30 -0
package/rules/hatch3r-component-conventions.mdc +30 -0
package/rules/hatch3r-container-hardening.md +131 -0
package/rules/hatch3r-container-hardening.mdc +127 -0
package/rules/hatch3r-contract-testing.md +117 -0
package/rules/hatch3r-contract-testing.mdc +113 -0
package/rules/hatch3r-deep-context.md +3 -1
package/rules/hatch3r-deep-context.mdc +3 -1
package/rules/hatch3r-dependency-management.md +73 -1
package/rules/hatch3r-dependency-management.mdc +72 -0
package/rules/hatch3r-design-system-detection.md +142 -0
package/rules/hatch3r-design-system-detection.mdc +138 -0
package/rules/hatch3r-event-schema-evolution.md +90 -0
package/rules/hatch3r-event-schema-evolution.mdc +86 -0
package/rules/hatch3r-handoff-readiness.md +45 -0
package/rules/hatch3r-handoff-readiness.mdc +40 -0
package/rules/hatch3r-i18n.md +13 -0
package/rules/hatch3r-i18n.mdc +13 -0
package/rules/hatch3r-iteration-summary.md +2 -0
package/rules/hatch3r-iteration-summary.mdc +2 -0
package/rules/hatch3r-migrations.md +61 -16
package/rules/hatch3r-migrations.mdc +61 -16
package/rules/hatch3r-observability-logging.md +1 -1
package/rules/hatch3r-observability-logging.mdc +1 -1
package/rules/hatch3r-observability-metrics.md +1 -1
package/rules/hatch3r-observability-metrics.mdc +1 -1
package/rules/hatch3r-observability-tracing-detail.md +1 -1
package/rules/hatch3r-observability-tracing-detail.mdc +1 -1
package/rules/hatch3r-observability-tracing.md +1 -1
package/rules/hatch3r-observability-tracing.mdc +1 -1
package/rules/hatch3r-observability.md +1 -0
package/rules/hatch3r-observability.mdc +1 -0
package/rules/hatch3r-operability.md +149 -0
package/rules/hatch3r-operability.mdc +145 -0
package/rules/hatch3r-passkey-server.md +181 -0
package/rules/hatch3r-passkey-server.mdc +177 -0
package/rules/hatch3r-progressive-delivery.md +120 -0
package/rules/hatch3r-progressive-delivery.mdc +116 -0
package/rules/hatch3r-resilience-patterns.md +154 -0
package/rules/hatch3r-resilience-patterns.mdc +150 -0
package/rules/hatch3r-secrets-management.md +29 -0
package/rules/hatch3r-secrets-management.mdc +29 -0
package/rules/hatch3r-testing.md +139 -43
package/rules/hatch3r-testing.mdc +139 -43
package/rules/hatch3r-ux-states-and-flows.md +149 -0
package/rules/hatch3r-ux-states-and-flows.mdc +145 -0
package/skills/hatch3r-a11y-audit/SKILL.md +14 -0
package/skills/hatch3r-ai-feature/SKILL.md +134 -0
package/skills/hatch3r-api-spec/SKILL.md +5 -0
package/skills/hatch3r-architecture-review/SKILL.md +14 -0
package/skills/hatch3r-bug-fix/SKILL.md +5 -0
package/skills/hatch3r-ci-pipeline/SKILL.md +14 -0
package/skills/hatch3r-cli-aichat/SKILL.md +84 -0
package/skills/hatch3r-cli-ast-grep/SKILL.md +85 -0
package/skills/hatch3r-cli-az-devops/SKILL.md +89 -0
package/skills/hatch3r-cli-bat/SKILL.md +85 -0
package/skills/hatch3r-cli-comby/SKILL.md +85 -0
package/skills/hatch3r-cli-csvkit/SKILL.md +84 -0
package/skills/hatch3r-cli-delta/SKILL.md +86 -0
package/skills/hatch3r-cli-difftastic/SKILL.md +84 -0
package/skills/hatch3r-cli-docker/SKILL.md +89 -0
package/skills/hatch3r-cli-duckdb/SKILL.md +84 -0
package/skills/hatch3r-cli-fd/SKILL.md +85 -0
package/skills/hatch3r-cli-fzf/SKILL.md +84 -0
package/skills/hatch3r-cli-gh/SKILL.md +90 -0
package/skills/hatch3r-cli-glab/SKILL.md +89 -0
package/skills/hatch3r-cli-jq/SKILL.md +85 -0
package/skills/hatch3r-cli-lazygit/SKILL.md +78 -0
package/skills/hatch3r-cli-llm/SKILL.md +84 -0
package/skills/hatch3r-cli-miller/SKILL.md +84 -0
package/skills/hatch3r-cli-mods/SKILL.md +84 -0
package/skills/hatch3r-cli-overview/SKILL.md +60 -0
package/skills/hatch3r-cli-playwright/SKILL.md +89 -0
package/skills/hatch3r-cli-podman/SKILL.md +84 -0
package/skills/hatch3r-cli-ripgrep/SKILL.md +85 -0
package/skills/hatch3r-cli-rtk/SKILL.md +91 -0
package/skills/hatch3r-cli-sd/SKILL.md +85 -0
package/skills/hatch3r-cli-stagehand/SKILL.md +79 -0
package/skills/hatch3r-cli-taplo/SKILL.md +84 -0
package/skills/hatch3r-cli-xsv/SKILL.md +89 -0
package/skills/hatch3r-cli-yq/SKILL.md +85 -0
package/skills/hatch3r-cli-zstd/SKILL.md +85 -0
package/skills/hatch3r-context-health/SKILL.md +14 -0
package/skills/hatch3r-cost-tracking/SKILL.md +14 -0
package/skills/hatch3r-customize/SKILL.md +14 -0
package/skills/hatch3r-dep-audit/SKILL.md +14 -0
package/skills/hatch3r-design-system-detect/SKILL.md +162 -0
package/skills/hatch3r-feature/SKILL.md +2 -0
package/skills/hatch3r-gh-agentic-workflows/SKILL.md +13 -0
package/skills/hatch3r-handoff-prepare/SKILL.md +160 -0
package/skills/hatch3r-handoff-resume/SKILL.md +171 -0
package/skills/hatch3r-incident-response/SKILL.md +14 -0
package/skills/hatch3r-issue-workflow/SKILL.md +5 -0
package/skills/hatch3r-logical-refactor/SKILL.md +14 -0
package/skills/hatch3r-migration/SKILL.md +14 -0
package/skills/hatch3r-observability-verify/SKILL.md +133 -0
package/skills/hatch3r-perf-audit/SKILL.md +14 -0
package/skills/hatch3r-pr-creation/SKILL.md +14 -0
package/skills/hatch3r-qa-validation/SKILL.md +18 -0
package/skills/hatch3r-recipe/SKILL.md +14 -0
package/skills/hatch3r-refactor/SKILL.md +14 -0
package/skills/hatch3r-release/SKILL.md +14 -0
package/skills/hatch3r-reliability-verify/SKILL.md +144 -0
package/skills/hatch3r-ui-ux-verify/SKILL.md +136 -0
package/skills/hatch3r-visual-refactor/SKILL.md +15 -1

package/rules/hatch3r-ux-states-and-flows.md ADDED Viewed

@@ -0,0 +1,149 @@
+---
+id: hatch3r-ux-states-and-flows
+type: rule
+description: Four-state surface contract (loading/empty/error/partial), user-flow decomposition before implementation, and microcopy + tone discipline for end-user UIs
+scope: "**/*.vue,**/*.jsx,**/*.tsx,**/*.svelte,**/components/**,**/*.html,**/messages/**,**/locales/**,**/copy/**"
+tags: [ux, ui, frontend]
+quality_charter: agents/shared/quality-charter.md
+cache_friendly: true
+---
+# UX States and Flows
+## User-Flow Decomposition (Mandatory Before Implementation)
+Every user story produces three explicit flows before any UI code is written. Skipping this step is a regression — the implementer rejects specs lacking flow decomposition.
+1. **Happy Path** — numbered sequence of user actions paired with system responses and decision points covering the success route end-to-end. Includes entry point (route, prior screen, deep link), authentication state assumed, and exit point (resulting state, next screen).
+2. **Alternative Paths** — every documented branch (filters applied, multi-step forms, optional sections, permission tiers, locale variants). Each branch is numbered and references the decision point in the Happy Path where it diverges. Re-converges named.
+3. **Error-Recovery Path** — what the user does when a step fails (network drop, validation failure, expired session, permission denied, server outage, conflict on concurrent edit). Names the recovery action (retry, undo, refresh credential, contact support, drop changes) at every failure point.
+Each flow is a numbered list with three columns: user action, system response, decision/branch. Reference `agents/modes/user-flows.md` for the canonical template. A flow is incomplete if any documented decision point has no specified branch outcome. The implementer treats acceptance criteria as a downstream artifact of the flow set — flows come first, criteria are derived.
+Flow decomposition is also the contract between the feature designer and the reviewer: the reviewer walks every flow during code review and asserts each numbered step maps to a code path. A code path with no flow step is dead code or undocumented surface; a flow step with no code path is a missing implementation.
+## Four-State Surface Contract
+Every async view renders all four states. Missing any state on an async view is a blocker — NN/g 2025 measured 92% of AI-generated dashboards omit empty state and 78% omit error state, so this is the regression bar to beat.
+| State | Trigger | Rendering rules |
+|-------|---------|-----------------|
+| Loading | Request in flight >300ms | Skeleton matches final content dimensions (explicit width/height) to keep CLS <=0.1; spinner only for short modal actions <500ms; loading state announced via `aria-busy="true"` on the container |
+| Empty | Successful response, 0 results | Distinguish three sub-types below: cold start, active filter, no permission. Each has a distinct heading, visual cue, and CTA |
+| Error | Network failure, server error, validation failure | Plain-language cause + retry CTA + fallback or contact path; no "Oops" or "Something went wrong" — corrective verb required; render in `role="alert"` for assistive tech announcement |
+| Partial | Some data succeeded, some failed | Banner naming the failed subset + degraded data rendered for the succeeded subset + retry option for the failed subset; never silently drop the failed subset |
+State machines drive these transitions — model the view as `idle | loading | success | empty | error | partial` and assert every state has a render path in the component. Snapshot every state in tests (Storybook play function or vitest + Testing Library): missing snapshots are blockers.
+Transitions between states have their own rules. Loading -> success swaps the skeleton for content without layout shift (the skeleton was already the final dimensions). Loading -> error replaces the skeleton with the error surface and announces it via `role="alert"`. Loading -> empty replaces the skeleton with the sub-typed empty surface (cold / filter / permission). Success -> partial overlays the banner without re-rendering the succeeded subset. Every transition has a measurable duration and an announcement strategy — never silent.
+## First-Run vs Filter vs Network — Content Structure
+| State type | Heading level | Visual cue | CTA | Example |
+|------------|---------------|------------|-----|---------|
+| Cold start | h2 | Onboarding illustration | "Create your first <noun>" | "No <items> yet — create your first to begin" |
+| Active filter | h3 inside results | Filter icon | "Clear filters" | "No results match these filters" |
+| Network error | role=alert | Error icon | "Retry" + "Contact support" | "Couldn't load <items> — check your connection" |
+| No permission | h3 | Lock icon | "Request access" | "You don't have access to <thing>" |
+Cold start and active filter are not interchangeable — a first-run view directs the user to the onboarding action; an active-filter view offers "Clear filters" because the data exists behind the filter. Rendering "Clear filters" on a cold-start view is a usability defect (the button has no effect). Detect the sub-type from a state predicate (item count vs filter count vs auth scope), not from a single boolean.
+Predicate ordering matters: check permission first (no-permission overrides everything), then filter state (active-filter overrides cold-start), then total count (cold-start when zero items exist anywhere). Inverting the order renders the wrong empty sub-type and the wrong CTA — a frequent source of "the button does nothing" bug reports.
+## Form UX
+- Inline validation timing: validate on `blur`. After the first error on a field appears, re-validate on every keystroke with a 150ms debounce so the error tracks the user's correction in flight.
+- Error recovery: clear the inline error immediately when the field becomes valid (no debounce on clear); re-enable the submit button in real time so the user sees recovery without re-tabbing.
+- Error summary on submit: render at the top of the form with anchor links to each invalid field. Move focus to the summary heading (not to the first error field) so screen-reader users hear the count and list before being dropped into one field. Anchor link target is the input itself; the link text is the field label, not "Error 1".
+- Required attributes per input: `autocomplete` (e.g., `email`, `tel`, `cc-number`, `one-time-code`), `inputmode` (e.g., `numeric`, `decimal`, `email`), `aria-describedby` pointing at the error or help text, and `aria-invalid="true"` when invalid. `enterkeyhint` on multi-step forms maps the soft-keyboard enter key to the next action.
+- Never use a disabled submit button as the only error signal — users assume the page is broken. Allow submission and render validation outcome on submit. A submit attempt with invalid fields renders the summary, sets focus on it, and leaves the submit button operable for retry.
+- Identifier-first auth: separate email and password screens; offer passkeys via WebAuthn Conditional UI (`autocomplete="username webauthn"`) so the OS surfaces the credential inline in autofill rather than behind a separate button. Passkey enrollment copy states the user-visible benefit in one sentence ("Sign in with your fingerprint — no password to remember").
+- Positive confirmation for non-trivial fields (email, password strength, username availability): when valid, render an inline check or short affirmation. Pair with a screen-reader-only `aria-live="polite"` announcement.
+- Multi-step forms render a step indicator with current step, total steps, and per-step labels. Each step is a separate `<form>` with its own validation and submit; never collect all fields on one screen behind tabs. Browser back button maps to previous step without data loss.
+- Autosave drafts where the user expects persistence (long-form composition, settings panels). Surface autosave state with a passive indicator ("Saved 2s ago" or "Saving..."), never with a blocking modal.
+## Error-Recovery Patterns
+- Retry button retries the same operation with the same input. Disable for 1s after click to avoid duplicate submits; re-enable on completion.
+- Fallback offers an alternative path when retry will not help: cached view, contact support, alternate route. Render fallback below retry, not above.
+- Undo for reversible mutations: surface a toast with the undo action and a >=10-second timeout. Persist the undo action across screen navigations until the timeout fires or the user confirms.
+- Conflict on concurrent edit: surface a diff with three options (keep mine, keep theirs, merge); never overwrite silently.
+- Session expiration: re-auth inline (modal or banner with credential field), preserving the user's in-flight data. Forced re-auth that drops form data is a defect.
+## Microcopy and Tone
+Anchored to GOV.UK Design System (error pattern, plain English) and IBM Carbon (voice + tone).
+- Plain language; second person; corrective verb on every error string ("Enter your email address", not "Email is required"). The verb names the action the user takes to recover.
+- No jargon visible to end users: never expose "FIDO2", "WebAuthn", "HTTP 500", "null", "undefined", "401". Translate the implementation detail to the action the user can take. Implementation labels live in logs and error reports, not in UI strings.
+- CTA labels are action-oriented and specific: "Save changes", "Create account", "Delete project" — not "Submit", "OK", "Confirm". Destructive verbs name the object: "Delete project Foo permanently".
+- Error tone explains cause and offers recovery; never blame the user. "We couldn't save your changes — try again." not "You did something wrong." Distinguish system-fault from user-fault wording — for user-fault, name the field; for system-fault, use first-person plural ("we couldn't").
+- ICU MessageFormat for every translatable string. Never concatenate strings to build sentences — singular, plural, and gender variants all live in one message ID with explicit categories. Translator memory stays stable across releases when keys are stable; rename keys is a breaking change.
+- Cross-reference `rules/hatch3r-i18n.md` for translation mechanics and the Microcopy & Tone subsection that defines voice consistency across locales. Reserve 30-40% extra width in layouts for languages like DE and FI.
+- Treat AI as a copy assistant, not author — AI generates tone variants and the human approves final voice strings. Brand voice strings are owned by a named maintainer, not by a model.
+- Forbidden filler in any user-visible string: "Oops", "Whoops", "Something went wrong", "An unexpected error occurred". Lint these via a string-scan check in CI against the locale files. Replacements name the failure and the action — "Couldn't load <items>. Check your connection and retry."
+- Empty-state copy follows the same plain-language rule: name what is absent and the action to fill it. "No drafts yet — start writing." beats "You have no drafts." because the action is in the same sentence.
+- Truncation strategy: never mid-word in single-line truncations; use `text-overflow: ellipsis` with a tooltip or expand affordance for the full string. Multi-line truncation uses `-webkit-line-clamp` with a "Read more" affordance below.
+- Time and date formatting: respect the user's locale via `Intl.DateTimeFormat`; never hard-code "MM/DD/YYYY" or "12:34 PM" in UI strings. Relative time ("3 minutes ago") for recent events; absolute for events >24h old.
+- Numbers and units follow the locale's separator and grouping. Currency renders the user-selected currency, not the merchant's default. Round to the precision the user can act on — financial transactions to two decimals, weights to one.
+## Perceived Performance
+- Skeleton show threshold: 300ms first-content latency. Below 300ms render the final view directly; above 300ms render skeleton first to avoid spinner flash. Implement via a delayed `setTimeout(showSkeleton, 300)` cleared on early response.
+- Skeleton matches final content dimensions (explicit `width`, `height`, or `aspect-ratio`) so the layout does not shift when real content arrives — CLS <=0.1 baseline. Test by measuring CLS on the route under a throttled network profile (Slow 4G or 4x CPU).
+- Optimistic UI for safe mutations (add to list, mark done, toggle like): apply the change in local state immediately and reconcile on server response. On reject, roll back the local state AND surface a non-blocking toast with retry. Use `useOptimistic` in React or framework equivalent.
+- Pessimistic UI for destructive or financial actions (delete, payment, irreversible state change): wait for server confirmation before reflecting the change. Surface a progress indicator on the action itself, not on the page.
+- Stale-while-revalidate for non-critical data: render the cached value immediately, refresh in the background, surface a non-blocking "Refresh available" affordance when the fresh value differs from the cached value. Never overwrite the rendered view silently — the user is mid-task on the stale data.
+- 2026 Core Web Vitals gates at p75 of real-user data: LCP <=2.5s, INP <=200ms, CLS <=0.1. These are gating, not aspirational. Break long tasks (>50ms) with `scheduler.yield()` or `requestIdleCallback` to keep INP within budget.
+- Streaming UI for long-running AI or tool responses: render tokens as they arrive without re-parsing the full buffer per chunk. Show a typing indicator only while no token has arrived; switch to streamed content on first delta. Provide cancel/abort at every streamed surface.
+- Loading copy names the phase when total work is knowable ("Uploading 3 of 7..."), or names the activity when not ("Encoding..."). Generic "Loading..." is the fallback for sub-300ms cases that escape the skeleton threshold.
+## Streaming and Long-Running Operations
+- Streaming text (AI responses, log tails, real-time data) renders incrementally. First-token latency target <=500ms; render the typing indicator only until first token, then switch to streamed content.
+- Tool calls and side-effectful operations render as collapsible structured cards with status (pending, running, done, failed), arguments summary, duration, result preview. Hidden tool calls are a trust regression.
+- Human-in-the-loop approval required for any side-effectful tool (write, send, pay, deploy). Default approval mode is opt-in; auto-accept lives behind an explicit setting with named scope and one-click revocation.
+- Cancel/abort available at every long-running call. Abort triggers cleanup (rollback partial state, close streams, release locks). Show the user the post-abort state — never leave the UI in a "Cancelling..." limbo.
+- Citation rendering for retrieval-grounded claims: inline link to the source span. Spans the model cannot ground from retrieved evidence are flagged visibly, not silently emitted.
+## Mobile and Touch
+- Touch target minimum: 44pt on iOS HIG, 48dp on Material 3. The platform minimum supersedes WCAG SC 2.5.8 (24x24 CSS px) on touch surfaces — use the stricter bound.
+- Spacing between adjacent interactive elements: >=8px to avoid mis-taps. Measure on the rendered DOM, not on the design comp — padding and margins count toward hit area only when they belong to the same interactive element.
+- Avoid tap-to-reveal (hover-only tooltip) on touch surfaces. Use permanent labels or long-press with a visible affordance. Hover-only tooltips fail WCAG SC 1.4.13 Content on Hover or Focus on touch.
+- Apply `env(safe-area-inset-*)` on full-bleed surfaces so primary CTAs do not sit under the iOS home indicator, notch, or Android gesture nav. Native equivalents: `safeAreaInsets` on iOS, `WindowInsets` on Android.
+- Dynamic Type on iOS (`UIFont.preferredFont(forTextStyle:)` or SwiftUI `Font.body`) and rem-based font scaling on Android/web. Never use hard-coded `px` for text — fails text-resize verification at 200% zoom (WCAG SC 1.4.4) and 400% reflow (SC 1.4.10).
+- Drag-only gestures (range slider, kanban reorder, swipe-to-delete) require a non-drag alternative per WCAG SC 2.5.7 — keyboard arrows, click-to-place, explicit "Delete" button.
+- Pointer-event vs touch-event: prefer pointer events (`pointerdown`, `pointermove`, `pointerup`) so the same handler covers mouse, touch, pen. Synthesize click events on tap with a 300ms tolerance only on legacy mobile browsers — modern engines emit click immediately.
+- Pinch-zoom: never disable user zoom on the viewport meta tag. WCAG SC 1.4.4 fails on disabled zoom. Use `maximum-scale=5` to allow up to 5x zoom while suppressing zoom-on-focus jitter.
+- Orientation lock: respect WCAG SC 1.3.4 — content adapts to portrait and landscape unless an essential exception applies (e.g., piano keyboard, blueprint editor). Document the exception in the spec.
+## Heuristic Verification Gate
+Run every check below before declaring a feature done. A "done" claim without these checks is not done.
+- Nielsen 10 heuristics 5-minute walkthrough by a reviewer on the key user flow. Flag any heuristic violation (visibility of status, match to real world, user control, consistency, error prevention, recognition over recall, flexibility, minimalist design, error recovery, help/documentation); resolve before ship.
+- Keyboard-only run through the flow (no mouse, no touch). Every step reachable, every focus state visible (>=2px outline at >=3:1 contrast), no traps, no off-screen focus.
+- Screen-reader smoke test on the key route — one human pass per release using VoiceOver (macOS/iOS) or NVDA (Windows). Document the trace: which landmarks announced, which form labels heard, which dynamic updates registered.
+- First-time-user walkthrough for P0 features: measure task-completion time and observe error recovery. Note where the user paused, backtracked, or asked for help. Three observed users per P0 feature minimum.
+- State-coverage check: snapshot every async view in each of {loading, empty, error, partial, success} via Storybook play or component tests. Missing snapshots block the gate.
+- Microcopy lint: a CI string-scan rejects banned filler ("Oops", "Whoops", "Something went wrong") in any locale file; rejects `disabled` on submit buttons without an accompanying error-state announcement; flags strings without a corrective verb on error keys (`error.*`, `validation.*`).
+- Reduced-motion pass: toggle `prefers-reduced-motion: reduce` on the route and verify all non-essential transitions are removed; essential loaders simplify rather than disable.
+- Reflow pass: zoom the route to 200% and 400% in a 1280x1024 viewport — no horizontal scroll, no clipped content, no overlapping interactive elements.
+## References
+- WCAG 2.2 AA — new success criteria SC 2.5.8 (Target Size), SC 2.4.11 (Focus Not Obscured), SC 2.5.7 (Dragging Movements), SC 1.4.13 (Content on Hover or Focus)
+- NN/g state-omission research 2025 — empty-state and error-state omission rates in AI-generated UIs (92% empty omitted, 78% error omitted)
+- GOV.UK Design System — error message and error summary components, plain English readability guidance
+- IBM Carbon Design System — voice and tone content guidelines, error message patterns
+- Baymard Institute — inline form validation research and disabled-submit findings (top-10 form-abandonment driver)
+- Google Core Web Vitals 2026 thresholds — LCP, INP, CLS at p75 real-user data
+- Nielsen Norman Group — 10 usability heuristics for user interface design
+- Mailchimp Content Style Guide — voice and tone baseline, clarity over cleverness
+- Apple Human Interface Guidelines — Dynamic Type, safe area, touch target sizing
+- Material 3 Design System — touch target sizing (48dp), state layers, ripple feedback
+- W3C WAI-ARIA Authoring Practices — accessible widget patterns, live regions, focus management
+- FIDO Alliance / Passkey Central — WebAuthn Conditional UI design guidelines for passwordless auth
+- Vercel AI Elements — streaming message component patterns for AI chat surfaces
+- Anthropic engineering blog — agent UI patterns, human-in-the-loop approval, tool-call transparency
+- OpenAI Apps SDK UI guidelines — citation rendering, tool-result presentation

package/rules/hatch3r-ux-states-and-flows.mdc ADDED Viewed

@@ -0,0 +1,145 @@
+---
+description: Four-state surface contract (loading/empty/error/partial), user-flow decomposition before implementation, and microcopy + tone discipline for end-user UIs
+globs: ["**/*.vue", "**/*.jsx", "**/*.tsx", "**/*.svelte", "**/components/**", "**/*.html", "**/messages/**", "**/locales/**", "**/copy/**"]
+alwaysApply: false
+---
+# UX States and Flows
+## User-Flow Decomposition (Mandatory Before Implementation)
+Every user story produces three explicit flows before any UI code is written. Skipping this step is a regression — the implementer rejects specs lacking flow decomposition.
+1. **Happy Path** — numbered sequence of user actions paired with system responses and decision points covering the success route end-to-end. Includes entry point (route, prior screen, deep link), authentication state assumed, and exit point (resulting state, next screen).
+2. **Alternative Paths** — every documented branch (filters applied, multi-step forms, optional sections, permission tiers, locale variants). Each branch is numbered and references the decision point in the Happy Path where it diverges. Re-converges named.
+3. **Error-Recovery Path** — what the user does when a step fails (network drop, validation failure, expired session, permission denied, server outage, conflict on concurrent edit). Names the recovery action (retry, undo, refresh credential, contact support, drop changes) at every failure point.
+Each flow is a numbered list with three columns: user action, system response, decision/branch. Reference `agents/modes/user-flows.md` for the canonical template. A flow is incomplete if any documented decision point has no specified branch outcome. The implementer treats acceptance criteria as a downstream artifact of the flow set — flows come first, criteria are derived.
+Flow decomposition is also the contract between the feature designer and the reviewer: the reviewer walks every flow during code review and asserts each numbered step maps to a code path. A code path with no flow step is dead code or undocumented surface; a flow step with no code path is a missing implementation.
+## Four-State Surface Contract
+Every async view renders all four states. Missing any state on an async view is a blocker — NN/g 2025 measured 92% of AI-generated dashboards omit empty state and 78% omit error state, so this is the regression bar to beat.
+| State | Trigger | Rendering rules |
+|-------|---------|-----------------|
+| Loading | Request in flight >300ms | Skeleton matches final content dimensions (explicit width/height) to keep CLS <=0.1; spinner only for short modal actions <500ms; loading state announced via `aria-busy="true"` on the container |
+| Empty | Successful response, 0 results | Distinguish three sub-types below: cold start, active filter, no permission. Each has a distinct heading, visual cue, and CTA |
+| Error | Network failure, server error, validation failure | Plain-language cause + retry CTA + fallback or contact path; no "Oops" or "Something went wrong" — corrective verb required; render in `role="alert"` for assistive tech announcement |
+| Partial | Some data succeeded, some failed | Banner naming the failed subset + degraded data rendered for the succeeded subset + retry option for the failed subset; never silently drop the failed subset |
+State machines drive these transitions — model the view as `idle | loading | success | empty | error | partial` and assert every state has a render path in the component. Snapshot every state in tests (Storybook play function or vitest + Testing Library): missing snapshots are blockers.
+Transitions between states have their own rules. Loading -> success swaps the skeleton for content without layout shift (the skeleton was already the final dimensions). Loading -> error replaces the skeleton with the error surface and announces it via `role="alert"`. Loading -> empty replaces the skeleton with the sub-typed empty surface (cold / filter / permission). Success -> partial overlays the banner without re-rendering the succeeded subset. Every transition has a measurable duration and an announcement strategy — never silent.
+## First-Run vs Filter vs Network — Content Structure
+| State type | Heading level | Visual cue | CTA | Example |
+|------------|---------------|------------|-----|---------|
+| Cold start | h2 | Onboarding illustration | "Create your first <noun>" | "No <items> yet — create your first to begin" |
+| Active filter | h3 inside results | Filter icon | "Clear filters" | "No results match these filters" |
+| Network error | role=alert | Error icon | "Retry" + "Contact support" | "Couldn't load <items> — check your connection" |
+| No permission | h3 | Lock icon | "Request access" | "You don't have access to <thing>" |
+Cold start and active filter are not interchangeable — a first-run view directs the user to the onboarding action; an active-filter view offers "Clear filters" because the data exists behind the filter. Rendering "Clear filters" on a cold-start view is a usability defect (the button has no effect). Detect the sub-type from a state predicate (item count vs filter count vs auth scope), not from a single boolean.
+Predicate ordering matters: check permission first (no-permission overrides everything), then filter state (active-filter overrides cold-start), then total count (cold-start when zero items exist anywhere). Inverting the order renders the wrong empty sub-type and the wrong CTA — a frequent source of "the button does nothing" bug reports.
+## Form UX
+- Inline validation timing: validate on `blur`. After the first error on a field appears, re-validate on every keystroke with a 150ms debounce so the error tracks the user's correction in flight.
+- Error recovery: clear the inline error immediately when the field becomes valid (no debounce on clear); re-enable the submit button in real time so the user sees recovery without re-tabbing.
+- Error summary on submit: render at the top of the form with anchor links to each invalid field. Move focus to the summary heading (not to the first error field) so screen-reader users hear the count and list before being dropped into one field. Anchor link target is the input itself; the link text is the field label, not "Error 1".
+- Required attributes per input: `autocomplete` (e.g., `email`, `tel`, `cc-number`, `one-time-code`), `inputmode` (e.g., `numeric`, `decimal`, `email`), `aria-describedby` pointing at the error or help text, and `aria-invalid="true"` when invalid. `enterkeyhint` on multi-step forms maps the soft-keyboard enter key to the next action.
+- Never use a disabled submit button as the only error signal — users assume the page is broken. Allow submission and render validation outcome on submit. A submit attempt with invalid fields renders the summary, sets focus on it, and leaves the submit button operable for retry.
+- Identifier-first auth: separate email and password screens; offer passkeys via WebAuthn Conditional UI (`autocomplete="username webauthn"`) so the OS surfaces the credential inline in autofill rather than behind a separate button. Passkey enrollment copy states the user-visible benefit in one sentence ("Sign in with your fingerprint — no password to remember").
+- Positive confirmation for non-trivial fields (email, password strength, username availability): when valid, render an inline check or short affirmation. Pair with a screen-reader-only `aria-live="polite"` announcement.
+- Multi-step forms render a step indicator with current step, total steps, and per-step labels. Each step is a separate `<form>` with its own validation and submit; never collect all fields on one screen behind tabs. Browser back button maps to previous step without data loss.
+- Autosave drafts where the user expects persistence (long-form composition, settings panels). Surface autosave state with a passive indicator ("Saved 2s ago" or "Saving..."), never with a blocking modal.
+## Error-Recovery Patterns
+- Retry button retries the same operation with the same input. Disable for 1s after click to avoid duplicate submits; re-enable on completion.
+- Fallback offers an alternative path when retry will not help: cached view, contact support, alternate route. Render fallback below retry, not above.
+- Undo for reversible mutations: surface a toast with the undo action and a >=10-second timeout. Persist the undo action across screen navigations until the timeout fires or the user confirms.
+- Conflict on concurrent edit: surface a diff with three options (keep mine, keep theirs, merge); never overwrite silently.
+- Session expiration: re-auth inline (modal or banner with credential field), preserving the user's in-flight data. Forced re-auth that drops form data is a defect.
+## Microcopy and Tone
+Anchored to GOV.UK Design System (error pattern, plain English) and IBM Carbon (voice + tone).
+- Plain language; second person; corrective verb on every error string ("Enter your email address", not "Email is required"). The verb names the action the user takes to recover.
+- No jargon visible to end users: never expose "FIDO2", "WebAuthn", "HTTP 500", "null", "undefined", "401". Translate the implementation detail to the action the user can take. Implementation labels live in logs and error reports, not in UI strings.
+- CTA labels are action-oriented and specific: "Save changes", "Create account", "Delete project" — not "Submit", "OK", "Confirm". Destructive verbs name the object: "Delete project Foo permanently".
+- Error tone explains cause and offers recovery; never blame the user. "We couldn't save your changes — try again." not "You did something wrong." Distinguish system-fault from user-fault wording — for user-fault, name the field; for system-fault, use first-person plural ("we couldn't").
+- ICU MessageFormat for every translatable string. Never concatenate strings to build sentences — singular, plural, and gender variants all live in one message ID with explicit categories. Translator memory stays stable across releases when keys are stable; rename keys is a breaking change.
+- Cross-reference `rules/hatch3r-i18n.md` for translation mechanics and the Microcopy & Tone subsection that defines voice consistency across locales. Reserve 30-40% extra width in layouts for languages like DE and FI.
+- Treat AI as a copy assistant, not author — AI generates tone variants and the human approves final voice strings. Brand voice strings are owned by a named maintainer, not by a model.
+- Forbidden filler in any user-visible string: "Oops", "Whoops", "Something went wrong", "An unexpected error occurred". Lint these via a string-scan check in CI against the locale files. Replacements name the failure and the action — "Couldn't load <items>. Check your connection and retry."
+- Empty-state copy follows the same plain-language rule: name what is absent and the action to fill it. "No drafts yet — start writing." beats "You have no drafts." because the action is in the same sentence.
+- Truncation strategy: never mid-word in single-line truncations; use `text-overflow: ellipsis` with a tooltip or expand affordance for the full string. Multi-line truncation uses `-webkit-line-clamp` with a "Read more" affordance below.
+- Time and date formatting: respect the user's locale via `Intl.DateTimeFormat`; never hard-code "MM/DD/YYYY" or "12:34 PM" in UI strings. Relative time ("3 minutes ago") for recent events; absolute for events >24h old.
+- Numbers and units follow the locale's separator and grouping. Currency renders the user-selected currency, not the merchant's default. Round to the precision the user can act on — financial transactions to two decimals, weights to one.
+## Perceived Performance
+- Skeleton show threshold: 300ms first-content latency. Below 300ms render the final view directly; above 300ms render skeleton first to avoid spinner flash. Implement via a delayed `setTimeout(showSkeleton, 300)` cleared on early response.
+- Skeleton matches final content dimensions (explicit `width`, `height`, or `aspect-ratio`) so the layout does not shift when real content arrives — CLS <=0.1 baseline. Test by measuring CLS on the route under a throttled network profile (Slow 4G or 4x CPU).
+- Optimistic UI for safe mutations (add to list, mark done, toggle like): apply the change in local state immediately and reconcile on server response. On reject, roll back the local state AND surface a non-blocking toast with retry. Use `useOptimistic` in React or framework equivalent.
+- Pessimistic UI for destructive or financial actions (delete, payment, irreversible state change): wait for server confirmation before reflecting the change. Surface a progress indicator on the action itself, not on the page.
+- Stale-while-revalidate for non-critical data: render the cached value immediately, refresh in the background, surface a non-blocking "Refresh available" affordance when the fresh value differs from the cached value. Never overwrite the rendered view silently — the user is mid-task on the stale data.
+- 2026 Core Web Vitals gates at p75 of real-user data: LCP <=2.5s, INP <=200ms, CLS <=0.1. These are gating, not aspirational. Break long tasks (>50ms) with `scheduler.yield()` or `requestIdleCallback` to keep INP within budget.
+- Streaming UI for long-running AI or tool responses: render tokens as they arrive without re-parsing the full buffer per chunk. Show a typing indicator only while no token has arrived; switch to streamed content on first delta. Provide cancel/abort at every streamed surface.
+- Loading copy names the phase when total work is knowable ("Uploading 3 of 7..."), or names the activity when not ("Encoding..."). Generic "Loading..." is the fallback for sub-300ms cases that escape the skeleton threshold.
+## Streaming and Long-Running Operations
+- Streaming text (AI responses, log tails, real-time data) renders incrementally. First-token latency target <=500ms; render the typing indicator only until first token, then switch to streamed content.
+- Tool calls and side-effectful operations render as collapsible structured cards with status (pending, running, done, failed), arguments summary, duration, result preview. Hidden tool calls are a trust regression.
+- Human-in-the-loop approval required for any side-effectful tool (write, send, pay, deploy). Default approval mode is opt-in; auto-accept lives behind an explicit setting with named scope and one-click revocation.
+- Cancel/abort available at every long-running call. Abort triggers cleanup (rollback partial state, close streams, release locks). Show the user the post-abort state — never leave the UI in a "Cancelling..." limbo.
+- Citation rendering for retrieval-grounded claims: inline link to the source span. Spans the model cannot ground from retrieved evidence are flagged visibly, not silently emitted.
+## Mobile and Touch
+- Touch target minimum: 44pt on iOS HIG, 48dp on Material 3. The platform minimum supersedes WCAG SC 2.5.8 (24x24 CSS px) on touch surfaces — use the stricter bound.
+- Spacing between adjacent interactive elements: >=8px to avoid mis-taps. Measure on the rendered DOM, not on the design comp — padding and margins count toward hit area only when they belong to the same interactive element.
+- Avoid tap-to-reveal (hover-only tooltip) on touch surfaces. Use permanent labels or long-press with a visible affordance. Hover-only tooltips fail WCAG SC 1.4.13 Content on Hover or Focus on touch.
+- Apply `env(safe-area-inset-*)` on full-bleed surfaces so primary CTAs do not sit under the iOS home indicator, notch, or Android gesture nav. Native equivalents: `safeAreaInsets` on iOS, `WindowInsets` on Android.
+- Dynamic Type on iOS (`UIFont.preferredFont(forTextStyle:)` or SwiftUI `Font.body`) and rem-based font scaling on Android/web. Never use hard-coded `px` for text — fails text-resize verification at 200% zoom (WCAG SC 1.4.4) and 400% reflow (SC 1.4.10).
+- Drag-only gestures (range slider, kanban reorder, swipe-to-delete) require a non-drag alternative per WCAG SC 2.5.7 — keyboard arrows, click-to-place, explicit "Delete" button.
+- Pointer-event vs touch-event: prefer pointer events (`pointerdown`, `pointermove`, `pointerup`) so the same handler covers mouse, touch, pen. Synthesize click events on tap with a 300ms tolerance only on legacy mobile browsers — modern engines emit click immediately.
+- Pinch-zoom: never disable user zoom on the viewport meta tag. WCAG SC 1.4.4 fails on disabled zoom. Use `maximum-scale=5` to allow up to 5x zoom while suppressing zoom-on-focus jitter.
+- Orientation lock: respect WCAG SC 1.3.4 — content adapts to portrait and landscape unless an essential exception applies (e.g., piano keyboard, blueprint editor). Document the exception in the spec.
+## Heuristic Verification Gate
+Run every check below before declaring a feature done. A "done" claim without these checks is not done.
+- Nielsen 10 heuristics 5-minute walkthrough by a reviewer on the key user flow. Flag any heuristic violation (visibility of status, match to real world, user control, consistency, error prevention, recognition over recall, flexibility, minimalist design, error recovery, help/documentation); resolve before ship.
+- Keyboard-only run through the flow (no mouse, no touch). Every step reachable, every focus state visible (>=2px outline at >=3:1 contrast), no traps, no off-screen focus.
+- Screen-reader smoke test on the key route — one human pass per release using VoiceOver (macOS/iOS) or NVDA (Windows). Document the trace: which landmarks announced, which form labels heard, which dynamic updates registered.
+- First-time-user walkthrough for P0 features: measure task-completion time and observe error recovery. Note where the user paused, backtracked, or asked for help. Three observed users per P0 feature minimum.
+- State-coverage check: snapshot every async view in each of {loading, empty, error, partial, success} via Storybook play or component tests. Missing snapshots block the gate.
+- Microcopy lint: a CI string-scan rejects banned filler ("Oops", "Whoops", "Something went wrong") in any locale file; rejects `disabled` on submit buttons without an accompanying error-state announcement; flags strings without a corrective verb on error keys (`error.*`, `validation.*`).
+- Reduced-motion pass: toggle `prefers-reduced-motion: reduce` on the route and verify all non-essential transitions are removed; essential loaders simplify rather than disable.
+- Reflow pass: zoom the route to 200% and 400% in a 1280x1024 viewport — no horizontal scroll, no clipped content, no overlapping interactive elements.
+## References
+- WCAG 2.2 AA — new success criteria SC 2.5.8 (Target Size), SC 2.4.11 (Focus Not Obscured), SC 2.5.7 (Dragging Movements), SC 1.4.13 (Content on Hover or Focus)
+- NN/g state-omission research 2025 — empty-state and error-state omission rates in AI-generated UIs (92% empty omitted, 78% error omitted)
+- GOV.UK Design System — error message and error summary components, plain English readability guidance
+- IBM Carbon Design System — voice and tone content guidelines, error message patterns
+- Baymard Institute — inline form validation research and disabled-submit findings (top-10 form-abandonment driver)
+- Google Core Web Vitals 2026 thresholds — LCP, INP, CLS at p75 real-user data
+- Nielsen Norman Group — 10 usability heuristics for user interface design
+- Mailchimp Content Style Guide — voice and tone baseline, clarity over cleverness
+- Apple Human Interface Guidelines — Dynamic Type, safe area, touch target sizing
+- Material 3 Design System — touch target sizing (48dp), state layers, ripple feedback
+- W3C WAI-ARIA Authoring Practices — accessible widget patterns, live regions, focus management
+- FIDO Alliance / Passkey Central — WebAuthn Conditional UI design guidelines for passwordless auth
+- Vercel AI Elements — streaming message component patterns for AI chat surfaces
+- Anthropic engineering blog — agent UI patterns, human-in-the-loop approval, tool-call transparency
+- OpenAI Apps SDK UI guidelines — citation rendering, tool-result presentation

package/skills/hatch3r-a11y-audit/SKILL.md CHANGED Viewed

@@ -12,6 +12,7 @@ cache_friendly: true
 ```
 Task Progress:
+- [ ] Step 0: Detect ambiguity (P8 B1)
 - [ ] Step 1: Read accessibility requirements from rules and specs
 - [ ] Step 2: Automated scan — run axe-core or similar on all pages/components
 - [ ] Step 3: Manual audit — load references/manual-audit-checklist.md
@@ -20,6 +21,19 @@ Task Progress:
 - [ ] Step 6: Verify fixes with re-scan and manual check
 ```
+## Step 0 — Detect Ambiguity (P8 B1)
+Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md`. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: WCAG conformance target (A vs AA vs AAA), scope (single component vs full app), surfaces to audit (which routes), fix authority (audit-only vs audit-and-fix), and screen-reader pass scope (per release vs per audit).
+## Fan-out Discipline (P8 B2)
+This skill delegates per task size:
+- Tier 1 (trivial single-file): inline execution acceptable.
+- Tier 2 (multi-file or multi-concern): spawn parallel sub-agents per concern via the Task tool.
+- Tier 3 (multi-module / high-risk): one fresh sub-agent per independent module or gate; orchestrator integrates only.
+Never under-fan-out to save tokens. Token cost is dominated by quality and completeness gains. Emit `sub_agents_spawned: { count, rationale }` in your output.
 ## Progressive Disclosure
 - **Main skill file (this):** Workflow steps, automated scan, fix process, DoD.

package/skills/hatch3r-ai-feature/SKILL.md ADDED Viewed

@@ -0,0 +1,134 @@
+---
+id: hatch3r-ai-feature
+type: skill
+description: Eval-driven development workflow for shipping AI features — write eval before prompt, measure, iterate, ship with caching + cost telemetry + model fallback + hallucination SLI
+tags: [implementation, ai]
+quality_charter: agents/shared/quality-charter.md
+---
+# AI Feature Workflow (Eval-Driven)
+## Quick Start
+Run this skill before shipping any LLM-driven feature. It defines the canonical eval-driven loop (write eval, write prompt, measure, iterate) and the production-readiness gates. Skipping any of the 9 steps = the feature is not done.
+This skill is the implementation counterpart to `rules/hatch3r-ai-evals.md` (backend governance) and `rules/hatch3r-ai-ux-patterns.md` (UI governance). The rules define the bar; this skill defines the route to clearing the bar.
+## Step 0 — Detect Ambiguity (P8 B1)
+Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md`. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: task class (classification vs open-ended vs RAG vs agentic), model pin (Sonnet vs Opus vs Haiku), eval threshold values, budget per request (cost cap), and fallback policy (graceful degrade vs hard fail).
+## Fan-out Discipline (P8 B2)
+This skill delegates per task size:
+- Tier 1 (trivial single-file): inline execution acceptable.
+- Tier 2 (multi-file or multi-concern): spawn parallel sub-agents per concern via the Task tool.
+- Tier 3 (multi-module / high-risk): one fresh sub-agent per independent module or gate; orchestrator integrates only.
+Never under-fan-out to save tokens. Token cost is dominated by quality and completeness gains. Emit `sub_agents_spawned: { count, rationale }` in your output.
+## Step 1: Define the task and success criteria
+- Write down what "right" looks like in one paragraph — the user input class, the expected output shape, the failure modes you want to catch.
+- Hand-author 20+ golden examples in `evals/<feature>/golden.jsonl` with `input` + `expected_output` (or a graded rubric when the task is open-ended).
+- Save the threshold per metric in `evals/<feature>/thresholds.json`. Without an explicit threshold, "passing the eval" is undefined.
+- Cross-reference `rules/hatch3r-ai-evals.md` Golden Dataset Versioning for filename and refresh policy.
+- Source diversity matters more than count beyond 20 — include adversarial inputs, edge cases from prior incidents, and at least 3 examples per known input class.
+- Label every example with the input class so per-class accuracy is computable in Step 4.
+## Step 2: Pick eval tool and metric
+Match the task class to the tool:
+- Classification → promptfoo with exact-match assertions.
+- Open-ended generation → DeepEval or braintrust with LLM-as-judge + a 50-example human-labeled calibration set.
+- Retrieval/RAG → RAGAS (context_precision, context_recall, faithfulness, answer_relevance).
+- Tool-use / agentic → Inspect or BFCL-style harness.
+- Safety/red-team → Garak or PyRIT scheduled weekly.
+Pin the choice in `evals/README.md` so the next agent run picks the same tool.
+## Step 3: Write the prompt
+- Author the prompt at `prompts/<feature>/v1.md` with frontmatter `{ id, version: 1, model_pinned, eval_set }`.
+- Commit; record SHA-256 hash in `evals/<feature>/thresholds.json`.
+- If the system prompt + tool definitions + RAG context exceed 1024 tokens, apply Anthropic `cache_control` breakpoints (or rely on OpenAI's automatic prefix cache for ≥1024-token deterministic prefixes). Longest-TTL block first.
+## Step 4: Run eval; iterate prompt
+- Run `npx promptfoo eval` (or the chosen tool's CLI) against the golden set.
+- Read the per-metric report. If below threshold, modify the prompt, bump to `v2.md`, re-hash, re-run.
+- Treat each prompt revision like a code commit — small, named, testable.
+- Stop iterating when every metric clears its threshold in `thresholds.json` and the pairwise win-rate vs the prior version is >=55%.
+- Capture the eval report artifact in CI so the PR reviewer can read per-case pass/fail without re-running the suite locally.
+- If iteration count exceeds 10 versions without convergence, escalate — the task may need decomposition (one sub-prompt per input class) or a retrieval-grounded approach.
+## Step 5: Wire production telemetry
+- Per-request log line emits `model`, `tokens_in`, `tokens_out`, `cache_hit`, `cached_tokens`, `cost_usd`, `latency_ms`, `prompt_version`, `prompt_hash`, `cost_center`.
+- Per-request OpenTelemetry span follows the OTel GenAI semantic conventions (`gen_ai.*` attributes).
+- Aggregate dashboards: cost-per-request, hallucination_rate, citation_precision, refusal_rate, cache_hit_ratio.
+- Cross-reference `skills/hatch3r-observability-verify` for the per-feature dashboard checklist.
+## Step 6: Wire fallback chain
+- Primary model (e.g. Sonnet 4.7) → secondary (cheaper/faster, e.g. Haiku 4.5) → static fallback (cached or canned).
+- Wrap in circuit-breaker + retry-with-decorrelated-jitter — cross-reference `rules/hatch3r-resilience-patterns.md` (Slice 8) for the primitives.
+- Run the eval suite against the secondary path too — a silent quality cliff between primary and secondary is a regression.
+- Static fallback text names the failure mode in user-readable language ("AI is briefly unavailable — retry in a minute") rather than dumping a stack trace into the UI.
+## Step 7: Add CI gate
+- Eval runs on every PR that touches `**/prompts/**`, `**/rag/**`, `**/ai/**`, `**/llm/**`.
+- PR blocks when any metric drops below the threshold in `evals/<feature>/thresholds.json`.
+- Model-version upgrade (Sonnet to Opus, 4.6 to 4.7) triggers a full eval with a 5% accuracy budget; cross over 5% requires a named-reviewer sign-off + 24-hour canary at 5% traffic.
+## Step 8: Production verification
+First 24 hours after deploy, monitor:
+- `ai.hallucination_rate` — SLO <5% on golden set; alert if 7-day rolling rate >5%.
+- `ai.refusal_rate` — track false-positive refusal rate separately.
+- `ai.cost_per_request_usd` — p50/p95/p99 vs feature budget; alert at 50%/75%/90% of monthly budget.
+- `ai.latency_ms` — first-token-latency p95 + total-response-latency p99.
+- `ai.cache_hit_ratio` — should match the dev-environment baseline within 10%; a drop indicates prefix drift.
+- `ai.tokens_per_request` — p95 should be within 20% of the eval-time distribution; a spike signals retrieval growth or prompt drift.
+Cross-reference `skills/hatch3r-observability-verify`.
+## Step 9: Feedback loop
+- Wire user thumbs-down to a feedback queue per response.
+- Monthly triage job promotes thumbs-down examples into regression fixtures in `evals/<feature>/edge.jsonl`.
+- Promotion is a manual review step — raw user feedback contains noise and adversarial labels.
+- Capture an optional free-text comment with each thumbs-down; the comment is the highest-signal feature for triage clustering.
+- Track feedback volume per response surface — a sudden spike in thumbs-down rate signals an upstream prompt or retrieval regression and gates a rollback.
+## Verdict
+All 9 steps complete = the AI feature is "done". Anything less = not done. The orchestrator running this skill emits a single-line verdict per step (`STEP_N: PASS|FAIL <evidence-path>`) and aggregates them. One FAIL on any step blocks release.
+Evidence paths point at concrete artifacts: the golden set (`evals/<feature>/golden.jsonl`), the prompt version (`prompts/<feature>/v<N>.md`), the eval report (`evals/<feature>/report-<run-id>.json`), and the dashboard URL for production SLI verification. Verdicts without evidence paths are not accepted by the gate.
+## When this skill runs
+- After `hatch3r-implementer` finishes the surrounding non-AI feature code, before `hatch3r-qa-validation`.
+- On every PR that introduces a new LLM call or modifies an existing prompt, model, or retrieval pipeline.
+- Step 8 (production verification) executes against the post-deploy environment, not the PR branch.
+## Cross-References
+- `rules/hatch3r-ai-evals.md` — backend governance (eval, cost, caching, fallback, SLI).
+- `rules/hatch3r-ai-ux-patterns.md` — frontend UX patterns (streaming, tool-call cards, citations).
+- `skills/hatch3r-ui-ux-verify/SKILL.md` — UI verification gate for AI surfaces.
+- `skills/hatch3r-observability-verify` — observability wiring checklist.
+- `rules/hatch3r-resilience-patterns.md` (Slice 8) — circuit-breaker + retry primitives reused in the fallback chain.
+## References
+- promptfoo — `promptfoo.dev`
+- DeepEval — `github.com/confident-ai/deepeval`
+- RAGAS — `docs.ragas.io`
+- Inspect (UK AISI) — `github.com/UKGovernmentBEIS/inspect_ai`
+- Anthropic prompt caching guide — `docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
+- OpenTelemetry GenAI semantic conventions — `opentelemetry.io/docs/specs/semconv/gen-ai/`
+- Berkeley Function Calling Leaderboard (BFCL v4) — `gorilla.cs.berkeley.edu/leaderboard.html`

package/skills/hatch3r-api-spec/SKILL.md CHANGED Viewed

@@ -14,6 +14,7 @@ cache_friendly: true
 ```
 Task Progress:
+- [ ] Step 0: Detect ambiguity (P8 B1)
 - [ ] Step 1: Inventory existing endpoints
 - [ ] Step 2: Generate OpenAPI spec
 - [ ] Step 3: Validate schemas
@@ -21,6 +22,10 @@ Task Progress:
 - [ ] Step 5: Verify spec accuracy
 ```
+## Step 0 — Detect Ambiguity (P8 B1)
+Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md`. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: OpenAPI version (3.0 vs 3.1), spec output path, auth scheme (Bearer vs OAuth2 vs API key), breaking-change policy (block vs version vs document), and target consumers (SDK clients vs human docs vs both).
 ## Step 1: Inventory Existing Endpoints
 - Scan route definitions across the codebase (controllers, handlers, route files).

package/skills/hatch3r-architecture-review/SKILL.md CHANGED Viewed

@@ -12,6 +12,7 @@ cache_friendly: true
 ```
 Task Progress:
+- [ ] Step 0: Detect ambiguity (P8 B1)
 - [ ] Step 1: Read existing ADRs and the template
 - [ ] Step 2: Define the decision context — problem, constraints, options
 - [ ] Step 3: Evaluate options — pros/cons, prototype if needed, check ADR constraints
@@ -20,6 +21,19 @@ Task Progress:
 - [ ] Step 6: Update affected specs or docs to reference the new ADR
 ```
+## Step 0 — Detect Ambiguity (P8 B1)
+Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md`. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: problem framing (what decision needs to be made), constraint set (mandatory vs preferred), evaluation horizon (short-term vs long-term cost), supersedes which prior ADR, and ADR status target (PROPOSED for discussion vs ACCEPTED for binding decision).
+## Fan-out Discipline (P8 B2)
+This skill delegates per task size:
+- Tier 1 (trivial single-file): inline execution acceptable.
+- Tier 2 (multi-file or multi-concern): spawn parallel sub-agents per concern via the Task tool.
+- Tier 3 (multi-module / high-risk): one fresh sub-agent per independent module or gate; orchestrator integrates only.
+Never under-fan-out to save tokens. Token cost is dominated by quality and completeness gains. Emit `sub_agents_spawned: { count, rationale }` in your output.
 ## Step 1: Read Existing ADRs and Template
 - Read all ADRs in project docs to understand current architecture and constraints.

package/skills/hatch3r-bug-fix/SKILL.md CHANGED Viewed

@@ -14,6 +14,7 @@ cache_friendly: true
 ```
 Task Progress:
+- [ ] Step 0: Detect ambiguity (P8 B1)
 - [ ] Step 1: Read the issue and relevant specs
 - [ ] Step 2: Produce a diagnosis plan
 - [ ] Step 2b: Browser reproduction (if UI bug)
@@ -25,6 +26,10 @@ Task Progress:
 - [ ] Step 6: Open PR
 ```
+## Step 0 — Detect Ambiguity (P8 B1)
+Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md`. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: reproduction steps incomplete, expected vs actual behavior unstated, severity unclear (P0/P1 vs P2/P3), affected environment unknown (staging vs prod), or fix may require schema/API change with downstream consumers.
 ## Step 1: Read Inputs
 - Parse the issue body: problem description, reproduction steps, expected/actual behavior, severity, affected area.

package/skills/hatch3r-ci-pipeline/SKILL.md CHANGED Viewed

@@ -14,6 +14,7 @@ cache_friendly: true
 ```
 Task Progress:
+- [ ] Step 0: Detect ambiguity (P8 B1)
 - [ ] Step 1: Audit existing pipeline
 - [ ] Step 2: Design stage structure
 - [ ] Step 3: Optimize test parallelization
@@ -21,6 +22,19 @@ Task Progress:
 - [ ] Step 5: Implement and validate
 ```
+## Step 0 — Detect Ambiguity (P8 B1)
+Before any work, scan the invocation for unresolved questions in scope, intent, acceptance criteria, target environment, or irreversibility. If any are found, ask the user via the platform-native question tool per `agents/shared/user-question-protocol.md`. Do not proceed under silent assumption. Default path, not an exception. Triggers for THIS skill: CI platform (GitHub Actions vs GitLab vs CircleCI vs Azure Pipelines), pipeline duration target, runner sizing budget, deploy gate (auto vs manual approval for prod), and artifact retention policy.
+## Fan-out Discipline (P8 B2)
+This skill delegates per task size:
+- Tier 1 (trivial single-file): inline execution acceptable.
+- Tier 2 (multi-file or multi-concern): spawn parallel sub-agents per concern via the Task tool.
+- Tier 3 (multi-module / high-risk): one fresh sub-agent per independent module or gate; orchestrator integrates only.
+Never under-fan-out to save tokens. Token cost is dominated by quality and completeness gains. Emit `sub_agents_spawned: { count, rationale }` in your output.
 ## Step 1: Audit Existing Pipeline
 - Map the current pipeline stages, their dependencies, and execution times.