hatch3r 1.7.0 → 1.7.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +38 -12
- package/agents/hatch3r-a11y-auditor.md +4 -0
- package/agents/hatch3r-architect.md +5 -1
- package/agents/hatch3r-ci-watcher.md +4 -0
- package/agents/hatch3r-context-rules.md +4 -0
- package/agents/hatch3r-creator.md +4 -0
- package/agents/hatch3r-dependency-auditor.md +4 -0
- package/agents/hatch3r-devops.md +4 -0
- package/agents/hatch3r-docs-writer.md +4 -0
- package/agents/hatch3r-fixer.md +5 -1
- package/agents/hatch3r-handoff-loader.md +243 -0
- package/agents/hatch3r-handoff-preparer.md +134 -0
- package/agents/hatch3r-implementer.md +5 -1
- package/agents/hatch3r-learnings-loader.md +4 -0
- package/agents/hatch3r-lint-fixer.md +4 -0
- package/agents/hatch3r-perf-profiler.md +8 -0
- package/agents/hatch3r-researcher.md +5 -1
- package/agents/hatch3r-reviewer.md +92 -0
- package/agents/hatch3r-security-auditor.md +24 -0
- package/agents/hatch3r-test-writer.md +4 -0
- package/agents/modes/requirements-elicitation.md +5 -1
- package/agents/modes/similar-implementation.md +6 -0
- package/agents/modes/user-flows.md +76 -0
- package/agents/shared/quality-charter.md +129 -0
- package/agents/shared/user-question-protocol.md +95 -0
- package/commands/board/shared-azure-devops.md +2 -0
- package/commands/board/shared-github.md +17 -0
- package/commands/board/shared-gitlab.md +4 -0
- package/commands/hatch3r-board-fill.md +2 -1
- package/commands/hatch3r-board-pickup.md +1 -1
- package/commands/hatch3r-board-shared.md +21 -0
- package/commands/hatch3r-create.md +2 -0
- package/commands/hatch3r-handoff.md +126 -0
- package/commands/hatch3r-pr-resolve.md +672 -0
- package/commands/hatch3r-quick-change.md +5 -3
- package/commands/hatch3r-report.md +167 -0
- package/commands/hatch3r-revision.md +1 -1
- package/commands/hatch3r-workflow.md +3 -1
- package/dist/cli/index.js +3144 -979
- package/dist/cli/index.js.map +1 -1
- package/package.json +4 -2
- package/rules/hatch3r-accessibility-standards.md +21 -0
- package/rules/hatch3r-accessibility-standards.mdc +21 -0
- package/rules/hatch3r-agent-orchestration.md +32 -1
- package/rules/hatch3r-agent-orchestration.mdc +32 -1
- package/rules/hatch3r-ai-evals.md +158 -0
- package/rules/hatch3r-ai-evals.mdc +154 -0
- package/rules/hatch3r-ai-ux-patterns.md +131 -0
- package/rules/hatch3r-ai-ux-patterns.mdc +127 -0
- package/rules/hatch3r-api-design.md +67 -9
- package/rules/hatch3r-api-design.mdc +67 -9
- package/rules/hatch3r-api-versioning.md +119 -0
- package/rules/hatch3r-api-versioning.mdc +115 -0
- package/rules/hatch3r-auth-patterns.md +170 -0
- package/rules/hatch3r-auth-patterns.mdc +166 -0
- package/rules/hatch3r-component-conventions.md +30 -0
- package/rules/hatch3r-component-conventions.mdc +30 -0
- package/rules/hatch3r-container-hardening.md +131 -0
- package/rules/hatch3r-container-hardening.mdc +127 -0
- package/rules/hatch3r-contract-testing.md +117 -0
- package/rules/hatch3r-contract-testing.mdc +113 -0
- package/rules/hatch3r-deep-context.md +3 -1
- package/rules/hatch3r-deep-context.mdc +3 -1
- package/rules/hatch3r-dependency-management.md +73 -1
- package/rules/hatch3r-dependency-management.mdc +72 -0
- package/rules/hatch3r-design-system-detection.md +142 -0
- package/rules/hatch3r-design-system-detection.mdc +138 -0
- package/rules/hatch3r-event-schema-evolution.md +90 -0
- package/rules/hatch3r-event-schema-evolution.mdc +86 -0
- package/rules/hatch3r-handoff-readiness.md +45 -0
- package/rules/hatch3r-handoff-readiness.mdc +40 -0
- package/rules/hatch3r-i18n.md +13 -0
- package/rules/hatch3r-i18n.mdc +13 -0
- package/rules/hatch3r-iteration-summary.md +2 -0
- package/rules/hatch3r-iteration-summary.mdc +2 -0
- package/rules/hatch3r-migrations.md +61 -16
- package/rules/hatch3r-migrations.mdc +61 -16
- package/rules/hatch3r-observability-logging.md +1 -1
- package/rules/hatch3r-observability-logging.mdc +1 -1
- package/rules/hatch3r-observability-metrics.md +1 -1
- package/rules/hatch3r-observability-metrics.mdc +1 -1
- package/rules/hatch3r-observability-tracing-detail.md +1 -1
- package/rules/hatch3r-observability-tracing-detail.mdc +1 -1
- package/rules/hatch3r-observability-tracing.md +1 -1
- package/rules/hatch3r-observability-tracing.mdc +1 -1
- package/rules/hatch3r-observability.md +1 -0
- package/rules/hatch3r-observability.mdc +1 -0
- package/rules/hatch3r-operability.md +149 -0
- package/rules/hatch3r-operability.mdc +145 -0
- package/rules/hatch3r-passkey-server.md +181 -0
- package/rules/hatch3r-passkey-server.mdc +177 -0
- package/rules/hatch3r-progressive-delivery.md +120 -0
- package/rules/hatch3r-progressive-delivery.mdc +116 -0
- package/rules/hatch3r-resilience-patterns.md +154 -0
- package/rules/hatch3r-resilience-patterns.mdc +150 -0
- package/rules/hatch3r-secrets-management.md +29 -0
- package/rules/hatch3r-secrets-management.mdc +29 -0
- package/rules/hatch3r-testing.md +139 -43
- package/rules/hatch3r-testing.mdc +139 -43
- package/rules/hatch3r-ux-states-and-flows.md +149 -0
- package/rules/hatch3r-ux-states-and-flows.mdc +145 -0
- package/skills/hatch3r-a11y-audit/SKILL.md +14 -0
- package/skills/hatch3r-ai-feature/SKILL.md +134 -0
- package/skills/hatch3r-api-spec/SKILL.md +5 -0
- package/skills/hatch3r-architecture-review/SKILL.md +14 -0
- package/skills/hatch3r-bug-fix/SKILL.md +5 -0
- package/skills/hatch3r-ci-pipeline/SKILL.md +14 -0
- package/skills/hatch3r-cli-aichat/SKILL.md +84 -0
- package/skills/hatch3r-cli-ast-grep/SKILL.md +85 -0
- package/skills/hatch3r-cli-az-devops/SKILL.md +89 -0
- package/skills/hatch3r-cli-bat/SKILL.md +85 -0
- package/skills/hatch3r-cli-comby/SKILL.md +85 -0
- package/skills/hatch3r-cli-csvkit/SKILL.md +84 -0
- package/skills/hatch3r-cli-delta/SKILL.md +86 -0
- package/skills/hatch3r-cli-difftastic/SKILL.md +84 -0
- package/skills/hatch3r-cli-docker/SKILL.md +89 -0
- package/skills/hatch3r-cli-duckdb/SKILL.md +84 -0
- package/skills/hatch3r-cli-fd/SKILL.md +85 -0
- package/skills/hatch3r-cli-fzf/SKILL.md +84 -0
- package/skills/hatch3r-cli-gh/SKILL.md +90 -0
- package/skills/hatch3r-cli-glab/SKILL.md +89 -0
- package/skills/hatch3r-cli-jq/SKILL.md +85 -0
- package/skills/hatch3r-cli-lazygit/SKILL.md +78 -0
- package/skills/hatch3r-cli-llm/SKILL.md +84 -0
- package/skills/hatch3r-cli-miller/SKILL.md +84 -0
- package/skills/hatch3r-cli-mods/SKILL.md +84 -0
- package/skills/hatch3r-cli-overview/SKILL.md +60 -0
- package/skills/hatch3r-cli-playwright/SKILL.md +89 -0
- package/skills/hatch3r-cli-podman/SKILL.md +84 -0
- package/skills/hatch3r-cli-ripgrep/SKILL.md +85 -0
- package/skills/hatch3r-cli-rtk/SKILL.md +91 -0
- package/skills/hatch3r-cli-sd/SKILL.md +85 -0
- package/skills/hatch3r-cli-stagehand/SKILL.md +79 -0
- package/skills/hatch3r-cli-taplo/SKILL.md +84 -0
- package/skills/hatch3r-cli-xsv/SKILL.md +89 -0
- package/skills/hatch3r-cli-yq/SKILL.md +85 -0
- package/skills/hatch3r-cli-zstd/SKILL.md +85 -0
- package/skills/hatch3r-context-health/SKILL.md +14 -0
- package/skills/hatch3r-cost-tracking/SKILL.md +14 -0
- package/skills/hatch3r-customize/SKILL.md +14 -0
- package/skills/hatch3r-dep-audit/SKILL.md +14 -0
- package/skills/hatch3r-design-system-detect/SKILL.md +162 -0
- package/skills/hatch3r-feature/SKILL.md +2 -0
- package/skills/hatch3r-gh-agentic-workflows/SKILL.md +13 -0
- package/skills/hatch3r-handoff-prepare/SKILL.md +160 -0
- package/skills/hatch3r-handoff-resume/SKILL.md +171 -0
- package/skills/hatch3r-incident-response/SKILL.md +14 -0
- package/skills/hatch3r-issue-workflow/SKILL.md +5 -0
- package/skills/hatch3r-logical-refactor/SKILL.md +14 -0
- package/skills/hatch3r-migration/SKILL.md +14 -0
- package/skills/hatch3r-observability-verify/SKILL.md +133 -0
- package/skills/hatch3r-perf-audit/SKILL.md +14 -0
- package/skills/hatch3r-pr-creation/SKILL.md +14 -0
- package/skills/hatch3r-qa-validation/SKILL.md +18 -0
- package/skills/hatch3r-recipe/SKILL.md +14 -0
- package/skills/hatch3r-refactor/SKILL.md +14 -0
- package/skills/hatch3r-release/SKILL.md +14 -0
- package/skills/hatch3r-reliability-verify/SKILL.md +144 -0
- package/skills/hatch3r-ui-ux-verify/SKILL.md +136 -0
- package/skills/hatch3r-visual-refactor/SKILL.md +15 -1
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Operability patterns in user code — liveness / readiness / startup probes, graceful shutdown, feature flags, runbook URL annotations, health endpoints
|
|
3
|
+
globs: ["**/services/**", "**/handlers/**", "**/health*", "**/probes/**", "**/k8s/**", "**/manifests/**", "**/charts/**", "**/feature*", "**/flags/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# Operability
|
|
7
|
+
|
|
8
|
+
## Liveness, Readiness, and Startup Probes (Kubernetes)
|
|
9
|
+
|
|
10
|
+
Three probe kinds; each answers a different question and triggers a different action. Conflating them is the most common operability bug — a liveness probe that reaches downstream causes pod-restart cascades on dependency outages.
|
|
11
|
+
|
|
12
|
+
- **Liveness** — shallow self-check: process alive, event loop responsive, no deadlock on the request handler. NEVER checks external dependencies. Failure → kubelet restarts the pod. Default: `httpGet /health/live`, `periodSeconds: 10`, `failureThreshold: 3`, `timeoutSeconds: 1`.
|
|
13
|
+
- **Readiness** — deep dependency check: DB reachable, cache reachable, required downstream healthy, migrations applied. Failure → endpoint controller removes the pod from Service load-balancer rotation; pod stays running and recovers when dependencies return. Default: `httpGet /health/ready`, `periodSeconds: 5`, `failureThreshold: 2`, `timeoutSeconds: 2`.
|
|
14
|
+
- **Startup** — same checks as readiness but with longer initial allowance for slow-starting apps (large model load, warm cache build, JIT warmup, schema migration). Once succeeds once, liveness and readiness take over. Default: `httpGet /health/startup`, `periodSeconds: 5`, `failureThreshold: 60` (allowing 5 minutes), `timeoutSeconds: 2`.
|
|
15
|
+
|
|
16
|
+
Concrete examples per ecosystem:
|
|
17
|
+
|
|
18
|
+
- **Node Express** — `/health/live`, `/health/ready`, `/health/startup` on the same router with different handler chains. The live handler returns 200 unconditionally if the event loop is responsive; the ready handler awaits `Promise.allSettled` over DB ping, cache ping, downstream ping with per-check 1s timeout.
|
|
19
|
+
- **Spring Boot Actuator** — `/actuator/health/liveness` and `/actuator/health/readiness` exposed out of the box; add startup via a custom `HealthIndicator`. Configure `management.endpoint.health.probes.enabled=true`.
|
|
20
|
+
- **Go (net/http)** — three handlers on `http.ServeMux`; ready handler aggregates checks via `errgroup.Group.Wait()` with per-check `context.WithTimeout`.
|
|
21
|
+
|
|
22
|
+
Anti-pattern: a single `/health` endpoint that both Kubernetes probes hit. The pod is killed during DB outage because the liveness probe failed on the same deep check the readiness probe is supposed to own.
|
|
23
|
+
|
|
24
|
+
## Graceful Shutdown
|
|
25
|
+
|
|
26
|
+
Handle `SIGTERM` per the sequence below. Skipping the preStop delay causes the well-known endpoint-propagation race that drops up to several seconds of in-flight traffic.
|
|
27
|
+
|
|
28
|
+
- Step 1: Stop accepting new connections (close the HTTP listener; stop pulling from the queue).
|
|
29
|
+
- Step 2: Mark `/health/ready` to return 503 (the endpoint controller removes the pod from the Service).
|
|
30
|
+
- Step 3: Wait `preStop` window of 1–3s for endpoint-removal propagation. Kubernetes does NOT propagate endpoint removal before delivering SIGTERM — without an explicit `preStop` `sleep`, traffic continues to arrive for the first 1–3s of shutdown.
|
|
31
|
+
- Step 4: Drain in-flight requests (configurable deadline 30–45s; cap by `terminationGracePeriodSeconds`).
|
|
32
|
+
- Step 5: Close DB connections, drain queue consumers, flush log buffers, close OpenTelemetry exporters.
|
|
33
|
+
- Step 6: Exit 0.
|
|
34
|
+
|
|
35
|
+
Pod manifest baseline:
|
|
36
|
+
|
|
37
|
+
```
|
|
38
|
+
terminationGracePeriodSeconds: 45
|
|
39
|
+
lifecycle:
|
|
40
|
+
preStop:
|
|
41
|
+
exec:
|
|
42
|
+
command: ["sh", "-c", "sleep 3"]
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Force kill after grace period defeats the drain. If 45s is insufficient, raise `terminationGracePeriodSeconds` rather than skip the drain — but investigate why the service holds long-running requests at shutdown.
|
|
46
|
+
|
|
47
|
+
For queue consumers (Kafka, NATS, SQS): on SIGTERM stop pulling new messages first, finish in-flight processing, then commit offsets and disconnect. Skipping the commit step on shutdown produces duplicate processing on the next pod's first poll.
|
|
48
|
+
|
|
49
|
+
## Feature Flags — OpenFeature
|
|
50
|
+
|
|
51
|
+
OpenFeature is a CNCF Incubating SDK (pre-1.0 spec) that wraps any provider — LaunchDarkly, Unleash, Flagsmith, GrowthBook, Statsig — behind a consistent API. New code targets OpenFeature; provider-specific SDK imports become a finding.
|
|
52
|
+
|
|
53
|
+
Flag types by lifecycle:
|
|
54
|
+
|
|
55
|
+
- **Release flag** — gate new code path during rollout. Remove within 1 sprint of full enablement; otherwise it becomes flag debt.
|
|
56
|
+
- **Experiment flag** — A/B test traffic split for measurement. Retire after the experiment concludes and the decision is shipped.
|
|
57
|
+
- **Ops flag** — kill switch for risky features and feature-level circuit breakers. Permanent by design. Cross-reference the kill-switch pattern below.
|
|
58
|
+
- **Permission flag** — entitlement gating (per-tenant feature availability). Prefer reusing the authorization layer where possible; flag only the surface that the auth system does not naturally cover.
|
|
59
|
+
|
|
60
|
+
Flag-debt budget: 50–100 active flags per service. Run a monthly cleanup cadence — list flags with `enabled = true for 100% of traffic for >30 days` and either remove them or convert to permanent config.
|
|
61
|
+
|
|
62
|
+
Default-off for new release flags; default-on only for kill switches (so the absence of provider connectivity does not silently disable the feature the on-call needs to disable). Evaluate flags at request entry once and pass the resolved value down — re-evaluating mid-request risks split-brain behavior when the flag flips during the call.
|
|
63
|
+
|
|
64
|
+
## Kill-Switch Pattern
|
|
65
|
+
|
|
66
|
+
Every risky feature ships with a kill switch (Ops flag) and a documented procedure for flipping it. On-call must be able to disable the feature without redeploying. Document the flag name in `docs/runbooks/<service>.md` alongside the alert.
|
|
67
|
+
|
|
68
|
+
Test the kill switch on every release — a kill switch that nobody has flipped in production is unverified. Quarterly drill: flip the flag, observe the metric drop, restore.
|
|
69
|
+
|
|
70
|
+
Cross-reference `rules/hatch3r-progressive-delivery.md` — kill switch is the rollback mechanism when the staged rollout has already finished and the regression is discovered after 100%.
|
|
71
|
+
|
|
72
|
+
## Runbook URL on Every Alert
|
|
73
|
+
|
|
74
|
+
Alert without a runbook is a finding under this rule. Every alert carries a runbook URL annotation:
|
|
75
|
+
|
|
76
|
+
- **Prometheus:** `annotations.runbook_url: "https://internal.example.com/runbooks/<alert-name>.md"`.
|
|
77
|
+
- **Datadog / Grafana:** equivalent `runbook_url` or `notification_message` template field.
|
|
78
|
+
|
|
79
|
+
Runbook format:
|
|
80
|
+
|
|
81
|
+
- **Symptoms** — what the on-call sees on the dashboard or in the alert payload.
|
|
82
|
+
- **Triage** — first three commands or queries to narrow the cause.
|
|
83
|
+
- **Mitigation** — kill switch, rollback command, scale action.
|
|
84
|
+
- **Root cause** — links to past postmortems for similar symptoms.
|
|
85
|
+
- **Follow-ups** — open issues, related dashboards, owner team.
|
|
86
|
+
|
|
87
|
+
Empirical observation from 2024–2026 incidents: LLM-driven auto-diagnosis quality is dominated by runbook quality rather than by model choice. A high-fidelity runbook with concrete commands beats a generic prompt against a frontier model on the same dataset.
|
|
88
|
+
|
|
89
|
+
Runbooks live in the service repository under `docs/runbooks/<alert-name>.md` so they ship with the code that emits the alert. Renaming an alert without updating the runbook URL produces a 404 link from the alert payload — a CI check on the alert manifest catches this on every PR that touches alert names.
|
|
90
|
+
|
|
91
|
+
## Health Endpoint Conventions
|
|
92
|
+
|
|
93
|
+
- `/health/live` — 200 + `{ "status": "ok", "version": "<semver>", "build_sha": "<short-sha>" }` on success; 503 + `{ "status": "down", "reason": "<diagnostic>" }` on failure.
|
|
94
|
+
- `/health/ready` — same shape; 503 includes the failing downstream name (e.g. `reason: "postgres-primary unreachable"`).
|
|
95
|
+
- `/health/startup` — same as ready but with allowance for warmup.
|
|
96
|
+
|
|
97
|
+
Never expose secrets, connection strings, or per-request internals from health endpoints — they are unauthenticated by default. Detailed diagnostics live behind authentication on a separate admin endpoint.
|
|
98
|
+
|
|
99
|
+
Probe response time budget: under 100ms for `/health/live`, under 500ms for `/health/ready` and `/health/startup`. A probe that times out the kubelet causes pod restart cascades unrelated to the underlying issue.
|
|
100
|
+
|
|
101
|
+
Cache the ready-check result for 1–2s — Kubernetes polls every 5s by default; checking each downstream on every poll multiplies dependency load by the replica count.
|
|
102
|
+
|
|
103
|
+
## Capacity Planning
|
|
104
|
+
|
|
105
|
+
- Nightly load test in CI against a staging environment representative of production. Compare baseline-vs-current p50, p95, p99 latency and error rate; fail the run on a 20% regression vs the previous green baseline.
|
|
106
|
+
- Saturation tracking — alert when CPU sustained above 70% for 15 minutes, memory above 80% for 15 minutes, connection pool above 80% utilization for 5 minutes.
|
|
107
|
+
- Rightsizing — target 20–30% headroom on CPU and memory at peak. Below 10% headroom is undersized; above 50% sustained is oversized and a cost finding.
|
|
108
|
+
- HPA (Horizontal Pod Autoscaler) target CPU at 60–70% — leaves room for the scale-up lag to bring new replicas online before the existing pods saturate.
|
|
109
|
+
- Cold-start budget for serverless and JVM services: measure p95 cold-start latency; provision the warm-pool to keep cold-start traffic below 1% of total. KEDA or platform autoscaler keeps the warm-pool at the right size.
|
|
110
|
+
|
|
111
|
+
## Multi-Region and Disaster Recovery
|
|
112
|
+
|
|
113
|
+
- RTO (Recovery Time Objective) and RPO (Recovery Point Objective) documented per service tier in the service catalog.
|
|
114
|
+
- Active-active for tier-1 services targeting RTO under 5 minutes and RPO under 1 minute.
|
|
115
|
+
- Active-passive acceptable for tier-2 services targeting RTO under 1 hour.
|
|
116
|
+
- Failover drill quarterly; the drill is the test of the runbook. A runbook that has not been executed in a drill is a draft, not a runbook.
|
|
117
|
+
- Data residency — honor regional data boundaries (EU-US DPF, regional cloud regions). Cross-region replication respects the residency contract.
|
|
118
|
+
- DNS-based failover (Route 53, Cloud DNS) has a propagation tail measured in minutes; for sub-minute RTO use load-balancer-level failover (cross-region target groups) or anycast.
|
|
119
|
+
|
|
120
|
+
## 2024–2026 Outage Lessons
|
|
121
|
+
|
|
122
|
+
- **CrowdStrike, July 2024** — global config push without canary. Mitigation: staged rollout (cross-reference `rules/hatch3r-progressive-delivery.md`).
|
|
123
|
+
- **AWS us-east-1, October 2025** — cascading failure across services with hidden us-east-1 control-plane dependency. Mitigation: dependency mapping, multi-region for control plane.
|
|
124
|
+
- **Azure East-US2, September 2025** — single-region outage on customer-perceived multi-region service. Mitigation: active-active across regions.
|
|
125
|
+
- **Cloudflare, November 2025** — config-change incident. Mitigation: treat config as code, canary config changes.
|
|
126
|
+
|
|
127
|
+
Most outages have an organizational root cause (change management, capacity planning, ownership). The patterns above defend against the technical leg of the failure; organizational defenses are out of scope for this rule.
|
|
128
|
+
|
|
129
|
+
Postmortem cadence: within 5 business days of incident close, blameless, owned by the on-call rotation, action items tracked in the service backlog with target due dates. Postmortem reviewers cross-check that the runbook was updated and that an automated detection (alert, dashboard, or unit test) was added for the failure mode that escaped detection.
|
|
130
|
+
|
|
131
|
+
## Cross-References
|
|
132
|
+
|
|
133
|
+
- `rules/hatch3r-resilience-patterns.md` — circuit breakers, retries, timeouts.
|
|
134
|
+
- `rules/hatch3r-progressive-delivery.md` — canary, blue-green, kill-switch usage during rollout.
|
|
135
|
+
- `rules/hatch3r-observability-metrics.md` — SLOs, RED metrics, burn-rate alerts feed the runbook.
|
|
136
|
+
|
|
137
|
+
## References
|
|
138
|
+
|
|
139
|
+
- Kubernetes probes — `kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes`
|
|
140
|
+
- Kubernetes pod termination — `kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination`
|
|
141
|
+
- Google SRE workbook, graceful-shutdown chapter — `sre.google/workbook`
|
|
142
|
+
- OpenFeature specification — `openfeature.dev`
|
|
143
|
+
- LaunchDarkly docs — `docs.launchdarkly.com`
|
|
144
|
+
- Unleash docs — `docs.getunleash.io`
|
|
145
|
+
- 2024–2026 outage postmortems: CrowdStrike Jul 2024, AWS us-east-1 Oct 2025, Azure East-US2 Sep 2025, Cloudflare Nov 2025.
|
|
@@ -0,0 +1,181 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-passkey-server
|
|
3
|
+
type: rule
|
|
4
|
+
description: Server-side WebAuthn / passkey ceremony — registration, authentication, attestation, counter, RP-ID, recovery, FIDO CXP/CXF awareness
|
|
5
|
+
scope: "**/auth/**,**/passkey*,**/webauthn*,**/fido*,**/credentials/**,**/api/**,**/handlers/**"
|
|
6
|
+
tags: [security, implementation]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
8
|
+
cache_friendly: true
|
|
9
|
+
---
|
|
10
|
+
# Passkey Server — WebAuthn Ceremony
|
|
11
|
+
|
|
12
|
+
## Scope
|
|
13
|
+
|
|
14
|
+
Server-side WebAuthn / passkey implementation. Companion to `rules/hatch3r-ux-states-and-flows.md` and `rules/hatch3r-ai-ux-patterns.md` (frontend Conditional UI), and to `rules/hatch3r-auth-patterns.md` (auth model, AAL mapping, step-up). WebAuthn Level 3 (W3C Candidate Recommendation, Jan 2026) is the spec baseline.
|
|
15
|
+
|
|
16
|
+
## Server Libraries — Use Vetted, Never Hand-Roll
|
|
17
|
+
|
|
18
|
+
| Runtime | Library |
|
|
19
|
+
|---------|---------|
|
|
20
|
+
| Node.js / Deno / Bun | `@simplewebauthn/server` |
|
|
21
|
+
| JVM (Java / Kotlin) | `webauthn4j` |
|
|
22
|
+
| Go | `go-webauthn/webauthn` |
|
|
23
|
+
| Python | `py_webauthn` (Duo) |
|
|
24
|
+
| Rust | `webauthn-rs` |
|
|
25
|
+
| .NET | `Fido2NetLib` |
|
|
26
|
+
|
|
27
|
+
Hand-rolling the ceremony (signature parsing, CBOR decoding, attestation chain validation) is a Critical anti-pattern.
|
|
28
|
+
|
|
29
|
+
## Registration Ceremony
|
|
30
|
+
|
|
31
|
+
Two HTTP endpoints. Always.
|
|
32
|
+
|
|
33
|
+
### `POST /webauthn/register/options`
|
|
34
|
+
|
|
35
|
+
Server generates and returns `PublicKeyCredentialCreationOptions`:
|
|
36
|
+
|
|
37
|
+
- `challenge` — 32 random bytes, cached server-side with a 5-minute TTL keyed to the user session.
|
|
38
|
+
- `rp.id` — registrable domain (e.g., `example.com`). Never include scheme or port.
|
|
39
|
+
- `rp.name` — human-readable RP name shown in the OS passkey UI.
|
|
40
|
+
- `user.id` — server-side opaque ID (16-64 random bytes), NOT email and NOT username. Stable per account; never reused after deletion.
|
|
41
|
+
- `user.name`, `user.displayName` — visible identifiers (e.g., email + display name).
|
|
42
|
+
- `pubKeyCredParams` — `[{type: "public-key", alg: -7 /* ES256 */}, {type: "public-key", alg: -257 /* RS256 */}]`.
|
|
43
|
+
- `authenticatorSelection.userVerification: "required"` for new accounts.
|
|
44
|
+
- `authenticatorSelection.residentKey: "required"` to enable discoverable / passkey-first UX.
|
|
45
|
+
- `attestation: "none"` for consumer apps. `"direct"` only when high-assurance attestation is required (enterprise / regulated).
|
|
46
|
+
- `excludeCredentials` — array of the user's already-registered credential IDs to prevent duplicate registration on the same authenticator.
|
|
47
|
+
|
|
48
|
+
### `POST /webauthn/register/verify`
|
|
49
|
+
|
|
50
|
+
Client returns `AuthenticatorAttestationResponse`. Server verifies:
|
|
51
|
+
|
|
52
|
+
- Challenge matches the cached challenge for this session; consume after verification.
|
|
53
|
+
- `origin` is in the allowlist (typically `https://{rp.id}` + known origins).
|
|
54
|
+
- RP-ID hash in `authData` matches SHA-256 of `rp.id`.
|
|
55
|
+
- Signature validates against the attestation public key (when attestation requested).
|
|
56
|
+
- Attestation chain validates against FIDO Metadata Service when `attestation: "direct"`.
|
|
57
|
+
|
|
58
|
+
Persist on success: `credential_id` (base64url), `public_key` (COSE-encoded), `counter`, `aaguid`, `transports[]`, `backup_eligible`, `backup_state`, `user_id` (FK), `created_at`.
|
|
59
|
+
|
|
60
|
+
## Authentication Ceremony
|
|
61
|
+
|
|
62
|
+
### `POST /webauthn/login/options`
|
|
63
|
+
|
|
64
|
+
Server returns `PublicKeyCredentialRequestOptions`:
|
|
65
|
+
|
|
66
|
+
- `challenge` — fresh 32 random bytes, cached with 5-minute TTL keyed to the login attempt ID.
|
|
67
|
+
- `rp.id` — same registrable domain as registration.
|
|
68
|
+
- `allowCredentials` — array of the user's credential IDs when identifier is known. Empty array (or omit) for discoverable login (passkey-first UX); the OS picker handles credential selection.
|
|
69
|
+
- `userVerification: "required"` for the AAL2-required path.
|
|
70
|
+
|
|
71
|
+
### `POST /webauthn/login/verify`
|
|
72
|
+
|
|
73
|
+
Client returns `AuthenticatorAssertionResponse`. Server verifies:
|
|
74
|
+
|
|
75
|
+
- Challenge matches; consume after verification.
|
|
76
|
+
- `origin` is in the allowlist.
|
|
77
|
+
- RP-ID hash matches SHA-256 of `rp.id`.
|
|
78
|
+
- Signature validates against the stored public key for the asserted credential ID.
|
|
79
|
+
- Counter check (see below).
|
|
80
|
+
|
|
81
|
+
On success, update the stored counter and `backup_state`; issue the session.
|
|
82
|
+
|
|
83
|
+
## RP-ID Rules
|
|
84
|
+
|
|
85
|
+
- `rp.id` is the registrable domain. For `app.example.com`, `rp.id` may be `example.com` (broader) or `app.example.com` (narrower). Broader RP-ID allows credentials to work across subdomains.
|
|
86
|
+
- Credentials are bound to the RP-ID used at registration. Changing the RP-ID later requires re-registration.
|
|
87
|
+
- Multi-region domains: plan the RP-ID before launch. Use Related Origin Requests (WebAuthn L3) for legitimate multi-origin scenarios.
|
|
88
|
+
|
|
89
|
+
## Counter Checking — Clone Detection
|
|
90
|
+
|
|
91
|
+
Every authenticator increments its signature counter on each assertion (some authenticators report 0 always — document expected behavior per AAGUID).
|
|
92
|
+
|
|
93
|
+
- On registration, persist the initial counter.
|
|
94
|
+
- On authentication, the asserted counter MUST be greater than the stored counter.
|
|
95
|
+
- If asserted counter <= stored counter (and not both 0): treat as possible clone. Reject the assertion, notify the user via email, and require recovery.
|
|
96
|
+
- Always update the stored counter after successful authentication.
|
|
97
|
+
|
|
98
|
+
## AAGUID Handling
|
|
99
|
+
|
|
100
|
+
- `aaguid` (Authenticator Attestation GUID) identifies the authenticator model family.
|
|
101
|
+
- When `attestation: "none"`, do NOT use AAGUID for security decisions — it is unattested and can be spoofed.
|
|
102
|
+
- Use AAGUID for analytics (which device families users prefer) and for cross-referencing the FIDO Metadata Service when `attestation: "direct"`.
|
|
103
|
+
|
|
104
|
+
## Backup State Flags
|
|
105
|
+
|
|
106
|
+
WebAuthn L3 introduces two single-bit flags in `authData`:
|
|
107
|
+
|
|
108
|
+
- `backup_eligible` (BE) — credential CAN be backed up / synced (passkey).
|
|
109
|
+
- `backup_state` (BS) — credential IS currently backed up.
|
|
110
|
+
|
|
111
|
+
Persist both. Display "passkey synced across your devices" UI hint when `BS=1` (cross-reference frontend `rules/hatch3r-ux-states-and-flows.md`). Flag `BE=0` credentials as device-bound (security key) in recovery flows — losing the device means losing the credential.
|
|
112
|
+
|
|
113
|
+
## Recovery Patterns
|
|
114
|
+
|
|
115
|
+
- **Multiple passkeys per user** — REQUIRED. Encourage registration of >=2 authenticators (laptop sync + phone, or laptop + hardware key) during onboarding. UI surfaces missing-second-passkey state as a recoverable warning.
|
|
116
|
+
- **Account recovery flow** — verified email + identity-proofing question + step-up auth via remaining passkey. Never plain SMS. Document the recovery SLA.
|
|
117
|
+
- **Admin recovery** — enterprise tenants get an admin-initiated recovery path with audit trail; consumer accounts do not.
|
|
118
|
+
- **FIDO CXP/CXF awareness** — Cross-Platform Export Format (Feb 2026 draft) enables passkey migration across managers (1Password, Apple Keychain, Google Password Manager, Bitwarden). Plan the migration UX so users discover that exporting from manager A to manager B is supported.
|
|
119
|
+
|
|
120
|
+
## User Verification (UV)
|
|
121
|
+
|
|
122
|
+
- `userVerification: "required"` — authenticator must verify the user via biometric or PIN. Replaces the password factor. Use for AAL2 and all sensitive operations.
|
|
123
|
+
- `userVerification: "preferred"` — authenticator verifies if able, otherwise presence-only. Acceptable for AAL1.
|
|
124
|
+
- `userVerification: "discouraged"` — presence only. Rare; only for tap-to-confirm flows.
|
|
125
|
+
- Map per AAL level — cross-reference `rules/hatch3r-auth-patterns.md` MFA section.
|
|
126
|
+
|
|
127
|
+
## Step-Up Authentication
|
|
128
|
+
|
|
129
|
+
High-risk operations require a fresh WebAuthn assertion even when the session is valid:
|
|
130
|
+
|
|
131
|
+
- Delete account, change email, change password.
|
|
132
|
+
- Rotate API key, add OAuth client, change billing.
|
|
133
|
+
- Transfer funds above the project's risk threshold.
|
|
134
|
+
|
|
135
|
+
Issue a short-lived (5-15 min) step-up token scoped to the operation. Re-assertion uses `userVerification: "required"` regardless of session AAL.
|
|
136
|
+
|
|
137
|
+
## Discoverable Credentials (Passkey-First UX)
|
|
138
|
+
|
|
139
|
+
- Register with `residentKey: "required"` so the credential is stored on the authenticator with a username hint.
|
|
140
|
+
- Authenticate with empty `allowCredentials` and frontend `mediation: "conditional"` (Conditional UI) so the username field offers passkeys directly.
|
|
141
|
+
- Cross-reference `rules/hatch3r-ux-states-and-flows.md` for the frontend autocomplete attribute pattern.
|
|
142
|
+
|
|
143
|
+
## Telemetry
|
|
144
|
+
|
|
145
|
+
Track per-feature:
|
|
146
|
+
|
|
147
|
+
- Registration success rate, registration failure rate by reason.
|
|
148
|
+
- Authentication success rate, authentication failure rate by reason.
|
|
149
|
+
- Time-to-completion for registration and assertion (P50/P95).
|
|
150
|
+
- AAGUID distribution.
|
|
151
|
+
- `backup_state` rate (sync vs device-bound).
|
|
152
|
+
- Counter-mismatch incidents (potential clone).
|
|
153
|
+
|
|
154
|
+
Cross-reference `skills/hatch3r-observability-verify` for instrumentation patterns.
|
|
155
|
+
|
|
156
|
+
## Migration From Passwords
|
|
157
|
+
|
|
158
|
+
Opt-in flow after a successful password+TOTP login:
|
|
159
|
+
|
|
160
|
+
1. Prompt "Add a passkey for faster sign-in?" with one-tap registration.
|
|
161
|
+
2. On success, mark the account `passkey_enabled: true`.
|
|
162
|
+
3. After >=1 passkey is registered AND the user has confirmed passkey works on >=2 sessions, prompt "Switch to passkey-only sign-in?"
|
|
163
|
+
4. On opt-in confirmation, disable password sign-in for the account. Keep a recovery path (email + secondary passkey).
|
|
164
|
+
|
|
165
|
+
## Anti-Patterns
|
|
166
|
+
|
|
167
|
+
- Never store the private key server-side. Passkeys are public-key only — the private key never leaves the authenticator.
|
|
168
|
+
- Never use email as `user.id`. Use a server-side opaque ID; email changes break credential binding otherwise.
|
|
169
|
+
- Never accept attestation as user identity. Attestation identifies the authenticator model, not the human.
|
|
170
|
+
- Never skip the origin check. RP-ID alone does not prevent phishing — origin allowlist does.
|
|
171
|
+
- Never log credential public keys or signatures in plaintext audit logs.
|
|
172
|
+
- Never accept a counter <= stored counter as success.
|
|
173
|
+
|
|
174
|
+
## References
|
|
175
|
+
|
|
176
|
+
- W3C Web Authentication Level 3 (Candidate Recommendation, Jan 2026).
|
|
177
|
+
- FIDO Alliance — Client to Authenticator Protocol (CTAP) 2.2.
|
|
178
|
+
- FIDO Alliance — Credential Exchange Protocol / Credential Exchange Format (CXP / CXF, Feb 2026 draft).
|
|
179
|
+
- `@simplewebauthn/server` documentation.
|
|
180
|
+
- MDN — Web Authentication API.
|
|
181
|
+
- FIDO Metadata Service (MDS3) for attestation verification.
|
|
@@ -0,0 +1,177 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Server-side WebAuthn / passkey ceremony — registration, authentication, attestation, counter, RP-ID, recovery, FIDO CXP/CXF awareness
|
|
3
|
+
globs: ["**/auth/**", "**/passkey*", "**/webauthn*", "**/fido*", "**/credentials/**", "**/api/**", "**/handlers/**"]
|
|
4
|
+
alwaysApply: false
|
|
5
|
+
---
|
|
6
|
+
# Passkey Server — WebAuthn Ceremony
|
|
7
|
+
|
|
8
|
+
## Scope
|
|
9
|
+
|
|
10
|
+
Server-side WebAuthn / passkey implementation. Companion to `rules/hatch3r-ux-states-and-flows.md` and `rules/hatch3r-ai-ux-patterns.md` (frontend Conditional UI), and to `rules/hatch3r-auth-patterns.md` (auth model, AAL mapping, step-up). WebAuthn Level 3 (W3C Candidate Recommendation, Jan 2026) is the spec baseline.
|
|
11
|
+
|
|
12
|
+
## Server Libraries — Use Vetted, Never Hand-Roll
|
|
13
|
+
|
|
14
|
+
| Runtime | Library |
|
|
15
|
+
|---------|---------|
|
|
16
|
+
| Node.js / Deno / Bun | `@simplewebauthn/server` |
|
|
17
|
+
| JVM (Java / Kotlin) | `webauthn4j` |
|
|
18
|
+
| Go | `go-webauthn/webauthn` |
|
|
19
|
+
| Python | `py_webauthn` (Duo) |
|
|
20
|
+
| Rust | `webauthn-rs` |
|
|
21
|
+
| .NET | `Fido2NetLib` |
|
|
22
|
+
|
|
23
|
+
Hand-rolling the ceremony (signature parsing, CBOR decoding, attestation chain validation) is a Critical anti-pattern.
|
|
24
|
+
|
|
25
|
+
## Registration Ceremony
|
|
26
|
+
|
|
27
|
+
Two HTTP endpoints. Always.
|
|
28
|
+
|
|
29
|
+
### `POST /webauthn/register/options`
|
|
30
|
+
|
|
31
|
+
Server generates and returns `PublicKeyCredentialCreationOptions`:
|
|
32
|
+
|
|
33
|
+
- `challenge` — 32 random bytes, cached server-side with a 5-minute TTL keyed to the user session.
|
|
34
|
+
- `rp.id` — registrable domain (e.g., `example.com`). Never include scheme or port.
|
|
35
|
+
- `rp.name` — human-readable RP name shown in the OS passkey UI.
|
|
36
|
+
- `user.id` — server-side opaque ID (16-64 random bytes), NOT email and NOT username. Stable per account; never reused after deletion.
|
|
37
|
+
- `user.name`, `user.displayName` — visible identifiers (e.g., email + display name).
|
|
38
|
+
- `pubKeyCredParams` — `[{type: "public-key", alg: -7 /* ES256 */}, {type: "public-key", alg: -257 /* RS256 */}]`.
|
|
39
|
+
- `authenticatorSelection.userVerification: "required"` for new accounts.
|
|
40
|
+
- `authenticatorSelection.residentKey: "required"` to enable discoverable / passkey-first UX.
|
|
41
|
+
- `attestation: "none"` for consumer apps. `"direct"` only when high-assurance attestation is required (enterprise / regulated).
|
|
42
|
+
- `excludeCredentials` — array of the user's already-registered credential IDs to prevent duplicate registration on the same authenticator.
|
|
43
|
+
|
|
44
|
+
### `POST /webauthn/register/verify`
|
|
45
|
+
|
|
46
|
+
Client returns `AuthenticatorAttestationResponse`. Server verifies:
|
|
47
|
+
|
|
48
|
+
- Challenge matches the cached challenge for this session; consume after verification.
|
|
49
|
+
- `origin` is in the allowlist (typically `https://{rp.id}` + known origins).
|
|
50
|
+
- RP-ID hash in `authData` matches SHA-256 of `rp.id`.
|
|
51
|
+
- Signature validates against the attestation public key (when attestation requested).
|
|
52
|
+
- Attestation chain validates against FIDO Metadata Service when `attestation: "direct"`.
|
|
53
|
+
|
|
54
|
+
Persist on success: `credential_id` (base64url), `public_key` (COSE-encoded), `counter`, `aaguid`, `transports[]`, `backup_eligible`, `backup_state`, `user_id` (FK), `created_at`.
|
|
55
|
+
|
|
56
|
+
## Authentication Ceremony
|
|
57
|
+
|
|
58
|
+
### `POST /webauthn/login/options`
|
|
59
|
+
|
|
60
|
+
Server returns `PublicKeyCredentialRequestOptions`:
|
|
61
|
+
|
|
62
|
+
- `challenge` — fresh 32 random bytes, cached with 5-minute TTL keyed to the login attempt ID.
|
|
63
|
+
- `rp.id` — same registrable domain as registration.
|
|
64
|
+
- `allowCredentials` — array of the user's credential IDs when identifier is known. Empty array (or omit) for discoverable login (passkey-first UX); the OS picker handles credential selection.
|
|
65
|
+
- `userVerification: "required"` for the AAL2-required path.
|
|
66
|
+
|
|
67
|
+
### `POST /webauthn/login/verify`
|
|
68
|
+
|
|
69
|
+
Client returns `AuthenticatorAssertionResponse`. Server verifies:
|
|
70
|
+
|
|
71
|
+
- Challenge matches; consume after verification.
|
|
72
|
+
- `origin` is in the allowlist.
|
|
73
|
+
- RP-ID hash matches SHA-256 of `rp.id`.
|
|
74
|
+
- Signature validates against the stored public key for the asserted credential ID.
|
|
75
|
+
- Counter check (see below).
|
|
76
|
+
|
|
77
|
+
On success, update the stored counter and `backup_state`; issue the session.
|
|
78
|
+
|
|
79
|
+
## RP-ID Rules
|
|
80
|
+
|
|
81
|
+
- `rp.id` is the registrable domain. For `app.example.com`, `rp.id` may be `example.com` (broader) or `app.example.com` (narrower). Broader RP-ID allows credentials to work across subdomains.
|
|
82
|
+
- Credentials are bound to the RP-ID used at registration. Changing the RP-ID later requires re-registration.
|
|
83
|
+
- Multi-region domains: plan the RP-ID before launch. Use Related Origin Requests (WebAuthn L3) for legitimate multi-origin scenarios.
|
|
84
|
+
|
|
85
|
+
## Counter Checking — Clone Detection
|
|
86
|
+
|
|
87
|
+
Every authenticator increments its signature counter on each assertion (some authenticators report 0 always — document expected behavior per AAGUID).
|
|
88
|
+
|
|
89
|
+
- On registration, persist the initial counter.
|
|
90
|
+
- On authentication, the asserted counter MUST be greater than the stored counter.
|
|
91
|
+
- If asserted counter <= stored counter (and not both 0): treat as possible clone. Reject the assertion, notify the user via email, and require recovery.
|
|
92
|
+
- Always update the stored counter after successful authentication.
|
|
93
|
+
|
|
94
|
+
## AAGUID Handling
|
|
95
|
+
|
|
96
|
+
- `aaguid` (Authenticator Attestation GUID) identifies the authenticator model family.
|
|
97
|
+
- When `attestation: "none"`, do NOT use AAGUID for security decisions — it is unattested and can be spoofed.
|
|
98
|
+
- Use AAGUID for analytics (which device families users prefer) and for cross-referencing the FIDO Metadata Service when `attestation: "direct"`.
|
|
99
|
+
|
|
100
|
+
## Backup State Flags
|
|
101
|
+
|
|
102
|
+
WebAuthn L3 introduces two single-bit flags in `authData`:
|
|
103
|
+
|
|
104
|
+
- `backup_eligible` (BE) — credential CAN be backed up / synced (passkey).
|
|
105
|
+
- `backup_state` (BS) — credential IS currently backed up.
|
|
106
|
+
|
|
107
|
+
Persist both. Display "passkey synced across your devices" UI hint when `BS=1` (cross-reference frontend `rules/hatch3r-ux-states-and-flows.md`). Flag `BE=0` credentials as device-bound (security key) in recovery flows — losing the device means losing the credential.
|
|
108
|
+
|
|
109
|
+
## Recovery Patterns
|
|
110
|
+
|
|
111
|
+
- **Multiple passkeys per user** — REQUIRED. Encourage registration of >=2 authenticators (laptop sync + phone, or laptop + hardware key) during onboarding. UI surfaces missing-second-passkey state as a recoverable warning.
|
|
112
|
+
- **Account recovery flow** — verified email + identity-proofing question + step-up auth via remaining passkey. Never plain SMS. Document the recovery SLA.
|
|
113
|
+
- **Admin recovery** — enterprise tenants get an admin-initiated recovery path with audit trail; consumer accounts do not.
|
|
114
|
+
- **FIDO CXP/CXF awareness** — Cross-Platform Export Format (Feb 2026 draft) enables passkey migration across managers (1Password, Apple Keychain, Google Password Manager, Bitwarden). Plan the migration UX so users discover that exporting from manager A to manager B is supported.
|
|
115
|
+
|
|
116
|
+
## User Verification (UV)
|
|
117
|
+
|
|
118
|
+
- `userVerification: "required"` — authenticator must verify the user via biometric or PIN. Replaces the password factor. Use for AAL2 and all sensitive operations.
|
|
119
|
+
- `userVerification: "preferred"` — authenticator verifies if able, otherwise presence-only. Acceptable for AAL1.
|
|
120
|
+
- `userVerification: "discouraged"` — presence only. Rare; only for tap-to-confirm flows.
|
|
121
|
+
- Map per AAL level — cross-reference `rules/hatch3r-auth-patterns.md` MFA section.
|
|
122
|
+
|
|
123
|
+
## Step-Up Authentication
|
|
124
|
+
|
|
125
|
+
High-risk operations require a fresh WebAuthn assertion even when the session is valid:
|
|
126
|
+
|
|
127
|
+
- Delete account, change email, change password.
|
|
128
|
+
- Rotate API key, add OAuth client, change billing.
|
|
129
|
+
- Transfer funds above the project's risk threshold.
|
|
130
|
+
|
|
131
|
+
Issue a short-lived (5-15 min) step-up token scoped to the operation. Re-assertion uses `userVerification: "required"` regardless of session AAL.
|
|
132
|
+
|
|
133
|
+
## Discoverable Credentials (Passkey-First UX)
|
|
134
|
+
|
|
135
|
+
- Register with `residentKey: "required"` so the credential is stored on the authenticator with a username hint.
|
|
136
|
+
- Authenticate with empty `allowCredentials` and frontend `mediation: "conditional"` (Conditional UI) so the username field offers passkeys directly.
|
|
137
|
+
- Cross-reference `rules/hatch3r-ux-states-and-flows.md` for the frontend autocomplete attribute pattern.
|
|
138
|
+
|
|
139
|
+
## Telemetry
|
|
140
|
+
|
|
141
|
+
Track per-feature:
|
|
142
|
+
|
|
143
|
+
- Registration success rate, registration failure rate by reason.
|
|
144
|
+
- Authentication success rate, authentication failure rate by reason.
|
|
145
|
+
- Time-to-completion for registration and assertion (P50/P95).
|
|
146
|
+
- AAGUID distribution.
|
|
147
|
+
- `backup_state` rate (sync vs device-bound).
|
|
148
|
+
- Counter-mismatch incidents (potential clone).
|
|
149
|
+
|
|
150
|
+
Cross-reference `skills/hatch3r-observability-verify` for instrumentation patterns.
|
|
151
|
+
|
|
152
|
+
## Migration From Passwords
|
|
153
|
+
|
|
154
|
+
Opt-in flow after a successful password+TOTP login:
|
|
155
|
+
|
|
156
|
+
1. Prompt "Add a passkey for faster sign-in?" with one-tap registration.
|
|
157
|
+
2. On success, mark the account `passkey_enabled: true`.
|
|
158
|
+
3. After >=1 passkey is registered AND the user has confirmed passkey works on >=2 sessions, prompt "Switch to passkey-only sign-in?"
|
|
159
|
+
4. On opt-in confirmation, disable password sign-in for the account. Keep a recovery path (email + secondary passkey).
|
|
160
|
+
|
|
161
|
+
## Anti-Patterns
|
|
162
|
+
|
|
163
|
+
- Never store the private key server-side. Passkeys are public-key only — the private key never leaves the authenticator.
|
|
164
|
+
- Never use email as `user.id`. Use a server-side opaque ID; email changes break credential binding otherwise.
|
|
165
|
+
- Never accept attestation as user identity. Attestation identifies the authenticator model, not the human.
|
|
166
|
+
- Never skip the origin check. RP-ID alone does not prevent phishing — origin allowlist does.
|
|
167
|
+
- Never log credential public keys or signatures in plaintext audit logs.
|
|
168
|
+
- Never accept a counter <= stored counter as success.
|
|
169
|
+
|
|
170
|
+
## References
|
|
171
|
+
|
|
172
|
+
- W3C Web Authentication Level 3 (Candidate Recommendation, Jan 2026).
|
|
173
|
+
- FIDO Alliance — Client to Authenticator Protocol (CTAP) 2.2.
|
|
174
|
+
- FIDO Alliance — Credential Exchange Protocol / Credential Exchange Format (CXP / CXF, Feb 2026 draft).
|
|
175
|
+
- `@simplewebauthn/server` documentation.
|
|
176
|
+
- MDN — Web Authentication API.
|
|
177
|
+
- FIDO Metadata Service (MDS3) for attestation verification.
|
|
@@ -0,0 +1,120 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-progressive-delivery
|
|
3
|
+
type: rule
|
|
4
|
+
description: Progressive delivery — canary, blue-green, feature-flag rollout with auto-rollback on SLO burn; staged rollout to prevent CrowdStrike-class incidents
|
|
5
|
+
scope: "**/.github/workflows/**,**/deploy/**,**/k8s/**,**/manifests/**,**/argo/**,**/flagger/**,**/spinnaker/**,**/rollout*"
|
|
6
|
+
tags: [devops]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
8
|
+
cache_friendly: true
|
|
9
|
+
---
|
|
10
|
+
# Progressive Delivery
|
|
11
|
+
|
|
12
|
+
## Three Rollout Strategies
|
|
13
|
+
|
|
14
|
+
Choose one per service based on risk profile and resource budget:
|
|
15
|
+
|
|
16
|
+
- **Canary** — shift traffic in increments (1% → 10% → 50% → 100%) over hours; analyze metrics at each stage; auto-rollback on SLO burn. Argo Rollouts is the most-deployed Kubernetes-native option; Flagger is the Flux-aligned alternative. Lowest resource overhead; gradual exposure.
|
|
17
|
+
- **Blue-green** — bring the new version fully online alongside the old version; switch traffic atomically; keep old version warm for fast rollback. Higher resource cost (2x at switchover); near-instant rollback. Suits stateful upgrades and small fleets.
|
|
18
|
+
- **Feature-flag rollout** — deploy the new code dark (path exists but is gated); flip an OpenFeature flag to release. Decouples deploy from release; pairs with the kill-switch pattern in `rules/hatch3r-operability.md`.
|
|
19
|
+
|
|
20
|
+
A service may combine strategies — deploy via blue-green for infrastructure changes, then gate user-visible behavior behind a feature flag for incremental release.
|
|
21
|
+
|
|
22
|
+
## Canary Analysis
|
|
23
|
+
|
|
24
|
+
Automate the success / fail decision at each stage; never rely on a human eyeballing the dashboard.
|
|
25
|
+
|
|
26
|
+
- **Argo Rollouts AnalysisTemplate** — declarative Prometheus / Datadog / New Relic / Wavefront queries with success conditions.
|
|
27
|
+
- **Flagger metric checks** — Prometheus / Datadog / CloudWatch / Stackdriver via `MetricTemplate`.
|
|
28
|
+
- **Datadog deployment tracking** — auto-rollback triggered by deployment-tagged anomaly detection.
|
|
29
|
+
- **Spinnaker / Kayenta** — Mann-Whitney U test on baseline-vs-canary metrics; statistical rigor for low-traffic services where raw thresholds give noisy signals.
|
|
30
|
+
|
|
31
|
+
Metrics gated at every stage:
|
|
32
|
+
|
|
33
|
+
- Error-rate ratio (canary vs control) — fail on >1.2x baseline for 2 consecutive intervals.
|
|
34
|
+
- p95 / p99 latency (canary vs control) — fail on >1.2x baseline for 2 consecutive intervals.
|
|
35
|
+
- Business KPIs (checkout success rate, signup completion) — fail on >5% drop from baseline.
|
|
36
|
+
- Saturation metrics (CPU, memory) — fail on canary above 90% when control is below 70%.
|
|
37
|
+
|
|
38
|
+
Two consecutive intervals (typically 1-minute scrape windows) avoids single-scrape noise tripping a false rollback.
|
|
39
|
+
|
|
40
|
+
Baseline definition: a stable subset of production pods running the prior version, receiving the same traffic mix as the canary. Comparing canary against absolute thresholds rather than against a live baseline produces false positives during traffic spikes and false negatives during quiet windows.
|
|
41
|
+
|
|
42
|
+
## SLO-Burn Auto-Rollback
|
|
43
|
+
|
|
44
|
+
Wire canary analysis to the service's SLO (cross-reference `rules/hatch3r-observability-metrics.md`). If a multi-window multi-burn-rate alert fires during a canary stage, rollback automatically — do not wait for the next stage gate.
|
|
45
|
+
|
|
46
|
+
Two-window burn-rate alert pattern: trigger rollback when error budget burns at 14.4x over a 5-minute window AND 6x over a 1-hour window (the Google SRE "fast burn" pattern). Slow-burn alerts (3x over 6h AND 1x over 3d) do not auto-rollback — they page the on-call.
|
|
47
|
+
|
|
48
|
+
## Staged Rollout Cadence
|
|
49
|
+
|
|
50
|
+
Default cadence for non-trivial production deploys, enforced by the rollout controller:
|
|
51
|
+
|
|
52
|
+
- Stage 1: 1% for 30 minutes minimum.
|
|
53
|
+
- Stage 2: 10% for 1 hour minimum.
|
|
54
|
+
- Stage 3: 50% for 2 hours minimum.
|
|
55
|
+
- Stage 4: 100%.
|
|
56
|
+
|
|
57
|
+
Override only with explicit on-call approval, documented in the deploy log. CrowdStrike July 2024 root cause: a global config push without staged exposure; a 1% canary would have detected the kernel panic before reaching any production tenant.
|
|
58
|
+
|
|
59
|
+
Trivial deploys (docs-only, config-only with prior canary, dependency patch with no behavior change) may skip stages with a documented justification in the PR.
|
|
60
|
+
|
|
61
|
+
Stage holding times scale with peak-hour traffic — a 30-minute stage at 03:00 may see less traffic than a 5-minute stage at 13:00. Tune cadence per service so each stage observes at least 1000 canary requests; below that the metrics-gate is statistically underpowered.
|
|
62
|
+
|
|
63
|
+
## Blast-Radius Reasoning
|
|
64
|
+
|
|
65
|
+
Every PR description includes the blast-radius block:
|
|
66
|
+
|
|
67
|
+
- **Services affected** — list of deployed services.
|
|
68
|
+
- **Regions** — list of target regions and rollout order.
|
|
69
|
+
- **Traffic %** — peak traffic share over the deploy window.
|
|
70
|
+
- **Rollback time** — measured rollback duration (under 5 minutes is the target).
|
|
71
|
+
- **Rollback command** — exact CLI invocation, copy-pasteable.
|
|
72
|
+
|
|
73
|
+
A reviewer-enforced checklist item rejects PRs missing the block. Blast-radius is the artifact the incident commander reads at 03:00.
|
|
74
|
+
|
|
75
|
+
## Deploy Windows and Change Calendar
|
|
76
|
+
|
|
77
|
+
- No production deploys: Friday 12:00 PM through Monday 09:00 AM in the timezone of the on-call region.
|
|
78
|
+
- No production deploys during named freezes (Black Friday window, end-of-quarter close, major event launch).
|
|
79
|
+
- Hotfixes require explicit incident-commander sign-off, captured in the deploy log.
|
|
80
|
+
|
|
81
|
+
Deploy-window violations are a finding even when the deploy succeeded — the on-call coverage assumption is what is being tested, not the deploy.
|
|
82
|
+
|
|
83
|
+
## Rollback Playbook
|
|
84
|
+
|
|
85
|
+
Every deploy ships a rollback command in the service runbook. Examples:
|
|
86
|
+
|
|
87
|
+
- Argo Rollouts: `kubectl argo rollouts abort <rollout> -n <ns>` then `kubectl argo rollouts undo <rollout> -n <ns>`.
|
|
88
|
+
- Flagger canary: `kubectl patch canary <name> -n <ns> --type merge -p '{"spec":{"skipAnalysis":false}}'` then trigger a revert deployment.
|
|
89
|
+
- Kubernetes deployment (no rollout controller): `kubectl rollout undo deployment/<name> -n <ns>`.
|
|
90
|
+
|
|
91
|
+
Time-to-rollback target: under 5 minutes from rollback decision to traffic on the previous version. If the rollback path is untested in the last quarterly drill, the deploy is blocked until the path is verified.
|
|
92
|
+
|
|
93
|
+
Database migrations follow expand-contract: the new code reads both old and new schema; the migration runs as a separate deploy; the old code path is removed in a subsequent release. Rollback of a deploy that owns its own destructive migration is not possible — separate the migration from the deploy.
|
|
94
|
+
|
|
95
|
+
## Config-Change Canary
|
|
96
|
+
|
|
97
|
+
Treat configuration as code — same staged rollout cadence, same auto-rollback wiring. CrowdStrike July 2024 and Cloudflare November 2025 both originated in config changes pushed globally without canary.
|
|
98
|
+
|
|
99
|
+
- Feature flags rolled out via OpenFeature targeting rules (1% → 10% → 50% → 100% by user / tenant / region) cover application-level config.
|
|
100
|
+
- Infrastructure config (kernel modules, network policies, service-mesh routing) needs a canary cluster or canary node pool, not a global push.
|
|
101
|
+
|
|
102
|
+
A "config tweak" is a deploy by another name; the change-management envelope is identical.
|
|
103
|
+
|
|
104
|
+
GitOps-managed clusters (Flux, Argo CD) gain canary semantics for free via Flagger or Argo Rollouts. Direct `kubectl apply` to the cluster bypasses the controller and produces audit-trail gaps — block direct cluster writes outside of break-glass procedures.
|
|
105
|
+
|
|
106
|
+
## Cross-References
|
|
107
|
+
|
|
108
|
+
- `rules/hatch3r-operability.md` — kill switches and runbooks for post-rollout recovery.
|
|
109
|
+
- `rules/hatch3r-observability-metrics.md` — SLO definitions and burn-rate alerts that gate the canary.
|
|
110
|
+
- `rules/hatch3r-resilience-patterns.md` — circuit breakers shield the user from a failing canary while the rollback completes.
|
|
111
|
+
|
|
112
|
+
## References
|
|
113
|
+
|
|
114
|
+
- Argo Rollouts — `argoproj.github.io/argo-rollouts`
|
|
115
|
+
- Flagger — `flagger.app`
|
|
116
|
+
- Spinnaker — `spinnaker.io`
|
|
117
|
+
- Kayenta — `github.com/spinnaker/kayenta`
|
|
118
|
+
- Datadog deployment tracking — `docs.datadoghq.com/tracing/services/deployment_tracking`
|
|
119
|
+
- Google SRE workbook, alerting chapter (multi-window multi-burn-rate) — `sre.google/workbook/alerting-on-slos`
|
|
120
|
+
- 2024–2026 outage postmortems: CrowdStrike Jul 2024, AWS us-east-1 Oct 2025, Azure East-US2 Sep 2025, Cloudflare Nov 2025.
|