hatch3r 1.7.0 → 1.7.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +38 -12
- package/agents/hatch3r-a11y-auditor.md +4 -0
- package/agents/hatch3r-architect.md +5 -1
- package/agents/hatch3r-ci-watcher.md +4 -0
- package/agents/hatch3r-context-rules.md +4 -0
- package/agents/hatch3r-creator.md +4 -0
- package/agents/hatch3r-dependency-auditor.md +4 -0
- package/agents/hatch3r-devops.md +4 -0
- package/agents/hatch3r-docs-writer.md +4 -0
- package/agents/hatch3r-fixer.md +5 -1
- package/agents/hatch3r-handoff-loader.md +243 -0
- package/agents/hatch3r-handoff-preparer.md +134 -0
- package/agents/hatch3r-implementer.md +5 -1
- package/agents/hatch3r-learnings-loader.md +4 -0
- package/agents/hatch3r-lint-fixer.md +4 -0
- package/agents/hatch3r-perf-profiler.md +8 -0
- package/agents/hatch3r-researcher.md +5 -1
- package/agents/hatch3r-reviewer.md +92 -0
- package/agents/hatch3r-security-auditor.md +24 -0
- package/agents/hatch3r-test-writer.md +4 -0
- package/agents/modes/requirements-elicitation.md +5 -1
- package/agents/modes/similar-implementation.md +6 -0
- package/agents/modes/user-flows.md +76 -0
- package/agents/shared/quality-charter.md +129 -0
- package/agents/shared/user-question-protocol.md +95 -0
- package/commands/board/shared-azure-devops.md +2 -0
- package/commands/board/shared-github.md +17 -0
- package/commands/board/shared-gitlab.md +4 -0
- package/commands/hatch3r-board-fill.md +2 -1
- package/commands/hatch3r-board-pickup.md +1 -1
- package/commands/hatch3r-board-shared.md +21 -0
- package/commands/hatch3r-create.md +2 -0
- package/commands/hatch3r-handoff.md +126 -0
- package/commands/hatch3r-pr-resolve.md +672 -0
- package/commands/hatch3r-quick-change.md +5 -3
- package/commands/hatch3r-report.md +167 -0
- package/commands/hatch3r-revision.md +1 -1
- package/commands/hatch3r-workflow.md +3 -1
- package/dist/cli/index.js +3144 -979
- package/dist/cli/index.js.map +1 -1
- package/package.json +4 -2
- package/rules/hatch3r-accessibility-standards.md +21 -0
- package/rules/hatch3r-accessibility-standards.mdc +21 -0
- package/rules/hatch3r-agent-orchestration.md +32 -1
- package/rules/hatch3r-agent-orchestration.mdc +32 -1
- package/rules/hatch3r-ai-evals.md +158 -0
- package/rules/hatch3r-ai-evals.mdc +154 -0
- package/rules/hatch3r-ai-ux-patterns.md +131 -0
- package/rules/hatch3r-ai-ux-patterns.mdc +127 -0
- package/rules/hatch3r-api-design.md +67 -9
- package/rules/hatch3r-api-design.mdc +67 -9
- package/rules/hatch3r-api-versioning.md +119 -0
- package/rules/hatch3r-api-versioning.mdc +115 -0
- package/rules/hatch3r-auth-patterns.md +170 -0
- package/rules/hatch3r-auth-patterns.mdc +166 -0
- package/rules/hatch3r-component-conventions.md +30 -0
- package/rules/hatch3r-component-conventions.mdc +30 -0
- package/rules/hatch3r-container-hardening.md +131 -0
- package/rules/hatch3r-container-hardening.mdc +127 -0
- package/rules/hatch3r-contract-testing.md +117 -0
- package/rules/hatch3r-contract-testing.mdc +113 -0
- package/rules/hatch3r-deep-context.md +3 -1
- package/rules/hatch3r-deep-context.mdc +3 -1
- package/rules/hatch3r-dependency-management.md +73 -1
- package/rules/hatch3r-dependency-management.mdc +72 -0
- package/rules/hatch3r-design-system-detection.md +142 -0
- package/rules/hatch3r-design-system-detection.mdc +138 -0
- package/rules/hatch3r-event-schema-evolution.md +90 -0
- package/rules/hatch3r-event-schema-evolution.mdc +86 -0
- package/rules/hatch3r-handoff-readiness.md +45 -0
- package/rules/hatch3r-handoff-readiness.mdc +40 -0
- package/rules/hatch3r-i18n.md +13 -0
- package/rules/hatch3r-i18n.mdc +13 -0
- package/rules/hatch3r-iteration-summary.md +2 -0
- package/rules/hatch3r-iteration-summary.mdc +2 -0
- package/rules/hatch3r-migrations.md +61 -16
- package/rules/hatch3r-migrations.mdc +61 -16
- package/rules/hatch3r-observability-logging.md +1 -1
- package/rules/hatch3r-observability-logging.mdc +1 -1
- package/rules/hatch3r-observability-metrics.md +1 -1
- package/rules/hatch3r-observability-metrics.mdc +1 -1
- package/rules/hatch3r-observability-tracing-detail.md +1 -1
- package/rules/hatch3r-observability-tracing-detail.mdc +1 -1
- package/rules/hatch3r-observability-tracing.md +1 -1
- package/rules/hatch3r-observability-tracing.mdc +1 -1
- package/rules/hatch3r-observability.md +1 -0
- package/rules/hatch3r-observability.mdc +1 -0
- package/rules/hatch3r-operability.md +149 -0
- package/rules/hatch3r-operability.mdc +145 -0
- package/rules/hatch3r-passkey-server.md +181 -0
- package/rules/hatch3r-passkey-server.mdc +177 -0
- package/rules/hatch3r-progressive-delivery.md +120 -0
- package/rules/hatch3r-progressive-delivery.mdc +116 -0
- package/rules/hatch3r-resilience-patterns.md +154 -0
- package/rules/hatch3r-resilience-patterns.mdc +150 -0
- package/rules/hatch3r-secrets-management.md +29 -0
- package/rules/hatch3r-secrets-management.mdc +29 -0
- package/rules/hatch3r-testing.md +139 -43
- package/rules/hatch3r-testing.mdc +139 -43
- package/rules/hatch3r-ux-states-and-flows.md +149 -0
- package/rules/hatch3r-ux-states-and-flows.mdc +145 -0
- package/skills/hatch3r-a11y-audit/SKILL.md +14 -0
- package/skills/hatch3r-ai-feature/SKILL.md +134 -0
- package/skills/hatch3r-api-spec/SKILL.md +5 -0
- package/skills/hatch3r-architecture-review/SKILL.md +14 -0
- package/skills/hatch3r-bug-fix/SKILL.md +5 -0
- package/skills/hatch3r-ci-pipeline/SKILL.md +14 -0
- package/skills/hatch3r-cli-aichat/SKILL.md +84 -0
- package/skills/hatch3r-cli-ast-grep/SKILL.md +85 -0
- package/skills/hatch3r-cli-az-devops/SKILL.md +89 -0
- package/skills/hatch3r-cli-bat/SKILL.md +85 -0
- package/skills/hatch3r-cli-comby/SKILL.md +85 -0
- package/skills/hatch3r-cli-csvkit/SKILL.md +84 -0
- package/skills/hatch3r-cli-delta/SKILL.md +86 -0
- package/skills/hatch3r-cli-difftastic/SKILL.md +84 -0
- package/skills/hatch3r-cli-docker/SKILL.md +89 -0
- package/skills/hatch3r-cli-duckdb/SKILL.md +84 -0
- package/skills/hatch3r-cli-fd/SKILL.md +85 -0
- package/skills/hatch3r-cli-fzf/SKILL.md +84 -0
- package/skills/hatch3r-cli-gh/SKILL.md +90 -0
- package/skills/hatch3r-cli-glab/SKILL.md +89 -0
- package/skills/hatch3r-cli-jq/SKILL.md +85 -0
- package/skills/hatch3r-cli-lazygit/SKILL.md +78 -0
- package/skills/hatch3r-cli-llm/SKILL.md +84 -0
- package/skills/hatch3r-cli-miller/SKILL.md +84 -0
- package/skills/hatch3r-cli-mods/SKILL.md +84 -0
- package/skills/hatch3r-cli-overview/SKILL.md +60 -0
- package/skills/hatch3r-cli-playwright/SKILL.md +89 -0
- package/skills/hatch3r-cli-podman/SKILL.md +84 -0
- package/skills/hatch3r-cli-ripgrep/SKILL.md +85 -0
- package/skills/hatch3r-cli-rtk/SKILL.md +91 -0
- package/skills/hatch3r-cli-sd/SKILL.md +85 -0
- package/skills/hatch3r-cli-stagehand/SKILL.md +79 -0
- package/skills/hatch3r-cli-taplo/SKILL.md +84 -0
- package/skills/hatch3r-cli-xsv/SKILL.md +89 -0
- package/skills/hatch3r-cli-yq/SKILL.md +85 -0
- package/skills/hatch3r-cli-zstd/SKILL.md +85 -0
- package/skills/hatch3r-context-health/SKILL.md +14 -0
- package/skills/hatch3r-cost-tracking/SKILL.md +14 -0
- package/skills/hatch3r-customize/SKILL.md +14 -0
- package/skills/hatch3r-dep-audit/SKILL.md +14 -0
- package/skills/hatch3r-design-system-detect/SKILL.md +162 -0
- package/skills/hatch3r-feature/SKILL.md +2 -0
- package/skills/hatch3r-gh-agentic-workflows/SKILL.md +13 -0
- package/skills/hatch3r-handoff-prepare/SKILL.md +160 -0
- package/skills/hatch3r-handoff-resume/SKILL.md +171 -0
- package/skills/hatch3r-incident-response/SKILL.md +14 -0
- package/skills/hatch3r-issue-workflow/SKILL.md +5 -0
- package/skills/hatch3r-logical-refactor/SKILL.md +14 -0
- package/skills/hatch3r-migration/SKILL.md +14 -0
- package/skills/hatch3r-observability-verify/SKILL.md +133 -0
- package/skills/hatch3r-perf-audit/SKILL.md +14 -0
- package/skills/hatch3r-pr-creation/SKILL.md +14 -0
- package/skills/hatch3r-qa-validation/SKILL.md +18 -0
- package/skills/hatch3r-recipe/SKILL.md +14 -0
- package/skills/hatch3r-refactor/SKILL.md +14 -0
- package/skills/hatch3r-release/SKILL.md +14 -0
- package/skills/hatch3r-reliability-verify/SKILL.md +144 -0
- package/skills/hatch3r-ui-ux-verify/SKILL.md +136 -0
- package/skills/hatch3r-visual-refactor/SKILL.md +15 -1
package/rules/hatch3r-i18n.md
CHANGED
|
@@ -93,3 +93,16 @@ ICU MessageFormat 2.0 reached Final Candidate status in CLDR 46.1 (January 2025)
|
|
|
93
93
|
- **Migration strategy:** New translation keys should use MF2 syntax. Existing MF1 keys can be migrated incrementally — both syntaxes can coexist during transition.
|
|
94
94
|
- **Tooling:** Verify that your translation management system (TMS) supports MF2 syntax before migrating. Test with a small key set first.
|
|
95
95
|
- **Stability:** The MF2 specification has stability guarantees post-approval (mid-2025). Syntax and semantics will not change incompatibly after that point.
|
|
96
|
+
|
|
97
|
+
## Microcopy and Tone
|
|
98
|
+
|
|
99
|
+
Translation strings are user-facing copy — write them as product copy, not as technical labels.
|
|
100
|
+
|
|
101
|
+
- Use plain language. Default to second person ("you", "your") for end-user surfaces.
|
|
102
|
+
- Use a corrective verb in error messages: "Try again", "Reconnect", "Enter a valid email" — not "Error" or "Oops".
|
|
103
|
+
- Never expose to end users: protocol acronyms ("FIDO2", "WebAuthn"), raw HTTP status codes ("500", "401"), language sentinel values (`null`, `undefined`), or internal record/ID strings. Translate these into a user-visible cause + recovery.
|
|
104
|
+
- CTA labels are action-oriented and specific: "Save changes" beats "Submit"; "Delete project" beats "Confirm"; "Send invite" beats "OK".
|
|
105
|
+
- Error tone explains the cause and offers a recovery path. Do not blame the user. Replace "You entered an invalid value" with "This field needs a valid email address — for example, name@example.com".
|
|
106
|
+
- Use ICU MessageFormat (1.0 or 2.0 per the MF2 section above) for every plural, gender, and select pattern. Never concatenate translated fragments to build a sentence — each complete sentence is a single translation key with its own placeholders.
|
|
107
|
+
- Tone source-of-truth: the GOV.UK Design System content style guide (https://design-system.service.gov.uk/styles/) and IBM Carbon Design System voice and tone guidance (https://carbondesignsystem.com/guidelines/content/general/) — cite both when reviewing tone or training a translator.
|
|
108
|
+
- Cross-reference `rules/hatch3r-ux-states-and-flows.md` Microcopy subsection for the state-driven copy patterns (loading, empty, error, partial) that share this tone contract.
|
package/rules/hatch3r-i18n.mdc
CHANGED
|
@@ -88,3 +88,16 @@ ICU MessageFormat 2.0 reached Final Candidate status in CLDR 46.1 (January 2025)
|
|
|
88
88
|
- **Migration strategy:** New translation keys should use MF2 syntax. Existing MF1 keys can be migrated incrementally — both syntaxes can coexist during transition.
|
|
89
89
|
- **Tooling:** Verify that your translation management system (TMS) supports MF2 syntax before migrating. Test with a small key set first.
|
|
90
90
|
- **Stability:** The MF2 specification has stability guarantees post-approval (mid-2025). Syntax and semantics will not change incompatibly after that point.
|
|
91
|
+
|
|
92
|
+
## Microcopy and Tone
|
|
93
|
+
|
|
94
|
+
Translation strings are user-facing copy — write them as product copy, not as technical labels.
|
|
95
|
+
|
|
96
|
+
- Use plain language. Default to second person ("you", "your") for end-user surfaces.
|
|
97
|
+
- Use a corrective verb in error messages: "Try again", "Reconnect", "Enter a valid email" — not "Error" or "Oops".
|
|
98
|
+
- Never expose to end users: protocol acronyms ("FIDO2", "WebAuthn"), raw HTTP status codes ("500", "401"), language sentinel values (`null`, `undefined`), or internal record/ID strings. Translate these into a user-visible cause + recovery.
|
|
99
|
+
- CTA labels are action-oriented and specific: "Save changes" beats "Submit"; "Delete project" beats "Confirm"; "Send invite" beats "OK".
|
|
100
|
+
- Error tone explains the cause and offers a recovery path. Do not blame the user. Replace "You entered an invalid value" with "This field needs a valid email address — for example, name@example.com".
|
|
101
|
+
- Use ICU MessageFormat (1.0 or 2.0 per the MF2 section above) for every plural, gender, and select pattern. Never concatenate translated fragments to build a sentence — each complete sentence is a single translation key with its own placeholders.
|
|
102
|
+
- Tone source-of-truth: the GOV.UK Design System content style guide (https://design-system.service.gov.uk/styles/) and IBM Carbon Design System voice and tone guidance (https://carbondesignsystem.com/guidelines/content/general/) — cite both when reviewing tone or training a translator.
|
|
103
|
+
- Cross-reference `rules/hatch3r-ux-states-and-flows.md` Microcopy subsection for the state-driven copy patterns (loading, empty, error, partial) that share this tone contract.
|
|
@@ -16,6 +16,8 @@ Every iteration with the user ends with the canonical block defined below — no
|
|
|
16
16
|
|
|
17
17
|
Every user-facing iteration, regardless of size — multi-step coding tasks, single-file edits, read-only answers, failed or blocked attempts. No exceptions.
|
|
18
18
|
|
|
19
|
+
The per-turn pipeline-state header (defined in `hatch3r-agent-orchestration` → Per-Turn Pipeline-State Header) is a separate start-of-turn artifact and does not replace this end-of-turn block.
|
|
20
|
+
|
|
19
21
|
## The Required Block
|
|
20
22
|
|
|
21
23
|
Use this exact shape with these exact field names:
|
|
@@ -11,6 +11,8 @@ Every iteration with the user ends with the canonical block defined below — no
|
|
|
11
11
|
|
|
12
12
|
Every user-facing iteration, regardless of size — multi-step coding tasks, single-file edits, read-only answers, failed or blocked attempts. No exceptions.
|
|
13
13
|
|
|
14
|
+
The per-turn pipeline-state header (defined in `hatch3r-agent-orchestration` → Per-Turn Pipeline-State Header) is a separate start-of-turn artifact and does not replace this end-of-turn block.
|
|
15
|
+
|
|
14
16
|
## The Required Block
|
|
15
17
|
|
|
16
18
|
Use this exact shape with these exact field names:
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
id: hatch3r-migrations
|
|
3
3
|
type: rule
|
|
4
|
-
description: Database migration and schema change patterns
|
|
4
|
+
description: Database migration and schema change patterns — expand-contract, online DDL, backfills, compatibility windows, reversibility, multi-region, tooling
|
|
5
5
|
scope: "**/migrations/**,**/*migration*,**/migrate/**,**/seeds/**,**/seeders/**,**/prisma/migrations/**,**/drizzle/**,**/knex/**"
|
|
6
6
|
tags: [implementation, brownfield]
|
|
7
7
|
quality_charter: agents/shared/quality-charter.md
|
|
@@ -9,23 +9,68 @@ cache_friendly: true
|
|
|
9
9
|
---
|
|
10
10
|
# Migrations
|
|
11
11
|
|
|
12
|
-
- Schema changes must be backward-compatible. Add fields with defaults; never remove or rename without migration.
|
|
13
12
|
- Migration scripts live in a dedicated `migrations/` directory. One script per migration.
|
|
14
|
-
- Every migration is idempotent
|
|
15
|
-
-
|
|
16
|
-
-
|
|
17
|
-
- Order: deploy new code (handles old + new schema) → run migration → remove old schema handling.
|
|
18
|
-
- Document schema changes in project data model spec.
|
|
19
|
-
- Rollback plan required for every migration. Never run destructive migrations without backup verification.
|
|
20
|
-
- Hot documents must stay within size limits after migration.
|
|
13
|
+
- Every migration is idempotent (re-running produces the same result). Use a version column, `migratedAt` timestamp, or migration ledger row to skip already-applied work.
|
|
14
|
+
- Test every migration against an emulator or staging dataset before production. Verify data integrity after each step, not just at the end.
|
|
15
|
+
- Document the schema change in the project data model spec. Hot documents must stay within size limits after migration.
|
|
21
16
|
|
|
22
|
-
##
|
|
17
|
+
## Expand-Contract Pattern (mandatory for non-trivial schema changes)
|
|
23
18
|
|
|
24
|
-
-
|
|
25
|
-
- Include count checks: the number of records processed should match the number of records in the source collection. Log discrepancies as errors, not warnings.
|
|
26
|
-
- For large datasets, migrate in batches with progress checkpoints. If a batch fails, resume from the last checkpoint rather than restarting the entire migration.
|
|
19
|
+
Non-trivial = anything beyond pure-additive nullable columns on small tables, or any rename/drop/type-change. Use a 3-deploy cadence; split Migrate into two deploys when dual-write is required (4 deploys total).
|
|
27
20
|
|
|
28
|
-
|
|
21
|
+
1. **Deploy 1 — Expand.** Add new column nullable, add new table, or `CREATE INDEX CONCURRENTLY`. Add new constraints with `NOT VALID` first. Old code paths still work. No app behavior change in this deploy.
|
|
22
|
+
2. **Deploy 2 — Migrate (backfill + dual-write).** Run a batched, idempotent, resumable backfill job. If the change is a column rename / type swap, app code writes to both old and new columns during this phase. Validate row counts and per-block checksums on the new shape before proceeding.
|
|
23
|
+
3. **Deploy 3 — Contract.** Switch reads to the new shape (feature-flag-gated; flip is the rollback). Drop the old column, old table, or old index. Wait at least one full release cycle plus one on-call rotation between Expand and Contract — old code must remain executable to roll back inside the deploy window.
|
|
29
24
|
|
|
30
|
-
|
|
31
|
-
|
|
25
|
+
Hard rules: never rename a column in a single step; never add a `NOT NULL` column to a populated table without a default or a deferred `SET NOT NULL NOT VALID` → `VALIDATE`; every phase must be valid in isolation so that any deploy is independently rollbackable.
|
|
26
|
+
|
|
27
|
+
## Online Schema Changes
|
|
28
|
+
|
|
29
|
+
Set `lock_timeout` and `statement_timeout` before every DDL statement to bound blast radius. Selection by engine:
|
|
30
|
+
|
|
31
|
+
- **Postgres 18.x.** Use `CREATE INDEX CONCURRENTLY` (outside any transaction block — disable the migration tool's transaction wrapper). On failure, the index is left `INVALID`; emit a `DROP INDEX IF EXISTS` + retry step. For FK and CHECK constraints, use `ALTER TABLE ... ADD CONSTRAINT ... NOT VALID` followed later by `VALIDATE CONSTRAINT` (skips full scan, downgrades to `SHARE UPDATE EXCLUSIVE`). Postgres 18 also supports `SET NOT NULL NOT VALID` for column nullability. Use `pg_repack` 1.5.x for bloat removal instead of `VACUUM FULL`. Avoid `ALTER TABLE ... ADD COLUMN ... DEFAULT non_constant_expression` on large tables — it rewrites every row.
|
|
32
|
+
- **MySQL 8.4 LTS.** `ALGORITHM=INSTANT` is the default for many metadata ops (ADD COLUMN at end, RENAME COLUMN, some index meta) — verify against the 8.4 online DDL operations matrix. Hard limit: 64 row versions per table in 8.4. When `INSTANT` is rejected, fall back to `ALGORITHM=INPLACE`. For `ALGORITHM=COPY` operations on large tables, use `gh-ost` v1.1.8 (trigger-free, binlog-based, checkpoint + resume + revert) when the table has no incoming FKs and the cluster is not Galera / Percona XtraDB. Use `pt-online-schema-change` when FKs are present (`--alter-foreign-keys-method`) or under Galera. `lhm` is unmaintained — do not propose it for new code.
|
|
33
|
+
|
|
34
|
+
## Backfill Jobs
|
|
35
|
+
|
|
36
|
+
Every backfill must be batched, idempotent, resumable, throttled, and observable.
|
|
37
|
+
|
|
38
|
+
- **Batched.** Order by PK or a monotonic key. Chunk by `id BETWEEN ? AND ?` (range), not `LIMIT/OFFSET` — offsets drift under concurrent writes. Default chunk 1k–10k rows; tune by table width.
|
|
39
|
+
- **Idempotent.** Write `UPDATE ... SET new = f(old) WHERE id = ? AND new IS NULL` (or upsert with a deterministic source-derived value). Re-running on the same range must produce the same final state.
|
|
40
|
+
- **Resumable.** Persist the last-processed boundary (`last_id` or timestamp cursor) to a control table after each batch commit. Resume from the checkpoint on restart; never restart from zero on partial failure.
|
|
41
|
+
- **Throttled.** Poll replication lag (`pg_stat_replication`, `SHOW REPLICA STATUS`) between batches; pause when lag exceeds 30 seconds or the SLO threshold. Cap concurrency at the IO budget of the slowest replica.
|
|
42
|
+
- **Observable.** Emit `migration.backfill.rows_processed` (counter), `migration.backfill.error_rate` (counter), `migration.backfill.eta_seconds` (gauge), and `migration.backfill.current_boundary` (gauge). Wire dashboards before launch. Avoid single mega-DML — one `UPDATE` over 50M+ rows produces multi-hour locks and table bloat.
|
|
43
|
+
|
|
44
|
+
## Compatibility Window
|
|
45
|
+
|
|
46
|
+
Schema changes deploy before the code that depends on them when widening (add column, add table, add index). Schema changes deploy after the code that no longer depends on them when narrowing (drop column, drop table). During the window, app code reads both shapes — the new shape if populated, fall back to the old shape otherwise. Rollback compatibility (old code remains executable against the current schema) must hold for at least 1 full release cycle plus 1 on-call rotation — minimum 7 calendar days, longer when the on-call rotation is longer.
|
|
47
|
+
|
|
48
|
+
## Reversibility
|
|
49
|
+
|
|
50
|
+
Every migration ships a tested down-migration script. Forward-only migrations are permitted only when the operation is data-destructive (e.g., a `DROP COLUMN` after Contract) — these require an explicit `IRREVERSIBLE: <reason>` annotation in the migration header and reviewer sign-off. A compensating forward migration that restores the prior shape is acceptable in place of a down-script for tools that lack reversibility (Prisma Migrate, Drizzle Kit — surface the gap to the reviewer). Default for every migration: reversible.
|
|
51
|
+
|
|
52
|
+
## Data Integrity Verification
|
|
53
|
+
|
|
54
|
+
Apply layered verification from cheapest to most thorough; stop at the cheapest layer that detects no drift.
|
|
55
|
+
|
|
56
|
+
1. **Pre-migration backup drill.** Full restore to staging plus a smoke query within 24 hours prior to a destructive migration. "Backup exists" is not verification.
|
|
57
|
+
2. **Row-count parity per chunk.** Source rows processed equals target rows written. Log discrepancies as errors, not warnings.
|
|
58
|
+
3. **Aggregate checks.** SUM, MIN, MAX, COUNT(DISTINCT) on numeric and date columns per partition or batch.
|
|
59
|
+
4. **Per-block checksums.** SHA-256 or MD5 over concatenated key columns for blocks of N rows (e.g., `md5(string_agg(id::text || col::text, ',' ORDER BY id))`).
|
|
60
|
+
5. **Cross-system diff.** Datafold Reconcile, dbt-data-diff, or a hand-rolled sample-then-drill comparison for value-level differences.
|
|
61
|
+
6. **Canary dual-read.** Read both shapes in production for 24–72 hours before cutover; shadow-diff and alert on mismatch.
|
|
62
|
+
7. **Reconciliation control table.** Per-batch row count plus checksum stored alongside the checkpoint; auto-stop the backfill on drift above the configured threshold.
|
|
63
|
+
|
|
64
|
+
## Multi-Region & Replica Lag
|
|
65
|
+
|
|
66
|
+
- Pause backfill writes when any replica lag exceeds 30 seconds (or the project's lag SLO, whichever is lower). Resume only after lag returns to baseline for 5 consecutive minutes.
|
|
67
|
+
- Roll migrations across regions sequentially; never alter an active partition during the peak traffic window of any region.
|
|
68
|
+
- FK validation (`VALIDATE CONSTRAINT`) reads the entire dependent table — schedule outside peak read windows on replica-heavy topologies.
|
|
69
|
+
- For Postgres major-version upgrades, use native logical replication (PG17+ preserves slots through `pg_upgrade`); advance sequences manually at cutover — logical replication does not replicate sequences, DDL, or large objects.
|
|
70
|
+
- For ongoing cross-system replication, prefer Debezium (when Kafka is already deployed) or AWS DMS (managed, AWS-native). DMS hard limit: 200 tasks per replication instance — relevant for schema-per-tenant designs.
|
|
71
|
+
|
|
72
|
+
## Tooling Mandate
|
|
73
|
+
|
|
74
|
+
Pick one schema-management tool per project and commit the schema declaration to the repo. Greenfield default: Atlas (50+ destructive/locking linters, auto-generated down migrations, GitHub Actions approval policies) or dbmate (plain-SQL portability with first-class `-- migrate:down`). Existing-project default: whatever already ships migrations in the repo. Acceptable tools: Atlas, Prisma Migrate (forward-only — surface to reviewer), Drizzle Kit (forward-only — surface), Flyway 11+, Liquibase 4.27+ Pro, sqitch, Alembic, Knex, dbmate, Bytebase. Run a migration linter in CI — Atlas analyze, `squawk` for raw Postgres SQL — fail the PR on destructive operations without an explicit `IRREVERSIBLE:` annotation.
|
|
75
|
+
|
|
76
|
+
Cross-references: see `hatch3r-data-classification` (PII / encrypted-column migration requirements), `hatch3r-feature-flags` (read-path switchover gating), `hatch3r-observability-metrics` (backfill progress metrics).
|
|
@@ -1,27 +1,72 @@
|
|
|
1
1
|
---
|
|
2
|
-
description: Database migration and schema change patterns
|
|
2
|
+
description: Database migration and schema change patterns — expand-contract, online DDL, backfills, compatibility windows, reversibility, multi-region, tooling
|
|
3
3
|
globs: ["**/migrations/**", "**/*migration*", "**/migrate/**", "**/seeds/**", "**/seeders/**", "**/prisma/migrations/**", "**/drizzle/**", "**/knex/**"]
|
|
4
4
|
alwaysApply: false
|
|
5
5
|
---
|
|
6
6
|
# Migrations
|
|
7
7
|
|
|
8
|
-
- Schema changes must be backward-compatible. Add fields with defaults; never remove or rename without migration.
|
|
9
8
|
- Migration scripts live in a dedicated `migrations/` directory. One script per migration.
|
|
10
|
-
- Every migration is idempotent
|
|
11
|
-
-
|
|
12
|
-
-
|
|
13
|
-
- Order: deploy new code (handles old + new schema) → run migration → remove old schema handling.
|
|
14
|
-
- Document schema changes in project data model spec.
|
|
15
|
-
- Rollback plan required for every migration. Never run destructive migrations without backup verification.
|
|
16
|
-
- Hot documents must stay within size limits after migration.
|
|
9
|
+
- Every migration is idempotent (re-running produces the same result). Use a version column, `migratedAt` timestamp, or migration ledger row to skip already-applied work.
|
|
10
|
+
- Test every migration against an emulator or staging dataset before production. Verify data integrity after each step, not just at the end.
|
|
11
|
+
- Document the schema change in the project data model spec. Hot documents must stay within size limits after migration.
|
|
17
12
|
|
|
18
|
-
##
|
|
13
|
+
## Expand-Contract Pattern (mandatory for non-trivial schema changes)
|
|
19
14
|
|
|
20
|
-
-
|
|
21
|
-
- Include count checks: the number of records processed should match the number of records in the source collection. Log discrepancies as errors, not warnings.
|
|
22
|
-
- For large datasets, migrate in batches with progress checkpoints. If a batch fails, resume from the last checkpoint rather than restarting the entire migration.
|
|
15
|
+
Non-trivial = anything beyond pure-additive nullable columns on small tables, or any rename/drop/type-change. Use a 3-deploy cadence; split Migrate into two deploys when dual-write is required (4 deploys total).
|
|
23
16
|
|
|
24
|
-
|
|
17
|
+
1. **Deploy 1 — Expand.** Add new column nullable, add new table, or `CREATE INDEX CONCURRENTLY`. Add new constraints with `NOT VALID` first. Old code paths still work. No app behavior change in this deploy.
|
|
18
|
+
2. **Deploy 2 — Migrate (backfill + dual-write).** Run a batched, idempotent, resumable backfill job. If the change is a column rename / type swap, app code writes to both old and new columns during this phase. Validate row counts and per-block checksums on the new shape before proceeding.
|
|
19
|
+
3. **Deploy 3 — Contract.** Switch reads to the new shape (feature-flag-gated; flip is the rollback). Drop the old column, old table, or old index. Wait at least one full release cycle plus one on-call rotation between Expand and Contract — old code must remain executable to roll back inside the deploy window.
|
|
25
20
|
|
|
26
|
-
|
|
27
|
-
|
|
21
|
+
Hard rules: never rename a column in a single step; never add a `NOT NULL` column to a populated table without a default or a deferred `SET NOT NULL NOT VALID` → `VALIDATE`; every phase must be valid in isolation so that any deploy is independently rollbackable.
|
|
22
|
+
|
|
23
|
+
## Online Schema Changes
|
|
24
|
+
|
|
25
|
+
Set `lock_timeout` and `statement_timeout` before every DDL statement to bound blast radius. Selection by engine:
|
|
26
|
+
|
|
27
|
+
- **Postgres 18.x.** Use `CREATE INDEX CONCURRENTLY` (outside any transaction block — disable the migration tool's transaction wrapper). On failure, the index is left `INVALID`; emit a `DROP INDEX IF EXISTS` + retry step. For FK and CHECK constraints, use `ALTER TABLE ... ADD CONSTRAINT ... NOT VALID` followed later by `VALIDATE CONSTRAINT` (skips full scan, downgrades to `SHARE UPDATE EXCLUSIVE`). Postgres 18 also supports `SET NOT NULL NOT VALID` for column nullability. Use `pg_repack` 1.5.x for bloat removal instead of `VACUUM FULL`. Avoid `ALTER TABLE ... ADD COLUMN ... DEFAULT non_constant_expression` on large tables — it rewrites every row.
|
|
28
|
+
- **MySQL 8.4 LTS.** `ALGORITHM=INSTANT` is the default for many metadata ops (ADD COLUMN at end, RENAME COLUMN, some index meta) — verify against the 8.4 online DDL operations matrix. Hard limit: 64 row versions per table in 8.4. When `INSTANT` is rejected, fall back to `ALGORITHM=INPLACE`. For `ALGORITHM=COPY` operations on large tables, use `gh-ost` v1.1.8 (trigger-free, binlog-based, checkpoint + resume + revert) when the table has no incoming FKs and the cluster is not Galera / Percona XtraDB. Use `pt-online-schema-change` when FKs are present (`--alter-foreign-keys-method`) or under Galera. `lhm` is unmaintained — do not propose it for new code.
|
|
29
|
+
|
|
30
|
+
## Backfill Jobs
|
|
31
|
+
|
|
32
|
+
Every backfill must be batched, idempotent, resumable, throttled, and observable.
|
|
33
|
+
|
|
34
|
+
- **Batched.** Order by PK or a monotonic key. Chunk by `id BETWEEN ? AND ?` (range), not `LIMIT/OFFSET` — offsets drift under concurrent writes. Default chunk 1k–10k rows; tune by table width.
|
|
35
|
+
- **Idempotent.** Write `UPDATE ... SET new = f(old) WHERE id = ? AND new IS NULL` (or upsert with a deterministic source-derived value). Re-running on the same range must produce the same final state.
|
|
36
|
+
- **Resumable.** Persist the last-processed boundary (`last_id` or timestamp cursor) to a control table after each batch commit. Resume from the checkpoint on restart; never restart from zero on partial failure.
|
|
37
|
+
- **Throttled.** Poll replication lag (`pg_stat_replication`, `SHOW REPLICA STATUS`) between batches; pause when lag exceeds 30 seconds or the SLO threshold. Cap concurrency at the IO budget of the slowest replica.
|
|
38
|
+
- **Observable.** Emit `migration.backfill.rows_processed` (counter), `migration.backfill.error_rate` (counter), `migration.backfill.eta_seconds` (gauge), and `migration.backfill.current_boundary` (gauge). Wire dashboards before launch. Avoid single mega-DML — one `UPDATE` over 50M+ rows produces multi-hour locks and table bloat.
|
|
39
|
+
|
|
40
|
+
## Compatibility Window
|
|
41
|
+
|
|
42
|
+
Schema changes deploy before the code that depends on them when widening (add column, add table, add index). Schema changes deploy after the code that no longer depends on them when narrowing (drop column, drop table). During the window, app code reads both shapes — the new shape if populated, fall back to the old shape otherwise. Rollback compatibility (old code remains executable against the current schema) must hold for at least 1 full release cycle plus 1 on-call rotation — minimum 7 calendar days, longer when the on-call rotation is longer.
|
|
43
|
+
|
|
44
|
+
## Reversibility
|
|
45
|
+
|
|
46
|
+
Every migration ships a tested down-migration script. Forward-only migrations are permitted only when the operation is data-destructive (e.g., a `DROP COLUMN` after Contract) — these require an explicit `IRREVERSIBLE: <reason>` annotation in the migration header and reviewer sign-off. A compensating forward migration that restores the prior shape is acceptable in place of a down-script for tools that lack reversibility (Prisma Migrate, Drizzle Kit — surface the gap to the reviewer). Default for every migration: reversible.
|
|
47
|
+
|
|
48
|
+
## Data Integrity Verification
|
|
49
|
+
|
|
50
|
+
Apply layered verification from cheapest to most thorough; stop at the cheapest layer that detects no drift.
|
|
51
|
+
|
|
52
|
+
1. **Pre-migration backup drill.** Full restore to staging plus a smoke query within 24 hours prior to a destructive migration. "Backup exists" is not verification.
|
|
53
|
+
2. **Row-count parity per chunk.** Source rows processed equals target rows written. Log discrepancies as errors, not warnings.
|
|
54
|
+
3. **Aggregate checks.** SUM, MIN, MAX, COUNT(DISTINCT) on numeric and date columns per partition or batch.
|
|
55
|
+
4. **Per-block checksums.** SHA-256 or MD5 over concatenated key columns for blocks of N rows (e.g., `md5(string_agg(id::text || col::text, ',' ORDER BY id))`).
|
|
56
|
+
5. **Cross-system diff.** Datafold Reconcile, dbt-data-diff, or a hand-rolled sample-then-drill comparison for value-level differences.
|
|
57
|
+
6. **Canary dual-read.** Read both shapes in production for 24–72 hours before cutover; shadow-diff and alert on mismatch.
|
|
58
|
+
7. **Reconciliation control table.** Per-batch row count plus checksum stored alongside the checkpoint; auto-stop the backfill on drift above the configured threshold.
|
|
59
|
+
|
|
60
|
+
## Multi-Region & Replica Lag
|
|
61
|
+
|
|
62
|
+
- Pause backfill writes when any replica lag exceeds 30 seconds (or the project's lag SLO, whichever is lower). Resume only after lag returns to baseline for 5 consecutive minutes.
|
|
63
|
+
- Roll migrations across regions sequentially; never alter an active partition during the peak traffic window of any region.
|
|
64
|
+
- FK validation (`VALIDATE CONSTRAINT`) reads the entire dependent table — schedule outside peak read windows on replica-heavy topologies.
|
|
65
|
+
- For Postgres major-version upgrades, use native logical replication (PG17+ preserves slots through `pg_upgrade`); advance sequences manually at cutover — logical replication does not replicate sequences, DDL, or large objects.
|
|
66
|
+
- For ongoing cross-system replication, prefer Debezium (when Kafka is already deployed) or AWS DMS (managed, AWS-native). DMS hard limit: 200 tasks per replication instance — relevant for schema-per-tenant designs.
|
|
67
|
+
|
|
68
|
+
## Tooling Mandate
|
|
69
|
+
|
|
70
|
+
Pick one schema-management tool per project and commit the schema declaration to the repo. Greenfield default: Atlas (50+ destructive/locking linters, auto-generated down migrations, GitHub Actions approval policies) or dbmate (plain-SQL portability with first-class `-- migrate:down`). Existing-project default: whatever already ships migrations in the repo. Acceptable tools: Atlas, Prisma Migrate (forward-only — surface to reviewer), Drizzle Kit (forward-only — surface), Flyway 11+, Liquibase 4.27+ Pro, sqitch, Alembic, Knex, dbmate, Bytebase. Run a migration linter in CI — Atlas analyze, `squawk` for raw Postgres SQL — fail the PR on destructive operations without an explicit `IRREVERSIBLE:` annotation.
|
|
71
|
+
|
|
72
|
+
Cross-references: see `hatch3r-data-classification` (PII / encrypted-column migration requirements), `hatch3r-feature-flags` (read-path switchover gating), `hatch3r-observability-metrics` (backfill progress metrics).
|
|
@@ -3,7 +3,7 @@ id: hatch3r-observability-logging
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Structured logging and error reporting conventions for the project
|
|
5
5
|
scope: conditional
|
|
6
|
-
globs: "**/*log*,**/*logger*,**/*logging*,**/*error*,**/observability/**"
|
|
6
|
+
globs: "**/*log*,**/*logger*,**/*logging*,**/*error*,**/observability/**,**/routes/**,**/handlers/**,**/services/**,**/api/**,**/middleware/**,**/controllers/**,**/lib/**"
|
|
7
7
|
tags: [devops]
|
|
8
8
|
quality_charter: agents/shared/quality-charter.md
|
|
9
9
|
cache_friendly: true
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Structured logging and error reporting conventions for the project
|
|
3
|
-
globs: ["**/*log*", "**/*logger*", "**/*logging*", "**/*error*", "**/observability/**"]
|
|
3
|
+
globs: ["**/*log*", "**/*logger*", "**/*logging*", "**/*error*", "**/observability/**", "**/routes/**", "**/handlers/**", "**/services/**", "**/api/**", "**/middleware/**", "**/controllers/**", "**/lib/**"]
|
|
4
4
|
alwaysApply: false
|
|
5
5
|
---
|
|
6
6
|
# Observability -- Logging & Error Reporting
|
|
@@ -3,7 +3,7 @@ id: hatch3r-observability-metrics
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Metrics, SLO/SLI definitions, alerting, and dashboard conventions for the project
|
|
5
5
|
scope: conditional
|
|
6
|
-
globs: "**/*metric*,**/*slo*,**/*sli*,**/*alert*,**/*dashboard*,**/observability/**"
|
|
6
|
+
globs: "**/*metric*,**/*slo*,**/*sli*,**/*alert*,**/*dashboard*,**/observability/**,**/routes/**,**/handlers/**,**/services/**,**/api/**,**/middleware/**,**/controllers/**,**/lib/**"
|
|
7
7
|
tags: [devops]
|
|
8
8
|
quality_charter: agents/shared/quality-charter.md
|
|
9
9
|
cache_friendly: true
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Metrics, SLO/SLI definitions, alerting, and dashboard conventions for the project
|
|
3
|
-
globs: ["**/*metric*", "**/*slo*", "**/*sli*", "**/*alert*", "**/*dashboard*", "**/observability/**"]
|
|
3
|
+
globs: ["**/*metric*", "**/*slo*", "**/*sli*", "**/*alert*", "**/*dashboard*", "**/observability/**", "**/routes/**", "**/handlers/**", "**/services/**", "**/api/**", "**/middleware/**", "**/controllers/**", "**/lib/**"]
|
|
4
4
|
alwaysApply: false
|
|
5
5
|
---
|
|
6
6
|
# Observability -- Metrics, SLOs & Alerting
|
|
@@ -3,7 +3,7 @@ id: hatch3r-observability-tracing-detail
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Extended tracing reference -- AI agent instrumentation, tool call audit trails, LLM request tracing, and correlation ID patterns
|
|
5
5
|
scope: conditional
|
|
6
|
-
globs: "**/*trac*,**/*span*,**/*telemetry*,**/*otel*,**/*agent*,**/observability/**"
|
|
6
|
+
globs: "**/*trac*,**/*span*,**/*telemetry*,**/*otel*,**/*agent*,**/observability/**,**/routes/**,**/handlers/**,**/services/**,**/api/**,**/middleware/**,**/controllers/**,**/lib/**"
|
|
7
7
|
tags: [devops]
|
|
8
8
|
quality_charter: agents/shared/quality-charter.md
|
|
9
9
|
cache_friendly: true
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Extended tracing reference -- AI agent instrumentation, tool call audit trails, LLM request tracing, and correlation ID patterns
|
|
3
|
-
globs: ["**/*trac*", "**/*span*", "**/*telemetry*", "**/*otel*", "**/*agent*", "**/observability/**"]
|
|
3
|
+
globs: ["**/*trac*", "**/*span*", "**/*telemetry*", "**/*otel*", "**/*agent*", "**/observability/**", "**/routes/**", "**/handlers/**", "**/services/**", "**/api/**", "**/middleware/**", "**/controllers/**", "**/lib/**"]
|
|
4
4
|
alwaysApply: false
|
|
5
5
|
---
|
|
6
6
|
# Observability -- Tracing Extended Reference
|
|
@@ -3,7 +3,7 @@ id: hatch3r-observability-tracing
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: Distributed tracing and OpenTelemetry core conventions for the project
|
|
5
5
|
scope: conditional
|
|
6
|
-
globs: "**/*trac*,**/*span*,**/*telemetry*,**/*otel*,**/observability/**"
|
|
6
|
+
globs: "**/*trac*,**/*span*,**/*telemetry*,**/*otel*,**/observability/**,**/routes/**,**/handlers/**,**/services/**,**/api/**,**/middleware/**,**/controllers/**,**/lib/**"
|
|
7
7
|
tags: [devops]
|
|
8
8
|
quality_charter: agents/shared/quality-charter.md
|
|
9
9
|
cache_friendly: true
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Distributed tracing and OpenTelemetry core conventions for the project
|
|
3
|
-
globs: ["**/*trac*", "**/*span*", "**/*telemetry*", "**/*otel*", "**/observability/**"]
|
|
3
|
+
globs: ["**/*trac*", "**/*span*", "**/*telemetry*", "**/*otel*", "**/observability/**", "**/routes/**", "**/handlers/**", "**/services/**", "**/api/**", "**/middleware/**", "**/controllers/**", "**/lib/**"]
|
|
4
4
|
alwaysApply: false
|
|
5
5
|
---
|
|
6
6
|
# Observability -- Distributed Tracing & OpenTelemetry
|
|
@@ -3,6 +3,7 @@ id: hatch3r-observability
|
|
|
3
3
|
type: rule
|
|
4
4
|
description: "[Deprecated] Observability conventions -- split into hatch3r-observability-logging, hatch3r-observability-metrics, and hatch3r-observability-tracing"
|
|
5
5
|
scope: conditional
|
|
6
|
+
globs: "**/routes/**,**/handlers/**,**/services/**,**/api/**,**/middleware/**,**/controllers/**,**/lib/**,**/observability/**"
|
|
6
7
|
tags: [devops]
|
|
7
8
|
quality_charter: agents/shared/quality-charter.md
|
|
8
9
|
deprecated: true
|
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: "[Deprecated] Observability conventions -- split into hatch3r-observability-logging, hatch3r-observability-metrics, and hatch3r-observability-tracing"
|
|
3
|
+
globs: ["**/routes/**", "**/handlers/**", "**/services/**", "**/api/**", "**/middleware/**", "**/controllers/**", "**/lib/**", "**/observability/**"]
|
|
3
4
|
alwaysApply: false
|
|
4
5
|
---
|
|
5
6
|
# Observability (Deprecated Redirect)
|
|
@@ -0,0 +1,149 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: hatch3r-operability
|
|
3
|
+
type: rule
|
|
4
|
+
description: Operability patterns in user code — liveness / readiness / startup probes, graceful shutdown, feature flags, runbook URL annotations, health endpoints
|
|
5
|
+
scope: "**/services/**,**/handlers/**,**/health*,**/probes/**,**/k8s/**,**/manifests/**,**/charts/**,**/feature*,**/flags/**"
|
|
6
|
+
tags: [implementation, devops]
|
|
7
|
+
quality_charter: agents/shared/quality-charter.md
|
|
8
|
+
cache_friendly: true
|
|
9
|
+
---
|
|
10
|
+
# Operability
|
|
11
|
+
|
|
12
|
+
## Liveness, Readiness, and Startup Probes (Kubernetes)
|
|
13
|
+
|
|
14
|
+
Three probe kinds; each answers a different question and triggers a different action. Conflating them is the most common operability bug — a liveness probe that reaches downstream causes pod-restart cascades on dependency outages.
|
|
15
|
+
|
|
16
|
+
- **Liveness** — shallow self-check: process alive, event loop responsive, no deadlock on the request handler. NEVER checks external dependencies. Failure → kubelet restarts the pod. Default: `httpGet /health/live`, `periodSeconds: 10`, `failureThreshold: 3`, `timeoutSeconds: 1`.
|
|
17
|
+
- **Readiness** — deep dependency check: DB reachable, cache reachable, required downstream healthy, migrations applied. Failure → endpoint controller removes the pod from Service load-balancer rotation; pod stays running and recovers when dependencies return. Default: `httpGet /health/ready`, `periodSeconds: 5`, `failureThreshold: 2`, `timeoutSeconds: 2`.
|
|
18
|
+
- **Startup** — same checks as readiness but with longer initial allowance for slow-starting apps (large model load, warm cache build, JIT warmup, schema migration). Once succeeds once, liveness and readiness take over. Default: `httpGet /health/startup`, `periodSeconds: 5`, `failureThreshold: 60` (allowing 5 minutes), `timeoutSeconds: 2`.
|
|
19
|
+
|
|
20
|
+
Concrete examples per ecosystem:
|
|
21
|
+
|
|
22
|
+
- **Node Express** — `/health/live`, `/health/ready`, `/health/startup` on the same router with different handler chains. The live handler returns 200 unconditionally if the event loop is responsive; the ready handler awaits `Promise.allSettled` over DB ping, cache ping, downstream ping with per-check 1s timeout.
|
|
23
|
+
- **Spring Boot Actuator** — `/actuator/health/liveness` and `/actuator/health/readiness` exposed out of the box; add startup via a custom `HealthIndicator`. Configure `management.endpoint.health.probes.enabled=true`.
|
|
24
|
+
- **Go (net/http)** — three handlers on `http.ServeMux`; ready handler aggregates checks via `errgroup.Group.Wait()` with per-check `context.WithTimeout`.
|
|
25
|
+
|
|
26
|
+
Anti-pattern: a single `/health` endpoint that both Kubernetes probes hit. The pod is killed during DB outage because the liveness probe failed on the same deep check the readiness probe is supposed to own.
|
|
27
|
+
|
|
28
|
+
## Graceful Shutdown
|
|
29
|
+
|
|
30
|
+
Handle `SIGTERM` per the sequence below. Skipping the preStop delay causes the well-known endpoint-propagation race that drops up to several seconds of in-flight traffic.
|
|
31
|
+
|
|
32
|
+
- Step 1: Stop accepting new connections (close the HTTP listener; stop pulling from the queue).
|
|
33
|
+
- Step 2: Mark `/health/ready` to return 503 (the endpoint controller removes the pod from the Service).
|
|
34
|
+
- Step 3: Wait `preStop` window of 1–3s for endpoint-removal propagation. Kubernetes does NOT propagate endpoint removal before delivering SIGTERM — without an explicit `preStop` `sleep`, traffic continues to arrive for the first 1–3s of shutdown.
|
|
35
|
+
- Step 4: Drain in-flight requests (configurable deadline 30–45s; cap by `terminationGracePeriodSeconds`).
|
|
36
|
+
- Step 5: Close DB connections, drain queue consumers, flush log buffers, close OpenTelemetry exporters.
|
|
37
|
+
- Step 6: Exit 0.
|
|
38
|
+
|
|
39
|
+
Pod manifest baseline:
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
terminationGracePeriodSeconds: 45
|
|
43
|
+
lifecycle:
|
|
44
|
+
preStop:
|
|
45
|
+
exec:
|
|
46
|
+
command: ["sh", "-c", "sleep 3"]
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Force kill after grace period defeats the drain. If 45s is insufficient, raise `terminationGracePeriodSeconds` rather than skip the drain — but investigate why the service holds long-running requests at shutdown.
|
|
50
|
+
|
|
51
|
+
For queue consumers (Kafka, NATS, SQS): on SIGTERM stop pulling new messages first, finish in-flight processing, then commit offsets and disconnect. Skipping the commit step on shutdown produces duplicate processing on the next pod's first poll.
|
|
52
|
+
|
|
53
|
+
## Feature Flags — OpenFeature
|
|
54
|
+
|
|
55
|
+
OpenFeature is a CNCF Incubating SDK (pre-1.0 spec) that wraps any provider — LaunchDarkly, Unleash, Flagsmith, GrowthBook, Statsig — behind a consistent API. New code targets OpenFeature; provider-specific SDK imports become a finding.
|
|
56
|
+
|
|
57
|
+
Flag types by lifecycle:
|
|
58
|
+
|
|
59
|
+
- **Release flag** — gate new code path during rollout. Remove within 1 sprint of full enablement; otherwise it becomes flag debt.
|
|
60
|
+
- **Experiment flag** — A/B test traffic split for measurement. Retire after the experiment concludes and the decision is shipped.
|
|
61
|
+
- **Ops flag** — kill switch for risky features and feature-level circuit breakers. Permanent by design. Cross-reference the kill-switch pattern below.
|
|
62
|
+
- **Permission flag** — entitlement gating (per-tenant feature availability). Prefer reusing the authorization layer where possible; flag only the surface that the auth system does not naturally cover.
|
|
63
|
+
|
|
64
|
+
Flag-debt budget: 50–100 active flags per service. Run a monthly cleanup cadence — list flags with `enabled = true for 100% of traffic for >30 days` and either remove them or convert to permanent config.
|
|
65
|
+
|
|
66
|
+
Default-off for new release flags; default-on only for kill switches (so the absence of provider connectivity does not silently disable the feature the on-call needs to disable). Evaluate flags at request entry once and pass the resolved value down — re-evaluating mid-request risks split-brain behavior when the flag flips during the call.
|
|
67
|
+
|
|
68
|
+
## Kill-Switch Pattern
|
|
69
|
+
|
|
70
|
+
Every risky feature ships with a kill switch (Ops flag) and a documented procedure for flipping it. On-call must be able to disable the feature without redeploying. Document the flag name in `docs/runbooks/<service>.md` alongside the alert.
|
|
71
|
+
|
|
72
|
+
Test the kill switch on every release — a kill switch that nobody has flipped in production is unverified. Quarterly drill: flip the flag, observe the metric drop, restore.
|
|
73
|
+
|
|
74
|
+
Cross-reference `rules/hatch3r-progressive-delivery.md` — kill switch is the rollback mechanism when the staged rollout has already finished and the regression is discovered after 100%.
|
|
75
|
+
|
|
76
|
+
## Runbook URL on Every Alert
|
|
77
|
+
|
|
78
|
+
Alert without a runbook is a finding under this rule. Every alert carries a runbook URL annotation:
|
|
79
|
+
|
|
80
|
+
- **Prometheus:** `annotations.runbook_url: "https://internal.example.com/runbooks/<alert-name>.md"`.
|
|
81
|
+
- **Datadog / Grafana:** equivalent `runbook_url` or `notification_message` template field.
|
|
82
|
+
|
|
83
|
+
Runbook format:
|
|
84
|
+
|
|
85
|
+
- **Symptoms** — what the on-call sees on the dashboard or in the alert payload.
|
|
86
|
+
- **Triage** — first three commands or queries to narrow the cause.
|
|
87
|
+
- **Mitigation** — kill switch, rollback command, scale action.
|
|
88
|
+
- **Root cause** — links to past postmortems for similar symptoms.
|
|
89
|
+
- **Follow-ups** — open issues, related dashboards, owner team.
|
|
90
|
+
|
|
91
|
+
Empirical observation from 2024–2026 incidents: LLM-driven auto-diagnosis quality is dominated by runbook quality rather than by model choice. A high-fidelity runbook with concrete commands beats a generic prompt against a frontier model on the same dataset.
|
|
92
|
+
|
|
93
|
+
Runbooks live in the service repository under `docs/runbooks/<alert-name>.md` so they ship with the code that emits the alert. Renaming an alert without updating the runbook URL produces a 404 link from the alert payload — a CI check on the alert manifest catches this on every PR that touches alert names.
|
|
94
|
+
|
|
95
|
+
## Health Endpoint Conventions
|
|
96
|
+
|
|
97
|
+
- `/health/live` — 200 + `{ "status": "ok", "version": "<semver>", "build_sha": "<short-sha>" }` on success; 503 + `{ "status": "down", "reason": "<diagnostic>" }` on failure.
|
|
98
|
+
- `/health/ready` — same shape; 503 includes the failing downstream name (e.g. `reason: "postgres-primary unreachable"`).
|
|
99
|
+
- `/health/startup` — same as ready but with allowance for warmup.
|
|
100
|
+
|
|
101
|
+
Never expose secrets, connection strings, or per-request internals from health endpoints — they are unauthenticated by default. Detailed diagnostics live behind authentication on a separate admin endpoint.
|
|
102
|
+
|
|
103
|
+
Probe response time budget: under 100ms for `/health/live`, under 500ms for `/health/ready` and `/health/startup`. A probe that times out the kubelet causes pod restart cascades unrelated to the underlying issue.
|
|
104
|
+
|
|
105
|
+
Cache the ready-check result for 1–2s — Kubernetes polls every 5s by default; checking each downstream on every poll multiplies dependency load by the replica count.
|
|
106
|
+
|
|
107
|
+
## Capacity Planning
|
|
108
|
+
|
|
109
|
+
- Nightly load test in CI against a staging environment representative of production. Compare baseline-vs-current p50, p95, p99 latency and error rate; fail the run on a 20% regression vs the previous green baseline.
|
|
110
|
+
- Saturation tracking — alert when CPU sustained above 70% for 15 minutes, memory above 80% for 15 minutes, connection pool above 80% utilization for 5 minutes.
|
|
111
|
+
- Rightsizing — target 20–30% headroom on CPU and memory at peak. Below 10% headroom is undersized; above 50% sustained is oversized and a cost finding.
|
|
112
|
+
- HPA (Horizontal Pod Autoscaler) target CPU at 60–70% — leaves room for the scale-up lag to bring new replicas online before the existing pods saturate.
|
|
113
|
+
- Cold-start budget for serverless and JVM services: measure p95 cold-start latency; provision the warm-pool to keep cold-start traffic below 1% of total. KEDA or platform autoscaler keeps the warm-pool at the right size.
|
|
114
|
+
|
|
115
|
+
## Multi-Region and Disaster Recovery
|
|
116
|
+
|
|
117
|
+
- RTO (Recovery Time Objective) and RPO (Recovery Point Objective) documented per service tier in the service catalog.
|
|
118
|
+
- Active-active for tier-1 services targeting RTO under 5 minutes and RPO under 1 minute.
|
|
119
|
+
- Active-passive acceptable for tier-2 services targeting RTO under 1 hour.
|
|
120
|
+
- Failover drill quarterly; the drill is the test of the runbook. A runbook that has not been executed in a drill is a draft, not a runbook.
|
|
121
|
+
- Data residency — honor regional data boundaries (EU-US DPF, regional cloud regions). Cross-region replication respects the residency contract.
|
|
122
|
+
- DNS-based failover (Route 53, Cloud DNS) has a propagation tail measured in minutes; for sub-minute RTO use load-balancer-level failover (cross-region target groups) or anycast.
|
|
123
|
+
|
|
124
|
+
## 2024–2026 Outage Lessons
|
|
125
|
+
|
|
126
|
+
- **CrowdStrike, July 2024** — global config push without canary. Mitigation: staged rollout (cross-reference `rules/hatch3r-progressive-delivery.md`).
|
|
127
|
+
- **AWS us-east-1, October 2025** — cascading failure across services with hidden us-east-1 control-plane dependency. Mitigation: dependency mapping, multi-region for control plane.
|
|
128
|
+
- **Azure East-US2, September 2025** — single-region outage on customer-perceived multi-region service. Mitigation: active-active across regions.
|
|
129
|
+
- **Cloudflare, November 2025** — config-change incident. Mitigation: treat config as code, canary config changes.
|
|
130
|
+
|
|
131
|
+
Most outages have an organizational root cause (change management, capacity planning, ownership). The patterns above defend against the technical leg of the failure; organizational defenses are out of scope for this rule.
|
|
132
|
+
|
|
133
|
+
Postmortem cadence: within 5 business days of incident close, blameless, owned by the on-call rotation, action items tracked in the service backlog with target due dates. Postmortem reviewers cross-check that the runbook was updated and that an automated detection (alert, dashboard, or unit test) was added for the failure mode that escaped detection.
|
|
134
|
+
|
|
135
|
+
## Cross-References
|
|
136
|
+
|
|
137
|
+
- `rules/hatch3r-resilience-patterns.md` — circuit breakers, retries, timeouts.
|
|
138
|
+
- `rules/hatch3r-progressive-delivery.md` — canary, blue-green, kill-switch usage during rollout.
|
|
139
|
+
- `rules/hatch3r-observability-metrics.md` — SLOs, RED metrics, burn-rate alerts feed the runbook.
|
|
140
|
+
|
|
141
|
+
## References
|
|
142
|
+
|
|
143
|
+
- Kubernetes probes — `kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes`
|
|
144
|
+
- Kubernetes pod termination — `kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination`
|
|
145
|
+
- Google SRE workbook, graceful-shutdown chapter — `sre.google/workbook`
|
|
146
|
+
- OpenFeature specification — `openfeature.dev`
|
|
147
|
+
- LaunchDarkly docs — `docs.launchdarkly.com`
|
|
148
|
+
- Unleash docs — `docs.getunleash.io`
|
|
149
|
+
- 2024–2026 outage postmortems: CrowdStrike Jul 2024, AWS us-east-1 Oct 2025, Azure East-US2 Sep 2025, Cloudflare Nov 2025.
|