npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.0 - Mend

sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

package/.env.example +19 -0
package/CHANGELOG.md +144 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +394 -51
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +2 -2
package/dist/providers/keys.js +3 -2
package/dist/providers/registry.js +133 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +218 -27
package/dist/ui/banner.js +4 -9
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/setup.js +6 -5
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/design-authorization-model/SKILL.md ADDED Viewed

@@ -0,0 +1,119 @@
+---
+name: design-authorization-model
+description: Designs an authorization model — RBAC/ABAC/ReBAC, multi-tenant isolation, resource ownership, and policy-as-code (OPA/Cedar/Oso) — keeping authZ decisions separate from authN identity in a centralized, testable policy layer enforced down to the data tier.
+when_to_use: A system needs roles/permissions, multi-tenant data isolation, or per-resource access rules beyond a logged-in check. Distinct from auth-jwt-session (who you are — tokens/sessions), security-review (audit), and rate-limiting (request volume).
+---
+## When to Use
+Reach for this skill when the question is **"is this caller allowed to do this to this resource?"** — not "who is this caller?":
+- "Add roles and permissions" / "only admins can delete, members can edit, viewers read"
+- "Tenants must not see each other's data" / "isolate orgs / workspaces"
+- "Owner can share a doc with specific users" (Google-Drive-style) → relationship graph
+- "Permissions depend on attributes" — department, resource status, time, region
+- "Stop scattering `if user.role == 'admin'` across 40 handlers — centralize it"
+- An IDOR/cross-tenant leak found in review (a user fetched another org's record by id)
+NOT this skill:
+- Issuing/verifying tokens, sessions, refresh rotation, OAuth/OIDC login → **auth-jwt-session** (authN establishes identity; this skill consumes that identity to make the access decision)
+- Auditing existing code for access-control holes by severity → **security-review**
+- Capping request *rate/volume* per caller → **rate-limiting**
+- Recording *who did what when* for compliance/forensics → **build-audit-logging**
+- GDPR data-subject rights, lawful basis, PII mapping → **map-privacy-data-gdpr**
+- Fixing injection/XSS/SSRF in web code → **remediate-web-vulnerabilities**
+## Steps
+1. **Pick the model by the shape of the access rule — do not default to RBAC for everything.**
+   | Model | Decide by | Use when | Engine fit |
+   |---|---|---|---|
+   | **RBAC** | role → permission set | Fixed, coarse tiers (admin/editor/viewer); permissions don't depend on the specific row | DB tables, Casbin, Cedar |
+   | **ABAC** | attributes of subject+resource+context | Rules vary by field/status/time/region (`owner.dept == doc.dept AND time < embargo`) | **OPA/Rego**, Cedar |
+   | **ReBAC** | relationship/ownership graph | Per-resource sharing, nesting (`folder→doc`), "users this owner invited" — Drive/GitHub-style | **OpenFGA / SpiceDB** (Zanzibar), Oso |
+   Default: start **RBAC for app-wide roles**, add **ReBAC** the moment you need per-resource sharing or hierarchy, add **ABAC** conditions for field/context rules. They compose — RBAC roles can be relations in a ReBAC graph. Don't roll a bespoke nested-`if` engine; pick one of the named tools.
+2. **Separate authN from authZ — the decision is its own layer.** AuthN hands you a verified principal (`{user_id, tenant_id, roles}` from the validated token — see **auth-jwt-session**). AuthZ takes `(principal, action, resource)` → `allow|deny`. Never re-derive identity inside the policy, and never let the policy trust unverified claims.
+3. **Centralize the decision behind one `authorize()` call — never inline `if role ==`.** Every protected operation calls the same checkpoint; scattered checks drift and leak.
+   ```python
+   # ONE entry point. Engine (OPA/Cedar/Oso/OpenFGA) behind it.
+   def authorize(principal, action, resource):
+       decision = engine.check(
+           subject=principal.user_id,
+           tenant=principal.tenant_id,      # from token, NEVER from the request body
+           action=action,                   # "document:delete"
+           resource=resource,               # {id, type, tenant_id, owner_id, status}
+       )
+       if not decision.allow:               # deny by default — no rule matched = deny
+           raise Forbidden(action, resource.id)
+       return decision
+   ```
+4. **Enforce multi-tenant isolation on every query — and derive `tenant_id` from the token, never the client.** A client-supplied tenant/org id is an attacker-controlled cross-tenant key. Scope every read/write by the token's tenant; treat a missing tenant scope as a bug, not a default-all.
+   ```sql
+   -- Defense in depth: Postgres Row-Level Security so a forgotten WHERE can't leak.
+   ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
+   ALTER TABLE documents FORCE ROW LEVEL SECURITY;          -- applies to table owner too
+   CREATE POLICY tenant_isolation ON documents
+     USING (tenant_id = current_setting('app.tenant_id')::uuid);
+   ```
+   Set `app.tenant_id` per request/connection from the verified token (`SET LOCAL app.tenant_id = ...` inside the request transaction). App-layer `WHERE tenant_id = $1` is the primary guard; RLS is the backstop for the day someone forgets it.
+5. **Deny by default, least privilege, deny wins.** No matching allow rule ⇒ deny. Start every role at zero permissions and add. When allow and deny rules overlap, **explicit deny beats allow**. Write this into the policy, don't rely on convention.
+6. **Make policy versioned, code-reviewed, and unit-tested — policy-as-code.** Keep `.rego` / Cedar / `policy.polar` in the repo, PR-reviewed like app code. Example OPA/Rego with the three non-negotiables baked in:
+   ```rego
+   package authz
+   default allow := false                          # deny by default
+   default deny := false                           # bare `deny` is always defined
+   allow if {                                       # owner can do anything to own resource
+     input.resource.tenant_id == input.principal.tenant_id   # same-tenant gate, always
+     input.resource.owner_id == input.principal.user_id
+   }
+   allow if {                                       # role grants the action
+     input.resource.tenant_id == input.principal.tenant_id
+     some role in input.principal.roles
+     grants[role][_] == input.action               # e.g. grants.editor[] = "document:edit"
+   }
+   deny if input.resource.status == "locked"        # explicit deny condition
+   final_allow := allow and not deny                # deny wins over any allow
+   ```
+   Wire the API to read `final_allow`, not `allow`. Run `opa test policy/ -v` in CI. The same input schema feeds both the running engine and the tests.
+7. **Pass the decision an explicit resource snapshot, fetched tenant-scoped first.** Load the resource (already filtered by tenant in the query) before checking, so the policy sees real `owner_id`/`status`/`tenant_id`. Checking by id alone, then fetching unscoped, reintroduces the IDOR.
+8. **Verify with an allow/deny matrix per role × action — including explicit cross-tenant denial** (see Verify) before shipping.
+## Common Errors
+- **Trusting a client-supplied `tenant_id`/`org_id`** from body, query, or header. It's the cross-tenant skeleton key. Derive tenant solely from the verified token; ignore any tenant field in the request.
+- **IDOR — checking the role but not the ownership/tenant of *this* row.** `can_edit_documents` is true, but the doc belongs to another tenant. Always bind the check to the specific resource's `tenant_id`/`owner_id`, fetched tenant-scoped.
+- **Inline `if user.role == 'admin'` scattered across handlers.** They drift, one gets missed, and a new action ships unguarded. Route every check through the single `authorize()` checkpoint.
+- **Role explosion (`editor_us_finance_readonly`).** Combinatorial roles that should be attributes. Move per-field/context rules to ABAC conditions; keep roles coarse.
+- **Allow-by-default / "fail open."** A request that matches no rule slips through, or an engine error returns allow. Set `default allow := false` and treat engine errors/timeouts as deny.
+- **Reading `allow` instead of the deny-wins result.** Exposing `allow` to the API skips the explicit-deny rule. Have the engine return `final_allow` (`allow and not deny`) so a locked/blocked resource can't be reached through a permissive role.
+- **AuthZ in the frontend only.** Hiding a button is UX, not security — the API is the enforcement boundary. Every server endpoint authorizes independently.
+- **Roles baked into the JWT and never refreshed.** Revoking a role doesn't take effect until the token expires. Check permissions against current state (or keep token TTL short and re-resolve roles server-side).
+- **No DB-tier backstop.** One forgotten `WHERE tenant_id` leaks every tenant. Enable Postgres RLS with `FORCE` so the data tier denies even when the app forgets.
+- **Confused-deputy / unscoped service calls.** A worker or internal service queries with god privileges on behalf of a user without carrying the user's tenant/permission scope. Propagate the principal; don't let internal callers bypass `authorize()`.
+- **Policy with no tests.** Untested Rego/Cedar rots silently. Ship the allow/deny matrix as `opa test` cases alongside the policy.
+## Verify
+1. **Allow/deny matrix — every role × action.** For each role (admin/editor/viewer/none) × each action (create/read/update/delete/share), assert the decision matches the intended table. Every cell is a test case, run in CI (`opa test policy/ -v` or the engine's harness).
+2. **Cross-tenant denial (the critical one).** User in tenant A requests a resource in tenant B by its real id → **403/deny**, for *every* action, including read. Do this both through the API and by querying the DB with `app.tenant_id` set to A — RLS must return zero rows.
+3. **IDOR probe.** As a non-owner same-tenant user, attempt update/delete on a resource you don't own and your role doesn't permit → deny. Then as owner → allow. Confirms the check binds to the resource, not just the role.
+4. **Deny by default.** Invent a brand-new action string with no policy rule → deny (not allow). Proves nothing slips through unmatched.
+5. **Deny wins.** A resource in `status = "locked"` (or a user under an explicit deny) → deny even when a role would otherwise allow. Assert against `final_allow`, the value the API consumes.
+6. **RLS backstop.** Run a `SELECT` that *omits* the app-layer tenant filter against a session with `app.tenant_id` set → still returns only that tenant's rows. Proves the data tier holds when the app forgets.
+7. **Centralization.** `grep -rnE 'role *== *|isAdmin|\.role\b' src/` finds zero authorization branches outside the policy layer — every decision goes through `authorize()`.
+8. **Privilege escalation negative test.** A user cannot grant themselves a role/permission or modify a policy they shouldn't (the "edit roles" action is itself authorized).
+Done = the role × action matrix passes in CI, cross-tenant and IDOR probes are denied at both the API and DB tier (RLS enforced), the policy is versioned with `default allow := false` and deny-wins (API reads `final_allow`), and `grep` finds no authorization logic outside the centralized layer.

package/skills/design-backup-dr-recovery/SKILL.md ADDED Viewed

@@ -0,0 +1,113 @@
+---
+name: design-backup-dr-recovery
+description: Designs and validates backup, point-in-time-recovery, and disaster-recovery strategy for datastores — sets RPO/RTO targets, configures snapshot plus continuous WAL/binlog/oplog archiving for PITR, 3-2-1 immutable retention, automated test-restores, and cross-region replica failover with split-brain fencing.
+when_to_use: When a stateful service needs a credible answer to "what if the database is lost or corrupted" — setting RPO/RTO, wiring snapshots + continuous log archiving for PITR, designing cross-region failover, scheduling tested restores, or auditing a never-restore-tested backup. Distinct from db-migration-safety (forward schema change safety) and incident-response-sre (running the live outage, not designing recoverability).
+---
+## When to Use
+Reach for this skill when the question is **"can we get the data back, and how fast"** — not how to change the schema:
+- "Set RPO/RTO for this database and prove we can hit them"
+- "We have nightly snapshots but no way to restore to 2:47pm — add PITR"
+- "Stand up cross-region DR / a warm standby we can promote"
+- "Our backups have never been restore-tested — audit and fix that"
+- "Recover a single dropped table without rolling back the whole DB"
+- "Defend backups against ransomware / a fat-fingered `DROP DATABASE`"
+NOT this skill:
+- Making a forward schema migration safe/reversible (expand-contract, online DDL) → db-migration-safety
+- Running the live incident — paging, comms, mitigation timeline → incident-response-sre
+- Protecting/rotating the backup-store credentials and KMS keys → secrets-management
+- Alerting that a backup job failed / dashboards for restore lag → observability-instrument
+- Trimming snapshot/storage spend → cloud-cost-optimize
+## Steps
+1. **Set RPO and RTO per datastore from business impact — these two numbers drive every later choice.** RPO = max tolerable data loss (how far back you may rewind). RTO = max tolerable downtime (how long restore may take). Pick a tier, don't invent per-DB:
+   | Tier | Example data | RPO | RTO | Implied mechanism |
+   |---|---|---|---|---|
+   | Tier 0 (money/orders) | payments ledger, auth | ≤ seconds | ≤ minutes | sync replica + continuous WAL, automated promotion |
+   | Tier 1 (core app) | primary OLTP DB | ≤ 5 min | ≤ 1 hr | snapshot + async WAL archiving (PITR), warm standby |
+   | Tier 2 (supporting) | analytics, search index | ≤ 1 hr | ≤ 4 hr | hourly snapshot, rebuild-from-source allowed |
+   | Tier 3 (derived/cache) | caches, rebuildable views | n/a | n/a | no backup — document the rebuild procedure instead |
+   RPO ≤ snapshot interval is a lie unless you also archive logs continuously (step 2). Write the chosen numbers down; an untargeted "we back up nightly" has an implicit 24h RPO nobody agreed to.
+2. **Two backup layers: periodic base + continuous log archiving. Snapshot-only cannot do PITR.** A snapshot gets you to *snapshot time*; the log stream replays forward to any timestamp in between.
+   | Engine | Base backup | Continuous log (the PITR engine) | Restore = base + replay |
+   |---|---|---|---|
+   | PostgreSQL | `pg_basebackup` / disk snapshot | WAL via `archive_command` → object store (pgBackRest/WAL-G) | `restore_command` + `recovery_target_time` |
+   | MySQL/MariaDB | `xtrabackup` / `mysqldump` | binlog (`log_bin`, `binlog_format=ROW`) shipped off-host | restore base, `mysqlbinlog --stop-datetime` apply |
+   | MongoDB | `mongodump` / filesystem snapshot | oplog (replica set required) | restore + `--oplogReplay --oplogLimit` |
+   | SQLite | `.backup` / file copy | WAL file is local-only — ship full DB on a cron | copy file (no true PITR) |
+   | Managed (RDS/Cloud SQL) | automated snapshots | provider-managed transaction logs | "restore to point in time" API |
+   Default for any Tier 0/1 SQL store: **pgBackRest/WAL-G (Postgres) or Percona XtraBackup + binlog (MySQL)** with logs archived every ≤60s. Logical dumps (`pg_dump`/`mysqldump`) are a *secondary* portable copy, not your primary — they're slow to restore and lock/strain a large live DB.
+3. **Retention and layout: 3-2-1 with at least one immutable copy.** 3 copies, 2 media/accounts, 1 off-site/cross-region. Make ≥1 copy **immutable** so ransomware or a compromised admin can't delete it:
+   - Object-lock the bucket: S3 Object Lock **Compliance mode** (`--object-lock-mode COMPLIANCE`), or GCS bucket retention lock, or Azure immutable blob. Compliance mode = nobody, including root, can delete before expiry.
+   - Put backups in a **separate account/project** from production with write-only (no-delete) IAM for the backup writer — same-account backups die with the account.
+   - Lifecycle: hot (last 7d, fast restore) → warm (30d) → cold/Glacier (90–365d per compliance). Cold tiers add hours to RTO — never put your RTO-critical recent backups in Glacier.
+   - Retention must cover **detection lag**: corruption found on day 10 needs a day-9 good copy, so retain > realistic time-to-detect.
+4. **Verify restorability automatically — an untested backup is a hypothesis, not a backup.** Schedule a job that restores to a *scratch* environment and validates, on every backup or at least nightly:
+   ```bash
+   # nightly restore drill (Postgres / pgBackRest), exits non-zero on any failure
+   pgbackrest --stanza=main --type=time \
+     --target="$(date -u -d '10 min ago' +'%Y-%m-%d %H:%M:%S')" restore
+   pg_ctl start -D "$PGDATA" -w -t 600
+   # validate: structural + content, not just "it started"
+   psql -tAc "SELECT count(*) FROM orders"            | grep -qE '^[0-9]+$'
+   psql -tAc "SELECT pg_catalog.pg_database_size('app')"   # > 0
+   RESTORE_SECS=$SECONDS; echo "restore took ${RESTORE_SECS}s (RTO budget: 3600s)"
+   [ "$RESTORE_SECS" -le 3600 ] || { echo "RTO BREACH"; exit 1; }
+   ```
+   Validate **content** (row counts vs a known watermark, `pg_amcheck`/`CHECKSUM TABLE`, app-level invariant query), measure wall-clock restore time, and **fail the job loud** (page) if it breaks or exceeds RTO. The restore time you measure here *is* your real RTO — the planned number is fiction until measured.
+5. **Have a procedure for each recovery shape — they are not the same command.**
+   - **Full restore (host lost):** provision, restore latest base, replay logs to "now", re-point app.
+   - **PITR (bad deploy/poison write at 14:32):** restore base before 14:32, replay to `recovery_target_time = '14:31:59'`, `pause_at_recovery_target=on`, inspect, then promote. Recover to *just before* the bad event.
+   - **Single-table / logical restore:** restore into a throwaway instance, `pg_dump -t orders` (or `mysqldump --no-create-info`) that table, load into prod — never restore the whole cluster to fix one table.
+   - **Corruption:** do **not** overwrite the only good copy. Restore to a new instance, run `pg_amcheck`/`mongod --repair`/`CHECK TABLE`, diff, cut over only after validation. Promote a healthy replica only after confirming the corruption didn't already replicate.
+6. **Cross-region/replica DR: pick sync vs async deliberately, and fence against split-brain.**
+   | | Sync replication | Async replication |
+   |---|---|---|
+   | RPO | ~0 (no committed loss) | replica lag (seconds–minutes) |
+   | Write latency | + cross-region RTT every commit | none (local commit) |
+   | Use for | Tier 0 only, regions < ~10ms apart | everything else (default) |
+   Default to **async** unless RPO≈0 is mandated and you accept the write-latency tax. Failover = promote standby + cut traffic over. **Split-brain is the real danger**: if the old primary comes back and also takes writes, you get divergent histories that can't be merged. Enforce a quorum/leader-election (Patroni + etcd/Consul, Orchestrator, or RDS Multi-AZ which fences for you) and **STONITH-fence** the old primary (revoke its network/credentials) *before* promoting. Cut traffic via low-TTL DNS (≤30s) or, better, a connection proxy (PgBouncer/HAProxy/ProxySQL) that flips backends instantly — DNS TTL caching makes raw DNS failover slow and uneven.
+7. **Write the runbook with exact commands, and rehearse it (game day).** The runbook lists per scenario: detection signal → exact restore/promote commands (copy-pasteable, with placeholders) → validation queries → traffic-cutover step → rollback-of-the-rollback. Store it **outside** the system it recovers (it's useless if it lives only in the DB that's down). Schedule a **DR drill quarterly** (Tier 0: monthly) that actually fails over to the standby/restored copy under timing — measure RTO/RPO against target, file the gaps. A runbook never executed end-to-end is presumed broken.
+## Common Errors
+- **Never restore-testing.** The #1 cause of "we had backups but couldn't recover." A backup that has never been restored is unproven; automate the drill (step 4) so success/failure is observed continuously, not discovered during the outage.
+- **Snapshot-only, calling it PITR.** Nightly snapshots = up to 24h RPO and you can only land on snapshot boundaries. PITR requires continuous WAL/binlog/oplog archiving (step 2). If asked for "restore to any second," snapshots alone cannot.
+- **Same blast radius.** Backups in the same account/region/bucket as prod die with it — one compromised credential, one region outage, one `DROP` and both the data and its backup are gone. Cross-account + cross-region + immutable is the point.
+- **No immutability → ransomware/insider wipes the backups too.** Mutable backups are deleted in the same attack that hit prod. Use object-lock Compliance mode / retention lock on ≥1 copy.
+- **Replica treated as a backup.** A replica faithfully replicates `DELETE FROM users` and corruption in milliseconds. Replication is for availability/failover; it is **not** a backup and gives zero protection against logical errors. You need both.
+- **Logical dump as the primary backup for a large DB.** `pg_dump`/`mysqldump` of a multi-TB DB takes hours to restore and strains/locks the live DB while running — blows RTO. Use physical base + log archiving; keep logical dumps as a secondary portable copy only.
+- **RTO ignores restore *and* warm-up.** Real RTO = provision + transfer + restore + log replay + cache/index warm-up + cutover. Cold-tier (Glacier) retrieval alone can be hours. Measure end-to-end; don't quote the `restore` command's runtime.
+- **Failover with no split-brain fencing.** Promoting a standby while the old primary still accepts writes forks history irrecoverably. Fence (STONITH) the old primary and use quorum-based promotion before flipping traffic.
+- **DNS-only cutover with long TTL.** A 300s+ TTL means clients keep hitting the dead primary long past promotion. Use TTL ≤30s, or a connection proxy that switches backends instantly.
+- **Backup job "succeeds" but the file is empty/corrupt.** Exit-0 ≠ valid backup. Verify object size > expected floor, checksum, and a test-restore — not just the job's return code.
+- **Retention shorter than detection lag.** Corruption noticed on day 10 with 7-day retention = no clean copy exists. Retain past your realistic time-to-detect, and keep a longer-interval cold copy.
+## Verify
+1. **RPO/RTO are written and tiered.** Every stateful datastore has an explicit RPO and RTO number tied to a business tier (step 1) — not an implicit "nightly."
+2. **PITR proven, not assumed.** Restore to an *arbitrary* timestamp between two base backups (e.g. 14:31:59 yesterday) lands the data at that second — proves continuous log archiving works, not just snapshots.
+3. **Automated restore drill is green and timed.** The nightly/per-backup test-restore to scratch passes (structural + content + invariant checks) and its measured wall-clock ≤ RTO budget; a failure or RTO breach **pages**.
+4. **3-2-1 + immutability holds.** ≥3 copies across ≥2 accounts/regions, ≥1 with object-lock Compliance/retention-lock that even root cannot delete before expiry — confirm by attempting (and failing) to delete a locked object.
+5. **Independent blast radius.** Deleting/encrypting the prod bucket/account leaves a usable backup intact in another account/region.
+6. **Each recovery shape has a tested path:** full restore, PITR-to-timestamp, single-table logical restore, and corruption-to-new-instance — each with copy-pasteable commands in the runbook.
+7. **Failover fences and cuts over fast.** A drill promotion fences the old primary (it cannot take writes post-promotion) and traffic moves via ≤30s-TTL DNS or a proxy; no split-brain divergence after.
+8. **Game day actually ran.** A dated DR drill within the cadence (≤1 quarter; Tier 0 ≤1 month) failed over end-to-end, measured RPO/RTO vs target, and logged the gaps.
+Done = every datastore has written RPO/RTO targets, PITR (base + continuous logs) restoring to an arbitrary timestamp, an automated restore drill that is green and within RTO, ≥1 immutable cross-account/region copy, and a runbook proven by a dated end-to-end DR drill — restore time **measured**, never merely planned.

package/skills/design-event-sourcing-cqrs/SKILL.md ADDED Viewed

@@ -0,0 +1,143 @@
+---
+name: design-event-sourcing-cqrs
+description: Designs event-sourced and CQRS systems — past-tense immutable event schemas, aggregate boundaries with command→validate→emit→apply and expected-version optimistic concurrency, append-only per-stream event store with outbox publishing, rebuildable idempotent projections, snapshotting, and versioned upcasting for event evolution.
+when_to_use: When you need an audit-complete, replayable, append-only domain model (ledgers, order/workflow state machines, compliance) or are splitting write commands from read queries, or fixing event-sourcing pain (projection lag, frozen event shapes, slow rebuilds, lost ordering). For plain CRUD use db-migration-safety; for the messaging transport use message-queue-jobs.
+---
+## When to Use
+Reach for this skill when the domain needs **the history of changes as first-class truth**, not just the current row:
+- "We need a full audit trail / who-changed-what-when that nobody can edit after the fact"
+- "Model an order / loan / subscription as a state machine with replayable transitions"
+- "Build a ledger or balance that must reconcile to zero from its entries"
+- "Separate the write side (commands) from a denormalized read side (queries)"
+- "Time-travel: rebuild what the state *was* at any past moment"
+- Fixing existing pain: projection lag, "we can't change the shape of a 2-year-old event", multi-hour rebuilds, lost per-aggregate ordering, eventual-consistency bugs in the UI
+NOT this skill:
+- Plain CRUD with mutable rows and no replay need → **db-migration-safety** (and stop here — event sourcing is the wrong tool for simple CRUD)
+- The broker/transport that *carries* events (Kafka/SQS/RabbitMQ delivery, retries, DLQ) → **message-queue-jobs**
+- A read-only cache layer to cut DB load → **caching-strategy** (a projection is a system of record for reads; a cache is disposable)
+- Syncing offline client state with conflict resolution → **build-offline-first-sync**
+- Recording *why you chose* event sourcing as a decision → **write-adr**
+- Tuning the projection's query/index once it exists → **optimize-sql-query**
+- Wiring client UI state to the read API → **manage-client-server-state**
+## Steps
+1. **First, decide if event sourcing is even warranted — most apps should not use it.** Adopt it only when ≥1 of these is a hard requirement, and accept the listed cost:
+   | Driver (need ≥1) | Why ES wins | Cost you take on |
+   |---|---|---|
+   | Audit/compliance: immutable, complete history | Events *are* the audit log, tamper-evident | More moving parts than a table |
+   | Temporal queries / "state as of T" | Replay to any point | Rebuild + snapshot machinery |
+   | Complex state machine w/ many transitions | Each transition = one fact | Up-front modelling effort |
+   | Multiple read shapes from one write model | CQRS projections, independent scaling | Eventual consistency everywhere |
+   | Debugging by replaying real history | Deterministic reproduction | Replay must stay deterministic forever |
+   If none apply → use a normal table with CRUD and an `updated_at`; **do not event-source CRUD.** CQRS (split read/write models) is independently useful and does **not** require event sourcing — you can do CQRS over a normal DB.
+2. **Model events as immutable, past-tense facts — name them as business outcomes, never CRUD verbs.** `OrderPlaced`, `PaymentCaptured`, `FundsWithdrawn`, `ShipmentDispatched` — not `OrderUpdated`/`OrderSaved`/`SetStatus`. An event records *what happened*, is append-only, and never carries read-model concerns (no denormalized display strings, no joined names, no computed totals the reader could derive). Event payload contract:
+   ```json
+   {
+     "event_id": "uuid-v4",                 // unique; the consumer dedup key (idempotency)
+     "event_type": "FundsWithdrawn",        // past tense, business fact
+     "event_version": 1,                    // schema version of THIS type
+     "aggregate_id": "acct-9c1f",           // the stream key
+     "aggregate_type": "Account",
+     "sequence": 42,                        // per-aggregate, gap-free, monotonic = the version
+     "occurred_at": "2026-06-15T09:30:00Z", // business time captured at emit, NEVER now() in apply
+     "data": { "amount_cents": 5000, "currency": "USD" },
+     "metadata": { "causation_id": "...", "correlation_id": "...", "actor": "user-7" }
+   }
+   ```
+   Keep `data` minimal and self-contained: only facts the writer *decided*, expressed in raw value types. Put tracing/identity in `metadata`, never in `data`.
+3. **Draw aggregate boundaries = the consistency boundary, and keep them small.** An aggregate is the unit that enforces an invariant in a single transaction (e.g. "balance never goes negative"). One command mutates exactly **one** aggregate atomically. Command flow is always **load → validate → emit → apply**:
+   ```
+   handle(cmd):
+     events = load_stream(cmd.aggregate_id)        # replay history
+     state  = events.reduce(apply, initial())      # rebuild current state in memory
+     if not invariant_holds(state, cmd):           # VALIDATE against rebuilt state
+        raise Rejected(reason)                      # rejection is NOT an event
+     new = decide(state, cmd)                       # EMIT new past-tense events
+     append(cmd.aggregate_id, new,
+            expected_version = state.version)        # optimistic concurrency
+     return new
+   ```
+   Rules: validation reads only the aggregate's own rebuilt state (no cross-aggregate reads, no querying a projection to decide). Cross-aggregate consistency is achieved *eventually* via a process manager/saga reacting to events, not in one transaction. A giant aggregate ("the whole tenant") serializes all writes — split it.
+4. **Make the store append-only, ordered per stream, with expected-version concurrency.** One stream per aggregate; `sequence` is gap-free and monotonic *within a stream* (do not assume a global total order across streams). Append is a conditional insert:
+   ```sql
+   CREATE TABLE events (
+     global_position BIGSERIAL PRIMARY KEY,        -- store-wide read order for projectors/relay
+     event_id        UUID NOT NULL UNIQUE,         -- carried to broker; consumer dedup key
+     aggregate_id    TEXT NOT NULL,
+     aggregate_type  TEXT NOT NULL,
+     sequence        INT  NOT NULL,                -- per-stream version: append uses expected_version+1
+     event_type      TEXT NOT NULL,
+     event_version   INT  NOT NULL,
+     data            JSONB NOT NULL,
+     metadata        JSONB NOT NULL,
+     occurred_at     TIMESTAMPTZ NOT NULL,         -- business time, set by writer (not now())
+     recorded_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
+     UNIQUE (aggregate_id, sequence)               -- THIS enforces optimistic concurrency
+   );
+   ```
+   An aggregate's `version` == the `sequence` of its last appended event. The append SQL inserts rows with `sequence = expected_version + 1, +2, …`. The `UNIQUE(aggregate_id, sequence)` violation = a concurrent writer won the race → catch it (`23505` in Postgres), reload, re-validate, retry (or return `409 Conflict` to the caller). `event_id` must be persisted, not regenerated — it's what every downstream consumer dedupes on. **Never** `UPDATE`/`DELETE` an event row; corrections are new compensating events (`ChargeRefunded`, not a delete).
+5. **Publish via the outbox/transactional pattern — never dual-write.** Writing to the event store *and* publishing to the broker as two separate operations loses or duplicates events on crash. Instead: the event row **is** the outbox. Commit the event in the same DB transaction as the aggregate write, then a separate relay polls `events` ordered by `global_position` (or uses CDC/`LISTEN`) and pushes to the broker, tracking a high-water mark. Consumers must be idempotent (dedupe on `event_id`) because the relay guarantees **at-least-once**.
+6. **Build read models as rebuildable, idempotent projections — and surface eventual consistency.** A projection subscribes to the event stream in `global_position` order and writes a denormalized read table. Two non-negotiables:
+   - **Idempotent**: store the last processed `global_position` per projection; on replay skip anything `<=` it, and make each apply an upsert keyed by the event's natural id so re-delivery is a no-op.
+   - **Rebuildable from zero**: a projection must be reconstructable by `TRUNCATE read_table; reset checkpoint to 0; replay all`. If it can't, it's a hidden write model — fix it.
+   Reads are stale by the projection lag (ms→s). Make that explicit: return a version/`as_of` position with reads, and for read-your-writes either route the writer to a freshly-projected read or have the client wait until the projection checkpoint ≥ the position its write returned. Do not pretend the read side is synchronous.
+7. **Bound replay with snapshots — but rebuild must still work from zero without them.** When a hot aggregate has thousands of events, replaying all of them per command gets slow. Snapshot = a serialized aggregate state at a known `sequence`, stored in a separate `snapshots` table. Load = newest snapshot ≤ head, then replay only events after it. Defaults: snapshot every **N=100–500** events per aggregate, keep the latest 1–2, and treat snapshots as a **disposable cache** — they're derived, deletable, and a full rebuild from event 0 must produce byte-identical state. Never let business logic read from a snapshot that the event log couldn't reproduce.
+8. **Evolve schemas by versioning + upcasting, with lenient deserialization — you can never edit old events.** Old events are immutable history; you migrate them *on read*. Bump `event_version` for any non-additive change and register an upcaster chain that transforms `v1 → v2 → … → current` before the event reaches `apply`:
+   | Change | Safe? | How |
+   |---|---|---|
+   | Add optional field w/ default | ✅ additive | Lenient deserializer fills default; no version bump needed |
+   | Rename field | ⚠️ | Bump version; upcaster maps old→new name |
+   | Split/merge fields, change units (dollars→cents) | ⚠️ | Bump version; upcaster computes new shape |
+   | Remove a field still read by a projector | ❌ | Keep reading it via upcaster default; never drop in place |
+   | Change the *meaning* of an event type | ❌ | Introduce a **new** event type; leave the old one |
+   Deserialize leniently (ignore unknown fields, default missing ones) so a forward-deployed reader survives a slightly newer/older payload during rollout.
+9. **Detect and repair projection drift.** Projections silently diverge (a bug skipped an event, a deploy reset a checkpoint wrong). Build a reconciliation job that recomputes a checksum/aggregate from the event log and compares to the read model; on mismatch, rebuild that projection from zero (it's safe because projections are idempotent + rebuildable). A blue/green projection swap (build the new table fully, then atomically repoint reads) lets you rebuild without downtime.
+## Common Errors
+- **Event-sourcing plain CRUD.** No audit/temporal/state-machine need → you bought replay/snapshot/upcasting machinery for nothing. Use a table.
+- **CRUD-named events** (`OrderUpdated`, `EntitySaved`, `SetField`). They carry no business meaning and force readers to diff state. Name the *fact*: `OrderShipped`, `PriceReduced`.
+- **Read concerns leaking into events** — denormalized display names, joined data, computed totals. The event is now coupled to a read shape and breaks when the read model changes. Store only the writer's decided facts.
+- **Giant aggregate.** "Account" containing every transaction of every user serializes all writes and replays forever. Scope the aggregate to the smallest invariant boundary.
+- **No expected-version on append.** Two concurrent commands both read version 41 and both write 42 → lost update / broken invariant. Enforce `UNIQUE(aggregate_id, sequence)` and retry on conflict.
+- **Dual-write to store and broker.** A crash between the two loses or duplicates events. Use the outbox (the event row) + a relay; make consumers idempotent.
+- **Non-deterministic replay** — `apply` calls `now()`, `random()`, or a remote service, so rebuild ≠ original. Capture all nondeterminism *into the event* at emit time; `apply` must be a pure fold.
+- **Non-idempotent projector.** Re-delivery (at-least-once) double-counts. Track per-projection `global_position` and make applies upserts keyed by a natural id.
+- **Validating against a projection instead of the rebuilt aggregate.** The projection is stale, so the invariant check races. Always rebuild the aggregate's own state from its stream to decide.
+- **Treating rejections as events.** A failed/declined command must not append `OrderRejected` unless the *rejection itself is a meaningful business fact*; otherwise return an error — don't pollute the log.
+- **Editing or deleting old events to "fix" them.** Destroys auditability and breaks every existing projection's replay. Append a compensating event instead.
+- **Snapshot used as source of truth.** If the log can't reproduce the snapshot, a snapshot bug becomes permanent corruption. Snapshots are a disposable cache.
+- **Assuming a global event order across aggregates.** Per-stream order is guaranteed; cross-stream is not. Don't build invariants that need two streams ordered together — use a saga.
+## Verify
+1. **Round-trip determinism:** replay an aggregate's full stream twice into fresh in-memory state → byte-identical result; replaying with vs without a snapshot → identical state.
+2. **Optimistic concurrency:** fire two commands against the same aggregate at the same `expected_version` **in parallel** → exactly one commits, the other gets the `UNIQUE(aggregate_id, sequence)` violation (`23505`) surfaced as `409 Conflict` and succeeds only after reload+retry. The stream has no gap and no duplicated `sequence`.
+3. **Projection rebuild:** `TRUNCATE read_table`, reset checkpoint to 0, replay all events → read model is bit-identical to its pre-truncate state. This proves it's rebuildable, not a hidden write model.
+4. **Idempotent projector:** replay the same event slice twice → read rows and the checkpoint are unchanged after the second pass (no double counts).
+5. **Outbox at-least-once:** kill the relay mid-publish, restart → every event reaches the broker at least once, consumers dedupe on `event_id`, no event lost.
+6. **Upcasting:** feed a stored `event_version: 1` payload through the upcaster chain → it deserializes to current shape and `apply` accepts it; a lenient-deserialize test with an unknown extra field still loads.
+7. **Drift detection:** intentionally skip one event in a projection → the reconciliation checksum job flags the mismatch, and a rebuild from zero repairs it.
+8. **Eventual consistency surfaced:** a write returns a position; a read issued before the projector catches up is detectably stale (returns an older `as_of`/version), and the read-your-writes path waits for checkpoint ≥ that position.
+Done = replay is deterministic (1), concurrent appends conflict-detect with gap-free sequences (2), every projection rebuilds from zero idempotently (3,4), publishing is at-least-once with idempotent consumers (5), old event versions upcast cleanly (6), and projection drift is both detectable and auto-repairable (7) — all under parallel load, with eventual consistency made explicit to readers (8).

package/skills/design-multi-tenancy/SKILL.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+name: design-multi-tenancy
+description: Architects a SaaS so many customer orgs share infrastructure without leaking into each other — picking an isolation model (shared schema + Postgres RLS, schema-per-tenant, or database-per-tenant) against an explicit cost/blast-radius/ops tradeoff table, resolving and propagating tenant context from request to DB session, and enforcing isolation in depth (app-layer query scoping PLUS RLS as the safety net) so a single forgotten tenant filter can't cross-leak. Also covers per-tenant quotas/noisy-neighbor mitigation, fan-out migrations across thousands of tenants, tenant offboarding (export + hard delete), optional per-tenant keys, and safe cross-tenant admin features.
+when_to_use: Building or hardening a multi-tenant SaaS where many customer organizations share infra and must be isolated from one another — choosing an isolation model, stopping cross-tenant data leaks, scoping every query by tenant, or scaling migrations/quotas across many tenants. Distinct from design-relational-schema (general table/normalization modeling — this is the tenancy/isolation layer built on top of that), design-authorization-model (what a user may do WITHIN one tenant — RBAC/ABAC — vs separating tenants from each other), and map-privacy-data-gdpr (PII rights/consent — referenced for export/delete mechanics but not the focus).
+---
+## When to Use
+Reach for this skill when the question is **"how do I keep tenant A's data away from tenant B while they share the same stack?"** — the isolation architecture, not the per-user permissions inside one org:
+- "We're going multi-tenant — shared tables with a `tenant_id`, a schema per customer, or a DB per customer?"
+- "How do I make sure one missing `WHERE tenant_id` can't leak another org's data?"
+- "Resolve the tenant from the subdomain / `X-Tenant` header / JWT org claim and scope every query to it"
+- "We have 4,000 tenants — how do I run a schema migration across all of them safely?"
+- "Enterprise customer wants their data in a separate database / their own encryption key"
+- "One big tenant is hammering the DB and starving everyone else (noisy neighbor)"
+- "Build admin impersonation / global analytics without accidentally bypassing isolation"
+NOT this skill:
+- Designing the tables/keys/normalization themselves (PKs, 1:N, constraints) → **design-relational-schema** (this skill adds the `tenant_id` + RLS layer on top of that model)
+- Roles/permissions for users *within* a single tenant (admin vs viewer, per-resource sharing) → **design-authorization-model** (authZ within a tenant ≠ isolating tenants from each other)
+- DSAR export format, consent capture, lawful basis, erasure-across-backups policy → **map-privacy-data-gdpr** (referenced in step 7 for the offboarding mechanics)
+- Capping request *rate/volume* per caller mechanics (token bucket, 429, Redis counters) → **rate-limiting** (referenced in step 6 for per-tenant quotas)
+- Running one risky `ALTER` on one large live table safely (locks, backfill) → **db-migration-safety** (referenced in step 8 for the fan-out)
+- Cache patterns/TTLs/stampede in general → **caching-strategy** (referenced in step 9 for tenant-keyed caches)
+## Steps
+1. **Pick the isolation model from a tradeoff table — default shared+RLS, escalate per tenant only when a reason demands it.** The three models are not all-or-nothing; a mature SaaS often runs *hybrid pods*.
+   | Dimension | Shared schema + RLS | Schema-per-tenant | Database-per-tenant |
+   |---|---|---|---|
+   | **Isolation** | Logical (one bug from leak) | Stronger (namespace) | Strongest (physical) |
+   | **Cost / tenant** | Lowest (one DB, shared) | Low–medium | Highest (conn pool, idle DB, backups each) |
+   | **Ops / migration burden** | One migration, all tenants | Loop over N schemas | Loop over N databases (heaviest) |
+   | **Blast radius** | All tenants (shared) | Per-schema | Per-tenant only |
+   | **Noisy neighbor** | Worst — shared buffers/CPU/locks | Some sharing | Isolated resources |
+   | **Per-tenant restore / PITR** | Hard (row-level surgery) | Medium | Trivial (restore that DB) |
+   | **Tenant count it scales to** | 100k+ | hundreds–low thousands | tens–low hundreds |
+   Pick: **shared schema + RLS by default** (cheapest, scales to many small tenants); **schema-per-tenant** when you want per-tenant restore/customization without N databases' connection overhead; **database-per-tenant** for enterprise/compliance (HIPAA/SOC2 data-residency), per-tenant encryption/restore, or one tenant so large it deserves its own resources. **Hybrid pods:** small tenants share a pool, large/enterprise tenants get dedicated DBs — a `tenant → shard/connection` routing map (a "tenant catalog" table in a control-plane DB) decides at request time. Write the decision in an ADR (see write-adr); migrating models later is a large project.
+2. **Add `tenant_id` to every tenant-owned table — non-null, indexed, first column of composite indexes.** In the shared model, every table carrying tenant data gets `tenant_id uuid NOT NULL REFERENCES tenant(id)`. Make it the **leading column** of relevant indexes and most composite PKs/uniques (`UNIQUE (tenant_id, email)` not `UNIQUE (email)` — email is unique *per tenant*, not globally). Global/system tables (plans, feature flags, the `tenant` registry itself) have no `tenant_id`. Never let `tenant_id` be nullable or default — a null tenant row is an isolation hole.
+3. **Resolve tenant context at the edge, from a trusted source — never from a client-supplied body field.** Map the inbound request to exactly one tenant:
+   | Source | How | Note |
+   |---|---|---|
+   | **Subdomain** | `acme.app.com` → `acme` | Friendly; needs wildcard DNS/TLS; map slug→tenant_id in catalog |
+   | **`X-Tenant` header** | API/service-to-service | Trust only if the caller is authenticated; never from a browser unauthenticated |
+   | **JWT `org`/`tenant` claim** | from the verified token | **Most trustworthy** — signed, can't be forged client-side |
+   Resolve once at the edge (middleware), validate the tenant is active, and store it in an **immutable request context** (not a mutable global). The cardinal rule: derive `tenant_id` from the **authenticated identity**, never from a request body/query param — a client-supplied `tenant_id` is a cross-tenant skeleton key (this is exactly the gap **design-authorization-model** warns about). If subdomain and token disagree, reject.
+4. **Defense in depth — app-layer scoping is the primary guard, Postgres RLS is the safety net.** The #1 production multi-tenancy bug is a single query that forgot its tenant filter → cross-tenant leak. You need **both** layers because each fails differently:
+   - **App layer (primary):** every query is scoped through a **tenant-aware repository / ORM global filter** so developers physically can't write an unscoped query. Don't rely on each engineer remembering `WHERE tenant_id = $1` — inject it centrally (e.g. an ORM global scope, a base repository that always appends the filter, a query builder that refuses to run without a tenant).
+   - **DB layer (backstop):** Postgres Row-Level Security catches the day someone bypasses the repository or writes raw SQL.
+   ```sql
+   ALTER TABLE document ENABLE ROW LEVEL SECURITY;
+   ALTER TABLE document FORCE ROW LEVEL SECURITY;        -- applies to the table owner too
+   CREATE POLICY tenant_isolation ON document
+     USING       (tenant_id = current_setting('app.tenant_id')::uuid)   -- read/update/delete visibility
+     WITH CHECK  (tenant_id = current_setting('app.tenant_id')::uuid);  -- blocks INSERT into another tenant
+   ```
+   `FORCE` is non-negotiable (without it the table owner — usually your app's role — bypasses RLS). `WITH CHECK` stops a write that *sets* a foreign `tenant_id`. The app role must **not** have `BYPASSRLS`.
+5. **Set the RLS variable with `SET LOCAL` inside the transaction — the connection-pool caveat that breaks naive RLS.** RLS reads `current_setting('app.tenant_id')`. You must set it per request — but **how** depends on the pooler:
+   - With **PgBouncer in transaction mode** (the common setup), a connection is handed to a *different* tenant's request the instant your transaction commits. A session-level `SET app.tenant_id = ...` therefore **leaks** the previous tenant's value into the next request — a catastrophic cross-tenant bug.
+   - Fix: set it **transaction-scoped** so it auto-resets at commit/rollback:
+     ```sql
+     BEGIN;
+     SET LOCAL app.tenant_id = '...';   -- reset automatically at COMMIT/ROLLBACK; never plain SET
+     -- ... all queries in this request ...
+     COMMIT;
+     ```
+   - Equivalent: `SELECT set_config('app.tenant_id', $1, true)` (the `true` = local). Every tenant request must run inside a transaction that begins with `SET LOCAL`. Assert in the repository that the var is set before any query runs, so a missing context fails closed (returns zero rows / errors) rather than leaking.
+6. **Per-tenant quotas + noisy-neighbor mitigation.** In a shared model one tenant can starve the rest. Enforce **per-tenant rate limits and quotas keyed by `tenant_id`** (token bucket / sliding window — see **rate-limiting**), plus resource guards: statement timeouts, max connections per tenant, row/storage caps, background-job concurrency caps per tenant. For chronic offenders or very-large tenants, move them to a **dedicated pod/DB** (step 1's hybrid). Track per-tenant usage metrics (queries/sec, storage, CPU) so you can detect and isolate a noisy neighbor before it causes an incident.
+7. **Tenant offboarding — clean per-tenant export and verifiable hard delete.** Deletion and export are isolation-critical and a GDPR obligation (mechanics: **map-privacy-data-gdpr**):
+   - **Export:** dump all rows where `tenant_id = $1` across every table to a machine-readable archive. Database-per-tenant makes this a `pg_dump` of one DB; shared schema requires a tenant-scoped export of every table (drive it from a registry of tenant-owned tables so none is missed).
+   - **Hard delete:** in shared schema, `DELETE` cascades by `tenant_id` (rely on `ON DELETE CASCADE` from the `tenant` row, or a deterministic ordered delete) — and don't forget derived data: caches, search indexes, object storage, analytics warehouse, backups' retention policy. Database/schema-per-tenant: `DROP DATABASE`/`DROP SCHEMA` is the cleanest, most auditable erasure. Verify deletion (assert zero rows remain for the tenant) and log it for compliance.
+8. **Migrations across thousands of tenants — online, batched, versioned, idempotent.** Schema changes don't break with one model but they *scale* differently:
+   - **Shared schema:** one migration changes all tenants at once — fast, but a bad migration's blast radius is everyone. Use online/safe DDL (see **db-migration-safety**: avoid long table locks, backfill in batches, add indexes `CONCURRENTLY`).
+   - **Schema-per-tenant / DB-per-tenant:** loop the migration over every schema/database. This must be **batched, resumable, and idempotent** — track each tenant's schema version in the catalog, run N at a time, record success/failure per tenant, and be able to retry only the failures. A 4,000-tenant migration that aborts at tenant 2,500 must resume, not restart. Roll out behind a flag and canary on a few tenants first.
+9. **Cache and search keyed by tenant; cross-tenant admin features that don't bypass isolation.**
+   - **Caching:** every cache key includes `tenant_id` (`doc:{tenant_id}:{id}`) so tenant A can never read tenant B's cached value, and invalidation can be per-tenant (see **caching-strategy**). Same for search indexes (per-tenant index or a mandatory tenant filter on every query).
+   - **Admin / impersonation:** when support impersonates a tenant, **set the same `app.tenant_id` and go through the same scoped path** — don't add a "god query" that ignores RLS. Use a separate, audited DB role with `BYPASSRLS` *only* for narrow platform operations, and **log every impersonation** (build-audit-logging). **Global analytics** (cross-tenant metrics) is the one legitimate cross-tenant read: run it through a dedicated read-only role/replica with explicit tenant aggregation, isolated from the application path — never by relaxing the app's RLS.
+10. **Test tenant isolation as a first-class, automated guarantee — the leak test is the one that matters.** Ship these as CI tests, not manual checks:
+    - **Cross-tenant denial:** seed data for tenant A and tenant B; with context set to A, assert that *every* read/list/get of B's resources (by real id) returns **zero rows / 404 / deny** — including raw SQL paths, the cache, and search.
+    - **RLS backstop:** run a query that *omits* the app-layer filter against a session with `app.tenant_id = A` → still returns only A's rows. Proves the data tier holds when the app forgets.
+    - **Fuzz the `tenant_id`:** randomize/swap the tenant in the request context and assert no other tenant's data is ever returned and no write lands in the wrong tenant (`WITH CHECK` holds).
+    - **Pool leak test:** run interleaved requests for A then B over a transaction-mode pool and assert B never sees A's `SET LOCAL` value.
+    - **Optional per-tenant keys:** if tenants have their own encryption keys (envelope encryption, key per tenant in a KMS), test that the wrong key can't decrypt another tenant's data and that key deletion = crypto-shredded data.
+    Done = an isolation model chosen against the tradeoff table and recorded in an ADR; `tenant_id` non-null + indexed on every tenant table (and leading in uniques); tenant context derived from the verified identity at the edge and propagated as immutable request state; every query scoped at the app layer **and** RLS (`FORCE` + `WITH CHECK`, app role without `BYPASSRLS`) enforced, with the var set via `SET LOCAL` per transaction; per-tenant quotas in place; migrations fan out batched/resumable/versioned; export + verified hard delete defined; caches/search/admin/analytics tenant-keyed; and an automated cross-tenant leak test (plus fuzz + pool-leak) passing in CI.