sanook-cli 0.4.0 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (238) hide show
  1. package/.env.example +19 -0
  2. package/CHANGELOG.md +173 -0
  3. package/README.md +153 -20
  4. package/README.th.md +136 -0
  5. package/dist/agentContext.js +4 -0
  6. package/dist/approval.js +6 -0
  7. package/dist/bin.js +405 -57
  8. package/dist/brain.js +92 -59
  9. package/dist/brand.js +47 -0
  10. package/dist/checkpoint.js +37 -0
  11. package/dist/commands.js +86 -6
  12. package/dist/compaction.js +76 -5
  13. package/dist/config.js +100 -12
  14. package/dist/cost.js +60 -3
  15. package/dist/doctor.js +92 -0
  16. package/dist/gateway/auth.js +2 -2
  17. package/dist/gateway/ledger.js +2 -2
  18. package/dist/gateway/scheduler.js +1 -0
  19. package/dist/gateway/serve.js +6 -4
  20. package/dist/gateway/server.js +10 -2
  21. package/dist/git.js +11 -2
  22. package/dist/hooks.js +43 -17
  23. package/dist/knowledge.js +48 -49
  24. package/dist/loop.js +182 -66
  25. package/dist/lsp/client.js +173 -0
  26. package/dist/lsp/framing.js +56 -0
  27. package/dist/lsp/index.js +138 -0
  28. package/dist/lsp/servers.js +82 -0
  29. package/dist/mcp-server.js +244 -0
  30. package/dist/mcp.js +184 -29
  31. package/dist/memory-store.js +559 -0
  32. package/dist/memory.js +143 -29
  33. package/dist/orchestrate.js +150 -0
  34. package/dist/providers/codex.js +21 -7
  35. package/dist/providers/keys.js +3 -2
  36. package/dist/providers/models.js +22 -6
  37. package/dist/providers/registry.js +155 -1
  38. package/dist/repomap.js +93 -0
  39. package/dist/search/chunk.js +158 -0
  40. package/dist/search/embed-store.js +187 -0
  41. package/dist/search/engine.js +203 -0
  42. package/dist/search/fuse.js +35 -0
  43. package/dist/search/index-core.js +187 -0
  44. package/dist/search/indexer.js +241 -0
  45. package/dist/search/store.js +77 -0
  46. package/dist/session.js +42 -8
  47. package/dist/skill-install.js +10 -10
  48. package/dist/skills.js +12 -9
  49. package/dist/summarize.js +31 -0
  50. package/dist/tools/bash.js +21 -2
  51. package/dist/tools/diagnostics.js +41 -0
  52. package/dist/tools/edit.js +29 -7
  53. package/dist/tools/index.js +8 -1
  54. package/dist/tools/list.js +7 -2
  55. package/dist/tools/permission.js +90 -9
  56. package/dist/tools/read.js +23 -4
  57. package/dist/tools/remember.js +1 -1
  58. package/dist/tools/sandbox.js +61 -0
  59. package/dist/tools/search.js +105 -4
  60. package/dist/tools/task.js +195 -29
  61. package/dist/tools/timeout.js +35 -0
  62. package/dist/tools/util.js +10 -0
  63. package/dist/tools/write.js +6 -4
  64. package/dist/trust.js +89 -0
  65. package/dist/ui/app.js +228 -31
  66. package/dist/ui/banner.js +4 -9
  67. package/dist/ui/brain-wizard.js +2 -2
  68. package/dist/ui/history.js +30 -0
  69. package/dist/ui/mentions.js +44 -0
  70. package/dist/ui/render.js +55 -15
  71. package/dist/ui/setup.js +97 -12
  72. package/dist/ui/useEditor.js +83 -0
  73. package/dist/update.js +114 -0
  74. package/dist/worktree.js +173 -0
  75. package/package.json +11 -5
  76. package/scripts/postinstall.mjs +33 -0
  77. package/second-brain/.agents/_Index.md +30 -0
  78. package/second-brain/.agents/skills/_Index.md +30 -0
  79. package/second-brain/.agents/workflows/_Index.md +30 -0
  80. package/second-brain/AGENTS.md +4 -4
  81. package/second-brain/Acceptance/_Index.md +30 -0
  82. package/second-brain/Acceptance/golden-case-template.md +39 -0
  83. package/second-brain/Areas/_Index.md +30 -0
  84. package/second-brain/Bugs/System-OS/_Index.md +30 -0
  85. package/second-brain/Bugs/_Index.md +30 -0
  86. package/second-brain/CLAUDE.md +4 -1
  87. package/second-brain/Checklists/_Index.md +30 -0
  88. package/second-brain/Checklists/preflight-postflight-template.md +29 -0
  89. package/second-brain/Distillations/_Index.md +30 -0
  90. package/second-brain/Entities/_Index.md +30 -0
  91. package/second-brain/Entities/entity-template.md +33 -0
  92. package/second-brain/Evals/_Index.md +30 -0
  93. package/second-brain/Evals/correction-pairs.md +24 -0
  94. package/second-brain/Evals/failure-taxonomy.md +24 -0
  95. package/second-brain/Evals/golden-set.md +25 -0
  96. package/second-brain/Evals/quality-ledger.md +23 -0
  97. package/second-brain/Evals/self-eval-rubric.md +23 -0
  98. package/second-brain/GEMINI.md +4 -4
  99. package/second-brain/Goals/_Index.md +30 -0
  100. package/second-brain/Handoffs/_Index.md +30 -0
  101. package/second-brain/Home.md +7 -0
  102. package/second-brain/Intake/Raw Sources/_Index.md +30 -0
  103. package/second-brain/Intake/_Index.md +30 -0
  104. package/second-brain/Intake/_Quarantine/_Index.md +30 -0
  105. package/second-brain/Learning/_Index.md +30 -0
  106. package/second-brain/Playbooks/_Index.md +30 -0
  107. package/second-brain/Playbooks/playbook-template.md +23 -0
  108. package/second-brain/Projects/_Index.md +30 -0
  109. package/second-brain/Prompts/_Index.md +30 -0
  110. package/second-brain/README.md +2 -1
  111. package/second-brain/Research/_Index.md +30 -0
  112. package/second-brain/Retrospectives/_Index.md +30 -0
  113. package/second-brain/Reviews/_Index.md +30 -0
  114. package/second-brain/Runbooks/_Index.md +30 -0
  115. package/second-brain/Runbooks/eval-loop.md +24 -0
  116. package/second-brain/Sessions/_Index.md +30 -0
  117. package/second-brain/Shared/AI-Context-Index.md +20 -0
  118. package/second-brain/Shared/AI-Threads/_Index.md +30 -0
  119. package/second-brain/Shared/Archive/_Index.md +30 -0
  120. package/second-brain/Shared/Assets/_Index.md +30 -0
  121. package/second-brain/Shared/Context-Packs/_Index.md +30 -0
  122. package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
  123. package/second-brain/Shared/Coordination/NOW.md +28 -0
  124. package/second-brain/Shared/Coordination/_Index.md +30 -0
  125. package/second-brain/Shared/Coordination/agent-registry.md +24 -0
  126. package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
  127. package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
  128. package/second-brain/Shared/Coordination/task-board.md +32 -0
  129. package/second-brain/Shared/Core-Facts/_Index.md +30 -0
  130. package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
  131. package/second-brain/Shared/Glossary/_Index.md +30 -0
  132. package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
  133. package/second-brain/Shared/Operating-State/_Index.md +30 -0
  134. package/second-brain/Shared/Prompting/_Index.md +30 -0
  135. package/second-brain/Shared/Provenance/_Index.md +30 -0
  136. package/second-brain/Shared/Rules/_Index.md +30 -0
  137. package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
  138. package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
  139. package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
  140. package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
  141. package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
  142. package/second-brain/Shared/Rules/rules-formatting.md +34 -0
  143. package/second-brain/Shared/Scripts/_Index.md +30 -0
  144. package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
  145. package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
  146. package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
  147. package/second-brain/Shared/User-Memory/_Index.md +30 -0
  148. package/second-brain/Shared/User-Persona/_Index.md +30 -0
  149. package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
  150. package/second-brain/Shared/Working-Memory/_Index.md +30 -0
  151. package/second-brain/Shared/_Index.md +30 -0
  152. package/second-brain/Shared/mcp-servers/_Index.md +30 -0
  153. package/second-brain/Skills/_Index.md +30 -0
  154. package/second-brain/Templates/_Index.md +30 -0
  155. package/second-brain/Templates/bug.md +2 -0
  156. package/second-brain/Templates/handoff.md +2 -0
  157. package/second-brain/Templates/session.md +2 -0
  158. package/second-brain/Tools/_Index.md +30 -0
  159. package/second-brain/Traces/_Index.md +30 -0
  160. package/second-brain/Vault Structure Map.md +33 -1
  161. package/second-brain/copilot/_Index.md +30 -0
  162. package/skills/audit-license-compliance/SKILL.md +117 -0
  163. package/skills/author-codemod/SKILL.md +110 -0
  164. package/skills/build-audit-logging/SKILL.md +112 -0
  165. package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
  166. package/skills/build-cli-tool/SKILL.md +108 -0
  167. package/skills/build-data-table/SKILL.md +141 -0
  168. package/skills/build-native-mobile-ui/SKILL.md +154 -0
  169. package/skills/build-offline-first-sync/SKILL.md +118 -0
  170. package/skills/build-realtime-channel/SKILL.md +122 -0
  171. package/skills/build-vector-search/SKILL.md +131 -0
  172. package/skills/compose-local-dev-stack/SKILL.md +149 -0
  173. package/skills/configure-bundler-build/SKILL.md +166 -0
  174. package/skills/configure-dns-tls/SKILL.md +142 -0
  175. package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
  176. package/skills/configure-security-headers-csp/SKILL.md +122 -0
  177. package/skills/contract-testing/SKILL.md +140 -0
  178. package/skills/datetime-timezone-correctness/SKILL.md +125 -0
  179. package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
  180. package/skills/debug-flaky-tests/SKILL.md +128 -0
  181. package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
  182. package/skills/deliver-webhooks/SKILL.md +116 -0
  183. package/skills/design-api-pagination/SKILL.md +144 -0
  184. package/skills/design-authorization-model/SKILL.md +119 -0
  185. package/skills/design-backup-dr-recovery/SKILL.md +113 -0
  186. package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
  187. package/skills/design-multi-tenancy/SKILL.md +100 -0
  188. package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
  189. package/skills/design-relational-schema/SKILL.md +129 -0
  190. package/skills/design-search-index-infra/SKILL.md +151 -0
  191. package/skills/design-state-machine/SKILL.md +108 -0
  192. package/skills/design-token-system/SKILL.md +109 -0
  193. package/skills/distributed-locks-leases/SKILL.md +120 -0
  194. package/skills/encrypt-sensitive-data/SKILL.md +148 -0
  195. package/skills/feature-flags-rollout/SKILL.md +130 -0
  196. package/skills/file-upload-object-storage/SKILL.md +107 -0
  197. package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
  198. package/skills/harden-llm-app-reliability/SKILL.md +126 -0
  199. package/skills/i18n-localization-setup/SKILL.md +113 -0
  200. package/skills/idempotency-keys/SKILL.md +107 -0
  201. package/skills/implement-push-notifications/SKILL.md +142 -0
  202. package/skills/ingest-webhook-secure/SKILL.md +120 -0
  203. package/skills/integrate-oauth-oidc/SKILL.md +126 -0
  204. package/skills/load-stress-test/SKILL.md +129 -0
  205. package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
  206. package/skills/model-nosql-data/SKILL.md +118 -0
  207. package/skills/money-decimal-arithmetic/SKILL.md +123 -0
  208. package/skills/monitor-ml-drift/SKILL.md +109 -0
  209. package/skills/numeric-precision-units/SKILL.md +144 -0
  210. package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
  211. package/skills/optimize-react-rerenders/SKILL.md +124 -0
  212. package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
  213. package/skills/payments-billing-integration/SKILL.md +114 -0
  214. package/skills/pin-toolchain-versions/SKILL.md +116 -0
  215. package/skills/plan-strangler-migration/SKILL.md +95 -0
  216. package/skills/property-based-testing/SKILL.md +108 -0
  217. package/skills/publish-package-registry/SKILL.md +130 -0
  218. package/skills/recover-git-state/SKILL.md +119 -0
  219. package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
  220. package/skills/resilience-timeouts-retries/SKILL.md +104 -0
  221. package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
  222. package/skills/rewrite-git-history/SKILL.md +109 -0
  223. package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
  224. package/skills/schema-evolution-compatibility/SKILL.md +121 -0
  225. package/skills/send-transactional-email/SKILL.md +126 -0
  226. package/skills/serve-deploy-ml-model/SKILL.md +107 -0
  227. package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
  228. package/skills/setup-devcontainer-env/SKILL.md +131 -0
  229. package/skills/setup-lint-format-precommit/SKILL.md +140 -0
  230. package/skills/setup-monorepo-tooling/SKILL.md +125 -0
  231. package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
  232. package/skills/structured-output-llm/SKILL.md +86 -0
  233. package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
  234. package/skills/test-data-factories/SKILL.md +158 -0
  235. package/skills/threat-model-stride/SKILL.md +123 -0
  236. package/skills/train-evaluate-ml-model/SKILL.md +109 -0
  237. package/skills/unicode-text-correctness/SKILL.md +109 -0
  238. package/skills/visual-regression-testing/SKILL.md +120 -0
@@ -0,0 +1,116 @@
1
+ ---
2
+ name: deliver-webhooks
3
+ description: Builds the producer side of webhooks — you dispatch signed events to customers' HTTPS endpoints. Sign every payload with HMAC-SHA256 over "{timestamp}.{raw_body}" in a versioned signature header with per-endpoint secrets and rotation overlap; deliver at-least-once with exponential backoff + jitter over hours, then dead-letter with manual replay; send thin events (id, type, ts, minimal data) so consumers re-fetch the resource; isolate delivery per endpoint so one broken target can't stall everyone; ship a stable event id + sequence number so consumers can dedup and not assume order; verify endpoints on registration; and lock down the target URL against SSRF (HTTPS-only, block internal/link-local/metadata IPs, re-resolve on each send). Use when your service must reliably and verifiably push events out to third-party subscribers.
4
+ when_to_use: You are the SENDER pushing events to customers' webhook URLs (the Stripe/GitHub side) — building event dispatch, payload signing, the retry+DLQ schedule, endpoint registration, or a delivery-history/replay UI. Distinct from ingest-webhook-secure (the RECEIVER side — verifying inbound signatures and safely processing webhooks you consume) and message-queue-jobs (the general internal job system used here as the delivery substrate; this skill adds the webhook-specific signing, replay, SSRF, and consumer-ergonomics layer on top).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when **your service emits events that third parties subscribe to** and you must deliver them verifiably and reliably:
10
+
11
+ - "Let customers register a webhook URL and we POST events to it when X happens"
12
+ - "How do we sign payloads so receivers can verify the request really came from us?"
13
+ - "One customer's endpoint is down/slow and it's backing up deliveries for everyone"
14
+ - "A delivery failed — retry it with backoff, then dead-letter, with a replay button in the dashboard"
15
+ - "Rotate a customer's signing secret without dropping any deliveries"
16
+ - "Add a delivery-history / attempts log to the customer dashboard"
17
+ - "Someone registered `http://169.254.169.254/...` as their webhook URL" (SSRF)
18
+
19
+ NOT this skill:
20
+ - *Receiving* and verifying inbound webhooks from a provider (Stripe→you) → ingest-webhook-secure (that's the mirror image: you verify their signature; here you produce yours)
21
+ - The underlying job queue / worker / DLQ plumbing (SQS/Kafka/BullMQ/Celery) → message-queue-jobs — used here as the substrate; this skill is the webhook policy on top
22
+ - The retry/backoff/jitter/circuit-breaker math for outbound calls → resilience-timeouts-retries (the primitive this skill's delivery schedule is built on)
23
+ - Per-endpoint request-rate caps / token bucket → rate-limiting (this skill references it for per-target throttling, doesn't reimplement it)
24
+ - Making the *consumer* safe to re-process a redelivered event → idempotency-keys (your job is to send a stable id + sequence so they *can*)
25
+ - Where the per-endpoint signing secret is stored/encrypted at rest → secrets-management
26
+ - Delivery metrics/traces/dashboards plumbing → observability-instrument
27
+
28
+ ## Steps
29
+
30
+ 1. **Sign every payload — HMAC-SHA256 over the exact bytes you send, with a per-endpoint secret.** Compute the signature over `"{timestamp}.{raw_body}"` (the same bytes on the wire — serialize ONCE, sign that buffer, send that buffer; never re-serialize between signing and sending or receivers' verification breaks). Put it in a versioned header so you can add schemes later without breaking verifiers:
31
+
32
+ ```
33
+ X-Webhook-Id: evt_01HZ... # stable unique event id (also a dedup key for the consumer)
34
+ X-Webhook-Timestamp: 1718409600 # unix seconds; part of the signed preimage
35
+ X-Webhook-Signature: t=1718409600,v1=5257a8... # v1 = hex HMAC-SHA256(secret, "{t}.{body}")
36
+ ```
37
+ ```python
38
+ import hmac, hashlib, json
39
+ raw = json.dumps(payload, separators=(",", ":"), sort_keys=True).encode() # serialize ONCE
40
+ t = str(int(time.time()))
41
+ sigs = [f"v1={hmac.new(s, f'{t}.'.encode()+raw, hashlib.sha256).hexdigest()}"
42
+ for s in active_secrets_for(endpoint)] # one secret per endpoint; >1 during rotation
43
+ headers = {"X-Webhook-Id": event_id, "X-Webhook-Timestamp": t,
44
+ "X-Webhook-Signature": "t=" + t + "," + ",".join(sigs)}
45
+ ```
46
+ One secret **per endpoint** (never a global secret — a leak then compromises every customer). Document the verify recipe for receivers and point them at ingest-webhook-secure.
47
+
48
+ 2. **Support secret rotation with an overlap window — send multiple signatures.** Rotation = generate a new secret, mark both old+new *active*, and sign with **both** during the overlap (`v1=<old>,v1=<new>`). The receiver accepts a request if *any* signature matches, so they can swap secrets at their leisure. After the documented overlap (e.g. 24–72 h, or when the customer confirms), retire the old secret. Without overlap, rotation drops every in-flight delivery. Store secrets via secrets-management; show "rotate" + "reveal once" in the dashboard.
49
+
50
+ 3. **Include a signed timestamp + document a tolerance window so receivers can reject replays.** The `t` you sign lets a receiver drop a captured-and-replayed request that's older than their tolerance (recommend **±300 s**). You can't enforce it — but you must (a) put `t` *inside* the signed preimage (not just a loose header), (b) keep your senders' clocks NTP-synced so legit deliveries land inside the window, and (c) document the recommended tolerance so receivers implement it. Drift on your side = false replay rejections at every customer.
51
+
52
+ 4. **Deliver at-least-once with exponential backoff + jitter over hours/days, then dead-letter.** Treat **any non-2xx, timeout, or connection error as retryable**; 2xx (any) = delivered, stop. Build the schedule on resilience-timeouts-retries (full jitter, per-attempt timeout ~10 s). Give up after N attempts (e.g. 8–15) spread across a long horizon, then move to a **dead-letter store** with a manual **replay** button.
53
+
54
+ | Attempt | Delay (base, +jitter) | Cumulative |
55
+ |---|---|---|
56
+ | 1 | immediate | 0 |
57
+ | 2 | ~30 s | ~30 s |
58
+ | 3 | ~2 m | ~3 m |
59
+ | 4 | ~10 m | ~13 m |
60
+ | 5 | ~1 h | ~1 h |
61
+ | 6 | ~3 h | ~4 h |
62
+ | 7–N | ~6 h, capped | up to ~1–3 days |
63
+
64
+ After exhaustion → DLQ row with last status/error; surface it in the dashboard with a one-click "replay" that re-enqueues the *same event* (same id + sequence) so consumer dedup still works. Auto-disable an endpoint that's failed for days and email the owner.
65
+
66
+ 5. **Send a STABLE unique event id and a per-endpoint SEQUENCE number — you WILL re-deliver and you do NOT guarantee order.** The `event_id` is generated once at event creation and is identical across every retry of that event (it's the consumer's dedup key → idempotency-keys). Add a monotonic `sequence` (per endpoint or per resource) **and** the `timestamp` so consumers can detect/repair reordering. Explicitly document: *delivery is at-least-once and unordered — retries and parallel sends mean `updated` can arrive before `created`; dedup on `event_id`, order on `sequence`/`timestamp`, never on arrival order.* Don't pretend ordering you can't deliver.
67
+
68
+ 6. **Send THIN events; let the consumer fetch the full resource.** Payload = `{ id, type, timestamp, sequence, data: { id, <a few key fields> } }` — enough to route and decide, not the whole object. Then the consumer GETs `/v1/orders/{id}` from your API for current truth. This avoids (a) **stale payloads** (the resource changed between enqueue and delivery), (b) **oversized bodies** that blow timeouts, and (c) **leaking** fields the subscriber shouldn't see / that bloat your audit logs. For events that are inherently terminal facts (`invoice.finalized` snapshot), a fuller payload is fine — but default thin.
69
+
70
+ 7. **Version the event schema and keep it stable.** Give every event a `type` (`order.created`) and an explicit schema version (`"api_version": "2026-06-01"` on the event, or `v1` in the signature header). **Add fields, never repurpose/remove** within a version; breaking changes = a new version that customers opt into. Publish a typed catalog of event types + JSON schema. Stripe-style dated versions or a coarse `v1/v2` both work — pick one and document it.
71
+
72
+ 8. **Verify the endpoint on registration before sending real traffic.** When a customer adds a URL, prove they control it and it works: send a **test/`ping` event** (or a challenge-response where they echo a token) and require a **2xx within timeout** before marking the endpoint active. This blocks typos, dead URLs, and registering *someone else's* endpoint to flood it. Re-verify on URL change. Keep the endpoint in `pending` until the test succeeds.
73
+
74
+ 9. **Lock the target down against SSRF — your dispatcher is a server-side request to a customer-controlled URL.** This is the highest-severity bug in any webhook sender. On registration AND on **every send** (DNS rebinding defeats register-time-only checks):
75
+
76
+ - **HTTPS only.** Reject `http://`, `file://`, `gopher://`, etc.
77
+ - **Resolve the hostname yourself, then block** loopback/private/link-local/metadata ranges: `127.0.0.0/8`, `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `169.254.0.0/16` (incl. cloud metadata `169.254.169.254`), `::1`, `fc00::/7`, `fe80::/10`, `0.0.0.0`. Block by *resolved IP*, not by string-matching the hostname.
78
+ - **Pin the connection to the IP you validated** (connect to the checked IP / re-resolve and re-check just before connect) so a TOCTOU re-resolution can't swap in an internal IP between check and connect.
79
+ - **Disable redirect following** (or re-validate every hop) — a 302 to `http://169.254.169.254` bypasses a register-time check.
80
+ - Egress from a **locked-down network** (no access to internal services / metadata endpoint) as defense in depth. Cap response body read and per-attempt timeout. See remediate-web-vulnerabilities for the SSRF class.
81
+
82
+ 10. **Isolate delivery per endpoint/customer so one bad target can't stall the rest.** Run delivery on message-queue-jobs, but partition: a **per-endpoint queue/lane** with **bounded concurrency**, so a customer whose endpoint hangs/5xx's only backs up *their* lane. Add a **per-endpoint rate cap** (→ rate-limiting) to respect receivers' limits, and a **circuit breaker** (→ resilience-timeouts-retries) that fast-fails (straight to the retry schedule) while an endpoint is consistently down instead of burning worker time on it. Never deliver all customers off one shared FIFO — head-of-line blocking will take down delivery for everyone the moment one endpoint goes slow.
83
+
84
+ 11. **Observability — per-attempt logs + a customer-facing delivery history and replay UI.** Log **every attempt** (not just final): event id, endpoint, attempt #, response status, latency, error, next-retry-at. Emit metrics per endpoint: **success rate, attempt count, p50/p95 latency, dead-letter count** (→ observability-instrument). Expose to the customer a **delivery history** showing each event, its attempts, response codes/bodies (truncated), and a **manual replay** button — this is what turns "your webhooks are broken" support tickets into self-service. Alert the owner when an endpoint's success rate craters or it gets auto-disabled.
85
+
86
+ ## Common Errors
87
+
88
+ - **Re-serializing the body between signing and sending.** Sign a buffer, then a middleware/pretty-printer/key-reorder changes the bytes → every receiver's HMAC fails. Serialize ONCE, sign and send the *same* bytes.
89
+ - **One global signing secret for all endpoints.** A single leak compromises every customer and forces a flag-day rotation. Use one secret per endpoint.
90
+ - **Rotation with no overlap.** Swap the secret atomically → every in-flight and newly-signed delivery fails verification at the receiver. Send both old+new signatures during the overlap window, retire old after.
91
+ - **Timestamp not inside the signature (or unsynced clocks).** A loose `X-Timestamp` header an attacker can edit gives no replay protection; NTP-unsynced senders make legit deliveries fall outside receivers' tolerance. Sign `"{t}.{body}"`; keep clocks synced.
92
+ - **Treating 2xx-but-slow as success without a per-attempt timeout.** A hanging endpoint pins a worker forever. Bound each attempt (~10 s) and count a timeout as a retryable failure.
93
+ - **No dead-letter / no replay.** After N retries the event vanishes silently and the customer never knows. DLQ + a manual replay that re-sends the same event id.
94
+ - **Assuming/promising ordered delivery.** Retries + parallel lanes reorder events; consumers that apply in arrival order corrupt state. Send `sequence` + `timestamp`, document "unordered, at-least-once," and tell consumers to dedup + order on those fields.
95
+ - **Fat payloads with the full resource.** Stale by delivery time, oversized, and leak fields. Send thin + an `id`; let the consumer re-fetch.
96
+ - **SSRF: validating the URL string but not the resolved IP, or only at registration.** `http://internal.svc`, a hostname that resolves to `169.254.169.254`, or DNS-rebinding after registration all hit internal services / cloud metadata. Resolve + block private/link-local/metadata ranges on **every** send, pin to the validated IP, disable redirects.
97
+ - **Following redirects blindly.** A `302 → http://169.254.169.254/latest/meta-data/` turns a clean external URL into an SSRF. Don't follow, or re-validate each hop.
98
+ - **Shared global delivery queue.** One slow endpoint head-of-line-blocks everyone. Partition per endpoint with bounded concurrency + a breaker.
99
+ - **No endpoint verification on registration.** Typo'd/dead/hijacked URLs accepted; you send real events into the void or at a victim. Require a challenge / test event returning 2xx before activating.
100
+ - **Per-attempt logs missing.** Only logging final outcome makes "why did delivery 5 fail at 14:03" undebuggable. Log every attempt with status/latency/error and expose it to the customer.
101
+
102
+ ## Verify
103
+
104
+ 1. **Signature round-trips:** an independent verifier (the ingest-webhook-secure recipe) recomputes `HMAC("{t}.{body}")` over the received raw bytes and matches `v1` exactly; flipping one body byte fails verification.
105
+ 2. **Per-endpoint secrets:** two endpoints get different signatures for the same event; a secret leaked from one does not verify the other.
106
+ 3. **Rotation overlap:** during rotation the request carries two `v1` signatures; a receiver holding the old secret AND one holding the new secret both verify; after overlap, only the new verifies.
107
+ 4. **Replay window honored:** delivered `t` is inside ±tolerance of real time (clock-sync check); a receiver rejecting `t` older than tolerance still accepts your live deliveries.
108
+ 5. **Retry schedule + DLQ:** point an event at an endpoint that returns 500 → it retries with growing, jittered delays, stops after N attempts, lands in the DLQ; the replay button re-sends the **same** event id and a now-healthy endpoint accepts it once.
109
+ 6. **Stable id + sequence:** every retry of one event carries the identical `event_id`; `sequence` is monotonic per endpoint; deliver two events out of order and confirm the consumer can reorder via `sequence`/`timestamp`.
110
+ 7. **Thin payload:** body contains id/type/timestamp/sequence + minimal data only; the documented re-fetch returns current truth even after the resource changed post-enqueue.
111
+ 8. **Endpoint isolation:** make endpoint A hang/timeout; endpoint B keeps receiving on time (assert B's delivery latency is unaffected) — proves no head-of-line blocking.
112
+ 9. **SSRF blocked:** registering/sending to `http://x` (non-HTTPS), `https://127.0.0.1`, `https://169.254.169.254`, a `10.x`/`::1` address, or a hostname that resolves into a private range is rejected — at registration AND on send (test DNS-rebinding: hostname resolves public at register, private at send → still blocked); a 302 to a metadata IP is not followed.
113
+ 10. **Registration verification:** a URL that never returns 2xx to the test event stays `pending`/inactive and receives no real events.
114
+ 11. **Observability:** every attempt produces a log row (event id, endpoint, attempt #, status, latency); the customer dashboard lists deliveries + attempts and the replay button works.
115
+
116
+ Done = each event is HMAC-signed per-endpoint over the exact bytes with a signed timestamp and rotation overlap, delivery is at-least-once with jittered backoff → DLQ → manual replay, events are thin and carry a stable id + sequence so consumers dedup and reorder, the target URL is HTTPS-only and SSRF-blocked (private/link-local/metadata ranges, re-checked on every send), endpoints are verified before activation, delivery is isolated per endpoint, and every attempt is logged and visible to the customer with a working replay — all proven by checks 1–11.
@@ -0,0 +1,144 @@
1
+ ---
2
+ name: design-api-pagination
3
+ description: Designs paginated list endpoints that stay correct and fast under concurrent writes — cursor/keyset pagination over a stable total ordering with a unique tie-break key (e.g. ORDER BY created_at DESC, id DESC and WHERE (created_at,id) < (?,?)), opaque base64url-encoded cursors that bind sort+filter so they can't be tampered or reused across queries, a sane page_size default (20-50) and hard cap (100), and the {data, next_cursor, has_more} envelope (fetch limit+1 to compute has_more without a COUNT) — instead of OFFSET/LIMIT, which gets O(n) slow on deep pages and skips/duplicates rows when items are inserted or deleted mid-scan; covers REST and GraphQL Relay connections (edges/node/cursor + pageInfo.hasNextPage/endCursor), forward+backward paging, and why total counts are expensive and usually optional.
4
+ when_to_use: Building or fixing a list/feed/search endpoint that returns many rows and needs paging, an infinite-scroll or "load more" API, a stable cursor under live inserts/deletes, or migrating a slow OFFSET endpoint to keyset; or implementing a GraphQL Relay connection. Distinct from api-design-review (reviews the whole API surface/REST conventions; this owns the pagination mechanics specifically) and optimize-sql-query (builds the covering composite index that makes the keyset WHERE/ORDER BY fast; this decides the cursor/ordering contract that index must serve).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when an endpoint returns a list that's too big for one response and must page through it correctly:
10
+
11
+ - "Add pagination to this list/feed/search endpoint" / "support infinite scroll / load-more"
12
+ - "Our `?page=500` query takes 8 seconds — deep OFFSET is killing us"
13
+ - "Users see duplicate or missing rows while scrolling a live feed" (rows inserted/deleted mid-scan)
14
+ - "Design the cursor — should it be opaque? what goes in it?"
15
+ - "Implement a GraphQL Relay connection (edges/pageInfo/cursors)"
16
+ - "We need stable ordering with a tie-break so pages don't shuffle"
17
+ - "Do we have to return a total count?" (usually no — it's the expensive part)
18
+
19
+ NOT this skill:
20
+ - Reviewing the whole REST/HTTP API surface — resource naming, status codes, versioning, error shape → api-design-review (this skill is only the pagination contract within it)
21
+ - Defining the serialized field types / GraphQL schema contract in general → rest-graphql-contract (this skill specifies the connection/cursor shape it slots into)
22
+ - Building the composite/covering index that makes the keyset `WHERE (a,b) < (?,?)` fast, EXPLAIN-tuning the scan → optimize-sql-query (this skill defines the ordering the index must support)
23
+ - Caching list responses / CDN / ETag for pages → caching-strategy
24
+ - Rate-limiting how many pages a client can pull → rate-limiting
25
+ - Throttling/queuing expensive list jobs → message-queue-jobs
26
+ - Designing the underlying table/keys → design-relational-schema (this skill consumes the unique key it needs as a tie-break)
27
+
28
+ ## Steps
29
+
30
+ 1. **Default to keyset (cursor) pagination; reach for OFFSET only for small, static, jump-to-page-N admin tables.** The two models:
31
+
32
+ | | Offset/limit | Keyset/cursor |
33
+ |---|---|---|
34
+ | Query | `ORDER BY ... LIMIT 20 OFFSET 980` | `WHERE (sort_key,id) < (?,?) ORDER BY ... LIMIT 20` |
35
+ | Deep-page cost | **O(offset)** — DB scans + discards all skipped rows | **O(1)** w/ index — seeks straight to the cursor |
36
+ | Concurrent insert/delete | **skips or duplicates** rows (offset shifts under you) | stable — anchored to a value, not a position |
37
+ | Jump to page N | yes | no (sequential only) |
38
+ | Total pages | derivable (needs COUNT) | not directly |
39
+
40
+ Offset is fine for a 200-row config table behind admin; for any feed, search, timeline, or table that grows or is written concurrently, **keyset is the default**.
41
+
42
+ 2. **Pick a stable total ordering with a unique tie-break — this is the whole game.** The `ORDER BY` columns must be (a) the user-visible sort and (b) **made total** by appending a unique, immutable column (the PK) so no two rows compare equal. A non-unique sort (`ORDER BY created_at` alone) lets rows with the same timestamp straddle a page boundary → duplicates or skips.
43
+
44
+ ```sql
45
+ -- newest-first feed, made total by id tie-break
46
+ ORDER BY created_at DESC, id DESC
47
+ ```
48
+ The cursor encodes the **full** sort tuple of the last row returned: `(created_at, id)`. Use row-value comparison so it's one index seek:
49
+ ```sql
50
+ WHERE (created_at, id) < (:last_created_at, :last_id) -- DESC page
51
+ ORDER BY created_at DESC, id DESC
52
+ LIMIT :page_size + 1;
53
+ ```
54
+ For ASC use `>`. For **mixed** directions (`created_at DESC, name ASC`) row-value syntax doesn't apply — expand the boolean predicate explicitly:
55
+ ```sql
56
+ WHERE created_at < :c
57
+ OR (created_at = :c AND name > :n)
58
+ OR (created_at = :c AND name = :n AND id > :id)
59
+ ```
60
+ Tie-break key must be **unique and never-updated** (PK, not a mutable slug). If the sort column itself is mutable (e.g. `updated_at`), a row can move pages — acceptable for "recently updated" feeds, surprising for "newest"; document it.
61
+
62
+ 3. **Make cursors opaque and self-describing — base64url-encode a small payload, never expose raw offsets/ids.** A cursor is a token the client echoes back verbatim; it is NOT a stable id and clients must not parse it. Encode the sort tuple plus enough to detect misuse:
63
+
64
+ ```json
65
+ { "k": [1718409600000, 84213], "d": "desc", "f": "a1b2c3" }
66
+ // k = sort-key tuple of last row, d = direction, f = hash of filter+sort params
67
+ ```
68
+ `base64url(JSON)` → `eyJrIjpb...`. Rules:
69
+ - **Opaque:** document it as "treat as opaque; do not construct or parse." Lets you change the internal format later without breaking clients.
70
+ - **Bind the query shape:** include `f` = a hash (or the canonical filter/sort) the cursor was created under. On the next request, **reject (400) if the client changed `filter`/`sort` but reused the cursor** — a cursor is only valid against the exact query that produced it.
71
+ - **Don't trust it for authz:** re-apply tenant/visibility filters on every page; never assume the cursor proves access. Tamper-resistance optional — sign (HMAC) only if a forged cursor could leak data past a filter; usually re-filtering server-side is enough.
72
+ - **Cross-page consistency:** a cursor's sort tuple is independent of position, so inserts/deletes between pages don't shift it — the core win over offset.
73
+
74
+ 4. **Set a page-size default and a hard cap; fetch `limit + 1` to compute `has_more` cheaply.** Never let the client ask for unbounded rows (DoS / memory blowup).
75
+
76
+ | Param | Value |
77
+ |---|---|
78
+ | default `page_size` | 20–50 |
79
+ | hard max | 100 (clamp, don't 400 — `min(requested, 100)`) |
80
+ | `limit + 1` trick | query `page_size + 1`; if you got the extra row, `has_more = true`, drop it from `data`, its predecessor's key is `next_cursor` |
81
+
82
+ This avoids a separate `COUNT(*)` just to know if there's a next page. Reject `page_size <= 0`.
83
+
84
+ 5. **Return the `{data, next_cursor, has_more}` envelope; make `next_cursor` null at the end.** Stable, minimal contract:
85
+ ```json
86
+ {
87
+ "data": [ /* page_size items */ ],
88
+ "next_cursor": "eyJrIjpbMTcxODQw...", // null when has_more=false
89
+ "has_more": true
90
+ }
91
+ ```
92
+ - `next_cursor` is the cursor of the **last returned row** (after dropping the `+1` probe). Client passes it back as `?cursor=...`.
93
+ - When `has_more=false`, `next_cursor=null` and clients stop — don't make them request an empty page to discover the end.
94
+ - **Omit total by default.** A `total_count` forces a full `COUNT(*)` (often as slow as the data scan) and is meaningless under concurrent writes. Offer it only as an opt-in (`?include_total=true`), cache it, or return an estimate (`reltuples` / approximate count).
95
+
96
+ 6. **Support backward paging when the UI needs "previous" — flip comparison + order, then re-reverse.** For bidirectional cursors carry the direction in the token. To page backward from a cursor: flip `<`→`>` (and the `ORDER BY` direction), `LIMIT n+1`, then **reverse the returned slice in memory** so the client still gets ascending-by-display order. Track both `has_next` and `has_prev`. Relay's `pageInfo` (next step) formalizes this with `hasNextPage`/`hasPreviousPage` + `startCursor`/`endCursor`.
97
+
98
+ 7. **For GraphQL, follow the Relay Cursor Connections spec exactly — don't invent a connection shape.** REST and GraphQL share the same keyset engine; only the envelope differs. Relay structure:
99
+ ```graphql
100
+ type PostConnection {
101
+ edges: [PostEdge!]!
102
+ pageInfo: PageInfo!
103
+ }
104
+ type PostEdge { node: Post! cursor: String! } # per-row opaque cursor
105
+ type PageInfo {
106
+ hasNextPage: Boolean! hasPreviousPage: Boolean!
107
+ startCursor: String endCursor: String
108
+ }
109
+ # query args: first/after (forward), last/before (backward)
110
+ ```
111
+ - `first: 20, after: "<cursor>"` is forward; `last: 20, before: "<cursor>"` is backward. Each `edge.cursor` is the opaque keyset token for that node.
112
+ - `pageInfo.endCursor` ↔ REST `next_cursor`; `hasNextPage` ↔ `has_more`. `totalCount` is a **separate, optional** field — same COUNT cost caveat (step 5).
113
+ - Don't mix `first` with `last`, or `after` with `before`, in one request — reject it.
114
+
115
+ 8. **Hand the ordering to `optimize-sql-query` and make sure a composite index covers it.** Keyset is only O(1) if a B-tree index matches the `ORDER BY` columns **in order and direction**: `CREATE INDEX ON posts (created_at DESC, id DESC)` (plus any equality filter columns as a **leading prefix**: `(tenant_id, created_at DESC, id DESC)` for a per-tenant feed). Without it the DB sorts the whole table per page and you've gained nothing. Verify with `EXPLAIN` that it's an index range scan with no `Sort` node and `Rows Removed by Filter ≈ 0`.
116
+
117
+ ## Common Errors
118
+
119
+ - **OFFSET for deep pages on a growing table.** `OFFSET 100000` scans and throws away 100k rows; latency grows linearly with page depth. Fix: keyset/cursor (step 1).
120
+ - **Non-unique `ORDER BY` (no tie-break).** `ORDER BY created_at` with duplicate timestamps → rows straddle page boundaries, appear twice or vanish. Fix: append the unique PK to make ordering total (step 2).
121
+ - **Exposing a raw offset/id/timestamp as the "cursor."** Clients build their own, you can't change the format, and they construct invalid ones. Fix: opaque base64url token, documented as un-parseable (step 3).
122
+ - **Cursor not bound to the query.** Client keeps the cursor but switches `sort` or `filter` → garbage page or skipped rows. Fix: encode a filter/sort hash in the cursor and 400 on mismatch (step 3).
123
+ - **No page-size cap.** `?page_size=1000000` OOMs the server. Fix: default 20–50, clamp to max 100 (step 4).
124
+ - **Separate `COUNT(*)` on every page for `has_more`.** Doubles DB load. Fix: fetch `limit + 1` and check for the extra row (step 4).
125
+ - **Mandatory `total_count`.** Forces a full count scan, and it's wrong under concurrent writes anyway. Fix: omit by default; opt-in / cached / estimated (step 5).
126
+ - **Sorting on a mutable column without telling anyone.** Ordering by `updated_at` lets a row jump pages mid-scroll → silent dup/skip. Fix: prefer immutable sort (created_at/id); if mutable, document the behavior.
127
+ - **Missing/mismatched index.** Keyset query without a composite index matching column order+direction → full sort per page, no speedup. Fix: index the exact `(filter…, sort…, id)` tuple, verify no `Sort` in EXPLAIN (step 8).
128
+ - **Row-value comparison with mixed sort directions.** `(a,b) < (?,?)` is wrong when `a` and `b` sort opposite ways. Fix: expand to the explicit OR-chain predicate (step 2).
129
+ - **GraphQL inventing its own `{items, nextPage}` instead of Relay connections.** Breaks Relay/Apollo client cache + tooling assumptions. Fix: follow edges/node/cursor + pageInfo (step 7).
130
+ - **Off-by-one at the boundary.** Forgetting to drop the `+1` probe row leaks it into `data` and as the cursor. Fix: slice to `page_size`, derive `next_cursor` from the last *kept* row.
131
+
132
+ ## Verify
133
+
134
+ 1. **Deep page is fast:** request the millionth row's page; latency ≈ first page (constant), not linear. `EXPLAIN ANALYZE` shows an index range scan, no `Sort` node, near-zero rows filtered.
135
+ 2. **Stable under inserts:** start paging, insert/delete rows ahead of and behind the cursor mid-scan; assert no row appears twice and no existing-before-the-cursor row is skipped (the offset failure mode).
136
+ 3. **Tie-break holds:** seed many rows with identical sort values (same `created_at`); page through and assert every row appears exactly once across page boundaries.
137
+ 4. **Cursor is opaque + bound:** decode shows no client-meaningful offset; reusing a cursor with a changed `filter`/`sort` returns 400, not a corrupt page.
138
+ 5. **Page size enforced:** `page_size=1000000` returns ≤100; `page_size=0`/negative is rejected.
139
+ 6. **End-of-list is clean:** the last page returns `has_more=false` and `next_cursor=null`; clients never need an extra empty request to detect the end.
140
+ 7. **`has_more` without COUNT:** confirm the query plan fetches `limit+1` and runs no `COUNT(*)` unless `include_total` is explicitly set.
141
+ 8. **Bidirectional round-trip:** page forward N then backward N lands on the original rows in the original display order (slice was re-reversed correctly).
142
+ 9. **Relay conformance (GraphQL):** `first/after` and `last/before` work; `pageInfo.hasNextPage`/`endCursor` are correct; mixing `first` with `before` is rejected; `totalCount` is opt-in.
143
+
144
+ Done = list endpoints use keyset cursors over a unique-tie-break total ordering, cursors are opaque base64url tokens bound to their query, page size is defaulted and hard-capped, `has_more` comes from `limit+1` (no mandatory COUNT), the `{data,next_cursor,has_more}` (or Relay connection) envelope is stable, a composite index backs the ordering, and the consistency/perf tests in checks 1–9 pass under concurrent writes.
@@ -0,0 +1,119 @@
1
+ ---
2
+ name: design-authorization-model
3
+ description: Designs an authorization model — RBAC/ABAC/ReBAC, multi-tenant isolation, resource ownership, and policy-as-code (OPA/Cedar/Oso) — keeping authZ decisions separate from authN identity in a centralized, testable policy layer enforced down to the data tier.
4
+ when_to_use: A system needs roles/permissions, multi-tenant data isolation, or per-resource access rules beyond a logged-in check. Distinct from auth-jwt-session (who you are — tokens/sessions), security-review (audit), and rate-limiting (request volume).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the question is **"is this caller allowed to do this to this resource?"** — not "who is this caller?":
10
+
11
+ - "Add roles and permissions" / "only admins can delete, members can edit, viewers read"
12
+ - "Tenants must not see each other's data" / "isolate orgs / workspaces"
13
+ - "Owner can share a doc with specific users" (Google-Drive-style) → relationship graph
14
+ - "Permissions depend on attributes" — department, resource status, time, region
15
+ - "Stop scattering `if user.role == 'admin'` across 40 handlers — centralize it"
16
+ - An IDOR/cross-tenant leak found in review (a user fetched another org's record by id)
17
+
18
+ NOT this skill:
19
+ - Issuing/verifying tokens, sessions, refresh rotation, OAuth/OIDC login → **auth-jwt-session** (authN establishes identity; this skill consumes that identity to make the access decision)
20
+ - Auditing existing code for access-control holes by severity → **security-review**
21
+ - Capping request *rate/volume* per caller → **rate-limiting**
22
+ - Recording *who did what when* for compliance/forensics → **build-audit-logging**
23
+ - GDPR data-subject rights, lawful basis, PII mapping → **map-privacy-data-gdpr**
24
+ - Fixing injection/XSS/SSRF in web code → **remediate-web-vulnerabilities**
25
+
26
+ ## Steps
27
+
28
+ 1. **Pick the model by the shape of the access rule — do not default to RBAC for everything.**
29
+
30
+ | Model | Decide by | Use when | Engine fit |
31
+ |---|---|---|---|
32
+ | **RBAC** | role → permission set | Fixed, coarse tiers (admin/editor/viewer); permissions don't depend on the specific row | DB tables, Casbin, Cedar |
33
+ | **ABAC** | attributes of subject+resource+context | Rules vary by field/status/time/region (`owner.dept == doc.dept AND time < embargo`) | **OPA/Rego**, Cedar |
34
+ | **ReBAC** | relationship/ownership graph | Per-resource sharing, nesting (`folder→doc`), "users this owner invited" — Drive/GitHub-style | **OpenFGA / SpiceDB** (Zanzibar), Oso |
35
+
36
+ Default: start **RBAC for app-wide roles**, add **ReBAC** the moment you need per-resource sharing or hierarchy, add **ABAC** conditions for field/context rules. They compose — RBAC roles can be relations in a ReBAC graph. Don't roll a bespoke nested-`if` engine; pick one of the named tools.
37
+
38
+ 2. **Separate authN from authZ — the decision is its own layer.** AuthN hands you a verified principal (`{user_id, tenant_id, roles}` from the validated token — see **auth-jwt-session**). AuthZ takes `(principal, action, resource)` → `allow|deny`. Never re-derive identity inside the policy, and never let the policy trust unverified claims.
39
+
40
+ 3. **Centralize the decision behind one `authorize()` call — never inline `if role ==`.** Every protected operation calls the same checkpoint; scattered checks drift and leak.
41
+
42
+ ```python
43
+ # ONE entry point. Engine (OPA/Cedar/Oso/OpenFGA) behind it.
44
+ def authorize(principal, action, resource):
45
+ decision = engine.check(
46
+ subject=principal.user_id,
47
+ tenant=principal.tenant_id, # from token, NEVER from the request body
48
+ action=action, # "document:delete"
49
+ resource=resource, # {id, type, tenant_id, owner_id, status}
50
+ )
51
+ if not decision.allow: # deny by default — no rule matched = deny
52
+ raise Forbidden(action, resource.id)
53
+ return decision
54
+ ```
55
+
56
+ 4. **Enforce multi-tenant isolation on every query — and derive `tenant_id` from the token, never the client.** A client-supplied tenant/org id is an attacker-controlled cross-tenant key. Scope every read/write by the token's tenant; treat a missing tenant scope as a bug, not a default-all.
57
+
58
+ ```sql
59
+ -- Defense in depth: Postgres Row-Level Security so a forgotten WHERE can't leak.
60
+ ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
61
+ ALTER TABLE documents FORCE ROW LEVEL SECURITY; -- applies to table owner too
62
+ CREATE POLICY tenant_isolation ON documents
63
+ USING (tenant_id = current_setting('app.tenant_id')::uuid);
64
+ ```
65
+ Set `app.tenant_id` per request/connection from the verified token (`SET LOCAL app.tenant_id = ...` inside the request transaction). App-layer `WHERE tenant_id = $1` is the primary guard; RLS is the backstop for the day someone forgets it.
66
+
67
+ 5. **Deny by default, least privilege, deny wins.** No matching allow rule ⇒ deny. Start every role at zero permissions and add. When allow and deny rules overlap, **explicit deny beats allow**. Write this into the policy, don't rely on convention.
68
+
69
+ 6. **Make policy versioned, code-reviewed, and unit-tested — policy-as-code.** Keep `.rego` / Cedar / `policy.polar` in the repo, PR-reviewed like app code. Example OPA/Rego with the three non-negotiables baked in:
70
+
71
+ ```rego
72
+ package authz
73
+ default allow := false # deny by default
74
+ default deny := false # bare `deny` is always defined
75
+
76
+ allow if { # owner can do anything to own resource
77
+ input.resource.tenant_id == input.principal.tenant_id # same-tenant gate, always
78
+ input.resource.owner_id == input.principal.user_id
79
+ }
80
+ allow if { # role grants the action
81
+ input.resource.tenant_id == input.principal.tenant_id
82
+ some role in input.principal.roles
83
+ grants[role][_] == input.action # e.g. grants.editor[] = "document:edit"
84
+ }
85
+ deny if input.resource.status == "locked" # explicit deny condition
86
+ final_allow := allow and not deny # deny wins over any allow
87
+ ```
88
+ Wire the API to read `final_allow`, not `allow`. Run `opa test policy/ -v` in CI. The same input schema feeds both the running engine and the tests.
89
+
90
+ 7. **Pass the decision an explicit resource snapshot, fetched tenant-scoped first.** Load the resource (already filtered by tenant in the query) before checking, so the policy sees real `owner_id`/`status`/`tenant_id`. Checking by id alone, then fetching unscoped, reintroduces the IDOR.
91
+
92
+ 8. **Verify with an allow/deny matrix per role × action — including explicit cross-tenant denial** (see Verify) before shipping.
93
+
94
+ ## Common Errors
95
+
96
+ - **Trusting a client-supplied `tenant_id`/`org_id`** from body, query, or header. It's the cross-tenant skeleton key. Derive tenant solely from the verified token; ignore any tenant field in the request.
97
+ - **IDOR — checking the role but not the ownership/tenant of *this* row.** `can_edit_documents` is true, but the doc belongs to another tenant. Always bind the check to the specific resource's `tenant_id`/`owner_id`, fetched tenant-scoped.
98
+ - **Inline `if user.role == 'admin'` scattered across handlers.** They drift, one gets missed, and a new action ships unguarded. Route every check through the single `authorize()` checkpoint.
99
+ - **Role explosion (`editor_us_finance_readonly`).** Combinatorial roles that should be attributes. Move per-field/context rules to ABAC conditions; keep roles coarse.
100
+ - **Allow-by-default / "fail open."** A request that matches no rule slips through, or an engine error returns allow. Set `default allow := false` and treat engine errors/timeouts as deny.
101
+ - **Reading `allow` instead of the deny-wins result.** Exposing `allow` to the API skips the explicit-deny rule. Have the engine return `final_allow` (`allow and not deny`) so a locked/blocked resource can't be reached through a permissive role.
102
+ - **AuthZ in the frontend only.** Hiding a button is UX, not security — the API is the enforcement boundary. Every server endpoint authorizes independently.
103
+ - **Roles baked into the JWT and never refreshed.** Revoking a role doesn't take effect until the token expires. Check permissions against current state (or keep token TTL short and re-resolve roles server-side).
104
+ - **No DB-tier backstop.** One forgotten `WHERE tenant_id` leaks every tenant. Enable Postgres RLS with `FORCE` so the data tier denies even when the app forgets.
105
+ - **Confused-deputy / unscoped service calls.** A worker or internal service queries with god privileges on behalf of a user without carrying the user's tenant/permission scope. Propagate the principal; don't let internal callers bypass `authorize()`.
106
+ - **Policy with no tests.** Untested Rego/Cedar rots silently. Ship the allow/deny matrix as `opa test` cases alongside the policy.
107
+
108
+ ## Verify
109
+
110
+ 1. **Allow/deny matrix — every role × action.** For each role (admin/editor/viewer/none) × each action (create/read/update/delete/share), assert the decision matches the intended table. Every cell is a test case, run in CI (`opa test policy/ -v` or the engine's harness).
111
+ 2. **Cross-tenant denial (the critical one).** User in tenant A requests a resource in tenant B by its real id → **403/deny**, for *every* action, including read. Do this both through the API and by querying the DB with `app.tenant_id` set to A — RLS must return zero rows.
112
+ 3. **IDOR probe.** As a non-owner same-tenant user, attempt update/delete on a resource you don't own and your role doesn't permit → deny. Then as owner → allow. Confirms the check binds to the resource, not just the role.
113
+ 4. **Deny by default.** Invent a brand-new action string with no policy rule → deny (not allow). Proves nothing slips through unmatched.
114
+ 5. **Deny wins.** A resource in `status = "locked"` (or a user under an explicit deny) → deny even when a role would otherwise allow. Assert against `final_allow`, the value the API consumes.
115
+ 6. **RLS backstop.** Run a `SELECT` that *omits* the app-layer tenant filter against a session with `app.tenant_id` set → still returns only that tenant's rows. Proves the data tier holds when the app forgets.
116
+ 7. **Centralization.** `grep -rnE 'role *== *|isAdmin|\.role\b' src/` finds zero authorization branches outside the policy layer — every decision goes through `authorize()`.
117
+ 8. **Privilege escalation negative test.** A user cannot grant themselves a role/permission or modify a policy they shouldn't (the "edit roles" action is itself authorized).
118
+
119
+ Done = the role × action matrix passes in CI, cross-tenant and IDOR probes are denied at both the API and DB tier (RLS enforced), the policy is versioned with `default allow := false` and deny-wins (API reads `final_allow`), and `grep` finds no authorization logic outside the centralized layer.
@@ -0,0 +1,113 @@
1
+ ---
2
+ name: design-backup-dr-recovery
3
+ description: Designs and validates backup, point-in-time-recovery, and disaster-recovery strategy for datastores — sets RPO/RTO targets, configures snapshot plus continuous WAL/binlog/oplog archiving for PITR, 3-2-1 immutable retention, automated test-restores, and cross-region replica failover with split-brain fencing.
4
+ when_to_use: When a stateful service needs a credible answer to "what if the database is lost or corrupted" — setting RPO/RTO, wiring snapshots + continuous log archiving for PITR, designing cross-region failover, scheduling tested restores, or auditing a never-restore-tested backup. Distinct from db-migration-safety (forward schema change safety) and incident-response-sre (running the live outage, not designing recoverability).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the question is **"can we get the data back, and how fast"** — not how to change the schema:
10
+
11
+ - "Set RPO/RTO for this database and prove we can hit them"
12
+ - "We have nightly snapshots but no way to restore to 2:47pm — add PITR"
13
+ - "Stand up cross-region DR / a warm standby we can promote"
14
+ - "Our backups have never been restore-tested — audit and fix that"
15
+ - "Recover a single dropped table without rolling back the whole DB"
16
+ - "Defend backups against ransomware / a fat-fingered `DROP DATABASE`"
17
+
18
+ NOT this skill:
19
+ - Making a forward schema migration safe/reversible (expand-contract, online DDL) → db-migration-safety
20
+ - Running the live incident — paging, comms, mitigation timeline → incident-response-sre
21
+ - Protecting/rotating the backup-store credentials and KMS keys → secrets-management
22
+ - Alerting that a backup job failed / dashboards for restore lag → observability-instrument
23
+ - Trimming snapshot/storage spend → cloud-cost-optimize
24
+
25
+ ## Steps
26
+
27
+ 1. **Set RPO and RTO per datastore from business impact — these two numbers drive every later choice.** RPO = max tolerable data loss (how far back you may rewind). RTO = max tolerable downtime (how long restore may take). Pick a tier, don't invent per-DB:
28
+
29
+ | Tier | Example data | RPO | RTO | Implied mechanism |
30
+ |---|---|---|---|---|
31
+ | Tier 0 (money/orders) | payments ledger, auth | ≤ seconds | ≤ minutes | sync replica + continuous WAL, automated promotion |
32
+ | Tier 1 (core app) | primary OLTP DB | ≤ 5 min | ≤ 1 hr | snapshot + async WAL archiving (PITR), warm standby |
33
+ | Tier 2 (supporting) | analytics, search index | ≤ 1 hr | ≤ 4 hr | hourly snapshot, rebuild-from-source allowed |
34
+ | Tier 3 (derived/cache) | caches, rebuildable views | n/a | n/a | no backup — document the rebuild procedure instead |
35
+
36
+ RPO ≤ snapshot interval is a lie unless you also archive logs continuously (step 2). Write the chosen numbers down; an untargeted "we back up nightly" has an implicit 24h RPO nobody agreed to.
37
+
38
+ 2. **Two backup layers: periodic base + continuous log archiving. Snapshot-only cannot do PITR.** A snapshot gets you to *snapshot time*; the log stream replays forward to any timestamp in between.
39
+
40
+ | Engine | Base backup | Continuous log (the PITR engine) | Restore = base + replay |
41
+ |---|---|---|---|
42
+ | PostgreSQL | `pg_basebackup` / disk snapshot | WAL via `archive_command` → object store (pgBackRest/WAL-G) | `restore_command` + `recovery_target_time` |
43
+ | MySQL/MariaDB | `xtrabackup` / `mysqldump` | binlog (`log_bin`, `binlog_format=ROW`) shipped off-host | restore base, `mysqlbinlog --stop-datetime` apply |
44
+ | MongoDB | `mongodump` / filesystem snapshot | oplog (replica set required) | restore + `--oplogReplay --oplogLimit` |
45
+ | SQLite | `.backup` / file copy | WAL file is local-only — ship full DB on a cron | copy file (no true PITR) |
46
+ | Managed (RDS/Cloud SQL) | automated snapshots | provider-managed transaction logs | "restore to point in time" API |
47
+
48
+ Default for any Tier 0/1 SQL store: **pgBackRest/WAL-G (Postgres) or Percona XtraBackup + binlog (MySQL)** with logs archived every ≤60s. Logical dumps (`pg_dump`/`mysqldump`) are a *secondary* portable copy, not your primary — they're slow to restore and lock/strain a large live DB.
49
+
50
+ 3. **Retention and layout: 3-2-1 with at least one immutable copy.** 3 copies, 2 media/accounts, 1 off-site/cross-region. Make ≥1 copy **immutable** so ransomware or a compromised admin can't delete it:
51
+ - Object-lock the bucket: S3 Object Lock **Compliance mode** (`--object-lock-mode COMPLIANCE`), or GCS bucket retention lock, or Azure immutable blob. Compliance mode = nobody, including root, can delete before expiry.
52
+ - Put backups in a **separate account/project** from production with write-only (no-delete) IAM for the backup writer — same-account backups die with the account.
53
+ - Lifecycle: hot (last 7d, fast restore) → warm (30d) → cold/Glacier (90–365d per compliance). Cold tiers add hours to RTO — never put your RTO-critical recent backups in Glacier.
54
+ - Retention must cover **detection lag**: corruption found on day 10 needs a day-9 good copy, so retain > realistic time-to-detect.
55
+
56
+ 4. **Verify restorability automatically — an untested backup is a hypothesis, not a backup.** Schedule a job that restores to a *scratch* environment and validates, on every backup or at least nightly:
57
+ ```bash
58
+ # nightly restore drill (Postgres / pgBackRest), exits non-zero on any failure
59
+ pgbackrest --stanza=main --type=time \
60
+ --target="$(date -u -d '10 min ago' +'%Y-%m-%d %H:%M:%S')" restore
61
+ pg_ctl start -D "$PGDATA" -w -t 600
62
+ # validate: structural + content, not just "it started"
63
+ psql -tAc "SELECT count(*) FROM orders" | grep -qE '^[0-9]+$'
64
+ psql -tAc "SELECT pg_catalog.pg_database_size('app')" # > 0
65
+ RESTORE_SECS=$SECONDS; echo "restore took ${RESTORE_SECS}s (RTO budget: 3600s)"
66
+ [ "$RESTORE_SECS" -le 3600 ] || { echo "RTO BREACH"; exit 1; }
67
+ ```
68
+ Validate **content** (row counts vs a known watermark, `pg_amcheck`/`CHECKSUM TABLE`, app-level invariant query), measure wall-clock restore time, and **fail the job loud** (page) if it breaks or exceeds RTO. The restore time you measure here *is* your real RTO — the planned number is fiction until measured.
69
+
70
+ 5. **Have a procedure for each recovery shape — they are not the same command.**
71
+ - **Full restore (host lost):** provision, restore latest base, replay logs to "now", re-point app.
72
+ - **PITR (bad deploy/poison write at 14:32):** restore base before 14:32, replay to `recovery_target_time = '14:31:59'`, `pause_at_recovery_target=on`, inspect, then promote. Recover to *just before* the bad event.
73
+ - **Single-table / logical restore:** restore into a throwaway instance, `pg_dump -t orders` (or `mysqldump --no-create-info`) that table, load into prod — never restore the whole cluster to fix one table.
74
+ - **Corruption:** do **not** overwrite the only good copy. Restore to a new instance, run `pg_amcheck`/`mongod --repair`/`CHECK TABLE`, diff, cut over only after validation. Promote a healthy replica only after confirming the corruption didn't already replicate.
75
+
76
+ 6. **Cross-region/replica DR: pick sync vs async deliberately, and fence against split-brain.**
77
+
78
+ | | Sync replication | Async replication |
79
+ |---|---|---|
80
+ | RPO | ~0 (no committed loss) | replica lag (seconds–minutes) |
81
+ | Write latency | + cross-region RTT every commit | none (local commit) |
82
+ | Use for | Tier 0 only, regions < ~10ms apart | everything else (default) |
83
+
84
+ Default to **async** unless RPO≈0 is mandated and you accept the write-latency tax. Failover = promote standby + cut traffic over. **Split-brain is the real danger**: if the old primary comes back and also takes writes, you get divergent histories that can't be merged. Enforce a quorum/leader-election (Patroni + etcd/Consul, Orchestrator, or RDS Multi-AZ which fences for you) and **STONITH-fence** the old primary (revoke its network/credentials) *before* promoting. Cut traffic via low-TTL DNS (≤30s) or, better, a connection proxy (PgBouncer/HAProxy/ProxySQL) that flips backends instantly — DNS TTL caching makes raw DNS failover slow and uneven.
85
+
86
+ 7. **Write the runbook with exact commands, and rehearse it (game day).** The runbook lists per scenario: detection signal → exact restore/promote commands (copy-pasteable, with placeholders) → validation queries → traffic-cutover step → rollback-of-the-rollback. Store it **outside** the system it recovers (it's useless if it lives only in the DB that's down). Schedule a **DR drill quarterly** (Tier 0: monthly) that actually fails over to the standby/restored copy under timing — measure RTO/RPO against target, file the gaps. A runbook never executed end-to-end is presumed broken.
87
+
88
+ ## Common Errors
89
+
90
+ - **Never restore-testing.** The #1 cause of "we had backups but couldn't recover." A backup that has never been restored is unproven; automate the drill (step 4) so success/failure is observed continuously, not discovered during the outage.
91
+ - **Snapshot-only, calling it PITR.** Nightly snapshots = up to 24h RPO and you can only land on snapshot boundaries. PITR requires continuous WAL/binlog/oplog archiving (step 2). If asked for "restore to any second," snapshots alone cannot.
92
+ - **Same blast radius.** Backups in the same account/region/bucket as prod die with it — one compromised credential, one region outage, one `DROP` and both the data and its backup are gone. Cross-account + cross-region + immutable is the point.
93
+ - **No immutability → ransomware/insider wipes the backups too.** Mutable backups are deleted in the same attack that hit prod. Use object-lock Compliance mode / retention lock on ≥1 copy.
94
+ - **Replica treated as a backup.** A replica faithfully replicates `DELETE FROM users` and corruption in milliseconds. Replication is for availability/failover; it is **not** a backup and gives zero protection against logical errors. You need both.
95
+ - **Logical dump as the primary backup for a large DB.** `pg_dump`/`mysqldump` of a multi-TB DB takes hours to restore and strains/locks the live DB while running — blows RTO. Use physical base + log archiving; keep logical dumps as a secondary portable copy only.
96
+ - **RTO ignores restore *and* warm-up.** Real RTO = provision + transfer + restore + log replay + cache/index warm-up + cutover. Cold-tier (Glacier) retrieval alone can be hours. Measure end-to-end; don't quote the `restore` command's runtime.
97
+ - **Failover with no split-brain fencing.** Promoting a standby while the old primary still accepts writes forks history irrecoverably. Fence (STONITH) the old primary and use quorum-based promotion before flipping traffic.
98
+ - **DNS-only cutover with long TTL.** A 300s+ TTL means clients keep hitting the dead primary long past promotion. Use TTL ≤30s, or a connection proxy that switches backends instantly.
99
+ - **Backup job "succeeds" but the file is empty/corrupt.** Exit-0 ≠ valid backup. Verify object size > expected floor, checksum, and a test-restore — not just the job's return code.
100
+ - **Retention shorter than detection lag.** Corruption noticed on day 10 with 7-day retention = no clean copy exists. Retain past your realistic time-to-detect, and keep a longer-interval cold copy.
101
+
102
+ ## Verify
103
+
104
+ 1. **RPO/RTO are written and tiered.** Every stateful datastore has an explicit RPO and RTO number tied to a business tier (step 1) — not an implicit "nightly."
105
+ 2. **PITR proven, not assumed.** Restore to an *arbitrary* timestamp between two base backups (e.g. 14:31:59 yesterday) lands the data at that second — proves continuous log archiving works, not just snapshots.
106
+ 3. **Automated restore drill is green and timed.** The nightly/per-backup test-restore to scratch passes (structural + content + invariant checks) and its measured wall-clock ≤ RTO budget; a failure or RTO breach **pages**.
107
+ 4. **3-2-1 + immutability holds.** ≥3 copies across ≥2 accounts/regions, ≥1 with object-lock Compliance/retention-lock that even root cannot delete before expiry — confirm by attempting (and failing) to delete a locked object.
108
+ 5. **Independent blast radius.** Deleting/encrypting the prod bucket/account leaves a usable backup intact in another account/region.
109
+ 6. **Each recovery shape has a tested path:** full restore, PITR-to-timestamp, single-table logical restore, and corruption-to-new-instance — each with copy-pasteable commands in the runbook.
110
+ 7. **Failover fences and cuts over fast.** A drill promotion fences the old primary (it cannot take writes post-promotion) and traffic moves via ≤30s-TTL DNS or a proxy; no split-brain divergence after.
111
+ 8. **Game day actually ran.** A dated DR drill within the cadence (≤1 quarter; Tier 0 ≤1 month) failed over end-to-end, measured RPO/RTO vs target, and logged the gaps.
112
+
113
+ Done = every datastore has written RPO/RTO targets, PITR (base + continuous logs) restoring to an arbitrary timestamp, an automated restore drill that is green and within RTO, ≥1 immutable cross-account/region copy, and a runbook proven by a dated end-to-end DR drill — restore time **measured**, never merely planned.