sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (235) hide show
  1. package/.env.example +19 -0
  2. package/CHANGELOG.md +144 -0
  3. package/README.md +153 -20
  4. package/README.th.md +136 -0
  5. package/dist/agentContext.js +4 -0
  6. package/dist/approval.js +6 -0
  7. package/dist/bin.js +394 -51
  8. package/dist/brain.js +92 -59
  9. package/dist/brand.js +47 -0
  10. package/dist/checkpoint.js +37 -0
  11. package/dist/commands.js +86 -6
  12. package/dist/compaction.js +76 -5
  13. package/dist/config.js +100 -12
  14. package/dist/cost.js +60 -3
  15. package/dist/doctor.js +92 -0
  16. package/dist/gateway/auth.js +2 -2
  17. package/dist/gateway/ledger.js +2 -2
  18. package/dist/gateway/scheduler.js +1 -0
  19. package/dist/gateway/serve.js +6 -4
  20. package/dist/gateway/server.js +10 -2
  21. package/dist/git.js +11 -2
  22. package/dist/hooks.js +43 -17
  23. package/dist/knowledge.js +48 -49
  24. package/dist/loop.js +182 -66
  25. package/dist/lsp/client.js +173 -0
  26. package/dist/lsp/framing.js +56 -0
  27. package/dist/lsp/index.js +138 -0
  28. package/dist/lsp/servers.js +82 -0
  29. package/dist/mcp-server.js +244 -0
  30. package/dist/mcp.js +184 -29
  31. package/dist/memory-store.js +559 -0
  32. package/dist/memory.js +143 -29
  33. package/dist/orchestrate.js +150 -0
  34. package/dist/providers/codex.js +2 -2
  35. package/dist/providers/keys.js +3 -2
  36. package/dist/providers/registry.js +133 -1
  37. package/dist/repomap.js +93 -0
  38. package/dist/search/chunk.js +158 -0
  39. package/dist/search/embed-store.js +187 -0
  40. package/dist/search/engine.js +203 -0
  41. package/dist/search/fuse.js +35 -0
  42. package/dist/search/index-core.js +187 -0
  43. package/dist/search/indexer.js +241 -0
  44. package/dist/search/store.js +77 -0
  45. package/dist/session.js +42 -8
  46. package/dist/skill-install.js +10 -10
  47. package/dist/skills.js +12 -9
  48. package/dist/summarize.js +31 -0
  49. package/dist/tools/bash.js +21 -2
  50. package/dist/tools/diagnostics.js +41 -0
  51. package/dist/tools/edit.js +29 -7
  52. package/dist/tools/index.js +8 -1
  53. package/dist/tools/list.js +7 -2
  54. package/dist/tools/permission.js +90 -9
  55. package/dist/tools/read.js +23 -4
  56. package/dist/tools/remember.js +1 -1
  57. package/dist/tools/sandbox.js +61 -0
  58. package/dist/tools/search.js +105 -4
  59. package/dist/tools/task.js +195 -29
  60. package/dist/tools/timeout.js +35 -0
  61. package/dist/tools/util.js +10 -0
  62. package/dist/tools/write.js +6 -4
  63. package/dist/trust.js +89 -0
  64. package/dist/ui/app.js +218 -27
  65. package/dist/ui/banner.js +4 -9
  66. package/dist/ui/history.js +30 -0
  67. package/dist/ui/mentions.js +44 -0
  68. package/dist/ui/setup.js +6 -5
  69. package/dist/ui/useEditor.js +83 -0
  70. package/dist/update.js +114 -0
  71. package/dist/worktree.js +173 -0
  72. package/package.json +11 -5
  73. package/scripts/postinstall.mjs +33 -0
  74. package/second-brain/.agents/_Index.md +30 -0
  75. package/second-brain/.agents/skills/_Index.md +30 -0
  76. package/second-brain/.agents/workflows/_Index.md +30 -0
  77. package/second-brain/AGENTS.md +4 -4
  78. package/second-brain/Acceptance/_Index.md +30 -0
  79. package/second-brain/Acceptance/golden-case-template.md +39 -0
  80. package/second-brain/Areas/_Index.md +30 -0
  81. package/second-brain/Bugs/System-OS/_Index.md +30 -0
  82. package/second-brain/Bugs/_Index.md +30 -0
  83. package/second-brain/CLAUDE.md +4 -1
  84. package/second-brain/Checklists/_Index.md +30 -0
  85. package/second-brain/Checklists/preflight-postflight-template.md +29 -0
  86. package/second-brain/Distillations/_Index.md +30 -0
  87. package/second-brain/Entities/_Index.md +30 -0
  88. package/second-brain/Entities/entity-template.md +33 -0
  89. package/second-brain/Evals/_Index.md +30 -0
  90. package/second-brain/Evals/correction-pairs.md +24 -0
  91. package/second-brain/Evals/failure-taxonomy.md +24 -0
  92. package/second-brain/Evals/golden-set.md +25 -0
  93. package/second-brain/Evals/quality-ledger.md +23 -0
  94. package/second-brain/Evals/self-eval-rubric.md +23 -0
  95. package/second-brain/GEMINI.md +4 -4
  96. package/second-brain/Goals/_Index.md +30 -0
  97. package/second-brain/Handoffs/_Index.md +30 -0
  98. package/second-brain/Home.md +7 -0
  99. package/second-brain/Intake/Raw Sources/_Index.md +30 -0
  100. package/second-brain/Intake/_Index.md +30 -0
  101. package/second-brain/Intake/_Quarantine/_Index.md +30 -0
  102. package/second-brain/Learning/_Index.md +30 -0
  103. package/second-brain/Playbooks/_Index.md +30 -0
  104. package/second-brain/Playbooks/playbook-template.md +23 -0
  105. package/second-brain/Projects/_Index.md +30 -0
  106. package/second-brain/Prompts/_Index.md +30 -0
  107. package/second-brain/README.md +2 -1
  108. package/second-brain/Research/_Index.md +30 -0
  109. package/second-brain/Retrospectives/_Index.md +30 -0
  110. package/second-brain/Reviews/_Index.md +30 -0
  111. package/second-brain/Runbooks/_Index.md +30 -0
  112. package/second-brain/Runbooks/eval-loop.md +24 -0
  113. package/second-brain/Sessions/_Index.md +30 -0
  114. package/second-brain/Shared/AI-Context-Index.md +20 -0
  115. package/second-brain/Shared/AI-Threads/_Index.md +30 -0
  116. package/second-brain/Shared/Archive/_Index.md +30 -0
  117. package/second-brain/Shared/Assets/_Index.md +30 -0
  118. package/second-brain/Shared/Context-Packs/_Index.md +30 -0
  119. package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
  120. package/second-brain/Shared/Coordination/NOW.md +28 -0
  121. package/second-brain/Shared/Coordination/_Index.md +30 -0
  122. package/second-brain/Shared/Coordination/agent-registry.md +24 -0
  123. package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
  124. package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
  125. package/second-brain/Shared/Coordination/task-board.md +32 -0
  126. package/second-brain/Shared/Core-Facts/_Index.md +30 -0
  127. package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
  128. package/second-brain/Shared/Glossary/_Index.md +30 -0
  129. package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
  130. package/second-brain/Shared/Operating-State/_Index.md +30 -0
  131. package/second-brain/Shared/Prompting/_Index.md +30 -0
  132. package/second-brain/Shared/Provenance/_Index.md +30 -0
  133. package/second-brain/Shared/Rules/_Index.md +30 -0
  134. package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
  135. package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
  136. package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
  137. package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
  138. package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
  139. package/second-brain/Shared/Rules/rules-formatting.md +34 -0
  140. package/second-brain/Shared/Scripts/_Index.md +30 -0
  141. package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
  142. package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
  143. package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
  144. package/second-brain/Shared/User-Memory/_Index.md +30 -0
  145. package/second-brain/Shared/User-Persona/_Index.md +30 -0
  146. package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
  147. package/second-brain/Shared/Working-Memory/_Index.md +30 -0
  148. package/second-brain/Shared/_Index.md +30 -0
  149. package/second-brain/Shared/mcp-servers/_Index.md +30 -0
  150. package/second-brain/Skills/_Index.md +30 -0
  151. package/second-brain/Templates/_Index.md +30 -0
  152. package/second-brain/Templates/bug.md +2 -0
  153. package/second-brain/Templates/handoff.md +2 -0
  154. package/second-brain/Templates/session.md +2 -0
  155. package/second-brain/Tools/_Index.md +30 -0
  156. package/second-brain/Traces/_Index.md +30 -0
  157. package/second-brain/Vault Structure Map.md +33 -1
  158. package/second-brain/copilot/_Index.md +30 -0
  159. package/skills/audit-license-compliance/SKILL.md +117 -0
  160. package/skills/author-codemod/SKILL.md +110 -0
  161. package/skills/build-audit-logging/SKILL.md +112 -0
  162. package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
  163. package/skills/build-cli-tool/SKILL.md +108 -0
  164. package/skills/build-data-table/SKILL.md +141 -0
  165. package/skills/build-native-mobile-ui/SKILL.md +154 -0
  166. package/skills/build-offline-first-sync/SKILL.md +118 -0
  167. package/skills/build-realtime-channel/SKILL.md +122 -0
  168. package/skills/build-vector-search/SKILL.md +131 -0
  169. package/skills/compose-local-dev-stack/SKILL.md +149 -0
  170. package/skills/configure-bundler-build/SKILL.md +166 -0
  171. package/skills/configure-dns-tls/SKILL.md +142 -0
  172. package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
  173. package/skills/configure-security-headers-csp/SKILL.md +122 -0
  174. package/skills/contract-testing/SKILL.md +140 -0
  175. package/skills/datetime-timezone-correctness/SKILL.md +125 -0
  176. package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
  177. package/skills/debug-flaky-tests/SKILL.md +128 -0
  178. package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
  179. package/skills/deliver-webhooks/SKILL.md +116 -0
  180. package/skills/design-api-pagination/SKILL.md +144 -0
  181. package/skills/design-authorization-model/SKILL.md +119 -0
  182. package/skills/design-backup-dr-recovery/SKILL.md +113 -0
  183. package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
  184. package/skills/design-multi-tenancy/SKILL.md +100 -0
  185. package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
  186. package/skills/design-relational-schema/SKILL.md +129 -0
  187. package/skills/design-search-index-infra/SKILL.md +151 -0
  188. package/skills/design-state-machine/SKILL.md +108 -0
  189. package/skills/design-token-system/SKILL.md +109 -0
  190. package/skills/distributed-locks-leases/SKILL.md +120 -0
  191. package/skills/encrypt-sensitive-data/SKILL.md +148 -0
  192. package/skills/feature-flags-rollout/SKILL.md +130 -0
  193. package/skills/file-upload-object-storage/SKILL.md +107 -0
  194. package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
  195. package/skills/harden-llm-app-reliability/SKILL.md +126 -0
  196. package/skills/i18n-localization-setup/SKILL.md +113 -0
  197. package/skills/idempotency-keys/SKILL.md +107 -0
  198. package/skills/implement-push-notifications/SKILL.md +142 -0
  199. package/skills/ingest-webhook-secure/SKILL.md +120 -0
  200. package/skills/integrate-oauth-oidc/SKILL.md +126 -0
  201. package/skills/load-stress-test/SKILL.md +129 -0
  202. package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
  203. package/skills/model-nosql-data/SKILL.md +118 -0
  204. package/skills/money-decimal-arithmetic/SKILL.md +123 -0
  205. package/skills/monitor-ml-drift/SKILL.md +109 -0
  206. package/skills/numeric-precision-units/SKILL.md +144 -0
  207. package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
  208. package/skills/optimize-react-rerenders/SKILL.md +124 -0
  209. package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
  210. package/skills/payments-billing-integration/SKILL.md +114 -0
  211. package/skills/pin-toolchain-versions/SKILL.md +116 -0
  212. package/skills/plan-strangler-migration/SKILL.md +95 -0
  213. package/skills/property-based-testing/SKILL.md +108 -0
  214. package/skills/publish-package-registry/SKILL.md +130 -0
  215. package/skills/recover-git-state/SKILL.md +119 -0
  216. package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
  217. package/skills/resilience-timeouts-retries/SKILL.md +104 -0
  218. package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
  219. package/skills/rewrite-git-history/SKILL.md +109 -0
  220. package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
  221. package/skills/schema-evolution-compatibility/SKILL.md +121 -0
  222. package/skills/send-transactional-email/SKILL.md +126 -0
  223. package/skills/serve-deploy-ml-model/SKILL.md +107 -0
  224. package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
  225. package/skills/setup-devcontainer-env/SKILL.md +131 -0
  226. package/skills/setup-lint-format-precommit/SKILL.md +140 -0
  227. package/skills/setup-monorepo-tooling/SKILL.md +125 -0
  228. package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
  229. package/skills/structured-output-llm/SKILL.md +86 -0
  230. package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
  231. package/skills/test-data-factories/SKILL.md +158 -0
  232. package/skills/threat-model-stride/SKILL.md +123 -0
  233. package/skills/train-evaluate-ml-model/SKILL.md +109 -0
  234. package/skills/unicode-text-correctness/SKILL.md +109 -0
  235. package/skills/visual-regression-testing/SKILL.md +120 -0
@@ -0,0 +1,107 @@
1
+ ---
2
+ name: file-upload-object-storage
3
+ description: Implements secure file/image/video upload to object storage via short-lived presigned URLs or POST policies, with content-type + size validation, magic-byte verification, non-guessable tenant-scoped key namespacing, multipart/resumable transfer, private buckets with signed-URL access, and post-upload scan/transcode + lifecycle cleanup.
4
+ when_to_use: User is adding file/image/video upload, generating presigned/direct upload URLs, handling large/resumable/multipart transfers, validating uploads, controlling object access, or serving via signed URLs/CDN. Distinct from auth-jwt-session (who the caller is — this consumes that identity) and secrets-management (storing the bucket credentials themselves).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when bytes flow **from a client into a bucket** and back out under access control:
10
+
11
+ - "Let users upload an avatar / document / video to object storage (S3, GCS, R2, Azure Blob)"
12
+ - "Generate a presigned URL / POST policy so the browser uploads direct-to-bucket"
13
+ - "Handle large/resumable uploads, multipart with retry"
14
+ - "Validate uploaded files — block executables, cap size, check the real type"
15
+ - "Keep these files private and serve them with a time-limited signed URL / behind a CDN"
16
+ - "Scan/resize/transcode after upload; clean up orphaned objects when a record is deleted"
17
+
18
+ NOT this skill:
19
+ - Deciding *who* may request an upload URL or read an object → auth-jwt-session (this skill consumes the authenticated identity; it does not establish it)
20
+ - Where the bucket's own access keys / service-account JSON live → secrets-management
21
+ - Broad storage-class/egress cost modeling across services → cloud-cost-optimize (this covers only per-object lifecycle/tiering for uploads)
22
+ - Front-end image perf (responsive `srcset`, lazy-load, format) → optimize-core-web-vitals
23
+ - Validating the *other* form fields around the file → build-form-validation
24
+ - App-response caching to cut load → caching-strategy (CDN here is for object delivery, not API responses)
25
+
26
+ ## Steps
27
+
28
+ 1. **Default: client uploads direct-to-bucket via a short-lived presigned credential — never stream bytes through your app server.** Proxying uploads burns app memory/bandwidth and adds a hop. Your server only mints a credential and records metadata. Pick the credential type:
29
+
30
+ | Mechanism | Use when | Constrains |
31
+ |---|---|---|
32
+ | **Presigned POST policy** (S3 `create_presigned_post`, R2 same) | Browser/HTML form, single file, **want server-enforced size + type** | `Content-Type`, `content-length-range`, exact key or key prefix, expiry |
33
+ | Presigned PUT URL | Simple programmatic single-shot PUT | content-type + expiry only (no size cap pre-upload) |
34
+ | **Multipart presigned** (`create_multipart_upload` + per-part `upload_part` URLs) | File > ~100 MB, flaky networks, resumable | per-part, parallel, retryable |
35
+ | GCS resumable session (`x-goog-resumable:start`) / Azure Block Blob (`PutBlock`+`PutBlockList`) | GCS/Azure large or resumable | session URL + range |
36
+
37
+ Default for web uploads = **presigned POST policy** because it is the only one that enforces a size ceiling *before* bytes land. Expiry: **5 minutes** (`Expires=300`). Generate per-upload, single-use.
38
+
39
+ 2. **Validate on the boundary — twice. Never trust client-supplied MIME or extension.** A `.jpg` can be a polyglot HTML/JS or a zip bomb. Enforce in two places:
40
+ - **In the policy (pre-upload, hard ceiling):** size via `content-length-range` and a `Content-Type` allowlist condition. Example POST policy conditions:
41
+ ```python
42
+ fields = {"Content-Type": content_type, "x-amz-meta-owner": str(user_id)}
43
+ conditions = [
44
+ {"bucket": BUCKET},
45
+ ["starts-with", "$key", f"tenants/{tenant_id}/uploads/"],
46
+ {"Content-Type": content_type}, # must equal the one you signed
47
+ ["content-length-range", 1, 10 * 1024 * 1024], # 1 B .. 10 MiB
48
+ ]
49
+ post = s3.generate_presigned_post(BUCKET, key, Fields=fields,
50
+ Conditions=conditions, ExpiresIn=300)
51
+ ```
52
+ S3 rejects with `403 EntityTooLarge` / `AccessDenied` if the upload violates the policy — the server never sees the oversized body.
53
+ - **Server-side after upload, by magic bytes:** on the upload event (step 6), read the **first 256–512 bytes** and sniff the real type (`file --mime-type -`, `python-magic`, Go `net/http.DetectContentType`, Node `file-type`). If the sniffed type is not in your allowlist, delete the object and mark the record `rejected`. Allowlist concrete types (`image/jpeg image/png image/webp image/gif application/pdf video/mp4`) — never a denylist.
54
+
55
+ 3. **Design the key/namespace: non-guessable, tenant-scoped, no user-controlled path segments.** Pattern: `tenants/{tenant_id}/{kind}/{uuidv4}{ext}` — e.g. `tenants/9f3.../avatars/0b1e7d2a-...-.webp`.
56
+ - Use a **server-generated UUIDv4/ULID** as the object id; never the raw filename. This blocks enumeration *and* path traversal (`../../etc/passwd`, leading `/`, `%2e%2e`).
57
+ - Sanitize/extension only: strip everything but a known-good extension derived from the **validated** type, not the client name.
58
+ - Store the real filename, owner, sniffed content-type, size, and `status` in **your DB** — the bucket is a blob store, not a database. The DB row is the source of truth; the object is referenced by key.
59
+ - Same prefix discipline lets one IAM policy / lifecycle rule target `tenants/*/uploads/`.
60
+
61
+ 4. **Large files: multipart/resumable with part retry + an abort-incomplete lifecycle rule.** Parts: **8–16 MiB** each (S3 min 5 MiB except last; max 10,000 parts). Upload parts in parallel (3–4 at a time), retry a failed part by re-PUTting just that part number, then `complete_multipart_upload` with the `{PartNumber, ETag}` list. **Failed/abandoned multipart uploads keep billable orphaned parts forever** — add the lifecycle rule:
62
+ ```json
63
+ { "Rules": [{ "ID": "abort-incomplete-mpu", "Status": "Enabled",
64
+ "Filter": { "Prefix": "tenants/" },
65
+ "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 1 } }] }
66
+ ```
67
+ GCS resumable / Azure uncommitted blocks need the equivalent (resumable sessions expire in 7 days; Azure uncommitted blocks GC after 7 days).
68
+
69
+ 5. **Access control: private buckets by default; serve via time-limited signed URLs.** Block all public access (`PublicAccessBlockConfiguration` all-true on S3; Uniform bucket-level access + no `allUsers` on GCS). Two-bucket split:
70
+ - **`public-assets`** bucket/prefix → genuinely public, immutable, cacheable (logos, released static media) → fronted by CDN, long `Cache-Control: public, max-age=31536000, immutable`.
71
+ - **`private-data`** bucket → user docs, originals → **never public**. Read access = a signed GET URL minted **per request after an authz check**, short expiry (**60–300 s**). For many objects on one page (galleries) use **signed cookies** (CloudFront) / signed-URL prefix to avoid signing each object.
72
+ The authz check (does this user own/may-read this key?) lives in **your** code before signing — the signature only proves the URL wasn't tampered with, not that the requester is entitled. That ownership decision is auth-jwt-session territory; this skill just enforces it at sign time.
73
+
74
+ 6. **Post-upload pipeline: react to the upload event, go `pending → ready`.** Insert the DB row as `status=pending` when you mint the URL. Fire on the storage event (**S3 → EventBridge/SNS/SQS or Lambda; GCS → Pub/Sub; Azure → Event Grid**) — do **not** trust a client "I'm done" callback as the only signal. In the handler: (a) magic-byte validate (step 2), (b) AV scan (e.g. ClamAV / `clamdscan`) for any user-shared file, (c) derive — resize/strip-EXIF for images, transcode for video (`ffmpeg` → HLS/MP4), (d) on success flip `status=ready`, on failure delete object + `status=rejected`. Clients only ever see/serve `ready` objects.
75
+
76
+ 7. **Lifecycle & cost: expire temp uploads, tier cold objects, CDN the hot reads.** Separate a `tmp/` prefix for unconfirmed uploads with a **1-day expiry** lifecycle rule (the orphan from an abandoned form never lingers). Transition originals not read in 30/90 days to Infrequent-Access / Nearline / Cool. Serve hot public reads through a CDN (CloudFront/Cloudflare/Fastly) with an origin-access identity so the bucket stays private to the world but readable by the CDN.
77
+
78
+ 8. **Cleanup orphaned objects on record delete.** Deleting the DB row must enqueue a delete of its object key(s) — including derived renditions/thumbnails. Do it **transactionally-ish**: delete the row, then on commit enqueue an idempotent delete job (retry-safe; a missing key is success). A nightly **reconcile** sweep (list bucket prefix vs DB keys) catches drift in both directions — objects with no row, rows with no object.
79
+
80
+ ## Common Errors
81
+
82
+ - **Proxying upload bytes through the app server.** OOMs on large files, doubles bandwidth. Mint a presigned credential; let the client PUT/POST straight to the bucket.
83
+ - **Trusting `Content-Type`/extension from the client.** Spoofable; enables stored-XSS via a `.jpg` that's really HTML served inline. Verify magic bytes server-side and serve user files with `Content-Disposition: attachment` + `X-Content-Type-Options: nosniff`.
84
+ - **Putting the user's filename in the key.** Invites path traversal and enumeration. Key on a server UUID; keep the display name in the DB.
85
+ - **Public bucket / public ACL "just to make it work."** Leaks every object and lets anyone overwrite. Block public access; use signed URLs. A `?`-less object URL that loads in incognito is a finding.
86
+ - **Presigned URL with hours/days expiry.** A leaked long-lived URL is a permanent backdoor. Cap at minutes; mint per request.
87
+ - **No `content-length-range` in the POST policy.** Client uploads a 5 GB file and you pay for it. Always set a size ceiling in the policy, not just a client-side JS check.
88
+ - **Authz only at URL-mint time, never re-checked.** Object IDs leak in logs/referers; a stale signed URL outlives the user's permission. Keep expiry short and re-authorize on every mint.
89
+ - **Forgetting `AbortIncompleteMultipartUpload`.** Failed multipart uploads accrue invisible, billable parts indefinitely. Add the 1-day abort rule on day one.
90
+ - **Marking `ready` before the scan/transcode finishes.** Serves unscanned malware or a half-written object. Flip to `ready` only from the post-upload handler.
91
+ - **Deleting the DB row but not the object (or vice-versa).** Orphans cost money and leak data; missing objects 404 live records. Enqueue an idempotent object delete on row delete + run a reconcile sweep.
92
+ - **CORS not configured on the bucket.** Browser direct-PUT/POST fails preflight. Set `AllowedMethods` (`PUT POST`), `AllowedOrigins` (your exact origins, not `*`), and expose `ETag` for multipart.
93
+ - **Same bucket for public assets and private originals.** One misconfig exposes everything. Split public-asset and private-data buckets with different policies.
94
+
95
+ ## Verify
96
+
97
+ 1. **Direct-to-bucket works:** client obtains a presigned POST/PUT and uploads with no bytes touching the app server (confirm app logs show only the mint call, not the body).
98
+ 2. **Size ceiling enforced server-side:** upload `max+1` bytes → bucket rejects (`403 EntityTooLarge`/policy violation); the body never reaches you. A client that disables the JS size check still cannot exceed the policy.
99
+ 3. **Type spoof blocked:** upload an HTML/EXE file renamed `.png` with `Content-Type: image/png` → passes the policy but the magic-byte check deletes it and sets `status=rejected`; it is never served `ready`.
100
+ 4. **Key is non-guessable + traversal-proof:** the stored key is a server UUID under `tenants/{id}/...`; a key containing `../`, a leading `/`, or the raw filename is rejected/never produced.
101
+ 5. **Privacy:** the raw object URL (no signature) returns `403` in an incognito session; a freshly signed URL returns `200`; the **same** URL after `Expires` returns `403`.
102
+ 6. **Cross-tenant denied:** user A's signed-URL request for user B's key fails the authz check at mint time (no URL issued), not just at read time.
103
+ 7. **Resumable:** kill the network mid-multipart, resume → only missing parts re-upload, `complete` succeeds, final object bytes match the source checksum (`s3 cp` then `sha256sum`).
104
+ 8. **Orphan hygiene:** an abandoned multipart upload is gone after the abort window; a `tmp/` object expires per its 1-day rule; deleting a record removes its object + renditions (verify via `aws s3 ls`/reconcile sweep shows no drift).
105
+ 9. **Post-upload state machine:** a fresh upload is `pending`, becomes `ready` only after scan+derive complete, and a malicious/corrupt file ends `rejected` with the object deleted.
106
+
107
+ Done = uploads go direct-to-bucket under a ≤5-min single-use credential with a server-enforced size cap and magic-byte type check; objects are private, tenant-scoped, non-guessable, and readable only via short-lived signed URLs after an ownership check; large uploads resume and abandoned parts auto-abort; and every record delete (plus a reconcile sweep) leaves zero orphaned objects.
@@ -0,0 +1,111 @@
1
+ ---
2
+ name: fuzz-dynamic-security-test
3
+ description: Sets up dynamic security testing — coverage-guided fuzzing of parsers and input handlers (libFuzzer/cargo-fuzz/AFL++/go test -fuzz/atheris) and DAST scanning of a running app (OWASP ZAP/nuclei) — wired into CI with seed corpora, crash minimization, baseline suppression, and regression-corpus commits.
4
+ when_to_use: Hardening code that parses untrusted input, or a running web app, with active runtime testing that drives real inputs to provoke crashes/vulns. Distinct from write-tests (functional-correctness tests), security-review (static code audit), remediate-web-vulnerabilities (fixing a known vuln), and load-stress-test (performance under load).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when you want to **actively drive inputs at code or a running app** to provoke crashes/vulns, not reason about them statically:
10
+
11
+ - "Fuzz this parser / deserializer / protocol decoder / image or PDF loader for crashes"
12
+ - "Set up cargo-fuzz / libFuzzer / `go test -fuzz` / atheris / AFL++ with a seed corpus and run it in CI"
13
+ - "An input crashes / hangs / OOMs — minimize it and add a regression test"
14
+ - "Run OWASP ZAP / nuclei against staging, authenticated, and triage the findings"
15
+ - "Wire short fuzz on PR + long nightly fuzz, and DAST on every staging deploy"
16
+
17
+ NOT this skill:
18
+ - Functional correctness / example-based unit tests → write-tests
19
+ - Reading code by eye for injection/authz/secret bugs (no execution) → security-review
20
+ - Fixing a *specific known* SQLi/XSS/SSRF you already found → remediate-web-vulnerabilities
21
+ - Measuring latency/throughput/breaking point under concurrency → load-stress-test
22
+ - Reviewing an authorization *design* rather than testing it at runtime → design-authorization-model
23
+
24
+ If a finding is confirmed, hand the fix to remediate-web-vulnerabilities; this skill *finds and reproduces*, it does not patch app logic.
25
+
26
+ ## Steps
27
+
28
+ 1. **Pick the right tool for the target — do not write a fuzzer by hand.** Coverage-guided engines mutate toward new code paths; random byte-spray finds nothing. Match the language:
29
+
30
+ | Target | Engine | Harness entry | Sanitizers |
31
+ |---|---|---|---|
32
+ | C/C++ | **libFuzzer** (clang `-fsanitize=fuzzer`) | `LLVMFuzzerTestOneInput(const uint8_t*, size_t)` | ASan + UBSan (+ MSan separately) |
33
+ | Rust | **cargo-fuzz** (libFuzzer under the hood) | `fuzz_target!(\|data: &[u8]\| { ... })` | ASan on by default |
34
+ | Go | **native `go test -fuzz`** | `func FuzzX(f *testing.F)` + `f.Fuzz(...)` | race + built-in checks |
35
+ | Python | **atheris** (libFuzzer bindings) | `atheris.Setup` + `TestOneInput(data)` | native-ext ASan optional |
36
+ | JS/TS | **Jazzer.js** | `module.exports.fuzz = (data) => {...}` | n/a (catches throws) |
37
+ | Out-of-process C binary | **AFL++** (`afl-fuzz -i in -o out`) | feed stdin/file | persistent mode + cmplog |
38
+
39
+ Default to the **in-process libFuzzer-family** engine for the language; reach for AFL++ only when you can't instrument the target (closed binary, weird build).
40
+
41
+ 2. **Fuzz the smallest deterministic boundary, structure-aware.** Target one pure `bytes → parsed value` function — the deserializer, the codec/protocol decode, the template/expression parser — not the whole HTTP handler. Make it deterministic (no clock/network/RNG/global state). For structured formats, decode the byte buffer into typed inputs with `arbitrary` (Rust) / `FuzzedDataProvider` (C++/atheris) so mutations stay valid-ish and reach deep logic instead of dying at the length check. Rust example:
42
+
43
+ ```rust
44
+ #![no_main]
45
+ use libfuzzer_sys::fuzz_target;
46
+ use arbitrary::Arbitrary;
47
+
48
+ #[derive(Arbitrary, Debug)]
49
+ struct Input { name: String, depth: u8, body: Vec<u8> }
50
+
51
+ fuzz_target!(|inp: Input| {
52
+ // never unwrap() inside a harness on the parser's own error path —
53
+ // a clean Err is correct, only a panic/abort is a finding.
54
+ let _ = my_parser::parse(&inp.name, inp.depth, &inp.body);
55
+ });
56
+ ```
57
+
58
+ 3. **Seed the corpus and add a dictionary — this multiplies coverage.** Drop real, valid sample files into `corpus/<target>/` (one input per file). Add a `.dict` of format tokens/magic bytes (`"PDF"`, `"\xFF\xD8"`, keywords) and pass `-dict=tokens.dict`. Without seeds the fuzzer wastes hours rediscovering the file header. Keep the corpus in the repo so CI starts warm.
59
+
60
+ 4. **Run, then minimize every crash before committing it.** Run locally first (`cargo fuzz run target -- -max_total_time=300` / `go test -fuzz=FuzzX -fuzztime=5m`). On a crash, **minimize the input** (`cargo fuzz tmin`, libFuzzer `-minimize_crash=1 -runs=100000`, AFL++ `afl-tmin`) so the repro is small and the root cause is obvious. Add `-rss_limit_mb`, `-timeout=`, and `-max_len=` so OOMs and hangs are reported as findings, not killed silently.
61
+
62
+ 5. **Commit each crash as a regression seed — this is the deliverable.** Copy the minimized input to `corpus/<target>/crash-<hash>` (or Go's `testdata/fuzz/FuzzX/`). It now re-runs on every fuzz invocation, so the bug can't silently return. This is what turns a one-off crash into a permanent test. Open a finding (step 7) linking the seed.
63
+
64
+ 6. **For a running app, run DAST against staging — never prod.** Stand up a disposable staging instance with seeded test data, then:
65
+ - **nuclei** for known-CVE/misconfig templates: `nuclei -u https://staging.app -severity medium,high,critical -rl 50`.
66
+ - **ZAP** for app-aware crawling: baseline first, then authenticated active scan. Authenticate (ZAP context + auth script, or pass a session cookie/Bearer) so the scanner reaches logged-in routes — an unauthenticated scan misses ~80% of the surface.
67
+ - **Baseline-suppress accepted findings** instead of muting the whole rule: keep a `zap-baseline.conf` / nuclei exclude list of triaged-and-accepted IDs so the gate only fails on *new* findings. Tune `-rl`/throttle so the active scan doesn't DoS staging.
68
+
69
+ 7. **Triage every finding: reproduce → severity → dedupe → file.** Re-run the exact input/request to confirm it's real (drop scanner false positives — reflected param that's actually encoded, "missing header" on an internal-only route). Rate by realistic impact (RCE/memory-corruption/authn-bypass = Critical; reflected-but-encoded = noise). Dedupe by crash stack / vuln class, not by input bytes — 500 inputs hitting one `parse()` panic are one bug. File with the minimized repro and the committed seed path.
70
+
71
+ 8. **Wire into CI in two tiers + a DAST stage.** Cheap on every PR, deep on a schedule:
72
+
73
+ ```yaml
74
+ # PR: smoke-fuzz only the changed/corpus seeds — must finish in <2 min, gates merge
75
+ pr-fuzz:
76
+ run: cargo fuzz run parser -- -runs=0 corpus/parser # replay corpus, no mutation
77
+ # Nightly: long mutation run, upload new crashes as artifacts, file on failure
78
+ nightly-fuzz:
79
+ run: cargo fuzz run parser -- -max_total_time=3600 -timeout=10 -rss_limit_mb=2048
80
+ # Per staging deploy: DAST gate
81
+ dast:
82
+ run: nuclei -u $STAGING_URL -severity high,critical -ed <accepted.txt> -ni
83
+ ```
84
+
85
+ PR job replays the corpus (deterministic, fast, catches regressions); nightly does the expensive mutation. Never put an unbounded mutation run on the PR critical path.
86
+
87
+ ## Common Errors
88
+
89
+ - **`unwrap()`/`expect()` in the harness on the parser's own error path.** Every malformed input then "crashes" — pure noise. A returned `Err` is correct behavior; only a panic/abort/sanitizer-trip in the *code under test* is a finding.
90
+ - **No seed corpus and no dictionary.** The fuzzer burns the whole budget rediscovering the file magic/header and never reaches real logic. Seed with valid samples; add a token `.dict`.
91
+ - **Non-deterministic harness.** Reading the clock, network, RNG, or mutating global state makes crashes non-reproducible and corrupts coverage feedback. The harness must be a pure function of `data`.
92
+ - **Committing the raw crashing input, not the minimized one.** A 4 MB repro hides the root cause and bloats the corpus. Always `tmin`/`-minimize_crash` first.
93
+ - **Fuzzing the whole HTTP handler instead of the parser.** Network/auth/DB setup dwarfs the parse step, so mutations rarely reach it — throughput collapses to a few execs/sec. Target the pure decode boundary in-process.
94
+ - **No `-rss_limit_mb`/`-timeout`/`-max_len`.** OOMs and infinite loops get OS-killed and look like a hung job instead of a reported memory/hang bug. Set explicit limits.
95
+ - **Sanitizers off (release build).** Use-after-free, OOB read, and integer-UB pass silently without ASan/UBSan — you only catch hard segfaults. Build the fuzz target with sanitizers on.
96
+ - **Running ZAP/nuclei against production.** Active scans send malicious payloads, mutate data, and can take the service down. Always a disposable staging instance with test data.
97
+ - **Unauthenticated DAST scan.** Misses every logged-in route — the high-value surface. Configure auth (context/script/session token) and verify the scanner is actually inside a session.
98
+ - **Muting a whole scanner rule to clear noise.** Hides future real hits of that class. Suppress the specific accepted finding ID in a baseline file so only *new* findings fail the gate.
99
+ - **Unbounded mutation fuzz on the PR job.** Blocks every merge for an hour or times out. PR replays the corpus; the long mutation run goes nightly.
100
+
101
+ ## Verify
102
+
103
+ 1. **Engine actually mutates and gains coverage:** a short run shows rising `cov:`/`ft:` and `exec/s` counters (libFuzzer) — not flat. Flat coverage means the harness rejects inputs at the door (wrong shape, missing `arbitrary`/`FuzzedDataProvider`).
104
+ 2. **Planted-bug catch:** add a deliberate `assert!`/OOB/`panic!` on a specific byte pattern (or use a target with a known-CVE-style flaw), run the fuzzer, and confirm it finds and minimizes the input within minutes. A fuzzer that can't catch a planted bug catches nothing.
105
+ 3. **Every crash yields a committed seed:** for each crash found, the minimized input lives in the corpus/`testdata` and is tracked in git. Re-running the harness over the corpus reproduces the crash deterministically.
106
+ 4. **Regression gate works:** with the seed committed but the bug *un*fixed, the PR corpus-replay job fails; after the fix it passes — proving the seed actually guards the regression.
107
+ 5. **DAST reproduced + authenticated:** at least one scanner finding is independently re-sent (curl/HTTP client) and reproduces; the scan log shows it traversed authenticated routes (logged-in paths visited), not just the login page.
108
+ 6. **Baseline suppression is scoped:** a previously-accepted finding is silenced by its specific ID, while a freshly introduced vuln of a *different* class still fails the gate (suppression didn't blanket the rule).
109
+ 7. **CI tiers honored:** PR fuzz finishes under its time cap (corpus replay only); nightly runs the bounded mutation budget and uploads any new crash as an artifact + files it.
110
+
111
+ Done = the engine provably mutates toward new coverage, a planted bug or known-CVE pattern is caught and minimized, every crash is committed as a regression seed that fails-then-passes across the fix, and DAST runs authenticated against staging with scoped baseline suppression and a two-tier CI wiring.
@@ -0,0 +1,126 @@
1
+ ---
2
+ name: harden-llm-app-reliability
3
+ description: Hardens LLM API calls for production with per-call timeouts and cancellation, exponential-backoff-plus-full-jitter retries on 429/500/529 that honor Retry-After, model fallback, one-round structured-output repair, refusal/stop_reason handling, and a circuit-breaker degraded mode so a flaky provider never breaks the feature.
4
+ when_to_use: Shipping an LLM feature where provider errors, timeouts, rate limits, or refusals must not crash the UX. Distinct from optimize-llm-cost-latency (speed/spend), defend-llm-prompt-injection (security of inputs), and rate-limiting (protecting your own API from callers, not surviving a provider's limits).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the failure mode you fear is **the provider**, not your code or your callers:
10
+
11
+ - "The model call sometimes hangs / times out and the request just spins forever"
12
+ - "We get 429s / 529s / 500s in bursts and the feature errors out"
13
+ - "Wrap the LLM call so a bad response or refusal degrades gracefully instead of throwing"
14
+ - "Add fallback to a cheaper/other model when the primary is down or refuses"
15
+ - "JSON-mode output is occasionally malformed and crashes the parser"
16
+ - "Mid-stream the connection drops and the user sees half an answer"
17
+
18
+ NOT this skill:
19
+ - Making calls *cheaper or faster* (model routing for cost, prompt caching, token trimming) → optimize-llm-cost-latency
20
+ - Defending the prompt against injection / untrusted-content attacks → defend-llm-prompt-injection
21
+ - Limiting how often *your callers* hit *your* API (token bucket, quotas, your own 429s) → rate-limiting
22
+ - Designing the prompt + structured-output schema itself → prompt-engineering
23
+ - Measuring output quality across prompt/model changes → llm-eval-harness
24
+ - Offloading the whole LLM job to a durable background queue with DLQ → message-queue-jobs
25
+
26
+ This skill is the resilience wrapper *around* one logical LLM call. It assumes the prompt is already written.
27
+
28
+ ## Steps
29
+
30
+ 1. **Wrap every call in a timeout + cancellation token. No naked `await`.** A hung socket must die on a deadline you own, not the SDK default (often 600s+). Two clocks: a per-attempt timeout (the request) and a total deadline (all retries combined). Stream long calls so the per-attempt timeout measures *time-to-first-byte*, not total generation.
31
+
32
+ ```ts
33
+ const TOTAL_DEADLINE_MS = 30_000; // whole operation, retries included
34
+ const PER_ATTEMPT_MS = 12_000; // one HTTP attempt (TTFB for streams)
35
+
36
+ // remainingMs = total deadline left for this attempt (computed by the caller, step 2)
37
+ async function callWithDeadline(fn, remainingMs) {
38
+ const ctrl = new AbortController();
39
+ const budget = Math.max(0, Math.min(PER_ATTEMPT_MS, remainingMs));
40
+ const t = setTimeout(() => ctrl.abort(), budget);
41
+ try { return await fn(ctrl.signal); }
42
+ finally { clearTimeout(t); }
43
+ }
44
+ ```
45
+ Pass `signal` into the SDK (`client.messages.create({...}, { signal })`). On the user side wire the inbound request's abort signal through so a closed browser tab cancels the upstream call instead of burning tokens.
46
+
47
+ 2. **Retry only what's retryable, with exponential backoff + full jitter, and honor `Retry-After`.** Classify the error before you retry — retrying a 400 is just slower failure.
48
+
49
+ | Status / condition | Retry? | Wait |
50
+ |---|---|---|
51
+ | 429 rate-limited | Yes | `Retry-After` header if present, else backoff |
52
+ | 529 overloaded (Anthropic) / 503 | Yes | backoff + jitter |
53
+ | 500 / 502 / 504 / gateway | Yes | backoff + jitter |
54
+ | Network reset / timeout / ECONNRESET | Yes | backoff + jitter |
55
+ | 408 request timeout | Yes | backoff |
56
+ | 400 / 422 bad request | **No** | fix the request, not the retry |
57
+ | 401 / 403 auth | **No** | rotate key / fix scope |
58
+ | 413 too large | **No** | trim input |
59
+ | Refusal / `stop_reason` | **No retry — fall back** (step 4) | — |
60
+
61
+ Defaults: **max 4 attempts**, base 500ms, cap 8s, **full jitter** (`sleep = random(0, min(cap, base * 2**attempt))`). Full jitter beats fixed/equal backoff because synchronized clients (a 429 storm) otherwise retry in lockstep and re-stampede. Always clamp the wait to the remaining total deadline — never sleep past it.
62
+
63
+ ```ts
64
+ const start = Date.now();
65
+ const elapsed = () => Date.now() - start;
66
+ for (let attempt = 0; attempt < 4; attempt++) {
67
+ try { return await callWithDeadline(fn, TOTAL_DEADLINE_MS - elapsed()); }
68
+ catch (e) {
69
+ if (!isRetryable(e) || attempt === 3 || elapsed() > TOTAL_DEADLINE_MS) throw e;
70
+ const ra = retryAfterMs(e); // parse header, seconds or http-date
71
+ const backoff = Math.random() * Math.min(8000, 500 * 2 ** attempt);
72
+ await sleep(Math.min(ra ?? backoff, TOTAL_DEADLINE_MS - elapsed()));
73
+ }
74
+ }
75
+ ```
76
+ LLM calls are **non-idempotent and billed** — a retry after a *partial* success double-charges. Only retry attempts that demonstrably failed before producing a usable response (connection error, non-2xx, timeout-before-first-byte). Never retry a call that already streamed a full body.
77
+
78
+ 3. **Validate structured output; repair once; then fail safe — never crash on malformed JSON.** When you asked for JSON, do not feed the raw model string straight into `JSON.parse` + a schema and let it throw to the user.
79
+ - Parse → validate against the schema (Zod / Pydantic / JSON Schema).
80
+ - On failure, **one** repair round: send the model the broken output + the validator error, ask for corrected JSON only. Strip code fences and prose first.
81
+ - Still invalid → return a typed safe default (e.g. `{ status: "unavailable" }`) or route to degraded mode. Log the raw output. Do **not** loop repairs (cost + latency blowup).
82
+
83
+ Prefer the SDK's native enforcement (tool/`tool_choice` forcing, strict JSON mode) over free-text + regex — it eliminates most repairs. Repair is the safety net, not the plan.
84
+
85
+ 4. **Fall back to another model on persistent failure or refusal.** When the primary is exhausted (retries spent, circuit open) or returns a refusal, try a fallback before giving up. Order by capability-then-availability, e.g. primary Sonnet → fallback Haiku, or cross-provider if you run multi-vendor.
86
+ - A **refusal** (`stop_reason: "refusal"`, or the model declining) is not a transport error — do not retry the same model; either fall back or return the refusal as a first-class result.
87
+ - Treat `stop_reason: "max_tokens"` as a *truncated* (not failed) result: the JSON is incomplete — repair or raise `max_tokens` and retry once, don't ship the cutoff.
88
+ - Cap fallback depth at 1–2 models. Record which model actually served the response.
89
+
90
+ 5. **Stream with a heartbeat; discard partials on mid-stream error.** Long generations should stream so the user sees progress and you detect stalls. Set an **inter-chunk idle timeout** (e.g. 20s with no new token → abort) — a stream can hang open without erroring. If the stream errors or aborts mid-way, **discard the accumulated partial** and either retry from scratch (step 2 rules) or degrade; never persist or render a half-message as if complete. Buffer to a scratch variable and only commit on the terminal `message_stop`.
91
+
92
+ 6. **Circuit-breaker around the provider → degraded mode.** Per-provider breaker: after N consecutive failures (e.g. 5) or a failure rate over a window, **open** the circuit and stop calling for a cooldown (e.g. 30s), then **half-open** one probe. While open, skip the doomed call and serve degraded mode immediately: a cached previous answer, a canned/templated response, or a clear "this feature is temporarily unavailable" — chosen per feature, decided *before* the incident. This stops a provider outage from turning into 30s timeouts on every request and exhausting your own connection pool.
93
+
94
+ 7. **Never lose user input on failure.** Before the call, persist the user's prompt/turn so any failure path (timeout, all-retries-exhausted, circuit open) returns a retryable state, not a black hole. The user should be able to resend with one tap, or the system auto-resumes — input is never silently dropped. For expensive multi-step agent runs, checkpoint so you resume from the failed step, not step 1.
95
+
96
+ ## Common Errors
97
+
98
+ - **Relying on the SDK default timeout.** It's often minutes. A spike of hung sockets exhausts your connection pool and takes the whole service down. Set an explicit per-attempt timeout you own.
99
+ - **Retrying non-retryable errors.** Looping on a 400/401/413 wastes the deadline and (for auth) can lock the key. Classify first; only retry 408/429/5xx/network.
100
+ - **Fixed or equal-jitter backoff.** All clients that got 429'd retry at the same instant and re-stampede the provider. Use full jitter: `random(0, min(cap, base·2^n))`.
101
+ - **Ignoring `Retry-After`.** The provider told you exactly when to come back; backoff math that retries sooner just earns another 429. Parse the header (seconds *or* HTTP-date) and prefer it.
102
+ - **Retrying a partially-streamed call.** It already cost tokens and may have half-applied a side effect; the retry double-charges and can double-act. Only retry failures that occurred before a usable response.
103
+ - **`JSON.parse` straight onto the response.** One malformed token throws an unhandled exception to the user. Always validate, repair once, then fail to a typed default.
104
+ - **Infinite repair loop.** Re-asking the model until JSON is valid can run forever and 10x the bill. Exactly one repair round, then degrade.
105
+ - **Treating a refusal as a 5xx.** Retrying the identical prompt on the same model just refuses again. Fall back or surface it; don't burn retries.
106
+ - **Shipping a `max_tokens` cutoff as complete.** Truncated JSON silently corrupts downstream. Check `stop_reason`; repair or re-call with higher limit.
107
+ - **Rendering the mid-stream partial.** A dropped stream leaves a half-answer the user reads as final. Buffer and only commit on `message_stop`; discard on error.
108
+ - **No circuit breaker.** During a provider outage every request pays the full timeout × retries before failing — your latency and pool collapse. Trip the breaker and serve degraded mode fast.
109
+ - **Dropping user input on the failure path.** The user retypes everything. Persist the turn before the call; make every failure resumable.
110
+ - **Sharing one breaker/timeout budget across unrelated features.** A flaky batch job opens the circuit for your latency-critical chat path. Scope breakers per provider+route.
111
+
112
+ ## Verify
113
+
114
+ Prove resilience with **fault injection**, not hope. Force each failure and assert the wrapper holds — don't wait for prod to hit them.
115
+
116
+ 1. **Forced 429 storm:** Stub the client to return `429` with `Retry-After: 2` for the first 3 calls, then `200`. Assert: exactly 4 attempts, waits honor `Retry-After` (≈2s, not the backoff curve), final result returned, total stays under the deadline.
117
+ 2. **Forced timeout:** Stub a response slower than `PER_ATTEMPT_MS`. Assert: the attempt aborts at the deadline (not the SDK default), the `AbortController` fired, and either a retry or a clean degraded response — never a hang.
118
+ 3. **Non-retryable:** Stub a `400`. Assert: **zero** retries, immediate failure, deadline barely consumed.
119
+ 4. **Malformed JSON:** Stub output that fails the schema, then valid on the repair call. Assert: exactly one repair round, valid object returned. Then stub it invalid twice → assert the typed safe default, no thrown exception.
120
+ 5. **Refusal / cutoff:** Stub `stop_reason: "refusal"` → assert fallback model is tried (no same-model retry). Stub `stop_reason: "max_tokens"` → assert truncation is detected, not shipped as complete.
121
+ 6. **Mid-stream drop:** Start a stream, kill the connection after 2 chunks. Assert: the partial is discarded (not rendered/persisted), and retry-or-degrade fires.
122
+ 7. **Circuit breaker:** Force N consecutive failures → assert the circuit opens, subsequent calls return degraded mode **immediately** (no timeout wait), then half-open probes and closes on recovery.
123
+ 8. **Input preservation:** Trigger total failure → assert the user's input is still retrievable/resumable, returned as retryable state, never silently lost.
124
+ 9. **Idempotency/billing:** Assert a fully-streamed-then-errored response is **not** retried (no double charge).
125
+
126
+ Done = fault-injection tests 1–9 pass, every LLM call has an explicit per-attempt timeout + total deadline, retries use full-jitter backoff that honors `Retry-After` and never fires on non-retryable or already-served calls, malformed/refused/truncated output degrades to a typed safe path instead of throwing, the circuit breaker serves degraded mode under a forced outage without paying timeouts, and no failure path loses user input.
@@ -0,0 +1,113 @@
1
+ ---
2
+ name: i18n-localization-setup
3
+ description: Externalizes user-facing text into message catalogs keyed by stable IDs and wires locale-correct rendering — ICU MessageFormat plurals/gender/select, named-placeholder interpolation, Intl/CLDR number/date/list/relative-time formatting, RTL/bidi via logical CSS, and an extract→translate→compile pipeline with pseudo-localization.
4
+ when_to_use: Making a product support multiple languages/locales, or auditing existing i18n — hardcoded UI strings, sentence concatenation, English-only `if(n===1)` plurals, missing RTL, locale-blind number/date formatting, or wiring i18next/react-intl/gettext/Rails i18n/Fluent. Distinct from style-responsive-tailwind (visual layout) and audit-accessibility-wcag (a11y conformance — i18n only owns translatable a11y *attribute text*).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this when text must render correctly in **more than one language/locale**, not just look right:
10
+
11
+ - "Add Spanish/Arabic/Japanese — what's the right way to externalize strings?"
12
+ - "Our plurals break in Polish/Russian" or "we do `count === 1 ? 'item' : 'items'` everywhere"
13
+ - "Dates show as `6/15/2026` for everyone" / numbers use `.` for thousands in `de-DE`
14
+ - "Arabic/Hebrew layout is broken — everything's still left-to-right"
15
+ - "Translators can't reorder words — we concatenate `'Deleted ' + n + ' files'`"
16
+ - "Set up the extraction pipeline: extract → PO/XLIFF/JSON → translate → compile" + catch missing keys before ship
17
+ - Auditing an app that's "already i18n'd" for the traps below
18
+
19
+ NOT this skill:
20
+ - Visual responsive layout, breakpoints, container sizing → **style-responsive-tailwind** (i18n owns *logical* CSS props + `dir`, not the design system)
21
+ - WCAG conformance, screen-reader semantics, contrast → **audit-accessibility-wcag** (i18n only owns making `aria-label`/`alt`/`title` *translatable*)
22
+ - `hreflang`, localized URLs, sitemap per-locale, canonical → **audit-technical-seo**
23
+ - Validating/parsing user-entered locale data (phone, postal) → **build-form-validation**
24
+ - Wrapping a single component's copy as you build it → **build-react-component** (use this skill when standing up the *catalog system*)
25
+ - UTC storage, DST, IANA conversion math behind a displayed timestamp → **datetime-timezone-correctness** (i18n only *formats* the instant per locale; it doesn't compute it)
26
+
27
+ ## Steps
28
+
29
+ 1. **Externalize every user-facing string into a catalog keyed by a stable ID — kill concatenation.** A string is translatable if a human ever reads it: labels, buttons, errors, emails, `alt`/`aria-label`/`title`/`placeholder`, `<title>`, push/toast text. Key by **semantic ID**, never by English source (English changes → key shouldn't). Co-locate by feature: `checkout.cart.empty`, not `string_447`.
30
+
31
+ ```jsonc
32
+ // en.json — one message = one full sentence with named placeholders
33
+ { "checkout.items": "{count, plural, one {# item} other {# items}}",
34
+ "profile.greeting": "Welcome back, {name}!" }
35
+ ```
36
+ Never build sentences from fragments. `t('deleted') + ' ' + n + ' ' + t('files')` is **untranslatable** — word order, plural agreement, and gender all vary by language. One key = one whole sentence.
37
+
38
+ 2. **Pluralize with ICU MessageFormat / CLDR categories — never `if (n === 1)`.** English has 2 forms; Arabic has **6** (zero/one/two/few/many/other), Polish/Russian have 4. Provide every category the *target* locale's CLDR rules require; `other` is the mandatory fallback. Same mechanism for gender/choice via `select`. Use `#` for the count (auto-formatted per locale), not `{count}` re-interpolated.
39
+
40
+ | Need | ICU construct | Anti-pattern it replaces |
41
+ |---|---|---|
42
+ | Count agreement | `{n, plural, one {…} few {…} many {…} other {…}}` | `n === 1 ? 'x' : 'xs'` |
43
+ | Ordinals (1st/2nd) | `{n, selectordinal, one {#st} two {#nd} few {#rd} other {#th}}` | string-suffix hacks |
44
+ | Gender / enum | `{g, select, female {…} male {…} other {…}}` | branching in code, concatenating |
45
+ | Money/percent inside text | `{amt, number, ::currency/EUR}` | manual `$` + `toFixed(2)` |
46
+
47
+ Translators supply the categories *their* language needs — don't hardcode the English set into the message shape.
48
+
49
+ 3. **Interpolate with named placeholders so translators can reorder.** `"{count} {unit} remaining"` lets a translator emit `"quedan {count} {unit}"`. Positional `{0}`/`%s`/`printf` ordering is fixed and breaks under reordering — use named only. Pass an explicit values object: `t('checkout.items', { count })`. Escape literal braces per ICU (`'{'`). Auto-escape interpolated values for the sink (HTML) to avoid injection.
50
+
51
+ 4. **Format numbers/dates/lists/units via `Intl` (CLDR) in the user's locale — never roll your own.** Locale decides separators (`1,234.5` vs `1.234,5`), date order, AM/PM vs 24h, RTL digit shaping, list conjunctions. Always pass the resolved locale explicitly; relying on the host default is non-deterministic.
52
+
53
+ ```js
54
+ new Intl.NumberFormat(locale, { style: 'currency', currency: 'JPY' }).format(1234) // ¥1,234
55
+ new Intl.DateTimeFormat(locale, { dateStyle: 'long' }).format(d) // 15 de junio de 2026
56
+ new Intl.RelativeTimeFormat(locale, { numeric: 'auto' }).format(-1, 'day') // "yesterday"
57
+ new Intl.ListFormat(locale, { type: 'conjunction' }).format(['a','b','c']) // "a, b, and c"
58
+ ```
59
+ `currency` is data, not locale — `de` user paying USD shows `1.234,00 $`. Never store formatted strings; format at render time from raw numbers + ISO/epoch timestamps. Default time storage to **UTC**, convert to the user's IANA timezone for display (the conversion/DST math itself lives in **datetime-timezone-correctness**).
60
+
61
+ 5. **Make layout direction-agnostic: logical CSS + `dir` + bidi isolation.** Set `<html dir="rtl" lang="ar">` from the locale (RTL set: ar, he, fa, ur). Replace physical properties with logical ones so one stylesheet serves both directions:
62
+
63
+ | Physical (breaks RTL) | Logical (correct both) |
64
+ |---|---|
65
+ | `margin-left` / `padding-right` | `margin-inline-start` / `padding-inline-end` |
66
+ | `left` / `right` | `inset-inline-start` / `inset-inline-end` |
67
+ | `text-align: left` | `text-align: start` |
68
+ | `border-left` | `border-inline-start` |
69
+
70
+ Wrap user/dynamic content of unknown direction in `<bdi>` or `unicode-bidi: isolate` so an Arabic username doesn't scramble surrounding LTR punctuation. Mirror directional icons (back/forward arrows) via `[dir=rtl]` or transform; don't mirror logos.
71
+
72
+ 6. **Stand up the pipeline: extract → catalog → translate → compile, gated by pseudo-loc + missing-key detection.** Source code is the single source of truth for *keys*; translators own values. Pick the format by toolchain:
73
+
74
+ | Format | Use with | Plurals |
75
+ |---|---|---|
76
+ | **JSON / ICU** | i18next, react-intl/FormatJS | native ICU |
77
+ | **PO/POT** (gettext) | Rails (`gettext`), Python, PHP, C | `nplurals` header |
78
+ | **XLIFF** | Angular, enterprise TMS handoff | ICU or `<plural>` |
79
+ | **FTL** (Fluent) | Mozilla stack, attribute-rich UI | built-in selectors |
80
+
81
+ Pipeline: (1) `extract` keys from source (`i18next-parser`, `formatjs extract`, `xgettext`) → POT/template; (2) merge into per-locale catalogs without dropping existing translations; (3) translate / push to TMS; (4) `compile` to runtime bundles (`formatjs compile`, `msgfmt`). Generate a **pseudo-locale** (`en-XA`: `[!!! Ŝéàŕçĥ ~~~]`) — accent + ~40% length padding + bracket markers — to surface hardcoded strings, truncation, and concatenation in CI before any human translates. Fail the build on missing keys / unknown ICU vars.
82
+
83
+ 7. **Negotiate locale, fall back, and support runtime switching.** Resolve in priority order: explicit user setting → URL/cookie → `Accept-Language` → app default. Match with BCP-47 lookup (`fr-CA` → `fr` → default); never 404 on an unsupported region — fall back to base language, then to source locale. Lazy-load the active locale's bundle (don't ship all 30); switching locale re-renders messages **and** updates `lang`/`dir` on `<html>`. Use a real BCP-47 matcher — `@formatjs/intl-localematcher` (`match()`) or `accept-language-parser` for the header, canonicalized via `Intl.getCanonicalLocales` — never naive string equality (there is no `Intl.LocaleMatcher` global; locale matching is the `localeMatcher` *option* on `Intl` constructors or a library).
84
+
85
+ ## Common Errors
86
+
87
+ - **`count === 1 ? x : xs`.** Breaks every language with ≠2 plural forms (Arabic, Polish, Russian, Welsh). Use ICU `plural` with CLDR categories.
88
+ - **Sentence concatenation** (`t('sent') + name + t('a_msg')`). Word order/agreement/gender vary; translators can't fix it. One key = one full sentence with named placeholders.
89
+ - **Keying by English source text.** Editing the copy silently orphans the translation. Key by stable semantic ID.
90
+ - **Hand-formatted numbers/dates** (`'$' + n.toFixed(2)`, `MM/DD/YYYY`). Wrong separators/order/currency per locale. Use `Intl.NumberFormat`/`DateTimeFormat` with an explicit locale.
91
+ - **Conflating locale with currency/timezone.** A `de` user can pay in USD in `America/New_York`. Format with the user's *locale* but the transaction's *currency* and the event's *timezone*; store UTC + ISO currency code.
92
+ - **Physical CSS** (`margin-left`, `float: right`). Layout breaks in RTL. Use logical properties + `dir`.
93
+ - **No bidi isolation.** An RTL name/number injected into LTR text reorders adjacent punctuation/brackets. Wrap unknown-direction content in `<bdi>`/`unicode-bidi: isolate`.
94
+ - **Forgetting non-`textContent` text.** `alt`, `aria-label`, `title`, `placeholder`, `<title>`, email subjects, validation messages are all translatable — and untranslated `aria-label` regresses a11y.
95
+ - **No length budget.** German/Finnish run ~35% longer than English; pseudo-loc padding exposes truncation/overflow before translators do.
96
+ - **Locale-blind sort/case.** JS `.sort()` is code-point order (`Ä` after `Z`); Turkish `i`↔`İ`/`ı` breaks `toUpperCase()`. Use `Intl.Collator(locale)` for sorting and `toLocaleUpperCase(locale)` for case.
97
+ - **Inventing `Intl.LocaleMatcher`.** No such global exists — locale matching is the `localeMatcher` option on `Intl` constructors or a library (`@formatjs/intl-localematcher`). Don't string-compare BCP-47 tags.
98
+ - **Shipping all locales eagerly / hard 404 on unknown region.** Lazy-load active locale; fall back `region → language → source`, never error.
99
+ - **Pluralizing the count with `#` but re-interpolating `{count}` raw.** `#` is locale-formatted (`1,000`); a separate `{count}` isn't. Use `#` inside `plural`.
100
+
101
+ ## Verify
102
+
103
+ 1. **No hardcoded strings:** lint (`eslint-plugin-formatjs`, `i18next` no-literal rule) reports zero user-facing literals outside the catalog.
104
+ 2. **Pseudo-loc pass:** run UI in `en-XA` — every visible string is accented+bracketed (no bare English = no missed key), nothing truncates or overflows, no concatenated fragments appear.
105
+ 3. **Plural matrix:** render the count message at `n = 0,1,2,5,11,100` in `en`, `pl` (4 forms), and `ar` (6 forms); each picks the CLDR-correct category. `if(n===1)` cannot pass this.
106
+ 4. **Reordering:** a target locale that reverses placeholder order renders correctly (proves named, not positional, interpolation).
107
+ 5. **Formatting:** the same number/date/currency/list renders per-locale separators/order (`1,234.5`↔`1.234,5`, `06/15`↔`15/06`, currency symbol placement) — assert against `Intl` golden strings.
108
+ 6. **RTL:** load `ar`/`he` → `<html dir="rtl">`, layout mirrors via logical props, directional icons flip, bidi-isolated names don't scramble punctuation.
109
+ 7. **Missing-key gate:** delete a key from a non-source catalog → CI fails (or falls back to source) — it must never render a raw key like `checkout.items` to a user.
110
+ 8. **Negotiation:** `Accept-Language: fr-CA` with only `fr` available resolves to `fr` (not default/404) via a real matcher; switching locale at runtime updates messages **and** `lang`/`dir`.
111
+ 9. **Sort/case:** a localized list sorts via `Intl.Collator(locale)` (e.g. Swedish `å/ä/ö` last); Turkish case round-trips with `toLocaleUpperCase('tr')`.
112
+
113
+ Done = zero hardcoded user-facing strings, pseudo-loc clean, the plural matrix passes for a 4-form and a 6-form locale, RTL renders with logical CSS + bidi isolation, all formatting goes through `Intl` with explicit locale, locale negotiation uses a real BCP-47 matcher, and CI fails on any missing key or unknown ICU variable.
@@ -0,0 +1,107 @@
1
+ ---
2
+ name: idempotency-keys
3
+ description: Makes operations safe to repeat so retries and at-least-once delivery don't double-charge or double-create — idempotency by design first (PUT/upsert, conditional writes with version/ETag/If-Match, natural deterministic keys, set-don't-increment) and by key second (client Idempotency-Key header, a dedup table keyed unique on the key that stores request fingerprint + status + response and replays the SAME response, 409 in-progress lock for concurrent duplicates, 422 on key-reuse-with-different-body), plus consumer-side dedup (processed-event-id store / dedup window), the outbox pattern for atomic write+publish, and DB mechanics (ON CONFLICT, SELECT FOR UPDATE / advisory locks). Effectively-once via dedup, because exactly-once delivery is a myth.
4
+ when_to_use: An operation can run more than once and must not have double effects — a POST that creates/charges behind a client/proxy/SDK retry, an at-least-once queue or webhook consumer that may redeliver, a job that may run twice, or you're adding an Idempotency-Key header or a dedup table. Distinct from resilience-timeouts-retries (decides WHEN/how to retry; this skill makes the target safe to retry into) and deliver-webhooks (the sender side — at-least-once delivery + signed retries; this skill is what makes the receiver safe under that redelivery).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the same operation may execute more than once and a second execution must NOT produce a second effect:
10
+
11
+ - "A POST timed out, the client retried, and we charged/created twice"
12
+ - "Our SDK/proxy/load balancer retries — make the create idempotent"
13
+ - "Add an Idempotency-Key header so replays return the original response"
14
+ - "The queue/webhook is at-least-once; the consumer ran the same event twice"
15
+ - "Make this job safe to run twice" / "dedup redelivered events"
16
+ - "Atomically write a row AND publish an event without dual-write loss" (outbox)
17
+
18
+ NOT this skill:
19
+ - Deciding the retry policy itself — backoff, jitter, retry budget, circuit breaker, which errors are retryable → resilience-timeouts-retries (it generates the duplicate calls; this skill absorbs them safely)
20
+ - The webhook *sender*: at-least-once dispatch, signing, retry schedule, DLQ for failed deliveries → deliver-webhooks
21
+ - The webhook *receiver's* signature/replay-window verification (HMAC over raw body, timestamp window) → ingest-webhook-secure (this skill is the dedup-on-event-id half it hands off to)
22
+ - Building the queue/worker, DLQ, poison-message handling → message-queue-jobs (this skill specifies the *idempotent consumer* it needs)
23
+ - Idempotent PSP charges + subscription/proration/ledger reconciliation → payments-billing-integration (it owns billing state and calls this skill's key pattern for money-mutating calls)
24
+ - The rounding/allocation math of the amounts → money-decimal-arithmetic
25
+
26
+ ## Steps
27
+
28
+ 1. **Make it idempotent BY DESIGN before reaching for a key — that's cheaper and self-healing.** A surprising amount of "double effect" disappears if the operation is naturally repeatable:
29
+
30
+ | Technique | How | Why it's idempotent |
31
+ |---|---|---|
32
+ | **PUT / upsert** to a client-chosen id | `PUT /orders/{client_uuid}` → `INSERT ... ON CONFLICT (id) DO NOTHING/UPDATE` | second call hits the same row, no new row |
33
+ | **Conditional write** (optimistic concurrency) | `If-Match: <etag>` / `WHERE version = N` → bump version | stale retry's precondition fails → no double-apply |
34
+ | **Natural / deterministic key** | derive id from stable inputs (`hash(order_id+sku)`, not `uuid()`) | same inputs → same id → conflict, not insert |
35
+ | **Set, don't increment** | `balance = 100` not `balance += 10`; `status = 'paid'` | reapplying the same set is a no-op |
36
+ | **DELETE / "ensure absent"** | delete-by-id, "cancel if active" | already-gone is success, not error |
37
+
38
+ Increments, "append a row", and server-generated ids on POST are the *non*-idempotent shapes that force you to step 2.
39
+
40
+ 2. **For non-idempotent POSTs, use a client-supplied Idempotency-Key.** The client (not the server, not per-retry) generates ONE key for a logical operation and sends it on the original request AND every retry — header `Idempotency-Key: <opaque-uuid>`. The key must be **stable across retries and unique per operation**: generate it once before the first send, store it with the in-flight request, reuse it on retry. This is the Stripe model and the reference semantics to copy.
41
+
42
+ 3. **Persist the key with a dedup table — fingerprint, status, and the stored response.** One row per key:
43
+
44
+ ```sql
45
+ CREATE TABLE idempotency_keys (
46
+ id_key text NOT NULL,
47
+ scope text NOT NULL, -- e.g. (user_id || ':' || endpoint)
48
+ request_hash text NOT NULL, -- SHA-256 of canonical request body+route
49
+ status text NOT NULL, -- 'in_progress' | 'completed'
50
+ response_code int,
51
+ response_body jsonb,
52
+ created_at timestamptz NOT NULL DEFAULT now(),
53
+ expires_at timestamptz NOT NULL, -- TTL: now() + 24h..7d
54
+ PRIMARY KEY (scope, id_key) -- UNIQUE on (scope, key)
55
+ );
56
+ ```
57
+ Scope the key per `(user/tenant, endpoint)` so one client's key can't collide with or replay another's. Retention 24h–7d (Stripe = 24h); a TTL/cron purges expired rows so storage is bounded.
58
+
59
+ 4. **Server flow — claim, execute, store, replay — atomically.** On each request:
60
+ 1. Compute `request_hash` from the canonical (sorted/normalized) body + method + route.
61
+ 2. **Atomically claim** the key: `INSERT (scope, id_key, request_hash, status='in_progress') ON CONFLICT (scope, id_key) DO NOTHING`. The insert *is* the lock.
62
+ 3. **If the insert won** (0 conflicts): run the real operation, then `UPDATE ... SET status='completed', response_code, response_body` and return the response.
63
+ 4. **If it conflicted**, read the existing row:
64
+ - `status='completed'` **and** `request_hash` matches → return the **stored** `response_code`/`response_body` verbatim (the replay path — same result, no re-execution).
65
+ - `status='in_progress'` → a concurrent duplicate is still running → return **409 Conflict** (or `425`-style "in progress"); the client should retry-after, not re-execute.
66
+ - `request_hash` **differs** (same key, different body) → return **422 Unprocessable Entity** — the key was reused for a different operation; never run it.
67
+
68
+ 5. **Hold a lock for the in-flight window so concurrent duplicates don't both execute.** The `INSERT ... ON CONFLICT DO NOTHING` claim handles most of it, but if you read-then-write, take a row lock: `SELECT ... FOR UPDATE` on the key row, or a Postgres advisory lock `pg_advisory_xact_lock(hashtext(scope||id_key))` around the whole claim+execute. Without this, two parallel retries can both see "no row" and both run. Wrap claim + business write + result-store in **one transaction** (or make the business write itself idempotent via step 1) so a crash between execute and store doesn't lose the recorded response.
69
+
70
+ 6. **Consumer-side dedup for at-least-once queues and webhooks — "exactly-once delivery" is a myth; you get effectively-once.** Brokers (SQS, Kafka, RabbitMQ) and webhook senders redeliver on ack timeout, so the *consumer* must be idempotent. Two patterns:
71
+ - **Processed-event-id store:** a `processed_events(event_id PRIMARY KEY, processed_at)` table. Before handling, `INSERT ... ON CONFLICT DO NOTHING`; if 0 rows inserted, it's a duplicate → ack and skip. Dedup on the **provider's stable event id** (not your own per-receipt uuid). Pairs with ingest-webhook-secure / message-queue-jobs.
72
+ - **Dedup window:** a bounded TTL set (Redis `SET key NX EX <window>`) when full history is too large — only safe if redelivery is bounded within the window.
73
+
74
+ Best of all: make the *handler* naturally idempotent (step 1: upsert by event-derived key, set-don't-increment) so even a missed dedup is harmless.
75
+
76
+ 7. **Atomic write + publish → outbox pattern (no dual-write).** Writing the DB row and publishing the event as two separate calls can crash between them (row saved, event lost — or vice versa). Instead, in **one transaction** write the business row AND an `outbox` row; a separate relay polls/CDC-tails the outbox and publishes (at-least-once → consumers dedup per step 6). The transaction guarantees the event is recorded iff the state changed.
77
+
78
+ 8. **DB mechanics cheat-sheet.**
79
+ - Postgres/SQLite: `INSERT ... ON CONFLICT (cols) DO NOTHING` (claim/dedup) or `DO UPDATE SET ...` (upsert). MySQL: `INSERT ... ON DUPLICATE KEY UPDATE`. The **UNIQUE constraint/index is what makes it safe** — `ON CONFLICT` without a matching unique index silently doesn't dedup.
80
+ - Serialize the in-flight window with `SELECT ... FOR UPDATE` (row) or `pg_advisory_xact_lock` (cross-row/logical) — released at transaction end.
81
+ - Check the *result* of the upsert (rows affected / `RETURNING xmax = 0`) to know whether you inserted or hit an existing row.
82
+
83
+ ## Common Errors
84
+
85
+ - **Generating the key per-retry (`uuid()` / `now()` inside the retry loop).** Every attempt gets a fresh key → zero dedup → still double-charges. Fix: generate ONCE before the first send; reuse the identical key on every retry.
86
+ - **No request-fingerprint check.** Same key replayed with a *different* body silently returns the old response (or runs the new op). Fix: store `request_hash`; on mismatch return 422, never execute.
87
+ - **Racing duplicates with no lock.** Two parallel retries both `SELECT` (no row), both execute, both insert. Fix: atomic `INSERT ... ON CONFLICT DO NOTHING` as the claim, or `FOR UPDATE` / advisory lock around read-modify-write.
88
+ - **`ON CONFLICT` / upsert without a UNIQUE index on the key.** No conflict ever fires → no dedup, duplicate rows. Fix: enforce a unique constraint on `(scope, id_key)` (or the natural key).
89
+ - **Unbounded key storage.** The dedup table grows forever. Fix: `expires_at` + a purge job; pick 24h–7d retention.
90
+ - **Treating a non-idempotent op as idempotent.** Retrying `balance += 10` or "append row" doubles the effect even *with* a key if you don't replay the stored response. Fix: replay the stored response on hit; or redesign to set-don't-increment (step 1).
91
+ - **Recording the result in a separate step from the business write.** Crash in between → next retry re-executes a completed op. Fix: same transaction, or idempotent business write so re-execution is a no-op.
92
+ - **Believing the broker gives exactly-once.** "Exactly-once delivery" doesn't exist over a network; redelivery happens. Fix: idempotent consumer + processed-event-id dedup = effectively-once.
93
+ - **Dual-write (DB then publish, two calls).** A crash loses one side. Fix: outbox in the same transaction + a relay.
94
+ - **Acking before the work is durable.** Ack-then-process loses the message on a crash. Fix: process (idempotently) and commit, *then* ack.
95
+
96
+ ## Verify
97
+
98
+ 1. **Duplicate POST is a no-op:** send the same request with the same `Idempotency-Key` twice → exactly one effect (one charge/row) and the second response is byte-identical to the first.
99
+ 2. **Concurrent duplicates:** fire N parallel requests with the same key → exactly one executes; the rest get the stored response or `409 in-progress`, never a second effect. (This is the race test — run it against the real shared store.)
100
+ 3. **Key reuse, different body:** same key + changed payload → `422`, and no operation runs.
101
+ 4. **Per-retry-key bug guard:** confirm the client generates the key once and reuses it (grep the retry path for `uuid()`/`now()` *inside* the loop).
102
+ 5. **Consumer redelivery:** deliver the same event id to the queue/webhook consumer twice → handled once (processed-events insert conflicts on the second); effect is identical to single delivery.
103
+ 6. **By-design ops:** issue the same `PUT`/upsert / conditional write twice → one row, version advances once; a stale `If-Match` retry is rejected, not double-applied.
104
+ 7. **Outbox atomicity:** kill the process between the business write and publish → on restart the relay still publishes (event recorded iff state changed); no orphan event, no lost event.
105
+ 8. **Retention bounded:** expired keys are purged; an old key past TTL behaves as a fresh request (documented), and the table doesn't grow without bound.
106
+
107
+ Done = duplicate and concurrent requests produce exactly one effect with an identical replayed response, same-key/different-body returns 422, the in-flight window is locked, consumers dedup at-least-once delivery on stable event ids, write+publish is atomic via the outbox, and key storage is TTL-bounded — all proven by the parallel/redelivery tests in checks 1–7.