sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (235) hide show
  1. package/.env.example +19 -0
  2. package/CHANGELOG.md +144 -0
  3. package/README.md +153 -20
  4. package/README.th.md +136 -0
  5. package/dist/agentContext.js +4 -0
  6. package/dist/approval.js +6 -0
  7. package/dist/bin.js +394 -51
  8. package/dist/brain.js +92 -59
  9. package/dist/brand.js +47 -0
  10. package/dist/checkpoint.js +37 -0
  11. package/dist/commands.js +86 -6
  12. package/dist/compaction.js +76 -5
  13. package/dist/config.js +100 -12
  14. package/dist/cost.js +60 -3
  15. package/dist/doctor.js +92 -0
  16. package/dist/gateway/auth.js +2 -2
  17. package/dist/gateway/ledger.js +2 -2
  18. package/dist/gateway/scheduler.js +1 -0
  19. package/dist/gateway/serve.js +6 -4
  20. package/dist/gateway/server.js +10 -2
  21. package/dist/git.js +11 -2
  22. package/dist/hooks.js +43 -17
  23. package/dist/knowledge.js +48 -49
  24. package/dist/loop.js +182 -66
  25. package/dist/lsp/client.js +173 -0
  26. package/dist/lsp/framing.js +56 -0
  27. package/dist/lsp/index.js +138 -0
  28. package/dist/lsp/servers.js +82 -0
  29. package/dist/mcp-server.js +244 -0
  30. package/dist/mcp.js +184 -29
  31. package/dist/memory-store.js +559 -0
  32. package/dist/memory.js +143 -29
  33. package/dist/orchestrate.js +150 -0
  34. package/dist/providers/codex.js +2 -2
  35. package/dist/providers/keys.js +3 -2
  36. package/dist/providers/registry.js +133 -1
  37. package/dist/repomap.js +93 -0
  38. package/dist/search/chunk.js +158 -0
  39. package/dist/search/embed-store.js +187 -0
  40. package/dist/search/engine.js +203 -0
  41. package/dist/search/fuse.js +35 -0
  42. package/dist/search/index-core.js +187 -0
  43. package/dist/search/indexer.js +241 -0
  44. package/dist/search/store.js +77 -0
  45. package/dist/session.js +42 -8
  46. package/dist/skill-install.js +10 -10
  47. package/dist/skills.js +12 -9
  48. package/dist/summarize.js +31 -0
  49. package/dist/tools/bash.js +21 -2
  50. package/dist/tools/diagnostics.js +41 -0
  51. package/dist/tools/edit.js +29 -7
  52. package/dist/tools/index.js +8 -1
  53. package/dist/tools/list.js +7 -2
  54. package/dist/tools/permission.js +90 -9
  55. package/dist/tools/read.js +23 -4
  56. package/dist/tools/remember.js +1 -1
  57. package/dist/tools/sandbox.js +61 -0
  58. package/dist/tools/search.js +105 -4
  59. package/dist/tools/task.js +195 -29
  60. package/dist/tools/timeout.js +35 -0
  61. package/dist/tools/util.js +10 -0
  62. package/dist/tools/write.js +6 -4
  63. package/dist/trust.js +89 -0
  64. package/dist/ui/app.js +218 -27
  65. package/dist/ui/banner.js +4 -9
  66. package/dist/ui/history.js +30 -0
  67. package/dist/ui/mentions.js +44 -0
  68. package/dist/ui/setup.js +6 -5
  69. package/dist/ui/useEditor.js +83 -0
  70. package/dist/update.js +114 -0
  71. package/dist/worktree.js +173 -0
  72. package/package.json +11 -5
  73. package/scripts/postinstall.mjs +33 -0
  74. package/second-brain/.agents/_Index.md +30 -0
  75. package/second-brain/.agents/skills/_Index.md +30 -0
  76. package/second-brain/.agents/workflows/_Index.md +30 -0
  77. package/second-brain/AGENTS.md +4 -4
  78. package/second-brain/Acceptance/_Index.md +30 -0
  79. package/second-brain/Acceptance/golden-case-template.md +39 -0
  80. package/second-brain/Areas/_Index.md +30 -0
  81. package/second-brain/Bugs/System-OS/_Index.md +30 -0
  82. package/second-brain/Bugs/_Index.md +30 -0
  83. package/second-brain/CLAUDE.md +4 -1
  84. package/second-brain/Checklists/_Index.md +30 -0
  85. package/second-brain/Checklists/preflight-postflight-template.md +29 -0
  86. package/second-brain/Distillations/_Index.md +30 -0
  87. package/second-brain/Entities/_Index.md +30 -0
  88. package/second-brain/Entities/entity-template.md +33 -0
  89. package/second-brain/Evals/_Index.md +30 -0
  90. package/second-brain/Evals/correction-pairs.md +24 -0
  91. package/second-brain/Evals/failure-taxonomy.md +24 -0
  92. package/second-brain/Evals/golden-set.md +25 -0
  93. package/second-brain/Evals/quality-ledger.md +23 -0
  94. package/second-brain/Evals/self-eval-rubric.md +23 -0
  95. package/second-brain/GEMINI.md +4 -4
  96. package/second-brain/Goals/_Index.md +30 -0
  97. package/second-brain/Handoffs/_Index.md +30 -0
  98. package/second-brain/Home.md +7 -0
  99. package/second-brain/Intake/Raw Sources/_Index.md +30 -0
  100. package/second-brain/Intake/_Index.md +30 -0
  101. package/second-brain/Intake/_Quarantine/_Index.md +30 -0
  102. package/second-brain/Learning/_Index.md +30 -0
  103. package/second-brain/Playbooks/_Index.md +30 -0
  104. package/second-brain/Playbooks/playbook-template.md +23 -0
  105. package/second-brain/Projects/_Index.md +30 -0
  106. package/second-brain/Prompts/_Index.md +30 -0
  107. package/second-brain/README.md +2 -1
  108. package/second-brain/Research/_Index.md +30 -0
  109. package/second-brain/Retrospectives/_Index.md +30 -0
  110. package/second-brain/Reviews/_Index.md +30 -0
  111. package/second-brain/Runbooks/_Index.md +30 -0
  112. package/second-brain/Runbooks/eval-loop.md +24 -0
  113. package/second-brain/Sessions/_Index.md +30 -0
  114. package/second-brain/Shared/AI-Context-Index.md +20 -0
  115. package/second-brain/Shared/AI-Threads/_Index.md +30 -0
  116. package/second-brain/Shared/Archive/_Index.md +30 -0
  117. package/second-brain/Shared/Assets/_Index.md +30 -0
  118. package/second-brain/Shared/Context-Packs/_Index.md +30 -0
  119. package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
  120. package/second-brain/Shared/Coordination/NOW.md +28 -0
  121. package/second-brain/Shared/Coordination/_Index.md +30 -0
  122. package/second-brain/Shared/Coordination/agent-registry.md +24 -0
  123. package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
  124. package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
  125. package/second-brain/Shared/Coordination/task-board.md +32 -0
  126. package/second-brain/Shared/Core-Facts/_Index.md +30 -0
  127. package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
  128. package/second-brain/Shared/Glossary/_Index.md +30 -0
  129. package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
  130. package/second-brain/Shared/Operating-State/_Index.md +30 -0
  131. package/second-brain/Shared/Prompting/_Index.md +30 -0
  132. package/second-brain/Shared/Provenance/_Index.md +30 -0
  133. package/second-brain/Shared/Rules/_Index.md +30 -0
  134. package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
  135. package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
  136. package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
  137. package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
  138. package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
  139. package/second-brain/Shared/Rules/rules-formatting.md +34 -0
  140. package/second-brain/Shared/Scripts/_Index.md +30 -0
  141. package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
  142. package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
  143. package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
  144. package/second-brain/Shared/User-Memory/_Index.md +30 -0
  145. package/second-brain/Shared/User-Persona/_Index.md +30 -0
  146. package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
  147. package/second-brain/Shared/Working-Memory/_Index.md +30 -0
  148. package/second-brain/Shared/_Index.md +30 -0
  149. package/second-brain/Shared/mcp-servers/_Index.md +30 -0
  150. package/second-brain/Skills/_Index.md +30 -0
  151. package/second-brain/Templates/_Index.md +30 -0
  152. package/second-brain/Templates/bug.md +2 -0
  153. package/second-brain/Templates/handoff.md +2 -0
  154. package/second-brain/Templates/session.md +2 -0
  155. package/second-brain/Tools/_Index.md +30 -0
  156. package/second-brain/Traces/_Index.md +30 -0
  157. package/second-brain/Vault Structure Map.md +33 -1
  158. package/second-brain/copilot/_Index.md +30 -0
  159. package/skills/audit-license-compliance/SKILL.md +117 -0
  160. package/skills/author-codemod/SKILL.md +110 -0
  161. package/skills/build-audit-logging/SKILL.md +112 -0
  162. package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
  163. package/skills/build-cli-tool/SKILL.md +108 -0
  164. package/skills/build-data-table/SKILL.md +141 -0
  165. package/skills/build-native-mobile-ui/SKILL.md +154 -0
  166. package/skills/build-offline-first-sync/SKILL.md +118 -0
  167. package/skills/build-realtime-channel/SKILL.md +122 -0
  168. package/skills/build-vector-search/SKILL.md +131 -0
  169. package/skills/compose-local-dev-stack/SKILL.md +149 -0
  170. package/skills/configure-bundler-build/SKILL.md +166 -0
  171. package/skills/configure-dns-tls/SKILL.md +142 -0
  172. package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
  173. package/skills/configure-security-headers-csp/SKILL.md +122 -0
  174. package/skills/contract-testing/SKILL.md +140 -0
  175. package/skills/datetime-timezone-correctness/SKILL.md +125 -0
  176. package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
  177. package/skills/debug-flaky-tests/SKILL.md +128 -0
  178. package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
  179. package/skills/deliver-webhooks/SKILL.md +116 -0
  180. package/skills/design-api-pagination/SKILL.md +144 -0
  181. package/skills/design-authorization-model/SKILL.md +119 -0
  182. package/skills/design-backup-dr-recovery/SKILL.md +113 -0
  183. package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
  184. package/skills/design-multi-tenancy/SKILL.md +100 -0
  185. package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
  186. package/skills/design-relational-schema/SKILL.md +129 -0
  187. package/skills/design-search-index-infra/SKILL.md +151 -0
  188. package/skills/design-state-machine/SKILL.md +108 -0
  189. package/skills/design-token-system/SKILL.md +109 -0
  190. package/skills/distributed-locks-leases/SKILL.md +120 -0
  191. package/skills/encrypt-sensitive-data/SKILL.md +148 -0
  192. package/skills/feature-flags-rollout/SKILL.md +130 -0
  193. package/skills/file-upload-object-storage/SKILL.md +107 -0
  194. package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
  195. package/skills/harden-llm-app-reliability/SKILL.md +126 -0
  196. package/skills/i18n-localization-setup/SKILL.md +113 -0
  197. package/skills/idempotency-keys/SKILL.md +107 -0
  198. package/skills/implement-push-notifications/SKILL.md +142 -0
  199. package/skills/ingest-webhook-secure/SKILL.md +120 -0
  200. package/skills/integrate-oauth-oidc/SKILL.md +126 -0
  201. package/skills/load-stress-test/SKILL.md +129 -0
  202. package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
  203. package/skills/model-nosql-data/SKILL.md +118 -0
  204. package/skills/money-decimal-arithmetic/SKILL.md +123 -0
  205. package/skills/monitor-ml-drift/SKILL.md +109 -0
  206. package/skills/numeric-precision-units/SKILL.md +144 -0
  207. package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
  208. package/skills/optimize-react-rerenders/SKILL.md +124 -0
  209. package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
  210. package/skills/payments-billing-integration/SKILL.md +114 -0
  211. package/skills/pin-toolchain-versions/SKILL.md +116 -0
  212. package/skills/plan-strangler-migration/SKILL.md +95 -0
  213. package/skills/property-based-testing/SKILL.md +108 -0
  214. package/skills/publish-package-registry/SKILL.md +130 -0
  215. package/skills/recover-git-state/SKILL.md +119 -0
  216. package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
  217. package/skills/resilience-timeouts-retries/SKILL.md +104 -0
  218. package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
  219. package/skills/rewrite-git-history/SKILL.md +109 -0
  220. package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
  221. package/skills/schema-evolution-compatibility/SKILL.md +121 -0
  222. package/skills/send-transactional-email/SKILL.md +126 -0
  223. package/skills/serve-deploy-ml-model/SKILL.md +107 -0
  224. package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
  225. package/skills/setup-devcontainer-env/SKILL.md +131 -0
  226. package/skills/setup-lint-format-precommit/SKILL.md +140 -0
  227. package/skills/setup-monorepo-tooling/SKILL.md +125 -0
  228. package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
  229. package/skills/structured-output-llm/SKILL.md +86 -0
  230. package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
  231. package/skills/test-data-factories/SKILL.md +158 -0
  232. package/skills/threat-model-stride/SKILL.md +123 -0
  233. package/skills/train-evaluate-ml-model/SKILL.md +109 -0
  234. package/skills/unicode-text-correctness/SKILL.md +109 -0
  235. package/skills/visual-regression-testing/SKILL.md +120 -0
@@ -0,0 +1,122 @@
1
+ ---
2
+ name: configure-security-headers-csp
3
+ description: Configures HTTP response security headers and a strict, nonce/hash-based Content-Security-Policy — script-src with a per-request nonce or sha256 hash plus 'strict-dynamic' (so you can drop host allowlists and 'unsafe-inline'), object-src 'none', base-uri 'none', frame-ancestors to control framing, a Report-Only rollout via report-to/report-uri before enforcing, plus HSTS with includeSubDomains+preload, X-Content-Type-Options: nosniff, Referrer-Policy, a deny-by-default Permissions-Policy, correct CORS (echo a single allowed origin, never wildcard '*' together with Access-Control-Allow-Credentials), and cookie flags Secure+HttpOnly+SameSite. Eliminates inline-script XSS sinks, clickjacking, MIME-sniffing, mixed content, and credentialed-CORS leaks by policy rather than per-bug patching.
4
+ when_to_use: Hardening a web app's HTTP responses — adding or tightening CSP, fixing a console "Refused to execute inline script" after enabling CSP, rolling out HSTS/preload, setting frame-ancestors/Referrer-Policy/Permissions-Policy, or getting CORS and cookie flags right. Distinct from remediate-web-vulnerabilities (finds and fixes a specific bug like a reflected XSS or open redirect; this skill sets the defense-in-depth headers that contain whole bug classes) and setup-cdn-edge-waf (the CDN/WAF edge layer that can inject or override these headers; this skill defines the header values that layer should serve).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the task is **setting HTTP response headers and CSP as defense-in-depth policy**, not chasing one specific vulnerability:
10
+
11
+ - "Add a Content-Security-Policy" / "our CSP uses 'unsafe-inline' — make it strict"
12
+ - "After turning on CSP the page broke: Refused to execute inline script / Refused to load the stylesheet"
13
+ - "Enable HSTS / submit the domain to the preload list"
14
+ - "Stop the site from being framed / set frame-ancestors / X-Frame-Options"
15
+ - "Set Referrer-Policy and lock down Permissions-Policy (camera, geolocation, FLoC)"
16
+ - "Our CORS sends `Access-Control-Allow-Origin: *` with credentials — is that safe?" (no)
17
+ - "Cookies missing Secure/HttpOnly/SameSite" / harden the Set-Cookie flags
18
+
19
+ NOT this skill:
20
+ - Finding/fixing a concrete bug — reflected/stored XSS sink, open redirect, SSRF, SQLi — and sanitizing the offending code path → remediate-web-vulnerabilities (this skill is the header *containment* layer that limits the blast radius of such bugs)
21
+ - Configuring the CDN/WAF/edge that injects, caches, or overrides these headers, or rate-limits at the edge → setup-cdn-edge-waf (this skill defines the header *values* it should emit)
22
+ - TLS certs, cipher suites, OCSP, ACME issuance, the TLS handshake behind HSTS → configure-dns-tls (HSTS only asserts TLS is mandatory; it doesn't provision it)
23
+ - Reverse-proxy/load-balancer routing where you might *also* add these headers (nginx/Envoy/Traefik) → configure-reverse-proxy-lb (this skill says *which* headers; that one places them in the proxy)
24
+ - The OAuth/OIDC redirect, token, and session-cookie *protocol* → integrate-oauth-oidc / auth-jwt-session (this skill only hardens the cookie *flags* and CORS around them)
25
+ - Structured threat enumeration (STRIDE) or a full audit pass → threat-model-stride / security-review
26
+ - Active fuzzing/DAST to prove a bypass → fuzz-dynamic-security-test
27
+
28
+ ## Steps
29
+
30
+ 1. **Default to a strict, nonce- or hash-based CSP — host allowlists are obsolete and bypassable.** Allowlist CSPs (`script-src 'self' cdn.example.com`) are trivially defeated via JSONP endpoints, open redirects, or AngularJS on a whitelisted host (Google's own research found ~94% of allowlist CSPs bypassable). The strict pattern:
31
+
32
+ ```
33
+ Content-Security-Policy:
34
+ script-src 'nonce-{RANDOM}' 'strict-dynamic' https: 'unsafe-inline';
35
+ object-src 'none';
36
+ base-uri 'none';
37
+ require-trusted-types-for 'script';
38
+ report-uri /csp-report; report-to csp
39
+ ```
40
+ - **`'strict-dynamic'`** lets a nonced/hashed script load further scripts it creates, so you don't enumerate every CDN. When present, browsers that understand it **ignore** `https:` and `'unsafe-inline'` — those are *fallbacks for old browsers only*, not a real relaxation.
41
+ - **`object-src 'none'`** kills Flash/`<object>` injection; **`base-uri 'none'`** stops `<base href>` from rewriting relative script URLs.
42
+ - You usually don't need `default-src` micromanaged once `script-src` is strict; the dangerous directive is script execution.
43
+
44
+ 2. **Generate a fresh 128-bit nonce per response and stamp it on every inline `<script>`.** The nonce must be cryptographically random and **unique per HTTP response** (never reuse, never hardcode) — a static nonce is equivalent to `'unsafe-inline'`.
45
+
46
+ | Stack | Generate | Apply |
47
+ |---|---|---|
48
+ | Express | `res.locals.nonce = crypto.randomBytes(16).toString('base64')` | helmet `contentSecurityPolicy` with `(req,res)=>`nonce`; `<script nonce="<%= nonce %>">` |
49
+ | Next.js | nonce in `middleware.ts`, pass via header | Next injects nonce into its own scripts when CSP header has a nonce |
50
+ | Django | `django-csp` `@csp_update` / `{{ request.csp_nonce }}` | `<script nonce="{{ request.csp_nonce }}">` |
51
+ | Rails | `config.content_security_policy_nonce_generator` | `javascript_tag nonce: true` / `nonce: true` in tags |
52
+ | Go/Caddy/nginx | per-request var (sub_filter or middleware) | template the nonce into markup |
53
+
54
+ For **static/cached HTML** where you can't inject a per-response nonce, use **`'sha256-...'` hashes** of each inline script's exact bytes instead (compute at build time). Nonces require dynamic rendering; hashes work on a CDN.
55
+
56
+ 3. **Roll out in Report-Only first — never flip enforcing CSP straight to prod.** Ship `Content-Security-Policy-Report-Only` (same policy) alongside any existing enforced policy, collect violations for days/weeks, fix legitimate breakage, then promote to the enforcing header. Wire reporting with the modern `report-to` (a `Reporting-Endpoints` header naming a collector) **and** keep deprecated `report-uri` for older browsers:
57
+
58
+ ```
59
+ Reporting-Endpoints: csp="https://example.com/csp-report"
60
+ Content-Security-Policy-Report-Only: script-src 'nonce-...' 'strict-dynamic'; report-to csp; report-uri /csp-report
61
+ ```
62
+ Expect noise from browser extensions injecting inline scripts — triage by `blocked-uri`/`source-file`; don't widen the policy to silence extension reports.
63
+
64
+ 4. **Set `frame-ancestors` to control framing — it supersedes X-Frame-Options.** `frame-ancestors 'none'` (no framing) or `frame-ancestors 'self' https://trusted.example.com` (allow specific embedders). Browsers honor `frame-ancestors` over the legacy `X-Frame-Options: DENY|SAMEORIGIN` when both exist; keep `X-Frame-Options: DENY` only as a fallback for ancient clients. XFO has no allowlist-multiple-origins capability — `frame-ancestors` is the real control.
65
+
66
+ 5. **Pin the rest of the header set — each closes a specific class.**
67
+
68
+ | Header | Value (strong default) | Closes |
69
+ |---|---|---|
70
+ | `Strict-Transport-Security` | `max-age=63072000; includeSubDomains; preload` | SSL-strip / downgrade; mandates HTTPS for 2y |
71
+ | `X-Content-Type-Options` | `nosniff` | MIME-sniffing a JSON/text response into executable HTML/JS |
72
+ | `Referrer-Policy` | `strict-origin-when-cross-origin` (or `no-referrer`) | leaking full URL + query (tokens) in `Referer` |
73
+ | `Permissions-Policy` | `camera=(), microphone=(), geolocation=(), interest-cohort=()` | abuse of powerful features; deny-by-default `()` = nobody |
74
+ | `Cross-Origin-Opener-Policy` | `same-origin` | cross-window attacks; required (with COEP) to re-enable `SharedArrayBuffer` |
75
+ | `Cross-Origin-Resource-Policy` | `same-origin` (or `same-site`) | side-channel/Spectre cross-origin reads (data leak) |
76
+ | `X-Frame-Options` | `DENY` (legacy fallback only) | clickjacking on old browsers (else use `frame-ancestors`) |
77
+
78
+ HSTS rules: only send over HTTPS; `includeSubDomains` covers every subdomain (verify they're all HTTPS first); **`preload` is a near-irreversible commitment** — once on the browser preload list, removal takes months, so don't add it until you're certain all subdomains are HTTPS-only. Submit at hstspreload.org. `Permissions-Policy` replaces the old `Feature-Policy`; `interest-cohort=()` opts out of FLoC/Topics.
79
+
80
+ 6. **CORS: echo exactly one validated origin — NEVER wildcard with credentials.** The single most common CORS vuln is `Access-Control-Allow-Origin: *` (or reflecting `Origin` blindly) **together with** `Access-Control-Allow-Credentials: true`, which lets any site read authenticated responses.
81
+ - The spec **forbids** `*` + credentials — but reflecting the `Origin` header unchecked is the same hole. **Validate the incoming `Origin` against an allowlist**, and only then echo it back: `Access-Control-Allow-Origin: <that exact origin>` + `Vary: Origin`.
82
+ - Never trust substring/regex matches like `endsWith('.example.com')` (matches `evilexample.com`) or `startsWith('https://example.com')` (matches `https://example.com.evil.com`). Match the full origin against an explicit set.
83
+ - If you don't need credentials, prefer `Access-Control-Allow-Origin: *` **without** credentials — that's safe and simpler. Don't reflect `null` (sandboxed iframes/`file://` send `Origin: null` — allowlisting `null` is exploitable).
84
+ - Set `Vary: Origin` whenever the ACAO value depends on the request, or a cache will serve one origin's allowed-response to another.
85
+
86
+ 7. **Harden cookies: `Secure; HttpOnly; SameSite` — and `__Host-` for session cookies.** Every session/auth cookie:
87
+
88
+ ```
89
+ Set-Cookie: __Host-session=...; Secure; HttpOnly; SameSite=Lax; Path=/
90
+ ```
91
+ - **`Secure`** — only sent over HTTPS. **`HttpOnly`** — invisible to `document.cookie`, so an XSS can't exfiltrate it. **`SameSite=Lax`** (default-safe; blocks cross-site POST CSRF) or **`Strict`** for the most sensitive; use `SameSite=None; Secure` only for genuine cross-site cookies (and then you need CSRF defense).
92
+ - The **`__Host-` prefix** forces `Secure`, `Path=/`, and no `Domain` — the browser rejects the cookie if those aren't met, preventing subdomain cookie-fixation. Use it for session cookies. `__Secure-` is the weaker variant (just requires `Secure`).
93
+
94
+ 8. **Set headers once, at the right layer, and don't let it get clobbered.** Prefer a single source of truth: app middleware (helmet / `secure_headers` / `django-csp`) **or** the edge/proxy — not both fighting. If a CDN/WAF (setup-cdn-edge-waf) or reverse proxy (configure-reverse-proxy-lb) also injects headers, confirm which wins (proxies often *append*, producing duplicate/conflicting CSP — the browser then enforces the **intersection**, which can silently break the page). Apply headers to **all** responses including errors, redirects, and API/JSON. Use **helmet** (Express), **`secure_headers`** gem (Rails), **`django-csp` + `SecurityMiddleware`** (Django), or **`securityheaders`** middleware (Go) rather than hand-rolling.
95
+
96
+ ## Common Errors
97
+
98
+ - **Allowlist CSP with `'unsafe-inline'`.** `script-src 'self' 'unsafe-inline'` provides essentially zero XSS protection — inline injected scripts run. Fix: nonce/hash + `'strict-dynamic'`, drop `'unsafe-inline'` (keep it only as the old-browser fallback that strict-dynamic neutralizes).
99
+ - **Reusing or hardcoding the nonce.** A static/cached nonce = `'unsafe-inline'`; the attacker just reads it from the page and reuses it. Fix: fresh CSPRNG nonce per response; for cacheable HTML use hashes instead.
100
+ - **Flipping enforcing CSP straight to prod.** You blank-screen real users on day one. Fix: `-Report-Only` first, collect via `report-to`/`report-uri`, fix breakage, then enforce.
101
+ - **`'unsafe-eval'` left in to satisfy a library.** Re-opens `eval`/`Function` injection. Fix: move to a CSP-compatible build (no runtime eval); add Trusted Types (`require-trusted-types-for 'script'`) instead of loosening.
102
+ - **CSP only on HTML, missing `object-src`/`base-uri`.** `<base>` hijack or `<object>` injection bypasses a script-only policy. Fix: always add `object-src 'none'; base-uri 'none'`.
103
+ - **`Access-Control-Allow-Origin: *` (or reflected Origin) with `Allow-Credentials: true`.** Any website reads the victim's authenticated data. Fix: allowlist + echo the single matched origin + `Vary: Origin`; or drop credentials and use `*`.
104
+ - **Substring origin matching.** `origin.endsWith('example.com')` allows `notexample.com`/`example.com.evil.com`. Fix: exact full-origin set membership.
105
+ - **HSTS `preload` added prematurely / without `includeSubDomains`.** A non-HTTPS subdomain becomes unreachable, and preload removal takes months. Fix: confirm every subdomain is HTTPS-only before `includeSubDomains; preload`; ramp `max-age` up gradually.
106
+ - **Setting HSTS over plain HTTP.** Ignored by browsers and a sign of misconfig. Fix: emit HSTS only on HTTPS responses; redirect HTTP→HTTPS first.
107
+ - **Cookies without `HttpOnly`/`Secure`/`SameSite`.** XSS steals the session; CSRF rides it; it leaks over HTTP. Fix: `__Host-name=...; Secure; HttpOnly; SameSite=Lax`.
108
+ - **Duplicate CSP headers from app + proxy.** Browser enforces the *intersection* of all CSP headers, silently breaking the stricter-than-intended result. Fix: one owner of the header; verify the response has a single CSP.
109
+ - **Missing `nosniff`, so an API returns user content as `text/html`.** Browser sniffs and executes it. Fix: `X-Content-Type-Options: nosniff` on every response and correct `Content-Type`.
110
+
111
+ ## Verify
112
+
113
+ 1. **Scan the live headers:** run the response through `securityheaders.com` / Mozilla Observatory, or `curl -sI https://site` — confirm a single `Content-Security-Policy`, `Strict-Transport-Security`, `X-Content-Type-Options: nosniff`, `Referrer-Policy`, `Permissions-Policy`, and `frame-ancestors` present, with no duplicates.
114
+ 2. **CSP is strict:** the policy contains a `'nonce-...'` or `'sha256-...'` in `script-src` with `'strict-dynamic'` and **no** standalone `'unsafe-inline'`/`'unsafe-eval'` that a modern browser honors; `object-src 'none'` and `base-uri 'none'` present. Validate with Google's CSP Evaluator.
115
+ 3. **Nonce is per-response:** fetch the page twice — the nonce value differs each time and matches the inline `<script nonce=...>` tags.
116
+ 4. **Report-Only worked:** the violation collector received reports and they were triaged before enforcing; the enforced policy doesn't blank the app (load the real pages, check the console for `Refused to...`).
117
+ 5. **CORS is safe:** `curl -H 'Origin: https://evil.com' -I` to a credentialed endpoint returns **no** `Access-Control-Allow-Origin` for `evil.com` (or omits credentials); an allowlisted origin gets that exact origin echoed plus `Vary: Origin`. No `*`+credentials anywhere.
118
+ 6. **Cookies hardened:** `Set-Cookie` on the session cookie shows `Secure; HttpOnly; SameSite=...` (and `__Host-` prefix for session); inspect in DevTools → Application → Cookies.
119
+ 7. **HSTS sane:** `Strict-Transport-Security` only on HTTPS, `max-age` ≥ 1 year, `includeSubDomains` only if every subdomain is HTTPS; `preload` only when committed (verify at hstspreload.org).
120
+ 8. **Clickjacking blocked:** attempt to frame the site from another origin → blocked by `frame-ancestors`; `X-Content-Type-Options: nosniff` confirmed so a `text/plain` API body isn't sniffed to HTML.
121
+
122
+ Done = a strict nonce/hash CSP with `'strict-dynamic'` and no honored `'unsafe-inline'`, rolled out via Report-Only then enforced; HSTS (preload only when safe), nosniff, frame-ancestors, Referrer-Policy and a deny-by-default Permissions-Policy all present exactly once; CORS validates origin against an allowlist and never pairs `*`/reflected-origin with credentials; and session cookies carry Secure+HttpOnly+SameSite (`__Host-` prefixed) — all proven by the header scan, CSP evaluator, and CORS/cookie checks above.
@@ -0,0 +1,140 @@
1
+ ---
2
+ name: contract-testing
3
+ description: Implements consumer-driven contract testing so services deploy independently without a full integration environment — the consumer's unit tests record concrete request/response expectations against a stub (Pact `pact-jvm`/`pact-js`/`pact-python`, or Spring Cloud Contract DSL), the resulting contract (pact file / Spring stub jar) is published to a broker (Pact Broker / PactFlow) tagged by consumer version + branch + environment, the provider replays every expectation against its real app in CI with provider states (`@State` / `Given`) seeding data, and `pact-broker can-i-deploy --pacticipant X --version <git-sha> --to-environment production` gates the pipeline — plus webhook-triggered provider verification on contract change, bi-directional contracts (verify a provider's OpenAPI against consumer pacts without running the provider), pending/WIP pacts so a new consumer expectation never breaks the provider build, and version pinning via the consumer's git SHA with `record-deployment`/`record-release`.
4
+ when_to_use: You have ≥2 services that talk over HTTP/messages and want to catch integration breakage in fast unit-speed CI instead of a brittle shared E2E env — adding Pact or Spring Cloud Contract, wiring a Pact broker, gating deploys with can-i-deploy, or deciding consumer-driven vs bi-directional contracts. Distinct from rest-graphql-contract (defines the API spec/schema itself — OpenAPI/GraphQL SDL/JSON Schema; this skill tests that two specific deployed versions actually agree) and schema-evolution-compatibility (the back/forward-compat rules a change must obey; this skill is the CI mechanism that proves a given consumer↔provider pair still satisfies them).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when two or more independently deployed services integrate and you want integration confidence at unit-test speed, not via a fragile end-to-end stack:
10
+
11
+ - "Provider changed a field and a consumer broke in prod — catch it in CI before merge"
12
+ - "Our shared staging/E2E env is flaky and slow; we want to test integration without it"
13
+ - "Add Pact / Spring Cloud Contract between our frontend/BFF and the API"
14
+ - "Gate the deploy: don't ship the provider until every consumer's contract still passes"
15
+ - "We already have an OpenAPI spec — verify the provider matches it AND the consumers (bi-directional)"
16
+ - "A new consumer's expectation shouldn't be able to red the provider's build (pending pacts)"
17
+ - "Mobile app v3 is still live; how do we know the provider didn't drop a field v3 needs?"
18
+
19
+ NOT this skill:
20
+ - Authoring the API spec/schema (OpenAPI, GraphQL SDL, JSON Schema, field types, pagination shape) → rest-graphql-contract (defines *what* the API is; this skill proves two running versions *agree* on it)
21
+ - The back/forward-compatibility *rules* (additive-only, never-remove-required, default-on-new-optional) → schema-evolution-compatibility (the policy; this skill is the per-pair CI enforcement of it)
22
+ - gRPC/protobuf service definition and codegen → design-protobuf-grpc-service (you can still Pact-test gRPC via message pacts, but the `.proto` itself lives there)
23
+ - General API design / breaking-change review of a diff → api-design-review
24
+ - Browser/UI end-to-end flows across the whole app → write-playwright-e2e (this skill *replaces* most cross-service E2E with isolated pair contracts)
25
+ - Structuring the unit-test suite itself / assertions / fixtures → write-tests, test-data-factories (this skill specifies the contract interactions; those build the surrounding suite/data)
26
+ - Wiring the CI stages / runners / caching → cicd-pipeline-author; the deploy gate's release flow → deploy-release (this skill supplies the can-i-deploy check those stages run)
27
+
28
+ ## Steps
29
+
30
+ 1. **Pick consumer-driven (Pact) when consumers know what they need; bi-directional/spec-driven when the provider already owns an OpenAPI/GraphQL spec.** They are not interchangeable:
31
+
32
+ | Approach | How it works | Use when | Limitation |
33
+ |---|---|---|---|
34
+ | **Consumer-driven (Pact)** | consumer's tests *generate* expectations; provider *replays* them against the real app | consumers drive the API; you want to know exactly which fields are used | provider must run verification against real code; needs provider states |
35
+ | **Bi-directional (PactFlow)** | provider's OpenAPI is verified as a "provider contract"; consumer pacts compared statically against it — provider need not run | provider already has a trustworthy spec; can't run full provider verification | only as good as the spec; a spec that lies passes |
36
+ | **Spring Cloud Contract** | contracts in Groovy/YAML DSL live with the *provider*; generate provider tests + a stub jar consumers run against | JVM-heavy estate, provider-owned contracts, message + HTTP | JVM-centric; less natural for polyglot consumers |
37
+
38
+ Default to **consumer-driven Pact** for polyglot HTTP/message estates; **Spring Cloud Contract** for an all-JVM shop; add **bi-directional** when a provider can't feasibly run verification but has a real OpenAPI.
39
+
40
+ 2. **Write the consumer test against a Pact mock — assert on the request you send and matchers (not literals) for the response.** The consumer test spins up Pact's local mock server, you exercise your real client code against it, and Pact records the interaction. Use **matchers** so the contract pins *structure/type*, not brittle example values:
41
+
42
+ ```js
43
+ // pact-js v3+ (V3/V4 spec)
44
+ const { PactV3, MatchersV3: M } = require('@pact-foundation/pact');
45
+ const provider = new PactV3({ consumer: 'web-bff', provider: 'orders-api' });
46
+
47
+ provider
48
+ .given('order 42 exists') // provider state — seeds data later
49
+ .uponReceiving('a request for order 42')
50
+ .withRequest({ method: 'GET', path: '/orders/42',
51
+ headers: { Accept: 'application/json' } })
52
+ .willRespondWith({ status: 200,
53
+ headers: { 'Content-Type': M.regex('application/json.*', 'application/json') },
54
+ body: { id: M.integer(42), total: M.decimal(19.99),
55
+ status: M.regex('PAID|PENDING', 'PAID'),
56
+ items: M.eachLike({ sku: M.string('ABC'), qty: M.integer(1) }) } });
57
+
58
+ await provider.executeTest(mock => new OrdersClient(mock.url).getOrder(42));
59
+ ```
60
+ Rules: assert only on **fields the consumer actually reads** (Pact verifies the provider returns *at least* these — extra provider fields are fine; that's how providers stay free to add). Use `integer/decimal/string/regex/eachLike/like`, never hardcoded values, or any data change reds the provider. One `given(...)` per distinct precondition; the string must match a provider state handler exactly.
61
+
62
+ 3. **Run the consumer test in normal unit CI; it emits a pact JSON file as a side effect — there is no provider involved here.** `npm test` / `mvn test` / `pytest` produces `pacts/web-bff-orders-api.json`. This runs at unit speed, no network, no provider deployed. The pact file is the deliverable.
63
+
64
+ 4. **Publish the pact to a broker, tagged with the consumer's git SHA + branch + (later) environments.** The broker is the exchange point; never email pact files around.
65
+
66
+ ```bash
67
+ pact-broker publish ./pacts \
68
+ --consumer-app-version $(git rev-parse --short HEAD) \
69
+ --branch $GIT_BRANCH \
70
+ --broker-base-url $PACT_BROKER_URL --broker-token $PACT_BROKER_TOKEN
71
+ ```
72
+ **Version MUST be the git SHA (or `<semver>+<sha>`), not a timestamp or "latest"** — can-i-deploy reasons about specific versions, and a non-unique version corrupts the matrix. `--branch` enables WIP/pending-pact selection. Self-host the OSS **Pact Broker** (Docker, Postgres-backed) or use hosted **PactFlow** (adds bi-directional + WIP UI).
73
+
74
+ 5. **Provider verification: replay every consumer's pact against the real running provider, seeding data via provider-state handlers.** The provider pulls pacts from the broker by **consumer version selectors** (not "all pacts ever") and runs them against a real instance:
75
+
76
+ ```java
77
+ // pact-jvm JUnit5
78
+ @Provider("orders-api")
79
+ @PactBroker(url="${PACT_BROKER_URL}", selectors = {
80
+ @VersionSelector(deployedOrReleased = true), // pacts live in any env
81
+ @VersionSelector(mainBranch = true) }) // + main branch
82
+ class OrdersApiPactTest {
83
+ @State("order 42 exists") // matches given(...) string EXACTLY
84
+ void seedOrder42() { db.insertOrder(42, "PAID"); } // arrange real data
85
+ @TestTemplate @ExtendWith(PactVerificationInvocationContextProvider.class)
86
+ void verify(PactVerificationContext ctx) { ctx.verifyInteraction(); }
87
+ }
88
+ ```
89
+ Verify against the **real app + a test DB**, not mocks — the point is to prove the actual provider satisfies the expectation. **`@State` handlers are mandatory and must be idempotent**; they set up exactly the data the interaction needs and clean up after. A missing/misnamed state handler fails verification with "state not found".
90
+
91
+ 6. **Publish verification results back to the broker so the matrix is complete on both sides.** Set `pact.verifier.publishResults=true` (pact-jvm) / `publishVerificationResult: true` (pact-js) **only in CI, keyed to the provider's git SHA**. This is what lets can-i-deploy answer "has provider@sha verified consumer@sha?" — without it the matrix has holes and the gate fails open or stuck.
92
+
93
+ 7. **Gate every deploy with `can-i-deploy` against the target environment — this is the whole payoff.** Before shipping either side, ask the broker whether this version is compatible with everything currently in the target env:
94
+
95
+ ```bash
96
+ pact-broker can-i-deploy \
97
+ --pacticipant orders-api --version $(git rev-parse --short HEAD) \
98
+ --to-environment production --retry-while-unknown 30 --retry-interval 10
99
+ # exit 0 = safe to deploy; non-zero = a consumer in prod would break → fail the stage
100
+ ```
101
+ `--retry-while-unknown` waits for in-flight verifications instead of failing on a race. After a successful deploy, record it so the matrix tracks what's live:
102
+ ```bash
103
+ pact-broker record-deployment --pacticipant orders-api \
104
+ --version $(git rev-parse --short HEAD) --environment production
105
+ ```
106
+ Use `record-deployment` for environments you replace-in-place (one version live), `record-release`/`record-support-ended` for things like mobile apps where **multiple versions are live at once** — that's how you stop the provider dropping a field old app builds still need.
107
+
108
+ 8. **Trigger provider re-verification automatically on contract change via broker webhooks.** Configure a broker **webhook** on `contract_content_changed` / `contract_requiring_verification_published` to POST to the provider's CI (GitHub Actions `repository_dispatch`, GitLab pipeline trigger). New consumer expectation published → provider pipeline runs verification → result published → consumer's can-i-deploy unblocks. Without this the loop is manual and contracts rot.
109
+
110
+ 9. **Use pending pacts + WIP pacts so a new/changed consumer expectation can't red the provider's main build.** Enable `enablePending: true` and `includeWipPactsSince: <date>` in the provider's selectors. A brand-new consumer expectation is verified but reported as **pending** — failures are visible but **non-blocking** for the provider — until it verifies green once, at which point it becomes blocking. This decouples teams: a consumer can publish a forward-looking contract without breaking the provider's release, and the provider opts in when ready. Pair with branch-based selectors so you verify against `main` + `deployedOrReleased`, not every stale feature-branch pact.
111
+
112
+ 10. **For async/messaging, use message pacts; for the provider's own spec, optionally add a bi-directional contract.** **Message pacts**: the consumer asserts on a *message body* it can handle (no HTTP mock); the provider verifies its producer function emits a matching message — same broker, same can-i-deploy. **Bi-directional**: publish the provider's OpenAPI as a provider contract (`pactflow-cli publish-provider-contract openapi.yaml`); PactFlow statically cross-validates consumer pacts against it, so the provider needn't run verification — accept the tradeoff that a wrong spec passes (mitigate by also asserting the spec in the provider's own tests).
113
+
114
+ ## Common Errors
115
+
116
+ - **Asserting on literal example values instead of matchers.** Hardcoding `total: 19.99` means any data change reds provider verification. Fix: `M.decimal()/integer()/regex()/eachLike()` — pin type/structure, not the example.
117
+ - **Consumer over-specifies fields it doesn't use.** Asserting on every response field couples you to the provider's full shape and blocks its additive changes. Fix: assert only the fields the consumer reads; extra provider fields must pass.
118
+ - **Provider state string ≠ `@State`/`Given` handler.** `given('order exists')` vs `@State("order 42 exists")` → "no state handler" verification failure. Fix: keep the strings byte-identical; treat them as a shared contract.
119
+ - **Verifying the provider against mocks/in-memory stubs.** Defeats the purpose — you prove the mock matches, not the real app. Fix: run verification against the real provider + test DB seeded by state handlers.
120
+ - **Versioning pacts with `latest`/timestamps instead of the git SHA.** can-i-deploy's matrix needs unique, reproducible versions; "latest" makes the gate meaningless. Fix: `--consumer-app-version <git-sha>`, branch via `--branch`.
121
+ - **Not publishing verification results (or publishing from local dev).** Holes in the matrix → can-i-deploy can't answer → gate fails open or hangs. Fix: publish results only from CI, keyed to the provider SHA.
122
+ - **Skipping can-i-deploy and just deploying.** Contracts that aren't gated provide false safety. Fix: make can-i-deploy a required pipeline stage that fails the deploy on non-zero exit; add `record-deployment` after.
123
+ - **No pending pacts → new consumer expectation reds the provider main build.** Teams get blocked on each other and disable Pact in frustration. Fix: `enablePending` + WIP pacts; new expectations are non-blocking until first green.
124
+ - **Treating Pact as full-coverage E2E.** Pact verifies the request/response *shape* per interaction, not business correctness or multi-hop flows. Fix: keep a thin layer of true E2E for critical journeys; Pact replaces the broad, flaky middle.
125
+ - **Forgetting multi-version providers (mobile).** `record-deployment` assumes one live version; old app builds still in the wild get dropped. Fix: `record-release`/`record-support-ended` so can-i-deploy keeps every supported app version in the matrix.
126
+ - **Webhook not configured → manual verification loop.** Contracts published but provider never re-verifies, so the broker shows stale green. Fix: `contract_requiring_verification_published` webhook → provider CI dispatch.
127
+
128
+ ## Verify
129
+
130
+ 1. **Consumer test produces a pact at unit speed:** running the consumer suite emits `pacts/<consumer>-<provider>.json` with matchers (not literals), no provider or network involved.
131
+ 2. **Provider verification replays real interactions:** the provider's verification task pulls pacts from the broker, runs against the real app + seeded DB via every `@State`/`Given` handler, and all interactions pass (or are explicitly pending).
132
+ 3. **Matrix is complete both ways:** the broker shows the consumer pact *and* a published verification result for the provider's version — no "unverified" holes.
133
+ 4. **Gate actually blocks:** introduce a breaking provider change (drop/rename a consumed field), run `can-i-deploy --to-environment production` → it exits non-zero and the deploy stage fails; revert → exit 0.
134
+ 5. **Additive change is safe:** add a new optional field on the provider → consumer pact still verifies green and can-i-deploy passes (proves extra fields don't break consumers).
135
+ 6. **Pending pacts don't red main:** publish a new consumer expectation the provider doesn't yet satisfy → provider build reports it pending/non-blocking, not failed; once provider implements it and verifies, it becomes blocking.
136
+ 7. **Versions are git SHAs:** every publish/verify/record uses `git rev-parse` versions; grep CI for `latest`/timestamp versions and remove them.
137
+ 8. **Webhook closes the loop:** publishing a changed contract auto-triggers the provider's verification pipeline; the broker reflects the fresh result without manual intervention.
138
+ 9. **Multi-version handled (if applicable):** `record-release` keeps every supported mobile/app version in the matrix; can-i-deploy refuses a provider change that breaks any still-supported version.
139
+
140
+ Done = consumers generate matcher-based pacts at unit speed, the provider replays them against the real app with idempotent state handlers, verification results and deployments are recorded to the broker keyed by git SHA, every deploy is gated by can-i-deploy against the target environment, new expectations land as non-blocking pending pacts, and contract changes auto-trigger provider re-verification via webhook — proven by the breaking-change-blocks / additive-change-passes / pending-doesn't-red tests in checks 4–6.
@@ -0,0 +1,125 @@
1
+ ---
2
+ name: datetime-timezone-correctness
3
+ description: Implements and fixes correct date/time handling — UTC/instant storage, IANA timezone and DST conversion (gaps and overlaps), explicit ISO-8601 parsing/formatting, calendar-vs-elapsed duration math, DST-stable RRULE recurrence, and monotonic-vs-wall-clock duration measurement.
4
+ when_to_use: Code stores, parses, compares, adds to, or displays timestamps; or a bug is off-by-an-hour/day, a DST transition, a date-boundary or leap-day error, an ambiguous/nonexistent local time, recurrence/expiry, or wall-clock vs monotonic duration. Distinct from regex-build (validating a date *string's* shape) and message-queue-jobs (scheduling/firing the job, not computing its time).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when the bug or task is about **what a timestamp means**, not how it looks on screen:
10
+
11
+ - "Reminder fires an hour early/late twice a year" / "off by one hour after the clock change"
12
+ - "Event lands on the wrong day for users in another timezone"
13
+ - "Token/trial expires a day early" or "expiry compares a naive datetime to an aware one"
14
+ - "Picking `datetime.now()` vs `utcnow()`, naive vs aware, or `Date` vs Temporal/Luxon/java.time/chrono"
15
+ - "Recurring 9am meeting drifts to 8am / 10am" (DST-unstable RRULE)
16
+ - "Elapsed-time metric goes negative or huge" (used wall clock, NTP stepped it)
17
+ - "Parsing `01/02/2026` flips day and month" / "`+0000` got dropped on parse"
18
+ - Leap-day / leap-second / Feb-29 arithmetic, "add 1 month to Jan 31"
19
+
20
+ NOT this skill:
21
+ - Validating that a *string* matches a date format (regex, positive/negative cases) → regex-build
22
+ - Scheduling, enqueuing, retrying, or actually *firing* a job at a time → message-queue-jobs
23
+ - Adding type hints so `Aware` vs `Naive` is a compile-time error → type-safety-strict
24
+ - Column-level checks that a dataset's date field is non-null/in-range → validate-data-quality
25
+ - A concurrency race where two threads read a clock out of order → async-concurrency-correctness
26
+
27
+ ## Steps
28
+
29
+ 1. **Cardinal rule: store and transport an absolute instant; convert to local only at the display edge.** Persist UTC or an offset-bearing instant. Local wall-clock time is for input and output only — never the source of truth.
30
+
31
+ | Concept stored | Right type | Wrong type | Example |
32
+ |---|---|---|---|
33
+ | A moment that happened/will happen | UTC instant / `timestamptz` / `Instant` | naive local datetime, "string + separate tz column" | log entry, `created_at`, fired-at |
34
+ | A wall-clock appointment a human set | local datetime **+ IANA zone id** (e.g. `America/New_York`) | UTC instant alone (loses the user's intent across DST law changes) | "9:00am every Mon", future calendar event |
35
+ | A pure date with no time | date-only type (`LocalDate`) | midnight-UTC instant (shifts a day under any offset) | birthday, invoice due date, holiday |
36
+ | Elapsed time / a timeout | monotonic duration (see step 7) | difference of two wall-clock timestamps | request latency, cache TTL countdown |
37
+
38
+ Store the **zone id** (`Europe/London`), never a fixed offset (`+01:00`) or abbreviation (`BST`/`CST` — ambiguous, and offset changes at DST). Schema default: `TIMESTAMPTZ` in Postgres, never `TIMESTAMP` (Postgres `timestamp` is naive and silently drops the zone).
39
+
40
+ 2. **Audit naive vs aware; forbid the silent-local default.** Grep the hotspots and replace every implicit-local call:
41
+
42
+ | Language | Banned (naive / implicit-local) | Use instead |
43
+ |---|---|---|
44
+ | Python | `datetime.now()`, `datetime.utcnow()`, `datetime.fromtimestamp(ts)`, `datetime.strptime(...)` (naive) | `datetime.now(timezone.utc)`, `datetime.fromtimestamp(ts, tz=ZoneInfo("UTC"))`, attach `ZoneInfo` |
45
+ | JS/TS | `new Date("2026-01-02")`, `Date.parse`, `new Date(y,m,d)`, `getHours()`/`setHours()` | Temporal (`Temporal.ZonedDateTime`, `Instant`) or Luxon `DateTime.fromISO(s,{zone})` |
46
+ | Java | `new Date()`, `Calendar`, `SimpleDateFormat`, `LocalDateTime` for an instant | `Instant`, `ZonedDateTime`, `OffsetDateTime`, `DateTimeFormatter`, `java.time` |
47
+ | Rust | `chrono::Local::now`, `NaiveDateTime` as an instant | `Utc::now()`, `DateTime<Utc>`, `chrono-tz` `Tz` |
48
+ | Go | `time.Parse` without a layout offset | `time.Now().UTC()`, `time.LoadLocation`, RFC3339 layout |
49
+
50
+ `utcnow()` is banned because it returns a **naive** datetime tagged nothing — comparing it to an aware one raises `TypeError`, comparing it to another naive one silently treats both as local. A timestamp is correct only when its type carries a zone.
51
+
52
+ 3. **Convert through a real IANA tzdb, and resolve DST gaps and overlaps explicitly — never let the library pick silently.** Use the OS/`tzdata` IANA database (Python `zoneinfo`, JS Temporal/Luxon, Java `java.time`, Rust `chrono-tz`, Go `time/tzdata`). Two wall-clock times per year are not 1:1 with instants:
53
+
54
+ - **Spring-forward GAP** — `2026-03-08 02:30 America/New_York` does not exist. Decide: shift forward to `03:30` (calendar apps), or reject. Never let it round to a random instant.
55
+ - **Fall-back OVERLAP** — `2026-11-01 01:30 America/New_York` occurs twice. Decide which: earlier (`fold=0`, first occurrence, larger offset) or later (`fold=1`).
56
+
57
+ ```python
58
+ from datetime import datetime, timezone
59
+ from zoneinfo import ZoneInfo
60
+ ny = ZoneInfo("America/New_York")
61
+
62
+ wall = datetime(2026, 3, 8, 2, 30, tzinfo=ny) # gap — does NOT exist
63
+ inst = wall.astimezone(timezone.utc) # zoneinfo skips ahead silently
64
+ back = inst.astimezone(ny) # -> 03:30, NOT 02:30 => round-trip changed it
65
+ assert back != wall # detect: round-trip mismatch == gap/overlap hit
66
+
67
+ amb = datetime(2026, 11, 1, 1, 30, tzinfo=ny) # overlap, two instants
68
+ first = amb.replace(fold=0).astimezone(timezone.utc) # 05:30Z (EDT)
69
+ second = amb.replace(fold=1).astimezone(timezone.utc) # 06:30Z (EST)
70
+ assert first != second # 1h apart — pick fold deliberately
71
+ ```
72
+ Detection rule that works in any language: convert local→instant→local; if you don't get the original wall time back, you hit a gap or overlap — handle it, don't ship it.
73
+
74
+ 4. **Parse and format with an explicit format string and explicit offset handling; reject locale/heuristic parsers.** Default wire format is **RFC 3339 / ISO-8601 with offset**: `2026-06-15T13:45:30Z` or `...+07:00`. Rules:
75
+ - Parse with a fixed pattern (`strptime`/`DateTimeFormatter.ofPattern`/explicit layout), never a "smart" parser (`Date.parse`, `dateutil.parser.parse` for machine data — they guess `MM/DD` vs `DD/MM` by locale and flip silently).
76
+ - A timestamp string **without** an offset is incomplete: either require one, or attach a documented zone — never assume the server's local zone.
77
+ - When emitting for machines, always include the offset (`Z` for UTC). For humans, format in their zone with the zone name shown.
78
+ - Date-only fields parse to a date type, not midnight in some zone.
79
+
80
+ 5. **Duration & arithmetic: separate elapsed (fixed) from calendar (variable), and pin the order of operations.**
81
+ - **Elapsed / exact:** add seconds/`Duration`/`timedelta` to an **instant**. `now + 24h` is exactly 86400s and may land on a different wall-clock hour across DST — that is correct for "24 hours from now."
82
+ - **Calendar:** "tomorrow", "+1 day", "+1 month" are zoned wall-clock ops — do them on a `ZonedDateTime`/`LocalDate`, then convert. "+1 day" across spring-forward is 23h of real time, and that is correct for "same time tomorrow."
83
+ - **Month/year overflow:** "Jan 31 + 1 month" must clamp to Feb 28/29, not roll to Mar 3. Use a library that clamps (`java.time` `plusMonths`, Luxon `plus({months:1})`, `dateutil.relativedelta`) — never naïve day-count math.
84
+ - **Day boundaries** ("events on 2026-06-15") are `[startOfDay, startOfDay+1d)` **in the user's zone** converted to instants, then a half-open `>= lo AND < hi` range — never `date(timestamp)` in SQL (that truncates in the server's zone). Always half-open, never `BETWEEN`: its closed upper bound double-counts the next midnight.
85
+ - **Business days / "3 working days":** iterate calendar days in the relevant zone, skip weekends + a holiday set; don't approximate as `+72h`.
86
+
87
+ 6. **Recurrence (RRULE / iCal RFC 5545): expand against the zone, not against fixed offsets, so a recurring local time stays put across DST.** Store the rule with `DTSTART;TZID=America/New_York:20260105T090000` + `RRULE:FREQ=WEEKLY;BYDAY=MO`. Expand each occurrence as a **wall-clock time in that zone**, then convert each to an instant individually — so "every Monday 9:00am" is always 9am local even though its UTC offset shifts at DST. Handle `EXDATE` exclusions and `UNTIL` (which is in UTC per spec). If an occurrence lands in a spring-forward gap, apply the same gap policy as step 3 (shift forward). Never precompute occurrences as fixed-offset UTC instants — they drift an hour after the next DST change.
88
+
89
+ 7. **Clock source: monotonic for elapsed, wall clock for timestamps; assume the wall clock jumps.**
90
+
91
+ | Need | Use | Never |
92
+ |---|---|---|
93
+ | Duration / latency / timeout / "has 30s passed" | monotonic clock — `time.monotonic()`, `performance.now()`, `System.nanoTime()`, `Instant::now()` | subtracting two wall-clock timestamps (NTP/leap/DST can step it → negative or huge) |
94
+ | "When did this happen" / persisted time / display | wall clock — `time.time()`/UTC instant | monotonic (meaningless across processes/reboots) |
95
+
96
+ Wall clock is not monotonic: NTP slews/steps it, users change it, VMs pause. A duration measured by wall clock can go **negative**. Measure every interval, retry/backoff window, and benchmark with the monotonic source. For leap seconds, prefer a smeared-time NTP source over special-casing `:60`.
97
+
98
+ 8. **Test across the boundaries that break naive code, and migrate off deprecated APIs.** Freeze the clock (`freezegun`, `@sinonjs/fake-timers`, `Clock.fixed`) and parametrize the zone (run the suite under `TZ=UTC` *and* `TZ=America/New_York` *and* `TZ=Pacific/Kiritimati` UTC+14). Cover: spring-forward gap, fall-back overlap, Dec 31 → Jan 1 rollover in a non-UTC zone, Feb 29 leap day, Jan 31 + 1 month, an RRULE crossing a DST date, and a monotonic-vs-wall duration during a simulated clock step. Replace any `SimpleDateFormat`/`utcnow`/`new Date(string)`/`chrono::Local` found in step 2.
99
+
100
+ ## Common Errors
101
+
102
+ - **`datetime.utcnow()`** — returns naive; downstream it's treated as local and shifts by the server offset. Fix: `datetime.now(timezone.utc)`.
103
+ - **`new Date("2026-06-15")`** — JS parses a date-only ISO string as **UTC midnight**, so it prints as the *previous day* west of UTC. Fix: parse with Temporal/Luxon and an explicit zone, or treat as a date type.
104
+ - **Storing offset `+01:00` instead of zone id `Europe/London`** — the offset is wrong the other half of the year and can't survive a tzdb law change. Fix: store the IANA id; derive the offset at conversion time.
105
+ - **Postgres `TIMESTAMP` (without `TZ`) for an instant** — drops the zone; reads back in the session's `TimeZone`. Fix: `TIMESTAMPTZ`.
106
+ - **Comparing naive to aware** — Python raises `TypeError`; some languages compare them as both-local and lie. Fix: normalize both to aware-UTC before comparing.
107
+ - **`date_trunc`/`CAST(ts AS date)` for "which day"** — truncates in the server zone, so a 23:30-local event lands on the wrong date. Fix: convert to the user zone first (`ts AT TIME ZONE 'America/New_York'`), then truncate.
108
+ - **`+ timedelta(days=1)` expecting "same wall time tomorrow"** — adds exactly 24h; off by an hour across DST. Fix: do the `+1 day` on a zoned/local value, then convert to instant.
109
+ - **Jan 31 + 1 month = Mar 3** — naive 30/31-day math overflows February. Fix: a clamping API (`relativedelta`, `plusMonths`, Luxon `plus`).
110
+ - **RRULE expanded as fixed-offset UTC** — every occurrence drifts an hour after the next DST change. Fix: expand in the `TZID` zone, convert each occurrence individually.
111
+ - **Elapsed time from wall clock** — NTP step makes the delta negative or enormous, poisoning metrics/timeouts. Fix: monotonic clock for all durations.
112
+ - **`SimpleDateFormat`/`dateutil.parser.parse`/`Date.parse` on machine data** — locale-dependent `MM/DD` vs `DD/MM` guessing silently swaps day and month. Fix: a fixed explicit pattern.
113
+ - **Dropping the offset on parse** — `2026-06-15T08:00:00+07:00` parsed as naive becomes 08:00 in the wrong zone (7h error). Fix: a parser that retains the offset and converts to UTC.
114
+
115
+ ## Verify
116
+
117
+ - **Round-trip is stable:** parse → store as UTC → format back yields the same instant for a sample including a `+07:00` and a `-05:00` input. No value silently re-zoned.
118
+ - **DST gap handled:** constructing `02:30` on the spring-forward date in a DST zone applies the documented policy (shift-forward or reject) — it does not silently produce an arbitrary instant; the local→instant→local round-trip mismatch is detected.
119
+ - **DST overlap handled:** the fall-back `01:30` is resolvable to **both** instants via fold/disambiguation, and the code picks one deliberately (asserted 1h apart).
120
+ - **Suite green under multiple zones:** the full test run passes under `TZ=UTC`, `TZ=America/New_York`, and `TZ=Pacific/Kiritimati` (UTC+14) — proving no hidden local-zone assumption.
121
+ - **Boundary cases pass:** Dec 31→Jan 1 in a non-UTC zone, Feb 29 leap day, Jan 31 + 1 month clamps to Feb, and a weekly RRULE crossing a DST date keeps its local wall time.
122
+ - **Duration uses monotonic:** a simulated wall-clock backward step does not produce a negative or absurd elapsed value (proving the monotonic source).
123
+ - **Grep clean:** no `utcnow`/`SimpleDateFormat`/`new Date(<string>)`/`chrono::Local`/naive `strptime` remains in instant-handling paths.
124
+
125
+ Done = every stored/transported timestamp is a zone-carrying instant (UTC) or an explicit local-time-plus-IANA-zone, all parse/format uses an explicit offset-aware format, all durations use the monotonic clock, and the test suite passes under ≥3 timezones across the gap, overlap, leap-day, month-overflow, and recurrence boundaries.
@@ -0,0 +1,134 @@
1
+ ---
2
+ name: debug-ci-pipeline-failure
3
+ description: Debugs a red CI job to root cause instead of blind-rerunning — reproduce locally in the SAME image (`act -j <job>`, `gitlab-runner exec`, `circleci local execute`, or `docker run` the exact pinned digest), read the full log + the real exit code (124=timeout, 137=OOM/SIGKILL, 143=SIGTERM, 139=segfault), then classify into flaky / env-drift / poisoned-or-stale cache / resource-OOM / missing-secret / timeout / test-ordering / network, and confirm with a targeted experiment — diff local-vs-CTRL env (`printenv | sort`, tool `--version`, lockfile hash), run clean (no cache, `--no-cache`/clear key) vs cached, isolate ONE matrix leg, bisect with `git bisect run`, re-run with debug logging (`ACTIONS_STEP_DEBUG=true`, `CI_DEBUG_TRACE=true`, `set -x`) or open an interactive runner (`tmate`/`debug with SSH`/`--privileged` shell) — and fix the cause (pin the digest, scope the cache key, raise the limit, randomize then fix test order) not the symptom.
4
+ when_to_use: A CI/CD job that passes locally is red on the runner, fails intermittently, or broke without a relevant code change — green-on-my-machine/red-on-CI, an OOM/timeout/exit-137, a cache or matrix-only failure, or you're tempted to just hit "Re-run job". Distinct from cicd-pipeline-author (designs/authors the pipeline YAML from scratch; this debugs an existing one that's failing) and debug-flaky-tests (fixes the nondeterministic TEST itself — one of several causes here; this skill first classifies whether flakiness, env drift, cache, or limits is even the cause).
5
+ ---
6
+
7
+ ## When to Use
8
+
9
+ Reach for this skill when a CI job is failing and you need to find out WHY before touching anything:
10
+
11
+ - "It's green locally but red on CI" / "passes on my machine, fails on the runner"
12
+ - "The job got killed — exit 137 / OOM / 'Process completed with exit code 137'"
13
+ - "It only fails sometimes — re-running makes it go green" (don't stop there — classify it)
14
+ - "One matrix leg (py3.12 / arm64 / windows) fails, the rest pass"
15
+ - "Nothing in my diff touches this — it broke on its own" (env/cache/upstream drift)
16
+ - "The job hangs and gets cancelled after N minutes" (timeout vs deadlock)
17
+ - "I keep hitting Re-run and hoping" — STOP, reproduce and root-cause instead
18
+
19
+ NOT this skill:
20
+ - Authoring the pipeline YAML, stages, caching strategy, runners from scratch → cicd-pipeline-author (this skill debugs the pipeline it produced)
21
+ - Fixing the nondeterministic test itself (shared state, time/random, async races, order-dependence) once you've confirmed flakiness is the cause → debug-flaky-tests (this skill decides *whether* it's a flaky test vs env/cache/limit, then hands off)
22
+ - General "why does this code crash" root-causing unrelated to CI → debug-root-cause (this skill is CI-runner-specific: images, caches, runners, matrices)
23
+ - Choosing/pinning the toolchain & language versions as a deliverable → pin-toolchain-versions (this skill *detects* version drift as a cause and tells you to pin)
24
+ - Designing the cache key/layers/TTL as a strategy → caching-strategy (this skill *invalidates* a poisoned cache to confirm it's the cause)
25
+ - Debugging a failing K8s pod/job workload (not a CI runner) → k8s-debug-workload
26
+ - A missing secret that's really a vault/rotation/scoping problem → secrets-management (this skill detects "secret empty on CI" as a class; that one fixes how secrets are stored/injected)
27
+ - Standing up a reproducible local dev container to match CI → compose-local-dev-stack / setup-devcontainer-env
28
+ - A production incident/postmortem (not a build) → incident-response-sre
29
+
30
+ ## Steps
31
+
32
+ 1. **Read the log top-to-bottom and grab the REAL exit code before theorizing.** The first red line is rarely the cause — scroll up to the first error, and check the process exit code, which names the failure class:
33
+
34
+ | Exit code | Means | Likely cause |
35
+ |---|---|---|
36
+ | `1` / `2` | generic failure / misuse | real test/build error — read the actual assertion |
37
+ | `124` | command timed out (`timeout` wrapper) | step exceeded its time budget |
38
+ | `137` | `128+9` = SIGKILL | **OOM-killed** (almost always memory limit) or job cancelled |
39
+ | `139` | `128+11` = SIGSEGV | native segfault (bad binary/arch mismatch) |
40
+ | `143` | `128+15` = SIGTERM | timeout/cancel signalled gracefully |
41
+ | `125` | docker run failed | image/entrypoint problem, not your code |
42
+
43
+ In GitHub Actions add `--rerun-failed-jobs` only AFTER you know why. Download the raw log (`gh run view <id> --log-failed`, GitLab "Complete Raw") — the web UI truncates and folds groups.
44
+
45
+ 2. **Reproduce locally in the SAME image, not your laptop.** "Green on my machine" proves nothing if your machine isn't the runner. Run the actual job in its actual container:
46
+
47
+ | CI | Local reproduce |
48
+ |---|---|
49
+ | GitHub Actions | `act -j <job> --container-architecture linux/amd64` (use the runner image: `-P ubuntu-latest=catthehacker/ubuntu:act-latest`) |
50
+ | GitLab CI | `gitlab-runner exec docker <job>` or `glab ci run`; pull the exact `image:` |
51
+ | CircleCI | `circleci local execute --job <job>` |
52
+ | any | `docker run --rm -it <image>@<digest>` then run the steps by hand |
53
+
54
+ Pin the image by **digest** (`@sha256:…`), not a moving tag — `ubuntu-latest`/`node:20` drift between your pull and the runner's. If it reproduces in the container but not on your host, the delta IS the bug (next step).
55
+
56
+ 3. **Diff the environment — env-drift is the #1 silent cause.** Inside the reproduced container vs your host, compare:
57
+ ```bash
58
+ printenv | sort > /tmp/ci.env # capture on CI (add `printenv|sort` as a debug step)
59
+ <tool> --version # node/python/go/java/gcc — exact patch
60
+ sha256sum package-lock.json poetry.lock go.sum # lockfile parity
61
+ uname -m && cat /etc/os-release # arch (arm64 vs amd64!) + distro
62
+ locale && echo $TZ # LANG/LC_ALL/TZ change sort & date tests
63
+ ```
64
+ Classic drifts: tool tag floated (`actions/setup-node@v4` with no exact version), `npm ci` vs `npm install` (lockfile ignored), `$PATH` ordering picks a different binary, `TZ`/`LANG` unset on the runner breaks date/sort tests, CI sets `CI=true` which flips test behavior. Fix = **pin** (digest + lockfile + exact tool version) → pin-toolchain-versions.
65
+
66
+ 4. **Classify the failure — match symptom to cause, then run ONE experiment to confirm.** Don't guess and re-run; prove it:
67
+
68
+ | Class | Tell-tale | Confirm by |
69
+ |---|---|---|
70
+ | **Flaky test** | passes on re-run, no code change, intermittent | re-run the SAME commit 10×; randomize order → debug-flaky-tests |
71
+ | **Env/version drift** | green local, red CI; broke with no relevant diff | the env diff in step 3 |
72
+ | **Poisoned/stale cache** | broke after a dep bump or cache-key collision; "works in clean checkout" | run with cache disabled (step 5) |
73
+ | **Resource / OOM** | exit 137, "Killed", slow then dead | raise mem / lower parallelism; watch RSS |
74
+ | **Missing/empty secret** | only on fork PRs, only on protected branches, `***` blank | `echo "len=${#SECRET}"` (never the value) |
75
+ | **Timeout / deadlock** | exit 124/143, "cancelled after Nm" | run with `timeout` + thread dump on hang |
76
+ | **Test-ordering** | fails only in CI's shard/order, passes in isolation | run that ONE test alone; then full suite |
77
+ | **Network/flaky registry** | `ETIMEDOUT`/`ECONNRESET`/429 to npm/pypi/ghcr | retry; check it's not a hard dep on a live service |
78
+
79
+ 5. **Run clean-vs-cached to convict the cache.** A poisoned or stale cache makes "works in a fresh checkout, fails in CI" — because CI restored a bad layer. Force a clean run and compare:
80
+ - **GitHub Actions:** bump the cache `key` (e.g. `-v2`), or `gh cache delete <key>`, or set `actions/cache` to a key that won't hit. Re-run with `ACTIONS_STEP_DEBUG=true`.
81
+ - **GitLab:** `CACHE_DISABLE=true` / clear via "Clear runner caches"; or change `cache:key`.
82
+ - **Docker layer cache:** `docker build --no-cache --pull`; suspect a stale base layer if a `RUN apt-get`/`pip install` silently uses old pins.
83
+ - **Package managers:** `npm ci` (not `install`), `pip install --no-cache-dir`, `go clean -cache`.
84
+
85
+ If clean is green and cached is red → the cache is the cause: the key is too coarse (not keyed on lockfile hash) or restores across incompatible refs → caching-strategy to re-scope the key. If BOTH fail, the cache is innocent — move on.
86
+
87
+ 6. **Isolate ONE matrix leg.** A matrix-only failure (only `windows-latest`, only `py3.12`, only `arm64`) is a portability bug, not a flake. Temporarily pin the matrix to the failing leg (`include:` just that combo) so you iterate on one red job, not 12. Common per-leg causes: path separators / line endings (CRLF) on Windows, glibc vs musl (alpine) for native deps, arch-specific wheels/binaries on arm64, a stdlib behavior that changed in the new language minor. Fix the portability issue, then restore the full matrix.
88
+
89
+ 7. **Crank up debug logging and `set -x`.** The default log hides what ran. Turn on tracing:
90
+
91
+ | CI | Debug switch |
92
+ |---|---|
93
+ | GitHub Actions | secrets/vars `ACTIONS_STEP_DEBUG=true` and `ACTIONS_RUNNER_DEBUG=true` |
94
+ | GitLab CI | `variables: CI_DEBUG_TRACE: "true"` (⚠ leaks env — protected branch only) |
95
+ | shell steps | add `set -euxo pipefail` to see every command + fail-fast on the real line |
96
+ | any | echo `printenv\|sort`, `df -h`, `free -m`, `nproc`, tool `--version` as a debug step |
97
+
98
+ `set -x` + `pipefail` alone fixes a whole class of "silent failure" where an early command in a pipe failed but the exit code was masked by the last one.
99
+
100
+ 8. **Drop into an interactive runner when logs aren't enough.** For "I can't reproduce it locally and the log is opaque," open a live shell ON the runner:
101
+ - **GitHub Actions:** `mxschmitt/action-tmate@v3` step (gates on failure) → SSH into the live runner mid-job; or `tmate` in a manual `workflow_dispatch`.
102
+ - **GitLab:** interactive web terminal / `gitlab-runner --debug`; CircleCI: "Rerun job with SSH".
103
+ - Inside: re-run the failing command by hand, inspect `/tmp`, check mounted caches, `cat` the generated config, `ps`/`top` for the OOM, `dmesg | tail` for the kill. Tear it down — don't leave a runner pinned.
104
+
105
+ 9. **Convict resource/OOM with real numbers.** Exit 137 + "Killed" = the kernel OOM-killer. Don't just `continue-on-error` — measure: GitHub-hosted runners are ~7 GB / 2 cores; self-hosted/container jobs have a `--memory` cgroup limit. Add `/usr/bin/time -v <cmd>` (Max RSS), or `while true; do free -m; sleep 5; done &` to watch growth. Fixes: lower test parallelism (`-j2`, `--maxWorkers=2`, `pytest -n2`), raise the container/runner memory limit, split the job, or fix the actual leak. Node OOM specifically → `NODE_OPTIONS=--max-old-space-size=4096`.
106
+
107
+ 10. **Fix the ROOT CAUSE, then re-run to confirm — never ship a blind re-run as the fix.** A green re-run on a flaky/poisoned/under-provisioned job is a false negative that WILL recur. The closing move per class: pin the digest+lockfile+tool version (drift), re-scope or invalidate the cache key (cache), raise the limit or cut parallelism (OOM), randomize-then-pin test order / fix the shared state (ordering/flake → debug-flaky-tests), add a retry-with-backoff ONLY for genuinely external network calls (and nothing else). Then re-run the SAME commit ≥3× to prove it's deterministically green. Quarantining a flaky test (skip + tracking issue) is acceptable as a *stopgap* to unblock the pipeline — but it's a TODO, not the fix.
108
+
109
+ ## Common Errors
110
+
111
+ - **Hitting "Re-run job" until it's green and calling it fixed.** That's hiding a flaky/OOM/cache bug; it recurs and erodes trust in CI. Fix: classify (step 4) and fix the cause; re-run only to *confirm*.
112
+ - **"It passes on my machine" as proof.** Your laptop isn't the runner (arch, tool version, env, cache). Fix: reproduce in the exact image/digest (step 2).
113
+ - **Reading only the last red line.** The real error is usually higher; the last line is often a downstream symptom. Fix: read top-down, find the FIRST error + the exit code.
114
+ - **Treating exit 137 as a code bug.** It's OOM/kill, not your assertion. Fix: measure RSS, raise mem or cut parallelism (step 9).
115
+ - **Floating tags (`node:20`, `ubuntu-latest`, `@v4` minor).** They drift between your pull and CI's → "broke with no diff." Fix: pin by digest + lockfile + exact version (pin-toolchain-versions).
116
+ - **`npm install` / non-`ci` installs in CI.** Ignores the lockfile → different deps than local. Fix: `npm ci`, `poetry install --no-update`, `--frozen-lockfile`.
117
+ - **Blaming the test when it's the cache.** Stale restored layer fails a clean build. Fix: clean-vs-cached run (step 5) before touching the test.
118
+ - **Debugging the whole matrix at once.** 12 red legs hide which is the real bug. Fix: isolate the one failing leg (step 6).
119
+ - **`CI_DEBUG_TRACE`/printing secrets to debug.** Leaks credentials into logs. Fix: trace on protected branches only; print `${#SECRET}` length, never the value.
120
+ - **No `pipefail`, so a failed mid-pipe command exits 0.** Silent green on a broken step. Fix: `set -euo pipefail` in every shell step.
121
+ - **Empty secret on fork PRs read as a code bug.** Secrets aren't exposed to forks/`pull_request` from forks by design. Fix: recognize the class, use `pull_request_target` carefully or a label gate, not a value hunt.
122
+ - **Adding broad retries to mask flakiness.** Retrying a deterministic bug just burns minutes. Fix: retry ONLY external network I/O; fix logic/ordering/resource causes.
123
+
124
+ ## Verify
125
+
126
+ 1. **Reproduced in-image:** the failure reproduces with `act`/`gitlab-runner exec`/`docker run @digest` (or you've proven via env-diff exactly what the runner has that you don't) — not just observed in the web UI.
127
+ 2. **Classified, not guessed:** you can name the class (flaky / env-drift / cache / OOM / secret / timeout / ordering / network) AND state the experiment that confirmed it (clean-vs-cached, isolated test, RSS measurement, env diff).
128
+ 3. **Exit code accounted for:** you read the real exit code and it's consistent with the diagnosis (137→OOM, 124/143→timeout, 1/2→real error).
129
+ 4. **Root cause fix, not a re-run:** the diff pins/invalidates/scopes/limits the actual cause; there's no bare "Re-run" or blanket `continue-on-error` standing in for a fix.
130
+ 5. **Determinism proven:** the SAME commit is re-run ≥3× (and for a flake, ≥10×) and is green every time — not green once after N reds.
131
+ 6. **No new leak:** debug tracing is off (or gated to protected branches), no secret value was printed, and interactive runners were torn down.
132
+ 7. **Matrix restored:** if you isolated a leg, the full matrix is back and all legs pass.
133
+
134
+ Done = the failure was reproduced in the runner's actual image, classified into one named cause confirmed by a targeted experiment (env diff / clean-vs-cached / isolated test / RSS), and fixed at the root (pin, cache-key, limit, order) — proven by the same commit going green ≥3× with no blind re-run, no masking, and no leaked secrets.