npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.0 - Mend

sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

package/.env.example +19 -0
package/CHANGELOG.md +144 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +394 -51
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +2 -2
package/dist/providers/keys.js +3 -2
package/dist/providers/registry.js +133 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +218 -27
package/dist/ui/banner.js +4 -9
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/setup.js +6 -5
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/design-token-system/SKILL.md ADDED Viewed

@@ -0,0 +1,109 @@
+---
+name: design-token-system
+description: Architects a framework-agnostic design-token system with primitive/semantic/component tiers, theming and multi-brand/dark-mode alias contracts, and multi-platform export (CSS vars, Tailwind, JS/TS, iOS/Android) from one W3C-DTCG source via Style Dictionary.
+when_to_use: Setting up or refactoring a token architecture, building a theme/multi-brand/dark-mode system, exporting one token source to web + native, or adopting Style Dictionary / the W3C Design Tokens format. Distinct from style-responsive-tailwind (consuming tokens in markup) and brainstorm-design (choosing the palette/visual direction).
+---
+## When to Use
+Reach for this skill when the problem is the **token architecture and export pipeline**, not a single component's styling:
+- "Set up design tokens / a theme system from scratch"
+- "Add dark mode without forking every color"
+- "Support multiple brands / white-label from one codebase"
+- "Export the same tokens to CSS, Tailwind, and our iOS + Android apps"
+- "Adopt Style Dictionary / the W3C Design Tokens (DTCG) format"
+- "We have 300 hardcoded hex/px values — give us a governed token layer"
+NOT this skill:
+- Writing the markup/utility classes that *consume* tokens → style-responsive-tailwind
+- Picking the actual palette, type pairing, or visual mood → brainstorm-design
+- Translating one Figma frame into a component → implement-from-design
+- Building the React component that renders from tokens → build-react-component
+- Wiring a cross-platform app shell/build → scaffold-cross-platform-app
+- Certifying contrast ratios meet WCAG → audit-accessibility-wcag (this skill *structures* color; it does not verify contrast)
+## Steps
+1. **Build exactly three tiers — never let a component read a primitive.** This is the whole architecture; get it wrong and theming is impossible.
+   | Tier | Names mean | References | Example | Rule |
+   |---|---|---|---|---|
+   | **Primitive** (global/core) | nothing — raw scale | literal values only | `blue.500 = #2563EB`, `space.4 = 16px` | No semantics. Never themed. Never imported by components. |
+   | **Semantic** (alias) | role/intent | → primitives | `color.bg.surface → gray.50`, `color.intent.danger → red.600` | The *only* layer that swaps per theme/brand. |
+   | **Component** (scoped) | one part | → semantics | `button.primary.bg → color.intent.brand` | Optional; add only when a component overrides a semantic. |
+   Default to **2 tiers (primitive + semantic)**; add component tokens only where a component genuinely diverges. Components and Tailwind/CSS consume **semantic tokens only**.
+2. **One source of truth in W3C DTCG JSON.** Use the spec's `$value` / `$type` and `{dot.path}` references so any compliant tool (Style Dictionary v4+, Tokens Studio) can read it. No per-platform hand-edited files.
+   ```jsonc
+   // tokens/primitive/color.json
+   { "color": { "blue": { "500": { "$type": "color", "$value": "#2563EB" } } } }
+   // tokens/semantic/color.json  — alias, NOT a literal
+   { "color": { "intent": { "brand": { "$type": "color", "$value": "{color.blue.500}" } },
+                "bg":     { "surface": { "$type": "color", "$value": "{color.gray.50}" } } } }
+   ```
+   A semantic token whose `$value` is a literal hex is a bug — it must be a `{reference}`.
+3. **Theming = swap the semantic layer, never fork the palette.** Light, dark, and each brand are *alternate semantic files* pointing at the *same* primitives. One `primitive/` set; `semantic/light.json`, `semantic/dark.json`, `semantic/brand-acme.json`. Dark mode flips `bg.surface → gray.900` instead of `gray.50` — the primitives don't move. Never create `blue.500.dark`.
+4. **Author color in OKLCH so themes shift predictably.** Build scales in OKLCH (fall back to HSL only if tooling can't): equal lightness steps stay perceptually even and a brand hue rotation keeps contrast. Hardcoded hex per shade drifts. Emit hex/rgb as a *build output* for legacy targets, not as the source.
+5. **Cover every token type — color is the easy half.** Define and `$type` all of: `color`, `dimension` (spacing/sizing), `fontFamily`/`fontWeight`/`fontSize`/`lineHeight`/`letterSpacing` (typography), `borderRadius`, `shadow` (elevation), `duration`/`cubicBezier` (motion), and z-index. Derive primitives from a **base scale** (4px grid for spacing, a modular ratio for type); semantics name the use (`space.inline.sm`, `text.heading.lg`).
+6. **Export everything from one Style Dictionary config.** One source → many platforms, each with the right transform group and output format:
+   ```js
+   // style-dictionary.config.js  (v4 — ESM)
+   export default {
+     source: ['tokens/primitive/**/*.json', 'tokens/semantic/light.json'],
+     platforms: {
+       css:      { transformGroup: 'css', buildPath: 'build/css/',
+                   files: [{ destination: 'vars.css', format: 'css/variables',
+                             options: { outputReferences: true } }] }, // keeps var(--x) chains
+       tailwind: { transformGroup: 'js', buildPath: 'build/tw/',
+                   files: [{ destination: 'tokens.cjs', format: 'javascript/module-flat' }] },
+       ts:       { transformGroup: 'js', buildPath: 'build/ts/',
+                   files: [{ destination: 'tokens.ts', format: 'javascript/es6' }] },
+       ios:      { transformGroup: 'ios-swift', buildPath: 'build/ios/',
+                   files: [{ destination: 'Tokens.swift', format: 'ios-swift/class.swift' }] },
+       android:  { transformGroup: 'android', buildPath: 'build/android/',
+                   files: [{ destination: 'tokens.xml', format: 'android/resources' }] }
+     }
+   };
+   ```
+   Run `style-dictionary build`. For each extra theme, run the same config with `semantic/dark.json` swapped into `source` and scope output under `[data-theme="dark"]` (CSS `options.selector`).
+7. **Wire Tailwind to the generated tokens — do not retype them.** `tailwind.config` imports `build/tw/tokens.cjs` into `theme.colors/spacing/...`. CSS vars drive runtime theme switching: Tailwind utilities resolve `var(--color-bg-surface)`, and the `[data-theme]` attribute swaps which value that var resolves to. One toggle, zero recompiled CSS.
+8. **Forbid raw values in app code with a linter.** Add `stylelint-declaration-strict-value` (web CSS) or an ESLint/lint rule that bans hex, `rgb(`, and bare `px` outside `tokens/` and `build/`. Raw values must fail CI, not slip through code review.
+9. **Govern it as a published API.** Fix a naming grammar `category.role.variant.state` (e.g. `color.bg.surface.hover`); semver the published token package (removed/renamed semantic token = **major**, added = minor, primitive value tweak = patch); keep a CHANGELOG; treat the `semantic` layer as the public API and primitives as private/internal.
+## Common Errors
+- **Components reading primitives** (`button { color: blue.500 }`). Dark mode and rebrand degrade to find-and-replace. Components must reference semantics only.
+- **Forking the palette per theme** (`blue.500.dark`). Palette count explodes and brands drift. Themes swap the *semantic* alias target; primitives are shared and immutable.
+- **Semantic tokens holding literal values** instead of `{references}`. The indirection is the entire point — a literal hex in a semantic token can't be retargeted by a theme.
+- **`outputReferences: false` (the default) flattening CSS vars.** The build bakes `#2563EB` into every rule, killing runtime theme switching. Set `options: { outputReferences: true }` so `var(--color-intent-brand)` chains survive.
+- **Duplicating tokens into `tailwind.config` by hand.** They desync within the first week. Import the Style Dictionary build output; never maintain two sources.
+- **No grid/scale — arbitrary `13px`, `17px` primitives.** Defeats consistency. Primitives come from a 4px (or 8px) grid and a modular type ratio.
+- **Treating contrast as solved because colors are tokenized.** Tokens organize color; they don't guarantee `bg.surface`/`text.primary` meet 4.5:1. Run audit-accessibility-wcag on each theme.
+- **Component tokens for everything**, including parts that never override a semantic. Pure bloat. Add a component token only where it diverges from the semantic.
+- **Per-platform manual edits to `build/` outputs.** They're regenerated; your edit vanishes on the next build. Fix the source and rebuild.
+- **No versioning/changelog on the token package.** A renamed semantic token silently breaks every consumer. Semver it; a rename is a breaking (major) change.
+## Verify
+1. **Tier discipline:** `grep` app/component source — zero references to primitive names (`blue.500`, `space.4`) and zero raw hex/`rgb(`/bare `px`. Every match is a violation.
+2. **Aliases resolve:** every semantic `$value` is a `{reference}`, not a literal; `style-dictionary build` reports **0 unresolved references** and exits `0`.
+3. **One source, many outputs:** a single `style-dictionary build` produces CSS, Tailwind, TS, iOS, and Android artifacts from the same `tokens/` tree (no hand-edited platform file).
+4. **Theme swap is alias-only:** diff `semantic/light.json` vs `semantic/dark.json` — they differ only in reference *targets*; `primitive/` is byte-identical across themes. Adding a brand touches no primitive.
+5. **Runtime switch works:** toggling `[data-theme="dark"]` on the built CSS recolors the page with **no CSS recompile** (proves `outputReferences` chains survived).
+6. **Lint gate is live:** committing a raw `#fff` or `12px` in app code fails CI, not review.
+7. **Native parity:** the same semantic token (e.g. `color.intent.brand`) yields the same color in `build/css/vars.css`, `build/ios/Tokens.swift`, and `build/android/tokens.xml`.
+8. **Governance:** naming matches `category.role.variant.state`, the package carries a semver + CHANGELOG, and a token rename ships as a major bump.
+Done = one W3C-DTCG source builds all platforms with zero unresolved references, components reference semantics only (lint-enforced in CI), themes/brands swap via alias targets over shared immutable primitives, and runtime theme switching recolors with no recompile.

package/skills/distributed-locks-leases/SKILL.md ADDED Viewed

@@ -0,0 +1,120 @@
+---
+name: distributed-locks-leases
+description: Implements distributed mutual exclusion and leader election correctly across processes/nodes — Redis `SET key token NX PX <ttl>` with a unique random token + Lua compare-and-delete unlock (never bare DEL), etcd/ZooKeeper/Consul leases (lease grant + TTL + keepAlive renewal, ephemeral znode + watch on predecessor for leader election), and Postgres advisory locks (`pg_advisory_lock`/`pg_try_advisory_xact_lock`) for single-DB serialization — while treating every lock as a LEASE that can expire mid-work, so safety rides on monotonic fencing tokens that the protected resource checks-and-rejects-stale (per Kleppmann's Redlock critique), never on the lock alone. Covers TTL sizing vs work duration, renewal/keepalive, the GC-pause/clock-skew expiry hazard, split-brain, and choosing idempotency or partitioning INSTEAD of a lock.
+when_to_use: You need only-one-runner-at-a-time across machines — a leader/singleton (cron that must not double-fire, one active scheduler/consumer), a critical section over a shared external resource (a row, a file, an API quota) spanning multiple nodes, leader election, or you're reaching for Redlock/`SETNX`/etcd leases/ZooKeeper. Distinct from async-concurrency-correctness (in-process mutexes/atomics/channels within ONE process — no network, no lease expiry) and idempotency-keys (the real safety net when the lock fails or expires — make the protected operation safe to repeat instead of/in addition to locking).
+---
+## When to Use
+Reach for this skill when you need **at most one actor running at a time across separate processes or machines**, coordinated through a shared store — and a second concurrent runner would corrupt state:
+- "Only one instance should run this cron / scheduler / migration / cleanup at a time"
+- "Elect a leader / single active consumer across N replicas" (active-passive failover)
+- "Two pods both processed the same job / both wrote the same file"
+- "Serialize edits to one row/aggregate/external resource across the cluster"
+- "I'm using Redis `SETNX` / Redlock / etcd lease / ZooKeeper ephemeral node for a lock"
+- "Hold a lock while I do work, renew it, and release it safely"
+- "The lock expired while my job was still running and another node started"
+NOT this skill:
+- A mutex/semaphore/atomic/channel **inside a single process** (Go `sync.Mutex`, Java `synchronized`/`ReentrantLock`, Python `Lock`, `asyncio` races) — no network, no TTL, no lease expiry → async-concurrency-correctness
+- Making the protected operation **safe to run twice** so a lock failure/expiry is harmless (dedup table, upsert, set-don't-increment) → idempotency-keys (this is the safety net BELOW the lock; prefer it over a lock when you can)
+- Throttling request *rate* (token bucket / sliding window), not exclusivity → rate-limiting
+- Worker pool, job dispatch, DLQ, poison-message handling, exactly-once consumer semantics → message-queue-jobs
+- Optimistic concurrency on a single DB row (`WHERE version = N` / `If-Match`/ETag, no separate lock service) → idempotency-keys (by-design) / db-migration-safety for schema
+- Timeouts, retries, backoff, circuit breakers around the locked call → resilience-timeouts-retries
+- Saga/state-machine coordination of a long multi-step workflow → design-state-machine / orchestrate-agent-workflow
+## Steps
+1. **First ask: do you actually need a distributed lock? Usually you don't.** A lock is a liveness/correctness liability (a held-but-dead lock stalls everyone; an expired one breaks mutual exclusion). Prefer, in order:
+   | Instead of a lock | Technique | Why it's better |
+   |---|---|---|
+   | **Idempotency** | make the op safe to repeat (upsert, set-don't-increment, dedup key) → idempotency-keys | concurrent runs are *harmless*, not *prevented* — no expiry hazard at all |
+   | **Partitioning** | shard work by key (Kafka partition, consistent-hash, `id % N`) so each key has exactly one owner | structural single-ownership, no shared lock at all |
+   | **Single-DB serialization** | `SELECT ... FOR UPDATE` / unique constraint / `INSERT ... ON CONFLICT` / advisory lock (step 6) | the DB transaction *is* the lock, with real ACID guarantees |
+   | **A queue / leader-elected scheduler** | one consumer per partition; framework-provided leader election (k8s `Lease`, Raft) | offloads the hard part to a tested system |
+   Use a distributed lock only for **efficiency** (avoid duplicate work, where a rare double-run is *tolerable*) — NOT as your sole correctness guarantee. For correctness you also need step 4 (fencing) or idempotency.
+2. **Treat every lock as a LEASE: it auto-expires after a TTL, and it can expire WHILE you still think you hold it.** This is the central hazard. A lock without a TTL deadlocks the whole system if the holder crashes; a lock with a TTL can expire mid-work (GC pause, CPU starvation, slow I/O, network partition, VM freeze) — then the store hands the lock to node B while node A, paused, *believes* it still holds it and resumes writing. Two writers, one lock. Conclusions that follow:
+   - Always set a TTL (no infinite locks).
+   - TTL alone is never sufficient for correctness — you must also fence (step 4) or be idempotent (step 1).
+   - Pick TTL ≥ p99 work duration + safety margin; renew (step 5) for long work rather than setting a huge TTL.
+3. **Redis single-node lock — acquire with a unique token, release with compare-and-delete (Lua), never bare `DEL`.** Use one atomic command and a per-acquisition random token so only the owner can unlock:
+   ```
+   # acquire — NX = only if absent, PX = TTL in ms, token = unique per acquisition (uuid/16 random bytes)
+   SET resource_lock <token> NX PX 30000
+   ```
+   ```lua
+   -- release — DELETE ONLY IF the value is still OUR token (compare-and-delete, atomic)
+   if redis.call("GET", KEYS[1]) == ARGV[1] then
+     return redis.call("DEL", KEYS[1])
+   else return 0 end
+   ```
+   - **Never** `SETNX` + separate `EXPIRE` (non-atomic: crash between them = a lock that never expires). Use `SET ... NX PX` in one call.
+   - **Never** a bare `DEL resource_lock` to release: if your lease already expired and B re-acquired, your `DEL` deletes *B's* lock. The token check prevents that.
+   - **Redlock (multi-node) is contested — default to single-node + fencing.** Kleppmann's critique ("How to do distributed locking", 2016): Redlock relies on bounded clocks and pauses it can't guarantee, so it provides neither efficiency nor correctness better than a single node *for correctness*. Antirez disputes the framing, but the practical takeaway holds: **do not rely on any timing-based lock (Redlock included) for correctness — fence the resource (step 4).** Use single-node Redis for the cheap mutual-exclusion-for-efficiency case; reach for a consensus store (step 7) when you need real leader election.
+4. **Fencing tokens — the only thing that makes a lease-based lock SAFE. The protected resource must reject stale writers.** On every acquisition, get a **monotonically increasing** token (the "fence"). Pass it with every write to the protected resource. The resource stores the highest token it has seen and **rejects any write carrying a token ≤ the last accepted one.** Now a paused node A (token 33) that wakes after B acquired (token 34) gets its write rejected — mutual exclusion is enforced *at the resource*, independent of who "thinks" they hold the lock.
+   ```
+   client A acquires → fence=33 → write(x, fence=33)   accepted, resource now at 33
+   A pauses; lease expires; B acquires → fence=34 → write(y, fence=34)   accepted, resource at 34
+   A resumes, still "holds" lock → write(z, fence=33)   REJECTED (33 ≤ 34)
+   ```
+   - Source of monotonic tokens: ZooKeeper `zxid`/znode version, etcd key `mod_revision` / a `CreateRevision`-based counter, Redis `INCR fence_counter` (single-node only — multi-node Redis can't guarantee monotonicity), or a DB sequence.
+   - The resource MUST participate — if your storage/API can't check-and-reject a token (e.g. a dumb blob store), fencing is impossible and you fall back to idempotency (step 1). Many real systems can't fence; that's exactly why idempotency is the more robust default.
+5. **Long work: renew (keepalive) instead of guessing a huge TTL — and abort if renewal fails.** For work that may exceed the TTL, run a watchdog that re-extends the lease at ~TTL/3:
+   - Redis: a Lua `PEXPIRE` guarded by the same token check (extend only if still ours).
+   - etcd: `LeaseKeepAlive` stream; ZooKeeper: session heartbeats keep the ephemeral node alive; Consul: session renew before TTL.
+   - **Critical:** if a renewal FAILS or is late, you may have already lost the lease — **stop doing work immediately** (cancel the in-flight operation), don't blindly continue. The renewer and the worker must share a cancellation signal (context/CancellationToken). A renew thread that keeps extending after the worker is wedged is also a bug (it masks a stuck holder).
+6. **Postgres advisory locks — the right tool when one Postgres is your coordination point.** No extra infra; the lock lives in the DB you already trust:
+   | Function | Scope | Released by | Use for |
+   |---|---|---|---|
+   | `pg_advisory_lock(key)` | **session** | explicit `pg_advisory_unlock` or session end | held across transactions; must release manually (leaks if connection pooled + forgotten) |
+   | `pg_advisory_xact_lock(key)` | **transaction** | automatically at COMMIT/ROLLBACK | **preferred** — no manual release, no leak; held only for the txn |
+   | `pg_try_advisory_lock(key)` | session, **non-blocking** | as above | returns `true/false` instantly — "skip if someone else has it" (e.g. cron singleton) |
+   - Key is a `bigint` (or two `int4`s) — hash your logical name: `pg_try_advisory_xact_lock(hashtext('nightly-report'))`. Beware `hashtext` collisions; use a deliberate keyspace for unrelated locks.
+   - **Advisory locks are NOT enforced by the data** — they're cooperative; only code that *also* takes the lock is excluded. They don't lock rows. For row-level exclusion use `SELECT ... FOR UPDATE` instead.
+   - **Pooling gotcha:** with a transaction pooler (PgBouncer `transaction` mode), session-level advisory locks break (different backend per statement). Use `*_xact_lock` or a `session` pool.
+7. **etcd / ZooKeeper / Consul — when you need real leader election and consensus.** These are CP (consistent under partition) consensus stores; use them when a *rare* double-leader is unacceptable:
+   - **etcd:** `Lease` (grant TTL) + a key written with that lease; election via the `concurrency.Election` API (campaign → leader holds key until lease lapses or it resigns). `mod_revision` gives you a fencing token for free.
+   - **ZooKeeper:** create an **ephemeral sequential** znode; the lowest sequence number is leader; each node **watches only its immediate predecessor** (not all nodes — avoids the herd effect). On predecessor delete, re-check if you're now lowest. Ephemeral = auto-removed on session loss → automatic failover. The Curator `LeaderLatch`/`InterProcessMutex` recipes implement this correctly; prefer them over hand-rolling.
+   - **Consul:** session + KV `acquire` flag; session TTL + health check ties lock liveness to the holder's health.
+   - **Even here, fence.** Consensus guarantees agreement on *who holds the lease*, but a GC-paused leader still doesn't know its lease lapsed — the resource must still reject its stale-token writes (step 4). Consensus narrows the window; it doesn't remove the mid-work-expiry hazard.
+8. **Defend against split-brain and clock skew.** Two nodes both believing they're leader = split-brain. Mitigations: a single consensus source of truth (don't run two independent lock services); fencing tokens so even a split-brain second writer is rejected at the resource; **never trust wall-clock time for lease math across nodes** — use the lock service's own expiry, and within a node use a *monotonic* clock (`CLOCK_MONOTONIC`, `time.monotonic()`, `Instant`/`System.nanoTime`) for "have I exceeded my budget?" since NTP steps and VM time-warps corrupt wall-clock deltas. Assume your process can pause arbitrarily long between any two lines (GC, OS scheduler, live-migration).
+## Common Errors
+- **No TTL → permanent deadlock on crash.** A holder dies, the lock is held forever, the system stalls. Fix: always set a TTL; renew for long work (step 5).
+- **TTL but no fencing → silent double-write on mid-work expiry.** The lock expires during a GC pause, B acquires, A resumes and writes. Fix: monotonic fencing token rejected at the resource (step 4), or make the op idempotent (step 1).
+- **`SETNX` then separate `EXPIRE`.** Crash between the two leaves a lock with no expiry = deadlock. Fix: single atomic `SET key token NX PX <ttl>`.
+- **Releasing with bare `DEL` / no owner check.** If your lease already expired and someone re-acquired, you delete *their* lock. Fix: Lua compare-and-delete on your unique token.
+- **Reusing a constant lock value.** Without a per-acquisition random token you can't tell your lock from a successor's — unlock and renew both become unsafe. Fix: fresh uuid/random token each acquire.
+- **Trusting Redlock (or any timing lock) for correctness.** Bounded-clock/bounded-pause assumptions don't hold. Fix: single-node for efficiency-only; fencing/consensus for correctness (steps 3, 4, 7).
+- **Renewal failure ignored.** The watchdog can't renew but the worker keeps writing without the lease. Fix: failed/late renew → cancel the work immediately via a shared cancellation signal.
+- **Session-level `pg_advisory_lock` behind a transaction pooler.** Different backend per statement → lock acquired on one connection, never released / not visible. Fix: `pg_advisory_xact_lock`, or a session-mode pool.
+- **Forgetting to release a session advisory lock.** Leaks until the connection dies; with pooling that connection is reused holding the lock. Fix: prefer `*_xact_lock` (auto-release at txn end).
+- **Using a distributed lock where idempotency/partitioning was the right tool.** You inherit the whole expiry/split-brain failure surface for no reason. Fix: revisit step 1 — can the op be idempotent or key-partitioned instead?
+- **Wall-clock lease math across nodes.** NTP steps / VM time-warps make "is my lease still valid?" wrong. Fix: trust the lock service's expiry; use a monotonic clock for local budget checks.
+- **Watching all nodes in ZooKeeper leader election (herd effect).** Every change wakes every node. Fix: ephemeral-sequential + watch only your immediate predecessor (or use Curator recipes).
+## Verify
+1. **Mutual exclusion under contention:** spawn N nodes/goroutines racing for the same lock against the *real* shared store; assert exactly one holds it at any instant (e.g. each increments a shared counter inside the section and the section must never overlap — verified with a sentinel that fails if two enter).
+2. **Crash releases the lock:** kill the holder mid-section; another node acquires within ~TTL (the lease expires), not never (no permanent deadlock) and not instantly (no missing TTL).
+3. **Fencing rejects the stale writer:** simulate the Kleppmann scenario — A acquires (fence 33), pause A, let the lease expire, B acquires (fence 34) and writes, then resume A's write with fence 33 → the resource **rejects** it. Without fencing, this is the test that exposes the double-write.
+4. **Atomic acquire:** the acquire path is a single `SET NX PX` (or equivalent) — grep shows no `SETNX`+`EXPIRE` two-step and no infinite/missing TTL.
+5. **Safe release:** the unlock only deletes when the stored token matches (Lua/compare-and-delete); a test where the lease expired and was re-acquired confirms the old holder's release does NOT remove the new holder's lock.
+6. **Renewal + abort:** for long work, the lease is extended at ~TTL/3 while the token still matches; inject a renewal failure and assert the worker *cancels* rather than continuing without the lease.
+7. **Advisory-lock leak/pooling check:** advisory locks are `*_xact_lock` (or explicitly unlocked) and behave correctly under the actual connection-pool mode; `pg_locks` shows no orphaned advisory locks after the txn ends.
+8. **Leader election failover:** kill the leader; a new leader is elected within the session/lease TTL; assert there is never *zero* leader for long nor *two* leaders simultaneously (split-brain) — and that a deposed leader's writes are fenced out.
+9. **Default-choice justification:** confirm a distributed lock is genuinely needed — document why idempotency (idempotency-keys) or partitioning couldn't replace it; if the lock is correctness-critical, fencing or idempotency is present, not the lock alone.
+Done = at most one actor runs at a time under real contention, every lock has a TTL and crash-frees within it, mid-work expiry cannot cause a double effect because the resource rejects stale fencing tokens (or the op is idempotent), acquire/release/renew are atomic and owner-checked, advisory locks are pool-safe and leak-free, leader election survives failover without split-brain, and the choice of a lock over idempotency/partitioning is deliberate — all proven by the contention, crash, and fencing tests in checks 1–8.

package/skills/encrypt-sensitive-data/SKILL.md ADDED Viewed

@@ -0,0 +1,148 @@
+---
+name: encrypt-sensitive-data
+description: Encrypts sensitive data at rest, in transit, and per-field using AEAD-only ciphers (AES-256-GCM or ChaCha20-Poly1305 — never ECB, never unauthenticated CBC, never raw RSA) — envelope encryption where a KMS-held KEK wraps a per-record/per-tenant DEK, per-column field encryption for PII with deterministic-vs-randomized chosen per query need, strict unique-nonce/IV discipline (random 96-bit or counter, NEVER reused under one key), AAD binding ciphertext to its context (tenant/row id), versioned keys + rotation that re-wraps DEKs without re-encrypting data, TLS 1.2+/1.3 with mTLS and modern cipher suites, and — critically — passwords are HASHED with argon2id/bcrypt, NOT encrypted. Distinct from secrets-management (stores the app secrets/keys this skill consumes) and map-privacy-data-gdpr (the legal PII/erasure obligations encryption helps satisfy).
+when_to_use: You must protect sensitive data — encrypting PII/PHI/card data at rest, a per-column/field-level encryption scheme, envelope encryption with a KMS (AWS KMS/GCP KMS/Vault Transit), key rotation, choosing a cipher/mode/nonce strategy, enforcing TLS/mTLS, or hashing passwords. Distinct from secrets-management (storing and injecting the KEKs/API keys/credentials — that skill provisions the keys; this one uses them to encrypt data) and map-privacy-data-gdpr (the legal classification/erasure/residency duties that encryption and crypto-shredding help you meet).
+---
+## When to Use
+Reach for this skill when the task is making sensitive *data* cryptographically protected — at rest, in transit, or field-by-field:
+- "Encrypt SSNs / card numbers / health records / PII columns in the database"
+- "Set up envelope encryption with AWS KMS / GCP KMS / Vault Transit (DEK + KEK)"
+- "Rotate our encryption keys" / "we need versioned keys without re-encrypting everything"
+- "Which cipher/mode — is AES-CBC okay? do we need a separate MAC? what nonce?"
+- "Enforce TLS 1.3 / mutual TLS between services with modern cipher suites"
+- "Are we storing passwords correctly?" (hash, don't encrypt)
+- "Make a user's data unrecoverable on account deletion" (crypto-shredding)
+NOT this skill:
+- Storing/injecting the KEKs, API keys, DB creds, and `.env` material this skill *consumes* → secrets-management (it provisions and rotates the secrets; this skill encrypts data *with* them)
+- The legal side — what counts as PII/PHI, lawful basis, right-to-erasure, data residency → map-privacy-data-gdpr (this skill is the *technical control*, e.g. crypto-shredding, that satisfies those duties)
+- TLS *termination/cert issuance* at the edge proxy, ACME, SNI routing → configure-dns-tls and configure-reverse-proxy-lb (this skill covers the cipher-suite/mTLS *policy*, not cert plumbing)
+- Browser security response headers (HSTS, CSP) → configure-security-headers-csp (HSTS *enforces* HTTPS; this skill is the transport crypto itself)
+- Login sessions, JWT signing/verification, token rotation → auth-jwt-session (signatures/JWE are adjacent but that owns session lifecycle)
+- Identifying the threats/attacker model that justify these controls → threat-model-stride
+- A broad security pass over a diff → security-review (this skill is the deep crypto specialist it defers to)
+## Steps
+1. **Classify data first, then pick the protection tier — encryption is not the answer to everything.** Three distinct goals need three different tools:
+   | Goal | Use | NEVER |
+   |---|---|---|
+   | Verify a credential later (passwords) | **slow password hash** (argon2id) — one-way | encrypt; never decrypt a password |
+   | Protect data you must read back (PII, PHI, PAN, tokens) | **AEAD encryption** + KMS envelope | reversible "encoding", base64, ROT |
+   | Integrity/origin without secrecy | HMAC-SHA-256 / signature | "encrypt to authenticate" |
+   | Index/search without revealing value | HMAC-based blind index or deterministic enc | plaintext index column |
+   Encrypting a password is a **bug**, not a feature: anything reversible means an attacker (or insider) with the key gets every plaintext password.
+2. **Use AEAD ciphers only. Banned modes are non-negotiable.** Authenticated Encryption with Associated Data gives confidentiality *and* tamper-detection in one primitive:
+   | Use this | Why |
+   |---|---|
+   | **AES-256-GCM** | hardware-accelerated (AES-NI), NIST-approved, ubiquitous KMS support |
+   | **ChaCha20-Poly1305** | faster on CPUs without AES-NI (mobile/ARM), constant-time by design |
+   | **AES-256-GCM-SIV / XChaCha20-Poly1305** | nonce-misuse-resistant / 192-bit nonce — prefer when you can't guarantee unique 96-bit nonces |
+   | Banned | Why it's broken |
+   |---|---|
+   | **ECB** | identical plaintext blocks → identical ciphertext (the "ECB penguin"); leaks structure |
+   | **CBC/CTR without a MAC** | unauthenticated → padding-oracle (CBC) & bit-flipping attacks; ciphertext is malleable |
+   | **Raw RSA / RSA-PKCS#1v1.5 enc** | use RSA-OAEP, or better ECIES/hybrid; never "RSA the whole payload" |
+   | DES/3DES/RC4/MD5/SHA-1 | broken/deprecated |
+   Don't hand-roll "AES + separate HMAC" (encrypt-then-MAC) unless you must — get the construction order wrong and you reintroduce the oracle. Use a vetted library: **libsodium** (`crypto_aead_*` / `secretbox`), **Go** `crypto/cipher` GCM or `nacl/secretbox`, **Python** `cryptography` `AESGCM`/`ChaCha20Poly1305` (not the low-level `Cipher` API), **Java** `javax.crypto` GCM or Google **Tink**, **Rust** `aes-gcm`/`chacha20poly1305` RustCrypto crates, **Node** `crypto.createCipheriv('aes-256-gcm', …)` + `getAuthTag()`. **Tink/libsodium are the senior default** — they pick safe modes and manage nonces for you.
+3. **Nonce/IV discipline: unique per (key, message), forever. This is the #1 way AEAD fails.** GCM with a **repeated nonce under the same key is catastrophic** — it leaks the XOR of plaintexts *and* the authentication key (forgery). Rules:
+   - 96-bit (12-byte) nonce for GCM. Either **random** from a CSPRNG (`os.urandom`/`crypto.randomBytes`/`getrandom`) or a **monotonic counter** — never both, never `0`, never a timestamp, never reuse.
+   - Random 96-bit nonces are safe only up to **~2³² messages per key** (birthday bound). High-volume? Rotate the DEK sooner, or use **XChaCha20-Poly1305 (192-bit nonce)** / **AES-GCM-SIV** which tolerate accidental reuse.
+   - **Store the nonce alongside the ciphertext** (it's not secret) — typical record = `version ‖ nonce ‖ ciphertext ‖ tag`.
+   - Don't derive the nonce from the plaintext or a non-unique field. Don't reuse one nonce across a re-encrypt.
+4. **Bind ciphertext to its context with AAD (Associated Data).** AEAD lets you authenticate (not encrypt) extra context — pass the **row id / tenant id / column name / key version** as AAD. This stops an attacker from copying a valid ciphertext from row A into row B (ciphertext substitution): decryption of B fails because the AAD no longer matches. AAD must be reconstructible at decrypt time from the record's own metadata.
+5. **Envelope encryption: a KMS-held KEK wraps per-record/per-tenant DEKs. Never encrypt bulk data directly with the KMS key.** The pattern that scales and rotates cleanly:
+   ```
+   1. KMS.GenerateDataKey(KeyId=KEK, KeySpec=AES_256)
+        → returns { Plaintext DEK, Encrypted DEK (wrapped by KEK) }
+   2. Encrypt your data locally with the plaintext DEK (AES-256-GCM, fresh nonce)
+   3. Store: encrypted_dek ‖ key_version ‖ nonce ‖ ciphertext ‖ tag
+   4. ZERO the plaintext DEK from memory immediately after use
+   5. Decrypt: KMS.Decrypt(encrypted_dek) → plaintext DEK → local AEAD decrypt
+   ```
+   - **KEK** lives in **AWS KMS / GCP KMS / Azure Key Vault / Vault Transit / an HSM** and *never leaves it* — KMS does the wrap/unwrap, your app never sees KEK bytes. **DEK** is short-lived in app memory, zeroed after use.
+   - Granularity: **per-tenant or per-record DEK** for crypto-shredding (delete the DEK → that data is gone). Per-row is most flexible; cache the unwrapped DEK briefly (e.g. LRU with TTL) to avoid a KMS call per row.
+   - Tools: AWS **KMS** + the **AWS Encryption SDK** (handles the envelope + nonce for you), GCP **KMS**, HashiCorp **Vault Transit** (`vault write transit/encrypt/...` — Vault holds the key, returns ciphertext), or **Tink**'s `KmsEnvelopeAead`. Prefer these over rolling your own envelope.
+6. **Per-field/column encryption for PII — choose deterministic vs randomized by query need.** Application-layer (encrypt before the DB sees it) beats trusting only DB-native TDE, because TDE protects the *disk file*, not a SQL-injection or a DBA reading rows.
+   | Mode | Same plaintext → | Lets you | Cost |
+   |---|---|---|---|
+   | **Randomized** (fresh nonce) | different ciphertext | only decrypt-then-use | leaks nothing; **default for PII** |
+   | **Deterministic** (synthetic IV / SIV) | same ciphertext | equality lookup, joins, unique constraint | leaks equality (which rows share a value) |
+   For *searchable* encryption use a **blind index**: store `HMAC-SHA256(key, normalize(value))` in a separate indexed column and query by that, keeping the value column randomized-encrypted. Don't reach for order-preserving/fully-homomorphic encryption (leaky / impractical) unless you truly understand the tradeoff. Postgres `pgcrypto` is fine for small cases but does *application-visible* keys in SQL logs — prefer encrypting in the app. **Don't encrypt a column you need range-query or `LIKE` on** without redesigning the access pattern first.
+7. **Passwords: hash with a memory-hard KDF, salted and parameterized — never encrypt, never plain SHA-256.** Use:
+   | Algorithm | Params (2025 baseline) |
+   |---|---|
+   | **argon2id** (first choice) | m=19–64 MiB, t=2–3, p=1; OWASP min m=19 MiB,t=2,p=1 |
+   | **scrypt** | N=2^17, r=8, p=1 (or N=2^15 for lighter) |
+   | **bcrypt** (legacy/compat) | cost ≥ 12; **pre-hash with SHA-256 + base64** if password may exceed 72 bytes (bcrypt silently truncates) |
+   - A **per-password random salt** is mandatory (the libraries generate and embed it in the encoded hash — `$argon2id$v=19$m=...`). No global "pepper-as-salt".
+   - **Pepper** (optional, defense-in-depth) = a secret key *not* in the DB; either HMAC the password before hashing or keep it in a KMS/HSM. Store the pepper in secrets-management, never beside the hash.
+   - **Never** use fast hashes (MD5, SHA-1, SHA-256, SHA-512) bare for passwords — GPUs do billions/sec. **Never** encrypt passwords (reversible = breach of all of them).
+   - Verify in **constant time** (the KDF's `verify`/`checkpw` does this); re-hash on login if cost params have since increased.
+8. **TLS in transit: 1.2 minimum, 1.3 preferred; modern cipher suites; mTLS for service-to-service.**
+   - **Versions:** disable SSLv3/TLS 1.0/1.1 entirely. Allow **TLS 1.2 + 1.3**; prefer 1.3 (1-RTT, AEAD-only, forward-secret by construction).
+   - **TLS 1.2 cipher suites** (AEAD + ECDHE forward secrecy only): `ECDHE-ECDSA-AES128-GCM-SHA256`, `ECDHE-RSA-AES256-GCM-SHA384`, `ECDHE-*-CHACHA20-POLY1305`. **No** CBC suites, no static RSA key exchange, no `NULL`/`RC4`/`3DES`/`EXPORT`. TLS 1.3 only offers AEAD suites, so the choice is made for you.
+   - Mozilla SSL Config "**Intermediate**" is the safe default; "Modern" = TLS 1.3-only. Verify with **`testssl.sh`** or SSL Labs (target **A/A+**). Enable **HSTS** at the edge (handoff to configure-security-headers-csp).
+   - **mTLS** for internal/service-to-service: both sides present certs; pin to your CA, short-lived certs (SPIFFE/SVID, Istio, Linkerd, or a service-mesh issuer). Validate the **full chain + SAN**, not just "a cert was presented."
+   - **Never disable cert verification** (`verify=False`, `rejectUnauthorized:false`, `InsecureSkipVerify:true`) outside a throwaway test — it silently turns TLS into plaintext-to-anyone.
+9. **Key rotation with versioned keys — rotate the KEK cheaply, re-wrap DEKs, lazy-re-encrypt data.** Store a **`key_version`** with every ciphertext so multiple key generations coexist:
+   - **KEK rotation** (cheapest, do on schedule, e.g. annually or per policy): KMS rotates the KEK; you **re-wrap each DEK** (decrypt-unwrap with old, wrap with new). Bulk data is *untouched* — that's the whole point of envelope encryption.
+   - **DEK rotation:** generate a new DEK, **re-encrypt the affected records lazily** (on next write, or a background backfill) and bump `key_version`. Keep old key versions readable until backfill completes, then retire.
+   - **On compromise:** rotate immediately and force re-encryption; **crypto-shred** by destroying a DEK to make its data permanently unrecoverable (the GDPR-erasure trick — handoff to map-privacy-data-gdpr).
+   - Decrypt path must **dispatch on the stored `key_version`**; never assume "current key." Keep a registry of retired versions for audit.
+10. **Operational hygiene — the parts that get forgotten.** Generate all keys/nonces/salts from a **CSPRNG** (`os.urandom`, `crypto.randomBytes`, `getrandom(2)`, `SecureRandom`) — never `Math.random`/`rand()`/`mt19937`. **Zero plaintext keys** from memory after use where the language allows (Go `defer` wipe, Rust `zeroize`, libsodium `sodium_memzero`). Don't log plaintext, keys, or full ciphertext. Encrypt **backups and replicas** too (same KMS). Use **constant-time comparison** for MACs/tags/tokens (`hmac.compare_digest`, `crypto.timingSafeEqual`, `subtle.ConstantTimeCompare`) — `==` leaks via timing. Run a **`security-review`** over the crypto diff before shipping.
+## Common Errors
+- **Encrypting passwords instead of hashing.** Reversible = one key compromise dumps every password. Fix: argon2id/bcrypt, one-way (step 7).
+- **Plain/fast hash for passwords** (`SHA256(password)`, unsalted MD5). GPUs crack billions/sec; rainbow tables for unsalted. Fix: memory-hard KDF with per-password salt.
+- **ECB mode / unauthenticated CBC.** ECB leaks structure; CBC-without-MAC → padding oracle, malleable ciphertext. Fix: AEAD (AES-GCM/ChaCha20-Poly1305) only.
+- **Nonce/IV reuse under one key (GCM).** Catastrophic — leaks plaintext XOR *and* the auth key (forgeries). Fix: unique nonce per message; XChaCha20/GCM-SIV if you can't guarantee it (step 3).
+- **Hardcoded / static IV** (`iv = new byte[12]` all zeros). Same as reuse. Fix: fresh CSPRNG nonce per encryption, stored with ciphertext.
+- **Encrypting bulk data directly with the KMS/KEK.** Throughput and cost explode; no clean rotation. Fix: envelope — KEK wraps per-record DEK (step 5).
+- **No AAD binding.** Valid ciphertext copy-pasted between rows/tenants decrypts fine. Fix: pass row/tenant/version as AAD (step 4).
+- **No key version on ciphertext.** Rotation becomes a flag-day re-encrypt-everything. Fix: store `key_version`, dispatch decryption on it (step 9).
+- **Plaintext DEK left in memory / logged.** Heap dump or log leak = game over. Fix: zero after use; never log keys/plaintext/tags.
+- **`Math.random()` / `rand()` for keys, nonces, or salts.** Predictable → forgeable. Fix: CSPRNG only.
+- **Disabling TLS verification** (`verify=False`, `InsecureSkipVerify`, `rejectUnauthorized:false`). Silent MITM. Fix: validate chain + SAN; only bypass in isolated tests.
+- **Weak TLS** (TLS 1.0/1.1, CBC suites, static RSA, RC4/3DES). Fix: TLS 1.2+/1.3, AEAD+ECDHE suites; verify with testssl.sh.
+- **`==` on MACs/tags/tokens.** Timing side-channel. Fix: constant-time comparison.
+- **Roll-your-own crypto / `Cipher` low-level API.** Easy to misorder encrypt-then-MAC, mishandle padding. Fix: libsodium / Tink / AWS Encryption SDK.
+- **Deterministic encryption on high-cardinality PII you didn't mean to.** Leaks equality patterns. Fix: randomized by default; deterministic/blind-index only where a query needs it (step 6).
+## Verify
+1. **No banned modes/algorithms:** grep the diff for `ECB`, `AES/CBC` without an accompanying MAC, `DES`, `RC4`, `MD5`/`SHA1` on secrets, raw RSA encrypt — zero hits. All symmetric encryption is AES-GCM / ChaCha20-Poly1305 (AEAD).
+2. **Passwords are hashed, not encrypted:** grep finds argon2id/bcrypt/scrypt on the password path and **no** encrypt/decrypt of passwords; salts are per-password (encoded in the hash); cost params meet the step-7 baseline.
+3. **Nonce uniqueness:** confirm every encryption draws a fresh CSPRNG nonce (or a guaranteed-unique counter); no static/zero IV; nonce stored with ciphertext. For high volume, DEK rotation or a nonce-misuse-resistant mode is in place.
+4. **Envelope encryption holds:** bulk data is encrypted with a DEK, the DEK is wrapped by a KMS-held KEK that never leaves KMS, plaintext DEK is zeroed after use, and a `key_version` is stored per record.
+5. **AAD binds context:** moving a valid ciphertext from one row/tenant to another **fails** decryption (AAD mismatch).
+6. **Rotation works without re-encrypting everything:** rotating the KEK re-wraps DEKs only; old `key_version` ciphertext still decrypts; a DEK-destroy crypto-shreds its records (they become permanently undecryptable).
+7. **TLS posture:** `testssl.sh <host>` / SSL Labs returns **A/A+** — TLS 1.2+ only, AEAD+forward-secret suites, no CBC/RC4/3DES; mTLS validates the full chain + SAN; no `verify=False`/`InsecureSkipVerify` in non-test code.
+8. **Randomness + timing:** all keys/nonces/salts come from a CSPRNG (no `Math.random`/`rand`); MAC/tag/token comparisons are constant-time.
+9. **Tamper detection:** flipping one ciphertext byte makes decryption **fail** (auth tag rejects it) rather than returning garbage plaintext.
+Done = sensitive data is encrypted with AEAD under unique nonces, bulk data uses KMS envelope encryption with versioned, rotatable keys and context-binding AAD, passwords are hashed with argon2id/bcrypt (never encrypted), PII fields are randomized-encrypted (deterministic/blind-index only where a query demands it), transport is TLS 1.2+/1.3 with modern suites and mTLS where needed, and all keys/nonces/salts come from a CSPRNG with constant-time tag checks — all proven by checks 1–9, with `security-review` run over the crypto diff.

package/skills/feature-flags-rollout/SKILL.md ADDED Viewed

@@ -0,0 +1,130 @@
+---
+name: feature-flags-rollout
+description: Implements feature flags and progressive delivery — kill switches, percentage/targeted rollouts, sticky hashed bucketing, fail-safe evaluation, 1→10→50→100 ramps with guardrail-metric rollback, and TTL-enforced stale-flag cleanup — so changes ship decoupled from deploys and reverse in seconds.
+when_to_use: Adding a flag, gating a feature, running a percentage/canary/ring rollout decoupled from deploy, building a kill switch, targeting by user/segment/plan, or paying down flag debt. Covers OpenFeature-compatible managed flag platforms, vendor SDKs, and homegrown flag tables. Distinct from deploy-release (ships the artifact; flags gate behavior inside it) and auth-jwt-session (establishes entitlement; flags must never compute it).
+---
+## When to Use
+Reach for this skill when the request is about **decoupling a behavior change from the deploy that carries it**:
+- "Put this behind a flag so we can turn it off without redeploying"
+- "Roll it out to 1% / 10% / a canary ring, then ramp"
+- "Add a kill switch for the new checkout / payments path"
+- "Only enable for plan=enterprise / this segment / our internal allowlist"
+- "Migrate from env-var booleans to a managed flag platform / OpenFeature"
+- "Clean up dead flags / we have 300 flags and nobody knows which are live"
+NOT this skill:
+- Shipping/promoting the build, blue-green, canary *infra*, rollback of the artifact → deploy-release
+- Deciding *who the user is* or whether they paid → auth-jwt-session (a flag gates rollout; it does not grant entitlement)
+- Computing experiment lift / significance / metric tables from exposure logs → write-analytical-sql
+- Hiding the provider SDK key / signing flag payloads → secrets-management
+- Gating prompt/model changes behind a regression score → llm-eval-harness
+## Steps
+1. **Classify the flag first — type dictates lifetime, owner, and removal policy.** Do not create a flag without picking one.
+   | Type | Purpose | Lifetime | Owner | Removal |
+   |---|---|---|---|---|
+   | **Release** | Gate in-progress code, ramp it | days–weeks | feature author | **delete at 100% or revert** — TTL-enforced |
+   | **Kill switch (ops)** | Instantly disable a risky path | permanent | on-call/SRE | keep; review yearly |
+   | **Ops/config** | Tunables (timeouts, batch size, region) | permanent | platform | keep; document |
+   | **Experiment** | A/B exposure split | length of test | data/PM | delete when test concludes |
+   | **Permission/entitlement** | Plan/role gating | permanent | product | keep — but source of truth is auth, not the flag |
+   Release flags are 90% of debt. Every one gets an owner + removal date at creation (step 7). Default any new flag to **release** unless it's clearly a permanent switch.
+2. **Pick the evaluation locus — server-side by default.** Evaluate where the decision is *trusted and cheap*.
+   | Locus | Use for | Hard rule |
+   |---|---|---|
+   | **Server** (default) | entitlement-adjacent gating, anything secret, backend behavior | rule logic + flag values never leave the server |
+   | **Client** | pure UX (show new button, layout) | only flags in the **public** set; client can lie |
+   | **Edge/CDN** | geo/ring routing at the boundary | static rules only |
+   **Never evaluate an entitlement or paywall in the browser** — the user controls the client and can flip any client-side flag with devtools. Gate the *capability* server-side; the client flag only hides the UI. Server SDKs evaluate locally against a streamed ruleset (no per-request network call); client SDKs fetch a bootstrapped, scoped flag set.
+3. **Define a deterministic key + a fail-safe default.** The default is what runs when the provider is unreachable — and it *will* be unreachable.
+   ```ts
+   // ONE typed helper, the only place flags are read (step 5).
+   export function flag<T>(key: FlagKey, ctx: EvalContext, fallback: T): T {
+     try {
+       return client.variation(key, ctx, fallback); // local eval, no network
+     } catch (e) {
+       metrics.increment("flag.eval_error", { key });
+       return fallback;                              // FAIL-SAFE: never throw
+     }
+   }
+   // Release flag → fallback = OFF (old code path). Kill switch → fallback = "killed/safe".
+   ```
+   Rules: a flag read **must not throw, block, or call out per request**. Fall to **last-known-good** (SDK cache) → then the **hardcoded fallback**. For release flags the fallback is the *old* behavior (fail-off). For kill switches the fallback is the *safe* state (path disabled). Never let SDK init failure crash startup — init async with a timeout and serve fallbacks until ready.
+4. **Targeting — percentage by stable hashed bucketing, not RNG.** Bucketing must be **sticky**: the same user sees the same variant across requests, servers, and deploys.
+   ```ts
+   // Deterministic bucket 0..9999 — identical on every server, no shared state.
+   function bucket(flagKey: string, unitId: string): number {
+     const h = sha1(`${flagKey}:${unitId}`); // salt with flagKey so flags are independent
+     return parseInt(h.slice(0, 8), 16) % 10000;
+   }
+   const inRollout = bucket("new-checkout", user.id) < rolloutPct * 100; // 10% → <1000
+   ```
+   - **Bucketing unit** = a stable id (userId / accountId / deviceId) — **never** session id, request time, or `Math.random()` (those reshuffle users every request → broken/flickering UX and uninterpretable experiments).
+   - Salt the hash with the flag key so two flags at 10% don't hit the *same* 10% of users (correlated rollouts).
+   - **Rule order:** allowlist (force-on for QA/internal) → segment/plan rules → percentage → default. First match wins; make precedence explicit.
+   - Ramping the percentage must only *add* users, never reshuffle: monotonic threshold on a fixed hash guarantees a user inside 10% stays inside 50%.
+5. **Wrap every read behind the one typed helper from step 3.** No raw `client.variation(...)` or `process.env.FEATURE_X` scattered in code. Centralizing gives you: a single fallback policy, one audit point for cleanup, typed keys (no stringly-typed typos), and a place to log exposure for experiments. Key names are namespaced and stable: `team.feature.scope` (e.g. `checkout.new-flow.enabled`).
+6. **Ramp on a schedule with a guardrail metric and a one-flip rollback.** Decoupled from deploy means rollback = flip the flag, not redeploy.
+   | Stage | Audience | Hold | Watch (guardrail) |
+   |---|---|---|---|
+   | 0% + allowlist | internal/QA | until smoke passes | manual QA |
+   | 1% | canary cohort | ≥1 peak hour | error rate, p99 latency, the feature's own success metric |
+   | 10% | — | ≥1 business day | + downstream load, support tickets |
+   | 50% | — | ≥1 day | + cost / DB / queue depth |
+   | 100% | everyone | bake 1 week | then **delete the flag** (step 7) |
+   Pick the guardrail **before** ramping (e.g. "5xx rate must stay <0.5%, checkout-success must not drop >1pp"). Wire an automated trip if you can: guardrail breach → set flag to 0% (kill). A flag flip propagates in seconds; a redeploy does not — that gap is the entire point. Never jump 1%→100%.
+7. **Lifecycle = owner + removal date + CI enforcement.** A flag with no expiry is permanent debt.
+   - At creation, record `{ owner, type, created, removeBy }` (flag description, a registry table, or `// @flag-owner @removeBy=YYYY-MM-DD` next to the helper call).
+   - **CI fails the build when a `release` flag passes its `removeBy`** — grep flag metadata, exit nonzero on any overdue release flag. This is the single highest-leverage anti-debt control.
+   - Cleanup is a real PR: delete the flag key in the provider **and** the `flag()` call **and** the now-dead branch — keep the winning path, remove the loser. Archive the flag (don't hard-delete history) so old exposure logs stay interpretable.
+   - Kill switches and ops flags are exempt from TTL but get an annual review.
+8. **Test both branches; flag-off is the safe default.** Every gated change has two live code paths — both must be tested and shippable. Default the flag **off** in test config (proves old path still works), then run the suite again with it **on**. A PR that only works with the flag on is not done.
+## Common Errors
+- **`Math.random()` / time / session id as the bucketing unit.** Users flicker between variants every request — broken UX and uninterpretable experiments. Hash a stable user/account id.
+- **Two flags at 10% hitting the *same* users.** Forgot to salt the hash with the flag key, so rollouts are correlated. Salt = `flagKey:unitId`.
+- **Reshuffling on ramp.** Changing the hash scheme/seed when going 10%→50% moves users *out* of the rollout, regressing them mid-flight. Use a monotonic threshold on one fixed hash.
+- **Flag read that throws or blocks.** Provider hiccup takes down the request path. Wrap in the helper; fail to last-known-good then fallback; never network-call per request (server SDKs eval locally).
+- **SDK init crashes startup.** Synchronous blocking init against an unreachable provider hangs boot. Init async with timeout, serve fallbacks until ready.
+- **Entitlement evaluated client-side.** A client-side flag "unlocks" a paid feature — trivially bypassed in devtools. Gate the capability server-side; the client flag only hides UI (auth-jwt-session owns the grant).
+- **Fail-open release flag.** Provider down → fallback is the *new, unfinished* path. Release fallback must be **off** (old path); only ops defaults bias to "on".
+- **`process.env.FEATURE_X` booleans.** Env flags need a redeploy to flip — that's a config change, not a runtime kill switch, and defeats decoupling. Use the provider/table behind the helper.
+- **Only testing the on-path.** Flag-off regressions ship silently because nobody ran the suite with the flag off. Test both states; off is the default.
+- **Flag never removed.** 100% rolled out months ago, both branches still in code, the loser rotting. CI must fail past `removeBy`; cleanup deletes the dead branch.
+- **Stale flag still referenced after provider deletion.** Deleting the key in the dashboard but leaving the `flag()` call → it silently serves the fallback forever (often the wrong one). Delete provider key and code in the same PR.
+## Verify
+1. **Determinism / stickiness:** Evaluate the same flag for the same user id 1000× across ≥2 processes → identical variant every time. Restart the service → still identical.
+2. **Independent rollouts:** Two flags at the same percentage do **not** select the same user set (hash is flag-salted) — compare the bucketed cohorts, overlap ≈ percentage², not 100%.
+3. **Monotonic ramp:** Take the users inside the 10% rollout; raise to 50% → every one of them is still inside (no user regresses out). Lower back → only the added tail leaves.
+4. **Fail-safe:** Block/kill the provider (firewall the SDK endpoint), send traffic → every read returns the fallback, nothing throws, requests still succeed, and `flag.eval_error` is emitted (not silent).
+5. **Kill switch latency:** Flip a kill switch to off → the gated path stops within the SDK's stream/poll interval (seconds), with **no deploy**. Time it.
+6. **Both branches green:** Full test suite passes with the flag **off** (default) and again with it **on**. CI runs at least the off state.
+7. **No raw reads:** `grep -rE "\.variation\(|process\.env\.FEATURE" src/` returns only the single helper file — every other read goes through `flag()`.
+8. **TTL enforcement:** A `release` flag with a past `removeBy` makes CI **exit nonzero**. Verify by backdating one in a throwaway branch.
+9. **Entitlement is server-trusted:** With the client-side flag forced on in devtools, the server still refuses the gated capability (403/empty), proving the browser can't unlock it.
+Done = bucketing is deterministic + flag-salted + monotonic under ramp, every read goes through the fail-safe helper (provider-down test serves fallbacks without throwing), the kill switch flips in seconds with no deploy, both flag branches pass tests with off as default, no entitlement is decided client-side, and CI fails on any release flag past its removal date.