npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.0 - Mend

sanook-cli 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

package/.env.example +19 -0
package/CHANGELOG.md +144 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +394 -51
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +2 -2
package/dist/providers/keys.js +3 -2
package/dist/providers/registry.js +133 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +218 -27
package/dist/ui/banner.js +4 -9
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/setup.js +6 -5
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/compose-local-dev-stack/SKILL.md ADDED Viewed

@@ -0,0 +1,149 @@
+---
+name: compose-local-dev-stack
+description: Wires a local multi-service development stack with Docker Compose — app plus backing datastores (Postgres/Redis/Kafka), dependency-ordered healthchecks (depends_on condition: service_healthy), pinned images and named volumes, seed/init scripts, hot-reload bind mounts, profiles, and one-command up/down/reset via a Makefile.
+when_to_use: An app needs real local backing services (db, cache, queue) and "start everything" is fragile, slow, or undocumented. Not the dev container the editor runs in (setup-devcontainer-env), not the shippable app image (dockerfile-optimize), not cluster deployment (k8s-manifest-review).
+---
+## When to Use
+Reach for this when the request is about **standing up the app's runtime dependencies on a laptop**, reproducibly, with one command:
+- "Get Postgres + Redis + the API running locally so I can develop"
+- "The onboarding doc says `docker compose up` but it races / half the services aren't ready"
+- "Add Kafka (or a queue, or a second DB) to the local stack"
+- "I want hot reload — edit code, see it without rebuilding the image"
+- "Seed the dev database automatically" / "give me a clean-slate reset command"
+NOT this skill:
+- The container the editor/agent itself runs inside (devcontainer.json, features, VS Code attach) → setup-devcontainer-env
+- Shrinking/hardening the **production** image (multi-stage, distroless, non-root) → dockerfile-optimize
+- Deploying these services to a cluster (Deployments, probes, resource limits, Helm) → k8s-manifest-review
+- Pinning the *host* language/tool versions (node/python/go via asdf/mise/`.tool-versions`) → pin-toolchain-versions
+- A schema change's lock/data-loss safety → db-migration-safety (this skill only *runs* migrations on start)
+## Steps
+1. **One file, services as the unit. Pin every tag, name every volume.** Floating `:latest` makes the stack non-reproducible and breaks silently on pull; bare anonymous volumes orphan and lose data on `down`. Use `compose.yaml` (the modern name — drop the `version:` key, it's obsolete):
+   ```yaml
+   name: myapp
+   services:
+     db:
+       image: postgres:16.4-alpine            # pin minor; never :latest
+       environment:
+         POSTGRES_USER: app
+         POSTGRES_PASSWORD: app
+         POSTGRES_DB: app
+       volumes:
+         - pgdata:/var/lib/postgresql/data     # named -> survives `down`
+         - ./db/init:/docker-entrypoint-initdb.d:ro  # runs ONCE on empty volume
+       healthcheck:
+         test: ["CMD-SHELL", "pg_isready -U app -d app"]
+         interval: 3s
+         timeout: 3s
+         retries: 20
+         start_period: 5s
+       ports: ["5432:5432"]
+     redis:
+       image: redis:7.4-alpine
+       command: ["redis-server", "--save", "", "--appendonly", "no"]  # ephemeral cache
+       healthcheck:
+         test: ["CMD", "redis-cli", "ping"]
+         interval: 3s
+         timeout: 2s
+         retries: 20
+     app:
+       build: { context: ., target: dev }       # dev stage, not prod
+       command: ["npm", "run", "dev"]            # hot-reload command, overrides Dockerfile CMD
+       depends_on:
+         db:    { condition: service_healthy }   # waits for healthcheck, not just "started"
+         redis: { condition: service_healthy }
+       environment:
+         DATABASE_URL: postgres://app:app@db:5432/app   # use service name, not localhost
+         REDIS_URL: redis://redis:6379
+       volumes:
+         - ./src:/app/src                        # bind mount -> edits reflect live
+         - /app/node_modules                     # anon vol masks host node_modules
+       ports: ["3000:3000"]
+   volumes:
+     pgdata:
+   ```
+2. **Order startup with `depends_on: condition: service_healthy` — never bare `depends_on`.** Bare `depends_on` only waits for the container to *start*, not to be *ready*; the app then connects to a Postgres still replaying WAL and crash-loops. The gate is the **healthcheck on each backing service**. Pick the right probe per service:
+   | Service | Healthcheck test | Why not just TCP |
+   |---|---|---|
+   | Postgres | `pg_isready -U $USER -d $DB` | port opens before it accepts queries |
+   | MySQL | `mysqladmin ping -h localhost` | same early-port problem |
+   | Redis | `redis-cli ping` → `PONG` | trivial, do it |
+   | Kafka (KRaft) | `kafka-broker-api-versions --bootstrap-server localhost:9092` | broker advertises before it serves metadata |
+   | RabbitMQ | `rabbitmq-diagnostics -q ping` | mgmt port lies about readiness |
+   | Elasticsearch | `curl -fsS localhost:9200/_cluster/health?wait_for_status=yellow` | green never comes single-node |
+   | App migrations | a one-shot `migrate` service the app `depends_on` (condition: `service_completed_successfully`) | keeps schema setup off the app's hot path |
+   Tune `retries × interval ≥ real cold-start time` (Kafka/ES need `start_period: 20s`+) or healthy never arrives and the dependents abort.
+3. **Seed once via `docker-entrypoint-initdb.d`; run migrations every start via a one-shot service.** The init dir (`*.sql`/`*.sh`, alphabetical) runs **only when the data volume is empty** — perfect for extensions, roles, and static seed (`01-schema.sql`, `02-seed.sql`). It does **not** re-run after the volume exists, so never put evolving migrations there. Migrations belong in a dedicated short-lived service the app waits on:
+   ```yaml
+     migrate:
+       build: { context: ., target: dev }
+       command: ["npm", "run", "migrate:deploy"]   # or: alembic upgrade head / flyway migrate
+       depends_on: { db: { condition: service_healthy } }
+       restart: "no"
+   ```
+   Then set `app.depends_on.migrate.condition: service_completed_successfully`. Idempotent migration tools make this safe to run on every `up`.
+4. **Hot reload = bind mount source + a dev `command` + a watcher, not a rebuild.** Bind `./src:/app/src` and run the dev server (`npm run dev`/`uvicorn --reload`/`air`/`nodemon`). Mask installed deps with an **anonymous volume** (`- /app/node_modules`) so the host's empty/mismatched dir doesn't shadow the image's. Build the image from a **`dev` stage** (`target: dev`) that includes dev deps and the watcher — keep the lean prod stage for shipping (that's dockerfile-optimize's job). Changing `package.json`/`requirements.txt` still needs a rebuild; code does not.
+5. **Split config: committed `compose.yaml` + `.env` + an uncommitted `compose.override.yml`.** Compose **auto-merges** `compose.override.yml` on top of `compose.yaml` with no `-f` flag — put local-only tweaks there (extra port bindings, mounted debug tools, `DEBUG=1`) and gitignore it so teammates' hacks don't collide. Variables interpolate from `.env` (committed `.env.example`, real `.env` gitignored). Never hardcode host-specific ports or paths in the base file.
+6. **Gate optional services behind `profiles`.** Tag heavy/rarely-needed services (Kafka, a second DB, mailhog, a metrics stack) with `profiles: ["kafka"]` so a plain `docker compose up` starts only the core stack. Opt in with `docker compose --profile kafka up`. Keeps the default path fast; a service with no `profiles` always runs.
+7. **Use the default network and talk service-to-service by name; publish only the host ports you need.** Compose gives you a default bridge network where services resolve each other by **service name** (`db`, `redis`) — the app must use `db:5432`, never `localhost:5432` (localhost inside the app container is the app). Publish stable host ports (`5432:5432`) only for tools you run on the host (psql, a GUI). Collisions with a host Postgres → remap the **host** side (`5433:5432`), never the container side.
+8. **Make one-command verbs in a `Makefile` (or `Taskfile.yml`) so nobody memorizes flags.** `up` must block until healthy; `reset` must wipe volumes:
+   ```makefile
+   up:      ## start core stack, wait until healthy
+   	docker compose up -d --wait
+   down:    ## stop, keep data
+   	docker compose down
+   reset:   ## stop AND wipe volumes -> clean slate
+   	docker compose down -v --remove-orphans
+   	docker compose up -d --wait
+   logs:    ## tail everything
+   	docker compose logs -f --tail=100
+   ps:
+   	docker compose ps
+   ```
+   `--wait` makes `up` exit non-zero if any service never goes healthy — that's your machine-checkable gate. `down -v` is the *only* thing that deletes data; keep it on `reset` alone so `down` is always safe.
+## Common Errors
+- **Bare `depends_on:` (list form).** Waits for container *start*, not readiness; the app races the DB and crash-loops on cold boot. Use the map form with `condition: service_healthy`.
+- **No `healthcheck` on a backing service.** Then `service_healthy` has nothing to gate on and Compose errors or treats it as instantly up. Every service you depend-on needs a real probe (table in step 2).
+- **App connects to `localhost` instead of the service name.** `localhost` inside the app container is the app itself — connection refused. Use `db`/`redis`/`kafka` (the service names) in `DATABASE_URL`/`REDIS_URL`.
+- **Anonymous/missing volume on a datastore.** `docker compose down` orphans the anonymous volume and the next `up` starts empty; data "randomly" vanishes. Always name datastore volumes and declare them under `volumes:`.
+- **Expecting `docker-entrypoint-initdb.d` to re-run.** It runs **only on an empty data volume**. Edited a seed file and "nothing happened"? The volume already exists — `docker compose down -v` (or `make reset`) to re-init. Don't put live migrations there.
+- **`start_period` too short for Kafka/Elasticsearch.** They take 20–60s to be ready; with the default `start_period: 0s` and few retries, healthy never arrives and dependents abort. Set `start_period: 30s` and enough `retries`.
+- **`:latest` / unpinned tags.** A teammate pulls a newer Postgres major, the data dir format changes, the volume won't mount. Pin to a minor tag (`postgres:16.4-alpine`).
+- **Host port already in use (`bind: address already in use`).** A host Postgres or a previous stack holds 5432. Remap the host side only (`5433:5432`); changing the container side breaks intra-network DNS.
+- **Host `node_modules`/`venv` shadowing the image's via the source bind mount.** App can't find deps or loads wrong-arch binaries. Add the anonymous-volume mask (`- /app/node_modules`) *after* the source bind.
+- **Secrets committed in `compose.yaml`.** Real credentials in the base file leak to git. Keep them in the gitignored `.env`; commit only `.env.example` with placeholders.
+## Verify
+1. **Cold up from nothing:** `make reset` (wipes), then `make up`. The command must **exit 0** — `--wait` fails the command if any service is unhealthy. `docker compose ps` shows every core service `running (healthy)`.
+2. **Ordering held:** check `docker compose logs migrate` / app — the app started its first DB query *after* `db` was healthy and migrations completed, with **no** connection-refused retries in the log.
+3. **Seeded:** `docker compose exec db psql -U app -d app -c "select count(*) from <seeded_table>;"` returns the expected non-zero count without any manual step.
+4. **Hot reload:** with the stack up, edit a source file under the bind mount → the app reloads and serves the change **without** `docker compose build` or restart.
+5. **Reachability:** a host tool hits the published port (`psql -h localhost -p 5432 -U app`), and the app reaches the DB **by service name** (no `localhost` in its config).
+6. **Reset is clean:** `make reset` recreates the stack and the seeded count from step 3 matches again (volume truly wiped and re-init'd, not stale).
+7. **Profiles:** plain `docker compose up -d --wait` starts only core services; `--profile kafka up` additionally starts the gated ones; `docker compose ps` confirms each case.
+8. **`down` is safe:** `make down` then `make up` preserves data (row count unchanged); only `make reset` resets it.
+Done = `make reset && make up` exits 0 with every service `healthy`, the DB is auto-seeded, a source edit hot-reloads without a rebuild, the app talks to backing services by name, and `make reset` reproducibly returns the stack to a clean seeded state.

package/skills/configure-bundler-build/SKILL.md ADDED Viewed

@@ -0,0 +1,166 @@
+---
+name: configure-bundler-build
+description: Configures and optimizes the JS/TS build toolchain — tsconfig plus a bundler (Vite/esbuild/Rollup/tsup/webpack) — for correct module output (ESM/CJS/dual + types), code splitting, tree-shaking, sourcemaps, env injection, and fast incremental builds.
+when_to_use: Setting up or fixing how an app or library compiles and bundles — wrong module format, broken tree-shaking, missing/incorrect types, slow builds, tsconfig errors. Distinct from dockerfile-optimize (container images) and optimize-core-web-vitals (browser runtime metrics).
+---
+## When to Use
+Reach for this skill when the problem is **how source compiles and emits**, not how it runs in a browser or container:
+- "Set up the build for this app/library" (pick bundler, tsconfig, output format)
+- "My library ships ESM but breaks in a CJS `require()`" (or vice versa) — dual-package output
+- "Consumers get `Could not find a declaration file`" — missing/mislocated `.d.ts`
+- "Tree-shaking isn't dropping unused exports" — dead code in the bundle
+- "`tsc`/`vite build` is slow" — switch transform to esbuild/swc, add a persistent cache
+- "`define`/`import.meta.env` isn't replacing my env var" or a secret leaked into the client bundle
+- tsconfig errors: `module`/`moduleResolution` mismatch, `"x.js" has no exported member`, paths not resolving
+NOT this skill:
+- Shrinking the runtime container image, multi-stage Docker layers → dockerfile-optimize
+- LCP/INP/CLS, lazy-loading images, render-blocking JS in the browser → optimize-core-web-vitals
+- Cross-package build orchestration, workspace topo build order, Turbo/Nx pipelines → setup-monorepo-tooling
+- `npm publish`, `files`/`publishConfig`, provenance, version bump → publish-package-registry
+- ESLint/Prettier/pre-commit wiring → setup-lint-format-precommit
+- Pinning the Node/pnpm/tsc *versions* themselves (engines, `.nvmrc`, Volta) → pin-toolchain-versions
+## Steps
+1. **Pick the bundler by build target — do not default to webpack.**
+   | Target | Use | Why |
+   |---|---|---|
+   | **App** (SPA/SSR, has an entry HTML or framework) | **Vite** | Rollup-based prod build, esbuild dev, HMR, code-splitting out of the box |
+   | **Library** (published to npm, consumers bundle it) | **tsup** (esbuild) or **Rollup** | dual ESM+CJS + `.d.ts` in one config; Rollup when you need fine-grained chunking |
+   | **Node tool / CLI / serverless fn** (single self-run entry) | **esbuild** | fastest, bundle deps in, `--platform=node`, no chunk graph needed |
+   | Legacy app needing module federation / exotic loaders | webpack | only when a Vite/Rollup plugin doesn't exist |
+   Default: **app → Vite, library → tsup, node-tool → esbuild.** One tool emits JS; **`tsc` emits types** (or `tsup --dts` / `vite-plugin-dts` wraps it). Never run `tsc` as the bundler for shipping code — it doesn't bundle, tree-shake, or split.
+2. **Set the tsconfig essentials — `moduleResolution` is the #1 footgun.** Pick the resolution mode by who resolves modules:
+   | Scenario | `module` | `moduleResolution` |
+   |---|---|---|
+   | Bundler handles resolution (Vite/tsup/esbuild) | `ESNext` (or `Preserve`) | `bundler` |
+   | Node runs the output directly (Node ESM/CJS) | `NodeNext` | `nodenext` |
+   ```jsonc
+   // tsconfig.json — app/library baseline
+   {
+     "compilerOptions": {
+       "target": "ES2022",          // match your lowest runtime; don't ship ES5 needlessly
+       "lib": ["ES2022", "DOM"],    // drop "DOM" for node-only code
+       "module": "ESNext",
+       "moduleResolution": "bundler",
+       "strict": true,
+       "skipLibCheck": true,
+       "esModuleInterop": true,
+       "isolatedModules": true,     // required: esbuild/swc compile file-by-file
+       "verbatimModuleSyntax": true,// makes `import type` explicit — kills accidental value imports
+       "declaration": true,         // emit .d.ts (libraries)
+       "declarationMap": true,      // go-to-definition into your source
+       "sourceMap": true,
+       "outDir": "dist",
+       "paths": { "@/*": ["./src/*"] }
+     }
+   }
+   ```
+   `paths` are a **type-level alias only** — the bundler must be told too (Vite `resolve.alias`, tsup/esbuild via `vite-tsconfig-paths`/`esbuild` alias, or `tsconfig-paths`). tsc does not rewrite them in emitted JS.
+3. **For a library, emit dual ESM+CJS with a correct `exports` map — the `exports` map is the contract, file extensions are the proof.** tsup config:
+   ```ts
+   // tsup.config.ts
+   import { defineConfig } from "tsup";
+   export default defineConfig({
+     entry: ["src/index.ts"],
+     format: ["esm", "cjs"],   // → index.js (esm) + index.cjs
+     dts: true,                // → index.d.ts (+ .d.cts for cjs types)
+     sourcemap: true,
+     treeshake: true,
+     clean: true,
+     target: "node18",
+     external: [/^node:/],     // never bundle node builtins
+   });
+   ```
+   ```jsonc
+   // package.json — types condition MUST come first in each block
+   {
+     "type": "module",
+     "exports": {
+       ".": {
+         "import": { "types": "./dist/index.d.ts",  "default": "./dist/index.js"  },
+         "require": { "types": "./dist/index.d.cts", "default": "./dist/index.cjs" }
+       },
+       "./package.json": "./package.json"
+     },
+     "main": "./dist/index.cjs",      // legacy fallback for old resolvers
+     "module": "./dist/index.js",
+     "types": "./dist/index.d.ts",
+     "sideEffects": false,
+     "files": ["dist"]
+   }
+   ```
+   Keep `peerDependencies` (react, etc.) in `external` so you don't bundle two copies into the consumer.
+4. **App: split code with dynamic `import()`, then control chunks deliberately.** Route-level `const Page = lazy(() => import('./Page'))` and `import('heavy-lib')` create async chunks automatically. Pull stable vendor deps into their own long-cached chunk:
+   ```ts
+   // vite.config.ts
+   build: {
+     sourcemap: true,
+     rollupOptions: {
+       output: {
+         manualChunks: { vendor: ["react", "react-dom"] }, // or a function for finer control
+       },
+     },
+     chunkSizeWarningLimit: 500,
+   }
+   ```
+   Don't over-split (HTTP/2 helps, but hundreds of tiny chunks add request + parse overhead). Split on real route/feature boundaries, not per-file.
+5. **Enable tree-shaking — it only works on static ESM.** Author with `import`/`export` (no `require`, no `module.exports`); CJS interop defeats it. Mark the package `"sideEffects": false` (or list the few files with real side effects, e.g. `["**/*.css", "./src/polyfill.ts"]`) so the bundler may drop unused modules. Annotate top-level calls that look impure but aren't with `/*#__PURE__*/`:
+   ```ts
+   export const icon = /*#__PURE__*/ createIcon(path); // droppable if `icon` is unused
+   ```
+   A `"sideEffects": false` lie (a module that *does* mutate global state on import) causes silently-missing behavior — list those files.
+6. **Inject env at build time via `define`/`import.meta.env` — never bake a secret.** Static replacement only:
+   ```ts
+   // vite: only VITE_* are exposed to client; access via import.meta.env.VITE_API_URL
+   // esbuild/tsup: define: { "process.env.NODE_ENV": JSON.stringify("production") }
+   ```
+   **Anything bundled for the browser is public.** Put API keys/DB URLs behind a server route or read them at runtime on the server (`process.env`) — a `define`'d secret is grep-able in `dist/`. Gate dev-only code behind `if (import.meta.env.DEV)` / `process.env.NODE_ENV !== "production"` so it tree-shakes out of prod.
+7. **Always emit sourcemaps; choose by environment.** `sourcemap: true` (full, external `.map`) for libraries and CI artifacts. For a public web app, ship `hidden` sourcemaps (uploaded to your error tracker, not referenced in the bundle) so stack traces de-minify without exposing source to every visitor. Never `eval`/inline sourcemaps in production.
+8. **Make builds fast and incremental.** Use an esbuild/swc transform (Vite and tsup already do) instead of `ts-loader`/`babel` for the JS transform — 10–100× faster. Keep type-checking **out of the bundle path**: run `tsc --noEmit` (or `vite build` + a parallel `tsc -b --watch`) so a type error doesn't block fast iteration but still gates CI. Turn on the persistent cache (Vite caches in `node_modules/.vite`; for `tsc -b` use `incremental: true` + `tsBuildInfoFile`). Add `--metafile` (esbuild) / `rollup-plugin-visualizer` to find what's bloating the bundle.
+9. **Verify the output shape before declaring done** (see Verify) — `publint` + `@arethetypeswrong/cli` catch the dual-package and types-resolution bugs that don't surface until a consumer installs you.
+## Common Errors
+- **`moduleResolution: node` (classic) with modern packages.** Fails to resolve `exports`-map-only packages. Use `bundler` (bundler resolves) or `nodenext` (Node resolves) — never the legacy `node`/`node10`.
+- **`types` condition placed last in the `exports` map.** TS reads conditions top-down and takes the first match; if `import`/`require` come before `types`, the consumer gets "no declaration file." `types` must be the **first** key in each condition block.
+- **`.cjs` file emitting `export {}` (or `.mjs` with `require`).** The `exports` map points at the wrong file per condition, or `"type": "module"` mismatches the extension. ESM → `.js`/`.mjs`, CJS → `.cjs`. Verify with `node -e "require('your-pkg')"` and a separate `import`.
+- **`"sideEffects": false` on a package that has side effects.** Tree-shaking drops a polyfill/CSS/registration import → feature silently missing in prod only. List the real side-effect files instead of a blanket `false`.
+- **Secret in a `VITE_`/`define`d var.** It's inlined into client JS and shipped to every browser. Only public values get `VITE_`/`NEXT_PUBLIC_`; secrets stay server-side at runtime.
+- **`paths` alias resolves in the editor but `Cannot find module '@/x'` at build.** tsc/`paths` is type-only; the bundler needs its own alias (`vite-tsconfig-paths`, `resolve.alias`, or `tsconfig-paths`). Configure both.
+- **`isolatedModules` errors on `export { Foo }` where `Foo` is a type.** esbuild/swc compile each file alone and can't tell types from values. Use `export type { Foo }` / `import type` (enforced by `verbatimModuleSyntax`).
+- **Bundling `peerDependencies` (react, etc.) into a library.** Consumer gets two React copies → "invalid hook call." Mark peers `external`.
+- **Running `tsc` as the production bundler.** It transpiles per-file but doesn't bundle, tree-shake, split, or rewrite `paths` — output has unresolved aliases and no chunking. Use a real bundler for JS; `tsc` only for `.d.ts`.
+- **No sourcemaps in prod (or inline/eval maps).** Minified stack traces are useless; inline maps bloat the bundle and leak source. Emit external (`hidden` for public web), upload to the error tracker.
+- **Targeting `ES5`/old `lib` by reflex.** Forces heavy down-leveling and polyfills for runtimes that support modern JS. Set `target`/`lib` to your *actual* lowest runtime.
+## Verify
+1. **Clean build succeeds:** `rm -rf dist && <build>` exits 0 and `dist/` contains the expected entry files (`.js`, `.cjs`, `.d.ts`, `.map`).
+2. **Types resolve both ways (library):** `npx @arethetypeswrong/cli --pack` reports no ❌ — no "masquerading as CJS/ESM", no missing types per condition.
+3. **Package shape is publishable:** `npx publint` is clean — `exports`, `main`/`module`/`types`, and file extensions all consistent.
+4. **Dual import actually loads:** in a scratch dir, `node -e "import('your-pkg').then(m=>console.log(Object.keys(m)))"` **and** `node -e "console.log(Object.keys(require('your-pkg')))"` both print the API — no `ERR_REQUIRE_ESM` / `ERR_PACKAGE_PATH_NOT_EXPORTED`.
+5. **Type-check passes independently:** `tsc --noEmit` exits 0 (proves the build path didn't skip a type error).
+6. **Tree-shaking works:** bundle a fixture importing one named export; the visualizer/`--metafile` shows unused siblings absent from output. Bundle size drops when an unused heavy import is removed.
+7. **Code-splitting present (app):** prod build emits ≥1 async chunk per lazy route, and the vendor chunk is separate from app code (check `dist/assets/`).
+8. **No secret in the bundle:** `grep -r "<a known secret substring>" dist/` returns nothing; only intended public `VITE_*`/`NEXT_PUBLIC_*` values appear.
+9. **Sourcemaps map back:** open a built file's `.map` or trigger an error — stack trace points to original `src/` lines, not minified columns.
+10. **Incremental rebuild is fast:** a one-line edit triggers a sub-second rebuild (warm cache), not a full cold compile.
+Done = clean build emits the correct module formats + types, `attw` and `publint` are clean, both `import()` and `require()` load the API, `tsc --noEmit` passes, tree-shaking and code-splitting are confirmed in the output, no secret leaked into `dist/`, and warm rebuilds are fast.

package/skills/configure-dns-tls/SKILL.md ADDED Viewed

@@ -0,0 +1,142 @@
+---
+name: configure-dns-tls
+description: Configures DNS records and TLS for a service — A/AAAA/CNAME/ALIAS/MX/TXT/CAA, zero-downtime cutovers via pre-lowered TTL, automated ACME/Let's Encrypt/cert-manager issuance and auto-renewal, and TLS 1.2+/1.3-only settings with HSTS, OCSP stapling, and 80→443 redirect — eliminating expired-cert and bad-cutover outages.
+when_to_use: Pointing a domain at a service, enabling HTTPS, automating/rotating certificates (ACME/cert-manager), or migrating DNS. Distinct from configure-reverse-proxy-lb (the proxy/LB that terminates the TLS this issues) and setup-cdn-edge-waf (the CDN/WAF edge in front).
+---
+## When to Use
+Reach for this skill when the task is **names and certificates** — getting a domain to resolve to your service and serving valid HTTPS that renews itself:
+- "Point `app.example.com` at this load balancer / IP without downtime"
+- "Enable HTTPS / fix the expired-cert outage / stop the cert from ever expiring again"
+- "Automate certs with Let's Encrypt / cert-manager; issue a wildcard"
+- "Migrate DNS to a new provider / cut over to a new origin"
+- "Lock down SPF/DKIM/DMARC, or CAA so only my CA can issue"
+- "Why does SSL Labs give us a B? Harden the TLS config"
+NOT this skill:
+- Configuring the proxy/LB/Ingress that **terminates** TLS, virtual hosts, upstream pools, timeouts → configure-reverse-proxy-lb
+- The CDN/edge, WAF rules, edge caching, or DDoS layer in front of origin → setup-cdn-edge-waf
+- Application-layer auth/authz, token scopes, RBAC → design-authorization-model
+- Tamper-evident security event logs (incl. cert-rotation events) → build-audit-logging
+This skill owns the **record values, the cutover choreography, certificate lifecycle, and the TLS handshake policy**. It hands the terminated connection to the proxy.
+## Steps
+1. **Pick the record type by what you're pointing at — do not CNAME the apex.**
+   | Need | Record | Notes |
+   |---|---|---|
+   | Name → IPv4 | `A` | Bare IP only |
+   | Name → IPv6 | `AAAA` | Add alongside A; serve dual-stack |
+   | Subdomain → another hostname | `CNAME` | e.g. `www → app.example.com`; cannot coexist with other records on that name |
+   | **Apex** (`example.com`) → hostname | `ALIAS`/`ANAME`/flattened-CNAME | Apex can't be a real CNAME (breaks SOA/NS/MX). Use the provider's ALIAS (Route 53 alias, Cloudflare CNAME-flattening, etc.) |
+   | Mail | `MX` | Priority + target; target must be an A/AAAA, never a CNAME |
+   | SPF/DKIM/DMARC/verification | `TXT` | One SPF per domain; DMARC at `_dmarc`; DKIM at `<sel>._domainkey` |
+   | Who may issue certs | `CAA` | `0 issue "letsencrypt.org"` + `0 issuewild "letsencrypt.org"` |
+   Set CAA **before** first ACME issuance, or issuance fails with `CAA record prevents issuance`. Example:
+   ```
+   example.com.  CAA  0 issue "letsencrypt.org"
+   example.com.  CAA  0 issuewild "letsencrypt.org"
+   example.com.  CAA  0 iodef "mailto:security@example.com"
+   ```
+2. **Zero-downtime cutover: lower the TTL BEFORE the change — this is the whole trick.** Resolvers cache the old answer for up to its TTL; if you cut over while TTL is 3600, clients hit the dead origin for an hour.
+   1. Drop the record's TTL to `60` (or `30`). **Wait out the *old* TTL** (e.g. wait the full prior 3600s) so every cache holds the short TTL.
+   2. Run both origins in parallel (old + new healthy) during the switch — never tear down old first.
+   3. Change the record value to the new target.
+   4. Verify the new answer is served (step in Verify) and the new origin takes real traffic.
+   5. Only after traffic has fully drained from the old origin (watch its access logs go quiet for > one TTL), decommission it and **raise TTL back** to 3600+ to cut query volume/cost.
+3. **Automate certificates — manual renewal is a guaranteed future outage.** Use ACME (Let's Encrypt / ZeroSSL). Never click-issue a 1-year cert you have to remember to renew; LE is 90-day by design to *force* automation.
+   - **VM / bare proxy:** `certbot` with a renewal timer, or the proxy's built-in ACME (Caddy auto-HTTPS, Traefik resolver, nginx + `acme.sh`).
+   - **Kubernetes:** **cert-manager** — a `ClusterIssuer` + `Certificate` (or Ingress annotation) reconciles renewal automatically; renews at ~⅔ of lifetime.
+   ```yaml
+   # cert-manager: DNS-01 wildcard via Cloudflare
+   apiVersion: cert-manager.io/v1
+   kind: ClusterIssuer
+   metadata: { name: letsencrypt-prod }
+   spec:
+     acme:
+       server: https://acme-v02.api.letsencrypt.org/directory
+       email: ops@example.com
+       privateKeySecretRef: { name: letsencrypt-prod-key }
+       solvers:
+       - dns01:
+           cloudflare:
+             apiTokenSecretRef: { name: cloudflare-token, key: api-token }
+   ---
+   apiVersion: cert-manager.io/v1
+   kind: Certificate
+   metadata: { name: example-tls, namespace: web }
+   spec:
+     secretName: example-tls          # Ingress references this
+     issuerRef: { name: letsencrypt-prod, kind: ClusterIssuer }
+     dnsNames: ["example.com", "*.example.com"]
+   ```
+   Iterate against the **staging** ACME server first — set the `ClusterIssuer` `spec.acme.server` to `https://acme-staging-v02.api.letsencrypt.org/directory` (or `certbot --test-cert`) to dodge LE prod rate limits (50 certs / registered-domain / week) while debugging, then flip the server back to prod and re-issue.
+4. **Choose the ACME challenge and cert shape deliberately.**
+   | Axis | Pick | Why |
+   |---|---|---|
+   | **HTTP-01** | single host, port 80 reachable from internet | simplest; needs `/.well-known/acme-challenge/` served; **cannot** do wildcards |
+   | **DNS-01** | wildcards, internal hosts, no inbound 80, or many SANs | proves control via a `_acme-challenge` TXT; needs DNS-provider API creds; works behind a firewall |
+   | **Wildcard** `*.example.com` | many dynamic subdomains | DNS-01 only; one cert, but a single shared private key (bigger blast radius) |
+   | **SAN / multi-domain** | a known fixed set of names | explicit per-name; rotate one without touching others; preferred when the list is stable |
+   Default: **SAN cert via DNS-01** for anything non-trivial; wildcard only when subdomains are unbounded/dynamic.
+5. **Set a modern TLS policy at the terminator — TLS 1.2+ only, redirect, HSTS, stapling.** Configure on whatever terminates (see configure-reverse-proxy-lb), but the *policy* is owned here:
+   - Protocols: **TLS 1.3 + TLS 1.2 only**. Disable TLS 1.0/1.1 and SSLv3 entirely.
+   - Ciphers: TLS 1.3 defaults; for 1.2 use forward-secret AEAD suites (ECDHE + AES-GCM/CHACHA20), no CBC/RC4/3DES.
+   - **Redirect 80→443** with `301`, then serve everything over HTTPS.
+   - **HSTS** on HTTPS responses: `Strict-Transport-Security: max-age=63072000; includeSubDomains; preload` — but only add `preload`/`includeSubDomains` once *every* subdomain is HTTPS (it's hard to undo). Roll out short → long → preload.
+   - **OCSP stapling** on (`ssl_stapling on;` in nginx) so clients don't round-trip the CA.
+   - Serve the **full chain** (leaf + intermediates), not just the leaf — the #2 cause of "works in my browser, fails in `curl`/old Android".
+   ```nginx
+   server {
+     listen 443 ssl http2;
+     ssl_protocols TLSv1.2 TLSv1.3;
+     ssl_prefer_server_ciphers off;
+     ssl_certificate     /etc/letsencrypt/live/example.com/fullchain.pem;  # full chain
+     ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
+     ssl_stapling on; ssl_stapling_verify on;
+     add_header Strict-Transport-Security "max-age=63072000; includeSubDomains" always;
+   }
+   server { listen 80; server_name example.com; return 301 https://$host$request_uri; }
+   ```
+6. **Prove auto-renew works before you trust it.** A cert that issues fine but never renews is a 90-day time bomb. Force a dry-run/staging renewal now (step in Verify) so you discover broken DNS creds or a missing port today, not at 2am on day 89.
+## Common Errors
+- **CNAME on the apex.** Breaks NS/SOA/MX co-existence; many resolvers reject it. Use ALIAS/ANAME/CNAME-flattening for `example.com`.
+- **Cutover without pre-lowering TTL.** You switch the record but caches serve the dead origin for the full old TTL (often an hour). Lower TTL and wait out the *old* TTL first.
+- **Raising TTL or killing the old origin too early.** Do it only after old-origin logs go quiet for > one TTL; otherwise stragglers 502.
+- **Missing/forbidding CAA.** No CAA = any CA may issue (security gap); a CAA that omits your CA = ACME fails with `CAA record prevents issuance`. Add the issuing CA explicitly, including `issuewild` for wildcards.
+- **HTTP-01 for a wildcard.** Impossible — wildcards require DNS-01. Switch the solver.
+- **Manual cert renewal "we'll remember."** You won't. The outage is scheduled for expiry day. Automate or it will lapse.
+- **Serving only the leaf cert.** Browsers cache intermediates and "work"; `curl`, Java, old Android, and API clients fail chain validation. Always deploy `fullchain.pem`.
+- **Burning LE rate limits while debugging.** Iterate against `acme-staging-v02` (or `certbot --test-cert`); only hit prod once issuance succeeds in staging.
+- **`includeSubDomains`/`preload` HSTS before all subdomains are HTTPS.** Any plain-HTTP subdomain becomes unreachable, and `preload` is baked into browsers for months. Roll HSTS out short → long → preload.
+- **DNS-01 with under-scoped API creds.** The token can't write `_acme-challenge` TXT, so renewal silently fails. Scope the token to DNS-edit on that zone and test it.
+- **Mixed content after enabling HTTPS.** Page loads over HTTPS but pulls `http://` assets → browser blocks them. Rewrite asset URLs to `https://` or protocol-relative; verify console is clean.
+- **Clock skew on the TLS host.** A wrong system clock makes a valid cert read as not-yet-valid/expired. Run NTP.
+## Verify
+1. **Records resolve correctly:** `dig +short A app.example.com` (and `AAAA`) returns the new target; `dig CAA example.com` shows your CA; `dig TXT _dmarc.example.com` shows the DMARC policy. Query an external resolver (`dig @1.1.1.1 …`) too, not just the local cache.
+2. **TTL was actually lowered before cutover:** `dig app.example.com | grep -E '^app'` shows the short TTL *before* you change the value; confirm the answer flips after, and that it propagated (`dig @8.8.8.8` and `@1.1.1.1` agree).
+3. **Full chain + protocol scan:** `echo | openssl s_client -connect example.com:443 -servername example.com -showcerts` shows leaf **and** intermediate(s), `Verify return code: 0 (ok)`. `testssl.sh example.com` (or SSL Labs) reports TLS 1.2/1.3 only, no TLS 1.0/1.1, HSTS present, OCSP stapled — target grade **A/A+**.
+4. **Redirect + HSTS:** `curl -sI http://example.com` → `301` to `https://`; `curl -sI https://example.com | grep -i strict-transport` shows the HSTS header.
+5. **No mixed content:** load the page, browser console shows zero "Mixed Content" / blocked-asset warnings; all subresources are `https://`.
+6. **Expiry & auto-renew proven:** `echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -enddate` shows a future date; then force a **staging** renewal — `certbot renew --dry-run` (VM) or, for k8s, point the issuer at `acme-staging-v02`, run `cmctl renew example-tls`, and watch `cmctl status certificate example-tls` go Ready — and confirm a fresh cert issues without manual steps.
+7. **Mail auth (if MX set):** SPF/DKIM/DMARC TXT records validate (e.g. an external mail-tester) — no `softfail`/missing-DKIM.
+Done = every name resolves to the new target on external resolvers, HTTPS serves the **full chain** over **TLS 1.2/1.3 only** with HSTS + stapling + 80→443 redirect and no mixed content (SSL Labs/testssl ≥ A), CAA locks issuance to your CA, and a staging force-renew has **proven** auto-renewal works before any cert nears expiry.

package/skills/configure-reverse-proxy-lb/SKILL.md ADDED Viewed

@@ -0,0 +1,129 @@
+---
+name: configure-reverse-proxy-lb
+description: Configures a reverse proxy / load balancer (nginx, Envoy, Caddy, HAProxy) in front of services — upstream pools, active/passive health checks, per-hop connect/read/send timeouts, TLS termination vs passthrough, idempotent-only retries with circuit breaking, sticky sessions, and zero-drop graceful reloads.
+when_to_use: Putting a proxy/LB in front of services, fixing 502/504s, balancing across instances, or routing by host/path. Distinct from configure-dns-tls (DNS records + cert issuance), setup-cdn-edge-waf (the CDN/WAF edge), rate-limiting (app-level request caps), and k8s-manifest-review (in-cluster Service/Ingress objects).
+---
+## When to Use
+Reach for this skill when the request is about **the proxy/LB layer between clients and your services**:
+- "Put nginx/Envoy/Caddy/HAProxy in front of these app instances"
+- "We're getting random 502/504s — fix the timeouts"
+- "Balance traffic across N backends and drop a dead one automatically"
+- "Route by `Host:` / path prefix to different upstreams"
+- "Terminate TLS at the proxy" / "pass TLS straight through to the backend"
+- "Config reload kills in-flight requests — make it zero-drop"
+NOT this skill:
+- Creating DNS records or issuing/renewing the cert itself → configure-dns-tls
+- The CDN/edge tier, bot rules, or WAF rulesets → setup-cdn-edge-waf
+- Per-user/per-key request caps and 429s at the app → rate-limiting
+- Kubernetes `Service`/`Ingress`/`Gateway` objects in-cluster → k8s-manifest-review
+## Steps
+1. **Pick the proxy by requirement — default to nginx.**
+   | Proxy | Pick when | Watch out |
+   |---|---|---|
+   | **nginx** | General L7 in front of HTTP/HTTPS apps — the default | Active health checks need nginx **Plus**; OSS only does passive `max_fails` |
+   | **Envoy** | Dynamic config via xDS, gRPC/HTTP2, fine-grained circuit breaking, outlier detection | Steep config; run with a control plane (Istio/Contour/Gloo) for anything large |
+   | **Caddy** | You want automatic TLS (ACME) with near-zero config | Less knob-level control over upstreams/retries |
+   | **HAProxy** | Heavy L4 (TCP) LB, max throughput, advanced balancing/observability | L7 ergonomics weaker than nginx for content routing |
+   For a typical web service: **nginx terminating TLS, round-robin or least-conn upstream, passive health checks**. Reach for Envoy only when you genuinely need dynamic upstreams or per-endpoint outlier ejection.
+2. **Define the upstream pool + algorithm — least-conn is the safer default for mixed latency.**
+   ```nginx
+   upstream app {
+       least_conn;                         # round-robin is fine for uniform requests; least_conn for variable latency
+       server 10.0.1.11:8080 max_fails=3 fail_timeout=10s;
+       server 10.0.1.12:8080 max_fails=3 fail_timeout=10s;
+       server 10.0.1.13:8080 max_fails=3 fail_timeout=10s backup;  # only when primaries are down
+       keepalive 64;                       # REUSE upstream conns — without this every request does a fresh TCP+TLS handshake
+   }
+   ```
+   - **round-robin** (default): uniform, cheap requests.
+   - **least-conn**: requests with variable duration — avoids piling onto a slow node.
+   - **consistent-hash** (`hash $arg_key consistent;` / Envoy ring-hash): only when a key must stick to a backend (cache affinity, sharding). Plain `ip_hash` rebalances badly when a node leaves; use `consistent` so a single ejection doesn't reshuffle every key.
+3. **Set timeouts at EVERY hop — a proxy timeout shorter than the app is the #1 cause of 502/504.** A 502 = backend refused/reset the connection; a 504 = backend accepted but didn't answer before `proxy_read_timeout`. The proxy's read timeout must be **longer** than the slowest legitimate backend response, and the backend's own keepalive must be **longer** than the proxy's so the proxy never reuses a socket the backend just closed (classic race → sporadic 502).
+   ```nginx
+   location / {
+       proxy_pass http://app;
+       proxy_http_version 1.1;
+       proxy_set_header Connection "";      # required so keepalive to upstream actually works
+       proxy_connect_timeout 2s;            # TCP connect to backend — short; a backend that won't accept is dead
+       proxy_send_timeout   30s;            # writing the request body to backend
+       proxy_read_timeout   60s;            # waiting for the backend's response — MUST exceed slowest real response
+   }
+   # And: backend keepalive_timeout (e.g. 75s) > nginx upstream idle reuse window, to avoid the reuse-after-close 502.
+   ```
+   Envoy: set `connect_timeout` on the cluster and `route.timeout` per route; default route timeout is 15s and silently truncates long requests — set it deliberately.
+4. **Add health checks — passive at minimum, active if your proxy supports it.** Passive ejection (`max_fails`/`fail_timeout`, Envoy outlier detection) reacts only to *real* request failures, so a freshly-booted-but-broken node still gets traffic until it fails N live requests. Active checks (nginx Plus `health_check`, HAProxy `option httpchk`, Envoy `health_checks`) probe a `/healthz` endpoint and eject before user traffic hits it.
+   - Health endpoint must check **dependencies** (DB, cache reachable), not just "process is up" — otherwise you keep a node that 500s on every real request.
+   - Set an explicit `unhealthy`→`healthy` hysteresis (e.g. eject after 3 fails, re-add after 2 passes) so a flapping node doesn't oscillate in and out of rotation.
+5. **TLS: terminate at the proxy unless the backend legally must see the cert.** Terminate (decrypt at proxy, plaintext or re-encrypt to backend) for HTTP routing, header inspection, and central cert management — the common case. **Passthrough** (L4 `stream`/SNI routing, proxy never decrypts) only for end-to-end encryption mandates or non-HTTP TLS. When terminating, forward the original scheme/IP so the app builds correct URLs and logs the real client:
+   ```nginx
+   proxy_set_header Host              $host;
+   proxy_set_header X-Real-IP         $remote_addr;
+   proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
+   proxy_set_header X-Forwarded-Proto $scheme;   # app uses this to know the request was HTTPS
+   ```
+   Pin `ssl_protocols TLSv1.2 TLSv1.3;` and a modern cipher suite; redirect `:80` → `:443`.
+6. **Retry idempotent requests ONLY, with circuit breaking.** Auto-retrying a `POST`/`PATCH` that timed out can double-charge a card or double-write. Restrict retries to safe methods + connect/early failures, cap attempts, and stop retrying once the backend is clearly down.
+   ```nginx
+   proxy_next_upstream error timeout http_502 http_503;   # NOT non_idempotent — never blindly retry POST
+   proxy_next_upstream_tries 2;
+   proxy_next_upstream_timeout 10s;
+   ```
+   Envoy: `retry_policy` with `retry_on: connect-failure,refused-stream,unavailable`, `num_retries: 2`, plus `retry_back_off`. Add **circuit breaking** (Envoy `circuit_breakers` max connections/pending/retries, or outlier detection ejecting a 5xx-storming host) so retries don't amplify load against a struggling backend into a full meltdown.
+7. **Sticky sessions only when state truly demands it.** Cookie/affinity routing (`sticky cookie`, Envoy hash policy) pins a client to one backend — necessary for in-memory session state, fatal for even load balancing and graceful drain (a drained node's clients all break). **First fix the state**: move sessions to Redis/JWT so any backend serves any user, then drop stickiness. Only keep it for unavoidable backend-local state, and pair it with consistent hashing so losing one node reshuffles minimally.
+8. **Make reloads zero-drop (graceful drain).** A naive restart cuts in-flight connections → user-visible 5xx during every deploy.
+   - **nginx:** `nginx -t && nginx -s reload` — the master spins up new workers on the new config and lets old workers finish in-flight requests before exiting. Never `kill -9` / hard restart for a config change.
+   - **HAProxy:** run with `-sf $(cat pid)` (seamless finish) or the master-worker socket reload.
+   - **Envoy:** hot restart / xDS push drains the old listener.
+   - For removing a **backend**: first mark it `down`/drain in the pool and reload so the proxy stops sending *new* requests, wait for in-flight to finish, then stop the backend. Tie the backend's shutdown to its readiness probe (fail `/healthz` → proxy ejects → then SIGTERM) so the LB drains it before it dies.
+## Common Errors
+- **`proxy_read_timeout` shorter than the slowest real response.** Long uploads/reports hit a **504** even though the backend is healthy. Set the read timeout above the legitimate p99, and only then chase a slow endpoint separately.
+- **Backend keepalive shorter than the proxy's upstream idle window.** Backend closes an idle socket the proxy then reuses → sporadic **502** under no real load. Make backend `keepalive_timeout` longer than the proxy's, and set `proxy_http_version 1.1` + `Connection ""`.
+- **No `keepalive` in the upstream block.** Every request does a fresh TCP (and TLS) handshake to the backend — latency and CPU explode under load. Add `keepalive N` and clear the `Connection` header.
+- **Retrying non-idempotent requests.** `proxy_next_upstream` including `non_idempotent` (or an Envoy `retry_on` that catches POSTs) silently double-executes writes on a timeout → duplicate charges/orders. Retry safe methods + connect failures only.
+- **Health check that only pings the port / returns 200 unconditionally.** A node with a dead DB stays in rotation and 500s every request. Probe real dependencies in `/healthz`.
+- **`ip_hash` / non-consistent hashing for affinity.** Removing or adding one node reshuffles *every* client to a new backend, blowing caches and sessions. Use `consistent` hashing.
+- **Trusting client-supplied `X-Forwarded-For`/`X-Forwarded-Proto`.** The app sees spoofed client IPs or thinks plaintext is HTTPS. Reset these headers at the trust boundary (`proxy_set_header ... $remote_addr`/`$scheme`); never pass the raw inbound value through.
+- **Hard restart on config change.** `systemctl restart nginx` / `kill -9` drops in-flight connections every deploy. Use `reload` / `-sf` graceful paths.
+- **Stopping a backend before draining it.** Killing an instance while the LB still routes to it = a burst of 5xx for its in-flight requests. Drain (fail readiness → eject) first, then SIGTERM.
+- **Default Envoy 15s route timeout left implicit.** Long-running requests get cut at 15s with no obvious cause. Set `route.timeout` explicitly per route.
+- **Single proxy = single point of failure.** One LB box and the whole service is down when it dies or reloads badly. Run ≥2 behind a VIP/anycast/keepalived or a managed LB.
+## Verify
+1. **Config is valid before reload:** `nginx -t` (or `haproxy -c -f`, `envoy --mode validate`, `caddy validate`) returns OK. Never reload an unvalidated config.
+2. **Balancing works:** fire `N` requests (`hey`, `vegeta`, `for i in $(seq 100); do curl -s .../whoami; done`) and confirm responses spread across all backends per the chosen algorithm (e.g. roughly even for round-robin).
+3. **Dead-backend reroute, zero 5xx:** kill one backend mid-load. Traffic must reroute to healthy nodes and the client must see **no 5xx** (passive: a brief blip until `max_fails`; active: none). The killed node returns to rotation after it's healthy again.
+4. **Timeouts behave:** point at a backend that sleeps longer than `proxy_read_timeout` → you get **504** at the configured time, not earlier/later. A backend refusing connections → **502** (not a retry storm).
+5. **Retries are idempotent-only:** a timed-out `GET` retries to a second backend (one served response); a timed-out `POST` does **not** double-execute (assert the write happened exactly once at the backend).
+6. **Zero-drop reload:** run sustained load (`vegeta attack -rate=200 -duration=60s`), trigger a config `reload` mid-run, and confirm **0 connection errors / 0 non-2xx** attributable to the reload in the report.
+7. **TLS + forwarded headers:** `curl -v https://host` negotiates TLS1.2/1.3; the backend logs the real client IP (`X-Real-IP`) and sees `X-Forwarded-Proto: https`; `:80` 301-redirects to `:443`.
+8. **Drain before stop:** mark a backend down, confirm new requests stop hitting it while in-flight ones complete, *then* stop it — no 5xx in the transition.
+Done = killing a backend reroutes with **zero 5xx**, timeouts produce the right code at the right time, idempotent-only retries never double-write, and a config reload under sustained load drops **zero** in-flight connections — all with a validated config and ≥2 proxies (no single point of failure).