npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.1 - Mend

sanook-cli 0.4.0 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (238) hide show

package/.env.example +19 -0
package/CHANGELOG.md +173 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +405 -57
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +21 -7
package/dist/providers/keys.js +3 -2
package/dist/providers/models.js +22 -6
package/dist/providers/registry.js +155 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +228 -31
package/dist/ui/banner.js +4 -9
package/dist/ui/brain-wizard.js +2 -2
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/render.js +55 -15
package/dist/ui/setup.js +97 -12
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/build-realtime-channel/SKILL.md ADDED Viewed

@@ -0,0 +1,122 @@
+---
+name: build-realtime-channel
+description: Builds realtime push channels over WebSocket/SSE — auth-on-connect, heartbeat/zombie eviction, topic subscribe/publish with per-topic authz and presence, sequence-numbered resume for missed-message recovery, client reconnect with backoff+jitter, and a Redis/NATS pub/sub backplane with send-buffer limits for horizontal scale.
+when_to_use: Adding live updates (chat, notifications, live dashboards, collaborative cursors, feeds), choosing WebSocket vs SSE vs long-poll, or fixing a channel that drops messages, leaks connections, thunders on reconnect, or can't scale past one server. Distinct from message-queue-jobs (durable server-to-server work queues) and manage-client-server-state (client cache/refetch, not the transport).
+---
+## When to Use
+Reach for this skill when the request is about **pushing live data to clients over a long-lived connection**:
+- "Push notifications / chat messages / order updates to the browser in realtime"
+- "Build a live dashboard / activity feed / collaborative cursors that updates without polling"
+- "Should this be WebSocket, SSE, or long-poll?"
+- "Our socket drops messages after a reconnect" / "clients miss updates while disconnected"
+- "Connections leak — server FD count climbs and never drops" / "zombie sockets pile up"
+- "On deploy every client reconnects at once and melts the box" (thundering herd)
+- "Realtime works on one node but breaks behind a load balancer / can't scale out"
+NOT this skill:
+- Durable server-to-server work queues, retries, dead-letter, exactly-once job processing → message-queue-jobs (this skill is at-most/at-least-once *push to clients*, not a job system)
+- Native mobile push via APNs/FCM, device-token registration, woken-from-killed delivery → implement-push-notifications (OS push to a closed app; this skill is an open in-app socket/stream)
+- Client-side cache, refetch, optimistic UI, query invalidation → manage-client-server-state (that's what the client *does* with pushed data; this is the wire)
+- Issuing/validating the token itself, refresh, session rotation → auth-jwt-session (this skill *consumes* a token on connect, it doesn't mint it)
+- Capping how many connects/messages a client may send → rate-limiting
+- Races inside your fan-out/handler code (shared mutable state, missing await) → async-concurrency-correctness
+- Metrics/tracing/log wiring for the channel → observability-instrument
+## Steps
+1. **Pick the transport by directionality — do not default to WebSocket.** Most "realtime" needs are server→client only.
+   | Transport | Use when | Cost / caveat |
+   |---|---|---|
+   | **SSE** (`text/event-stream`) | Server→client only (feeds, notifications, dashboards, token streaming). **Default for one-way.** | One HTTP/1.1 conn per stream (use HTTP/2 to avoid 6-conn cap); auto-reconnect + `Last-Event-ID` built in; no binary |
+   | **WebSocket** | True bidirectional, low-latency, high message rate (chat, presence, games, collaborative editing) | Manual heartbeat/reconnect/resume; no auto-resume; proxies/LBs need explicit `Upgrade` support |
+   | **Long-poll** | Fallback only when SSE/WS are blocked (ancient proxy, locked-down corp net) | High overhead, ~1 msg per round trip; keep as graceful degradation, not primary |
+   Rule: one-way → **SSE**; bidirectional or >~10 msg/s/client → **WebSocket**; long-poll only as fallback. Don't hand-roll if a maintained lib fits — `Socket.IO` (built-in reconnect+rooms+fallback), `Phoenix Channels` (presence+backpressure+cluster PubSub out of the box), `Centrifugo` (standalone server, history/recovery built in), `SignalR` (.NET). Raw `ws`/SSE only when you need full control and will build lifecycle yourself.
+2. **Authenticate on connect — never in the query string.** A token in `?token=...` lands in access logs, proxy logs, and `Referer`. Validate *before* upgrading.
+   - **WebSocket:** pass the token via the `Sec-WebSocket-Protocol` subprotocol header, or require an authenticated **cookie** (sent automatically on the upgrade), or accept an unauthenticated socket and require an `auth` frame as the **first message** within a short deadline (≤5s) or close.
+   - **SSE:** EventSource can't set headers — use a same-site auth **cookie**, or a short-lived single-use ticket fetched over a normal authed request then passed once.
+   - On bad/expired token: WS close code **`4401`** (app range; reserve `4403` for authz failure), SSE respond **`401`** before the stream opens. Re-check token expiry on long-lived sockets; close when it lapses.
+   ```js
+   // WS upgrade with subprotocol-carried token (client)
+   new WebSocket("wss://api.example.com/ws", ["bearer", token]);
+   // server: read token from Sec-WebSocket-Protocol, verify, then accept (echo the protocol)
+   ```
+3. **Run the full connection lifecycle — this is where leaks live.**
+   - **Heartbeat:** WS — server sends `ping` every **30s**, expects `pong`; if 2 missed (60s), terminate the socket (a half-open TCP conn looks alive to the OS but is dead). SSE — emit a comment line `:keep-alive\n\n` every 15–30s so proxies don't idle-close. `ws` clients that never `pong` are zombies; an `isAlive` flag flipped false on each ping and reset on pong evicts them.
+   - **Zombie eviction:** sweep on an interval; `socket.terminate()` (not `.close()`) anything that failed the heartbeat. Track open connections in a registry so you can count and reap them.
+   - **Graceful drain on deploy:** on `SIGTERM`, stop accepting new connections, send a `going_away` app message (so clients reconnect *staggered*, not instantly), then close with code **`1001`** after a grace window. Never `kill -9` a live channel node — every client stampedes back at once.
+4. **Model subscriptions as topics with per-topic authz, and add presence.** A connection is not a subscription. Let a client subscribe to named topics/channels (`chat:room:42`, `user:7:notifications`) over one socket.
+   - **Authorize every subscribe** against the *current* user — a connection authed as user 7 must not subscribe to `user:9:*`. Check on subscribe, not just on connect; deny with an error frame, don't silently drop.
+   - **Namespace topics** so wildcards can't leak (`org:{id}:...`). Reject subscribe to topics the user can't read.
+   - **Presence:** maintain a per-topic set of members in the backplane (Redis `SET`/hash keyed by topic, member = `{userId, connId}`); broadcast `join`/`leave` on change. Tie membership to the connection so a dropped socket auto-removes the member (TTL-backed, refreshed by heartbeat — otherwise a crashed client lingers as "online" forever).
+5. **Make missed-message recovery explicit with sequence numbers + a resume cursor — decide the delivery guarantee up front.** Default to **at-least-once + client dedup**, not "best effort."
+   - Stamp every message per-topic with a monotonic **`seq`** (and an event `id`). Keep a bounded **history buffer** per topic (e.g. last N=1000 or last 5 min) in Redis (`XADD` to a stream, or a capped list).
+   - On (re)subscribe the client sends its **last seen `seq`** (`resume_from`); server replays buffered events `> resume_from` then switches to live. SSE gets this for free: the browser auto-sends **`Last-Event-ID`** on reconnect — honor it and replay.
+   - If the gap exceeds the buffer, send a **`reset`/snapshot-required** signal so the client refetches full state instead of silently missing data.
+   - Guarantee table — pick one and document it:
+   | Guarantee | Mechanism | Cost |
+   |---|---|---|
+   | Best-effort (at-most-once) | Fire-and-forget, no buffer | Drops on any disconnect — only for ephemeral (live cursor pos) |
+   | **At-least-once + dedup** | seq + history buffer + resume cursor; client drops `seq ≤ lastSeen` | **Default.** Bounded buffer mem; client must dedup |
+   | Exactly-once *delivery* | Don't. Use at-least-once + idempotent client apply | True E2E exactly-once is a distributed-systems tax you don't need |
+6. **Reconnect on the client with backoff + jitter, then resubscribe and dedup.** A fixed-delay or zero-delay reconnect loop is how one deploy becomes a self-DDoS.
+   ```js
+   // exponential backoff, full jitter, cap 30s — applies to WS and SSE-with-manual-reconnect
+   let attempt = 0;
+   function reconnect() {
+     const base = Math.min(30000, 1000 * 2 ** attempt++);
+     const delay = Math.random() * base;          // full jitter — spreads the herd
+     setTimeout(connect, delay);
+   }
+   // on open: attempt = 0; resubscribe all topics with resume_from=lastSeq[topic];
+   //          drop any replayed event whose seq <= lastSeq[topic] (dedup)
+   ```
+   Reset the attempt counter on a *successful* open, resubscribe every topic with its own `resume_from`, and dedup replayed events by `seq`. Stop reconnecting on a fatal close code (`4401`/`4403`) — don't hammer a server that rejected your auth.
+7. **Scale horizontally with a pub/sub backplane + per-connection backpressure.** A second node means a publish on node A must reach a subscriber on node B.
+   - **Stateless + backplane (preferred):** each node holds its own sockets; publishes go to **Redis Pub/Sub**, **Redis Streams**, or **NATS**; every node subscribes and fans out to its local sockets for that topic. No sticky sessions needed for WS (the socket stays pinned to one node by TCP anyway); for SSE/long-poll across nodes you still need either a backplane or sticky routing.
+   - **Sticky sessions:** only needed for handshake-split transports (Socket.IO long-poll→WS upgrade must hit the same node) — set LB affinity or force `transports: ['websocket']`. Prefer stateless+backplane over relying on stickiness.
+   - **Backpressure / slow-consumer:** a client that reads slower than you write balloons the per-socket send buffer and OOMs the node. Cap it: watch `ws.bufferedAmount` (or your lib's queue depth); if it exceeds a threshold (e.g. 1–4 MB), **drop the slow consumer** (close `1013`/`4408`) rather than buffer unboundedly. For SSE, the same applies to the response stream's write backpressure. One slow consumer must never degrade the rest.
+8. **Load + soak test, then observe.** Single-connection tests prove nothing about a channel (see Verify). Wire metrics (open conns, msgs/s, send-buffer high-water, reconnect rate, dropped-slow-consumers) via observability-instrument.
+## Common Errors
+- **Token in the query string.** `?token=...` leaks into access/proxy logs and `Referer`. Use subprotocol header, auth cookie, or a first-frame `auth` message.
+- **No heartbeat → silent half-open sockets.** TCP keepalive defaults to ~2h; a dead peer looks connected for hours. App-level ping/pong (30s) + terminate on miss is mandatory.
+- **`.close()` instead of `.terminate()` on a zombie.** `close()` waits for a close handshake the dead peer will never send, so the FD lingers. Terminate failed-heartbeat sockets.
+- **Unbounded per-socket send buffer.** One slow/paused client grows `bufferedAmount` until the node OOMs and takes down *every* connection. Cap buffer; drop the slow consumer.
+- **No jitter on reconnect.** All clients backoff on the same schedule and reconnect in lockstep after an outage/deploy — synchronized thundering herd. Add full jitter.
+- **Instant/zero-delay reconnect loop.** A server that closes on every connect gets hammered thousands of times/sec. Always backoff; stop on fatal auth close codes.
+- **Treating a connection as a subscription / authz only on connect.** A long-lived socket can request any topic later; authorize every `subscribe` against the current user, namespace topics, deny cross-tenant.
+- **No sequence numbers → "messages disappear" after reconnect.** Without `seq` + history + resume there's no way to recover the gap; clients silently miss data. Stamp seq, buffer, replay from cursor (or `Last-Event-ID`).
+- **In-memory subscriptions/presence with >1 node.** A publish on node A never reaches node B's subscribers; presence shows half the users. Use a Redis/NATS backplane; back presence with a TTL'd shared store.
+- **Presence that never clears.** Membership tied to a clean disconnect only — a crashed client stays "online" forever. TTL the entry, refresh on heartbeat.
+- **No graceful drain on deploy.** Killing a node drops every socket simultaneously and they all reconnect at once. SIGTERM → stop accepts → `going_away` (staggered reconnect) → close `1001`.
+- **Relying on sticky sessions to "fix" scale.** Stickiness papers over a missing backplane and breaks the moment a node dies (all its clients fail over to a node that doesn't know their topics). Make nodes stateless + backplane; use stickiness only for handshake-split transports.
+- **SSE without keep-alive comments.** Idle proxies/LBs close a quiet `text/event-stream` after their timeout. Emit `:keep-alive` every 15–30s.
+## Verify
+1. **Auth on connect:** connect with no/expired/forged token → rejected before the stream opens (WS close `4401`, SSE `401`); token never appears in `nginx`/access logs. Cross-tenant `subscribe` → denied with an error frame, not silently dropped.
+2. **Heartbeat + zombie eviction:** `tc`/`iptables`-drop a client's traffic (simulate half-open) → server detects via missed pong within ~60s and `terminate()`s it; server open-connection gauge returns to baseline (no FD leak). Re-run 100× in a loop — count must not climb.
+3. **Missed-message recovery:** subscribe, record `seq`, kill the client, publish 50 events, reconnect with `resume_from`/`Last-Event-ID` → client receives exactly those 50 in order, zero gaps, zero dupes after dedup. Exceed the buffer → client gets a `reset`/snapshot signal (not a silent gap).
+4. **Reconnect storm (thundering herd):** connect 5–10k clients, restart/kill the node → reconnects spread over the backoff window (full-jitter histogram, not a spike); server stays up; all topics resubscribed. With zero jitter this test must visibly fail (then pass after adding jitter).
+5. **Horizontal fan-out:** run ≥2 nodes behind an LB; a subscriber on node B receives a message published to node A → proves the backplane works, not just one box. Kill node A mid-stream → its clients fail over to node B and resume from cursor with no lost messages.
+6. **Slow-consumer isolation:** one client stops reading (pause the socket) while others stay live → the slow one is dropped (`bufferedAmount` cap hit, close `1013`/`4408`); all other clients keep flowing with no latency spike; node memory stays flat.
+7. **Graceful drain:** send the node `SIGTERM` under load → clients get `going_away` then close `1001`, reconnect staggered to another node, miss zero messages (resume covers the gap).
+8. **Load/soak:** drive target concurrent connections at peak msg/s with a WS/SSE load tool (`k6` `ws`/SSE, `artillery`, `vegeta` for SSE) for ≥30 min → p99 delivery latency within budget; open-conn count, memory, and send-buffer high-water are flat (no leak/creep).
+Done = auth-on-connect (no query-string token), heartbeat-driven zombie eviction with flat FD/conn count under churn, sequence-based resume recovers every missed message with no dupes, reconnect uses backoff+jitter, fan-out works across ≥2 nodes via the backplane, slow consumers are dropped without affecting others, and a ≥30-min soak shows no leak.

package/skills/build-vector-search/SKILL.md ADDED Viewed

@@ -0,0 +1,131 @@
+---
+name: build-vector-search
+description: Builds semantic/vector search — pick an embedding model + dimensionality (and whether to truncate Matryoshka dims) and the matching distance metric (cosine/dot/L2, normalize to unit length so cosine == dot and IP is correct), an ANN index with the recall/latency/memory tradeoff understood (HNSW M/efConstruction/efSearch for low-latency RAM-resident; IVF-PQ nlist/nprobe/PQ for billion-scale compressed; flat/exact for <100k) in pgvector/Qdrant/Milvus/FAISS/Pinecone, chunking + overlap + per-chunk metadata for filtering, HYBRID retrieval fusing BM25 + dense by Reciprocal Rank Fusion (RRF, k≈60) not score addition, a cross-encoder/Cohere reranker over the top-50→k, correct pre-filter-vs-ANN interaction (filterable HNSW, not post-filter that starves k), and offline eval with recall@k / nDCG@10 / MRR against a labeled qrels set. Quantize (scalar/PQ) only after measuring recall loss; tune efSearch/nprobe to a recall target, not a guess.
+when_to_use: Building or tuning the embedding + vector-index + retrieval-quality core — choosing an embedding model/dim/metric, sizing/tuning an HNSW or IVF-PQ index for a recall@k target, adding hybrid (BM25+vector via RRF) or a reranker, fixing pre-filtering that tanks recall, or running a recall@k/nDCG eval. Distinct from rag-pipeline (the full retrieve-augment-generate app — prompt assembly, grounding, citations, hallucination control; this skill is the retrieval engine it embeds) and design-search-index-infra (the lexical/inverted-index + cluster topology + zero-downtime reindex infra; this skill owns the embedding model, distance metric, ANN params, and relevance eval rather than shard/analyzer/capacity design).
+---
+## When to Use
+Reach for this skill when the task is the **quality and mechanics of vector retrieval itself** — embeddings, the ANN index, hybrid/rerank, and measuring relevance:
+- "Pick an embedding model + dimensionality + distance metric for semantic search"
+- "Our ANN search misses obvious matches" / "tune HNSW/IVF for recall@10 without blowing latency"
+- "pgvector / Qdrant / Milvus / FAISS / Pinecone — which index and what parameters?"
+- "Add hybrid search (BM25 + vector) and a reranker" / "results are semantically close but wrong-ranked"
+- "Filtering by metadata returns too few results / wrong ones" (pre-filter vs ANN)
+- "How do I know retrieval got better?" → recall@k / nDCG / MRR eval
+- "Quantize to fit in RAM" / "embeddings cost/latency too high"
+NOT this skill:
+- The end-to-end **retrieve→augment→generate** app — prompt assembly, context packing, grounding, citations, hallucination control → rag-pipeline (this skill is the retrieval core it calls; tune retrieval here, wire the LLM there)
+- **Lexical/inverted-index** search infra — Elasticsearch/OpenSearch analyzers & mappings, shard/replica topology, capacity sizing, alias-based zero-downtime reindex → design-search-index-infra (it owns BM25 analyzer config + cluster ops; this skill owns the embedding model, metric, ANN params, and relevance eval)
+- Measuring **LLM answer** quality (faithfulness, answer correctness, LLM-as-judge) → llm-eval-harness (this skill evals *retrieval* — recall@k/nDCG — not generation)
+- Cutting **embedding/inference** cost & latency at the model/serving layer (batching, caching, model size) → optimize-llm-cost-latency
+- The BM25/keyword half as a standalone full-text feature with no vectors → design-search-index-infra
+- Picking a document/KV store schema unrelated to vectors → model-nosql-data; relational schema for the metadata table → design-relational-schema
+- Profiling the corpus before indexing (length distribution, dupes, language mix) → profile-dataset
+## Steps
+1. **Pick the embedding model, dimensionality, and distance metric together — they're coupled.** Don't default to `text-embedding-ada-002` (legacy). 2025-2026 strong choices:
+   | Model | Dim | Notes |
+   |---|---|---|
+   | OpenAI `text-embedding-3-large` | 3072 (truncatable to 256/1024) | Matryoshka — truncate then **re-normalize**; strong general |
+   | OpenAI `text-embedding-3-small` | 1536 (truncatable) | cheap, good baseline |
+   | Cohere `embed-v3` / `embed-v4` | 1024 | has `input_type` (query vs document) — use it |
+   | `BAAI/bge-large-en-v1.5`, `intfloat/e5-large-v2` | 1024 | open, self-host; **require a prefix** (`query:` / `passage:`) — omitting it craters recall |
+   | `BAAI/bge-m3` | 1024 | multilingual + multi-vector |
+   | Voyage `voyage-3` | 1024 | strong retrieval, code/domain variants |
+   Rules: **embed the query and the document with the SAME model** (and the right `input_type`/prefix). Higher dim ≈ better recall but more RAM/latency — Matryoshka models let you truncate (e.g. 3072→1024) and trade recall for cost; **re-normalize after truncating**. Metric choice:
+   | Metric | Use when | pgvector op | Note |
+   |---|---|---|---|
+   | **Cosine** | text embeddings (default) | `<=>` (`vector_cosine_ops`) | direction only |
+   | **Dot / inner product** | already unit-normalized vectors | `<#>` (negative IP) | == cosine when normalized; faster |
+   | **L2 / Euclidean** | rarely for text; some image models | `<->` (`vector_l2_ops`) | magnitude matters |
+   **Normalize embeddings to unit length once at write time**, then cosine == dot and you can use the faster IP path. Pick the index opclass to match the metric — a cosine index on un-normalized vectors silently mis-ranks.
+2. **Chunk with structure, sized to the model, with overlap and metadata — bad chunks cap recall before any tuning.** Defaults: ~**256–512 tokens** per chunk, **10–15% overlap** (~50–80 tokens) so a fact split across a boundary survives. Split on **semantic boundaries** (headings, paragraphs, code blocks, `RecursiveCharacterTextSplitter` by separator hierarchy) — never a blind fixed char window mid-sentence. Stamp every chunk with metadata for filtering and citation: `{doc_id, chunk_id, source, title, section, page, created_at, tenant_id, lang}`. Consider **late chunking** (embed the long context, then pool per-chunk) or a parent-document retriever (embed small, return the larger parent) when chunks lose context. One vector per chunk; keep the raw text + metadata in a payload column/store.
+3. **Choose the ANN index by corpus size and the recall/latency/memory tradeoff — there is no free lunch.**
+   | Index | Recall | Latency | Memory | Build | Use when |
+   |---|---|---|---|---|---|
+   | **Flat / exact (brute force)** | 100% | O(N) | full | none | < ~50–100k vectors, or as the recall ground-truth |
+   | **HNSW** | high | very low | **high (graph in RAM)** | slow | low-latency, RAM-resident, ≤ tens of millions |
+   | **IVF / IVF-Flat** | tunable | low | medium | fast | large, want simple recall/latency knob (`nprobe`) |
+   | **IVF-PQ / PQ** | lower (lossy) | low | **very low (compressed)** | medium | 100M–1B+, must fit RAM/budget; accept recall hit |
+   | **DiskANN / Vamana** | high | low | on-disk | slow | billion-scale, can't fit graph in RAM |
+   **HNSW knobs:** `M` (neighbors/node, 16–64; higher = better recall + more RAM), `efConstruction` (build quality, 100–400), `efSearch`/`ef` (**query-time** recall↔latency dial — raise until recall target met). **IVF knobs:** `nlist` (clusters ≈ `√N` to `4√N`), `nprobe` (clusters scanned at query — the recall↔latency dial). **PQ knobs:** `m` sub-quantizers (dim must be divisible), `nbits` (usually 8). Default to **HNSW** unless memory or scale forces IVF-PQ.
+4. **Per-store specifics — same concepts, different syntax.**
+   - **pgvector** (Postgres): `CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);` then per-session `SET hnsw.ef_search = 100;`. IVFFlat: `WITH (lists = N)` + `SET ivfflat.probes = 10;` — **build IVFFlat AFTER loading data** (it clusters existing rows); HNSW can be built on empty. `pgvector` ≥0.7 supports `halfvec` (16-bit) to halve size. Pre-filter with a plain `WHERE` + a btree index — Postgres can combine.
+   - **Qdrant:** HNSW by default; set `hnsw_config` (`m`, `ef_construct`) per collection, `ef` per search via `params.hnsw_ef`. **Payload indexes** on filtered fields enable *filterable HNSW* (filter applied during graph traversal, not after). Use scalar/product quantization via `quantization_config`.
+   - **Milvus:** explicit `index_type` (`HNSW`, `IVF_FLAT`, `IVF_PQ`, `DISKANN`, `SCANN`) + `metric_type` (`COSINE`/`IP`/`L2`); search `params` = `{ef}` or `{nprobe}`. Must `load()` collection into memory before search.
+   - **FAISS** (library, no server): `IndexHNSWFlat`, `IndexIVFFlat`, `IndexIVFPQ`; **train IVF/PQ on a representative sample** before `add`; `index.nprobe = N`. Wrap with `IndexIDMap` to keep external ids. You manage persistence + metadata yourself.
+   - **Pinecone:** managed; pick `metric` at index creation (immutable), use **namespaces** for tenant isolation, `filter` in query for metadata. Serverless handles the index internals — you tune `top_k` and filters, not `M`/`nprobe`.
+5. **Pre-filter correctly — naïve post-filtering starves your k and silently drops good hits.** Three interaction modes:
+   | Mode | What happens | Risk |
+   |---|---|---|
+   | **Post-filter** (ANN then drop non-matches) | fetch top-k, remove rows failing the filter | a selective filter can leave **0–few** results; raise `k` won't reliably fix it |
+   | **Pre-filter** (filter then exact search) | filter to a subset, brute-force within it | exact but slow on large subsets |
+   | **Filterable ANN** (filter *during* graph/list traversal) | engine prunes by metadata inside HNSW/IVF | best — Qdrant payload index, Milvus filtered search, pgvector `WHERE` + index |
+   For a highly selective filter (e.g. `tenant_id = X` with few rows), **pre-filter or partition** (separate collection/namespace/partition per tenant) instead of filtering a global index. **Always index the metadata fields you filter on**; an unindexed filter forces a slow scan or weak post-filter. Test recall *with the filter applied* — unfiltered recall lies.
+6. **Add hybrid (BM25 + dense) and fuse with RRF — not score addition.** Dense embeddings miss exact terms (IDs, codes, rare names, acronyms); BM25/keyword catches them. Run both retrievers, take each result's **rank**, and fuse with **Reciprocal Rank Fusion**:
+   ```
+   RRF_score(d) = Σ_retrievers 1 / (k + rank_r(d))      # k ≈ 60
+   ```
+   RRF is rank-based, so you **don't have to normalize** the wildly different BM25 vs cosine score scales (raw weighted score-sum is the classic bug — one scale dominates). Native support: Qdrant `Query` API with `Fusion.RRF`, Elasticsearch/OpenSearch `rrf` retriever, Milvus `RRFRanker`, Weaviate hybrid `fusionType`. In pgvector, run a BM25/`tsvector` (or ParadeDB `pg_search`) query and a vector query, then fuse in SQL. Hybrid typically beats either alone on heterogeneous corpora.
+7. **Rerank the shortlist with a cross-encoder — it fixes the ordering bi-encoders get wrong.** Retrieve a wide net (top **50–100** by RRF), then rerank to the final **k (5–10)** with a cross-encoder that scores (query, doc) jointly: **Cohere `rerank-v3.5`**, `BAAI/bge-reranker-v2-m3`, or a `cross-encoder/ms-marco-MiniLM` model. Cross-encoders are far more accurate but **O(candidates)** per query — only ever run them over the shortlist, never the whole index. Reranking usually buys more nDCG than squeezing the ANN, and it's where "semantically close but mis-ranked" gets fixed. Budget the extra ~50–300ms.
+8. **Eval with a labeled set — tune to a recall TARGET, never by eyeballing.** Build qrels: a set of queries each with known-relevant `doc_id`s (mine from clicks/logs, or hand-label 50–200). Metrics:
+   | Metric | Measures | When |
+   |---|---|---|
+   | **recall@k** | did the relevant doc make the top-k at all | the **ANN/retrieval** gate — most important for RAG (can't rerank what you didn't retrieve) |
+   | **MRR** | rank of the *first* relevant hit | single-answer / "find the doc" |
+   | **nDCG@10** | graded relevance + position | multi-relevant, ranking quality (post-rerank) |
+   | **precision@k** | fraction of top-k relevant | when noise in context hurts |
+   Compute **exact (flat) search as the recall=100% ground truth**, then measure your ANN's recall@k against it — that's how you set `efSearch`/`nprobe`: raise it until recall@k hits target (e.g. 0.95), then stop (latency grows past it). Re-run the suite on every model/chunk/param change. Tools: `ranx`, BEIR, `pytrec_eval`, or a small custom harness.
+9. **Quantize only after measuring the recall loss — it's a memory/latency win that costs accuracy.** Options: **scalar quantization** (float32→int8, ~4× smaller, small recall loss — good default), **binary quantization** (1-bit, ~32× smaller, big loss — only with a rescoring/oversampling pass), **PQ** (product quantization, tunable, needs training). Pattern: quantize for the fast first pass, then **rescore the top candidates with full-precision vectors** (Qdrant `rescore`, Milvus refine) to recover recall. Measure recall@k before/after on the eval set — never ship a silent quality drop. Also: store the original embedding model + dim in metadata so a model upgrade triggers a full re-embed (you can't mix embedding spaces).
+## Common Errors
+- **Embedding query and documents with different models (or wrong `input_type`/prefix).** Vectors live in different spaces → garbage similarity. Fix: same model both sides; set Cohere `input_type`, E5/BGE `query:`/`passage:` prefixes.
+- **Metric/opclass mismatch or un-normalized vectors with cosine/IP.** A cosine index on un-normalized vectors mis-ranks; IP on un-normalized ≠ cosine. Fix: normalize to unit length at write time, pick the matching opclass (`vector_cosine_ops` etc.).
+- **Tuning by feel instead of to a recall target.** Picking `efSearch`/`nprobe` "that seems fine" hides recall cliffs. Fix: exact search as ground truth, raise the knob until recall@k ≥ target, then stop.
+- **Post-filtering a selective metadata filter.** ANN returns k, the filter drops most → too few/empty results. Fix: filterable ANN (payload/`WHERE` index) or pre-filter/partition per tenant.
+- **Weighted score-sum hybrid instead of RRF.** BM25 and cosine scales differ wildly; one dominates. Fix: fuse by rank with RRF (k≈60) — no score normalization needed.
+- **Building an IVFFlat index before loading data.** It clusters on existing rows; empty → degenerate. Fix: load data, then build IVFFlat (HNSW is fine on empty).
+- **No overlap / mid-sentence chunking.** Facts split across boundaries become unretrievable. Fix: 10–15% overlap, split on semantic boundaries.
+- **Reranking the whole index.** Cross-encoders are O(N) per query → unusable latency. Fix: rerank only the top-50–100 shortlist.
+- **Quantizing without measuring.** Silent recall drop in prod. Fix: measure recall@k before/after; add a full-precision rescore pass.
+- **Mixing embedding spaces after a model upgrade.** New and old vectors are incomparable. Fix: store model+dim in metadata; re-embed the whole corpus on upgrade.
+- **HNSW out-of-memory at scale.** The graph is RAM-resident; tens of millions × high `M` × float32 blows the budget. Fix: lower `M`, scalar-quantize, or switch to IVF-PQ / DiskANN.
+## Verify
+1. **Metric/normalization correct:** vectors are unit-normalized; the index opclass matches the metric; a known query returns its known-relevant doc as a top hit.
+2. **Same-model invariant:** grep the pipeline — query and document embeddings use the identical model + correct `input_type`/prefix.
+3. **Recall measured against exact search:** flat/brute-force gives the ground truth; ANN recall@k is computed and meets target (e.g. ≥0.95) at the chosen `efSearch`/`nprobe`, with latency recorded.
+4. **Filter recall holds:** run the eval **with the production metadata filter applied**; recall doesn't collapse (no post-filter starvation), and selective filters use pre-filter/partition.
+5. **Hybrid fuses by RRF:** BM25 and dense both contribute; fusion is rank-based (RRF k≈60), and hybrid recall@k ≥ either retriever alone on the eval set.
+6. **Rerank improves nDCG, not latency-killing:** cross-encoder runs over the top-50–100 only; nDCG@10 improves vs pre-rerank; added latency is within budget.
+7. **Chunking sound:** chunks are 256–512 tokens with 10–15% overlap on semantic boundaries, each carrying filter/citation metadata; a boundary-straddling fact is retrievable.
+8. **Quantization is net-positive:** recall@k before/after quantization is measured; any drop is recovered by a full-precision rescore pass and is within tolerance.
+9. **Index choice fits scale/memory:** the index type (flat/HNSW/IVF-PQ/DiskANN) matches corpus size and the RAM budget; HNSW graph fits in memory or a compressed index was chosen.
+Done = query and documents share one normalized embedding model with a matching distance metric/opclass, the ANN index is chosen for the corpus's scale/latency/memory budget and tuned to a measured recall@k target against exact search, hybrid retrieval fuses BM25 + dense by RRF, a cross-encoder reranks the shortlist, metadata filtering uses filterable/pre-filter (not post-filter starvation), and every change is validated by the recall@k / nDCG / MRR eval in checks 3–8.

package/skills/compose-local-dev-stack/SKILL.md ADDED Viewed

@@ -0,0 +1,149 @@
+---
+name: compose-local-dev-stack
+description: Wires a local multi-service development stack with Docker Compose — app plus backing datastores (Postgres/Redis/Kafka), dependency-ordered healthchecks (depends_on condition: service_healthy), pinned images and named volumes, seed/init scripts, hot-reload bind mounts, profiles, and one-command up/down/reset via a Makefile.
+when_to_use: An app needs real local backing services (db, cache, queue) and "start everything" is fragile, slow, or undocumented. Not the dev container the editor runs in (setup-devcontainer-env), not the shippable app image (dockerfile-optimize), not cluster deployment (k8s-manifest-review).
+---
+## When to Use
+Reach for this when the request is about **standing up the app's runtime dependencies on a laptop**, reproducibly, with one command:
+- "Get Postgres + Redis + the API running locally so I can develop"
+- "The onboarding doc says `docker compose up` but it races / half the services aren't ready"
+- "Add Kafka (or a queue, or a second DB) to the local stack"
+- "I want hot reload — edit code, see it without rebuilding the image"
+- "Seed the dev database automatically" / "give me a clean-slate reset command"
+NOT this skill:
+- The container the editor/agent itself runs inside (devcontainer.json, features, VS Code attach) → setup-devcontainer-env
+- Shrinking/hardening the **production** image (multi-stage, distroless, non-root) → dockerfile-optimize
+- Deploying these services to a cluster (Deployments, probes, resource limits, Helm) → k8s-manifest-review
+- Pinning the *host* language/tool versions (node/python/go via asdf/mise/`.tool-versions`) → pin-toolchain-versions
+- A schema change's lock/data-loss safety → db-migration-safety (this skill only *runs* migrations on start)
+## Steps
+1. **One file, services as the unit. Pin every tag, name every volume.** Floating `:latest` makes the stack non-reproducible and breaks silently on pull; bare anonymous volumes orphan and lose data on `down`. Use `compose.yaml` (the modern name — drop the `version:` key, it's obsolete):
+   ```yaml
+   name: myapp
+   services:
+     db:
+       image: postgres:16.4-alpine            # pin minor; never :latest
+       environment:
+         POSTGRES_USER: app
+         POSTGRES_PASSWORD: app
+         POSTGRES_DB: app
+       volumes:
+         - pgdata:/var/lib/postgresql/data     # named -> survives `down`
+         - ./db/init:/docker-entrypoint-initdb.d:ro  # runs ONCE on empty volume
+       healthcheck:
+         test: ["CMD-SHELL", "pg_isready -U app -d app"]
+         interval: 3s
+         timeout: 3s
+         retries: 20
+         start_period: 5s
+       ports: ["5432:5432"]
+     redis:
+       image: redis:7.4-alpine
+       command: ["redis-server", "--save", "", "--appendonly", "no"]  # ephemeral cache
+       healthcheck:
+         test: ["CMD", "redis-cli", "ping"]
+         interval: 3s
+         timeout: 2s
+         retries: 20
+     app:
+       build: { context: ., target: dev }       # dev stage, not prod
+       command: ["npm", "run", "dev"]            # hot-reload command, overrides Dockerfile CMD
+       depends_on:
+         db:    { condition: service_healthy }   # waits for healthcheck, not just "started"
+         redis: { condition: service_healthy }
+       environment:
+         DATABASE_URL: postgres://app:app@db:5432/app   # use service name, not localhost
+         REDIS_URL: redis://redis:6379
+       volumes:
+         - ./src:/app/src                        # bind mount -> edits reflect live
+         - /app/node_modules                     # anon vol masks host node_modules
+       ports: ["3000:3000"]
+   volumes:
+     pgdata:
+   ```
+2. **Order startup with `depends_on: condition: service_healthy` — never bare `depends_on`.** Bare `depends_on` only waits for the container to *start*, not to be *ready*; the app then connects to a Postgres still replaying WAL and crash-loops. The gate is the **healthcheck on each backing service**. Pick the right probe per service:
+   | Service | Healthcheck test | Why not just TCP |
+   |---|---|---|
+   | Postgres | `pg_isready -U $USER -d $DB` | port opens before it accepts queries |
+   | MySQL | `mysqladmin ping -h localhost` | same early-port problem |
+   | Redis | `redis-cli ping` → `PONG` | trivial, do it |
+   | Kafka (KRaft) | `kafka-broker-api-versions --bootstrap-server localhost:9092` | broker advertises before it serves metadata |
+   | RabbitMQ | `rabbitmq-diagnostics -q ping` | mgmt port lies about readiness |
+   | Elasticsearch | `curl -fsS localhost:9200/_cluster/health?wait_for_status=yellow` | green never comes single-node |
+   | App migrations | a one-shot `migrate` service the app `depends_on` (condition: `service_completed_successfully`) | keeps schema setup off the app's hot path |
+   Tune `retries × interval ≥ real cold-start time` (Kafka/ES need `start_period: 20s`+) or healthy never arrives and the dependents abort.
+3. **Seed once via `docker-entrypoint-initdb.d`; run migrations every start via a one-shot service.** The init dir (`*.sql`/`*.sh`, alphabetical) runs **only when the data volume is empty** — perfect for extensions, roles, and static seed (`01-schema.sql`, `02-seed.sql`). It does **not** re-run after the volume exists, so never put evolving migrations there. Migrations belong in a dedicated short-lived service the app waits on:
+   ```yaml
+     migrate:
+       build: { context: ., target: dev }
+       command: ["npm", "run", "migrate:deploy"]   # or: alembic upgrade head / flyway migrate
+       depends_on: { db: { condition: service_healthy } }
+       restart: "no"
+   ```
+   Then set `app.depends_on.migrate.condition: service_completed_successfully`. Idempotent migration tools make this safe to run on every `up`.
+4. **Hot reload = bind mount source + a dev `command` + a watcher, not a rebuild.** Bind `./src:/app/src` and run the dev server (`npm run dev`/`uvicorn --reload`/`air`/`nodemon`). Mask installed deps with an **anonymous volume** (`- /app/node_modules`) so the host's empty/mismatched dir doesn't shadow the image's. Build the image from a **`dev` stage** (`target: dev`) that includes dev deps and the watcher — keep the lean prod stage for shipping (that's dockerfile-optimize's job). Changing `package.json`/`requirements.txt` still needs a rebuild; code does not.
+5. **Split config: committed `compose.yaml` + `.env` + an uncommitted `compose.override.yml`.** Compose **auto-merges** `compose.override.yml` on top of `compose.yaml` with no `-f` flag — put local-only tweaks there (extra port bindings, mounted debug tools, `DEBUG=1`) and gitignore it so teammates' hacks don't collide. Variables interpolate from `.env` (committed `.env.example`, real `.env` gitignored). Never hardcode host-specific ports or paths in the base file.
+6. **Gate optional services behind `profiles`.** Tag heavy/rarely-needed services (Kafka, a second DB, mailhog, a metrics stack) with `profiles: ["kafka"]` so a plain `docker compose up` starts only the core stack. Opt in with `docker compose --profile kafka up`. Keeps the default path fast; a service with no `profiles` always runs.
+7. **Use the default network and talk service-to-service by name; publish only the host ports you need.** Compose gives you a default bridge network where services resolve each other by **service name** (`db`, `redis`) — the app must use `db:5432`, never `localhost:5432` (localhost inside the app container is the app). Publish stable host ports (`5432:5432`) only for tools you run on the host (psql, a GUI). Collisions with a host Postgres → remap the **host** side (`5433:5432`), never the container side.
+8. **Make one-command verbs in a `Makefile` (or `Taskfile.yml`) so nobody memorizes flags.** `up` must block until healthy; `reset` must wipe volumes:
+   ```makefile
+   up:      ## start core stack, wait until healthy
+   	docker compose up -d --wait
+   down:    ## stop, keep data
+   	docker compose down
+   reset:   ## stop AND wipe volumes -> clean slate
+   	docker compose down -v --remove-orphans
+   	docker compose up -d --wait
+   logs:    ## tail everything
+   	docker compose logs -f --tail=100
+   ps:
+   	docker compose ps
+   ```
+   `--wait` makes `up` exit non-zero if any service never goes healthy — that's your machine-checkable gate. `down -v` is the *only* thing that deletes data; keep it on `reset` alone so `down` is always safe.
+## Common Errors
+- **Bare `depends_on:` (list form).** Waits for container *start*, not readiness; the app races the DB and crash-loops on cold boot. Use the map form with `condition: service_healthy`.
+- **No `healthcheck` on a backing service.** Then `service_healthy` has nothing to gate on and Compose errors or treats it as instantly up. Every service you depend-on needs a real probe (table in step 2).
+- **App connects to `localhost` instead of the service name.** `localhost` inside the app container is the app itself — connection refused. Use `db`/`redis`/`kafka` (the service names) in `DATABASE_URL`/`REDIS_URL`.
+- **Anonymous/missing volume on a datastore.** `docker compose down` orphans the anonymous volume and the next `up` starts empty; data "randomly" vanishes. Always name datastore volumes and declare them under `volumes:`.
+- **Expecting `docker-entrypoint-initdb.d` to re-run.** It runs **only on an empty data volume**. Edited a seed file and "nothing happened"? The volume already exists — `docker compose down -v` (or `make reset`) to re-init. Don't put live migrations there.
+- **`start_period` too short for Kafka/Elasticsearch.** They take 20–60s to be ready; with the default `start_period: 0s` and few retries, healthy never arrives and dependents abort. Set `start_period: 30s` and enough `retries`.
+- **`:latest` / unpinned tags.** A teammate pulls a newer Postgres major, the data dir format changes, the volume won't mount. Pin to a minor tag (`postgres:16.4-alpine`).
+- **Host port already in use (`bind: address already in use`).** A host Postgres or a previous stack holds 5432. Remap the host side only (`5433:5432`); changing the container side breaks intra-network DNS.
+- **Host `node_modules`/`venv` shadowing the image's via the source bind mount.** App can't find deps or loads wrong-arch binaries. Add the anonymous-volume mask (`- /app/node_modules`) *after* the source bind.
+- **Secrets committed in `compose.yaml`.** Real credentials in the base file leak to git. Keep them in the gitignored `.env`; commit only `.env.example` with placeholders.
+## Verify
+1. **Cold up from nothing:** `make reset` (wipes), then `make up`. The command must **exit 0** — `--wait` fails the command if any service is unhealthy. `docker compose ps` shows every core service `running (healthy)`.
+2. **Ordering held:** check `docker compose logs migrate` / app — the app started its first DB query *after* `db` was healthy and migrations completed, with **no** connection-refused retries in the log.
+3. **Seeded:** `docker compose exec db psql -U app -d app -c "select count(*) from <seeded_table>;"` returns the expected non-zero count without any manual step.
+4. **Hot reload:** with the stack up, edit a source file under the bind mount → the app reloads and serves the change **without** `docker compose build` or restart.
+5. **Reachability:** a host tool hits the published port (`psql -h localhost -p 5432 -U app`), and the app reaches the DB **by service name** (no `localhost` in its config).
+6. **Reset is clean:** `make reset` recreates the stack and the seeded count from step 3 matches again (volume truly wiped and re-init'd, not stale).
+7. **Profiles:** plain `docker compose up -d --wait` starts only core services; `--profile kafka up` additionally starts the gated ones; `docker compose ps` confirms each case.
+8. **`down` is safe:** `make down` then `make up` preserves data (row count unchanged); only `make reset` resets it.
+Done = `make reset && make up` exits 0 with every service `healthy`, the DB is auto-seeded, a source edit hot-reloads without a rebuild, the app talks to backing services by name, and `make reset` reproducibly returns the stack to a clean seeded state.

package/skills/configure-bundler-build/SKILL.md ADDED Viewed

@@ -0,0 +1,166 @@
+---
+name: configure-bundler-build
+description: Configures and optimizes the JS/TS build toolchain — tsconfig plus a bundler (Vite/esbuild/Rollup/tsup/webpack) — for correct module output (ESM/CJS/dual + types), code splitting, tree-shaking, sourcemaps, env injection, and fast incremental builds.
+when_to_use: Setting up or fixing how an app or library compiles and bundles — wrong module format, broken tree-shaking, missing/incorrect types, slow builds, tsconfig errors. Distinct from dockerfile-optimize (container images) and optimize-core-web-vitals (browser runtime metrics).
+---
+## When to Use
+Reach for this skill when the problem is **how source compiles and emits**, not how it runs in a browser or container:
+- "Set up the build for this app/library" (pick bundler, tsconfig, output format)
+- "My library ships ESM but breaks in a CJS `require()`" (or vice versa) — dual-package output
+- "Consumers get `Could not find a declaration file`" — missing/mislocated `.d.ts`
+- "Tree-shaking isn't dropping unused exports" — dead code in the bundle
+- "`tsc`/`vite build` is slow" — switch transform to esbuild/swc, add a persistent cache
+- "`define`/`import.meta.env` isn't replacing my env var" or a secret leaked into the client bundle
+- tsconfig errors: `module`/`moduleResolution` mismatch, `"x.js" has no exported member`, paths not resolving
+NOT this skill:
+- Shrinking the runtime container image, multi-stage Docker layers → dockerfile-optimize
+- LCP/INP/CLS, lazy-loading images, render-blocking JS in the browser → optimize-core-web-vitals
+- Cross-package build orchestration, workspace topo build order, Turbo/Nx pipelines → setup-monorepo-tooling
+- `npm publish`, `files`/`publishConfig`, provenance, version bump → publish-package-registry
+- ESLint/Prettier/pre-commit wiring → setup-lint-format-precommit
+- Pinning the Node/pnpm/tsc *versions* themselves (engines, `.nvmrc`, Volta) → pin-toolchain-versions
+## Steps
+1. **Pick the bundler by build target — do not default to webpack.**
+   | Target | Use | Why |
+   |---|---|---|
+   | **App** (SPA/SSR, has an entry HTML or framework) | **Vite** | Rollup-based prod build, esbuild dev, HMR, code-splitting out of the box |
+   | **Library** (published to npm, consumers bundle it) | **tsup** (esbuild) or **Rollup** | dual ESM+CJS + `.d.ts` in one config; Rollup when you need fine-grained chunking |
+   | **Node tool / CLI / serverless fn** (single self-run entry) | **esbuild** | fastest, bundle deps in, `--platform=node`, no chunk graph needed |
+   | Legacy app needing module federation / exotic loaders | webpack | only when a Vite/Rollup plugin doesn't exist |
+   Default: **app → Vite, library → tsup, node-tool → esbuild.** One tool emits JS; **`tsc` emits types** (or `tsup --dts` / `vite-plugin-dts` wraps it). Never run `tsc` as the bundler for shipping code — it doesn't bundle, tree-shake, or split.
+2. **Set the tsconfig essentials — `moduleResolution` is the #1 footgun.** Pick the resolution mode by who resolves modules:
+   | Scenario | `module` | `moduleResolution` |
+   |---|---|---|
+   | Bundler handles resolution (Vite/tsup/esbuild) | `ESNext` (or `Preserve`) | `bundler` |
+   | Node runs the output directly (Node ESM/CJS) | `NodeNext` | `nodenext` |
+   ```jsonc
+   // tsconfig.json — app/library baseline
+   {
+     "compilerOptions": {
+       "target": "ES2022",          // match your lowest runtime; don't ship ES5 needlessly
+       "lib": ["ES2022", "DOM"],    // drop "DOM" for node-only code
+       "module": "ESNext",
+       "moduleResolution": "bundler",
+       "strict": true,
+       "skipLibCheck": true,
+       "esModuleInterop": true,
+       "isolatedModules": true,     // required: esbuild/swc compile file-by-file
+       "verbatimModuleSyntax": true,// makes `import type` explicit — kills accidental value imports
+       "declaration": true,         // emit .d.ts (libraries)
+       "declarationMap": true,      // go-to-definition into your source
+       "sourceMap": true,
+       "outDir": "dist",
+       "paths": { "@/*": ["./src/*"] }
+     }
+   }
+   ```
+   `paths` are a **type-level alias only** — the bundler must be told too (Vite `resolve.alias`, tsup/esbuild via `vite-tsconfig-paths`/`esbuild` alias, or `tsconfig-paths`). tsc does not rewrite them in emitted JS.
+3. **For a library, emit dual ESM+CJS with a correct `exports` map — the `exports` map is the contract, file extensions are the proof.** tsup config:
+   ```ts
+   // tsup.config.ts
+   import { defineConfig } from "tsup";
+   export default defineConfig({
+     entry: ["src/index.ts"],
+     format: ["esm", "cjs"],   // → index.js (esm) + index.cjs
+     dts: true,                // → index.d.ts (+ .d.cts for cjs types)
+     sourcemap: true,
+     treeshake: true,
+     clean: true,
+     target: "node18",
+     external: [/^node:/],     // never bundle node builtins
+   });
+   ```
+   ```jsonc
+   // package.json — types condition MUST come first in each block
+   {
+     "type": "module",
+     "exports": {
+       ".": {
+         "import": { "types": "./dist/index.d.ts",  "default": "./dist/index.js"  },
+         "require": { "types": "./dist/index.d.cts", "default": "./dist/index.cjs" }
+       },
+       "./package.json": "./package.json"
+     },
+     "main": "./dist/index.cjs",      // legacy fallback for old resolvers
+     "module": "./dist/index.js",
+     "types": "./dist/index.d.ts",
+     "sideEffects": false,
+     "files": ["dist"]
+   }
+   ```
+   Keep `peerDependencies` (react, etc.) in `external` so you don't bundle two copies into the consumer.
+4. **App: split code with dynamic `import()`, then control chunks deliberately.** Route-level `const Page = lazy(() => import('./Page'))` and `import('heavy-lib')` create async chunks automatically. Pull stable vendor deps into their own long-cached chunk:
+   ```ts
+   // vite.config.ts
+   build: {
+     sourcemap: true,
+     rollupOptions: {
+       output: {
+         manualChunks: { vendor: ["react", "react-dom"] }, // or a function for finer control
+       },
+     },
+     chunkSizeWarningLimit: 500,
+   }
+   ```
+   Don't over-split (HTTP/2 helps, but hundreds of tiny chunks add request + parse overhead). Split on real route/feature boundaries, not per-file.
+5. **Enable tree-shaking — it only works on static ESM.** Author with `import`/`export` (no `require`, no `module.exports`); CJS interop defeats it. Mark the package `"sideEffects": false` (or list the few files with real side effects, e.g. `["**/*.css", "./src/polyfill.ts"]`) so the bundler may drop unused modules. Annotate top-level calls that look impure but aren't with `/*#__PURE__*/`:
+   ```ts
+   export const icon = /*#__PURE__*/ createIcon(path); // droppable if `icon` is unused
+   ```
+   A `"sideEffects": false` lie (a module that *does* mutate global state on import) causes silently-missing behavior — list those files.
+6. **Inject env at build time via `define`/`import.meta.env` — never bake a secret.** Static replacement only:
+   ```ts
+   // vite: only VITE_* are exposed to client; access via import.meta.env.VITE_API_URL
+   // esbuild/tsup: define: { "process.env.NODE_ENV": JSON.stringify("production") }
+   ```
+   **Anything bundled for the browser is public.** Put API keys/DB URLs behind a server route or read them at runtime on the server (`process.env`) — a `define`'d secret is grep-able in `dist/`. Gate dev-only code behind `if (import.meta.env.DEV)` / `process.env.NODE_ENV !== "production"` so it tree-shakes out of prod.
+7. **Always emit sourcemaps; choose by environment.** `sourcemap: true` (full, external `.map`) for libraries and CI artifacts. For a public web app, ship `hidden` sourcemaps (uploaded to your error tracker, not referenced in the bundle) so stack traces de-minify without exposing source to every visitor. Never `eval`/inline sourcemaps in production.
+8. **Make builds fast and incremental.** Use an esbuild/swc transform (Vite and tsup already do) instead of `ts-loader`/`babel` for the JS transform — 10–100× faster. Keep type-checking **out of the bundle path**: run `tsc --noEmit` (or `vite build` + a parallel `tsc -b --watch`) so a type error doesn't block fast iteration but still gates CI. Turn on the persistent cache (Vite caches in `node_modules/.vite`; for `tsc -b` use `incremental: true` + `tsBuildInfoFile`). Add `--metafile` (esbuild) / `rollup-plugin-visualizer` to find what's bloating the bundle.
+9. **Verify the output shape before declaring done** (see Verify) — `publint` + `@arethetypeswrong/cli` catch the dual-package and types-resolution bugs that don't surface until a consumer installs you.
+## Common Errors
+- **`moduleResolution: node` (classic) with modern packages.** Fails to resolve `exports`-map-only packages. Use `bundler` (bundler resolves) or `nodenext` (Node resolves) — never the legacy `node`/`node10`.
+- **`types` condition placed last in the `exports` map.** TS reads conditions top-down and takes the first match; if `import`/`require` come before `types`, the consumer gets "no declaration file." `types` must be the **first** key in each condition block.
+- **`.cjs` file emitting `export {}` (or `.mjs` with `require`).** The `exports` map points at the wrong file per condition, or `"type": "module"` mismatches the extension. ESM → `.js`/`.mjs`, CJS → `.cjs`. Verify with `node -e "require('your-pkg')"` and a separate `import`.
+- **`"sideEffects": false` on a package that has side effects.** Tree-shaking drops a polyfill/CSS/registration import → feature silently missing in prod only. List the real side-effect files instead of a blanket `false`.
+- **Secret in a `VITE_`/`define`d var.** It's inlined into client JS and shipped to every browser. Only public values get `VITE_`/`NEXT_PUBLIC_`; secrets stay server-side at runtime.
+- **`paths` alias resolves in the editor but `Cannot find module '@/x'` at build.** tsc/`paths` is type-only; the bundler needs its own alias (`vite-tsconfig-paths`, `resolve.alias`, or `tsconfig-paths`). Configure both.
+- **`isolatedModules` errors on `export { Foo }` where `Foo` is a type.** esbuild/swc compile each file alone and can't tell types from values. Use `export type { Foo }` / `import type` (enforced by `verbatimModuleSyntax`).
+- **Bundling `peerDependencies` (react, etc.) into a library.** Consumer gets two React copies → "invalid hook call." Mark peers `external`.
+- **Running `tsc` as the production bundler.** It transpiles per-file but doesn't bundle, tree-shake, split, or rewrite `paths` — output has unresolved aliases and no chunking. Use a real bundler for JS; `tsc` only for `.d.ts`.
+- **No sourcemaps in prod (or inline/eval maps).** Minified stack traces are useless; inline maps bloat the bundle and leak source. Emit external (`hidden` for public web), upload to the error tracker.
+- **Targeting `ES5`/old `lib` by reflex.** Forces heavy down-leveling and polyfills for runtimes that support modern JS. Set `target`/`lib` to your *actual* lowest runtime.
+## Verify
+1. **Clean build succeeds:** `rm -rf dist && <build>` exits 0 and `dist/` contains the expected entry files (`.js`, `.cjs`, `.d.ts`, `.map`).
+2. **Types resolve both ways (library):** `npx @arethetypeswrong/cli --pack` reports no ❌ — no "masquerading as CJS/ESM", no missing types per condition.
+3. **Package shape is publishable:** `npx publint` is clean — `exports`, `main`/`module`/`types`, and file extensions all consistent.
+4. **Dual import actually loads:** in a scratch dir, `node -e "import('your-pkg').then(m=>console.log(Object.keys(m)))"` **and** `node -e "console.log(Object.keys(require('your-pkg')))"` both print the API — no `ERR_REQUIRE_ESM` / `ERR_PACKAGE_PATH_NOT_EXPORTED`.
+5. **Type-check passes independently:** `tsc --noEmit` exits 0 (proves the build path didn't skip a type error).
+6. **Tree-shaking works:** bundle a fixture importing one named export; the visualizer/`--metafile` shows unused siblings absent from output. Bundle size drops when an unused heavy import is removed.
+7. **Code-splitting present (app):** prod build emits ≥1 async chunk per lazy route, and the vendor chunk is separate from app code (check `dist/assets/`).
+8. **No secret in the bundle:** `grep -r "<a known secret substring>" dist/` returns nothing; only intended public `VITE_*`/`NEXT_PUBLIC_*` values appear.
+9. **Sourcemaps map back:** open a built file's `.map` or trigger an error — stack trace points to original `src/` lines, not minified columns.
+10. **Incremental rebuild is fast:** a one-line edit triggers a sub-second rebuild (warm cache), not a full cold compile.
+Done = clean build emits the correct module formats + types, `attw` and `publint` are clean, both `import()` and `require()` load the API, `tsc --noEmit` passes, tree-shaking and code-splitting are confirmed in the output, no secret leaked into `dist/`, and warm rebuilds are fast.