npm - sanook-cli - Versions diffs - 0.4.0 → 0.5.1 - Mend

sanook-cli 0.4.0 → 0.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (238) hide show

package/.env.example +19 -0
package/CHANGELOG.md +173 -0
package/README.md +153 -20
package/README.th.md +136 -0
package/dist/agentContext.js +4 -0
package/dist/approval.js +6 -0
package/dist/bin.js +405 -57
package/dist/brain.js +92 -59
package/dist/brand.js +47 -0
package/dist/checkpoint.js +37 -0
package/dist/commands.js +86 -6
package/dist/compaction.js +76 -5
package/dist/config.js +100 -12
package/dist/cost.js +60 -3
package/dist/doctor.js +92 -0
package/dist/gateway/auth.js +2 -2
package/dist/gateway/ledger.js +2 -2
package/dist/gateway/scheduler.js +1 -0
package/dist/gateway/serve.js +6 -4
package/dist/gateway/server.js +10 -2
package/dist/git.js +11 -2
package/dist/hooks.js +43 -17
package/dist/knowledge.js +48 -49
package/dist/loop.js +182 -66
package/dist/lsp/client.js +173 -0
package/dist/lsp/framing.js +56 -0
package/dist/lsp/index.js +138 -0
package/dist/lsp/servers.js +82 -0
package/dist/mcp-server.js +244 -0
package/dist/mcp.js +184 -29
package/dist/memory-store.js +559 -0
package/dist/memory.js +143 -29
package/dist/orchestrate.js +150 -0
package/dist/providers/codex.js +21 -7
package/dist/providers/keys.js +3 -2
package/dist/providers/models.js +22 -6
package/dist/providers/registry.js +155 -1
package/dist/repomap.js +93 -0
package/dist/search/chunk.js +158 -0
package/dist/search/embed-store.js +187 -0
package/dist/search/engine.js +203 -0
package/dist/search/fuse.js +35 -0
package/dist/search/index-core.js +187 -0
package/dist/search/indexer.js +241 -0
package/dist/search/store.js +77 -0
package/dist/session.js +42 -8
package/dist/skill-install.js +10 -10
package/dist/skills.js +12 -9
package/dist/summarize.js +31 -0
package/dist/tools/bash.js +21 -2
package/dist/tools/diagnostics.js +41 -0
package/dist/tools/edit.js +29 -7
package/dist/tools/index.js +8 -1
package/dist/tools/list.js +7 -2
package/dist/tools/permission.js +90 -9
package/dist/tools/read.js +23 -4
package/dist/tools/remember.js +1 -1
package/dist/tools/sandbox.js +61 -0
package/dist/tools/search.js +105 -4
package/dist/tools/task.js +195 -29
package/dist/tools/timeout.js +35 -0
package/dist/tools/util.js +10 -0
package/dist/tools/write.js +6 -4
package/dist/trust.js +89 -0
package/dist/ui/app.js +228 -31
package/dist/ui/banner.js +4 -9
package/dist/ui/brain-wizard.js +2 -2
package/dist/ui/history.js +30 -0
package/dist/ui/mentions.js +44 -0
package/dist/ui/render.js +55 -15
package/dist/ui/setup.js +97 -12
package/dist/ui/useEditor.js +83 -0
package/dist/update.js +114 -0
package/dist/worktree.js +173 -0
package/package.json +11 -5
package/scripts/postinstall.mjs +33 -0
package/second-brain/.agents/_Index.md +30 -0
package/second-brain/.agents/skills/_Index.md +30 -0
package/second-brain/.agents/workflows/_Index.md +30 -0
package/second-brain/AGENTS.md +4 -4
package/second-brain/Acceptance/_Index.md +30 -0
package/second-brain/Acceptance/golden-case-template.md +39 -0
package/second-brain/Areas/_Index.md +30 -0
package/second-brain/Bugs/System-OS/_Index.md +30 -0
package/second-brain/Bugs/_Index.md +30 -0
package/second-brain/CLAUDE.md +4 -1
package/second-brain/Checklists/_Index.md +30 -0
package/second-brain/Checklists/preflight-postflight-template.md +29 -0
package/second-brain/Distillations/_Index.md +30 -0
package/second-brain/Entities/_Index.md +30 -0
package/second-brain/Entities/entity-template.md +33 -0
package/second-brain/Evals/_Index.md +30 -0
package/second-brain/Evals/correction-pairs.md +24 -0
package/second-brain/Evals/failure-taxonomy.md +24 -0
package/second-brain/Evals/golden-set.md +25 -0
package/second-brain/Evals/quality-ledger.md +23 -0
package/second-brain/Evals/self-eval-rubric.md +23 -0
package/second-brain/GEMINI.md +4 -4
package/second-brain/Goals/_Index.md +30 -0
package/second-brain/Handoffs/_Index.md +30 -0
package/second-brain/Home.md +7 -0
package/second-brain/Intake/Raw Sources/_Index.md +30 -0
package/second-brain/Intake/_Index.md +30 -0
package/second-brain/Intake/_Quarantine/_Index.md +30 -0
package/second-brain/Learning/_Index.md +30 -0
package/second-brain/Playbooks/_Index.md +30 -0
package/second-brain/Playbooks/playbook-template.md +23 -0
package/second-brain/Projects/_Index.md +30 -0
package/second-brain/Prompts/_Index.md +30 -0
package/second-brain/README.md +2 -1
package/second-brain/Research/_Index.md +30 -0
package/second-brain/Retrospectives/_Index.md +30 -0
package/second-brain/Reviews/_Index.md +30 -0
package/second-brain/Runbooks/_Index.md +30 -0
package/second-brain/Runbooks/eval-loop.md +24 -0
package/second-brain/Sessions/_Index.md +30 -0
package/second-brain/Shared/AI-Context-Index.md +20 -0
package/second-brain/Shared/AI-Threads/_Index.md +30 -0
package/second-brain/Shared/Archive/_Index.md +30 -0
package/second-brain/Shared/Assets/_Index.md +30 -0
package/second-brain/Shared/Context-Packs/_Index.md +30 -0
package/second-brain/Shared/Context7-Docs/_Index.md +30 -0
package/second-brain/Shared/Coordination/NOW.md +28 -0
package/second-brain/Shared/Coordination/_Index.md +30 -0
package/second-brain/Shared/Coordination/agent-registry.md +24 -0
package/second-brain/Shared/Coordination/task-board/_Index.md +30 -0
package/second-brain/Shared/Coordination/task-board/task-template.md +43 -0
package/second-brain/Shared/Coordination/task-board.md +32 -0
package/second-brain/Shared/Core-Facts/_Index.md +30 -0
package/second-brain/Shared/Decision-Memory/_Index.md +30 -0
package/second-brain/Shared/Glossary/_Index.md +30 -0
package/second-brain/Shared/Memory-Inbox/_Index.md +30 -0
package/second-brain/Shared/Operating-State/_Index.md +30 -0
package/second-brain/Shared/Prompting/_Index.md +30 -0
package/second-brain/Shared/Provenance/_Index.md +30 -0
package/second-brain/Shared/Rules/_Index.md +30 -0
package/second-brain/Shared/Rules/contextual-note-rule.md +30 -0
package/second-brain/Shared/Rules/frontmatter-standard.md +10 -0
package/second-brain/Shared/Rules/memory-write-protocol.md +28 -0
package/second-brain/Shared/Rules/procedural-runbook-header.md +40 -0
package/second-brain/Shared/Rules/review-and-staleness-policy.md +22 -0
package/second-brain/Shared/Rules/rules-formatting.md +34 -0
package/second-brain/Shared/Scripts/_Index.md +30 -0
package/second-brain/Shared/Scripts-Archive/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/_Index.md +30 -0
package/second-brain/Shared/Tech-Standards/verification-standard.md +40 -0
package/second-brain/Shared/User-Memory/_Index.md +30 -0
package/second-brain/Shared/User-Persona/_Index.md +30 -0
package/second-brain/Shared/User-Persona/owner-profile.md +25 -0
package/second-brain/Shared/Working-Memory/_Index.md +30 -0
package/second-brain/Shared/_Index.md +30 -0
package/second-brain/Shared/mcp-servers/_Index.md +30 -0
package/second-brain/Skills/_Index.md +30 -0
package/second-brain/Templates/_Index.md +30 -0
package/second-brain/Templates/bug.md +2 -0
package/second-brain/Templates/handoff.md +2 -0
package/second-brain/Templates/session.md +2 -0
package/second-brain/Tools/_Index.md +30 -0
package/second-brain/Traces/_Index.md +30 -0
package/second-brain/Vault Structure Map.md +33 -1
package/second-brain/copilot/_Index.md +30 -0
package/skills/audit-license-compliance/SKILL.md +117 -0
package/skills/author-codemod/SKILL.md +110 -0
package/skills/build-audit-logging/SKILL.md +112 -0
package/skills/build-cdc-streaming-pipeline/SKILL.md +123 -0
package/skills/build-cli-tool/SKILL.md +108 -0
package/skills/build-data-table/SKILL.md +141 -0
package/skills/build-native-mobile-ui/SKILL.md +154 -0
package/skills/build-offline-first-sync/SKILL.md +118 -0
package/skills/build-realtime-channel/SKILL.md +122 -0
package/skills/build-vector-search/SKILL.md +131 -0
package/skills/compose-local-dev-stack/SKILL.md +149 -0
package/skills/configure-bundler-build/SKILL.md +166 -0
package/skills/configure-dns-tls/SKILL.md +142 -0
package/skills/configure-reverse-proxy-lb/SKILL.md +129 -0
package/skills/configure-security-headers-csp/SKILL.md +122 -0
package/skills/contract-testing/SKILL.md +140 -0
package/skills/datetime-timezone-correctness/SKILL.md +125 -0
package/skills/debug-ci-pipeline-failure/SKILL.md +134 -0
package/skills/debug-flaky-tests/SKILL.md +128 -0
package/skills/defend-llm-prompt-injection/SKILL.md +110 -0
package/skills/deliver-webhooks/SKILL.md +116 -0
package/skills/design-api-pagination/SKILL.md +144 -0
package/skills/design-authorization-model/SKILL.md +119 -0
package/skills/design-backup-dr-recovery/SKILL.md +113 -0
package/skills/design-event-sourcing-cqrs/SKILL.md +143 -0
package/skills/design-multi-tenancy/SKILL.md +100 -0
package/skills/design-protobuf-grpc-service/SKILL.md +146 -0
package/skills/design-relational-schema/SKILL.md +129 -0
package/skills/design-search-index-infra/SKILL.md +151 -0
package/skills/design-state-machine/SKILL.md +108 -0
package/skills/design-token-system/SKILL.md +109 -0
package/skills/distributed-locks-leases/SKILL.md +120 -0
package/skills/encrypt-sensitive-data/SKILL.md +148 -0
package/skills/feature-flags-rollout/SKILL.md +130 -0
package/skills/file-upload-object-storage/SKILL.md +107 -0
package/skills/fuzz-dynamic-security-test/SKILL.md +111 -0
package/skills/harden-llm-app-reliability/SKILL.md +126 -0
package/skills/i18n-localization-setup/SKILL.md +113 -0
package/skills/idempotency-keys/SKILL.md +107 -0
package/skills/implement-push-notifications/SKILL.md +142 -0
package/skills/ingest-webhook-secure/SKILL.md +120 -0
package/skills/integrate-oauth-oidc/SKILL.md +126 -0
package/skills/load-stress-test/SKILL.md +129 -0
package/skills/map-privacy-data-gdpr/SKILL.md +146 -0
package/skills/model-nosql-data/SKILL.md +118 -0
package/skills/money-decimal-arithmetic/SKILL.md +123 -0
package/skills/monitor-ml-drift/SKILL.md +109 -0
package/skills/numeric-precision-units/SKILL.md +144 -0
package/skills/optimize-llm-cost-latency/SKILL.md +103 -0
package/skills/optimize-react-rerenders/SKILL.md +124 -0
package/skills/orchestrate-agent-workflow/SKILL.md +100 -0
package/skills/payments-billing-integration/SKILL.md +114 -0
package/skills/pin-toolchain-versions/SKILL.md +116 -0
package/skills/plan-strangler-migration/SKILL.md +95 -0
package/skills/property-based-testing/SKILL.md +108 -0
package/skills/publish-package-registry/SKILL.md +130 -0
package/skills/recover-git-state/SKILL.md +119 -0
package/skills/remediate-web-vulnerabilities/SKILL.md +125 -0
package/skills/resilience-timeouts-retries/SKILL.md +104 -0
package/skills/resolve-merge-rebase-conflict/SKILL.md +97 -0
package/skills/rewrite-git-history/SKILL.md +109 -0
package/skills/scaffold-cross-platform-app/SKILL.md +137 -0
package/skills/schema-evolution-compatibility/SKILL.md +121 -0
package/skills/send-transactional-email/SKILL.md +126 -0
package/skills/serve-deploy-ml-model/SKILL.md +107 -0
package/skills/setup-cdn-edge-waf/SKILL.md +107 -0
package/skills/setup-devcontainer-env/SKILL.md +131 -0
package/skills/setup-lint-format-precommit/SKILL.md +140 -0
package/skills/setup-monorepo-tooling/SKILL.md +125 -0
package/skills/ship-mobile-app-store-release/SKILL.md +137 -0
package/skills/structured-output-llm/SKILL.md +86 -0
package/skills/supply-chain-sbom-provenance/SKILL.md +120 -0
package/skills/test-data-factories/SKILL.md +158 -0
package/skills/threat-model-stride/SKILL.md +123 -0
package/skills/train-evaluate-ml-model/SKILL.md +109 -0
package/skills/unicode-text-correctness/SKILL.md +109 -0
package/skills/visual-regression-testing/SKILL.md +120 -0

package/skills/design-event-sourcing-cqrs/SKILL.md ADDED Viewed

@@ -0,0 +1,143 @@
+---
+name: design-event-sourcing-cqrs
+description: Designs event-sourced and CQRS systems — past-tense immutable event schemas, aggregate boundaries with command→validate→emit→apply and expected-version optimistic concurrency, append-only per-stream event store with outbox publishing, rebuildable idempotent projections, snapshotting, and versioned upcasting for event evolution.
+when_to_use: When you need an audit-complete, replayable, append-only domain model (ledgers, order/workflow state machines, compliance) or are splitting write commands from read queries, or fixing event-sourcing pain (projection lag, frozen event shapes, slow rebuilds, lost ordering). For plain CRUD use db-migration-safety; for the messaging transport use message-queue-jobs.
+---
+## When to Use
+Reach for this skill when the domain needs **the history of changes as first-class truth**, not just the current row:
+- "We need a full audit trail / who-changed-what-when that nobody can edit after the fact"
+- "Model an order / loan / subscription as a state machine with replayable transitions"
+- "Build a ledger or balance that must reconcile to zero from its entries"
+- "Separate the write side (commands) from a denormalized read side (queries)"
+- "Time-travel: rebuild what the state *was* at any past moment"
+- Fixing existing pain: projection lag, "we can't change the shape of a 2-year-old event", multi-hour rebuilds, lost per-aggregate ordering, eventual-consistency bugs in the UI
+NOT this skill:
+- Plain CRUD with mutable rows and no replay need → **db-migration-safety** (and stop here — event sourcing is the wrong tool for simple CRUD)
+- The broker/transport that *carries* events (Kafka/SQS/RabbitMQ delivery, retries, DLQ) → **message-queue-jobs**
+- A read-only cache layer to cut DB load → **caching-strategy** (a projection is a system of record for reads; a cache is disposable)
+- Syncing offline client state with conflict resolution → **build-offline-first-sync**
+- Recording *why you chose* event sourcing as a decision → **write-adr**
+- Tuning the projection's query/index once it exists → **optimize-sql-query**
+- Wiring client UI state to the read API → **manage-client-server-state**
+## Steps
+1. **First, decide if event sourcing is even warranted — most apps should not use it.** Adopt it only when ≥1 of these is a hard requirement, and accept the listed cost:
+   | Driver (need ≥1) | Why ES wins | Cost you take on |
+   |---|---|---|
+   | Audit/compliance: immutable, complete history | Events *are* the audit log, tamper-evident | More moving parts than a table |
+   | Temporal queries / "state as of T" | Replay to any point | Rebuild + snapshot machinery |
+   | Complex state machine w/ many transitions | Each transition = one fact | Up-front modelling effort |
+   | Multiple read shapes from one write model | CQRS projections, independent scaling | Eventual consistency everywhere |
+   | Debugging by replaying real history | Deterministic reproduction | Replay must stay deterministic forever |
+   If none apply → use a normal table with CRUD and an `updated_at`; **do not event-source CRUD.** CQRS (split read/write models) is independently useful and does **not** require event sourcing — you can do CQRS over a normal DB.
+2. **Model events as immutable, past-tense facts — name them as business outcomes, never CRUD verbs.** `OrderPlaced`, `PaymentCaptured`, `FundsWithdrawn`, `ShipmentDispatched` — not `OrderUpdated`/`OrderSaved`/`SetStatus`. An event records *what happened*, is append-only, and never carries read-model concerns (no denormalized display strings, no joined names, no computed totals the reader could derive). Event payload contract:
+   ```json
+   {
+     "event_id": "uuid-v4",                 // unique; the consumer dedup key (idempotency)
+     "event_type": "FundsWithdrawn",        // past tense, business fact
+     "event_version": 1,                    // schema version of THIS type
+     "aggregate_id": "acct-9c1f",           // the stream key
+     "aggregate_type": "Account",
+     "sequence": 42,                        // per-aggregate, gap-free, monotonic = the version
+     "occurred_at": "2026-06-15T09:30:00Z", // business time captured at emit, NEVER now() in apply
+     "data": { "amount_cents": 5000, "currency": "USD" },
+     "metadata": { "causation_id": "...", "correlation_id": "...", "actor": "user-7" }
+   }
+   ```
+   Keep `data` minimal and self-contained: only facts the writer *decided*, expressed in raw value types. Put tracing/identity in `metadata`, never in `data`.
+3. **Draw aggregate boundaries = the consistency boundary, and keep them small.** An aggregate is the unit that enforces an invariant in a single transaction (e.g. "balance never goes negative"). One command mutates exactly **one** aggregate atomically. Command flow is always **load → validate → emit → apply**:
+   ```
+   handle(cmd):
+     events = load_stream(cmd.aggregate_id)        # replay history
+     state  = events.reduce(apply, initial())      # rebuild current state in memory
+     if not invariant_holds(state, cmd):           # VALIDATE against rebuilt state
+        raise Rejected(reason)                      # rejection is NOT an event
+     new = decide(state, cmd)                       # EMIT new past-tense events
+     append(cmd.aggregate_id, new,
+            expected_version = state.version)        # optimistic concurrency
+     return new
+   ```
+   Rules: validation reads only the aggregate's own rebuilt state (no cross-aggregate reads, no querying a projection to decide). Cross-aggregate consistency is achieved *eventually* via a process manager/saga reacting to events, not in one transaction. A giant aggregate ("the whole tenant") serializes all writes — split it.
+4. **Make the store append-only, ordered per stream, with expected-version concurrency.** One stream per aggregate; `sequence` is gap-free and monotonic *within a stream* (do not assume a global total order across streams). Append is a conditional insert:
+   ```sql
+   CREATE TABLE events (
+     global_position BIGSERIAL PRIMARY KEY,        -- store-wide read order for projectors/relay
+     event_id        UUID NOT NULL UNIQUE,         -- carried to broker; consumer dedup key
+     aggregate_id    TEXT NOT NULL,
+     aggregate_type  TEXT NOT NULL,
+     sequence        INT  NOT NULL,                -- per-stream version: append uses expected_version+1
+     event_type      TEXT NOT NULL,
+     event_version   INT  NOT NULL,
+     data            JSONB NOT NULL,
+     metadata        JSONB NOT NULL,
+     occurred_at     TIMESTAMPTZ NOT NULL,         -- business time, set by writer (not now())
+     recorded_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
+     UNIQUE (aggregate_id, sequence)               -- THIS enforces optimistic concurrency
+   );
+   ```
+   An aggregate's `version` == the `sequence` of its last appended event. The append SQL inserts rows with `sequence = expected_version + 1, +2, …`. The `UNIQUE(aggregate_id, sequence)` violation = a concurrent writer won the race → catch it (`23505` in Postgres), reload, re-validate, retry (or return `409 Conflict` to the caller). `event_id` must be persisted, not regenerated — it's what every downstream consumer dedupes on. **Never** `UPDATE`/`DELETE` an event row; corrections are new compensating events (`ChargeRefunded`, not a delete).
+5. **Publish via the outbox/transactional pattern — never dual-write.** Writing to the event store *and* publishing to the broker as two separate operations loses or duplicates events on crash. Instead: the event row **is** the outbox. Commit the event in the same DB transaction as the aggregate write, then a separate relay polls `events` ordered by `global_position` (or uses CDC/`LISTEN`) and pushes to the broker, tracking a high-water mark. Consumers must be idempotent (dedupe on `event_id`) because the relay guarantees **at-least-once**.
+6. **Build read models as rebuildable, idempotent projections — and surface eventual consistency.** A projection subscribes to the event stream in `global_position` order and writes a denormalized read table. Two non-negotiables:
+   - **Idempotent**: store the last processed `global_position` per projection; on replay skip anything `<=` it, and make each apply an upsert keyed by the event's natural id so re-delivery is a no-op.
+   - **Rebuildable from zero**: a projection must be reconstructable by `TRUNCATE read_table; reset checkpoint to 0; replay all`. If it can't, it's a hidden write model — fix it.
+   Reads are stale by the projection lag (ms→s). Make that explicit: return a version/`as_of` position with reads, and for read-your-writes either route the writer to a freshly-projected read or have the client wait until the projection checkpoint ≥ the position its write returned. Do not pretend the read side is synchronous.
+7. **Bound replay with snapshots — but rebuild must still work from zero without them.** When a hot aggregate has thousands of events, replaying all of them per command gets slow. Snapshot = a serialized aggregate state at a known `sequence`, stored in a separate `snapshots` table. Load = newest snapshot ≤ head, then replay only events after it. Defaults: snapshot every **N=100–500** events per aggregate, keep the latest 1–2, and treat snapshots as a **disposable cache** — they're derived, deletable, and a full rebuild from event 0 must produce byte-identical state. Never let business logic read from a snapshot that the event log couldn't reproduce.
+8. **Evolve schemas by versioning + upcasting, with lenient deserialization — you can never edit old events.** Old events are immutable history; you migrate them *on read*. Bump `event_version` for any non-additive change and register an upcaster chain that transforms `v1 → v2 → … → current` before the event reaches `apply`:
+   | Change | Safe? | How |
+   |---|---|---|
+   | Add optional field w/ default | ✅ additive | Lenient deserializer fills default; no version bump needed |
+   | Rename field | ⚠️ | Bump version; upcaster maps old→new name |
+   | Split/merge fields, change units (dollars→cents) | ⚠️ | Bump version; upcaster computes new shape |
+   | Remove a field still read by a projector | ❌ | Keep reading it via upcaster default; never drop in place |
+   | Change the *meaning* of an event type | ❌ | Introduce a **new** event type; leave the old one |
+   Deserialize leniently (ignore unknown fields, default missing ones) so a forward-deployed reader survives a slightly newer/older payload during rollout.
+9. **Detect and repair projection drift.** Projections silently diverge (a bug skipped an event, a deploy reset a checkpoint wrong). Build a reconciliation job that recomputes a checksum/aggregate from the event log and compares to the read model; on mismatch, rebuild that projection from zero (it's safe because projections are idempotent + rebuildable). A blue/green projection swap (build the new table fully, then atomically repoint reads) lets you rebuild without downtime.
+## Common Errors
+- **Event-sourcing plain CRUD.** No audit/temporal/state-machine need → you bought replay/snapshot/upcasting machinery for nothing. Use a table.
+- **CRUD-named events** (`OrderUpdated`, `EntitySaved`, `SetField`). They carry no business meaning and force readers to diff state. Name the *fact*: `OrderShipped`, `PriceReduced`.
+- **Read concerns leaking into events** — denormalized display names, joined data, computed totals. The event is now coupled to a read shape and breaks when the read model changes. Store only the writer's decided facts.
+- **Giant aggregate.** "Account" containing every transaction of every user serializes all writes and replays forever. Scope the aggregate to the smallest invariant boundary.
+- **No expected-version on append.** Two concurrent commands both read version 41 and both write 42 → lost update / broken invariant. Enforce `UNIQUE(aggregate_id, sequence)` and retry on conflict.
+- **Dual-write to store and broker.** A crash between the two loses or duplicates events. Use the outbox (the event row) + a relay; make consumers idempotent.
+- **Non-deterministic replay** — `apply` calls `now()`, `random()`, or a remote service, so rebuild ≠ original. Capture all nondeterminism *into the event* at emit time; `apply` must be a pure fold.
+- **Non-idempotent projector.** Re-delivery (at-least-once) double-counts. Track per-projection `global_position` and make applies upserts keyed by a natural id.
+- **Validating against a projection instead of the rebuilt aggregate.** The projection is stale, so the invariant check races. Always rebuild the aggregate's own state from its stream to decide.
+- **Treating rejections as events.** A failed/declined command must not append `OrderRejected` unless the *rejection itself is a meaningful business fact*; otherwise return an error — don't pollute the log.
+- **Editing or deleting old events to "fix" them.** Destroys auditability and breaks every existing projection's replay. Append a compensating event instead.
+- **Snapshot used as source of truth.** If the log can't reproduce the snapshot, a snapshot bug becomes permanent corruption. Snapshots are a disposable cache.
+- **Assuming a global event order across aggregates.** Per-stream order is guaranteed; cross-stream is not. Don't build invariants that need two streams ordered together — use a saga.
+## Verify
+1. **Round-trip determinism:** replay an aggregate's full stream twice into fresh in-memory state → byte-identical result; replaying with vs without a snapshot → identical state.
+2. **Optimistic concurrency:** fire two commands against the same aggregate at the same `expected_version` **in parallel** → exactly one commits, the other gets the `UNIQUE(aggregate_id, sequence)` violation (`23505`) surfaced as `409 Conflict` and succeeds only after reload+retry. The stream has no gap and no duplicated `sequence`.
+3. **Projection rebuild:** `TRUNCATE read_table`, reset checkpoint to 0, replay all events → read model is bit-identical to its pre-truncate state. This proves it's rebuildable, not a hidden write model.
+4. **Idempotent projector:** replay the same event slice twice → read rows and the checkpoint are unchanged after the second pass (no double counts).
+5. **Outbox at-least-once:** kill the relay mid-publish, restart → every event reaches the broker at least once, consumers dedupe on `event_id`, no event lost.
+6. **Upcasting:** feed a stored `event_version: 1` payload through the upcaster chain → it deserializes to current shape and `apply` accepts it; a lenient-deserialize test with an unknown extra field still loads.
+7. **Drift detection:** intentionally skip one event in a projection → the reconciliation checksum job flags the mismatch, and a rebuild from zero repairs it.
+8. **Eventual consistency surfaced:** a write returns a position; a read issued before the projector catches up is detectably stale (returns an older `as_of`/version), and the read-your-writes path waits for checkpoint ≥ that position.
+Done = replay is deterministic (1), concurrent appends conflict-detect with gap-free sequences (2), every projection rebuilds from zero idempotently (3,4), publishing is at-least-once with idempotent consumers (5), old event versions upcast cleanly (6), and projection drift is both detectable and auto-repairable (7) — all under parallel load, with eventual consistency made explicit to readers (8).

package/skills/design-multi-tenancy/SKILL.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+name: design-multi-tenancy
+description: Architects a SaaS so many customer orgs share infrastructure without leaking into each other — picking an isolation model (shared schema + Postgres RLS, schema-per-tenant, or database-per-tenant) against an explicit cost/blast-radius/ops tradeoff table, resolving and propagating tenant context from request to DB session, and enforcing isolation in depth (app-layer query scoping PLUS RLS as the safety net) so a single forgotten tenant filter can't cross-leak. Also covers per-tenant quotas/noisy-neighbor mitigation, fan-out migrations across thousands of tenants, tenant offboarding (export + hard delete), optional per-tenant keys, and safe cross-tenant admin features.
+when_to_use: Building or hardening a multi-tenant SaaS where many customer organizations share infra and must be isolated from one another — choosing an isolation model, stopping cross-tenant data leaks, scoping every query by tenant, or scaling migrations/quotas across many tenants. Distinct from design-relational-schema (general table/normalization modeling — this is the tenancy/isolation layer built on top of that), design-authorization-model (what a user may do WITHIN one tenant — RBAC/ABAC — vs separating tenants from each other), and map-privacy-data-gdpr (PII rights/consent — referenced for export/delete mechanics but not the focus).
+---
+## When to Use
+Reach for this skill when the question is **"how do I keep tenant A's data away from tenant B while they share the same stack?"** — the isolation architecture, not the per-user permissions inside one org:
+- "We're going multi-tenant — shared tables with a `tenant_id`, a schema per customer, or a DB per customer?"
+- "How do I make sure one missing `WHERE tenant_id` can't leak another org's data?"
+- "Resolve the tenant from the subdomain / `X-Tenant` header / JWT org claim and scope every query to it"
+- "We have 4,000 tenants — how do I run a schema migration across all of them safely?"
+- "Enterprise customer wants their data in a separate database / their own encryption key"
+- "One big tenant is hammering the DB and starving everyone else (noisy neighbor)"
+- "Build admin impersonation / global analytics without accidentally bypassing isolation"
+NOT this skill:
+- Designing the tables/keys/normalization themselves (PKs, 1:N, constraints) → **design-relational-schema** (this skill adds the `tenant_id` + RLS layer on top of that model)
+- Roles/permissions for users *within* a single tenant (admin vs viewer, per-resource sharing) → **design-authorization-model** (authZ within a tenant ≠ isolating tenants from each other)
+- DSAR export format, consent capture, lawful basis, erasure-across-backups policy → **map-privacy-data-gdpr** (referenced in step 7 for the offboarding mechanics)
+- Capping request *rate/volume* per caller mechanics (token bucket, 429, Redis counters) → **rate-limiting** (referenced in step 6 for per-tenant quotas)
+- Running one risky `ALTER` on one large live table safely (locks, backfill) → **db-migration-safety** (referenced in step 8 for the fan-out)
+- Cache patterns/TTLs/stampede in general → **caching-strategy** (referenced in step 9 for tenant-keyed caches)
+## Steps
+1. **Pick the isolation model from a tradeoff table — default shared+RLS, escalate per tenant only when a reason demands it.** The three models are not all-or-nothing; a mature SaaS often runs *hybrid pods*.
+   | Dimension | Shared schema + RLS | Schema-per-tenant | Database-per-tenant |
+   |---|---|---|---|
+   | **Isolation** | Logical (one bug from leak) | Stronger (namespace) | Strongest (physical) |
+   | **Cost / tenant** | Lowest (one DB, shared) | Low–medium | Highest (conn pool, idle DB, backups each) |
+   | **Ops / migration burden** | One migration, all tenants | Loop over N schemas | Loop over N databases (heaviest) |
+   | **Blast radius** | All tenants (shared) | Per-schema | Per-tenant only |
+   | **Noisy neighbor** | Worst — shared buffers/CPU/locks | Some sharing | Isolated resources |
+   | **Per-tenant restore / PITR** | Hard (row-level surgery) | Medium | Trivial (restore that DB) |
+   | **Tenant count it scales to** | 100k+ | hundreds–low thousands | tens–low hundreds |
+   Pick: **shared schema + RLS by default** (cheapest, scales to many small tenants); **schema-per-tenant** when you want per-tenant restore/customization without N databases' connection overhead; **database-per-tenant** for enterprise/compliance (HIPAA/SOC2 data-residency), per-tenant encryption/restore, or one tenant so large it deserves its own resources. **Hybrid pods:** small tenants share a pool, large/enterprise tenants get dedicated DBs — a `tenant → shard/connection` routing map (a "tenant catalog" table in a control-plane DB) decides at request time. Write the decision in an ADR (see write-adr); migrating models later is a large project.
+2. **Add `tenant_id` to every tenant-owned table — non-null, indexed, first column of composite indexes.** In the shared model, every table carrying tenant data gets `tenant_id uuid NOT NULL REFERENCES tenant(id)`. Make it the **leading column** of relevant indexes and most composite PKs/uniques (`UNIQUE (tenant_id, email)` not `UNIQUE (email)` — email is unique *per tenant*, not globally). Global/system tables (plans, feature flags, the `tenant` registry itself) have no `tenant_id`. Never let `tenant_id` be nullable or default — a null tenant row is an isolation hole.
+3. **Resolve tenant context at the edge, from a trusted source — never from a client-supplied body field.** Map the inbound request to exactly one tenant:
+   | Source | How | Note |
+   |---|---|---|
+   | **Subdomain** | `acme.app.com` → `acme` | Friendly; needs wildcard DNS/TLS; map slug→tenant_id in catalog |
+   | **`X-Tenant` header** | API/service-to-service | Trust only if the caller is authenticated; never from a browser unauthenticated |
+   | **JWT `org`/`tenant` claim** | from the verified token | **Most trustworthy** — signed, can't be forged client-side |
+   Resolve once at the edge (middleware), validate the tenant is active, and store it in an **immutable request context** (not a mutable global). The cardinal rule: derive `tenant_id` from the **authenticated identity**, never from a request body/query param — a client-supplied `tenant_id` is a cross-tenant skeleton key (this is exactly the gap **design-authorization-model** warns about). If subdomain and token disagree, reject.
+4. **Defense in depth — app-layer scoping is the primary guard, Postgres RLS is the safety net.** The #1 production multi-tenancy bug is a single query that forgot its tenant filter → cross-tenant leak. You need **both** layers because each fails differently:
+   - **App layer (primary):** every query is scoped through a **tenant-aware repository / ORM global filter** so developers physically can't write an unscoped query. Don't rely on each engineer remembering `WHERE tenant_id = $1` — inject it centrally (e.g. an ORM global scope, a base repository that always appends the filter, a query builder that refuses to run without a tenant).
+   - **DB layer (backstop):** Postgres Row-Level Security catches the day someone bypasses the repository or writes raw SQL.
+   ```sql
+   ALTER TABLE document ENABLE ROW LEVEL SECURITY;
+   ALTER TABLE document FORCE ROW LEVEL SECURITY;        -- applies to the table owner too
+   CREATE POLICY tenant_isolation ON document
+     USING       (tenant_id = current_setting('app.tenant_id')::uuid)   -- read/update/delete visibility
+     WITH CHECK  (tenant_id = current_setting('app.tenant_id')::uuid);  -- blocks INSERT into another tenant
+   ```
+   `FORCE` is non-negotiable (without it the table owner — usually your app's role — bypasses RLS). `WITH CHECK` stops a write that *sets* a foreign `tenant_id`. The app role must **not** have `BYPASSRLS`.
+5. **Set the RLS variable with `SET LOCAL` inside the transaction — the connection-pool caveat that breaks naive RLS.** RLS reads `current_setting('app.tenant_id')`. You must set it per request — but **how** depends on the pooler:
+   - With **PgBouncer in transaction mode** (the common setup), a connection is handed to a *different* tenant's request the instant your transaction commits. A session-level `SET app.tenant_id = ...` therefore **leaks** the previous tenant's value into the next request — a catastrophic cross-tenant bug.
+   - Fix: set it **transaction-scoped** so it auto-resets at commit/rollback:
+     ```sql
+     BEGIN;
+     SET LOCAL app.tenant_id = '...';   -- reset automatically at COMMIT/ROLLBACK; never plain SET
+     -- ... all queries in this request ...
+     COMMIT;
+     ```
+   - Equivalent: `SELECT set_config('app.tenant_id', $1, true)` (the `true` = local). Every tenant request must run inside a transaction that begins with `SET LOCAL`. Assert in the repository that the var is set before any query runs, so a missing context fails closed (returns zero rows / errors) rather than leaking.
+6. **Per-tenant quotas + noisy-neighbor mitigation.** In a shared model one tenant can starve the rest. Enforce **per-tenant rate limits and quotas keyed by `tenant_id`** (token bucket / sliding window — see **rate-limiting**), plus resource guards: statement timeouts, max connections per tenant, row/storage caps, background-job concurrency caps per tenant. For chronic offenders or very-large tenants, move them to a **dedicated pod/DB** (step 1's hybrid). Track per-tenant usage metrics (queries/sec, storage, CPU) so you can detect and isolate a noisy neighbor before it causes an incident.
+7. **Tenant offboarding — clean per-tenant export and verifiable hard delete.** Deletion and export are isolation-critical and a GDPR obligation (mechanics: **map-privacy-data-gdpr**):
+   - **Export:** dump all rows where `tenant_id = $1` across every table to a machine-readable archive. Database-per-tenant makes this a `pg_dump` of one DB; shared schema requires a tenant-scoped export of every table (drive it from a registry of tenant-owned tables so none is missed).
+   - **Hard delete:** in shared schema, `DELETE` cascades by `tenant_id` (rely on `ON DELETE CASCADE` from the `tenant` row, or a deterministic ordered delete) — and don't forget derived data: caches, search indexes, object storage, analytics warehouse, backups' retention policy. Database/schema-per-tenant: `DROP DATABASE`/`DROP SCHEMA` is the cleanest, most auditable erasure. Verify deletion (assert zero rows remain for the tenant) and log it for compliance.
+8. **Migrations across thousands of tenants — online, batched, versioned, idempotent.** Schema changes don't break with one model but they *scale* differently:
+   - **Shared schema:** one migration changes all tenants at once — fast, but a bad migration's blast radius is everyone. Use online/safe DDL (see **db-migration-safety**: avoid long table locks, backfill in batches, add indexes `CONCURRENTLY`).
+   - **Schema-per-tenant / DB-per-tenant:** loop the migration over every schema/database. This must be **batched, resumable, and idempotent** — track each tenant's schema version in the catalog, run N at a time, record success/failure per tenant, and be able to retry only the failures. A 4,000-tenant migration that aborts at tenant 2,500 must resume, not restart. Roll out behind a flag and canary on a few tenants first.
+9. **Cache and search keyed by tenant; cross-tenant admin features that don't bypass isolation.**
+   - **Caching:** every cache key includes `tenant_id` (`doc:{tenant_id}:{id}`) so tenant A can never read tenant B's cached value, and invalidation can be per-tenant (see **caching-strategy**). Same for search indexes (per-tenant index or a mandatory tenant filter on every query).
+   - **Admin / impersonation:** when support impersonates a tenant, **set the same `app.tenant_id` and go through the same scoped path** — don't add a "god query" that ignores RLS. Use a separate, audited DB role with `BYPASSRLS` *only* for narrow platform operations, and **log every impersonation** (build-audit-logging). **Global analytics** (cross-tenant metrics) is the one legitimate cross-tenant read: run it through a dedicated read-only role/replica with explicit tenant aggregation, isolated from the application path — never by relaxing the app's RLS.
+10. **Test tenant isolation as a first-class, automated guarantee — the leak test is the one that matters.** Ship these as CI tests, not manual checks:
+    - **Cross-tenant denial:** seed data for tenant A and tenant B; with context set to A, assert that *every* read/list/get of B's resources (by real id) returns **zero rows / 404 / deny** — including raw SQL paths, the cache, and search.
+    - **RLS backstop:** run a query that *omits* the app-layer filter against a session with `app.tenant_id = A` → still returns only A's rows. Proves the data tier holds when the app forgets.
+    - **Fuzz the `tenant_id`:** randomize/swap the tenant in the request context and assert no other tenant's data is ever returned and no write lands in the wrong tenant (`WITH CHECK` holds).
+    - **Pool leak test:** run interleaved requests for A then B over a transaction-mode pool and assert B never sees A's `SET LOCAL` value.
+    - **Optional per-tenant keys:** if tenants have their own encryption keys (envelope encryption, key per tenant in a KMS), test that the wrong key can't decrypt another tenant's data and that key deletion = crypto-shredded data.
+    Done = an isolation model chosen against the tradeoff table and recorded in an ADR; `tenant_id` non-null + indexed on every tenant table (and leading in uniques); tenant context derived from the verified identity at the edge and propagated as immutable request state; every query scoped at the app layer **and** RLS (`FORCE` + `WITH CHECK`, app role without `BYPASSRLS`) enforced, with the var set via `SET LOCAL` per transaction; per-tenant quotas in place; migrations fan out batched/resumable/versioned; export + verified hard delete defined; caches/search/admin/analytics tenant-keyed; and an automated cross-tenant leak test (plus fuzz + pool-leak) passing in CI.

package/skills/design-protobuf-grpc-service/SKILL.md ADDED Viewed

@@ -0,0 +1,146 @@
+---
+name: design-protobuf-grpc-service
+description: Designs and evolves gRPC/protobuf service contracts — message and service definitions, unary vs streaming RPC selection, wire-compatible schema evolution (reserved tags, safe vs breaking changes), canonical status codes, deadlines/cancellation, interceptors, and buf-driven codegen plus breaking-change detection.
+when_to_use: User is writing or changing a .proto/gRPC service, picking unary vs streaming, worried about breaking wire compat on a rolling deploy, wiring multi-language codegen, or adding deadlines/auth/error semantics. This is the binary RPC contract; HTTP/JSON REST or GraphQL surfaces are rest-graphql-contract.
+---
+## When to Use
+Reach for this skill when the contract is a **.proto / gRPC wire format**, not an HTTP/JSON shape:
+- "Design the messages and RPCs for this new service" / "add a method to this `.proto`"
+- "Is renaming/renumbering this field safe to deploy?" — wire-compat review
+- "Should this be unary, server-streaming, or bidi?" / "stream vs websocket?"
+- "Wire codegen for Go + TS + Python off one schema" / "set up `buf` + breaking-change CI"
+- "Set deadlines / map our errors to gRPC status codes / add an auth interceptor"
+- "Expose this to a browser" → gRPC-Web / Connect
+NOT this skill:
+- REST resources, JSON envelopes, OpenAPI/SDL, HTTP versioning/pagination → rest-graphql-contract
+- Reviewing an existing HTTP/RPC API *diff* for naming/compat as an audit pass → api-design-review
+- Issuing/verifying JWTs, OAuth/OIDC flows, RBAC logic (the interceptor *calls* this) → auth-jwt-session
+- Adding tracing/metrics/logs to the service internals → observability-instrument
+- Correctness of the streaming/concurrency code itself (races, missing await) → async-concurrency-correctness
+## Steps
+1. **Model messages — field numbers are the contract, names are not.** The tag number is what goes on the wire; renaming a field is free, renumbering is catastrophic.
+   - Number `1–15` cost 1 byte; reserve them for the hot, always-present fields. `16+` cost 2 bytes.
+   - **Removing a field:** delete it *and* `reserved` both its number and name, so nobody reuses them. This is non-negotiable.
+     ```proto
+     message User {
+       reserved 4, 7 to 9;            // retired tags — never reuse
+       reserved "email_verified";     // retired name — block re-add under old meaning
+       string id = 1;
+       string display_name = 2;
+       optional string email = 3;     // optional => field presence (knows set-vs-default)
+     }
+     ```
+   - Use `optional` (proto3) when you must distinguish "unset" from zero-value; bare scalars can't tell `0`/`""`/`false` from absent.
+   - **Every enum starts at `0 = *_UNSPECIFIED`.** 0 is the default on the wire; if 0 means a real state you can't detect "not set," and you can't safely add values before it.
+     ```proto
+     enum Status { STATUS_UNSPECIFIED = 0; STATUS_ACTIVE = 1; STATUS_BANNED = 2; }
+     ```
+   - Prefer `google.protobuf.Timestamp`/`Duration` over raw int64; `map<k,v>` over parallel lists; a `Money{currency_code, units, nanos}` message over a float. Never put currency in a `double`.
+2. **Pick the RPC shape from the data flow — default to unary.** Streaming is for unbounded or incremental data, not for "it's faster."
+   | Shape | Signature | Use when | Don't use for |
+   |---|---|---|---|
+   | **Unary** | `rpc Get(Req) returns (Resp)` | request/response, bounded payload — **the default** | huge/unbounded results |
+   | Server-streaming | `returns (stream Resp)` | feed/tail, large result set, server-push progress | a single object that fits in memory |
+   | Client-streaming | `(stream Req) returns (Resp)` | chunked upload, batch ingest, client-side aggregation | small fixed-size input |
+   | Bidi | `(stream Req) returns (stream Resp)` | live chat, long-lived sync, interactive session | anything a sequence of unary calls covers |
+   - **Stream vs websocket:** if both ends are gRPC and you need typed messages + backpressure + deadlines, use a gRPC stream. Reach for a websocket only when a *browser* needs raw duplex and you're not on Connect/gRPC-Web.
+   - Page large reads with `page_size`/`page_token` (AIP-158) **before** reaching for server-streaming — pagination is resumable and cacheable; a broken stream restarts from zero.
+3. **Run the wire-compat checklist before any schema change** — clients and servers deploy at different times, in multiple languages, and old binaries must keep parsing new messages.
+   | Change | Wire-safe? | Why |
+   |---|---|---|
+   | Add a new field (new tag) | ✅ | old readers skip unknown fields |
+   | Add a new RPC / new message | ✅ | additive |
+   | Rename a field (same tag/type) | ✅ wire / ⚠️ JSON | wire keys on number; **gRPC-JSON/Connect keys on name** — breaks JSON clients |
+   | Add an enum value | ✅ | but old clients see it as the unknown/default — handle that branch |
+   | `int32`↔`int64`, `sint`↔`int`, `optional`↔`repeated` | ❌ | different wire encoding → silent corruption |
+   | Reuse / renumber a tag | ❌ | old data deserializes into the wrong field |
+   | Remove a field without `reserved` | ❌ | tag can be reused later → corruption |
+   | Change a field's type/cardinality | ❌ | re-version the message or add a new field instead |
+   | Rename/move a service or package | ❌ | path is `/pkg.Service/Method` — old stubs 404 with `UNIMPLEMENTED` |
+   To evolve incompatibly: **add a new field/method, deprecate the old (`[deprecated = true]`), migrate, then `reserved` it** — never mutate in place. Enforce this with `buf breaking` (step 6).
+4. **Error & control plane — set a deadline on every call, return canonical codes.**
+   - **Deadlines are mandatory.** A call without one can hang forever and pin a server thread. Set an absolute deadline client-side (`context.WithTimeout`, ~the SLO); servers must check `ctx.Err()`/`isCancelled` and stop work when the client gives up. Propagate the deadline to downstream calls — don't reset it.
+   - Map failures to the [canonical status codes](https://grpc.io/docs/guides/status-codes/), not a generic `UNKNOWN`/`INTERNAL`:
+     | Code | Use for | Retry? |
+     |---|---|---|
+     | `INVALID_ARGUMENT` | malformed request, fails regardless of state | no |
+     | `FAILED_PRECONDITION` | valid request, wrong system state | no (fix state first) |
+     | `NOT_FOUND` / `ALREADY_EXISTS` | missing / duplicate resource | no |
+     | `PERMISSION_DENIED` / `UNAUTHENTICATED` | authz fail / missing-bad creds | no |
+     | `RESOURCE_EXHAUSTED` | quota / rate limit | yes, with backoff + honor `Retry-After`-style detail |
+     | `DEADLINE_EXCEEDED` | call ran past deadline | yes if idempotent |
+     | `UNAVAILABLE` | transient — server down/restarting | yes, backoff (the canonical retryable code) |
+     | `ABORTED` | concurrency conflict (CAS/txn) | yes, after re-read |
+   - Attach machine-readable detail with `google.rpc.Status` + typed details (`ErrorInfo` with a stable `reason` + `domain`, `BadRequest.field_violations`, `QuotaFailure`) — not a prose string clients must regex.
+   - **Retry only idempotent methods.** Configure a service-config retry policy (`maxAttempts`, `UNAVAILABLE`/`DEADLINE_EXCEEDED` only, exponential backoff). For non-idempotent creates, pass a client-generated idempotency key in metadata and dedupe server-side. Cancellation propagates automatically when the client closes the stream/context — release resources on it.
+5. **Cross-cutting concerns belong in interceptors + metadata, not in every method.**
+   - **Interceptors** (chained, ordered) for auth, logging, tracing, panic-recovery, rate-limit. Auth interceptor reads the token from metadata and *delegates verification* (that logic lives in auth-jwt-session) — return `UNAUTHENTICATED` (missing/invalid creds) vs `PERMISSION_DENIED` (valid identity, not allowed).
+   - **Metadata** = gRPC's headers/trailers. Lowercase ASCII keys; a key carrying raw bytes must end in `-bin` (e.g. `trace-id-bin`) so the runtime base64-handles it. Carry auth (`authorization: Bearer …`), request id, idempotency key, locale. Never put a deadline in metadata — it's a first-class call property. Reserved `grpc-*` keys are runtime-owned; don't set them yourself.
+   - **TLS always; mTLS for service-to-service.** Never run a non-loopback gRPC server on an insecure channel — h2c in the clear leaks every byte.
+   - **Browser/edge:** native gRPC needs HTTP/2 trailers a browser can't send, so expose **Connect** (speaks gRPC, gRPC-Web, *and* JSON over the same handler — easiest) or **gRPC-Web** behind an Envoy/proxy translation layer. Don't try to call raw gRPC from `fetch`.
+6. **Codegen + lint with `buf`, not raw `protoc` — and wire breaking-change detection into CI.** `protoc` plugin/path juggling is the classic footgun; `buf` makes the schema the source of truth.
+   ```yaml
+   # buf.yaml
+   version: v2
+   lint:   { use: [STANDARD] }
+   breaking: { use: [WIRE_JSON] }   # catch tag/type/name breakage
+   ```
+   ```yaml
+   # buf.gen.yaml — one schema, many languages
+   version: v2
+   plugins:
+     - { remote: buf.build/protocolbuffers/go,    out: gen/go,  opt: paths=source_relative }
+     - { remote: buf.build/connectrpc/go,         out: gen/go,  opt: paths=source_relative }
+     - { remote: buf.build/bufbuild/es,           out: gen/ts }
+   ```
+   ```bash
+   buf lint                                   # naming/style/UNSPECIFIED rules
+   buf breaking --against '.git#branch=main'  # FAIL CI on any wire/JSON break
+   buf generate                               # regenerate all stubs from .proto
+   ```
+   Check generated stubs into VCS *or* regenerate in CI — pick one and enforce it; a stale committed stub that disagrees with the `.proto` is a silent contract drift. Back the contract with a contract test (step in Verify) so the running server and the `.proto` can't diverge.
+## Common Errors
+- **Reusing or renumbering a field tag.** Old bytes deserialize into the wrong field — silent data corruption, no error. Always `reserved` removed tags *and* names; `buf breaking` catches it if you let it.
+- **Enum without `0 = *_UNSPECIFIED`.** 0 is the wire default, so you can't distinguish "unset" from your first real value, and you can't prepend values later. Always reserve 0 for UNSPECIFIED.
+- **No deadline on the call.** One slow/hung downstream pins server resources indefinitely and cascades into outage. Set an absolute deadline on every client call; propagate, don't reset, downstream.
+- **Returning `INTERNAL`/`UNKNOWN` for everything.** Clients can't tell retryable from fatal and either hammer a down service or give up on a transient blip. Map to the specific canonical code; reserve `INTERNAL` for genuine server bugs.
+- **Retrying non-idempotent RPCs.** A retried `Create`/`Charge` after a timeout double-executes. Restrict the retry policy to idempotent methods; for the rest use a server-deduped idempotency key.
+- **Changing a scalar type to a "compatible-looking" one** (`int32`→`int64`, `optional`→`repeated`). Different wire encoding → garbled values on old readers. Add a new field instead and migrate.
+- **Renaming a field assumed free, but a Connect/gRPC-JSON client keys on the name.** Wire-safe, JSON-breaking. If any client speaks JSON, treat a rename as breaking.
+- **Calling raw gRPC from a browser.** Native gRPC needs HTTP/2 trailers the browser can't produce. Use Connect or gRPC-Web through a proxy.
+- **`protoc` plugin/import-path hell producing stale or wrong stubs.** Use `buf` with a remote plugin set so paths and versions are pinned and reproducible.
+- **Insecure (h2c, no TLS) channel in prod.** Everything including bearer tokens is in cleartext. TLS always; mTLS between services.
+- **Breaking-change check missing from CI.** A bad merge ships an incompatible schema and breaks every deployed client. `buf breaking --against main` must gate merges.
+## Verify
+1. **Lint clean:** `buf lint` passes — every enum has `*_UNSPECIFIED = 0`, fields snake_case, services/methods follow the standard naming rules.
+2. **Breaking-change gate:** `buf breaking --against '.git#branch=main'` is green; deliberately renumber a tag locally and confirm it goes **red** (proves the gate actually fires).
+3. **Codegen reproducible:** `buf generate` from a clean tree produces stubs byte-identical to what's committed (no uncommitted diff) for every target language.
+4. **Wire round-trip across versions:** serialize a message with the *new* schema, parse it with a binary built on the *old* schema (and vice-versa) — no error, no field loss for additive changes. This is the real proof of compatibility, not eyeballing the diff.
+5. **Deadline honored:** a call given a 100ms deadline against an artificially slow method returns `DEADLINE_EXCEEDED` near 100ms (not hanging), and the server logs show it cancelled work rather than running to completion.
+6. **Status mapping:** each error path returns its specific canonical code (asserted in tests), and retry policy retries `UNAVAILABLE`/`DEADLINE_EXCEEDED` only — a deliberate `INVALID_ARGUMENT` is not retried.
+7. **Streaming flow:** a server-stream consumer that cancels mid-stream causes the server's context to cancel and stop producing (no leaked goroutine/thread); a client-stream upload that drops mid-send leaves no half-written state.
+8. **Auth interceptor:** missing token → `UNAUTHENTICATED`; valid token without permission → `PERMISSION_DENIED`; both asserted, and the path runs over TLS (insecure channel rejected).
+9. **Contract test:** a cross-language client built from the generated stub calls the running server end-to-end and gets the expected typed response — proves `.proto`, server, and stubs agree.
+Done = `buf lint` and `buf breaking --against main` pass in CI (and the gate provably fails on a real break), `buf generate` leaves no diff, the old↔new wire round-trip and the cross-language contract test both pass, every call sets a deadline, and each error path returns its specific canonical status code over TLS.

package/skills/design-relational-schema/SKILL.md ADDED Viewed

@@ -0,0 +1,129 @@
+---
+name: design-relational-schema
+description: Designs a normalized relational schema from requirements — entities, relationships, PK strategy (surrogate bigint vs natural vs UUIDv7/ULID), 1:1/1:N/M:N and inheritance modeling, 3NF/BCNF normalization, invariants encoded as UNIQUE/CHECK/FK/exclusion constraints, and deliberate read-path denormalization with stated consistency tradeoffs.
+when_to_use: When starting a new database or a new table cluster and you need the logical+physical model — turning requirements/an ERD into tables, choosing keys, modeling cardinalities and inheritance, normalizing, then deciding where to denormalize. Distinct from db-migration-safety (altering a live table safely) and optimize-sql-query (speeding up a query against an existing schema).
+---
+## When to Use
+Reach for this skill when you're designing the **shape of the data**, before any table exists:
+- "Model these requirements / this ERD as tables"
+- "Should this PK be a UUID or a bigint? natural or surrogate? composite?"
+- "How do I model users↔roles (M:N) / orders→items (1:N) / a polymorphic comment?"
+- "Normalize this — I've got repeating columns / update anomalies / duplicated data"
+- "Where should I denormalize for a read-heavy dashboard, and what breaks?"
+- Choosing column types: enum vs lookup table, soft vs hard delete, audit columns, money/time precision
+NOT this skill:
+- Changing a table that already has rows/traffic (locks, backfills, rollback) → **db-migration-safety**
+- A query against an existing schema is slow → **optimize-sql-query**
+- You need an append-only, replayable, audit-complete domain model → **design-event-sourcing-cqrs**
+- Computing prices/tax/rounding/FX (the math, not the column type) → **money-decimal-arithmetic**
+- Storing/converting/comparing timestamps & DST correctly → **datetime-timezone-correctness**
+- Shaping items/documents for a non-relational store (DynamoDB/Mongo/Cassandra) around access patterns → **model-nosql-data**
+## Steps
+1. **Extract entities, attributes, relationships from requirements — nouns→tables, verbs→relationships.** List each entity, its attributes, and for every pair the cardinality (1:1 / 1:N / M:N) and optionality (mandatory vs nullable side). Mark each attribute's identity role: is it a candidate key (naturally unique, immutable), or descriptive? Write functional dependencies (`A → B`: A determines B) — they drive normalization in step 3. One table = one entity type; if an attribute is itself a list ("tags", "phone numbers"), it's a separate table, not a CSV column or `jsonb` dumping ground.
+2. **Pick a PK strategy per table — default to surrogate, choose the integer/UUID flavor deliberately.**
+   | Strategy | Use when | Avoid when |
+   |---|---|---|
+   | **`bigint GENERATED ALWAYS AS IDENTITY`** | Single-DB, internal IDs, smallest/fastest index, FK-heavy — **default** | IDs leak to clients/URLs and count/sequence is sensitive; multi-master inserts |
+   | **`uuid` v7 / ULID** (time-ordered) | IDs generated client-side or across shards, exposed in URLs, need merge without collision | You can use bigint and don't expose the ID — 16B vs 8B and bigger indexes |
+   | **`uuid` v4** (random) | Only if unguessability matters *and* you accept index-locality cost | Hot insert paths — random UUIDs fragment B-tree pages and bloat WAL |
+   | **Natural key** (email, ISO code, slug) | Truly immutable, single-attribute, externally governed (`country.iso2`, `currency.code`) | It can ever change or isn't guaranteed unique — a changing PK cascades through every FK |
+   | **Composite key** | Junction tables (`(a_id, b_id)`); rows identified only by the combination | A tempting single surrogate would be simpler and the combo isn't queried as a unit |
+   Rules: use a **surrogate `bigint` IDENTITY by default**; reach for **UUIDv7/ULID (not v4)** the moment IDs cross a process boundary or are client-generated; never expose a sequential surrogate where the sequence is sensitive (use UUIDv7 instead); a natural key still deserves a `UNIQUE` constraint even when you also keep a surrogate PK. Never use `serial`/`SERIAL` (legacy, ownership/permission footguns) — use `GENERATED ALWAYS AS IDENTITY`.
+3. **Normalize 1NF → 2NF → 3NF/BCNF; stop at BCNF.** Eliminate the anomaly classes in order:
+   - **1NF** — atomic columns, no repeating groups, no arrays-as-CSV. Split `phone1, phone2, phone3` and `tags TEXT` into child rows.
+   - **2NF** — no non-key attribute depends on *part* of a composite key. In `order_item(order_id, product_id, product_name)`, `product_name` depends only on `product_id` → move it to `product`.
+   - **3NF** — no transitive dependency (non-key → non-key). `employee(id, dept_id, dept_name)`: `dept_name` depends on `dept_id`, not `id` → split out `department`.
+   - **BCNF** — every determinant is a candidate key. Fixes the rare overlapping-candidate-key case 3NF misses.
+   Target **3NF as the floor, BCNF where a determinant anomaly exists.** Each non-key fact lives in exactly one place; a fact changes via exactly one `UPDATE` to one row. Do **not** model attribute-value pairs generically (EAV: `entity/attribute/value` rows) — it destroys typing, FKs, and constraints; make real typed columns instead.
+4. **Model cardinalities explicitly — the FK lives on the "many" side.**
+   - **1:N** — FK column on the child (many) side pointing at the parent's PK. `order.customer_id → customer.id`. The direction is not a choice: the many side carries the FK.
+   - **M:N** — a junction (associative) table with a composite PK of both FKs: `enrollment(student_id, course_id, PRIMARY KEY(student_id, course_id))`. Relationship attributes (`enrolled_at`, `grade`) live on the junction.
+   - **1:1** — share a PK: the dependent table's PK *is* an FK to the parent (`user_profile.user_id PK REFERENCES user(id)`). Use only for optional/rarely-loaded columns; otherwise just add the columns to the parent.
+   - **Inheritance/polymorphism** — pick one, don't mix:
+     | Pattern | Shape | Use when |
+     |---|---|---|
+     | Single-table | one table, nullable subtype columns, `kind` discriminator | few subtypes, mostly shared columns — **default** |
+     | Class-table | base table + one child table per subtype, shared PK | subtypes have many distinct, NOT-NULL-able columns |
+     | Concrete-table | one full table per subtype, no base | subtypes never queried together |
+     For a polymorphic FK ("comment on a post *or* a photo"), **avoid the nullable-`(target_type, target_id)` pair** — it can't have a real FK. Prefer separate nullable FK columns each with its own real `REFERENCES` plus a `CHECK` that exactly one is set.
+5. **Encode every invariant as a constraint in the DDL, not in app code.** If the database can enforce it, the database enforces it — app checks race and drift.
+   - `NOT NULL` on every column that is logically required (default to NOT NULL; justify each nullable column).
+   - `UNIQUE` on each natural/candidate key and on business-unique combos.
+   - `FOREIGN KEY ... ON DELETE` — choose the action deliberately: `CASCADE` (children are parts of the parent), `RESTRICT`/`NO ACTION` (default; refuse to orphan), `SET NULL` (only if the FK is legitimately optional).
+   - `CHECK` for value rules (`amount_minor >= 0`, `status IN (...)`, `start_at < end_at`).
+   - **Partial unique index** for conditional uniqueness: `CREATE UNIQUE INDEX ON users(email) WHERE deleted_at IS NULL;` (one active email, history allowed).
+   - **Exclusion constraint** for "no overlap" (e.g. no double-booking a room): `EXCLUDE USING gist (room_id WITH =, during WITH &&)`.
+   ```sql
+   CREATE TABLE booking (
+     id         bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
+     room_id    bigint NOT NULL REFERENCES room(id) ON DELETE RESTRICT,
+     guest_id   bigint NOT NULL REFERENCES guest(id) ON DELETE RESTRICT,
+     during     tstzrange NOT NULL,
+     status     text NOT NULL DEFAULT 'held' CHECK (status IN ('held','confirmed','cancelled')),
+     created_at timestamptz NOT NULL DEFAULT now(),
+     updated_at timestamptz NOT NULL DEFAULT now(),
+     CONSTRAINT no_double_booking
+       EXCLUDE USING gist (room_id WITH =, during WITH &&) WHERE (status <> 'cancelled')
+   );
+   ```
+6. **Decide the cross-cutting column conventions once, apply everywhere.**
+   - **Soft vs hard delete** — default **hard delete** with `ON DELETE` rules. Use soft delete (`deleted_at timestamptz NULL`) only when you must retain history or undo; then *every* uniqueness and FK must account for it (partial indexes `WHERE deleted_at IS NULL`, filtered FKs) or you reintroduce duplicates and dangling references.
+   - **Audit columns** — `created_at timestamptz NOT NULL DEFAULT now()`, `updated_at timestamptz NOT NULL DEFAULT now()` (kept current by a trigger), and `created_by/updated_by` FKs where attribution matters. Full row history → separate `*_history` table or → **design-event-sourcing-cqrs**, not bolted onto the live row.
+   - **Enum vs lookup table** — small, fixed, code-coupled set (`status`) → `CHECK (x IN (...))` or a native enum. Editable-by-users or carrying extra attributes (label, sort order, active flag) → a lookup table with an FK. Don't ship a `roles` lookup table of three forever-fixed values; don't ship a CHECK list for something product managers edit weekly.
+   - **Types** — money as `NUMERIC(precision, scale)` or integer minor units, **never `float`/`real`/`double`** (see **money-decimal-arithmetic**); timestamps as `timestamptz` storing UTC instants, never naive `timestamp` (see **datetime-timezone-correctness**); text as `text` (not `varchar(n)` unless a real domain limit exists); booleans as `boolean`, not `0/1` ints or `'Y'/'N'`.
+7. **Denormalize only on a measured read hot path, and write down what you traded.** Start fully normalized; denormalize a specific column **only** when a real, frequent read can't be served cheaply by a join/index. Each denormalization is a stated consistency contract:
+   | Technique | Buys you | Costs you (must be stated) |
+   |---|---|---|
+   | Derived/rollup column (`post.comment_count`) | O(1) read, no aggregate join | Must update on every child write — trigger or app, or it drifts |
+   | Duplicated parent attribute (`order_item.product_name` at sale time) | Stable historical snapshot | Diverges from source by design — that's the point; document it |
+   | Materialized view | Precomputed report | Staleness window; explicit `REFRESH` (concurrently) needed |
+   | Pre-joined wide read table | Single-table dashboard read | Whole second write path to keep in sync |
+   Default: keep it normalized and add an index first. A rollup counter maintained by a trigger is acceptable; copying mutable data you then have to keep in sync in two places is a liability — only when the read win is proven. Record for each: *what's duplicated, who keeps it consistent, and the acceptable staleness*.
+8. **Output the DDL plus an access-pattern → table/index map.** Deliver: (a) `CREATE TABLE` statements with all constraints from steps 5–6, (b) the FK graph, (c) a table mapping each top query/access pattern to the table(s) and the index that serves it (so every hot read has a supporting index and no table has unused indexes). This map is the proof the schema serves the real queries, not just an abstract model.
+## Common Errors
+- **EAV ("flexible schema") tables.** `entity/attribute/value` rows throw away typing, FKs, and constraints and turn every read into a self-join pivot. Use real typed columns; if attributes are genuinely open-ended, a single typed `jsonb` column beats EAV.
+- **Float money.** `price float` loses cents to binary rounding — `0.1 + 0.2 ≠ 0.3`. Use `NUMERIC` or integer minor units; defer the math rules to money-decimal-arithmetic.
+- **Nullable-FK soup / polymorphic `(type, id)`.** A `parent_type text, parent_id bigint` pair can't have a foreign key, so the DB can't stop dangling references. Use separate real FK columns + a `CHECK` that exactly one is non-null.
+- **Natural key as PK that later changes.** Making `email` or a username the PK means a single edit cascades through every referencing FK. Keep a surrogate PK; put `UNIQUE` on the natural key.
+- **Random UUID (v4) PK on a hot insert path.** Random keys scatter B-tree inserts, bloating the index and WAL. Use UUID**v7**/ULID (time-ordered) when you need a UUID, or a `bigint` when the ID isn't exposed.
+- **Soft delete without filtered constraints.** `deleted_at` plus a plain `UNIQUE(email)` blocks a user from re-registering a freed email, and plain FKs still "see" deleted parents. Make uniqueness and lookups partial: `WHERE deleted_at IS NULL`.
+- **Over-normalizing tiny fixed sets.** A 3-value lookup table joined on every query adds a join for no benefit. A `CHECK (x IN (...))` enum is fine for small, code-coupled, rarely-changing sets.
+- **Storing lists in a column.** `tags TEXT` as CSV (or an unindexed array) can't be FK'd, constrained, or joined cleanly. Model it as a child/junction table.
+- **`varchar(255)` cargo-culting and naive `timestamp`.** Arbitrary length caps cause silent truncation; `timestamp` without time zone loses the offset. Use `text` and `timestamptz`.
+- **Missing `ON DELETE` action.** Defaulting blindly leaves you with either accidental orphans or surprise cascade deletes. Choose `CASCADE`/`RESTRICT`/`SET NULL` per FK on purpose.
+- **Denormalizing speculatively.** Duplicating data "for speed" before any query proves slow doubles your write paths and invites drift. Normalize first, index, measure, then denormalize the proven hot path.
+## Verify
+1. **3NF check:** For each table, every non-key column depends on the key, the whole key, and nothing but the key. Name any transitive (`non-key → non-key`) or partial dependency you allowed and justify it as a deliberate denormalization — otherwise split it.
+2. **Anomaly probe:** Pick one update, one insert, and one delete per core entity. Confirm each touches exactly one row in one place with no way to leave the data inconsistent (no second copy to forget).
+3. **Constraint coverage:** Every invariant you stated in step 1 maps to an actual `NOT NULL`/`UNIQUE`/`CHECK`/`FK`/exclusion/partial-index in the DDL — not to an app-layer comment. List any invariant *not* enforced by the DB and why.
+4. **Referential integrity:** Every FK names an explicit `ON DELETE` action; no polymorphic `(type, id)` pair lacks a real FK; every junction table has a composite PK of its two FKs.
+5. **Key sanity:** Every table has a PK; no natural key that can change is used as a PK; sequential surrogates aren't exposed where the sequence is sensitive; UUID columns are v7/ULID unless v4 is justified.
+6. **Type sanity:** No money in `float`; timestamps are `timestamptz` (UTC); no CSV/array masquerading as a relationship; enums vs lookup chosen per the step-6 rule.
+7. **Access-pattern map:** Every listed top query is served by an existing index/PK; every index supports at least one stated query (no orphan indexes); each denormalized column has a named owner-of-consistency and a stated staleness bound.
+Done = the schema is at 3NF (BCNF where a determinant anomaly existed) with every stated invariant enforced by a DB constraint, every PK/FK and `ON DELETE` chosen deliberately, no float money / naive timestamps / EAV / polymorphic-FK soup, and an access-pattern→table/index map in which every hot read has a supporting index and every denormalization carries a written consistency tradeoff.