npm - agentscamp - Versions diffs - 0.3.0 → 0.5.0 - Mend

agentscamp 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/README.md +3 -3
package/content/commands/add-caching.md +79 -0
package/content/commands/audit-accessibility.md +101 -0
package/content/commands/clean-branches.md +113 -0
package/content/commands/review-tests.md +98 -0
package/content/commands/scaffold-github-action.md +94 -0
package/content/commands/setup-precommit-hooks.md +72 -0
package/content/commands/write-design-doc.md +78 -0
package/content/manifest.json +425 -3
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/connection-pool-tuner.md +46 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/dependency-upgrade-planner.md +42 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/memory-leak-hunter.md +35 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/pagination-designer.md +51 -0
package/content/skills/property-test-designer.md +63 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/security-headers-hardener.md +79 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/slo-definer.md +38 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/structured-logging-designer.md +42 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/package.json +1 -1

package/content/skills/alerting-rules-tuner.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "alerting-rules-tuner"
+description: "Cut alert noise and make every page mean something — rewrite alerting rules to fire on user-felt symptoms (error rate, latency SLO burn, failed requests) instead of causes (high CPU, full disk), with duration windows and severity routing so only urgent, actionable conditions reach a human. Use when on-call is fatigued by low-value pages, when real incidents get missed in the noise, or when alerts fire on causes rather than impact."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+On-call exhaustion is rarely an "alert quantity" problem you fix by muting things — it's an *altitude* problem. Pages fire on causes (a node at 95% CPU, a disk at 80%, a saturated thread pool) that may or may not hurt anyone, instead of on symptoms the user actually feels. This skill audits every rule against one question — *does this fire only when a human must act now?* — then rewrites the survivors to alert on symptoms with duration windows and severity routing, and demotes the rest to dashboards or tickets.
+## When to use this skill
+- On-call is fatigued: frequent pages that resolve themselves or need no action, night pages for non-urgent conditions.
+- Real incidents get missed because they're buried under low-value noise, or everyone has muted the channel.
+- Alerts fire on causes (CPU, memory, disk, queue depth, pod restarts) rather than user impact.
+- One incident generates a storm of 50 correlated pages instead of one.
+- You have alerts with no owner and no runbook — nobody knows what to do when they fire.
+- Standing up alerting for a new service and want to start symptom-first instead of bolting on host metrics.
+## Instructions
+1. **Inventory the rules and classify each as symptom or cause.** Grep the alerting config (`*.yml`/`*.yaml` Prometheus rules, Datadog monitor exports, Grafana alert JSON, Alertmanager routes) for every rule that pages a human. For each, label it: **symptom** (something the user experiences — request errors, latency, failed checkouts, SLO burn) or **cause** (a resource or internal metric — CPU, memory, disk, GC pause, replica lag, restart count). Causes belong on dashboards, not pagers.
+2. **Audit every paging rule with the single question.** For each rule ask: *does this fire only when a human must act, right now, with a clear action?* If the honest answer is "no" — it self-heals, it's informational, there's nothing to do at 3am — it is not a page. Downgrade it to a ticket or a dashboard panel. Keep paging only what's both urgent and actionable.
+3. **Define the symptom alert set at the user boundary.** Replace cause-pages with the symptoms they were trying to predict: request error rate (5xx / total), latency at a percentile that matters (p99 over SLO), failed business transactions (checkout/login failures), and SLO error-budget burn rate. Measure these where the user is — at the load balancer / ingress / API edge — not deep inside one component.
+4. **Add a duration window to every threshold.** No paging alert fires on an instantaneous value. Require the condition to hold `for: 5m` (tune per alert) so a single scrape blip or a 10-second spike clears itself. For graceful detection of both sudden outages and slow leaks, prefer multi-window, multi-burn-rate alerts (e.g. fast: 14.4x burn over 5m + 1h; slow: 6x over 30m + 6h) over a single fixed threshold.
+5. **Alert on rate-of-change / burn, not raw levels, where the level is naturally noisy.** "Disk is 80% full" pages constantly and means nothing; "disk will fill within 4 hours at the current fill rate" is actionable and rarely false. Same for error budgets: page on burn rate, not on a single bad minute.
+6. **Assign exactly one severity per rule and route accordingly.** Use three tiers and wire each to a destination: **page** (human-impacting, urgent, actionable → PagerDuty/Opsgenie, wakes someone), **ticket** (needs attention this week, not now → issue tracker), **info** (awareness only → Slack/dashboard, never pages). The default for anything you're unsure about is *not* page.
+7. **Deduplicate and group correlated alerts into one notification.** One incident must produce one page, not fifty. Group by incident dimension (service, cluster, region) in Alertmanager `group_by` / Datadog grouping, set `group_wait`/`group_interval` so the storm coalesces, and add inhibition rules so a parent symptom (whole service down) suppresses the child causes (every dependent check failing).
+8. **Attach an owner and a runbook link to every surviving alert.** Each paging rule gets an owning team (label/tag) and a `runbook_url` annotation pointing at concrete steps — first checks, dashboards, mitigation, escalation. If you can't write a runbook because there's no clear response, that's the signal the alert shouldn't page.
+> [!WARNING]
+> Paging on causes — CPU, memory, disk, queue depth — instead of user-felt symptoms is the single largest source of alert fatigue. A box can run hot all day while users are perfectly happy; a box can look idle while requests fail. Page on the symptom; keep the cause on a dashboard for when you're already investigating.
+> [!WARNING]
+> An alert with no runbook and no action is noise by definition. If the response to a page is "ack it and watch," it should not have woken anyone. Thresholds without a duration window flap on every transient spike — never ship a paging rule without a `for:` window.
+## Output
+A revised alerting plan, ready to apply to the config:
+- **Symptom alert set** — a table of paging alerts: name, signal (the user-facing metric), threshold + duration window (or burn-rate windows), and severity. Every row is urgent and actionable.
+- **Demoted rules** — the cause-metrics removed from paging, each annotated with where it went (dashboard panel name, or ticket-severity monitor) and why it isn't a page.
+- **Routing + dedup map** — severity → destination table, the `group_by` keys, and inhibition rules (parent symptom suppresses child causes).
+- **Ownership/runbook mapping** — for each surviving alert: owning team + `runbook_url`, flagging any alert that lacks a runbook as a candidate for deletion.

package/content/skills/canary-release-planner.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: "canary-release-planner"
+description: "Design a canary / progressive rollout so a bad release reaches 1% of users instead of 100% — staged traffic with bake times, gating metrics compared against the concurrently-running stable baseline, and automated promote-or-rollback. Use when shipping a risky change, when you want automatic rollback on regression, or when moving off all-at-once deploys."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+An all-at-once deploy is a single bet: CI is green, so you flip 100% of users onto new code and hope. A canary changes the bet — it routes a small, growing slice of real traffic to the new version, watches it against the version still serving everyone else, and either promotes it or rolls it back automatically. This skill produces that plan: the stages and bake times, the metrics that gate each promotion, the rollback trigger, and the data/session prerequisites that decide whether a canary is even safe for this change.
+## When to use this skill
+- You're shipping a change risky enough that a bad version reaching every user at once is unacceptable (auth, payments, a hot path, a dependency bump).
+- You want regressions to trigger an automatic rollback instead of waiting for an on-call human to notice and react.
+- You're moving a service off all-at-once / blue-green flips onto progressive delivery and need a concrete stage-and-gate plan.
+- A previous "it passed CI" deploy caused a production incident, and you want the blast radius capped before the next one.
+## Instructions
+1. **Define the rollout stages and a bake time at each.** Lay out an increasing traffic schedule — e.g. `1% → 10% → 50% → 100%` — and assign each stage a **bake time** long enough for the relevant signals to surface (cover at least one full traffic cycle for the failure mode you fear: cache fills, cron jobs, retries, a login spike). The first stage should be small enough that its failure is a non-event; the bake time, not the percentage, is what lets a slow leak (memory, connection exhaustion, a rare code path) show itself before the next promotion. Don't jump straight to 50%.
+2. **Pick the metrics that gate promotion.** Choose a small set that reflects user pain: **error rate** (5xx / failed requests), **latency percentiles** (p95/p99, never the mean — the mean hides the tail that churns users), and one or two **business/health signals** that catch silent failures the error rate won't (checkout completions, sign-ups, queue depth, a 200-with-empty-body). A deploy can be 200-OK and still be broken; the business metric is what catches that.
+3. **Set thresholds as canary-vs-baseline, not absolute.** For each gating metric, define a pass/fail rule comparing the **canary** to the **concurrently-running stable version** — e.g. "canary error rate ≤ stable + 0.5pp" and "canary p99 ≤ 1.2× stable p99." Both versions take a slice of the *same live traffic at the same time*, so time-of-day, weekday, and load differences cancel out and the only variable left is the new code.
+4. **Automate the promote-or-rollback decision.** At the end of each bake time: if every gating metric is within threshold, promote to the next stage; if any breaches, **auto-rollback** — shift 100% of traffic back to stable immediately. Make rollback fast and safe: it must be a traffic-weight change (drain the canary, don't kill in-flight requests), require no new build, and not depend on the canary being healthy enough to cooperate. A rollback that needs a redeploy is too slow to matter during an incident.
+5. **Guarantee schema compatibility across both versions.** During the rollout the old and new code hit the **same database simultaneously**. Every schema change must be backward-compatible in both directions for the duration of the canary — use **expand-contract / parallel-change** migrations: add the new column/table (expand) and deploy code that writes both, run the canary, then remove the old shape (contract) only after the new version owns 100%. Pair with `strangler-fig-migrator` for larger cutovers.
+6. **Pin session affinity so a user doesn't flip versions mid-flow.** Route by a stable key (user ID, session cookie) so a given user stays on canary *or* stable for the whole session. Without it, a user can bounce between versions between requests — half-applied multi-step flows, cache/state mismatches, and metrics that can't be attributed to either version. Affinity also makes the canary-vs-stable comparison clean.
+7. **Choose the routing dimension deliberately.** Decide whether the canary is a **percentage of traffic** (simplest, representative) or a **user segment** (internal staff → beta cohort → region → everyone) when you want known, tolerant users to absorb the first hit. Segment routing trades statistical representativeness for a friendlier blast radius — state which you chose and why.
+> [!WARNING]
+> Comparing the canary to a *historical* baseline (yesterday, last week, a stored average) instead of the stable version running right now produces false verdicts. Traffic and latency swing with time of day and day of week, so a healthy canary at peak can look "regressed" against an off-peak baseline — and a genuinely bad canary can hide inside normal variance. Always gate against the concurrently-running stable version.
+> [!WARNING]
+> A canary is unsafe when the release contains a non-backward-compatible schema change. Both versions query the same database during the rollout, so a breaking migration breaks one version no matter the traffic split. Decouple it: ship the migration as a backward-compatible expand step first, canary the code, then contract afterward.
+## Output
+A canary rollout plan containing: (1) the **stage schedule** — traffic percentages and the bake time at each, with the reason each bake time is long enough; (2) the **gating metrics** — error rate, latency percentiles, and the business/health signal(s), each with an explicit **canary-vs-baseline** pass/fail threshold; (3) the **auto-rollback trigger** — which breach forces a rollback and the (fast, build-free) mechanism that executes it; and (4) the **prerequisites** — the expand-contract schema plan confirming both versions are DB-compatible, and the session-affinity key. Reproducible: the same plan re-runs for the next release by swapping in its metrics and thresholds.

package/content/skills/cold-start-optimizer.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: "cold-start-optimizer"
+description: "Cut cold-start latency for serverless functions and slow-booting apps by measuring the init breakdown, then attacking the dominant phase — artifact size, eager imports, eager connections, or under-provisioned memory — instead of reflexively buying provisioned concurrency. Use when serverless p99 spikes on the first request, when a function times out during init, or when scale-to-zero is hurting user-facing latency."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A cold start is not one number — it is runtime boot, dependency/module load, framework init, and first-connection setup stacked on top of each other, and you are usually optimizing a guess about which one dominates. This skill makes it measured: split the init into phases, find the phase that actually costs you, and attack *that* — shrink the artifact and lazy-load the heavy deps off the first-request path, hoist one-time work to module scope so warm invocations reuse it, right-size memory (more CPU often means a *faster and cheaper* cold start), and reuse connections across invocations instead of opening a fresh one every cold start. Provisioned concurrency / keep-warm is the last resort for genuinely latency-critical paths, not the first reflex — because it bills you to mask a slow init rather than fixing it.
+## When to use this skill
+- Serverless p99 (or p999) spikes on the first request after a quiet period, while warm requests are fast.
+- A function intermittently times out *during init* — before your handler code even runs.
+- Scale-to-zero or aggressive autoscaling is hurting user-facing latency on a path that can't tolerate a 2–5s tail.
+- You've been told to "just turn on provisioned concurrency" and want to know whether the init is fixable first (and cheaper).
+- A deploy bloated the artifact (new dependency, bundling change) and cold starts regressed.
+## Instructions
+1. **Measure the cold start and split it into phases — don't optimize a guess.** Force a cold start (deploy a new version, or wait out the platform's idle timeout) and capture the init timeline, not just the total. Most platforms expose it: AWS Lambda `INIT_START`/`REPORT` log lines (`Init Duration` is the pre-handler cost) plus X-Ray init subsegments; GCP/Cloud Run startup probe + request logs; Vercel function logs. Instrument the four phases yourself with timestamps at module load:
+   - **runtime boot** — the platform spinning up the sandbox/container and language runtime (you can't change this much, but you must know its share).
+   - **dependency/module load** — `require`/`import` of your code and its tree, top-to-bottom.
+   - **framework init** — ORM bootstrap, DI container, route table build, config parse, schema/codegen load.
+   - **first-connection setup** — DB handshakes, TLS, secret-manager fetches, warm-up calls.
+   Attribute a millisecond cost to each. You optimize the dominant phase; everything else is noise until that one shrinks.
+2. **Shrink the deployment artifact and lazy-load heavy deps off the first-request path.** A giant bundle inflates both runtime boot (more to unpack) and module load (more to parse). Tree-shake and bundle (esbuild/`@vercel/nft`/webpack) so you ship the function's actual closure, not the whole `node_modules`; exclude the AWS SDK / platform SDK that the runtime already provides; strip source maps and dev deps from the package. Then find the imports that aren't needed for the *first* request — a PDF renderer, an image library, an analytics client, a markdown engine — and move them behind a lazy `await import()` / deferred `require` inside the code path that needs them, so they never touch init. Grep the entry module for top-level imports of known-heavy packages and ask of each: does request #1 use this?
+3. **Hoist one-time work to module scope so warm invocations reuse it — but don't connect eagerly.** Config parsing, client *construction*, schema compilation, and validator building should run once at module load and be captured in module-scope variables, so the platform's instance reuse amortizes them across every warm invocation on that instance. The sharp distinction: **construct** clients at module scope, but **connect** lazily. Build the DB pool / HTTP client object at module load (cheap, no I/O); open the actual connection on first use inside the handler, and reuse it across subsequent invocations on the same warm instance. Eager top-level `await pool.connect()` adds connection latency to *every* cold start and turns a traffic burst into a connection storm.
+4. **Reuse connections across invocations via instance reuse — never open a fresh connection per cold start.** Store the connection/pool in a module-scope (or `globalThis`) variable so a warm instance hands it back instead of reconnecting. Size the per-instance pool to **1–2 connections**, not 20: each concurrent serverless instance gets its own pool, so a large per-instance pool times the instance count will blow past the database's `max_connections` under burst. For Postgres at high concurrency, point functions at a transaction-mode pooler (PgBouncer/RDS Proxy/Supabase pooler) rather than the database directly. Set a connection idle timeout shorter than the platform's instance-freeze window so dead connections don't accumulate.
+5. **Right-size memory — on many platforms it buys CPU, so more memory = faster AND cheaper cold start.** On Lambda (and similar) CPU and network scale linearly with the memory setting, and a cold start is CPU-bound (parsing, JIT, framework init). Bumping 128MB → 512MB–1GB can cut the cold start by enough that the *higher per-ms price × shorter duration* is lower total cost — the classic counter-intuitive win. Sweep a few memory settings against the same forced-cold-start workload and pick the point on the cost-vs-latency curve, don't assume the smallest tier is cheapest.
+6. **Use provisioned concurrency / keep-warm only for genuinely latency-critical paths — after init is already fast.** If a path truly can't tolerate any cold tail (checkout, auth, a synchronous user-facing API), provision N warm instances to cover baseline concurrency. But apply it last, sized to real concurrency (not a round number), and only once steps 1–5 have made the init itself fast — because provisioning a 4-second init just means you pay 24/7 to keep a slow thing warm, and any burst beyond your provisioned count still pays the full cold start.
+> [!WARNING]
+> Opening a fresh DB connection on every cold start — instead of reusing one across warm invocations — is the classic serverless outage. Under a traffic spike, every new instance opens its own connections simultaneously, the database hits `max_connections`, and *every* request (warm ones included) starts failing. Construct the client at module scope, connect lazily, reuse across invocations, and cap the per-instance pool low. Use a transaction-mode pooler when instance count can exceed the DB's connection limit.
+> [!CAUTION]
+> Keep-warm and provisioned concurrency **mask** a slow init; they don't fix it — and they bill you continuously for the masking. If you reach for them before measuring, you'll pay 24/7 to hide a 3s init that two hours of lazy-loading would have cut to 400ms, and you'll *still* eat the full cold start on every burst beyond your provisioned count. Fix the init first; provision only the residual.
+## Output
+1. **Cold-start breakdown by phase** — the measured init timeline showing where the milliseconds actually go, so the dominant cost is obvious before any change:
+```text
+Cold start breakdown — POST /api/checkout (Lambda, 256MB, node20)
+Total cold init: 2,840 ms   (warm: 38 ms)
+  runtime boot ................   210 ms   7%   (platform; fixed)
+  dependency/module load ......  1,520 ms  54%  <- DOMINANT
+      stripe sdk (eager) .........  340 ms
+      @prisma/client (eager) .....  610 ms
+      pdfkit (eager, unused @ req#1) 470 ms
+  framework init ..............    180 ms   6%   prisma engine bootstrap
+  first-connection setup ......    930 ms  33%  top-level await pool.connect()
+```
+2. **Targeted fixes** — ordered by the phase that dominates, each with the specific change and why it lands:
+```text
+1. Lazy-load pdfkit behind await import() in the receipt path .. -470 ms  [HIGH]
+   Not used by request #1; only the async receipt job needs it.
+2. Move pool.connect() out of top-level await; connect on first
+   handler use, reuse across invocations; pool max 2 ................ -930 ms cold,
+   + eliminates connection-storm risk under burst .................. [HIGH]
+3. Bump memory 256MB -> 1024MB (CPU scales) ................... -640 ms  [HIGH]
+   Faster parse + prisma init; est. total cost -18% (shorter ms).
+4. Bundle with esbuild, exclude aws-sdk (runtime-provided),
+   strip source maps ................................................ -210 ms  [MED]
+5. Provisioned concurrency = 3 on /checkout ONLY, after the above ... covers
+   baseline concurrency; residual bursts now cost ~600ms not 2,840.  [LAST]
+```
+3. **Measured before/after** — the re-measured cold start after applying the fixes, proving the dominant phase actually shrank (and noting cost impact, since memory and provisioning change the bill):
+```text
+Cold init: 2,840 ms -> 620 ms  (-78%)   p99 first-request: 3.1s -> 0.7s
+Monthly cost: roughly flat (higher memory offset by shorter duration;
+provisioned-concurrency on /checkout adds ~$X for 3 warm instances).
+Re-measure after a real burst, not a single forced cold start.
+```

package/content/skills/connection-pool-tuner.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: "connection-pool-tuner"
+description: "Size and tune a database connection pool from the real constraint — the database's shared max_connections and its core count — so total connections (per-instance pool × instance count) stay safely under the cap and a too-large pool stops adding latency. Use when the app throws 'too many connections' or pool-acquire timeouts, when the DB is saturated by connection count, or when deploying to serverless."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Connection pools fail in two opposite ways, and a "nice round number" like 100 walks into both. Too large and every app instance's pool sums past the database's shared `max_connections`, so the next deploy or traffic spike exhausts the server and *every* instance starts throwing. Naively large and the pool is bigger than the DB has cores to serve it, so the extra connections don't add parallelism — they queue inside the database and add latency. This skill sizes the **per-instance** pool from concurrency need and core count, does the `instances × pool ≤ max_connections` arithmetic with real headroom, sets the timeouts that recycle dead connections, and sends serverless through a pooler instead of multiplying pools.
+## When to use this skill
+- The app logs `FATAL: too many connections` / `remaining connection slots are reserved`, or pool-acquire timeouts ("timed out fetching a connection from the pool").
+- The database is saturated by connection *count* (high `pg_stat_activity` rows, memory pressure from per-connection backends) rather than by slow queries.
+- You scaled out app instances or autoscaling kicked in, and the DB started erroring even though per-instance load looks fine.
+- You're deploying to serverless / many short-lived instances (Lambda, Vercel functions, Cloud Run) and need a connection strategy.
+- Standing up a new service and picking a pool size before it hits production.
+## Instructions
+1. **Find the real ceiling first.** Read the database's `max_connections` (Postgres `SHOW max_connections`, MySQL `max_connections`) — this is shared across *everything*: every app instance, background workers, migrations, replicas, admin/`psql` sessions, and the monitoring agent. Postgres also reserves `superuser_reserved_connections`. Treat the usable budget as roughly `max_connections − reserved − headroom`, not the raw number.
+2. **Count every connection source, not just the web app.** Total connections = (per-instance pool × app instance count) + worker/cron pools + replicas + migration tooling + a margin for admin sessions and a deploy overlap (old and new instances live simultaneously during rolling deploys — pools effectively double for that window). Enumerate each source by grepping for pool config (`max`, `pool_size`, `maximumPoolSize`, `DATABASE_URL`, `?connection_limit=`).
+3. **Size the per-instance pool from concurrency, capped by cores — not by a big round number.** A connection only does work when the DB has a free core to run its query. The starting heuristic for a CPU-bound OLTP workload is near the DB's core count *for the whole fleet*, so per-instance pool ≈ `(useful_DB_concurrency) / instance_count`, often a small single-digit number. Going higher doesn't buy parallelism — it buys a queue. For I/O-bound queries (lots of waiting) you can go somewhat above core count, but measure rather than assume.
+4. **Do the exhaustion arithmetic explicitly and leave headroom.** Compute `instances × pool + other_sources` and confirm it stays under the usable budget *at max autoscale*, not at average instance count. Size against the ceiling the autoscaler can reach, then keep ~20–30% of `max_connections` free for migrations, admin, replication, and deploy overlap. If the math doesn't fit, shrink the pool before raising `max_connections` (each Postgres backend costs real memory).
+5. **Set the four timeouts deliberately — defaults leak or stall.**
+   - **Acquire / pool timeout** — how long a request waits for a free connection before failing fast (e.g. a few seconds). Without it, a saturated pool turns into unbounded queueing and looks like a hang.
+   - **Idle timeout** — return idle connections so the pool shrinks under low load and you're not holding slots the DB could give elsewhere.
+   - **Max lifetime** — recycle each connection after a bounded age (e.g. 30 min) so a load balancer / DNS failover / DB restart doesn't leave stale half-dead connections in the pool.
+   - **Min / idle floor** — keep a small warm minimum to avoid connect latency on the first request, but not so high that idle instances hoard the budget.
+6. **Handle serverless and many-instances specially — route through a pooler.** When instance count is large or unbounded (one pool per function invocation), per-instance pools multiply faster than any safe per-instance number can absorb. Don't fix it by shrinking the per-function pool to 1 alone — put a pooler between the app and the DB: PgBouncer in **transaction** mode, RDS Proxy, Supabase's pooler, or a provider serverless/HTTP driver. The pooler multiplexes hundreds of client connections onto a small set of real DB connections; keep the per-function pool at 1–2 behind it.
+> [!WARNING]
+> Scaling out app instances silently multiplies total connections. A pool of 20 that's fine on 3 instances (60) exhausts a 100-connection DB the moment the autoscaler reaches 5 instances — and it fails *everywhere at once*, not gracefully. Always size against **max autoscale × pool**, plus the deploy-overlap doubling, never average instance count.
+> [!WARNING]
+> A bigger pool is frequently *slower*, not faster. Past the DB's effective core count, added connections don't run in parallel — they queue inside the database and add context-switching overhead, raising p99 latency while throughput stays flat. If the pool is large and the DB is CPU-bound, the fix for latency is usually to *shrink* the pool.
+> [!NOTE]
+> Transaction-mode poolers (PgBouncer) break features that hold state across statements on one connection: session-level `SET`, advisory locks, `LISTEN/NOTIFY`, and some prepared-statement modes. Use session mode (or a dedicated direct connection) for those paths, and run migrations against the DB directly, not through the transaction pooler.
+## Output
+A pool-sizing recommendation, concretely:
+- **The math** — usable budget (`max_connections − reserved − headroom`), and `instances_at_max_autoscale × per_instance_pool + other_sources` shown to land under it with the headroom stated.
+- **Recommended per-instance pool size** with the rationale (concurrency need vs. DB core count, and which workload type it is), plus separate sizes for worker/cron pools.
+- **Timeout/lifetime settings** — acquire timeout, idle timeout, max lifetime, and min/idle floor, with the value and why each is set.
+- **Serverless recommendation if applicable** — the specific pooler (PgBouncer transaction mode / RDS Proxy / serverless driver), the per-function pool size behind it, and any session-mode caveats for stateful paths.

package/content/skills/contract-test-designer.md ADDED Viewed

@@ -0,0 +1,70 @@
+---
+name: "contract-test-designer"
+description: "Design consumer-driven contract tests between services so an API provider can't break its consumers unnoticed — without slow, flaky full end-to-end environments. Use when independent services or teams integrate over an API, when integration bugs only surface in staging or prod, or when E2E suites are too slow and brittle to catch breaking API changes."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Cross-service E2E suites are slow, flaky, and tell you a provider broke a consumer only after both are deployed to a shared environment. This skill designs consumer-driven contract tests instead: the *consumer* declares the exact requests it sends and the precise response fields and types it actually reads, and the *provider* replays those expectations against its real handler in its own CI. A provider change that violates any consumer's contract fails the provider's build — before merge, before deploy, with no other service running. The deliverable is the consumer-defined contract(s), the provider-side verification wired into CI, and a sharing-plus-versioning approach so the two sides can evolve.
+## When to use this skill
+- Two or more independently deployed services (often owned by different teams) integrate over HTTP/JSON, gRPC, or a message queue, and a provider can ship a change that silently breaks a consumer.
+- Integration regressions only appear in staging or prod because nothing in either repo's CI exercises the actual cross-service shape.
+- The cross-service E2E suite is too slow or flaky to gate merges, so breaking API changes slip through.
+- You're standing up a new client against an existing API and want to lock the dependency to *exactly* the fields you read, not the whole payload.
+## Instructions
+1. **Let the CONSUMER define the contract — and only the part it uses.** Write the contract from the consumer's test suite, not the provider's spec. For each interaction, state the *request* the consumer sends (method, path, query/body, headers that matter) and the *response shape it actually depends on*: the status code, the fields it reads, and their types. If the consumer parses `order.id` (string) and `order.total` (number) and ignores the other 20 fields, the contract asserts those two fields and nothing else. The contract is a description of *this consumer's* needs, never the provider's full API surface.
+2. **Match on type and structure, not frozen example values.** Use matchers, not literals: assert `total` is a number, `status` is one of a set, `items` is a non-empty array of objects with `sku`/`qty` — not `total == 4250`. Frozen example values turn the contract into a snapshot test that breaks on every data change. Reserve exact-value matching for fields whose literal value is part of the contract (an enum the consumer branches on, a fixed `Content-Type`).
+3. **Pick a tool/pattern and generate the artifact.** Match what the stack already uses before adding a dep. **Pact** (pact-js / pact-jvm / pact-python / pact-go) is the default for HTTP and async messages — the consumer test runs against a mock provider and emits a pact JSON file. **Spring Cloud Contract** suits a JVM-heavy shop. For simpler needs, a **shared JSON Schema / OpenAPI fragment** committed to both repos, validated on each side, is a legitimate lightweight contract. Whatever the tool, the output is a machine-checkable artifact of the consumer's expectations.
+4. **Verify the PROVIDER against the contract in the PROVIDER's own CI.** This is the half teams skip and the half that earns the value. The provider's pipeline fetches every consumer contract and replays each recorded request against the real running provider (no consumer process involved), asserting the live response satisfies the matchers. Wire it as a required check: a provider change that drops `order.total` or renames `status` fails the provider build, so the break is caught at the source before merge. Use `provider states` (Pact) to set up the data each interaction needs (`given "order 42 exists"` → seed that fixture) rather than depending on ambient DB state.
+5. **Share contracts via a broker or committed artifacts, and gate deploys on verification.** For more than a couple of services, run a **Pact Broker** (or PactFlow): consumers publish contracts tagged by branch/version, providers fetch and verify, and `can-i-deploy` blocks a release whose verified contracts don't cover the consumer versions currently in prod. For a small, co-located set, committing the contract artifact into a shared repo or the provider repo and verifying in CI is simpler and adequate — pick the lightest mechanism that still makes verification a required, automated gate, not a manual step.
+6. **Version contracts so provider and consumer can evolve independently.** Tag each contract with the consumer's version and the environment where that consumer version runs. Additive provider changes (new optional field) keep old contracts passing — that's the point of matching only what the consumer reads. For a breaking change, support both shapes until every consumer has published a contract for the new one (verified via the broker), then retire the old. Never edit a published contract in place to make a failing provider build go green — that defeats the gate.
+7. **Keep contracts to interface shape; push behavior into unit tests.** A contract verifies the *integration surface* — fields, types, status codes, error envelopes — not that the provider computes the right total or applies the right discount. That logic belongs in the provider's own unit/integration tests. A contract bloated with business assertions becomes a second, worse copy of the provider's logic suite and breaks on unrelated correct changes.
+> [!WARNING]
+> Contract tests verify the INTERFACE shape, not end-to-end behavior. They replace brittle cross-service E2E for catching *breaking API changes* — but they do not prove the provider's logic is correct or that the wired-up system works. Keep the provider's own logic tests, and a thin smoke E2E for the critical path; contracts shrink the E2E suite, they don't delete it.
+> [!WARNING]
+> A contract that asserts the provider's *entire* response — every field, exact values — instead of only the fields this consumer reads is an anti-pattern: it produces false breakages on unrelated, backward-compatible changes (a new field, a reordered key, a changed value the consumer never reads), and trains teams to ignore red builds. Assert the minimum the consumer actually depends on.
+## Output
+For the integration, the skill produces:
+- **The consumer-defined contract(s)** — for each interaction, the request (method, path, body, key headers) and the response expectations as matchers (status code + only the fields/types this consumer reads), in the chosen tool's format.
+- **The provider-side verification setup** — the CI step that fetches the contract(s) and replays them against the real provider, the provider-state fixtures each interaction needs, and the required-check wiring so a violation fails the provider build.
+- **The sharing + versioning approach** — broker vs. committed artifact, how contracts are tagged by consumer version/environment, and the deploy gate (e.g. `can-i-deploy`) plus the rule for evolving through a breaking change.
+Example — a consumer contract for an order-service client, in pact-js:
+```js
+const { PactV3, MatchersV3: M } = require("@pact-foundation/pact");
+const provider = new PactV3({ consumer: "checkout-web", provider: "order-service" });
+// The consumer reads only id (string), total (number), and status (one of two values).
+// It ignores every other field on the order — so the contract asserts only these.
+provider
+  .given("order 42 exists")                       // provider state: seeded in provider CI
+  .uponReceiving("a request for an order")
+  .withRequest({ method: "GET", path: "/orders/42" })
+  .willRespondWith({
+    status: 200,
+    headers: { "Content-Type": "application/json" },
+    body: {
+      id: M.string("ord_42"),                     // type match, not the literal "ord_42"
+      total: M.number(4250),
+      status: M.regex(/^(open|closed)$/, "open"),  // enum the consumer branches on
+    },
+  });
+await provider.executeTest(async (mock) => {
+  const order = await fetchOrder(`${mock.url}/orders/42`);
+  expect(order.status).toBe("open");
+});
+```
+This test emits a pact file; the **provider's** pipeline then replays `GET /orders/42` against the real `order-service` (with state `order 42 exists` seeded) and fails the provider build if `total` stops being a number or `status` leaves the enum. Hand the request/response shapes to `openapi-doc-writer` to keep the published spec in sync, and use `test-scaffolder` to flesh out the provider-state fixtures.

package/content/skills/dependency-upgrade-planner.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: "dependency-upgrade-planner"
+description: "Plan and de-risk a major dependency, framework, or runtime upgrade — map the full version path, read every intermediate migration guide, and pin the breaking changes to your actual call sites instead of bumping the number and hoping. Use when a key dependency is several majors behind, when a security advisory forces an upgrade, or before a framework migration."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Turn "bump the version and hope" into a sequenced, evidence-backed upgrade plan. The skill establishes the exact current → target version gap, reads the CHANGELOG and migration guide for **every** major in between, then greps the codebase for the dependency's imported symbols so the breaking-change list is narrowed to the call sites that actually exist here. It checks the target's peer-dependency and runtime requirements, orders the work (codemods first, one major at a time for big jumps, behind tests), and writes down a rollback before anything is touched.
+## When to use this skill
+- A key dependency, framework, or runtime is several majors behind and you need a path forward, not a single `npm install pkg@latest`.
+- A security advisory (CVE, `npm audit`, Dependabot) forces an upgrade and you need to know the blast radius before merging.
+- You are scoping a framework or runtime migration (React, Next.js, Django, Rails, Node, Python) and want to know what breaks before committing the sprint.
+> [!WARNING]
+> Jumping several majors in one `install` hides which version broke what. Breaking changes compound: v3's removal of an API plus v4's renamed option plus v5's changed default land as one undebuggable wall of errors. For a gap of two or more majors, upgrade **one major at a time**, landing each behind a green build/test run, so every failure maps to exactly one version's changes.
+## Instructions
+1. **Pin the exact current and target versions.** Read the lockfile (`package-lock.json`/`pnpm-lock.yaml`/`yarn.lock`, `poetry.lock`, `go.sum`, `Cargo.lock`) for the version actually installed — not the loose range in the manifest, which lies about what resolved. Confirm the target: `npm view <pkg> versions --json`, `pip index versions <pkg>`, `go list -m -versions <mod>`, or the registry page. Record the full hop list, e.g. `4.2.1 → 5.x → 6.x → 7.0.3`.
+2. **Read the migration guide for every major in between — don't skip the intermediate notes.** A jump from v4 to v7 means reading the v5, v6, **and** v7 breaking-change sections, not just v7's. Pull the CHANGELOG / UPGRADING / migration doc (`gh release view`, the repo's `CHANGELOG.md`, the docs site) and extract every entry under "Breaking", "Removed", "Renamed", "Default changed", and "Deprecated → removed".
+3. **Inventory your actual usage so you only care about breaks that hit you.** Grep the codebase for the dependency's imported symbols and entry points — `grep -rIn "from 'pkg'" `, `grep -rIn "require('pkg')"`, `import pkg`, the specific class/function/option names called out in the breaking-change list. A breaking change to an API you never call is noise; a one-line default change to a function on 40 call sites is the real work. Map each relevant breaking change to its call sites.
+4. **Check transitive/peer-dep and runtime requirements of the target.** The target may demand a newer peer (`react@>=19`, a `@types/*` bump) or a higher minimum runtime (Node, Python, Go, the language edition). Run `npm info <pkg>@<target> peerDependencies engines` (or read `requires-python` / `go.mod` `go` directive / `rust-version`). Cross-check against your other dependencies' peer ranges and your CI/Dockerfile/`.nvmrc`/`engines` runtime — a conflict here blocks the install before any code change.
+5. **Sequence the work: codemods → one major at a time → behind tests.** Run the official codemod first if one exists (`npx <pkg>-codemod`, `npx @next/codemod`, framework migration CLIs) — they do the mechanical renames so you review semantics, not churn. For multi-major gaps, do one major per commit/PR; for each step, apply the codemod, hand-fix the mapped call sites, then run the **real** build and test commands as a checkpoint before the next hop.
+6. **Write the rollback before touching anything.** Commit the current lockfile, branch the work, and record the revert: restore the pinned versions in the manifest **and** the lockfile (a manifest-only revert re-resolves to something new), then reinstall from the lockfile (`npm ci`, `pnpm install --frozen-lockfile`, `poetry install`). For a forced security upgrade with no safe target yet, note the interim mitigation (override/resolution pin, patch backport) as the fallback.
+> [!WARNING]
+> Peer-dependency conflicts and a bumped minimum runtime are the upgrades that silently break the build — not the API renames you can see in a diff. `npm install` may resolve a peer with a warning (or fail under strict/`pnpm`), and a target that requires Node 22 will install fine locally then explode in CI on Node 20. Verify both **before** writing code, in step 4.
+> [!NOTE]
+> Land the upgrade on its own branch with one commit per major hop and the codemod output as a separate commit from your hand-fixes. If a regression only shows up in CI or staging, granular history makes `git revert` of a single version trivial instead of unpicking a tangled bump.
+## Output
+A concrete upgrade plan, reproducible from the evidence gathered:
+- **Version path** — the exact hop list from the lockfile to the target (`4.2.1 → 5.18.0 → 6.4.2 → 7.0.3`), one line per major.
+- **Breaking changes that affect THIS codebase** — a table of `change → version → call sites`, with the file:line locations grep found; changes that touch no call site are explicitly listed as not-applicable so the reader trusts the filter.
+- **Peer-dep & runtime gate** — required peer ranges and minimum runtime of the target vs. what the repo and CI currently pin, with conflicts flagged as blockers.
+- **Steps in order** — codemod commands first, then per-major manual fixes, each with its test/build checkpoint command.
+- **Rollback plan** — the exact manifest + lockfile revert and reinstall command, plus any interim mitigation for a forced upgrade.

package/content/skills/devcontainer-designer.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "devcontainer-designer"
+description: "Design a reproducible dev environment (Dev Container / Docker) so onboarding is one command and 'works on my machine' dies — by detecting the project's real stack and versions, authoring a devcontainer.json (+ Dockerfile/compose) that pins the runtime to what the repo targets, wires dependent services, caches dependencies, and injects secrets instead of baking them. Use when new contributors struggle to set up the project, when environment drift causes inconsistent behavior, or when standardizing tooling across a team."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+The phrase "works on my machine" is a confession that the project has no defined machine. Two contributors on Node 18.17 and 20.4, one with a system `libpq` and one without, a Postgres someone installed via Homebrew in 2023 — that spread is exactly the environment drift a dev container exists to kill. But a container only does that if it pins what the repo actually targets and brings the *whole* stack up together; an unpinned `node:latest` reintroduces the drift you containerized to remove, and a `:latest` Postgres can rev a major version under you on the next rebuild. This skill reads the repo to find the real stack, then writes a `devcontainer.json` (with a Dockerfile and/or compose when services are involved) where every version is pinned, services come up as one unit, dependencies are cached so rebuilds are cheap, and secrets are injected at runtime — never baked into the image.
+## When to use this skill
+- New contributors burn their first day on setup, or the onboarding README has more than a handful of "install X, then Y" steps that drift out of date.
+- The same code behaves differently across machines (passes locally, fails in CI, or vice versa) and you suspect runtime/version/system-lib differences rather than a real bug.
+- You're standardizing tooling across a team and want one definition of "the dev environment" that an editor can rebuild on demand.
+- The project needs a DB, cache, queue, or other service running alongside the app and people manage those by hand today.
+## When NOT to use this skill
+- The drift is a missing lockfile, not a missing container — if `package.json`/`pyproject.toml` has unpinned ranges and no committed lock, fix that first; a container around floating deps still drifts.
+- You need a production deployment image. A dev container optimizes for fast inner-loop edit/run with the source mounted; a production image optimizes for a small, immutable artifact with the source baked in. They are different files with different tradeoffs — don't ship this one.
+## Instructions
+1. **Detect the real stack before writing anything.** Glob and read the manifests that declare the runtime and pin it: `.nvmrc` / `.node-version` / `engines` in `package.json`, `.python-version` / `pyproject.toml` `requires-python`, `.ruby-version`, `go.mod` `go` directive, `.tool-versions` (asdf/mise), `rust-toolchain.toml`. Identify the package manager from the lockfile that exists (`package-lock.json` → npm, `pnpm-lock.yaml` → pnpm, `yarn.lock` → yarn, `poetry.lock` → poetry, `uv.lock` → uv) — the container must use the same one, or it builds a different tree. The repo's declared version is the source of truth; never round to "latest stable."
+2. **Find the services the app actually talks to.** Grep config and env templates (`.env.example`, `config/`, `docker-compose*.yml`, `application.yml`) for connection strings and ports — `DATABASE_URL`, `REDIS_URL`, `postgres://`, `amqp://`, ES/OpenSearch hosts. Read the dependency manifest for client libraries (`pg`, `redis`, `psycopg`, `pika`, `kafkajs`) as corroboration. Every external service the app expects at runtime must come up in the dev environment, or the container is half a setup and contributors are back to installing Postgres by hand.
+3. **Pin the base image to the repo's exact runtime version.** Use a digest-stable, version-specific tag — `mcr.microsoft.com/devcontainers/python:3.12` or `node:20.17-bookworm`, never `:latest`, `:lts`, or a bare major like `:20`. Match the minor the repo targets (a `.nvmrc` of `20.17.0` means `node:20.17`, not `node:20`). If you author a Dockerfile, install system libraries the build needs that the base lacks (`libpq-dev` for `psycopg`, `build-essential`, `libvips` for `sharp`, `default-libmysqlclient-dev`) — these are the silent "missing on my machine" failures. Set the pinned image in `devcontainer.json` `image`, or `build.dockerfile` if you need the extra libs.
+4. **Bring the whole stack up with compose when services exist.** When step 2 found a DB/cache/queue, write a `docker-compose.yml` with the app service plus each dependency pinned to a *specific* version (`postgres:16.4`, `redis:7.4`) — a major Postgres bump on rebuild can refuse to read the old data dir. Point `devcontainer.json` at it via `dockerComposeFile` + `service` + `workspaceFolder`, list `runServices` so the DB starts with the workspace, and use a named volume for the DB data dir so a container rebuild doesn't wipe local seed data. Set service `DATABASE_URL` to the compose service hostname (`postgres`, not `localhost`) so the app connects across the compose network.
+5. **Mount the workspace and cache dependencies so rebuilds stay cheap.** A 10-minute container build trains people to never rebuild — and a never-rebuilt container is the drift you were eliminating. Keep the source bind-mounted (default `workspaceFolder`) so edits are instant. Put the package manager's *store* (not `node_modules`/`.venv`) in a named volume mount so deps survive rebuilds: a volume on `~/.npm`, `~/.cache/pnpm`, `~/.cache/pip`, `~/.cargo`. For compiled-language or heavy-system-lib stacks, structure the Dockerfile so dependency-install layers come before the source copy, so a code change doesn't bust the dep cache.
+6. **Preinstall tooling and run a `postCreateCommand` that leaves the env ready.** Add the editor extensions and settings the project assumes under `customizations.vscode.extensions` (linter, formatter, language server, the DB client) — so everyone gets the same lint-on-save, not a personal config. Use a `postCreateCommand` to run the dependency install with the detected package manager (`pnpm install --frozen-lockfile`) plus any project setup (DB migrate + seed, generate types, copy `.env.example` to `.env` if absent). The goal: open the project, and after postCreate it runs — no manual step. Prefer `devcontainer features` (`ghcr.io/devcontainers/features/*`) for common add-ons (docker-in-docker, gh CLI) over hand-rolled `apt-get` lines.
+7. **Inject secrets at runtime — never bake them into the image.** Reference required secrets in `containerEnv`/`remoteEnv` sourced from the host (`${localEnv:OPENAI_API_KEY}`) or via a secret mount, and keep a committed `.env.example` documenting the keys with empty/placeholder values. Anything sensitive stays in the developer's local `.env` (gitignored) or their host env. Do not `ENV SECRET=...`, `COPY .env`, or `ARG` a credential in the Dockerfile, and don't commit a populated `.env` — an image layer is shipped verbatim to everyone who pulls it.
+> [!WARNING]
+> An unpinned base or runtime (`node:latest`, `python:3`, `postgres:16` without a minor) is the single change that reintroduces the exact drift the container is meant to eliminate. The image silently revs out from under the team on the next pull or rebuild, and now "works in the container" depends on *when* you built it. Pin every base image and every service to a specific version, and update those pins as a reviewed, deliberate commit.
+> [!CAUTION]
+> A secret baked into an image — via `ENV`, `ARG`, `COPY .env`, or a committed populated `.env` — leaks to everyone who pulls the image and persists in the layer history even if a later layer deletes it. Injecting credentials into a built image is publishing them. Keep all secrets in the developer's local env/secret store and reference them at runtime; commit only an empty `.env.example`.
+## Output
+A `devcontainer.json` plus the Dockerfile and/or `docker-compose.yml` the project needs, written via Write, with: every base image and service tag pinned to the version the repo targets (and the detected source of that version called out — e.g. `node:20.17 (from .nvmrc)`, `postgres:16.4`); the dependent services wired through compose with named data volumes and correct service-hostname connection strings; a dependency-store cache mount and a layer-ordered Dockerfile so rebuilds are fast; the preinstalled extensions and a `postCreateCommand` that installs and sets up so the env is ready on first open; and a clear note of which secrets are injected from the host env / secret mount versus the committed empty `.env.example` — none baked into the image. The skill reads the repo and writes config files only; it does not build images, start containers, or run install commands.

package/content/skills/distributed-tracing-instrumenter.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: "distributed-tracing-instrumenter"
+description: "Instrument a service (or a chain of services) with OpenTelemetry so a single request can be followed end-to-end — context propagated across every hop including async/queue boundaries, spans at the boundaries that matter, deliberate trace-wide sampling, and trace_id stamped on log lines. Use when latency or failures span multiple services, when you have logs but can't reconstruct a request's full path, or when adopting OpenTelemetry."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+You have logs in five services and a request that's slow, but no way to know it's slow *because* service C waited 800ms on a query that service A triggered three hops back — the lines aren't connected. Distributed tracing connects them: one trace ID threads through every service a request touches, each hop adds a timed span, and you read the whole waterfall in one view. The two things that make or break it are propagation (the context has to survive every hop, and it silently dies across async/queue boundaries) and span discipline (boundaries, not every function). This skill instruments against OpenTelemetry so you're not locked to a backend, fixes propagation at each hop, picks the spans worth having, samples whole traces consistently, and ties traces back to your logs.
+## When to use this skill
+- A request is slow or failing and the cause spans multiple services — you can see each service's logs but can't reconstruct which call, in which order, cost the time.
+- You have decent logs but reconstructing one request's full path means correlating timestamps by hand across services, and async work (queue jobs, background workers) is a black hole.
+- You're adopting OpenTelemetry and want spans at the right boundaries with a defensible attribute set, not a noisy span-per-function trace.
+- Traces already exist but show up broken — a request appears as two disconnected partial traces, or the downstream half is missing entirely (almost always a propagation or sampling bug).
+## Instructions
+1. **Adopt OpenTelemetry as the API/SDK; pick the exporter separately.** Instrument against the vendor-neutral OTel API and the W3C `traceparent`/`tracestate` propagation format so the wire protocol is standard across every service. Choose the backend (Jaeger, Tempo, Datadog, Honeycomb) only at the exporter/Collector layer — that way swapping or adding a backend never touches instrumentation. Prefer running the OTel Collector as a sidecar/agent so the app exports once and the Collector handles batching, sampling, and fan-out.
+2. **Turn on auto-instrumentation first, then map the request's hops.** Enable the language's auto-instrumentation for the HTTP/gRPC server, outbound HTTP/gRPC clients, and DB drivers — it gives you propagation and the obvious boundary spans for free. Then trace one real request end-to-end on paper: list every hop (inbound edge, each outbound call, each DB query, each queue publish/consume) so you know exactly where context must survive and which boundaries still need manual spans.
+3. **Fix context propagation at every hop — extract inbound, inject outbound.** At each service's entry point, *extract* trace context from the incoming `traceparent` header into the current context; on every outbound call, *inject* the current context into the outgoing headers. For HTTP and gRPC, auto-instrumentation usually does both — verify it actually fires (a manually-built client or a raw socket bypasses it). The hop that breaks is the one nobody instruments: confirm the child span's trace ID equals the parent's, not a fresh one.
+4. **Carry context across async and queue boundaries explicitly.** A message queue, background job, event bus, or thread/goroutine handoff drops the in-process context — the consumer starts a brand-new trace unless you bridge it. On publish, inject `traceparent` into the message *headers/attributes* (not the body); on consume, extract it and start the span as a *child* (or a span link, for batch/fan-in) of the producer's span. Without this the trace splits into two disconnected fragments and the async work looks like an orphan.
+5. **Create spans at meaningful boundaries, not per function.** A span is worth creating where work crosses a boundary or has independent cost: the inbound request, each outbound call (HTTP/RPC/DB/cache), and expensive in-process compute (a heavy serialization, a model inference, a batch loop *as one span*, not per iteration). Do not wrap every helper function — a span-per-function trace has hundreds of millisecond-thin spans that bury the one slow hop and multiply export cost. If a span never changes how you'd read the trace, don't create it.
+6. **Attach high-value attributes; never secrets or PII.** Put queryable context on spans as semantic attributes: `http.route` (the *template* `/users/:id`, not the literal path), `http.status_code`, `db.system`/`db.statement` (parameterized, no literal values), `messaging.destination`, and the key domain IDs you'd filter by (`order_id`, `tenant_id`). Set span status to error and record the exception on failure. Never put passwords, tokens, full auth headers, request/response bodies, raw SQL with inlined values, or PII on a span — spans are exported to third-party backends and widely readable.
+7. **Sample the whole trace consistently — decide head vs tail once, at the edge.** The cardinal rule: a trace must be sampled atomically, all-or-nothing, or you get broken partial traces. With head sampling, the *first* service makes the keep/drop decision and propagates it in `tracestate` (the sampled flag); every downstream service honors that bit instead of deciding independently — per-service sampling rates produce traces missing half their spans. For "keep all errors and slow requests" you need *tail* sampling, which must run in the Collector (it sees the full assembled trace before deciding), never per-service. Pick one strategy and apply it trace-wide.
+8. **Correlate traces with logs by stamping trace_id on every log line.** Pull the active `trace_id` (and `span_id`) from context and add them as fields on every log line in that request — so a log search jumps straight to the trace, and a trace span links straight to its logs. This is the payoff that makes traces and the structured logs you already have one navigable surface instead of two.
+> [!WARNING]
+> Context dropped across an async/queue boundary is the #1 tracing bug. The consumer starts a fresh root span, and one request becomes two disconnected traces — the producer side and the worker side — with no way to tell they're the same request. Always inject `traceparent` into message headers on publish and extract it (as a child span or link) on consume. Verify by checking the consumer span shares the producer's trace ID.
+> [!WARNING]
+> Inconsistent per-service sampling yields incomplete traces. If service A keeps 100% and service B keeps 10%, ~90% of A's traces are missing all of B's spans — a waterfall with holes that looks like B never ran. The sampling decision must be made once (head: at the edge, propagated; or tail: in the Collector) and honored by every service, never re-rolled per hop.
+> [!WARNING]
+> A span-per-function explosion makes traces unreadable and expensive. Hundreds of sub-millisecond spans hide the one 800ms hop that matters and multiply your backend's ingest cost and bill. Span boundaries and independently-costed work only; collapse tight loops into a single span with a count attribute rather than one span per iteration.
+## Output
+- **Instrumentation plan** — the request's hops mapped end-to-end, which boundaries get spans (inbound edge, outbound calls, DB queries, named expensive compute) and which are deliberately left out, and the per-span-type attribute set (with the secrets/PII deny-list).
+- **Propagation fix per hop** — for each hop, the extract-inbound / inject-outbound change, called out explicitly for HTTP, gRPC, and each async/queue boundary, with how to verify parent and child share one trace ID.
+- **Sampling strategy** — head vs tail decision, where it runs (edge vs Collector), the rule (e.g. base rate + keep-all-errors + keep-slow), and how the decision is propagated trace-wide.
+- **Trace↔log correlation** — how `trace_id`/`span_id` are pulled from context and stamped on log lines, so logs and traces cross-link in both directions.

package/content/skills/idempotency-designer.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: "idempotency-designer"
+description: "Make unsafe, retryable API operations idempotent so a client retry or a network hiccup can't double-charge, double-create, or double-send — design a client-supplied idempotency key, an atomic store-and-check (unique constraint or conditional write), in-flight conflict handling, and a retention policy. Use when a POST/mutation can be retried (payments, order creation, sends, webhooks), or when duplicate side effects have already shown up in production."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A network timeout doesn't mean the request failed — it means the client doesn't *know*. So the client retries, and now the charge runs twice. Idempotency fixes this by making "do this operation" return the *same result* no matter how many times it's submitted under the same key. The trap: almost everyone implements it as "check if we've seen this key, if not do the work" — two non-atomic steps — which is precisely a race that two concurrent retries win together. This skill designs the key, the *atomic* dedup, the in-flight case, and the cleanup.
+## When to use this skill
+- An endpoint has a side effect that must not happen twice — a payment/charge, order or account creation, an email/SMS/push send, a transfer, a webhook *delivery* you consume.
+- Clients (mobile, SDKs, queue consumers, other services) retry on timeout/5xx, so the same logical operation can arrive more than once.
+- Duplicate rows, double charges, or double-sent notifications have already appeared in production logs and you're retrofitting protection.
+- You're putting a queue or a webhook receiver in front of a mutation — at-least-once delivery guarantees duplicates by design.
+## Instructions
+1. **Have the client generate the key, one per logical operation.** The idempotency key is a client-minted unique id (a UUID v4, or a deterministic hash of the operation's natural identity) created *once* and reused on every retry of that same operation. It travels in a header — `Idempotency-Key: <uuid>` (the Stripe/IETF convention) — not in the body where a serializer might reorder it. A new key per *user click* / per *queued message*, the *same* key across that click's retries. Document who mints it and exactly where it rides.
+2. **Scope the key — never make it globally unique.** Store and match it as a composite: `(account_id, endpoint, idempotency_key)`. Without scoping, one tenant's key can collide with another's (information leak or wrong cached response returned), and the same UUID legitimately reused on two different endpoints would wrongly dedup. Reserve keys for POST-style *creates and actions*; `GET`/`PUT`/`DELETE` should be designed naturally idempotent (a `PUT` to a known id, a `DELETE` that no-ops on an absent row) and need no key.
+3. **Record the key BEFORE doing the work, in a single atomic operation.** This is the whole mechanism. Either:
+   - **Unique constraint** — `INSERT` a row keyed on `(account_id, endpoint, key)` with status `in_progress`; let the database's unique index reject the second insert. The *insert* is the lock; you do not read first.
+   - **Conditional write** — `SET key value NX` (Redis), or a conditional/compare-and-swap put (DynamoDB `attribute_not_exists`). The store decides the winner atomically.
+   The winner proceeds; everyone else hit the constraint/condition and branches to step 5. There is no "check then act" — the check and the claim are the same call.
+4. **Persist the response alongside the key, then replay it on repeat.** When the work finishes, store the *full* response (status code + body, or enough to reconstruct it) against the key and mark it `completed` — ideally in the **same transaction** that performs the side effect, so the key and the effect commit or roll back together. On a repeat of a *completed* key, return the stored response verbatim instead of re-executing. Optionally store a hash of the request payload and 422 if the same key arrives with a *different* body — that's a client bug, not a retry.
+5. **Handle the in-flight case explicitly — it's not "completed" yet.** A retry can arrive while the first request is still running (status `in_progress`). Do **not** run the work again and do **not** block indefinitely. Return **`409 Conflict`** (or `425 Too Early`) with a short `Retry-After`, telling the client "this is being processed, ask again." Give the `in_progress` record a lease/expiry so a crashed first attempt that never reached `completed` can be retried after the lease lapses rather than wedging the key forever.
+6. **Make the downstream effect idempotent too.** Your atomic key protects *your* handler; it does nothing for the third-party call inside it. If the handler calls a payment processor or another service, pass an idempotency key *to that call as well* (most payment APIs accept one) — derive it deterministically from your own key so a retry of your handler produces the same downstream key. Otherwise a crash *after* the external charge but *before* your commit leaves the charge live while your record says nothing happened.
+7. **Set a TTL and a cleanup job.** Keys are only needed for the retry window — minutes to ~24h, matched to how long clients realistically retry. Store an `expires_at` and either use the store's native TTL (Redis `EXPIRE`, DynamoDB TTL) or a periodic delete. Choose retention deliberately: long enough to cover every retry path (including a client that retries the next day), short enough that the table doesn't grow without bound.
+> [!WARNING]
+> Check-then-act is not idempotency. "Read whether the key exists, and if not, do the work" is two operations: two concurrent retries both read "not seen," both proceed, and both run the side effect. The dedup MUST be a single atomic operation — a unique-constraint `INSERT` or a conditional/`NX` write where the store picks the one winner. If your design has a `SELECT` (or `GET`) before the `INSERT`, it is racy under exactly the concurrent-retry load it exists to stop.
+> [!WARNING]
+> An idempotency store with no TTL grows forever. Every unique operation ever submitted leaves a permanent row, and the unique-index lookup that guards your hottest write path slowly degrades. Always attach an `expires_at` plus native-TTL or a sweep job; "we'll clean it up later" means an unbounded table on your write path.
+> [!NOTE]
+> Committing the side effect and the `completed` key in the *same transaction* is what makes replay trustworthy. If they're separate writes, a crash between them either replays a response for work that didn't happen, or re-runs work whose key looks unfinished. When the side effect is in another system (a payment API), you can't share a transaction — that's exactly why step 6's downstream key matters.
+## Output
+A design block specifying: (1) the **key scheme** — who generates it, its format, and the header it travels in; (2) the **scope** — the composite `(account, endpoint, key)` and which methods get keys vs. are naturally idempotent; (3) the **atomic store-and-check** — the exact unique constraint or conditional write, with the claim happening before the work; (4) the **in-flight handling** — the `in_progress` state, the `409`/`Retry-After` response, and the lease expiry; (5) the **downstream-keying** strategy for any third-party call; and (6) the **retention policy** — TTL value, mechanism, and the retry window it covers. Followed by a concrete handler/middleware sketch and the table/index DDL (or store schema) implementing it.

package/content/skills/memory-leak-hunter.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: "memory-leak-hunter"
+description: "Find and fix a memory leak in a running app: confirm it's a real leak under steady load, diff two heap snapshots to name the growing object and its retention path, cut the root reference that blocks collection, and re-run to confirm memory plateaus. Use when RSS climbs until OOM/restart, heap grows unbounded across a steady workload, or GC pauses worsen the longer the process runs."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A process whose memory only goes up will eventually OOM, get killed, or grind to a halt in GC — but "memory went up" is not the same as "there is a leak." A warming cache, a JIT, a connection pool filling, and a steadily growing legitimate working set all climb too. This skill refuses to guess: it first *confirms* the leak against a steady workload, then *locates* it with a heap diff rather than a single snapshot, traces the *retention path* to the one reference that blocks collection, fixes that root, and re-runs to prove the curve flattens.
+## When to use this skill
+- RSS climbs monotonically until the process OOMs, gets OOM-killed, or hits a scheduled restart that "fixes" it for a while.
+- Heap usage trends up across a steady, repeating workload and never returns to baseline after a GC.
+- GC pauses (or full-GC frequency) get worse the longer the process stays up — a classic sign the live set is growing.
+- A load test or soak test shows memory that doesn't plateau even after the request rate is constant.
+- After a deploy, memory behavior changed and you need to know whether it's a real leak or a bigger-but-bounded cache.
+## Instructions
+1. **Confirm it's a leak before hunting one.** Drive a *steady, repeating* workload (constant request rate or a fixed loop) and record memory over time — RSS and heap-used at, say, 30s intervals. Force a GC between samples where you can (`global.gc()` with `--expose-gc` in Node, `System.gc()`/`jcmd <pid> GC.run` on the JVM, `gc.collect()` in Python). A leak is memory that trends **up** under constant load and **does not recover** after GC. Memory that rises during warmup and then *plateaus*, or that drops back after GC, is not a leak — stop here and look at cache sizing or normal working set instead.
+2. **Capture two heap snapshots under load, spaced apart.** Take snapshot A once warmup has settled, keep the same workload running, then take snapshot B after memory has visibly grown (Node: `--inspect` + DevTools/`heapdump`/`v8.writeHeapSnapshot()`; JVM: `jmap -dump:live,format=b,file=… <pid>` or a JFR `OldObjectSample`; Python: `tracemalloc.take_snapshot()` ×2, or `objgraph`/`guppy`). One snapshot tells you what's big *now*, which is useless — you need both ends of the growth.
+3. **Diff the two snapshots — read what GREW, not what's biggest.** Use the comparison view (DevTools "Comparison" between A and B, `tracemalloc.compare_to`, MAT's dominator/histogram delta). Sort by *delta in retained size and object count*. The leak is the object type whose instance count and retained size climb monotonically across the diff and never get freed — not necessarily the single largest object, which is often a legitimately big-but-stable buffer.
+4. **Trace the retention path to the root that blocks collection.** For the growing object, follow the *retainers / paths-to-GC-root* (DevTools "Retainers", MAT "Path to GC Roots: exclude weak/soft"). The fix lives at the *root* end of that chain — the live reference that keeps the whole subtree alive. Match it to the usual suspects: an unbounded cache/`Map`/dict keyed by something ever-growing (request id, user id); an event listener / observable / pub-sub subscription added but never removed; a closure captured by a long-lived callback that drags a large scope with it; a `setInterval`/timer/scheduled task never cleared; a module-level array/list that's only ever appended to; or — in native or manual-memory code — an allocation with no matching free (check with `valgrind --leak-check=full` / ASan / a heap profiler).
+5. **Fix by bounding the lifetime at the root.** Don't trim symptoms — cut the retaining reference: put a size cap and eviction (LRU) or TTL on the cache; `removeEventListener` / `unsubscribe` / `dispose` in the matching teardown; `clearInterval`/`clearTimeout` and cancel scheduled work on shutdown/unmount; replace a cache keyed by short-lived objects with a `WeakMap`/`WeakRef` so entries are collectible; bound or drain the module-level collection; add the missing `free`/`delete`/`close`. Prefer the change that makes the lifetime *correct* over one that just makes the leak slower.
+6. **Re-run the same workload and confirm a plateau.** Repeat step 1's steady workload with the fix in place and capture the same memory-over-time trace. The fix is verified only when memory rises during warmup and then *flattens* (and recovers after GC) across a window long enough to have leaked before. If it still trends up, the diff pointed at one of several retainers — go back to step 3 and trace the next-largest grower.
+> [!WARNING]
+> A single heap snapshot proves nothing about a leak — every running process holds a lot of live memory legitimately. Only the **diff of two snapshots under sustained load** distinguishes "growing and never freed" from "big but stable." Never conclude a leak (or a fix) from one snapshot or one memory number.
+> [!NOTE]
+> "Memory went up" during warmup, JIT, or cache fill is expected, not a leak — a leak is unbounded growth that never plateaus under *constant* load. Before touching code, confirm the curve never flattens and never recovers after a forced GC; otherwise you'll "fix" a cache that was working as designed and make the app slower.
+## Output
+A short report with four parts: (1) the **confirmation evidence** — the memory-over-time trace under steady load showing growth that doesn't recover after GC; (2) the **leaking object and retention path** from the heap diff (type, delta count/retained size, and the path-to-GC-root naming the retaining root); (3) the **root-cause fix** as a concrete diff at that root (eviction/TTL, unsubscribe, cleared timer, weak reference, or missing free); and (4) the **post-fix plateau** — the same workload's memory trace now flattening — or a note that another retainer remains and which one to chase next.