npm - agentscamp - Versions diffs - 0.4.0 → 0.5.0 - Mend

agentscamp 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/README.md +2 -2
package/content/manifest.json +213 -2
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/package.json +1 -1

package/content/skills/contract-test-designer.md ADDED Viewed

@@ -0,0 +1,70 @@
+---
+name: "contract-test-designer"
+description: "Design consumer-driven contract tests between services so an API provider can't break its consumers unnoticed — without slow, flaky full end-to-end environments. Use when independent services or teams integrate over an API, when integration bugs only surface in staging or prod, or when E2E suites are too slow and brittle to catch breaking API changes."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Cross-service E2E suites are slow, flaky, and tell you a provider broke a consumer only after both are deployed to a shared environment. This skill designs consumer-driven contract tests instead: the *consumer* declares the exact requests it sends and the precise response fields and types it actually reads, and the *provider* replays those expectations against its real handler in its own CI. A provider change that violates any consumer's contract fails the provider's build — before merge, before deploy, with no other service running. The deliverable is the consumer-defined contract(s), the provider-side verification wired into CI, and a sharing-plus-versioning approach so the two sides can evolve.
+## When to use this skill
+- Two or more independently deployed services (often owned by different teams) integrate over HTTP/JSON, gRPC, or a message queue, and a provider can ship a change that silently breaks a consumer.
+- Integration regressions only appear in staging or prod because nothing in either repo's CI exercises the actual cross-service shape.
+- The cross-service E2E suite is too slow or flaky to gate merges, so breaking API changes slip through.
+- You're standing up a new client against an existing API and want to lock the dependency to *exactly* the fields you read, not the whole payload.
+## Instructions
+1. **Let the CONSUMER define the contract — and only the part it uses.** Write the contract from the consumer's test suite, not the provider's spec. For each interaction, state the *request* the consumer sends (method, path, query/body, headers that matter) and the *response shape it actually depends on*: the status code, the fields it reads, and their types. If the consumer parses `order.id` (string) and `order.total` (number) and ignores the other 20 fields, the contract asserts those two fields and nothing else. The contract is a description of *this consumer's* needs, never the provider's full API surface.
+2. **Match on type and structure, not frozen example values.** Use matchers, not literals: assert `total` is a number, `status` is one of a set, `items` is a non-empty array of objects with `sku`/`qty` — not `total == 4250`. Frozen example values turn the contract into a snapshot test that breaks on every data change. Reserve exact-value matching for fields whose literal value is part of the contract (an enum the consumer branches on, a fixed `Content-Type`).
+3. **Pick a tool/pattern and generate the artifact.** Match what the stack already uses before adding a dep. **Pact** (pact-js / pact-jvm / pact-python / pact-go) is the default for HTTP and async messages — the consumer test runs against a mock provider and emits a pact JSON file. **Spring Cloud Contract** suits a JVM-heavy shop. For simpler needs, a **shared JSON Schema / OpenAPI fragment** committed to both repos, validated on each side, is a legitimate lightweight contract. Whatever the tool, the output is a machine-checkable artifact of the consumer's expectations.
+4. **Verify the PROVIDER against the contract in the PROVIDER's own CI.** This is the half teams skip and the half that earns the value. The provider's pipeline fetches every consumer contract and replays each recorded request against the real running provider (no consumer process involved), asserting the live response satisfies the matchers. Wire it as a required check: a provider change that drops `order.total` or renames `status` fails the provider build, so the break is caught at the source before merge. Use `provider states` (Pact) to set up the data each interaction needs (`given "order 42 exists"` → seed that fixture) rather than depending on ambient DB state.
+5. **Share contracts via a broker or committed artifacts, and gate deploys on verification.** For more than a couple of services, run a **Pact Broker** (or PactFlow): consumers publish contracts tagged by branch/version, providers fetch and verify, and `can-i-deploy` blocks a release whose verified contracts don't cover the consumer versions currently in prod. For a small, co-located set, committing the contract artifact into a shared repo or the provider repo and verifying in CI is simpler and adequate — pick the lightest mechanism that still makes verification a required, automated gate, not a manual step.
+6. **Version contracts so provider and consumer can evolve independently.** Tag each contract with the consumer's version and the environment where that consumer version runs. Additive provider changes (new optional field) keep old contracts passing — that's the point of matching only what the consumer reads. For a breaking change, support both shapes until every consumer has published a contract for the new one (verified via the broker), then retire the old. Never edit a published contract in place to make a failing provider build go green — that defeats the gate.
+7. **Keep contracts to interface shape; push behavior into unit tests.** A contract verifies the *integration surface* — fields, types, status codes, error envelopes — not that the provider computes the right total or applies the right discount. That logic belongs in the provider's own unit/integration tests. A contract bloated with business assertions becomes a second, worse copy of the provider's logic suite and breaks on unrelated correct changes.
+> [!WARNING]
+> Contract tests verify the INTERFACE shape, not end-to-end behavior. They replace brittle cross-service E2E for catching *breaking API changes* — but they do not prove the provider's logic is correct or that the wired-up system works. Keep the provider's own logic tests, and a thin smoke E2E for the critical path; contracts shrink the E2E suite, they don't delete it.
+> [!WARNING]
+> A contract that asserts the provider's *entire* response — every field, exact values — instead of only the fields this consumer reads is an anti-pattern: it produces false breakages on unrelated, backward-compatible changes (a new field, a reordered key, a changed value the consumer never reads), and trains teams to ignore red builds. Assert the minimum the consumer actually depends on.
+## Output
+For the integration, the skill produces:
+- **The consumer-defined contract(s)** — for each interaction, the request (method, path, body, key headers) and the response expectations as matchers (status code + only the fields/types this consumer reads), in the chosen tool's format.
+- **The provider-side verification setup** — the CI step that fetches the contract(s) and replays them against the real provider, the provider-state fixtures each interaction needs, and the required-check wiring so a violation fails the provider build.
+- **The sharing + versioning approach** — broker vs. committed artifact, how contracts are tagged by consumer version/environment, and the deploy gate (e.g. `can-i-deploy`) plus the rule for evolving through a breaking change.
+Example — a consumer contract for an order-service client, in pact-js:
+```js
+const { PactV3, MatchersV3: M } = require("@pact-foundation/pact");
+const provider = new PactV3({ consumer: "checkout-web", provider: "order-service" });
+// The consumer reads only id (string), total (number), and status (one of two values).
+// It ignores every other field on the order — so the contract asserts only these.
+provider
+  .given("order 42 exists")                       // provider state: seeded in provider CI
+  .uponReceiving("a request for an order")
+  .withRequest({ method: "GET", path: "/orders/42" })
+  .willRespondWith({
+    status: 200,
+    headers: { "Content-Type": "application/json" },
+    body: {
+      id: M.string("ord_42"),                     // type match, not the literal "ord_42"
+      total: M.number(4250),
+      status: M.regex(/^(open|closed)$/, "open"),  // enum the consumer branches on
+    },
+  });
+await provider.executeTest(async (mock) => {
+  const order = await fetchOrder(`${mock.url}/orders/42`);
+  expect(order.status).toBe("open");
+});
+```
+This test emits a pact file; the **provider's** pipeline then replays `GET /orders/42` against the real `order-service` (with state `order 42 exists` seeded) and fails the provider build if `total` stops being a number or `status` leaves the enum. Hand the request/response shapes to `openapi-doc-writer` to keep the published spec in sync, and use `test-scaffolder` to flesh out the provider-state fixtures.

package/content/skills/devcontainer-designer.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "devcontainer-designer"
+description: "Design a reproducible dev environment (Dev Container / Docker) so onboarding is one command and 'works on my machine' dies — by detecting the project's real stack and versions, authoring a devcontainer.json (+ Dockerfile/compose) that pins the runtime to what the repo targets, wires dependent services, caches dependencies, and injects secrets instead of baking them. Use when new contributors struggle to set up the project, when environment drift causes inconsistent behavior, or when standardizing tooling across a team."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+The phrase "works on my machine" is a confession that the project has no defined machine. Two contributors on Node 18.17 and 20.4, one with a system `libpq` and one without, a Postgres someone installed via Homebrew in 2023 — that spread is exactly the environment drift a dev container exists to kill. But a container only does that if it pins what the repo actually targets and brings the *whole* stack up together; an unpinned `node:latest` reintroduces the drift you containerized to remove, and a `:latest` Postgres can rev a major version under you on the next rebuild. This skill reads the repo to find the real stack, then writes a `devcontainer.json` (with a Dockerfile and/or compose when services are involved) where every version is pinned, services come up as one unit, dependencies are cached so rebuilds are cheap, and secrets are injected at runtime — never baked into the image.
+## When to use this skill
+- New contributors burn their first day on setup, or the onboarding README has more than a handful of "install X, then Y" steps that drift out of date.
+- The same code behaves differently across machines (passes locally, fails in CI, or vice versa) and you suspect runtime/version/system-lib differences rather than a real bug.
+- You're standardizing tooling across a team and want one definition of "the dev environment" that an editor can rebuild on demand.
+- The project needs a DB, cache, queue, or other service running alongside the app and people manage those by hand today.
+## When NOT to use this skill
+- The drift is a missing lockfile, not a missing container — if `package.json`/`pyproject.toml` has unpinned ranges and no committed lock, fix that first; a container around floating deps still drifts.
+- You need a production deployment image. A dev container optimizes for fast inner-loop edit/run with the source mounted; a production image optimizes for a small, immutable artifact with the source baked in. They are different files with different tradeoffs — don't ship this one.
+## Instructions
+1. **Detect the real stack before writing anything.** Glob and read the manifests that declare the runtime and pin it: `.nvmrc` / `.node-version` / `engines` in `package.json`, `.python-version` / `pyproject.toml` `requires-python`, `.ruby-version`, `go.mod` `go` directive, `.tool-versions` (asdf/mise), `rust-toolchain.toml`. Identify the package manager from the lockfile that exists (`package-lock.json` → npm, `pnpm-lock.yaml` → pnpm, `yarn.lock` → yarn, `poetry.lock` → poetry, `uv.lock` → uv) — the container must use the same one, or it builds a different tree. The repo's declared version is the source of truth; never round to "latest stable."
+2. **Find the services the app actually talks to.** Grep config and env templates (`.env.example`, `config/`, `docker-compose*.yml`, `application.yml`) for connection strings and ports — `DATABASE_URL`, `REDIS_URL`, `postgres://`, `amqp://`, ES/OpenSearch hosts. Read the dependency manifest for client libraries (`pg`, `redis`, `psycopg`, `pika`, `kafkajs`) as corroboration. Every external service the app expects at runtime must come up in the dev environment, or the container is half a setup and contributors are back to installing Postgres by hand.
+3. **Pin the base image to the repo's exact runtime version.** Use a digest-stable, version-specific tag — `mcr.microsoft.com/devcontainers/python:3.12` or `node:20.17-bookworm`, never `:latest`, `:lts`, or a bare major like `:20`. Match the minor the repo targets (a `.nvmrc` of `20.17.0` means `node:20.17`, not `node:20`). If you author a Dockerfile, install system libraries the build needs that the base lacks (`libpq-dev` for `psycopg`, `build-essential`, `libvips` for `sharp`, `default-libmysqlclient-dev`) — these are the silent "missing on my machine" failures. Set the pinned image in `devcontainer.json` `image`, or `build.dockerfile` if you need the extra libs.
+4. **Bring the whole stack up with compose when services exist.** When step 2 found a DB/cache/queue, write a `docker-compose.yml` with the app service plus each dependency pinned to a *specific* version (`postgres:16.4`, `redis:7.4`) — a major Postgres bump on rebuild can refuse to read the old data dir. Point `devcontainer.json` at it via `dockerComposeFile` + `service` + `workspaceFolder`, list `runServices` so the DB starts with the workspace, and use a named volume for the DB data dir so a container rebuild doesn't wipe local seed data. Set service `DATABASE_URL` to the compose service hostname (`postgres`, not `localhost`) so the app connects across the compose network.
+5. **Mount the workspace and cache dependencies so rebuilds stay cheap.** A 10-minute container build trains people to never rebuild — and a never-rebuilt container is the drift you were eliminating. Keep the source bind-mounted (default `workspaceFolder`) so edits are instant. Put the package manager's *store* (not `node_modules`/`.venv`) in a named volume mount so deps survive rebuilds: a volume on `~/.npm`, `~/.cache/pnpm`, `~/.cache/pip`, `~/.cargo`. For compiled-language or heavy-system-lib stacks, structure the Dockerfile so dependency-install layers come before the source copy, so a code change doesn't bust the dep cache.
+6. **Preinstall tooling and run a `postCreateCommand` that leaves the env ready.** Add the editor extensions and settings the project assumes under `customizations.vscode.extensions` (linter, formatter, language server, the DB client) — so everyone gets the same lint-on-save, not a personal config. Use a `postCreateCommand` to run the dependency install with the detected package manager (`pnpm install --frozen-lockfile`) plus any project setup (DB migrate + seed, generate types, copy `.env.example` to `.env` if absent). The goal: open the project, and after postCreate it runs — no manual step. Prefer `devcontainer features` (`ghcr.io/devcontainers/features/*`) for common add-ons (docker-in-docker, gh CLI) over hand-rolled `apt-get` lines.
+7. **Inject secrets at runtime — never bake them into the image.** Reference required secrets in `containerEnv`/`remoteEnv` sourced from the host (`${localEnv:OPENAI_API_KEY}`) or via a secret mount, and keep a committed `.env.example` documenting the keys with empty/placeholder values. Anything sensitive stays in the developer's local `.env` (gitignored) or their host env. Do not `ENV SECRET=...`, `COPY .env`, or `ARG` a credential in the Dockerfile, and don't commit a populated `.env` — an image layer is shipped verbatim to everyone who pulls it.
+> [!WARNING]
+> An unpinned base or runtime (`node:latest`, `python:3`, `postgres:16` without a minor) is the single change that reintroduces the exact drift the container is meant to eliminate. The image silently revs out from under the team on the next pull or rebuild, and now "works in the container" depends on *when* you built it. Pin every base image and every service to a specific version, and update those pins as a reviewed, deliberate commit.
+> [!CAUTION]
+> A secret baked into an image — via `ENV`, `ARG`, `COPY .env`, or a committed populated `.env` — leaks to everyone who pulls the image and persists in the layer history even if a later layer deletes it. Injecting credentials into a built image is publishing them. Keep all secrets in the developer's local env/secret store and reference them at runtime; commit only an empty `.env.example`.
+## Output
+A `devcontainer.json` plus the Dockerfile and/or `docker-compose.yml` the project needs, written via Write, with: every base image and service tag pinned to the version the repo targets (and the detected source of that version called out — e.g. `node:20.17 (from .nvmrc)`, `postgres:16.4`); the dependent services wired through compose with named data volumes and correct service-hostname connection strings; a dependency-store cache mount and a layer-ordered Dockerfile so rebuilds are fast; the preinstalled extensions and a `postCreateCommand` that installs and sets up so the env is ready on first open; and a clear note of which secrets are injected from the host env / secret mount versus the committed empty `.env.example` — none baked into the image. The skill reads the repo and writes config files only; it does not build images, start containers, or run install commands.

package/content/skills/distributed-tracing-instrumenter.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: "distributed-tracing-instrumenter"
+description: "Instrument a service (or a chain of services) with OpenTelemetry so a single request can be followed end-to-end — context propagated across every hop including async/queue boundaries, spans at the boundaries that matter, deliberate trace-wide sampling, and trace_id stamped on log lines. Use when latency or failures span multiple services, when you have logs but can't reconstruct a request's full path, or when adopting OpenTelemetry."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+You have logs in five services and a request that's slow, but no way to know it's slow *because* service C waited 800ms on a query that service A triggered three hops back — the lines aren't connected. Distributed tracing connects them: one trace ID threads through every service a request touches, each hop adds a timed span, and you read the whole waterfall in one view. The two things that make or break it are propagation (the context has to survive every hop, and it silently dies across async/queue boundaries) and span discipline (boundaries, not every function). This skill instruments against OpenTelemetry so you're not locked to a backend, fixes propagation at each hop, picks the spans worth having, samples whole traces consistently, and ties traces back to your logs.
+## When to use this skill
+- A request is slow or failing and the cause spans multiple services — you can see each service's logs but can't reconstruct which call, in which order, cost the time.
+- You have decent logs but reconstructing one request's full path means correlating timestamps by hand across services, and async work (queue jobs, background workers) is a black hole.
+- You're adopting OpenTelemetry and want spans at the right boundaries with a defensible attribute set, not a noisy span-per-function trace.
+- Traces already exist but show up broken — a request appears as two disconnected partial traces, or the downstream half is missing entirely (almost always a propagation or sampling bug).
+## Instructions
+1. **Adopt OpenTelemetry as the API/SDK; pick the exporter separately.** Instrument against the vendor-neutral OTel API and the W3C `traceparent`/`tracestate` propagation format so the wire protocol is standard across every service. Choose the backend (Jaeger, Tempo, Datadog, Honeycomb) only at the exporter/Collector layer — that way swapping or adding a backend never touches instrumentation. Prefer running the OTel Collector as a sidecar/agent so the app exports once and the Collector handles batching, sampling, and fan-out.
+2. **Turn on auto-instrumentation first, then map the request's hops.** Enable the language's auto-instrumentation for the HTTP/gRPC server, outbound HTTP/gRPC clients, and DB drivers — it gives you propagation and the obvious boundary spans for free. Then trace one real request end-to-end on paper: list every hop (inbound edge, each outbound call, each DB query, each queue publish/consume) so you know exactly where context must survive and which boundaries still need manual spans.
+3. **Fix context propagation at every hop — extract inbound, inject outbound.** At each service's entry point, *extract* trace context from the incoming `traceparent` header into the current context; on every outbound call, *inject* the current context into the outgoing headers. For HTTP and gRPC, auto-instrumentation usually does both — verify it actually fires (a manually-built client or a raw socket bypasses it). The hop that breaks is the one nobody instruments: confirm the child span's trace ID equals the parent's, not a fresh one.
+4. **Carry context across async and queue boundaries explicitly.** A message queue, background job, event bus, or thread/goroutine handoff drops the in-process context — the consumer starts a brand-new trace unless you bridge it. On publish, inject `traceparent` into the message *headers/attributes* (not the body); on consume, extract it and start the span as a *child* (or a span link, for batch/fan-in) of the producer's span. Without this the trace splits into two disconnected fragments and the async work looks like an orphan.
+5. **Create spans at meaningful boundaries, not per function.** A span is worth creating where work crosses a boundary or has independent cost: the inbound request, each outbound call (HTTP/RPC/DB/cache), and expensive in-process compute (a heavy serialization, a model inference, a batch loop *as one span*, not per iteration). Do not wrap every helper function — a span-per-function trace has hundreds of millisecond-thin spans that bury the one slow hop and multiply export cost. If a span never changes how you'd read the trace, don't create it.
+6. **Attach high-value attributes; never secrets or PII.** Put queryable context on spans as semantic attributes: `http.route` (the *template* `/users/:id`, not the literal path), `http.status_code`, `db.system`/`db.statement` (parameterized, no literal values), `messaging.destination`, and the key domain IDs you'd filter by (`order_id`, `tenant_id`). Set span status to error and record the exception on failure. Never put passwords, tokens, full auth headers, request/response bodies, raw SQL with inlined values, or PII on a span — spans are exported to third-party backends and widely readable.
+7. **Sample the whole trace consistently — decide head vs tail once, at the edge.** The cardinal rule: a trace must be sampled atomically, all-or-nothing, or you get broken partial traces. With head sampling, the *first* service makes the keep/drop decision and propagates it in `tracestate` (the sampled flag); every downstream service honors that bit instead of deciding independently — per-service sampling rates produce traces missing half their spans. For "keep all errors and slow requests" you need *tail* sampling, which must run in the Collector (it sees the full assembled trace before deciding), never per-service. Pick one strategy and apply it trace-wide.
+8. **Correlate traces with logs by stamping trace_id on every log line.** Pull the active `trace_id` (and `span_id`) from context and add them as fields on every log line in that request — so a log search jumps straight to the trace, and a trace span links straight to its logs. This is the payoff that makes traces and the structured logs you already have one navigable surface instead of two.
+> [!WARNING]
+> Context dropped across an async/queue boundary is the #1 tracing bug. The consumer starts a fresh root span, and one request becomes two disconnected traces — the producer side and the worker side — with no way to tell they're the same request. Always inject `traceparent` into message headers on publish and extract it (as a child span or link) on consume. Verify by checking the consumer span shares the producer's trace ID.
+> [!WARNING]
+> Inconsistent per-service sampling yields incomplete traces. If service A keeps 100% and service B keeps 10%, ~90% of A's traces are missing all of B's spans — a waterfall with holes that looks like B never ran. The sampling decision must be made once (head: at the edge, propagated; or tail: in the Collector) and honored by every service, never re-rolled per hop.
+> [!WARNING]
+> A span-per-function explosion makes traces unreadable and expensive. Hundreds of sub-millisecond spans hide the one 800ms hop that matters and multiply your backend's ingest cost and bill. Span boundaries and independently-costed work only; collapse tight loops into a single span with a count attribute rather than one span per iteration.
+## Output
+- **Instrumentation plan** — the request's hops mapped end-to-end, which boundaries get spans (inbound edge, outbound calls, DB queries, named expensive compute) and which are deliberately left out, and the per-span-type attribute set (with the secrets/PII deny-list).
+- **Propagation fix per hop** — for each hop, the extract-inbound / inject-outbound change, called out explicitly for HTTP, gRPC, and each async/queue boundary, with how to verify parent and child share one trace ID.
+- **Sampling strategy** — head vs tail decision, where it runs (edge vs Collector), the rule (e.g. base rate + keep-all-errors + keep-slow), and how the decision is propagated trace-wide.
+- **Trace↔log correlation** — how `trace_id`/`span_id` are pulled from context and stamped on log lines, so logs and traces cross-link in both directions.

package/content/skills/idempotency-designer.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: "idempotency-designer"
+description: "Make unsafe, retryable API operations idempotent so a client retry or a network hiccup can't double-charge, double-create, or double-send — design a client-supplied idempotency key, an atomic store-and-check (unique constraint or conditional write), in-flight conflict handling, and a retention policy. Use when a POST/mutation can be retried (payments, order creation, sends, webhooks), or when duplicate side effects have already shown up in production."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A network timeout doesn't mean the request failed — it means the client doesn't *know*. So the client retries, and now the charge runs twice. Idempotency fixes this by making "do this operation" return the *same result* no matter how many times it's submitted under the same key. The trap: almost everyone implements it as "check if we've seen this key, if not do the work" — two non-atomic steps — which is precisely a race that two concurrent retries win together. This skill designs the key, the *atomic* dedup, the in-flight case, and the cleanup.
+## When to use this skill
+- An endpoint has a side effect that must not happen twice — a payment/charge, order or account creation, an email/SMS/push send, a transfer, a webhook *delivery* you consume.
+- Clients (mobile, SDKs, queue consumers, other services) retry on timeout/5xx, so the same logical operation can arrive more than once.
+- Duplicate rows, double charges, or double-sent notifications have already appeared in production logs and you're retrofitting protection.
+- You're putting a queue or a webhook receiver in front of a mutation — at-least-once delivery guarantees duplicates by design.
+## Instructions
+1. **Have the client generate the key, one per logical operation.** The idempotency key is a client-minted unique id (a UUID v4, or a deterministic hash of the operation's natural identity) created *once* and reused on every retry of that same operation. It travels in a header — `Idempotency-Key: <uuid>` (the Stripe/IETF convention) — not in the body where a serializer might reorder it. A new key per *user click* / per *queued message*, the *same* key across that click's retries. Document who mints it and exactly where it rides.
+2. **Scope the key — never make it globally unique.** Store and match it as a composite: `(account_id, endpoint, idempotency_key)`. Without scoping, one tenant's key can collide with another's (information leak or wrong cached response returned), and the same UUID legitimately reused on two different endpoints would wrongly dedup. Reserve keys for POST-style *creates and actions*; `GET`/`PUT`/`DELETE` should be designed naturally idempotent (a `PUT` to a known id, a `DELETE` that no-ops on an absent row) and need no key.
+3. **Record the key BEFORE doing the work, in a single atomic operation.** This is the whole mechanism. Either:
+   - **Unique constraint** — `INSERT` a row keyed on `(account_id, endpoint, key)` with status `in_progress`; let the database's unique index reject the second insert. The *insert* is the lock; you do not read first.
+   - **Conditional write** — `SET key value NX` (Redis), or a conditional/compare-and-swap put (DynamoDB `attribute_not_exists`). The store decides the winner atomically.
+   The winner proceeds; everyone else hit the constraint/condition and branches to step 5. There is no "check then act" — the check and the claim are the same call.
+4. **Persist the response alongside the key, then replay it on repeat.** When the work finishes, store the *full* response (status code + body, or enough to reconstruct it) against the key and mark it `completed` — ideally in the **same transaction** that performs the side effect, so the key and the effect commit or roll back together. On a repeat of a *completed* key, return the stored response verbatim instead of re-executing. Optionally store a hash of the request payload and 422 if the same key arrives with a *different* body — that's a client bug, not a retry.
+5. **Handle the in-flight case explicitly — it's not "completed" yet.** A retry can arrive while the first request is still running (status `in_progress`). Do **not** run the work again and do **not** block indefinitely. Return **`409 Conflict`** (or `425 Too Early`) with a short `Retry-After`, telling the client "this is being processed, ask again." Give the `in_progress` record a lease/expiry so a crashed first attempt that never reached `completed` can be retried after the lease lapses rather than wedging the key forever.
+6. **Make the downstream effect idempotent too.** Your atomic key protects *your* handler; it does nothing for the third-party call inside it. If the handler calls a payment processor or another service, pass an idempotency key *to that call as well* (most payment APIs accept one) — derive it deterministically from your own key so a retry of your handler produces the same downstream key. Otherwise a crash *after* the external charge but *before* your commit leaves the charge live while your record says nothing happened.
+7. **Set a TTL and a cleanup job.** Keys are only needed for the retry window — minutes to ~24h, matched to how long clients realistically retry. Store an `expires_at` and either use the store's native TTL (Redis `EXPIRE`, DynamoDB TTL) or a periodic delete. Choose retention deliberately: long enough to cover every retry path (including a client that retries the next day), short enough that the table doesn't grow without bound.
+> [!WARNING]
+> Check-then-act is not idempotency. "Read whether the key exists, and if not, do the work" is two operations: two concurrent retries both read "not seen," both proceed, and both run the side effect. The dedup MUST be a single atomic operation — a unique-constraint `INSERT` or a conditional/`NX` write where the store picks the one winner. If your design has a `SELECT` (or `GET`) before the `INSERT`, it is racy under exactly the concurrent-retry load it exists to stop.
+> [!WARNING]
+> An idempotency store with no TTL grows forever. Every unique operation ever submitted leaves a permanent row, and the unique-index lookup that guards your hottest write path slowly degrades. Always attach an `expires_at` plus native-TTL or a sweep job; "we'll clean it up later" means an unbounded table on your write path.
+> [!NOTE]
+> Committing the side effect and the `completed` key in the *same transaction* is what makes replay trustworthy. If they're separate writes, a crash between them either replays a response for work that didn't happen, or re-runs work whose key looks unfinished. When the side effect is in another system (a payment API), you can't share a transaction — that's exactly why step 6's downstream key matters.
+## Output
+A design block specifying: (1) the **key scheme** — who generates it, its format, and the header it travels in; (2) the **scope** — the composite `(account, endpoint, key)` and which methods get keys vs. are naturally idempotent; (3) the **atomic store-and-check** — the exact unique constraint or conditional write, with the claim happening before the work; (4) the **in-flight handling** — the `in_progress` state, the `409`/`Retry-After` response, and the lease expiry; (5) the **downstream-keying** strategy for any third-party call; and (6) the **retention policy** — TTL value, mechanism, and the retry window it covers. Followed by a concrete handler/middleware sketch and the table/index DDL (or store schema) implementing it.

package/content/skills/mutation-test-runner.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+name: "mutation-test-runner"
+description: "Measure whether a test suite actually catches bugs by running mutation testing — introduce small faults into the code and check which ones a test kills versus which slip through silently. Use when line coverage is high but bugs still ship, when you suspect tests assert weakly, or to find the exact assertions a suite is missing."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Line coverage tells you a line ran during a test. It does not tell you the test would fail if that line were wrong — a function can be 100% covered by an assertion-free test. Mutation testing closes that gap: it plants small faults in the code (flip `>` to `>=`, swap `+` for `-`, drop a statement, negate a condition) and re-runs the suite against each one. A mutant that makes a test fail is **killed** — the suite pins that behavior. A mutant that passes everything **survives** — no test noticed the code changed, so that behavior is unprotected. This skill runs a mutation tool, reads the survivors as a precise to-do list of missing assertions, and tells you exactly which tests to add to kill them.
+## When to use this skill
+- Coverage is high (80–100%) but bugs still slip into production — the classic symptom of covered-but-unasserted code.
+- You inherited or reviewed a suite and suspect the tests assert weakly (snapshot-only, no return-value checks, `toBeDefined` instead of `toEqual`).
+- A module is critical (auth, money, parsing, pricing) and you want proof the suite would catch a regression, not just that it touches the lines.
+- You're hardening a specific change and want the missing assertions for *that diff*, not a repo-wide audit.
+> [!WARNING]
+> 100% line coverage with surviving mutants is the false confidence this skill exists to expose: the code runs in a test, but no assertion would fail if the code were wrong. A green coverage badge is not a green mutation score.
+## Instructions
+1. **Pick the tool for the language — don't guess, check what's installed.** Inspect deps and config first:
+   - JS/TS: **Stryker** (`@stryker-mutator/core`, config `stryker.conf.json`/`.mjs`); it auto-detects Jest/Vitest/Mocha runners.
+   - Python: **mutmut** (`mutmut run`, config in `setup.cfg`/`pyproject.toml`) or **cosmic-ray** for larger suites.
+   - Java/Kotlin: **PIT** (`pitest`, Maven/Gradle plugin). Go: **go-mutesting** or **gremlins**. Ruby: **mutant**. C#: **Stryker.NET**.
+   - If no tool is installed, recommend the standard one for the stack and stop there — do not silently add a dev dependency.
+2. **Scope the run to changed code — this is mandatory, not an optimization.** Mutation testing re-runs the full suite once per mutant, so a repo can take hours. Target the diff or a single package: Stryker `--mutate "src/pricing/**/*.ts"` (or `--since main` on recent versions), mutmut `--paths-to-mutate src/billing/`, PIT `targetClasses` set to the changed package. State the chosen paths up front so the run is reproducible.
+3. **Run and collect the surviving mutants, not the summary number.** Execute the tool and read its detailed report (Stryker's `mutation.html`/`--reporter json`, mutmut `mutmut results` + `mutmut show <id>`, PIT's `mutations.xml`). For each survivor capture: file, line, the original code, and the exact mutation that lived (e.g. `boundary: changed >= to >` or `removed call to logAudit()`).
+4. **Triage each survivor: real gap or equivalent mutant.** An **equivalent mutant** changes the code without changing observable behavior — e.g. `i <= n-1` vs `i < n`, reordering commutative operations, mutating a value that's overwritten before use. These *cannot* be killed by any test; mark them `equivalent — ignore` with a one-line reason and move on. Everything else is a genuine gap: a behavior your tests don't constrain.
+5. **For each real survivor, name the assertion that would kill it.** This is the payoff. A survived `changed > to >=` on a discount threshold means no test exercises the exact boundary — propose "`applyDiscount(qty=10)` where the rule is `qty > 10`: assert no discount at exactly 10." A survived `removed call to audit()` means nothing asserts the side effect — propose "assert `auditLog` received one entry after `transfer()`." Write the input and the expected behavior, not "add a test for line 42."
+6. **Group survivors by file and track the score where it's worth defending.** Report the mutation score (killed / total non-equivalent) per scoped path as a *baseline to hold or raise on critical modules*, never as a vanity 100% target — chasing the last few percent usually means fighting equivalent mutants. Record the baseline so the next run can detect regressions.
+> [!NOTE]
+> Two survivors that share a root cause often need one assertion. A function where every arithmetic and boundary mutant survives usually has a single test that calls it and asserts only that it didn't throw — adding one real return-value assertion can kill the whole cluster at once.
+> [!WARNING]
+> If a mutation run "passes" with zero survivors but also shows mutants marked **no coverage** or **timeout**, the suite isn't strong — those mutants were never actually tested. No-coverage mutants are a coverage gap (hand them to `coverage-gap-finder`); timeouts often mean a mutant created an infinite loop the suite can't detect. Don't read them as kills.
+## Output
+A survivor report grouped by file, plus the run scoping so it's reproducible:
+```
+Scope: src/billing/**  (mutated 47 mutants, 90s)
+Mutation score: 81%  (34 killed / 42 non-equivalent) — baseline, hold >=80 on billing
+src/billing/discount.ts
+  SURVIVED  L23  changed `qty > 10` -> `qty >= 10`   [BOUNDARY]
+    Gap: no test hits the exact threshold.
+    Add: applyDiscount({ qty: 10 }) -> assert price unchanged (no discount at boundary)
+  SURVIVED  L31  removed call to `roundCents(total)`  [STATEMENT]
+    Gap: nothing asserts the rounded result.
+    Add: applyDiscount({ qty: 12, price: 3.337 }) -> assert total === 33.37 (not 33.3696)
+src/billing/invoice.ts
+  SURVIVED  L58  changed `&&` -> `||` in isOverdue guard  [LOGICAL]
+    Gap: only the both-true case is tested.
+    Add: isOverdue({ pastDue: true, paid: true }) -> assert false
+  EQUIVALENT L72  `i <= len-1` -> `i < len`  — ignore (same iteration count)
+No-coverage: 5 mutants in src/billing/legacy.ts -> route to coverage-gap-finder (not killed).
+```
+Each surviving line is a missing assertion; the `Add:` lines are concrete enough to hand straight to a test scaffolder. Re-run the same scope after adding them to confirm the survivors flip to killed and the score holds.

package/content/skills/query-plan-analyzer.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "query-plan-analyzer"
+description: "Read a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A slow query is almost never slow for the reason you'd guess from reading the SQL. The plan is the ground truth: it shows the database actually chose a Seq Scan over the 40-million-row table, actually fed 500,000 rows into a Nested Loop that estimated 5, actually sorted on disk because no index could supply the order. This skill pulls the **real** plan — `EXPLAIN ANALYZE` with `BUFFERS`, not bare `EXPLAIN` — reads it from the most expensive node outward, names the one node that's costing the time and *why*, and turns that into a specific fix: the index to add (with the right column order), the rewrite that makes the predicate sargable, or the `ANALYZE` that fixes the estimate. Then it re-runs the plan to prove the bad node is gone instead of declaring victory from theory.
+## When to use this skill
+- One specific query (an endpoint, a report, a dashboard panel) is slow and you need the cause, not a vague "add some indexes."
+- A query that was fast got slow after a data-volume change, a deploy, or a schema/index change.
+- The planner is doing something surprising — a Seq Scan despite an index existing, or ignoring the index you just added.
+- p99 latency on one query is high while the table and load look unremarkable, and you suspect the plan rather than the hardware.
+- Before shipping a new query or a `migration-writer` index change, to verify the plan is what you intended.
+## Instructions
+1. **Get the table shape and existing indexes before touching the plan.** Read the schema for the queried tables: column types, the existing indexes and their column order, row counts (`SELECT reltuples FROM pg_class`, or `\d+`), and whether stats are fresh (`pg_stat_user_tables.last_analyze` / `n_mod_since_analyze`). Grep the codebase for where the query is built so you tune the real SQL (including how parameters bind), not a hand-typed approximation.
+2. **Run the REAL plan with actual rows, timing, and I/O — never bare EXPLAIN.** Use `EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)` in Postgres (`ANALYZE FORMAT=TREE` / `EXPLAIN ANALYZE` in MySQL 8+). `ANALYZE` *executes* the query and reports actual rows + per-node `actual time`; `BUFFERS` shows shared/local hits vs. reads (heavy `read=` means I/O, not CPU, is the cost). Run it 2–3 times so a cold-cache first run doesn't masquerade as a planning problem. For a write query, wrap it in a transaction and `ROLLBACK` so `ANALYZE` doesn't mutate data.
+3. **Read from the most expensive node outward — find where the time actually is.** In the text plan, `actual time=start..end` is cumulative and inclusive of children; the time a node *adds* is its end-time minus its children's. Find the deepest/innermost node whose `actual time` and `loops × rows` dominate the total. That node — not the top of the plan — is what you fix. Note its `actual rows`, `loops`, and the `Rows Removed by Filter` line.
+4. **Check the estimate-vs-actual gap FIRST — a wide gap means stale stats, and that's the real bug.** Compare each node's estimated rows (`rows=`) to `actual rows`. A gap of more than ~10x (e.g. plans for 5 rows, processes 50,000) means the planner is choosing strategy on bad information — usually stale statistics. **Fix this before adding any index:** run `ANALYZE <table>;` (or `ANALYZE` the whole DB) and re-pull the plan. Often the plan corrects itself once estimates are right, and an index you'd have added would have been the wrong one.
+5. **Match the symptom to the culprit, then to the fix:**
+   - **Seq Scan on a large table with a selective predicate** → the predicate filters to few rows but there's no usable index. Add a b-tree on the filtered column(s). (A Seq Scan returning most of the table is *correct* — don't index it.)
+   - **Nested Loop with high `loops` over many outer rows** → the join is iterating per-row when it should batch. The cause is usually a bad row estimate (see step 4) or a missing join-key index; a corrected estimate or an index on the inner join column lets the planner pick a Hash/Merge Join.
+   - **Sort (especially `Sort Method: external merge  Disk:`)** → the query sorts at runtime and spills to disk. A b-tree index in the `ORDER BY` order can supply rows pre-sorted, removing the Sort node entirely (and powering `LIMIT` early-exit).
+   - **High `Rows Removed by Filter`** → the database fetched far more rows than it kept; the filter ran *after* the scan instead of being pushed into an index. Move the discriminating column into the index so it's a condition, not a post-filter.
+   - **Heavy `Buffers: ... read=`** → the working set isn't cached; a smaller/covering index reduces pages touched, or the data genuinely doesn't fit memory.
+6. **Check index sargability — an index the predicate can't use is no fix at all.** A b-tree is defeated by a function or cast on the column (`lower(email) = ?`, `date(created_at) = ?`, `col::text = ?`), by a leading-wildcard `LIKE '%x'`, and by an `OR` across different columns. The fix is a matching **expression index** (`CREATE INDEX ... ON t (lower(email))`), a rewrite to a range (`created_at >= d AND created_at < d+1`), or `UNION`-ing the `OR` branches — not a plain index on the raw column.
+7. **Order multi-column index columns for the predicate, then the sort.** Put equality-predicate columns first (leftmost), then the range/inequality column, then `ORDER BY` columns — so one index serves both the filter and the ordering. A column used only for a range can't have an equality column usefully placed after it. State the exact `CREATE INDEX` DDL, including `INCLUDE`d columns if a covering index would turn an Index Scan into an Index-Only Scan.
+8. **Re-run `EXPLAIN ANALYZE` after the fix and confirm the bad node is gone.** Apply the fix (in Postgres, build the index `CONCURRENTLY` to avoid a write lock; `migration-writer` can wrap the DDL). Re-pull the plan and verify the offending node changed type (Seq Scan → Index Scan, Nested Loop → Hash Join, Sort → no Sort) and that total `actual time` dropped. If the planner *ignores* the new index, run `ANALYZE` and re-check sargability before concluding the index is wrong.
+> [!WARNING]
+> Bare `EXPLAIN` shows the planner's *guess*, not reality — it never runs the query, so it can't reveal a Nested Loop that estimated 5 rows and processed half a million, or which node actually burned the time. Diagnose with `EXPLAIN ANALYZE` every time; tuning from estimates is how you add the wrong index.
+> [!WARNING]
+> A wide estimated-vs-actual row gap (>10x) means stale statistics, and that is the root cause — fix it with `ANALYZE` *before* adding indexes. An index chosen to compensate for a bad estimate is often useless or harmful once the estimate is corrected, and you'll have shipped a write-amplifying index that the planner ignores.
+> [!NOTE]
+> `EXPLAIN ANALYZE` executes the statement. For `INSERT`/`UPDATE`/`DELETE`, run it inside `BEGIN; ... ROLLBACK;` so diagnosis doesn't change data — and be aware it still fires triggers and acquires locks during the run.
+## Output
+A short report with three parts:
+1. **Annotated plan** — the offending node quoted from the `EXPLAIN ANALYZE` output, with its `actual rows` vs. estimate, `loops`, `Rows Removed by Filter`, and `Buffers`, plus a one-line statement of *why* it's the bottleneck (Seq Scan / stale-stats row gap / Nested Loop blowup / disk Sort / non-sargable predicate).
+2. **The specific fix** — exact `CREATE INDEX ... CONCURRENTLY` DDL with the column order justified, or the SQL rewrite, or the `ANALYZE <table>` command. One concrete action, not a menu.
+3. **Before/after proof** — total `actual time` and the changed node type from the re-run plan (e.g. `Seq Scan 1240 ms → Index Scan 3 ms`), confirming the bad node is gone rather than asserting it should be.

package/content/skills/runbook-writer.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: "runbook-writer"
+description: "Write an operational runbook a half-asleep on-call engineer can execute at 3am — scoped to ONE alert, leading with how to confirm the problem, the copy-pasteable mitigation that stops user pain, then diagnosis, escalation, and verification. Use when an alert has no documented response, after an incident exposed a missing procedure, or when standing up on-call for a service."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Write the document the on-call engineer opens when a pager fires at 3am — and can actually follow. The skill takes one alert or symptom and produces a runbook in the order a responder needs it: **confirm → mitigate → diagnose → escalate → verify**. It mines the repo for the real commands, dashboards, and service names, writes each step as a literal instruction with its expected output ("run X; if you see Y, do Z"), and front-loads the mitigation that stops user pain *before* any investigation. The result stops bleeding first and explains second.
+## When to use this skill
+- An alert fires with no documented response — the responder is reverse-engineering the system at the worst possible time.
+- A postmortem found that recovery was slow because the procedure lived only in one person's head.
+- You're onboarding on-call for a service and need a runbook per page-worthy alert before the rotation starts.
+- An existing runbook is prose-heavy ("investigate the root cause") and unusable under stress.
+## Instructions
+1. **Scope to ONE symptom — refuse the generic doc.** A runbook answers exactly one page: `HighErrorRate on checkout-api`, `ReplicaLag > 30s`, `DiskUsage > 90% on db-primary`. If the user asks for an "operations runbook," push back and split it — one alert per file. Name it after the alert that links to it (`docs/runbooks/checkout-api-high-error-rate.md`), so the pager's "runbook" link lands here. Search existing alert rules (`grep -ri "alert\|expr:" prometheus*.yml *.rules.yml`) to use the alert's exact name.
+2. **Open with the fast path, not background.** The first thing on the page is a one-line summary of what's broken and the user impact ("Checkout returns 500s — customers can't pay"), then a **TL;DR mitigation** block: the single command that most often stops the pain. The responder should be able to act from the top of the file without scrolling. Save architecture and theory for the bottom (or omit it).
+3. **Step 1 is always CONFIRM — is this real?** Give the exact way to verify the alert isn't a flapping false positive: the literal dashboard URL, the PromQL/log query to paste, or the curl/CLI command, plus the expected output that means "yes, real." Mine the repo for these — read dashboard JSON, `*.rules.yml`, health-check endpoints, and `Makefile`/`justfile` targets — rather than inventing command names. Example: `kubectl -n prod get pods -l app=checkout-api` → "all should be `Running`; `CrashLoopBackOff` confirms the alert."
+4. **Step 2 is MITIGATE — stop the bleeding before diagnosing.** This is the most important section and it comes *before* root-cause work. Give the copy-pasteable command to roll back, fail over, restart, scale up, or feature-flag-off — with real paths, namespaces, and service names from the repo. State what each command does and how to know it worked. Order options by safety and speed (rollback to last-good deploy usually beats live debugging). Never make the reader derive the command.
+5. **Step 3 is DIAGNOSE — only now look for cause.** Numbered, branching steps in `run X → if you see Y → do Z` form. Every step is a literal command with expected output and the decision it drives. No step may say "investigate," "look into," "check if there's an issue," or any phrase that offloads a judgment call onto a stressed human — convert each into a concrete check with a concrete next action. Link the relevant logs query, trace view, and the service's SLO/error-budget dashboard.
+6. **Write ESCALATE with names and triggers.** State exactly *when* to page the next person and *who*: "If mitigation hasn't restored success rate within 15 min, page the #payments on-call via PagerDuty service `checkout-api`." Include the secondary/owning team, any vendor support path, and the threshold (duration, error count, blast radius) that makes escalation mandatory rather than optional.
+7. **End with VERIFY — confirm recovery, don't assume it.** Give the explicit check that service is restored: the same dashboard/query from step 1 showing healthy values, with the threshold to watch ("error rate back under 0.5% for 5 consecutive minutes"). Include any cleanup (re-enable the flag you turned off, scale back down) and a one-line prompt to capture timeline notes for the postmortem.
+8. **Keep every command current and report assumptions.** Verify each command against the repo (binary names, namespaces, flags, env). Flag any command you could not confirm against a real file so the user tests it before relying on it. A command you guessed is worse than no command — it sends the responder down a dead end at 3am.
+> [!WARNING]
+> A runbook full of "investigate the issue" or "check the logs and determine the cause" is useless at 3am — it just restates the panic. Every step must be a literal command with an expected output and an explicit next action. Equally, a runbook with a stale or never-executed command fails at the exact moment it's needed: treat unverified commands as bugs, and have someone dry-run the mitigation path in staging before trusting it.
+## Output
+A single Markdown file at `docs/runbooks/<alert-name>.md` for one symptom, ordered **confirm → mitigate → diagnose → escalate → verify**, with a TL;DR mitigation at the top, literal copy-pasteable commands, expected outputs, decision branches, and links to the dashboard / logs / trace view / SLO. The skill reports the file path and any command it could not verify against the repo.
+```markdown
+# Runbook: checkout-api — HighErrorRate
+**Impact:** Checkout returns 500s — customers cannot complete payment.
+**Alert:** `HighErrorRate{service="checkout-api"}` (fires at 5xx > 2% for 3m)
+**Dashboard:** https://grafana.internal/d/checkout-api/overview
+## TL;DR mitigation
+Roll back to the last-good deploy — fixes ~80% of these pages:
+    kubectl -n prod rollout undo deployment/checkout-api
+Success rate should recover within ~2 min on the dashboard above.
+## 1. Confirm it's real
+    kubectl -n prod get pods -l app=checkout-api
+Expect all `Running`. Any `CrashLoopBackOff`/`Error` confirms the alert.
+Cross-check the 5xx panel: https://grafana.internal/d/checkout-api/overview
+## 2. Mitigate (stop the bleeding)
+1. If a deploy went out in the last hour → `kubectl -n prod rollout undo deployment/checkout-api`.
+2. If pods are healthy but the DB is the source → fail over reads:
+   `kubectl -n prod set env deployment/checkout-api READ_REPLICA=db-replica-2`
+3. If a downstream dependency is down → disable checkout behind the flag:
+   `curl -XPOST https://flags.internal/api/checkout_enabled -d '{"value":false}'`
+Confirm recovery on the dashboard before moving on.
+## 3. Diagnose
+- Run `kubectl -n prod logs -l app=checkout-api --since=10m | grep -i error`.
+  If you see `connection refused: payments-svc` → page payments (step 4).
+  If you see `pq: too many connections` → scale the pool: `kubectl -n prod set env deployment/checkout-api DB_POOL_MAX=40`.
+- Traces: https://tempo.internal/explore?service=checkout-api
+- SLO / error budget: https://grafana.internal/d/checkout-api/slo
+## 4. Escalate
+If success rate is not restored within 15 min, page **#payments on-call**
+via PagerDuty service `checkout-api`. For DB failover that won't recover,
+page **#platform-db**. Vendor (Stripe) status: https://status.stripe.com
+## 5. Verify
+- 5xx rate back under 0.5% for 5 consecutive minutes on the dashboard.
+- Re-enable any flag you toggled: `curl -XPOST .../checkout_enabled -d '{"value":true}'`.
+- Note start/detect/mitigate/resolve timestamps for the postmortem.
+```

package/content/skills/semantic-cache-designer.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "semantic-cache-designer"
+description: "Design a semantic cache for LLM responses — serve a cached answer when a new query is similar enough to a past one — to cut cost and latency on repetitive traffic, with the similarity threshold calibrated on real query pairs and a cache key that prevents cross-user/model leaks. Use when an LLM app sees many near-duplicate prompts (FAQs, support, search), when token spend on repetitive queries is high, or when latency on common questions matters."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A semantic cache turns "I've answered this before" into a skipped LLM call: embed the incoming query, find the nearest past query, and if it's close enough, return that cached answer. Done right it slashes cost and tail latency on FAQ/support/search traffic. Done wrong it confidently returns a *different* question's answer, or leaks one user's answer to another. This skill makes the two load-bearing decisions — the similarity threshold and the cache key — explicit and calibrated, instead of trusting a vibe-picked cosine cutoff.
+## When to use this skill
+- An LLM app gets many near-duplicate prompts — FAQs, support tickets, product search, "explain X" — and most calls re-derive the same answer.
+- Token spend is dominated by repetitive traffic and you want to stop paying for the same completion twice.
+- Latency on common questions matters (p50/p95) and a cache hit would return in milliseconds instead of seconds.
+- You're about to bolt a `GPTCache`-style layer onto a RAG or chat app and need the threshold/key/TTL decided before it ships.
+## Instructions
+1. **Pin down what a "correct hit" means before touching code.** A hit is only correct if the cached answer would still be the *right* answer for the new query. Write down the inputs that change the correct answer beyond the query text — user/tenant, locale/language, the retrieved-context version (for RAG), the model + version, system-prompt version, and any personalization. This list becomes the cache key in step 5; everything else flows from it.
+2. **Design the lookup.** Embed the incoming query with the *same* model and input-type used for the stored queries (a query/document asymmetry mismatch quietly wrecks similarity — see `embedding-set-inspector`). Look up the single nearest stored entry by vector similarity (cosine on normalized vectors), scoped to the exact key from step 5. Return the cached answer only if `similarity >= threshold`; otherwise it's a miss → call the LLM and write the new entry.
+3. **Calibrate the threshold on real query pairs — do not pick it from a blog post.** Pull ~100-300 query pairs from production logs and label each pair as "same intent / cached answer is correct" or "different intent / would be wrong." Sweep the threshold (e.g. 0.80→0.97) and at each value compute false-hit rate (returned a wrong answer) and false-miss rate (missed a valid reuse). Pick the threshold from this curve, not by feel.
+4. **Bias toward false-miss when a wrong answer is costly.** A false miss costs one extra LLM call; a false hit ships a confidently wrong answer to a user. For support/medical/financial/legal surfaces, choose the stricter threshold even if hit rate drops — a missed hit is cheap, a wrong hit is a trust incident.
+5. **Build the full cache key — never key on query text alone.** Namespace the cache (or the embedding lookup) by every input from step 1: `tenant + locale + model@version + prompt@version + context@version`. Personalized or per-user answers must include the user/tenant in the key. Omitting any of these is how you serve user A's answer to user B, or a `claude-opus-4` answer out of a `claude-haiku` cache after a model swap.
+6. **Set TTL and invalidation for answers that go stale.** Static facts can live long; RAG answers over changing data must expire (or be invalidated) when the underlying documents change — tag entries with the `context@version`/document IDs they depended on and evict on update. Time-sensitive answers ("current status", "today's price") get a short TTL or land in the no-cache list (step 7).
+7. **Decide explicitly what NOT to cache.** Exclude personalized/account-specific answers that lack a per-user key, time-sensitive or real-time responses, stateful/multi-turn replies that depend on conversation history, and anything with side effects (tool calls, writes). Caching these is worse than no cache. Write the no-cache predicate down as a rule, not a hope.
+8. **Measure hit *quality*, not just hit rate.** Track cache hit rate, token/cost saved, and latency delta — but also sample a slice of live hits (e.g. 1-2%) and judge whether the cached answer was actually right for the new query (LLM-as-judge or human review). Report false-hit rate as a first-class metric. A 60% hit rate that's 10% wrong is worse than a 35% hit rate that's clean.
+> [!WARNING]
+> A too-loose threshold is the signature failure of semantic caching: "How do I cancel my subscription?" and "How do I cancel my *order*?" are highly similar in embedding space, so the cache serves a fluent, confident answer to the *wrong* question. The user can't tell it's a stale match. Always validate the threshold against labeled different-intent pairs, not just same-intent ones.
+> [!WARNING]
+> Omitting context/user/model from the cache key leaks answers across boundaries — across users (privacy incident), across locales (wrong language), or across model/prompt versions (you keep serving the old model's answers after a deploy). The key must change whenever the correct answer would change.
+## Output
+- **Lookup design** — embedding model + input-type, similarity metric, nearest-neighbor scoping, and the hit/miss decision rule.
+- **Calibrated threshold** — the chosen value plus the false-hit / false-miss curve it came from and the labeled query-pair set used (and the false-miss bias rationale if applicable).
+- **Full cache key** — the exact composite key (`tenant + locale + model@version + prompt@version + context@version + user`), with a note on which fields apply to this app.
+- **TTL + invalidation + no-cache rules** — per-class TTLs, the document-version invalidation trigger for RAG entries, and the explicit no-cache predicate.
+- **Metrics** — hit rate, token/cost saved, latency delta, and the sampled hit-quality / false-hit measurement to track in production.

package/content/skills/strangler-fig-migrator.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: "strangler-fig-migrator"
+description: "Plan the incremental replacement of a legacy module or service using the strangler-fig pattern — grow new code around the old behind an interception seam until the old is dead, instead of a big-bang rewrite. Use when a legacy system is too risky to rewrite at once, or when migrating off a deprecated framework/dependency gradually while staying shippable and rollback-able at every step."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Replace a legacy module or service the way a strangler fig kills its host tree — by growing new code around the old until the old carries no load and can be cut away. The skill's first and most important move is to find the **interception seam**: the single place where calls can be diverted to either the old or the new implementation. Everything else (slicing, parallel-running, decommissioning) hangs off that seam. Without it, "incremental migration" silently becomes a big-bang rewrite with extra ceremony.
+## When to use this skill
+- A legacy system is load-bearing and too risky to rewrite all at once — a flag-day cutover would mean a long branch, a scary deploy, and no clean rollback.
+- You're migrating off a deprecated framework, library, or service (an ORM, an auth provider, a payments SDK, a monolith you're peeling into services) and want to move capability by capability.
+- The legacy code has no tests or unclear behavior, so the only trustworthy spec is "what it currently does" — you need to run new alongside old and compare.
+- Stakeholders need the system shippable and reversible the entire time, not dark for months behind a feature branch.
+> [!WARNING]
+> If you cannot find or build a clean interception seam, stop and reconsider. A migration where callers reach deep into legacy internals — not through one front door — cannot be routed incrementally. You will end up rewriting everything before you can flip anything, which is a big-bang rewrite wearing a strangler-fig costume. Creating the seam (a facade callers go through) is the *first deliverable*, sometimes a whole milestone of its own.
+## Instructions
+1. **Locate or create the interception seam first.** Find the single chokepoint where calls into the legacy unit can be diverted: a facade/adapter the callers already go through, a network proxy/router (reverse proxy, API gateway, service mesh route), or a feature-flag branch in code. Use `Grep`/`Glob` to map every caller of the legacy unit — if they all funnel through one interface, that's your seam; if they reach in twenty different ways, your first job is to introduce a facade they all route through *before* writing any new implementation. The seam must be able to send a call to old OR new and be flipped at runtime (config/flag), not at deploy time.
+2. **Inventory and slice the surface.** List the capabilities behind the seam (endpoints, methods, message types) with, for each, its call volume, blast radius if it breaks, and how self-contained it is (shared state, shared DB tables, downstream side effects). This is your migration backlog. Do not migrate by file or by "module size" — migrate by capability slice, because a slice is what the seam can route independently.
+3. **Carve off the smallest valuable slice first.** Pick the slice that is most self-contained and lowest-blast-radius — a read-only endpoint, an idempotent operation, an internal report — not the gnarliest core path. Implement it new behind the seam. The goal of slice one is to prove the *seam and the verification mechanism work end to end*, not to deliver the hardest functionality. Save the high-risk, high-coupling slices for after the machinery is trusted.
+4. **Run old and new in parallel and verify equivalence before shifting load.** Before routing real traffic to the new path, run it in **shadow mode**: send the live request to both, return the old result to the caller, and compare the new result off to the side (log/metric the diffs). Define equivalence concretely per slice — exact response match, match modulo known-acceptable differences (ordering, timestamps, formatting), or statistical match on key business metrics when outputs are non-deterministic. Only after the diff rate is at/under an agreed threshold over a representative window do you start serving the new path for real.
+5. **Shift traffic gradually and keep rollback one flip away.** Route a small fraction to the new implementation (a percentage, an allowlist of internal users, one tenant), watch error rate / latency / business metrics against the old baseline, and ramp only while they hold. The seam from step 1 makes the rollback trivial: if the new path misbehaves, flip the route back to legacy — no deploy, no revert. Treat every ramp as reversible; never remove the old path while it's still the fallback.
+6. **Migrate slice by slice, keeping the system shippable throughout.** Repeat steps 3–5 for the next slice. After each slice fully cuts over, the system is in a valid, releasable state with some capabilities on new and some on old — that is the point. Sequence so that you never half-migrate a slice that shares mutable state with an unmigrated one; if two slices write the same table, plan a shared-data strategy (dual-write with new as follower, or migrate the data owner first) before splitting their routing.
+7. **Decommission the legacy only once it is provably dead.** A slice's old code is a candidate for deletion only when: the seam routes 100% to new, the route has been pinned there long enough to cover the full usage cycle (including weekly/monthly/seasonal jobs and rare error paths), and instrumentation shows **zero** hits on the legacy path. Confirm deadness with evidence — access logs, a counter/log line on the old code path showing no calls, `Grep` proving no remaining static references — then remove the old implementation and the now-redundant routing in a final isolated step. Keep the seam until the very last slice is gone.
+> [!WARNING]
+> Deleting legacy code before confirming it's truly dead causes outages, not cleanup. "We migrated that months ago" is not evidence — a quarterly batch job, an admin tool, or a rare error branch can be the only remaining caller. Require positive proof of zero traffic (a metric/log over a full usage period) plus a static-reference search before any deletion. When in doubt, leave the dead branch behind the seam one more cycle; cold code is cheap, an outage is not.
+## Output
+1. **Interception seam design** — what the seam is (facade/adapter, proxy/router, or feature flag), where it sits relative to the callers, how it decides old-vs-new (config key / flag / percentage), and how it's flipped and rolled back at runtime. Includes the list of legacy callers found and whether they already route through one door or need a facade introduced first.
+2. **Slice-by-slice migration order** — the capability backlog as an ordered table, smallest/safest first, with the rationale for the sequence and any shared-data dependencies that force ordering:
+   | Order | Slice (capability) | Volume | Blast radius | Coupling / shared state | Why this position |
+   | --- | --- | --- | --- | --- | --- |
+   | 1 | `GET /report/summary` (read-only) | low | low | none | proves seam + verification end-to-end |
+   | 2 | `POST /events` (idempotent write) | high | medium | none | high volume, safe to shadow |
+   | 3 | `POST /orders` (core path) | high | high | shares `orders` table w/ #4 | after machinery trusted; pair with #4 |
+3. **Parallel-run verification method** — per slice: shadow-mode comparison plan, the concrete equivalence definition (exact / modulo-known-diffs / statistical), the diff threshold and observation window required before serving new, and the metrics watched during ramp (error rate, latency, business KPI vs. legacy baseline) with the ramp schedule (e.g. shadow → 1% → 10% → 50% → 100%).
+4. **Decommission criteria** — the exact gate for deleting each slice's legacy code: 100% routed to new, pinned for one full usage cycle, instrumented zero-traffic proof, and a clean static-reference search — plus the final-step plan to remove the old implementation and retire the seam once the last slice is migrated.