agentscamp 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/README.md +2 -2
  2. package/content/manifest.json +423 -2
  3. package/content/skills/agent-trajectory-evaluator.md +59 -0
  4. package/content/skills/alerting-rules-tuner.md +49 -0
  5. package/content/skills/canary-release-planner.md +35 -0
  6. package/content/skills/circular-dependency-breaker.md +48 -0
  7. package/content/skills/cold-start-optimizer.md +83 -0
  8. package/content/skills/commit-splitter.md +54 -0
  9. package/content/skills/contract-test-designer.md +70 -0
  10. package/content/skills/dashboard-designer.md +38 -0
  11. package/content/skills/deadlock-diagnoser.md +45 -0
  12. package/content/skills/devcontainer-designer.md +40 -0
  13. package/content/skills/distributed-tracing-instrumenter.md +42 -0
  14. package/content/skills/feature-flag-retirer.md +44 -0
  15. package/content/skills/flamegraph-analyzer.md +35 -0
  16. package/content/skills/git-blame-investigator.md +34 -0
  17. package/content/skills/graphql-schema-designer.md +49 -0
  18. package/content/skills/hallucination-evaluator.md +40 -0
  19. package/content/skills/idempotency-designer.md +47 -0
  20. package/content/skills/integration-test-designer.md +81 -0
  21. package/content/skills/model-router-designer.md +39 -0
  22. package/content/skills/mutation-test-runner.md +64 -0
  23. package/content/skills/onboarding-guide-writer.md +84 -0
  24. package/content/skills/query-plan-analyzer.md +49 -0
  25. package/content/skills/rbac-designer.md +82 -0
  26. package/content/skills/release-notes-writer.md +78 -0
  27. package/content/skills/runbook-writer.md +83 -0
  28. package/content/skills/semantic-cache-designer.md +40 -0
  29. package/content/skills/strangler-fig-migrator.md +47 -0
  30. package/content/skills/threat-model-builder.md +46 -0
  31. package/content/skills/token-usage-profiler.md +39 -0
  32. package/content/skills/web-vitals-optimizer.md +34 -0
  33. package/package.json +1 -1
@@ -0,0 +1,42 @@
1
+ ---
2
+ name: "distributed-tracing-instrumenter"
3
+ description: "Instrument a service (or a chain of services) with OpenTelemetry so a single request can be followed end-to-end — context propagated across every hop including async/queue boundaries, spans at the boundaries that matter, deliberate trace-wide sampling, and trace_id stamped on log lines. Use when latency or failures span multiple services, when you have logs but can't reconstruct a request's full path, or when adopting OpenTelemetry."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ You have logs in five services and a request that's slow, but no way to know it's slow *because* service C waited 800ms on a query that service A triggered three hops back — the lines aren't connected. Distributed tracing connects them: one trace ID threads through every service a request touches, each hop adds a timed span, and you read the whole waterfall in one view. The two things that make or break it are propagation (the context has to survive every hop, and it silently dies across async/queue boundaries) and span discipline (boundaries, not every function). This skill instruments against OpenTelemetry so you're not locked to a backend, fixes propagation at each hop, picks the spans worth having, samples whole traces consistently, and ties traces back to your logs.
9
+
10
+ ## When to use this skill
11
+
12
+ - A request is slow or failing and the cause spans multiple services — you can see each service's logs but can't reconstruct which call, in which order, cost the time.
13
+ - You have decent logs but reconstructing one request's full path means correlating timestamps by hand across services, and async work (queue jobs, background workers) is a black hole.
14
+ - You're adopting OpenTelemetry and want spans at the right boundaries with a defensible attribute set, not a noisy span-per-function trace.
15
+ - Traces already exist but show up broken — a request appears as two disconnected partial traces, or the downstream half is missing entirely (almost always a propagation or sampling bug).
16
+
17
+ ## Instructions
18
+
19
+ 1. **Adopt OpenTelemetry as the API/SDK; pick the exporter separately.** Instrument against the vendor-neutral OTel API and the W3C `traceparent`/`tracestate` propagation format so the wire protocol is standard across every service. Choose the backend (Jaeger, Tempo, Datadog, Honeycomb) only at the exporter/Collector layer — that way swapping or adding a backend never touches instrumentation. Prefer running the OTel Collector as a sidecar/agent so the app exports once and the Collector handles batching, sampling, and fan-out.
20
+ 2. **Turn on auto-instrumentation first, then map the request's hops.** Enable the language's auto-instrumentation for the HTTP/gRPC server, outbound HTTP/gRPC clients, and DB drivers — it gives you propagation and the obvious boundary spans for free. Then trace one real request end-to-end on paper: list every hop (inbound edge, each outbound call, each DB query, each queue publish/consume) so you know exactly where context must survive and which boundaries still need manual spans.
21
+ 3. **Fix context propagation at every hop — extract inbound, inject outbound.** At each service's entry point, *extract* trace context from the incoming `traceparent` header into the current context; on every outbound call, *inject* the current context into the outgoing headers. For HTTP and gRPC, auto-instrumentation usually does both — verify it actually fires (a manually-built client or a raw socket bypasses it). The hop that breaks is the one nobody instruments: confirm the child span's trace ID equals the parent's, not a fresh one.
22
+ 4. **Carry context across async and queue boundaries explicitly.** A message queue, background job, event bus, or thread/goroutine handoff drops the in-process context — the consumer starts a brand-new trace unless you bridge it. On publish, inject `traceparent` into the message *headers/attributes* (not the body); on consume, extract it and start the span as a *child* (or a span link, for batch/fan-in) of the producer's span. Without this the trace splits into two disconnected fragments and the async work looks like an orphan.
23
+ 5. **Create spans at meaningful boundaries, not per function.** A span is worth creating where work crosses a boundary or has independent cost: the inbound request, each outbound call (HTTP/RPC/DB/cache), and expensive in-process compute (a heavy serialization, a model inference, a batch loop *as one span*, not per iteration). Do not wrap every helper function — a span-per-function trace has hundreds of millisecond-thin spans that bury the one slow hop and multiply export cost. If a span never changes how you'd read the trace, don't create it.
24
+ 6. **Attach high-value attributes; never secrets or PII.** Put queryable context on spans as semantic attributes: `http.route` (the *template* `/users/:id`, not the literal path), `http.status_code`, `db.system`/`db.statement` (parameterized, no literal values), `messaging.destination`, and the key domain IDs you'd filter by (`order_id`, `tenant_id`). Set span status to error and record the exception on failure. Never put passwords, tokens, full auth headers, request/response bodies, raw SQL with inlined values, or PII on a span — spans are exported to third-party backends and widely readable.
25
+ 7. **Sample the whole trace consistently — decide head vs tail once, at the edge.** The cardinal rule: a trace must be sampled atomically, all-or-nothing, or you get broken partial traces. With head sampling, the *first* service makes the keep/drop decision and propagates it in `tracestate` (the sampled flag); every downstream service honors that bit instead of deciding independently — per-service sampling rates produce traces missing half their spans. For "keep all errors and slow requests" you need *tail* sampling, which must run in the Collector (it sees the full assembled trace before deciding), never per-service. Pick one strategy and apply it trace-wide.
26
+ 8. **Correlate traces with logs by stamping trace_id on every log line.** Pull the active `trace_id` (and `span_id`) from context and add them as fields on every log line in that request — so a log search jumps straight to the trace, and a trace span links straight to its logs. This is the payoff that makes traces and the structured logs you already have one navigable surface instead of two.
27
+
28
+ > [!WARNING]
29
+ > Context dropped across an async/queue boundary is the #1 tracing bug. The consumer starts a fresh root span, and one request becomes two disconnected traces — the producer side and the worker side — with no way to tell they're the same request. Always inject `traceparent` into message headers on publish and extract it (as a child span or link) on consume. Verify by checking the consumer span shares the producer's trace ID.
30
+
31
+ > [!WARNING]
32
+ > Inconsistent per-service sampling yields incomplete traces. If service A keeps 100% and service B keeps 10%, ~90% of A's traces are missing all of B's spans — a waterfall with holes that looks like B never ran. The sampling decision must be made once (head: at the edge, propagated; or tail: in the Collector) and honored by every service, never re-rolled per hop.
33
+
34
+ > [!WARNING]
35
+ > A span-per-function explosion makes traces unreadable and expensive. Hundreds of sub-millisecond spans hide the one 800ms hop that matters and multiply your backend's ingest cost and bill. Span boundaries and independently-costed work only; collapse tight loops into a single span with a count attribute rather than one span per iteration.
36
+
37
+ ## Output
38
+
39
+ - **Instrumentation plan** — the request's hops mapped end-to-end, which boundaries get spans (inbound edge, outbound calls, DB queries, named expensive compute) and which are deliberately left out, and the per-span-type attribute set (with the secrets/PII deny-list).
40
+ - **Propagation fix per hop** — for each hop, the extract-inbound / inject-outbound change, called out explicitly for HTTP, gRPC, and each async/queue boundary, with how to verify parent and child share one trace ID.
41
+ - **Sampling strategy** — head vs tail decision, where it runs (edge vs Collector), the rule (e.g. base rate + keep-all-errors + keep-slow), and how the decision is propagated trace-wide.
42
+ - **Trace↔log correlation** — how `trace_id`/`span_id` are pulled from context and stamped on log lines, so logs and traces cross-link in both directions.
@@ -0,0 +1,44 @@
1
+ ---
2
+ name: "feature-flag-retirer"
3
+ description: "Retire stale feature flags by confirming each flag's decided final state, then collapsing every conditional to the winning branch and deleting the loser plus the now-dead code it reached. Use when temporary flags have outlived their rollout, when flag conditionals clutter the code, or during a flag-debt cleanup."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ Feature flags are born temporary and die permanent. Once a flag is fully rolled out or quietly abandoned, the `if (flag)` it guards is just branching debt — two code paths where one is now unreachable. This skill retires a flag for real: it pins down which branch actually won, finds *every* reference (not just the obvious helper call), collapses each conditional to the winner, and deletes the loser along with any code only the dead branch reached — one flag at a time, with tests green after each.
9
+
10
+ ## When to use this skill
11
+
12
+ - A flag meant to last a sprint has been at 100% (or 0%) for months and still litters the code with conditionals.
13
+ - Flag checks have multiplied — nested `if (flagA && !flagB)` paths nobody can reason about — and you want to pay down the debt.
14
+ - You're running a flag-debt cleanup and need each removal to be independently reviewable and revertible.
15
+
16
+ > [!WARNING]
17
+ > Verify the flag's *decided* final state before you collapse anything. "Currently 100%" is not "permanently on" — a flag mid-rollout, a kill-switch, or an experiment still gathering data must NOT be retired. Deleting the live branch ships or kills a feature: that's a production incident, not a cleanup. Confirm from the flag system/config AND a human owner that the decision is final, and which branch won.
18
+
19
+ ## Instructions
20
+
21
+ 1. **Pin down the decided final state — not the current value.** For the flag, answer one question: is it *permanently on* (fully rolled out, winner = enabled branch) or *abandoned* (will never ship, winner = disabled branch)? Read the flag config/dashboard, then confirm with the owner. Reject the flag from this pass if it's still rolling out, A/B testing, a kill-switch kept for emergencies, or used per-tenant/per-environment with different values — those are live, not stale.
22
+ 2. **Find every reference — grep the flag KEY, not just the helper.** A flag leaks far past its `if`. Search the whole repo for the literal flag key string and its identifier:
23
+ - the helper calls: `isEnabled("new_checkout")`, `flags.newCheckout`, `useFlag(...)`, `treatment(...)`;
24
+ - the flag *definition/registration* (the declarations file, defaults, env vars, IaC/config);
25
+ - tests, fixtures, and mocks that force the flag on or off;
26
+ - analytics/telemetry events fired only when on, and feature-gated schema/migrations/routes;
27
+ - string usages: config keys, JSON, YAML, query params, log lines, docs.
28
+ Grep both the key (`"new_checkout"`) and the symbol (`newCheckout`) — different layers spell it differently.
29
+ 3. **Collapse each conditional to the winning branch.** For every reference, rewrite the conditional to keep only the winner: fully-on → keep the `if` body, drop the `else`/fallback; abandoned → keep the `else`, delete the guarded body. Remove the now-constant condition entirely — no `if (true)`, no dead `else`. Flatten the indentation you just freed.
30
+ 4. **Delete the code only the dead branch reached.** A removed branch usually calls helpers, imports, components, or fires events that nothing else uses. Trace each symbol the loser referenced; if its only caller was the branch you just deleted, remove it too (and repeat transitively). This is where flag retirement leaves dangling dead code if you stop at the `if`.
31
+ 5. **Remove the flag's definition and its tests.** Delete the flag declaration/registration, its default value and env/config entries, and the tests/fixtures that existed solely to toggle it. Tests that asserted the *winning* behavior stay — but drop their flag-setup boilerplate so they test the now-unconditional path.
32
+ 6. **One flag at a time, tests green after each.** Never retire two flags in one pass. After each flag: run the build and test suite, confirm green, and keep it as a single commit. A revert then removes exactly one flag's worth of change with no collateral.
33
+
34
+ > [!WARNING]
35
+ > A flag almost always guards MORE than the obvious if-block — feature-gated helper functions, config defaults, DB columns or migrations, route registrations, and analytics events reachable only when on. Grep exhaustively (step 2) before deleting: stop at the `if` and you leave dangling dead code; over-trust a single grep and you delete a path the *winning* branch still uses. When in doubt whether a symbol is shared, keep it and flag it for review.
36
+
37
+ ## Output
38
+
39
+ For each retired flag, a record an owner can rubber-stamp:
40
+
41
+ - **Confirmed final state** — `permanently-on` or `abandoned`, with the source (flag dashboard value + owner sign-off) and the resulting winning branch.
42
+ - **Reference inventory** — every match for the key and symbol, grouped by layer: conditionals, definition/config, tests/fixtures, analytics, schema/routes, docs/strings.
43
+ - **Collapse plan** — per conditional: which branch wins, the resulting diff, and the list of now-dead symbols deleted because only the loser reached them.
44
+ - **Verification** — confirmation the build and test suite pass after the removal, and that the change is a single self-contained commit. Anything ambiguous (shared symbol, public-API surface, flag still live elsewhere) is listed as a manual-review item rather than deleted.
@@ -0,0 +1,35 @@
1
+ ---
2
+ name: "flamegraph-analyzer"
3
+ description: "Turn a CPU profile or flamegraph into a concrete optimization instead of guessing where the time goes: capture under a realistic workload with a sampling profiler, read the graph correctly (width = time, depth ≠ time), find the widest self-time leaves, ask if that work is necessary/redundant/algorithmically wrong, fix the biggest contributor, then re-profile. Use when code is CPU-bound and slow, a function is hot but you don't know which part, or you have a profile you can't interpret."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ When code is slow and CPU-bound, the most expensive thing you can do is guess. Intuition about "the slow part" is wrong often enough that optimizing it usually buys nothing while the real hotspot sits untouched. A flamegraph answers the question directly — *which frames are actually burning CPU* — but only if you capture it under a realistic workload and read it correctly. This skill does both: it gets a representative sampling profile, reads width as time and the y-axis as depth (not a timeline), pins the hotspot to the widest self-time leaves, classifies the work as unnecessary / redundant / algorithmically wrong, fixes the biggest contributor, and re-profiles — because the bottleneck always moves after a fix, and your intuition about the new one is just as unreliable.
9
+
10
+ ## When to use this skill
11
+
12
+ - A request, job, or function is slow, CPU usage is high, and you don't know which part of the call tree is responsible.
13
+ - You have a profile or flamegraph SVG but can't tell where the time is going or whether you're reading it right.
14
+ - Something is "obviously" slow and you're about to optimize the part you suspect — stop and confirm it with a profile first.
15
+ - A hot path got optimized and got no faster, or only a little — the real bottleneck was elsewhere and you need to find it.
16
+ - You want to know whether the latency is *computation* (on-CPU) or *waiting* (I/O, locks) before you pick where to spend effort.
17
+
18
+ ## Instructions
19
+
20
+ 1. **Capture a profile under a realistic workload with a sampling profiler — don't reason from intuition.** Drive the code the way production does (representative input size, concurrency, warm caches/JIT), then sample it with the right tool: `perf record -F 99 -g` (Linux native), async-profiler (JVM), `py-spy record` (Python), `go tool pprof` (Go), or the browser/Node `--prof` / `--cpu-prof` / DevTools profiler. Prefer **sampling** over instrumenting — instrumentation distorts the very hot frames you care about. Profile a *steady* phase, not cold start, unless cold start is the thing you're optimizing.
21
+ 2. **Render it as a flamegraph and read the axes correctly.** Collapse stacks and render (e.g. `perf script | stackcollapse-perf.pl | flamegraph.pl`, async-profiler's HTML, `go tool pprof -http`, speedscope). **Width = total time spent in a frame and everything it called; wide is expensive. The y-axis is call-stack depth, NOT time — it is not a timeline.** A tall, narrow tower is a deep-but-cheap call chain; a short, wide plateau is your hotspot. Frame ordering left-to-right is alphabetical/merge order, not chronological — never read it as "this ran, then that."
22
+ 3. **Find the widest *leaf* frames — that's where the CPU actually is.** Look at the top edge of the graph: the plateaus at the *top* of the stacks are self-time leaves, the code actually executing when samples were taken. A wide frame deep in the middle is wide because of what it *calls*; the work itself lives in the wide things sitting on top of it. Use the profiler's "self/own time" sort to confirm. Rank hotspots by self-time, not by who's tallest.
23
+ 4. **For each top hotspot, classify the work: unnecessary, redundant, or algorithmically wrong.** Read the wide leaf and ask: (a) **Unnecessary** — is this work needed at all, or is it logging/serialization/validation/copying in a hot loop that could be hoisted, batched, or dropped? (b) **Redundant** — is the same frame wide because it's *called too many times* (recomputed per item, re-parsed, re-allocated)? Cache, memoize, or lift it out of the loop. (c) **Algorithmically wrong** — a wide frame that grows with input is often an O(n²) hiding in plain sight (linear scan inside a loop, repeated string concat, a `Set` that's actually a list). Match the frame's width-vs-input behavior to the algorithm.
24
+ 5. **Confirm the latency is on-CPU before optimizing CPU.** A CPU-sample flamegraph is *blind to time spent waiting* — it shows almost nothing for blocking I/O, lock contention, or sleeping threads, because those threads aren't on-CPU to be sampled. If the wall-clock latency is large but the on-CPU flamegraph is thin or idle, the time is being *waited*, not *computed* — capture an **off-CPU / wall-clock** profile instead (off-CPU flamegraph via `perf`/eBPF, async-profiler `wall` mode, py-spy without `--idle` filtering, a blocking/lock profiler). Optimizing CPU frames will do nothing for a workload that's actually waiting on a database or a mutex.
25
+ 6. **Optimize the single biggest contributor, then RE-PROFILE.** Fix the widest hotspot first — it has the most time to give back. Then capture the *same* workload again from scratch. The bottleneck moves after every fix: the second-widest frame is now first, and the percentages you remember are stale. Do not chain optimizations from one profile; your intuition about the *new* top frame is exactly as unreliable as it was about the first. Stop when the remaining hotspots are narrow enough that the next fix isn't worth the complexity.
26
+
27
+ > [!WARNING]
28
+ > The y-axis is call-stack **depth, not time** — a flamegraph is not a timeline. A tall, narrow tower is a cheap deep call chain; a short, wide plateau is your hotspot. Read it as left-to-right time and you'll "optimize" the wrong frame and wonder why nothing got faster.
29
+
30
+ > [!NOTE]
31
+ > A CPU flamegraph is blind to waiting. If a request takes 800ms but the on-CPU graph is mostly idle, the time is spent blocked on I/O or a lock, not computing — switch to an off-CPU / wall-clock profile. Speeding up thin CPU frames can't fix latency that's actually spent waiting.
32
+
33
+ ## Output
34
+
35
+ A short report with four parts: (1) the **capture conditions** — profiler used, workload/input that was profiled, and whether it's on-CPU or off-CPU/wall-clock; (2) the **identified hotspot(s)** read straight off the graph — each as `frame name + share of total samples + self-time vs. children` and *why* it's hot (unnecessary / redundant / algorithmically wrong); (3) the **targeted fix** for the biggest contributor as a concrete change (hoist out of loop, memoize, replace O(n²), or — if it's wait time — go profile off-CPU); and (4) the **re-profile plan** — rerun the identical workload, expected new top frame, and the stopping condition once hotspots are no longer worth chasing.
@@ -0,0 +1,34 @@
1
+ ---
2
+ name: "git-blame-investigator"
3
+ description: "Reconstruct why a line of code exists from Git history — find the originating commit, read its message and full diff for intent, and see through reformatting/rename commits with ignore-revs and the pickaxe — before you change or delete it. Use when a line looks wrong or pointless and you want to remove it, when tracing a regression to its commit, or when onboarding to unfamiliar code."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ `git blame` tells you *who* last touched a line, which is almost never the question you actually have. The real question — "why is this here, and what breaks if I remove it?" — lives in the commit *message*, the surrounding diff, and the PR that shipped it. This skill does code archaeology: it walks from a suspicious line back to the commit that introduced the *logic* (not the one that reindented it), reads the intent, and returns a verdict on whether the code is a dead artifact or a Chesterton's fence guarding a bug you can't see.
9
+
10
+ ## When to use this skill
11
+ - A line looks redundant, wrong, or pointless and you're about to delete or "simplify" it.
12
+ - You're tracing a regression and need the exact commit that changed the behavior.
13
+ - You're onboarding to unfamiliar code and need to reconstruct *why* it was written this way.
14
+ - A workaround, magic constant, or odd conditional has no comment explaining it.
15
+ - blame keeps pointing at a formatting, rename, or merge commit that obviously isn't the real author.
16
+
17
+ ## Instructions
18
+ 1. **Locate the line precisely, then blame with context.** Run `git blame -L <start>,<end> <path>` on the suspicious range (not the whole file) and note the commit SHA, not the author name. Add `-w` to ignore whitespace-only changes and `-C -C -M` to follow lines that were moved or copied in from other files — without these, blame stops at the refactor that relocated the code and you lose its true origin.
19
+ 2. **Distrust the first SHA — it's usually noise.** If the blamed commit is a Prettier run, a lint autofix, a mass rename, or a "merge branch" commit, it did not author the logic. Re-blame ignoring it: `git blame --ignore-rev <sha> -L <start>,<end> <path>`. If the repo has recurring reformatting commits, list them in a `.git-blame-ignore-revs` file and set `git config blame.ignoreRevsFile .git-blame-ignore-revs` so every blame sees through them automatically.
20
+ 3. **Read the intent, not just the patch.** Once you have the real commit, run `git show <sha>` to read the *full* commit message and the *entire* diff — not only the line you care about. Then find the PR with `git log --merges --ancestry-path <sha>..HEAD -- <path>` or `gh pr list --search <sha>` and read the PR description and review discussion. The "why" is in prose far more often than in code.
21
+ 4. **Track the exact line or string through time with line-history and the pickaxe.** For a moving target use `git log -L <start>,<end>:<path>` to see every commit that changed that line range, in order, with diffs. To find when a specific string, identifier, or value *entered or left* the codebase, use the pickaxe: `git log -S '<exact-string>' -- <path>` (changes in the count of that string) or `git log -G '<regex>' -- <path>` (any diff line matching the regex). `-S` answers "when did this magic number / flag / call site appear or disappear?" in seconds.
22
+ 5. **Follow the code across moves and renames.** A file rename or extraction silently truncates history. Use `git log --follow -- <path>` to span renames, and when logic was hoisted into a new file, use blame's `-C -C -C` (copy detection across the whole tree, even unmodified files) to find where it was lifted from. Confirm the trail is unbroken before drawing conclusions — a gap means the real origin is in a pre-rename path.
23
+ 6. **Trace a regression to its commit, by bisection if needed.** First try `git log --oneline -- <path>` plus `git log -L` to spot an obvious culprit. If the offending change isn't obvious, run `git bisect`: `git bisect start`, `git bisect bad` (current), `git bisect good <known-good-sha>`, then test each checkout (script it with `git bisect run <test-cmd>` for an exact, automated answer). Bisect finds the precise breaking commit even across hundreds of revisions.
24
+ 7. **Reconstruct the decision from the neighborhood.** Read the commits immediately before and after the originating one (`git log --oneline <sha>~3..<sha> -- <path>` plus the linked issue) to see what problem the change was solving. A line that looks pointless in isolation often makes sense as one half of a fix — the other half being the bug it prevents.
25
+ 8. **Render a verdict tied to evidence.** Conclude with one of: *safe to remove* (origin found, the problem it solved no longer exists — cite the commit/issue), *do not touch* (it guards a known bug or invariant — cite the commit), or *needs a test first* (intent is plausible but unverified — name the behavior to lock down before changing). Never conclude "safe to remove" without having found and read the originating intent.
26
+
27
+ > [!WARNING]
28
+ > blame's first answer is almost always a formatting or rename commit that hides the real author. If you act on it without `--ignore-rev` and the pickaxe, you will attribute the code to the wrong change and reason about the wrong intent.
29
+
30
+ > [!WARNING]
31
+ > Deleting code whose original purpose you haven't found is the single most common way regressions get reintroduced. "I don't see why this is here" is a reason to investigate, never a license to remove.
32
+
33
+ ## Output
34
+ A short investigation report containing: (1) the **originating commit(s)** — SHA, message, and the intent reconstructed from the diff and PR; (2) the **line/string history** — the ordered list of commits that introduced, moved, or altered the code (from `log -L` / `-S`), with the rename or refactor boundaries it crossed; and (3) a **verdict** — *safe to change/remove*, *do not touch*, or *needs a test first* — each justified by the cited commit or issue. All claims trace to a SHA the reader can re-run.
@@ -0,0 +1,49 @@
1
+ ---
2
+ name: "graphql-schema-designer"
3
+ description: "Design a clean, evolvable GraphQL schema (SDL) that won't paint you into a corner — model the graph around domain types and their relationships rather than as RPC-over-GraphQL, set nullability deliberately, standardize lists with Relay connections, plan DataLoader batching for per-parent fields, and evolve by adding + @deprecated instead of versioning. Use when designing a new GraphQL API, reviewing an SDL, or migrating REST endpoints to a graph."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ A GraphQL schema is not an afterthought over your endpoints — it's the public contract clients build against, and unlike REST there's no `/v2` to escape a bad decision: the graph evolves in place, forever. Two design mistakes dominate the post-launch pain. First, modeling the schema as a thin RPC wrapper of your existing endpoints (`getUserById`, `listOrdersForUser`) instead of a connected graph of types and relationships, which throws away the one thing GraphQL gives you. Second, sprinkling non-null (`!`) everywhere "to be safe," which is a trap — a single resolver error on a non-null field nulls its *entire parent object*, so a flaky downstream blanks out the whole response. This skill designs the SDL deliberately: types and edges, considered nullability, Relay connections for lists, a consistent mutation payload shape, and an explicit DataLoader plan for the fields that would otherwise N+1.
9
+
10
+ ## When to use this skill
11
+
12
+ - You're designing a new GraphQL API from scratch and want an SDL that survives years of additive change without versioning.
13
+ - You're reviewing or refactoring an existing schema that reads like a list of RPC calls, has `!` on nearly every field, or returns bare arrays for lists.
14
+ - You're migrating REST endpoints to GraphQL and need to re-model resources as a connected graph rather than transcribing routes into queries one-for-one.
15
+ - Nested queries are slow and you suspect resolvers are firing one DB query per parent row (the N+1 storm).
16
+
17
+ ## Instructions
18
+
19
+ 1. **Model the graph around domain types and their relationships, not your endpoints.** Identify the nouns (`User`, `Order`, `Product`, `Review`) and the *edges* between them, then expose those edges as fields that return types — `User.orders`, `Order.lineItems`, `Review.author` — so a client can traverse `user { orders { lineItems { product { name } } } }` in one round trip. Do **not** transcribe REST routes into a flat field per endpoint (`getUserById`, `getOrdersForUser`, `getProductForLineItem`); that's RPC-over-GraphQL and forces clients back into client-side joins and N round trips. The query-graph shape, not your handler list, is the source of truth.
20
+
21
+ 2. **Set nullability deliberately, field by field — non-null is a contract, not a default.** Mark a field non-null (`name: String!`) only when it *genuinely always resolves* — a column with a NOT NULL constraint, a synthesized value, the object's own `id`. Make a field nullable when a downstream failure (a separate service, a join that can return nothing, a slow API) shouldn't take down the rest of the response. The error-propagation rule is the whole reason this matters: when a non-null field's resolver throws or returns null, GraphQL can't put null there, so it nulls the *nearest nullable ancestor* — often the entire parent object — propagating upward until it hits a nullable field. So `Order.recommendedProducts` (computed by a flaky ML service) must be nullable, or one bad recommendation call blanks the whole order.
22
+
23
+ 3. **Standardize every list as a Relay Connection, not a bare array.** Replace `orders: [Order!]!` with a connection: `orders(first: Int, after: String, last: Int, before: String): OrderConnection!`, where `OrderConnection { edges: [OrderEdge!]!, pageInfo: PageInfo! }`, `OrderEdge { node: Order!, cursor: String! }`, and `PageInfo { hasNextPage: Boolean!, hasPreviousPage: Boolean!, startCursor: String, endCursor: String }`. Cursor-based connections page correctly under inserts/deletes (each page is anchored to a real cursor, not an offset) and give you a uniform place to hang edge metadata later (e.g. `OrderEdge.addedAt`). Bare arrays can't paginate without a breaking change and force `first`/`offset` bolt-ons later. Use connections for any list that can grow unbounded; a small fixed enum-like list (a user's `roles`) can stay a plain array.
24
+
25
+ 4. **Plan for the N+1 problem before you ship — name every field that needs a DataLoader.** Any field that resolves *per parent* — `Order.customer`, `Review.author`, `Product.category` — fires its resolver once per parent row in a list, so `orders(first: 50) { customer { name } }` becomes 1 query for orders plus 50 queries for customers. For each such field, specify a **DataLoader** that batches the per-parent keys into one query (`SELECT * FROM users WHERE id = ANY($1)`) and caches within the request. Walk the schema and list, explicitly, which fields are 1:1/1:N relationship fetches that must go through a batched loader; a schema with per-parent resolvers and no DataLoader will N+1 itself to death under nested queries.
26
+
27
+ 5. **Evolve by adding fields and deprecating — never repurpose, never version the endpoint.** GraphQL evolves in place: add new fields, types, and optional arguments freely (additive changes are non-breaking because clients select only what they ask for). To retire a field, mark it `@deprecated(reason: "Use fullName instead")` and keep it resolving until usage drops to zero (check field-usage analytics), then remove. Never change an existing field's *meaning* or *type* (`price: Int` cents → `price: Float` dollars is a silent data corruption for every existing client), never tighten nullability from nullable to non-null on a live field, and never add a `/v2` schema — versioning the endpoint defeats the entire evolvability model.
28
+
29
+ 6. **Constrain values with custom scalars and enums; never model a fixed set as a free string.** Use `enum OrderStatus { PENDING PAID SHIPPED CANCELLED }` instead of `status: String` so invalid values are rejected at the query layer and clients get the allowed set from introspection. Define custom scalars for formatted values (`DateTime`, `EmailAddress`, `URL`, `Money`) to centralize parse/serialize/validation and document the format in one place. Reserve `ID` for opaque identifiers (it serializes as a string — don't do math on it).
30
+
31
+ 7. **Give mutations input types and a consistent payload/error shape.** Every mutation takes one `input` argument of a dedicated input type (`createOrder(input: CreateOrderInput!): CreateOrderPayload!`) — input types keep arguments cohesive and let you add optional fields without changing the signature. Return a **payload type**, not the bare entity: `CreateOrderPayload { order: Order, userErrors: [UserError!]! }`, where `userErrors` carries expected, recoverable validation failures (`{ field: ["input","email"], message: "already taken" }`) as *data* the client can render — distinct from unexpected exceptions, which belong in the top-level `errors` array. Keep this `{ entity, userErrors }` shape uniform across every mutation so clients handle errors one way.
32
+
33
+ > [!WARNING]
34
+ > Overusing non-null (`!`) is a trap, not a safety measure. When a non-null field's resolver errors, GraphQL nulls the nearest *nullable* ancestor — so one failing `User.subscription!` field can null the entire `User`, and if `User` is also non-null, it nulls *its* parent, cascading up to potentially blank the whole `data`. Model genuinely-fallible fields (anything backed by a separate service, an external API, or an optional relationship) as **nullable** so a partial failure degrades to one missing field instead of an empty response.
35
+
36
+ > [!WARNING]
37
+ > A schema with per-parent resolver fields and no DataLoader will N+1 itself to death. A query like `posts(first: 100) { author { name } comments(first: 10) { edges { node { id } } } }` fans out into hundreds or thousands of individual DB queries — fast in dev with 3 rows, a query storm in production. Decide the batching plan at design time, not after the first incident: every relationship field gets a request-scoped DataLoader, no exceptions.
38
+
39
+ > [!NOTE]
40
+ > Connections are worth the boilerplate even for lists that "will never be large," because there is no non-breaking path from `[T!]!` to a paginated connection later — clients have already coded against the array. If a list is truly bounded and fixed (status flags, a handful of roles), a plain list is fine; everything user-generated or growth-prone starts as a connection.
41
+
42
+ ## Output
43
+
44
+ The deliverable is a designed SDL plus the decisions behind it:
45
+
46
+ - **The SDL** — object types and their relationship fields (edges), `enum`s and custom `scalar`s for constrained values, Relay **connection** types for every unbounded list (`*Connection` / `*Edge` / `PageInfo`), and mutations as `input`-arg + `*Payload` (with `userErrors`) pairs.
47
+ - **The nullability decisions** — a short table of the non-obvious fields marked nullable vs non-null, each with its rationale (this field can fail downstream → nullable; this field always resolves → non-null), so reviewers see the error-propagation reasoning.
48
+ - **The pagination decisions** — which lists became connections vs stayed plain arrays, and why.
49
+ - **The DataLoader / batching plan** — the explicit list of per-parent relationship fields (`Type.field`) that must resolve through a request-scoped batched loader, with the batch key and the batched query for each, so the schema doesn't N+1 under nested queries.
@@ -0,0 +1,40 @@
1
+ ---
2
+ name: "hallucination-evaluator"
3
+ description: "Detect and measure ungroundedness in LLM and RAG outputs — claims the source doesn't support — by decomposing answers into atomic claims and checking each for entailment, so you can quantify faithfulness and gate on it instead of eyeballing it. Use when a RAG/LLM feature makes confident wrong claims, before shipping anything that must be factual, or to add a groundedness gate to evals/CI."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ "It sounds confident" is not "it's correct." A RAG or grounded-generation feature can produce fluent, authoritative prose that the retrieved source never supports — and fluency is uncorrelated with faithfulness, so you cannot eyeball it. This skill makes hallucination measurable: it defines the standard precisely, decomposes each answer into atomic claims, checks each claim for entailment against the source, builds a labeled eval set that includes the should-abstain cases, splits retrieval failures from generation failures, and produces a groundedness score you can gate releases on.
9
+
10
+ ## When to use this skill
11
+ - A RAG/LLM feature is making confident claims that turn out to be wrong, and you can't tell how often.
12
+ - Before shipping anything that must be factual — support answers, summaries of provided docs, extraction over a source.
13
+ - You want a groundedness gate in evals/CI so a regression in faithfulness blocks the release instead of surfacing in production.
14
+ - A summary, citation, or "based on the document…" answer is adding facts the document doesn't contain.
15
+ - You need to know *why* it's wrong — bad retrieval vs. the model ignoring good retrieval — because the fix differs.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Define the standard precisely — faithfulness, not world-truth.** In RAG/grounded generation, a hallucination is a claim **not entailed by the retrieved context (the source you gave the model)**. This is *faithfulness*, and it is distinct from *factual accuracy against the world*. A claim can be true in reality but unfaithful (the source never said it), and faithful but false (the source itself was wrong). You grade **faithfulness to the source, because that is checkable**; open-world truth is not checkable here and conflating the two makes the eval incoherent. State which one you're measuring in writing before you score anything.
20
+
21
+ 2. **Decompose each answer into atomic claims.** A claim is a single, independently checkable assertion ("The policy refund window is 30 days"). Split compound sentences, drop hedges and meta-commentary, and keep pronoun referents resolved so each claim stands alone. Score faithfulness *per claim*, not per answer — a 4-sentence answer with one unsupported sentence is 75% grounded, and that granularity is what lets you find the specific failure.
22
+
23
+ 3. **Check each claim for entailment against the source.** For each atomic claim, label it `supported` / `not_supported` / `contradicted` using one of two checkers: (a) an NLI/entailment model (premise = the retrieved chunks, hypothesis = the claim), or (b) an **LLM-judge with the source in its context** — for the judge, default to the latest, most capable Claude model (`claude-opus-4-8`, or `claude-fable-5` for the hardest cases). Pin the judge to faithfulness: *"Using ONLY the provided source, is this claim supported? Quote the supporting span or answer not_supported. Do not use outside knowledge."* The judge grades entailment, which is checkable — never open-world truth.
24
+
25
+ 4. **Build a labeled eval set that includes the should-abstain cases.** Collect (question, retrieved context, answer) triples and hand-label the grounded/ungrounded claims. Crucially, include questions whose answer **is not in the context** — there the correct behavior is to abstain ("I don't know" / "the source doesn't say"), and answering anyway is the exact hallucination you most want to catch. An eval set without should-abstain cases will pass a model that confidently invents answers whenever retrieval comes up empty.
26
+
27
+ 5. **Split retrieval failure from generation failure.** For every ungrounded answer, ask: *was the correct answer present in what was retrieved?* If **no** → retrieval failure (the answer wasn't in the context → fix retrieval: chunking, embeddings, top-k, reranking). If **yes, but the model ignored or contradicted it** → generation failure (fix the prompt/model: cite-or-abstain instructions, a stronger model, lower the room to improvise). Report the two rates separately — they have different owners and different fixes, and a single "hallucination rate" hides which lever to pull.
28
+
29
+ 6. **Report a groundedness score and gate on it.** Compute groundedness = supported claims / total claims across the eval set, plus an abstention-accuracy number on the should-abstain subset. Attach concrete failing examples (claim + the source span it contradicts or the absence of any span). Set a threshold and wire it into CI so a drop blocks the release. Re-run the same fixed eval set on every prompt/retrieval/model change.
30
+
31
+ 7. **Reduce it, then re-measure.** Apply the levers the split points to: grounding prompts (**cite-or-abstain** — "answer only from the source; if it's not there, say so"), require an inline citation/verbatim quote per claim (a claim that can't be quoted is the one to suspect), and retrieval improvements for the retrieval-failure share. After each change, re-run the eval — don't trust that a prompt tweak helped; show the score moved.
32
+
33
+ > [!WARNING]
34
+ > Confidence and fluency are uncorrelated with faithfulness. The most dangerous hallucinations are the ones that read most authoritatively, so you must check claims against the source span by span — never grade on how convincing the answer sounds.
35
+
36
+ > [!WARNING]
37
+ > Do not let the faithfulness judge use outside knowledge. If it "knows" a claim is true and marks it supported even though the source never says it, you're now measuring world-truth (not measurable here) instead of groundedness (measurable) — and the eval becomes incoherent. The instruction "use ONLY the provided source" is load-bearing; verify the judge actually abstains when the source is silent.
38
+
39
+ ## Output
40
+ A faithfulness eval report containing: (1) the eval method — atomic-claim decomposition + the entailment/LLM-judge checker, with the exact judge prompt; (2) the labeled eval set, explicitly including should-abstain cases (answer-not-in-context); (3) per-answer results split into retrieval-failure vs. generation-failure, with separate rates; and (4) the groundedness score (supported claims / total) plus abstention accuracy, concrete failing examples with the offending source spans, and a CI gate threshold. Reproducible: same eval set, same judge model, re-runnable on every change.
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: "idempotency-designer"
3
+ description: "Make unsafe, retryable API operations idempotent so a client retry or a network hiccup can't double-charge, double-create, or double-send — design a client-supplied idempotency key, an atomic store-and-check (unique constraint or conditional write), in-flight conflict handling, and a retention policy. Use when a POST/mutation can be retried (payments, order creation, sends, webhooks), or when duplicate side effects have already shown up in production."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ A network timeout doesn't mean the request failed — it means the client doesn't *know*. So the client retries, and now the charge runs twice. Idempotency fixes this by making "do this operation" return the *same result* no matter how many times it's submitted under the same key. The trap: almost everyone implements it as "check if we've seen this key, if not do the work" — two non-atomic steps — which is precisely a race that two concurrent retries win together. This skill designs the key, the *atomic* dedup, the in-flight case, and the cleanup.
9
+
10
+ ## When to use this skill
11
+
12
+ - An endpoint has a side effect that must not happen twice — a payment/charge, order or account creation, an email/SMS/push send, a transfer, a webhook *delivery* you consume.
13
+ - Clients (mobile, SDKs, queue consumers, other services) retry on timeout/5xx, so the same logical operation can arrive more than once.
14
+ - Duplicate rows, double charges, or double-sent notifications have already appeared in production logs and you're retrofitting protection.
15
+ - You're putting a queue or a webhook receiver in front of a mutation — at-least-once delivery guarantees duplicates by design.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Have the client generate the key, one per logical operation.** The idempotency key is a client-minted unique id (a UUID v4, or a deterministic hash of the operation's natural identity) created *once* and reused on every retry of that same operation. It travels in a header — `Idempotency-Key: <uuid>` (the Stripe/IETF convention) — not in the body where a serializer might reorder it. A new key per *user click* / per *queued message*, the *same* key across that click's retries. Document who mints it and exactly where it rides.
20
+
21
+ 2. **Scope the key — never make it globally unique.** Store and match it as a composite: `(account_id, endpoint, idempotency_key)`. Without scoping, one tenant's key can collide with another's (information leak or wrong cached response returned), and the same UUID legitimately reused on two different endpoints would wrongly dedup. Reserve keys for POST-style *creates and actions*; `GET`/`PUT`/`DELETE` should be designed naturally idempotent (a `PUT` to a known id, a `DELETE` that no-ops on an absent row) and need no key.
22
+
23
+ 3. **Record the key BEFORE doing the work, in a single atomic operation.** This is the whole mechanism. Either:
24
+ - **Unique constraint** — `INSERT` a row keyed on `(account_id, endpoint, key)` with status `in_progress`; let the database's unique index reject the second insert. The *insert* is the lock; you do not read first.
25
+ - **Conditional write** — `SET key value NX` (Redis), or a conditional/compare-and-swap put (DynamoDB `attribute_not_exists`). The store decides the winner atomically.
26
+ The winner proceeds; everyone else hit the constraint/condition and branches to step 5. There is no "check then act" — the check and the claim are the same call.
27
+
28
+ 4. **Persist the response alongside the key, then replay it on repeat.** When the work finishes, store the *full* response (status code + body, or enough to reconstruct it) against the key and mark it `completed` — ideally in the **same transaction** that performs the side effect, so the key and the effect commit or roll back together. On a repeat of a *completed* key, return the stored response verbatim instead of re-executing. Optionally store a hash of the request payload and 422 if the same key arrives with a *different* body — that's a client bug, not a retry.
29
+
30
+ 5. **Handle the in-flight case explicitly — it's not "completed" yet.** A retry can arrive while the first request is still running (status `in_progress`). Do **not** run the work again and do **not** block indefinitely. Return **`409 Conflict`** (or `425 Too Early`) with a short `Retry-After`, telling the client "this is being processed, ask again." Give the `in_progress` record a lease/expiry so a crashed first attempt that never reached `completed` can be retried after the lease lapses rather than wedging the key forever.
31
+
32
+ 6. **Make the downstream effect idempotent too.** Your atomic key protects *your* handler; it does nothing for the third-party call inside it. If the handler calls a payment processor or another service, pass an idempotency key *to that call as well* (most payment APIs accept one) — derive it deterministically from your own key so a retry of your handler produces the same downstream key. Otherwise a crash *after* the external charge but *before* your commit leaves the charge live while your record says nothing happened.
33
+
34
+ 7. **Set a TTL and a cleanup job.** Keys are only needed for the retry window — minutes to ~24h, matched to how long clients realistically retry. Store an `expires_at` and either use the store's native TTL (Redis `EXPIRE`, DynamoDB TTL) or a periodic delete. Choose retention deliberately: long enough to cover every retry path (including a client that retries the next day), short enough that the table doesn't grow without bound.
35
+
36
+ > [!WARNING]
37
+ > Check-then-act is not idempotency. "Read whether the key exists, and if not, do the work" is two operations: two concurrent retries both read "not seen," both proceed, and both run the side effect. The dedup MUST be a single atomic operation — a unique-constraint `INSERT` or a conditional/`NX` write where the store picks the one winner. If your design has a `SELECT` (or `GET`) before the `INSERT`, it is racy under exactly the concurrent-retry load it exists to stop.
38
+
39
+ > [!WARNING]
40
+ > An idempotency store with no TTL grows forever. Every unique operation ever submitted leaves a permanent row, and the unique-index lookup that guards your hottest write path slowly degrades. Always attach an `expires_at` plus native-TTL or a sweep job; "we'll clean it up later" means an unbounded table on your write path.
41
+
42
+ > [!NOTE]
43
+ > Committing the side effect and the `completed` key in the *same transaction* is what makes replay trustworthy. If they're separate writes, a crash between them either replays a response for work that didn't happen, or re-runs work whose key looks unfinished. When the side effect is in another system (a payment API), you can't share a transaction — that's exactly why step 6's downstream key matters.
44
+
45
+ ## Output
46
+
47
+ A design block specifying: (1) the **key scheme** — who generates it, its format, and the header it travels in; (2) the **scope** — the composite `(account, endpoint, key)` and which methods get keys vs. are naturally idempotent; (3) the **atomic store-and-check** — the exact unique constraint or conditional write, with the claim happening before the work; (4) the **in-flight handling** — the `in_progress` state, the `409`/`Retry-After` response, and the lease expiry; (5) the **downstream-keying** strategy for any third-party call; and (6) the **retention policy** — TTL value, mechanism, and the retry window it covers. Followed by a concrete handler/middleware sketch and the table/index DDL (or store schema) implementing it.
@@ -0,0 +1,81 @@
1
+ ---
2
+ name: "integration-test-designer"
3
+ description: "Design integration tests that exercise components against REAL collaborators — actual database, queue, HTTP boundary — at a deliberately chosen seam, instead of a unit suite that mocks everything or a slow flaky full E2E. Use when bugs slip past green unit tests, when wiring or contracts between layers break in production, or when a mocked DB test passes but the real query/migration/serialization fails."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ A unit suite that mocks the database, the queue, and the HTTP client proves your mocks are configured the way you configured them — it never runs your actual SQL, your migrations, your serialization, or the wiring between layers. That's exactly where bugs slip into production. A full E2E suite catches them but is too slow and flaky to gate merges. This skill designs the layer in between: an integration test that drives a deliberately chosen *slice* of the system through its real boundaries — a real database, a real broker, a real HTTP framework — while stubbing only the genuinely uncontrollable third parties. The deliverable is the chosen seam, an explicit real-vs-stubbed split, an ephemeral-infrastructure plus per-test data-isolation setup, and representative tests that assert on observable outcomes.
9
+
10
+ ## When to use this skill
11
+
12
+ - A bug shipped despite a green unit suite because the suite mocked the very collaborator that broke — a wrong column name, a missing migration, a JSON field that serializes differently than the mock returned.
13
+ - The wiring or contract *between* layers fails (handler doesn't pass the tenant id to the repo; a queue message round-trips with the wrong shape) and no test exercises the layers together.
14
+ - The E2E suite is too slow or flaky to run on every PR, so cross-layer regressions are caught late, in staging or prod.
15
+ - You're standing up a new service and want a fast, real-infrastructure test for the persistence/messaging path before there's anything to E2E.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Choose the seam deliberately — name what's inside the slice and what's outside.** Don't test "the whole app" and don't test one function; pick a coherent slice with real boundaries: handler→service→repository→**real DB**, or producer→**real broker**→consumer, or service→**real HTTP** of your own framework. State the entry point (the call that drives the test) and the exit boundary (the real collaborator whose effect you assert). Everything between them runs for real, unmocked; that is the integration you're proving works.
20
+ 2. **Use REAL infrastructure via ephemeral instances — not a mock of it.** Run the actual database, broker, or cache the slice talks to, spun up disposably: **Testcontainers** (a throwaway Postgres/MySQL/Kafka/Redis container per suite), a disposable Docker service, an in-process real engine (embedded Postgres, an in-memory SQLite *only if prod is SQLite*), or a local broker (an embedded Kafka/Redpanda, LocalStack for SQS). Run your real migrations against it on startup. A mocked DB test proves the mock returns what you told it to; only a real instance proves your query compiles, your migration applied, and your row maps back to your object.
21
+ 3. **Stub ONLY the truly external and uncontrollable.** Third parties you don't own and can't run locally — a payment processor, an email/SMS gateway, a partner API, a clock, a random source — get stubbed (or pointed at a fake server like WireMock / a captured-fixture HTTP mock). Drawing the line here, not at your own DB/queue, is the whole discipline: stub what you can't control or can't make deterministic; run everything you own for real.
22
+ 4. **Make every test hermetic and isolated — own your data, depend on no other test.** The top source of integration flake is shared mutable state across tests. Pick one isolation strategy and hold it: **transaction-per-test** (open a transaction in setup, run the test, roll back in teardown — fastest, but breaks if the code under test commits or needs its own connection); **unique data per test** (every row keyed by a per-test tenant/run id so concurrent tests never collide); or **truncate/reset between tests** (clean tables in teardown — simplest, slower). Each test seeds exactly the data it reads. No test may rely on data left by another or on running in a particular order.
23
+ 5. **Pay the slow cost once, not per test.** Starting a container or applying migrations is seconds; doing it per test makes the suite unrunnable. Spin infra up **once per suite/session** (a session-scoped fixture: `pytest` session fixture, JUnit `@Container static`, a global setup) and reuse it; reset only the *data* between tests (step 4), which is milliseconds. Keep the integration suite a separate, taggable target from the unit suite so it can run on its own cadence and developers still get a fast unit loop.
24
+ 6. **Assert observable outcomes, not internal calls.** Verify what actually happened at the real boundary: the row that now exists in the DB (query it back), the HTTP status and body the handler returned, the message that landed on the queue, the record that did *not* get written on a rollback path. Do not assert `repository.save was called once` — that's a mock-interaction check masquerading as integration coverage, and it passes even when the save silently failed. Cover the failure and edge paths too (constraint violation, conflicting concurrent write, retry on a dropped message), because those are precisely what unit mocks can't reproduce.
25
+
26
+ > [!WARNING]
27
+ > Mocking the database or queue inside an "integration" test defeats the entire purpose — you are testing the mock's configuration, not the integration. A `when(repo.find(...)).thenReturn(...)` test never runs your SQL, never catches a renamed column, a broken migration, or a NULL-handling bug. If the collaborator is yours to run, run a real ephemeral instance; if it isn't yours (a payment API), that's a stub *and a separate contract test* — see `contract-test-designer`.
28
+
29
+ > [!WARNING]
30
+ > Integration tests that share one database without per-test isolation become order-dependent and flaky: a test passes alone, fails in the suite, and fails differently in parallel, because it sees rows another test wrote (or expected rows another test deleted). Isolate data per test (transaction rollback or a per-test run id) before adding more tests, or the flake compounds until the suite gets disabled.
31
+
32
+ ## Output
33
+
34
+ For the chosen slice, the skill produces:
35
+
36
+ - **The seam** — the entry point that drives the test and the exit boundary whose effect is asserted, with everything in between named as in-slice (real).
37
+ - **Real vs. stubbed, with the reason** — a short table: each collaborator marked REAL (ephemeral instance, how it's provisioned) or STUBBED (why it's uncontrollable, what fake stands in).
38
+ - **The infra + isolation setup** — how the real instance is spun up once per suite (Testcontainers / disposable service / embedded engine), how migrations are applied, and the per-test data-isolation strategy (transaction rollback / unique run id / truncate).
39
+ - **Representative tests** — happy path plus the failure/edge paths mocks can't reach, each asserting an observable outcome at the real boundary.
40
+
41
+ Example — a service+repository slice against a real Postgres, in Python (pytest + Testcontainers), data isolated by transaction rollback:
42
+
43
+ ```python
44
+ import pytest
45
+ from testcontainers.postgres import PostgresContainer
46
+ from sqlalchemy import create_engine, text
47
+ from app.orders import OrderService # entry point of the slice
48
+
49
+ # Spin the REAL database ONCE per session, run real migrations against it.
50
+ @pytest.fixture(scope="session")
51
+ def engine():
52
+ with PostgresContainer("postgres:16") as pg:
53
+ eng = create_engine(pg.get_connection_url())
54
+ run_migrations(eng) # the actual migrations, not a hand-built schema
55
+ yield eng
56
+
57
+ # Isolate every test: open a transaction, hand it to the service, roll back after.
58
+ @pytest.fixture
59
+ def db(engine):
60
+ conn = engine.connect()
61
+ tx = conn.begin()
62
+ yield conn
63
+ tx.rollback() # nothing persists; tests can't see each other's rows
64
+
65
+ def test_place_order_persists_row(db):
66
+ svc = OrderService(db) # real service -> real repository -> real Postgres
67
+ order_id = svc.place_order(sku="widget", qty=3)
68
+ # Assert the OBSERVABLE outcome: the row exists with the right state.
69
+ row = db.execute(text("SELECT qty, status FROM orders WHERE id = :id"),
70
+ {"id": order_id}).one()
71
+ assert (row.qty, row.status) == (3, "open")
72
+
73
+ def test_place_order_rejects_negative_qty_and_writes_nothing(db):
74
+ svc = OrderService(db)
75
+ with pytest.raises(ValueError):
76
+ svc.place_order(sku="widget", qty=-1) # path a mocked repo would never exercise
77
+ count = db.execute(text("SELECT count(*) FROM orders")).scalar()
78
+ assert count == 0 # the failed write left no partial row
79
+ ```
80
+
81
+ The negative-qty test is the kind a mocked repository can't reach — it proves the real `CHECK`/validation prevents a partial write, against the real schema. Hand the seam to `test-scaffolder` to flesh out the remaining paths, use `mock-data-factory` to build the per-test seed data, and for the third parties you stubbed here, write a `contract-test-designer` test so their real shape stays pinned.
@@ -0,0 +1,39 @@
1
+ ---
2
+ name: "model-router-designer"
3
+ description: "Design a model router that sends each LLM request to the cheapest model that can handle it and escalates only the hard cases to the strongest — cutting cost and latency without tanking quality, gated by an eval set so the savings don't come from silently worse answers. Use when one expensive model serves all traffic (most of it easy), when LLM cost or latency is too high, or when balancing quality against spend across a range of request difficulty."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ Serving 100% of traffic with your most capable model means paying frontier prices for the 70% of requests a smaller model would have nailed. A model router fixes that by routing each request to the cheapest model that can handle it and escalating only the genuinely hard cases — but routed blind, it trades cost for silent quality regressions on exactly the requests that needed the strong tier. This skill designs the router as a measured system: segment the traffic, pick the cheapest signal that separates it, build an escalation path for the misses, and gate the whole thing on an eval set so you can prove the savings are real.
9
+
10
+ ## When to use this skill
11
+ - One expensive model answers all requests and most of them are obviously easy (lookups, formatting, short classifications) — you're overpaying on the majority.
12
+ - LLM cost or p95 latency is too high and you want to shed both without a blanket model downgrade that would hurt the hard cases.
13
+ - Traffic spans a real difficulty range — trivial extraction up through multi-step reasoning — and you want to spend strong-model budget only where it changes the answer.
14
+ - You already tried "just use the cheaper model everywhere" and quality dropped on the hard tail.
15
+
16
+ > [!NOTE]
17
+ > Routing only pays off when a meaningful share of traffic is genuinely easy. If nearly every request needs the strong model, a router adds decision cost and complexity for almost no saving — segment first (step 1) and confirm the easy slice exists before building anything.
18
+
19
+ ## Instructions
20
+ 1. **Segment the traffic by difficulty before touching code.** Pull a representative sample of real requests (or read the logs/handlers with `Grep`/`Glob`) and bucket them into three tiers: (a) **mechanical** — classification, extraction, fixed-format transforms, short factual lookups; (b) **moderate** — straightforward Q&A, summarization, single-step reasoning; (c) **hard** — multi-step reasoning, code generation, ambiguous or long-context tasks. Estimate the volume share of each. If tier (a)+(b) isn't a sizable fraction, stop — the router won't earn its keep. This split is the spec for everything downstream.
21
+ 2. **Pick the cheapest routing signal that separates the tiers — in this order.** Reach for the lowest-cost signal that works and stop there: (1) **free heuristics** — the task type/endpoint the request came through, input token length, a required-capability flag (needs JSON mode, needs tools, needs vision, needs long context), presence of code; (2) **a lightweight classifier** — a small fast model or a trained text classifier that labels difficulty, when heuristics can't cleanly separate; (3) **an LLM-based router** — only when neither of the above can tell easy from hard. The router runs on every request, so its cost and latency are pure overhead — never let the router cost more than it saves.
22
+ 3. **Set explicit thresholds, not vibes.** Turn the signal into concrete rules: e.g. *length < 500 tokens AND task ∈ {classify, extract} → cheap tier*; *needs-tools OR length > 8k tokens → strong tier*. Write the thresholds down with the segmentation they came from so they're auditable and tunable, not buried in an `if`-ladder no one can reason about.
23
+ 4. **Design the escalation/fallback cascade so easy wins stay cheap and hard cases still get quality.** Default-route to the cheap tier, then run a **validation check** on its output — a confidence signal, a schema/format validation, a "did it actually answer / did it say it's unsure" check, or a cheap self-grade. On failure, **retry the same request on the strong tier** (a cascade). This way the easy majority is served at cheap-tier price in one hop, and only the cases the cheap model fumbles pay for the strong model — capturing most of the saving without eating the quality hit. Decide the validation check per task: structured outputs get schema validation for free; open-ended generation needs a confidence or self-grade signal.
24
+ 5. **Choose the tiers concretely.** Default the **strong tier** to the latest, most capable Claude model (`claude-opus-4-8`) and the **cheap tier** to a smaller, faster model (`claude-haiku-4-5`); a mid model (`claude-sonnet-4-6`) is a reasonable middle rung if a two-step cascade leaves a gap. Use exact model ID strings — never construct or date-suffix them. Add **always-route-strong guardrails** for high-stakes paths (anything irreversible, safety-relevant, or where a wrong answer is expensive) regardless of what the signal says.
25
+ 6. **Measure the trade with an eval set — per route, not just in aggregate.** Build (or reuse) a labeled eval set spanning all three difficulty tiers and score three things on every route: **cost**, **latency**, and a **quality metric** (task accuracy, schema-valid rate, judge score — whatever fits the task). Track cheap-route quality, strong-route quality, escalation rate, and the blended numbers separately. The router is only a win if blended cost and latency drop *and* cheap-route quality stays above your bar. If cheap-route quality sags, tighten the threshold or move that segment to the strong tier.
26
+
27
+ > [!WARNING]
28
+ > Routing too much to the cheap model silently degrades quality on the cases that needed the strong one — and aggregate metrics hide it because the easy majority looks fine. Never route blind: gate every threshold change against the per-route eval set and keep the escalation check honest. A router with no quality measurement is just a quality regression you haven't noticed yet.
29
+
30
+ > [!WARNING]
31
+ > An LLM-as-router adds its own latency and token cost on EVERY request, including the easy ones a heuristic would have caught for free. If a task-type check, an input-length cutoff, or a small classifier separates the traffic, use that — reserve the LLM router for the cases where simpler signals genuinely can't, and confirm it still nets a saving end to end.
32
+
33
+ ## Output
34
+ A model-routing design, written down so it's tunable:
35
+ - **Difficulty segmentation** — the three tiers with their defining traits and estimated volume share, plus the go/no-go call on whether a router is worth building.
36
+ - **Routing signal + thresholds** — which signal (heuristic / small classifier / LLM router) and why it's the cheapest that works, with the concrete cutoff rules and the segmentation they derive from.
37
+ - **Escalation/fallback cascade** — the default cheap route, the validation check per task type, and the retry-on-strong path, including any always-route-strong guardrails for high-stakes requests.
38
+ - **Tier choice** — the strong and cheap model IDs (default `claude-opus-4-8` / `claude-haiku-4-5`, optional `claude-sonnet-4-6` middle rung) and the rationale.
39
+ - **Validation metrics** — the eval set composition and the per-route cost / latency / quality numbers (with escalation rate) that prove the router cut spend and latency without dropping quality below the bar.
@@ -0,0 +1,64 @@
1
+ ---
2
+ name: "mutation-test-runner"
3
+ description: "Measure whether a test suite actually catches bugs by running mutation testing — introduce small faults into the code and check which ones a test kills versus which slip through silently. Use when line coverage is high but bugs still ship, when you suspect tests assert weakly, or to find the exact assertions a suite is missing."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ Line coverage tells you a line ran during a test. It does not tell you the test would fail if that line were wrong — a function can be 100% covered by an assertion-free test. Mutation testing closes that gap: it plants small faults in the code (flip `>` to `>=`, swap `+` for `-`, drop a statement, negate a condition) and re-runs the suite against each one. A mutant that makes a test fail is **killed** — the suite pins that behavior. A mutant that passes everything **survives** — no test noticed the code changed, so that behavior is unprotected. This skill runs a mutation tool, reads the survivors as a precise to-do list of missing assertions, and tells you exactly which tests to add to kill them.
9
+
10
+ ## When to use this skill
11
+
12
+ - Coverage is high (80–100%) but bugs still slip into production — the classic symptom of covered-but-unasserted code.
13
+ - You inherited or reviewed a suite and suspect the tests assert weakly (snapshot-only, no return-value checks, `toBeDefined` instead of `toEqual`).
14
+ - A module is critical (auth, money, parsing, pricing) and you want proof the suite would catch a regression, not just that it touches the lines.
15
+ - You're hardening a specific change and want the missing assertions for *that diff*, not a repo-wide audit.
16
+
17
+ > [!WARNING]
18
+ > 100% line coverage with surviving mutants is the false confidence this skill exists to expose: the code runs in a test, but no assertion would fail if the code were wrong. A green coverage badge is not a green mutation score.
19
+
20
+ ## Instructions
21
+
22
+ 1. **Pick the tool for the language — don't guess, check what's installed.** Inspect deps and config first:
23
+ - JS/TS: **Stryker** (`@stryker-mutator/core`, config `stryker.conf.json`/`.mjs`); it auto-detects Jest/Vitest/Mocha runners.
24
+ - Python: **mutmut** (`mutmut run`, config in `setup.cfg`/`pyproject.toml`) or **cosmic-ray** for larger suites.
25
+ - Java/Kotlin: **PIT** (`pitest`, Maven/Gradle plugin). Go: **go-mutesting** or **gremlins**. Ruby: **mutant**. C#: **Stryker.NET**.
26
+ - If no tool is installed, recommend the standard one for the stack and stop there — do not silently add a dev dependency.
27
+ 2. **Scope the run to changed code — this is mandatory, not an optimization.** Mutation testing re-runs the full suite once per mutant, so a repo can take hours. Target the diff or a single package: Stryker `--mutate "src/pricing/**/*.ts"` (or `--since main` on recent versions), mutmut `--paths-to-mutate src/billing/`, PIT `targetClasses` set to the changed package. State the chosen paths up front so the run is reproducible.
28
+ 3. **Run and collect the surviving mutants, not the summary number.** Execute the tool and read its detailed report (Stryker's `mutation.html`/`--reporter json`, mutmut `mutmut results` + `mutmut show <id>`, PIT's `mutations.xml`). For each survivor capture: file, line, the original code, and the exact mutation that lived (e.g. `boundary: changed >= to >` or `removed call to logAudit()`).
29
+ 4. **Triage each survivor: real gap or equivalent mutant.** An **equivalent mutant** changes the code without changing observable behavior — e.g. `i <= n-1` vs `i < n`, reordering commutative operations, mutating a value that's overwritten before use. These *cannot* be killed by any test; mark them `equivalent — ignore` with a one-line reason and move on. Everything else is a genuine gap: a behavior your tests don't constrain.
30
+ 5. **For each real survivor, name the assertion that would kill it.** This is the payoff. A survived `changed > to >=` on a discount threshold means no test exercises the exact boundary — propose "`applyDiscount(qty=10)` where the rule is `qty > 10`: assert no discount at exactly 10." A survived `removed call to audit()` means nothing asserts the side effect — propose "assert `auditLog` received one entry after `transfer()`." Write the input and the expected behavior, not "add a test for line 42."
31
+ 6. **Group survivors by file and track the score where it's worth defending.** Report the mutation score (killed / total non-equivalent) per scoped path as a *baseline to hold or raise on critical modules*, never as a vanity 100% target — chasing the last few percent usually means fighting equivalent mutants. Record the baseline so the next run can detect regressions.
32
+
33
+ > [!NOTE]
34
+ > Two survivors that share a root cause often need one assertion. A function where every arithmetic and boundary mutant survives usually has a single test that calls it and asserts only that it didn't throw — adding one real return-value assertion can kill the whole cluster at once.
35
+
36
+ > [!WARNING]
37
+ > If a mutation run "passes" with zero survivors but also shows mutants marked **no coverage** or **timeout**, the suite isn't strong — those mutants were never actually tested. No-coverage mutants are a coverage gap (hand them to `coverage-gap-finder`); timeouts often mean a mutant created an infinite loop the suite can't detect. Don't read them as kills.
38
+
39
+ ## Output
40
+
41
+ A survivor report grouped by file, plus the run scoping so it's reproducible:
42
+
43
+ ```
44
+ Scope: src/billing/** (mutated 47 mutants, 90s)
45
+ Mutation score: 81% (34 killed / 42 non-equivalent) — baseline, hold >=80 on billing
46
+
47
+ src/billing/discount.ts
48
+ SURVIVED L23 changed `qty > 10` -> `qty >= 10` [BOUNDARY]
49
+ Gap: no test hits the exact threshold.
50
+ Add: applyDiscount({ qty: 10 }) -> assert price unchanged (no discount at boundary)
51
+ SURVIVED L31 removed call to `roundCents(total)` [STATEMENT]
52
+ Gap: nothing asserts the rounded result.
53
+ Add: applyDiscount({ qty: 12, price: 3.337 }) -> assert total === 33.37 (not 33.3696)
54
+
55
+ src/billing/invoice.ts
56
+ SURVIVED L58 changed `&&` -> `||` in isOverdue guard [LOGICAL]
57
+ Gap: only the both-true case is tested.
58
+ Add: isOverdue({ pastDue: true, paid: true }) -> assert false
59
+ EQUIVALENT L72 `i <= len-1` -> `i < len` — ignore (same iteration count)
60
+
61
+ No-coverage: 5 mutants in src/billing/legacy.ts -> route to coverage-gap-finder (not killed).
62
+ ```
63
+
64
+ Each surviving line is a missing assertion; the `Add:` lines are concrete enough to hand straight to a test scaffolder. Re-run the same scope after adding them to confirm the survivors flip to killed and the score holds.