agentscamp 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,35 @@
1
+ ---
2
+ name: "flamegraph-analyzer"
3
+ description: "Turn a CPU profile or flamegraph into a concrete optimization instead of guessing where the time goes: capture under a realistic workload with a sampling profiler, read the graph correctly (width = time, depth ≠ time), find the widest self-time leaves, ask if that work is necessary/redundant/algorithmically wrong, fix the biggest contributor, then re-profile. Use when code is CPU-bound and slow, a function is hot but you don't know which part, or you have a profile you can't interpret."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ When code is slow and CPU-bound, the most expensive thing you can do is guess. Intuition about "the slow part" is wrong often enough that optimizing it usually buys nothing while the real hotspot sits untouched. A flamegraph answers the question directly — *which frames are actually burning CPU* — but only if you capture it under a realistic workload and read it correctly. This skill does both: it gets a representative sampling profile, reads width as time and the y-axis as depth (not a timeline), pins the hotspot to the widest self-time leaves, classifies the work as unnecessary / redundant / algorithmically wrong, fixes the biggest contributor, and re-profiles — because the bottleneck always moves after a fix, and your intuition about the new one is just as unreliable.
9
+
10
+ ## When to use this skill
11
+
12
+ - A request, job, or function is slow, CPU usage is high, and you don't know which part of the call tree is responsible.
13
+ - You have a profile or flamegraph SVG but can't tell where the time is going or whether you're reading it right.
14
+ - Something is "obviously" slow and you're about to optimize the part you suspect — stop and confirm it with a profile first.
15
+ - A hot path got optimized and got no faster, or only a little — the real bottleneck was elsewhere and you need to find it.
16
+ - You want to know whether the latency is *computation* (on-CPU) or *waiting* (I/O, locks) before you pick where to spend effort.
17
+
18
+ ## Instructions
19
+
20
+ 1. **Capture a profile under a realistic workload with a sampling profiler — don't reason from intuition.** Drive the code the way production does (representative input size, concurrency, warm caches/JIT), then sample it with the right tool: `perf record -F 99 -g` (Linux native), async-profiler (JVM), `py-spy record` (Python), `go tool pprof` (Go), or the browser/Node `--prof` / `--cpu-prof` / DevTools profiler. Prefer **sampling** over instrumenting — instrumentation distorts the very hot frames you care about. Profile a *steady* phase, not cold start, unless cold start is the thing you're optimizing.
21
+ 2. **Render it as a flamegraph and read the axes correctly.** Collapse stacks and render (e.g. `perf script | stackcollapse-perf.pl | flamegraph.pl`, async-profiler's HTML, `go tool pprof -http`, speedscope). **Width = total time spent in a frame and everything it called; wide is expensive. The y-axis is call-stack depth, NOT time — it is not a timeline.** A tall, narrow tower is a deep-but-cheap call chain; a short, wide plateau is your hotspot. Frame ordering left-to-right is alphabetical/merge order, not chronological — never read it as "this ran, then that."
22
+ 3. **Find the widest *leaf* frames — that's where the CPU actually is.** Look at the top edge of the graph: the plateaus at the *top* of the stacks are self-time leaves, the code actually executing when samples were taken. A wide frame deep in the middle is wide because of what it *calls*; the work itself lives in the wide things sitting on top of it. Use the profiler's "self/own time" sort to confirm. Rank hotspots by self-time, not by who's tallest.
23
+ 4. **For each top hotspot, classify the work: unnecessary, redundant, or algorithmically wrong.** Read the wide leaf and ask: (a) **Unnecessary** — is this work needed at all, or is it logging/serialization/validation/copying in a hot loop that could be hoisted, batched, or dropped? (b) **Redundant** — is the same frame wide because it's *called too many times* (recomputed per item, re-parsed, re-allocated)? Cache, memoize, or lift it out of the loop. (c) **Algorithmically wrong** — a wide frame that grows with input is often an O(n²) hiding in plain sight (linear scan inside a loop, repeated string concat, a `Set` that's actually a list). Match the frame's width-vs-input behavior to the algorithm.
24
+ 5. **Confirm the latency is on-CPU before optimizing CPU.** A CPU-sample flamegraph is *blind to time spent waiting* — it shows almost nothing for blocking I/O, lock contention, or sleeping threads, because those threads aren't on-CPU to be sampled. If the wall-clock latency is large but the on-CPU flamegraph is thin or idle, the time is being *waited*, not *computed* — capture an **off-CPU / wall-clock** profile instead (off-CPU flamegraph via `perf`/eBPF, async-profiler `wall` mode, py-spy without `--idle` filtering, a blocking/lock profiler). Optimizing CPU frames will do nothing for a workload that's actually waiting on a database or a mutex.
25
+ 6. **Optimize the single biggest contributor, then RE-PROFILE.** Fix the widest hotspot first — it has the most time to give back. Then capture the *same* workload again from scratch. The bottleneck moves after every fix: the second-widest frame is now first, and the percentages you remember are stale. Do not chain optimizations from one profile; your intuition about the *new* top frame is exactly as unreliable as it was about the first. Stop when the remaining hotspots are narrow enough that the next fix isn't worth the complexity.
26
+
27
+ > [!WARNING]
28
+ > The y-axis is call-stack **depth, not time** — a flamegraph is not a timeline. A tall, narrow tower is a cheap deep call chain; a short, wide plateau is your hotspot. Read it as left-to-right time and you'll "optimize" the wrong frame and wonder why nothing got faster.
29
+
30
+ > [!NOTE]
31
+ > A CPU flamegraph is blind to waiting. If a request takes 800ms but the on-CPU graph is mostly idle, the time is spent blocked on I/O or a lock, not computing — switch to an off-CPU / wall-clock profile. Speeding up thin CPU frames can't fix latency that's actually spent waiting.
32
+
33
+ ## Output
34
+
35
+ A short report with four parts: (1) the **capture conditions** — profiler used, workload/input that was profiled, and whether it's on-CPU or off-CPU/wall-clock; (2) the **identified hotspot(s)** read straight off the graph — each as `frame name + share of total samples + self-time vs. children` and *why* it's hot (unnecessary / redundant / algorithmically wrong); (3) the **targeted fix** for the biggest contributor as a concrete change (hoist out of loop, memoize, replace O(n²), or — if it's wait time — go profile off-CPU); and (4) the **re-profile plan** — rerun the identical workload, expected new top frame, and the stopping condition once hotspots are no longer worth chasing.
@@ -0,0 +1,34 @@
1
+ ---
2
+ name: "git-blame-investigator"
3
+ description: "Reconstruct why a line of code exists from Git history — find the originating commit, read its message and full diff for intent, and see through reformatting/rename commits with ignore-revs and the pickaxe — before you change or delete it. Use when a line looks wrong or pointless and you want to remove it, when tracing a regression to its commit, or when onboarding to unfamiliar code."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ `git blame` tells you *who* last touched a line, which is almost never the question you actually have. The real question — "why is this here, and what breaks if I remove it?" — lives in the commit *message*, the surrounding diff, and the PR that shipped it. This skill does code archaeology: it walks from a suspicious line back to the commit that introduced the *logic* (not the one that reindented it), reads the intent, and returns a verdict on whether the code is a dead artifact or a Chesterton's fence guarding a bug you can't see.
9
+
10
+ ## When to use this skill
11
+ - A line looks redundant, wrong, or pointless and you're about to delete or "simplify" it.
12
+ - You're tracing a regression and need the exact commit that changed the behavior.
13
+ - You're onboarding to unfamiliar code and need to reconstruct *why* it was written this way.
14
+ - A workaround, magic constant, or odd conditional has no comment explaining it.
15
+ - blame keeps pointing at a formatting, rename, or merge commit that obviously isn't the real author.
16
+
17
+ ## Instructions
18
+ 1. **Locate the line precisely, then blame with context.** Run `git blame -L <start>,<end> <path>` on the suspicious range (not the whole file) and note the commit SHA, not the author name. Add `-w` to ignore whitespace-only changes and `-C -C -M` to follow lines that were moved or copied in from other files — without these, blame stops at the refactor that relocated the code and you lose its true origin.
19
+ 2. **Distrust the first SHA — it's usually noise.** If the blamed commit is a Prettier run, a lint autofix, a mass rename, or a "merge branch" commit, it did not author the logic. Re-blame ignoring it: `git blame --ignore-rev <sha> -L <start>,<end> <path>`. If the repo has recurring reformatting commits, list them in a `.git-blame-ignore-revs` file and set `git config blame.ignoreRevsFile .git-blame-ignore-revs` so every blame sees through them automatically.
20
+ 3. **Read the intent, not just the patch.** Once you have the real commit, run `git show <sha>` to read the *full* commit message and the *entire* diff — not only the line you care about. Then find the PR with `git log --merges --ancestry-path <sha>..HEAD -- <path>` or `gh pr list --search <sha>` and read the PR description and review discussion. The "why" is in prose far more often than in code.
21
+ 4. **Track the exact line or string through time with line-history and the pickaxe.** For a moving target use `git log -L <start>,<end>:<path>` to see every commit that changed that line range, in order, with diffs. To find when a specific string, identifier, or value *entered or left* the codebase, use the pickaxe: `git log -S '<exact-string>' -- <path>` (changes in the count of that string) or `git log -G '<regex>' -- <path>` (any diff line matching the regex). `-S` answers "when did this magic number / flag / call site appear or disappear?" in seconds.
22
+ 5. **Follow the code across moves and renames.** A file rename or extraction silently truncates history. Use `git log --follow -- <path>` to span renames, and when logic was hoisted into a new file, use blame's `-C -C -C` (copy detection across the whole tree, even unmodified files) to find where it was lifted from. Confirm the trail is unbroken before drawing conclusions — a gap means the real origin is in a pre-rename path.
23
+ 6. **Trace a regression to its commit, by bisection if needed.** First try `git log --oneline -- <path>` plus `git log -L` to spot an obvious culprit. If the offending change isn't obvious, run `git bisect`: `git bisect start`, `git bisect bad` (current), `git bisect good <known-good-sha>`, then test each checkout (script it with `git bisect run <test-cmd>` for an exact, automated answer). Bisect finds the precise breaking commit even across hundreds of revisions.
24
+ 7. **Reconstruct the decision from the neighborhood.** Read the commits immediately before and after the originating one (`git log --oneline <sha>~3..<sha> -- <path>` plus the linked issue) to see what problem the change was solving. A line that looks pointless in isolation often makes sense as one half of a fix — the other half being the bug it prevents.
25
+ 8. **Render a verdict tied to evidence.** Conclude with one of: *safe to remove* (origin found, the problem it solved no longer exists — cite the commit/issue), *do not touch* (it guards a known bug or invariant — cite the commit), or *needs a test first* (intent is plausible but unverified — name the behavior to lock down before changing). Never conclude "safe to remove" without having found and read the originating intent.
26
+
27
+ > [!WARNING]
28
+ > blame's first answer is almost always a formatting or rename commit that hides the real author. If you act on it without `--ignore-rev` and the pickaxe, you will attribute the code to the wrong change and reason about the wrong intent.
29
+
30
+ > [!WARNING]
31
+ > Deleting code whose original purpose you haven't found is the single most common way regressions get reintroduced. "I don't see why this is here" is a reason to investigate, never a license to remove.
32
+
33
+ ## Output
34
+ A short investigation report containing: (1) the **originating commit(s)** — SHA, message, and the intent reconstructed from the diff and PR; (2) the **line/string history** — the ordered list of commits that introduced, moved, or altered the code (from `log -L` / `-S`), with the rename or refactor boundaries it crossed; and (3) a **verdict** — *safe to change/remove*, *do not touch*, or *needs a test first* — each justified by the cited commit or issue. All claims trace to a SHA the reader can re-run.
@@ -0,0 +1,49 @@
1
+ ---
2
+ name: "graphql-schema-designer"
3
+ description: "Design a clean, evolvable GraphQL schema (SDL) that won't paint you into a corner — model the graph around domain types and their relationships rather than as RPC-over-GraphQL, set nullability deliberately, standardize lists with Relay connections, plan DataLoader batching for per-parent fields, and evolve by adding + @deprecated instead of versioning. Use when designing a new GraphQL API, reviewing an SDL, or migrating REST endpoints to a graph."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ A GraphQL schema is not an afterthought over your endpoints — it's the public contract clients build against, and unlike REST there's no `/v2` to escape a bad decision: the graph evolves in place, forever. Two design mistakes dominate the post-launch pain. First, modeling the schema as a thin RPC wrapper of your existing endpoints (`getUserById`, `listOrdersForUser`) instead of a connected graph of types and relationships, which throws away the one thing GraphQL gives you. Second, sprinkling non-null (`!`) everywhere "to be safe," which is a trap — a single resolver error on a non-null field nulls its *entire parent object*, so a flaky downstream blanks out the whole response. This skill designs the SDL deliberately: types and edges, considered nullability, Relay connections for lists, a consistent mutation payload shape, and an explicit DataLoader plan for the fields that would otherwise N+1.
9
+
10
+ ## When to use this skill
11
+
12
+ - You're designing a new GraphQL API from scratch and want an SDL that survives years of additive change without versioning.
13
+ - You're reviewing or refactoring an existing schema that reads like a list of RPC calls, has `!` on nearly every field, or returns bare arrays for lists.
14
+ - You're migrating REST endpoints to GraphQL and need to re-model resources as a connected graph rather than transcribing routes into queries one-for-one.
15
+ - Nested queries are slow and you suspect resolvers are firing one DB query per parent row (the N+1 storm).
16
+
17
+ ## Instructions
18
+
19
+ 1. **Model the graph around domain types and their relationships, not your endpoints.** Identify the nouns (`User`, `Order`, `Product`, `Review`) and the *edges* between them, then expose those edges as fields that return types — `User.orders`, `Order.lineItems`, `Review.author` — so a client can traverse `user { orders { lineItems { product { name } } } }` in one round trip. Do **not** transcribe REST routes into a flat field per endpoint (`getUserById`, `getOrdersForUser`, `getProductForLineItem`); that's RPC-over-GraphQL and forces clients back into client-side joins and N round trips. The query-graph shape, not your handler list, is the source of truth.
20
+
21
+ 2. **Set nullability deliberately, field by field — non-null is a contract, not a default.** Mark a field non-null (`name: String!`) only when it *genuinely always resolves* — a column with a NOT NULL constraint, a synthesized value, the object's own `id`. Make a field nullable when a downstream failure (a separate service, a join that can return nothing, a slow API) shouldn't take down the rest of the response. The error-propagation rule is the whole reason this matters: when a non-null field's resolver throws or returns null, GraphQL can't put null there, so it nulls the *nearest nullable ancestor* — often the entire parent object — propagating upward until it hits a nullable field. So `Order.recommendedProducts` (computed by a flaky ML service) must be nullable, or one bad recommendation call blanks the whole order.
22
+
23
+ 3. **Standardize every list as a Relay Connection, not a bare array.** Replace `orders: [Order!]!` with a connection: `orders(first: Int, after: String, last: Int, before: String): OrderConnection!`, where `OrderConnection { edges: [OrderEdge!]!, pageInfo: PageInfo! }`, `OrderEdge { node: Order!, cursor: String! }`, and `PageInfo { hasNextPage: Boolean!, hasPreviousPage: Boolean!, startCursor: String, endCursor: String }`. Cursor-based connections page correctly under inserts/deletes (each page is anchored to a real cursor, not an offset) and give you a uniform place to hang edge metadata later (e.g. `OrderEdge.addedAt`). Bare arrays can't paginate without a breaking change and force `first`/`offset` bolt-ons later. Use connections for any list that can grow unbounded; a small fixed enum-like list (a user's `roles`) can stay a plain array.
24
+
25
+ 4. **Plan for the N+1 problem before you ship — name every field that needs a DataLoader.** Any field that resolves *per parent* — `Order.customer`, `Review.author`, `Product.category` — fires its resolver once per parent row in a list, so `orders(first: 50) { customer { name } }` becomes 1 query for orders plus 50 queries for customers. For each such field, specify a **DataLoader** that batches the per-parent keys into one query (`SELECT * FROM users WHERE id = ANY($1)`) and caches within the request. Walk the schema and list, explicitly, which fields are 1:1/1:N relationship fetches that must go through a batched loader; a schema with per-parent resolvers and no DataLoader will N+1 itself to death under nested queries.
26
+
27
+ 5. **Evolve by adding fields and deprecating — never repurpose, never version the endpoint.** GraphQL evolves in place: add new fields, types, and optional arguments freely (additive changes are non-breaking because clients select only what they ask for). To retire a field, mark it `@deprecated(reason: "Use fullName instead")` and keep it resolving until usage drops to zero (check field-usage analytics), then remove. Never change an existing field's *meaning* or *type* (`price: Int` cents → `price: Float` dollars is a silent data corruption for every existing client), never tighten nullability from nullable to non-null on a live field, and never add a `/v2` schema — versioning the endpoint defeats the entire evolvability model.
28
+
29
+ 6. **Constrain values with custom scalars and enums; never model a fixed set as a free string.** Use `enum OrderStatus { PENDING PAID SHIPPED CANCELLED }` instead of `status: String` so invalid values are rejected at the query layer and clients get the allowed set from introspection. Define custom scalars for formatted values (`DateTime`, `EmailAddress`, `URL`, `Money`) to centralize parse/serialize/validation and document the format in one place. Reserve `ID` for opaque identifiers (it serializes as a string — don't do math on it).
30
+
31
+ 7. **Give mutations input types and a consistent payload/error shape.** Every mutation takes one `input` argument of a dedicated input type (`createOrder(input: CreateOrderInput!): CreateOrderPayload!`) — input types keep arguments cohesive and let you add optional fields without changing the signature. Return a **payload type**, not the bare entity: `CreateOrderPayload { order: Order, userErrors: [UserError!]! }`, where `userErrors` carries expected, recoverable validation failures (`{ field: ["input","email"], message: "already taken" }`) as *data* the client can render — distinct from unexpected exceptions, which belong in the top-level `errors` array. Keep this `{ entity, userErrors }` shape uniform across every mutation so clients handle errors one way.
32
+
33
+ > [!WARNING]
34
+ > Overusing non-null (`!`) is a trap, not a safety measure. When a non-null field's resolver errors, GraphQL nulls the nearest *nullable* ancestor — so one failing `User.subscription!` field can null the entire `User`, and if `User` is also non-null, it nulls *its* parent, cascading up to potentially blank the whole `data`. Model genuinely-fallible fields (anything backed by a separate service, an external API, or an optional relationship) as **nullable** so a partial failure degrades to one missing field instead of an empty response.
35
+
36
+ > [!WARNING]
37
+ > A schema with per-parent resolver fields and no DataLoader will N+1 itself to death. A query like `posts(first: 100) { author { name } comments(first: 10) { edges { node { id } } } }` fans out into hundreds or thousands of individual DB queries — fast in dev with 3 rows, a query storm in production. Decide the batching plan at design time, not after the first incident: every relationship field gets a request-scoped DataLoader, no exceptions.
38
+
39
+ > [!NOTE]
40
+ > Connections are worth the boilerplate even for lists that "will never be large," because there is no non-breaking path from `[T!]!` to a paginated connection later — clients have already coded against the array. If a list is truly bounded and fixed (status flags, a handful of roles), a plain list is fine; everything user-generated or growth-prone starts as a connection.
41
+
42
+ ## Output
43
+
44
+ The deliverable is a designed SDL plus the decisions behind it:
45
+
46
+ - **The SDL** — object types and their relationship fields (edges), `enum`s and custom `scalar`s for constrained values, Relay **connection** types for every unbounded list (`*Connection` / `*Edge` / `PageInfo`), and mutations as `input`-arg + `*Payload` (with `userErrors`) pairs.
47
+ - **The nullability decisions** — a short table of the non-obvious fields marked nullable vs non-null, each with its rationale (this field can fail downstream → nullable; this field always resolves → non-null), so reviewers see the error-propagation reasoning.
48
+ - **The pagination decisions** — which lists became connections vs stayed plain arrays, and why.
49
+ - **The DataLoader / batching plan** — the explicit list of per-parent relationship fields (`Type.field`) that must resolve through a request-scoped batched loader, with the batch key and the batched query for each, so the schema doesn't N+1 under nested queries.
@@ -0,0 +1,40 @@
1
+ ---
2
+ name: "hallucination-evaluator"
3
+ description: "Detect and measure ungroundedness in LLM and RAG outputs — claims the source doesn't support — by decomposing answers into atomic claims and checking each for entailment, so you can quantify faithfulness and gate on it instead of eyeballing it. Use when a RAG/LLM feature makes confident wrong claims, before shipping anything that must be factual, or to add a groundedness gate to evals/CI."
4
+ allowed-tools: "Read, Grep, Glob, Bash"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ "It sounds confident" is not "it's correct." A RAG or grounded-generation feature can produce fluent, authoritative prose that the retrieved source never supports — and fluency is uncorrelated with faithfulness, so you cannot eyeball it. This skill makes hallucination measurable: it defines the standard precisely, decomposes each answer into atomic claims, checks each claim for entailment against the source, builds a labeled eval set that includes the should-abstain cases, splits retrieval failures from generation failures, and produces a groundedness score you can gate releases on.
9
+
10
+ ## When to use this skill
11
+ - A RAG/LLM feature is making confident claims that turn out to be wrong, and you can't tell how often.
12
+ - Before shipping anything that must be factual — support answers, summaries of provided docs, extraction over a source.
13
+ - You want a groundedness gate in evals/CI so a regression in faithfulness blocks the release instead of surfacing in production.
14
+ - A summary, citation, or "based on the document…" answer is adding facts the document doesn't contain.
15
+ - You need to know *why* it's wrong — bad retrieval vs. the model ignoring good retrieval — because the fix differs.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Define the standard precisely — faithfulness, not world-truth.** In RAG/grounded generation, a hallucination is a claim **not entailed by the retrieved context (the source you gave the model)**. This is *faithfulness*, and it is distinct from *factual accuracy against the world*. A claim can be true in reality but unfaithful (the source never said it), and faithful but false (the source itself was wrong). You grade **faithfulness to the source, because that is checkable**; open-world truth is not checkable here and conflating the two makes the eval incoherent. State which one you're measuring in writing before you score anything.
20
+
21
+ 2. **Decompose each answer into atomic claims.** A claim is a single, independently checkable assertion ("The policy refund window is 30 days"). Split compound sentences, drop hedges and meta-commentary, and keep pronoun referents resolved so each claim stands alone. Score faithfulness *per claim*, not per answer — a 4-sentence answer with one unsupported sentence is 75% grounded, and that granularity is what lets you find the specific failure.
22
+
23
+ 3. **Check each claim for entailment against the source.** For each atomic claim, label it `supported` / `not_supported` / `contradicted` using one of two checkers: (a) an NLI/entailment model (premise = the retrieved chunks, hypothesis = the claim), or (b) an **LLM-judge with the source in its context** — for the judge, default to the latest, most capable Claude model (`claude-opus-4-8`, or `claude-fable-5` for the hardest cases). Pin the judge to faithfulness: *"Using ONLY the provided source, is this claim supported? Quote the supporting span or answer not_supported. Do not use outside knowledge."* The judge grades entailment, which is checkable — never open-world truth.
24
+
25
+ 4. **Build a labeled eval set that includes the should-abstain cases.** Collect (question, retrieved context, answer) triples and hand-label the grounded/ungrounded claims. Crucially, include questions whose answer **is not in the context** — there the correct behavior is to abstain ("I don't know" / "the source doesn't say"), and answering anyway is the exact hallucination you most want to catch. An eval set without should-abstain cases will pass a model that confidently invents answers whenever retrieval comes up empty.
26
+
27
+ 5. **Split retrieval failure from generation failure.** For every ungrounded answer, ask: *was the correct answer present in what was retrieved?* If **no** → retrieval failure (the answer wasn't in the context → fix retrieval: chunking, embeddings, top-k, reranking). If **yes, but the model ignored or contradicted it** → generation failure (fix the prompt/model: cite-or-abstain instructions, a stronger model, lower the room to improvise). Report the two rates separately — they have different owners and different fixes, and a single "hallucination rate" hides which lever to pull.
28
+
29
+ 6. **Report a groundedness score and gate on it.** Compute groundedness = supported claims / total claims across the eval set, plus an abstention-accuracy number on the should-abstain subset. Attach concrete failing examples (claim + the source span it contradicts or the absence of any span). Set a threshold and wire it into CI so a drop blocks the release. Re-run the same fixed eval set on every prompt/retrieval/model change.
30
+
31
+ 7. **Reduce it, then re-measure.** Apply the levers the split points to: grounding prompts (**cite-or-abstain** — "answer only from the source; if it's not there, say so"), require an inline citation/verbatim quote per claim (a claim that can't be quoted is the one to suspect), and retrieval improvements for the retrieval-failure share. After each change, re-run the eval — don't trust that a prompt tweak helped; show the score moved.
32
+
33
+ > [!WARNING]
34
+ > Confidence and fluency are uncorrelated with faithfulness. The most dangerous hallucinations are the ones that read most authoritatively, so you must check claims against the source span by span — never grade on how convincing the answer sounds.
35
+
36
+ > [!WARNING]
37
+ > Do not let the faithfulness judge use outside knowledge. If it "knows" a claim is true and marks it supported even though the source never says it, you're now measuring world-truth (not measurable here) instead of groundedness (measurable) — and the eval becomes incoherent. The instruction "use ONLY the provided source" is load-bearing; verify the judge actually abstains when the source is silent.
38
+
39
+ ## Output
40
+ A faithfulness eval report containing: (1) the eval method — atomic-claim decomposition + the entailment/LLM-judge checker, with the exact judge prompt; (2) the labeled eval set, explicitly including should-abstain cases (answer-not-in-context); (3) per-answer results split into retrieval-failure vs. generation-failure, with separate rates; and (4) the groundedness score (supported claims / total) plus abstention accuracy, concrete failing examples with the offending source spans, and a CI gate threshold. Reproducible: same eval set, same judge model, re-runnable on every change.
@@ -0,0 +1,81 @@
1
+ ---
2
+ name: "integration-test-designer"
3
+ description: "Design integration tests that exercise components against REAL collaborators — actual database, queue, HTTP boundary — at a deliberately chosen seam, instead of a unit suite that mocks everything or a slow flaky full E2E. Use when bugs slip past green unit tests, when wiring or contracts between layers break in production, or when a mocked DB test passes but the real query/migration/serialization fails."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ A unit suite that mocks the database, the queue, and the HTTP client proves your mocks are configured the way you configured them — it never runs your actual SQL, your migrations, your serialization, or the wiring between layers. That's exactly where bugs slip into production. A full E2E suite catches them but is too slow and flaky to gate merges. This skill designs the layer in between: an integration test that drives a deliberately chosen *slice* of the system through its real boundaries — a real database, a real broker, a real HTTP framework — while stubbing only the genuinely uncontrollable third parties. The deliverable is the chosen seam, an explicit real-vs-stubbed split, an ephemeral-infrastructure plus per-test data-isolation setup, and representative tests that assert on observable outcomes.
9
+
10
+ ## When to use this skill
11
+
12
+ - A bug shipped despite a green unit suite because the suite mocked the very collaborator that broke — a wrong column name, a missing migration, a JSON field that serializes differently than the mock returned.
13
+ - The wiring or contract *between* layers fails (handler doesn't pass the tenant id to the repo; a queue message round-trips with the wrong shape) and no test exercises the layers together.
14
+ - The E2E suite is too slow or flaky to run on every PR, so cross-layer regressions are caught late, in staging or prod.
15
+ - You're standing up a new service and want a fast, real-infrastructure test for the persistence/messaging path before there's anything to E2E.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Choose the seam deliberately — name what's inside the slice and what's outside.** Don't test "the whole app" and don't test one function; pick a coherent slice with real boundaries: handler→service→repository→**real DB**, or producer→**real broker**→consumer, or service→**real HTTP** of your own framework. State the entry point (the call that drives the test) and the exit boundary (the real collaborator whose effect you assert). Everything between them runs for real, unmocked; that is the integration you're proving works.
20
+ 2. **Use REAL infrastructure via ephemeral instances — not a mock of it.** Run the actual database, broker, or cache the slice talks to, spun up disposably: **Testcontainers** (a throwaway Postgres/MySQL/Kafka/Redis container per suite), a disposable Docker service, an in-process real engine (embedded Postgres, an in-memory SQLite *only if prod is SQLite*), or a local broker (an embedded Kafka/Redpanda, LocalStack for SQS). Run your real migrations against it on startup. A mocked DB test proves the mock returns what you told it to; only a real instance proves your query compiles, your migration applied, and your row maps back to your object.
21
+ 3. **Stub ONLY the truly external and uncontrollable.** Third parties you don't own and can't run locally — a payment processor, an email/SMS gateway, a partner API, a clock, a random source — get stubbed (or pointed at a fake server like WireMock / a captured-fixture HTTP mock). Drawing the line here, not at your own DB/queue, is the whole discipline: stub what you can't control or can't make deterministic; run everything you own for real.
22
+ 4. **Make every test hermetic and isolated — own your data, depend on no other test.** The top source of integration flake is shared mutable state across tests. Pick one isolation strategy and hold it: **transaction-per-test** (open a transaction in setup, run the test, roll back in teardown — fastest, but breaks if the code under test commits or needs its own connection); **unique data per test** (every row keyed by a per-test tenant/run id so concurrent tests never collide); or **truncate/reset between tests** (clean tables in teardown — simplest, slower). Each test seeds exactly the data it reads. No test may rely on data left by another or on running in a particular order.
23
+ 5. **Pay the slow cost once, not per test.** Starting a container or applying migrations is seconds; doing it per test makes the suite unrunnable. Spin infra up **once per suite/session** (a session-scoped fixture: `pytest` session fixture, JUnit `@Container static`, a global setup) and reuse it; reset only the *data* between tests (step 4), which is milliseconds. Keep the integration suite a separate, taggable target from the unit suite so it can run on its own cadence and developers still get a fast unit loop.
24
+ 6. **Assert observable outcomes, not internal calls.** Verify what actually happened at the real boundary: the row that now exists in the DB (query it back), the HTTP status and body the handler returned, the message that landed on the queue, the record that did *not* get written on a rollback path. Do not assert `repository.save was called once` — that's a mock-interaction check masquerading as integration coverage, and it passes even when the save silently failed. Cover the failure and edge paths too (constraint violation, conflicting concurrent write, retry on a dropped message), because those are precisely what unit mocks can't reproduce.
25
+
26
+ > [!WARNING]
27
+ > Mocking the database or queue inside an "integration" test defeats the entire purpose — you are testing the mock's configuration, not the integration. A `when(repo.find(...)).thenReturn(...)` test never runs your SQL, never catches a renamed column, a broken migration, or a NULL-handling bug. If the collaborator is yours to run, run a real ephemeral instance; if it isn't yours (a payment API), that's a stub *and a separate contract test* — see `contract-test-designer`.
28
+
29
+ > [!WARNING]
30
+ > Integration tests that share one database without per-test isolation become order-dependent and flaky: a test passes alone, fails in the suite, and fails differently in parallel, because it sees rows another test wrote (or expected rows another test deleted). Isolate data per test (transaction rollback or a per-test run id) before adding more tests, or the flake compounds until the suite gets disabled.
31
+
32
+ ## Output
33
+
34
+ For the chosen slice, the skill produces:
35
+
36
+ - **The seam** — the entry point that drives the test and the exit boundary whose effect is asserted, with everything in between named as in-slice (real).
37
+ - **Real vs. stubbed, with the reason** — a short table: each collaborator marked REAL (ephemeral instance, how it's provisioned) or STUBBED (why it's uncontrollable, what fake stands in).
38
+ - **The infra + isolation setup** — how the real instance is spun up once per suite (Testcontainers / disposable service / embedded engine), how migrations are applied, and the per-test data-isolation strategy (transaction rollback / unique run id / truncate).
39
+ - **Representative tests** — happy path plus the failure/edge paths mocks can't reach, each asserting an observable outcome at the real boundary.
40
+
41
+ Example — a service+repository slice against a real Postgres, in Python (pytest + Testcontainers), data isolated by transaction rollback:
42
+
43
+ ```python
44
+ import pytest
45
+ from testcontainers.postgres import PostgresContainer
46
+ from sqlalchemy import create_engine, text
47
+ from app.orders import OrderService # entry point of the slice
48
+
49
+ # Spin the REAL database ONCE per session, run real migrations against it.
50
+ @pytest.fixture(scope="session")
51
+ def engine():
52
+ with PostgresContainer("postgres:16") as pg:
53
+ eng = create_engine(pg.get_connection_url())
54
+ run_migrations(eng) # the actual migrations, not a hand-built schema
55
+ yield eng
56
+
57
+ # Isolate every test: open a transaction, hand it to the service, roll back after.
58
+ @pytest.fixture
59
+ def db(engine):
60
+ conn = engine.connect()
61
+ tx = conn.begin()
62
+ yield conn
63
+ tx.rollback() # nothing persists; tests can't see each other's rows
64
+
65
+ def test_place_order_persists_row(db):
66
+ svc = OrderService(db) # real service -> real repository -> real Postgres
67
+ order_id = svc.place_order(sku="widget", qty=3)
68
+ # Assert the OBSERVABLE outcome: the row exists with the right state.
69
+ row = db.execute(text("SELECT qty, status FROM orders WHERE id = :id"),
70
+ {"id": order_id}).one()
71
+ assert (row.qty, row.status) == (3, "open")
72
+
73
+ def test_place_order_rejects_negative_qty_and_writes_nothing(db):
74
+ svc = OrderService(db)
75
+ with pytest.raises(ValueError):
76
+ svc.place_order(sku="widget", qty=-1) # path a mocked repo would never exercise
77
+ count = db.execute(text("SELECT count(*) FROM orders")).scalar()
78
+ assert count == 0 # the failed write left no partial row
79
+ ```
80
+
81
+ The negative-qty test is the kind a mocked repository can't reach — it proves the real `CHECK`/validation prevents a partial write, against the real schema. Hand the seam to `test-scaffolder` to flesh out the remaining paths, use `mock-data-factory` to build the per-test seed data, and for the third parties you stubbed here, write a `contract-test-designer` test so their real shape stays pinned.
@@ -0,0 +1,39 @@
1
+ ---
2
+ name: "model-router-designer"
3
+ description: "Design a model router that sends each LLM request to the cheapest model that can handle it and escalates only the hard cases to the strongest — cutting cost and latency without tanking quality, gated by an eval set so the savings don't come from silently worse answers. Use when one expensive model serves all traffic (most of it easy), when LLM cost or latency is too high, or when balancing quality against spend across a range of request difficulty."
4
+ allowed-tools: "Read, Grep, Glob, Edit"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ Serving 100% of traffic with your most capable model means paying frontier prices for the 70% of requests a smaller model would have nailed. A model router fixes that by routing each request to the cheapest model that can handle it and escalating only the genuinely hard cases — but routed blind, it trades cost for silent quality regressions on exactly the requests that needed the strong tier. This skill designs the router as a measured system: segment the traffic, pick the cheapest signal that separates it, build an escalation path for the misses, and gate the whole thing on an eval set so you can prove the savings are real.
9
+
10
+ ## When to use this skill
11
+ - One expensive model answers all requests and most of them are obviously easy (lookups, formatting, short classifications) — you're overpaying on the majority.
12
+ - LLM cost or p95 latency is too high and you want to shed both without a blanket model downgrade that would hurt the hard cases.
13
+ - Traffic spans a real difficulty range — trivial extraction up through multi-step reasoning — and you want to spend strong-model budget only where it changes the answer.
14
+ - You already tried "just use the cheaper model everywhere" and quality dropped on the hard tail.
15
+
16
+ > [!NOTE]
17
+ > Routing only pays off when a meaningful share of traffic is genuinely easy. If nearly every request needs the strong model, a router adds decision cost and complexity for almost no saving — segment first (step 1) and confirm the easy slice exists before building anything.
18
+
19
+ ## Instructions
20
+ 1. **Segment the traffic by difficulty before touching code.** Pull a representative sample of real requests (or read the logs/handlers with `Grep`/`Glob`) and bucket them into three tiers: (a) **mechanical** — classification, extraction, fixed-format transforms, short factual lookups; (b) **moderate** — straightforward Q&A, summarization, single-step reasoning; (c) **hard** — multi-step reasoning, code generation, ambiguous or long-context tasks. Estimate the volume share of each. If tier (a)+(b) isn't a sizable fraction, stop — the router won't earn its keep. This split is the spec for everything downstream.
21
+ 2. **Pick the cheapest routing signal that separates the tiers — in this order.** Reach for the lowest-cost signal that works and stop there: (1) **free heuristics** — the task type/endpoint the request came through, input token length, a required-capability flag (needs JSON mode, needs tools, needs vision, needs long context), presence of code; (2) **a lightweight classifier** — a small fast model or a trained text classifier that labels difficulty, when heuristics can't cleanly separate; (3) **an LLM-based router** — only when neither of the above can tell easy from hard. The router runs on every request, so its cost and latency are pure overhead — never let the router cost more than it saves.
22
+ 3. **Set explicit thresholds, not vibes.** Turn the signal into concrete rules: e.g. *length < 500 tokens AND task ∈ {classify, extract} → cheap tier*; *needs-tools OR length > 8k tokens → strong tier*. Write the thresholds down with the segmentation they came from so they're auditable and tunable, not buried in an `if`-ladder no one can reason about.
23
+ 4. **Design the escalation/fallback cascade so easy wins stay cheap and hard cases still get quality.** Default-route to the cheap tier, then run a **validation check** on its output — a confidence signal, a schema/format validation, a "did it actually answer / did it say it's unsure" check, or a cheap self-grade. On failure, **retry the same request on the strong tier** (a cascade). This way the easy majority is served at cheap-tier price in one hop, and only the cases the cheap model fumbles pay for the strong model — capturing most of the saving without eating the quality hit. Decide the validation check per task: structured outputs get schema validation for free; open-ended generation needs a confidence or self-grade signal.
24
+ 5. **Choose the tiers concretely.** Default the **strong tier** to the latest, most capable Claude model (`claude-opus-4-8`) and the **cheap tier** to a smaller, faster model (`claude-haiku-4-5`); a mid model (`claude-sonnet-4-6`) is a reasonable middle rung if a two-step cascade leaves a gap. Use exact model ID strings — never construct or date-suffix them. Add **always-route-strong guardrails** for high-stakes paths (anything irreversible, safety-relevant, or where a wrong answer is expensive) regardless of what the signal says.
25
+ 6. **Measure the trade with an eval set — per route, not just in aggregate.** Build (or reuse) a labeled eval set spanning all three difficulty tiers and score three things on every route: **cost**, **latency**, and a **quality metric** (task accuracy, schema-valid rate, judge score — whatever fits the task). Track cheap-route quality, strong-route quality, escalation rate, and the blended numbers separately. The router is only a win if blended cost and latency drop *and* cheap-route quality stays above your bar. If cheap-route quality sags, tighten the threshold or move that segment to the strong tier.
26
+
27
+ > [!WARNING]
28
+ > Routing too much to the cheap model silently degrades quality on the cases that needed the strong one — and aggregate metrics hide it because the easy majority looks fine. Never route blind: gate every threshold change against the per-route eval set and keep the escalation check honest. A router with no quality measurement is just a quality regression you haven't noticed yet.
29
+
30
+ > [!WARNING]
31
+ > An LLM-as-router adds its own latency and token cost on EVERY request, including the easy ones a heuristic would have caught for free. If a task-type check, an input-length cutoff, or a small classifier separates the traffic, use that — reserve the LLM router for the cases where simpler signals genuinely can't, and confirm it still nets a saving end to end.
32
+
33
+ ## Output
34
+ A model-routing design, written down so it's tunable:
35
+ - **Difficulty segmentation** — the three tiers with their defining traits and estimated volume share, plus the go/no-go call on whether a router is worth building.
36
+ - **Routing signal + thresholds** — which signal (heuristic / small classifier / LLM router) and why it's the cheapest that works, with the concrete cutoff rules and the segmentation they derive from.
37
+ - **Escalation/fallback cascade** — the default cheap route, the validation check per task type, and the retry-on-strong path, including any always-route-strong guardrails for high-stakes requests.
38
+ - **Tier choice** — the strong and cheap model IDs (default `claude-opus-4-8` / `claude-haiku-4-5`, optional `claude-sonnet-4-6` middle rung) and the rationale.
39
+ - **Validation metrics** — the eval set composition and the per-route cost / latency / quality numbers (with escalation rate) that prove the router cut spend and latency without dropping quality below the bar.
@@ -0,0 +1,84 @@
1
+ ---
2
+ name: "onboarding-guide-writer"
3
+ description: "Write a developer onboarding guide that gets a new contributor from clone to first merged change fast — a verified golden path, a quick architecture map, the real workflow conventions, and the gotchas that live only in senior engineers' heads. Use when a repo has no onboarding doc, when new hires keep asking the same setup questions, or when the README is a marketing page instead of a contributor guide."
4
+ allowed-tools: "Read, Grep, Glob, Write"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ Write the doc a new contributor opens on day one and uses to ship their first change by lunch. The center of gravity is the **golden path**: the exact, copy-pasteable sequence from `git clone` to a trivial verified change — every command grounded in the repo's real scripts and tooling, not invented `make` targets. Around it sit a quick architecture map (where to look, not a spec), the workflow conventions that gate a PR, and the troubleshooting that currently lives only in tribal knowledge. Deeper material is linked, never duplicated, so the guide stays true as the code moves.
9
+
10
+ ## When to use this skill
11
+
12
+ - A repo has no onboarding/CONTRIBUTING doc and new contributors reverse-engineer setup from CI configs and Slack threads.
13
+ - New hires repeatedly ask the same setup questions (which Node version, what env vars, why does the build fail the first time).
14
+ - The README is marketing prose — what the product does — rather than how a developer runs and contributes to it.
15
+ - Onboarding currently means a senior engineer pairing for two hours to get someone to a passing test suite.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Reconstruct the golden path from real tooling — verify every command exists.** Read the manifest that exists (`package.json` scripts/`engines`, `Makefile` targets, `pyproject.toml`, `go.mod`, `Justfile`, `Taskfile.yml`) and the lockfile to pick the package manager. Read CI config (`.github/workflows/*.yml`, `.gitlab-ci.yml`) — CI is the ground truth for the steps that actually pass. Build the path in execution order: clone → install deps → set up env/config → run locally → run tests → make a trivial change and verify it. Quote each command verbatim from a script that exists; if a step has no backing script, say so explicitly rather than inventing one.
20
+ 2. **Surface the prerequisites a fresh machine actually needs.** Pin the runtime version (from `engines`, `.nvmrc`, `.tool-versions`, `go.mod`, `python_requires`) and any system deps (a database, Docker, a specific package manager). List them before the install step — a clone that fails on a missing Postgres is the most common day-one wall.
21
+ 3. **Handle env and config concretely.** Find `.env.example` / `.env.sample` / `config.example.*`. Tell the contributor to copy it (`cp .env.example .env`) and call out which variables must be filled to run locally versus which have working defaults. Name the ones that need a secret or a teammate to provide — that is the question that otherwise hits Slack.
22
+ 4. **Prove the setup with a trivial verified change.** End the golden path with a concrete, reversible first change — flip a string, add a log line, fix a typo — then the exact command that confirms it (the dev server reloads, a test passes, the page shows the new text). This is what turns "I think it's set up" into "it works." Don't skip it: it's the difference between an install guide and an onboarding guide.
23
+ 5. **Write a brief architecture orientation — a map, not a spec.** Glob the top-level layout and name where the entry points are, how the main pieces fit (request → handler → data, or CLI → command → core), and where a newcomer should look first for a given task. Then list the **3–5 things that would surprise a newcomer**: the non-obvious build step, the directory that isn't what its name implies, the generated file you must never hand-edit. Keep it to a screen; point to deeper design docs for the rest.
24
+ 6. **Document the real workflow conventions.** Extract them from evidence, not assumption: branch naming (from existing branches / contributing notes), commit and PR style (from `.gitmessage`, PR template, recent history), how to run lint and typecheck (the real script names), and how CI gates a PR (which checks are required, from the workflow files). A contributor needs to know what will block their merge before they open the PR, not after.
25
+ 7. **Capture the tribal-knowledge gotchas and troubleshooting.** Write down the fixes that live in senior engineers' heads: the first build that fails until you run a generate step, the test that's flaky on certain OSes, the port that conflicts, the cache you clear when things go weird. Format as symptom → fix so a stuck contributor can scan to their error.
26
+ 8. **Link to deeper docs instead of duplicating them.** For anything with a canonical home — full architecture docs, API reference, ADRs, deployment runbooks — link to it in one line. Duplicated detail is detail that will silently go stale; a link stays correct or visibly 404s.
27
+ 9. **Order for action and skim.** Golden path first (it's what they need in the next five minutes), then architecture, conventions, troubleshooting, links. Lead each section with the action. Save it as `CONTRIBUTING.md` or `docs/onboarding.md` per the repo's convention, and report which commands you verified against real scripts and which you flagged as unverified.
28
+
29
+ > [!WARNING]
30
+ > An onboarding guide whose setup commands don't actually work is worse than no guide — it burns the new contributor's trust on day one and makes them distrust every other line in the doc. Verify each command against a script that exists in the repo. Never paste a `make dev` or `npm run setup` you haven't confirmed.
31
+
32
+ > [!WARNING]
33
+ > Do not re-explain the architecture in depth here. Detailed design that belongs in code comments, ADRs, or a design doc is guaranteed to drift once it's copied into onboarding. Give the orientation map and link to the canonical source.
34
+
35
+ ## Output
36
+
37
+ A drop-in `CONTRIBUTING.md` (or `docs/onboarding.md`), structured for action:
38
+
39
+ ````md
40
+ # Contributing
41
+
42
+ ## Golden path: clone → first change
43
+
44
+ **Prerequisites:** Node 20 (`.nvmrc`), pnpm 9, Docker (for the local DB).
45
+
46
+ ```bash
47
+ git clone git@github.com:acme/taskflow.git && cd taskflow
48
+ pnpm install # lockfile: pnpm-lock.yaml
49
+ cp .env.example .env # fill DATABASE_URL — ask #eng for the dev value
50
+ docker compose up -d db # local Postgres on :5432
51
+ pnpm db:migrate # apply schema
52
+ pnpm dev # http://localhost:3000
53
+ pnpm test # vitest — should be all green before you start
54
+ ```
55
+
56
+ **Your first change:** edit the heading in `src/app/page.tsx`, save —
57
+ the dev server hot-reloads and the new text shows at `localhost:3000`.
58
+ That confirms your setup end to end.
59
+
60
+ ## How the code fits
61
+ - Entry points: `src/app/` (routes), `src/server/` (API handlers), `prisma/` (schema).
62
+ - Flow: route → handler in `src/server/` → Prisma → Postgres.
63
+ - Surprises for newcomers:
64
+ - `pnpm db:generate` must run after editing `prisma/schema.prisma` — the client is generated, never hand-edited.
65
+ - `src/lib/legacy/` is frozen; new code goes in `src/lib/`.
66
+ - The first `pnpm build` after install fails unless `pnpm db:generate` has run.
67
+
68
+ ## Workflow
69
+ - Branch: `feat/<short-desc>` or `fix/<short-desc>` off `main`.
70
+ - Commits: Conventional Commits (`.gitmessage`); PRs use the template.
71
+ - Before pushing: `pnpm lint && pnpm typecheck`.
72
+ - CI gates merge on: lint, typecheck, `vitest`, and a preview deploy.
73
+
74
+ ## Troubleshooting
75
+ - `ECONNREFUSED 5432` → `docker compose up -d db` isn't running.
76
+ - `Prisma client not generated` → `pnpm db:generate`.
77
+ - Port 3000 in use → `pnpm dev -- --port 3001`.
78
+
79
+ ## Deeper docs
80
+ - Architecture & design decisions → `docs/architecture.md`, `docs/adr/`
81
+ - Deploy & on-call → `docs/runbooks/`
82
+ ````
83
+
84
+ Every command above is quoted from a real script; the report lists exactly which were verified against the repo and which (if any) were flagged unverified for the maintainer to confirm.
@@ -0,0 +1,82 @@
1
+ ---
2
+ name: "rbac-designer"
3
+ description: "Design the authorization model itself — fine-grained permissions on resources composed into roles, with the right amount of resource/tenant scoping — instead of scattering role-name checks through handlers. Use when building multi-user or multi-tenant authorization, when `if user.isAdmin` checks are sprawling across the codebase, or when 'who can do what' needs a real model rather than ad-hoc gates."
4
+ allowed-tools: "Read, Grep, Glob"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ Design the authorization model — the permission system itself — rather than reviewing one that exists. The job is to decide *what capabilities exist*, *how they compose into roles*, *how far each check is scoped*, and *where enforcement lives* — so that application code asks one question, **"can this actor perform this action on this resource?"**, instead of the brittle `if (user.isAdmin)` checks that breed across handlers and rot the moment requirements change. The skill reads the codebase to find the resources, actions, and existing role checks, then produces a concrete permission/role model, a single central enforcement design, and explicit decisions on hierarchy, default-deny, tenant isolation, and audit.
9
+
10
+ ## When to use this skill
11
+
12
+ - Building authorization for a multi-user or multi-tenant (SaaS) product, where access depends on both *who* the actor is and *which org/project/resource* they are touching.
13
+ - When ad-hoc role checks — `if (user.role === 'admin')`, `user.isManager`, `@RequireRole("OWNER")` — are sprawling through controllers and every new rule means a code hunt.
14
+ - When "who can do what" is tribal knowledge with no single model, or a customer/security review asks you to document the permission matrix.
15
+ - Before adding roles, a permissions UI, custom roles, or an admin-impersonation feature on top of a system that hardcodes role names.
16
+
17
+ > [!WARNING]
18
+ > Scattering role-name checks (`isAdmin`, `role === "manager"`) through the codebase instead of checking granular permissions makes every permission change a risky code hunt and guarantees missed spots — the endpoint you forget is the privilege-escalation bug. Model permissions, compose them into roles, and enforce in one place so a grant change is one edit and coverage is greppable.
19
+
20
+ ## Instructions
21
+
22
+ 1. **Inventory resources and actions before inventing roles.** Glob the routers, controllers, and data models (`**/routes/**`, `**/*controller*`, `**/models/**`, `**/entities/**`) and list every *resource* (invoice, project, user, billing-account) and every *action* on it (read, create, update, delete, approve, export, invite). Permissions are these `resource:action` pairs — `invoice:read`, `invoice:approve`, `member:invite`. Name them after the capability, not the role, so the same permission can be granted to many roles. This list is the vocabulary; everything else composes it.
23
+ 2. **Compose permissions into roles — never the reverse.** Define roles as *named sets of permissions* (`viewer = {invoice:read, project:read}`, `approver = viewer ∪ {invoice:approve}`). Code checks `can(actor, "invoice:approve", invoice)`, never `actor.role === "approver"`. This is the whole point: when product says "approvers can now export", you edit one role→permission map, not every handler. Grep the codebase for existing `role ===`, `isAdmin`, `hasRole`, `@Role`, `@PreAuthorize` sites and list each as a call site to migrate to a permission check.
24
+ 3. **Pick the granularity you actually need — and stop there.** Choose explicitly among three:
25
+ - **Pure RBAC** (roles → permissions, global) — fine for single-tenant internal tools where a role means the same thing everywhere.
26
+ - **Scoped RBAC** (role *within* an org/project/workspace) — the default for SaaS: a user is `admin` of org A and `viewer` of org B, and every check is scoped to the resource's tenant. Model the assignment as `(actor, role, scope)`.
27
+ - **ReBAC / ABAC** (permission depends on the specific object's relationship or attributes — "owner of THIS document", "assignee of THIS ticket") — reach for this *only* for the per-object rules; let scoped RBAC carry the rest. Do **not** stand up a full policy engine if scoped RBAC suffices.
28
+ State the choice and the reason; mixing scoped RBAC for the 90% with a handful of ReBAC ownership rules is usually correct.
29
+ 4. **Centralize enforcement in one authorization layer.** Design a single policy function/middleware — `authorize(actor, action, resource)` (or a guard/policy class) — that every entry point routes through: HTTP handlers, GraphQL resolvers, queue/cron jobs, and admin scripts. No handler should make its own role decision. Specify *where* it sits (e.g. middleware that resolves the resource, computes the actor's permissions in that scope, and allows/denies) so coverage is provable by reading one module, not auditing hundreds.
30
+ 5. **Default-deny, explicitly.** The policy layer returns deny unless a rule grants. A new route with no policy attached must fail closed (no access), never fall through to allowed. Specify how an un-annotated/un-checked endpoint is detected and rejected (e.g. a route-level assertion that a policy was declared) so "forgot to add a check" becomes a *deny*, not a hole.
31
+ 6. **Decide role hierarchy and inheritance deliberately.** If `admin` should imply everything `editor` can do, model it as *permission inheritance* (admin's permission set ⊇ editor's) computed when permissions are resolved — not as a chain of `if role >= X` comparisons, which reintroduce role-name logic. Keep the hierarchy shallow and flatten to an effective permission set at check time; document the partial order so "what can role X do" is answerable from the model alone.
32
+ 7. **Scope every check to the resource — at the API *and* data layer.** A valid role on tenant A must never act on tenant B's data. The permission check answers "may this actor approve invoices?"; the *data* layer must additionally bind the query to the resource's owner/tenant (`WHERE org_id = :actorOrg`, a tenant filter, or row-level security), so changing an id in the URL cannot reach another tenant's row. Specify both: the policy check *and* the scoped query. Skipping the data-layer scope is the classic IDOR — the permission passed, but the object belonged to someone else.
33
+ 8. **Make it auditable.** Design the model so authorization decisions are explainable and logged: who has which role in which scope (queryable), what permissions a role grants (the map), and a decision log for sensitive actions (actor, action, resource, allow/deny, why). A model nobody can answer "who can approve invoices in org X?" about is not finished.
34
+
35
+ > [!NOTE]
36
+ > RBAC without per-tenant/resource scoping is the most common real failure: a legitimate `admin` of org A passes the `invoice:approve` permission check and then approves org B's invoice because the query fetched by id alone. The permission says *what* the actor may do; the scope says *to which objects*. Both are required — design them together, not as an afterthought.
37
+
38
+ ## Output
39
+
40
+ A concrete authorization design with four parts:
41
+
42
+ 1. **The permission/role model** — the resource×action permission list, the role→permission map (with inheritance), and the assignment shape (`(actor, role)` for pure RBAC or `(actor, role, scope)` for scoped/multi-tenant).
43
+ 2. **The central enforcement design** — the single `authorize(actor, action, resource)` entry point, where it sits, what it resolves, and the list of existing scattered role checks to migrate into it.
44
+ 3. **Granularity decision** — pure RBAC vs scoped RBAC vs ReBAC/ABAC, stated with the reason, including which specific rules (if any) need per-object relationship checks.
45
+ 4. **The hardening decisions** — default-deny mechanism, role hierarchy/partial order, the API-and-data-layer scoping rule per resource, and the audit/decision-log plan.
46
+
47
+ ```text
48
+ Authorization model — scope: src/routes/**, src/models/** (multi-tenant SaaS)
49
+ Granularity: SCOPED RBAC (role within org) + ReBAC for document ownership
50
+
51
+ PERMISSIONS (resource:action)
52
+ invoice: read, create, update, delete, approve, export
53
+ member: read, invite, remove
54
+ doc: read, edit (edit also gated by ownership — see ReBAC)
55
+
56
+ ROLES → PERMISSIONS (within an org)
57
+ viewer = {invoice:read, member:read, doc:read}
58
+ editor = viewer ∪ {invoice:create, invoice:update, doc:edit}
59
+ approver = editor ∪ {invoice:approve, invoice:export}
60
+ admin = approver ∪ {member:invite, member:remove} # inherits all above
61
+
62
+ ASSIGNMENT: (user_id, role, org_id) # scoped — same user differs per org
63
+
64
+ ENFORCEMENT (one layer)
65
+ authorize(actor, action, resource):
66
+ 1. resolve actor's role in resource.org_id -> effective permission set
67
+ 2. deny if action ∉ permissions # DEFAULT-DENY
68
+ 3. ReBAC rule: doc:edit also requires resource.owner_id == actor.id
69
+ Every route/resolver/job calls authorize(); routes with no policy → fail closed.
70
+
71
+ MIGRATE these scattered checks into authorize():
72
+ - src/routes/invoices.ts:41 if (user.isAdmin) -> can(..,"invoice:approve",inv)
73
+ - src/routes/members.ts:88 user.role === "owner" -> can(..,"member:invite",org)
74
+
75
+ DATA-LAYER SCOPING (prevents IDOR — required alongside the permission check)
76
+ invoices: WHERE id = :id AND org_id = :actorOrg # not findById(id) alone
77
+ docs: WHERE id = :id AND org_id = :actorOrg # + ReBAC owner check above
78
+
79
+ AUDIT
80
+ - role assignments queryable: "who can invoice:approve in org X?"
81
+ - decision log on approve/export/remove: actor, action, resource, allow/deny
82
+ ```
@@ -0,0 +1,78 @@
1
+ ---
2
+ name: "release-notes-writer"
3
+ description: "Write user-facing release notes — the curated 'what's new and what it means for you' — by starting from the real changes (git log / merged PRs / the changelog since the last release) and translating developer-speak into user impact, grouped by what the user cares about with breaking changes and required actions surfaced first. Use when shipping a release to users or customers and the raw commit log isn't something a user should read, when you need a published GitHub-release / blog / in-app announcement, or when a breaking change must be made unmissable so upgrades don't break."
4
+ allowed-tools: "Read, Grep, Glob, Bash, Write"
5
+ version: 1.0.0
6
+ ---
7
+
8
+ A changelog records *what changed*; release notes explain *what it means for the person upgrading*. Pasting raw conventional-commit lines into a release fails users twice: it buries the two things they actually need under twenty refactors and dependency bumps, and it hides the one breaking change that will take down their integration on upgrade. This skill reads the real changes since the last release, throws away the churn users don't care about, translates the rest into impact-and-action language grouped the way a user thinks (New / Improved / Fixed), and puts breaking changes and required steps at the top where they cannot be missed.
9
+
10
+ ## When to use this skill
11
+
12
+ - You are shipping a release to end users or API consumers and the commit log / changelog is not something they should read.
13
+ - You need a GitHub release body, a "what's new" blog post, or an in-app changelog entry — not an internal diff.
14
+ - A release contains a breaking change or a required migration and you need it surfaced first, with the exact action spelled out.
15
+ - You have a draft changelog (e.g. from `changelog-from-prs`) and need to convert it into something audience-appropriate and benefit-led.
16
+
17
+ ## Instructions
18
+
19
+ 1. **Start from the real changes, not memory.** Establish the range from the last released tag and pull the actual shipped work — never invent items or summarize from what you "think" landed.
20
+
21
+ ```bash
22
+ LAST_TAG=$(git describe --tags --abbrev=0)
23
+ git log "$LAST_TAG"..HEAD --no-merges --pretty='%s'
24
+ gh pr list --state merged --search "merged:>$(git log -1 --format=%cI "$LAST_TAG")" \
25
+ --json number,title,labels,body --limit 200
26
+ ```
27
+ If a `CHANGELOG.md` already covers this range, read it as the source of record instead of re-deriving from commits.
28
+
29
+ 2. **Identify the audience and pin the voice.** End users, API consumers, and self-hosting operators need different notes. Look at where this publishes (`README`, app store text, GitHub release, developer docs) and at past release notes for tone. API/SDK consumers need exact symbol/endpoint names and code; end users need plain-language benefit and a screenshot-level description, not the function that changed.
30
+
31
+ 3. **Drop the churn.** Remove everything a user cannot observe: internal refactors, test-only changes, CI/build config, dependency bumps with no behavior change, lint/format, doc-internal edits. A 60-commit release is often 5 user-facing notes. Keep a dependency bump *only* if it fixes a user-visible bug or a known CVE the user is exposed to — and say which.
32
+
33
+ 4. **Extract breaking changes and required actions first — this is the part that breaks systems if you get it wrong.** Scan PR bodies/commits for `BREAKING`, `!` in conventional-commit type, removed/renamed exports, flags, endpoints, config keys, changed defaults, and tightened validation. For each, write: what changed, who it affects, and the **exact action** the user must take to upgrade safely (the command, the renamed field, the config edit), with a link to a migration guide if one exists. Cross-check against the SemVer bump — a major bump with zero listed breaking changes means you missed one.
34
+
35
+ 5. **Group the rest by what the user cares about, in benefit language.** Use **New** (capabilities they didn't have), **Improved** (things that got faster/better/clearer), **Fixed** (bugs that affected them). Rewrite each from implementation to impact: not "refactor `ExportService` to stream rows" but "Exports of large datasets no longer time out." For notable new features add a one-line *how to use it* (the flag, the menu, the endpoint). Order within each group by how many users it affects, not by PR number.
36
+
37
+ 6. **Append upgrade instructions and links.** Give the concrete upgrade step for this project (`npm i pkg@2.0.0`, the container tag, the migration command) and link the full changelog, the migration guide, and relevant docs for new features. Keep PR/issue references only where a user might want the detail — don't litter end-user notes with `(#1423)`.
38
+
39
+ 7. **Lead with a one-line summary and write the header.** Open with a single sentence a user can skim ("v2.0 adds scheduled exports and a JSON API; one breaking change to the auth header"). Then breaking/action-required, then New / Improved / Fixed, then upgrade steps. Emit it as Markdown ready to paste — publish nothing yourself.
40
+
41
+ > [!WARNING]
42
+ > Release notes are not a commit dump. Pasting raw conventional-commit lines (`feat:`, `chore(deps):`, `refactor:`) buries the few items users need under noise they cannot act on, and makes the notes look auto-generated and untrustworthy. Translate to impact and delete the rest.
43
+
44
+ > [!CAUTION]
45
+ > A breaking change hidden mid-list — or omitted because it "looked small" — is how you break your users' systems on upgrade. Every removed/renamed flag, changed default, tightened validation, or altered response shape goes in a **Breaking changes / action required** block at the very top, with the exact migration step. If the SemVer bump is major but you wrote no breaking items, stop and re-scan; you missed one.
46
+
47
+ ## Output
48
+
49
+ Publishable release notes — breaking-first, benefit-led — ready to paste into a GitHub release, blog post, or in-app changelog:
50
+
51
+ ```markdown
52
+ # v2.0.0 — 2026-06-17
53
+
54
+ Scheduled exports and a new JSON API. **One breaking change:** the API auth header was renamed — update integrations before upgrading.
55
+
56
+ ## ⚠️ Breaking changes — action required
57
+ - **Auth header renamed `X-Token` → `Authorization: Bearer <key>`.** Requests using `X-Token` now return `401`. Update your client before upgrading. See the [migration guide](https://docs.example.com/migrate/v2).
58
+ - **`export` config key `format: csv` is no longer the default** — it now defaults to `json`. Add `format: csv` explicitly to keep the old behavior.
59
+
60
+ ## New
61
+ - **Scheduled exports.** Set a cron in Settings → Exports to deliver reports automatically — no more manual runs.
62
+ - **JSON API for reports.** Pull report data programmatically via `GET /api/v2/reports`. See the [API docs](https://docs.example.com/api).
63
+
64
+ ## Improved
65
+ - Exports of large datasets no longer time out — they now stream and complete in seconds.
66
+ - Faster dashboard load on accounts with many projects.
67
+
68
+ ## Fixed
69
+ - Fixed a crash when a saved filter referenced a deleted field.
70
+ - Times now display in the account's timezone instead of UTC.
71
+
72
+ ## Upgrade
73
+ 1. Update auth headers per the breaking change above.
74
+ 2. `npm i your-pkg@2.0.0` (or pull image tag `:2.0.0`).
75
+ 3. Run `your-cli migrate` to apply the config default change.
76
+
77
+ [Full changelog](https://github.com/org/repo/compare/v1.6.0...v2.0.0)
78
+ ```