npm - agentscamp - Versions diffs - 0.5.0 → 0.7.0 - Mend

agentscamp 0.5.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

package/README.md +46 -19
package/content/manifest.json +212 -2
package/content/skills/circular-dependency-breaker.md +48 -0
package/content/skills/commit-splitter.md +54 -0
package/content/skills/dashboard-designer.md +38 -0
package/content/skills/deadlock-diagnoser.md +45 -0
package/content/skills/feature-flag-retirer.md +44 -0
package/content/skills/flamegraph-analyzer.md +35 -0
package/content/skills/git-blame-investigator.md +34 -0
package/content/skills/graphql-schema-designer.md +49 -0
package/content/skills/hallucination-evaluator.md +40 -0
package/content/skills/integration-test-designer.md +81 -0
package/content/skills/model-router-designer.md +39 -0
package/content/skills/onboarding-guide-writer.md +84 -0
package/content/skills/rbac-designer.md +82 -0
package/content/skills/release-notes-writer.md +78 -0
package/content/skills/web-vitals-optimizer.md +34 -0
package/dist/index.js +400 -43
package/package.json +1 -1

package/content/skills/deadlock-diagnoser.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: "deadlock-diagnoser"
+description: "Diagnose a database deadlock from the engine's own deadlock report, reconstruct the lock cycle (A holds 1 wants 2, B holds 2 wants 1), name the root cause — almost always two code paths locking the same rows in different orders — and fix it with consistent lock ordering, shorter transactions, and a retry-the-victim safeguard. Use when the DB logs deadlock errors, when transactions intermittently fail under load, or when queries mysteriously block each other."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A deadlock looks random from the application — a transaction that worked a thousand times suddenly errors out under load — but the database already did the forensics for you. When the engine detects a cycle it picks a victim, rolls it back, and logs *exactly* who held what and waited on what. This skill reads that report instead of guessing: it pulls the Postgres deadlock log lines (or the SQL Server deadlock graph / `innodb status` in MySQL), reconstructs the cycle (A holds lock 1 and wants lock 2 while B holds 2 and wants 1), and names the real root cause — which is almost always two code paths acquiring the **same** rows or tables in **different** orders. Then it fixes the cause: enforce one consistent lock-acquisition order everywhere, shrink the lock window so the race rarely opens, and add a retry-the-victim safeguard for the deadlocks you can't design away — in that priority, because retries without ordering just trade a deadlock for a rollback storm.
+## When to use this skill
+- The database log shows `deadlock detected` (Postgres), a deadlock graph / error 1205 (SQL Server), or `Deadlock found when trying to get lock` (MySQL/InnoDB).
+- A transaction intermittently fails or auto-retries only under concurrency — fine in dev, flaky in production at peak.
+- Two queries or endpoints mysteriously block each other, or you see processes stuck in a lock wait that times out.
+- You're adding a write path that touches multiple rows/tables and want to confirm it locks in the same order as existing code before it ships.
+- Lock contention (not a true cycle) is serializing throughput, and you need to tell genuine deadlocks apart from long lock waits.
+## Instructions
+1. **Get the engine's deadlock report — don't reconstruct from app logs.** In Postgres, read the server log around the error: it prints both processes, their full SQL statements, and the `Process N waits for <lockmode> on <relation/tuple>; blocked by process M` lines for each side of the cycle (raise `log_lock_waits = on` and `deadlock_timeout` context if it's terse). In SQL Server, pull the deadlock graph from the `system_health` Extended Events session or a trace — it lists each `process` with its `inputbuf` (the statement) and the `resource-list` of locks owned vs. requested. In MySQL/InnoDB, run `SHOW ENGINE INNODB STATUS` and read the `LATEST DETECTED DEADLOCK` section. This report is ground truth; the app's stack trace only tells you which transaction lost.
+2. **Reconstruct the cycle explicitly: who HELD what, who WANTED what.** Write it out as a two-column picture — `Txn A: holds <lock on resource 1>, waits for <lock on resource 2>` / `Txn B: holds <lock on resource 2>, waits for <lock on resource 1>`. Identify the exact resources (which rows/index ranges/tables) and the lock modes (row `FOR UPDATE`/exclusive vs. shared, gap locks in InnoDB, intent locks in SQL Server). A real deadlock is a closed cycle of waits; if it's not a cycle, it's lock contention or a lock-wait timeout (step 8), which has a different fix.
+3. **Find the inconsistent acquisition ORDER — the usual root cause.** Grep the codebase for every transaction that touches the resources in the cycle and trace the order each one locks them. The classic bug: one path does `UPDATE accounts WHERE id=1` then `id=2`, another does `id=2` then `id=1` (or two services lock tables `orders` then `inventory` vs. `inventory` then `orders`). Watch for ordering that's *hidden* — a `SELECT ... FOR UPDATE` with an unordered `IN (...)` or a join whose row-locking order depends on the plan, an ORM that emits writes in object-graph order, or a foreign-key check that takes a lock on the parent row you didn't write explicitly.
+4. **Fix the cause first: enforce ONE consistent lock-acquisition order across all transactions.** Make every code path acquire the shared resources in the same deterministic order — sort the ids before locking (`SELECT ... FOR UPDATE ... ORDER BY id`), always lock parent before child, always lock tables in a fixed documented sequence. Consistent ordering makes a cycle impossible: contenders queue instead of deadlocking. This is the only fix that actually removes the deadlock rather than reducing its odds.
+5. **Shrink the lock window so the race rarely opens.** Keep transactions short and narrow: acquire locks as late as possible, commit as early as possible, and lock only the rows you'll write. Never hold a transaction open across a network/RPC/third-party-API call or across user think-time — an external call inside the transaction stretches the lock-hold from milliseconds to seconds and turns rare contention into constant deadlocks. Do the slow work *before* `BEGIN` or *after* `COMMIT`.
+6. **Pick a deliberate lock strategy for the access pattern, and right-size isolation.** Where the same rows are contended, use pessimistic locking with `SELECT ... FOR UPDATE` in the consistent order from step 4. Where conflicts are *rare*, prefer optimistic concurrency — a `version`/`updated_at` column checked in the `WHERE` of the `UPDATE` and a conflict-retry, which takes no long-held locks. If the engine is over-locking (e.g. Serializable or InnoDB gap locks causing deadlocks on inserts/range scans), drop to the lowest isolation level that's still correct (often Read Committed) to acquire fewer locks.
+7. **Add the retry-the-victim safeguard — last, not first.** A deadlock victim's transaction is rolled back cleanly and is a *transient, safe-to-retry* error; the app should catch it specifically (Postgres `SQLSTATE 40P01`, MySQL `1213`, SQL Server `1205`) and retry the whole transaction with capped exponential backoff and jitter (e.g. 3–5 attempts). Retry the *entire* transaction from `BEGIN` — replaying half a rolled-back transaction corrupts state. This handles the deadlocks you can't design away; it does NOT substitute for steps 4–5.
+8. **Distinguish a true deadlock from plain lock contention before "fixing" the wrong thing.** If the report shows a lock-*wait timeout* rather than a detected cycle, there's no ordering bug — one transaction is simply holding a lock too long (a long-running write, an idle-in-transaction connection, a missing index forcing a wide row/range lock). The fix there is shortening the holder (step 5), adding the index so the lock is narrow (`query-plan-analyzer`), or killing idle-in-transaction sessions — not reordering locks.
+> [!WARNING]
+> Adding retries WITHOUT fixing the inconsistent lock order just papers over the bug. Under load, every retry re-enters the same cycle, so you trade one deadlock for a storm of rollbacks and re-runs: throughput craters, latency spikes, and the database burns work undoing transactions. Fix the ordering first; the retry is a net for the residual, not the cure.
+> [!WARNING]
+> A transaction that holds a lock across an external/API call (or user think-time) is the single most common way rare contention becomes constant deadlocks — the lock-hold goes from milliseconds to seconds, widening the race window enormously. Move every network call and slow computation outside the `BEGIN ... COMMIT`.
+> [!NOTE]
+> Lowering isolation reduces locking but changes correctness guarantees (Read Committed allows non-repeatable reads; dropping below Serializable can reintroduce write skew). Only lower it where the access pattern is provably safe — don't trade a deadlock for a silent data anomaly.
+## Output
+A short report with four parts:
+1. **The reconstructed cycle** — quoted from the engine's deadlock report: `Txn A holds <lock on R1>, wants <lock on R2>` / `Txn B holds <lock on R2>, wants <lock on R1>`, with the exact resources, lock modes, and the two offending statements.
+2. **The root cause** — the specific inconsistent lock-acquisition order (or over-long lock scope / over-strict isolation) behind the cycle, naming the two code paths and the resources they lock in conflicting order.
+3. **The fix** — one concrete change: the consistent ordering to enforce (with the exact `ORDER BY` / lock sequence), or the shortened-transaction change (what to move outside `BEGIN`), or the isolation-level / locking-strategy change — not a menu.
+4. **The retry safeguard** — the specific deadlock SQLSTATE/error code to catch and the backoff retry of the whole transaction, framed explicitly as the net for residual deadlocks, not the primary fix.

package/content/skills/feature-flag-retirer.md ADDED Viewed

@@ -0,0 +1,44 @@
+---
+name: "feature-flag-retirer"
+description: "Retire stale feature flags by confirming each flag's decided final state, then collapsing every conditional to the winning branch and deleting the loser plus the now-dead code it reached. Use when temporary flags have outlived their rollout, when flag conditionals clutter the code, or during a flag-debt cleanup."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Feature flags are born temporary and die permanent. Once a flag is fully rolled out or quietly abandoned, the `if (flag)` it guards is just branching debt — two code paths where one is now unreachable. This skill retires a flag for real: it pins down which branch actually won, finds *every* reference (not just the obvious helper call), collapses each conditional to the winner, and deletes the loser along with any code only the dead branch reached — one flag at a time, with tests green after each.
+## When to use this skill
+- A flag meant to last a sprint has been at 100% (or 0%) for months and still litters the code with conditionals.
+- Flag checks have multiplied — nested `if (flagA && !flagB)` paths nobody can reason about — and you want to pay down the debt.
+- You're running a flag-debt cleanup and need each removal to be independently reviewable and revertible.
+> [!WARNING]
+> Verify the flag's *decided* final state before you collapse anything. "Currently 100%" is not "permanently on" — a flag mid-rollout, a kill-switch, or an experiment still gathering data must NOT be retired. Deleting the live branch ships or kills a feature: that's a production incident, not a cleanup. Confirm from the flag system/config AND a human owner that the decision is final, and which branch won.
+## Instructions
+1. **Pin down the decided final state — not the current value.** For the flag, answer one question: is it *permanently on* (fully rolled out, winner = enabled branch) or *abandoned* (will never ship, winner = disabled branch)? Read the flag config/dashboard, then confirm with the owner. Reject the flag from this pass if it's still rolling out, A/B testing, a kill-switch kept for emergencies, or used per-tenant/per-environment with different values — those are live, not stale.
+2. **Find every reference — grep the flag KEY, not just the helper.** A flag leaks far past its `if`. Search the whole repo for the literal flag key string and its identifier:
+   - the helper calls: `isEnabled("new_checkout")`, `flags.newCheckout`, `useFlag(...)`, `treatment(...)`;
+   - the flag *definition/registration* (the declarations file, defaults, env vars, IaC/config);
+   - tests, fixtures, and mocks that force the flag on or off;
+   - analytics/telemetry events fired only when on, and feature-gated schema/migrations/routes;
+   - string usages: config keys, JSON, YAML, query params, log lines, docs.
+   Grep both the key (`"new_checkout"`) and the symbol (`newCheckout`) — different layers spell it differently.
+3. **Collapse each conditional to the winning branch.** For every reference, rewrite the conditional to keep only the winner: fully-on → keep the `if` body, drop the `else`/fallback; abandoned → keep the `else`, delete the guarded body. Remove the now-constant condition entirely — no `if (true)`, no dead `else`. Flatten the indentation you just freed.
+4. **Delete the code only the dead branch reached.** A removed branch usually calls helpers, imports, components, or fires events that nothing else uses. Trace each symbol the loser referenced; if its only caller was the branch you just deleted, remove it too (and repeat transitively). This is where flag retirement leaves dangling dead code if you stop at the `if`.
+5. **Remove the flag's definition and its tests.** Delete the flag declaration/registration, its default value and env/config entries, and the tests/fixtures that existed solely to toggle it. Tests that asserted the *winning* behavior stay — but drop their flag-setup boilerplate so they test the now-unconditional path.
+6. **One flag at a time, tests green after each.** Never retire two flags in one pass. After each flag: run the build and test suite, confirm green, and keep it as a single commit. A revert then removes exactly one flag's worth of change with no collateral.
+> [!WARNING]
+> A flag almost always guards MORE than the obvious if-block — feature-gated helper functions, config defaults, DB columns or migrations, route registrations, and analytics events reachable only when on. Grep exhaustively (step 2) before deleting: stop at the `if` and you leave dangling dead code; over-trust a single grep and you delete a path the *winning* branch still uses. When in doubt whether a symbol is shared, keep it and flag it for review.
+## Output
+For each retired flag, a record an owner can rubber-stamp:
+- **Confirmed final state** — `permanently-on` or `abandoned`, with the source (flag dashboard value + owner sign-off) and the resulting winning branch.
+- **Reference inventory** — every match for the key and symbol, grouped by layer: conditionals, definition/config, tests/fixtures, analytics, schema/routes, docs/strings.
+- **Collapse plan** — per conditional: which branch wins, the resulting diff, and the list of now-dead symbols deleted because only the loser reached them.
+- **Verification** — confirmation the build and test suite pass after the removal, and that the change is a single self-contained commit. Anything ambiguous (shared symbol, public-API surface, flag still live elsewhere) is listed as a manual-review item rather than deleted.

package/content/skills/flamegraph-analyzer.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: "flamegraph-analyzer"
+description: "Turn a CPU profile or flamegraph into a concrete optimization instead of guessing where the time goes: capture under a realistic workload with a sampling profiler, read the graph correctly (width = time, depth ≠ time), find the widest self-time leaves, ask if that work is necessary/redundant/algorithmically wrong, fix the biggest contributor, then re-profile. Use when code is CPU-bound and slow, a function is hot but you don't know which part, or you have a profile you can't interpret."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+When code is slow and CPU-bound, the most expensive thing you can do is guess. Intuition about "the slow part" is wrong often enough that optimizing it usually buys nothing while the real hotspot sits untouched. A flamegraph answers the question directly — *which frames are actually burning CPU* — but only if you capture it under a realistic workload and read it correctly. This skill does both: it gets a representative sampling profile, reads width as time and the y-axis as depth (not a timeline), pins the hotspot to the widest self-time leaves, classifies the work as unnecessary / redundant / algorithmically wrong, fixes the biggest contributor, and re-profiles — because the bottleneck always moves after a fix, and your intuition about the new one is just as unreliable.
+## When to use this skill
+- A request, job, or function is slow, CPU usage is high, and you don't know which part of the call tree is responsible.
+- You have a profile or flamegraph SVG but can't tell where the time is going or whether you're reading it right.
+- Something is "obviously" slow and you're about to optimize the part you suspect — stop and confirm it with a profile first.
+- A hot path got optimized and got no faster, or only a little — the real bottleneck was elsewhere and you need to find it.
+- You want to know whether the latency is *computation* (on-CPU) or *waiting* (I/O, locks) before you pick where to spend effort.
+## Instructions
+1. **Capture a profile under a realistic workload with a sampling profiler — don't reason from intuition.** Drive the code the way production does (representative input size, concurrency, warm caches/JIT), then sample it with the right tool: `perf record -F 99 -g` (Linux native), async-profiler (JVM), `py-spy record` (Python), `go tool pprof` (Go), or the browser/Node `--prof` / `--cpu-prof` / DevTools profiler. Prefer **sampling** over instrumenting — instrumentation distorts the very hot frames you care about. Profile a *steady* phase, not cold start, unless cold start is the thing you're optimizing.
+2. **Render it as a flamegraph and read the axes correctly.** Collapse stacks and render (e.g. `perf script | stackcollapse-perf.pl | flamegraph.pl`, async-profiler's HTML, `go tool pprof -http`, speedscope). **Width = total time spent in a frame and everything it called; wide is expensive. The y-axis is call-stack depth, NOT time — it is not a timeline.** A tall, narrow tower is a deep-but-cheap call chain; a short, wide plateau is your hotspot. Frame ordering left-to-right is alphabetical/merge order, not chronological — never read it as "this ran, then that."
+3. **Find the widest *leaf* frames — that's where the CPU actually is.** Look at the top edge of the graph: the plateaus at the *top* of the stacks are self-time leaves, the code actually executing when samples were taken. A wide frame deep in the middle is wide because of what it *calls*; the work itself lives in the wide things sitting on top of it. Use the profiler's "self/own time" sort to confirm. Rank hotspots by self-time, not by who's tallest.
+4. **For each top hotspot, classify the work: unnecessary, redundant, or algorithmically wrong.** Read the wide leaf and ask: (a) **Unnecessary** — is this work needed at all, or is it logging/serialization/validation/copying in a hot loop that could be hoisted, batched, or dropped? (b) **Redundant** — is the same frame wide because it's *called too many times* (recomputed per item, re-parsed, re-allocated)? Cache, memoize, or lift it out of the loop. (c) **Algorithmically wrong** — a wide frame that grows with input is often an O(n²) hiding in plain sight (linear scan inside a loop, repeated string concat, a `Set` that's actually a list). Match the frame's width-vs-input behavior to the algorithm.
+5. **Confirm the latency is on-CPU before optimizing CPU.** A CPU-sample flamegraph is *blind to time spent waiting* — it shows almost nothing for blocking I/O, lock contention, or sleeping threads, because those threads aren't on-CPU to be sampled. If the wall-clock latency is large but the on-CPU flamegraph is thin or idle, the time is being *waited*, not *computed* — capture an **off-CPU / wall-clock** profile instead (off-CPU flamegraph via `perf`/eBPF, async-profiler `wall` mode, py-spy without `--idle` filtering, a blocking/lock profiler). Optimizing CPU frames will do nothing for a workload that's actually waiting on a database or a mutex.
+6. **Optimize the single biggest contributor, then RE-PROFILE.** Fix the widest hotspot first — it has the most time to give back. Then capture the *same* workload again from scratch. The bottleneck moves after every fix: the second-widest frame is now first, and the percentages you remember are stale. Do not chain optimizations from one profile; your intuition about the *new* top frame is exactly as unreliable as it was about the first. Stop when the remaining hotspots are narrow enough that the next fix isn't worth the complexity.
+> [!WARNING]
+> The y-axis is call-stack **depth, not time** — a flamegraph is not a timeline. A tall, narrow tower is a cheap deep call chain; a short, wide plateau is your hotspot. Read it as left-to-right time and you'll "optimize" the wrong frame and wonder why nothing got faster.
+> [!NOTE]
+> A CPU flamegraph is blind to waiting. If a request takes 800ms but the on-CPU graph is mostly idle, the time is spent blocked on I/O or a lock, not computing — switch to an off-CPU / wall-clock profile. Speeding up thin CPU frames can't fix latency that's actually spent waiting.
+## Output
+A short report with four parts: (1) the **capture conditions** — profiler used, workload/input that was profiled, and whether it's on-CPU or off-CPU/wall-clock; (2) the **identified hotspot(s)** read straight off the graph — each as `frame name + share of total samples + self-time vs. children` and *why* it's hot (unnecessary / redundant / algorithmically wrong); (3) the **targeted fix** for the biggest contributor as a concrete change (hoist out of loop, memoize, replace O(n²), or — if it's wait time — go profile off-CPU); and (4) the **re-profile plan** — rerun the identical workload, expected new top frame, and the stopping condition once hotspots are no longer worth chasing.

package/content/skills/git-blame-investigator.md ADDED Viewed

@@ -0,0 +1,34 @@
+---
+name: "git-blame-investigator"
+description: "Reconstruct why a line of code exists from Git history — find the originating commit, read its message and full diff for intent, and see through reformatting/rename commits with ignore-revs and the pickaxe — before you change or delete it. Use when a line looks wrong or pointless and you want to remove it, when tracing a regression to its commit, or when onboarding to unfamiliar code."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+`git blame` tells you *who* last touched a line, which is almost never the question you actually have. The real question — "why is this here, and what breaks if I remove it?" — lives in the commit *message*, the surrounding diff, and the PR that shipped it. This skill does code archaeology: it walks from a suspicious line back to the commit that introduced the *logic* (not the one that reindented it), reads the intent, and returns a verdict on whether the code is a dead artifact or a Chesterton's fence guarding a bug you can't see.
+## When to use this skill
+- A line looks redundant, wrong, or pointless and you're about to delete or "simplify" it.
+- You're tracing a regression and need the exact commit that changed the behavior.
+- You're onboarding to unfamiliar code and need to reconstruct *why* it was written this way.
+- A workaround, magic constant, or odd conditional has no comment explaining it.
+- blame keeps pointing at a formatting, rename, or merge commit that obviously isn't the real author.
+## Instructions
+1. **Locate the line precisely, then blame with context.** Run `git blame -L <start>,<end> <path>` on the suspicious range (not the whole file) and note the commit SHA, not the author name. Add `-w` to ignore whitespace-only changes and `-C -C -M` to follow lines that were moved or copied in from other files — without these, blame stops at the refactor that relocated the code and you lose its true origin.
+2. **Distrust the first SHA — it's usually noise.** If the blamed commit is a Prettier run, a lint autofix, a mass rename, or a "merge branch" commit, it did not author the logic. Re-blame ignoring it: `git blame --ignore-rev <sha> -L <start>,<end> <path>`. If the repo has recurring reformatting commits, list them in a `.git-blame-ignore-revs` file and set `git config blame.ignoreRevsFile .git-blame-ignore-revs` so every blame sees through them automatically.
+3. **Read the intent, not just the patch.** Once you have the real commit, run `git show <sha>` to read the *full* commit message and the *entire* diff — not only the line you care about. Then find the PR with `git log --merges --ancestry-path <sha>..HEAD -- <path>` or `gh pr list --search <sha>` and read the PR description and review discussion. The "why" is in prose far more often than in code.
+4. **Track the exact line or string through time with line-history and the pickaxe.** For a moving target use `git log -L <start>,<end>:<path>` to see every commit that changed that line range, in order, with diffs. To find when a specific string, identifier, or value *entered or left* the codebase, use the pickaxe: `git log -S '<exact-string>' -- <path>` (changes in the count of that string) or `git log -G '<regex>' -- <path>` (any diff line matching the regex). `-S` answers "when did this magic number / flag / call site appear or disappear?" in seconds.
+5. **Follow the code across moves and renames.** A file rename or extraction silently truncates history. Use `git log --follow -- <path>` to span renames, and when logic was hoisted into a new file, use blame's `-C -C -C` (copy detection across the whole tree, even unmodified files) to find where it was lifted from. Confirm the trail is unbroken before drawing conclusions — a gap means the real origin is in a pre-rename path.
+6. **Trace a regression to its commit, by bisection if needed.** First try `git log --oneline -- <path>` plus `git log -L` to spot an obvious culprit. If the offending change isn't obvious, run `git bisect`: `git bisect start`, `git bisect bad` (current), `git bisect good <known-good-sha>`, then test each checkout (script it with `git bisect run <test-cmd>` for an exact, automated answer). Bisect finds the precise breaking commit even across hundreds of revisions.
+7. **Reconstruct the decision from the neighborhood.** Read the commits immediately before and after the originating one (`git log --oneline <sha>~3..<sha> -- <path>` plus the linked issue) to see what problem the change was solving. A line that looks pointless in isolation often makes sense as one half of a fix — the other half being the bug it prevents.
+8. **Render a verdict tied to evidence.** Conclude with one of: *safe to remove* (origin found, the problem it solved no longer exists — cite the commit/issue), *do not touch* (it guards a known bug or invariant — cite the commit), or *needs a test first* (intent is plausible but unverified — name the behavior to lock down before changing). Never conclude "safe to remove" without having found and read the originating intent.
+> [!WARNING]
+> blame's first answer is almost always a formatting or rename commit that hides the real author. If you act on it without `--ignore-rev` and the pickaxe, you will attribute the code to the wrong change and reason about the wrong intent.
+> [!WARNING]
+> Deleting code whose original purpose you haven't found is the single most common way regressions get reintroduced. "I don't see why this is here" is a reason to investigate, never a license to remove.
+## Output
+A short investigation report containing: (1) the **originating commit(s)** — SHA, message, and the intent reconstructed from the diff and PR; (2) the **line/string history** — the ordered list of commits that introduced, moved, or altered the code (from `log -L` / `-S`), with the rename or refactor boundaries it crossed; and (3) a **verdict** — *safe to change/remove*, *do not touch*, or *needs a test first* — each justified by the cited commit or issue. All claims trace to a SHA the reader can re-run.

package/content/skills/graphql-schema-designer.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "graphql-schema-designer"
+description: "Design a clean, evolvable GraphQL schema (SDL) that won't paint you into a corner — model the graph around domain types and their relationships rather than as RPC-over-GraphQL, set nullability deliberately, standardize lists with Relay connections, plan DataLoader batching for per-parent fields, and evolve by adding + @deprecated instead of versioning. Use when designing a new GraphQL API, reviewing an SDL, or migrating REST endpoints to a graph."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A GraphQL schema is not an afterthought over your endpoints — it's the public contract clients build against, and unlike REST there's no `/v2` to escape a bad decision: the graph evolves in place, forever. Two design mistakes dominate the post-launch pain. First, modeling the schema as a thin RPC wrapper of your existing endpoints (`getUserById`, `listOrdersForUser`) instead of a connected graph of types and relationships, which throws away the one thing GraphQL gives you. Second, sprinkling non-null (`!`) everywhere "to be safe," which is a trap — a single resolver error on a non-null field nulls its *entire parent object*, so a flaky downstream blanks out the whole response. This skill designs the SDL deliberately: types and edges, considered nullability, Relay connections for lists, a consistent mutation payload shape, and an explicit DataLoader plan for the fields that would otherwise N+1.
+## When to use this skill
+- You're designing a new GraphQL API from scratch and want an SDL that survives years of additive change without versioning.
+- You're reviewing or refactoring an existing schema that reads like a list of RPC calls, has `!` on nearly every field, or returns bare arrays for lists.
+- You're migrating REST endpoints to GraphQL and need to re-model resources as a connected graph rather than transcribing routes into queries one-for-one.
+- Nested queries are slow and you suspect resolvers are firing one DB query per parent row (the N+1 storm).
+## Instructions
+1. **Model the graph around domain types and their relationships, not your endpoints.** Identify the nouns (`User`, `Order`, `Product`, `Review`) and the *edges* between them, then expose those edges as fields that return types — `User.orders`, `Order.lineItems`, `Review.author` — so a client can traverse `user { orders { lineItems { product { name } } } }` in one round trip. Do **not** transcribe REST routes into a flat field per endpoint (`getUserById`, `getOrdersForUser`, `getProductForLineItem`); that's RPC-over-GraphQL and forces clients back into client-side joins and N round trips. The query-graph shape, not your handler list, is the source of truth.
+2. **Set nullability deliberately, field by field — non-null is a contract, not a default.** Mark a field non-null (`name: String!`) only when it *genuinely always resolves* — a column with a NOT NULL constraint, a synthesized value, the object's own `id`. Make a field nullable when a downstream failure (a separate service, a join that can return nothing, a slow API) shouldn't take down the rest of the response. The error-propagation rule is the whole reason this matters: when a non-null field's resolver throws or returns null, GraphQL can't put null there, so it nulls the *nearest nullable ancestor* — often the entire parent object — propagating upward until it hits a nullable field. So `Order.recommendedProducts` (computed by a flaky ML service) must be nullable, or one bad recommendation call blanks the whole order.
+3. **Standardize every list as a Relay Connection, not a bare array.** Replace `orders: [Order!]!` with a connection: `orders(first: Int, after: String, last: Int, before: String): OrderConnection!`, where `OrderConnection { edges: [OrderEdge!]!, pageInfo: PageInfo! }`, `OrderEdge { node: Order!, cursor: String! }`, and `PageInfo { hasNextPage: Boolean!, hasPreviousPage: Boolean!, startCursor: String, endCursor: String }`. Cursor-based connections page correctly under inserts/deletes (each page is anchored to a real cursor, not an offset) and give you a uniform place to hang edge metadata later (e.g. `OrderEdge.addedAt`). Bare arrays can't paginate without a breaking change and force `first`/`offset` bolt-ons later. Use connections for any list that can grow unbounded; a small fixed enum-like list (a user's `roles`) can stay a plain array.
+4. **Plan for the N+1 problem before you ship — name every field that needs a DataLoader.** Any field that resolves *per parent* — `Order.customer`, `Review.author`, `Product.category` — fires its resolver once per parent row in a list, so `orders(first: 50) { customer { name } }` becomes 1 query for orders plus 50 queries for customers. For each such field, specify a **DataLoader** that batches the per-parent keys into one query (`SELECT * FROM users WHERE id = ANY($1)`) and caches within the request. Walk the schema and list, explicitly, which fields are 1:1/1:N relationship fetches that must go through a batched loader; a schema with per-parent resolvers and no DataLoader will N+1 itself to death under nested queries.
+5. **Evolve by adding fields and deprecating — never repurpose, never version the endpoint.** GraphQL evolves in place: add new fields, types, and optional arguments freely (additive changes are non-breaking because clients select only what they ask for). To retire a field, mark it `@deprecated(reason: "Use fullName instead")` and keep it resolving until usage drops to zero (check field-usage analytics), then remove. Never change an existing field's *meaning* or *type* (`price: Int` cents → `price: Float` dollars is a silent data corruption for every existing client), never tighten nullability from nullable to non-null on a live field, and never add a `/v2` schema — versioning the endpoint defeats the entire evolvability model.
+6. **Constrain values with custom scalars and enums; never model a fixed set as a free string.** Use `enum OrderStatus { PENDING PAID SHIPPED CANCELLED }` instead of `status: String` so invalid values are rejected at the query layer and clients get the allowed set from introspection. Define custom scalars for formatted values (`DateTime`, `EmailAddress`, `URL`, `Money`) to centralize parse/serialize/validation and document the format in one place. Reserve `ID` for opaque identifiers (it serializes as a string — don't do math on it).
+7. **Give mutations input types and a consistent payload/error shape.** Every mutation takes one `input` argument of a dedicated input type (`createOrder(input: CreateOrderInput!): CreateOrderPayload!`) — input types keep arguments cohesive and let you add optional fields without changing the signature. Return a **payload type**, not the bare entity: `CreateOrderPayload { order: Order, userErrors: [UserError!]! }`, where `userErrors` carries expected, recoverable validation failures (`{ field: ["input","email"], message: "already taken" }`) as *data* the client can render — distinct from unexpected exceptions, which belong in the top-level `errors` array. Keep this `{ entity, userErrors }` shape uniform across every mutation so clients handle errors one way.
+> [!WARNING]
+> Overusing non-null (`!`) is a trap, not a safety measure. When a non-null field's resolver errors, GraphQL nulls the nearest *nullable* ancestor — so one failing `User.subscription!` field can null the entire `User`, and if `User` is also non-null, it nulls *its* parent, cascading up to potentially blank the whole `data`. Model genuinely-fallible fields (anything backed by a separate service, an external API, or an optional relationship) as **nullable** so a partial failure degrades to one missing field instead of an empty response.
+> [!WARNING]
+> A schema with per-parent resolver fields and no DataLoader will N+1 itself to death. A query like `posts(first: 100) { author { name } comments(first: 10) { edges { node { id } } } }` fans out into hundreds or thousands of individual DB queries — fast in dev with 3 rows, a query storm in production. Decide the batching plan at design time, not after the first incident: every relationship field gets a request-scoped DataLoader, no exceptions.
+> [!NOTE]
+> Connections are worth the boilerplate even for lists that "will never be large," because there is no non-breaking path from `[T!]!` to a paginated connection later — clients have already coded against the array. If a list is truly bounded and fixed (status flags, a handful of roles), a plain list is fine; everything user-generated or growth-prone starts as a connection.
+## Output
+The deliverable is a designed SDL plus the decisions behind it:
+- **The SDL** — object types and their relationship fields (edges), `enum`s and custom `scalar`s for constrained values, Relay **connection** types for every unbounded list (`*Connection` / `*Edge` / `PageInfo`), and mutations as `input`-arg + `*Payload` (with `userErrors`) pairs.
+- **The nullability decisions** — a short table of the non-obvious fields marked nullable vs non-null, each with its rationale (this field can fail downstream → nullable; this field always resolves → non-null), so reviewers see the error-propagation reasoning.
+- **The pagination decisions** — which lists became connections vs stayed plain arrays, and why.
+- **The DataLoader / batching plan** — the explicit list of per-parent relationship fields (`Type.field`) that must resolve through a request-scoped batched loader, with the batch key and the batched query for each, so the schema doesn't N+1 under nested queries.

package/content/skills/hallucination-evaluator.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "hallucination-evaluator"
+description: "Detect and measure ungroundedness in LLM and RAG outputs — claims the source doesn't support — by decomposing answers into atomic claims and checking each for entailment, so you can quantify faithfulness and gate on it instead of eyeballing it. Use when a RAG/LLM feature makes confident wrong claims, before shipping anything that must be factual, or to add a groundedness gate to evals/CI."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+"It sounds confident" is not "it's correct." A RAG or grounded-generation feature can produce fluent, authoritative prose that the retrieved source never supports — and fluency is uncorrelated with faithfulness, so you cannot eyeball it. This skill makes hallucination measurable: it defines the standard precisely, decomposes each answer into atomic claims, checks each claim for entailment against the source, builds a labeled eval set that includes the should-abstain cases, splits retrieval failures from generation failures, and produces a groundedness score you can gate releases on.
+## When to use this skill
+- A RAG/LLM feature is making confident claims that turn out to be wrong, and you can't tell how often.
+- Before shipping anything that must be factual — support answers, summaries of provided docs, extraction over a source.
+- You want a groundedness gate in evals/CI so a regression in faithfulness blocks the release instead of surfacing in production.
+- A summary, citation, or "based on the document…" answer is adding facts the document doesn't contain.
+- You need to know *why* it's wrong — bad retrieval vs. the model ignoring good retrieval — because the fix differs.
+## Instructions
+1. **Define the standard precisely — faithfulness, not world-truth.** In RAG/grounded generation, a hallucination is a claim **not entailed by the retrieved context (the source you gave the model)**. This is *faithfulness*, and it is distinct from *factual accuracy against the world*. A claim can be true in reality but unfaithful (the source never said it), and faithful but false (the source itself was wrong). You grade **faithfulness to the source, because that is checkable**; open-world truth is not checkable here and conflating the two makes the eval incoherent. State which one you're measuring in writing before you score anything.
+2. **Decompose each answer into atomic claims.** A claim is a single, independently checkable assertion ("The policy refund window is 30 days"). Split compound sentences, drop hedges and meta-commentary, and keep pronoun referents resolved so each claim stands alone. Score faithfulness *per claim*, not per answer — a 4-sentence answer with one unsupported sentence is 75% grounded, and that granularity is what lets you find the specific failure.
+3. **Check each claim for entailment against the source.** For each atomic claim, label it `supported` / `not_supported` / `contradicted` using one of two checkers: (a) an NLI/entailment model (premise = the retrieved chunks, hypothesis = the claim), or (b) an **LLM-judge with the source in its context** — for the judge, default to the latest, most capable Claude model (`claude-opus-4-8`, or `claude-fable-5` for the hardest cases). Pin the judge to faithfulness: *"Using ONLY the provided source, is this claim supported? Quote the supporting span or answer not_supported. Do not use outside knowledge."* The judge grades entailment, which is checkable — never open-world truth.
+4. **Build a labeled eval set that includes the should-abstain cases.** Collect (question, retrieved context, answer) triples and hand-label the grounded/ungrounded claims. Crucially, include questions whose answer **is not in the context** — there the correct behavior is to abstain ("I don't know" / "the source doesn't say"), and answering anyway is the exact hallucination you most want to catch. An eval set without should-abstain cases will pass a model that confidently invents answers whenever retrieval comes up empty.
+5. **Split retrieval failure from generation failure.** For every ungrounded answer, ask: *was the correct answer present in what was retrieved?* If **no** → retrieval failure (the answer wasn't in the context → fix retrieval: chunking, embeddings, top-k, reranking). If **yes, but the model ignored or contradicted it** → generation failure (fix the prompt/model: cite-or-abstain instructions, a stronger model, lower the room to improvise). Report the two rates separately — they have different owners and different fixes, and a single "hallucination rate" hides which lever to pull.
+6. **Report a groundedness score and gate on it.** Compute groundedness = supported claims / total claims across the eval set, plus an abstention-accuracy number on the should-abstain subset. Attach concrete failing examples (claim + the source span it contradicts or the absence of any span). Set a threshold and wire it into CI so a drop blocks the release. Re-run the same fixed eval set on every prompt/retrieval/model change.
+7. **Reduce it, then re-measure.** Apply the levers the split points to: grounding prompts (**cite-or-abstain** — "answer only from the source; if it's not there, say so"), require an inline citation/verbatim quote per claim (a claim that can't be quoted is the one to suspect), and retrieval improvements for the retrieval-failure share. After each change, re-run the eval — don't trust that a prompt tweak helped; show the score moved.
+> [!WARNING]
+> Confidence and fluency are uncorrelated with faithfulness. The most dangerous hallucinations are the ones that read most authoritatively, so you must check claims against the source span by span — never grade on how convincing the answer sounds.
+> [!WARNING]
+> Do not let the faithfulness judge use outside knowledge. If it "knows" a claim is true and marks it supported even though the source never says it, you're now measuring world-truth (not measurable here) instead of groundedness (measurable) — and the eval becomes incoherent. The instruction "use ONLY the provided source" is load-bearing; verify the judge actually abstains when the source is silent.
+## Output
+A faithfulness eval report containing: (1) the eval method — atomic-claim decomposition + the entailment/LLM-judge checker, with the exact judge prompt; (2) the labeled eval set, explicitly including should-abstain cases (answer-not-in-context); (3) per-answer results split into retrieval-failure vs. generation-failure, with separate rates; and (4) the groundedness score (supported claims / total) plus abstention accuracy, concrete failing examples with the offending source spans, and a CI gate threshold. Reproducible: same eval set, same judge model, re-runnable on every change.

package/content/skills/integration-test-designer.md ADDED Viewed

@@ -0,0 +1,81 @@
+---
+name: "integration-test-designer"
+description: "Design integration tests that exercise components against REAL collaborators — actual database, queue, HTTP boundary — at a deliberately chosen seam, instead of a unit suite that mocks everything or a slow flaky full E2E. Use when bugs slip past green unit tests, when wiring or contracts between layers break in production, or when a mocked DB test passes but the real query/migration/serialization fails."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A unit suite that mocks the database, the queue, and the HTTP client proves your mocks are configured the way you configured them — it never runs your actual SQL, your migrations, your serialization, or the wiring between layers. That's exactly where bugs slip into production. A full E2E suite catches them but is too slow and flaky to gate merges. This skill designs the layer in between: an integration test that drives a deliberately chosen *slice* of the system through its real boundaries — a real database, a real broker, a real HTTP framework — while stubbing only the genuinely uncontrollable third parties. The deliverable is the chosen seam, an explicit real-vs-stubbed split, an ephemeral-infrastructure plus per-test data-isolation setup, and representative tests that assert on observable outcomes.
+## When to use this skill
+- A bug shipped despite a green unit suite because the suite mocked the very collaborator that broke — a wrong column name, a missing migration, a JSON field that serializes differently than the mock returned.
+- The wiring or contract *between* layers fails (handler doesn't pass the tenant id to the repo; a queue message round-trips with the wrong shape) and no test exercises the layers together.
+- The E2E suite is too slow or flaky to run on every PR, so cross-layer regressions are caught late, in staging or prod.
+- You're standing up a new service and want a fast, real-infrastructure test for the persistence/messaging path before there's anything to E2E.
+## Instructions
+1. **Choose the seam deliberately — name what's inside the slice and what's outside.** Don't test "the whole app" and don't test one function; pick a coherent slice with real boundaries: handler→service→repository→**real DB**, or producer→**real broker**→consumer, or service→**real HTTP** of your own framework. State the entry point (the call that drives the test) and the exit boundary (the real collaborator whose effect you assert). Everything between them runs for real, unmocked; that is the integration you're proving works.
+2. **Use REAL infrastructure via ephemeral instances — not a mock of it.** Run the actual database, broker, or cache the slice talks to, spun up disposably: **Testcontainers** (a throwaway Postgres/MySQL/Kafka/Redis container per suite), a disposable Docker service, an in-process real engine (embedded Postgres, an in-memory SQLite *only if prod is SQLite*), or a local broker (an embedded Kafka/Redpanda, LocalStack for SQS). Run your real migrations against it on startup. A mocked DB test proves the mock returns what you told it to; only a real instance proves your query compiles, your migration applied, and your row maps back to your object.
+3. **Stub ONLY the truly external and uncontrollable.** Third parties you don't own and can't run locally — a payment processor, an email/SMS gateway, a partner API, a clock, a random source — get stubbed (or pointed at a fake server like WireMock / a captured-fixture HTTP mock). Drawing the line here, not at your own DB/queue, is the whole discipline: stub what you can't control or can't make deterministic; run everything you own for real.
+4. **Make every test hermetic and isolated — own your data, depend on no other test.** The top source of integration flake is shared mutable state across tests. Pick one isolation strategy and hold it: **transaction-per-test** (open a transaction in setup, run the test, roll back in teardown — fastest, but breaks if the code under test commits or needs its own connection); **unique data per test** (every row keyed by a per-test tenant/run id so concurrent tests never collide); or **truncate/reset between tests** (clean tables in teardown — simplest, slower). Each test seeds exactly the data it reads. No test may rely on data left by another or on running in a particular order.
+5. **Pay the slow cost once, not per test.** Starting a container or applying migrations is seconds; doing it per test makes the suite unrunnable. Spin infra up **once per suite/session** (a session-scoped fixture: `pytest` session fixture, JUnit `@Container static`, a global setup) and reuse it; reset only the *data* between tests (step 4), which is milliseconds. Keep the integration suite a separate, taggable target from the unit suite so it can run on its own cadence and developers still get a fast unit loop.
+6. **Assert observable outcomes, not internal calls.** Verify what actually happened at the real boundary: the row that now exists in the DB (query it back), the HTTP status and body the handler returned, the message that landed on the queue, the record that did *not* get written on a rollback path. Do not assert `repository.save was called once` — that's a mock-interaction check masquerading as integration coverage, and it passes even when the save silently failed. Cover the failure and edge paths too (constraint violation, conflicting concurrent write, retry on a dropped message), because those are precisely what unit mocks can't reproduce.
+> [!WARNING]
+> Mocking the database or queue inside an "integration" test defeats the entire purpose — you are testing the mock's configuration, not the integration. A `when(repo.find(...)).thenReturn(...)` test never runs your SQL, never catches a renamed column, a broken migration, or a NULL-handling bug. If the collaborator is yours to run, run a real ephemeral instance; if it isn't yours (a payment API), that's a stub *and a separate contract test* — see `contract-test-designer`.
+> [!WARNING]
+> Integration tests that share one database without per-test isolation become order-dependent and flaky: a test passes alone, fails in the suite, and fails differently in parallel, because it sees rows another test wrote (or expected rows another test deleted). Isolate data per test (transaction rollback or a per-test run id) before adding more tests, or the flake compounds until the suite gets disabled.
+## Output
+For the chosen slice, the skill produces:
+- **The seam** — the entry point that drives the test and the exit boundary whose effect is asserted, with everything in between named as in-slice (real).
+- **Real vs. stubbed, with the reason** — a short table: each collaborator marked REAL (ephemeral instance, how it's provisioned) or STUBBED (why it's uncontrollable, what fake stands in).
+- **The infra + isolation setup** — how the real instance is spun up once per suite (Testcontainers / disposable service / embedded engine), how migrations are applied, and the per-test data-isolation strategy (transaction rollback / unique run id / truncate).
+- **Representative tests** — happy path plus the failure/edge paths mocks can't reach, each asserting an observable outcome at the real boundary.
+Example — a service+repository slice against a real Postgres, in Python (pytest + Testcontainers), data isolated by transaction rollback:
+```python
+import pytest
+from testcontainers.postgres import PostgresContainer
+from sqlalchemy import create_engine, text
+from app.orders import OrderService  # entry point of the slice
+# Spin the REAL database ONCE per session, run real migrations against it.
+@pytest.fixture(scope="session")
+def engine():
+    with PostgresContainer("postgres:16") as pg:
+        eng = create_engine(pg.get_connection_url())
+        run_migrations(eng)            # the actual migrations, not a hand-built schema
+        yield eng
+# Isolate every test: open a transaction, hand it to the service, roll back after.
+@pytest.fixture
+def db(engine):
+    conn = engine.connect()
+    tx = conn.begin()
+    yield conn
+    tx.rollback()                      # nothing persists; tests can't see each other's rows
+def test_place_order_persists_row(db):
+    svc = OrderService(db)             # real service -> real repository -> real Postgres
+    order_id = svc.place_order(sku="widget", qty=3)
+    # Assert the OBSERVABLE outcome: the row exists with the right state.
+    row = db.execute(text("SELECT qty, status FROM orders WHERE id = :id"),
+                     {"id": order_id}).one()
+    assert (row.qty, row.status) == (3, "open")
+def test_place_order_rejects_negative_qty_and_writes_nothing(db):
+    svc = OrderService(db)
+    with pytest.raises(ValueError):
+        svc.place_order(sku="widget", qty=-1)   # path a mocked repo would never exercise
+    count = db.execute(text("SELECT count(*) FROM orders")).scalar()
+    assert count == 0                            # the failed write left no partial row
+```
+The negative-qty test is the kind a mocked repository can't reach — it proves the real `CHECK`/validation prevents a partial write, against the real schema. Hand the seam to `test-scaffolder` to flesh out the remaining paths, use `mock-data-factory` to build the per-test seed data, and for the third parties you stubbed here, write a `contract-test-designer` test so their real shape stays pinned.

package/content/skills/model-router-designer.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: "model-router-designer"
+description: "Design a model router that sends each LLM request to the cheapest model that can handle it and escalates only the hard cases to the strongest — cutting cost and latency without tanking quality, gated by an eval set so the savings don't come from silently worse answers. Use when one expensive model serves all traffic (most of it easy), when LLM cost or latency is too high, or when balancing quality against spend across a range of request difficulty."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Serving 100% of traffic with your most capable model means paying frontier prices for the 70% of requests a smaller model would have nailed. A model router fixes that by routing each request to the cheapest model that can handle it and escalating only the genuinely hard cases — but routed blind, it trades cost for silent quality regressions on exactly the requests that needed the strong tier. This skill designs the router as a measured system: segment the traffic, pick the cheapest signal that separates it, build an escalation path for the misses, and gate the whole thing on an eval set so you can prove the savings are real.
+## When to use this skill
+- One expensive model answers all requests and most of them are obviously easy (lookups, formatting, short classifications) — you're overpaying on the majority.
+- LLM cost or p95 latency is too high and you want to shed both without a blanket model downgrade that would hurt the hard cases.
+- Traffic spans a real difficulty range — trivial extraction up through multi-step reasoning — and you want to spend strong-model budget only where it changes the answer.
+- You already tried "just use the cheaper model everywhere" and quality dropped on the hard tail.
+> [!NOTE]
+> Routing only pays off when a meaningful share of traffic is genuinely easy. If nearly every request needs the strong model, a router adds decision cost and complexity for almost no saving — segment first (step 1) and confirm the easy slice exists before building anything.
+## Instructions
+1. **Segment the traffic by difficulty before touching code.** Pull a representative sample of real requests (or read the logs/handlers with `Grep`/`Glob`) and bucket them into three tiers: (a) **mechanical** — classification, extraction, fixed-format transforms, short factual lookups; (b) **moderate** — straightforward Q&A, summarization, single-step reasoning; (c) **hard** — multi-step reasoning, code generation, ambiguous or long-context tasks. Estimate the volume share of each. If tier (a)+(b) isn't a sizable fraction, stop — the router won't earn its keep. This split is the spec for everything downstream.
+2. **Pick the cheapest routing signal that separates the tiers — in this order.** Reach for the lowest-cost signal that works and stop there: (1) **free heuristics** — the task type/endpoint the request came through, input token length, a required-capability flag (needs JSON mode, needs tools, needs vision, needs long context), presence of code; (2) **a lightweight classifier** — a small fast model or a trained text classifier that labels difficulty, when heuristics can't cleanly separate; (3) **an LLM-based router** — only when neither of the above can tell easy from hard. The router runs on every request, so its cost and latency are pure overhead — never let the router cost more than it saves.
+3. **Set explicit thresholds, not vibes.** Turn the signal into concrete rules: e.g. *length < 500 tokens AND task ∈ {classify, extract} → cheap tier*; *needs-tools OR length > 8k tokens → strong tier*. Write the thresholds down with the segmentation they came from so they're auditable and tunable, not buried in an `if`-ladder no one can reason about.
+4. **Design the escalation/fallback cascade so easy wins stay cheap and hard cases still get quality.** Default-route to the cheap tier, then run a **validation check** on its output — a confidence signal, a schema/format validation, a "did it actually answer / did it say it's unsure" check, or a cheap self-grade. On failure, **retry the same request on the strong tier** (a cascade). This way the easy majority is served at cheap-tier price in one hop, and only the cases the cheap model fumbles pay for the strong model — capturing most of the saving without eating the quality hit. Decide the validation check per task: structured outputs get schema validation for free; open-ended generation needs a confidence or self-grade signal.
+5. **Choose the tiers concretely.** Default the **strong tier** to the latest, most capable Claude model (`claude-opus-4-8`) and the **cheap tier** to a smaller, faster model (`claude-haiku-4-5`); a mid model (`claude-sonnet-4-6`) is a reasonable middle rung if a two-step cascade leaves a gap. Use exact model ID strings — never construct or date-suffix them. Add **always-route-strong guardrails** for high-stakes paths (anything irreversible, safety-relevant, or where a wrong answer is expensive) regardless of what the signal says.
+6. **Measure the trade with an eval set — per route, not just in aggregate.** Build (or reuse) a labeled eval set spanning all three difficulty tiers and score three things on every route: **cost**, **latency**, and a **quality metric** (task accuracy, schema-valid rate, judge score — whatever fits the task). Track cheap-route quality, strong-route quality, escalation rate, and the blended numbers separately. The router is only a win if blended cost and latency drop *and* cheap-route quality stays above your bar. If cheap-route quality sags, tighten the threshold or move that segment to the strong tier.
+> [!WARNING]
+> Routing too much to the cheap model silently degrades quality on the cases that needed the strong one — and aggregate metrics hide it because the easy majority looks fine. Never route blind: gate every threshold change against the per-route eval set and keep the escalation check honest. A router with no quality measurement is just a quality regression you haven't noticed yet.
+> [!WARNING]
+> An LLM-as-router adds its own latency and token cost on EVERY request, including the easy ones a heuristic would have caught for free. If a task-type check, an input-length cutoff, or a small classifier separates the traffic, use that — reserve the LLM router for the cases where simpler signals genuinely can't, and confirm it still nets a saving end to end.
+## Output
+A model-routing design, written down so it's tunable:
+- **Difficulty segmentation** — the three tiers with their defining traits and estimated volume share, plus the go/no-go call on whether a router is worth building.
+- **Routing signal + thresholds** — which signal (heuristic / small classifier / LLM router) and why it's the cheapest that works, with the concrete cutoff rules and the segmentation they derive from.
+- **Escalation/fallback cascade** — the default cheap route, the validation check per task type, and the retry-on-strong path, including any always-route-strong guardrails for high-stakes requests.
+- **Tier choice** — the strong and cheap model IDs (default `claude-opus-4-8` / `claude-haiku-4-5`, optional `claude-sonnet-4-6` middle rung) and the rationale.
+- **Validation metrics** — the eval set composition and the per-route cost / latency / quality numbers (with escalation rate) that prove the router cut spend and latency without dropping quality below the bar.

package/content/skills/onboarding-guide-writer.md ADDED Viewed

@@ -0,0 +1,84 @@
+---
+name: "onboarding-guide-writer"
+description: "Write a developer onboarding guide that gets a new contributor from clone to first merged change fast — a verified golden path, a quick architecture map, the real workflow conventions, and the gotchas that live only in senior engineers' heads. Use when a repo has no onboarding doc, when new hires keep asking the same setup questions, or when the README is a marketing page instead of a contributor guide."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Write the doc a new contributor opens on day one and uses to ship their first change by lunch. The center of gravity is the **golden path**: the exact, copy-pasteable sequence from `git clone` to a trivial verified change — every command grounded in the repo's real scripts and tooling, not invented `make` targets. Around it sit a quick architecture map (where to look, not a spec), the workflow conventions that gate a PR, and the troubleshooting that currently lives only in tribal knowledge. Deeper material is linked, never duplicated, so the guide stays true as the code moves.
+## When to use this skill
+- A repo has no onboarding/CONTRIBUTING doc and new contributors reverse-engineer setup from CI configs and Slack threads.
+- New hires repeatedly ask the same setup questions (which Node version, what env vars, why does the build fail the first time).
+- The README is marketing prose — what the product does — rather than how a developer runs and contributes to it.
+- Onboarding currently means a senior engineer pairing for two hours to get someone to a passing test suite.
+## Instructions
+1. **Reconstruct the golden path from real tooling — verify every command exists.** Read the manifest that exists (`package.json` scripts/`engines`, `Makefile` targets, `pyproject.toml`, `go.mod`, `Justfile`, `Taskfile.yml`) and the lockfile to pick the package manager. Read CI config (`.github/workflows/*.yml`, `.gitlab-ci.yml`) — CI is the ground truth for the steps that actually pass. Build the path in execution order: clone → install deps → set up env/config → run locally → run tests → make a trivial change and verify it. Quote each command verbatim from a script that exists; if a step has no backing script, say so explicitly rather than inventing one.
+2. **Surface the prerequisites a fresh machine actually needs.** Pin the runtime version (from `engines`, `.nvmrc`, `.tool-versions`, `go.mod`, `python_requires`) and any system deps (a database, Docker, a specific package manager). List them before the install step — a clone that fails on a missing Postgres is the most common day-one wall.
+3. **Handle env and config concretely.** Find `.env.example` / `.env.sample` / `config.example.*`. Tell the contributor to copy it (`cp .env.example .env`) and call out which variables must be filled to run locally versus which have working defaults. Name the ones that need a secret or a teammate to provide — that is the question that otherwise hits Slack.
+4. **Prove the setup with a trivial verified change.** End the golden path with a concrete, reversible first change — flip a string, add a log line, fix a typo — then the exact command that confirms it (the dev server reloads, a test passes, the page shows the new text). This is what turns "I think it's set up" into "it works." Don't skip it: it's the difference between an install guide and an onboarding guide.
+5. **Write a brief architecture orientation — a map, not a spec.** Glob the top-level layout and name where the entry points are, how the main pieces fit (request → handler → data, or CLI → command → core), and where a newcomer should look first for a given task. Then list the **3–5 things that would surprise a newcomer**: the non-obvious build step, the directory that isn't what its name implies, the generated file you must never hand-edit. Keep it to a screen; point to deeper design docs for the rest.
+6. **Document the real workflow conventions.** Extract them from evidence, not assumption: branch naming (from existing branches / contributing notes), commit and PR style (from `.gitmessage`, PR template, recent history), how to run lint and typecheck (the real script names), and how CI gates a PR (which checks are required, from the workflow files). A contributor needs to know what will block their merge before they open the PR, not after.
+7. **Capture the tribal-knowledge gotchas and troubleshooting.** Write down the fixes that live in senior engineers' heads: the first build that fails until you run a generate step, the test that's flaky on certain OSes, the port that conflicts, the cache you clear when things go weird. Format as symptom → fix so a stuck contributor can scan to their error.
+8. **Link to deeper docs instead of duplicating them.** For anything with a canonical home — full architecture docs, API reference, ADRs, deployment runbooks — link to it in one line. Duplicated detail is detail that will silently go stale; a link stays correct or visibly 404s.
+9. **Order for action and skim.** Golden path first (it's what they need in the next five minutes), then architecture, conventions, troubleshooting, links. Lead each section with the action. Save it as `CONTRIBUTING.md` or `docs/onboarding.md` per the repo's convention, and report which commands you verified against real scripts and which you flagged as unverified.
+> [!WARNING]
+> An onboarding guide whose setup commands don't actually work is worse than no guide — it burns the new contributor's trust on day one and makes them distrust every other line in the doc. Verify each command against a script that exists in the repo. Never paste a `make dev` or `npm run setup` you haven't confirmed.
+> [!WARNING]
+> Do not re-explain the architecture in depth here. Detailed design that belongs in code comments, ADRs, or a design doc is guaranteed to drift once it's copied into onboarding. Give the orientation map and link to the canonical source.
+## Output
+A drop-in `CONTRIBUTING.md` (or `docs/onboarding.md`), structured for action:
+````md
+# Contributing
+## Golden path: clone → first change
+**Prerequisites:** Node 20 (`.nvmrc`), pnpm 9, Docker (for the local DB).
+```bash
+git clone git@github.com:acme/taskflow.git && cd taskflow
+pnpm install                 # lockfile: pnpm-lock.yaml
+cp .env.example .env         # fill DATABASE_URL — ask #eng for the dev value
+docker compose up -d db      # local Postgres on :5432
+pnpm db:migrate              # apply schema
+pnpm dev                     # http://localhost:3000
+pnpm test                    # vitest — should be all green before you start
+```
+**Your first change:** edit the heading in `src/app/page.tsx`, save —
+the dev server hot-reloads and the new text shows at `localhost:3000`.
+That confirms your setup end to end.
+## How the code fits
+- Entry points: `src/app/` (routes), `src/server/` (API handlers), `prisma/` (schema).
+- Flow: route → handler in `src/server/` → Prisma → Postgres.
+- Surprises for newcomers:
+  - `pnpm db:generate` must run after editing `prisma/schema.prisma` — the client is generated, never hand-edited.
+  - `src/lib/legacy/` is frozen; new code goes in `src/lib/`.
+  - The first `pnpm build` after install fails unless `pnpm db:generate` has run.
+## Workflow
+- Branch: `feat/<short-desc>` or `fix/<short-desc>` off `main`.
+- Commits: Conventional Commits (`.gitmessage`); PRs use the template.
+- Before pushing: `pnpm lint && pnpm typecheck`.
+- CI gates merge on: lint, typecheck, `vitest`, and a preview deploy.
+## Troubleshooting
+- `ECONNREFUSED 5432` → `docker compose up -d db` isn't running.
+- `Prisma client not generated` → `pnpm db:generate`.
+- Port 3000 in use → `pnpm dev -- --port 3001`.
+## Deeper docs
+- Architecture & design decisions → `docs/architecture.md`, `docs/adr/`
+- Deploy & on-call → `docs/runbooks/`
+````
+Every command above is quoted from a real script; the report lists exactly which were verified against the repo and which (if any) were flagged unverified for the maintainer to confirm.