npm - agentscamp - Versions diffs - 0.4.0 → 0.6.0 - Mend

agentscamp 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/README.md +2 -2
package/content/manifest.json +423 -2
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/circular-dependency-breaker.md +48 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/commit-splitter.md +54 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/dashboard-designer.md +38 -0
package/content/skills/deadlock-diagnoser.md +45 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/feature-flag-retirer.md +44 -0
package/content/skills/flamegraph-analyzer.md +35 -0
package/content/skills/git-blame-investigator.md +34 -0
package/content/skills/graphql-schema-designer.md +49 -0
package/content/skills/hallucination-evaluator.md +40 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/integration-test-designer.md +81 -0
package/content/skills/model-router-designer.md +39 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/onboarding-guide-writer.md +84 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/rbac-designer.md +82 -0
package/content/skills/release-notes-writer.md +78 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/content/skills/web-vitals-optimizer.md +34 -0
package/package.json +1 -1

package/content/skills/agent-trajectory-evaluator.md ADDED Viewed

@@ -0,0 +1,59 @@
+---
+name: "agent-trajectory-evaluator"
+description: "Evaluate a multi-step AI agent's whole run — tool calls, intermediate steps, and final result — not just final-answer correctness, so you can pinpoint WHERE it went wrong. Use when building or debugging a tool-using or multi-step agent, when final-answer-only evals can't explain failures, or when a prompt/model change quietly makes the agent less efficient or more error-prone even though the answer still looks right."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Final-answer evals tell you the agent failed; they don't tell you *where*. An agent that returns the right number might have called the wrong tool first, looped on a flaky API, or stumbled into the answer through a path that collapses on the next input. This skill makes the agent's **process** inspectable: capture the full trajectory — every decision, tool call, argument, and result — then score it on the axes that actually predict failure, asserting what's checkable and judging only what isn't.
+## When to use this skill
+- You're building or debugging a tool-using / multi-step agent and a final-answer eval says "wrong" without saying why.
+- A prompt or model change kept the answers correct but you suspect the agent got slower, looped more, or recovers worse — and you need to prove it.
+- You're adding a new tool and want to confirm the agent selects it correctly instead of brute-forcing with the old one.
+- Failures are intermittent and you can't tell whether the agent is fragile (lucky path) or robust (sound path).
+## Instructions
+1. **Capture the full trajectory as a structured, replayable log — one record per step.** Final-answer-only logging is the root cause of un-diagnosable failures. Each step records: the model's decision (the assistant turn, including thinking-block summaries if present), the tool called and its exact arguments, the raw tool result (success/error), and any externalized state (files written, working dir, retry count). Use a stable schema so two runs diff cleanly:
+   ```json
+   {"run_id": "...", "task_id": "...", "step": 3,
+    "decision": "call search_orders to find the open order",
+    "tool": "search_orders", "args": {"customer_id": "C-118", "status": "open"},
+    "result": {"ok": true, "rows": 2}, "is_error": false,
+    "latency_ms": 410, "state": {"retries": 0}}
+   ```
+   Pull this from your agent loop's tool-call records (or the Managed Agents event stream: `agent.tool_use` / `agent.tool_result` / `agent.custom_tool_use` events carry tool name, input, and result). Persist trajectories to disk so a baseline run is a diffable artifact, not a console scroll-by.
+2. **Build a fixed, version-controlled eval set of representative tasks — and deliberately include trap tasks.** A good set has three buckets: (a) routine tasks the agent should handle cleanly, (b) tasks that *require* tool use (the answer isn't in the prompt, so the agent must select and call the right tool), and (c) tasks engineered to trip a known failure mode — a tool that returns an error on the first call (does it recover?), an ambiguous request (does it loop?), a distractor tool that looks relevant but is wrong (does it mis-select?). Pin the set; an eval set that drifts can't catch regressions. Each task carries its expected trajectory assertions (next step).
+3. **Score every trajectory on five axes, not one.** Final-answer correctness is necessary but insufficient. For each task, evaluate:
+   - **Tool selection** — did it call the right tool for each sub-goal? (mis-selection often produces a right answer via a wrong, slow path)
+   - **Argument correctness** — were the tool arguments right? (a `status: "open"` typo'd to `status: "all"` can still return the target row by luck)
+   - **Step efficiency** — did it stay within a step budget, or did it repeat calls, loop, or take a needless detour? Measure against a per-task budget, not a global one.
+   - **Error recovery** — when a tool returned an error, did the agent recover sensibly (retry once, switch approach) or thrash / give up?
+   - **Goal completion** — did it actually finish the task, distinct from "the final text looks plausible"?
+4. **Split scoring into programmatic assertions and a narrow LLM-judge — assert everything you can.** An LLM-judge over a whole trajectory is noisy and expensive, and it will rationalize a broken path. So check the deterministic axes with code: exact tool-name assertions, argument equality (or schema match), and step-count budgets are all plain comparisons against the trajectory you captured.
+   ```python
+   tools = [s["tool"] for s in trajectory]
+   assert tools[0] == "search_orders", f"wrong first tool: {tools[0]}"
+   assert trajectory[0]["args"]["status"] == "open"
+   assert len(trajectory) <= task["step_budget"], f"{len(trajectory)} steps > budget"
+   assert not any(s["is_error"] for s in trajectory[-2:]), "ended on an error"
+   ```
+   Reserve the LLM-judge for the genuinely subjective steps only — "was this reasoning step sound given the prior result?", "was this summary faithful to the tool output?" — and judge **one step at a time** with the step's inputs in context, not the entire run. Default both the agent-under-test and the judge to the latest, most capable Claude model (`claude-opus-4-8`); use a *different* sample or framing for the judge so it isn't grading its own twin, and keep the judge's rubric to one criterion per call.
+5. **Diff every candidate trajectory against a stored baseline and report the regressions.** This is what catches the silent ones. After a prompt or model change, re-run the fixed eval set and compare trajectory-for-trajectory against the baseline: tools added/removed/reordered, argument changes, step-count delta, new error-recovery loops, latency delta. A change that keeps the final answer correct but adds two steps, introduces a retry loop, or swaps a precise tool for a brute-force one is a **regression** — surface it even though the answer still passes. Promote a candidate to the new baseline only when the diff is empty or every change is reviewed and intended.
+> [!WARNING]
+> Grading only the final answer hides process failures. An agent can reach the right answer through a path that is broken, expensive, or lucky — wrong tool, redundant loop, a crash it recovered from by chance — and that path will break on the very next input. The final answer being correct is *not* evidence the agent worked correctly.
+> [!WARNING]
+> An LLM-judge over a whole trajectory is noisy and tends to rationalize whatever path it sees. Assert the checkable steps — tool names, argument values, step counts — with code, and give the judge exactly one subjective step and one criterion at a time. A judge asked "was this whole run good?" will hand-wave; a judge asked "was *this* summary faithful to *this* tool output?" gives a usable signal.
+## Output
+- **Trajectory schema** — the per-step record (decision, tool, args, result, is_error, latency, state) and where each field comes from in your agent loop or event stream.
+- **Per-axis rubric** — the five axes (tool selection, argument correctness, step efficiency, error recovery, goal completion) with the concrete check for each task.
+- **Assertion-vs-judge split** — the deterministic assertions written as code, and the short list of subjective steps routed to a single-criterion LLM-judge (agent and judge both on `claude-opus-4-8`).
+- **Baseline-diff regression report** — a per-task diff of the candidate run against the stored baseline (tools reordered/added/removed, arg changes, step-count and latency deltas, new recovery loops), flagging every regression even where the final answer still passes, plus a verdict on whether to promote the candidate to baseline.

package/content/skills/alerting-rules-tuner.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "alerting-rules-tuner"
+description: "Cut alert noise and make every page mean something — rewrite alerting rules to fire on user-felt symptoms (error rate, latency SLO burn, failed requests) instead of causes (high CPU, full disk), with duration windows and severity routing so only urgent, actionable conditions reach a human. Use when on-call is fatigued by low-value pages, when real incidents get missed in the noise, or when alerts fire on causes rather than impact."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+On-call exhaustion is rarely an "alert quantity" problem you fix by muting things — it's an *altitude* problem. Pages fire on causes (a node at 95% CPU, a disk at 80%, a saturated thread pool) that may or may not hurt anyone, instead of on symptoms the user actually feels. This skill audits every rule against one question — *does this fire only when a human must act now?* — then rewrites the survivors to alert on symptoms with duration windows and severity routing, and demotes the rest to dashboards or tickets.
+## When to use this skill
+- On-call is fatigued: frequent pages that resolve themselves or need no action, night pages for non-urgent conditions.
+- Real incidents get missed because they're buried under low-value noise, or everyone has muted the channel.
+- Alerts fire on causes (CPU, memory, disk, queue depth, pod restarts) rather than user impact.
+- One incident generates a storm of 50 correlated pages instead of one.
+- You have alerts with no owner and no runbook — nobody knows what to do when they fire.
+- Standing up alerting for a new service and want to start symptom-first instead of bolting on host metrics.
+## Instructions
+1. **Inventory the rules and classify each as symptom or cause.** Grep the alerting config (`*.yml`/`*.yaml` Prometheus rules, Datadog monitor exports, Grafana alert JSON, Alertmanager routes) for every rule that pages a human. For each, label it: **symptom** (something the user experiences — request errors, latency, failed checkouts, SLO burn) or **cause** (a resource or internal metric — CPU, memory, disk, GC pause, replica lag, restart count). Causes belong on dashboards, not pagers.
+2. **Audit every paging rule with the single question.** For each rule ask: *does this fire only when a human must act, right now, with a clear action?* If the honest answer is "no" — it self-heals, it's informational, there's nothing to do at 3am — it is not a page. Downgrade it to a ticket or a dashboard panel. Keep paging only what's both urgent and actionable.
+3. **Define the symptom alert set at the user boundary.** Replace cause-pages with the symptoms they were trying to predict: request error rate (5xx / total), latency at a percentile that matters (p99 over SLO), failed business transactions (checkout/login failures), and SLO error-budget burn rate. Measure these where the user is — at the load balancer / ingress / API edge — not deep inside one component.
+4. **Add a duration window to every threshold.** No paging alert fires on an instantaneous value. Require the condition to hold `for: 5m` (tune per alert) so a single scrape blip or a 10-second spike clears itself. For graceful detection of both sudden outages and slow leaks, prefer multi-window, multi-burn-rate alerts (e.g. fast: 14.4x burn over 5m + 1h; slow: 6x over 30m + 6h) over a single fixed threshold.
+5. **Alert on rate-of-change / burn, not raw levels, where the level is naturally noisy.** "Disk is 80% full" pages constantly and means nothing; "disk will fill within 4 hours at the current fill rate" is actionable and rarely false. Same for error budgets: page on burn rate, not on a single bad minute.
+6. **Assign exactly one severity per rule and route accordingly.** Use three tiers and wire each to a destination: **page** (human-impacting, urgent, actionable → PagerDuty/Opsgenie, wakes someone), **ticket** (needs attention this week, not now → issue tracker), **info** (awareness only → Slack/dashboard, never pages). The default for anything you're unsure about is *not* page.
+7. **Deduplicate and group correlated alerts into one notification.** One incident must produce one page, not fifty. Group by incident dimension (service, cluster, region) in Alertmanager `group_by` / Datadog grouping, set `group_wait`/`group_interval` so the storm coalesces, and add inhibition rules so a parent symptom (whole service down) suppresses the child causes (every dependent check failing).
+8. **Attach an owner and a runbook link to every surviving alert.** Each paging rule gets an owning team (label/tag) and a `runbook_url` annotation pointing at concrete steps — first checks, dashboards, mitigation, escalation. If you can't write a runbook because there's no clear response, that's the signal the alert shouldn't page.
+> [!WARNING]
+> Paging on causes — CPU, memory, disk, queue depth — instead of user-felt symptoms is the single largest source of alert fatigue. A box can run hot all day while users are perfectly happy; a box can look idle while requests fail. Page on the symptom; keep the cause on a dashboard for when you're already investigating.
+> [!WARNING]
+> An alert with no runbook and no action is noise by definition. If the response to a page is "ack it and watch," it should not have woken anyone. Thresholds without a duration window flap on every transient spike — never ship a paging rule without a `for:` window.
+## Output
+A revised alerting plan, ready to apply to the config:
+- **Symptom alert set** — a table of paging alerts: name, signal (the user-facing metric), threshold + duration window (or burn-rate windows), and severity. Every row is urgent and actionable.
+- **Demoted rules** — the cause-metrics removed from paging, each annotated with where it went (dashboard panel name, or ticket-severity monitor) and why it isn't a page.
+- **Routing + dedup map** — severity → destination table, the `group_by` keys, and inhibition rules (parent symptom suppresses child causes).
+- **Ownership/runbook mapping** — for each surviving alert: owning team + `runbook_url`, flagging any alert that lacks a runbook as a candidate for deletion.

package/content/skills/canary-release-planner.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: "canary-release-planner"
+description: "Design a canary / progressive rollout so a bad release reaches 1% of users instead of 100% — staged traffic with bake times, gating metrics compared against the concurrently-running stable baseline, and automated promote-or-rollback. Use when shipping a risky change, when you want automatic rollback on regression, or when moving off all-at-once deploys."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+An all-at-once deploy is a single bet: CI is green, so you flip 100% of users onto new code and hope. A canary changes the bet — it routes a small, growing slice of real traffic to the new version, watches it against the version still serving everyone else, and either promotes it or rolls it back automatically. This skill produces that plan: the stages and bake times, the metrics that gate each promotion, the rollback trigger, and the data/session prerequisites that decide whether a canary is even safe for this change.
+## When to use this skill
+- You're shipping a change risky enough that a bad version reaching every user at once is unacceptable (auth, payments, a hot path, a dependency bump).
+- You want regressions to trigger an automatic rollback instead of waiting for an on-call human to notice and react.
+- You're moving a service off all-at-once / blue-green flips onto progressive delivery and need a concrete stage-and-gate plan.
+- A previous "it passed CI" deploy caused a production incident, and you want the blast radius capped before the next one.
+## Instructions
+1. **Define the rollout stages and a bake time at each.** Lay out an increasing traffic schedule — e.g. `1% → 10% → 50% → 100%` — and assign each stage a **bake time** long enough for the relevant signals to surface (cover at least one full traffic cycle for the failure mode you fear: cache fills, cron jobs, retries, a login spike). The first stage should be small enough that its failure is a non-event; the bake time, not the percentage, is what lets a slow leak (memory, connection exhaustion, a rare code path) show itself before the next promotion. Don't jump straight to 50%.
+2. **Pick the metrics that gate promotion.** Choose a small set that reflects user pain: **error rate** (5xx / failed requests), **latency percentiles** (p95/p99, never the mean — the mean hides the tail that churns users), and one or two **business/health signals** that catch silent failures the error rate won't (checkout completions, sign-ups, queue depth, a 200-with-empty-body). A deploy can be 200-OK and still be broken; the business metric is what catches that.
+3. **Set thresholds as canary-vs-baseline, not absolute.** For each gating metric, define a pass/fail rule comparing the **canary** to the **concurrently-running stable version** — e.g. "canary error rate ≤ stable + 0.5pp" and "canary p99 ≤ 1.2× stable p99." Both versions take a slice of the *same live traffic at the same time*, so time-of-day, weekday, and load differences cancel out and the only variable left is the new code.
+4. **Automate the promote-or-rollback decision.** At the end of each bake time: if every gating metric is within threshold, promote to the next stage; if any breaches, **auto-rollback** — shift 100% of traffic back to stable immediately. Make rollback fast and safe: it must be a traffic-weight change (drain the canary, don't kill in-flight requests), require no new build, and not depend on the canary being healthy enough to cooperate. A rollback that needs a redeploy is too slow to matter during an incident.
+5. **Guarantee schema compatibility across both versions.** During the rollout the old and new code hit the **same database simultaneously**. Every schema change must be backward-compatible in both directions for the duration of the canary — use **expand-contract / parallel-change** migrations: add the new column/table (expand) and deploy code that writes both, run the canary, then remove the old shape (contract) only after the new version owns 100%. Pair with `strangler-fig-migrator` for larger cutovers.
+6. **Pin session affinity so a user doesn't flip versions mid-flow.** Route by a stable key (user ID, session cookie) so a given user stays on canary *or* stable for the whole session. Without it, a user can bounce between versions between requests — half-applied multi-step flows, cache/state mismatches, and metrics that can't be attributed to either version. Affinity also makes the canary-vs-stable comparison clean.
+7. **Choose the routing dimension deliberately.** Decide whether the canary is a **percentage of traffic** (simplest, representative) or a **user segment** (internal staff → beta cohort → region → everyone) when you want known, tolerant users to absorb the first hit. Segment routing trades statistical representativeness for a friendlier blast radius — state which you chose and why.
+> [!WARNING]
+> Comparing the canary to a *historical* baseline (yesterday, last week, a stored average) instead of the stable version running right now produces false verdicts. Traffic and latency swing with time of day and day of week, so a healthy canary at peak can look "regressed" against an off-peak baseline — and a genuinely bad canary can hide inside normal variance. Always gate against the concurrently-running stable version.
+> [!WARNING]
+> A canary is unsafe when the release contains a non-backward-compatible schema change. Both versions query the same database during the rollout, so a breaking migration breaks one version no matter the traffic split. Decouple it: ship the migration as a backward-compatible expand step first, canary the code, then contract afterward.
+## Output
+A canary rollout plan containing: (1) the **stage schedule** — traffic percentages and the bake time at each, with the reason each bake time is long enough; (2) the **gating metrics** — error rate, latency percentiles, and the business/health signal(s), each with an explicit **canary-vs-baseline** pass/fail threshold; (3) the **auto-rollback trigger** — which breach forces a rollback and the (fast, build-free) mechanism that executes it; and (4) the **prerequisites** — the expand-contract schema plan confirming both versions are DB-compatible, and the session-affinity key. Reproducible: the same plan re-runs for the next release by swapping in its metrics and thresholds.

package/content/skills/circular-dependency-breaker.md ADDED Viewed

@@ -0,0 +1,48 @@
+---
+name: "circular-dependency-breaker"
+description: "Detect and break a circular import — map the exact cycle with a real tool, then break the right edge by extracting the shared piece into a leaf module, inverting a layering dependency, merging two falsely-split modules, or (last resort) deferring an import. Use when you hit an import cycle error, an undefined-on-import or 'cannot access before initialization' bug, or a bundler/linter flags a cycle."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A circular import is two or more modules that need each other to finish loading before either can finish loading — so one of them gets a half-built version of the other, and you get an `undefined` export, a `cannot access X before initialization`, or a bundler warning that surfaces "randomly" depending on which file ran first. This skill refuses to guess: it maps the exact cycle with a real dependency tool, identifies *which edge* is the wrong one, breaks it with the technique that matches the cause, and re-runs the tool to prove the cycle is gone.
+## When to use this skill
+- An import throws `cannot access '<x>' before initialization`, `ReferenceError`, or an export reads as `undefined` even though it is clearly exported.
+- A bundler (webpack/Vite/Rollup/esbuild), a linter (`import/no-cycle`), `madge --circular`, `import-linter`, or `go vet` flags a circular dependency.
+- A value works in one entry order and breaks in another — tests pass alone but fail in a suite, or prod breaks while dev works, because module load order differs.
+- You are about to "fix" a crash by moving an import inside a function and want to know whether that hides the real problem (it does).
+## Instructions
+1. **Map the cycle with a tool before changing one line.** Do not infer the cycle from the stack trace — the trace shows where it *crashed*, not which edge to cut. Run the right tool for the stack: JS/TS `npx madge --circular --extensions ts,tsx src` or `npx dpdm --circular src/index.ts`; Python `import-linter` (with a `[importlinter]` contract) or `pydeps --show-cycles pkg`; Go `go list -deps` / `go mod graph`; or read the bundler's own circular-dependency warning. Capture the full ordered chain, e.g. `auth → user → session → auth`, so you are fixing a real edge.
+2. **Find the one edge that is wrong.** A cycle has N edges but usually one of them is the design mistake — a lower-level module reaching back up to a higher-level one, or two leaf-ish modules each grabbing one symbol from the other. With `Grep`, list *exactly which symbols* each module imports from the next in the chain. The edge to break is the one importing the fewest, most-extractable symbols — often a single shared type, constant, or helper.
+3. **Prefer extracting the shared thing into a leaf module — this is the cleanest fix and the most common cause.** If A and B both need a type, constant, or pure helper that currently lives in one of them, move that symbol into a new dependency-free module (`types.ts`, `constants.ts`, `shared/`) that both A and B import *from*, and which imports from neither. The cycle dissolves because the contested symbol no longer lives on the cycle. Update every importer with `Edit`.
+4. **Invert the dependency when there is a true layering violation.** If a lower-level module imports a higher-level one only to call back into it (e.g. a storage layer importing a service to notify it), apply dependency inversion: define the interface/type at the *lower* module (it owns the contract), and have the caller inject the concrete implementation as an argument or via a registration call. The lower module now depends on nothing above it; the arrow points one way.
+5. **Merge the two modules if they are genuinely one unit.** If A and B call deep into each other through many symbols and neither has a coherent identity without the other, they were split artificially. Combine them into one module and re-export from the old paths as a barrel so external callers stay green. A cycle between two files that are really one concept is a packaging bug, not a dependency to invert.
+6. **Defer the import only as a last resort — and say so out loud.** Moving `import` inside the function that uses it (lazy/local import, `require()` at call time, or a TYPE_CHECKING-only import in Python) makes the crash stop because the import now runs after both modules finished loading. It does not remove the cycle — `madge` will still report it. Use it only when the real fixes are blocked (e.g. a third-party constraint), and flag it explicitly as deferring a known design smell.
+7. **Re-run the same tool and check import-time side effects.** Re-run the step-1 command and confirm the cycle no longer appears in its output — that is your proof, not "the crash went away." Then verify nothing relied on import-time side effects whose order you just changed: a module that registered a handler, populated a singleton, or ran top-level code now runs in a new order. Search for top-level statements (not inside a function/class) in the moved code and confirm they still fire when expected.
+> [!WARNING]
+> A lazy/deferred import "fixes" the crash but leaves the architectural cycle fully in place — the next person hits the same partially-initialized-module bug from a different entry point. Treat it as a tourniquet, not a cure. Always reach for extracting the shared dependency (step 3) or inverting the layer (step 4) first; only defer when those are genuinely blocked, and label it as a deferral.
+> [!NOTE]
+> The bug is in the import graph, not the stack trace. `cannot access X before initialization` points at the line that *read* the half-built module, which is rarely where the cycle should be cut. Map the graph first (step 1) — the right edge to break is almost never the one the error names.
+## Output
+1. **The dependency cycle diagram** — the exact ordered chain from the tool, annotated with the symbols crossing each edge:
+   ```
+   auth.ts ──(needs SessionToken)──▶ session.ts
+      ▲                                   │
+      └──────(needs currentUser)──────────┘
+   Cycle: auth → session → auth   (madge --circular)
+   ```
+2. **The chosen break technique with rationale** — e.g. "Extract `SessionToken` (a type, the only symbol `session` takes from `auth`) into `auth/types.ts` leaf; both import from it. Chosen over deferral because the cycle is a misplaced shared type, not a real layering need."
+3. **The concrete import/module changes** — the new/edited files and every `import` line that moved, as applied edits (new leaf module created, contested symbol relocated, importers re-pointed).
+4. **Proof the cycle is gone** — the re-run of the step-1 command showing no cycle, e.g. `madge --circular src` → `✔ No circular dependency found!`, plus a one-line confirmation that any import-time side effects in the moved code still execute in the right order.

package/content/skills/cold-start-optimizer.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: "cold-start-optimizer"
+description: "Cut cold-start latency for serverless functions and slow-booting apps by measuring the init breakdown, then attacking the dominant phase — artifact size, eager imports, eager connections, or under-provisioned memory — instead of reflexively buying provisioned concurrency. Use when serverless p99 spikes on the first request, when a function times out during init, or when scale-to-zero is hurting user-facing latency."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A cold start is not one number — it is runtime boot, dependency/module load, framework init, and first-connection setup stacked on top of each other, and you are usually optimizing a guess about which one dominates. This skill makes it measured: split the init into phases, find the phase that actually costs you, and attack *that* — shrink the artifact and lazy-load the heavy deps off the first-request path, hoist one-time work to module scope so warm invocations reuse it, right-size memory (more CPU often means a *faster and cheaper* cold start), and reuse connections across invocations instead of opening a fresh one every cold start. Provisioned concurrency / keep-warm is the last resort for genuinely latency-critical paths, not the first reflex — because it bills you to mask a slow init rather than fixing it.
+## When to use this skill
+- Serverless p99 (or p999) spikes on the first request after a quiet period, while warm requests are fast.
+- A function intermittently times out *during init* — before your handler code even runs.
+- Scale-to-zero or aggressive autoscaling is hurting user-facing latency on a path that can't tolerate a 2–5s tail.
+- You've been told to "just turn on provisioned concurrency" and want to know whether the init is fixable first (and cheaper).
+- A deploy bloated the artifact (new dependency, bundling change) and cold starts regressed.
+## Instructions
+1. **Measure the cold start and split it into phases — don't optimize a guess.** Force a cold start (deploy a new version, or wait out the platform's idle timeout) and capture the init timeline, not just the total. Most platforms expose it: AWS Lambda `INIT_START`/`REPORT` log lines (`Init Duration` is the pre-handler cost) plus X-Ray init subsegments; GCP/Cloud Run startup probe + request logs; Vercel function logs. Instrument the four phases yourself with timestamps at module load:
+   - **runtime boot** — the platform spinning up the sandbox/container and language runtime (you can't change this much, but you must know its share).
+   - **dependency/module load** — `require`/`import` of your code and its tree, top-to-bottom.
+   - **framework init** — ORM bootstrap, DI container, route table build, config parse, schema/codegen load.
+   - **first-connection setup** — DB handshakes, TLS, secret-manager fetches, warm-up calls.
+   Attribute a millisecond cost to each. You optimize the dominant phase; everything else is noise until that one shrinks.
+2. **Shrink the deployment artifact and lazy-load heavy deps off the first-request path.** A giant bundle inflates both runtime boot (more to unpack) and module load (more to parse). Tree-shake and bundle (esbuild/`@vercel/nft`/webpack) so you ship the function's actual closure, not the whole `node_modules`; exclude the AWS SDK / platform SDK that the runtime already provides; strip source maps and dev deps from the package. Then find the imports that aren't needed for the *first* request — a PDF renderer, an image library, an analytics client, a markdown engine — and move them behind a lazy `await import()` / deferred `require` inside the code path that needs them, so they never touch init. Grep the entry module for top-level imports of known-heavy packages and ask of each: does request #1 use this?
+3. **Hoist one-time work to module scope so warm invocations reuse it — but don't connect eagerly.** Config parsing, client *construction*, schema compilation, and validator building should run once at module load and be captured in module-scope variables, so the platform's instance reuse amortizes them across every warm invocation on that instance. The sharp distinction: **construct** clients at module scope, but **connect** lazily. Build the DB pool / HTTP client object at module load (cheap, no I/O); open the actual connection on first use inside the handler, and reuse it across subsequent invocations on the same warm instance. Eager top-level `await pool.connect()` adds connection latency to *every* cold start and turns a traffic burst into a connection storm.
+4. **Reuse connections across invocations via instance reuse — never open a fresh connection per cold start.** Store the connection/pool in a module-scope (or `globalThis`) variable so a warm instance hands it back instead of reconnecting. Size the per-instance pool to **1–2 connections**, not 20: each concurrent serverless instance gets its own pool, so a large per-instance pool times the instance count will blow past the database's `max_connections` under burst. For Postgres at high concurrency, point functions at a transaction-mode pooler (PgBouncer/RDS Proxy/Supabase pooler) rather than the database directly. Set a connection idle timeout shorter than the platform's instance-freeze window so dead connections don't accumulate.
+5. **Right-size memory — on many platforms it buys CPU, so more memory = faster AND cheaper cold start.** On Lambda (and similar) CPU and network scale linearly with the memory setting, and a cold start is CPU-bound (parsing, JIT, framework init). Bumping 128MB → 512MB–1GB can cut the cold start by enough that the *higher per-ms price × shorter duration* is lower total cost — the classic counter-intuitive win. Sweep a few memory settings against the same forced-cold-start workload and pick the point on the cost-vs-latency curve, don't assume the smallest tier is cheapest.
+6. **Use provisioned concurrency / keep-warm only for genuinely latency-critical paths — after init is already fast.** If a path truly can't tolerate any cold tail (checkout, auth, a synchronous user-facing API), provision N warm instances to cover baseline concurrency. But apply it last, sized to real concurrency (not a round number), and only once steps 1–5 have made the init itself fast — because provisioning a 4-second init just means you pay 24/7 to keep a slow thing warm, and any burst beyond your provisioned count still pays the full cold start.
+> [!WARNING]
+> Opening a fresh DB connection on every cold start — instead of reusing one across warm invocations — is the classic serverless outage. Under a traffic spike, every new instance opens its own connections simultaneously, the database hits `max_connections`, and *every* request (warm ones included) starts failing. Construct the client at module scope, connect lazily, reuse across invocations, and cap the per-instance pool low. Use a transaction-mode pooler when instance count can exceed the DB's connection limit.
+> [!CAUTION]
+> Keep-warm and provisioned concurrency **mask** a slow init; they don't fix it — and they bill you continuously for the masking. If you reach for them before measuring, you'll pay 24/7 to hide a 3s init that two hours of lazy-loading would have cut to 400ms, and you'll *still* eat the full cold start on every burst beyond your provisioned count. Fix the init first; provision only the residual.
+## Output
+1. **Cold-start breakdown by phase** — the measured init timeline showing where the milliseconds actually go, so the dominant cost is obvious before any change:
+```text
+Cold start breakdown — POST /api/checkout (Lambda, 256MB, node20)
+Total cold init: 2,840 ms   (warm: 38 ms)
+  runtime boot ................   210 ms   7%   (platform; fixed)
+  dependency/module load ......  1,520 ms  54%  <- DOMINANT
+      stripe sdk (eager) .........  340 ms
+      @prisma/client (eager) .....  610 ms
+      pdfkit (eager, unused @ req#1) 470 ms
+  framework init ..............    180 ms   6%   prisma engine bootstrap
+  first-connection setup ......    930 ms  33%  top-level await pool.connect()
+```
+2. **Targeted fixes** — ordered by the phase that dominates, each with the specific change and why it lands:
+```text
+1. Lazy-load pdfkit behind await import() in the receipt path .. -470 ms  [HIGH]
+   Not used by request #1; only the async receipt job needs it.
+2. Move pool.connect() out of top-level await; connect on first
+   handler use, reuse across invocations; pool max 2 ................ -930 ms cold,
+   + eliminates connection-storm risk under burst .................. [HIGH]
+3. Bump memory 256MB -> 1024MB (CPU scales) ................... -640 ms  [HIGH]
+   Faster parse + prisma init; est. total cost -18% (shorter ms).
+4. Bundle with esbuild, exclude aws-sdk (runtime-provided),
+   strip source maps ................................................ -210 ms  [MED]
+5. Provisioned concurrency = 3 on /checkout ONLY, after the above ... covers
+   baseline concurrency; residual bursts now cost ~600ms not 2,840.  [LAST]
+```
+3. **Measured before/after** — the re-measured cold start after applying the fixes, proving the dominant phase actually shrank (and noting cost impact, since memory and provisioning change the bill):
+```text
+Cold init: 2,840 ms -> 620 ms  (-78%)   p99 first-request: 3.1s -> 0.7s
+Monthly cost: roughly flat (higher memory offset by shorter duration;
+provisioned-concurrency on /checkout adds ~$X for 3 warm instances).
+Re-measure after a real burst, not a single forced cold start.
+```

package/content/skills/commit-splitter.md ADDED Viewed

@@ -0,0 +1,54 @@
+---
+name: "commit-splitter"
+description: "Split one big, mixed-up change into a series of small, atomic commits — each a single logical change that builds and passes tests on its own — by grouping hunks by intent and staging them piecemeal. Use when a working tree or a fat commit mixes a feature, a refactor, a bug fix, and formatting, or before opening a PR you want reviewers to actually read."
+allowed-tools: "Read, Grep, Bash"
+version: 1.0.0
+---
+A 600-line diff that mixes a feature, a drive-by refactor, a bug fix, and a formatter run is unreviewable — reviewers skim it and approve on faith. This skill decomposes that change into a sequence of small commits, each one a single logical intent that compiles and passes tests on its own. It groups the diff by purpose, stages one group at a time with `git add -p`, orders them so prerequisites land first, and gives each commit a focused message — so reviewers read the story instead of guessing at it, and `git bisect`/`git revert` stay meaningful.
+## When to use this skill
+- An uncommitted working tree mixes concerns — a new feature, an unrelated refactor, a bug fix, and whitespace/formatting churn all tangled together.
+- A single fat commit (yours, not yet pushed) bundles several logical changes and you want to split it before review.
+- You're about to open a PR and want the commit series to read as a deliberate narrative, not a `wip` dump.
+> [!WARNING]
+> Splitting only pays off if **each** commit independently builds and passes tests. A series where intermediate commits are broken defeats `git bisect` and makes any single-commit `revert` land a non-working tree — worse than one honest fat commit. Verify every commit, not just the tip.
+## Instructions
+1. **Inventory what changed.** Run `git status --porcelain` and `git diff --stat` (add `--cached` for staged hunks; `git show --stat HEAD` if splitting an existing commit). Read the actual hunks with `git diff` so you reason about real code, not filenames. Note any new/deleted/renamed files — those move as whole units, not per-hunk.
+2. **Group hunks by logical intent.** Assign every hunk to exactly one group. Typical buckets, in dependency order:
+   - **Prerequisite refactor** — renames, extractions, signature changes the feature depends on (no behavior change).
+   - **Bug fix** — a self-contained correctness fix, ideally with its own test.
+   - **Feature** — the new behavior, built on the refactor above.
+   - **Formatting / lint** — pure whitespace, import sorting, autoformatter noise. Isolate this; mixed-in formatting is what makes diffs unreadable.
+   - **Unrelated cleanup** — dead code, typo, comment. Its own commit (or a separate PR).
+   Watch for **hidden coupling**: a feature that won't compile without the refactor must come *after* it, never before.
+3. **Stage one group at a time.** Use `git add -p <files>` and answer per hunk: `y` to stage, `n` to skip, `s` to split a hunk into smaller pieces. When a single hunk mixes two intents that `s` can't separate (e.g. a logic change and a reformat on adjacent lines), use `git add -e` (or `e` at the prompt) to hand-edit the staged patch — delete the `+`/`-` lines that belong to the other group, keep context lines intact. Stage exactly one group, then go to step 4.
+4. **Verify the staged group in isolation, then commit.** Before committing, prove the staged subset stands alone: `git stash push --keep-index` parks everything *not* staged, leaving only this group in the tree. Run the project's build + tests (detect them — `npm run build && npm test`, `pytest`, `go build ./... && go test ./...`). If it builds and passes, commit (step 6); then `git stash pop` to restore the rest and return to step 3 for the next group. If it fails, you mis-grouped — a prerequisite is in a later group; re-order and re-stage.
+5. **For an already-committed mess, rewrite local history.** Two routes:
+   - **Re-stage the whole commit:** `git reset HEAD~1` (soft-ish — keeps changes in the working tree, unstaged), then proceed from step 2 to rebuild it as several commits.
+   - **Surgical split inside a series:** `git rebase -i <base>`, mark the offending commit `edit`. When the rebase stops on it, `git reset HEAD~1` to unstage its contents, then split via steps 3–6, and `git rebase --continue`. Use `git rebase --abort` to bail back to the original state if anything looks wrong.
+6. **Write a focused conventional message per commit.** One intent per subject line: `refactor(parser): extract tokenizer`, `fix(auth): reject expired tokens`, `feat(auth): add SSO login`, `style: apply formatter`. The subject names the *single* thing this commit does; if you need "and" or a bullet list of unrelated items, the commit is still mixed — split further.
+7. **Confirm the series reads as a story and every commit is green.** Run `git log --oneline <base>..HEAD` to read the sequence top-to-bottom: prerequisites → fix → feature → cleanup. Then verify *each* commit independently — `git rebase --exec '<build && test>' <base>` replays the series running your command after every commit, failing on the first that breaks. This is the proof that the split is bisect-safe.
+> [!WARNING]
+> Rewriting history that's already pushed or shared (`reset`, `rebase -i`) forces every collaborator to recover their local copy and can orphan their work. Only reshape **local, unpushed** history. If the commits are already on a shared branch, coordinate first — or leave history alone and split going forward.
+## Output
+- **Commit breakdown** — an ordered table: each proposed commit's purpose (its single intent), the files/hunks it claims, and its dependency on earlier commits.
+- **Exact reproduction steps** — the concrete `git add -p` / `git add -e` sequence (or the `rebase -i` + `reset HEAD~1` plan) that produces that breakdown, including the per-group `stash push --keep-index` → build/test → commit → `stash pop` loop.
+- **Recommended commit messages** — one conventional-commit subject (and body where it earns it) per commit, in apply order.
+- **Verification result** — confirmation that `git rebase --exec` ran the build+tests after every commit and the whole series is green, with any commit that needed re-grouping called out.
+Example breakdown for a tangled working tree:
+| # | Commit | Hunks / files | Depends on |
+|---|--------|---------------|------------|
+| 1 | `refactor(parser): extract Tokenizer class` | `parser.ts` (lines 12–88), new `tokenizer.ts` | — |
+| 2 | `fix(parser): handle empty input` | `parser.ts` (lines 140–152), `parser.test.ts` (new case) | 1 |
+| 3 | `feat(parser): support inline comments` | `tokenizer.ts` (lines 40–72), `parser.ts` (lines 95–110) | 1 |
+| 4 | `style: apply prettier` | whitespace-only across 6 files | — |

package/content/skills/contract-test-designer.md ADDED Viewed

@@ -0,0 +1,70 @@
+---
+name: "contract-test-designer"
+description: "Design consumer-driven contract tests between services so an API provider can't break its consumers unnoticed — without slow, flaky full end-to-end environments. Use when independent services or teams integrate over an API, when integration bugs only surface in staging or prod, or when E2E suites are too slow and brittle to catch breaking API changes."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Cross-service E2E suites are slow, flaky, and tell you a provider broke a consumer only after both are deployed to a shared environment. This skill designs consumer-driven contract tests instead: the *consumer* declares the exact requests it sends and the precise response fields and types it actually reads, and the *provider* replays those expectations against its real handler in its own CI. A provider change that violates any consumer's contract fails the provider's build — before merge, before deploy, with no other service running. The deliverable is the consumer-defined contract(s), the provider-side verification wired into CI, and a sharing-plus-versioning approach so the two sides can evolve.
+## When to use this skill
+- Two or more independently deployed services (often owned by different teams) integrate over HTTP/JSON, gRPC, or a message queue, and a provider can ship a change that silently breaks a consumer.
+- Integration regressions only appear in staging or prod because nothing in either repo's CI exercises the actual cross-service shape.
+- The cross-service E2E suite is too slow or flaky to gate merges, so breaking API changes slip through.
+- You're standing up a new client against an existing API and want to lock the dependency to *exactly* the fields you read, not the whole payload.
+## Instructions
+1. **Let the CONSUMER define the contract — and only the part it uses.** Write the contract from the consumer's test suite, not the provider's spec. For each interaction, state the *request* the consumer sends (method, path, query/body, headers that matter) and the *response shape it actually depends on*: the status code, the fields it reads, and their types. If the consumer parses `order.id` (string) and `order.total` (number) and ignores the other 20 fields, the contract asserts those two fields and nothing else. The contract is a description of *this consumer's* needs, never the provider's full API surface.
+2. **Match on type and structure, not frozen example values.** Use matchers, not literals: assert `total` is a number, `status` is one of a set, `items` is a non-empty array of objects with `sku`/`qty` — not `total == 4250`. Frozen example values turn the contract into a snapshot test that breaks on every data change. Reserve exact-value matching for fields whose literal value is part of the contract (an enum the consumer branches on, a fixed `Content-Type`).
+3. **Pick a tool/pattern and generate the artifact.** Match what the stack already uses before adding a dep. **Pact** (pact-js / pact-jvm / pact-python / pact-go) is the default for HTTP and async messages — the consumer test runs against a mock provider and emits a pact JSON file. **Spring Cloud Contract** suits a JVM-heavy shop. For simpler needs, a **shared JSON Schema / OpenAPI fragment** committed to both repos, validated on each side, is a legitimate lightweight contract. Whatever the tool, the output is a machine-checkable artifact of the consumer's expectations.
+4. **Verify the PROVIDER against the contract in the PROVIDER's own CI.** This is the half teams skip and the half that earns the value. The provider's pipeline fetches every consumer contract and replays each recorded request against the real running provider (no consumer process involved), asserting the live response satisfies the matchers. Wire it as a required check: a provider change that drops `order.total` or renames `status` fails the provider build, so the break is caught at the source before merge. Use `provider states` (Pact) to set up the data each interaction needs (`given "order 42 exists"` → seed that fixture) rather than depending on ambient DB state.
+5. **Share contracts via a broker or committed artifacts, and gate deploys on verification.** For more than a couple of services, run a **Pact Broker** (or PactFlow): consumers publish contracts tagged by branch/version, providers fetch and verify, and `can-i-deploy` blocks a release whose verified contracts don't cover the consumer versions currently in prod. For a small, co-located set, committing the contract artifact into a shared repo or the provider repo and verifying in CI is simpler and adequate — pick the lightest mechanism that still makes verification a required, automated gate, not a manual step.
+6. **Version contracts so provider and consumer can evolve independently.** Tag each contract with the consumer's version and the environment where that consumer version runs. Additive provider changes (new optional field) keep old contracts passing — that's the point of matching only what the consumer reads. For a breaking change, support both shapes until every consumer has published a contract for the new one (verified via the broker), then retire the old. Never edit a published contract in place to make a failing provider build go green — that defeats the gate.
+7. **Keep contracts to interface shape; push behavior into unit tests.** A contract verifies the *integration surface* — fields, types, status codes, error envelopes — not that the provider computes the right total or applies the right discount. That logic belongs in the provider's own unit/integration tests. A contract bloated with business assertions becomes a second, worse copy of the provider's logic suite and breaks on unrelated correct changes.
+> [!WARNING]
+> Contract tests verify the INTERFACE shape, not end-to-end behavior. They replace brittle cross-service E2E for catching *breaking API changes* — but they do not prove the provider's logic is correct or that the wired-up system works. Keep the provider's own logic tests, and a thin smoke E2E for the critical path; contracts shrink the E2E suite, they don't delete it.
+> [!WARNING]
+> A contract that asserts the provider's *entire* response — every field, exact values — instead of only the fields this consumer reads is an anti-pattern: it produces false breakages on unrelated, backward-compatible changes (a new field, a reordered key, a changed value the consumer never reads), and trains teams to ignore red builds. Assert the minimum the consumer actually depends on.
+## Output
+For the integration, the skill produces:
+- **The consumer-defined contract(s)** — for each interaction, the request (method, path, body, key headers) and the response expectations as matchers (status code + only the fields/types this consumer reads), in the chosen tool's format.
+- **The provider-side verification setup** — the CI step that fetches the contract(s) and replays them against the real provider, the provider-state fixtures each interaction needs, and the required-check wiring so a violation fails the provider build.
+- **The sharing + versioning approach** — broker vs. committed artifact, how contracts are tagged by consumer version/environment, and the deploy gate (e.g. `can-i-deploy`) plus the rule for evolving through a breaking change.
+Example — a consumer contract for an order-service client, in pact-js:
+```js
+const { PactV3, MatchersV3: M } = require("@pact-foundation/pact");
+const provider = new PactV3({ consumer: "checkout-web", provider: "order-service" });
+// The consumer reads only id (string), total (number), and status (one of two values).
+// It ignores every other field on the order — so the contract asserts only these.
+provider
+  .given("order 42 exists")                       // provider state: seeded in provider CI
+  .uponReceiving("a request for an order")
+  .withRequest({ method: "GET", path: "/orders/42" })
+  .willRespondWith({
+    status: 200,
+    headers: { "Content-Type": "application/json" },
+    body: {
+      id: M.string("ord_42"),                     // type match, not the literal "ord_42"
+      total: M.number(4250),
+      status: M.regex(/^(open|closed)$/, "open"),  // enum the consumer branches on
+    },
+  });
+await provider.executeTest(async (mock) => {
+  const order = await fetchOrder(`${mock.url}/orders/42`);
+  expect(order.status).toBe("open");
+});
+```
+This test emits a pact file; the **provider's** pipeline then replays `GET /orders/42` against the real `order-service` (with state `order 42 exists` seeded) and fails the provider build if `total` stops being a number or `status` leaves the enum. Hand the request/response shapes to `openapi-doc-writer` to keep the published spec in sync, and use `test-scaffolder` to flesh out the provider-state fixtures.

package/content/skills/dashboard-designer.md ADDED Viewed

@@ -0,0 +1,38 @@
+---
+name: "dashboard-designer"
+description: "Design a service dashboard that answers one question at a glance — is the service healthy, and if not, where's the problem? — by structuring panels around RED/USE instead of dumping every metric. Use when a service has no dashboard, when the existing one is an unreadable metric wall, or during incident-readiness prep."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+A dashboard is read in two modes: a calm weekly glance, and a 3am incident with an angry pager. Most dashboards are built for neither — they're a wall of every metric the system can emit, ranked by nothing, where the panel that matters is the same size as the one that never moves. This skill designs the opposite: a dashboard structured by a proven method (RED for request services, USE for resources) so the top row answers "is the service healthy?" in one glance, and the rows below answer "then where's the problem?" only when you need them.
+## When to use this skill
+- A service is running in production with no dashboard, or only a default auto-generated one nobody trusts.
+- An existing dashboard is a 40-panel metric dump — technically complete, useless in an incident, because nothing is ranked.
+- Incident-readiness or on-call onboarding: you need a board a new engineer can read cold at 3am.
+- You're defining or visualizing SLOs and need error-budget burn to live next to the signals that drive it.
+- A postmortem found that the dashboard existed but the operator couldn't find the symptom on it fast enough.
+## Instructions
+1. **Classify the thing you're instrumenting, then pick the method.** Request-driven service (HTTP/gRPC/API) → **RED**: Rate (requests/sec), Errors (failed requests/sec and error %), Duration (latency distribution). Resource or queue (worker pool, broker, DB, cache, thread pool) → **USE**: Utilization (% busy), Saturation (queue depth / backlog / wait time), Errors. A typical service is RED on top with a USE block below for its hottest dependency.
+2. **Put user-facing, SLO-aligned signals in the top row — nothing else competes for that space.** Request rate, error rate (%), latency p95/p99, and **error-budget burn rate** if an SLO exists. These four answer "are users being served?" A reader who sees the top row green should be able to stop reading. Everything below is for when it's red.
+3. **Show latency as percentiles — p50, p95, p99 — never an average.** Average latency is a lie that hides the tail: a p99 of 4s with a 120ms mean reads as "fine" on an average and "users are rage-quitting" on a percentile. Plot p50/p95/p99 as separate series on one panel so the spread between them (the tail blowing out) is visible.
+4. **Place cause metrics BELOW the signals, as drill-down — not mixed in.** CPU, memory, GC pause, queue depth, DB connection pool usage/saturation, downstream dependency latency, restart/OOM counts. These don't tell you if users hurt; they tell you *why* once the top row says they do. Group them so the path is top-down: symptom (top) → suspected cause (below).
+5. **Put correlated panels adjacent so the eye does the joining.** Error rate next to the deploy marker. Latency next to the saturated dependency it's waiting on. Queue depth next to consumer error rate. An operator should be able to see "errors started exactly at the deploy" or "latency tracks the DB pool maxing out" without flipping between boards.
+6. **Annotate the timeline with deploys and incidents.** Wire deploy/release events and incident start/end onto every time-series panel as vertical markers. Half of all "where's the problem?" questions are answered by a deploy line landing on the exact second the graph turns — make that free to see.
+7. **Set thresholds and colors that mean something, plus units and a sane default range.** Color by SLO/alert boundary, not by gut feel: green within budget, amber approaching, red breached — and keep it consistent across panels. Label every axis with units (ms, req/s, %, MiB). Default the time range to something an incident needs (last 1–6h, not 30 days) with the ability to zoom out.
+8. **One dashboard per service or user journey — linked, not merged.** Resist the urge to build one giant board for the whole platform. Per-service boards stay readable; link them (this service → its dependencies' boards, the journey board → each service board) so drill-down is a click, not a scroll through 200 panels.
+9. **Cut every panel that doesn't earn its place.** For each candidate ask: "In an incident, would this change what I do next?" If no, it's decoration — leave it off or push it to a separate deep-dive board. Noise hides signal; a 12-panel board you trust beats a 40-panel board you scan past.
+> [!WARNING]
+> A dashboard that shows every metric with equal weight is unreadable in an incident — the operator has to reason about *which* panel matters at exactly the moment they have no spare attention. Rank by user impact (RED/USE on top, causes below) or the board is decoration, not a tool.
+> [!WARNING]
+> Average latency on a dashboard hides the tail where users actually hurt. A healthy-looking mean can sit on top of a p99 that's timing out for 1% of traffic. Always plot percentiles (p50/p95/p99); never let an average latency panel be the thing on-call looks at first.
+## Output
+- **A top-down layout spec** for one service/journey: the chosen method (RED and/or USE) and the ordered rows — top row of user-facing/SLO signals, then cause/drill-down rows below.
+- **A per-panel table**: panel title → metric/query intent → visualization (time series, single-stat, percentile lines, heatmap) → threshold/color rule → units. Latency panels specify p50/p95/p99.
+- **The annotations and links to wire in**: deploy/incident markers on time-series panels, default time range, and the cross-links to dependency or journey dashboards.
+- **A "cut list"**: panels deliberately left off (and where they live instead), so the omission is a decision, not an oversight.

package/content/skills/deadlock-diagnoser.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: "deadlock-diagnoser"
+description: "Diagnose a database deadlock from the engine's own deadlock report, reconstruct the lock cycle (A holds 1 wants 2, B holds 2 wants 1), name the root cause — almost always two code paths locking the same rows in different orders — and fix it with consistent lock ordering, shorter transactions, and a retry-the-victim safeguard. Use when the DB logs deadlock errors, when transactions intermittently fail under load, or when queries mysteriously block each other."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A deadlock looks random from the application — a transaction that worked a thousand times suddenly errors out under load — but the database already did the forensics for you. When the engine detects a cycle it picks a victim, rolls it back, and logs *exactly* who held what and waited on what. This skill reads that report instead of guessing: it pulls the Postgres deadlock log lines (or the SQL Server deadlock graph / `innodb status` in MySQL), reconstructs the cycle (A holds lock 1 and wants lock 2 while B holds 2 and wants 1), and names the real root cause — which is almost always two code paths acquiring the **same** rows or tables in **different** orders. Then it fixes the cause: enforce one consistent lock-acquisition order everywhere, shrink the lock window so the race rarely opens, and add a retry-the-victim safeguard for the deadlocks you can't design away — in that priority, because retries without ordering just trade a deadlock for a rollback storm.
+## When to use this skill
+- The database log shows `deadlock detected` (Postgres), a deadlock graph / error 1205 (SQL Server), or `Deadlock found when trying to get lock` (MySQL/InnoDB).
+- A transaction intermittently fails or auto-retries only under concurrency — fine in dev, flaky in production at peak.
+- Two queries or endpoints mysteriously block each other, or you see processes stuck in a lock wait that times out.
+- You're adding a write path that touches multiple rows/tables and want to confirm it locks in the same order as existing code before it ships.
+- Lock contention (not a true cycle) is serializing throughput, and you need to tell genuine deadlocks apart from long lock waits.
+## Instructions
+1. **Get the engine's deadlock report — don't reconstruct from app logs.** In Postgres, read the server log around the error: it prints both processes, their full SQL statements, and the `Process N waits for <lockmode> on <relation/tuple>; blocked by process M` lines for each side of the cycle (raise `log_lock_waits = on` and `deadlock_timeout` context if it's terse). In SQL Server, pull the deadlock graph from the `system_health` Extended Events session or a trace — it lists each `process` with its `inputbuf` (the statement) and the `resource-list` of locks owned vs. requested. In MySQL/InnoDB, run `SHOW ENGINE INNODB STATUS` and read the `LATEST DETECTED DEADLOCK` section. This report is ground truth; the app's stack trace only tells you which transaction lost.
+2. **Reconstruct the cycle explicitly: who HELD what, who WANTED what.** Write it out as a two-column picture — `Txn A: holds <lock on resource 1>, waits for <lock on resource 2>` / `Txn B: holds <lock on resource 2>, waits for <lock on resource 1>`. Identify the exact resources (which rows/index ranges/tables) and the lock modes (row `FOR UPDATE`/exclusive vs. shared, gap locks in InnoDB, intent locks in SQL Server). A real deadlock is a closed cycle of waits; if it's not a cycle, it's lock contention or a lock-wait timeout (step 8), which has a different fix.
+3. **Find the inconsistent acquisition ORDER — the usual root cause.** Grep the codebase for every transaction that touches the resources in the cycle and trace the order each one locks them. The classic bug: one path does `UPDATE accounts WHERE id=1` then `id=2`, another does `id=2` then `id=1` (or two services lock tables `orders` then `inventory` vs. `inventory` then `orders`). Watch for ordering that's *hidden* — a `SELECT ... FOR UPDATE` with an unordered `IN (...)` or a join whose row-locking order depends on the plan, an ORM that emits writes in object-graph order, or a foreign-key check that takes a lock on the parent row you didn't write explicitly.
+4. **Fix the cause first: enforce ONE consistent lock-acquisition order across all transactions.** Make every code path acquire the shared resources in the same deterministic order — sort the ids before locking (`SELECT ... FOR UPDATE ... ORDER BY id`), always lock parent before child, always lock tables in a fixed documented sequence. Consistent ordering makes a cycle impossible: contenders queue instead of deadlocking. This is the only fix that actually removes the deadlock rather than reducing its odds.
+5. **Shrink the lock window so the race rarely opens.** Keep transactions short and narrow: acquire locks as late as possible, commit as early as possible, and lock only the rows you'll write. Never hold a transaction open across a network/RPC/third-party-API call or across user think-time — an external call inside the transaction stretches the lock-hold from milliseconds to seconds and turns rare contention into constant deadlocks. Do the slow work *before* `BEGIN` or *after* `COMMIT`.
+6. **Pick a deliberate lock strategy for the access pattern, and right-size isolation.** Where the same rows are contended, use pessimistic locking with `SELECT ... FOR UPDATE` in the consistent order from step 4. Where conflicts are *rare*, prefer optimistic concurrency — a `version`/`updated_at` column checked in the `WHERE` of the `UPDATE` and a conflict-retry, which takes no long-held locks. If the engine is over-locking (e.g. Serializable or InnoDB gap locks causing deadlocks on inserts/range scans), drop to the lowest isolation level that's still correct (often Read Committed) to acquire fewer locks.
+7. **Add the retry-the-victim safeguard — last, not first.** A deadlock victim's transaction is rolled back cleanly and is a *transient, safe-to-retry* error; the app should catch it specifically (Postgres `SQLSTATE 40P01`, MySQL `1213`, SQL Server `1205`) and retry the whole transaction with capped exponential backoff and jitter (e.g. 3–5 attempts). Retry the *entire* transaction from `BEGIN` — replaying half a rolled-back transaction corrupts state. This handles the deadlocks you can't design away; it does NOT substitute for steps 4–5.
+8. **Distinguish a true deadlock from plain lock contention before "fixing" the wrong thing.** If the report shows a lock-*wait timeout* rather than a detected cycle, there's no ordering bug — one transaction is simply holding a lock too long (a long-running write, an idle-in-transaction connection, a missing index forcing a wide row/range lock). The fix there is shortening the holder (step 5), adding the index so the lock is narrow (`query-plan-analyzer`), or killing idle-in-transaction sessions — not reordering locks.
+> [!WARNING]
+> Adding retries WITHOUT fixing the inconsistent lock order just papers over the bug. Under load, every retry re-enters the same cycle, so you trade one deadlock for a storm of rollbacks and re-runs: throughput craters, latency spikes, and the database burns work undoing transactions. Fix the ordering first; the retry is a net for the residual, not the cure.
+> [!WARNING]
+> A transaction that holds a lock across an external/API call (or user think-time) is the single most common way rare contention becomes constant deadlocks — the lock-hold goes from milliseconds to seconds, widening the race window enormously. Move every network call and slow computation outside the `BEGIN ... COMMIT`.
+> [!NOTE]
+> Lowering isolation reduces locking but changes correctness guarantees (Read Committed allows non-repeatable reads; dropping below Serializable can reintroduce write skew). Only lower it where the access pattern is provably safe — don't trade a deadlock for a silent data anomaly.
+## Output
+A short report with four parts:
+1. **The reconstructed cycle** — quoted from the engine's deadlock report: `Txn A holds <lock on R1>, wants <lock on R2>` / `Txn B holds <lock on R2>, wants <lock on R1>`, with the exact resources, lock modes, and the two offending statements.
+2. **The root cause** — the specific inconsistent lock-acquisition order (or over-long lock scope / over-strict isolation) behind the cycle, naming the two code paths and the resources they lock in conflicting order.
+3. **The fix** — one concrete change: the consistent ordering to enforce (with the exact `ORDER BY` / lock sequence), or the shortened-transaction change (what to move outside `BEGIN`), or the isolation-level / locking-strategy change — not a menu.
+4. **The retry safeguard** — the specific deadlock SQLSTATE/error code to catch and the backoff retry of the whole transaction, framed explicitly as the net for residual deadlocks, not the primary fix.

package/content/skills/devcontainer-designer.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "devcontainer-designer"
+description: "Design a reproducible dev environment (Dev Container / Docker) so onboarding is one command and 'works on my machine' dies — by detecting the project's real stack and versions, authoring a devcontainer.json (+ Dockerfile/compose) that pins the runtime to what the repo targets, wires dependent services, caches dependencies, and injects secrets instead of baking them. Use when new contributors struggle to set up the project, when environment drift causes inconsistent behavior, or when standardizing tooling across a team."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+The phrase "works on my machine" is a confession that the project has no defined machine. Two contributors on Node 18.17 and 20.4, one with a system `libpq` and one without, a Postgres someone installed via Homebrew in 2023 — that spread is exactly the environment drift a dev container exists to kill. But a container only does that if it pins what the repo actually targets and brings the *whole* stack up together; an unpinned `node:latest` reintroduces the drift you containerized to remove, and a `:latest` Postgres can rev a major version under you on the next rebuild. This skill reads the repo to find the real stack, then writes a `devcontainer.json` (with a Dockerfile and/or compose when services are involved) where every version is pinned, services come up as one unit, dependencies are cached so rebuilds are cheap, and secrets are injected at runtime — never baked into the image.
+## When to use this skill
+- New contributors burn their first day on setup, or the onboarding README has more than a handful of "install X, then Y" steps that drift out of date.
+- The same code behaves differently across machines (passes locally, fails in CI, or vice versa) and you suspect runtime/version/system-lib differences rather than a real bug.
+- You're standardizing tooling across a team and want one definition of "the dev environment" that an editor can rebuild on demand.
+- The project needs a DB, cache, queue, or other service running alongside the app and people manage those by hand today.
+## When NOT to use this skill
+- The drift is a missing lockfile, not a missing container — if `package.json`/`pyproject.toml` has unpinned ranges and no committed lock, fix that first; a container around floating deps still drifts.
+- You need a production deployment image. A dev container optimizes for fast inner-loop edit/run with the source mounted; a production image optimizes for a small, immutable artifact with the source baked in. They are different files with different tradeoffs — don't ship this one.
+## Instructions
+1. **Detect the real stack before writing anything.** Glob and read the manifests that declare the runtime and pin it: `.nvmrc` / `.node-version` / `engines` in `package.json`, `.python-version` / `pyproject.toml` `requires-python`, `.ruby-version`, `go.mod` `go` directive, `.tool-versions` (asdf/mise), `rust-toolchain.toml`. Identify the package manager from the lockfile that exists (`package-lock.json` → npm, `pnpm-lock.yaml` → pnpm, `yarn.lock` → yarn, `poetry.lock` → poetry, `uv.lock` → uv) — the container must use the same one, or it builds a different tree. The repo's declared version is the source of truth; never round to "latest stable."
+2. **Find the services the app actually talks to.** Grep config and env templates (`.env.example`, `config/`, `docker-compose*.yml`, `application.yml`) for connection strings and ports — `DATABASE_URL`, `REDIS_URL`, `postgres://`, `amqp://`, ES/OpenSearch hosts. Read the dependency manifest for client libraries (`pg`, `redis`, `psycopg`, `pika`, `kafkajs`) as corroboration. Every external service the app expects at runtime must come up in the dev environment, or the container is half a setup and contributors are back to installing Postgres by hand.
+3. **Pin the base image to the repo's exact runtime version.** Use a digest-stable, version-specific tag — `mcr.microsoft.com/devcontainers/python:3.12` or `node:20.17-bookworm`, never `:latest`, `:lts`, or a bare major like `:20`. Match the minor the repo targets (a `.nvmrc` of `20.17.0` means `node:20.17`, not `node:20`). If you author a Dockerfile, install system libraries the build needs that the base lacks (`libpq-dev` for `psycopg`, `build-essential`, `libvips` for `sharp`, `default-libmysqlclient-dev`) — these are the silent "missing on my machine" failures. Set the pinned image in `devcontainer.json` `image`, or `build.dockerfile` if you need the extra libs.
+4. **Bring the whole stack up with compose when services exist.** When step 2 found a DB/cache/queue, write a `docker-compose.yml` with the app service plus each dependency pinned to a *specific* version (`postgres:16.4`, `redis:7.4`) — a major Postgres bump on rebuild can refuse to read the old data dir. Point `devcontainer.json` at it via `dockerComposeFile` + `service` + `workspaceFolder`, list `runServices` so the DB starts with the workspace, and use a named volume for the DB data dir so a container rebuild doesn't wipe local seed data. Set service `DATABASE_URL` to the compose service hostname (`postgres`, not `localhost`) so the app connects across the compose network.
+5. **Mount the workspace and cache dependencies so rebuilds stay cheap.** A 10-minute container build trains people to never rebuild — and a never-rebuilt container is the drift you were eliminating. Keep the source bind-mounted (default `workspaceFolder`) so edits are instant. Put the package manager's *store* (not `node_modules`/`.venv`) in a named volume mount so deps survive rebuilds: a volume on `~/.npm`, `~/.cache/pnpm`, `~/.cache/pip`, `~/.cargo`. For compiled-language or heavy-system-lib stacks, structure the Dockerfile so dependency-install layers come before the source copy, so a code change doesn't bust the dep cache.
+6. **Preinstall tooling and run a `postCreateCommand` that leaves the env ready.** Add the editor extensions and settings the project assumes under `customizations.vscode.extensions` (linter, formatter, language server, the DB client) — so everyone gets the same lint-on-save, not a personal config. Use a `postCreateCommand` to run the dependency install with the detected package manager (`pnpm install --frozen-lockfile`) plus any project setup (DB migrate + seed, generate types, copy `.env.example` to `.env` if absent). The goal: open the project, and after postCreate it runs — no manual step. Prefer `devcontainer features` (`ghcr.io/devcontainers/features/*`) for common add-ons (docker-in-docker, gh CLI) over hand-rolled `apt-get` lines.
+7. **Inject secrets at runtime — never bake them into the image.** Reference required secrets in `containerEnv`/`remoteEnv` sourced from the host (`${localEnv:OPENAI_API_KEY}`) or via a secret mount, and keep a committed `.env.example` documenting the keys with empty/placeholder values. Anything sensitive stays in the developer's local `.env` (gitignored) or their host env. Do not `ENV SECRET=...`, `COPY .env`, or `ARG` a credential in the Dockerfile, and don't commit a populated `.env` — an image layer is shipped verbatim to everyone who pulls it.
+> [!WARNING]
+> An unpinned base or runtime (`node:latest`, `python:3`, `postgres:16` without a minor) is the single change that reintroduces the exact drift the container is meant to eliminate. The image silently revs out from under the team on the next pull or rebuild, and now "works in the container" depends on *when* you built it. Pin every base image and every service to a specific version, and update those pins as a reviewed, deliberate commit.
+> [!CAUTION]
+> A secret baked into an image — via `ENV`, `ARG`, `COPY .env`, or a committed populated `.env` — leaks to everyone who pulls the image and persists in the layer history even if a later layer deletes it. Injecting credentials into a built image is publishing them. Keep all secrets in the developer's local env/secret store and reference them at runtime; commit only an empty `.env.example`.
+## Output
+A `devcontainer.json` plus the Dockerfile and/or `docker-compose.yml` the project needs, written via Write, with: every base image and service tag pinned to the version the repo targets (and the detected source of that version called out — e.g. `node:20.17 (from .nvmrc)`, `postgres:16.4`); the dependent services wired through compose with named data volumes and correct service-hostname connection strings; a dependency-store cache mount and a layer-ordered Dockerfile so rebuilds are fast; the preinstalled extensions and a `postCreateCommand` that installs and sets up so the env is ready on first open; and a clear note of which secrets are injected from the host env / secret mount versus the committed empty `.env.example` — none baked into the image. The skill reads the repo and writes config files only; it does not build images, start containers, or run install commands.