npm - agentscamp - Versions diffs - 0.4.0 → 0.5.0 - Mend

agentscamp 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/README.md +2 -2
package/content/manifest.json +213 -2
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # agentscamp
-> 168 ready-to-use Claude Code agents, skills, and slash commands — installable in one command.
+> 183 ready-to-use Claude Code agents, skills, and slash commands — installable in one command.
 [AgentsCamp](https://agentscamp.com) is a curated, format-validated directory of AI coding artifacts. This CLI bundles the full catalog and installs items straight into your `.claude/` directory.
@@ -43,7 +43,7 @@ These are Claude Code's standard locations — agents get delegated to automatic
 ## What's inside
 - **58 agents** — specialized subagents for development, data/AI, infra, security, and more → [browse agents](https://agentscamp.com/agents)
-- **60 skills** — on-demand capabilities for testing, databases, refactoring, releases → [browse skills](https://agentscamp.com/skills)
+- **75 skills** — on-demand capabilities for testing, databases, refactoring, releases → [browse skills](https://agentscamp.com/skills)
 - **50 commands** — reusable slash commands for planning, review, git, scaffolding → [browse commands](https://agentscamp.com/commands)
 Every item has a full page with docs, examples, and related picks at [agentscamp.com](https://agentscamp.com).

package/content/manifest.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
   "schemaVersion": 1,
-  "generatedAt": "2026-06-18T02:21:30.142Z",
+  "generatedAt": "2026-06-18T02:36:19.351Z",
   "counts": {
     "agents": 58,
-    "skills": 60,
+    "skills": 75,
     "commands": 50
   },
   "items": [
@@ -1635,6 +1635,35 @@
       "installAs": "skills/agent-memory-designer/SKILL.md",
       "url": "https://agentscamp.com/skills/workflow/agent-memory-designer"
     },
+    {
+      "id": "skills/agent-trajectory-evaluator",
+      "type": "skill",
+      "slug": "agent-trajectory-evaluator",
+      "category": "data",
+      "title": "Agent Trajectory Evaluator",
+      "description": "Evaluate a multi-step AI agent's whole run — tool calls, intermediate steps, and final result — not just final-answer correctness, so you can pinpoint WHERE it went wrong. Use when building or debugging a tool-using or multi-step agent, when final-answer-only evals can't explain failures, or when a prompt/model change quietly makes the agent less efficient or more error-prone even though the answer still looks right.",
+      "topics": [
+        "llm-evals",
+        "ai-agents-systems"
+      ],
+      "file": "skills/agent-trajectory-evaluator.md",
+      "installAs": "skills/agent-trajectory-evaluator/SKILL.md",
+      "url": "https://agentscamp.com/skills/data/agent-trajectory-evaluator"
+    },
+    {
+      "id": "skills/alerting-rules-tuner",
+      "type": "skill",
+      "slug": "alerting-rules-tuner",
+      "category": "observability",
+      "title": "Alerting Rules Tuner",
+      "description": "Cut alert noise and make every page mean something — rewrite alerting rules to fire on user-felt symptoms (error rate, latency SLO burn, failed requests) instead of causes (high CPU, full disk), with duration windows and severity routing so only urgent, actionable conditions reach a human. Use when on-call is fatigued by low-value pages, when real incidents get missed in the noise, or when alerts fire on causes rather than impact.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/alerting-rules-tuner.md",
+      "installAs": "skills/alerting-rules-tuner/SKILL.md",
+      "url": "https://agentscamp.com/skills/observability/alerting-rules-tuner"
+    },
     {
       "id": "skills/architecture-diagram-generator",
       "type": "skill",
@@ -1691,6 +1720,20 @@
       "installAs": "skills/bundle-analyzer/SKILL.md",
       "url": "https://agentscamp.com/skills/performance/bundle-analyzer"
     },
+    {
+      "id": "skills/canary-release-planner",
+      "type": "skill",
+      "slug": "canary-release-planner",
+      "category": "release",
+      "title": "Canary Release Planner",
+      "description": "Design a canary / progressive rollout so a bad release reaches 1% of users instead of 100% — staged traffic with bake times, gating metrics compared against the concurrently-running stable baseline, and automated promote-or-rollback. Use when shipping a risky change, when you want automatic rollback on regression, or when moving off all-at-once deploys.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/canary-release-planner.md",
+      "installAs": "skills/canary-release-planner/SKILL.md",
+      "url": "https://agentscamp.com/skills/release/canary-release-planner"
+    },
     {
       "id": "skills/changelog-from-prs",
       "type": "skill",
@@ -1733,6 +1776,20 @@
       "installAs": "skills/claude-settings-auditor/SKILL.md",
       "url": "https://agentscamp.com/skills/workflow/claude-settings-auditor"
     },
+    {
+      "id": "skills/cold-start-optimizer",
+      "type": "skill",
+      "slug": "cold-start-optimizer",
+      "category": "performance",
+      "title": "Cold Start Optimizer",
+      "description": "Cut cold-start latency for serverless functions and slow-booting apps by measuring the init breakdown, then attacking the dominant phase — artifact size, eager imports, eager connections, or under-provisioned memory — instead of reflexively buying provisioned concurrency. Use when serverless p99 spikes on the first request, when a function times out during init, or when scale-to-zero is hurting user-facing latency.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/cold-start-optimizer.md",
+      "installAs": "skills/cold-start-optimizer/SKILL.md",
+      "url": "https://agentscamp.com/skills/performance/cold-start-optimizer"
+    },
     {
       "id": "skills/connection-pool-tuner",
       "type": "skill",
@@ -1747,6 +1804,20 @@
       "installAs": "skills/connection-pool-tuner/SKILL.md",
       "url": "https://agentscamp.com/skills/database/connection-pool-tuner"
     },
+    {
+      "id": "skills/contract-test-designer",
+      "type": "skill",
+      "slug": "contract-test-designer",
+      "category": "testing",
+      "title": "Contract Test Designer",
+      "description": "Design consumer-driven contract tests between services so an API provider can't break its consumers unnoticed — without slow, flaky full end-to-end environments. Use when independent services or teams integrate over an API, when integration bugs only surface in staging or prod, or when E2E suites are too slow and brittle to catch breaking API changes.",
+      "topics": [
+        "review-qa"
+      ],
+      "file": "skills/contract-test-designer.md",
+      "installAs": "skills/contract-test-designer/SKILL.md",
+      "url": "https://agentscamp.com/skills/testing/contract-test-designer"
+    },
     {
       "id": "skills/conventional-commits",
       "type": "skill",
@@ -1817,6 +1888,34 @@
       "installAs": "skills/dependency-upgrade-planner/SKILL.md",
       "url": "https://agentscamp.com/skills/refactor/dependency-upgrade-planner"
     },
+    {
+      "id": "skills/devcontainer-designer",
+      "type": "skill",
+      "slug": "devcontainer-designer",
+      "category": "workflow",
+      "title": "Dev Container Designer",
+      "description": "Design a reproducible dev environment (Dev Container / Docker) so onboarding is one command and 'works on my machine' dies — by detecting the project's real stack and versions, authoring a devcontainer.json (+ Dockerfile/compose) that pins the runtime to what the repo targets, wires dependent services, caches dependencies, and injects secrets instead of baking them. Use when new contributors struggle to set up the project, when environment drift causes inconsistent behavior, or when standardizing tooling across a team.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/devcontainer-designer.md",
+      "installAs": "skills/devcontainer-designer/SKILL.md",
+      "url": "https://agentscamp.com/skills/workflow/devcontainer-designer"
+    },
+    {
+      "id": "skills/distributed-tracing-instrumenter",
+      "type": "skill",
+      "slug": "distributed-tracing-instrumenter",
+      "category": "observability",
+      "title": "Distributed Tracing Instrumenter",
+      "description": "Instrument a service (or a chain of services) with OpenTelemetry so a single request can be followed end-to-end — context propagated across every hop including async/queue boundaries, spans at the boundaries that matter, deliberate trace-wide sampling, and trace_id stamped on log lines. Use when latency or failures span multiple services, when you have logs but can't reconstruct a request's full path, or when adopting OpenTelemetry.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/distributed-tracing-instrumenter.md",
+      "installAs": "skills/distributed-tracing-instrumenter/SKILL.md",
+      "url": "https://agentscamp.com/skills/observability/distributed-tracing-instrumenter"
+    },
     {
       "id": "skills/embedding-index-tuner",
       "type": "skill",
@@ -1931,6 +2030,20 @@
       "installAs": "skills/human-in-the-loop-gate/SKILL.md",
       "url": "https://agentscamp.com/skills/workflow/human-in-the-loop-gate"
     },
+    {
+      "id": "skills/idempotency-designer",
+      "type": "skill",
+      "slug": "idempotency-designer",
+      "category": "api",
+      "title": "Idempotency Designer",
+      "description": "Make unsafe, retryable API operations idempotent so a client retry or a network hiccup can't double-charge, double-create, or double-send — design a client-supplied idempotency key, an atomic store-and-check (unique constraint or conditional write), in-flight conflict handling, and a retention policy. Use when a POST/mutation can be retried (payments, order creation, sends, webhooks), or when duplicate side effects have already shown up in production.",
+      "topics": [
+        "architecture"
+      ],
+      "file": "skills/idempotency-designer.md",
+      "installAs": "skills/idempotency-designer/SKILL.md",
+      "url": "https://agentscamp.com/skills/api/idempotency-designer"
+    },
     {
       "id": "skills/llm-as-judge-scorer",
       "type": "skill",
@@ -2073,6 +2186,20 @@
       "installAs": "skills/multimodal-document-extractor/SKILL.md",
       "url": "https://agentscamp.com/skills/data/multimodal-document-extractor"
     },
+    {
+      "id": "skills/mutation-test-runner",
+      "type": "skill",
+      "slug": "mutation-test-runner",
+      "category": "testing",
+      "title": "Mutation Test Runner",
+      "description": "Measure whether a test suite actually catches bugs by running mutation testing — introduce small faults into the code and check which ones a test kills versus which slip through silently. Use when line coverage is high but bugs still ship, when you suspect tests assert weakly, or to find the exact assertions a suite is missing.",
+      "topics": [
+        "review-qa"
+      ],
+      "file": "skills/mutation-test-runner.md",
+      "installAs": "skills/mutation-test-runner/SKILL.md",
+      "url": "https://agentscamp.com/skills/testing/mutation-test-runner"
+    },
     {
       "id": "skills/openapi-doc-writer",
       "type": "skill",
@@ -2242,6 +2369,20 @@
       "installAs": "skills/qlora-finetune-runner/SKILL.md",
       "url": "https://agentscamp.com/skills/data/qlora-finetune-runner"
     },
+    {
+      "id": "skills/query-plan-analyzer",
+      "type": "skill",
+      "slug": "query-plan-analyzer",
+      "category": "database",
+      "title": "Query Plan Analyzer",
+      "description": "Read a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/query-plan-analyzer.md",
+      "installAs": "skills/query-plan-analyzer/SKILL.md",
+      "url": "https://agentscamp.com/skills/database/query-plan-analyzer"
+    },
     {
       "id": "skills/rate-limiter-designer",
       "type": "skill",
@@ -2286,6 +2427,20 @@
       "installAs": "skills/readme-generator/SKILL.md",
       "url": "https://agentscamp.com/skills/docs/readme-generator"
     },
+    {
+      "id": "skills/runbook-writer",
+      "type": "skill",
+      "slug": "runbook-writer",
+      "category": "docs",
+      "title": "Runbook Writer",
+      "description": "Write an operational runbook a half-asleep on-call engineer can execute at 3am — scoped to ONE alert, leading with how to confirm the problem, the copy-pasteable mitigation that stops user pain, then diagnosis, escalation, and verification. Use when an alert has no documented response, after an incident exposed a missing procedure, or when standing up on-call for a service.",
+      "topics": [
+        "devops-infra"
+      ],
+      "file": "skills/runbook-writer.md",
+      "installAs": "skills/runbook-writer/SKILL.md",
+      "url": "https://agentscamp.com/skills/docs/runbook-writer"
+    },
     {
       "id": "skills/secret-scanner",
       "type": "skill",
@@ -2314,6 +2469,20 @@
       "installAs": "skills/security-headers-hardener/SKILL.md",
       "url": "https://agentscamp.com/skills/security/security-headers-hardener"
     },
+    {
+      "id": "skills/semantic-cache-designer",
+      "type": "skill",
+      "slug": "semantic-cache-designer",
+      "category": "data",
+      "title": "Semantic Cache Designer",
+      "description": "Design a semantic cache for LLM responses — serve a cached answer when a new query is similar enough to a past one — to cut cost and latency on repetitive traffic, with the similarity threshold calibrated on real query pairs and a cache key that prevents cross-user/model leaks. Use when an LLM app sees many near-duplicate prompts (FAQs, support, search), when token spend on repetitive queries is high, or when latency on common questions matters.",
+      "topics": [
+        "llm-app-dev"
+      ],
+      "file": "skills/semantic-cache-designer.md",
+      "installAs": "skills/semantic-cache-designer/SKILL.md",
+      "url": "https://agentscamp.com/skills/data/semantic-cache-designer"
+    },
     {
       "id": "skills/semver-advisor",
       "type": "skill",
@@ -2356,6 +2525,20 @@
       "installAs": "skills/sql-optimizer/SKILL.md",
       "url": "https://agentscamp.com/skills/data/sql-optimizer"
     },
+    {
+      "id": "skills/strangler-fig-migrator",
+      "type": "skill",
+      "slug": "strangler-fig-migrator",
+      "category": "refactor",
+      "title": "Strangler Fig Migrator",
+      "description": "Plan the incremental replacement of a legacy module or service using the strangler-fig pattern — grow new code around the old behind an interception seam until the old is dead, instead of a big-bang rewrite. Use when a legacy system is too risky to rewrite at once, or when migrating off a deprecated framework/dependency gradually while staying shippable and rollback-able at every step.",
+      "topics": [
+        "architecture"
+      ],
+      "file": "skills/strangler-fig-migrator.md",
+      "installAs": "skills/strangler-fig-migrator/SKILL.md",
+      "url": "https://agentscamp.com/skills/refactor/strangler-fig-migrator"
+    },
     {
       "id": "skills/structured-logging-designer",
       "type": "skill",
@@ -2384,6 +2567,34 @@
       "installAs": "skills/test-scaffolder/SKILL.md",
       "url": "https://agentscamp.com/skills/testing/test-scaffolder"
     },
+    {
+      "id": "skills/threat-model-builder",
+      "type": "skill",
+      "slug": "threat-model-builder",
+      "category": "security",
+      "title": "Threat Model Builder",
+      "description": "Build a practical threat model for a feature or system using STRIDE — diagram the data flow, mark trust boundaries, enumerate concrete threats where data crosses them, and prioritize by likelihood × impact so security is reasoned about before shipping instead of bolted on after. Use when designing a feature that touches auth, money, or sensitive data, running a security design review, or hardening before a launch.",
+      "topics": [
+        "review-qa"
+      ],
+      "file": "skills/threat-model-builder.md",
+      "installAs": "skills/threat-model-builder/SKILL.md",
+      "url": "https://agentscamp.com/skills/security/threat-model-builder"
+    },
+    {
+      "id": "skills/token-usage-profiler",
+      "type": "skill",
+      "slug": "token-usage-profiler",
+      "category": "data",
+      "title": "Token Usage Profiler",
+      "description": "Measure and attribute LLM token usage and cost across an app — input vs output tokens by feature, route, model, and tenant — then rank the waste and the specific lever to cut it. Use when LLM spend is high or climbing with no clear cause, before scaling a feature that calls a model, or when you need per-feature or per-tenant cost attribution for billing or budgets.",
+      "topics": [
+        "llm-app-dev"
+      ],
+      "file": "skills/token-usage-profiler.md",
+      "installAs": "skills/token-usage-profiler/SKILL.md",
+      "url": "https://agentscamp.com/skills/data/token-usage-profiler"
+    },
     {
       "id": "skills/tool-definition-generator",
       "type": "skill",

package/content/skills/agent-trajectory-evaluator.md ADDED Viewed

@@ -0,0 +1,59 @@
+---
+name: "agent-trajectory-evaluator"
+description: "Evaluate a multi-step AI agent's whole run — tool calls, intermediate steps, and final result — not just final-answer correctness, so you can pinpoint WHERE it went wrong. Use when building or debugging a tool-using or multi-step agent, when final-answer-only evals can't explain failures, or when a prompt/model change quietly makes the agent less efficient or more error-prone even though the answer still looks right."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Final-answer evals tell you the agent failed; they don't tell you *where*. An agent that returns the right number might have called the wrong tool first, looped on a flaky API, or stumbled into the answer through a path that collapses on the next input. This skill makes the agent's **process** inspectable: capture the full trajectory — every decision, tool call, argument, and result — then score it on the axes that actually predict failure, asserting what's checkable and judging only what isn't.
+## When to use this skill
+- You're building or debugging a tool-using / multi-step agent and a final-answer eval says "wrong" without saying why.
+- A prompt or model change kept the answers correct but you suspect the agent got slower, looped more, or recovers worse — and you need to prove it.
+- You're adding a new tool and want to confirm the agent selects it correctly instead of brute-forcing with the old one.
+- Failures are intermittent and you can't tell whether the agent is fragile (lucky path) or robust (sound path).
+## Instructions
+1. **Capture the full trajectory as a structured, replayable log — one record per step.** Final-answer-only logging is the root cause of un-diagnosable failures. Each step records: the model's decision (the assistant turn, including thinking-block summaries if present), the tool called and its exact arguments, the raw tool result (success/error), and any externalized state (files written, working dir, retry count). Use a stable schema so two runs diff cleanly:
+   ```json
+   {"run_id": "...", "task_id": "...", "step": 3,
+    "decision": "call search_orders to find the open order",
+    "tool": "search_orders", "args": {"customer_id": "C-118", "status": "open"},
+    "result": {"ok": true, "rows": 2}, "is_error": false,
+    "latency_ms": 410, "state": {"retries": 0}}
+   ```
+   Pull this from your agent loop's tool-call records (or the Managed Agents event stream: `agent.tool_use` / `agent.tool_result` / `agent.custom_tool_use` events carry tool name, input, and result). Persist trajectories to disk so a baseline run is a diffable artifact, not a console scroll-by.
+2. **Build a fixed, version-controlled eval set of representative tasks — and deliberately include trap tasks.** A good set has three buckets: (a) routine tasks the agent should handle cleanly, (b) tasks that *require* tool use (the answer isn't in the prompt, so the agent must select and call the right tool), and (c) tasks engineered to trip a known failure mode — a tool that returns an error on the first call (does it recover?), an ambiguous request (does it loop?), a distractor tool that looks relevant but is wrong (does it mis-select?). Pin the set; an eval set that drifts can't catch regressions. Each task carries its expected trajectory assertions (next step).
+3. **Score every trajectory on five axes, not one.** Final-answer correctness is necessary but insufficient. For each task, evaluate:
+   - **Tool selection** — did it call the right tool for each sub-goal? (mis-selection often produces a right answer via a wrong, slow path)
+   - **Argument correctness** — were the tool arguments right? (a `status: "open"` typo'd to `status: "all"` can still return the target row by luck)
+   - **Step efficiency** — did it stay within a step budget, or did it repeat calls, loop, or take a needless detour? Measure against a per-task budget, not a global one.
+   - **Error recovery** — when a tool returned an error, did the agent recover sensibly (retry once, switch approach) or thrash / give up?
+   - **Goal completion** — did it actually finish the task, distinct from "the final text looks plausible"?
+4. **Split scoring into programmatic assertions and a narrow LLM-judge — assert everything you can.** An LLM-judge over a whole trajectory is noisy and expensive, and it will rationalize a broken path. So check the deterministic axes with code: exact tool-name assertions, argument equality (or schema match), and step-count budgets are all plain comparisons against the trajectory you captured.
+   ```python
+   tools = [s["tool"] for s in trajectory]
+   assert tools[0] == "search_orders", f"wrong first tool: {tools[0]}"
+   assert trajectory[0]["args"]["status"] == "open"
+   assert len(trajectory) <= task["step_budget"], f"{len(trajectory)} steps > budget"
+   assert not any(s["is_error"] for s in trajectory[-2:]), "ended on an error"
+   ```
+   Reserve the LLM-judge for the genuinely subjective steps only — "was this reasoning step sound given the prior result?", "was this summary faithful to the tool output?" — and judge **one step at a time** with the step's inputs in context, not the entire run. Default both the agent-under-test and the judge to the latest, most capable Claude model (`claude-opus-4-8`); use a *different* sample or framing for the judge so it isn't grading its own twin, and keep the judge's rubric to one criterion per call.
+5. **Diff every candidate trajectory against a stored baseline and report the regressions.** This is what catches the silent ones. After a prompt or model change, re-run the fixed eval set and compare trajectory-for-trajectory against the baseline: tools added/removed/reordered, argument changes, step-count delta, new error-recovery loops, latency delta. A change that keeps the final answer correct but adds two steps, introduces a retry loop, or swaps a precise tool for a brute-force one is a **regression** — surface it even though the answer still passes. Promote a candidate to the new baseline only when the diff is empty or every change is reviewed and intended.
+> [!WARNING]
+> Grading only the final answer hides process failures. An agent can reach the right answer through a path that is broken, expensive, or lucky — wrong tool, redundant loop, a crash it recovered from by chance — and that path will break on the very next input. The final answer being correct is *not* evidence the agent worked correctly.
+> [!WARNING]
+> An LLM-judge over a whole trajectory is noisy and tends to rationalize whatever path it sees. Assert the checkable steps — tool names, argument values, step counts — with code, and give the judge exactly one subjective step and one criterion at a time. A judge asked "was this whole run good?" will hand-wave; a judge asked "was *this* summary faithful to *this* tool output?" gives a usable signal.
+## Output
+- **Trajectory schema** — the per-step record (decision, tool, args, result, is_error, latency, state) and where each field comes from in your agent loop or event stream.
+- **Per-axis rubric** — the five axes (tool selection, argument correctness, step efficiency, error recovery, goal completion) with the concrete check for each task.
+- **Assertion-vs-judge split** — the deterministic assertions written as code, and the short list of subjective steps routed to a single-criterion LLM-judge (agent and judge both on `claude-opus-4-8`).
+- **Baseline-diff regression report** — a per-task diff of the candidate run against the stored baseline (tools reordered/added/removed, arg changes, step-count and latency deltas, new recovery loops), flagging every regression even where the final answer still passes, plus a verdict on whether to promote the candidate to baseline.

package/content/skills/alerting-rules-tuner.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "alerting-rules-tuner"
+description: "Cut alert noise and make every page mean something — rewrite alerting rules to fire on user-felt symptoms (error rate, latency SLO burn, failed requests) instead of causes (high CPU, full disk), with duration windows and severity routing so only urgent, actionable conditions reach a human. Use when on-call is fatigued by low-value pages, when real incidents get missed in the noise, or when alerts fire on causes rather than impact."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+On-call exhaustion is rarely an "alert quantity" problem you fix by muting things — it's an *altitude* problem. Pages fire on causes (a node at 95% CPU, a disk at 80%, a saturated thread pool) that may or may not hurt anyone, instead of on symptoms the user actually feels. This skill audits every rule against one question — *does this fire only when a human must act now?* — then rewrites the survivors to alert on symptoms with duration windows and severity routing, and demotes the rest to dashboards or tickets.
+## When to use this skill
+- On-call is fatigued: frequent pages that resolve themselves or need no action, night pages for non-urgent conditions.
+- Real incidents get missed because they're buried under low-value noise, or everyone has muted the channel.
+- Alerts fire on causes (CPU, memory, disk, queue depth, pod restarts) rather than user impact.
+- One incident generates a storm of 50 correlated pages instead of one.
+- You have alerts with no owner and no runbook — nobody knows what to do when they fire.
+- Standing up alerting for a new service and want to start symptom-first instead of bolting on host metrics.
+## Instructions
+1. **Inventory the rules and classify each as symptom or cause.** Grep the alerting config (`*.yml`/`*.yaml` Prometheus rules, Datadog monitor exports, Grafana alert JSON, Alertmanager routes) for every rule that pages a human. For each, label it: **symptom** (something the user experiences — request errors, latency, failed checkouts, SLO burn) or **cause** (a resource or internal metric — CPU, memory, disk, GC pause, replica lag, restart count). Causes belong on dashboards, not pagers.
+2. **Audit every paging rule with the single question.** For each rule ask: *does this fire only when a human must act, right now, with a clear action?* If the honest answer is "no" — it self-heals, it's informational, there's nothing to do at 3am — it is not a page. Downgrade it to a ticket or a dashboard panel. Keep paging only what's both urgent and actionable.
+3. **Define the symptom alert set at the user boundary.** Replace cause-pages with the symptoms they were trying to predict: request error rate (5xx / total), latency at a percentile that matters (p99 over SLO), failed business transactions (checkout/login failures), and SLO error-budget burn rate. Measure these where the user is — at the load balancer / ingress / API edge — not deep inside one component.
+4. **Add a duration window to every threshold.** No paging alert fires on an instantaneous value. Require the condition to hold `for: 5m` (tune per alert) so a single scrape blip or a 10-second spike clears itself. For graceful detection of both sudden outages and slow leaks, prefer multi-window, multi-burn-rate alerts (e.g. fast: 14.4x burn over 5m + 1h; slow: 6x over 30m + 6h) over a single fixed threshold.
+5. **Alert on rate-of-change / burn, not raw levels, where the level is naturally noisy.** "Disk is 80% full" pages constantly and means nothing; "disk will fill within 4 hours at the current fill rate" is actionable and rarely false. Same for error budgets: page on burn rate, not on a single bad minute.
+6. **Assign exactly one severity per rule and route accordingly.** Use three tiers and wire each to a destination: **page** (human-impacting, urgent, actionable → PagerDuty/Opsgenie, wakes someone), **ticket** (needs attention this week, not now → issue tracker), **info** (awareness only → Slack/dashboard, never pages). The default for anything you're unsure about is *not* page.
+7. **Deduplicate and group correlated alerts into one notification.** One incident must produce one page, not fifty. Group by incident dimension (service, cluster, region) in Alertmanager `group_by` / Datadog grouping, set `group_wait`/`group_interval` so the storm coalesces, and add inhibition rules so a parent symptom (whole service down) suppresses the child causes (every dependent check failing).
+8. **Attach an owner and a runbook link to every surviving alert.** Each paging rule gets an owning team (label/tag) and a `runbook_url` annotation pointing at concrete steps — first checks, dashboards, mitigation, escalation. If you can't write a runbook because there's no clear response, that's the signal the alert shouldn't page.
+> [!WARNING]
+> Paging on causes — CPU, memory, disk, queue depth — instead of user-felt symptoms is the single largest source of alert fatigue. A box can run hot all day while users are perfectly happy; a box can look idle while requests fail. Page on the symptom; keep the cause on a dashboard for when you're already investigating.
+> [!WARNING]
+> An alert with no runbook and no action is noise by definition. If the response to a page is "ack it and watch," it should not have woken anyone. Thresholds without a duration window flap on every transient spike — never ship a paging rule without a `for:` window.
+## Output
+A revised alerting plan, ready to apply to the config:
+- **Symptom alert set** — a table of paging alerts: name, signal (the user-facing metric), threshold + duration window (or burn-rate windows), and severity. Every row is urgent and actionable.
+- **Demoted rules** — the cause-metrics removed from paging, each annotated with where it went (dashboard panel name, or ticket-severity monitor) and why it isn't a page.
+- **Routing + dedup map** — severity → destination table, the `group_by` keys, and inhibition rules (parent symptom suppresses child causes).
+- **Ownership/runbook mapping** — for each surviving alert: owning team + `runbook_url`, flagging any alert that lacks a runbook as a candidate for deletion.

package/content/skills/canary-release-planner.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: "canary-release-planner"
+description: "Design a canary / progressive rollout so a bad release reaches 1% of users instead of 100% — staged traffic with bake times, gating metrics compared against the concurrently-running stable baseline, and automated promote-or-rollback. Use when shipping a risky change, when you want automatic rollback on regression, or when moving off all-at-once deploys."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+An all-at-once deploy is a single bet: CI is green, so you flip 100% of users onto new code and hope. A canary changes the bet — it routes a small, growing slice of real traffic to the new version, watches it against the version still serving everyone else, and either promotes it or rolls it back automatically. This skill produces that plan: the stages and bake times, the metrics that gate each promotion, the rollback trigger, and the data/session prerequisites that decide whether a canary is even safe for this change.
+## When to use this skill
+- You're shipping a change risky enough that a bad version reaching every user at once is unacceptable (auth, payments, a hot path, a dependency bump).
+- You want regressions to trigger an automatic rollback instead of waiting for an on-call human to notice and react.
+- You're moving a service off all-at-once / blue-green flips onto progressive delivery and need a concrete stage-and-gate plan.
+- A previous "it passed CI" deploy caused a production incident, and you want the blast radius capped before the next one.
+## Instructions
+1. **Define the rollout stages and a bake time at each.** Lay out an increasing traffic schedule — e.g. `1% → 10% → 50% → 100%` — and assign each stage a **bake time** long enough for the relevant signals to surface (cover at least one full traffic cycle for the failure mode you fear: cache fills, cron jobs, retries, a login spike). The first stage should be small enough that its failure is a non-event; the bake time, not the percentage, is what lets a slow leak (memory, connection exhaustion, a rare code path) show itself before the next promotion. Don't jump straight to 50%.
+2. **Pick the metrics that gate promotion.** Choose a small set that reflects user pain: **error rate** (5xx / failed requests), **latency percentiles** (p95/p99, never the mean — the mean hides the tail that churns users), and one or two **business/health signals** that catch silent failures the error rate won't (checkout completions, sign-ups, queue depth, a 200-with-empty-body). A deploy can be 200-OK and still be broken; the business metric is what catches that.
+3. **Set thresholds as canary-vs-baseline, not absolute.** For each gating metric, define a pass/fail rule comparing the **canary** to the **concurrently-running stable version** — e.g. "canary error rate ≤ stable + 0.5pp" and "canary p99 ≤ 1.2× stable p99." Both versions take a slice of the *same live traffic at the same time*, so time-of-day, weekday, and load differences cancel out and the only variable left is the new code.
+4. **Automate the promote-or-rollback decision.** At the end of each bake time: if every gating metric is within threshold, promote to the next stage; if any breaches, **auto-rollback** — shift 100% of traffic back to stable immediately. Make rollback fast and safe: it must be a traffic-weight change (drain the canary, don't kill in-flight requests), require no new build, and not depend on the canary being healthy enough to cooperate. A rollback that needs a redeploy is too slow to matter during an incident.
+5. **Guarantee schema compatibility across both versions.** During the rollout the old and new code hit the **same database simultaneously**. Every schema change must be backward-compatible in both directions for the duration of the canary — use **expand-contract / parallel-change** migrations: add the new column/table (expand) and deploy code that writes both, run the canary, then remove the old shape (contract) only after the new version owns 100%. Pair with `strangler-fig-migrator` for larger cutovers.
+6. **Pin session affinity so a user doesn't flip versions mid-flow.** Route by a stable key (user ID, session cookie) so a given user stays on canary *or* stable for the whole session. Without it, a user can bounce between versions between requests — half-applied multi-step flows, cache/state mismatches, and metrics that can't be attributed to either version. Affinity also makes the canary-vs-stable comparison clean.
+7. **Choose the routing dimension deliberately.** Decide whether the canary is a **percentage of traffic** (simplest, representative) or a **user segment** (internal staff → beta cohort → region → everyone) when you want known, tolerant users to absorb the first hit. Segment routing trades statistical representativeness for a friendlier blast radius — state which you chose and why.
+> [!WARNING]
+> Comparing the canary to a *historical* baseline (yesterday, last week, a stored average) instead of the stable version running right now produces false verdicts. Traffic and latency swing with time of day and day of week, so a healthy canary at peak can look "regressed" against an off-peak baseline — and a genuinely bad canary can hide inside normal variance. Always gate against the concurrently-running stable version.
+> [!WARNING]
+> A canary is unsafe when the release contains a non-backward-compatible schema change. Both versions query the same database during the rollout, so a breaking migration breaks one version no matter the traffic split. Decouple it: ship the migration as a backward-compatible expand step first, canary the code, then contract afterward.
+## Output
+A canary rollout plan containing: (1) the **stage schedule** — traffic percentages and the bake time at each, with the reason each bake time is long enough; (2) the **gating metrics** — error rate, latency percentiles, and the business/health signal(s), each with an explicit **canary-vs-baseline** pass/fail threshold; (3) the **auto-rollback trigger** — which breach forces a rollback and the (fast, build-free) mechanism that executes it; and (4) the **prerequisites** — the expand-contract schema plan confirming both versions are DB-compatible, and the session-affinity key. Reproducible: the same plan re-runs for the next release by swapping in its metrics and thresholds.

package/content/skills/cold-start-optimizer.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: "cold-start-optimizer"
+description: "Cut cold-start latency for serverless functions and slow-booting apps by measuring the init breakdown, then attacking the dominant phase — artifact size, eager imports, eager connections, or under-provisioned memory — instead of reflexively buying provisioned concurrency. Use when serverless p99 spikes on the first request, when a function times out during init, or when scale-to-zero is hurting user-facing latency."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A cold start is not one number — it is runtime boot, dependency/module load, framework init, and first-connection setup stacked on top of each other, and you are usually optimizing a guess about which one dominates. This skill makes it measured: split the init into phases, find the phase that actually costs you, and attack *that* — shrink the artifact and lazy-load the heavy deps off the first-request path, hoist one-time work to module scope so warm invocations reuse it, right-size memory (more CPU often means a *faster and cheaper* cold start), and reuse connections across invocations instead of opening a fresh one every cold start. Provisioned concurrency / keep-warm is the last resort for genuinely latency-critical paths, not the first reflex — because it bills you to mask a slow init rather than fixing it.
+## When to use this skill
+- Serverless p99 (or p999) spikes on the first request after a quiet period, while warm requests are fast.
+- A function intermittently times out *during init* — before your handler code even runs.
+- Scale-to-zero or aggressive autoscaling is hurting user-facing latency on a path that can't tolerate a 2–5s tail.
+- You've been told to "just turn on provisioned concurrency" and want to know whether the init is fixable first (and cheaper).
+- A deploy bloated the artifact (new dependency, bundling change) and cold starts regressed.
+## Instructions
+1. **Measure the cold start and split it into phases — don't optimize a guess.** Force a cold start (deploy a new version, or wait out the platform's idle timeout) and capture the init timeline, not just the total. Most platforms expose it: AWS Lambda `INIT_START`/`REPORT` log lines (`Init Duration` is the pre-handler cost) plus X-Ray init subsegments; GCP/Cloud Run startup probe + request logs; Vercel function logs. Instrument the four phases yourself with timestamps at module load:
+   - **runtime boot** — the platform spinning up the sandbox/container and language runtime (you can't change this much, but you must know its share).
+   - **dependency/module load** — `require`/`import` of your code and its tree, top-to-bottom.
+   - **framework init** — ORM bootstrap, DI container, route table build, config parse, schema/codegen load.
+   - **first-connection setup** — DB handshakes, TLS, secret-manager fetches, warm-up calls.
+   Attribute a millisecond cost to each. You optimize the dominant phase; everything else is noise until that one shrinks.
+2. **Shrink the deployment artifact and lazy-load heavy deps off the first-request path.** A giant bundle inflates both runtime boot (more to unpack) and module load (more to parse). Tree-shake and bundle (esbuild/`@vercel/nft`/webpack) so you ship the function's actual closure, not the whole `node_modules`; exclude the AWS SDK / platform SDK that the runtime already provides; strip source maps and dev deps from the package. Then find the imports that aren't needed for the *first* request — a PDF renderer, an image library, an analytics client, a markdown engine — and move them behind a lazy `await import()` / deferred `require` inside the code path that needs them, so they never touch init. Grep the entry module for top-level imports of known-heavy packages and ask of each: does request #1 use this?
+3. **Hoist one-time work to module scope so warm invocations reuse it — but don't connect eagerly.** Config parsing, client *construction*, schema compilation, and validator building should run once at module load and be captured in module-scope variables, so the platform's instance reuse amortizes them across every warm invocation on that instance. The sharp distinction: **construct** clients at module scope, but **connect** lazily. Build the DB pool / HTTP client object at module load (cheap, no I/O); open the actual connection on first use inside the handler, and reuse it across subsequent invocations on the same warm instance. Eager top-level `await pool.connect()` adds connection latency to *every* cold start and turns a traffic burst into a connection storm.
+4. **Reuse connections across invocations via instance reuse — never open a fresh connection per cold start.** Store the connection/pool in a module-scope (or `globalThis`) variable so a warm instance hands it back instead of reconnecting. Size the per-instance pool to **1–2 connections**, not 20: each concurrent serverless instance gets its own pool, so a large per-instance pool times the instance count will blow past the database's `max_connections` under burst. For Postgres at high concurrency, point functions at a transaction-mode pooler (PgBouncer/RDS Proxy/Supabase pooler) rather than the database directly. Set a connection idle timeout shorter than the platform's instance-freeze window so dead connections don't accumulate.
+5. **Right-size memory — on many platforms it buys CPU, so more memory = faster AND cheaper cold start.** On Lambda (and similar) CPU and network scale linearly with the memory setting, and a cold start is CPU-bound (parsing, JIT, framework init). Bumping 128MB → 512MB–1GB can cut the cold start by enough that the *higher per-ms price × shorter duration* is lower total cost — the classic counter-intuitive win. Sweep a few memory settings against the same forced-cold-start workload and pick the point on the cost-vs-latency curve, don't assume the smallest tier is cheapest.
+6. **Use provisioned concurrency / keep-warm only for genuinely latency-critical paths — after init is already fast.** If a path truly can't tolerate any cold tail (checkout, auth, a synchronous user-facing API), provision N warm instances to cover baseline concurrency. But apply it last, sized to real concurrency (not a round number), and only once steps 1–5 have made the init itself fast — because provisioning a 4-second init just means you pay 24/7 to keep a slow thing warm, and any burst beyond your provisioned count still pays the full cold start.
+> [!WARNING]
+> Opening a fresh DB connection on every cold start — instead of reusing one across warm invocations — is the classic serverless outage. Under a traffic spike, every new instance opens its own connections simultaneously, the database hits `max_connections`, and *every* request (warm ones included) starts failing. Construct the client at module scope, connect lazily, reuse across invocations, and cap the per-instance pool low. Use a transaction-mode pooler when instance count can exceed the DB's connection limit.
+> [!CAUTION]
+> Keep-warm and provisioned concurrency **mask** a slow init; they don't fix it — and they bill you continuously for the masking. If you reach for them before measuring, you'll pay 24/7 to hide a 3s init that two hours of lazy-loading would have cut to 400ms, and you'll *still* eat the full cold start on every burst beyond your provisioned count. Fix the init first; provision only the residual.
+## Output
+1. **Cold-start breakdown by phase** — the measured init timeline showing where the milliseconds actually go, so the dominant cost is obvious before any change:
+```text
+Cold start breakdown — POST /api/checkout (Lambda, 256MB, node20)
+Total cold init: 2,840 ms   (warm: 38 ms)
+  runtime boot ................   210 ms   7%   (platform; fixed)
+  dependency/module load ......  1,520 ms  54%  <- DOMINANT
+      stripe sdk (eager) .........  340 ms
+      @prisma/client (eager) .....  610 ms
+      pdfkit (eager, unused @ req#1) 470 ms
+  framework init ..............    180 ms   6%   prisma engine bootstrap
+  first-connection setup ......    930 ms  33%  top-level await pool.connect()
+```
+2. **Targeted fixes** — ordered by the phase that dominates, each with the specific change and why it lands:
+```text
+1. Lazy-load pdfkit behind await import() in the receipt path .. -470 ms  [HIGH]
+   Not used by request #1; only the async receipt job needs it.
+2. Move pool.connect() out of top-level await; connect on first
+   handler use, reuse across invocations; pool max 2 ................ -930 ms cold,
+   + eliminates connection-storm risk under burst .................. [HIGH]
+3. Bump memory 256MB -> 1024MB (CPU scales) ................... -640 ms  [HIGH]
+   Faster parse + prisma init; est. total cost -18% (shorter ms).
+4. Bundle with esbuild, exclude aws-sdk (runtime-provided),
+   strip source maps ................................................ -210 ms  [MED]
+5. Provisioned concurrency = 3 on /checkout ONLY, after the above ... covers
+   baseline concurrency; residual bursts now cost ~600ms not 2,840.  [LAST]
+```
+3. **Measured before/after** — the re-measured cold start after applying the fixes, proving the dominant phase actually shrank (and noting cost impact, since memory and provisioning change the bill):
+```text
+Cold init: 2,840 ms -> 620 ms  (-78%)   p99 first-request: 3.1s -> 0.7s
+Monthly cost: roughly flat (higher memory offset by shorter duration;
+provisioned-concurrency on /checkout adds ~$X for 3 warm instances).
+Re-measure after a real burst, not a single forced cold start.
+```