npm - agentscamp - Versions diffs - 0.3.0 → 0.5.0 - Mend

agentscamp 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/README.md +3 -3
package/content/commands/add-caching.md +79 -0
package/content/commands/audit-accessibility.md +101 -0
package/content/commands/clean-branches.md +113 -0
package/content/commands/review-tests.md +98 -0
package/content/commands/scaffold-github-action.md +94 -0
package/content/commands/setup-precommit-hooks.md +72 -0
package/content/commands/write-design-doc.md +78 -0
package/content/manifest.json +425 -3
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/connection-pool-tuner.md +46 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/dependency-upgrade-planner.md +42 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/memory-leak-hunter.md +35 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/pagination-designer.md +51 -0
package/content/skills/property-test-designer.md +63 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/security-headers-hardener.md +79 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/slo-definer.md +38 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/structured-logging-designer.md +42 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/package.json +1 -1

package/content/skills/mutation-test-runner.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+name: "mutation-test-runner"
+description: "Measure whether a test suite actually catches bugs by running mutation testing — introduce small faults into the code and check which ones a test kills versus which slip through silently. Use when line coverage is high but bugs still ship, when you suspect tests assert weakly, or to find the exact assertions a suite is missing."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Line coverage tells you a line ran during a test. It does not tell you the test would fail if that line were wrong — a function can be 100% covered by an assertion-free test. Mutation testing closes that gap: it plants small faults in the code (flip `>` to `>=`, swap `+` for `-`, drop a statement, negate a condition) and re-runs the suite against each one. A mutant that makes a test fail is **killed** — the suite pins that behavior. A mutant that passes everything **survives** — no test noticed the code changed, so that behavior is unprotected. This skill runs a mutation tool, reads the survivors as a precise to-do list of missing assertions, and tells you exactly which tests to add to kill them.
+## When to use this skill
+- Coverage is high (80–100%) but bugs still slip into production — the classic symptom of covered-but-unasserted code.
+- You inherited or reviewed a suite and suspect the tests assert weakly (snapshot-only, no return-value checks, `toBeDefined` instead of `toEqual`).
+- A module is critical (auth, money, parsing, pricing) and you want proof the suite would catch a regression, not just that it touches the lines.
+- You're hardening a specific change and want the missing assertions for *that diff*, not a repo-wide audit.
+> [!WARNING]
+> 100% line coverage with surviving mutants is the false confidence this skill exists to expose: the code runs in a test, but no assertion would fail if the code were wrong. A green coverage badge is not a green mutation score.
+## Instructions
+1. **Pick the tool for the language — don't guess, check what's installed.** Inspect deps and config first:
+   - JS/TS: **Stryker** (`@stryker-mutator/core`, config `stryker.conf.json`/`.mjs`); it auto-detects Jest/Vitest/Mocha runners.
+   - Python: **mutmut** (`mutmut run`, config in `setup.cfg`/`pyproject.toml`) or **cosmic-ray** for larger suites.
+   - Java/Kotlin: **PIT** (`pitest`, Maven/Gradle plugin). Go: **go-mutesting** or **gremlins**. Ruby: **mutant**. C#: **Stryker.NET**.
+   - If no tool is installed, recommend the standard one for the stack and stop there — do not silently add a dev dependency.
+2. **Scope the run to changed code — this is mandatory, not an optimization.** Mutation testing re-runs the full suite once per mutant, so a repo can take hours. Target the diff or a single package: Stryker `--mutate "src/pricing/**/*.ts"` (or `--since main` on recent versions), mutmut `--paths-to-mutate src/billing/`, PIT `targetClasses` set to the changed package. State the chosen paths up front so the run is reproducible.
+3. **Run and collect the surviving mutants, not the summary number.** Execute the tool and read its detailed report (Stryker's `mutation.html`/`--reporter json`, mutmut `mutmut results` + `mutmut show <id>`, PIT's `mutations.xml`). For each survivor capture: file, line, the original code, and the exact mutation that lived (e.g. `boundary: changed >= to >` or `removed call to logAudit()`).
+4. **Triage each survivor: real gap or equivalent mutant.** An **equivalent mutant** changes the code without changing observable behavior — e.g. `i <= n-1` vs `i < n`, reordering commutative operations, mutating a value that's overwritten before use. These *cannot* be killed by any test; mark them `equivalent — ignore` with a one-line reason and move on. Everything else is a genuine gap: a behavior your tests don't constrain.
+5. **For each real survivor, name the assertion that would kill it.** This is the payoff. A survived `changed > to >=` on a discount threshold means no test exercises the exact boundary — propose "`applyDiscount(qty=10)` where the rule is `qty > 10`: assert no discount at exactly 10." A survived `removed call to audit()` means nothing asserts the side effect — propose "assert `auditLog` received one entry after `transfer()`." Write the input and the expected behavior, not "add a test for line 42."
+6. **Group survivors by file and track the score where it's worth defending.** Report the mutation score (killed / total non-equivalent) per scoped path as a *baseline to hold or raise on critical modules*, never as a vanity 100% target — chasing the last few percent usually means fighting equivalent mutants. Record the baseline so the next run can detect regressions.
+> [!NOTE]
+> Two survivors that share a root cause often need one assertion. A function where every arithmetic and boundary mutant survives usually has a single test that calls it and asserts only that it didn't throw — adding one real return-value assertion can kill the whole cluster at once.
+> [!WARNING]
+> If a mutation run "passes" with zero survivors but also shows mutants marked **no coverage** or **timeout**, the suite isn't strong — those mutants were never actually tested. No-coverage mutants are a coverage gap (hand them to `coverage-gap-finder`); timeouts often mean a mutant created an infinite loop the suite can't detect. Don't read them as kills.
+## Output
+A survivor report grouped by file, plus the run scoping so it's reproducible:
+```
+Scope: src/billing/**  (mutated 47 mutants, 90s)
+Mutation score: 81%  (34 killed / 42 non-equivalent) — baseline, hold >=80 on billing
+src/billing/discount.ts
+  SURVIVED  L23  changed `qty > 10` -> `qty >= 10`   [BOUNDARY]
+    Gap: no test hits the exact threshold.
+    Add: applyDiscount({ qty: 10 }) -> assert price unchanged (no discount at boundary)
+  SURVIVED  L31  removed call to `roundCents(total)`  [STATEMENT]
+    Gap: nothing asserts the rounded result.
+    Add: applyDiscount({ qty: 12, price: 3.337 }) -> assert total === 33.37 (not 33.3696)
+src/billing/invoice.ts
+  SURVIVED  L58  changed `&&` -> `||` in isOverdue guard  [LOGICAL]
+    Gap: only the both-true case is tested.
+    Add: isOverdue({ pastDue: true, paid: true }) -> assert false
+  EQUIVALENT L72  `i <= len-1` -> `i < len`  — ignore (same iteration count)
+No-coverage: 5 mutants in src/billing/legacy.ts -> route to coverage-gap-finder (not killed).
+```
+Each surviving line is a missing assertion; the `Add:` lines are concrete enough to hand straight to a test scaffolder. Re-run the same scope after adding them to confirm the survivors flip to killed and the score holds.

package/content/skills/pagination-designer.md ADDED Viewed

@@ -0,0 +1,51 @@
+---
+name: "pagination-designer"
+description: "Design correct, scalable pagination (plus the filtering and sorting that ride with it) for a list endpoint — pick cursor (keyset) vs offset and justify it, define an opaque cursor with a unique tiebreaker so no row is skipped or repeated, return a consistent envelope, bound page size, and name the indexes the sort actually needs. Use when adding a list endpoint, when OFFSET pagination crawls on a large table, or when clients see duplicate or missing rows while paging."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Pagination looks trivial until the table grows or the data moves under the reader. `OFFSET 100000` doesn't skip to row 100,000 — the database scans and throws away the first 100,000 matching rows on every request, so latency climbs linearly with depth. And sorting by a non-unique column (`created_at`, `name`, `score`) without a tiebreaker gives a *partial* order: rows that tie can reorder between requests, so paging skips some and shows others twice. This skill makes the pagination scheme an explicit decision — keyset vs offset, the cursor encoding, the tiebreaker, the page-size bounds, and the indexes — and defines how filtering and sorting compose with it.
+## When to use this skill
+- You're adding a list/collection endpoint and need to decide how clients page through it.
+- An existing `OFFSET`/`LIMIT` endpoint is fast on page 1 and slow on page 500, or it times out on deep pages.
+- Clients report seeing the same row twice or missing rows entirely while scrolling — the classic symptom of an unstable sort under concurrent inserts/deletes.
+- The list is large, append-heavy, or actively changing (feeds, logs, events, search results) and you need stable paging that doesn't drift as rows are added.
+## Instructions
+1. **Choose cursor (keyset) vs offset from the dataset, and justify it.**
+   - **Cursor / keyset** — the default for large or actively-changing data. Instead of `OFFSET`, the next page *seeks* on the sort key: `WHERE (created_at, id) < (:last_created_at, :last_id) ORDER BY created_at DESC, id DESC LIMIT :n`. It's stable under inserts/deletes (each page is anchored to a real row, not a positional count) and stays fast at any depth because it uses an index range scan instead of scanning prior rows. Cost: no random page jumps, no total page count.
+   - **Offset / limit** — acceptable only for **small, stable, human-paginated** lists where users click numbered pages (an admin table of a few thousand rows). It allows arbitrary jumps and easy "page 7 of 20" UIs. Never use it for infinite scroll, large tables, or feeds.
+   State which you chose and the property (depth performance + stability vs random access) that drove it.
+2. **Always include a unique tiebreaker so the sort order is total.** A cursor seeking on a non-unique column alone (`created_at`) can't disambiguate ties: two rows with the same timestamp have no defined relative order, so one can land on both sides of a page boundary. Encode the user-facing sort key **plus a unique, monotonic tiebreaker** (the primary key) — the cursor compares on the tuple `(created_at, id)`. This makes the order total: every row has exactly one position, so no row is skipped or repeated. Even when the apparent sort is "by id" alone, that already happens to be unique — but any user-chosen sort needs the explicit `, id` tiebreaker appended.
+3. **Make the cursor opaque.** Encode the tuple `(sort_key_value, tiebreaker_value)` (and, if filters/sort are part of the page identity, a version or the sort direction) into a single base64url token — `next_cursor: "eyJjcmVhdGVkX2F0IjoiMjAyNi0wNi0xN1QwOTozMDowMFoiLCJpZCI6IjQ4ODEyIn0"`. Opaque means clients treat it as a blob and pass it back verbatim; you keep freedom to change the internal encoding without breaking them. Do **not** expose raw `(timestamp, id)` as query params — clients will hand-craft them, couple to your schema, and break on the next change.
+4. **Return one consistent envelope.** Every list endpoint returns the same shape:
+   ```json
+   { "data": [ ... ], "next_cursor": "…", "has_more": true }
+   ```
+   `next_cursor` is `null` when there are no more rows. Derive `has_more` reliably by fetching `LIMIT n + 1`: if you get `n + 1` rows, there's another page — drop the extra row and set `next_cursor` from the last *kept* row. This avoids a separate `COUNT` and is correct even when the last page is exactly full. Do not return a total count for keyset pagination; computing it scans the whole filtered set and defeats the point.
+5. **Bound page size with a sane default and a hard max.** Read the page size from `limit` (or `page_size`), clamp it: default 20–50, hard max 100–200 — never unbounded. An unbounded `limit` lets one client request a million rows and OOM the server or exhaust the DB. Clamp silently (return `min(requested, max)`) and document the cap.
+6. **Name the indexes the sort actually needs — this is non-negotiable for keyset.** The `ORDER BY (sort_key, tiebreaker)` and the `WHERE (sort_key, tiebreaker) < (...)` seek are only fast if a **composite index on those exact columns in that exact order and direction** exists. Sorting `created_at DESC, id DESC` needs an index supporting that; a plain index on `created_at` alone forces a sort and undoes the win. If filters narrow the set, the index should lead with the equality-filter columns, then the sort columns: `(tenant_id, created_at, id)` for a query filtered by tenant and sorted by time. Verify the index exists or flag it as required.
+7. **Define how filtering and sorting compose with the cursor.** The cursor is only valid *for the filter and sort it was issued under* — a cursor minted for `?status=active&sort=created_at` is meaningless if the next request changes `status` or `sort`. Specify the contract: which fields are filterable, which are sortable (whitelist them — never interpolate a client-supplied column name into `ORDER BY`), and that **changing any filter or sort param invalidates the cursor and resets to the first page**. For multi-column sorts, the tiebreaker is appended after *all* user sort columns, and the seek predicate must compare the full tuple (row-value comparison `(a, b, c) < (:a, :b, :c)`, not `a < :a OR (a = :a AND b < :b) OR …` unless your engine lacks tuple comparison).
+> [!WARNING]
+> Deep `OFFSET` is O(n), not O(1). `OFFSET 100000 LIMIT 20` makes the database read and discard 100,000 matching rows before returning 20 — every request, getting worse as users page deeper, holding locks and burning IO the whole time. Page 1 being fast tells you nothing about page 5,000. If the table can grow large or users can reach deep pages, use keyset.
+> [!WARNING]
+> A non-unique sort key without a tiebreaker silently corrupts paging. With `ORDER BY created_at` and several rows sharing a timestamp, the engine may return those tied rows in a different order on the next request — so a row sitting on the page boundary gets skipped on one page and the previous boundary row reappears on the next. There is no error, just missing and duplicated data. Always append a unique tiebreaker (`, id`) to every sort.
+> [!NOTE]
+> Offset and keyset can coexist behind one envelope: serve numbered offset pages for a small admin UI and keyset for the public feed, both returning `{ data, next_cursor, has_more }` (offset endpoints simply also accept `page`/leave `next_cursor` null). Pick per endpoint from its access pattern, not one rule for the whole API.
+## Output
+A pagination spec stating: the chosen **scheme** (cursor vs offset) + rationale; the **response envelope** (`data` / `next_cursor` / `has_more`, with the `null`-when-done and `LIMIT n+1` rules); the **cursor encoding** — the exact tuple `(sort key, unique tiebreaker)` and that it's base64url-opaque; the **page-size** default and hard max; the **required indexes** (exact columns, order, and direction, leading with equality-filter columns); and the **filter/sort contract** — the filterable/sortable field whitelist, the tuple seek predicate, and that changing any filter or sort param invalidates the cursor.

package/content/skills/property-test-designer.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+name: "property-test-designer"
+description: "Design property-based tests — generate hundreds of random inputs and assert invariants that must hold for ALL of them — to surface the edge cases hand-picked examples never reach. Use when code has a large input space (parsers, serializers, encoders, math, data transforms), when a bug keeps slipping through despite green example tests, or when you can't enumerate every case worth checking."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Example-based tests only check the inputs you thought to write down. This skill designs property-based tests instead: it identifies the invariants that must hold for *every* valid input, defines generators that produce hundreds of them — including the corners you'd never type by hand — and lets the framework shrink any failure to its minimal reproducing input. The deliverable is the chosen properties, the generators, a runnable test in your language's framework, and a plan to pin every counterexample as a fixed regression case.
+## When to use this skill
+- The input space is large or recursive — parsers, serializers, encoders/decoders, numeric code, date/time logic, data transforms, state machines — and enumerating cases by hand is hopeless.
+- A bug keeps escaping a green example suite because it lives in a corner nobody wrote a test for (empty input, unicode, overflow, a specific interleaving).
+- You have a clear correctness relation — a round-trip, an inverse, a slower reference implementation — but no single "expected output" to assert against.
+- You're hardening a critical pure function and want adversarial coverage, not three happy-path examples.
+## Instructions
+1. **Pick properties that hold for ALL valid inputs — not examples.** Stop choosing inputs; choose relations. The classics, in rough order of power:
+   - **Round-trip / inverse:** `decode(encode(x)) == x`, `parse(render(x)) == x`, `decompress(compress(x)) == x`. The highest-value property for any serializer or codec.
+   - **Invariant:** a property of the output regardless of input — `sort(xs)` is ordered *and* a permutation of `xs`; a balanced-tree insert keeps the balance condition; a parser never returns a node spanning past EOF.
+   - **Idempotence:** `f(f(x)) == f(x)` — for normalizers, dedupers, sanitizers, `canonicalize`.
+   - **Oracle / model:** the function must agree with a simpler, slower, or trusted reference (a brute-force version, the previous release, the stdlib) on every input.
+   - **Metamorphic:** when there's no oracle, relate two runs — `sort(xs) == sort(shuffle(xs))`; `search(q)` ⊆ `search(broaden(q))`; `len(filter(p, xs)) <= len(xs)`.
+2. **Define generators that cover the real domain.** A property is only as good as its inputs. For each property, build a generator that reaches the nasty regions on purpose: empty/single-element collections, `0`/`-0`/negatives/`MAX_INT`/`MIN_INT`, NaN and infinities, empty strings, unicode and surrogate pairs, embedded delimiters and escape chars, huge inputs, deeply nested structures, and duplicates. Compose existing generators (`lists(integers())`, `dictionaries(...)`) rather than rolling raw randomness.
+3. **Constrain generators to valid inputs.** If the property only holds for, say, sorted lists or well-formed dates, *generate them in that shape* — `map`/`build` from raw primitives — instead of generating garbage and filtering it. Filtering (`assume`/`.filter`) discards rejected inputs and silently shrinks your effective sample size.
+4. **Pick the framework for the language.** Python → **Hypothesis** (`@given`, `st.*` strategies). JS/TS → **fast-check** (`fc.assert(fc.property(...))`). Haskell → **QuickCheck**; Scala → **ScalaCheck**; JVM/Java → **jqwik**; Rust → **proptest**/`quickcheck`; Go → built-in `testing/quick` or `rapid`. Match what's already in the project before adding a dep.
+5. **Lean on shrinking and pin the counterexample.** When a property fails, the framework shrinks the random input to a *minimal* failing case (e.g. `[0, 0]`, not a 400-element list). Read that minimal input — it usually names the bug. Then add it as an explicit example so it's checked every run regardless of the random seed: Hypothesis `@example(...)`, fast-check `fc.assert(prop, { examples: [[...]] })`, or just a plain unit test asserting the fixed input.
+6. **Budget run counts for CI.** Defaults (Hypothesis 100, fast-check 100) are fine locally; for cheap pure functions raise to 1000+ in a nightly job, but keep PR runs bounded so the suite stays fast. Set an explicit seed in CI config notes so a flake is reproducible, and disable Hypothesis's `deadline` for inputs whose runtime legitimately scales with size.
+> [!WARNING]
+> A property that reimplements the function under test proves nothing. If your "oracle" shares the buggy logic (or you assert `encode(x) == encode(x)`), the test is green and worthless. The relation must be *independent* of the implementation — an inverse, a brute-force model, or a structural invariant the code never computes directly.
+> [!NOTE]
+> An unconstrained generator wastes the run budget rejecting invalid inputs and can starve the interesting region. If a heavy `assume()`/`.filter()` throws away most candidates, the framework will warn (Hypothesis raises `FailedHealthCheck`) — rebuild the generator to *construct* valid inputs instead of filtering for them.
+## Output
+For each property, the skill produces:
+- **The property and the relation it encodes** (round-trip / invariant / idempotence / oracle / metamorphic), stated as a one-line claim about all valid inputs.
+- **The generator(s)**, written in the project's framework, with the edge regions they deliberately reach.
+- **A runnable test** in that framework.
+- **The regression plan** — where each shrunk counterexample gets pinned as a fixed example so it's checked deterministically forever.
+Example — a round-trip property for a CSV codec, in Hypothesis:
+```python
+from hypothesis import given, strategies as st, example
+# Generate well-formed rows directly (no filtering): each cell is arbitrary
+# text incl. commas, quotes, newlines, unicode — exactly the chars that break parsers.
+rows = st.lists(st.lists(st.text(), min_size=1), min_size=1)
+@given(rows)
+@example([["a,b", '"q"', "line\nbreak"]])  # pinned: a past failure, checked every run
+def test_csv_roundtrip(data):
+    # Property: parsing what we wrote back yields the original (inverse).
+    # parse_csv is INDEPENDENT of write_csv — not a reimplementation of it.
+    assert parse_csv(write_csv(data)) == data
+```
+A failure here shrinks to the minimal breaking cell — typically `[["\n"]]` or `[['"']]` — which you read, fix, and then pin via a second `@example(...)`. Hand the proposed properties to `test-scaffolder` to flesh out, and use `coverage-gap-finder` to confirm the generated inputs now reach the previously-cold branches.

package/content/skills/query-plan-analyzer.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "query-plan-analyzer"
+description: "Read a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A slow query is almost never slow for the reason you'd guess from reading the SQL. The plan is the ground truth: it shows the database actually chose a Seq Scan over the 40-million-row table, actually fed 500,000 rows into a Nested Loop that estimated 5, actually sorted on disk because no index could supply the order. This skill pulls the **real** plan — `EXPLAIN ANALYZE` with `BUFFERS`, not bare `EXPLAIN` — reads it from the most expensive node outward, names the one node that's costing the time and *why*, and turns that into a specific fix: the index to add (with the right column order), the rewrite that makes the predicate sargable, or the `ANALYZE` that fixes the estimate. Then it re-runs the plan to prove the bad node is gone instead of declaring victory from theory.
+## When to use this skill
+- One specific query (an endpoint, a report, a dashboard panel) is slow and you need the cause, not a vague "add some indexes."
+- A query that was fast got slow after a data-volume change, a deploy, or a schema/index change.
+- The planner is doing something surprising — a Seq Scan despite an index existing, or ignoring the index you just added.
+- p99 latency on one query is high while the table and load look unremarkable, and you suspect the plan rather than the hardware.
+- Before shipping a new query or a `migration-writer` index change, to verify the plan is what you intended.
+## Instructions
+1. **Get the table shape and existing indexes before touching the plan.** Read the schema for the queried tables: column types, the existing indexes and their column order, row counts (`SELECT reltuples FROM pg_class`, or `\d+`), and whether stats are fresh (`pg_stat_user_tables.last_analyze` / `n_mod_since_analyze`). Grep the codebase for where the query is built so you tune the real SQL (including how parameters bind), not a hand-typed approximation.
+2. **Run the REAL plan with actual rows, timing, and I/O — never bare EXPLAIN.** Use `EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)` in Postgres (`ANALYZE FORMAT=TREE` / `EXPLAIN ANALYZE` in MySQL 8+). `ANALYZE` *executes* the query and reports actual rows + per-node `actual time`; `BUFFERS` shows shared/local hits vs. reads (heavy `read=` means I/O, not CPU, is the cost). Run it 2–3 times so a cold-cache first run doesn't masquerade as a planning problem. For a write query, wrap it in a transaction and `ROLLBACK` so `ANALYZE` doesn't mutate data.
+3. **Read from the most expensive node outward — find where the time actually is.** In the text plan, `actual time=start..end` is cumulative and inclusive of children; the time a node *adds* is its end-time minus its children's. Find the deepest/innermost node whose `actual time` and `loops × rows` dominate the total. That node — not the top of the plan — is what you fix. Note its `actual rows`, `loops`, and the `Rows Removed by Filter` line.
+4. **Check the estimate-vs-actual gap FIRST — a wide gap means stale stats, and that's the real bug.** Compare each node's estimated rows (`rows=`) to `actual rows`. A gap of more than ~10x (e.g. plans for 5 rows, processes 50,000) means the planner is choosing strategy on bad information — usually stale statistics. **Fix this before adding any index:** run `ANALYZE <table>;` (or `ANALYZE` the whole DB) and re-pull the plan. Often the plan corrects itself once estimates are right, and an index you'd have added would have been the wrong one.
+5. **Match the symptom to the culprit, then to the fix:**
+   - **Seq Scan on a large table with a selective predicate** → the predicate filters to few rows but there's no usable index. Add a b-tree on the filtered column(s). (A Seq Scan returning most of the table is *correct* — don't index it.)
+   - **Nested Loop with high `loops` over many outer rows** → the join is iterating per-row when it should batch. The cause is usually a bad row estimate (see step 4) or a missing join-key index; a corrected estimate or an index on the inner join column lets the planner pick a Hash/Merge Join.
+   - **Sort (especially `Sort Method: external merge  Disk:`)** → the query sorts at runtime and spills to disk. A b-tree index in the `ORDER BY` order can supply rows pre-sorted, removing the Sort node entirely (and powering `LIMIT` early-exit).
+   - **High `Rows Removed by Filter`** → the database fetched far more rows than it kept; the filter ran *after* the scan instead of being pushed into an index. Move the discriminating column into the index so it's a condition, not a post-filter.
+   - **Heavy `Buffers: ... read=`** → the working set isn't cached; a smaller/covering index reduces pages touched, or the data genuinely doesn't fit memory.
+6. **Check index sargability — an index the predicate can't use is no fix at all.** A b-tree is defeated by a function or cast on the column (`lower(email) = ?`, `date(created_at) = ?`, `col::text = ?`), by a leading-wildcard `LIKE '%x'`, and by an `OR` across different columns. The fix is a matching **expression index** (`CREATE INDEX ... ON t (lower(email))`), a rewrite to a range (`created_at >= d AND created_at < d+1`), or `UNION`-ing the `OR` branches — not a plain index on the raw column.
+7. **Order multi-column index columns for the predicate, then the sort.** Put equality-predicate columns first (leftmost), then the range/inequality column, then `ORDER BY` columns — so one index serves both the filter and the ordering. A column used only for a range can't have an equality column usefully placed after it. State the exact `CREATE INDEX` DDL, including `INCLUDE`d columns if a covering index would turn an Index Scan into an Index-Only Scan.
+8. **Re-run `EXPLAIN ANALYZE` after the fix and confirm the bad node is gone.** Apply the fix (in Postgres, build the index `CONCURRENTLY` to avoid a write lock; `migration-writer` can wrap the DDL). Re-pull the plan and verify the offending node changed type (Seq Scan → Index Scan, Nested Loop → Hash Join, Sort → no Sort) and that total `actual time` dropped. If the planner *ignores* the new index, run `ANALYZE` and re-check sargability before concluding the index is wrong.
+> [!WARNING]
+> Bare `EXPLAIN` shows the planner's *guess*, not reality — it never runs the query, so it can't reveal a Nested Loop that estimated 5 rows and processed half a million, or which node actually burned the time. Diagnose with `EXPLAIN ANALYZE` every time; tuning from estimates is how you add the wrong index.
+> [!WARNING]
+> A wide estimated-vs-actual row gap (>10x) means stale statistics, and that is the root cause — fix it with `ANALYZE` *before* adding indexes. An index chosen to compensate for a bad estimate is often useless or harmful once the estimate is corrected, and you'll have shipped a write-amplifying index that the planner ignores.
+> [!NOTE]
+> `EXPLAIN ANALYZE` executes the statement. For `INSERT`/`UPDATE`/`DELETE`, run it inside `BEGIN; ... ROLLBACK;` so diagnosis doesn't change data — and be aware it still fires triggers and acquires locks during the run.
+## Output
+A short report with three parts:
+1. **Annotated plan** — the offending node quoted from the `EXPLAIN ANALYZE` output, with its `actual rows` vs. estimate, `loops`, `Rows Removed by Filter`, and `Buffers`, plus a one-line statement of *why* it's the bottleneck (Seq Scan / stale-stats row gap / Nested Loop blowup / disk Sort / non-sargable predicate).
+2. **The specific fix** — exact `CREATE INDEX ... CONCURRENTLY` DDL with the column order justified, or the SQL rewrite, or the `ANALYZE <table>` command. One concrete action, not a menu.
+3. **Before/after proof** — total `actual time` and the changed node type from the re-run plan (e.g. `Seq Scan 1240 ms → Index Scan 3 ms`), confirming the bad node is gone rather than asserting it should be.

package/content/skills/runbook-writer.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: "runbook-writer"
+description: "Write an operational runbook a half-asleep on-call engineer can execute at 3am — scoped to ONE alert, leading with how to confirm the problem, the copy-pasteable mitigation that stops user pain, then diagnosis, escalation, and verification. Use when an alert has no documented response, after an incident exposed a missing procedure, or when standing up on-call for a service."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Write the document the on-call engineer opens when a pager fires at 3am — and can actually follow. The skill takes one alert or symptom and produces a runbook in the order a responder needs it: **confirm → mitigate → diagnose → escalate → verify**. It mines the repo for the real commands, dashboards, and service names, writes each step as a literal instruction with its expected output ("run X; if you see Y, do Z"), and front-loads the mitigation that stops user pain *before* any investigation. The result stops bleeding first and explains second.
+## When to use this skill
+- An alert fires with no documented response — the responder is reverse-engineering the system at the worst possible time.
+- A postmortem found that recovery was slow because the procedure lived only in one person's head.
+- You're onboarding on-call for a service and need a runbook per page-worthy alert before the rotation starts.
+- An existing runbook is prose-heavy ("investigate the root cause") and unusable under stress.
+## Instructions
+1. **Scope to ONE symptom — refuse the generic doc.** A runbook answers exactly one page: `HighErrorRate on checkout-api`, `ReplicaLag > 30s`, `DiskUsage > 90% on db-primary`. If the user asks for an "operations runbook," push back and split it — one alert per file. Name it after the alert that links to it (`docs/runbooks/checkout-api-high-error-rate.md`), so the pager's "runbook" link lands here. Search existing alert rules (`grep -ri "alert\|expr:" prometheus*.yml *.rules.yml`) to use the alert's exact name.
+2. **Open with the fast path, not background.** The first thing on the page is a one-line summary of what's broken and the user impact ("Checkout returns 500s — customers can't pay"), then a **TL;DR mitigation** block: the single command that most often stops the pain. The responder should be able to act from the top of the file without scrolling. Save architecture and theory for the bottom (or omit it).
+3. **Step 1 is always CONFIRM — is this real?** Give the exact way to verify the alert isn't a flapping false positive: the literal dashboard URL, the PromQL/log query to paste, or the curl/CLI command, plus the expected output that means "yes, real." Mine the repo for these — read dashboard JSON, `*.rules.yml`, health-check endpoints, and `Makefile`/`justfile` targets — rather than inventing command names. Example: `kubectl -n prod get pods -l app=checkout-api` → "all should be `Running`; `CrashLoopBackOff` confirms the alert."
+4. **Step 2 is MITIGATE — stop the bleeding before diagnosing.** This is the most important section and it comes *before* root-cause work. Give the copy-pasteable command to roll back, fail over, restart, scale up, or feature-flag-off — with real paths, namespaces, and service names from the repo. State what each command does and how to know it worked. Order options by safety and speed (rollback to last-good deploy usually beats live debugging). Never make the reader derive the command.
+5. **Step 3 is DIAGNOSE — only now look for cause.** Numbered, branching steps in `run X → if you see Y → do Z` form. Every step is a literal command with expected output and the decision it drives. No step may say "investigate," "look into," "check if there's an issue," or any phrase that offloads a judgment call onto a stressed human — convert each into a concrete check with a concrete next action. Link the relevant logs query, trace view, and the service's SLO/error-budget dashboard.
+6. **Write ESCALATE with names and triggers.** State exactly *when* to page the next person and *who*: "If mitigation hasn't restored success rate within 15 min, page the #payments on-call via PagerDuty service `checkout-api`." Include the secondary/owning team, any vendor support path, and the threshold (duration, error count, blast radius) that makes escalation mandatory rather than optional.
+7. **End with VERIFY — confirm recovery, don't assume it.** Give the explicit check that service is restored: the same dashboard/query from step 1 showing healthy values, with the threshold to watch ("error rate back under 0.5% for 5 consecutive minutes"). Include any cleanup (re-enable the flag you turned off, scale back down) and a one-line prompt to capture timeline notes for the postmortem.
+8. **Keep every command current and report assumptions.** Verify each command against the repo (binary names, namespaces, flags, env). Flag any command you could not confirm against a real file so the user tests it before relying on it. A command you guessed is worse than no command — it sends the responder down a dead end at 3am.
+> [!WARNING]
+> A runbook full of "investigate the issue" or "check the logs and determine the cause" is useless at 3am — it just restates the panic. Every step must be a literal command with an expected output and an explicit next action. Equally, a runbook with a stale or never-executed command fails at the exact moment it's needed: treat unverified commands as bugs, and have someone dry-run the mitigation path in staging before trusting it.
+## Output
+A single Markdown file at `docs/runbooks/<alert-name>.md` for one symptom, ordered **confirm → mitigate → diagnose → escalate → verify**, with a TL;DR mitigation at the top, literal copy-pasteable commands, expected outputs, decision branches, and links to the dashboard / logs / trace view / SLO. The skill reports the file path and any command it could not verify against the repo.
+```markdown
+# Runbook: checkout-api — HighErrorRate
+**Impact:** Checkout returns 500s — customers cannot complete payment.
+**Alert:** `HighErrorRate{service="checkout-api"}` (fires at 5xx > 2% for 3m)
+**Dashboard:** https://grafana.internal/d/checkout-api/overview
+## TL;DR mitigation
+Roll back to the last-good deploy — fixes ~80% of these pages:
+    kubectl -n prod rollout undo deployment/checkout-api
+Success rate should recover within ~2 min on the dashboard above.
+## 1. Confirm it's real
+    kubectl -n prod get pods -l app=checkout-api
+Expect all `Running`. Any `CrashLoopBackOff`/`Error` confirms the alert.
+Cross-check the 5xx panel: https://grafana.internal/d/checkout-api/overview
+## 2. Mitigate (stop the bleeding)
+1. If a deploy went out in the last hour → `kubectl -n prod rollout undo deployment/checkout-api`.
+2. If pods are healthy but the DB is the source → fail over reads:
+   `kubectl -n prod set env deployment/checkout-api READ_REPLICA=db-replica-2`
+3. If a downstream dependency is down → disable checkout behind the flag:
+   `curl -XPOST https://flags.internal/api/checkout_enabled -d '{"value":false}'`
+Confirm recovery on the dashboard before moving on.
+## 3. Diagnose
+- Run `kubectl -n prod logs -l app=checkout-api --since=10m | grep -i error`.
+  If you see `connection refused: payments-svc` → page payments (step 4).
+  If you see `pq: too many connections` → scale the pool: `kubectl -n prod set env deployment/checkout-api DB_POOL_MAX=40`.
+- Traces: https://tempo.internal/explore?service=checkout-api
+- SLO / error budget: https://grafana.internal/d/checkout-api/slo
+## 4. Escalate
+If success rate is not restored within 15 min, page **#payments on-call**
+via PagerDuty service `checkout-api`. For DB failover that won't recover,
+page **#platform-db**. Vendor (Stripe) status: https://status.stripe.com
+## 5. Verify
+- 5xx rate back under 0.5% for 5 consecutive minutes on the dashboard.
+- Re-enable any flag you toggled: `curl -XPOST .../checkout_enabled -d '{"value":true}'`.
+- Note start/detect/mitigate/resolve timestamps for the postmortem.
+```

package/content/skills/security-headers-hardener.md ADDED Viewed

@@ -0,0 +1,79 @@
+---
+name: "security-headers-hardener"
+description: "Audit and harden a web app's or API's HTTP security headers — Content-Security-Policy, HSTS, X-Content-Type-Options, frame-ancestors, Referrer-Policy, Permissions-Policy, and CORS — and produce a staged rollout that won't break the site. Use before a launch, during a security pass, or when a scanner (Mozilla Observatory, securityheaders.com, a pentest) flags missing or weak headers. Audits and edits header config; rolls CSP out Report-Only first."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Audit the HTTP security headers a web app or API actually sends, then harden them without taking the site down. The single highest-value header is a real **Content-Security-Policy** — it is the strongest in-band mitigation for XSS — but it is also the one most likely to break your site if shipped carelessly, so this skill always stages CSP through **Report-Only** first. Around it: enforce HTTPS with HSTS (carefully, because `preload` is effectively one-way), stop MIME sniffing, block framing, tighten `Referrer-Policy` and `Permissions-Policy`, scope CORS so it can't be turned into a credential-leaking open door, and strip headers that advertise your stack and version. Output is a per-header `current → recommended` audit, the exact values to paste, and a rollout plan that goes Report-Only before enforce.
+## When to use this skill
+- Before a public launch or a major release that changes the frontend, third-party scripts, or the CDN/proxy in front of the app.
+- When a scanner (securityheaders.com, Mozilla Observatory, Lighthouse, a pentest report) flags missing or weak headers.
+- When standing up a new service, edge config, or reverse proxy and you want headers right from day one.
+- After adding a third-party embed, analytics, payment iframe, or auth widget — anything that changes what origins the page must trust.
+> [!WARNING]
+> Never ship an enforcing `Content-Security-Policy` you have not first run as `Content-Security-Policy-Report-Only` against real traffic. A directive like `script-src 'self'` will silently kill every inline `<script>`, injected analytics snippet, and third-party widget the moment it enforces — that's a white-screened production site, not a hardened one.
+## Instructions
+1. **Find where headers are actually set, then observe what ships.** Glob and grep the layers that can emit headers — app middleware (`helmet`, `setHeader`, `res.headers`, `add_header`), framework config (`next.config`, `vercel.json`, `netlify.toml`, `**/middleware*`), and edge config (`nginx.conf`, `*.htaccess`, Cloudflare/CDN rules, `**/*.conf`). Multiple layers may set the same header; the proxy can override the app, or duplicate it. Establish the *effective* response (e.g. `curl -sI https://host` against a deployed instance, or read the proxy config) before changing anything — you can't harden what you can't see, and a header set twice with different values is its own bug.
+2. **Set a real Content-Security-Policy — the core control.** Start from a default-deny base: `default-src 'self'`. Then open *only* what the app needs: `script-src` and `style-src` for trusted origins, `img-src`, `connect-src` for your APIs/websockets, `font-src`, `frame-src` for embeds. Avoid `'unsafe-inline'` and `'unsafe-eval'` in `script-src` — they neuter the whole policy against XSS. For unavoidable inline scripts, use a per-response **nonce** (`script-src 'nonce-<random>'`, regenerated each request) or a **SHA-256 hash** of the script body, not a blanket allow. Always add `object-src 'none'` (kills Flash/plugin vectors) and `base-uri 'self'` (stops `<base>`-tag injection that reroutes relative script URLs). Add a `report-uri`/`report-to` endpoint so violations are collected.
+3. **Roll CSP out Report-Only before enforcing.** Deploy the policy as `Content-Security-Policy-Report-Only` first — same directives, but violations are reported to your collector instead of blocked. Watch the violation stream across representative traffic (all major pages, logged-in and out, the third-party flows) until it goes quiet or shows only known-benign noise (browser extensions inject inline styles — scope by `document-uri`/`blocked-uri`, don't widen the policy for them). Only then flip the header name to `Content-Security-Policy`. Keep `report-to` on after enforcing to catch regressions.
+4. **Enforce HTTPS with HSTS — and be deliberate about preload.** Set `Strict-Transport-Security: max-age=31536000; includeSubDomains`. Add `; preload` **only** once every subdomain serves valid HTTPS, because preload submission bakes HTTPS-only into shipped browsers and is slow and painful to undo. When first introducing HSTS, consider starting with a shorter `max-age` (e.g. a day) to confirm nothing breaks, then raise it. HSTS only takes effect on a response served over HTTPS, so also ensure a plain-HTTP→HTTPS redirect exists.
+5. **Stop MIME sniffing and clickjacking.** Set `X-Content-Type-Options: nosniff` (stops the browser from re-interpreting a response's type, a classic way to execute an uploaded "image" as script). Block framing with a frame-busting policy: prefer `Content-Security-Policy: frame-ancestors 'self'` (or an explicit allowlist of origins permitted to frame you), which supersedes the legacy `X-Frame-Options: DENY/SAMEORIGIN` — set both for older-browser coverage, but make them agree.
+6. **Tighten Referrer-Policy and Permissions-Policy.** Set `Referrer-Policy: strict-origin-when-cross-origin` (sends the full URL same-origin, only the origin cross-origin over HTTPS, nothing on downgrade) — this stops tokens or PII in query strings from leaking via the `Referer` header to third parties. Set `Permissions-Policy` to disable powerful features the app doesn't use, e.g. `camera=(), microphone=(), geolocation=(), payment=()` — an empty allowlist `()` means "no origin, not even self." Only grant features the app actually calls.
+7. **Scope CORS tightly — never the wildcard-plus-credentials trap.** If the API serves cross-origin requests, reflect or allowlist **specific** trusted origins for `Access-Control-Allow-Origin`; never reflect an arbitrary `Origin` header back unchecked (that's "allow everyone" with a disguise). The exploitable misconfiguration to hunt for: `Access-Control-Allow-Origin: *` together with `Access-Control-Allow-Credentials: true` — browsers forbid the literal combination, so a server that *needs* credentials will instead reflect the caller's Origin, and if that reflection is unchecked, any site can make authenticated cross-origin requests and read the response. Pin `Allow-Methods`/`Allow-Headers` to what's used, and set `Vary: Origin` when reflecting so caches don't serve one origin's CORS response to another.
+8. **Remove headers that leak the stack.** Strip or blank `Server` version detail, `X-Powered-By`, `X-AspNet-Version`, `X-Generator`, and framework banners — they hand attackers a version to match against known CVEs and cost nothing to remove. (`X-XSS-Protection` is deprecated and best set to `0` or omitted; do not rely on it — CSP replaces it.)
+9. **Apply the changes, keeping each layer's edit minimal and consistent.** Use Edit to set the recommended values in the right layer (prefer the single source of truth — usually the proxy/edge or one central middleware — over scattering headers across the app). Don't introduce a header in two places with conflicting values. Leave CSP as Report-Only in the committed config if the violation-watch window hasn't completed; note clearly in the rollout plan when to flip it.
+> [!NOTE]
+> Test against a real response, not the config file. A header in `helmet()` or `next.config` can be silently overridden, dropped, or duplicated by a CDN, load balancer, or framework default. Confirm the effective `curl -sI` output before and after — the wire is the source of truth.
+## Output
+A per-header audit table (`current → recommended` for every header in scope), the exact header/config values to apply in the identified layer, and a staged rollout plan that puts CSP through Report-Only before enforce. Edits are applied to the header config; CSP stays Report-Only until the violation window is clear.
+```text
+Security headers — scope: next.config.ts, middleware.ts, effective response for https://app.example.com
+Header                       Current                          Recommended
+---------------------------------------------------------------------------------------------------
+Content-Security-Policy      (none)                           default-src 'self'; script-src 'self'
+                                                              'nonce-{n}'; style-src 'self'; img-src
+                                                              'self' data:; connect-src 'self'
+                                                              https://api.example.com; object-src
+                                                              'none'; base-uri 'self'; frame-ancestors
+                                                              'self'; report-to csp
+                                                              → ship as -Report-Only first
+Strict-Transport-Security    (none)                           max-age=31536000; includeSubDomains
+                                                              (add ;preload only after subdomain audit)
+X-Content-Type-Options       (none)                           nosniff
+X-Frame-Options              (none)                           DENY        (CSP frame-ancestors is primary)
+Referrer-Policy              unsafe-url                       strict-origin-when-cross-origin
+Permissions-Policy           (none)                           camera=(), microphone=(), geolocation=(),
+                                                              payment=()
+Access-Control-Allow-Origin  * (reflected, with credentials)  https://app.example.com (allowlist) + Vary: Origin
+X-Powered-By                 Next.js                          (removed)
+Server                       nginx/1.25.3                     nginx (version suppressed)
+Rollout plan
+1. Deploy all headers above; CSP as Content-Security-Policy-Report-Only with report-to=csp.
+2. Watch violation reports across all pages + third-party flows for one full traffic cycle.
+3. Resolve real violations (add the specific origin/nonce); ignore extension noise.
+4. When the stream is quiet, rename the header to Content-Security-Policy (enforce). Keep report-to on.
+5. After every subdomain is verified HTTPS-only, add ;preload to HSTS and submit (one-way).
+Fixed now: CORS wildcard+credentials misconfiguration removed; X-Powered-By/Server stripped;
+nosniff, frame-ancestors, Referrer-Policy, Permissions-Policy, HSTS applied. CSP pending enforce.
+```

package/content/skills/semantic-cache-designer.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "semantic-cache-designer"
+description: "Design a semantic cache for LLM responses — serve a cached answer when a new query is similar enough to a past one — to cut cost and latency on repetitive traffic, with the similarity threshold calibrated on real query pairs and a cache key that prevents cross-user/model leaks. Use when an LLM app sees many near-duplicate prompts (FAQs, support, search), when token spend on repetitive queries is high, or when latency on common questions matters."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A semantic cache turns "I've answered this before" into a skipped LLM call: embed the incoming query, find the nearest past query, and if it's close enough, return that cached answer. Done right it slashes cost and tail latency on FAQ/support/search traffic. Done wrong it confidently returns a *different* question's answer, or leaks one user's answer to another. This skill makes the two load-bearing decisions — the similarity threshold and the cache key — explicit and calibrated, instead of trusting a vibe-picked cosine cutoff.
+## When to use this skill
+- An LLM app gets many near-duplicate prompts — FAQs, support tickets, product search, "explain X" — and most calls re-derive the same answer.
+- Token spend is dominated by repetitive traffic and you want to stop paying for the same completion twice.
+- Latency on common questions matters (p50/p95) and a cache hit would return in milliseconds instead of seconds.
+- You're about to bolt a `GPTCache`-style layer onto a RAG or chat app and need the threshold/key/TTL decided before it ships.
+## Instructions
+1. **Pin down what a "correct hit" means before touching code.** A hit is only correct if the cached answer would still be the *right* answer for the new query. Write down the inputs that change the correct answer beyond the query text — user/tenant, locale/language, the retrieved-context version (for RAG), the model + version, system-prompt version, and any personalization. This list becomes the cache key in step 5; everything else flows from it.
+2. **Design the lookup.** Embed the incoming query with the *same* model and input-type used for the stored queries (a query/document asymmetry mismatch quietly wrecks similarity — see `embedding-set-inspector`). Look up the single nearest stored entry by vector similarity (cosine on normalized vectors), scoped to the exact key from step 5. Return the cached answer only if `similarity >= threshold`; otherwise it's a miss → call the LLM and write the new entry.
+3. **Calibrate the threshold on real query pairs — do not pick it from a blog post.** Pull ~100-300 query pairs from production logs and label each pair as "same intent / cached answer is correct" or "different intent / would be wrong." Sweep the threshold (e.g. 0.80→0.97) and at each value compute false-hit rate (returned a wrong answer) and false-miss rate (missed a valid reuse). Pick the threshold from this curve, not by feel.
+4. **Bias toward false-miss when a wrong answer is costly.** A false miss costs one extra LLM call; a false hit ships a confidently wrong answer to a user. For support/medical/financial/legal surfaces, choose the stricter threshold even if hit rate drops — a missed hit is cheap, a wrong hit is a trust incident.
+5. **Build the full cache key — never key on query text alone.** Namespace the cache (or the embedding lookup) by every input from step 1: `tenant + locale + model@version + prompt@version + context@version`. Personalized or per-user answers must include the user/tenant in the key. Omitting any of these is how you serve user A's answer to user B, or a `claude-opus-4` answer out of a `claude-haiku` cache after a model swap.
+6. **Set TTL and invalidation for answers that go stale.** Static facts can live long; RAG answers over changing data must expire (or be invalidated) when the underlying documents change — tag entries with the `context@version`/document IDs they depended on and evict on update. Time-sensitive answers ("current status", "today's price") get a short TTL or land in the no-cache list (step 7).
+7. **Decide explicitly what NOT to cache.** Exclude personalized/account-specific answers that lack a per-user key, time-sensitive or real-time responses, stateful/multi-turn replies that depend on conversation history, and anything with side effects (tool calls, writes). Caching these is worse than no cache. Write the no-cache predicate down as a rule, not a hope.
+8. **Measure hit *quality*, not just hit rate.** Track cache hit rate, token/cost saved, and latency delta — but also sample a slice of live hits (e.g. 1-2%) and judge whether the cached answer was actually right for the new query (LLM-as-judge or human review). Report false-hit rate as a first-class metric. A 60% hit rate that's 10% wrong is worse than a 35% hit rate that's clean.
+> [!WARNING]
+> A too-loose threshold is the signature failure of semantic caching: "How do I cancel my subscription?" and "How do I cancel my *order*?" are highly similar in embedding space, so the cache serves a fluent, confident answer to the *wrong* question. The user can't tell it's a stale match. Always validate the threshold against labeled different-intent pairs, not just same-intent ones.
+> [!WARNING]
+> Omitting context/user/model from the cache key leaks answers across boundaries — across users (privacy incident), across locales (wrong language), or across model/prompt versions (you keep serving the old model's answers after a deploy). The key must change whenever the correct answer would change.
+## Output
+- **Lookup design** — embedding model + input-type, similarity metric, nearest-neighbor scoping, and the hit/miss decision rule.
+- **Calibrated threshold** — the chosen value plus the false-hit / false-miss curve it came from and the labeled query-pair set used (and the false-miss bias rationale if applicable).
+- **Full cache key** — the exact composite key (`tenant + locale + model@version + prompt@version + context@version + user`), with a note on which fields apply to this app.
+- **TTL + invalidation + no-cache rules** — per-class TTLs, the document-version invalidation trigger for RAG entries, and the explicit no-cache predicate.
+- **Metrics** — hit rate, token/cost saved, latency delta, and the sampled hit-quality / false-hit measurement to track in production.

package/content/skills/slo-definer.md ADDED Viewed

@@ -0,0 +1,38 @@
+---
+name: "slo-definer"
+description: "Turn a vague reliability goal into concrete SLIs, SLOs, an error budget, and burn-rate alerts — service-level indicators measured at the user-facing boundary, targets over a rolling window, and a written policy for what happens when the budget runs out. Use when a service has no defined reliability target, when on-call is noisy and alert-fatigued, or before you commit to an SLA you can't measure."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+"Make it reliable" can't be measured, can't be alerted on, and can't tell you when to stop shipping. This skill converts a reliability intention into four artifacts that can: **SLIs** that measure what users actually experience, **SLOs** that set a target over a window, an **error budget** with a written policy for spending and exhausting it, and **burn-rate alerts** that page when the budget is genuinely at risk. The output is a spec, not a dashboard — a contract the team and on-call can both point at.
+## When to use this skill
+- A service is "important" but has no defined reliability target, so nobody can say whether last week was good or bad.
+- On-call is drowning in pages that don't correspond to user pain — alert fatigue from threshold blips on CPU, memory, or a single 5xx.
+- You're about to sign an SLA and need an internal SLO (tighter, measurable) to back it before you promise anything externally.
+- You have dashboards full of metrics but can't answer "are users having a good time right now, and how much room do we have left to break things?"
+## Instructions
+1. **Identify the user and the boundary first.** An SLI measures the experience of a consumer (end user, calling service) at a specific boundary — the load balancer, the API gateway, the client SDK. Measure as close to the user as you can: a 200 at the app server while the CDN returns 502s is a lie. Name the boundary explicitly before picking metrics.
+2. **Pick the few SLIs that reflect that experience.** Choose from the request/response SLI families: **availability** (good-event ratio: non-5xx, non-timeout responses ÷ total valid requests), **latency** (fraction of requests served under a threshold at a percentile), and for data systems **freshness** (fraction of reads no older than N seconds) or **correctness/coverage**. Two or three SLIs per service is plenty — more dilutes the signal.
+3. **Write each SLI as an explicit good-event criterion.** Spell out what counts as a good event, what's in the denominator, and what's excluded. Example: `latency SLI = (requests with TTFB < 300ms) / (all non-400 requests at the gateway)`. Exclude client errors (4xx) and load-test traffic from the denominator — they aren't the service failing — but say so in writing.
+4. **Set the SLO as a target over a rolling window grounded in user need.** Format: "X% of [good events] over [rolling window]" — e.g. `99.9% of requests succeed over 28 days`. Use a **rolling** window (28 days is common) rather than calendar months so the number can't be gamed by a quiet week. Pick the lowest target users genuinely won't notice; if you can't justify the extra nine from user impact, don't pay for it.
+5. **Derive the error budget and write its spend policy.** The budget is `1 − SLO` over the window: a 99.9% SLO allows 0.1% bad events — for 28 days that's ~40 minutes of total unavailability, or 0.1% of requests. State who may spend it (experiments, risky migrations, planned maintenance all draw down the same budget) and the **exhaustion rule in writing**: when the budget is gone, risky changes freeze and reliability work takes priority until the window recovers. A budget with no consequence is just a number.
+6. **Tie alerts to burn rate, not to thresholds.** Alert on how fast the budget is being consumed relative to the window. Run two: a **fast-burn** alert (e.g. 14.4× burn over 1 hour = ~2% of a 28-day budget gone in an hour → page now) and a **slow-burn** alert (e.g. ~3× burn over 6 hours → ticket, not a page). This makes a page mean "the budget is at risk," with high precision and low noise, instead of "5xx crossed 5 for 30 seconds."
+7. **Sanity-check against history before committing.** Read recent latency/error data (logs, metrics exports) and confirm the proposed SLO is currently *achievable* and *meaningful* — not already breached every week (unattainable, so it'll be ignored) and not trivially met with 100× headroom (no signal). Adjust the target to the real distribution.
+> [!WARNING]
+> A 100% SLO is a trap: it leaves zero error budget, so every deploy is a potential breach and the only "safe" move is to never change the system. The gap below 100% is precisely the room you have to ship, experiment, and do maintenance — design it in deliberately.
+> [!WARNING]
+> Averages hide the tail. A 200ms *average* latency is consistent with 5% of users waiting 4 seconds — and the tail is where users churn. Always state latency SLIs as a percentile (p95/p99 served under a threshold), never as a mean.
+> [!NOTE]
+> System metrics are not SLIs. CPU, memory, disk, and queue depth are *causes*, useful for debugging, but a user never files a ticket about your CPU. SLIs live at the user-facing boundary; keep host metrics on the diagnosis dashboard, out of the SLO spec.
+## Output
+A reliability spec containing: (1) **SLI definitions** — for each, what's measured, the boundary it's measured at, and the exact good-event criterion (numerator/denominator + exclusions); (2) **SLO targets** — the percentage and rolling window per SLI, with the user-impact rationale; (3) the **error budget** — `1 − SLO` translated into concrete allowance (minutes and/or request count over the window) plus the written spend-and-exhaustion policy; and (4) the **burn-rate alert thresholds** — fast-burn (page) and slow-burn (ticket) multipliers and look-back windows. Reproducible: the same spec can be re-derived and re-checked against fresh data each quarter.

package/content/skills/strangler-fig-migrator.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: "strangler-fig-migrator"
+description: "Plan the incremental replacement of a legacy module or service using the strangler-fig pattern — grow new code around the old behind an interception seam until the old is dead, instead of a big-bang rewrite. Use when a legacy system is too risky to rewrite at once, or when migrating off a deprecated framework/dependency gradually while staying shippable and rollback-able at every step."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Replace a legacy module or service the way a strangler fig kills its host tree — by growing new code around the old until the old carries no load and can be cut away. The skill's first and most important move is to find the **interception seam**: the single place where calls can be diverted to either the old or the new implementation. Everything else (slicing, parallel-running, decommissioning) hangs off that seam. Without it, "incremental migration" silently becomes a big-bang rewrite with extra ceremony.
+## When to use this skill
+- A legacy system is load-bearing and too risky to rewrite all at once — a flag-day cutover would mean a long branch, a scary deploy, and no clean rollback.
+- You're migrating off a deprecated framework, library, or service (an ORM, an auth provider, a payments SDK, a monolith you're peeling into services) and want to move capability by capability.
+- The legacy code has no tests or unclear behavior, so the only trustworthy spec is "what it currently does" — you need to run new alongside old and compare.
+- Stakeholders need the system shippable and reversible the entire time, not dark for months behind a feature branch.
+> [!WARNING]
+> If you cannot find or build a clean interception seam, stop and reconsider. A migration where callers reach deep into legacy internals — not through one front door — cannot be routed incrementally. You will end up rewriting everything before you can flip anything, which is a big-bang rewrite wearing a strangler-fig costume. Creating the seam (a facade callers go through) is the *first deliverable*, sometimes a whole milestone of its own.
+## Instructions
+1. **Locate or create the interception seam first.** Find the single chokepoint where calls into the legacy unit can be diverted: a facade/adapter the callers already go through, a network proxy/router (reverse proxy, API gateway, service mesh route), or a feature-flag branch in code. Use `Grep`/`Glob` to map every caller of the legacy unit — if they all funnel through one interface, that's your seam; if they reach in twenty different ways, your first job is to introduce a facade they all route through *before* writing any new implementation. The seam must be able to send a call to old OR new and be flipped at runtime (config/flag), not at deploy time.
+2. **Inventory and slice the surface.** List the capabilities behind the seam (endpoints, methods, message types) with, for each, its call volume, blast radius if it breaks, and how self-contained it is (shared state, shared DB tables, downstream side effects). This is your migration backlog. Do not migrate by file or by "module size" — migrate by capability slice, because a slice is what the seam can route independently.
+3. **Carve off the smallest valuable slice first.** Pick the slice that is most self-contained and lowest-blast-radius — a read-only endpoint, an idempotent operation, an internal report — not the gnarliest core path. Implement it new behind the seam. The goal of slice one is to prove the *seam and the verification mechanism work end to end*, not to deliver the hardest functionality. Save the high-risk, high-coupling slices for after the machinery is trusted.
+4. **Run old and new in parallel and verify equivalence before shifting load.** Before routing real traffic to the new path, run it in **shadow mode**: send the live request to both, return the old result to the caller, and compare the new result off to the side (log/metric the diffs). Define equivalence concretely per slice — exact response match, match modulo known-acceptable differences (ordering, timestamps, formatting), or statistical match on key business metrics when outputs are non-deterministic. Only after the diff rate is at/under an agreed threshold over a representative window do you start serving the new path for real.
+5. **Shift traffic gradually and keep rollback one flip away.** Route a small fraction to the new implementation (a percentage, an allowlist of internal users, one tenant), watch error rate / latency / business metrics against the old baseline, and ramp only while they hold. The seam from step 1 makes the rollback trivial: if the new path misbehaves, flip the route back to legacy — no deploy, no revert. Treat every ramp as reversible; never remove the old path while it's still the fallback.
+6. **Migrate slice by slice, keeping the system shippable throughout.** Repeat steps 3–5 for the next slice. After each slice fully cuts over, the system is in a valid, releasable state with some capabilities on new and some on old — that is the point. Sequence so that you never half-migrate a slice that shares mutable state with an unmigrated one; if two slices write the same table, plan a shared-data strategy (dual-write with new as follower, or migrate the data owner first) before splitting their routing.
+7. **Decommission the legacy only once it is provably dead.** A slice's old code is a candidate for deletion only when: the seam routes 100% to new, the route has been pinned there long enough to cover the full usage cycle (including weekly/monthly/seasonal jobs and rare error paths), and instrumentation shows **zero** hits on the legacy path. Confirm deadness with evidence — access logs, a counter/log line on the old code path showing no calls, `Grep` proving no remaining static references — then remove the old implementation and the now-redundant routing in a final isolated step. Keep the seam until the very last slice is gone.
+> [!WARNING]
+> Deleting legacy code before confirming it's truly dead causes outages, not cleanup. "We migrated that months ago" is not evidence — a quarterly batch job, an admin tool, or a rare error branch can be the only remaining caller. Require positive proof of zero traffic (a metric/log over a full usage period) plus a static-reference search before any deletion. When in doubt, leave the dead branch behind the seam one more cycle; cold code is cheap, an outage is not.
+## Output
+1. **Interception seam design** — what the seam is (facade/adapter, proxy/router, or feature flag), where it sits relative to the callers, how it decides old-vs-new (config key / flag / percentage), and how it's flipped and rolled back at runtime. Includes the list of legacy callers found and whether they already route through one door or need a facade introduced first.
+2. **Slice-by-slice migration order** — the capability backlog as an ordered table, smallest/safest first, with the rationale for the sequence and any shared-data dependencies that force ordering:
+   | Order | Slice (capability) | Volume | Blast radius | Coupling / shared state | Why this position |
+   | --- | --- | --- | --- | --- | --- |
+   | 1 | `GET /report/summary` (read-only) | low | low | none | proves seam + verification end-to-end |
+   | 2 | `POST /events` (idempotent write) | high | medium | none | high volume, safe to shadow |
+   | 3 | `POST /orders` (core path) | high | high | shares `orders` table w/ #4 | after machinery trusted; pair with #4 |
+3. **Parallel-run verification method** — per slice: shadow-mode comparison plan, the concrete equivalence definition (exact / modulo-known-diffs / statistical), the diff threshold and observation window required before serving new, and the metrics watched during ramp (error rate, latency, business KPI vs. legacy baseline) with the ramp schedule (e.g. shadow → 1% → 10% → 50% → 100%).
+4. **Decommission criteria** — the exact gate for deleting each slice's legacy code: 100% routed to new, pinned for one full usage cycle, instrumented zero-traffic proof, and a clean static-reference search — plus the final-step plan to remove the old implementation and retire the seam once the last slice is migrated.