npm - agentscamp - Versions diffs - 0.4.0 → 0.6.0 - Mend

agentscamp 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/README.md +2 -2
package/content/manifest.json +423 -2
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/circular-dependency-breaker.md +48 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/commit-splitter.md +54 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/dashboard-designer.md +38 -0
package/content/skills/deadlock-diagnoser.md +45 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/feature-flag-retirer.md +44 -0
package/content/skills/flamegraph-analyzer.md +35 -0
package/content/skills/git-blame-investigator.md +34 -0
package/content/skills/graphql-schema-designer.md +49 -0
package/content/skills/hallucination-evaluator.md +40 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/integration-test-designer.md +81 -0
package/content/skills/model-router-designer.md +39 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/onboarding-guide-writer.md +84 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/rbac-designer.md +82 -0
package/content/skills/release-notes-writer.md +78 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/content/skills/web-vitals-optimizer.md +34 -0
package/package.json +1 -1

package/content/skills/onboarding-guide-writer.md ADDED Viewed

@@ -0,0 +1,84 @@
+---
+name: "onboarding-guide-writer"
+description: "Write a developer onboarding guide that gets a new contributor from clone to first merged change fast — a verified golden path, a quick architecture map, the real workflow conventions, and the gotchas that live only in senior engineers' heads. Use when a repo has no onboarding doc, when new hires keep asking the same setup questions, or when the README is a marketing page instead of a contributor guide."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Write the doc a new contributor opens on day one and uses to ship their first change by lunch. The center of gravity is the **golden path**: the exact, copy-pasteable sequence from `git clone` to a trivial verified change — every command grounded in the repo's real scripts and tooling, not invented `make` targets. Around it sit a quick architecture map (where to look, not a spec), the workflow conventions that gate a PR, and the troubleshooting that currently lives only in tribal knowledge. Deeper material is linked, never duplicated, so the guide stays true as the code moves.
+## When to use this skill
+- A repo has no onboarding/CONTRIBUTING doc and new contributors reverse-engineer setup from CI configs and Slack threads.
+- New hires repeatedly ask the same setup questions (which Node version, what env vars, why does the build fail the first time).
+- The README is marketing prose — what the product does — rather than how a developer runs and contributes to it.
+- Onboarding currently means a senior engineer pairing for two hours to get someone to a passing test suite.
+## Instructions
+1. **Reconstruct the golden path from real tooling — verify every command exists.** Read the manifest that exists (`package.json` scripts/`engines`, `Makefile` targets, `pyproject.toml`, `go.mod`, `Justfile`, `Taskfile.yml`) and the lockfile to pick the package manager. Read CI config (`.github/workflows/*.yml`, `.gitlab-ci.yml`) — CI is the ground truth for the steps that actually pass. Build the path in execution order: clone → install deps → set up env/config → run locally → run tests → make a trivial change and verify it. Quote each command verbatim from a script that exists; if a step has no backing script, say so explicitly rather than inventing one.
+2. **Surface the prerequisites a fresh machine actually needs.** Pin the runtime version (from `engines`, `.nvmrc`, `.tool-versions`, `go.mod`, `python_requires`) and any system deps (a database, Docker, a specific package manager). List them before the install step — a clone that fails on a missing Postgres is the most common day-one wall.
+3. **Handle env and config concretely.** Find `.env.example` / `.env.sample` / `config.example.*`. Tell the contributor to copy it (`cp .env.example .env`) and call out which variables must be filled to run locally versus which have working defaults. Name the ones that need a secret or a teammate to provide — that is the question that otherwise hits Slack.
+4. **Prove the setup with a trivial verified change.** End the golden path with a concrete, reversible first change — flip a string, add a log line, fix a typo — then the exact command that confirms it (the dev server reloads, a test passes, the page shows the new text). This is what turns "I think it's set up" into "it works." Don't skip it: it's the difference between an install guide and an onboarding guide.
+5. **Write a brief architecture orientation — a map, not a spec.** Glob the top-level layout and name where the entry points are, how the main pieces fit (request → handler → data, or CLI → command → core), and where a newcomer should look first for a given task. Then list the **3–5 things that would surprise a newcomer**: the non-obvious build step, the directory that isn't what its name implies, the generated file you must never hand-edit. Keep it to a screen; point to deeper design docs for the rest.
+6. **Document the real workflow conventions.** Extract them from evidence, not assumption: branch naming (from existing branches / contributing notes), commit and PR style (from `.gitmessage`, PR template, recent history), how to run lint and typecheck (the real script names), and how CI gates a PR (which checks are required, from the workflow files). A contributor needs to know what will block their merge before they open the PR, not after.
+7. **Capture the tribal-knowledge gotchas and troubleshooting.** Write down the fixes that live in senior engineers' heads: the first build that fails until you run a generate step, the test that's flaky on certain OSes, the port that conflicts, the cache you clear when things go weird. Format as symptom → fix so a stuck contributor can scan to their error.
+8. **Link to deeper docs instead of duplicating them.** For anything with a canonical home — full architecture docs, API reference, ADRs, deployment runbooks — link to it in one line. Duplicated detail is detail that will silently go stale; a link stays correct or visibly 404s.
+9. **Order for action and skim.** Golden path first (it's what they need in the next five minutes), then architecture, conventions, troubleshooting, links. Lead each section with the action. Save it as `CONTRIBUTING.md` or `docs/onboarding.md` per the repo's convention, and report which commands you verified against real scripts and which you flagged as unverified.
+> [!WARNING]
+> An onboarding guide whose setup commands don't actually work is worse than no guide — it burns the new contributor's trust on day one and makes them distrust every other line in the doc. Verify each command against a script that exists in the repo. Never paste a `make dev` or `npm run setup` you haven't confirmed.
+> [!WARNING]
+> Do not re-explain the architecture in depth here. Detailed design that belongs in code comments, ADRs, or a design doc is guaranteed to drift once it's copied into onboarding. Give the orientation map and link to the canonical source.
+## Output
+A drop-in `CONTRIBUTING.md` (or `docs/onboarding.md`), structured for action:
+````md
+# Contributing
+## Golden path: clone → first change
+**Prerequisites:** Node 20 (`.nvmrc`), pnpm 9, Docker (for the local DB).
+```bash
+git clone git@github.com:acme/taskflow.git && cd taskflow
+pnpm install                 # lockfile: pnpm-lock.yaml
+cp .env.example .env         # fill DATABASE_URL — ask #eng for the dev value
+docker compose up -d db      # local Postgres on :5432
+pnpm db:migrate              # apply schema
+pnpm dev                     # http://localhost:3000
+pnpm test                    # vitest — should be all green before you start
+```
+**Your first change:** edit the heading in `src/app/page.tsx`, save —
+the dev server hot-reloads and the new text shows at `localhost:3000`.
+That confirms your setup end to end.
+## How the code fits
+- Entry points: `src/app/` (routes), `src/server/` (API handlers), `prisma/` (schema).
+- Flow: route → handler in `src/server/` → Prisma → Postgres.
+- Surprises for newcomers:
+  - `pnpm db:generate` must run after editing `prisma/schema.prisma` — the client is generated, never hand-edited.
+  - `src/lib/legacy/` is frozen; new code goes in `src/lib/`.
+  - The first `pnpm build` after install fails unless `pnpm db:generate` has run.
+## Workflow
+- Branch: `feat/<short-desc>` or `fix/<short-desc>` off `main`.
+- Commits: Conventional Commits (`.gitmessage`); PRs use the template.
+- Before pushing: `pnpm lint && pnpm typecheck`.
+- CI gates merge on: lint, typecheck, `vitest`, and a preview deploy.
+## Troubleshooting
+- `ECONNREFUSED 5432` → `docker compose up -d db` isn't running.
+- `Prisma client not generated` → `pnpm db:generate`.
+- Port 3000 in use → `pnpm dev -- --port 3001`.
+## Deeper docs
+- Architecture & design decisions → `docs/architecture.md`, `docs/adr/`
+- Deploy & on-call → `docs/runbooks/`
+````
+Every command above is quoted from a real script; the report lists exactly which were verified against the repo and which (if any) were flagged unverified for the maintainer to confirm.

package/content/skills/query-plan-analyzer.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+name: "query-plan-analyzer"
+description: "Read a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A slow query is almost never slow for the reason you'd guess from reading the SQL. The plan is the ground truth: it shows the database actually chose a Seq Scan over the 40-million-row table, actually fed 500,000 rows into a Nested Loop that estimated 5, actually sorted on disk because no index could supply the order. This skill pulls the **real** plan — `EXPLAIN ANALYZE` with `BUFFERS`, not bare `EXPLAIN` — reads it from the most expensive node outward, names the one node that's costing the time and *why*, and turns that into a specific fix: the index to add (with the right column order), the rewrite that makes the predicate sargable, or the `ANALYZE` that fixes the estimate. Then it re-runs the plan to prove the bad node is gone instead of declaring victory from theory.
+## When to use this skill
+- One specific query (an endpoint, a report, a dashboard panel) is slow and you need the cause, not a vague "add some indexes."
+- A query that was fast got slow after a data-volume change, a deploy, or a schema/index change.
+- The planner is doing something surprising — a Seq Scan despite an index existing, or ignoring the index you just added.
+- p99 latency on one query is high while the table and load look unremarkable, and you suspect the plan rather than the hardware.
+- Before shipping a new query or a `migration-writer` index change, to verify the plan is what you intended.
+## Instructions
+1. **Get the table shape and existing indexes before touching the plan.** Read the schema for the queried tables: column types, the existing indexes and their column order, row counts (`SELECT reltuples FROM pg_class`, or `\d+`), and whether stats are fresh (`pg_stat_user_tables.last_analyze` / `n_mod_since_analyze`). Grep the codebase for where the query is built so you tune the real SQL (including how parameters bind), not a hand-typed approximation.
+2. **Run the REAL plan with actual rows, timing, and I/O — never bare EXPLAIN.** Use `EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)` in Postgres (`ANALYZE FORMAT=TREE` / `EXPLAIN ANALYZE` in MySQL 8+). `ANALYZE` *executes* the query and reports actual rows + per-node `actual time`; `BUFFERS` shows shared/local hits vs. reads (heavy `read=` means I/O, not CPU, is the cost). Run it 2–3 times so a cold-cache first run doesn't masquerade as a planning problem. For a write query, wrap it in a transaction and `ROLLBACK` so `ANALYZE` doesn't mutate data.
+3. **Read from the most expensive node outward — find where the time actually is.** In the text plan, `actual time=start..end` is cumulative and inclusive of children; the time a node *adds* is its end-time minus its children's. Find the deepest/innermost node whose `actual time` and `loops × rows` dominate the total. That node — not the top of the plan — is what you fix. Note its `actual rows`, `loops`, and the `Rows Removed by Filter` line.
+4. **Check the estimate-vs-actual gap FIRST — a wide gap means stale stats, and that's the real bug.** Compare each node's estimated rows (`rows=`) to `actual rows`. A gap of more than ~10x (e.g. plans for 5 rows, processes 50,000) means the planner is choosing strategy on bad information — usually stale statistics. **Fix this before adding any index:** run `ANALYZE <table>;` (or `ANALYZE` the whole DB) and re-pull the plan. Often the plan corrects itself once estimates are right, and an index you'd have added would have been the wrong one.
+5. **Match the symptom to the culprit, then to the fix:**
+   - **Seq Scan on a large table with a selective predicate** → the predicate filters to few rows but there's no usable index. Add a b-tree on the filtered column(s). (A Seq Scan returning most of the table is *correct* — don't index it.)
+   - **Nested Loop with high `loops` over many outer rows** → the join is iterating per-row when it should batch. The cause is usually a bad row estimate (see step 4) or a missing join-key index; a corrected estimate or an index on the inner join column lets the planner pick a Hash/Merge Join.
+   - **Sort (especially `Sort Method: external merge  Disk:`)** → the query sorts at runtime and spills to disk. A b-tree index in the `ORDER BY` order can supply rows pre-sorted, removing the Sort node entirely (and powering `LIMIT` early-exit).
+   - **High `Rows Removed by Filter`** → the database fetched far more rows than it kept; the filter ran *after* the scan instead of being pushed into an index. Move the discriminating column into the index so it's a condition, not a post-filter.
+   - **Heavy `Buffers: ... read=`** → the working set isn't cached; a smaller/covering index reduces pages touched, or the data genuinely doesn't fit memory.
+6. **Check index sargability — an index the predicate can't use is no fix at all.** A b-tree is defeated by a function or cast on the column (`lower(email) = ?`, `date(created_at) = ?`, `col::text = ?`), by a leading-wildcard `LIKE '%x'`, and by an `OR` across different columns. The fix is a matching **expression index** (`CREATE INDEX ... ON t (lower(email))`), a rewrite to a range (`created_at >= d AND created_at < d+1`), or `UNION`-ing the `OR` branches — not a plain index on the raw column.
+7. **Order multi-column index columns for the predicate, then the sort.** Put equality-predicate columns first (leftmost), then the range/inequality column, then `ORDER BY` columns — so one index serves both the filter and the ordering. A column used only for a range can't have an equality column usefully placed after it. State the exact `CREATE INDEX` DDL, including `INCLUDE`d columns if a covering index would turn an Index Scan into an Index-Only Scan.
+8. **Re-run `EXPLAIN ANALYZE` after the fix and confirm the bad node is gone.** Apply the fix (in Postgres, build the index `CONCURRENTLY` to avoid a write lock; `migration-writer` can wrap the DDL). Re-pull the plan and verify the offending node changed type (Seq Scan → Index Scan, Nested Loop → Hash Join, Sort → no Sort) and that total `actual time` dropped. If the planner *ignores* the new index, run `ANALYZE` and re-check sargability before concluding the index is wrong.
+> [!WARNING]
+> Bare `EXPLAIN` shows the planner's *guess*, not reality — it never runs the query, so it can't reveal a Nested Loop that estimated 5 rows and processed half a million, or which node actually burned the time. Diagnose with `EXPLAIN ANALYZE` every time; tuning from estimates is how you add the wrong index.
+> [!WARNING]
+> A wide estimated-vs-actual row gap (>10x) means stale statistics, and that is the root cause — fix it with `ANALYZE` *before* adding indexes. An index chosen to compensate for a bad estimate is often useless or harmful once the estimate is corrected, and you'll have shipped a write-amplifying index that the planner ignores.
+> [!NOTE]
+> `EXPLAIN ANALYZE` executes the statement. For `INSERT`/`UPDATE`/`DELETE`, run it inside `BEGIN; ... ROLLBACK;` so diagnosis doesn't change data — and be aware it still fires triggers and acquires locks during the run.
+## Output
+A short report with three parts:
+1. **Annotated plan** — the offending node quoted from the `EXPLAIN ANALYZE` output, with its `actual rows` vs. estimate, `loops`, `Rows Removed by Filter`, and `Buffers`, plus a one-line statement of *why* it's the bottleneck (Seq Scan / stale-stats row gap / Nested Loop blowup / disk Sort / non-sargable predicate).
+2. **The specific fix** — exact `CREATE INDEX ... CONCURRENTLY` DDL with the column order justified, or the SQL rewrite, or the `ANALYZE <table>` command. One concrete action, not a menu.
+3. **Before/after proof** — total `actual time` and the changed node type from the re-run plan (e.g. `Seq Scan 1240 ms → Index Scan 3 ms`), confirming the bad node is gone rather than asserting it should be.

package/content/skills/rbac-designer.md ADDED Viewed

@@ -0,0 +1,82 @@
+---
+name: "rbac-designer"
+description: "Design the authorization model itself — fine-grained permissions on resources composed into roles, with the right amount of resource/tenant scoping — instead of scattering role-name checks through handlers. Use when building multi-user or multi-tenant authorization, when `if user.isAdmin` checks are sprawling across the codebase, or when 'who can do what' needs a real model rather than ad-hoc gates."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Design the authorization model — the permission system itself — rather than reviewing one that exists. The job is to decide *what capabilities exist*, *how they compose into roles*, *how far each check is scoped*, and *where enforcement lives* — so that application code asks one question, **"can this actor perform this action on this resource?"**, instead of the brittle `if (user.isAdmin)` checks that breed across handlers and rot the moment requirements change. The skill reads the codebase to find the resources, actions, and existing role checks, then produces a concrete permission/role model, a single central enforcement design, and explicit decisions on hierarchy, default-deny, tenant isolation, and audit.
+## When to use this skill
+- Building authorization for a multi-user or multi-tenant (SaaS) product, where access depends on both *who* the actor is and *which org/project/resource* they are touching.
+- When ad-hoc role checks — `if (user.role === 'admin')`, `user.isManager`, `@RequireRole("OWNER")` — are sprawling through controllers and every new rule means a code hunt.
+- When "who can do what" is tribal knowledge with no single model, or a customer/security review asks you to document the permission matrix.
+- Before adding roles, a permissions UI, custom roles, or an admin-impersonation feature on top of a system that hardcodes role names.
+> [!WARNING]
+> Scattering role-name checks (`isAdmin`, `role === "manager"`) through the codebase instead of checking granular permissions makes every permission change a risky code hunt and guarantees missed spots — the endpoint you forget is the privilege-escalation bug. Model permissions, compose them into roles, and enforce in one place so a grant change is one edit and coverage is greppable.
+## Instructions
+1. **Inventory resources and actions before inventing roles.** Glob the routers, controllers, and data models (`**/routes/**`, `**/*controller*`, `**/models/**`, `**/entities/**`) and list every *resource* (invoice, project, user, billing-account) and every *action* on it (read, create, update, delete, approve, export, invite). Permissions are these `resource:action` pairs — `invoice:read`, `invoice:approve`, `member:invite`. Name them after the capability, not the role, so the same permission can be granted to many roles. This list is the vocabulary; everything else composes it.
+2. **Compose permissions into roles — never the reverse.** Define roles as *named sets of permissions* (`viewer = {invoice:read, project:read}`, `approver = viewer ∪ {invoice:approve}`). Code checks `can(actor, "invoice:approve", invoice)`, never `actor.role === "approver"`. This is the whole point: when product says "approvers can now export", you edit one role→permission map, not every handler. Grep the codebase for existing `role ===`, `isAdmin`, `hasRole`, `@Role`, `@PreAuthorize` sites and list each as a call site to migrate to a permission check.
+3. **Pick the granularity you actually need — and stop there.** Choose explicitly among three:
+   - **Pure RBAC** (roles → permissions, global) — fine for single-tenant internal tools where a role means the same thing everywhere.
+   - **Scoped RBAC** (role *within* an org/project/workspace) — the default for SaaS: a user is `admin` of org A and `viewer` of org B, and every check is scoped to the resource's tenant. Model the assignment as `(actor, role, scope)`.
+   - **ReBAC / ABAC** (permission depends on the specific object's relationship or attributes — "owner of THIS document", "assignee of THIS ticket") — reach for this *only* for the per-object rules; let scoped RBAC carry the rest. Do **not** stand up a full policy engine if scoped RBAC suffices.
+   State the choice and the reason; mixing scoped RBAC for the 90% with a handful of ReBAC ownership rules is usually correct.
+4. **Centralize enforcement in one authorization layer.** Design a single policy function/middleware — `authorize(actor, action, resource)` (or a guard/policy class) — that every entry point routes through: HTTP handlers, GraphQL resolvers, queue/cron jobs, and admin scripts. No handler should make its own role decision. Specify *where* it sits (e.g. middleware that resolves the resource, computes the actor's permissions in that scope, and allows/denies) so coverage is provable by reading one module, not auditing hundreds.
+5. **Default-deny, explicitly.** The policy layer returns deny unless a rule grants. A new route with no policy attached must fail closed (no access), never fall through to allowed. Specify how an un-annotated/un-checked endpoint is detected and rejected (e.g. a route-level assertion that a policy was declared) so "forgot to add a check" becomes a *deny*, not a hole.
+6. **Decide role hierarchy and inheritance deliberately.** If `admin` should imply everything `editor` can do, model it as *permission inheritance* (admin's permission set ⊇ editor's) computed when permissions are resolved — not as a chain of `if role >= X` comparisons, which reintroduce role-name logic. Keep the hierarchy shallow and flatten to an effective permission set at check time; document the partial order so "what can role X do" is answerable from the model alone.
+7. **Scope every check to the resource — at the API *and* data layer.** A valid role on tenant A must never act on tenant B's data. The permission check answers "may this actor approve invoices?"; the *data* layer must additionally bind the query to the resource's owner/tenant (`WHERE org_id = :actorOrg`, a tenant filter, or row-level security), so changing an id in the URL cannot reach another tenant's row. Specify both: the policy check *and* the scoped query. Skipping the data-layer scope is the classic IDOR — the permission passed, but the object belonged to someone else.
+8. **Make it auditable.** Design the model so authorization decisions are explainable and logged: who has which role in which scope (queryable), what permissions a role grants (the map), and a decision log for sensitive actions (actor, action, resource, allow/deny, why). A model nobody can answer "who can approve invoices in org X?" about is not finished.
+> [!NOTE]
+> RBAC without per-tenant/resource scoping is the most common real failure: a legitimate `admin` of org A passes the `invoice:approve` permission check and then approves org B's invoice because the query fetched by id alone. The permission says *what* the actor may do; the scope says *to which objects*. Both are required — design them together, not as an afterthought.
+## Output
+A concrete authorization design with four parts:
+1. **The permission/role model** — the resource×action permission list, the role→permission map (with inheritance), and the assignment shape (`(actor, role)` for pure RBAC or `(actor, role, scope)` for scoped/multi-tenant).
+2. **The central enforcement design** — the single `authorize(actor, action, resource)` entry point, where it sits, what it resolves, and the list of existing scattered role checks to migrate into it.
+3. **Granularity decision** — pure RBAC vs scoped RBAC vs ReBAC/ABAC, stated with the reason, including which specific rules (if any) need per-object relationship checks.
+4. **The hardening decisions** — default-deny mechanism, role hierarchy/partial order, the API-and-data-layer scoping rule per resource, and the audit/decision-log plan.
+```text
+Authorization model — scope: src/routes/**, src/models/**  (multi-tenant SaaS)
+Granularity: SCOPED RBAC (role within org) + ReBAC for document ownership
+PERMISSIONS (resource:action)
+  invoice: read, create, update, delete, approve, export
+  member:  read, invite, remove
+  doc:     read, edit  (edit also gated by ownership — see ReBAC)
+ROLES → PERMISSIONS  (within an org)
+  viewer   = {invoice:read, member:read, doc:read}
+  editor   = viewer ∪ {invoice:create, invoice:update, doc:edit}
+  approver = editor ∪ {invoice:approve, invoice:export}
+  admin    = approver ∪ {member:invite, member:remove}     # inherits all above
+ASSIGNMENT:  (user_id, role, org_id)        # scoped — same user differs per org
+ENFORCEMENT (one layer)
+  authorize(actor, action, resource):
+    1. resolve actor's role in resource.org_id   -> effective permission set
+    2. deny if action ∉ permissions              # DEFAULT-DENY
+    3. ReBAC rule: doc:edit also requires resource.owner_id == actor.id
+  Every route/resolver/job calls authorize(); routes with no policy → fail closed.
+MIGRATE these scattered checks into authorize():
+  - src/routes/invoices.ts:41   if (user.isAdmin)        -> can(..,"invoice:approve",inv)
+  - src/routes/members.ts:88    user.role === "owner"    -> can(..,"member:invite",org)
+DATA-LAYER SCOPING (prevents IDOR — required alongside the permission check)
+  invoices:  WHERE id = :id AND org_id = :actorOrg     # not findById(id) alone
+  docs:      WHERE id = :id AND org_id = :actorOrg      # + ReBAC owner check above
+AUDIT
+  - role assignments queryable: "who can invoice:approve in org X?"
+  - decision log on approve/export/remove: actor, action, resource, allow/deny
+```

package/content/skills/release-notes-writer.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+name: "release-notes-writer"
+description: "Write user-facing release notes — the curated 'what's new and what it means for you' — by starting from the real changes (git log / merged PRs / the changelog since the last release) and translating developer-speak into user impact, grouped by what the user cares about with breaking changes and required actions surfaced first. Use when shipping a release to users or customers and the raw commit log isn't something a user should read, when you need a published GitHub-release / blog / in-app announcement, or when a breaking change must be made unmissable so upgrades don't break."
+allowed-tools: "Read, Grep, Glob, Bash, Write"
+version: 1.0.0
+---
+A changelog records *what changed*; release notes explain *what it means for the person upgrading*. Pasting raw conventional-commit lines into a release fails users twice: it buries the two things they actually need under twenty refactors and dependency bumps, and it hides the one breaking change that will take down their integration on upgrade. This skill reads the real changes since the last release, throws away the churn users don't care about, translates the rest into impact-and-action language grouped the way a user thinks (New / Improved / Fixed), and puts breaking changes and required steps at the top where they cannot be missed.
+## When to use this skill
+- You are shipping a release to end users or API consumers and the commit log / changelog is not something they should read.
+- You need a GitHub release body, a "what's new" blog post, or an in-app changelog entry — not an internal diff.
+- A release contains a breaking change or a required migration and you need it surfaced first, with the exact action spelled out.
+- You have a draft changelog (e.g. from `changelog-from-prs`) and need to convert it into something audience-appropriate and benefit-led.
+## Instructions
+1. **Start from the real changes, not memory.** Establish the range from the last released tag and pull the actual shipped work — never invent items or summarize from what you "think" landed.
+   ```bash
+   LAST_TAG=$(git describe --tags --abbrev=0)
+   git log "$LAST_TAG"..HEAD --no-merges --pretty='%s'
+   gh pr list --state merged --search "merged:>$(git log -1 --format=%cI "$LAST_TAG")" \
+     --json number,title,labels,body --limit 200
+   ```
+   If a `CHANGELOG.md` already covers this range, read it as the source of record instead of re-deriving from commits.
+2. **Identify the audience and pin the voice.** End users, API consumers, and self-hosting operators need different notes. Look at where this publishes (`README`, app store text, GitHub release, developer docs) and at past release notes for tone. API/SDK consumers need exact symbol/endpoint names and code; end users need plain-language benefit and a screenshot-level description, not the function that changed.
+3. **Drop the churn.** Remove everything a user cannot observe: internal refactors, test-only changes, CI/build config, dependency bumps with no behavior change, lint/format, doc-internal edits. A 60-commit release is often 5 user-facing notes. Keep a dependency bump *only* if it fixes a user-visible bug or a known CVE the user is exposed to — and say which.
+4. **Extract breaking changes and required actions first — this is the part that breaks systems if you get it wrong.** Scan PR bodies/commits for `BREAKING`, `!` in conventional-commit type, removed/renamed exports, flags, endpoints, config keys, changed defaults, and tightened validation. For each, write: what changed, who it affects, and the **exact action** the user must take to upgrade safely (the command, the renamed field, the config edit), with a link to a migration guide if one exists. Cross-check against the SemVer bump — a major bump with zero listed breaking changes means you missed one.
+5. **Group the rest by what the user cares about, in benefit language.** Use **New** (capabilities they didn't have), **Improved** (things that got faster/better/clearer), **Fixed** (bugs that affected them). Rewrite each from implementation to impact: not "refactor `ExportService` to stream rows" but "Exports of large datasets no longer time out." For notable new features add a one-line *how to use it* (the flag, the menu, the endpoint). Order within each group by how many users it affects, not by PR number.
+6. **Append upgrade instructions and links.** Give the concrete upgrade step for this project (`npm i pkg@2.0.0`, the container tag, the migration command) and link the full changelog, the migration guide, and relevant docs for new features. Keep PR/issue references only where a user might want the detail — don't litter end-user notes with `(#1423)`.
+7. **Lead with a one-line summary and write the header.** Open with a single sentence a user can skim ("v2.0 adds scheduled exports and a JSON API; one breaking change to the auth header"). Then breaking/action-required, then New / Improved / Fixed, then upgrade steps. Emit it as Markdown ready to paste — publish nothing yourself.
+> [!WARNING]
+> Release notes are not a commit dump. Pasting raw conventional-commit lines (`feat:`, `chore(deps):`, `refactor:`) buries the few items users need under noise they cannot act on, and makes the notes look auto-generated and untrustworthy. Translate to impact and delete the rest.
+> [!CAUTION]
+> A breaking change hidden mid-list — or omitted because it "looked small" — is how you break your users' systems on upgrade. Every removed/renamed flag, changed default, tightened validation, or altered response shape goes in a **Breaking changes / action required** block at the very top, with the exact migration step. If the SemVer bump is major but you wrote no breaking items, stop and re-scan; you missed one.
+## Output
+Publishable release notes — breaking-first, benefit-led — ready to paste into a GitHub release, blog post, or in-app changelog:
+```markdown
+# v2.0.0 — 2026-06-17
+Scheduled exports and a new JSON API. **One breaking change:** the API auth header was renamed — update integrations before upgrading.
+## ⚠️ Breaking changes — action required
+- **Auth header renamed `X-Token` → `Authorization: Bearer <key>`.** Requests using `X-Token` now return `401`. Update your client before upgrading. See the [migration guide](https://docs.example.com/migrate/v2).
+- **`export` config key `format: csv` is no longer the default** — it now defaults to `json`. Add `format: csv` explicitly to keep the old behavior.
+## New
+- **Scheduled exports.** Set a cron in Settings → Exports to deliver reports automatically — no more manual runs.
+- **JSON API for reports.** Pull report data programmatically via `GET /api/v2/reports`. See the [API docs](https://docs.example.com/api).
+## Improved
+- Exports of large datasets no longer time out — they now stream and complete in seconds.
+- Faster dashboard load on accounts with many projects.
+## Fixed
+- Fixed a crash when a saved filter referenced a deleted field.
+- Times now display in the account's timezone instead of UTC.
+## Upgrade
+1. Update auth headers per the breaking change above.
+2. `npm i your-pkg@2.0.0` (or pull image tag `:2.0.0`).
+3. Run `your-cli migrate` to apply the config default change.
+[Full changelog](https://github.com/org/repo/compare/v1.6.0...v2.0.0)
+```

package/content/skills/runbook-writer.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: "runbook-writer"
+description: "Write an operational runbook a half-asleep on-call engineer can execute at 3am — scoped to ONE alert, leading with how to confirm the problem, the copy-pasteable mitigation that stops user pain, then diagnosis, escalation, and verification. Use when an alert has no documented response, after an incident exposed a missing procedure, or when standing up on-call for a service."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Write the document the on-call engineer opens when a pager fires at 3am — and can actually follow. The skill takes one alert or symptom and produces a runbook in the order a responder needs it: **confirm → mitigate → diagnose → escalate → verify**. It mines the repo for the real commands, dashboards, and service names, writes each step as a literal instruction with its expected output ("run X; if you see Y, do Z"), and front-loads the mitigation that stops user pain *before* any investigation. The result stops bleeding first and explains second.
+## When to use this skill
+- An alert fires with no documented response — the responder is reverse-engineering the system at the worst possible time.
+- A postmortem found that recovery was slow because the procedure lived only in one person's head.
+- You're onboarding on-call for a service and need a runbook per page-worthy alert before the rotation starts.
+- An existing runbook is prose-heavy ("investigate the root cause") and unusable under stress.
+## Instructions
+1. **Scope to ONE symptom — refuse the generic doc.** A runbook answers exactly one page: `HighErrorRate on checkout-api`, `ReplicaLag > 30s`, `DiskUsage > 90% on db-primary`. If the user asks for an "operations runbook," push back and split it — one alert per file. Name it after the alert that links to it (`docs/runbooks/checkout-api-high-error-rate.md`), so the pager's "runbook" link lands here. Search existing alert rules (`grep -ri "alert\|expr:" prometheus*.yml *.rules.yml`) to use the alert's exact name.
+2. **Open with the fast path, not background.** The first thing on the page is a one-line summary of what's broken and the user impact ("Checkout returns 500s — customers can't pay"), then a **TL;DR mitigation** block: the single command that most often stops the pain. The responder should be able to act from the top of the file without scrolling. Save architecture and theory for the bottom (or omit it).
+3. **Step 1 is always CONFIRM — is this real?** Give the exact way to verify the alert isn't a flapping false positive: the literal dashboard URL, the PromQL/log query to paste, or the curl/CLI command, plus the expected output that means "yes, real." Mine the repo for these — read dashboard JSON, `*.rules.yml`, health-check endpoints, and `Makefile`/`justfile` targets — rather than inventing command names. Example: `kubectl -n prod get pods -l app=checkout-api` → "all should be `Running`; `CrashLoopBackOff` confirms the alert."
+4. **Step 2 is MITIGATE — stop the bleeding before diagnosing.** This is the most important section and it comes *before* root-cause work. Give the copy-pasteable command to roll back, fail over, restart, scale up, or feature-flag-off — with real paths, namespaces, and service names from the repo. State what each command does and how to know it worked. Order options by safety and speed (rollback to last-good deploy usually beats live debugging). Never make the reader derive the command.
+5. **Step 3 is DIAGNOSE — only now look for cause.** Numbered, branching steps in `run X → if you see Y → do Z` form. Every step is a literal command with expected output and the decision it drives. No step may say "investigate," "look into," "check if there's an issue," or any phrase that offloads a judgment call onto a stressed human — convert each into a concrete check with a concrete next action. Link the relevant logs query, trace view, and the service's SLO/error-budget dashboard.
+6. **Write ESCALATE with names and triggers.** State exactly *when* to page the next person and *who*: "If mitigation hasn't restored success rate within 15 min, page the #payments on-call via PagerDuty service `checkout-api`." Include the secondary/owning team, any vendor support path, and the threshold (duration, error count, blast radius) that makes escalation mandatory rather than optional.
+7. **End with VERIFY — confirm recovery, don't assume it.** Give the explicit check that service is restored: the same dashboard/query from step 1 showing healthy values, with the threshold to watch ("error rate back under 0.5% for 5 consecutive minutes"). Include any cleanup (re-enable the flag you turned off, scale back down) and a one-line prompt to capture timeline notes for the postmortem.
+8. **Keep every command current and report assumptions.** Verify each command against the repo (binary names, namespaces, flags, env). Flag any command you could not confirm against a real file so the user tests it before relying on it. A command you guessed is worse than no command — it sends the responder down a dead end at 3am.
+> [!WARNING]
+> A runbook full of "investigate the issue" or "check the logs and determine the cause" is useless at 3am — it just restates the panic. Every step must be a literal command with an expected output and an explicit next action. Equally, a runbook with a stale or never-executed command fails at the exact moment it's needed: treat unverified commands as bugs, and have someone dry-run the mitigation path in staging before trusting it.
+## Output
+A single Markdown file at `docs/runbooks/<alert-name>.md` for one symptom, ordered **confirm → mitigate → diagnose → escalate → verify**, with a TL;DR mitigation at the top, literal copy-pasteable commands, expected outputs, decision branches, and links to the dashboard / logs / trace view / SLO. The skill reports the file path and any command it could not verify against the repo.
+```markdown
+# Runbook: checkout-api — HighErrorRate
+**Impact:** Checkout returns 500s — customers cannot complete payment.
+**Alert:** `HighErrorRate{service="checkout-api"}` (fires at 5xx > 2% for 3m)
+**Dashboard:** https://grafana.internal/d/checkout-api/overview
+## TL;DR mitigation
+Roll back to the last-good deploy — fixes ~80% of these pages:
+    kubectl -n prod rollout undo deployment/checkout-api
+Success rate should recover within ~2 min on the dashboard above.
+## 1. Confirm it's real
+    kubectl -n prod get pods -l app=checkout-api
+Expect all `Running`. Any `CrashLoopBackOff`/`Error` confirms the alert.
+Cross-check the 5xx panel: https://grafana.internal/d/checkout-api/overview
+## 2. Mitigate (stop the bleeding)
+1. If a deploy went out in the last hour → `kubectl -n prod rollout undo deployment/checkout-api`.
+2. If pods are healthy but the DB is the source → fail over reads:
+   `kubectl -n prod set env deployment/checkout-api READ_REPLICA=db-replica-2`
+3. If a downstream dependency is down → disable checkout behind the flag:
+   `curl -XPOST https://flags.internal/api/checkout_enabled -d '{"value":false}'`
+Confirm recovery on the dashboard before moving on.
+## 3. Diagnose
+- Run `kubectl -n prod logs -l app=checkout-api --since=10m | grep -i error`.
+  If you see `connection refused: payments-svc` → page payments (step 4).
+  If you see `pq: too many connections` → scale the pool: `kubectl -n prod set env deployment/checkout-api DB_POOL_MAX=40`.
+- Traces: https://tempo.internal/explore?service=checkout-api
+- SLO / error budget: https://grafana.internal/d/checkout-api/slo
+## 4. Escalate
+If success rate is not restored within 15 min, page **#payments on-call**
+via PagerDuty service `checkout-api`. For DB failover that won't recover,
+page **#platform-db**. Vendor (Stripe) status: https://status.stripe.com
+## 5. Verify
+- 5xx rate back under 0.5% for 5 consecutive minutes on the dashboard.
+- Re-enable any flag you toggled: `curl -XPOST .../checkout_enabled -d '{"value":true}'`.
+- Note start/detect/mitigate/resolve timestamps for the postmortem.
+```

package/content/skills/semantic-cache-designer.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+name: "semantic-cache-designer"
+description: "Design a semantic cache for LLM responses — serve a cached answer when a new query is similar enough to a past one — to cut cost and latency on repetitive traffic, with the similarity threshold calibrated on real query pairs and a cache key that prevents cross-user/model leaks. Use when an LLM app sees many near-duplicate prompts (FAQs, support, search), when token spend on repetitive queries is high, or when latency on common questions matters."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A semantic cache turns "I've answered this before" into a skipped LLM call: embed the incoming query, find the nearest past query, and if it's close enough, return that cached answer. Done right it slashes cost and tail latency on FAQ/support/search traffic. Done wrong it confidently returns a *different* question's answer, or leaks one user's answer to another. This skill makes the two load-bearing decisions — the similarity threshold and the cache key — explicit and calibrated, instead of trusting a vibe-picked cosine cutoff.
+## When to use this skill
+- An LLM app gets many near-duplicate prompts — FAQs, support tickets, product search, "explain X" — and most calls re-derive the same answer.
+- Token spend is dominated by repetitive traffic and you want to stop paying for the same completion twice.
+- Latency on common questions matters (p50/p95) and a cache hit would return in milliseconds instead of seconds.
+- You're about to bolt a `GPTCache`-style layer onto a RAG or chat app and need the threshold/key/TTL decided before it ships.
+## Instructions
+1. **Pin down what a "correct hit" means before touching code.** A hit is only correct if the cached answer would still be the *right* answer for the new query. Write down the inputs that change the correct answer beyond the query text — user/tenant, locale/language, the retrieved-context version (for RAG), the model + version, system-prompt version, and any personalization. This list becomes the cache key in step 5; everything else flows from it.
+2. **Design the lookup.** Embed the incoming query with the *same* model and input-type used for the stored queries (a query/document asymmetry mismatch quietly wrecks similarity — see `embedding-set-inspector`). Look up the single nearest stored entry by vector similarity (cosine on normalized vectors), scoped to the exact key from step 5. Return the cached answer only if `similarity >= threshold`; otherwise it's a miss → call the LLM and write the new entry.
+3. **Calibrate the threshold on real query pairs — do not pick it from a blog post.** Pull ~100-300 query pairs from production logs and label each pair as "same intent / cached answer is correct" or "different intent / would be wrong." Sweep the threshold (e.g. 0.80→0.97) and at each value compute false-hit rate (returned a wrong answer) and false-miss rate (missed a valid reuse). Pick the threshold from this curve, not by feel.
+4. **Bias toward false-miss when a wrong answer is costly.** A false miss costs one extra LLM call; a false hit ships a confidently wrong answer to a user. For support/medical/financial/legal surfaces, choose the stricter threshold even if hit rate drops — a missed hit is cheap, a wrong hit is a trust incident.
+5. **Build the full cache key — never key on query text alone.** Namespace the cache (or the embedding lookup) by every input from step 1: `tenant + locale + model@version + prompt@version + context@version`. Personalized or per-user answers must include the user/tenant in the key. Omitting any of these is how you serve user A's answer to user B, or a `claude-opus-4` answer out of a `claude-haiku` cache after a model swap.
+6. **Set TTL and invalidation for answers that go stale.** Static facts can live long; RAG answers over changing data must expire (or be invalidated) when the underlying documents change — tag entries with the `context@version`/document IDs they depended on and evict on update. Time-sensitive answers ("current status", "today's price") get a short TTL or land in the no-cache list (step 7).
+7. **Decide explicitly what NOT to cache.** Exclude personalized/account-specific answers that lack a per-user key, time-sensitive or real-time responses, stateful/multi-turn replies that depend on conversation history, and anything with side effects (tool calls, writes). Caching these is worse than no cache. Write the no-cache predicate down as a rule, not a hope.
+8. **Measure hit *quality*, not just hit rate.** Track cache hit rate, token/cost saved, and latency delta — but also sample a slice of live hits (e.g. 1-2%) and judge whether the cached answer was actually right for the new query (LLM-as-judge or human review). Report false-hit rate as a first-class metric. A 60% hit rate that's 10% wrong is worse than a 35% hit rate that's clean.
+> [!WARNING]
+> A too-loose threshold is the signature failure of semantic caching: "How do I cancel my subscription?" and "How do I cancel my *order*?" are highly similar in embedding space, so the cache serves a fluent, confident answer to the *wrong* question. The user can't tell it's a stale match. Always validate the threshold against labeled different-intent pairs, not just same-intent ones.
+> [!WARNING]
+> Omitting context/user/model from the cache key leaks answers across boundaries — across users (privacy incident), across locales (wrong language), or across model/prompt versions (you keep serving the old model's answers after a deploy). The key must change whenever the correct answer would change.
+## Output
+- **Lookup design** — embedding model + input-type, similarity metric, nearest-neighbor scoping, and the hit/miss decision rule.
+- **Calibrated threshold** — the chosen value plus the false-hit / false-miss curve it came from and the labeled query-pair set used (and the false-miss bias rationale if applicable).
+- **Full cache key** — the exact composite key (`tenant + locale + model@version + prompt@version + context@version + user`), with a note on which fields apply to this app.
+- **TTL + invalidation + no-cache rules** — per-class TTLs, the document-version invalidation trigger for RAG entries, and the explicit no-cache predicate.
+- **Metrics** — hit rate, token/cost saved, latency delta, and the sampled hit-quality / false-hit measurement to track in production.

package/content/skills/strangler-fig-migrator.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: "strangler-fig-migrator"
+description: "Plan the incremental replacement of a legacy module or service using the strangler-fig pattern — grow new code around the old behind an interception seam until the old is dead, instead of a big-bang rewrite. Use when a legacy system is too risky to rewrite at once, or when migrating off a deprecated framework/dependency gradually while staying shippable and rollback-able at every step."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Replace a legacy module or service the way a strangler fig kills its host tree — by growing new code around the old until the old carries no load and can be cut away. The skill's first and most important move is to find the **interception seam**: the single place where calls can be diverted to either the old or the new implementation. Everything else (slicing, parallel-running, decommissioning) hangs off that seam. Without it, "incremental migration" silently becomes a big-bang rewrite with extra ceremony.
+## When to use this skill
+- A legacy system is load-bearing and too risky to rewrite all at once — a flag-day cutover would mean a long branch, a scary deploy, and no clean rollback.
+- You're migrating off a deprecated framework, library, or service (an ORM, an auth provider, a payments SDK, a monolith you're peeling into services) and want to move capability by capability.
+- The legacy code has no tests or unclear behavior, so the only trustworthy spec is "what it currently does" — you need to run new alongside old and compare.
+- Stakeholders need the system shippable and reversible the entire time, not dark for months behind a feature branch.
+> [!WARNING]
+> If you cannot find or build a clean interception seam, stop and reconsider. A migration where callers reach deep into legacy internals — not through one front door — cannot be routed incrementally. You will end up rewriting everything before you can flip anything, which is a big-bang rewrite wearing a strangler-fig costume. Creating the seam (a facade callers go through) is the *first deliverable*, sometimes a whole milestone of its own.
+## Instructions
+1. **Locate or create the interception seam first.** Find the single chokepoint where calls into the legacy unit can be diverted: a facade/adapter the callers already go through, a network proxy/router (reverse proxy, API gateway, service mesh route), or a feature-flag branch in code. Use `Grep`/`Glob` to map every caller of the legacy unit — if they all funnel through one interface, that's your seam; if they reach in twenty different ways, your first job is to introduce a facade they all route through *before* writing any new implementation. The seam must be able to send a call to old OR new and be flipped at runtime (config/flag), not at deploy time.
+2. **Inventory and slice the surface.** List the capabilities behind the seam (endpoints, methods, message types) with, for each, its call volume, blast radius if it breaks, and how self-contained it is (shared state, shared DB tables, downstream side effects). This is your migration backlog. Do not migrate by file or by "module size" — migrate by capability slice, because a slice is what the seam can route independently.
+3. **Carve off the smallest valuable slice first.** Pick the slice that is most self-contained and lowest-blast-radius — a read-only endpoint, an idempotent operation, an internal report — not the gnarliest core path. Implement it new behind the seam. The goal of slice one is to prove the *seam and the verification mechanism work end to end*, not to deliver the hardest functionality. Save the high-risk, high-coupling slices for after the machinery is trusted.
+4. **Run old and new in parallel and verify equivalence before shifting load.** Before routing real traffic to the new path, run it in **shadow mode**: send the live request to both, return the old result to the caller, and compare the new result off to the side (log/metric the diffs). Define equivalence concretely per slice — exact response match, match modulo known-acceptable differences (ordering, timestamps, formatting), or statistical match on key business metrics when outputs are non-deterministic. Only after the diff rate is at/under an agreed threshold over a representative window do you start serving the new path for real.
+5. **Shift traffic gradually and keep rollback one flip away.** Route a small fraction to the new implementation (a percentage, an allowlist of internal users, one tenant), watch error rate / latency / business metrics against the old baseline, and ramp only while they hold. The seam from step 1 makes the rollback trivial: if the new path misbehaves, flip the route back to legacy — no deploy, no revert. Treat every ramp as reversible; never remove the old path while it's still the fallback.
+6. **Migrate slice by slice, keeping the system shippable throughout.** Repeat steps 3–5 for the next slice. After each slice fully cuts over, the system is in a valid, releasable state with some capabilities on new and some on old — that is the point. Sequence so that you never half-migrate a slice that shares mutable state with an unmigrated one; if two slices write the same table, plan a shared-data strategy (dual-write with new as follower, or migrate the data owner first) before splitting their routing.
+7. **Decommission the legacy only once it is provably dead.** A slice's old code is a candidate for deletion only when: the seam routes 100% to new, the route has been pinned there long enough to cover the full usage cycle (including weekly/monthly/seasonal jobs and rare error paths), and instrumentation shows **zero** hits on the legacy path. Confirm deadness with evidence — access logs, a counter/log line on the old code path showing no calls, `Grep` proving no remaining static references — then remove the old implementation and the now-redundant routing in a final isolated step. Keep the seam until the very last slice is gone.
+> [!WARNING]
+> Deleting legacy code before confirming it's truly dead causes outages, not cleanup. "We migrated that months ago" is not evidence — a quarterly batch job, an admin tool, or a rare error branch can be the only remaining caller. Require positive proof of zero traffic (a metric/log over a full usage period) plus a static-reference search before any deletion. When in doubt, leave the dead branch behind the seam one more cycle; cold code is cheap, an outage is not.
+## Output
+1. **Interception seam design** — what the seam is (facade/adapter, proxy/router, or feature flag), where it sits relative to the callers, how it decides old-vs-new (config key / flag / percentage), and how it's flipped and rolled back at runtime. Includes the list of legacy callers found and whether they already route through one door or need a facade introduced first.
+2. **Slice-by-slice migration order** — the capability backlog as an ordered table, smallest/safest first, with the rationale for the sequence and any shared-data dependencies that force ordering:
+   | Order | Slice (capability) | Volume | Blast radius | Coupling / shared state | Why this position |
+   | --- | --- | --- | --- | --- | --- |
+   | 1 | `GET /report/summary` (read-only) | low | low | none | proves seam + verification end-to-end |
+   | 2 | `POST /events` (idempotent write) | high | medium | none | high volume, safe to shadow |
+   | 3 | `POST /orders` (core path) | high | high | shares `orders` table w/ #4 | after machinery trusted; pair with #4 |
+3. **Parallel-run verification method** — per slice: shadow-mode comparison plan, the concrete equivalence definition (exact / modulo-known-diffs / statistical), the diff threshold and observation window required before serving new, and the metrics watched during ramp (error rate, latency, business KPI vs. legacy baseline) with the ramp schedule (e.g. shadow → 1% → 10% → 50% → 100%).
+4. **Decommission criteria** — the exact gate for deleting each slice's legacy code: 100% routed to new, pinned for one full usage cycle, instrumented zero-traffic proof, and a clean static-reference search — plus the final-step plan to remove the old implementation and retire the seam once the last slice is migrated.

package/content/skills/threat-model-builder.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: "threat-model-builder"
+description: "Build a practical threat model for a feature or system using STRIDE — diagram the data flow, mark trust boundaries, enumerate concrete threats where data crosses them, and prioritize by likelihood × impact so security is reasoned about before shipping instead of bolted on after. Use when designing a feature that touches auth, money, or sensitive data, running a security design review, or hardening before a launch."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Most threat models fail in one of two ways: they list every conceivable attack until the team is paralyzed and ships nothing, or they skip straight to "we use HTTPS and JWTs" without ever asking where an attacker actually sits. This skill does neither. It forces a data-flow diagram with explicit **trust boundaries**, walks **STRIDE** only where data crosses those boundaries, and ranks every threat by **likelihood × impact** so you mitigate the handful that matter and consciously accept the rest. The output is a diagram, a threat table, and a signed-off list of residual risk — not a vibe.
+## When to use this skill
+- Designing a feature that handles authentication, authorization, money/payments, PII, or anything multi-tenant.
+- Running a security design review before a launch, or as a gate on a new external-facing endpoint or integration.
+- A pen-test or incident keeps surfacing the same class of bug and you need to find the whole class, not the one instance.
+- Adding a new external entity (third-party webhook, partner API, file upload from users) that data now flows to or from.
+## Instructions
+1. **Draw the system as a data-flow first — not a box diagram.** Identify four element types: **external entities** (users, partner services, the browser), **processes** (your API, workers, lambdas), **data stores** (DB, cache, queue, blob storage), and the **data flows** (arrows) between them, each labeled with what it carries (`login creds`, `session token`, `tenant_id`, `payment amount`). Express it as Mermaid `flowchart`. If you cannot name what flows on an arrow, you do not understand the system well enough to model it yet.
+2. **Mark trust boundaries — these are the whole point.** A trust boundary is any place data crosses from less-trusted to more-trusted, or between principals that should not see each other's data: internet → your API, your API → DB, unauthenticated → authenticated, tenant A → tenant B, your code → a third-party SDK, user input → a SQL/shell/template interpreter. Draw them as dashed `subgraph` borders. Number each crossing — those numbers are the only places you do STRIDE.
+3. **Walk STRIDE at each boundary crossing, and enumerate CONCRETE threats.** For each element or flow that crosses a boundary, ask all six and write down the specific attack, not the category:
+   - **S — Spoofing** (authentication): "Attacker replays a captured session cookie because tokens have no `exp`." Not "ensure authentication."
+   - **T — Tampering** (integrity): "Client sends `amount: -100` and the refund flow trusts it."
+   - **R — Repudiation** (audit): "A user disputes a transfer and there is no signed, append-only log tying the action to their identity."
+   - **I — Information disclosure** (confidentiality): "`GET /users/:id` returns any id without an ownership check — IDOR across tenants."
+   - **D — Denial of service** (availability): "The export endpoint runs an unbounded query; one request pins the DB."
+   - **E — Elevation of privilege** (authorization): "A `viewer` role can call the admin mutation because authz is checked in the UI, not the API."
+4. **Rate each threat by likelihood × impact and SORT — you cannot fix everything.** Score likelihood (how reachable/easy: High/Med/Low) and impact (blast radius if it lands: High/Med/Low) independently, then derive priority (e.g. High×High = P0, anything with a Low dimension = P3). Sort the table by priority. An unprioritized list of forty threats gets you forty half-done mitigations; a ranked list gets the top five done properly.
+5. **Propose a specific, falsifiable mitigation per high-priority threat.** "Validate input" is not a mitigation. "Reject `amount` server-side unless it matches the stored invoice total; add a contract test" is. Tie each mitigation to where it lives (which middleware, which check, which migration) so it becomes a work item, not an aspiration.
+6. **Write down the residual risk you are accepting — explicitly.** For every threat you are NOT mitigating now, record it as *accepted* (with a one-line rationale and who accepted it) or *deferred* (with the trigger that reopens it: "revisit when we add a second tenant"). Silent acceptance is how a known risk becomes a postmortem line item.
+> [!WARNING]
+> A threat model with no trust boundaries marked is just a feature list with extra steps. The boundaries are where the threats live — if you skipped step 2, the STRIDE walk in step 3 has nothing to anchor to and will produce generic mush.
+> [!NOTE]
+> The two highest-yield STRIDE letters for typical web/SaaS features are **E (authz)** and **I (info disclosure)** — IDOR and missing object-level authorization cause more real breaches than exotic crypto failures. If time is short, do those two at every boundary first.
+## Output
+A self-contained threat model document with three parts:
+1. **Data-flow diagram** (Mermaid `flowchart`) with external entities, processes, stores, labeled flows, and dashed trust-boundary subgraphs with numbered crossings.
+2. **STRIDE threat table**, sorted by priority:
+   | # | Threat (concrete) | Element / flow | STRIDE | Likelihood | Impact | Priority | Mitigation (specific, where it lives) |
+   |---|-------------------|----------------|--------|-----------|--------|----------|----------------------------------------|
+   | 1 | `GET /users/:id` returns any tenant's record | API → DB read | I / E | High | High | P0 | Add ownership check in `requireOwner` middleware; contract test for cross-tenant 403 |
+3. **Accepted residual risks** — a short list of threats not being mitigated now, each tagged *accepted* (rationale + owner) or *deferred* (reopen trigger), so the decision is on the record rather than implied.

package/content/skills/token-usage-profiler.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: "token-usage-profiler"
+description: "Measure and attribute LLM token usage and cost across an app — input vs output tokens by feature, route, model, and tenant — then rank the waste and the specific lever to cut it. Use when LLM spend is high or climbing with no clear cause, before scaling a feature that calls a model, or when you need per-feature or per-tenant cost attribution for billing or budgets."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+An LLM bill arrives as one number, and that number tells you nothing about what to fix. The waste is almost never spread evenly — a couple of bloated prompts, one feature that streams paragraphs where a sentence would do, or a single noisy tenant usually drive most of the spend. This skill turns the total into an attributed, ranked profile: it instruments every model call to record input vs output tokens, the model, and a feature/route/tenant **tag**, breaks cost down by that tag, and hands you the dominant drivers each paired with the specific lever that cuts it.
+## When to use this skill
+- The LLM bill is high or rising and nobody can say which feature or tenant is responsible.
+- You're about to scale a model-backed feature and want to know its true per-call and aggregate cost first.
+- You need per-feature or per-tenant cost attribution for internal budgets, chargeback, or usage-based pricing.
+- A verbose feature or a stuffed context window is suspected, but you have no measurement to confirm it.
+- A cost regression slipped in — spend jumped after a deploy — and you need to localize it to a call site.
+## Instructions
+1. **Add the tag before measuring anything — attribution is impossible without it.** At every model call site, capture: model id, input (prompt) tokens, output (completion) tokens, and a stable `tag` identifying the *feature/route* (e.g. `summarize-thread`, `support-reply`) plus `tenant`/`user` where billing matters. Pull token counts from the provider's `usage` object on the response, not a local tokenizer — the provider reflects system prompts, tool schemas, and cache discounts. `grep` the codebase for call sites first (`Grep` for the SDK call, e.g. `messages.create`, `chat.completions`, `generateText`) so no path is missed; a single untagged call site becomes an "unattributed" bucket that hides waste.
+2. **Compute cost, don't count tokens.** Map each `(model, input|output)` pair to its price and compute `cost = tokens × price_per_token`, keeping input and output as separate columns. Sum over a representative window (e.g. 7 days, or one full traffic cycle). Tokens alone mislead because input and output, and cheap vs frontier models, have wildly different unit prices.
+3. **Break spend down by tag and sort by total cost.** Produce a table: tag × model × {input cost, output cost, calls, avg tokens/call}. Sort descending by total cost. Expect a Pareto shape — the top 2–4 tags usually own the majority of spend. Optimize those; ignore the long tail.
+4. **Separate per-call cost from volume — they need different fixes.** For each top tag, look at *both* cost-per-call and call count. An expensive call made rarely and a cheap call made a million times can carry the same total; the first is fixed by trimming the prompt/output, the second by caching, dedup, or not calling at all. Flag which axis dominates each driver.
+5. **For each driver, attack the levers in this order (cheapest win first):**
+   - **Trim bloated input.** Remove dead boilerplate from system prompts, stop stuffing whole documents/full chat history when a retrieved snippet or rolling summary suffices, and drop unused tool schemas. This is usually the largest, lowest-risk reduction.
+   - **Cap or shorten output.** Set `max_tokens` to the real need, ask for terse/structured output, and avoid "explain your reasoning" in production paths where it isn't consumed. Because output is the pricier axis, shaving it often beats prompt trimming on cost.
+   - **Downshift the model.** Route easy calls (classification, extraction, short rewrites) to a smaller/cheaper model and reserve the frontier model for genuinely hard ones. Gate the route on a measurable signal, not a guess, and confirm quality holds with an eval set before shipping.
+   - **Cache repeated stable prefixes.** Where a long system prompt or document prefix is reused across calls, enable prompt/KV caching so the stable part is billed at the discounted cached rate. Order the prompt so the stable prefix comes first; volatile content last.
+6. **Set per-feature budgets and alerts.** Record each top tag's current cost/call and cost/day as a baseline, then add an alert that fires when either exceeds a threshold (e.g. +30%). Treat a token-usage spike like any other regression — caught at deploy, not at the invoice.
+> [!WARNING]
+> You cannot optimize what you can't attribute. Without per-feature/per-tenant tags, the "profile" is just a grand total — you'll guess which prompt to cut and likely guess wrong. Add the tag and re-collect before doing any optimization work.
+> [!NOTE]
+> Output tokens usually cost several times more per token than input tokens, so a verbose model response — not a long prompt — is frequently the real cost driver. Always inspect avg *output* tokens/call on your top tags before assuming the prompt is to blame.
+## Output
+- **Instrumentation/tagging plan** — the list of call sites found, and for each the tag (feature/route + tenant) and the input/output/model fields to record, sourced from the provider `usage` object.
+- **Spend breakdown** — a table of tag × model with separate input-cost and output-cost columns (`cost = tokens × price`), calls, and avg tokens/call, sorted by total cost, with an "unattributed" row if any call site is still untagged.
+- **Ranked waste** — the dominant drivers in order, each labeled by axis (per-call cost vs volume) and assigned its specific lever (trim context / cap output / downshift model / cache prefix) with the expected reduction.
+- **Budgets & alerts** — baseline cost/call and cost/day per top tag plus the threshold alert to add, so future regressions are caught automatically.