npm - agentscamp - Versions diffs - 0.2.1 → 0.4.0 - Mend

agentscamp 0.2.1 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/README.md +4 -4
package/content/agents/ci-cd-engineer.md +95 -0
package/content/agents/cli-tooling-engineer.md +47 -0
package/content/agents/context-engineer.md +68 -0
package/content/agents/csharp-pro.md +73 -0
package/content/agents/database-architect.md +90 -0
package/content/agents/eval-driven-developer.md +47 -0
package/content/agents/incident-responder.md +77 -0
package/content/agents/java-pro.md +73 -0
package/content/agents/qa-automation-engineer.md +92 -0
package/content/commands/add-caching.md +79 -0
package/content/commands/audit-accessibility.md +101 -0
package/content/commands/clean-branches.md +113 -0
package/content/commands/generate-e2e-test.md +98 -0
package/content/commands/review-tests.md +98 -0
package/content/commands/scaffold-dockerfile.md +111 -0
package/content/commands/scaffold-github-action.md +94 -0
package/content/commands/seed-data.md +63 -0
package/content/commands/setup-precommit-hooks.md +72 -0
package/content/commands/write-design-doc.md +78 -0
package/content/manifest.json +436 -4
package/content/skills/architecture-diagram-generator.md +78 -0
package/content/skills/connection-pool-tuner.md +46 -0
package/content/skills/dependency-upgrade-planner.md +42 -0
package/content/skills/github-actions-optimizer.md +45 -0
package/content/skills/load-test-designer.md +87 -0
package/content/skills/memory-leak-hunter.md +35 -0
package/content/skills/pagination-designer.md +51 -0
package/content/skills/property-test-designer.md +63 -0
package/content/skills/security-headers-hardener.md +79 -0
package/content/skills/slo-definer.md +38 -0
package/content/skills/structured-logging-designer.md +42 -0
package/package.json +1 -1

package/content/skills/architecture-diagram-generator.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+name: "architecture-diagram-generator"
+description: "Generate accurate architecture diagrams as Mermaid — straight from the codebase, not from imagination — by first choosing which view answers the question (container/component, sequence, ER, or state) and then reading the real entry points, module boundaries, service calls, and schema. Use when onboarding to an unfamiliar repo, documenting a system, or visualizing one complex flow."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Most architecture diagrams lie. They were drawn once on a whiteboard, drifted as the code changed, and now mislead the next person who trusts them. This skill draws diagrams *from the repository* — by reading entry points, module boundaries, service calls, and the schema — so the picture reflects what exists today. It picks the single view that answers the question instead of one sprawling everything-diagram, and emits Mermaid, which renders natively in GitHub, GitLab, Obsidian, and most docs tooling with no image pipeline.
+## When to use this skill
+- You're onboarding to an unfamiliar repo and need a map of the services and how they call each other before you start changing anything.
+- You're documenting a system for a README, ADR, or design doc and want a diagram that won't go stale the moment someone reads the code.
+- One specific flow is hard to reason about — a checkout, an auth handshake, a webhook fan-out — and you want it laid out as a sequence over time.
+- You need the data model visible (tables, foreign keys, cardinality) or the lifecycle of a stateful entity (an order, a job, a subscription).
+## Instructions
+1. **Choose the view before drawing anything.** Pick the *one* diagram type that answers the actual question — they are not interchangeable:
+   - **Container / component (`graph` or `flowchart`)** — "what are the services/modules and who calls whom?" Use for onboarding and system overviews.
+   - **Sequence (`sequenceDiagram`)** — "how does *this one request* move through the system over time?" Use for a single flow with ordering, async, and error paths.
+   - **ER (`erDiagram`)** — "what is the data model and how are entities related?" Use when the schema is the question.
+   - **State (`stateDiagram-v2`)** — "what states can *this entity* be in and what transitions are legal?" Use for orders, jobs, payments, finite-state logic.
+   If the question spans concerns, emit two small diagrams, not one fused diagram.
+2. **Find the real boundaries — read, don't assume.** Locate evidence before drawing a single node:
+   - Entry points: `Glob` for `main.*`, `app.*`, `server.*`, `index.*`, route files, `Procfile`, `docker-compose.yml`, `*.tf`, k8s manifests, `package.json`/`pyproject.toml` workspaces.
+   - Service-to-service edges: `Grep` for HTTP clients (`fetch`, `axios`, `requests`, `httpx`), queue/topic names, gRPC stubs, and env vars like `*_URL`/`*_HOST` that name a dependency.
+   - Data stores: connection strings, ORM models, migration files, `*.sql`, `schema.prisma`.
+   A node or edge goes in the diagram only if you found it in the code or config — never because the architecture "should" have it.
+3. **Build the chosen diagram from that evidence.**
+   - *Container/component:* one node per deployable/service/module; directed edges labeled with the real protocol or call (`-->|REST|`, `-->|publishes order.created|`). Group with `subgraph` by boundary (per process, per network zone). Mark external systems (Stripe, S3, a third-party API) distinctly so the trust boundary is obvious.
+   - *Sequence:* one participant per real actor/service; arrows in call order (`->>` sync request, `-->>` response, `-)` async/fire-and-forget); use `alt`/`opt` for the error and conditional branches you found, not idealized happy-path only.
+   - *ER:* `erDiagram` with real table/entity names, key attributes (mark `PK`/`FK`), and correct crow's-foot cardinality (`||--o{`) read from the foreign keys, not guessed.
+   - *State:* `stateDiagram-v2` with `[*]` start/end, named transitions, and only the states the code actually models.
+4. **Cut everything that doesn't serve the diagram's purpose.** A container view does not need every helper class; a sequence diagram does not need every logging call. Aim for a diagram a reader can absorb in one screen. If a container view exceeds ~12 nodes, split it: one high-level map plus a zoom-in on the busy subgraph.
+5. **Validate the Mermaid.** Check that the first line declares the diagram type, every node referenced in an edge is defined, labels with special characters are quoted (`["Auth Service (OIDC)"]`), and the block is fenced as ` ```mermaid `. Broken Mermaid renders as a red error box in GitHub — worse than no diagram.
+6. **Write and caption.** Emit the diagram(s) into the requested doc (or return inline), and follow each with one line stating what it *does* and *does not* show (e.g. "Shows synchronous request flow for checkout; does not show the async receipt-email worker or retry behavior").
+> [!WARNING]
+> A stale or wrong diagram is worse than none — readers trust a picture more than prose and will design against a lie. Draw only edges and nodes you found in the code, and date or version-anchor the diagram so the next reader knows when it was true.
+> [!NOTE]
+> Resist the everything-diagram. A single chart that crams services, data model, and request flow into one canvas communicates nothing — no reader can hold it. Each diagram answers exactly one question; if you have two questions, draw two diagrams.
+## Output
+For each request, the skill returns:
+1. **The chosen view + rationale** — e.g. "Sequence diagram, because the question is about ordering across services in one flow, not the static topology."
+2. **Paste-ready Mermaid** in a fenced ` ```mermaid ` block, built from real entry points and calls.
+3. **A scope caption** — one line on what the diagram does and does not show.
+Example — a container view of a small web app, traced from `docker-compose.yml` (web, api, worker, redis, postgres) and the API's HTTP client to Stripe:
+```mermaid
+flowchart LR
+  user(["Browser"])
+  subgraph app["app network"]
+    web["Web (Next.js)"]
+    api["API (Node)"]
+    worker["Worker"]
+    redis[("Redis<br/>queue + cache")]
+    db[("Postgres")]
+  end
+  stripe["Stripe API"]:::ext
+  user -->|HTTPS| web
+  web -->|REST /api| api
+  api -->|SQL| db
+  api -->|"enqueue charge"| redis
+  worker -->|"dequeue"| redis
+  worker -->|"create charge"| stripe
+  worker -->|SQL| db
+  classDef ext fill:#fde68a,stroke:#b45309;
+```
+*Shows the deployed services and their call/data edges as wired in `docker-compose.yml` and the API client. Does not show request timing/order (use a sequence diagram) or the table schema (use an ER diagram).*

package/content/skills/connection-pool-tuner.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: "connection-pool-tuner"
+description: "Size and tune a database connection pool from the real constraint — the database's shared max_connections and its core count — so total connections (per-instance pool × instance count) stay safely under the cap and a too-large pool stops adding latency. Use when the app throws 'too many connections' or pool-acquire timeouts, when the DB is saturated by connection count, or when deploying to serverless."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Connection pools fail in two opposite ways, and a "nice round number" like 100 walks into both. Too large and every app instance's pool sums past the database's shared `max_connections`, so the next deploy or traffic spike exhausts the server and *every* instance starts throwing. Naively large and the pool is bigger than the DB has cores to serve it, so the extra connections don't add parallelism — they queue inside the database and add latency. This skill sizes the **per-instance** pool from concurrency need and core count, does the `instances × pool ≤ max_connections` arithmetic with real headroom, sets the timeouts that recycle dead connections, and sends serverless through a pooler instead of multiplying pools.
+## When to use this skill
+- The app logs `FATAL: too many connections` / `remaining connection slots are reserved`, or pool-acquire timeouts ("timed out fetching a connection from the pool").
+- The database is saturated by connection *count* (high `pg_stat_activity` rows, memory pressure from per-connection backends) rather than by slow queries.
+- You scaled out app instances or autoscaling kicked in, and the DB started erroring even though per-instance load looks fine.
+- You're deploying to serverless / many short-lived instances (Lambda, Vercel functions, Cloud Run) and need a connection strategy.
+- Standing up a new service and picking a pool size before it hits production.
+## Instructions
+1. **Find the real ceiling first.** Read the database's `max_connections` (Postgres `SHOW max_connections`, MySQL `max_connections`) — this is shared across *everything*: every app instance, background workers, migrations, replicas, admin/`psql` sessions, and the monitoring agent. Postgres also reserves `superuser_reserved_connections`. Treat the usable budget as roughly `max_connections − reserved − headroom`, not the raw number.
+2. **Count every connection source, not just the web app.** Total connections = (per-instance pool × app instance count) + worker/cron pools + replicas + migration tooling + a margin for admin sessions and a deploy overlap (old and new instances live simultaneously during rolling deploys — pools effectively double for that window). Enumerate each source by grepping for pool config (`max`, `pool_size`, `maximumPoolSize`, `DATABASE_URL`, `?connection_limit=`).
+3. **Size the per-instance pool from concurrency, capped by cores — not by a big round number.** A connection only does work when the DB has a free core to run its query. The starting heuristic for a CPU-bound OLTP workload is near the DB's core count *for the whole fleet*, so per-instance pool ≈ `(useful_DB_concurrency) / instance_count`, often a small single-digit number. Going higher doesn't buy parallelism — it buys a queue. For I/O-bound queries (lots of waiting) you can go somewhat above core count, but measure rather than assume.
+4. **Do the exhaustion arithmetic explicitly and leave headroom.** Compute `instances × pool + other_sources` and confirm it stays under the usable budget *at max autoscale*, not at average instance count. Size against the ceiling the autoscaler can reach, then keep ~20–30% of `max_connections` free for migrations, admin, replication, and deploy overlap. If the math doesn't fit, shrink the pool before raising `max_connections` (each Postgres backend costs real memory).
+5. **Set the four timeouts deliberately — defaults leak or stall.**
+   - **Acquire / pool timeout** — how long a request waits for a free connection before failing fast (e.g. a few seconds). Without it, a saturated pool turns into unbounded queueing and looks like a hang.
+   - **Idle timeout** — return idle connections so the pool shrinks under low load and you're not holding slots the DB could give elsewhere.
+   - **Max lifetime** — recycle each connection after a bounded age (e.g. 30 min) so a load balancer / DNS failover / DB restart doesn't leave stale half-dead connections in the pool.
+   - **Min / idle floor** — keep a small warm minimum to avoid connect latency on the first request, but not so high that idle instances hoard the budget.
+6. **Handle serverless and many-instances specially — route through a pooler.** When instance count is large or unbounded (one pool per function invocation), per-instance pools multiply faster than any safe per-instance number can absorb. Don't fix it by shrinking the per-function pool to 1 alone — put a pooler between the app and the DB: PgBouncer in **transaction** mode, RDS Proxy, Supabase's pooler, or a provider serverless/HTTP driver. The pooler multiplexes hundreds of client connections onto a small set of real DB connections; keep the per-function pool at 1–2 behind it.
+> [!WARNING]
+> Scaling out app instances silently multiplies total connections. A pool of 20 that's fine on 3 instances (60) exhausts a 100-connection DB the moment the autoscaler reaches 5 instances — and it fails *everywhere at once*, not gracefully. Always size against **max autoscale × pool**, plus the deploy-overlap doubling, never average instance count.
+> [!WARNING]
+> A bigger pool is frequently *slower*, not faster. Past the DB's effective core count, added connections don't run in parallel — they queue inside the database and add context-switching overhead, raising p99 latency while throughput stays flat. If the pool is large and the DB is CPU-bound, the fix for latency is usually to *shrink* the pool.
+> [!NOTE]
+> Transaction-mode poolers (PgBouncer) break features that hold state across statements on one connection: session-level `SET`, advisory locks, `LISTEN/NOTIFY`, and some prepared-statement modes. Use session mode (or a dedicated direct connection) for those paths, and run migrations against the DB directly, not through the transaction pooler.
+## Output
+A pool-sizing recommendation, concretely:
+- **The math** — usable budget (`max_connections − reserved − headroom`), and `instances_at_max_autoscale × per_instance_pool + other_sources` shown to land under it with the headroom stated.
+- **Recommended per-instance pool size** with the rationale (concurrency need vs. DB core count, and which workload type it is), plus separate sizes for worker/cron pools.
+- **Timeout/lifetime settings** — acquire timeout, idle timeout, max lifetime, and min/idle floor, with the value and why each is set.
+- **Serverless recommendation if applicable** — the specific pooler (PgBouncer transaction mode / RDS Proxy / serverless driver), the per-function pool size behind it, and any session-mode caveats for stateful paths.

package/content/skills/dependency-upgrade-planner.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: "dependency-upgrade-planner"
+description: "Plan and de-risk a major dependency, framework, or runtime upgrade — map the full version path, read every intermediate migration guide, and pin the breaking changes to your actual call sites instead of bumping the number and hoping. Use when a key dependency is several majors behind, when a security advisory forces an upgrade, or before a framework migration."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Turn "bump the version and hope" into a sequenced, evidence-backed upgrade plan. The skill establishes the exact current → target version gap, reads the CHANGELOG and migration guide for **every** major in between, then greps the codebase for the dependency's imported symbols so the breaking-change list is narrowed to the call sites that actually exist here. It checks the target's peer-dependency and runtime requirements, orders the work (codemods first, one major at a time for big jumps, behind tests), and writes down a rollback before anything is touched.
+## When to use this skill
+- A key dependency, framework, or runtime is several majors behind and you need a path forward, not a single `npm install pkg@latest`.
+- A security advisory (CVE, `npm audit`, Dependabot) forces an upgrade and you need to know the blast radius before merging.
+- You are scoping a framework or runtime migration (React, Next.js, Django, Rails, Node, Python) and want to know what breaks before committing the sprint.
+> [!WARNING]
+> Jumping several majors in one `install` hides which version broke what. Breaking changes compound: v3's removal of an API plus v4's renamed option plus v5's changed default land as one undebuggable wall of errors. For a gap of two or more majors, upgrade **one major at a time**, landing each behind a green build/test run, so every failure maps to exactly one version's changes.
+## Instructions
+1. **Pin the exact current and target versions.** Read the lockfile (`package-lock.json`/`pnpm-lock.yaml`/`yarn.lock`, `poetry.lock`, `go.sum`, `Cargo.lock`) for the version actually installed — not the loose range in the manifest, which lies about what resolved. Confirm the target: `npm view <pkg> versions --json`, `pip index versions <pkg>`, `go list -m -versions <mod>`, or the registry page. Record the full hop list, e.g. `4.2.1 → 5.x → 6.x → 7.0.3`.
+2. **Read the migration guide for every major in between — don't skip the intermediate notes.** A jump from v4 to v7 means reading the v5, v6, **and** v7 breaking-change sections, not just v7's. Pull the CHANGELOG / UPGRADING / migration doc (`gh release view`, the repo's `CHANGELOG.md`, the docs site) and extract every entry under "Breaking", "Removed", "Renamed", "Default changed", and "Deprecated → removed".
+3. **Inventory your actual usage so you only care about breaks that hit you.** Grep the codebase for the dependency's imported symbols and entry points — `grep -rIn "from 'pkg'" `, `grep -rIn "require('pkg')"`, `import pkg`, the specific class/function/option names called out in the breaking-change list. A breaking change to an API you never call is noise; a one-line default change to a function on 40 call sites is the real work. Map each relevant breaking change to its call sites.
+4. **Check transitive/peer-dep and runtime requirements of the target.** The target may demand a newer peer (`react@>=19`, a `@types/*` bump) or a higher minimum runtime (Node, Python, Go, the language edition). Run `npm info <pkg>@<target> peerDependencies engines` (or read `requires-python` / `go.mod` `go` directive / `rust-version`). Cross-check against your other dependencies' peer ranges and your CI/Dockerfile/`.nvmrc`/`engines` runtime — a conflict here blocks the install before any code change.
+5. **Sequence the work: codemods → one major at a time → behind tests.** Run the official codemod first if one exists (`npx <pkg>-codemod`, `npx @next/codemod`, framework migration CLIs) — they do the mechanical renames so you review semantics, not churn. For multi-major gaps, do one major per commit/PR; for each step, apply the codemod, hand-fix the mapped call sites, then run the **real** build and test commands as a checkpoint before the next hop.
+6. **Write the rollback before touching anything.** Commit the current lockfile, branch the work, and record the revert: restore the pinned versions in the manifest **and** the lockfile (a manifest-only revert re-resolves to something new), then reinstall from the lockfile (`npm ci`, `pnpm install --frozen-lockfile`, `poetry install`). For a forced security upgrade with no safe target yet, note the interim mitigation (override/resolution pin, patch backport) as the fallback.
+> [!WARNING]
+> Peer-dependency conflicts and a bumped minimum runtime are the upgrades that silently break the build — not the API renames you can see in a diff. `npm install` may resolve a peer with a warning (or fail under strict/`pnpm`), and a target that requires Node 22 will install fine locally then explode in CI on Node 20. Verify both **before** writing code, in step 4.
+> [!NOTE]
+> Land the upgrade on its own branch with one commit per major hop and the codemod output as a separate commit from your hand-fixes. If a regression only shows up in CI or staging, granular history makes `git revert` of a single version trivial instead of unpicking a tangled bump.
+## Output
+A concrete upgrade plan, reproducible from the evidence gathered:
+- **Version path** — the exact hop list from the lockfile to the target (`4.2.1 → 5.18.0 → 6.4.2 → 7.0.3`), one line per major.
+- **Breaking changes that affect THIS codebase** — a table of `change → version → call sites`, with the file:line locations grep found; changes that touch no call site are explicitly listed as not-applicable so the reader trusts the filter.
+- **Peer-dep & runtime gate** — required peer ranges and minimum runtime of the target vs. what the repo and CI currently pin, with conflicts flagged as blockers.
+- **Steps in order** — codemod commands first, then per-major manual fixes, each with its test/build checkpoint command.
+- **Rollback plan** — the exact manifest + lockfile revert and reinstall command, plus any interim mitigation for a forced upgrade.

package/content/skills/github-actions-optimizer.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: "github-actions-optimizer"
+description: "Make a GitHub Actions workflow faster, cheaper, and harder to attack — by profiling where wall-clock and billed minutes actually go, then adding content-keyed caching, matrix/job parallelism, run-cancellation, and path filters, and hardening the supply chain (SHA-pinned actions, least-privilege GITHUB_TOKEN, safe fork-PR handling). Use when CI is slow or queues, when a repo burns Actions minutes, or before trusting a workflow that runs on untrusted pull requests."
+allowed-tools: "Read, Grep, Glob, Edit, Bash"
+version: 1.0.0
+---
+A workflow that takes 22 minutes and costs you a fortune in minutes is rarely slow for one reason — it's usually re-downloading dependencies every run, running serially what could run in parallel, and building branches no one is waiting on. And the same file is often a supply-chain liability: a third-party action pinned to `@v3` can be repointed under you, and a `write-all` token plus `pull_request_target` is one malicious fork PR away from leaking secrets. This skill measures before it touches anything, then ships fixes ordered by payoff — biggest time or security win first — as concrete YAML diffs.
+## When to use this skill
+- CI wall-clock is the bottleneck on every PR, runs queue behind each other, or the monthly Actions bill is climbing.
+- A job re-installs the whole dependency tree or rebuilds from scratch on every run, with no cache or a cache that never hits.
+- The workflow runs on `pull_request` / `pull_request_target` from forks and you haven't audited what secrets and permissions are exposed.
+- You inherited a workflow that pins actions to floating tags (`@v4`, `@main`) and grants the default broad `GITHUB_TOKEN`.
+## When NOT to use this skill
+- The slowness is in your test suite itself (flaky retries, an N+1 in integration tests) rather than the CI plumbing — fix the tests; faster runners won't save a 9-minute test that should take 90 seconds.
+- You need a workflow authored from scratch for a new stack — that's scaffolding work; this skill optimizes and hardens an *existing* `.github/workflows/*.yml`.
+## Instructions
+1. **Inventory the workflows before changing one.** Glob `.github/workflows/*.{yml,yaml}` and read each. For every workflow note its triggers (`on:`), its jobs and their `needs:` graph, the runner labels (`ubuntu-latest` vs a larger/self-hosted runner — larger runners bill at a multiple), and the matrix dimensions. This is the map; you optimize against it, not against guesses.
+2. **Profile where time actually goes — don't optimize from intuition.** Pull recent run timings with the CLI: `gh run list --workflow <file> -L 20 --json databaseId,conclusion,createdAt,updatedAt` for wall-clock per run, then `gh run view <id> --json jobs` to get per-job durations. The serial critical path is `needs:`-chained job durations summed; a 4-minute lint that gates a 12-minute test set adds 4 minutes to *everyone*. Rank jobs and steps by total billed minutes (duration × runs/day × runner multiplier). Fix the top one first.
+3. **Add caching with a content-based key — or don't bother.** Cache the package manager's store, not `node_modules`/`.venv` (restoring a half-built tree is worse than a clean install). Key on a hash of the lockfile so the cache invalidates exactly when deps change: `key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json') }}` with a `restore-keys: ${{ runner.os }}-deps-` prefix fallback for partial hits. For language setup actions (`actions/setup-node`, `setup-python`, `setup-go`), prefer their built-in `cache:` input — it keys on the lockfile for you and handles the path. Confirm a hit after: the run log prints `Cache restored from key` (or `Cache not found`). A cache that never hits is pure overhead — it uploads on every run and restores nothing.
+4. **Parallelize the critical path.** Convert serial variants (Node 18/20/22, OS targets, test shards) into a `strategy.matrix` so they run concurrently instead of in sequence. Split a single monster test job into shards with `matrix` + a test-splitting flag (`--shard ${{ matrix.shard }}/${{ matrix.total }}`). Drop unnecessary `needs:` edges — only gate a job on what it truly consumes; lint and unit tests rarely need to wait on each other. Set `fail-fast: false` only when you want all matrix legs to report; leave it `true` (default) to abort the matrix the moment one leg fails and stop burning minutes.
+5. **Cancel superseded runs with `concurrency`.** Add a top-level `concurrency` group keyed on the ref so a new push cancels the in-flight run for that branch instead of running both: `concurrency: { group: ${{ github.workflow }}-${{ github.ref }}, cancel-in-progress: true }`. This alone can halve minutes on an active branch. Do NOT set `cancel-in-progress: true` on deploy/release workflows — cancelling a half-finished deploy mid-flight can leave the environment in a broken state.
+6. **Skip work that can't be affected.** Add `paths:` / `paths-ignore:` filters so a docs-only change doesn't trigger the full build matrix, and `branches:` filters so feature pushes don't run release jobs. For required status checks, use a path filter plus a tiny "always-green" companion job (or `paths-filter` action with a downstream `if:`) so the required check still reports success on skipped paths — a hard `paths:` skip leaves a required check pending forever and blocks merges.
+7. **Pin third-party actions to a full commit SHA.** Replace every `uses: owner/action@v4` (and especially `@main`) for *third-party* actions with the full 40-char commit SHA, keeping the version in a trailing comment: `uses: owner/action@a1b2c3...def # v4.1.2`. A floating tag is mutable — the owner (or an attacker who compromises them) can repoint it at code that exfiltrates your secrets, and your pinned-to-tag workflow will silently run it. First-party `actions/*` are lower risk but pinning them too is the consistent posture. Use `gh api repos/<owner>/<repo>/git/ref/tags/<tag>` to resolve a tag to its SHA.
+8. **Set least-privilege `permissions` on `GITHUB_TOKEN`.** Add a top-level `permissions: { contents: read }` to default everything to read, then grant exactly what each job needs at the job level (`packages: write` to publish, `pull-requests: write` to comment, `id-token: write` for OIDC). The repo default is often `read-write` on everything; a token that can push to `contents` is a token a compromised dependency can use to push to your branches.
+9. **Quarantine secrets from untrusted fork PRs.** Understand the split: `pull_request` from a fork runs with a read-only token and *no* repo secrets — safe but limited. `pull_request_target` runs in the context of the base repo *with* secrets and a writable token, while checking out the fork's code — this is the dangerous one. Never `checkout` and then build/run a fork's code under `pull_request_target`; that hands an attacker your secrets via a malicious build script or workflow. If you need a label-gated privileged step, split it into a separate `workflow_run`/manually-approved job that operates only on trusted artifacts, never on raw fork code.
+> [!WARNING]
+> An unkeyed or over-broad cache rots silently. If the key isn't tied to the lockfile, the cache never invalidates — CI keeps restoring stale dependencies, masking lockfile changes and producing "works in CI, broken locally" drift. If the key is too unique (includes `github.sha`), it never hits and you pay the upload cost every run for nothing. Verify "Cache restored from key" appears in real run logs before calling caching done.
+> [!CAUTION]
+> A third-party action pinned to a moving tag (`@v4`, `@main`) is remote code you don't control, running with your token and secrets. Tag mutation is the documented supply-chain attack (see the `tj-actions/changed-files` incident). Pin to a full commit SHA, and review the diff before bumping the SHA — never auto-update action SHAs without reading what changed.
+> [!CAUTION]
+> Secrets must never reach untrusted fork code. `pull_request_target` + checking out the PR head + running its scripts = secret exfiltration. Default to `pull_request` for fork CI, keep secrets out of those runs, and gate any privileged automation behind manual approval or a separate trusted workflow.
+## Output
+A prioritized remediation plan ordered by payoff — each item tagged TIME or SECURITY, with the measured cost it addresses (e.g. "SECURITY: 3 actions on floating tags"; "TIME: deps re-installed every run, ~90s × 40 runs/day") — followed by the concrete YAML diffs to apply, smallest-blast-radius wins first. Each diff is a minimal, reviewable change to a specific workflow file (added `concurrency` block, a cache step with its key, a matrix rewrite, SHA pins with version comments, a `permissions` block). The skill proposes edits via Edit and uses Bash only for read-only `gh`/`git` profiling and tag-to-SHA resolution; it does not push, re-run, or alter pipeline behavior beyond the diffs you approve.

package/content/skills/load-test-designer.md ADDED Viewed

@@ -0,0 +1,87 @@
+---
+name: "load-test-designer"
+description: "Design a defensible load test — a realistic workload model, a deliberate test type, and SLO-tied pass/fail thresholds — instead of a meaningless tight-loop script that hammers one endpoint. Use when validating capacity or SLOs before a launch or scaling event, when sizing infrastructure, or when an existing load test reports averages that nobody trusts."
+allowed-tools: "Read, Grep, Glob, Write"
+version: 1.0.0
+---
+Most "load tests" hammer a single endpoint in a tight loop with no think-time, run from one laptop, and report an average response time that makes everyone feel good and predicts nothing. This skill designs a load test you can actually defend in a launch review. It builds a workload model from the real traffic mix, picks the test type that answers your actual question (Will we survive peak? Where do we break? Do we leak under sustained load? Can we absorb a surge?), writes thresholds tied to your SLOs *before* the run so the test has a pass/fail answer, and produces a runnable script plus a guide to reading the results by percentile and saturation point.
+## When to use this skill
+- You have a launch, marketing event, sale, or migration coming and need numbers to prove the system survives expected peak.
+- You need to size infrastructure (instance count, DB connection pool, autoscaling thresholds) and want evidence, not a guess.
+- You want to find the breaking point — the concurrency or RPS at which latency or error rate falls off a cliff — before users do.
+- An existing load test reports a single average latency and nobody believes it represents real traffic.
+- You suspect a slow leak (memory, connections, file handles) that only appears after the system runs hot for an hour.
+## Instructions
+1. **Build a workload model from real traffic, not a single URL.** A load test that loops on `GET /health` measures your load balancer, not your system. Derive the endpoint mix from production access logs, APM, or analytics: which routes, in what proportion, with which payloads. Capture the *journey* (e.g. browse 60%, search 25%, add-to-cart 10%, checkout 5%) because checkout hits the DB and payment provider while browse hits a cache — they are not interchangeable load. Write the mix down as weighted scenarios with a representative, **distinct** data set (rotating user IDs, search terms, cart contents) so you exercise cache misses and row contention instead of the one hot row that gets cached after the first request.
+2. **Add think-time between actions.** Real users pause to read, type, and decide. A closed-loop test with zero think-time generates a firehose no human population produces and tells you about your queueing behavior at an impossible arrival rate. Insert randomized think-time (e.g. 1–5s) between steps in a journey, and prefer an **open model** (specify arrival rate — new users per second) over a **closed model** (fixed VU count) when you are modeling a real-world population, because closed models artificially throttle load as the system slows.
+3. **Pick the test type deliberately — it determines the shape, not just the size.** Choose one question per test:
+   - **Load test** — sustain *expected peak* (e.g. Black Friday 1.5×) for 15–30 min. Answers "do we meet SLOs at peak?"
+   - **Stress test** — ramp past peak until something breaks. Answers "where is the cliff, and how does it fail — graceful 503s or a cascading meltdown?"
+   - **Soak test** — hold a moderate, realistic load for hours. Answers "do we leak memory/connections/handles, and does latency drift upward over time?"
+   - **Spike test** — jump from baseline to a large surge in seconds, then drop. Answers "can autoscaling and queues absorb a sudden surge, and do we recover cleanly?"
+4. **Choose the tool to match the model.** Use **k6** (JS scenarios, first-class thresholds, scriptable open/closed models) as the default; **Locust** (Python, good for complex stateful user flows); **Gatling** (Scala/JVM, strong reporting, high single-node throughput). Match the tool to the team's language and to whether you need a closed VU model or an open arrival-rate model — k6 `scenarios` with `ramping-arrival-rate` is the cleanest open model.
+5. **Set pass/fail thresholds tied to actual SLOs — before you run.** A test with no threshold is a demo, not a test. Translate each SLO into a machine-checkable pass condition and encode it so the tool exits non-zero on breach (k6 `thresholds`, Gatling `assertions`). Example bar: `http_req_duration: p(95)<300 AND p(99)<800`, `http_req_failed: rate<0.001` (0.1% errors), and per-scenario thresholds for the expensive journey (checkout p95 < 1s). Define these from the SLO doc, not from whatever the first run happened to produce.
+6. **Run against a prod-like, isolated environment from enough generators.** The environment must match production in the dimensions that saturate: instance size/count, DB tier and connection limits, cache size, and rate limits. Isolate it so you are not loading a shared staging DB that other teams use. Generate load from multiple machines (or a distributed runner / k6 Cloud / a fleet of generator nodes) and **monitor the generators' own CPU, network, and open sockets** — if a generator saturates, you measured the generator, not the target. Capture server-side metrics in parallel (CPU, memory, DB connections, queue depth, GC) so you can locate the bottleneck, not just observe that latency rose.
+7. **Interpret by percentiles and the saturation point, not the average.** Read p95/p99 (and the max), error rate, and throughput together. The headline result is the **knee**: the load level where latency percentiles start climbing super-linearly and/or error rate crosses the threshold — that is your real capacity, and anything below it with headroom is the number you size to. Correlate the knee with a server-side resource hitting its limit (CPU pegged, connection pool exhausted, GC thrashing) to name the actual bottleneck.
+> [!WARNING]
+> The average latency hides the tail, and the tail is what pages you. A 50ms mean can sit on top of a 2s p99 — meaning 1 in 100 requests is 40× slower, which at scale is thousands of furious users. Never let an average be the pass/fail metric; gate on p95/p99 and error rate.
+> [!WARNING]
+> Load-testing a tiny staging environment tells you nothing transferable. A 1-instance, free-tier-DB staging box breaks at numbers that say nothing about your 12-instance production fleet, and the bottleneck (e.g. a 5-connection pool) may not even exist in prod. Test against prod-like capacity, or test prod itself in a maintenance window — not a toy.
+> [!CAUTION]
+> A single under-powered load generator caps your result: you will report the *client's* ceiling as the *server's*. If generator CPU or network is pegged, or you exhaust ephemeral ports, the numbers are invalid. Distribute generators and watch their own metrics; treat a saturated generator as a failed run, not a finding.
+## Output
+A complete, defensible load-test design, written as files plus an interpretation guide:
+1. **Workload model** — a table of weighted scenarios with endpoint mix, payloads, think-time ranges, and the data set strategy.
+```text
+Scenario        Weight  Steps (think-time)                         Data
+browse          60%     GET /  -> GET /p/{id}  (2-5s)              rotate 5k product IDs
+search          25%     GET /search?q={term}  (1-3s)               2k distinct terms
+add-to-cart     10%     POST /cart  (1-4s)                         rotate user + product
+checkout         5%     POST /cart -> POST /checkout  (3-8s)       unique cart per VU
+```
+2. **Test type + tool + load profile** — which of load/stress/soak/spike, the tool, the model (open arrival-rate vs closed VU), ramp shape, and duration, with the one question the test answers.
+3. **The threshold-bearing script** (e.g. k6) — runnable, with SLO-tied thresholds that fail the run on breach:
+```javascript
+export const options = {
+  scenarios: {
+    peak: {
+      executor: "ramping-arrival-rate",
+      startRate: 50, timeUnit: "1s",
+      preAllocatedVUs: 500, maxVUs: 2000,
+      stages: [
+        { target: 300, duration: "3m" },   // ramp to expected peak
+        { target: 300, duration: "20m" },  // hold at peak
+        { target: 0,   duration: "2m" },   // ramp down
+      ],
+    },
+  },
+  thresholds: {
+    http_req_failed:   ["rate<0.001"],                  // < 0.1% errors
+    http_req_duration: ["p(95)<300", "p(99)<800"],      // SLO latency
+    "http_req_duration{scenario:checkout}": ["p(95)<1000"],
+  },
+};
+```
+4. **How to read the results** — the percentile/error/throughput table to produce, where the saturation knee is, which server-side metric to correlate it with, and the explicit pass/fail call against the thresholds, plus the recommended capacity number with headroom.

package/content/skills/memory-leak-hunter.md ADDED Viewed

@@ -0,0 +1,35 @@
+---
+name: "memory-leak-hunter"
+description: "Find and fix a memory leak in a running app: confirm it's a real leak under steady load, diff two heap snapshots to name the growing object and its retention path, cut the root reference that blocks collection, and re-run to confirm memory plateaus. Use when RSS climbs until OOM/restart, heap grows unbounded across a steady workload, or GC pauses worsen the longer the process runs."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A process whose memory only goes up will eventually OOM, get killed, or grind to a halt in GC — but "memory went up" is not the same as "there is a leak." A warming cache, a JIT, a connection pool filling, and a steadily growing legitimate working set all climb too. This skill refuses to guess: it first *confirms* the leak against a steady workload, then *locates* it with a heap diff rather than a single snapshot, traces the *retention path* to the one reference that blocks collection, fixes that root, and re-runs to prove the curve flattens.
+## When to use this skill
+- RSS climbs monotonically until the process OOMs, gets OOM-killed, or hits a scheduled restart that "fixes" it for a while.
+- Heap usage trends up across a steady, repeating workload and never returns to baseline after a GC.
+- GC pauses (or full-GC frequency) get worse the longer the process stays up — a classic sign the live set is growing.
+- A load test or soak test shows memory that doesn't plateau even after the request rate is constant.
+- After a deploy, memory behavior changed and you need to know whether it's a real leak or a bigger-but-bounded cache.
+## Instructions
+1. **Confirm it's a leak before hunting one.** Drive a *steady, repeating* workload (constant request rate or a fixed loop) and record memory over time — RSS and heap-used at, say, 30s intervals. Force a GC between samples where you can (`global.gc()` with `--expose-gc` in Node, `System.gc()`/`jcmd <pid> GC.run` on the JVM, `gc.collect()` in Python). A leak is memory that trends **up** under constant load and **does not recover** after GC. Memory that rises during warmup and then *plateaus*, or that drops back after GC, is not a leak — stop here and look at cache sizing or normal working set instead.
+2. **Capture two heap snapshots under load, spaced apart.** Take snapshot A once warmup has settled, keep the same workload running, then take snapshot B after memory has visibly grown (Node: `--inspect` + DevTools/`heapdump`/`v8.writeHeapSnapshot()`; JVM: `jmap -dump:live,format=b,file=… <pid>` or a JFR `OldObjectSample`; Python: `tracemalloc.take_snapshot()` ×2, or `objgraph`/`guppy`). One snapshot tells you what's big *now*, which is useless — you need both ends of the growth.
+3. **Diff the two snapshots — read what GREW, not what's biggest.** Use the comparison view (DevTools "Comparison" between A and B, `tracemalloc.compare_to`, MAT's dominator/histogram delta). Sort by *delta in retained size and object count*. The leak is the object type whose instance count and retained size climb monotonically across the diff and never get freed — not necessarily the single largest object, which is often a legitimately big-but-stable buffer.
+4. **Trace the retention path to the root that blocks collection.** For the growing object, follow the *retainers / paths-to-GC-root* (DevTools "Retainers", MAT "Path to GC Roots: exclude weak/soft"). The fix lives at the *root* end of that chain — the live reference that keeps the whole subtree alive. Match it to the usual suspects: an unbounded cache/`Map`/dict keyed by something ever-growing (request id, user id); an event listener / observable / pub-sub subscription added but never removed; a closure captured by a long-lived callback that drags a large scope with it; a `setInterval`/timer/scheduled task never cleared; a module-level array/list that's only ever appended to; or — in native or manual-memory code — an allocation with no matching free (check with `valgrind --leak-check=full` / ASan / a heap profiler).
+5. **Fix by bounding the lifetime at the root.** Don't trim symptoms — cut the retaining reference: put a size cap and eviction (LRU) or TTL on the cache; `removeEventListener` / `unsubscribe` / `dispose` in the matching teardown; `clearInterval`/`clearTimeout` and cancel scheduled work on shutdown/unmount; replace a cache keyed by short-lived objects with a `WeakMap`/`WeakRef` so entries are collectible; bound or drain the module-level collection; add the missing `free`/`delete`/`close`. Prefer the change that makes the lifetime *correct* over one that just makes the leak slower.
+6. **Re-run the same workload and confirm a plateau.** Repeat step 1's steady workload with the fix in place and capture the same memory-over-time trace. The fix is verified only when memory rises during warmup and then *flattens* (and recovers after GC) across a window long enough to have leaked before. If it still trends up, the diff pointed at one of several retainers — go back to step 3 and trace the next-largest grower.
+> [!WARNING]
+> A single heap snapshot proves nothing about a leak — every running process holds a lot of live memory legitimately. Only the **diff of two snapshots under sustained load** distinguishes "growing and never freed" from "big but stable." Never conclude a leak (or a fix) from one snapshot or one memory number.
+> [!NOTE]
+> "Memory went up" during warmup, JIT, or cache fill is expected, not a leak — a leak is unbounded growth that never plateaus under *constant* load. Before touching code, confirm the curve never flattens and never recovers after a forced GC; otherwise you'll "fix" a cache that was working as designed and make the app slower.
+## Output
+A short report with four parts: (1) the **confirmation evidence** — the memory-over-time trace under steady load showing growth that doesn't recover after GC; (2) the **leaking object and retention path** from the heap diff (type, delta count/retained size, and the path-to-GC-root naming the retaining root); (3) the **root-cause fix** as a concrete diff at that root (eviction/TTL, unsubscribe, cleared timer, weak reference, or missing free); and (4) the **post-fix plateau** — the same workload's memory trace now flattening — or a note that another retainer remains and which one to chase next.

package/content/skills/pagination-designer.md ADDED Viewed

@@ -0,0 +1,51 @@
+---
+name: "pagination-designer"
+description: "Design correct, scalable pagination (plus the filtering and sorting that ride with it) for a list endpoint — pick cursor (keyset) vs offset and justify it, define an opaque cursor with a unique tiebreaker so no row is skipped or repeated, return a consistent envelope, bound page size, and name the indexes the sort actually needs. Use when adding a list endpoint, when OFFSET pagination crawls on a large table, or when clients see duplicate or missing rows while paging."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Pagination looks trivial until the table grows or the data moves under the reader. `OFFSET 100000` doesn't skip to row 100,000 — the database scans and throws away the first 100,000 matching rows on every request, so latency climbs linearly with depth. And sorting by a non-unique column (`created_at`, `name`, `score`) without a tiebreaker gives a *partial* order: rows that tie can reorder between requests, so paging skips some and shows others twice. This skill makes the pagination scheme an explicit decision — keyset vs offset, the cursor encoding, the tiebreaker, the page-size bounds, and the indexes — and defines how filtering and sorting compose with it.
+## When to use this skill
+- You're adding a list/collection endpoint and need to decide how clients page through it.
+- An existing `OFFSET`/`LIMIT` endpoint is fast on page 1 and slow on page 500, or it times out on deep pages.
+- Clients report seeing the same row twice or missing rows entirely while scrolling — the classic symptom of an unstable sort under concurrent inserts/deletes.
+- The list is large, append-heavy, or actively changing (feeds, logs, events, search results) and you need stable paging that doesn't drift as rows are added.
+## Instructions
+1. **Choose cursor (keyset) vs offset from the dataset, and justify it.**
+   - **Cursor / keyset** — the default for large or actively-changing data. Instead of `OFFSET`, the next page *seeks* on the sort key: `WHERE (created_at, id) < (:last_created_at, :last_id) ORDER BY created_at DESC, id DESC LIMIT :n`. It's stable under inserts/deletes (each page is anchored to a real row, not a positional count) and stays fast at any depth because it uses an index range scan instead of scanning prior rows. Cost: no random page jumps, no total page count.
+   - **Offset / limit** — acceptable only for **small, stable, human-paginated** lists where users click numbered pages (an admin table of a few thousand rows). It allows arbitrary jumps and easy "page 7 of 20" UIs. Never use it for infinite scroll, large tables, or feeds.
+   State which you chose and the property (depth performance + stability vs random access) that drove it.
+2. **Always include a unique tiebreaker so the sort order is total.** A cursor seeking on a non-unique column alone (`created_at`) can't disambiguate ties: two rows with the same timestamp have no defined relative order, so one can land on both sides of a page boundary. Encode the user-facing sort key **plus a unique, monotonic tiebreaker** (the primary key) — the cursor compares on the tuple `(created_at, id)`. This makes the order total: every row has exactly one position, so no row is skipped or repeated. Even when the apparent sort is "by id" alone, that already happens to be unique — but any user-chosen sort needs the explicit `, id` tiebreaker appended.
+3. **Make the cursor opaque.** Encode the tuple `(sort_key_value, tiebreaker_value)` (and, if filters/sort are part of the page identity, a version or the sort direction) into a single base64url token — `next_cursor: "eyJjcmVhdGVkX2F0IjoiMjAyNi0wNi0xN1QwOTozMDowMFoiLCJpZCI6IjQ4ODEyIn0"`. Opaque means clients treat it as a blob and pass it back verbatim; you keep freedom to change the internal encoding without breaking them. Do **not** expose raw `(timestamp, id)` as query params — clients will hand-craft them, couple to your schema, and break on the next change.
+4. **Return one consistent envelope.** Every list endpoint returns the same shape:
+   ```json
+   { "data": [ ... ], "next_cursor": "…", "has_more": true }
+   ```
+   `next_cursor` is `null` when there are no more rows. Derive `has_more` reliably by fetching `LIMIT n + 1`: if you get `n + 1` rows, there's another page — drop the extra row and set `next_cursor` from the last *kept* row. This avoids a separate `COUNT` and is correct even when the last page is exactly full. Do not return a total count for keyset pagination; computing it scans the whole filtered set and defeats the point.
+5. **Bound page size with a sane default and a hard max.** Read the page size from `limit` (or `page_size`), clamp it: default 20–50, hard max 100–200 — never unbounded. An unbounded `limit` lets one client request a million rows and OOM the server or exhaust the DB. Clamp silently (return `min(requested, max)`) and document the cap.
+6. **Name the indexes the sort actually needs — this is non-negotiable for keyset.** The `ORDER BY (sort_key, tiebreaker)` and the `WHERE (sort_key, tiebreaker) < (...)` seek are only fast if a **composite index on those exact columns in that exact order and direction** exists. Sorting `created_at DESC, id DESC` needs an index supporting that; a plain index on `created_at` alone forces a sort and undoes the win. If filters narrow the set, the index should lead with the equality-filter columns, then the sort columns: `(tenant_id, created_at, id)` for a query filtered by tenant and sorted by time. Verify the index exists or flag it as required.
+7. **Define how filtering and sorting compose with the cursor.** The cursor is only valid *for the filter and sort it was issued under* — a cursor minted for `?status=active&sort=created_at` is meaningless if the next request changes `status` or `sort`. Specify the contract: which fields are filterable, which are sortable (whitelist them — never interpolate a client-supplied column name into `ORDER BY`), and that **changing any filter or sort param invalidates the cursor and resets to the first page**. For multi-column sorts, the tiebreaker is appended after *all* user sort columns, and the seek predicate must compare the full tuple (row-value comparison `(a, b, c) < (:a, :b, :c)`, not `a < :a OR (a = :a AND b < :b) OR …` unless your engine lacks tuple comparison).
+> [!WARNING]
+> Deep `OFFSET` is O(n), not O(1). `OFFSET 100000 LIMIT 20` makes the database read and discard 100,000 matching rows before returning 20 — every request, getting worse as users page deeper, holding locks and burning IO the whole time. Page 1 being fast tells you nothing about page 5,000. If the table can grow large or users can reach deep pages, use keyset.
+> [!WARNING]
+> A non-unique sort key without a tiebreaker silently corrupts paging. With `ORDER BY created_at` and several rows sharing a timestamp, the engine may return those tied rows in a different order on the next request — so a row sitting on the page boundary gets skipped on one page and the previous boundary row reappears on the next. There is no error, just missing and duplicated data. Always append a unique tiebreaker (`, id`) to every sort.
+> [!NOTE]
+> Offset and keyset can coexist behind one envelope: serve numbered offset pages for a small admin UI and keyset for the public feed, both returning `{ data, next_cursor, has_more }` (offset endpoints simply also accept `page`/leave `next_cursor` null). Pick per endpoint from its access pattern, not one rule for the whole API.
+## Output
+A pagination spec stating: the chosen **scheme** (cursor vs offset) + rationale; the **response envelope** (`data` / `next_cursor` / `has_more`, with the `null`-when-done and `LIMIT n+1` rules); the **cursor encoding** — the exact tuple `(sort key, unique tiebreaker)` and that it's base64url-opaque; the **page-size** default and hard max; the **required indexes** (exact columns, order, and direction, leading with equality-filter columns); and the **filter/sort contract** — the filterable/sortable field whitelist, the tuple seek predicate, and that changing any filter or sort param invalidates the cursor.

package/content/skills/property-test-designer.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+name: "property-test-designer"
+description: "Design property-based tests — generate hundreds of random inputs and assert invariants that must hold for ALL of them — to surface the edge cases hand-picked examples never reach. Use when code has a large input space (parsers, serializers, encoders, math, data transforms), when a bug keeps slipping through despite green example tests, or when you can't enumerate every case worth checking."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Example-based tests only check the inputs you thought to write down. This skill designs property-based tests instead: it identifies the invariants that must hold for *every* valid input, defines generators that produce hundreds of them — including the corners you'd never type by hand — and lets the framework shrink any failure to its minimal reproducing input. The deliverable is the chosen properties, the generators, a runnable test in your language's framework, and a plan to pin every counterexample as a fixed regression case.
+## When to use this skill
+- The input space is large or recursive — parsers, serializers, encoders/decoders, numeric code, date/time logic, data transforms, state machines — and enumerating cases by hand is hopeless.
+- A bug keeps escaping a green example suite because it lives in a corner nobody wrote a test for (empty input, unicode, overflow, a specific interleaving).
+- You have a clear correctness relation — a round-trip, an inverse, a slower reference implementation — but no single "expected output" to assert against.
+- You're hardening a critical pure function and want adversarial coverage, not three happy-path examples.
+## Instructions
+1. **Pick properties that hold for ALL valid inputs — not examples.** Stop choosing inputs; choose relations. The classics, in rough order of power:
+   - **Round-trip / inverse:** `decode(encode(x)) == x`, `parse(render(x)) == x`, `decompress(compress(x)) == x`. The highest-value property for any serializer or codec.
+   - **Invariant:** a property of the output regardless of input — `sort(xs)` is ordered *and* a permutation of `xs`; a balanced-tree insert keeps the balance condition; a parser never returns a node spanning past EOF.
+   - **Idempotence:** `f(f(x)) == f(x)` — for normalizers, dedupers, sanitizers, `canonicalize`.
+   - **Oracle / model:** the function must agree with a simpler, slower, or trusted reference (a brute-force version, the previous release, the stdlib) on every input.
+   - **Metamorphic:** when there's no oracle, relate two runs — `sort(xs) == sort(shuffle(xs))`; `search(q)` ⊆ `search(broaden(q))`; `len(filter(p, xs)) <= len(xs)`.
+2. **Define generators that cover the real domain.** A property is only as good as its inputs. For each property, build a generator that reaches the nasty regions on purpose: empty/single-element collections, `0`/`-0`/negatives/`MAX_INT`/`MIN_INT`, NaN and infinities, empty strings, unicode and surrogate pairs, embedded delimiters and escape chars, huge inputs, deeply nested structures, and duplicates. Compose existing generators (`lists(integers())`, `dictionaries(...)`) rather than rolling raw randomness.
+3. **Constrain generators to valid inputs.** If the property only holds for, say, sorted lists or well-formed dates, *generate them in that shape* — `map`/`build` from raw primitives — instead of generating garbage and filtering it. Filtering (`assume`/`.filter`) discards rejected inputs and silently shrinks your effective sample size.
+4. **Pick the framework for the language.** Python → **Hypothesis** (`@given`, `st.*` strategies). JS/TS → **fast-check** (`fc.assert(fc.property(...))`). Haskell → **QuickCheck**; Scala → **ScalaCheck**; JVM/Java → **jqwik**; Rust → **proptest**/`quickcheck`; Go → built-in `testing/quick` or `rapid`. Match what's already in the project before adding a dep.
+5. **Lean on shrinking and pin the counterexample.** When a property fails, the framework shrinks the random input to a *minimal* failing case (e.g. `[0, 0]`, not a 400-element list). Read that minimal input — it usually names the bug. Then add it as an explicit example so it's checked every run regardless of the random seed: Hypothesis `@example(...)`, fast-check `fc.assert(prop, { examples: [[...]] })`, or just a plain unit test asserting the fixed input.
+6. **Budget run counts for CI.** Defaults (Hypothesis 100, fast-check 100) are fine locally; for cheap pure functions raise to 1000+ in a nightly job, but keep PR runs bounded so the suite stays fast. Set an explicit seed in CI config notes so a flake is reproducible, and disable Hypothesis's `deadline` for inputs whose runtime legitimately scales with size.
+> [!WARNING]
+> A property that reimplements the function under test proves nothing. If your "oracle" shares the buggy logic (or you assert `encode(x) == encode(x)`), the test is green and worthless. The relation must be *independent* of the implementation — an inverse, a brute-force model, or a structural invariant the code never computes directly.
+> [!NOTE]
+> An unconstrained generator wastes the run budget rejecting invalid inputs and can starve the interesting region. If a heavy `assume()`/`.filter()` throws away most candidates, the framework will warn (Hypothesis raises `FailedHealthCheck`) — rebuild the generator to *construct* valid inputs instead of filtering for them.
+## Output
+For each property, the skill produces:
+- **The property and the relation it encodes** (round-trip / invariant / idempotence / oracle / metamorphic), stated as a one-line claim about all valid inputs.
+- **The generator(s)**, written in the project's framework, with the edge regions they deliberately reach.
+- **A runnable test** in that framework.
+- **The regression plan** — where each shrunk counterexample gets pinned as a fixed example so it's checked deterministically forever.
+Example — a round-trip property for a CSV codec, in Hypothesis:
+```python
+from hypothesis import given, strategies as st, example
+# Generate well-formed rows directly (no filtering): each cell is arbitrary
+# text incl. commas, quotes, newlines, unicode — exactly the chars that break parsers.
+rows = st.lists(st.lists(st.text(), min_size=1), min_size=1)
+@given(rows)
+@example([["a,b", '"q"', "line\nbreak"]])  # pinned: a past failure, checked every run
+def test_csv_roundtrip(data):
+    # Property: parsing what we wrote back yields the original (inverse).
+    # parse_csv is INDEPENDENT of write_csv — not a reimplementation of it.
+    assert parse_csv(write_csv(data)) == data
+```
+A failure here shrinks to the minimal breaking cell — typically `[["\n"]]` or `[['"']]` — which you read, fix, and then pin via a second `@example(...)`. Hand the proposed properties to `test-scaffolder` to flesh out, and use `coverage-gap-finder` to confirm the generated inputs now reach the previously-cold branches.

package/content/skills/security-headers-hardener.md ADDED Viewed

@@ -0,0 +1,79 @@
+---
+name: "security-headers-hardener"
+description: "Audit and harden a web app's or API's HTTP security headers — Content-Security-Policy, HSTS, X-Content-Type-Options, frame-ancestors, Referrer-Policy, Permissions-Policy, and CORS — and produce a staged rollout that won't break the site. Use before a launch, during a security pass, or when a scanner (Mozilla Observatory, securityheaders.com, a pentest) flags missing or weak headers. Audits and edits header config; rolls CSP out Report-Only first."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+Audit the HTTP security headers a web app or API actually sends, then harden them without taking the site down. The single highest-value header is a real **Content-Security-Policy** — it is the strongest in-band mitigation for XSS — but it is also the one most likely to break your site if shipped carelessly, so this skill always stages CSP through **Report-Only** first. Around it: enforce HTTPS with HSTS (carefully, because `preload` is effectively one-way), stop MIME sniffing, block framing, tighten `Referrer-Policy` and `Permissions-Policy`, scope CORS so it can't be turned into a credential-leaking open door, and strip headers that advertise your stack and version. Output is a per-header `current → recommended` audit, the exact values to paste, and a rollout plan that goes Report-Only before enforce.
+## When to use this skill
+- Before a public launch or a major release that changes the frontend, third-party scripts, or the CDN/proxy in front of the app.
+- When a scanner (securityheaders.com, Mozilla Observatory, Lighthouse, a pentest report) flags missing or weak headers.
+- When standing up a new service, edge config, or reverse proxy and you want headers right from day one.
+- After adding a third-party embed, analytics, payment iframe, or auth widget — anything that changes what origins the page must trust.
+> [!WARNING]
+> Never ship an enforcing `Content-Security-Policy` you have not first run as `Content-Security-Policy-Report-Only` against real traffic. A directive like `script-src 'self'` will silently kill every inline `<script>`, injected analytics snippet, and third-party widget the moment it enforces — that's a white-screened production site, not a hardened one.
+## Instructions
+1. **Find where headers are actually set, then observe what ships.** Glob and grep the layers that can emit headers — app middleware (`helmet`, `setHeader`, `res.headers`, `add_header`), framework config (`next.config`, `vercel.json`, `netlify.toml`, `**/middleware*`), and edge config (`nginx.conf`, `*.htaccess`, Cloudflare/CDN rules, `**/*.conf`). Multiple layers may set the same header; the proxy can override the app, or duplicate it. Establish the *effective* response (e.g. `curl -sI https://host` against a deployed instance, or read the proxy config) before changing anything — you can't harden what you can't see, and a header set twice with different values is its own bug.
+2. **Set a real Content-Security-Policy — the core control.** Start from a default-deny base: `default-src 'self'`. Then open *only* what the app needs: `script-src` and `style-src` for trusted origins, `img-src`, `connect-src` for your APIs/websockets, `font-src`, `frame-src` for embeds. Avoid `'unsafe-inline'` and `'unsafe-eval'` in `script-src` — they neuter the whole policy against XSS. For unavoidable inline scripts, use a per-response **nonce** (`script-src 'nonce-<random>'`, regenerated each request) or a **SHA-256 hash** of the script body, not a blanket allow. Always add `object-src 'none'` (kills Flash/plugin vectors) and `base-uri 'self'` (stops `<base>`-tag injection that reroutes relative script URLs). Add a `report-uri`/`report-to` endpoint so violations are collected.
+3. **Roll CSP out Report-Only before enforcing.** Deploy the policy as `Content-Security-Policy-Report-Only` first — same directives, but violations are reported to your collector instead of blocked. Watch the violation stream across representative traffic (all major pages, logged-in and out, the third-party flows) until it goes quiet or shows only known-benign noise (browser extensions inject inline styles — scope by `document-uri`/`blocked-uri`, don't widen the policy for them). Only then flip the header name to `Content-Security-Policy`. Keep `report-to` on after enforcing to catch regressions.
+4. **Enforce HTTPS with HSTS — and be deliberate about preload.** Set `Strict-Transport-Security: max-age=31536000; includeSubDomains`. Add `; preload` **only** once every subdomain serves valid HTTPS, because preload submission bakes HTTPS-only into shipped browsers and is slow and painful to undo. When first introducing HSTS, consider starting with a shorter `max-age` (e.g. a day) to confirm nothing breaks, then raise it. HSTS only takes effect on a response served over HTTPS, so also ensure a plain-HTTP→HTTPS redirect exists.
+5. **Stop MIME sniffing and clickjacking.** Set `X-Content-Type-Options: nosniff` (stops the browser from re-interpreting a response's type, a classic way to execute an uploaded "image" as script). Block framing with a frame-busting policy: prefer `Content-Security-Policy: frame-ancestors 'self'` (or an explicit allowlist of origins permitted to frame you), which supersedes the legacy `X-Frame-Options: DENY/SAMEORIGIN` — set both for older-browser coverage, but make them agree.
+6. **Tighten Referrer-Policy and Permissions-Policy.** Set `Referrer-Policy: strict-origin-when-cross-origin` (sends the full URL same-origin, only the origin cross-origin over HTTPS, nothing on downgrade) — this stops tokens or PII in query strings from leaking via the `Referer` header to third parties. Set `Permissions-Policy` to disable powerful features the app doesn't use, e.g. `camera=(), microphone=(), geolocation=(), payment=()` — an empty allowlist `()` means "no origin, not even self." Only grant features the app actually calls.
+7. **Scope CORS tightly — never the wildcard-plus-credentials trap.** If the API serves cross-origin requests, reflect or allowlist **specific** trusted origins for `Access-Control-Allow-Origin`; never reflect an arbitrary `Origin` header back unchecked (that's "allow everyone" with a disguise). The exploitable misconfiguration to hunt for: `Access-Control-Allow-Origin: *` together with `Access-Control-Allow-Credentials: true` — browsers forbid the literal combination, so a server that *needs* credentials will instead reflect the caller's Origin, and if that reflection is unchecked, any site can make authenticated cross-origin requests and read the response. Pin `Allow-Methods`/`Allow-Headers` to what's used, and set `Vary: Origin` when reflecting so caches don't serve one origin's CORS response to another.
+8. **Remove headers that leak the stack.** Strip or blank `Server` version detail, `X-Powered-By`, `X-AspNet-Version`, `X-Generator`, and framework banners — they hand attackers a version to match against known CVEs and cost nothing to remove. (`X-XSS-Protection` is deprecated and best set to `0` or omitted; do not rely on it — CSP replaces it.)
+9. **Apply the changes, keeping each layer's edit minimal and consistent.** Use Edit to set the recommended values in the right layer (prefer the single source of truth — usually the proxy/edge or one central middleware — over scattering headers across the app). Don't introduce a header in two places with conflicting values. Leave CSP as Report-Only in the committed config if the violation-watch window hasn't completed; note clearly in the rollout plan when to flip it.
+> [!NOTE]
+> Test against a real response, not the config file. A header in `helmet()` or `next.config` can be silently overridden, dropped, or duplicated by a CDN, load balancer, or framework default. Confirm the effective `curl -sI` output before and after — the wire is the source of truth.
+## Output
+A per-header audit table (`current → recommended` for every header in scope), the exact header/config values to apply in the identified layer, and a staged rollout plan that puts CSP through Report-Only before enforce. Edits are applied to the header config; CSP stays Report-Only until the violation window is clear.
+```text
+Security headers — scope: next.config.ts, middleware.ts, effective response for https://app.example.com
+Header                       Current                          Recommended
+---------------------------------------------------------------------------------------------------
+Content-Security-Policy      (none)                           default-src 'self'; script-src 'self'
+                                                              'nonce-{n}'; style-src 'self'; img-src
+                                                              'self' data:; connect-src 'self'
+                                                              https://api.example.com; object-src
+                                                              'none'; base-uri 'self'; frame-ancestors
+                                                              'self'; report-to csp
+                                                              → ship as -Report-Only first
+Strict-Transport-Security    (none)                           max-age=31536000; includeSubDomains
+                                                              (add ;preload only after subdomain audit)
+X-Content-Type-Options       (none)                           nosniff
+X-Frame-Options              (none)                           DENY        (CSP frame-ancestors is primary)
+Referrer-Policy              unsafe-url                       strict-origin-when-cross-origin
+Permissions-Policy           (none)                           camera=(), microphone=(), geolocation=(),
+                                                              payment=()
+Access-Control-Allow-Origin  * (reflected, with credentials)  https://app.example.com (allowlist) + Vary: Origin
+X-Powered-By                 Next.js                          (removed)
+Server                       nginx/1.25.3                     nginx (version suppressed)
+Rollout plan
+1. Deploy all headers above; CSP as Content-Security-Policy-Report-Only with report-to=csp.
+2. Watch violation reports across all pages + third-party flows for one full traffic cycle.
+3. Resolve real violations (add the specific origin/nonce); ignore extension noise.
+4. When the stream is quiet, rename the header to Content-Security-Policy (enforce). Keep report-to on.
+5. After every subdomain is verified HTTPS-only, add ;preload to HSTS and submit (one-way).
+Fixed now: CORS wildcard+credentials misconfiguration removed; X-Powered-By/Server stripped;
+nosniff, frame-ancestors, Referrer-Policy, Permissions-Policy, HSTS applied. CSP pending enforce.
+```