npm - agentscamp - Versions diffs - 0.3.0 → 0.5.0 - Mend

agentscamp 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

package/README.md +3 -3
package/content/commands/add-caching.md +79 -0
package/content/commands/audit-accessibility.md +101 -0
package/content/commands/clean-branches.md +113 -0
package/content/commands/review-tests.md +98 -0
package/content/commands/scaffold-github-action.md +94 -0
package/content/commands/setup-precommit-hooks.md +72 -0
package/content/commands/write-design-doc.md +78 -0
package/content/manifest.json +425 -3
package/content/skills/agent-trajectory-evaluator.md +59 -0
package/content/skills/alerting-rules-tuner.md +49 -0
package/content/skills/canary-release-planner.md +35 -0
package/content/skills/cold-start-optimizer.md +83 -0
package/content/skills/connection-pool-tuner.md +46 -0
package/content/skills/contract-test-designer.md +70 -0
package/content/skills/dependency-upgrade-planner.md +42 -0
package/content/skills/devcontainer-designer.md +40 -0
package/content/skills/distributed-tracing-instrumenter.md +42 -0
package/content/skills/idempotency-designer.md +47 -0
package/content/skills/memory-leak-hunter.md +35 -0
package/content/skills/mutation-test-runner.md +64 -0
package/content/skills/pagination-designer.md +51 -0
package/content/skills/property-test-designer.md +63 -0
package/content/skills/query-plan-analyzer.md +49 -0
package/content/skills/runbook-writer.md +83 -0
package/content/skills/security-headers-hardener.md +79 -0
package/content/skills/semantic-cache-designer.md +40 -0
package/content/skills/slo-definer.md +38 -0
package/content/skills/strangler-fig-migrator.md +47 -0
package/content/skills/structured-logging-designer.md +42 -0
package/content/skills/threat-model-builder.md +46 -0
package/content/skills/token-usage-profiler.md +39 -0
package/package.json +1 -1

package/content/skills/structured-logging-designer.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+name: "structured-logging-designer"
+description: "Design a structured (JSON) logging strategy with a stable field schema, correlation-ID propagation, and a disciplined level policy — then migrate ad-hoc string logs toward it. Use when logs are unsearchable plain text, when debugging a request across services means grepping multiple log streams by hand, or when standing up logging for a new service."
+allowed-tools: "Read, Grep, Glob, Edit"
+version: 1.0.0
+---
+A log line like `"user 42 failed to checkout"` answers nothing you can query: you can't filter by user, can't join it to the request that produced it, can't alert on it. Structured logging makes every line a queryable record — fields, not prose — so "show me every ERROR for tenant X in the last hour, with the request ID" is a query instead of a grep across five files. This skill designs that schema, threads a correlation ID through a request so a single flow is reconstructable across services, sets a level policy you can actually act on, and redacts secrets at the boundary — then rewrites representative statements so the team has a concrete pattern to copy.
+## When to use this skill
+- Logs are plain text and unsearchable — you grep for substrings instead of filtering on fields, and you can't build a dashboard or alert from them.
+- Debugging one request means manually correlating timestamps across multiple services or log streams because nothing ties the lines together.
+- Standing up logging for a new service and you want a defensible schema and level policy instead of scattered `print`/`console.log` calls.
+- Log levels are meaningless (everything is INFO, or ERROR is used for expected conditions) so on-call alerts are noise and real failures hide.
+## Instructions
+1. **Emit one structured record per line with a stable schema.** Every log line is a JSON object with the same required fields: `timestamp` (ISO-8601 / RFC-3339, UTC), `level`, `message` (a short, *constant* string — the variable parts go in fields, not interpolated into the message), `service`, and `correlation_id`. A constant message is what lets you group and count: `{"message": "checkout failed", "user_id": 42, "reason": "card_declined"}` is countable; `"user 42 failed: card declined"` is not.
+2. **Thread a correlation ID through every line of a request.** At the request entry point (HTTP middleware, queue consumer, RPC handler), read an incoming `X-Request-Id` / trace header or generate one, store it in a context-local (Go `context`, Node `AsyncLocalStorage`, Python `contextvars`, MDC in JVM), and have the logger attach it automatically to *every* line in that request — never pass it by hand. Propagate the same ID on outbound calls (set the header) so downstream services log it too. Reconstructing a flow then becomes `correlation_id = "abc123"` across all services.
+3. **Define a level policy and enforce what each level means.** ERROR = something failed and a human needs to act or be alerted (unhandled exception, failed write, breached invariant) — never use it for expected conditions like a 404 or a validation rejection. WARN = suspicious but handled (retry succeeded, fell back, approaching a limit). INFO = key business events worth keeping in production (request completed, order placed, job finished). DEBUG = developer detail (intermediate values, branch taken), off in production. Write the policy down with one concrete example per level so reviewers can reject a misused level.
+4. **Make the level runtime-configurable.** Read the threshold from an env var or config (`LOG_LEVEL=debug`) so you can raise verbosity for an incident without a redeploy, and run production at INFO. Where the logger supports it, allow per-module overrides (e.g. DEBUG for one noisy package) so you can zoom in without drowning in unrelated DEBUG output.
+5. **Attach context as fields, never by string concatenation.** User, tenant, resource, and operation IDs are structured fields (`user_id`, `tenant_id`, `order_id`, `operation`), not substrings of `message`. Bind request-scoped context once (a child/bound logger carrying `tenant_id` and `correlation_id`) so every line in that scope inherits it without repeating it. This is what makes `tenant_id = "acme" AND level = "ERROR"` a one-line query.
+6. **Redact secrets and PII at the logging boundary.** Maintain a deny-list of field names (`password`, `token`, `authorization`, `secret`, `api_key`, `ssn`, `card`, `cookie`, `set-cookie`) and a redaction hook in the logger that masks them *before serialization*, regardless of which call site logs them — do not rely on every developer remembering. Never log full request/response bodies or raw headers; log a content length, a hash, or an explicit allow-list of safe fields instead.
+7. **Rewrite representative statements as before/after.** Pick the highest-traffic and highest-value sites — a request handler, an error path, an external-call wrapper — and rewrite each from string log to structured log so the team copies a real pattern, not a doc.
+> [!WARNING]
+> Logging a secret, token, or PII field is a breach the moment it lands in your log store — logs are widely replicated, retained, and read by people who'd never get database access. Redact at the boundary (step 6); do not trust call sites to remember.
+> [!WARNING]
+> Unbounded high-cardinality fields (raw URLs with query strings, full user-agent strings, per-request UUIDs as *indexed* fields) explode log-store cost and index size. Keep correlation IDs as plain fields, bucket or template high-cardinality values (`route_template = "/users/:id"`, not the literal path), and never put unbounded free text in a field your backend indexes.
+> [!WARNING]
+> A log call in a hot loop or per-row path can dominate latency — serialization, redaction, and I/O are not free. Guard DEBUG with the level check so it's skipped (not just discarded) in production, log aggregates instead of per-iteration lines, and sample very-high-frequency events rather than logging every one.
+## Output
+- **Log schema** — the required fields (`timestamp`, `level`, `message`, `service`, `correlation_id`) and the standard contextual fields (`user_id`, `tenant_id`, request/resource IDs) with types and an example record.
+- **Correlation-ID propagation** — where the ID is created/read, how it's stored (context-local), how it's auto-attached to every line, and how it's propagated on outbound calls.
+- **Level policy** — the meaning of ERROR/WARN/INFO/DEBUG with one concrete example each, plus the runtime config knob (`LOG_LEVEL`) and any per-module override.
+- **Redaction rules** — the field deny-list, the boundary hook that applies it, and the body/header policy.
+- **Before/after diffs** — representative log statements rewritten from string to structured, ready to copy across the codebase.

package/content/skills/threat-model-builder.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+name: "threat-model-builder"
+description: "Build a practical threat model for a feature or system using STRIDE — diagram the data flow, mark trust boundaries, enumerate concrete threats where data crosses them, and prioritize by likelihood × impact so security is reasoned about before shipping instead of bolted on after. Use when designing a feature that touches auth, money, or sensitive data, running a security design review, or hardening before a launch."
+allowed-tools: "Read, Grep, Glob"
+version: 1.0.0
+---
+Most threat models fail in one of two ways: they list every conceivable attack until the team is paralyzed and ships nothing, or they skip straight to "we use HTTPS and JWTs" without ever asking where an attacker actually sits. This skill does neither. It forces a data-flow diagram with explicit **trust boundaries**, walks **STRIDE** only where data crosses those boundaries, and ranks every threat by **likelihood × impact** so you mitigate the handful that matter and consciously accept the rest. The output is a diagram, a threat table, and a signed-off list of residual risk — not a vibe.
+## When to use this skill
+- Designing a feature that handles authentication, authorization, money/payments, PII, or anything multi-tenant.
+- Running a security design review before a launch, or as a gate on a new external-facing endpoint or integration.
+- A pen-test or incident keeps surfacing the same class of bug and you need to find the whole class, not the one instance.
+- Adding a new external entity (third-party webhook, partner API, file upload from users) that data now flows to or from.
+## Instructions
+1. **Draw the system as a data-flow first — not a box diagram.** Identify four element types: **external entities** (users, partner services, the browser), **processes** (your API, workers, lambdas), **data stores** (DB, cache, queue, blob storage), and the **data flows** (arrows) between them, each labeled with what it carries (`login creds`, `session token`, `tenant_id`, `payment amount`). Express it as Mermaid `flowchart`. If you cannot name what flows on an arrow, you do not understand the system well enough to model it yet.
+2. **Mark trust boundaries — these are the whole point.** A trust boundary is any place data crosses from less-trusted to more-trusted, or between principals that should not see each other's data: internet → your API, your API → DB, unauthenticated → authenticated, tenant A → tenant B, your code → a third-party SDK, user input → a SQL/shell/template interpreter. Draw them as dashed `subgraph` borders. Number each crossing — those numbers are the only places you do STRIDE.
+3. **Walk STRIDE at each boundary crossing, and enumerate CONCRETE threats.** For each element or flow that crosses a boundary, ask all six and write down the specific attack, not the category:
+   - **S — Spoofing** (authentication): "Attacker replays a captured session cookie because tokens have no `exp`." Not "ensure authentication."
+   - **T — Tampering** (integrity): "Client sends `amount: -100` and the refund flow trusts it."
+   - **R — Repudiation** (audit): "A user disputes a transfer and there is no signed, append-only log tying the action to their identity."
+   - **I — Information disclosure** (confidentiality): "`GET /users/:id` returns any id without an ownership check — IDOR across tenants."
+   - **D — Denial of service** (availability): "The export endpoint runs an unbounded query; one request pins the DB."
+   - **E — Elevation of privilege** (authorization): "A `viewer` role can call the admin mutation because authz is checked in the UI, not the API."
+4. **Rate each threat by likelihood × impact and SORT — you cannot fix everything.** Score likelihood (how reachable/easy: High/Med/Low) and impact (blast radius if it lands: High/Med/Low) independently, then derive priority (e.g. High×High = P0, anything with a Low dimension = P3). Sort the table by priority. An unprioritized list of forty threats gets you forty half-done mitigations; a ranked list gets the top five done properly.
+5. **Propose a specific, falsifiable mitigation per high-priority threat.** "Validate input" is not a mitigation. "Reject `amount` server-side unless it matches the stored invoice total; add a contract test" is. Tie each mitigation to where it lives (which middleware, which check, which migration) so it becomes a work item, not an aspiration.
+6. **Write down the residual risk you are accepting — explicitly.** For every threat you are NOT mitigating now, record it as *accepted* (with a one-line rationale and who accepted it) or *deferred* (with the trigger that reopens it: "revisit when we add a second tenant"). Silent acceptance is how a known risk becomes a postmortem line item.
+> [!WARNING]
+> A threat model with no trust boundaries marked is just a feature list with extra steps. The boundaries are where the threats live — if you skipped step 2, the STRIDE walk in step 3 has nothing to anchor to and will produce generic mush.
+> [!NOTE]
+> The two highest-yield STRIDE letters for typical web/SaaS features are **E (authz)** and **I (info disclosure)** — IDOR and missing object-level authorization cause more real breaches than exotic crypto failures. If time is short, do those two at every boundary first.
+## Output
+A self-contained threat model document with three parts:
+1. **Data-flow diagram** (Mermaid `flowchart`) with external entities, processes, stores, labeled flows, and dashed trust-boundary subgraphs with numbered crossings.
+2. **STRIDE threat table**, sorted by priority:
+   | # | Threat (concrete) | Element / flow | STRIDE | Likelihood | Impact | Priority | Mitigation (specific, where it lives) |
+   |---|-------------------|----------------|--------|-----------|--------|----------|----------------------------------------|
+   | 1 | `GET /users/:id` returns any tenant's record | API → DB read | I / E | High | High | P0 | Add ownership check in `requireOwner` middleware; contract test for cross-tenant 403 |
+3. **Accepted residual risks** — a short list of threats not being mitigated now, each tagged *accepted* (rationale + owner) or *deferred* (reopen trigger), so the decision is on the record rather than implied.

package/content/skills/token-usage-profiler.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: "token-usage-profiler"
+description: "Measure and attribute LLM token usage and cost across an app — input vs output tokens by feature, route, model, and tenant — then rank the waste and the specific lever to cut it. Use when LLM spend is high or climbing with no clear cause, before scaling a feature that calls a model, or when you need per-feature or per-tenant cost attribution for billing or budgets."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+An LLM bill arrives as one number, and that number tells you nothing about what to fix. The waste is almost never spread evenly — a couple of bloated prompts, one feature that streams paragraphs where a sentence would do, or a single noisy tenant usually drive most of the spend. This skill turns the total into an attributed, ranked profile: it instruments every model call to record input vs output tokens, the model, and a feature/route/tenant **tag**, breaks cost down by that tag, and hands you the dominant drivers each paired with the specific lever that cuts it.
+## When to use this skill
+- The LLM bill is high or rising and nobody can say which feature or tenant is responsible.
+- You're about to scale a model-backed feature and want to know its true per-call and aggregate cost first.
+- You need per-feature or per-tenant cost attribution for internal budgets, chargeback, or usage-based pricing.
+- A verbose feature or a stuffed context window is suspected, but you have no measurement to confirm it.
+- A cost regression slipped in — spend jumped after a deploy — and you need to localize it to a call site.
+## Instructions
+1. **Add the tag before measuring anything — attribution is impossible without it.** At every model call site, capture: model id, input (prompt) tokens, output (completion) tokens, and a stable `tag` identifying the *feature/route* (e.g. `summarize-thread`, `support-reply`) plus `tenant`/`user` where billing matters. Pull token counts from the provider's `usage` object on the response, not a local tokenizer — the provider reflects system prompts, tool schemas, and cache discounts. `grep` the codebase for call sites first (`Grep` for the SDK call, e.g. `messages.create`, `chat.completions`, `generateText`) so no path is missed; a single untagged call site becomes an "unattributed" bucket that hides waste.
+2. **Compute cost, don't count tokens.** Map each `(model, input|output)` pair to its price and compute `cost = tokens × price_per_token`, keeping input and output as separate columns. Sum over a representative window (e.g. 7 days, or one full traffic cycle). Tokens alone mislead because input and output, and cheap vs frontier models, have wildly different unit prices.
+3. **Break spend down by tag and sort by total cost.** Produce a table: tag × model × {input cost, output cost, calls, avg tokens/call}. Sort descending by total cost. Expect a Pareto shape — the top 2–4 tags usually own the majority of spend. Optimize those; ignore the long tail.
+4. **Separate per-call cost from volume — they need different fixes.** For each top tag, look at *both* cost-per-call and call count. An expensive call made rarely and a cheap call made a million times can carry the same total; the first is fixed by trimming the prompt/output, the second by caching, dedup, or not calling at all. Flag which axis dominates each driver.
+5. **For each driver, attack the levers in this order (cheapest win first):**
+   - **Trim bloated input.** Remove dead boilerplate from system prompts, stop stuffing whole documents/full chat history when a retrieved snippet or rolling summary suffices, and drop unused tool schemas. This is usually the largest, lowest-risk reduction.
+   - **Cap or shorten output.** Set `max_tokens` to the real need, ask for terse/structured output, and avoid "explain your reasoning" in production paths where it isn't consumed. Because output is the pricier axis, shaving it often beats prompt trimming on cost.
+   - **Downshift the model.** Route easy calls (classification, extraction, short rewrites) to a smaller/cheaper model and reserve the frontier model for genuinely hard ones. Gate the route on a measurable signal, not a guess, and confirm quality holds with an eval set before shipping.
+   - **Cache repeated stable prefixes.** Where a long system prompt or document prefix is reused across calls, enable prompt/KV caching so the stable part is billed at the discounted cached rate. Order the prompt so the stable prefix comes first; volatile content last.
+6. **Set per-feature budgets and alerts.** Record each top tag's current cost/call and cost/day as a baseline, then add an alert that fires when either exceeds a threshold (e.g. +30%). Treat a token-usage spike like any other regression — caught at deploy, not at the invoice.
+> [!WARNING]
+> You cannot optimize what you can't attribute. Without per-feature/per-tenant tags, the "profile" is just a grand total — you'll guess which prompt to cut and likely guess wrong. Add the tag and re-collect before doing any optimization work.
+> [!NOTE]
+> Output tokens usually cost several times more per token than input tokens, so a verbose model response — not a long prompt — is frequently the real cost driver. Always inspect avg *output* tokens/call on your top tags before assuming the prompt is to blame.
+## Output
+- **Instrumentation/tagging plan** — the list of call sites found, and for each the tag (feature/route + tenant) and the input/output/model fields to record, sourced from the provider `usage` object.
+- **Spend breakdown** — a table of tag × model with separate input-cost and output-cost columns (`cost = tokens × price`), calls, and avg tokens/call, sorted by total cost, with an "unattributed" row if any call site is still untagged.
+- **Ranked waste** — the dominant drivers in order, each labeled by axis (per-call cost vs volume) and assigned its specific lever (trim context / cap output / downshift model / cache prefix) with the expected reduction.
+- **Budgets & alerts** — baseline cost/call and cost/day per top tag plus the threshold alert to add, so future regressions are caught automatically.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentscamp",
-  "version": "0.3.0",
+  "version": "0.5.0",
   "description": "Install AgentsCamp agents, skills, and slash commands into Claude Code from your terminal.",
   "license": "MIT",
   "type": "module",