npm - agentscamp - Versions diffs - 0.1.0 - Mend

agentscamp 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (121) hide show

package/LICENSE +21 -0
package/README.md +64 -0
package/content/agents/accessibility-auditor.md +66 -0
package/content/agents/agent-architect.md +65 -0
package/content/agents/agent-reliability-reviewer.md +40 -0
package/content/agents/agent-tool-integration-engineer.md +38 -0
package/content/agents/api-architect.md +84 -0
package/content/agents/backend-developer.md +92 -0
package/content/agents/browser-agent-engineer.md +37 -0
package/content/agents/cloud-architect.md +72 -0
package/content/agents/code-reviewer.md +69 -0
package/content/agents/data-engineer.md +67 -0
package/content/agents/data-scientist.md +79 -0
package/content/agents/debugger.md +89 -0
package/content/agents/dependency-manager.md +64 -0
package/content/agents/devops-engineer.md +94 -0
package/content/agents/documentation-engineer.md +52 -0
package/content/agents/finetuning-engineer.md +43 -0
package/content/agents/frontend-developer.md +78 -0
package/content/agents/git-github-expert.md +66 -0
package/content/agents/golang-pro.md +72 -0
package/content/agents/graphql-architect.md +85 -0
package/content/agents/kubernetes-specialist.md +87 -0
package/content/agents/llm-cost-optimizer.md +39 -0
package/content/agents/llm-evaluation-engineer.md +42 -0
package/content/agents/llm-inference-engineer.md +42 -0
package/content/agents/llm-integration-engineer.md +39 -0
package/content/agents/llm-observability-engineer.md +41 -0
package/content/agents/mcp-server-engineer.md +43 -0
package/content/agents/ml-engineer.md +67 -0
package/content/agents/mobile-developer.md +89 -0
package/content/agents/performance-engineer.md +79 -0
package/content/agents/postgres-migration-engineer.md +42 -0
package/content/agents/prompt-engineer.md +58 -0
package/content/agents/prompt-injection-auditor.md +42 -0
package/content/agents/python-pro.md +77 -0
package/content/agents/rag-pipeline-engineer.md +42 -0
package/content/agents/react-specialist.md +83 -0
package/content/agents/refactoring-specialist.md +78 -0
package/content/agents/retrieval-engineer.md +41 -0
package/content/agents/rust-pro.md +89 -0
package/content/agents/security-auditor.md +78 -0
package/content/agents/sql-pro.md +53 -0
package/content/agents/sre-engineer.md +66 -0
package/content/agents/system-architect.md +77 -0
package/content/agents/terraform-specialist.md +73 -0
package/content/agents/test-engineer.md +79 -0
package/content/agents/typescript-pro.md +82 -0
package/content/agents/vector-search-engineer.md +43 -0
package/content/agents/voice-agent-engineer.md +38 -0
package/content/agents/workflow-orchestrator.md +70 -0
package/content/commands/add-docstrings.md +92 -0
package/content/commands/add-human-approval.md +40 -0
package/content/commands/add-mcp-server.md +50 -0
package/content/commands/add-streaming-endpoint.md +34 -0
package/content/commands/benchmark-rerankers.md +44 -0
package/content/commands/breakdown-task.md +86 -0
package/content/commands/commit.md +117 -0
package/content/commands/create-pr.md +109 -0
package/content/commands/db-migrate.md +47 -0
package/content/commands/explain-code.md +71 -0
package/content/commands/explain-error.md +98 -0
package/content/commands/extract-function.md +107 -0
package/content/commands/find-bug.md +93 -0
package/content/commands/fix-failing-test.md +106 -0
package/content/commands/new-component.md +119 -0
package/content/commands/plan-feature.md +71 -0
package/content/commands/profile-postgres-queries.md +41 -0
package/content/commands/red-team-llm.md +45 -0
package/content/commands/refactor.md +82 -0
package/content/commands/review-pr.md +101 -0
package/content/commands/run-evals.md +34 -0
package/content/commands/scaffold-pgvector-schema.md +42 -0
package/content/commands/scaffold-vllm-config.md +44 -0
package/content/commands/security-scan.md +129 -0
package/content/commands/set-perf-budget.md +47 -0
package/content/commands/setup-claude-ci.md +60 -0
package/content/commands/sync-branch.md +138 -0
package/content/commands/update-readme.md +108 -0
package/content/commands/write-tests.md +81 -0
package/content/manifest.json +1709 -0
package/content/skills/adr-writer.md +90 -0
package/content/skills/branch-rebaser.md +86 -0
package/content/skills/bundle-analyzer.md +77 -0
package/content/skills/changelog-from-prs.md +81 -0
package/content/skills/chunking-strategy-optimizer.md +34 -0
package/content/skills/claude-settings-auditor.md +38 -0
package/content/skills/conventional-commits.md +80 -0
package/content/skills/coverage-gap-finder.md +72 -0
package/content/skills/dead-code-finder.md +65 -0
package/content/skills/dependency-audit.md +64 -0
package/content/skills/embedding-index-tuner.md +34 -0
package/content/skills/embedding-set-inspector.md +34 -0
package/content/skills/finetune-dataset-builder.md +33 -0
package/content/skills/graphrag-scaffolder.md +39 -0
package/content/skills/hook-writer.md +39 -0
package/content/skills/human-in-the-loop-gate.md +33 -0
package/content/skills/llm-as-judge-scorer.md +33 -0
package/content/skills/llm-eval-suite-scaffolder.md +30 -0
package/content/skills/llm-guardrails-designer.md +33 -0
package/content/skills/llm-output-schema-generator.md +32 -0
package/content/skills/mcp-server-scaffolder.md +33 -0
package/content/skills/mock-data-factory.md +75 -0
package/content/skills/multimodal-document-extractor.md +39 -0
package/content/skills/openapi-doc-writer.md +88 -0
package/content/skills/plugin-scaffolder.md +38 -0
package/content/skills/postgres-index-strategist.md +38 -0
package/content/skills/pr-description.md +87 -0
package/content/skills/prompt-cache-optimizer.md +34 -0
package/content/skills/prompt-optimizer.md +40 -0
package/content/skills/prompt-pii-redactor.md +33 -0
package/content/skills/provider-fallback-wrapper.md +33 -0
package/content/skills/qlora-finetune-runner.md +33 -0
package/content/skills/readme-generator.md +84 -0
package/content/skills/secret-scanner.md +65 -0
package/content/skills/sql-optimizer.md +77 -0
package/content/skills/test-scaffolder.md +74 -0
package/content/skills/tool-definition-generator.md +33 -0
package/content/skills/web-research-pipeline.md +39 -0
package/dist/index.js +384 -0
package/package.json +44 -0

package/content/skills/dependency-audit.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+name: "dependency-audit"
+description: "Audit project dependencies for known vulnerabilities and turn the raw scanner output into a triaged, prioritized upgrade plan. Use when an audit is noisy, a CVE was reported, or you need to know which advisories actually matter."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+Run the ecosystem's vulnerability audit, then do the part the scanner won't: separate exploitable, reachable advisories from transitive noise and propose the minimal upgrade that closes the real risk. The skill reads the actual lockfile, runs the native audit tool, traces each flagged package to how it's used in the codebase, and rewrites the severity in context — so a critical-rated advisory in a build-only dependency you never call doesn't outrank a moderate one on the request path.
+## When to use this skill
+- An audit (`npm audit`, `pip-audit`, `cargo audit`, …) prints a wall of advisories and you need to know which ones to act on first.
+- A specific CVE or GitHub advisory landed and you want to confirm whether your usage is actually reachable.
+- You want the smallest safe set of version bumps — not a blanket `npm audit fix --force` that breaks the build.
+- A security gate is failing CI and you need to justify a documented downgrade or suppression.
+> [!WARNING]
+> A vulnerability's CVSS score rates the flaw in the abstract, not your exposure to it. Never act on severity alone — an unreachable "critical" is lower priority than a reachable "moderate" on your request path. This skill exists to make that distinction explicit.
+## Instructions
+1. **Locate the manifest and lockfile.** Find the dependency files (`package.json` + `package-lock.json`/`pnpm-lock.yaml`/`yarn.lock`, `requirements.txt`/`poetry.lock`/`Pipfile.lock`, `Cargo.lock`, `go.mod`/`go.sum`, `Gemfile.lock`). The lockfile is the source of truth for resolved versions — audit that, not the loose ranges in the manifest.
+2. **Detect the audit tool — do not guess.** Match the ecosystem and run its native auditor: `npm audit --json` (or `pnpm audit --json` / `yarn npm audit`), `pip-audit -r requirements.txt -f json` (or `poetry`/`uv` equivalents), `cargo audit --json`, `govulncheck ./...`, `bundle audit`. Prefer the JSON output so you can parse advisories programmatically.
+3. **Classify each advisory by reachability.** For every flagged package, determine: is it a **direct** or **transitive** dependency? Is it a runtime, dev, build, or test-only dependency? Then `grep`/`Glob` the codebase for actual imports and calls of the vulnerable API. A package present in the tree but never imported — or imported only in tooling that never runs in production — is **not reachable** and should be downgraded in priority.
+4. **Rewrite severity in context.** State the original score, then the *contextual* priority with a one-line reason: the affected code path, whether attacker-controlled input can reach it, and the deployment surface (public endpoint vs. local CLI vs. CI-only). `govulncheck` does call-graph reachability natively — trust it over a flat `npm audit` when available.
+5. **Compute the minimal safe upgrade.** For each issue worth fixing, find the lowest patched version that resolves it. Prefer in-range patch/minor bumps; flag major bumps and transitive-only fixes (which may need an `overrides`/`resolutions` pin or a dependency-tree update) separately as higher-effort. Never blanket-run `--force` fixes.
+6. **Verify the fix.** Apply the proposed bumps in a scratch step, re-run the audit, and run the build/test command to confirm nothing broke (`npm ci && npm test`, `pytest`, `cargo build`, …). Re-running the auditor must show the targeted advisories cleared.
+7. **Report and flag gaps.** Produce a triaged summary: **act now** (reachable, fixable), **monitor** (unreachable or no patch yet), and **suppressed** (false positive / accepted risk, with reason). Call out any advisory with no fix available and any transitive issue you couldn't resolve without a major upgrade.
+> [!TIP]
+> If an advisory is genuinely not applicable, record it in the tool's ignore file (`.npmrc` audit overrides, `pip-audit --ignore-vuln`, `cargo audit`'s `audit.toml`, `.trivyignore`) **with a dated justification comment** — don't silently suppress it, and don't leave it failing CI for the next person to re-triage.
+## Examples
+Input — raw `npm audit` reports two advisories at face value:
+```
+# npm audit report
+minimatch  <3.0.5   high     ReDoS via brace expansion   (transitive, via glob → eslint)
+axios      <1.6.0   medium   XSRF-TOKEN leak to cross-origin hosts  (direct, used in src/api/client.ts)
+```
+After tracing usage, the triaged summary downgrades the unreachable one and prioritizes the reachable one:
+```
+Dependency audit — 2 advisories, 1 actionable
+[ACT NOW]  axios  0.27.2 → 1.6.0   (medium, contextually HIGH)
+  CSRF / XSRF-TOKEN leak to cross-origin hosts (CVE-2023-45857). axios is
+  on the live request path in src/api/client.ts and forwards a user-supplied
+  `targetUrl` — the XSRF-TOKEN cookie can leak to attacker-controlled hosts.
+  In-range minor bump; no breaking API changes used.
+[MONITOR]  minimatch  3.0.4 → 3.0.5   (high, contextually LOW)
+  ReDoS via brace expansion. Pulled in transitively by eslint (dev only);
+  never bundled or executed in production, and no untrusted pattern reaches
+  it. Patched by `npm dedupe` or an override — fix opportunistically, not
+  blocking. Original "high" score reflects the flaw, not our exposure.
+Verification: applied axios bump, `npm ci && npm test` green,
+re-ran `npm audit` → axios advisory cleared.
+Gap: minimatch fix requires an eslint transitive bump; left for the next
+dep-update PR.
+```

package/content/skills/embedding-index-tuner.md ADDED Viewed

@@ -0,0 +1,34 @@
+---
+name: "embedding-index-tuner"
+description: "Tune a vector index — HNSW graph parameters and quantization — to hit a recall target at the lowest latency and memory, by sweeping settings against a fixed query set instead of trusting defaults. Use when vector search is slow or memory-hungry, when recall dropped after enabling quantization, or when standing up an index and you need defensible parameters."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+A vector index has knobs, and the defaults are a guess about a workload that isn't yours. HNSW graph parameters and quantization each trade **recall** against **latency** and **memory** — and the recall loss is invisible unless you measure it. This skill replaces "enable quantization and hope" with a sweep: hold the data and queries fixed, vary the index settings, and pick the configuration that hits your recall target at the lowest cost.
+## When to use this skill
+- Vector search is too slow (p95 latency over budget) or uses too much memory, and you want to know which parameter to move.
+- Recall dropped after you enabled quantization or lowered an HNSW search parameter, and you need to find a safe setting.
+- Standing up a new index and you want defensible parameters instead of copy-pasted defaults.
+- Validating that an index still meets its recall target after a corpus or embedding-model change.
+## Instructions
+1. **Fix the ground truth.** Use a labeled query set (20–50+ queries with known-relevant document IDs). For an approximate index, the gold standard is the **exact** (brute-force / flat) nearest neighbours for each query — compute them once so you can measure recall *of the approximate index against exact search*, not just against human labels.
+2. **State the budget.** Write down the recall target (e.g. recall@10 ≥ 0.95), the p95 latency ceiling, and the memory/storage ceiling. The sweep optimizes cost subject to these.
+3. **Sweep HNSW build/search parameters.** Vary `m` and `ef_construction` (build-time: higher = better recall, more memory and slower builds) and `ef_search` / `hnsw.ef_search` (query-time: higher = better recall, slower queries). Query-time parameters are cheap to sweep because they don't require a rebuild — sweep those first.
+4. **Sweep quantization.** Test scalar, product, and binary quantization (and binary + rescoring where supported). Each shrinks memory and speeds search at some recall cost; measure the cost rather than assuming it's acceptable.
+5. **Measure each configuration the same way.** For every setting, record recall@k (vs. exact neighbours), p95 query latency, and index memory/size. Hold the embedding model, data, and query set constant so the index is the only variable.
+6. **Recommend the cheapest config that clears the bar.** Report the full sweep as a table and pick the lowest-latency/lowest-memory setting that still hits the recall target. Note the trade-off explicitly (e.g. "binary quantization with rescoring: recall@10 0.96, p95 −60%, memory −75%").
+> [!WARNING]
+> "Search still returns results" is not a recall measurement. Quantization and low `ef_search` can quietly drop the right document from the top-k while still returning *something* plausible. Always measure recall against exact neighbours before shipping a down-tuned index.
+> [!NOTE]
+> Query-time parameters (`ef_search`) tune without a rebuild — sweep them first and you may hit your latency budget without touching build parameters or quantization at all. Build-time parameters (`m`, `ef_construction`) and quantization mode require re-indexing, so change them deliberately.
+## Output
+A sweep table (configuration → recall@k, p95 latency, memory), the recommended configuration with its rationale, and the exact index-definition change (DDL or client call) to apply it — reproducible, so the next corpus change can re-run the same sweep.

package/content/skills/embedding-set-inspector.md ADDED Viewed

@@ -0,0 +1,34 @@
+---
+name: "embedding-set-inspector"
+description: "Diagnose the health of an embedding set before blaming the retriever — checking normalization, dimensionality, near-duplicates, degenerate vectors, and corpus/query distribution mismatch. Use when retrieval quality is poor, after a re-embed, or before shipping a new index."
+allowed-tools: "Read, Grep, Glob, Bash"
+version: 1.0.0
+---
+When retrieval is poor, teams reach for a bigger model or a reranker before checking whether the embeddings themselves are sound. This skill inspects an embedding set for the failure modes that quietly wreck recall, so you fix the cause instead of layering patches on top.
+## When to use this skill
+- Retrieval recall is low and you want to rule out the embeddings before tuning the retriever.
+- After re-embedding a corpus (new model, new chunking) and before promoting the index.
+- A subset of documents is "invisible" to search no matter the query.
+- Validating a freshly built index in CI before it ships.
+## Instructions
+1. **Confirm the basics.** Verify every vector has the **expected dimensionality** and that vectors are **normalized** if your distance metric assumes it (cosine vs. dot product vs. L2 mismatch is a classic silent bug). Flag any zero, NaN, or near-zero-norm vectors — usually empty or failed-to-embed chunks.
+2. **Check for asymmetry handling.** If the model supports input types (document vs. query), confirm documents were embedded as documents and queries as queries. Mixing them degrades retrieval and is easy to get wrong.
+3. **Profile the distribution.** Summarize pairwise similarity: if almost everything is highly similar to everything else, the embeddings are not discriminating (often over-large chunks or a domain mismatch). If clusters are extreme, check for duplicated or boilerplate content dominating the space.
+4. **Find near-duplicates.** Detect chunks whose embeddings are near-identical — repeated headers/footers, navigation, or licence text — which crowd out real answers in the top-k. Recommend dedup or metadata filtering.
+5. **Test query/document alignment.** Embed a handful of the eval queries and confirm their nearest neighbours are plausible. A systematic mismatch (queries land far from all documents) points to a model or input-type problem, not a tuning problem.
+6. **Report and recommend.** Summarize findings as `severity | issue | affected count | fix`, ordered by impact on retrieval.
+> [!NOTE]
+> Embeddings from different models are not comparable. Never mix vectors from two models in one index, and re-embed the whole corpus when you switch — see [Choosing Embeddings in 2026](/guides/concepts/choosing-embeddings-2026).
+> [!WARNING]
+> A normalization or distance-metric mismatch can make retrieval look "sort of working" while quietly tanking recall. Check it first — it is the single most common embedding bug.
+## Output
+A health report: dimensionality/normalization status, count of degenerate vectors, near-duplicate clusters, distribution summary, query-alignment spot checks, and a prioritized list of fixes.

package/content/skills/finetune-dataset-builder.md ADDED Viewed

@@ -0,0 +1,33 @@
+---
+name: "finetune-dataset-builder"
+description: "Turn raw examples into a training-ready fine-tuning dataset — normalize to the trainer's chat/instruction format, deduplicate (including near-duplicates), strip PII, balance, validate the schema and token lengths, and carve a leak-free eval split. Use when you have raw examples and need a clean, formatted, split dataset before training."
+allowed-tools: "Read, Grep, Glob, Bash, Write, Edit"
+version: 1.0.0
+---
+The dataset is the model — so this skill treats building it as the real work, not a preprocessing afterthought. It takes raw examples and produces a clean, correctly-formatted, deduplicated dataset with a leak-free eval split, ready to hand to a trainer. Get this right and the training run is mechanical; get it wrong and no amount of tuning saves the result.
+## When to use this skill
+- You have raw examples (logs, labeled pairs, exported conversations) and need them formatted, cleaned, and split before fine-tuning.
+- An existing dataset gave a disappointing fine-tune and you suspect duplicates, leakage, PII, or off-distribution noise.
+- Standing up a repeatable dataset pipeline so each fine-tune is reproducible.
+## Instructions
+1. **Fix the target format first.** Determine the trainer's expected schema (commonly JSONL chat records: system/user/assistant, or instruction-response) and that it matches how the model is called in production. Normalize every example to that exact shape — the training format must mirror the inference format.
+2. **Deduplicate, including near-duplicates.** Remove exact duplicates and fuzzy/near-duplicates (normalized text, embedding similarity). Near-duplicates are the main cause of memorization and the silent leak that inflates eval scores, so be aggressive here.
+3. **Clean and correct.** Fix label/answer errors, drop malformed records, normalize whitespace/formatting, and **strip PII and secrets**. A wrong target teaches the wrong thing; sensitive strings risk being memorized and regurgitated.
+4. **Balance and check coverage.** Make sure no single pattern or class dominates, and that the set covers the real input distribution including edge cases. Flag thin slices that may need real or validated synthetic examples (see [Preparing a Fine-Tuning Dataset](/guides/mlops/finetune-dataset-prep)).
+5. **Validate the schema and token lengths.** Confirm every record parses against the schema and fits within the model's context length; quarantine the ones that don't rather than silently truncating.
+6. **Carve a leak-free split.** Split into train/validation (and test) **by a stable key** (source document, entity, or user) so paraphrases of the same item can't land on both sides, and deduplicate *across* the split boundary. Report the split sizes and the dedup/cleaning counts so the dataset is auditable.
+> [!WARNING]
+> Split by a stable key, not by random row. Random splitting lets near-duplicates and paraphrases of the same underlying item appear in both train and eval — leakage that produces beautiful offline numbers and a model that fails in production.
+> [!TIP]
+> Version the output dataset (and record the cleaning/dedup counts and split keys). Reproducibility is what lets you attribute a fine-tune's quality to a specific dataset and iterate deliberately instead of guessing.
+## Output
+A training-ready dataset: normalized to the trainer's format, deduplicated and cleaned (with PII stripped), balanced, schema- and length-validated, and split by a stable key into leak-free train/validation/test files — plus a short report of record counts, duplicates removed, and split sizes so the dataset is auditable and reproducible.

package/content/skills/graphrag-scaffolder.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: "graphrag-scaffolder"
+description: "Stand up a GraphRAG experiment the disciplined way: audit whether your failed queries are actually connection-shaped, scope a minimal entity/relationship ontology, build extraction → graph → community-summary indexing on a corpus slice, and measure against vector-RAG baselines before committing. Use when multi-hop or whole-corpus questions keep failing plain RAG."
+allowed-tools: "Read, Grep, Glob, Write, Edit, Bash"
+version: 1.0.0
+---
+GraphRAG is the most oversold upgrade in retrieval — and genuinely transformative for the right query shapes. This skill keeps you on the right side of that line: it builds the smallest GraphRAG that could prove value on *your* failures, measures it against your existing pipeline, and prices the ongoing bill before you commit.
+## When to use this skill
+- Multi-hop questions ("how is A exposed to C through B?") keep failing your vector RAG and you suspect structure is the answer.
+- You need "global" answers over a whole corpus (themes, patterns, summaries) that top-k chunks structurally can't provide.
+- Someone said "let's add a knowledge graph" and you want evidence before infrastructure.
+## When NOT to use this skill
+- Your RAG failures are ranking problems (right doc exists, wrong position) — fix retrieval first: hybrid search and reranking are cheaper and usually sufficient.
+- The corpus churns rapidly — GraphRAG's re-extraction cost on updates may dominate; consider it only with an incremental-update plan.
+- You need agent memory with temporal structure rather than corpus QA — that's a memory platform (Zep/Graphiti), not corpus GraphRAG.
+## Instructions
+1. **Build the failure set first.** Collect 15–30 real queries the current pipeline fails, and classify each: lookup (vector should handle — fix retrieval instead), multi-hop (graph traversal candidate), or global (community-summary candidate). If multi-hop+global don't dominate, stop and say so — that's a successful outcome of this skill.
+2. **Scope the minimal ontology.** From the failure set, derive only the entity and relationship types those queries traverse (e.g. Company—supplies→Company, Service—depends-on→Service). Resist "extract everything": every extra type inflates extraction cost and noise.
+3. **Scaffold the pipeline on a slice.** Pick a representative 5–10% corpus slice. Build: an LLM extraction pass emitting entities/relations per the ontology (with source-chunk provenance), graph assembly with entity resolution (merge duplicates deliberately), community detection, and LLM-written community summaries at 1–2 levels. Storage per scale: in-memory/parquet or Postgres first; a graph database only when scale demands.
+4. **Wire the two query paths.** Local: resolve query entities → traverse 1–3 hops → collect connected evidence + provenance chunks → synthesize. Global: route corpus-level questions to community summaries. Keep the existing vector path alive — the end state is a router, not a replacement.
+5. **Measure against baseline.** Run the failure set through both pipelines; score answer quality (human or LLM-judge with a rubric) and report per-class lift: GraphRAG should win multi-hop/global decisively and roughly tie lookups. Include extraction cost actually incurred, extrapolated to full corpus, plus the per-update re-indexing estimate.
+6. **Recommend with the bill attached.** Ship the verdict: adopt (with the router architecture and update strategy), adopt-partially (graph for one domain), or don't (retrieval fixes suffice) — each with the evidence and the standing costs stated plainly.
+> [!WARNING]
+> Extraction quality is the whole game: a missed relationship is an unanswerable question, a hallucinated one is a wrong answer with confidence. Spot-check extractions against source text on every run, and keep provenance so any graph fact traces to its chunk.
+> [!TIP]
+> The slice-first discipline is the budget saver — full-corpus extraction before validation is how GraphRAG projects die. Prove lift on 10%, then spend.
+## Output
+A working GraphRAG experiment: the classified failure set, the scoped ontology, the pipeline code (extraction → graph → summaries → both query paths) on the corpus slice, the baseline-vs-graph evaluation with per-class results, full-corpus cost projections, and the adopt/partial/don't recommendation with its evidence.

package/content/skills/hook-writer.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: "hook-writer"
+description: "Turn a plain-language automation request — 'format every file Claude edits', 'block writes to migrations', 'notify me when input is needed' — into a working Claude Code hook: the right event, a safe tested script, and the settings.json registration at the right scope. Use when you want a hook but don't want to hand-write the matcher, stdin JSON parsing, and exit-code plumbing."
+allowed-tools: "Read, Grep, Glob, Write, Edit, Bash"
+version: 1.0.0
+---
+Give this skill a sentence like "run prettier on every file Claude edits" or "never let Claude touch `.env` files, and tell it why" and it produces the complete hook: event choice, matcher, hardened script, settings registration, and a verification step. Hooks are the right tool for rules that must hold every time — this skill removes the plumbing tax of writing one.
+## When to use this skill
+- You want an automatic action around Claude Code's lifecycle: format after edits, run affected tests, log tool calls, send a desktop notification, enforce a freeze window.
+- You want to **block** something deterministically: edits to protected paths, dangerous Bash patterns, prompts containing secrets.
+- You have a hook that misbehaves and want it diagnosed and rewritten safely.
+## When NOT to use this skill
+- The rule is a judgment call ("prefer small functions") — that's a CLAUDE.md instruction, not a hook. See [Claude Code Hooks](/guides/configuration/claude-code-hooks) for the dividing line.
+- You're gating *which tools may run at all* with static patterns — plain [permission rules](/guides/configuration/claude-code-settings-permissions) do that with zero code; hooks are for logic patterns can't express.
+- You're bundling hooks for distribution to other repos — write them here, then package with [plugin-scaffolder](/skills/workflow/plugin-scaffolder).
+## Instructions
+1. **Restate the automation as event + condition + action.** Name the lifecycle moment (before a tool call? after? on prompt submit? when Claude finishes?), the condition (which tools, paths, or patterns), and the action (run, block, notify, log). If the user's request maps to two events, prefer the narrower one — gate with `PreToolUse`, react with `PostToolUse`.
+2. **Choose blocking semantics deliberately.** Only some events can block (`PreToolUse`, `UserPromptSubmit`). For a blocking hook, decide fail-open vs fail-closed on script errors and say which you chose: formatters fail open (exit 0 on error), guardrails fail closed (exit 2 on any doubt). Blocking output goes to stderr so Claude learns *why* and adjusts.
+3. **Write the script defensively.** Read the event JSON from stdin (`jq -r '.tool_input.file_path // empty'`), quote every expansion, handle missing fields, and keep it fast — hooks run inline with the session. Place it at `.claude/hooks/<name>.sh` and make it executable. Never interpolate model-controlled values into shell unquoted.
+4. **Register it at the right scope.** Project-wide rules go in `.claude/settings.json` (team-shared); personal automation in `~/.claude/settings.json`; machine-local experiments in `.claude/settings.local.json`. Add the matcher (`Edit|Write`, `Bash`, `mcp__server__tool`, or `*`) and a `timeout`. Show the exact JSON block being added and merge it without clobbering existing hooks.
+5. **Verify it fires.** Trigger the event once (e.g. have Claude edit a scratch file) and confirm the effect; run `/hooks` to confirm registration and source file. For a blocking hook, also verify the *allowed* path still works — overblocking is the most common hook bug.
+6. **Hand over the off-switch.** Note how to disable it (remove the JSON block, or `"disableAllHooks": true` temporarily) and what its failure mode looks like in practice.
+> [!WARNING]
+> A hook is arbitrary code running with the user's credentials on every matching event. Keep secrets out of hook scripts, treat tool input as untrusted data, and never write a blocking hook whose error path silently allows what it was built to stop.
+> [!TIP]
+> When the user's rule is expressible as a permission pattern (`deny: ["Read(./.env)"]`), say so and offer the rule instead — fewer moving parts beats a script doing the same job.
+## Output
+The executable hook script at `.claude/hooks/<name>.sh`, the exact settings JSON registered (with scope stated and why), a one-command verification you ran or the user can run, and the disable/rollback instructions — everything needed to trust the hook or remove it.

package/content/skills/human-in-the-loop-gate.md ADDED Viewed

@@ -0,0 +1,33 @@
+---
+name: "human-in-the-loop-gate"
+description: "Add a human approval checkpoint to an agent so it pauses before a risky or irreversible action (spending money, deleting data, sending messages, merging code) and resumes only after a human approves. Use when an agent acts autonomously on consequential operations."
+allowed-tools: "Read, Grep, Glob, Edit, Write, Bash"
+version: 1.0.0
+---
+An agent that can act autonomously will eventually try to do something you'd want to stop — spend money, delete a record, email a customer, force-push to main. A human-in-the-loop (HITL) gate makes consequential actions **require approval** without turning the whole agent into a manual tool. This skill adds that gate cleanly.
+## When to use this skill
+- An agent performs irreversible or costly actions (payments, deletions, deploys, outbound messages, merges).
+- You're moving an agent from a trusted sandbox toward production or real-user traffic.
+- A compliance or safety requirement mandates a human checkpoint before certain operations.
+## Instructions
+1. **Classify actions by consequence.** Separate reversible/cheap actions (read a file, search) the agent may do freely from consequential ones (write to prod, spend, send, delete) that require approval. Gate only the latter — gating everything destroys the point of an agent.
+2. **Interrupt before the action, not after.** At the gate, pause the agent and surface the **proposed action plus its context**: exactly what it will do, the arguments, and why. The human approves, edits, or rejects.
+3. **Make the pause durable.** Persist agent state at the interrupt (checkpoint) so approval can come seconds or hours later, and a process restart doesn't lose the run. Frameworks like [LangGraph](/tools/langgraph) provide interrupt/resume primitives; for others, persist state explicitly.
+4. **Handle all three outcomes.** Approve → resume from the checkpoint. Edit → resume with the modified action. Reject → abort safely (no partial side effects) and record the reason.
+5. **Fail safe and audit.** Default to *not acting* on timeout or ambiguity, and log every gated decision (action, context, who approved, outcome) for accountability.
+6. **Right-size the friction.** Too many prompts and humans rubber-stamp; too few and risky actions slip through. Gate by genuine blast radius, and consider thresholds (e.g. approve refunds over $X).
+> [!WARNING]
+> A gate that fires on everything trains humans to approve blindly — which is worse than no gate, because it looks safe. Gate only genuinely consequential actions, and show enough context to make a real decision.
+> [!NOTE]
+> The gate must be enforced where the action executes (the tool layer), not just requested in the prompt. A prompt instruction to "ask first" is a suggestion; a code-level interrupt is a guarantee.
+## Output
+A working approval gate: the action-consequence classification, the interrupt/resume implementation with durable state, the approve/edit/reject handling, fail-safe defaults, and an audit log of decisions.

package/content/skills/llm-as-judge-scorer.md ADDED Viewed

@@ -0,0 +1,33 @@
+---
+name: "llm-as-judge-scorer"
+description: "Design a reliable LLM-as-judge metric — a calibrated rubric, a clear scoring scale, and bias controls — and validate it against human labels before trusting it. Use when grading open-ended LLM output (summaries, answers, tone) that exact-match can't score."
+allowed-tools: "Read, Grep, Glob, Edit, Write, Bash"
+version: 1.0.0
+---
+When output is open-ended — a summary, a support answer, tone, helpfulness — you can't score it with exact match, and human grading doesn't scale. An **LLM-as-judge** does, but only if it's built carefully: an uncalibrated judge produces confident, inconsistent scores that quietly corrupt every downstream decision. This skill designs a judge you can actually trust.
+## When to use this skill
+- Grading subjective or open-ended outputs where there's no single correct string.
+- Replacing slow, inconsistent manual review in an eval loop.
+- An existing LLM-as-judge gives scores that don't match your own judgment.
+## Instructions
+1. **Define the rubric explicitly.** State precisely what's being judged and the criteria. Vague instructions ("rate quality 1–10") produce noise; concrete criteria ("deduct if the answer omits the rotation step, hallucinates a flag, or exceeds 3 sentences") produce signal.
+2. **Use a discrete scale with anchors.** Prefer a small scale (e.g. pass/fail or 1–5) with a written description of what each level means. Discrete, anchored scales are far more consistent than a bare 1–10.
+3. **Provide reference examples.** Include a few scored examples in the judge prompt — especially boundary cases — so the model calibrates to your standard rather than its own.
+4. **Control known biases.** LLM judges favor longer answers, their own model family's style, and the first option in a pairwise test. Mitigate: randomize order in pairwise comparisons, instruct length-neutrality, and consider a different model as judge than the one under test.
+5. **Validate against human labels.** Hand-label 20–30 cases, run the judge, and measure agreement. If the judge disagrees with you often, fix the rubric — do not deploy a judge you haven't checked against ground truth.
+6. **Wire it in.** Implement as a custom metric in your framework (e.g. DeepEval's G-Eval or a custom scorer) and add it to the suite with a threshold.
+> [!WARNING]
+> An LLM judge you haven't validated against human labels is not a metric — it's an opinion with a number attached. Calibrate before you trust it, and re-check when you change the judge model.
+> [!NOTE]
+> Where possible, prefer a deterministic check (schema validity, exact match, a regex) over an LLM judge — it's cheaper, faster, and perfectly consistent. Reserve the judge for what genuinely needs judgment.
+## Output
+A validated judge: the rubric and scale, reference examples, the bias controls applied, the human-agreement score, and the metric wired into the eval suite.

package/content/skills/llm-eval-suite-scaffolder.md ADDED Viewed

@@ -0,0 +1,30 @@
+---
+name: "llm-eval-suite-scaffolder"
+description: "Stand up an evaluation suite for an LLM feature from scratch — a representative dataset, the right metrics, a baseline score, and a CI gate — using DeepEval, promptfoo, or RAGAS. Use when a feature has no evals, before tuning a prompt, or when adding an LLM feature to CI."
+allowed-tools: "Read, Grep, Glob, Edit, Write, Bash"
+version: 1.0.0
+---
+The hardest part of LLM evaluation is starting. This skill scaffolds a complete, runnable eval suite for a feature — dataset, metrics, baseline, and CI wiring — using the framework that fits the stack (DeepEval for Python/pytest, promptfoo for config-driven CLI, RAGAS for RAG-specific metrics).
+## When to use this skill
+- An LLM feature ships with no evals and you need a gate before changing it further.
+- You're about to tune a prompt or swap a model and want to measure the change, not guess.
+- You're adding an LLM feature to CI and need a suite that fails on regressions.
+## Instructions
+1. **Pin the task and the unit of scoring.** State exactly what the feature must produce and how one output is judged: exact match, JSON-schema valid, a numeric tolerance, or an LLM-as-judge rubric. An ambiguous success criterion is the real bug — resolve it first.
+2. **Build a representative dataset.** Collect 20–50 real inputs with expected behavior, deliberately oversampling hard and adversarial cases (empty input, ambiguity, the format that broke last time, the prompt-injection attempt). Freeze it under version control. For RAG, capture the gold passages too.
+3. **Pick the few metrics that matter.** Two or three the feature is actually graded on — not every metric the framework offers. Faithfulness and answer relevancy for RAG; task accuracy and format validity for extraction; a calibrated rubric ([llm-as-judge-scorer](/skills/data/llm-as-judge-scorer)) for open-ended output.
+4. **Choose the framework and scaffold it.** Generate the suite: [DeepEval](/tools/deepeval) (pytest-style assertions), [promptfoo](/tools/promptfoo) (YAML matrix), or [RAGAS](/tools/ragas) (RAG metrics). Wire the dataset and metrics in, with thresholds.
+5. **Record a baseline.** Run the current/naive prompt over the full set and commit the score. Every later number is compared to this.
+6. **Wire the CI gate.** Add a `run-evals` step that fails the build when a metric drops below threshold, so regressions are caught in PRs — see the [Run Evals](/commands/testing/run-evals) command.
+> [!WARNING]
+> Don't generate hundreds of synthetic cases and call it an eval set. Twenty real, well-chosen cases — including the adversarial ones — beat a thousand bland synthetic ones. Quality and coverage of failure modes, not volume.
+## Output
+A runnable eval suite committed to the repo: the frozen dataset, the chosen metrics with thresholds, a recorded baseline score, and a CI step that gates merges on it.

package/content/skills/llm-guardrails-designer.md ADDED Viewed

@@ -0,0 +1,33 @@
+---
+name: "llm-guardrails-designer"
+description: "Design input and output guardrails for an LLM app — decide what to check (injection patterns, PII, secrets, policy, schema, leakage, toxicity), place them as input vs. output rails, implement with a library like NeMo Guardrails or LLM Guard, and fail closed. Use when adding a safety/validation layer around an LLM, not relying on the prompt alone."
+allowed-tools: "Read, Grep, Glob, Bash, Write, Edit"
+version: 1.0.0
+---
+A guardrail is the validation layer around an LLM that a system prompt can't be: programmatic checks on what goes *into* the model and what comes *out*, enforced in code rather than requested in text. This skill designs that layer — deciding which checks matter for your app, placing them as input or output rails, implementing them with a guardrails library, and making them fail closed — as defense in depth, not a wall.
+## When to use this skill
+- Adding a safety/validation layer to an LLM app instead of trusting the prompt to police itself.
+- Enforcing output structure, policy, or PII/secret-leakage checks before responses reach users or downstream systems.
+- Hardening a RAG or agent app against injection and unsafe actions as part of [defending against prompt injection](/guides/ai-safety/defending-prompt-injection).
+## Instructions
+1. **Threat-model the app first.** Identify the untrusted inputs (user, retrieved content, tool output), the sensitive data/actions to protect, and the unacceptable outputs (leaked secrets, policy violations, malformed structure). Guardrails follow the threats — don't add checks with no threat behind them.
+2. **Choose input rails.** On the way in, decide what to scan and reject/sanitize: prompt-injection patterns, PII/secret stripping (often via the [prompt-pii-redactor](/skills/security/prompt-pii-redactor)), banned topics, and input size/token limits. Input rails reduce what reaches the model.
+3. **Choose output rails.** On the way out, validate before the response is trusted: **schema/structure** conformance, **policy** and safety (toxicity, disallowed content), **leakage** (PII, secrets, system-prompt disclosure), and grounding/relevance for RAG. Output rails are your last line before a user or a tool acts on the response.
+4. **Implement with a library, not from scratch.** Use [NeMo Guardrails](/tools/nemo-guardrails) (programmable rails, Colang) or [LLM Guard](/tools/llm-guard) (ready-made input/output scanners) rather than hand-rolling detectors. Match the choice to the stack and the checks you need.
+5. **Fail closed and make it observable.** When a guardrail trips, default to the safe action (block, sanitize, or escalate to a human) rather than passing through. Log every trigger with enough context to tune it — guardrails you can't see are guardrails you can't trust.
+6. **Acknowledge the limits.** State plainly that guardrails are **defense in depth**, not prevention — they raise the cost of an attack and catch known patterns, but they don't replace least privilege and human approval for high-impact actions. Don't let a guardrail create false confidence.
+> [!WARNING]
+> Guardrails are probabilistic and bypassable — a detector for injection or toxicity will miss novel phrasings. Layer them with architectural controls (least privilege, approvals, output validation), and never let "we have guardrails" substitute for limiting what the model can actually do.
+> [!TIP]
+> Fail closed by default. A guardrail that, on error or uncertainty, lets the request through is worse than none — it gives you confidence without protection. The safe default when a check can't run or is unsure is to block or route to a human.
+## Output
+A guardrail design and implementation: the threat model it addresses, the input and output rails with what each checks and its fail-closed behavior, the library wiring (NeMo Guardrails or LLM Guard), logging for each trigger, and an explicit statement of what the guardrails do and do not cover — so they're treated as one layer of defense, not the whole defense.

package/content/skills/llm-output-schema-generator.md ADDED Viewed

@@ -0,0 +1,32 @@
+---
+name: "llm-output-schema-generator"
+description: "Turn an example of the data you want from an LLM into a precise, validated output schema (Pydantic / Zod / JSON Schema) and wire it into structured-output calls. Use when adding typed LLM output, replacing brittle JSON parsing, or designing an extraction shape."
+allowed-tools: "Read, Grep, Glob, Edit, Write"
+version: 1.0.0
+---
+The reliable way to get data (not prose) from an LLM is to give it a schema and validate against it. This skill builds that schema from a concrete example of what you want back, then wires it into a structured-output call — so the model returns typed, validated objects and your code stops parsing free-form JSON by hand.
+This is distinct from generating **test fixtures** (that's a mock-data factory) and from documenting an **existing API** (that's an OpenAPI doc writer): here the output *is the schema the LLM must conform to*.
+## When to use this skill
+- Adding typed/structured output to an LLM feature (extraction, classification, form-filling).
+- Replacing fragile `JSON.parse` + try/catch around model output with a validated schema.
+- Designing the exact shape for an extraction or tool-output contract.
+## Instructions
+1. **Start from a real example.** Take a representative sample of the desired output (or a few). Infer fields and types from the data, not from a guess — and gather a couple of edge-case examples so optionality and unions are right.
+2. **Type precisely.** Choose specific types (int vs. float, date vs. string), mark genuinely optional fields optional and required fields required, and use **enums** for closed sets rather than free strings.
+3. **Add model-facing descriptions.** Field descriptions are prompt surface in structured-output libraries — say what each field means, with units and formats ("ISO 8601", "USD cents"). This improves the model's accuracy, not just documentation.
+4. **Constrain to make bad output impossible.** Add bounds, patterns, and enums so invalid values can't validate. Prefer a flatter shape where it doesn't lose meaning — deeply nested schemas are harder for models to fill correctly.
+5. **Emit in the target stack.** Generate the schema as Pydantic (Python), Zod (TypeScript), a `.baml` type, or JSON Schema — matching the structured-output tool in use ([Instructor](/tools/instructor), [BAML](/tools/baml), or the [Vercel AI SDK](/tools/vercel-ai-sdk)).
+6. **Wire and validate.** Hook it into the structured-output call with retry-on-validation-failure, and test it against the original examples plus the edge cases.
+> [!TIP]
+> Let the schema carry the instructions. A well-named field with a clear description and an enum often replaces a paragraph of prompt — see [Structured Output vs JSON Mode vs Function Calling](/guides/concepts/structured-output-2026).
+## Output
+A validated output schema in the target language, with typed/constrained fields and descriptions, wired into a structured-output call with retry — verified against the example outputs.

package/content/skills/mcp-server-scaffolder.md ADDED Viewed

@@ -0,0 +1,33 @@
+---
+name: "mcp-server-scaffolder"
+description: "Scaffold a new Model Context Protocol (MCP) server from a description — pick the SDK and transport, generate a typed first tool with a strict schema, and wire up MCP Inspector testing and the client-registration command. Use when starting a new MCP server and you want a correct, runnable skeleton instead of copying a README."
+allowed-tools: "Read, Grep, Glob, Bash, Write, Edit"
+version: 1.0.0
+---
+Starting an MCP server from a blog post means inheriting its mistakes — a vague tool description, a loose schema, the wrong transport for where it'll run. This skill scaffolds a **correct, runnable** server skeleton from a one-line description: it picks the SDK and transport for your case, generates a first tool that already follows the naming/schema/description discipline that makes a server usable, and leaves you with the exact commands to test and register it.
+## When to use this skill
+- Starting a brand-new MCP server and you want a runnable skeleton with one good tool, not a pile of boilerplate.
+- You know *what* the server should expose but want the transport choice, schema shape, and project layout decided correctly up front.
+- Standing up a server to test an integration idea quickly, with Inspector testing already wired in.
+## Instructions
+1. **Clarify the capability and where it runs.** From the description, identify the first tool (one job, clearly named) and whether the server runs **local** (stdio) or **remote/shared** (Streamable HTTP). If it's ambiguous, default to stdio for a local prototype and note how to promote it to HTTP later.
+2. **Pick the SDK and detect the ecosystem.** Match the project's language: the official Python or TypeScript MCP SDK, or a higher-level framework like [FastMCP](/tools/fastmcp) for Python. Check for an existing project to slot into (package manager, lockfile, conventions) rather than scaffolding in isolation.
+3. **Generate the server skeleton.** Create the entrypoint, the transport setup (stdio or Streamable HTTP), and one **tool** with: a verb-object name, a description written as a routing signal (what it does, what it returns, when to use it), and a strict input schema (required vs. optional, enums, per-field descriptions). Stub the handler with a clear TODO and a concise, model-ready return shape.
+4. **Wire in testing.** Add the command to launch the [MCP Inspector](/tools/mcp-inspector) against the server (`npx @modelcontextprotocol/inspector ...`) so the first thing the developer can do is connect, list the tool, and call it.
+5. **Emit the registration command.** Provide the exact `claude mcp add` invocation (correct transport, scope, and any `--env` for secrets) — or point to the [Add MCP Server](/commands/workflow/add-mcp-server) command — so the server can be connected to a client immediately.
+6. **Leave a clear extension path.** Document where to add the next tool, resource, or prompt, and flag that a remote server still needs auth and stateless scaling before production (link the deploy guide).
+> [!TIP]
+> Scaffold **one** good tool, not five mediocre ones. The first tool sets the pattern the rest of the server copies — get its name, schema, and description right and the server stays usable as it grows.
+> [!WARNING]
+> A scaffold is a starting point, not a production server. The generated handler must still validate its inputs, and a remote (HTTP) server is not safe to expose until it has authentication and input validation — see [Deploying a Remote MCP Server](/guides/mcp/deploy-remote-mcp-server).
+## Output
+A runnable MCP server skeleton: entrypoint and transport wired up, one well-shaped tool with a strict schema and routing-quality description, the Inspector test command, and the client-registration snippet — plus a short note on where to add the next capability and what hardening is still required before production.

package/content/skills/mock-data-factory.md ADDED Viewed

@@ -0,0 +1,75 @@
+---
+name: "mock-data-factory"
+description: "Generate a typed mock/fixture factory for a given type, interface, or schema, inferring believable values from field names and types. Use when tests or local dev need realistic, type-safe sample data with per-field overrides."
+allowed-tools: "Read, Grep, Glob, Write, Bash"
+version: 1.0.0
+---
+Generate a type-safe factory that produces realistic mock data for a named type, interface, or schema. The skill reads the target definition, infers each field's semantics from its name and type (an `email` becomes a valid address, `createdAt` a recent ISO date, `id` a UUID, `count` a small non-negative integer), and emits a `build()` factory that returns a complete, valid object while accepting a partial override for any field. It matches the project's existing fixture conventions instead of inventing a new one.
+## When to use this skill
+- A test or story needs a valid instance of a type and you don't want to hand-write every field.
+- You keep copy-pasting and tweaking the same object literal across specs — centralize it in one factory.
+- Local dev or a seed script needs believable sample records (users, orders, events) rather than `"foo"` / `123` placeholders.
+> [!NOTE]
+> The factory produces *plausible*, schema-valid data — not data that satisfies your business invariants. If a test depends on a specific relationship (e.g. `endsAt` after `startsAt`, or a total matching its line items), pass explicit overrides rather than trusting the defaults.
+## Instructions
+1. **Locate the target.** Read the type the user named — a TypeScript `interface`/`type`, a Zod/Yup schema, a Prisma model, a Python dataclass/Pydantic model. Resolve every field, its type, optionality, and any nested or referenced types so the factory returns a fully-populated object.
+2. **Detect the project's conventions.** Inspect the repo before writing — do not guess:
+   - Is a faker library already a dependency (`@faker-js/faker`, `faker`, `factory.ts`, `fishery`, `factory_boy`)? Reuse it. If none exists, generate deterministic values with plain code rather than adding a dependency.
+   - Mirror existing factory/fixture file location and naming (`*.factory.ts`, `factories/`, `fixtures/`, `conftest.py`).
+   - Match the override signature already in use (e.g. `build(overrides?: Partial<T>)` vs. a `fishery` `params` object).
+3. **Infer field semantics from name + type.** Map fields to believable generators: `email` → valid address, `*Id`/`id`/`uuid` → UUID, `*At`/`*Date` → recent ISO timestamp, `name`/`firstName` → a real-looking name, `url`/`avatar` → a URL, `price`/`amount` → a positive decimal, `count`/`quantity` → a small int, `isActive`/`enabled` → boolean. For enums/unions, pick the first valid member. Fall back to the type's primitive default only when the name carries no signal.
+4. **Write the factory.** Emit a `build()` that returns a complete object with sensible defaults, deep-merges a `Partial<T>` override, and is typed so the return value is the full `T`. Make defaults deterministic (or seedable) so snapshots stay stable. Populate nested objects via their own factories where they exist. Leave a `// TODO` only where a value needs genuine human judgment (a real foreign key, a domain-specific constraint).
+5. **Verify it type-checks and runs.** Type-check the file (`tsc --noEmit`, or import it in a scratch test) and instantiate `build()` plus `build({ ...override })` to confirm both produce valid instances and the override actually wins.
+6. **Report.** Summarize the fields and the generator chosen for each, and flag gaps — fields where the inferred value may violate a business rule, unresolved referenced types, or invariants the caller must enforce via overrides.
+> [!WARNING]
+> Keep generated values clearly synthetic (example.com emails, obviously fake names) and never commit real PII or production-shaped secrets into fixtures. A factory checked into the repo is shared sample data, not a place for live tokens or customer records.
+## Examples
+Given a `User` type:
+```ts
+export interface User {
+  id: string;
+  email: string;
+  displayName: string;
+  role: "admin" | "member" | "guest";
+  isActive: boolean;
+  createdAt: string; // ISO 8601
+}
+```
+The skill detects `@faker-js/faker` is already installed and writes `src/test/factories/user.factory.ts`:
+```ts
+import { faker } from "@faker-js/faker";
+import type { User } from "../../types/user";
+export function buildUser(overrides: Partial<User> = {}): User {
+  return {
+    id: faker.string.uuid(),
+    email: faker.internet.email().toLowerCase(),
+    displayName: faker.person.fullName(),
+    role: "member",
+    isActive: true,
+    createdAt: faker.date.recent({ days: 30 }).toISOString(),
+    ...overrides,
+  };
+}
+```
+Use it in a test, overriding only what the case cares about:
+```ts
+const admin = buildUser({ role: "admin", email: "ada@example.com" });
+expect(canDeleteWorkspace(admin)).toBe(true);
+```
+Seed the faker instance (`faker.seed(1)`) when you need byte-stable output for snapshots.

package/content/skills/multimodal-document-extractor.md ADDED Viewed

@@ -0,0 +1,39 @@
+---
+name: "multimodal-document-extractor"
+description: "Extract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR."
+allowed-tools: "Read, Grep, Glob, Edit, Write, Bash"
+version: 1.0.0
+---
+Extract structured data from documents and images using a vision-language model, the right way: schema-first, with verification on the fields that matter. VLMs are powerful at reading messy, varied documents that template OCR can't handle — but they can also confidently mis-read an exact value, so this skill pairs extraction with the faithfulness checks that make the output trustworthy.
+## When to use this skill
+- Pulling structured fields from documents that vary in layout — invoices, receipts, forms, statements, contracts, IDs.
+- Scanned, photographed, or handwritten documents where template/positional OCR is brittle.
+- You need the result as structured data (a schema) for a database or downstream system, not as free text.
+## When NOT to use this skill
+- Clean, fixed-format printed text at scale where deterministic OCR is cheaper and sufficient — use traditional OCR.
+- General document Q&A or summarization with no structured-output requirement — a plain VLM call is enough.
+## Instructions
+1. **Define the target schema first.** Specify the exact fields, types, and enums you need, each with a clear description (e.g. `total: number`, `currency: enum`, `line_items: [{description, qty, unit_price}]`). The schema is the contract; design it before prompting. The [llm-output-schema-generator](/skills/api/llm-output-schema-generator) can draft it from a sample.
+2. **Pick the model.** Choose an open-weights VLM ([Qwen3-VL](/tools/qwen3-vl)) for self-hosting, privacy, or cost at volume, or a proprietary VLM for maximum capability — decide on measured accuracy for *your* document type, not a benchmark.
+3. **Extract with structured output.** Send the page image(s) and prompt the model to fill the schema using the provider's structured-output/JSON mode, so the result conforms instead of being free-form text you parse. See [Structured Output vs JSON Mode vs Function Calling](/guides/concepts/structured-output-2026).
+4. **Handle multi-page and large documents.** Split long documents into pages or logical sections, extract per section, and merge — keeping the page reference for each field so values can be traced back.
+5. **Verify the fields that matter.** This is the step that makes it production-grade: cross-check critical values (totals, dates, IDs, amounts) against the source — a second pass, a checksum/arithmetic validation (line items sum to the total), or a traditional OCR comparison. A VLM's confident output is not proof.
+6. **Confidence and human review.** Capture or estimate per-field confidence and route low-confidence or failed-validation pages to human review rather than silently committing a guess.
+7. **Measure accuracy on real documents.** Evaluate field-level accuracy on a representative, labeled sample (including the hard cases — bad scans, edge formats) before trusting the extractor in production, and hand the eval to the [llm-evaluation-engineer](/agents/data-ai/llm-evaluation-engineer).
+> [!WARNING]
+> VLMs can hallucinate or transpose an exact value while the surrounding text is perfect — a `$1,240.00` read as `$1,420.00`, a digit dropped from an ID. For anything financial, legal, or identity-related, treat extracted values as unverified until checked against the source. The schema guarantees the *shape*, not the *truth*.
+> [!NOTE]
+> Prefer arithmetic and cross-field validation where the document gives it to you for free — line items should sum to the subtotal, subtotal plus tax to the total, dates should be plausible. These catch mis-reads no confidence score will.
+## Output
+A working extractor: the target schema, the VLM extraction call with structured output, multi-page handling, the verification/validation step for critical fields, confidence-based routing to human review, and a field-level accuracy measurement on a representative sample — so the structured data is both well-formed and faithful to the source.