rsc-universal 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +279 -0
- package/manifest.json +4761 -0
- package/package.json +59 -0
- package/schema/frontmatter.schema.json +12 -0
- package/scripts/build-manifest.js +72 -0
- package/scripts/consult.js +106 -0
- package/scripts/detect-repo.js +118 -0
- package/scripts/doctor.js +21 -0
- package/scripts/eval-lint.sh +179 -0
- package/scripts/install-apply.js +52 -0
- package/scripts/install-plan.js +13 -0
- package/scripts/lib/behavior-score.js +103 -0
- package/scripts/lib/frontmatter.js +47 -0
- package/scripts/lib/harden-policy.js +41 -0
- package/scripts/lib/manifest.js +18 -0
- package/scripts/lib/recommend.js +36 -0
- package/scripts/lib/registry.js +110 -0
- package/scripts/lib/result-envelope.js +35 -0
- package/scripts/lib/state.js +12 -0
- package/scripts/lib/ui.js +17 -0
- package/scripts/reviewer-guard.sh +67 -0
- package/scripts/rsc.js +108 -0
- package/scripts/skill-behavior-eval.js +33 -0
- package/scripts/skill-behavior-eval.workflow.js +136 -0
- package/scripts/skill-behavior-rubric.md +63 -0
- package/scripts/skill-harden-rubric.md +40 -0
- package/scripts/skill-harden.workflow.js +161 -0
- package/scripts/skill-rubric.md +39 -0
- package/scripts/skill-scoreboard.workflow.js +35 -0
- package/skills/ab-testing/SKILL.md +191 -0
- package/skills/ab-testing/evals/README.md +8 -0
- package/skills/ab-testing/evals/cases.yaml +49 -0
- package/skills/ab-testing/references/pitfalls.md +74 -0
- package/skills/ab-testing/references/sample-size-and-cuped.md +128 -0
- package/skills/ab-testing/scripts/verify.sh +89 -0
- package/skills/accessibility/SKILL.md +218 -0
- package/skills/accessibility/evals/README.md +3 -0
- package/skills/accessibility/evals/cases.yaml +47 -0
- package/skills/accessibility/references/aria-patterns.md +113 -0
- package/skills/accessibility/references/wcag22-checklist.md +83 -0
- package/skills/accessibility/scripts/verify.sh +103 -0
- package/skills/ads/SKILL.md +175 -0
- package/skills/ads/evals/README.md +15 -0
- package/skills/ads/evals/cases.yaml +58 -0
- package/skills/ads/references/platform-specs.md +73 -0
- package/skills/ads/references/roas-model.md +77 -0
- package/skills/ads/scripts/verify.sh +210 -0
- package/skills/agent-eval/SKILL.md +213 -0
- package/skills/agent-eval/evals/README.md +12 -0
- package/skills/agent-eval/evals/cases.yaml +45 -0
- package/skills/agent-eval/references/judge-design.md +118 -0
- package/skills/agent-eval/references/runner-and-gate.md +183 -0
- package/skills/agent-eval/scripts/verify.sh +161 -0
- package/skills/agent-safety/SKILL.md +176 -0
- package/skills/agent-safety/evals/README.md +12 -0
- package/skills/agent-safety/evals/cases.yaml +46 -0
- package/skills/agent-safety/references/threat-model.md +51 -0
- package/skills/ai-media/SKILL.md +196 -0
- package/skills/ai-media/evals/README.md +3 -0
- package/skills/ai-media/evals/cases.yaml +45 -0
- package/skills/ai-media/references/ffmpeg-assembly.md +117 -0
- package/skills/ai-media/references/models-and-params.md +78 -0
- package/skills/ai-media/scripts/verify.sh +103 -0
- package/skills/analytics/SKILL.md +219 -0
- package/skills/analytics/evals/README.md +9 -0
- package/skills/analytics/evals/cases.yaml +53 -0
- package/skills/analytics/references/event-taxonomy.md +75 -0
- package/skills/analytics/references/ga4-setup.md +122 -0
- package/skills/analytics/references/posthog-setup.md +100 -0
- package/skills/analytics/scripts/verify.sh +95 -0
- package/skills/analyze/SKILL.md +136 -0
- package/skills/analyze/evals/README.md +72 -0
- package/skills/analyze/evals/cases.yaml +74 -0
- package/skills/angular/SKILL.md +288 -0
- package/skills/angular/evals/README.md +3 -0
- package/skills/angular/evals/cases.yaml +38 -0
- package/skills/angular/references/migration.md +81 -0
- package/skills/angular/references/signals-rxjs.md +92 -0
- package/skills/angular/scripts/verify.sh +122 -0
- package/skills/api-connector-builder/SKILL.md +285 -0
- package/skills/api-connector-builder/evals/README.md +11 -0
- package/skills/api-connector-builder/evals/cases.yaml +47 -0
- package/skills/api-connector-builder/references/auth-flows.md +132 -0
- package/skills/api-connector-builder/references/pagination.md +144 -0
- package/skills/api-connector-builder/scripts/verify.sh +172 -0
- package/skills/api-design/SKILL.md +189 -0
- package/skills/api-design/evals/README.md +3 -0
- package/skills/api-design/evals/cases.yaml +45 -0
- package/skills/api-design/references/graphql-design.md +70 -0
- package/skills/api-design/references/openapi-contract.md +86 -0
- package/skills/api-design/references/rest-conventions.md +63 -0
- package/skills/api-design/references/versioning-and-evolution.md +49 -0
- package/skills/api-design/scripts/verify.sh +138 -0
- package/skills/article-writing/SKILL.md +175 -0
- package/skills/article-writing/evals/README.md +3 -0
- package/skills/article-writing/evals/cases.yaml +47 -0
- package/skills/article-writing/references/ai-tell-banlist.md +114 -0
- package/skills/article-writing/references/on-page-seo.md +133 -0
- package/skills/article-writing/scripts/verify.sh +165 -0
- package/skills/astro/SKILL.md +275 -0
- package/skills/astro/evals/README.md +3 -0
- package/skills/astro/evals/cases.yaml +41 -0
- package/skills/astro/references/content-layer.md +118 -0
- package/skills/astro/references/deploy-and-integrations.md +163 -0
- package/skills/astro/scripts/verify.sh +137 -0
- package/skills/author-skill/SKILL.md +206 -0
- package/skills/author-skill/evals/README.md +66 -0
- package/skills/author-skill/evals/cases.yaml +75 -0
- package/skills/author-skill/references/description-recipe.md +84 -0
- package/skills/author-skill/references/eval-authoring.md +74 -0
- package/skills/author-skill/references/rsc-conventions.md +91 -0
- package/skills/automation-flows/SKILL.md +132 -0
- package/skills/automation-flows/evals/README.md +5 -0
- package/skills/automation-flows/evals/cases.yaml +44 -0
- package/skills/automation-flows/references/error-handling.md +58 -0
- package/skills/automation-flows/references/n8n-workflow-json.md +63 -0
- package/skills/automation-flows/scripts/verify.sh +78 -0
- package/skills/aws-essentials/SKILL.md +223 -0
- package/skills/aws-essentials/evals/README.md +10 -0
- package/skills/aws-essentials/evals/cases.yaml +44 -0
- package/skills/aws-essentials/references/iam-least-privilege.md +134 -0
- package/skills/aws-essentials/references/rds-cloudfront-recipes.md +127 -0
- package/skills/aws-essentials/scripts/verify.sh +99 -0
- package/skills/backups/SKILL.md +137 -0
- package/skills/backups/evals/README.md +3 -0
- package/skills/backups/evals/cases.yaml +42 -0
- package/skills/backups/references/engine-recipes.md +121 -0
- package/skills/backups/references/restore-runbook.md +65 -0
- package/skills/backups/scripts/verify.sh +80 -0
- package/skills/bash-scripting/SKILL.md +231 -0
- package/skills/bash-scripting/evals/README.md +3 -0
- package/skills/bash-scripting/evals/cases.yaml +45 -0
- package/skills/bash-scripting/references/portability.md +97 -0
- package/skills/bash-scripting/scripts/verify.sh +140 -0
- package/skills/bookkeeping/SKILL.md +184 -0
- package/skills/bookkeeping/evals/README.md +5 -0
- package/skills/bookkeeping/evals/cases.yaml +52 -0
- package/skills/bookkeeping/references/chart-of-accounts.md +87 -0
- package/skills/bookkeeping/references/reconciliation-playbook.md +54 -0
- package/skills/bookkeeping/references/tricky-transactions.md +192 -0
- package/skills/brand-identity/SKILL.md +161 -0
- package/skills/brand-identity/evals/README.md +14 -0
- package/skills/brand-identity/evals/cases.yaml +43 -0
- package/skills/brand-identity/references/color-and-tokens.md +129 -0
- package/skills/brand-identity/references/logo-and-assets.md +117 -0
- package/skills/brand-identity/scripts/verify.sh +224 -0
- package/skills/brand-voice/SKILL.md +183 -0
- package/skills/brand-voice/evals/README.md +3 -0
- package/skills/brand-voice/evals/cases.yaml +57 -0
- package/skills/brand-voice/references/voice-guide-template.md +150 -0
- package/skills/brand-voice/references/word-bank.md +61 -0
- package/skills/brand-voice/scripts/verify.sh +190 -0
- package/skills/building-agents/SKILL.md +469 -0
- package/skills/building-agents/evals/README.md +68 -0
- package/skills/building-agents/evals/cases.yaml +60 -0
- package/skills/building-agents/references/agent-loops-and-harness.md +371 -0
- package/skills/building-agents/references/evals-and-observability.md +420 -0
- package/skills/building-agents/references/mcp-servers.md +294 -0
- package/skills/building-agents/references/provider-abstraction.md +489 -0
- package/skills/building-agents/references/tools-and-rag.md +417 -0
- package/skills/building-agents/scripts/verify.sh +121 -0
- package/skills/business-intelligence/SKILL.md +176 -0
- package/skills/business-intelligence/evals/README.md +3 -0
- package/skills/business-intelligence/evals/cases.yaml +43 -0
- package/skills/business-intelligence/references/authoring-semantic-models.md +120 -0
- package/skills/business-intelligence/references/wiring-agents-and-apis.md +79 -0
- package/skills/business-intelligence/scripts/verify.sh +143 -0
- package/skills/calendar-scheduling/SKILL.md +196 -0
- package/skills/calendar-scheduling/evals/README.md +14 -0
- package/skills/calendar-scheduling/evals/cases.yaml +45 -0
- package/skills/calendar-scheduling/references/google-calendar-sync.md +78 -0
- package/skills/calendar-scheduling/references/provider-matrix.md +71 -0
- package/skills/calendar-scheduling/scripts/verify.sh +117 -0
- package/skills/case-studies/SKILL.md +147 -0
- package/skills/case-studies/evals/README.md +3 -0
- package/skills/case-studies/evals/cases.yaml +63 -0
- package/skills/case-studies/references/case-study-skeleton.md +90 -0
- package/skills/case-studies/references/consent-and-substantiation.md +80 -0
- package/skills/case-studies/scripts/verify.sh +161 -0
- package/skills/chatbot/SKILL.md +168 -0
- package/skills/chatbot/evals/README.md +13 -0
- package/skills/chatbot/evals/cases.yaml +43 -0
- package/skills/chatbot/references/handoff-and-sales.md +71 -0
- package/skills/chatbot/references/system-prompt-and-guardrails.md +78 -0
- package/skills/chatbot/scripts/verify.sh +162 -0
- package/skills/chrome-extension/SKILL.md +169 -0
- package/skills/chrome-extension/evals/README.md +12 -0
- package/skills/chrome-extension/evals/cases.yaml +40 -0
- package/skills/chrome-extension/references/store-and-migration.md +84 -0
- package/skills/chrome-extension/scripts/verify.sh +62 -0
- package/skills/clarify/SKILL.md +159 -0
- package/skills/clarify/evals/README.md +70 -0
- package/skills/clarify/evals/cases.yaml +71 -0
- package/skills/clickhouse-analytics/SKILL.md +165 -0
- package/skills/clickhouse-analytics/evals/README.md +3 -0
- package/skills/clickhouse-analytics/evals/cases.yaml +45 -0
- package/skills/clickhouse-analytics/references/ingestion-and-mvs.md +109 -0
- package/skills/clickhouse-analytics/references/query-optimization.md +76 -0
- package/skills/clickhouse-analytics/references/schema-and-engines.md +63 -0
- package/skills/clickhouse-analytics/scripts/verify.sh +109 -0
- package/skills/client-onboarding/SKILL.md +254 -0
- package/skills/client-onboarding/evals/README.md +14 -0
- package/skills/client-onboarding/evals/cases.yaml +40 -0
- package/skills/client-onboarding/references/onboarding-playbook.md +126 -0
- package/skills/cloudflare/SKILL.md +191 -0
- package/skills/cloudflare/evals/README.md +15 -0
- package/skills/cloudflare/evals/cases.yaml +46 -0
- package/skills/cloudflare/references/storage-primitives.md +104 -0
- package/skills/cloudflare/references/wrangler-config.md +91 -0
- package/skills/cloudflare/scripts/verify.sh +133 -0
- package/skills/code-review/SKILL.md +143 -0
- package/skills/code-review/evals/README.md +3 -0
- package/skills/code-review/evals/cases.yaml +55 -0
- package/skills/code-review/references/pr-workflow.md +67 -0
- package/skills/codebase-onboarding/SKILL.md +133 -0
- package/skills/codebase-onboarding/evals/README.md +3 -0
- package/skills/codebase-onboarding/evals/cases.yaml +69 -0
- package/skills/codebase-onboarding/references/recon-playbook.md +57 -0
- package/skills/codebase-onboarding/scripts/verify.sh +54 -0
- package/skills/cold-outreach/SKILL.md +206 -0
- package/skills/cold-outreach/evals/README.md +3 -0
- package/skills/cold-outreach/evals/cases.yaml +60 -0
- package/skills/cold-outreach/references/compliance-footer.md +50 -0
- package/skills/cold-outreach/references/hook-derivation.md +73 -0
- package/skills/cold-outreach/references/templates.md +88 -0
- package/skills/cold-outreach/scripts/verify.sh +170 -0
- package/skills/community/SKILL.md +225 -0
- package/skills/community/evals/README.md +3 -0
- package/skills/community/evals/cases.yaml +40 -0
- package/skills/community/references/metrics-and-rituals.md +58 -0
- package/skills/community/references/platform-playbooks.md +64 -0
- package/skills/community/scripts/verify.sh +83 -0
- package/skills/competitor-watch/SKILL.md +193 -0
- package/skills/competitor-watch/evals/README.md +19 -0
- package/skills/competitor-watch/evals/cases.yaml +54 -0
- package/skills/competitor-watch/references/monitoring-config.md +124 -0
- package/skills/competitor-watch/references/tracker-schema.md +79 -0
- package/skills/competitor-watch/scripts/verify.sh +253 -0
- package/skills/compliance/SKILL.md +184 -0
- package/skills/compliance/evals/README.md +14 -0
- package/skills/compliance/evals/cases.yaml +46 -0
- package/skills/compliance/references/frameworks.md +108 -0
- package/skills/compliance/references/operating-rhythm.md +79 -0
- package/skills/compliance/scripts/verify.sh +168 -0
- package/skills/compose-multiplatform/SKILL.md +198 -0
- package/skills/compose-multiplatform/evals/README.md +3 -0
- package/skills/compose-multiplatform/evals/cases.yaml +40 -0
- package/skills/compose-multiplatform/references/ios-interop.md +91 -0
- package/skills/compose-multiplatform/references/project-setup.md +96 -0
- package/skills/compose-multiplatform/scripts/verify.sh +123 -0
- package/skills/constitution/SKILL.md +160 -0
- package/skills/constitution/evals/README.md +68 -0
- package/skills/constitution/evals/cases.yaml +72 -0
- package/skills/constitution/references/constitution-template.md +90 -0
- package/skills/content-engine/SKILL.md +164 -0
- package/skills/content-engine/evals/README.md +17 -0
- package/skills/content-engine/evals/cases.yaml +62 -0
- package/skills/content-engine/references/atomization.md +81 -0
- package/skills/content-engine/references/brief-and-pipeline.md +90 -0
- package/skills/content-engine/scripts/verify.sh +146 -0
- package/skills/context-budget/SKILL.md +132 -0
- package/skills/context-budget/evals/README.md +11 -0
- package/skills/context-budget/evals/cases.yaml +40 -0
- package/skills/context-budget/references/handoff-and-compaction.md +96 -0
- package/skills/continuous-learning/SKILL.md +136 -0
- package/skills/continuous-learning/evals/README.md +16 -0
- package/skills/continuous-learning/evals/cases.yaml +39 -0
- package/skills/continuous-learning/references/lesson-routing.md +106 -0
- package/skills/contracts/SKILL.md +124 -0
- package/skills/contracts/evals/README.md +3 -0
- package/skills/contracts/evals/cases.yaml +42 -0
- package/skills/contracts/references/clause-library.md +129 -0
- package/skills/contracts/references/review-playbook.md +49 -0
- package/skills/contracts/scripts/verify.sh +53 -0
- package/skills/coolify/SKILL.md +201 -0
- package/skills/coolify/evals/README.md +21 -0
- package/skills/coolify/evals/cases.yaml +46 -0
- package/skills/coolify/references/databases-and-backups.md +99 -0
- package/skills/coolify/references/deploy-recipes.md +105 -0
- package/skills/coolify/references/install-and-proxy.md +80 -0
- package/skills/coolify/scripts/verify.sh +123 -0
- package/skills/cost-tracking/SKILL.md +183 -0
- package/skills/cost-tracking/evals/README.md +3 -0
- package/skills/cost-tracking/evals/cases.yaml +45 -0
- package/skills/cost-tracking/references/cloud-caps.md +52 -0
- package/skills/cost-tracking/references/pricing-tables.md +51 -0
- package/skills/cost-tracking/scripts/verify.sh +135 -0
- package/skills/course-builder/SKILL.md +186 -0
- package/skills/course-builder/evals/README.md +16 -0
- package/skills/course-builder/evals/cases.yaml +49 -0
- package/skills/course-builder/references/assessment-design.md +74 -0
- package/skills/course-builder/references/grounding-and-scoping.md +69 -0
- package/skills/course-builder/references/outcomes-and-blooms.md +82 -0
- package/skills/course-builder/scripts/verify.sh +247 -0
- package/skills/course-storytelling/SKILL.md +205 -0
- package/skills/course-storytelling/evals/README.md +54 -0
- package/skills/course-storytelling/evals/cases.yaml +50 -0
- package/skills/course-storytelling/references/brunson-frameworks.md +190 -0
- package/skills/course-storytelling/references/concept-landing-recipe.md +136 -0
- package/skills/course-storytelling/references/course-analysis.md +124 -0
- package/skills/course-storytelling/references/learner-grounding.md +183 -0
- package/skills/course-storytelling/references/mental-models.md +115 -0
- package/skills/course-storytelling/scripts/verify.sh +223 -0
- package/skills/cpp/SKILL.md +349 -0
- package/skills/cpp/evals/README.md +14 -0
- package/skills/cpp/evals/cases.yaml +44 -0
- package/skills/cpp/references/cmake.md +167 -0
- package/skills/cpp/references/move-and-templates.md +130 -0
- package/skills/cpp/references/undefined-behavior.md +86 -0
- package/skills/cpp/scripts/verify.sh +165 -0
- package/skills/csharp-dotnet/SKILL.md +291 -0
- package/skills/csharp-dotnet/evals/README.md +3 -0
- package/skills/csharp-dotnet/evals/cases.yaml +48 -0
- package/skills/csharp-dotnet/references/aspnetcore.md +99 -0
- package/skills/csharp-dotnet/references/async.md +82 -0
- package/skills/csharp-dotnet/references/efcore.md +96 -0
- package/skills/csharp-dotnet/scripts/verify.sh +90 -0
- package/skills/customer-support/SKILL.md +193 -0
- package/skills/customer-support/evals/README.md +13 -0
- package/skills/customer-support/evals/cases.yaml +61 -0
- package/skills/customer-support/references/macros-and-sla.md +142 -0
- package/skills/dashboard/SKILL.md +205 -0
- package/skills/dashboard/evals/README.md +3 -0
- package/skills/dashboard/evals/cases.yaml +50 -0
- package/skills/dashboard/references/chart-selection.md +34 -0
- package/skills/dashboard/references/tile-schema.md +164 -0
- package/skills/dashboard/scripts/verify.sh +130 -0
- package/skills/data-cleaning/SKILL.md +285 -0
- package/skills/data-cleaning/evals/README.md +16 -0
- package/skills/data-cleaning/evals/cases.yaml +57 -0
- package/skills/data-cleaning/references/normalization-recipes.md +136 -0
- package/skills/data-cleaning/references/validation-patterns.md +134 -0
- package/skills/data-cleaning/scripts/verify.sh +115 -0
- package/skills/data-policy/SKILL.md +163 -0
- package/skills/data-policy/evals/README.md +15 -0
- package/skills/data-policy/evals/cases.yaml +44 -0
- package/skills/data-policy/references/consent-and-ropa.md +97 -0
- package/skills/data-policy/references/retention-schedule.md +83 -0
- package/skills/data-policy/scripts/verify.sh +143 -0
- package/skills/data-scraper/SKILL.md +134 -0
- package/skills/data-scraper/evals/README.md +3 -0
- package/skills/data-scraper/evals/cases.yaml +46 -0
- package/skills/data-scraper/references/anti-bot.md +85 -0
- package/skills/data-scraper/references/frameworks.md +116 -0
- package/skills/data-scraper/references/legal-compliance.md +59 -0
- package/skills/data-scraper/scripts/verify.sh +166 -0
- package/skills/db-migrations/SKILL.md +254 -0
- package/skills/db-migrations/evals/README.md +10 -0
- package/skills/db-migrations/evals/cases.yaml +46 -0
- package/skills/db-migrations/references/backfill-and-batching.md +105 -0
- package/skills/db-migrations/references/expand-contract-playbook.md +152 -0
- package/skills/db-migrations/references/tools-and-runners.md +88 -0
- package/skills/db-migrations/scripts/verify.sh +112 -0
- package/skills/debug/SKILL.md +227 -0
- package/skills/debug/evals/README.md +88 -0
- package/skills/debug/evals/cases.yaml +74 -0
- package/skills/decision-records/SKILL.md +189 -0
- package/skills/decision-records/evals/README.md +3 -0
- package/skills/decision-records/evals/cases.yaml +43 -0
- package/skills/decision-records/references/templates.md +232 -0
- package/skills/decision-records/scripts/verify.sh +105 -0
- package/skills/deployment/SKILL.md +439 -0
- package/skills/deployment/evals/README.md +50 -0
- package/skills/deployment/evals/cases.yaml +53 -0
- package/skills/deployment/references/coolify.md +216 -0
- package/skills/deployment/references/dockerfiles-by-stack.md +319 -0
- package/skills/deployment/references/github-actions.md +295 -0
- package/skills/deployment/references/hosting-targets.md +272 -0
- package/skills/deployment/scripts/verify.sh +134 -0
- package/skills/design/SKILL.md +399 -0
- package/skills/design/evals/README.md +53 -0
- package/skills/design/evals/cases.yaml +56 -0
- package/skills/design/references/brand-grounding.md +187 -0
- package/skills/design/references/copywriting-frameworks.md +138 -0
- package/skills/design/references/landing-anatomy-and-cro.md +202 -0
- package/skills/design/references/motion-and-interaction.md +182 -0
- package/skills/design/references/research-method.md +147 -0
- package/skills/design/references/signature-and-craft.md +148 -0
- package/skills/design/references/trends-2026.md +80 -0
- package/skills/design/references/visual-system.md +236 -0
- package/skills/design/scripts/verify.sh +248 -0
- package/skills/digitalocean/SKILL.md +251 -0
- package/skills/digitalocean/evals/README.md +10 -0
- package/skills/digitalocean/evals/cases.yaml +37 -0
- package/skills/digitalocean/references/app-spec.md +126 -0
- package/skills/digitalocean/references/droplet-ops.md +95 -0
- package/skills/digitalocean/scripts/verify.sh +102 -0
- package/skills/django/SKILL.md +268 -0
- package/skills/django/evals/README.md +11 -0
- package/skills/django/evals/cases.yaml +47 -0
- package/skills/django/references/drf.md +109 -0
- package/skills/django/references/orm-performance.md +91 -0
- package/skills/django/references/security.md +81 -0
- package/skills/django/references/testing.md +86 -0
- package/skills/django/scripts/verify.sh +115 -0
- package/skills/docker/SKILL.md +283 -0
- package/skills/docker/evals/README.md +10 -0
- package/skills/docker/evals/cases.yaml +44 -0
- package/skills/docker/references/base-images-and-stages.md +104 -0
- package/skills/docker/references/compose-recipes.md +109 -0
- package/skills/docker/scripts/verify.sh +149 -0
- package/skills/document-processing/SKILL.md +214 -0
- package/skills/document-processing/evals/README.md +3 -0
- package/skills/document-processing/evals/cases.yaml +65 -0
- package/skills/document-processing/references/engines.md +67 -0
- package/skills/document-processing/scripts/verify.sh +172 -0
- package/skills/domains-dns/SKILL.md +146 -0
- package/skills/domains-dns/evals/README.md +16 -0
- package/skills/domains-dns/evals/cases.yaml +47 -0
- package/skills/domains-dns/references/record-cookbook.md +94 -0
- package/skills/domains-dns/references/tls-and-acme.md +90 -0
- package/skills/domains-dns/references/verify-and-debug.md +64 -0
- package/skills/domains-dns/scripts/verify.sh +163 -0
- package/skills/drizzle-orm/SKILL.md +234 -0
- package/skills/drizzle-orm/evals/README.md +12 -0
- package/skills/drizzle-orm/evals/cases.yaml +47 -0
- package/skills/drizzle-orm/references/relations-and-drivers.md +118 -0
- package/skills/drizzle-orm/scripts/verify.sh +155 -0
- package/skills/duckdb/SKILL.md +207 -0
- package/skills/duckdb/evals/README.md +31 -0
- package/skills/duckdb/evals/cases.yaml +41 -0
- package/skills/duckdb/references/python-and-interop.md +105 -0
- package/skills/duckdb/references/remote-and-lakehouse.md +101 -0
- package/skills/duckdb/scripts/verify.sh +71 -0
- package/skills/dynamodb/SKILL.md +217 -0
- package/skills/dynamodb/evals/README.md +8 -0
- package/skills/dynamodb/evals/cases.yaml +46 -0
- package/skills/dynamodb/references/access-patterns.md +127 -0
- package/skills/dynamodb/references/capacity-and-limits.md +78 -0
- package/skills/dynamodb/scripts/verify.sh +108 -0
- package/skills/e-signature/SKILL.md +185 -0
- package/skills/e-signature/evals/README.md +3 -0
- package/skills/e-signature/evals/cases.yaml +44 -0
- package/skills/e-signature/references/docusign.md +83 -0
- package/skills/e-signature/references/dropbox-sign.md +73 -0
- package/skills/e-signature/references/legal-tiers.md +37 -0
- package/skills/e-signature/scripts/verify.sh +81 -0
- package/skills/e2e-testing/SKILL.md +243 -0
- package/skills/e2e-testing/evals/README.md +10 -0
- package/skills/e2e-testing/evals/cases.yaml +64 -0
- package/skills/e2e-testing/references/config-and-ci.md +156 -0
- package/skills/e2e-testing/references/flakiness-playbook.md +124 -0
- package/skills/e2e-testing/scripts/verify.sh +117 -0
- package/skills/electron/SKILL.md +221 -0
- package/skills/electron/evals/README.md +13 -0
- package/skills/electron/evals/cases.yaml +38 -0
- package/skills/electron/references/packaging-and-updates.md +122 -0
- package/skills/electron/references/security-and-ipc.md +158 -0
- package/skills/electron/scripts/verify.sh +143 -0
- package/skills/elixir/SKILL.md +217 -0
- package/skills/elixir/evals/README.md +3 -0
- package/skills/elixir/evals/cases.yaml +41 -0
- package/skills/elixir/references/mix-and-releases.md +91 -0
- package/skills/elixir/references/otp-patterns.md +96 -0
- package/skills/elixir/scripts/verify.sh +76 -0
- package/skills/email-connector/SKILL.md +294 -0
- package/skills/email-connector/evals/README.md +19 -0
- package/skills/email-connector/evals/cases.yaml +39 -0
- package/skills/email-connector/references/providers.md +107 -0
- package/skills/email-connector/scripts/verify.sh +72 -0
- package/skills/email-deliverability/SKILL.md +168 -0
- package/skills/email-deliverability/evals/README.md +21 -0
- package/skills/email-deliverability/evals/cases.yaml +45 -0
- package/skills/email-deliverability/scripts/verify.sh +98 -0
- package/skills/embeddings-search/SKILL.md +193 -0
- package/skills/embeddings-search/evals/README.md +10 -0
- package/skills/embeddings-search/evals/cases.yaml +44 -0
- package/skills/embeddings-search/references/evaluation.md +86 -0
- package/skills/embeddings-search/references/models.md +73 -0
- package/skills/embeddings-search/scripts/verify.sh +103 -0
- package/skills/error-handling/SKILL.md +307 -0
- package/skills/error-handling/evals/README.md +12 -0
- package/skills/error-handling/evals/cases.yaml +46 -0
- package/skills/error-handling/references/boundaries-and-messaging.md +120 -0
- package/skills/error-handling/references/retry-and-resilience.md +154 -0
- package/skills/error-handling/scripts/verify.sh +110 -0
- package/skills/expo/SKILL.md +253 -0
- package/skills/expo/evals/README.md +13 -0
- package/skills/expo/evals/cases.yaml +44 -0
- package/skills/expo/references/config-plugins.md +117 -0
- package/skills/expo/references/eas-update.md +118 -0
- package/skills/expo/scripts/verify.sh +132 -0
- package/skills/fal/SKILL.md +210 -0
- package/skills/fal/evals/README.md +3 -0
- package/skills/fal/evals/cases.yaml +42 -0
- package/skills/fal/references/models-and-cost.md +53 -0
- package/skills/fal/references/queue-and-webhooks.md +153 -0
- package/skills/fal/scripts/verify.sh +72 -0
- package/skills/fastapi/SKILL.md +499 -0
- package/skills/fastapi/evals/README.md +50 -0
- package/skills/fastapi/evals/cases.yaml +55 -0
- package/skills/fastapi/references/database.md +347 -0
- package/skills/fastapi/references/production.md +338 -0
- package/skills/fastapi/references/security.md +330 -0
- package/skills/fastapi/references/testing.md +349 -0
- package/skills/fastapi/scripts/verify.sh +116 -0
- package/skills/finance-ops/SKILL.md +149 -0
- package/skills/finance-ops/evals/README.md +3 -0
- package/skills/finance-ops/evals/cases.yaml +39 -0
- package/skills/finance-ops/references/cash-flow-forecast.md +57 -0
- package/skills/finance-ops/references/month-close.md +59 -0
- package/skills/finance-ops/references/reconciliation.md +65 -0
- package/skills/finance-ops/scripts/verify.sh +166 -0
- package/skills/financial-model/SKILL.md +170 -0
- package/skills/financial-model/evals/README.md +3 -0
- package/skills/financial-model/evals/cases.yaml +53 -0
- package/skills/financial-model/references/benchmarks-and-scenarios.md +55 -0
- package/skills/financial-model/references/model-structure.md +67 -0
- package/skills/financial-model/references/revenue-build.md +68 -0
- package/skills/financial-model/scripts/verify.sh +232 -0
- package/skills/firebase/SKILL.md +251 -0
- package/skills/firebase/evals/README.md +12 -0
- package/skills/firebase/evals/cases.yaml +45 -0
- package/skills/firebase/references/cloud-functions.md +102 -0
- package/skills/firebase/references/data-modeling.md +108 -0
- package/skills/firebase/references/security-rules.md +137 -0
- package/skills/firebase/scripts/verify.sh +98 -0
- package/skills/flutter/SKILL.md +448 -0
- package/skills/flutter/evals/README.md +54 -0
- package/skills/flutter/evals/cases.yaml +69 -0
- package/skills/flutter/references/architecture-and-state.md +499 -0
- package/skills/flutter/references/i18n-and-dependencies.md +197 -0
- package/skills/flutter/references/performance.md +299 -0
- package/skills/flutter/references/testing.md +385 -0
- package/skills/flutter/references/ui-and-navigation.md +378 -0
- package/skills/flutter/scripts/verify.sh +104 -0
- package/skills/fly-io/SKILL.md +206 -0
- package/skills/fly-io/evals/README.md +3 -0
- package/skills/fly-io/evals/cases.yaml +42 -0
- package/skills/fly-io/references/fly-toml.md +155 -0
- package/skills/fly-io/references/multi-region.md +66 -0
- package/skills/fly-io/scripts/verify.sh +90 -0
- package/skills/forecasting/SKILL.md +139 -0
- package/skills/forecasting/evals/README.md +13 -0
- package/skills/forecasting/evals/cases.yaml +47 -0
- package/skills/forecasting/references/accuracy-and-backtesting.md +104 -0
- package/skills/forecasting/references/methods-cheatsheet.md +94 -0
- package/skills/forecasting/scripts/verify.sh +99 -0
- package/skills/fundraising/SKILL.md +162 -0
- package/skills/fundraising/evals/README.md +18 -0
- package/skills/fundraising/evals/cases.yaml +76 -0
- package/skills/fundraising/references/funnel-math.md +90 -0
- package/skills/fundraising/references/process-playbook.md +97 -0
- package/skills/gcp-essentials/SKILL.md +327 -0
- package/skills/gcp-essentials/evals/README.md +12 -0
- package/skills/gcp-essentials/evals/cases.yaml +38 -0
- package/skills/gcp-essentials/references/deploy-recipes.md +81 -0
- package/skills/gcp-essentials/references/iam-and-auth.md +94 -0
- package/skills/gcp-essentials/references/networking-and-sql.md +74 -0
- package/skills/gcp-essentials/scripts/verify.sh +158 -0
- package/skills/gdpr-privacy/SKILL.md +167 -0
- package/skills/gdpr-privacy/evals/README.md +3 -0
- package/skills/gdpr-privacy/evals/cases.yaml +47 -0
- package/skills/gdpr-privacy/references/dpa-and-transfers.md +63 -0
- package/skills/gdpr-privacy/references/dsar-and-consent.md +83 -0
- package/skills/gdpr-privacy/references/privacy-policy-blueprint.md +99 -0
- package/skills/gdpr-privacy/scripts/verify.sh +84 -0
- package/skills/git-workflow/SKILL.md +190 -0
- package/skills/git-workflow/evals/README.md +10 -0
- package/skills/git-workflow/evals/cases.yaml +47 -0
- package/skills/git-workflow/references/interactive-rebase.md +89 -0
- package/skills/github-actions/SKILL.md +256 -0
- package/skills/github-actions/evals/README.md +3 -0
- package/skills/github-actions/evals/cases.yaml +45 -0
- package/skills/github-actions/references/caching-and-matrix.md +92 -0
- package/skills/github-actions/references/oidc-deploys.md +130 -0
- package/skills/github-actions/scripts/verify.sh +105 -0
- package/skills/go/SKILL.md +438 -0
- package/skills/go/evals/README.md +56 -0
- package/skills/go/evals/cases.yaml +55 -0
- package/skills/go/references/concurrency.md +557 -0
- package/skills/go/references/http-services.md +529 -0
- package/skills/go/references/testing.md +338 -0
- package/skills/go/scripts/verify.sh +109 -0
- package/skills/google-workspace/SKILL.md +287 -0
- package/skills/google-workspace/evals/README.md +16 -0
- package/skills/google-workspace/evals/cases.yaml +44 -0
- package/skills/google-workspace/references/api-recipes.md +148 -0
- package/skills/google-workspace/references/auth-setup.md +100 -0
- package/skills/google-workspace/scripts/verify.sh +128 -0
- package/skills/grants/SKILL.md +171 -0
- package/skills/grants/evals/README.md +3 -0
- package/skills/grants/evals/cases.yaml +69 -0
- package/skills/grants/references/budget-justification.md +71 -0
- package/skills/grants/references/jurisdictions.md +35 -0
- package/skills/grants/references/logic-model.md +66 -0
- package/skills/grants/scripts/verify.sh +193 -0
- package/skills/harness/SKILL.md +329 -0
- package/skills/harness/assets/_TEMPLATE/.env.example +8 -0
- package/skills/harness/assets/_TEMPLATE/CREDENTIALS.md +25 -0
- package/skills/harness/assets/_TEMPLATE/README.md +25 -0
- package/skills/harness/assets/_TEMPLATE/test_connection.sh +30 -0
- package/skills/harness/evals/README.md +54 -0
- package/skills/harness/evals/cases.yaml +72 -0
- package/skills/harness/examples/audit-example.md +120 -0
- package/skills/harness/references/agents-md-template.md +41 -0
- package/skills/harness/references/audit-report-template.html +140 -0
- package/skills/harness/references/audit-report-template.md +116 -0
- package/skills/harness/references/claude-md-template.md +98 -0
- package/skills/harness/references/inbox-readme-template.md +51 -0
- package/skills/harness/references/ingest-formats.md +185 -0
- package/skills/harness/references/providers.yaml +3410 -0
- package/skills/harness/references/tools-readme-template.md +88 -0
- package/skills/harness/references/wiki-archive-template.html +81 -0
- package/skills/harness/references/wiki-article-template.md +20 -0
- package/skills/harness/references/wiki-dashboard-template.html +136 -0
- package/skills/harness/references/wiki-deep-improve-report-template.html +126 -0
- package/skills/harness/references/wiki-gaps-template.md +18 -0
- package/skills/harness/references/wiki-index-template.md +23 -0
- package/skills/harness/references/wiki-protocol.md +699 -0
- package/skills/harness/references/wiki-raw-template.md +7 -0
- package/skills/hetzner/SKILL.md +221 -0
- package/skills/hetzner/evals/README.md +35 -0
- package/skills/hetzner/evals/cases.yaml +46 -0
- package/skills/hetzner/references/cloud-init.md +120 -0
- package/skills/hetzner/references/plans-and-locations.md +56 -0
- package/skills/hetzner/scripts/verify.sh +122 -0
- package/skills/hiring/SKILL.md +248 -0
- package/skills/hiring/evals/README.md +13 -0
- package/skills/hiring/evals/cases.yaml +41 -0
- package/skills/hiring/references/templates.md +118 -0
- package/skills/htmx/SKILL.md +261 -0
- package/skills/htmx/evals/README.md +3 -0
- package/skills/htmx/evals/cases.yaml +38 -0
- package/skills/htmx/references/patterns.md +113 -0
- package/skills/htmx/references/server-contract.md +91 -0
- package/skills/htmx/scripts/verify.sh +93 -0
- package/skills/huggingface/SKILL.md +190 -0
- package/skills/huggingface/evals/README.md +11 -0
- package/skills/huggingface/evals/cases.yaml +41 -0
- package/skills/huggingface/references/endpoints-and-spaces.md +99 -0
- package/skills/huggingface/references/hub-and-cli.md +85 -0
- package/skills/huggingface/references/inference-providers.md +115 -0
- package/skills/huggingface/scripts/verify.sh +123 -0
- package/skills/implement/SKILL.md +283 -0
- package/skills/implement/evals/README.md +56 -0
- package/skills/implement/evals/cases.yaml +43 -0
- package/skills/init/SKILL.md +184 -0
- package/skills/init/evals/README.md +49 -0
- package/skills/init/evals/cases.yaml +74 -0
- package/skills/init/references/accompaniment-and-profile.md +140 -0
- package/skills/init/references/discovery.md +90 -0
- package/skills/init/references/recommend-skills.md +115 -0
- package/skills/init/scripts/verify.sh +122 -0
- package/skills/instagram-api/SKILL.md +241 -0
- package/skills/instagram-api/evals/README.md +3 -0
- package/skills/instagram-api/evals/cases.yaml +43 -0
- package/skills/instagram-api/references/insights-metrics.md +88 -0
- package/skills/instagram-api/references/publish-reel.md +98 -0
- package/skills/instagram-api/scripts/verify.sh +137 -0
- package/skills/inventory/SKILL.md +131 -0
- package/skills/inventory/evals/README.md +3 -0
- package/skills/inventory/evals/cases.yaml +43 -0
- package/skills/inventory/references/abc-xyz.md +52 -0
- package/skills/inventory/references/ddmrp.md +32 -0
- package/skills/inventory/references/reorder-policies.md +85 -0
- package/skills/inventory/references/safety-stock.md +63 -0
- package/skills/inventory/scripts/verify.sh +155 -0
- package/skills/investor-materials/SKILL.md +175 -0
- package/skills/investor-materials/evals/README.md +15 -0
- package/skills/investor-materials/evals/cases.yaml +60 -0
- package/skills/investor-materials/references/dataroom-checklist.md +134 -0
- package/skills/investor-materials/references/update-and-onepager-templates.md +152 -0
- package/skills/investor-materials/scripts/verify.sh +148 -0
- package/skills/invoicing/SKILL.md +154 -0
- package/skills/invoicing/evals/README.md +5 -0
- package/skills/invoicing/evals/cases.yaml +49 -0
- package/skills/invoicing/references/dunning-ladder.md +53 -0
- package/skills/invoicing/references/e-invoicing-mandates.md +43 -0
- package/skills/invoicing/scripts/fixtures/broken-invoice.json +13 -0
- package/skills/invoicing/scripts/fixtures/valid-invoice.json +15 -0
- package/skills/invoicing/scripts/verify.sh +133 -0
- package/skills/ip-trademark/SKILL.md +186 -0
- package/skills/ip-trademark/evals/README.md +10 -0
- package/skills/ip-trademark/evals/cases.yaml +47 -0
- package/skills/ip-trademark/references/jurisdictions.md +63 -0
- package/skills/ip-trademark/references/ownership-and-licensing.md +90 -0
- package/skills/java/SKILL.md +341 -0
- package/skills/java/evals/README.md +23 -0
- package/skills/java/evals/cases.yaml +43 -0
- package/skills/java/references/builds.md +133 -0
- package/skills/java/references/concurrency.md +108 -0
- package/skills/java/references/streams.md +102 -0
- package/skills/java/scripts/verify.sh +107 -0
- package/skills/knowledge-ops/SKILL.md +125 -0
- package/skills/knowledge-ops/evals/README.md +16 -0
- package/skills/knowledge-ops/evals/cases.yaml +50 -0
- package/skills/knowledge-ops/references/gardening-playbook.md +116 -0
- package/skills/kotlin-android/SKILL.md +245 -0
- package/skills/kotlin-android/evals/README.md +13 -0
- package/skills/kotlin-android/evals/cases.yaml +56 -0
- package/skills/kotlin-android/references/architecture.md +200 -0
- package/skills/kotlin-android/references/gradle-setup.md +125 -0
- package/skills/kotlin-android/scripts/verify.sh +109 -0
- package/skills/kpi-framework/SKILL.md +199 -0
- package/skills/kpi-framework/evals/README.md +11 -0
- package/skills/kpi-framework/evals/cases.yaml +42 -0
- package/skills/kpi-framework/references/definition-and-targets.md +64 -0
- package/skills/kpi-framework/references/metric-catalog.md +84 -0
- package/skills/landing-copy/SKILL.md +153 -0
- package/skills/landing-copy/evals/README.md +18 -0
- package/skills/landing-copy/evals/cases.yaml +63 -0
- package/skills/landing-copy/references/frameworks.md +61 -0
- package/skills/landing-copy/references/page-skeleton.md +92 -0
- package/skills/landing-copy/scripts/verify.sh +164 -0
- package/skills/laravel/SKILL.md +301 -0
- package/skills/laravel/evals/README.md +10 -0
- package/skills/laravel/evals/cases.yaml +45 -0
- package/skills/laravel/references/eloquent-patterns.md +126 -0
- package/skills/laravel/references/queues-and-scheduling.md +153 -0
- package/skills/laravel/scripts/verify.sh +128 -0
- package/skills/lead-gen/SKILL.md +155 -0
- package/skills/lead-gen/evals/README.md +3 -0
- package/skills/lead-gen/evals/cases.yaml +43 -0
- package/skills/lead-gen/references/data-sources.md +87 -0
- package/skills/lead-gen/references/scoring-model.md +93 -0
- package/skills/lead-gen/scripts/verify.sh +179 -0
- package/skills/linkedin-api/SKILL.md +211 -0
- package/skills/linkedin-api/evals/README.md +3 -0
- package/skills/linkedin-api/evals/cases.yaml +41 -0
- package/skills/linkedin-api/references/api-reference.md +168 -0
- package/skills/linkedin-api/scripts/verify.sh +98 -0
- package/skills/linkedin-carousels/SKILL.md +239 -0
- package/skills/linkedin-carousels/evals/README.md +13 -0
- package/skills/linkedin-carousels/evals/cases.yaml +62 -0
- package/skills/linkedin-carousels/references/carousel-patterns.md +200 -0
- package/skills/linkedin-carousels/scripts/verify.sh +160 -0
- package/skills/linkedin-content/SKILL.md +162 -0
- package/skills/linkedin-content/evals/README.md +13 -0
- package/skills/linkedin-content/evals/cases.yaml +62 -0
- package/skills/linkedin-content/references/hooks-and-formats.md +114 -0
- package/skills/linkedin-content/scripts/verify.sh +154 -0
- package/skills/linkedin-outreach/SKILL.md +174 -0
- package/skills/linkedin-outreach/evals/README.md +3 -0
- package/skills/linkedin-outreach/evals/cases.yaml +43 -0
- package/skills/linkedin-outreach/references/ledger-schema.md +48 -0
- package/skills/linkedin-outreach/references/sales-navigator-playbook.md +61 -0
- package/skills/linkedin-outreach/scripts/verify.sh +120 -0
- package/skills/linkedin-strategy/SKILL.md +167 -0
- package/skills/linkedin-strategy/evals/README.md +3 -0
- package/skills/linkedin-strategy/evals/cases.yaml +49 -0
- package/skills/linkedin-strategy/references/ssi-and-pillars.md +59 -0
- package/skills/linkedin-strategy/references/wiki-records.md +62 -0
- package/skills/linkedin-strategy/scripts/verify.sh +120 -0
- package/skills/llm-pipeline/SKILL.md +155 -0
- package/skills/llm-pipeline/evals/README.md +3 -0
- package/skills/llm-pipeline/evals/cases.yaml +44 -0
- package/skills/llm-pipeline/references/caching-layers.md +60 -0
- package/skills/llm-pipeline/references/litellm-router.md +101 -0
- package/skills/llm-pipeline/scripts/verify.sh +169 -0
- package/skills/logistics-ops/SKILL.md +219 -0
- package/skills/logistics-ops/evals/README.md +20 -0
- package/skills/logistics-ops/evals/cases.yaml +48 -0
- package/skills/logistics-ops/references/carriers-and-claims.md +105 -0
- package/skills/market-research/SKILL.md +145 -0
- package/skills/market-research/evals/README.md +3 -0
- package/skills/market-research/evals/cases.yaml +48 -0
- package/skills/market-research/references/demand-signals.md +63 -0
- package/skills/market-research/references/sizing-playbook.md +121 -0
- package/skills/market-research/scripts/verify.sh +215 -0
- package/skills/marketing/SKILL.md +233 -0
- package/skills/marketing/evals/README.md +61 -0
- package/skills/marketing/evals/cases.yaml +84 -0
- package/skills/marketing/references/brand-grounding.md +197 -0
- package/skills/marketing/references/campaigns-and-channels.md +151 -0
- package/skills/marketing/references/copy-frameworks.md +166 -0
- package/skills/marketing/references/landing-copy.md +191 -0
- package/skills/marketing/references/seo-geo.md +391 -0
- package/skills/marketing/scripts/seo_audit.py +166 -0
- package/skills/marketing/scripts/verify.sh +233 -0
- package/skills/medium-publishing/SKILL.md +152 -0
- package/skills/medium-publishing/evals/README.md +3 -0
- package/skills/medium-publishing/evals/cases.yaml +42 -0
- package/skills/medium-publishing/references/cross-post-and-canonical.md +65 -0
- package/skills/medium-publishing/references/legacy-api.md +100 -0
- package/skills/medium-strategy/SKILL.md +161 -0
- package/skills/medium-strategy/evals/README.md +3 -0
- package/skills/medium-strategy/evals/cases.yaml +50 -0
- package/skills/medium-strategy/references/distribution-and-boost.md +65 -0
- package/skills/medium-strategy/references/wiki-records.md +60 -0
- package/skills/medium-strategy/scripts/verify.sh +118 -0
- package/skills/medium-writing/SKILL.md +140 -0
- package/skills/medium-writing/evals/README.md +5 -0
- package/skills/medium-writing/evals/cases.yaml +39 -0
- package/skills/medium-writing/references/title-patterns.md +79 -0
- package/skills/meeting-notes/SKILL.md +168 -0
- package/skills/meeting-notes/evals/README.md +14 -0
- package/skills/meeting-notes/evals/cases.yaml +46 -0
- package/skills/meeting-notes/references/templates.md +140 -0
- package/skills/modal/SKILL.md +307 -0
- package/skills/modal/evals/README.md +29 -0
- package/skills/modal/evals/cases.yaml +50 -0
- package/skills/modal/references/images-gpu-cookbook.md +160 -0
- package/skills/modal/references/web-and-scaling.md +138 -0
- package/skills/modal/scripts/verify.sh +127 -0
- package/skills/mongodb/SKILL.md +342 -0
- package/skills/mongodb/evals/README.md +29 -0
- package/skills/mongodb/evals/cases.yaml +41 -0
- package/skills/mongodb/references/aggregation.md +115 -0
- package/skills/mongodb/references/data-modeling.md +135 -0
- package/skills/mongodb/references/transactions-and-ops.md +128 -0
- package/skills/mongodb/scripts/verify.sh +151 -0
- package/skills/monitoring/SKILL.md +155 -0
- package/skills/monitoring/evals/README.md +3 -0
- package/skills/monitoring/evals/cases.yaml +47 -0
- package/skills/monitoring/references/burn-rate-and-oncall.md +128 -0
- package/skills/monitoring/references/tool-setup.md +154 -0
- package/skills/monitoring/scripts/verify.sh +145 -0
- package/skills/mysql/SKILL.md +249 -0
- package/skills/mysql/evals/README.md +12 -0
- package/skills/mysql/evals/cases.yaml +49 -0
- package/skills/mysql/references/indexing-and-explain.md +161 -0
- package/skills/mysql/references/mysql-vs-mariadb.md +78 -0
- package/skills/mysql/references/online-ddl-and-migrations.md +120 -0
- package/skills/mysql/references/replication-and-ha.md +115 -0
- package/skills/mysql/scripts/verify.sh +141 -0
- package/skills/neon/SKILL.md +218 -0
- package/skills/neon/evals/README.md +11 -0
- package/skills/neon/evals/cases.yaml +45 -0
- package/skills/neon/references/branching-ci.md +86 -0
- package/skills/neon/scripts/verify.sh +78 -0
- package/skills/nestjs/SKILL.md +225 -0
- package/skills/nestjs/evals/README.md +3 -0
- package/skills/nestjs/evals/cases.yaml +38 -0
- package/skills/nestjs/references/cross-cutting.md +135 -0
- package/skills/nestjs/references/testing-recipes.md +105 -0
- package/skills/nestjs/scripts/verify.sh +98 -0
- package/skills/netlify/SKILL.md +208 -0
- package/skills/netlify/evals/README.md +13 -0
- package/skills/netlify/evals/cases.yaml +43 -0
- package/skills/netlify/references/functions.md +97 -0
- package/skills/netlify/references/netlify-toml.md +115 -0
- package/skills/netlify/scripts/verify.sh +95 -0
- package/skills/newsletter/SKILL.md +162 -0
- package/skills/newsletter/evals/README.md +12 -0
- package/skills/newsletter/evals/cases.yaml +42 -0
- package/skills/newsletter/references/growth-loops.md +73 -0
- package/skills/newsletter/references/welcome-sequence.md +62 -0
- package/skills/newsletter/scripts/verify.sh +173 -0
- package/skills/nextjs/SKILL.md +472 -0
- package/skills/nextjs/evals/README.md +59 -0
- package/skills/nextjs/evals/cases.yaml +56 -0
- package/skills/nextjs/references/data-and-caching.md +309 -0
- package/skills/nextjs/references/metadata.md +208 -0
- package/skills/nextjs/references/performance.md +325 -0
- package/skills/nextjs/references/react.md +383 -0
- package/skills/nextjs/references/security.md +239 -0
- package/skills/nextjs/references/testing.md +290 -0
- package/skills/nextjs/scripts/verify.sh +141 -0
- package/skills/no-code-app/SKILL.md +153 -0
- package/skills/no-code-app/evals/README.md +3 -0
- package/skills/no-code-app/evals/cases.yaml +43 -0
- package/skills/no-code-app/references/platform-limits.md +100 -0
- package/skills/nodejs/SKILL.md +242 -0
- package/skills/nodejs/evals/README.md +3 -0
- package/skills/nodejs/evals/cases.yaml +39 -0
- package/skills/nodejs/references/express5-migration.md +53 -0
- package/skills/nodejs/references/graceful-shutdown.md +73 -0
- package/skills/nodejs/scripts/verify.sh +122 -0
- package/skills/notion-connector/SKILL.md +234 -0
- package/skills/notion-connector/evals/README.md +15 -0
- package/skills/notion-connector/evals/cases.yaml +45 -0
- package/skills/notion-connector/references/api-versions.md +63 -0
- package/skills/notion-connector/references/property-shapes.md +110 -0
- package/skills/notion-connector/references/sync-patterns.md +95 -0
- package/skills/notion-connector/scripts/verify.sh +162 -0
- package/skills/observability/SKILL.md +231 -0
- package/skills/observability/evals/README.md +3 -0
- package/skills/observability/evals/cases.yaml +49 -0
- package/skills/observability/references/collector-config.md +98 -0
- package/skills/observability/references/instrumentation-recipes.md +115 -0
- package/skills/observability/scripts/verify.sh +156 -0
- package/skills/ollama/SKILL.md +213 -0
- package/skills/ollama/evals/README.md +9 -0
- package/skills/ollama/evals/cases.yaml +43 -0
- package/skills/ollama/references/api.md +148 -0
- package/skills/ollama/references/hardware-sizing.md +87 -0
- package/skills/ollama/scripts/verify.sh +116 -0
- package/skills/orient/SKILL.md +54 -0
- package/skills/orient/evals/README.md +16 -0
- package/skills/orient/evals/cases.yaml +57 -0
- package/skills/orient/references/orientation-contract.md +34 -0
- package/skills/parallel/SKILL.md +198 -0
- package/skills/parallel/evals/README.md +62 -0
- package/skills/parallel/evals/cases.yaml +44 -0
- package/skills/people-ops/SKILL.md +122 -0
- package/skills/people-ops/evals/README.md +14 -0
- package/skills/people-ops/evals/cases.yaml +43 -0
- package/skills/people-ops/references/templates.md +129 -0
- package/skills/performance/SKILL.md +221 -0
- package/skills/performance/evals/README.md +3 -0
- package/skills/performance/evals/cases.yaml +47 -0
- package/skills/performance/references/profiling-playbook.md +54 -0
- package/skills/performance/scripts/verify.sh +94 -0
- package/skills/phoenix/SKILL.md +169 -0
- package/skills/phoenix/evals/README.md +3 -0
- package/skills/phoenix/evals/cases.yaml +40 -0
- package/skills/phoenix/references/auth-and-scopes.md +82 -0
- package/skills/phoenix/references/ecto-patterns.md +93 -0
- package/skills/phoenix/references/liveview.md +134 -0
- package/skills/phoenix/scripts/verify.sh +73 -0
- package/skills/php/SKILL.md +397 -0
- package/skills/php/evals/README.md +12 -0
- package/skills/php/evals/cases.yaml +45 -0
- package/skills/php/references/tooling.md +170 -0
- package/skills/php/references/type-system.md +220 -0
- package/skills/php/scripts/verify.sh +155 -0
- package/skills/pitch-deck/SKILL.md +209 -0
- package/skills/pitch-deck/evals/README.md +15 -0
- package/skills/pitch-deck/evals/cases.yaml +55 -0
- package/skills/pitch-deck/references/numbers-that-matter.md +78 -0
- package/skills/pitch-deck/references/slide-spine.md +149 -0
- package/skills/pitch-deck/scripts/verify.sh +186 -0
- package/skills/plan/SKILL.md +204 -0
- package/skills/plan/evals/README.md +62 -0
- package/skills/plan/evals/cases.yaml +49 -0
- package/skills/plan/references/plan-template.md +124 -0
- package/skills/planetscale/SKILL.md +223 -0
- package/skills/planetscale/evals/README.md +11 -0
- package/skills/planetscale/evals/cases.yaml +46 -0
- package/skills/planetscale/references/deploy-requests.md +75 -0
- package/skills/planetscale/references/no-foreign-keys.md +88 -0
- package/skills/planetscale/scripts/verify.sh +115 -0
- package/skills/podcast/SKILL.md +166 -0
- package/skills/podcast/evals/README.md +17 -0
- package/skills/podcast/evals/cases.yaml +61 -0
- package/skills/podcast/references/rss-and-namespace.md +136 -0
- package/skills/podcast/scripts/verify.sh +246 -0
- package/skills/postgresdb/SKILL.md +372 -0
- package/skills/postgresdb/evals/README.md +55 -0
- package/skills/postgresdb/evals/cases.yaml +57 -0
- package/skills/postgresdb/references/migrations.md +279 -0
- package/skills/postgresdb/references/operations-and-security.md +267 -0
- package/skills/postgresdb/references/query-optimization.md +374 -0
- package/skills/postgresdb/references/schema-and-indexing.md +379 -0
- package/skills/postgresdb/scripts/verify.sh +191 -0
- package/skills/presentations/SKILL.md +296 -0
- package/skills/presentations/evals/README.md +61 -0
- package/skills/presentations/evals/cases.yaml +56 -0
- package/skills/presentations/references/brand-grounding.md +160 -0
- package/skills/presentations/references/markdown-decks.md +290 -0
- package/skills/presentations/references/pptx-python.md +242 -0
- package/skills/presentations/references/slide-design.md +261 -0
- package/skills/presentations/references/storytelling-and-decks.md +150 -0
- package/skills/presentations/scripts/verify.sh +252 -0
- package/skills/press-kit/SKILL.md +243 -0
- package/skills/press-kit/evals/README.md +15 -0
- package/skills/press-kit/evals/cases.yaml +55 -0
- package/skills/press-kit/references/release-types.md +102 -0
- package/skills/press-kit/references/templates.md +132 -0
- package/skills/press-kit/scripts/verify.sh +161 -0
- package/skills/pricing/SKILL.md +160 -0
- package/skills/pricing/evals/README.md +5 -0
- package/skills/pricing/evals/cases.yaml +44 -0
- package/skills/pricing/references/localization.md +56 -0
- package/skills/pricing/references/pricing-models.md +55 -0
- package/skills/pricing/scripts/verify.sh +91 -0
- package/skills/prisma-orm/SKILL.md +320 -0
- package/skills/prisma-orm/evals/README.md +12 -0
- package/skills/prisma-orm/evals/cases.yaml +56 -0
- package/skills/prisma-orm/references/migrations-and-v7-upgrade.md +197 -0
- package/skills/prisma-orm/references/queries-and-performance.md +169 -0
- package/skills/prisma-orm/scripts/verify.sh +137 -0
- package/skills/procurement/SKILL.md +179 -0
- package/skills/procurement/evals/README.md +20 -0
- package/skills/procurement/evals/cases.yaml +49 -0
- package/skills/procurement/references/scorecard-and-tco.md +100 -0
- package/skills/procurement/references/sourcing-requests.md +116 -0
- package/skills/procurement/scripts/verify.sh +280 -0
- package/skills/project-ops/SKILL.md +130 -0
- package/skills/project-ops/evals/README.md +3 -0
- package/skills/project-ops/evals/cases.yaml +71 -0
- package/skills/project-ops/references/raid-and-rag.md +58 -0
- package/skills/project-ops/references/status-report-template.md +68 -0
- package/skills/project-ops/scripts/verify.sh +257 -0
- package/skills/prompt-engineering/SKILL.md +138 -0
- package/skills/prompt-engineering/evals/README.md +11 -0
- package/skills/prompt-engineering/evals/cases.yaml +46 -0
- package/skills/prompt-engineering/references/eval-templates.md +94 -0
- package/skills/prompt-engineering/references/output-contracts.md +120 -0
- package/skills/prompt-engineering/scripts/verify.sh +84 -0
- package/skills/proposals/SKILL.md +159 -0
- package/skills/proposals/evals/README.md +3 -0
- package/skills/proposals/evals/cases.yaml +53 -0
- package/skills/proposals/references/proposal-skeleton.md +110 -0
- package/skills/proposals/references/sow-skeleton.md +79 -0
- package/skills/proposals/scripts/verify.sh +201 -0
- package/skills/python/SKILL.md +369 -0
- package/skills/python/evals/README.md +19 -0
- package/skills/python/evals/cases.yaml +46 -0
- package/skills/python/references/async.md +136 -0
- package/skills/python/references/stdlib.md +162 -0
- package/skills/python/references/typing.md +160 -0
- package/skills/python/scripts/verify.sh +125 -0
- package/skills/rag/SKILL.md +226 -0
- package/skills/rag/evals/README.md +13 -0
- package/skills/rag/evals/cases.yaml +45 -0
- package/skills/rag/references/evaluation.md +99 -0
- package/skills/rag/references/pipeline.md +151 -0
- package/skills/rag/scripts/verify.sh +99 -0
- package/skills/rails/SKILL.md +264 -0
- package/skills/rails/evals/README.md +12 -0
- package/skills/rails/evals/cases.yaml +47 -0
- package/skills/rails/references/activerecord.md +148 -0
- package/skills/rails/references/hotwire.md +139 -0
- package/skills/rails/references/testing.md +110 -0
- package/skills/rails/scripts/verify.sh +128 -0
- package/skills/railway/SKILL.md +245 -0
- package/skills/railway/evals/README.md +14 -0
- package/skills/railway/evals/cases.yaml +44 -0
- package/skills/railway/references/cli-cookbook.md +137 -0
- package/skills/railway/references/config-as-code.md +120 -0
- package/skills/railway/scripts/verify.sh +162 -0
- package/skills/react/SKILL.md +222 -0
- package/skills/react/evals/README.md +3 -0
- package/skills/react/evals/cases.yaml +43 -0
- package/skills/react/references/data-and-state.md +152 -0
- package/skills/react/references/performance.md +75 -0
- package/skills/react/references/routing.md +99 -0
- package/skills/react/scripts/verify.sh +123 -0
- package/skills/react-native/SKILL.md +220 -0
- package/skills/react-native/evals/README.md +3 -0
- package/skills/react-native/evals/cases.yaml +42 -0
- package/skills/react-native/references/native-modules.md +123 -0
- package/skills/react-native/references/performance-debugging.md +46 -0
- package/skills/react-native/scripts/verify.sh +117 -0
- package/skills/redis/SKILL.md +298 -0
- package/skills/redis/evals/README.md +10 -0
- package/skills/redis/evals/cases.yaml +43 -0
- package/skills/redis/references/caching.md +116 -0
- package/skills/redis/references/locks-and-rate-limiting.md +140 -0
- package/skills/redis/references/queues.md +102 -0
- package/skills/redis/scripts/verify.sh +164 -0
- package/skills/remotion-video/SKILL.md +218 -0
- package/skills/remotion-video/evals/README.md +23 -0
- package/skills/remotion-video/evals/cases.yaml +64 -0
- package/skills/remotion-video/references/captions-pipeline.md +163 -0
- package/skills/remotion-video/references/render-and-pipeline.md +131 -0
- package/skills/remotion-video/scripts/verify.sh +169 -0
- package/skills/render/SKILL.md +256 -0
- package/skills/render/evals/README.md +12 -0
- package/skills/render/evals/cases.yaml +45 -0
- package/skills/render/references/blueprint-reference.md +203 -0
- package/skills/render/scripts/verify.sh +167 -0
- package/skills/replicate/SKILL.md +210 -0
- package/skills/replicate/evals/README.md +9 -0
- package/skills/replicate/evals/cases.yaml +45 -0
- package/skills/replicate/references/cog-packaging.md +89 -0
- package/skills/replicate/references/deployments-api.md +87 -0
- package/skills/replicate/references/webhooks-and-async.md +110 -0
- package/skills/replicate/scripts/verify.sh +162 -0
- package/skills/replicate-images/SKILL.md +241 -0
- package/skills/replicate-images/evals/README.md +13 -0
- package/skills/replicate-images/evals/cases.yaml +41 -0
- package/skills/replicate-images/references/editing-recipes.md +129 -0
- package/skills/replicate-images/references/models.md +131 -0
- package/skills/replicate-images/scripts/verify.sh +178 -0
- package/skills/reporting/SKILL.md +178 -0
- package/skills/reporting/evals/README.md +12 -0
- package/skills/reporting/evals/cases.yaml +46 -0
- package/skills/reporting/references/pipeline.md +213 -0
- package/skills/reporting/scripts/verify.sh +149 -0
- package/skills/research-ops/SKILL.md +200 -0
- package/skills/research-ops/evals/README.md +13 -0
- package/skills/research-ops/evals/cases.yaml +38 -0
- package/skills/research-ops/references/credibility-rubric.md +78 -0
- package/skills/research-ops/references/memo-template.md +63 -0
- package/skills/research-ops/scripts/verify.sh +181 -0
- package/skills/retention/SKILL.md +206 -0
- package/skills/retention/evals/README.md +13 -0
- package/skills/retention/evals/cases.yaml +42 -0
- package/skills/retention/references/health-score-and-metrics.md +97 -0
- package/skills/retention/references/save-and-winback-plays.md +65 -0
- package/skills/review/SKILL.md +222 -0
- package/skills/review/evals/README.md +84 -0
- package/skills/review/evals/cases.yaml +55 -0
- package/skills/review-management/SKILL.md +204 -0
- package/skills/review-management/evals/README.md +13 -0
- package/skills/review-management/evals/cases.yaml +60 -0
- package/skills/review-management/references/platform-apis.md +86 -0
- package/skills/review-management/scripts/verify.sh +128 -0
- package/skills/ruby/SKILL.md +316 -0
- package/skills/ruby/evals/README.md +12 -0
- package/skills/ruby/evals/cases.yaml +41 -0
- package/skills/ruby/references/gems-and-testing.md +208 -0
- package/skills/ruby/references/metaprogramming.md +161 -0
- package/skills/ruby/scripts/verify.sh +83 -0
- package/skills/runpod/SKILL.md +238 -0
- package/skills/runpod/evals/README.md +11 -0
- package/skills/runpod/evals/cases.yaml +47 -0
- package/skills/runpod/references/cost-and-scaling.md +85 -0
- package/skills/runpod/references/serverless-workers.md +101 -0
- package/skills/runpod/scripts/verify.sh +126 -0
- package/skills/rust/SKILL.md +395 -0
- package/skills/rust/evals/README.md +12 -0
- package/skills/rust/evals/cases.yaml +42 -0
- package/skills/rust/references/async-tokio.md +141 -0
- package/skills/rust/references/axum-service.md +132 -0
- package/skills/rust/references/ownership.md +86 -0
- package/skills/rust/references/testing.md +108 -0
- package/skills/rust/scripts/verify.sh +91 -0
- package/skills/sales-pipeline/SKILL.md +162 -0
- package/skills/sales-pipeline/evals/README.md +13 -0
- package/skills/sales-pipeline/evals/cases.yaml +60 -0
- package/skills/sales-pipeline/references/forecasting-math.md +82 -0
- package/skills/sales-pipeline/references/stage-playbook.md +84 -0
- package/skills/sales-pipeline/scripts/verify.sh +210 -0
- package/skills/scaling/SKILL.md +137 -0
- package/skills/scaling/evals/README.md +3 -0
- package/skills/scaling/evals/cases.yaml +42 -0
- package/skills/scaling/references/load-testing-k6.md +127 -0
- package/skills/scaling/scripts/example.load.js +24 -0
- package/skills/scaling/scripts/verify.sh +70 -0
- package/skills/sdd/SKILL.md +203 -0
- package/skills/sdd/evals/README.md +60 -0
- package/skills/sdd/evals/cases.yaml +78 -0
- package/skills/sdd-init/SKILL.md +148 -0
- package/skills/sdd-init/evals/README.md +3 -0
- package/skills/sdd-init/evals/cases.yaml +43 -0
- package/skills/secure-coding/SKILL.md +365 -0
- package/skills/secure-coding/evals/README.md +68 -0
- package/skills/secure-coding/evals/cases.yaml +55 -0
- package/skills/secure-coding/references/authn-authz.md +249 -0
- package/skills/secure-coding/references/owasp-by-stack.md +574 -0
- package/skills/secure-coding/references/secrets-and-supply-chain.md +205 -0
- package/skills/secure-coding/references/threat-modeling.md +213 -0
- package/skills/secure-coding/scripts/verify.sh +208 -0
- package/skills/security-scan/SKILL.md +239 -0
- package/skills/security-scan/evals/README.md +14 -0
- package/skills/security-scan/evals/cases.yaml +50 -0
- package/skills/security-scan/references/tools.md +98 -0
- package/skills/security-scan/references/triage.md +93 -0
- package/skills/security-scan/scripts/verify.sh +108 -0
- package/skills/seo-geo/SKILL.md +192 -0
- package/skills/seo-geo/evals/README.md +14 -0
- package/skills/seo-geo/evals/cases.yaml +45 -0
- package/skills/seo-geo/references/ai-crawler-control.md +104 -0
- package/skills/seo-geo/references/schema-recipes.md +130 -0
- package/skills/seo-geo/scripts/verify.sh +236 -0
- package/skills/ship/SKILL.md +258 -0
- package/skills/ship/evals/README.md +89 -0
- package/skills/ship/evals/cases.yaml +44 -0
- package/skills/shopify/SKILL.md +229 -0
- package/skills/shopify/evals/README.md +14 -0
- package/skills/shopify/evals/cases.yaml +41 -0
- package/skills/shopify/references/apps-graphql.md +103 -0
- package/skills/shopify/references/checkout-extensibility.md +71 -0
- package/skills/shopify/references/liquid-themes.md +89 -0
- package/skills/shopify/scripts/verify.sh +120 -0
- package/skills/shortform-editing/SKILL.md +161 -0
- package/skills/shortform-editing/evals/README.md +16 -0
- package/skills/shortform-editing/evals/cases.yaml +61 -0
- package/skills/shortform-editing/references/captions.md +85 -0
- package/skills/shortform-editing/references/ffmpeg-pipeline.md +126 -0
- package/skills/shortform-editing/scripts/verify.sh +148 -0
- package/skills/shortform-ideation/SKILL.md +153 -0
- package/skills/shortform-ideation/evals/README.md +20 -0
- package/skills/shortform-ideation/evals/cases.yaml +58 -0
- package/skills/shortform-ideation/references/experiment-ledger.md +85 -0
- package/skills/shortform-ideation/references/trend-sources.md +69 -0
- package/skills/shortform-ideation/scripts/verify.sh +172 -0
- package/skills/shortform-packaging/SKILL.md +247 -0
- package/skills/shortform-packaging/evals/README.md +10 -0
- package/skills/shortform-packaging/evals/cases.yaml +48 -0
- package/skills/shortform-packaging/references/package-templates.md +117 -0
- package/skills/shortform-packaging/scripts/verify.sh +210 -0
- package/skills/shortform-strategy/SKILL.md +149 -0
- package/skills/shortform-strategy/evals/README.md +3 -0
- package/skills/shortform-strategy/evals/cases.yaml +52 -0
- package/skills/shortform-strategy/references/learning-loop-template.md +49 -0
- package/skills/shortform-strategy/references/platform-signals-2026.md +46 -0
- package/skills/shortform-strategy/scripts/verify.sh +176 -0
- package/skills/skill-scout/SKILL.md +133 -0
- package/skills/skill-scout/evals/README.md +12 -0
- package/skills/skill-scout/evals/cases.yaml +56 -0
- package/skills/skill-scout/references/install-commands.md +76 -0
- package/skills/skill-scout/scripts/verify.sh +154 -0
- package/skills/social-publisher/SKILL.md +179 -0
- package/skills/social-publisher/evals/README.md +14 -0
- package/skills/social-publisher/evals/cases.yaml +55 -0
- package/skills/social-publisher/references/calendar-schema.md +97 -0
- package/skills/social-publisher/references/platform-limits.md +56 -0
- package/skills/social-publisher/scripts/verify.sh +232 -0
- package/skills/solid-js/SKILL.md +260 -0
- package/skills/solid-js/evals/README.md +3 -0
- package/skills/solid-js/evals/cases.yaml +38 -0
- package/skills/solid-js/references/reactivity-deep-dive.md +89 -0
- package/skills/solid-js/references/router-and-start.md +93 -0
- package/skills/solid-js/scripts/verify.sh +130 -0
- package/skills/sop-builder/SKILL.md +233 -0
- package/skills/sop-builder/evals/README.md +14 -0
- package/skills/sop-builder/evals/cases.yaml +48 -0
- package/skills/sop-builder/references/sop-skeleton.md +170 -0
- package/skills/specify/SKILL.md +214 -0
- package/skills/specify/evals/README.md +73 -0
- package/skills/specify/evals/cases.yaml +80 -0
- package/skills/specify/references/eliciting-requirements.md +77 -0
- package/skills/specify/references/spec-template.md +60 -0
- package/skills/spreadsheet-ops/SKILL.md +180 -0
- package/skills/spreadsheet-ops/evals/README.md +33 -0
- package/skills/spreadsheet-ops/evals/cases.yaml +42 -0
- package/skills/spreadsheet-ops/references/formula-cookbook.md +70 -0
- package/skills/spreadsheet-ops/references/python-excel.md +87 -0
- package/skills/spreadsheet-ops/references/sheets-api-appsscript.md +118 -0
- package/skills/spreadsheet-ops/scripts/verify.sh +152 -0
- package/skills/spring-boot/SKILL.md +375 -0
- package/skills/spring-boot/evals/README.md +11 -0
- package/skills/spring-boot/evals/cases.yaml +49 -0
- package/skills/spring-boot/references/jpa.md +94 -0
- package/skills/spring-boot/references/security.md +92 -0
- package/skills/spring-boot/references/testing.md +95 -0
- package/skills/spring-boot/scripts/verify.sh +115 -0
- package/skills/sql/SKILL.md +286 -0
- package/skills/sql/evals/README.md +9 -0
- package/skills/sql/evals/cases.yaml +49 -0
- package/skills/sql/references/ctes-and-recursion.md +63 -0
- package/skills/sql/references/joins-and-sets.md +71 -0
- package/skills/sql/references/portability.md +38 -0
- package/skills/sql/references/window-functions.md +72 -0
- package/skills/sql/scripts/verify.sh +139 -0
- package/skills/sqlite-turso/SKILL.md +214 -0
- package/skills/sqlite-turso/evals/README.md +24 -0
- package/skills/sqlite-turso/evals/cases.yaml +45 -0
- package/skills/sqlite-turso/references/embedded-replicas.md +96 -0
- package/skills/sqlite-turso/scripts/verify.sh +95 -0
- package/skills/stripe/SKILL.md +269 -0
- package/skills/stripe/evals/README.md +11 -0
- package/skills/stripe/evals/cases.yaml +45 -0
- package/skills/stripe/references/going-live.md +64 -0
- package/skills/stripe/references/webhook-events.md +79 -0
- package/skills/stripe/scripts/verify.sh +130 -0
- package/skills/structured-extraction/SKILL.md +230 -0
- package/skills/structured-extraction/evals/README.md +13 -0
- package/skills/structured-extraction/evals/cases.yaml +70 -0
- package/skills/structured-extraction/references/providers.md +152 -0
- package/skills/structured-extraction/scripts/verify.sh +160 -0
- package/skills/suggest/SKILL.md +30 -0
- package/skills/suggest/evals/README.md +14 -0
- package/skills/suggest/evals/cases.yaml +51 -0
- package/skills/supabase/SKILL.md +268 -0
- package/skills/supabase/evals/README.md +12 -0
- package/skills/supabase/evals/cases.yaml +42 -0
- package/skills/supabase/references/auth-ssr.md +173 -0
- package/skills/supabase/references/rls-cookbook.md +122 -0
- package/skills/supabase/scripts/verify.sh +149 -0
- package/skills/svelte/SKILL.md +238 -0
- package/skills/svelte/evals/README.md +3 -0
- package/skills/svelte/evals/cases.yaml +41 -0
- package/skills/svelte/references/runes.md +97 -0
- package/skills/svelte/references/sveltekit-data.md +156 -0
- package/skills/svelte/scripts/verify.sh +128 -0
- package/skills/swift-ios/SKILL.md +217 -0
- package/skills/swift-ios/evals/README.md +3 -0
- package/skills/swift-ios/evals/cases.yaml +46 -0
- package/skills/swift-ios/references/concurrency.md +132 -0
- package/skills/swift-ios/references/testing.md +112 -0
- package/skills/swift-ios/scripts/verify.sh +98 -0
- package/skills/tasks/SKILL.md +260 -0
- package/skills/tasks/evals/README.md +70 -0
- package/skills/tasks/evals/cases.yaml +75 -0
- package/skills/tauri/SKILL.md +224 -0
- package/skills/tauri/evals/README.md +12 -0
- package/skills/tauri/evals/cases.yaml +46 -0
- package/skills/tauri/references/bundling-distribution.md +129 -0
- package/skills/tauri/references/security.md +143 -0
- package/skills/tauri/scripts/verify.sh +178 -0
- package/skills/technical-writing/SKILL.md +230 -0
- package/skills/technical-writing/evals/README.md +12 -0
- package/skills/technical-writing/evals/cases.yaml +53 -0
- package/skills/technical-writing/references/diataxis-modes.md +131 -0
- package/skills/technical-writing/references/vale-starter.md +90 -0
- package/skills/technical-writing/scripts/verify.sh +83 -0
- package/skills/terms-conditions/SKILL.md +147 -0
- package/skills/terms-conditions/evals/README.md +14 -0
- package/skills/terms-conditions/evals/cases.yaml +48 -0
- package/skills/terms-conditions/references/clause-library.md +158 -0
- package/skills/terms-conditions/references/notices-and-aup.md +125 -0
- package/skills/terms-conditions/scripts/verify.sh +92 -0
- package/skills/testing-go/SKILL.md +246 -0
- package/skills/testing-go/evals/README.md +3 -0
- package/skills/testing-go/evals/cases.yaml +44 -0
- package/skills/testing-go/references/coverage-and-benchmarks.md +85 -0
- package/skills/testing-go/references/mocks-and-fakes.md +140 -0
- package/skills/testing-go/references/synctest-and-concurrency.md +82 -0
- package/skills/testing-go/scripts/verify.sh +72 -0
- package/skills/testing-py/SKILL.md +179 -0
- package/skills/testing-py/evals/README.md +5 -0
- package/skills/testing-py/evals/cases.yaml +44 -0
- package/skills/testing-py/references/mocking.md +141 -0
- package/skills/testing-py/references/property-testing.md +99 -0
- package/skills/testing-py/scripts/verify.sh +117 -0
- package/skills/testing-web/SKILL.md +224 -0
- package/skills/testing-web/evals/README.md +11 -0
- package/skills/testing-web/evals/cases.yaml +52 -0
- package/skills/testing-web/references/jest-setup.md +88 -0
- package/skills/testing-web/references/recipes.md +116 -0
- package/skills/testing-web/scripts/verify.sh +111 -0
- package/skills/tiktok-api/SKILL.md +315 -0
- package/skills/tiktok-api/evals/README.md +17 -0
- package/skills/tiktok-api/evals/cases.yaml +51 -0
- package/skills/tiktok-api/references/metrics-and-publish.md +127 -0
- package/skills/tiktok-api/references/oauth-setup.md +105 -0
- package/skills/tiktok-api/references/wiki-schema.md +85 -0
- package/skills/tiktok-api/scripts/verify.sh +96 -0
- package/skills/together-fireworks/SKILL.md +181 -0
- package/skills/together-fireworks/evals/README.md +3 -0
- package/skills/together-fireworks/evals/cases.yaml +50 -0
- package/skills/together-fireworks/references/batch-and-tuning.md +59 -0
- package/skills/together-fireworks/references/models-and-pricing.md +79 -0
- package/skills/together-fireworks/scripts/verify.sh +165 -0
- package/skills/translation-l10n/SKILL.md +229 -0
- package/skills/translation-l10n/evals/README.md +3 -0
- package/skills/translation-l10n/evals/cases.yaml +39 -0
- package/skills/translation-l10n/references/icu-cookbook.md +82 -0
- package/skills/translation-l10n/references/rtl-and-bidi.md +60 -0
- package/skills/typescript/SKILL.md +258 -0
- package/skills/typescript/evals/README.md +15 -0
- package/skills/typescript/evals/cases.yaml +46 -0
- package/skills/typescript/references/build-and-monorepo.md +141 -0
- package/skills/typescript/references/type-system.md +162 -0
- package/skills/typescript/scripts/verify.sh +52 -0
- package/skills/unit-economics/SKILL.md +180 -0
- package/skills/unit-economics/evals/README.md +5 -0
- package/skills/unit-economics/evals/cases.yaml +43 -0
- package/skills/unit-economics/references/formulas.md +144 -0
- package/skills/unit-economics/scripts/verify.sh +179 -0
- package/skills/vector-db/SKILL.md +189 -0
- package/skills/vector-db/evals/README.md +10 -0
- package/skills/vector-db/evals/cases.yaml +45 -0
- package/skills/vector-db/references/engines.md +175 -0
- package/skills/vector-db/references/tuning.md +62 -0
- package/skills/vector-db/scripts/verify.sh +110 -0
- package/skills/vercel/SKILL.md +242 -0
- package/skills/vercel/evals/README.md +23 -0
- package/skills/vercel/evals/cases.yaml +45 -0
- package/skills/vercel/references/cli-cookbook.md +98 -0
- package/skills/vercel/references/vercel-json.md +120 -0
- package/skills/vercel/scripts/verify.sh +168 -0
- package/skills/verify/SKILL.md +188 -0
- package/skills/verify/evals/README.md +78 -0
- package/skills/verify/evals/cases.yaml +74 -0
- package/skills/video-shorts/SKILL.md +163 -0
- package/skills/video-shorts/evals/README.md +15 -0
- package/skills/video-shorts/evals/cases.yaml +56 -0
- package/skills/video-shorts/references/hook-and-script-patterns.md +95 -0
- package/skills/video-shorts/references/specs-and-safe-zones.md +74 -0
- package/skills/video-shorts/scripts/verify.sh +172 -0
- package/skills/vue-nuxt/SKILL.md +384 -0
- package/skills/vue-nuxt/evals/README.md +11 -0
- package/skills/vue-nuxt/evals/cases.yaml +49 -0
- package/skills/vue-nuxt/references/data-and-state.md +127 -0
- package/skills/vue-nuxt/references/migration-nuxt4.md +79 -0
- package/skills/vue-nuxt/references/nitro-and-rendering.md +117 -0
- package/skills/vue-nuxt/references/reactivity.md +135 -0
- package/skills/vue-nuxt/scripts/verify.sh +148 -0
- package/skills/webhooks/SKILL.md +246 -0
- package/skills/webhooks/evals/README.md +15 -0
- package/skills/webhooks/evals/cases.yaml +46 -0
- package/skills/webhooks/references/framework-raw-body.md +97 -0
- package/skills/webhooks/references/signature-schemes.md +66 -0
- package/skills/webhooks/scripts/verify.sh +142 -0
- package/skills/webinar/SKILL.md +196 -0
- package/skills/webinar/evals/README.md +14 -0
- package/skills/webinar/evals/cases.yaml +44 -0
- package/skills/webinar/references/email-cadence.md +75 -0
- package/skills/webinar/references/run-of-show.md +83 -0
- package/skills/whatsapp-telegram/SKILL.md +235 -0
- package/skills/whatsapp-telegram/evals/README.md +11 -0
- package/skills/whatsapp-telegram/evals/cases.yaml +44 -0
- package/skills/whatsapp-telegram/references/telegram-bot-api.md +91 -0
- package/skills/whatsapp-telegram/references/whatsapp-cloud-api.md +103 -0
- package/skills/whatsapp-telegram/scripts/verify.sh +90 -0
- package/skills/wordpress/SKILL.md +224 -0
- package/skills/wordpress/evals/README.md +3 -0
- package/skills/wordpress/evals/cases.yaml +50 -0
- package/skills/wordpress/references/hardening.md +108 -0
- package/skills/wordpress/references/performance.md +80 -0
- package/skills/wordpress/references/woocommerce.md +65 -0
- package/skills/wordpress/scripts/verify.sh +96 -0
- package/skills/worktrees/SKILL.md +199 -0
- package/skills/worktrees/evals/README.md +78 -0
- package/skills/worktrees/evals/cases.yaml +47 -0
- package/skills/youtube-api/SKILL.md +286 -0
- package/skills/youtube-api/evals/README.md +3 -0
- package/skills/youtube-api/evals/cases.yaml +50 -0
- package/skills/youtube-api/references/analytics-queries.md +89 -0
- package/skills/youtube-api/references/oauth-setup.md +55 -0
- package/skills/youtube-api/references/wiki-schema.md +70 -0
- package/skills/youtube-api/scripts/verify.sh +84 -0
- package/skills/youtube-ideation/SKILL.md +234 -0
- package/skills/youtube-ideation/evals/README.md +14 -0
- package/skills/youtube-ideation/evals/cases.yaml +52 -0
- package/skills/youtube-ideation/references/idea-ledger-and-loop.md +89 -0
- package/skills/youtube-ideation/references/research-and-signals.md +92 -0
- package/skills/youtube-ideation/scripts/verify.sh +237 -0
- package/skills/youtube-packaging/SKILL.md +220 -0
- package/skills/youtube-packaging/evals/README.md +16 -0
- package/skills/youtube-packaging/evals/cases.yaml +48 -0
- package/skills/youtube-packaging/references/description-and-chapters.md +135 -0
- package/skills/youtube-packaging/scripts/verify.sh +250 -0
- package/skills/youtube-strategy/SKILL.md +157 -0
- package/skills/youtube-strategy/evals/README.md +5 -0
- package/skills/youtube-strategy/evals/cases.yaml +61 -0
- package/skills/youtube-strategy/references/channel-architecture.md +46 -0
- package/skills/youtube-strategy/references/wiki-records.md +86 -0
- package/skills/youtube-strategy/scripts/verify.sh +118 -0
- package/skills/youtube-thumbnails/SKILL.md +180 -0
- package/skills/youtube-thumbnails/evals/README.md +11 -0
- package/skills/youtube-thumbnails/evals/cases.yaml +48 -0
- package/skills/youtube-thumbnails/references/composition-and-specs.md +69 -0
- package/skills/youtube-thumbnails/references/experiment-log-format.md +65 -0
- package/skills/youtube-thumbnails/scripts/verify.sh +123 -0
- package/targets/claude.js +23 -0
- package/targets/codex.js +29 -0
- package/targets/cursor.js +20 -0
- package/targets/gemini.js +29 -0
- package/targets/index.js +55 -0
|
@@ -0,0 +1,213 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agent-eval
|
|
3
|
+
description: "Use when you need to measure whether an LLM or agent system actually got better and gate merges on it — before/after a prompt or model change, building a golden set, fixing a noisy or inflated LLM-as-judge, scoring a RAG pipeline (faithfulness, contextual recall) or an agent trajectory (tool-correctness, task-completion), or choosing between DeepEval / Inspect AI / promptfoo / Braintrust. Triggers: 'is the new prompt actually better', 'build a golden set / regression suite', 'our judge gives everything a 9/10', 'fail the PR when answer quality drops', 'score the tool path not just the final answer', 'medir si el agente mejoró', 'tenim un eval que dona puntuacions inflades', 'montar un golden set'. NOT building the agent loop/tools/RAG plumbing (that is building-agents)."
|
|
4
|
+
tags: [evals, llm, agents, llm-as-judge, regression-gate, ai]
|
|
5
|
+
recommends: [building-agents, prompt-engineering, observability]
|
|
6
|
+
origin: risco
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Measure agent quality you can defend and gate on
|
|
10
|
+
|
|
11
|
+
Turn "the agent feels better" into a number you can put in a PR check. You own the eval dataset, the scorer mix, the LLM-as-judge calibration, and the block-on-regression CI gate — framework-neutral, provider-neutral.
|
|
12
|
+
|
|
13
|
+
## The one rule
|
|
14
|
+
|
|
15
|
+
> A score you do not trust is worse than no score. Calibrate the scorer against human labels and report the agreement **before** you gate anything on it. An uncalibrated judge gives false confidence, which is more dangerous than admitted ignorance.
|
|
16
|
+
|
|
17
|
+
> Build your own golden set. A public leaderboard number is not your number — identical model weights swing SWE-Bench Verified by 10–20 points just by changing the harness. Measure *your* task on *your* data.
|
|
18
|
+
|
|
19
|
+
## When to use / When NOT
|
|
20
|
+
|
|
21
|
+
**Use when:**
|
|
22
|
+
- "Is the new prompt/model actually better?" — before/after on a fixed dataset.
|
|
23
|
+
- "Build a golden set / regression suite for this agent."
|
|
24
|
+
- "Our judge gives everything 9/10 / the scores are inflated" — judge calibration.
|
|
25
|
+
- "Fail the PR when answer quality / faithfulness / tool-call accuracy drops" — CI gate.
|
|
26
|
+
- Scoring a RAG pipeline (faithfulness, answer relevancy, contextual recall/precision).
|
|
27
|
+
- Scoring an agent's *trajectory* — tool calls, task completion — not just the final string.
|
|
28
|
+
- Picking between DeepEval / Inspect AI / promptfoo / Braintrust.
|
|
29
|
+
|
|
30
|
+
**Do NOT use — route instead:**
|
|
31
|
+
|
|
32
|
+
| The ask | Route to | Why it is not this skill |
|
|
33
|
+
| --- | --- | --- |
|
|
34
|
+
| Build the agent loop, tools, RAG plumbing | `building-agents` | It builds the system; you score it. They cross-link. |
|
|
35
|
+
| "Make the answers shorter / rewrite the prompt" | `prompt-engineering` | Evals say it is worse; that skill changes the words. You never edit the prompt. |
|
|
36
|
+
| pytest/jest on deterministic functions | `testing-py` / `testing-web` | Assert-equals on pure code, not stochastic outputs scored by a judge. |
|
|
37
|
+
| Dashboards / tracing of live production traffic | `observability` | Online monitoring; you are offline + pre-merge. |
|
|
38
|
+
| Red-team, jailbreak, prompt injection | `agent-safety` | Adversarial coverage, not quality measurement. |
|
|
39
|
+
| Per-token cost budgets and accounting | `cost-tracking` | You report cost-per-task as one metric; the discipline lives there. |
|
|
40
|
+
| A/B stats on product/funnel metrics | `ab-testing` | Web experiments, not offline model comparison on a fixed set. |
|
|
41
|
+
|
|
42
|
+
(`building-agents`, `ab-testing`, `chatbot` are linkable siblings here; the rest are KNOWN ids you name but cannot link yet.)
|
|
43
|
+
|
|
44
|
+
## The eval anatomy
|
|
45
|
+
|
|
46
|
+
Every framework instantiates the same five-stage pipeline. Learn it once; the tool is a detail.
|
|
47
|
+
|
|
48
|
+
```text
|
|
49
|
+
dataset ──▶ runner ──▶ scorers ──▶ metrics ──▶ gate
|
|
50
|
+
(JSONL (calls the (det / judge (aggregate + (pass/fail
|
|
51
|
+
golden system per / human) bootstrap CI) exit code)
|
|
52
|
+
set) case)
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
DeepEval, Inspect AI, and promptfoo are all just opinionated wrappers around this. If you understand the stages you can switch tools without relearning the craft.
|
|
56
|
+
|
|
57
|
+
## Build the dataset first
|
|
58
|
+
|
|
59
|
+
The dataset is the asset. Everything else is replaceable. Rules:
|
|
60
|
+
|
|
61
|
+
- **50–200 hand-labeled cases per failure mode**, not per total. Coverage of how the system fails beats raw volume. 80 real failure cases > 1000 generic ones.
|
|
62
|
+
- **Never synthetic-only.** A set the model wrote will not surface the model's blind spots. Mine real traffic / tickets / transcripts and hand-label.
|
|
63
|
+
- **Version it in git as JSONL**, a first-class reviewed asset — same as code. Diffs are reviewable; relabels are auditable.
|
|
64
|
+
- **Decontaminate.** The eval set must not appear in training data or few-shot examples, or the score is a memorization artifact, not a capability.
|
|
65
|
+
|
|
66
|
+
Case schema — one JSON object per line:
|
|
67
|
+
|
|
68
|
+
```jsonl
|
|
69
|
+
{"id":"refund-001","input":"Where is my refund for order 4821?","expected":"States refunds take 5-7 business days and asks for nothing already on file","context":["policy: refunds 5-7 business days"],"meta":{"failure_mode":"hallucinated_policy","source":"ticket#4821"}}
|
|
70
|
+
{"id":"refund-002","input":"Cancel my subscription and refund this month","expected":"Cancels, refunds prorated amount, confirms no future charge","context":["policy: prorated refund on cancel"],"meta":{"failure_mode":"missed_tool_call","source":"ticket#5190"}}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
`failure_mode` in `meta` is what lets you slice metrics by mode and find *which* kind of bug regressed — not just that the aggregate dropped.
|
|
74
|
+
|
|
75
|
+
> Bad: "generate 1000 test questions with GPT and use those." Good: "80 real failure-mode cases pulled from support tickets, hand-labeled, tagged by failure mode."
|
|
76
|
+
|
|
77
|
+
## Choose the scorer — the 60/30/10 mix
|
|
78
|
+
|
|
79
|
+
Reach for the cheapest scorer that correlates with human judgment. Default mix:
|
|
80
|
+
|
|
81
|
+
| Share | Scorer kind | Use for | Why |
|
|
82
|
+
| --- | --- | --- | --- |
|
|
83
|
+
| ~60% | Deterministic — exact match, regex, JSON-schema validation, latency threshold | Anything with a checkable shape: format, required fields, a known string, a budget | Free, instant, zero drift. Never spend a judge call on something a regex settles. |
|
|
84
|
+
| ~30% | LLM-as-judge — G-Eval, DAG, custom Python scorer | Meaning: is this answer faithful, relevant, helpful | Only where correctness is semantic. Costs money and can drift — so calibrate it. |
|
|
85
|
+
| ~10% | Human-in-the-loop | Genuinely ambiguous cases the judge disagrees on | The ground truth you calibrate the judge against. |
|
|
86
|
+
|
|
87
|
+
One `Scorer` protocol, two implementations behind it — deterministic and judge are interchangeable to the runner:
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
from typing import Protocol
|
|
91
|
+
|
|
92
|
+
class Scorer(Protocol):
|
|
93
|
+
name: str
|
|
94
|
+
def score(self, case: dict, output: str) -> float: ... # 0.0–1.0
|
|
95
|
+
|
|
96
|
+
class JsonSchemaScorer:
|
|
97
|
+
name = "schema_valid"
|
|
98
|
+
def score(self, case, output): # deterministic, free, no drift
|
|
99
|
+
import json
|
|
100
|
+
try:
|
|
101
|
+
json.loads(output)
|
|
102
|
+
return 1.0
|
|
103
|
+
except ValueError:
|
|
104
|
+
return 0.0
|
|
105
|
+
|
|
106
|
+
class FaithfulnessJudge:
|
|
107
|
+
name = "faithfulness"
|
|
108
|
+
def __init__(self, judge_model): self.judge = judge_model
|
|
109
|
+
def score(self, case, output): # judge only where meaning matters
|
|
110
|
+
return self.judge.rate(case["context"], output) # see judge-design.md
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
## LLM-as-judge you can trust
|
|
114
|
+
|
|
115
|
+
A judge is a scorer, so the one rule applies hardest here. Each rule, with its why:
|
|
116
|
+
|
|
117
|
+
- **Judge model ≥ system under test.** A weaker judge cannot reliably rank a stronger system — it scores noise.
|
|
118
|
+
- **The rubric must force a written rationale before the score.** Rationale-first judging is what pushes judge–human agreement to ~85% — higher than two humans agree with each other. A bare number is a vibe with a decimal point.
|
|
119
|
+
- **Pairwise beats pointwise for stability.** "Is A or B better?" is more reproducible than "rate A from 1–10," which inflates and clusters at 8–9.
|
|
120
|
+
- **Swap positions and average.** Judges favor whichever answer came first; run A-then-B and B-then-A to cancel position bias.
|
|
121
|
+
- **Calibrate against human gold and report agreement** before you trust a single judge score.
|
|
122
|
+
|
|
123
|
+
> Bad judge prompt: "Rate this answer 1–10." → everything lands 8–9, useless.
|
|
124
|
+
> Good: "Compare answer A and answer B against the reference. First write one sentence on each per the rubric, then output the better label." → forces reasoning, gives a stable signal.
|
|
125
|
+
|
|
126
|
+
Full rubric templates (pointwise + pairwise), the position-swap harness, the calibration script (agreement / Cohen's kappa vs human gold), G-Eval vs DAG, and the judge bias catalog (length, position, self-preference) with mitigations live in **[references/judge-design.md](references/judge-design.md)**.
|
|
127
|
+
|
|
128
|
+
## Agent and RAG scorers
|
|
129
|
+
|
|
130
|
+
Score the path, not only the destination. Beyond exact/judge:
|
|
131
|
+
|
|
132
|
+
**RAG** (DeepEval / RAGAS names):
|
|
133
|
+
- **Faithfulness** — does the answer only claim what the retrieved context supports? Catches hallucination.
|
|
134
|
+
- **Answer relevancy** — does it actually address the question, or drift?
|
|
135
|
+
- **Contextual recall / precision** — did retrieval fetch the right chunks, and not bury them in noise? Separates a retrieval bug from a generation bug.
|
|
136
|
+
|
|
137
|
+
**Agent:**
|
|
138
|
+
- **Tool correctness** — right tool, right arguments, right order.
|
|
139
|
+
- **Task completion / goal accuracy** — did it finish the job, not just produce plausible text.
|
|
140
|
+
- **Trajectory scoring** — grade the sequence of steps. A correct final answer from a wrong path will fail differently next time; only trajectory scoring catches it.
|
|
141
|
+
|
|
142
|
+
The system side of these (how the loop and tools are built) is `../building-agents/SKILL.md`.
|
|
143
|
+
|
|
144
|
+
## The regression gate
|
|
145
|
+
|
|
146
|
+
> Gate policy: **block on regression vs a committed baseline, not on an absolute threshold.** An absolute threshold flaps CI on judge noise and gives no signal on drift; "did this PR make a tracked metric worse than `main`?" is the question that matters.
|
|
147
|
+
|
|
148
|
+
- Compute a **bootstrap confidence interval** on each metric so judge noise alone does not fail the build — only a drop beyond the CI counts.
|
|
149
|
+
- The runner writes `eval-report.json` (metrics, per-failure-mode slices, baseline, pass/fail) and **exits non-zero** on a real regression so the merge is blocked.
|
|
150
|
+
|
|
151
|
+
```python
|
|
152
|
+
import json, sys
|
|
153
|
+
|
|
154
|
+
def gate(current: dict, baseline: dict, margin: float = 0.0) -> int:
|
|
155
|
+
regressed = []
|
|
156
|
+
for metric, score in current.items():
|
|
157
|
+
if metric in baseline and score < baseline[metric] - margin:
|
|
158
|
+
regressed.append((metric, baseline[metric], score))
|
|
159
|
+
report = {"metrics": current, "baseline": baseline, "regressed": regressed,
|
|
160
|
+
"passed": not regressed}
|
|
161
|
+
with open("eval-report.json", "w") as f:
|
|
162
|
+
json.dump(report, f, indent=2)
|
|
163
|
+
if regressed:
|
|
164
|
+
for m, b, c in regressed:
|
|
165
|
+
print(f"REGRESSION {m}: {b:.3f} -> {c:.3f}", file=sys.stderr)
|
|
166
|
+
return 1
|
|
167
|
+
return 0
|
|
168
|
+
|
|
169
|
+
sys.exit(gate(run_eval(), json.load(open("eval-baseline.json"))))
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
The complete provider-neutral runner (JSONL loader, scorer registry, bootstrap-CI metrics), the GitHub Actions workflow, and side-by-side DeepEval-pytest + Inspect-AI Task/Solver/Scorer versions of the same eval live in **[references/runner-and-gate.md](references/runner-and-gate.md)**.
|
|
173
|
+
|
|
174
|
+
## Framework cheat-sheet
|
|
175
|
+
|
|
176
|
+
Pick by where the eval runs and what it must do. Versions as of 2026-06 — re-verify, they rot.
|
|
177
|
+
|
|
178
|
+
| Tool | What it is | Reach for it when |
|
|
179
|
+
| --- | --- | --- |
|
|
180
|
+
| **DeepEval** v4.0.3 | pytest-native, 50+ metrics, Decision-Graph (DAG) logic | Your CI is Python/pytest and you want metrics that read like tests. |
|
|
181
|
+
| **Inspect AI** v0.3.225 (UK AISI) | dataset→Task→Solver→Scorer, bootstrap CIs, first-class tool-use & trajectory logging, 200+ pre-built evals | Multi-provider, safety-adjacent, or you need real trajectory scoring. |
|
|
182
|
+
| **promptfoo** (acquired by OpenAI 2026-03) | CLI + YAML, strong pre-deploy + red-team across 50+ vuln types | Config-driven pre-deploy checks; route the red-team half to `agent-safety`. |
|
|
183
|
+
| **Braintrust / LangSmith / Phoenix** v16.0.0 | platforms: annotation, regression tracking, dashboards | You need human annotation queues and historical regression tracking. |
|
|
184
|
+
|
|
185
|
+
> The two-tool pattern is normal, not over-engineering: a light CI gate (DeepEval / RAGAS / promptfoo) **plus** a platform (Braintrust / LangSmith / Arize) for annotation and history. They share data; different jobs.
|
|
186
|
+
|
|
187
|
+
## Anti-patterns
|
|
188
|
+
|
|
189
|
+
| Anti-pattern | Why it bites | Do instead |
|
|
190
|
+
| --- | --- | --- |
|
|
191
|
+
| Vibes-gating ("feels better, merge it") | No artifact to defend or reproduce | Gate on a number from a committed dataset |
|
|
192
|
+
| Synthetic-only dataset | Model-written cases miss the model's blind spots | Hand-label real traffic by failure mode |
|
|
193
|
+
| Uncalibrated judge | Confident wrong scores; worse than none | Report agreement vs human gold first |
|
|
194
|
+
| Judge weaker than system | Cannot rank a stronger system; scores noise | Judge model ≥ system under test |
|
|
195
|
+
| Absolute-threshold gate | Flaps CI on judge noise, blind to drift | Block on regression vs baseline + bootstrap CI |
|
|
196
|
+
| Shipping on a leaderboard number | Harness effect = 10–20pt swing | Build your own golden set |
|
|
197
|
+
| Scoring only the final answer | A right answer from a wrong path regresses later | Score the trajectory too |
|
|
198
|
+
| Never relabeling drifted gold | Stale "truth" silently rots the gate | Review and relabel the golden set on a schedule |
|
|
199
|
+
|
|
200
|
+
## Project grounding
|
|
201
|
+
|
|
202
|
+
If the workspace has a `02-DOCS/` harness, record the eval policy in `02-DOCS/wiki/stack/evals.md`: dataset location, scorer mix, gate baseline file, judge model, and the failure modes covered. Link it from root `CLAUDE.md` under `## Knowledge map`. This is **recorded, not gated** — skip silently if there is no harness.
|
|
203
|
+
|
|
204
|
+
## verify.sh
|
|
205
|
+
|
|
206
|
+
`scripts/verify.sh` is read-only and tool-detecting. It validates that every `*.jsonl` golden set in the project parses and that each line carries the required `id`, `input`, `expected` keys; checks the shape of any `eval-report.json`; and runs `ruff` / `mypy` on example Python and `markdownlint` on docs when those tools are installed. Every missing tool prints a yellow WARN and is skipped — never a failure. An empty or clean target exits 0.
|
|
207
|
+
|
|
208
|
+
## See also
|
|
209
|
+
|
|
210
|
+
- `../building-agents/SKILL.md` — builds the agent/RAG/tool system you score here.
|
|
211
|
+
- `../chatbot/SKILL.md` — a common system-under-test for these evals.
|
|
212
|
+
- KNOWN siblings to route to (not yet linkable): `prompt-engineering`, `observability`, `agent-safety`, `cost-tracking`, `testing-py`, `ab-testing`.
|
|
213
|
+
- External (no link): RAGAS, DeepEval, Inspect AI, promptfoo, Braintrust.
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
# Evals for agent-eval
|
|
2
|
+
|
|
3
|
+
`cases.yaml` is the trigger and capability spec for this skill. It is read by the catalog's
|
|
4
|
+
skill-eval harness, not by a standalone runner here. `should_trigger` lists prompts the skill
|
|
5
|
+
must claim (including non-obvious and Spanish phrasings); `should_not_trigger` lists prompts
|
|
6
|
+
that must route to a named sibling instead; `capability` is a rubric a graded run must satisfy.
|
|
7
|
+
|
|
8
|
+
To check by hand: read each `should_trigger` prompt and confirm the description in `SKILL.md`
|
|
9
|
+
would plausibly fire on it; read each `should_not_trigger` prompt and confirm the `route_to`
|
|
10
|
+
sibling is the better home and is a real catalog id. For the `capability` case, draft the
|
|
11
|
+
skill's answer and confirm every `must_include` bullet is covered. If your harness scores these
|
|
12
|
+
automatically, point it at this file with `skill: agent-eval` as the key.
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
skill: agent-eval
|
|
2
|
+
|
|
3
|
+
should_trigger:
|
|
4
|
+
- prompt: "Is the new system prompt actually better than the old one, or am I just fooling myself?"
|
|
5
|
+
why: "Before/after on a fixed dataset is the core job. Non-obvious: uses no eval vocabulary at all."
|
|
6
|
+
- prompt: "Build me a golden set and a CI check that fails the PR if our support bot's answer quality drops."
|
|
7
|
+
why: "Dataset construction plus a block-on-regression gate, stated explicitly."
|
|
8
|
+
- prompt: "Our LLM judge gives every answer a 9 out of 10 — the scores are useless."
|
|
9
|
+
why: "Judge calibration / inflation; the fix is rationale-forcing rubric and pairwise over pointwise. Non-obvious symptom phrasing, no mention of 'eval'."
|
|
10
|
+
- prompt: "Score our RAG pipeline for faithfulness and whether it actually uses the retrieved context."
|
|
11
|
+
why: "RAG metrics — faithfulness and contextual recall — are squarely this skill."
|
|
12
|
+
- prompt: "I need to know if the agent took the right tool path, not just whether the final answer looks ok."
|
|
13
|
+
why: "Trajectory and tool-correctness scoring. Non-obvious: framed as behavior, not as 'eval'."
|
|
14
|
+
- prompt: "Necesito medir si el agente mejoró con el modelo nuevo antes de hacer merge."
|
|
15
|
+
why: "Spanish — before/after measurement plus a pre-merge gate."
|
|
16
|
+
- prompt: "Should we use DeepEval or Inspect AI for our eval suite, and how do we wire the gate?"
|
|
17
|
+
why: "Framework choice plus gate wiring; the cheat-sheet and runner are the answer."
|
|
18
|
+
|
|
19
|
+
should_not_trigger:
|
|
20
|
+
- prompt: "Rewrite this system prompt so the answers come out shorter and punchier."
|
|
21
|
+
route_to: prompt-engineering
|
|
22
|
+
why: "Changing the words, not measuring quality. Evals say it is worse; that skill changes it."
|
|
23
|
+
- prompt: "Build the agent loop and tool registry for our new assistant."
|
|
24
|
+
route_to: building-agents
|
|
25
|
+
why: "Building the system under test, not scoring it."
|
|
26
|
+
- prompt: "Add a dashboard showing token cost and latency of our live LLM traffic."
|
|
27
|
+
route_to: observability
|
|
28
|
+
why: "Online monitoring of production, not offline pre-merge evaluation."
|
|
29
|
+
- prompt: "Write pytest unit tests for our data-cleaning functions."
|
|
30
|
+
route_to: testing-py
|
|
31
|
+
why: "Deterministic code with assert-equals, not stochastic outputs scored by a judge."
|
|
32
|
+
- prompt: "Red-team our chatbot for jailbreaks and prompt injection."
|
|
33
|
+
route_to: agent-safety
|
|
34
|
+
why: "Adversarial safety coverage, not quality measurement on a golden set."
|
|
35
|
+
|
|
36
|
+
capability:
|
|
37
|
+
- scenario: "Set up an eval plus a regression gate for a RAG support bot that is regressing silently between releases."
|
|
38
|
+
must_include:
|
|
39
|
+
- "Builds a versioned JSONL golden set of 50-200 hand-labeled cases per failure mode, decontaminated, not synthetic-only"
|
|
40
|
+
- "Picks a scorer mix: deterministic first (schema/regex/latency), faithfulness and answer-relevancy judge where meaning matters"
|
|
41
|
+
- "Calibrates the LLM-as-judge against human labels and reports agreement (raw and/or Cohen's kappa) before trusting it"
|
|
42
|
+
- "Gate policy is block-on-regression vs a committed baseline with a bootstrap CI, not an absolute threshold"
|
|
43
|
+
- "Runner emits eval-report.json and exits non-zero to block the merge"
|
|
44
|
+
- "Names at least one concrete framework (DeepEval, Inspect AI, or promptfoo) and fits it to the case"
|
|
45
|
+
- "Records the eval policy in 02-DOCS/wiki/stack/evals.md if a harness exists — recorded, not gated"
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# LLM-as-judge design and calibration
|
|
2
|
+
|
|
3
|
+
Depth offloaded from `SKILL.md`. A judge is a scorer, so the one rule binds hardest here:
|
|
4
|
+
**calibrate against human gold and report agreement before you trust a single judge score.**
|
|
5
|
+
Judge–human agreement reaches ~85% in 2026 (higher than two humans on the same task) — but
|
|
6
|
+
only with a capable judge model and a rubric that forces a written rationale.
|
|
7
|
+
|
|
8
|
+
## Pointwise rubric template
|
|
9
|
+
|
|
10
|
+
Pointwise (rate one answer) is convenient but inflates and clusters at 8–9. Use it only when
|
|
11
|
+
you cannot run pairwise, and always force the rationale first.
|
|
12
|
+
|
|
13
|
+
```text
|
|
14
|
+
You are grading an answer against a reference. The dimension is FAITHFULNESS:
|
|
15
|
+
every claim in the answer must be supported by the provided context.
|
|
16
|
+
|
|
17
|
+
Context:
|
|
18
|
+
{context}
|
|
19
|
+
|
|
20
|
+
Answer:
|
|
21
|
+
{answer}
|
|
22
|
+
|
|
23
|
+
Steps (do them in order):
|
|
24
|
+
1. List each factual claim in the answer.
|
|
25
|
+
2. For each claim, mark SUPPORTED or UNSUPPORTED against the context.
|
|
26
|
+
3. Only then output a JSON object: {"rationale": "...", "score": <0.0-1.0>}
|
|
27
|
+
where score = supported_claims / total_claims.
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
The order matters: a rubric that asks for the number first gets a vibe; asking for the
|
|
31
|
+
claim-by-claim breakdown first forces the reasoning that earns the ~85% agreement.
|
|
32
|
+
|
|
33
|
+
## Pairwise rubric template (preferred)
|
|
34
|
+
|
|
35
|
+
"Is A or B better?" is more reproducible than an absolute rating. Use it for before/after and
|
|
36
|
+
A/B comparisons.
|
|
37
|
+
|
|
38
|
+
```text
|
|
39
|
+
Compare two answers, A and B, against the reference for HELPFULNESS.
|
|
40
|
+
|
|
41
|
+
Reference: {reference}
|
|
42
|
+
Answer A: {a}
|
|
43
|
+
Answer B: {b}
|
|
44
|
+
|
|
45
|
+
1. One sentence on what A does well/badly vs the reference.
|
|
46
|
+
2. One sentence on what B does well/badly vs the reference.
|
|
47
|
+
3. Output JSON: {"rationale": "...", "winner": "A" | "B" | "tie"}
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
## Position-swap harness (kills position bias)
|
|
51
|
+
|
|
52
|
+
Judges favor whichever answer appears first. Run each comparison twice with positions swapped
|
|
53
|
+
and only count a clear win as a win.
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
def pairwise(judge, reference, a, b):
|
|
57
|
+
"""Return 'a', 'b', or 'tie' after canceling position bias."""
|
|
58
|
+
first = judge.compare(reference, a, b) # A in slot 1
|
|
59
|
+
second = judge.compare(reference, b, a) # A in slot 2 -> remap
|
|
60
|
+
remap = {"A": "b", "B": "a", "tie": "tie"}
|
|
61
|
+
r1 = {"A": "a", "B": "b", "tie": "tie"}[first]
|
|
62
|
+
r2 = remap[second]
|
|
63
|
+
if r1 == r2:
|
|
64
|
+
return r1 # consistent across positions -> trust it
|
|
65
|
+
return "tie" # judge flipped with position -> not a real signal
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
A judge that flips when you swap positions has told you it cannot separate the two — record a
|
|
69
|
+
tie, do not pick the first-slot winner.
|
|
70
|
+
|
|
71
|
+
## Calibration script (agreement and Cohen's kappa vs human gold)
|
|
72
|
+
|
|
73
|
+
You need a small human-labeled gold slice (~10% of cases, the ambiguous ones). Run the judge
|
|
74
|
+
on it and measure how often it agrees with the humans. Report this number in `eval-report.json`
|
|
75
|
+
and refuse to gate if it is low.
|
|
76
|
+
|
|
77
|
+
```python
|
|
78
|
+
def cohen_kappa(human: list[int], judge: list[int]) -> float:
|
|
79
|
+
"""Agreement corrected for chance. 1.0 perfect, 0 chance-level."""
|
|
80
|
+
n = len(human)
|
|
81
|
+
po = sum(h == j for h, j in zip(human, judge)) / n # raw agreement
|
|
82
|
+
labels = set(human) | set(judge)
|
|
83
|
+
pe = sum((human.count(k) / n) * (judge.count(k) / n) for k in labels)
|
|
84
|
+
return (po - pe) / (1 - pe) if pe != 1 else 1.0
|
|
85
|
+
|
|
86
|
+
def calibration_report(human, judge):
|
|
87
|
+
raw = sum(h == j for h, j in zip(human, judge)) / len(human)
|
|
88
|
+
return {"raw_agreement": round(raw, 3),
|
|
89
|
+
"cohen_kappa": round(cohen_kappa(human, judge), 3)}
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Rules of thumb for the kappa: < 0.4 the judge is unusable, fix the rubric or model; 0.4–0.6
|
|
93
|
+
marginal, widen the human slice; > 0.6 with raw agreement ~0.85 is the working zone. These are
|
|
94
|
+
thresholds for *trusting the judge*, not for gating the system.
|
|
95
|
+
|
|
96
|
+
## G-Eval vs DAG
|
|
97
|
+
|
|
98
|
+
- **G-Eval** — you give the judge an evaluation-criteria sentence; it generates the
|
|
99
|
+
chain-of-thought steps and a weighted score. Fast to author, good for fuzzy semantic
|
|
100
|
+
dimensions (coherence, helpfulness). Less reproducible for hard rules.
|
|
101
|
+
- **DAG (Decision Graph)** — you author an explicit decision tree of yes/no checks; the score
|
|
102
|
+
is deterministic given the answers. Use for compliance-style "must / must-not" criteria where
|
|
103
|
+
you want auditability over flexibility. DeepEval v4 ships this as first-class.
|
|
104
|
+
|
|
105
|
+
Pick G-Eval for taste, DAG for rules.
|
|
106
|
+
|
|
107
|
+
## Judge bias catalog and mitigations
|
|
108
|
+
|
|
109
|
+
| Bias | Symptom | Mitigation |
|
|
110
|
+
| --- | --- | --- |
|
|
111
|
+
| Position | Favors the first answer shown | Position-swap harness above; count only consistent wins |
|
|
112
|
+
| Length | Scores longer answers higher regardless of quality | Add "ignore length; reward density" to the rubric; spot-check long losers |
|
|
113
|
+
| Self-preference | A model prefers text in its own style | Use a different model family as judge than the system under test |
|
|
114
|
+
| Verbosity-of-rationale | Long rationale read as more correct | Score the claim breakdown, not the prose |
|
|
115
|
+
| Anchoring on reference wording | Penalizes correct paraphrases | Grade meaning vs reference, not surface overlap; test with known-good paraphrases |
|
|
116
|
+
|
|
117
|
+
Re-run the calibration slice whenever you change the judge model or rubric — a judge swap is a
|
|
118
|
+
scorer change, and an uncalibrated scorer is back to a number you cannot defend.
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# The runner, the gate, and framework equivalents
|
|
2
|
+
|
|
3
|
+
Depth offloaded from `SKILL.md`. A provider-neutral runner you can drop into any repo, the CI
|
|
4
|
+
wiring, and the same eval expressed in DeepEval and Inspect AI so you can switch tools without
|
|
5
|
+
relearning the craft.
|
|
6
|
+
|
|
7
|
+
## The five stages, in code
|
|
8
|
+
|
|
9
|
+
`dataset → runner → scorers → metrics → gate`. Keep them separable so each can be swapped.
|
|
10
|
+
|
|
11
|
+
### Dataset loader (JSONL, decontaminated, versioned)
|
|
12
|
+
|
|
13
|
+
```python
|
|
14
|
+
import json
|
|
15
|
+
from pathlib import Path
|
|
16
|
+
|
|
17
|
+
REQUIRED = {"id", "input", "expected"}
|
|
18
|
+
|
|
19
|
+
def load_cases(path: str) -> list[dict]:
|
|
20
|
+
cases = []
|
|
21
|
+
for i, line in enumerate(Path(path).read_text().splitlines(), 1):
|
|
22
|
+
line = line.strip()
|
|
23
|
+
if not line:
|
|
24
|
+
continue
|
|
25
|
+
case = json.loads(line)
|
|
26
|
+
missing = REQUIRED - case.keys()
|
|
27
|
+
if missing:
|
|
28
|
+
raise ValueError(f"{path}:{i} missing keys {missing}")
|
|
29
|
+
cases.append(case)
|
|
30
|
+
return cases
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### Scorer registry
|
|
34
|
+
|
|
35
|
+
Every scorer satisfies one protocol — deterministic and judge are interchangeable. See
|
|
36
|
+
`judge-design.md` for the judge implementations.
|
|
37
|
+
|
|
38
|
+
```python
|
|
39
|
+
from dataclasses import dataclass
|
|
40
|
+
from typing import Callable
|
|
41
|
+
|
|
42
|
+
@dataclass
|
|
43
|
+
class Scorer:
|
|
44
|
+
name: str
|
|
45
|
+
fn: Callable[[dict, str], float] # (case, output) -> 0.0..1.0
|
|
46
|
+
|
|
47
|
+
def exact_match(case, output):
|
|
48
|
+
return 1.0 if output.strip() == case["expected"].strip() else 0.0
|
|
49
|
+
|
|
50
|
+
def latency_under(threshold_s):
|
|
51
|
+
def _fn(case, output):
|
|
52
|
+
return 1.0 if case["meta"].get("latency_s", 0) <= threshold_s else 0.0
|
|
53
|
+
return _fn
|
|
54
|
+
|
|
55
|
+
REGISTRY = [
|
|
56
|
+
Scorer("exact_match", exact_match),
|
|
57
|
+
Scorer("latency_2s", latency_under(2.0)),
|
|
58
|
+
# Scorer("faithfulness", FaithfulnessJudge(judge_model).score), # 30% judge
|
|
59
|
+
]
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### Runner + bootstrap-CI metrics
|
|
63
|
+
|
|
64
|
+
The bootstrap CI is what stops judge noise from flapping the gate: only a drop beyond the
|
|
65
|
+
interval counts as a real regression.
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
import random, statistics
|
|
69
|
+
|
|
70
|
+
def run(system, cases, scorers):
|
|
71
|
+
rows = []
|
|
72
|
+
for case in cases:
|
|
73
|
+
output = system(case["input"]) # the system under test
|
|
74
|
+
row = {"id": case["id"],
|
|
75
|
+
"failure_mode": case["meta"].get("failure_mode", "none")}
|
|
76
|
+
for s in scorers:
|
|
77
|
+
row[s.name] = s.fn(case, output)
|
|
78
|
+
rows.append(row)
|
|
79
|
+
return rows
|
|
80
|
+
|
|
81
|
+
def metrics(rows, scorers, n_boot=1000):
|
|
82
|
+
out = {}
|
|
83
|
+
for s in scorers:
|
|
84
|
+
vals = [r[s.name] for r in rows]
|
|
85
|
+
mean = statistics.fmean(vals)
|
|
86
|
+
boots = [statistics.fmean(random.choices(vals, k=len(vals)))
|
|
87
|
+
for _ in range(n_boot)]
|
|
88
|
+
boots.sort()
|
|
89
|
+
out[s.name] = {"mean": round(mean, 4),
|
|
90
|
+
"ci_low": round(boots[int(0.025 * n_boot)], 4),
|
|
91
|
+
"ci_high": round(boots[int(0.975 * n_boot)], 4)}
|
|
92
|
+
return out
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Gate (block on regression vs committed baseline)
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
import json, sys
|
|
99
|
+
|
|
100
|
+
def gate(current: dict, baseline: dict, report_path="eval-report.json") -> int:
|
|
101
|
+
regressed = []
|
|
102
|
+
for metric, m in current.items():
|
|
103
|
+
base = baseline.get(metric, {}).get("mean")
|
|
104
|
+
# Real regression: the current CI is entirely below the baseline mean.
|
|
105
|
+
if base is not None and m["ci_high"] < base:
|
|
106
|
+
regressed.append({"metric": metric, "baseline": base,
|
|
107
|
+
"now": m["mean"], "ci_high": m["ci_high"]})
|
|
108
|
+
report = {"metrics": current, "baseline": baseline,
|
|
109
|
+
"regressed": regressed, "passed": not regressed}
|
|
110
|
+
Path(report_path).write_text(json.dumps(report, indent=2))
|
|
111
|
+
for r in regressed:
|
|
112
|
+
print(f"REGRESSION {r['metric']}: {r['baseline']} -> {r['now']}",
|
|
113
|
+
file=sys.stderr)
|
|
114
|
+
return 1 if regressed else 0
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Update the baseline deliberately — commit a new `eval-baseline.json` in the PR that *intends*
|
|
118
|
+
to move the metric, so the move is reviewed, not silent.
|
|
119
|
+
|
|
120
|
+
## GitHub Actions wiring
|
|
121
|
+
|
|
122
|
+
```yaml
|
|
123
|
+
name: eval-gate
|
|
124
|
+
on: [pull_request]
|
|
125
|
+
jobs:
|
|
126
|
+
eval:
|
|
127
|
+
runs-on: ubuntu-latest
|
|
128
|
+
steps:
|
|
129
|
+
- uses: actions/checkout@v4
|
|
130
|
+
- uses: actions/setup-python@v5
|
|
131
|
+
with: { python-version: "3.12" }
|
|
132
|
+
- run: pip install -r evals/requirements.txt
|
|
133
|
+
- name: Run eval gate
|
|
134
|
+
env:
|
|
135
|
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
136
|
+
run: python evals/run.py # exits non-zero on regression -> blocks merge
|
|
137
|
+
- uses: actions/upload-artifact@v4
|
|
138
|
+
if: always()
|
|
139
|
+
with: { name: eval-report, path: eval-report.json }
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
The non-zero exit is the gate. Upload the report `if: always()` so a failed run still shows
|
|
143
|
+
*which* metric and failure mode regressed.
|
|
144
|
+
|
|
145
|
+
## Same eval in DeepEval (pytest-native)
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
from deepeval import assert_test
|
|
149
|
+
from deepeval.test_case import LLMTestCase
|
|
150
|
+
from deepeval.metrics import FaithfulnessMetric
|
|
151
|
+
|
|
152
|
+
def test_faithfulness():
|
|
153
|
+
case = LLMTestCase(
|
|
154
|
+
input="Where is my refund for order 4821?",
|
|
155
|
+
actual_output=system("Where is my refund for order 4821?"),
|
|
156
|
+
retrieval_context=["policy: refunds 5-7 business days"],
|
|
157
|
+
)
|
|
158
|
+
assert_test(case, [FaithfulnessMetric(threshold=0.8)])
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
DeepEval v4.0.3 reads like tests and runs under pytest; reach for it when CI is already Python.
|
|
162
|
+
|
|
163
|
+
## Same eval in Inspect AI (Task / Solver / Scorer)
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
from inspect_ai import Task, task
|
|
167
|
+
from inspect_ai.dataset import json_dataset
|
|
168
|
+
from inspect_ai.solver import generate
|
|
169
|
+
from inspect_ai.scorer import model_graded_qa
|
|
170
|
+
|
|
171
|
+
@task
|
|
172
|
+
def refund_quality():
|
|
173
|
+
return Task(
|
|
174
|
+
dataset=json_dataset("evals/cases.jsonl"),
|
|
175
|
+
solver=generate(),
|
|
176
|
+
scorer=model_graded_qa(), # judge with rationale; bootstrap CIs built in
|
|
177
|
+
)
|
|
178
|
+
# inspect eval refund_quality.py --model openai/gpt-... -> pass/fail + CIs
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
Inspect AI v0.3.225 (UK AISI) gives first-class tool-use and trajectory logging plus bootstrap
|
|
182
|
+
CIs out of the box; reach for it when you are multi-provider or need real trajectory scoring.
|
|
183
|
+
Both express the identical `dataset → scorer → gate` pipeline — the tool is a detail.
|