rsc-universal 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (1418) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +279 -0
  3. package/manifest.json +4761 -0
  4. package/package.json +59 -0
  5. package/schema/frontmatter.schema.json +12 -0
  6. package/scripts/build-manifest.js +72 -0
  7. package/scripts/consult.js +106 -0
  8. package/scripts/detect-repo.js +118 -0
  9. package/scripts/doctor.js +21 -0
  10. package/scripts/eval-lint.sh +179 -0
  11. package/scripts/install-apply.js +52 -0
  12. package/scripts/install-plan.js +13 -0
  13. package/scripts/lib/behavior-score.js +103 -0
  14. package/scripts/lib/frontmatter.js +47 -0
  15. package/scripts/lib/harden-policy.js +41 -0
  16. package/scripts/lib/manifest.js +18 -0
  17. package/scripts/lib/recommend.js +36 -0
  18. package/scripts/lib/registry.js +110 -0
  19. package/scripts/lib/result-envelope.js +35 -0
  20. package/scripts/lib/state.js +12 -0
  21. package/scripts/lib/ui.js +17 -0
  22. package/scripts/reviewer-guard.sh +67 -0
  23. package/scripts/rsc.js +108 -0
  24. package/scripts/skill-behavior-eval.js +33 -0
  25. package/scripts/skill-behavior-eval.workflow.js +136 -0
  26. package/scripts/skill-behavior-rubric.md +63 -0
  27. package/scripts/skill-harden-rubric.md +40 -0
  28. package/scripts/skill-harden.workflow.js +161 -0
  29. package/scripts/skill-rubric.md +39 -0
  30. package/scripts/skill-scoreboard.workflow.js +35 -0
  31. package/skills/ab-testing/SKILL.md +191 -0
  32. package/skills/ab-testing/evals/README.md +8 -0
  33. package/skills/ab-testing/evals/cases.yaml +49 -0
  34. package/skills/ab-testing/references/pitfalls.md +74 -0
  35. package/skills/ab-testing/references/sample-size-and-cuped.md +128 -0
  36. package/skills/ab-testing/scripts/verify.sh +89 -0
  37. package/skills/accessibility/SKILL.md +218 -0
  38. package/skills/accessibility/evals/README.md +3 -0
  39. package/skills/accessibility/evals/cases.yaml +47 -0
  40. package/skills/accessibility/references/aria-patterns.md +113 -0
  41. package/skills/accessibility/references/wcag22-checklist.md +83 -0
  42. package/skills/accessibility/scripts/verify.sh +103 -0
  43. package/skills/ads/SKILL.md +175 -0
  44. package/skills/ads/evals/README.md +15 -0
  45. package/skills/ads/evals/cases.yaml +58 -0
  46. package/skills/ads/references/platform-specs.md +73 -0
  47. package/skills/ads/references/roas-model.md +77 -0
  48. package/skills/ads/scripts/verify.sh +210 -0
  49. package/skills/agent-eval/SKILL.md +213 -0
  50. package/skills/agent-eval/evals/README.md +12 -0
  51. package/skills/agent-eval/evals/cases.yaml +45 -0
  52. package/skills/agent-eval/references/judge-design.md +118 -0
  53. package/skills/agent-eval/references/runner-and-gate.md +183 -0
  54. package/skills/agent-eval/scripts/verify.sh +161 -0
  55. package/skills/agent-safety/SKILL.md +176 -0
  56. package/skills/agent-safety/evals/README.md +12 -0
  57. package/skills/agent-safety/evals/cases.yaml +46 -0
  58. package/skills/agent-safety/references/threat-model.md +51 -0
  59. package/skills/ai-media/SKILL.md +196 -0
  60. package/skills/ai-media/evals/README.md +3 -0
  61. package/skills/ai-media/evals/cases.yaml +45 -0
  62. package/skills/ai-media/references/ffmpeg-assembly.md +117 -0
  63. package/skills/ai-media/references/models-and-params.md +78 -0
  64. package/skills/ai-media/scripts/verify.sh +103 -0
  65. package/skills/analytics/SKILL.md +219 -0
  66. package/skills/analytics/evals/README.md +9 -0
  67. package/skills/analytics/evals/cases.yaml +53 -0
  68. package/skills/analytics/references/event-taxonomy.md +75 -0
  69. package/skills/analytics/references/ga4-setup.md +122 -0
  70. package/skills/analytics/references/posthog-setup.md +100 -0
  71. package/skills/analytics/scripts/verify.sh +95 -0
  72. package/skills/analyze/SKILL.md +136 -0
  73. package/skills/analyze/evals/README.md +72 -0
  74. package/skills/analyze/evals/cases.yaml +74 -0
  75. package/skills/angular/SKILL.md +288 -0
  76. package/skills/angular/evals/README.md +3 -0
  77. package/skills/angular/evals/cases.yaml +38 -0
  78. package/skills/angular/references/migration.md +81 -0
  79. package/skills/angular/references/signals-rxjs.md +92 -0
  80. package/skills/angular/scripts/verify.sh +122 -0
  81. package/skills/api-connector-builder/SKILL.md +285 -0
  82. package/skills/api-connector-builder/evals/README.md +11 -0
  83. package/skills/api-connector-builder/evals/cases.yaml +47 -0
  84. package/skills/api-connector-builder/references/auth-flows.md +132 -0
  85. package/skills/api-connector-builder/references/pagination.md +144 -0
  86. package/skills/api-connector-builder/scripts/verify.sh +172 -0
  87. package/skills/api-design/SKILL.md +189 -0
  88. package/skills/api-design/evals/README.md +3 -0
  89. package/skills/api-design/evals/cases.yaml +45 -0
  90. package/skills/api-design/references/graphql-design.md +70 -0
  91. package/skills/api-design/references/openapi-contract.md +86 -0
  92. package/skills/api-design/references/rest-conventions.md +63 -0
  93. package/skills/api-design/references/versioning-and-evolution.md +49 -0
  94. package/skills/api-design/scripts/verify.sh +138 -0
  95. package/skills/article-writing/SKILL.md +175 -0
  96. package/skills/article-writing/evals/README.md +3 -0
  97. package/skills/article-writing/evals/cases.yaml +47 -0
  98. package/skills/article-writing/references/ai-tell-banlist.md +114 -0
  99. package/skills/article-writing/references/on-page-seo.md +133 -0
  100. package/skills/article-writing/scripts/verify.sh +165 -0
  101. package/skills/astro/SKILL.md +275 -0
  102. package/skills/astro/evals/README.md +3 -0
  103. package/skills/astro/evals/cases.yaml +41 -0
  104. package/skills/astro/references/content-layer.md +118 -0
  105. package/skills/astro/references/deploy-and-integrations.md +163 -0
  106. package/skills/astro/scripts/verify.sh +137 -0
  107. package/skills/author-skill/SKILL.md +206 -0
  108. package/skills/author-skill/evals/README.md +66 -0
  109. package/skills/author-skill/evals/cases.yaml +75 -0
  110. package/skills/author-skill/references/description-recipe.md +84 -0
  111. package/skills/author-skill/references/eval-authoring.md +74 -0
  112. package/skills/author-skill/references/rsc-conventions.md +91 -0
  113. package/skills/automation-flows/SKILL.md +132 -0
  114. package/skills/automation-flows/evals/README.md +5 -0
  115. package/skills/automation-flows/evals/cases.yaml +44 -0
  116. package/skills/automation-flows/references/error-handling.md +58 -0
  117. package/skills/automation-flows/references/n8n-workflow-json.md +63 -0
  118. package/skills/automation-flows/scripts/verify.sh +78 -0
  119. package/skills/aws-essentials/SKILL.md +223 -0
  120. package/skills/aws-essentials/evals/README.md +10 -0
  121. package/skills/aws-essentials/evals/cases.yaml +44 -0
  122. package/skills/aws-essentials/references/iam-least-privilege.md +134 -0
  123. package/skills/aws-essentials/references/rds-cloudfront-recipes.md +127 -0
  124. package/skills/aws-essentials/scripts/verify.sh +99 -0
  125. package/skills/backups/SKILL.md +137 -0
  126. package/skills/backups/evals/README.md +3 -0
  127. package/skills/backups/evals/cases.yaml +42 -0
  128. package/skills/backups/references/engine-recipes.md +121 -0
  129. package/skills/backups/references/restore-runbook.md +65 -0
  130. package/skills/backups/scripts/verify.sh +80 -0
  131. package/skills/bash-scripting/SKILL.md +231 -0
  132. package/skills/bash-scripting/evals/README.md +3 -0
  133. package/skills/bash-scripting/evals/cases.yaml +45 -0
  134. package/skills/bash-scripting/references/portability.md +97 -0
  135. package/skills/bash-scripting/scripts/verify.sh +140 -0
  136. package/skills/bookkeeping/SKILL.md +184 -0
  137. package/skills/bookkeeping/evals/README.md +5 -0
  138. package/skills/bookkeeping/evals/cases.yaml +52 -0
  139. package/skills/bookkeeping/references/chart-of-accounts.md +87 -0
  140. package/skills/bookkeeping/references/reconciliation-playbook.md +54 -0
  141. package/skills/bookkeeping/references/tricky-transactions.md +192 -0
  142. package/skills/brand-identity/SKILL.md +161 -0
  143. package/skills/brand-identity/evals/README.md +14 -0
  144. package/skills/brand-identity/evals/cases.yaml +43 -0
  145. package/skills/brand-identity/references/color-and-tokens.md +129 -0
  146. package/skills/brand-identity/references/logo-and-assets.md +117 -0
  147. package/skills/brand-identity/scripts/verify.sh +224 -0
  148. package/skills/brand-voice/SKILL.md +183 -0
  149. package/skills/brand-voice/evals/README.md +3 -0
  150. package/skills/brand-voice/evals/cases.yaml +57 -0
  151. package/skills/brand-voice/references/voice-guide-template.md +150 -0
  152. package/skills/brand-voice/references/word-bank.md +61 -0
  153. package/skills/brand-voice/scripts/verify.sh +190 -0
  154. package/skills/building-agents/SKILL.md +469 -0
  155. package/skills/building-agents/evals/README.md +68 -0
  156. package/skills/building-agents/evals/cases.yaml +60 -0
  157. package/skills/building-agents/references/agent-loops-and-harness.md +371 -0
  158. package/skills/building-agents/references/evals-and-observability.md +420 -0
  159. package/skills/building-agents/references/mcp-servers.md +294 -0
  160. package/skills/building-agents/references/provider-abstraction.md +489 -0
  161. package/skills/building-agents/references/tools-and-rag.md +417 -0
  162. package/skills/building-agents/scripts/verify.sh +121 -0
  163. package/skills/business-intelligence/SKILL.md +176 -0
  164. package/skills/business-intelligence/evals/README.md +3 -0
  165. package/skills/business-intelligence/evals/cases.yaml +43 -0
  166. package/skills/business-intelligence/references/authoring-semantic-models.md +120 -0
  167. package/skills/business-intelligence/references/wiring-agents-and-apis.md +79 -0
  168. package/skills/business-intelligence/scripts/verify.sh +143 -0
  169. package/skills/calendar-scheduling/SKILL.md +196 -0
  170. package/skills/calendar-scheduling/evals/README.md +14 -0
  171. package/skills/calendar-scheduling/evals/cases.yaml +45 -0
  172. package/skills/calendar-scheduling/references/google-calendar-sync.md +78 -0
  173. package/skills/calendar-scheduling/references/provider-matrix.md +71 -0
  174. package/skills/calendar-scheduling/scripts/verify.sh +117 -0
  175. package/skills/case-studies/SKILL.md +147 -0
  176. package/skills/case-studies/evals/README.md +3 -0
  177. package/skills/case-studies/evals/cases.yaml +63 -0
  178. package/skills/case-studies/references/case-study-skeleton.md +90 -0
  179. package/skills/case-studies/references/consent-and-substantiation.md +80 -0
  180. package/skills/case-studies/scripts/verify.sh +161 -0
  181. package/skills/chatbot/SKILL.md +168 -0
  182. package/skills/chatbot/evals/README.md +13 -0
  183. package/skills/chatbot/evals/cases.yaml +43 -0
  184. package/skills/chatbot/references/handoff-and-sales.md +71 -0
  185. package/skills/chatbot/references/system-prompt-and-guardrails.md +78 -0
  186. package/skills/chatbot/scripts/verify.sh +162 -0
  187. package/skills/chrome-extension/SKILL.md +169 -0
  188. package/skills/chrome-extension/evals/README.md +12 -0
  189. package/skills/chrome-extension/evals/cases.yaml +40 -0
  190. package/skills/chrome-extension/references/store-and-migration.md +84 -0
  191. package/skills/chrome-extension/scripts/verify.sh +62 -0
  192. package/skills/clarify/SKILL.md +159 -0
  193. package/skills/clarify/evals/README.md +70 -0
  194. package/skills/clarify/evals/cases.yaml +71 -0
  195. package/skills/clickhouse-analytics/SKILL.md +165 -0
  196. package/skills/clickhouse-analytics/evals/README.md +3 -0
  197. package/skills/clickhouse-analytics/evals/cases.yaml +45 -0
  198. package/skills/clickhouse-analytics/references/ingestion-and-mvs.md +109 -0
  199. package/skills/clickhouse-analytics/references/query-optimization.md +76 -0
  200. package/skills/clickhouse-analytics/references/schema-and-engines.md +63 -0
  201. package/skills/clickhouse-analytics/scripts/verify.sh +109 -0
  202. package/skills/client-onboarding/SKILL.md +254 -0
  203. package/skills/client-onboarding/evals/README.md +14 -0
  204. package/skills/client-onboarding/evals/cases.yaml +40 -0
  205. package/skills/client-onboarding/references/onboarding-playbook.md +126 -0
  206. package/skills/cloudflare/SKILL.md +191 -0
  207. package/skills/cloudflare/evals/README.md +15 -0
  208. package/skills/cloudflare/evals/cases.yaml +46 -0
  209. package/skills/cloudflare/references/storage-primitives.md +104 -0
  210. package/skills/cloudflare/references/wrangler-config.md +91 -0
  211. package/skills/cloudflare/scripts/verify.sh +133 -0
  212. package/skills/code-review/SKILL.md +143 -0
  213. package/skills/code-review/evals/README.md +3 -0
  214. package/skills/code-review/evals/cases.yaml +55 -0
  215. package/skills/code-review/references/pr-workflow.md +67 -0
  216. package/skills/codebase-onboarding/SKILL.md +133 -0
  217. package/skills/codebase-onboarding/evals/README.md +3 -0
  218. package/skills/codebase-onboarding/evals/cases.yaml +69 -0
  219. package/skills/codebase-onboarding/references/recon-playbook.md +57 -0
  220. package/skills/codebase-onboarding/scripts/verify.sh +54 -0
  221. package/skills/cold-outreach/SKILL.md +206 -0
  222. package/skills/cold-outreach/evals/README.md +3 -0
  223. package/skills/cold-outreach/evals/cases.yaml +60 -0
  224. package/skills/cold-outreach/references/compliance-footer.md +50 -0
  225. package/skills/cold-outreach/references/hook-derivation.md +73 -0
  226. package/skills/cold-outreach/references/templates.md +88 -0
  227. package/skills/cold-outreach/scripts/verify.sh +170 -0
  228. package/skills/community/SKILL.md +225 -0
  229. package/skills/community/evals/README.md +3 -0
  230. package/skills/community/evals/cases.yaml +40 -0
  231. package/skills/community/references/metrics-and-rituals.md +58 -0
  232. package/skills/community/references/platform-playbooks.md +64 -0
  233. package/skills/community/scripts/verify.sh +83 -0
  234. package/skills/competitor-watch/SKILL.md +193 -0
  235. package/skills/competitor-watch/evals/README.md +19 -0
  236. package/skills/competitor-watch/evals/cases.yaml +54 -0
  237. package/skills/competitor-watch/references/monitoring-config.md +124 -0
  238. package/skills/competitor-watch/references/tracker-schema.md +79 -0
  239. package/skills/competitor-watch/scripts/verify.sh +253 -0
  240. package/skills/compliance/SKILL.md +184 -0
  241. package/skills/compliance/evals/README.md +14 -0
  242. package/skills/compliance/evals/cases.yaml +46 -0
  243. package/skills/compliance/references/frameworks.md +108 -0
  244. package/skills/compliance/references/operating-rhythm.md +79 -0
  245. package/skills/compliance/scripts/verify.sh +168 -0
  246. package/skills/compose-multiplatform/SKILL.md +198 -0
  247. package/skills/compose-multiplatform/evals/README.md +3 -0
  248. package/skills/compose-multiplatform/evals/cases.yaml +40 -0
  249. package/skills/compose-multiplatform/references/ios-interop.md +91 -0
  250. package/skills/compose-multiplatform/references/project-setup.md +96 -0
  251. package/skills/compose-multiplatform/scripts/verify.sh +123 -0
  252. package/skills/constitution/SKILL.md +160 -0
  253. package/skills/constitution/evals/README.md +68 -0
  254. package/skills/constitution/evals/cases.yaml +72 -0
  255. package/skills/constitution/references/constitution-template.md +90 -0
  256. package/skills/content-engine/SKILL.md +164 -0
  257. package/skills/content-engine/evals/README.md +17 -0
  258. package/skills/content-engine/evals/cases.yaml +62 -0
  259. package/skills/content-engine/references/atomization.md +81 -0
  260. package/skills/content-engine/references/brief-and-pipeline.md +90 -0
  261. package/skills/content-engine/scripts/verify.sh +146 -0
  262. package/skills/context-budget/SKILL.md +132 -0
  263. package/skills/context-budget/evals/README.md +11 -0
  264. package/skills/context-budget/evals/cases.yaml +40 -0
  265. package/skills/context-budget/references/handoff-and-compaction.md +96 -0
  266. package/skills/continuous-learning/SKILL.md +136 -0
  267. package/skills/continuous-learning/evals/README.md +16 -0
  268. package/skills/continuous-learning/evals/cases.yaml +39 -0
  269. package/skills/continuous-learning/references/lesson-routing.md +106 -0
  270. package/skills/contracts/SKILL.md +124 -0
  271. package/skills/contracts/evals/README.md +3 -0
  272. package/skills/contracts/evals/cases.yaml +42 -0
  273. package/skills/contracts/references/clause-library.md +129 -0
  274. package/skills/contracts/references/review-playbook.md +49 -0
  275. package/skills/contracts/scripts/verify.sh +53 -0
  276. package/skills/coolify/SKILL.md +201 -0
  277. package/skills/coolify/evals/README.md +21 -0
  278. package/skills/coolify/evals/cases.yaml +46 -0
  279. package/skills/coolify/references/databases-and-backups.md +99 -0
  280. package/skills/coolify/references/deploy-recipes.md +105 -0
  281. package/skills/coolify/references/install-and-proxy.md +80 -0
  282. package/skills/coolify/scripts/verify.sh +123 -0
  283. package/skills/cost-tracking/SKILL.md +183 -0
  284. package/skills/cost-tracking/evals/README.md +3 -0
  285. package/skills/cost-tracking/evals/cases.yaml +45 -0
  286. package/skills/cost-tracking/references/cloud-caps.md +52 -0
  287. package/skills/cost-tracking/references/pricing-tables.md +51 -0
  288. package/skills/cost-tracking/scripts/verify.sh +135 -0
  289. package/skills/course-builder/SKILL.md +186 -0
  290. package/skills/course-builder/evals/README.md +16 -0
  291. package/skills/course-builder/evals/cases.yaml +49 -0
  292. package/skills/course-builder/references/assessment-design.md +74 -0
  293. package/skills/course-builder/references/grounding-and-scoping.md +69 -0
  294. package/skills/course-builder/references/outcomes-and-blooms.md +82 -0
  295. package/skills/course-builder/scripts/verify.sh +247 -0
  296. package/skills/course-storytelling/SKILL.md +205 -0
  297. package/skills/course-storytelling/evals/README.md +54 -0
  298. package/skills/course-storytelling/evals/cases.yaml +50 -0
  299. package/skills/course-storytelling/references/brunson-frameworks.md +190 -0
  300. package/skills/course-storytelling/references/concept-landing-recipe.md +136 -0
  301. package/skills/course-storytelling/references/course-analysis.md +124 -0
  302. package/skills/course-storytelling/references/learner-grounding.md +183 -0
  303. package/skills/course-storytelling/references/mental-models.md +115 -0
  304. package/skills/course-storytelling/scripts/verify.sh +223 -0
  305. package/skills/cpp/SKILL.md +349 -0
  306. package/skills/cpp/evals/README.md +14 -0
  307. package/skills/cpp/evals/cases.yaml +44 -0
  308. package/skills/cpp/references/cmake.md +167 -0
  309. package/skills/cpp/references/move-and-templates.md +130 -0
  310. package/skills/cpp/references/undefined-behavior.md +86 -0
  311. package/skills/cpp/scripts/verify.sh +165 -0
  312. package/skills/csharp-dotnet/SKILL.md +291 -0
  313. package/skills/csharp-dotnet/evals/README.md +3 -0
  314. package/skills/csharp-dotnet/evals/cases.yaml +48 -0
  315. package/skills/csharp-dotnet/references/aspnetcore.md +99 -0
  316. package/skills/csharp-dotnet/references/async.md +82 -0
  317. package/skills/csharp-dotnet/references/efcore.md +96 -0
  318. package/skills/csharp-dotnet/scripts/verify.sh +90 -0
  319. package/skills/customer-support/SKILL.md +193 -0
  320. package/skills/customer-support/evals/README.md +13 -0
  321. package/skills/customer-support/evals/cases.yaml +61 -0
  322. package/skills/customer-support/references/macros-and-sla.md +142 -0
  323. package/skills/dashboard/SKILL.md +205 -0
  324. package/skills/dashboard/evals/README.md +3 -0
  325. package/skills/dashboard/evals/cases.yaml +50 -0
  326. package/skills/dashboard/references/chart-selection.md +34 -0
  327. package/skills/dashboard/references/tile-schema.md +164 -0
  328. package/skills/dashboard/scripts/verify.sh +130 -0
  329. package/skills/data-cleaning/SKILL.md +285 -0
  330. package/skills/data-cleaning/evals/README.md +16 -0
  331. package/skills/data-cleaning/evals/cases.yaml +57 -0
  332. package/skills/data-cleaning/references/normalization-recipes.md +136 -0
  333. package/skills/data-cleaning/references/validation-patterns.md +134 -0
  334. package/skills/data-cleaning/scripts/verify.sh +115 -0
  335. package/skills/data-policy/SKILL.md +163 -0
  336. package/skills/data-policy/evals/README.md +15 -0
  337. package/skills/data-policy/evals/cases.yaml +44 -0
  338. package/skills/data-policy/references/consent-and-ropa.md +97 -0
  339. package/skills/data-policy/references/retention-schedule.md +83 -0
  340. package/skills/data-policy/scripts/verify.sh +143 -0
  341. package/skills/data-scraper/SKILL.md +134 -0
  342. package/skills/data-scraper/evals/README.md +3 -0
  343. package/skills/data-scraper/evals/cases.yaml +46 -0
  344. package/skills/data-scraper/references/anti-bot.md +85 -0
  345. package/skills/data-scraper/references/frameworks.md +116 -0
  346. package/skills/data-scraper/references/legal-compliance.md +59 -0
  347. package/skills/data-scraper/scripts/verify.sh +166 -0
  348. package/skills/db-migrations/SKILL.md +254 -0
  349. package/skills/db-migrations/evals/README.md +10 -0
  350. package/skills/db-migrations/evals/cases.yaml +46 -0
  351. package/skills/db-migrations/references/backfill-and-batching.md +105 -0
  352. package/skills/db-migrations/references/expand-contract-playbook.md +152 -0
  353. package/skills/db-migrations/references/tools-and-runners.md +88 -0
  354. package/skills/db-migrations/scripts/verify.sh +112 -0
  355. package/skills/debug/SKILL.md +227 -0
  356. package/skills/debug/evals/README.md +88 -0
  357. package/skills/debug/evals/cases.yaml +74 -0
  358. package/skills/decision-records/SKILL.md +189 -0
  359. package/skills/decision-records/evals/README.md +3 -0
  360. package/skills/decision-records/evals/cases.yaml +43 -0
  361. package/skills/decision-records/references/templates.md +232 -0
  362. package/skills/decision-records/scripts/verify.sh +105 -0
  363. package/skills/deployment/SKILL.md +439 -0
  364. package/skills/deployment/evals/README.md +50 -0
  365. package/skills/deployment/evals/cases.yaml +53 -0
  366. package/skills/deployment/references/coolify.md +216 -0
  367. package/skills/deployment/references/dockerfiles-by-stack.md +319 -0
  368. package/skills/deployment/references/github-actions.md +295 -0
  369. package/skills/deployment/references/hosting-targets.md +272 -0
  370. package/skills/deployment/scripts/verify.sh +134 -0
  371. package/skills/design/SKILL.md +399 -0
  372. package/skills/design/evals/README.md +53 -0
  373. package/skills/design/evals/cases.yaml +56 -0
  374. package/skills/design/references/brand-grounding.md +187 -0
  375. package/skills/design/references/copywriting-frameworks.md +138 -0
  376. package/skills/design/references/landing-anatomy-and-cro.md +202 -0
  377. package/skills/design/references/motion-and-interaction.md +182 -0
  378. package/skills/design/references/research-method.md +147 -0
  379. package/skills/design/references/signature-and-craft.md +148 -0
  380. package/skills/design/references/trends-2026.md +80 -0
  381. package/skills/design/references/visual-system.md +236 -0
  382. package/skills/design/scripts/verify.sh +248 -0
  383. package/skills/digitalocean/SKILL.md +251 -0
  384. package/skills/digitalocean/evals/README.md +10 -0
  385. package/skills/digitalocean/evals/cases.yaml +37 -0
  386. package/skills/digitalocean/references/app-spec.md +126 -0
  387. package/skills/digitalocean/references/droplet-ops.md +95 -0
  388. package/skills/digitalocean/scripts/verify.sh +102 -0
  389. package/skills/django/SKILL.md +268 -0
  390. package/skills/django/evals/README.md +11 -0
  391. package/skills/django/evals/cases.yaml +47 -0
  392. package/skills/django/references/drf.md +109 -0
  393. package/skills/django/references/orm-performance.md +91 -0
  394. package/skills/django/references/security.md +81 -0
  395. package/skills/django/references/testing.md +86 -0
  396. package/skills/django/scripts/verify.sh +115 -0
  397. package/skills/docker/SKILL.md +283 -0
  398. package/skills/docker/evals/README.md +10 -0
  399. package/skills/docker/evals/cases.yaml +44 -0
  400. package/skills/docker/references/base-images-and-stages.md +104 -0
  401. package/skills/docker/references/compose-recipes.md +109 -0
  402. package/skills/docker/scripts/verify.sh +149 -0
  403. package/skills/document-processing/SKILL.md +214 -0
  404. package/skills/document-processing/evals/README.md +3 -0
  405. package/skills/document-processing/evals/cases.yaml +65 -0
  406. package/skills/document-processing/references/engines.md +67 -0
  407. package/skills/document-processing/scripts/verify.sh +172 -0
  408. package/skills/domains-dns/SKILL.md +146 -0
  409. package/skills/domains-dns/evals/README.md +16 -0
  410. package/skills/domains-dns/evals/cases.yaml +47 -0
  411. package/skills/domains-dns/references/record-cookbook.md +94 -0
  412. package/skills/domains-dns/references/tls-and-acme.md +90 -0
  413. package/skills/domains-dns/references/verify-and-debug.md +64 -0
  414. package/skills/domains-dns/scripts/verify.sh +163 -0
  415. package/skills/drizzle-orm/SKILL.md +234 -0
  416. package/skills/drizzle-orm/evals/README.md +12 -0
  417. package/skills/drizzle-orm/evals/cases.yaml +47 -0
  418. package/skills/drizzle-orm/references/relations-and-drivers.md +118 -0
  419. package/skills/drizzle-orm/scripts/verify.sh +155 -0
  420. package/skills/duckdb/SKILL.md +207 -0
  421. package/skills/duckdb/evals/README.md +31 -0
  422. package/skills/duckdb/evals/cases.yaml +41 -0
  423. package/skills/duckdb/references/python-and-interop.md +105 -0
  424. package/skills/duckdb/references/remote-and-lakehouse.md +101 -0
  425. package/skills/duckdb/scripts/verify.sh +71 -0
  426. package/skills/dynamodb/SKILL.md +217 -0
  427. package/skills/dynamodb/evals/README.md +8 -0
  428. package/skills/dynamodb/evals/cases.yaml +46 -0
  429. package/skills/dynamodb/references/access-patterns.md +127 -0
  430. package/skills/dynamodb/references/capacity-and-limits.md +78 -0
  431. package/skills/dynamodb/scripts/verify.sh +108 -0
  432. package/skills/e-signature/SKILL.md +185 -0
  433. package/skills/e-signature/evals/README.md +3 -0
  434. package/skills/e-signature/evals/cases.yaml +44 -0
  435. package/skills/e-signature/references/docusign.md +83 -0
  436. package/skills/e-signature/references/dropbox-sign.md +73 -0
  437. package/skills/e-signature/references/legal-tiers.md +37 -0
  438. package/skills/e-signature/scripts/verify.sh +81 -0
  439. package/skills/e2e-testing/SKILL.md +243 -0
  440. package/skills/e2e-testing/evals/README.md +10 -0
  441. package/skills/e2e-testing/evals/cases.yaml +64 -0
  442. package/skills/e2e-testing/references/config-and-ci.md +156 -0
  443. package/skills/e2e-testing/references/flakiness-playbook.md +124 -0
  444. package/skills/e2e-testing/scripts/verify.sh +117 -0
  445. package/skills/electron/SKILL.md +221 -0
  446. package/skills/electron/evals/README.md +13 -0
  447. package/skills/electron/evals/cases.yaml +38 -0
  448. package/skills/electron/references/packaging-and-updates.md +122 -0
  449. package/skills/electron/references/security-and-ipc.md +158 -0
  450. package/skills/electron/scripts/verify.sh +143 -0
  451. package/skills/elixir/SKILL.md +217 -0
  452. package/skills/elixir/evals/README.md +3 -0
  453. package/skills/elixir/evals/cases.yaml +41 -0
  454. package/skills/elixir/references/mix-and-releases.md +91 -0
  455. package/skills/elixir/references/otp-patterns.md +96 -0
  456. package/skills/elixir/scripts/verify.sh +76 -0
  457. package/skills/email-connector/SKILL.md +294 -0
  458. package/skills/email-connector/evals/README.md +19 -0
  459. package/skills/email-connector/evals/cases.yaml +39 -0
  460. package/skills/email-connector/references/providers.md +107 -0
  461. package/skills/email-connector/scripts/verify.sh +72 -0
  462. package/skills/email-deliverability/SKILL.md +168 -0
  463. package/skills/email-deliverability/evals/README.md +21 -0
  464. package/skills/email-deliverability/evals/cases.yaml +45 -0
  465. package/skills/email-deliverability/scripts/verify.sh +98 -0
  466. package/skills/embeddings-search/SKILL.md +193 -0
  467. package/skills/embeddings-search/evals/README.md +10 -0
  468. package/skills/embeddings-search/evals/cases.yaml +44 -0
  469. package/skills/embeddings-search/references/evaluation.md +86 -0
  470. package/skills/embeddings-search/references/models.md +73 -0
  471. package/skills/embeddings-search/scripts/verify.sh +103 -0
  472. package/skills/error-handling/SKILL.md +307 -0
  473. package/skills/error-handling/evals/README.md +12 -0
  474. package/skills/error-handling/evals/cases.yaml +46 -0
  475. package/skills/error-handling/references/boundaries-and-messaging.md +120 -0
  476. package/skills/error-handling/references/retry-and-resilience.md +154 -0
  477. package/skills/error-handling/scripts/verify.sh +110 -0
  478. package/skills/expo/SKILL.md +253 -0
  479. package/skills/expo/evals/README.md +13 -0
  480. package/skills/expo/evals/cases.yaml +44 -0
  481. package/skills/expo/references/config-plugins.md +117 -0
  482. package/skills/expo/references/eas-update.md +118 -0
  483. package/skills/expo/scripts/verify.sh +132 -0
  484. package/skills/fal/SKILL.md +210 -0
  485. package/skills/fal/evals/README.md +3 -0
  486. package/skills/fal/evals/cases.yaml +42 -0
  487. package/skills/fal/references/models-and-cost.md +53 -0
  488. package/skills/fal/references/queue-and-webhooks.md +153 -0
  489. package/skills/fal/scripts/verify.sh +72 -0
  490. package/skills/fastapi/SKILL.md +499 -0
  491. package/skills/fastapi/evals/README.md +50 -0
  492. package/skills/fastapi/evals/cases.yaml +55 -0
  493. package/skills/fastapi/references/database.md +347 -0
  494. package/skills/fastapi/references/production.md +338 -0
  495. package/skills/fastapi/references/security.md +330 -0
  496. package/skills/fastapi/references/testing.md +349 -0
  497. package/skills/fastapi/scripts/verify.sh +116 -0
  498. package/skills/finance-ops/SKILL.md +149 -0
  499. package/skills/finance-ops/evals/README.md +3 -0
  500. package/skills/finance-ops/evals/cases.yaml +39 -0
  501. package/skills/finance-ops/references/cash-flow-forecast.md +57 -0
  502. package/skills/finance-ops/references/month-close.md +59 -0
  503. package/skills/finance-ops/references/reconciliation.md +65 -0
  504. package/skills/finance-ops/scripts/verify.sh +166 -0
  505. package/skills/financial-model/SKILL.md +170 -0
  506. package/skills/financial-model/evals/README.md +3 -0
  507. package/skills/financial-model/evals/cases.yaml +53 -0
  508. package/skills/financial-model/references/benchmarks-and-scenarios.md +55 -0
  509. package/skills/financial-model/references/model-structure.md +67 -0
  510. package/skills/financial-model/references/revenue-build.md +68 -0
  511. package/skills/financial-model/scripts/verify.sh +232 -0
  512. package/skills/firebase/SKILL.md +251 -0
  513. package/skills/firebase/evals/README.md +12 -0
  514. package/skills/firebase/evals/cases.yaml +45 -0
  515. package/skills/firebase/references/cloud-functions.md +102 -0
  516. package/skills/firebase/references/data-modeling.md +108 -0
  517. package/skills/firebase/references/security-rules.md +137 -0
  518. package/skills/firebase/scripts/verify.sh +98 -0
  519. package/skills/flutter/SKILL.md +448 -0
  520. package/skills/flutter/evals/README.md +54 -0
  521. package/skills/flutter/evals/cases.yaml +69 -0
  522. package/skills/flutter/references/architecture-and-state.md +499 -0
  523. package/skills/flutter/references/i18n-and-dependencies.md +197 -0
  524. package/skills/flutter/references/performance.md +299 -0
  525. package/skills/flutter/references/testing.md +385 -0
  526. package/skills/flutter/references/ui-and-navigation.md +378 -0
  527. package/skills/flutter/scripts/verify.sh +104 -0
  528. package/skills/fly-io/SKILL.md +206 -0
  529. package/skills/fly-io/evals/README.md +3 -0
  530. package/skills/fly-io/evals/cases.yaml +42 -0
  531. package/skills/fly-io/references/fly-toml.md +155 -0
  532. package/skills/fly-io/references/multi-region.md +66 -0
  533. package/skills/fly-io/scripts/verify.sh +90 -0
  534. package/skills/forecasting/SKILL.md +139 -0
  535. package/skills/forecasting/evals/README.md +13 -0
  536. package/skills/forecasting/evals/cases.yaml +47 -0
  537. package/skills/forecasting/references/accuracy-and-backtesting.md +104 -0
  538. package/skills/forecasting/references/methods-cheatsheet.md +94 -0
  539. package/skills/forecasting/scripts/verify.sh +99 -0
  540. package/skills/fundraising/SKILL.md +162 -0
  541. package/skills/fundraising/evals/README.md +18 -0
  542. package/skills/fundraising/evals/cases.yaml +76 -0
  543. package/skills/fundraising/references/funnel-math.md +90 -0
  544. package/skills/fundraising/references/process-playbook.md +97 -0
  545. package/skills/gcp-essentials/SKILL.md +327 -0
  546. package/skills/gcp-essentials/evals/README.md +12 -0
  547. package/skills/gcp-essentials/evals/cases.yaml +38 -0
  548. package/skills/gcp-essentials/references/deploy-recipes.md +81 -0
  549. package/skills/gcp-essentials/references/iam-and-auth.md +94 -0
  550. package/skills/gcp-essentials/references/networking-and-sql.md +74 -0
  551. package/skills/gcp-essentials/scripts/verify.sh +158 -0
  552. package/skills/gdpr-privacy/SKILL.md +167 -0
  553. package/skills/gdpr-privacy/evals/README.md +3 -0
  554. package/skills/gdpr-privacy/evals/cases.yaml +47 -0
  555. package/skills/gdpr-privacy/references/dpa-and-transfers.md +63 -0
  556. package/skills/gdpr-privacy/references/dsar-and-consent.md +83 -0
  557. package/skills/gdpr-privacy/references/privacy-policy-blueprint.md +99 -0
  558. package/skills/gdpr-privacy/scripts/verify.sh +84 -0
  559. package/skills/git-workflow/SKILL.md +190 -0
  560. package/skills/git-workflow/evals/README.md +10 -0
  561. package/skills/git-workflow/evals/cases.yaml +47 -0
  562. package/skills/git-workflow/references/interactive-rebase.md +89 -0
  563. package/skills/github-actions/SKILL.md +256 -0
  564. package/skills/github-actions/evals/README.md +3 -0
  565. package/skills/github-actions/evals/cases.yaml +45 -0
  566. package/skills/github-actions/references/caching-and-matrix.md +92 -0
  567. package/skills/github-actions/references/oidc-deploys.md +130 -0
  568. package/skills/github-actions/scripts/verify.sh +105 -0
  569. package/skills/go/SKILL.md +438 -0
  570. package/skills/go/evals/README.md +56 -0
  571. package/skills/go/evals/cases.yaml +55 -0
  572. package/skills/go/references/concurrency.md +557 -0
  573. package/skills/go/references/http-services.md +529 -0
  574. package/skills/go/references/testing.md +338 -0
  575. package/skills/go/scripts/verify.sh +109 -0
  576. package/skills/google-workspace/SKILL.md +287 -0
  577. package/skills/google-workspace/evals/README.md +16 -0
  578. package/skills/google-workspace/evals/cases.yaml +44 -0
  579. package/skills/google-workspace/references/api-recipes.md +148 -0
  580. package/skills/google-workspace/references/auth-setup.md +100 -0
  581. package/skills/google-workspace/scripts/verify.sh +128 -0
  582. package/skills/grants/SKILL.md +171 -0
  583. package/skills/grants/evals/README.md +3 -0
  584. package/skills/grants/evals/cases.yaml +69 -0
  585. package/skills/grants/references/budget-justification.md +71 -0
  586. package/skills/grants/references/jurisdictions.md +35 -0
  587. package/skills/grants/references/logic-model.md +66 -0
  588. package/skills/grants/scripts/verify.sh +193 -0
  589. package/skills/harness/SKILL.md +329 -0
  590. package/skills/harness/assets/_TEMPLATE/.env.example +8 -0
  591. package/skills/harness/assets/_TEMPLATE/CREDENTIALS.md +25 -0
  592. package/skills/harness/assets/_TEMPLATE/README.md +25 -0
  593. package/skills/harness/assets/_TEMPLATE/test_connection.sh +30 -0
  594. package/skills/harness/evals/README.md +54 -0
  595. package/skills/harness/evals/cases.yaml +72 -0
  596. package/skills/harness/examples/audit-example.md +120 -0
  597. package/skills/harness/references/agents-md-template.md +41 -0
  598. package/skills/harness/references/audit-report-template.html +140 -0
  599. package/skills/harness/references/audit-report-template.md +116 -0
  600. package/skills/harness/references/claude-md-template.md +98 -0
  601. package/skills/harness/references/inbox-readme-template.md +51 -0
  602. package/skills/harness/references/ingest-formats.md +185 -0
  603. package/skills/harness/references/providers.yaml +3410 -0
  604. package/skills/harness/references/tools-readme-template.md +88 -0
  605. package/skills/harness/references/wiki-archive-template.html +81 -0
  606. package/skills/harness/references/wiki-article-template.md +20 -0
  607. package/skills/harness/references/wiki-dashboard-template.html +136 -0
  608. package/skills/harness/references/wiki-deep-improve-report-template.html +126 -0
  609. package/skills/harness/references/wiki-gaps-template.md +18 -0
  610. package/skills/harness/references/wiki-index-template.md +23 -0
  611. package/skills/harness/references/wiki-protocol.md +699 -0
  612. package/skills/harness/references/wiki-raw-template.md +7 -0
  613. package/skills/hetzner/SKILL.md +221 -0
  614. package/skills/hetzner/evals/README.md +35 -0
  615. package/skills/hetzner/evals/cases.yaml +46 -0
  616. package/skills/hetzner/references/cloud-init.md +120 -0
  617. package/skills/hetzner/references/plans-and-locations.md +56 -0
  618. package/skills/hetzner/scripts/verify.sh +122 -0
  619. package/skills/hiring/SKILL.md +248 -0
  620. package/skills/hiring/evals/README.md +13 -0
  621. package/skills/hiring/evals/cases.yaml +41 -0
  622. package/skills/hiring/references/templates.md +118 -0
  623. package/skills/htmx/SKILL.md +261 -0
  624. package/skills/htmx/evals/README.md +3 -0
  625. package/skills/htmx/evals/cases.yaml +38 -0
  626. package/skills/htmx/references/patterns.md +113 -0
  627. package/skills/htmx/references/server-contract.md +91 -0
  628. package/skills/htmx/scripts/verify.sh +93 -0
  629. package/skills/huggingface/SKILL.md +190 -0
  630. package/skills/huggingface/evals/README.md +11 -0
  631. package/skills/huggingface/evals/cases.yaml +41 -0
  632. package/skills/huggingface/references/endpoints-and-spaces.md +99 -0
  633. package/skills/huggingface/references/hub-and-cli.md +85 -0
  634. package/skills/huggingface/references/inference-providers.md +115 -0
  635. package/skills/huggingface/scripts/verify.sh +123 -0
  636. package/skills/implement/SKILL.md +283 -0
  637. package/skills/implement/evals/README.md +56 -0
  638. package/skills/implement/evals/cases.yaml +43 -0
  639. package/skills/init/SKILL.md +184 -0
  640. package/skills/init/evals/README.md +49 -0
  641. package/skills/init/evals/cases.yaml +74 -0
  642. package/skills/init/references/accompaniment-and-profile.md +140 -0
  643. package/skills/init/references/discovery.md +90 -0
  644. package/skills/init/references/recommend-skills.md +115 -0
  645. package/skills/init/scripts/verify.sh +122 -0
  646. package/skills/instagram-api/SKILL.md +241 -0
  647. package/skills/instagram-api/evals/README.md +3 -0
  648. package/skills/instagram-api/evals/cases.yaml +43 -0
  649. package/skills/instagram-api/references/insights-metrics.md +88 -0
  650. package/skills/instagram-api/references/publish-reel.md +98 -0
  651. package/skills/instagram-api/scripts/verify.sh +137 -0
  652. package/skills/inventory/SKILL.md +131 -0
  653. package/skills/inventory/evals/README.md +3 -0
  654. package/skills/inventory/evals/cases.yaml +43 -0
  655. package/skills/inventory/references/abc-xyz.md +52 -0
  656. package/skills/inventory/references/ddmrp.md +32 -0
  657. package/skills/inventory/references/reorder-policies.md +85 -0
  658. package/skills/inventory/references/safety-stock.md +63 -0
  659. package/skills/inventory/scripts/verify.sh +155 -0
  660. package/skills/investor-materials/SKILL.md +175 -0
  661. package/skills/investor-materials/evals/README.md +15 -0
  662. package/skills/investor-materials/evals/cases.yaml +60 -0
  663. package/skills/investor-materials/references/dataroom-checklist.md +134 -0
  664. package/skills/investor-materials/references/update-and-onepager-templates.md +152 -0
  665. package/skills/investor-materials/scripts/verify.sh +148 -0
  666. package/skills/invoicing/SKILL.md +154 -0
  667. package/skills/invoicing/evals/README.md +5 -0
  668. package/skills/invoicing/evals/cases.yaml +49 -0
  669. package/skills/invoicing/references/dunning-ladder.md +53 -0
  670. package/skills/invoicing/references/e-invoicing-mandates.md +43 -0
  671. package/skills/invoicing/scripts/fixtures/broken-invoice.json +13 -0
  672. package/skills/invoicing/scripts/fixtures/valid-invoice.json +15 -0
  673. package/skills/invoicing/scripts/verify.sh +133 -0
  674. package/skills/ip-trademark/SKILL.md +186 -0
  675. package/skills/ip-trademark/evals/README.md +10 -0
  676. package/skills/ip-trademark/evals/cases.yaml +47 -0
  677. package/skills/ip-trademark/references/jurisdictions.md +63 -0
  678. package/skills/ip-trademark/references/ownership-and-licensing.md +90 -0
  679. package/skills/java/SKILL.md +341 -0
  680. package/skills/java/evals/README.md +23 -0
  681. package/skills/java/evals/cases.yaml +43 -0
  682. package/skills/java/references/builds.md +133 -0
  683. package/skills/java/references/concurrency.md +108 -0
  684. package/skills/java/references/streams.md +102 -0
  685. package/skills/java/scripts/verify.sh +107 -0
  686. package/skills/knowledge-ops/SKILL.md +125 -0
  687. package/skills/knowledge-ops/evals/README.md +16 -0
  688. package/skills/knowledge-ops/evals/cases.yaml +50 -0
  689. package/skills/knowledge-ops/references/gardening-playbook.md +116 -0
  690. package/skills/kotlin-android/SKILL.md +245 -0
  691. package/skills/kotlin-android/evals/README.md +13 -0
  692. package/skills/kotlin-android/evals/cases.yaml +56 -0
  693. package/skills/kotlin-android/references/architecture.md +200 -0
  694. package/skills/kotlin-android/references/gradle-setup.md +125 -0
  695. package/skills/kotlin-android/scripts/verify.sh +109 -0
  696. package/skills/kpi-framework/SKILL.md +199 -0
  697. package/skills/kpi-framework/evals/README.md +11 -0
  698. package/skills/kpi-framework/evals/cases.yaml +42 -0
  699. package/skills/kpi-framework/references/definition-and-targets.md +64 -0
  700. package/skills/kpi-framework/references/metric-catalog.md +84 -0
  701. package/skills/landing-copy/SKILL.md +153 -0
  702. package/skills/landing-copy/evals/README.md +18 -0
  703. package/skills/landing-copy/evals/cases.yaml +63 -0
  704. package/skills/landing-copy/references/frameworks.md +61 -0
  705. package/skills/landing-copy/references/page-skeleton.md +92 -0
  706. package/skills/landing-copy/scripts/verify.sh +164 -0
  707. package/skills/laravel/SKILL.md +301 -0
  708. package/skills/laravel/evals/README.md +10 -0
  709. package/skills/laravel/evals/cases.yaml +45 -0
  710. package/skills/laravel/references/eloquent-patterns.md +126 -0
  711. package/skills/laravel/references/queues-and-scheduling.md +153 -0
  712. package/skills/laravel/scripts/verify.sh +128 -0
  713. package/skills/lead-gen/SKILL.md +155 -0
  714. package/skills/lead-gen/evals/README.md +3 -0
  715. package/skills/lead-gen/evals/cases.yaml +43 -0
  716. package/skills/lead-gen/references/data-sources.md +87 -0
  717. package/skills/lead-gen/references/scoring-model.md +93 -0
  718. package/skills/lead-gen/scripts/verify.sh +179 -0
  719. package/skills/linkedin-api/SKILL.md +211 -0
  720. package/skills/linkedin-api/evals/README.md +3 -0
  721. package/skills/linkedin-api/evals/cases.yaml +41 -0
  722. package/skills/linkedin-api/references/api-reference.md +168 -0
  723. package/skills/linkedin-api/scripts/verify.sh +98 -0
  724. package/skills/linkedin-carousels/SKILL.md +239 -0
  725. package/skills/linkedin-carousels/evals/README.md +13 -0
  726. package/skills/linkedin-carousels/evals/cases.yaml +62 -0
  727. package/skills/linkedin-carousels/references/carousel-patterns.md +200 -0
  728. package/skills/linkedin-carousels/scripts/verify.sh +160 -0
  729. package/skills/linkedin-content/SKILL.md +162 -0
  730. package/skills/linkedin-content/evals/README.md +13 -0
  731. package/skills/linkedin-content/evals/cases.yaml +62 -0
  732. package/skills/linkedin-content/references/hooks-and-formats.md +114 -0
  733. package/skills/linkedin-content/scripts/verify.sh +154 -0
  734. package/skills/linkedin-outreach/SKILL.md +174 -0
  735. package/skills/linkedin-outreach/evals/README.md +3 -0
  736. package/skills/linkedin-outreach/evals/cases.yaml +43 -0
  737. package/skills/linkedin-outreach/references/ledger-schema.md +48 -0
  738. package/skills/linkedin-outreach/references/sales-navigator-playbook.md +61 -0
  739. package/skills/linkedin-outreach/scripts/verify.sh +120 -0
  740. package/skills/linkedin-strategy/SKILL.md +167 -0
  741. package/skills/linkedin-strategy/evals/README.md +3 -0
  742. package/skills/linkedin-strategy/evals/cases.yaml +49 -0
  743. package/skills/linkedin-strategy/references/ssi-and-pillars.md +59 -0
  744. package/skills/linkedin-strategy/references/wiki-records.md +62 -0
  745. package/skills/linkedin-strategy/scripts/verify.sh +120 -0
  746. package/skills/llm-pipeline/SKILL.md +155 -0
  747. package/skills/llm-pipeline/evals/README.md +3 -0
  748. package/skills/llm-pipeline/evals/cases.yaml +44 -0
  749. package/skills/llm-pipeline/references/caching-layers.md +60 -0
  750. package/skills/llm-pipeline/references/litellm-router.md +101 -0
  751. package/skills/llm-pipeline/scripts/verify.sh +169 -0
  752. package/skills/logistics-ops/SKILL.md +219 -0
  753. package/skills/logistics-ops/evals/README.md +20 -0
  754. package/skills/logistics-ops/evals/cases.yaml +48 -0
  755. package/skills/logistics-ops/references/carriers-and-claims.md +105 -0
  756. package/skills/market-research/SKILL.md +145 -0
  757. package/skills/market-research/evals/README.md +3 -0
  758. package/skills/market-research/evals/cases.yaml +48 -0
  759. package/skills/market-research/references/demand-signals.md +63 -0
  760. package/skills/market-research/references/sizing-playbook.md +121 -0
  761. package/skills/market-research/scripts/verify.sh +215 -0
  762. package/skills/marketing/SKILL.md +233 -0
  763. package/skills/marketing/evals/README.md +61 -0
  764. package/skills/marketing/evals/cases.yaml +84 -0
  765. package/skills/marketing/references/brand-grounding.md +197 -0
  766. package/skills/marketing/references/campaigns-and-channels.md +151 -0
  767. package/skills/marketing/references/copy-frameworks.md +166 -0
  768. package/skills/marketing/references/landing-copy.md +191 -0
  769. package/skills/marketing/references/seo-geo.md +391 -0
  770. package/skills/marketing/scripts/seo_audit.py +166 -0
  771. package/skills/marketing/scripts/verify.sh +233 -0
  772. package/skills/medium-publishing/SKILL.md +152 -0
  773. package/skills/medium-publishing/evals/README.md +3 -0
  774. package/skills/medium-publishing/evals/cases.yaml +42 -0
  775. package/skills/medium-publishing/references/cross-post-and-canonical.md +65 -0
  776. package/skills/medium-publishing/references/legacy-api.md +100 -0
  777. package/skills/medium-strategy/SKILL.md +161 -0
  778. package/skills/medium-strategy/evals/README.md +3 -0
  779. package/skills/medium-strategy/evals/cases.yaml +50 -0
  780. package/skills/medium-strategy/references/distribution-and-boost.md +65 -0
  781. package/skills/medium-strategy/references/wiki-records.md +60 -0
  782. package/skills/medium-strategy/scripts/verify.sh +118 -0
  783. package/skills/medium-writing/SKILL.md +140 -0
  784. package/skills/medium-writing/evals/README.md +5 -0
  785. package/skills/medium-writing/evals/cases.yaml +39 -0
  786. package/skills/medium-writing/references/title-patterns.md +79 -0
  787. package/skills/meeting-notes/SKILL.md +168 -0
  788. package/skills/meeting-notes/evals/README.md +14 -0
  789. package/skills/meeting-notes/evals/cases.yaml +46 -0
  790. package/skills/meeting-notes/references/templates.md +140 -0
  791. package/skills/modal/SKILL.md +307 -0
  792. package/skills/modal/evals/README.md +29 -0
  793. package/skills/modal/evals/cases.yaml +50 -0
  794. package/skills/modal/references/images-gpu-cookbook.md +160 -0
  795. package/skills/modal/references/web-and-scaling.md +138 -0
  796. package/skills/modal/scripts/verify.sh +127 -0
  797. package/skills/mongodb/SKILL.md +342 -0
  798. package/skills/mongodb/evals/README.md +29 -0
  799. package/skills/mongodb/evals/cases.yaml +41 -0
  800. package/skills/mongodb/references/aggregation.md +115 -0
  801. package/skills/mongodb/references/data-modeling.md +135 -0
  802. package/skills/mongodb/references/transactions-and-ops.md +128 -0
  803. package/skills/mongodb/scripts/verify.sh +151 -0
  804. package/skills/monitoring/SKILL.md +155 -0
  805. package/skills/monitoring/evals/README.md +3 -0
  806. package/skills/monitoring/evals/cases.yaml +47 -0
  807. package/skills/monitoring/references/burn-rate-and-oncall.md +128 -0
  808. package/skills/monitoring/references/tool-setup.md +154 -0
  809. package/skills/monitoring/scripts/verify.sh +145 -0
  810. package/skills/mysql/SKILL.md +249 -0
  811. package/skills/mysql/evals/README.md +12 -0
  812. package/skills/mysql/evals/cases.yaml +49 -0
  813. package/skills/mysql/references/indexing-and-explain.md +161 -0
  814. package/skills/mysql/references/mysql-vs-mariadb.md +78 -0
  815. package/skills/mysql/references/online-ddl-and-migrations.md +120 -0
  816. package/skills/mysql/references/replication-and-ha.md +115 -0
  817. package/skills/mysql/scripts/verify.sh +141 -0
  818. package/skills/neon/SKILL.md +218 -0
  819. package/skills/neon/evals/README.md +11 -0
  820. package/skills/neon/evals/cases.yaml +45 -0
  821. package/skills/neon/references/branching-ci.md +86 -0
  822. package/skills/neon/scripts/verify.sh +78 -0
  823. package/skills/nestjs/SKILL.md +225 -0
  824. package/skills/nestjs/evals/README.md +3 -0
  825. package/skills/nestjs/evals/cases.yaml +38 -0
  826. package/skills/nestjs/references/cross-cutting.md +135 -0
  827. package/skills/nestjs/references/testing-recipes.md +105 -0
  828. package/skills/nestjs/scripts/verify.sh +98 -0
  829. package/skills/netlify/SKILL.md +208 -0
  830. package/skills/netlify/evals/README.md +13 -0
  831. package/skills/netlify/evals/cases.yaml +43 -0
  832. package/skills/netlify/references/functions.md +97 -0
  833. package/skills/netlify/references/netlify-toml.md +115 -0
  834. package/skills/netlify/scripts/verify.sh +95 -0
  835. package/skills/newsletter/SKILL.md +162 -0
  836. package/skills/newsletter/evals/README.md +12 -0
  837. package/skills/newsletter/evals/cases.yaml +42 -0
  838. package/skills/newsletter/references/growth-loops.md +73 -0
  839. package/skills/newsletter/references/welcome-sequence.md +62 -0
  840. package/skills/newsletter/scripts/verify.sh +173 -0
  841. package/skills/nextjs/SKILL.md +472 -0
  842. package/skills/nextjs/evals/README.md +59 -0
  843. package/skills/nextjs/evals/cases.yaml +56 -0
  844. package/skills/nextjs/references/data-and-caching.md +309 -0
  845. package/skills/nextjs/references/metadata.md +208 -0
  846. package/skills/nextjs/references/performance.md +325 -0
  847. package/skills/nextjs/references/react.md +383 -0
  848. package/skills/nextjs/references/security.md +239 -0
  849. package/skills/nextjs/references/testing.md +290 -0
  850. package/skills/nextjs/scripts/verify.sh +141 -0
  851. package/skills/no-code-app/SKILL.md +153 -0
  852. package/skills/no-code-app/evals/README.md +3 -0
  853. package/skills/no-code-app/evals/cases.yaml +43 -0
  854. package/skills/no-code-app/references/platform-limits.md +100 -0
  855. package/skills/nodejs/SKILL.md +242 -0
  856. package/skills/nodejs/evals/README.md +3 -0
  857. package/skills/nodejs/evals/cases.yaml +39 -0
  858. package/skills/nodejs/references/express5-migration.md +53 -0
  859. package/skills/nodejs/references/graceful-shutdown.md +73 -0
  860. package/skills/nodejs/scripts/verify.sh +122 -0
  861. package/skills/notion-connector/SKILL.md +234 -0
  862. package/skills/notion-connector/evals/README.md +15 -0
  863. package/skills/notion-connector/evals/cases.yaml +45 -0
  864. package/skills/notion-connector/references/api-versions.md +63 -0
  865. package/skills/notion-connector/references/property-shapes.md +110 -0
  866. package/skills/notion-connector/references/sync-patterns.md +95 -0
  867. package/skills/notion-connector/scripts/verify.sh +162 -0
  868. package/skills/observability/SKILL.md +231 -0
  869. package/skills/observability/evals/README.md +3 -0
  870. package/skills/observability/evals/cases.yaml +49 -0
  871. package/skills/observability/references/collector-config.md +98 -0
  872. package/skills/observability/references/instrumentation-recipes.md +115 -0
  873. package/skills/observability/scripts/verify.sh +156 -0
  874. package/skills/ollama/SKILL.md +213 -0
  875. package/skills/ollama/evals/README.md +9 -0
  876. package/skills/ollama/evals/cases.yaml +43 -0
  877. package/skills/ollama/references/api.md +148 -0
  878. package/skills/ollama/references/hardware-sizing.md +87 -0
  879. package/skills/ollama/scripts/verify.sh +116 -0
  880. package/skills/orient/SKILL.md +54 -0
  881. package/skills/orient/evals/README.md +16 -0
  882. package/skills/orient/evals/cases.yaml +57 -0
  883. package/skills/orient/references/orientation-contract.md +34 -0
  884. package/skills/parallel/SKILL.md +198 -0
  885. package/skills/parallel/evals/README.md +62 -0
  886. package/skills/parallel/evals/cases.yaml +44 -0
  887. package/skills/people-ops/SKILL.md +122 -0
  888. package/skills/people-ops/evals/README.md +14 -0
  889. package/skills/people-ops/evals/cases.yaml +43 -0
  890. package/skills/people-ops/references/templates.md +129 -0
  891. package/skills/performance/SKILL.md +221 -0
  892. package/skills/performance/evals/README.md +3 -0
  893. package/skills/performance/evals/cases.yaml +47 -0
  894. package/skills/performance/references/profiling-playbook.md +54 -0
  895. package/skills/performance/scripts/verify.sh +94 -0
  896. package/skills/phoenix/SKILL.md +169 -0
  897. package/skills/phoenix/evals/README.md +3 -0
  898. package/skills/phoenix/evals/cases.yaml +40 -0
  899. package/skills/phoenix/references/auth-and-scopes.md +82 -0
  900. package/skills/phoenix/references/ecto-patterns.md +93 -0
  901. package/skills/phoenix/references/liveview.md +134 -0
  902. package/skills/phoenix/scripts/verify.sh +73 -0
  903. package/skills/php/SKILL.md +397 -0
  904. package/skills/php/evals/README.md +12 -0
  905. package/skills/php/evals/cases.yaml +45 -0
  906. package/skills/php/references/tooling.md +170 -0
  907. package/skills/php/references/type-system.md +220 -0
  908. package/skills/php/scripts/verify.sh +155 -0
  909. package/skills/pitch-deck/SKILL.md +209 -0
  910. package/skills/pitch-deck/evals/README.md +15 -0
  911. package/skills/pitch-deck/evals/cases.yaml +55 -0
  912. package/skills/pitch-deck/references/numbers-that-matter.md +78 -0
  913. package/skills/pitch-deck/references/slide-spine.md +149 -0
  914. package/skills/pitch-deck/scripts/verify.sh +186 -0
  915. package/skills/plan/SKILL.md +204 -0
  916. package/skills/plan/evals/README.md +62 -0
  917. package/skills/plan/evals/cases.yaml +49 -0
  918. package/skills/plan/references/plan-template.md +124 -0
  919. package/skills/planetscale/SKILL.md +223 -0
  920. package/skills/planetscale/evals/README.md +11 -0
  921. package/skills/planetscale/evals/cases.yaml +46 -0
  922. package/skills/planetscale/references/deploy-requests.md +75 -0
  923. package/skills/planetscale/references/no-foreign-keys.md +88 -0
  924. package/skills/planetscale/scripts/verify.sh +115 -0
  925. package/skills/podcast/SKILL.md +166 -0
  926. package/skills/podcast/evals/README.md +17 -0
  927. package/skills/podcast/evals/cases.yaml +61 -0
  928. package/skills/podcast/references/rss-and-namespace.md +136 -0
  929. package/skills/podcast/scripts/verify.sh +246 -0
  930. package/skills/postgresdb/SKILL.md +372 -0
  931. package/skills/postgresdb/evals/README.md +55 -0
  932. package/skills/postgresdb/evals/cases.yaml +57 -0
  933. package/skills/postgresdb/references/migrations.md +279 -0
  934. package/skills/postgresdb/references/operations-and-security.md +267 -0
  935. package/skills/postgresdb/references/query-optimization.md +374 -0
  936. package/skills/postgresdb/references/schema-and-indexing.md +379 -0
  937. package/skills/postgresdb/scripts/verify.sh +191 -0
  938. package/skills/presentations/SKILL.md +296 -0
  939. package/skills/presentations/evals/README.md +61 -0
  940. package/skills/presentations/evals/cases.yaml +56 -0
  941. package/skills/presentations/references/brand-grounding.md +160 -0
  942. package/skills/presentations/references/markdown-decks.md +290 -0
  943. package/skills/presentations/references/pptx-python.md +242 -0
  944. package/skills/presentations/references/slide-design.md +261 -0
  945. package/skills/presentations/references/storytelling-and-decks.md +150 -0
  946. package/skills/presentations/scripts/verify.sh +252 -0
  947. package/skills/press-kit/SKILL.md +243 -0
  948. package/skills/press-kit/evals/README.md +15 -0
  949. package/skills/press-kit/evals/cases.yaml +55 -0
  950. package/skills/press-kit/references/release-types.md +102 -0
  951. package/skills/press-kit/references/templates.md +132 -0
  952. package/skills/press-kit/scripts/verify.sh +161 -0
  953. package/skills/pricing/SKILL.md +160 -0
  954. package/skills/pricing/evals/README.md +5 -0
  955. package/skills/pricing/evals/cases.yaml +44 -0
  956. package/skills/pricing/references/localization.md +56 -0
  957. package/skills/pricing/references/pricing-models.md +55 -0
  958. package/skills/pricing/scripts/verify.sh +91 -0
  959. package/skills/prisma-orm/SKILL.md +320 -0
  960. package/skills/prisma-orm/evals/README.md +12 -0
  961. package/skills/prisma-orm/evals/cases.yaml +56 -0
  962. package/skills/prisma-orm/references/migrations-and-v7-upgrade.md +197 -0
  963. package/skills/prisma-orm/references/queries-and-performance.md +169 -0
  964. package/skills/prisma-orm/scripts/verify.sh +137 -0
  965. package/skills/procurement/SKILL.md +179 -0
  966. package/skills/procurement/evals/README.md +20 -0
  967. package/skills/procurement/evals/cases.yaml +49 -0
  968. package/skills/procurement/references/scorecard-and-tco.md +100 -0
  969. package/skills/procurement/references/sourcing-requests.md +116 -0
  970. package/skills/procurement/scripts/verify.sh +280 -0
  971. package/skills/project-ops/SKILL.md +130 -0
  972. package/skills/project-ops/evals/README.md +3 -0
  973. package/skills/project-ops/evals/cases.yaml +71 -0
  974. package/skills/project-ops/references/raid-and-rag.md +58 -0
  975. package/skills/project-ops/references/status-report-template.md +68 -0
  976. package/skills/project-ops/scripts/verify.sh +257 -0
  977. package/skills/prompt-engineering/SKILL.md +138 -0
  978. package/skills/prompt-engineering/evals/README.md +11 -0
  979. package/skills/prompt-engineering/evals/cases.yaml +46 -0
  980. package/skills/prompt-engineering/references/eval-templates.md +94 -0
  981. package/skills/prompt-engineering/references/output-contracts.md +120 -0
  982. package/skills/prompt-engineering/scripts/verify.sh +84 -0
  983. package/skills/proposals/SKILL.md +159 -0
  984. package/skills/proposals/evals/README.md +3 -0
  985. package/skills/proposals/evals/cases.yaml +53 -0
  986. package/skills/proposals/references/proposal-skeleton.md +110 -0
  987. package/skills/proposals/references/sow-skeleton.md +79 -0
  988. package/skills/proposals/scripts/verify.sh +201 -0
  989. package/skills/python/SKILL.md +369 -0
  990. package/skills/python/evals/README.md +19 -0
  991. package/skills/python/evals/cases.yaml +46 -0
  992. package/skills/python/references/async.md +136 -0
  993. package/skills/python/references/stdlib.md +162 -0
  994. package/skills/python/references/typing.md +160 -0
  995. package/skills/python/scripts/verify.sh +125 -0
  996. package/skills/rag/SKILL.md +226 -0
  997. package/skills/rag/evals/README.md +13 -0
  998. package/skills/rag/evals/cases.yaml +45 -0
  999. package/skills/rag/references/evaluation.md +99 -0
  1000. package/skills/rag/references/pipeline.md +151 -0
  1001. package/skills/rag/scripts/verify.sh +99 -0
  1002. package/skills/rails/SKILL.md +264 -0
  1003. package/skills/rails/evals/README.md +12 -0
  1004. package/skills/rails/evals/cases.yaml +47 -0
  1005. package/skills/rails/references/activerecord.md +148 -0
  1006. package/skills/rails/references/hotwire.md +139 -0
  1007. package/skills/rails/references/testing.md +110 -0
  1008. package/skills/rails/scripts/verify.sh +128 -0
  1009. package/skills/railway/SKILL.md +245 -0
  1010. package/skills/railway/evals/README.md +14 -0
  1011. package/skills/railway/evals/cases.yaml +44 -0
  1012. package/skills/railway/references/cli-cookbook.md +137 -0
  1013. package/skills/railway/references/config-as-code.md +120 -0
  1014. package/skills/railway/scripts/verify.sh +162 -0
  1015. package/skills/react/SKILL.md +222 -0
  1016. package/skills/react/evals/README.md +3 -0
  1017. package/skills/react/evals/cases.yaml +43 -0
  1018. package/skills/react/references/data-and-state.md +152 -0
  1019. package/skills/react/references/performance.md +75 -0
  1020. package/skills/react/references/routing.md +99 -0
  1021. package/skills/react/scripts/verify.sh +123 -0
  1022. package/skills/react-native/SKILL.md +220 -0
  1023. package/skills/react-native/evals/README.md +3 -0
  1024. package/skills/react-native/evals/cases.yaml +42 -0
  1025. package/skills/react-native/references/native-modules.md +123 -0
  1026. package/skills/react-native/references/performance-debugging.md +46 -0
  1027. package/skills/react-native/scripts/verify.sh +117 -0
  1028. package/skills/redis/SKILL.md +298 -0
  1029. package/skills/redis/evals/README.md +10 -0
  1030. package/skills/redis/evals/cases.yaml +43 -0
  1031. package/skills/redis/references/caching.md +116 -0
  1032. package/skills/redis/references/locks-and-rate-limiting.md +140 -0
  1033. package/skills/redis/references/queues.md +102 -0
  1034. package/skills/redis/scripts/verify.sh +164 -0
  1035. package/skills/remotion-video/SKILL.md +218 -0
  1036. package/skills/remotion-video/evals/README.md +23 -0
  1037. package/skills/remotion-video/evals/cases.yaml +64 -0
  1038. package/skills/remotion-video/references/captions-pipeline.md +163 -0
  1039. package/skills/remotion-video/references/render-and-pipeline.md +131 -0
  1040. package/skills/remotion-video/scripts/verify.sh +169 -0
  1041. package/skills/render/SKILL.md +256 -0
  1042. package/skills/render/evals/README.md +12 -0
  1043. package/skills/render/evals/cases.yaml +45 -0
  1044. package/skills/render/references/blueprint-reference.md +203 -0
  1045. package/skills/render/scripts/verify.sh +167 -0
  1046. package/skills/replicate/SKILL.md +210 -0
  1047. package/skills/replicate/evals/README.md +9 -0
  1048. package/skills/replicate/evals/cases.yaml +45 -0
  1049. package/skills/replicate/references/cog-packaging.md +89 -0
  1050. package/skills/replicate/references/deployments-api.md +87 -0
  1051. package/skills/replicate/references/webhooks-and-async.md +110 -0
  1052. package/skills/replicate/scripts/verify.sh +162 -0
  1053. package/skills/replicate-images/SKILL.md +241 -0
  1054. package/skills/replicate-images/evals/README.md +13 -0
  1055. package/skills/replicate-images/evals/cases.yaml +41 -0
  1056. package/skills/replicate-images/references/editing-recipes.md +129 -0
  1057. package/skills/replicate-images/references/models.md +131 -0
  1058. package/skills/replicate-images/scripts/verify.sh +178 -0
  1059. package/skills/reporting/SKILL.md +178 -0
  1060. package/skills/reporting/evals/README.md +12 -0
  1061. package/skills/reporting/evals/cases.yaml +46 -0
  1062. package/skills/reporting/references/pipeline.md +213 -0
  1063. package/skills/reporting/scripts/verify.sh +149 -0
  1064. package/skills/research-ops/SKILL.md +200 -0
  1065. package/skills/research-ops/evals/README.md +13 -0
  1066. package/skills/research-ops/evals/cases.yaml +38 -0
  1067. package/skills/research-ops/references/credibility-rubric.md +78 -0
  1068. package/skills/research-ops/references/memo-template.md +63 -0
  1069. package/skills/research-ops/scripts/verify.sh +181 -0
  1070. package/skills/retention/SKILL.md +206 -0
  1071. package/skills/retention/evals/README.md +13 -0
  1072. package/skills/retention/evals/cases.yaml +42 -0
  1073. package/skills/retention/references/health-score-and-metrics.md +97 -0
  1074. package/skills/retention/references/save-and-winback-plays.md +65 -0
  1075. package/skills/review/SKILL.md +222 -0
  1076. package/skills/review/evals/README.md +84 -0
  1077. package/skills/review/evals/cases.yaml +55 -0
  1078. package/skills/review-management/SKILL.md +204 -0
  1079. package/skills/review-management/evals/README.md +13 -0
  1080. package/skills/review-management/evals/cases.yaml +60 -0
  1081. package/skills/review-management/references/platform-apis.md +86 -0
  1082. package/skills/review-management/scripts/verify.sh +128 -0
  1083. package/skills/ruby/SKILL.md +316 -0
  1084. package/skills/ruby/evals/README.md +12 -0
  1085. package/skills/ruby/evals/cases.yaml +41 -0
  1086. package/skills/ruby/references/gems-and-testing.md +208 -0
  1087. package/skills/ruby/references/metaprogramming.md +161 -0
  1088. package/skills/ruby/scripts/verify.sh +83 -0
  1089. package/skills/runpod/SKILL.md +238 -0
  1090. package/skills/runpod/evals/README.md +11 -0
  1091. package/skills/runpod/evals/cases.yaml +47 -0
  1092. package/skills/runpod/references/cost-and-scaling.md +85 -0
  1093. package/skills/runpod/references/serverless-workers.md +101 -0
  1094. package/skills/runpod/scripts/verify.sh +126 -0
  1095. package/skills/rust/SKILL.md +395 -0
  1096. package/skills/rust/evals/README.md +12 -0
  1097. package/skills/rust/evals/cases.yaml +42 -0
  1098. package/skills/rust/references/async-tokio.md +141 -0
  1099. package/skills/rust/references/axum-service.md +132 -0
  1100. package/skills/rust/references/ownership.md +86 -0
  1101. package/skills/rust/references/testing.md +108 -0
  1102. package/skills/rust/scripts/verify.sh +91 -0
  1103. package/skills/sales-pipeline/SKILL.md +162 -0
  1104. package/skills/sales-pipeline/evals/README.md +13 -0
  1105. package/skills/sales-pipeline/evals/cases.yaml +60 -0
  1106. package/skills/sales-pipeline/references/forecasting-math.md +82 -0
  1107. package/skills/sales-pipeline/references/stage-playbook.md +84 -0
  1108. package/skills/sales-pipeline/scripts/verify.sh +210 -0
  1109. package/skills/scaling/SKILL.md +137 -0
  1110. package/skills/scaling/evals/README.md +3 -0
  1111. package/skills/scaling/evals/cases.yaml +42 -0
  1112. package/skills/scaling/references/load-testing-k6.md +127 -0
  1113. package/skills/scaling/scripts/example.load.js +24 -0
  1114. package/skills/scaling/scripts/verify.sh +70 -0
  1115. package/skills/sdd/SKILL.md +203 -0
  1116. package/skills/sdd/evals/README.md +60 -0
  1117. package/skills/sdd/evals/cases.yaml +78 -0
  1118. package/skills/sdd-init/SKILL.md +148 -0
  1119. package/skills/sdd-init/evals/README.md +3 -0
  1120. package/skills/sdd-init/evals/cases.yaml +43 -0
  1121. package/skills/secure-coding/SKILL.md +365 -0
  1122. package/skills/secure-coding/evals/README.md +68 -0
  1123. package/skills/secure-coding/evals/cases.yaml +55 -0
  1124. package/skills/secure-coding/references/authn-authz.md +249 -0
  1125. package/skills/secure-coding/references/owasp-by-stack.md +574 -0
  1126. package/skills/secure-coding/references/secrets-and-supply-chain.md +205 -0
  1127. package/skills/secure-coding/references/threat-modeling.md +213 -0
  1128. package/skills/secure-coding/scripts/verify.sh +208 -0
  1129. package/skills/security-scan/SKILL.md +239 -0
  1130. package/skills/security-scan/evals/README.md +14 -0
  1131. package/skills/security-scan/evals/cases.yaml +50 -0
  1132. package/skills/security-scan/references/tools.md +98 -0
  1133. package/skills/security-scan/references/triage.md +93 -0
  1134. package/skills/security-scan/scripts/verify.sh +108 -0
  1135. package/skills/seo-geo/SKILL.md +192 -0
  1136. package/skills/seo-geo/evals/README.md +14 -0
  1137. package/skills/seo-geo/evals/cases.yaml +45 -0
  1138. package/skills/seo-geo/references/ai-crawler-control.md +104 -0
  1139. package/skills/seo-geo/references/schema-recipes.md +130 -0
  1140. package/skills/seo-geo/scripts/verify.sh +236 -0
  1141. package/skills/ship/SKILL.md +258 -0
  1142. package/skills/ship/evals/README.md +89 -0
  1143. package/skills/ship/evals/cases.yaml +44 -0
  1144. package/skills/shopify/SKILL.md +229 -0
  1145. package/skills/shopify/evals/README.md +14 -0
  1146. package/skills/shopify/evals/cases.yaml +41 -0
  1147. package/skills/shopify/references/apps-graphql.md +103 -0
  1148. package/skills/shopify/references/checkout-extensibility.md +71 -0
  1149. package/skills/shopify/references/liquid-themes.md +89 -0
  1150. package/skills/shopify/scripts/verify.sh +120 -0
  1151. package/skills/shortform-editing/SKILL.md +161 -0
  1152. package/skills/shortform-editing/evals/README.md +16 -0
  1153. package/skills/shortform-editing/evals/cases.yaml +61 -0
  1154. package/skills/shortform-editing/references/captions.md +85 -0
  1155. package/skills/shortform-editing/references/ffmpeg-pipeline.md +126 -0
  1156. package/skills/shortform-editing/scripts/verify.sh +148 -0
  1157. package/skills/shortform-ideation/SKILL.md +153 -0
  1158. package/skills/shortform-ideation/evals/README.md +20 -0
  1159. package/skills/shortform-ideation/evals/cases.yaml +58 -0
  1160. package/skills/shortform-ideation/references/experiment-ledger.md +85 -0
  1161. package/skills/shortform-ideation/references/trend-sources.md +69 -0
  1162. package/skills/shortform-ideation/scripts/verify.sh +172 -0
  1163. package/skills/shortform-packaging/SKILL.md +247 -0
  1164. package/skills/shortform-packaging/evals/README.md +10 -0
  1165. package/skills/shortform-packaging/evals/cases.yaml +48 -0
  1166. package/skills/shortform-packaging/references/package-templates.md +117 -0
  1167. package/skills/shortform-packaging/scripts/verify.sh +210 -0
  1168. package/skills/shortform-strategy/SKILL.md +149 -0
  1169. package/skills/shortform-strategy/evals/README.md +3 -0
  1170. package/skills/shortform-strategy/evals/cases.yaml +52 -0
  1171. package/skills/shortform-strategy/references/learning-loop-template.md +49 -0
  1172. package/skills/shortform-strategy/references/platform-signals-2026.md +46 -0
  1173. package/skills/shortform-strategy/scripts/verify.sh +176 -0
  1174. package/skills/skill-scout/SKILL.md +133 -0
  1175. package/skills/skill-scout/evals/README.md +12 -0
  1176. package/skills/skill-scout/evals/cases.yaml +56 -0
  1177. package/skills/skill-scout/references/install-commands.md +76 -0
  1178. package/skills/skill-scout/scripts/verify.sh +154 -0
  1179. package/skills/social-publisher/SKILL.md +179 -0
  1180. package/skills/social-publisher/evals/README.md +14 -0
  1181. package/skills/social-publisher/evals/cases.yaml +55 -0
  1182. package/skills/social-publisher/references/calendar-schema.md +97 -0
  1183. package/skills/social-publisher/references/platform-limits.md +56 -0
  1184. package/skills/social-publisher/scripts/verify.sh +232 -0
  1185. package/skills/solid-js/SKILL.md +260 -0
  1186. package/skills/solid-js/evals/README.md +3 -0
  1187. package/skills/solid-js/evals/cases.yaml +38 -0
  1188. package/skills/solid-js/references/reactivity-deep-dive.md +89 -0
  1189. package/skills/solid-js/references/router-and-start.md +93 -0
  1190. package/skills/solid-js/scripts/verify.sh +130 -0
  1191. package/skills/sop-builder/SKILL.md +233 -0
  1192. package/skills/sop-builder/evals/README.md +14 -0
  1193. package/skills/sop-builder/evals/cases.yaml +48 -0
  1194. package/skills/sop-builder/references/sop-skeleton.md +170 -0
  1195. package/skills/specify/SKILL.md +214 -0
  1196. package/skills/specify/evals/README.md +73 -0
  1197. package/skills/specify/evals/cases.yaml +80 -0
  1198. package/skills/specify/references/eliciting-requirements.md +77 -0
  1199. package/skills/specify/references/spec-template.md +60 -0
  1200. package/skills/spreadsheet-ops/SKILL.md +180 -0
  1201. package/skills/spreadsheet-ops/evals/README.md +33 -0
  1202. package/skills/spreadsheet-ops/evals/cases.yaml +42 -0
  1203. package/skills/spreadsheet-ops/references/formula-cookbook.md +70 -0
  1204. package/skills/spreadsheet-ops/references/python-excel.md +87 -0
  1205. package/skills/spreadsheet-ops/references/sheets-api-appsscript.md +118 -0
  1206. package/skills/spreadsheet-ops/scripts/verify.sh +152 -0
  1207. package/skills/spring-boot/SKILL.md +375 -0
  1208. package/skills/spring-boot/evals/README.md +11 -0
  1209. package/skills/spring-boot/evals/cases.yaml +49 -0
  1210. package/skills/spring-boot/references/jpa.md +94 -0
  1211. package/skills/spring-boot/references/security.md +92 -0
  1212. package/skills/spring-boot/references/testing.md +95 -0
  1213. package/skills/spring-boot/scripts/verify.sh +115 -0
  1214. package/skills/sql/SKILL.md +286 -0
  1215. package/skills/sql/evals/README.md +9 -0
  1216. package/skills/sql/evals/cases.yaml +49 -0
  1217. package/skills/sql/references/ctes-and-recursion.md +63 -0
  1218. package/skills/sql/references/joins-and-sets.md +71 -0
  1219. package/skills/sql/references/portability.md +38 -0
  1220. package/skills/sql/references/window-functions.md +72 -0
  1221. package/skills/sql/scripts/verify.sh +139 -0
  1222. package/skills/sqlite-turso/SKILL.md +214 -0
  1223. package/skills/sqlite-turso/evals/README.md +24 -0
  1224. package/skills/sqlite-turso/evals/cases.yaml +45 -0
  1225. package/skills/sqlite-turso/references/embedded-replicas.md +96 -0
  1226. package/skills/sqlite-turso/scripts/verify.sh +95 -0
  1227. package/skills/stripe/SKILL.md +269 -0
  1228. package/skills/stripe/evals/README.md +11 -0
  1229. package/skills/stripe/evals/cases.yaml +45 -0
  1230. package/skills/stripe/references/going-live.md +64 -0
  1231. package/skills/stripe/references/webhook-events.md +79 -0
  1232. package/skills/stripe/scripts/verify.sh +130 -0
  1233. package/skills/structured-extraction/SKILL.md +230 -0
  1234. package/skills/structured-extraction/evals/README.md +13 -0
  1235. package/skills/structured-extraction/evals/cases.yaml +70 -0
  1236. package/skills/structured-extraction/references/providers.md +152 -0
  1237. package/skills/structured-extraction/scripts/verify.sh +160 -0
  1238. package/skills/suggest/SKILL.md +30 -0
  1239. package/skills/suggest/evals/README.md +14 -0
  1240. package/skills/suggest/evals/cases.yaml +51 -0
  1241. package/skills/supabase/SKILL.md +268 -0
  1242. package/skills/supabase/evals/README.md +12 -0
  1243. package/skills/supabase/evals/cases.yaml +42 -0
  1244. package/skills/supabase/references/auth-ssr.md +173 -0
  1245. package/skills/supabase/references/rls-cookbook.md +122 -0
  1246. package/skills/supabase/scripts/verify.sh +149 -0
  1247. package/skills/svelte/SKILL.md +238 -0
  1248. package/skills/svelte/evals/README.md +3 -0
  1249. package/skills/svelte/evals/cases.yaml +41 -0
  1250. package/skills/svelte/references/runes.md +97 -0
  1251. package/skills/svelte/references/sveltekit-data.md +156 -0
  1252. package/skills/svelte/scripts/verify.sh +128 -0
  1253. package/skills/swift-ios/SKILL.md +217 -0
  1254. package/skills/swift-ios/evals/README.md +3 -0
  1255. package/skills/swift-ios/evals/cases.yaml +46 -0
  1256. package/skills/swift-ios/references/concurrency.md +132 -0
  1257. package/skills/swift-ios/references/testing.md +112 -0
  1258. package/skills/swift-ios/scripts/verify.sh +98 -0
  1259. package/skills/tasks/SKILL.md +260 -0
  1260. package/skills/tasks/evals/README.md +70 -0
  1261. package/skills/tasks/evals/cases.yaml +75 -0
  1262. package/skills/tauri/SKILL.md +224 -0
  1263. package/skills/tauri/evals/README.md +12 -0
  1264. package/skills/tauri/evals/cases.yaml +46 -0
  1265. package/skills/tauri/references/bundling-distribution.md +129 -0
  1266. package/skills/tauri/references/security.md +143 -0
  1267. package/skills/tauri/scripts/verify.sh +178 -0
  1268. package/skills/technical-writing/SKILL.md +230 -0
  1269. package/skills/technical-writing/evals/README.md +12 -0
  1270. package/skills/technical-writing/evals/cases.yaml +53 -0
  1271. package/skills/technical-writing/references/diataxis-modes.md +131 -0
  1272. package/skills/technical-writing/references/vale-starter.md +90 -0
  1273. package/skills/technical-writing/scripts/verify.sh +83 -0
  1274. package/skills/terms-conditions/SKILL.md +147 -0
  1275. package/skills/terms-conditions/evals/README.md +14 -0
  1276. package/skills/terms-conditions/evals/cases.yaml +48 -0
  1277. package/skills/terms-conditions/references/clause-library.md +158 -0
  1278. package/skills/terms-conditions/references/notices-and-aup.md +125 -0
  1279. package/skills/terms-conditions/scripts/verify.sh +92 -0
  1280. package/skills/testing-go/SKILL.md +246 -0
  1281. package/skills/testing-go/evals/README.md +3 -0
  1282. package/skills/testing-go/evals/cases.yaml +44 -0
  1283. package/skills/testing-go/references/coverage-and-benchmarks.md +85 -0
  1284. package/skills/testing-go/references/mocks-and-fakes.md +140 -0
  1285. package/skills/testing-go/references/synctest-and-concurrency.md +82 -0
  1286. package/skills/testing-go/scripts/verify.sh +72 -0
  1287. package/skills/testing-py/SKILL.md +179 -0
  1288. package/skills/testing-py/evals/README.md +5 -0
  1289. package/skills/testing-py/evals/cases.yaml +44 -0
  1290. package/skills/testing-py/references/mocking.md +141 -0
  1291. package/skills/testing-py/references/property-testing.md +99 -0
  1292. package/skills/testing-py/scripts/verify.sh +117 -0
  1293. package/skills/testing-web/SKILL.md +224 -0
  1294. package/skills/testing-web/evals/README.md +11 -0
  1295. package/skills/testing-web/evals/cases.yaml +52 -0
  1296. package/skills/testing-web/references/jest-setup.md +88 -0
  1297. package/skills/testing-web/references/recipes.md +116 -0
  1298. package/skills/testing-web/scripts/verify.sh +111 -0
  1299. package/skills/tiktok-api/SKILL.md +315 -0
  1300. package/skills/tiktok-api/evals/README.md +17 -0
  1301. package/skills/tiktok-api/evals/cases.yaml +51 -0
  1302. package/skills/tiktok-api/references/metrics-and-publish.md +127 -0
  1303. package/skills/tiktok-api/references/oauth-setup.md +105 -0
  1304. package/skills/tiktok-api/references/wiki-schema.md +85 -0
  1305. package/skills/tiktok-api/scripts/verify.sh +96 -0
  1306. package/skills/together-fireworks/SKILL.md +181 -0
  1307. package/skills/together-fireworks/evals/README.md +3 -0
  1308. package/skills/together-fireworks/evals/cases.yaml +50 -0
  1309. package/skills/together-fireworks/references/batch-and-tuning.md +59 -0
  1310. package/skills/together-fireworks/references/models-and-pricing.md +79 -0
  1311. package/skills/together-fireworks/scripts/verify.sh +165 -0
  1312. package/skills/translation-l10n/SKILL.md +229 -0
  1313. package/skills/translation-l10n/evals/README.md +3 -0
  1314. package/skills/translation-l10n/evals/cases.yaml +39 -0
  1315. package/skills/translation-l10n/references/icu-cookbook.md +82 -0
  1316. package/skills/translation-l10n/references/rtl-and-bidi.md +60 -0
  1317. package/skills/typescript/SKILL.md +258 -0
  1318. package/skills/typescript/evals/README.md +15 -0
  1319. package/skills/typescript/evals/cases.yaml +46 -0
  1320. package/skills/typescript/references/build-and-monorepo.md +141 -0
  1321. package/skills/typescript/references/type-system.md +162 -0
  1322. package/skills/typescript/scripts/verify.sh +52 -0
  1323. package/skills/unit-economics/SKILL.md +180 -0
  1324. package/skills/unit-economics/evals/README.md +5 -0
  1325. package/skills/unit-economics/evals/cases.yaml +43 -0
  1326. package/skills/unit-economics/references/formulas.md +144 -0
  1327. package/skills/unit-economics/scripts/verify.sh +179 -0
  1328. package/skills/vector-db/SKILL.md +189 -0
  1329. package/skills/vector-db/evals/README.md +10 -0
  1330. package/skills/vector-db/evals/cases.yaml +45 -0
  1331. package/skills/vector-db/references/engines.md +175 -0
  1332. package/skills/vector-db/references/tuning.md +62 -0
  1333. package/skills/vector-db/scripts/verify.sh +110 -0
  1334. package/skills/vercel/SKILL.md +242 -0
  1335. package/skills/vercel/evals/README.md +23 -0
  1336. package/skills/vercel/evals/cases.yaml +45 -0
  1337. package/skills/vercel/references/cli-cookbook.md +98 -0
  1338. package/skills/vercel/references/vercel-json.md +120 -0
  1339. package/skills/vercel/scripts/verify.sh +168 -0
  1340. package/skills/verify/SKILL.md +188 -0
  1341. package/skills/verify/evals/README.md +78 -0
  1342. package/skills/verify/evals/cases.yaml +74 -0
  1343. package/skills/video-shorts/SKILL.md +163 -0
  1344. package/skills/video-shorts/evals/README.md +15 -0
  1345. package/skills/video-shorts/evals/cases.yaml +56 -0
  1346. package/skills/video-shorts/references/hook-and-script-patterns.md +95 -0
  1347. package/skills/video-shorts/references/specs-and-safe-zones.md +74 -0
  1348. package/skills/video-shorts/scripts/verify.sh +172 -0
  1349. package/skills/vue-nuxt/SKILL.md +384 -0
  1350. package/skills/vue-nuxt/evals/README.md +11 -0
  1351. package/skills/vue-nuxt/evals/cases.yaml +49 -0
  1352. package/skills/vue-nuxt/references/data-and-state.md +127 -0
  1353. package/skills/vue-nuxt/references/migration-nuxt4.md +79 -0
  1354. package/skills/vue-nuxt/references/nitro-and-rendering.md +117 -0
  1355. package/skills/vue-nuxt/references/reactivity.md +135 -0
  1356. package/skills/vue-nuxt/scripts/verify.sh +148 -0
  1357. package/skills/webhooks/SKILL.md +246 -0
  1358. package/skills/webhooks/evals/README.md +15 -0
  1359. package/skills/webhooks/evals/cases.yaml +46 -0
  1360. package/skills/webhooks/references/framework-raw-body.md +97 -0
  1361. package/skills/webhooks/references/signature-schemes.md +66 -0
  1362. package/skills/webhooks/scripts/verify.sh +142 -0
  1363. package/skills/webinar/SKILL.md +196 -0
  1364. package/skills/webinar/evals/README.md +14 -0
  1365. package/skills/webinar/evals/cases.yaml +44 -0
  1366. package/skills/webinar/references/email-cadence.md +75 -0
  1367. package/skills/webinar/references/run-of-show.md +83 -0
  1368. package/skills/whatsapp-telegram/SKILL.md +235 -0
  1369. package/skills/whatsapp-telegram/evals/README.md +11 -0
  1370. package/skills/whatsapp-telegram/evals/cases.yaml +44 -0
  1371. package/skills/whatsapp-telegram/references/telegram-bot-api.md +91 -0
  1372. package/skills/whatsapp-telegram/references/whatsapp-cloud-api.md +103 -0
  1373. package/skills/whatsapp-telegram/scripts/verify.sh +90 -0
  1374. package/skills/wordpress/SKILL.md +224 -0
  1375. package/skills/wordpress/evals/README.md +3 -0
  1376. package/skills/wordpress/evals/cases.yaml +50 -0
  1377. package/skills/wordpress/references/hardening.md +108 -0
  1378. package/skills/wordpress/references/performance.md +80 -0
  1379. package/skills/wordpress/references/woocommerce.md +65 -0
  1380. package/skills/wordpress/scripts/verify.sh +96 -0
  1381. package/skills/worktrees/SKILL.md +199 -0
  1382. package/skills/worktrees/evals/README.md +78 -0
  1383. package/skills/worktrees/evals/cases.yaml +47 -0
  1384. package/skills/youtube-api/SKILL.md +286 -0
  1385. package/skills/youtube-api/evals/README.md +3 -0
  1386. package/skills/youtube-api/evals/cases.yaml +50 -0
  1387. package/skills/youtube-api/references/analytics-queries.md +89 -0
  1388. package/skills/youtube-api/references/oauth-setup.md +55 -0
  1389. package/skills/youtube-api/references/wiki-schema.md +70 -0
  1390. package/skills/youtube-api/scripts/verify.sh +84 -0
  1391. package/skills/youtube-ideation/SKILL.md +234 -0
  1392. package/skills/youtube-ideation/evals/README.md +14 -0
  1393. package/skills/youtube-ideation/evals/cases.yaml +52 -0
  1394. package/skills/youtube-ideation/references/idea-ledger-and-loop.md +89 -0
  1395. package/skills/youtube-ideation/references/research-and-signals.md +92 -0
  1396. package/skills/youtube-ideation/scripts/verify.sh +237 -0
  1397. package/skills/youtube-packaging/SKILL.md +220 -0
  1398. package/skills/youtube-packaging/evals/README.md +16 -0
  1399. package/skills/youtube-packaging/evals/cases.yaml +48 -0
  1400. package/skills/youtube-packaging/references/description-and-chapters.md +135 -0
  1401. package/skills/youtube-packaging/scripts/verify.sh +250 -0
  1402. package/skills/youtube-strategy/SKILL.md +157 -0
  1403. package/skills/youtube-strategy/evals/README.md +5 -0
  1404. package/skills/youtube-strategy/evals/cases.yaml +61 -0
  1405. package/skills/youtube-strategy/references/channel-architecture.md +46 -0
  1406. package/skills/youtube-strategy/references/wiki-records.md +86 -0
  1407. package/skills/youtube-strategy/scripts/verify.sh +118 -0
  1408. package/skills/youtube-thumbnails/SKILL.md +180 -0
  1409. package/skills/youtube-thumbnails/evals/README.md +11 -0
  1410. package/skills/youtube-thumbnails/evals/cases.yaml +48 -0
  1411. package/skills/youtube-thumbnails/references/composition-and-specs.md +69 -0
  1412. package/skills/youtube-thumbnails/references/experiment-log-format.md +65 -0
  1413. package/skills/youtube-thumbnails/scripts/verify.sh +123 -0
  1414. package/targets/claude.js +23 -0
  1415. package/targets/codex.js +29 -0
  1416. package/targets/cursor.js +20 -0
  1417. package/targets/gemini.js +29 -0
  1418. package/targets/index.js +55 -0
@@ -0,0 +1,213 @@
1
+ ---
2
+ name: agent-eval
3
+ description: "Use when you need to measure whether an LLM or agent system actually got better and gate merges on it — before/after a prompt or model change, building a golden set, fixing a noisy or inflated LLM-as-judge, scoring a RAG pipeline (faithfulness, contextual recall) or an agent trajectory (tool-correctness, task-completion), or choosing between DeepEval / Inspect AI / promptfoo / Braintrust. Triggers: 'is the new prompt actually better', 'build a golden set / regression suite', 'our judge gives everything a 9/10', 'fail the PR when answer quality drops', 'score the tool path not just the final answer', 'medir si el agente mejoró', 'tenim un eval que dona puntuacions inflades', 'montar un golden set'. NOT building the agent loop/tools/RAG plumbing (that is building-agents)."
4
+ tags: [evals, llm, agents, llm-as-judge, regression-gate, ai]
5
+ recommends: [building-agents, prompt-engineering, observability]
6
+ origin: risco
7
+ ---
8
+
9
+ # Measure agent quality you can defend and gate on
10
+
11
+ Turn "the agent feels better" into a number you can put in a PR check. You own the eval dataset, the scorer mix, the LLM-as-judge calibration, and the block-on-regression CI gate — framework-neutral, provider-neutral.
12
+
13
+ ## The one rule
14
+
15
+ > A score you do not trust is worse than no score. Calibrate the scorer against human labels and report the agreement **before** you gate anything on it. An uncalibrated judge gives false confidence, which is more dangerous than admitted ignorance.
16
+
17
+ > Build your own golden set. A public leaderboard number is not your number — identical model weights swing SWE-Bench Verified by 10–20 points just by changing the harness. Measure *your* task on *your* data.
18
+
19
+ ## When to use / When NOT
20
+
21
+ **Use when:**
22
+ - "Is the new prompt/model actually better?" — before/after on a fixed dataset.
23
+ - "Build a golden set / regression suite for this agent."
24
+ - "Our judge gives everything 9/10 / the scores are inflated" — judge calibration.
25
+ - "Fail the PR when answer quality / faithfulness / tool-call accuracy drops" — CI gate.
26
+ - Scoring a RAG pipeline (faithfulness, answer relevancy, contextual recall/precision).
27
+ - Scoring an agent's *trajectory* — tool calls, task completion — not just the final string.
28
+ - Picking between DeepEval / Inspect AI / promptfoo / Braintrust.
29
+
30
+ **Do NOT use — route instead:**
31
+
32
+ | The ask | Route to | Why it is not this skill |
33
+ | --- | --- | --- |
34
+ | Build the agent loop, tools, RAG plumbing | `building-agents` | It builds the system; you score it. They cross-link. |
35
+ | "Make the answers shorter / rewrite the prompt" | `prompt-engineering` | Evals say it is worse; that skill changes the words. You never edit the prompt. |
36
+ | pytest/jest on deterministic functions | `testing-py` / `testing-web` | Assert-equals on pure code, not stochastic outputs scored by a judge. |
37
+ | Dashboards / tracing of live production traffic | `observability` | Online monitoring; you are offline + pre-merge. |
38
+ | Red-team, jailbreak, prompt injection | `agent-safety` | Adversarial coverage, not quality measurement. |
39
+ | Per-token cost budgets and accounting | `cost-tracking` | You report cost-per-task as one metric; the discipline lives there. |
40
+ | A/B stats on product/funnel metrics | `ab-testing` | Web experiments, not offline model comparison on a fixed set. |
41
+
42
+ (`building-agents`, `ab-testing`, `chatbot` are linkable siblings here; the rest are KNOWN ids you name but cannot link yet.)
43
+
44
+ ## The eval anatomy
45
+
46
+ Every framework instantiates the same five-stage pipeline. Learn it once; the tool is a detail.
47
+
48
+ ```text
49
+ dataset ──▶ runner ──▶ scorers ──▶ metrics ──▶ gate
50
+ (JSONL (calls the (det / judge (aggregate + (pass/fail
51
+ golden system per / human) bootstrap CI) exit code)
52
+ set) case)
53
+ ```
54
+
55
+ DeepEval, Inspect AI, and promptfoo are all just opinionated wrappers around this. If you understand the stages you can switch tools without relearning the craft.
56
+
57
+ ## Build the dataset first
58
+
59
+ The dataset is the asset. Everything else is replaceable. Rules:
60
+
61
+ - **50–200 hand-labeled cases per failure mode**, not per total. Coverage of how the system fails beats raw volume. 80 real failure cases > 1000 generic ones.
62
+ - **Never synthetic-only.** A set the model wrote will not surface the model's blind spots. Mine real traffic / tickets / transcripts and hand-label.
63
+ - **Version it in git as JSONL**, a first-class reviewed asset — same as code. Diffs are reviewable; relabels are auditable.
64
+ - **Decontaminate.** The eval set must not appear in training data or few-shot examples, or the score is a memorization artifact, not a capability.
65
+
66
+ Case schema — one JSON object per line:
67
+
68
+ ```jsonl
69
+ {"id":"refund-001","input":"Where is my refund for order 4821?","expected":"States refunds take 5-7 business days and asks for nothing already on file","context":["policy: refunds 5-7 business days"],"meta":{"failure_mode":"hallucinated_policy","source":"ticket#4821"}}
70
+ {"id":"refund-002","input":"Cancel my subscription and refund this month","expected":"Cancels, refunds prorated amount, confirms no future charge","context":["policy: prorated refund on cancel"],"meta":{"failure_mode":"missed_tool_call","source":"ticket#5190"}}
71
+ ```
72
+
73
+ `failure_mode` in `meta` is what lets you slice metrics by mode and find *which* kind of bug regressed — not just that the aggregate dropped.
74
+
75
+ > Bad: "generate 1000 test questions with GPT and use those." Good: "80 real failure-mode cases pulled from support tickets, hand-labeled, tagged by failure mode."
76
+
77
+ ## Choose the scorer — the 60/30/10 mix
78
+
79
+ Reach for the cheapest scorer that correlates with human judgment. Default mix:
80
+
81
+ | Share | Scorer kind | Use for | Why |
82
+ | --- | --- | --- | --- |
83
+ | ~60% | Deterministic — exact match, regex, JSON-schema validation, latency threshold | Anything with a checkable shape: format, required fields, a known string, a budget | Free, instant, zero drift. Never spend a judge call on something a regex settles. |
84
+ | ~30% | LLM-as-judge — G-Eval, DAG, custom Python scorer | Meaning: is this answer faithful, relevant, helpful | Only where correctness is semantic. Costs money and can drift — so calibrate it. |
85
+ | ~10% | Human-in-the-loop | Genuinely ambiguous cases the judge disagrees on | The ground truth you calibrate the judge against. |
86
+
87
+ One `Scorer` protocol, two implementations behind it — deterministic and judge are interchangeable to the runner:
88
+
89
+ ```python
90
+ from typing import Protocol
91
+
92
+ class Scorer(Protocol):
93
+ name: str
94
+ def score(self, case: dict, output: str) -> float: ... # 0.0–1.0
95
+
96
+ class JsonSchemaScorer:
97
+ name = "schema_valid"
98
+ def score(self, case, output): # deterministic, free, no drift
99
+ import json
100
+ try:
101
+ json.loads(output)
102
+ return 1.0
103
+ except ValueError:
104
+ return 0.0
105
+
106
+ class FaithfulnessJudge:
107
+ name = "faithfulness"
108
+ def __init__(self, judge_model): self.judge = judge_model
109
+ def score(self, case, output): # judge only where meaning matters
110
+ return self.judge.rate(case["context"], output) # see judge-design.md
111
+ ```
112
+
113
+ ## LLM-as-judge you can trust
114
+
115
+ A judge is a scorer, so the one rule applies hardest here. Each rule, with its why:
116
+
117
+ - **Judge model ≥ system under test.** A weaker judge cannot reliably rank a stronger system — it scores noise.
118
+ - **The rubric must force a written rationale before the score.** Rationale-first judging is what pushes judge–human agreement to ~85% — higher than two humans agree with each other. A bare number is a vibe with a decimal point.
119
+ - **Pairwise beats pointwise for stability.** "Is A or B better?" is more reproducible than "rate A from 1–10," which inflates and clusters at 8–9.
120
+ - **Swap positions and average.** Judges favor whichever answer came first; run A-then-B and B-then-A to cancel position bias.
121
+ - **Calibrate against human gold and report agreement** before you trust a single judge score.
122
+
123
+ > Bad judge prompt: "Rate this answer 1–10." → everything lands 8–9, useless.
124
+ > Good: "Compare answer A and answer B against the reference. First write one sentence on each per the rubric, then output the better label." → forces reasoning, gives a stable signal.
125
+
126
+ Full rubric templates (pointwise + pairwise), the position-swap harness, the calibration script (agreement / Cohen's kappa vs human gold), G-Eval vs DAG, and the judge bias catalog (length, position, self-preference) with mitigations live in **[references/judge-design.md](references/judge-design.md)**.
127
+
128
+ ## Agent and RAG scorers
129
+
130
+ Score the path, not only the destination. Beyond exact/judge:
131
+
132
+ **RAG** (DeepEval / RAGAS names):
133
+ - **Faithfulness** — does the answer only claim what the retrieved context supports? Catches hallucination.
134
+ - **Answer relevancy** — does it actually address the question, or drift?
135
+ - **Contextual recall / precision** — did retrieval fetch the right chunks, and not bury them in noise? Separates a retrieval bug from a generation bug.
136
+
137
+ **Agent:**
138
+ - **Tool correctness** — right tool, right arguments, right order.
139
+ - **Task completion / goal accuracy** — did it finish the job, not just produce plausible text.
140
+ - **Trajectory scoring** — grade the sequence of steps. A correct final answer from a wrong path will fail differently next time; only trajectory scoring catches it.
141
+
142
+ The system side of these (how the loop and tools are built) is `../building-agents/SKILL.md`.
143
+
144
+ ## The regression gate
145
+
146
+ > Gate policy: **block on regression vs a committed baseline, not on an absolute threshold.** An absolute threshold flaps CI on judge noise and gives no signal on drift; "did this PR make a tracked metric worse than `main`?" is the question that matters.
147
+
148
+ - Compute a **bootstrap confidence interval** on each metric so judge noise alone does not fail the build — only a drop beyond the CI counts.
149
+ - The runner writes `eval-report.json` (metrics, per-failure-mode slices, baseline, pass/fail) and **exits non-zero** on a real regression so the merge is blocked.
150
+
151
+ ```python
152
+ import json, sys
153
+
154
+ def gate(current: dict, baseline: dict, margin: float = 0.0) -> int:
155
+ regressed = []
156
+ for metric, score in current.items():
157
+ if metric in baseline and score < baseline[metric] - margin:
158
+ regressed.append((metric, baseline[metric], score))
159
+ report = {"metrics": current, "baseline": baseline, "regressed": regressed,
160
+ "passed": not regressed}
161
+ with open("eval-report.json", "w") as f:
162
+ json.dump(report, f, indent=2)
163
+ if regressed:
164
+ for m, b, c in regressed:
165
+ print(f"REGRESSION {m}: {b:.3f} -> {c:.3f}", file=sys.stderr)
166
+ return 1
167
+ return 0
168
+
169
+ sys.exit(gate(run_eval(), json.load(open("eval-baseline.json"))))
170
+ ```
171
+
172
+ The complete provider-neutral runner (JSONL loader, scorer registry, bootstrap-CI metrics), the GitHub Actions workflow, and side-by-side DeepEval-pytest + Inspect-AI Task/Solver/Scorer versions of the same eval live in **[references/runner-and-gate.md](references/runner-and-gate.md)**.
173
+
174
+ ## Framework cheat-sheet
175
+
176
+ Pick by where the eval runs and what it must do. Versions as of 2026-06 — re-verify, they rot.
177
+
178
+ | Tool | What it is | Reach for it when |
179
+ | --- | --- | --- |
180
+ | **DeepEval** v4.0.3 | pytest-native, 50+ metrics, Decision-Graph (DAG) logic | Your CI is Python/pytest and you want metrics that read like tests. |
181
+ | **Inspect AI** v0.3.225 (UK AISI) | dataset→Task→Solver→Scorer, bootstrap CIs, first-class tool-use & trajectory logging, 200+ pre-built evals | Multi-provider, safety-adjacent, or you need real trajectory scoring. |
182
+ | **promptfoo** (acquired by OpenAI 2026-03) | CLI + YAML, strong pre-deploy + red-team across 50+ vuln types | Config-driven pre-deploy checks; route the red-team half to `agent-safety`. |
183
+ | **Braintrust / LangSmith / Phoenix** v16.0.0 | platforms: annotation, regression tracking, dashboards | You need human annotation queues and historical regression tracking. |
184
+
185
+ > The two-tool pattern is normal, not over-engineering: a light CI gate (DeepEval / RAGAS / promptfoo) **plus** a platform (Braintrust / LangSmith / Arize) for annotation and history. They share data; different jobs.
186
+
187
+ ## Anti-patterns
188
+
189
+ | Anti-pattern | Why it bites | Do instead |
190
+ | --- | --- | --- |
191
+ | Vibes-gating ("feels better, merge it") | No artifact to defend or reproduce | Gate on a number from a committed dataset |
192
+ | Synthetic-only dataset | Model-written cases miss the model's blind spots | Hand-label real traffic by failure mode |
193
+ | Uncalibrated judge | Confident wrong scores; worse than none | Report agreement vs human gold first |
194
+ | Judge weaker than system | Cannot rank a stronger system; scores noise | Judge model ≥ system under test |
195
+ | Absolute-threshold gate | Flaps CI on judge noise, blind to drift | Block on regression vs baseline + bootstrap CI |
196
+ | Shipping on a leaderboard number | Harness effect = 10–20pt swing | Build your own golden set |
197
+ | Scoring only the final answer | A right answer from a wrong path regresses later | Score the trajectory too |
198
+ | Never relabeling drifted gold | Stale "truth" silently rots the gate | Review and relabel the golden set on a schedule |
199
+
200
+ ## Project grounding
201
+
202
+ If the workspace has a `02-DOCS/` harness, record the eval policy in `02-DOCS/wiki/stack/evals.md`: dataset location, scorer mix, gate baseline file, judge model, and the failure modes covered. Link it from root `CLAUDE.md` under `## Knowledge map`. This is **recorded, not gated** — skip silently if there is no harness.
203
+
204
+ ## verify.sh
205
+
206
+ `scripts/verify.sh` is read-only and tool-detecting. It validates that every `*.jsonl` golden set in the project parses and that each line carries the required `id`, `input`, `expected` keys; checks the shape of any `eval-report.json`; and runs `ruff` / `mypy` on example Python and `markdownlint` on docs when those tools are installed. Every missing tool prints a yellow WARN and is skipped — never a failure. An empty or clean target exits 0.
207
+
208
+ ## See also
209
+
210
+ - `../building-agents/SKILL.md` — builds the agent/RAG/tool system you score here.
211
+ - `../chatbot/SKILL.md` — a common system-under-test for these evals.
212
+ - KNOWN siblings to route to (not yet linkable): `prompt-engineering`, `observability`, `agent-safety`, `cost-tracking`, `testing-py`, `ab-testing`.
213
+ - External (no link): RAGAS, DeepEval, Inspect AI, promptfoo, Braintrust.
@@ -0,0 +1,12 @@
1
+ # Evals for agent-eval
2
+
3
+ `cases.yaml` is the trigger and capability spec for this skill. It is read by the catalog's
4
+ skill-eval harness, not by a standalone runner here. `should_trigger` lists prompts the skill
5
+ must claim (including non-obvious and Spanish phrasings); `should_not_trigger` lists prompts
6
+ that must route to a named sibling instead; `capability` is a rubric a graded run must satisfy.
7
+
8
+ To check by hand: read each `should_trigger` prompt and confirm the description in `SKILL.md`
9
+ would plausibly fire on it; read each `should_not_trigger` prompt and confirm the `route_to`
10
+ sibling is the better home and is a real catalog id. For the `capability` case, draft the
11
+ skill's answer and confirm every `must_include` bullet is covered. If your harness scores these
12
+ automatically, point it at this file with `skill: agent-eval` as the key.
@@ -0,0 +1,45 @@
1
+ skill: agent-eval
2
+
3
+ should_trigger:
4
+ - prompt: "Is the new system prompt actually better than the old one, or am I just fooling myself?"
5
+ why: "Before/after on a fixed dataset is the core job. Non-obvious: uses no eval vocabulary at all."
6
+ - prompt: "Build me a golden set and a CI check that fails the PR if our support bot's answer quality drops."
7
+ why: "Dataset construction plus a block-on-regression gate, stated explicitly."
8
+ - prompt: "Our LLM judge gives every answer a 9 out of 10 — the scores are useless."
9
+ why: "Judge calibration / inflation; the fix is rationale-forcing rubric and pairwise over pointwise. Non-obvious symptom phrasing, no mention of 'eval'."
10
+ - prompt: "Score our RAG pipeline for faithfulness and whether it actually uses the retrieved context."
11
+ why: "RAG metrics — faithfulness and contextual recall — are squarely this skill."
12
+ - prompt: "I need to know if the agent took the right tool path, not just whether the final answer looks ok."
13
+ why: "Trajectory and tool-correctness scoring. Non-obvious: framed as behavior, not as 'eval'."
14
+ - prompt: "Necesito medir si el agente mejoró con el modelo nuevo antes de hacer merge."
15
+ why: "Spanish — before/after measurement plus a pre-merge gate."
16
+ - prompt: "Should we use DeepEval or Inspect AI for our eval suite, and how do we wire the gate?"
17
+ why: "Framework choice plus gate wiring; the cheat-sheet and runner are the answer."
18
+
19
+ should_not_trigger:
20
+ - prompt: "Rewrite this system prompt so the answers come out shorter and punchier."
21
+ route_to: prompt-engineering
22
+ why: "Changing the words, not measuring quality. Evals say it is worse; that skill changes it."
23
+ - prompt: "Build the agent loop and tool registry for our new assistant."
24
+ route_to: building-agents
25
+ why: "Building the system under test, not scoring it."
26
+ - prompt: "Add a dashboard showing token cost and latency of our live LLM traffic."
27
+ route_to: observability
28
+ why: "Online monitoring of production, not offline pre-merge evaluation."
29
+ - prompt: "Write pytest unit tests for our data-cleaning functions."
30
+ route_to: testing-py
31
+ why: "Deterministic code with assert-equals, not stochastic outputs scored by a judge."
32
+ - prompt: "Red-team our chatbot for jailbreaks and prompt injection."
33
+ route_to: agent-safety
34
+ why: "Adversarial safety coverage, not quality measurement on a golden set."
35
+
36
+ capability:
37
+ - scenario: "Set up an eval plus a regression gate for a RAG support bot that is regressing silently between releases."
38
+ must_include:
39
+ - "Builds a versioned JSONL golden set of 50-200 hand-labeled cases per failure mode, decontaminated, not synthetic-only"
40
+ - "Picks a scorer mix: deterministic first (schema/regex/latency), faithfulness and answer-relevancy judge where meaning matters"
41
+ - "Calibrates the LLM-as-judge against human labels and reports agreement (raw and/or Cohen's kappa) before trusting it"
42
+ - "Gate policy is block-on-regression vs a committed baseline with a bootstrap CI, not an absolute threshold"
43
+ - "Runner emits eval-report.json and exits non-zero to block the merge"
44
+ - "Names at least one concrete framework (DeepEval, Inspect AI, or promptfoo) and fits it to the case"
45
+ - "Records the eval policy in 02-DOCS/wiki/stack/evals.md if a harness exists — recorded, not gated"
@@ -0,0 +1,118 @@
1
+ # LLM-as-judge design and calibration
2
+
3
+ Depth offloaded from `SKILL.md`. A judge is a scorer, so the one rule binds hardest here:
4
+ **calibrate against human gold and report agreement before you trust a single judge score.**
5
+ Judge–human agreement reaches ~85% in 2026 (higher than two humans on the same task) — but
6
+ only with a capable judge model and a rubric that forces a written rationale.
7
+
8
+ ## Pointwise rubric template
9
+
10
+ Pointwise (rate one answer) is convenient but inflates and clusters at 8–9. Use it only when
11
+ you cannot run pairwise, and always force the rationale first.
12
+
13
+ ```text
14
+ You are grading an answer against a reference. The dimension is FAITHFULNESS:
15
+ every claim in the answer must be supported by the provided context.
16
+
17
+ Context:
18
+ {context}
19
+
20
+ Answer:
21
+ {answer}
22
+
23
+ Steps (do them in order):
24
+ 1. List each factual claim in the answer.
25
+ 2. For each claim, mark SUPPORTED or UNSUPPORTED against the context.
26
+ 3. Only then output a JSON object: {"rationale": "...", "score": <0.0-1.0>}
27
+ where score = supported_claims / total_claims.
28
+ ```
29
+
30
+ The order matters: a rubric that asks for the number first gets a vibe; asking for the
31
+ claim-by-claim breakdown first forces the reasoning that earns the ~85% agreement.
32
+
33
+ ## Pairwise rubric template (preferred)
34
+
35
+ "Is A or B better?" is more reproducible than an absolute rating. Use it for before/after and
36
+ A/B comparisons.
37
+
38
+ ```text
39
+ Compare two answers, A and B, against the reference for HELPFULNESS.
40
+
41
+ Reference: {reference}
42
+ Answer A: {a}
43
+ Answer B: {b}
44
+
45
+ 1. One sentence on what A does well/badly vs the reference.
46
+ 2. One sentence on what B does well/badly vs the reference.
47
+ 3. Output JSON: {"rationale": "...", "winner": "A" | "B" | "tie"}
48
+ ```
49
+
50
+ ## Position-swap harness (kills position bias)
51
+
52
+ Judges favor whichever answer appears first. Run each comparison twice with positions swapped
53
+ and only count a clear win as a win.
54
+
55
+ ```python
56
+ def pairwise(judge, reference, a, b):
57
+ """Return 'a', 'b', or 'tie' after canceling position bias."""
58
+ first = judge.compare(reference, a, b) # A in slot 1
59
+ second = judge.compare(reference, b, a) # A in slot 2 -> remap
60
+ remap = {"A": "b", "B": "a", "tie": "tie"}
61
+ r1 = {"A": "a", "B": "b", "tie": "tie"}[first]
62
+ r2 = remap[second]
63
+ if r1 == r2:
64
+ return r1 # consistent across positions -> trust it
65
+ return "tie" # judge flipped with position -> not a real signal
66
+ ```
67
+
68
+ A judge that flips when you swap positions has told you it cannot separate the two — record a
69
+ tie, do not pick the first-slot winner.
70
+
71
+ ## Calibration script (agreement and Cohen's kappa vs human gold)
72
+
73
+ You need a small human-labeled gold slice (~10% of cases, the ambiguous ones). Run the judge
74
+ on it and measure how often it agrees with the humans. Report this number in `eval-report.json`
75
+ and refuse to gate if it is low.
76
+
77
+ ```python
78
+ def cohen_kappa(human: list[int], judge: list[int]) -> float:
79
+ """Agreement corrected for chance. 1.0 perfect, 0 chance-level."""
80
+ n = len(human)
81
+ po = sum(h == j for h, j in zip(human, judge)) / n # raw agreement
82
+ labels = set(human) | set(judge)
83
+ pe = sum((human.count(k) / n) * (judge.count(k) / n) for k in labels)
84
+ return (po - pe) / (1 - pe) if pe != 1 else 1.0
85
+
86
+ def calibration_report(human, judge):
87
+ raw = sum(h == j for h, j in zip(human, judge)) / len(human)
88
+ return {"raw_agreement": round(raw, 3),
89
+ "cohen_kappa": round(cohen_kappa(human, judge), 3)}
90
+ ```
91
+
92
+ Rules of thumb for the kappa: < 0.4 the judge is unusable, fix the rubric or model; 0.4–0.6
93
+ marginal, widen the human slice; > 0.6 with raw agreement ~0.85 is the working zone. These are
94
+ thresholds for *trusting the judge*, not for gating the system.
95
+
96
+ ## G-Eval vs DAG
97
+
98
+ - **G-Eval** — you give the judge an evaluation-criteria sentence; it generates the
99
+ chain-of-thought steps and a weighted score. Fast to author, good for fuzzy semantic
100
+ dimensions (coherence, helpfulness). Less reproducible for hard rules.
101
+ - **DAG (Decision Graph)** — you author an explicit decision tree of yes/no checks; the score
102
+ is deterministic given the answers. Use for compliance-style "must / must-not" criteria where
103
+ you want auditability over flexibility. DeepEval v4 ships this as first-class.
104
+
105
+ Pick G-Eval for taste, DAG for rules.
106
+
107
+ ## Judge bias catalog and mitigations
108
+
109
+ | Bias | Symptom | Mitigation |
110
+ | --- | --- | --- |
111
+ | Position | Favors the first answer shown | Position-swap harness above; count only consistent wins |
112
+ | Length | Scores longer answers higher regardless of quality | Add "ignore length; reward density" to the rubric; spot-check long losers |
113
+ | Self-preference | A model prefers text in its own style | Use a different model family as judge than the system under test |
114
+ | Verbosity-of-rationale | Long rationale read as more correct | Score the claim breakdown, not the prose |
115
+ | Anchoring on reference wording | Penalizes correct paraphrases | Grade meaning vs reference, not surface overlap; test with known-good paraphrases |
116
+
117
+ Re-run the calibration slice whenever you change the judge model or rubric — a judge swap is a
118
+ scorer change, and an uncalibrated scorer is back to a number you cannot defend.
@@ -0,0 +1,183 @@
1
+ # The runner, the gate, and framework equivalents
2
+
3
+ Depth offloaded from `SKILL.md`. A provider-neutral runner you can drop into any repo, the CI
4
+ wiring, and the same eval expressed in DeepEval and Inspect AI so you can switch tools without
5
+ relearning the craft.
6
+
7
+ ## The five stages, in code
8
+
9
+ `dataset → runner → scorers → metrics → gate`. Keep them separable so each can be swapped.
10
+
11
+ ### Dataset loader (JSONL, decontaminated, versioned)
12
+
13
+ ```python
14
+ import json
15
+ from pathlib import Path
16
+
17
+ REQUIRED = {"id", "input", "expected"}
18
+
19
+ def load_cases(path: str) -> list[dict]:
20
+ cases = []
21
+ for i, line in enumerate(Path(path).read_text().splitlines(), 1):
22
+ line = line.strip()
23
+ if not line:
24
+ continue
25
+ case = json.loads(line)
26
+ missing = REQUIRED - case.keys()
27
+ if missing:
28
+ raise ValueError(f"{path}:{i} missing keys {missing}")
29
+ cases.append(case)
30
+ return cases
31
+ ```
32
+
33
+ ### Scorer registry
34
+
35
+ Every scorer satisfies one protocol — deterministic and judge are interchangeable. See
36
+ `judge-design.md` for the judge implementations.
37
+
38
+ ```python
39
+ from dataclasses import dataclass
40
+ from typing import Callable
41
+
42
+ @dataclass
43
+ class Scorer:
44
+ name: str
45
+ fn: Callable[[dict, str], float] # (case, output) -> 0.0..1.0
46
+
47
+ def exact_match(case, output):
48
+ return 1.0 if output.strip() == case["expected"].strip() else 0.0
49
+
50
+ def latency_under(threshold_s):
51
+ def _fn(case, output):
52
+ return 1.0 if case["meta"].get("latency_s", 0) <= threshold_s else 0.0
53
+ return _fn
54
+
55
+ REGISTRY = [
56
+ Scorer("exact_match", exact_match),
57
+ Scorer("latency_2s", latency_under(2.0)),
58
+ # Scorer("faithfulness", FaithfulnessJudge(judge_model).score), # 30% judge
59
+ ]
60
+ ```
61
+
62
+ ### Runner + bootstrap-CI metrics
63
+
64
+ The bootstrap CI is what stops judge noise from flapping the gate: only a drop beyond the
65
+ interval counts as a real regression.
66
+
67
+ ```python
68
+ import random, statistics
69
+
70
+ def run(system, cases, scorers):
71
+ rows = []
72
+ for case in cases:
73
+ output = system(case["input"]) # the system under test
74
+ row = {"id": case["id"],
75
+ "failure_mode": case["meta"].get("failure_mode", "none")}
76
+ for s in scorers:
77
+ row[s.name] = s.fn(case, output)
78
+ rows.append(row)
79
+ return rows
80
+
81
+ def metrics(rows, scorers, n_boot=1000):
82
+ out = {}
83
+ for s in scorers:
84
+ vals = [r[s.name] for r in rows]
85
+ mean = statistics.fmean(vals)
86
+ boots = [statistics.fmean(random.choices(vals, k=len(vals)))
87
+ for _ in range(n_boot)]
88
+ boots.sort()
89
+ out[s.name] = {"mean": round(mean, 4),
90
+ "ci_low": round(boots[int(0.025 * n_boot)], 4),
91
+ "ci_high": round(boots[int(0.975 * n_boot)], 4)}
92
+ return out
93
+ ```
94
+
95
+ ### Gate (block on regression vs committed baseline)
96
+
97
+ ```python
98
+ import json, sys
99
+
100
+ def gate(current: dict, baseline: dict, report_path="eval-report.json") -> int:
101
+ regressed = []
102
+ for metric, m in current.items():
103
+ base = baseline.get(metric, {}).get("mean")
104
+ # Real regression: the current CI is entirely below the baseline mean.
105
+ if base is not None and m["ci_high"] < base:
106
+ regressed.append({"metric": metric, "baseline": base,
107
+ "now": m["mean"], "ci_high": m["ci_high"]})
108
+ report = {"metrics": current, "baseline": baseline,
109
+ "regressed": regressed, "passed": not regressed}
110
+ Path(report_path).write_text(json.dumps(report, indent=2))
111
+ for r in regressed:
112
+ print(f"REGRESSION {r['metric']}: {r['baseline']} -> {r['now']}",
113
+ file=sys.stderr)
114
+ return 1 if regressed else 0
115
+ ```
116
+
117
+ Update the baseline deliberately — commit a new `eval-baseline.json` in the PR that *intends*
118
+ to move the metric, so the move is reviewed, not silent.
119
+
120
+ ## GitHub Actions wiring
121
+
122
+ ```yaml
123
+ name: eval-gate
124
+ on: [pull_request]
125
+ jobs:
126
+ eval:
127
+ runs-on: ubuntu-latest
128
+ steps:
129
+ - uses: actions/checkout@v4
130
+ - uses: actions/setup-python@v5
131
+ with: { python-version: "3.12" }
132
+ - run: pip install -r evals/requirements.txt
133
+ - name: Run eval gate
134
+ env:
135
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
136
+ run: python evals/run.py # exits non-zero on regression -> blocks merge
137
+ - uses: actions/upload-artifact@v4
138
+ if: always()
139
+ with: { name: eval-report, path: eval-report.json }
140
+ ```
141
+
142
+ The non-zero exit is the gate. Upload the report `if: always()` so a failed run still shows
143
+ *which* metric and failure mode regressed.
144
+
145
+ ## Same eval in DeepEval (pytest-native)
146
+
147
+ ```python
148
+ from deepeval import assert_test
149
+ from deepeval.test_case import LLMTestCase
150
+ from deepeval.metrics import FaithfulnessMetric
151
+
152
+ def test_faithfulness():
153
+ case = LLMTestCase(
154
+ input="Where is my refund for order 4821?",
155
+ actual_output=system("Where is my refund for order 4821?"),
156
+ retrieval_context=["policy: refunds 5-7 business days"],
157
+ )
158
+ assert_test(case, [FaithfulnessMetric(threshold=0.8)])
159
+ ```
160
+
161
+ DeepEval v4.0.3 reads like tests and runs under pytest; reach for it when CI is already Python.
162
+
163
+ ## Same eval in Inspect AI (Task / Solver / Scorer)
164
+
165
+ ```python
166
+ from inspect_ai import Task, task
167
+ from inspect_ai.dataset import json_dataset
168
+ from inspect_ai.solver import generate
169
+ from inspect_ai.scorer import model_graded_qa
170
+
171
+ @task
172
+ def refund_quality():
173
+ return Task(
174
+ dataset=json_dataset("evals/cases.jsonl"),
175
+ solver=generate(),
176
+ scorer=model_graded_qa(), # judge with rationale; bootstrap CIs built in
177
+ )
178
+ # inspect eval refund_quality.py --model openai/gpt-... -> pass/fail + CIs
179
+ ```
180
+
181
+ Inspect AI v0.3.225 (UK AISI) gives first-class tool-use and trajectory logging plus bootstrap
182
+ CIs out of the box; reach for it when you are multi-provider or need real trajectory scoring.
183
+ Both express the identical `dataset → scorer → gate` pipeline — the tool is a detail.