@skill-graph/cli 0.5.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (330) hide show
  1. package/CHANGELOG.md +247 -0
  2. package/LICENSE +200 -0
  3. package/NOTICE +62 -0
  4. package/README.md +398 -0
  5. package/SKILL_GRAPH.md +443 -0
  6. package/bin/skill-graph.js +374 -0
  7. package/docs/ADOPTION.md +117 -0
  8. package/docs/CONFORMANCE.md +66 -0
  9. package/docs/PRIMER.md +384 -0
  10. package/docs/QUICKSTART-30MIN.md +333 -0
  11. package/docs/ROUTING-METRICS.md +120 -0
  12. package/docs/SKILL-MD-FORMAT-COMPATIBILITY.md +127 -0
  13. package/docs/SKILL_AUDIT_CHECKLIST.md +199 -0
  14. package/docs/SKILL_AUDIT_LOOP.md +195 -0
  15. package/docs/SKILL_METADATA_PROTOCOL.md +609 -0
  16. package/docs/_archived/marketplace-publication-priority-2026-05-18.md +239 -0
  17. package/docs/adr/0001-predicate-set.md +69 -0
  18. package/docs/adr/0002-json-ld-context.md +82 -0
  19. package/docs/adr/0003-ontoclean-rigidity-tags.md +65 -0
  20. package/docs/adr/0004-persistent-identifiers.md +74 -0
  21. package/docs/adr/0005-freshness-consolidation.md +70 -0
  22. package/docs/adr/0006-revise-predicate-rename.md +105 -0
  23. package/docs/adr/0007-audit-loop-cadence.md +99 -0
  24. package/docs/adr/0008-skill-surface-split-and-curation-policy.md +93 -0
  25. package/docs/category-consumers.md +168 -0
  26. package/docs/concept-map.md +194 -0
  27. package/docs/diagrams/drift-states.mmd +21 -0
  28. package/docs/diagrams/manifest-pipeline.mmd +25 -0
  29. package/docs/diagrams/routing-harness.mmd +41 -0
  30. package/docs/diagrams/starter-graph.mmd +53 -0
  31. package/docs/field-decision-guide.md +315 -0
  32. package/docs/field-rationale.md +211 -0
  33. package/docs/field-reference.generated.md +624 -0
  34. package/docs/field-reference.md +1426 -0
  35. package/docs/glossary.md +190 -0
  36. package/docs/head-noun-glossary.md +63 -0
  37. package/docs/images/audit-phases.png +0 -0
  38. package/docs/images/drift-states.png +0 -0
  39. package/docs/images/graded-mode.png +0 -0
  40. package/docs/images/manifest-pipeline.png +0 -0
  41. package/docs/images/routing-harness.png +0 -0
  42. package/docs/images/skill-anatomy.png +0 -0
  43. package/docs/images/starter-graph.png +0 -0
  44. package/docs/images/system-model.png +0 -0
  45. package/docs/integrations/github-actions.md +155 -0
  46. package/docs/manifest-field-mapping.md +443 -0
  47. package/docs/marketplace-publication-queue.generated.md +240 -0
  48. package/docs/marketplace-release-agent-prompt.md +82 -0
  49. package/docs/marketplace-skill-candidate-list.md +272 -0
  50. package/docs/marketplace-syndication.md +222 -0
  51. package/docs/migration-sample-review.md +155 -0
  52. package/docs/migrations/v4-to-v5.md +168 -0
  53. package/docs/migrations/v5-to-v6.md +221 -0
  54. package/docs/name-exceptions.yaml +37 -0
  55. package/docs/plans/marketplace-p1-public-migration-plan.md +41 -0
  56. package/docs/plans/multi-root-workspace.md +148 -0
  57. package/docs/plans/scripts-roadmap.md +107 -0
  58. package/docs/plans/v4-schema-bump.md +160 -0
  59. package/docs/plans/wave-2-extraction.md +122 -0
  60. package/docs/positioning-vs-marketplaces.md +175 -0
  61. package/docs/proposals/skill-audit-loop-positioning.md +160 -0
  62. package/docs/quality-doctrine.md +138 -0
  63. package/docs/recommended-skills.md +150 -0
  64. package/docs/research/skill-comprehension-eval-research.md +1830 -0
  65. package/docs/research/skill-retrieval-evidence.md +66 -0
  66. package/docs/skill-metadata-protocol.md +471 -0
  67. package/docs/skills-sh-maintainer-cleanup-request.md +80 -0
  68. package/examples/audits/a11y/findings.md +52 -0
  69. package/examples/audits/a11y/scorecard.md +21 -0
  70. package/examples/audits/a11y/verdict.md +44 -0
  71. package/examples/audits/debugging/findings.md +59 -0
  72. package/examples/audits/debugging/scorecard.md +22 -0
  73. package/examples/audits/debugging/verdict.md +33 -0
  74. package/examples/audits/documentation/findings.md +59 -0
  75. package/examples/audits/documentation/scorecard.md +22 -0
  76. package/examples/audits/documentation/verdict.md +33 -0
  77. package/examples/evals/a11y.json +140 -0
  78. package/examples/evals/api-design.json +52 -0
  79. package/examples/evals/code-review.json +52 -0
  80. package/examples/evals/data-modeling.json +52 -0
  81. package/examples/evals/database-migration.json +52 -0
  82. package/examples/evals/debugging.json +118 -0
  83. package/examples/evals/dependency-architecture.json +52 -0
  84. package/examples/evals/design-system-architecture.json +52 -0
  85. package/examples/evals/error-tracking.json +52 -0
  86. package/examples/evals/event-contract-design.json +52 -0
  87. package/examples/evals/form-ux-architecture.json +52 -0
  88. package/examples/evals/framework-fit-analysis.json +52 -0
  89. package/examples/evals/graph-audit.json +139 -0
  90. package/examples/evals/information-architecture.json +52 -0
  91. package/examples/evals/interaction-feedback.json +52 -0
  92. package/examples/evals/interaction-patterns.json +52 -0
  93. package/examples/evals/layout-composition.json +52 -0
  94. package/examples/evals/lint-overlay.json +117 -0
  95. package/examples/evals/microcopy.json +52 -0
  96. package/examples/evals/observability-modeling.json +52 -0
  97. package/examples/evals/pattern-recognition.json +96 -0
  98. package/examples/evals/performance-engineering.json +52 -0
  99. package/examples/evals/refactor.json +128 -0
  100. package/examples/evals/semiotics.json +52 -0
  101. package/examples/evals/skill-infrastructure.json +96 -0
  102. package/examples/evals/skill-router.json +140 -0
  103. package/examples/evals/skill-router.routing.json +113 -0
  104. package/examples/evals/system-interface-contracts.json +52 -0
  105. package/examples/evals/task-analysis.json +52 -0
  106. package/examples/evals/testing-strategy.json +118 -0
  107. package/examples/evals/type-safety.json +249 -0
  108. package/examples/evals/visual-design-foundations.json +52 -0
  109. package/examples/evals/webhook-integration.json +52 -0
  110. package/examples/exports/a11y.skill-md.md +80 -0
  111. package/examples/exports/debugging.skill-md.md +80 -0
  112. package/examples/exports/refactor.skill-md.md +78 -0
  113. package/examples/exports/testing-strategy.skill-md.md +81 -0
  114. package/examples/projects/markdown-static-site/README.md +115 -0
  115. package/examples/projects/markdown-static-site/skills/content-source-router/SKILL.md +131 -0
  116. package/examples/projects/markdown-static-site/skills/image-optimization-pipeline-config/SKILL.md +132 -0
  117. package/examples/projects/markdown-static-site/skills/link-rot-detection/SKILL.md +103 -0
  118. package/examples/projects/markdown-static-site/skills/markdown-post-frontmatter-validation/SKILL.md +133 -0
  119. package/examples/projects/markdown-static-site/skills/migrate-posts-to-v2-frontmatter/SKILL.md +140 -0
  120. package/examples/projects/saas-stripe-postgres/README.md +208 -0
  121. package/examples/projects/saas-stripe-postgres/db/migrations/0004_canonicalize_orders.sql +37 -0
  122. package/examples/projects/saas-stripe-postgres/db/schema.sql +112 -0
  123. package/examples/projects/saas-stripe-postgres/skills/migrate-orders-to-canonical-schema/SKILL.md +149 -0
  124. package/examples/projects/saas-stripe-postgres/skills/nextjs-server-action-validation/SKILL.md +154 -0
  125. package/examples/projects/saas-stripe-postgres/skills/payment-provider-router/SKILL.md +153 -0
  126. package/examples/projects/saas-stripe-postgres/skills/postgres-rls-pattern/SKILL.md +163 -0
  127. package/examples/projects/saas-stripe-postgres/skills/stripe-webhook-signature-verification/SKILL.md +137 -0
  128. package/examples/protocol/skill-metadata-template.md +301 -0
  129. package/examples/protocol/skills.manifest.sample.json +13245 -0
  130. package/examples/skill-metadata-template.md +317 -0
  131. package/examples/skills.manifest.sample.json +13519 -0
  132. package/examples/tests/v3-1-skos-fixture/SKILL.md +93 -0
  133. package/marketplace/README.md +17 -0
  134. package/marketplace/skills/a11y/SKILL.md +66 -0
  135. package/marketplace/skills/acid-fundamentals/SKILL.md +106 -0
  136. package/marketplace/skills/agent-engineering/SKILL.md +386 -0
  137. package/marketplace/skills/agent-eval-design/SKILL.md +55 -0
  138. package/marketplace/skills/ai-native-development/SKILL.md +294 -0
  139. package/marketplace/skills/api-design/SKILL.md +60 -0
  140. package/marketplace/skills/architecture-decision-records/SKILL.md +55 -0
  141. package/marketplace/skills/background-jobs/SKILL.md +265 -0
  142. package/marketplace/skills/bounded-context-mapping/SKILL.md +55 -0
  143. package/marketplace/skills/cap-theorem-tradeoffs/SKILL.md +127 -0
  144. package/marketplace/skills/client-server-boundary/SKILL.md +187 -0
  145. package/marketplace/skills/code-review/SKILL.md +120 -0
  146. package/marketplace/skills/color-system-design/SKILL.md +43 -0
  147. package/marketplace/skills/component-architecture/SKILL.md +126 -0
  148. package/marketplace/skills/compression/SKILL.md +112 -0
  149. package/marketplace/skills/conceptual-modeling/SKILL.md +181 -0
  150. package/marketplace/skills/connection-pooling/SKILL.md +105 -0
  151. package/marketplace/skills/constraint-awareness/SKILL.md +287 -0
  152. package/marketplace/skills/content-monitor/SKILL.md +209 -0
  153. package/marketplace/skills/context-engineering/SKILL.md +320 -0
  154. package/marketplace/skills/context-graph/SKILL.md +174 -0
  155. package/marketplace/skills/context-management/SKILL.md +174 -0
  156. package/marketplace/skills/context-window/SKILL.md +239 -0
  157. package/marketplace/skills/contract-testing/SKILL.md +120 -0
  158. package/marketplace/skills/cron-scheduling/SKILL.md +223 -0
  159. package/marketplace/skills/dark-mode-implementation/SKILL.md +47 -0
  160. package/marketplace/skills/data-modeling/SKILL.md +59 -0
  161. package/marketplace/skills/data-modeling-fundamentals/SKILL.md +117 -0
  162. package/marketplace/skills/database-migration/SKILL.md +429 -0
  163. package/marketplace/skills/debugging/SKILL.md +67 -0
  164. package/marketplace/skills/dependency-architecture/SKILL.md +58 -0
  165. package/marketplace/skills/design-module-composition/SKILL.md +43 -0
  166. package/marketplace/skills/design-system-architecture/SKILL.md +61 -0
  167. package/marketplace/skills/design-thinking/SKILL.md +44 -0
  168. package/marketplace/skills/diagnosis/SKILL.md +296 -0
  169. package/marketplace/skills/diff-analysis/SKILL.md +188 -0
  170. package/marketplace/skills/e2e-test-design/SKILL.md +113 -0
  171. package/marketplace/skills/entity-relationship-modeling/SKILL.md +218 -0
  172. package/marketplace/skills/epistemic-grounding/SKILL.md +112 -0
  173. package/marketplace/skills/error-boundary/SKILL.md +235 -0
  174. package/marketplace/skills/error-tracking/SKILL.md +261 -0
  175. package/marketplace/skills/eval-driven-development/SKILL.md +147 -0
  176. package/marketplace/skills/evaluation/SKILL.md +113 -0
  177. package/marketplace/skills/event-contract-design/SKILL.md +60 -0
  178. package/marketplace/skills/event-storming/SKILL.md +56 -0
  179. package/marketplace/skills/form-ux-architecture/SKILL.md +60 -0
  180. package/marketplace/skills/framework-fit-analysis/SKILL.md +59 -0
  181. package/marketplace/skills/frontend-architecture/SKILL.md +43 -0
  182. package/marketplace/skills/generative-ui/SKILL.md +118 -0
  183. package/marketplace/skills/graph-audit/SKILL.md +81 -0
  184. package/marketplace/skills/guardrails/SKILL.md +118 -0
  185. package/marketplace/skills/hooks-patterns/SKILL.md +185 -0
  186. package/marketplace/skills/http-semantics/SKILL.md +136 -0
  187. package/marketplace/skills/ideation/SKILL.md +41 -0
  188. package/marketplace/skills/indexing-strategy/SKILL.md +108 -0
  189. package/marketplace/skills/information-architecture/SKILL.md +59 -0
  190. package/marketplace/skills/integration-test-design/SKILL.md +111 -0
  191. package/marketplace/skills/intent-recognition/SKILL.md +136 -0
  192. package/marketplace/skills/interaction-feedback/SKILL.md +59 -0
  193. package/marketplace/skills/interaction-patterns/SKILL.md +59 -0
  194. package/marketplace/skills/journey-mapping/SKILL.md +41 -0
  195. package/marketplace/skills/keywords/SKILL.md +213 -0
  196. package/marketplace/skills/knowledge-modeling/SKILL.md +232 -0
  197. package/marketplace/skills/layout-composition/SKILL.md +59 -0
  198. package/marketplace/skills/linguistics/SKILL.md +429 -0
  199. package/marketplace/skills/lint-overlay/SKILL.md +76 -0
  200. package/marketplace/skills/mental-models/SKILL.md +126 -0
  201. package/marketplace/skills/merge-queue/SKILL.md +94 -0
  202. package/marketplace/skills/methodology/SKILL.md +317 -0
  203. package/marketplace/skills/microcopy/SKILL.md +232 -0
  204. package/marketplace/skills/middleware-patterns/SKILL.md +363 -0
  205. package/marketplace/skills/mobile-responsive-ux/SKILL.md +287 -0
  206. package/marketplace/skills/mutation-testing/SKILL.md +112 -0
  207. package/marketplace/skills/naming-conventions/SKILL.md +112 -0
  208. package/marketplace/skills/observability-modeling/SKILL.md +59 -0
  209. package/marketplace/skills/ontology-modeling/SKILL.md +67 -0
  210. package/marketplace/skills/owasp-security/SKILL.md +153 -0
  211. package/marketplace/skills/pattern-recognition/SKILL.md +472 -0
  212. package/marketplace/skills/performance-budgets/SKILL.md +185 -0
  213. package/marketplace/skills/performance-engineering/SKILL.md +58 -0
  214. package/marketplace/skills/performance-testing/SKILL.md +125 -0
  215. package/marketplace/skills/printify/SKILL.md +42 -0
  216. package/marketplace/skills/prioritization/SKILL.md +118 -0
  217. package/marketplace/skills/problem-framing/SKILL.md +41 -0
  218. package/marketplace/skills/problem-locating-solving/SKILL.md +203 -0
  219. package/marketplace/skills/project-knowledge-extraction/SKILL.md +54 -0
  220. package/marketplace/skills/prompt-craft/SKILL.md +134 -0
  221. package/marketplace/skills/prompt-injection-defense/SKILL.md +132 -0
  222. package/marketplace/skills/property-based-testing/SKILL.md +100 -0
  223. package/marketplace/skills/prototyping/SKILL.md +43 -0
  224. package/marketplace/skills/query-optimization/SKILL.md +144 -0
  225. package/marketplace/skills/real-time-updates/SKILL.md +324 -0
  226. package/marketplace/skills/ref-patterns/SKILL.md +284 -0
  227. package/marketplace/skills/refactor/SKILL.md +65 -0
  228. package/marketplace/skills/rendering-models/SKILL.md +142 -0
  229. package/marketplace/skills/replication-patterns/SKILL.md +110 -0
  230. package/marketplace/skills/research-synthesis/SKILL.md +41 -0
  231. package/marketplace/skills/route-handler-design/SKILL.md +347 -0
  232. package/marketplace/skills/schema-evolution/SKILL.md +140 -0
  233. package/marketplace/skills/security-fundamentals/SKILL.md +139 -0
  234. package/marketplace/skills/semantic-center/SKILL.md +194 -0
  235. package/marketplace/skills/semantic-relations/SKILL.md +250 -0
  236. package/marketplace/skills/semantics/SKILL.md +366 -0
  237. package/marketplace/skills/semiotics/SKILL.md +230 -0
  238. package/marketplace/skills/seo-strategy/SKILL.md +260 -0
  239. package/marketplace/skills/server-actions-design/SKILL.md +243 -0
  240. package/marketplace/skills/server-components-design/SKILL.md +190 -0
  241. package/marketplace/skills/sharding-strategy/SKILL.md +123 -0
  242. package/marketplace/skills/shopify/SKILL.md +42 -0
  243. package/marketplace/skills/skill-infrastructure/SKILL.md +320 -0
  244. package/marketplace/skills/skill-router/SKILL.md +71 -0
  245. package/marketplace/skills/skill-scaffold/SKILL.md +105 -0
  246. package/marketplace/skills/snapshot-testing/SKILL.md +120 -0
  247. package/marketplace/skills/spec-driven-development/SKILL.md +148 -0
  248. package/marketplace/skills/state-machine-modeling/SKILL.md +56 -0
  249. package/marketplace/skills/state-management/SKILL.md +134 -0
  250. package/marketplace/skills/streaming-architecture/SKILL.md +194 -0
  251. package/marketplace/skills/summarization/SKILL.md +156 -0
  252. package/marketplace/skills/suspense-patterns/SKILL.md +265 -0
  253. package/marketplace/skills/system-interface-contracts/SKILL.md +59 -0
  254. package/marketplace/skills/task-analysis/SKILL.md +201 -0
  255. package/marketplace/skills/taxonomy-design/SKILL.md +66 -0
  256. package/marketplace/skills/test-coverage-strategy/SKILL.md +108 -0
  257. package/marketplace/skills/test-doubles-design/SKILL.md +98 -0
  258. package/marketplace/skills/test-driven-development/SKILL.md +96 -0
  259. package/marketplace/skills/testing-strategy/SKILL.md +67 -0
  260. package/marketplace/skills/theme-system-design/SKILL.md +43 -0
  261. package/marketplace/skills/tool-call-flow/SKILL.md +229 -0
  262. package/marketplace/skills/tool-call-strategy/SKILL.md +292 -0
  263. package/marketplace/skills/transaction-isolation/SKILL.md +98 -0
  264. package/marketplace/skills/type-safety/SKILL.md +177 -0
  265. package/marketplace/skills/typography-system/SKILL.md +43 -0
  266. package/marketplace/skills/usability-testing/SKILL.md +43 -0
  267. package/marketplace/skills/user-research/SKILL.md +43 -0
  268. package/marketplace/skills/vercel-composition-patterns/SKILL.md +157 -0
  269. package/marketplace/skills/version-control/SKILL.md +233 -0
  270. package/marketplace/skills/visual-design-foundations/SKILL.md +59 -0
  271. package/marketplace/skills/visual-hierarchy/SKILL.md +43 -0
  272. package/marketplace/skills/webhook-integration/SKILL.md +331 -0
  273. package/marketplace/skills/writing-humanizer/SKILL.md +380 -0
  274. package/package.json +67 -0
  275. package/schemas/manifest.schema.json +811 -0
  276. package/schemas/manifest.v2.schema.json +164 -0
  277. package/schemas/manifest.v3.schema.json +758 -0
  278. package/schemas/manifest.v4.schema.json +755 -0
  279. package/schemas/manifest.v5.schema.json +755 -0
  280. package/schemas/manifest.v6.schema.json +811 -0
  281. package/schemas/skill.context.jsonld +279 -0
  282. package/schemas/skill.schema.json +919 -0
  283. package/schemas/skill.v2.schema.json +201 -0
  284. package/schemas/skill.v3.schema.json +827 -0
  285. package/schemas/skill.v4.schema.json +822 -0
  286. package/schemas/skill.v5.schema.json +830 -0
  287. package/schemas/skill.v6.schema.json +946 -0
  288. package/schemas/vocabulary/keywords.json +180 -0
  289. package/schemas/vocabulary/workspace_tags.json +23 -0
  290. package/scripts/__tests__/migrate-skill-v2-to-v3.test.js +161 -0
  291. package/scripts/__tests__/migrate-skill-v3-to-v4.test.js +158 -0
  292. package/scripts/__tests__/test-export-parser-drift.js +149 -0
  293. package/scripts/__tests__/test-marketplace-export.js +114 -0
  294. package/scripts/__tests__/test-router-paths.js +82 -0
  295. package/scripts/__tests__/test-stability-promotion.js +244 -0
  296. package/scripts/__tests__/test-v3-1-alias-contract.js +109 -0
  297. package/scripts/__tests__/test-v3-1-skos-runtime.js +116 -0
  298. package/scripts/backfill-schema-version.js +198 -0
  299. package/scripts/build-field-reference.js +160 -0
  300. package/scripts/build-retrieval-baseline.js +511 -0
  301. package/scripts/check-markdown-links.js +211 -0
  302. package/scripts/check-protocol-consistency.js +979 -0
  303. package/scripts/export-marketplace-skills.js +610 -0
  304. package/scripts/export-skill.js +374 -0
  305. package/scripts/generate-manifest.js +787 -0
  306. package/scripts/lib/alias-contract.js +83 -0
  307. package/scripts/lib/audit-prompt-builder.js +771 -0
  308. package/scripts/lib/mock-grader.js +134 -0
  309. package/scripts/lib/parse-frontmatter.js +429 -0
  310. package/scripts/lib/roots.js +119 -0
  311. package/scripts/lint/check-archetype-sections.js +185 -0
  312. package/scripts/lint/check-category-enum.js +83 -0
  313. package/scripts/lint/check-routing-eval.js +146 -0
  314. package/scripts/lint/check-routing-quality.js +211 -0
  315. package/scripts/lint/check-stability-promotion.js +220 -0
  316. package/scripts/lint/format-code-frame.js +206 -0
  317. package/scripts/marketplace-install.js +125 -0
  318. package/scripts/migrate-category-to-enum.js +169 -0
  319. package/scripts/migrate-skill-v2-to-v3.js +424 -0
  320. package/scripts/migrate-skill-v3-to-v4.js +200 -0
  321. package/scripts/migrate-skill-v5-to-v6.js +304 -0
  322. package/scripts/restructure-by-category.js +85 -0
  323. package/scripts/seed-publication-classification.js +282 -0
  324. package/scripts/skill-audit.js +893 -0
  325. package/scripts/skill-graph-drift.js +483 -0
  326. package/scripts/skill-graph-route.js +766 -0
  327. package/scripts/skill-graph-routing-eval.js +393 -0
  328. package/scripts/skill-lint.js +1317 -0
  329. package/scripts/skill-overlap.js +213 -0
  330. package/scripts/verify-skill-md-export.js +201 -0
@@ -0,0 +1,147 @@
1
+ ---
2
+ name: eval-driven-development
3
+ description: "Use when reasoning about building language-model-integrated systems by writing evaluations before and alongside the system: the statistical (not binary) nature of LLM evals, the five primitives (dataset, evaluation function, aggregation, iteration loop, regression budget), the judgment-mechanism taxonomy (programmatic, model-graded, human-graded, preference comparison), the difference between system-specific evals and canonical benchmarks (MMLU, HumanEval, BIG-bench, GAIA), how evals drive prompt/model/scaffolding/tooling changes, why Goodhart's Law means higher eval scores are not always improvements, and the offline-eval-vs-production-telemetry distinction. Do NOT use for deterministic unit testing (use testing-strategy), production monitoring (use evaluation or error-tracking), general-software TDD (use testing-strategy), or the construction of individual eval rubrics and task sets (use agent-eval-design — it owns construction; this skill owns the iteration discipline)."
4
+ license: MIT
5
+ allowed-tools: Read Grep
6
+ metadata:
7
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"agent\",\"domain\":\"agent/evaluation\",\"scope\":\"reference\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-16\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-16\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"comprehension_state\":\"present\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"eval-driven development\\\\\\\",\\\\\\\"LLM evals\\\\\\\",\\\\\\\"evaluation harness\\\\\\\",\\\\\\\"benchmark\\\\\\\",\\\\\\\"HumanEval\\\\\\\",\\\\\\\"MMLU\\\\\\\",\\\\\\\"BIG-bench\\\\\\\",\\\\\\\"GAIA\\\\\\\",\\\\\\\"LLM-as-judge\\\\\\\",\\\\\\\"model-graded eval\\\\\\\",\\\\\\\"pass rate\\\\\\\",\\\\\\\"regression budget\\\\\\\",\\\\\\\"Goodhart's law\\\\\\\",\\\\\\\"golden dataset\\\\\\\",\\\\\\\"reference-free eval\\\\\\\"]\",\"triggers\":\"[\\\\\\\"how do we know this prompt change improved things\\\\\\\",\\\\\\\"should this be an eval or a unit test\\\\\\\",\\\\\\\"the model passes the benchmark but fails in production\\\\\\\",\\\\\\\"what should we measure\\\\\\\",\\\\\\\"the LLM-as-judge gives different scores each run\\\\\\\"]\",\"examples\":\"[\\\\\\\"design an offline eval suite for an LLM-integrated summarization feature before writing the prompt\\\\\\\",\\\\\\\"decide between programmatic grading, model-graded judgment, and human review for a freeform-output eval\\\\\\\",\\\\\\\"explain why MMLU score is a poor predictor of a domain-specific assistant's quality\\\\\\\",\\\\\\\"structure an iteration loop where each prompt change is gated by a regression budget\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"write unit tests for a deterministic data transformation (use testing-strategy)\\\\\\\",\\\\\\\"set up production alerting on API error rates (use observability)\\\\\\\",\\\\\\\"interpret a specific benchmark's leaderboard (use benchmarking-engine)\\\\\\\"]\",\"relations\":\"{\\\\\\\"related\\\\\\\":[\\\\\\\"tool-call-flow\\\\\\\",\\\\\\\"prompt-injection-defense\\\\\\\",\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"type-safety\\\\\\\",\\\\\\\"agent-eval-design\\\\\\\",\\\\\\\"evaluation\\\\\\\"],\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"testing-strategy owns deterministic-software testing where every run is binary pass/fail; this skill owns LLM evaluation where every run is a sample from a distribution and pass-rate is the unit of judgment. The disciplines share vocabulary (suite, gate, regression) but the math underneath differs.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"tool-call-flow\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"tool-call-flow owns the protocol cycle by which a model invokes tools; this skill owns the discipline of measuring whether that cycle produces correct behavior. Tool-call evals are a specialization of the general pattern.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"prompt-injection-defense\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"prompt-injection-defense owns the security property; this skill owns the measurement of whether the property holds. Red-team evals against an injection corpus are one application of eval-driven-development.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"agent-eval-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"agent-eval-design owns the construction of evals — task sets, rubrics, graders, hard negatives, traces; this skill owns the development discipline that uses constructed evals to gate every change to prompt, model, retrieval, scaffolding, or tooling. The two compose: agent-eval-design produces the suite; this skill applies it.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"type-safety\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"type-safety owns the compile-time property of programs; this skill owns the runtime-distributional property of LLM outputs. They are both validate-at-the-boundary disciplines with different threat models.\\\\\\\"}],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"agent-eval-design\\\\\\\"]}\",\"mental_model\":\"|\",\"purpose\":\"|\",\"boundary\":\"|\",\"analogy\":\"Eval-driven development is to LLM system engineering what crash-test ratings are to automotive safety — you do not ship a car based on how well it parked in your driveway; you ship it after a battery of standardized tests on representative crash scenarios, with the pass-rate against named criteria as the gating signal. A score of 4.3 stars across the suite is the only defensible claim of 'safer'; a developer's intuition that 'the new model feels smarter' is the unmeasured equivalent of 'I drove it home, it seemed fine.'\",\"misconception\":\"|\",\"concept\":\"{\\\\\\\"definition\\\\\\\":\\\\\\\"Eval-driven development is the practice of building language-model-integrated systems by writing evaluations before and alongside the system, where each evaluation defines a behavioral criterion the system must satisfy on a representative input set, and the suite's aggregated pass-rate signal gates every change to the prompt, model, retrieval, scaffolding, or tooling. Evals are the LLM analog of automated tests for deterministic software with one fundamental difference: LLM evals are statistical (pass-rate over a sampled population) rather than binary (pass/fail per run), because the system under test is itself stochastic. The discipline is the rigorous separation of generation (what the system produces) from judgment (how it is scored) with explicit accounting for the uncertainty in both.\\\\\\\",\\\\\\\"mental_model\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"purpose\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"boundary\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"taxonomy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"analogy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"misconception\\\\\\\":\\\\\\\"|\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/eval-driven-development/SKILL.md\"}"
8
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
9
+ skill_graph_protocol: Skill Metadata Protocol v4
10
+ skill_graph_project: Skill Graph
11
+ skill_graph_canonical_skill: skills/eval-driven-development/SKILL.md
12
+ ---
13
+
14
+ # Eval-Driven Development
15
+
16
+ ## Coverage
17
+
18
+ The practice of building language-model-integrated systems by writing evaluations before and alongside the system, and using the eval suite's aggregated pass-rate signal to gate every change. Covers the statistical (not binary) nature of LLM evaluation, the five primitives (dataset, evaluation function, aggregation, iteration loop, regression budget), the judgment-mechanism taxonomy (programmatic / model-graded / human-graded / hybrid), the distinction between system-specific evals and canonical public benchmarks (MMLU, HumanEval, BIG-bench, GAIA, MT-Bench), why higher scores are not always improvements (Goodhart's Law), the difference between offline evals and production telemetry, and the eval-lifecycle archetypes (acceptance, regression, calibration, red-team, cross-model).
19
+
20
+ ## Philosophy
21
+
22
+ Building LLM-integrated systems without evals is shipping airplanes based on how good the model feels at the desk. The system's behavior is stochastic, the input space is open-ended, and the developer's pet examples are not a representative sample of what users will throw at it. An eval suite is the empirical measurement instrument that lets a team distinguish "the new prompt works better" from "the new prompt works better on the five examples I happened to try."
23
+
24
+ The discipline's hard part is not writing evals. It is choosing what to measure, encoding the choice into a grader the team agrees with, sampling a dataset that represents production, and resisting the gravitational pull of Goodhart's Law as the eval suite becomes the optimization target. Teams that get this right ship systems whose quality matches their team's stated definition of "good." Teams that get this wrong ship systems that ace evals and disappoint users.
25
+
26
+ Eval-driven development is not test-driven development with extra noise. It is empirical engineering applied to systems whose behavior is a distribution rather than a value. The vocabulary overlaps; the math underneath does not.
27
+
28
+ ## The Five Primitives In Practice
29
+
30
+ | Primitive | What it is | Common encoding | Failure mode if neglected |
31
+ |---|---|---|---|
32
+ | Eval dataset | Curated input examples that represent production | JSONL of `{input, reference}` records; checked into version control | "It works for me" with no shared evidence |
33
+ | Evaluation function | Per-example grader producing a score | Python function, model-graded prompt template, or human-review UI | Implicit, undocumented definition of "good" |
34
+ | Aggregation | Statistical summary across the dataset | Pass-rate, weighted pass-rate, stratified pass-rate, distribution | Headline number obscures pattern of failure |
35
+ | Iteration loop | Eval → diagnose → change → eval | CI-integrated pipeline; eval results in PR comment | Changes ship without measurement |
36
+ | Regression budget | Defined acceptable change per metric | Per-eval policy: "must not regress" / "improvement gates merge" / "watchful" | Every change becomes a debate about the headline number |
37
+
38
+ ## Judgment Mechanism Selection
39
+
40
+ | Mechanism | Best for | Cost per example | Reliability | Watch for |
41
+ |---|---|---|---|---|
42
+ | Programmatic | Outputs with mechanical correctness (code, JSON validity, exact match) | $0 | Deterministic | Narrow applicability — won't work for freeform output |
43
+ | Model-graded | Open-output tasks at scale (summarization, classification, freeform Q&A) | $0.001-$0.10 per grade depending on model | Correlated error with the system being graded | Verbosity bias, position bias in pairwise; calibrate against humans |
44
+ | Human-graded | Subjective criteria, calibration runs, ambiguous outputs | $1-$50 per grade depending on annotator | Highest validity; lowest scale | Inter-rater agreement must be measured; one rater is not "humans think" |
45
+ | Hybrid | Production systems | Mixed | Mixed | Standard setup: programmatic gates, model-graded scales, human samples calibrate |
46
+
47
+ A practical default: programmatic checks for the parts you can mechanically verify, model-graded for the open parts, periodic human review to calibrate.
48
+
49
+ ## Iteration Loop Discipline
50
+
51
+ The eval-driven iteration loop is the development cycle. Run the suite, diagnose failures, identify the change, re-run, gate the change on regression budget.
52
+
53
+ ```
54
+ ┌─────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
55
+ │ Eval │ -> │ Diagnose │ -> │ Change │ -> │ Re-run │
56
+ │ baseline│ │ failures │ │ (prompt, │ │ Eval │
57
+ │ │ │ │ │ model, │ │ │
58
+ │ │ │ │ │ retrieval, │ │ │
59
+ │ │ │ │ │ tooling) │ │ │
60
+ └─────────┘ └──────────┘ └──────────────┘ └──────────┘
61
+
62
+ v
63
+ ┌──────────────────┐
64
+ │ Compare against │
65
+ │ regression budget│
66
+ │ Merge / iterate │
67
+ └──────────────────┘
68
+ ```
69
+
70
+ The discipline is keeping the suite stable while the system changes. If both move in the same iteration, the comparison anchor is gone and the team is doing parallel experiments.
71
+
72
+ ## Public Benchmarks — Cited For Grounding, Not For Gating
73
+
74
+ Benchmarks measure cross-system capability against a shared standard. They predict how a model will do on the *exact* tasks the benchmark contains. They do not predict how your specific system, on your specific user inputs, with your specific prompts and retrieval, will perform.
75
+
76
+ | Benchmark | Measures | Cited for |
77
+ |---|---|---|
78
+ | MMLU (Hendrycks et al., 2021) | 57 subjects of multiple-choice general knowledge | Breadth of general capability |
79
+ | HumanEval (Chen et al., 2021) | 164 programming problems graded by test execution | Code-generation correctness baseline |
80
+ | BIG-bench (Srivastava et al., 2022) | 200+ tasks across the long tail of NLP | Breadth of niche capabilities |
81
+ | GAIA (Mialon et al., 2023) | General-assistant multi-step tasks with tool use | Realistic agentic-task baseline |
82
+ | MT-Bench / Chatbot Arena (Zheng et al., 2023) | Pairwise preference comparison for chat | Human-aligned preference signal |
83
+
84
+ The right use: pick a model partly on benchmark performance, then build system-specific evals to gate the actual deployment.
85
+
86
+ ## Goodhart's Law In Eval Practice
87
+
88
+ When the eval becomes the optimization target, the eval ceases to be a good measure. Symptoms:
89
+
90
+ - Pass-rate climbs while human reviewers' confidence in the system flattens or declines.
91
+ - Prompt changes produce phrasings the grader rewards but users dislike (e.g., verbose hedging that scores well on rubric, reads poorly on screen).
92
+ - The system memorizes patterns specific to the eval dataset (over-fitting to test cases).
93
+ - A held-out evaluation set, scored fresh, shows worse pass-rate than the development set.
94
+
95
+ Defenses:
96
+
97
+ - **Hold out a portion of the dataset from active iteration.** Score it periodically; if held-out and development pass-rates diverge, the iteration is over-fitting.
98
+ - **Periodically refresh the eval dataset.** Replace some examples with new production-sampled inputs to prevent the dataset from going stale.
99
+ - **Calibrate model-graders against humans.** A grader that has drifted from human judgment can produce high pass-rates on outputs humans dislike.
100
+ - **Track multiple criteria.** A single headline number is easier to over-fit than a panel of independent measures.
101
+
102
+ ## What This Skill Is Not
103
+
104
+ This skill is the *concept* of eval-driven development. Specific topics with their own scope:
105
+ - The mechanics of running evals in CI/CD pipelines belong to a tooling skill.
106
+ - The construction of individual eval rubrics, task sets, graders, and hard negatives belongs to `agent-eval-design`.
107
+ - The deterministic testing of non-LLM code belongs to `testing-strategy`.
108
+ - The production monitoring of running systems belongs to observability and reliability skills.
109
+ - The obra/superpowers `test-driven-development` skill (on skills.sh) is a process-shape workflow skill for general software TDD; this one is the concept-shape complement for the LLM-specific evaluation discipline.
110
+
111
+ ## Verification
112
+
113
+ After applying this skill, verify:
114
+ - [ ] An eval dataset exists, is checked into version control, and is representative of the system's actual production input distribution.
115
+ - [ ] Each eval criterion has a defined judgment mechanism (programmatic, model-graded, or human-graded), and the mechanism's known biases are accounted for.
116
+ - [ ] Aggregation reports pass-rate with sample size and either a confidence interval or a defined minimum-detectable-change threshold.
117
+ - [ ] Each eval has an explicit regression budget: gating (no regression allowed), optimizing (improvement gates merge), or watchful (tracked, not gated).
118
+ - [ ] Model-graded evals have been calibrated against human review on a sample within the past quarter (or whatever cadence the team has agreed to).
119
+ - [ ] A held-out portion of the dataset is reserved from active iteration and scored periodically to detect over-fitting.
120
+ - [ ] The eval suite is integrated into the change workflow — prompt changes, model upgrades, retrieval changes, and tooling changes are all gated by the suite.
121
+ - [ ] Public benchmarks (MMLU, HumanEval, etc.) are cited for model-selection grounding but are not the gating decision for system-specific quality.
122
+ - [ ] Production telemetry exists separately from the offline eval suite; one is not used as a substitute for the other.
123
+ - [ ] The shipping threshold is a product decision documented somewhere, not an emergent average across team opinion.
124
+
125
+ ## Do NOT Use When
126
+
127
+ | Instead of this skill | Use | Why |
128
+ |---|---|---|
129
+ | Writing deterministic unit tests for non-LLM code | `testing-strategy` | testing-strategy owns binary pass/fail testing; this skill owns distributional measurement |
130
+ | Designing individual eval rubrics, task sets, graders, hard negatives | `agent-eval-design` | agent-eval-design owns eval construction; this skill owns the iteration discipline that uses constructed evals |
131
+ | Setting up production monitoring, alerting, or telemetry | `evaluation` (general framing) or `error-tracking` | those own runtime measurement of deployed systems; this skill owns offline pre-deployment measurement |
132
+ | Reasoning about the protocol cycle of tool calls | `tool-call-flow` | tool-call-flow owns the cycle; eval-driven development can measure tool-call correctness as one criterion |
133
+ | Defending against prompt injection | `prompt-injection-defense` | prompt-injection-defense owns the security property; this skill can measure whether the defense holds |
134
+ | General software TDD process | the obra/superpowers `test-driven-development` skill or `testing-strategy` | TDD is process-shape for general software; this skill is concept-shape for the LLM-specific evaluation discipline |
135
+
136
+ ## Key Sources
137
+
138
+ - Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). ["Measuring Massive Multitask Language Understanding"](https://arxiv.org/abs/2009.03300). The MMLU benchmark paper; foundational reference for cross-system general-knowledge evaluation.
139
+ - Chen, M., Tworek, J., Jun, H., Yuan, Q., et al. (2021). ["Evaluating Large Language Models Trained on Code"](https://arxiv.org/abs/2107.03374). The HumanEval benchmark paper; foundational for code-generation evaluation.
140
+ - Srivastava, A., et al. (2022). ["Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models"](https://arxiv.org/abs/2206.04615). The BIG-bench paper; canonical reference for breadth-of-capability evaluation across 200+ tasks.
141
+ - Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., & Scialom, T. (2023). ["GAIA: A Benchmark for General AI Assistants"](https://arxiv.org/abs/2311.12983). The GAIA benchmark paper; canonical reference for evaluating multi-step assistant tasks with tool use.
142
+ - Zheng, L., Chiang, W.-L., et al. (2023). ["Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"](https://arxiv.org/abs/2306.05685). The MT-Bench paper; canonical reference for LLM-as-judge methodology, including known biases.
143
+ - OpenAI. [Evals framework on GitHub](https://github.com/openai/evals). Open-source framework for writing and running LLM evals; documents the practical mechanics of the discipline.
144
+ - Anthropic. [Building evals — Anthropic Cookbook](https://github.com/anthropics/anthropic-cookbook/tree/main/skills/classification) and [Evaluation guide](https://docs.anthropic.com/en/docs/test-and-evaluate/develop-tests). Practitioner-oriented guidance on building eval suites.
145
+ - UK AI Safety Institute. [Inspect: An open-source evaluation framework](https://inspect.ai-safety-institute.org.uk/). Open framework purpose-built for capability and safety evaluations of LLMs.
146
+ - Goodhart, C. (1975). "Problems of Monetary Management: The U.K. Experience." The origin of Goodhart's Law as commonly cited; "when a measure becomes a target, it ceases to be a good measure."
147
+ - Liang, P., et al. (2022). ["Holistic Evaluation of Language Models"](https://arxiv.org/abs/2211.09110). The HELM framework paper; argues for multi-metric eval across many dimensions as a counter to single-metric Goodharting.
@@ -0,0 +1,113 @@
1
+ ---
2
+ name: evaluation
3
+ description: "This skill provides a structured framework for automated agentic evaluation and human feedback loops. It defines the 'Critic' persona, the 5-point quality scale, and the mandatory 'Evaluation-Revision' loop for all critical work."
4
+ license: MIT
5
+ compatibility: "Markdown, Git, agent-skill runtimes"
6
+ allowed-tools: Read Grep Bash
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"engineering\",\"domain\":\"engineering/skill-system\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-03-29\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-03-29\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"evaluation\\\\\\\",\\\\\\\"quality gate\\\\\\\",\\\\\\\"feedback loop\\\\\\\",\\\\\\\"critic persona\\\\\\\",\\\\\\\"evaluation revision\\\\\\\",\\\\\\\"completion contract\\\\\\\",\\\\\\\"score 1-5\\\\\\\",\\\\\\\"skeptical critic\\\\\\\"]\",\"triggers\":\"[\\\\\\\"evaluation-skill\\\\\\\",\\\\\\\"critic-loop-skill\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[\\\\\\\"code-review\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":90,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/evaluation/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/evaluation/SKILL.md
13
+ ---
14
+ # Evaluation Skill (Critic Persona)
15
+
16
+ ## Domain Context
17
+
18
+ **What is this skill?** This skill provides a structured framework for automated agentic evaluation and human feedback loops. It defines the 'Critic' persona, the 5-point quality scale, and the mandatory 'Evaluation-Revision' loop for all critical work.
19
+
20
+ ## Coverage
21
+
22
+ This skill covers the Skeptical Critic evaluation framework for AI agent work: the 5-point quality scale (Broken through State-of-the-Art), the mandatory Evaluation-Revision loop for all critical work, Goal Alignment auditing against implementation plans, Pattern Compliance checking (financial nullish coalescing, composition-theory focal points, semantic naming), the evaluation process flow (Persona, Identify, Score, Critique, Iterate), and the repo-specific completion contract (code + docs + Linear + verification evidence).
23
+
24
+ ## Philosophy
25
+
26
+ Without this skill, agents self-assess as "done" after code compiles and tests pass, missing the completion contract that includes documentation updates, Linear task reporting, verification evidence, and adherence to project-specific patterns. The Skeptical Critic persona exists because agents are systematically optimistic about their own output quality. By forcing a numeric score with explicit gap identification, this skill prevents the "ship it because it works" failure mode that produces technically functional but incomplete deliverables.
27
+
28
+ # MANDATORY ACTIVATION TRIGGER (SYSTEM DIRECTIVE)
29
+ This skill MUST be activated after EVERY meaningful implementation or cleanup step. It acts as a final 'Skeptical Critic' to ensure no regressions or half-baked solutions are shipped.
30
+
31
+ ### EVALUATION RUBRIC (1-5 SCALE)
32
+ 1. **Broken/Unusable**: Fails basic requirements, contains bugs or syntax errors.
33
+ 2. **Partial/Risky**: Meets some goals but lacks edge-case coverage or violates security rules.
34
+ 3. **Functional/Standard**: Meets all requirements but lacks 'Craft' (polish, docs, naming).
35
+ 4. **Polish/Repo-Compliant**: High quality, full doc updates, passes all local tests.
36
+ 5. **State-of-the-Art**: World-class implementation with zero-defect proof and proactive optimizations.
37
+
38
+ ## Core Mandate
39
+
40
+ You MUST evaluate implementation against **Goal Alignment** and **Pattern Compliance**.
41
+
42
+ ### 1. Goal Alignment (Audit)
43
+ - **Check**: Does the output solve the user's specific request?
44
+ - **Verify**: Cross-reference the `implementation_plan.md`. If any numbered requirement is missing (e.g., "tax reconciliation"), the score MUST be below 3.
45
+
46
+ ### 2. Pattern Compliance (Grounded Evaluation)
47
+ You MUST proactively check for these project-specific requirements:
48
+ - **Financial Logic**: Does every calculation use nullish coalescing or zero-guards? (Pattern: `(val ?? 0)`). If missing, score is < 3.
49
+ - **UI & Layout**: Does the page have a clear L1 focal point according to `composition-theory`?
50
+ - **Naming**: Does it follow `semantics` rules (no generic suffixes like `Helper` or `Utils`)?
51
+
52
+ ## Quality Scale (1-5)
53
+
54
+ You MUST assign a numeric score (1-5) and a 1-sentence justification for each.
55
+
56
+ 1. **Broken/Dangerous**: Logic errors, security flaws, or incomplete file writes.
57
+ 2. **Suboptimal**: Violates key patterns (missing `?? 0`, hardcoded strings, no focal point).
58
+ 3. **Acceptable**: Meets functional goals and matches most patterns.
59
+ 4. **Professional**: Follows all domain skills and design guidelines; well-documented.
60
+ 5. **Exceptional**: Proactively handles edge cases (Skeleton states, error boundaries).
61
+
62
+ ## Evaluation Process
63
+
64
+ 1. **Persona**: Adopt the **Skeptical Critic** mindset.
65
+ 2. **Identify**: List all changed files and their associated domain skills.
66
+ 3. **Score**: Assign a 1-5 score.
67
+ 4. **Critique**: List exactly what is missing or violates patterns.
68
+ 5. **Iterate**: Do not consider a task "Done" until scores are >= 4.
69
+
70
+ ## Key Files
71
+
72
+ | File | Why it matters |
73
+ |------|---------------|
74
+ | `docs/quality-doctrine.md` | The definition of "better" for Skill Graph artifacts |
75
+ | `skills/agent-eval-design/SKILL.md` | Agent, prompt, router, and skill evaluation design |
76
+ | `skills/code-review/SKILL.md` | Diff-level correctness and risk review |
77
+ | `skills/testing-strategy/SKILL.md` | Test planning and verification strategy |
78
+ | `skills/usability-testing/SKILL.md` | UI-specific evaluation gates |
79
+
80
+ ## Reference Files
81
+ - [quality-doctrine](https://github.com/jacob-balslev/skill-graph/blob/main/docs/quality-doctrine.md): The definition of "better."
82
+ - [usability-testing](https://github.com/jacob-balslev/skill-graph/blob/main/skills/usability-testing/SKILL.md): UI-specific evaluation gates.
83
+
84
+ ## Anti-Patterns
85
+
86
+ | Anti-pattern | Why it fails | Do instead |
87
+ |---|---|---|
88
+ | Scoring >= 4 with missing docs | Violates the completion contract — code alone is never "done" | Check AGENTS.md routing table before scoring |
89
+ | Accepting "tests pass" as full verification | Narrow test success hides missing integration, core-loop, or edge-case coverage | Require evidence of the repo's canonical verification gates |
90
+ | Self-scoring without the Skeptical Critic persona | Agents are systematically optimistic about their own output | Adopt the Critic mindset explicitly before assigning any score |
91
+ | Treating evaluation as code review | Evaluation is holistic quality gating; code review is line-by-line technical analysis | Route technical review to `code-review`, keep evaluation for the completion contract |
92
+ | Skipping financial null-guard checks | Financial logic without `?? 0` guards produces silent NaN/undefined bugs | Always verify nullish coalescing on every financial calculation |
93
+
94
+ ## Verification
95
+
96
+ After applying this skill, verify:
97
+ - [ ] A numeric 1-5 score was assigned with a 1-sentence justification
98
+ - [ ] All changed files were identified and checked against their domain skills
99
+ - [ ] The implementation plan (if one exists) was cross-referenced for missing requirements
100
+ - [ ] Financial calculations include nullish coalescing guards (`val ?? 0`)
101
+ - [ ] Documentation routing table was checked and required doc updates are included
102
+ - [ ] Linear task has a summary comment if the task has an associated issue
103
+ - [ ] Score is not >= 4 if any requirement from the plan is missing or docs are stale
104
+
105
+ ## Do NOT Use When
106
+
107
+ | Instead of this skill | Use | Why |
108
+ |---|---|---|
109
+ | Reviewing code for logic bugs, type errors, or security issues | `code-review` | Code review owns the technical review; evaluation owns the holistic quality gate |
110
+ | Evaluating visual design quality or DESIGN_GUIDE compliance | `design-review` | Design review owns the pre/post design quality gates with specific visual checks |
111
+ | Assessing task completion in the multi-agent /manage pipeline | `task-evaluation` | Task-evaluation owns the A/B scoring and pass/fail verdict for automated pipelines |
112
+ | Running a generate-critique-revise self-review loop | `self-evaluation`, `reflection-pattern` | Self-evaluation and reflection own the iterative internal revision cycle |
113
+ | Defining what "better" means for a specific artifact type | `craft-doctrine` | Craft-doctrine defines quality standards; evaluation applies them |
@@ -0,0 +1,60 @@
1
+ ---
2
+ name: event-contract-design
3
+ description: "Use when designing or reviewing asynchronous event contracts: producer/consumer ownership, event envelope, schema, topic/channel naming, ordering, idempotency, versioning, compatibility, replay, dead-letter behavior, and AsyncAPI/CloudEvents-style documentation. Do NOT use for domain-event discovery (use `event-storming`), broad interface contracts (use `system-interface-contracts`), inbound provider webhook mechanics (use `webhook-integration`), or HTTP endpoint design (use `api-design`)."
4
+ license: MIT
5
+ compatibility: "Portable async-event contract guidance for queues, streams, pub/sub, internal events, outbound webhooks, and documented event-driven APIs."
6
+ allowed-tools: Read Grep
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"engineering\",\"domain\":\"architecture/events\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-11\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-11\\\\\\\"}\",\"eval_artifacts\":\"present\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"event-contract\\\\\\\",\\\\\\\"async-api\\\\\\\",\\\\\\\"cloudevents\\\\\\\",\\\\\\\"event envelope\\\\\\\",\\\\\\\"topic naming\\\\\\\",\\\\\\\"async event schema\\\\\\\",\\\\\\\"event compatibility\\\\\\\",\\\\\\\"replay contract\\\\\\\",\\\\\\\"dead-letter behavior\\\\\\\",\\\\\\\"consumer fixtures\\\\\\\"]\",\"examples\":\"[\\\\\\\"design the event contract for publishing OrderPaid to downstream consumers\\\\\\\",\\\\\\\"define topic names, payload schema, idempotency, and versioning for this event stream\\\\\\\",\\\\\\\"review this outbound webhook event schema before customers integrate with it\\\\\\\",\\\\\\\"write the compatibility rules for consumers of these async messages\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"discover the domain events, commands, and policies in this business process\\\\\\\",\\\\\\\"define every boundary contract between services, jobs, and APIs\\\\\\\",\\\\\\\"verify inbound provider webhook signatures and retry behavior\\\\\\\",\\\\\\\"design REST endpoints, status codes, and pagination\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"event-storming\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"event-storming discovers domain events and policies; event-contract-design turns selected events into publishable contracts\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"system-interface-contracts\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"system-interface-contracts owns broad boundary contracts; event-contract-design owns asynchronous message and event surfaces\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"webhook-integration\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"webhook-integration owns inbound third-party delivery mechanics; event-contract-design owns events this system publishes or documents for consumers\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"api-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"api-design owns HTTP resource and action surfaces; event-contract-design owns asynchronous event contracts\\\\\\\"}],\\\\\\\"related\\\\\\\":[\\\\\\\"event-storming\\\\\\\",\\\\\\\"system-interface-contracts\\\\\\\",\\\\\\\"observability-modeling\\\\\\\",\\\\\\\"state-machine-modeling\\\\\\\",\\\\\\\"data-modeling\\\\\\\"],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"system-interface-contracts\\\\\\\",\\\\\\\"observability-modeling\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":365,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/event-contract-design/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/event-contract-design/SKILL.md
13
+ ---
14
+
15
+ # Event Contract Design
16
+
17
+ ## Coverage
18
+
19
+ Design asynchronous event contracts for producers and consumers. Covers event envelope, schema, event type, topic/channel naming, producer ownership, consumer expectations, required and optional fields, idempotency keys, ordering, causation and correlation IDs, schema evolution, replay, dead-letter behavior, compatibility, observability, and machine-readable documentation such as AsyncAPI or CloudEvents-style metadata.
20
+
21
+ ## Philosophy
22
+
23
+ An event is a public promise once another consumer depends on it. If the payload, ordering, retry, or compatibility rules are implicit, every consumer invents its own interpretation and the event stream becomes shared folklore.
24
+
25
+ Do not confuse event discovery with event contracts. Discovery asks what happened in the domain. Contract design asks what exactly will be published, consumed, replayed, and evolved.
26
+
27
+ ## Method
28
+
29
+ 1. Name the producer, owner, intended consumers, and event purpose.
30
+ 2. Separate business event type from transport topic or queue name.
31
+ 3. Define envelope fields: id, type, source, time, subject, schema version, correlation, causation, tenant, and idempotency key.
32
+ 4. Define payload schema with required, optional, nullable, and deprecated fields.
33
+ 5. State ordering, delivery, retry, replay, and dead-letter expectations.
34
+ 6. Define compatibility rules: additive fields, breaking changes, versioning, deprecation, and consumer migration.
35
+ 7. Add observability fields needed to reconstruct publishing and consumption failures.
36
+ 8. Provide at least one positive and one negative contract fixture.
37
+
38
+ ## Evals
39
+
40
+ This skill ships a comprehension-eval artifact at [`examples/evals/event-contract-design.json`](https://github.com/jacob-balslev/skill-graph/blob/main/examples/evals/event-contract-design.json). The checklist below is the authoring gate for async event contracts; the eval file is the grader surface.
41
+
42
+ ## Verification
43
+
44
+ - [ ] Producer, owner, and consumers are named
45
+ - [ ] Event type, topic/channel, envelope, and payload are distinct
46
+ - [ ] Required, optional, nullable, and deprecated fields are explicit
47
+ - [ ] Idempotency, ordering, retry, replay, and dead-letter behavior are stated
48
+ - [ ] Compatibility rules distinguish additive from breaking changes
49
+ - [ ] Correlation and causation IDs cross async boundaries
50
+ - [ ] Positive and negative fixtures exist for contract testing
51
+
52
+ ## Do NOT Use When
53
+
54
+ | Use instead | When |
55
+ |---|---|
56
+ | `event-storming` | You are still discovering domain events, commands, policies, aggregates, or timelines. |
57
+ | `system-interface-contracts` | The boundary is not specifically asynchronous events or messages. |
58
+ | `webhook-integration` | You are implementing inbound provider webhooks, signatures, retries, and raw payload handling. |
59
+ | `api-design` | You are designing HTTP endpoints, status codes, pagination, filtering, or error envelopes. |
60
+ | `observability-modeling` | The event contract is settled and the task is telemetry design. |
@@ -0,0 +1,56 @@
1
+ ---
2
+ name: event-storming
3
+ description: "Use when discovering a domain through events, commands, actors, policies, aggregates, read models, external systems, and temporal workflows before implementation. Do NOT use for event schema/topic contracts (use `event-contract-design`), webhook handler implementation (use `webhook-integration`), generic state transition modeling (use `state-machine-modeling`), or persistence schema design (use `data-modeling`)."
4
+ license: MIT
5
+ compatibility: "Portable event-storming discipline for product discovery, domain modeling, event-driven architecture, and workflow analysis."
6
+ allowed-tools: Read Grep
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"engineering\",\"domain\":\"architecture/domain-discovery\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-11\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-11\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"event storming\\\\\\\",\\\\\\\"domain events\\\\\\\",\\\\\\\"commands\\\\\\\",\\\\\\\"aggregates\\\\\\\",\\\\\\\"policies\\\\\\\",\\\\\\\"read models\\\\\\\",\\\\\\\"temporal workflow\\\\\\\",\\\\\\\"event-driven discovery\\\\\\\",\\\\\\\"process modeling\\\\\\\"]\",\"examples\":\"[\\\\\\\"map the order lifecycle as domain events before we design tables or APIs\\\\\\\",\\\\\\\"which commands, policies, and external systems are hidden in this workflow?\\\\\\\",\\\\\\\"use event storming to find aggregate boundaries for fulfillment\\\\\\\",\\\\\\\"turn this incident-prone business process into events and decisions\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"implement Shopify webhook signature verification and idempotent retries\\\\\\\",\\\\\\\"draw the state machine for this one status field\\\\\\\",\\\\\\\"create a normalized data model and indexes\\\\\\\",\\\\\\\"write event-bus infrastructure code\\\\\\\",\\\\\\\"define the schema, topic, compatibility, and fixtures for a selected event\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"event-contract-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"event-contract-design owns publishable event envelopes, schemas, topics, and compatibility; event-storming discovers the domain behavior before contract design\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"webhook-integration\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"webhook-integration implements inbound provider event handlers; event-storming discovers business-domain events before implementation\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"state-machine-modeling\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"state-machine-modeling formalizes states and transitions; event-storming discovers events, commands, policies, and aggregates\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"data-modeling\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"data-modeling designs stored data structures after domain events are understood\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"api-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"api-design shapes endpoint surfaces; event-storming discovers the domain behavior those surfaces expose\\\\\\\"}],\\\\\\\"related\\\\\\\":[\\\\\\\"bounded-context-mapping\\\\\\\",\\\\\\\"state-machine-modeling\\\\\\\",\\\\\\\"system-interface-contracts\\\\\\\",\\\\\\\"event-contract-design\\\\\\\",\\\\\\\"conceptual-modeling\\\\\\\"],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"conceptual-modeling\\\\\\\",\\\\\\\"bounded-context-mapping\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":365,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/event-storming/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/event-storming/SKILL.md
13
+ ---
14
+
15
+ # Event Storming
16
+
17
+ ## Coverage
18
+
19
+ Discover domain behavior through temporal events and decisions. Covers domain events, commands, actors, policies, aggregates, read models, external systems, hotspots, timelines, invariants, and handoff into bounded-context, state-machine, data, and API design.
20
+
21
+ ## Philosophy
22
+
23
+ Event storming starts from "what happened" because events expose real business flow faster than nouns do. Noun-first modeling often freezes premature assumptions. Event-first modeling reveals time, causality, policy, and exceptions.
24
+
25
+ Do not confuse domain events with technical notifications. "OrderPlaced" is business meaning. "WebhookReceived" is transport detail.
26
+
27
+ ## Method
28
+
29
+ 1. List domain events in past tense.
30
+ 2. Place them on a timeline.
31
+ 3. Add commands that cause events.
32
+ 4. Add actors and external systems that issue commands or receive outcomes.
33
+ 5. Add policies: "when event X happens, if condition Y, then command Z."
34
+ 6. Identify aggregates that enforce invariants.
35
+ 7. Mark hotspots, missing decisions, temporal ambiguity, and unclear ownership.
36
+ 8. Hand off to bounded-context, state-machine, data, or API design only after the flow is coherent.
37
+
38
+ ## Verification
39
+
40
+ - [ ] Events are named in past tense and carry business meaning
41
+ - [ ] Commands are imperative and have actors or policies
42
+ - [ ] Policies are explicit condition-action rules
43
+ - [ ] Aggregates are tied to invariants, not guessed from nouns
44
+ - [ ] External systems and transport details are separated from domain events
45
+ - [ ] Hotspots and unanswered questions are recorded
46
+ - [ ] The timeline can replay a real scenario end to end
47
+
48
+ ## Do NOT Use When
49
+
50
+ | Use instead | When |
51
+ |---|---|
52
+ | `webhook-integration` | You are implementing provider webhooks, signatures, retries, and deduplication. |
53
+ | `event-contract-design` | You already selected the event and need schema, envelope, topic, compatibility, replay, or fixtures. |
54
+ | `state-machine-modeling` | You already know the lifecycle and need formal states, transitions, and guards. |
55
+ | `data-modeling` | You need tables, keys, indexes, constraints, or data lifecycle. |
56
+ | `api-design` | You need endpoint, request, response, and status-code design. |
@@ -0,0 +1,60 @@
1
+ ---
2
+ name: form-ux-architecture
3
+ description: "Use when designing or auditing form structure and validation UX: field grouping, required vs optional inputs, validation timing, client/server validation split, submission lifecycle, recovery, multi-step forms, and high-risk data entry. Do NOT use for labels and announcements alone (use `a11y`), validation-message wording (use `microcopy`), API schema design (use `api-design`), or stored data modeling (use `data-modeling`)."
4
+ license: MIT
5
+ compatibility: Portable form UX guidance for web and app forms. Client-side validation improves UX; server-side validation remains mandatory for trust and security.
6
+ allowed-tools: Read Grep
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"design\",\"domain\":\"design/ux\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-11\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-11\\\\\\\"}\",\"eval_artifacts\":\"present\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"form-ux\\\\\\\",\\\\\\\"form architecture\\\\\\\",\\\\\\\"validation timing\\\\\\\",\\\\\\\"client server validation\\\\\\\",\\\\\\\"field grouping\\\\\\\",\\\\\\\"submission lifecycle\\\\\\\",\\\\\\\"form recovery\\\\\\\",\\\\\\\"multi-step forms\\\\\\\",\\\\\\\"required optional fields\\\\\\\",\\\\\\\"empty state design\\\\\\\",\\\\\\\"no results state\\\\\\\"]\",\"examples\":\"[\\\\\\\"design the validation lifecycle for this signup form\\\\\\\",\\\\\\\"audit this checkout form for grouping, required fields, and recovery\\\\\\\",\\\\\\\"should this be one form, a wizard, or progressive disclosure?\\\\\\\",\\\\\\\"split client-side and server-side validation responsibilities for this form\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"add labels so assistive tech can read each field\\\\\\\",\\\\\\\"rewrite the inline validation messages\\\\\\\",\\\\\\\"define the request and response schema for the form submit endpoint\\\\\\\",\\\\\\\"model the database columns that store these inputs\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"a11y\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"a11y owns labels, focus, fieldsets, errors, and assistive-tech behavior; form-ux-architecture owns form structure and validation lifecycle\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"microcopy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"microcopy owns validation-message wording; form-ux-architecture owns when validation appears and how users recover\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"api-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"api-design owns submit endpoint schemas and error envelopes; form-ux-architecture owns the user-facing input and correction flow\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"data-modeling\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"data-modeling owns stored data shape; form-ux-architecture owns collection and correction before submission\\\\\\\"}],\\\\\\\"related\\\\\\\":[\\\\\\\"interaction-patterns\\\\\\\",\\\\\\\"interaction-feedback\\\\\\\",\\\\\\\"task-analysis\\\\\\\",\\\\\\\"a11y\\\\\\\",\\\\\\\"microcopy\\\\\\\"],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"a11y\\\\\\\",\\\\\\\"microcopy\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":365,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/form-ux-architecture/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/form-ux-architecture/SKILL.md
13
+ ---
14
+
15
+ # Form UX Architecture
16
+
17
+ ## Coverage
18
+
19
+ Design form structure and validation behavior. Covers field grouping, labels as structure handoff, required vs optional decisions, progressive disclosure, defaults, input formats, client-side validation, server-side validation, validation timing, submit lifecycle, error recovery, multi-step forms, review steps, autosave, and high-risk data entry.
20
+
21
+ ## Philosophy
22
+
23
+ Forms are not data dumps. A form is a guided conversation that asks only for information the system truly needs, at the moment the user can answer it, with correction paths that preserve trust.
24
+
25
+ Client-side validation is a user-experience aid, not a security boundary. The server must validate every submitted field even when the client appears correct.
26
+
27
+ ## Method
28
+
29
+ 1. Name the user goal and the minimum data needed to complete it.
30
+ 2. Remove fields that are not needed now or cannot be acted on.
31
+ 3. Group fields by user mental model, not database table.
32
+ 4. Decide required, optional, defaulted, derived, and deferred fields.
33
+ 5. Choose validation timing: on submit, on blur, on change, or after async check.
34
+ 6. Split client-side validation from server-side validation and map server errors back to fields.
35
+ 7. Define submit, pending, success, failure, retry, and partial-save behavior.
36
+ 8. Hand off labels and announcements to `a11y`, wording to `microcopy`, and endpoint shape to `api-design`.
37
+
38
+ ## Evals
39
+
40
+ This skill ships a comprehension-eval artifact at [`examples/evals/form-ux-architecture.json`](https://github.com/jacob-balslev/skill-graph/blob/main/examples/evals/form-ux-architecture.json). The checklist below is the authoring gate for form UX architecture; the eval file is the grader surface.
41
+
42
+ ## Verification
43
+
44
+ - [ ] Every field has a reason tied to the user's goal or system requirement
45
+ - [ ] Required fields are truly required at this step
46
+ - [ ] Field groups match how users think about the task
47
+ - [ ] Validation timing avoids hostile on-keystroke errors unless immediate feedback is necessary
48
+ - [ ] Client-side checks improve correction speed but do not replace server validation
49
+ - [ ] Server errors map back to fields or a clear form-level recovery path
50
+ - [ ] Submit, pending, success, failure, retry, and partial-save states are defined
51
+
52
+ ## Do NOT Use When
53
+
54
+ | Use instead | When |
55
+ |---|---|
56
+ | `a11y` | The task is labels, fieldsets, focus, keyboard flow, or screen-reader announcement. |
57
+ | `microcopy` | The task is validation-message wording, placeholder text, button labels, or error copy. |
58
+ | `api-design` | The task is endpoint shape, request/response schema, status codes, or error envelope. |
59
+ | `data-modeling` | The task is persistence schema, constraints, keys, or data lifecycle. |
60
+ | `interaction-feedback` | The task is feedback state staging after the form action starts. |
@@ -0,0 +1,59 @@
1
+ ---
2
+ name: framework-fit-analysis
3
+ description: "Use when choosing, replacing, or justifying a framework, library, SDK, runtime, database, UI kit, or platform by fit: constraints, team skill, ecosystem maturity, migration cost, operability, performance, security, and exit cost. Do NOT use for routine dependency hygiene (use `dependency-architecture`), documenting an accepted decision (use `architecture-decision-records`), or framework-specific implementation work."
4
+ license: MIT
5
+ compatibility: "Portable technology-selection discipline for application frameworks, libraries, SDKs, platforms, runtimes, data stores, and agent tooling."
6
+ allowed-tools: Read Grep
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"engineering\",\"domain\":\"architecture/technology-selection\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-11\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-11\\\\\\\"}\",\"eval_artifacts\":\"present\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"framework fit\\\\\\\",\\\\\\\"technology selection\\\\\\\",\\\\\\\"library choice\\\\\\\",\\\\\\\"SDK evaluation\\\\\\\",\\\\\\\"platform evaluation\\\\\\\",\\\\\\\"migration cost\\\\\\\",\\\\\\\"exit cost\\\\\\\",\\\\\\\"operability\\\\\\\",\\\\\\\"ecosystem maturity\\\\\\\",\\\\\\\"build vs buy\\\\\\\"]\",\"examples\":\"[\\\\\\\"should we use Next.js server actions, route handlers, or a separate API service for this workflow?\\\\\\\",\\\\\\\"evaluate whether adding this charting library is worth it\\\\\\\",\\\\\\\"compare Supabase, Firebase, and custom Postgres for this project under real constraints\\\\\\\",\\\\\\\"we want to replace this framework - what fit analysis should happen before an ADR?\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"audit installed packages for duplication and supply-chain risk\\\\\\\",\\\\\\\"write the ADR after we chose the framework\\\\\\\",\\\\\\\"implement this feature in the framework we already selected\\\\\\\",\\\\\\\"profile a slow page and optimize bottlenecks\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"dependency-architecture\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"dependency-architecture governs dependency graph shape and package boundaries; framework-fit-analysis evaluates a candidate technology decision\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"architecture-decision-records\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"architecture-decision-records records accepted decisions; framework-fit-analysis compares options before acceptance\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"performance-engineering\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"performance-engineering measures and optimizes actual behavior; framework-fit-analysis weighs expected performance among selection criteria\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"refactor\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"refactor changes existing code structure; framework-fit-analysis decides whether a larger technology shift is justified\\\\\\\"}],\\\\\\\"related\\\\\\\":[\\\\\\\"architecture-decision-records\\\\\\\",\\\\\\\"dependency-architecture\\\\\\\",\\\\\\\"performance-engineering\\\\\\\",\\\\\\\"owasp-security\\\\\\\"],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"architecture-decision-records\\\\\\\",\\\\\\\"dependency-architecture\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":180,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/framework-fit-analysis/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/framework-fit-analysis/SKILL.md
13
+ ---
14
+
15
+ # Framework Fit Analysis
16
+
17
+ ## Coverage
18
+
19
+ Evaluate technology fit before adoption, replacement, or standardization. Covers requirement fit, constraints, ecosystem maturity, maintenance health, team skill, integration cost, migration path, performance envelope, security posture, operational burden, lock-in, exit cost, and decision recording.
20
+
21
+ ## Philosophy
22
+
23
+ Technology choice is context-dependent. "Best" without constraints is marketing. A good fit analysis makes tradeoffs explicit enough that a team can accept the costs knowingly.
24
+
25
+ Do not confuse popularity with fit. Do not let a narrow implementation preference choose a durable platform. The right output is a recommendation plus consequences, not a ranking table with fake precision.
26
+
27
+ ## Method
28
+
29
+ 1. State the job the technology must do.
30
+ 2. List hard constraints: runtime, hosting, data, compliance, team, budget, timeline.
31
+ 3. Define evaluation criteria and weights qualitatively: must-have, important, nice-to-have.
32
+ 4. Compare credible options, including staying put.
33
+ 5. Assess migration and exit costs.
34
+ 6. Identify operational ownership and failure modes.
35
+ 7. Recommend one path with accepted tradeoffs.
36
+ 8. Hand off to `architecture-decision-records` if the decision is durable.
37
+
38
+ ## Evals
39
+
40
+ This skill ships a comprehension-eval artifact at [`examples/evals/framework-fit-analysis.json`](https://github.com/jacob-balslev/skill-graph/blob/main/examples/evals/framework-fit-analysis.json). The checklist below is the authoring gate for technology-fit decisions; the eval file is the grader surface.
41
+
42
+ ## Verification
43
+
44
+ - [ ] The recommendation is tied to explicit project constraints
45
+ - [ ] "Do nothing" or "keep current stack" was considered when real
46
+ - [ ] Migration and exit costs are named
47
+ - [ ] Operational ownership is named
48
+ - [ ] Performance and security claims are evidence-backed or marked uncertain
49
+ - [ ] The decision can be reversed only with known cost
50
+ - [ ] Follow-up ADR is proposed for durable choices
51
+
52
+ ## Do NOT Use When
53
+
54
+ | Use instead | When |
55
+ |---|---|
56
+ | `dependency-architecture` | You need dependency graph hygiene, package boundaries, duplication control, or supply-chain guardrails. |
57
+ | `architecture-decision-records` | The choice is already made and needs a record. |
58
+ | `performance-engineering` | You need to measure and optimize actual runtime behavior. |
59
+ | A framework-specific skill | The framework is already chosen and the task is implementation. |
@@ -0,0 +1,43 @@
1
+ ---
2
+ name: frontend-architecture
3
+ description: "Use when organizing a frontend codebase — module boundaries, component layering, state ownership, data-flow direction, and the separation between feature code and shared primitives. Do NOT use for visual design decisions, specific framework migration tactics, or backend API contract design."
4
+ license: CC-BY-4.0
5
+ metadata:
6
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"engineering\",\"domain\":\"engineering/frontend\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-12\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-12\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"frontend architecture\\\\\\\",\\\\\\\"component boundaries\\\\\\\",\\\\\\\"module organization\\\\\\\",\\\\\\\"state management shape\\\\\\\",\\\\\\\"feature-sliced design\\\\\\\",\\\\\\\"container presentational\\\\\\\",\\\\\\\"data flow direction\\\\\\\",\\\\\\\"shared primitives\\\\\\\",\\\\\\\"component layering\\\\\\\",\\\\\\\"frontend folder structure\\\\\\\",\\\\\\\"colocation\\\\\\\",\\\\\\\"Next.js App Router structure\\\\\\\"]\",\"triggers\":\"[\\\\\\\"frontend architecture\\\\\\\",\\\\\\\"component boundaries\\\\\\\",\\\\\\\"folder structure\\\\\\\",\\\\\\\"state shape\\\\\\\",\\\\\\\"where should this code live\\\\\\\"]\",\"examples\":\"[\\\\\\\"Decide whether a new modal lives in the shared component library or inside a feature folder\\\\\\\",\\\\\\\"Reorganize a frontend that has mixed presentational components and data-fetching components in the same files\\\\\\\",\\\\\\\"Choose a state shape that doesn't force every consumer to re-render on unrelated changes\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"Pick the brand color palette for a marketing site\\\\\\\",\\\\\\\"Design the REST endpoint shape for the orders resource\\\\\\\",\\\\\\\"Decide whether to use CSS-in-JS or Tailwind\\\\\\\"]\",\"relations\":\"{\\\\\\\"related\\\\\\\":[\\\\\\\"design-system-architecture\\\\\\\",\\\\\\\"design-module-composition\\\\\\\",\\\\\\\"refactor\\\\\\\",\\\\\\\"testing-strategy\\\\\\\"],\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"design-system-architecture\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"design-system-architecture covers token layering, primitive component contracts, and library publishing; this skill covers application-side organization that consumes those primitives.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"api-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"API contract shape lives in api-design; this skill takes the API as given and structures the frontend around it.\\\\\\\"}]}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/frontend-architecture/SKILL.md\"}"
7
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
8
+ skill_graph_protocol: Skill Metadata Protocol v4
9
+ skill_graph_project: Skill Graph
10
+ skill_graph_canonical_skill: skills/frontend-architecture/SKILL.md
11
+ ---
12
+
13
+ # Frontend Architecture
14
+
15
+ ## Coverage
16
+ Frontend architecture decides three things: where code lives (folder and module structure), what depends on what (allowed import direction), and who owns mutable state (component-local, feature-scoped, or global). This skill covers the common organizing models — feature-sliced (features/<feature>/{ui,model,api}), layered (components/, hooks/, services/, pages/), and domain-driven (domains/<domain>/{ui,logic,data}) — and the trade-offs each makes when the codebase grows from a few features to dozens.
17
+
18
+ Component layering separates primitives (no business knowledge, configurable purely through props — Button, Input, Stack), composed components (combine primitives with feature-specific layout, still no data fetching — OrderSummary, AddressForm), and connected components (own data fetching, mutation, and routing — OrderDetailPage). The boundary between composed and connected is the most common source of dependency tangles: a "shared" composed component that quietly reaches into a feature-specific store creates a back-edge that breaks the dependency graph.
19
+
20
+ State shape decisions span four axes: location (component, context, store), normalization (entity-keyed vs. nested), derivation (computed at read time vs. stored), and ownership (who can write). The shape choice determines what re-renders, what stays consistent across views, and what becomes a source of bugs when a mutation forgets to update one of several copies. Server state (data fetched from an API, cache-managed) and client state (UI-only, ephemeral) have different requirements and benefit from being managed by different tools.
21
+
22
+ Import direction enforces the architecture. A rule like "shared/ may not import from features/, and feature A may not import from feature B" is checkable with ESLint boundary plugins and tells the team at PR time when a change crosses an intended layer. Without enforcement, the structure degrades within months — a single shortcut import becomes the norm.
23
+
24
+ ## Philosophy
25
+ The folder structure is not the architecture; the import graph is. A pretty folder tree with cyclic imports between features is architecturally worse than a flat folder with strict one-way dependencies. Optimize for "where would I look for this" (colocation by feature) and "what changes together stays together" (cohesion) over imposed taxonomy.
26
+
27
+ Server state and client state are different problems. Mixing them in a single store creates cache-invalidation bugs that look like rendering bugs. Use the same tool for fetching, caching, and revalidating server data; reserve global client state for genuinely cross-cutting UI concerns (theme, current user, route).
28
+
29
+ ## Verification
30
+ - An import-boundary linter is configured and a CI step fails the build on violations.
31
+ - Every feature folder can be removed without touching code outside it, except a single registration point (route table, feature flag map).
32
+ - A new developer can name where any given piece of code lives within thirty seconds of being told the feature name and the kind of thing (UI vs. data vs. logic).
33
+ - Server state is fetched through one mechanism (a single query/cache library) — counting fetch call sites in client code returns approximately the number of distinct endpoints, not multiples per endpoint.
34
+ - Component props for shared primitives contain no feature-specific names; a grep for feature names in shared/ returns nothing.
35
+ - Re-render counts on a representative interaction can be measured and explained — no "this component re-renders five times and I'm not sure why."
36
+ - Tests follow the layering: primitives tested in isolation, connected components tested with mocked data layer, no test reaches across feature boundaries.
37
+
38
+ ## Do NOT Use When
39
+ - The decision is visual rather than structural — color, type, spacing, motion. Use visual-design-foundations, color-system-design, or typography-system.
40
+ - The work is publishing or versioning a shared component library across multiple applications. Use design-system-architecture.
41
+ - The task is choosing or migrating a specific framework (React→Solid, Webpack→Vite). This skill is framework-neutral.
42
+ - You are designing the API the frontend consumes. Use api-design or system-interface-contracts.
43
+ - The problem is a single component's internal behavior or accessibility. Use design-module-composition, interaction-patterns, or a11y.