@skill-graph/cli 0.5.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (330) hide show
  1. package/CHANGELOG.md +247 -0
  2. package/LICENSE +200 -0
  3. package/NOTICE +62 -0
  4. package/README.md +398 -0
  5. package/SKILL_GRAPH.md +443 -0
  6. package/bin/skill-graph.js +374 -0
  7. package/docs/ADOPTION.md +117 -0
  8. package/docs/CONFORMANCE.md +66 -0
  9. package/docs/PRIMER.md +384 -0
  10. package/docs/QUICKSTART-30MIN.md +333 -0
  11. package/docs/ROUTING-METRICS.md +120 -0
  12. package/docs/SKILL-MD-FORMAT-COMPATIBILITY.md +127 -0
  13. package/docs/SKILL_AUDIT_CHECKLIST.md +199 -0
  14. package/docs/SKILL_AUDIT_LOOP.md +195 -0
  15. package/docs/SKILL_METADATA_PROTOCOL.md +609 -0
  16. package/docs/_archived/marketplace-publication-priority-2026-05-18.md +239 -0
  17. package/docs/adr/0001-predicate-set.md +69 -0
  18. package/docs/adr/0002-json-ld-context.md +82 -0
  19. package/docs/adr/0003-ontoclean-rigidity-tags.md +65 -0
  20. package/docs/adr/0004-persistent-identifiers.md +74 -0
  21. package/docs/adr/0005-freshness-consolidation.md +70 -0
  22. package/docs/adr/0006-revise-predicate-rename.md +105 -0
  23. package/docs/adr/0007-audit-loop-cadence.md +99 -0
  24. package/docs/adr/0008-skill-surface-split-and-curation-policy.md +93 -0
  25. package/docs/category-consumers.md +168 -0
  26. package/docs/concept-map.md +194 -0
  27. package/docs/diagrams/drift-states.mmd +21 -0
  28. package/docs/diagrams/manifest-pipeline.mmd +25 -0
  29. package/docs/diagrams/routing-harness.mmd +41 -0
  30. package/docs/diagrams/starter-graph.mmd +53 -0
  31. package/docs/field-decision-guide.md +315 -0
  32. package/docs/field-rationale.md +211 -0
  33. package/docs/field-reference.generated.md +624 -0
  34. package/docs/field-reference.md +1426 -0
  35. package/docs/glossary.md +190 -0
  36. package/docs/head-noun-glossary.md +63 -0
  37. package/docs/images/audit-phases.png +0 -0
  38. package/docs/images/drift-states.png +0 -0
  39. package/docs/images/graded-mode.png +0 -0
  40. package/docs/images/manifest-pipeline.png +0 -0
  41. package/docs/images/routing-harness.png +0 -0
  42. package/docs/images/skill-anatomy.png +0 -0
  43. package/docs/images/starter-graph.png +0 -0
  44. package/docs/images/system-model.png +0 -0
  45. package/docs/integrations/github-actions.md +155 -0
  46. package/docs/manifest-field-mapping.md +443 -0
  47. package/docs/marketplace-publication-queue.generated.md +240 -0
  48. package/docs/marketplace-release-agent-prompt.md +82 -0
  49. package/docs/marketplace-skill-candidate-list.md +272 -0
  50. package/docs/marketplace-syndication.md +222 -0
  51. package/docs/migration-sample-review.md +155 -0
  52. package/docs/migrations/v4-to-v5.md +168 -0
  53. package/docs/migrations/v5-to-v6.md +221 -0
  54. package/docs/name-exceptions.yaml +37 -0
  55. package/docs/plans/marketplace-p1-public-migration-plan.md +41 -0
  56. package/docs/plans/multi-root-workspace.md +148 -0
  57. package/docs/plans/scripts-roadmap.md +107 -0
  58. package/docs/plans/v4-schema-bump.md +160 -0
  59. package/docs/plans/wave-2-extraction.md +122 -0
  60. package/docs/positioning-vs-marketplaces.md +175 -0
  61. package/docs/proposals/skill-audit-loop-positioning.md +160 -0
  62. package/docs/quality-doctrine.md +138 -0
  63. package/docs/recommended-skills.md +150 -0
  64. package/docs/research/skill-comprehension-eval-research.md +1830 -0
  65. package/docs/research/skill-retrieval-evidence.md +66 -0
  66. package/docs/skill-metadata-protocol.md +471 -0
  67. package/docs/skills-sh-maintainer-cleanup-request.md +80 -0
  68. package/examples/audits/a11y/findings.md +52 -0
  69. package/examples/audits/a11y/scorecard.md +21 -0
  70. package/examples/audits/a11y/verdict.md +44 -0
  71. package/examples/audits/debugging/findings.md +59 -0
  72. package/examples/audits/debugging/scorecard.md +22 -0
  73. package/examples/audits/debugging/verdict.md +33 -0
  74. package/examples/audits/documentation/findings.md +59 -0
  75. package/examples/audits/documentation/scorecard.md +22 -0
  76. package/examples/audits/documentation/verdict.md +33 -0
  77. package/examples/evals/a11y.json +140 -0
  78. package/examples/evals/api-design.json +52 -0
  79. package/examples/evals/code-review.json +52 -0
  80. package/examples/evals/data-modeling.json +52 -0
  81. package/examples/evals/database-migration.json +52 -0
  82. package/examples/evals/debugging.json +118 -0
  83. package/examples/evals/dependency-architecture.json +52 -0
  84. package/examples/evals/design-system-architecture.json +52 -0
  85. package/examples/evals/error-tracking.json +52 -0
  86. package/examples/evals/event-contract-design.json +52 -0
  87. package/examples/evals/form-ux-architecture.json +52 -0
  88. package/examples/evals/framework-fit-analysis.json +52 -0
  89. package/examples/evals/graph-audit.json +139 -0
  90. package/examples/evals/information-architecture.json +52 -0
  91. package/examples/evals/interaction-feedback.json +52 -0
  92. package/examples/evals/interaction-patterns.json +52 -0
  93. package/examples/evals/layout-composition.json +52 -0
  94. package/examples/evals/lint-overlay.json +117 -0
  95. package/examples/evals/microcopy.json +52 -0
  96. package/examples/evals/observability-modeling.json +52 -0
  97. package/examples/evals/pattern-recognition.json +96 -0
  98. package/examples/evals/performance-engineering.json +52 -0
  99. package/examples/evals/refactor.json +128 -0
  100. package/examples/evals/semiotics.json +52 -0
  101. package/examples/evals/skill-infrastructure.json +96 -0
  102. package/examples/evals/skill-router.json +140 -0
  103. package/examples/evals/skill-router.routing.json +113 -0
  104. package/examples/evals/system-interface-contracts.json +52 -0
  105. package/examples/evals/task-analysis.json +52 -0
  106. package/examples/evals/testing-strategy.json +118 -0
  107. package/examples/evals/type-safety.json +249 -0
  108. package/examples/evals/visual-design-foundations.json +52 -0
  109. package/examples/evals/webhook-integration.json +52 -0
  110. package/examples/exports/a11y.skill-md.md +80 -0
  111. package/examples/exports/debugging.skill-md.md +80 -0
  112. package/examples/exports/refactor.skill-md.md +78 -0
  113. package/examples/exports/testing-strategy.skill-md.md +81 -0
  114. package/examples/projects/markdown-static-site/README.md +115 -0
  115. package/examples/projects/markdown-static-site/skills/content-source-router/SKILL.md +131 -0
  116. package/examples/projects/markdown-static-site/skills/image-optimization-pipeline-config/SKILL.md +132 -0
  117. package/examples/projects/markdown-static-site/skills/link-rot-detection/SKILL.md +103 -0
  118. package/examples/projects/markdown-static-site/skills/markdown-post-frontmatter-validation/SKILL.md +133 -0
  119. package/examples/projects/markdown-static-site/skills/migrate-posts-to-v2-frontmatter/SKILL.md +140 -0
  120. package/examples/projects/saas-stripe-postgres/README.md +208 -0
  121. package/examples/projects/saas-stripe-postgres/db/migrations/0004_canonicalize_orders.sql +37 -0
  122. package/examples/projects/saas-stripe-postgres/db/schema.sql +112 -0
  123. package/examples/projects/saas-stripe-postgres/skills/migrate-orders-to-canonical-schema/SKILL.md +149 -0
  124. package/examples/projects/saas-stripe-postgres/skills/nextjs-server-action-validation/SKILL.md +154 -0
  125. package/examples/projects/saas-stripe-postgres/skills/payment-provider-router/SKILL.md +153 -0
  126. package/examples/projects/saas-stripe-postgres/skills/postgres-rls-pattern/SKILL.md +163 -0
  127. package/examples/projects/saas-stripe-postgres/skills/stripe-webhook-signature-verification/SKILL.md +137 -0
  128. package/examples/protocol/skill-metadata-template.md +301 -0
  129. package/examples/protocol/skills.manifest.sample.json +13245 -0
  130. package/examples/skill-metadata-template.md +317 -0
  131. package/examples/skills.manifest.sample.json +13519 -0
  132. package/examples/tests/v3-1-skos-fixture/SKILL.md +93 -0
  133. package/marketplace/README.md +17 -0
  134. package/marketplace/skills/a11y/SKILL.md +66 -0
  135. package/marketplace/skills/acid-fundamentals/SKILL.md +106 -0
  136. package/marketplace/skills/agent-engineering/SKILL.md +386 -0
  137. package/marketplace/skills/agent-eval-design/SKILL.md +55 -0
  138. package/marketplace/skills/ai-native-development/SKILL.md +294 -0
  139. package/marketplace/skills/api-design/SKILL.md +60 -0
  140. package/marketplace/skills/architecture-decision-records/SKILL.md +55 -0
  141. package/marketplace/skills/background-jobs/SKILL.md +265 -0
  142. package/marketplace/skills/bounded-context-mapping/SKILL.md +55 -0
  143. package/marketplace/skills/cap-theorem-tradeoffs/SKILL.md +127 -0
  144. package/marketplace/skills/client-server-boundary/SKILL.md +187 -0
  145. package/marketplace/skills/code-review/SKILL.md +120 -0
  146. package/marketplace/skills/color-system-design/SKILL.md +43 -0
  147. package/marketplace/skills/component-architecture/SKILL.md +126 -0
  148. package/marketplace/skills/compression/SKILL.md +112 -0
  149. package/marketplace/skills/conceptual-modeling/SKILL.md +181 -0
  150. package/marketplace/skills/connection-pooling/SKILL.md +105 -0
  151. package/marketplace/skills/constraint-awareness/SKILL.md +287 -0
  152. package/marketplace/skills/content-monitor/SKILL.md +209 -0
  153. package/marketplace/skills/context-engineering/SKILL.md +320 -0
  154. package/marketplace/skills/context-graph/SKILL.md +174 -0
  155. package/marketplace/skills/context-management/SKILL.md +174 -0
  156. package/marketplace/skills/context-window/SKILL.md +239 -0
  157. package/marketplace/skills/contract-testing/SKILL.md +120 -0
  158. package/marketplace/skills/cron-scheduling/SKILL.md +223 -0
  159. package/marketplace/skills/dark-mode-implementation/SKILL.md +47 -0
  160. package/marketplace/skills/data-modeling/SKILL.md +59 -0
  161. package/marketplace/skills/data-modeling-fundamentals/SKILL.md +117 -0
  162. package/marketplace/skills/database-migration/SKILL.md +429 -0
  163. package/marketplace/skills/debugging/SKILL.md +67 -0
  164. package/marketplace/skills/dependency-architecture/SKILL.md +58 -0
  165. package/marketplace/skills/design-module-composition/SKILL.md +43 -0
  166. package/marketplace/skills/design-system-architecture/SKILL.md +61 -0
  167. package/marketplace/skills/design-thinking/SKILL.md +44 -0
  168. package/marketplace/skills/diagnosis/SKILL.md +296 -0
  169. package/marketplace/skills/diff-analysis/SKILL.md +188 -0
  170. package/marketplace/skills/e2e-test-design/SKILL.md +113 -0
  171. package/marketplace/skills/entity-relationship-modeling/SKILL.md +218 -0
  172. package/marketplace/skills/epistemic-grounding/SKILL.md +112 -0
  173. package/marketplace/skills/error-boundary/SKILL.md +235 -0
  174. package/marketplace/skills/error-tracking/SKILL.md +261 -0
  175. package/marketplace/skills/eval-driven-development/SKILL.md +147 -0
  176. package/marketplace/skills/evaluation/SKILL.md +113 -0
  177. package/marketplace/skills/event-contract-design/SKILL.md +60 -0
  178. package/marketplace/skills/event-storming/SKILL.md +56 -0
  179. package/marketplace/skills/form-ux-architecture/SKILL.md +60 -0
  180. package/marketplace/skills/framework-fit-analysis/SKILL.md +59 -0
  181. package/marketplace/skills/frontend-architecture/SKILL.md +43 -0
  182. package/marketplace/skills/generative-ui/SKILL.md +118 -0
  183. package/marketplace/skills/graph-audit/SKILL.md +81 -0
  184. package/marketplace/skills/guardrails/SKILL.md +118 -0
  185. package/marketplace/skills/hooks-patterns/SKILL.md +185 -0
  186. package/marketplace/skills/http-semantics/SKILL.md +136 -0
  187. package/marketplace/skills/ideation/SKILL.md +41 -0
  188. package/marketplace/skills/indexing-strategy/SKILL.md +108 -0
  189. package/marketplace/skills/information-architecture/SKILL.md +59 -0
  190. package/marketplace/skills/integration-test-design/SKILL.md +111 -0
  191. package/marketplace/skills/intent-recognition/SKILL.md +136 -0
  192. package/marketplace/skills/interaction-feedback/SKILL.md +59 -0
  193. package/marketplace/skills/interaction-patterns/SKILL.md +59 -0
  194. package/marketplace/skills/journey-mapping/SKILL.md +41 -0
  195. package/marketplace/skills/keywords/SKILL.md +213 -0
  196. package/marketplace/skills/knowledge-modeling/SKILL.md +232 -0
  197. package/marketplace/skills/layout-composition/SKILL.md +59 -0
  198. package/marketplace/skills/linguistics/SKILL.md +429 -0
  199. package/marketplace/skills/lint-overlay/SKILL.md +76 -0
  200. package/marketplace/skills/mental-models/SKILL.md +126 -0
  201. package/marketplace/skills/merge-queue/SKILL.md +94 -0
  202. package/marketplace/skills/methodology/SKILL.md +317 -0
  203. package/marketplace/skills/microcopy/SKILL.md +232 -0
  204. package/marketplace/skills/middleware-patterns/SKILL.md +363 -0
  205. package/marketplace/skills/mobile-responsive-ux/SKILL.md +287 -0
  206. package/marketplace/skills/mutation-testing/SKILL.md +112 -0
  207. package/marketplace/skills/naming-conventions/SKILL.md +112 -0
  208. package/marketplace/skills/observability-modeling/SKILL.md +59 -0
  209. package/marketplace/skills/ontology-modeling/SKILL.md +67 -0
  210. package/marketplace/skills/owasp-security/SKILL.md +153 -0
  211. package/marketplace/skills/pattern-recognition/SKILL.md +472 -0
  212. package/marketplace/skills/performance-budgets/SKILL.md +185 -0
  213. package/marketplace/skills/performance-engineering/SKILL.md +58 -0
  214. package/marketplace/skills/performance-testing/SKILL.md +125 -0
  215. package/marketplace/skills/printify/SKILL.md +42 -0
  216. package/marketplace/skills/prioritization/SKILL.md +118 -0
  217. package/marketplace/skills/problem-framing/SKILL.md +41 -0
  218. package/marketplace/skills/problem-locating-solving/SKILL.md +203 -0
  219. package/marketplace/skills/project-knowledge-extraction/SKILL.md +54 -0
  220. package/marketplace/skills/prompt-craft/SKILL.md +134 -0
  221. package/marketplace/skills/prompt-injection-defense/SKILL.md +132 -0
  222. package/marketplace/skills/property-based-testing/SKILL.md +100 -0
  223. package/marketplace/skills/prototyping/SKILL.md +43 -0
  224. package/marketplace/skills/query-optimization/SKILL.md +144 -0
  225. package/marketplace/skills/real-time-updates/SKILL.md +324 -0
  226. package/marketplace/skills/ref-patterns/SKILL.md +284 -0
  227. package/marketplace/skills/refactor/SKILL.md +65 -0
  228. package/marketplace/skills/rendering-models/SKILL.md +142 -0
  229. package/marketplace/skills/replication-patterns/SKILL.md +110 -0
  230. package/marketplace/skills/research-synthesis/SKILL.md +41 -0
  231. package/marketplace/skills/route-handler-design/SKILL.md +347 -0
  232. package/marketplace/skills/schema-evolution/SKILL.md +140 -0
  233. package/marketplace/skills/security-fundamentals/SKILL.md +139 -0
  234. package/marketplace/skills/semantic-center/SKILL.md +194 -0
  235. package/marketplace/skills/semantic-relations/SKILL.md +250 -0
  236. package/marketplace/skills/semantics/SKILL.md +366 -0
  237. package/marketplace/skills/semiotics/SKILL.md +230 -0
  238. package/marketplace/skills/seo-strategy/SKILL.md +260 -0
  239. package/marketplace/skills/server-actions-design/SKILL.md +243 -0
  240. package/marketplace/skills/server-components-design/SKILL.md +190 -0
  241. package/marketplace/skills/sharding-strategy/SKILL.md +123 -0
  242. package/marketplace/skills/shopify/SKILL.md +42 -0
  243. package/marketplace/skills/skill-infrastructure/SKILL.md +320 -0
  244. package/marketplace/skills/skill-router/SKILL.md +71 -0
  245. package/marketplace/skills/skill-scaffold/SKILL.md +105 -0
  246. package/marketplace/skills/snapshot-testing/SKILL.md +120 -0
  247. package/marketplace/skills/spec-driven-development/SKILL.md +148 -0
  248. package/marketplace/skills/state-machine-modeling/SKILL.md +56 -0
  249. package/marketplace/skills/state-management/SKILL.md +134 -0
  250. package/marketplace/skills/streaming-architecture/SKILL.md +194 -0
  251. package/marketplace/skills/summarization/SKILL.md +156 -0
  252. package/marketplace/skills/suspense-patterns/SKILL.md +265 -0
  253. package/marketplace/skills/system-interface-contracts/SKILL.md +59 -0
  254. package/marketplace/skills/task-analysis/SKILL.md +201 -0
  255. package/marketplace/skills/taxonomy-design/SKILL.md +66 -0
  256. package/marketplace/skills/test-coverage-strategy/SKILL.md +108 -0
  257. package/marketplace/skills/test-doubles-design/SKILL.md +98 -0
  258. package/marketplace/skills/test-driven-development/SKILL.md +96 -0
  259. package/marketplace/skills/testing-strategy/SKILL.md +67 -0
  260. package/marketplace/skills/theme-system-design/SKILL.md +43 -0
  261. package/marketplace/skills/tool-call-flow/SKILL.md +229 -0
  262. package/marketplace/skills/tool-call-strategy/SKILL.md +292 -0
  263. package/marketplace/skills/transaction-isolation/SKILL.md +98 -0
  264. package/marketplace/skills/type-safety/SKILL.md +177 -0
  265. package/marketplace/skills/typography-system/SKILL.md +43 -0
  266. package/marketplace/skills/usability-testing/SKILL.md +43 -0
  267. package/marketplace/skills/user-research/SKILL.md +43 -0
  268. package/marketplace/skills/vercel-composition-patterns/SKILL.md +157 -0
  269. package/marketplace/skills/version-control/SKILL.md +233 -0
  270. package/marketplace/skills/visual-design-foundations/SKILL.md +59 -0
  271. package/marketplace/skills/visual-hierarchy/SKILL.md +43 -0
  272. package/marketplace/skills/webhook-integration/SKILL.md +331 -0
  273. package/marketplace/skills/writing-humanizer/SKILL.md +380 -0
  274. package/package.json +67 -0
  275. package/schemas/manifest.schema.json +811 -0
  276. package/schemas/manifest.v2.schema.json +164 -0
  277. package/schemas/manifest.v3.schema.json +758 -0
  278. package/schemas/manifest.v4.schema.json +755 -0
  279. package/schemas/manifest.v5.schema.json +755 -0
  280. package/schemas/manifest.v6.schema.json +811 -0
  281. package/schemas/skill.context.jsonld +279 -0
  282. package/schemas/skill.schema.json +919 -0
  283. package/schemas/skill.v2.schema.json +201 -0
  284. package/schemas/skill.v3.schema.json +827 -0
  285. package/schemas/skill.v4.schema.json +822 -0
  286. package/schemas/skill.v5.schema.json +830 -0
  287. package/schemas/skill.v6.schema.json +946 -0
  288. package/schemas/vocabulary/keywords.json +180 -0
  289. package/schemas/vocabulary/workspace_tags.json +23 -0
  290. package/scripts/__tests__/migrate-skill-v2-to-v3.test.js +161 -0
  291. package/scripts/__tests__/migrate-skill-v3-to-v4.test.js +158 -0
  292. package/scripts/__tests__/test-export-parser-drift.js +149 -0
  293. package/scripts/__tests__/test-marketplace-export.js +114 -0
  294. package/scripts/__tests__/test-router-paths.js +82 -0
  295. package/scripts/__tests__/test-stability-promotion.js +244 -0
  296. package/scripts/__tests__/test-v3-1-alias-contract.js +109 -0
  297. package/scripts/__tests__/test-v3-1-skos-runtime.js +116 -0
  298. package/scripts/backfill-schema-version.js +198 -0
  299. package/scripts/build-field-reference.js +160 -0
  300. package/scripts/build-retrieval-baseline.js +511 -0
  301. package/scripts/check-markdown-links.js +211 -0
  302. package/scripts/check-protocol-consistency.js +979 -0
  303. package/scripts/export-marketplace-skills.js +610 -0
  304. package/scripts/export-skill.js +374 -0
  305. package/scripts/generate-manifest.js +787 -0
  306. package/scripts/lib/alias-contract.js +83 -0
  307. package/scripts/lib/audit-prompt-builder.js +771 -0
  308. package/scripts/lib/mock-grader.js +134 -0
  309. package/scripts/lib/parse-frontmatter.js +429 -0
  310. package/scripts/lib/roots.js +119 -0
  311. package/scripts/lint/check-archetype-sections.js +185 -0
  312. package/scripts/lint/check-category-enum.js +83 -0
  313. package/scripts/lint/check-routing-eval.js +146 -0
  314. package/scripts/lint/check-routing-quality.js +211 -0
  315. package/scripts/lint/check-stability-promotion.js +220 -0
  316. package/scripts/lint/format-code-frame.js +206 -0
  317. package/scripts/marketplace-install.js +125 -0
  318. package/scripts/migrate-category-to-enum.js +169 -0
  319. package/scripts/migrate-skill-v2-to-v3.js +424 -0
  320. package/scripts/migrate-skill-v3-to-v4.js +200 -0
  321. package/scripts/migrate-skill-v5-to-v6.js +304 -0
  322. package/scripts/restructure-by-category.js +85 -0
  323. package/scripts/seed-publication-classification.js +282 -0
  324. package/scripts/skill-audit.js +893 -0
  325. package/scripts/skill-graph-drift.js +483 -0
  326. package/scripts/skill-graph-route.js +766 -0
  327. package/scripts/skill-graph-routing-eval.js +393 -0
  328. package/scripts/skill-lint.js +1317 -0
  329. package/scripts/skill-overlap.js +213 -0
  330. package/scripts/verify-skill-md-export.js +201 -0
@@ -0,0 +1,108 @@
1
+ ---
2
+ name: test-coverage-strategy
3
+ description: "Use when reasoning about code coverage as a strategic measurement rather than a target: the coverage-criterion hierarchy (function, statement, line, branch, decision, condition, MC/DC, path), Marick's distinction between coverage as a floor signal and coverage as a ceiling target, Goodhart's Law applied to coverage metrics, why 100% line coverage may catch nothing important while 80% well-chosen coverage catches almost everything, the safety-critical industry's MC/DC requirement (DO-178C aviation, ISO 26262 automotive) as the empirical evidence that coverage at the right granularity matters, the difference between covered (lines exercised) and tested (behaviors verified), and how coverage gaps are diagnostic signals about the test suite. Do NOT use for choosing test levels (use testing-strategy), the technique of mutation testing as a stronger signal (use mutation-testing), the construction of test doubles (use test-doubles-design), or specific coverage-tooling configuration (tool docs)."
4
+ license: MIT
5
+ allowed-tools: Read Grep
6
+ metadata:
7
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"quality\",\"domain\":\"quality/testing\",\"scope\":\"reference\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-16\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-16\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"comprehension_state\":\"present\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"code coverage\\\\\\\",\\\\\\\"line coverage\\\\\\\",\\\\\\\"branch coverage\\\\\\\",\\\\\\\"decision coverage\\\\\\\",\\\\\\\"condition coverage\\\\\\\",\\\\\\\"MC/DC\\\\\\\",\\\\\\\"modified condition decision coverage\\\\\\\",\\\\\\\"path coverage\\\\\\\",\\\\\\\"coverage as target\\\\\\\",\\\\\\\"Goodhart on coverage\\\\\\\",\\\\\\\"covered vs tested\\\\\\\",\\\\\\\"DO-178C\\\\\\\"]\",\"triggers\":\"[\\\\\\\"what coverage target should we set\\\\\\\",\\\\\\\"is 100% coverage the goal\\\\\\\",\\\\\\\"is this line covered or tested\\\\\\\",\\\\\\\"should we add tests for these uncovered lines\\\\\\\",\\\\\\\"branch vs line coverage\\\\\\\"]\",\"examples\":\"[\\\\\\\"set a coverage policy for a service that distinguishes 'covered' from 'tested'\\\\\\\",\\\\\\\"explain why a 100% line coverage gate may be worse than an 80% branch coverage gate\\\\\\\",\\\\\\\"diagnose a high-coverage test suite that still misses bugs — likely a granularity mismatch\\\\\\\",\\\\\\\"decide when to upgrade from branch coverage to MC/DC for safety-critical code\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"choose test levels (unit/integration/e2e) for a project (use testing-strategy)\\\\\\\",\\\\\\\"use mutation testing to evaluate test-suite quality (use mutation-testing)\\\\\\\",\\\\\\\"configure Istanbul, Jest coverage, or JaCoCo (tool documentation)\\\\\\\"]\",\"relations\":\"{\\\\\\\"related\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"mutation-testing\\\\\\\",\\\\\\\"test-driven-development\\\\\\\",\\\\\\\"eval-driven-development\\\\\\\"],\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"testing-strategy owns the question of what to test at which level; this skill owns the measurement of how much of the code those tests actually exercise and at what granularity.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"mutation-testing\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"mutation-testing owns the stronger signal of whether tests would *catch* a defect; this skill owns the structural signal of whether tests *reach* the code at all. Coverage is a necessary-not-sufficient precondition for mutation tests to even apply.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"test-driven-development\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"TDD produces high coverage as a side effect; this skill explains why that coverage is a side effect rather than a target, and why pursuing the metric directly produces worse tests.\\\\\\\"}],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"mutation-testing\\\\\\\"]}\",\"mental_model\":\"|\",\"purpose\":\"|\",\"boundary\":\"|\",\"analogy\":\"Test coverage is to a test suite what cell-phone coverage maps are to a network — a green map on the carrier's website is not the same as actually being able to make a call in your basement; the map measures *where there is theoretical signal*, not *whether the call gets through*. A 100% coverage map with a 30% call-completion rate is the same shape of problem as a 100%-line-coverage suite that misses bugs — the floor is met (no dead zones), but the ceiling is unverified.\",\"misconception\":\"|\",\"concept\":\"{\\\\\\\"definition\\\\\\\":\\\\\\\"Test coverage is the family of structural measurements that report which parts of the production code were exercised by the test suite. Coverage criteria differ in granularity — function, statement, line, branch, decision, condition, modified-condition/decision (MC/DC), path — and each higher criterion subsumes the lower ones. Coverage is a *structural* property: it answers 'did the test reach this code,' not 'did the test verify this behavior.' The strategic discipline of test coverage is using coverage as a diagnostic signal about gaps in the test suite while resisting the Goodhart-Law failure mode of treating the coverage number as the target. Coverage is a floor (below this level, the test suite definitely misses things) not a ceiling (above this level, the test suite is necessarily good).\\\\\\\",\\\\\\\"mental_model\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"purpose\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"boundary\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"taxonomy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"analogy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"misconception\\\\\\\":\\\\\\\"|\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/test-coverage-strategy/SKILL.md\"}"
8
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
9
+ skill_graph_protocol: Skill Metadata Protocol v4
10
+ skill_graph_project: Skill Graph
11
+ skill_graph_canonical_skill: skills/test-coverage-strategy/SKILL.md
12
+ ---
13
+
14
+ # Test-Coverage Strategy
15
+
16
+ ## Coverage
17
+
18
+ The strategic use of code-coverage measurement as a diagnostic signal about test-suite gaps, without falling into the Goodhart-Law trap of treating the coverage number as the target. Covers the coverage-criterion hierarchy (function → line → branch → decision → condition → MC/DC → path), Marick's distinction between coverage as floor (uncovered code is definitely untested) and coverage as ceiling (covered code is necessarily tested — false), the covered-vs-tested distinction, the safety-critical industry's MC/DC standard (DO-178C aviation, ISO 26262 automotive) as empirical evidence for granularity-matching, and the strategic uses of coverage (diagnostic, floor-gating, diff-based gating, audit artifact).
19
+
20
+ ## Philosophy
21
+
22
+ Coverage measures execution, not verification. A test that calls every function but asserts nothing has 100% line coverage and zero testing value; a test with rich assertions but missing some branches has lower coverage and may have much higher testing value. The metric's strategic value is in the *floor direction*: uncovered code is definitely untested. The reverse direction — covered code is necessarily tested — is not reliable.
23
+
24
+ The discipline of coverage strategy is using the metric where it is informative (gap detection, regression signaling, test-effort allocation, audit compliance) and refusing to use it where it is misleading (as a quality target, as a substitute for behavioral verification, as a ceiling claim about test thoroughness). The risk is large because coverage numbers are easy to produce, easy to gate on, and easy to optimize for in ways that degrade the test suite.
25
+
26
+ Most working test suites should aim for coverage that is *high enough to reveal gaps without becoming the goal*. Where the certification environment mandates a specific criterion (MC/DC for safety-critical), the discipline shifts: the criterion is given; the work is meeting it without coverage-padding.
27
+
28
+ ## The Criterion Hierarchy
29
+
30
+ | Criterion | What it requires | Test count for `A && B && C` | Typical use |
31
+ |---|---|---|---|
32
+ | Function | Each function called once | 1 | Detects entirely-untested functions |
33
+ | Statement / Line | Each line executed once | 1 | Default in most tools |
34
+ | Branch | Each branch (true/false) taken once | 2 | Catches missed-else-branch bugs |
35
+ | Decision | Each decision evaluated both ways | 2 | Often conflated with branch in tooling |
36
+ | Condition | Each Boolean sub-expression both true and false | 2 | Catches short-circuit-evaluation bugs |
37
+ | MC/DC | Each condition independently affects the decision | 4 (N+1 for N conditions) | DO-178C Level A; ISO 26262 ASIL-D |
38
+ | Multiple condition | All combinations of conditions exercised | 8 (2^N) | Combinatorial; rarely required |
39
+ | Path | Every execution path through the function | Infeasible for loops | Theoretical |
40
+
41
+ The right criterion depends on where the code's failure modes hide. Straight-line procedural code: line coverage. Complex Boolean conditions (access control, rate limiters, financial logic): branch or condition coverage. Safety-critical: MC/DC by mandate.
42
+
43
+ ## Covered vs Tested
44
+
45
+ | Aspect | Covered | Tested |
46
+ |---|---|---|
47
+ | What it measures | Test reached the code | Test verified the code's behavior |
48
+ | Detected by | Coverage tool | Assertion presence + correctness |
49
+ | 100% achievable cheaply by | Function/line traversal with no assertions | Real behavioral assertions |
50
+ | Signal direction | Floor (uncovered = untested) reliable | Ceiling (covered = tested) unreliable |
51
+ | Strategic value | Gap detection | Quality assurance |
52
+
53
+ A test that calls `processOrder()` but asserts nothing covers `processOrder` and tests nothing in it. The coverage report cannot distinguish this from a test with full behavioral assertions. The discipline is to read coverage as one part of a picture that includes assertion review, mutation testing (see `mutation-testing`), and integration test coverage.
54
+
55
+ ## When To Use Each Criterion
56
+
57
+ | Code character | Recommended criterion | Why |
58
+ |---|---|---|
59
+ | Straight-line procedural; few branches | Line | Adequate; cheap |
60
+ | Branchy business logic | Branch | Catches missed-else cases line misses |
61
+ | Compound Boolean conditions (access checks, rate limiting, financial gates) | Condition or MC/DC | Catches short-circuit-evaluation bugs |
62
+ | Safety-critical (aviation, automotive, medical) | MC/DC | Mandated by DO-178C, ISO 26262 |
63
+ | Algorithmic / mathematical | Branch + mutation testing | Coverage + behavioral signal |
64
+ | Pure utility / value objects | Line | Most failure modes line coverage catches |
65
+ | Error / exception paths | Branch (or accept the gap) | Branch reveals untested error handling |
66
+
67
+ ## Goodhart-Resistant Strategy
68
+
69
+ Patterns that use coverage strategically without producing Goodhart-pressure:
70
+
71
+ 1. **Diagnostic, not gating.** Report coverage on PRs; don't fail builds on coverage. Reviewers read the report; the team adjusts allocation.
72
+ 2. **Diff-based gating at low threshold.** New/changed lines must be covered above (e.g.) 70%, not the codebase as a whole. Lower aggregate target, higher meaningfulness of each test.
73
+ 3. **Coverage as part of a panel.** Pair coverage with mutation score, assertion density, and integration test breadth. No single metric is the target.
74
+ 4. **Per-module differentiation.** Higher coverage required for high-impact modules (financial calculation, security, data-loss-risk); lower for stable utility code.
75
+ 5. **Annual review of gaps.** Walk the uncovered code; decide consciously whether the gap is acceptable (defensive code, logging) or a real test debt.
76
+
77
+ ## Verification
78
+
79
+ After applying this skill, verify:
80
+ - [ ] The team's coverage criterion is appropriate to the code's failure modes (line for procedural, branch for branchy, MC/DC for safety-critical). The default criterion is not assumed correct without checking what it measures.
81
+ - [ ] Coverage is read as a floor signal (gaps point to untested code), not a ceiling signal (high coverage proves tests are good). The team distinguishes "covered" from "tested" in conversation and in policy.
82
+ - [ ] Coverage gates, if used, are set at a level achievable by real behavioral testing — not a level that forces coverage-padding tests.
83
+ - [ ] Coverage of dead code is excluded or noted. A dropping coverage percentage is interpreted (could be dead-code removal, test deletion, or new untested code), not treated as automatically bad.
84
+ - [ ] Coverage is paired with at least one behavioral measurement (mutation testing, assertion review, integration-test breadth) in any quality conversation about the test suite.
85
+ - [ ] For safety-critical code, the certification standard's coverage criterion is the criterion in use, and the tool's reported number actually matches the standard's definition.
86
+ - [ ] Production observability is treated as a separate verification dimension from coverage. Coverage tells you what tests exercised; observability tells you what production exercises.
87
+ - [ ] The team can explain a coverage gap on any specific module as either "we don't need to test this because [reason]" or "we should test this and have not." Unexplained gaps are test debt.
88
+
89
+ ## Do NOT Use When
90
+
91
+ | Instead of this skill | Use | Why |
92
+ |---|---|---|
93
+ | Deciding which test levels (unit/integration/e2e) to invest in | `testing-strategy` | testing-strategy owns the level/scope question; this skill owns the measurement of structural reach |
94
+ | Measuring whether tests would catch deliberately-introduced defects | `mutation-testing` | mutation owns the behavioral signal; this skill owns the structural signal |
95
+ | Constructing test doubles (mocks, stubs, fakes) | `test-doubles-design` | test-doubles-design owns the stand-in design |
96
+ | Iterating on LLM behavior via an eval suite | `eval-driven-development` | eval-driven-development is the LLM-system analog; this skill is about deterministic-software coverage |
97
+ | Configuring Istanbul, Jest coverage, JaCoCo, Coverlet | tool documentation | tool API choice is tactical detail below this skill's scope |
98
+
99
+ ## Key Sources
100
+
101
+ - Marick, B. (1999). ["How to Misuse Code Coverage"](http://www.exampler.com/testing-com/writings/coverage.pdf). The canonical practitioner essay on coverage as a strategic signal vs target; defines the floor/ceiling distinction.
102
+ - Cornett, S. ["Code Coverage Analysis"](http://www.bullseye.com/coverage.html). Long-form reference defining the coverage criterion hierarchy and the practical differences between branch, decision, and MC/DC coverage.
103
+ - RTCA. *DO-178C: Software Considerations in Airborne Systems and Equipment Certification* (2011). The aviation certification standard mandating MC/DC for Level A software. Industry benchmark for what coverage criterion safety-critical software requires.
104
+ - ISO. *ISO 26262: Road vehicles — Functional safety* (2018). The automotive safety standard mandating MC/DC for ASIL-D software. Parallel industry benchmark.
105
+ - Hayhurst, K. J., Veerhusen, D. S., Chilenski, J. J., & Rierson, L. K. (2001). ["A Practical Tutorial on Modified Condition/Decision Coverage"](https://ntrs.nasa.gov/citations/20010057789). NASA technical memorandum; the practical reference on MC/DC.
106
+ - Goodhart, C. (1975). "Problems of Monetary Management: The U.K. Experience." The origin of Goodhart's Law as commonly cited; applies sharply to coverage-as-target.
107
+ - Inozemtseva, L., & Holmes, R. (2014). ["Coverage Is Not Strongly Correlated with Test Suite Effectiveness"](https://dl.acm.org/doi/10.1145/2568225.2568271). *ICSE 2014*. Empirical study showing coverage and test-suite effectiveness are weakly correlated; supports the floor-not-ceiling reading.
108
+ - Namin, A. S., & Andrews, J. H. (2009). ["The influence of size and coverage on test suite effectiveness"](https://dl.acm.org/doi/10.1145/1572272.1572284). *ISSTA 2009*. Earlier empirical study with similar findings on coverage's limited correlation with bug-finding effectiveness.
@@ -0,0 +1,98 @@
1
+ ---
2
+ name: test-doubles-design
3
+ description: "Use when designing or reviewing test doubles — the stand-in objects that replace real collaborators in a test: the five-kind taxonomy (dummy, stub, spy, fake, mock) from Meszaros's xUnit Test Patterns, the difference between state verification (Detroit-school, classicist) and interaction verification (London-school, mockist) and how it determines which doubles fit, the cost of doubles (fragility, false confidence, divergence from real behavior), the role of fakes as the under-used middle ground, the verify_with relationship to test-driven-development, and the heuristics for when to use a real collaborator instead of any double. Do NOT use for choosing test levels or what to test (use testing-strategy), the design discipline of writing tests first (use test-driven-development), specific mocking-library API choice (library docs), or general production stubs and feature flags (use feature-gating or domain-specific skills)."
4
+ license: MIT
5
+ allowed-tools: Read Grep
6
+ metadata:
7
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"quality\",\"domain\":\"quality/testing\",\"scope\":\"reference\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-16\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-16\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"comprehension_state\":\"present\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"test double\\\\\\\",\\\\\\\"mock\\\\\\\",\\\\\\\"stub\\\\\\\",\\\\\\\"spy\\\\\\\",\\\\\\\"fake\\\\\\\",\\\\\\\"dummy\\\\\\\",\\\\\\\"test isolation\\\\\\\",\\\\\\\"interaction verification\\\\\\\",\\\\\\\"state verification\\\\\\\",\\\\\\\"mockist\\\\\\\",\\\\\\\"classicist\\\\\\\",\\\\\\\"in-memory fake\\\\\\\",\\\\\\\"sociable test\\\\\\\",\\\\\\\"solitary test\\\\\\\",\\\\\\\"mocking external API calls\\\\\\\",\\\\\\\"testing code that calls an API\\\\\\\"]\",\"triggers\":\"[\\\\\\\"should this be a mock or a stub\\\\\\\",\\\\\\\"are we using mocks correctly\\\\\\\",\\\\\\\"the test is brittle when I refactor\\\\\\\",\\\\\\\"do we need a fake here\\\\\\\",\\\\\\\"is this test really testing anything\\\\\\\"]\",\"examples\":\"[\\\\\\\"decide between a mock, a stub, and a fake for a database collaborator in a test\\\\\\\",\\\\\\\"explain why over-mocking produces fragile tests that change with every refactor\\\\\\\",\\\\\\\"diagnose a passing test that mirrors the implementation rather than specifying behavior\\\\\\\",\\\\\\\"design an in-memory fake for a repository interface that supports both classicist tests and integration tests\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"decide which test levels (unit/integration/e2e) the project should invest in (use testing-strategy)\\\\\\\",\\\\\\\"set up a production feature flag (use feature-gating)\\\\\\\",\\\\\\\"configure a specific mocking library — Jest, Sinon, Mockito (library docs)\\\\\\\"]\",\"relations\":\"{\\\\\\\"related\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"test-driven-development\\\\\\\",\\\\\\\"refactor\\\\\\\",\\\\\\\"api-design\\\\\\\",\\\\\\\"type-safety\\\\\\\"],\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"test-driven-development\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"test-driven-development owns the design discipline of writing the test before the production code; this skill owns the design of the stand-in objects those tests use. The two compose: TDD prescribes the rhythm; test-doubles-design prescribes the stand-ins.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"testing-strategy owns the strategic question of what to test at which level; this skill owns the tactical construction of the stand-ins that make a given test possible.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"refactor\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"refactor owns behavior-preserving structural change; this skill owns the doubles whose over-use produces refactor-fragile tests. The two skills are read together when a test suite resists refactoring.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"api-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"api-design owns the design of an interface as a production contract; this skill owns the doubles that exercise that interface in tests. When test doubles drive interface decisions (London-school TDD), this skill and api-design overlap heavily.\\\\\\\"}],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"test-driven-development\\\\\\\",\\\\\\\"refactor\\\\\\\"]}\",\"mental_model\":\"|\",\"purpose\":\"|\",\"boundary\":\"|\",\"analogy\":\"A test double is to a unit under test what a Hollywood stunt-double is to a leading actor — the stunt-double looks like the actor enough that the camera believes the scene, performs the dangerous part the actor cannot or should not perform (slow networks, paid APIs, real emails), but is not the actor. The director's job is choosing the right kind of stunt-double for the scene: a dummy stands in the back of the shot (placeholder); a stub recites canned lines from off-camera (canned answers); a spy records which lines were spoken (call recording); a mock has a script that *fails the take* if the actor deviates (rigid interaction verification); a fake is a body-double trained to actually perform the action in a controlled way (working in-memory substitute). Casting mocks where a fake would do produces takes that pass when the stunt is performed exactly as written and fail catastrophically when the actor improvises a better line.\",\"misconception\":\"|\",\"concept\":\"{\\\\\\\"definition\\\\\\\":\\\\\\\"A test double is a stand-in object that replaces a real collaborator in a test so the test can run without the collaborator's real behavior. The term, from Meszaros's *xUnit Test Patterns*, generalizes a family of stand-ins — dummies, stubs, spies, fakes, and mocks — each defined by what the test expects of it. Doubles exist because real collaborators are often unavailable (external services), nondeterministic (time, network, randomness), slow (databases, file systems), expensive (paid APIs), or have side effects unacceptable in a test (sending email, charging cards). The discipline of test-doubles design is choosing the right kind of double for each test's purpose, recognizing that the choice determines what the test actually verifies — state or interaction — and what the test will reveal under refactoring.\\\\\\\",\\\\\\\"mental_model\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"purpose\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"boundary\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"taxonomy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"analogy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"misconception\\\\\\\":\\\\\\\"|\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/test-doubles-design/SKILL.md\"}"
8
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
9
+ skill_graph_protocol: Skill Metadata Protocol v4
10
+ skill_graph_project: Skill Graph
11
+ skill_graph_canonical_skill: skills/test-doubles-design/SKILL.md
12
+ ---
13
+
14
+ # Test-Doubles Design
15
+
16
+ ## Coverage
17
+
18
+ The discipline of choosing and constructing the stand-in objects that replace real collaborators in tests. Covers the five-kind taxonomy (dummy, stub, spy, mock, fake) from Meszaros's *xUnit Test Patterns*, the state-vs-interaction verification distinction that determines which kinds fit which tests, the solitary-vs-sociable test-shape trade-off, the under-use of fakes as the robust middle ground, the cost ledger of every double (divergence, maintenance, false confidence, fragility), and the heuristics for when to use a real collaborator instead of any double. Includes the connection to London/Detroit-school TDD and to the api-design surface that doubles often pressure.
19
+
20
+ ## Philosophy
21
+
22
+ A test double is a small lie the test tells the unit under test. The lie is useful — it makes the test fast, isolated, deterministic, and free of side effects — but every lie is also a place where the test's belief about the collaborator may diverge from the collaborator's actual behavior. The discipline of test-doubles design is choosing lies whose costs are worth the tests they enable, and recognizing that the choice of *what kind* of lie shapes what the test actually verifies.
23
+
24
+ The biggest design decision in test-doubles work is whether the test is verifying state ("after this action, the system looks like this") or interaction ("during this action, these calls were made to collaborators"). The choice is not a matter of preference; it determines the test suite's character, its fragility under refactoring, and its connection to the production design. London-school TDD favors interaction; Detroit-school favors state. Most working test suites mix both, and most working test suites have not chosen the mix consciously.
25
+
26
+ The under-acknowledged construct is the fake. Discourse about test doubles is dominated by mocks (because they are the most distinctive kind and the most natural fit for interaction verification), but fakes — working implementations with production-unsuitable shortcuts — often produce more robust tests with less long-term maintenance cost. A test suite that uses fakes for collaborators that admit a working stand-in (databases, clocks, HTTP clients) and reserves mocks for true behavioral verification is usually better-aged than the equivalent mock-heavy suite.
27
+
28
+ ## The Five Kinds — A Practical Reference
29
+
30
+ | Kind | What it is | What it verifies | Fragility under refactor | Best use |
31
+ |---|---|---|---|---|
32
+ | Dummy | Placeholder; never actually used | Nothing | None — it isn't exercised | Parameter slots the test path doesn't touch |
33
+ | Stub | Returns canned answers to calls | State after the action | Low — only the canned answer matters | Providing inputs the unit needs |
34
+ | Spy | Stub + records calls received | State and (post-hoc) which calls happened | Medium — call-shape changes break the spy assertion | Flexible interaction verification |
35
+ | Mock | Pre-programmed with expected calls; double itself fails on deviation | Interaction (built into the double) | High — call-shape changes break the test directly | Verifying a contract with a collaborator |
36
+ | Fake | Working implementation of the interface with production-unsuitable shortcuts | State end-to-end through the fake | Low — behavior is verified through a working substitute | Slow/nondeterministic collaborators with workable in-memory stand-ins |
37
+
38
+ ## State vs Interaction — The Defining Choice
39
+
40
+ | Property | State verification | Interaction verification |
41
+ |---|---|---|
42
+ | Question the test answers | "After this, what does the system look like?" | "During this, what did the system do?" |
43
+ | Typical doubles | Stubs, fakes | Spies, mocks |
44
+ | Schools | Detroit-school (Beck), classicist | London-school (Freeman & Pryce), mockist |
45
+ | Refactor fragility | Low — internal call shapes can change | High — call shapes are pinned by the test |
46
+ | Design coupling | Loose — test sees the unit's surface | Tight — test sees the unit's collaborations |
47
+ | Failure mode | Coarse assertions; missed interaction bugs | Test mirrors implementation; refactor breaks tests that still produce correct behavior |
48
+
49
+ A practical heuristic: prefer state verification when the unit's behavior is naturally state-shaped (calculators, parsers, domain logic); prefer interaction verification when the unit's behavior is naturally collaboration-shaped (orchestrators, controllers, services that exist to coordinate other services).
50
+
51
+ ## When To Use A Real Collaborator (Sociable Tests)
52
+
53
+ | Collaborator | Real-collaborator use? | If not, prefer |
54
+ |---|---|---|
55
+ | Pure functions, value objects, in-process domain logic | Yes — always | n/a |
56
+ | In-process services, repositories with in-memory implementations | Often yes | Fake if the real one has setup cost |
57
+ | Database access | Increasingly yes (containerized real DB in CI) | In-memory DB fake (SQLite/H2) over mock |
58
+ | File system | Usually fake (temp dir or in-memory FS) | Fake over mock |
59
+ | Time, clocks, schedulers | Always fake | Deterministic clock fake |
60
+ | Network (HTTP) | Recorded responses (fake) or real in integration scope | Fake (recorded responder) over mock |
61
+ | External paid APIs | Fake or contract test against recorded responses | Fake over mock |
62
+ | Other services across process boundary | Contract test against; double in unit scope | Fake or stub at unit scope; real in integration scope |
63
+
64
+ A test suite that mocks pure-function collaborators is over-mocking; a test suite that uses real third-party APIs in every unit test is under-using doubles.
65
+
66
+ ## Verification
67
+
68
+ After applying this skill, verify:
69
+ - [ ] Every double in the test suite is identifiable as one of the five kinds (dummy, stub, spy, mock, fake). Tests that say "mock" loosely (when the construct is really a stub or spy) are using the term in dialect, not in the precise sense.
70
+ - [ ] The verification style (state vs interaction) is consistent within a test. A test that mixes both styles is usually testing two things and should be split.
71
+ - [ ] Doubles sit at real seams (service boundaries, external dependencies, true injection points), not at artificial seams introduced just to enable the test.
72
+ - [ ] Where a collaborator admits a working in-memory implementation, a fake is preferred over a mock. Mocks are reserved for genuine interaction verification.
73
+ - [ ] No test mocks the database; database interaction uses an in-memory DB fake, a containerized real DB in CI, or a contract test layer.
74
+ - [ ] The school being practiced (London/Detroit/hybrid) is intentional, not accidental. The mock-to-fake ratio in the test suite is a measurement of which school is in use.
75
+ - [ ] Tests are not over-specified — strict mocks that pin exact call sequences are used only where the contract is genuinely about the call sequence (rare).
76
+ - [ ] Contract tests, integration tests, or production observability close the gap between mock-isolated unit tests and the real collaborator's actual behavior. Mocks alone are not the full verification story.
77
+
78
+ ## Do NOT Use When
79
+
80
+ | Instead of this skill | Use | Why |
81
+ |---|---|---|
82
+ | Choosing what to test, at which level | `testing-strategy` | testing-strategy owns the strategic question; this skill owns the doubles inside chosen tests |
83
+ | Designing the rhythm of test-first development | `test-driven-development` | TDD owns the discipline of writing tests first; this skill owns the doubles those tests use |
84
+ | Configuring a specific mocking library (Jest, Sinon, Mockito) | library documentation | library API choice is tactical detail below this skill's scope |
85
+ | Setting up a production feature flag or runtime alternative | `feature-gating` | feature-gating is about production behavior; this skill is about test-harness behavior |
86
+ | Designing the production interface a double would mimic | `api-design` | api-design owns the production contract; this skill consumes that contract for tests |
87
+ | Performing a behavior-preserving structural change | `refactor` | refactor owns the technique; this skill owns the doubles that may resist or enable the refactor |
88
+
89
+ ## Key Sources
90
+
91
+ - Meszaros, G. (2007). *xUnit Test Patterns: Refactoring Test Code*. Addison-Wesley. The canonical reference for the five-kind test-double taxonomy (dummy, stub, spy, mock, fake) and the broader patterns of test-code design.
92
+ - Fowler, M. (2007). ["Mocks Aren't Stubs"](https://martinfowler.com/articles/mocksArentStubs.html). The defining practitioner essay distinguishing classicist (Detroit) and mockist (London) TDD and the role of doubles in each.
93
+ - Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests* (GOOS). Addison-Wesley. The canonical book on London-school TDD; treats mocks as a design tool that drives collaboration interfaces.
94
+ - Beck, K. (2002). *Test-Driven Development: By Example*. Addison-Wesley. The canonical book on Detroit-school TDD; uses doubles sparingly and verifies state.
95
+ - Fowler, M. ["Test Double"](https://martinfowler.com/bliki/TestDouble.html). Short reference page formalizing Meszaros's taxonomy in practitioner vocabulary.
96
+ - de Oliveira Neto, F. G., et al. (2019). ["Evolution of statistical analysis in empirical software engineering research: Current state and steps forward"](https://www.sciencedirect.com/science/article/abs/pii/S0164121219300573). Survey of empirical evidence on test-suite quality including the effects of mock-heavy vs sociable test designs.
97
+ - Spadini, D., Aniche, M., Bruntink, M., & Bacchelli, A. (2019). ["Mock objects for testing Java systems"](https://link.springer.com/article/10.1007/s10664-018-9663-0). *Empirical Software Engineering*. Industrial study on mock usage patterns and their evolution.
98
+ - Mancinelli, F. (2018). ["A Survey on Test Doubles Frameworks"](https://arxiv.org/abs/1801.10306). Comparative review of mocking libraries across languages; useful as a third-party reference for library-specific encodings.
@@ -0,0 +1,96 @@
1
+ ---
2
+ name: test-driven-development
3
+ description: "Use when reasoning about Test-Driven Development as a design discipline rather than a workflow: the red-green-refactor cycle as a feedback loop, the difference between London-school (outside-in, interaction-heavy, mock-driven) and Detroit-school (inside-out, state-heavy, classicist) TDD, the role of TDD as a design tool (how tests pressure code into more decomposable shapes), the connection between TDD and emergent design, the boundary between TDD and prior-test-suites, why TDD's failure mode is not 'no tests' but 'tests that mirror implementation', and the empirical record of TDD's effects on defect density, design quality, and development velocity. Do NOT use for the strategy of what to test at which level (use testing-strategy), the construction of test doubles (use test-doubles-design), the discipline of LLM eval iteration (use eval-driven-development), or general-software process workflow (use the obra/superpowers test-driven-development workflow skill — this skill is the concept-shape complement)."
4
+ license: MIT
5
+ allowed-tools: Read Grep
6
+ metadata:
7
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"quality\",\"domain\":\"quality/testing\",\"scope\":\"reference\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-16\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-16\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"comprehension_state\":\"present\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"test-driven development\\\\\\\",\\\\\\\"TDD\\\\\\\",\\\\\\\"red green refactor\\\\\\\",\\\\\\\"London school\\\\\\\",\\\\\\\"Detroit school\\\\\\\",\\\\\\\"Chicago school\\\\\\\",\\\\\\\"outside-in TDD\\\\\\\",\\\\\\\"inside-out TDD\\\\\\\",\\\\\\\"mockist\\\\\\\",\\\\\\\"classicist\\\\\\\",\\\\\\\"emergent design\\\\\\\",\\\\\\\"test-first\\\\\\\",\\\\\\\"design pressure\\\\\\\",\\\\\\\"GOOS\\\\\\\"]\",\"triggers\":\"[\\\\\\\"should we write tests first\\\\\\\",\\\\\\\"are mocks ruining the design\\\\\\\",\\\\\\\"is TDD worth it\\\\\\\",\\\\\\\"London school vs Detroit school\\\\\\\",\\\\\\\"the tests changed every refactor\\\\\\\"]\",\"examples\":\"[\\\\\\\"explain why writing the test first changes the design of the code under test\\\\\\\",\\\\\\\"decide between London-school (mocks-as-design) and Detroit-school (state-verification) TDD for a new module\\\\\\\",\\\\\\\"diagnose why the test suite is fragile under refactor — likely over-mocked interaction tests\\\\\\\",\\\\\\\"explain why high test coverage with TDD is a side effect, not the goal\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"construct a mock, stub, or spy (use test-doubles-design)\\\\\\\",\\\\\\\"decide what test levels (unit/integration/e2e) to invest in (use testing-strategy)\\\\\\\",\\\\\\\"iterate on LLM behavior using an eval suite (use eval-driven-development)\\\\\\\"]\",\"relations\":\"{\\\\\\\"related\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"test-doubles-design\\\\\\\",\\\\\\\"eval-driven-development\\\\\\\",\\\\\\\"refactor\\\\\\\",\\\\\\\"type-safety\\\\\\\"],\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"testing-strategy owns the question 'what should we test, at which level, with what evidence' for a given change; this skill owns the design discipline of writing the test before the code that satisfies it. The two compose — testing-strategy decides the surface; TDD prescribes the rhythm — but they answer different questions.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"test-doubles-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"test-doubles-design owns mocks/stubs/fakes/spies as a construct; this skill owns the discipline that places them (London-school heavily, Detroit-school lightly). The schools differ on how much test-doubles design matters to the practice.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"eval-driven-development\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"eval-driven-development owns the LLM analog of TDD where the unit of judgment is pass-rate over a sample rather than binary per-test pass/fail. The disciplines share the iteration-first-then-implement spirit but the math underneath differs.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"refactor\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"refactor owns behavior-preserving structural change; this skill prescribes refactor as the third beat of red-green-refactor. The skills compose: TDD calls refactor on every green; refactor owns how to do it without breaking behavior.\\\\\\\"}],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"refactor\\\\\\\"]}\",\"mental_model\":\"|\",\"purpose\":\"|\",\"boundary\":\"|\",\"analogy\":\"TDD is to code design what a piano teacher's metronome is to a student's playing — the rhythm is not the music, but it surfaces every uneven phrase, every rushed measure, every hesitation, in time to correct it before it ossifies into habit.\",\"misconception\":\"|\",\"concept\":\"{\\\\\\\"definition\\\\\\\":\\\\\\\"Test-Driven Development is a software design discipline in which the test for a behavior is written before the production code that satisfies the test, then production code is written until the test passes, then the code is restructured to improve its shape while the test continues to pass. The cycle — red (failing test), green (passing test), refactor (clean code, still passing) — is the unit of work, and the test suite is the design pressure that shapes the production code. TDD is not 'writing tests first' as a procedural rule; it is using the act of writing a test as the moment to design the interface, decompose the responsibility, and discover what the code should be. The tests are a beneficial side effect of doing design through the test-writing lens; mistaking the side effect for the purpose is the most common failure mode.\\\\\\\",\\\\\\\"mental_model\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"purpose\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"boundary\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"taxonomy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"analogy\\\\\\\":\\\\\\\"|\\\\\\\",\\\\\\\"misconception\\\\\\\":\\\\\\\"|\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/test-driven-development/SKILL.md\"}"
8
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
9
+ skill_graph_protocol: Skill Metadata Protocol v4
10
+ skill_graph_project: Skill Graph
11
+ skill_graph_canonical_skill: skills/test-driven-development/SKILL.md
12
+ ---
13
+
14
+ # Test-Driven Development
15
+
16
+ ## Coverage
17
+
18
+ The design discipline of writing the test before the production code that satisfies it, using the test-writing pressure to shape the code's design, and applying the red-green-refactor cycle as the unit of work. Covers the cycle structure (red → green → refactor), the design-pressure mechanism that makes TDD a design discipline rather than just a test-first habit, the London/Detroit/Chicago school distinctions (mockist vs classicist, outside-in vs inside-out, interaction vs state), the relationship between TDD and emergent design, the empirical record (Nagappan 2008 at Microsoft/IBM and meta-analyses since), the boundary between TDD and BDD, and the failure modes (skipping refactor, ignoring design pressure, mock-heavy fragile tests, coverage-as-goal Goodharting).
19
+
20
+ ## Philosophy
21
+
22
+ TDD is design through the test-writing lens. Every test you sit down to write is a moment of design: what is this unit, what does it own, what does it delegate, what does its interface look like from the outside. The discomfort of writing a hard-to-test piece of code is the same discomfort that will eventually arrive as hard-to-use, hard-to-maintain, hard-to-compose code; TDD makes that discomfort visible at the moment when correcting it is cheap.
23
+
24
+ The tests are not the point. The tests are an artifact of doing design with a test-shaped tool. A team that "does TDD" by writing tests first without heeding the design pressure has the artifact without the practice; they will conclude TDD doesn't work because they will see only the cost (slower initial development) without the benefit (better-shaped code with lower defect density).
25
+
26
+ The schools matter. London and Detroit produce different code under the same red-green-refactor cycle, because they apply the design pressure to different surfaces. A practitioner who has not chosen a school has chosen by accident, and the test suite's character — interaction-heavy or state-heavy, mock-rich or mock-sparse, outside-in or inside-out — reveals which they chose without knowing.
27
+
28
+ ## The Cycle In Detail
29
+
30
+ | Beat | Activity | Stop condition | Common mistake |
31
+ |---|---|---|---|
32
+ | Red | Write one failing test for one increment of desired behavior | The test compiles and fails for the right reason (not a syntax error) | Writing a passing test (no design pressure); writing many tests at once (loses the increment) |
33
+ | Green | Write the smallest production change that makes the test pass | The test passes; no other tests broke | Writing more than needed; "fake it till you make it" abandoned too early |
34
+ | Refactor | Improve the shape of both code and test while keeping all tests passing | The structure is cleaner; tests still pass | Skipping this beat; refactoring into bigger changes that break tests |
35
+
36
+ Each cycle should fit in a few minutes — long cycles indicate either insufficient test granularity or production complexity that should be decomposed before continuing.
37
+
38
+ ## London School vs Detroit School — A Practical Comparison
39
+
40
+ | Property | London (mockist, outside-in) | Detroit (classicist, inside-out) |
41
+ |---|---|---|
42
+ | Origin | Freeman & Pryce, *GOOS* (2009) | Beck, *TDD by Example* (2002) |
43
+ | Starting point | Acceptance test or service boundary | A single concrete unit |
44
+ | Test focus | Interactions (who called whom with what) | Observable state changes |
45
+ | Test doubles | Central — every new collaborator becomes a mock | Sparse — used only when needed (DB, external service) |
46
+ | Design pressure | On collaborator interfaces (the mocks shape them) | On units' responsibilities and state shape |
47
+ | Failure mode | Tests mirror implementation; refactor breaks them | State assertions get coarse; design coupling grows |
48
+ | Typical codebase | Many small classes with well-defined collaboration roles | Fewer larger classes with rich state |
49
+
50
+ A practical heuristic: London-school suits services with rich collaboration (microservices, hexagonal architectures); Detroit-school suits state-rich domains (calculators, domain logic, parsers). Hybrid is the working norm.
51
+
52
+ ## The Empirical Record
53
+
54
+ Multiple controlled studies and one large industrial study converge on a consistent finding: TDD codebases have lower defect density at the cost of slightly longer initial development time.
55
+
56
+ | Study | Finding |
57
+ |---|---|
58
+ | Nagappan et al. (2008) at Microsoft and IBM | TDD teams had 40-90% lower defect density; 15-35% longer initial development time |
59
+ | Erdogmus et al. (2005) controlled experiment | TDD subjects had higher external code quality; productivity comparable |
60
+ | Janzen & Saiedian (2008) meta-analysis | Code complexity reduced; cohesion improved; coupling reduced under TDD |
61
+ | Bissi et al. (2016) systematic review | 27 of 39 studies showed TDD improved internal quality; 18 of 23 showed external quality improvement |
62
+
63
+ The trade is well-documented. Teams abandoning TDD often do so on the visible-cost side (the time spent writing tests) without measuring the invisible-saving side (defects not encountered, rework not required).
64
+
65
+ ## Verification
66
+
67
+ After applying this skill, verify:
68
+ - [ ] Every increment of production code is preceded by a failing test that describes the behavior, not the implementation. If code was written before the test, the test is regression coverage, not TDD-born specification.
69
+ - [ ] Each cycle includes a refactor beat. Cycles that go red → green → next-red are test-first development, not TDD.
70
+ - [ ] The school being practiced (London / Detroit / hybrid) is intentional, not accidental. The test suite's character (mock-rich vs state-rich) reveals which school is in use.
71
+ - [ ] When a test is hard to write, the design is examined. Hard-to-test code is the design-pressure signal; ignoring it (with test-only hooks, exposed internals, or stretched test boundaries) defeats the discipline.
72
+ - [ ] Test names describe behaviors at the unit's natural boundaries — "calculates total with discount applied," not "tests `calculateTotal()` line 17."
73
+ - [ ] Coverage is treated as a side effect, not a target. The discipline is the goal; coverage emerges from doing it.
74
+ - [ ] The cycle's granularity stays small. Cycles that run hours indicate either over-large tests or under-decomposed production code.
75
+ - [ ] For research/spike work where the target is unclear, exploration is allowed before TDD applies. The discipline is not universally appropriate.
76
+
77
+ ## Do NOT Use When
78
+
79
+ | Instead of this skill | Use | Why |
80
+ |---|---|---|
81
+ | Choosing what to test, at which level, with what evidence | `testing-strategy` | testing-strategy owns the strategic-level decision; this skill owns the tactical design discipline |
82
+ | Constructing a mock, stub, fake, or spy | `test-doubles-design` | test-doubles-design owns the constructs; this skill owns the discipline that places them |
83
+ | Iterating on LLM behavior using an eval suite | `eval-driven-development` | eval-driven-development is the LLM analog with statistical (not binary) judgment |
84
+ | Performing a behavior-preserving structural change | `refactor` | refactor owns the technique; this skill calls it as the third beat |
85
+ | Process workflow guidance for a TDD session | the obra/superpowers `test-driven-development` skill on skills.sh | that skill is the workflow-shape complement; this one is concept-shape |
86
+
87
+ ## Key Sources
88
+
89
+ - Beck, K. (2002). *Test-Driven Development: By Example*. Addison-Wesley. The canonical book on Detroit-school TDD; defines red-green-refactor and the discipline's core form.
90
+ - Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests* (GOOS). Addison-Wesley. The canonical book on London-school TDD; defines outside-in mock-driven development.
91
+ - Nagappan, N., Maximilien, E. M., Bhat, T., & Williams, L. (2008). ["Realizing quality improvement through test driven development: results and experiences of four industrial teams"](https://link.springer.com/article/10.1007/s10664-008-9062-z). *Empirical Software Engineering*, 13(3), 289-302. The Microsoft/IBM industrial study showing 40-90% defect-density reduction.
92
+ - Erdogmus, H., Morisio, M., & Torchiano, M. (2005). ["On the effectiveness of the test-first approach to programming"](https://ieeexplore.ieee.org/document/1423994). *IEEE Transactions on Software Engineering*, 31(3), 226-237. Controlled experiment on TDD's productivity and quality effects.
93
+ - Janzen, D., & Saiedian, H. (2008). ["Does Test-Driven Development Really Improve Software Design Quality?"](https://ieeexplore.ieee.org/document/4493089). *IEEE Software*, 25(2). Meta-analysis on TDD's effect on code complexity, cohesion, and coupling.
94
+ - Bissi, W., Serra Seca Neto, A. G., & Emer, M. C. F. P. (2016). ["The effects of test driven development on internal quality, external quality and productivity: A systematic review"](https://www.sciencedirect.com/science/article/abs/pii/S0950584916300903). *Information and Software Technology*, 74, 45-54. Systematic review of 27 controlled studies.
95
+ - Fowler, M. (2007). ["Mocks Aren't Stubs"](https://martinfowler.com/articles/mocksArentStubs.html). Practitioner-focused explanation of the London/Detroit school distinction and its consequences.
96
+ - North, D. (2006). ["Introducing BDD"](https://dannorth.net/introducing-bdd/). The origin essay for Behavior-Driven Development as a vocabulary and tooling layer above TDD.
@@ -0,0 +1,67 @@
1
+ ---
2
+ name: testing-strategy
3
+ description: "Use when planning tests for a bug fix, feature, or refactor — deciding what deserves a test, at which level, with what evidence. Covers test-scope decisions, test-level selection (unit / integration / contract / e2e), effort-to-risk matching, regression targeting, evidence quality, and failure-case coverage. Do NOT use for chasing a known failure (that is `debugging`), for pure doc writing (that is `documentation`), or for conceptual architecture discussion with no verification target (no dedicated skill — treat as strategy, not testing)."
4
+ license: MIT
5
+ compatibility: "Markdown, Git, any codebase"
6
+ allowed-tools: Read Grep Bash
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"quality\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-04-18\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-04-18\\\\\\\"}\",\"eval_artifacts\":\"present\",\"eval_state\":\"passing\",\"routing_eval\":\"present\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"testing strategy\\\\\\\",\\\\\\\"what to test\\\\\\\",\\\\\\\"what not to test\\\\\\\",\\\\\\\"which test level\\\\\\\",\\\\\\\"test scope\\\\\\\",\\\\\\\"effort vs risk\\\\\\\",\\\\\\\"regression target\\\\\\\",\\\\\\\"failure case coverage\\\\\\\",\\\\\\\"test plan\\\\\\\",\\\\\\\"do I need a test\\\\\\\",\\\\\\\"should I test this\\\\\\\",\\\\\\\"unit or integration\\\\\\\",\\\\\\\"test coverage\\\\\\\",\\\\\\\"pin this behavior\\\\\\\",\\\\\\\"plan test coverage\\\\\\\",\\\\\\\"plan coverage\\\\\\\",\\\\\\\"needs an automated test\\\\\\\",\\\\\\\"automated test\\\\\\\",\\\\\\\"manual QA coverage\\\\\\\",\\\\\\\"passes manual QA\\\\\\\",\\\\\\\"test level decision\\\\\\\"]\",\"triggers\":\"[\\\\\\\"testing-skill\\\\\\\"]\",\"routing_bundles\":\"[\\\\\\\"quality\\\\\\\"]\",\"examples\":\"[\\\\\\\"do I need a unit test for this pure formatter or is integration enough?\\\\\\\",\\\\\\\"what's the right test level for a webhook handler that talks to Stripe?\\\\\\\",\\\\\\\"the feature passes manual QA — does it need an automated test?\\\\\\\",\\\\\\\"pin this regression so the same bug can't slip through again\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"my existing test is failing — why?\\\\\\\",\\\\\\\"write a testing-patterns guide for the contributor docs\\\\\\\",\\\\\\\"clean up this duplicated test setup across three files\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"debugging\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"debugging chases a specific observed failure; testing-strategy decides what to test BEFORE a failure exists\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"refactor\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"refactor reshapes code (including test setup) while preserving behavior; testing-strategy decides what coverage to author in the first place\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"integration-test-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"integration-test-design owns the design of integration-level tests including their setup and data lifecycle; the 'duplicated test setup across three files' anti_example has token overlap with integration-test setup discipline. testing-strategy decides what level a test should be; integration-test-design owns how to design integration tests once chosen.\\\\\\\"}],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"debugging\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/testing-strategy/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/testing-strategy/SKILL.md
13
+ ---
14
+
15
+ # Testing Strategy
16
+
17
+ ## Coverage
18
+
19
+ - Test scope: deciding what behavior actually needs a test, and what does not earn the maintenance cost
20
+ - Test level selection: choosing between unit, integration, contract, and end-to-end tests based on risk and coupling
21
+ - Effort-to-risk matching: investing verification effort where regressions are most likely and most damaging
22
+ - Regression targeting: writing tests that pin the specific behavior a change risks breaking, not generic coverage
23
+ - Evidence quality: preferring concrete, reproducible verification over assumed or manual checks
24
+ - Failure-case coverage: ensuring boundary conditions and error paths are tested, not only the happy path
25
+
26
+ ## Philosophy
27
+
28
+ Most test suites fail the effort-to-risk test: they exercise code that will never break and skip code that breaks in production. The correct target is the behavior that ships to users, not the code you happen to have written last. Coverage percentage is a proxy, and every proxy eventually gets gamed — the real signal is regressions caught before release. A test that never fails is noise; a test that fails without isolating the cause is worse than no test at all because it wastes the next engineer's time.
29
+
30
+ ## Test-Level Selection
31
+
32
+ Pick the test level by the risk of the change and the coupling of the behavior, not by the file you happen to be editing. Unit tests are cheap to write and cheap to pass; integration and contract tests are where real production bugs are actually caught.
33
+
34
+ | Situation | Test level | Why |
35
+ |---|---|---|
36
+ | Pure function, single-owner, no I/O | **Unit** | Fast, deterministic, zero setup. If you cannot unit-test it, the function is doing too much |
37
+ | Logic that composes multiple units inside one service | **Integration** (in-process) | Unit tests of each piece will miss composition bugs; integration test catches real wiring |
38
+ | Behavior that crosses a service / process / network boundary | **Contract** | Both sides need a shared verifiable agreement; a unit test on either side misses the real failure mode |
39
+ | User-visible flow end-to-end | **E2E** (one or two per critical path) | Proves the full path works at least once; too expensive to run for every code path |
40
+ | Bug fix for a bug that reached production | **Regression** at the level where the bug slipped through | If it slipped past unit tests, a unit test won't catch it next time; write the test at the level the bug exposed |
41
+ | Behavior that is "obviously correct," unchanged for a year, no external pressure | **No new test** | The test would never fail; it would only add maintenance cost. Every test is a liability until it catches a bug |
42
+
43
+ ### Level-selection anti-patterns
44
+
45
+ - **Unit testing what should be an integration test** — mocking the only thing that could actually break. Fix: test the real integration, or admit the unit test proves nothing.
46
+ - **Integration testing what should be a unit test** — slow setup for a function that has no dependencies. Fix: extract the pure logic and unit-test it.
47
+ - **E2E-testing every code path** — fragile, slow, flaky. Fix: one E2E per critical user journey, unit/integration for the rest.
48
+ - **Adding a test because coverage dropped** — test has no regression target and never fails meaningfully. Fix: either find a real regression to pin, or delete the uncovered code if it has no value.
49
+
50
+ ## Evals
51
+
52
+ This skill ships a comprehension-eval artifact at [`examples/evals/testing-strategy.json`](https://github.com/jacob-balslev/skill-graph/blob/main/examples/evals/testing-strategy.json). The `Verification` checklist below is the authoring gate for a completed test plan; the eval file is how this skill is graded by `scripts/skill-audit.js --graded`. Do not conflate them — the checklist is for the test author, the eval is for the grader.
53
+
54
+ ## Verification
55
+
56
+ - [ ] The test type matches the change risk
57
+ - [ ] A behavior or regression target is explicit
58
+ - [ ] Verification evidence is concrete, not assumed
59
+ - [ ] Failure cases and boundaries are covered, not only the happy path
60
+
61
+ ## Do NOT Use When
62
+
63
+ | Use instead | When |
64
+ |---|---|
65
+ | `documentation` | The task structures explanation for a reader, not verification for a change |
66
+ | `debugging` | The task is chasing a known failure — strategy is planned before the failure, not after |
67
+ | `refactor` | The task is restructuring code; any test work is to preserve existing behavior, which belongs to the refactor skill's verification step |
@@ -0,0 +1,43 @@
1
+ ---
2
+ name: theme-system-design
3
+ description: "Use when designing a theme system — design tokens, semantic token layering, CSS custom property strategy, runtime theme switching, and theme contract guarantees. Do NOT use for one-off color choices, brand-only palette work, or framework-specific styling-library configuration."
4
+ license: CC-BY-4.0
5
+ metadata:
6
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"design\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-12\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-12\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"theme token contract\\\\\\\",\\\\\\\"theme semantic layer\\\\\\\",\\\\\\\"theme variables\\\\\\\",\\\\\\\"css custom properties\\\\\\\",\\\\\\\"runtime theme switching\\\\\\\",\\\\\\\"token tiers\\\\\\\",\\\\\\\"theme contract\\\\\\\",\\\\\\\"design tokens community group\\\\\\\",\\\\\\\"theme parity\\\\\\\",\\\\\\\"token naming\\\\\\\",\\\\\\\"reference tokens\\\\\\\",\\\\\\\"system tokens\\\\\\\"]\",\"triggers\":\"[\\\\\\\"theme system\\\\\\\",\\\\\\\"design tokens\\\\\\\",\\\\\\\"theme switching\\\\\\\",\\\\\\\"css variables for theming\\\\\\\",\\\\\\\"token architecture\\\\\\\"]\",\"examples\":\"[\\\\\\\"Design a three-tier token system (reference → system → component) for a multi-brand product\\\\\\\",\\\\\\\"Add a third theme to an existing two-theme system without breaking the component contracts\\\\\\\",\\\\\\\"Move from hard-coded colors to CSS custom properties with runtime switching\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"Choose the exact hex value for the brand's primary blue\\\\\\\",\\\\\\\"Configure Tailwind's content array and purge settings\\\\\\\",\\\\\\\"Implement the dark mode toggle interaction\\\\\\\"]\",\"relations\":\"{\\\\\\\"related\\\\\\\":[\\\\\\\"color-system-design\\\\\\\",\\\\\\\"design-system-architecture\\\\\\\",\\\\\\\"visual-design-foundations\\\\\\\",\\\\\\\"dark-mode-implementation\\\\\\\"],\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"color-system-design\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"color-system-design produces the palette and the semantic color decisions; this skill structures how those decisions become tokens and reach components.\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"dark-mode-implementation\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"dark-mode-implementation handles preference detection, asset swapping, and image handling; this skill defines the token layer that makes dark mode trivial to add.\\\\\\\"}]}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/theme-system-design/SKILL.md\"}"
7
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
8
+ skill_graph_protocol: Skill Metadata Protocol v4
9
+ skill_graph_project: Skill Graph
10
+ skill_graph_canonical_skill: skills/theme-system-design/SKILL.md
11
+ ---
12
+
13
+ # Theme System Design
14
+
15
+ ## Coverage
16
+ A theme system is a contract between design decisions and component code, expressed as named tokens that components consume and themes resolve. The mainstream model uses three tiers — reference tokens (raw values: blue.500 = #1E66F5), system or semantic tokens (intent-named: color.background.surface, color.text.primary), and component tokens (component-specific overrides: button.primary.background). Components consume system and component tokens only; themes provide reference-token-to-system-token mappings. This indirection is what allows a single theme swap to repaint the whole product without component edits.
17
+
18
+ CSS custom properties are the dominant runtime delivery mechanism for web themes. A theme is a set of custom-property assignments scoped to a selector — typically :root for the default, [data-theme="dark"] for alternates. Custom properties cascade and inherit, so a theme override on a subtree (a dark-on-light landing-page hero inside a light app) is a single attribute change. The W3C Design Tokens Community Group format (design-tokens.org) is the emerging interchange standard, expressed as JSON with $value, $type, and $description; tooling (Style Dictionary, Tokens Studio, Terrazzo) transforms it into platform outputs (CSS variables, iOS catalogs, Android resources).
19
+
20
+ Runtime switching has three operational pieces: persistence (localStorage or a cookie if SSR-sensitive), application (a class or data attribute on the document root before first paint to avoid flash), and propagation (notify components that read theme imperatively — chart libraries, canvases). The pre-paint application typically requires a small inline script in the document head that runs before the stylesheet loads.
21
+
22
+ Theme contracts make additive changes safe and breaking changes visible. Adding a new system token is safe; renaming or removing one is a breaking change requiring a deprecation cycle. Components that read tokens by exact name should not be expected to handle missing tokens gracefully — a missing token resolves to the CSS variable's fallback value (or invalid, depending on the property), which is rarely what's wanted in production.
23
+
24
+ ## Philosophy
25
+ The indirection earns its complexity by separating two rates of change: brand decisions change rarely, theme assignments (which brand color means "danger") change occasionally, and components change continuously. A flat token system collapses these into one rate and forces every component to know about every brand value. The three-tier model gives each rate of change its own surface to evolve on.
26
+
27
+ Semantic names beat descriptive names. color.background.surface tells a component author what to use; color.gray.100 forces them to know that gray.100 happens to be the surface color today and changes meaning when the theme flips. The discipline is to resist convenience names that leak appearance into the contract.
28
+
29
+ ## Verification
30
+ - Components reference only system or component tokens; a grep for raw hex values or reference tokens in component code returns zero matches.
31
+ - Switching themes at runtime triggers no full-page repaint flicker; the theme attribute is set before the first paint.
32
+ - A new theme can be added by providing only a new set of system-token values; no component files are modified.
33
+ - Token definitions live in a single source-of-truth file (JSON, ideally DTCG-format) and are generated into the CSS variable layer by a build step.
34
+ - Removing or renaming a system token causes a build-time or lint-time error in every consuming component, not a silent runtime fallback.
35
+ - Server-rendered pages serialize the chosen theme into the initial HTML so first paint matches user preference.
36
+ - Documentation lists every semantic token with intent, not appearance.
37
+
38
+ ## Do NOT Use When
39
+ - The task is picking specific color values or designing a palette. Use color-system-design.
40
+ - The work is one-off styling of a single component with no system-wide implications. Use design-module-composition.
41
+ - The integration concerns are dark-mode detection, asset variants, and prefers-color-scheme handling. Use dark-mode-implementation.
42
+ - You are configuring a styling library's runtime (Tailwind config, Emotion theme provider setup) without changing the token contract. That is build configuration, not theme architecture.
43
+ - The decision is about typography or spacing scales rather than tokens generally. Use typography-system or visual-design-foundations.