@skill-graph/cli 0.5.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (330) hide show
  1. package/CHANGELOG.md +247 -0
  2. package/LICENSE +200 -0
  3. package/NOTICE +62 -0
  4. package/README.md +398 -0
  5. package/SKILL_GRAPH.md +443 -0
  6. package/bin/skill-graph.js +374 -0
  7. package/docs/ADOPTION.md +117 -0
  8. package/docs/CONFORMANCE.md +66 -0
  9. package/docs/PRIMER.md +384 -0
  10. package/docs/QUICKSTART-30MIN.md +333 -0
  11. package/docs/ROUTING-METRICS.md +120 -0
  12. package/docs/SKILL-MD-FORMAT-COMPATIBILITY.md +127 -0
  13. package/docs/SKILL_AUDIT_CHECKLIST.md +199 -0
  14. package/docs/SKILL_AUDIT_LOOP.md +195 -0
  15. package/docs/SKILL_METADATA_PROTOCOL.md +609 -0
  16. package/docs/_archived/marketplace-publication-priority-2026-05-18.md +239 -0
  17. package/docs/adr/0001-predicate-set.md +69 -0
  18. package/docs/adr/0002-json-ld-context.md +82 -0
  19. package/docs/adr/0003-ontoclean-rigidity-tags.md +65 -0
  20. package/docs/adr/0004-persistent-identifiers.md +74 -0
  21. package/docs/adr/0005-freshness-consolidation.md +70 -0
  22. package/docs/adr/0006-revise-predicate-rename.md +105 -0
  23. package/docs/adr/0007-audit-loop-cadence.md +99 -0
  24. package/docs/adr/0008-skill-surface-split-and-curation-policy.md +93 -0
  25. package/docs/category-consumers.md +168 -0
  26. package/docs/concept-map.md +194 -0
  27. package/docs/diagrams/drift-states.mmd +21 -0
  28. package/docs/diagrams/manifest-pipeline.mmd +25 -0
  29. package/docs/diagrams/routing-harness.mmd +41 -0
  30. package/docs/diagrams/starter-graph.mmd +53 -0
  31. package/docs/field-decision-guide.md +315 -0
  32. package/docs/field-rationale.md +211 -0
  33. package/docs/field-reference.generated.md +624 -0
  34. package/docs/field-reference.md +1426 -0
  35. package/docs/glossary.md +190 -0
  36. package/docs/head-noun-glossary.md +63 -0
  37. package/docs/images/audit-phases.png +0 -0
  38. package/docs/images/drift-states.png +0 -0
  39. package/docs/images/graded-mode.png +0 -0
  40. package/docs/images/manifest-pipeline.png +0 -0
  41. package/docs/images/routing-harness.png +0 -0
  42. package/docs/images/skill-anatomy.png +0 -0
  43. package/docs/images/starter-graph.png +0 -0
  44. package/docs/images/system-model.png +0 -0
  45. package/docs/integrations/github-actions.md +155 -0
  46. package/docs/manifest-field-mapping.md +443 -0
  47. package/docs/marketplace-publication-queue.generated.md +240 -0
  48. package/docs/marketplace-release-agent-prompt.md +82 -0
  49. package/docs/marketplace-skill-candidate-list.md +272 -0
  50. package/docs/marketplace-syndication.md +222 -0
  51. package/docs/migration-sample-review.md +155 -0
  52. package/docs/migrations/v4-to-v5.md +168 -0
  53. package/docs/migrations/v5-to-v6.md +221 -0
  54. package/docs/name-exceptions.yaml +37 -0
  55. package/docs/plans/marketplace-p1-public-migration-plan.md +41 -0
  56. package/docs/plans/multi-root-workspace.md +148 -0
  57. package/docs/plans/scripts-roadmap.md +107 -0
  58. package/docs/plans/v4-schema-bump.md +160 -0
  59. package/docs/plans/wave-2-extraction.md +122 -0
  60. package/docs/positioning-vs-marketplaces.md +175 -0
  61. package/docs/proposals/skill-audit-loop-positioning.md +160 -0
  62. package/docs/quality-doctrine.md +138 -0
  63. package/docs/recommended-skills.md +150 -0
  64. package/docs/research/skill-comprehension-eval-research.md +1830 -0
  65. package/docs/research/skill-retrieval-evidence.md +66 -0
  66. package/docs/skill-metadata-protocol.md +471 -0
  67. package/docs/skills-sh-maintainer-cleanup-request.md +80 -0
  68. package/examples/audits/a11y/findings.md +52 -0
  69. package/examples/audits/a11y/scorecard.md +21 -0
  70. package/examples/audits/a11y/verdict.md +44 -0
  71. package/examples/audits/debugging/findings.md +59 -0
  72. package/examples/audits/debugging/scorecard.md +22 -0
  73. package/examples/audits/debugging/verdict.md +33 -0
  74. package/examples/audits/documentation/findings.md +59 -0
  75. package/examples/audits/documentation/scorecard.md +22 -0
  76. package/examples/audits/documentation/verdict.md +33 -0
  77. package/examples/evals/a11y.json +140 -0
  78. package/examples/evals/api-design.json +52 -0
  79. package/examples/evals/code-review.json +52 -0
  80. package/examples/evals/data-modeling.json +52 -0
  81. package/examples/evals/database-migration.json +52 -0
  82. package/examples/evals/debugging.json +118 -0
  83. package/examples/evals/dependency-architecture.json +52 -0
  84. package/examples/evals/design-system-architecture.json +52 -0
  85. package/examples/evals/error-tracking.json +52 -0
  86. package/examples/evals/event-contract-design.json +52 -0
  87. package/examples/evals/form-ux-architecture.json +52 -0
  88. package/examples/evals/framework-fit-analysis.json +52 -0
  89. package/examples/evals/graph-audit.json +139 -0
  90. package/examples/evals/information-architecture.json +52 -0
  91. package/examples/evals/interaction-feedback.json +52 -0
  92. package/examples/evals/interaction-patterns.json +52 -0
  93. package/examples/evals/layout-composition.json +52 -0
  94. package/examples/evals/lint-overlay.json +117 -0
  95. package/examples/evals/microcopy.json +52 -0
  96. package/examples/evals/observability-modeling.json +52 -0
  97. package/examples/evals/pattern-recognition.json +96 -0
  98. package/examples/evals/performance-engineering.json +52 -0
  99. package/examples/evals/refactor.json +128 -0
  100. package/examples/evals/semiotics.json +52 -0
  101. package/examples/evals/skill-infrastructure.json +96 -0
  102. package/examples/evals/skill-router.json +140 -0
  103. package/examples/evals/skill-router.routing.json +113 -0
  104. package/examples/evals/system-interface-contracts.json +52 -0
  105. package/examples/evals/task-analysis.json +52 -0
  106. package/examples/evals/testing-strategy.json +118 -0
  107. package/examples/evals/type-safety.json +249 -0
  108. package/examples/evals/visual-design-foundations.json +52 -0
  109. package/examples/evals/webhook-integration.json +52 -0
  110. package/examples/exports/a11y.skill-md.md +80 -0
  111. package/examples/exports/debugging.skill-md.md +80 -0
  112. package/examples/exports/refactor.skill-md.md +78 -0
  113. package/examples/exports/testing-strategy.skill-md.md +81 -0
  114. package/examples/projects/markdown-static-site/README.md +115 -0
  115. package/examples/projects/markdown-static-site/skills/content-source-router/SKILL.md +131 -0
  116. package/examples/projects/markdown-static-site/skills/image-optimization-pipeline-config/SKILL.md +132 -0
  117. package/examples/projects/markdown-static-site/skills/link-rot-detection/SKILL.md +103 -0
  118. package/examples/projects/markdown-static-site/skills/markdown-post-frontmatter-validation/SKILL.md +133 -0
  119. package/examples/projects/markdown-static-site/skills/migrate-posts-to-v2-frontmatter/SKILL.md +140 -0
  120. package/examples/projects/saas-stripe-postgres/README.md +208 -0
  121. package/examples/projects/saas-stripe-postgres/db/migrations/0004_canonicalize_orders.sql +37 -0
  122. package/examples/projects/saas-stripe-postgres/db/schema.sql +112 -0
  123. package/examples/projects/saas-stripe-postgres/skills/migrate-orders-to-canonical-schema/SKILL.md +149 -0
  124. package/examples/projects/saas-stripe-postgres/skills/nextjs-server-action-validation/SKILL.md +154 -0
  125. package/examples/projects/saas-stripe-postgres/skills/payment-provider-router/SKILL.md +153 -0
  126. package/examples/projects/saas-stripe-postgres/skills/postgres-rls-pattern/SKILL.md +163 -0
  127. package/examples/projects/saas-stripe-postgres/skills/stripe-webhook-signature-verification/SKILL.md +137 -0
  128. package/examples/protocol/skill-metadata-template.md +301 -0
  129. package/examples/protocol/skills.manifest.sample.json +13245 -0
  130. package/examples/skill-metadata-template.md +317 -0
  131. package/examples/skills.manifest.sample.json +13519 -0
  132. package/examples/tests/v3-1-skos-fixture/SKILL.md +93 -0
  133. package/marketplace/README.md +17 -0
  134. package/marketplace/skills/a11y/SKILL.md +66 -0
  135. package/marketplace/skills/acid-fundamentals/SKILL.md +106 -0
  136. package/marketplace/skills/agent-engineering/SKILL.md +386 -0
  137. package/marketplace/skills/agent-eval-design/SKILL.md +55 -0
  138. package/marketplace/skills/ai-native-development/SKILL.md +294 -0
  139. package/marketplace/skills/api-design/SKILL.md +60 -0
  140. package/marketplace/skills/architecture-decision-records/SKILL.md +55 -0
  141. package/marketplace/skills/background-jobs/SKILL.md +265 -0
  142. package/marketplace/skills/bounded-context-mapping/SKILL.md +55 -0
  143. package/marketplace/skills/cap-theorem-tradeoffs/SKILL.md +127 -0
  144. package/marketplace/skills/client-server-boundary/SKILL.md +187 -0
  145. package/marketplace/skills/code-review/SKILL.md +120 -0
  146. package/marketplace/skills/color-system-design/SKILL.md +43 -0
  147. package/marketplace/skills/component-architecture/SKILL.md +126 -0
  148. package/marketplace/skills/compression/SKILL.md +112 -0
  149. package/marketplace/skills/conceptual-modeling/SKILL.md +181 -0
  150. package/marketplace/skills/connection-pooling/SKILL.md +105 -0
  151. package/marketplace/skills/constraint-awareness/SKILL.md +287 -0
  152. package/marketplace/skills/content-monitor/SKILL.md +209 -0
  153. package/marketplace/skills/context-engineering/SKILL.md +320 -0
  154. package/marketplace/skills/context-graph/SKILL.md +174 -0
  155. package/marketplace/skills/context-management/SKILL.md +174 -0
  156. package/marketplace/skills/context-window/SKILL.md +239 -0
  157. package/marketplace/skills/contract-testing/SKILL.md +120 -0
  158. package/marketplace/skills/cron-scheduling/SKILL.md +223 -0
  159. package/marketplace/skills/dark-mode-implementation/SKILL.md +47 -0
  160. package/marketplace/skills/data-modeling/SKILL.md +59 -0
  161. package/marketplace/skills/data-modeling-fundamentals/SKILL.md +117 -0
  162. package/marketplace/skills/database-migration/SKILL.md +429 -0
  163. package/marketplace/skills/debugging/SKILL.md +67 -0
  164. package/marketplace/skills/dependency-architecture/SKILL.md +58 -0
  165. package/marketplace/skills/design-module-composition/SKILL.md +43 -0
  166. package/marketplace/skills/design-system-architecture/SKILL.md +61 -0
  167. package/marketplace/skills/design-thinking/SKILL.md +44 -0
  168. package/marketplace/skills/diagnosis/SKILL.md +296 -0
  169. package/marketplace/skills/diff-analysis/SKILL.md +188 -0
  170. package/marketplace/skills/e2e-test-design/SKILL.md +113 -0
  171. package/marketplace/skills/entity-relationship-modeling/SKILL.md +218 -0
  172. package/marketplace/skills/epistemic-grounding/SKILL.md +112 -0
  173. package/marketplace/skills/error-boundary/SKILL.md +235 -0
  174. package/marketplace/skills/error-tracking/SKILL.md +261 -0
  175. package/marketplace/skills/eval-driven-development/SKILL.md +147 -0
  176. package/marketplace/skills/evaluation/SKILL.md +113 -0
  177. package/marketplace/skills/event-contract-design/SKILL.md +60 -0
  178. package/marketplace/skills/event-storming/SKILL.md +56 -0
  179. package/marketplace/skills/form-ux-architecture/SKILL.md +60 -0
  180. package/marketplace/skills/framework-fit-analysis/SKILL.md +59 -0
  181. package/marketplace/skills/frontend-architecture/SKILL.md +43 -0
  182. package/marketplace/skills/generative-ui/SKILL.md +118 -0
  183. package/marketplace/skills/graph-audit/SKILL.md +81 -0
  184. package/marketplace/skills/guardrails/SKILL.md +118 -0
  185. package/marketplace/skills/hooks-patterns/SKILL.md +185 -0
  186. package/marketplace/skills/http-semantics/SKILL.md +136 -0
  187. package/marketplace/skills/ideation/SKILL.md +41 -0
  188. package/marketplace/skills/indexing-strategy/SKILL.md +108 -0
  189. package/marketplace/skills/information-architecture/SKILL.md +59 -0
  190. package/marketplace/skills/integration-test-design/SKILL.md +111 -0
  191. package/marketplace/skills/intent-recognition/SKILL.md +136 -0
  192. package/marketplace/skills/interaction-feedback/SKILL.md +59 -0
  193. package/marketplace/skills/interaction-patterns/SKILL.md +59 -0
  194. package/marketplace/skills/journey-mapping/SKILL.md +41 -0
  195. package/marketplace/skills/keywords/SKILL.md +213 -0
  196. package/marketplace/skills/knowledge-modeling/SKILL.md +232 -0
  197. package/marketplace/skills/layout-composition/SKILL.md +59 -0
  198. package/marketplace/skills/linguistics/SKILL.md +429 -0
  199. package/marketplace/skills/lint-overlay/SKILL.md +76 -0
  200. package/marketplace/skills/mental-models/SKILL.md +126 -0
  201. package/marketplace/skills/merge-queue/SKILL.md +94 -0
  202. package/marketplace/skills/methodology/SKILL.md +317 -0
  203. package/marketplace/skills/microcopy/SKILL.md +232 -0
  204. package/marketplace/skills/middleware-patterns/SKILL.md +363 -0
  205. package/marketplace/skills/mobile-responsive-ux/SKILL.md +287 -0
  206. package/marketplace/skills/mutation-testing/SKILL.md +112 -0
  207. package/marketplace/skills/naming-conventions/SKILL.md +112 -0
  208. package/marketplace/skills/observability-modeling/SKILL.md +59 -0
  209. package/marketplace/skills/ontology-modeling/SKILL.md +67 -0
  210. package/marketplace/skills/owasp-security/SKILL.md +153 -0
  211. package/marketplace/skills/pattern-recognition/SKILL.md +472 -0
  212. package/marketplace/skills/performance-budgets/SKILL.md +185 -0
  213. package/marketplace/skills/performance-engineering/SKILL.md +58 -0
  214. package/marketplace/skills/performance-testing/SKILL.md +125 -0
  215. package/marketplace/skills/printify/SKILL.md +42 -0
  216. package/marketplace/skills/prioritization/SKILL.md +118 -0
  217. package/marketplace/skills/problem-framing/SKILL.md +41 -0
  218. package/marketplace/skills/problem-locating-solving/SKILL.md +203 -0
  219. package/marketplace/skills/project-knowledge-extraction/SKILL.md +54 -0
  220. package/marketplace/skills/prompt-craft/SKILL.md +134 -0
  221. package/marketplace/skills/prompt-injection-defense/SKILL.md +132 -0
  222. package/marketplace/skills/property-based-testing/SKILL.md +100 -0
  223. package/marketplace/skills/prototyping/SKILL.md +43 -0
  224. package/marketplace/skills/query-optimization/SKILL.md +144 -0
  225. package/marketplace/skills/real-time-updates/SKILL.md +324 -0
  226. package/marketplace/skills/ref-patterns/SKILL.md +284 -0
  227. package/marketplace/skills/refactor/SKILL.md +65 -0
  228. package/marketplace/skills/rendering-models/SKILL.md +142 -0
  229. package/marketplace/skills/replication-patterns/SKILL.md +110 -0
  230. package/marketplace/skills/research-synthesis/SKILL.md +41 -0
  231. package/marketplace/skills/route-handler-design/SKILL.md +347 -0
  232. package/marketplace/skills/schema-evolution/SKILL.md +140 -0
  233. package/marketplace/skills/security-fundamentals/SKILL.md +139 -0
  234. package/marketplace/skills/semantic-center/SKILL.md +194 -0
  235. package/marketplace/skills/semantic-relations/SKILL.md +250 -0
  236. package/marketplace/skills/semantics/SKILL.md +366 -0
  237. package/marketplace/skills/semiotics/SKILL.md +230 -0
  238. package/marketplace/skills/seo-strategy/SKILL.md +260 -0
  239. package/marketplace/skills/server-actions-design/SKILL.md +243 -0
  240. package/marketplace/skills/server-components-design/SKILL.md +190 -0
  241. package/marketplace/skills/sharding-strategy/SKILL.md +123 -0
  242. package/marketplace/skills/shopify/SKILL.md +42 -0
  243. package/marketplace/skills/skill-infrastructure/SKILL.md +320 -0
  244. package/marketplace/skills/skill-router/SKILL.md +71 -0
  245. package/marketplace/skills/skill-scaffold/SKILL.md +105 -0
  246. package/marketplace/skills/snapshot-testing/SKILL.md +120 -0
  247. package/marketplace/skills/spec-driven-development/SKILL.md +148 -0
  248. package/marketplace/skills/state-machine-modeling/SKILL.md +56 -0
  249. package/marketplace/skills/state-management/SKILL.md +134 -0
  250. package/marketplace/skills/streaming-architecture/SKILL.md +194 -0
  251. package/marketplace/skills/summarization/SKILL.md +156 -0
  252. package/marketplace/skills/suspense-patterns/SKILL.md +265 -0
  253. package/marketplace/skills/system-interface-contracts/SKILL.md +59 -0
  254. package/marketplace/skills/task-analysis/SKILL.md +201 -0
  255. package/marketplace/skills/taxonomy-design/SKILL.md +66 -0
  256. package/marketplace/skills/test-coverage-strategy/SKILL.md +108 -0
  257. package/marketplace/skills/test-doubles-design/SKILL.md +98 -0
  258. package/marketplace/skills/test-driven-development/SKILL.md +96 -0
  259. package/marketplace/skills/testing-strategy/SKILL.md +67 -0
  260. package/marketplace/skills/theme-system-design/SKILL.md +43 -0
  261. package/marketplace/skills/tool-call-flow/SKILL.md +229 -0
  262. package/marketplace/skills/tool-call-strategy/SKILL.md +292 -0
  263. package/marketplace/skills/transaction-isolation/SKILL.md +98 -0
  264. package/marketplace/skills/type-safety/SKILL.md +177 -0
  265. package/marketplace/skills/typography-system/SKILL.md +43 -0
  266. package/marketplace/skills/usability-testing/SKILL.md +43 -0
  267. package/marketplace/skills/user-research/SKILL.md +43 -0
  268. package/marketplace/skills/vercel-composition-patterns/SKILL.md +157 -0
  269. package/marketplace/skills/version-control/SKILL.md +233 -0
  270. package/marketplace/skills/visual-design-foundations/SKILL.md +59 -0
  271. package/marketplace/skills/visual-hierarchy/SKILL.md +43 -0
  272. package/marketplace/skills/webhook-integration/SKILL.md +331 -0
  273. package/marketplace/skills/writing-humanizer/SKILL.md +380 -0
  274. package/package.json +67 -0
  275. package/schemas/manifest.schema.json +811 -0
  276. package/schemas/manifest.v2.schema.json +164 -0
  277. package/schemas/manifest.v3.schema.json +758 -0
  278. package/schemas/manifest.v4.schema.json +755 -0
  279. package/schemas/manifest.v5.schema.json +755 -0
  280. package/schemas/manifest.v6.schema.json +811 -0
  281. package/schemas/skill.context.jsonld +279 -0
  282. package/schemas/skill.schema.json +919 -0
  283. package/schemas/skill.v2.schema.json +201 -0
  284. package/schemas/skill.v3.schema.json +827 -0
  285. package/schemas/skill.v4.schema.json +822 -0
  286. package/schemas/skill.v5.schema.json +830 -0
  287. package/schemas/skill.v6.schema.json +946 -0
  288. package/schemas/vocabulary/keywords.json +180 -0
  289. package/schemas/vocabulary/workspace_tags.json +23 -0
  290. package/scripts/__tests__/migrate-skill-v2-to-v3.test.js +161 -0
  291. package/scripts/__tests__/migrate-skill-v3-to-v4.test.js +158 -0
  292. package/scripts/__tests__/test-export-parser-drift.js +149 -0
  293. package/scripts/__tests__/test-marketplace-export.js +114 -0
  294. package/scripts/__tests__/test-router-paths.js +82 -0
  295. package/scripts/__tests__/test-stability-promotion.js +244 -0
  296. package/scripts/__tests__/test-v3-1-alias-contract.js +109 -0
  297. package/scripts/__tests__/test-v3-1-skos-runtime.js +116 -0
  298. package/scripts/backfill-schema-version.js +198 -0
  299. package/scripts/build-field-reference.js +160 -0
  300. package/scripts/build-retrieval-baseline.js +511 -0
  301. package/scripts/check-markdown-links.js +211 -0
  302. package/scripts/check-protocol-consistency.js +979 -0
  303. package/scripts/export-marketplace-skills.js +610 -0
  304. package/scripts/export-skill.js +374 -0
  305. package/scripts/generate-manifest.js +787 -0
  306. package/scripts/lib/alias-contract.js +83 -0
  307. package/scripts/lib/audit-prompt-builder.js +771 -0
  308. package/scripts/lib/mock-grader.js +134 -0
  309. package/scripts/lib/parse-frontmatter.js +429 -0
  310. package/scripts/lib/roots.js +119 -0
  311. package/scripts/lint/check-archetype-sections.js +185 -0
  312. package/scripts/lint/check-category-enum.js +83 -0
  313. package/scripts/lint/check-routing-eval.js +146 -0
  314. package/scripts/lint/check-routing-quality.js +211 -0
  315. package/scripts/lint/check-stability-promotion.js +220 -0
  316. package/scripts/lint/format-code-frame.js +206 -0
  317. package/scripts/marketplace-install.js +125 -0
  318. package/scripts/migrate-category-to-enum.js +169 -0
  319. package/scripts/migrate-skill-v2-to-v3.js +424 -0
  320. package/scripts/migrate-skill-v3-to-v4.js +200 -0
  321. package/scripts/migrate-skill-v5-to-v6.js +304 -0
  322. package/scripts/restructure-by-category.js +85 -0
  323. package/scripts/seed-publication-classification.js +282 -0
  324. package/scripts/skill-audit.js +893 -0
  325. package/scripts/skill-graph-drift.js +483 -0
  326. package/scripts/skill-graph-route.js +766 -0
  327. package/scripts/skill-graph-routing-eval.js +393 -0
  328. package/scripts/skill-lint.js +1317 -0
  329. package/scripts/skill-overlap.js +213 -0
  330. package/scripts/verify-skill-md-export.js +201 -0
@@ -0,0 +1,386 @@
1
+ ---
2
+ name: agent-engineering
3
+ description: "Use when designing or evaluating a production AI agent system, choosing a multi-agent coordination pattern (orchestrator/worker, fan-out, consensus, sequential chain, evaluator/optimizer), diagnosing coordination failures (claim races, silent stalls, context contamination, runaway loops), or auditing whether an agent loop is truly production-ready. Covers the four pillars (architecture and lifecycle, task decomposition, coordination patterns, production reliability), the six reliability requirements (observability, cost budgets, idempotency, failure recovery, safety caps, claim locks), the delegation decision framework with overhead crossover, and the most common anti-patterns. Do NOT use for prompt wording (use `prompt-craft`), per-call tool efficiency (use `tool-call-strategy`), context-stack design within a single agent (use `context-engineering`), or runtime debugging of a deployed system (use `debugging`)."
4
+ license: MIT
5
+ compatibility: "Provider- and harness-agnostic. Patterns apply across Claude Code, Cursor, Copilot, OpenCode, Aider, Continue, custom Anthropic/OpenAI/Google SDK loops, and self-hosted multi-agent systems. Specific filenames in this skill (continuation.json, claim.lock, session-logs.jsonl) are example artefacts -- substitute your harness's equivalents."
6
+ allowed-tools: Read Grep Bash Edit Write
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"engineering\",\"domain\":\"ai-engineering/architecture\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-06\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-06\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"agent engineering\\\\\\\",\\\\\\\"agentic engineering\\\\\\\",\\\\\\\"multi-agent systems\\\\\\\",\\\\\\\"production agent system\\\\\\\",\\\\\\\"orchestration patterns\\\\\\\",\\\\\\\"orchestrator worker\\\\\\\",\\\\\\\"fan-out merge\\\\\\\",\\\\\\\"consensus pattern\\\\\\\",\\\\\\\"evaluator optimizer\\\\\\\",\\\\\\\"sequential chain\\\\\\\",\\\\\\\"two-pass pattern\\\\\\\",\\\\\\\"task decomposition\\\\\\\",\\\\\\\"delegation overhead\\\\\\\",\\\\\\\"claim lock\\\\\\\",\\\\\\\"claim race\\\\\\\",\\\\\\\"silent stall\\\\\\\",\\\\\\\"stall detection\\\\\\\",\\\\\\\"safety caps\\\\\\\",\\\\\\\"iteration limit\\\\\\\",\\\\\\\"cost budget\\\\\\\",\\\\\\\"agent idempotency guard\\\\\\\",\\\\\\\"failure recovery\\\\\\\",\\\\\\\"retry backoff\\\\\\\",\\\\\\\"escalation path\\\\\\\",\\\\\\\"context rot\\\\\\\",\\\\\\\"god agent\\\\\\\",\\\\\\\"runaway loop\\\\\\\",\\\\\\\"ghost claim\\\\\\\",\\\\\\\"telephone game\\\\\\\",\\\\\\\"file-based coordination\\\\\\\",\\\\\\\"agent continuation protocol\\\\\\\",\\\\\\\"exit code protocol\\\\\\\",\\\\\\\"heartbeat\\\\\\\",\\\\\\\"flow engineering\\\\\\\",\\\\\\\"software 3.0 agent systems\\\\\\\"]\",\"examples\":\"[\\\\\\\"we want to fan out 40 classification subtasks to subagents — what coordination pattern should we use and what are the failure modes?\\\\\\\",\\\\\\\"two of our agents claimed the same task and produced duplicate PRs — what atomicity guarantee prevents this?\\\\\\\",\\\\\\\"the orchestrator burns 6x the budget we planned every Tuesday — where do we add cost visibility and caps?\\\\\\\",\\\\\\\"an agent loop ran for four hours without progress before anyone noticed — how do we detect silent stalls?\\\\\\\",\\\\\\\"we keep getting context-contamination bugs where agent B uses stale output from agent A's failed run — fix the protocol\\\\\\\",\\\\\\\"audit this agent loop and tell me whether it's production-ready or still a demo\\\\\\\",\\\\\\\"is consensus-of-three worth the 3x cost for security-critical decisions, or is two-pass cheaper and good enough?\\\\\\\",\\\\\\\"design the lifecycle for a long-running autonomous agent that survives crashes mid-task\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"improve this prompt's wording to get better outputs\\\\\\\",\\\\\\\"the agent made 17 read calls when 3 greps would have done\\\\\\\",\\\\\\\"design what skills get loaded for which prompts\\\\\\\",\\\\\\\"scaffold a new SKILL.md for our orchestration runbook\\\\\\\",\\\\\\\"review this AI-generated PR for correctness\\\\\\\",\\\\\\\"the test suite is failing after my change — find the cause\\\\\\\",\\\\\\\"draft an architecture note explaining why we chose Postgres\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"prompt-craft\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"prompt-craft optimises a single LLM call's wording; agent-engineering optimises the entire system that composes many calls into a workflow\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"tool-call-strategy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"tool-call-strategy owns per-call efficiency decisions inside one agent; agent-engineering owns the multi-agent architecture those calls run within\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"context-engineering\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"context-engineering owns what reaches a single agent's context window; agent-engineering owns the system architecture that decides which agent runs at all\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"debugging\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"debugging chases a specific runtime failure; agent-engineering is about preventing classes of coordination failure through architecture\\\\\\\"}],\\\\\\\"related\\\\\\\":[\\\\\\\"context-engineering\\\\\\\",\\\\\\\"tool-call-strategy\\\\\\\",\\\\\\\"prompt-craft\\\\\\\"],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"code-review\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":90,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/agent-engineering/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/agent-engineering/SKILL.md
13
+ ---
14
+
15
+ # Agent Engineering
16
+
17
+ ## Coverage
18
+
19
+ - The discipline's relationship to and distinction from prompt engineering, harness engineering, and traditional distributed systems
20
+ - The four pillars: architecture and lifecycle management, task decomposition and context management, multi-agent coordination patterns, production reliability
21
+ - The lifecycle state machine: claim → execute → verify → commit → release, with the extended research/plan/review variant for complex workflows
22
+ - Context health states (ok / degraded / compact / exhausted) and their budget thresholds, plus the six observable signals of context rot
23
+ - Multi-agent coordination patterns: orchestrator/worker, fan-out/merge, evaluator/optimiser, consensus/fusion, sequential chain, hybrid — and the cost/reliability trade-offs of each
24
+ - The two-pass pattern (audit then fresh-context implement) for reliability-critical workflows
25
+ - The eight named coordination failure modes (task stealing, context contamination, merge conflicts, silent stall, brief rot, result injection, context bloat, double-commit) with detection and mitigation
26
+ - The six production reliability requirements: observability, cost budgets, idempotency, failure recovery, safety caps, claim locks — and what breaks when each is missing
27
+ - The delegation decision framework: six gates with overhead crossover analysis (≈1000-token minimum subagent overhead), batch crossover at four tasks for cheap-model fan-out
28
+ - The most common anti-patterns (God Agent, prompt-as-architecture, memory-persisted state, runaway loop, telephone-game briefs, ghost claim) and corrective actions
29
+ - The production readiness audit checklist and the staged-rollout verification workflow (10% → 50% → 100% budget)
30
+
31
+ ## Philosophy
32
+
33
+ A single LLM prompt produces an answer. A *system* of LLMs produces a workflow that survives session boundaries, crashes, model variance, budget exhaustion, and adversarial input. Agent engineering is the discipline of building the second from the first.
34
+
35
+ Three foundational truths inform everything that follows:
36
+
37
+ 1. **LLMs are unreliable components.** They hallucinate, forget, rationalise, and vary between calls. Treat each model call like any other unreliable component — with retries, circuit breakers, verification passes, and observability that catches failures early.
38
+ 2. **Context is finite and decays.** Every session has a context budget; reasoning quality degrades *before* the hard limit through a phenomenon called context rot. Architecture must compensate: decompose tasks, compress handoffs, spawn fresh agents, and detect rot early.
39
+ 3. **Coordination overhead is real.** The minimum cost of a single subagent round-trip is roughly 1000 tokens (brief composition + result verification + handoff summary). Over-delegation is as harmful as under-delegation; every fan-out must create value exceeding its overhead.
40
+
41
+ Four design axioms follow:
42
+
43
+ - **Non-determinism is a feature, not a bug.** Use the LLM for creative breadth; use architecture to converge that breadth onto deterministic results.
44
+ - **Context is the most precious resource.** Treat context like RAM: garbage-collect, page in and out, protect the working set.
45
+ - **Validation ≥ Generation.** The code that *verifies* output should be at least as robust as the logic that *generates* it.
46
+ - **Deterministic-first.** If a task can be solved by SQL, regex, or a script, the agent should *invoke* those tools — never *simulate* their logic.
47
+
48
+ > Agent engineering is to LLM systems what distributed-systems engineering is to unreliable network services — the model is one component, and the real work is managing its lifecycle, scaling, and failure recovery so the whole system stays alive under failure.
49
+
50
+ ## Agent Engineering vs Related Disciplines
51
+
52
+ | Discipline | Focus | Key question | Scope |
53
+ |---|---|---|---|
54
+ | **Prompt engineering** | Single LLM call optimisation | "What instruction produces the right output?" | One call |
55
+ | **Harness engineering** | Execution environment and safety | "What can the agent NOT do?" | Per-agent |
56
+ | **Agent engineering** | System architecture and reliability | "How do we build a trustworthy system from unreliable components?" | Multi-agent system |
57
+ | **Distributed systems** | Deterministic service reliability | "How do reliable services coordinate at scale?" | Assumes deterministic, repeatable behaviour |
58
+
59
+ The key distinction from traditional system design: distributed systems assume reliable, deterministic components. Agent systems assume unreliable, non-deterministic components that require explicit alignment, cost management, and failure handling at every layer. Techniques transfer (retries, circuit breakers, idempotency keys, leader-election analogues) but the failure modes differ. A database node fails with an error code. An LLM node fails by confidently producing a plausible-but-wrong answer.
60
+
61
+ ## The Four Pillars
62
+
63
+ ### Pillar 1 — Architecture and Lifecycle Management
64
+
65
+ Agents are not fire-and-forget. They have an explicit lifecycle that must be managed across session boundaries.
66
+
67
+ **Standard lifecycle:** Claim → Execute → Verify → Commit → Release.
68
+
69
+ **Extended lifecycle for complex workflows:** Claim → Research → Plan → Execute → Verify → Review → Commit → Release.
70
+
71
+ Each stage is traceable and recoverable:
72
+
73
+ - **Claim** — agent locks the task via an atomic file write or equivalent primitive; prevents two agents from working on the same task simultaneously.
74
+ - **Execute** — agent operates on the task; all tool calls logged; state snapshots written every N iterations.
75
+ - **Verify** — before commit, agent checks acceptance criteria, runs tests, validates output against spec.
76
+ - **Commit** — changes staged and committed (typically with path-limited commits to avoid parallel-session interference).
77
+ - **Release** — durable status update (issue tracker, work queue) marks task complete; claim lock removed so other agents know the task is released.
78
+
79
+ #### Context health states
80
+
81
+ Define explicit budget thresholds and behaviour at each:
82
+
83
+ | State | Budget used | Behaviour |
84
+ |---|---|---|
85
+ | `ok` | < 60% | Normal operation |
86
+ | `degraded` | 60–80% | Trigger internal compaction or pruning |
87
+ | `compact` | 80–90% | Prepare a handoff brief; spawn fresh session |
88
+ | `exhausted` | > 90% | Stop work; do not attempt complex writes |
89
+
90
+ #### Critical implementation rule
91
+
92
+ **Lifecycle state must persist to disk, not to process memory.** If the agent crashes between Execute and Commit, the state file tells the next agent where to resume. A variable in memory does not survive a crash.
93
+
94
+ ### Pillar 2 — Task Decomposition and Context Management
95
+
96
+ Complex tasks fail when forced into one session. Decomposition is the primary tool for managing reasoning quality.
97
+
98
+ #### The "God Agent" anti-pattern
99
+
100
+ Placing a complex, multi-day task into a single session.
101
+
102
+ - **Symptoms:** repeating errors, hallucinating file paths, forgetting constraints, increasing latency, re-explaining own prior actions.
103
+ - **Fix:** decompose into sub-tasks that fit within a ~4000-token "working zone" per agent step, and chain or fan-out with explicit handoffs.
104
+
105
+ #### Decomposition rules
106
+
107
+ 1. **Fit-in-budget test.** Will the full task context plus expected output fit in one session's budget? If not, decompose.
108
+ 2. **Independence test.** Can subtasks execute without reading intermediate output from each other? If yes, fan-out is safe (no ordering constraint, no shared mutable file).
109
+ 3. **Specialisation test.** Does the subtask benefit from a different model, harness, or tool set? If yes, route to a specialist.
110
+ 4. **Context protection.** Would doing this subtask inline risk blowing the orchestrator's context? If yes, delegate to protect the working set.
111
+ 5. **Three-step rule.** If a task requires more than three distinct logical steps, it should be a planned sub-issue in the work tracker, not an ad-hoc inline task.
112
+
113
+ #### Context rot detection signals
114
+
115
+ Observable indicators that an agent's context has degraded *before* hitting the hard limit:
116
+
117
+ - Agent starts repeating the same tool calls without progressing
118
+ - Reasoning sections become shorter and less specific
119
+ - Agent begins re-explaining its own prior actions from the same session
120
+ - Model forgets a constraint it acknowledged 10 turns earlier
121
+ - Summary quality degrades: less detail, more vague hedging
122
+ - Agent's tool-selection accuracy drops noticeably
123
+
124
+ When two of these signals appear together, treat context as `degraded` and trigger compaction or spawn a fresh agent.
125
+
126
+ ### Pillar 3 — Multi-Agent Coordination Patterns
127
+
128
+ | Pattern | Structure | When to use | Cost / reliability trade-off |
129
+ |---|---|---|---|
130
+ | **Orchestrator/worker** | One orchestrator dispatches N workers sequentially or via queue | Default for backlogs of independent tasks (bulk fixes, repo-wide audits) | Low overhead, sequential throughput |
131
+ | **Fan-out/merge** | Orchestrator spawns N parallel agents, merges results | Independent subtasks with no ordering constraint (parallel feature work, parallel test generation) | Moderate overhead (merge cost), high parallelism, reduced wall-clock |
132
+ | **Evaluator/optimiser** | Agent A generates, agent B critiques, A revises | UI implementation, documentation, complex logic | Moderate cost, higher quality through iteration |
133
+ | **Consensus/fusion** | N agents solve the same task independently; judge picks best or fuses results | High-stakes decisions (security, financial logic) where single-agent reliability is insufficient | N× cost, higher reliability, detects hallucinations better |
134
+ | **Sequential chain** | A.output → B.input → ... → final result | Each phase depends on prior output (impossible to parallelise) | Low overhead, sequential throughput, cannot parallelise |
135
+ | **Hybrid** | Orchestrator + parallel workers + sequential verify + consensus on risky decisions | Most production workflows | Moderate overhead, high reliability and parallelism |
136
+
137
+ #### The two-pass pattern
138
+
139
+ A mandatory standard for reliability-critical workflows:
140
+
141
+ - **Pass 1 (audit).** Agent researches and produces a research report or plan; writes to durable storage.
142
+ - **Pass 2 (implement).** A *fresh* agent (or fresh context) consumes the plan and executes.
143
+
144
+ This resets context rot and prevents "sunk-cost" bias — the implementing agent has not invested reasoning in a flawed plan, so it can flag issues rather than rationalise them.
145
+
146
+ #### Coordination failure modes
147
+
148
+ | Failure | Symptom | Root cause | Detection | Mitigation |
149
+ |---|---|---|---|---|
150
+ | **Task stealing** | Two agents claim the same task; duplicate PRs | No atomic claim primitive | Two claim files with same task ID | Atomic lock file write before *any* tool call; loser of race backs off immediately |
151
+ | **Context contamination** | Agent B uses stale output from A's failed run | Completion marker is file presence, not explicit status | Output file exists but tracker shows no completion | Tracker status is authoritative, not file presence |
152
+ | **Merge conflicts** | A and B edit the same file concurrently | No ownership assignment | Conflict markers on next commit | Disjoint file ownership per agent; worktree isolation |
153
+ | **Silent stall** | Agent stops responding without marking failure | No staleness monitor | No state update for N minutes | Per-phase timeouts; heartbeat writes; timeout escalates |
154
+ | **Brief rot** | Subagent brief contains stale context from earlier in the orchestrator session | Brief composed late in a long session | Brief references outdated paths or constraints | Compose briefs early; verify against current state before dispatch |
155
+ | **Result injection** | Agent B receives a result from A that was partially written mid-crash | No atomic result write | Downstream consumer errors on malformed JSON | Write to temp file, then rename atomically |
156
+ | **Context bloat** | Orchestrator keeps expanding context to track all workers' output | No summary compression | Orchestrator context grows monotonically | Workers write to durable storage, not back into prompt; orchestrator reads compressed summaries |
157
+ | **Double-commit** | Agent posts completion comment twice (retried after partial failure) | Missing idempotency check | Duplicate comments in tracker | Check before posting; post only if not already present |
158
+
159
+ Inter-agent misalignment accounts for roughly **37%** of multi-agent system failures (2026 industry analysis). Choosing the wrong coordination pattern for a given task structure is the most common architectural error: orchestrator/worker on highly dependent tasks bottlenecks; fan-out on dependent tasks fails because subagents lack prior subtask context; consensus on cheap decisions wastes 3× cost for marginal reliability gain.
160
+
161
+ ### Pillar 4 — Production Reliability
162
+
163
+ A production agent system must address all six reliability requirements. Missing any one cascades into operational failure.
164
+
165
+ #### 4.1 Observability
166
+
167
+ Every model call, tool call, state transition, cost event, and claim/release must be logged and queryable. Implementation: structured event logs (JSONL is the de-facto standard) with a fixed schema:
168
+
169
+ ```
170
+ { timestamp, task_id, agent_id, event_type, model, tokens_in, tokens_out, tool, outcome, cost, session_id }
171
+ ```
172
+
173
+ Without observability, "agent completed" is indistinguishable from "agent silently failed" — you cannot detect degradation, budget creep, or model-specific failure modes. Observability is the difference between a system you can operate and a black box.
174
+
175
+ #### 4.2 Cost budgets
176
+
177
+ Per-model daily quotas with graduated response: throttle → warn → lock. Free-tier or cheaper-model lanes for absorbing bulk work. Recommended thresholds per model per day:
178
+
179
+ - 0–70% — green (normal)
180
+ - 70–90% — yellow (warn)
181
+ - 90–100% — orange (throttle to cheaper model)
182
+ - > 100% — red (lock until reset)
183
+
184
+ Without budgets, one misconfigured infinite loop burns a month's quota in two hours. With budgets, the loop hits the cap and escalates to human review.
185
+
186
+ #### 4.3 Idempotency
187
+
188
+ Operations must be safe to retry. Claiming a task twice must not create two PR branches. Posting a completion comment twice must not duplicate it. Applying a migration twice must not corrupt data.
189
+
190
+ Patterns:
191
+
192
+ - **Idempotency keys** — store result of operation under a key derived from inputs; check before retrying
193
+ - **Atomic writes** — file writes are all-or-nothing (rename from temp, never overwrite in place)
194
+ - **Pre-post existence check** — before posting a comment or creating a branch, search for an existing one with the same identifier
195
+ - **Grounding** — always verify state via inspection (`ls`, `git status`, tracker query) before acting; never trust the prompt's memory of state
196
+
197
+ #### 4.4 Failure recovery
198
+
199
+ Three recovery layers:
200
+
201
+ - **Detect early** — stall detection (no state update for N minutes), error caps (stop after K consecutive errors), context-health checks (spawn fresh at `degraded`)
202
+ - **Retry with backoff** — exponential backoff (1s, 2s, 4s, 8s) on transient failures; max 3 retries per operation; wait 30 s before respawning
203
+ - **Escalate** — after retries exhausted, post a comment in the tracker, send a chat alert, or page on-call
204
+
205
+ Never loop indefinitely. Every loop must have an escape valve.
206
+
207
+ #### 4.5 Safety caps
208
+
209
+ Iteration limits (10 per session is a sane default, higher only with explicit override), consecutive error limits (3), context health checks (spawn fresh at `degraded`). A runaway agent that retries the same failing task 100 times is "working as designed" if no safety cap is set.
210
+
211
+ ```json
212
+ { "max_iterations": 10, "max_consecutive_errors": 3 }
213
+ ```
214
+
215
+ #### 4.6 Claim locks
216
+
217
+ Prevent two agents from claiming the same task. Implement as an atomic file write (POSIX `O_EXCL`, or `mv` from temp): the file contains `{ task_id, agent_id, timestamp }`. If the file already exists, the second agent skips. The lock is released on completion (success or failure).
218
+
219
+ Why atomic files instead of a database lock service? Simple to debug (readable with `cat`), crash-safe (rename is atomic at OS level), requires no running server, scales horizontally via instance isolation, and humans can edit it mid-run for steering. Trade-off: concurrent writes still need the rename pattern — plain `echo > file` is not atomic.
220
+
221
+ ## Delegation Decision Framework
222
+
223
+ Before spawning a subagent, walk this decision tree:
224
+
225
+ ```
226
+ 0. Single-shot decision or under three tool calls?
227
+ YES → Do it inline. Delegation overhead (~1000 tokens) exceeds task cost.
228
+ NO → Continue.
229
+
230
+ 1. Requires a specialist tool or model not currently loaded?
231
+ YES → Delegate to the specialist.
232
+ NO → Continue.
233
+
234
+ 2. Two or more genuinely independent subtasks (no ordering, no shared mutable file)?
235
+ YES → Fan-out to parallel workers. Verify independence first.
236
+ NO → Continue.
237
+
238
+ 3. Inline context plus output would exceed working budget?
239
+ YES → Delegate to a subagent with its own context window. Compose a minimal brief (< 2000 tokens).
240
+ NO → Continue.
241
+
242
+ 4. Cheaper model would handle this adequately (scanning, classification, format conversion)?
243
+ YES → Route to cheaper model.
244
+ NO → Continue.
245
+
246
+ 5. None of the above?
247
+ → Do it inline. Over-delegation is an anti-pattern.
248
+ ```
249
+
250
+ **Delegation overhead breakdown** (minimum cost of one subagent round-trip):
251
+
252
+ - Brief composition: ~500 tokens
253
+ - Result verification: ~300 tokens
254
+ - Context transfer (handoff summary): ~200 tokens
255
+ - **Total: ~1000 tokens minimum**
256
+
257
+ A task that takes 800 tokens inline costs 1800 tokens when delegated. Delegation must create value beyond its overhead: parallelism, specialisation, context protection, or model-cost reduction.
258
+
259
+ **Batch crossover rule.** Delegate a batch of N tasks if `per-task inline cost × N` exceeds `(N × per-task subagent cost) + fixed overhead`. For classification tasks routed to a model 75% cheaper than the orchestrator, the crossover is approximately **four tasks**. Below four, route inline. Above four, fan out.
260
+
261
+ | Mistake | Symptom | Cost |
262
+ |---|---|---|
263
+ | Delegating a < 3-tool task | Subagent spawns, briefs, runs, returns | 700 tokens vs ~200 inline = 3.5× cost |
264
+ | Not delegating when context-bound | Orchestrator hits limit, quality degrades, must compact and respawn | 2× context usage and delay |
265
+ | Parallel fan-out on dependent tasks | Subagent B fails because it lacked output from A | 2× cost for B plus merge complexity |
266
+ | Delegating to wrong specialist | Cheap model misses subtle bugs in code review | Quality loss, rework cost later |
267
+ | Recursive over-delegation | Agent spawns spawns spawns | Exponential context budget, hard to debug, slow |
268
+
269
+ ## Production Readiness Checklist
270
+
271
+ Before declaring an agent system production-ready, verify all six. Missing any one is not production-ready — it is a demo.
272
+
273
+ | Requirement | Test | Consequence of missing |
274
+ |---|---|---|
275
+ | **Observability** | Every invocation logged with cost, model, task ID, outcome; queryable | Cannot diagnose failures or identify model-specific issues |
276
+ | **Cost budgets** | Per-model quotas enforced; graduated response (throttle → warn → lock) | One misconfigured loop burns the month's quota in hours |
277
+ | **Idempotency** | Operations safe to retry; no duplicate side effects from retries | Duplicate comments, duplicate work, data corruption on retry |
278
+ | **Failure recovery** | Stall detection configured, retry backoff, human escalation path | Agents stall indefinitely; consumers think loop is running when it has hung |
279
+ | **Safety caps** | Iteration limit, consecutive error limit, context-health check all set | Runaway agent retries the same failing task 100 times |
280
+ | **Claim locks** | Prevent double-claim of the same task | Two agents work on the same task; duplicate PRs; wasted effort |
281
+
282
+ The most commonly skipped requirements: **idempotency** (teams assume the orchestrator will not retry a successful operation) and **claim locks** (teams assume parallel agents will not race on the same task queue).
283
+
284
+ ### Reliability audit checklist
285
+
286
+ - [ ] **Locking** — prevents concurrent claims on the same task
287
+ - [ ] **Isolation** — work performed in isolated worktrees or branches
288
+ - [ ] **Observability** — all tool and model calls logged to the event stream
289
+ - [ ] **Budgeting** — token or cost cap for this specific loop
290
+ - [ ] **Idempotency** — safe to run this loop ten times in a row
291
+ - [ ] **Verification** — automated gate (test, lint, eval) before the "done" signal
292
+ - [ ] **Handoff** — state persists to a durable continuation file
293
+ - [ ] **Safety caps** — iteration and error caps enforced
294
+
295
+ ### Staged rollout verification
296
+
297
+ 1. Deploy with observability on
298
+ 2. Run against production workload with all budgets set to **10%** (catch failures early)
299
+ 3. Monitor for one hour: stalls, cost overruns, duplicate claims
300
+ 4. Increase budget to **50%**, run one day
301
+ 5. Verify all six requirements pass; document any gaps
302
+ 6. Only after every requirement is verified, increase budget to **100%** and declare production-ready
303
+
304
+ ## Anti-Patterns
305
+
306
+ | Anti-pattern | What happens | Corrective action |
307
+ |---|---|---|
308
+ | **God Agent / spawn-first design** | Monolithic prompt tries to solve everything; or agent spawns subagents reflexively without applying the decision tree | Decompose into 4K-token working zones; apply the six-question decision tree before every spawn |
309
+ | **Prompt-as-architecture** | System relies on prompt instructions for reliability; when the model deviates there is no fallback | Add lifecycle state, safety caps, and failure recovery as first-class components — not prompt instructions |
310
+ | **Memory-persisted state** | Continuation state lives in a variable; a crash loses all progress | Write continuation state to disk atomically before every yield |
311
+ | **Implicit completion marker** | Task is "done" when a file appears; two agents both find the file absent and both claim | Use an explicit, atomic status write (tracker comment, lock file) as the completion marker |
312
+ | **Uncapped loop / runaway loop** | Loop runs until success or user kills it; a stuck task loops forever | Set `max_iterations` and `max_consecutive_errors` in every loop config |
313
+ | **Telephone-game brief** | Orchestrator sends full conversation history as the subagent brief; subagent starts context-exhausted | Briefs must be minimal: goal, constraints, deliverables, verification. Cap at 2000 tokens |
314
+ | **Fan-out without independence verification** | Tasks that share mutable state are fanned out; agents conflict on writes | Before fan-out, prove independence: no shared file, no ordering dependency |
315
+ | **Stall without detection / Ghost Claim** | Agent stops progressing but does not exit; or zombie agents hold locks after work died | Per-phase timeouts with heartbeat writes; release the claim lock in a `finally` block |
316
+ | **Delegation to expensive model for mechanical work** | Premium model classifies 200 documents that a cheap model could handle | Cost-aware delegation: deterministic mechanical work → cheap model or script; judgment work → premium |
317
+ | **Cross-session local-file dependency** | Agent A writes a finding to a local file; agent B in the next session cannot find it | Durable findings go to the work tracker; ephemeral files do not survive session boundaries |
318
+ | **Observability as afterthought** | Cannot diagnose why a task failed; guessing at root cause | Add logging before deployment; observability is a feature, not optional |
319
+ | **Budget limits set too high on rollout** | One misconfigured loop burns 30% of monthly quota before anyone notices | Start with conservative limits (10% of quota); increase as confidence grows |
320
+ | **Non-atomic claim locks** | Two agents claim the same task; both start working; merge conflict on results | Use atomic file create (rename from temp, `O_EXCL`); not polling-based checks |
321
+ | **Context bloat in orchestrator** | Orchestrator tracks full output from ten workers; context approaches limit | Workers write to durable storage; orchestrator reads compressed summaries only |
322
+
323
+ ## Debugging Common Coordination Failures
324
+
325
+ Start with the symptom, not the root-cause hypothesis.
326
+
327
+ **Symptom: two PRs opened for the same task**
328
+ - Was the claim lock written before execution started, or after?
329
+ - Is the lock file deleted on completion before the next dispatch cycle?
330
+ - Fix: move lock write to the first action inside the worker, before any tool call. Use `O_EXCL` open or rename-from-temp for atomicity.
331
+
332
+ **Symptom: worker reports "task already done" but the tracker still shows In Progress**
333
+ - Is completion detection based on file presence or on tracker status?
334
+ - Did a prior failed run leave a stale completion file?
335
+ - Fix: completion detection must query tracker status, not file presence. Stale files must be cleaned at session start.
336
+
337
+ **Symptom: orchestrator stalls at 70% context without producing a continuation signal**
338
+ - Was context health monitored during the run?
339
+ - Did brief composition happen late in the session, after significant context burn?
340
+ - Fix: check context health every ten iterations. Trigger compaction at `degraded`. At `compact`, write continuation and exit.
341
+
342
+ **Symptom: agent retries the same failing operation 20+ times**
343
+ - Is `max_consecutive_errors` set?
344
+ - Does the error-cap code path actually halt execution, or merely log and continue?
345
+ - Fix: set `max_consecutive_errors: 3` in every loop config. Verify the cap exits with non-zero status and writes a failure summary.
346
+
347
+ **Symptom: subagent produces different results when replayed with the same brief**
348
+ - This is expected — LLMs are non-deterministic. Agent engineering wraps it, does not eliminate it.
349
+ - Is the verification pass catching the variance?
350
+ - Fix: add a structured verification pass (schema check, test run) after every subagent result. Retry up to the cap, then escalate.
351
+
352
+ **Symptom: cost spike on a single task**
353
+ - Did the task loop without a safety cap?
354
+ - Were cost events emitted per-iteration, allowing early detection?
355
+ - Fix: emit a cost event per loop iteration. Set a per-task budget threshold and escalate if exceeded.
356
+
357
+ ## Verification
358
+
359
+ After applying agent-engineering decisions, verify:
360
+
361
+ - [ ] Architecture is explicit — the coordination pattern can be drawn on a whiteboard
362
+ - [ ] All six production-readiness requirements are addressed
363
+ - [ ] Delegation decisions applied the six-gate framework, not "spawn because it seems helpful"
364
+ - [ ] Fan-out tasks are genuinely independent (no ordering, no shared mutable file, no implicit cross-subtask dependency)
365
+ - [ ] Stall detection is configured for unattended loops with thresholds based on expected phase duration
366
+ - [ ] Idempotency is verified for every operation with external side effects
367
+ - [ ] Continuation signals persist to disk, not memory or environment variables
368
+ - [ ] Safety caps are explicitly set, not left at defaults
369
+ - [ ] Claim locks are written before the first tool call and released in a `finally` block
370
+ - [ ] Claim locks work under concurrent load (test with two agents on the same task)
371
+ - [ ] Completion markers are based on explicit status, not file presence
372
+ - [ ] Cost events are emitted per iteration, enabling outlier detection
373
+ - [ ] Every fan-out has a corresponding merge or verify phase
374
+ - [ ] Observability answers within 30 seconds: how many agents ran today, total cost, error rate, mean time to failure
375
+
376
+ ## Do NOT Use When
377
+
378
+ | Use instead | When |
379
+ |---|---|
380
+ | `prompt-craft` | The fix is in the wording of one instruction (single-agent prompting), not the system around it |
381
+ | `tool-call-strategy` | The question is per-call efficiency inside one agent, not how multiple agents coordinate |
382
+ | `context-engineering` | The question is what reaches a single agent's context window, not the architecture of which agents run at all |
383
+ | `debugging` | A specific runtime failure needs root-cause analysis; not a class of coordination failure to prevent through architecture |
384
+ | `code-review` | Reviewing AI-generated code for correctness; not designing the system that generated it |
385
+ | `documentation` | Writing prose for a human reader explaining an architecture; not designing the architecture itself |
386
+ | `testing-strategy` | Choosing test pyramid / trophy / honeycomb shape; not the verification pass inside an agent loop |
@@ -0,0 +1,55 @@
1
+ ---
2
+ name: agent-eval-design
3
+ description: "Use when designing evaluations for AI agents, skills, routers, prompts, tool-use policies, or multi-step workflows: task sets, rubrics, graders, hard negatives, regression cases, traces, and acceptance thresholds. Do NOT use for application test planning (use `testing-strategy`), skill-library health tooling (use `skill-infrastructure`), or live debugging of a failed run (use `debugging`)."
4
+ license: MIT
5
+ compatibility: "Portable eval-design discipline for agent workflows, skill routers, prompt systems, and tool-use policies."
6
+ allowed-tools: Read Grep
7
+ metadata:
8
+ metadata: "{\"schema_version\":6,\"version\":\"1.0.0\",\"type\":\"capability\",\"category\":\"quality\",\"domain\":\"ai-engineering/evaluation\",\"scope\":\"portable\",\"owner\":\"skill-graph-maintainer\",\"freshness\":\"2026-05-11\",\"drift_check\":\"{\\\\\\\"last_verified\\\\\\\":\\\\\\\"2026-05-11\\\\\\\"}\",\"eval_artifacts\":\"planned\",\"eval_state\":\"unverified\",\"routing_eval\":\"absent\",\"stability\":\"experimental\",\"keywords\":\"[\\\\\\\"agent eval\\\\\\\",\\\\\\\"AI eval design\\\\\\\",\\\\\\\"skill routing eval\\\\\\\",\\\\\\\"eval rubric\\\\\\\",\\\\\\\"hard negatives\\\\\\\",\\\\\\\"grader design\\\\\\\",\\\\\\\"regression eval\\\\\\\",\\\\\\\"trace evaluation\\\\\\\",\\\\\\\"acceptance threshold\\\\\\\",\\\\\\\"prompt eval\\\\\\\"]\",\"examples\":\"[\\\\\\\"design an eval set for whether this skill routes correctly against near-miss prompts\\\\\\\",\\\\\\\"create a rubric for judging agent outputs on grounded project knowledge extraction\\\\\\\",\\\\\\\"what hard negatives should test this router before we mark routing_eval present?\\\\\\\",\\\\\\\"turn these agent failure traces into regression eval cases\\\\\\\"]\",\"anti_examples\":\"[\\\\\\\"plan unit, integration, and e2e tests for this product feature\\\\\\\",\\\\\\\"run the skill graph lint and overlap tooling\\\\\\\",\\\\\\\"debug why yesterday's agent run failed\\\\\\\",\\\\\\\"write production code to fix this failing test\\\\\\\"]\",\"relations\":\"{\\\\\\\"boundary\\\\\\\":[{\\\\\\\"skill\\\\\\\":\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"testing-strategy plans software tests; agent-eval-design designs behavioral evals for AI agents and skills\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"skill-infrastructure\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"skill-infrastructure owns library health tooling; agent-eval-design owns eval content and grading design\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"debugging\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"debugging investigates a live failure; agent-eval-design turns patterns into future evals\\\\\\\"},{\\\\\\\"skill\\\\\\\":\\\\\\\"code-review\\\\\\\",\\\\\\\"reason\\\\\\\":\\\\\\\"code-review evaluates diffs; agent-eval-design evaluates agent behavior\\\\\\\"}],\\\\\\\"related\\\\\\\":[\\\\\\\"skill-router\\\\\\\",\\\\\\\"context-engineering\\\\\\\",\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"skill-infrastructure\\\\\\\"],\\\\\\\"verify_with\\\\\\\":[\\\\\\\"testing-strategy\\\\\\\",\\\\\\\"skill-infrastructure\\\\\\\"]}\",\"portability\":\"{\\\\\\\"readiness\\\\\\\":\\\\\\\"scripted\\\\\\\",\\\\\\\"targets\\\\\\\":[\\\\\\\"skill-md\\\\\\\"]}\",\"lifecycle\":\"{\\\\\\\"stale_after_days\\\\\\\":365,\\\\\\\"review_cadence\\\\\\\":\\\\\\\"quarterly\\\\\\\"}\",\"skill_graph_source_repo\":\"https://github.com/jacob-balslev/skill-graph\",\"skill_graph_protocol\":\"Skill Metadata Protocol v5\",\"skill_graph_project\":\"Skill Graph\",\"skill_graph_canonical_skill\":\"skills/agent-eval-design/SKILL.md\"}"
9
+ skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph"
10
+ skill_graph_protocol: Skill Metadata Protocol v4
11
+ skill_graph_project: Skill Graph
12
+ skill_graph_canonical_skill: skills/agent-eval-design/SKILL.md
13
+ ---
14
+
15
+ # Agent Eval Design
16
+
17
+ ## Coverage
18
+
19
+ Design evaluations for agent behavior, skill routing, prompt systems, tool-use policies, and multi-step workflows. Covers task selection, expected behavior, rubrics, graders, hard negatives, trace capture, regression cases, thresholds, coverage, and eval maintenance.
20
+
21
+ ## Philosophy
22
+
23
+ Agent evals are behavioral contracts. They should measure whether the agent does the right thing under realistic ambiguity, not whether it can parrot the happy path.
24
+
25
+ The highest-value cases are hard negatives and prior failures. A routing eval with only obvious positives gives false confidence.
26
+
27
+ ## Method
28
+
29
+ 1. Define the behavior being evaluated in one sentence.
30
+ 2. Collect realistic positive cases, near misses, and failure traces.
31
+ 3. Write expected outcomes that are observable.
32
+ 4. Add hard negatives that should route elsewhere or refuse an unsafe path.
33
+ 5. Choose grader type: exact, rubric, trace inspection, artifact check, or hybrid.
34
+ 6. Set pass thresholds and severity for failures.
35
+ 7. Add regression cases whenever a real agent failure is fixed.
36
+
37
+ ## Verification
38
+
39
+ - [ ] Eval cases include positives, hard negatives, and prior failures
40
+ - [ ] Expected outcomes are observable and not preference-only
41
+ - [ ] The grader can distinguish partially correct from wrong
42
+ - [ ] Thresholds match risk, not vanity metrics
43
+ - [ ] Cases cover routing, grounding, tool use, and final artifact where relevant
44
+ - [ ] New failures become regression cases
45
+ - [ ] Eval metadata honestly reflects run state
46
+
47
+ ## Do NOT Use When
48
+
49
+ | Use instead | When |
50
+ |---|---|
51
+ | `testing-strategy` | You are planning tests for application code or product behavior. |
52
+ | `skill-infrastructure` | You are building or running skill-library health tooling. |
53
+ | `debugging` | You need to root-cause a specific failed run. |
54
+ | `code-review` | You need to review a code diff. |
55
+