autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,548 @@
1
+ # Cost/Quality Tradeoff Modeling for Autonomous Coding Pipelines
2
+
3
+ **Date:** 2026-02-22
4
+ **Status:** Research complete
5
+ **Confidence:** High on pricing data (official sources), Medium on quality deltas (benchmark-dependent), Medium on break-even modeling (assumptions documented)
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ Running an autonomous coding pipeline costs $5-65 per 6-batch feature depending on execution mode and caching strategy. The single largest cost lever is **prompt caching** (83% reduction), not model selection. Sonnet 4.5/4.6 matches or exceeds Opus on SWE-bench coding benchmarks at 60% of the price, making Opus routing justifiable only for architectural/planning tasks where reasoning depth matters. Competitive (MAB) mode doubles per-batch cost but stays under $2/batch with cache priming — the break-even is any feature where a single rework cycle costs more than $6. Compared to commercial alternatives (Devin at $8-9/hr, Cursor at ~$0.09/request, Copilot at $0.04/premium request), the toolkit's API-direct approach is cheaper for heavy autonomous workloads but lacks the UX guardrails of commercial products.
12
+
13
+ **Recommendation:** Default to Sonnet with Haiku for verification-only batches. Reserve Opus for planning and judging. Always cache-prime before parallel dispatch. Implement cost tracking per batch (the data doesn't exist yet and every recommendation here would be more precise with it).
14
+
15
+ ---
16
+
17
+ ## 1. Current API Pricing Landscape
18
+
19
+ ### 1.1 Claude Model Pricing (Anthropic, Official)
20
+
21
+ Source: [Anthropic Pricing Page](https://platform.claude.com/docs/en/about-claude/pricing)
22
+
23
+ | Model | Input $/MTok | Output $/MTok | Cache Read $/MTok | Cache Write (5m) $/MTok | Batch Input $/MTok | Batch Output $/MTok |
24
+ |-------|-------------|--------------|-------------------|------------------------|--------------------|---------------------|
25
+ | **Opus 4.6/4.5** | $5.00 | $25.00 | $0.50 | $6.25 | $2.50 | $12.50 |
26
+ | **Sonnet 4.6/4.5/4** | $3.00 | $15.00 | $0.30 | $3.75 | $1.50 | $7.50 |
27
+ | **Haiku 4.5** | $1.00 | $5.00 | $0.10 | $1.25 | $0.50 | $2.50 |
28
+ | Opus 4.1/4 (legacy) | $15.00 | $75.00 | $1.50 | $18.75 | $7.50 | $37.50 |
29
+ | Haiku 3.5 | $0.80 | $4.00 | $0.08 | $1.00 | $0.40 | $2.00 |
30
+
31
+ **Long context surcharge:** Requests exceeding 200K input tokens double the input price and add 50% to output (e.g., Sonnet: $6/$22.50). This is relevant for batch agents with large codebases — staying under 200K tokens per call is a significant cost optimization.
32
+
33
+ **Key ratio:** Opus 4.6 costs 1.67x Sonnet input and 1.67x Sonnet output. This is dramatically cheaper than legacy Opus 4.1 (5x Sonnet). The Opus tax has shrunk from 5x to 1.67x in one generation.
34
+
35
+ ### 1.2 Competitor Pricing
36
+
37
+ | Provider | Model | Input $/MTok | Output $/MTok | Notes |
38
+ |----------|-------|-------------|--------------|-------|
39
+ | OpenAI | GPT-4o | $2.50 | $10.00 | 128K context |
40
+ | OpenAI | GPT-4o Mini | $0.15 | $0.60 | Budget tier |
41
+ | Google | Gemini 2.5 Pro | $1.25 | $10.00 | Under 200K; doubles above |
42
+ | Google | Gemini 2.5 Flash | $0.075 | $0.30 | Cheapest viable option |
43
+ | Google | Gemini 3 Pro | $2.00 | $12.00 | Newest generation |
44
+
45
+ **Finding:** Claude Sonnet ($3/$15) is priced between GPT-4o ($2.50/$10) and Gemini 3 Pro ($2/$12) on input, but is significantly more expensive on output. For output-heavy coding tasks (where the model generates substantial code), Claude's output premium matters. A batch generating 50K output tokens costs $0.75 on Sonnet vs $0.50 on GPT-4o vs $0.60 on Gemini 3 Pro.
46
+
47
+ **Implication for the toolkit:** The toolkit is model-agnostic at the `claude -p` layer, but the skill chain and quality gates are Claude-specific. Multi-provider routing (send verification batches to Gemini Flash at $0.075/$0.30) would require significant architecture changes but could cut verification costs by 90%.
48
+
49
+ ### 1.3 Discount Mechanisms
50
+
51
+ | Mechanism | Discount | Latency Impact | Stackable? |
52
+ |-----------|----------|---------------|------------|
53
+ | **Prompt caching (read)** | 90% off input | Faster (no reprocessing) | Yes, with batch |
54
+ | **Prompt caching (write)** | +25% on first call | Minimal | Yes, with batch |
55
+ | **Batch API** | 50% off everything | Up to 24h (usually <1h) | Yes, with caching |
56
+ | **Cache + Batch combined** | ~95% off cached input | Up to 24h | Yes |
57
+
58
+ **The stacking math for a typical batch:**
59
+ - Uncached Sonnet input (100K tokens): $0.30
60
+ - Cached Sonnet input (90K cached + 10K new): 90K × $0.30/MTok + 10K × $3.00/MTok = $0.027 + $0.030 = $0.057
61
+ - Cached + Batch: 90K × $0.15/MTok + 10K × $1.50/MTok = $0.0135 + $0.015 = $0.029
62
+
63
+ That's a 90% reduction from uncached to cached, and 95% from uncached to cached+batch.
64
+
65
+ ---
66
+
67
+ ## 2. Quality Delta Between Models for Coding
68
+
69
+ ### 2.1 Benchmark Evidence
70
+
71
+ Source: [SWE-bench Verified Leaderboard](https://llm-stats.com/benchmarks/swe-bench-verified), [SWE-rebench](https://swe-rebench.com), [Vellum LLM Leaderboard](https://www.vellum.ai/llm-leaderboard)
72
+
73
+ | Model | SWE-bench Verified | SWE-bench Pro | Cost/Task (SWE-rebench) |
74
+ |-------|-------------------|---------------|------------------------|
75
+ | Claude Sonnet 4.5 | 77.2% (82% w/ parallel) | 43.6% | $0.94 |
76
+ | Claude Opus 4.5 | 80.9% | 45.9% | — |
77
+ | Claude Opus 4.6 | ~80-82% | — | $0.93 |
78
+ | GPT-4o | ~49% | — | ~$0.50-1.00 |
79
+ | Gemini 2.5 Pro | ~65% | — | ~$0.80 |
80
+
81
+ **Finding: Sonnet is ~95% of Opus quality on coding benchmarks at 60% of the price.**
82
+
83
+ On SWE-bench Verified, Sonnet 4.5 scores 77.2% vs Opus 4.5's 80.9% — a 4.6% gap. On SWE-bench Pro (harder), the gap is 2.3 percentage points (43.6% vs 45.9%). Crucially, Sonnet 4.5 with parallel compute (82%) actually exceeds single-shot Opus (80.9%).
84
+
85
+ **Where Opus still wins:**
86
+ - Planning and architecture decisions (qualitative, not well-captured by SWE-bench)
87
+ - Complex multi-file refactoring requiring deep reasoning
88
+ - Judge/evaluation tasks where nuanced comparison matters
89
+ - The SWE-bench Pro gap suggests Opus pulls ahead on harder problems
90
+
91
+ **Where Opus doesn't justify the cost:**
92
+ - Standard implementation tasks (file creation, test writing)
93
+ - Verification/run-only batches
94
+ - Well-specified tasks with clear acceptance criteria
95
+
96
+ ### 2.2 Cost Per Success Analysis
97
+
98
+ The metric that matters is **cost per successful batch**, not cost per token.
99
+
100
+ | Model | Cost/batch | Success rate (est.) | Cost/success |
101
+ |-------|-----------|--------------------:|-------------|
102
+ | Haiku 4.5 | ~$0.30 | ~60% | ~$0.50 |
103
+ | Sonnet 4.6 | ~$0.94 | ~85% | ~$1.11 |
104
+ | Opus 4.6 | ~$1.50 | ~90% | ~$1.67 |
105
+
106
+ Success rates are estimated from SWE-bench data scaled to the toolkit's quality gate pass rates. The key insight: **Haiku's apparent cheapness disappears when factoring in retry cost.** A 60% success rate means 40% of batches need a retry (costing another $0.30+ each), plus the quality gate execution time.
107
+
108
+ **Implication:** Sonnet is the cost-per-success sweet spot. Haiku is appropriate only for tasks with near-deterministic success (verification-only, run commands, check output). Opus is appropriate when a single failure is very expensive (complex integration, architectural changes).
109
+
110
+ ---
111
+
112
+ ## 3. Cost Per Batch by Execution Mode
113
+
114
+ ### 3.1 Token Consumption Model
115
+
116
+ Based on SWE-rebench data and Claude Code usage statistics:
117
+
118
+ | Component | Input Tokens | Output Tokens | Notes |
119
+ |-----------|-------------|--------------|-------|
120
+ | System prompt + CLAUDE.md chain | ~8,000 | — | Cacheable |
121
+ | Plan text (single batch) | ~2,000 | — | Varies by plan |
122
+ | Context injection (failure patterns, progress) | ~1,500 | — | From run-plan-context.sh |
123
+ | Tool definitions (Bash, Read, Write, Edit, Grep, Glob) | ~2,000 | — | Cacheable |
124
+ | File reads during execution | ~20,000 | — | Varies heavily |
125
+ | Code generation + tool calls | — | ~15,000 | Primary output cost |
126
+ | **Total per batch** | **~33,500** | **~15,000** | Conservative estimate |
127
+
128
+ ### 3.2 Cost Per Batch by Mode
129
+
130
+ Using Sonnet 4.6 pricing ($3/$15 per MTok) with ~33.5K input, ~15K output:
131
+
132
+ | Mode | Agents | Calls/Batch | Input Tokens | Output Tokens | Cost/Batch (uncached) | Cost/Batch (cached) |
133
+ |------|--------|------------|-------------|--------------|----------------------|---------------------|
134
+ | **Headless** | 1 | 1 | 33.5K | 15K | $0.33 | $0.13 |
135
+ | **Team** | 2-3 | 2-3 | 67-100K | 30-45K | $0.65-1.00 | $0.26-0.40 |
136
+ | **Competitive (MAB)** | 2 + judge | 3 | 80K+ | 35K+ | $0.77+ | $0.31+ |
137
+ | **Ralph loop** | 1 (iterating) | 2-5 | 67-167K | 30-75K | $0.65-1.63 | $0.26-0.65 |
138
+
139
+ **Notes:**
140
+ - Team mode spawns implementer + reviewer agents. Each gets its own context window.
141
+ - Competitive mode runs 2 parallel implementers + 1 judge evaluation. The judge call is smaller (diff comparison, not full implementation).
142
+ - Ralph loop cost depends on iterations. The stop-hook re-injects the prompt each cycle, but context accumulates within a session. Worst case: 5 iterations before convergence.
143
+ - Cached prices assume 80% of input tokens hit cache (system prompt + tools + CLAUDE.md chain + plan prefix).
144
+
145
+ ### 3.3 Model Routing Impact on Batch Cost
146
+
147
+ The toolkit's `classify_batch_model()` function in `run-plan-routing.sh` routes:
148
+ - **Haiku** for verification-only batches (all steps are `Run:` commands)
149
+ - **Sonnet** for implementation batches (Create/Modify files) — default
150
+ - **Opus** for CRITICAL-tagged batches
151
+
152
+ | Batch Type | Model | Cost (cached) | Frequency |
153
+ |-----------|-------|--------------|-----------|
154
+ | Implementation (Create) | Sonnet | $0.13 | ~50% |
155
+ | Implementation (Modify) | Sonnet | $0.13 | ~30% |
156
+ | Verification-only | Haiku | $0.04 | ~10% |
157
+ | Critical | Opus | $0.22 | ~10% |
158
+
159
+ **Weighted average per batch:** ~$0.12 (cached, with routing)
160
+ **Without routing (all Sonnet):** ~$0.13 (cached)
161
+ **Routing savings:** ~8% — modest, because Sonnet dominates the mix.
162
+
163
+ **Implication:** Model routing saves less than prompt caching by a large margin. Caching first, routing second.
164
+
165
+ ---
166
+
167
+ ## 4. Total Pipeline Cost for a Typical Feature
168
+
169
+ ### 4.1 Pipeline Stage Costs
170
+
171
+ | Stage | Model | Calls | Input Tokens | Output Tokens | Cost (cached) |
172
+ |-------|-------|-------|-------------|--------------|---------------|
173
+ | Brainstorm | Sonnet | 1 interactive session | ~50K | ~10K | $0.20 |
174
+ | PRD generation | Sonnet | 1 | ~20K | ~5K | $0.10 |
175
+ | Plan writing | Sonnet | 1 | ~30K | ~20K | $0.40 |
176
+ | Execution (6 batches, headless) | Mixed | 6 | ~200K | ~90K | $0.78 |
177
+ | Quality gates (6x) | — | 0 (bash scripts) | — | — | $0.00 |
178
+ | Verification | Sonnet | 1 | ~30K | ~5K | $0.12 |
179
+ | **Total (headless, cached)** | | **~10 calls** | **~330K** | **~130K** | **~$1.60** |
180
+
181
+ ### 4.2 Total Cost by Execution Mode (6-batch feature)
182
+
183
+ | Mode | Base Cost | + Retries (20%) | + Judge (MAB) | Total |
184
+ |------|----------|----------------|--------------|-------|
185
+ | **Headless** | $1.60 | $0.16 | — | **$1.76** |
186
+ | **Team** | $2.38 | $0.24 | — | **$2.62** |
187
+ | **Competitive (MAB)** | $2.50 | $0.25 | $0.60 | **$3.35** |
188
+ | **Ralph loop** | $2.20 | $0.22 | — | **$2.42** |
189
+
190
+ **Without caching:**
191
+
192
+ | Mode | Total (uncached) |
193
+ |------|-----------------|
194
+ | **Headless** | ~$6.50 |
195
+ | **Team** | ~$10.00 |
196
+ | **Competitive (MAB)** | ~$13.50 |
197
+ | **Ralph loop** | ~$9.00 |
198
+
199
+ ### 4.3 Scaling: What Does a Multi-Feature Sprint Cost?
200
+
201
+ Assuming 5 features per week, 6 batches each:
202
+
203
+ | Scenario | Weekly Cost | Monthly Cost |
204
+ |----------|-----------|-------------|
205
+ | Headless + cached | $8.80 | $35.20 |
206
+ | MAB on everything + cached | $16.75 | $67.00 |
207
+ | Headless + uncached | $32.50 | $130.00 |
208
+ | MAB + uncached | $67.50 | $270.00 |
209
+
210
+ **Context:** Claude Code's average daily cost per developer is $6, with 90th percentile at $12 (source: [Claude Code cost docs](https://code.claude.com/docs/en/costs)). The toolkit's headless mode with caching would add ~$1.76 per feature on top of any interactive session costs.
211
+
212
+ ---
213
+
214
+ ## 5. When Does Competitive Mode Pay for Itself?
215
+
216
+ ### 5.1 The Rework Cost Model
217
+
218
+ Competitive mode costs ~$3.35 vs headless at ~$1.76 — a **$1.59 premium** per feature. This premium pays for itself when it avoids rework.
219
+
220
+ **What does rework cost?**
221
+ - A failed batch that passes quality gates but introduces subtle bugs: 1-3 batches of debugging ($0.40-1.20 cached)
222
+ - A failed batch caught by quality gates requiring retry: $0.13-0.22 per retry
223
+ - A feature that ships broken and requires a hotfix cycle: $3-10 (new brainstorm + plan + execute)
224
+ - Developer time debugging AI-generated code: $50-150/hr (opportunity cost)
225
+
226
+ ### 5.2 Break-Even Analysis
227
+
228
+ | Rework Scenario | Rework Cost | MAB Premium | Break-Even Frequency |
229
+ |----------------|------------|-------------|---------------------|
230
+ | 1 retry saved | $0.13 | $1.59 | Every 12th feature |
231
+ | 1 debugging batch saved | $0.94 | $1.59 | Every 2nd feature |
232
+ | 1 hotfix cycle saved | $5.00 | $1.59 | Every 3rd hotfix |
233
+ | 1 hour dev time saved | $75.00 | $1.59 | Every 47th feature |
234
+
235
+ **Finding:** If competitive mode catches architectural issues that would require even one debugging batch per 2 features, it pays for itself. The question is empirical: **does the judge actually catch issues that quality gates miss?**
236
+
237
+ ### 5.3 When to Use Competitive Mode
238
+
239
+ **Use competitive mode when:**
240
+ - The batch involves cross-module integration (highest bug density)
241
+ - Historical retry rate for this batch type exceeds 30%
242
+ - The cost of a subtle bug is high (production-facing, data-handling)
243
+ - You have no strategy performance data yet (exploration phase of MAB)
244
+
245
+ **Use headless when:**
246
+ - The task is well-specified with clear acceptance criteria
247
+ - Strategy performance data shows a clear winner (>70% win rate)
248
+ - The batch is isolated (single file, no cross-module touches)
249
+ - Cost sensitivity is high and quality gates are comprehensive
250
+
251
+ ---
252
+
253
+ ## 6. Model Routing Strategies with Empirical Support
254
+
255
+ ### 6.1 Academic Approaches
256
+
257
+ Three main paradigms from the literature:
258
+
259
+ **Routing (single model selection):** A classifier predicts which model will succeed and routes the entire request to that model. Cost = 1 model call + router overhead.
260
+ - Hybrid-LLM (ICLR 2024): Routes based on estimated quality gap between models. Works well when the small model handles >60% of queries adequately.
261
+ - Source: [ICLR 2024 paper](https://proceedings.iclr.cc/paper_files/paper/2024/file/b47d93c99fa22ac0b377578af0a1f63a-Paper-Conference.pdf)
262
+
263
+ **Cascading (escalation):** Start with the cheapest model. If confidence is below threshold, escalate to the next tier. Cost = 1-3 model calls, but most stop at tier 1.
264
+ - C3PO (2025): Achieves <20% cost of the most capable model with <2% accuracy loss across 16 benchmarks.
265
+ - Source: [C3PO paper](https://arxiv.org/pdf/2511.07396)
266
+
267
+ **Unified routing + cascading (ICLR 2025):** Proves that combining routing and cascading is strictly better than either alone. 4% improvement on RouterBench with 80% relative improvement over naive baselines.
268
+ - Source: [Unified approach](https://arxiv.org/abs/2410.10347)
269
+
270
+ ### 6.2 Current Toolkit Strategy
271
+
272
+ The toolkit uses static routing via `classify_batch_model()`:
273
+
274
+ ```
275
+ Create files → Sonnet
276
+ Modify files → Sonnet
277
+ Run-only (verification) → Haiku
278
+ CRITICAL tag → Opus
279
+ Default → Sonnet
280
+ ```
281
+
282
+ This is pure routing (no cascading). It's simple and low-overhead but leaves money on the table.
283
+
284
+ ### 6.3 Recommended Improvements
285
+
286
+ **Short-term (no architecture changes):**
287
+ 1. **Retry escalation already exists** — the toolkit escalates context on retry (includes previous failure log). Adding model escalation (Haiku → Sonnet → Opus on retry) would implement cascading with zero new infrastructure.
288
+ 2. **Tag more batches as Haiku-eligible.** Currently only "all-Run" batches get Haiku. Config/documentation-only batches, test-only batches, and simple rename/move batches could also use Haiku.
289
+
290
+ **Medium-term (requires tracking):**
291
+ 3. **Cost-per-success tracking.** Record model, cost, and pass/fail per batch in `.run-plan-state.json`. After 50+ data points, the toolkit can make data-driven routing decisions.
292
+ 4. **Complexity-based routing.** Use batch metadata (file count, line count of changes, number of cross-file references) as routing features. More complex batches → higher-tier model.
293
+
294
+ **Long-term (architecture change):**
295
+ 5. **Cascade on failure.** Instead of retrying with the same model + more context, retry with a more capable model. Haiku fails → Sonnet retry → Opus retry. Cost increases only when needed.
296
+
297
+ ---
298
+
299
+ ## 7. Prompt Caching Economics
300
+
301
+ ### 7.1 How Caching Works for the Toolkit
302
+
303
+ The toolkit's `claude -p` calls have a highly cacheable prefix:
304
+
305
+ | Component | Tokens | Cacheable? | Cache Hit Rate |
306
+ |-----------|--------|-----------|---------------|
307
+ | System prompt | ~2,000 | Yes | ~100% across batches |
308
+ | CLAUDE.md chain (3 files) | ~4,000 | Yes | ~100% across batches |
309
+ | Tool definitions | ~2,000 | Yes | ~100% across batches |
310
+ | AGENTS.md (per-worktree) | ~1,000 | Yes | ~100% across batches |
311
+ | Plan text (current batch) | ~2,000 | No | 0% (changes per batch) |
312
+ | Context injection | ~1,500 | No | 0% (changes per batch) |
313
+ | File contents read during execution | ~20,000 | Partial | ~50% (some files repeated) |
314
+ | **Cacheable total** | **~9,000** | | |
315
+ | **Non-cacheable total** | **~24,500** | | |
316
+
317
+ **Effective cache rate:** ~27% of input tokens are cacheable across batches (the static prefix). Within a batch with multiple tool calls, the entire conversation so far is cacheable for each subsequent turn, pushing effective rates to 60-80%.
318
+
319
+ ### 7.2 Cache Priming for Parallel Agents
320
+
321
+ The MAB round 2 research identified a critical pattern: when two agents launch simultaneously with uncached content, both pay write costs independently. The fix is a "prime the cache" call:
322
+
323
+ 1. Send a single API call with the shared prefix (system prompt + CLAUDE.md + tools + design doc + PRD)
324
+ 2. This call creates the cache entry (costs 1.25x input)
325
+ 3. Both parallel agents then get cache-read pricing (0.1x input) on the shared prefix
326
+
327
+ **Savings per MAB batch:**
328
+ - Without priming: 2 × cache write = 2 × 1.25x × $3.00/MTok × 9K tokens = $0.0675
329
+ - With priming: 1 × cache write + 2 × cache read = 1.25x × $3.00/MTok × 9K + 2 × 0.1x × $3.00/MTok × 9K = $0.034 + $0.0054 = $0.039
330
+ - Savings: $0.028 per batch, or ~42% of the cache-related costs
331
+
332
+ This is small in absolute terms but compounds: over a 26-task MAB plan, it saves ~$0.73.
333
+
334
+ ### 7.3 Batch API for Non-Interactive Work
335
+
336
+ The Batch API offers 50% off everything with up to 24-hour latency (usually under 1 hour). This is directly applicable to the toolkit's headless mode — `claude -p` calls are already non-interactive.
337
+
338
+ **Current barrier:** The toolkit uses `claude -p` (CLI), not the Batch API directly. Converting to Batch API would require:
339
+ 1. Constructing API requests as JSON
340
+ 2. Submitting batches via `curl` or a thin wrapper
341
+ 3. Polling for completion
342
+ 4. Parsing results
343
+
344
+ **Potential savings:** 50% across the board. A 6-batch headless feature drops from $1.76 to $0.88 (cached + batched).
345
+
346
+ ---
347
+
348
+ ## 8. Economics of Retry
349
+
350
+ ### 8.1 Retry Cost Model
351
+
352
+ Each retry is a full API call — no discount for "trying again." The retry includes:
353
+ - All original context (system prompt, tools, plan)
354
+ - Additional context: previous failure log (~2,000 tokens)
355
+ - The model's new attempt (full output token cost)
356
+
357
+ **Cost per retry = base batch cost + ~10% overhead for failure context.**
358
+
359
+ ### 8.2 Expected Retry Costs
360
+
361
+ | Scenario | P(success) | E[retries] | E[cost] per batch | vs. Single-shot |
362
+ |----------|-----------|-----------|-------------------|----------------|
363
+ | Sonnet, well-specified | 90% | 0.11 | $0.14 | +8% |
364
+ | Sonnet, complex integration | 70% | 0.43 | $0.19 | +46% |
365
+ | Haiku, simple task | 80% | 0.25 | $0.05 | +25% |
366
+ | Haiku, moderate task | 50% | 1.00 | $0.08 | +100% |
367
+
368
+ Expected retries formula: E[retries] = (1 - p) / p for geometric distribution, capped at max_retries (typically 3).
369
+
370
+ ### 8.3 When to Retry vs. Escalate
371
+
372
+ **Current behavior:** Retry same model with more context (failure log appended).
373
+ **Better behavior:** Escalate model tier after first failure.
374
+
375
+ | Strategy | Avg cost/batch (complex task) | Success rate |
376
+ |----------|------------------------------|-------------|
377
+ | Retry same model (3x Sonnet) | $0.39 (3 × $0.13) | ~97% |
378
+ | Escalate (Sonnet → Opus) | $0.35 ($0.13 + $0.22) | ~98.5% |
379
+ | Escalate (Haiku → Sonnet → Opus) | $0.39 ($0.04 + $0.13 + $0.22) | ~99% |
380
+
381
+ **Finding:** Escalation is slightly cheaper than retry-at-same-tier for complex tasks because the higher-tier model is more likely to succeed on attempt 1, avoiding the cost of a third attempt. The quality improvement is marginal (97% vs 98.5%) but the cost structure is better.
382
+
383
+ ---
384
+
385
+ ## 9. Commercial AI Coding Tool Pricing
386
+
387
+ ### 9.1 Pricing Comparison
388
+
389
+ | Tool | Pricing Model | Monthly Cost | $/Hour of Work | Notes |
390
+ |------|--------------|-------------|---------------|-------|
391
+ | **Devin** (Core) | $20/mo + $2.25/ACU | $20+ | ~$9.00/hr | 1 ACU = ~15 min work |
392
+ | **Devin** (Team) | $500/mo + $2.00/ACU | $500+ | ~$8.00/hr | 250 ACUs included |
393
+ | **Cursor** (Pro) | $20/mo | $20 | ~$0.09/request | ~225 requests/mo with Claude |
394
+ | **Cursor** (Ultra) | $200/mo | $200 | ~$0.05/request | 20x capacity |
395
+ | **GitHub Copilot** (Pro) | $10/mo | $10 | $0.04/overage | 300 premium requests |
396
+ | **GitHub Copilot** (Pro+) | $39/mo | $39 | $0.04/overage | 1,500 premium requests |
397
+ | **Toolkit** (API direct) | Pay-per-token | $0-270/mo | ~$0.29/batch | Depends entirely on usage |
398
+
399
+ ### 9.2 Cost-Effectiveness Comparison
400
+
401
+ For a developer running 5 features/week (6 batches each = 30 batches/week):
402
+
403
+ | Tool | Monthly Cost | Autonomous? | Quality Gates? |
404
+ |------|-------------|------------|---------------|
405
+ | Toolkit (headless, cached) | ~$35 | Yes | Yes (built-in) |
406
+ | Toolkit (MAB, cached) | ~$67 | Yes | Yes + competitive evaluation |
407
+ | Devin (equivalent work) | ~$360-720 | Yes | Limited (proprietary) |
408
+ | Cursor Pro | $20 (capped) | No (interactive) | No (manual) |
409
+ | Copilot Pro | $10 (capped) | Partial (agent mode) | No (manual) |
410
+
411
+ **Finding:** The toolkit is the cheapest option for autonomous batch execution. Commercial tools are cheaper for interactive use (fixed monthly fee) but don't support headless autonomous operation with quality gates.
412
+
413
+ ### 9.3 What You're Paying For
414
+
415
+ | Capability | Toolkit | Devin | Cursor | Copilot |
416
+ |-----------|---------|-------|--------|---------|
417
+ | Autonomous execution | Yes | Yes | No | Partial |
418
+ | Quality gates | Yes | No | No | No |
419
+ | Fresh context per batch | Yes | Unknown | No | No |
420
+ | Model routing | Yes | No | Yes (credit-weighted) | Yes (model selection) |
421
+ | Cost transparency | Yes (API direct) | ACU-abstracted | Credit-abstracted | Request-abstracted |
422
+ | UX/IDE integration | No (CLI) | Web UI | VS Code | VS Code/GitHub |
423
+
424
+ ---
425
+
426
+ ## 10. Cost Model for the Autonomous Coding Toolkit
427
+
428
+ ### 10.1 Per-Batch Cost Calculator
429
+
430
+ ```
431
+ batch_cost = (input_tokens × input_rate × cache_factor) + (output_tokens × output_rate)
432
+
433
+ Where:
434
+ input_rate:
435
+ haiku: $1.00/MTok
436
+ sonnet: $3.00/MTok
437
+ opus: $5.00/MTok
438
+
439
+ output_rate:
440
+ haiku: $5.00/MTok
441
+ sonnet: $15.00/MTok
442
+ opus: $25.00/MTok
443
+
444
+ cache_factor:
445
+ uncached: 1.0
446
+ first call (write): 1.25
447
+ subsequent (read): 0.1
448
+ effective (80% cache hit): 0.28
449
+
450
+ Typical batch:
451
+ input_tokens: 33,500
452
+ output_tokens: 15,000
453
+ ```
454
+
455
+ ### 10.2 Reference Cost Table
456
+
457
+ All costs in USD per batch, assuming typical token consumption:
458
+
459
+ | Configuration | Sonnet (uncached) | Sonnet (cached) | Haiku (cached) | Opus (cached) |
460
+ |--------------|------------------|-----------------|----------------|--------------|
461
+ | Headless (1 call) | $0.33 | $0.13 | $0.04 | $0.22 |
462
+ | Team (2 calls) | $0.65 | $0.26 | $0.09 | $0.43 |
463
+ | Competitive (2+judge) | $0.77 | $0.31 | $0.12 | $0.52 |
464
+ | With 1 retry | $0.46 | $0.18 | $0.06 | $0.30 |
465
+ | With 2 retries | $0.59 | $0.23 | $0.07 | $0.39 |
466
+
467
+ ### 10.3 Full Pipeline Cost Table
468
+
469
+ | Pipeline Configuration | 6-Batch Feature | 12-Batch Feature | 26-Batch Sprint |
470
+ |----------------------|----------------|-----------------|----------------|
471
+ | Headless, all Sonnet, cached | $1.60 | $2.40 | $4.20 |
472
+ | Headless, routed, cached | $1.52 | $2.24 | $3.90 |
473
+ | MAB on all batches, cached | $3.35 | $5.50 | $10.40 |
474
+ | MAB selective (30% MAB), cached | $2.12 | $3.40 | $6.10 |
475
+ | Headless, all Sonnet, uncached | $6.50 | $10.00 | $18.00 |
476
+
477
+ ### 10.4 Monthly Budget Estimates
478
+
479
+ For a solo developer using the toolkit full-time (20 features/month, 6 batches avg):
480
+
481
+ | Strategy | Monthly API Cost | Annual |
482
+ |----------|-----------------|--------|
483
+ | Conservative (headless, cached, routed) | $30 | $365 |
484
+ | Balanced (headless + selective MAB, cached) | $42 | $510 |
485
+ | Aggressive (MAB everything, cached) | $67 | $804 |
486
+ | Uncached baseline | $130 | $1,560 |
487
+
488
+ ---
489
+
490
+ ## 11. Recommendations
491
+
492
+ ### Priority-ordered by impact:
493
+
494
+ 1. **Implement prompt caching immediately.** 83% cost reduction, zero quality tradeoff. This is the single highest-ROI optimization. Ensure the CLAUDE.md chain, system prompt, and tool definitions are in the cacheable prefix of every `claude -p` call.
495
+
496
+ 2. **Add cost tracking per batch.** Record `{model, input_tokens, output_tokens, cache_hits, cost, passed}` to `.run-plan-state.json`. Without this data, all cost optimization is guesswork. This is prerequisite to every other recommendation.
497
+
498
+ 3. **Keep Sonnet as default.** The SWE-bench data shows Sonnet 4.5/4.6 is 95% of Opus quality at 60% of the price. The 4.6-generation Opus price drop (from 5x to 1.67x Sonnet) makes Opus more tempting, but Sonnet remains the cost-per-success sweet spot for implementation tasks.
499
+
500
+ 4. **Implement model escalation on retry.** Instead of retrying the same model with more context, escalate: Haiku → Sonnet → Opus. This is cheaper than 3x same-model retry and has a higher cumulative success rate.
501
+
502
+ 5. **Use selective MAB, not universal MAB.** Run competitive mode on integration batches, first-time batch types, and historically-flaky batch types. Route known-easy batches to headless. Target 30% MAB rate for optimal cost/learning balance.
503
+
504
+ 6. **Cache-prime before parallel dispatch.** When running MAB or team mode, fire a single "warm the cache" call with the shared prefix before launching parallel agents. Saves ~42% of cache-related costs.
505
+
506
+ 7. **Evaluate Batch API for overnight runs.** For non-urgent features (entropy audits, batch-audit.sh, auto-compound.sh overnight), the Batch API's 50% discount is free money. Requires thin wrapper around `curl` to submit and poll.
507
+
508
+ 8. **Expand Haiku eligibility.** Currently only verification-only batches get Haiku. Add: test-only batches, config/documentation updates, simple file renames. Each Haiku-eligible batch saves $0.09 vs Sonnet (cached).
509
+
510
+ ### What NOT to optimize:
511
+
512
+ - **Don't chase multi-provider routing.** Sending verification batches to Gemini Flash would save ~$0.03/batch but requires significant architecture changes. Not worth it at current scale.
513
+ - **Don't use Opus for everything.** The 1.67x cost premium over Sonnet is not justified by the 5% quality improvement for standard implementation tasks.
514
+ - **Don't skip quality gates to save money.** Quality gates are bash scripts with zero API cost. They prevent the most expensive failure mode: subtle bugs that ship and require full rework cycles.
515
+
516
+ ---
517
+
518
+ ## Sources
519
+
520
+ ### Pricing (Official)
521
+ - [Anthropic Claude API Pricing](https://platform.claude.com/docs/en/about-claude/pricing)
522
+ - [OpenAI API Pricing](https://platform.openai.com/docs/pricing)
523
+ - [Google Gemini API Pricing](https://ai.google.dev/gemini-api/docs/pricing)
524
+ - [Devin AI Pricing](https://devin.ai/pricing)
525
+ - [GitHub Copilot Plans](https://github.com/features/copilot/plans)
526
+ - [Cursor Pricing](https://cursor.com/pricing)
527
+
528
+ ### Benchmarks & Performance
529
+ - [SWE-bench Verified Leaderboard](https://llm-stats.com/benchmarks/swe-bench-verified)
530
+ - [SWE-rebench Leaderboard](https://swe-rebench.com) (cost-per-task data)
531
+ - [Vellum LLM Leaderboard](https://www.vellum.ai/llm-leaderboard)
532
+ - [Claude Sonnet 4.5 Benchmarks](https://www.leanware.co/insights/claude-sonnet-4-5-overview)
533
+
534
+ ### Caching & Optimization
535
+ - [Anthropic Prompt Caching Docs](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
536
+ - [Anthropic Batch Processing Docs](https://platform.claude.com/docs/en/build-with-claude/batch-processing)
537
+ - [Claude Code Cost Management](https://code.claude.com/docs/en/costs)
538
+
539
+ ### Research Papers
540
+ - [Unified Routing and Cascading for LLMs — ICLR 2025](https://arxiv.org/abs/2410.10347)
541
+ - [Hybrid LLM: Cost-Efficient Quality-Aware — ICLR 2024](https://proceedings.iclr.cc/paper_files/paper/2024/file/b47d93c99fa22ac0b377578af0a1f63a-Paper-Conference.pdf)
542
+ - [C3PO: Optimized LLM Cascades — 2025](https://arxiv.org/pdf/2511.07396)
543
+ - [Why Multi-Agent LLM Systems Fail — 2025](https://arxiv.org/pdf/2503.13657)
544
+
545
+ ### Internal References
546
+ - [MAB Research Round 2](/home/justin/Documents/projects/autonomous-coding-toolkit/docs/plans/2026-02-22-mab-research-round2.md) — cost economics, cache priming pattern
547
+ - [Architecture](/home/justin/Documents/projects/autonomous-coding-toolkit/docs/ARCHITECTURE.md) — execution modes, quality gates
548
+ - [Run-Plan Routing](/home/justin/Documents/projects/autonomous-coding-toolkit/scripts/lib/run-plan-routing.sh) — model classification logic