autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,556 @@
1
+ # Multi-Armed Bandit System: Research Report — Round 2
2
+
3
+ **Date:** 2026-02-22
4
+ **Status:** Research complete
5
+ **Scope:** Cost modeling, testing strategies, cross-domain analogies, coder toolkit workflow analysis, latent bugs
6
+ **Builds on:** `docs/plans/2026-02-21-mab-research-report.md` (Round 1)
7
+
8
+ ---
9
+
10
+ ## Executive Summary
11
+
12
+ Round 2 research expands beyond ML/AI literature into seven cross-domain analogies (chess tournaments, evolutionary biology, competitive programming, manufacturing dual-sourcing, adversarial collaboration, forecasting tournaments, ensemble methods), plus deep analysis of cost economics, testing methodology, and the full coder toolkit workflow. Key findings:
13
+
14
+ 1. **Cost is manageable:** Two parallel agents cost ~$1.88-2.38 per task with prompt caching (83% reduction vs. uncached). Cache priming before parallel dispatch is the single biggest cost lever.
15
+ 2. **Testing MAB requires synthetic bandits, not just integration tests.** Simulation with known ground truth, seeded randomness, and distribution-level assertions — not output equality.
16
+ 3. **Three cross-domain patterns emerged independently across all seven analogies:** locked criteria before evaluation, diversity as signal, and discriminating starting conditions.
17
+ 4. **The coder toolkit workflow has 8 latent issues** that should be fixed before or alongside MAB implementation, including a state schema mismatch that silently returns wrong test counts.
18
+ 5. **The stop-hook/ralph-loop mechanism adapts naturally for MAB Agent B** — set up ralph-loop state in Agent B's worktree before `claude -p` launch.
19
+
20
+ **Action items for the revised implementation plan:** Fix Gap 6 (state schema bug), fix Gap 7 (JSON extraction fragility), wire planner into auto-compound.sh, and add cache-prime step before parallel agent dispatch.
21
+
22
+ ---
23
+
24
+ ## 1. Cost Economics
25
+
26
+ ### 1.1 Concrete Pricing
27
+
28
+ | Model | Input (per 1M tokens) | Output (per 1M tokens) |
29
+ |-------|-----------------------|------------------------|
30
+ | Claude Haiku 4.5 | $1.00 | $5.00 |
31
+ | Claude Sonnet 4.6 | $3.00 | $15.00 |
32
+ | Claude Opus 4.6 | $5.00 | $25.00 |
33
+ | Any model, >200K context | $6.00 input | $22.50 output |
34
+
35
+ **Real-world per-task costs (SWE-bench, from swe-rebench.com):**
36
+
37
+ | Agent/Model | Cost per Task | Tokens per Task | Resolved Rate |
38
+ |-------------|--------------|-----------------|---------------|
39
+ | Claude Sonnet 4.5 | $0.94 | ~1.9M | 47.1% |
40
+ | Claude Opus 4.6 | $0.93 | ~1.0M | 51.7% |
41
+ | Claude Code (product) | $3.50 | ~2.1M | 52.9% |
42
+
43
+ **Agent teams multiplier:** Anthropic's docs state teams use ~7x more tokens than single-agent sessions. Two parallel agents = ~2x per-agent cost with no automatic context sharing.
44
+
45
+ ### 1.2 The Cache Priming Pattern
46
+
47
+ **Critical finding:** Claude Sonnet dropped from $5.29 to $0.91 per task with prompt caching — an 83% reduction. Cache reads cost 0.1x input price; cache writes cost 1.25x input price (one-time).
48
+
49
+ **Parallel agent gotcha:** When two agents fire simultaneously on uncached content, both create independent caches, doubling write costs and getting zero read savings.
50
+
51
+ **Fix:** Fire a single "prime the cache" call first with the shared context (system prompt + design doc + PRD + codebase summary), then launch both agents. Both agents get cache-read pricing on the shared prefix.
52
+
53
+ **Concrete cost model for MAB per batch:**
54
+
55
+ | Scenario | Cost per batch (2 agents) | 6-batch plan total |
56
+ |----------|--------------------------|-------------------|
57
+ | No caching | ~$5.29 × 2 = $10.58 | ~$63.48 |
58
+ | With cache priming | ~$0.94 × 2 = $1.88 | ~$11.28 |
59
+ | Single agent (no MAB) | ~$0.94 × 1 = $0.94 | ~$5.64 |
60
+
61
+ **Bottom line:** MAB doubles cost vs. single agent, but cache priming keeps it under $2/batch. The real cost concern is not per-batch — it's the judge call (~$0.50-1.00 additional per batch for evaluation).
62
+
63
+ ### 1.3 Cost-Aware Thompson Sampling
64
+
65
+ Academic research formalizes "budgeted MAB" as a distinct problem class (UCB-B, Budget-UCB). Key techniques:
66
+
67
+ - **Cost-weighted priors:** Track `reward / cost` per arm, not just `reward`. Naturally deprioritizes expensive arms (Opus + extended thinking) unless they demonstrably outperform by more than the cost ratio.
68
+ - **Decaying violation budget:** Permit limited overspend early in learning, enforce strict compliance later. Maps directly to: early MAB runs explore freely, later runs exploit proven winners.
69
+ - **Pivot trigger:** A budget threshold at which all remaining pulls go to the current best arm regardless of uncertainty. Prevents runaway exploration.
70
+
71
+ **Recommendation for Phase 1:** Track cost per arm alongside win/loss. Don't optimize for it yet, but capture the data.
72
+
73
+ ### 1.4 Agentic Plan Caching
74
+
75
+ A newer technique (arxiv 2506.14852) caches structured plan templates across semantically similar tasks. Result: 46.62% average cost reduction while maintaining 96.67% of optimal performance. Relevant if MAB runs similar task types repeatedly.
76
+
77
+ ---
78
+
79
+ ## 2. Testing Strategy for the MAB System
80
+
81
+ ### 2.1 Testing the Bandit Algorithm
82
+
83
+ **Technique 1: Synthetic Bandits**
84
+
85
+ Build a synthetic environment with known ground truth. Define a matrix of true arm reward probabilities, generate simulated outcomes, run the algorithm, and verify convergence.
86
+
87
+ ```bash
88
+ # Test: Thompson Sampling converges to the better arm
89
+ # Ground truth: arm_a wins 70%, arm_b wins 40%
90
+ test_thompson_convergence() {
91
+ # Run 1000 simulated rounds with fixed seed
92
+ result=$(python3 -c "
93
+ import random
94
+ random.seed(42)
95
+ wins_a, losses_a, wins_b, losses_b = 0, 0, 0, 0
96
+ choices = []
97
+ for i in range(1000):
98
+ sample_a = random.betavariate(wins_a+1, losses_a+1)
99
+ sample_b = random.betavariate(wins_b+1, losses_b+1)
100
+ if sample_a >= sample_b:
101
+ choices.append('a')
102
+ if random.random() < 0.7: wins_a += 1
103
+ else: losses_a += 1
104
+ else:
105
+ choices.append('b')
106
+ if random.random() < 0.4: wins_b += 1
107
+ else: losses_b += 1
108
+ # Assert: arm_a selected >70% of last 200 rounds
109
+ print(choices[-200:].count('a') / 200)
110
+ ")
111
+ # Should be >0.70 with high probability
112
+ assertTrue "$(echo "$result > 0.70" | bc -l)" "Thompson Sampling should converge to better arm"
113
+ }
114
+ ```
115
+
116
+ **Technique 2: Offline Replay Evaluation**
117
+
118
+ Log all MAB decisions and outcomes to `logs/mab-run-*.json`. Replay logged events against a candidate policy to validate that new routing logic would have performed at least as well as the historical policy.
119
+
120
+ **Key testing principles for stochastic systems (from CMU SEI):**
121
+ - Fix random seed for reproducibility
122
+ - Assert on distribution properties, not specific outputs ("arm A selected >70% of last N rounds" not "arm A selected at round 47")
123
+ - Run 10-20 replicates as baseline for estimating distribution properties
124
+ - Use KS test or chi-squared to compare output distribution to expected
125
+
126
+ ### 2.2 Testing the LLM Judge
127
+
128
+ **Agreement rates from literature:**
129
+
130
+ | Context | Cohen's Kappa | Notes |
131
+ |---------|--------------|-------|
132
+ | Patch evaluation (clear cases) | 0.75 | High recall (0.94), precision (0.80) |
133
+ | Patch evaluation (full dataset) | 0.57 | Drops on ambiguous cases |
134
+ | Search query parsing | 0.807 → 0.639 | Position bias degrades by 0.17 |
135
+ | RAG evaluation (filtered) | 0.781-0.816 | "Substantial to almost perfect" |
136
+ | Human inter-rater (developers on patches) | Fleiss 0.31 | Humans themselves are inconsistent |
137
+
138
+ **Validation protocol (before trusting automated routing):**
139
+ 1. Build rubric collaboratively (LLM drafts, expert refines)
140
+ 2. Run judge on a clear benchmark where humans unanimously agree
141
+ 3. Require kappa >= 0.70 on the clear subset before deploying
142
+ 4. Track NPV separately — LLM judges are more reliable on INVALID (0.94-0.95) than VALID
143
+ 5. Measure self-consistency: same input, different seeds → same output?
144
+ 6. If >30% of cases have human disagreement, switch from categorical metrics to distributional (Jensen-Shannon Divergence)
145
+
146
+ **Judge test plan for Phase 1:**
147
+ - Prepare 10 synthetic evaluation pairs (known-better vs known-worse diffs)
148
+ - Run judge on each pair twice (once A-first, once B-first) = 20 evaluations
149
+ - Assert: >80% correct winner identification
150
+ - Assert: position bias < 15% (win rate difference between first/second position)
151
+ - Assert: self-consistency > 85% (same winner when re-run with same order)
152
+
153
+ ### 2.3 Testing Nondeterministic Integration
154
+
155
+ The full MAB pipeline (agent dispatch → quality gate → judge → merge → learn) is inherently nondeterministic. Testing strategy:
156
+
157
+ - **Deterministic units:** Test each component in isolation with fixed inputs (e.g., test `run_judge()` with a fixed diff pair, test `thompson_sample()` with a fixed seed)
158
+ - **Stochastic integration:** Run the full pipeline N times on a trivial task (e.g., "add a docstring to this function") and assert statistical properties: winner is declared in >95% of runs, quality gate runs in 100%, state file is updated in 100%
159
+ - **Fault injection:** Test what happens when Agent A fails (exit non-zero), Agent B produces no diff, judge returns malformed JSON, merge conflicts occur
160
+
161
+ ---
162
+
163
+ ## 3. Cross-Domain Analogies
164
+
165
+ ### 3.1 Computer Chess Tournaments (TCEC)
166
+
167
+ The closest structural analog. Two agents, identical hardware, identical problem, a judge picks the winner.
168
+
169
+ | TCEC Practice | MAB Application |
170
+ |---------------|-----------------|
171
+ | **Curated opening book** (bias toward decisive positions) | Pre-screen tasks for discriminating power. Trivially easy tasks (both ace) or impossible tasks (both fail) produce no signal. |
172
+ | **Adjudication rules** (auto-draw if engines agree ±0.08 for 10 plies) | Early termination: if both agents produce identical solutions (by diff similarity), declare a draw — don't burn judge tokens. If one passes all tests and the other passes none, skip detailed rubric — call it early. |
173
+ | **Same hardware, same time control** | Same model, same context budget, same token limit. Otherwise you're comparing resource allocation, not capability. |
174
+ | **Draw rate is a design problem** | If MAB produces too many ties, the task design is wrong. Fix the tasks, not the judge. Monitor tie rate as a health metric. |
175
+
176
+ ### 3.2 Evolutionary Biology / Genetic Algorithms
177
+
178
+ | Biological Pattern | MAB Application |
179
+ |-------------------|-----------------|
180
+ | **Tournament selection pressure is a dial** (small tournament = diversity, large = convergence) | Number of tasks per MAB round controls signal-to-noise. More matches per round = more reliable signal but slower adaptation. |
181
+ | **Artificial selection drives local optima** (domesticated crops lose wild resilience) | If judge consistently favors one style, both agents converge to it. Diversity collapses. Monitor inter-agent diff similarity as a canary. |
182
+ | **Recombination > pure selection** | The real value isn't picking a winner — it's identifying *which parts* of each solution were stronger. Phase 2 judge should extract specific winning behaviors. |
183
+
184
+ ### 3.3 Adversarial Collaboration (Kahneman)
185
+
186
+ | Scientific Practice | MAB Application |
187
+ |--------------------|-----------------|
188
+ | **Pre-registration of criteria** (both parties agree what evidence would change their mind before the experiment) | Judge rubric must be locked before agents see the task. If rubric is written after reviewing outputs, it unconsciously favors the impressive-looking answer. |
189
+ | **The joint design of the test is where value lies** | Defining what "better" means for each task class is harder and more valuable than the competition itself. |
190
+ | **Ask "on what dimension do these differ most?"** | Don't ask the judge "which is better overall?" — ask "on what dimension do these most differ, and which is better on that dimension?" Produces more actionable lessons. |
191
+
192
+ ### 3.4 Manufacturing Dual Sourcing
193
+
194
+ | Procurement Pattern | MAB Application |
195
+ |--------------------|-----------------|
196
+ | **Credible threat of replacement drives improvement** | The mere existence of competition improves both agents. Keep both pipelines alive even when one is winning. |
197
+ | **Quality inconsistency between suppliers breaks integration** | If agents produce stylistically incompatible solutions (different abstractions, naming), the "winner" creates downstream debt. Judge needs a consistency criterion. |
198
+ | **Technology licensing outperforms pure competition** | Feed winning approach back to both agents before next round. Sharing knowledge produces better cumulative results than withholding it. Maps to injecting MAB lessons into both agents' context. |
199
+
200
+ ### 3.5 Competitive Programming Judges (Codeforces/ICPC)
201
+
202
+ | Competition Practice | MAB Application |
203
+ |---------------------|-----------------|
204
+ | **Pre-test vs. system-test split** | Run agents against a visible "sanity check" suite first, then against a harder hidden suite for final judging. Prevents overfitting to visible rubric. |
205
+ | **Hacking** (competitors find inputs that break opponents' solutions) | After both agents submit, have each attempt to write a test case that breaks the other's solution. Valid breaking test = signal about code quality reasoning. (Phase 3 feature) |
206
+ | **Distinct verdict categories** (WA vs TLE vs RE) | Judge outputting only "Agent A wins" discards signal. "Agent A correct but 3x slower; Agent B had edge case bug at N=0" generates compounding knowledge. |
207
+
208
+ ### 3.6 Forecasting Tournaments / Proper Scoring Rules
209
+
210
+ | Forecasting Pattern | MAB Application |
211
+ |--------------------|-----------------|
212
+ | **Proper scoring rules eliminate gaming** | Can an agent score well by optimizing for the judge rather than for correctness? If yes, the rubric isn't proper. Test by submitting impressive-looking-but-wrong solutions. |
213
+ | **Time-weighting for sequential competitions** | An agent that produces correct architecture early and refines is better than one that patches a wrong architecture — even if final outputs look identical. |
214
+ | **Panel of 2-3 judges beats single judge by 13-22%** | A single LLM judge is a single point of failure. Phase 2: use two judge calls with different temperatures and take majority vote. |
215
+
216
+ ### 3.7 Ensemble Methods / Mixture of Experts
217
+
218
+ | ML Pattern | MAB Application |
219
+ |------------|-----------------|
220
+ | **Disagreement between agents IS the signal** | Two agents producing identical solutions = one agent. Track disagreement rate as a health metric. If it drops, tasks are too easy or agents have converged. |
221
+ | **Diversity must be actively promoted** | Same model + same context = correlated outputs. Structural diversity requires different prompting, tool access, context priming, or temperature. |
222
+ | **Gating network learns task-type trust** | A sophisticated judge learns "Agent A better on algorithmic; Agent B better on integration." Static rubrics lose this signal. |
223
+
224
+ ### 3.8 Cross-Domain Synthesis
225
+
226
+ Three patterns appeared independently across all seven domains:
227
+
228
+ 1. **Locked criteria before outputs are seen.** TCEC opening books, Kahneman's pre-registration, Codeforces hidden test suites, Brier score properness. The judge rubric must be defined and frozen before agents run.
229
+
230
+ 2. **Homogeneous competition is waste.** Ensemble diversity, dual-sourcing, tournament selection pressure. If both agents converge to identical strategies, the competition produces zero information. Diversity is the asset; it must be actively maintained.
231
+
232
+ 3. **Shared starting conditions must be pre-screened for discriminating power.** TCEC curated openings, speedrun set seeds, competitive programming difficulty calibration. Don't MAB trivially easy or impossibly hard tasks — they produce no signal.
233
+
234
+ ---
235
+
236
+ ## 4. Coder Toolkit Workflow Analysis
237
+
238
+ ### 4.1 Full Skill Chain
239
+
240
+ ```
241
+ USER INPUT
242
+
243
+
244
+ Phase 1: DESIGN ─────────── superpowers:brainstorming
245
+ │ Output: docs/plans/YYYY-MM-DD-<topic>-design.md
246
+ │ Gate: user approval
247
+
248
+ Phase 2: PRD ────────────── /create-prd
249
+ │ Output: tasks/prd.json + tasks/prd-<feature>.md
250
+ │ Gate: user approval
251
+
252
+ Phase 3: PLAN ───────────── superpowers:writing-plans
253
+ │ Output: docs/plans/YYYY-MM-DD-<feature>.md
254
+ │ Gate: user chooses execution mode
255
+
256
+ Phase 3.5: ISOLATE ──────── superpowers:using-git-worktrees
257
+ │ Output: .worktrees/<branch>/, baseline test count
258
+ │ Gate: tests pass in clean worktree
259
+
260
+ Phase 4: EXECUTE ────────── [4 modes, see below]
261
+ │ Gate: quality gate after every batch
262
+
263
+ Phase 5: VERIFY ─────────── superpowers:verification-before-completion
264
+ │ Gate: ALL PRD criteria pass (shell commands)
265
+
266
+ Phase 6: FINISH ─────────── superpowers:finishing-a-development-branch
267
+ Output: merge / PR / keep / discard
268
+ ```
269
+
270
+ ### 4.2 Four Execution Modes
271
+
272
+ | Mode | Entry Point | Context Model | Human Checkpoints | Best For |
273
+ |------|-------------|---------------|-------------------|----------|
274
+ | **4a: Subagent-Driven** | `superpowers:subagent-driven-development` | Fresh subagent per task | None after start | 1-10 tasks, interactive |
275
+ | **4b: Executing-Plans** | `superpowers:executing-plans` | Shared session (degrades) | Between batches | Medium plans, oversight needed |
276
+ | **4c: Headless** | `scripts/run-plan.sh` | Fresh `claude -p` per batch | None (autonomous) | 5+ batches, overnight |
277
+ | **4d: Ralph Loop** | `/ralph-loop` | Same session, iterates | None (until promise) | PRD-driven, open-ended |
278
+
279
+ Headless mode has 3 sub-modes: `headless` (serial), `team` (parallel groups), `competitive` (stub → becomes MAB).
280
+
281
+ ### 4.3 Where MAB Fits
282
+
283
+ MAB replaces the competitive stub in headless mode. It sits at the Phase 4 execution layer:
284
+
285
+ ```
286
+ Phase 3.5: ISOLATE
287
+
288
+ ├── MODE: headless ──── run_mode_headless() ──── serial batches
289
+ ├── MODE: team ──────── run_mode_team() ──────── parallel groups
290
+ ├── MODE: mab ──────── run_mode_mab() ──────── [NEW] two agents, judge picks winner
291
+ │ │
292
+ │ ├── Create worktree A (superpowers-led)
293
+ │ ├── Create worktree B (ralph-led)
294
+ │ ├── Cache-prime shared context
295
+ │ ├── Launch both agents in parallel
296
+ │ ├── Quality gate both
297
+ │ ├── Judge evaluates diffs (randomized order)
298
+ │ ├── Merge winner to main worktree
299
+ │ └── Update strategy-perf.json + mab-lessons.json
300
+
301
+ └── MODE: ralph ──────── /ralph-loop ──────── stop-hook iterations
302
+ ```
303
+
304
+ ### 4.4 State Files Across the Workflow
305
+
306
+ | File | Writer | Reader | Lifecycle |
307
+ |------|--------|--------|-----------|
308
+ | `docs/plans/*-design.md` | brainstorming | writing-plans, code-factory | Permanent |
309
+ | `tasks/prd.json` | /create-prd | verification, ralph-loop, run-plan.sh | Updated during execution |
310
+ | `docs/plans/*-<feature>.md` | writing-plans | all execution modes | Permanent |
311
+ | `.run-plan-state.json` | run-plan-state.sh | --resume, context injection | Per-execution |
312
+ | `progress.txt` | run-plan-prompt.sh | cross-batch context injection | Per-execution, append-only |
313
+ | `logs/failure-patterns.json` | run-plan-context.sh | batch context injection | Cross-run |
314
+ | `logs/sampling-outcomes.json` | run-plan-headless.sh | get_prompt_variants() | Cross-run |
315
+ | `logs/strategy-perf.json` | [NEW] run-plan-mab.sh | Thompson Sampling routing | Cross-run |
316
+ | `logs/mab-lessons.json` | [NEW] judge agent | batch context injection | Cross-run |
317
+ | `AGENTS.md` | run-plan-prompt.sh | agent teams | Per-execution |
318
+ | `.claude/ralph-loop.local.md` | setup-ralph-loop.sh | stop-hook.sh | Per-ralph-session |
319
+
320
+ ### 4.5 Quality Gate Enforcement Points
321
+
322
+ 1. **Worktree baseline** (Phase 3.5): Tests must pass before implementation begins
323
+ 2. **Per-step** (Modes 4a/4b): Plan includes explicit "run test, verify it passes" steps
324
+ 3. **Inter-batch** (Mode 4c): `run_quality_gate()` after every batch — lesson-check + tests + memory + regression + git clean
325
+ 4. **Final verification** (Phase 5): ALL PRD criteria as shell commands, lesson-scanner agent
326
+ 5. **Pre-merge** (Phase 6): Tests must pass before options are presented; re-tested after merge
327
+
328
+ ### 4.6 Stop-Hook / Ralph Loop: MAB Adaptation
329
+
330
+ The stop-hook mechanism intercepts session exits and re-feeds the prompt. It's inherently single-session, while MAB needs two parallel sessions. However:
331
+
332
+ **Agent B (ralph-led) naturally fits ralph-loop.** In `run_mode_mab()`, before launching Agent B's `claude -p` call:
333
+ 1. `cd "$worktree_b"`
334
+ 2. Run `setup-ralph-loop.sh --completion-promise "ALL PRD CRITERIA PASS" --max-iterations 15`
335
+ 3. Launch `claude -p` — the stop-hook will iterate Agent B until PRD criteria pass
336
+
337
+ Agent A (superpowers-led) terminates naturally after its last batch — no ralph-loop needed.
338
+
339
+ **Guard needed:** Both `.claude/ralph-loop.local.md` and the stop-hook are relative to `$PWD`. Since each MAB worktree has its own directory, state files are naturally isolated — but only if `cd "$worktree"` runs before `claude -p`. The current design doesn't explicitly `cd` — this must be added.
340
+
341
+ ---
342
+
343
+ ## 5. Latent Issues Found During Workflow Analysis
344
+
345
+ ### Issue 1: State Schema Mismatch (Bug — affects all headless runs)
346
+
347
+ **File:** `scripts/lib/run-plan-context.sh:25`
348
+ **Problem:** `generate_batch_context()` reads `jq '[.batches[].test_count // 0] | max'` but `run-plan-state.sh` stores test counts at `.test_counts` (a flat key-value object), not `.batches[].test_count`.
349
+ **Impact:** The test count high-water-mark injected into batch context is always 0. All batches think they're starting from zero tests.
350
+ **Fix:** Change to `jq '[.test_counts // {} | to_entries[].value] | max // 0'`
351
+
352
+ ### Issue 2: Judge JSON Extraction Is Fragile
353
+
354
+ **File:** `mab-run.sh` (planned) `run_judge()` function
355
+ **Problem:** `grep -o '{.*}' | head -1` fails on multi-line JSON, which LLM output frequently produces.
356
+ **Fix:** Use `python3 -c "import sys,json,re; m=re.search(r'\\{.*\\}', sys.stdin.read(), re.DOTALL); print(m.group(0) if m else '{}')"` or instruct judge prompt to output ONLY JSON and validate with `jq empty`.
357
+
358
+ ### Issue 3: `--mab` Flag vs `--mode ab` Naming Inconsistency
359
+
360
+ **File:** MAB plan Batch 3, Tasks 9-10
361
+ **Problem:** The plan adds both a `--mab` boolean flag and a `--mode ab` enum value. These are parallel pathways that need reconciliation.
362
+ **Fix:** Use one canonical path: `run-plan.sh --mode mab`.
363
+
364
+ ### Issue 4: Planner Agent Has No Caller
365
+
366
+ **File:** No file — gap in the plan
367
+ **Problem:** `scripts/prompts/planner-agent.md` is created in Batch 1 but never called by `auto-compound.sh` or any other script. The routing decision is purely manual.
368
+ **Fix:** Wire planner into `auto-compound.sh` between PRD generation and execution.
369
+
370
+ ### Issue 5: `auto-compound.sh` Bypasses `writing-plans`
371
+
372
+ **File:** `scripts/auto-compound.sh`
373
+ **Problem:** Goes directly from PRD → Ralph loop, skipping plan writing entirely. This means MAB (which supports superpowers-led strategy that needs a plan) can't be exercised via `auto-compound.sh`.
374
+ **Fix:** Document this as intentional for the ralph-only pipeline. Add a `--plan-first` flag for when MAB or superpowers mode is desired.
375
+
376
+ ### Issue 6: `sampling-outcomes.json` vs `strategy-perf.json` Confusion
377
+
378
+ **Problem:** Both files track win rates — one for prompt variants within a strategy (micro-MAB), one for strategies (macro-MAB). No documentation distinguishes them.
379
+ **Fix:** Add comment blocks to creation code and a section in ARCHITECTURE.md.
380
+
381
+ ### Issue 7: MAB and Ralph Loop Compete for Session State
382
+
383
+ **Problem:** If a user activates `/ralph-loop` in a worktree that's also running inside `mab-run.sh`, both mechanisms are active simultaneously.
384
+ **Fix:** `run_mode_mab()` should write a `.mab-active` sentinel file in its worktrees. The ralph-loop setup should check for this and refuse to activate, or the MAB script should set up ralph-loop state itself (preferred — see Section 4.6).
385
+
386
+ ### Issue 8: No Explicit `cd` Before Agent `claude -p` in MAB Worktrees
387
+
388
+ **Problem:** Each MAB agent's `claude -p` must run in its own worktree directory for proper isolation. The current design doesn't explicitly change directory.
389
+ **Fix:** Add `cd "$worktree_a" &&` before each `claude -p` invocation in `run_mode_mab()`.
390
+
391
+ ---
392
+
393
+ ## 6. Concrete Recommendations for Revised Plan
394
+
395
+ ### Pre-MAB Fixes (do first)
396
+
397
+ | # | Fix | Effort | Impact |
398
+ |---|-----|--------|--------|
399
+ | 1 | Fix state schema mismatch (Issue 1) | 10 min | Fixes all headless runs |
400
+ | 2 | Canonical `--mode mab` naming (Issue 3) | 5 min | Prevents naming confusion |
401
+
402
+ ### Phase 1 Architecture (replaces original Batches 1-3)
403
+
404
+ ```
405
+ scripts/
406
+ ├── lib/
407
+ │ └── run-plan-mab.sh # ~250 lines, peer to headless/team
408
+ ├── prompts/
409
+ │ ├── judge-agent.md # Binary judge: winner + reasoning + SHAs
410
+ │ ├── agent-a-superpowers.md # Superpowers-led batch execution prompt
411
+ │ └── agent-b-ralph.md # Ralph-led iteration prompt
412
+ └── run-plan.sh # Add --mode mab dispatch
413
+ ```
414
+
415
+ **`run-plan-mab.sh` responsibilities:**
416
+ 1. Create two worktrees from current HEAD
417
+ 2. Cache-prime shared context (design doc + PRD + codebase summary)
418
+ 3. Launch both agents in parallel (`claude -p` with `cd "$worktree"`)
419
+ 4. Wait for both to complete
420
+ 5. Run quality gate on both
421
+ 6. Call judge agent with randomized presentation order
422
+ 7. Merge winner to main worktree
423
+ 8. Update `logs/strategy-perf.json` and `logs/mab-lessons.json`
424
+ 9. Clean up loser worktree
425
+
426
+ **Judge agent (Phase 1 — binary):**
427
+ ```json
428
+ {
429
+ "winner": "agent_a|agent_b|draw",
430
+ "confidence": "low|medium|high",
431
+ "reasoning": "2-3 sentences explaining the decision",
432
+ "key_difference": "The specific dimension where agents most differed",
433
+ "sha_a": "abc1234",
434
+ "sha_b": "def5678",
435
+ "presentation_order": "a_first|b_first"
436
+ }
437
+ ```
438
+
439
+ **Routing (Phase 1 — Thompson Sampling, ~15 lines bash):**
440
+ ```bash
441
+ sample_a=$(python3 -c "import random; random.seed(); print(random.betavariate($wins_a+1,$losses_a+1))")
442
+ sample_b=$(python3 -c "import random; random.seed(); print(random.betavariate($wins_b+1,$losses_b+1))")
443
+ delta=$(python3 -c "print(abs($sample_a - $sample_b))")
444
+ if (( $(echo "$delta < 0.10" | bc -l) )); then
445
+ echo "mab" # Uncertain — run both agents
446
+ else
447
+ # Exploit — route to higher sample
448
+ if (( $(echo "$sample_a > $sample_b" | bc -l) )); then
449
+ echo "superpowers"
450
+ else
451
+ echo "ralph"
452
+ fi
453
+ fi
454
+ ```
455
+
456
+ **Early termination rules (from TCEC + Codeforces patterns):**
457
+ - If both agents produce identical diffs (>95% similarity): declare draw, skip judge
458
+ - If one agent passes all tests and other passes none: auto-declare winner, skip judge
459
+ - If both agents fail quality gate: declare no winner, retry batch in headless mode
460
+
461
+ ### Phase 2 Additions (after 10+ runs)
462
+
463
+ - Judge enrichment: add `failure_mode`, `strategy_update`, `winning_behaviors` fields
464
+ - Prompt evolution from judge reasoning (SEW pattern) → `logs/evolved-prompts.json`
465
+ - Model variation: `--sample-models "sonnet,opus,haiku"` flag
466
+ - Panel judging: two judge calls, different temperatures, majority vote
467
+ - Wire planner agent into `auto-compound.sh` for automated routing
468
+
469
+ ### Phase 3 Additions (after 50+ runs, maybe never)
470
+
471
+ - Strategy archive (ADAS pattern): judge proposes new strategy descriptions
472
+ - Hacking mechanism: each agent writes a test case to break the other (Codeforces pattern)
473
+ - Community strategy data aggregation
474
+ - Semantic lesson dedup via Pinecone
475
+
476
+ ---
477
+
478
+ ## 7. Updated Risk Matrix
479
+
480
+ | Risk | Likelihood | Impact | Mitigation | Source |
481
+ |------|-----------|--------|------------|--------|
482
+ | Judge inconsistency (first 10 runs) | High | Medium | Validate first 10 decisions manually; require kappa >= 0.70 | LLM-as-Judge literature |
483
+ | Low agent diversity (same outputs) | Medium | High | Monitor diff similarity; add model variation in Phase 2 | Ensemble methods, evolutionary biology |
484
+ | 2x compute cost | Certain | Low | Cache priming drops from $10.58 to $1.88/batch; Thompson Sampling reduces MAB frequency | SWE-bench cost data |
485
+ | Position bias in judge | High | Medium | Randomize order; log in output; monitor win rates by position | LLM-as-Judge research, Codeforces |
486
+ | Rubric gaming (agent optimizes for judge, not correctness) | Low (Phase 1) | High | Proper scoring rule design; hidden test suite for judge | Forecasting tournaments |
487
+ | State schema bug produces wrong test counts | Certain (existing) | Medium | Fix before MAB — affects all headless runs today | Workflow analysis |
488
+ | JSON extraction breaks on multiline judge output | High | Medium | Use multiline-aware extraction; validate with jq | Workflow analysis |
489
+ | Both mechanisms active (ralph-loop + MAB) | Low | Medium | MAB sets up ralph-loop state itself; sentinel file guard | Workflow analysis |
490
+ | Draw rate too high (no signal) | Medium | Medium | Pre-screen tasks for discriminating power; early termination rules | TCEC, comp programming |
491
+
492
+ ---
493
+
494
+ ## 8. Updated Success Metrics
495
+
496
+ | Metric | Phase 1 | Phase 2 | Measurement |
497
+ |--------|---------|---------|-------------|
498
+ | MAB runs completed | 10 | 50 | Count of `logs/mab-run-*.json` |
499
+ | Judge agreement with human | >80% | >90% | Manual review, Cohen's kappa |
500
+ | Judge self-consistency | >85% | >90% | Same input, different seed → same winner |
501
+ | Position bias | <15% | <10% | Win rate delta by presentation order |
502
+ | Agent diversity (diff similarity) | <80% overlap | <70% | Diff intersection / union |
503
+ | Cost per MAB batch | <$3.00 | <$2.50 | API billing, logged per run |
504
+ | Draw rate | <40% | <25% | Draws / total evaluations |
505
+ | Quality gate pass rate (winner) | >80% | >90% | strategy-perf.json aggregate |
506
+ | Thompson Sampling convergence | — | Within 15 runs | Cumulative regret vs oracle |
507
+ | Prompt evolution yield | — | 1 variant / 5 runs | evolved-prompts.json entries |
508
+
509
+ ---
510
+
511
+ ## 9. Sources
512
+
513
+ ### Round 2 — New Sources
514
+
515
+ #### Cost & Economics
516
+ - [Manage costs effectively — Claude Code Docs](https://code.claude.com/docs/en/costs)
517
+ - [Claude API Pricing](https://platform.claude.com/docs/en/about-claude/pricing)
518
+ - [SWE-rebench Leaderboard](https://swe-rebench.com) (cost-per-task data)
519
+ - [Prompt Caching — Claude API Docs](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
520
+ - [Agentic Plan Caching — arxiv 2506.14852](https://arxiv.org/abs/2506.14852)
521
+ - [Budget-Constrained MAB — UCL/AAAI 2013](http://www0.cs.ucl.ac.uk/staff/w.zhang/rtb-papers/mab-adx.pdf)
522
+ - [Adaptive Budgeted UCB — arxiv 2505.02640](https://arxiv.org/pdf/2505.02640)
523
+
524
+ #### Testing & Validation
525
+ - [Validating LLM-as-a-Judge Under Rating Indeterminacy — CMU ML Blog](https://blog.ml.cmu.edu/2025/12/09/validating-llm-as-a-judge-systems-under-rating-indeterminacy/)
526
+ - [Judge's Verdict — arxiv 2510.09738](https://arxiv.org/pdf/2510.09738)
527
+ - [Seven Recommendations for Testing in a Non-Deterministic World — CMU SEI](https://www.sei.cmu.edu/blog/seven-recommendations-for-testing-in-a-non-deterministic-world/)
528
+ - [Statistical Testing of Stochastic Systems — U. Washington](https://homes.cs.washington.edu/~borning/papers/sevcikova-issta-2006.pdf)
529
+ - [Offline Bandit Evaluation — James LeDoux / Udemy](https://jamesrledoux.com/algorithms/offline-bandit-evaluation/)
530
+ - [Contextual R Package — Synthetic Bandit Simulation](https://nth-iteration-labs.github.io/contextual/)
531
+
532
+ #### Cross-Domain Analogies
533
+ - [TCEC Rules — Chessdom Wiki](https://wiki.chessdom.org/Rules)
534
+ - [Tournament Selection — Wikipedia](https://en.wikipedia.org/wiki/Tournament_selection)
535
+ - [Adversarial Collaboration — Kahneman / Edge.org](https://www.edge.org/adversarial-collaboration-daniel-kahneman)
536
+ - [Nature: Time for Adversarial Collaboration (2025)](https://www.nature.com/articles/d41586-025-01379-3)
537
+ - [Dual Sourcing — Management Science](https://pubsonline.informs.org/doi/10.1287/mnsc.41.8.1317)
538
+ - [Brier Score — Wikipedia](https://en.wikipedia.org/wiki/Brier_score)
539
+ - [Competitive Programming Judge Systems](https://en.wikipedia.org/wiki/Competitive_programming)
540
+ - [Codeforces Contest Rules](https://codeforces.com/blog/entry/4088)
541
+ - [Mixture of Experts — Wikipedia](https://en.wikipedia.org/wiki/Mixture_of_experts)
542
+ - [Ensemble Diversity — JMLR](https://jmlr.org/papers/volume24/23-0041/23-0041.pdf)
543
+ - [Speedrunning Verification](https://en.wikipedia.org/wiki/Speedrun)
544
+
545
+ ### Round 1 Sources (from `2026-02-21-mab-research-report.md`)
546
+
547
+ See the original report for the full Round 1 source list covering academic MAB+LLM literature, LLM-as-Judge practitioner guides, SEW/ADAS research, SWE-bench analysis, and Notion workspace references.
548
+
549
+ ### Codebase Files Analyzed
550
+
551
+ - Full skill chain: `skills/{brainstorming,writing-plans,using-git-worktrees,executing-plans,subagent-driven-development,verification-before-completion,finishing-a-development-branch}/SKILL.md`
552
+ - Commands: `commands/{code-factory,run-plan,ralph-loop}.md`
553
+ - Scripts: `scripts/run-plan.sh`, `scripts/auto-compound.sh`, all 8 `scripts/lib/run-plan-*.sh` modules
554
+ - Hooks: `hooks/stop-hook.sh`, `hooks/hooks.json`
555
+ - Architecture: `docs/ARCHITECTURE.md`
556
+ - MAB design + plan: `docs/plans/2026-02-22-mab-run-{design,plan}.md`