autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,462 @@
1
+ # Multi-Armed Bandit System Design
2
+
3
+ **Date:** 2026-02-22 (updated 2026-02-23)
4
+ **Status:** Approved — updated with research findings
5
+ **Goal:** Competing autonomous agents (superpowers vs ralph-wiggum) execute the same brief using different methodologies, judged by an LLM that extracts lessons and updates strategy performance data. The toolkit gets smarter with every run, and community contributions compound learning for everyone.
6
+
7
+ > ## Research-Driven Updates (2026-02-23)
8
+ >
9
+ > Based on cross-cutting synthesis of 25 research papers, the following changes were made:
10
+ >
11
+ > 1. **Thompson Sampling replaces LLM planner.** The planner agent (Section "Planner Agent") is now a bash function using Beta distribution sampling, not a separate `claude -p` call. Cheaper, faster, better calibrated. (Source: MAB Research R1, cross-cutting synthesis §F)
12
+ >
13
+ > 2. **Human calibration for first 10 decisions.** The judge's verdict is presented to the user for approval/override for the first 10 MAB runs. Only after 10 human-validated decisions does automated routing take over. (Source: cross-cutting synthesis §F — "validate against human judgment")
14
+ >
15
+ > 3. **Selective MAB (~30% of batches).** MAB is not the default mode. It triggers on: integration batches, first-time batch types (insufficient data), and historically flaky batches (>50% retry rate). Single-strategy routing is the default when win rates are clear. (Source: Cost/Quality paper — break-even only if prevents 1 debugging batch per 2 features)
16
+ >
17
+ > 4. **Prerequisites added.** Phase 1 (bug fixes, especially #10 state schema) and Phase 3 (cost tracking, prompt caching) must complete before MAB implementation. Without cost data, MAB economics can't be validated. (Source: cross-cutting synthesis §8)
18
+ >
19
+ > 5. **Plan slimmed from 6 to 4 batches.** Prompts are just files (no code), planner is now a function (not an agent), and community sync is a simple script. The original plan over-scoped. (Source: 80% infrastructure reuse finding from MAB R1)
20
+ >
21
+ > 6. **`{AB_LESSONS}` placeholder bug fixed.** Original plan used `{AB_LESSONS}` in `assemble_prompt()` but data file is `mab-lessons.json`. Changed to `{MAB_LESSONS}`.
22
+ >
23
+ > See updated plan: `docs/plans/2026-02-23-roadmap-to-completion.md` Phase 4.
24
+
25
+ ## Problem
26
+
27
+ The toolkit has two execution strategies — structured (superpowers skill chain) and autonomous (ralph-wiggum iteration loop) — but no empirical data on which works better for which types of work. Users pick one and hope. The toolkit learns nothing from execution outcomes.
28
+
29
+ ## Design Principles
30
+
31
+ 1. **Thin infrastructure, rich data, LLM intelligence.** Bash scripts create worktrees, run quality gates, merge branches. LLM agents make all decisions (routing, judging, lesson extraction). Data files are the interface between runs.
32
+
33
+ 2. **Both agents are full toolkit citizens.** They inherit all skills, lessons, hooks, quality gates, and CLAUDE.md conventions. The competition is about orchestration strategy, not available tools.
34
+
35
+ 3. **Human input ends at PRD approval.** Brainstorm → design → PRD is human-in-the-loop. Everything after is machine-driven.
36
+
37
+ 4. **Every run produces learning.** MAB lessons, strategy performance data, and failure mode classifications feed back into future runs. Community contributions propagate via git.
38
+
39
+ ## Architecture
40
+
41
+ ```
42
+ PHASE 1 — HUMAN + SINGLE AGENT (shared)
43
+ 1. Brainstorm → approved design doc
44
+ 2. PRD → machine-verifiable acceptance criteria
45
+ 3. Architecture map generated
46
+
47
+ PHASE 2 — PLANNER AGENT (LLM)
48
+ Reads: design doc, PRD, architecture map, strategy-perf.json
49
+ Decides per work unit: MAB or single? Which strategy? Unit size?
50
+
51
+ PHASE 3 — MAB EXECUTION (parallel worktrees)
52
+ Agent A (superpowers): writes own plan, TDD, batch-by-batch
53
+ Agent B (ralph): iterates until PRD criteria pass
54
+
55
+ PHASE 4 — JUDGE AGENT (LLM)
56
+ Reads: both diffs, design doc, PRD, architecture map, lesson history
57
+ Outputs: winner, bidirectional lessons, strategy update, failure mode
58
+
59
+ PHASE 5 — MERGE + LEARN
60
+ Merge winner, log lessons, update strategy data, promote patterns
61
+ ```
62
+
63
+ ## Shared Inputs (Human-Approved)
64
+
65
+ All created before MAB execution begins. Human approves design and PRD.
66
+
67
+ | Artifact | Source | Purpose |
68
+ |----------|--------|---------|
69
+ | Design doc | Brainstorming skill | What to build and why |
70
+ | `tasks/prd.json` | `/create-prd` | Machine-verifiable acceptance criteria |
71
+ | `docs/ARCHITECTURE-MAP.json` | `architecture-map.sh` (auto-generated) | Module graph, imports/exports, dependencies |
72
+
73
+ ## Agent Configuration
74
+
75
+ Both agents receive identical toolkit context. The only difference is the lead instruction.
76
+
77
+ ### What both agents always get
78
+
79
+ ```
80
+ 1. CLAUDE.md chain
81
+ ~/.claude/CLAUDE.md (global)
82
+ ~/Documents/CLAUDE.md (workspace)
83
+ <project>/CLAUDE.md (project)
84
+
85
+ 2. Toolkit enforcement
86
+ hooks/hooks.json (hookify rules)
87
+ quality-gate.sh (lesson-check + tests + memory)
88
+ docs/lessons/*.md (all lessons)
89
+
90
+ 3. Shared brief
91
+ Design doc
92
+ tasks/prd.json
93
+ docs/ARCHITECTURE-MAP.json
94
+ logs/mab-lessons.json (previous MAB learnings)
95
+
96
+ 4. All skills available via Skill tool
97
+ ```
98
+
99
+ ### Agent A (superpowers-led)
100
+
101
+ ```
102
+ Lead instruction:
103
+ "Execute this using the superpowers skill chain.
104
+ Write your own implementation plan first.
105
+ Follow TDD: write test → verify fail → implement → verify pass → commit.
106
+ Run quality gates between logical batches."
107
+
108
+ Natural behavior:
109
+ → writing-plans (creates its own plan from the design doc)
110
+ → executing-plans (batch-by-batch)
111
+ → test-driven-development (per task)
112
+ → verification-before-completion (at end)
113
+ ```
114
+
115
+ ### Agent B (ralph-led)
116
+
117
+ ```
118
+ Lead instruction:
119
+ "Execute this using the ralph-loop approach.
120
+ All PRD acceptance criteria in tasks/prd.json must pass (exit 0).
121
+ Iterate until done. Use any toolkit skills as needed."
122
+
123
+ Natural behavior:
124
+ → Reads PRD criteria
125
+ → Starts coding toward acceptance criteria
126
+ → Uses TDD, debugging, etc. as needed (not mandated order)
127
+ → Stop-hook checks criteria each cycle
128
+ → Done when all criteria pass
129
+ ```
130
+
131
+ ## Worktree Isolation
132
+
133
+ Each MAB run creates two git worktrees branched from HEAD.
134
+
135
+ ```bash
136
+ # Create worktrees
137
+ git worktree add .claude/worktrees/mab-a-batch-N -b mab-a-batch-N HEAD
138
+ git worktree add .claude/worktrees/mab-b-batch-N -b mab-b-batch-N HEAD
139
+
140
+ # After judge picks winner (say A):
141
+ git merge mab-a-batch-N
142
+
143
+ # Cleanup
144
+ git worktree remove .claude/worktrees/mab-a-batch-N
145
+ git worktree remove .claude/worktrees/mab-b-batch-N
146
+ git branch -d mab-a-batch-N mab-b-batch-N
147
+ ```
148
+
149
+ Both agents run in parallel. Neither can see the other's work.
150
+
151
+ ## Planner Agent
152
+
153
+ An LLM agent that decides routing before execution begins. Not a bash script — reads data files and produces a JSON routing plan.
154
+
155
+ ### Inputs
156
+
157
+ - Design doc (scope and complexity)
158
+ - PRD task graph (dependencies, count)
159
+ - `docs/ARCHITECTURE-MAP.json` (cross-module touches)
160
+ - `logs/strategy-perf.json` (historical win rates per strategy x batch type)
161
+
162
+ ### Decision Logic
163
+
164
+ ```
165
+ For each work unit:
166
+ 1. Classify type: new-file, refactoring, integration, test-only
167
+ 2. Check strategy-perf.json for this type
168
+ 3. If clear winner (>70% win rate, 10+ data points): route to winner
169
+ 4. If uncertain or insufficient data: MAB run
170
+ 5. If error-prone type (historically high retry rate): MAB run
171
+ ```
172
+
173
+ ### Output
174
+
175
+ ```json
176
+ {
177
+ "routing": [
178
+ {
179
+ "unit": 1,
180
+ "description": "Create test helpers and validators",
181
+ "type": "new-file",
182
+ "decision": "single",
183
+ "strategy": "ralph",
184
+ "reasoning": "new-file: ralph wins 70%, 15 data points"
185
+ },
186
+ {
187
+ "unit": 2,
188
+ "description": "Integration wiring and CI",
189
+ "type": "integration",
190
+ "decision": "mmab_run",
191
+ "reasoning": "integration: superpowers 55%, only 8 data points — need more data"
192
+ }
193
+ ]
194
+ }
195
+ ```
196
+
197
+ ### Work Unit Sizing
198
+
199
+ | Project size | Strategy |
200
+ |-------------|----------|
201
+ | Small (< 5 PRD tasks) | MAB the whole project |
202
+ | Medium (5-15 PRD tasks) | Chunk by PRD dependency groups, route per chunk |
203
+ | Large (15+ PRD tasks) | Phase 1: MAB (explore), Phase 2+: route to winners (exploit) |
204
+
205
+ ## Judge Agent
206
+
207
+ An LLM agent that evaluates both candidates after execution.
208
+
209
+ ### Inputs
210
+
211
+ ```
212
+ 1. Full plan context: design doc, PRD, architecture map
213
+ 2. Both diffs: git diff main...ab-a, git diff main...ab-b
214
+ 3. Quality gate results for both
215
+ 4. All previous MAB lessons: logs/mab-lessons.json
216
+ 5. Score from automated scoring (test count, diff size, gate pass)
217
+ ```
218
+
219
+ ### Evaluation Criteria
220
+
221
+ ```
222
+ 1. WINNER SELECTION
223
+ Which implementation better serves the overall architecture?
224
+
225
+ 2. BIDIRECTIONAL LESSONS
226
+ What did the winner do well that the loser should learn from?
227
+ What did the loser do well that the winner should learn from?
228
+
229
+ 3. FAILURE MODE CLASSIFICATION
230
+ How did the weaker submission fall short?
231
+ Categories: over-engineering, under-testing, code-duplication,
232
+ integration-gap, convention-violation, wrong-abstraction-level
233
+
234
+ 4. TOOLKIT COMPLIANCE
235
+ Did both agents follow CLAUDE.md conventions?
236
+ Did both use TDD (regardless of strategy)?
237
+ Did either trigger hookify blocks?
238
+ Did either skip verification?
239
+
240
+ 5. STRATEGY RECOMMENDATION
241
+ For this work unit type, which strategy should be preferred?
242
+ Confidence level (low/medium/high)?
243
+
244
+ 6. LESSON EXTRACTION
245
+ {
246
+ "pattern": "description of what was learned",
247
+ "context": "when this applies (batch type, project type)",
248
+ "recommendation": "what to do differently",
249
+ "source_strategy": "which agent's behavior this came from",
250
+ "lesson_type": "syntactic|semantic"
251
+ }
252
+ ```
253
+
254
+ ### Output
255
+
256
+ ```json
257
+ {
258
+ "winner": "agent_a",
259
+ "confidence": "high",
260
+ "reasoning": "Agent A's implementation separated validation logic into composable functions. Agent B duplicated validation across 3 files.",
261
+ "failure_mode": "code-duplication-under-time-pressure",
262
+ "toolkit_compliance": {
263
+ "agent_a": {"tdd": true, "conventions": true, "hookify_blocks": 0},
264
+ "agent_b": {"tdd": false, "conventions": true, "hookify_blocks": 0}
265
+ },
266
+ "lessons": [
267
+ {
268
+ "pattern": "Extract shared validation patterns before writing per-type validators",
269
+ "context": "new-file batches with 3+ similar validators",
270
+ "recommendation": "Create a shared contract function first, then implement per-type",
271
+ "source_strategy": "agent_a",
272
+ "lesson_type": "semantic"
273
+ }
274
+ ],
275
+ "strategy_update": {
276
+ "batch_type": "new-file",
277
+ "winner": "superpowers",
278
+ "confidence": "medium"
279
+ }
280
+ }
281
+ ```
282
+
283
+ ## Data Files
284
+
285
+ ### `logs/mab-lessons.json` — Accumulated MMAB Lessons
286
+
287
+ ```json
288
+ [
289
+ {
290
+ "timestamp": "2026-02-22T15:30:00Z",
291
+ "project": "autonomous-coding-toolkit",
292
+ "work_unit": "validator-suite",
293
+ "batch_type": "new-file",
294
+ "winner": "agent_a",
295
+ "pattern": "Extract shared validation patterns before per-type validators",
296
+ "context": "new-file batches with 3+ similar validators",
297
+ "recommendation": "Create shared contract function first",
298
+ "failure_mode": "code-duplication-under-time-pressure",
299
+ "occurrences": 1
300
+ }
301
+ ]
302
+ ```
303
+
304
+ ### `logs/strategy-perf.json` — Strategy Win Rates
305
+
306
+ ```json
307
+ {
308
+ "new-file": {
309
+ "superpowers": {"wins": 12, "losses": 8, "total": 20},
310
+ "ralph": {"wins": 8, "losses": 12, "total": 20}
311
+ },
312
+ "refactoring": {
313
+ "superpowers": {"wins": 3, "losses": 11, "total": 14},
314
+ "ralph": {"wins": 11, "losses": 3, "total": 14}
315
+ },
316
+ "integration": {
317
+ "superpowers": {"wins": 9, "losses": 2, "total": 11},
318
+ "ralph": {"wins": 2, "losses": 9, "total": 11}
319
+ },
320
+ "test-only": {
321
+ "superpowers": {"wins": 5, "losses": 7, "total": 12},
322
+ "ralph": {"wins": 7, "losses": 5, "total": 12}
323
+ }
324
+ }
325
+ ```
326
+
327
+ ### `docs/ARCHITECTURE-MAP.json` — Auto-Generated Module Graph
328
+
329
+ ```json
330
+ {
331
+ "generated_at": "2026-02-22T15:00:00Z",
332
+ "modules": [
333
+ {
334
+ "name": "run-plan",
335
+ "files": ["scripts/run-plan.sh", "scripts/lib/run-plan-*.sh"],
336
+ "exports": ["run_mode_headless", "run_mode_team"],
337
+ "depends_on": ["quality-gate", "lesson-check", "telegram"]
338
+ }
339
+ ]
340
+ }
341
+ ```
342
+
343
+ ## Lesson Lifecycle
344
+
345
+ ```
346
+ MAB judge extracts lesson
347
+ → logs/mab-lessons.json (immediate, local)
348
+
349
+ Pattern recurs 3+ times (same pattern across runs)
350
+ → Auto-promoted to docs/lessons/NNNN-*.md
351
+ → lesson-check.sh enforces syntactic lessons
352
+ → lesson-scanner agent enforces semantic lessons
353
+
354
+ Promoted lesson causes quality gate failure
355
+ → Tagged "disputed" in mab-lessons.json
356
+ → Excluded from injection until human review
357
+
358
+ User runs /submit-lesson
359
+ → PR to upstream autonomous-coding-toolkit repo
360
+ → Maintainer reviews and merges
361
+ → Community users pull via scripts/pull-community-lessons.sh
362
+ ```
363
+
364
+ ## Community Propagation
365
+
366
+ ### Contributing Lessons
367
+
368
+ ```bash
369
+ # Existing command — already in the toolkit
370
+ /submit-lesson
371
+
372
+ # Creates PR with:
373
+ # docs/lessons/NNNN-<slug>.md (the lesson)
374
+ # Commit message references the MAB run that produced it
375
+ ```
376
+
377
+ ### Consuming Community Lessons
378
+
379
+ ```bash
380
+ # New script
381
+ scripts/pull-community-lessons.sh
382
+
383
+ # Behavior:
384
+ # git fetch upstream
385
+ # Copy new docs/lessons/*.md files
386
+ # Copy updated logs/strategy-perf.json (community aggregate)
387
+ # lesson-check.sh picks up new lessons automatically
388
+ ```
389
+
390
+ ### Community Strategy Data
391
+
392
+ Aggregated `strategy-perf.json` from all contributors. When merged upstream, includes anonymous win/loss data across all users' projects. New users start with community baseline instead of zero data.
393
+
394
+ ### Semantic Search (Pinecone)
395
+
396
+ For large lesson corpus (100+ lessons):
397
+
398
+ ```
399
+ Before judge extracts a lesson:
400
+ Query Pinecone: "has this pattern been learned before?"
401
+ If match: refine existing lesson instead of creating duplicate
402
+ If no match: create new lesson
403
+ ```
404
+
405
+ Uses the existing Pinecone MCP integration.
406
+
407
+ ## Infrastructure Scripts
408
+
409
+ ### `scripts/mab-run.sh` — Orchestrator
410
+
411
+ Thin bash script that:
412
+ 1. Creates worktrees
413
+ 2. Launches both agents in parallel (`claude -p` per worktree)
414
+ 3. Runs quality gate on both
415
+ 4. Launches judge agent
416
+ 5. Merges winner
417
+ 6. Cleans up worktrees
418
+ 7. Updates data files
419
+
420
+ ### `scripts/architecture-map.sh` — Module Graph Generator
421
+
422
+ Scans project source files:
423
+ - Python: `import` / `from X import` statements
424
+ - JavaScript/TypeScript: `import` / `require` statements
425
+ - Shell: `source` statements
426
+ - Produces `docs/ARCHITECTURE-MAP.json`
427
+
428
+ ### `scripts/pull-community-lessons.sh` — Community Sync
429
+
430
+ Fetches latest lessons and strategy data from upstream repo.
431
+
432
+ ### Agent Prompts
433
+
434
+ - `scripts/prompts/planner-agent.md` — routing decision prompt
435
+ - `scripts/prompts/judge-agent.md` — evaluation prompt
436
+ - `scripts/prompts/agent-a-superpowers.md` — superpowers lead instruction
437
+ - `scripts/prompts/agent-b-ralph.md` — ralph lead instruction
438
+
439
+ ## File Summary
440
+
441
+ New files:
442
+ - `scripts/mab-run.sh` — MAB execution orchestrator
443
+ - `scripts/architecture-map.sh` — module graph generator
444
+ - `scripts/pull-community-lessons.sh` — community lesson sync
445
+ - `scripts/prompts/planner-agent.md` — planner prompt
446
+ - `scripts/prompts/judge-agent.md` — judge prompt
447
+ - `scripts/prompts/agent-a-superpowers.md` — Agent A instructions
448
+ - `scripts/prompts/agent-b-ralph.md` — Agent B instructions
449
+ - `scripts/tests/test-mab-run.sh` — MAB orchestrator tests
450
+ - `scripts/tests/test-architecture-map.sh` — map generator tests
451
+ - `docs/plans/2026-02-22-mab-run-design.md` — this document
452
+
453
+ Modified files:
454
+ - `scripts/run-plan.sh` — add `--mab` flag that routes through `mab-run.sh`
455
+ - `scripts/lib/run-plan-context.sh` — inject MAB lessons into batch context
456
+ - `docs/ARCHITECTURE.md` — document MAB system
457
+
458
+ Data files (created at runtime):
459
+ - `logs/mab-lessons.json`
460
+ - `logs/strategy-perf.json`
461
+ - `logs/mab-run-<timestamp>.json`
462
+ - `docs/ARCHITECTURE-MAP.json`