autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,530 @@
1
+ # Autonomous Coding Toolkit — Roadmap to Completion
2
+
3
+ **Date:** 2026-02-23
4
+ **Status:** Draft — awaiting user approval
5
+ **Scope:** Complete roadmap from current state to v1.0 release, informed by 25 research papers, 20 open bugs, and 3 unexecuted designs
6
+
7
+ ---
8
+
9
+ ## Current State Assessment
10
+
11
+ ### What's Shipped (Production-Quality)
12
+
13
+ | Category | Count | Notes |
14
+ |----------|-------|-------|
15
+ | Bash scripts | 34+ | All under 300 lines |
16
+ | Test files | 34 | 369+ assertions, all passing |
17
+ | Quality gate checks | 7 | lesson-check, lint, tests, ast-grep, memory, test count, git clean |
18
+ | Validators | 7 | lessons, skills, commands, plans, prd, plugin, hooks |
19
+ | Lessons | 66 | 6 clusters, YAML frontmatter, syntactic + semantic |
20
+ | Execution modes | 5 | headless, team, competitive (stub), ralph loop, subagent-driven |
21
+ | Skills | 14 | Full pipeline chain + supporting skills |
22
+ | Agents | 1 (in-repo) | lesson-scanner; 6 new designed but in ~/.claude/agents/ |
23
+ | CI pipeline | `make ci` | lint → validate → test |
24
+
25
+ ### What's Designed But Not Implemented
26
+
27
+ | Feature | Design Doc | Plan Doc | Batches | Status |
28
+ |---------|-----------|---------|---------|--------|
29
+ | MAB system | `2026-02-22-mab-run-design.md` | `2026-02-22-mab-run-plan.md` | 6 (26 tasks) | **Needs update** — research found bugs, new prerequisites |
30
+ | Agent suite | `2026-02-23-agent-suite-design.md` | `2026-02-23-agent-suite-plan.md` | 7 (23 tasks) | Batch 1 (lint) done; Batches 2-7 pending |
31
+ | Research phase | `2026-02-22-research-phase-integration.md` | — | ~2 | Design complete, no plan |
32
+ | Roadmap stage | `2026-02-22-research-phase-integration.md` § 3.3 | — | ~1 | Design complete, no plan |
33
+
34
+ ### What's Recommended by Research (No Design Yet)
35
+
36
+ From the cross-cutting synthesis (25 papers, confidence ratings included):
37
+
38
+ | # | Item | Evidence | Effort | Confidence |
39
+ |---|------|----------|--------|------------|
40
+ | 1 | Prompt caching | 83% cost reduction (pricing analysis) | 1-2 days | **High** |
41
+ | 2 | Plan quality scorecard | Plan quality worth 3x execution (SWE-bench Pro, N=1865) | 2-3 days | **High** |
42
+ | 3 | Spec echo-back gate | Spec misunderstanding is 60%+ of failures (SWE-EVO) | 1-2 days | **Medium-High** |
43
+ | 4 | Context restructuring | Lost in the Middle: 20pp accuracy degradation (Liu et al.) | 1 day | **High** |
44
+ | 5 | Lesson scope metadata | 67% false positive rate predicted at 100+ lessons | 2-3 days | **High** |
45
+ | 6 | Fast lane onboarding | 34.7% abandon on difficult setup (N=202 OSS devs) | 1-2 days | **High** |
46
+ | 7 | Per-batch cost tracking | No measured cost data exists — all optimization is guesswork | 1-2 days | **High** |
47
+ | 8 | Structured progress.txt | Freeform text reduces cross-context value | 1 day | **Medium-High** |
48
+ | 9 | Positive policy system | Positive instructions outperform negative for LLMs (NeQA) | 3-5 days | **Medium-High** |
49
+ | 10 | Property-based testing guidance | 50x more mutations found (OOPSLA 2025, 40 projects) | 2-3 days | **High** |
50
+
51
+ ### Open Bugs (20)
52
+
53
+ | Severity | Count | Issues |
54
+ |----------|-------|--------|
55
+ | Medium | 7 | #9, #10, #11, #12, #13, #14, #15, #16 |
56
+ | Low | 12 | #17-#28 |
57
+
58
+ Key clusters:
59
+ - **Sampling** (#16, #27, #28): stash/state issues in parallel patch sampling
60
+ - **Portability** (#17, #18, #23): shebang, grep -P, bash 4.4 compat
61
+ - **Edge cases** (#9, #10, #13, #20, #21, #24): empty/missing state, truncation
62
+ - **Safety** (#11, #12, #19, #22): path escaping, directory restore, glob fragility
63
+
64
+ ---
65
+
66
+ ## Strategic Priorities
67
+
68
+ Ordered by impact per effort, accounting for dependencies:
69
+
70
+ 1. **Fix before building** — The 20 open bugs include a state schema mismatch (#10) that affects all headless runs. Fix bugs first.
71
+ 2. **Pre-execution quality** — Plan quality scorecard, spec echo-back, and context restructuring are the highest-leverage investments per the 3:1 plan-vs-execution ratio.
72
+ 3. **Cost infrastructure** — Prompt caching (83% savings) and per-batch cost tracking are prerequisites for MAB economics to make sense.
73
+ 4. **MAB system** — Updated design, slimmed from 6 to 4 batches based on research findings.
74
+ 5. **Adoption infrastructure** — Fast lane onboarding, lesson scope metadata, README rewrite.
75
+ 6. **Pipeline extensions** — Research phase, roadmap stage, positive policies.
76
+ 7. **Agent suite** — New agents are useful but not blocking; they serve Justin's ecosystem, not the public toolkit.
77
+
78
+ ---
79
+
80
+ ## Phased Roadmap
81
+
82
+ ### Phase 1: Stabilize (Fix What's Broken)
83
+
84
+ **Goal:** Zero known bugs in core pipeline. All existing tests pass. CI green.
85
+ **Effort:** 1-2 sessions
86
+ **Prerequisite for:** Everything else
87
+
88
+ #### Batch 1A: Critical Bugs (Medium Severity)
89
+
90
+ | Issue | Title | Fix |
91
+ |-------|-------|-----|
92
+ | #9 | `complete_batch` called with batch_num='final' crashes jq | Validate batch_num is numeric before `--argjson` |
93
+ | #10 | `get_previous_test_count` returns empty on missing state | Return -1 (unknown), match `extract_test_count` convention |
94
+ | #11 | `batch-test.sh` cd without restore | Use subshell `(cd "$dir" && ...)` or pushd/popd |
95
+ | #12 | `generate-ast-rules.sh` writes to root when --output-dir omitted | Default to `$PWD/scripts/patterns/` |
96
+ | #13 | `entropy-audit.sh` iterates once on empty find | Use `while read` with null check instead of heredoc |
97
+ | #16 | SAMPLE_COUNT persists across batches | Reset SAMPLE_COUNT=0 at top of batch loop |
98
+
99
+ #### Batch 1B: Low Severity Bugs
100
+
101
+ | Issue | Title | Fix |
102
+ |-------|-------|-----|
103
+ | #14 | `auto-compound.sh` head -c 40 UTF-8 | Use `cut -c1-40` or `${var:0:40}` |
104
+ | #15 | No timeout on routing jq loop | Add `timeout 30` wrapper |
105
+ | #17 | Inconsistent shebangs | `#!/usr/bin/env bash` everywhere |
106
+ | #18 | `grep -P` non-portable | Replace with `grep -E` or `[[ =~ ]]` |
107
+ | #19 | ls -t fragile with spaces | Use `find -printf` or `stat --format` |
108
+ | #20 | `free -g` truncates | Use `free -m` and compare against 4096 |
109
+ | #21 | check_memory fallback '999' | Return -1 (unknown), skip check |
110
+ | #22 | setup-ralph-loop special chars | Quote with `jq --arg` instead of bash substitution |
111
+ | #23 | bash < 4.4 empty array set -u | `"${PASS_ARGS[@]+"${PASS_ARGS[@]}"}"` |
112
+ | #24 | detect_project_type nullglob | Use `compgen -G` or explicit test |
113
+ | #25 | ollama_query no timeout | Add `--connect-timeout 10 --max-time 60` to curl |
114
+ | #26 | validate-plans sed range bug | Fix sed address to stop at next `## Batch` header |
115
+ | #27 | Sampling stash no-op on clean | Check `git stash list` count before/after |
116
+ | #28 | SAMPLE_COUNT reset between batches | Same fix as #16 |
117
+
118
+ #### Quality Gate
119
+ - `make ci` passes
120
+ - All 20 issues closed
121
+ - No new test regressions
122
+
123
+ ---
124
+
125
+ ### Phase 2: Pre-Execution Quality (Highest Leverage)
126
+
127
+ **Goal:** Implement the three research-backed improvements that address the 3:1 plan-vs-execution quality ratio.
128
+ **Effort:** 1-2 sessions
129
+ **Prerequisite for:** Phase 4 (MAB needs better plans to judge)
130
+
131
+ #### Batch 2A: Context Restructuring
132
+
133
+ **What:** Restructure `build_batch_prompt()` in `run-plan-prompt.sh`:
134
+ 1. Raise `TOKEN_BUDGET_CHARS` from 6000 to 10000
135
+ 2. Place batch task text at the top, requirements/constraints at the bottom
136
+ 3. Wrap sections in XML tags (`<batch_tasks>`, `<prior_progress>`, `<failure_patterns>`, `<referenced_files>`, `<requirements>`)
137
+ 4. Add `<research_warnings>` section from research JSON (when present)
138
+
139
+ **Evidence:** Lost in the Middle effect degrades accuracy 20pp for middle-positioned info. Anthropic's testing shows up to 30% quality improvement with structured context.
140
+
141
+ **Tests:** Update `test-run-plan-prompt.sh` to verify XML tag presence and section ordering.
142
+
143
+ #### Batch 2B: Plan Quality Scorecard
144
+
145
+ **What:** Create `scripts/validate-plan-quality.sh` scoring 8 dimensions:
146
+
147
+ | Dimension | Check | Weight |
148
+ |-----------|-------|--------|
149
+ | Task granularity | Each task modifies < 100 lines (estimated) | 15% |
150
+ | Spec completeness | Each task has verification command | 20% |
151
+ | Single outcome | No mixed task types per batch | 10% |
152
+ | Dependency ordering | No forward references | 10% |
153
+ | File path specificity | All tasks name exact files | 15% |
154
+ | Acceptance criteria | Each batch has at least one assert | 15% |
155
+ | Batch size | 1-5 tasks per batch | 10% |
156
+ | TDD structure | Test-before-implement pattern | 5% |
157
+
158
+ Returns score 0-100. Gate execution on configurable minimum (default: 60).
159
+
160
+ **Integration:** Wire into `run-plan.sh` before batch loop. Add `--skip-plan-quality` override.
161
+
162
+ **Tests:** Create `test-validate-plan-quality.sh` with sample plans at various quality levels.
163
+
164
+ #### Batch 2C: Specification Echo-Back Gate
165
+
166
+ **What:** Before coding each batch, the agent restates what the batch accomplishes. Lightweight LLM comparison between restatement and plan's task description.
167
+
168
+ **Implementation:** Add `echo_back_check()` to `run-plan-headless.sh`:
169
+ 1. First 2 lines of `claude -p` prompt: "Before implementing, restate in one paragraph what this batch must accomplish."
170
+ 2. Extract first paragraph from agent output
171
+ 3. Lightweight `claude -p` call (haiku): "Does this restatement match the original spec? YES/NO + reason"
172
+ 4. If NO → retry with clarified prompt (max 1 retry)
173
+
174
+ **Evidence:** Catches 60%+ of specification misunderstanding failures (SWE-EVO).
175
+
176
+ **Tests:** Test with intentionally mismatched spec/restatement pairs.
177
+
178
+ #### Quality Gate
179
+ - `make ci` passes
180
+ - New validators pass on existing plans
181
+ - Context restructuring doesn't break existing test-run-plan-prompt tests
182
+
183
+ ---
184
+
185
+ ### Phase 3: Cost Infrastructure
186
+
187
+ **Goal:** Enable measured cost data (prerequisite for MAB economics) and implement prompt caching (83% cost reduction).
188
+ **Effort:** 1 session
189
+ **Prerequisite for:** Phase 4 (MAB)
190
+
191
+ #### Batch 3A: Per-Batch Cost Tracking
192
+
193
+ **What:** Track input tokens, output tokens, cache hits, and estimated cost per batch in `.run-plan-state.json`.
194
+
195
+ **Implementation:**
196
+ 1. Parse `claude -p` stderr for token usage (Claude CLI outputs this)
197
+ 2. Add `costs` object to state: `{"batch_N": {"input_tokens": N, "output_tokens": N, "cache_hits": N, "estimated_cost_usd": N}}`
198
+ 3. Add `--show-costs` flag to `pipeline-status.sh`
199
+ 4. Update `run-plan-notify.sh` to include cost in Telegram notifications
200
+
201
+ **Tests:** Mock claude -p output with token counts, verify state updates.
202
+
203
+ #### Batch 3B: Prompt Caching Structure
204
+
205
+ **What:** Structure prompts with stable prefix (CLAUDE.md chain, skills, lessons — rarely changes) and variable suffix (batch tasks, context — changes each batch). This enables Anthropic's prompt caching to reuse the prefix across batches.
206
+
207
+ **Implementation:**
208
+ 1. In `build_batch_prompt()`, separate `STABLE_PREFIX` (CLAUDE.md, lessons, conventions) from `VARIABLE_SUFFIX` (batch tasks, context, progress)
209
+ 2. Write stable prefix to a file that `claude -p` reads via `--system-prompt-file` (if supported) or prepend it with a clear separator
210
+ 3. Track cache hit rate in state file
211
+
212
+ **Evidence:** 83% cost reduction modeled (pricing analysis + cache priming). A 6-batch feature drops from $6.50 to $1.76.
213
+
214
+ **Tests:** Verify prompt structure separates stable/variable. Verify state tracks cache metrics.
215
+
216
+ #### Batch 3C: Structured progress.txt
217
+
218
+ **What:** Replace freeform `progress.txt` with defined sections:
219
+
220
+ ```
221
+ ## Batch N: <title>
222
+ ### Files Modified
223
+ - path/to/file (created|modified|deleted)
224
+
225
+ ### Decisions
226
+ - <decision>: <rationale>
227
+
228
+ ### Issues Encountered
229
+ - <issue> → <resolution>
230
+
231
+ ### State
232
+ - Tests: N passing
233
+ - Duration: Ns
234
+ - Cost: $N.NN
235
+ ```
236
+
237
+ **Tests:** Update `test-run-plan-context.sh` to verify structured parsing.
238
+
239
+ #### Quality Gate
240
+ - `make ci` passes
241
+ - Cost tracking produces data on a real 2+ batch run
242
+ - Structured progress.txt parses correctly
243
+
244
+ ---
245
+
246
+ ### Phase 4: Multi-Armed Bandit System (Updated)
247
+
248
+ **Goal:** Implement competing agents with LLM judge, informed by research findings.
249
+ **Effort:** 2-3 sessions
250
+ **Prerequisites:** Phase 1 (bug fixes), Phase 3 (cost tracking, caching)
251
+
252
+ #### Changes from Original Plan
253
+
254
+ The original 6-batch plan needs revision based on research findings:
255
+
256
+ | Original | Change | Reason |
257
+ |----------|--------|--------|
258
+ | LLM planner agent | Replace with Thompson Sampling | Research: Thompson Sampling is cheaper and better calibrated than LLM routing (MAB R1) |
259
+ | 6 batches, 26 tasks | Slim to 4 batches, ~18 tasks | Research: 80% infrastructure exists; prompts are just files; planner is now a function |
260
+ | Judge trusts automated routing | Add human calibration for first 10 decisions | Research: LLM-as-Judge reliability unvalidated (cross-cutting synthesis §F) |
261
+ | Default competitive mode | Selective MAB (~30% of batches) | Research: Cost break-even only if prevents 1 debugging batch per 2 features |
262
+ | `{AB_LESSONS}` placeholder | Fix to `{MAB_LESSONS}` | Bug in original plan: placeholder name doesn't match data file name |
263
+
264
+ #### Batch 4A: Foundation (Prompts + Architecture Map + Data Init)
265
+
266
+ Matches original Batch 1 but simplified:
267
+
268
+ 1. Create 4 prompt files in `scripts/prompts/` (agent-a, agent-b, judge-agent, planner-agent)
269
+ 2. Create `scripts/architecture-map.sh` (scans source for import/source dependencies)
270
+ 3. Tests for architecture-map.sh
271
+ 4. Create `scripts/lib/thompson-sampling.sh` — Beta distribution sampling for strategy routing:
272
+ - `thompson_sample(wins, losses)` → returns sampled value (0-1)
273
+ - `thompson_route(batch_type, strategy_perf_file)` → returns "superpowers" or "ralph" or "mab"
274
+ - Pure bash using `bc` for floating point
275
+ 5. Tests for thompson-sampling.sh
276
+
277
+ #### Batch 4B: MAB Orchestrator (mab-run.sh)
278
+
279
+ Core orchestrator, simplified from original Batch 2:
280
+
281
+ 1. `scripts/mab-run.sh` — argument parsing, data init, worktree management, prompt assembly
282
+ 2. Agent execution (parallel `claude -p` in separate worktrees)
283
+ 3. Quality gate on both agents
284
+ 4. Judge invocation (separate `claude -p` with read-only tools)
285
+ 5. Winner selection (gate override: if only one passes, that one wins regardless of judge)
286
+ 6. Data updates (strategy-perf.json, mab-lessons.json, mab-run-<ts>.json)
287
+ 7. Human calibration mode: for first 10 decisions, present judge verdict to user for approval before merge
288
+ 8. Cleanup (worktree removal)
289
+ 9. Tests for mab-run.sh (dry-run, data init, argument validation)
290
+
291
+ #### Batch 4C: Integration (run-plan --mab + context injection)
292
+
293
+ Wire into existing pipeline:
294
+
295
+ 1. Add `--mab` flag to `run-plan.sh`
296
+ 2. Inject MAB lessons into per-batch context (`run-plan-context.sh`)
297
+ 3. Add Thompson Sampling routing call before batch execution (when `--mab` is set)
298
+ 4. Update `pipeline-status.sh` with MAB section
299
+ 5. Tests for CLI flags and context injection
300
+
301
+ #### Batch 4D: Community Sync + Lesson Promotion + Docs
302
+
303
+ 1. `scripts/pull-community-lessons.sh` — fetch lessons from upstream
304
+ 2. `scripts/promote-mab-lessons.sh` — auto-promote patterns with 3+ occurrences
305
+ 3. Update `docs/ARCHITECTURE.md` with MAB section
306
+ 4. Update `CLAUDE.md` with MAB capabilities
307
+ 5. Tests for both scripts
308
+ 6. Run full `make ci`
309
+
310
+ #### Quality Gate
311
+ - `make ci` passes
312
+ - `mab-run.sh --dry-run` works end-to-end
313
+ - `architecture-map.sh` produces valid JSON on the toolkit itself
314
+ - Thompson sampling unit tests pass
315
+ - All 20+ previous bugs still fixed
316
+
317
+ ---
318
+
319
+ ### Phase 5: Adoption & Polish
320
+
321
+ **Goal:** Make the toolkit usable by someone who isn't Justin.
322
+ **Effort:** 1-2 sessions
323
+ **Prerequisites:** Phase 2 (plan quality), Phase 4 (MAB)
324
+
325
+ #### Batch 5A: Lesson Scope Metadata
326
+
327
+ **What:** Add `scope` field to lesson YAML frontmatter:
328
+
329
+ ```yaml
330
+ scope: universal | language:python | language:bash | framework:pytest | domain:ha-aria | project-specific
331
+ ```
332
+
333
+ Update `lesson-check.sh` to:
334
+ 1. Detect project languages from file extensions
335
+ 2. Skip lessons whose scope doesn't match the project
336
+ 3. Add `--all-scopes` flag to override filtering
337
+
338
+ Update all 66 existing lessons with appropriate scope tags.
339
+
340
+ **Evidence:** Without scope, false positive rate hits 67% at ~100 lessons (Zimmermann, 622 predictions).
341
+
342
+ #### Batch 5B: Fast Lane Onboarding
343
+
344
+ **What:**
345
+ 1. Create `examples/quickstart-plan.md` — a 2-batch plan that reaches first quality-gated execution in 3 commands
346
+ 2. Rewrite `README.md` to under 100 lines with progressive disclosure
347
+ 3. Add `Getting Started in 5 Minutes` section with:
348
+ ```bash
349
+ git clone ... && cd autonomous-coding-toolkit
350
+ ./scripts/run-plan.sh examples/quickstart-plan.md --project-root /tmp/quickstart-demo
351
+ # Watch: batch execution → quality gate → test count → DONE
352
+ ```
353
+ 4. Move detailed docs to `docs/` (ARCHITECTURE.md already there)
354
+
355
+ **Evidence:** 34.7% abandon on difficult setup.
356
+
357
+ #### Batch 5C: Expand Lessons to 6 Clusters
358
+
359
+ Add 12 starter lessons for the three new clusters:
360
+
361
+ - **Cluster D (Specification Drift):** 4 lessons — agent misinterprets requirements, builds wrong thing correctly
362
+ - **Cluster E (Context & Retrieval):** 4 lessons — wrong files read, stale context, lost information
363
+ - **Cluster F (Planning & Control Flow):** 4 lessons — wrong decomposition, dependency errors, scope creep
364
+
365
+ Update `docs/lessons/SUMMARY.md` with new clusters.
366
+
367
+ #### Quality Gate
368
+ - `make ci` passes
369
+ - Quickstart demo runs end-to-end in < 5 minutes
370
+ - Lesson scope filtering reduces false matches on non-Python projects
371
+
372
+ ---
373
+
374
+ ### Phase 6: Pipeline Extensions
375
+
376
+ **Goal:** Add research phase and roadmap stage to the pipeline.
377
+ **Effort:** 2-3 sessions
378
+ **Prerequisites:** Phase 2 (context restructuring), Phase 5 (lesson scope)
379
+
380
+ #### Batch 6A: Research Phase (Stage 1.5)
381
+
382
+ Per the design in `2026-02-22-research-phase-integration.md`:
383
+
384
+ 1. Create `skills/research/SKILL.md` — 10-step research protocol
385
+ 2. Create `scripts/research-gate.sh` — blocks PRD if blocking issues unresolved
386
+ 3. Update `scripts/lib/run-plan-context.sh` — inject research warnings
387
+ 4. Update `scripts/auto-compound.sh` — replace Step 2.5 with research phase
388
+ 5. Update `skills/autocode/SKILL.md` — add Stage 1.5
389
+ 6. Tests for research-gate.sh
390
+
391
+ Artifacts produced:
392
+ - `tasks/research-<slug>.md` — human-readable report
393
+ - `tasks/research-<slug>.json` — machine-readable for PRD scoping
394
+
395
+ #### Batch 6B: Roadmap Stage (Stage 0.5)
396
+
397
+ 1. Create `skills/roadmap/SKILL.md` — multi-feature sequencing
398
+ 2. Update `skills/autocode/SKILL.md` — add Stage 0.5
399
+ 3. Create `examples/example-roadmap.md` — sample roadmap
400
+
401
+ #### Batch 6C: Positive Policy System
402
+
403
+ 1. Create `policies/` directory with `universal.md`, `python.md`, `bash.md`, `testing.md`
404
+ 2. Add `positive_alternative` field to lesson YAML template
405
+ 3. Create `scripts/policy-check.sh` — audit mode (advisory, not blocking)
406
+ 4. Update `lesson-check.sh` to read positive alternatives and include in violation messages
407
+ 5. Tests for policy-check.sh
408
+
409
+ **Evidence:** Positive instructions outperform negative for LLMs (NeQA benchmark, Pink Elephant Problem).
410
+
411
+ #### Quality Gate
412
+ - `make ci` passes
413
+ - Research gate blocks on a test file with blocking issues
414
+ - Roadmap skill produces valid artifact
415
+ - Policy check runs without errors on toolkit itself
416
+
417
+ ---
418
+
419
+ ### Phase 7: Agent Suite
420
+
421
+ **Goal:** Ship the 6 new agents and 8 existing agent improvements.
422
+ **Effort:** 1-2 sessions
423
+ **Prerequisites:** Phase 1 (bugs), Phase 2 (lesson-scanner scan groups reference updated lessons)
424
+
425
+ Per the design in `2026-02-23-agent-suite-design.md`:
426
+
427
+ #### Batch 7A: New Agents (6)
428
+
429
+ All placed in `~/.claude/agents/` (global) AND `agents/` (toolkit repo):
430
+
431
+ 1. `bash-expert.md` — review/write/debug bash scripts
432
+ 2. `shell-expert.md` — diagnose systemd/PATH/permissions issues
433
+ 3. `python-expert.md` — async discipline, resource lifecycle, type safety
434
+ 4. `integration-tester.md` — verify cross-service data flows
435
+ 5. `dependency-auditor.md` — CVE/outdated/license scanning (read-only)
436
+ 6. `service-monitor.md` — service/timer health auditing
437
+
438
+ #### Batch 7B: Existing Agent Improvements
439
+
440
+ P0 (correctness): security-reviewer tools/categories, infra-auditor freshness, lesson-scanner count
441
+ P1 (quality): model/maxTurns on all agents, doc-updater git diff
442
+ P2 (capability): lesson-scanner scan groups, notion fallbacks
443
+ P3 (polish): doc-updater output, counter-daily scope rule
444
+
445
+ #### Quality Gate
446
+ - All 14 agents have valid frontmatter (name, model, tools, maxTurns)
447
+ - `make ci` passes
448
+ - No agent references nonexistent tools
449
+
450
+ ---
451
+
452
+ ## Dependency Graph
453
+
454
+ ```
455
+ Phase 1: Stabilize (bug fixes)
456
+
457
+ ├──► Phase 2: Pre-Execution Quality
458
+ │ │
459
+ │ ├──► Phase 4: MAB System ◄── Phase 3: Cost Infrastructure
460
+ │ │ │
461
+ │ │ ├──► Phase 5: Adoption & Polish
462
+ │ │ │
463
+ │ │ └──► Phase 6: Pipeline Extensions
464
+ │ │
465
+ │ └──► Phase 6: Pipeline Extensions
466
+
467
+ └──► Phase 7: Agent Suite (independent, can run in parallel with 2-6)
468
+ ```
469
+
470
+ **Critical path:** 1 → 2 → 3 → 4 → 5
471
+ **Parallel track:** 7 can run anytime after Phase 1
472
+
473
+ ---
474
+
475
+ ## Effort Summary
476
+
477
+ | Phase | Batches | Estimated Sessions | Key Deliverable |
478
+ |-------|---------|-------------------|-----------------|
479
+ | 1: Stabilize | 2 | 1-2 | 20 bugs fixed, CI green |
480
+ | 2: Pre-Execution Quality | 3 | 1-2 | Plan scorecard, echo-back gate, context restructuring |
481
+ | 3: Cost Infrastructure | 3 | 1 | Cost tracking, prompt caching, structured progress |
482
+ | 4: MAB System | 4 | 2-3 | Competing agents, judge, Thompson Sampling, lesson promotion |
483
+ | 5: Adoption & Polish | 3 | 1-2 | Scope metadata, fast lane, 6 clusters |
484
+ | 6: Pipeline Extensions | 3 | 2-3 | Research phase, roadmap stage, positive policies |
485
+ | 7: Agent Suite | 2 | 1-2 | 6 new agents, 8 improvements |
486
+ | **Total** | **20** | **9-15** | **v1.0** |
487
+
488
+ ---
489
+
490
+ ## What "v1.0" Means
491
+
492
+ The toolkit reaches v1.0 when:
493
+
494
+ 1. **Core pipeline works end-to-end** for headless, ralph loop, and MAB modes ✓ (mostly done)
495
+ 2. **Quality gates catch real bugs** with < 20% false positive rate (needs scope metadata)
496
+ 3. **Cost is tracked and optimized** (prompt caching, per-batch cost data)
497
+ 4. **A new user can start in < 5 minutes** (fast lane onboarding)
498
+ 5. **MAB produces measurable learning** (strategy-perf.json with 10+ data points, human-calibrated judge)
499
+ 6. **Research phase produces durable artifacts** (not ephemeral conversation)
500
+ 7. **Zero known bugs in core pipeline** (all 20 issues closed)
501
+ 8. **Documentation is complete** — ARCHITECTURE.md, README, CONTRIBUTING, examples
502
+
503
+ ### What's NOT in v1.0
504
+
505
+ - Multi-language support beyond Python/bash (deferred — no evidence of demand)
506
+ - CI/CD integration (GitHub Actions workflow exists but not tested across repos)
507
+ - Web dashboard (pipeline-status.sh is CLI-only)
508
+ - Pinecone-backed lesson dedup (only needed at 100+ lessons)
509
+ - Agent chains (post-commit audit, service triage, pre-release)
510
+ - Property-based testing integration (guidance only, no automation)
511
+
512
+ ---
513
+
514
+ ## Lean Gate
515
+
516
+ **Hypothesis:** A structured autonomous coding pipeline with quality gates and competing agents produces higher-quality code with fewer debugging cycles than manual Claude Code usage.
517
+
518
+ **MVP:** Phases 1-4 (stabilize + pre-execution quality + cost + MAB). Everything after is optimization.
519
+
520
+ **First 5 users:** Justin (primary), then 4 Claude Code power users from GitHub/Discord who have expressed interest in autonomous execution.
521
+
522
+ **Success metric:** Measured reduction in debugging batches per feature (target: < 1 retry per 5-batch feature, vs current ~2-3).
523
+
524
+ **Pivot trigger:** If MAB shows no win-rate differentiation after 20 features (10 per strategy), downgrade to single-strategy with the lessons system only.
525
+
526
+ ---
527
+
528
+ ## Next Action
529
+
530
+ Start with **Phase 1, Batch 1A** — fix the 7 medium-severity bugs. These affect core functionality (state management, batch execution, sampling) and must be fixed before any new features are built on top.
@@ -0,0 +1,98 @@
1
+ # Design: Headless Module Split
2
+
3
+ **Date:** 2026-02-24
4
+ **Status:** Approved
5
+ **Problem:** `scripts/lib/run-plan-headless.sh` is 681 lines (project limit: 300). Three concerns mixed in one file: echo-back gate, sampling candidates, and batch orchestration.
6
+ **Approach:** Extract two new lib modules. Fix issue #73 (MAB path resolution).
7
+
8
+ ## Extraction 1: Echo-Back Gate
9
+
10
+ ### New file: `scripts/lib/run-plan-echo-back.sh`
11
+
12
+ **Functions moved (verbatim):**
13
+ - `_echo_back_check()` — lightweight keyword-match gate on agent output (lines 19-63)
14
+ - `echo_back_check()` — full spec verification: agent restatement → haiku verdict → retry once (lines 65-163)
15
+
16
+ **Globals (read-only):** `SKIP_ECHO_BACK`, `STRICT_ECHO_BACK`
17
+
18
+ **Interface:** No signature changes. Functions called by name from `run_mode_headless()`.
19
+
20
+ **Source order in `run-plan.sh`:** Add before headless source line:
21
+ ```bash
22
+ source "$SCRIPT_DIR/lib/run-plan-echo-back.sh"
23
+ ```
24
+
25
+ **Test changes:**
26
+ - `test-echo-back.sh`: Change source from `run-plan-headless.sh` to `run-plan-echo-back.sh`
27
+ - `test-run-plan-headless.sh`: 5 tests for `_echo_back_check()` move to `test-echo-back.sh` (or source both modules)
28
+
29
+ **Reuse opportunity:** `run-plan-team.sh` can source this module to add spec verification before team batch groups — implements lesson #61 across execution modes.
30
+
31
+ ## Extraction 2: Sampling Candidates
32
+
33
+ ### New file: `scripts/lib/run-plan-sampling.sh`
34
+
35
+ **New function wrapping extracted code:**
36
+ ```bash
37
+ # run_sampling_candidates <worktree> <plan_file> <batch> <prompt> <quality_gate_cmd>
38
+ # Returns: 0 if winner found (worktree has winner's changes), 1 if no candidate passed
39
+ # Side-effects: writes logs/sampling-outcomes.json, uses patch files in /tmp/
40
+ run_sampling_candidates() { ... }
41
+ ```
42
+
43
+ **Code moved:** Lines 373-494 of current `run_mode_headless()` (the sampling block inside the retry while-loop).
44
+
45
+ **Also extracted:**
46
+ - `check_memory_for_sampling()` — memory guard logic (current lines 354-369), reusable by any mode
47
+
48
+ **Globals (read-only):** `SAMPLE_COUNT`, `SAMPLE_ON_RETRY`, `SAMPLE_ON_CRITICAL`, `SAMPLE_DEFAULT_COUNT`, `SAMPLE_MIN_MEMORY_PER_GB`
49
+
50
+ **Call site in headless:** Replace inline sampling block with:
51
+ ```bash
52
+ if [[ "${SAMPLE_COUNT:-0}" -gt 0 && $attempt -ge 2 ]]; then
53
+ check_memory_for_sampling || SAMPLE_COUNT=0
54
+ if [[ "${SAMPLE_COUNT:-0}" -gt 0 ]]; then
55
+ if run_sampling_candidates "$WORKTREE" "$PLAN_FILE" "$batch" "$prompt" "$QUALITY_GATE_CMD"; then
56
+ batch_passed=true
57
+ break
58
+ fi
59
+ continue
60
+ fi
61
+ fi
62
+ ```
63
+
64
+ **Source order in `run-plan.sh`:** Add before headless:
65
+ ```bash
66
+ source "$SCRIPT_DIR/lib/run-plan-sampling.sh"
67
+ ```
68
+
69
+ **Dependencies:** Requires `run-plan-scoring.sh` (for `score_candidate`, `select_winner`, `classify_batch_type`, `get_prompt_variants`).
70
+
71
+ ## Bug Fix: Issue #73
72
+
73
+ **File:** `scripts/lib/run-plan-headless.sh` line 251
74
+ **Before:** `"$SCRIPT_DIR/../mab-run.sh"`
75
+ **After:** `"$SCRIPT_DIR/mab-run.sh"`
76
+ **Root cause:** `SCRIPT_DIR` resolves to `scripts/` (set in `run-plan.sh` line 14). `../mab-run.sh` looks at repo root; `mab-run.sh` lives in `scripts/`.
77
+
78
+ ## Resulting Line Counts
79
+
80
+ | Module | Before | After |
81
+ |--------|--------|-------|
82
+ | `run-plan-headless.sh` | 681 | ~416 |
83
+ | `run-plan-echo-back.sh` | (new) | ~145 |
84
+ | `run-plan-sampling.sh` | (new) | ~135 |
85
+
86
+ **Remaining debt:** Headless at ~416 is over the 300-line limit. The remaining bulk is the sequential batch orchestration loop (init → prompt → claude → gate → notify → failure handling). This is inherently sequential — further splitting would create artificial boundaries. Future candidate: retry/escalation logic (~60 lines) if the module grows again.
87
+
88
+ ## Implementation Order
89
+
90
+ 1. Create `run-plan-echo-back.sh` (move functions, update sources, fix tests)
91
+ 2. Create `run-plan-sampling.sh` (extract + wrap in function, update call site)
92
+ 3. Fix #73 (one-line path change)
93
+ 4. Run full test suite to confirm no regressions
94
+ 5. Commit and close #73
95
+
96
+ ## Risk
97
+
98
+ **Low.** Echo-back extraction is pure function move with no interface change. Sampling extraction wraps existing code in a function — the only new interface is the 5-parameter signature. Both are tested by existing test files.