autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,841 @@
1
+ # Design: npm Packaging as a Learning System
2
+
3
+ > **Date:** 2026-02-24
4
+ > **Status:** Approved
5
+ > **Goal:** Package the autonomous-coding-toolkit as a publicly installable npm package that improves with every run, every user, and every failure — not just a tool, but a compounding learning system.
6
+
7
+ ## The Thesis
8
+
9
+ The toolkit's differentiator isn't any single feature — it's that **the system gets better with every run**. Lessons compound, strategy routing learns, quality gates adapt, trust earns autonomy. The packaging must expose the learning loop as a first-class concept:
10
+
11
+ ```
12
+ Every run → telemetry captured
13
+ Every failure → lesson candidate
14
+ Every lesson → community contribution candidate
15
+ Every community contribution → all users improve
16
+ Every improvement → measured by benchmarks
17
+ ```
18
+
19
+ That's how you code better than a human on large projects: not by being smarter on any single batch, but by compounding learning across thousands of batches across hundreds of users.
20
+
21
+ ---
22
+
23
+ ## Research Foundation
24
+
25
+ This design is governed by findings from the 25-paper cross-cutting synthesis (`research/2026-02-22-cross-cutting-synthesis.md`). Key findings that drive decisions:
26
+
27
+ | # | Finding | Confidence | Design Impact |
28
+ |---|---------|------------|---------------|
29
+ | 1 | Plan quality worth ~3x execution capability | High | Plan scoring learns which dimensions predict success |
30
+ | 2 | Fresh context per batch is superior to accumulated | High | Core architecture preserved — this is the #1 differentiator |
31
+ | 3 | Prompt caching yields 83% cost reduction | High | Stable prefix structure in prompts |
32
+ | 4 | Lost in the Middle: 20pp accuracy degradation | High | Task top, requirements bottom in context assembly |
33
+ | 5 | Spec misunderstanding is 60%+ of failures for strong models | Medium | Two-tier echo-back gate |
34
+ | 6 | Lesson system covers 30-40% of failure surface | Medium-High | Expand to 6 clusters, add spec drift coverage |
35
+ | 7 | 34.7% abandon on difficult setup | Medium | Fast lane onboarding under 3 minutes |
36
+ | 8 | Positive instructions outperform negative for LLMs | Medium-High | Policy system promoted alongside lessons |
37
+ | 9 | Transferability depends on abstraction level | High | Scope metadata prevents false positive death spiral |
38
+ | 10 | Coordination is #1 multi-agent failure mode (37%) | High | Structured artifacts over chat for agent communication |
39
+ | 11 | Property-based testing finds 50x more mutations | High | Testing guidance in plan skill |
40
+ | 12 | Optimal multi-agent team size is 3-4 | High | Subagent-driven-dev stays within this bound |
41
+ | 13 | No benchmark suite = can't prove improvement | — | Benchmark suite ships with package |
42
+ | 14 | Single-user testing is not testing | — | Federated telemetry across users |
43
+
44
+ ---
45
+
46
+ ## Part 1: Package Structure
47
+
48
+ ### Approach: npm + Claude Code Plugin (dual surface)
49
+
50
+ **npm:** `npm install -g autonomous-coding-toolkit` → `act` CLI on PATH
51
+ **Plugin:** `/install autonomous-coding-toolkit` → skills, commands, agents in Claude Code
52
+
53
+ Both install from the same repo. Nothing moves — we add `package.json` + `bin/act.js` on top of the existing structure.
54
+
55
+ ### Directory Layout (additions in bold)
56
+
57
+ ```
58
+ autonomous-coding-toolkit/
59
+ ├── **package.json** # npm: name, version, bin, files, engines
60
+ ├── **bin/**
61
+ │ └── **act.js** # Node.js CLI router (~150 lines)
62
+ ├── scripts/ # 32 bash scripts (UNCHANGED)
63
+ │ ├── lib/ # 18 modules (UNCHANGED)
64
+ │ ├── prompts/ # 4 agent prompts (UNCHANGED)
65
+ │ ├── patterns/ # 5 ast-grep rules (UNCHANGED)
66
+ │ ├── tests/ # Script tests (UNCHANGED)
67
+ │ └── **init.sh** # Project bootstrapper (~100 lines)
68
+ ├── skills/ # 20 skills (UNCHANGED)
69
+ ├── commands/ # 7 commands (UNCHANGED)
70
+ ├── agents/ # 7 agents (UNCHANGED)
71
+ ├── hooks/ # hooks.json + stop-hook.sh (UNCHANGED)
72
+ ├── policies/ # 4 positive pattern defs (UNCHANGED)
73
+ ├── examples/ # 4 samples (UNCHANGED)
74
+ ├── **benchmarks/** # 5 reproducible benchmark tasks
75
+ │ ├── **tasks/** # Task definitions + reference implementations
76
+ │ ├── **rubrics/** # Machine-scored evaluation rubrics
77
+ │ └── **runner.sh** # Benchmark orchestrator
78
+ ├── docs/
79
+ │ ├── ARCHITECTURE.md # System design
80
+ │ ├── CONTRIBUTING.md # Lesson submission guide
81
+ │ └── lessons/ # 79 lessons + framework (BUNDLED)
82
+ ├── .claude-plugin/ # Plugin metadata (UNCHANGED)
83
+ ├── .github/ # CI (UNCHANGED)
84
+ ├── Makefile # lint, test, validate, ci
85
+ ├── SECURITY.md
86
+ ├── README.md
87
+ └── .gitignore
88
+ ```
89
+
90
+ ### package.json
91
+
92
+ ```json
93
+ {
94
+ "name": "autonomous-coding-toolkit",
95
+ "version": "1.0.0",
96
+ "description": "Autonomous AI coding pipeline: quality gates, fresh-context execution, community lessons, and compounding learning",
97
+ "license": "MIT",
98
+ "author": "Justin McFarland <parthalon025@gmail.com>",
99
+ "homepage": "https://github.com/parthalon025/autonomous-coding-toolkit",
100
+ "repository": "https://github.com/parthalon025/autonomous-coding-toolkit",
101
+ "bin": {
102
+ "act": "./bin/act.js"
103
+ },
104
+ "files": [
105
+ "bin/",
106
+ "scripts/",
107
+ "skills/",
108
+ "commands/",
109
+ "agents/",
110
+ "hooks/",
111
+ "policies/",
112
+ "examples/",
113
+ "benchmarks/",
114
+ "docs/",
115
+ ".claude-plugin/",
116
+ "Makefile",
117
+ "SECURITY.md"
118
+ ],
119
+ "engines": {
120
+ "node": ">=18.0.0"
121
+ },
122
+ "os": ["linux", "darwin", "win32"],
123
+ "keywords": [
124
+ "autonomous-coding", "ai-agents", "quality-gates",
125
+ "claude-code", "tdd", "lessons-learned", "headless",
126
+ "multi-armed-bandit", "code-review", "pipeline"
127
+ ]
128
+ }
129
+ ```
130
+
131
+ **Note:** `files` field excludes runtime state (`logs/`, `.run-plan-state.json`, `progress.txt`, `.worktrees/`). These are project-local, not distributable.
132
+
133
+ ### Windows Support
134
+
135
+ Scripts are bash. Windows users require WSL (Windows Subsystem for Linux). `bin/act.js` checks for bash availability at startup and prints a WSL installation hint if missing. Claude Code users on Windows already have WSL as a practical requirement.
136
+
137
+ ---
138
+
139
+ ## Part 2: CLI Surface
140
+
141
+ ### bin/act.js — Node.js Router (~150 lines)
142
+
143
+ Responsibilities:
144
+ 1. **Platform check** — verify `bash` available, WSL hint on Windows
145
+ 2. **Subcommand routing** — dispatch to correct bash script
146
+ 3. **Toolkit root resolution** — `path.resolve(__dirname, '..')` (works for npm global, npx, and local clone)
147
+ 4. **Pass-through** — all args forwarded, exit codes preserved
148
+ 5. **Version/help** — built-in, no bash needed
149
+
150
+ ### Full Command Map
151
+
152
+ #### Execution
153
+
154
+ | Command | Script | Purpose |
155
+ |---------|--------|---------|
156
+ | `act plan <file> [flags]` | `run-plan.sh` | Headless/team/MAB batch execution |
157
+ | `act plan --resume` | `run-plan.sh --resume` | Resume interrupted execution |
158
+ | `act compound [dir] [flags]` | `auto-compound.sh` | Full pipeline: report→PRD→execute→PR |
159
+ | `act mab <flags>` | `mab-run.sh` | Multi-Armed Bandit competing agents |
160
+
161
+ #### Quality
162
+
163
+ | Command | Script | Purpose |
164
+ |---------|--------|---------|
165
+ | `act gate [flags]` | `quality-gate.sh` | Composite quality gate |
166
+ | `act check [files...]` | `lesson-check.sh` | Syntactic anti-pattern scan |
167
+ | `act policy [flags]` | `policy-check.sh` | Advisory positive-pattern check |
168
+ | `act research-gate <json>` | `research-gate.sh` | Validate research completeness |
169
+ | `act validate` | `validate-all.sh` | Toolkit self-validation |
170
+ | `act validate-plan <file>` | `validate-plan-quality.sh` | Score plan quality (8 dimensions) |
171
+ | `act validate-prd [file]` | `validate-prd.sh` | Validate PRD JSON structure |
172
+
173
+ #### Lessons
174
+
175
+ | Command | Script | Purpose |
176
+ |---------|--------|---------|
177
+ | `act lessons pull [--remote]` | `pull-community-lessons.sh` | Sync community lessons + strategy data |
178
+ | `act lessons check` | `lesson-check.sh --list` | List active lesson checks |
179
+ | `act lessons promote` | `promote-mab-lessons.sh` | Auto-promote MAB patterns |
180
+ | `act lessons infer [--apply]` | `scope-infer.sh` | Infer scope tags for lessons |
181
+
182
+ #### Analysis
183
+
184
+ | Command | Script | Purpose |
185
+ |---------|--------|---------|
186
+ | `act audit [flags]` | `entropy-audit.sh` | Doc drift & naming violations |
187
+ | `act batch-audit <dir>` | `batch-audit.sh` | Cross-project audit |
188
+ | `act batch-test <dir>` | `batch-test.sh` | Memory-aware cross-project tests |
189
+ | `act analyze <report>` | `analyze-report.sh` | Extract priority from report |
190
+ | `act digest <log>` | `failure-digest.sh` | Summarize failure patterns |
191
+ | `act status [dir]` | `pipeline-status.sh` | Pipeline health check |
192
+ | `act architecture [dir]` | `architecture-map.sh` | Generate architecture diagram |
193
+
194
+ #### Telemetry (NEW)
195
+
196
+ | Command | Script | Purpose |
197
+ |---------|--------|---------|
198
+ | `act telemetry show` | `telemetry.sh show` | Dashboard: success rate, cost, lesson hits |
199
+ | `act telemetry export` | `telemetry.sh export` | Export anonymized run data |
200
+ | `act telemetry import <file>` | `telemetry.sh import` | Import community aggregate data |
201
+ | `act telemetry reset` | `telemetry.sh reset` | Clear local telemetry |
202
+
203
+ #### Benchmarks (NEW)
204
+
205
+ | Command | Script | Purpose |
206
+ |---------|--------|---------|
207
+ | `act benchmark run` | `benchmarks/runner.sh` | Execute all 5 benchmark tasks |
208
+ | `act benchmark run <name>` | `benchmarks/runner.sh <name>` | Execute single benchmark |
209
+ | `act benchmark compare <a> <b>` | `benchmarks/runner.sh compare` | Compare two benchmark results |
210
+
211
+ #### Setup
212
+
213
+ | Command | Script | Purpose |
214
+ |---------|--------|---------|
215
+ | `act init` | `init.sh` | Bootstrap project for toolkit use |
216
+ | `act init --quickstart` | `init.sh --quickstart` | Fast lane: working example in <3 min |
217
+ | `act license-check` | `license-check.sh` | GPL/AGPL dependency audit |
218
+ | `act module-size` | `module-size-check.sh` | Detect oversized modules |
219
+
220
+ #### Meta
221
+
222
+ | Command | Purpose |
223
+ |---------|---------|
224
+ | `act version` | Print version (from package.json) |
225
+ | `act help [command]` | Show help for any command |
226
+
227
+ ---
228
+
229
+ ## Part 3: Two Install Paths
230
+
231
+ ### Path A: npm (CLI scripts)
232
+
233
+ ```bash
234
+ npm install -g autonomous-coding-toolkit
235
+ # Now: act plan, act gate, act check, act telemetry, etc. on PATH
236
+ ```
237
+
238
+ Or zero-install:
239
+ ```bash
240
+ npx autonomous-coding-toolkit gate --project-root .
241
+ ```
242
+
243
+ ### Path B: Claude Code Plugin (skills/commands/agents)
244
+
245
+ ```bash
246
+ # From Claude Code:
247
+ /install autonomous-coding-toolkit
248
+ # Now: /autocode, /create-prd, /run-plan, /ralph-loop, etc. available
249
+ ```
250
+
251
+ **Both paths install from the same repo/package.** Users who install both get the full experience:
252
+ - npm → CLI scripts for headless, CI, and standalone use
253
+ - Plugin → skills, commands, agents for interactive Claude Code sessions
254
+
255
+ ### Entry Points
256
+
257
+ | User wants to... | Entry point |
258
+ |-------------------|-------------|
259
+ | Start a new feature from scratch | `/autocode <feature>` (Claude Code) |
260
+ | Start from an existing plan | `act plan <file>` (CLI) or `/run-plan` (Claude Code) |
261
+ | Jump into a roadmap mid-stream | `act plan <file> --start-batch N` or `act plan --resume` |
262
+ | Quick quality check | `act gate --project-root .` (CLI) |
263
+ | See how the system is performing | `act telemetry show` (CLI) |
264
+ | Validate before shipping | `act benchmark run` (CLI) |
265
+ | Bootstrap a new project | `act init --quickstart` (CLI) |
266
+
267
+ ---
268
+
269
+ ## Part 4: Seven Strategic Improvements
270
+
271
+ These improvements transform the toolkit from a tool into a learning system.
272
+
273
+ ### Improvement 1: Telemetry — Measure Before Optimizing
274
+
275
+ **Principle:** You can't improve what you don't measure. The research says "the first measurement infrastructure should precede the first optimization."
276
+
277
+ **Data captured per batch (local, opt-in for sharing):**
278
+
279
+ ```json
280
+ {
281
+ "timestamp": "2026-02-24T14:30:00Z",
282
+ "project_type": "python",
283
+ "batch_type": "integration",
284
+ "batch_number": 3,
285
+ "attempt": 1,
286
+ "passed_gate": true,
287
+ "gate_failures": [],
288
+ "lessons_triggered": ["0007", "0033"],
289
+ "lessons_true_positive": ["0007"],
290
+ "test_count_delta": 12,
291
+ "duration_seconds": 180,
292
+ "cost_usd": 0.42,
293
+ "strategy": "superpowers",
294
+ "plan_quality_score": 78,
295
+ "echo_back_passed": true,
296
+ "trust_score": 73
297
+ }
298
+ ```
299
+
300
+ **Storage:** `logs/telemetry.jsonl` (append-only, one line per batch). Project-local, never committed.
301
+
302
+ **Dashboard (`act telemetry show`):**
303
+ ```
304
+ Autonomous Coding Toolkit — Telemetry Dashboard
305
+ ════════════════════════════════════════════════
306
+
307
+ Runs: 47 batches across 8 plans
308
+ Success rate: 89% (42/47 passed gate on first attempt)
309
+ Total cost: $19.83 ($0.42/batch average)
310
+ Total time: 2.4 hours
311
+
312
+ Strategy Performance:
313
+ superpowers: 78% win rate (28 runs)
314
+ ralph: 65% win rate (19 runs)
315
+
316
+ Top Lesson Hits:
317
+ #0007 bare-except: 12 hits, 11 true positives (92%)
318
+ #0033 sqlite-closing: 3 hits, 3 true positives (100%)
319
+ #0045 hub-cache: 8 hits, 0 true positives (0%) ← retirement candidate
320
+
321
+ Batch Type Success:
322
+ new-file: 95% (19/20)
323
+ test-only: 100% (8/8)
324
+ refactoring: 83% (10/12)
325
+ integration: 71% (5/7) ← lowest, consider MAB for this type
326
+ ```
327
+
328
+ **Export/import for community learning:**
329
+ - `act telemetry export` → anonymized JSON (no file paths, no project names, no code)
330
+ - `act telemetry import community-aggregate.json` → merges into local strategy routing
331
+ - Community aggregate published periodically to toolkit repo (opt-in contributions)
332
+
333
+ ### Improvement 2: Federated Learning for Strategy Routing
334
+
335
+ **Principle:** 100 users learning independently is 100x slower than learning together. Strategy performance should compound across the community.
336
+
337
+ **Current state:** `strategy-perf.json` is per-install. `pull-community-lessons.sh` already merges it with `max(local, remote)` per counter.
338
+
339
+ **Improvement:** Extend the pull mechanism to also merge:
340
+ - Anonymized strategy-perf data from community aggregate
341
+ - Lesson hit rate statistics (which lessons actually catch bugs)
342
+ - Batch-type success rates per strategy
343
+
344
+ **Merge strategy (already implemented, extend):**
345
+ - `max(local, remote)` per counter for win/loss data
346
+ - Weighted average for rates (weight = sample size)
347
+ - Never overwrite local data — additive merge only
348
+
349
+ **Effect on routing:** Thompson Sampling in `lib/thompson-sampling.sh` starts with community priors instead of uniform priors. A new user benefits from the collective experience of all previous users from their first run.
350
+
351
+ ### Improvement 3: Adaptive Quality Gates
352
+
353
+ **Principle:** The immune system amplifies what works and retires what doesn't (biological analogy from research #B2-3). Quality gates should do the same.
354
+
355
+ **Current state:** Gate pipeline is static: lesson-check → ast-grep → tests → memory → test count → git clean.
356
+
357
+ **Improvement:** Track lesson effectiveness from telemetry:
358
+
359
+ | Metric | Threshold | Action |
360
+ |--------|-----------|--------|
361
+ | True positive rate > 80% | After 20+ triggers | Promote to "high-value" (always first in pipeline) |
362
+ | True positive rate 20-80% | After 20+ triggers | Normal (current behavior) |
363
+ | True positive rate < 20% | After 50+ triggers | Downgrade to advisory (warn, don't block) |
364
+ | Zero triggers | After 100+ scans | Flag as retirement candidate |
365
+
366
+ **Implementation:** `lesson-check.sh` reads `logs/telemetry.jsonl` to compute lesson effectiveness. Lessons flagged as retirement candidates appear in `act telemetry show` for manual review. No lesson is auto-deleted — only downgraded to advisory.
367
+
368
+ **Why not auto-delete:** A lesson with zero hits might be preventing bugs by its mere presence in the system (developers read lessons and avoid the pattern). Retirement requires human judgment.
369
+
370
+ ### Improvement 4: Semantic Echo-Back
371
+
372
+ **Principle:** Spec misunderstanding is 60%+ of failures for strong models (#B1-5). Keyword matching catches omissions but not misinterpretation. A human reviewer asks "do you understand what I'm asking?" before "did you do it right?"
373
+
374
+ **Current state:** `run-plan-echo-back.sh` does keyword matching — checks whether key terms from batch text appear in agent output.
375
+
376
+ **Improvement:** Two-tier echo-back:
377
+
378
+ **Tier 1 (current, every batch):** Keyword match — fast (<1s), catches obvious omissions.
379
+
380
+ **Tier 2 (new, selective):** LLM verification — agent summarizes what it will build, separate `claude -p` call compares summary vs. spec, flags misalignment.
381
+
382
+ **When Tier 2 activates:**
383
+ - Always on Batch 1 of any plan (disproportionate risk — research #B2-3, #P9)
384
+ - Always on integration batches (highest failure rate from telemetry)
385
+ - When `--strict-echo-back` flag is set
386
+ - MAB can learn whether Tier 2 prevents enough rework to justify cost (~$0.10/batch)
387
+
388
+ **Tier 2 prompt structure:**
389
+ ```
390
+ You are a specification compliance reviewer. Compare:
391
+
392
+ SPECIFICATION:
393
+ <batch task text from plan>
394
+
395
+ AGENT'S UNDERSTANDING:
396
+ <agent's summary of what it will build>
397
+
398
+ Does the agent's understanding match the specification? Flag any:
399
+ - Missing requirements
400
+ - Added requirements not in spec
401
+ - Misinterpreted requirements
402
+ - Ambiguous interpretations
403
+
404
+ Output: PASS or FAIL with specific misalignments.
405
+ ```
406
+
407
+ ### Improvement 5: Fast Lane Onboarding
408
+
409
+ **Principle:** 34.7% abandon on difficult setup (#B2-1). A dead user gets zero benefit from perfect process. Time to first value must be under 3 minutes.
410
+
411
+ **`act init` (standard):**
412
+ 1. Detect project type (Python/Node/bash/Make/unknown)
413
+ 2. Create `tasks/` directory
414
+ 3. Create empty `progress.txt`
415
+ 4. Append Code Factory section to CLAUDE.md (or create minimal CLAUDE.md)
416
+ 5. Set quality gate command based on project type
417
+ 6. Detect language → set `## Scope Tags`
418
+ 7. Print next steps
419
+
420
+ **`act init --quickstart` (fast lane):**
421
+ All of the above, plus:
422
+ 1. Copy `examples/quickstart-plan.md` → `docs/plans/quickstart.md`
423
+ 2. Customize the plan for detected project type:
424
+ - Python: "Add a conftest.py with common fixtures + test helper"
425
+ - Node: "Add a build validation script + test helper"
426
+ - Bash: "Add shellcheck CI + test runner"
427
+ 3. Run `act gate --project-root .` to verify quality gate works
428
+ 4. Print: "Ready. Run `act plan docs/plans/quickstart.md` for your first quality-gated execution."
429
+
430
+ **Time budget:** `act init` < 10 seconds, `act init --quickstart` < 30 seconds (gate run is the bottleneck).
431
+
432
+ ### Improvement 6: Graduated Autonomy
433
+
434
+ **Principle:** Start supervised, earn trust, reduce friction. Humans don't give full autonomy to new team members on day one.
435
+
436
+ **Trust score per project, derived from telemetry:**
437
+
438
+ ```
439
+ Trust Score = weighted average of:
440
+ - Gate first-attempt pass rate (40%)
441
+ - Echo-back pass rate (20%)
442
+ - Test regression rate, inverted (20%)
443
+ - Post-merge revert rate, inverted (20%)
444
+ ```
445
+
446
+ **Trust levels and default behavior:**
447
+
448
+ | Trust | Score | Default Mode | Rationale |
449
+ |-------|-------|-------------|-----------|
450
+ | New | < 30 (or < 10 runs) | Mode B: human checkpoint every batch | Unknown project, build confidence |
451
+ | Growing | 30-70 | Headless with checkpoint every 3rd batch | Earning trust, spot-check |
452
+ | Trusted | 70-90 | Headless with notification on failures only | Proven track record |
453
+ | Autonomous | > 90 | Full headless, post-run summary only | Consistently excellent |
454
+
455
+ **Override:** Users can always set `--mode` explicitly. Trust score is advisory default, not a hard gate.
456
+
457
+ **Trust score in `act status`:**
458
+ ```
459
+ Project: my-app (python)
460
+ Trust Score: 73/100 (28 runs)
461
+ Gate pass rate: 89% ████████▉ (HIGH)
462
+ Echo-back rate: 92% █████████▏ (HIGH)
463
+ Test regression: 4% ▍ (GOOD)
464
+ Post-merge revert: 0% ▏ (EXCELLENT)
465
+ Default mode: headless with checkpoint every 3rd batch
466
+ ```
467
+
468
+ ### Improvement 7: Benchmark Suite
469
+
470
+ **Principle:** "Single-user testing is not testing." Without benchmarks, you can't prove the toolkit works, you can't measure improvement between versions, and users can't validate their setup.
471
+
472
+ **5 benchmark tasks (varying complexity):**
473
+
474
+ | # | Task | Complexity | Measures |
475
+ |---|------|-----------|----------|
476
+ | 1 | Add a REST endpoint with tests | Simple (1 batch) | Basic execution, TDD compliance |
477
+ | 2 | Refactor a module into two | Medium (2 batches) | Refactoring quality, test preservation |
478
+ | 3 | Fix an integration bug | Medium (2 batches) | Debugging, root cause analysis |
479
+ | 4 | Add test coverage to untested module | Medium (2 batches) | Test quality, edge case discovery |
480
+ | 5 | Multi-file feature with API + DB + tests | Complex (4 batches) | Full pipeline, cross-file coordination |
481
+
482
+ **Each benchmark includes:**
483
+ - `task.md` — Problem description (what the agent receives)
484
+ - `scaffold/` — Starting codebase (reproducible initial state)
485
+ - `reference/` — Reference implementation (what "correct" looks like)
486
+ - `rubric.sh` — Machine-scored evaluation (exit 0 = pass per criterion)
487
+ - `rubric.json` — Criteria and weights for scoring
488
+
489
+ **`act benchmark run` behavior:**
490
+ 1. Create temp directory, copy scaffold
491
+ 2. Run `act plan` on the task
492
+ 3. Execute `rubric.sh` to score the result
493
+ 4. Compare against reference implementation
494
+ 5. Output scorecard with per-criterion pass/fail
495
+
496
+ **`act benchmark compare <before.json> <after.json>`:**
497
+ ```
498
+ Benchmark Comparison: v1.0.0 vs v1.1.0
499
+ ═══════════════════════════════════════
500
+ v1.0.0 v1.1.0 Delta
501
+ Task 1 (endpoint): 85% 92% +7%
502
+ Task 2 (refactor): 72% 78% +6%
503
+ Task 3 (debug): 68% 81% +13% ← biggest improvement
504
+ Task 4 (coverage): 90% 91% +1%
505
+ Task 5 (multi-file): 55% 67% +12%
506
+ ─────────────────────────────────────────
507
+ Overall: 74% 82% +8%
508
+ ```
509
+
510
+ ---
511
+
512
+ ## Part 5: Complete Concept Inventory
513
+
514
+ Everything from the existing toolkit is preserved. Nothing is removed or moved.
515
+
516
+ ### Skills (20 — all preserved)
517
+
518
+ | Skill | Purpose | Pipeline Stage |
519
+ |-------|---------|---------------|
520
+ | autocode | Full 9-stage pipeline orchestrator | Entry point |
521
+ | brainstorming | Design exploration & approval | Stage 1 |
522
+ | research | Structured technical investigation | Stage 1.5 |
523
+ | roadmap | Multi-feature epic decomposition | Stage 0.5 |
524
+ | writing-plans | TDD-structured implementation plans | Stage 3 |
525
+ | using-git-worktrees | Isolated workspace creation | Stage 2 |
526
+ | subagent-driven-development | Fresh agent per task + 2-stage review | Stage 4a |
527
+ | executing-plans | Batch execution with human checkpoints | Stage 4b |
528
+ | verification-before-completion | Evidence-based gate | Stage 5 |
529
+ | finishing-a-development-branch | Merge/PR/keep/discard | Stage 6 |
530
+ | test-driven-development | Red-Green-Refactor cycle | Supporting |
531
+ | systematic-debugging | 4-phase root cause investigation | Supporting |
532
+ | dispatching-parallel-agents | 2+ independent task coordination | Supporting |
533
+ | requesting-code-review | Dispatch reviewer subagent | Supporting |
534
+ | receiving-code-review | Technical evaluation of feedback | Supporting |
535
+ | using-superpowers | Meta-skill: invoke skills before action | Meta |
536
+ | verify | Self-verification checklist | Supporting |
537
+ | writing-skills | TDD applied to skill documentation | Meta |
538
+ | capture-lesson | Incident → lesson workflow | Lesson system |
539
+ | check-lessons | Surface relevant lessons for current work | Lesson system |
540
+
541
+ ### Commands (7 — all preserved)
542
+
543
+ | Command | Purpose |
544
+ |---------|---------|
545
+ | `/autocode <feature>` | Full pipeline entry point |
546
+ | `/code-factory <feature>` | Alias for autocode |
547
+ | `/create-prd <feature>` | Machine-verifiable acceptance criteria |
548
+ | `/run-plan <file>` | In-session batch execution |
549
+ | `/ralph-loop <prompt>` | Autonomous iteration with stop-hook |
550
+ | `/cancel-ralph` | Cancel active Ralph loop |
551
+ | `/submit-lesson` | Community lesson submission via PR |
552
+
553
+ ### Agents (7 — all preserved)
554
+
555
+ | Agent | Model | Purpose |
556
+ |-------|-------|---------|
557
+ | lesson-scanner | sonnet | Dynamic anti-pattern scan from lesson files |
558
+ | bash-expert | sonnet | Shell script review & debugging |
559
+ | shell-expert | sonnet | systemd/service diagnosis |
560
+ | python-expert | sonnet | Async, lifecycle, type safety review |
561
+ | integration-tester | opus | Cross-service data flow verification |
562
+ | dependency-auditor | haiku | CVE scan, license compliance |
563
+ | service-monitor | sonnet | systemd service/timer health |
564
+
565
+ ### Scripts (32 existing + 3 new = 35)
566
+
567
+ **Existing (all preserved, paths unchanged):**
568
+
569
+ Execution: run-plan.sh, auto-compound.sh, mab-run.sh, setup-ralph-loop.sh
570
+ Quality: quality-gate.sh, lesson-check.sh, policy-check.sh, research-gate.sh
571
+ Validation: validate-all.sh, validate-lessons.sh, validate-skills.sh, validate-commands.sh, validate-plugin.sh, validate-hooks.sh, validate-policies.sh, validate-prd.sh, validate-plan-quality.sh
572
+ Analysis: entropy-audit.sh, batch-audit.sh, batch-test.sh, analyze-report.sh, failure-digest.sh, pipeline-status.sh, architecture-map.sh
573
+ Lessons: pull-community-lessons.sh, promote-mab-lessons.sh, scope-infer.sh
574
+ Utilities: license-check.sh, module-size-check.sh, generate-ast-rules.sh, prior-art-search.sh
575
+
576
+ **New:**
577
+
578
+ | Script | Purpose | Lines (est.) |
579
+ |--------|---------|-------------|
580
+ | `scripts/init.sh` | Project bootstrapper (`act init`) | ~100 |
581
+ | `scripts/telemetry.sh` | Telemetry capture, dashboard, export/import | ~200 |
582
+ | `benchmarks/runner.sh` | Benchmark orchestrator | ~150 |
583
+
584
+ ### Lib Modules (18 — all preserved)
585
+
586
+ common.sh, ollama.sh, telegram.sh, progress-writer.sh, cost-tracking.sh, thompson-sampling.sh, run-plan-parser.sh, run-plan-state.sh, run-plan-headless.sh, run-plan-team.sh, run-plan-routing.sh, run-plan-quality-gate.sh, run-plan-prompt.sh, run-plan-context.sh, run-plan-sampling.sh, run-plan-scoring.sh, run-plan-echo-back.sh, run-plan-notify.sh
587
+
588
+ ### Execution Modes (5 — all preserved)
589
+
590
+ | Mode | Entry (Claude Code) | Entry (CLI) | Isolation |
591
+ |------|-------------------|------------|-----------|
592
+ | A: Subagent-dev | /autocode → Stage 4a | N/A (Claude-only) | Same session |
593
+ | B: Executing-plans | /autocode → Stage 4b | N/A (Claude-only) | Separate session |
594
+ | C: Headless | /run-plan | `act plan <file>` | Fresh context/batch |
595
+ | D: Ralph Loop | /ralph-loop | N/A (needs stop-hook) | Same session |
596
+ | E: MAB | /run-plan --mab | `act plan <file> --mab` | Parallel worktrees |
597
+
598
+ ### State & Persistence (5 existing + 1 new = 6)
599
+
600
+ | State File | Location | Purpose |
601
+ |-----------|----------|---------|
602
+ | `.run-plan-state.json` | Project root | Execution checkpoint (batches, test counts, costs) |
603
+ | `progress.txt` | Project root | Append-only discovery log |
604
+ | `tasks/prd.json` | Project root | Machine-verifiable acceptance criteria |
605
+ | `logs/failure-patterns.json` | Project root | Cross-run failure learning |
606
+ | `.claude/ralph-loop.local.md` | Project root | Ralph loop state |
607
+ | **`logs/telemetry.jsonl`** | Project root | **Per-batch telemetry (NEW)** |
608
+
609
+ Additional learning state (existing, in `logs/`): routing-decisions.log, sampling-outcomes.json, strategy-perf.json, mab-lessons.json.
610
+
611
+ All state is project-local. The npm package is stateless. No state collision between projects.
612
+
613
+ ### Lessons (79 + framework — all bundled)
614
+
615
+ **Three-tier architecture:**
616
+
617
+ ```
618
+ Tier 1: Bundled (ships with npm, updated on npm update)
619
+ Location: <npm-root>/docs/lessons/
620
+ Count: 79 (grows with releases)
621
+
622
+ Tier 2: Community (git-synced between releases)
623
+ Mechanism: act lessons pull --remote upstream
624
+ Source: main branch of toolkit repo
625
+ Merge: additive only, never overwrites local
626
+
627
+ Tier 3: Project-local (user's own lessons)
628
+ Location: <project>/docs/lessons/
629
+ Scope: project-specific anti-patterns
630
+ Never overwritten by Tier 1 or 2
631
+ ```
632
+
633
+ **Six root cause clusters:**
634
+ 1. Silent Failures — operation appears to succeed but silently fails
635
+ 2. Integration Boundaries — each component passes its test; bug hides at seam
636
+ 3. Cold-Start Assumptions — works steady-state, fails on restart
637
+ 4. Specification Drift — agent builds wrong thing correctly
638
+ 5. Context & Retrieval — info available but buried/misscoped
639
+ 6. Planning & Control Flow — wrong decomposition contaminates downstream
640
+
641
+ **Lesson schema:** YAML frontmatter with id, title, severity, languages, scope, category, pattern (type + regex/description), fix, positive_alternative, example (bad/good).
642
+
643
+ **Scope filtering:** `lesson-check.sh` reads `## Scope Tags` from CLAUDE.md, computes intersection with lesson scope tags. Prevents false positive death spiral at scale (research #B2-2).
644
+
645
+ ### Policies (4 — all preserved)
646
+
647
+ | File | Scope | Patterns |
648
+ |------|-------|----------|
649
+ | universal.md | All projects | Error visibility, test before ship, fresh context, durable artifacts |
650
+ | python.md | Python | Async discipline, closing(), create_task callbacks |
651
+ | bash.md | Shell | Strict mode, quoting, subshell cd, atomic writes |
652
+ | testing.md | All tests | No hardcoded counts, boundary testing, live > static |
653
+
654
+ ### Hooks (2 — all preserved)
655
+
656
+ | Hook | Trigger | Purpose |
657
+ |------|---------|---------|
658
+ | SessionStart | Session init | Symlink setup for skill discovery |
659
+ | Stop | Session exit | Ralph loop continuation gate |
660
+
661
+ ### Quality Gate Pipeline (preserved + enhanced)
662
+
663
+ ```
664
+ lesson-check.sh (syntactic, <2s)
665
+ ↓ if clean
666
+ ast-grep patterns (5 structural checks)
667
+ ↓ if clean
668
+ Test suite (auto-detected: pytest/npm/make)
669
+ ↓ if pass
670
+ Memory check (warn if <4GB, never fail)
671
+
672
+ Test count regression (new_count >= old_count)
673
+ ↓ if no regression
674
+ Git clean (all changes committed)
675
+ ↓ if clean
676
+ **Telemetry capture (NEW — write batch results to logs/telemetry.jsonl)**
677
+
678
+ ✅ PASS → next batch
679
+ ```
680
+
681
+ ### Examples (4 — all preserved)
682
+
683
+ example-plan.md, example-prd.json, example-roadmap.md, quickstart-plan.md
684
+
685
+ ### Documentation (all preserved)
686
+
687
+ ARCHITECTURE.md, CONTRIBUTING.md, SECURITY.md, docs/lessons/FRAMEWORK.md, docs/lessons/TEMPLATE.md, docs/lessons/SUMMARY.md, docs/lessons/DIAGNOSTICS.md
688
+
689
+ ### CI (preserved)
690
+
691
+ .github/workflows/ci.yml — ShellCheck + shfmt + shellharden + semgrep + tests
692
+
693
+ ### Prompts & AST Patterns (all preserved)
694
+
695
+ Prompts: planner-agent.md, judge-agent.md, agent-a-superpowers.md, agent-b-ralph.md
696
+ Patterns: bare-except.yml, empty-catch.yml, async-no-await.yml, retry-loop-no-backoff.yml, hardcoded-localhost.yml
697
+
698
+ ---
699
+
700
+ ## Part 6: External Dependencies
701
+
702
+ ### Required
703
+
704
+ | Dependency | Used By | Check |
705
+ |-----------|---------|-------|
706
+ | bash 4+ | All scripts | `act` checks at startup |
707
+ | git | Worktrees, state, PRs | `act` checks at startup |
708
+ | jq | State files, PRD, MAB, telemetry | `act` checks at startup |
709
+ | curl | Ollama, Telegram (optional features) | Checked at call site |
710
+ | claude CLI | Execution modes (plan, compound, mab) | Checked by run-plan.sh |
711
+ | Node.js 18+ | `bin/act.js` router only | npm enforces via engines |
712
+
713
+ ### Optional (graceful degradation)
714
+
715
+ | Dependency | Used By | Behavior if Missing |
716
+ |-----------|---------|-------------------|
717
+ | ruff | quality-gate (Python lint) | Skipped with warning |
718
+ | eslint | quality-gate (JS lint) | Skipped with warning |
719
+ | ast-grep | quality-gate (structural) | Skipped (advisory anyway) |
720
+ | ollama | analyze-report, auto-compound | Fails with clear message |
721
+ | bc | Thompson Sampling | Falls back to random routing |
722
+ | gh | PRs, submit-lesson, benchmarks | Fails with install hint |
723
+ | pytest/npm/make | quality-gate (tests) | Auto-detected, skips if none |
724
+
725
+ ### Hardcoded Paths to Fix (2 only)
726
+
727
+ | Current | Fix | Script |
728
+ |---------|-----|--------|
729
+ | `~/.env` for Telegram/Ollama creds | Add `ACT_ENV_FILE` env var | telegram.sh, ollama.sh |
730
+ | `$HOME/Documents/projects` default | Already has `--projects-dir` flag | entropy-audit.sh |
731
+
732
+ Everything else uses `SCRIPT_DIR` relative resolution via `readlink -f`.
733
+
734
+ ---
735
+
736
+ ## Part 7: Design Principles
737
+
738
+ These principles govern the toolkit's behavior and every future contribution. They are non-negotiable.
739
+
740
+ ### From the Original Architecture
741
+
742
+ 1. **Fresh context per unit of work** — Context degradation is the #1 quality killer. Every execution mode solves this differently.
743
+ 2. **Machine-verifiable gates** — No human judgment for "did this work?" Every gate is a command that exits 0 or non-zero.
744
+ 3. **Test count monotonicity** — Tests only go up. Decreased count = something broke.
745
+ 4. **State survives interruption** — Every transition persisted to disk. Kill, reboot, come back later — `--resume` works.
746
+ 5. **Orthogonal verification** — Bottom-up (syntactic) + top-down (integration) catch non-overlapping bug classes.
747
+ 6. **Lessons compound** — Every bug becomes an automated check. The system gets harder to break over time.
748
+
749
+ ### From the Research Foundation
750
+
751
+ 7. **Plan quality over execution quality** — 3:1 ratio. Invest in plan scoring, spec echo-back, and research gates before execution optimization.
752
+ 8. **Measure before optimizing** — Telemetry first. Every improvement must be measurable.
753
+ 9. **Positive instructions alongside negative** — Policies ("do Y") complement lessons ("don't do X"). LLMs respond better to positive guidance.
754
+ 10. **Scope to prevent noise** — Every lesson has scope metadata. Without it, false positives compound and users disable the system.
755
+ 11. **Community learning compounds** — Federated telemetry and lesson sync mean every user makes every other user's system better.
756
+ 12. **Graduated autonomy** — Start supervised, earn trust through measured success, reduce friction over time.
757
+ 13. **Fast time to first value** — Under 3 minutes to first quality-gated execution. A dead user gets zero benefit from perfect process.
758
+
759
+ ### From Operations Research (18 frameworks converged)
760
+
761
+ 14. **Formal gate between understanding and building** — The brainstorm→research→PRD chain is not optional overhead; it's the highest-leverage investment.
762
+ 15. **Adversarial review at every stage** — Spec reviewer, code quality reviewer, lesson scanner, quality gate — each catches a different failure class.
763
+ 16. **Intent over method** — Plans specify what and why, not how. Agents choose implementation strategy.
764
+
765
+ ---
766
+
767
+ ## Part 8: What's New (Summary)
768
+
769
+ | Item | Type | Est. Lines | Priority |
770
+ |------|------|-----------|----------|
771
+ | `package.json` | New file | ~30 | P0 (required for npm) |
772
+ | `bin/act.js` | New file | ~150 | P0 (CLI router) |
773
+ | `scripts/init.sh` | New file | ~100 | P0 (project bootstrap) |
774
+ | `scripts/telemetry.sh` | New file | ~200 | P1 (measurement before optimization) |
775
+ | `benchmarks/` directory | New directory | ~300 | P1 (prove the system works) |
776
+ | Fix `~/.env` → `ACT_ENV_FILE` | Edit 2 files | ~10 | P0 (portability) |
777
+ | `LESSONS_DIR` project-local fallback | Edit lesson-check.sh | ~10 | P0 (lesson tiers) |
778
+ | Update README.md | Edit | ~200 | P0 (installation docs) |
779
+ | Telemetry capture in quality gate | Edit quality-gate.sh | ~20 | P1 (data collection) |
780
+ | Trust score in pipeline-status.sh | Edit | ~50 | P2 (graduated autonomy) |
781
+ | Tier 2 echo-back | Edit run-plan-echo-back.sh | ~80 | P2 (spec drift prevention) |
782
+ | **Total new code** | | **~1,150** | |
783
+
784
+ **P0:** Required for npm publish. Ship first.
785
+ **P1:** Required for the learning system thesis. Ship second.
786
+ **P2:** Enhances the learning system. Ship third.
787
+
788
+ ---
789
+
790
+ ## Part 9: What Does NOT Change
791
+
792
+ - All 20 skills — unchanged, same paths
793
+ - All 7 commands — unchanged
794
+ - All 7 agents — unchanged
795
+ - All 32 existing scripts — unchanged (except 3 small edits noted above)
796
+ - All 18 lib modules — unchanged
797
+ - All 79 lessons — bundled as-is
798
+ - All 4 policies — unchanged
799
+ - All 5 execution modes — unchanged
800
+ - All hooks — unchanged
801
+ - All state file formats — unchanged
802
+ - All prompts and AST patterns — unchanged
803
+ - CI workflow — unchanged
804
+ - Directory layout — preserved (additions only)
805
+ - Design principles 1-6 — preserved (7-16 are additions)
806
+
807
+ ---
808
+
809
+ ## Appendix A: Risk Assessment
810
+
811
+ | Risk | Likelihood | Impact | Mitigation |
812
+ |------|-----------|--------|------------|
813
+ | `act` name collision (other npm packages) | Medium | Low | Check npm registry; fallback: `actk` |
814
+ | Windows without WSL | Medium | Medium | Clear error message + WSL install guide |
815
+ | Telemetry privacy concerns | Low | High | Local-only default, explicit opt-in for sharing, no PII ever |
816
+ | Claude Code API changes break hooks/skills | Medium | High | Abstract plugin interface; version pin in package.json |
817
+ | Lesson false positive spiral at scale | Medium | High | Adaptive gates (Improvement 3) + scope filtering |
818
+ | Community doesn't form | High | Medium | Toolkit works solo; community features are additive |
819
+
820
+ ## Appendix B: Success Metrics
821
+
822
+ | Metric | Target (6 months) | How Measured |
823
+ |--------|-------------------|-------------|
824
+ | npm weekly downloads | 50+ | npm stats |
825
+ | Community lessons submitted | 10+ | GitHub PRs |
826
+ | Benchmark score improvement | +10% over v1.0 baseline | `act benchmark compare` |
827
+ | Gate first-attempt pass rate | >85% across community | Aggregated telemetry |
828
+ | Time to first value | <3 minutes | Manual testing + user reports |
829
+ | User retention (>5 runs) | >50% of installers | Telemetry (if opted in) |
830
+
831
+ ## Appendix C: Research Document Index
832
+
833
+ Full research corpus governing this design: `research/2026-02-22-cross-cutting-synthesis.md` (25 papers, 409 lines). Key references by section:
834
+
835
+ - Telemetry: Cost/Quality (#B1-7), MAB R2 (#P7)
836
+ - Federated learning: Lesson Transferability (#B2-2), MAB R1 (#P6)
837
+ - Adaptive gates: Lesson Transferability (#B2-2), Unconventional Perspectives (#B2-3)
838
+ - Echo-back: Failure Taxonomy (#B1-5), Multi-Agent Coordination (#B1-8)
839
+ - Fast lane: User Adoption (#B2-1), Competitive Landscape (#B1-4)
840
+ - Graduated autonomy: User Adoption (#B2-1), Operations Design (#P9)
841
+ - Benchmarks: Verification Effectiveness (#B1-6), Comprehensive Testing (#B2-7)