autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,908 @@
1
+ # Research: Codebase Auditing and Refactoring with Autonomous AI Agents
2
+
3
+ > **Date:** 2026-02-22
4
+ > **Status:** Research complete
5
+ > **Method:** Web research + academic literature + tool analysis + existing toolkit review
6
+ > **Confidence:** High on audit pipeline design, medium on migration patterns, high on metrics
7
+
8
+ ## Executive Summary
9
+
10
+ The autonomous-coding-toolkit is optimized for greenfield development: brainstorm, plan, execute new features. But the overwhelming majority of professional software work is improving existing code — auditing for issues, refactoring legacy systems, improving test coverage, paying down tech debt, and migrating frameworks. This research makes the case that **"improve existing code" is as important a use case as "build new features"** and proposes a concrete pipeline for it.
11
+
12
+ **Key findings:**
13
+
14
+ 1. **AI agents are already effective at localized refactoring** — extract method, dead code removal, naming improvements, magic number elimination — but struggle with cross-module architectural changes (confidence: high, ICSE 2025 IDE workshop evidence).
15
+ 2. **Hotspot analysis (CodeScene's approach) is the highest-leverage prioritization strategy** — intersecting code complexity with change frequency identifies the 5-10% of files responsible for most defects and delivery slowdowns (confidence: high, behavioral code analysis research).
16
+ 3. **Characterization testing is the safety net** — before any AI refactoring, capture current behavior as tests. Michael Feathers' principle: "legacy code is code without tests" (confidence: high, established practice).
17
+ 4. **The strangler fig pattern maps naturally to AI agent work** — incremental replacement with a routing layer is safer than wholesale rewrite, and AI agents can execute the incremental steps autonomously (confidence: high).
18
+ 5. **An audit-first pipeline is the missing stage** — the toolkit needs a `discover → assess → prioritize → plan → execute → measure` pipeline that precedes the existing brainstorm → plan → execute chain (confidence: high).
19
+ 6. **Continuous improvement via scheduled agents is production-ready** — Continuous Claude and similar approaches demonstrate overnight autonomous PR generation for incremental code improvement (confidence: medium, early but proven pattern).
20
+
21
+ **Proposed new stages for the toolkit:**
22
+
23
+ ```
24
+ [NEW] /audit → discover → assess → prioritize
25
+ [EXISTING] brainstorm → plan → execute → verify → finish
26
+ [NEW] /measure → track improvement over time
27
+ ```
28
+
29
+ ---
30
+
31
+ ## 1. Audit-First Workflow: How AI Agents Should Approach Existing Code
32
+
33
+ ### Findings
34
+
35
+ AI agents exploring unfamiliar codebases follow a consistent high-to-low resolution pattern, supported by both SWE-bench agent analysis and the ArchAgent framework (arXiv 2601.13007):
36
+
37
+ **Optimal exploration sequence:**
38
+
39
+ 1. **Structural survey** (seconds) — file tree, directory layout, language detection, build system identification
40
+ 2. **Documentation scan** (seconds) — README, CLAUDE.md, ARCHITECTURE.md, CONTRIBUTING.md, inline doc comments
41
+ 3. **Dependency graph** (seconds-minutes) — `package.json`, `requirements.txt`, `pyproject.toml`, import analysis
42
+ 4. **Test suite assessment** (seconds) — test framework detection, test count, coverage report if available, run baseline
43
+ 5. **Git history analysis** (minutes) — recent commits, change frequency per file, contributor patterns, hotspot identification
44
+ 6. **Architecture recovery** (minutes) — call graph extraction, module boundaries, entry points, data flow paths
45
+ 7. **Tech debt inventory** (minutes) — code smells, complexity metrics, dead code, naming violations, pattern inconsistencies
46
+
47
+ SWE-Agent's approach deliberately shows only small amounts of code at a time during search, which works for targeted bug fixing but is insufficient for holistic codebase understanding. The OpenHands project discussion (issue #2363) proposes an "OmniscientAgent" — a head agent with a broader codebase view — suggesting the community recognizes this gap.
48
+
49
+ **Key insight from SWE-EVO benchmark:** Long-horizon software evolution tasks (averaging 21 files modified per task, 874 tests per instance) show dramatic performance drops — from 75% on SWE-bench Verified to 23% on SWE-bench Pro. This means current AI agents can fix isolated bugs but struggle with systemic improvements. A structured audit pipeline that breaks systemic improvement into targeted, isolated tasks is essential.
50
+
51
+ ### Evidence
52
+
53
+ - SWE-bench Pro: Best models (GPT-5, Claude Opus 4.1) score only ~23% on multi-file evolution tasks vs. 75% on isolated fixes ([SWE-bench Pro, Scale AI](https://scale.com/leaderboard/swe_bench_pro_public))
54
+ - ArchAgent combines static analysis + adaptive code segmentation + LLM synthesis for architecture recovery ([arXiv 2601.13007](https://arxiv.org/html/2601.13007))
55
+ - OpenHands exploration relies on shell commands, file reading, and web browsing — no structured audit methodology ([OpenHands](https://github.com/OpenHands/OpenHands))
56
+
57
+ ### Implications for the Toolkit
58
+
59
+ The toolkit needs a `/audit` command that executes the 7-step exploration sequence above, producing a structured audit report. This report then feeds into the existing brainstorm → plan → execute pipeline. The audit replaces brainstorming's "explore project context" step with a rigorous, repeatable methodology.
60
+
61
+ ---
62
+
63
+ ## 2. Codebase Comprehension: Building Mental Models
64
+
65
+ ### Findings
66
+
67
+ AI agents build codebase understanding through several complementary techniques:
68
+
69
+ **Static analysis techniques:**
70
+ - **AST parsing** — extract function signatures, class hierarchies, import relationships. Fast, deterministic, language-specific.
71
+ - **Call graph extraction** — map which functions call which. Critical for understanding impact radius of changes.
72
+ - **Dependency mapping** — external dependencies (packages) and internal dependencies (module imports). Reveals coupling.
73
+ - **Architecture recovery** — ArchAgent's approach: File Summarizer → Repo Manager (chunking) → Readme Generator → Architect (Mermaid diagrams). Combines static analysis with LLM-powered synthesis.
74
+
75
+ **LLM-specific techniques:**
76
+ - **Hierarchical summarization** — summarize files → summarize modules → summarize system. Fits large codebases into context windows.
77
+ - **Intent-aware interaction** — CodeMap system provides "dynamic information extraction and representation aligned with human cognitive flow" ([arXiv 2504.04553](https://arxiv.org/html/2504.04553))
78
+ - **Cross-file relationship indexing** — ArchAgent's File Summarizer performs code search and reference indexing to establish cross-file relationships before summarization.
79
+
80
+ **Practical approach for the toolkit:**
81
+
82
+ ```
83
+ Phase 1: Deterministic (fast, no LLM cost)
84
+ - File tree + language detection
85
+ - Dependency parsing (package.json, requirements.txt, etc.)
86
+ - Import graph (grep/AST-based)
87
+ - Test framework detection + baseline run
88
+ - Git log analysis (hotspots, churn, contributors)
89
+
90
+ Phase 2: LLM-assisted (slower, higher quality)
91
+ - Hierarchical file summarization
92
+ - Architecture diagram generation (Mermaid)
93
+ - Business logic identification
94
+ - Pattern and convention extraction
95
+ - Anti-pattern detection
96
+ ```
97
+
98
+ ### Evidence
99
+
100
+ - ArchAgent ablation study confirms dependency context improves architecture accuracy ([arXiv 2601.13007](https://arxiv.org/abs/2601.13007))
101
+ - LoCoBench-Agent benchmark evaluates LLM agents on interactive code comprehension tasks ([arXiv 2511.13998](https://arxiv.org/pdf/2511.13998))
102
+ - Hybrid reverse engineering combining static/behavioral views with LLM-guided interaction ([arXiv 2511.05165](https://arxiv.org/html/2511.05165v1))
103
+
104
+ ### Implications for the Toolkit
105
+
106
+ Create a `codebase-profile.sh` script that runs Phase 1 deterministically and produces a JSON profile. This profile becomes the context input for Phase 2 LLM analysis and for all subsequent audit stages. The profile is cached and invalidated on significant git changes.
107
+
108
+ ---
109
+
110
+ ## 3. Refactoring Strategies: What Works with AI Agents
111
+
112
+ ### Findings
113
+
114
+ Research from ICSE 2025 (IDE workshop) and practical experience establish a clear taxonomy of AI refactoring effectiveness:
115
+
116
+ **AI excels at (safe, high confidence):**
117
+ - Extract method / extract function
118
+ - Magic number elimination (replace with named constants)
119
+ - Long statement splitting
120
+ - Dead code removal
121
+ - Naming improvements (variable, function, class)
122
+ - Import cleanup and organization
123
+ - Automated idiomatization (e.g., Python list comprehensions)
124
+ - Simplify conditional logic (flatten nested if/else)
125
+ - Remove code duplication (within single file)
126
+
127
+ **AI is mediocre at (requires guardrails):**
128
+ - Move to module (cross-file refactoring)
129
+ - Replace inheritance with composition
130
+ - Interface extraction
131
+ - Dependency injection introduction
132
+ - Cross-file duplication removal
133
+
134
+ **AI struggles with (high risk, needs human review):**
135
+ - Architectural refactoring (e.g., monolith to modules)
136
+ - Multi-module refactoring requiring domain knowledge
137
+ - Performance optimization requiring profiling data
138
+ - Concurrency pattern changes
139
+ - Database schema migrations
140
+
141
+ **McKinsey estimates:** Generative AI can reduce refactoring time by 20-30% and code writing time by up to 45%. Static checks filter LLM hallucinations, and iterative re-prompting on compile/test errors raises functional correctness by 40-65 percentage points over naive LLM output.
142
+
143
+ ### Evidence
144
+
145
+ - ICSE 2025 IDE workshop: "LLMs consistently outperform or match developers on systematic, localized refactorings... they underperform on context-dependent, architectural, or multi-module refactorings" ([ICSE 2025](https://conf.researchr.org/details/icse-2025/ide-2025-papers/12/LLM-Driven-Code-Refactoring-Opportunities-and-Limitations))
146
+ - Augment Code practical guide: "Begin on a low-risk module, prompt an LLM to map dependencies and suggest refactors, then run new code through test-suite and code-review gates" ([Augment Code](https://www.augmentcode.com/learn/ai-powered-legacy-code-refactoring))
147
+ - IBM: AI refactoring uses "intelligent risk assessment to predict failure cascades before they happen" ([IBM](https://www.ibm.com/think/topics/ai-code-refactoring))
148
+
149
+ ### Implications for the Toolkit
150
+
151
+ The toolkit should classify refactoring tasks by risk level and route them accordingly:
152
+
153
+ | Risk Level | Examples | Execution Mode | Human Review |
154
+ |-----------|----------|----------------|-------------|
155
+ | Low | Naming, dead code, imports | Headless (Mode C) | Post-merge |
156
+ | Medium | Extract method, simplify conditionals | Ralph loop with quality gates | PR review |
157
+ | High | Architecture, cross-module | Competitive dual-track (Mode B) | Before merge |
158
+
159
+ The batch-type classification system (`classify_batch_type()`) already exists in `run-plan.sh` — extend it with a refactoring risk classifier.
160
+
161
+ ---
162
+
163
+ ## 4. Tech Debt Prioritization: What to Fix First
164
+
165
+ ### Findings
166
+
167
+ The most effective prioritization strategy is **hotspot analysis** — intersecting code complexity with change frequency — pioneered by CodeScene and validated by behavioral code analysis research.
168
+
169
+ **CodeScene's approach:**
170
+ - **Code Health metric:** Aggregated score from 25+ factors, scaled 1 (severe issues) to 10 (healthy). Research shows unhealthy code has **15x more defects**, **2x slower development**, and **10x more delivery uncertainty**.
171
+ - **Hotspot = complexity + churn:** Files that are both complex AND frequently changed are the highest-priority targets. A complex file that nobody touches is low priority. A simple file that changes often is already fine.
172
+ - **Behavioral analysis:** Combines code quality with team patterns — knowledge silos, developer fragmentation, coordination problems.
173
+
174
+ **Prioritization framework for AI agents:**
175
+
176
+ ```
177
+ Priority = Impact × Frequency × Feasibility
178
+
179
+ Impact: How much does this issue slow down development?
180
+ (defect rate, review time, build failures)
181
+
182
+ Frequency: How often does this code change?
183
+ (git log --oneline --since="6 months" -- <file> | wc -l)
184
+
185
+ Feasibility: Can an AI agent safely fix this?
186
+ (Low risk refactoring? Tests exist? Clear scope?)
187
+ ```
188
+
189
+ **Concrete prioritization order:**
190
+ 1. **Hotspots with tests** — complex, frequently-changed files that already have test coverage. Safest to refactor.
191
+ 2. **Hotspots without tests** — same files but need characterization tests first. Higher effort, same priority.
192
+ 3. **Dead code** — unused imports, unreachable functions, commented-out blocks. Zero-risk removal.
193
+ 4. **Naming violations** — convention drift that impairs readability. Low risk, high readability impact.
194
+ 5. **Code duplication** — within-file first, then within-module, then cross-module.
195
+ 6. **Dependency cleanup** — unused dependencies, outdated versions with security patches.
196
+ 7. **Documentation drift** — stale references, outdated examples, missing docs for public APIs.
197
+
198
+ ### Evidence
199
+
200
+ - CodeScene research: "unhealthy code has 15 times more defects, 2x slower development, and 10 times more delivery uncertainty" ([CodeScene](https://codescene.com/product/behavioral-code-analysis))
201
+ - CodeScene's Code Health metric is built on 25+ research-backed factors ([CodeScene Docs](https://docs.enterprise.codescene.io/versions/7.2.0/guides/technical/hotspots.html))
202
+ - NASA Software Assurance: "most effective evaluation is a combination of size and cyclomatic complexity" ([Wikipedia - Cyclomatic Complexity](https://en.wikipedia.org/wiki/Cyclomatic_complexity))
203
+
204
+ ### Implications for the Toolkit
205
+
206
+ Create a `hotspot-analysis.sh` script that:
207
+ 1. Runs `git log --format='%H' --since='6 months' -- <file> | wc -l` for change frequency
208
+ 2. Runs complexity analysis (radon for Python, eslint-plugin-complexity for JS)
209
+ 3. Cross-references with test coverage data
210
+ 4. Produces a ranked list of files to improve
211
+
212
+ This replaces the current entropy-audit's flat check approach with a prioritized, evidence-based ranking.
213
+
214
+ ---
215
+
216
+ ## 5. Migration Patterns: Framework and API Upgrades
217
+
218
+ ### Findings
219
+
220
+ The **strangler fig pattern** is the dominant strategy for incremental migration, and it maps naturally to AI agent capabilities:
221
+
222
+ **Strangler fig applied to AI agents:**
223
+ 1. **Identify boundary** — find the interface between old and new (API layer, routing layer, module boundary)
224
+ 2. **Build routing facade** — create a proxy that dispatches to old or new implementation
225
+ 3. **Migrate one endpoint/module at a time** — each migration is an isolated, testable unit of work
226
+ 4. **Verify parity** — run both old and new in parallel, compare outputs
227
+ 5. **Retire old code** — once all traffic routes to new, remove legacy
228
+
229
+ **Why this works for AI agents:**
230
+ - Each migration step is small enough for a single context window
231
+ - Each step is independently verifiable (tests, output comparison)
232
+ - Rollback is trivial (change routing)
233
+ - No "big bang" rewrite risk
234
+
235
+ **Codemod tools:**
236
+ - **jscodeshift** (JavaScript) — AST-based transform scripts
237
+ - **ast-grep** (multi-language) — structural search and replace
238
+ - **libcst** (Python) — concrete syntax tree transforms
239
+ - **Rector** (PHP) — automated refactoring and upgrades
240
+
241
+ **AI + codemods:** AI agents can generate codemods from examples. Given "before" and "after" code for a few cases, an LLM can generate the transformation rule. This is more reliable than having the AI transform each file individually.
242
+
243
+ ### Evidence
244
+
245
+ - Microsoft Azure Architecture Center: strangler fig is the recommended pattern for incremental modernization ([Azure Docs](https://learn.microsoft.com/en-us/azure/architecture/patterns/strangler-fig))
246
+ - AWS Prescriptive Guidance endorses the same pattern ([AWS Docs](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/strangler-fig.html))
247
+ - vFunction's automated AI platform integrates with strangler fig for Java monolith decomposition ([vFunction](https://vfunction.com/blog/strangler-architecture-pattern-for-modernization/))
248
+
249
+ ### Implications for the Toolkit
250
+
251
+ Add a `migration` batch type to the batch-type classifier. Migration batches get:
252
+ - Parity verification gates (old output == new output)
253
+ - Rollback verification (can we switch back?)
254
+ - Higher retry budget (migrations are flaky)
255
+ - Parallel comparison mode (run old and new, diff outputs)
256
+
257
+ ---
258
+
259
+ ## 6. Test Coverage Improvement: Strategy for Untested Code
260
+
261
+ ### Findings
262
+
263
+ The debate between "outside-in" (integration first) and "inside-out" (unit first) has a nuanced answer for AI agents working on legacy code:
264
+
265
+ **Michael Feathers' recommendation (Working Effectively with Legacy Code):**
266
+ > Start with characterization tests that capture current behavior, then use those as a safety net for refactoring. Don't try to achieve the standard test pyramid first — that requires too many unsafe refactorings.
267
+
268
+ **Recommended sequence for AI agents:**
269
+
270
+ 1. **Characterization tests first** (golden master) — capture current behavior of critical paths. These are NOT correctness assertions — they're behavior preservation assertions. "Given input X, the system currently produces output Y. If that changes, we want to know."
271
+
272
+ 2. **Integration tests for boundaries** — test the interfaces between modules. These catch the most bugs per line of test code because defects cluster at integration points.
273
+
274
+ 3. **Unit tests for hotspots** — once characterization tests provide a safety net, add unit tests to the files identified by hotspot analysis. Focus on the 5-10% of code that changes most.
275
+
276
+ 4. **Coverage-guided expansion** — use coverage reports to identify untested branches in high-churn files. AI can generate tests for specific uncovered branches efficiently.
277
+
278
+ **AI-specific considerations:**
279
+ - AI-generated tests need human review for relevance — AI may generate tests that pass but don't test meaningful behavior
280
+ - AI excels at generating tests from function signatures and docstrings
281
+ - AI is poor at generating tests that require complex setup or domain knowledge
282
+ - Coverage tools (coverage.py, Istanbul.js, Jest) should feed back into the prioritization loop
283
+
284
+ ### Evidence
285
+
286
+ - Michael Feathers defines legacy code as "code without tests" — the solution is characterization tests before refactoring ([Working Effectively with Legacy Code](https://bssw.io/items/working-effectively-with-legacy-code))
287
+ - Golden master testing is "the fastest way to cover Legacy Code with meaningful, useful tests" ([The Code Whisperer](https://blog.thecodewhisperer.com/permalink/surviving-legacy-code-with-golden-master-and-sampling))
288
+ - Qodo: "In legacy codebases with poorly structured or untestable code, attempting to retrofit unit tests may be impractical" — focus on making code testable first ([Qodo](https://www.qodo.ai/blog/unit-testing-vs-integration-testing-ais-role-in-redefining-software-quality/))
289
+
290
+ ### Implications for the Toolkit
291
+
292
+ Add a `/improve-coverage` command that:
293
+ 1. Runs coverage analysis to identify gaps
294
+ 2. Cross-references with hotspot data (cover hot files first)
295
+ 3. Generates characterization tests for uncovered critical paths
296
+ 4. Generates unit tests for uncovered branches in hotspot files
297
+ 5. Uses quality gates to verify tests actually test meaningful behavior (not just asserting `True`)
298
+
299
+ ---
300
+
301
+ ## 7. Safe Refactoring Guardrails: Preventing Regressions
302
+
303
+ ### Findings
304
+
305
+ The safety stack for AI refactoring has four layers:
306
+
307
+ **Layer 1: Characterization tests (pre-refactoring)**
308
+ - Capture current behavior before changes
309
+ - Golden master / approval tests for complex outputs
310
+ - Snapshot testing for serializable state
311
+
312
+ **Layer 2: Static analysis gates (during refactoring)**
313
+ - Type checking (mypy, TypeScript strict mode)
314
+ - AST-based pattern enforcement (ast-grep, Semgrep rules)
315
+ - Import validation (no new circular dependencies)
316
+ - The toolkit's existing lesson-check.sh fits here
317
+
318
+ **Layer 3: Test suite execution (after each change)**
319
+ - Full test suite must pass after every refactoring step
320
+ - Test count monotonicity (toolkit's existing enforcement)
321
+ - Coverage must not decrease
322
+ - Performance benchmarks must not regress
323
+
324
+ **Layer 4: Behavioral verification (before merge)**
325
+ - End-to-end smoke tests
326
+ - Output comparison (old vs. new for same inputs)
327
+ - Review by a separate AI agent (toolkit's existing code review skill)
328
+ - Horizontal + vertical pipeline testing (toolkit's existing /verify)
329
+
330
+ **Critical insight:** Iterative re-prompting on compile/test errors raises functional correctness by 40-65 percentage points. The toolkit's retry-with-escalating-context mechanism already implements this pattern. The existing quality gate pipeline (lesson-check → test suite → memory → test count regression → git clean) provides Layers 2-3. Layer 1 (characterization tests) is the gap.
331
+
332
+ ### Evidence
333
+
334
+ - ICSE 2025: "iterative re-prompting on compile/test errors raises functional correctness over naive LLM output by 40–65 percentage points" ([ICSE 2025 IDE](https://seal-queensu.github.io/publications/pdf/IDE-Jonathan-2025.pdf))
335
+ - Characterization tests are the "fastest way to cover Legacy Code" ([Golden Master Testing](https://www.fabrizioduroni.it/blog/post/2018/03/20/golden-master-test-characterization-test-legacy-code))
336
+ - The toolkit's quality gate pipeline already implements Layers 2-3
337
+
338
+ ### Implications for the Toolkit
339
+
340
+ Add a `characterize` pre-step before any refactoring batch. The characterize step:
341
+ 1. Identifies functions/classes being modified
342
+ 2. Generates characterization tests capturing current behavior
343
+ 3. Runs them to establish baseline
344
+ 4. Adds them to the test suite
345
+ 5. Only then proceeds with refactoring
346
+
347
+ This can be implemented as a new `--pre-step characterize` flag on `run-plan.sh`.
348
+
349
+ ---
350
+
351
+ ## 8. Existing Tools and Approaches
352
+
353
+ ### Findings
354
+
355
+ The landscape of AI-assisted codebase improvement tools as of early 2026:
356
+
357
+ | Tool | Focus | Approach | Strengths | Limitations |
358
+ |------|-------|----------|-----------|------------|
359
+ | **SWE-agent** | Bug fixing | Agent framework for SWE-bench tasks | Structured tool use, file navigation | Isolated fixes, not systemic improvement |
360
+ | **OpenHands** | General development | Open-source autonomous agent | Shell + browser + file manipulation | No structured audit methodology |
361
+ | **CodeScene** | Tech debt analysis | Behavioral code analysis | Hotspot analysis, Code Health metric, 25+ factors | Commercial, not agent-integrated |
362
+ | **SonarQube** | Code quality | Static analysis, 30+ languages | Quality gates, dashboards, CI/CD integration | Rule-based, not AI-powered analysis |
363
+ | **Semgrep** | Security + patterns | AST-based pattern matching, 20K-100K loc/sec | Custom rules look like source code, blazing fast | Pattern-only, no behavioral analysis |
364
+ | **Sourcery** | Python refactoring | Automated transforms | Pythonic idioms, real-time suggestions | Python only, single-file scope |
365
+ | **DCE-LLM** | Dead code elimination | CodeBERT + attribution-based line selection | 94% F1 on dead code detection, beats GPT-4o by 30% | Research prototype, not production tool |
366
+ | **Continuous Claude** | Automated PRs | Continuous loop + GitHub Actions | Overnight autonomous improvement, self-learning | Requires CI/CD setup, nascent tooling |
367
+ | **Knip** | Dead code (TS/JS) | Mark-and-sweep algorithm | Finds unused deps, exports, types | TypeScript/JavaScript only |
368
+ | **Meta's SCARF** | Dead code at scale | Dependency graph + auto-delete PRs | Production-proven at Meta scale | Internal tool, not publicly available |
369
+
370
+ **Key gap:** No tool combines hotspot-based prioritization with AI agent execution for autonomous codebase improvement. CodeScene identifies what to fix; AI agents can execute the fixes; but no pipeline connects the two. This is the toolkit's opportunity.
371
+
372
+ ### Evidence
373
+
374
+ - CodeScene's MCP server creates "a continuous feedback loop for AI agents, with deterministic, real-time quality checks" ([CodeScene](https://codescene.com/))
375
+ - DCE-LLM achieves 94% F1 scores, surpassing GPT-4o by 30% on dead code detection ([ACL 2025](https://aclanthology.org/2025.naacl-long.501/))
376
+ - Semgrep scans at 20K-100K loc/sec per rule vs. SonarQube's 0.4K loc/sec ([Semgrep Docs](https://semgrep.dev/docs/faq/comparisons/sonarqube))
377
+ - Continuous Claude enables overnight autonomous PR generation ([Anand Chowdhary](https://anandchowdhary.com/open-source/2025/continuous-claude))
378
+
379
+ ### Implications for the Toolkit
380
+
381
+ Integrate with existing tools rather than rebuilding:
382
+ - Use Semgrep rules for pattern enforcement (extend lesson-check.sh or add as optional gate)
383
+ - Use radon/eslint for complexity metrics (input to hotspot analysis)
384
+ - Use coverage.py/Istanbul.js for coverage data (input to test prioritization)
385
+ - Generate CodeScene-compatible output format for teams already using it
386
+
387
+ ---
388
+
389
+ ## 9. Incremental Improvement vs. Rewrite
390
+
391
+ ### Findings
392
+
393
+ Joel Spolsky's 2000 assertion that "you should never rewrite" remains largely valid, but with important nuances:
394
+
395
+ **When incremental refactoring wins (most cases):**
396
+ - Preserves embedded domain knowledge ("old code is not ugly because it's old — it has bug fixes encoded in it")
397
+ - Maintains production capability during improvement
398
+ - Each step is individually verifiable
399
+ - Risk is bounded per change
400
+
401
+ **When rewrite might be justified:**
402
+ - Architecture or schema are severely misaligned with requirements AND no clear migration path exists
403
+ - Tech stack is limiting contributors (e.g., nobody writes that language anymore)
404
+ - Security architecture is fundamentally broken (can't be patched)
405
+ - The codebase is small enough that rewrite risk is bounded
406
+
407
+ **How AI changes the calculus:**
408
+ - AI dramatically reduces the cost of incremental refactoring (the main argument against it was "too slow")
409
+ - AI also reduces the cost of rewrites (but doesn't reduce the RISK)
410
+ - The strongest argument against rewrites — losing embedded knowledge — is NOT addressed by AI
411
+ - AI agents working on incremental refactoring benefit from existing tests; rewrites start from zero
412
+
413
+ **Recommendation:** AI makes the "never rewrite" heuristic STRONGER, not weaker. Incremental refactoring was always the safer choice; AI makes it faster, removing the main practical objection. The toolkit should optimize for incremental improvement by default.
414
+
415
+ ### Evidence
416
+
417
+ - Joel Spolsky: "the single worst strategic mistake that any software company can make" — rewriting from scratch ([Joel on Software](https://bssw.io/items/things-you-should-never-do-part-i))
418
+ - Counter-argument: rewrites justified when "architecture is severely out of alignment and incrementally updating would be exceedingly difficult" ([Remesh Blog](https://remesh.blog/refactor-vs-rewrite-7b260e80277a))
419
+ - Ben Morris: "Refactoring code is almost always better than rewriting it" — preserves institutional knowledge ([Ben Morris](https://www.ben-morris.com/why-refactoring-code-is-almost-always-better-than-rewriting-it/))
420
+
421
+ ### Implications for the Toolkit
422
+
423
+ Default to incremental improvement. The audit report should never recommend "rewrite from scratch" — instead, it should identify the highest-impact incremental improvements. If the assessment reveals a codebase so broken that incremental improvement is infeasible, flag it explicitly with the evidence and let a human decide.
424
+
425
+ ---
426
+
427
+ ## 10. Audit Report Format
428
+
429
+ ### Findings
430
+
431
+ An effective AI-generated audit report must serve two audiences: the human reviewer (who decides priorities) and the AI agent (who executes fixes). This demands both readable prose AND machine-parseable structure.
432
+
433
+ **Recommended format:**
434
+
435
+ ```json
436
+ {
437
+ "audit_metadata": {
438
+ "project": "project-name",
439
+ "date": "2026-02-22",
440
+ "commit": "abc123",
441
+ "agent_version": "autonomous-coding-toolkit v1.x"
442
+ },
443
+ "summary": {
444
+ "health_score": 7.2,
445
+ "total_findings": 42,
446
+ "critical": 3,
447
+ "high": 8,
448
+ "medium": 15,
449
+ "low": 16,
450
+ "top_3_actions": ["..."]
451
+ },
452
+ "findings": [
453
+ {
454
+ "id": "F001",
455
+ "category": "dead-code|complexity|naming|duplication|coverage|dependency|security|documentation",
456
+ "severity": "critical|high|medium|low",
457
+ "file": "src/parser.py",
458
+ "line_range": [45, 89],
459
+ "title": "Cyclomatic complexity 23 in parse_config()",
460
+ "description": "...",
461
+ "evidence": "radon cc output: ...",
462
+ "remediation": "Extract 3 helper functions for condition branches",
463
+ "estimated_effort": "15 minutes",
464
+ "risk_level": "low|medium|high",
465
+ "auto_fixable": true,
466
+ "acceptance_criteria": ["pytest tests/test_parser.py -x"]
467
+ }
468
+ ],
469
+ "metrics": {
470
+ "cyclomatic_complexity": {"mean": 5.2, "max": 23, "p90": 12},
471
+ "test_coverage": {"line": 0.67, "branch": 0.45},
472
+ "dependency_count": {"direct": 12, "transitive": 89},
473
+ "file_count": {"total": 156, "over_300_lines": 4},
474
+ "dead_code_estimate": {"files": 3, "functions": 15, "imports": 42}
475
+ },
476
+ "hotspots": [
477
+ {"file": "src/parser.py", "churn": 47, "complexity": 23, "coverage": 0.34, "priority_score": 0.92}
478
+ ]
479
+ }
480
+ ```
481
+
482
+ **Key design decisions:**
483
+ - Every finding has `acceptance_criteria` — shell commands that exit 0 when fixed. This feeds directly into the PRD system.
484
+ - Every finding has `auto_fixable` — determines whether it can be assigned to an AI agent without human review.
485
+ - `hotspots` section provides the prioritized hit list.
486
+ - `metrics` section provides the baseline for measuring improvement.
487
+
488
+ ### Evidence
489
+
490
+ - Solo Sentinel guide: "Define checklist with prioritized categories: Primary (business logic), Non-negotiable (security), Secondary (code health)" ([Mad Devs](https://maddevs.io/writeups/practical-guide-to-lightweight-audits-in-the-age-of-ai/))
491
+ - DocsBot: "AI presents findings in well-structured format, categorizing by severity with specific remediation steps" ([DocsBot](https://docsbot.ai/prompts/technical/code-audit-analysis))
492
+ - CodeAnt: structured audit output with severity scoring and automated remediation ([CodeAnt](https://www.codeant.ai/blogs/10-best-code-audit-tools-to-improve-code-quality-security-in-2025))
493
+
494
+ ### Implications for the Toolkit
495
+
496
+ Create `audit-report.json` as a first-class artifact alongside `prd.json`. The audit command produces this report; the plan generator reads it; the quality gate validates against it; the measure step compares pre/post metrics.
497
+
498
+ ---
499
+
500
+ ## 11. The Audit-Plan-Execute Pipeline
501
+
502
+ ### Findings
503
+
504
+ The existing pipeline needs a new front-end for improvement work:
505
+
506
+ **Current pipeline (greenfield):**
507
+ ```
508
+ /autocode "Add feature X"
509
+ → brainstorm → PRD → plan → execute → verify → finish
510
+ ```
511
+
512
+ **Proposed pipeline (improvement):**
513
+ ```
514
+ /audit [project-dir]
515
+ → discover → assess → prioritize → report (audit-report.json)
516
+
517
+ /improve [audit-report.json]
518
+ → select top-N findings → generate PRD → plan → execute → verify → measure → finish
519
+ ```
520
+
521
+ **How the PRD system works for refactoring:**
522
+
523
+ Acceptance criteria for "code is better" are measurable:
524
+ - `radon cc src/parser.py -nc | grep -c ' [C-F] '` exits 0 (no functions above grade C)
525
+ - `pytest --cov=src --cov-fail-under=80` exits 0
526
+ - `grep -rc 'import unused_module' src/ | grep -v ':0$'` exits non-zero (no more imports)
527
+ - `wc -l < src/big_file.py` output is < 300
528
+
529
+ Each finding in the audit report already has `acceptance_criteria` — the PRD generator just needs to collect them.
530
+
531
+ **Batch organization for improvement work:**
532
+
533
+ Unlike feature development (which follows a dependency order), improvement work can be organized by risk and independence:
534
+ - **Batch 1:** Dead code removal (zero risk, builds confidence)
535
+ - **Batch 2:** Naming and import cleanup (near-zero risk)
536
+ - **Batch 3:** Characterization tests for hotspot files (preparation)
537
+ - **Batch 4:** Refactor hotspot #1 (medium risk, has tests now)
538
+ - **Batch 5:** Refactor hotspot #2
539
+ - **Final batch:** Measure improvement, update documentation
540
+
541
+ ### Implications for the Toolkit
542
+
543
+ Two new commands:
544
+ 1. `/audit` — runs discover → assess → prioritize → produces `audit-report.json`
545
+ 2. `/improve` — reads audit report, generates improvement PRD and plan, executes
546
+
547
+ The `/improve` command reuses the existing execution pipeline entirely. The only new code is the audit pipeline and the audit-to-PRD translator.
548
+
549
+ ---
550
+
551
+ ## 12. Measuring Improvement: The Scorecard
552
+
553
+ ### Findings
554
+
555
+ Measuring codebase improvement requires a composite approach — no single metric captures "better."
556
+
557
+ **Metrics that matter (ranked by evidence strength):**
558
+
559
+ | Metric | What It Measures | Evidence Strength | Tool |
560
+ |--------|-----------------|-------------------|------|
561
+ | **Defect rate** | Bugs per time period post-change | High — direct outcome | Git + issue tracker |
562
+ | **Code Health** | Composite quality (CodeScene's 25+ factors) | High — research-backed | CodeScene / radon + custom |
563
+ | **Test coverage (branch)** | % of branches exercised | Medium-High — necessary but not sufficient | coverage.py, Istanbul |
564
+ | **Cyclomatic complexity** | Decision point count | Medium — correlated with defects, not causal | radon, eslint |
565
+ | **Change frequency** | Churn rate post-refactoring | Medium — should decrease if refactoring worked | Git log |
566
+ | **Coupling (afferent/efferent)** | Module interdependence | Medium — high coupling → hard changes | Custom import analysis |
567
+ | **Dead code count** | Unreachable / unused code | Medium — direct measure of waste | DCE-LLM, Knip, custom |
568
+ | **File size distribution** | Lines per file | Low-Medium — proxy for decomposition | wc -l |
569
+ | **Build time** | CI/CD pipeline duration | Low — secondary indicator | CI system |
570
+ | **Dependency count** | Direct + transitive deps | Low — more ≠ worse, but worth tracking | pip-audit, npm ls |
571
+
572
+ **Important caveat from DX research:** "Traditional structural analysis misses the real sources of complexity that impact delivery speed and developer satisfaction." Cyclomatic complexity alone is misleading — it must be combined with behavioral data (change frequency, defect rate) to be meaningful.
573
+
574
+ **Composite health score formula (proposed):**
575
+
576
+ ```
577
+ health_score = (
578
+ 0.25 * normalize(test_coverage_branch) +
579
+ 0.20 * normalize(1 / mean_cyclomatic_complexity) +
580
+ 0.20 * normalize(1 / hotspot_count) +
581
+ 0.15 * normalize(1 / dead_code_ratio) +
582
+ 0.10 * normalize(1 / max_file_size_ratio) +
583
+ 0.10 * normalize(1 / coupling_score)
584
+ ) * 10 # Scale to 1-10
585
+ ```
586
+
587
+ ### Evidence
588
+
589
+ - CodeScene: "unhealthy code has 15x more defects, 2x slower development, 10x more delivery uncertainty" ([CodeScene](https://codescene.com/product/behavioral-code-analysis))
590
+ - DX: "Traditional structural analysis misses the real sources of complexity" ([GetDX](https://getdx.com/blog/cyclomatic-complexity/))
591
+ - NASA: "most effective evaluation is a combination of size and cyclomatic complexity" ([Wikipedia](https://en.wikipedia.org/wiki/Cyclomatic_complexity))
592
+ - LinearB: "Cyclomatic complexity alone is misleading" — needs behavioral context ([LinearB](https://linearb.io/blog/cyclomatic-complexity))
593
+
594
+ ### Implications for the Toolkit
595
+
596
+ Create `measure-improvement.sh` that:
597
+ 1. Reads baseline metrics from `audit-report.json`
598
+ 2. Runs the same measurements on current code
599
+ 3. Produces a delta report showing improvement/regression per metric
600
+ 4. Calculates composite health score
601
+
602
+ This runs as the final step of `/improve` and feeds into the continuous improvement loop.
603
+
604
+ ---
605
+
606
+ ## 13. Continuous Improvement: Autonomous Ongoing Quality
607
+
608
+ ### Findings
609
+
610
+ The most promising pattern for continuous codebase improvement is **scheduled autonomous agents** that generate small, focused PRs:
611
+
612
+ **Continuous Claude pattern:**
613
+ 1. GitHub Actions workflow triggers on schedule (daily/weekly)
614
+ 2. Claude Code runs in headless mode with specific improvement goals
615
+ 3. Each run produces one focused PR (e.g., "Remove 3 dead functions in parser module")
616
+ 4. CI validates the PR
617
+ 5. Human reviews and merges (or auto-merges for low-risk changes)
618
+
619
+ **Key innovations from Continuous Claude:**
620
+ - Context persists between iterations via progress files
621
+ - Self-improving: "increase coverage" becomes "run coverage, find files with low coverage, do one at a time"
622
+ - Can tackle large refactoring as a series of 20 PRs over a weekend
623
+
624
+ **Tech debt budget approach:**
625
+ - Allocate 15-20% of each sprint to tech debt (industry standard)
626
+ - Use the audit report to fill this budget with highest-priority items
627
+ - AI agent handles the "boring" items (dead code, naming, imports) automatically
628
+ - Humans review the "interesting" items (architecture, design patterns)
629
+
630
+ **Integration with existing toolkit:**
631
+ - `auto-compound.sh` already implements the report → analyze → PRD → execute → PR pipeline
632
+ - `entropy-audit.sh` already runs on a weekly timer
633
+ - The missing piece is connecting audit findings to automated improvement execution
634
+
635
+ ### Evidence
636
+
637
+ - Continuous Claude: "multi-step projects complete while you sleep" via automated PR loops ([Anand Chowdhary](https://anandchowdhary.com/open-source/2025/continuous-claude))
638
+ - Organizations report "60-80% reduction in technical debt accumulation" with AI-driven refactoring ([GetDX](https://getdx.com/blog/enterprise-ai-refactoring-best-practices/))
639
+ - Coder Tasks: "From GitHub Issue to Pull Request" with Claude Code coding agent ([Coder](https://coder.com/blog/launch-dec-2025-coder-tasks))
640
+
641
+ ### Implications for the Toolkit
642
+
643
+ Create `auto-improve.sh` that chains:
644
+ 1. `audit.sh` → produces `audit-report.json`
645
+ 2. Filter to auto-fixable findings below risk threshold
646
+ 3. For each finding: create branch → fix → test → PR
647
+ 4. Notify via Telegram with summary
648
+
649
+ Run as a systemd timer (weekly) alongside the existing entropy-audit timer. Low-risk fixes get auto-merged; medium-risk get PRs for review.
650
+
651
+ ---
652
+
653
+ ## 14. Working with Unfamiliar Codebases: The Onboarding Phase
654
+
655
+ ### Findings
656
+
657
+ When an AI agent encounters a project it has never seen, it needs a structured onboarding phase before it can safely modify code. This is distinct from the audit — the onboarding produces a reusable **codebase profile** that speeds up all subsequent interactions.
658
+
659
+ **Onboarding sequence:**
660
+
661
+ 1. **Environment setup** (seconds)
662
+ - Detect language, build system, package manager
663
+ - Install dependencies
664
+ - Verify build succeeds
665
+
666
+ 2. **Convention detection** (seconds-minutes)
667
+ - Naming conventions (snake_case vs. camelCase)
668
+ - File organization patterns
669
+ - Import style (absolute vs. relative)
670
+ - Test file naming and location
671
+ - Commit message format (from git log)
672
+
673
+ 3. **Architecture map** (minutes)
674
+ - Entry points (main files, CLI commands, API routes)
675
+ - Module boundaries and dependencies
676
+ - Data models and database schema
677
+ - Configuration system
678
+ - External service integrations
679
+
680
+ 4. **Safety assessment** (minutes)
681
+ - Test suite health (does it run? does it pass? how long?)
682
+ - CI/CD configuration
683
+ - Code review requirements
684
+ - Protected branches
685
+ - Pre-commit hooks
686
+
687
+ 5. **Profile generation** (seconds)
688
+ - Write `codebase-profile.json` with all discovered information
689
+ - Generate abbreviated `CONTEXT.md` for injection into agent prompts
690
+ - Cache for reuse across sessions
691
+
692
+ **Confucius Code Agent approach (arXiv 2512.10398):** Uses scalable scaffolding for real-world codebases — a composable framework that adapts to different project structures. The key insight is that the scaffolding (project understanding framework) is reusable across sessions while the specific task changes.
693
+
694
+ ### Evidence
695
+
696
+ - Confucius Code Agent: scalable scaffolding for production codebases ([arXiv 2512.10398](https://arxiv.org/html/2512.10398v4))
697
+ - OpenHands SDK: composable and extensible foundation with reusable workspace packages ([arXiv 2511.03690](https://arxiv.org/html/2511.03690v1))
698
+ - AGENTLESS: achieves competitive results by de-composing the problem into localization and repair without complex agent frameworks ([arXiv 2407.01489](https://arxiv.org/pdf/2407.01489))
699
+
700
+ ### Implications for the Toolkit
701
+
702
+ Create `/onboard` command that:
703
+ 1. Runs environment setup + convention detection + architecture map + safety assessment
704
+ 2. Produces `codebase-profile.json` (cached, invalidated on major changes)
705
+ 3. Generates `CONTEXT.md` for prompt injection
706
+ 4. All subsequent commands (`/audit`, `/improve`, `/autocode`) read the profile first
707
+
708
+ The onboard step is idempotent — running it again updates the profile rather than starting from scratch.
709
+
710
+ ---
711
+
712
+ ## Proposed Audit Pipeline
713
+
714
+ ```
715
+ ┌─────────────────────────────────────────────────────────────┐
716
+ │ AUDIT PIPELINE │
717
+ │ │
718
+ │ /onboard (first time only — produces codebase-profile.json) │
719
+ │ │ │
720
+ │ ▼ │
721
+ │ /audit │
722
+ │ │ │
723
+ │ ├── Phase 1: Discover (deterministic, fast) │
724
+ │ │ ├── File tree + language detection │
725
+ │ │ ├── Dependency graph (imports, packages) │
726
+ │ │ ├── Test suite baseline (detect + run) │
727
+ │ │ ├── Git history analysis (hotspots, churn) │
728
+ │ │ └── Existing metrics (complexity, coverage) │
729
+ │ │ │
730
+ │ ├── Phase 2: Assess (LLM-assisted) │
731
+ │ │ ├── Architecture recovery (module map) │
732
+ │ │ ├── Code smell detection (per-file analysis) │
733
+ │ │ ├── Pattern consistency check │
734
+ │ │ ├── Dead code identification │
735
+ │ │ └── Documentation completeness │
736
+ │ │ │
737
+ │ ├── Phase 3: Prioritize │
738
+ │ │ ├── Hotspot ranking (complexity × churn × coverage) │
739
+ │ │ ├── Risk classification (low/medium/high) │
740
+ │ │ ├── Auto-fixable vs. human-required │
741
+ │ │ └── Effort estimation │
742
+ │ │ │
743
+ │ └── Output: audit-report.json │
744
+ │ │
745
+ │ /improve (reads audit-report.json) │
746
+ │ │ │
747
+ │ ├── Select top-N findings by priority │
748
+ │ ├── Generate improvement PRD (tasks/prd.json) │
749
+ │ ├── Generate improvement plan │
750
+ │ │ ├── Batch 1: Zero-risk fixes (dead code, imports) │
751
+ │ │ ├── Batch 2: Characterization tests for hotspots │
752
+ │ │ ├── Batch 3-N: Refactor hotspots (one per batch) │
753
+ │ │ └── Final: Measure improvement │
754
+ │ │ │
755
+ │ └── Execute via existing pipeline │
756
+ │ (run-plan.sh / ralph-loop / subagent-dev) │
757
+ │ │
758
+ │ /measure (runs after /improve) │
759
+ │ ├── Compare pre/post metrics │
760
+ │ ├── Calculate health score delta │
761
+ │ └── Output: improvement-report.json │
762
+ │ │
763
+ │ auto-improve.sh (scheduled, continuous) │
764
+ │ ├── Run /audit │
765
+ │ ├── Filter to auto-fixable low-risk findings │
766
+ │ ├── For each: branch → fix → test → PR │
767
+ │ └── Notify via Telegram │
768
+ └─────────────────────────────────────────────────────────────┘
769
+ ```
770
+
771
+ ---
772
+
773
+ ## Toolkit Integration
774
+
775
+ ### New Scripts
776
+
777
+ | Script | Purpose | Inputs | Outputs |
778
+ |--------|---------|--------|---------|
779
+ | `audit.sh` | Full audit pipeline (discover + assess + prioritize) | project dir, codebase-profile.json | audit-report.json |
780
+ | `onboard.sh` | Generate codebase profile for unfamiliar projects | project dir | codebase-profile.json, CONTEXT.md |
781
+ | `hotspot-analysis.sh` | Git + complexity + coverage cross-reference | project dir | hotspots.json |
782
+ | `measure-improvement.sh` | Pre/post metric comparison | audit-report.json (baseline), project dir | improvement-report.json |
783
+ | `auto-improve.sh` | Scheduled autonomous improvement (audit → fix → PR) | project dir | PRs on GitHub |
784
+ | `characterize.sh` | Generate characterization tests for specified files | file list | test files |
785
+
786
+ ### New Commands
787
+
788
+ | Command | Purpose |
789
+ |---------|---------|
790
+ | `/audit` | Run full audit, produce report |
791
+ | `/improve` | Read audit report, execute improvement plan |
792
+ | `/onboard` | Generate codebase profile for new project |
793
+ | `/measure` | Compare pre/post metrics |
794
+
795
+ ### New Skills
796
+
797
+ | Skill | Purpose |
798
+ |-------|---------|
799
+ | `codebase-audit/SKILL.md` | How to explore, assess, and prioritize an existing codebase |
800
+ | `improvement-planning/SKILL.md` | How to plan improvement work (batch ordering, risk management) |
801
+ | `characterization-testing/SKILL.md` | How to write golden master / characterization tests |
802
+
803
+ ### Extensions to Existing Components
804
+
805
+ | Component | Extension |
806
+ |-----------|-----------|
807
+ | `run-plan.sh` | Add `--pre-step characterize` flag for auto-characterization before refactoring batches |
808
+ | `quality-gate.sh` | Add coverage-no-decrease gate (test coverage must not drop) |
809
+ | `entropy-audit.sh` | Replace with or augment by `audit.sh` for full hotspot-based analysis |
810
+ | `batch-audit.sh` | Use `audit.sh` per project instead of raw `claude -p` |
811
+ | `auto-compound.sh` | Add `--mode improve` that reads audit report instead of analyzing a report file |
812
+ | `classify_batch_type()` | Add `refactoring-risk` classification (low/medium/high) |
813
+
814
+ ### Integration with Existing Pipeline
815
+
816
+ ```
817
+ EXISTING: /autocode "Add feature X" → brainstorm → PRD → plan → execute → verify → finish
818
+ NEW: /audit → discover → assess → prioritize → audit-report.json
819
+ BRIDGE: /improve audit-report.json → PRD → plan → execute → verify → measure → finish
820
+ CONTINUOUS: auto-improve.sh (timer) → audit → filter → fix → PR → notify
821
+ ```
822
+
823
+ The key insight is that `/improve` reuses 80% of the existing pipeline. The audit is the new work; the improvement execution is the existing work with a different input source.
824
+
825
+ ---
826
+
827
+ ## Improvement Metrics Scorecard
828
+
829
+ ### Pre-Audit Baseline (captured by `/audit`)
830
+
831
+ | Metric | Tool | Command |
832
+ |--------|------|---------|
833
+ | Test coverage (branch) | coverage.py / Istanbul | `pytest --cov --cov-branch --cov-report=json` |
834
+ | Cyclomatic complexity (mean, max, p90) | radon / eslint | `radon cc src/ -j` |
835
+ | Dead code count | custom grep / knip | `grep -rc 'import' src/ \| ...` |
836
+ | File count over 300 lines | wc -l | `find src/ -name '*.py' -exec wc -l {} + \| awk '$1>300'` |
837
+ | Hotspot count (complexity > 10 AND churn > 10) | custom | `hotspot-analysis.sh` |
838
+ | Dependency count | pip-audit / npm ls | `pip list --format=json \| jq length` |
839
+ | Naming violations | custom grep | `grep -rnE '^def [a-z]+[A-Z]' src/` |
840
+ | Documentation coverage | custom | `% of public functions with docstrings` |
841
+
842
+ ### Post-Improvement Delta (captured by `/measure`)
843
+
844
+ | Metric | Target | Red Flag |
845
+ |--------|--------|----------|
846
+ | Test coverage (branch) | +10% or more | Any decrease |
847
+ | Cyclomatic complexity (mean) | -20% or more | Any increase |
848
+ | Dead code count | -50% or more | No change |
849
+ | Files over 300 lines | Zero | Any increase |
850
+ | Hotspot count | -30% or more | No change |
851
+ | Naming violations | Zero | Any increase |
852
+ | Health score | +1.0 or more | Any decrease |
853
+
854
+ ### Composite Health Score
855
+
856
+ ```
857
+ health_score = (
858
+ 0.25 * normalize(test_coverage_branch) +
859
+ 0.20 * normalize(1 / mean_cyclomatic_complexity) +
860
+ 0.20 * normalize(1 / hotspot_count) +
861
+ 0.15 * normalize(1 / dead_code_ratio) +
862
+ 0.10 * normalize(1 / max_file_size_ratio) +
863
+ 0.10 * normalize(1 / coupling_score)
864
+ ) * 10
865
+ ```
866
+
867
+ Scale: 1 (critical issues) to 10 (excellent health). Mirrors CodeScene's Code Health for familiarity.
868
+
869
+ ---
870
+
871
+ ## Sources
872
+
873
+ ### Academic Papers and Benchmarks
874
+ - [ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs](https://arxiv.org/html/2601.13007) — arXiv 2025
875
+ - [SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/pdf/2509.16941) — Scale AI 2025
876
+ - [SWE-EVO: Benchmarking Coding Agents in Software Evolution](https://www.arxiv.org/pdf/2512.18470v1) — arXiv 2025
877
+ - [LLM-Driven Code Refactoring: Opportunities and Limitations](https://seal-queensu.github.io/publications/pdf/IDE-Jonathan-2025.pdf) — ICSE 2025 IDE Workshop
878
+ - [Understanding Codebase like a Professional: Human-AI Collaboration](https://arxiv.org/html/2504.04553) — arXiv 2025
879
+ - [DCE-LLM: Dead Code Elimination with Large Language Models](https://aclanthology.org/2025.naacl-long.501/) — NAACL 2025
880
+ - [Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases](https://arxiv.org/html/2512.10398v4) — arXiv 2025
881
+ - [AGENTLESS: Demystifying LLM-based Software Engineering Agents](https://arxiv.org/pdf/2407.01489) — arXiv 2024
882
+ - [OpenHands SDK: Composable and Extensible Foundation for Production Agents](https://arxiv.org/html/2511.03690v1) — arXiv 2025
883
+ - [LoCoBench-Agent: Interactive Benchmark for LLM Agents](https://arxiv.org/pdf/2511.13998) — arXiv 2025
884
+
885
+ ### Industry Tools and Documentation
886
+ - [CodeScene: Behavioral Code Analysis](https://codescene.com/product/behavioral-code-analysis)
887
+ - [CodeScene: Technical Debt Prioritization](https://codescene.com/blog/prioritize-technical-debt-by-impact/)
888
+ - [Semgrep vs. SonarQube Comparison](https://semgrep.dev/docs/faq/comparisons/sonarqube)
889
+ - [SWE-bench Overview](https://www.swebench.com/SWE-bench/)
890
+ - [OpenHands: AI-Driven Development](https://github.com/OpenHands/OpenHands)
891
+ - [Continuous Claude](https://anandchowdhary.com/open-source/2025/continuous-claude)
892
+ - [Meta Engineering: Automating Dead Code Cleanup](https://engineering.fb.com/2023/10/24/data-infrastructure/automating-dead-code-cleanup/)
893
+
894
+ ### Books and Foundational References
895
+ - Michael Feathers, *Working Effectively with Legacy Code* (2004) — characterization tests, legacy code definition
896
+ - Joel Spolsky, ["Things You Should Never Do, Part I"](https://bssw.io/items/things-you-should-never-do-part-i) — the rewrite anti-pattern
897
+ - Thomas J. McCabe, "A Complexity Measure" (1976) — cyclomatic complexity
898
+
899
+ ### Practical Guides
900
+ - [Augment Code: AI-Powered Legacy Code Refactoring](https://www.augmentcode.com/learn/ai-powered-legacy-code-refactoring)
901
+ - [IBM: What Is AI Code Refactoring?](https://www.ibm.com/think/topics/ai-code-refactoring)
902
+ - [Solo Sentinel: AI-Powered Lightweight Code Audits](https://maddevs.io/writeups/practical-guide-to-lightweight-audits-in-the-age-of-ai/)
903
+ - [Golden Master Testing for Legacy Code](https://blog.thecodewhisperer.com/permalink/surviving-legacy-code-with-golden-master-and-sampling)
904
+ - [Strangler Fig Pattern — Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/patterns/strangler-fig)
905
+ - [Strangler Fig Pattern — AWS Prescriptive Guidance](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/strangler-fig.html)
906
+ - [GetDX: Why Cyclomatic Complexity Misleads](https://getdx.com/blog/cyclomatic-complexity/)
907
+ - [Enterprise AI Refactoring Best Practices](https://getdx.com/blog/enterprise-ai-refactoring-best-practices/)
908
+ - [Qodo: Unit Testing vs Integration Testing with AI](https://www.qodo.ai/blog/unit-testing-vs-integration-testing-ais-role-in-redefining-software-quality/)