autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,433 @@
1
+ # Research: Verification Effectiveness — What Actually Catches Bugs in AI-Generated Code?
2
+
3
+ **Date:** 2026-02-22
4
+ **Researcher:** Claude Opus 4.6 (research agent)
5
+ **Context:** autonomous-coding-toolkit quality gate pipeline evaluation
6
+ **Status:** Complete
7
+
8
+ ---
9
+
10
+ ## Executive Summary
11
+
12
+ The toolkit's current quality gate pipeline (lesson-check, lint, test suite, memory check, test count regression, git clean) is well-designed but has measurable gaps. The evidence says:
13
+
14
+ 1. **Static analysis (linting) catches 40-52% of defects** in isolation, with false positive rates of 20-76% depending on configuration. The toolkit's narrow rule selection (`--select E,W,F`) is the right call — it trades breadth for signal.
15
+ 2. **Test suites are the single highest-ROI verification stage** but miss 33-67% of AI-specific bug types (hallucinated objects, prompt-biased code, missing corner cases) that don't trigger existing test paths.
16
+ 3. **Pattern-based checks (lesson-check) are high-signal, low-noise** when scoped to syntactic patterns. The toolkit's design rule — syntactic to grep, semantic to AI — is empirically sound. False positive rates for well-scoped regex patterns are near-zero.
17
+ 4. **Two high-value techniques are missing:** property-based testing (50x more mutations caught per test than unit tests) and mutation testing (reveals test suite weakness that coverage metrics hide).
18
+ 5. **Test count monotonicity is a useful but incomplete invariant.** It catches test deletion and discovery breakage but not test weakening (a passing test that no longer exercises the code path it claims to).
19
+ 6. **Diminishing returns set in around stage 4-5** in a sequential pipeline, but the toolkit's stages are largely orthogonal — they catch different bug classes with minimal overlap.
20
+
21
+ **Bottom line recommendation:** Add property-based testing guidance to the plan-writing skill and investigate LLM-powered mutation testing as a verification-time check. The existing pipeline is sound; the biggest gap is not in the gates but in the test quality they enforce.
22
+
23
+ ---
24
+
25
+ ## 1. What Types of Verification Actually Catch Bugs in AI-Generated Code?
26
+
27
+ ### Findings
28
+
29
+ AI-generated code has a **distinct bug distribution** compared to human-written code. An empirical study of 333 bugs across CodeGen, PanGu-Coder, and Codex identified 10 distinctive patterns (Tambon et al., 2024):
30
+
31
+ | Bug Pattern | Prevalence | Detectable By |
32
+ |-------------|-----------|---------------|
33
+ | Misinterpretations | High | Code review, spec compliance check |
34
+ | Syntax Error | Medium | Linter, compiler |
35
+ | Silly Mistake | Medium | Test suite, linter |
36
+ | Prompt-biased code | High | Spec compliance review |
37
+ | Missing Corner Case | High | Property-based testing, mutation testing |
38
+ | Wrong Input Type | Medium | Type checker, test suite |
39
+ | Hallucinated Object | Medium | Linter (undefined name), test suite |
40
+ | Wrong Attribute | Medium | Linter, type checker, test suite |
41
+ | Incomplete Generation | Medium | Spec compliance check, PRD verification |
42
+ | Non-Prompted Consideration | Low | Code review, integration testing |
43
+
44
+ Several patterns — Hallucinated Object, Wrong Attribute, Silly Mistake — are **less common in human-written code**, meaning verification pipelines designed for human developers have blind spots (Tambon et al., 2024).
45
+
46
+ A separate large-scale study found the most common semantic error type is "Garbage Code" (27-38% of errors), and the most common syntactic error is "Code Block Error" (43-60%) (Wang et al., 2024).
47
+
48
+ ### Evidence
49
+
50
+ - **Qodo 2025 report:** 17% of 1M pull requests contained high-severity issues (score 9-10) that would have passed manual review under time pressure.
51
+ - **AI code has 1.7x more issues and bugs** than human-written code, with up to 75% more logic and correctness issues in areas contributing to downstream incidents (Greptile State of AI Coding 2025).
52
+ - **12-65% of LLM-generated code snippets** are non-compliant with basic secure coding standards or trigger CWE-classified vulnerabilities (multiple studies, summarized in Georgetown CSET 2024).
53
+
54
+ ### Implications for the Toolkit
55
+
56
+ The quality gate pipeline catches **syntax errors** (linter), **runtime failures** (test suite), and **known anti-patterns** (lesson-check). It does NOT systematically catch:
57
+ - Missing corner cases (highest-prevalence LLM bug)
58
+ - Prompt-biased code (code that satisfies the prompt but misunderstands the requirement)
59
+ - Hallucinated objects that happen to not be exercised by existing tests
60
+
61
+ **Confidence: HIGH** — Multiple independent studies converge on the same bug taxonomy.
62
+
63
+ ---
64
+
65
+ ## 2. Empirical Evidence for Static Analysis Catching LLM-Generated Bugs
66
+
67
+ ### Findings
68
+
69
+ Static analysis tools have a complicated relationship with AI-generated code:
70
+
71
+ - **Semgrep baseline:** True positive rate of 80.49%, false positive rate of 39.09% on vulnerability detection. When combined with LLMs for triage, false positive rates dropped while true positive rates increased (UC-authored study, 2024).
72
+ - **Combined tools improve coverage by 26%:** A single static analysis tool warns on 52% of vulnerable code changes. Combining multiple tools increases detection to ~66% (ICSE empirical study, 2024).
73
+ - **Top-performing analyzers still miss 47-80% of vulnerabilities** depending on the evaluation scenario (TU Munich study, 2023).
74
+ - **Ruff** implements 900+ rules and runs 10-150x faster than Flake8/Pylint. The toolkit's `--select E,W,F` limits to errors, warnings, and pyflakes — approximately 150 rules focused on the highest-signal categories.
75
+
76
+ ### Evidence on False Positives
77
+
78
+ - **76% of warnings in vulnerable changes are irrelevant** to the actual vulnerability (ICSE 2024).
79
+ - **10-20 minutes of manual inspection per false alarm** — this is why industrial teams report "alert fatigue" (Huawei empirical study, 2025).
80
+ - **Developers tolerate ~20% false positive rate** as a traditional bound, though recent work shows higher tolerance in practice.
81
+
82
+ ### Implications for the Toolkit
83
+
84
+ The toolkit's approach is sound: **narrow rule selection reduces false positives** while catching the most impactful error classes (undefined names, syntax errors, unused imports). The ast-grep addition (5 structural patterns) adds AST-level precision that regex grep cannot achieve.
85
+
86
+ **Gap:** The lint stage runs only `E,W,F` categories. Adding `B` (bugbear) rules would catch additional logic errors (e.g., mutable default arguments, unreliable `__all__` definitions) at low false-positive cost. The `S` (bandit/security) rules are worth evaluating for security-sensitive projects.
87
+
88
+ **Confidence: HIGH** — Data from multiple industrial and academic studies.
89
+
90
+ ---
91
+
92
+ ## 3. Test Suite Effectiveness: AI Errors vs. Human Errors
93
+
94
+ ### Findings
95
+
96
+ Test suites designed for human code have systematic blind spots for AI-generated bugs:
97
+
98
+ - **SWE-bench evaluation model:** Uses FAIL_TO_PASS tests (does the patch fix the issue?) and PASS_TO_PASS tests (does the patch break anything else?). Both must pass. This is the gold standard for verifying AI coding agent output.
99
+ - **SWE-bench Verified:** Human annotators found that many original SWE-bench test cases were unreliable — leading to a curated 500-sample subset. This validates that test quality matters as much as test existence.
100
+ - **Top SWE-bench agents solve ~33-50% of issues** (as of late 2025), suggesting even well-tested codebases leave significant room for AI agents to produce unverifiable patches.
101
+ - **AI-generated tests have quality issues:** When AI generates both code and tests, the tests may be biased toward the implementation's actual behavior rather than the specification's intended behavior. This creates a circular validation problem.
102
+
103
+ ### Bug Distribution Differences
104
+
105
+ | Dimension | Human Bugs | AI Bugs |
106
+ |-----------|-----------|---------|
107
+ | Root cause | Logic errors, off-by-one, race conditions | Hallucinations, prompt misinterpretation, missing context |
108
+ | Locality | Usually in the changed function | Can span hallucinated imports, wrong modules |
109
+ | Detectability by tests | High (developers write tests for known risk areas) | Medium (tests don't cover "impossible" states) |
110
+ | Edge cases | Sometimes missed | Systematically missed |
111
+ | Security | Varies | 12-65% non-compliant with basic standards |
112
+
113
+ ### Implications for the Toolkit
114
+
115
+ The test suite gate is the highest-value single check, but its effectiveness depends entirely on test quality. The toolkit's TDD discipline (write failing test first, confirm fail, implement, confirm pass) is a strong mitigation for circular validation.
116
+
117
+ **Gap:** The toolkit enforces test *existence* (test count monotonicity) and test *passage* (exit 0) but not test *quality*. A test that asserts `True` passes both gates. Mutation testing would close this gap.
118
+
119
+ **Confidence: HIGH** for bug distribution differences. **MEDIUM** for the specific percentages, which vary by model and task.
120
+
121
+ ---
122
+
123
+ ## 4. False Positive Rate of Pattern-Based Checks (lesson-check)
124
+
125
+ ### Findings
126
+
127
+ The lesson-check system uses **syntactic regex patterns** loaded from YAML frontmatter in lesson files. This is a fundamentally different approach from traditional static analysis:
128
+
129
+ | Check Type | Typical False Positive Rate | lesson-check Design |
130
+ |------------|---------------------------|-------------------|
131
+ | General static analysis (Semgrep, etc.) | 39-76% | N/A |
132
+ | Narrow regex on known anti-patterns | 1-5% | This is what lesson-check does |
133
+ | AST-based structural patterns | 5-15% | ast-grep stage |
134
+ | AI-assisted semantic analysis | 10-25% | lesson-scanner agent |
135
+
136
+ The toolkit's explicit design rule — "syntactic patterns (near-zero false positives) go to lesson-check; semantic patterns (needs context) go to lesson-scanner agent" — is empirically sound. The current 6 checks target extremely specific patterns:
137
+ 1. `except:` without logging — unambiguous anti-pattern
138
+ 2. `async def` without `await` — unambiguous (with rare legitimate exceptions)
139
+ 3. `create_task` without `done_callback` — project-specific, high confidence
140
+ 4. `hub.cache` direct access — project-specific, high confidence
141
+ 5. HA automation singular keys — domain-specific, high confidence
142
+ 6. `.venv/bin/pip` wrong path — exact string match
143
+
144
+ These are **precision-optimized checks**: they sacrifice recall (they won't catch all instances of the underlying problem) for near-zero false positives. This is the right trade-off for a gate that blocks batch progression.
145
+
146
+ ### Evidence
147
+
148
+ - **Semgrep AST-level matching** reduces false positives by 25% and increases true positives by 250% compared to regex-only approaches (Semgrep documentation).
149
+ - The toolkit already uses ast-grep for 5 structural patterns as an advisory (non-blocking) check, which is the right escalation: regex for blocking, AST for advisory, AI for verification-time.
150
+
151
+ ### Implications for the Toolkit
152
+
153
+ The false positive rate of lesson-check is likely **<2%** given the narrow, project-specific patterns. The main risk is **false negatives** — anti-patterns that exist but don't match the regex. This is acceptable because the checks compound over time as new lessons are added.
154
+
155
+ **Recommendation:** Track false positive and false negative rates explicitly. Add a `--stats` flag to lesson-check that reports matches per pattern over time. This creates an empirical feedback loop.
156
+
157
+ **Confidence: HIGH** on the design approach. **MEDIUM** on the specific false positive percentage (estimated from similar tools, not measured on the toolkit itself).
158
+
159
+ ---
160
+
161
+ ## 5. High-Value Verification Techniques Missing from the Toolkit
162
+
163
+ ### 5a. Property-Based Testing
164
+
165
+ **Evidence:** An empirical evaluation of 40 Python projects found that **each property-based test finds ~50x as many mutations as the average unit test** (OOPSLA 2025, UC San Diego). Among PBT categories, exception-finding and collection-inclusion tests are 19x more effective than other types. **76% of mutations discovered by PBT are found within the first 20 inputs** — making it fast enough for a quality gate.
166
+
167
+ Combining property-based and example-based testing improved bug detection from 68.75% (each alone) to **81.25%** (combined).
168
+
169
+ **Agentic PBT:** A 2025 paper describes using AI agents to automatically write Hypothesis tests across the Python ecosystem — suggesting LLM agents could generate property-based tests as part of the plan-writing stage.
170
+
171
+ **Recommendation:** Add property-based testing guidance to the `writing-plans` skill. For functions with clear invariants (parsers, serializers, validators, transformers), the plan should specify Hypothesis-based property tests alongside example-based unit tests. **HIGH confidence this adds value.**
172
+
173
+ ### 5b. Mutation Testing
174
+
175
+ **Evidence:** Meta deployed LLM-powered mutation testing (ACH tool) in production: **73% of generated tests accepted by engineers, 36% judged as privacy-relevant** (Meta Engineering, 2025). LLM-generated mutants have a **93.4% fault detection rate** vs. 51.3% (PIT) and 74.4% (Major) for traditional mutation tools (MutGen study).
176
+
177
+ High code coverage **does not imply strong fault detection** when measured by mutation score — validating that test count and passage are insufficient quality metrics.
178
+
179
+ **Recommendation:** Investigate `mutmut` (Python mutation testing) or LLM-based mutation as a verification-time check. Too slow for between-batch quality gates, but viable as a `/verify` stage addition. **MEDIUM confidence on practical integration** — mutation testing is slow and requires careful configuration.
180
+
181
+ ### 5c. Formal Verification and Symbolic Execution
182
+
183
+ **Evidence:** Martin Kleppmann (2025) predicts AI will bring formal verification mainstream via "vericoding" — LLMs generating formally verified code. A proof-carrying pipeline using static analysis + symbolic execution + bounded model checking was demonstrated in regulated industries (Formal Verification for AI-Assisted Code Changes, 2024). An LLM-powered symbolic execution tool verified correct code in **83% of cases** on a 21-task benchmark.
184
+
185
+ **Recommendation:** Not practical for the toolkit today. Formal verification requires specification languages and theorem provers that add significant complexity. **LOW confidence it's worth the integration cost** for a general-purpose coding toolkit. Revisit when vericoding tools mature (likely 12-18 months).
186
+
187
+ ### 5d. AI-Powered Code Review
188
+
189
+ **Evidence:** 2025 benchmarks show AI code review tools catch 42-48% of bugs, with Greptile leading at 82%. CodeRabbit provides 46% detection rate. These tools operate on PR diffs and reason about downstream impact.
190
+
191
+ **Relevance:** The toolkit already has `requesting-code-review` and `receiving-code-review` skills, plus the spec-compliance and code-quality reviewer subagents. This is a strength. The gap is that the review is done by the same model that wrote the code — cross-model review (e.g., using a different LLM for review) could catch model-specific blind spots.
192
+
193
+ **Confidence: MEDIUM** — the concept is sound but no empirical data on cross-model review effectiveness.
194
+
195
+ ---
196
+
197
+ ## 6. Academic Literature on Verifying AI-Generated Code
198
+
199
+ ### Key Papers
200
+
201
+ 1. **"Bugs in Large Language Models Generated Code: An Empirical Study"** (Tambon et al., 2024, Empirical Software Engineering) — 333 bugs, 10 bug patterns, validated by 34 practitioners. Established that LLM bugs have a distinct taxonomy from human bugs.
202
+
203
+ 2. **"A Survey of Bugs in AI-Generated Code"** (Dec 2025, arXiv 2512.05239) — Comprehensive survey covering logical bugs, code duplication, inconsistent styles, performance issues, and security vulnerabilities. Root causes: flawed training data and inherent model limitations (hallucinations, lack of semantic reasoning).
204
+
205
+ 3. **"What's Wrong with Your Code Generated by Large Language Models?"** (Wang et al., 2024) — Developed a 3-category, 12-sub-category taxonomy. Found that benchmark bug distributions differ from real-world bug distributions.
206
+
207
+ 4. **"A Dual Perspective Review on LLMs and Code Verification"** (Frontiers in Computer Science, 2025) — Reviews both using LLMs to verify code and verifying LLM-generated code. Identifies the circular problem: LLMs used to verify their own output.
208
+
209
+ 5. **"AI-Powered Code Review with LLMs: Early Results"** (arXiv 2404.18496) — Found that LLM-assisted code review improved detection rates but introduced new failure modes (overconfidence in incorrect suggestions).
210
+
211
+ 6. **"Reducing False Positives in Static Bug Detection with LLMs"** (Huawei, 2025) — Industrial study showing LLMs can triage static analysis alerts, reducing manual inspection burden by filtering false positives.
212
+
213
+ ### Emerging Themes
214
+
215
+ - **Verification is harder than generation.** The research community has more work on generating code than on verifying it.
216
+ - **Circular validation is the central risk.** When the same model (or similar models) both generate and verify, they share blind spots.
217
+ - **Hybrid approaches work best.** Static analysis + test suite + AI review > any single technique.
218
+ - **Bug distributions shift with model capability.** As models improve, syntax errors decrease but semantic/logic errors persist.
219
+
220
+ **Confidence: HIGH** — Well-established academic literature with converging findings.
221
+
222
+ ---
223
+
224
+ ## 7. SWE-bench vs. Toolkit Quality Gates
225
+
226
+ ### SWE-bench Evaluation Model
227
+
228
+ SWE-bench evaluates patches by:
229
+ 1. **FAIL_TO_PASS tests:** Tests that should pass after the patch (does it fix the issue?)
230
+ 2. **PASS_TO_PASS tests:** Tests that should still pass after the patch (does it break anything?)
231
+
232
+ Both sets must pass for the patch to be considered resolved.
233
+
234
+ **SWE-bench Verified** adds human annotation to filter out:
235
+ - Ambiguous issue descriptions
236
+ - Unreliable unit tests
237
+ - Under-specified test criteria
238
+
239
+ ### Comparison with Toolkit Quality Gates
240
+
241
+ | Criterion | SWE-bench | Toolkit Quality Gates |
242
+ |-----------|-----------|----------------------|
243
+ | Test passage | FAIL_TO_PASS + PASS_TO_PASS | pytest/npm test (all pass) |
244
+ | Anti-pattern detection | None | lesson-check (syntactic), ast-grep (structural) |
245
+ | Lint | None | ruff --select E,W,F |
246
+ | Test quality | Human-validated tests | Test count monotonicity only |
247
+ | Spec compliance | Issue description match | PRD acceptance criteria (shell commands) |
248
+ | Regression prevention | PASS_TO_PASS tests | Test count + git clean |
249
+ | Memory safety | N/A | Advisory memory check |
250
+
251
+ ### Key Differences
252
+
253
+ 1. **SWE-bench has no anti-pattern detection.** The toolkit's lesson-check is a strictly additive verification that SWE-bench doesn't attempt. This is a strength.
254
+ 2. **SWE-bench uses curated tests.** The toolkit relies on project tests, which may or may not cover the relevant code paths. The PRD system (shell-command acceptance criteria) partially addresses this.
255
+ 3. **SWE-bench has no incremental verification.** It evaluates the final patch. The toolkit runs gates between every batch, catching drift early. This is a significant architectural advantage.
256
+ 4. **SWE-bench doesn't check for silent degradation.** The toolkit's test count monotonicity catches test deletion that SWE-bench would miss.
257
+
258
+ **Confidence: HIGH** — Direct comparison against publicly documented evaluation criteria.
259
+
260
+ ---
261
+
262
+ ## 8. ROI Curve of Adding More Verification Stages
263
+
264
+ ### Findings
265
+
266
+ The research consistently shows **diminishing returns** from additional verification stages, but with important nuances:
267
+
268
+ **General pattern:**
269
+ ```
270
+ Bug Detection Rate (%)
271
+ 100 | ___________
272
+ | ____/
273
+ 80 | ____/
274
+ | ____/
275
+ 60 | ____/
276
+ | ____/
277
+ 40 |___/
278
+ |
279
+ 20 |
280
+ |___________________________________________
281
+ 0 1 2 3 4 5 6 7 8
282
+ Number of Verification Stages
283
+ ```
284
+
285
+ **Key evidence:**
286
+ - **Combining static analysis tools increases detection by 26%** over a single tool (from 52% to 66%) — a meaningful but diminishing gain.
287
+ - **AI code review adds 42-48% detection** over no review, but tools overlap: CodeRabbit + Copilot together don't find 90% — they find maybe 55-60%.
288
+ - **Quality gates should be incremental:** "Start small, add gates incrementally" is the consistent best practice (InfoQ, Sonar, Perforce).
289
+ - **Pipeline speed matters:** A gate that takes >5 minutes per batch is a gate that developers (and agents) route around. The toolkit's lesson-check (<2s) has essentially zero friction cost.
290
+
291
+ ### The Toolkit's ROI Breakdown (estimated)
292
+
293
+ | Stage | Estimated Marginal Bug Detection | Speed | ROI |
294
+ |-------|--------------------------------|-------|-----|
295
+ | 1. lesson-check | 5-10% (known anti-patterns) | <2s | Very High (near-zero cost) |
296
+ | 2. Lint (ruff) | 15-25% (syntax, style, imports) | <5s | High |
297
+ | 3. Test suite | 40-60% (runtime behavior) | 10-120s | Highest absolute |
298
+ | 4. ast-grep | 3-8% (structural patterns) | <3s | High (low cost) |
299
+ | 5. Test count monotonicity | 2-5% (test deletion/discovery) | <1s | High (near-zero cost) |
300
+ | 6. Git clean check | 1-3% (uncommitted drift) | <1s | High (near-zero cost) |
301
+ | 7. Memory check | 0% (prevents OOM, not bugs) | <1s | Moderate (operational) |
302
+
303
+ **Total estimated detection: 60-80%** of defects that would otherwise reach the next batch. The remaining 20-40% are primarily:
304
+ - Logic errors that pass all existing tests
305
+ - Missing corner cases with no test coverage
306
+ - Semantic misunderstandings of requirements
307
+
308
+ ### Implications
309
+
310
+ The toolkit is **past the steep part of the ROI curve** for its current verification approach. Adding more of the same type of check (more linting rules, more regex patterns) yields diminishing returns. The highest-ROI additions are **orthogonal techniques** that catch fundamentally different bug classes:
311
+ - Property-based testing (corner cases)
312
+ - Mutation testing (test quality)
313
+ - Cross-model review (model-specific blind spots)
314
+
315
+ **Confidence: MEDIUM** — The marginal detection percentages are estimates extrapolated from literature, not measured on the toolkit.
316
+
317
+ ---
318
+
319
+ ## 9. Is Test Count Monotonicity a Useful Invariant?
320
+
321
+ ### Analysis
322
+
323
+ **What it catches:**
324
+ - Accidental test deletion (agent removes test file, renames incorrectly)
325
+ - Test discovery breakage (conftest changes, import errors that silently skip tests)
326
+ - Wholesale test replacement with fewer tests
327
+ - Agent "simplifying" a test suite by removing tests it considers redundant
328
+
329
+ **What it misses:**
330
+ - **Test weakening:** A test that previously asserted specific behavior now asserts `True` — count unchanged, quality degraded.
331
+ - **Tautological tests:** New tests that always pass regardless of implementation — count increases, quality unchanged.
332
+ - **Coverage regression:** Tests move to cover new code but abandon coverage of old code — count may increase, protection decreases.
333
+ - **Flaky test masking:** A flaky test that intermittently fails is replaced with one that always passes — same count, less signal.
334
+
335
+ ### Evidence
336
+
337
+ - SWE-bench's PASS_TO_PASS test set is a more rigorous version of monotonicity — it verifies that specific pre-existing tests still pass, not just that the count is maintained.
338
+ - Mutation testing research shows that **high test count and high coverage do not imply high fault detection** (MutGen study, multiple others). This directly challenges count as a quality proxy.
339
+ - However, test count monotonicity has near-zero cost (<1s, simple integer comparison) and catches a real failure mode specific to AI agents: the tendency to "clean up" by removing tests.
340
+
341
+ ### Recommendation
342
+
343
+ **Keep test count monotonicity** — it's a cheap, useful invariant that catches a real AI-agent failure mode. But **don't treat it as a test quality metric.** Add:
344
+
345
+ 1. **Test coverage monotonicity** (optional, slower): `coverage run` + compare percentages. More expensive but more meaningful.
346
+ 2. **Mutation score sampling** (at verification time): Run mutmut on changed files only. Detects test weakening.
347
+ 3. **Test assertion density** (cheap heuristic): Count `assert` statements per test function. Declining density suggests test weakening.
348
+
349
+ **Confidence: HIGH** that monotonicity is useful. **HIGH** that it's insufficient alone.
350
+
351
+ ---
352
+
353
+ ## Recommendations
354
+
355
+ ### Immediate (Low Effort, High Impact)
356
+
357
+ 1. **Add `B` (bugbear) rules to ruff** — `--select E,W,F,B` catches mutable default arguments, unreliable `__all__`, and other logic bugs at near-zero false positive cost. ~5 minute change.
358
+
359
+ 2. **Track lesson-check statistics** — Add `--stats` mode that logs pattern match counts over time. Creates the empirical feedback loop needed to validate false positive/negative rates. ~2 hours.
360
+
361
+ 3. **Add property-based testing guidance to writing-plans skill** — For functions with clear invariants, plans should specify Hypothesis property tests. Does not require tooling changes. ~30 minutes.
362
+
363
+ ### Medium-Term (Moderate Effort, High Impact)
364
+
365
+ 4. **Test assertion density check** — Add a quality gate stage that counts `assert` statements per test function. Flag functions with zero asserts (tautological tests). ~4 hours.
366
+
367
+ 5. **Coverage monotonicity (optional gate)** — Run `coverage run` and compare to baseline. More meaningful than test count alone but slower. Gate on decrease >5% to avoid noise. ~1 day.
368
+
369
+ 6. **Cross-batch test diff** — Instead of just counting tests, diff the test function names between batches. Catches renames and replacements that maintain count but change coverage. ~4 hours.
370
+
371
+ ### Long-Term (High Effort, High Impact)
372
+
373
+ 7. **Mutation testing at verification time** — Run `mutmut` on changed files during `/verify`. Too slow for between-batch gates but viable as a pre-merge check. ~2-3 days to integrate.
374
+
375
+ 8. **LLM-generated property tests** — At plan-writing time, use the LLM to generate Hypothesis property tests for new functions. These become part of the test suite and run in the normal quality gate. ~1 week.
376
+
377
+ 9. **Cross-model review option** — For critical batches, route the code-quality review subagent through a different model (e.g., if implementation used Sonnet, review with Opus). Requires model routing infrastructure. ~1 week.
378
+
379
+ ---
380
+
381
+ ## Sources
382
+
383
+ ### Academic Papers
384
+
385
+ - Tambon et al. (2024). ["Bugs in Large Language Models Generated Code: An Empirical Study"](https://arxiv.org/abs/2403.08937). Empirical Software Engineering, Springer.
386
+ - Wang et al. (2024). ["What's Wrong with Your Code Generated by Large Language Models? An Extensive Study"](https://arxiv.org/html/2407.06153v1). arXiv.
387
+ - Survey (2025). ["A Survey of Bugs in AI-Generated Code"](https://arxiv.org/abs/2512.05239). arXiv.
388
+ - Frontiers (2025). ["A Dual Perspective Review on LLMs and Code Verification"](https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2025.1655469/full). Frontiers in Computer Science.
389
+ - Li & Hao (2023). ["Assisting Static Analysis with Large Language Models: A ChatGPT Experiment"](https://www.semanticscholar.org/paper/Assisting-Static-Analysis-with-Large-Language-A-Li-Hao/80d9aa1cf1caa0f2115cca527a27f197c884b430). Semantic Scholar.
390
+ - Huawei (2025). ["Reducing False Positives in Static Bug Detection with LLMs: An Empirical Study in Industry"](https://arxiv.org/abs/2601.18844). arXiv.
391
+ - UC study (2024). ["Enhancing Static Analysis with LLMs to Detect Software Vulnerabilities"](https://escholarship.org/content/qt0kj3k9h9/qt0kj3k9h9.pdf). eScholarship.
392
+ - ICSE (2024). ["An Empirical Study of Static Analysis Tools for Secure Code Review"](https://arxiv.org/abs/2407.12241). arXiv.
393
+ - TU Munich (2023). ["An Empirical Study on the Effectiveness of Static C Code Analyzers"](https://mediatum.ub.tum.de/doc/1659728/1659728.pdf). MediaTUM.
394
+ - ICSE (2024). ["An Empirical Study on the Use of Static Analysis Tools"](https://machiry.github.io/files/emsast.pdf). ICSE Proceedings.
395
+
396
+ ### Testing and Mutation Research
397
+
398
+ - OOPSLA 2025. ["An Empirical Evaluation of Property-Based Testing in Python"](https://cseweb.ucsd.edu/~mcoblenz/assets/pdf/OOPSLA_2025_PBT.pdf). UC San Diego.
399
+ - (2025). ["Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem"](https://arxiv.org/html/2510.09907v1). arXiv.
400
+ - Meta Engineering (2025). ["LLMs Are the Key to Mutation Testing and Better Compliance"](https://engineering.fb.com/2025/09/30/security/llms-are-the-key-to-mutation-testing-and-better-compliance/). Meta Engineering Blog.
401
+ - (2024). ["Effective Test Generation Using Pre-trained LLMs and Mutation Testing"](https://www.sciencedirect.com/science/article/abs/pii/S0950584924000739). Information and Software Technology.
402
+ - (2025). ["On Mutation-Guided Unit Test Generation"](https://arxiv.org/html/2506.02954v2). arXiv.
403
+
404
+ ### SWE-bench
405
+
406
+ - OpenAI (2024). ["Introducing SWE-bench Verified"](https://openai.com/index/introducing-swe-bench-verified/). OpenAI Blog.
407
+ - Scale AI. ["SWE-Bench Pro"](https://scale.com/leaderboard/swe_bench_pro_public). Scale AI Leaderboard.
408
+ - Epoch AI. ["What Skills Does SWE-bench Verified Evaluate?"](https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate). Epoch AI Blog.
409
+ - (2025). ["SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?"](https://arxiv.org/pdf/2509.16941). arXiv.
410
+
411
+ ### Industry Reports and Benchmarks
412
+
413
+ - Qodo (2025). ["State of AI Code Quality in 2025"](https://www.qodo.ai/reports/state-of-ai-code-quality/). Qodo.
414
+ - Greptile (2025). ["The State of AI Coding 2025"](https://www.greptile.com/state-of-ai-coding-2025). Greptile.
415
+ - Greptile (2025). ["AI Code Review Benchmarks 2025"](https://www.greptile.com/benchmarks). Greptile.
416
+ - CodeRabbit (2025). ["2025 Was the Year of AI Speed. 2026 Will Be the Year of AI Quality."](https://www.coderabbit.ai/blog/2025-was-the-year-of-ai-speed-2026-will-be-the-year-of-ai-quality). CodeRabbit Blog.
417
+ - Georgetown CSET (2024). ["Cybersecurity Risks of AI-Generated Code"](https://cset.georgetown.edu/wp-content/uploads/CSET-Cybersecurity-Risks-of-AI-Generated-Code.pdf). Georgetown University.
418
+ - METR (2025). ["Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/). METR.
419
+
420
+ ### Formal Verification and Symbolic Execution
421
+
422
+ - Kleppmann (2025). ["Prediction: AI Will Make Formal Verification Go Mainstream"](https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html). Martin Kleppmann's Blog.
423
+ - (2025). ["Towards Formal Verification of LLM-Generated Code from Natural Language Prompts"](https://arxiv.org/pdf/2507.13290). arXiv.
424
+ - (2024). ["Formal Verification for AI-Assisted Code Changes in Regulated Environments"](https://computerfraudsecurity.com/index.php/journal/article/view/793). Computer Fraud & Security.
425
+ - (2024). ["Automating the Correctness Assessment of AI-Generated Code for Security Contexts"](https://www.sciencedirect.com/science/article/pii/S0164121224001584). Journal of Systems and Software.
426
+ - (2025). ["Large Language Model Powered Symbolic Execution"](https://mengrj.github.io/pdfs/autobug-oopsla25.pdf). OOPSLA 2025.
427
+
428
+ ### Tools and Comparisons
429
+
430
+ - ast-grep. ["Comparison With Other Frameworks"](https://ast-grep.github.io/advanced/tool-comparison.html). ast-grep Documentation.
431
+ - Semgrep. ["Detect Complex Code Patterns Using Semantic Grep"](https://github.com/semgrep/semgrep). GitHub.
432
+ - Ruff. ["FAQ"](https://docs.astral.sh/ruff/faq/). Astral Documentation.
433
+ - InfoQ (2023). ["The Importance of Pipeline Quality Gates"](https://www.infoq.com/articles/pipeline-quality-gates/). InfoQ.