npm - autonomous-coding-toolkit - Versions diffs - 1.0.0 - Mend

autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (324) hide show

package/.claude-plugin/marketplace.json +22 -0
package/.claude-plugin/plugin.json +13 -0
package/LICENSE +21 -0
package/Makefile +21 -0
package/README.md +140 -0
package/SECURITY.md +28 -0
package/agents/bash-expert.md +113 -0
package/agents/dependency-auditor.md +138 -0
package/agents/integration-tester.md +120 -0
package/agents/lesson-scanner.md +149 -0
package/agents/python-expert.md +179 -0
package/agents/service-monitor.md +141 -0
package/agents/shell-expert.md +147 -0
package/benchmarks/runner.sh +147 -0
package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
package/benchmarks/tasks/02-refactor-module/task.md +8 -0
package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
package/bin/act.js +238 -0
package/commands/autocode.md +6 -0
package/commands/cancel-ralph.md +18 -0
package/commands/code-factory.md +53 -0
package/commands/create-prd.md +55 -0
package/commands/ralph-loop.md +18 -0
package/commands/run-plan.md +117 -0
package/commands/submit-lesson.md +122 -0
package/docs/ARCHITECTURE.md +630 -0
package/docs/CONTRIBUTING.md +125 -0
package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
package/docs/lessons/0002-async-def-without-await.md +28 -0
package/docs/lessons/0003-create-task-without-callback.md +28 -0
package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
package/docs/lessons/0005-sqlite-without-closing.md +33 -0
package/docs/lessons/0006-venv-pip-path.md +27 -0
package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
package/docs/lessons/0010-local-outside-function-bash.md +33 -0
package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
package/docs/lessons/0020-persist-state-incrementally.md +44 -0
package/docs/lessons/0021-dual-axis-testing.md +48 -0
package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
package/docs/lessons/0023-static-analysis-spiral.md +51 -0
package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
package/docs/lessons/0045-iterative-design-improvement.md +33 -0
package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
package/docs/lessons/0048-integration-wiring-batch.md +40 -0
package/docs/lessons/0049-ab-verification.md +41 -0
package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
package/docs/lessons/0078-static-review-without-live-test.md +30 -0
package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
package/docs/lessons/FRAMEWORK.md +161 -0
package/docs/lessons/SUMMARY.md +201 -0
package/docs/lessons/TEMPLATE.md +85 -0
package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
package/docs/plans/2026-02-21-mab-research-report.md +406 -0
package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
package/docs/plans/2026-02-22-mab-run-design.md +462 -0
package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
package/docs/plans/2026-02-24-headless-module-split.md +443 -0
package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
package/docs/plans/audit-findings.md +186 -0
package/docs/telegram-notification-format.md +98 -0
package/examples/example-plan.md +51 -0
package/examples/example-prd.json +72 -0
package/examples/example-roadmap.md +33 -0
package/examples/quickstart-plan.md +63 -0
package/hooks/hooks.json +26 -0
package/hooks/setup-symlinks.sh +48 -0
package/hooks/stop-hook.sh +135 -0
package/package.json +47 -0
package/policies/bash.md +71 -0
package/policies/python.md +71 -0
package/policies/testing.md +61 -0
package/policies/universal.md +60 -0
package/scripts/analyze-report.sh +97 -0
package/scripts/architecture-map.sh +145 -0
package/scripts/auto-compound.sh +273 -0
package/scripts/batch-audit.sh +42 -0
package/scripts/batch-test.sh +101 -0
package/scripts/entropy-audit.sh +221 -0
package/scripts/failure-digest.sh +51 -0
package/scripts/generate-ast-rules.sh +96 -0
package/scripts/init.sh +112 -0
package/scripts/lesson-check.sh +428 -0
package/scripts/lib/common.sh +61 -0
package/scripts/lib/cost-tracking.sh +153 -0
package/scripts/lib/ollama.sh +60 -0
package/scripts/lib/progress-writer.sh +128 -0
package/scripts/lib/run-plan-context.sh +215 -0
package/scripts/lib/run-plan-echo-back.sh +231 -0
package/scripts/lib/run-plan-headless.sh +396 -0
package/scripts/lib/run-plan-notify.sh +57 -0
package/scripts/lib/run-plan-parser.sh +81 -0
package/scripts/lib/run-plan-prompt.sh +215 -0
package/scripts/lib/run-plan-quality-gate.sh +132 -0
package/scripts/lib/run-plan-routing.sh +315 -0
package/scripts/lib/run-plan-sampling.sh +170 -0
package/scripts/lib/run-plan-scoring.sh +146 -0
package/scripts/lib/run-plan-state.sh +142 -0
package/scripts/lib/run-plan-team.sh +199 -0
package/scripts/lib/telegram.sh +54 -0
package/scripts/lib/thompson-sampling.sh +176 -0
package/scripts/license-check.sh +74 -0
package/scripts/mab-run.sh +575 -0
package/scripts/module-size-check.sh +146 -0
package/scripts/patterns/async-no-await.yml +5 -0
package/scripts/patterns/bare-except.yml +6 -0
package/scripts/patterns/empty-catch.yml +6 -0
package/scripts/patterns/hardcoded-localhost.yml +9 -0
package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
package/scripts/pipeline-status.sh +197 -0
package/scripts/policy-check.sh +226 -0
package/scripts/prior-art-search.sh +133 -0
package/scripts/promote-mab-lessons.sh +126 -0
package/scripts/prompts/agent-a-superpowers.md +29 -0
package/scripts/prompts/agent-b-ralph.md +29 -0
package/scripts/prompts/judge-agent.md +61 -0
package/scripts/prompts/planner-agent.md +44 -0
package/scripts/pull-community-lessons.sh +90 -0
package/scripts/quality-gate.sh +266 -0
package/scripts/research-gate.sh +90 -0
package/scripts/run-plan.sh +329 -0
package/scripts/scope-infer.sh +159 -0
package/scripts/setup-ralph-loop.sh +155 -0
package/scripts/telemetry.sh +230 -0
package/scripts/tests/run-all-tests.sh +52 -0
package/scripts/tests/test-act-cli.sh +46 -0
package/scripts/tests/test-agents-md.sh +87 -0
package/scripts/tests/test-analyze-report.sh +114 -0
package/scripts/tests/test-architecture-map.sh +89 -0
package/scripts/tests/test-auto-compound.sh +169 -0
package/scripts/tests/test-batch-test.sh +65 -0
package/scripts/tests/test-benchmark-runner.sh +25 -0
package/scripts/tests/test-common.sh +168 -0
package/scripts/tests/test-cost-tracking.sh +158 -0
package/scripts/tests/test-echo-back.sh +180 -0
package/scripts/tests/test-entropy-audit.sh +146 -0
package/scripts/tests/test-failure-digest.sh +66 -0
package/scripts/tests/test-generate-ast-rules.sh +145 -0
package/scripts/tests/test-helpers.sh +82 -0
package/scripts/tests/test-init.sh +47 -0
package/scripts/tests/test-lesson-check.sh +278 -0
package/scripts/tests/test-lesson-local.sh +55 -0
package/scripts/tests/test-license-check.sh +109 -0
package/scripts/tests/test-mab-run.sh +182 -0
package/scripts/tests/test-ollama-lib.sh +49 -0
package/scripts/tests/test-ollama.sh +60 -0
package/scripts/tests/test-pipeline-status.sh +198 -0
package/scripts/tests/test-policy-check.sh +124 -0
package/scripts/tests/test-prior-art-search.sh +96 -0
package/scripts/tests/test-progress-writer.sh +140 -0
package/scripts/tests/test-promote-mab-lessons.sh +110 -0
package/scripts/tests/test-pull-community-lessons.sh +149 -0
package/scripts/tests/test-quality-gate.sh +241 -0
package/scripts/tests/test-research-gate.sh +132 -0
package/scripts/tests/test-run-plan-cli.sh +86 -0
package/scripts/tests/test-run-plan-context.sh +305 -0
package/scripts/tests/test-run-plan-e2e.sh +153 -0
package/scripts/tests/test-run-plan-headless.sh +424 -0
package/scripts/tests/test-run-plan-notify.sh +124 -0
package/scripts/tests/test-run-plan-parser.sh +217 -0
package/scripts/tests/test-run-plan-prompt.sh +254 -0
package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
package/scripts/tests/test-run-plan-routing.sh +178 -0
package/scripts/tests/test-run-plan-scoring.sh +148 -0
package/scripts/tests/test-run-plan-state.sh +261 -0
package/scripts/tests/test-run-plan-team.sh +157 -0
package/scripts/tests/test-scope-infer.sh +150 -0
package/scripts/tests/test-setup-ralph-loop.sh +63 -0
package/scripts/tests/test-telegram-env.sh +38 -0
package/scripts/tests/test-telegram.sh +121 -0
package/scripts/tests/test-telemetry.sh +46 -0
package/scripts/tests/test-thompson-sampling.sh +139 -0
package/scripts/tests/test-validate-all.sh +60 -0
package/scripts/tests/test-validate-commands.sh +89 -0
package/scripts/tests/test-validate-hooks.sh +98 -0
package/scripts/tests/test-validate-lessons.sh +150 -0
package/scripts/tests/test-validate-plan-quality.sh +235 -0
package/scripts/tests/test-validate-plans.sh +187 -0
package/scripts/tests/test-validate-plugin.sh +106 -0
package/scripts/tests/test-validate-prd.sh +184 -0
package/scripts/tests/test-validate-skills.sh +134 -0
package/scripts/validate-all.sh +57 -0
package/scripts/validate-commands.sh +67 -0
package/scripts/validate-hooks.sh +89 -0
package/scripts/validate-lessons.sh +98 -0
package/scripts/validate-plan-quality.sh +369 -0
package/scripts/validate-plans.sh +120 -0
package/scripts/validate-plugin.sh +86 -0
package/scripts/validate-policies.sh +42 -0
package/scripts/validate-prd.sh +118 -0
package/scripts/validate-skills.sh +96 -0
package/skills/autocode/SKILL.md +285 -0
package/skills/autocode/ab-verification.md +51 -0
package/skills/autocode/code-quality-standards.md +37 -0
package/skills/autocode/competitive-mode.md +364 -0
package/skills/brainstorming/SKILL.md +97 -0
package/skills/capture-lesson/SKILL.md +187 -0
package/skills/check-lessons/SKILL.md +116 -0
package/skills/dispatching-parallel-agents/SKILL.md +110 -0
package/skills/executing-plans/SKILL.md +85 -0
package/skills/finishing-a-development-branch/SKILL.md +201 -0
package/skills/receiving-code-review/SKILL.md +72 -0
package/skills/requesting-code-review/SKILL.md +59 -0
package/skills/requesting-code-review/code-reviewer.md +82 -0
package/skills/research/SKILL.md +145 -0
package/skills/roadmap/SKILL.md +115 -0
package/skills/subagent-driven-development/SKILL.md +98 -0
package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
package/skills/subagent-driven-development/implementer-prompt.md +73 -0
package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
package/skills/systematic-debugging/SKILL.md +134 -0
package/skills/systematic-debugging/condition-based-waiting.md +64 -0
package/skills/systematic-debugging/defense-in-depth.md +32 -0
package/skills/systematic-debugging/root-cause-tracing.md +55 -0
package/skills/test-driven-development/SKILL.md +167 -0
package/skills/using-git-worktrees/SKILL.md +219 -0
package/skills/using-superpowers/SKILL.md +54 -0
package/skills/verification-before-completion/SKILL.md +140 -0
package/skills/verify/SKILL.md +82 -0
package/skills/writing-plans/SKILL.md +128 -0
package/skills/writing-skills/SKILL.md +93 -0

package/docs/plans/2026-02-22-research-coding-standards-documentation.md ADDED Viewed

@@ -0,0 +1,541 @@
+# Research: Coding Standards, Documentation, and AI Agent Performance
+> **Date:** 2026-02-22
+> **Status:** Research complete
+> **Method:** Web research + academic literature + empirical studies
+> **Confidence key:** HIGH = multiple empirical studies converge; MEDIUM = limited studies or practitioner consensus without rigorous measurement; LOW = anecdotal, inferred, or single-study
+## Executive Summary
+Documentation quality has a measurable, significant impact on AI coding agent performance. The evidence converges on several non-obvious findings:
+1. **Code examples are the single most valuable documentation form for AI agents** — removing examples from API docs drops LLM accuracy by 40-60 percentage points, while removing parameter lists has zero or slightly positive effect (confidence: HIGH).
+2. **Type annotations reduce compilation errors by >50%** but have limited impact on semantic correctness — they constrain the output space rather than improve understanding (confidence: HIGH).
+3. **Stale comments are actively harmful** — inconsistent code-comment pairs are 1.5x more likely to introduce bugs than code with no comments at all (confidence: HIGH).
+4. **Descriptive variable names improve AI code completion by 8.9% semantically** and achieve 2x the exact-match rate vs. obfuscated names — what works for humans works for LLMs (confidence: HIGH).
+5. **The CLAUDE.md/AGENTS.md pattern is a genuinely new documentation category** — project-level AI instructions that are neither traditional docs nor code, but a form of "context engineering" (confidence: MEDIUM).
+6. **Self-documenting code + targeted "why" comments outperforms either approach alone** for both humans and AI agents (confidence: MEDIUM).
+7. **Well-tested codebases are significantly easier for agents to modify** because tests provide executable specifications and immediate feedback loops (confidence: MEDIUM).
+The practical implication for the autonomous-coding-toolkit: a codebase's "AI-readiness" is measurable and improvable. The toolkit should score documentation quality as a pre-flight check and recommend specific improvements before running autonomous agents against a repository.
+---
+## 1. Documentation as AI Context
+### Findings
+Not all documentation is equally useful to AI agents. Research on retrieval-augmented code generation reveals a clear hierarchy of documentation effectiveness.
+**Documentation effectiveness ranking (for AI agents):**
+| Rank | Documentation Form | Impact on Agent Performance | Evidence Strength |
+|------|-------------------|---------------------------|-------------------|
+| 1 | **Code examples** | 40-60 percentage point improvement | HIGH |
+| 2 | **Test suites** | Executable specification + feedback | MEDIUM |
+| 3 | **Type annotations** | >50% reduction in compilation errors | HIGH |
+| 4 | **Docstrings (concise)** | Moderate improvement when compressed | HIGH |
+| 5 | **Architecture docs (CLAUDE.md, AGENTS.md)** | Critical for repo-level navigation | MEDIUM |
+| 6 | **README / high-level docs** | Useful for orientation, not implementation | MEDIUM |
+| 7 | **Inline comments (accurate)** | Helpful for "why" context | MEDIUM |
+| 8 | **API parameter lists** | Zero to slightly negative impact | HIGH |
+| 9 | **Inline comments (stale)** | Actively harmful | HIGH |
+### Evidence
+**Code examples dominate.** A 2025 study on API documentation and RAG found that LLM code generation performance improved by 83-220% when documentation with code examples was retrieved. When examples were removed from documentation, Qwen32B's pass rate dropped from 66-82% to 22-39%. Critically, removing parameter lists from documentation had zero negative impact — and sometimes slightly improved performance. This suggests that **concrete usage examples matter far more than abstract API descriptions** ([When LLMs Meet API Documentation](https://arxiv.org/html/2503.15231)).
+**Docstring compression works.** The "Less is More" study (2024) demonstrated that docstrings can be compressed by 25-40% (via the ShortenDoc method) while preserving code generation quality across six benchmarks. Redundant tokens in docstrings add cost without adding signal. The method dynamically adjusts compression based on token importance, suggesting that **concise docstrings are better than verbose ones** ([Less is More: DocString Compression](https://arxiv.org/abs/2410.22793)).
+**Repository-level context matters most for real tasks.** SWE-bench Pro (2025) demonstrated that agents working on real repositories with comprehensive documentation achieved significantly higher resolve rates than those on poorly-documented codebases. The benchmark explicitly selects "actively maintained professional projects with substantial user bases, comprehensive documentation, and established development practices" — and still, agents only achieve 17-44% solve rates, suggesting documentation is necessary but not sufficient ([SWE-Bench Pro](https://arxiv.org/html/2509.16941v1)).
+### Implications for the Toolkit
+- The pre-flight check should verify that referenced APIs have code examples, not just parameter docs.
+- Docstring quality matters more than quantity — concise > verbose.
+- The AGENTS.md pattern (project-level AI instructions) should be auto-generated per worktree, which the toolkit already does.
+---
+## 2. Type Systems as Documentation
+### Findings
+Static typing produces measurably better AI-generated code, primarily by **constraining the output space** rather than improving semantic understanding.
+**Key finding:** Type-constrained decoding reduces compilation errors by more than 50% and significantly increases functional correctness across LLMs of various sizes and families ([Type-Constrained Code Generation](https://arxiv.org/abs/2504.09246)).
+However, the impact is asymmetric:
+- **Strong effect on syntactic correctness:** Type annotations dramatically reduce type errors, invalid member access, and wrong argument types.
+- **Weak effect on semantic correctness:** The presence of type annotations does not significantly improve whether the generated code *does the right thing*. A function can be perfectly typed and completely wrong.
+### Evidence
+**Type-constrained decoding.** Research formalized type-constrained code generation on a simply-typed language and extended it to TypeScript (2025). Results showed >50% reduction in compilation errors across model families including models with 30B+ parameters. The approach works by filtering the token probability distribution at each generation step to only allow type-valid continuations ([Type-Constrained Code Generation with Language Models](https://openreview.net/pdf?id=LYVyioTwvF)).
+**Python typing adoption.** Meta's 2025 Python Typing Survey (400+ respondents) found that developers overwhelmingly use type annotations for code quality and flexibility. Nearly 300 respondents reported using in-editor LLM suggestions when working with Python typing, and researchers demonstrated a generate-check-repair pipeline coupling LLMs with Mypy to achieve better type annotation quality ([Meta Python Typing Survey 2025](https://engineering.fb.com/2025/12/22/developer-tools/python-typing-survey-2025-code-quality-flexibility-typing-adoption/)).
+**TypeScript advantage for AI workflows.** Practitioner reports from 2025 indicate TypeScript is increasingly preferred for LLM-based coding specifically because its type system constrains agent output. The type system acts as a "guardrail" that catches structural errors before runtime ([TypeScript Rising](https://visiononedge.com/typescript-replacing-python-in-multiagent-systems/)).
+**Language-dependent impact.** Type annotations affect well-typedness of generated code, but this is language-dependent. For languages with strong type inference (Go, Rust), explicit annotations add less value than for Python or JavaScript where the type system is optional.
+### Implications for the Toolkit
+- **Recommendation:** Projects running autonomous agents should enable type checking (mypy for Python, tsc for TypeScript) as a quality gate.
+- **For Python projects:** Add `mypy --strict` or `pyright` to the quality gate pipeline. Type-checked code is not semantically better, but it catches a class of structural errors that compound across batches.
+- **For TypeScript projects:** Already well-positioned for AI agent workflows. The type system is an "always-on" documentation layer.
+- **Practical guidance:** Type annotations are a force multiplier for AI agents — they constrain the search space. A pre-flight check for type coverage (e.g., `mypy --txt-report` coverage percentage) is worth adding.
+---
+## 3. Comments: Help or Noise?
+### Findings
+Comments have a **bimodal impact** on AI agent performance: accurate "why" comments help; stale or redundant comments actively harm. The evidence is stronger for the harmful case than the helpful case.
+**The key statistic:** Inconsistent code-comment pairs are **1.52x more likely** to introduce bugs within 7 days, compared to code with synchronized or absent comments. This effect diminishes over time (1.14x at 14 days) ([Investigating the Impact of Code Comment Inconsistency on Bug Introducing](https://arxiv.org/html/2409.10781v1)).
+### Evidence
+**Stale comments introduce bugs.** A 2024 study using GPT-3.5 for detection (F1 > 88%) categorized code-comment inconsistencies and found that "outdated records" (code changed without comment update) created the highest bug risk. The temporal pattern is critical: the bug-introducing risk peaks immediately after the inconsistency is created and decays over subsequent weeks.
+**Large-scale inconsistency data.** Wen et al. (2019) mined 1.3 billion AST-level changes across 1,500 systems and found that in most cases, code and comments do not co-evolve. Comments drift out of sync from code, creating a pervasive inconsistency problem ([A Large-Scale Empirical Study on Code-Comment Inconsistencies](https://www.inf.usi.ch/lanza/Downloads/Wen2019a.pdf)).
+**Comments can fool AI security reviewers.** A 2025 study demonstrated that adversarial code comments can mislead LLM-based code analysis, suggesting that AI agents may over-weight comment content relative to actual code semantics ([Can Adversarial Code Comments Fool AI Security Reviewers](https://arxiv.org/abs/2602.16741)).
+**Comments help automated bug-fixing (when accurate).** A 2025 empirical study on 116,372 bug-fix pairs found that code comments provide useful context for LLM-based automated repair, but only when the comments accurately describe the intended behavior. The study's dataset excluded methods exceeding 512 tokens, aligning with typical context constraints ([On the Impact of Code Comments for Automated Bug-Fixing](https://www.arxiv.org/pdf/2601.23059)).
+**The "why" vs. "what" distinction is empirically validated.** Both human and AI studies converge: comments explaining *why* a design decision was made help comprehension; comments restating *what* the code does add no value for competent readers (human or AI) and create maintenance burden.
+### Implications for the Toolkit
+- **Add a quality gate:** Detect comments that merely restate code (e.g., `# increment counter` above `counter += 1`). These are noise.
+- **Entropy audit enhancement:** Flag files where code has been modified but adjacent comments were not (proxy for stale comments).
+- **Recommendation:** Prefer "why" comments over "what" comments. Delete "what" comments proactively.
+- **Risk model:** Stale comments are worse than no comments. The toolkit should recommend removing obviously stale comments as part of codebase preparation for AI agents.
+---
+## 4. Architecture Documentation
+### Findings
+Architecture documentation helps AI agents navigate repositories, but the **format matters greatly**. Agents consume structured, hierarchical documentation better than prose narratives. ADRs and C4 models serve complementary purposes.
+### Evidence
+**AgenticAKM (2025).** Research demonstrated that multi-agent systems (using Gemini-2.5-pro and GPT-5) produce better Architecture Decision Records than simple LLM calls, using summarizer, generator, and validator agents in an iterative workflow. This validates ADRs as a documentation form that AI can both consume and produce ([AgenticAKM](https://arxiv.org/html/2602.04445v1)).
+**C4 + ADR integration.** The C4 model documents the "what" (system structure at four abstraction levels: Context, Containers, Components, Code) while ADRs document the "why" (decision context, alternatives considered, consequences). AI agents benefit from both: C4 for navigation ("where does this code live?") and ADRs for constraint understanding ("why was this designed this way?") ([C4 Model Architecture and ADR Integration](https://visual-c4.com/blog/c4-model-architecture-adr-integration)).
+**Collaborative C4 Design Automation (2025).** A paper demonstrated that LLM agents can collaborate to automatically produce C4 architecture diagrams, suggesting that agents can both consume and generate this documentation format ([Collaborative LLM Agents for C4 Software Architecture Design Automation](https://arxiv.org/pdf/2510.22787)).
+**Repository-level RAG.** Retrieval-augmented code generation frameworks use architecture documentation to construct context-aware prompts, enabling models to navigate beyond their context window limits. Methods include identifier matching, sparse retrieval (BM25), dense retrieval (CodeBERT), and graph-based retrieval using ASTs and call graphs ([Retrieval-Augmented Code Generation Survey](https://arxiv.org/html/2510.04905v1)).
+### Architecture Documentation Format Comparison
+| Format | AI Consumability | Human Maintenance Cost | Best For |
+|--------|-----------------|----------------------|----------|
+| **ARCHITECTURE.md** (structured) | HIGH — easy to parse, fits in context | LOW — update on major changes | System overview, module map |
+| **ADRs** (individual files) | HIGH — self-contained decisions | LOW — write once, rarely update | Design rationale, constraints |
+| **C4 diagrams** (text-based) | MEDIUM — Mermaid/PlantUML parseable | MEDIUM — update on structural changes | Visual system decomposition |
+| **Dependency graphs** (generated) | HIGH — machine-readable | ZERO — auto-generated | Import/dependency analysis |
+| **Module maps** (directory layout) | HIGH — immediate orientation | LOW — update when structure changes | "Where does X live?" |
+| **Prose architecture narratives** | LOW — too long for context windows | HIGH — hard to keep current | Human onboarding (not AI) |
+### Implications for the Toolkit
+- **The toolkit's ARCHITECTURE.md format is well-optimized for AI consumption** — structured, hierarchical, with ASCII diagrams. This is a good pattern to recommend.
+- **ADRs should be recommended** as a documentation practice for projects using autonomous agents. Each ADR is a self-contained context unit that answers "why" questions.
+- **Auto-generated module maps** (directory tree + brief purpose annotations) should be part of the pre-flight context injection.
+- **Prose narratives are low-value** for AI agents. Recommend structured formats over free-text architecture docs.
+---
+## 5. Naming Conventions
+### Findings
+Naming quality has a **large, consistent, and measurable** effect on AI code completion quality. This is one of the most well-evidenced findings in this research. **Confidence: HIGH.**
+### Evidence
+**AI code completion study (2025).** Yakubov et al. tested 8 AI models (0.5B-8B parameters) on 500 Python code samples transformed into 7 naming schemes. Results:
+| Naming Scheme | Exact Match Rate | Levenshtein Similarity | Semantic Similarity |
+|--------------|-----------------|----------------------|-------------------|
+| Descriptive | **34.2%** | **0.786** | **0.874** |
+| SCREAM_SNAKE | 28.1% | 0.748 | 0.856 |
+| snake_case | 26.7% | 0.739 | 0.849 |
+| PascalCase | 25.3% | 0.731 | 0.843 |
+| Minimal | 21.8% | 0.707 | 0.825 |
+| Obfuscated | **16.6%** | **0.666** | **0.802** |
+Descriptive names achieved **2.06x the exact match rate** of obfuscated names. The 41% increase in token usage from descriptive names is far outweighed by the 8.9% improvement in semantic performance ([Variable Naming Impact on AI Code Completion](https://www.researchgate.net/publication/393939595)).
+**Human comprehension alignment.** The ranking mirrors human comprehension research: full-word identifiers lead to 19% faster comprehension than abbreviations (Hofmeister et al.). Descriptive identifier names enabled developers to find semantic defects 14% faster than short, non-descriptive names (Binkley et al.).
+**Naming flaws correlate with code quality.** Butler et al. found that flawed identifier names in Java classes were associated with lower static analysis quality scores, and this held at both class and method granularity levels ([Relating Identifier Naming Flaws and Code Quality](https://oro.open.ac.uk/17007/1/butler09wcreshort_latest.pdf)).
+### Implications for the Toolkit
+- **This is the highest-ROI documentation improvement.** Renaming variables from cryptic to descriptive improves AI agent performance more than adding comments.
+- **Entropy audit enhancement:** Add a naming quality check — detect single-letter variables outside of loop counters, abbreviations without context, inconsistent naming conventions within a file.
+- **Pre-flight recommendation:** If a codebase has poor naming, recommend a renaming pass before running autonomous agents.
+- **Practical threshold:** If >10% of non-loop variables are single characters or common abbreviations (tmp, val, x, res), flag for naming improvement.
+---
+## 6. Coding Standards That Matter
+### Findings
+Most coding standards debate focuses on style preferences (tabs vs. spaces, brace placement). The standards with **empirical evidence for improving code quality** are substantive, not stylistic.
+### Evidence-Based Standards
+| Standard | Evidence | Effect Size | Confidence |
+|----------|----------|-------------|------------|
+| **Function length <30 lines** | SATC: size + complexity = lowest reliability | Moderate | MEDIUM |
+| **Cyclomatic complexity ≤10** | McCabe/NIST threshold; Spotify: 30% fewer bugs | Moderate | MEDIUM |
+| **Error handling with logging** | Lesson #7 in this toolkit; silent failures are top bug cluster | Large | HIGH (from experience) |
+| **No bare exceptions** | Python anti-pattern consensus; empirical from 59 lessons | Large | HIGH |
+| **Async discipline** | Lessons #25, #30; truthy coroutine bugs | Large | HIGH (from experience) |
+| **Type annotations** | >50% compilation error reduction | Large (structural) | HIGH |
+| **Descriptive naming** | 2x exact match rate for AI completion | Large | HIGH |
+| **Tests for every feature** | SWE-bench: tested repos have higher solve rates | Large | MEDIUM |
+**Cyclomatic complexity.** The evidence is nuanced. When controlling for program size, the correlation between cyclomatic complexity and bugs weakens considerably. However, Spotify's internal report found that reducing average complexity from 15 to 8 led to 30% fewer bugs and 20% faster development. The NIST recommendation of ≤10 is widely adopted but not rigorously validated in isolation from code size ([Cyclomatic Complexity - Wikipedia](https://en.wikipedia.org/wiki/Cyclomatic_complexity)).
+**Function length.** The NASA Software Assurance Technology Center (SATC) found that the most effective evaluation combines size and complexity. Functions that are both large and complex have the lowest reliability. A practical limit of 30 lines (visible without scrolling) aligns with cognitive load theory (Miller's 7±2 chunks).
+**Error handling.** This toolkit's own lesson database validates this empirically: Lesson #7 (bare exception swallowing), Lesson #43 (create_task without callback), and Lesson #33 (sqlite without closing) are all error handling failures. The top root cause cluster (Cluster A: Silent Failures) in the toolkit's 59 lessons is entirely about inadequate error handling.
+### Standards That Don't Matter (for AI)
+| Standard | Why It Doesn't Matter |
+|----------|----------------------|
+| Tabs vs. spaces | Tokenized identically by most LLMs |
+| Brace style (K&R vs. Allman) | No measurable impact on generation quality |
+| Line length limits | Context window handles any reasonable line length |
+| Import ordering | Cosmetic; no comprehension impact |
+| Trailing commas | Style preference with no quality signal |
+### Implications for the Toolkit
+- **Focus quality gates on substantive standards:** error handling, complexity, naming, typing.
+- **Don't waste gate capacity on style:** linting for formatting is fine for humans but adds no value for AI agent quality.
+- **The toolkit's lesson-check.sh already targets the right things:** bare exceptions, async discipline, task callbacks. This is validated by the research.
+---
+## 7. Style Guides for AI-Assisted Development
+### Findings
+Traditional style guides (Google, Airbnb, PEP 8) were designed for human-to-human code communication. AI-assisted workflows need adapted guidance that emphasizes what helps agents most and deprioritizes what helps only humans.
+### Evidence
+No rigorous studies exist comparing style guide variants for AI agent performance. **Confidence: LOW.** The recommendations below are inferred from the empirical evidence in sections 1-6.
+### Recommended Adaptations
+| Traditional Guideline | AI-Optimized Adaptation | Rationale |
+|----------------------|------------------------|-----------|
+| "Write comments for every public method" | "Write 'why' comments; skip 'what' comments" | Stale comments hurt; redundant comments add no value (Section 3) |
+| "Follow PEP 8 formatting" | "Follow PEP 8 + add type annotations everywhere" | Type annotations constrain AI output (Section 2) |
+| "Use meaningful names" | "Use descriptive names even when verbose" | 2x exact match rate (Section 5) |
+| "Document parameters in docstrings" | "Include code examples in docstrings" | Examples > parameter docs for AI (Section 1) |
+| "Keep functions small" | "Keep functions <30 lines with low complexity" | Same recommendation, but now evidence-backed for AI (Section 6) |
+| "Write a README" | "Write AGENTS.md + structured ARCHITECTURE.md" | AI needs navigational context, not human onboarding (Section 4) |
+### Implications for the Toolkit
+- **Publish an "AI-Ready Style Guide"** that distills the empirical findings into actionable rules.
+- **The guide should be prescriptive, not permissive** — the toolkit's philosophy is already "rigid skills, follow exactly."
+- **Include the guide as a recommended codebase preparation step** before running `/autocode`.
+---
+## 8. Tests as Documentation
+### Findings
+Test suites serve as **executable specifications** that AI agents can both read (to understand expected behavior) and run (to verify their changes). Well-tested codebases are measurably easier for agents to modify correctly.
+### Evidence
+**SWE-bench structure.** Every SWE-bench task includes a test patch — the ground truth for whether the agent's fix is correct. The entire benchmark evaluation paradigm assumes tests are the definitive documentation of expected behavior. Agents that can run and interpret test output consistently outperform those that cannot.
+**Test count monotonicity.** The toolkit's own principle (#3: "test count must never decrease between batches") is a form of using tests as documentation — each test documents an expected behavior that must be preserved. This is empirically validated by the toolkit's 59 lessons: Lesson #32 and #44 address hardcoded test counts, showing that tests-as-documentation requires that the tests themselves remain accurate.
+**BDD as specification.** Behavior-Driven Development (BDD) frameworks (pytest-bdd, Cucumber) produce test files that read as specifications. While no direct study measures AI agent performance on BDD vs. unit test codebases, the principle is clear: descriptive test names + assertion messages provide the same "why" context that good comments provide, but with the advantage of being verified on every run.
+**AI test generation agents.** The rise of AI test generation tools (2025) creates a feedback loop: AI agents generate tests that serve as documentation for future AI agents. The quality of these generated tests varies, but the specification function is consistent ([9 Best Unit Test Agents 2025](https://gitauto.ai/blog/best-unit-test-agents-2025)).
+**Non-determinism challenge.** Testing AI agent systems themselves is complicated by non-determinism in LLM outputs. Traditional deterministic testing doesn't directly apply, requiring new approaches for agent testing ([An Empirical Study of Testing Practices in Open Source AI Agent Frameworks](https://arxiv.org/html/2509.19185v1)).
+### Implications for the Toolkit
+- **Test suite quality is a pre-flight metric.** Before running autonomous agents, check: Does the project have tests? Do they pass? What's the coverage?
+- **The quality gate's test count regression detection is validated** by the research — tests are documentation that must be preserved.
+- **Recommendation:** Projects with <50% test coverage should add tests before running autonomous agents. The agent's test runs are its primary feedback mechanism.
+- **BDD-style test names** (e.g., `test_user_login_with_expired_token_returns_401`) provide more context than abbreviated names (`test_login_fail`).
+---
+## 9. Documentation Debt
+### Findings
+Documentation debt compounds **differently** when AI agents are the consumers vs. humans. For humans, stale docs cause confusion that can be resolved by asking a colleague. For AI agents, stale docs cause **silent incorrect behavior** because agents treat documentation as ground truth.
+### Evidence
+**Code-comment inconsistency.** Wen et al. (2019) mined 1.3 billion AST changes across 1,500 systems and found pervasive code-comment drift — in most cases, code and comments do not co-evolve. The 2024 follow-up demonstrated that these inconsistencies are 1.52x more likely to introduce bugs within 7 days ([Wen et al.](https://www.inf.usi.ch/lanza/Downloads/Wen2019a.pdf); [Impact study](https://arxiv.org/html/2409.10781v1)).
+**Self-admitted technical debt (SATD).** Studies on SATD in code comments show that developers often leave TODO/FIXME/HACK markers that document known debt. AI agents can detect these markers but may not understand the severity or context — a FIXME comment might indicate a critical bug or a cosmetic preference ([Identifying Technical Debt through Code Comment Analysis](https://www.scitepress.org/papers/2016/59438/59438.pdf)).
+**Adversarial comments.** The 2025 study on adversarial code comments demonstrated that AI security reviewers can be misled by intentionally deceptive comments, highlighting that agents over-trust comment content relative to code structure ([Adversarial Code Comments](https://arxiv.org/abs/2602.16741)).
+**Documentation debt accumulation model:**
+| Debt Type | Human Impact | AI Agent Impact | Relative Severity for AI |
+|-----------|-------------|----------------|------------------------|
+| Missing docs | Slows onboarding | Agent guesses (hallucination risk) | HIGH |
+| Stale comments | Confuses, but detectable | Agent follows wrong guidance silently | CRITICAL |
+| Wrong README | Misleading setup instructions | Agent fails to build/test | HIGH |
+| Outdated ADRs | Wrong context for decisions | Agent violates constraints | MEDIUM |
+| Missing type annotations | More mental load | More type errors in generated code | HIGH |
+| Dead code | Clutters understanding | Agent may call dead functions | MEDIUM |
+### Implications for the Toolkit
+- **Stale documentation is the highest-priority documentation debt for AI agents.** The toolkit should detect and warn about it.
+- **The entropy audit already checks for dead references in CLAUDE.md.** Extend this to detect likely stale comments (code changed, adjacent comments unchanged).
+- **Pre-flight recommendation:** Run a comment freshness check before autonomous execution. Flag files where code has changed recently but comments have not.
+- **Key insight:** For AI agents, **missing documentation is better than wrong documentation.** An agent with no docs will hallucinate sometimes; an agent with wrong docs will confidently do the wrong thing.
+---
+## 10. Self-Documenting Code
+### Findings
+"Self-documenting code" (clear naming, small functions, obvious structure) is **more effective for AI agents than heavily-commented code**, with one crucial caveat: non-obvious design decisions ("why" comments) remain essential because no amount of clean code can explain *why* a particular approach was chosen.
+### Evidence
+**Student vs. professional divergence.** A study comparing students and IT professionals found that students preferred more comments for readability, while professionals leaned toward "less is more." This mirrors the AI case: agents, like experienced developers, can read code directly — they benefit from structural clarity more than explanatory commentary ([Code readability: Code comments OR self-documenting code](https://www.diva-portal.org/smash/get/diva2:943979/FULLTEXT02)).
+**Naming > comments.** Section 5 established that descriptive naming achieves a 2x exact match rate for AI completion. This is a stronger effect than any measured benefit from comments. Self-documenting code *is* better documentation for AI agents in the structural/navigational dimension.
+**The "why" gap.** Self-documenting code cannot express: design constraints, performance trade-offs, regulatory requirements, integration assumptions, or workaround reasons. These require explicit "why" comments or ADRs. Clean Code's rule — "comments should explain WHY, not WHAT" — is the optimal strategy for AI agents.
+**Small functions debate.** The "extract until everything is tiny" approach from Clean Code has pushback: extracting a 3-line operation into a named function can increase code from 3 lines to 5, requiring the reader (human or AI) to jump to a separate location. The optimal function size for AI consumption aligns with the SATC finding: small enough to have low complexity, large enough to be self-contained. Practical target: <30 lines.
+### Implications for the Toolkit
+- **Recommendation:** Prioritize naming and structure over comment density.
+- **The ideal documentation strategy for AI agents is:**
+  1. Descriptive names (highest impact)
+  2. Type annotations (constrains output)
+  3. Small, focused functions (reduces complexity)
+  4. "Why" comments only (explains non-obvious decisions)
+  5. Delete "what" comments (they drift and become harmful)
+- **This aligns with the toolkit's existing philosophy:** the lesson-check system catches structural anti-patterns (bare exceptions, async discipline) rather than comment quality — code structure is the primary documentation.
+---
+## 11. Documentation Generation
+### Findings
+AI-generated documentation is useful for **structural/mechanical documentation** (parameter docs, API references, type stubs) but poor for **contextual/intentional documentation** (architecture decisions, design rationale, business logic explanation).
+### Evidence
+**Adoption.** 64% of software development professionals use AI for writing documentation (Google DORA 2025). IBM's internal test found 59% reduction in documentation time ([IBM AI Code Documentation](https://www.ibm.com/think/insights/ai-code-documentation-benefits-top-tips)).
+**DocAgent (2025).** A multi-agent system for automated code documentation demonstrated that specialized agent architectures (planner + analyzer + writer + reviewer) produce higher-quality documentation than single-pass generation, but still require human review for accuracy of business logic descriptions ([DocAgent](https://arxiv.org/html/2504.08725v2)).
+**The "correct but useless" problem.** The biggest risk with AI documentation generation is producing documentation that accurately describes *what* code does without explaining *why* — exactly the type of documentation that Section 3 identifies as low-value noise.
+**Quality trade-off.** Google's 2024 DORA report found that increased AI use improves documentation speed but causes a **7.2% drop in delivery stability**. AI-generated code has 1.7x more issues than human-written code ([AI vs Human Code Gen Report](https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report)).
+### What Should Be AI-Generated vs. Human-Written
+| Documentation Type | AI-Generated? | Human-Written? | Rationale |
+|-------------------|---------------|----------------|-----------|
+| Docstring scaffolding | YES | Review | Mechanical, high-volume |
+| Type annotations | YES | Review | AI + mypy pipeline validated |
+| API reference docs | YES | Review | Extractable from code |
+| README (initial) | YES | Edit heavily | Structure is generic; context is specific |
+| Architecture decisions | NO | YES | "Why" requires human judgment |
+| CLAUDE.md / AGENTS.md | PARTIAL | YES | Conventions are human decisions |
+| Code examples | NO | YES | Good examples require domain expertise |
+| "Why" comments | NO | YES | Only humans know the "why" |
+| Test descriptions | YES | Review | BDD specs are formulaic |
+### Implications for the Toolkit
+- **Auto-generate structural docs** as part of the pipeline: type stubs, docstring scaffolding, API references.
+- **Never auto-generate "why" documentation** — architecture decisions, design rationale, and constraint explanations must be human-authored.
+- **The toolkit should generate AGENTS.md per worktree** (which it already does) but treat CLAUDE.md as human-authored.
+---
+## 12. The CLAUDE.md Pattern
+### Findings
+CLAUDE.md (Anthropic), AGENTS.md (cross-tool standard), .cursorrules (Cursor), and similar files represent a **genuinely new documentation category**: project-level instructions specifically for AI agents. This is distinct from all traditional documentation forms.
+### Evidence
+**AGENTS.md specification (2025).** The AGENTS.md standard emerged in July 2025 to solve a real problem: developers maintaining separate configuration files for each AI coding tool. 25+ tools now support the format, making it a cross-platform standard ([AGENTS.md](https://agents.md/)).
+**CLAUDE.md as context engineering.** Claude Code reads CLAUDE.md at session start and uses it for the duration. It functions as: coding standards + architecture overview + tool constraints + workflow instructions. This is not documentation *about* the code — it's documentation *for the AI* about how to work with the code.
+**Recommended sections (from practitioner consensus):**
+1. Project overview and directory layout
+2. Build and test commands
+3. Code style conventions
+4. Architecture decisions and constraints
+5. Common gotchas and anti-patterns
+6. Tool permissions and restrictions
+7. Naming conventions
+8. Security boundaries
+**The "one source of truth" problem.** Practitioners report that AGENTS.md and traditional docs (README, ARCHITECTURE.md) overlap significantly. The recommended approach is: AGENTS.md contains **operational instructions** (how to build, test, deploy); traditional docs contain **conceptual understanding** (why the system works this way) ([Keep your AGENTS.md in sync](https://kau.sh/blog/agents-md/)).
+**A counterpoint:** Upsun's analysis argues that a well-maintained README matters more than AI configuration files. If the README is excellent, the AI configuration file becomes redundant for most tasks ([Why your README matters more than AI configuration files](https://devcenter.upsun.com/posts/why-your-readme-matters-more-than-ai-configuration-files/)).
+### Taxonomy of the New Documentation Category
+| File | Scope | Primary Consumer | Key Content |
+|------|-------|-----------------|-------------|
+| CLAUDE.md | Project | Claude Code | Conventions, commands, constraints, gotchas |
+| AGENTS.md | Project | Cross-tool agents | Same as above, tool-agnostic |
+| .cursorrules | Project | Cursor AI | IDE-specific coding rules |
+| progress.txt | Session | Same agent (across resets) | What happened in previous batches |
+| tasks/prd.json | Feature | Quality gates | Machine-verifiable acceptance criteria |
+### Implications for the Toolkit
+- **CLAUDE.md is the toolkit's most impactful documentation artifact.** Its content directly shapes every agent interaction.
+- **The toolkit should provide a CLAUDE.md template** optimized for AI agent consumption, based on the empirical findings in this research.
+- **AGENTS.md support** should be added as an alias/symlink for cross-tool compatibility.
+- **The auto-generated AGENTS.md per worktree is the right pattern** — it provides batch-specific context that the project-level file cannot.
+- **Key insight:** CLAUDE.md is "context engineering" — it's the human's mechanism for shaping AI behavior without changing the AI itself. This is a first-class documentation concern.
+---
+## Documentation Quality Scorecard
+A concrete rubric for evaluating whether a codebase is well-documented for AI agent consumption. Each dimension is scored 0-3.
+| Dimension | 0 (Poor) | 1 (Minimal) | 2 (Good) | 3 (Excellent) |
+|-----------|----------|-------------|----------|----------------|
+| **Type Annotations** | No types | <25% coverage | 25-75% coverage | >75% + type checker in CI |
+| **Naming Quality** | Single-letter/cryptic names common | Mix of descriptive and cryptic | Mostly descriptive | Consistently descriptive, domain-specific |
+| **Test Coverage** | No tests | <30% coverage | 30-70% coverage | >70% + tests pass + BDD-style names |
+| **Code Examples** | No examples in docs | Few examples | Examples for main APIs | Comprehensive examples with edge cases |
+| **Architecture Docs** | No arch docs | README only | ARCHITECTURE.md exists | ARCHITECTURE.md + ADRs + module map |
+| **AI Instructions** | No CLAUDE.md/AGENTS.md | Basic file with commands | Commands + conventions + constraints | Full context: commands, conventions, gotchas, architecture |
+| **Comment Quality** | Many stale/wrong comments | Mix of stale and current | Mostly current, some "what" noise | Only "why" comments, all current |
+| **Function Complexity** | Average CC >15 | Average CC 10-15 | Average CC 5-10 | Average CC <5, functions <30 lines |
+**Scoring:**
+- **0-8:** AI-hostile codebase. Agents will hallucinate frequently. Major documentation investment needed before autonomous execution.
+- **9-14:** Minimum viable. Agents will work but with elevated error rates. Targeted improvements recommended.
+- **15-20:** AI-ready. Agents can operate effectively with standard quality gates.
+- **21-24:** Optimized. Best possible environment for autonomous AI coding.
+---
+## Recommended Documentation Standards
+What the toolkit should recommend/enforce for projects using it.
+### Tier 1: Must-Have (Pre-Flight Gate)
+These should block autonomous execution if absent:
+1. **Test suite exists and passes.** No autonomous execution against untested code. Tests are the agent's primary feedback mechanism.
+2. **CLAUDE.md or AGENTS.md exists** with at minimum: build commands, test commands, directory layout.
+3. **No obviously stale comments** in files that will be modified (code changed in last 30 days, comments unchanged in >90 days in same block).
+### Tier 2: Should-Have (Pre-Flight Warning)
+These should generate warnings but not block:
+4. **Type annotations present** (Python: mypy/pyright configured; TypeScript: strict mode; Go: interfaces defined).
+5. **Architecture documentation exists** (ARCHITECTURE.md or equivalent structured overview).
+6. **Naming quality check passes** (<10% of non-loop variables are single characters or common abbreviations).
+7. **Test coverage >30%** (enough for agents to have meaningful feedback).
+### Tier 3: Nice-to-Have (Documentation Score Report)
+These improve agent performance but aren't blockers:
+8. **ADRs for major design decisions.**
+9. **Code examples in API documentation.**
+10. **BDD-style test names** (descriptive test function names).
+11. **"Why" comments on non-obvious code blocks.**
+12. **Function complexity <10 cyclomatic complexity on average.**
+### Anti-Patterns to Detect and Warn About
+| Anti-Pattern | Detection Method | Impact |
+|-------------|-----------------|--------|
+| Stale comments | Code change date vs. comment change date | 1.52x bug introduction rate |
+| "What" comments | Heuristic: comment restates next line | Noise that drifts into misinformation |
+| Cryptic names | Single-char non-loop variables, abbreviation density | 2x worse AI completion rate |
+| Missing types | mypy/pyright coverage report | >50% more compilation errors |
+| Bare exceptions | Existing lesson-check.sh | Silent failure cascade |
+| Long functions | Line count + CC analysis | Lower reliability |
+| Missing tests | Test framework detection + coverage | No feedback mechanism for agents |
+| No AI instructions | CLAUDE.md/AGENTS.md file check | Agent has no project context |
+---
+## Sources
+### Primary Research Papers
+- [Type-Constrained Code Generation with Language Models](https://arxiv.org/abs/2504.09246) — Formalizes type-constrained decoding, >50% error reduction
+- [Less is More: DocString Compression in Code Generation](https://arxiv.org/abs/2410.22793) — ShortenDoc achieves 25-40% compression preserving quality
+- [Investigating the Impact of Code Comment Inconsistency on Bug Introducing](https://arxiv.org/html/2409.10781v1) — 1.52x bug risk from stale comments
+- [A Large-Scale Empirical Study on Code-Comment Inconsistencies](https://www.inf.usi.ch/lanza/Downloads/Wen2019a.pdf) — 1.3B AST changes, 1,500 systems
+- [Variable Naming Impact on AI Code Completion: An Empirical Study](https://www.researchgate.net/publication/393939595) — 2x exact match with descriptive names
+- [When LLMs Meet API Documentation](https://arxiv.org/html/2503.15231) — RAG improves code gen 83-220%, examples most valuable
+- [On the Impact of Code Comments for Automated Bug-Fixing](https://www.arxiv.org/pdf/2601.23059) — Comments help LLM repair when accurate
+- [Can Adversarial Code Comments Fool AI Security Reviewers](https://arxiv.org/abs/2602.16741) — AI over-trusts comments
+### Benchmarks and Surveys
+- [SWE-Bench Pro](https://arxiv.org/html/2509.16941v1) — Real-world repo benchmark, documentation quality as factor
+- [Meta Python Typing Survey 2025](https://engineering.fb.com/2025/12/22/developer-tools/python-typing-survey-2025-code-quality-flexibility-typing-adoption/) — 400+ respondents on typing adoption
+- [DocAgent: Multi-Agent System for Code Documentation](https://arxiv.org/html/2504.08725v2) — Agent-based doc generation
+- [Retrieval-Augmented Code Generation Survey](https://arxiv.org/html/2510.04905v1) — RAG methods for code
+- [AgenticAKM](https://arxiv.org/html/2602.04445v1) — AI agents for ADR generation
+- [An Empirical Study of Testing Practices in AI Agent Frameworks](https://arxiv.org/html/2509.19185v1) — Testing non-deterministic agents
+### Naming and Code Quality
+- [Exploring the Influence of Identifier Names on Code Quality](https://ieeexplore.ieee.org/abstract/document/5714430) — Naming flaws correlate with low quality
+- [Relating Identifier Naming Flaws and Code Quality](https://oro.open.ac.uk/17007/1/butler09wcreshort_latest.pdf) — Method-level naming impact
+- [Effects of Variable Names on Comprehension](https://www.researchgate.net/publication/318036120) — Full words 19% faster comprehension
+### Industry Reports
+- [IBM AI Code Documentation](https://www.ibm.com/think/insights/ai-code-documentation-benefits-top-tips) — 59% documentation time reduction
+- [AI vs Human Code Generation Report](https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report) — AI code creates 1.7x more issues
+- [Collaborative LLM Agents for C4 Architecture Design](https://arxiv.org/pdf/2510.22787) — AI-generated architecture diagrams
+- [C4 Model Architecture and ADR Integration](https://visual-c4.com/blog/c4-model-architecture-adr-integration) — C4 + ADR complementarity
+### Standards and Specifications
+- [AGENTS.md](https://agents.md/) — Cross-tool AI instruction format (25+ tools)
+- [Keep your AGENTS.md in sync](https://kau.sh/blog/agents-md/) — One source of truth for AI instructions
+- [AGENTS.md: Why your README matters more](https://devcenter.upsun.com/posts/why-your-readme-matters-more-than-ai-configuration-files/) — Counterpoint on README vs. AI config
+- [Airbnb JavaScript Style Guide](https://github.com/airbnb/javascript) — Reference style guide
+- [Code readability: Comments OR self-documenting code](https://www.diva-portal.org/smash/get/diva2:943979/FULLTEXT02) — Student vs. professional divergence