autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,459 @@
1
+ # Research: Context Window Utilization and Degradation in AI Coding Agents
2
+
3
+ **Date:** 2026-02-22
4
+ **Domain:** AI Agent Architecture
5
+ **Relevance:** autonomous-coding-toolkit Design Principle #1 ("fresh context per batch")
6
+ **Status:** Complete
7
+
8
+ ---
9
+
10
+ ## Executive Summary
11
+
12
+ The toolkit's "fresh context per batch" architecture is **strongly validated** by current research. Context degradation is real, measurable, and non-linear — but the mechanism is more nuanced than simple "context fills up, quality drops." The primary threats are (1) the Lost-in-the-Middle effect causing positional retrieval failures, (2) attention budget exhaustion from irrelevant context, and (3) noise-to-signal ratio degradation as accumulated context grows. The current 6000-char (~1500 token) context injection budget is conservative but defensible — it sits well within the "high-recall zone" where models maintain near-baseline performance. Fresh context per batch is empirically superior to accumulated context for autonomous coding, but the toolkit should adopt **structured context injection** (placing critical information at document edges, using XML tags) and consider **observation masking** as a lightweight alternative to full context reset for retry scenarios.
13
+
14
+ **Confidence: HIGH** — Multiple peer-reviewed papers, Anthropic's own engineering documentation, and empirical benchmarks from competing agent frameworks all converge on the same conclusions.
15
+
16
+ ---
17
+
18
+ ## 1. The Degradation Curve: When Does Quality Drop?
19
+
20
+ ### Findings
21
+
22
+ **The degradation is real but task-dependent.** Research converges on several key patterns:
23
+
24
+ - **U-shaped positional recall:** Information at document edges (0-20% and 80-100% of context depth) achieves high recall. Middle-positioned information suffers dramatic drops. This is the "Lost in the Middle" effect (Liu et al., 2023).
25
+
26
+ - **Non-linear latency degradation:** The "Context Discipline" paper (Abubakar et al., 2026) measured Llama-3.1-70B at 150% latency degradation at 4K words scaling to 720% at 15K words, following a linear-quadratic trajectory driven by KV cache growth and memory bandwidth constraints.
27
+
28
+ - **Accuracy remains surprisingly stable under clean conditions:** Llama-70B dropped only 0.5% accuracy (98.5% to 98%) at 15K words. Qwen-14B dropped 1.5% (99% to 97.5%). Mixtral-8x7B dropped 1% (99.5% to 98.5%). These are clean-room conditions — single-needle retrieval tasks with minimal distraction.
29
+
30
+ - **Real-world degradation is much worse:** The Chroma "Context Rot" study (Hong et al., 2025) found that with distractors present, degradation accelerates dramatically. At 32K tokens, 11 of 12 tested models dropped below 50% of their short-context performance. A model claiming 200K tokens typically becomes unreliable around 130K (~65% utilization).
31
+
32
+ - **The cliff is not gradual:** Performance drops are often sudden rather than progressive. Models maintain near-baseline performance until hitting a threshold, then quality collapses.
33
+
34
+ ### Evidence Quality
35
+
36
+ | Source | Type | Confidence |
37
+ |--------|------|------------|
38
+ | Liu et al. (2023) "Lost in the Middle" | Peer-reviewed (TACL 2024) | HIGH |
39
+ | Abubakar et al. (2026) "Context Discipline" | arXiv preprint | MEDIUM-HIGH |
40
+ | Chroma "Context Rot" (2025) | Industry research | MEDIUM-HIGH |
41
+ | Epoch AI context window analysis (2025) | Data analysis | MEDIUM |
42
+
43
+ ### Implications for the Toolkit
44
+
45
+ The current architecture's fresh-context approach avoids the degradation curve entirely. Each `claude -p` call starts at the leftmost point of the curve — maximum performance. The 6000-char context injection means each batch operates at roughly 1500 tokens of injected context on top of the batch task text, well below any measured degradation threshold.
46
+
47
+ **Recommendation:** No change needed to the core architecture. The fresh-context approach is the most robust strategy available. Document the specific degradation thresholds (50% performance at ~32K tokens with distractors, unreliable at ~65% of claimed window) in ARCHITECTURE.md as empirical backing for Design Principle #1.
48
+
49
+ ---
50
+
51
+ ## 2. Is the 6000-Character Context Budget Optimal?
52
+
53
+ ### Findings
54
+
55
+ The current budget (6000 chars / ~1500 tokens) is **conservative and safe, but could be expanded without risk.**
56
+
57
+ Key data points:
58
+
59
+ - **Anthropic's own guidance:** Place long documents (20K+ tokens) near the top of prompts. Queries at the end improve response quality by up to 30%. This suggests Claude handles substantial context volumes well when properly structured.
60
+
61
+ - **ACON framework thresholds:** Research on optimal compression triggers suggests 4096 tokens for history compression and 1024 tokens for observation compression as effective thresholds. The toolkit's 1500-token budget sits between these.
62
+
63
+ - **Factory.ai's approach:** Treats context as a "finite, budgeted resource" with a layered stack: repository overviews, semantic search results, targeted file operations, and hierarchical memory. Their per-layer budgets are not published but the architecture implies 2K-5K tokens per layer.
64
+
65
+ - **Sub-agent patterns from Anthropic:** Sub-agents return condensed summaries of 1000-2000 tokens to coordinating agents. This suggests Anthropic considers this range effective for conveying substantial task context.
66
+
67
+ - **The diminishing-returns zone:** Below ~500 tokens, agents lack sufficient context for non-trivial tasks. Above ~8K tokens of injected context (on top of the task itself), noise-to-signal ratio starts climbing. The sweet spot for injected auxiliary context appears to be **1000-4000 tokens** (~4000-16000 chars).
68
+
69
+ ### Evidence Quality
70
+
71
+ | Source | Type | Confidence |
72
+ |--------|------|------------|
73
+ | Anthropic long-context tips | Official documentation | HIGH |
74
+ | ACON framework (Kang et al., 2025) | Peer-reviewed | HIGH |
75
+ | Factory.ai architecture | Industry practice | MEDIUM |
76
+ | Anthropic sub-agent patterns | Engineering blog | MEDIUM-HIGH |
77
+
78
+ ### Implications for the Toolkit
79
+
80
+ The 6000-char budget is defensible but could be raised to **8000-12000 chars (~2000-3000 tokens)** to allow richer context injection without approaching any degradation threshold. The priority ordering in `run-plan-context.sh` (directives > failure patterns > referenced files > git log > progress notes) is correct — highest-signal information first.
81
+
82
+ **Recommendation:** Raise `TOKEN_BUDGET_CHARS` to 10000 (from 6000). This gives ~2500 tokens of auxiliary context — still well within safe bounds, but allows referenced files and progress notes to be included more reliably. The priority ordering should remain as-is.
83
+
84
+ ---
85
+
86
+ ## 3. The "Lost in the Middle" Effect
87
+
88
+ ### Findings
89
+
90
+ The landmark paper by Liu et al. (2023) from Stanford, UC Berkeley, and Samaya AI demonstrated that:
91
+
92
+ - **Performance is highest when relevant information is at the beginning or end of context.** This holds across all tested models (GPT-3.5-Turbo, Claude-1.3, MPT-30B, LongChat-13B).
93
+
94
+ - **Middle-positioned information suffers dramatic retrieval failures.** On multi-document QA, accuracy dropped from ~75% (information at position 1) to ~45% (information at position 10 of 20) for several models — a 30+ percentage point degradation from position alone.
95
+
96
+ - **The effect persists even in models explicitly designed for long contexts.** LongChat-13B, trained specifically for 16K contexts, still exhibited the U-shaped performance curve.
97
+
98
+ - **More documents amplify the effect.** Going from 10 to 20 documents increased the performance gap between edge-positioned and middle-positioned information.
99
+
100
+ - **2025 follow-up research confirms persistence:** The "Lost in the Haystack" paper (2025) found that smaller gold contexts (shorter needles) further degrade performance and amplify positional sensitivity. The effect is not an artifact of early models — it persists in current architectures.
101
+
102
+ ### Evidence Quality
103
+
104
+ | Source | Type | Confidence |
105
+ |--------|------|------------|
106
+ | Liu et al. (2023) arXiv 2307.03172 | Peer-reviewed (TACL 2024) | HIGH |
107
+ | "Lost in the Haystack" (2025) | Peer-reviewed (NAACL 2025) | HIGH |
108
+
109
+ ### Implications for the Toolkit
110
+
111
+ The toolkit's `run-plan-prompt.sh` places batch task text in the middle of the prompt, with metadata above and requirements below. This is suboptimal per Lost-in-the-Middle findings.
112
+
113
+ Current prompt structure:
114
+ ```
115
+ 1. Header (batch number, working directory, branch) <- TOP
116
+ 2. Tasks in this batch <- MIDDLE
117
+ 3. Recent commits <- MIDDLE
118
+ 4. Previous progress <- MIDDLE
119
+ 5. Referenced files <- MIDDLE
120
+ 6. Requirements (TDD, quality gate, test count) <- BOTTOM
121
+ ```
122
+
123
+ Optimal structure per research:
124
+ ```
125
+ 1. Tasks in this batch (THE CRITICAL CONTENT) <- TOP (primacy)
126
+ 2. Referenced files <- NEAR TOP
127
+ 3. Header metadata <- MIDDLE (low importance, OK here)
128
+ 4. Recent commits <- MIDDLE
129
+ 5. Previous progress <- MIDDLE
130
+ 6. Requirements and directives <- BOTTOM (recency)
131
+ ```
132
+
133
+ **Recommendation:** Restructure `build_batch_prompt()` to place the batch task text at the very top and the requirements/directives at the very bottom. Metadata and auxiliary context go in the middle where recall is lowest but impact of missing it is also lowest. This is a zero-cost change that could improve batch execution quality by up to 30% (per Anthropic's own testing of query placement).
134
+
135
+ ---
136
+
137
+ ## 4. Model-Specific Degradation
138
+
139
+ ### Findings
140
+
141
+ Degradation varies significantly by model family and tier:
142
+
143
+ **Claude models:**
144
+ - Most conservative behavior — tend to abstain when uncertain rather than hallucinate (Chroma Context Rot study)
145
+ - Opus 4.6 "actually uses full context effectively" unlike previous generations (Anthropic marketing, take with grain of salt)
146
+ - Haiku "loses track fast in longer sessions, forgetting variable names and changing class names randomly" — suited for short tasks only
147
+ - Sonnet handles multi-file logic and state management well, "remembered context better" than Haiku in real projects
148
+
149
+ **GPT models:**
150
+ - "Highest rates of hallucination, often generating confident but incorrect responses" under context pressure (Chroma)
151
+ - GPT-4 fails to retrieve needles toward the start of documents as context length increases
152
+
153
+ **General patterns:**
154
+ - Larger models degrade more gracefully than smaller ones
155
+ - MoE architectures (Mixtral) show anomalous behavior — routing overhead at intermediate context lengths can paradoxically slow performance before the expected degradation point
156
+ - All models show the positional U-shaped curve, but severity varies
157
+
158
+ **By task type:**
159
+ - Retrieval tasks (find specific information): Most sensitive to context length and position
160
+ - Reasoning tasks (analyze and synthesize): More robust to context length, but quality degrades with irrelevant noise
161
+ - Code generation: Highly sensitive to having the right context, relatively robust to context volume if signal-to-noise ratio is maintained
162
+
163
+ ### Evidence Quality
164
+
165
+ | Source | Type | Confidence |
166
+ |--------|------|------------|
167
+ | Chroma Context Rot (2025) | Industry research, 18 models | MEDIUM-HIGH |
168
+ | Real-world model comparisons | Practitioner reports | MEDIUM |
169
+ | Abubakar et al. (2026) | arXiv, 3 architectures | MEDIUM-HIGH |
170
+
171
+ ### Implications for the Toolkit
172
+
173
+ The toolkit's model-agnostic approach (same context budget regardless of model) is reasonable given that all models share the same fundamental degradation patterns. However, the `run-plan.sh` script's `--model` flag could benefit from model-aware context budgets:
174
+
175
+ - **Haiku:** Reduce context injection budget (shorter attention span). Best for simple, well-specified tasks only.
176
+ - **Sonnet:** Current budget is well-suited. Good balance of context utilization and cost.
177
+ - **Opus:** Could tolerate larger context budgets, but the marginal benefit is small given the fresh-context architecture already keeps context minimal.
178
+
179
+ **Recommendation:** Add a model-tier multiplier to `TOKEN_BUDGET_CHARS`: Haiku 0.7x (4200 chars), Sonnet 1.0x (current), Opus 1.3x (7800 chars). LOW priority — the fresh-context architecture already mitigates most model-specific degradation.
180
+
181
+ ---
182
+
183
+ ## 5. The Sweet Spot: Too Little vs. Too Much Context
184
+
185
+ ### Findings
186
+
187
+ Research and practice converge on a clear framework:
188
+
189
+ **Too little context (under ~500 tokens injected):**
190
+ - Agent doesn't know what happened in previous batches
191
+ - Repeats work already done
192
+ - Makes decisions inconsistent with prior implementation choices
193
+ - Fails to maintain architectural coherence across batches
194
+
195
+ **Sweet spot (~1000-4000 tokens injected on top of task text):**
196
+ - Agent has sufficient memory of prior work (progress notes, recent commits)
197
+ - Knows relevant failure patterns to avoid
198
+ - Can reference key files without drowning in irrelevant content
199
+ - Factory.ai's research shows this range preserves "structural relationships between components"
200
+
201
+ **Too much context (over ~8K tokens of auxiliary context):**
202
+ - Noise drowns signal — irrelevant context actively harms reasoning (Factory.ai)
203
+ - "Indiscriminate context stuffing becomes financially unsustainable" at scale
204
+ - Lost-in-the-Middle effect places critical information in the low-recall zone
205
+ - Latency increases non-linearly (720% at 15K words per Abubakar et al.)
206
+
207
+ **The critical insight from Factory.ai:** "Compression ratio optimization is counterproductive." OpenAI's aggressive 99.3% compression sacrificed quality. For coding tasks, **total tokens consumed per completed task** matters more than tokens saved per request, because missing details force expensive re-fetching and error cycles.
208
+
209
+ ### Evidence Quality
210
+
211
+ | Source | Type | Confidence |
212
+ |--------|------|------------|
213
+ | Factory.ai compression evaluation | Industry benchmark | MEDIUM-HIGH |
214
+ | Factory.ai context window problem analysis | Industry research | MEDIUM |
215
+ | ACON framework benchmarks | Peer-reviewed | HIGH |
216
+
217
+ ### Implications for the Toolkit
218
+
219
+ The toolkit's context assembler (`run-plan-context.sh`) already implements the correct priority ordering. The 6000-char budget lands in the sweet spot. The key improvement opportunity is not the budget size but the **information density** of what's injected.
220
+
221
+ **Recommendation:** Focus on improving context quality over quantity. Specifically:
222
+ 1. Replace raw `git log --oneline -5` with structured commit summaries (what changed, not just commit messages)
223
+ 2. Increase `progress.txt` tail from 10 to 15-20 lines (this is the highest-value context for cross-batch continuity)
224
+ 3. Add XML tags around each context section per Anthropic's guidance: `<prior_batches>`, `<failure_patterns>`, `<referenced_files>`, `<recent_changes>`
225
+
226
+ ---
227
+
228
+ ## 6. Context Compression and Selection Strategies
229
+
230
+ ### Findings
231
+
232
+ Six major strategies exist, ordered by complexity:
233
+
234
+ **1. Fresh Context (Context Reset)**
235
+ - What the toolkit does: start clean each batch
236
+ - Strongest guarantee against degradation
237
+ - Trade-off: loses all accumulated knowledge; requires explicit context injection
238
+ - Used by: autonomous-coding-toolkit (run-plan.sh Mode C)
239
+
240
+ **2. Observation Masking (Sliding Window)**
241
+ - Replace older environmental outputs with placeholders while preserving agent reasoning
242
+ - JetBrains research (2025): matched or exceeded LLM summarization in 4 of 5 configurations
243
+ - With Qwen3-Coder 480B: 2.6% solve rate improvement + 52% cost reduction vs. unmanaged context
244
+ - Used by: SWE-agent
245
+
246
+ **3. LLM Summarization (Compaction)**
247
+ - Use a separate model call to compress conversation history
248
+ - Anthropic's Claude Code uses this ("compaction") when approaching context limits
249
+ - Drawback: 13-15% trajectory elongation — agents run longer, increasing cost
250
+ - Drawback: "masks failure signals" — summaries may obscure indicators that the agent should stop
251
+ - Used by: OpenHands, Claude Code
252
+
253
+ **4. Structured Summarization (Anchored Iterative)**
254
+ - Maintain structured sections (files modified, decisions made, goals remaining) rather than free-form summaries
255
+ - Factory.ai's approach: outscored OpenAI and Anthropic on quality (3.70 vs 3.35 and 3.44)
256
+ - Preserves technical details like file paths and error codes (accuracy 4.04 vs 3.43)
257
+ - All methods struggle with "artifact trail preservation" (2.19-2.45 out of 5.0)
258
+
259
+ **5. Code-Specific Compression (LongCodeZip)**
260
+ - Dual-stage: coarse-grained (AST-based structural compression) + fine-grained (token-level)
261
+ - Achieves 5.6x compression without degrading task performance
262
+ - Purpose-built for code LLMs
263
+ - Published: October 2025
264
+
265
+ **6. RAG/Semantic Retrieval**
266
+ - Use embeddings to retrieve only the most relevant context chunks
267
+ - LLMLingua: 20x compression with minimal performance loss when integrated with LangChain/LlamaIndex
268
+ - Risk for code: destroys structural relationships between components when chunking naively
269
+ - Best used as a complement to, not replacement for, structured context injection
270
+
271
+ ### Evidence Quality
272
+
273
+ | Source | Type | Confidence |
274
+ |--------|------|------------|
275
+ | JetBrains context management (2025) | Industry research | HIGH |
276
+ | Factory.ai compression evaluation | Industry benchmark | MEDIUM-HIGH |
277
+ | ACON framework (ICLR 2025) | Peer-reviewed | HIGH |
278
+ | LongCodeZip (2025) | arXiv | MEDIUM-HIGH |
279
+ | LLMLingua | Peer-reviewed + deployed | HIGH |
280
+
281
+ ### Implications for the Toolkit
282
+
283
+ The toolkit's fresh-context approach (Strategy 1) is the most robust but also the most expensive in terms of information loss. The `progress.txt` mechanism and context assembler partially compensate, but there's room to adopt elements of Strategy 4 (structured summarization).
284
+
285
+ **Recommendation:** Enhance `progress.txt` with structured sections rather than free-form append-only text:
286
+ ```
287
+ ## Batch N Summary
288
+ ### Files Modified
289
+ - path/to/file.py (added function X, modified class Y)
290
+ ### Decisions Made
291
+ - Chose approach A over B because...
292
+ ### Issues Encountered
293
+ - Test Z failed due to...
294
+ ### State
295
+ - 45 tests passing, 2 pending
296
+ ```
297
+
298
+ This makes the last-N-lines tail read by subsequent batches far more information-dense. The structured format also makes it possible to selectively inject specific sections (e.g., only "Decisions Made" for architecture-sensitive batches).
299
+
300
+ ---
301
+
302
+ ## 7. Anthropic's Official Guidance
303
+
304
+ ### Findings
305
+
306
+ Anthropic's engineering blog posts and documentation provide clear guidance:
307
+
308
+ **Context as a finite resource:**
309
+ > "LLMs have an attention budget that they draw on when parsing large volumes of context." Context rot emerges across all models — as token count increases, recall accuracy decreases. Even larger context windows remain subject to attention constraints due to transformer architecture limitations (n-squared pairwise token relationships).
310
+
311
+ **Four key techniques (from Anthropic's "Effective Context Engineering" blog):**
312
+
313
+ 1. **Compaction:** Summarize conversation history when approaching limits. Preserve architectural decisions and unresolved bugs. Discard redundant tool outputs.
314
+
315
+ 2. **Just-in-Time Context:** Maintain lightweight identifiers, dynamically load data at runtime using tools. "Mirrors human cognition — we retrieve information on demand."
316
+
317
+ 3. **Sub-Agent Architecture:** Specialized sub-agents with clean context windows return condensed summaries (1000-2000 tokens) to a coordinating agent. "Fresh start — the main agent context is not carried to subagents."
318
+
319
+ 4. **Progress Documentation:** Maintain `claude-progress.txt` alongside git history. This is explicitly preferred over compaction alone because "compaction doesn't always pass perfectly clear instructions to the next agent."
320
+
321
+ **Document placement (from Anthropic's long-context tips):**
322
+ - Place long documents at the TOP of prompts, above queries and instructions
323
+ - Queries at the end improve response quality by up to 30%
324
+ - Use XML tags (`<document>`, `<document_content>`, `<source>`) for structure
325
+ - Ask Claude to quote relevant parts before answering — cuts through noise
326
+
327
+ **Multi-window architecture (from Anthropic's "Effective Harnesses" blog):**
328
+ - Use an initializer agent for first-window setup, then a coding agent for incremental work
329
+ - Each subsequent session: read progress logs and git history, review requirements, run tests, work incrementally
330
+ - Prevent agents from "one-shotting" projects — enforce incremental progress
331
+
332
+ ### Evidence Quality
333
+
334
+ | Source | Type | Confidence |
335
+ |--------|------|------------|
336
+ | Anthropic "Effective Context Engineering" | Official engineering blog | HIGH |
337
+ | Anthropic "Effective Harnesses" | Official engineering blog | HIGH |
338
+ | Anthropic long-context tips | Official documentation | HIGH |
339
+ | Anthropic context windows docs | Official documentation | HIGH |
340
+
341
+ ### Implications for the Toolkit
342
+
343
+ The toolkit already implements Anthropic's recommended patterns:
344
+ - Fresh sub-processes (≈ sub-agent architecture)
345
+ - `progress.txt` (≈ progress documentation)
346
+ - Context assembler with budget (≈ just-in-time context)
347
+
348
+ The gap is in **document placement** — the toolkit doesn't follow the "long content at top, queries at bottom" guidance, and doesn't use XML structuring for context sections.
349
+
350
+ **Recommendation:** Adopt Anthropic's XML tag structure in the prompt template. Wrap injected context in semantic tags. Place the batch task specification at the top and requirements/directives at the bottom.
351
+
352
+ ---
353
+
354
+ ## 8. Fresh Context vs. Accumulated Context with Good Management
355
+
356
+ ### Findings
357
+
358
+ This is the core architectural question. The evidence strongly favors fresh context for autonomous coding:
359
+
360
+ **Arguments for fresh context (the toolkit's approach):**
361
+ - Eliminates degradation curve entirely — every batch starts at peak performance
362
+ - No risk of Lost-in-the-Middle effects on critical task instructions
363
+ - No compaction artifacts or information loss from summarization
364
+ - Deterministic context composition — same task always gets the same context structure
365
+ - JetBrains research: even best-managed accumulated context (observation masking) only matches fresh context performance while adding complexity
366
+ - Anthropic's own recommendation: "When the context window is cleared, consider restarting rather than compressing"
367
+
368
+ **Arguments for accumulated context:**
369
+ - Agents discover things during execution that aren't in the plan (edge cases, API quirks, naming conventions)
370
+ - Cross-task dependencies are naturally preserved in conversation history
371
+ - No need for explicit context serialization (progress.txt, state files)
372
+ - Compaction + context editing can extend effective session length significantly
373
+
374
+ **Arguments for hybrid (fresh context + rich injection):**
375
+ - Gets the reliability of fresh context with the continuity of accumulated knowledge
376
+ - `progress.txt` + structured context injection bridges the gap
377
+ - Factory.ai's structured summarization shows this approach preserves 95%+ of relevant context across resets
378
+ - ACON: 26-54% peak token reduction while maintaining task performance — the savings come from discarding irrelevant accumulated context, not useful context
379
+
380
+ **The decisive evidence:** JetBrains' 2025 study found that LLM summarization (the best way to manage accumulated context) caused agents to run 13-15% longer and masked failure signals. Observation masking (a partial-reset strategy) matched fresh context performance. This suggests that accumulated context management adds cost and complexity without improving outcomes for well-structured tasks.
381
+
382
+ **The exception:** For exploratory/debugging tasks (Ralph Loop Mode D), accumulated context has value. The stop-hook approach that re-injects prompts while preserving file-system state is a reasonable hybrid — the agent sees prior work through git history and progress.txt rather than conversation history.
383
+
384
+ ### Evidence Quality
385
+
386
+ | Source | Type | Confidence |
387
+ |--------|------|------------|
388
+ | JetBrains context management study (2025) | Industry research, controlled | HIGH |
389
+ | Anthropic engineering blogs (2025) | Official guidance | HIGH |
390
+ | ACON framework (ICLR 2025) | Peer-reviewed | HIGH |
391
+ | Factory.ai structured summarization | Industry benchmark | MEDIUM-HIGH |
392
+
393
+ ### Implications for the Toolkit
394
+
395
+ Fresh context per batch is validated as the superior strategy for structured plan execution. The toolkit should continue this approach and invest in improving the quality of context injection rather than switching to accumulated context.
396
+
397
+ **Recommendation:** Maintain fresh context as the default. For Ralph Loop (Mode D), consider implementing observation masking as a lightweight alternative to full context reset — mask old tool outputs while preserving the agent's reasoning chain. This would give Ralph loops better continuity without the full cost of accumulated context.
398
+
399
+ ---
400
+
401
+ ## Consolidated Recommendations
402
+
403
+ ### Priority 1 (High Impact, Low Effort)
404
+
405
+ 1. **Restructure prompt placement in `build_batch_prompt()`:** Move batch task text to the top, requirements to the bottom. Zero-cost change, up to 30% quality improvement per Anthropic's own testing. **Confidence: HIGH.**
406
+
407
+ 2. **Add XML tags to context sections:** Wrap each injected context section in semantic tags (`<batch_tasks>`, `<prior_progress>`, `<failure_patterns>`, `<referenced_files>`, `<requirements>`). Aligns with Anthropic's explicit guidance. **Confidence: HIGH.**
408
+
409
+ 3. **Document empirical basis in ARCHITECTURE.md:** Replace the unsourced claim "context degradation is the #1 quality killer" with specific citations: Lost-in-the-Middle (Liu et al., 2023), Context Rot (Chroma, 2025), and Anthropic's own "attention budget" framing. **Confidence: HIGH.**
410
+
411
+ ### Priority 2 (Medium Impact, Medium Effort)
412
+
413
+ 4. **Raise `TOKEN_BUDGET_CHARS` to 10000:** Current 6000 is safe but conservative. 10000 (~2500 tokens) remains well within the sweet spot while allowing richer context injection. **Confidence: MEDIUM-HIGH.**
414
+
415
+ 5. **Structure `progress.txt` format:** Define sections (Files Modified, Decisions Made, Issues Encountered, State) instead of free-form text. Makes tail reads by subsequent batches far more information-dense. **Confidence: MEDIUM-HIGH.**
416
+
417
+ 6. **Increase `progress.txt` tail read from 10 to 20 lines:** This is the highest-value context for cross-batch continuity, and the current 10-line limit may truncate critical information. **Confidence: MEDIUM.**
418
+
419
+ ### Priority 3 (Lower Priority, Higher Effort)
420
+
421
+ 7. **Model-aware context budgets:** Haiku 0.7x, Sonnet 1.0x, Opus 1.3x multiplier on `TOKEN_BUDGET_CHARS`. Useful but low-urgency given fresh-context architecture already mitigates model-specific degradation. **Confidence: MEDIUM.**
422
+
423
+ 8. **Observation masking for Ralph Loop retries:** When a Ralph iteration fails, mask old tool outputs in the re-injected context rather than providing raw conversation history. JetBrains research shows this matches summarization quality at lower cost. **Confidence: MEDIUM.**
424
+
425
+ 9. **Structured commit summaries:** Replace raw `git log --oneline` with `git log --format="- %s (%h): [files changed]"` or a custom summary that includes which files were modified, not just commit messages. **Confidence: MEDIUM.**
426
+
427
+ ---
428
+
429
+ ## Sources
430
+
431
+ ### Peer-Reviewed Papers
432
+
433
+ - Liu, N.F. et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." [arXiv 2307.03172](https://arxiv.org/abs/2307.03172). Published in TACL 2024.
434
+ - Kang, M. et al. (2025). "ACON: Optimizing Context Compression for Long-Horizon LLM Agents." [arXiv 2510.00615](https://arxiv.org/abs/2510.00615). ICLR 2025.
435
+ - Abubakar et al. (2026). "Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths." [arXiv 2601.11564](https://arxiv.org/abs/2601.11564).
436
+ - Wang et al. (2025). "Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find." [arXiv 2505.18148](https://arxiv.org/abs/2505.18148). NAACL 2025.
437
+ - Li et al. (2025). "LongCodeZip: Compress Long Context for Code Language Models." [arXiv 2510.00446](https://arxiv.org/abs/2510.00446).
438
+
439
+ ### Industry Research
440
+
441
+ - Hong et al. (2025). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." [Chroma Research](https://research.trychroma.com/context-rot).
442
+ - JetBrains Research (2025). "Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents." [JetBrains Research Blog](https://blog.jetbrains.com/research/2025/12/efficient-context-management/).
443
+ - Factory.ai (2025). "The Context Window Problem: Scaling Agents Beyond Token Limits." [Factory.ai](https://factory.ai/news/context-window-problem).
444
+ - Factory.ai (2025). "Evaluating Context Compression for AI Agents." [Factory.ai](https://factory.ai/news/evaluating-compression).
445
+ - Epoch AI (2025). "LLMs now accept longer inputs, and the best models can use them more effectively." [Epoch AI](https://epoch.ai/data-insights/context-windows).
446
+
447
+ ### Anthropic Documentation
448
+
449
+ - Anthropic (2025). "Effective Context Engineering for AI Agents." [Engineering Blog](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents).
450
+ - Anthropic (2025). "Effective Harnesses for Long-Running Agents." [Engineering Blog](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents).
451
+ - Anthropic (2025). "Long Context Prompting Tips." [Claude API Docs](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/long-context-tips).
452
+ - Anthropic (2025). "Context Windows." [Claude API Docs](https://platform.claude.com/docs/en/build-with-claude/context-windows).
453
+ - Anthropic (2025). "Prompt Engineering for Claude's Long Context Window." [Anthropic News](https://www.anthropic.com/news/prompting-long-context).
454
+
455
+ ### Agent Framework References
456
+
457
+ - OpenHands (2025). "The OpenHands Software Agent SDK." [arXiv 2511.03690](https://arxiv.org/abs/2511.03690).
458
+ - OpenHands (2025). "CodeAct 2.1: An Open, State-of-the-Art Software Development Agent." [OpenHands Blog](https://openhands.dev/blog/openhands-codeact-21-an-open-state-of-the-art-software-development-agent).
459
+ - LLMLingua. "Effectively Deliver Information to LLMs via Prompt Compression." [LLMLingua](https://llmlingua.com/).