autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,532 @@
|
|
|
1
|
+
# Research: Agent Failure Taxonomy — Why AI Coding Agents Fail
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-02-22
|
|
4
|
+
**Researcher:** Claude Opus 4.6 (research agent)
|
|
5
|
+
**Confidence:** High (12 primary sources, 6 empirical studies with >35,000 data points)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Executive Summary
|
|
10
|
+
|
|
11
|
+
Academic literature identifies **5-7 top-level failure categories** for AI coding agents, compared to the toolkit's 3 clusters. The toolkit's taxonomy (Silent Failures, Integration Boundaries, Cold-Start Assumptions) maps well to implementation-level failures but misses three major failure classes that dominate in empirical studies: **Specification Misunderstanding** (agents solve the wrong problem), **Planning Failures** (agents decompose tasks incorrectly), and **Context Degradation** (quality declines as context grows). These three categories account for an estimated 40-55% of all agent failures in the literature, yet the toolkit has zero lessons addressing them.
|
|
12
|
+
|
|
13
|
+
The toolkit's strengths are real — its integration boundary and silent failure coverage is more granular than any academic taxonomy. But its blind spots are systematic: it captures failures that happen *during* correct implementation but not failures that happen *before* implementation begins (wrong task, wrong plan, wrong context).
|
|
14
|
+
|
|
15
|
+
**Recommendation:** Add 3 new root cause clusters to the taxonomy. Keep the existing 3. The result is a 6-cluster model that covers the full agent failure surface.
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
## 1. What Does Academic Literature Say About Why Coding Agents Fail?
|
|
20
|
+
|
|
21
|
+
### 1.1 The Landscape of Empirical Studies
|
|
22
|
+
|
|
23
|
+
Six major empirical studies (2024-2026) provide quantitative failure analysis:
|
|
24
|
+
|
|
25
|
+
| Study | Scope | Key Finding |
|
|
26
|
+
|-------|-------|-------------|
|
|
27
|
+
| Jimenez et al. (2024) — SWE-Bench | 2,294 GitHub issues | Agents solve 3-65% depending on model/benchmark variant |
|
|
28
|
+
| SWE-EVO (2025) | Long-horizon evolution tasks | Best model (GPT-5) solves only 21% vs 65% on SWE-Bench Verified |
|
|
29
|
+
| Failed Agentic PRs (2026) | 33,596 agent-authored PRs | 71.5% merge rate overall; 38% of rejections are abandoned without review |
|
|
30
|
+
| Unmerged Fix PRs (2026) | 326 closed-unmerged PRs | 12 failure reasons; test failures (18%) and redundancy (22%) dominate |
|
|
31
|
+
| Autonomous Agent Failures (2025) | 204 runs, 3 frameworks | ~50% completion rate; planning errors dominate |
|
|
32
|
+
| MAST (2025) | 1,600+ traces, 7 frameworks | 14 failure modes in 3 categories; coordination failures = 37% |
|
|
33
|
+
|
|
34
|
+
### 1.2 Convergent Findings Across Studies
|
|
35
|
+
|
|
36
|
+
Despite different methodologies, all studies converge on a consistent set of root causes:
|
|
37
|
+
|
|
38
|
+
1. **Specification/instruction misunderstanding** — agents solve the wrong problem (strongest models fail here most)
|
|
39
|
+
2. **Implementation errors** — correct understanding but wrong code (weaker models fail here most)
|
|
40
|
+
3. **Tool/environment misuse** — incorrect invocation of editing tools, test runners, file paths
|
|
41
|
+
4. **Context/retrieval failures** — agents lose critical information or retrieve irrelevant context
|
|
42
|
+
5. **Planning failures** — incorrect task decomposition, unrealistic plans, failed self-refinement
|
|
43
|
+
6. **Verification gaps** — superficial or absent validation of generated code
|
|
44
|
+
|
|
45
|
+
**Confidence: High.** These categories appear independently in 4+ studies with different datasets and methodologies.
|
|
46
|
+
|
|
47
|
+
### 1.3 Distribution of Failure Causes
|
|
48
|
+
|
|
49
|
+
Synthesizing across studies, the approximate distribution:
|
|
50
|
+
|
|
51
|
+
| Failure Class | Estimated Share | Primary Sources |
|
|
52
|
+
|---------------|----------------|-----------------|
|
|
53
|
+
| Specification misunderstanding | 15-25% | SWE-EVO (60%+ for GPT-5), Failed PRs (4% unwanted features) |
|
|
54
|
+
| Incorrect implementation | 20-30% | SWE-EVO (70% for open-source models), Code gen study (functional bugs 13-69%) |
|
|
55
|
+
| Planning failures | 10-20% | Autonomous agent study (dominant failure mode), MAST |
|
|
56
|
+
| Context/retrieval failures | 10-15% | Context rot studies (13.9-85% degradation), consolidation gap research |
|
|
57
|
+
| Tool/environment misuse | 5-15% | SWE-EVO, autonomous agent study (tool exploitation failures) |
|
|
58
|
+
| Verification gaps | 10-20% | MAST (21.3%), unmerged PRs (test failures 18.1%) |
|
|
59
|
+
| Process/coordination issues | 5-15% | MAST (36.9% coordination), abandoned PRs (38%) |
|
|
60
|
+
|
|
61
|
+
**Note:** These ranges overlap because studies use different taxonomies and granularity. A single failure often has multiple contributing causes.
|
|
62
|
+
|
|
63
|
+
**Confidence: Medium.** Individual study numbers are solid; cross-study synthesis requires interpretation due to taxonomy differences.
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## 2. Failure Modes Unique to Autonomous Agents vs Human-in-the-Loop
|
|
68
|
+
|
|
69
|
+
### 2.1 Autonomous-Only Failure Modes
|
|
70
|
+
|
|
71
|
+
These failure modes are absent or rare with human-in-the-loop but frequent in autonomous execution:
|
|
72
|
+
|
|
73
|
+
| Failure Mode | Why Autonomous-Only | Toolkit Coverage |
|
|
74
|
+
|-------------|---------------------|-----------------|
|
|
75
|
+
| **Stuck in loop** — agent repeats same actions without progress | Human would notice and redirect after 1-2 iterations | None |
|
|
76
|
+
| **Premature termination** — agent gives up with viable paths remaining | Human would suggest next steps | None |
|
|
77
|
+
| **Context window overflow** — quality degrades as context fills | Human sessions are shorter and reset naturally | Implicit (ARCHITECTURE.md design principle, but no lesson) |
|
|
78
|
+
| **Cascading retry failure** — each retry compounds errors from previous attempts | Human would reset approach rather than building on failures | None |
|
|
79
|
+
| **Overthinking/safety conflicts** — larger models refuse viable actions due to safety training | Human can override or rephrase | None |
|
|
80
|
+
| **Failed self-refinement** — agent identifies error but applies wrong fix in loop | Human would catch the meta-error | None |
|
|
81
|
+
|
|
82
|
+
### 2.2 Amplified Failure Modes
|
|
83
|
+
|
|
84
|
+
These exist in human-in-the-loop but are much worse in autonomous mode:
|
|
85
|
+
|
|
86
|
+
| Failure Mode | Human-in-Loop Severity | Autonomous Severity | Toolkit Coverage |
|
|
87
|
+
|-------------|------------------------|---------------------|-----------------|
|
|
88
|
+
| **Specification drift** — gradual deviation from intent | Low (human catches early) | High (compounds over batches) | None |
|
|
89
|
+
| **Integration blindness** — unit tests pass, integration fails | Medium | High (no human to spot it) | Strong (Cluster B) |
|
|
90
|
+
| **Silent failures** — errors produce no visible signal | Medium | Critical (no human watching) | Strong (Cluster A) |
|
|
91
|
+
| **Test adequacy illusion** — tests pass but don't cover the bug | Medium | High (agent trusts green tests) | Partial (lesson 0008) |
|
|
92
|
+
|
|
93
|
+
### 2.3 Key Insight
|
|
94
|
+
|
|
95
|
+
**The toolkit's taxonomy is biased toward implementation-phase failures because it was derived from a human-in-the-loop workflow** where Justin catches specification, planning, and context errors himself. In fully autonomous mode, these pre-implementation failures become the dominant failure class.
|
|
96
|
+
|
|
97
|
+
**Confidence: High.** This is a structural observation supported by the data — the toolkit's 61 lessons contain zero entries about the agent misunderstanding the task, decomposing it incorrectly, or losing context.
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## 3. Failure Taxonomies Used by SWE-Bench, OpenHands, and SWE-Agent
|
|
102
|
+
|
|
103
|
+
### 3.1 SWE-EVO Taxonomy (7 categories)
|
|
104
|
+
|
|
105
|
+
The most granular benchmark-derived taxonomy, from SWE-EVO (2025):
|
|
106
|
+
|
|
107
|
+
| Category | Definition | Distribution (GPT-5) | Distribution (Open-Source) |
|
|
108
|
+
|----------|-----------|----------------------|---------------------------|
|
|
109
|
+
| Syntax Error | Patch breaks parsing/formatting | <5% | 5-10% |
|
|
110
|
+
| Incorrect Implementation | Right area, wrong behavior | 15-20% | ~70% |
|
|
111
|
+
| Instruction Following | Misreads or ignores requirements | 60%+ | 10-15% |
|
|
112
|
+
| Tool-Use | Failed invocation of agent tools | <5% | 10-15% |
|
|
113
|
+
| Stuck in Loop | Repeats actions without progress | <5% | 15-20% |
|
|
114
|
+
| Gave Up Prematurely | Terminates with viable paths remaining | <5% | 10-15% |
|
|
115
|
+
| Other | Rare/ambiguous failures | <5% | <5% |
|
|
116
|
+
|
|
117
|
+
**Key finding:** As models get stronger, **instruction following** (not implementation) becomes the dominant failure mode. This is counter-intuitive but robust across SWE-EVO and SWE-Bench Pro data.
|
|
118
|
+
|
|
119
|
+
### 3.2 OpenHands Agent Analysis
|
|
120
|
+
|
|
121
|
+
OpenHands provides a trajectory-level analysis framework:
|
|
122
|
+
|
|
123
|
+
- Agents correctly identify problematic files in **72-81% of cases even in failures**
|
|
124
|
+
- Success depends on **approximate rather than exact** code modifications
|
|
125
|
+
- Failed trajectories are **consistently longer and more variable** than successful ones
|
|
126
|
+
- **Consolidation gap:** agents "see" 100% of relevant code but only retain 50-70% in final context
|
|
127
|
+
|
|
128
|
+
### 3.3 Three-Tier Autonomous Agent Taxonomy
|
|
129
|
+
|
|
130
|
+
From the comprehensive autonomous agent failure study (2025):
|
|
131
|
+
|
|
132
|
+
**Phase 1 — Planning:** Improper task decomposition, failed self-refinement, unrealistic planning
|
|
133
|
+
**Phase 2 — Execution:** Tool exploitation failures, code generation defects, environmental setup issues
|
|
134
|
+
**Phase 3 — Response:** Context window constraints, formatting issues, interaction limits exceeded
|
|
135
|
+
|
|
136
|
+
### 3.4 MAST Framework (Multi-Agent Systems)
|
|
137
|
+
|
|
138
|
+
14 failure modes in 3 categories, from 1,600+ annotated traces:
|
|
139
|
+
|
|
140
|
+
**Category 1 — System Design:** Poor prompt design, missing role constraints, lack of termination criteria
|
|
141
|
+
**Category 2 — Inter-Agent Misalignment:** Communication breakdowns, state synchronization issues, conflicting objectives (36.9% of all failures)
|
|
142
|
+
**Category 3 — Task Verification:** Superficial checks, compilation-only validation, inconsistent comment verification (21.3%)
|
|
143
|
+
|
|
144
|
+
### 3.5 Code Generation Error Taxonomy
|
|
145
|
+
|
|
146
|
+
From the comprehensive LLM code generation study (2024):
|
|
147
|
+
|
|
148
|
+
**Type A — Syntax Bugs** (<10%): Incomplete syntax, indentation, import errors
|
|
149
|
+
**Type B — Runtime Bugs** (5-45%): API misuse, undefined references, boundary conditions, argument errors
|
|
150
|
+
**Type C — Functional Bugs** (13-69%): Logic errors, hallucinations, I/O format errors
|
|
151
|
+
|
|
152
|
+
**Critical distribution insight:** Functional bugs (logic errors, wrong algorithm) increase with problem complexity. Syntax bugs are nearly eliminated by modern LLMs. The remaining challenge is *semantic correctness*.
|
|
153
|
+
|
|
154
|
+
---
|
|
155
|
+
|
|
156
|
+
## 4. Failure Classes the Toolkit's Lesson System Cannot Catch
|
|
157
|
+
|
|
158
|
+
### 4.1 Structural Blind Spots
|
|
159
|
+
|
|
160
|
+
The toolkit's lesson system catches pattern-level anti-patterns in code. These failure classes operate at a different level of abstraction:
|
|
161
|
+
|
|
162
|
+
| Failure Class | Why Uncatchable | Estimated Prevalence | Example |
|
|
163
|
+
|---------------|----------------|---------------------|---------|
|
|
164
|
+
| **Requirement misunderstanding** | No code pattern to grep for; the code is correct for the wrong spec | 15-25% | Agent implements caching when spec asked for rate limiting |
|
|
165
|
+
| **Architectural mismatch** | Decision is sound locally but wrong for the system; needs global context | 5-10% | Agent adds polling when system uses event-driven architecture |
|
|
166
|
+
| **Plausible-but-wrong patches** | Tests pass but behavior is incorrect; needs semantic verification | 10-15% | Fix handles reported case but breaks unreported edge case |
|
|
167
|
+
| **Context consolidation failures** | Agent saw the relevant code but lost it by patch time | 10-15% | Agent reads file with constraint, edits another file without applying constraint |
|
|
168
|
+
| **Planning over-decomposition** | Too many steps create compounding error probability | 5-10% | 20-step plan where step 3 error cascades through steps 4-20 |
|
|
169
|
+
| **Hallucinated APIs/libraries** | Agent invents functions, parameters, or entire libraries that don't exist | 5-10% | `from sklearn.ensemble import AdaptiveGBM` (doesn't exist) |
|
|
170
|
+
|
|
171
|
+
### 4.2 What the Lesson System CAN Catch
|
|
172
|
+
|
|
173
|
+
For contrast, the toolkit excels at catching:
|
|
174
|
+
|
|
175
|
+
- Implementation-level anti-patterns (bare except, missing await, wrong pip path)
|
|
176
|
+
- Integration boundary violations (schema drift, unit mismatch, path confusion)
|
|
177
|
+
- Resource lifecycle errors (missing unsubscribe, connection leaks)
|
|
178
|
+
- Test anti-patterns (hardcoded counts, lint spirals, format mismatches)
|
|
179
|
+
|
|
180
|
+
### 4.3 The Gap in Numbers
|
|
181
|
+
|
|
182
|
+
**Toolkit's 61 lessons cover approximately 30-40% of the total failure surface** identified in academic literature. The missing 60-70% is concentrated in pre-implementation failures (specification, planning, context) and post-implementation verification gaps (plausible-but-wrong patches).
|
|
183
|
+
|
|
184
|
+
**Confidence: Medium-High.** The "30-40%" estimate is derived from mapping toolkit lessons against academic taxonomies. The exact number depends on task mix — for bug-fix-only tasks the toolkit covers more; for greenfield features it covers less.
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## 5. Distribution of Failure Causes
|
|
189
|
+
|
|
190
|
+
### 5.1 Four-Way Split
|
|
191
|
+
|
|
192
|
+
Synthesizing across all studies, failures cluster into four macro-categories:
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
Specification Failures (20-25%)
|
|
196
|
+
├── Requirement misunderstanding
|
|
197
|
+
├── Instruction following errors
|
|
198
|
+
├── Unwanted/misaligned features
|
|
199
|
+
└── Wrong task description
|
|
200
|
+
|
|
201
|
+
Reasoning Failures (25-35%)
|
|
202
|
+
├── Incorrect implementation logic
|
|
203
|
+
├── Missing edge cases
|
|
204
|
+
├── Hallucinated APIs/behavior
|
|
205
|
+
└── Plausible-but-wrong patches
|
|
206
|
+
|
|
207
|
+
Tool/Environment Failures (10-20%)
|
|
208
|
+
├── Tool invocation errors
|
|
209
|
+
├── File path mistakes
|
|
210
|
+
├── Environmental setup issues
|
|
211
|
+
└── Context window overflow
|
|
212
|
+
|
|
213
|
+
Verification Failures (15-25%)
|
|
214
|
+
├── Insufficient test coverage
|
|
215
|
+
├── Superficial validation
|
|
216
|
+
├── Test adequacy illusion
|
|
217
|
+
└── Missing integration tests
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### 5.2 The Counter-Intuitive Finding
|
|
221
|
+
|
|
222
|
+
**Better models fail differently, not less.** GPT-5 on SWE-EVO shows 60%+ of failures are instruction-following errors, not implementation bugs. As implementation capability improves, specification understanding becomes the bottleneck.
|
|
223
|
+
|
|
224
|
+
This has a direct implication for the toolkit: **quality gates that check code correctness (lesson-check, test suites) will catch a shrinking share of failures over time.** The growing failure mode — specification misunderstanding — requires a different kind of gate.
|
|
225
|
+
|
|
226
|
+
### 5.3 Context Degradation as a Force Multiplier
|
|
227
|
+
|
|
228
|
+
Context degradation is not a failure mode itself but a **force multiplier for all other failure modes**:
|
|
229
|
+
|
|
230
|
+
- Models experience **13.9-85% performance degradation** as input length increases, even within claimed context windows
|
|
231
|
+
- Performance degradation is **worse on complex tasks** than simple ones
|
|
232
|
+
- **Coherent context is harder to process** than shuffled text (counter-intuitive but empirically validated)
|
|
233
|
+
- The "consolidation gap" means agents lose 30-50% of relevant information between retrieval and patch generation
|
|
234
|
+
|
|
235
|
+
The toolkit's architecture (fresh context per batch) directly addresses this. But the lesson system doesn't capture *why* this matters or what happens when it fails.
|
|
236
|
+
|
|
237
|
+
**Confidence: High.** Context degradation is one of the most robustly measured phenomena, with 5+ independent studies and consistent results across models.
|
|
238
|
+
|
|
239
|
+
---
|
|
240
|
+
|
|
241
|
+
## 6. How Failure Modes Differ by Task Type
|
|
242
|
+
|
|
243
|
+
### 6.1 Merge Success Rates by Task Type
|
|
244
|
+
|
|
245
|
+
From the 33,596 PR analysis:
|
|
246
|
+
|
|
247
|
+
| Task Type | Merge Rate | Primary Failure Mode |
|
|
248
|
+
|-----------|-----------|---------------------|
|
|
249
|
+
| Documentation | 84% | Rarely fails |
|
|
250
|
+
| CI/Build | 74-79% | Configuration errors |
|
|
251
|
+
| Refactoring | ~75% | Behavioral regression |
|
|
252
|
+
| Feature addition | ~70% | Specification misunderstanding |
|
|
253
|
+
| Bug fix | 64% | Plausible-but-wrong patches |
|
|
254
|
+
| Performance | 55% | Incorrect optimization strategy |
|
|
255
|
+
|
|
256
|
+
### 6.2 Failure Mode Distribution by Task Type
|
|
257
|
+
|
|
258
|
+
| Failure Mode | Bug Fix | New Feature | Refactoring |
|
|
259
|
+
|-------------|---------|-------------|-------------|
|
|
260
|
+
| Specification error | Low | **High** | Medium |
|
|
261
|
+
| Implementation error | **High** | Medium | Low |
|
|
262
|
+
| Test adequacy gap | **High** | Medium | Low |
|
|
263
|
+
| Integration boundary | Medium | Medium | **High** |
|
|
264
|
+
| Architectural mismatch | Low | **High** | Medium |
|
|
265
|
+
| Context overflow | Low | **High** | Low |
|
|
266
|
+
|
|
267
|
+
### 6.3 Implications for the Toolkit
|
|
268
|
+
|
|
269
|
+
The toolkit's lesson system is best suited for **bug fix and refactoring tasks** where implementation-level patterns dominate. It is weakest for **new feature development** where specification understanding and architectural decisions are the primary failure modes.
|
|
270
|
+
|
|
271
|
+
This matches the toolkit's origin story — lessons derived from implementation experience, not from feature design sessions.
|
|
272
|
+
|
|
273
|
+
**Confidence: Medium.** Task-type distributions are from a single large study. The directional findings are consistent across studies but exact percentages may vary.
|
|
274
|
+
|
|
275
|
+
---
|
|
276
|
+
|
|
277
|
+
## 7. Failure Prevention Strategies with Empirical Support
|
|
278
|
+
|
|
279
|
+
### 7.1 Strategies with Strong Evidence
|
|
280
|
+
|
|
281
|
+
| Strategy | Evidence Source | Effect | Toolkit Implementation |
|
|
282
|
+
|----------|---------------|--------|----------------------|
|
|
283
|
+
| **Fresh context per unit of work** | Context rot studies (5+) | Prevents 13.9-85% degradation | Yes — core architecture |
|
|
284
|
+
| **Test-driven development** | SWE-Bench, code gen studies | Catches implementation errors at write time | Yes — quality gates |
|
|
285
|
+
| **Retry with escalation** | Autonomous agent study | Success improves through iteration 10, plateaus after | Yes — run-plan.sh retry logic |
|
|
286
|
+
| **Dual-track verification** | MAST, UTBoost | Catches plausible-but-wrong patches | Partial — A/B verification exists |
|
|
287
|
+
| **Monotonic test counts** | Toolkit's own data | Prevents test deletion/breakage | Yes — quality gates |
|
|
288
|
+
|
|
289
|
+
### 7.2 Strategies with Moderate Evidence
|
|
290
|
+
|
|
291
|
+
| Strategy | Evidence Source | Effect | Toolkit Implementation |
|
|
292
|
+
|----------|---------------|--------|----------------------|
|
|
293
|
+
| **RAG-based context injection** | Hallucination study (2024) | Reduces hallucinated APIs/libraries | Partial — per-batch context injection |
|
|
294
|
+
| **Specification validation before coding** | Failed PR study, MAST | Catches wrong-task errors early | Partial — brainstorming stage |
|
|
295
|
+
| **Multi-agent review** | MAST, multi-agent coding studies | Catches errors single agent misses | Yes — subagent-driven-development |
|
|
296
|
+
| **Fault localization first** | OpenHands analysis | 72-81% correct even in failures; build on this | No explicit strategy |
|
|
297
|
+
| **Meta-controller for error routing** | Autonomous agent study | Routes planning vs execution errors to different fix strategies | No |
|
|
298
|
+
|
|
299
|
+
### 7.3 Strategies with Emerging Evidence
|
|
300
|
+
|
|
301
|
+
| Strategy | Evidence Source | Effect | Toolkit Implementation |
|
|
302
|
+
|----------|---------------|--------|----------------------|
|
|
303
|
+
| **Specification diffing** — compare agent's understanding against human intent before coding | Addy Osmani (2025), spec writing guides | Catches requirement misunderstanding pre-implementation | No |
|
|
304
|
+
| **Behavioral regression testing** — test observable behavior, not just return values | UTBoost (ACL 2025) | Catches plausible-but-wrong patches that pass unit tests | No |
|
|
305
|
+
| **Trajectory length monitoring** — flag when agent trajectory exceeds 2x median | OpenHands trajectory analysis | Early warning for stuck/looping agents | No |
|
|
306
|
+
| **Confidence-gated commits** — agent declares confidence; low-confidence changes get extra review | Emerging practice | Routes uncertain code to human review | No |
|
|
307
|
+
|
|
308
|
+
**Confidence: High for 7.1, Medium for 7.2, Low-Medium for 7.3.** Strong-evidence strategies have multiple independent validations. Emerging strategies have theoretical support and early results but limited replication.
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## 8. Gap Analysis: Toolkit's 3-Cluster Taxonomy vs Academic Taxonomies
|
|
313
|
+
|
|
314
|
+
### 8.1 Coverage Matrix
|
|
315
|
+
|
|
316
|
+
| Academic Failure Category | Toolkit Cluster | Coverage Level | Notes |
|
|
317
|
+
|--------------------------|----------------|---------------|-------|
|
|
318
|
+
| Syntax errors | Cluster A (Silent) | Partial | Lessons cover some (0022 JSX, 0010 bash), but not code gen syntax errors |
|
|
319
|
+
| Runtime errors / API misuse | Cluster A + B | Strong | Lessons 0002, 0005, 0006, 0033, etc. |
|
|
320
|
+
| Functional/logic errors | Cluster B (Integration) | Strong | Lessons 0015, 0018, 0031, etc. |
|
|
321
|
+
| Silent failures | Cluster A (Silent) | **Excellent** | 21 lessons — most granular coverage of any taxonomy |
|
|
322
|
+
| Integration boundary errors | Cluster B (Integration) | **Excellent** | 27 lessons — unmatched depth |
|
|
323
|
+
| Cold-start failures | Cluster C (Cold-Start) | Good | 4 lessons — small but well-defined |
|
|
324
|
+
| **Specification misunderstanding** | **None** | **Missing** | Zero lessons. Major gap. |
|
|
325
|
+
| **Planning/decomposition errors** | **None** | **Missing** | Zero lessons. Addressed architecturally but not in lesson system. |
|
|
326
|
+
| **Context degradation** | **None** | **Missing** | Addressed by architecture (fresh context) but no lessons capture what to do when it fails. |
|
|
327
|
+
| **Stuck in loop / premature termination** | **None** | **Missing** | No lessons. Ralph loop has stop conditions, but no diagnostic lessons. |
|
|
328
|
+
| **Hallucination (API/library)** | **None** | **Missing** | No lessons about fabricated APIs, wrong library versions, invented parameters. |
|
|
329
|
+
| **Verification gaps** | Partial | **Weak** | Lesson 0008 (quality gate blind spot) is the only entry. Academic literature identifies this as 15-25% of failures. |
|
|
330
|
+
| **Tool/environment misuse** | Cluster B | Partial | Lesson 0006 (pip path), 0044 (worktree deps), but missing broader tool invocation failures. |
|
|
331
|
+
| **Coordination failures (multi-agent)** | Cluster B | Partial | Lesson 0037 (parallel agents), but missing communication and state sync failures. |
|
|
332
|
+
|
|
333
|
+
### 8.2 Completeness Score
|
|
334
|
+
|
|
335
|
+
**Toolkit covers 4 of 9 major failure categories well, 2 partially, and 3 not at all.**
|
|
336
|
+
|
|
337
|
+
Mapping by estimated failure prevalence:
|
|
338
|
+
|
|
339
|
+
- **Well covered** (Clusters A, B, C): ~35-45% of failures
|
|
340
|
+
- **Partially covered**: ~10-15% of failures
|
|
341
|
+
- **Not covered** (specification, planning, context, hallucination, loops): ~40-55% of failures
|
|
342
|
+
|
|
343
|
+
### 8.3 What the Toolkit Does Better Than Academia
|
|
344
|
+
|
|
345
|
+
The academic taxonomies have their own gaps that the toolkit fills:
|
|
346
|
+
|
|
347
|
+
1. **Granularity of implementation-level patterns.** No academic taxonomy distinguishes "bare except swallowing" from "async def without await" from "cache replace vs merge." The toolkit's 61 lessons provide grep-detectable specificity that academic categories lack.
|
|
348
|
+
|
|
349
|
+
2. **Actionability.** Academic taxonomies describe *what* fails. The toolkit's lessons describe *what to do about it* — with corrective actions, 5-whys analysis, and sustain plans.
|
|
350
|
+
|
|
351
|
+
3. **Compounding enforcement.** Academic taxonomies are descriptive. The toolkit turns lessons into automated checks (lesson-check.sh, hookify rules, quality gates). No academic framework has this feedback loop.
|
|
352
|
+
|
|
353
|
+
4. **Integration boundary depth.** The toolkit's 27 integration boundary lessons constitute the most detailed treatment of this failure class in any source reviewed.
|
|
354
|
+
|
|
355
|
+
---
|
|
356
|
+
|
|
357
|
+
## 9. Recommendations: New Lesson Categories
|
|
358
|
+
|
|
359
|
+
### 9.1 Proposed 6-Cluster Taxonomy
|
|
360
|
+
|
|
361
|
+
Retain the existing 3 clusters. Add 3 new ones:
|
|
362
|
+
|
|
363
|
+
| Cluster | Name | Description | Priority |
|
|
364
|
+
|---------|------|-------------|----------|
|
|
365
|
+
| A | Silent Failures | (existing) Something fails with no error signal | — |
|
|
366
|
+
| B | Integration Boundaries | (existing) Bug hides at seam between components | — |
|
|
367
|
+
| C | Cold-Start Assumptions | (existing) Works steady-state, fails on restart | — |
|
|
368
|
+
| **D** | **Specification Drift** | **Agent solves the wrong problem or deviates from intent** | **High** |
|
|
369
|
+
| **E** | **Context & Retrieval** | **Agent loses, ignores, or hallucinates critical information** | **High** |
|
|
370
|
+
| **F** | **Planning & Control Flow** | **Agent decomposes incorrectly, loops, or terminates prematurely** | **Medium** |
|
|
371
|
+
|
|
372
|
+
### 9.2 Proposed Starter Lessons per New Cluster
|
|
373
|
+
|
|
374
|
+
#### Cluster D: Specification Drift
|
|
375
|
+
|
|
376
|
+
| ID | Title | Type | Source |
|
|
377
|
+
|----|-------|------|--------|
|
|
378
|
+
| D-1 | Agent implements feature the spec didn't ask for | semantic | SWE-EVO instruction following |
|
|
379
|
+
| D-2 | Specification ambiguity resolved incorrectly by agent | semantic | Failed PR study |
|
|
380
|
+
| D-3 | Agent addresses symptom instead of root cause in bug fix | semantic | APR plausible-but-wrong patches |
|
|
381
|
+
| D-4 | Refactoring changes observable behavior (semantic regression) | semantic | Task-type failure analysis |
|
|
382
|
+
|
|
383
|
+
#### Cluster E: Context & Retrieval
|
|
384
|
+
|
|
385
|
+
| ID | Title | Type | Source |
|
|
386
|
+
|----|-------|------|--------|
|
|
387
|
+
| E-1 | Agent reads constraint in file A but ignores it when editing file B | semantic | Consolidation gap research |
|
|
388
|
+
| E-2 | Agent hallucinates API that doesn't exist in the library version used | semantic | Library hallucination study |
|
|
389
|
+
| E-3 | Long context causes quality degradation mid-task | semantic | Context rot studies |
|
|
390
|
+
| E-4 | RAG retrieval includes irrelevant code that distracts agent | semantic | Long-context LLM + RAG study |
|
|
391
|
+
|
|
392
|
+
#### Cluster F: Planning & Control Flow
|
|
393
|
+
|
|
394
|
+
| ID | Title | Type | Source |
|
|
395
|
+
|----|-------|------|--------|
|
|
396
|
+
| F-1 | Agent loops on same failed approach without changing strategy | semantic | SWE-EVO (stuck in loop) |
|
|
397
|
+
| F-2 | Agent gives up with viable approaches remaining | semantic | SWE-EVO (premature termination) |
|
|
398
|
+
| F-3 | Over-decomposed plan creates compounding error across steps | semantic | Autonomous agent study |
|
|
399
|
+
| F-4 | Agent self-refinement applies wrong fix to correctly-identified error | semantic | Failed self-refinement research |
|
|
400
|
+
|
|
401
|
+
### 9.3 New Diagnostic Shortcuts
|
|
402
|
+
|
|
403
|
+
| Symptom | Check First |
|
|
404
|
+
|---------|-------------|
|
|
405
|
+
| Agent's output is correct code that solves the wrong problem | D-1, D-2 |
|
|
406
|
+
| Fix works for reported case but breaks other cases | D-3 |
|
|
407
|
+
| Agent ignores information it read 10 minutes ago | E-1, E-3 |
|
|
408
|
+
| Code references API/function that doesn't exist | E-2 |
|
|
409
|
+
| Agent retries the same approach 3+ times | F-1 |
|
|
410
|
+
| Agent declares "done" but obvious work remains | F-2 |
|
|
411
|
+
| Step 7 of 15 fails because step 3 was wrong | F-3 |
|
|
412
|
+
|
|
413
|
+
### 9.4 New Prevention Strategies to Implement
|
|
414
|
+
|
|
415
|
+
**High priority (strong evidence):**
|
|
416
|
+
|
|
417
|
+
1. **Specification echo-back gate.** Before coding, agent must restate the requirement in its own words. Human or automated check compares against original spec. Catches Cluster D failures. (Evidence: SWE-EVO, Addy Osmani spec guidance.)
|
|
418
|
+
|
|
419
|
+
2. **Trajectory length alarm.** If agent trajectory exceeds 2x the median for that task type, trigger a warning and force re-evaluation. Catches Cluster F failures. (Evidence: OpenHands trajectory analysis — failed runs are consistently longer.)
|
|
420
|
+
|
|
421
|
+
3. **Library/API existence check.** Before using any import or API call, verify it exists in the installed version. Catches Cluster E hallucination failures. (Evidence: Library hallucination study.)
|
|
422
|
+
|
|
423
|
+
**Medium priority (moderate evidence):**
|
|
424
|
+
|
|
425
|
+
4. **Constraint propagation check.** After reading a file with constraints, verify those constraints appear in subsequent edits. Catches consolidation gap (Cluster E).
|
|
426
|
+
|
|
427
|
+
5. **Behavioral regression tests.** Add tests for observable behavior (not just return values) to catch plausible-but-wrong patches. (Evidence: UTBoost.)
|
|
428
|
+
|
|
429
|
+
6. **Context budget monitoring.** Track context window utilization; trigger context pruning or fresh-start when approaching limits. (Evidence: Context rot studies.)
|
|
430
|
+
|
|
431
|
+
---
|
|
432
|
+
|
|
433
|
+
## 10. Sources
|
|
434
|
+
|
|
435
|
+
### Empirical Studies (Primary Sources)
|
|
436
|
+
|
|
437
|
+
1. Jimenez, C. E., et al. (2024). "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. https://arxiv.org/pdf/2310.06770
|
|
438
|
+
|
|
439
|
+
2. SWE-EVO (2025). "Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios." https://arxiv.org/html/2512.18470
|
|
440
|
+
|
|
441
|
+
3. "Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub." (2026). https://arxiv.org/html/2601.15195v1
|
|
442
|
+
|
|
443
|
+
4. "Why Are AI Agent-Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study." (2026). https://arxiv.org/html/2602.00164
|
|
444
|
+
|
|
445
|
+
5. "Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks." (2025). https://arxiv.org/html/2508.13143v1
|
|
446
|
+
|
|
447
|
+
6. Cemri, M., Pan, M. Z., Yang, S., et al. (2025). "Why Do Multi-Agent LLM Systems Fail?" (MAST Framework). https://arxiv.org/abs/2503.13657
|
|
448
|
+
|
|
449
|
+
### Code Generation Error Analysis
|
|
450
|
+
|
|
451
|
+
7. "What's Wrong with Your Code Generated by Large Language Models? An Extensive Study." (2024). https://arxiv.org/html/2407.06153v1
|
|
452
|
+
|
|
453
|
+
8. "Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories." (2025). https://arxiv.org/abs/2511.00197
|
|
454
|
+
|
|
455
|
+
9. "LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation." ISSTA 2025. https://arxiv.org/abs/2409.20550
|
|
456
|
+
|
|
457
|
+
10. "Library Hallucinations in LLMs." (2025). https://arxiv.org/pdf/2509.22202
|
|
458
|
+
|
|
459
|
+
### Context Degradation
|
|
460
|
+
|
|
461
|
+
11. "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval." EMNLP 2025. https://arxiv.org/abs/2510.05381
|
|
462
|
+
|
|
463
|
+
12. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research (2025). https://research.trychroma.com/context-rot
|
|
464
|
+
|
|
465
|
+
13. "Context Discipline and Performance Correlation." (2026). https://arxiv.org/html/2601.11564v1
|
|
466
|
+
|
|
467
|
+
### Evaluation & Verification
|
|
468
|
+
|
|
469
|
+
14. "UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench." ACL 2025. https://aclanthology.org/2025.acl-long.189/
|
|
470
|
+
|
|
471
|
+
### Agent Frameworks & Tools
|
|
472
|
+
|
|
473
|
+
15. OpenHands Agent Analysis. https://github.com/OpenHands/agent-analysis
|
|
474
|
+
|
|
475
|
+
16. "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair." (2024). https://arxiv.org/abs/2403.17134
|
|
476
|
+
|
|
477
|
+
### Industry Reports
|
|
478
|
+
|
|
479
|
+
17. Answer.AI independent evaluation of Devin. Reported via IT Pro, The Register, Futurism (2025). 15% task completion rate on 20 tasks.
|
|
480
|
+
|
|
481
|
+
18. IEEE Spectrum. "AI Coding Degrades: Silent Failures Emerge." (2025). https://spectrum.ieee.org/ai-coding-degrades
|
|
482
|
+
|
|
483
|
+
---
|
|
484
|
+
|
|
485
|
+
## Appendix A: Full Mapping of Toolkit Lessons to Academic Categories
|
|
486
|
+
|
|
487
|
+
| Toolkit Lesson | Academic Category | SWE-EVO Equivalent | MAST Equivalent |
|
|
488
|
+
|---------------|-------------------|--------------------|-----------------|
|
|
489
|
+
| 0001 (bare except) | Silent failure | — | — |
|
|
490
|
+
| 0002 (async without await) | Runtime bug (API misuse) | Syntax Error | — |
|
|
491
|
+
| 0004 (hardcoded counts) | Test anti-pattern | — | Task Verification |
|
|
492
|
+
| 0015 (schema drift) | Integration boundary | Incorrect Implementation | Inter-Agent Misalignment |
|
|
493
|
+
| 0018 (unit pass, integration fail) | Verification gap | — | Task Verification |
|
|
494
|
+
| 0037 (parallel agent staging) | Coordination failure | — | Inter-Agent Misalignment |
|
|
495
|
+
| 0055 (garbled batch prompts) | Context/retrieval | Instruction Following | System Design |
|
|
496
|
+
|
|
497
|
+
## Appendix B: Academic Taxonomy Comparison Table
|
|
498
|
+
|
|
499
|
+
| Dimension | SWE-EVO | MAST | Autonomous Agent Study | Failed PR Study | Toolkit |
|
|
500
|
+
|-----------|---------|------|----------------------|-----------------|---------|
|
|
501
|
+
| # of top-level categories | 7 | 3 | 3 (phases) | 3 | 3 |
|
|
502
|
+
| # of leaf categories | 7 | 14 | 9 | 12 | 6* |
|
|
503
|
+
| Covers specification errors | Yes | Yes | Yes | Yes | **No** |
|
|
504
|
+
| Covers planning errors | No | Yes | Yes | No | **No** |
|
|
505
|
+
| Covers context degradation | No | No | Yes | No | **No** |
|
|
506
|
+
| Covers implementation errors | Yes | Yes | Yes | Yes | Yes |
|
|
507
|
+
| Covers integration errors | Implicit | Yes | No | Implicit | **Yes** |
|
|
508
|
+
| Covers silent failures | No | No | No | No | **Yes** |
|
|
509
|
+
| Actionable (corrective actions) | No | No | No | No | **Yes** |
|
|
510
|
+
| Automated enforcement | No | No | No | No | **Yes** |
|
|
511
|
+
|
|
512
|
+
*6 categories in existing taxonomy, with 61 individual lessons providing the leaf-level granularity.
|
|
513
|
+
|
|
514
|
+
---
|
|
515
|
+
|
|
516
|
+
## Appendix C: Counter-Arguments and Limitations
|
|
517
|
+
|
|
518
|
+
### Why the gap might be smaller than estimated
|
|
519
|
+
|
|
520
|
+
1. **The toolkit's architecture already prevents some missing failure classes.** Fresh context per batch prevents context degradation. Brainstorming prevents some specification errors. The quality gate prevents some verification gaps. These architectural mitigations are not lessons but they reduce exposure.
|
|
521
|
+
|
|
522
|
+
2. **The toolkit targets a specific workflow.** It's designed for plan-driven, batch-executed development — not open-ended "fix this issue" tasks like SWE-Bench. Some academic failure modes (stuck in loop, premature termination) may be less relevant in this constrained context.
|
|
523
|
+
|
|
524
|
+
3. **Some "missing" categories may be inherently non-lesson-able.** Specification misunderstanding may require better prompting, not a lesson file. The lesson system's strength is grep-detectable patterns; some failures resist that format.
|
|
525
|
+
|
|
526
|
+
### Why the gap might be larger than estimated
|
|
527
|
+
|
|
528
|
+
1. **The toolkit has been tested by one developer.** The lesson system reflects one person's failure distribution, which may under-represent failure classes they personally catch early.
|
|
529
|
+
|
|
530
|
+
2. **Hallucination frequency is increasing with library/API churn.** As ecosystems evolve faster, the gap between training data and current APIs grows, making hallucination a larger failure class over time.
|
|
531
|
+
|
|
532
|
+
3. **Multi-agent coordination failures are underrepresented.** The toolkit has 1 lesson (0037) on multi-agent issues. The MAST framework identifies 14 failure modes, with coordination accounting for 37% of all MAS failures.
|