autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,428 @@
|
|
|
1
|
+
# Research: Plan Quality for AI Coding Agents
|
|
2
|
+
|
|
3
|
+
> **Date:** 2026-02-22
|
|
4
|
+
> **Status:** Research complete
|
|
5
|
+
> **Method:** Web research + codebase analysis + SWE-bench literature review
|
|
6
|
+
|
|
7
|
+
## Executive Summary
|
|
8
|
+
|
|
9
|
+
1. **The 2-5 minute heuristic is directionally correct but too rigid.** Task granularity should vary by task type: single-file changes can be coarser (5-15 min), multi-file coordination tasks need finer decomposition (2-5 min), and verification-only batches need no decomposition at all. The strongest predictor of failure is lines-of-code-per-task, not wall-clock time. (Confidence: **high**)
|
|
10
|
+
|
|
11
|
+
2. **Over-specification hurts more than under-specification — but both lose to "structured intent."** The optimal spec provides: exact file paths, the goal and constraints for each task, one code example showing style — but does NOT dictate implementation line-by-line. Research on the "curse of instructions" shows LLM adherence to individual instructions drops as instruction count grows. (Confidence: **high**)
|
|
12
|
+
|
|
13
|
+
3. **Batch boundaries should follow dependency graphs, not arbitrary size limits.** Batches should group tasks that share test infrastructure or modify the same module, bounded by a quality gate. The toolkit's current 3-task batch default is reasonable but should be tunable per plan. (Confidence: **medium**)
|
|
14
|
+
|
|
15
|
+
4. **Plan quality is the single strongest lever on execution success.** SWE-bench Pro found that removing requirements and interface specifications from task descriptions degraded GPT-5 performance from 25.9% to 8.4% — a 3x drop. The plan IS the product; execution is mechanical. (Confidence: **high**)
|
|
16
|
+
|
|
17
|
+
5. **Adaptive decomposition outperforms fixed decomposition.** ADaPT (Allen AI, NAACL 2024) showed 28-33% higher success rates by decomposing tasks only when the executor fails, rather than pre-decomposing everything. This maps directly to the toolkit's retry escalation pattern. (Confidence: **medium**)
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## 1. Task Granularity: What Size Produces the Best Outcomes?
|
|
22
|
+
|
|
23
|
+
### Findings
|
|
24
|
+
|
|
25
|
+
The "2-5 minute task" heuristic in the current `writing-plans` skill conflates two distinct dimensions: **scope** (how much code changes) and **complexity** (how many files, how much coordination). Research consistently shows that **lines of code changed** is the strongest predictor of AI agent success, not time.
|
|
26
|
+
|
|
27
|
+
**SWE-bench Verified difficulty analysis** (Ganhotra, 2025):
|
|
28
|
+
|
|
29
|
+
| Difficulty | Avg Files | Avg Hunks | Avg Lines Changed | Agent Success (top) |
|
|
30
|
+
|-----------|-----------|-----------|-------------------|---------------------|
|
|
31
|
+
| Easy (≤15 min) | 1.03 | 1.37 | 5.04 | ~80% |
|
|
32
|
+
| Medium (15-60 min) | 1.28 | 2.48 | 14.1 | 56-62% |
|
|
33
|
+
| Hard (≥1 hr) | 2.0 | 6.82 | 55.78 | 20-25% |
|
|
34
|
+
|
|
35
|
+
The 11x increase in lines changed from easy to hard dwarfs the 2x increase in file count. This means: **a single-file task that changes 60 lines is harder than a two-file task that changes 10 lines total.**
|
|
36
|
+
|
|
37
|
+
**SWE-bench Pro** (Scale AI, 2025) found that frontier models (Claude Opus 4.1, GPT-5) maintained reasonable success on single-file tasks but showed "sharp declines as file count increases," approaching near-zero on 10+ file tasks.
|
|
38
|
+
|
|
39
|
+
**Devin's recommendation** (Cognition, agents101): Target 1-6 hours of work per task for maximum ROI, with explicit checkpoint pauses between phases. This is much coarser than the toolkit's 2-5 minutes, but Devin operates in a persistent session — not fresh `claude -p` per batch.
|
|
40
|
+
|
|
41
|
+
**Anthropic's own guidance** (Effective Harnesses for Long-Running Agents): Work on one feature at a time. Agents that "try to do too much at once" exhaust context windows mid-implementation. The solution: a comprehensive feature list where each feature is independently testable.
|
|
42
|
+
|
|
43
|
+
### Evidence from the Toolkit's Own Execution Data
|
|
44
|
+
|
|
45
|
+
The `progress.txt` log reveals clear patterns across 14 batches of real headless execution:
|
|
46
|
+
|
|
47
|
+
| Batch Type | Tasks/Batch | Test Delta | Notes |
|
|
48
|
+
|-----------|-------------|------------|-------|
|
|
49
|
+
| Foundation (new files) | 3-5 | +22 | Clean execution, high test yield |
|
|
50
|
+
| Refactoring (modify existing) | 5 | +35 | Highest test yield per batch |
|
|
51
|
+
| Accuracy fixes (surgical) | 3 | +13 | Small scope, high precision |
|
|
52
|
+
| New capabilities | 2 | +7 | Lower test yield — integration complexity |
|
|
53
|
+
| Verification-only | 5 | 0 | No-ops and confirmation, no code |
|
|
54
|
+
| Bugfix | 2 | 0 | Rework, no new tests |
|
|
55
|
+
|
|
56
|
+
Key observations:
|
|
57
|
+
- **Batches with 3-5 tasks of new-file creation had the highest success rate** — clean scope, no coordination.
|
|
58
|
+
- **Batch 4 had a no-op task** (Task 12: lint already implemented in Batch 2) — the plan over-specified work that was already done. This wastes a `claude -p` invocation (~$0.10-0.50 per batch).
|
|
59
|
+
- **Batch 5 had lower test yield despite only 2 tasks** — both tasks involved cross-cutting integration (failure digest wiring, context_refs injection across parser + prompt modules). Multi-file coordination, not task count, drove complexity.
|
|
60
|
+
- **Batch 7 (verification-only) was efficient** — 5 tasks, zero code changes, pure confirmation. Could have been a single task.
|
|
61
|
+
|
|
62
|
+
### Implications for the Toolkit
|
|
63
|
+
|
|
64
|
+
**Replace "2-5 minutes" with a task-type-aware guideline:**
|
|
65
|
+
|
|
66
|
+
| Task Type | Recommended Granularity | Rationale |
|
|
67
|
+
|-----------|------------------------|-----------|
|
|
68
|
+
| New file creation | 1 file + its tests per task | Self-contained, high parallelism potential |
|
|
69
|
+
| Refactoring existing code | 1 module per task, ≤30 lines changed | Keep diff small to stay in "easy" zone |
|
|
70
|
+
| Cross-module integration | 1 integration point per task | Multi-file = high failure risk; isolate |
|
|
71
|
+
| Bug fixes | 1 bug per task, always | Never batch bugs together |
|
|
72
|
+
| Verification / wiring | Group freely (3-5 per batch) | Low risk, low complexity |
|
|
73
|
+
|
|
74
|
+
**Add a lines-changed heuristic:** If a task's expected diff exceeds ~30 lines, decompose further. The SWE-bench data shows the cliff between "easy" (5 lines avg) and "medium" (14 lines avg) is steep.
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## 2. Specification Level: Over-Specification vs. Under-Specification
|
|
79
|
+
|
|
80
|
+
### Findings
|
|
81
|
+
|
|
82
|
+
The research reveals a clear "Goldilocks zone" for specification detail, with distinct failure modes on each side.
|
|
83
|
+
|
|
84
|
+
**Over-specification failure mode — "Curse of Instructions":**
|
|
85
|
+
Addy Osmani (referencing GitHub's analysis of 2,500+ agent configuration files) found that as instructions accumulate, LLM adherence to each individual instruction drops. Even GPT-4 struggles to satisfy many simultaneous requirements. The most effective specs cover six areas (commands, testing, structure, style, git workflow, boundaries) without prescribing implementation details.
|
|
86
|
+
|
|
87
|
+
**Under-specification failure mode — "Vague Specs = Vague Code":**
|
|
88
|
+
SWE-bench Pro demonstrated this quantitatively: removing requirements and interface specifications from task descriptions degraded GPT-5 from 25.9% to 8.4% resolve rate. The task description IS the primary input; without it, even frontier models flounder.
|
|
89
|
+
|
|
90
|
+
**The Goldilocks zone — "Structured Intent":**
|
|
91
|
+
Multiple sources converge on the same pattern:
|
|
92
|
+
- **Osmani:** "One real code snippet showing style beats three paragraphs of description."
|
|
93
|
+
- **Devin agents101:** "Clearly outline your preferred approach from the outset. Providing the agent with the overall architecture and logic upfront reduces review time."
|
|
94
|
+
- **Technical Design Spec pattern** (Harper Reed): Include full file paths, function signatures, API contracts — but NOT line-by-line implementation. "Prompting the agent to implement only one step at a time prevents it biting off more than it can chew."
|
|
95
|
+
|
|
96
|
+
**What the toolkit currently does well:**
|
|
97
|
+
The `writing-plans` skill already mandates exact file paths, complete test code, and exact commands with expected output. This is well-aligned with the research.
|
|
98
|
+
|
|
99
|
+
**What the toolkit currently over-specifies:**
|
|
100
|
+
The skill says "Complete code in plan (not 'add validation')." While the intent is correct (be specific, not vague), providing **complete implementation code** for every task removes the LLM's ability to adapt to discovered context. When the plan says "write this exact code" but the codebase has evolved since plan creation, the agent either follows the stale plan (wrong) or deviates (violating the plan contract).
|
|
101
|
+
|
|
102
|
+
### Evidence
|
|
103
|
+
|
|
104
|
+
| Specification Level | Example | Observed Outcome |
|
|
105
|
+
|--------------------|---------|-----------------|
|
|
106
|
+
| Under-specified | "Add validation" | Agent guesses scope, often wrong |
|
|
107
|
+
| Structured intent | "Add input validation to `parse_config()` in `src/config.py:45-60`. Reject empty strings and non-dict inputs. Write test first." | Agent knows scope, chooses implementation |
|
|
108
|
+
| Over-specified | "Replace line 47 with `if not isinstance(config, dict): raise ValueError('...')`" | Brittle — breaks if line numbers shift |
|
|
109
|
+
|
|
110
|
+
### Implications for the Toolkit
|
|
111
|
+
|
|
112
|
+
**Shift from "complete code in plan" to "complete contract in plan":**
|
|
113
|
+
- Keep: exact file paths, test assertions, expected behavior, command to verify
|
|
114
|
+
- Change: provide function signatures and contracts instead of full implementation code
|
|
115
|
+
- Add: explicit "constraints" section per task (what NOT to do)
|
|
116
|
+
|
|
117
|
+
**Proposed task template revision:**
|
|
118
|
+
|
|
119
|
+
```markdown
|
|
120
|
+
### Task N: [Name]
|
|
121
|
+
|
|
122
|
+
**Files:** Create: `path/to/file.py` | Test: `tests/path/test_file.py`
|
|
123
|
+
|
|
124
|
+
**Contract:**
|
|
125
|
+
- Function: `parse_config(raw: str) -> dict`
|
|
126
|
+
- Must reject: empty strings, non-dict JSON, missing required keys
|
|
127
|
+
- Must return: validated config dict with defaults applied
|
|
128
|
+
|
|
129
|
+
**Test (write first):**
|
|
130
|
+
```python
|
|
131
|
+
def test_parse_config_rejects_empty():
|
|
132
|
+
with pytest.raises(ValueError):
|
|
133
|
+
parse_config("")
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
**Verify:** `pytest tests/path/test_file.py::test_parse_config_rejects_empty -v`
|
|
137
|
+
|
|
138
|
+
**Constraints:**
|
|
139
|
+
- Do not modify `src/loader.py` (that's Task N+1)
|
|
140
|
+
- Use stdlib only — no new dependencies
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
This preserves specificity (file paths, test code, verification command) while leaving implementation to the agent's judgment.
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## 3. Batch Boundaries: How Should Batches Be Drawn?
|
|
148
|
+
|
|
149
|
+
### Findings
|
|
150
|
+
|
|
151
|
+
The toolkit currently uses implicit batching (plan authors create `## Batch N` headers manually). Research suggests three viable strategies:
|
|
152
|
+
|
|
153
|
+
**Strategy A: Module-boundary batches.**
|
|
154
|
+
Group tasks that modify the same module or file cluster. Anthropic's long-running agent guidance recommends "one feature at a time" where each feature is independently testable. This maps to module-boundary batching.
|
|
155
|
+
|
|
156
|
+
**Strategy B: Dependency-graph batches.**
|
|
157
|
+
The toolkit's own `run-plan-routing.sh` already builds dependency graphs and computes parallelism scores. Tasks with shared dependencies belong in the same batch; independent tasks can be parallelized across batches.
|
|
158
|
+
|
|
159
|
+
**Strategy C: Test-group batches.**
|
|
160
|
+
Group tasks by the test suite that validates them. Each batch ends with a meaningful test gate. This is implicitly what the toolkit does (quality gate runs after each batch), but making it explicit forces plan authors to think about testability boundaries.
|
|
161
|
+
|
|
162
|
+
**What the research says:**
|
|
163
|
+
- **ADaPT** (Allen AI): Don't pre-decompose everything. Decompose only when the executor fails. This suggests batches should be coarser initially, with finer decomposition on retry.
|
|
164
|
+
- **SWE-EVO** (2025): Uses "Fix Rate" as a partial-progress metric — what fraction of failing tests does the agent fix? This supports test-group batching where progress is measurable per batch.
|
|
165
|
+
- **Anthropic's harness guidance:** The initializer + coding agent pattern treats each session as one feature increment. The feature list, not a batch structure, drives execution order.
|
|
166
|
+
|
|
167
|
+
### Evidence from Toolkit Execution Data
|
|
168
|
+
|
|
169
|
+
The progress.txt reveals natural batch boundary patterns:
|
|
170
|
+
|
|
171
|
+
| Batch | Boundary Type | Outcome |
|
|
172
|
+
|-------|--------------|---------|
|
|
173
|
+
| 1 (Foundation) | Module: shared libraries | Clean — independent files |
|
|
174
|
+
| 2 (Refactoring) | Dependency: all depend on Batch 1 libs | Clean — but 5 scripts modified |
|
|
175
|
+
| 3 (Accuracy) | Feature cluster: test parsing + context + duration | Clean — tight scope |
|
|
176
|
+
| 4 (Quality gates) | Mixed: new scripts + wiring | 1 no-op task (over-planned) |
|
|
177
|
+
| 5 (New capabilities) | Feature: failure digest + context refs | Low yield — cross-cutting |
|
|
178
|
+
| 6 (License + flags) | Feature: license check | 1 no-op task |
|
|
179
|
+
| 7 (Verification) | Test group: verify everything | 5 tasks, 0 code — batch was too large for its type |
|
|
180
|
+
|
|
181
|
+
**Pattern:** The cleanest batches (1, 2, 3) grouped by module or tight feature cluster. The messiest (4, 5, 6) mixed unrelated features or included tasks that were already done.
|
|
182
|
+
|
|
183
|
+
### Implications for the Toolkit
|
|
184
|
+
|
|
185
|
+
**Batch boundary guidelines for the `writing-plans` skill:**
|
|
186
|
+
|
|
187
|
+
1. **Primary rule: each batch has one testable outcome.** If you can't describe the batch's quality gate in one sentence, it's too broad.
|
|
188
|
+
2. **Group by dependency, not by count.** A batch of 2 cross-cutting tasks is harder than a batch of 5 independent file creations.
|
|
189
|
+
3. **Never mix new-file and integration tasks.** Create files in one batch, wire them together in the next. This prevents the "implement and integrate in one shot" failure mode.
|
|
190
|
+
4. **Verification batches should be a single task** — there's no benefit to splitting "run all tests and confirm" across 5 tasks within one `claude -p` invocation.
|
|
191
|
+
5. **Plan for no-ops.** If an earlier batch might complete a later task's work (common with refactoring), add a conditional: "Skip if already implemented in Batch N."
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## 4. Plan Quality and Downstream Execution Success
|
|
196
|
+
|
|
197
|
+
### Findings
|
|
198
|
+
|
|
199
|
+
The evidence is unambiguous: **plan quality is the dominant variable in execution success.**
|
|
200
|
+
|
|
201
|
+
**SWE-bench Pro (Scale AI, 2025):**
|
|
202
|
+
"Human augmentation significantly improves resolvability." When requirements and interface specifications were provided alongside the issue description:
|
|
203
|
+
- GPT-5: 25.9% → 8.4% without specs (3x degradation)
|
|
204
|
+
- Claude Opus 4.1: Similar pattern
|
|
205
|
+
|
|
206
|
+
This means the spec/plan is worth roughly **3x the execution capability** of the model itself. A mediocre model with a great plan outperforms a great model with a bad plan.
|
|
207
|
+
|
|
208
|
+
**GitHub's analysis of 2,500+ agent config files:**
|
|
209
|
+
"Most agent files fail due to being too vague." The most effective configurations shared six properties: specific commands, testing instructions, project structure paths, style guidance, git workflow, and explicit boundaries.
|
|
210
|
+
|
|
211
|
+
**Anthropic's harness research:**
|
|
212
|
+
The most important design decision was having each agent session start by reading progress logs and selecting the next highest-priority incomplete feature. The plan structure (feature list with pass/fail tracking) determined success more than the agent's capability.
|
|
213
|
+
|
|
214
|
+
**Devin agents101:**
|
|
215
|
+
"80% time savings, not complete automation." The 20% manual intervention is almost entirely plan-level: clarifying intent, reordering steps, fixing spec ambiguities. The execution itself is largely mechanical when the plan is clear.
|
|
216
|
+
|
|
217
|
+
### Mapping to the Toolkit
|
|
218
|
+
|
|
219
|
+
The toolkit's architecture already reflects this insight: `progress.txt`, `prd.json`, and `.run-plan-state.json` give each fresh `claude -p` invocation the plan context it needs. But the plan file itself — the markdown document — is the primary input, and its quality determines everything downstream.
|
|
220
|
+
|
|
221
|
+
**Current plan quality strengths:**
|
|
222
|
+
- Exact file paths (strongly supported by research)
|
|
223
|
+
- TDD structure (test-first forces specificity)
|
|
224
|
+
- Batch structure with quality gates (machine-verifiable progress)
|
|
225
|
+
- Cross-batch context injection (prevents blind starts)
|
|
226
|
+
|
|
227
|
+
**Current plan quality gaps:**
|
|
228
|
+
- No plan validation beyond `validate-plans.sh` structural checks (sequential batch numbers, task presence)
|
|
229
|
+
- No measurement of plan quality before execution
|
|
230
|
+
- No detection of stale plans (codebase changed since plan creation)
|
|
231
|
+
- No conditional tasks ("skip if already done")
|
|
232
|
+
- Complete code provision instead of contracts (over-specification)
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## 5. What SWE-bench, Devin, OpenHands, and Academic Literature Say
|
|
237
|
+
|
|
238
|
+
### SWE-bench Ecosystem
|
|
239
|
+
|
|
240
|
+
**SWE-bench Verified** (OpenAI, 2024): Established the standard task format — issue description + repository snapshot. No plan structure at all; agents must navigate repositories and write patches from a natural-language issue description. Top agents reach ~55% on Verified.
|
|
241
|
+
|
|
242
|
+
**SWE-bench Pro** (Scale AI, 2025): 1,865 enterprise-grade problems averaging 107 lines across 4.1 files. Found that "wrong solutions account for 35.9% of failures" — agents understand the task but implement incorrectly. This is a plan quality problem: better specs reduce solution space.
|
|
243
|
+
|
|
244
|
+
**SWE-EVO** (2025): 48 tasks averaging 21 files modified and 874 tests per instance. Introduced "Fix Rate" as a partial-progress metric. Relevant for batch execution: measure how many tests a batch fixes, not just pass/fail.
|
|
245
|
+
|
|
246
|
+
**SWE-rebench** (NeurIPS 2025): Automated task collection pipeline. Emphasizes decontamination — agents shouldn't have seen the solutions in training data. This is irrelevant to the toolkit's use case (novel codebases), but the task format research is applicable.
|
|
247
|
+
|
|
248
|
+
### Devin (Cognition)
|
|
249
|
+
|
|
250
|
+
Devin uses a two-agent architecture: **Planner** (high-level analysis, task breakdown) and **Executor** (implementation, tests, iteration). Key design decisions:
|
|
251
|
+
- Interactive planning phase before execution — user can edit, reorder, approve steps
|
|
252
|
+
- Checkpoint approach: Plan -> Implement chunk -> Test -> Fix -> Review -> Next chunk
|
|
253
|
+
- "Defensive prompting" — anticipate confusion points an intern would face
|
|
254
|
+
|
|
255
|
+
### OpenHands / CodeAct
|
|
256
|
+
|
|
257
|
+
OpenHands takes a minimal-structure approach: point the agent at a repo and an issue, let it plan and execute autonomously using bash and Python. CodeAct 2.1 is a single agent that interleaves planning and execution — no separate plan document.
|
|
258
|
+
|
|
259
|
+
This works for issue-resolution (SWE-bench) but not for multi-batch feature implementation (the toolkit's use case). The key difference: OpenHands agents have persistent context within a session; the toolkit uses fresh `claude -p` per batch.
|
|
260
|
+
|
|
261
|
+
### Aider
|
|
262
|
+
|
|
263
|
+
Aider's contribution is primarily about **edit format**, not plan structure:
|
|
264
|
+
- "Whole file" format: simple but expensive (return entire file for any edit)
|
|
265
|
+
- "Diff" format: efficient but error-prone with less capable models
|
|
266
|
+
- "Architect mode": separate planning model (generates instructions) + editing model (applies changes)
|
|
267
|
+
|
|
268
|
+
Aider's architect mode is conceptually similar to the toolkit's plan -> execute separation. The planning model operates with more context (can see the full codebase); the editing model operates with focused context (one file at a time).
|
|
269
|
+
|
|
270
|
+
### ADaPT (Allen AI, NAACL 2024)
|
|
271
|
+
|
|
272
|
+
**As-Needed Decomposition and Planning.** The core insight: don't pre-decompose tasks into subtasks. Instead, attempt the task at the current granularity and decompose only on failure. Results:
|
|
273
|
+
- 28.3% higher success in ALFWorld
|
|
274
|
+
- 27% higher in WebShop
|
|
275
|
+
- 33% higher in TextCraft
|
|
276
|
+
|
|
277
|
+
This directly maps to the toolkit's retry escalation: Attempt 1 gets the task as-is. Attempt 2 gets the task + failure context. The implication: the initial plan could be coarser, with finer decomposition reserved for retries.
|
|
278
|
+
|
|
279
|
+
### Self-Organized Agents (SoA, 2024)
|
|
280
|
+
|
|
281
|
+
Multi-agent framework where a "Mother agent" generates a code skeleton and delegates subtasks to "Child agents." The number of subtasks is automatically determined by the LLM based on problem complexity. This supports adaptive granularity over fixed granularity.
|
|
282
|
+
|
|
283
|
+
---
|
|
284
|
+
|
|
285
|
+
## 6. Frameworks for Measuring Plan Quality Before Execution
|
|
286
|
+
|
|
287
|
+
### Findings
|
|
288
|
+
|
|
289
|
+
No established framework exists for measuring AI-consumable plan quality pre-execution. This is a gap in the literature. However, combining software requirements quality research with AI agent evaluation metrics yields a viable framework.
|
|
290
|
+
|
|
291
|
+
### Proposed Plan Quality Scorecard
|
|
292
|
+
|
|
293
|
+
Drawing from IEEE 830 (SRS quality attributes), SWE-bench task analysis, and the toolkit's own execution data:
|
|
294
|
+
|
|
295
|
+
| Dimension | Metric | How to Measure | Weight |
|
|
296
|
+
|-----------|--------|---------------|--------|
|
|
297
|
+
| **Specificity** | File paths present per task | Automated: count tasks with `Files:` section | 0.20 |
|
|
298
|
+
| **Testability** | Verification command per task | Automated: count tasks with runnable test command | 0.20 |
|
|
299
|
+
| **Scope** | Estimated lines changed per task | Heuristic: count code blocks in plan, estimate diff size | 0.15 |
|
|
300
|
+
| **Independence** | Cross-task dependencies per batch | Parse: count references to other tasks within same batch | 0.15 |
|
|
301
|
+
| **Freshness** | Plan age vs. last codebase commit | Automated: compare plan file mtime to HEAD commit time | 0.10 |
|
|
302
|
+
| **Completeness** | Tasks cover all PRD acceptance criteria | Cross-reference: plan task IDs vs. prd.json task IDs | 0.10 |
|
|
303
|
+
| **Conditionality** | Skip conditions for potentially redundant tasks | Count: tasks with "Skip if..." clauses | 0.05 |
|
|
304
|
+
| **Batch coherence** | Tasks within batch share module/test scope | Heuristic: analyze file path overlap within batch | 0.05 |
|
|
305
|
+
|
|
306
|
+
**Scoring:**
|
|
307
|
+
- 0.8+ = Ready for headless execution
|
|
308
|
+
- 0.6-0.8 = Review recommended before headless; safe for interactive execution
|
|
309
|
+
- <0.6 = Rewrite recommended
|
|
310
|
+
|
|
311
|
+
### Implementation Path
|
|
312
|
+
|
|
313
|
+
This scorecard could be implemented as `scripts/validate-plan-quality.sh`:
|
|
314
|
+
|
|
315
|
+
```bash
|
|
316
|
+
# Run before execution
|
|
317
|
+
validate-plan-quality.sh docs/plans/my-feature.md
|
|
318
|
+
|
|
319
|
+
# Output:
|
|
320
|
+
# Specificity: 0.95 (19/20 tasks have file paths)
|
|
321
|
+
# Testability: 0.90 (18/20 tasks have verify commands)
|
|
322
|
+
# Scope: 0.75 (3 tasks estimated >30 lines)
|
|
323
|
+
# Independence: 0.85 (2 cross-task deps in same batch)
|
|
324
|
+
# Freshness: 1.00 (plan created today)
|
|
325
|
+
# Completeness: 0.80 (8/10 PRD criteria mapped)
|
|
326
|
+
# Conditionality: 0.40 (0 skip conditions, 2 potential no-ops detected)
|
|
327
|
+
# Batch coherence:0.70 (Batch 4 mixes unrelated modules)
|
|
328
|
+
#
|
|
329
|
+
# OVERALL: 0.82 — Ready for headless execution
|
|
330
|
+
# WARNINGS:
|
|
331
|
+
# - Task 7 estimates ~45 lines changed — consider decomposing
|
|
332
|
+
# - Batch 4 tasks touch 3 unrelated modules — consider splitting
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
---
|
|
336
|
+
|
|
337
|
+
## Recommendations
|
|
338
|
+
|
|
339
|
+
### R1: Replace fixed "2-5 minute" heuristic with task-type-aware granularity
|
|
340
|
+
|
|
341
|
+
**Change to `writing-plans/SKILL.md`:**
|
|
342
|
+
|
|
343
|
+
Replace the "Bite-Sized Task Granularity" section with a task-type matrix:
|
|
344
|
+
|
|
345
|
+
| Task Type | Target Scope | Max Lines Changed |
|
|
346
|
+
|-----------|-------------|-------------------|
|
|
347
|
+
| New file | 1 file + tests | ~50 |
|
|
348
|
+
| Refactor | 1 module | ~30 |
|
|
349
|
+
| Integration | 1 connection point | ~20 |
|
|
350
|
+
| Bug fix | 1 bug | ~30 |
|
|
351
|
+
| Verification | Group freely | 0 (no code changes) |
|
|
352
|
+
|
|
353
|
+
Confidence: **high** — directly supported by SWE-bench difficulty data showing lines-changed as strongest predictor.
|
|
354
|
+
|
|
355
|
+
### R2: Shift from "complete code" to "contract + one example"
|
|
356
|
+
|
|
357
|
+
**Change to `writing-plans/SKILL.md`:**
|
|
358
|
+
|
|
359
|
+
Replace "Complete code in plan (not 'add validation')" with: "Complete contract in plan: function signature, behavior specification, one test showing expected usage. Implementation code is optional — provide it only for non-obvious algorithms or domain-specific logic."
|
|
360
|
+
|
|
361
|
+
Confidence: **high** — supported by both the "curse of instructions" research and SWE-bench Pro's finding that requirements + interface specs are the critical inputs.
|
|
362
|
+
|
|
363
|
+
### R3: Add batch boundary guidelines
|
|
364
|
+
|
|
365
|
+
**Add to `writing-plans/SKILL.md`:**
|
|
366
|
+
|
|
367
|
+
"Each batch has exactly one testable outcome. Group by dependency, never by arbitrary count. Never mix file-creation and integration tasks in the same batch. Add skip conditions for tasks that earlier batches might complete."
|
|
368
|
+
|
|
369
|
+
Confidence: **medium** — supported by toolkit execution data and Anthropic's harness guidance, but no controlled experiments on batch boundary strategies exist.
|
|
370
|
+
|
|
371
|
+
### R4: Implement plan quality scorecard
|
|
372
|
+
|
|
373
|
+
**New script:** `scripts/validate-plan-quality.sh`
|
|
374
|
+
|
|
375
|
+
Pre-execution quality check that scores plans on 8 dimensions (see Section 6). Wire into `run-plan.sh` as an optional pre-flight check.
|
|
376
|
+
|
|
377
|
+
Confidence: **medium** — the individual dimensions are well-supported, but the specific weights are heuristic and would benefit from calibration against execution outcomes.
|
|
378
|
+
|
|
379
|
+
### R5: Support adaptive decomposition on retry
|
|
380
|
+
|
|
381
|
+
**Change to `run-plan-headless.sh`:**
|
|
382
|
+
|
|
383
|
+
On retry, if the failure digest indicates scope-related issues (context overflow, multi-file coordination failure), automatically request finer decomposition in the retry prompt: "The previous attempt failed. Break this batch into smaller steps, implementing one file at a time."
|
|
384
|
+
|
|
385
|
+
Confidence: **medium** — supported by ADaPT research (28-33% improvement from as-needed decomposition) but untested in the toolkit's specific architecture.
|
|
386
|
+
|
|
387
|
+
### R6: Add conditional task support to plan format
|
|
388
|
+
|
|
389
|
+
**Change to plan parser:**
|
|
390
|
+
|
|
391
|
+
Support a `skip_if:` field per task that specifies a shell command. If the command exits 0, the task is skipped. Example: `skip_if: test -f src/lib/telegram.sh` (skip if file already exists from a prior batch).
|
|
392
|
+
|
|
393
|
+
Confidence: **high** — directly addresses the no-op task problem observed in Batches 4 and 6 of the toolkit's own execution data.
|
|
394
|
+
|
|
395
|
+
---
|
|
396
|
+
|
|
397
|
+
## Sources
|
|
398
|
+
|
|
399
|
+
### SWE-bench Ecosystem
|
|
400
|
+
- [SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/abs/2509.16941) — Scale AI, 2025
|
|
401
|
+
- [Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really?](https://jatinganhotra.dev/blog/swe-agents/2025/04/15/swe-bench-verified-easy-medium-hard.html) — Ganhotra, 2025
|
|
402
|
+
- [SWE-bench Verified](https://openai.com/index/introducing-swe-bench-verified/) — OpenAI, 2024
|
|
403
|
+
- [SWE-EVO: Benchmarking Coding Agents](https://www.arxiv.org/pdf/2512.18470v1) — 2025
|
|
404
|
+
- [SWE-rebench](https://arxiv.org/abs/2505.20411) — NeurIPS 2025
|
|
405
|
+
- [SWE-bench Pro Leaderboard](https://scale.com/leaderboard/swe_bench_pro_public) — Scale AI
|
|
406
|
+
|
|
407
|
+
### Agent Architecture & Planning
|
|
408
|
+
- [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) — Anthropic Engineering
|
|
409
|
+
- [Coding Agents 101](https://devin.ai/agents101) — Cognition (Devin)
|
|
410
|
+
- [ADaPT: As-Needed Decomposition and Planning](https://arxiv.org/abs/2311.05772) — Allen AI, NAACL 2024
|
|
411
|
+
- [Self-Organized Agents: A LLM Multi-Agent Framework](https://arxiv.org/abs/2404.02183) — 2024
|
|
412
|
+
- [A Survey on Code Generation with LLM-based Agents](https://arxiv.org/abs/2508.00083) — 2025
|
|
413
|
+
- [A Survey of Task Planning with Large Language Models](https://spj.science.org/doi/10.34133/icomputing.0124) — Intelligent Computing
|
|
414
|
+
|
|
415
|
+
### Specification & Plan Structure
|
|
416
|
+
- [How to Write a Good Spec for AI Agents](https://addyosmani.com/blog/good-spec/) — Addy Osmani (cites GitHub analysis of 2,500+ agent configs)
|
|
417
|
+
- [How to Keep Your AI Coding Agent from Going Rogue](https://www.arguingwithalgorithms.com/posts/technical-design-spec-pattern.html) — Technical Design Spec Pattern
|
|
418
|
+
- [How to Use a Spec-Driven Approach for Coding with AI](https://blog.jetbrains.com/junie/2025/10/how-to-use-a-spec-driven-approach-for-coding-with-ai/) — JetBrains Junie
|
|
419
|
+
- [Claude Code Best Practices](https://code.claude.com/docs/en/best-practices) — Anthropic
|
|
420
|
+
|
|
421
|
+
### Edit Formats & Code Generation
|
|
422
|
+
- [Aider Edit Formats](https://aider.chat/docs/more/edit-formats.html) — Aider
|
|
423
|
+
- [Unified Diffs Make GPT-4 Turbo 3X Less Lazy](https://aider.chat/docs/unified-diffs.html) — Aider
|
|
424
|
+
- [OpenHands CodeAct 2.1](https://openhands.dev/blog/openhands-codeact-21-an-open-state-of-the-art-software-development-agent) — OpenHands
|
|
425
|
+
|
|
426
|
+
### Evaluation Frameworks
|
|
427
|
+
- [Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems](https://arxiv.org/html/2512.12791v1) — 2025
|
|
428
|
+
- [TaskBench: Benchmarking Large Language Models for Task Automation](https://proceedings.neurips.cc/paper_files/paper/2024/file/085185ea97db31ae6dcac7497616fd3e-Paper-Datasets_and_Benchmarks_Track.pdf) — NeurIPS 2024
|