autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,558 @@
|
|
|
1
|
+
# Research: Prompt Engineering for Code Generation Agents
|
|
2
|
+
|
|
3
|
+
> **Date:** 2026-02-22
|
|
4
|
+
> **Status:** Research complete
|
|
5
|
+
> **Method:** Web research + academic literature + open-source agent analysis
|
|
6
|
+
> **Confidence convention:** High = multiple corroborating sources + empirical data. Medium = consistent expert guidance but limited controlled studies. Low = anecdotal or single-source.
|
|
7
|
+
|
|
8
|
+
## Executive Summary
|
|
9
|
+
|
|
10
|
+
This research synthesizes evidence from academic papers, open-source agent codebases (SWE-agent, OpenHands, Aider), vendor documentation (Anthropic, OpenAI), and SWE-bench competition analysis to answer eight questions about prompt engineering for code generation agents. The findings directly inform improvements to the toolkit's `run-plan-prompt.sh` and `run-plan-context.sh`.
|
|
11
|
+
|
|
12
|
+
**Top-line findings:**
|
|
13
|
+
|
|
14
|
+
1. **Structured planning in prompts yields a measurable 4% SWE-bench improvement** (OpenAI, confirmed). Direct instruction with structured planning outperforms both raw chain-of-thought and unstructured prompts for code generation.
|
|
15
|
+
2. **File context ordering matters significantly.** The "Lost in the Middle" effect (Stanford, 2023) is real and confirmed across models: information at the start and end of context is recalled best. Place task instructions and critical files at boundaries; relegate supporting context to the middle.
|
|
16
|
+
3. **Simple role prompting ("You are an expert programmer") has no measurable effect.** Detailed, behavior-defining system prompts do help, but generic personas do not.
|
|
17
|
+
4. **Few-shot examples help for smaller models but have diminishing returns on frontier models.** Self-planning with examples shows up to 25.4% Pass@1 improvement, but the benefit comes from the planning structure, not the examples per se.
|
|
18
|
+
5. **Error context in retries should be a failure digest, not a raw log dump.** The current escalation strategy (attempt 2: signal, attempt 3: digest) aligns with best practice.
|
|
19
|
+
6. **The current prompt variants ("vanilla", "different-approach", "minimal-change") were chosen without evidence.** Research supports batch-type-aware variants but the specific suffixes need revision based on what top agents actually do.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## 1. Prompt Structure: What Produces the Best Code from LLMs?
|
|
24
|
+
|
|
25
|
+
### Findings
|
|
26
|
+
|
|
27
|
+
Three prompting paradigms have been benchmarked for code generation:
|
|
28
|
+
|
|
29
|
+
| Approach | Performance vs. Baseline | Source |
|
|
30
|
+
|----------|------------------------|--------|
|
|
31
|
+
| Direct instruction | Baseline | Multiple |
|
|
32
|
+
| Standard Chain-of-Thought (CoT) | +0.82 pts Pass@1 (marginal) | Li et al., SCoT (ACM TOSEM 2024) |
|
|
33
|
+
| Structured CoT (SCoT) | +13.79% HumanEval, +12.31% MBPP | Li et al., SCoT (ACM TOSEM 2024) |
|
|
34
|
+
| Self-Planning | +25.4% Pass@1 vs. direct, +11.9% vs. CoT | ACM TOSEM 2024 |
|
|
35
|
+
| Chain of Grounded Objectives (CGO) | Outperforms SCoT and self-planning | Yeo et al., ECOOP 2025 |
|
|
36
|
+
| Explicit planning in system prompt | +4% SWE-bench Verified | OpenAI GPT-4.1 Prompting Guide |
|
|
37
|
+
|
|
38
|
+
**Key insight:** Standard CoT is wasteful for code generation. It adds 35-600% latency for marginal gains. Structured approaches that decompose the task into functional objectives or implementation steps perform far better. The ECOOP 2025 CGO paper found that "machine-oriented reasoning" (functional objectives) outperforms "human-oriented reasoning" (step-by-step procedures) for code.
|
|
39
|
+
|
|
40
|
+
**What top agents actually do:**
|
|
41
|
+
|
|
42
|
+
- **SWE-agent:** Mandates a structured five-phase workflow in its system prompt: reproduce issue, localize cause, plan fix, implement, verify. The system prompt is ~800 words.
|
|
43
|
+
- **OpenHands:** Five-phase workflow: Exploration, Analysis, Testing, Implementation, Verification. Explicitly states "thoroughly examine relevant files first."
|
|
44
|
+
- **Aider:** Minimal system prompt focused on output format (SEARCH/REPLACE blocks), not reasoning strategy. Relies on the model's native capabilities.
|
|
45
|
+
|
|
46
|
+
**Confidence:** High. Multiple independent benchmarks converge on structured planning > raw CoT > direct instruction.
|
|
47
|
+
|
|
48
|
+
### Evidence
|
|
49
|
+
|
|
50
|
+
- Li et al., "Structured Chain-of-Thought Prompting for Code Generation," ACM TOSEM 2024 ([arXiv 2305.06599](https://arxiv.org/abs/2305.06599))
|
|
51
|
+
- Yeo et al., "Chain of Grounded Objectives: Concise Goal-Oriented Prompting for Code Generation," ECOOP 2025 ([arXiv 2501.13978](https://arxiv.org/abs/2501.13978))
|
|
52
|
+
- [OpenAI GPT-4.1 Prompting Guide](https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide) — "Inducing explicit planning increased the pass rate by 4%"
|
|
53
|
+
- [Anthropic Claude 4 Best Practices](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices)
|
|
54
|
+
|
|
55
|
+
### Implications for the Toolkit
|
|
56
|
+
|
|
57
|
+
The current `build_batch_prompt()` gives a task list and requirements but no explicit planning instruction. Adding a structured planning directive would likely improve first-attempt pass rates.
|
|
58
|
+
|
|
59
|
+
**Specific change:** Add a planning section to the prompt template:
|
|
60
|
+
|
|
61
|
+
```
|
|
62
|
+
Before implementing, plan your approach:
|
|
63
|
+
1. Read relevant files to understand current state
|
|
64
|
+
2. For each task, identify: what files to create/modify, what tests to write, what the expected behavior is
|
|
65
|
+
3. Implement one task at a time: write failing test, implement, verify pass, commit
|
|
66
|
+
4. After all tasks, run the quality gate
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
This aligns with the OpenAI finding (+4% SWE-bench) and OpenHands/SWE-agent's structured workflow approach.
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## 2. File Context Ordering in Prompts
|
|
74
|
+
|
|
75
|
+
### Findings
|
|
76
|
+
|
|
77
|
+
The "Lost in the Middle" effect (Liu et al., 2023) establishes a U-shaped attention curve: LLMs attend most strongly to information at the **beginning** and **end** of the context, with significant degradation for information positioned in the middle. This has been confirmed across GPT-3.5, GPT-4, Claude, LLaMA, and Qwen model families.
|
|
78
|
+
|
|
79
|
+
**Practical ordering rules derived from research:**
|
|
80
|
+
|
|
81
|
+
1. **Task instructions and critical constraints go first** (primacy effect)
|
|
82
|
+
2. **Error context and test output go last** (recency effect)
|
|
83
|
+
3. **Supporting file contents go in the middle** (least critical position)
|
|
84
|
+
4. **If a file is the most important context, either lead with it or echo key parts at the end**
|
|
85
|
+
|
|
86
|
+
**What top agents do:**
|
|
87
|
+
|
|
88
|
+
- **Augment Code** explicitly states: "Models pay attention to beginning and end. Prioritize importance: user message content > prompt beginning > middle sections."
|
|
89
|
+
- **SWE-agent** places the issue statement last (after system prompt and demonstrations), leveraging recency bias.
|
|
90
|
+
- **OpenHands** places the task description at the end of system messages.
|
|
91
|
+
- **Anthropic** recommends: "If you control the prompt, bias it so key evidence is early or echoed late, and pre-summarize the most relevant spans."
|
|
92
|
+
|
|
93
|
+
**Truncation strategy also matters:** When truncating long outputs (command output, logs), Augment Code recommends **removing the middle, keeping the beginning and end**. Error messages typically appear at the end of output; file headers and structure appear at the beginning.
|
|
94
|
+
|
|
95
|
+
**Confidence:** High. The Lost in the Middle finding is one of the most replicated results in LLM research (Stanford 2023, published in TACL 2024, 1000+ citations).
|
|
96
|
+
|
|
97
|
+
### Evidence
|
|
98
|
+
|
|
99
|
+
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024 ([arXiv 2307.03172](https://arxiv.org/abs/2307.03172))
|
|
100
|
+
- Raimondi, "Exploiting Primacy Effect to Improve Large Language Models," RANLP 2025 ([arXiv 2507.13949](https://arxiv.org/abs/2507.13949))
|
|
101
|
+
- [Augment Code: 11 Prompting Techniques](https://www.augmentcode.com/blog/how-to-build-your-agent-11-prompting-techniques-for-better-ai-agents)
|
|
102
|
+
|
|
103
|
+
### Implications for the Toolkit
|
|
104
|
+
|
|
105
|
+
The current `build_batch_prompt()` ordering is:
|
|
106
|
+
|
|
107
|
+
```
|
|
108
|
+
1. Role + working directory (top) ← OK
|
|
109
|
+
2. Tasks in this batch ← GOOD (high priority, near top)
|
|
110
|
+
3. Recent commits ← OK (middle)
|
|
111
|
+
4. Previous progress ← OK (middle)
|
|
112
|
+
5. Previous quality gate ← OK (middle)
|
|
113
|
+
6. Referenced files ← OK (middle)
|
|
114
|
+
7. Requirements (TDD, quality gate) ← GOOD (at bottom, recency)
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
This ordering is already reasonable. The main improvements:
|
|
118
|
+
|
|
119
|
+
1. **Move Requirements block higher or duplicate key constraints.** The TDD and quality gate requirements are at the very bottom, which benefits from recency. But the task list is near the top, which benefits from primacy. This is a good structure.
|
|
120
|
+
|
|
121
|
+
2. **When injecting error context in retries, place the failure digest at the end** (already done in attempt 3 — this is correct).
|
|
122
|
+
|
|
123
|
+
3. **For `run-plan-context.sh`: the TOKEN_BUDGET_CHARS=6000 (~1500 tokens) is conservative.** The context budget should be high enough to include critical file content but aggressive about what gets included. Current priority order (directives > failure patterns > refs > git log > progress) is correct — highest value first.
|
|
124
|
+
|
|
125
|
+
4. **When including referenced files (`context_refs`), truncate by removing the middle** of long files rather than using `head -100` (which loses the end of the file where key functions may live).
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
## 3. "Lost in the Middle" for Code Context Injection
|
|
130
|
+
|
|
131
|
+
### Findings
|
|
132
|
+
|
|
133
|
+
The Stanford paper (arXiv 2307.03172) tested multi-document QA and key-value retrieval. When the relevant document was placed in the middle of 20 documents, accuracy dropped by up to 20 percentage points compared to placing it first or last. The effect was present even in models explicitly trained for long contexts.
|
|
134
|
+
|
|
135
|
+
**For code context specifically:**
|
|
136
|
+
|
|
137
|
+
- Code files have a natural structure: imports at top, main logic in middle, exports/entry points at bottom. When injecting multiple files, the agent needs to find the relevant function or class.
|
|
138
|
+
- **The mitigation is not to avoid middle placement, but to make middle content discoverable.** Techniques:
|
|
139
|
+
- Add section headers/markers before each file: `--- path/to/file.py (relevant function: parse_config) ---`
|
|
140
|
+
- Pre-summarize each file's purpose in a header line
|
|
141
|
+
- If injecting more than 5 files, summarize all files first, then include full content of the 2-3 most relevant ones
|
|
142
|
+
|
|
143
|
+
**The "Found in the Middle" follow-up paper** (2024) proposes plug-and-play positional encoding to mitigate the effect but this requires model architecture changes, not prompt engineering.
|
|
144
|
+
|
|
145
|
+
**Confidence:** High for the phenomenon existing. Medium for the specific code-context mitigations (these are derived from general principles applied to code, not directly benchmarked).
|
|
146
|
+
|
|
147
|
+
### Evidence
|
|
148
|
+
|
|
149
|
+
- Liu et al., "Lost in the Middle," TACL 2024 ([arXiv 2307.03172](https://arxiv.org/abs/2307.03172))
|
|
150
|
+
- Zhu et al., "Found in the Middle," NeurIPS 2024 ([arXiv 2403.04797](https://arxiv.org/abs/2403.04797))
|
|
151
|
+
|
|
152
|
+
### Implications for the Toolkit
|
|
153
|
+
|
|
154
|
+
The current context injection in `run-plan-context.sh` and `build_batch_prompt()` includes referenced files with a simple header (`--- $ref ---`) and `head -100`. Improvements:
|
|
155
|
+
|
|
156
|
+
1. **Add purpose annotations to context_refs headers.** Instead of `--- path/to/file.py ---`, use `--- path/to/file.py (defines: ConfigParser class, parse_config function) ---`. This could be automated by extracting the first docstring or function/class names.
|
|
157
|
+
|
|
158
|
+
2. **Limit injected files to 3-5 maximum.** Beyond that, include only summaries. The current TOKEN_BUDGET_CHARS=6000 naturally limits this, which is good.
|
|
159
|
+
|
|
160
|
+
3. **For files truncated by `head -50` or `head -100`, also include `tail -20`** to capture the end of the file (exports, main logic, error handling).
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## 4. Role Prompting for Code Generation
|
|
165
|
+
|
|
166
|
+
### Findings
|
|
167
|
+
|
|
168
|
+
Research on role prompting for code generation shows surprisingly mixed results:
|
|
169
|
+
|
|
170
|
+
| Study | Finding | Source |
|
|
171
|
+
|-------|---------|--------|
|
|
172
|
+
| "When 'A Helpful Assistant' Is Not Really Helpful" | Personas have "no or small negative effects" on performance across 4 LLM families | arXiv 2311.10054 |
|
|
173
|
+
| PromptHub analysis | Basic persona prompts don't improve results; Expert Prompting significantly outperformed other methods | PromptHub blog |
|
|
174
|
+
| Anaconda persona study | Different programming personas (Torvalds, Knuth) can influence code style but not correctness | Anaconda blog |
|
|
175
|
+
|
|
176
|
+
**The critical distinction:** Simple role assignment ("You are an expert Python developer") does not improve code quality. But **detailed behavioral specification** does. The difference:
|
|
177
|
+
|
|
178
|
+
- **Ineffective:** "You are a senior software engineer."
|
|
179
|
+
- **Effective:** "You are implementing Batch 3 of a plan. Follow TDD: write failing test, implement, verify pass, commit. Run the quality gate after all tasks. All 42+ tests must pass."
|
|
180
|
+
|
|
181
|
+
The effective version is not a "role" — it's a behavioral contract with specific constraints and success criteria.
|
|
182
|
+
|
|
183
|
+
**What top agents do:**
|
|
184
|
+
|
|
185
|
+
- **Aider:** No role prompt at all. Defines behavior through output format constraints.
|
|
186
|
+
- **SWE-agent:** "You are a helpful assistant" + detailed ACI tool documentation.
|
|
187
|
+
- **OpenHands:** "You are a helpful AI assistant" + 5-phase workflow specification.
|
|
188
|
+
- **Claude Code:** No generic role. Behavior defined by CLAUDE.md project instructions.
|
|
189
|
+
|
|
190
|
+
**Confidence:** High that generic roles don't help. Medium that detailed behavioral specs do (consistent across top agents but no controlled study isolating this variable).
|
|
191
|
+
|
|
192
|
+
### Evidence
|
|
193
|
+
|
|
194
|
+
- Zheng et al., "When 'A Helpful Assistant' Is Not Really Helpful" ([arXiv 2311.10054](https://arxiv.org/html/2311.10054v3))
|
|
195
|
+
- [PromptHub: Role-Prompting Analysis](https://www.prompthub.us/blog/role-prompting-does-adding-personas-to-your-prompts-really-make-a-difference)
|
|
196
|
+
- [Anaconda: Persona Programming](https://www.anaconda.com/blog/persona-programming-ai)
|
|
197
|
+
|
|
198
|
+
### Implications for the Toolkit
|
|
199
|
+
|
|
200
|
+
The current prompt opens with: `"You are implementing Batch ${batch_num}: ${title} from ${plan_file}."` This is already close to optimal — it's a behavioral specification, not a persona. No change needed here.
|
|
201
|
+
|
|
202
|
+
**Do NOT add:** "You are an expert software engineer" or similar generic role prompts. The research consistently shows this has zero or negative effect on frontier models.
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## 5. SWE-bench Top Performer Prompt Strategies
|
|
207
|
+
|
|
208
|
+
### Findings
|
|
209
|
+
|
|
210
|
+
As of early 2026, top SWE-bench Verified performers score 75-79%:
|
|
211
|
+
|
|
212
|
+
| Agent | Score | Key Prompt Strategy |
|
|
213
|
+
|-------|-------|-------------------|
|
|
214
|
+
| Claude Opus 4.6 (Thinking) | 79.2% | Adaptive thinking, no explicit planning prompt needed |
|
|
215
|
+
| Live-SWE-agent + Claude | 79.2% | ACI design: custom file viewer, linting before edit, structured observation |
|
|
216
|
+
| Gemini 3 Flash | 76.2% | Extended thinking |
|
|
217
|
+
| GPT 5.2 | 75.4% | Explicit planning in system prompt |
|
|
218
|
+
| CodeStory Midwit Agent | 62% | Multi-agent with brute force search |
|
|
219
|
+
|
|
220
|
+
**Common strategies among top performers:**
|
|
221
|
+
|
|
222
|
+
1. **Explicit planning prompt** — OpenAI measured +4% from adding planning instructions. This is the single largest prompt-engineering intervention with controlled evidence.
|
|
223
|
+
|
|
224
|
+
2. **Structured tool usage** — API-parsed tool descriptions outperform manually injected schemas by +2% (OpenAI). SWE-agent's ACI is the canonical example: custom commands with documentation and demonstrations.
|
|
225
|
+
|
|
226
|
+
3. **Persistence instructions** — "Keep going until the task is fully resolved" prevents early termination. OpenAI, Augment, and Anthropic all recommend this.
|
|
227
|
+
|
|
228
|
+
4. **Minimal context, maximum relevance** — Top agents aggressively filter context. SWE-agent uses file localization before editing. OpenHands mandates "Exploration" phase before implementation.
|
|
229
|
+
|
|
230
|
+
5. **Multi-agent approaches** — CodeStory's Midwit Agent uses multiple agents (brute force). The competitive mode in this toolkit already implements this.
|
|
231
|
+
|
|
232
|
+
**The sample SWE-bench prompt from OpenAI's guide specifies an 8-step methodology:**
|
|
233
|
+
1. Understand the problem deeply
|
|
234
|
+
2. Investigate the codebase systematically
|
|
235
|
+
3. Develop clear, step-by-step plans
|
|
236
|
+
4. Implement incrementally with small, testable changes
|
|
237
|
+
5. Debug to identify root causes
|
|
238
|
+
6. Test frequently after each change
|
|
239
|
+
7. Verify comprehensively
|
|
240
|
+
8. Reflect on edge cases and hidden test scenarios
|
|
241
|
+
|
|
242
|
+
**Confidence:** High. These strategies are from the agents that actually top the benchmark, with controlled ablations from OpenAI.
|
|
243
|
+
|
|
244
|
+
### Evidence
|
|
245
|
+
|
|
246
|
+
- [OpenAI GPT-4.1 Prompting Guide](https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide)
|
|
247
|
+
- [SWE-bench Leaderboard](https://www.vals.ai/benchmarks/swebench)
|
|
248
|
+
- [SWE-rebench Leaderboard](https://swe-rebench.com)
|
|
249
|
+
- [SWE-agent paper](https://arxiv.org/abs/2405.15793)
|
|
250
|
+
|
|
251
|
+
### Implications for the Toolkit
|
|
252
|
+
|
|
253
|
+
The toolkit already implements several of these (fresh context per batch, quality gates, TDD workflow). Missing elements:
|
|
254
|
+
|
|
255
|
+
1. **Add persistence instruction:** "Complete all tasks in this batch before stopping. Do not end your turn early or ask for clarification — use your tools to investigate and resolve uncertainties."
|
|
256
|
+
|
|
257
|
+
2. **Add investigation-first instruction:** "Before implementing any task, read the relevant files to understand the current state. Do not assume file contents from the plan description."
|
|
258
|
+
|
|
259
|
+
3. **The 8-step methodology from OpenAI maps well to TDD.** The current prompt says "TDD: write test -> verify fail -> implement -> verify pass -> commit each task." This could be expanded to include investigation and verification steps.
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
## 6. How Devin, SWE-agent, OpenHands, and Aider Construct Their Prompts
|
|
264
|
+
|
|
265
|
+
### Findings
|
|
266
|
+
|
|
267
|
+
| Agent | System Prompt Length | Key Structure | Context Strategy |
|
|
268
|
+
|-------|---------------------|---------------|-----------------|
|
|
269
|
+
| **SWE-agent** | ~800 words | System prompt + demonstration + issue | Custom ACI commands with inline docs. Linter gates edits. File viewer replaces raw `cat`. |
|
|
270
|
+
| **OpenHands** | ~600 words (Jinja2 template) | Role + capabilities + workflow phases + error recovery | Structured 5-phase workflow. "Reflect on 5-7 possible causes" on repeated failure. |
|
|
271
|
+
| **Aider** | ~200 words (main) + format spec | Behavioral constraints + SEARCH/REPLACE format | Minimal context. Repo map (file tree + signatures) injected automatically. Referenced files in full. |
|
|
272
|
+
| **Devin** | Proprietary | Unknown (closed source) | Multi-agent with planning agent, execution agent, and verification agent. |
|
|
273
|
+
|
|
274
|
+
**Structural patterns across all agents:**
|
|
275
|
+
|
|
276
|
+
1. **Role + capabilities** — One sentence establishing what the agent can do.
|
|
277
|
+
2. **Behavioral constraints** — Explicit rules about when to ask vs. act, how to handle ambiguity.
|
|
278
|
+
3. **Output format** — Strict format for code changes (SEARCH/REPLACE in Aider, structured actions in SWE-agent).
|
|
279
|
+
4. **Error recovery protocol** — What to do when things fail (OpenHands: "reflect on 5-7 possible causes").
|
|
280
|
+
5. **Minimal prompting** — None of these agents use long, elaborate prompts. The system prompts are 200-800 words. The power comes from tool design and workflow structure, not prompt verbosity.
|
|
281
|
+
|
|
282
|
+
**Aider's approach is distinctive:** It barely prompts the model for reasoning. Instead, it constrains the output format (SEARCH/REPLACE blocks) and relies on the model's native intelligence. The "repo map" (file tree with function/class signatures) provides structural context without full file contents.
|
|
283
|
+
|
|
284
|
+
**OpenHands' error recovery is the most sophisticated:** On repeated failure, the prompt instructs the agent to "step back" and "reflect on 5-7 different possible sources of the problem" before continuing. This prevents the common failure mode of repeatedly trying the same approach.
|
|
285
|
+
|
|
286
|
+
**Confidence:** High. These are the actual source code of production agents, not documentation or blog posts.
|
|
287
|
+
|
|
288
|
+
### Evidence
|
|
289
|
+
|
|
290
|
+
- [Aider prompts.py](https://github.com/Aider-AI/aider/blob/main/aider/prompts.py)
|
|
291
|
+
- [Aider editblock_prompts.py](https://github.com/Aider-AI/aider/blob/main/aider/coders/editblock_prompts.py)
|
|
292
|
+
- [OpenHands system_prompt.j2](https://github.com/All-Hands-AI/OpenHands/blob/main/openhands/agenthub/codeact_agent/prompts/system_prompt.j2)
|
|
293
|
+
- [SWE-agent ACI documentation](https://github.com/SWE-agent/SWE-agent/blob/main/docs/background/aci.md)
|
|
294
|
+
- [SWE-agent paper](https://arxiv.org/abs/2405.15793)
|
|
295
|
+
|
|
296
|
+
### Implications for the Toolkit
|
|
297
|
+
|
|
298
|
+
The current `build_batch_prompt()` is ~30 lines of template. This is in the right range — the top agents use 200-800 words. The prompt should not get longer, but it should get more structured.
|
|
299
|
+
|
|
300
|
+
**Specific improvements:**
|
|
301
|
+
|
|
302
|
+
1. **Add an error recovery instruction** (from OpenHands): When the prompt includes failure context (retries), add: "Before attempting a fix, identify 3-5 possible root causes and assess the likelihood of each. Address the most likely cause first."
|
|
303
|
+
|
|
304
|
+
2. **Consider a repo map equivalent.** The current `context_refs` system injects full file content. A lighter-weight option: inject file tree + function signatures for the entire worktree, similar to Aider's repo map. This gives the agent structural awareness without consuming tokens on file content.
|
|
305
|
+
|
|
306
|
+
3. **The prompt variant system in `get_prompt_variants()` appends short suffixes like "check all imports before running tests."** Top agents don't use this pattern. Instead, they vary the workflow structure or the error recovery strategy. The variant system should be revised (see Recommendations).
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
## 7. Few-Shot Examples in Code Generation Prompts
|
|
311
|
+
|
|
312
|
+
### Findings
|
|
313
|
+
|
|
314
|
+
The evidence on few-shot examples for code generation is nuanced:
|
|
315
|
+
|
|
316
|
+
| Model Class | Few-Shot Impact | Source |
|
|
317
|
+
|-------------|----------------|--------|
|
|
318
|
+
| Small models (T5, CodeLlama-7B) | Significant improvement | CODEEXEMPLAR, arXiv 2412.02906 |
|
|
319
|
+
| Frontier models (GPT-4, Claude Opus) | Diminishing returns | General consensus, no single paper |
|
|
320
|
+
| All models with self-planning | +25.4% Pass@1 improvement | ACM TOSEM 2024 |
|
|
321
|
+
|
|
322
|
+
**Key findings:**
|
|
323
|
+
|
|
324
|
+
1. **The benefit of few-shot comes from the planning structure, not the examples themselves.** Self-planning prompting (show examples of planning, then coding) yields +25.4% improvement. The planning template transfers; the specific example code does not.
|
|
325
|
+
|
|
326
|
+
2. **More complex examples are more informative than simple ones.** The CODEEXEMPLAR-FREE method selects examples the LLM struggles to generate on its own. This is counterintuitive — you'd expect easy examples to be better demonstrations. But hard examples force the model to attend more carefully.
|
|
327
|
+
|
|
328
|
+
3. **For frontier models on code generation tasks, few-shot examples consume tokens without proportional benefit.** The models already know how to code. The value-add is in workflow and constraint specification, not in showing code examples.
|
|
329
|
+
|
|
330
|
+
4. **Anthropic's guidance for Claude 4.6:** "Be vigilant with examples & details. Claude pays close attention to details and examples. Ensure that your examples align with the behaviors you want to encourage." But also: "Avoid overfitting to specific examples" — test that examples don't degrade performance on novel cases.
|
|
331
|
+
|
|
332
|
+
**What top agents do:**
|
|
333
|
+
|
|
334
|
+
- **SWE-agent:** Optional demonstration (a worked example of solving a GitHub issue). This is the closest to few-shot and it is marked as optional.
|
|
335
|
+
- **Aider:** No few-shot examples. Pure instruction + format specification.
|
|
336
|
+
- **OpenHands:** No few-shot examples. Workflow specification only.
|
|
337
|
+
|
|
338
|
+
**Confidence:** Medium. The self-planning result is well-evidenced, but the claim about diminishing returns on frontier models is inferred from the absence of few-shot in top agents rather than a controlled ablation.
|
|
339
|
+
|
|
340
|
+
### Evidence
|
|
341
|
+
|
|
342
|
+
- Bairi et al., "Does Few-Shot Learning Help LLM Performance in Code Synthesis?" ([arXiv 2412.02906](https://arxiv.org/abs/2412.02906))
|
|
343
|
+
- Jiang et al., "Self-Planning Code Generation with Large Language Models," ACM TOSEM 2024
|
|
344
|
+
- [Anthropic Claude 4 Best Practices](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices)
|
|
345
|
+
|
|
346
|
+
### Implications for the Toolkit
|
|
347
|
+
|
|
348
|
+
The current prompt does not include few-shot examples. **This is correct for a system using frontier models.** Do not add few-shot code examples to the prompt.
|
|
349
|
+
|
|
350
|
+
However, the self-planning finding suggests value in including a brief **planning template** (not a code example):
|
|
351
|
+
|
|
352
|
+
```
|
|
353
|
+
Planning template:
|
|
354
|
+
Task: [task name]
|
|
355
|
+
Files to read: [list]
|
|
356
|
+
Files to create/modify: [list]
|
|
357
|
+
Test to write: [test file and test name]
|
|
358
|
+
Expected behavior: [what the test checks]
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
This is a structural template, not a few-shot example. It guides the planning process without showing code.
|
|
362
|
+
|
|
363
|
+
---
|
|
364
|
+
|
|
365
|
+
## 8. Error Context Framing in Retry Prompts
|
|
366
|
+
|
|
367
|
+
### Findings
|
|
368
|
+
|
|
369
|
+
The toolkit's current retry escalation is:
|
|
370
|
+
|
|
371
|
+
- **Attempt 1:** Raw task prompt
|
|
372
|
+
- **Attempt 2:** Task + "Previous attempt failed. Review the quality gate output and fix the issues."
|
|
373
|
+
- **Attempt 3+:** Task + failure digest (last 50 lines or `failure-digest.sh` output) + "Focus on fixing the root cause."
|
|
374
|
+
|
|
375
|
+
**Research-backed best practices for retry prompts:**
|
|
376
|
+
|
|
377
|
+
1. **Signal-then-detail escalation is correct.** Attempt 2 signals failure without overwhelming context. Attempt 3 provides detail. This matches the recommended pattern.
|
|
378
|
+
|
|
379
|
+
2. **Failure digests should be structured, not raw logs.** Raw log tails include noise (progress bars, timestamps, irrelevant warnings). A digest that extracts: (a) the specific failure message, (b) the file and line number, and (c) the expected vs. actual output is more effective.
|
|
380
|
+
|
|
381
|
+
3. **"Reflect before retrying" prevents loops.** OpenHands' pattern: "reflect on 5-7 different possible sources of the problem. Assess the likelihood of each possible cause. Methodically address the most likely causes." This is the most important addition for retries.
|
|
382
|
+
|
|
383
|
+
4. **Prompt framing matters for self-correction.** Research shows that "ask yourself what went wrong" prompts lead to better self-correction than "be aware that you failed" prompts. The former triggers diagnostic reasoning; the latter triggers defensive behavior.
|
|
384
|
+
|
|
385
|
+
5. **Maximum retry limits prevent infinite loops.** The toolkit's `MAX_RETRIES` already implements this.
|
|
386
|
+
|
|
387
|
+
6. **Dynamic prompt adaptation improves over iterations.** Error context should not just be appended — it should reshape the instruction. Example: if the failure is a test failure, the retry prompt should say "Run the failing test first to reproduce the issue before attempting any fix."
|
|
388
|
+
|
|
389
|
+
**Confidence:** Medium. The OpenHands reflection pattern is well-tested in production. The "ask yourself" vs. "be aware" framing difference is from a single medium article but is consistent with known LLM behavior.
|
|
390
|
+
|
|
391
|
+
### Evidence
|
|
392
|
+
|
|
393
|
+
- [OpenHands system prompt](https://github.com/All-Hands-AI/OpenHands/blob/main/openhands/agenthub/codeact_agent/prompts/system_prompt.j2) — reflection on failure pattern
|
|
394
|
+
- [Augment Code prompting guide](https://www.augmentcode.com/blog/how-to-build-your-agent-11-prompting-techniques-for-better-ai-agents) — truncation strategy
|
|
395
|
+
- [Anthropic context engineering guide](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) — compaction and context management
|
|
396
|
+
|
|
397
|
+
### Implications for the Toolkit
|
|
398
|
+
|
|
399
|
+
The current retry escalation is structurally sound. Specific improvements:
|
|
400
|
+
|
|
401
|
+
1. **Attempt 2 prompt revision:**
|
|
402
|
+
```
|
|
403
|
+
Previous attempt failed quality gate. Before implementing again:
|
|
404
|
+
1. Read the quality gate output to understand what specifically failed
|
|
405
|
+
2. Identify 3 possible root causes
|
|
406
|
+
3. Address the most likely cause first
|
|
407
|
+
4. Run the failing test to verify your fix before proceeding to other tasks
|
|
408
|
+
```
|
|
409
|
+
|
|
410
|
+
2. **Attempt 3+ prompt revision:**
|
|
411
|
+
```
|
|
412
|
+
Previous attempts failed (${attempt_count} so far). Failure digest:
|
|
413
|
+
\`\`\`
|
|
414
|
+
${failure_digest}
|
|
415
|
+
\`\`\`
|
|
416
|
+
|
|
417
|
+
IMPORTANT: Do not repeat the same approach. Step back and consider:
|
|
418
|
+
- Is the test expectation wrong, or is the implementation wrong?
|
|
419
|
+
- Are there import errors, path errors, or dependency issues?
|
|
420
|
+
- Is there an assumption from the plan that doesn't match the actual codebase?
|
|
421
|
+
|
|
422
|
+
Fix the root cause. Run the specific failing test before running the full suite.
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
3. **The `failure-digest.sh` script should produce structured output** with sections: `FAILING_TEST`, `ERROR_MESSAGE`, `STACK_TRACE`, `EXPECTED_VS_ACTUAL`. Raw tail output is a fallback.
|
|
426
|
+
|
|
427
|
+
---
|
|
428
|
+
|
|
429
|
+
## Recommendations
|
|
430
|
+
|
|
431
|
+
Concrete changes to prompt assembly in `run-plan-prompt.sh`, ordered by expected impact:
|
|
432
|
+
|
|
433
|
+
### R1. Add Explicit Planning Instruction (High Impact, High Confidence)
|
|
434
|
+
|
|
435
|
+
**Evidence:** +4% SWE-bench (OpenAI), consistent with all top-performing agents.
|
|
436
|
+
|
|
437
|
+
Add after the task list in `build_batch_prompt()`:
|
|
438
|
+
|
|
439
|
+
```bash
|
|
440
|
+
cat <<'PLANNING'
|
|
441
|
+
|
|
442
|
+
Approach:
|
|
443
|
+
1. Read relevant files before modifying them — do not assume contents from the plan
|
|
444
|
+
2. For each task: write a failing test, confirm it fails, implement the fix, confirm it passes, commit
|
|
445
|
+
3. After all tasks: run the quality gate command
|
|
446
|
+
4. If the quality gate fails, fix issues before proceeding
|
|
447
|
+
PLANNING
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
### R2. Add Persistence Instruction (Medium Impact, High Confidence)
|
|
451
|
+
|
|
452
|
+
**Evidence:** OpenAI, Anthropic, and Augment all recommend this. Prevents early termination.
|
|
453
|
+
|
|
454
|
+
Add to the Requirements section:
|
|
455
|
+
|
|
456
|
+
```
|
|
457
|
+
- Complete ALL tasks in this batch. Do not stop early or report partial completion.
|
|
458
|
+
- If uncertain about implementation details, read the relevant files rather than guessing.
|
|
459
|
+
```
|
|
460
|
+
|
|
461
|
+
### R3. Add Reflection-on-Failure for Retries (Medium Impact, Medium Confidence)
|
|
462
|
+
|
|
463
|
+
**Evidence:** OpenHands production usage, consistent with self-correction research.
|
|
464
|
+
|
|
465
|
+
Modify the retry prompt escalation in `run_mode_headless()`:
|
|
466
|
+
|
|
467
|
+
```bash
|
|
468
|
+
if [[ $attempt -eq 2 ]]; then
|
|
469
|
+
full_prompt="$prompt
|
|
470
|
+
|
|
471
|
+
Previous attempt failed quality gate. Before re-implementing:
|
|
472
|
+
1. Read the quality gate output to understand what failed
|
|
473
|
+
2. Identify 3 possible root causes and address the most likely first
|
|
474
|
+
3. Run the failing test to verify your fix before proceeding"
|
|
475
|
+
```
|
|
476
|
+
|
|
477
|
+
### R4. Improve Context Ordering (Low-Medium Impact, High Confidence)
|
|
478
|
+
|
|
479
|
+
**Evidence:** Lost in the Middle (Stanford 2023), confirmed by Augment Code's production experience.
|
|
480
|
+
|
|
481
|
+
Current ordering is already reasonable. Marginal improvements:
|
|
482
|
+
|
|
483
|
+
- In `run-plan-context.sh`, when truncating referenced files, use `head -50` AND `tail -20` instead of just `head -50`, joining them with `\n...(truncated)...\n`.
|
|
484
|
+
- Add one-line purpose annotations to context_refs headers (extract from first docstring or comment).
|
|
485
|
+
|
|
486
|
+
### R5. Revise Prompt Variant System (Low Impact, Medium Confidence)
|
|
487
|
+
|
|
488
|
+
**Evidence:** Top agents do not use short instruction suffixes. They vary workflow structure.
|
|
489
|
+
|
|
490
|
+
The current `get_prompt_variants()` appends suffixes like "check all imports before running tests." This is a weak signal. Replace with workflow-level variants:
|
|
491
|
+
|
|
492
|
+
```bash
|
|
493
|
+
type_variants[new-file]="Write all test files first, then implement all production files|Implement each task fully (test+code) before moving to the next"
|
|
494
|
+
type_variants[refactoring]="Read every file you plan to modify before making any changes|Run the full test suite after each individual modification"
|
|
495
|
+
type_variants[integration]="Trace one complete data path end-to-end before declaring done|Write an integration test first that exercises the full flow"
|
|
496
|
+
```
|
|
497
|
+
|
|
498
|
+
These are behavioral instructions, not reminders. They change the agent's workflow, not just its attention.
|
|
499
|
+
|
|
500
|
+
### R6. Remove Any Generic Role Prompt (No Impact, High Confidence)
|
|
501
|
+
|
|
502
|
+
**Evidence:** Multiple studies show generic personas have no or negative effect on frontier models.
|
|
503
|
+
|
|
504
|
+
The current prompt does not have a generic role prompt — it uses a behavioral specification ("You are implementing Batch N"). **Do not add one.** This is a non-action recommendation to prevent future regression.
|
|
505
|
+
|
|
506
|
+
### R7. Consider Repo Map for Structural Context (Speculative, Low Confidence)
|
|
507
|
+
|
|
508
|
+
**Evidence:** Aider's repo map approach. No controlled benchmark comparison.
|
|
509
|
+
|
|
510
|
+
For batches that modify many files, a lightweight file tree with function signatures (like Aider's repo map) could provide better structural awareness than injecting full file contents. This is a larger implementation effort and should be validated experimentally.
|
|
511
|
+
|
|
512
|
+
### R8. Align with Anthropic's Claude 4.6 Guidance (Medium Impact, High Confidence)
|
|
513
|
+
|
|
514
|
+
**Evidence:** Official Anthropic documentation for the model the toolkit actually uses.
|
|
515
|
+
|
|
516
|
+
Key Claude 4.6-specific adjustments:
|
|
517
|
+
|
|
518
|
+
1. **Remove any anti-laziness language.** Claude 4.6 is already proactive; "be thorough" or "do not be lazy" causes over-exploration. The current prompt does not have this, but the prompt variants should not add it.
|
|
519
|
+
|
|
520
|
+
2. **Do not add "think step by step."** Anthropic specifically says "Remove explicit think tool instructions" for Claude 4.6 — the model thinks effectively without being told to.
|
|
521
|
+
|
|
522
|
+
3. **Add context window awareness:** If context compaction is available, add: "Your context window will be compacted as it approaches its limit. Save progress to progress.txt before reaching the limit."
|
|
523
|
+
|
|
524
|
+
4. **Use the effort parameter** rather than prompt-based reasoning control when invoking `claude -p`.
|
|
525
|
+
|
|
526
|
+
---
|
|
527
|
+
|
|
528
|
+
## Sources
|
|
529
|
+
|
|
530
|
+
### Academic Papers
|
|
531
|
+
- Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024 — [arXiv 2307.03172](https://arxiv.org/abs/2307.03172)
|
|
532
|
+
- Li et al., "Structured Chain-of-Thought Prompting for Code Generation," ACM TOSEM 2024 — [arXiv 2305.06599](https://arxiv.org/abs/2305.06599)
|
|
533
|
+
- Yeo et al., "Chain of Grounded Objectives: Concise Goal-Oriented Prompting for Code Generation," ECOOP 2025 — [arXiv 2501.13978](https://arxiv.org/abs/2501.13978)
|
|
534
|
+
- Bairi et al., "Does Few-Shot Learning Help LLM Performance in Code Synthesis?" 2024 — [arXiv 2412.02906](https://arxiv.org/abs/2412.02906)
|
|
535
|
+
- Yang et al., "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering," 2024 — [arXiv 2405.15793](https://arxiv.org/abs/2405.15793)
|
|
536
|
+
- Wang et al., "OpenHands: An Open Platform for AI Software Developers as Generalist Agents," 2024 — [arXiv 2407.16741](https://arxiv.org/abs/2407.16741)
|
|
537
|
+
- Zheng et al., "When 'A Helpful Assistant' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models" — [arXiv 2311.10054](https://arxiv.org/html/2311.10054v3)
|
|
538
|
+
- Zhu et al., "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding," NeurIPS 2024 — [arXiv 2403.04797](https://arxiv.org/abs/2403.04797)
|
|
539
|
+
- Raimondi, "Exploiting Primacy Effect to Improve Large Language Models," RANLP 2025 — [arXiv 2507.13949](https://arxiv.org/abs/2507.13949)
|
|
540
|
+
|
|
541
|
+
### Vendor Documentation
|
|
542
|
+
- [Anthropic Claude 4 Best Practices](https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices)
|
|
543
|
+
- [Anthropic: Effective Context Engineering for AI Agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
|
|
544
|
+
- [OpenAI GPT-4.1 Prompting Guide](https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide)
|
|
545
|
+
|
|
546
|
+
### Open-Source Agent Codebases
|
|
547
|
+
- [Aider prompts.py](https://github.com/Aider-AI/aider/blob/main/aider/prompts.py)
|
|
548
|
+
- [Aider editblock_prompts.py](https://github.com/Aider-AI/aider/blob/main/aider/coders/editblock_prompts.py)
|
|
549
|
+
- [OpenHands system_prompt.j2](https://github.com/All-Hands-AI/OpenHands/blob/main/openhands/agenthub/codeact_agent/prompts/system_prompt.j2)
|
|
550
|
+
- [SWE-agent ACI documentation](https://github.com/SWE-agent/SWE-agent/blob/main/docs/background/aci.md)
|
|
551
|
+
- [SWE-agent templates documentation](https://swe-agent.com/latest/config/templates/)
|
|
552
|
+
|
|
553
|
+
### Industry Analysis
|
|
554
|
+
- [Augment Code: 11 Prompting Techniques for Better AI Agents](https://www.augmentcode.com/blog/how-to-build-your-agent-11-prompting-techniques-for-better-ai-agents)
|
|
555
|
+
- [SWE-bench Verified Leaderboard](https://www.vals.ai/benchmarks/swebench)
|
|
556
|
+
- [SWE-rebench Leaderboard](https://swe-rebench.com)
|
|
557
|
+
- [Verdent SWE-bench Technical Report](https://www.verdent.ai/blog/swe-bench-verified-technical-report)
|
|
558
|
+
- [Modal: Open-source AI Agents](https://modal.com/blog/open-ai-agents)
|