npm - autonomous-coding-toolkit - Versions diffs - 1.0.0 - Mend

autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (324) hide show

package/.claude-plugin/marketplace.json +22 -0
package/.claude-plugin/plugin.json +13 -0
package/LICENSE +21 -0
package/Makefile +21 -0
package/README.md +140 -0
package/SECURITY.md +28 -0
package/agents/bash-expert.md +113 -0
package/agents/dependency-auditor.md +138 -0
package/agents/integration-tester.md +120 -0
package/agents/lesson-scanner.md +149 -0
package/agents/python-expert.md +179 -0
package/agents/service-monitor.md +141 -0
package/agents/shell-expert.md +147 -0
package/benchmarks/runner.sh +147 -0
package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
package/benchmarks/tasks/02-refactor-module/task.md +8 -0
package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
package/bin/act.js +238 -0
package/commands/autocode.md +6 -0
package/commands/cancel-ralph.md +18 -0
package/commands/code-factory.md +53 -0
package/commands/create-prd.md +55 -0
package/commands/ralph-loop.md +18 -0
package/commands/run-plan.md +117 -0
package/commands/submit-lesson.md +122 -0
package/docs/ARCHITECTURE.md +630 -0
package/docs/CONTRIBUTING.md +125 -0
package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
package/docs/lessons/0002-async-def-without-await.md +28 -0
package/docs/lessons/0003-create-task-without-callback.md +28 -0
package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
package/docs/lessons/0005-sqlite-without-closing.md +33 -0
package/docs/lessons/0006-venv-pip-path.md +27 -0
package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
package/docs/lessons/0010-local-outside-function-bash.md +33 -0
package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
package/docs/lessons/0020-persist-state-incrementally.md +44 -0
package/docs/lessons/0021-dual-axis-testing.md +48 -0
package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
package/docs/lessons/0023-static-analysis-spiral.md +51 -0
package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
package/docs/lessons/0045-iterative-design-improvement.md +33 -0
package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
package/docs/lessons/0048-integration-wiring-batch.md +40 -0
package/docs/lessons/0049-ab-verification.md +41 -0
package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
package/docs/lessons/0078-static-review-without-live-test.md +30 -0
package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
package/docs/lessons/FRAMEWORK.md +161 -0
package/docs/lessons/SUMMARY.md +201 -0
package/docs/lessons/TEMPLATE.md +85 -0
package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
package/docs/plans/2026-02-21-mab-research-report.md +406 -0
package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
package/docs/plans/2026-02-22-mab-run-design.md +462 -0
package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
package/docs/plans/2026-02-24-headless-module-split.md +443 -0
package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
package/docs/plans/audit-findings.md +186 -0
package/docs/telegram-notification-format.md +98 -0
package/examples/example-plan.md +51 -0
package/examples/example-prd.json +72 -0
package/examples/example-roadmap.md +33 -0
package/examples/quickstart-plan.md +63 -0
package/hooks/hooks.json +26 -0
package/hooks/setup-symlinks.sh +48 -0
package/hooks/stop-hook.sh +135 -0
package/package.json +47 -0
package/policies/bash.md +71 -0
package/policies/python.md +71 -0
package/policies/testing.md +61 -0
package/policies/universal.md +60 -0
package/scripts/analyze-report.sh +97 -0
package/scripts/architecture-map.sh +145 -0
package/scripts/auto-compound.sh +273 -0
package/scripts/batch-audit.sh +42 -0
package/scripts/batch-test.sh +101 -0
package/scripts/entropy-audit.sh +221 -0
package/scripts/failure-digest.sh +51 -0
package/scripts/generate-ast-rules.sh +96 -0
package/scripts/init.sh +112 -0
package/scripts/lesson-check.sh +428 -0
package/scripts/lib/common.sh +61 -0
package/scripts/lib/cost-tracking.sh +153 -0
package/scripts/lib/ollama.sh +60 -0
package/scripts/lib/progress-writer.sh +128 -0
package/scripts/lib/run-plan-context.sh +215 -0
package/scripts/lib/run-plan-echo-back.sh +231 -0
package/scripts/lib/run-plan-headless.sh +396 -0
package/scripts/lib/run-plan-notify.sh +57 -0
package/scripts/lib/run-plan-parser.sh +81 -0
package/scripts/lib/run-plan-prompt.sh +215 -0
package/scripts/lib/run-plan-quality-gate.sh +132 -0
package/scripts/lib/run-plan-routing.sh +315 -0
package/scripts/lib/run-plan-sampling.sh +170 -0
package/scripts/lib/run-plan-scoring.sh +146 -0
package/scripts/lib/run-plan-state.sh +142 -0
package/scripts/lib/run-plan-team.sh +199 -0
package/scripts/lib/telegram.sh +54 -0
package/scripts/lib/thompson-sampling.sh +176 -0
package/scripts/license-check.sh +74 -0
package/scripts/mab-run.sh +575 -0
package/scripts/module-size-check.sh +146 -0
package/scripts/patterns/async-no-await.yml +5 -0
package/scripts/patterns/bare-except.yml +6 -0
package/scripts/patterns/empty-catch.yml +6 -0
package/scripts/patterns/hardcoded-localhost.yml +9 -0
package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
package/scripts/pipeline-status.sh +197 -0
package/scripts/policy-check.sh +226 -0
package/scripts/prior-art-search.sh +133 -0
package/scripts/promote-mab-lessons.sh +126 -0
package/scripts/prompts/agent-a-superpowers.md +29 -0
package/scripts/prompts/agent-b-ralph.md +29 -0
package/scripts/prompts/judge-agent.md +61 -0
package/scripts/prompts/planner-agent.md +44 -0
package/scripts/pull-community-lessons.sh +90 -0
package/scripts/quality-gate.sh +266 -0
package/scripts/research-gate.sh +90 -0
package/scripts/run-plan.sh +329 -0
package/scripts/scope-infer.sh +159 -0
package/scripts/setup-ralph-loop.sh +155 -0
package/scripts/telemetry.sh +230 -0
package/scripts/tests/run-all-tests.sh +52 -0
package/scripts/tests/test-act-cli.sh +46 -0
package/scripts/tests/test-agents-md.sh +87 -0
package/scripts/tests/test-analyze-report.sh +114 -0
package/scripts/tests/test-architecture-map.sh +89 -0
package/scripts/tests/test-auto-compound.sh +169 -0
package/scripts/tests/test-batch-test.sh +65 -0
package/scripts/tests/test-benchmark-runner.sh +25 -0
package/scripts/tests/test-common.sh +168 -0
package/scripts/tests/test-cost-tracking.sh +158 -0
package/scripts/tests/test-echo-back.sh +180 -0
package/scripts/tests/test-entropy-audit.sh +146 -0
package/scripts/tests/test-failure-digest.sh +66 -0
package/scripts/tests/test-generate-ast-rules.sh +145 -0
package/scripts/tests/test-helpers.sh +82 -0
package/scripts/tests/test-init.sh +47 -0
package/scripts/tests/test-lesson-check.sh +278 -0
package/scripts/tests/test-lesson-local.sh +55 -0
package/scripts/tests/test-license-check.sh +109 -0
package/scripts/tests/test-mab-run.sh +182 -0
package/scripts/tests/test-ollama-lib.sh +49 -0
package/scripts/tests/test-ollama.sh +60 -0
package/scripts/tests/test-pipeline-status.sh +198 -0
package/scripts/tests/test-policy-check.sh +124 -0
package/scripts/tests/test-prior-art-search.sh +96 -0
package/scripts/tests/test-progress-writer.sh +140 -0
package/scripts/tests/test-promote-mab-lessons.sh +110 -0
package/scripts/tests/test-pull-community-lessons.sh +149 -0
package/scripts/tests/test-quality-gate.sh +241 -0
package/scripts/tests/test-research-gate.sh +132 -0
package/scripts/tests/test-run-plan-cli.sh +86 -0
package/scripts/tests/test-run-plan-context.sh +305 -0
package/scripts/tests/test-run-plan-e2e.sh +153 -0
package/scripts/tests/test-run-plan-headless.sh +424 -0
package/scripts/tests/test-run-plan-notify.sh +124 -0
package/scripts/tests/test-run-plan-parser.sh +217 -0
package/scripts/tests/test-run-plan-prompt.sh +254 -0
package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
package/scripts/tests/test-run-plan-routing.sh +178 -0
package/scripts/tests/test-run-plan-scoring.sh +148 -0
package/scripts/tests/test-run-plan-state.sh +261 -0
package/scripts/tests/test-run-plan-team.sh +157 -0
package/scripts/tests/test-scope-infer.sh +150 -0
package/scripts/tests/test-setup-ralph-loop.sh +63 -0
package/scripts/tests/test-telegram-env.sh +38 -0
package/scripts/tests/test-telegram.sh +121 -0
package/scripts/tests/test-telemetry.sh +46 -0
package/scripts/tests/test-thompson-sampling.sh +139 -0
package/scripts/tests/test-validate-all.sh +60 -0
package/scripts/tests/test-validate-commands.sh +89 -0
package/scripts/tests/test-validate-hooks.sh +98 -0
package/scripts/tests/test-validate-lessons.sh +150 -0
package/scripts/tests/test-validate-plan-quality.sh +235 -0
package/scripts/tests/test-validate-plans.sh +187 -0
package/scripts/tests/test-validate-plugin.sh +106 -0
package/scripts/tests/test-validate-prd.sh +184 -0
package/scripts/tests/test-validate-skills.sh +134 -0
package/scripts/validate-all.sh +57 -0
package/scripts/validate-commands.sh +67 -0
package/scripts/validate-hooks.sh +89 -0
package/scripts/validate-lessons.sh +98 -0
package/scripts/validate-plan-quality.sh +369 -0
package/scripts/validate-plans.sh +120 -0
package/scripts/validate-plugin.sh +86 -0
package/scripts/validate-policies.sh +42 -0
package/scripts/validate-prd.sh +118 -0
package/scripts/validate-skills.sh +96 -0
package/skills/autocode/SKILL.md +285 -0
package/skills/autocode/ab-verification.md +51 -0
package/skills/autocode/code-quality-standards.md +37 -0
package/skills/autocode/competitive-mode.md +364 -0
package/skills/brainstorming/SKILL.md +97 -0
package/skills/capture-lesson/SKILL.md +187 -0
package/skills/check-lessons/SKILL.md +116 -0
package/skills/dispatching-parallel-agents/SKILL.md +110 -0
package/skills/executing-plans/SKILL.md +85 -0
package/skills/finishing-a-development-branch/SKILL.md +201 -0
package/skills/receiving-code-review/SKILL.md +72 -0
package/skills/requesting-code-review/SKILL.md +59 -0
package/skills/requesting-code-review/code-reviewer.md +82 -0
package/skills/research/SKILL.md +145 -0
package/skills/roadmap/SKILL.md +115 -0
package/skills/subagent-driven-development/SKILL.md +98 -0
package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
package/skills/subagent-driven-development/implementer-prompt.md +73 -0
package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
package/skills/systematic-debugging/SKILL.md +134 -0
package/skills/systematic-debugging/condition-based-waiting.md +64 -0
package/skills/systematic-debugging/defense-in-depth.md +32 -0
package/skills/systematic-debugging/root-cause-tracing.md +55 -0
package/skills/test-driven-development/SKILL.md +167 -0
package/skills/using-git-worktrees/SKILL.md +219 -0
package/skills/using-superpowers/SKILL.md +54 -0
package/skills/verification-before-completion/SKILL.md +140 -0
package/skills/verify/SKILL.md +82 -0
package/skills/writing-plans/SKILL.md +128 -0
package/skills/writing-skills/SKILL.md +93 -0

package/docs/plans/2026-02-22-research-competitive-landscape.md ADDED Viewed

@@ -0,0 +1,687 @@
+# Research: Competitive Landscape -- Autonomous Coding Tools
+> **Date:** 2026-02-22
+> **Confidence:** High on architecture comparisons, Medium on exact pricing (changes frequently), Medium on SWE-bench scores (leaderboard updates weekly)
+> **Method:** Web research across official docs, blog posts, GitHub repos, benchmark leaderboards, and developer review aggregators
+## Executive Summary
+The autonomous coding tool market in early 2026 has converged on several validated patterns: fresh-context execution, sandbox isolation, multi-file editing, and test-driven iteration loops. Every major tool now implements some form of agentic coding where the AI plans, executes, tests, and iterates autonomously.
+**This toolkit's unique position:** It is the only tool that operates as a *meta-layer on top of Claude Code* rather than a standalone product. It does not compete with Claude Code, Cursor, or Devin -- it makes Claude Code better at sustained autonomous work through fresh-context batch execution, quality gates between every unit of work, a compounding lesson system, and machine-verifiable acceptance criteria. No competitor offers this combination.
+**Key findings:**
+- **Convergent:** All tools now do plan-then-execute, sandbox isolation, and auto-test-on-failure. These are validated patterns.
+- **Divergent:** No competitor implements test-count monotonicity enforcement, cross-context lesson compounding, batch-type-aware prompt selection, or competitive dual-track execution.
+- **Gap:** The toolkit lacks a visual IDE, cloud-hosted execution, built-in web browsing, and multi-language legacy migration. These are deliberate scope exclusions, not oversights.
+- **Pricing advantage:** The toolkit is free and open-source. Users pay only for Claude Code API usage. All competitors with comparable autonomy charge $20-500/month on top of model costs.
+---
+## 1. Competitor Profiles
+### 1.1 Devin (Cognition)
+**What it is:** A fully autonomous AI software engineer -- the most "agent-like" product on the market. Runs in a cloud-hosted VM with its own browser, terminal, and editor.
+**Architecture:**
+- Compound AI system: swarm of specialized models with a Planner (high-reasoning model) orchestrating strategy
+- Custom inference stack with Cerebras hardware for fast iteration
+- Cloud-hosted sandbox: each task gets a full VM (browser, terminal, code editor, shell)
+- Persistent Devin sessions -- can be assigned tasks via Slack, GitHub issues, or web UI
+**Execution model:**
+- Fully asynchronous: assign a task, Devin works on it, opens a PR when done
+- Self-healing: reads error logs, iterates on code, fixes autonomously
+- Can browse the web, read documentation, install packages
+- Collaborative PRs with human code review response
+**Context management:**
+- Cloud VM means effectively unlimited "working memory" (files on disk)
+- Model context managed by the Planner which breaks work into sub-tasks
+- No public documentation on context window management strategy
+**Quality gates:**
+- Self-healing on test/compilation failure
+- No documented equivalent to test-count monotonicity or lesson-based anti-pattern scanning
+- PR-based review workflow (human reviews Devin's PRs)
+**Pricing:**
+- Core: $20/month minimum ($2.25 per Agent Compute Unit)
+- Previously $500/month (Devin 1.0)
+- Enterprise: custom pricing
+- Goldman Sachs piloting with 12,000 developers
+**SWE-bench:** 67% on SWE-bench Verified (Devin 2.0, up from mid-30s in 2024)
+**Strengths vs. this toolkit:** Full VM with browser (can research docs), cloud-hosted (no local compute), Slack/GitHub integration for task assignment, legacy code migration capabilities.
+**Weaknesses vs. this toolkit:** Opaque quality gates (no user control over what checks run between tasks), no lesson compounding system, no fresh-context-per-batch architecture (single long session), no test-count regression detection, proprietary (can't inspect or modify the pipeline).
+---
+### 1.2 SWE-agent (Princeton/Stanford)
+**What it is:** An academic agent framework for solving GitHub issues. The research project that defined the SWE-bench evaluation methodology. Open-source.
+**Architecture:**
+- Python framework wrapping any LLM with custom shell commands
+- LM-centric interface design: simplified commands (open, edit, scroll, search) instead of raw bash
+- Agent-Computer Interface (ACI) concept: the interface design matters as much as the model
+- Minimal scaffolding -- the "mini-SWE-agent" achieves 74% on SWE-bench Verified in 100 lines of Python
+**Execution model:**
+- Single-issue, single-shot: takes a GitHub issue, produces a patch
+- No batch execution, no multi-step pipeline
+- No persistent state between runs
+- Turn-based: agent observes environment, takes action, observes result, repeats
+**Context management:**
+- Entire conversation history in context window
+- Custom commands reduce token usage vs. raw bash output
+- No explicit context window management or fresh-context reset
+**Quality gates:**
+- Patch is applied and tests run
+- No intermediate quality gates (single-shot execution)
+- Success = tests pass after patch application
+**Pricing:** Free and open-source. Users pay for LLM API calls.
+**SWE-bench:** SWE-agent 1.0 + Claude 3.7 Sonnet was state-of-the-art on both SWE-bench Full and SWE-bench Verified. Mini-SWE-agent achieves >74% on SWE-bench Verified.
+**Strengths vs. this toolkit:** Academic rigor, benchmark-optimized, minimal codebase (100 lines for mini version), model-agnostic, well-studied ACI design principles.
+**Weaknesses vs. this toolkit:** Single-issue scope (no multi-batch feature implementation), no quality gates between steps, no lesson system, no progress persistence, no resumability, no plan format, not designed for sustained development work.
+---
+### 1.3 OpenHands (formerly OpenDevin)
+**What it is:** An open-source, model-agnostic platform for cloud coding agents. The most full-featured open-source competitor.
+**Architecture:**
+- Event-sourced state model with deterministic replay
+- Modular SDK (V1): agent, tool, and workspace packages
+- Multiple agent types: CodeActAgent (generalist), BrowserAgent (web navigation), Micro-agents
+- Hierarchical agent delegation (agents can spawn sub-agents)
+- Built-in REST/WebSocket server for remote execution
+- Interactive workspaces: browser-based VSCode, VNC desktop, persistent Chromium
+**Execution model:**
+- Perception-action loop: agents observe and act in a Docker sandbox
+- Multi-agent: supports agent delegation and hierarchical task decomposition
+- Containerized execution: Docker-based sandboxing
+- Can run locally or remotely (cloud deployment supported)
+**Context management:**
+- Event-stream abstraction captures all actions and observations
+- Deterministic replay enables debugging and analysis
+- MCP integration for tool extensibility
+- No documented fresh-context-per-batch strategy
+**Quality gates:**
+- Tests run in sandbox
+- No documented equivalent to lesson-check, test-count monotonicity, or anti-pattern scanning
+**Pricing:** Free and open-source. OpenHands Cloud (hosted version) pricing not publicly documented.
+**SWE-bench:** 60.6% on SWE-bench Verified (November 2025). First on Multi-SWE-Bench (8 programming languages). 43.20% on SWE-bench Verified with Claude 3.7 Sonnet.
+**Strengths vs. this toolkit:** Model-agnostic (works with any LLM), visual web IDE, Docker sandbox isolation, multi-agent hierarchy, browser agent for web interaction, large community (188+ contributors).
+**Weaknesses vs. this toolkit:** No batch-based fresh-context execution, no quality gate pipeline, no lesson compounding, no test-count monotonicity, no plan format with batch structure, no competitive dual-track mode, heavier infrastructure requirements (Docker).
+---
+### 1.4 Aider
+**What it is:** AI pair programming in the terminal. The most mature terminal-based coding assistant. Focuses on interactive pair programming rather than full autonomy.
+**Architecture:**
+- Terminal-based chat interface with git integration
+- Repository map: extracts function signatures and file structures for codebase awareness
+- Architect/Editor split: Architect model describes the approach, Editor model makes file edits
+- Supports 100+ programming languages
+- Works with almost any LLM (Claude, GPT, DeepSeek, local models via Ollama)
+**Execution model:**
+- Interactive chat: user describes changes, Aider makes them
+- Automatic git commits for every change (clean history)
+- Auto-runs linters and tests, can self-fix detected problems
+- Voice input supported
+- Chat modes: code, architect, ask, help
+**Context management:**
+- Repository map gives codebase-wide awareness without loading everything
+- User controls which files are "in chat" (explicit context management)
+- No automatic context compaction or fresh-context-per-task
+**Quality gates:**
+- Auto-lint and auto-test after changes
+- Can self-fix detected problems
+- No formal quality gate pipeline between units of work
+- No test-count monotonicity or anti-pattern scanning
+**Pricing:** Free and open-source. Users pay for LLM API calls only. Typical cost: $3-5/hour depending on model and context size.
+**SWE-bench:** Aider's own Polyglot benchmark: Claude Opus 4.5 scores 89.4%, GPT-5 scores 88%.
+**Strengths vs. this toolkit:** Interactive pair-programming UX, voice input, repository map for efficient context, model-agnostic, mature git integration with clean commit history, lower barrier to entry (simpler mental model).
+**Weaknesses vs. this toolkit:** Not designed for unattended autonomous execution, no batch pipeline, no quality gates between tasks, no lesson system, no plan-based execution, no resumability, no competitive dual-track, no headless mode. Fundamentally a pair programmer, not an autonomous agent.
+---
+### 1.5 Cursor Agent Mode
+**What it is:** The dominant AI-native IDE. Agent mode enables autonomous multi-file editing, terminal command execution, and iterative problem-solving within the IDE.
+**Architecture:**
+- VS Code fork with deep AI integration
+- Agent mode: autonomously runs terminal commands, analyzes errors, proposes fixes
+- Background Agents (v0.50+): run tasks independently while developer works on other things
+- Composer: Cursor's own coding model for fast completions
+- BugBot: automated PR code review
+**Execution model:**
+- IDE-integrated: agent works within the editor context
+- Multi-file coherent edits understanding dependencies
+- Background agents for parallel task execution
+- Can install dependencies, run tests, fix compilation errors autonomously
+**Context management:**
+- Memories: persistent project knowledge across sessions
+- Codebase indexing for context-aware suggestions
+- No documented fresh-context-per-task or batch execution strategy
+**Quality gates:**
+- Auto-fix on compilation/test failure
+- BugBot for PR review
+- No formal inter-batch quality gate pipeline
+**Pricing:**
+- Free: 2,000 completions/month
+- Pro: $20/month (unlimited completions, unlimited slow requests)
+- Ultra: $200/month (20x usage)
+- Business: $40/user/month
+**Strengths vs. this toolkit:** Visual IDE with full editor experience, Background Agents for parallel work, BugBot for automated PR review, Memories for persistent knowledge, massive user base and ecosystem, Visual Editor for drag-and-drop UI work.
+**Weaknesses vs. this toolkit:** No formal batch pipeline, no quality gates between execution units, no lesson compounding, no test-count monotonicity, no headless/unattended mode, no plan-based execution, no resumability from saved state, proprietary and closed-source.
+---
+### 1.6 GitHub Copilot Coding Agent
+**What it is:** GitHub's asynchronous coding agent, embedded in GitHub and VS Code. Assigns tasks via issues, works autonomously in GitHub Actions, opens PRs.
+**Architecture:**
+- Runs in GitHub Actions-powered environment
+- Multi-model support (Claude, GPT, Gemini)
+- Asynchronous: works on tasks while developer does other things
+- Deeply integrated with GitHub (issues, PRs, code review)
+**Execution model:**
+- Assign via GitHub issue or Copilot Chat
+- Agent works in a GitHub Actions runner (cloud compute)
+- Creates PRs with results
+- Iterates on CI failures
+- Best for low-to-medium complexity tasks in well-tested codebases
+**Context management:**
+- Repository-level context from GitHub
+- No documented fresh-context-per-batch strategy
+- Relies on GitHub Actions environment for each task
+**Quality gates:**
+- CI pipeline integration (runs existing project CI)
+- Self-healing on CI failures
+- No custom quality gate injection beyond existing CI
+**Pricing:** Included with paid Copilot subscriptions ($10-39/month depending on plan). Copilot Workspace technical preview ended May 2025; features merged into Copilot coding agent.
+**Strengths vs. this toolkit:** Deep GitHub integration (issues -> agent -> PR), cloud-hosted execution (no local compute), CI pipeline integration, multi-model support, massive distribution (every GitHub user).
+**Weaknesses vs. this toolkit:** Limited to GitHub Actions environment, no custom quality gate pipeline, no lesson system, no batch-based fresh-context execution, no test-count monotonicity, limited to low-medium complexity tasks, no competitive dual-track mode.
+---
+### 1.7 Amazon Q Developer Agent
+**What it is:** AWS's autonomous coding agent, integrated into the AWS ecosystem. Strong focus on Java/.NET legacy migration.
+**Architecture:**
+- Cloud-hosted agent in AWS ecosystem
+- Deep AWS service integration (CodeWhisperer, CodeGuru, etc.)
+- Specialized transformation agents for language migration
+**Execution model:**
+- Natural language feature description -> multi-file implementation plan -> code changes + tests
+- Java 8->21, .NET upgrades automated
+- Agent analyzes existing codebase, maps implementation plan, executes changes
+**Context management:**
+- Codebase analysis for context
+- No documented fresh-context strategy
+**Quality gates:**
+- Runs tests as part of implementation
+- Human approval before applying changes
+**Pricing:**
+- Free tier: limited features with monthly caps
+- Pro: $19/user/month (1,000 agentic requests/month, 4,000 LOC/month for transformations)
+**SWE-bench:** 66% on SWE-bench Verified (April 2025 agent update), 49% on SWT-Bench.
+**Strengths vs. this toolkit:** Java/.NET legacy migration (specialized capability), AWS ecosystem integration, enterprise compliance, cloud-hosted.
+**Weaknesses vs. this toolkit:** AWS-centric (limited outside AWS), no batch pipeline, no lesson system, no fresh-context-per-batch, no quality gate customization, no competitive dual-track, usage caps on agentic requests.
+---
+### 1.8 Windsurf (Codeium)
+**What it is:** An AI-native IDE (VS Code fork) with the Cascade agent for multi-step autonomous coding.
+**Architecture:**
+- VS Code fork, re-engineered for AI-first workflows
+- Cascade: agentic AI that understands codebase, suggests multi-file edits, runs terminal commands
+- Indexing Engine: codebase-wide awareness (not just open files)
+- Workflows: custom automation pipelines
+**Execution model:**
+- Cascade handles multi-step coding tasks
+- Tab/Supercomplete for fast inline completions
+- Agent can run terminal commands, analyze errors, iterate
+- Supports 70+ languages
+**Context management:**
+- Indexing Engine provides codebase-wide context retrieval
+- Not limited to recently interacted files
+**Quality gates:**
+- Auto-fix on errors
+- No documented formal quality gate pipeline
+**Pricing:**
+- Free: 25 prompt credits/month
+- Pro: $15/month (500 prompt credits)
+- Teams: $30/user/month
+- Enterprise: $60/user/month
+**Strengths vs. this toolkit:** Visual IDE, codebase indexing engine, SOC 2 Type II compliance, self-hosted deployment option, enterprise admin controls.
+**Weaknesses vs. this toolkit:** No batch pipeline, no lesson system, no fresh-context-per-batch, no test-count monotonicity, no headless mode, no plan-based execution, credit-based pricing limits autonomous work.
+---
+### 1.9 Codex CLI (OpenAI)
+**What it is:** OpenAI's open-source terminal-based coding agent. The closest architectural peer to Claude Code.
+**Architecture:**
+- Open-source, written in Rust
+- Local terminal execution with OS-enforced sandboxing
+- macOS: Apple Seatbelt (sandbox-exec); Linux: Docker containers or bubblewrap (experimental)
+- Three approval modes: Suggest, Auto Edit, Full Auto
+- Network access configurable per sandbox
+**Execution model:**
+- Local terminal execution (like Claude Code)
+- Sandbox controls what agent can read/write/execute
+- Approval policy controls when human confirmation is needed
+- `--full-auto` mode for autonomous execution
+- Also has cloud Codex app (macOS, launched Feb 2026) for managing multiple agents
+**Context management:**
+- Conversation-based context in terminal mode
+- Cloud app manages parallel agent workflows
+- No documented fresh-context-per-batch strategy
+**Quality gates:**
+- Sandbox-enforced safety boundaries
+- No formal quality gate pipeline between tasks
+**Pricing:** Free and open-source CLI. Uses OpenAI API (GPT-5, o3, o4-mini). Cloud Codex app included with ChatGPT Plus/Pro subscriptions.
+**SWE-bench:** GPT-5.2 scores 75.40% on SWE-bench Verified.
+**Strengths vs. this toolkit:** OS-enforced sandboxing (Seatbelt/Docker/bubblewrap), Rust-based performance, multi-agent orchestration in cloud app, approval mode granularity.
+**Weaknesses vs. this toolkit:** No batch pipeline, no quality gate system, no lesson compounding, no test-count monotonicity, no plan-based execution, no resumability, tied to OpenAI models (no Claude support).
+---
+### 1.10 Claude Code (Vanilla, No Toolkit)
+**What it is:** Anthropic's agentic coding CLI. The foundation this toolkit builds on.
+**Architecture:**
+- Terminal-based agent: reads files, writes changes, runs shell commands, manages git
+- "Dumb loop" runtime with all intelligence in the model
+- Subagent spawning for parallel/delegated work
+- Checkpoint system: automatic state saves before each change, rewindable
+- Hooks: trigger actions at specific points (test after change, lint before commit)
+- Background tasks for long-running processes
+- 6 layers of memory loaded at session start
+**Execution model:**
+- Interactive session with tool calls
+- `claude -p` for headless/piped mode
+- Subagents via Task tool for parallel work
+- Extended thinking with controllable depth
+- Permission sandboxing (84% reduction in prompts)
+**Context management:**
+- Auto-compaction when context window fills
+- CLAUDE.md files for project memory
+- Semantic search for relevant context
+- Prompt caching (90% savings on repeated context)
+**Quality gates:**
+- Hooks system (pre/post tool execution)
+- No built-in quality gate pipeline between tasks
+- No test-count monotonicity
+- No anti-pattern scanning
+**Pricing:**
+- Pro: $20/month (5x free usage)
+- Max: $100/month (5x Pro) or $200/month (20x Pro)
+- API: Sonnet 4.6 at $3/$15 per million tokens input/output
+- Average ~$100-200/developer/month on API for team usage
+**SWE-bench:** Claude Opus 4.6 (Thinking) leads at 79.20% on SWE-bench Verified (February 2026).
+**What this toolkit adds on top:**
+- Fresh context per batch (solves the #1 quality problem)
+- Quality gate pipeline between every batch (lesson-check + tests + memory + test-count + git-clean)
+- Test-count monotonicity enforcement
+- Lesson compounding system (bugs become automated checks)
+- Plan-based batch execution with 4 execution modes
+- Resumability from saved state
+- Competitive dual-track execution
+- Batch-type-aware prompt selection (multi-armed bandit)
+- Machine-verifiable PRD system
+- Community lesson contribution pipeline
+---
+## 2. Convergent Patterns
+These patterns appear in 3+ tools, validating the approach:
+### 2.1 Plan-Then-Execute (Universal)
+Every tool breaks work into planning and execution phases. Devin plans strategy before coding. Cursor's agent mode reasons about approach before editing. GitHub Copilot agent creates implementation plans. This toolkit's mandatory brainstorming -> plan -> execute pipeline is aligned with market consensus.
+**Validation level:** Strong -- this is table stakes.
+### 2.2 Sandbox Isolation (Universal)
+Every tool isolates execution: Devin uses cloud VMs, OpenHands uses Docker, Codex CLI uses OS-level sandboxing, this toolkit uses git worktrees. The specific mechanism varies but the principle is universal.
+**Validation level:** Strong -- isolation prevents catastrophic failures.
+### 2.3 Auto-Test-and-Fix Loops (8/10 tools)
+Most tools run tests after changes and iterate on failures: Devin self-heals, Cursor fixes compilation errors, Aider auto-runs linters and tests, GitHub Copilot iterates on CI failures. This toolkit's quality gates between batches and ralph-loop iteration are aligned with this pattern.
+**Validation level:** Strong -- human verification does not scale.
+### 2.4 Terminal/CLI-Based Agents (4/10 tools)
+Claude Code, Codex CLI, Aider, and SWE-agent all operate from the terminal. The toolkit's terminal-first design matches a validated developer workflow preference for CLI-based tools.
+**Validation level:** Moderate -- IDE-based tools (Cursor, Windsurf) have larger market share, but CLI tools serve a distinct power-user niche.
+### 2.5 Model-Agnostic Design (5/10 tools)
+SWE-agent, OpenHands, Aider, Cursor, and GitHub Copilot support multiple LLM providers. This toolkit is Claude-specific by design (built on Claude Code's skill/hook/agent system), which limits model flexibility but enables deeper integration.
+**Validation level:** Moderate -- model-agnostic is popular but Claude-specific depth has trade-offs worth making.
+### 2.6 Persistent Memory Across Sessions (6/10 tools)
+Cursor Memories, Claude Code CLAUDE.md, this toolkit's progress.txt + state files, Aider's git history, Devin's persistent sessions, Codex app's session management. Cross-context memory is becoming standard.
+**Validation level:** Strong -- context resets are a known problem and everyone is solving it.
+### 2.7 Asynchronous/Background Execution (5/10 tools)
+Devin, GitHub Copilot agent, Cursor Background Agents, Codex app, and this toolkit's headless mode all support "assign and walk away" workflows.
+**Validation level:** Strong -- the market is moving toward autonomous agents that work without developer attendance.
+---
+## 3. Divergent Patterns (Unique Strengths)
+### 3.1 Fresh Context Per Batch (Unique to This Toolkit)
+No other tool explicitly solves context degradation by spawning a fresh process per unit of work. Context rot research (Adobe, February 2025; Chroma Research) confirms that LLM performance degrades predictably as context fills -- accuracy drops dramatically past ~65% of context window capacity.
+This toolkit's `claude -p` per batch is the only production implementation of fresh-context execution. Other tools use compaction (summarize and restart) or hope the context window is large enough. The research evidence strongly supports this approach.
+**Confidence:** High. This is the toolkit's strongest differentiator, backed by published research on context rot.
+### 3.2 Test-Count Monotonicity (Unique to This Toolkit)
+No competitor enforces that test count must never decrease between execution units. This catches a specific failure mode: the agent "fixing" a test by deleting it, or breaking test discovery. Simple invariant, high value.
+**Confidence:** High. No competitor documents this check.
+### 3.3 Lesson Compounding System (Unique to This Toolkit)
+No competitor has a system where production bugs automatically become quality gate checks that run on every future batch. The closest is Cursor Memories (persistent knowledge) and CLAUDE.md files (project instructions), but neither converts lessons into automated enforcement.
+The two-tier enforcement (syntactic via grep in <2s, semantic via AI agent) with community contribution pipeline (`/submit-lesson`) is architecturally novel.
+**Confidence:** High. This is a genuine architectural innovation.
+### 3.4 Competitive Dual-Track Execution (Unique to This Toolkit)
+No competitor implements "two agents solve the same problem in separate worktrees, a judge picks the winner." This is expensive (2x compute) but produces higher quality for critical batches through genuine competition + mandatory best-of-both synthesis.
+**Confidence:** High. No competitor documents this pattern.
+### 3.5 Batch-Type-Aware Prompt Selection (Unique to This Toolkit)
+The multi-armed bandit system that classifies batch types (new-file, refactoring, integration, test-only) and selects prompt variants based on past outcomes is not present in any competitor. Closest analog: Cursor's Composer model optimization, but that's model-level not prompt-level.
+**Confidence:** High. No competitor documents learned prompt selection.
+### 3.6 Machine-Verifiable PRD (Rare)
+The `tasks/prd.json` format where every acceptance criterion is a shell command (exit 0 = pass) is not standard in any competitor. Devin and GitHub Copilot work from issues/descriptions, but the criteria are natural language, not machine-executable.
+**Confidence:** High. Most tools use natural language acceptance criteria.
+### 3.7 Quality Gate Pipeline (Unique Composition)
+While individual checks exist in other tools (Cursor BugBot, GitHub CI, Aider lint), no competitor chains them into a mandatory pipeline between every execution unit: lesson-check -> tests -> memory -> test-count -> git-clean. The composition is unique.
+**Confidence:** High.
+---
+## 4. Gap Analysis
+What competitors offer that this toolkit does not:
+### 4.1 Visual IDE Experience
+**Who has it:** Cursor, Windsurf, OpenHands (web-based VSCode)
+**Impact:** High for developers who prefer visual editing. The toolkit is terminal-only.
+**Assessment:** Deliberate scope exclusion. The toolkit is a pipeline layer, not an editor. Recommendation: document this explicitly as a non-goal.
+### 4.2 Cloud-Hosted Execution
+**Who has it:** Devin, GitHub Copilot agent, OpenHands Cloud, Codex app
+**Impact:** Medium. Eliminates local compute requirements, enables mobile/lightweight access.
+**Assessment:** Could be added by running `run-plan.sh` on a remote server via SSH/tmux. Not a fundamental architecture gap.
+### 4.3 Web Browsing / Documentation Research
+**Who has it:** Devin (full browser), OpenHands (BrowserAgent), Codex app
+**Impact:** Medium. Useful for researching APIs, reading docs, understanding context.
+**Assessment:** Claude Code has web search and web fetch tools. The toolkit could integrate these into pre-flight phases. Gap is narrow.
+### 4.4 Model-Agnostic Support
+**Who has it:** SWE-agent, OpenHands, Aider, Cursor, GitHub Copilot
+**Impact:** Medium. Allows switching models based on cost/performance.
+**Assessment:** Deliberate trade-off. Claude-specific integration enables skill/hook/agent system depth. Supporting other models would require reimplementing the skill chain.
+### 4.5 Legacy Code Migration
+**Who has it:** Devin (COBOL/Fortran to modern), Amazon Q (Java/NET upgrades)
+**Impact:** Low for this toolkit's target audience (individual developers using Claude Code). High for enterprises.
+**Assessment:** Out of scope. Enterprise migration is a different market.
+### 4.6 Multi-Language Benchmark Coverage
+**Who has it:** OpenHands (first on Multi-SWE-Bench across 8 languages)
+**Impact:** Low. The toolkit is language-agnostic at the pipeline level (auto-detects pytest/npm/make).
+**Assessment:** Not a gap -- the toolkit delegates language handling to Claude Code.
+### 4.7 Graphical Diff / Visual Change Review
+**Who has it:** Cursor, Windsurf, GitHub Copilot (PR diffs), Devin (web UI)
+**Impact:** Medium. Visual diffs are easier to review than terminal output.
+**Assessment:** Users can review changes via `git diff` or any external diff tool. The toolkit produces standard git commits. Not a fundamental gap.
+### 4.8 Team Collaboration Features
+**Who has it:** Cursor (team plans), Windsurf (enterprise), GitHub Copilot (org-wide), Devin (Slack integration)
+**Impact:** Medium for teams, Low for solo developers.
+**Assessment:** The toolkit is primarily for individual power users augmenting their Claude Code workflow. Team features are out of current scope.
+### 4.9 Built-in Cost Tracking / Budget Controls
+**Who has it:** Cursor (usage limits per plan), Windsurf (credit system), Claude Code (checkpoint-based cost visibility)
+**Impact:** Medium. API costs can spike during long autonomous runs.
+**Assessment:** The toolkit could add cost tracking to `run-plan.sh` (count tokens per batch via Claude API response). Worth considering as a feature.
+---
+## 5. Comparative Feature Matrix
+| Feature | This Toolkit | Devin | SWE-agent | OpenHands | Aider | Cursor | Copilot Agent | Amazon Q | Windsurf | Codex CLI | Claude Code |
+|---------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| Fresh context per batch | Y | - | - | - | - | - | - | - | - | - | - |
+| Quality gate pipeline | Y | - | - | - | ~ | ~ | ~ | - | - | - | ~ |
+| Test-count monotonicity | Y | - | - | - | - | - | - | - | - | - | - |
+| Lesson compounding | Y | - | - | - | - | - | - | - | - | - | - |
+| Competitive dual-track | Y | - | - | - | - | - | - | - | - | - | - |
+| Machine-verifiable PRD | Y | - | - | - | - | - | - | - | - | - | - |
+| Headless unattended mode | Y | Y | Y | Y | - | ~ | Y | - | - | Y | Y |
+| Resumable from state | Y | Y | - | - | - | - | - | - | - | - | - |
+| Plan-based batch execution | Y | - | - | - | - | - | - | - | - | - | - |
+| Prompt variant learning | Y | - | - | - | - | - | - | - | - | - | - |
+| Visual IDE | - | Y | - | Y | - | Y | Y | Y | Y | - | - |
+| Cloud-hosted execution | - | Y | - | Y | - | ~ | Y | Y | - | Y | - |
+| Web browsing | - | Y | - | Y | - | - | - | - | - | - | ~ |
+| Model-agnostic | - | - | Y | Y | Y | Y | Y | - | Y | - | - |
+| OS-level sandboxing | - | Y | - | Y | - | - | Y | Y | - | Y | Y |
+| Community lessons | Y | - | - | - | - | - | - | - | - | - | - |
+| Open source | Y | - | Y | Y | Y | - | - | - | - | Y | Y |
+| Free (no subscription) | Y | - | Y | Y | Y | - | - | - | - | Y | - |
+Legend: Y = Yes, - = No, ~ = Partial
+---
+## 6. SWE-bench Performance Context
+| Agent/Model | SWE-bench Verified Score | Notes |
+|---|---|---|
+| Claude Opus 4.6 (Thinking) | 79.20% | Current leader (Feb 2026) |
+| Sonar Foundation Agent | 79.2% | Unfiltered leaderboard top |
+| Claude Opus 4.5 | ~79% | Official leaderboard leader |
+| Gemini 3 Flash | 76.20% | Google's latest |
+| GPT-5.2 | 75.40% | OpenAI Codex backbone |
+| Mini-SWE-agent | >74% | 100 lines of Python |
+| GitHub Copilot | 68.0% | Acceptance rate in study |
+| Devin 2.0 | 67% | PR merge rate |
+| Amazon Q Developer | 66% | April 2025 agent update |
+| OpenHands | 60.6% | November 2025 |
+**Key insight:** SWE-bench measures single-issue resolution, not sustained multi-batch development. This toolkit's value proposition -- preventing quality degradation over 10+ batches -- is orthogonal to SWE-bench scores. A tool that scores 79% on individual issues can still degrade to 40% effectiveness on batch 8 of a long feature. The toolkit addresses the gap between per-issue performance and sustained development quality.
+---
+## 7. Pricing Comparison
+| Tool | Base Cost | Model Costs | Total Monthly (Solo Dev) |
+|---|---|---|---|
+| **This Toolkit** | Free (OSS) | Claude API (~$100-200) | $100-200 |
+| Devin | $20/month + ACUs | Included | $50-300+ |
+| SWE-agent | Free (OSS) | Any LLM API | $50-200 |
+| OpenHands | Free (OSS) | Any LLM API | $50-200 |
+| Aider | Free (OSS) | Any LLM API ($3-5/hr) | $50-200 |
+| Cursor | $20/month | Included | $20-200 |
+| GitHub Copilot | $10-39/month | Included | $10-39 |
+| Amazon Q | $0-19/month | Included | $0-19 |
+| Windsurf | $0-15/month | Included | $0-15 |
+| Codex CLI | Free (OSS) | OpenAI API | $50-200 |
+| Claude Code | $20-200/month | Included in sub | $20-200 |
+**Key insight:** The toolkit adds zero marginal cost on top of Claude Code. Every competitor with comparable autonomy either charges a subscription or requires API costs. The toolkit's value-add is entirely in pipeline quality, not model capability.
+---
+## 8. Positioning Recommendation
+### Who This Toolkit Is For
+**Primary audience:** Claude Code power users who run multi-batch autonomous coding tasks and have been burned by context degradation, test regressions, or cascading errors in long sessions.
+**Secondary audience:** Developers who want a structured pipeline (brainstorm -> plan -> execute -> verify -> finish) with machine-verifiable gates, not just "chat with AI and hope for the best."
+### How to Position Against Each Competitor
+| Competitor | Positioning |
+|---|---|
+| **Devin** | "Devin is a cloud employee. This toolkit makes your local Claude Code session as reliable as an employee -- with quality gates, lesson learning, and resumable state -- at zero subscription cost." |
+| **SWE-agent** | "SWE-agent solves single issues. This toolkit orchestrates multi-batch feature development with quality guarantees between each step." |
+| **OpenHands** | "OpenHands is a platform. This toolkit is a pipeline. Use OpenHands if you need a model-agnostic cloud IDE. Use this toolkit if you already use Claude Code and want it to execute 10-batch plans without degradation." |
+| **Aider** | "Aider is pair programming. This toolkit is autonomous engineering. Aider stays with you. This toolkit works while you sleep." |
+| **Cursor** | "Cursor is an IDE. This toolkit is a quality pipeline. They solve different problems. If you edit in Cursor but execute plans with Claude Code, the toolkit ensures the execution is reliable." |
+| **Copilot Agent** | "Copilot agent works from GitHub issues. This toolkit works from structured plans with machine-verifiable criteria. Copilot handles tasks. This toolkit handles projects." |
+| **Amazon Q** | "Amazon Q specializes in AWS/Java/.NET. This toolkit is language- and cloud-agnostic." |
+| **Windsurf** | "Windsurf is an IDE with agent features. This toolkit is an agent pipeline with no IDE. Different tools for different workflows." |
+| **Codex CLI** | "Codex CLI is Claude Code's OpenAI equivalent. This toolkit adds quality gates, lesson learning, batch execution, and resumability -- features Codex CLI also lacks." |
+| **Claude Code** | "This toolkit IS Claude Code -- plus fresh-context execution, quality gates, lesson compounding, and a structured pipeline. It's Claude Code with discipline baked in." |
+### Unique Value Proposition (One Sentence)
+The autonomous-coding-toolkit is the only tool that prevents quality degradation in long autonomous coding sessions through fresh-context batch execution, mandatory quality gates, test-count monotonicity, and a compounding lesson system -- all as a free, open-source layer on top of Claude Code.
+---
+## Sources
+- [Devin AI Guide 2026 - AI Tools DevPro](https://aitoolsdevpro.com/ai-tools/devin-guide/)
+- [Devin 2.0 Pricing - VentureBeat](https://venturebeat.com/programming-development/devin-2-0-is-here-cognition-slashes-price-of-ai-software-engineer-to-20-per-month-from-500)
+- [Cognition Devin 2.0 Blog](https://cognition.ai/blog/devin-2)
+- [Devin Pricing Page](https://devin.ai/pricing)
+- [SWE-agent GitHub](https://github.com/SWE-agent/SWE-agent)
+- [Mini-SWE-agent GitHub](https://github.com/SWE-agent/mini-swe-agent)
+- [SWE-agent Documentation](https://swe-agent.com/latest/background/)
+- [OpenHands Official Site](https://openhands.dev/)
+- [OpenHands GitHub](https://github.com/OpenHands/OpenHands)
+- [OpenHands Agent SDK Paper](https://arxiv.org/html/2511.03690v1)
+- [OpenHands SOTA on SWE-Bench](https://openhands.dev/blog/sota-on-swe-bench-verified-with-inference-time-scaling-and-critic-model)
+- [Aider Official Site](https://aider.chat/)
+- [Aider GitHub](https://github.com/Aider-AI/aider)
+- [Aider LLM Leaderboards](https://aider.chat/docs/leaderboards/)
+- [Cursor AI Review 2025](https://skywork.ai/blog/cursor-ai-review-2025-agent-refactors-privacy/)
+- [Cursor AI Review 2026](https://prismic.io/blog/cursor-ai)
+- [Cursor 2.0 - The New Stack](https://thenewstack.io/cursor-2-0-ide-is-now-supercharged-with-ai-and-im-impressed/)
+- [GitHub Copilot Coding Agent Docs](https://docs.github.com/en/copilot/concepts/agents/coding-agent/about-coding-agent)
+- [GitHub Copilot Agent Mode in VS Code](https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode)
+- [Copilot Coding Agent GA Discussion](https://github.com/orgs/community/discussions/159068)
+- [Amazon Q Developer Features](https://aws.amazon.com/q/developer/features/)
+- [Amazon Q Developer Pricing](https://aws.amazon.com/q/developer/pricing/)
+- [Windsurf Official Site](https://windsurf.com/)
+- [Windsurf Review - Taskade](https://www.taskade.com/blog/windsurf-review)
+- [OpenAI Codex Introduction](https://openai.com/index/introducing-codex/)
+- [Codex CLI GitHub](https://github.com/openai/codex)
+- [Codex CLI Features](https://developers.openai.com/codex/cli/features/)
+- [Codex CLI Security](https://developers.openai.com/codex/security/)
+- [Claude Code Overview](https://code.claude.com/docs/en/overview)
+- [Claude Code Autonomous Work](https://www.anthropic.com/news/enabling-claude-code-to-work-more-autonomously)
+- [Claude Code Sandboxing](https://www.anthropic.com/engineering/claude-code-sandboxing)
+- [Claude Code GitHub](https://github.com/anthropics/claude-code)
+- [Claude API Pricing](https://platform.claude.com/docs/en/about-claude/pricing)
+- [SWE-bench Verified Leaderboard](https://llm-stats.com/benchmarks/swe-bench-verified)
+- [SWE-bench Official](https://www.swebench.com/)
+- [SWE-bench February 2026 Update - Simon Willison](https://simonwillison.net/2026/Feb/19/swe-bench/)
+- [SWE-Bench Pro - Scale AI](https://scale.com/leaderboard/swe_bench_pro_public)
+- [Context Rot Research - Understanding AI](https://www.understandingai.org/p/context-rot-the-emerging-challenge)
+- [Context Rot - Chroma Research](https://research.trychroma.com/context-rot)
+- [Context Engineering - Anthropic](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
+- [AI Coding Agents 2026 - Faros AI](https://www.faros.ai/blog/best-ai-coding-agents-2026)
+- [AI Dev Tool Power Rankings - LogRocket](https://blog.logrocket.com/ai-dev-tool-power-rankings/)
+- [Coding CLI Tools Comparison - Tembo](https://www.tembo.io/blog/coding-cli-tools-comparison)
+- [Cursor Pricing 2026](https://checkthat.ai/brands/cursor/pricing)
+- [Windsurf vs Cursor Pricing](https://windsurf.com/compare/windsurf-vs-cursor)