autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,548 @@
|
|
|
1
|
+
# Cost/Quality Tradeoff Modeling for Autonomous Coding Pipelines
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-02-22
|
|
4
|
+
**Status:** Research complete
|
|
5
|
+
**Confidence:** High on pricing data (official sources), Medium on quality deltas (benchmark-dependent), Medium on break-even modeling (assumptions documented)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Executive Summary
|
|
10
|
+
|
|
11
|
+
Running an autonomous coding pipeline costs $5-65 per 6-batch feature depending on execution mode and caching strategy. The single largest cost lever is **prompt caching** (83% reduction), not model selection. Sonnet 4.5/4.6 matches or exceeds Opus on SWE-bench coding benchmarks at 60% of the price, making Opus routing justifiable only for architectural/planning tasks where reasoning depth matters. Competitive (MAB) mode doubles per-batch cost but stays under $2/batch with cache priming — the break-even is any feature where a single rework cycle costs more than $6. Compared to commercial alternatives (Devin at $8-9/hr, Cursor at ~$0.09/request, Copilot at $0.04/premium request), the toolkit's API-direct approach is cheaper for heavy autonomous workloads but lacks the UX guardrails of commercial products.
|
|
12
|
+
|
|
13
|
+
**Recommendation:** Default to Sonnet with Haiku for verification-only batches. Reserve Opus for planning and judging. Always cache-prime before parallel dispatch. Implement cost tracking per batch (the data doesn't exist yet and every recommendation here would be more precise with it).
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 1. Current API Pricing Landscape
|
|
18
|
+
|
|
19
|
+
### 1.1 Claude Model Pricing (Anthropic, Official)
|
|
20
|
+
|
|
21
|
+
Source: [Anthropic Pricing Page](https://platform.claude.com/docs/en/about-claude/pricing)
|
|
22
|
+
|
|
23
|
+
| Model | Input $/MTok | Output $/MTok | Cache Read $/MTok | Cache Write (5m) $/MTok | Batch Input $/MTok | Batch Output $/MTok |
|
|
24
|
+
|-------|-------------|--------------|-------------------|------------------------|--------------------|---------------------|
|
|
25
|
+
| **Opus 4.6/4.5** | $5.00 | $25.00 | $0.50 | $6.25 | $2.50 | $12.50 |
|
|
26
|
+
| **Sonnet 4.6/4.5/4** | $3.00 | $15.00 | $0.30 | $3.75 | $1.50 | $7.50 |
|
|
27
|
+
| **Haiku 4.5** | $1.00 | $5.00 | $0.10 | $1.25 | $0.50 | $2.50 |
|
|
28
|
+
| Opus 4.1/4 (legacy) | $15.00 | $75.00 | $1.50 | $18.75 | $7.50 | $37.50 |
|
|
29
|
+
| Haiku 3.5 | $0.80 | $4.00 | $0.08 | $1.00 | $0.40 | $2.00 |
|
|
30
|
+
|
|
31
|
+
**Long context surcharge:** Requests exceeding 200K input tokens double the input price and add 50% to output (e.g., Sonnet: $6/$22.50). This is relevant for batch agents with large codebases — staying under 200K tokens per call is a significant cost optimization.
|
|
32
|
+
|
|
33
|
+
**Key ratio:** Opus 4.6 costs 1.67x Sonnet input and 1.67x Sonnet output. This is dramatically cheaper than legacy Opus 4.1 (5x Sonnet). The Opus tax has shrunk from 5x to 1.67x in one generation.
|
|
34
|
+
|
|
35
|
+
### 1.2 Competitor Pricing
|
|
36
|
+
|
|
37
|
+
| Provider | Model | Input $/MTok | Output $/MTok | Notes |
|
|
38
|
+
|----------|-------|-------------|--------------|-------|
|
|
39
|
+
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K context |
|
|
40
|
+
| OpenAI | GPT-4o Mini | $0.15 | $0.60 | Budget tier |
|
|
41
|
+
| Google | Gemini 2.5 Pro | $1.25 | $10.00 | Under 200K; doubles above |
|
|
42
|
+
| Google | Gemini 2.5 Flash | $0.075 | $0.30 | Cheapest viable option |
|
|
43
|
+
| Google | Gemini 3 Pro | $2.00 | $12.00 | Newest generation |
|
|
44
|
+
|
|
45
|
+
**Finding:** Claude Sonnet ($3/$15) is priced between GPT-4o ($2.50/$10) and Gemini 3 Pro ($2/$12) on input, but is significantly more expensive on output. For output-heavy coding tasks (where the model generates substantial code), Claude's output premium matters. A batch generating 50K output tokens costs $0.75 on Sonnet vs $0.50 on GPT-4o vs $0.60 on Gemini 3 Pro.
|
|
46
|
+
|
|
47
|
+
**Implication for the toolkit:** The toolkit is model-agnostic at the `claude -p` layer, but the skill chain and quality gates are Claude-specific. Multi-provider routing (send verification batches to Gemini Flash at $0.075/$0.30) would require significant architecture changes but could cut verification costs by 90%.
|
|
48
|
+
|
|
49
|
+
### 1.3 Discount Mechanisms
|
|
50
|
+
|
|
51
|
+
| Mechanism | Discount | Latency Impact | Stackable? |
|
|
52
|
+
|-----------|----------|---------------|------------|
|
|
53
|
+
| **Prompt caching (read)** | 90% off input | Faster (no reprocessing) | Yes, with batch |
|
|
54
|
+
| **Prompt caching (write)** | +25% on first call | Minimal | Yes, with batch |
|
|
55
|
+
| **Batch API** | 50% off everything | Up to 24h (usually <1h) | Yes, with caching |
|
|
56
|
+
| **Cache + Batch combined** | ~95% off cached input | Up to 24h | Yes |
|
|
57
|
+
|
|
58
|
+
**The stacking math for a typical batch:**
|
|
59
|
+
- Uncached Sonnet input (100K tokens): $0.30
|
|
60
|
+
- Cached Sonnet input (90K cached + 10K new): 90K × $0.30/MTok + 10K × $3.00/MTok = $0.027 + $0.030 = $0.057
|
|
61
|
+
- Cached + Batch: 90K × $0.15/MTok + 10K × $1.50/MTok = $0.0135 + $0.015 = $0.029
|
|
62
|
+
|
|
63
|
+
That's a 90% reduction from uncached to cached, and 95% from uncached to cached+batch.
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## 2. Quality Delta Between Models for Coding
|
|
68
|
+
|
|
69
|
+
### 2.1 Benchmark Evidence
|
|
70
|
+
|
|
71
|
+
Source: [SWE-bench Verified Leaderboard](https://llm-stats.com/benchmarks/swe-bench-verified), [SWE-rebench](https://swe-rebench.com), [Vellum LLM Leaderboard](https://www.vellum.ai/llm-leaderboard)
|
|
72
|
+
|
|
73
|
+
| Model | SWE-bench Verified | SWE-bench Pro | Cost/Task (SWE-rebench) |
|
|
74
|
+
|-------|-------------------|---------------|------------------------|
|
|
75
|
+
| Claude Sonnet 4.5 | 77.2% (82% w/ parallel) | 43.6% | $0.94 |
|
|
76
|
+
| Claude Opus 4.5 | 80.9% | 45.9% | — |
|
|
77
|
+
| Claude Opus 4.6 | ~80-82% | — | $0.93 |
|
|
78
|
+
| GPT-4o | ~49% | — | ~$0.50-1.00 |
|
|
79
|
+
| Gemini 2.5 Pro | ~65% | — | ~$0.80 |
|
|
80
|
+
|
|
81
|
+
**Finding: Sonnet is ~95% of Opus quality on coding benchmarks at 60% of the price.**
|
|
82
|
+
|
|
83
|
+
On SWE-bench Verified, Sonnet 4.5 scores 77.2% vs Opus 4.5's 80.9% — a 4.6% gap. On SWE-bench Pro (harder), the gap is 2.3 percentage points (43.6% vs 45.9%). Crucially, Sonnet 4.5 with parallel compute (82%) actually exceeds single-shot Opus (80.9%).
|
|
84
|
+
|
|
85
|
+
**Where Opus still wins:**
|
|
86
|
+
- Planning and architecture decisions (qualitative, not well-captured by SWE-bench)
|
|
87
|
+
- Complex multi-file refactoring requiring deep reasoning
|
|
88
|
+
- Judge/evaluation tasks where nuanced comparison matters
|
|
89
|
+
- The SWE-bench Pro gap suggests Opus pulls ahead on harder problems
|
|
90
|
+
|
|
91
|
+
**Where Opus doesn't justify the cost:**
|
|
92
|
+
- Standard implementation tasks (file creation, test writing)
|
|
93
|
+
- Verification/run-only batches
|
|
94
|
+
- Well-specified tasks with clear acceptance criteria
|
|
95
|
+
|
|
96
|
+
### 2.2 Cost Per Success Analysis
|
|
97
|
+
|
|
98
|
+
The metric that matters is **cost per successful batch**, not cost per token.
|
|
99
|
+
|
|
100
|
+
| Model | Cost/batch | Success rate (est.) | Cost/success |
|
|
101
|
+
|-------|-----------|--------------------:|-------------|
|
|
102
|
+
| Haiku 4.5 | ~$0.30 | ~60% | ~$0.50 |
|
|
103
|
+
| Sonnet 4.6 | ~$0.94 | ~85% | ~$1.11 |
|
|
104
|
+
| Opus 4.6 | ~$1.50 | ~90% | ~$1.67 |
|
|
105
|
+
|
|
106
|
+
Success rates are estimated from SWE-bench data scaled to the toolkit's quality gate pass rates. The key insight: **Haiku's apparent cheapness disappears when factoring in retry cost.** A 60% success rate means 40% of batches need a retry (costing another $0.30+ each), plus the quality gate execution time.
|
|
107
|
+
|
|
108
|
+
**Implication:** Sonnet is the cost-per-success sweet spot. Haiku is appropriate only for tasks with near-deterministic success (verification-only, run commands, check output). Opus is appropriate when a single failure is very expensive (complex integration, architectural changes).
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## 3. Cost Per Batch by Execution Mode
|
|
113
|
+
|
|
114
|
+
### 3.1 Token Consumption Model
|
|
115
|
+
|
|
116
|
+
Based on SWE-rebench data and Claude Code usage statistics:
|
|
117
|
+
|
|
118
|
+
| Component | Input Tokens | Output Tokens | Notes |
|
|
119
|
+
|-----------|-------------|--------------|-------|
|
|
120
|
+
| System prompt + CLAUDE.md chain | ~8,000 | — | Cacheable |
|
|
121
|
+
| Plan text (single batch) | ~2,000 | — | Varies by plan |
|
|
122
|
+
| Context injection (failure patterns, progress) | ~1,500 | — | From run-plan-context.sh |
|
|
123
|
+
| Tool definitions (Bash, Read, Write, Edit, Grep, Glob) | ~2,000 | — | Cacheable |
|
|
124
|
+
| File reads during execution | ~20,000 | — | Varies heavily |
|
|
125
|
+
| Code generation + tool calls | — | ~15,000 | Primary output cost |
|
|
126
|
+
| **Total per batch** | **~33,500** | **~15,000** | Conservative estimate |
|
|
127
|
+
|
|
128
|
+
### 3.2 Cost Per Batch by Mode
|
|
129
|
+
|
|
130
|
+
Using Sonnet 4.6 pricing ($3/$15 per MTok) with ~33.5K input, ~15K output:
|
|
131
|
+
|
|
132
|
+
| Mode | Agents | Calls/Batch | Input Tokens | Output Tokens | Cost/Batch (uncached) | Cost/Batch (cached) |
|
|
133
|
+
|------|--------|------------|-------------|--------------|----------------------|---------------------|
|
|
134
|
+
| **Headless** | 1 | 1 | 33.5K | 15K | $0.33 | $0.13 |
|
|
135
|
+
| **Team** | 2-3 | 2-3 | 67-100K | 30-45K | $0.65-1.00 | $0.26-0.40 |
|
|
136
|
+
| **Competitive (MAB)** | 2 + judge | 3 | 80K+ | 35K+ | $0.77+ | $0.31+ |
|
|
137
|
+
| **Ralph loop** | 1 (iterating) | 2-5 | 67-167K | 30-75K | $0.65-1.63 | $0.26-0.65 |
|
|
138
|
+
|
|
139
|
+
**Notes:**
|
|
140
|
+
- Team mode spawns implementer + reviewer agents. Each gets its own context window.
|
|
141
|
+
- Competitive mode runs 2 parallel implementers + 1 judge evaluation. The judge call is smaller (diff comparison, not full implementation).
|
|
142
|
+
- Ralph loop cost depends on iterations. The stop-hook re-injects the prompt each cycle, but context accumulates within a session. Worst case: 5 iterations before convergence.
|
|
143
|
+
- Cached prices assume 80% of input tokens hit cache (system prompt + tools + CLAUDE.md chain + plan prefix).
|
|
144
|
+
|
|
145
|
+
### 3.3 Model Routing Impact on Batch Cost
|
|
146
|
+
|
|
147
|
+
The toolkit's `classify_batch_model()` function in `run-plan-routing.sh` routes:
|
|
148
|
+
- **Haiku** for verification-only batches (all steps are `Run:` commands)
|
|
149
|
+
- **Sonnet** for implementation batches (Create/Modify files) — default
|
|
150
|
+
- **Opus** for CRITICAL-tagged batches
|
|
151
|
+
|
|
152
|
+
| Batch Type | Model | Cost (cached) | Frequency |
|
|
153
|
+
|-----------|-------|--------------|-----------|
|
|
154
|
+
| Implementation (Create) | Sonnet | $0.13 | ~50% |
|
|
155
|
+
| Implementation (Modify) | Sonnet | $0.13 | ~30% |
|
|
156
|
+
| Verification-only | Haiku | $0.04 | ~10% |
|
|
157
|
+
| Critical | Opus | $0.22 | ~10% |
|
|
158
|
+
|
|
159
|
+
**Weighted average per batch:** ~$0.12 (cached, with routing)
|
|
160
|
+
**Without routing (all Sonnet):** ~$0.13 (cached)
|
|
161
|
+
**Routing savings:** ~8% — modest, because Sonnet dominates the mix.
|
|
162
|
+
|
|
163
|
+
**Implication:** Model routing saves less than prompt caching by a large margin. Caching first, routing second.
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## 4. Total Pipeline Cost for a Typical Feature
|
|
168
|
+
|
|
169
|
+
### 4.1 Pipeline Stage Costs
|
|
170
|
+
|
|
171
|
+
| Stage | Model | Calls | Input Tokens | Output Tokens | Cost (cached) |
|
|
172
|
+
|-------|-------|-------|-------------|--------------|---------------|
|
|
173
|
+
| Brainstorm | Sonnet | 1 interactive session | ~50K | ~10K | $0.20 |
|
|
174
|
+
| PRD generation | Sonnet | 1 | ~20K | ~5K | $0.10 |
|
|
175
|
+
| Plan writing | Sonnet | 1 | ~30K | ~20K | $0.40 |
|
|
176
|
+
| Execution (6 batches, headless) | Mixed | 6 | ~200K | ~90K | $0.78 |
|
|
177
|
+
| Quality gates (6x) | — | 0 (bash scripts) | — | — | $0.00 |
|
|
178
|
+
| Verification | Sonnet | 1 | ~30K | ~5K | $0.12 |
|
|
179
|
+
| **Total (headless, cached)** | | **~10 calls** | **~330K** | **~130K** | **~$1.60** |
|
|
180
|
+
|
|
181
|
+
### 4.2 Total Cost by Execution Mode (6-batch feature)
|
|
182
|
+
|
|
183
|
+
| Mode | Base Cost | + Retries (20%) | + Judge (MAB) | Total |
|
|
184
|
+
|------|----------|----------------|--------------|-------|
|
|
185
|
+
| **Headless** | $1.60 | $0.16 | — | **$1.76** |
|
|
186
|
+
| **Team** | $2.38 | $0.24 | — | **$2.62** |
|
|
187
|
+
| **Competitive (MAB)** | $2.50 | $0.25 | $0.60 | **$3.35** |
|
|
188
|
+
| **Ralph loop** | $2.20 | $0.22 | — | **$2.42** |
|
|
189
|
+
|
|
190
|
+
**Without caching:**
|
|
191
|
+
|
|
192
|
+
| Mode | Total (uncached) |
|
|
193
|
+
|------|-----------------|
|
|
194
|
+
| **Headless** | ~$6.50 |
|
|
195
|
+
| **Team** | ~$10.00 |
|
|
196
|
+
| **Competitive (MAB)** | ~$13.50 |
|
|
197
|
+
| **Ralph loop** | ~$9.00 |
|
|
198
|
+
|
|
199
|
+
### 4.3 Scaling: What Does a Multi-Feature Sprint Cost?
|
|
200
|
+
|
|
201
|
+
Assuming 5 features per week, 6 batches each:
|
|
202
|
+
|
|
203
|
+
| Scenario | Weekly Cost | Monthly Cost |
|
|
204
|
+
|----------|-----------|-------------|
|
|
205
|
+
| Headless + cached | $8.80 | $35.20 |
|
|
206
|
+
| MAB on everything + cached | $16.75 | $67.00 |
|
|
207
|
+
| Headless + uncached | $32.50 | $130.00 |
|
|
208
|
+
| MAB + uncached | $67.50 | $270.00 |
|
|
209
|
+
|
|
210
|
+
**Context:** Claude Code's average daily cost per developer is $6, with 90th percentile at $12 (source: [Claude Code cost docs](https://code.claude.com/docs/en/costs)). The toolkit's headless mode with caching would add ~$1.76 per feature on top of any interactive session costs.
|
|
211
|
+
|
|
212
|
+
---
|
|
213
|
+
|
|
214
|
+
## 5. When Does Competitive Mode Pay for Itself?
|
|
215
|
+
|
|
216
|
+
### 5.1 The Rework Cost Model
|
|
217
|
+
|
|
218
|
+
Competitive mode costs ~$3.35 vs headless at ~$1.76 — a **$1.59 premium** per feature. This premium pays for itself when it avoids rework.
|
|
219
|
+
|
|
220
|
+
**What does rework cost?**
|
|
221
|
+
- A failed batch that passes quality gates but introduces subtle bugs: 1-3 batches of debugging ($0.40-1.20 cached)
|
|
222
|
+
- A failed batch caught by quality gates requiring retry: $0.13-0.22 per retry
|
|
223
|
+
- A feature that ships broken and requires a hotfix cycle: $3-10 (new brainstorm + plan + execute)
|
|
224
|
+
- Developer time debugging AI-generated code: $50-150/hr (opportunity cost)
|
|
225
|
+
|
|
226
|
+
### 5.2 Break-Even Analysis
|
|
227
|
+
|
|
228
|
+
| Rework Scenario | Rework Cost | MAB Premium | Break-Even Frequency |
|
|
229
|
+
|----------------|------------|-------------|---------------------|
|
|
230
|
+
| 1 retry saved | $0.13 | $1.59 | Every 12th feature |
|
|
231
|
+
| 1 debugging batch saved | $0.94 | $1.59 | Every 2nd feature |
|
|
232
|
+
| 1 hotfix cycle saved | $5.00 | $1.59 | Every 3rd hotfix |
|
|
233
|
+
| 1 hour dev time saved | $75.00 | $1.59 | Every 47th feature |
|
|
234
|
+
|
|
235
|
+
**Finding:** If competitive mode catches architectural issues that would require even one debugging batch per 2 features, it pays for itself. The question is empirical: **does the judge actually catch issues that quality gates miss?**
|
|
236
|
+
|
|
237
|
+
### 5.3 When to Use Competitive Mode
|
|
238
|
+
|
|
239
|
+
**Use competitive mode when:**
|
|
240
|
+
- The batch involves cross-module integration (highest bug density)
|
|
241
|
+
- Historical retry rate for this batch type exceeds 30%
|
|
242
|
+
- The cost of a subtle bug is high (production-facing, data-handling)
|
|
243
|
+
- You have no strategy performance data yet (exploration phase of MAB)
|
|
244
|
+
|
|
245
|
+
**Use headless when:**
|
|
246
|
+
- The task is well-specified with clear acceptance criteria
|
|
247
|
+
- Strategy performance data shows a clear winner (>70% win rate)
|
|
248
|
+
- The batch is isolated (single file, no cross-module touches)
|
|
249
|
+
- Cost sensitivity is high and quality gates are comprehensive
|
|
250
|
+
|
|
251
|
+
---
|
|
252
|
+
|
|
253
|
+
## 6. Model Routing Strategies with Empirical Support
|
|
254
|
+
|
|
255
|
+
### 6.1 Academic Approaches
|
|
256
|
+
|
|
257
|
+
Three main paradigms from the literature:
|
|
258
|
+
|
|
259
|
+
**Routing (single model selection):** A classifier predicts which model will succeed and routes the entire request to that model. Cost = 1 model call + router overhead.
|
|
260
|
+
- Hybrid-LLM (ICLR 2024): Routes based on estimated quality gap between models. Works well when the small model handles >60% of queries adequately.
|
|
261
|
+
- Source: [ICLR 2024 paper](https://proceedings.iclr.cc/paper_files/paper/2024/file/b47d93c99fa22ac0b377578af0a1f63a-Paper-Conference.pdf)
|
|
262
|
+
|
|
263
|
+
**Cascading (escalation):** Start with the cheapest model. If confidence is below threshold, escalate to the next tier. Cost = 1-3 model calls, but most stop at tier 1.
|
|
264
|
+
- C3PO (2025): Achieves <20% cost of the most capable model with <2% accuracy loss across 16 benchmarks.
|
|
265
|
+
- Source: [C3PO paper](https://arxiv.org/pdf/2511.07396)
|
|
266
|
+
|
|
267
|
+
**Unified routing + cascading (ICLR 2025):** Proves that combining routing and cascading is strictly better than either alone. 4% improvement on RouterBench with 80% relative improvement over naive baselines.
|
|
268
|
+
- Source: [Unified approach](https://arxiv.org/abs/2410.10347)
|
|
269
|
+
|
|
270
|
+
### 6.2 Current Toolkit Strategy
|
|
271
|
+
|
|
272
|
+
The toolkit uses static routing via `classify_batch_model()`:
|
|
273
|
+
|
|
274
|
+
```
|
|
275
|
+
Create files → Sonnet
|
|
276
|
+
Modify files → Sonnet
|
|
277
|
+
Run-only (verification) → Haiku
|
|
278
|
+
CRITICAL tag → Opus
|
|
279
|
+
Default → Sonnet
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
This is pure routing (no cascading). It's simple and low-overhead but leaves money on the table.
|
|
283
|
+
|
|
284
|
+
### 6.3 Recommended Improvements
|
|
285
|
+
|
|
286
|
+
**Short-term (no architecture changes):**
|
|
287
|
+
1. **Retry escalation already exists** — the toolkit escalates context on retry (includes previous failure log). Adding model escalation (Haiku → Sonnet → Opus on retry) would implement cascading with zero new infrastructure.
|
|
288
|
+
2. **Tag more batches as Haiku-eligible.** Currently only "all-Run" batches get Haiku. Config/documentation-only batches, test-only batches, and simple rename/move batches could also use Haiku.
|
|
289
|
+
|
|
290
|
+
**Medium-term (requires tracking):**
|
|
291
|
+
3. **Cost-per-success tracking.** Record model, cost, and pass/fail per batch in `.run-plan-state.json`. After 50+ data points, the toolkit can make data-driven routing decisions.
|
|
292
|
+
4. **Complexity-based routing.** Use batch metadata (file count, line count of changes, number of cross-file references) as routing features. More complex batches → higher-tier model.
|
|
293
|
+
|
|
294
|
+
**Long-term (architecture change):**
|
|
295
|
+
5. **Cascade on failure.** Instead of retrying with the same model + more context, retry with a more capable model. Haiku fails → Sonnet retry → Opus retry. Cost increases only when needed.
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## 7. Prompt Caching Economics
|
|
300
|
+
|
|
301
|
+
### 7.1 How Caching Works for the Toolkit
|
|
302
|
+
|
|
303
|
+
The toolkit's `claude -p` calls have a highly cacheable prefix:
|
|
304
|
+
|
|
305
|
+
| Component | Tokens | Cacheable? | Cache Hit Rate |
|
|
306
|
+
|-----------|--------|-----------|---------------|
|
|
307
|
+
| System prompt | ~2,000 | Yes | ~100% across batches |
|
|
308
|
+
| CLAUDE.md chain (3 files) | ~4,000 | Yes | ~100% across batches |
|
|
309
|
+
| Tool definitions | ~2,000 | Yes | ~100% across batches |
|
|
310
|
+
| AGENTS.md (per-worktree) | ~1,000 | Yes | ~100% across batches |
|
|
311
|
+
| Plan text (current batch) | ~2,000 | No | 0% (changes per batch) |
|
|
312
|
+
| Context injection | ~1,500 | No | 0% (changes per batch) |
|
|
313
|
+
| File contents read during execution | ~20,000 | Partial | ~50% (some files repeated) |
|
|
314
|
+
| **Cacheable total** | **~9,000** | | |
|
|
315
|
+
| **Non-cacheable total** | **~24,500** | | |
|
|
316
|
+
|
|
317
|
+
**Effective cache rate:** ~27% of input tokens are cacheable across batches (the static prefix). Within a batch with multiple tool calls, the entire conversation so far is cacheable for each subsequent turn, pushing effective rates to 60-80%.
|
|
318
|
+
|
|
319
|
+
### 7.2 Cache Priming for Parallel Agents
|
|
320
|
+
|
|
321
|
+
The MAB round 2 research identified a critical pattern: when two agents launch simultaneously with uncached content, both pay write costs independently. The fix is a "prime the cache" call:
|
|
322
|
+
|
|
323
|
+
1. Send a single API call with the shared prefix (system prompt + CLAUDE.md + tools + design doc + PRD)
|
|
324
|
+
2. This call creates the cache entry (costs 1.25x input)
|
|
325
|
+
3. Both parallel agents then get cache-read pricing (0.1x input) on the shared prefix
|
|
326
|
+
|
|
327
|
+
**Savings per MAB batch:**
|
|
328
|
+
- Without priming: 2 × cache write = 2 × 1.25x × $3.00/MTok × 9K tokens = $0.0675
|
|
329
|
+
- With priming: 1 × cache write + 2 × cache read = 1.25x × $3.00/MTok × 9K + 2 × 0.1x × $3.00/MTok × 9K = $0.034 + $0.0054 = $0.039
|
|
330
|
+
- Savings: $0.028 per batch, or ~42% of the cache-related costs
|
|
331
|
+
|
|
332
|
+
This is small in absolute terms but compounds: over a 26-task MAB plan, it saves ~$0.73.
|
|
333
|
+
|
|
334
|
+
### 7.3 Batch API for Non-Interactive Work
|
|
335
|
+
|
|
336
|
+
The Batch API offers 50% off everything with up to 24-hour latency (usually under 1 hour). This is directly applicable to the toolkit's headless mode — `claude -p` calls are already non-interactive.
|
|
337
|
+
|
|
338
|
+
**Current barrier:** The toolkit uses `claude -p` (CLI), not the Batch API directly. Converting to Batch API would require:
|
|
339
|
+
1. Constructing API requests as JSON
|
|
340
|
+
2. Submitting batches via `curl` or a thin wrapper
|
|
341
|
+
3. Polling for completion
|
|
342
|
+
4. Parsing results
|
|
343
|
+
|
|
344
|
+
**Potential savings:** 50% across the board. A 6-batch headless feature drops from $1.76 to $0.88 (cached + batched).
|
|
345
|
+
|
|
346
|
+
---
|
|
347
|
+
|
|
348
|
+
## 8. Economics of Retry
|
|
349
|
+
|
|
350
|
+
### 8.1 Retry Cost Model
|
|
351
|
+
|
|
352
|
+
Each retry is a full API call — no discount for "trying again." The retry includes:
|
|
353
|
+
- All original context (system prompt, tools, plan)
|
|
354
|
+
- Additional context: previous failure log (~2,000 tokens)
|
|
355
|
+
- The model's new attempt (full output token cost)
|
|
356
|
+
|
|
357
|
+
**Cost per retry = base batch cost + ~10% overhead for failure context.**
|
|
358
|
+
|
|
359
|
+
### 8.2 Expected Retry Costs
|
|
360
|
+
|
|
361
|
+
| Scenario | P(success) | E[retries] | E[cost] per batch | vs. Single-shot |
|
|
362
|
+
|----------|-----------|-----------|-------------------|----------------|
|
|
363
|
+
| Sonnet, well-specified | 90% | 0.11 | $0.14 | +8% |
|
|
364
|
+
| Sonnet, complex integration | 70% | 0.43 | $0.19 | +46% |
|
|
365
|
+
| Haiku, simple task | 80% | 0.25 | $0.05 | +25% |
|
|
366
|
+
| Haiku, moderate task | 50% | 1.00 | $0.08 | +100% |
|
|
367
|
+
|
|
368
|
+
Expected retries formula: E[retries] = (1 - p) / p for geometric distribution, capped at max_retries (typically 3).
|
|
369
|
+
|
|
370
|
+
### 8.3 When to Retry vs. Escalate
|
|
371
|
+
|
|
372
|
+
**Current behavior:** Retry same model with more context (failure log appended).
|
|
373
|
+
**Better behavior:** Escalate model tier after first failure.
|
|
374
|
+
|
|
375
|
+
| Strategy | Avg cost/batch (complex task) | Success rate |
|
|
376
|
+
|----------|------------------------------|-------------|
|
|
377
|
+
| Retry same model (3x Sonnet) | $0.39 (3 × $0.13) | ~97% |
|
|
378
|
+
| Escalate (Sonnet → Opus) | $0.35 ($0.13 + $0.22) | ~98.5% |
|
|
379
|
+
| Escalate (Haiku → Sonnet → Opus) | $0.39 ($0.04 + $0.13 + $0.22) | ~99% |
|
|
380
|
+
|
|
381
|
+
**Finding:** Escalation is slightly cheaper than retry-at-same-tier for complex tasks because the higher-tier model is more likely to succeed on attempt 1, avoiding the cost of a third attempt. The quality improvement is marginal (97% vs 98.5%) but the cost structure is better.
|
|
382
|
+
|
|
383
|
+
---
|
|
384
|
+
|
|
385
|
+
## 9. Commercial AI Coding Tool Pricing
|
|
386
|
+
|
|
387
|
+
### 9.1 Pricing Comparison
|
|
388
|
+
|
|
389
|
+
| Tool | Pricing Model | Monthly Cost | $/Hour of Work | Notes |
|
|
390
|
+
|------|--------------|-------------|---------------|-------|
|
|
391
|
+
| **Devin** (Core) | $20/mo + $2.25/ACU | $20+ | ~$9.00/hr | 1 ACU = ~15 min work |
|
|
392
|
+
| **Devin** (Team) | $500/mo + $2.00/ACU | $500+ | ~$8.00/hr | 250 ACUs included |
|
|
393
|
+
| **Cursor** (Pro) | $20/mo | $20 | ~$0.09/request | ~225 requests/mo with Claude |
|
|
394
|
+
| **Cursor** (Ultra) | $200/mo | $200 | ~$0.05/request | 20x capacity |
|
|
395
|
+
| **GitHub Copilot** (Pro) | $10/mo | $10 | $0.04/overage | 300 premium requests |
|
|
396
|
+
| **GitHub Copilot** (Pro+) | $39/mo | $39 | $0.04/overage | 1,500 premium requests |
|
|
397
|
+
| **Toolkit** (API direct) | Pay-per-token | $0-270/mo | ~$0.29/batch | Depends entirely on usage |
|
|
398
|
+
|
|
399
|
+
### 9.2 Cost-Effectiveness Comparison
|
|
400
|
+
|
|
401
|
+
For a developer running 5 features/week (6 batches each = 30 batches/week):
|
|
402
|
+
|
|
403
|
+
| Tool | Monthly Cost | Autonomous? | Quality Gates? |
|
|
404
|
+
|------|-------------|------------|---------------|
|
|
405
|
+
| Toolkit (headless, cached) | ~$35 | Yes | Yes (built-in) |
|
|
406
|
+
| Toolkit (MAB, cached) | ~$67 | Yes | Yes + competitive evaluation |
|
|
407
|
+
| Devin (equivalent work) | ~$360-720 | Yes | Limited (proprietary) |
|
|
408
|
+
| Cursor Pro | $20 (capped) | No (interactive) | No (manual) |
|
|
409
|
+
| Copilot Pro | $10 (capped) | Partial (agent mode) | No (manual) |
|
|
410
|
+
|
|
411
|
+
**Finding:** The toolkit is the cheapest option for autonomous batch execution. Commercial tools are cheaper for interactive use (fixed monthly fee) but don't support headless autonomous operation with quality gates.
|
|
412
|
+
|
|
413
|
+
### 9.3 What You're Paying For
|
|
414
|
+
|
|
415
|
+
| Capability | Toolkit | Devin | Cursor | Copilot |
|
|
416
|
+
|-----------|---------|-------|--------|---------|
|
|
417
|
+
| Autonomous execution | Yes | Yes | No | Partial |
|
|
418
|
+
| Quality gates | Yes | No | No | No |
|
|
419
|
+
| Fresh context per batch | Yes | Unknown | No | No |
|
|
420
|
+
| Model routing | Yes | No | Yes (credit-weighted) | Yes (model selection) |
|
|
421
|
+
| Cost transparency | Yes (API direct) | ACU-abstracted | Credit-abstracted | Request-abstracted |
|
|
422
|
+
| UX/IDE integration | No (CLI) | Web UI | VS Code | VS Code/GitHub |
|
|
423
|
+
|
|
424
|
+
---
|
|
425
|
+
|
|
426
|
+
## 10. Cost Model for the Autonomous Coding Toolkit
|
|
427
|
+
|
|
428
|
+
### 10.1 Per-Batch Cost Calculator
|
|
429
|
+
|
|
430
|
+
```
|
|
431
|
+
batch_cost = (input_tokens × input_rate × cache_factor) + (output_tokens × output_rate)
|
|
432
|
+
|
|
433
|
+
Where:
|
|
434
|
+
input_rate:
|
|
435
|
+
haiku: $1.00/MTok
|
|
436
|
+
sonnet: $3.00/MTok
|
|
437
|
+
opus: $5.00/MTok
|
|
438
|
+
|
|
439
|
+
output_rate:
|
|
440
|
+
haiku: $5.00/MTok
|
|
441
|
+
sonnet: $15.00/MTok
|
|
442
|
+
opus: $25.00/MTok
|
|
443
|
+
|
|
444
|
+
cache_factor:
|
|
445
|
+
uncached: 1.0
|
|
446
|
+
first call (write): 1.25
|
|
447
|
+
subsequent (read): 0.1
|
|
448
|
+
effective (80% cache hit): 0.28
|
|
449
|
+
|
|
450
|
+
Typical batch:
|
|
451
|
+
input_tokens: 33,500
|
|
452
|
+
output_tokens: 15,000
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
### 10.2 Reference Cost Table
|
|
456
|
+
|
|
457
|
+
All costs in USD per batch, assuming typical token consumption:
|
|
458
|
+
|
|
459
|
+
| Configuration | Sonnet (uncached) | Sonnet (cached) | Haiku (cached) | Opus (cached) |
|
|
460
|
+
|--------------|------------------|-----------------|----------------|--------------|
|
|
461
|
+
| Headless (1 call) | $0.33 | $0.13 | $0.04 | $0.22 |
|
|
462
|
+
| Team (2 calls) | $0.65 | $0.26 | $0.09 | $0.43 |
|
|
463
|
+
| Competitive (2+judge) | $0.77 | $0.31 | $0.12 | $0.52 |
|
|
464
|
+
| With 1 retry | $0.46 | $0.18 | $0.06 | $0.30 |
|
|
465
|
+
| With 2 retries | $0.59 | $0.23 | $0.07 | $0.39 |
|
|
466
|
+
|
|
467
|
+
### 10.3 Full Pipeline Cost Table
|
|
468
|
+
|
|
469
|
+
| Pipeline Configuration | 6-Batch Feature | 12-Batch Feature | 26-Batch Sprint |
|
|
470
|
+
|----------------------|----------------|-----------------|----------------|
|
|
471
|
+
| Headless, all Sonnet, cached | $1.60 | $2.40 | $4.20 |
|
|
472
|
+
| Headless, routed, cached | $1.52 | $2.24 | $3.90 |
|
|
473
|
+
| MAB on all batches, cached | $3.35 | $5.50 | $10.40 |
|
|
474
|
+
| MAB selective (30% MAB), cached | $2.12 | $3.40 | $6.10 |
|
|
475
|
+
| Headless, all Sonnet, uncached | $6.50 | $10.00 | $18.00 |
|
|
476
|
+
|
|
477
|
+
### 10.4 Monthly Budget Estimates
|
|
478
|
+
|
|
479
|
+
For a solo developer using the toolkit full-time (20 features/month, 6 batches avg):
|
|
480
|
+
|
|
481
|
+
| Strategy | Monthly API Cost | Annual |
|
|
482
|
+
|----------|-----------------|--------|
|
|
483
|
+
| Conservative (headless, cached, routed) | $30 | $365 |
|
|
484
|
+
| Balanced (headless + selective MAB, cached) | $42 | $510 |
|
|
485
|
+
| Aggressive (MAB everything, cached) | $67 | $804 |
|
|
486
|
+
| Uncached baseline | $130 | $1,560 |
|
|
487
|
+
|
|
488
|
+
---
|
|
489
|
+
|
|
490
|
+
## 11. Recommendations
|
|
491
|
+
|
|
492
|
+
### Priority-ordered by impact:
|
|
493
|
+
|
|
494
|
+
1. **Implement prompt caching immediately.** 83% cost reduction, zero quality tradeoff. This is the single highest-ROI optimization. Ensure the CLAUDE.md chain, system prompt, and tool definitions are in the cacheable prefix of every `claude -p` call.
|
|
495
|
+
|
|
496
|
+
2. **Add cost tracking per batch.** Record `{model, input_tokens, output_tokens, cache_hits, cost, passed}` to `.run-plan-state.json`. Without this data, all cost optimization is guesswork. This is prerequisite to every other recommendation.
|
|
497
|
+
|
|
498
|
+
3. **Keep Sonnet as default.** The SWE-bench data shows Sonnet 4.5/4.6 is 95% of Opus quality at 60% of the price. The 4.6-generation Opus price drop (from 5x to 1.67x Sonnet) makes Opus more tempting, but Sonnet remains the cost-per-success sweet spot for implementation tasks.
|
|
499
|
+
|
|
500
|
+
4. **Implement model escalation on retry.** Instead of retrying the same model with more context, escalate: Haiku → Sonnet → Opus. This is cheaper than 3x same-model retry and has a higher cumulative success rate.
|
|
501
|
+
|
|
502
|
+
5. **Use selective MAB, not universal MAB.** Run competitive mode on integration batches, first-time batch types, and historically-flaky batch types. Route known-easy batches to headless. Target 30% MAB rate for optimal cost/learning balance.
|
|
503
|
+
|
|
504
|
+
6. **Cache-prime before parallel dispatch.** When running MAB or team mode, fire a single "warm the cache" call with the shared prefix before launching parallel agents. Saves ~42% of cache-related costs.
|
|
505
|
+
|
|
506
|
+
7. **Evaluate Batch API for overnight runs.** For non-urgent features (entropy audits, batch-audit.sh, auto-compound.sh overnight), the Batch API's 50% discount is free money. Requires thin wrapper around `curl` to submit and poll.
|
|
507
|
+
|
|
508
|
+
8. **Expand Haiku eligibility.** Currently only verification-only batches get Haiku. Add: test-only batches, config/documentation updates, simple file renames. Each Haiku-eligible batch saves $0.09 vs Sonnet (cached).
|
|
509
|
+
|
|
510
|
+
### What NOT to optimize:
|
|
511
|
+
|
|
512
|
+
- **Don't chase multi-provider routing.** Sending verification batches to Gemini Flash would save ~$0.03/batch but requires significant architecture changes. Not worth it at current scale.
|
|
513
|
+
- **Don't use Opus for everything.** The 1.67x cost premium over Sonnet is not justified by the 5% quality improvement for standard implementation tasks.
|
|
514
|
+
- **Don't skip quality gates to save money.** Quality gates are bash scripts with zero API cost. They prevent the most expensive failure mode: subtle bugs that ship and require full rework cycles.
|
|
515
|
+
|
|
516
|
+
---
|
|
517
|
+
|
|
518
|
+
## Sources
|
|
519
|
+
|
|
520
|
+
### Pricing (Official)
|
|
521
|
+
- [Anthropic Claude API Pricing](https://platform.claude.com/docs/en/about-claude/pricing)
|
|
522
|
+
- [OpenAI API Pricing](https://platform.openai.com/docs/pricing)
|
|
523
|
+
- [Google Gemini API Pricing](https://ai.google.dev/gemini-api/docs/pricing)
|
|
524
|
+
- [Devin AI Pricing](https://devin.ai/pricing)
|
|
525
|
+
- [GitHub Copilot Plans](https://github.com/features/copilot/plans)
|
|
526
|
+
- [Cursor Pricing](https://cursor.com/pricing)
|
|
527
|
+
|
|
528
|
+
### Benchmarks & Performance
|
|
529
|
+
- [SWE-bench Verified Leaderboard](https://llm-stats.com/benchmarks/swe-bench-verified)
|
|
530
|
+
- [SWE-rebench Leaderboard](https://swe-rebench.com) (cost-per-task data)
|
|
531
|
+
- [Vellum LLM Leaderboard](https://www.vellum.ai/llm-leaderboard)
|
|
532
|
+
- [Claude Sonnet 4.5 Benchmarks](https://www.leanware.co/insights/claude-sonnet-4-5-overview)
|
|
533
|
+
|
|
534
|
+
### Caching & Optimization
|
|
535
|
+
- [Anthropic Prompt Caching Docs](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
|
|
536
|
+
- [Anthropic Batch Processing Docs](https://platform.claude.com/docs/en/build-with-claude/batch-processing)
|
|
537
|
+
- [Claude Code Cost Management](https://code.claude.com/docs/en/costs)
|
|
538
|
+
|
|
539
|
+
### Research Papers
|
|
540
|
+
- [Unified Routing and Cascading for LLMs — ICLR 2025](https://arxiv.org/abs/2410.10347)
|
|
541
|
+
- [Hybrid LLM: Cost-Efficient Quality-Aware — ICLR 2024](https://proceedings.iclr.cc/paper_files/paper/2024/file/b47d93c99fa22ac0b377578af0a1f63a-Paper-Conference.pdf)
|
|
542
|
+
- [C3PO: Optimized LLM Cascades — 2025](https://arxiv.org/pdf/2511.07396)
|
|
543
|
+
- [Why Multi-Agent LLM Systems Fail — 2025](https://arxiv.org/pdf/2503.13657)
|
|
544
|
+
|
|
545
|
+
### Internal References
|
|
546
|
+
- [MAB Research Round 2](/home/justin/Documents/projects/autonomous-coding-toolkit/docs/plans/2026-02-22-mab-research-round2.md) — cost economics, cache priming pattern
|
|
547
|
+
- [Architecture](/home/justin/Documents/projects/autonomous-coding-toolkit/docs/ARCHITECTURE.md) — execution modes, quality gates
|
|
548
|
+
- [Run-Plan Routing](/home/justin/Documents/projects/autonomous-coding-toolkit/scripts/lib/run-plan-routing.sh) — model classification logic
|