autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,556 @@
|
|
|
1
|
+
# Multi-Armed Bandit System: Research Report — Round 2
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-02-22
|
|
4
|
+
**Status:** Research complete
|
|
5
|
+
**Scope:** Cost modeling, testing strategies, cross-domain analogies, coder toolkit workflow analysis, latent bugs
|
|
6
|
+
**Builds on:** `docs/plans/2026-02-21-mab-research-report.md` (Round 1)
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Executive Summary
|
|
11
|
+
|
|
12
|
+
Round 2 research expands beyond ML/AI literature into seven cross-domain analogies (chess tournaments, evolutionary biology, competitive programming, manufacturing dual-sourcing, adversarial collaboration, forecasting tournaments, ensemble methods), plus deep analysis of cost economics, testing methodology, and the full coder toolkit workflow. Key findings:
|
|
13
|
+
|
|
14
|
+
1. **Cost is manageable:** Two parallel agents cost ~$1.88-2.38 per task with prompt caching (83% reduction vs. uncached). Cache priming before parallel dispatch is the single biggest cost lever.
|
|
15
|
+
2. **Testing MAB requires synthetic bandits, not just integration tests.** Simulation with known ground truth, seeded randomness, and distribution-level assertions — not output equality.
|
|
16
|
+
3. **Three cross-domain patterns emerged independently across all seven analogies:** locked criteria before evaluation, diversity as signal, and discriminating starting conditions.
|
|
17
|
+
4. **The coder toolkit workflow has 8 latent issues** that should be fixed before or alongside MAB implementation, including a state schema mismatch that silently returns wrong test counts.
|
|
18
|
+
5. **The stop-hook/ralph-loop mechanism adapts naturally for MAB Agent B** — set up ralph-loop state in Agent B's worktree before `claude -p` launch.
|
|
19
|
+
|
|
20
|
+
**Action items for the revised implementation plan:** Fix Gap 6 (state schema bug), fix Gap 7 (JSON extraction fragility), wire planner into auto-compound.sh, and add cache-prime step before parallel agent dispatch.
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
## 1. Cost Economics
|
|
25
|
+
|
|
26
|
+
### 1.1 Concrete Pricing
|
|
27
|
+
|
|
28
|
+
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|
|
29
|
+
|-------|-----------------------|------------------------|
|
|
30
|
+
| Claude Haiku 4.5 | $1.00 | $5.00 |
|
|
31
|
+
| Claude Sonnet 4.6 | $3.00 | $15.00 |
|
|
32
|
+
| Claude Opus 4.6 | $5.00 | $25.00 |
|
|
33
|
+
| Any model, >200K context | $6.00 input | $22.50 output |
|
|
34
|
+
|
|
35
|
+
**Real-world per-task costs (SWE-bench, from swe-rebench.com):**
|
|
36
|
+
|
|
37
|
+
| Agent/Model | Cost per Task | Tokens per Task | Resolved Rate |
|
|
38
|
+
|-------------|--------------|-----------------|---------------|
|
|
39
|
+
| Claude Sonnet 4.5 | $0.94 | ~1.9M | 47.1% |
|
|
40
|
+
| Claude Opus 4.6 | $0.93 | ~1.0M | 51.7% |
|
|
41
|
+
| Claude Code (product) | $3.50 | ~2.1M | 52.9% |
|
|
42
|
+
|
|
43
|
+
**Agent teams multiplier:** Anthropic's docs state teams use ~7x more tokens than single-agent sessions. Two parallel agents = ~2x per-agent cost with no automatic context sharing.
|
|
44
|
+
|
|
45
|
+
### 1.2 The Cache Priming Pattern
|
|
46
|
+
|
|
47
|
+
**Critical finding:** Claude Sonnet dropped from $5.29 to $0.91 per task with prompt caching — an 83% reduction. Cache reads cost 0.1x input price; cache writes cost 1.25x input price (one-time).
|
|
48
|
+
|
|
49
|
+
**Parallel agent gotcha:** When two agents fire simultaneously on uncached content, both create independent caches, doubling write costs and getting zero read savings.
|
|
50
|
+
|
|
51
|
+
**Fix:** Fire a single "prime the cache" call first with the shared context (system prompt + design doc + PRD + codebase summary), then launch both agents. Both agents get cache-read pricing on the shared prefix.
|
|
52
|
+
|
|
53
|
+
**Concrete cost model for MAB per batch:**
|
|
54
|
+
|
|
55
|
+
| Scenario | Cost per batch (2 agents) | 6-batch plan total |
|
|
56
|
+
|----------|--------------------------|-------------------|
|
|
57
|
+
| No caching | ~$5.29 × 2 = $10.58 | ~$63.48 |
|
|
58
|
+
| With cache priming | ~$0.94 × 2 = $1.88 | ~$11.28 |
|
|
59
|
+
| Single agent (no MAB) | ~$0.94 × 1 = $0.94 | ~$5.64 |
|
|
60
|
+
|
|
61
|
+
**Bottom line:** MAB doubles cost vs. single agent, but cache priming keeps it under $2/batch. The real cost concern is not per-batch — it's the judge call (~$0.50-1.00 additional per batch for evaluation).
|
|
62
|
+
|
|
63
|
+
### 1.3 Cost-Aware Thompson Sampling
|
|
64
|
+
|
|
65
|
+
Academic research formalizes "budgeted MAB" as a distinct problem class (UCB-B, Budget-UCB). Key techniques:
|
|
66
|
+
|
|
67
|
+
- **Cost-weighted priors:** Track `reward / cost` per arm, not just `reward`. Naturally deprioritizes expensive arms (Opus + extended thinking) unless they demonstrably outperform by more than the cost ratio.
|
|
68
|
+
- **Decaying violation budget:** Permit limited overspend early in learning, enforce strict compliance later. Maps directly to: early MAB runs explore freely, later runs exploit proven winners.
|
|
69
|
+
- **Pivot trigger:** A budget threshold at which all remaining pulls go to the current best arm regardless of uncertainty. Prevents runaway exploration.
|
|
70
|
+
|
|
71
|
+
**Recommendation for Phase 1:** Track cost per arm alongside win/loss. Don't optimize for it yet, but capture the data.
|
|
72
|
+
|
|
73
|
+
### 1.4 Agentic Plan Caching
|
|
74
|
+
|
|
75
|
+
A newer technique (arxiv 2506.14852) caches structured plan templates across semantically similar tasks. Result: 46.62% average cost reduction while maintaining 96.67% of optimal performance. Relevant if MAB runs similar task types repeatedly.
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## 2. Testing Strategy for the MAB System
|
|
80
|
+
|
|
81
|
+
### 2.1 Testing the Bandit Algorithm
|
|
82
|
+
|
|
83
|
+
**Technique 1: Synthetic Bandits**
|
|
84
|
+
|
|
85
|
+
Build a synthetic environment with known ground truth. Define a matrix of true arm reward probabilities, generate simulated outcomes, run the algorithm, and verify convergence.
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
# Test: Thompson Sampling converges to the better arm
|
|
89
|
+
# Ground truth: arm_a wins 70%, arm_b wins 40%
|
|
90
|
+
test_thompson_convergence() {
|
|
91
|
+
# Run 1000 simulated rounds with fixed seed
|
|
92
|
+
result=$(python3 -c "
|
|
93
|
+
import random
|
|
94
|
+
random.seed(42)
|
|
95
|
+
wins_a, losses_a, wins_b, losses_b = 0, 0, 0, 0
|
|
96
|
+
choices = []
|
|
97
|
+
for i in range(1000):
|
|
98
|
+
sample_a = random.betavariate(wins_a+1, losses_a+1)
|
|
99
|
+
sample_b = random.betavariate(wins_b+1, losses_b+1)
|
|
100
|
+
if sample_a >= sample_b:
|
|
101
|
+
choices.append('a')
|
|
102
|
+
if random.random() < 0.7: wins_a += 1
|
|
103
|
+
else: losses_a += 1
|
|
104
|
+
else:
|
|
105
|
+
choices.append('b')
|
|
106
|
+
if random.random() < 0.4: wins_b += 1
|
|
107
|
+
else: losses_b += 1
|
|
108
|
+
# Assert: arm_a selected >70% of last 200 rounds
|
|
109
|
+
print(choices[-200:].count('a') / 200)
|
|
110
|
+
")
|
|
111
|
+
# Should be >0.70 with high probability
|
|
112
|
+
assertTrue "$(echo "$result > 0.70" | bc -l)" "Thompson Sampling should converge to better arm"
|
|
113
|
+
}
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
**Technique 2: Offline Replay Evaluation**
|
|
117
|
+
|
|
118
|
+
Log all MAB decisions and outcomes to `logs/mab-run-*.json`. Replay logged events against a candidate policy to validate that new routing logic would have performed at least as well as the historical policy.
|
|
119
|
+
|
|
120
|
+
**Key testing principles for stochastic systems (from CMU SEI):**
|
|
121
|
+
- Fix random seed for reproducibility
|
|
122
|
+
- Assert on distribution properties, not specific outputs ("arm A selected >70% of last N rounds" not "arm A selected at round 47")
|
|
123
|
+
- Run 10-20 replicates as baseline for estimating distribution properties
|
|
124
|
+
- Use KS test or chi-squared to compare output distribution to expected
|
|
125
|
+
|
|
126
|
+
### 2.2 Testing the LLM Judge
|
|
127
|
+
|
|
128
|
+
**Agreement rates from literature:**
|
|
129
|
+
|
|
130
|
+
| Context | Cohen's Kappa | Notes |
|
|
131
|
+
|---------|--------------|-------|
|
|
132
|
+
| Patch evaluation (clear cases) | 0.75 | High recall (0.94), precision (0.80) |
|
|
133
|
+
| Patch evaluation (full dataset) | 0.57 | Drops on ambiguous cases |
|
|
134
|
+
| Search query parsing | 0.807 → 0.639 | Position bias degrades by 0.17 |
|
|
135
|
+
| RAG evaluation (filtered) | 0.781-0.816 | "Substantial to almost perfect" |
|
|
136
|
+
| Human inter-rater (developers on patches) | Fleiss 0.31 | Humans themselves are inconsistent |
|
|
137
|
+
|
|
138
|
+
**Validation protocol (before trusting automated routing):**
|
|
139
|
+
1. Build rubric collaboratively (LLM drafts, expert refines)
|
|
140
|
+
2. Run judge on a clear benchmark where humans unanimously agree
|
|
141
|
+
3. Require kappa >= 0.70 on the clear subset before deploying
|
|
142
|
+
4. Track NPV separately — LLM judges are more reliable on INVALID (0.94-0.95) than VALID
|
|
143
|
+
5. Measure self-consistency: same input, different seeds → same output?
|
|
144
|
+
6. If >30% of cases have human disagreement, switch from categorical metrics to distributional (Jensen-Shannon Divergence)
|
|
145
|
+
|
|
146
|
+
**Judge test plan for Phase 1:**
|
|
147
|
+
- Prepare 10 synthetic evaluation pairs (known-better vs known-worse diffs)
|
|
148
|
+
- Run judge on each pair twice (once A-first, once B-first) = 20 evaluations
|
|
149
|
+
- Assert: >80% correct winner identification
|
|
150
|
+
- Assert: position bias < 15% (win rate difference between first/second position)
|
|
151
|
+
- Assert: self-consistency > 85% (same winner when re-run with same order)
|
|
152
|
+
|
|
153
|
+
### 2.3 Testing Nondeterministic Integration
|
|
154
|
+
|
|
155
|
+
The full MAB pipeline (agent dispatch → quality gate → judge → merge → learn) is inherently nondeterministic. Testing strategy:
|
|
156
|
+
|
|
157
|
+
- **Deterministic units:** Test each component in isolation with fixed inputs (e.g., test `run_judge()` with a fixed diff pair, test `thompson_sample()` with a fixed seed)
|
|
158
|
+
- **Stochastic integration:** Run the full pipeline N times on a trivial task (e.g., "add a docstring to this function") and assert statistical properties: winner is declared in >95% of runs, quality gate runs in 100%, state file is updated in 100%
|
|
159
|
+
- **Fault injection:** Test what happens when Agent A fails (exit non-zero), Agent B produces no diff, judge returns malformed JSON, merge conflicts occur
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## 3. Cross-Domain Analogies
|
|
164
|
+
|
|
165
|
+
### 3.1 Computer Chess Tournaments (TCEC)
|
|
166
|
+
|
|
167
|
+
The closest structural analog. Two agents, identical hardware, identical problem, a judge picks the winner.
|
|
168
|
+
|
|
169
|
+
| TCEC Practice | MAB Application |
|
|
170
|
+
|---------------|-----------------|
|
|
171
|
+
| **Curated opening book** (bias toward decisive positions) | Pre-screen tasks for discriminating power. Trivially easy tasks (both ace) or impossible tasks (both fail) produce no signal. |
|
|
172
|
+
| **Adjudication rules** (auto-draw if engines agree ±0.08 for 10 plies) | Early termination: if both agents produce identical solutions (by diff similarity), declare a draw — don't burn judge tokens. If one passes all tests and the other passes none, skip detailed rubric — call it early. |
|
|
173
|
+
| **Same hardware, same time control** | Same model, same context budget, same token limit. Otherwise you're comparing resource allocation, not capability. |
|
|
174
|
+
| **Draw rate is a design problem** | If MAB produces too many ties, the task design is wrong. Fix the tasks, not the judge. Monitor tie rate as a health metric. |
|
|
175
|
+
|
|
176
|
+
### 3.2 Evolutionary Biology / Genetic Algorithms
|
|
177
|
+
|
|
178
|
+
| Biological Pattern | MAB Application |
|
|
179
|
+
|-------------------|-----------------|
|
|
180
|
+
| **Tournament selection pressure is a dial** (small tournament = diversity, large = convergence) | Number of tasks per MAB round controls signal-to-noise. More matches per round = more reliable signal but slower adaptation. |
|
|
181
|
+
| **Artificial selection drives local optima** (domesticated crops lose wild resilience) | If judge consistently favors one style, both agents converge to it. Diversity collapses. Monitor inter-agent diff similarity as a canary. |
|
|
182
|
+
| **Recombination > pure selection** | The real value isn't picking a winner — it's identifying *which parts* of each solution were stronger. Phase 2 judge should extract specific winning behaviors. |
|
|
183
|
+
|
|
184
|
+
### 3.3 Adversarial Collaboration (Kahneman)
|
|
185
|
+
|
|
186
|
+
| Scientific Practice | MAB Application |
|
|
187
|
+
|--------------------|-----------------|
|
|
188
|
+
| **Pre-registration of criteria** (both parties agree what evidence would change their mind before the experiment) | Judge rubric must be locked before agents see the task. If rubric is written after reviewing outputs, it unconsciously favors the impressive-looking answer. |
|
|
189
|
+
| **The joint design of the test is where value lies** | Defining what "better" means for each task class is harder and more valuable than the competition itself. |
|
|
190
|
+
| **Ask "on what dimension do these differ most?"** | Don't ask the judge "which is better overall?" — ask "on what dimension do these most differ, and which is better on that dimension?" Produces more actionable lessons. |
|
|
191
|
+
|
|
192
|
+
### 3.4 Manufacturing Dual Sourcing
|
|
193
|
+
|
|
194
|
+
| Procurement Pattern | MAB Application |
|
|
195
|
+
|--------------------|-----------------|
|
|
196
|
+
| **Credible threat of replacement drives improvement** | The mere existence of competition improves both agents. Keep both pipelines alive even when one is winning. |
|
|
197
|
+
| **Quality inconsistency between suppliers breaks integration** | If agents produce stylistically incompatible solutions (different abstractions, naming), the "winner" creates downstream debt. Judge needs a consistency criterion. |
|
|
198
|
+
| **Technology licensing outperforms pure competition** | Feed winning approach back to both agents before next round. Sharing knowledge produces better cumulative results than withholding it. Maps to injecting MAB lessons into both agents' context. |
|
|
199
|
+
|
|
200
|
+
### 3.5 Competitive Programming Judges (Codeforces/ICPC)
|
|
201
|
+
|
|
202
|
+
| Competition Practice | MAB Application |
|
|
203
|
+
|---------------------|-----------------|
|
|
204
|
+
| **Pre-test vs. system-test split** | Run agents against a visible "sanity check" suite first, then against a harder hidden suite for final judging. Prevents overfitting to visible rubric. |
|
|
205
|
+
| **Hacking** (competitors find inputs that break opponents' solutions) | After both agents submit, have each attempt to write a test case that breaks the other's solution. Valid breaking test = signal about code quality reasoning. (Phase 3 feature) |
|
|
206
|
+
| **Distinct verdict categories** (WA vs TLE vs RE) | Judge outputting only "Agent A wins" discards signal. "Agent A correct but 3x slower; Agent B had edge case bug at N=0" generates compounding knowledge. |
|
|
207
|
+
|
|
208
|
+
### 3.6 Forecasting Tournaments / Proper Scoring Rules
|
|
209
|
+
|
|
210
|
+
| Forecasting Pattern | MAB Application |
|
|
211
|
+
|--------------------|-----------------|
|
|
212
|
+
| **Proper scoring rules eliminate gaming** | Can an agent score well by optimizing for the judge rather than for correctness? If yes, the rubric isn't proper. Test by submitting impressive-looking-but-wrong solutions. |
|
|
213
|
+
| **Time-weighting for sequential competitions** | An agent that produces correct architecture early and refines is better than one that patches a wrong architecture — even if final outputs look identical. |
|
|
214
|
+
| **Panel of 2-3 judges beats single judge by 13-22%** | A single LLM judge is a single point of failure. Phase 2: use two judge calls with different temperatures and take majority vote. |
|
|
215
|
+
|
|
216
|
+
### 3.7 Ensemble Methods / Mixture of Experts
|
|
217
|
+
|
|
218
|
+
| ML Pattern | MAB Application |
|
|
219
|
+
|------------|-----------------|
|
|
220
|
+
| **Disagreement between agents IS the signal** | Two agents producing identical solutions = one agent. Track disagreement rate as a health metric. If it drops, tasks are too easy or agents have converged. |
|
|
221
|
+
| **Diversity must be actively promoted** | Same model + same context = correlated outputs. Structural diversity requires different prompting, tool access, context priming, or temperature. |
|
|
222
|
+
| **Gating network learns task-type trust** | A sophisticated judge learns "Agent A better on algorithmic; Agent B better on integration." Static rubrics lose this signal. |
|
|
223
|
+
|
|
224
|
+
### 3.8 Cross-Domain Synthesis
|
|
225
|
+
|
|
226
|
+
Three patterns appeared independently across all seven domains:
|
|
227
|
+
|
|
228
|
+
1. **Locked criteria before outputs are seen.** TCEC opening books, Kahneman's pre-registration, Codeforces hidden test suites, Brier score properness. The judge rubric must be defined and frozen before agents run.
|
|
229
|
+
|
|
230
|
+
2. **Homogeneous competition is waste.** Ensemble diversity, dual-sourcing, tournament selection pressure. If both agents converge to identical strategies, the competition produces zero information. Diversity is the asset; it must be actively maintained.
|
|
231
|
+
|
|
232
|
+
3. **Shared starting conditions must be pre-screened for discriminating power.** TCEC curated openings, speedrun set seeds, competitive programming difficulty calibration. Don't MAB trivially easy or impossibly hard tasks — they produce no signal.
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## 4. Coder Toolkit Workflow Analysis
|
|
237
|
+
|
|
238
|
+
### 4.1 Full Skill Chain
|
|
239
|
+
|
|
240
|
+
```
|
|
241
|
+
USER INPUT
|
|
242
|
+
│
|
|
243
|
+
▼
|
|
244
|
+
Phase 1: DESIGN ─────────── superpowers:brainstorming
|
|
245
|
+
│ Output: docs/plans/YYYY-MM-DD-<topic>-design.md
|
|
246
|
+
│ Gate: user approval
|
|
247
|
+
▼
|
|
248
|
+
Phase 2: PRD ────────────── /create-prd
|
|
249
|
+
│ Output: tasks/prd.json + tasks/prd-<feature>.md
|
|
250
|
+
│ Gate: user approval
|
|
251
|
+
▼
|
|
252
|
+
Phase 3: PLAN ───────────── superpowers:writing-plans
|
|
253
|
+
│ Output: docs/plans/YYYY-MM-DD-<feature>.md
|
|
254
|
+
│ Gate: user chooses execution mode
|
|
255
|
+
▼
|
|
256
|
+
Phase 3.5: ISOLATE ──────── superpowers:using-git-worktrees
|
|
257
|
+
│ Output: .worktrees/<branch>/, baseline test count
|
|
258
|
+
│ Gate: tests pass in clean worktree
|
|
259
|
+
▼
|
|
260
|
+
Phase 4: EXECUTE ────────── [4 modes, see below]
|
|
261
|
+
│ Gate: quality gate after every batch
|
|
262
|
+
▼
|
|
263
|
+
Phase 5: VERIFY ─────────── superpowers:verification-before-completion
|
|
264
|
+
│ Gate: ALL PRD criteria pass (shell commands)
|
|
265
|
+
▼
|
|
266
|
+
Phase 6: FINISH ─────────── superpowers:finishing-a-development-branch
|
|
267
|
+
Output: merge / PR / keep / discard
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
### 4.2 Four Execution Modes
|
|
271
|
+
|
|
272
|
+
| Mode | Entry Point | Context Model | Human Checkpoints | Best For |
|
|
273
|
+
|------|-------------|---------------|-------------------|----------|
|
|
274
|
+
| **4a: Subagent-Driven** | `superpowers:subagent-driven-development` | Fresh subagent per task | None after start | 1-10 tasks, interactive |
|
|
275
|
+
| **4b: Executing-Plans** | `superpowers:executing-plans` | Shared session (degrades) | Between batches | Medium plans, oversight needed |
|
|
276
|
+
| **4c: Headless** | `scripts/run-plan.sh` | Fresh `claude -p` per batch | None (autonomous) | 5+ batches, overnight |
|
|
277
|
+
| **4d: Ralph Loop** | `/ralph-loop` | Same session, iterates | None (until promise) | PRD-driven, open-ended |
|
|
278
|
+
|
|
279
|
+
Headless mode has 3 sub-modes: `headless` (serial), `team` (parallel groups), `competitive` (stub → becomes MAB).
|
|
280
|
+
|
|
281
|
+
### 4.3 Where MAB Fits
|
|
282
|
+
|
|
283
|
+
MAB replaces the competitive stub in headless mode. It sits at the Phase 4 execution layer:
|
|
284
|
+
|
|
285
|
+
```
|
|
286
|
+
Phase 3.5: ISOLATE
|
|
287
|
+
│
|
|
288
|
+
├── MODE: headless ──── run_mode_headless() ──── serial batches
|
|
289
|
+
├── MODE: team ──────── run_mode_team() ──────── parallel groups
|
|
290
|
+
├── MODE: mab ──────── run_mode_mab() ──────── [NEW] two agents, judge picks winner
|
|
291
|
+
│ │
|
|
292
|
+
│ ├── Create worktree A (superpowers-led)
|
|
293
|
+
│ ├── Create worktree B (ralph-led)
|
|
294
|
+
│ ├── Cache-prime shared context
|
|
295
|
+
│ ├── Launch both agents in parallel
|
|
296
|
+
│ ├── Quality gate both
|
|
297
|
+
│ ├── Judge evaluates diffs (randomized order)
|
|
298
|
+
│ ├── Merge winner to main worktree
|
|
299
|
+
│ └── Update strategy-perf.json + mab-lessons.json
|
|
300
|
+
│
|
|
301
|
+
└── MODE: ralph ──────── /ralph-loop ──────── stop-hook iterations
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
### 4.4 State Files Across the Workflow
|
|
305
|
+
|
|
306
|
+
| File | Writer | Reader | Lifecycle |
|
|
307
|
+
|------|--------|--------|-----------|
|
|
308
|
+
| `docs/plans/*-design.md` | brainstorming | writing-plans, code-factory | Permanent |
|
|
309
|
+
| `tasks/prd.json` | /create-prd | verification, ralph-loop, run-plan.sh | Updated during execution |
|
|
310
|
+
| `docs/plans/*-<feature>.md` | writing-plans | all execution modes | Permanent |
|
|
311
|
+
| `.run-plan-state.json` | run-plan-state.sh | --resume, context injection | Per-execution |
|
|
312
|
+
| `progress.txt` | run-plan-prompt.sh | cross-batch context injection | Per-execution, append-only |
|
|
313
|
+
| `logs/failure-patterns.json` | run-plan-context.sh | batch context injection | Cross-run |
|
|
314
|
+
| `logs/sampling-outcomes.json` | run-plan-headless.sh | get_prompt_variants() | Cross-run |
|
|
315
|
+
| `logs/strategy-perf.json` | [NEW] run-plan-mab.sh | Thompson Sampling routing | Cross-run |
|
|
316
|
+
| `logs/mab-lessons.json` | [NEW] judge agent | batch context injection | Cross-run |
|
|
317
|
+
| `AGENTS.md` | run-plan-prompt.sh | agent teams | Per-execution |
|
|
318
|
+
| `.claude/ralph-loop.local.md` | setup-ralph-loop.sh | stop-hook.sh | Per-ralph-session |
|
|
319
|
+
|
|
320
|
+
### 4.5 Quality Gate Enforcement Points
|
|
321
|
+
|
|
322
|
+
1. **Worktree baseline** (Phase 3.5): Tests must pass before implementation begins
|
|
323
|
+
2. **Per-step** (Modes 4a/4b): Plan includes explicit "run test, verify it passes" steps
|
|
324
|
+
3. **Inter-batch** (Mode 4c): `run_quality_gate()` after every batch — lesson-check + tests + memory + regression + git clean
|
|
325
|
+
4. **Final verification** (Phase 5): ALL PRD criteria as shell commands, lesson-scanner agent
|
|
326
|
+
5. **Pre-merge** (Phase 6): Tests must pass before options are presented; re-tested after merge
|
|
327
|
+
|
|
328
|
+
### 4.6 Stop-Hook / Ralph Loop: MAB Adaptation
|
|
329
|
+
|
|
330
|
+
The stop-hook mechanism intercepts session exits and re-feeds the prompt. It's inherently single-session, while MAB needs two parallel sessions. However:
|
|
331
|
+
|
|
332
|
+
**Agent B (ralph-led) naturally fits ralph-loop.** In `run_mode_mab()`, before launching Agent B's `claude -p` call:
|
|
333
|
+
1. `cd "$worktree_b"`
|
|
334
|
+
2. Run `setup-ralph-loop.sh --completion-promise "ALL PRD CRITERIA PASS" --max-iterations 15`
|
|
335
|
+
3. Launch `claude -p` — the stop-hook will iterate Agent B until PRD criteria pass
|
|
336
|
+
|
|
337
|
+
Agent A (superpowers-led) terminates naturally after its last batch — no ralph-loop needed.
|
|
338
|
+
|
|
339
|
+
**Guard needed:** Both `.claude/ralph-loop.local.md` and the stop-hook are relative to `$PWD`. Since each MAB worktree has its own directory, state files are naturally isolated — but only if `cd "$worktree"` runs before `claude -p`. The current design doesn't explicitly `cd` — this must be added.
|
|
340
|
+
|
|
341
|
+
---
|
|
342
|
+
|
|
343
|
+
## 5. Latent Issues Found During Workflow Analysis
|
|
344
|
+
|
|
345
|
+
### Issue 1: State Schema Mismatch (Bug — affects all headless runs)
|
|
346
|
+
|
|
347
|
+
**File:** `scripts/lib/run-plan-context.sh:25`
|
|
348
|
+
**Problem:** `generate_batch_context()` reads `jq '[.batches[].test_count // 0] | max'` but `run-plan-state.sh` stores test counts at `.test_counts` (a flat key-value object), not `.batches[].test_count`.
|
|
349
|
+
**Impact:** The test count high-water-mark injected into batch context is always 0. All batches think they're starting from zero tests.
|
|
350
|
+
**Fix:** Change to `jq '[.test_counts // {} | to_entries[].value] | max // 0'`
|
|
351
|
+
|
|
352
|
+
### Issue 2: Judge JSON Extraction Is Fragile
|
|
353
|
+
|
|
354
|
+
**File:** `mab-run.sh` (planned) `run_judge()` function
|
|
355
|
+
**Problem:** `grep -o '{.*}' | head -1` fails on multi-line JSON, which LLM output frequently produces.
|
|
356
|
+
**Fix:** Use `python3 -c "import sys,json,re; m=re.search(r'\\{.*\\}', sys.stdin.read(), re.DOTALL); print(m.group(0) if m else '{}')"` or instruct judge prompt to output ONLY JSON and validate with `jq empty`.
|
|
357
|
+
|
|
358
|
+
### Issue 3: `--mab` Flag vs `--mode ab` Naming Inconsistency
|
|
359
|
+
|
|
360
|
+
**File:** MAB plan Batch 3, Tasks 9-10
|
|
361
|
+
**Problem:** The plan adds both a `--mab` boolean flag and a `--mode ab` enum value. These are parallel pathways that need reconciliation.
|
|
362
|
+
**Fix:** Use one canonical path: `run-plan.sh --mode mab`.
|
|
363
|
+
|
|
364
|
+
### Issue 4: Planner Agent Has No Caller
|
|
365
|
+
|
|
366
|
+
**File:** No file — gap in the plan
|
|
367
|
+
**Problem:** `scripts/prompts/planner-agent.md` is created in Batch 1 but never called by `auto-compound.sh` or any other script. The routing decision is purely manual.
|
|
368
|
+
**Fix:** Wire planner into `auto-compound.sh` between PRD generation and execution.
|
|
369
|
+
|
|
370
|
+
### Issue 5: `auto-compound.sh` Bypasses `writing-plans`
|
|
371
|
+
|
|
372
|
+
**File:** `scripts/auto-compound.sh`
|
|
373
|
+
**Problem:** Goes directly from PRD → Ralph loop, skipping plan writing entirely. This means MAB (which supports superpowers-led strategy that needs a plan) can't be exercised via `auto-compound.sh`.
|
|
374
|
+
**Fix:** Document this as intentional for the ralph-only pipeline. Add a `--plan-first` flag for when MAB or superpowers mode is desired.
|
|
375
|
+
|
|
376
|
+
### Issue 6: `sampling-outcomes.json` vs `strategy-perf.json` Confusion
|
|
377
|
+
|
|
378
|
+
**Problem:** Both files track win rates — one for prompt variants within a strategy (micro-MAB), one for strategies (macro-MAB). No documentation distinguishes them.
|
|
379
|
+
**Fix:** Add comment blocks to creation code and a section in ARCHITECTURE.md.
|
|
380
|
+
|
|
381
|
+
### Issue 7: MAB and Ralph Loop Compete for Session State
|
|
382
|
+
|
|
383
|
+
**Problem:** If a user activates `/ralph-loop` in a worktree that's also running inside `mab-run.sh`, both mechanisms are active simultaneously.
|
|
384
|
+
**Fix:** `run_mode_mab()` should write a `.mab-active` sentinel file in its worktrees. The ralph-loop setup should check for this and refuse to activate, or the MAB script should set up ralph-loop state itself (preferred — see Section 4.6).
|
|
385
|
+
|
|
386
|
+
### Issue 8: No Explicit `cd` Before Agent `claude -p` in MAB Worktrees
|
|
387
|
+
|
|
388
|
+
**Problem:** Each MAB agent's `claude -p` must run in its own worktree directory for proper isolation. The current design doesn't explicitly change directory.
|
|
389
|
+
**Fix:** Add `cd "$worktree_a" &&` before each `claude -p` invocation in `run_mode_mab()`.
|
|
390
|
+
|
|
391
|
+
---
|
|
392
|
+
|
|
393
|
+
## 6. Concrete Recommendations for Revised Plan
|
|
394
|
+
|
|
395
|
+
### Pre-MAB Fixes (do first)
|
|
396
|
+
|
|
397
|
+
| # | Fix | Effort | Impact |
|
|
398
|
+
|---|-----|--------|--------|
|
|
399
|
+
| 1 | Fix state schema mismatch (Issue 1) | 10 min | Fixes all headless runs |
|
|
400
|
+
| 2 | Canonical `--mode mab` naming (Issue 3) | 5 min | Prevents naming confusion |
|
|
401
|
+
|
|
402
|
+
### Phase 1 Architecture (replaces original Batches 1-3)
|
|
403
|
+
|
|
404
|
+
```
|
|
405
|
+
scripts/
|
|
406
|
+
├── lib/
|
|
407
|
+
│ └── run-plan-mab.sh # ~250 lines, peer to headless/team
|
|
408
|
+
├── prompts/
|
|
409
|
+
│ ├── judge-agent.md # Binary judge: winner + reasoning + SHAs
|
|
410
|
+
│ ├── agent-a-superpowers.md # Superpowers-led batch execution prompt
|
|
411
|
+
│ └── agent-b-ralph.md # Ralph-led iteration prompt
|
|
412
|
+
└── run-plan.sh # Add --mode mab dispatch
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
**`run-plan-mab.sh` responsibilities:**
|
|
416
|
+
1. Create two worktrees from current HEAD
|
|
417
|
+
2. Cache-prime shared context (design doc + PRD + codebase summary)
|
|
418
|
+
3. Launch both agents in parallel (`claude -p` with `cd "$worktree"`)
|
|
419
|
+
4. Wait for both to complete
|
|
420
|
+
5. Run quality gate on both
|
|
421
|
+
6. Call judge agent with randomized presentation order
|
|
422
|
+
7. Merge winner to main worktree
|
|
423
|
+
8. Update `logs/strategy-perf.json` and `logs/mab-lessons.json`
|
|
424
|
+
9. Clean up loser worktree
|
|
425
|
+
|
|
426
|
+
**Judge agent (Phase 1 — binary):**
|
|
427
|
+
```json
|
|
428
|
+
{
|
|
429
|
+
"winner": "agent_a|agent_b|draw",
|
|
430
|
+
"confidence": "low|medium|high",
|
|
431
|
+
"reasoning": "2-3 sentences explaining the decision",
|
|
432
|
+
"key_difference": "The specific dimension where agents most differed",
|
|
433
|
+
"sha_a": "abc1234",
|
|
434
|
+
"sha_b": "def5678",
|
|
435
|
+
"presentation_order": "a_first|b_first"
|
|
436
|
+
}
|
|
437
|
+
```
|
|
438
|
+
|
|
439
|
+
**Routing (Phase 1 — Thompson Sampling, ~15 lines bash):**
|
|
440
|
+
```bash
|
|
441
|
+
sample_a=$(python3 -c "import random; random.seed(); print(random.betavariate($wins_a+1,$losses_a+1))")
|
|
442
|
+
sample_b=$(python3 -c "import random; random.seed(); print(random.betavariate($wins_b+1,$losses_b+1))")
|
|
443
|
+
delta=$(python3 -c "print(abs($sample_a - $sample_b))")
|
|
444
|
+
if (( $(echo "$delta < 0.10" | bc -l) )); then
|
|
445
|
+
echo "mab" # Uncertain — run both agents
|
|
446
|
+
else
|
|
447
|
+
# Exploit — route to higher sample
|
|
448
|
+
if (( $(echo "$sample_a > $sample_b" | bc -l) )); then
|
|
449
|
+
echo "superpowers"
|
|
450
|
+
else
|
|
451
|
+
echo "ralph"
|
|
452
|
+
fi
|
|
453
|
+
fi
|
|
454
|
+
```
|
|
455
|
+
|
|
456
|
+
**Early termination rules (from TCEC + Codeforces patterns):**
|
|
457
|
+
- If both agents produce identical diffs (>95% similarity): declare draw, skip judge
|
|
458
|
+
- If one agent passes all tests and other passes none: auto-declare winner, skip judge
|
|
459
|
+
- If both agents fail quality gate: declare no winner, retry batch in headless mode
|
|
460
|
+
|
|
461
|
+
### Phase 2 Additions (after 10+ runs)
|
|
462
|
+
|
|
463
|
+
- Judge enrichment: add `failure_mode`, `strategy_update`, `winning_behaviors` fields
|
|
464
|
+
- Prompt evolution from judge reasoning (SEW pattern) → `logs/evolved-prompts.json`
|
|
465
|
+
- Model variation: `--sample-models "sonnet,opus,haiku"` flag
|
|
466
|
+
- Panel judging: two judge calls, different temperatures, majority vote
|
|
467
|
+
- Wire planner agent into `auto-compound.sh` for automated routing
|
|
468
|
+
|
|
469
|
+
### Phase 3 Additions (after 50+ runs, maybe never)
|
|
470
|
+
|
|
471
|
+
- Strategy archive (ADAS pattern): judge proposes new strategy descriptions
|
|
472
|
+
- Hacking mechanism: each agent writes a test case to break the other (Codeforces pattern)
|
|
473
|
+
- Community strategy data aggregation
|
|
474
|
+
- Semantic lesson dedup via Pinecone
|
|
475
|
+
|
|
476
|
+
---
|
|
477
|
+
|
|
478
|
+
## 7. Updated Risk Matrix
|
|
479
|
+
|
|
480
|
+
| Risk | Likelihood | Impact | Mitigation | Source |
|
|
481
|
+
|------|-----------|--------|------------|--------|
|
|
482
|
+
| Judge inconsistency (first 10 runs) | High | Medium | Validate first 10 decisions manually; require kappa >= 0.70 | LLM-as-Judge literature |
|
|
483
|
+
| Low agent diversity (same outputs) | Medium | High | Monitor diff similarity; add model variation in Phase 2 | Ensemble methods, evolutionary biology |
|
|
484
|
+
| 2x compute cost | Certain | Low | Cache priming drops from $10.58 to $1.88/batch; Thompson Sampling reduces MAB frequency | SWE-bench cost data |
|
|
485
|
+
| Position bias in judge | High | Medium | Randomize order; log in output; monitor win rates by position | LLM-as-Judge research, Codeforces |
|
|
486
|
+
| Rubric gaming (agent optimizes for judge, not correctness) | Low (Phase 1) | High | Proper scoring rule design; hidden test suite for judge | Forecasting tournaments |
|
|
487
|
+
| State schema bug produces wrong test counts | Certain (existing) | Medium | Fix before MAB — affects all headless runs today | Workflow analysis |
|
|
488
|
+
| JSON extraction breaks on multiline judge output | High | Medium | Use multiline-aware extraction; validate with jq | Workflow analysis |
|
|
489
|
+
| Both mechanisms active (ralph-loop + MAB) | Low | Medium | MAB sets up ralph-loop state itself; sentinel file guard | Workflow analysis |
|
|
490
|
+
| Draw rate too high (no signal) | Medium | Medium | Pre-screen tasks for discriminating power; early termination rules | TCEC, comp programming |
|
|
491
|
+
|
|
492
|
+
---
|
|
493
|
+
|
|
494
|
+
## 8. Updated Success Metrics
|
|
495
|
+
|
|
496
|
+
| Metric | Phase 1 | Phase 2 | Measurement |
|
|
497
|
+
|--------|---------|---------|-------------|
|
|
498
|
+
| MAB runs completed | 10 | 50 | Count of `logs/mab-run-*.json` |
|
|
499
|
+
| Judge agreement with human | >80% | >90% | Manual review, Cohen's kappa |
|
|
500
|
+
| Judge self-consistency | >85% | >90% | Same input, different seed → same winner |
|
|
501
|
+
| Position bias | <15% | <10% | Win rate delta by presentation order |
|
|
502
|
+
| Agent diversity (diff similarity) | <80% overlap | <70% | Diff intersection / union |
|
|
503
|
+
| Cost per MAB batch | <$3.00 | <$2.50 | API billing, logged per run |
|
|
504
|
+
| Draw rate | <40% | <25% | Draws / total evaluations |
|
|
505
|
+
| Quality gate pass rate (winner) | >80% | >90% | strategy-perf.json aggregate |
|
|
506
|
+
| Thompson Sampling convergence | — | Within 15 runs | Cumulative regret vs oracle |
|
|
507
|
+
| Prompt evolution yield | — | 1 variant / 5 runs | evolved-prompts.json entries |
|
|
508
|
+
|
|
509
|
+
---
|
|
510
|
+
|
|
511
|
+
## 9. Sources
|
|
512
|
+
|
|
513
|
+
### Round 2 — New Sources
|
|
514
|
+
|
|
515
|
+
#### Cost & Economics
|
|
516
|
+
- [Manage costs effectively — Claude Code Docs](https://code.claude.com/docs/en/costs)
|
|
517
|
+
- [Claude API Pricing](https://platform.claude.com/docs/en/about-claude/pricing)
|
|
518
|
+
- [SWE-rebench Leaderboard](https://swe-rebench.com) (cost-per-task data)
|
|
519
|
+
- [Prompt Caching — Claude API Docs](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)
|
|
520
|
+
- [Agentic Plan Caching — arxiv 2506.14852](https://arxiv.org/abs/2506.14852)
|
|
521
|
+
- [Budget-Constrained MAB — UCL/AAAI 2013](http://www0.cs.ucl.ac.uk/staff/w.zhang/rtb-papers/mab-adx.pdf)
|
|
522
|
+
- [Adaptive Budgeted UCB — arxiv 2505.02640](https://arxiv.org/pdf/2505.02640)
|
|
523
|
+
|
|
524
|
+
#### Testing & Validation
|
|
525
|
+
- [Validating LLM-as-a-Judge Under Rating Indeterminacy — CMU ML Blog](https://blog.ml.cmu.edu/2025/12/09/validating-llm-as-a-judge-systems-under-rating-indeterminacy/)
|
|
526
|
+
- [Judge's Verdict — arxiv 2510.09738](https://arxiv.org/pdf/2510.09738)
|
|
527
|
+
- [Seven Recommendations for Testing in a Non-Deterministic World — CMU SEI](https://www.sei.cmu.edu/blog/seven-recommendations-for-testing-in-a-non-deterministic-world/)
|
|
528
|
+
- [Statistical Testing of Stochastic Systems — U. Washington](https://homes.cs.washington.edu/~borning/papers/sevcikova-issta-2006.pdf)
|
|
529
|
+
- [Offline Bandit Evaluation — James LeDoux / Udemy](https://jamesrledoux.com/algorithms/offline-bandit-evaluation/)
|
|
530
|
+
- [Contextual R Package — Synthetic Bandit Simulation](https://nth-iteration-labs.github.io/contextual/)
|
|
531
|
+
|
|
532
|
+
#### Cross-Domain Analogies
|
|
533
|
+
- [TCEC Rules — Chessdom Wiki](https://wiki.chessdom.org/Rules)
|
|
534
|
+
- [Tournament Selection — Wikipedia](https://en.wikipedia.org/wiki/Tournament_selection)
|
|
535
|
+
- [Adversarial Collaboration — Kahneman / Edge.org](https://www.edge.org/adversarial-collaboration-daniel-kahneman)
|
|
536
|
+
- [Nature: Time for Adversarial Collaboration (2025)](https://www.nature.com/articles/d41586-025-01379-3)
|
|
537
|
+
- [Dual Sourcing — Management Science](https://pubsonline.informs.org/doi/10.1287/mnsc.41.8.1317)
|
|
538
|
+
- [Brier Score — Wikipedia](https://en.wikipedia.org/wiki/Brier_score)
|
|
539
|
+
- [Competitive Programming Judge Systems](https://en.wikipedia.org/wiki/Competitive_programming)
|
|
540
|
+
- [Codeforces Contest Rules](https://codeforces.com/blog/entry/4088)
|
|
541
|
+
- [Mixture of Experts — Wikipedia](https://en.wikipedia.org/wiki/Mixture_of_experts)
|
|
542
|
+
- [Ensemble Diversity — JMLR](https://jmlr.org/papers/volume24/23-0041/23-0041.pdf)
|
|
543
|
+
- [Speedrunning Verification](https://en.wikipedia.org/wiki/Speedrun)
|
|
544
|
+
|
|
545
|
+
### Round 1 Sources (from `2026-02-21-mab-research-report.md`)
|
|
546
|
+
|
|
547
|
+
See the original report for the full Round 1 source list covering academic MAB+LLM literature, LLM-as-Judge practitioner guides, SEW/ADAS research, SWE-bench analysis, and Notion workspace references.
|
|
548
|
+
|
|
549
|
+
### Codebase Files Analyzed
|
|
550
|
+
|
|
551
|
+
- Full skill chain: `skills/{brainstorming,writing-plans,using-git-worktrees,executing-plans,subagent-driven-development,verification-before-completion,finishing-a-development-branch}/SKILL.md`
|
|
552
|
+
- Commands: `commands/{code-factory,run-plan,ralph-loop}.md`
|
|
553
|
+
- Scripts: `scripts/run-plan.sh`, `scripts/auto-compound.sh`, all 8 `scripts/lib/run-plan-*.sh` modules
|
|
554
|
+
- Hooks: `hooks/stop-hook.sh`, `hooks/hooks.json`
|
|
555
|
+
- Architecture: `docs/ARCHITECTURE.md`
|
|
556
|
+
- MAB design + plan: `docs/plans/2026-02-22-mab-run-{design,plan}.md`
|