autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,530 @@
|
|
|
1
|
+
# Autonomous Coding Toolkit — Roadmap to Completion
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-02-23
|
|
4
|
+
**Status:** Draft — awaiting user approval
|
|
5
|
+
**Scope:** Complete roadmap from current state to v1.0 release, informed by 25 research papers, 20 open bugs, and 3 unexecuted designs
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Current State Assessment
|
|
10
|
+
|
|
11
|
+
### What's Shipped (Production-Quality)
|
|
12
|
+
|
|
13
|
+
| Category | Count | Notes |
|
|
14
|
+
|----------|-------|-------|
|
|
15
|
+
| Bash scripts | 34+ | All under 300 lines |
|
|
16
|
+
| Test files | 34 | 369+ assertions, all passing |
|
|
17
|
+
| Quality gate checks | 7 | lesson-check, lint, tests, ast-grep, memory, test count, git clean |
|
|
18
|
+
| Validators | 7 | lessons, skills, commands, plans, prd, plugin, hooks |
|
|
19
|
+
| Lessons | 66 | 6 clusters, YAML frontmatter, syntactic + semantic |
|
|
20
|
+
| Execution modes | 5 | headless, team, competitive (stub), ralph loop, subagent-driven |
|
|
21
|
+
| Skills | 14 | Full pipeline chain + supporting skills |
|
|
22
|
+
| Agents | 1 (in-repo) | lesson-scanner; 6 new designed but in ~/.claude/agents/ |
|
|
23
|
+
| CI pipeline | `make ci` | lint → validate → test |
|
|
24
|
+
|
|
25
|
+
### What's Designed But Not Implemented
|
|
26
|
+
|
|
27
|
+
| Feature | Design Doc | Plan Doc | Batches | Status |
|
|
28
|
+
|---------|-----------|---------|---------|--------|
|
|
29
|
+
| MAB system | `2026-02-22-mab-run-design.md` | `2026-02-22-mab-run-plan.md` | 6 (26 tasks) | **Needs update** — research found bugs, new prerequisites |
|
|
30
|
+
| Agent suite | `2026-02-23-agent-suite-design.md` | `2026-02-23-agent-suite-plan.md` | 7 (23 tasks) | Batch 1 (lint) done; Batches 2-7 pending |
|
|
31
|
+
| Research phase | `2026-02-22-research-phase-integration.md` | — | ~2 | Design complete, no plan |
|
|
32
|
+
| Roadmap stage | `2026-02-22-research-phase-integration.md` § 3.3 | — | ~1 | Design complete, no plan |
|
|
33
|
+
|
|
34
|
+
### What's Recommended by Research (No Design Yet)
|
|
35
|
+
|
|
36
|
+
From the cross-cutting synthesis (25 papers, confidence ratings included):
|
|
37
|
+
|
|
38
|
+
| # | Item | Evidence | Effort | Confidence |
|
|
39
|
+
|---|------|----------|--------|------------|
|
|
40
|
+
| 1 | Prompt caching | 83% cost reduction (pricing analysis) | 1-2 days | **High** |
|
|
41
|
+
| 2 | Plan quality scorecard | Plan quality worth 3x execution (SWE-bench Pro, N=1865) | 2-3 days | **High** |
|
|
42
|
+
| 3 | Spec echo-back gate | Spec misunderstanding is 60%+ of failures (SWE-EVO) | 1-2 days | **Medium-High** |
|
|
43
|
+
| 4 | Context restructuring | Lost in the Middle: 20pp accuracy degradation (Liu et al.) | 1 day | **High** |
|
|
44
|
+
| 5 | Lesson scope metadata | 67% false positive rate predicted at 100+ lessons | 2-3 days | **High** |
|
|
45
|
+
| 6 | Fast lane onboarding | 34.7% abandon on difficult setup (N=202 OSS devs) | 1-2 days | **High** |
|
|
46
|
+
| 7 | Per-batch cost tracking | No measured cost data exists — all optimization is guesswork | 1-2 days | **High** |
|
|
47
|
+
| 8 | Structured progress.txt | Freeform text reduces cross-context value | 1 day | **Medium-High** |
|
|
48
|
+
| 9 | Positive policy system | Positive instructions outperform negative for LLMs (NeQA) | 3-5 days | **Medium-High** |
|
|
49
|
+
| 10 | Property-based testing guidance | 50x more mutations found (OOPSLA 2025, 40 projects) | 2-3 days | **High** |
|
|
50
|
+
|
|
51
|
+
### Open Bugs (20)
|
|
52
|
+
|
|
53
|
+
| Severity | Count | Issues |
|
|
54
|
+
|----------|-------|--------|
|
|
55
|
+
| Medium | 7 | #9, #10, #11, #12, #13, #14, #15, #16 |
|
|
56
|
+
| Low | 12 | #17-#28 |
|
|
57
|
+
|
|
58
|
+
Key clusters:
|
|
59
|
+
- **Sampling** (#16, #27, #28): stash/state issues in parallel patch sampling
|
|
60
|
+
- **Portability** (#17, #18, #23): shebang, grep -P, bash 4.4 compat
|
|
61
|
+
- **Edge cases** (#9, #10, #13, #20, #21, #24): empty/missing state, truncation
|
|
62
|
+
- **Safety** (#11, #12, #19, #22): path escaping, directory restore, glob fragility
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Strategic Priorities
|
|
67
|
+
|
|
68
|
+
Ordered by impact per effort, accounting for dependencies:
|
|
69
|
+
|
|
70
|
+
1. **Fix before building** — The 20 open bugs include a state schema mismatch (#10) that affects all headless runs. Fix bugs first.
|
|
71
|
+
2. **Pre-execution quality** — Plan quality scorecard, spec echo-back, and context restructuring are the highest-leverage investments per the 3:1 plan-vs-execution ratio.
|
|
72
|
+
3. **Cost infrastructure** — Prompt caching (83% savings) and per-batch cost tracking are prerequisites for MAB economics to make sense.
|
|
73
|
+
4. **MAB system** — Updated design, slimmed from 6 to 4 batches based on research findings.
|
|
74
|
+
5. **Adoption infrastructure** — Fast lane onboarding, lesson scope metadata, README rewrite.
|
|
75
|
+
6. **Pipeline extensions** — Research phase, roadmap stage, positive policies.
|
|
76
|
+
7. **Agent suite** — New agents are useful but not blocking; they serve Justin's ecosystem, not the public toolkit.
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Phased Roadmap
|
|
81
|
+
|
|
82
|
+
### Phase 1: Stabilize (Fix What's Broken)
|
|
83
|
+
|
|
84
|
+
**Goal:** Zero known bugs in core pipeline. All existing tests pass. CI green.
|
|
85
|
+
**Effort:** 1-2 sessions
|
|
86
|
+
**Prerequisite for:** Everything else
|
|
87
|
+
|
|
88
|
+
#### Batch 1A: Critical Bugs (Medium Severity)
|
|
89
|
+
|
|
90
|
+
| Issue | Title | Fix |
|
|
91
|
+
|-------|-------|-----|
|
|
92
|
+
| #9 | `complete_batch` called with batch_num='final' crashes jq | Validate batch_num is numeric before `--argjson` |
|
|
93
|
+
| #10 | `get_previous_test_count` returns empty on missing state | Return -1 (unknown), match `extract_test_count` convention |
|
|
94
|
+
| #11 | `batch-test.sh` cd without restore | Use subshell `(cd "$dir" && ...)` or pushd/popd |
|
|
95
|
+
| #12 | `generate-ast-rules.sh` writes to root when --output-dir omitted | Default to `$PWD/scripts/patterns/` |
|
|
96
|
+
| #13 | `entropy-audit.sh` iterates once on empty find | Use `while read` with null check instead of heredoc |
|
|
97
|
+
| #16 | SAMPLE_COUNT persists across batches | Reset SAMPLE_COUNT=0 at top of batch loop |
|
|
98
|
+
|
|
99
|
+
#### Batch 1B: Low Severity Bugs
|
|
100
|
+
|
|
101
|
+
| Issue | Title | Fix |
|
|
102
|
+
|-------|-------|-----|
|
|
103
|
+
| #14 | `auto-compound.sh` head -c 40 UTF-8 | Use `cut -c1-40` or `${var:0:40}` |
|
|
104
|
+
| #15 | No timeout on routing jq loop | Add `timeout 30` wrapper |
|
|
105
|
+
| #17 | Inconsistent shebangs | `#!/usr/bin/env bash` everywhere |
|
|
106
|
+
| #18 | `grep -P` non-portable | Replace with `grep -E` or `[[ =~ ]]` |
|
|
107
|
+
| #19 | ls -t fragile with spaces | Use `find -printf` or `stat --format` |
|
|
108
|
+
| #20 | `free -g` truncates | Use `free -m` and compare against 4096 |
|
|
109
|
+
| #21 | check_memory fallback '999' | Return -1 (unknown), skip check |
|
|
110
|
+
| #22 | setup-ralph-loop special chars | Quote with `jq --arg` instead of bash substitution |
|
|
111
|
+
| #23 | bash < 4.4 empty array set -u | `"${PASS_ARGS[@]+"${PASS_ARGS[@]}"}"` |
|
|
112
|
+
| #24 | detect_project_type nullglob | Use `compgen -G` or explicit test |
|
|
113
|
+
| #25 | ollama_query no timeout | Add `--connect-timeout 10 --max-time 60` to curl |
|
|
114
|
+
| #26 | validate-plans sed range bug | Fix sed address to stop at next `## Batch` header |
|
|
115
|
+
| #27 | Sampling stash no-op on clean | Check `git stash list` count before/after |
|
|
116
|
+
| #28 | SAMPLE_COUNT reset between batches | Same fix as #16 |
|
|
117
|
+
|
|
118
|
+
#### Quality Gate
|
|
119
|
+
- `make ci` passes
|
|
120
|
+
- All 20 issues closed
|
|
121
|
+
- No new test regressions
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
### Phase 2: Pre-Execution Quality (Highest Leverage)
|
|
126
|
+
|
|
127
|
+
**Goal:** Implement the three research-backed improvements that address the 3:1 plan-vs-execution quality ratio.
|
|
128
|
+
**Effort:** 1-2 sessions
|
|
129
|
+
**Prerequisite for:** Phase 4 (MAB needs better plans to judge)
|
|
130
|
+
|
|
131
|
+
#### Batch 2A: Context Restructuring
|
|
132
|
+
|
|
133
|
+
**What:** Restructure `build_batch_prompt()` in `run-plan-prompt.sh`:
|
|
134
|
+
1. Raise `TOKEN_BUDGET_CHARS` from 6000 to 10000
|
|
135
|
+
2. Place batch task text at the top, requirements/constraints at the bottom
|
|
136
|
+
3. Wrap sections in XML tags (`<batch_tasks>`, `<prior_progress>`, `<failure_patterns>`, `<referenced_files>`, `<requirements>`)
|
|
137
|
+
4. Add `<research_warnings>` section from research JSON (when present)
|
|
138
|
+
|
|
139
|
+
**Evidence:** Lost in the Middle effect degrades accuracy 20pp for middle-positioned info. Anthropic's testing shows up to 30% quality improvement with structured context.
|
|
140
|
+
|
|
141
|
+
**Tests:** Update `test-run-plan-prompt.sh` to verify XML tag presence and section ordering.
|
|
142
|
+
|
|
143
|
+
#### Batch 2B: Plan Quality Scorecard
|
|
144
|
+
|
|
145
|
+
**What:** Create `scripts/validate-plan-quality.sh` scoring 8 dimensions:
|
|
146
|
+
|
|
147
|
+
| Dimension | Check | Weight |
|
|
148
|
+
|-----------|-------|--------|
|
|
149
|
+
| Task granularity | Each task modifies < 100 lines (estimated) | 15% |
|
|
150
|
+
| Spec completeness | Each task has verification command | 20% |
|
|
151
|
+
| Single outcome | No mixed task types per batch | 10% |
|
|
152
|
+
| Dependency ordering | No forward references | 10% |
|
|
153
|
+
| File path specificity | All tasks name exact files | 15% |
|
|
154
|
+
| Acceptance criteria | Each batch has at least one assert | 15% |
|
|
155
|
+
| Batch size | 1-5 tasks per batch | 10% |
|
|
156
|
+
| TDD structure | Test-before-implement pattern | 5% |
|
|
157
|
+
|
|
158
|
+
Returns score 0-100. Gate execution on configurable minimum (default: 60).
|
|
159
|
+
|
|
160
|
+
**Integration:** Wire into `run-plan.sh` before batch loop. Add `--skip-plan-quality` override.
|
|
161
|
+
|
|
162
|
+
**Tests:** Create `test-validate-plan-quality.sh` with sample plans at various quality levels.
|
|
163
|
+
|
|
164
|
+
#### Batch 2C: Specification Echo-Back Gate
|
|
165
|
+
|
|
166
|
+
**What:** Before coding each batch, the agent restates what the batch accomplishes. Lightweight LLM comparison between restatement and plan's task description.
|
|
167
|
+
|
|
168
|
+
**Implementation:** Add `echo_back_check()` to `run-plan-headless.sh`:
|
|
169
|
+
1. First 2 lines of `claude -p` prompt: "Before implementing, restate in one paragraph what this batch must accomplish."
|
|
170
|
+
2. Extract first paragraph from agent output
|
|
171
|
+
3. Lightweight `claude -p` call (haiku): "Does this restatement match the original spec? YES/NO + reason"
|
|
172
|
+
4. If NO → retry with clarified prompt (max 1 retry)
|
|
173
|
+
|
|
174
|
+
**Evidence:** Catches 60%+ of specification misunderstanding failures (SWE-EVO).
|
|
175
|
+
|
|
176
|
+
**Tests:** Test with intentionally mismatched spec/restatement pairs.
|
|
177
|
+
|
|
178
|
+
#### Quality Gate
|
|
179
|
+
- `make ci` passes
|
|
180
|
+
- New validators pass on existing plans
|
|
181
|
+
- Context restructuring doesn't break existing test-run-plan-prompt tests
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
### Phase 3: Cost Infrastructure
|
|
186
|
+
|
|
187
|
+
**Goal:** Enable measured cost data (prerequisite for MAB economics) and implement prompt caching (83% cost reduction).
|
|
188
|
+
**Effort:** 1 session
|
|
189
|
+
**Prerequisite for:** Phase 4 (MAB)
|
|
190
|
+
|
|
191
|
+
#### Batch 3A: Per-Batch Cost Tracking
|
|
192
|
+
|
|
193
|
+
**What:** Track input tokens, output tokens, cache hits, and estimated cost per batch in `.run-plan-state.json`.
|
|
194
|
+
|
|
195
|
+
**Implementation:**
|
|
196
|
+
1. Parse `claude -p` stderr for token usage (Claude CLI outputs this)
|
|
197
|
+
2. Add `costs` object to state: `{"batch_N": {"input_tokens": N, "output_tokens": N, "cache_hits": N, "estimated_cost_usd": N}}`
|
|
198
|
+
3. Add `--show-costs` flag to `pipeline-status.sh`
|
|
199
|
+
4. Update `run-plan-notify.sh` to include cost in Telegram notifications
|
|
200
|
+
|
|
201
|
+
**Tests:** Mock claude -p output with token counts, verify state updates.
|
|
202
|
+
|
|
203
|
+
#### Batch 3B: Prompt Caching Structure
|
|
204
|
+
|
|
205
|
+
**What:** Structure prompts with stable prefix (CLAUDE.md chain, skills, lessons — rarely changes) and variable suffix (batch tasks, context — changes each batch). This enables Anthropic's prompt caching to reuse the prefix across batches.
|
|
206
|
+
|
|
207
|
+
**Implementation:**
|
|
208
|
+
1. In `build_batch_prompt()`, separate `STABLE_PREFIX` (CLAUDE.md, lessons, conventions) from `VARIABLE_SUFFIX` (batch tasks, context, progress)
|
|
209
|
+
2. Write stable prefix to a file that `claude -p` reads via `--system-prompt-file` (if supported) or prepend it with a clear separator
|
|
210
|
+
3. Track cache hit rate in state file
|
|
211
|
+
|
|
212
|
+
**Evidence:** 83% cost reduction modeled (pricing analysis + cache priming). A 6-batch feature drops from $6.50 to $1.76.
|
|
213
|
+
|
|
214
|
+
**Tests:** Verify prompt structure separates stable/variable. Verify state tracks cache metrics.
|
|
215
|
+
|
|
216
|
+
#### Batch 3C: Structured progress.txt
|
|
217
|
+
|
|
218
|
+
**What:** Replace freeform `progress.txt` with defined sections:
|
|
219
|
+
|
|
220
|
+
```
|
|
221
|
+
## Batch N: <title>
|
|
222
|
+
### Files Modified
|
|
223
|
+
- path/to/file (created|modified|deleted)
|
|
224
|
+
|
|
225
|
+
### Decisions
|
|
226
|
+
- <decision>: <rationale>
|
|
227
|
+
|
|
228
|
+
### Issues Encountered
|
|
229
|
+
- <issue> → <resolution>
|
|
230
|
+
|
|
231
|
+
### State
|
|
232
|
+
- Tests: N passing
|
|
233
|
+
- Duration: Ns
|
|
234
|
+
- Cost: $N.NN
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
**Tests:** Update `test-run-plan-context.sh` to verify structured parsing.
|
|
238
|
+
|
|
239
|
+
#### Quality Gate
|
|
240
|
+
- `make ci` passes
|
|
241
|
+
- Cost tracking produces data on a real 2+ batch run
|
|
242
|
+
- Structured progress.txt parses correctly
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
### Phase 4: Multi-Armed Bandit System (Updated)
|
|
247
|
+
|
|
248
|
+
**Goal:** Implement competing agents with LLM judge, informed by research findings.
|
|
249
|
+
**Effort:** 2-3 sessions
|
|
250
|
+
**Prerequisites:** Phase 1 (bug fixes), Phase 3 (cost tracking, caching)
|
|
251
|
+
|
|
252
|
+
#### Changes from Original Plan
|
|
253
|
+
|
|
254
|
+
The original 6-batch plan needs revision based on research findings:
|
|
255
|
+
|
|
256
|
+
| Original | Change | Reason |
|
|
257
|
+
|----------|--------|--------|
|
|
258
|
+
| LLM planner agent | Replace with Thompson Sampling | Research: Thompson Sampling is cheaper and better calibrated than LLM routing (MAB R1) |
|
|
259
|
+
| 6 batches, 26 tasks | Slim to 4 batches, ~18 tasks | Research: 80% infrastructure exists; prompts are just files; planner is now a function |
|
|
260
|
+
| Judge trusts automated routing | Add human calibration for first 10 decisions | Research: LLM-as-Judge reliability unvalidated (cross-cutting synthesis §F) |
|
|
261
|
+
| Default competitive mode | Selective MAB (~30% of batches) | Research: Cost break-even only if prevents 1 debugging batch per 2 features |
|
|
262
|
+
| `{AB_LESSONS}` placeholder | Fix to `{MAB_LESSONS}` | Bug in original plan: placeholder name doesn't match data file name |
|
|
263
|
+
|
|
264
|
+
#### Batch 4A: Foundation (Prompts + Architecture Map + Data Init)
|
|
265
|
+
|
|
266
|
+
Matches original Batch 1 but simplified:
|
|
267
|
+
|
|
268
|
+
1. Create 4 prompt files in `scripts/prompts/` (agent-a, agent-b, judge-agent, planner-agent)
|
|
269
|
+
2. Create `scripts/architecture-map.sh` (scans source for import/source dependencies)
|
|
270
|
+
3. Tests for architecture-map.sh
|
|
271
|
+
4. Create `scripts/lib/thompson-sampling.sh` — Beta distribution sampling for strategy routing:
|
|
272
|
+
- `thompson_sample(wins, losses)` → returns sampled value (0-1)
|
|
273
|
+
- `thompson_route(batch_type, strategy_perf_file)` → returns "superpowers" or "ralph" or "mab"
|
|
274
|
+
- Pure bash using `bc` for floating point
|
|
275
|
+
5. Tests for thompson-sampling.sh
|
|
276
|
+
|
|
277
|
+
#### Batch 4B: MAB Orchestrator (mab-run.sh)
|
|
278
|
+
|
|
279
|
+
Core orchestrator, simplified from original Batch 2:
|
|
280
|
+
|
|
281
|
+
1. `scripts/mab-run.sh` — argument parsing, data init, worktree management, prompt assembly
|
|
282
|
+
2. Agent execution (parallel `claude -p` in separate worktrees)
|
|
283
|
+
3. Quality gate on both agents
|
|
284
|
+
4. Judge invocation (separate `claude -p` with read-only tools)
|
|
285
|
+
5. Winner selection (gate override: if only one passes, that one wins regardless of judge)
|
|
286
|
+
6. Data updates (strategy-perf.json, mab-lessons.json, mab-run-<ts>.json)
|
|
287
|
+
7. Human calibration mode: for first 10 decisions, present judge verdict to user for approval before merge
|
|
288
|
+
8. Cleanup (worktree removal)
|
|
289
|
+
9. Tests for mab-run.sh (dry-run, data init, argument validation)
|
|
290
|
+
|
|
291
|
+
#### Batch 4C: Integration (run-plan --mab + context injection)
|
|
292
|
+
|
|
293
|
+
Wire into existing pipeline:
|
|
294
|
+
|
|
295
|
+
1. Add `--mab` flag to `run-plan.sh`
|
|
296
|
+
2. Inject MAB lessons into per-batch context (`run-plan-context.sh`)
|
|
297
|
+
3. Add Thompson Sampling routing call before batch execution (when `--mab` is set)
|
|
298
|
+
4. Update `pipeline-status.sh` with MAB section
|
|
299
|
+
5. Tests for CLI flags and context injection
|
|
300
|
+
|
|
301
|
+
#### Batch 4D: Community Sync + Lesson Promotion + Docs
|
|
302
|
+
|
|
303
|
+
1. `scripts/pull-community-lessons.sh` — fetch lessons from upstream
|
|
304
|
+
2. `scripts/promote-mab-lessons.sh` — auto-promote patterns with 3+ occurrences
|
|
305
|
+
3. Update `docs/ARCHITECTURE.md` with MAB section
|
|
306
|
+
4. Update `CLAUDE.md` with MAB capabilities
|
|
307
|
+
5. Tests for both scripts
|
|
308
|
+
6. Run full `make ci`
|
|
309
|
+
|
|
310
|
+
#### Quality Gate
|
|
311
|
+
- `make ci` passes
|
|
312
|
+
- `mab-run.sh --dry-run` works end-to-end
|
|
313
|
+
- `architecture-map.sh` produces valid JSON on the toolkit itself
|
|
314
|
+
- Thompson sampling unit tests pass
|
|
315
|
+
- All 20+ previous bugs still fixed
|
|
316
|
+
|
|
317
|
+
---
|
|
318
|
+
|
|
319
|
+
### Phase 5: Adoption & Polish
|
|
320
|
+
|
|
321
|
+
**Goal:** Make the toolkit usable by someone who isn't Justin.
|
|
322
|
+
**Effort:** 1-2 sessions
|
|
323
|
+
**Prerequisites:** Phase 2 (plan quality), Phase 4 (MAB)
|
|
324
|
+
|
|
325
|
+
#### Batch 5A: Lesson Scope Metadata
|
|
326
|
+
|
|
327
|
+
**What:** Add `scope` field to lesson YAML frontmatter:
|
|
328
|
+
|
|
329
|
+
```yaml
|
|
330
|
+
scope: universal | language:python | language:bash | framework:pytest | domain:ha-aria | project-specific
|
|
331
|
+
```
|
|
332
|
+
|
|
333
|
+
Update `lesson-check.sh` to:
|
|
334
|
+
1. Detect project languages from file extensions
|
|
335
|
+
2. Skip lessons whose scope doesn't match the project
|
|
336
|
+
3. Add `--all-scopes` flag to override filtering
|
|
337
|
+
|
|
338
|
+
Update all 66 existing lessons with appropriate scope tags.
|
|
339
|
+
|
|
340
|
+
**Evidence:** Without scope, false positive rate hits 67% at ~100 lessons (Zimmermann, 622 predictions).
|
|
341
|
+
|
|
342
|
+
#### Batch 5B: Fast Lane Onboarding
|
|
343
|
+
|
|
344
|
+
**What:**
|
|
345
|
+
1. Create `examples/quickstart-plan.md` — a 2-batch plan that reaches first quality-gated execution in 3 commands
|
|
346
|
+
2. Rewrite `README.md` to under 100 lines with progressive disclosure
|
|
347
|
+
3. Add `Getting Started in 5 Minutes` section with:
|
|
348
|
+
```bash
|
|
349
|
+
git clone ... && cd autonomous-coding-toolkit
|
|
350
|
+
./scripts/run-plan.sh examples/quickstart-plan.md --project-root /tmp/quickstart-demo
|
|
351
|
+
# Watch: batch execution → quality gate → test count → DONE
|
|
352
|
+
```
|
|
353
|
+
4. Move detailed docs to `docs/` (ARCHITECTURE.md already there)
|
|
354
|
+
|
|
355
|
+
**Evidence:** 34.7% abandon on difficult setup.
|
|
356
|
+
|
|
357
|
+
#### Batch 5C: Expand Lessons to 6 Clusters
|
|
358
|
+
|
|
359
|
+
Add 12 starter lessons for the three new clusters:
|
|
360
|
+
|
|
361
|
+
- **Cluster D (Specification Drift):** 4 lessons — agent misinterprets requirements, builds wrong thing correctly
|
|
362
|
+
- **Cluster E (Context & Retrieval):** 4 lessons — wrong files read, stale context, lost information
|
|
363
|
+
- **Cluster F (Planning & Control Flow):** 4 lessons — wrong decomposition, dependency errors, scope creep
|
|
364
|
+
|
|
365
|
+
Update `docs/lessons/SUMMARY.md` with new clusters.
|
|
366
|
+
|
|
367
|
+
#### Quality Gate
|
|
368
|
+
- `make ci` passes
|
|
369
|
+
- Quickstart demo runs end-to-end in < 5 minutes
|
|
370
|
+
- Lesson scope filtering reduces false matches on non-Python projects
|
|
371
|
+
|
|
372
|
+
---
|
|
373
|
+
|
|
374
|
+
### Phase 6: Pipeline Extensions
|
|
375
|
+
|
|
376
|
+
**Goal:** Add research phase and roadmap stage to the pipeline.
|
|
377
|
+
**Effort:** 2-3 sessions
|
|
378
|
+
**Prerequisites:** Phase 2 (context restructuring), Phase 5 (lesson scope)
|
|
379
|
+
|
|
380
|
+
#### Batch 6A: Research Phase (Stage 1.5)
|
|
381
|
+
|
|
382
|
+
Per the design in `2026-02-22-research-phase-integration.md`:
|
|
383
|
+
|
|
384
|
+
1. Create `skills/research/SKILL.md` — 10-step research protocol
|
|
385
|
+
2. Create `scripts/research-gate.sh` — blocks PRD if blocking issues unresolved
|
|
386
|
+
3. Update `scripts/lib/run-plan-context.sh` — inject research warnings
|
|
387
|
+
4. Update `scripts/auto-compound.sh` — replace Step 2.5 with research phase
|
|
388
|
+
5. Update `skills/autocode/SKILL.md` — add Stage 1.5
|
|
389
|
+
6. Tests for research-gate.sh
|
|
390
|
+
|
|
391
|
+
Artifacts produced:
|
|
392
|
+
- `tasks/research-<slug>.md` — human-readable report
|
|
393
|
+
- `tasks/research-<slug>.json` — machine-readable for PRD scoping
|
|
394
|
+
|
|
395
|
+
#### Batch 6B: Roadmap Stage (Stage 0.5)
|
|
396
|
+
|
|
397
|
+
1. Create `skills/roadmap/SKILL.md` — multi-feature sequencing
|
|
398
|
+
2. Update `skills/autocode/SKILL.md` — add Stage 0.5
|
|
399
|
+
3. Create `examples/example-roadmap.md` — sample roadmap
|
|
400
|
+
|
|
401
|
+
#### Batch 6C: Positive Policy System
|
|
402
|
+
|
|
403
|
+
1. Create `policies/` directory with `universal.md`, `python.md`, `bash.md`, `testing.md`
|
|
404
|
+
2. Add `positive_alternative` field to lesson YAML template
|
|
405
|
+
3. Create `scripts/policy-check.sh` — audit mode (advisory, not blocking)
|
|
406
|
+
4. Update `lesson-check.sh` to read positive alternatives and include in violation messages
|
|
407
|
+
5. Tests for policy-check.sh
|
|
408
|
+
|
|
409
|
+
**Evidence:** Positive instructions outperform negative for LLMs (NeQA benchmark, Pink Elephant Problem).
|
|
410
|
+
|
|
411
|
+
#### Quality Gate
|
|
412
|
+
- `make ci` passes
|
|
413
|
+
- Research gate blocks on a test file with blocking issues
|
|
414
|
+
- Roadmap skill produces valid artifact
|
|
415
|
+
- Policy check runs without errors on toolkit itself
|
|
416
|
+
|
|
417
|
+
---
|
|
418
|
+
|
|
419
|
+
### Phase 7: Agent Suite
|
|
420
|
+
|
|
421
|
+
**Goal:** Ship the 6 new agents and 8 existing agent improvements.
|
|
422
|
+
**Effort:** 1-2 sessions
|
|
423
|
+
**Prerequisites:** Phase 1 (bugs), Phase 2 (lesson-scanner scan groups reference updated lessons)
|
|
424
|
+
|
|
425
|
+
Per the design in `2026-02-23-agent-suite-design.md`:
|
|
426
|
+
|
|
427
|
+
#### Batch 7A: New Agents (6)
|
|
428
|
+
|
|
429
|
+
All placed in `~/.claude/agents/` (global) AND `agents/` (toolkit repo):
|
|
430
|
+
|
|
431
|
+
1. `bash-expert.md` — review/write/debug bash scripts
|
|
432
|
+
2. `shell-expert.md` — diagnose systemd/PATH/permissions issues
|
|
433
|
+
3. `python-expert.md` — async discipline, resource lifecycle, type safety
|
|
434
|
+
4. `integration-tester.md` — verify cross-service data flows
|
|
435
|
+
5. `dependency-auditor.md` — CVE/outdated/license scanning (read-only)
|
|
436
|
+
6. `service-monitor.md` — service/timer health auditing
|
|
437
|
+
|
|
438
|
+
#### Batch 7B: Existing Agent Improvements
|
|
439
|
+
|
|
440
|
+
P0 (correctness): security-reviewer tools/categories, infra-auditor freshness, lesson-scanner count
|
|
441
|
+
P1 (quality): model/maxTurns on all agents, doc-updater git diff
|
|
442
|
+
P2 (capability): lesson-scanner scan groups, notion fallbacks
|
|
443
|
+
P3 (polish): doc-updater output, counter-daily scope rule
|
|
444
|
+
|
|
445
|
+
#### Quality Gate
|
|
446
|
+
- All 14 agents have valid frontmatter (name, model, tools, maxTurns)
|
|
447
|
+
- `make ci` passes
|
|
448
|
+
- No agent references nonexistent tools
|
|
449
|
+
|
|
450
|
+
---
|
|
451
|
+
|
|
452
|
+
## Dependency Graph
|
|
453
|
+
|
|
454
|
+
```
|
|
455
|
+
Phase 1: Stabilize (bug fixes)
|
|
456
|
+
│
|
|
457
|
+
├──► Phase 2: Pre-Execution Quality
|
|
458
|
+
│ │
|
|
459
|
+
│ ├──► Phase 4: MAB System ◄── Phase 3: Cost Infrastructure
|
|
460
|
+
│ │ │
|
|
461
|
+
│ │ ├──► Phase 5: Adoption & Polish
|
|
462
|
+
│ │ │
|
|
463
|
+
│ │ └──► Phase 6: Pipeline Extensions
|
|
464
|
+
│ │
|
|
465
|
+
│ └──► Phase 6: Pipeline Extensions
|
|
466
|
+
│
|
|
467
|
+
└──► Phase 7: Agent Suite (independent, can run in parallel with 2-6)
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
**Critical path:** 1 → 2 → 3 → 4 → 5
|
|
471
|
+
**Parallel track:** 7 can run anytime after Phase 1
|
|
472
|
+
|
|
473
|
+
---
|
|
474
|
+
|
|
475
|
+
## Effort Summary
|
|
476
|
+
|
|
477
|
+
| Phase | Batches | Estimated Sessions | Key Deliverable |
|
|
478
|
+
|-------|---------|-------------------|-----------------|
|
|
479
|
+
| 1: Stabilize | 2 | 1-2 | 20 bugs fixed, CI green |
|
|
480
|
+
| 2: Pre-Execution Quality | 3 | 1-2 | Plan scorecard, echo-back gate, context restructuring |
|
|
481
|
+
| 3: Cost Infrastructure | 3 | 1 | Cost tracking, prompt caching, structured progress |
|
|
482
|
+
| 4: MAB System | 4 | 2-3 | Competing agents, judge, Thompson Sampling, lesson promotion |
|
|
483
|
+
| 5: Adoption & Polish | 3 | 1-2 | Scope metadata, fast lane, 6 clusters |
|
|
484
|
+
| 6: Pipeline Extensions | 3 | 2-3 | Research phase, roadmap stage, positive policies |
|
|
485
|
+
| 7: Agent Suite | 2 | 1-2 | 6 new agents, 8 improvements |
|
|
486
|
+
| **Total** | **20** | **9-15** | **v1.0** |
|
|
487
|
+
|
|
488
|
+
---
|
|
489
|
+
|
|
490
|
+
## What "v1.0" Means
|
|
491
|
+
|
|
492
|
+
The toolkit reaches v1.0 when:
|
|
493
|
+
|
|
494
|
+
1. **Core pipeline works end-to-end** for headless, ralph loop, and MAB modes ✓ (mostly done)
|
|
495
|
+
2. **Quality gates catch real bugs** with < 20% false positive rate (needs scope metadata)
|
|
496
|
+
3. **Cost is tracked and optimized** (prompt caching, per-batch cost data)
|
|
497
|
+
4. **A new user can start in < 5 minutes** (fast lane onboarding)
|
|
498
|
+
5. **MAB produces measurable learning** (strategy-perf.json with 10+ data points, human-calibrated judge)
|
|
499
|
+
6. **Research phase produces durable artifacts** (not ephemeral conversation)
|
|
500
|
+
7. **Zero known bugs in core pipeline** (all 20 issues closed)
|
|
501
|
+
8. **Documentation is complete** — ARCHITECTURE.md, README, CONTRIBUTING, examples
|
|
502
|
+
|
|
503
|
+
### What's NOT in v1.0
|
|
504
|
+
|
|
505
|
+
- Multi-language support beyond Python/bash (deferred — no evidence of demand)
|
|
506
|
+
- CI/CD integration (GitHub Actions workflow exists but not tested across repos)
|
|
507
|
+
- Web dashboard (pipeline-status.sh is CLI-only)
|
|
508
|
+
- Pinecone-backed lesson dedup (only needed at 100+ lessons)
|
|
509
|
+
- Agent chains (post-commit audit, service triage, pre-release)
|
|
510
|
+
- Property-based testing integration (guidance only, no automation)
|
|
511
|
+
|
|
512
|
+
---
|
|
513
|
+
|
|
514
|
+
## Lean Gate
|
|
515
|
+
|
|
516
|
+
**Hypothesis:** A structured autonomous coding pipeline with quality gates and competing agents produces higher-quality code with fewer debugging cycles than manual Claude Code usage.
|
|
517
|
+
|
|
518
|
+
**MVP:** Phases 1-4 (stabilize + pre-execution quality + cost + MAB). Everything after is optimization.
|
|
519
|
+
|
|
520
|
+
**First 5 users:** Justin (primary), then 4 Claude Code power users from GitHub/Discord who have expressed interest in autonomous execution.
|
|
521
|
+
|
|
522
|
+
**Success metric:** Measured reduction in debugging batches per feature (target: < 1 retry per 5-batch feature, vs current ~2-3).
|
|
523
|
+
|
|
524
|
+
**Pivot trigger:** If MAB shows no win-rate differentiation after 20 features (10 per strategy), downgrade to single-strategy with the lessons system only.
|
|
525
|
+
|
|
526
|
+
---
|
|
527
|
+
|
|
528
|
+
## Next Action
|
|
529
|
+
|
|
530
|
+
Start with **Phase 1, Batch 1A** — fix the 7 medium-severity bugs. These affect core functionality (state management, batch execution, sampling) and must be fixed before any new features are built on top.
|
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
# Design: Headless Module Split
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-02-24
|
|
4
|
+
**Status:** Approved
|
|
5
|
+
**Problem:** `scripts/lib/run-plan-headless.sh` is 681 lines (project limit: 300). Three concerns mixed in one file: echo-back gate, sampling candidates, and batch orchestration.
|
|
6
|
+
**Approach:** Extract two new lib modules. Fix issue #73 (MAB path resolution).
|
|
7
|
+
|
|
8
|
+
## Extraction 1: Echo-Back Gate
|
|
9
|
+
|
|
10
|
+
### New file: `scripts/lib/run-plan-echo-back.sh`
|
|
11
|
+
|
|
12
|
+
**Functions moved (verbatim):**
|
|
13
|
+
- `_echo_back_check()` — lightweight keyword-match gate on agent output (lines 19-63)
|
|
14
|
+
- `echo_back_check()` — full spec verification: agent restatement → haiku verdict → retry once (lines 65-163)
|
|
15
|
+
|
|
16
|
+
**Globals (read-only):** `SKIP_ECHO_BACK`, `STRICT_ECHO_BACK`
|
|
17
|
+
|
|
18
|
+
**Interface:** No signature changes. Functions called by name from `run_mode_headless()`.
|
|
19
|
+
|
|
20
|
+
**Source order in `run-plan.sh`:** Add before headless source line:
|
|
21
|
+
```bash
|
|
22
|
+
source "$SCRIPT_DIR/lib/run-plan-echo-back.sh"
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
**Test changes:**
|
|
26
|
+
- `test-echo-back.sh`: Change source from `run-plan-headless.sh` to `run-plan-echo-back.sh`
|
|
27
|
+
- `test-run-plan-headless.sh`: 5 tests for `_echo_back_check()` move to `test-echo-back.sh` (or source both modules)
|
|
28
|
+
|
|
29
|
+
**Reuse opportunity:** `run-plan-team.sh` can source this module to add spec verification before team batch groups — implements lesson #61 across execution modes.
|
|
30
|
+
|
|
31
|
+
## Extraction 2: Sampling Candidates
|
|
32
|
+
|
|
33
|
+
### New file: `scripts/lib/run-plan-sampling.sh`
|
|
34
|
+
|
|
35
|
+
**New function wrapping extracted code:**
|
|
36
|
+
```bash
|
|
37
|
+
# run_sampling_candidates <worktree> <plan_file> <batch> <prompt> <quality_gate_cmd>
|
|
38
|
+
# Returns: 0 if winner found (worktree has winner's changes), 1 if no candidate passed
|
|
39
|
+
# Side-effects: writes logs/sampling-outcomes.json, uses patch files in /tmp/
|
|
40
|
+
run_sampling_candidates() { ... }
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
**Code moved:** Lines 373-494 of current `run_mode_headless()` (the sampling block inside the retry while-loop).
|
|
44
|
+
|
|
45
|
+
**Also extracted:**
|
|
46
|
+
- `check_memory_for_sampling()` — memory guard logic (current lines 354-369), reusable by any mode
|
|
47
|
+
|
|
48
|
+
**Globals (read-only):** `SAMPLE_COUNT`, `SAMPLE_ON_RETRY`, `SAMPLE_ON_CRITICAL`, `SAMPLE_DEFAULT_COUNT`, `SAMPLE_MIN_MEMORY_PER_GB`
|
|
49
|
+
|
|
50
|
+
**Call site in headless:** Replace inline sampling block with:
|
|
51
|
+
```bash
|
|
52
|
+
if [[ "${SAMPLE_COUNT:-0}" -gt 0 && $attempt -ge 2 ]]; then
|
|
53
|
+
check_memory_for_sampling || SAMPLE_COUNT=0
|
|
54
|
+
if [[ "${SAMPLE_COUNT:-0}" -gt 0 ]]; then
|
|
55
|
+
if run_sampling_candidates "$WORKTREE" "$PLAN_FILE" "$batch" "$prompt" "$QUALITY_GATE_CMD"; then
|
|
56
|
+
batch_passed=true
|
|
57
|
+
break
|
|
58
|
+
fi
|
|
59
|
+
continue
|
|
60
|
+
fi
|
|
61
|
+
fi
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
**Source order in `run-plan.sh`:** Add before headless:
|
|
65
|
+
```bash
|
|
66
|
+
source "$SCRIPT_DIR/lib/run-plan-sampling.sh"
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
**Dependencies:** Requires `run-plan-scoring.sh` (for `score_candidate`, `select_winner`, `classify_batch_type`, `get_prompt_variants`).
|
|
70
|
+
|
|
71
|
+
## Bug Fix: Issue #73
|
|
72
|
+
|
|
73
|
+
**File:** `scripts/lib/run-plan-headless.sh` line 251
|
|
74
|
+
**Before:** `"$SCRIPT_DIR/../mab-run.sh"`
|
|
75
|
+
**After:** `"$SCRIPT_DIR/mab-run.sh"`
|
|
76
|
+
**Root cause:** `SCRIPT_DIR` resolves to `scripts/` (set in `run-plan.sh` line 14). `../mab-run.sh` looks at repo root; `mab-run.sh` lives in `scripts/`.
|
|
77
|
+
|
|
78
|
+
## Resulting Line Counts
|
|
79
|
+
|
|
80
|
+
| Module | Before | After |
|
|
81
|
+
|--------|--------|-------|
|
|
82
|
+
| `run-plan-headless.sh` | 681 | ~416 |
|
|
83
|
+
| `run-plan-echo-back.sh` | (new) | ~145 |
|
|
84
|
+
| `run-plan-sampling.sh` | (new) | ~135 |
|
|
85
|
+
|
|
86
|
+
**Remaining debt:** Headless at ~416 is over the 300-line limit. The remaining bulk is the sequential batch orchestration loop (init → prompt → claude → gate → notify → failure handling). This is inherently sequential — further splitting would create artificial boundaries. Future candidate: retry/escalation logic (~60 lines) if the module grows again.
|
|
87
|
+
|
|
88
|
+
## Implementation Order
|
|
89
|
+
|
|
90
|
+
1. Create `run-plan-echo-back.sh` (move functions, update sources, fix tests)
|
|
91
|
+
2. Create `run-plan-sampling.sh` (extract + wrap in function, update call site)
|
|
92
|
+
3. Fix #73 (one-line path change)
|
|
93
|
+
4. Run full test suite to confirm no regressions
|
|
94
|
+
5. Commit and close #73
|
|
95
|
+
|
|
96
|
+
## Risk
|
|
97
|
+
|
|
98
|
+
**Low.** Echo-back extraction is pure function move with no interface change. Sampling extraction wraps existing code in a function — the only new interface is the 5-parameter signature. Both are tested by existing test files.
|