autonomous-coding-toolkit 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +22 -0
- package/.claude-plugin/plugin.json +13 -0
- package/LICENSE +21 -0
- package/Makefile +21 -0
- package/README.md +140 -0
- package/SECURITY.md +28 -0
- package/agents/bash-expert.md +113 -0
- package/agents/dependency-auditor.md +138 -0
- package/agents/integration-tester.md +120 -0
- package/agents/lesson-scanner.md +149 -0
- package/agents/python-expert.md +179 -0
- package/agents/service-monitor.md +141 -0
- package/agents/shell-expert.md +147 -0
- package/benchmarks/runner.sh +147 -0
- package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
- package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
- package/benchmarks/tasks/02-refactor-module/task.md +8 -0
- package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
- package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
- package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
- package/bin/act.js +238 -0
- package/commands/autocode.md +6 -0
- package/commands/cancel-ralph.md +18 -0
- package/commands/code-factory.md +53 -0
- package/commands/create-prd.md +55 -0
- package/commands/ralph-loop.md +18 -0
- package/commands/run-plan.md +117 -0
- package/commands/submit-lesson.md +122 -0
- package/docs/ARCHITECTURE.md +630 -0
- package/docs/CONTRIBUTING.md +125 -0
- package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
- package/docs/lessons/0002-async-def-without-await.md +28 -0
- package/docs/lessons/0003-create-task-without-callback.md +28 -0
- package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
- package/docs/lessons/0005-sqlite-without-closing.md +33 -0
- package/docs/lessons/0006-venv-pip-path.md +27 -0
- package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
- package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
- package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
- package/docs/lessons/0010-local-outside-function-bash.md +33 -0
- package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
- package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
- package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
- package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
- package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
- package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
- package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
- package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
- package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
- package/docs/lessons/0020-persist-state-incrementally.md +44 -0
- package/docs/lessons/0021-dual-axis-testing.md +48 -0
- package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
- package/docs/lessons/0023-static-analysis-spiral.md +51 -0
- package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
- package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
- package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
- package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
- package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
- package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
- package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
- package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
- package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
- package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
- package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
- package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
- package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
- package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
- package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
- package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
- package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
- package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
- package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
- package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
- package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
- package/docs/lessons/0045-iterative-design-improvement.md +33 -0
- package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
- package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
- package/docs/lessons/0048-integration-wiring-batch.md +40 -0
- package/docs/lessons/0049-ab-verification.md +41 -0
- package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
- package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
- package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
- package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
- package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
- package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
- package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
- package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
- package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
- package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
- package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
- package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
- package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
- package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
- package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
- package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
- package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
- package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
- package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
- package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
- package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
- package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
- package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
- package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
- package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
- package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
- package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
- package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
- package/docs/lessons/0078-static-review-without-live-test.md +30 -0
- package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
- package/docs/lessons/FRAMEWORK.md +161 -0
- package/docs/lessons/SUMMARY.md +201 -0
- package/docs/lessons/TEMPLATE.md +85 -0
- package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
- package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
- package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
- package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
- package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
- package/docs/plans/2026-02-21-mab-research-report.md +406 -0
- package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
- package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
- package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
- package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
- package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
- package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
- package/docs/plans/2026-02-22-mab-run-design.md +462 -0
- package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
- package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
- package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
- package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
- package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
- package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
- package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
- package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
- package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
- package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
- package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
- package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
- package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
- package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
- package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
- package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
- package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
- package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
- package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
- package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
- package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
- package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
- package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
- package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
- package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
- package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
- package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
- package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
- package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
- package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
- package/docs/plans/2026-02-24-headless-module-split.md +443 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
- package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
- package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
- package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
- package/docs/plans/audit-findings.md +186 -0
- package/docs/telegram-notification-format.md +98 -0
- package/examples/example-plan.md +51 -0
- package/examples/example-prd.json +72 -0
- package/examples/example-roadmap.md +33 -0
- package/examples/quickstart-plan.md +63 -0
- package/hooks/hooks.json +26 -0
- package/hooks/setup-symlinks.sh +48 -0
- package/hooks/stop-hook.sh +135 -0
- package/package.json +47 -0
- package/policies/bash.md +71 -0
- package/policies/python.md +71 -0
- package/policies/testing.md +61 -0
- package/policies/universal.md +60 -0
- package/scripts/analyze-report.sh +97 -0
- package/scripts/architecture-map.sh +145 -0
- package/scripts/auto-compound.sh +273 -0
- package/scripts/batch-audit.sh +42 -0
- package/scripts/batch-test.sh +101 -0
- package/scripts/entropy-audit.sh +221 -0
- package/scripts/failure-digest.sh +51 -0
- package/scripts/generate-ast-rules.sh +96 -0
- package/scripts/init.sh +112 -0
- package/scripts/lesson-check.sh +428 -0
- package/scripts/lib/common.sh +61 -0
- package/scripts/lib/cost-tracking.sh +153 -0
- package/scripts/lib/ollama.sh +60 -0
- package/scripts/lib/progress-writer.sh +128 -0
- package/scripts/lib/run-plan-context.sh +215 -0
- package/scripts/lib/run-plan-echo-back.sh +231 -0
- package/scripts/lib/run-plan-headless.sh +396 -0
- package/scripts/lib/run-plan-notify.sh +57 -0
- package/scripts/lib/run-plan-parser.sh +81 -0
- package/scripts/lib/run-plan-prompt.sh +215 -0
- package/scripts/lib/run-plan-quality-gate.sh +132 -0
- package/scripts/lib/run-plan-routing.sh +315 -0
- package/scripts/lib/run-plan-sampling.sh +170 -0
- package/scripts/lib/run-plan-scoring.sh +146 -0
- package/scripts/lib/run-plan-state.sh +142 -0
- package/scripts/lib/run-plan-team.sh +199 -0
- package/scripts/lib/telegram.sh +54 -0
- package/scripts/lib/thompson-sampling.sh +176 -0
- package/scripts/license-check.sh +74 -0
- package/scripts/mab-run.sh +575 -0
- package/scripts/module-size-check.sh +146 -0
- package/scripts/patterns/async-no-await.yml +5 -0
- package/scripts/patterns/bare-except.yml +6 -0
- package/scripts/patterns/empty-catch.yml +6 -0
- package/scripts/patterns/hardcoded-localhost.yml +9 -0
- package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
- package/scripts/pipeline-status.sh +197 -0
- package/scripts/policy-check.sh +226 -0
- package/scripts/prior-art-search.sh +133 -0
- package/scripts/promote-mab-lessons.sh +126 -0
- package/scripts/prompts/agent-a-superpowers.md +29 -0
- package/scripts/prompts/agent-b-ralph.md +29 -0
- package/scripts/prompts/judge-agent.md +61 -0
- package/scripts/prompts/planner-agent.md +44 -0
- package/scripts/pull-community-lessons.sh +90 -0
- package/scripts/quality-gate.sh +266 -0
- package/scripts/research-gate.sh +90 -0
- package/scripts/run-plan.sh +329 -0
- package/scripts/scope-infer.sh +159 -0
- package/scripts/setup-ralph-loop.sh +155 -0
- package/scripts/telemetry.sh +230 -0
- package/scripts/tests/run-all-tests.sh +52 -0
- package/scripts/tests/test-act-cli.sh +46 -0
- package/scripts/tests/test-agents-md.sh +87 -0
- package/scripts/tests/test-analyze-report.sh +114 -0
- package/scripts/tests/test-architecture-map.sh +89 -0
- package/scripts/tests/test-auto-compound.sh +169 -0
- package/scripts/tests/test-batch-test.sh +65 -0
- package/scripts/tests/test-benchmark-runner.sh +25 -0
- package/scripts/tests/test-common.sh +168 -0
- package/scripts/tests/test-cost-tracking.sh +158 -0
- package/scripts/tests/test-echo-back.sh +180 -0
- package/scripts/tests/test-entropy-audit.sh +146 -0
- package/scripts/tests/test-failure-digest.sh +66 -0
- package/scripts/tests/test-generate-ast-rules.sh +145 -0
- package/scripts/tests/test-helpers.sh +82 -0
- package/scripts/tests/test-init.sh +47 -0
- package/scripts/tests/test-lesson-check.sh +278 -0
- package/scripts/tests/test-lesson-local.sh +55 -0
- package/scripts/tests/test-license-check.sh +109 -0
- package/scripts/tests/test-mab-run.sh +182 -0
- package/scripts/tests/test-ollama-lib.sh +49 -0
- package/scripts/tests/test-ollama.sh +60 -0
- package/scripts/tests/test-pipeline-status.sh +198 -0
- package/scripts/tests/test-policy-check.sh +124 -0
- package/scripts/tests/test-prior-art-search.sh +96 -0
- package/scripts/tests/test-progress-writer.sh +140 -0
- package/scripts/tests/test-promote-mab-lessons.sh +110 -0
- package/scripts/tests/test-pull-community-lessons.sh +149 -0
- package/scripts/tests/test-quality-gate.sh +241 -0
- package/scripts/tests/test-research-gate.sh +132 -0
- package/scripts/tests/test-run-plan-cli.sh +86 -0
- package/scripts/tests/test-run-plan-context.sh +305 -0
- package/scripts/tests/test-run-plan-e2e.sh +153 -0
- package/scripts/tests/test-run-plan-headless.sh +424 -0
- package/scripts/tests/test-run-plan-notify.sh +124 -0
- package/scripts/tests/test-run-plan-parser.sh +217 -0
- package/scripts/tests/test-run-plan-prompt.sh +254 -0
- package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
- package/scripts/tests/test-run-plan-routing.sh +178 -0
- package/scripts/tests/test-run-plan-scoring.sh +148 -0
- package/scripts/tests/test-run-plan-state.sh +261 -0
- package/scripts/tests/test-run-plan-team.sh +157 -0
- package/scripts/tests/test-scope-infer.sh +150 -0
- package/scripts/tests/test-setup-ralph-loop.sh +63 -0
- package/scripts/tests/test-telegram-env.sh +38 -0
- package/scripts/tests/test-telegram.sh +121 -0
- package/scripts/tests/test-telemetry.sh +46 -0
- package/scripts/tests/test-thompson-sampling.sh +139 -0
- package/scripts/tests/test-validate-all.sh +60 -0
- package/scripts/tests/test-validate-commands.sh +89 -0
- package/scripts/tests/test-validate-hooks.sh +98 -0
- package/scripts/tests/test-validate-lessons.sh +150 -0
- package/scripts/tests/test-validate-plan-quality.sh +235 -0
- package/scripts/tests/test-validate-plans.sh +187 -0
- package/scripts/tests/test-validate-plugin.sh +106 -0
- package/scripts/tests/test-validate-prd.sh +184 -0
- package/scripts/tests/test-validate-skills.sh +134 -0
- package/scripts/validate-all.sh +57 -0
- package/scripts/validate-commands.sh +67 -0
- package/scripts/validate-hooks.sh +89 -0
- package/scripts/validate-lessons.sh +98 -0
- package/scripts/validate-plan-quality.sh +369 -0
- package/scripts/validate-plans.sh +120 -0
- package/scripts/validate-plugin.sh +86 -0
- package/scripts/validate-policies.sh +42 -0
- package/scripts/validate-prd.sh +118 -0
- package/scripts/validate-skills.sh +96 -0
- package/skills/autocode/SKILL.md +285 -0
- package/skills/autocode/ab-verification.md +51 -0
- package/skills/autocode/code-quality-standards.md +37 -0
- package/skills/autocode/competitive-mode.md +364 -0
- package/skills/brainstorming/SKILL.md +97 -0
- package/skills/capture-lesson/SKILL.md +187 -0
- package/skills/check-lessons/SKILL.md +116 -0
- package/skills/dispatching-parallel-agents/SKILL.md +110 -0
- package/skills/executing-plans/SKILL.md +85 -0
- package/skills/finishing-a-development-branch/SKILL.md +201 -0
- package/skills/receiving-code-review/SKILL.md +72 -0
- package/skills/requesting-code-review/SKILL.md +59 -0
- package/skills/requesting-code-review/code-reviewer.md +82 -0
- package/skills/research/SKILL.md +145 -0
- package/skills/roadmap/SKILL.md +115 -0
- package/skills/subagent-driven-development/SKILL.md +98 -0
- package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
- package/skills/subagent-driven-development/implementer-prompt.md +73 -0
- package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
- package/skills/systematic-debugging/SKILL.md +134 -0
- package/skills/systematic-debugging/condition-based-waiting.md +64 -0
- package/skills/systematic-debugging/defense-in-depth.md +32 -0
- package/skills/systematic-debugging/root-cause-tracing.md +55 -0
- package/skills/test-driven-development/SKILL.md +167 -0
- package/skills/using-git-worktrees/SKILL.md +219 -0
- package/skills/using-superpowers/SKILL.md +54 -0
- package/skills/verification-before-completion/SKILL.md +140 -0
- package/skills/verify/SKILL.md +82 -0
- package/skills/writing-plans/SKILL.md +128 -0
- package/skills/writing-skills/SKILL.md +93 -0
|
@@ -0,0 +1,841 @@
|
|
|
1
|
+
# Design: npm Packaging as a Learning System
|
|
2
|
+
|
|
3
|
+
> **Date:** 2026-02-24
|
|
4
|
+
> **Status:** Approved
|
|
5
|
+
> **Goal:** Package the autonomous-coding-toolkit as a publicly installable npm package that improves with every run, every user, and every failure — not just a tool, but a compounding learning system.
|
|
6
|
+
|
|
7
|
+
## The Thesis
|
|
8
|
+
|
|
9
|
+
The toolkit's differentiator isn't any single feature — it's that **the system gets better with every run**. Lessons compound, strategy routing learns, quality gates adapt, trust earns autonomy. The packaging must expose the learning loop as a first-class concept:
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
Every run → telemetry captured
|
|
13
|
+
Every failure → lesson candidate
|
|
14
|
+
Every lesson → community contribution candidate
|
|
15
|
+
Every community contribution → all users improve
|
|
16
|
+
Every improvement → measured by benchmarks
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
That's how you code better than a human on large projects: not by being smarter on any single batch, but by compounding learning across thousands of batches across hundreds of users.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Research Foundation
|
|
24
|
+
|
|
25
|
+
This design is governed by findings from the 25-paper cross-cutting synthesis (`research/2026-02-22-cross-cutting-synthesis.md`). Key findings that drive decisions:
|
|
26
|
+
|
|
27
|
+
| # | Finding | Confidence | Design Impact |
|
|
28
|
+
|---|---------|------------|---------------|
|
|
29
|
+
| 1 | Plan quality worth ~3x execution capability | High | Plan scoring learns which dimensions predict success |
|
|
30
|
+
| 2 | Fresh context per batch is superior to accumulated | High | Core architecture preserved — this is the #1 differentiator |
|
|
31
|
+
| 3 | Prompt caching yields 83% cost reduction | High | Stable prefix structure in prompts |
|
|
32
|
+
| 4 | Lost in the Middle: 20pp accuracy degradation | High | Task top, requirements bottom in context assembly |
|
|
33
|
+
| 5 | Spec misunderstanding is 60%+ of failures for strong models | Medium | Two-tier echo-back gate |
|
|
34
|
+
| 6 | Lesson system covers 30-40% of failure surface | Medium-High | Expand to 6 clusters, add spec drift coverage |
|
|
35
|
+
| 7 | 34.7% abandon on difficult setup | Medium | Fast lane onboarding under 3 minutes |
|
|
36
|
+
| 8 | Positive instructions outperform negative for LLMs | Medium-High | Policy system promoted alongside lessons |
|
|
37
|
+
| 9 | Transferability depends on abstraction level | High | Scope metadata prevents false positive death spiral |
|
|
38
|
+
| 10 | Coordination is #1 multi-agent failure mode (37%) | High | Structured artifacts over chat for agent communication |
|
|
39
|
+
| 11 | Property-based testing finds 50x more mutations | High | Testing guidance in plan skill |
|
|
40
|
+
| 12 | Optimal multi-agent team size is 3-4 | High | Subagent-driven-dev stays within this bound |
|
|
41
|
+
| 13 | No benchmark suite = can't prove improvement | — | Benchmark suite ships with package |
|
|
42
|
+
| 14 | Single-user testing is not testing | — | Federated telemetry across users |
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Part 1: Package Structure
|
|
47
|
+
|
|
48
|
+
### Approach: npm + Claude Code Plugin (dual surface)
|
|
49
|
+
|
|
50
|
+
**npm:** `npm install -g autonomous-coding-toolkit` → `act` CLI on PATH
|
|
51
|
+
**Plugin:** `/install autonomous-coding-toolkit` → skills, commands, agents in Claude Code
|
|
52
|
+
|
|
53
|
+
Both install from the same repo. Nothing moves — we add `package.json` + `bin/act.js` on top of the existing structure.
|
|
54
|
+
|
|
55
|
+
### Directory Layout (additions in bold)
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
autonomous-coding-toolkit/
|
|
59
|
+
├── **package.json** # npm: name, version, bin, files, engines
|
|
60
|
+
├── **bin/**
|
|
61
|
+
│ └── **act.js** # Node.js CLI router (~150 lines)
|
|
62
|
+
├── scripts/ # 32 bash scripts (UNCHANGED)
|
|
63
|
+
│ ├── lib/ # 18 modules (UNCHANGED)
|
|
64
|
+
│ ├── prompts/ # 4 agent prompts (UNCHANGED)
|
|
65
|
+
│ ├── patterns/ # 5 ast-grep rules (UNCHANGED)
|
|
66
|
+
│ ├── tests/ # Script tests (UNCHANGED)
|
|
67
|
+
│ └── **init.sh** # Project bootstrapper (~100 lines)
|
|
68
|
+
├── skills/ # 20 skills (UNCHANGED)
|
|
69
|
+
├── commands/ # 7 commands (UNCHANGED)
|
|
70
|
+
├── agents/ # 7 agents (UNCHANGED)
|
|
71
|
+
├── hooks/ # hooks.json + stop-hook.sh (UNCHANGED)
|
|
72
|
+
├── policies/ # 4 positive pattern defs (UNCHANGED)
|
|
73
|
+
├── examples/ # 4 samples (UNCHANGED)
|
|
74
|
+
├── **benchmarks/** # 5 reproducible benchmark tasks
|
|
75
|
+
│ ├── **tasks/** # Task definitions + reference implementations
|
|
76
|
+
│ ├── **rubrics/** # Machine-scored evaluation rubrics
|
|
77
|
+
│ └── **runner.sh** # Benchmark orchestrator
|
|
78
|
+
├── docs/
|
|
79
|
+
│ ├── ARCHITECTURE.md # System design
|
|
80
|
+
│ ├── CONTRIBUTING.md # Lesson submission guide
|
|
81
|
+
│ └── lessons/ # 79 lessons + framework (BUNDLED)
|
|
82
|
+
├── .claude-plugin/ # Plugin metadata (UNCHANGED)
|
|
83
|
+
├── .github/ # CI (UNCHANGED)
|
|
84
|
+
├── Makefile # lint, test, validate, ci
|
|
85
|
+
├── SECURITY.md
|
|
86
|
+
├── README.md
|
|
87
|
+
└── .gitignore
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### package.json
|
|
91
|
+
|
|
92
|
+
```json
|
|
93
|
+
{
|
|
94
|
+
"name": "autonomous-coding-toolkit",
|
|
95
|
+
"version": "1.0.0",
|
|
96
|
+
"description": "Autonomous AI coding pipeline: quality gates, fresh-context execution, community lessons, and compounding learning",
|
|
97
|
+
"license": "MIT",
|
|
98
|
+
"author": "Justin McFarland <parthalon025@gmail.com>",
|
|
99
|
+
"homepage": "https://github.com/parthalon025/autonomous-coding-toolkit",
|
|
100
|
+
"repository": "https://github.com/parthalon025/autonomous-coding-toolkit",
|
|
101
|
+
"bin": {
|
|
102
|
+
"act": "./bin/act.js"
|
|
103
|
+
},
|
|
104
|
+
"files": [
|
|
105
|
+
"bin/",
|
|
106
|
+
"scripts/",
|
|
107
|
+
"skills/",
|
|
108
|
+
"commands/",
|
|
109
|
+
"agents/",
|
|
110
|
+
"hooks/",
|
|
111
|
+
"policies/",
|
|
112
|
+
"examples/",
|
|
113
|
+
"benchmarks/",
|
|
114
|
+
"docs/",
|
|
115
|
+
".claude-plugin/",
|
|
116
|
+
"Makefile",
|
|
117
|
+
"SECURITY.md"
|
|
118
|
+
],
|
|
119
|
+
"engines": {
|
|
120
|
+
"node": ">=18.0.0"
|
|
121
|
+
},
|
|
122
|
+
"os": ["linux", "darwin", "win32"],
|
|
123
|
+
"keywords": [
|
|
124
|
+
"autonomous-coding", "ai-agents", "quality-gates",
|
|
125
|
+
"claude-code", "tdd", "lessons-learned", "headless",
|
|
126
|
+
"multi-armed-bandit", "code-review", "pipeline"
|
|
127
|
+
]
|
|
128
|
+
}
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
**Note:** `files` field excludes runtime state (`logs/`, `.run-plan-state.json`, `progress.txt`, `.worktrees/`). These are project-local, not distributable.
|
|
132
|
+
|
|
133
|
+
### Windows Support
|
|
134
|
+
|
|
135
|
+
Scripts are bash. Windows users require WSL (Windows Subsystem for Linux). `bin/act.js` checks for bash availability at startup and prints a WSL installation hint if missing. Claude Code users on Windows already have WSL as a practical requirement.
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## Part 2: CLI Surface
|
|
140
|
+
|
|
141
|
+
### bin/act.js — Node.js Router (~150 lines)
|
|
142
|
+
|
|
143
|
+
Responsibilities:
|
|
144
|
+
1. **Platform check** — verify `bash` available, WSL hint on Windows
|
|
145
|
+
2. **Subcommand routing** — dispatch to correct bash script
|
|
146
|
+
3. **Toolkit root resolution** — `path.resolve(__dirname, '..')` (works for npm global, npx, and local clone)
|
|
147
|
+
4. **Pass-through** — all args forwarded, exit codes preserved
|
|
148
|
+
5. **Version/help** — built-in, no bash needed
|
|
149
|
+
|
|
150
|
+
### Full Command Map
|
|
151
|
+
|
|
152
|
+
#### Execution
|
|
153
|
+
|
|
154
|
+
| Command | Script | Purpose |
|
|
155
|
+
|---------|--------|---------|
|
|
156
|
+
| `act plan <file> [flags]` | `run-plan.sh` | Headless/team/MAB batch execution |
|
|
157
|
+
| `act plan --resume` | `run-plan.sh --resume` | Resume interrupted execution |
|
|
158
|
+
| `act compound [dir] [flags]` | `auto-compound.sh` | Full pipeline: report→PRD→execute→PR |
|
|
159
|
+
| `act mab <flags>` | `mab-run.sh` | Multi-Armed Bandit competing agents |
|
|
160
|
+
|
|
161
|
+
#### Quality
|
|
162
|
+
|
|
163
|
+
| Command | Script | Purpose |
|
|
164
|
+
|---------|--------|---------|
|
|
165
|
+
| `act gate [flags]` | `quality-gate.sh` | Composite quality gate |
|
|
166
|
+
| `act check [files...]` | `lesson-check.sh` | Syntactic anti-pattern scan |
|
|
167
|
+
| `act policy [flags]` | `policy-check.sh` | Advisory positive-pattern check |
|
|
168
|
+
| `act research-gate <json>` | `research-gate.sh` | Validate research completeness |
|
|
169
|
+
| `act validate` | `validate-all.sh` | Toolkit self-validation |
|
|
170
|
+
| `act validate-plan <file>` | `validate-plan-quality.sh` | Score plan quality (8 dimensions) |
|
|
171
|
+
| `act validate-prd [file]` | `validate-prd.sh` | Validate PRD JSON structure |
|
|
172
|
+
|
|
173
|
+
#### Lessons
|
|
174
|
+
|
|
175
|
+
| Command | Script | Purpose |
|
|
176
|
+
|---------|--------|---------|
|
|
177
|
+
| `act lessons pull [--remote]` | `pull-community-lessons.sh` | Sync community lessons + strategy data |
|
|
178
|
+
| `act lessons check` | `lesson-check.sh --list` | List active lesson checks |
|
|
179
|
+
| `act lessons promote` | `promote-mab-lessons.sh` | Auto-promote MAB patterns |
|
|
180
|
+
| `act lessons infer [--apply]` | `scope-infer.sh` | Infer scope tags for lessons |
|
|
181
|
+
|
|
182
|
+
#### Analysis
|
|
183
|
+
|
|
184
|
+
| Command | Script | Purpose |
|
|
185
|
+
|---------|--------|---------|
|
|
186
|
+
| `act audit [flags]` | `entropy-audit.sh` | Doc drift & naming violations |
|
|
187
|
+
| `act batch-audit <dir>` | `batch-audit.sh` | Cross-project audit |
|
|
188
|
+
| `act batch-test <dir>` | `batch-test.sh` | Memory-aware cross-project tests |
|
|
189
|
+
| `act analyze <report>` | `analyze-report.sh` | Extract priority from report |
|
|
190
|
+
| `act digest <log>` | `failure-digest.sh` | Summarize failure patterns |
|
|
191
|
+
| `act status [dir]` | `pipeline-status.sh` | Pipeline health check |
|
|
192
|
+
| `act architecture [dir]` | `architecture-map.sh` | Generate architecture diagram |
|
|
193
|
+
|
|
194
|
+
#### Telemetry (NEW)
|
|
195
|
+
|
|
196
|
+
| Command | Script | Purpose |
|
|
197
|
+
|---------|--------|---------|
|
|
198
|
+
| `act telemetry show` | `telemetry.sh show` | Dashboard: success rate, cost, lesson hits |
|
|
199
|
+
| `act telemetry export` | `telemetry.sh export` | Export anonymized run data |
|
|
200
|
+
| `act telemetry import <file>` | `telemetry.sh import` | Import community aggregate data |
|
|
201
|
+
| `act telemetry reset` | `telemetry.sh reset` | Clear local telemetry |
|
|
202
|
+
|
|
203
|
+
#### Benchmarks (NEW)
|
|
204
|
+
|
|
205
|
+
| Command | Script | Purpose |
|
|
206
|
+
|---------|--------|---------|
|
|
207
|
+
| `act benchmark run` | `benchmarks/runner.sh` | Execute all 5 benchmark tasks |
|
|
208
|
+
| `act benchmark run <name>` | `benchmarks/runner.sh <name>` | Execute single benchmark |
|
|
209
|
+
| `act benchmark compare <a> <b>` | `benchmarks/runner.sh compare` | Compare two benchmark results |
|
|
210
|
+
|
|
211
|
+
#### Setup
|
|
212
|
+
|
|
213
|
+
| Command | Script | Purpose |
|
|
214
|
+
|---------|--------|---------|
|
|
215
|
+
| `act init` | `init.sh` | Bootstrap project for toolkit use |
|
|
216
|
+
| `act init --quickstart` | `init.sh --quickstart` | Fast lane: working example in <3 min |
|
|
217
|
+
| `act license-check` | `license-check.sh` | GPL/AGPL dependency audit |
|
|
218
|
+
| `act module-size` | `module-size-check.sh` | Detect oversized modules |
|
|
219
|
+
|
|
220
|
+
#### Meta
|
|
221
|
+
|
|
222
|
+
| Command | Purpose |
|
|
223
|
+
|---------|---------|
|
|
224
|
+
| `act version` | Print version (from package.json) |
|
|
225
|
+
| `act help [command]` | Show help for any command |
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## Part 3: Two Install Paths
|
|
230
|
+
|
|
231
|
+
### Path A: npm (CLI scripts)
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
npm install -g autonomous-coding-toolkit
|
|
235
|
+
# Now: act plan, act gate, act check, act telemetry, etc. on PATH
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
Or zero-install:
|
|
239
|
+
```bash
|
|
240
|
+
npx autonomous-coding-toolkit gate --project-root .
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### Path B: Claude Code Plugin (skills/commands/agents)
|
|
244
|
+
|
|
245
|
+
```bash
|
|
246
|
+
# From Claude Code:
|
|
247
|
+
/install autonomous-coding-toolkit
|
|
248
|
+
# Now: /autocode, /create-prd, /run-plan, /ralph-loop, etc. available
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
**Both paths install from the same repo/package.** Users who install both get the full experience:
|
|
252
|
+
- npm → CLI scripts for headless, CI, and standalone use
|
|
253
|
+
- Plugin → skills, commands, agents for interactive Claude Code sessions
|
|
254
|
+
|
|
255
|
+
### Entry Points
|
|
256
|
+
|
|
257
|
+
| User wants to... | Entry point |
|
|
258
|
+
|-------------------|-------------|
|
|
259
|
+
| Start a new feature from scratch | `/autocode <feature>` (Claude Code) |
|
|
260
|
+
| Start from an existing plan | `act plan <file>` (CLI) or `/run-plan` (Claude Code) |
|
|
261
|
+
| Jump into a roadmap mid-stream | `act plan <file> --start-batch N` or `act plan --resume` |
|
|
262
|
+
| Quick quality check | `act gate --project-root .` (CLI) |
|
|
263
|
+
| See how the system is performing | `act telemetry show` (CLI) |
|
|
264
|
+
| Validate before shipping | `act benchmark run` (CLI) |
|
|
265
|
+
| Bootstrap a new project | `act init --quickstart` (CLI) |
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
## Part 4: Seven Strategic Improvements
|
|
270
|
+
|
|
271
|
+
These improvements transform the toolkit from a tool into a learning system.
|
|
272
|
+
|
|
273
|
+
### Improvement 1: Telemetry — Measure Before Optimizing
|
|
274
|
+
|
|
275
|
+
**Principle:** You can't improve what you don't measure. The research says "the first measurement infrastructure should precede the first optimization."
|
|
276
|
+
|
|
277
|
+
**Data captured per batch (local, opt-in for sharing):**
|
|
278
|
+
|
|
279
|
+
```json
|
|
280
|
+
{
|
|
281
|
+
"timestamp": "2026-02-24T14:30:00Z",
|
|
282
|
+
"project_type": "python",
|
|
283
|
+
"batch_type": "integration",
|
|
284
|
+
"batch_number": 3,
|
|
285
|
+
"attempt": 1,
|
|
286
|
+
"passed_gate": true,
|
|
287
|
+
"gate_failures": [],
|
|
288
|
+
"lessons_triggered": ["0007", "0033"],
|
|
289
|
+
"lessons_true_positive": ["0007"],
|
|
290
|
+
"test_count_delta": 12,
|
|
291
|
+
"duration_seconds": 180,
|
|
292
|
+
"cost_usd": 0.42,
|
|
293
|
+
"strategy": "superpowers",
|
|
294
|
+
"plan_quality_score": 78,
|
|
295
|
+
"echo_back_passed": true,
|
|
296
|
+
"trust_score": 73
|
|
297
|
+
}
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
**Storage:** `logs/telemetry.jsonl` (append-only, one line per batch). Project-local, never committed.
|
|
301
|
+
|
|
302
|
+
**Dashboard (`act telemetry show`):**
|
|
303
|
+
```
|
|
304
|
+
Autonomous Coding Toolkit — Telemetry Dashboard
|
|
305
|
+
════════════════════════════════════════════════
|
|
306
|
+
|
|
307
|
+
Runs: 47 batches across 8 plans
|
|
308
|
+
Success rate: 89% (42/47 passed gate on first attempt)
|
|
309
|
+
Total cost: $19.83 ($0.42/batch average)
|
|
310
|
+
Total time: 2.4 hours
|
|
311
|
+
|
|
312
|
+
Strategy Performance:
|
|
313
|
+
superpowers: 78% win rate (28 runs)
|
|
314
|
+
ralph: 65% win rate (19 runs)
|
|
315
|
+
|
|
316
|
+
Top Lesson Hits:
|
|
317
|
+
#0007 bare-except: 12 hits, 11 true positives (92%)
|
|
318
|
+
#0033 sqlite-closing: 3 hits, 3 true positives (100%)
|
|
319
|
+
#0045 hub-cache: 8 hits, 0 true positives (0%) ← retirement candidate
|
|
320
|
+
|
|
321
|
+
Batch Type Success:
|
|
322
|
+
new-file: 95% (19/20)
|
|
323
|
+
test-only: 100% (8/8)
|
|
324
|
+
refactoring: 83% (10/12)
|
|
325
|
+
integration: 71% (5/7) ← lowest, consider MAB for this type
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
**Export/import for community learning:**
|
|
329
|
+
- `act telemetry export` → anonymized JSON (no file paths, no project names, no code)
|
|
330
|
+
- `act telemetry import community-aggregate.json` → merges into local strategy routing
|
|
331
|
+
- Community aggregate published periodically to toolkit repo (opt-in contributions)
|
|
332
|
+
|
|
333
|
+
### Improvement 2: Federated Learning for Strategy Routing
|
|
334
|
+
|
|
335
|
+
**Principle:** 100 users learning independently is 100x slower than learning together. Strategy performance should compound across the community.
|
|
336
|
+
|
|
337
|
+
**Current state:** `strategy-perf.json` is per-install. `pull-community-lessons.sh` already merges it with `max(local, remote)` per counter.
|
|
338
|
+
|
|
339
|
+
**Improvement:** Extend the pull mechanism to also merge:
|
|
340
|
+
- Anonymized strategy-perf data from community aggregate
|
|
341
|
+
- Lesson hit rate statistics (which lessons actually catch bugs)
|
|
342
|
+
- Batch-type success rates per strategy
|
|
343
|
+
|
|
344
|
+
**Merge strategy (already implemented, extend):**
|
|
345
|
+
- `max(local, remote)` per counter for win/loss data
|
|
346
|
+
- Weighted average for rates (weight = sample size)
|
|
347
|
+
- Never overwrite local data — additive merge only
|
|
348
|
+
|
|
349
|
+
**Effect on routing:** Thompson Sampling in `lib/thompson-sampling.sh` starts with community priors instead of uniform priors. A new user benefits from the collective experience of all previous users from their first run.
|
|
350
|
+
|
|
351
|
+
### Improvement 3: Adaptive Quality Gates
|
|
352
|
+
|
|
353
|
+
**Principle:** The immune system amplifies what works and retires what doesn't (biological analogy from research #B2-3). Quality gates should do the same.
|
|
354
|
+
|
|
355
|
+
**Current state:** Gate pipeline is static: lesson-check → ast-grep → tests → memory → test count → git clean.
|
|
356
|
+
|
|
357
|
+
**Improvement:** Track lesson effectiveness from telemetry:
|
|
358
|
+
|
|
359
|
+
| Metric | Threshold | Action |
|
|
360
|
+
|--------|-----------|--------|
|
|
361
|
+
| True positive rate > 80% | After 20+ triggers | Promote to "high-value" (always first in pipeline) |
|
|
362
|
+
| True positive rate 20-80% | After 20+ triggers | Normal (current behavior) |
|
|
363
|
+
| True positive rate < 20% | After 50+ triggers | Downgrade to advisory (warn, don't block) |
|
|
364
|
+
| Zero triggers | After 100+ scans | Flag as retirement candidate |
|
|
365
|
+
|
|
366
|
+
**Implementation:** `lesson-check.sh` reads `logs/telemetry.jsonl` to compute lesson effectiveness. Lessons flagged as retirement candidates appear in `act telemetry show` for manual review. No lesson is auto-deleted — only downgraded to advisory.
|
|
367
|
+
|
|
368
|
+
**Why not auto-delete:** A lesson with zero hits might be preventing bugs by its mere presence in the system (developers read lessons and avoid the pattern). Retirement requires human judgment.
|
|
369
|
+
|
|
370
|
+
### Improvement 4: Semantic Echo-Back
|
|
371
|
+
|
|
372
|
+
**Principle:** Spec misunderstanding is 60%+ of failures for strong models (#B1-5). Keyword matching catches omissions but not misinterpretation. A human reviewer asks "do you understand what I'm asking?" before "did you do it right?"
|
|
373
|
+
|
|
374
|
+
**Current state:** `run-plan-echo-back.sh` does keyword matching — checks whether key terms from batch text appear in agent output.
|
|
375
|
+
|
|
376
|
+
**Improvement:** Two-tier echo-back:
|
|
377
|
+
|
|
378
|
+
**Tier 1 (current, every batch):** Keyword match — fast (<1s), catches obvious omissions.
|
|
379
|
+
|
|
380
|
+
**Tier 2 (new, selective):** LLM verification — agent summarizes what it will build, separate `claude -p` call compares summary vs. spec, flags misalignment.
|
|
381
|
+
|
|
382
|
+
**When Tier 2 activates:**
|
|
383
|
+
- Always on Batch 1 of any plan (disproportionate risk — research #B2-3, #P9)
|
|
384
|
+
- Always on integration batches (highest failure rate from telemetry)
|
|
385
|
+
- When `--strict-echo-back` flag is set
|
|
386
|
+
- MAB can learn whether Tier 2 prevents enough rework to justify cost (~$0.10/batch)
|
|
387
|
+
|
|
388
|
+
**Tier 2 prompt structure:**
|
|
389
|
+
```
|
|
390
|
+
You are a specification compliance reviewer. Compare:
|
|
391
|
+
|
|
392
|
+
SPECIFICATION:
|
|
393
|
+
<batch task text from plan>
|
|
394
|
+
|
|
395
|
+
AGENT'S UNDERSTANDING:
|
|
396
|
+
<agent's summary of what it will build>
|
|
397
|
+
|
|
398
|
+
Does the agent's understanding match the specification? Flag any:
|
|
399
|
+
- Missing requirements
|
|
400
|
+
- Added requirements not in spec
|
|
401
|
+
- Misinterpreted requirements
|
|
402
|
+
- Ambiguous interpretations
|
|
403
|
+
|
|
404
|
+
Output: PASS or FAIL with specific misalignments.
|
|
405
|
+
```
|
|
406
|
+
|
|
407
|
+
### Improvement 5: Fast Lane Onboarding
|
|
408
|
+
|
|
409
|
+
**Principle:** 34.7% abandon on difficult setup (#B2-1). A dead user gets zero benefit from perfect process. Time to first value must be under 3 minutes.
|
|
410
|
+
|
|
411
|
+
**`act init` (standard):**
|
|
412
|
+
1. Detect project type (Python/Node/bash/Make/unknown)
|
|
413
|
+
2. Create `tasks/` directory
|
|
414
|
+
3. Create empty `progress.txt`
|
|
415
|
+
4. Append Code Factory section to CLAUDE.md (or create minimal CLAUDE.md)
|
|
416
|
+
5. Set quality gate command based on project type
|
|
417
|
+
6. Detect language → set `## Scope Tags`
|
|
418
|
+
7. Print next steps
|
|
419
|
+
|
|
420
|
+
**`act init --quickstart` (fast lane):**
|
|
421
|
+
All of the above, plus:
|
|
422
|
+
1. Copy `examples/quickstart-plan.md` → `docs/plans/quickstart.md`
|
|
423
|
+
2. Customize the plan for detected project type:
|
|
424
|
+
- Python: "Add a conftest.py with common fixtures + test helper"
|
|
425
|
+
- Node: "Add a build validation script + test helper"
|
|
426
|
+
- Bash: "Add shellcheck CI + test runner"
|
|
427
|
+
3. Run `act gate --project-root .` to verify quality gate works
|
|
428
|
+
4. Print: "Ready. Run `act plan docs/plans/quickstart.md` for your first quality-gated execution."
|
|
429
|
+
|
|
430
|
+
**Time budget:** `act init` < 10 seconds, `act init --quickstart` < 30 seconds (gate run is the bottleneck).
|
|
431
|
+
|
|
432
|
+
### Improvement 6: Graduated Autonomy
|
|
433
|
+
|
|
434
|
+
**Principle:** Start supervised, earn trust, reduce friction. Humans don't give full autonomy to new team members on day one.
|
|
435
|
+
|
|
436
|
+
**Trust score per project, derived from telemetry:**
|
|
437
|
+
|
|
438
|
+
```
|
|
439
|
+
Trust Score = weighted average of:
|
|
440
|
+
- Gate first-attempt pass rate (40%)
|
|
441
|
+
- Echo-back pass rate (20%)
|
|
442
|
+
- Test regression rate, inverted (20%)
|
|
443
|
+
- Post-merge revert rate, inverted (20%)
|
|
444
|
+
```
|
|
445
|
+
|
|
446
|
+
**Trust levels and default behavior:**
|
|
447
|
+
|
|
448
|
+
| Trust | Score | Default Mode | Rationale |
|
|
449
|
+
|-------|-------|-------------|-----------|
|
|
450
|
+
| New | < 30 (or < 10 runs) | Mode B: human checkpoint every batch | Unknown project, build confidence |
|
|
451
|
+
| Growing | 30-70 | Headless with checkpoint every 3rd batch | Earning trust, spot-check |
|
|
452
|
+
| Trusted | 70-90 | Headless with notification on failures only | Proven track record |
|
|
453
|
+
| Autonomous | > 90 | Full headless, post-run summary only | Consistently excellent |
|
|
454
|
+
|
|
455
|
+
**Override:** Users can always set `--mode` explicitly. Trust score is advisory default, not a hard gate.
|
|
456
|
+
|
|
457
|
+
**Trust score in `act status`:**
|
|
458
|
+
```
|
|
459
|
+
Project: my-app (python)
|
|
460
|
+
Trust Score: 73/100 (28 runs)
|
|
461
|
+
Gate pass rate: 89% ████████▉ (HIGH)
|
|
462
|
+
Echo-back rate: 92% █████████▏ (HIGH)
|
|
463
|
+
Test regression: 4% ▍ (GOOD)
|
|
464
|
+
Post-merge revert: 0% ▏ (EXCELLENT)
|
|
465
|
+
Default mode: headless with checkpoint every 3rd batch
|
|
466
|
+
```
|
|
467
|
+
|
|
468
|
+
### Improvement 7: Benchmark Suite
|
|
469
|
+
|
|
470
|
+
**Principle:** "Single-user testing is not testing." Without benchmarks, you can't prove the toolkit works, you can't measure improvement between versions, and users can't validate their setup.
|
|
471
|
+
|
|
472
|
+
**5 benchmark tasks (varying complexity):**
|
|
473
|
+
|
|
474
|
+
| # | Task | Complexity | Measures |
|
|
475
|
+
|---|------|-----------|----------|
|
|
476
|
+
| 1 | Add a REST endpoint with tests | Simple (1 batch) | Basic execution, TDD compliance |
|
|
477
|
+
| 2 | Refactor a module into two | Medium (2 batches) | Refactoring quality, test preservation |
|
|
478
|
+
| 3 | Fix an integration bug | Medium (2 batches) | Debugging, root cause analysis |
|
|
479
|
+
| 4 | Add test coverage to untested module | Medium (2 batches) | Test quality, edge case discovery |
|
|
480
|
+
| 5 | Multi-file feature with API + DB + tests | Complex (4 batches) | Full pipeline, cross-file coordination |
|
|
481
|
+
|
|
482
|
+
**Each benchmark includes:**
|
|
483
|
+
- `task.md` — Problem description (what the agent receives)
|
|
484
|
+
- `scaffold/` — Starting codebase (reproducible initial state)
|
|
485
|
+
- `reference/` — Reference implementation (what "correct" looks like)
|
|
486
|
+
- `rubric.sh` — Machine-scored evaluation (exit 0 = pass per criterion)
|
|
487
|
+
- `rubric.json` — Criteria and weights for scoring
|
|
488
|
+
|
|
489
|
+
**`act benchmark run` behavior:**
|
|
490
|
+
1. Create temp directory, copy scaffold
|
|
491
|
+
2. Run `act plan` on the task
|
|
492
|
+
3. Execute `rubric.sh` to score the result
|
|
493
|
+
4. Compare against reference implementation
|
|
494
|
+
5. Output scorecard with per-criterion pass/fail
|
|
495
|
+
|
|
496
|
+
**`act benchmark compare <before.json> <after.json>`:**
|
|
497
|
+
```
|
|
498
|
+
Benchmark Comparison: v1.0.0 vs v1.1.0
|
|
499
|
+
═══════════════════════════════════════
|
|
500
|
+
v1.0.0 v1.1.0 Delta
|
|
501
|
+
Task 1 (endpoint): 85% 92% +7%
|
|
502
|
+
Task 2 (refactor): 72% 78% +6%
|
|
503
|
+
Task 3 (debug): 68% 81% +13% ← biggest improvement
|
|
504
|
+
Task 4 (coverage): 90% 91% +1%
|
|
505
|
+
Task 5 (multi-file): 55% 67% +12%
|
|
506
|
+
─────────────────────────────────────────
|
|
507
|
+
Overall: 74% 82% +8%
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
---
|
|
511
|
+
|
|
512
|
+
## Part 5: Complete Concept Inventory
|
|
513
|
+
|
|
514
|
+
Everything from the existing toolkit is preserved. Nothing is removed or moved.
|
|
515
|
+
|
|
516
|
+
### Skills (20 — all preserved)
|
|
517
|
+
|
|
518
|
+
| Skill | Purpose | Pipeline Stage |
|
|
519
|
+
|-------|---------|---------------|
|
|
520
|
+
| autocode | Full 9-stage pipeline orchestrator | Entry point |
|
|
521
|
+
| brainstorming | Design exploration & approval | Stage 1 |
|
|
522
|
+
| research | Structured technical investigation | Stage 1.5 |
|
|
523
|
+
| roadmap | Multi-feature epic decomposition | Stage 0.5 |
|
|
524
|
+
| writing-plans | TDD-structured implementation plans | Stage 3 |
|
|
525
|
+
| using-git-worktrees | Isolated workspace creation | Stage 2 |
|
|
526
|
+
| subagent-driven-development | Fresh agent per task + 2-stage review | Stage 4a |
|
|
527
|
+
| executing-plans | Batch execution with human checkpoints | Stage 4b |
|
|
528
|
+
| verification-before-completion | Evidence-based gate | Stage 5 |
|
|
529
|
+
| finishing-a-development-branch | Merge/PR/keep/discard | Stage 6 |
|
|
530
|
+
| test-driven-development | Red-Green-Refactor cycle | Supporting |
|
|
531
|
+
| systematic-debugging | 4-phase root cause investigation | Supporting |
|
|
532
|
+
| dispatching-parallel-agents | 2+ independent task coordination | Supporting |
|
|
533
|
+
| requesting-code-review | Dispatch reviewer subagent | Supporting |
|
|
534
|
+
| receiving-code-review | Technical evaluation of feedback | Supporting |
|
|
535
|
+
| using-superpowers | Meta-skill: invoke skills before action | Meta |
|
|
536
|
+
| verify | Self-verification checklist | Supporting |
|
|
537
|
+
| writing-skills | TDD applied to skill documentation | Meta |
|
|
538
|
+
| capture-lesson | Incident → lesson workflow | Lesson system |
|
|
539
|
+
| check-lessons | Surface relevant lessons for current work | Lesson system |
|
|
540
|
+
|
|
541
|
+
### Commands (7 — all preserved)
|
|
542
|
+
|
|
543
|
+
| Command | Purpose |
|
|
544
|
+
|---------|---------|
|
|
545
|
+
| `/autocode <feature>` | Full pipeline entry point |
|
|
546
|
+
| `/code-factory <feature>` | Alias for autocode |
|
|
547
|
+
| `/create-prd <feature>` | Machine-verifiable acceptance criteria |
|
|
548
|
+
| `/run-plan <file>` | In-session batch execution |
|
|
549
|
+
| `/ralph-loop <prompt>` | Autonomous iteration with stop-hook |
|
|
550
|
+
| `/cancel-ralph` | Cancel active Ralph loop |
|
|
551
|
+
| `/submit-lesson` | Community lesson submission via PR |
|
|
552
|
+
|
|
553
|
+
### Agents (7 — all preserved)
|
|
554
|
+
|
|
555
|
+
| Agent | Model | Purpose |
|
|
556
|
+
|-------|-------|---------|
|
|
557
|
+
| lesson-scanner | sonnet | Dynamic anti-pattern scan from lesson files |
|
|
558
|
+
| bash-expert | sonnet | Shell script review & debugging |
|
|
559
|
+
| shell-expert | sonnet | systemd/service diagnosis |
|
|
560
|
+
| python-expert | sonnet | Async, lifecycle, type safety review |
|
|
561
|
+
| integration-tester | opus | Cross-service data flow verification |
|
|
562
|
+
| dependency-auditor | haiku | CVE scan, license compliance |
|
|
563
|
+
| service-monitor | sonnet | systemd service/timer health |
|
|
564
|
+
|
|
565
|
+
### Scripts (32 existing + 3 new = 35)
|
|
566
|
+
|
|
567
|
+
**Existing (all preserved, paths unchanged):**
|
|
568
|
+
|
|
569
|
+
Execution: run-plan.sh, auto-compound.sh, mab-run.sh, setup-ralph-loop.sh
|
|
570
|
+
Quality: quality-gate.sh, lesson-check.sh, policy-check.sh, research-gate.sh
|
|
571
|
+
Validation: validate-all.sh, validate-lessons.sh, validate-skills.sh, validate-commands.sh, validate-plugin.sh, validate-hooks.sh, validate-policies.sh, validate-prd.sh, validate-plan-quality.sh
|
|
572
|
+
Analysis: entropy-audit.sh, batch-audit.sh, batch-test.sh, analyze-report.sh, failure-digest.sh, pipeline-status.sh, architecture-map.sh
|
|
573
|
+
Lessons: pull-community-lessons.sh, promote-mab-lessons.sh, scope-infer.sh
|
|
574
|
+
Utilities: license-check.sh, module-size-check.sh, generate-ast-rules.sh, prior-art-search.sh
|
|
575
|
+
|
|
576
|
+
**New:**
|
|
577
|
+
|
|
578
|
+
| Script | Purpose | Lines (est.) |
|
|
579
|
+
|--------|---------|-------------|
|
|
580
|
+
| `scripts/init.sh` | Project bootstrapper (`act init`) | ~100 |
|
|
581
|
+
| `scripts/telemetry.sh` | Telemetry capture, dashboard, export/import | ~200 |
|
|
582
|
+
| `benchmarks/runner.sh` | Benchmark orchestrator | ~150 |
|
|
583
|
+
|
|
584
|
+
### Lib Modules (18 — all preserved)
|
|
585
|
+
|
|
586
|
+
common.sh, ollama.sh, telegram.sh, progress-writer.sh, cost-tracking.sh, thompson-sampling.sh, run-plan-parser.sh, run-plan-state.sh, run-plan-headless.sh, run-plan-team.sh, run-plan-routing.sh, run-plan-quality-gate.sh, run-plan-prompt.sh, run-plan-context.sh, run-plan-sampling.sh, run-plan-scoring.sh, run-plan-echo-back.sh, run-plan-notify.sh
|
|
587
|
+
|
|
588
|
+
### Execution Modes (5 — all preserved)
|
|
589
|
+
|
|
590
|
+
| Mode | Entry (Claude Code) | Entry (CLI) | Isolation |
|
|
591
|
+
|------|-------------------|------------|-----------|
|
|
592
|
+
| A: Subagent-dev | /autocode → Stage 4a | N/A (Claude-only) | Same session |
|
|
593
|
+
| B: Executing-plans | /autocode → Stage 4b | N/A (Claude-only) | Separate session |
|
|
594
|
+
| C: Headless | /run-plan | `act plan <file>` | Fresh context/batch |
|
|
595
|
+
| D: Ralph Loop | /ralph-loop | N/A (needs stop-hook) | Same session |
|
|
596
|
+
| E: MAB | /run-plan --mab | `act plan <file> --mab` | Parallel worktrees |
|
|
597
|
+
|
|
598
|
+
### State & Persistence (5 existing + 1 new = 6)
|
|
599
|
+
|
|
600
|
+
| State File | Location | Purpose |
|
|
601
|
+
|-----------|----------|---------|
|
|
602
|
+
| `.run-plan-state.json` | Project root | Execution checkpoint (batches, test counts, costs) |
|
|
603
|
+
| `progress.txt` | Project root | Append-only discovery log |
|
|
604
|
+
| `tasks/prd.json` | Project root | Machine-verifiable acceptance criteria |
|
|
605
|
+
| `logs/failure-patterns.json` | Project root | Cross-run failure learning |
|
|
606
|
+
| `.claude/ralph-loop.local.md` | Project root | Ralph loop state |
|
|
607
|
+
| **`logs/telemetry.jsonl`** | Project root | **Per-batch telemetry (NEW)** |
|
|
608
|
+
|
|
609
|
+
Additional learning state (existing, in `logs/`): routing-decisions.log, sampling-outcomes.json, strategy-perf.json, mab-lessons.json.
|
|
610
|
+
|
|
611
|
+
All state is project-local. The npm package is stateless. No state collision between projects.
|
|
612
|
+
|
|
613
|
+
### Lessons (79 + framework — all bundled)
|
|
614
|
+
|
|
615
|
+
**Three-tier architecture:**
|
|
616
|
+
|
|
617
|
+
```
|
|
618
|
+
Tier 1: Bundled (ships with npm, updated on npm update)
|
|
619
|
+
Location: <npm-root>/docs/lessons/
|
|
620
|
+
Count: 79 (grows with releases)
|
|
621
|
+
|
|
622
|
+
Tier 2: Community (git-synced between releases)
|
|
623
|
+
Mechanism: act lessons pull --remote upstream
|
|
624
|
+
Source: main branch of toolkit repo
|
|
625
|
+
Merge: additive only, never overwrites local
|
|
626
|
+
|
|
627
|
+
Tier 3: Project-local (user's own lessons)
|
|
628
|
+
Location: <project>/docs/lessons/
|
|
629
|
+
Scope: project-specific anti-patterns
|
|
630
|
+
Never overwritten by Tier 1 or 2
|
|
631
|
+
```
|
|
632
|
+
|
|
633
|
+
**Six root cause clusters:**
|
|
634
|
+
1. Silent Failures — operation appears to succeed but silently fails
|
|
635
|
+
2. Integration Boundaries — each component passes its test; bug hides at seam
|
|
636
|
+
3. Cold-Start Assumptions — works steady-state, fails on restart
|
|
637
|
+
4. Specification Drift — agent builds wrong thing correctly
|
|
638
|
+
5. Context & Retrieval — info available but buried/misscoped
|
|
639
|
+
6. Planning & Control Flow — wrong decomposition contaminates downstream
|
|
640
|
+
|
|
641
|
+
**Lesson schema:** YAML frontmatter with id, title, severity, languages, scope, category, pattern (type + regex/description), fix, positive_alternative, example (bad/good).
|
|
642
|
+
|
|
643
|
+
**Scope filtering:** `lesson-check.sh` reads `## Scope Tags` from CLAUDE.md, computes intersection with lesson scope tags. Prevents false positive death spiral at scale (research #B2-2).
|
|
644
|
+
|
|
645
|
+
### Policies (4 — all preserved)
|
|
646
|
+
|
|
647
|
+
| File | Scope | Patterns |
|
|
648
|
+
|------|-------|----------|
|
|
649
|
+
| universal.md | All projects | Error visibility, test before ship, fresh context, durable artifacts |
|
|
650
|
+
| python.md | Python | Async discipline, closing(), create_task callbacks |
|
|
651
|
+
| bash.md | Shell | Strict mode, quoting, subshell cd, atomic writes |
|
|
652
|
+
| testing.md | All tests | No hardcoded counts, boundary testing, live > static |
|
|
653
|
+
|
|
654
|
+
### Hooks (2 — all preserved)
|
|
655
|
+
|
|
656
|
+
| Hook | Trigger | Purpose |
|
|
657
|
+
|------|---------|---------|
|
|
658
|
+
| SessionStart | Session init | Symlink setup for skill discovery |
|
|
659
|
+
| Stop | Session exit | Ralph loop continuation gate |
|
|
660
|
+
|
|
661
|
+
### Quality Gate Pipeline (preserved + enhanced)
|
|
662
|
+
|
|
663
|
+
```
|
|
664
|
+
lesson-check.sh (syntactic, <2s)
|
|
665
|
+
↓ if clean
|
|
666
|
+
ast-grep patterns (5 structural checks)
|
|
667
|
+
↓ if clean
|
|
668
|
+
Test suite (auto-detected: pytest/npm/make)
|
|
669
|
+
↓ if pass
|
|
670
|
+
Memory check (warn if <4GB, never fail)
|
|
671
|
+
↓
|
|
672
|
+
Test count regression (new_count >= old_count)
|
|
673
|
+
↓ if no regression
|
|
674
|
+
Git clean (all changes committed)
|
|
675
|
+
↓ if clean
|
|
676
|
+
**Telemetry capture (NEW — write batch results to logs/telemetry.jsonl)**
|
|
677
|
+
↓
|
|
678
|
+
✅ PASS → next batch
|
|
679
|
+
```
|
|
680
|
+
|
|
681
|
+
### Examples (4 — all preserved)
|
|
682
|
+
|
|
683
|
+
example-plan.md, example-prd.json, example-roadmap.md, quickstart-plan.md
|
|
684
|
+
|
|
685
|
+
### Documentation (all preserved)
|
|
686
|
+
|
|
687
|
+
ARCHITECTURE.md, CONTRIBUTING.md, SECURITY.md, docs/lessons/FRAMEWORK.md, docs/lessons/TEMPLATE.md, docs/lessons/SUMMARY.md, docs/lessons/DIAGNOSTICS.md
|
|
688
|
+
|
|
689
|
+
### CI (preserved)
|
|
690
|
+
|
|
691
|
+
.github/workflows/ci.yml — ShellCheck + shfmt + shellharden + semgrep + tests
|
|
692
|
+
|
|
693
|
+
### Prompts & AST Patterns (all preserved)
|
|
694
|
+
|
|
695
|
+
Prompts: planner-agent.md, judge-agent.md, agent-a-superpowers.md, agent-b-ralph.md
|
|
696
|
+
Patterns: bare-except.yml, empty-catch.yml, async-no-await.yml, retry-loop-no-backoff.yml, hardcoded-localhost.yml
|
|
697
|
+
|
|
698
|
+
---
|
|
699
|
+
|
|
700
|
+
## Part 6: External Dependencies
|
|
701
|
+
|
|
702
|
+
### Required
|
|
703
|
+
|
|
704
|
+
| Dependency | Used By | Check |
|
|
705
|
+
|-----------|---------|-------|
|
|
706
|
+
| bash 4+ | All scripts | `act` checks at startup |
|
|
707
|
+
| git | Worktrees, state, PRs | `act` checks at startup |
|
|
708
|
+
| jq | State files, PRD, MAB, telemetry | `act` checks at startup |
|
|
709
|
+
| curl | Ollama, Telegram (optional features) | Checked at call site |
|
|
710
|
+
| claude CLI | Execution modes (plan, compound, mab) | Checked by run-plan.sh |
|
|
711
|
+
| Node.js 18+ | `bin/act.js` router only | npm enforces via engines |
|
|
712
|
+
|
|
713
|
+
### Optional (graceful degradation)
|
|
714
|
+
|
|
715
|
+
| Dependency | Used By | Behavior if Missing |
|
|
716
|
+
|-----------|---------|-------------------|
|
|
717
|
+
| ruff | quality-gate (Python lint) | Skipped with warning |
|
|
718
|
+
| eslint | quality-gate (JS lint) | Skipped with warning |
|
|
719
|
+
| ast-grep | quality-gate (structural) | Skipped (advisory anyway) |
|
|
720
|
+
| ollama | analyze-report, auto-compound | Fails with clear message |
|
|
721
|
+
| bc | Thompson Sampling | Falls back to random routing |
|
|
722
|
+
| gh | PRs, submit-lesson, benchmarks | Fails with install hint |
|
|
723
|
+
| pytest/npm/make | quality-gate (tests) | Auto-detected, skips if none |
|
|
724
|
+
|
|
725
|
+
### Hardcoded Paths to Fix (2 only)
|
|
726
|
+
|
|
727
|
+
| Current | Fix | Script |
|
|
728
|
+
|---------|-----|--------|
|
|
729
|
+
| `~/.env` for Telegram/Ollama creds | Add `ACT_ENV_FILE` env var | telegram.sh, ollama.sh |
|
|
730
|
+
| `$HOME/Documents/projects` default | Already has `--projects-dir` flag | entropy-audit.sh |
|
|
731
|
+
|
|
732
|
+
Everything else uses `SCRIPT_DIR` relative resolution via `readlink -f`.
|
|
733
|
+
|
|
734
|
+
---
|
|
735
|
+
|
|
736
|
+
## Part 7: Design Principles
|
|
737
|
+
|
|
738
|
+
These principles govern the toolkit's behavior and every future contribution. They are non-negotiable.
|
|
739
|
+
|
|
740
|
+
### From the Original Architecture
|
|
741
|
+
|
|
742
|
+
1. **Fresh context per unit of work** — Context degradation is the #1 quality killer. Every execution mode solves this differently.
|
|
743
|
+
2. **Machine-verifiable gates** — No human judgment for "did this work?" Every gate is a command that exits 0 or non-zero.
|
|
744
|
+
3. **Test count monotonicity** — Tests only go up. Decreased count = something broke.
|
|
745
|
+
4. **State survives interruption** — Every transition persisted to disk. Kill, reboot, come back later — `--resume` works.
|
|
746
|
+
5. **Orthogonal verification** — Bottom-up (syntactic) + top-down (integration) catch non-overlapping bug classes.
|
|
747
|
+
6. **Lessons compound** — Every bug becomes an automated check. The system gets harder to break over time.
|
|
748
|
+
|
|
749
|
+
### From the Research Foundation
|
|
750
|
+
|
|
751
|
+
7. **Plan quality over execution quality** — 3:1 ratio. Invest in plan scoring, spec echo-back, and research gates before execution optimization.
|
|
752
|
+
8. **Measure before optimizing** — Telemetry first. Every improvement must be measurable.
|
|
753
|
+
9. **Positive instructions alongside negative** — Policies ("do Y") complement lessons ("don't do X"). LLMs respond better to positive guidance.
|
|
754
|
+
10. **Scope to prevent noise** — Every lesson has scope metadata. Without it, false positives compound and users disable the system.
|
|
755
|
+
11. **Community learning compounds** — Federated telemetry and lesson sync mean every user makes every other user's system better.
|
|
756
|
+
12. **Graduated autonomy** — Start supervised, earn trust through measured success, reduce friction over time.
|
|
757
|
+
13. **Fast time to first value** — Under 3 minutes to first quality-gated execution. A dead user gets zero benefit from perfect process.
|
|
758
|
+
|
|
759
|
+
### From Operations Research (18 frameworks converged)
|
|
760
|
+
|
|
761
|
+
14. **Formal gate between understanding and building** — The brainstorm→research→PRD chain is not optional overhead; it's the highest-leverage investment.
|
|
762
|
+
15. **Adversarial review at every stage** — Spec reviewer, code quality reviewer, lesson scanner, quality gate — each catches a different failure class.
|
|
763
|
+
16. **Intent over method** — Plans specify what and why, not how. Agents choose implementation strategy.
|
|
764
|
+
|
|
765
|
+
---
|
|
766
|
+
|
|
767
|
+
## Part 8: What's New (Summary)
|
|
768
|
+
|
|
769
|
+
| Item | Type | Est. Lines | Priority |
|
|
770
|
+
|------|------|-----------|----------|
|
|
771
|
+
| `package.json` | New file | ~30 | P0 (required for npm) |
|
|
772
|
+
| `bin/act.js` | New file | ~150 | P0 (CLI router) |
|
|
773
|
+
| `scripts/init.sh` | New file | ~100 | P0 (project bootstrap) |
|
|
774
|
+
| `scripts/telemetry.sh` | New file | ~200 | P1 (measurement before optimization) |
|
|
775
|
+
| `benchmarks/` directory | New directory | ~300 | P1 (prove the system works) |
|
|
776
|
+
| Fix `~/.env` → `ACT_ENV_FILE` | Edit 2 files | ~10 | P0 (portability) |
|
|
777
|
+
| `LESSONS_DIR` project-local fallback | Edit lesson-check.sh | ~10 | P0 (lesson tiers) |
|
|
778
|
+
| Update README.md | Edit | ~200 | P0 (installation docs) |
|
|
779
|
+
| Telemetry capture in quality gate | Edit quality-gate.sh | ~20 | P1 (data collection) |
|
|
780
|
+
| Trust score in pipeline-status.sh | Edit | ~50 | P2 (graduated autonomy) |
|
|
781
|
+
| Tier 2 echo-back | Edit run-plan-echo-back.sh | ~80 | P2 (spec drift prevention) |
|
|
782
|
+
| **Total new code** | | **~1,150** | |
|
|
783
|
+
|
|
784
|
+
**P0:** Required for npm publish. Ship first.
|
|
785
|
+
**P1:** Required for the learning system thesis. Ship second.
|
|
786
|
+
**P2:** Enhances the learning system. Ship third.
|
|
787
|
+
|
|
788
|
+
---
|
|
789
|
+
|
|
790
|
+
## Part 9: What Does NOT Change
|
|
791
|
+
|
|
792
|
+
- All 20 skills — unchanged, same paths
|
|
793
|
+
- All 7 commands — unchanged
|
|
794
|
+
- All 7 agents — unchanged
|
|
795
|
+
- All 32 existing scripts — unchanged (except 3 small edits noted above)
|
|
796
|
+
- All 18 lib modules — unchanged
|
|
797
|
+
- All 79 lessons — bundled as-is
|
|
798
|
+
- All 4 policies — unchanged
|
|
799
|
+
- All 5 execution modes — unchanged
|
|
800
|
+
- All hooks — unchanged
|
|
801
|
+
- All state file formats — unchanged
|
|
802
|
+
- All prompts and AST patterns — unchanged
|
|
803
|
+
- CI workflow — unchanged
|
|
804
|
+
- Directory layout — preserved (additions only)
|
|
805
|
+
- Design principles 1-6 — preserved (7-16 are additions)
|
|
806
|
+
|
|
807
|
+
---
|
|
808
|
+
|
|
809
|
+
## Appendix A: Risk Assessment
|
|
810
|
+
|
|
811
|
+
| Risk | Likelihood | Impact | Mitigation |
|
|
812
|
+
|------|-----------|--------|------------|
|
|
813
|
+
| `act` name collision (other npm packages) | Medium | Low | Check npm registry; fallback: `actk` |
|
|
814
|
+
| Windows without WSL | Medium | Medium | Clear error message + WSL install guide |
|
|
815
|
+
| Telemetry privacy concerns | Low | High | Local-only default, explicit opt-in for sharing, no PII ever |
|
|
816
|
+
| Claude Code API changes break hooks/skills | Medium | High | Abstract plugin interface; version pin in package.json |
|
|
817
|
+
| Lesson false positive spiral at scale | Medium | High | Adaptive gates (Improvement 3) + scope filtering |
|
|
818
|
+
| Community doesn't form | High | Medium | Toolkit works solo; community features are additive |
|
|
819
|
+
|
|
820
|
+
## Appendix B: Success Metrics
|
|
821
|
+
|
|
822
|
+
| Metric | Target (6 months) | How Measured |
|
|
823
|
+
|--------|-------------------|-------------|
|
|
824
|
+
| npm weekly downloads | 50+ | npm stats |
|
|
825
|
+
| Community lessons submitted | 10+ | GitHub PRs |
|
|
826
|
+
| Benchmark score improvement | +10% over v1.0 baseline | `act benchmark compare` |
|
|
827
|
+
| Gate first-attempt pass rate | >85% across community | Aggregated telemetry |
|
|
828
|
+
| Time to first value | <3 minutes | Manual testing + user reports |
|
|
829
|
+
| User retention (>5 runs) | >50% of installers | Telemetry (if opted in) |
|
|
830
|
+
|
|
831
|
+
## Appendix C: Research Document Index
|
|
832
|
+
|
|
833
|
+
Full research corpus governing this design: `research/2026-02-22-cross-cutting-synthesis.md` (25 papers, 409 lines). Key references by section:
|
|
834
|
+
|
|
835
|
+
- Telemetry: Cost/Quality (#B1-7), MAB R2 (#P7)
|
|
836
|
+
- Federated learning: Lesson Transferability (#B2-2), MAB R1 (#P6)
|
|
837
|
+
- Adaptive gates: Lesson Transferability (#B2-2), Unconventional Perspectives (#B2-3)
|
|
838
|
+
- Echo-back: Failure Taxonomy (#B1-5), Multi-Agent Coordination (#B1-8)
|
|
839
|
+
- Fast lane: User Adoption (#B2-1), Competitive Landscape (#B1-4)
|
|
840
|
+
- Graduated autonomy: User Adoption (#B2-1), Operations Design (#P9)
|
|
841
|
+
- Benchmarks: Verification Effectiveness (#B1-6), Comprehensive Testing (#B2-7)
|