autonomous-coding-toolkit 1.0.0 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +48 -19
- package/docs/RESEARCH.md +55 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[](https://github.com/parthalon025/autonomous-coding-toolkit/actions)
|
|
2
|
+
[](https://www.npmjs.com/package/autonomous-coding-toolkit)
|
|
2
3
|
[](https://opensource.org/licenses/MIT)
|
|
3
|
-
[](https://github.com/parthalon025/autonomous-coding-toolkit/releases/tag/v1.0.0)
|
|
4
4
|
|
|
5
5
|
# Autonomous Coding Toolkit
|
|
6
6
|
|
|
@@ -10,15 +10,6 @@
|
|
|
10
10
|
|
|
11
11
|
Built for [Claude Code](https://docs.anthropic.com/en/docs/claude-code) (v1.0.33+). Works as a Claude Code plugin (interactive) and npm CLI (headless/CI).
|
|
12
12
|
|
|
13
|
-
## What It Does
|
|
14
|
-
|
|
15
|
-
```
|
|
16
|
-
You write a plan → the toolkit executes it batch-by-batch with:
|
|
17
|
-
- Fresh 200k context window per batch (no accumulated degradation)
|
|
18
|
-
- Quality gates between every batch (tests + anti-pattern scan + memory check)
|
|
19
|
-
- Machine-verifiable completion (every criterion is a shell command)
|
|
20
|
-
```
|
|
21
|
-
|
|
22
13
|
## Install
|
|
23
14
|
|
|
24
15
|
### npm (recommended)
|
|
@@ -27,7 +18,7 @@ You write a plan → the toolkit executes it batch-by-batch with:
|
|
|
27
18
|
npm install -g autonomous-coding-toolkit
|
|
28
19
|
```
|
|
29
20
|
|
|
30
|
-
This puts `act` on your PATH.
|
|
21
|
+
This puts `act` on your PATH.
|
|
31
22
|
|
|
32
23
|
### Claude Code Plugin
|
|
33
24
|
|
|
@@ -47,7 +38,30 @@ cd autonomous-coding-toolkit
|
|
|
47
38
|
npm link # puts 'act' on PATH
|
|
48
39
|
```
|
|
49
40
|
|
|
50
|
-
|
|
41
|
+
### Platform Notes
|
|
42
|
+
|
|
43
|
+
| Platform | Status | Notes |
|
|
44
|
+
|----------|--------|-------|
|
|
45
|
+
| **Linux** | Works out of the box | bash 4+, jq, git required |
|
|
46
|
+
| **macOS** | Works with Homebrew bash | macOS ships bash 3.2 — install bash 4+ via `brew install bash`. Also install coreutils for GNU readlink: `brew install coreutils` |
|
|
47
|
+
| **Windows** | WSL only | Run `wsl --install`, then use the toolkit inside WSL. Native Windows is not supported |
|
|
48
|
+
|
|
49
|
+
<details>
|
|
50
|
+
<summary>macOS setup</summary>
|
|
51
|
+
|
|
52
|
+
macOS ships bash 3.2 (2007) due to licensing. The toolkit requires bash 4+ for associative arrays and other features.
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
# Install modern bash and GNU coreutils
|
|
56
|
+
brew install bash coreutils jq
|
|
57
|
+
|
|
58
|
+
# Verify
|
|
59
|
+
bash --version # Should show 5.x
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Homebrew bash installs to `/opt/homebrew/bin/bash` (Apple Silicon) or `/usr/local/bin/bash` (Intel). The `act` CLI invokes scripts via `bash` — as long as Homebrew's bin is on your PATH (which `brew` sets up automatically), scripts will use the correct version.
|
|
63
|
+
|
|
64
|
+
</details>
|
|
51
65
|
|
|
52
66
|
## Quick Start
|
|
53
67
|
|
|
@@ -80,13 +94,13 @@ Each stage exists because a specific failure mode demanded it:
|
|
|
80
94
|
|
|
81
95
|
| Stage | Problem It Solves | Evidence |
|
|
82
96
|
|-------|------------------|----------|
|
|
83
|
-
| **Brainstorm** | Agents build the wrong thing correctly
|
|
84
|
-
| **Research** | Building on assumptions wastes hours |
|
|
97
|
+
| **Brainstorm** | Agents build the wrong thing correctly | SWE-bench Pro: removing specs = 3x degradation |
|
|
98
|
+
| **Research** | Building on assumptions wastes hours | Stage-Gate: stable definitions = 3x success rate |
|
|
85
99
|
| **Plan** | Plan quality dominates execution quality ~3:1 | SWE-bench Pro: spec removal = 3x degradation |
|
|
86
|
-
| **Execute** | Context degradation is the #1 quality killer |
|
|
87
|
-
| **Verify** | Static review misses behavioral bugs |
|
|
100
|
+
| **Execute** | Context degradation is the #1 quality killer | 11/12 models < 50% at 32K tokens |
|
|
101
|
+
| **Verify** | Static review misses behavioral bugs | Property-based testing finds ~50x more mutations |
|
|
88
102
|
|
|
89
|
-
Full evidence
|
|
103
|
+
Full evidence with 25+ papers across 16 research reports: [`docs/RESEARCH.md`](docs/RESEARCH.md)
|
|
90
104
|
|
|
91
105
|
## How It Compares
|
|
92
106
|
|
|
@@ -119,14 +133,16 @@ Submit new lessons via `/submit-lesson` or [open an issue](https://github.com/pa
|
|
|
119
133
|
## Requirements
|
|
120
134
|
|
|
121
135
|
- **Claude Code** v1.0.33+ (`claude` CLI)
|
|
136
|
+
- **Node.js** 18+ (for the `act` CLI router)
|
|
122
137
|
- **bash** 4+, **jq**, **git**
|
|
123
|
-
- Optional: **gh** (PR creation), **curl** (Telegram notifications)
|
|
138
|
+
- Optional: **gh** (PR creation), **curl** (Telegram notifications), **ast-grep** (structural checks)
|
|
124
139
|
|
|
125
140
|
## Learn More
|
|
126
141
|
|
|
127
142
|
| Topic | Doc |
|
|
128
143
|
|-------|-----|
|
|
129
|
-
| Architecture
|
|
144
|
+
| Architecture and internals | [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) |
|
|
145
|
+
| Research (25+ papers, 16 reports) | [`docs/RESEARCH.md`](docs/RESEARCH.md) |
|
|
130
146
|
| Contributing lessons | [`docs/CONTRIBUTING.md`](docs/CONTRIBUTING.md) |
|
|
131
147
|
| Plan file format | [`examples/example-plan.md`](examples/example-plan.md) |
|
|
132
148
|
| Execution modes (5 options) | [`docs/ARCHITECTURE.md#system-overview`](docs/ARCHITECTURE.md#system-overview) |
|
|
@@ -135,6 +151,19 @@ Submit new lessons via `/submit-lesson` or [open an issue](https://github.com/pa
|
|
|
135
151
|
|
|
136
152
|
Core skill chain forked from [superpowers](https://github.com/obra/superpowers) by Jesse Vincent / Anthropic. Extended with quality gate pipeline, headless execution, lesson system, MAB routing, and research/roadmap stages.
|
|
137
153
|
|
|
154
|
+
## Research Sources
|
|
155
|
+
|
|
156
|
+
The toolkit's design is grounded in peer-reviewed research. Key papers:
|
|
157
|
+
|
|
158
|
+
- [**SWE-bench Pro**](https://arxiv.org/pdf/2509.16941) (Xia et al., 2025) — 1,865 programming problems; removing specifications degraded agent success from 25.9% to 8.4%
|
|
159
|
+
- [**Context Rot**](https://research.trychroma.com/context-rot) (Hong et al., Chroma 2025) — 11 of 12 models scored below 50% of short-context performance at 32K tokens
|
|
160
|
+
- [**Lost in the Middle**](https://arxiv.org/abs/2307.03172) (Liu et al., Stanford TACL 2024) — Information placed mid-context suffers up to 20 percentage point accuracy loss
|
|
161
|
+
- [**Agentic Property-Based Testing**](https://arxiv.org/html/2510.09907v1) (OOPSLA 2025) — Property-based testing finds ~50x more mutations per test than traditional unit tests
|
|
162
|
+
- [**Bugs in LLM-Generated Code**](https://arxiv.org/abs/2403.08937) (Tambon et al., 2024) — Empirical taxonomy of AI code generation failures
|
|
163
|
+
- **Cooper Stage-Gate** — Projects with stable, upfront definitions are 3x more likely to succeed
|
|
164
|
+
|
|
165
|
+
16 research reports synthesizing 25+ papers: [`docs/RESEARCH.md`](docs/RESEARCH.md)
|
|
166
|
+
|
|
138
167
|
## License
|
|
139
168
|
|
|
140
169
|
MIT
|
package/docs/RESEARCH.md
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
1
|
+
# Research
|
|
2
|
+
|
|
3
|
+
Evidence base for the Autonomous Coding Toolkit's design decisions. Each report synthesizes peer-reviewed papers, benchmarks, and field observations into actionable findings.
|
|
4
|
+
|
|
5
|
+
## Core Design Research
|
|
6
|
+
|
|
7
|
+
These directly shaped the toolkit's architecture:
|
|
8
|
+
|
|
9
|
+
| Topic | Key Finding | Report |
|
|
10
|
+
|-------|-------------|--------|
|
|
11
|
+
| Plan quality | Plan quality dominates execution quality ~3:1 (SWE-bench Pro) | [Plan Quality](plans/2026-02-22-research-plan-quality.md) |
|
|
12
|
+
| Context degradation | 11/12 models < 50% accuracy at 32K tokens; mid-context loss up to 20pp | [Context Utilization](plans/2026-02-22-research-context-utilization.md) |
|
|
13
|
+
| Agent failures | Spec misunderstanding is the dominant failure mode (~60%), not code quality | [Agent Failure Taxonomy](plans/2026-02-22-research-agent-failure-taxonomy.md) |
|
|
14
|
+
| Verification | Property-based testing finds ~50x more mutations per test than unit tests | [Verification Effectiveness](plans/2026-02-22-research-verification-effectiveness.md) |
|
|
15
|
+
| Prompt engineering | Positive instructions outperform negative; context placement matters | [Prompt Engineering](plans/2026-02-22-research-prompt-engineering.md) |
|
|
16
|
+
| Lesson transferability | Anti-pattern lessons generalize across projects with scope metadata | [Lesson Transferability](plans/2026-02-22-research-lesson-transferability.md) |
|
|
17
|
+
|
|
18
|
+
## Competitive & Adoption Research
|
|
19
|
+
|
|
20
|
+
| Topic | Report |
|
|
21
|
+
|-------|--------|
|
|
22
|
+
| Competitive landscape (Aider, Cursor, SWE-agent, etc.) | [Competitive Landscape](plans/2026-02-22-research-competitive-landscape.md) |
|
|
23
|
+
| User adoption friction and onboarding | [User Adoption](plans/2026-02-22-research-user-adoption.md) |
|
|
24
|
+
| Cost/quality tradeoff modeling | [Cost-Quality Tradeoff](plans/2026-02-22-research-cost-quality-tradeoff.md) |
|
|
25
|
+
|
|
26
|
+
## Implementation Research
|
|
27
|
+
|
|
28
|
+
| Topic | Report |
|
|
29
|
+
|-------|--------|
|
|
30
|
+
| Testing strategies for large full-stack projects | [Comprehensive Testing](plans/2026-02-22-research-comprehensive-testing.md) |
|
|
31
|
+
| Multi-agent coordination patterns | [Multi-Agent Coordination](plans/2026-02-22-research-multi-agent-coordination.md) |
|
|
32
|
+
| Codebase auditing and refactoring with AI | [Codebase Audit](plans/2026-02-22-research-codebase-audit-refactoring.md) |
|
|
33
|
+
| Code guideline policies for AI agents | [Code Guidelines](plans/2026-02-22-research-code-guideline-policies.md) |
|
|
34
|
+
| Coding standards and AI agent performance | [Coding Standards](plans/2026-02-22-research-coding-standards-documentation.md) |
|
|
35
|
+
| Research phase integration into pipelines | [Phase Integration](plans/2026-02-22-research-phase-integration.md) |
|
|
36
|
+
|
|
37
|
+
## Advanced Topics
|
|
38
|
+
|
|
39
|
+
| Topic | Report |
|
|
40
|
+
|-------|--------|
|
|
41
|
+
| Multi-Armed Bandit strategy selection | [MAB Report](plans/2026-02-21-mab-research-report.md), [Round 2](plans/2026-02-22-mab-research-round2.md) |
|
|
42
|
+
| Operations design methodology (18 cross-domain frameworks) | [Operations Design](plans/2026-02-22-operations-design-methodology-research.md) |
|
|
43
|
+
| Unconventional perspectives on autonomous coding | [Unconventional Perspectives](plans/2026-02-22-research-unconventional-perspectives.md) |
|
|
44
|
+
|
|
45
|
+
## Key Papers Referenced
|
|
46
|
+
|
|
47
|
+
The most-cited papers across the research corpus:
|
|
48
|
+
|
|
49
|
+
1. **SWE-bench Pro** (Xia et al., 2025) — 1,865 programming problems; spec removal = 3x degradation
|
|
50
|
+
2. **Chroma** (Hong et al., 2025) — Long-context coding benchmark; 11/12 models < 50% at 32K
|
|
51
|
+
3. **Lost in the Middle** (Liu et al., Stanford TACL 2024) — Up to 20pp accuracy loss for mid-context information
|
|
52
|
+
4. **OOPSLA 2025** — Property-based testing mutation analysis
|
|
53
|
+
5. **Cooper Stage-Gate** — Projects with stable definitions are 3x more likely to succeed
|
|
54
|
+
|
|
55
|
+
Full citation details are in each individual report.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "autonomous-coding-toolkit",
|
|
3
|
-
"version": "1.0.
|
|
3
|
+
"version": "1.0.2",
|
|
4
4
|
"description": "Autonomous AI coding pipeline: quality gates, fresh-context execution, community lessons, and compounding learning",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"author": "Justin McFarland <parthalon025@gmail.com>",
|