npm - autonomous-coding-toolkit - Versions diffs - 1.0.0 → 1.0.2 - Mend

autonomous-coding-toolkit 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 [![CI](https://github.com/parthalon025/autonomous-coding-toolkit/actions/workflows/ci.yml/badge.svg)](https://github.com/parthalon025/autonomous-coding-toolkit/actions)
+[![npm](https://img.shields.io/npm/v/autonomous-coding-toolkit)](https://www.npmjs.com/package/autonomous-coding-toolkit)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
-[![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://github.com/parthalon025/autonomous-coding-toolkit/releases/tag/v1.0.0)
 # Autonomous Coding Toolkit
@@ -10,15 +10,6 @@
 Built for [Claude Code](https://docs.anthropic.com/en/docs/claude-code) (v1.0.33+). Works as a Claude Code plugin (interactive) and npm CLI (headless/CI).
-## What It Does
-```
-You write a plan → the toolkit executes it batch-by-batch with:
-  - Fresh 200k context window per batch (no accumulated degradation)
-  - Quality gates between every batch (tests + anti-pattern scan + memory check)
-  - Machine-verifiable completion (every criterion is a shell command)
-```
 ## Install
 ### npm (recommended)
@@ -27,7 +18,7 @@ You write a plan → the toolkit executes it batch-by-batch with:
 npm install -g autonomous-coding-toolkit
 ```
-This puts `act` on your PATH. Requires Node.js 18+ and bash 4+.
+This puts `act` on your PATH.
 ### Claude Code Plugin
@@ -47,7 +38,30 @@ cd autonomous-coding-toolkit
 npm link  # puts 'act' on PATH
 ```
-> **Windows:** Requires [WSL](https://learn.microsoft.com/en-us/windows/wsl/install). Run `wsl --install`, then use the toolkit inside WSL.
+### Platform Notes
+| Platform | Status | Notes |
+|----------|--------|-------|
+| **Linux** | Works out of the box | bash 4+, jq, git required |
+| **macOS** | Works with Homebrew bash | macOS ships bash 3.2 — install bash 4+ via `brew install bash`. Also install coreutils for GNU readlink: `brew install coreutils` |
+| **Windows** | WSL only | Run `wsl --install`, then use the toolkit inside WSL. Native Windows is not supported |
+<details>
+<summary>macOS setup</summary>
+macOS ships bash 3.2 (2007) due to licensing. The toolkit requires bash 4+ for associative arrays and other features.
+```bash
+# Install modern bash and GNU coreutils
+brew install bash coreutils jq
+# Verify
+bash --version  # Should show 5.x
+```
+Homebrew bash installs to `/opt/homebrew/bin/bash` (Apple Silicon) or `/usr/local/bin/bash` (Intel). The `act` CLI invokes scripts via `bash` — as long as Homebrew's bin is on your PATH (which `brew` sets up automatically), scripts will use the correct version.
+</details>
 ## Quick Start
@@ -80,13 +94,13 @@ Each stage exists because a specific failure mode demanded it:
 | Stage | Problem It Solves | Evidence |
 |-------|------------------|----------|
-| **Brainstorm** | Agents build the wrong thing correctly — spec misunderstanding is the dominant failure mode | SWE-bench Pro (1,865 problems): removing specs degraded success from 25.9% to 8.4% |
-| **Research** | Building on assumptions wastes hours | Cooper Stage-Gate: projects with stable definitions are 3x more likely to succeed |
+| **Brainstorm** | Agents build the wrong thing correctly | SWE-bench Pro: removing specs = 3x degradation |
+| **Research** | Building on assumptions wastes hours | Stage-Gate: stable definitions = 3x success rate |
 | **Plan** | Plan quality dominates execution quality ~3:1 | SWE-bench Pro: spec removal = 3x degradation |
-| **Execute** | Context degradation is the #1 quality killer | Chroma (Hong et al., 2025): 11/12 models < 50% at 32K tokens; Liu et al. (Stanford, TACL 2024): up to 20pp mid-context accuracy loss |
-| **Verify** | Static review misses behavioral bugs | OOPSLA 2025: property-based testing finds ~50x more mutations per test |
+| **Execute** | Context degradation is the #1 quality killer | 11/12 models < 50% at 32K tokens |
+| **Verify** | Static review misses behavioral bugs | Property-based testing finds ~50x more mutations |
-Full evidence table with all 25 papers: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
+Full evidence with 25+ papers across 16 research reports: [`docs/RESEARCH.md`](docs/RESEARCH.md)
 ## How It Compares
@@ -119,14 +133,16 @@ Submit new lessons via `/submit-lesson` or [open an issue](https://github.com/pa
 ## Requirements
 - **Claude Code** v1.0.33+ (`claude` CLI)
+- **Node.js** 18+ (for the `act` CLI router)
 - **bash** 4+, **jq**, **git**
-- Optional: **gh** (PR creation), **curl** (Telegram notifications)
+- Optional: **gh** (PR creation), **curl** (Telegram notifications), **ast-grep** (structural checks)
 ## Learn More
 | Topic | Doc |
 |-------|-----|
-| Architecture, evidence, internals | [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) |
+| Architecture and internals | [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) |
+| Research (25+ papers, 16 reports) | [`docs/RESEARCH.md`](docs/RESEARCH.md) |
 | Contributing lessons | [`docs/CONTRIBUTING.md`](docs/CONTRIBUTING.md) |
 | Plan file format | [`examples/example-plan.md`](examples/example-plan.md) |
 | Execution modes (5 options) | [`docs/ARCHITECTURE.md#system-overview`](docs/ARCHITECTURE.md#system-overview) |
@@ -135,6 +151,19 @@ Submit new lessons via `/submit-lesson` or [open an issue](https://github.com/pa
 Core skill chain forked from [superpowers](https://github.com/obra/superpowers) by Jesse Vincent / Anthropic. Extended with quality gate pipeline, headless execution, lesson system, MAB routing, and research/roadmap stages.
+## Research Sources
+The toolkit's design is grounded in peer-reviewed research. Key papers:
+- [**SWE-bench Pro**](https://arxiv.org/pdf/2509.16941) (Xia et al., 2025) — 1,865 programming problems; removing specifications degraded agent success from 25.9% to 8.4%
+- [**Context Rot**](https://research.trychroma.com/context-rot) (Hong et al., Chroma 2025) — 11 of 12 models scored below 50% of short-context performance at 32K tokens
+- [**Lost in the Middle**](https://arxiv.org/abs/2307.03172) (Liu et al., Stanford TACL 2024) — Information placed mid-context suffers up to 20 percentage point accuracy loss
+- [**Agentic Property-Based Testing**](https://arxiv.org/html/2510.09907v1) (OOPSLA 2025) — Property-based testing finds ~50x more mutations per test than traditional unit tests
+- [**Bugs in LLM-Generated Code**](https://arxiv.org/abs/2403.08937) (Tambon et al., 2024) — Empirical taxonomy of AI code generation failures
+- **Cooper Stage-Gate** — Projects with stable, upfront definitions are 3x more likely to succeed
+16 research reports synthesizing 25+ papers: [`docs/RESEARCH.md`](docs/RESEARCH.md)
 ## License
 MIT

package/docs/RESEARCH.md ADDED Viewed

@@ -0,0 +1,55 @@
+# Research
+Evidence base for the Autonomous Coding Toolkit's design decisions. Each report synthesizes peer-reviewed papers, benchmarks, and field observations into actionable findings.
+## Core Design Research
+These directly shaped the toolkit's architecture:
+| Topic | Key Finding | Report |
+|-------|-------------|--------|
+| Plan quality | Plan quality dominates execution quality ~3:1 (SWE-bench Pro) | [Plan Quality](plans/2026-02-22-research-plan-quality.md) |
+| Context degradation | 11/12 models < 50% accuracy at 32K tokens; mid-context loss up to 20pp | [Context Utilization](plans/2026-02-22-research-context-utilization.md) |
+| Agent failures | Spec misunderstanding is the dominant failure mode (~60%), not code quality | [Agent Failure Taxonomy](plans/2026-02-22-research-agent-failure-taxonomy.md) |
+| Verification | Property-based testing finds ~50x more mutations per test than unit tests | [Verification Effectiveness](plans/2026-02-22-research-verification-effectiveness.md) |
+| Prompt engineering | Positive instructions outperform negative; context placement matters | [Prompt Engineering](plans/2026-02-22-research-prompt-engineering.md) |
+| Lesson transferability | Anti-pattern lessons generalize across projects with scope metadata | [Lesson Transferability](plans/2026-02-22-research-lesson-transferability.md) |
+## Competitive & Adoption Research
+| Topic | Report |
+|-------|--------|
+| Competitive landscape (Aider, Cursor, SWE-agent, etc.) | [Competitive Landscape](plans/2026-02-22-research-competitive-landscape.md) |
+| User adoption friction and onboarding | [User Adoption](plans/2026-02-22-research-user-adoption.md) |
+| Cost/quality tradeoff modeling | [Cost-Quality Tradeoff](plans/2026-02-22-research-cost-quality-tradeoff.md) |
+## Implementation Research
+| Topic | Report |
+|-------|--------|
+| Testing strategies for large full-stack projects | [Comprehensive Testing](plans/2026-02-22-research-comprehensive-testing.md) |
+| Multi-agent coordination patterns | [Multi-Agent Coordination](plans/2026-02-22-research-multi-agent-coordination.md) |
+| Codebase auditing and refactoring with AI | [Codebase Audit](plans/2026-02-22-research-codebase-audit-refactoring.md) |
+| Code guideline policies for AI agents | [Code Guidelines](plans/2026-02-22-research-code-guideline-policies.md) |
+| Coding standards and AI agent performance | [Coding Standards](plans/2026-02-22-research-coding-standards-documentation.md) |
+| Research phase integration into pipelines | [Phase Integration](plans/2026-02-22-research-phase-integration.md) |
+## Advanced Topics
+| Topic | Report |
+|-------|--------|
+| Multi-Armed Bandit strategy selection | [MAB Report](plans/2026-02-21-mab-research-report.md), [Round 2](plans/2026-02-22-mab-research-round2.md) |
+| Operations design methodology (18 cross-domain frameworks) | [Operations Design](plans/2026-02-22-operations-design-methodology-research.md) |
+| Unconventional perspectives on autonomous coding | [Unconventional Perspectives](plans/2026-02-22-research-unconventional-perspectives.md) |
+## Key Papers Referenced
+The most-cited papers across the research corpus:
+1. **SWE-bench Pro** (Xia et al., 2025) — 1,865 programming problems; spec removal = 3x degradation
+2. **Chroma** (Hong et al., 2025) — Long-context coding benchmark; 11/12 models < 50% at 32K
+3. **Lost in the Middle** (Liu et al., Stanford TACL 2024) — Up to 20pp accuracy loss for mid-context information
+4. **OOPSLA 2025** — Property-based testing mutation analysis
+5. **Cooper Stage-Gate** — Projects with stable definitions are 3x more likely to succeed
+Full citation details are in each individual report.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "autonomous-coding-toolkit",
-  "version": "1.0.0",
+  "version": "1.0.2",
   "description": "Autonomous AI coding pipeline: quality gates, fresh-context execution, community lessons, and compounding learning",
   "license": "MIT",
   "author": "Justin McFarland <parthalon025@gmail.com>",