npm - devlyn-cli - Versions diffs - 1.14.0 → 2.0.0 - Mend

devlyn-cli 1.14.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (148) hide show

package/README.md CHANGED Viewed

@@ -27,124 +27,71 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
 npx devlyn-cli
 ```
-That's it. The interactive installer handles everything. Run it again anytime to update.
+That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md`. Run it again anytime to update.
 ---
-## How It Works — Three Steps, Full Cycle
+## How It Works — Two Skills, Full Cycle
-devlyn-cli turns Claude Code into an autonomous development pipeline. The core loop is simple:
+devlyn-cli turns Claude Code into a hands-free development pipeline. The product surface is two skills:
 ```
-ideate  →  auto-resolve  →  preflight  →  fix gaps  →  ship
+ideate (optional)  →  resolve  →  ship
 ```
-### Step 1 — Plan with `/devlyn:ideate`
+### Step 1 (optional) — Plan with `/devlyn:ideate`
-Turn a raw idea into structured, implementation-ready specs.
+Turn a raw idea into a verifiable spec — single-feature, multi-feature, or "normalize this external doc".
 ```
 /devlyn:ideate "I want to build a habit tracking app with AI nudges"
 ```
-This produces three documents through interactive brainstorming:
+Default mode produces a `docs/specs/<id>-<slug>/spec.md` plus `spec.expected.json` (mechanical verification block) that `/devlyn:resolve --spec` consumes directly. Modes:
-| Document | What It Contains |
+| Mode | When to use |
 |---|---|
-| `docs/VISION.md` | North star, principles, anti-goals |
-| `docs/ROADMAP.md` | Phased roadmap with links to each spec |
-| `docs/roadmap/phase-N/*.md` | Self-contained spec per feature — ready for auto-resolve |
+| `default` | One feature, AI drives focused Q&A |
+| `--quick` | One-line goal → assume-and-confirm spec, single-turn (autonomous-pipeline-safe) |
+| `--from-spec <path>` | You already wrote a spec; ideate normalizes + lints it |
+| `--project` | Multi-feature project: emits `plan.md` index + N child specs |
-Need to add features later? Run ideate again — it expands the existing roadmap.
+Skip ideate entirely if you have a spec or just want to describe the work — `/devlyn:resolve` accepts free-form goals too.
-### Step 2 — Build with `/devlyn:auto-resolve`
+### Step 2 — Resolve with `/devlyn:resolve`
-Point it at a spec (or just describe what you want) and walk away.
+Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Pass a spec, a free-form goal, or a diff to verify.
 ```
-/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-1/1.1-user-auth.md"
+/devlyn:resolve "fix the login bug"                                # free-form
+/devlyn:resolve --spec docs/specs/2026-05-04-auth/spec.md          # spec mode
+/devlyn:resolve --verify-only <diff-or-PR-ref> --spec <path>       # verify-only
 ```
-It runs a **10-phase pipeline** autonomously:
+Internal phases run sequentially with file-based handoff via `.devlyn/pipeline.state.json`:
 ```
-Build → Build Gate → Browser Test → Evaluate → Fix Loop → Simplify → Review → Security → Clean → Docs
+PLAN  →  IMPLEMENT  →  BUILD_GATE  →  CLEANUP  →  VERIFY (fresh subagent, findings-only)
 ```
-- Each phase runs as a separate agent with fresh context
-- Git checkpoints at every phase for safe rollback
-- **Build Gate** runs your project's real compilers, typecheckers, and linters — catches type errors, cross-package drift, and Docker build failures that tests alone miss. Auto-detects project type (Next.js, Rust, Go, Solidity, Expo, Swift, and more) and Dockerfiles.
-- Browser validation tests your feature end-to-end (clicks, forms, verification)
-- Evaluation grades against done-criteria — if it fails, auto-fix and re-evaluate
+- **PLAN** is the heaviest phase by design — formalizes invariants from the spec/goal and the file list to touch.
+- **BUILD_GATE** runs your project's real compilers, typecheckers, linters, and `python3 .claude/skills/_shared/spec-verify-check.py` (verification commands literal-match). Auto-detects Next.js, Rust, Go, Solidity, Expo, Swift, and Dockerfiles. Browser flows route through Chrome MCP → Playwright → curl tier.
+- **VERIFY** runs in a fresh subagent context with no code-mutation tools — findings only, structurally independent.
+- Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
-Skip phases you don't need: `--skip-browser`, `--skip-review`, `--skip-clean`, `--skip-docs`, `--skip-build-gate`, `--max-rounds 6`
-Customize the build gate: `--build-gate strict` (warnings = errors), `--build-gate no-docker` (skip Docker builds for speed)
-Use dual-model routing: `--engine auto` (Codex builds, Claude evaluates — see below)
+Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--perf` (per-phase timing).
-### Step 3 — Verify with `/devlyn:preflight`
+### Engine selection — Claude solo by default
-After implementing all roadmap items, run a final alignment check:
+`--engine claude` (default) is the canonical surface. Every phase routes to Claude.
-```
-/devlyn:preflight
-```
-Reads every commitment from your vision, roadmap, and item specs, then audits the codebase evidence-based. Catches what you missed:
-| Category | What It Finds |
-|---|---|
-| `MISSING` | In roadmap but not implemented |
-| `INCOMPLETE` | Started but unfinished |
-| `DIVERGENT` | Implemented differently than spec |
-| `BROKEN` | Has a bug preventing it from working |
-| `STALE_DOC` | Docs don't match current code |
-Confirmed gaps become new roadmap items — feed them back into auto-resolve. Use `--autofix` to do this automatically, or `--phase 2` to check only one phase.
-### Bonus — Intelligent Model Routing with `--engine`
-Install the Codex MCP server during setup, then:
+`--engine codex` routes IMPLEMENT to Codex; `--engine auto` opts into the experimental dual-engine routing where applicable. Both are research-only at HEAD: iter-0020 closed Codex BUILD/IMPLEMENT below the quality floor on the 9-fixture suite (L2 vs L1 = −3.6, 3/8 gated fixtures cleared the +5 margin floor — release-readiness FAIL); iter-0033g + iter-0034 closed PLAN-pair as research-only with explicit unblock conditions (container/sandbox infra OR production telemetry capturing positive evidence of subagent introspection). Install the Codex CLI (https://platform.openai.com/docs/codex) and pass the flag explicitly to opt in:
 ```
-/devlyn:auto-resolve "fix the auth bug" --engine auto
+/devlyn:resolve "fix the auth bug" --engine auto   # experimental, research-only
 ```
-**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
-> `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
-Works across the full pipeline:
-```
-/devlyn:auto-resolve "implement feature" --engine auto
-/devlyn:ideate "plan new project" --engine auto
-/devlyn:preflight --engine auto
-```
-<details>
-<summary><strong>How routing works</strong> — A/B tested on 6 roles, 11 integration tests</summary>
-**Pipeline phases** — builder and critic are always different models (GAN dynamic):
-| Phase | Model | Why |
-|---|---|---|
-| Build (implementation) | **Codex GPT-5.4** | SWE-bench Pro +11.7pp for hard coding tasks |
-| Evaluate | **Claude** | Long-context (MRCR +28pp) for full-diff grading |
-| Fix Loop | **Codex GPT-5.4** | Same advantage as Build |
-| Challenge | **Claude** | Fresh skeptical review needs different model family |
-| Browser Validate | **Claude** | Chrome MCP session-bound |
-**Team roles** — each of 21 roles routes to the best model:
-| Engine | Roles | Examples |
-|---|---|---|
-| Claude (11) | Analysis, design, architecture | root-cause-analyst, architecture-reviewer, ux-designer, product-analyst |
-| Codex (4) | Code generation, performance | implementation-planner, test-engineer, performance-engineer |
-| Dual (6) | Both models find unique issues | security-auditor, quality-reviewer, api-designer |
-**Key finding**: Benchmark predictions were only 33% accurate. 4 of 6 A/B-tested roles needed routing changes after real testing — proving that benchmarks alone are insufficient for optimal routing.
-</details>
+If Codex is absent when `--engine auto` or `--engine codex` is requested, the harness silently downgrades to `--engine claude` and emits a banner in the final report.
 <details>
 <summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
@@ -169,7 +116,7 @@ Works across the full pipeline:
 Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
 - **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
-- **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
+- **`--with-codex` consolidated into `--engine auto`** — auto covers BUILD/FIX + team roles + ideate CHALLENGE critic. Legacy flag still accepted with a graceful handoff. *(Note: post iter-0020 close-out, `--engine auto` is experimental research-only; default is `--engine claude`.)*
 - **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
 - **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
@@ -177,47 +124,16 @@ Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against A
 ---
-## Manual Commands
-When you want step-by-step control instead of the full pipeline.
-### Debugging & Resolution
-| Command | Use When |
-|---|---|
-| `/devlyn:resolve` | Simple bugs (1-2 files) |
-| `/devlyn:team-resolve` | Complex issues — spawns root-cause analyst, test engineer, security auditor |
-| `/devlyn:browser-validate` | Test a web feature in a real browser (Chrome MCP → Playwright → curl fallback) |
+## Optional Power-User Skills
-### Code Review & Quality
+Two creative skills have moved to `optional-skills/` — install them via the interactive installer when you need them.
 | Command | Use When |
 |---|---|
-| `/devlyn:review` | Solo review — security, quality, best practices checklist |
-| `/devlyn:team-review` | Multi-reviewer team — security, testing, performance, product perspectives |
-| `/devlyn:evaluate` | Grade work against done-criteria with calibrated skepticism |
-| `/devlyn:clean` | Remove dead code, unused deps, complexity hotspots |
-### UI Design Pipeline
-| Step | Command | What It Does |
-|---|---|---|
-| 1 | `/devlyn:design-ui` | Generate 5 distinct style explorations |
-| 2 | `/devlyn:design-system` | Extract design tokens from chosen style |
-| 3 | `/devlyn:implement-ui` | Team builds it — component architect, UX, accessibility, responsive, visual QA |
+| `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
+| `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
-> Use `/devlyn:team-design-ui` for step 1 with a full creative team.
-### Planning & Docs
-| Command | What It Does |
-|---|---|
-| `/devlyn:preflight` | Verify codebase matches vision/roadmap — gap analysis with evidence |
-| `/devlyn:product-spec` | Generate or update product specs |
-| `/devlyn:feature-spec` | Turn product spec → implementable feature spec |
-| `/devlyn:discover-product` | Scan codebase → auto-generate product docs |
-| `/devlyn:recommend-features` | Prioritize top 5 features to build next |
-| `/devlyn:update-docs` | Sync all docs with current codebase |
+> Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). These were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
 ---
@@ -231,7 +147,6 @@ These activate automatically — no commands needed. They shape how Claude think
 | `code-review-standards` | Reviews — severity framework, approval criteria |
 | `ui-implementation-standards` | UI work — design fidelity, accessibility, responsiveness |
 | `code-health-standards` | Maintenance — dead code prevention, complexity thresholds |
-| `workflow-routing` | Any task — guides you to the right command |
 ---
@@ -253,6 +168,9 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | `dokkit` | Document template filling for DOCX/HWPX |
 | `devlyn:pencil-pull` | Pull Pencil designs into code |
 | `devlyn:pencil-push` | Push codebase UI to Pencil canvas |
+| `devlyn:reap` | Safely reap orphaned MCP / codex / Superset child processes |
+| `devlyn:design-system` | Extract design tokens from a chosen UI style for exact reproduction |
+| `devlyn:team-design-ui` | 5 distinct UI style explorations from a full design team |
 </details>
@@ -274,8 +192,9 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | Server | Description |
 |---|---|
-| `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing |
-| `playwright` | Playwright MCP — powers browser-validate Tier 2 |
+| `playwright` | Playwright MCP — powers `/devlyn:resolve` BUILD_GATE browser tier (Chrome MCP → Playwright → curl fallback) |
+> `--engine auto/codex` uses the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex; the harness silently downgrades to `--engine claude` if the CLI is missing.
 </details>
@@ -290,9 +209,8 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 ## Contributing
-- **Add a command** — `.md` file in `config/commands/`
 - **Add a skill** — directory in `config/skills/` with `SKILL.md`
-- **Add optional skill** — add to `optional-skills/` and `OPTIONAL_ADDONS`
+- **Add optional skill** — add to `optional-skills/` and `OPTIONAL_ADDONS` in [`bin/devlyn.js`](bin/devlyn.js)
 - **Suggest a pack** — PR to the pack list
 ## Star History

package/benchmark/auto-resolve/BENCHMARK-DESIGN.md ADDED Viewed

@@ -0,0 +1,272 @@
+# Benchmark Suite Design — v1
+**Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
+**Purpose.** Replace ad-hoc A/B benchmarking with a permanent, comprehensive,
+one-command suite that gates every future harness change with a ship/rollback
+decision. Any prompt edit, phase reorder, new native skill, or model upgrade
+can be validated by running the suite and reading the numbers.
+**Arm structure (current vs planned).** Today the suite runs `variant` (L2: Claude + Codex pair) vs `bare` (L0). The L1 (solo harness on a single LLM) arm is queued for iter-0020 — until then the benchmark cannot directly verify the L1 contract, only the L0 ↔ L2 delta. Single-LLM users (Opus alone, GPT-5.5 alone) are first-class per the North Star, so this gap is a release-blocker for them, not a future enhancement.
+**Non-goals.** Publishable-research statistical rigor. Not a regression test
+library for the product code — those live elsewhere. Not a substitute for
+production telemetry — just enough signal for ship decisions.
+---
+## Principles
+1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
+   verdict. No manual fixture setup.
+2. **Novice-proof.** The suite exercises the same paths a first-time user
+   hits — including an end-to-end `ideate → auto-resolve → preflight` fixture.
+3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
+   stable; scores and margins float up as models improve. Nothing is
+   hardcoded to a specific model version.
+4. **Karpathy.** No fixture earns its place unless it tests a distinct
+   failure mode. Tooling stays boring. History plumbing is simple.
+5. **Ship gate is numbers, not vibes.** Concrete thresholds in RUBRIC.md.
+---
+## Directory Layout
+```
+benchmark/auto-resolve/
+├── BENCHMARK-DESIGN.md       # this file
+├── README.md                 # how to run, interpret, extend
+├── RUBRIC.md                 # stable judge rubric + ship gates
+│
+├── fixtures/
+│   ├── SCHEMA.md             # fixture file format
+│   ├── test-repo/            # bootstrap Node project (shared base)
+│   │   ├── bin/cli.js
+│   │   ├── server/index.js
+│   │   ├── web/page.html
+│   │   ├── tests/
+│   │   ├── playwright.config.js
+│   │   └── package.json
+│   │
+│   ├── F1-cli-trivial-flag/
+│   ├── F2-cli-medium-subcommand/
+│   ├── F3-backend-contract-risk/
+│   ├── F4-web-browser-design/
+│   ├── F5-fix-loop-red-green/
+│   ├── F6-dep-audit-native-module/
+│   ├── F7-out-of-scope-trap/
+│   ├── F8-known-limit-ambiguous/
+│   └── F9-e2e-ideate-to-resolve/
+│
+├── scripts/
+│   ├── run-suite.sh          # single entry — runs all fixtures × 2 arms + judge + report
+│   ├── run-fixture.sh        # one fixture, one arm
+│   ├── judge.sh              # Codex blind judge (model-agnostic)
+│   ├── compile-report.py     # aggregate into report.md + summary.json
+│   └── ship-gate.py          # apply thresholds, return ship/rollback verdict
+│
+├── results/                  # per-run artifacts (overwritten)
+│   └── <run-id>/
+│       ├── <fixture>/
+│       │   ├── variant/{input.md, transcript.txt, diff.patch, verify.json, timing.json}
+│       │   └── bare/{same}
+│       ├── <fixture>/judge.json
+│       ├── report.md
+│       └── summary.json
+│
+└── history/
+    ├── runs/                 # append-only immutable records
+    │   └── 2026-04-23T120000Z-v3.6.json
+    ├── latest.json           # pointer to most recent run
+    └── baselines/
+        └── shipped.json      # last blessed version, used for regression check
+```
+---
+## Fixture Schema
+Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
+| File | Purpose |
+|------|---------|
+| `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
+| `spec.md` | pipeline-arm input (auto-resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
+| `task.txt` | bare-arm input (same intent, natural-language framing) |
+| `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
+| `NOTES.md` | why this fixture exists, the specific failure mode it tests |
+| `setup.sh` | deterministic starting state — applies to a fresh copy of `test-repo/` |
+**Drift prevention**: `spec.md` and `task.txt` both derive from the same
+`intent` block in `metadata.json`. A lint step in CI verifies they stay
+consistent.
+---
+## The 9 Fixtures
+Category coverage matrix (rows = concerns, columns = fixtures):
+| Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
+|---------|---------|--------|-----------|--------|------|-----|
+| F1-cli-trivial-flag | ✓ | | | | | |
+| F2-cli-medium-subcommand | | ✓ | | | | |
+| F3-backend-contract-risk | | | ✓ | | | |
+| F4-web-browser-design | | | | ✓ (browser-validate) | | |
+| F5-fix-loop-red-green | | | | ✓ (FIX LOOP) | | |
+| F6-dep-audit-native-module | | | | ✓ (CRITIC security dep audit) | | |
+| F7-out-of-scope-trap | | | | ✓ (scope discipline) | | |
+| F8-known-limit-ambiguous | | | | | ✓ (documents where pipeline may lose) | |
+| F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
+**F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
+Input is a vague idea; pipeline arm runs ideate → auto-resolve on every
+generated spec → preflight; bare arm runs a direct prompt. Judge compares
+the final usable artifact set (code + docs + roadmap state).
+---
+## Single-Command Invocation
+### User experience
+```bash
+npx devlyn-cli benchmark            # n=1 smoke, all fixtures
+npx devlyn-cli benchmark --n 3      # higher confidence for ship decisions
+npx devlyn-cli benchmark F2 F5      # specific fixtures only
+npx devlyn-cli benchmark --judge-only --run-id <id>   # re-judge without re-running
+```
+Output on completion:
+```
+Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
+Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
+Fixture                         Variant   Bare   Margin   Verdict
+F1-cli-trivial-flag                 95     88     +7      PASS
+F2-cli-medium-subcommand            92     81    +11      PASS
+F3-backend-contract-risk            89     72    +17      PASS
+F4-web-browser-design               87     79     +8      PASS
+F5-fix-loop-red-green               91     65    +26      PASS
+F6-dep-audit-native-module          88     70    +18      PASS
+F7-out-of-scope-trap                94     73    +21      PASS
+F8-known-limit-ambiguous            78     79     -1      EXPECTED (known-limit)
+F9-e2e-ideate-to-resolve          90     68    +22      PASS
+---------------------------------------------------------
+Suite average variant score: 89.3
+Suite average bare score:    75.0
+Suite average margin:       +14.3  (ship floor: +5)
+Hard-floor violations:        0
+Regression vs shipped:       n/a (first run of v3.6)
+SHIP-GATE VERDICT: ✅ PASS
+```
+### Runner orchestration
+`run-suite.sh`:
+1. Generate run-id `<ISO>-<sha>-<branch>`
+2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
+   - `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
+3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
+4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
+5. `ship-gate.py <run-id>` → exit 0 (PASS) / 1 (FAIL). Prints verdict to stdout.
+6. If PASS and `--bless` flag: copy `summary.json` → `history/baselines/shipped.json`
+7. Always: append `history/runs/<run-id>.json` + update `latest.json`
+### `run-fixture.sh` contract
+- Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
+- Applies `setup.sh` if present
+- Copies `spec.md` (variant) or `task.txt` (bare) as the prompt
+- Invokes Claude/auto-resolve (variant) or bare Claude (bare) via isolated Agent
+- Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
+- Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
+- Writes `result.json` with aggregate: exit code, duration, files changed, verification score
+### `judge.sh` contract
+- Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
+- Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
+- Invokes `codex exec` (current flagship — no model hardcode) with RUBRIC.md
+- Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
+- Idempotent: re-running overwrites the same `judge.json`
+---
+## LLM-Upgrade Resilience
+Three mechanisms:
+1. **No hardcoded models.** Judge invocation is `codex exec` without `-m`; it
+   inherits whichever flagship the CLI currently ships. Same for agents —
+   they run against whatever Claude Code session-model the caller has.
+   Model provenance is captured in `result.json` per run.
+2. **Margin as primary signal, absolute score as secondary.** When models
+   improve, both arms get better. Margin (variant − bare) is model-invariant
+   — it measures **what the harness adds beyond bare**. Ship gates are
+   defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
+   score.
+3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
+   100 quickly as models improve — that's fine, it still catches catastrophic
+   regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
+   model won't 100-zero bare. If any fixture saturates (both arms > 95 for
+   two consecutive versions), we replace it with a harder one and document
+   the swap in `history/runs/<ts>-fixture-rotation.json`.
+---
+## Ship Gates (from RUBRIC.md)
+Hard floors (any single failure blocks ship):
+- **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
+- **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
+- **At least 7 of 9 fixtures** must have margin ≥ +5 (suite coverage).
+- **F9 (E2E) must PASS** — novice-flow contract.
+Soft gates (trigger rollback discussion):
+- Suite average margin drop > 3 vs last shipped.
+- Any fixture with margin ≤ 0 that previously had margin > +5.
+- Critical-finding catch-rate decrease vs last shipped variant (not vs bare — bare is the opponent, not the regression baseline).
+Known-limit exception:
+- F8 is explicitly allowed to tie or lose (margin in [-3, +3]). Its job is to
+  document honesty, not to beat bare.
+---
+## Karpathy Check
+Where over-engineering lurks:
+- ❌ **Automatic history mutation during development.** Add append-only
+  history AFTER the suite format stabilizes (one version after initial ship).
+- ❌ **Statistical tooling beyond mean/median/margin.** n=1-3 doesn't need
+  t-tests.
+- ❌ **Auto-generated fixture cards / dashboards.** Plain `report.md` is enough.
+- ✅ **Keep scripts under 100 lines each** unless they're doing concrete,
+  repeated work the user would do by hand.
+If the suite tooling grows past ~800 total lines, prune aggressively before
+adding anything.
+---
+## Open Questions (to be answered before first full ship-gate run)
+1. Where does `benchmark` subcommand live? Inside `bin/devlyn.js` or as
+   standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
+   run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
+   entry, which shells out to the script.
+2. Parallel run safety — can we run 9 fixtures × 2 arms concurrently without
+   rate-limit / lockfile conflicts? **Proposal**: default sequential with
+   `--parallel N` flag. Default `N=1` for safety; the user can opt in.
+3. Token accounting — Claude Code doesn't expose subagent totals reliably.
+   **Proposal**: capture wall time as primary efficiency metric; token
+   estimate as best-effort secondary. Do not gate ship on token math alone.

package/benchmark/auto-resolve/README.md ADDED Viewed

@@ -0,0 +1,114 @@
+# devlyn-cli auto-resolve Benchmark Suite
+One-command A/B benchmark that gates every harness change with a ship/rollback decision.
+## Quick start
+```bash
+npx devlyn-cli benchmark                 # n=1 smoke, all fixtures × 2 arms, judge, report, ship-gate
+npx devlyn-cli benchmark --n 3           # higher confidence for ship decisions
+npx devlyn-cli benchmark F2              # specific fixture only
+npx devlyn-cli benchmark --dry-run       # validate suite wiring without model invocation
+npx devlyn-cli benchmark --bless         # if ship-gate PASSes, promote this run as the shipped baseline
+npx devlyn-cli benchmark --judge-only --run-id <ID>   # re-judge an existing run's artifacts
+```
+Exit code 0 = PASS, 1 = FAIL.
+## What it does
+1. For every fixture × arm (`variant` / `bare`):
+   - Prepare a fresh temp copy of `fixtures/test-repo/`.
+   - Commit baseline + apply `setup.sh` + commit bench scaffolding.
+   - Invoke the arm via an isolated `claude -p` subprocess.
+   - Capture `diff.patch`, `transcript.txt`, `timing.json`, run `expected.json::verification_commands`.
+2. For every fixture, invoke `codex exec` as a blind judge (`A`/`B` randomized per fixture) using the 4-axis rubric in `RUBRIC.md`.
+3. Aggregate into `results/<run-id>/report.md` + `summary.json`.
+4. Apply ship-gate thresholds (`scripts/ship-gate.py`). Print verdict.
+5. Append immutable record to `history/runs/<run-id>.json`.
+## Directory layout
+```
+benchmark/auto-resolve/
+├── BENCHMARK-DESIGN.md       # full design rationale
+├── README.md                 # this file
+├── RUBRIC.md                 # 4-axis scoring + ship gates
+│
+├── fixtures/
+│   ├── SCHEMA.md             # fixture file format
+│   ├── test-repo/            # bootstrap Node project — base for all arms
+│   ├── F2-cli-medium-subcommand/
+│   └── F1,F3-F9/             # add per Stage 2-3
+│
+├── scripts/
+│   ├── run-suite.sh          # single entry — called by `npx devlyn-cli benchmark`
+│   ├── run-fixture.sh        # one fixture × one arm, self-contained
+│   ├── judge.sh              # Codex blind judge for one fixture
+│   ├── compile-report.py     # aggregates into report.md + summary.json
+│   └── ship-gate.py          # applies thresholds + writes history record
+│
+├── results/<run-id>/         # per-run artifacts (overwritten)
+└── history/
+    ├── runs/                 # append-only, one JSON per run
+    ├── latest.json           # pointer to most recent run
+    └── baselines/shipped.json   # last blessed version, used for regression floor
+```
+## Prerequisites
+- `claude` CLI on PATH (Claude Code, used to invoke each arm).
+- `codex` CLI on PATH (used by the blind judge). Install from https://platform.openai.com/docs/codex.
+- `python3`, `node`, `git`, `timeout`.
+## Adding a fixture
+Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`, `task.txt`, `expected.json`, `NOTES.md`, `setup.sh`. Common workflow:
+1. Copy an existing fixture directory as a template.
+2. Rewrite `metadata.json::intent` with the new task's plain-language intent.
+3. Write `spec.md` (auto-resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
+4. Fill `expected.json` with concrete verification commands and forbidden patterns.
+5. Document purpose + failure mode in `NOTES.md`.
+6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
+## LLM-upgrade resilience
+- **No model hardcoding.** Judge runs `codex exec` without `-m`, inheriting whichever flagship the CLI currently ships. Each run captures `_judge_model` for historical provenance.
+- **Margin-based gates.** Ship thresholds use margin (variant − bare), not absolute score. Both arms improve together as models improve; the harness-added value measured by margin stays meaningful.
+- **Saturation rotation.** When both arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
+## Ship gates (summary — see `RUBRIC.md` for full spec)
+Hard floors (any one fails → block):
+- Zero variant disqualifier (silent catch, fabricated verification, extra deps beyond `max_deps_added`, etc.).
+- `F9-e2e-ideate-to-resolve` must PASS (novice-flow contract).
+- ≥ 7 of 9 gated fixtures have margin ≥ +5.
+- No per-fixture regression worse than −5 vs last shipped baseline.
+Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margin, critical-finding catch-rate regression vs last shipped variant.
+## Running the full suite (real)
+Full real benchmark costs roughly 2-3 minutes per arm for simple fixtures and up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures × 2 arms can take 30 min – 2 hrs depending on routes taken.
+```bash
+# Smoke run before ship decisions
+npx devlyn-cli benchmark
+# Ship-decision run
+npx devlyn-cli benchmark --n 3 --label v3.7 --bless
+```
+## Dry-run
+`--dry-run` skips model invocation. It still:
+- Prepares each fresh work dir.
+- Writes arm-specific prompts.
+- Commits the baseline.
+- Applies `setup.sh`.
+- Runs verification commands (which will mostly fail since no implementation was added).
+Use it to sanity-check new fixtures or runner changes before burning model tokens.