npm - devlyn-cli - Versions diffs - 2.2.2 → 2.3.1 - Mend

devlyn-cli 2.2.2 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (220) hide show

package/AGENTS.md CHANGED Viewed

@@ -28,7 +28,7 @@ ideate (optional)  ->  resolve  ->  ship
 - `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project` (multi-feature).
 - `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <ref> --spec <path>`. Phases run inline: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh-subagent, findings-only).
-- Three creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
+- `/devlyn:design-ui` — required creative UI exploration surface. Optional companion skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
 Each skill's `SKILL.md` is the source of truth for flags and workflow. Do not duplicate.
@@ -73,7 +73,7 @@ No silent fallbacks.
 - Fallbacks allowed only when widely accepted and harmless (CSS fallback fonts, CDN failover, image placeholders).
 - Silent `catch` blocks are bugs.
 - Logging is not user-visible error handling.
-- The Codex CLI availability downgrade is the one documented exception: emit `engine downgraded: codex-unavailable` and behave exactly like explicit Claude routing.
+- No engine-availability fallback is permitted for pair/risk-probe routes: if required Codex or Claude is unavailable, emit `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` with setup guidance. `--no-pair` and `--no-risk-probes` are explicit user opt-outs, not fallbacks.
 ## Evidence Over Claim

package/CLAUDE.md CHANGED Viewed

@@ -24,7 +24,7 @@ The runtime sub-agent contract below (Subtractive-first / Goal-locked / No-worka
 ## Quick Start
-Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has a gated pair-JUDGE product path when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path; the harness silently downgrades to `claude` and emits a banner if the Codex CLI is missing.
+Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED; `/devlyn:design-ui` is also REQUIRED as the creative UI exploration surface. **Both pipeline skills default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has conditional-default pair-JUDGE when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path. If a selected or conditionally required engine is unavailable, the run stops with `BLOCKED:<engine>-unavailable` and setup guidance.
 1. `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project`.
 2. `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <diff> --spec <path>`. Phases: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent, findings-only).
@@ -123,7 +123,7 @@ No `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scr
 **Permitted exceptions** (explicitly carved out):
 - CSS fallback fonts, CDN failover, image placeholders — widely-accepted best practices.
-- Codex CLI availability downgrade — the one documented silent fallback in this repo. Fires when the resolved engine is `auto` or `codex` (either via skill default or explicit `--engine` flag) and the Codex CLI is absent. Banner `engine downgraded: codex-unavailable` always prints; verdict identical to `--engine claude`. Any other silent fallback in skills code is a bug — file it against the skill that introduced it.
+- No engine-availability fallback is permitted for `/devlyn:resolve` pair/risk-probe routes. If Codex or Claude is required and unavailable, the run stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` plus setup guidance. `--no-pair` / `--no-risk-probes` are explicit user opt-outs, not fallbacks.
 <!-- runtime-principles:section=no-workaround:end -->
 ### Evidence over claim
@@ -141,7 +141,7 @@ A finding without one of these forms is excluded. Vague findings produce vague f
 ## Codex invocation
-When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex` or `--engine auto`), the wrapper-form contract lives in `config/skills/_shared/codex-config.md` (or `.claude/skills/_shared/codex-config.md` once installed). Omit `-m <model>` — the CLI's current flagship is used automatically. MCP is not in the loop. If the Codex CLI is absent the harness silently downgrades to Claude and prints `engine downgraded: codex-unavailable` in the final report.
+When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex`, `--engine auto`, or conditional VERIFY pair/risk-probe routing), the wrapper-form contract lives in `config/skills/_shared/codex-config.md` (or `.claude/skills/_shared/codex-config.md` once installed). Omit `-m <model>` — the CLI's current flagship is used automatically. MCP is not in the loop. If Codex is required and unavailable, stop with `BLOCKED:codex-unavailable` and setup guidance.
 ## Working Mode
@@ -152,7 +152,7 @@ When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine cod
 ## Skill Boundary Policy
-Post iter-0034 Phase 4 cutover (2026-05-04) the runtime product surface is two skills — `/devlyn:resolve` and `/devlyn:ideate`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Three creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
+Post iter-0034 Phase 4 cutover (2026-05-04) the runtime pipeline surface is two skills — `/devlyn:resolve` and `/devlyn:ideate` — plus the required creative UI exploration surface `/devlyn:design-ui`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Optional creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
 Browser validation routes through `_shared/browser-runner.sh` (Chrome MCP → Playwright → curl tier) directly from BUILD_GATE — there is no separate `/devlyn:browser-validate` skill at HEAD.

package/README.md CHANGED Viewed

@@ -27,13 +27,13 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
 npx devlyn-cli
 ```
-That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
+That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` + `/devlyn:design-ui` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
 ---
 ## How It Works — Two Skills, Full Cycle
-devlyn-cli turns Claude Code into a hands-free development pipeline. The product surface is two skills:
+devlyn-cli turns Claude Code into a hands-free development pipeline. The pipeline surface is two skills, with `/devlyn:design-ui` installed as the required creative UI surface:
 ```
 ideate (optional)  →  resolve  →  ship
@@ -79,11 +79,25 @@ PLAN  →  IMPLEMENT  →  BUILD_GATE  →  CLEANUP  →  VERIFY (fresh subagent
 - **VERIFY** runs in a fresh subagent context with no code-mutation tools — findings only, structurally independent.
 - Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
-Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--perf` (per-phase timing).
+Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--no-pair` (intentional solo VERIFY), `--risk-probes` / `--no-risk-probes`, `--perf` (per-phase timing).
+`--pair-verify` and `--no-pair` are mutually exclusive; using both stops with `BLOCKED:invalid-flags`.
-### Engine selection — Claude solo by default
+Free-form goals that ask for benchmark evidence, pair-evidence, risk-probe
+measurement, `solo<pair` proof, or solo-headroom work must include an
+actionable `solo-headroom hypothesis` naming the visible behavior `solo_claude`
+is expected to miss plus a backticked observable command; the backticked line
+itself must contain `miss` and be framed as the command/observable that exposes it. Without that,
+`/devlyn:resolve` stops with `BLOCKED:solo-headroom-hypothesis-required` and
+points you to `/devlyn:ideate` instead of inventing a weak hypothesis.
+Free-form goals that add or run a new unmeasured benchmark, shadow fixture,
+golden fixture, risk-probe, or pair-evidence candidate must also include
+`solo ceiling avoidance`, mention `solo_claude`, and name the concrete
+difference from rejected or solo-saturated controls such as `S2`-`S6`; without
+that, `/devlyn:resolve` stops with `BLOCKED:solo-ceiling-avoidance-required`.
-`--engine claude` (default) is the canonical surface. Every phase routes to Claude.
+### Engine selection — Claude implementation, conditional pair VERIFY
+`--engine claude` (default) is the canonical implementation surface for PLAN, IMPLEMENT, BUILD_GATE, and CLEANUP. VERIFY/JUDGE conditionally runs pair mode for verify-only runs, high-risk specs, risk probes, mechanical warnings, coverage gaps, or explicit `--pair-verify`.
 `--engine codex` routes IMPLEMENT to Codex; `--engine auto` opts into the experimental dual-engine routing where applicable. Both are research-only at HEAD: iter-0020 closed Codex BUILD/IMPLEMENT below the quality floor on the 9-fixture suite (L2 vs L1 = −3.6, 3/8 gated fixtures cleared the +5 margin floor — release-readiness FAIL); iter-0033g + iter-0034 closed PLAN-pair as research-only with explicit unblock conditions (container/sandbox infra OR production telemetry capturing positive evidence of subagent introspection). Install the Codex CLI (https://platform.openai.com/docs/codex) and pass the flag explicitly to opt in:
@@ -91,49 +105,86 @@ Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-g
 /devlyn:resolve "fix the auth bug" --engine auto   # experimental, research-only
 ```
-If Codex is absent when `--engine auto` or `--engine codex` is requested, the harness silently downgrades to `--engine claude` and emits a banner in the final report.
-<details>
-<summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
-`/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
+If Codex or Claude is absent when explicitly selected or conditionally required, the harness stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` and prints setup guidance. Use `--no-pair` only when intentionally accepting solo VERIFY; use `--no-risk-probes` only when intentionally disabling automatic high-risk probes.
-- **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
-- **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
-- **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
-- **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
-- **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
-- **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
-- **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
-- **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
-- **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
+### Benchmark score runs
-</details>
-<details>
-<summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
+Use the benchmark CLI when a change claims `solo_claude < pair`. The score-focused runners print the run id, startup gate lines, blind-judge score tables, fixture pair margins, average pair margin, wall-time ratio, and failure reasons:
-Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
+```bash
+npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
+npx devlyn-cli benchmark recent
+npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
+npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
+npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
+npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
+npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
+```
-- **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
-- **`--with-codex` consolidated into `--engine auto`** — auto covers BUILD/FIX + team roles + ideate CHALLENGE critic. Legacy flag still accepted with a graceful handoff. *(Note: post iter-0020 close-out, `--engine auto` is experimental research-only; default is `--engine claude`.)*
-- **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
-- **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
+`benchmark recent` prints a compact, wrap-safe snapshot of the current local
+pair evidence: status counts, pair-lift aggregates, and one card per passing
+pair-evidence fixture. It intentionally avoids wide Markdown tables, so the
+same output stays readable in narrow terminals, PR comments, and release notes.
+`benchmark frontier` also prints a stdout score summary for existing complete pair
+evidence rows, including pair arm, trigger reasons, average/minimum pair margin,
+and wall ratio, plus row-level verdicts even when `--out-json` or `--out-md`
+writes an artifact. Markdown frontier artifacts include a `Triggers` column.
+Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
+and include a Markdown `Hypothesis trigger` column, so strict regenerated
+evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
+`benchmark audit` is the provider-free release/handoff guard: it writes
+`audit.json` with the frontier summary, artifact map, and compact trigger-backed verdict-bearing `pair_evidence_rows`
+(each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`), runs the frontier with
+`--fail-on-unmeasured`, requires at least four fixtures with passing pair evidence,
+revalidates frontier `verdict: PASS`, zero unmeasured candidates, and revalidates `pair_mode: true`,
+the default 5-point pair margin, and 3x pair/solo wall ratio, then
+audits failed headroom results. The audit stdout also prints
+`headroom_rejections=...`, `pair_evidence_quality=...`,
+`pair_trigger_reasons=...`, `pair_evidence_hypotheses=...`, and
+`pair_evidence_hypothesis_triggers=...` handoff rows, plus
+`pair_trigger_historical_aliases=...` when archived evidence includes legacy
+trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
+hypotheses have not yet propagated into trigger reasons, with the rejected-fixture
+coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
+and canonical trigger reason coverage plus row-match status.
+The compact evidence row count must match the frontier evidence count,
+`checks.frontier_stdout` records summary, aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts,
+`checks.headroom_rejections` records child verdict plus unrecorded/unsupported counts,
+`checks.pair_evidence_quality` records the same quality thresholds from the compact rows,
+`checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status,
+`checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts,
+and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details
+so incomplete or low-quality local score artifacts cannot inflate the claim.
+Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
+archived-evidence WARN rows into release-blocking FAIL rows for newly
+regenerated pair evidence.
-</details>
+```bash
+npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
+```
----
+Historical trigger aliases are only reported for archived artifact review; new
+current pair-evidence gates fail historical-only or unknown trigger reasons and
+require at least one canonical `pair_trigger.reasons` entry.
+`benchmark audit-headroom` fails if an active failed headroom fixture is missing
+from both rejected registry and passing pair evidence.
+Headroom runs use the current claim gate: `bare <= 60`, `solo_claude <= 80`,
+and the default 5-point `bare`/`solo_claude` headroom margins before spending a pair arm.
+Add `--dry-run` to either score runner to validate args, fixture ids, minimum
+fixture count, and the replay command without running arms or judges. Dry-runs
+and lint prove wiring only; real score claims must cite the run id and fixture
+ids.
 ## Optional Power-User Skills
-Two creative skills have moved to `optional-skills/` — install them via the interactive installer when you need them.
+Two creative companion skills live in `optional-skills/` — install them via the interactive installer when you need them.
 | Command | Use When |
 |---|---|
 | `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
 | `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
-> Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). These were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
+> Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). Most were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:design-ui` is now installed as a required creative UI surface. Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
 ---
@@ -194,7 +245,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 |---|---|
 | `playwright` | Playwright MCP — powers `/devlyn:resolve` BUILD_GATE browser tier (Chrome MCP → Playwright → curl fallback) |
-> `--engine auto/codex` uses the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex; the harness silently downgrades to `--engine claude` if the CLI is missing.
+> `--engine auto/codex` and conditional VERIFY pair mode use the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex, run the current Codex auth/login flow, verify `codex --version`, then rerun.
 </details>

package/benchmark/auto-resolve/BENCHMARK-DESIGN.md CHANGED Viewed

@@ -2,12 +2,18 @@
 **Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
-**Purpose.** Replace ad-hoc A/B benchmarking with a permanent, comprehensive,
+**Purpose.** Replace ad-hoc harness benchmarking with a permanent, comprehensive,
 one-command suite that gates every future harness change with a ship/rollback
 decision. Any prompt edit, phase reorder, new native skill, or model upgrade
 can be validated by running the suite and reading the numbers.
-**Arm structure (current vs planned).** Today the suite runs `variant` (L2: Claude + Codex pair) vs `bare` (L0). The L1 (solo harness on a single LLM) arm is queued for iter-0020 — until then the benchmark cannot directly verify the L1 contract, only the L0 ↔ L2 delta. Single-LLM users (Opus alone, GPT-5.5 alone) are first-class per the North Star, so this gap is a release-blocker for them, not a future enhancement.
+**Arm structure.** Current full-pipeline evidence uses three arms: `bare` (L0),
+`solo_claude` (L1 solo harness), and an L2 pair arm (`variant` in the smoke
+suite, or a focused pair arm such as `l2_risk_probes` in pair-candidate runs).
+Pair claims are headroom-gated: counted fixtures must leave room above solo
+(`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins),
+the pair arm must actually run, and blind judging must show pair above solo by
+the configured margin.
 **Non-goals.** Publishable-research statistical rigor. Not a regression test
 library for the product code — those live elsewhere. Not a substitute for
@@ -20,7 +26,7 @@ production telemetry — just enough signal for ship decisions.
 1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
    verdict. No manual fixture setup.
 2. **Novice-proof.** The suite exercises the same paths a first-time user
-   hits — including an end-to-end `ideate → auto-resolve → preflight` fixture.
+   hits — including an end-to-end `ideate → resolve` fixture.
 3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
    stable; scores and margins float up as models improve. Nothing is
    hardcoded to a specific model version.
@@ -56,10 +62,11 @@ benchmark/auto-resolve/
 │   ├── F6-dep-audit-native-module/
 │   ├── F7-out-of-scope-trap/
 │   ├── F8-known-limit-ambiguous/
-│   └── F9-e2e-ideate-to-resolve/
+│   ├── F9-e2e-ideate-to-resolve/
+│   └── F10+ extensions for headroom, full-pipeline pair, and frozen VERIFY
 │
 ├── scripts/
-│   ├── run-suite.sh          # single entry — runs all fixtures × 2 arms + judge + report
+│   ├── run-suite.sh          # smoke entry — runs fixture arms + judge + report
 │   ├── run-fixture.sh        # one fixture, one arm
 │   ├── judge.sh              # Codex blind judge (model-agnostic)
 │   ├── compile-report.py     # aggregate into report.md + summary.json
@@ -68,8 +75,9 @@ benchmark/auto-resolve/
 ├── results/                  # per-run artifacts (overwritten)
 │   └── <run-id>/
 │       ├── <fixture>/
-│       │   ├── variant/{input.md, transcript.txt, diff.patch, verify.json, timing.json}
-│       │   └── bare/{same}
+│       │   ├── bare/{input.md, transcript.txt, diff.patch, result.json}
+│       │   ├── solo_claude/{same}
+│       │   └── variant or l2_risk_probes/{same}
 │       ├── <fixture>/judge.json
 │       ├── report.md
 │       └── summary.json
@@ -91,7 +99,7 @@ Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
 | File | Purpose |
 |------|---------|
 | `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
-| `spec.md` | pipeline-arm input (auto-resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
+| `spec.md` | pipeline-arm input (resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
 | `task.txt` | bare-arm input (same intent, natural-language framing) |
 | `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
 | `NOTES.md` | why this fixture exists, the specific failure mode it tests |
@@ -103,9 +111,13 @@ consistent.
 ---
-## The 9 Fixtures
+## Core Fixtures And Extensions
-Category coverage matrix (rows = concerns, columns = fixtures):
+The original v3.6 matrix covered F1-F9. Later fixtures extend the same schema
+for headroom, full-pipeline pair, and frozen VERIFY evidence.
+Category coverage matrix for the original core set (rows = concerns, columns =
+fixtures):
 | Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
 |---------|---------|--------|-----------|--------|------|-----|
@@ -120,9 +132,9 @@ Category coverage matrix (rows = concerns, columns = fixtures):
 | F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
 **F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
-Input is a vague idea; pipeline arm runs ideate → auto-resolve on every
-generated spec → preflight; bare arm runs a direct prompt. Judge compares
-the final usable artifact set (code + docs + roadmap state).
+Input is a vague idea; the pipeline path turns it into a spec with ideate and
+then resolves that spec. Bare arm runs a direct prompt. Judge compares the final
+usable artifact set.
 ---
@@ -132,7 +144,6 @@ the final usable artifact set (code + docs + roadmap state).
 ```bash
 npx devlyn-cli benchmark            # n=1 smoke, all fixtures
-npx devlyn-cli benchmark --n 3      # higher confidence for ship decisions
 npx devlyn-cli benchmark F2 F5      # specific fixtures only
 npx devlyn-cli benchmark --judge-only --run-id <id>   # re-judge without re-running
 ```
@@ -143,20 +154,21 @@ Output on completion:
 Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
 Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
-Fixture                         Variant   Bare   Margin   Verdict
-F1-cli-trivial-flag                 95     88     +7      PASS
-F2-cli-medium-subcommand            92     81    +11      PASS
-F3-backend-contract-risk            89     72    +17      PASS
-F4-web-browser-design               87     79     +8      PASS
-F5-fix-loop-red-green               91     65    +26      PASS
-F6-dep-audit-native-module          88     70    +18      PASS
-F7-out-of-scope-trap                94     73    +21      PASS
-F8-known-limit-ambiguous            78     79     -1      EXPECTED (known-limit)
-F9-e2e-ideate-to-resolve          90     68    +22      PASS
+Fixture                         variant (L2)  solo_claude (L1)  bare (L0)  variant-solo_claude  Verdict
+F1-cli-trivial-flag             95            92                88         +3                   PASS
+F2-cli-medium-subcommand        92            86                81         +6                   PASS
+F3-backend-contract-risk        89            80                72         +9                   PASS
+F4-web-browser-design           87            83                79         +4                   PASS
+F5-fix-loop-red-green           91            78                65         +13                  PASS
+F6-dep-audit-native-module      88            82                70         +6                   PASS
+F7-out-of-scope-trap            94            85                73         +9                   PASS
+F8-known-limit-ambiguous        78            79                79         -1                   EXPECTED (known-limit)
+F9-e2e-ideate-to-resolve        90            84                68         +6                   PASS
 ---------------------------------------------------------
-Suite average variant score: 89.3
-Suite average bare score:    75.0
-Suite average margin:       +14.3  (ship floor: +5)
+Suite average variant (L2) score:       89.3
+Suite average solo_claude (L1) score:   83.2
+Suite average bare (L0) score:          75.0
+Suite average variant-solo_claude margin: +6.1  (pair-evidence floor: +5 on eligible fixtures)
 Hard-floor violations:        0
 Regression vs shipped:       n/a (first run of v3.6)
 SHIP-GATE VERDICT: ✅ PASS
@@ -167,7 +179,7 @@ SHIP-GATE VERDICT: ✅ PASS
 `run-suite.sh`:
 1. Generate run-id `<ISO>-<sha>-<branch>`
-2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
+2. For each fixture × each arm (`variant`/L2, `solo_claude`/L1, `bare`/L0): parallelizable via `xargs -P`
    - `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
 3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
 4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
@@ -179,17 +191,17 @@ SHIP-GATE VERDICT: ✅ PASS
 - Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
 - Applies `setup.sh` if present
-- Copies `spec.md` (variant) or `task.txt` (bare) as the prompt
-- Invokes Claude/auto-resolve (variant) or bare Claude (bare) via isolated Agent
+- Copies `spec.md` for `variant`/`solo_claude` or `task.txt` for `bare` as the prompt
+- Invokes `/devlyn:resolve --spec` for `variant`, `/devlyn:resolve --spec --engine claude --no-pair --no-risk-probes` for `solo_claude`, or bare Claude for `bare` via isolated Agent
 - Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
 - Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
 - Writes `result.json` with aggregate: exit code, duration, files changed, verification score
 ### `judge.sh` contract
-- Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
+- Reads `results/<run-id>/<fixture>/{variant,solo_claude,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
 - Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
-- Invokes `codex exec` (current flagship — no model hardcode) with RUBRIC.md
+- Invokes isolated Codex (current flagship — no model hardcode) with RUBRIC.md
 - Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
 - Idempotent: re-running overwrites the same `judge.json`
@@ -199,23 +211,27 @@ SHIP-GATE VERDICT: ✅ PASS
 Three mechanisms:
-1. **No hardcoded models.** Judge invocation is `codex exec` without `-m`; it
-   inherits whichever flagship the CLI currently ships. Same for agents —
-   they run against whatever Claude Code session-model the caller has.
-   Model provenance is captured in `result.json` per run.
+1. **No hardcoded models.** Judge invocation omits `-m`, so it inherits
+   whichever flagship the CLI currently ships. The blind judge is isolated from
+   user config/rules/hooks so local agent instructions cannot contaminate the
+   judgment. Same for agents — they run against whatever Claude Code
+   session-model the caller has. Model provenance is captured in `result.json`
+   per run.
 2. **Margin as primary signal, absolute score as secondary.** When models
-   improve, both arms get better. Margin (variant − bare) is model-invariant
-   — it measures **what the harness adds beyond bare**. Ship gates are
+   improve, all arms tend to get better. Pairwise margins remain the stable
+   signal: `solo_claude`-`bare` (L1-L0) measures solo harness value,
+   pair-`solo_claude` (L2-L1) measures pair value on eligible fixtures, and
+   `variant`-`bare` (L2-L0) remains the legacy suite signal. Ship gates are
    defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
    score.
 3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
    100 quickly as models improve — that's fine, it still catches catastrophic
    regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
-   model won't 100-zero bare. If any fixture saturates (both arms > 95 for
-   two consecutive versions), we replace it with a harder one and document
-   the swap in `history/runs/<ts>-fixture-rotation.json`.
+   model won't 100-zero bare. If any fixture saturates (all compared gated arms
+   > 95 for two consecutive versions), we replace it with a harder one and
+   document the swap in `history/runs/<ts>-fixture-rotation.json`.
 ---
@@ -225,14 +241,15 @@ Hard floors (any single failure blocks ship):
 - **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
 - **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
-- **At least 7 of 9 fixtures** must have margin ≥ +5 (suite coverage).
+- **At least 7 gated, headroom-available fixtures** must have margin ≥ +5
+  (suite coverage).
 - **F9 (E2E) must PASS** — novice-flow contract.
 Soft gates (trigger rollback discussion):
 - Suite average margin drop > 3 vs last shipped.
 - Any fixture with margin ≤ 0 that previously had margin > +5.
-- Critical-finding catch-rate decrease vs last shipped variant (not vs bare — bare is the opponent, not the regression baseline).
+- Critical-finding catch-rate decrease vs the last shipped comparable arm.
 Known-limit exception:
@@ -264,7 +281,7 @@ adding anything.
    standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
    run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
    entry, which shells out to the script.
-2. Parallel run safety — can we run 9 fixtures × 2 arms concurrently without
+2. Parallel run safety — can we run the selected fixture set × 3 arms concurrently without
    rate-limit / lockfile conflicts? **Proposal**: default sequential with
    `--parallel N` flag. Default `N=1` for safety; the user can opt in.
 3. Token accounting — Claude Code doesn't expose subagent totals reliably.