npm - devlyn-cli - Versions diffs - 1.15.0 → 2.1.0 - Mend

devlyn-cli 1.15.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

package/benchmark/auto-resolve/BENCHMARK-DESIGN.md ADDED Viewed

@@ -0,0 +1,272 @@
+# Benchmark Suite Design — v1
+**Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
+**Purpose.** Replace ad-hoc A/B benchmarking with a permanent, comprehensive,
+one-command suite that gates every future harness change with a ship/rollback
+decision. Any prompt edit, phase reorder, new native skill, or model upgrade
+can be validated by running the suite and reading the numbers.
+**Arm structure (current vs planned).** Today the suite runs `variant` (L2: Claude + Codex pair) vs `bare` (L0). The L1 (solo harness on a single LLM) arm is queued for iter-0020 — until then the benchmark cannot directly verify the L1 contract, only the L0 ↔ L2 delta. Single-LLM users (Opus alone, GPT-5.5 alone) are first-class per the North Star, so this gap is a release-blocker for them, not a future enhancement.
+**Non-goals.** Publishable-research statistical rigor. Not a regression test
+library for the product code — those live elsewhere. Not a substitute for
+production telemetry — just enough signal for ship decisions.
+---
+## Principles
+1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
+   verdict. No manual fixture setup.
+2. **Novice-proof.** The suite exercises the same paths a first-time user
+   hits — including an end-to-end `ideate → auto-resolve → preflight` fixture.
+3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
+   stable; scores and margins float up as models improve. Nothing is
+   hardcoded to a specific model version.
+4. **Karpathy.** No fixture earns its place unless it tests a distinct
+   failure mode. Tooling stays boring. History plumbing is simple.
+5. **Ship gate is numbers, not vibes.** Concrete thresholds in RUBRIC.md.
+---
+## Directory Layout
+```
+benchmark/auto-resolve/
+├── BENCHMARK-DESIGN.md       # this file
+├── README.md                 # how to run, interpret, extend
+├── RUBRIC.md                 # stable judge rubric + ship gates
+│
+├── fixtures/
+│   ├── SCHEMA.md             # fixture file format
+│   ├── test-repo/            # bootstrap Node project (shared base)
+│   │   ├── bin/cli.js
+│   │   ├── server/index.js
+│   │   ├── web/page.html
+│   │   ├── tests/
+│   │   ├── playwright.config.js
+│   │   └── package.json
+│   │
+│   ├── F1-cli-trivial-flag/
+│   ├── F2-cli-medium-subcommand/
+│   ├── F3-backend-contract-risk/
+│   ├── F4-web-browser-design/
+│   ├── F5-fix-loop-red-green/
+│   ├── F6-dep-audit-native-module/
+│   ├── F7-out-of-scope-trap/
+│   ├── F8-known-limit-ambiguous/
+│   └── F9-e2e-ideate-to-resolve/
+│
+├── scripts/
+│   ├── run-suite.sh          # single entry — runs all fixtures × 2 arms + judge + report
+│   ├── run-fixture.sh        # one fixture, one arm
+│   ├── judge.sh              # Codex blind judge (model-agnostic)
+│   ├── compile-report.py     # aggregate into report.md + summary.json
+│   └── ship-gate.py          # apply thresholds, return ship/rollback verdict
+│
+├── results/                  # per-run artifacts (overwritten)
+│   └── <run-id>/
+│       ├── <fixture>/
+│       │   ├── variant/{input.md, transcript.txt, diff.patch, verify.json, timing.json}
+│       │   └── bare/{same}
+│       ├── <fixture>/judge.json
+│       ├── report.md
+│       └── summary.json
+│
+└── history/
+    ├── runs/                 # append-only immutable records
+    │   └── 2026-04-23T120000Z-v3.6.json
+    ├── latest.json           # pointer to most recent run
+    └── baselines/
+        └── shipped.json      # last blessed version, used for regression check
+```
+---
+## Fixture Schema
+Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
+| File | Purpose |
+|------|---------|
+| `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
+| `spec.md` | pipeline-arm input (auto-resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
+| `task.txt` | bare-arm input (same intent, natural-language framing) |
+| `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
+| `NOTES.md` | why this fixture exists, the specific failure mode it tests |
+| `setup.sh` | deterministic starting state — applies to a fresh copy of `test-repo/` |
+**Drift prevention**: `spec.md` and `task.txt` both derive from the same
+`intent` block in `metadata.json`. A lint step in CI verifies they stay
+consistent.
+---
+## The 9 Fixtures
+Category coverage matrix (rows = concerns, columns = fixtures):
+| Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
+|---------|---------|--------|-----------|--------|------|-----|
+| F1-cli-trivial-flag | ✓ | | | | | |
+| F2-cli-medium-subcommand | | ✓ | | | | |
+| F3-backend-contract-risk | | | ✓ | | | |
+| F4-web-browser-design | | | | ✓ (browser-validate) | | |
+| F5-fix-loop-red-green | | | | ✓ (FIX LOOP) | | |
+| F6-dep-audit-native-module | | | | ✓ (CRITIC security dep audit) | | |
+| F7-out-of-scope-trap | | | | ✓ (scope discipline) | | |
+| F8-known-limit-ambiguous | | | | | ✓ (documents where pipeline may lose) | |
+| F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
+**F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
+Input is a vague idea; pipeline arm runs ideate → auto-resolve on every
+generated spec → preflight; bare arm runs a direct prompt. Judge compares
+the final usable artifact set (code + docs + roadmap state).
+---
+## Single-Command Invocation
+### User experience
+```bash
+npx devlyn-cli benchmark            # n=1 smoke, all fixtures
+npx devlyn-cli benchmark --n 3      # higher confidence for ship decisions
+npx devlyn-cli benchmark F2 F5      # specific fixtures only
+npx devlyn-cli benchmark --judge-only --run-id <id>   # re-judge without re-running
+```
+Output on completion:
+```
+Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
+Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
+Fixture                         Variant   Bare   Margin   Verdict
+F1-cli-trivial-flag                 95     88     +7      PASS
+F2-cli-medium-subcommand            92     81    +11      PASS
+F3-backend-contract-risk            89     72    +17      PASS
+F4-web-browser-design               87     79     +8      PASS
+F5-fix-loop-red-green               91     65    +26      PASS
+F6-dep-audit-native-module          88     70    +18      PASS
+F7-out-of-scope-trap                94     73    +21      PASS
+F8-known-limit-ambiguous            78     79     -1      EXPECTED (known-limit)
+F9-e2e-ideate-to-resolve          90     68    +22      PASS
+---------------------------------------------------------
+Suite average variant score: 89.3
+Suite average bare score:    75.0
+Suite average margin:       +14.3  (ship floor: +5)
+Hard-floor violations:        0
+Regression vs shipped:       n/a (first run of v3.6)
+SHIP-GATE VERDICT: ✅ PASS
+```
+### Runner orchestration
+`run-suite.sh`:
+1. Generate run-id `<ISO>-<sha>-<branch>`
+2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
+   - `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
+3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
+4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
+5. `ship-gate.py <run-id>` → exit 0 (PASS) / 1 (FAIL). Prints verdict to stdout.
+6. If PASS and `--bless` flag: copy `summary.json` → `history/baselines/shipped.json`
+7. Always: append `history/runs/<run-id>.json` + update `latest.json`
+### `run-fixture.sh` contract
+- Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
+- Applies `setup.sh` if present
+- Copies `spec.md` (variant) or `task.txt` (bare) as the prompt
+- Invokes Claude/auto-resolve (variant) or bare Claude (bare) via isolated Agent
+- Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
+- Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
+- Writes `result.json` with aggregate: exit code, duration, files changed, verification score
+### `judge.sh` contract
+- Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
+- Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
+- Invokes `codex exec` (current flagship — no model hardcode) with RUBRIC.md
+- Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
+- Idempotent: re-running overwrites the same `judge.json`
+---
+## LLM-Upgrade Resilience
+Three mechanisms:
+1. **No hardcoded models.** Judge invocation is `codex exec` without `-m`; it
+   inherits whichever flagship the CLI currently ships. Same for agents —
+   they run against whatever Claude Code session-model the caller has.
+   Model provenance is captured in `result.json` per run.
+2. **Margin as primary signal, absolute score as secondary.** When models
+   improve, both arms get better. Margin (variant − bare) is model-invariant
+   — it measures **what the harness adds beyond bare**. Ship gates are
+   defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
+   score.
+3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
+   100 quickly as models improve — that's fine, it still catches catastrophic
+   regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
+   model won't 100-zero bare. If any fixture saturates (both arms > 95 for
+   two consecutive versions), we replace it with a harder one and document
+   the swap in `history/runs/<ts>-fixture-rotation.json`.
+---
+## Ship Gates (from RUBRIC.md)
+Hard floors (any single failure blocks ship):
+- **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
+- **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
+- **At least 7 of 9 fixtures** must have margin ≥ +5 (suite coverage).
+- **F9 (E2E) must PASS** — novice-flow contract.
+Soft gates (trigger rollback discussion):
+- Suite average margin drop > 3 vs last shipped.
+- Any fixture with margin ≤ 0 that previously had margin > +5.
+- Critical-finding catch-rate decrease vs last shipped variant (not vs bare — bare is the opponent, not the regression baseline).
+Known-limit exception:
+- F8 is explicitly allowed to tie or lose (margin in [-3, +3]). Its job is to
+  document honesty, not to beat bare.
+---
+## Karpathy Check
+Where over-engineering lurks:
+- ❌ **Automatic history mutation during development.** Add append-only
+  history AFTER the suite format stabilizes (one version after initial ship).
+- ❌ **Statistical tooling beyond mean/median/margin.** n=1-3 doesn't need
+  t-tests.
+- ❌ **Auto-generated fixture cards / dashboards.** Plain `report.md` is enough.
+- ✅ **Keep scripts under 100 lines each** unless they're doing concrete,
+  repeated work the user would do by hand.
+If the suite tooling grows past ~800 total lines, prune aggressively before
+adding anything.
+---
+## Open Questions (to be answered before first full ship-gate run)
+1. Where does `benchmark` subcommand live? Inside `bin/devlyn.js` or as
+   standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
+   run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
+   entry, which shells out to the script.
+2. Parallel run safety — can we run 9 fixtures × 2 arms concurrently without
+   rate-limit / lockfile conflicts? **Proposal**: default sequential with
+   `--parallel N` flag. Default `N=1` for safety; the user can opt in.
+3. Token accounting — Claude Code doesn't expose subagent totals reliably.
+   **Proposal**: capture wall time as primary efficiency metric; token
+   estimate as best-effort secondary. Do not gate ship on token math alone.

package/benchmark/auto-resolve/README.md ADDED Viewed

@@ -0,0 +1,114 @@
+# devlyn-cli auto-resolve Benchmark Suite
+One-command A/B benchmark that gates every harness change with a ship/rollback decision.
+## Quick start
+```bash
+npx devlyn-cli benchmark                 # n=1 smoke, all fixtures × 2 arms, judge, report, ship-gate
+npx devlyn-cli benchmark --n 3           # higher confidence for ship decisions
+npx devlyn-cli benchmark F2              # specific fixture only
+npx devlyn-cli benchmark --dry-run       # validate suite wiring without model invocation
+npx devlyn-cli benchmark --bless         # if ship-gate PASSes, promote this run as the shipped baseline
+npx devlyn-cli benchmark --judge-only --run-id <ID>   # re-judge an existing run's artifacts
+```
+Exit code 0 = PASS, 1 = FAIL.
+## What it does
+1. For every fixture × arm (`variant` / `bare`):
+   - Prepare a fresh temp copy of `fixtures/test-repo/`.
+   - Commit baseline + apply `setup.sh` + commit bench scaffolding.
+   - Invoke the arm via an isolated `claude -p` subprocess.
+   - Capture `diff.patch`, `transcript.txt`, `timing.json`, run `expected.json::verification_commands`.
+2. For every fixture, invoke `codex exec` as a blind judge (`A`/`B` randomized per fixture) using the 4-axis rubric in `RUBRIC.md`.
+3. Aggregate into `results/<run-id>/report.md` + `summary.json`.
+4. Apply ship-gate thresholds (`scripts/ship-gate.py`). Print verdict.
+5. Append immutable record to `history/runs/<run-id>.json`.
+## Directory layout
+```
+benchmark/auto-resolve/
+├── BENCHMARK-DESIGN.md       # full design rationale
+├── README.md                 # this file
+├── RUBRIC.md                 # 4-axis scoring + ship gates
+│
+├── fixtures/
+│   ├── SCHEMA.md             # fixture file format
+│   ├── test-repo/            # bootstrap Node project — base for all arms
+│   ├── F2-cli-medium-subcommand/
+│   └── F1,F3-F9/             # add per Stage 2-3
+│
+├── scripts/
+│   ├── run-suite.sh          # single entry — called by `npx devlyn-cli benchmark`
+│   ├── run-fixture.sh        # one fixture × one arm, self-contained
+│   ├── judge.sh              # Codex blind judge for one fixture
+│   ├── compile-report.py     # aggregates into report.md + summary.json
+│   └── ship-gate.py          # applies thresholds + writes history record
+│
+├── results/<run-id>/         # per-run artifacts (overwritten)
+└── history/
+    ├── runs/                 # append-only, one JSON per run
+    ├── latest.json           # pointer to most recent run
+    └── baselines/shipped.json   # last blessed version, used for regression floor
+```
+## Prerequisites
+- `claude` CLI on PATH (Claude Code, used to invoke each arm).
+- `codex` CLI on PATH (used by the blind judge). Install from https://platform.openai.com/docs/codex.
+- `python3`, `node`, `git`, `timeout`.
+## Adding a fixture
+Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`, `task.txt`, `expected.json`, `NOTES.md`, `setup.sh`. Common workflow:
+1. Copy an existing fixture directory as a template.
+2. Rewrite `metadata.json::intent` with the new task's plain-language intent.
+3. Write `spec.md` (auto-resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
+4. Fill `expected.json` with concrete verification commands and forbidden patterns.
+5. Document purpose + failure mode in `NOTES.md`.
+6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
+## LLM-upgrade resilience
+- **No model hardcoding.** Judge runs `codex exec` without `-m`, inheriting whichever flagship the CLI currently ships. Each run captures `_judge_model` for historical provenance.
+- **Margin-based gates.** Ship thresholds use margin (variant − bare), not absolute score. Both arms improve together as models improve; the harness-added value measured by margin stays meaningful.
+- **Saturation rotation.** When both arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
+## Ship gates (summary — see `RUBRIC.md` for full spec)
+Hard floors (any one fails → block):
+- Zero variant disqualifier (silent catch, fabricated verification, extra deps beyond `max_deps_added`, etc.).
+- `F9-e2e-ideate-to-resolve` must PASS (novice-flow contract).
+- ≥ 7 of 9 gated fixtures have margin ≥ +5.
+- No per-fixture regression worse than −5 vs last shipped baseline.
+Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margin, critical-finding catch-rate regression vs last shipped variant.
+## Running the full suite (real)
+Full real benchmark costs roughly 2-3 minutes per arm for simple fixtures and up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures × 2 arms can take 30 min – 2 hrs depending on routes taken.
+```bash
+# Smoke run before ship decisions
+npx devlyn-cli benchmark
+# Ship-decision run
+npx devlyn-cli benchmark --n 3 --label v3.7 --bless
+```
+## Dry-run
+`--dry-run` skips model invocation. It still:
+- Prepares each fresh work dir.
+- Writes arm-specific prompts.
+- Commits the baseline.
+- Applies `setup.sh`.
+- Runs verification commands (which will mostly fail since no implementation was added).
+Use it to sanity-check new fixtures or runner changes before burning model tokens.

package/benchmark/auto-resolve/RUBRIC.md ADDED Viewed

@@ -0,0 +1,162 @@
+# Benchmark Judge Rubric
+Stable across model upgrades. This file is the single source of truth for how
+arms are scored and how ship gates evaluate a run. Do not change the rubric
+during a benchmarking window — changing it invalidates comparability with
+prior `history/runs/`.
+**Outer goal lives in [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md).** The release-decision layer (L0 / L1 / L2 contracts, wall-time efficiency, pair-cost justification) sits on top of the per-arm scoring rules below. When NORTH-STAR.md adds a release-gate number that this file did not have, the new number applies — open a doc-fix iter to mirror it here.
+## Scoring — 4 axes, 25 points each, 100 total
+The blind judge scores both arms on identical axes without knowing which is
+variant vs. bare.
+### Axis 1 — Spec Compliance (0-25)
+Does this implementation satisfy every Requirements bullet in `spec.md`?
+Does every Verification command behave as the spec states?
+- **25** — All Requirements satisfied. All Verification commands would pass.
+- **19-24** — 90%+ coverage, minor omissions.
+- **13-18** — Partial implementation or verification gaps.
+- **7-12** — Major requirements missed.
+- **0-6** — Does not address the core task.
+### Axis 2 — Constraint Respect (0-25)
+Zero new npm deps (unless spec allows), no silent catches (`try { } catch { return fallback }`), no `any`/`@ts-ignore`, explicit HOME/env guards where required, EACCES-specific handling, no hardcoded values that should be configurable.
+- Each **disqualifier-severity forbidden_pattern match** or explicit constraint violation = −4 points, minimum 0.
+### Axis 3 — Scope Discipline (0-25)
+Out of Scope respected. No gratuitous refactors of unrelated code. No "while I'm here" additions. No opportunistic upgrades.
+- Each out-of-scope change = −5 points, minimum 0.
+### Axis 4 — Code Quality (0-25)
+Readable, idiomatic for the language/framework, helpful error messages, appropriate abstraction level (not under- or over-engineered), uses standard library primitives where available (e.g., `fs.accessSync` over mode-bit checks per CLAUDE.md `phase-3-critic.md:32` calibration).
+- This axis is judge-calibrated; no deterministic grading. The judge looks at
+  naming, function/file decomposition, error handling, and comparability to
+  idiomatic peer code.
+---
+## Judge Disqualifiers (hard floor)
+Any of these produces `disqualifier: true` on the arm, overriding score:
+- Silent-catch pattern in diff.
+- Fabricated verification (code that claims to pass without actually running).
+- Skipped a required test file that the spec names.
+- Created a file listed in `expected.forbidden_files`.
+- Exceeded `expected.max_deps_added`.
+- `@ts-ignore` / `eslint-disable` without scoped justification comment.
+- Hardcoded paths or values where spec required configurability.
+Disqualifier arms automatically lose the fixture regardless of score.
+---
+## Ship Gates
+After the judge finishes every fixture, `scripts/ship-gate.py` applies these
+rules to the run's `summary.json`.
+### Hard floors (any one failure blocks ship)
+1. **No disqualifier-level violation** in variant on any fixture.
+2. **F9 (E2E) must PASS** — novice-flow contract.
+3. **≥ 7 of 9 fixtures** must have margin ≥ +5 — **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from this count when `100 - L0_score < 5` AND `L1_score >= 95` AND the L1 arm has no disqualifier / CRITICAL-HIGH finding / watchdog timeout / regression worse than gate #4. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
+4. **No fixture regression worse than −5** vs. last `baselines/shipped.json` on the same fixture.
+### Soft gates (produce WARNING but do not block)
+5. Suite average margin drop > 3 vs. last shipped.
+6. A fixture that previously had margin > +5 now has margin ≤ 0.
+7. Critical-finding catch-rate decrease vs. last shipped variant (not vs. bare).
+### Known-limit exception
+- **F8-known-limit-ambiguous** is excluded from gates 3 and 4. It exists to
+  document where the harness may not beat bare. Its allowed margin range is
+  [-3, +3]. Margins outside this range trigger a WARNING regardless of sign
+  (too-good means the fixture is no longer a known limit; too-bad means we
+  shipped a regression somewhere else that this fixture caught).
+---
+## Run Record
+Every suite run appends an immutable record to `history/runs/<ts>-<label>.json`:
+```json
+{
+  "run_id": "2026-04-23T12:00:00Z-v3.6",
+  "version_label": "v3.6",
+  "git_sha": "fdb7428...",
+  "branch": "benchmark/v3.6-ab-...",
+  "n_per_fixture": 1,
+  "judge_model": "<recorded from ~/.codex/config.toml at run time; do not hardcode>",
+  "judge_effort": "xhigh",
+  "fixtures": [
+    {
+      "id": "F2-cli-medium-subcommand",
+      "variant": { "score": 92, "wall_s": 707, "tokens_agg": 108852, "disqualifier": false,
+                   "axes": {"spec": 23, "constraint": 23, "scope": 24, "quality": 22} },
+      "bare":    { "score": 81, "wall_s": 101, "tokens_agg": 55588,  "disqualifier": false,
+                   "axes": {"spec": 19, "constraint": 19, "scope": 20, "quality": 23} },
+      "winner": "variant",
+      "margin": 11,
+      "critical_findings": {
+        "variant": [],
+        "bare": ["silent catch in findSkillMdFiles (no-silent-catches violation)"]
+      }
+    }
+  ],
+  "suite": {
+    "fixtures_run": 9,
+    "variant_avg": 89.3,
+    "bare_avg": 75.0,
+    "margin_avg": 14.3,
+    "hard_floor_violations": 0,
+    "ship_gate": "PASS"
+  }
+}
+```
+---
+## Fixture Rotation Policy
+If any fixture has both arms scoring > 95 for two consecutive shipped
+versions, it's saturated and no longer differentiates. Replace with a harder
+equivalent and record the swap in
+`history/runs/<ts>-fixture-rotation.json`:
+```json
+{
+  "retired": "F1-cli-trivial-flag",
+  "retired_reason": "both arms > 95 on v3.7 and v3.8 (saturation)",
+  "replacement": "F1b-cli-trivial-flag-v2",
+  "replacement_rationale": "adds exit-code precedence requirement that current leaders didn't handle on first try"
+}
+```
+Retired fixtures stay in `fixtures/retired/` for replay if a regression is
+suspected in their area.
+---
+## Why These Thresholds
+- **+5 margin floor** — below this, variant isn't reliably beating bare given
+  judge variance (empirically ~±3 per axis). Worth paying pipeline cost
+  requires margin clearly above noise.
+- **−5 regression floor** — one-axis regression can look like −5; allowing
+  less would let real regressions slip through.
+- **7/9 fixtures rule** — tolerates one close-call + F8 known-limit; anything
+  worse means the suite is surfacing a broad harness problem.

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md ADDED Viewed

@@ -0,0 +1,30 @@
+# F1 — Notes
+## Purpose
+Trivial-tier calibration. Every arm should one-shot this; it's here to catch
+catastrophic regressions and to anchor the "saturation" end of the scoring
+scale.
+## Failure mode
+- **Default-behavior regression.** Careless implementations add `--loud`
+  handling but accidentally alter the default case (e.g., always uppercasing
+  because the flag-check is misplaced). Verification commands 1 and 4 guard
+  against that.
+- **Scope creep.** Modifying unrelated code while "here" would be caught by
+  both CRITIC design sub-pass and the `git diff --stat` spec requirement.
+## Pipeline exercise
+- Phase 0 routing: expected `standard` route (no risk keywords).
+- Phase 1 BUILD: single-file edit.
+- Phase 1.4 BUILD GATE: `node --check` + `node --test` both must pass.
+- Phase 2 EVAL: minimal findings expected.
+- Phase 3 CRITIC design: verifies diff surgical-ness.
+## Rotation trigger
+When both arms score > 95 for two consecutive shipped versions, replace with
+a harder trivial fixture (e.g., one that requires handling a new flag
+interacting with existing flag precedence).

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json ADDED Viewed

@@ -0,0 +1,68 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node bin/cli.js hello",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, world!"
+      ],
+      "stdout_not_contains": [
+        "HELLO"
+      ]
+    },
+    {
+      "cmd": "node bin/cli.js hello --loud",
+      "exit_code": 0,
+      "stdout_contains": [
+        "HELLO, WORLD!!"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js hello --loud --name alice",
+      "exit_code": 0,
+      "stdout_contains": [
+        "HELLO, ALICE!!"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js hello --name bob",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, bob!"
+      ],
+      "stdout_not_contains": [
+        "HELLO"
+      ]
+    },
+    {
+      "cmd": "node --test tests/cli.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": [
+        "not ok "
+      ]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|'')",
+      "description": "silent catch returning fallback",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": [
+    "bin/cli.js",
+    "tests/cli.test.js"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": [
+    "bin/cli.js",
+    "tests/cli.test.js"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F1-cli-trivial-flag",
+  "category": "trivial",
+  "difficulty": "trivial",
+  "timeout_seconds": 900,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Add a boolean --loud flag to bench-test-repo's hello subcommand. When passed, the greeting is uppercased and ends with '!!'. Default behavior unchanged. Update tests."
+}

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh ADDED Viewed

@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+# F1 setup — no changes to base test-repo needed.
+set -e
+exit 0