npm - devlyn-cli - Versions diffs - 2.3.0 → 2.3.1 - Mend

devlyn-cli 2.3.0 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (219) hide show

package/benchmark/auto-resolve/README.md CHANGED Viewed

@@ -1,13 +1,14 @@
-# devlyn-cli auto-resolve Benchmark Suite
+# devlyn-cli resolve Benchmark Suite
-One-command A/B benchmark that gates every harness change with a ship/rollback decision.
+One-command resolve benchmark that gates every harness change with a ship/rollback decision.
 ## Quick start
 ```bash
-npx devlyn-cli benchmark                 # n=1 smoke, all fixtures × 2 arms, judge, report, ship-gate
-npx devlyn-cli benchmark --n 3           # higher confidence for ship decisions
+npx devlyn-cli benchmark                 # n=1 smoke, all fixtures × 3 arms, judge, report, ship-gate
 npx devlyn-cli benchmark F2              # specific fixture only
+npx devlyn-cli benchmark headroom F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
+npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
 npx devlyn-cli benchmark --dry-run       # validate suite wiring without model invocation
 npx devlyn-cli benchmark --bless         # if ship-gate PASSes, promote this run as the shipped baseline
 npx devlyn-cli benchmark --judge-only --run-id <ID>   # re-judge an existing run's artifacts
@@ -17,12 +18,12 @@ Exit code 0 = PASS, 1 = FAIL.
 ## What it does
-1. For every fixture × arm (`variant` / `bare`):
+1. For every fixture × arm (`variant` / `solo_claude` / `bare`):
    - Prepare a fresh temp copy of `fixtures/test-repo/`.
    - Commit baseline + apply `setup.sh` + commit bench scaffolding.
    - Invoke the arm via an isolated `claude -p` subprocess.
    - Capture `diff.patch`, `transcript.txt`, `timing.json`, run `expected.json::verification_commands`.
-2. For every fixture, invoke `codex exec` as a blind judge (`A`/`B` randomized per fixture) using the 4-axis rubric in `RUBRIC.md`.
+2. For every fixture, invoke isolated Codex as a blind judge with randomized slots using the 4-axis rubric in `RUBRIC.md`.
 3. Aggregate into `results/<run-id>/report.md` + `summary.json`.
 4. Apply ship-gate thresholds (`scripts/ship-gate.py`). Print verdict.
 5. Append immutable record to `history/runs/<run-id>.json`.
@@ -47,16 +48,30 @@ benchmark/auto-resolve/
 │   ├── judge.sh              # Codex blind judge for one fixture
 │   ├── compile-report.py     # aggregates into report.md + summary.json
 │   ├── ship-gate.py          # applies thresholds + writes history record
+│   ├── test-benchmark-arg-parsing.sh
+│   ├── test-ship-gate.sh
 │   ├── run-headroom-candidate.sh
 │   ├── headroom-gate.py      # blocks pair measurement without headroom set
 │   ├── test-headroom-gate.sh
+│   ├── test-run-headroom-candidate.sh
 │   ├── run-full-pipeline-pair-candidate.sh
+│   ├── test-run-full-pipeline-pair-candidate.sh
 │   ├── full-pipeline-pair-gate.py
 │   ├── test-full-pipeline-pair-gate.sh
+│   ├── pair-candidate-frontier.py
+│   ├── test-pair-candidate-frontier.sh
+│   ├── audit-pair-evidence.py
+│   ├── test-audit-pair-evidence.sh
+│   ├── audit-headroom-rejections.py
+│   ├── test-audit-headroom-rejections.sh
+│   ├── test-check-f9-artifacts.sh
+│   ├── iter-0033c-l1-summary.py
+│   ├── test-iter-0033c-l1-summary.sh
 │   ├── run-frozen-verify-pair.sh
 │   ├── fetch-swebench-instances.py
 │   ├── collect-swebench-predictions.py
 │   ├── run-swebench-solver-batch.sh
+│   ├── test-run-swebench-solver-batch.sh
 │   ├── prepare-swebench-frozen-case.py
 │   ├── prepare-swebench-frozen-corpus.py
 │   ├── run-swebench-frozen-corpus.sh
@@ -85,58 +100,231 @@ Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`,
 1. Copy an existing fixture directory as a template.
 2. Rewrite `metadata.json::intent` with the new task's plain-language intent.
-3. Write `spec.md` (auto-resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
+3. Write `spec.md` (resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
 4. Fill `expected.json` with concrete verification commands and forbidden patterns.
 5. Document purpose + failure mode in `NOTES.md`.
 6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
 7. Run `bash scripts/lint-fixtures.sh`.
+For draft pair candidates, start in `shadow-fixtures/S*` and run
+`bash scripts/lint-shadow-fixtures.sh`. The headroom and pair candidate runners
+accept explicitly named `S*` ids for dry-run checks and candidate measurement,
+but shadow results are read-only signals. Promote a validated task to an active
+`F*` fixture before counting it as golden pair evidence.
+Use `run-suite.sh --suite shadow` only with `--dry-run`; the suite path refuses
+provider and judge runs for shadow fixtures so rejected/smoke controls do not
+spend benchmark budget accidentally.
+Before spending provider calls, write a solo-headroom hypothesis into the
+candidate's `spec.md`: name the visible behavior a capable `solo_claude`
+baseline is expected to miss, and the observable command from `expected.json`
+that would expose that miss. A hypothesis of only "the task is hard" is not
+enough; rework the candidate before measurement. `lint-shadow-fixtures.sh` and
+the candidate runners enforce this as an actionable hypothesis: the fixture
+`spec.md` must contain `solo-headroom hypothesis`, `solo_claude`, `miss`, and a
+backticked observable command matching `expected.json`, with the backticked line
+itself containing `miss` and framed as the command/observable that exposes it.
+For unmeasured high-risk shadow candidates, `NOTES.md` must also include
+`## Solo ceiling avoidance` naming how the candidate differs from the
+solo-saturated `S2`-`S6` controls and why that difference should preserve
+`solo_claude` headroom. If that distinction is not concrete, rework the
+candidate before measurement.
+If a real shadow headroom run fails because the fixture is solo-saturated, record
+the run and score in the fixture's `NOTES.md` and add the fixture to
+`scripts/pair-rejected-fixtures.sh`; `lint-shadow-fixtures.sh` enforces that
+calibrated shadow `FAIL` entries are registered before future provider spend.
 For L2/pair candidate fixtures, also run:
 ```bash
-bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh F16-cli-quote-tax-rules
+bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh \
+  --bare-max 60 \
+  --solo-max 80 \
+  --min-fixtures 3 \
+  F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
 ```
+The same runner is available through `npx devlyn-cli benchmark headroom ...`.
 This runs only the arms needed for calibration (`bare` and `solo_claude`),
 blind-judges them, and applies `headroom-gate.py`. A candidate set is not
 usable for pair measurement unless at least two fixtures pass and each fixture
-has clean `bare <= 60` and `solo_claude <= 80` scores. A one-fixture calibration
-run can show useful scores but does not satisfy the set gate.
-When changing the gate itself, run:
+has evidence-complete `bare <= 60` and `solo_claude <= 80` scores with the
+default minimum 5-point `bare`/`solo_claude` headroom margin.
+The runner prints the headroom gate markdown report to stdout, including the
+startup `Gate:` line and the fixture score table with bare score, bare
+headroom, solo_claude score, solo_claude headroom, status, and reason columns. When launched
+through `npx devlyn-cli benchmark headroom`, the replay `Command:` uses the
+same package CLI path.
+For passing sets, the report also prints average and minimum `bare`/`solo_claude`
+headroom plus the fixture pass count, so ceiling-near, threshold-fragile, or
+under-count candidate sets are visible before spending pair arms.
+It explicitly reports whether the candidate set was accepted or rejected.
+Evidence-clean means the measured arm has complete artifacts, no deterministic
+or judge disqualifier, all expected verification commands pass, and any
+skill-pipeline verdict is non-blocking (`PASS` or `PASS_WITH_ISSUES`). A
+one-fixture calibration run can show useful scores but does not satisfy the set
+gate. Add `--dry-run` to validate args, fixture ids, minimum fixture count, and
+the replay command without running arms or judges.
+Known rejected or ceiling-saturated fixtures are refused by default in the
+headroom runner; use `--allow-rejected-fixtures` only for diagnostics of
+rejected fixtures or calibrated shadow controls, not for new pair-evidence
+candidate selection. Retired fixtures are preserved for historical artifact replay
+and are not rerun by the pair-candidate runners.
+Before spending new provider calls, inspect the active candidate frontier:
+```bash
+python3 benchmark/auto-resolve/scripts/pair-candidate-frontier.py \
+  --out-md /tmp/devlyn-pair-frontier.md
+npx devlyn-cli benchmark recent
+npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
+npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
+```
+`benchmark recent` is the reader-facing version of the current evidence set: it
+prints a compact, wrap-safe status block, pair-lift aggregates, and one card per
+passing pair-evidence fixture. Use it for PR comments and release notes when a
+wide frontier table would wrap poorly.
+The frontier report lists active fixtures as `rejected`,
+`pair_evidence_passed`, or `candidate_unmeasured`, using the same rejected
+fixture registry and local full-pipeline gate artifacts. It also prints stdout
+summary rows with `bare`, `solo_claude`, `pair`, pair arm, margin, wall ratio, run id, verdict, and trigger reasons for
+fixtures that already have complete pair evidence rows, plus average/minimum pair margin and wall ratio,
+even when writing `--out-md` or `--out-json`. The markdown artifact also carries
+the overall verdict plus row-level verdict, pair-arm, and trigger-reason columns.
+Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
+and the report includes a Markdown `Hypothesis trigger` column, so strict regenerated
+evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
+After a headroom run fails, audit that any active failed fixture without passing
+pair evidence is either rejected or reworked before more provider spend. The
+same audit also rejects active registry entries whose reason cites a run id or
+score that is not backed by a matching local headroom artifact:
+```bash
+python3 benchmark/auto-resolve/scripts/audit-headroom-rejections.py
+npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
+```
+For release or handoff checks where open candidates are not acceptable, add
+`--fail-on-unmeasured` to the frontier command so any active
+`candidate_unmeasured` fixture becomes a nonzero exit.
+The package CLI exposes that release/handoff guard as one command:
+```bash
+npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
+npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
+```
+It writes `audit.json` with the frontier summary and an artifact map (`artifacts`), plus
+`frontier.json`, `frontier.stdout`, `frontier.stderr`, `headroom-audit.json`, and child stdout/stderr logs, prints the same frontier score rows for existing complete pair
+evidence rows, and embeds those compact trigger-backed verdict-bearing score rows in
+`audit.json` as `pair_evidence_rows` (each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`). It fails if either active unmeasured pair candidates or unrecorded
+headroom failures remain. By default it also revalidates frontier `verdict: PASS`
+and zero unmeasured candidates, requires at least four active fixtures with passing pair evidence,
+and requires each counted evidence row to satisfy `pair_mode: true`, the default 5-point pair margin, and 3x pair/solo wall ratio.
+The audit stdout also prints `headroom_rejections=...`,
+`pair_evidence_quality=...`, `pair_trigger_reasons=...`, and
+`pair_evidence_hypotheses=...` and
+`pair_evidence_hypothesis_triggers=...` handoff rows, plus
+`pair_trigger_historical_aliases=...` when archived evidence includes legacy
+trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
+hypotheses have not yet propagated into trigger reasons, with rejected-fixture
+coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
+and canonical trigger reason coverage plus row-match status. The compact evidence row count must match the frontier evidence count, so incomplete local score
+artifacts cannot inflate the claim. `checks.frontier_stdout` records summary,
+aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts, `checks.pair_evidence_quality`
+records the same quality thresholds from the compact rows,
+`checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status
+for handoff review, `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts, and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details.
+Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
+archived-evidence WARN rows into release-blocking FAIL rows for newly
+regenerated pair evidence.
+Historical trigger aliases are only reported for archived artifact review; new
+current pair-evidence gates fail historical-only or unknown trigger reasons and
+require at least one canonical `pair_trigger.reasons` entry.
+`checks.headroom_rejections` records the child verdict plus unrecorded and
+unsupported registry-rejection counts, so handoff review can see rejected-fixture
+coverage without opening the child artifact first.
+Override `--min-pair-evidence`, `--min-pair-margin`, or
+`--max-pair-solo-wall-ratio` only for narrower diagnostics.
+When changing the calibration/pair evidence gates, run:
 ```bash
+bash scripts/lint-fixtures.sh
+bash benchmark/auto-resolve/scripts/test-ship-gate.sh
+bash benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh
+bash benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh
+bash benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh
+bash benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh
+bash benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh
 bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
+bash benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh
+bash benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh
+bash benchmark/auto-resolve/scripts/test-lint-fixtures.sh
+bash benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh
+bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
+bash benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh
+bash benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh
+bash benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh
+bash benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh
+bash benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh
 ```
-After a full-pipeline pair run has the calibrated arms (`bare`,
-`solo_claude`, `l2_gated` or `l2_risk_probes`) plus a blind `judge.json`, gate
-it separately:
+`build-pair-eligible-manifest.py` writes `selection_rule.rejected_excluded`
+with the rejected fixture ids removed from Gate 3, and
+`selection_rule.rejected_excluded_reasons` with the exact registry reason for
+each removed id. This keeps the manifest self-explaining when F31/F32-style
+solo-ceiling controls are excluded from pair-lift evidence.
+After a full-pipeline pair run has the calibrated arms (`bare`, `solo_claude`,
+and the selected pair arm, default `l2_risk_probes`) plus a blind `judge.json`,
+gate it separately:
 ```bash
 bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
+  --min-fixtures 3 \
   --max-pair-solo-wall-ratio 3 \
-  F21-cli-scheduler-priority F23-cli-fulfillment-wave
+  F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
 ```
+The same runner is available through `npx devlyn-cli benchmark pair ...`.
 The runner executes `bare` + `solo_claude`, applies `headroom-gate.py`, and
-only then spends a `l2_gated` arm. To gate already-existing artifacts:
-When a prompt-only pair change needs a fresh `l2_gated` measurement but the
-calibrated `bare` + `solo_claude` arms are already clean, reuse them into a new
-run id:
+only then spends the selected pair arm. Pair arms are limited to current
+proof (`l2_risk_probes`) or diagnostic replay (`l2_gated`); `l2_forced` is
+retired and rejected. It prints the exact replay command plus each gate's
+markdown report to stdout, including startup `Headroom:` / `Pair:` lines,
+fixture pass count, average pair margin, and the fixture score table with bare,
+solo_claude, pair, margin, pair-mode, trigger-reason, and wall-ratio columns; if headroom or pair
+evidence fails, the report is printed before the runner exits non-zero. If
+headroom fails, the runner explicitly says the pair arm was not executed; if
+the final pair gate fails, it explicitly says pair evidence was rejected. When
+both gates pass, it explicitly says the selected pair arm is being executed and
+then that pair evidence was accepted. When launched through
+`npx devlyn-cli benchmark pair`, the replay `Command:` uses
+the same package CLI path. Add `--dry-run` to
+validate args, fixture ids, minimum fixture count, and the replay command
+without running arms or judges. Known rejected or ceiling-saturated fixtures
+are refused by default here too; use `--allow-rejected-fixtures` only for
+diagnostics of rejected fixtures or calibrated shadow controls. Retired fixtures
+remain historical replay artifacts and are not rerun by this candidate runner.
+To gate already-existing artifacts:
+When a prompt-only pair change needs a fresh `l2_risk_probes` measurement but
+the calibrated `bare` + `solo_claude` arms are already evidence-complete, reuse
+them into a new run id:
 ```bash
 bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
   --run-id <new-run-id> \
   --reuse-calibrated-from <prior-headroom-run-id> \
+  --min-fixtures 3 \
   --max-pair-solo-wall-ratio 3 \
-  F21-cli-scheduler-priority F23-cli-fulfillment-wave
+  F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
 ```
 ```bash
 python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
   --run-id <full-pipeline-run-id> \
-  --min-fixtures 2 \
+  --min-fixtures 3 \
   --min-pair-margin 5 \
   --max-pair-solo-wall-ratio 3 \
   --out-json benchmark/auto-resolve/results/<full-pipeline-run-id>/full-pipeline-pair-gate.json \
@@ -144,16 +332,67 @@ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
 ```
 This is the full-pipeline claim gate: each counted fixture must satisfy the
-headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
-must be clean, `pair_mode` must be true in the captured resolve state, and the
-blind judge must score the pair arm at least `--min-pair-margin` above
-`solo_claude`. `l2_risk_probes` is the current measured pair arm for the
-F16/F25 gate: `20260509-f16-f25-combined-cartprobe-v2` passed with margins +21
-and +24, average pair/solo wall ratio 1.46x. When changing this gate, run:
-```bash
-bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
-```
+headroom precondition (`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins), the selected pair arm must be evidence-clean,
+`pair_mode` must be true in the captured resolve state, the pair trigger must be
+eligible with non-empty reasons and at least one canonical reason, fixtures with an actionable solo-headroom hypothesis must include `spec.solo_headroom_hypothesis` in the trigger reasons, the pair/solo wall-time
+ratio must stay within the default 3x limit, and the blind judge must score the
+pair arm at least `--min-pair-margin` above `solo_claude`. The report separates
+the allowed pair/solo wall ratio from the maximum observed pair/solo wall ratio,
+records `require_hypothesis_trigger` in JSON, and includes a Markdown
+`Hypothesis trigger` column for each fixture row.
+The judge
+file must also map `bare`, `solo_claude`, and the selected pair arm in
+`_blind_mapping`; `scores_by_arm` alone is not evidence.
+`l2_risk_probes` is the current measured pair arm for the
+F16/F23/F25 gate: `20260510-f16-f23-f25-combined-proof` passed with margins
++21, +31, and +24, average pair margin +25.3, and average pair/solo wall ratio
+1.73x. Earlier F16/F25 evidence also passes the current gate in
+`20260509-f16-f25-combined-cartprobe-v2`.
+Additional focused F21 evidence: `20260511-f21-current-riskprobes-v1` passed
+with `bare` 33, `solo_claude` 66, `l2_risk_probes` 99, margin +33, pair mode true, and
+pair/solo wall ratio 1.47x, and is counted by `benchmark audit` as the fourth passing pair-evidence row. Do not count ceiling/control fixtures as pair
+evidence: F22 and F26 are
+currently rejected because existing headroom runs put `solo_claude` at 98. F27
+subscription proration is also rejected in its first headroom smoke:
+`20260511-f27-headroom-smoke-061401` measured bare 33 / solo_claude 94, with bare
+verification passing only 1 of 3 commands. Rework or rotate F27 before spending
+a pair arm on it. F28 return authorization is rejected as pair-lift evidence:
+earlier unstable runs `20260511-f28-headroom-smoke-085307` and
+`20260511-f28-pair-smoke-091021` were superseded after a hidden-oracle bug was
+found. The oracle had expected a defective item to bypass expiration, which the
+visible spec does not require. After re-verifying the same provider diffs
+against the corrected oracle, `20260511-f28-policy-oraclefix-reverified-pair`
+scored bare 50 / solo_claude 98 / `l2_risk_probes` 96, margin -2, and failed headroom.
+Rework or rotate F28 before spending more pair arms.
+F30 credit hold settlement is also rejected: `20260511-f30-headroom-v1` scored
+bare 33 / solo_claude 98, so it failed the `solo_claude` headroom precondition before any pair
+arm should be spent. F15 frozen-diff race review is now a ceiling/control
+fixture too: `20260511-f15-concurrency-headroom` scored bare 99 / solo_claude 94, so
+it failed both headroom preconditions. F3 backend contract risk is also
+rejected after tightening its HTTP error-body verifier:
+`20260511-f3-http-error-headroom` scored bare 97 / solo_claude 99. F2 medium CLI is
+rejected by `20260512-f2-medium-headroom`: bare 83 / solo_claude 95, so both baseline
+scores exceed headroom ceilings. F4 web browser design is rejected by
+`20260512-f4-web-headroom`: bare 70 / solo_claude 92 with bare disqualifiers, so it
+needs rework before pair arms. F5 fix-loop is rejected by
+`20260512-f5-fixloop-headroom`: bare 99 / solo_claude 99, with `bare` and `solo_claude` each
+passing 5/5 verification commands. F6 dep-audit checksum is rejected by
+`20260512-f6-checksum-headroom`: bare 97 / solo_claude 96, with `bare` and `solo_claude` each
+passing 6/6 verification commands. F7 scope discipline is rejected by
+`20260512-f7-scope-headroom`: bare 99 / solo_claude 100, with `bare` and `solo_claude` each
+passing 6/6 verification commands. F9 ideate-to-resolve remains the novice-flow
+anchor but is rejected as pair evidence by `20260512-f9-e2e-headroom`: bare 60 /
+solo_claude 90 with bare headroom 0 and a bare judge disqualifier, despite passing F9
+artifact checks. Rework it before spending pair arms. F1 and F8 are rejected by
+design as calibration/known-limit controls, not pair-lift evidence candidates.
+F10/F11 are also rejected by `20260507-f10-f11-tier1-full-pipeline`: F10 scored
+bare 75 / solo_claude 94, and F11 scored bare 98 / solo_claude 97. F12 webhook signature/replay is rejected by
+`20260511-f12-webhook-headroom`: bare 85 / solo_claude 99.
+F31 seat rebalance is rejected by `20260512-f31-seat-rebalance-headroom`: bare
+33 / solo_claude 98, with bare judge/result/verify disqualifiers. Rework it before
+spending pair arms. F32 subscription renewal is rejected by
+`20260512-f32-subscription-renewal-headroom`: bare 33 / solo_claude 98, so it is a
+solo-ceiling billing rollback/shape control rather than pair-lift evidence.
 Commands that reference `BENCH_FIXTURE_DIR` are hidden post-run oracles: they
 are not staged into BUILD_GATE's `.devlyn/spec-verify.json`.
@@ -175,8 +414,10 @@ diagnostics. Use non-empty diffs only; empty diffs fail fast because they are
 not valid pair evidence.
 Hidden verifier context is available during VERIFY, so this runner prevents
 IMPLEMENT contamination but is not an oracle-blind judge setup.
-The runner writes `compare.json`; `pair_verdict_lift: true` means pair VERIFY
-actually ran and found a verdict-binding issue that solo VERIFY did not.
+The runner writes `compare.json` and `compare.md`; `pair_verdict_lift: true`
+means pair VERIFY actually ran and found a verdict-binding issue that solo
+VERIFY did not. It also prints a replay `Command:` block before invoking
+providers and a final solo/pair summary table.
 If an imported case has no deterministic `verification_commands`, the runner
 does not create `.devlyn/spec-verify.json`; an empty carrier is malformed by the
 normal real-user contract and must not block qualitative frozen review.
@@ -187,6 +428,7 @@ To gate a set of frozen VERIFY results mechanically:
 python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
   --run-id 20260505T173913Z-9986cd3-frozen-verify \
   --run-id 20260505T230215Z-9986cd3-frozen-verify \
+  --require-hypothesis-trigger \
   --max-pair-solo-wall-ratio 3 \
   --out-json benchmark/auto-resolve/results/frozen-verify-gate-20260505.json \
   --out-md benchmark/auto-resolve/results/frozen-verify-gate-20260505.md
@@ -203,12 +445,19 @@ full-pipeline pair superiority. It proves only that, after the implementation
 diff is frozen, gated pair VERIFY fires and returns a stricter verdict-binding
 result than solo VERIFY on the same diff. Each supplied run must cover a
 distinct fixture; repeated runs of the same fixture do not count as independent
+evidence. For new measurements, pass `--require-hypothesis-trigger` so any
+fixture spec with an actionable solo-headroom hypothesis must also expose
+`spec.solo_headroom_hypothesis` in `pair_trigger.reasons`; omit it only when
+re-gating historical artifacts that predate that trigger reason.
 corpus growth. `--max-pair-solo-wall-ratio` is optional, but use it for
 ship-style evidence so quality lift is not accepted without a reasonable
 wall-time bound. The gate infers the fixture id from the runner input metadata;
 artifacts without that metadata, or with a fixture id absent from
 the selected `--fixtures-root`, fail instead of being counted as anonymous or
-fake evidence.
+fake evidence. JSON rows expose `pair_trigger_reasons` and
+`pair_trigger_has_canonical_reason`; Markdown output includes a `Triggers`
+column so reviewers can see which canonical pair trigger made the evidence
+eligible.
 ### SWE-bench fixed-diff review pilot
@@ -292,6 +541,10 @@ bash benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh \
   --out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
 ```
+The corpus runner prints a replay `Command:` block before invoking providers or
+gating existing run ids, so frozen VERIFY score runs can be reproduced from the
+captured stdout.
 To re-gate existing run ids without re-invoking providers, write one run id per
 line and pass `--gate-only-run-ids <file>` with the same manifest. For large
 tranches, keep `--run-ids-out` and use `--resume-completed-arms` on retries:
@@ -345,6 +598,7 @@ python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
   --run-id <swebench-frozen-run-2> \
   --run-id <swebench-frozen-run-3> \
   --min-runs 3 \
+  --require-hypothesis-trigger \
   --max-pair-solo-wall-ratio 3 \
   --out-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
   --out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
@@ -359,8 +613,14 @@ inspect `avg_pair_solo_wall_ratio` plus each row's `pair_solo_wall_ratio`.
 For selection-bias control, render every run in the attempted pilot, not just
 gate rows. The matrix reports verdict-lift rows separately from recall-only
 rows where pair found additional findings but did not change the binding
-verdict. It also reports classification counts, gate rate, and trailing
-non-gate rows. Use the optional yield thresholds when the matrix is meant to
+verdict. It also reports pair-trigger eligibility/contract failures,
+trigger reasons, canonical-trigger coverage, classification counts, gate rate,
+and trailing non-gate rows. Its Markdown table includes a `Triggers` column.
+For new measurements, pass `--fixtures-root` with
+`--require-hypothesis-trigger` so matrix rows classify any missing
+`spec.solo_headroom_hypothesis` trigger reason as a pair-trigger contract
+failure instead of leaving it to the gate artifact alone.
+Use the optional yield thresholds when the matrix is meant to
 fail closed instead of only documenting that additional rows are adding
 controls without strengthening the proof gate:
@@ -368,9 +628,11 @@ controls without strengthening the proof gate:
 python3 benchmark/auto-resolve/scripts/swebench-frozen-matrix.py \
   --title "SWE-bench Lite Frozen VERIFY Matrix" \
   --verdict MIXED_WITH_GATE_PASS \
+  --fixtures-root benchmark/auto-resolve/external/swebench/cases \
   --gate-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
   --run-id <swebench-frozen-run-1> \
   --run-id <swebench-frozen-run-2> \
+  --require-hypothesis-trigger \
   --min-gate-rate 0.25 \
   --max-trailing-non-gate 10 \
   --out-json benchmark/auto-resolve/results/swebench-frozen-matrix.json \
@@ -391,9 +653,9 @@ the diff is frozen.
 ## LLM-upgrade resilience
-- **No model hardcoding.** Judge runs `codex exec` without `-m`, inheriting whichever flagship the CLI currently ships. Each run captures `_judge_model` for historical provenance.
-- **Margin-based gates.** Ship thresholds use margin (variant − bare), not absolute score. Both arms improve together as models improve; the harness-added value measured by margin stays meaningful.
-- **Saturation rotation.** When both arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
+- **No model hardcoding.** Judge runs Codex without `-m`, inheriting whichever flagship the CLI currently ships. The call is isolated from user config/rules/hooks so local agent instructions cannot contaminate the blind judgment. Each run captures `_judge_model` for historical provenance.
+- **Margin-based gates.** Ship thresholds use pairwise margins, not absolute score. `solo_claude`-`bare` measures solo harness value; pair-`solo_claude` measures pair value on pair-eligible fixtures. As models improve, margin remains the meaningful harness-added signal.
+- **Saturation rotation.** When all compared gated arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
 ## Ship gates (summary — see `RUBRIC.md` for full spec)
@@ -401,7 +663,7 @@ Hard floors (any one fails → block):
 - Zero variant disqualifier (silent catch, fabricated verification, extra deps beyond `max_deps_added`, etc.).
 - `F9-e2e-ideate-to-resolve` must PASS (novice-flow contract).
-- ≥ 7 of 9 gated fixtures have margin ≥ +5.
+- ≥ 7 gated, headroom-available fixtures have margin ≥ +5.
 - No per-fixture regression worse than −5 vs last shipped baseline.
 Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margin, critical-finding catch-rate regression vs last shipped variant.
@@ -409,15 +671,16 @@ Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margi
 ## Running the full suite (real)
 Full real benchmarks usually take 2-3 minutes per arm for simple fixtures and
-up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures
-× 2 arms can take 30 min - 2 hrs depending on routes taken.
+up to 15 minutes per arm for strict-route fixtures. A full n=1 run time depends
+on the selected fixture count; the historical 9-fixture core suite was roughly
+45 min - 3 hrs for 3 arms, while the current extended suite can take longer.
 ```bash
 # Smoke run before ship decisions
 npx devlyn-cli benchmark
 # Ship-decision run
-npx devlyn-cli benchmark --n 3 --label v3.7 --bless
+npx devlyn-cli benchmark --label v3.7 --bless
 ```
 ## Dry-run

package/benchmark/auto-resolve/RUBRIC.md CHANGED Viewed

@@ -9,8 +9,8 @@ prior `history/runs/`.
 ## Scoring — 4 axes, 25 points each, 100 total
-The blind judge scores both arms on identical axes without knowing which is
-variant vs. bare.
+The blind judge scores all submitted arms on identical axes without knowing
+which label maps to which arm.
 ### Axis 1 — Spec Compliance (0-25)
@@ -72,18 +72,23 @@ Disqualifier arms automatically lose the fixture regardless of score.
 After the judge finishes every fixture, `scripts/ship-gate.py` applies these
 rules to the run's `summary.json`.
+This section describes the broad run-suite ship gate. Current solo<pair
+evidence uses the full-pipeline pair gate with an explicit selected pair arm
+(`l2_risk_probes` for proof runs, `l2_gated` for diagnostics), and that gate
+compares the selected pair arm against `solo_claude`.
 ### Hard floors (any one failure blocks ship)
-1. **No disqualifier-level violation** in variant on any fixture.
+1. **No disqualifier-level violation** in any gated harness arm (legacy suite `variant`/L2 and `solo_claude`/L1 when present).
 2. **F9 (E2E) must PASS** — novice-flow contract.
-3. **≥ 7 of 9 fixtures** must have margin ≥ +5 — **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from this count when `100 - L0_score < 5` AND `L1_score >= 95` AND the L1 arm has no disqualifier / CRITICAL-HIGH finding / watchdog timeout / regression worse than gate #4. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
+3. **At least 7 gated, headroom-available fixtures** must have the required margin ≥ +5 for each gated contract — legacy `variant`-`bare` (L2-L0) for the suite gate, and `solo_claude`-`bare` (L1-L0) when `solo_claude` is present. This is **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from a contract count when the lower arm is ceiling-near and the higher arm is clean at ceiling. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
 4. **No fixture regression worse than −5** vs. last `baselines/shipped.json` on the same fixture.
 ### Soft gates (produce WARNING but do not block)
 5. Suite average margin drop > 3 vs. last shipped.
 6. A fixture that previously had margin > +5 now has margin ≤ 0.
-7. Critical-finding catch-rate decrease vs. last shipped variant (not vs. bare).
+7. Critical-finding catch-rate decrease vs. the last shipped gated harness arm.
 ### Known-limit exception
@@ -138,15 +143,15 @@ Every suite run appends an immutable record to `history/runs/<ts>-<label>.json`:
 ## Fixture Rotation Policy
-If any fixture has both arms scoring > 95 for two consecutive shipped
-versions, it's saturated and no longer differentiates. Replace with a harder
-equivalent and record the swap in
+If any fixture has all compared gated arms scoring > 95 for two consecutive
+shipped versions, it's saturated and no longer differentiates. Replace with a
+harder equivalent and record the swap in
 `history/runs/<ts>-fixture-rotation.json`:
 ```json
 {
   "retired": "F1-cli-trivial-flag",
-  "retired_reason": "both arms > 95 on v3.7 and v3.8 (saturation)",
+  "retired_reason": "all compared gated arms > 95 on v3.7 and v3.8 (saturation)",
   "replacement": "F1b-cli-trivial-flag-v2",
   "replacement_rationale": "adds exit-code precedence requirement that current leaders didn't handle on first try"
 }
@@ -159,10 +164,14 @@ suspected in their area.
 ## Why These Thresholds
-- **+5 margin floor** — below this, variant isn't reliably beating bare given
-  judge variance (empirically ~±3 per axis). Worth paying pipeline cost
-  requires margin clearly above noise.
+- **+5 margin floor** — below this, the gated harness arm is not reliably
+  beating its lower baseline given judge variance (empirically ~±3 per axis).
+  For the legacy suite that is `variant` over `bare`; for pair evidence it is
+  the selected pair arm over `solo_claude`. Worth paying pipeline cost requires
+  margin clearly above noise.
 - **−5 regression floor** — one-axis regression can look like −5; allowing
   less would let real regressions slip through.
-- **7/9 fixtures rule** — tolerates one close-call + F8 known-limit; anything
-  worse means the suite is surfacing a broad harness problem.
+- **7-fixture coverage floor** — requires a broad enough set of
+  headroom-available, non-known-limit fixtures to clear the margin floor. This
+  preserves the original core-suite coverage bar without pretending the current
+  extended fixture inventory is still exactly nine fixtures.

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md CHANGED Viewed

@@ -6,6 +6,10 @@ Trivial-tier calibration. Every arm should one-shot this; it's here to catch
 catastrophic regressions and to anchor the "saturation" end of the scoring
 scale.
+Pair-candidate status: rejected by design. Because every arm is expected to
+one-shot F1, it is a trivial calibration/control fixture and must not be used
+as pair-lift evidence.
 ## Failure mode
 - **Default-behavior regression.** Careless implementations add `--loud`
@@ -25,6 +29,6 @@ scale.
 ## Rotation trigger
-When both arms score > 95 for two consecutive shipped versions, replace with
-a harder trivial fixture (e.g., one that requires handling a new flag
-interacting with existing flag precedence).
+When both `bare` and `solo_claude` score > 95 for two consecutive shipped
+versions, replace with a harder trivial fixture (e.g., one that requires
+handling a new flag interacting with existing flag precedence).

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md CHANGED Viewed

@@ -57,7 +57,12 @@ prose forces invariant derivation, which is where pair has the edge.
 ## Rotation trigger
-Retire when both arms consistently land > 90 across two shipped versions,
-OR when "close-together-write" becomes a recognized pattern such that
-solo arm reliably reaches for a serializing mechanism on first read.
+Headroom run `20260507-f10-f11-tier1-full-pipeline` rejected this fixture as
+pair-lift evidence: bare scored 75 and solo_claude scored 94. Keep it as a
+concurrent persistence control unless the visible contract is reworked to
+expose lower bare/solo ceilings.
+Retire when both `bare` and `solo_claude` consistently land > 90 across two
+shipped versions, OR when "close-together-write" becomes a recognized pattern
+such that solo arm reliably reaches for a serializing mechanism on first read.
 Whichever comes first.