devlyn-cli 2.3.0 → 2.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +1 -1
- package/CLAUDE.md +2 -2
- package/README.md +80 -29
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
- package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
- package/benchmark/auto-resolve/README.md +307 -44
- package/benchmark/auto-resolve/RUBRIC.md +23 -14
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
- package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
- package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
- package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
- package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
- package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
- package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
- package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
- package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
- package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
- package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
- package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
- package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
- package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
- package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
- package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
- package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
- package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
- package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
- package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
- package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
- package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
- package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
- package/benchmark/auto-resolve/scripts/judge.sh +153 -26
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
- package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
- package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
- package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
- package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
- package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
- package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
- package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
- package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
- package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
- package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
- package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
- package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
- package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
- package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
- package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
- package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
- package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
- package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
- package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
- package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
- package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
- package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
- package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
- package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
- package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
- package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
- package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
- package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
- package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
- package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
- package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
- package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
- package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
- package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
- package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
- package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
- package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
- package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
- package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
- package/bin/devlyn.js +210 -17
- package/config/skills/_shared/adapters/README.md +3 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
- package/config/skills/_shared/adapters/opus-4-7.md +9 -1
- package/config/skills/_shared/archive_run.py +78 -6
- package/config/skills/_shared/codex-config.md +3 -2
- package/config/skills/_shared/codex-monitored.sh +46 -1
- package/config/skills/_shared/collect-codex-findings.py +20 -5
- package/config/skills/_shared/engine-preflight.md +1 -1
- package/config/skills/_shared/runtime-principles.md +5 -8
- package/config/skills/_shared/spec-verify-check.py +2664 -107
- package/config/skills/_shared/verify-merge-findings.py +1369 -19
- package/config/skills/devlyn:ideate/SKILL.md +7 -4
- package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
- package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
- package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
- package/config/skills/devlyn:resolve/SKILL.md +49 -18
- package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
- package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
- package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
- package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
- package/package.json +47 -2
- package/scripts/lint-fixtures.sh +349 -0
- package/scripts/lint-shadow-fixtures.sh +58 -0
- package/scripts/lint-skills.sh +3642 -92
- /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
|
@@ -1,13 +1,14 @@
|
|
|
1
|
-
# devlyn-cli
|
|
1
|
+
# devlyn-cli resolve Benchmark Suite
|
|
2
2
|
|
|
3
|
-
One-command
|
|
3
|
+
One-command resolve benchmark that gates every harness change with a ship/rollback decision.
|
|
4
4
|
|
|
5
5
|
## Quick start
|
|
6
6
|
|
|
7
7
|
```bash
|
|
8
|
-
npx devlyn-cli benchmark # n=1 smoke, all fixtures ×
|
|
9
|
-
npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
|
|
8
|
+
npx devlyn-cli benchmark # n=1 smoke, all fixtures × 3 arms, judge, report, ship-gate
|
|
10
9
|
npx devlyn-cli benchmark F2 # specific fixture only
|
|
10
|
+
npx devlyn-cli benchmark headroom F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
11
|
+
npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
11
12
|
npx devlyn-cli benchmark --dry-run # validate suite wiring without model invocation
|
|
12
13
|
npx devlyn-cli benchmark --bless # if ship-gate PASSes, promote this run as the shipped baseline
|
|
13
14
|
npx devlyn-cli benchmark --judge-only --run-id <ID> # re-judge an existing run's artifacts
|
|
@@ -17,12 +18,12 @@ Exit code 0 = PASS, 1 = FAIL.
|
|
|
17
18
|
|
|
18
19
|
## What it does
|
|
19
20
|
|
|
20
|
-
1. For every fixture × arm (`variant` / `bare`):
|
|
21
|
+
1. For every fixture × arm (`variant` / `solo_claude` / `bare`):
|
|
21
22
|
- Prepare a fresh temp copy of `fixtures/test-repo/`.
|
|
22
23
|
- Commit baseline + apply `setup.sh` + commit bench scaffolding.
|
|
23
24
|
- Invoke the arm via an isolated `claude -p` subprocess.
|
|
24
25
|
- Capture `diff.patch`, `transcript.txt`, `timing.json`, run `expected.json::verification_commands`.
|
|
25
|
-
2. For every fixture, invoke
|
|
26
|
+
2. For every fixture, invoke isolated Codex as a blind judge with randomized slots using the 4-axis rubric in `RUBRIC.md`.
|
|
26
27
|
3. Aggregate into `results/<run-id>/report.md` + `summary.json`.
|
|
27
28
|
4. Apply ship-gate thresholds (`scripts/ship-gate.py`). Print verdict.
|
|
28
29
|
5. Append immutable record to `history/runs/<run-id>.json`.
|
|
@@ -47,16 +48,30 @@ benchmark/auto-resolve/
|
|
|
47
48
|
│ ├── judge.sh # Codex blind judge for one fixture
|
|
48
49
|
│ ├── compile-report.py # aggregates into report.md + summary.json
|
|
49
50
|
│ ├── ship-gate.py # applies thresholds + writes history record
|
|
51
|
+
│ ├── test-benchmark-arg-parsing.sh
|
|
52
|
+
│ ├── test-ship-gate.sh
|
|
50
53
|
│ ├── run-headroom-candidate.sh
|
|
51
54
|
│ ├── headroom-gate.py # blocks pair measurement without headroom set
|
|
52
55
|
│ ├── test-headroom-gate.sh
|
|
56
|
+
│ ├── test-run-headroom-candidate.sh
|
|
53
57
|
│ ├── run-full-pipeline-pair-candidate.sh
|
|
58
|
+
│ ├── test-run-full-pipeline-pair-candidate.sh
|
|
54
59
|
│ ├── full-pipeline-pair-gate.py
|
|
55
60
|
│ ├── test-full-pipeline-pair-gate.sh
|
|
61
|
+
│ ├── pair-candidate-frontier.py
|
|
62
|
+
│ ├── test-pair-candidate-frontier.sh
|
|
63
|
+
│ ├── audit-pair-evidence.py
|
|
64
|
+
│ ├── test-audit-pair-evidence.sh
|
|
65
|
+
│ ├── audit-headroom-rejections.py
|
|
66
|
+
│ ├── test-audit-headroom-rejections.sh
|
|
67
|
+
│ ├── test-check-f9-artifacts.sh
|
|
68
|
+
│ ├── iter-0033c-l1-summary.py
|
|
69
|
+
│ ├── test-iter-0033c-l1-summary.sh
|
|
56
70
|
│ ├── run-frozen-verify-pair.sh
|
|
57
71
|
│ ├── fetch-swebench-instances.py
|
|
58
72
|
│ ├── collect-swebench-predictions.py
|
|
59
73
|
│ ├── run-swebench-solver-batch.sh
|
|
74
|
+
│ ├── test-run-swebench-solver-batch.sh
|
|
60
75
|
│ ├── prepare-swebench-frozen-case.py
|
|
61
76
|
│ ├── prepare-swebench-frozen-corpus.py
|
|
62
77
|
│ ├── run-swebench-frozen-corpus.sh
|
|
@@ -85,58 +100,231 @@ Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`,
|
|
|
85
100
|
|
|
86
101
|
1. Copy an existing fixture directory as a template.
|
|
87
102
|
2. Rewrite `metadata.json::intent` with the new task's plain-language intent.
|
|
88
|
-
3. Write `spec.md` (
|
|
103
|
+
3. Write `spec.md` (resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
|
|
89
104
|
4. Fill `expected.json` with concrete verification commands and forbidden patterns.
|
|
90
105
|
5. Document purpose + failure mode in `NOTES.md`.
|
|
91
106
|
6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
|
|
92
107
|
7. Run `bash scripts/lint-fixtures.sh`.
|
|
93
108
|
|
|
109
|
+
For draft pair candidates, start in `shadow-fixtures/S*` and run
|
|
110
|
+
`bash scripts/lint-shadow-fixtures.sh`. The headroom and pair candidate runners
|
|
111
|
+
accept explicitly named `S*` ids for dry-run checks and candidate measurement,
|
|
112
|
+
but shadow results are read-only signals. Promote a validated task to an active
|
|
113
|
+
`F*` fixture before counting it as golden pair evidence.
|
|
114
|
+
Use `run-suite.sh --suite shadow` only with `--dry-run`; the suite path refuses
|
|
115
|
+
provider and judge runs for shadow fixtures so rejected/smoke controls do not
|
|
116
|
+
spend benchmark budget accidentally.
|
|
117
|
+
Before spending provider calls, write a solo-headroom hypothesis into the
|
|
118
|
+
candidate's `spec.md`: name the visible behavior a capable `solo_claude`
|
|
119
|
+
baseline is expected to miss, and the observable command from `expected.json`
|
|
120
|
+
that would expose that miss. A hypothesis of only "the task is hard" is not
|
|
121
|
+
enough; rework the candidate before measurement. `lint-shadow-fixtures.sh` and
|
|
122
|
+
the candidate runners enforce this as an actionable hypothesis: the fixture
|
|
123
|
+
`spec.md` must contain `solo-headroom hypothesis`, `solo_claude`, `miss`, and a
|
|
124
|
+
backticked observable command matching `expected.json`, with the backticked line
|
|
125
|
+
itself containing `miss` and framed as the command/observable that exposes it.
|
|
126
|
+
For unmeasured high-risk shadow candidates, `NOTES.md` must also include
|
|
127
|
+
`## Solo ceiling avoidance` naming how the candidate differs from the
|
|
128
|
+
solo-saturated `S2`-`S6` controls and why that difference should preserve
|
|
129
|
+
`solo_claude` headroom. If that distinction is not concrete, rework the
|
|
130
|
+
candidate before measurement.
|
|
131
|
+
If a real shadow headroom run fails because the fixture is solo-saturated, record
|
|
132
|
+
the run and score in the fixture's `NOTES.md` and add the fixture to
|
|
133
|
+
`scripts/pair-rejected-fixtures.sh`; `lint-shadow-fixtures.sh` enforces that
|
|
134
|
+
calibrated shadow `FAIL` entries are registered before future provider spend.
|
|
135
|
+
|
|
94
136
|
For L2/pair candidate fixtures, also run:
|
|
95
137
|
|
|
96
138
|
```bash
|
|
97
|
-
bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh
|
|
139
|
+
bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh \
|
|
140
|
+
--bare-max 60 \
|
|
141
|
+
--solo-max 80 \
|
|
142
|
+
--min-fixtures 3 \
|
|
143
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
98
144
|
```
|
|
99
145
|
|
|
146
|
+
The same runner is available through `npx devlyn-cli benchmark headroom ...`.
|
|
100
147
|
This runs only the arms needed for calibration (`bare` and `solo_claude`),
|
|
101
148
|
blind-judges them, and applies `headroom-gate.py`. A candidate set is not
|
|
102
149
|
usable for pair measurement unless at least two fixtures pass and each fixture
|
|
103
|
-
has
|
|
104
|
-
|
|
105
|
-
|
|
150
|
+
has evidence-complete `bare <= 60` and `solo_claude <= 80` scores with the
|
|
151
|
+
default minimum 5-point `bare`/`solo_claude` headroom margin.
|
|
152
|
+
The runner prints the headroom gate markdown report to stdout, including the
|
|
153
|
+
startup `Gate:` line and the fixture score table with bare score, bare
|
|
154
|
+
headroom, solo_claude score, solo_claude headroom, status, and reason columns. When launched
|
|
155
|
+
through `npx devlyn-cli benchmark headroom`, the replay `Command:` uses the
|
|
156
|
+
same package CLI path.
|
|
157
|
+
For passing sets, the report also prints average and minimum `bare`/`solo_claude`
|
|
158
|
+
headroom plus the fixture pass count, so ceiling-near, threshold-fragile, or
|
|
159
|
+
under-count candidate sets are visible before spending pair arms.
|
|
160
|
+
It explicitly reports whether the candidate set was accepted or rejected.
|
|
161
|
+
Evidence-clean means the measured arm has complete artifacts, no deterministic
|
|
162
|
+
or judge disqualifier, all expected verification commands pass, and any
|
|
163
|
+
skill-pipeline verdict is non-blocking (`PASS` or `PASS_WITH_ISSUES`). A
|
|
164
|
+
one-fixture calibration run can show useful scores but does not satisfy the set
|
|
165
|
+
gate. Add `--dry-run` to validate args, fixture ids, minimum fixture count, and
|
|
166
|
+
the replay command without running arms or judges.
|
|
167
|
+
Known rejected or ceiling-saturated fixtures are refused by default in the
|
|
168
|
+
headroom runner; use `--allow-rejected-fixtures` only for diagnostics of
|
|
169
|
+
rejected fixtures or calibrated shadow controls, not for new pair-evidence
|
|
170
|
+
candidate selection. Retired fixtures are preserved for historical artifact replay
|
|
171
|
+
and are not rerun by the pair-candidate runners.
|
|
172
|
+
Before spending new provider calls, inspect the active candidate frontier:
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
python3 benchmark/auto-resolve/scripts/pair-candidate-frontier.py \
|
|
176
|
+
--out-md /tmp/devlyn-pair-frontier.md
|
|
177
|
+
npx devlyn-cli benchmark recent
|
|
178
|
+
npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
|
|
179
|
+
npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
`benchmark recent` is the reader-facing version of the current evidence set: it
|
|
183
|
+
prints a compact, wrap-safe status block, pair-lift aggregates, and one card per
|
|
184
|
+
passing pair-evidence fixture. Use it for PR comments and release notes when a
|
|
185
|
+
wide frontier table would wrap poorly.
|
|
186
|
+
The frontier report lists active fixtures as `rejected`,
|
|
187
|
+
`pair_evidence_passed`, or `candidate_unmeasured`, using the same rejected
|
|
188
|
+
fixture registry and local full-pipeline gate artifacts. It also prints stdout
|
|
189
|
+
summary rows with `bare`, `solo_claude`, `pair`, pair arm, margin, wall ratio, run id, verdict, and trigger reasons for
|
|
190
|
+
fixtures that already have complete pair evidence rows, plus average/minimum pair margin and wall ratio,
|
|
191
|
+
even when writing `--out-md` or `--out-json`. The markdown artifact also carries
|
|
192
|
+
the overall verdict plus row-level verdict, pair-arm, and trigger-reason columns.
|
|
193
|
+
Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
|
|
194
|
+
and the report includes a Markdown `Hypothesis trigger` column, so strict regenerated
|
|
195
|
+
evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
|
|
196
|
+
After a headroom run fails, audit that any active failed fixture without passing
|
|
197
|
+
pair evidence is either rejected or reworked before more provider spend. The
|
|
198
|
+
same audit also rejects active registry entries whose reason cites a run id or
|
|
199
|
+
score that is not backed by a matching local headroom artifact:
|
|
200
|
+
|
|
201
|
+
```bash
|
|
202
|
+
python3 benchmark/auto-resolve/scripts/audit-headroom-rejections.py
|
|
203
|
+
npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
For release or handoff checks where open candidates are not acceptable, add
|
|
207
|
+
`--fail-on-unmeasured` to the frontier command so any active
|
|
208
|
+
`candidate_unmeasured` fixture becomes a nonzero exit.
|
|
209
|
+
The package CLI exposes that release/handoff guard as one command:
|
|
210
|
+
|
|
211
|
+
```bash
|
|
212
|
+
npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
|
|
213
|
+
npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
It writes `audit.json` with the frontier summary and an artifact map (`artifacts`), plus
|
|
217
|
+
`frontier.json`, `frontier.stdout`, `frontier.stderr`, `headroom-audit.json`, and child stdout/stderr logs, prints the same frontier score rows for existing complete pair
|
|
218
|
+
evidence rows, and embeds those compact trigger-backed verdict-bearing score rows in
|
|
219
|
+
`audit.json` as `pair_evidence_rows` (each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`). It fails if either active unmeasured pair candidates or unrecorded
|
|
220
|
+
headroom failures remain. By default it also revalidates frontier `verdict: PASS`
|
|
221
|
+
and zero unmeasured candidates, requires at least four active fixtures with passing pair evidence,
|
|
222
|
+
and requires each counted evidence row to satisfy `pair_mode: true`, the default 5-point pair margin, and 3x pair/solo wall ratio.
|
|
223
|
+
The audit stdout also prints `headroom_rejections=...`,
|
|
224
|
+
`pair_evidence_quality=...`, `pair_trigger_reasons=...`, and
|
|
225
|
+
`pair_evidence_hypotheses=...` and
|
|
226
|
+
`pair_evidence_hypothesis_triggers=...` handoff rows, plus
|
|
227
|
+
`pair_trigger_historical_aliases=...` when archived evidence includes legacy
|
|
228
|
+
trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
|
|
229
|
+
hypotheses have not yet propagated into trigger reasons, with rejected-fixture
|
|
230
|
+
coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
|
|
231
|
+
and canonical trigger reason coverage plus row-match status. The compact evidence row count must match the frontier evidence count, so incomplete local score
|
|
232
|
+
artifacts cannot inflate the claim. `checks.frontier_stdout` records summary,
|
|
233
|
+
aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts, `checks.pair_evidence_quality`
|
|
234
|
+
records the same quality thresholds from the compact rows,
|
|
235
|
+
`checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status
|
|
236
|
+
for handoff review, `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts, and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details.
|
|
237
|
+
Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
|
|
238
|
+
archived-evidence WARN rows into release-blocking FAIL rows for newly
|
|
239
|
+
regenerated pair evidence.
|
|
240
|
+
Historical trigger aliases are only reported for archived artifact review; new
|
|
241
|
+
current pair-evidence gates fail historical-only or unknown trigger reasons and
|
|
242
|
+
require at least one canonical `pair_trigger.reasons` entry.
|
|
243
|
+
`checks.headroom_rejections` records the child verdict plus unrecorded and
|
|
244
|
+
unsupported registry-rejection counts, so handoff review can see rejected-fixture
|
|
245
|
+
coverage without opening the child artifact first.
|
|
246
|
+
Override `--min-pair-evidence`, `--min-pair-margin`, or
|
|
247
|
+
`--max-pair-solo-wall-ratio` only for narrower diagnostics.
|
|
248
|
+
|
|
249
|
+
When changing the calibration/pair evidence gates, run:
|
|
106
250
|
|
|
107
251
|
```bash
|
|
252
|
+
bash scripts/lint-fixtures.sh
|
|
253
|
+
bash benchmark/auto-resolve/scripts/test-ship-gate.sh
|
|
254
|
+
bash benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh
|
|
255
|
+
bash benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh
|
|
256
|
+
bash benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh
|
|
257
|
+
bash benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh
|
|
258
|
+
bash benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh
|
|
108
259
|
bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
|
|
260
|
+
bash benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh
|
|
261
|
+
bash benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh
|
|
262
|
+
bash benchmark/auto-resolve/scripts/test-lint-fixtures.sh
|
|
263
|
+
bash benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh
|
|
264
|
+
bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
|
|
265
|
+
bash benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh
|
|
266
|
+
bash benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh
|
|
267
|
+
bash benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh
|
|
268
|
+
bash benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh
|
|
269
|
+
bash benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh
|
|
109
270
|
```
|
|
110
271
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
272
|
+
`build-pair-eligible-manifest.py` writes `selection_rule.rejected_excluded`
|
|
273
|
+
with the rejected fixture ids removed from Gate 3, and
|
|
274
|
+
`selection_rule.rejected_excluded_reasons` with the exact registry reason for
|
|
275
|
+
each removed id. This keeps the manifest self-explaining when F31/F32-style
|
|
276
|
+
solo-ceiling controls are excluded from pair-lift evidence.
|
|
277
|
+
|
|
278
|
+
After a full-pipeline pair run has the calibrated arms (`bare`, `solo_claude`,
|
|
279
|
+
and the selected pair arm, default `l2_risk_probes`) plus a blind `judge.json`,
|
|
280
|
+
gate it separately:
|
|
114
281
|
|
|
115
282
|
```bash
|
|
116
283
|
bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
|
|
284
|
+
--min-fixtures 3 \
|
|
117
285
|
--max-pair-solo-wall-ratio 3 \
|
|
118
|
-
|
|
286
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
119
287
|
```
|
|
120
288
|
|
|
289
|
+
The same runner is available through `npx devlyn-cli benchmark pair ...`.
|
|
121
290
|
The runner executes `bare` + `solo_claude`, applies `headroom-gate.py`, and
|
|
122
|
-
only then spends
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
291
|
+
only then spends the selected pair arm. Pair arms are limited to current
|
|
292
|
+
proof (`l2_risk_probes`) or diagnostic replay (`l2_gated`); `l2_forced` is
|
|
293
|
+
retired and rejected. It prints the exact replay command plus each gate's
|
|
294
|
+
markdown report to stdout, including startup `Headroom:` / `Pair:` lines,
|
|
295
|
+
fixture pass count, average pair margin, and the fixture score table with bare,
|
|
296
|
+
solo_claude, pair, margin, pair-mode, trigger-reason, and wall-ratio columns; if headroom or pair
|
|
297
|
+
evidence fails, the report is printed before the runner exits non-zero. If
|
|
298
|
+
headroom fails, the runner explicitly says the pair arm was not executed; if
|
|
299
|
+
the final pair gate fails, it explicitly says pair evidence was rejected. When
|
|
300
|
+
both gates pass, it explicitly says the selected pair arm is being executed and
|
|
301
|
+
then that pair evidence was accepted. When launched through
|
|
302
|
+
`npx devlyn-cli benchmark pair`, the replay `Command:` uses
|
|
303
|
+
the same package CLI path. Add `--dry-run` to
|
|
304
|
+
validate args, fixture ids, minimum fixture count, and the replay command
|
|
305
|
+
without running arms or judges. Known rejected or ceiling-saturated fixtures
|
|
306
|
+
are refused by default here too; use `--allow-rejected-fixtures` only for
|
|
307
|
+
diagnostics of rejected fixtures or calibrated shadow controls. Retired fixtures
|
|
308
|
+
remain historical replay artifacts and are not rerun by this candidate runner.
|
|
309
|
+
To gate already-existing artifacts:
|
|
310
|
+
|
|
311
|
+
When a prompt-only pair change needs a fresh `l2_risk_probes` measurement but
|
|
312
|
+
the calibrated `bare` + `solo_claude` arms are already evidence-complete, reuse
|
|
313
|
+
them into a new run id:
|
|
127
314
|
|
|
128
315
|
```bash
|
|
129
316
|
bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
|
|
130
317
|
--run-id <new-run-id> \
|
|
131
318
|
--reuse-calibrated-from <prior-headroom-run-id> \
|
|
319
|
+
--min-fixtures 3 \
|
|
132
320
|
--max-pair-solo-wall-ratio 3 \
|
|
133
|
-
|
|
321
|
+
F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
|
|
134
322
|
```
|
|
135
323
|
|
|
136
324
|
```bash
|
|
137
325
|
python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
|
|
138
326
|
--run-id <full-pipeline-run-id> \
|
|
139
|
-
--min-fixtures
|
|
327
|
+
--min-fixtures 3 \
|
|
140
328
|
--min-pair-margin 5 \
|
|
141
329
|
--max-pair-solo-wall-ratio 3 \
|
|
142
330
|
--out-json benchmark/auto-resolve/results/<full-pipeline-run-id>/full-pipeline-pair-gate.json \
|
|
@@ -144,16 +332,67 @@ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
|
|
|
144
332
|
```
|
|
145
333
|
|
|
146
334
|
This is the full-pipeline claim gate: each counted fixture must satisfy the
|
|
147
|
-
headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
335
|
+
headroom precondition (`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins), the selected pair arm must be evidence-clean,
|
|
336
|
+
`pair_mode` must be true in the captured resolve state, the pair trigger must be
|
|
337
|
+
eligible with non-empty reasons and at least one canonical reason, fixtures with an actionable solo-headroom hypothesis must include `spec.solo_headroom_hypothesis` in the trigger reasons, the pair/solo wall-time
|
|
338
|
+
ratio must stay within the default 3x limit, and the blind judge must score the
|
|
339
|
+
pair arm at least `--min-pair-margin` above `solo_claude`. The report separates
|
|
340
|
+
the allowed pair/solo wall ratio from the maximum observed pair/solo wall ratio,
|
|
341
|
+
records `require_hypothesis_trigger` in JSON, and includes a Markdown
|
|
342
|
+
`Hypothesis trigger` column for each fixture row.
|
|
343
|
+
The judge
|
|
344
|
+
file must also map `bare`, `solo_claude`, and the selected pair arm in
|
|
345
|
+
`_blind_mapping`; `scores_by_arm` alone is not evidence.
|
|
346
|
+
`l2_risk_probes` is the current measured pair arm for the
|
|
347
|
+
F16/F23/F25 gate: `20260510-f16-f23-f25-combined-proof` passed with margins
|
|
348
|
+
+21, +31, and +24, average pair margin +25.3, and average pair/solo wall ratio
|
|
349
|
+
1.73x. Earlier F16/F25 evidence also passes the current gate in
|
|
350
|
+
`20260509-f16-f25-combined-cartprobe-v2`.
|
|
351
|
+
Additional focused F21 evidence: `20260511-f21-current-riskprobes-v1` passed
|
|
352
|
+
with `bare` 33, `solo_claude` 66, `l2_risk_probes` 99, margin +33, pair mode true, and
|
|
353
|
+
pair/solo wall ratio 1.47x, and is counted by `benchmark audit` as the fourth passing pair-evidence row. Do not count ceiling/control fixtures as pair
|
|
354
|
+
evidence: F22 and F26 are
|
|
355
|
+
currently rejected because existing headroom runs put `solo_claude` at 98. F27
|
|
356
|
+
subscription proration is also rejected in its first headroom smoke:
|
|
357
|
+
`20260511-f27-headroom-smoke-061401` measured bare 33 / solo_claude 94, with bare
|
|
358
|
+
verification passing only 1 of 3 commands. Rework or rotate F27 before spending
|
|
359
|
+
a pair arm on it. F28 return authorization is rejected as pair-lift evidence:
|
|
360
|
+
earlier unstable runs `20260511-f28-headroom-smoke-085307` and
|
|
361
|
+
`20260511-f28-pair-smoke-091021` were superseded after a hidden-oracle bug was
|
|
362
|
+
found. The oracle had expected a defective item to bypass expiration, which the
|
|
363
|
+
visible spec does not require. After re-verifying the same provider diffs
|
|
364
|
+
against the corrected oracle, `20260511-f28-policy-oraclefix-reverified-pair`
|
|
365
|
+
scored bare 50 / solo_claude 98 / `l2_risk_probes` 96, margin -2, and failed headroom.
|
|
366
|
+
Rework or rotate F28 before spending more pair arms.
|
|
367
|
+
F30 credit hold settlement is also rejected: `20260511-f30-headroom-v1` scored
|
|
368
|
+
bare 33 / solo_claude 98, so it failed the `solo_claude` headroom precondition before any pair
|
|
369
|
+
arm should be spent. F15 frozen-diff race review is now a ceiling/control
|
|
370
|
+
fixture too: `20260511-f15-concurrency-headroom` scored bare 99 / solo_claude 94, so
|
|
371
|
+
it failed both headroom preconditions. F3 backend contract risk is also
|
|
372
|
+
rejected after tightening its HTTP error-body verifier:
|
|
373
|
+
`20260511-f3-http-error-headroom` scored bare 97 / solo_claude 99. F2 medium CLI is
|
|
374
|
+
rejected by `20260512-f2-medium-headroom`: bare 83 / solo_claude 95, so both baseline
|
|
375
|
+
scores exceed headroom ceilings. F4 web browser design is rejected by
|
|
376
|
+
`20260512-f4-web-headroom`: bare 70 / solo_claude 92 with bare disqualifiers, so it
|
|
377
|
+
needs rework before pair arms. F5 fix-loop is rejected by
|
|
378
|
+
`20260512-f5-fixloop-headroom`: bare 99 / solo_claude 99, with `bare` and `solo_claude` each
|
|
379
|
+
passing 5/5 verification commands. F6 dep-audit checksum is rejected by
|
|
380
|
+
`20260512-f6-checksum-headroom`: bare 97 / solo_claude 96, with `bare` and `solo_claude` each
|
|
381
|
+
passing 6/6 verification commands. F7 scope discipline is rejected by
|
|
382
|
+
`20260512-f7-scope-headroom`: bare 99 / solo_claude 100, with `bare` and `solo_claude` each
|
|
383
|
+
passing 6/6 verification commands. F9 ideate-to-resolve remains the novice-flow
|
|
384
|
+
anchor but is rejected as pair evidence by `20260512-f9-e2e-headroom`: bare 60 /
|
|
385
|
+
solo_claude 90 with bare headroom 0 and a bare judge disqualifier, despite passing F9
|
|
386
|
+
artifact checks. Rework it before spending pair arms. F1 and F8 are rejected by
|
|
387
|
+
design as calibration/known-limit controls, not pair-lift evidence candidates.
|
|
388
|
+
F10/F11 are also rejected by `20260507-f10-f11-tier1-full-pipeline`: F10 scored
|
|
389
|
+
bare 75 / solo_claude 94, and F11 scored bare 98 / solo_claude 97. F12 webhook signature/replay is rejected by
|
|
390
|
+
`20260511-f12-webhook-headroom`: bare 85 / solo_claude 99.
|
|
391
|
+
F31 seat rebalance is rejected by `20260512-f31-seat-rebalance-headroom`: bare
|
|
392
|
+
33 / solo_claude 98, with bare judge/result/verify disqualifiers. Rework it before
|
|
393
|
+
spending pair arms. F32 subscription renewal is rejected by
|
|
394
|
+
`20260512-f32-subscription-renewal-headroom`: bare 33 / solo_claude 98, so it is a
|
|
395
|
+
solo-ceiling billing rollback/shape control rather than pair-lift evidence.
|
|
157
396
|
|
|
158
397
|
Commands that reference `BENCH_FIXTURE_DIR` are hidden post-run oracles: they
|
|
159
398
|
are not staged into BUILD_GATE's `.devlyn/spec-verify.json`.
|
|
@@ -175,8 +414,10 @@ diagnostics. Use non-empty diffs only; empty diffs fail fast because they are
|
|
|
175
414
|
not valid pair evidence.
|
|
176
415
|
Hidden verifier context is available during VERIFY, so this runner prevents
|
|
177
416
|
IMPLEMENT contamination but is not an oracle-blind judge setup.
|
|
178
|
-
The runner writes `compare.json`; `pair_verdict_lift: true`
|
|
179
|
-
actually ran and found a verdict-binding issue that solo
|
|
417
|
+
The runner writes `compare.json` and `compare.md`; `pair_verdict_lift: true`
|
|
418
|
+
means pair VERIFY actually ran and found a verdict-binding issue that solo
|
|
419
|
+
VERIFY did not. It also prints a replay `Command:` block before invoking
|
|
420
|
+
providers and a final solo/pair summary table.
|
|
180
421
|
If an imported case has no deterministic `verification_commands`, the runner
|
|
181
422
|
does not create `.devlyn/spec-verify.json`; an empty carrier is malformed by the
|
|
182
423
|
normal real-user contract and must not block qualitative frozen review.
|
|
@@ -187,6 +428,7 @@ To gate a set of frozen VERIFY results mechanically:
|
|
|
187
428
|
python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
|
|
188
429
|
--run-id 20260505T173913Z-9986cd3-frozen-verify \
|
|
189
430
|
--run-id 20260505T230215Z-9986cd3-frozen-verify \
|
|
431
|
+
--require-hypothesis-trigger \
|
|
190
432
|
--max-pair-solo-wall-ratio 3 \
|
|
191
433
|
--out-json benchmark/auto-resolve/results/frozen-verify-gate-20260505.json \
|
|
192
434
|
--out-md benchmark/auto-resolve/results/frozen-verify-gate-20260505.md
|
|
@@ -203,12 +445,19 @@ full-pipeline pair superiority. It proves only that, after the implementation
|
|
|
203
445
|
diff is frozen, gated pair VERIFY fires and returns a stricter verdict-binding
|
|
204
446
|
result than solo VERIFY on the same diff. Each supplied run must cover a
|
|
205
447
|
distinct fixture; repeated runs of the same fixture do not count as independent
|
|
448
|
+
evidence. For new measurements, pass `--require-hypothesis-trigger` so any
|
|
449
|
+
fixture spec with an actionable solo-headroom hypothesis must also expose
|
|
450
|
+
`spec.solo_headroom_hypothesis` in `pair_trigger.reasons`; omit it only when
|
|
451
|
+
re-gating historical artifacts that predate that trigger reason.
|
|
206
452
|
corpus growth. `--max-pair-solo-wall-ratio` is optional, but use it for
|
|
207
453
|
ship-style evidence so quality lift is not accepted without a reasonable
|
|
208
454
|
wall-time bound. The gate infers the fixture id from the runner input metadata;
|
|
209
455
|
artifacts without that metadata, or with a fixture id absent from
|
|
210
456
|
the selected `--fixtures-root`, fail instead of being counted as anonymous or
|
|
211
|
-
fake evidence.
|
|
457
|
+
fake evidence. JSON rows expose `pair_trigger_reasons` and
|
|
458
|
+
`pair_trigger_has_canonical_reason`; Markdown output includes a `Triggers`
|
|
459
|
+
column so reviewers can see which canonical pair trigger made the evidence
|
|
460
|
+
eligible.
|
|
212
461
|
|
|
213
462
|
### SWE-bench fixed-diff review pilot
|
|
214
463
|
|
|
@@ -292,6 +541,10 @@ bash benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh \
|
|
|
292
541
|
--out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
|
|
293
542
|
```
|
|
294
543
|
|
|
544
|
+
The corpus runner prints a replay `Command:` block before invoking providers or
|
|
545
|
+
gating existing run ids, so frozen VERIFY score runs can be reproduced from the
|
|
546
|
+
captured stdout.
|
|
547
|
+
|
|
295
548
|
To re-gate existing run ids without re-invoking providers, write one run id per
|
|
296
549
|
line and pass `--gate-only-run-ids <file>` with the same manifest. For large
|
|
297
550
|
tranches, keep `--run-ids-out` and use `--resume-completed-arms` on retries:
|
|
@@ -345,6 +598,7 @@ python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
|
|
|
345
598
|
--run-id <swebench-frozen-run-2> \
|
|
346
599
|
--run-id <swebench-frozen-run-3> \
|
|
347
600
|
--min-runs 3 \
|
|
601
|
+
--require-hypothesis-trigger \
|
|
348
602
|
--max-pair-solo-wall-ratio 3 \
|
|
349
603
|
--out-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
|
|
350
604
|
--out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
|
|
@@ -359,8 +613,14 @@ inspect `avg_pair_solo_wall_ratio` plus each row's `pair_solo_wall_ratio`.
|
|
|
359
613
|
For selection-bias control, render every run in the attempted pilot, not just
|
|
360
614
|
gate rows. The matrix reports verdict-lift rows separately from recall-only
|
|
361
615
|
rows where pair found additional findings but did not change the binding
|
|
362
|
-
verdict. It also reports
|
|
363
|
-
|
|
616
|
+
verdict. It also reports pair-trigger eligibility/contract failures,
|
|
617
|
+
trigger reasons, canonical-trigger coverage, classification counts, gate rate,
|
|
618
|
+
and trailing non-gate rows. Its Markdown table includes a `Triggers` column.
|
|
619
|
+
For new measurements, pass `--fixtures-root` with
|
|
620
|
+
`--require-hypothesis-trigger` so matrix rows classify any missing
|
|
621
|
+
`spec.solo_headroom_hypothesis` trigger reason as a pair-trigger contract
|
|
622
|
+
failure instead of leaving it to the gate artifact alone.
|
|
623
|
+
Use the optional yield thresholds when the matrix is meant to
|
|
364
624
|
fail closed instead of only documenting that additional rows are adding
|
|
365
625
|
controls without strengthening the proof gate:
|
|
366
626
|
|
|
@@ -368,9 +628,11 @@ controls without strengthening the proof gate:
|
|
|
368
628
|
python3 benchmark/auto-resolve/scripts/swebench-frozen-matrix.py \
|
|
369
629
|
--title "SWE-bench Lite Frozen VERIFY Matrix" \
|
|
370
630
|
--verdict MIXED_WITH_GATE_PASS \
|
|
631
|
+
--fixtures-root benchmark/auto-resolve/external/swebench/cases \
|
|
371
632
|
--gate-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
|
|
372
633
|
--run-id <swebench-frozen-run-1> \
|
|
373
634
|
--run-id <swebench-frozen-run-2> \
|
|
635
|
+
--require-hypothesis-trigger \
|
|
374
636
|
--min-gate-rate 0.25 \
|
|
375
637
|
--max-trailing-non-gate 10 \
|
|
376
638
|
--out-json benchmark/auto-resolve/results/swebench-frozen-matrix.json \
|
|
@@ -391,9 +653,9 @@ the diff is frozen.
|
|
|
391
653
|
|
|
392
654
|
## LLM-upgrade resilience
|
|
393
655
|
|
|
394
|
-
- **No model hardcoding.** Judge runs
|
|
395
|
-
- **Margin-based gates.** Ship thresholds use
|
|
396
|
-
- **Saturation rotation.** When
|
|
656
|
+
- **No model hardcoding.** Judge runs Codex without `-m`, inheriting whichever flagship the CLI currently ships. The call is isolated from user config/rules/hooks so local agent instructions cannot contaminate the blind judgment. Each run captures `_judge_model` for historical provenance.
|
|
657
|
+
- **Margin-based gates.** Ship thresholds use pairwise margins, not absolute score. `solo_claude`-`bare` measures solo harness value; pair-`solo_claude` measures pair value on pair-eligible fixtures. As models improve, margin remains the meaningful harness-added signal.
|
|
658
|
+
- **Saturation rotation.** When all compared gated arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
|
|
397
659
|
|
|
398
660
|
## Ship gates (summary — see `RUBRIC.md` for full spec)
|
|
399
661
|
|
|
@@ -401,7 +663,7 @@ Hard floors (any one fails → block):
|
|
|
401
663
|
|
|
402
664
|
- Zero variant disqualifier (silent catch, fabricated verification, extra deps beyond `max_deps_added`, etc.).
|
|
403
665
|
- `F9-e2e-ideate-to-resolve` must PASS (novice-flow contract).
|
|
404
|
-
- ≥ 7
|
|
666
|
+
- ≥ 7 gated, headroom-available fixtures have margin ≥ +5.
|
|
405
667
|
- No per-fixture regression worse than −5 vs last shipped baseline.
|
|
406
668
|
|
|
407
669
|
Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margin, critical-finding catch-rate regression vs last shipped variant.
|
|
@@ -409,15 +671,16 @@ Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margi
|
|
|
409
671
|
## Running the full suite (real)
|
|
410
672
|
|
|
411
673
|
Full real benchmarks usually take 2-3 minutes per arm for simple fixtures and
|
|
412
|
-
up to 15 minutes per arm for strict-route fixtures. A full n=1 run
|
|
413
|
-
|
|
674
|
+
up to 15 minutes per arm for strict-route fixtures. A full n=1 run time depends
|
|
675
|
+
on the selected fixture count; the historical 9-fixture core suite was roughly
|
|
676
|
+
45 min - 3 hrs for 3 arms, while the current extended suite can take longer.
|
|
414
677
|
|
|
415
678
|
```bash
|
|
416
679
|
# Smoke run before ship decisions
|
|
417
680
|
npx devlyn-cli benchmark
|
|
418
681
|
|
|
419
682
|
# Ship-decision run
|
|
420
|
-
npx devlyn-cli benchmark --
|
|
683
|
+
npx devlyn-cli benchmark --label v3.7 --bless
|
|
421
684
|
```
|
|
422
685
|
|
|
423
686
|
## Dry-run
|
|
@@ -9,8 +9,8 @@ prior `history/runs/`.
|
|
|
9
9
|
|
|
10
10
|
## Scoring — 4 axes, 25 points each, 100 total
|
|
11
11
|
|
|
12
|
-
The blind judge scores
|
|
13
|
-
|
|
12
|
+
The blind judge scores all submitted arms on identical axes without knowing
|
|
13
|
+
which label maps to which arm.
|
|
14
14
|
|
|
15
15
|
### Axis 1 — Spec Compliance (0-25)
|
|
16
16
|
|
|
@@ -72,18 +72,23 @@ Disqualifier arms automatically lose the fixture regardless of score.
|
|
|
72
72
|
After the judge finishes every fixture, `scripts/ship-gate.py` applies these
|
|
73
73
|
rules to the run's `summary.json`.
|
|
74
74
|
|
|
75
|
+
This section describes the broad run-suite ship gate. Current solo<pair
|
|
76
|
+
evidence uses the full-pipeline pair gate with an explicit selected pair arm
|
|
77
|
+
(`l2_risk_probes` for proof runs, `l2_gated` for diagnostics), and that gate
|
|
78
|
+
compares the selected pair arm against `solo_claude`.
|
|
79
|
+
|
|
75
80
|
### Hard floors (any one failure blocks ship)
|
|
76
81
|
|
|
77
|
-
1. **No disqualifier-level violation** in variant
|
|
82
|
+
1. **No disqualifier-level violation** in any gated harness arm (legacy suite `variant`/L2 and `solo_claude`/L1 when present).
|
|
78
83
|
2. **F9 (E2E) must PASS** — novice-flow contract.
|
|
79
|
-
3.
|
|
84
|
+
3. **At least 7 gated, headroom-available fixtures** must have the required margin ≥ +5 for each gated contract — legacy `variant`-`bare` (L2-L0) for the suite gate, and `solo_claude`-`bare` (L1-L0) when `solo_claude` is present. This is **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from a contract count when the lower arm is ceiling-near and the higher arm is clean at ceiling. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
|
|
80
85
|
4. **No fixture regression worse than −5** vs. last `baselines/shipped.json` on the same fixture.
|
|
81
86
|
|
|
82
87
|
### Soft gates (produce WARNING but do not block)
|
|
83
88
|
|
|
84
89
|
5. Suite average margin drop > 3 vs. last shipped.
|
|
85
90
|
6. A fixture that previously had margin > +5 now has margin ≤ 0.
|
|
86
|
-
7. Critical-finding catch-rate decrease vs. last shipped
|
|
91
|
+
7. Critical-finding catch-rate decrease vs. the last shipped gated harness arm.
|
|
87
92
|
|
|
88
93
|
### Known-limit exception
|
|
89
94
|
|
|
@@ -138,15 +143,15 @@ Every suite run appends an immutable record to `history/runs/<ts>-<label>.json`:
|
|
|
138
143
|
|
|
139
144
|
## Fixture Rotation Policy
|
|
140
145
|
|
|
141
|
-
If any fixture has
|
|
142
|
-
versions, it's saturated and no longer differentiates. Replace with a
|
|
143
|
-
equivalent and record the swap in
|
|
146
|
+
If any fixture has all compared gated arms scoring > 95 for two consecutive
|
|
147
|
+
shipped versions, it's saturated and no longer differentiates. Replace with a
|
|
148
|
+
harder equivalent and record the swap in
|
|
144
149
|
`history/runs/<ts>-fixture-rotation.json`:
|
|
145
150
|
|
|
146
151
|
```json
|
|
147
152
|
{
|
|
148
153
|
"retired": "F1-cli-trivial-flag",
|
|
149
|
-
"retired_reason": "
|
|
154
|
+
"retired_reason": "all compared gated arms > 95 on v3.7 and v3.8 (saturation)",
|
|
150
155
|
"replacement": "F1b-cli-trivial-flag-v2",
|
|
151
156
|
"replacement_rationale": "adds exit-code precedence requirement that current leaders didn't handle on first try"
|
|
152
157
|
}
|
|
@@ -159,10 +164,14 @@ suspected in their area.
|
|
|
159
164
|
|
|
160
165
|
## Why These Thresholds
|
|
161
166
|
|
|
162
|
-
- **+5 margin floor** — below this,
|
|
163
|
-
judge variance (empirically ~±3 per axis).
|
|
164
|
-
|
|
167
|
+
- **+5 margin floor** — below this, the gated harness arm is not reliably
|
|
168
|
+
beating its lower baseline given judge variance (empirically ~±3 per axis).
|
|
169
|
+
For the legacy suite that is `variant` over `bare`; for pair evidence it is
|
|
170
|
+
the selected pair arm over `solo_claude`. Worth paying pipeline cost requires
|
|
171
|
+
margin clearly above noise.
|
|
165
172
|
- **−5 regression floor** — one-axis regression can look like −5; allowing
|
|
166
173
|
less would let real regressions slip through.
|
|
167
|
-
- **7
|
|
168
|
-
|
|
174
|
+
- **7-fixture coverage floor** — requires a broad enough set of
|
|
175
|
+
headroom-available, non-known-limit fixtures to clear the margin floor. This
|
|
176
|
+
preserves the original core-suite coverage bar without pretending the current
|
|
177
|
+
extended fixture inventory is still exactly nine fixtures.
|
|
@@ -6,6 +6,10 @@ Trivial-tier calibration. Every arm should one-shot this; it's here to catch
|
|
|
6
6
|
catastrophic regressions and to anchor the "saturation" end of the scoring
|
|
7
7
|
scale.
|
|
8
8
|
|
|
9
|
+
Pair-candidate status: rejected by design. Because every arm is expected to
|
|
10
|
+
one-shot F1, it is a trivial calibration/control fixture and must not be used
|
|
11
|
+
as pair-lift evidence.
|
|
12
|
+
|
|
9
13
|
## Failure mode
|
|
10
14
|
|
|
11
15
|
- **Default-behavior regression.** Careless implementations add `--loud`
|
|
@@ -25,6 +29,6 @@ scale.
|
|
|
25
29
|
|
|
26
30
|
## Rotation trigger
|
|
27
31
|
|
|
28
|
-
When both
|
|
29
|
-
a harder trivial fixture (e.g., one that requires
|
|
30
|
-
interacting with existing flag precedence).
|
|
32
|
+
When both `bare` and `solo_claude` score > 95 for two consecutive shipped
|
|
33
|
+
versions, replace with a harder trivial fixture (e.g., one that requires
|
|
34
|
+
handling a new flag interacting with existing flag precedence).
|
|
@@ -57,7 +57,12 @@ prose forces invariant derivation, which is where pair has the edge.
|
|
|
57
57
|
|
|
58
58
|
## Rotation trigger
|
|
59
59
|
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
60
|
+
Headroom run `20260507-f10-f11-tier1-full-pipeline` rejected this fixture as
|
|
61
|
+
pair-lift evidence: bare scored 75 and solo_claude scored 94. Keep it as a
|
|
62
|
+
concurrent persistence control unless the visible contract is reworked to
|
|
63
|
+
expose lower bare/solo ceilings.
|
|
64
|
+
|
|
65
|
+
Retire when both `bare` and `solo_claude` consistently land > 90 across two
|
|
66
|
+
shipped versions, OR when "close-together-write" becomes a recognized pattern
|
|
67
|
+
such that solo arm reliably reaches for a serializing mechanism on first read.
|
|
63
68
|
Whichever comes first.
|