devlyn-cli 2.3.0 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (219) hide show
  1. package/AGENTS.md +1 -1
  2. package/CLAUDE.md +2 -2
  3. package/README.md +80 -29
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +210 -17
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +3 -2
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +1 -1
  201. package/config/skills/_shared/runtime-principles.md +5 -8
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  205. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  206. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  207. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  208. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  209. package/config/skills/devlyn:resolve/SKILL.md +49 -18
  210. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  211. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  212. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  213. package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
  214. package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
  215. package/package.json +47 -2
  216. package/scripts/lint-fixtures.sh +349 -0
  217. package/scripts/lint-shadow-fixtures.sh +58 -0
  218. package/scripts/lint-skills.sh +3642 -92
  219. /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
@@ -1,13 +1,14 @@
1
- # devlyn-cli auto-resolve Benchmark Suite
1
+ # devlyn-cli resolve Benchmark Suite
2
2
 
3
- One-command A/B benchmark that gates every harness change with a ship/rollback decision.
3
+ One-command resolve benchmark that gates every harness change with a ship/rollback decision.
4
4
 
5
5
  ## Quick start
6
6
 
7
7
  ```bash
8
- npx devlyn-cli benchmark # n=1 smoke, all fixtures × 2 arms, judge, report, ship-gate
9
- npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
8
+ npx devlyn-cli benchmark # n=1 smoke, all fixtures × 3 arms, judge, report, ship-gate
10
9
  npx devlyn-cli benchmark F2 # specific fixture only
10
+ npx devlyn-cli benchmark headroom F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
11
+ npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
11
12
  npx devlyn-cli benchmark --dry-run # validate suite wiring without model invocation
12
13
  npx devlyn-cli benchmark --bless # if ship-gate PASSes, promote this run as the shipped baseline
13
14
  npx devlyn-cli benchmark --judge-only --run-id <ID> # re-judge an existing run's artifacts
@@ -17,12 +18,12 @@ Exit code 0 = PASS, 1 = FAIL.
17
18
 
18
19
  ## What it does
19
20
 
20
- 1. For every fixture × arm (`variant` / `bare`):
21
+ 1. For every fixture × arm (`variant` / `solo_claude` / `bare`):
21
22
  - Prepare a fresh temp copy of `fixtures/test-repo/`.
22
23
  - Commit baseline + apply `setup.sh` + commit bench scaffolding.
23
24
  - Invoke the arm via an isolated `claude -p` subprocess.
24
25
  - Capture `diff.patch`, `transcript.txt`, `timing.json`, run `expected.json::verification_commands`.
25
- 2. For every fixture, invoke `codex exec` as a blind judge (`A`/`B` randomized per fixture) using the 4-axis rubric in `RUBRIC.md`.
26
+ 2. For every fixture, invoke isolated Codex as a blind judge with randomized slots using the 4-axis rubric in `RUBRIC.md`.
26
27
  3. Aggregate into `results/<run-id>/report.md` + `summary.json`.
27
28
  4. Apply ship-gate thresholds (`scripts/ship-gate.py`). Print verdict.
28
29
  5. Append immutable record to `history/runs/<run-id>.json`.
@@ -47,16 +48,30 @@ benchmark/auto-resolve/
47
48
  │ ├── judge.sh # Codex blind judge for one fixture
48
49
  │ ├── compile-report.py # aggregates into report.md + summary.json
49
50
  │ ├── ship-gate.py # applies thresholds + writes history record
51
+ │ ├── test-benchmark-arg-parsing.sh
52
+ │ ├── test-ship-gate.sh
50
53
  │ ├── run-headroom-candidate.sh
51
54
  │ ├── headroom-gate.py # blocks pair measurement without headroom set
52
55
  │ ├── test-headroom-gate.sh
56
+ │ ├── test-run-headroom-candidate.sh
53
57
  │ ├── run-full-pipeline-pair-candidate.sh
58
+ │ ├── test-run-full-pipeline-pair-candidate.sh
54
59
  │ ├── full-pipeline-pair-gate.py
55
60
  │ ├── test-full-pipeline-pair-gate.sh
61
+ │ ├── pair-candidate-frontier.py
62
+ │ ├── test-pair-candidate-frontier.sh
63
+ │ ├── audit-pair-evidence.py
64
+ │ ├── test-audit-pair-evidence.sh
65
+ │ ├── audit-headroom-rejections.py
66
+ │ ├── test-audit-headroom-rejections.sh
67
+ │ ├── test-check-f9-artifacts.sh
68
+ │ ├── iter-0033c-l1-summary.py
69
+ │ ├── test-iter-0033c-l1-summary.sh
56
70
  │ ├── run-frozen-verify-pair.sh
57
71
  │ ├── fetch-swebench-instances.py
58
72
  │ ├── collect-swebench-predictions.py
59
73
  │ ├── run-swebench-solver-batch.sh
74
+ │ ├── test-run-swebench-solver-batch.sh
60
75
  │ ├── prepare-swebench-frozen-case.py
61
76
  │ ├── prepare-swebench-frozen-corpus.py
62
77
  │ ├── run-swebench-frozen-corpus.sh
@@ -85,58 +100,231 @@ Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`,
85
100
 
86
101
  1. Copy an existing fixture directory as a template.
87
102
  2. Rewrite `metadata.json::intent` with the new task's plain-language intent.
88
- 3. Write `spec.md` (auto-resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
103
+ 3. Write `spec.md` (resolve-ready) and `task.txt` (plain prompt) both derived from the intent.
89
104
  4. Fill `expected.json` with concrete verification commands and forbidden patterns.
90
105
  5. Document purpose + failure mode in `NOTES.md`.
91
106
  6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
92
107
  7. Run `bash scripts/lint-fixtures.sh`.
93
108
 
109
+ For draft pair candidates, start in `shadow-fixtures/S*` and run
110
+ `bash scripts/lint-shadow-fixtures.sh`. The headroom and pair candidate runners
111
+ accept explicitly named `S*` ids for dry-run checks and candidate measurement,
112
+ but shadow results are read-only signals. Promote a validated task to an active
113
+ `F*` fixture before counting it as golden pair evidence.
114
+ Use `run-suite.sh --suite shadow` only with `--dry-run`; the suite path refuses
115
+ provider and judge runs for shadow fixtures so rejected/smoke controls do not
116
+ spend benchmark budget accidentally.
117
+ Before spending provider calls, write a solo-headroom hypothesis into the
118
+ candidate's `spec.md`: name the visible behavior a capable `solo_claude`
119
+ baseline is expected to miss, and the observable command from `expected.json`
120
+ that would expose that miss. A hypothesis of only "the task is hard" is not
121
+ enough; rework the candidate before measurement. `lint-shadow-fixtures.sh` and
122
+ the candidate runners enforce this as an actionable hypothesis: the fixture
123
+ `spec.md` must contain `solo-headroom hypothesis`, `solo_claude`, `miss`, and a
124
+ backticked observable command matching `expected.json`, with the backticked line
125
+ itself containing `miss` and framed as the command/observable that exposes it.
126
+ For unmeasured high-risk shadow candidates, `NOTES.md` must also include
127
+ `## Solo ceiling avoidance` naming how the candidate differs from the
128
+ solo-saturated `S2`-`S6` controls and why that difference should preserve
129
+ `solo_claude` headroom. If that distinction is not concrete, rework the
130
+ candidate before measurement.
131
+ If a real shadow headroom run fails because the fixture is solo-saturated, record
132
+ the run and score in the fixture's `NOTES.md` and add the fixture to
133
+ `scripts/pair-rejected-fixtures.sh`; `lint-shadow-fixtures.sh` enforces that
134
+ calibrated shadow `FAIL` entries are registered before future provider spend.
135
+
94
136
  For L2/pair candidate fixtures, also run:
95
137
 
96
138
  ```bash
97
- bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh F16-cli-quote-tax-rules
139
+ bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh \
140
+ --bare-max 60 \
141
+ --solo-max 80 \
142
+ --min-fixtures 3 \
143
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
98
144
  ```
99
145
 
146
+ The same runner is available through `npx devlyn-cli benchmark headroom ...`.
100
147
  This runs only the arms needed for calibration (`bare` and `solo_claude`),
101
148
  blind-judges them, and applies `headroom-gate.py`. A candidate set is not
102
149
  usable for pair measurement unless at least two fixtures pass and each fixture
103
- has clean `bare <= 60` and `solo_claude <= 80` scores. A one-fixture calibration
104
- run can show useful scores but does not satisfy the set gate.
105
- When changing the gate itself, run:
150
+ has evidence-complete `bare <= 60` and `solo_claude <= 80` scores with the
151
+ default minimum 5-point `bare`/`solo_claude` headroom margin.
152
+ The runner prints the headroom gate markdown report to stdout, including the
153
+ startup `Gate:` line and the fixture score table with bare score, bare
154
+ headroom, solo_claude score, solo_claude headroom, status, and reason columns. When launched
155
+ through `npx devlyn-cli benchmark headroom`, the replay `Command:` uses the
156
+ same package CLI path.
157
+ For passing sets, the report also prints average and minimum `bare`/`solo_claude`
158
+ headroom plus the fixture pass count, so ceiling-near, threshold-fragile, or
159
+ under-count candidate sets are visible before spending pair arms.
160
+ It explicitly reports whether the candidate set was accepted or rejected.
161
+ Evidence-clean means the measured arm has complete artifacts, no deterministic
162
+ or judge disqualifier, all expected verification commands pass, and any
163
+ skill-pipeline verdict is non-blocking (`PASS` or `PASS_WITH_ISSUES`). A
164
+ one-fixture calibration run can show useful scores but does not satisfy the set
165
+ gate. Add `--dry-run` to validate args, fixture ids, minimum fixture count, and
166
+ the replay command without running arms or judges.
167
+ Known rejected or ceiling-saturated fixtures are refused by default in the
168
+ headroom runner; use `--allow-rejected-fixtures` only for diagnostics of
169
+ rejected fixtures or calibrated shadow controls, not for new pair-evidence
170
+ candidate selection. Retired fixtures are preserved for historical artifact replay
171
+ and are not rerun by the pair-candidate runners.
172
+ Before spending new provider calls, inspect the active candidate frontier:
173
+
174
+ ```bash
175
+ python3 benchmark/auto-resolve/scripts/pair-candidate-frontier.py \
176
+ --out-md /tmp/devlyn-pair-frontier.md
177
+ npx devlyn-cli benchmark recent
178
+ npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
179
+ npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
180
+ ```
181
+
182
+ `benchmark recent` is the reader-facing version of the current evidence set: it
183
+ prints a compact, wrap-safe status block, pair-lift aggregates, and one card per
184
+ passing pair-evidence fixture. Use it for PR comments and release notes when a
185
+ wide frontier table would wrap poorly.
186
+ The frontier report lists active fixtures as `rejected`,
187
+ `pair_evidence_passed`, or `candidate_unmeasured`, using the same rejected
188
+ fixture registry and local full-pipeline gate artifacts. It also prints stdout
189
+ summary rows with `bare`, `solo_claude`, `pair`, pair arm, margin, wall ratio, run id, verdict, and trigger reasons for
190
+ fixtures that already have complete pair evidence rows, plus average/minimum pair margin and wall ratio,
191
+ even when writing `--out-md` or `--out-json`. The markdown artifact also carries
192
+ the overall verdict plus row-level verdict, pair-arm, and trigger-reason columns.
193
+ Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
194
+ and the report includes a Markdown `Hypothesis trigger` column, so strict regenerated
195
+ evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
196
+ After a headroom run fails, audit that any active failed fixture without passing
197
+ pair evidence is either rejected or reworked before more provider spend. The
198
+ same audit also rejects active registry entries whose reason cites a run id or
199
+ score that is not backed by a matching local headroom artifact:
200
+
201
+ ```bash
202
+ python3 benchmark/auto-resolve/scripts/audit-headroom-rejections.py
203
+ npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
204
+ ```
205
+
206
+ For release or handoff checks where open candidates are not acceptable, add
207
+ `--fail-on-unmeasured` to the frontier command so any active
208
+ `candidate_unmeasured` fixture becomes a nonzero exit.
209
+ The package CLI exposes that release/handoff guard as one command:
210
+
211
+ ```bash
212
+ npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
213
+ npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
214
+ ```
215
+
216
+ It writes `audit.json` with the frontier summary and an artifact map (`artifacts`), plus
217
+ `frontier.json`, `frontier.stdout`, `frontier.stderr`, `headroom-audit.json`, and child stdout/stderr logs, prints the same frontier score rows for existing complete pair
218
+ evidence rows, and embeds those compact trigger-backed verdict-bearing score rows in
219
+ `audit.json` as `pair_evidence_rows` (each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`). It fails if either active unmeasured pair candidates or unrecorded
220
+ headroom failures remain. By default it also revalidates frontier `verdict: PASS`
221
+ and zero unmeasured candidates, requires at least four active fixtures with passing pair evidence,
222
+ and requires each counted evidence row to satisfy `pair_mode: true`, the default 5-point pair margin, and 3x pair/solo wall ratio.
223
+ The audit stdout also prints `headroom_rejections=...`,
224
+ `pair_evidence_quality=...`, `pair_trigger_reasons=...`, and
225
+ `pair_evidence_hypotheses=...` and
226
+ `pair_evidence_hypothesis_triggers=...` handoff rows, plus
227
+ `pair_trigger_historical_aliases=...` when archived evidence includes legacy
228
+ trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
229
+ hypotheses have not yet propagated into trigger reasons, with rejected-fixture
230
+ coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
231
+ and canonical trigger reason coverage plus row-match status. The compact evidence row count must match the frontier evidence count, so incomplete local score
232
+ artifacts cannot inflate the claim. `checks.frontier_stdout` records summary,
233
+ aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts, `checks.pair_evidence_quality`
234
+ records the same quality thresholds from the compact rows,
235
+ `checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status
236
+ for handoff review, `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts, and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details.
237
+ Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
238
+ archived-evidence WARN rows into release-blocking FAIL rows for newly
239
+ regenerated pair evidence.
240
+ Historical trigger aliases are only reported for archived artifact review; new
241
+ current pair-evidence gates fail historical-only or unknown trigger reasons and
242
+ require at least one canonical `pair_trigger.reasons` entry.
243
+ `checks.headroom_rejections` records the child verdict plus unrecorded and
244
+ unsupported registry-rejection counts, so handoff review can see rejected-fixture
245
+ coverage without opening the child artifact first.
246
+ Override `--min-pair-evidence`, `--min-pair-margin`, or
247
+ `--max-pair-solo-wall-ratio` only for narrower diagnostics.
248
+
249
+ When changing the calibration/pair evidence gates, run:
106
250
 
107
251
  ```bash
252
+ bash scripts/lint-fixtures.sh
253
+ bash benchmark/auto-resolve/scripts/test-ship-gate.sh
254
+ bash benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh
255
+ bash benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh
256
+ bash benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh
257
+ bash benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh
258
+ bash benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh
108
259
  bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
260
+ bash benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh
261
+ bash benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh
262
+ bash benchmark/auto-resolve/scripts/test-lint-fixtures.sh
263
+ bash benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh
264
+ bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
265
+ bash benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh
266
+ bash benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh
267
+ bash benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh
268
+ bash benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh
269
+ bash benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh
109
270
  ```
110
271
 
111
- After a full-pipeline pair run has the calibrated arms (`bare`,
112
- `solo_claude`, `l2_gated` or `l2_risk_probes`) plus a blind `judge.json`, gate
113
- it separately:
272
+ `build-pair-eligible-manifest.py` writes `selection_rule.rejected_excluded`
273
+ with the rejected fixture ids removed from Gate 3, and
274
+ `selection_rule.rejected_excluded_reasons` with the exact registry reason for
275
+ each removed id. This keeps the manifest self-explaining when F31/F32-style
276
+ solo-ceiling controls are excluded from pair-lift evidence.
277
+
278
+ After a full-pipeline pair run has the calibrated arms (`bare`, `solo_claude`,
279
+ and the selected pair arm, default `l2_risk_probes`) plus a blind `judge.json`,
280
+ gate it separately:
114
281
 
115
282
  ```bash
116
283
  bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
284
+ --min-fixtures 3 \
117
285
  --max-pair-solo-wall-ratio 3 \
118
- F21-cli-scheduler-priority F23-cli-fulfillment-wave
286
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
119
287
  ```
120
288
 
289
+ The same runner is available through `npx devlyn-cli benchmark pair ...`.
121
290
  The runner executes `bare` + `solo_claude`, applies `headroom-gate.py`, and
122
- only then spends a `l2_gated` arm. To gate already-existing artifacts:
123
-
124
- When a prompt-only pair change needs a fresh `l2_gated` measurement but the
125
- calibrated `bare` + `solo_claude` arms are already clean, reuse them into a new
126
- run id:
291
+ only then spends the selected pair arm. Pair arms are limited to current
292
+ proof (`l2_risk_probes`) or diagnostic replay (`l2_gated`); `l2_forced` is
293
+ retired and rejected. It prints the exact replay command plus each gate's
294
+ markdown report to stdout, including startup `Headroom:` / `Pair:` lines,
295
+ fixture pass count, average pair margin, and the fixture score table with bare,
296
+ solo_claude, pair, margin, pair-mode, trigger-reason, and wall-ratio columns; if headroom or pair
297
+ evidence fails, the report is printed before the runner exits non-zero. If
298
+ headroom fails, the runner explicitly says the pair arm was not executed; if
299
+ the final pair gate fails, it explicitly says pair evidence was rejected. When
300
+ both gates pass, it explicitly says the selected pair arm is being executed and
301
+ then that pair evidence was accepted. When launched through
302
+ `npx devlyn-cli benchmark pair`, the replay `Command:` uses
303
+ the same package CLI path. Add `--dry-run` to
304
+ validate args, fixture ids, minimum fixture count, and the replay command
305
+ without running arms or judges. Known rejected or ceiling-saturated fixtures
306
+ are refused by default here too; use `--allow-rejected-fixtures` only for
307
+ diagnostics of rejected fixtures or calibrated shadow controls. Retired fixtures
308
+ remain historical replay artifacts and are not rerun by this candidate runner.
309
+ To gate already-existing artifacts:
310
+
311
+ When a prompt-only pair change needs a fresh `l2_risk_probes` measurement but
312
+ the calibrated `bare` + `solo_claude` arms are already evidence-complete, reuse
313
+ them into a new run id:
127
314
 
128
315
  ```bash
129
316
  bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
130
317
  --run-id <new-run-id> \
131
318
  --reuse-calibrated-from <prior-headroom-run-id> \
319
+ --min-fixtures 3 \
132
320
  --max-pair-solo-wall-ratio 3 \
133
- F21-cli-scheduler-priority F23-cli-fulfillment-wave
321
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
134
322
  ```
135
323
 
136
324
  ```bash
137
325
  python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
138
326
  --run-id <full-pipeline-run-id> \
139
- --min-fixtures 2 \
327
+ --min-fixtures 3 \
140
328
  --min-pair-margin 5 \
141
329
  --max-pair-solo-wall-ratio 3 \
142
330
  --out-json benchmark/auto-resolve/results/<full-pipeline-run-id>/full-pipeline-pair-gate.json \
@@ -144,16 +332,67 @@ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
144
332
  ```
145
333
 
146
334
  This is the full-pipeline claim gate: each counted fixture must satisfy the
147
- headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
148
- must be clean, `pair_mode` must be true in the captured resolve state, and the
149
- blind judge must score the pair arm at least `--min-pair-margin` above
150
- `solo_claude`. `l2_risk_probes` is the current measured pair arm for the
151
- F16/F25 gate: `20260509-f16-f25-combined-cartprobe-v2` passed with margins +21
152
- and +24, average pair/solo wall ratio 1.46x. When changing this gate, run:
153
-
154
- ```bash
155
- bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
156
- ```
335
+ headroom precondition (`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins), the selected pair arm must be evidence-clean,
336
+ `pair_mode` must be true in the captured resolve state, the pair trigger must be
337
+ eligible with non-empty reasons and at least one canonical reason, fixtures with an actionable solo-headroom hypothesis must include `spec.solo_headroom_hypothesis` in the trigger reasons, the pair/solo wall-time
338
+ ratio must stay within the default 3x limit, and the blind judge must score the
339
+ pair arm at least `--min-pair-margin` above `solo_claude`. The report separates
340
+ the allowed pair/solo wall ratio from the maximum observed pair/solo wall ratio,
341
+ records `require_hypothesis_trigger` in JSON, and includes a Markdown
342
+ `Hypothesis trigger` column for each fixture row.
343
+ The judge
344
+ file must also map `bare`, `solo_claude`, and the selected pair arm in
345
+ `_blind_mapping`; `scores_by_arm` alone is not evidence.
346
+ `l2_risk_probes` is the current measured pair arm for the
347
+ F16/F23/F25 gate: `20260510-f16-f23-f25-combined-proof` passed with margins
348
+ +21, +31, and +24, average pair margin +25.3, and average pair/solo wall ratio
349
+ 1.73x. Earlier F16/F25 evidence also passes the current gate in
350
+ `20260509-f16-f25-combined-cartprobe-v2`.
351
+ Additional focused F21 evidence: `20260511-f21-current-riskprobes-v1` passed
352
+ with `bare` 33, `solo_claude` 66, `l2_risk_probes` 99, margin +33, pair mode true, and
353
+ pair/solo wall ratio 1.47x, and is counted by `benchmark audit` as the fourth passing pair-evidence row. Do not count ceiling/control fixtures as pair
354
+ evidence: F22 and F26 are
355
+ currently rejected because existing headroom runs put `solo_claude` at 98. F27
356
+ subscription proration is also rejected in its first headroom smoke:
357
+ `20260511-f27-headroom-smoke-061401` measured bare 33 / solo_claude 94, with bare
358
+ verification passing only 1 of 3 commands. Rework or rotate F27 before spending
359
+ a pair arm on it. F28 return authorization is rejected as pair-lift evidence:
360
+ earlier unstable runs `20260511-f28-headroom-smoke-085307` and
361
+ `20260511-f28-pair-smoke-091021` were superseded after a hidden-oracle bug was
362
+ found. The oracle had expected a defective item to bypass expiration, which the
363
+ visible spec does not require. After re-verifying the same provider diffs
364
+ against the corrected oracle, `20260511-f28-policy-oraclefix-reverified-pair`
365
+ scored bare 50 / solo_claude 98 / `l2_risk_probes` 96, margin -2, and failed headroom.
366
+ Rework or rotate F28 before spending more pair arms.
367
+ F30 credit hold settlement is also rejected: `20260511-f30-headroom-v1` scored
368
+ bare 33 / solo_claude 98, so it failed the `solo_claude` headroom precondition before any pair
369
+ arm should be spent. F15 frozen-diff race review is now a ceiling/control
370
+ fixture too: `20260511-f15-concurrency-headroom` scored bare 99 / solo_claude 94, so
371
+ it failed both headroom preconditions. F3 backend contract risk is also
372
+ rejected after tightening its HTTP error-body verifier:
373
+ `20260511-f3-http-error-headroom` scored bare 97 / solo_claude 99. F2 medium CLI is
374
+ rejected by `20260512-f2-medium-headroom`: bare 83 / solo_claude 95, so both baseline
375
+ scores exceed headroom ceilings. F4 web browser design is rejected by
376
+ `20260512-f4-web-headroom`: bare 70 / solo_claude 92 with bare disqualifiers, so it
377
+ needs rework before pair arms. F5 fix-loop is rejected by
378
+ `20260512-f5-fixloop-headroom`: bare 99 / solo_claude 99, with `bare` and `solo_claude` each
379
+ passing 5/5 verification commands. F6 dep-audit checksum is rejected by
380
+ `20260512-f6-checksum-headroom`: bare 97 / solo_claude 96, with `bare` and `solo_claude` each
381
+ passing 6/6 verification commands. F7 scope discipline is rejected by
382
+ `20260512-f7-scope-headroom`: bare 99 / solo_claude 100, with `bare` and `solo_claude` each
383
+ passing 6/6 verification commands. F9 ideate-to-resolve remains the novice-flow
384
+ anchor but is rejected as pair evidence by `20260512-f9-e2e-headroom`: bare 60 /
385
+ solo_claude 90 with bare headroom 0 and a bare judge disqualifier, despite passing F9
386
+ artifact checks. Rework it before spending pair arms. F1 and F8 are rejected by
387
+ design as calibration/known-limit controls, not pair-lift evidence candidates.
388
+ F10/F11 are also rejected by `20260507-f10-f11-tier1-full-pipeline`: F10 scored
389
+ bare 75 / solo_claude 94, and F11 scored bare 98 / solo_claude 97. F12 webhook signature/replay is rejected by
390
+ `20260511-f12-webhook-headroom`: bare 85 / solo_claude 99.
391
+ F31 seat rebalance is rejected by `20260512-f31-seat-rebalance-headroom`: bare
392
+ 33 / solo_claude 98, with bare judge/result/verify disqualifiers. Rework it before
393
+ spending pair arms. F32 subscription renewal is rejected by
394
+ `20260512-f32-subscription-renewal-headroom`: bare 33 / solo_claude 98, so it is a
395
+ solo-ceiling billing rollback/shape control rather than pair-lift evidence.
157
396
 
158
397
  Commands that reference `BENCH_FIXTURE_DIR` are hidden post-run oracles: they
159
398
  are not staged into BUILD_GATE's `.devlyn/spec-verify.json`.
@@ -175,8 +414,10 @@ diagnostics. Use non-empty diffs only; empty diffs fail fast because they are
175
414
  not valid pair evidence.
176
415
  Hidden verifier context is available during VERIFY, so this runner prevents
177
416
  IMPLEMENT contamination but is not an oracle-blind judge setup.
178
- The runner writes `compare.json`; `pair_verdict_lift: true` means pair VERIFY
179
- actually ran and found a verdict-binding issue that solo VERIFY did not.
417
+ The runner writes `compare.json` and `compare.md`; `pair_verdict_lift: true`
418
+ means pair VERIFY actually ran and found a verdict-binding issue that solo
419
+ VERIFY did not. It also prints a replay `Command:` block before invoking
420
+ providers and a final solo/pair summary table.
180
421
  If an imported case has no deterministic `verification_commands`, the runner
181
422
  does not create `.devlyn/spec-verify.json`; an empty carrier is malformed by the
182
423
  normal real-user contract and must not block qualitative frozen review.
@@ -187,6 +428,7 @@ To gate a set of frozen VERIFY results mechanically:
187
428
  python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
188
429
  --run-id 20260505T173913Z-9986cd3-frozen-verify \
189
430
  --run-id 20260505T230215Z-9986cd3-frozen-verify \
431
+ --require-hypothesis-trigger \
190
432
  --max-pair-solo-wall-ratio 3 \
191
433
  --out-json benchmark/auto-resolve/results/frozen-verify-gate-20260505.json \
192
434
  --out-md benchmark/auto-resolve/results/frozen-verify-gate-20260505.md
@@ -203,12 +445,19 @@ full-pipeline pair superiority. It proves only that, after the implementation
203
445
  diff is frozen, gated pair VERIFY fires and returns a stricter verdict-binding
204
446
  result than solo VERIFY on the same diff. Each supplied run must cover a
205
447
  distinct fixture; repeated runs of the same fixture do not count as independent
448
+ evidence. For new measurements, pass `--require-hypothesis-trigger` so any
449
+ fixture spec with an actionable solo-headroom hypothesis must also expose
450
+ `spec.solo_headroom_hypothesis` in `pair_trigger.reasons`; omit it only when
451
+ re-gating historical artifacts that predate that trigger reason.
206
452
  corpus growth. `--max-pair-solo-wall-ratio` is optional, but use it for
207
453
  ship-style evidence so quality lift is not accepted without a reasonable
208
454
  wall-time bound. The gate infers the fixture id from the runner input metadata;
209
455
  artifacts without that metadata, or with a fixture id absent from
210
456
  the selected `--fixtures-root`, fail instead of being counted as anonymous or
211
- fake evidence.
457
+ fake evidence. JSON rows expose `pair_trigger_reasons` and
458
+ `pair_trigger_has_canonical_reason`; Markdown output includes a `Triggers`
459
+ column so reviewers can see which canonical pair trigger made the evidence
460
+ eligible.
212
461
 
213
462
  ### SWE-bench fixed-diff review pilot
214
463
 
@@ -292,6 +541,10 @@ bash benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh \
292
541
  --out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
293
542
  ```
294
543
 
544
+ The corpus runner prints a replay `Command:` block before invoking providers or
545
+ gating existing run ids, so frozen VERIFY score runs can be reproduced from the
546
+ captured stdout.
547
+
295
548
  To re-gate existing run ids without re-invoking providers, write one run id per
296
549
  line and pass `--gate-only-run-ids <file>` with the same manifest. For large
297
550
  tranches, keep `--run-ids-out` and use `--resume-completed-arms` on retries:
@@ -345,6 +598,7 @@ python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
345
598
  --run-id <swebench-frozen-run-2> \
346
599
  --run-id <swebench-frozen-run-3> \
347
600
  --min-runs 3 \
601
+ --require-hypothesis-trigger \
348
602
  --max-pair-solo-wall-ratio 3 \
349
603
  --out-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
350
604
  --out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
@@ -359,8 +613,14 @@ inspect `avg_pair_solo_wall_ratio` plus each row's `pair_solo_wall_ratio`.
359
613
  For selection-bias control, render every run in the attempted pilot, not just
360
614
  gate rows. The matrix reports verdict-lift rows separately from recall-only
361
615
  rows where pair found additional findings but did not change the binding
362
- verdict. It also reports classification counts, gate rate, and trailing
363
- non-gate rows. Use the optional yield thresholds when the matrix is meant to
616
+ verdict. It also reports pair-trigger eligibility/contract failures,
617
+ trigger reasons, canonical-trigger coverage, classification counts, gate rate,
618
+ and trailing non-gate rows. Its Markdown table includes a `Triggers` column.
619
+ For new measurements, pass `--fixtures-root` with
620
+ `--require-hypothesis-trigger` so matrix rows classify any missing
621
+ `spec.solo_headroom_hypothesis` trigger reason as a pair-trigger contract
622
+ failure instead of leaving it to the gate artifact alone.
623
+ Use the optional yield thresholds when the matrix is meant to
364
624
  fail closed instead of only documenting that additional rows are adding
365
625
  controls without strengthening the proof gate:
366
626
 
@@ -368,9 +628,11 @@ controls without strengthening the proof gate:
368
628
  python3 benchmark/auto-resolve/scripts/swebench-frozen-matrix.py \
369
629
  --title "SWE-bench Lite Frozen VERIFY Matrix" \
370
630
  --verdict MIXED_WITH_GATE_PASS \
631
+ --fixtures-root benchmark/auto-resolve/external/swebench/cases \
371
632
  --gate-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
372
633
  --run-id <swebench-frozen-run-1> \
373
634
  --run-id <swebench-frozen-run-2> \
635
+ --require-hypothesis-trigger \
374
636
  --min-gate-rate 0.25 \
375
637
  --max-trailing-non-gate 10 \
376
638
  --out-json benchmark/auto-resolve/results/swebench-frozen-matrix.json \
@@ -391,9 +653,9 @@ the diff is frozen.
391
653
 
392
654
  ## LLM-upgrade resilience
393
655
 
394
- - **No model hardcoding.** Judge runs `codex exec` without `-m`, inheriting whichever flagship the CLI currently ships. Each run captures `_judge_model` for historical provenance.
395
- - **Margin-based gates.** Ship thresholds use margin (variant − bare), not absolute score. Both arms improve together as models improve; the harness-added value measured by margin stays meaningful.
396
- - **Saturation rotation.** When both arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
656
+ - **No model hardcoding.** Judge runs Codex without `-m`, inheriting whichever flagship the CLI currently ships. The call is isolated from user config/rules/hooks so local agent instructions cannot contaminate the blind judgment. Each run captures `_judge_model` for historical provenance.
657
+ - **Margin-based gates.** Ship thresholds use pairwise margins, not absolute score. `solo_claude`-`bare` measures solo harness value; pair-`solo_claude` measures pair value on pair-eligible fixtures. As models improve, margin remains the meaningful harness-added signal.
658
+ - **Saturation rotation.** When all compared gated arms exceed 95 on a fixture for two shipped versions, rotate it (see `RUBRIC.md::Fixture Rotation Policy`).
397
659
 
398
660
  ## Ship gates (summary — see `RUBRIC.md` for full spec)
399
661
 
@@ -401,7 +663,7 @@ Hard floors (any one fails → block):
401
663
 
402
664
  - Zero variant disqualifier (silent catch, fabricated verification, extra deps beyond `max_deps_added`, etc.).
403
665
  - `F9-e2e-ideate-to-resolve` must PASS (novice-flow contract).
404
- - ≥ 7 of 9 gated fixtures have margin ≥ +5.
666
+ - ≥ 7 gated, headroom-available fixtures have margin ≥ +5.
405
667
  - No per-fixture regression worse than −5 vs last shipped baseline.
406
668
 
407
669
  Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margin, critical-finding catch-rate regression vs last shipped variant.
@@ -409,15 +671,16 @@ Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margi
409
671
  ## Running the full suite (real)
410
672
 
411
673
  Full real benchmarks usually take 2-3 minutes per arm for simple fixtures and
412
- up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures
413
- × 2 arms can take 30 min - 2 hrs depending on routes taken.
674
+ up to 15 minutes per arm for strict-route fixtures. A full n=1 run time depends
675
+ on the selected fixture count; the historical 9-fixture core suite was roughly
676
+ 45 min - 3 hrs for 3 arms, while the current extended suite can take longer.
414
677
 
415
678
  ```bash
416
679
  # Smoke run before ship decisions
417
680
  npx devlyn-cli benchmark
418
681
 
419
682
  # Ship-decision run
420
- npx devlyn-cli benchmark --n 3 --label v3.7 --bless
683
+ npx devlyn-cli benchmark --label v3.7 --bless
421
684
  ```
422
685
 
423
686
  ## Dry-run
@@ -9,8 +9,8 @@ prior `history/runs/`.
9
9
 
10
10
  ## Scoring — 4 axes, 25 points each, 100 total
11
11
 
12
- The blind judge scores both arms on identical axes without knowing which is
13
- variant vs. bare.
12
+ The blind judge scores all submitted arms on identical axes without knowing
13
+ which label maps to which arm.
14
14
 
15
15
  ### Axis 1 — Spec Compliance (0-25)
16
16
 
@@ -72,18 +72,23 @@ Disqualifier arms automatically lose the fixture regardless of score.
72
72
  After the judge finishes every fixture, `scripts/ship-gate.py` applies these
73
73
  rules to the run's `summary.json`.
74
74
 
75
+ This section describes the broad run-suite ship gate. Current solo<pair
76
+ evidence uses the full-pipeline pair gate with an explicit selected pair arm
77
+ (`l2_risk_probes` for proof runs, `l2_gated` for diagnostics), and that gate
78
+ compares the selected pair arm against `solo_claude`.
79
+
75
80
  ### Hard floors (any one failure blocks ship)
76
81
 
77
- 1. **No disqualifier-level violation** in variant on any fixture.
82
+ 1. **No disqualifier-level violation** in any gated harness arm (legacy suite `variant`/L2 and `solo_claude`/L1 when present).
78
83
  2. **F9 (E2E) must PASS** — novice-flow contract.
79
- 3. **≥ 7 of 9 fixtures** must have margin ≥ +5 — **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from this count when `100 - L0_score < 5` AND `L1_score >= 95` AND the L1 arm has no disqualifier / CRITICAL-HIGH finding / watchdog timeout / regression worse than gate #4. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
84
+ 3. **At least 7 gated, headroom-available fixtures** must have the required margin ≥ +5 for each gated contract legacy `variant`-`bare` (L2-L0) for the suite gate, and `solo_claude`-`bare` (L1-L0) when `solo_claude` is present. This is **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from a contract count when the lower arm is ceiling-near and the higher arm is clean at ceiling. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
80
85
  4. **No fixture regression worse than −5** vs. last `baselines/shipped.json` on the same fixture.
81
86
 
82
87
  ### Soft gates (produce WARNING but do not block)
83
88
 
84
89
  5. Suite average margin drop > 3 vs. last shipped.
85
90
  6. A fixture that previously had margin > +5 now has margin ≤ 0.
86
- 7. Critical-finding catch-rate decrease vs. last shipped variant (not vs. bare).
91
+ 7. Critical-finding catch-rate decrease vs. the last shipped gated harness arm.
87
92
 
88
93
  ### Known-limit exception
89
94
 
@@ -138,15 +143,15 @@ Every suite run appends an immutable record to `history/runs/<ts>-<label>.json`:
138
143
 
139
144
  ## Fixture Rotation Policy
140
145
 
141
- If any fixture has both arms scoring > 95 for two consecutive shipped
142
- versions, it's saturated and no longer differentiates. Replace with a harder
143
- equivalent and record the swap in
146
+ If any fixture has all compared gated arms scoring > 95 for two consecutive
147
+ shipped versions, it's saturated and no longer differentiates. Replace with a
148
+ harder equivalent and record the swap in
144
149
  `history/runs/<ts>-fixture-rotation.json`:
145
150
 
146
151
  ```json
147
152
  {
148
153
  "retired": "F1-cli-trivial-flag",
149
- "retired_reason": "both arms > 95 on v3.7 and v3.8 (saturation)",
154
+ "retired_reason": "all compared gated arms > 95 on v3.7 and v3.8 (saturation)",
150
155
  "replacement": "F1b-cli-trivial-flag-v2",
151
156
  "replacement_rationale": "adds exit-code precedence requirement that current leaders didn't handle on first try"
152
157
  }
@@ -159,10 +164,14 @@ suspected in their area.
159
164
 
160
165
  ## Why These Thresholds
161
166
 
162
- - **+5 margin floor** — below this, variant isn't reliably beating bare given
163
- judge variance (empirically ~±3 per axis). Worth paying pipeline cost
164
- requires margin clearly above noise.
167
+ - **+5 margin floor** — below this, the gated harness arm is not reliably
168
+ beating its lower baseline given judge variance (empirically ~±3 per axis).
169
+ For the legacy suite that is `variant` over `bare`; for pair evidence it is
170
+ the selected pair arm over `solo_claude`. Worth paying pipeline cost requires
171
+ margin clearly above noise.
165
172
  - **−5 regression floor** — one-axis regression can look like −5; allowing
166
173
  less would let real regressions slip through.
167
- - **7/9 fixtures rule** — tolerates one close-call + F8 known-limit; anything
168
- worse means the suite is surfacing a broad harness problem.
174
+ - **7-fixture coverage floor** — requires a broad enough set of
175
+ headroom-available, non-known-limit fixtures to clear the margin floor. This
176
+ preserves the original core-suite coverage bar without pretending the current
177
+ extended fixture inventory is still exactly nine fixtures.
@@ -6,6 +6,10 @@ Trivial-tier calibration. Every arm should one-shot this; it's here to catch
6
6
  catastrophic regressions and to anchor the "saturation" end of the scoring
7
7
  scale.
8
8
 
9
+ Pair-candidate status: rejected by design. Because every arm is expected to
10
+ one-shot F1, it is a trivial calibration/control fixture and must not be used
11
+ as pair-lift evidence.
12
+
9
13
  ## Failure mode
10
14
 
11
15
  - **Default-behavior regression.** Careless implementations add `--loud`
@@ -25,6 +29,6 @@ scale.
25
29
 
26
30
  ## Rotation trigger
27
31
 
28
- When both arms score > 95 for two consecutive shipped versions, replace with
29
- a harder trivial fixture (e.g., one that requires handling a new flag
30
- interacting with existing flag precedence).
32
+ When both `bare` and `solo_claude` score > 95 for two consecutive shipped
33
+ versions, replace with a harder trivial fixture (e.g., one that requires
34
+ handling a new flag interacting with existing flag precedence).
@@ -57,7 +57,12 @@ prose forces invariant derivation, which is where pair has the edge.
57
57
 
58
58
  ## Rotation trigger
59
59
 
60
- Retire when both arms consistently land > 90 across two shipped versions,
61
- OR when "close-together-write" becomes a recognized pattern such that
62
- solo arm reliably reaches for a serializing mechanism on first read.
60
+ Headroom run `20260507-f10-f11-tier1-full-pipeline` rejected this fixture as
61
+ pair-lift evidence: bare scored 75 and solo_claude scored 94. Keep it as a
62
+ concurrent persistence control unless the visible contract is reworked to
63
+ expose lower bare/solo ceilings.
64
+
65
+ Retire when both `bare` and `solo_claude` consistently land > 90 across two
66
+ shipped versions, OR when "close-together-write" becomes a recognized pattern
67
+ such that solo arm reliably reaches for a serializing mechanism on first read.
63
68
  Whichever comes first.