devlyn-cli 2.3.0 → 2.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (219) hide show
  1. package/AGENTS.md +1 -1
  2. package/CLAUDE.md +2 -2
  3. package/README.md +82 -29
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +211 -18
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +3 -2
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +1 -1
  201. package/config/skills/_shared/runtime-principles.md +5 -8
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  205. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  206. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  207. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  208. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  209. package/config/skills/devlyn:resolve/SKILL.md +49 -18
  210. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  211. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  212. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  213. package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
  214. package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
  215. package/package.json +47 -2
  216. package/scripts/lint-fixtures.sh +349 -0
  217. package/scripts/lint-shadow-fixtures.sh +58 -0
  218. package/scripts/lint-skills.sh +3642 -92
  219. /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
@@ -0,0 +1,82 @@
1
+ {
2
+ "run_id": "20260510-f16-f23-f25-combined-proof",
3
+ "rule": "headroom candidates only; bare headroom >= 5; solo_claude headroom >= 5; l2_risk_probes must be evidence-clean, pair_mode true, pair_trigger eligible with a canonical reason, and beat solo_claude by the configured margin",
4
+ "verdict": "PASS",
5
+ "fixtures_total": 3,
6
+ "fixtures_passed": 3,
7
+ "min_fixtures": 3,
8
+ "fixture_count_ok": true,
9
+ "bare_max": 60,
10
+ "solo_max": 80,
11
+ "min_bare_headroom_required": 5,
12
+ "min_solo_headroom_required": 5,
13
+ "min_pair_margin": 5,
14
+ "pair_arm": "l2_risk_probes",
15
+ "max_pair_solo_wall_ratio": 3.0,
16
+ "require_hypothesis_trigger": false,
17
+ "max_observed_pair_solo_wall_ratio": 2.2506234413965087,
18
+ "avg_pair_margin": 25.333333333333332,
19
+ "avg_pair_solo_wall_ratio": 1.725768446785212,
20
+ "rows": [
21
+ {
22
+ "fixture": "F16-cli-quote-tax-rules",
23
+ "status": "PASS",
24
+ "bare_score": 50,
25
+ "bare_headroom": 10,
26
+ "solo_score": 75,
27
+ "solo_headroom": 5,
28
+ "pair_score": 96,
29
+ "pair_margin": 21,
30
+ "pair_mode": true,
31
+ "pair_trigger_eligible": true,
32
+ "pair_trigger_reasons": [
33
+ "complexity.high",
34
+ "spec.solo_headroom_hypothesis"
35
+ ],
36
+ "pair_trigger_has_canonical_reason": true,
37
+ "pair_trigger_has_hypothesis_reason": true,
38
+ "pair_solo_wall_ratio": 1.2805280528052805,
39
+ "reason": ""
40
+ },
41
+ {
42
+ "fixture": "F23-cli-fulfillment-wave",
43
+ "status": "PASS",
44
+ "bare_score": 33,
45
+ "bare_headroom": 27,
46
+ "solo_score": 66,
47
+ "solo_headroom": 14,
48
+ "pair_score": 97,
49
+ "pair_margin": 31,
50
+ "pair_mode": true,
51
+ "pair_trigger_eligible": true,
52
+ "pair_trigger_reasons": [
53
+ "complexity.high",
54
+ "spec.solo_headroom_hypothesis"
55
+ ],
56
+ "pair_trigger_has_canonical_reason": true,
57
+ "pair_trigger_has_hypothesis_reason": true,
58
+ "pair_solo_wall_ratio": 2.2506234413965087,
59
+ "reason": ""
60
+ },
61
+ {
62
+ "fixture": "F25-cli-cart-promotion-rules",
63
+ "status": "PASS",
64
+ "bare_score": 25,
65
+ "bare_headroom": 35,
66
+ "solo_score": 75,
67
+ "solo_headroom": 5,
68
+ "pair_score": 99,
69
+ "pair_margin": 24,
70
+ "pair_mode": true,
71
+ "pair_trigger_eligible": true,
72
+ "pair_trigger_reasons": [
73
+ "complexity.high",
74
+ "spec.solo_headroom_hypothesis"
75
+ ],
76
+ "pair_trigger_has_canonical_reason": true,
77
+ "pair_trigger_has_hypothesis_reason": true,
78
+ "pair_solo_wall_ratio": 1.646153846153846,
79
+ "reason": ""
80
+ }
81
+ ]
82
+ }
@@ -0,0 +1,18 @@
1
+ # Full-Pipeline Pair Gate - 20260510-f16-f23-f25-combined-proof
2
+
3
+ Verdict: **PASS**
4
+
5
+ Fixtures passed: 3/3 (minimum required: 3)
6
+
7
+ Rule: at least 3 fixtures; bare <= 60; bare headroom >= 5; solo_claude <= 80; solo_claude headroom >= 5; l2_risk_probes evidence-clean; pair_mode true; pair_trigger eligible with canonical reason; l2_risk_probes - solo_claude >= 5.
8
+ Average pair margin: +25.3
9
+ Allowed pair/solo wall ratio: 3.00x
10
+ Maximum observed pair/solo wall ratio: 2.25x
11
+ Average pair/solo wall ratio: 1.73x
12
+ Hypothesis trigger required: false
13
+
14
+ | Fixture | Bare | Bare headroom | Solo_claude | Solo_claude headroom | Pair | Margin | Pair mode | Hypothesis trigger | Triggers | Wall ratio | Status | Reason |
15
+ |---|---:|---:|---:|---:|---:|---:|---|---|---|---:|---|---|
16
+ | F16-cli-quote-tax-rules | 50 | 10 | 75 | 5 | 96 | +21 | true | true | complexity.high,spec.solo_headroom_hypothesis | 1.28x | PASS | |
17
+ | F23-cli-fulfillment-wave | 33 | 27 | 66 | 14 | 97 | +31 | true | true | complexity.high,spec.solo_headroom_hypothesis | 2.25x | PASS | |
18
+ | F25-cli-cart-promotion-rules | 25 | 35 | 75 | 5 | 99 | +24 | true | true | complexity.high,spec.solo_headroom_hypothesis | 1.65x | PASS | |
@@ -0,0 +1,46 @@
1
+ {
2
+ "run_id": "20260510-f16-f23-f25-combined-proof",
3
+ "rule": "at least 3 candidate fixtures; each must satisfy bare <= 60 with headroom >= 5, solo_claude <= 80 with headroom >= 5, with both baseline arms evidence-complete",
4
+ "verdict": "PASS",
5
+ "fixtures_total": 3,
6
+ "fixtures_passed": 3,
7
+ "min_fixtures": 3,
8
+ "bare_max": 60,
9
+ "solo_max": 80,
10
+ "min_bare_headroom_required": 5,
11
+ "min_solo_headroom_required": 5,
12
+ "fixture_count_ok": true,
13
+ "avg_bare_headroom": 24.0,
14
+ "min_bare_headroom": 10,
15
+ "avg_solo_headroom": 8.0,
16
+ "min_solo_headroom": 5,
17
+ "rows": [
18
+ {
19
+ "fixture": "F16-cli-quote-tax-rules",
20
+ "status": "PASS",
21
+ "bare_score": 50,
22
+ "solo_score": 75,
23
+ "bare_headroom": 10,
24
+ "solo_headroom": 5,
25
+ "reason": ""
26
+ },
27
+ {
28
+ "fixture": "F23-cli-fulfillment-wave",
29
+ "status": "PASS",
30
+ "bare_score": 33,
31
+ "solo_score": 66,
32
+ "bare_headroom": 27,
33
+ "solo_headroom": 14,
34
+ "reason": ""
35
+ },
36
+ {
37
+ "fixture": "F25-cli-cart-promotion-rules",
38
+ "status": "PASS",
39
+ "bare_score": 25,
40
+ "solo_score": 75,
41
+ "bare_headroom": 35,
42
+ "solo_headroom": 5,
43
+ "reason": ""
44
+ }
45
+ ]
46
+ }
@@ -0,0 +1,17 @@
1
+ # Headroom Gate — 20260510-f16-f23-f25-combined-proof
2
+
3
+ Verdict: **PASS**
4
+
5
+ Fixtures passed: 3/3 (minimum required: 3)
6
+
7
+ Rule: at least 3 fixtures; bare <= 60 with headroom >= 5, solo_claude <= 80 with headroom >= 5, both baseline arms evidence-complete.
8
+ Average bare headroom: 24.0
9
+ Minimum bare headroom: 10
10
+ Average solo_claude headroom: 8.0
11
+ Minimum solo_claude headroom: 5
12
+
13
+ | Fixture | Bare | Bare headroom | Solo_claude | Solo_claude headroom | Status | Reason |
14
+ |---|---:|---:|---:|---:|---|---|
15
+ | F16-cli-quote-tax-rules | 50 | 10 | 75 | 5 | PASS | |
16
+ | F23-cli-fulfillment-wave | 33 | 27 | 66 | 14 | PASS | |
17
+ | F25-cli-cart-promotion-rules | 25 | 35 | 75 | 5 | PASS | |
@@ -0,0 +1,303 @@
1
+ # Running Real Pair/Solo Benchmarks
2
+
3
+ This document is for benchmark runs that spend real model calls and produce
4
+ judge scores. Use it when a change claims `solo_claude < pair`.
5
+
6
+ For wiring checks that must not invoke providers, use `npx devlyn-cli benchmark
7
+ --dry-run` or the shell tests listed in `README.md`.
8
+
9
+ ## Current Score Harness
10
+
11
+ The current full-pipeline comparison has three evidence arms:
12
+
13
+ | Arm | Meaning |
14
+ |---|---|
15
+ | `bare` | control without the devlyn skills |
16
+ | `solo_claude` | Claude-only `/devlyn:resolve` path |
17
+ | `l2_risk_probes` | current measured pair path: Claude implement plus Codex-derived risk probes / pair VERIFY |
18
+
19
+ `l2_gated` is diagnostic replay only. `l2_forced` is retired and rejected by the
20
+ runner because it leaks pair-awareness before IMPLEMENT.
21
+
22
+ The score artifacts that matter are:
23
+
24
+ - `benchmark/auto-resolve/results/<run-id>/<fixture>/judge.json`
25
+ - `benchmark/auto-resolve/results/<run-id>/<fixture>/<arm>/result.json`
26
+ - `benchmark/auto-resolve/results/<run-id>/<fixture>/<arm>/verify.json`
27
+ - `benchmark/auto-resolve/results/<run-id>/full-pipeline-pair-gate.md`
28
+ - `benchmark/auto-resolve/results/<run-id>/full-pipeline-pair-gate.json`
29
+
30
+ Do not treat a score as evidence if the matching arm has a deterministic
31
+ failure, judge disqualifier, missing `diff.patch`, blocked resolve verdict,
32
+ failed verify score, provider invocation failure, or an invalid judge axis cell.
33
+ The matching arms must also appear in `judge.json` `_blind_mapping`; a
34
+ `scores_by_arm` value without the blind slot mapping is not score evidence.
35
+
36
+ ## Headroom First
37
+
38
+ Before spending new provider calls, check the active frontier:
39
+
40
+ ```bash
41
+ python3 benchmark/auto-resolve/scripts/pair-candidate-frontier.py \
42
+ --out-md /tmp/devlyn-pair-frontier.md
43
+ npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
44
+ ```
45
+
46
+ Only `candidate_unmeasured` fixtures need fresh headroom. Fixtures marked
47
+ `pair_evidence_passed` already have local passing full-pipeline complete pair evidence rows,
48
+ and fixtures marked `rejected` need rework before pair arms. The frontier command
49
+ prints existing complete `bare`, `solo_claude`, `pair`, margin, wall ratio, and run id rows to
50
+ stdout, plus average/minimum pair margin and wall ratio, even when `--out-md`
51
+ or `--out-json` writes an artifact.
52
+ Gate-3 pair-eligible manifests carry both `rejected_excluded` and
53
+ `rejected_excluded_reasons`, so excluded solo-ceiling controls keep their
54
+ registry reason inside the manifest artifact.
55
+ After a headroom failure, run
56
+ `npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json`
57
+ which invokes `audit-headroom-rejections.py` to ensure no active failed fixture
58
+ remains outside both the rejected registry and passing pair evidence, and that
59
+ each active rejected-registry reason is backed by a matching local headroom
60
+ artifact unless it is an explicit calibration/known-limit fixture.
61
+ For release/handoff checks, add `--fail-on-unmeasured` to the frontier command
62
+ to fail when active pair candidates still need headroom measurement.
63
+ Or run the composite provider-free guard:
64
+
65
+ ```bash
66
+ npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
67
+ npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
68
+ ```
69
+
70
+ It invokes `pair-candidate-frontier.py --fail-on-unmeasured` and
71
+ `audit-headroom-rejections.py`, writes `audit.json` with the frontier summary, artifact map,
72
+ `frontier.json`, `frontier.stdout`, `frontier.stderr`,
73
+ and compact trigger-backed verdict-bearing `pair_evidence_rows` (each row carries
74
+ `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`), plus both child JSON reports and child stdout/stderr logs, and prints the existing complete pair score rows
75
+ with pair arm, verdict, and trigger reasons from the frontier step. By default it revalidates frontier `verdict: PASS`, zero unmeasured candidates,
76
+ requires at least four active fixtures with passing pair evidence, and revalidates `pair_mode: true`,
77
+ the default 5-point pair margin, and 3x pair/solo wall ratio. The audit stdout
78
+ also prints `headroom_rejections=...`, `pair_evidence_quality=...`,
79
+ `pair_trigger_reasons=...`, `pair_evidence_hypotheses=...`, and `pair_evidence_hypothesis_triggers=...` handoff rows, plus
80
+ `pair_trigger_historical_aliases=...` when archived evidence includes legacy
81
+ trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
82
+ hypotheses have not yet propagated into trigger reasons, with rejected-fixture
83
+ coverage counts, actual minimum pair margin, maximum pair/solo wall ratio, and
84
+ canonical trigger reason coverage plus row-match status. The compact evidence row count must match the frontier evidence count, so incomplete local score artifacts cannot inflate
85
+ the claim. `checks.frontier_stdout` records summary, aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts, `checks.headroom_rejections` records child verdict plus unrecorded/unsupported counts, `checks.pair_evidence_quality` records the same quality thresholds from the compact rows, `checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status for handoff review, `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts, and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details. The markdown frontier
86
+ artifact includes the overall verdict plus row-level verdict, pair-arm, and trigger-reason columns.
87
+ Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
88
+ and include a Markdown `Hypothesis trigger` column, so strict regenerated
89
+ evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
90
+ Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
91
+ archived-evidence WARN rows into release-blocking FAIL rows for newly
92
+ regenerated pair evidence.
93
+ Historical trigger aliases are only reported for archived artifact review; new
94
+ current pair-evidence gates fail historical-only or unknown trigger reasons and
95
+ require at least one canonical `pair_trigger.reasons` entry.
96
+
97
+ Pair lift is not measurable when `bare` or `solo_claude` is already near the
98
+ ceiling. Calibrate candidate fixtures first:
99
+
100
+ ```bash
101
+ bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh \
102
+ --bare-max 60 \
103
+ --solo-max 80 \
104
+ --min-fixtures 3 \
105
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
106
+ ```
107
+
108
+ Equivalent CLI entrypoint:
109
+
110
+ ```bash
111
+ npx devlyn-cli benchmark headroom \
112
+ --bare-max 60 \
113
+ --solo-max 80 \
114
+ --min-fixtures 3 \
115
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
116
+ ```
117
+
118
+ The runner prints a startup `Gate:` line, the replay `Command:`, and the
119
+ headroom markdown report with `bare`/`solo_claude` scores and remaining headroom against
120
+ the configured thresholds, including average and minimum headroom for the
121
+ candidate set plus fixture pass count. When launched through
122
+ `npx devlyn-cli benchmark headroom`, the replay command uses that same package
123
+ CLI path. Count a fixture only when `headroom-gate.py` reports
124
+ evidence-complete `bare <= 60` and `solo_claude <= 80` with the default minimum 5-point `bare`/`solo_claude` headroom margin. Add `--dry-run` only to validate args,
125
+ fixture ids, minimum fixture count, and the replay command; it does not produce
126
+ scores. When showing scores, include `bare` headroom and `solo_claude` headroom. A real
127
+ headroom run explicitly reports whether the candidate set was accepted or rejected.
128
+ Known rejected or ceiling-saturated fixtures are refused by default; use
129
+ `--allow-rejected-fixtures` only for diagnostics of still active rejected
130
+ fixtures, not for new pair-evidence candidate selection. Retired fixtures are
131
+ preserved for historical artifact replay and are not rerun by the pair-candidate
132
+ runners.
133
+
134
+ ## Full Pair Measurement
135
+
136
+ Run the selected pair arm only after headroom passes:
137
+
138
+ ```bash
139
+ bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
140
+ --min-fixtures 3 \
141
+ --max-pair-solo-wall-ratio 3 \
142
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
143
+ ```
144
+
145
+ Equivalent CLI entrypoint:
146
+
147
+ ```bash
148
+ npx devlyn-cli benchmark pair \
149
+ --min-fixtures 3 \
150
+ --max-pair-solo-wall-ratio 3 \
151
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
152
+ ```
153
+
154
+ For prompt-only pair changes, reuse an evidence-complete calibration run to avoid
155
+ re-spending `bare` and `solo_claude`:
156
+
157
+ ```bash
158
+ bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
159
+ --run-id <new-run-id> \
160
+ --reuse-calibrated-from <prior-headroom-run-id> \
161
+ --min-fixtures 3 \
162
+ --max-pair-solo-wall-ratio 3 \
163
+ F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
164
+ ```
165
+
166
+ The runner prints startup `Headroom:` and `Pair:` lines, the replay `Command:`,
167
+ and the final pair gate report with fixture pass count and average pair margin.
168
+ If headroom fails, it reports that the pair arm was not executed. If the final
169
+ pair gate fails, it reports that pair evidence was rejected. On success, it
170
+ reports that the selected pair arm is executing and then that pair evidence was
171
+ accepted. When launched through `npx devlyn-cli benchmark pair`, the replay
172
+ command uses that same package CLI path. The pair runner and full-pipeline gate
173
+ use the default 3x pair/solo wall ratio unless `--max-pair-solo-wall-ratio` is
174
+ overridden for diagnostics. The full-pipeline gate report separates the allowed pair/solo wall ratio from the maximum observed pair/solo wall ratio, records `require_hypothesis_trigger` in JSON, and includes a Markdown `Hypothesis trigger` column. Add
175
+ `--dry-run` only to validate args, fixture ids, minimum fixture count, and the
176
+ replay command; it does not produce scores. Known rejected or ceiling-saturated
177
+ fixtures are refused by default here too; use
178
+ `--allow-rejected-fixtures` only for diagnostics of still active
179
+ rejected fixtures. Retired fixtures remain historical replay artifacts and are
180
+ not rerun by this candidate runner.
181
+ When showing a real run, report at minimum:
182
+
183
+ - run id
184
+ - fixture id
185
+ - fixtures passed / total and `--min-fixtures`
186
+ - startup `Headroom:` / `Pair:` gate lines
187
+ - `bare`, `solo_claude`, and `l2_risk_probes` scores
188
+ - pair minus `solo_claude` margin
189
+ - average pair margin for the counted set
190
+ - `pair_mode`
191
+ - pair trigger eligibility, trigger reasons, canonical-trigger coverage, and `spec.solo_headroom_hypothesis` coverage when the fixture spec has an actionable solo-headroom hypothesis
192
+ - pair/solo wall-time ratio
193
+ - gate verdict and failure reasons, if any
194
+
195
+ Example reporting shape:
196
+
197
+ ```text
198
+ Run: <run-id>
199
+ Fixture Bare Solo_claude Pair Pair-Solo_claude Pair mode Wall pair/solo Verdict
200
+ <fixture-a> 42 65 86 +21 true 1.44x PASS
201
+ <fixture-b> 31 58 82 +24 true 1.48x PASS
202
+ ```
203
+
204
+ Do not summarize a real run as "pair improved" unless the gate passed or the
205
+ failure reason is explicitly shown next to the scores.
206
+
207
+ ## Existing Evidence
208
+
209
+ The current measured pair arm is `l2_risk_probes`.
210
+
211
+ - `20260510-f16-f23-f25-combined-proof` passed the F16/F23/F25 gate with pair
212
+ margins `+21`, `+31`, and `+24`; average pair margin was `+25.3`; average
213
+ pair/solo wall ratio was `1.73x`.
214
+ - `20260509-f16-f25-combined-cartprobe-v2` also passes the current gate for
215
+ the F16/F25 subset with pair margins `+21` and `+24`; average pair margin was
216
+ `+22.5`; average pair/solo wall ratio was `1.46x`.
217
+ - `20260511-f21-current-riskprobes-v1` passed focused F21 evidence with
218
+ `bare 33`, `solo_claude 66`, `l2_risk_probes 99`, margin `+33`, pair mode
219
+ true, and pair/solo wall ratio `1.47x`; it is counted by `benchmark audit` as the fourth passing pair-evidence row.
220
+
221
+ F22 and F26 are not pair-lift evidence right now because existing headroom runs
222
+ put `solo_claude` near the ceiling. F27 is also rejected in its first headroom smoke:
223
+ `20260511-f27-headroom-smoke-061401` measured bare 33 / solo_claude 94, with bare
224
+ verification passing only 1 of 3 commands. Rework or rotate F27 before spending
225
+ a pair arm on it. F28 is rejected as pair-lift evidence: earlier unstable runs
226
+ `20260511-f28-headroom-smoke-085307` and `20260511-f28-pair-smoke-091021` were
227
+ superseded after a hidden-oracle bug was found. The oracle had expected a
228
+ defective item to bypass expiration, which the visible spec does not require.
229
+ After re-verifying the same provider diffs against the corrected oracle,
230
+ `20260511-f28-policy-oraclefix-reverified-pair` scored bare 50 / solo_claude 98 /
231
+ `l2_risk_probes` 96, margin -2, and failed headroom. Rework or rotate F28 before
232
+ spending more pair arms.
233
+ F30 is also rejected: `20260511-f30-headroom-v1` scored bare 33 / solo_claude 98, so
234
+ it failed the `solo_claude` headroom precondition before any pair arm should be spent.
235
+ F15 is also rejected: `20260511-f15-concurrency-headroom` scored bare 99 /
236
+ solo_claude 94, so it failed both headroom preconditions and should stay a frozen-diff
237
+ review control unless reworked. F3 is also rejected after tightening its HTTP
238
+ error-body verifier: `20260511-f3-http-error-headroom` scored bare 97 / solo_claude 99,
239
+ so it failed both headroom preconditions. F2 medium CLI is rejected by
240
+ `20260512-f2-medium-headroom`: bare 83 / solo_claude 95, so both baseline scores
241
+ exceed headroom ceilings. F4 web browser design is rejected by
242
+ `20260512-f4-web-headroom`: bare 70 / solo_claude 92 with bare disqualifiers, so it
243
+ needs rework before pair arms. F5 fix-loop is rejected by
244
+ `20260512-f5-fixloop-headroom`: bare 99 / solo_claude 99, with `bare` and `solo_claude` each
245
+ passing 5/5 verification commands. F6 dep-audit checksum is rejected by
246
+ `20260512-f6-checksum-headroom`: bare 97 / solo_claude 96, with `bare` and `solo_claude` each
247
+ passing 6/6 verification commands. F7 scope discipline is rejected by
248
+ `20260512-f7-scope-headroom`: bare 99 / solo_claude 100, with `bare` and `solo_claude` each
249
+ passing 6/6 verification commands. F9 ideate-to-resolve remains the novice-flow
250
+ anchor but is rejected as pair evidence by `20260512-f9-e2e-headroom`: bare 60 /
251
+ solo_claude 90 with bare headroom 0 and a bare judge disqualifier, despite passing F9
252
+ artifact checks. Rework it before spending pair arms. F1 and F8 are rejected by
253
+ design as calibration/known-limit controls, not pair-lift evidence candidates.
254
+ F10/F11 are also rejected by `20260507-f10-f11-tier1-full-pipeline`: F10 scored
255
+ bare 75 / solo_claude 94, and F11 scored bare 98 / solo_claude 97. F12 webhook signature/replay is rejected by
256
+ `20260511-f12-webhook-headroom`: bare 85 / solo_claude 99.
257
+ F31 seat rebalance is rejected by `20260512-f31-seat-rebalance-headroom`: bare
258
+ 33 / solo_claude 98, with bare judge/result/verify disqualifiers and `solo_claude` passing 3/3
259
+ verification commands. F32 subscription renewal is rejected by
260
+ `20260512-f32-subscription-renewal-headroom`: bare 33 / solo_claude 98, so it should
261
+ not receive a pair arm unless reworked.
262
+
263
+ ## Smoke Suite
264
+
265
+ The top-level benchmark command still exists for broad suite health:
266
+
267
+ ```bash
268
+ npx devlyn-cli benchmark
269
+ npx devlyn-cli benchmark --judge-only --run-id <ID>
270
+ ```
271
+
272
+ This path runs `variant`, `solo_claude`, and `bare` across fixtures, judges
273
+ them, compiles `summary.json`, and applies `ship-gate.py`. It is useful for
274
+ regression floors and fixture hygiene. For new `solo_claude < pair` claims,
275
+ prefer the headroom plus full-pipeline pair gate above because it names the
276
+ selected pair arm and enforces `pair_mode`.
277
+
278
+ ## Runtime Perf Artifacts
279
+
280
+ Every `/devlyn:resolve` run can also archive state into
281
+ `.devlyn/runs/<run_id>/pipeline.state.json`. Use those artifacts for wall-time
282
+ and phase diagnostics, not as score evidence by themselves.
283
+
284
+ ```bash
285
+ for f in .devlyn/runs/*/pipeline.state.json; do
286
+ jq '{run_id, engine: .engine, phases: .phases, risk_profile: .risk_profile}' "$f"
287
+ done
288
+ ```
289
+
290
+ When `--perf` data is present, include it as secondary cost evidence. If token
291
+ counts are absent in the environment, say so; do not infer token savings from
292
+ wall-time alone.
293
+
294
+ ## Honest Reporting Rules
295
+
296
+ - Real score claims must cite the run id and fixture ids.
297
+ - A fixture counts only when all measured arms have complete artifacts.
298
+ - Headroom failures are not pair failures; they mean the fixture cannot measure
299
+ lift.
300
+ - Provider-limit or invocation failures make the affected fixture non-evidence.
301
+ - Wall-time ratios are cost signals, not quality scores.
302
+ - Dry-runs, lint, and shell tests prove wiring only. They are not benchmark
303
+ scores.