devlyn-cli 2.3.0 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (219) hide show
  1. package/AGENTS.md +1 -1
  2. package/CLAUDE.md +2 -2
  3. package/README.md +80 -29
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +210 -17
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +3 -2
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +1 -1
  201. package/config/skills/_shared/runtime-principles.md +5 -8
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  205. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  206. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  207. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  208. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  209. package/config/skills/devlyn:resolve/SKILL.md +49 -18
  210. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  211. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  212. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  213. package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
  214. package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
  215. package/package.json +47 -2
  216. package/scripts/lint-fixtures.sh +349 -0
  217. package/scripts/lint-shadow-fixtures.sh +58 -0
  218. package/scripts/lint-skills.sh +3642 -92
  219. /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
@@ -0,0 +1,341 @@
1
+ # Auto-Resolve Benchmark Results (v2.1)
2
+
3
+ Date: 2026-04-22
4
+ Baseline ref: `4eb7b47` (v1.14.0 — CPO lens in ideate + handoff enforcement)
5
+ Head ref: v2.1 STEP 5 complete (commit `5859959`)
6
+
7
+ This report separates **measured** properties from **hypothesized** properties. Any number in this file without an explicit measurement script behind it is not reported.
8
+
9
+ ## 1. Static properties (measured via `measure-static.py`)
10
+
11
+ | Metric | Baseline | Head | Delta | Interpretation |
12
+ |--------|---------|------|-------|----------------|
13
+ | `SKILL.md` lines | 602 | 645 | +43 | Slight growth from routing logic + structured output contracts. Offset by phase-prompt concision. |
14
+ | `SKILL.md` tokens estimate | 10,817 | 14,438 | +3,621 | ~4 chars/token estimate. Head has denser lines (more content per line) due to structured contracts. |
15
+ | Legacy monolithic artifact refs | 28 | 0 | **−28** | `BUILD-GATE.md` / `EVAL-FINDINGS.md` / `done-criteria.md` / `SPEC-CONTEXT.md` / etc. completely removed. |
16
+ | Structured artifact refs | 0 | 54 | **+54** | `pipeline.state.json` + `findings.jsonl` references (structured, machine-parseable). |
17
+ | Goal-driven XML blocks | 3 | 13 | **+10** | `goal` / `output_contract` / `quality_bar` / `principle` / `harness_principles` adoption on BUILD/EVALUATE/CHALLENGE. |
18
+ | Reference files | 2 (build-gate, engine-routing) | 5 (+findings-schema, pipeline-state, pipeline-routing) | +3 | Forward-declared schemas + routing matrix. On-demand loaded, not always in context. |
19
+
20
+ **Measured reading**: the orchestrator's *potential* context surface grew from 922 → 1,590 lines (+668). In practice, reference files are loaded on-demand per phase — the orchestrator rarely carries all 1,590 lines at once. The ~28→0 removal of legacy artifact references and ~0→54 adoption of structured references IS always in orchestrator context since both live in SKILL.md.
21
+
22
+ ## 2. Route trace simulations (measured via `trace-route.py`)
23
+
24
+ All 3 test cases produce the expected routing outcome. **3/3 match.**
25
+
26
+ | Test case | Expected route | Measured route | Phase count | Match |
27
+ |-----------|---------------|----------------|-------------|-------|
28
+ | T1-trivial (CLI typo, complexity=low) | `fast` | `fast` | 4 | ✓ |
29
+ | T2-standard (order cancel, complexity=medium, web files) | `standard` | `standard` | 8 (includes browser) | ✓ |
30
+ | T3-high-risk (session token rotation, auth keywords) | `strict` | `strict` | 10 | ✓ |
31
+
32
+ ### Stage A decision traces
33
+
34
+ - T1: `spec.complexity=low, 0 risk keywords → fast`
35
+ - T2: `spec.complexity=medium, 0 risk keywords → standard`
36
+ - T3: `risk keyword hit: ['auth', 'session', 'token']... → strict` (correctly force-escalates regardless of complexity)
37
+
38
+ ### Phase inclusion matrix validation
39
+
40
+ Monotonicity holds: `fast ⊆ standard ⊆ strict`. Each route adds phases on top of the previous, never removes.
41
+
42
+ ## 3. What is NOT measured here (future work)
43
+
44
+ The following require REAL `/devlyn:resolve` pipeline executions on real tasks. Hypotheses from prior design docs remain labeled as hypotheses until run:
45
+
46
+ - **Wall-clock time per route** — requires running the full pipeline with timing.
47
+ - **Actual token consumption** (Codex + Claude) — requires running with token accounting enabled.
48
+ - **Fix-round convergence** — how often does `max_rounds` exhaust vs settle?
49
+ - **Criterion verification correctness** in production-like scenarios — requires real BUILD + EVALUATE on real code.
50
+ - **False-positive escalation rate** (Stage B escalates when it shouldn't) — needs a population of realistic tasks.
51
+ - **Fix-batch packet efficiency** (tokens saved vs re-parse) — needs instrumented runs.
52
+
53
+ `run-real-benchmark.md` documents the procedure. 30 paired runs across the 3 tiers = ~7.5–15 hours execution time and is out of scope for this commit.
54
+
55
+ ## 3.0 Current benchmark snapshot (provider-free, 2026-05-14)
56
+
57
+ Generated from local gate artifacts with:
58
+
59
+ ```bash
60
+ npx devlyn-cli benchmark recent
61
+ npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
62
+ ```
63
+
64
+ Status:
65
+
66
+ - Verdict: **PASS**
67
+ - Active fixtures: 21
68
+ - Rejected controls: 17
69
+ - Pair evidence rows: 4
70
+ - Unmeasured candidates: 0
71
+
72
+ Pair lift:
73
+
74
+ - Average margin: **+27.25**
75
+ - Minimum margin: **+21**
76
+ - Average wall ratio: 1.66x
77
+ - Maximum wall ratio: 2.25x
78
+ - Gate: margin >= +5; wall <= 3.00x
79
+
80
+ Evidence cards:
81
+
82
+ ### F16 cli quote tax rules
83
+
84
+ - Scores: bare 50, solo_claude 75, pair 96.
85
+ - Lift: +21; wall 1.28x; arm `l2_risk_probes`.
86
+ - Run: `20260510-f16-f23-f25-combined-proof`.
87
+ - Triggers: `complexity.high`, `spec.solo_headroom_hypothesis`.
88
+
89
+ ### F21 cli scheduler priority
90
+
91
+ - Scores: bare 33, solo_claude 66, pair 99.
92
+ - Lift: +33; wall 1.47x; arm `l2_risk_probes`.
93
+ - Run: `20260511-f21-current-riskprobes-v1`.
94
+ - Triggers: `complexity.high`, `risk.high`, `risk_probes.enabled`,
95
+ `spec.solo_headroom_hypothesis`.
96
+
97
+ ### F23 cli fulfillment wave
98
+
99
+ - Scores: bare 33, solo_claude 66, pair 97.
100
+ - Lift: +31; wall 2.25x; arm `l2_risk_probes`.
101
+ - Run: `20260510-f16-f23-f25-combined-proof`.
102
+ - Triggers: `complexity.high`, `spec.solo_headroom_hypothesis`.
103
+
104
+ ### F25 cli cart promotion rules
105
+
106
+ - Scores: bare 25, solo_claude 75, pair 99.
107
+ - Lift: +24; wall 1.65x; arm `l2_risk_probes`.
108
+ - Run: `20260510-f16-f23-f25-combined-proof`.
109
+ - Triggers: `complexity.high`, `spec.solo_headroom_hypothesis`.
110
+
111
+ ## 3.1 Full-pipeline pair evidence (measured 2026-05-09, expanded 2026-05-11)
112
+
113
+ Run set: `20260510-f16-f23-f25-combined-proof`
114
+
115
+ Gate:
116
+
117
+ ```bash
118
+ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
119
+ --run-id 20260510-f16-f23-f25-combined-proof \
120
+ --pair-arm l2_risk_probes \
121
+ --min-fixtures 3 \
122
+ --min-pair-margin 5 \
123
+ --max-pair-solo-wall-ratio 3
124
+ ```
125
+
126
+ Result: **PASS**. All three fixtures satisfy the headroom precondition,
127
+ including the default 5-point `bare`/`solo_claude` headroom margins. Pair mode actually
128
+ fired, the pair arm was clean, and the blind judge scored pair above `solo_claude` by
129
+ more than the +5 margin floor.
130
+
131
+ | Fixture | Bare | Solo_claude | Pair (`l2_risk_probes`) | Margin | Pair mode | Wall ratio |
132
+ |---------|-----:|-----:|------------------------:|-------:|-----------|-----------:|
133
+ | F16-cli-quote-tax-rules | 50 | 75 | 96 | +21 | true | 1.28x |
134
+ | F23-cli-fulfillment-wave | 33 | 66 | 97 | +31 | true | 2.25x |
135
+ | F25-cli-cart-promotion-rules | 25 | 75 | 99 | +24 | true | 1.65x |
136
+
137
+ Average pair margin: **+25.3**.
138
+ Average pair/solo wall ratio: **1.73x**.
139
+ Headroom summary before pair measurement: average bare headroom **24.0**,
140
+ minimum bare headroom **10**, average solo_claude headroom **8.0**, minimum solo_claude
141
+ headroom **5**.
142
+
143
+ Earlier two-fixture run `20260509-f16-f25-combined-cartprobe-v2` also passed
144
+ the current gate for F16/F25 with margins +21 and +24, average pair margin
145
+ +22.5, and average pair/solo wall ratio 1.46x.
146
+
147
+ Supporting focused run: `20260509-f25-cartprobe-v2` closed the previous F25
148
+ gap. The `l2_risk_probes` arm passed 4/4 fixture verification commands, produced
149
+ a two-file diff, and scored 99 vs `solo_claude` 75.
150
+
151
+ Additional focused run: `20260511-f21-current-riskprobes-v1` re-measured F21
152
+ with the current risk-probe path and passed the same full-pipeline gate with
153
+ `--min-fixtures 1`. Scores: `bare` 33, `solo_claude` 66, `l2_risk_probes` 99, pair margin
154
+ +33, pair mode true, pair/solo wall ratio 1.47x. This is supporting fixture
155
+ evidence for the same pair mechanism and is counted by `benchmark audit` as the
156
+ fourth passing pair-evidence row alongside the F16/F23/F25 proof run.
157
+
158
+ Rejected candidate: `20260508-f26-headroom` measured F26 payout ledger rules at
159
+ bare 25 / solo_claude 98, so it fails the headroom precondition (`solo_claude <= 80`)
160
+ despite being a useful ledger arithmetic control fixture.
161
+
162
+ Rejected candidate: F22 ledger close reached ceiling in both available headroom
163
+ runs (`20260507-f21-f22-full-pipeline-pair`: bare 91 / solo_claude 98;
164
+ `20260508-f22-exact-error-headroom`: bare 94 / solo_claude 98). It is a control
165
+ fixture, not counted pair-lift evidence.
166
+
167
+ Rejected candidate: `20260511-f27-headroom-smoke-061401` measured F27
168
+ subscription proration at bare 33 / solo_claude 94. It fails the headroom precondition
169
+ (`solo_claude <= 80`) and bare passed only 1 of 3 verification commands, so it
170
+ must not be counted as pair evidence until it is reworked or rotated and clears
171
+ a fresh headroom gate.
172
+
173
+ Rejected candidate: F28 return authorization is not pair-lift evidence. Earlier
174
+ unstable runs `20260511-f28-headroom-smoke-085307` and
175
+ `20260511-f28-pair-smoke-091021` were superseded after a hidden-oracle bug was
176
+ found. The oracle had expected a defective item to bypass expiration, which the
177
+ visible spec does not require. After re-verifying the same provider diffs
178
+ against the corrected oracle, `20260511-f28-policy-oraclefix-reverified-pair`
179
+ scored bare 50 / solo_claude 98 / `l2_risk_probes` 96, margin -2, and failed
180
+ headroom. Rework or rotate F28 before spending more pair arms.
181
+
182
+ Rejected candidate: `20260511-f30-headroom-v1` measured F30 credit hold
183
+ settlement at bare 33 / solo_claude 98. It fails the headroom precondition
184
+ (`solo_claude <= 80`) and must not be counted as pair evidence until it is
185
+ reworked or rotated.
186
+
187
+ Rejected candidate: `20260511-f15-concurrency-headroom` measured F15 frozen-diff
188
+ race review at bare 99 / solo_claude 94. It fails both headroom preconditions
189
+ and should remain a frozen-diff review control unless reworked to expose a lower
190
+ solo ceiling.
191
+
192
+ Rejected candidate: `20260511-f3-http-error-headroom` measured F3 backend
193
+ contract risk at bare 97 / solo_claude 99 after tightening the invalid-query
194
+ HTTP error body verifier. It fails both headroom preconditions and should remain
195
+ a backend contract control unless reworked.
196
+
197
+ Rejected candidate: `20260512-f2-medium-headroom` measured F2 medium CLI at
198
+ bare 83 / solo_claude 95. It has a positive solo-over-bare margin, but both
199
+ baseline scores exceed current headroom ceilings, so it remains a medium CLI
200
+ control fixture rather than pair-lift evidence.
201
+
202
+ Rejected candidate: `20260512-f4-web-headroom` measured F4 web browser design at
203
+ bare 70 / solo_claude 92, with a +22 solo-over-bare margin. It fails headroom
204
+ because both baseline scores exceed the ceilings and bare also carries
205
+ judge/result/verify disqualifiers. Rework F4 before spending a pair arm.
206
+
207
+ Rejected candidate: `20260512-f5-fixloop-headroom` measured F5 fix-loop at bare
208
+ 99 / solo_claude 99, with bare and solo each passing 5/5 verification commands.
209
+ It fails both headroom preconditions and should remain a fix-loop control unless
210
+ reworked.
211
+
212
+ Rejected candidate: `20260512-f6-checksum-headroom` measured F6 dep-audit
213
+ checksum at bare 97 / solo_claude 96, with bare and solo each passing 6/6 verification
214
+ commands. It fails both headroom preconditions and should remain a dep-audit
215
+ control unless reworked.
216
+
217
+ Rejected candidate: `20260512-f7-scope-headroom` measured F7 scope discipline
218
+ at bare 99 / solo_claude 100, with bare and solo each passing 6/6 verification
219
+ commands. It fails both headroom preconditions and should remain a scope-control
220
+ fixture unless reworked.
221
+
222
+ Rejected candidate: `20260512-f9-e2e-headroom` measured F9 ideate-to-resolve at
223
+ bare 60 / solo_claude 90, with a +30 solo-over-bare margin and passing F9
224
+ artifact checks. It fails headroom because bare headroom is 0 < 5, solo exceeds
225
+ 80, and bare carries a judge disqualifier. Keep F9 as the novice-flow anchor,
226
+ but rework it before spending pair arms as pair evidence.
227
+
228
+ Rejected by design: F1 is a trivial calibration fixture where every arm is
229
+ expected to one-shot; F8 is a known-limit ambiguity barometer with expected
230
+ margin in [-3, +3]. Neither should be used as pair-lift evidence.
231
+
232
+ Rejected candidates: `20260507-f10-f11-tier1-full-pipeline` measured F10
233
+ persistent write collision at bare 75 / solo_claude 94 and F11 batch import at
234
+ bare 98 / solo_claude 97. Both fail headroom and should remain control fixtures
235
+ unless reworked.
236
+
237
+ Rejected candidate: `20260511-f12-webhook-headroom` measured F12 webhook
238
+ signature/replay at bare 85 / solo_claude 99. Bare passed 6/7 verification
239
+ commands and solo passed 7/7, but the blind judge scores still exceed both
240
+ headroom ceilings, so F12 should remain a webhook/security control unless
241
+ reworked.
242
+
243
+ Rejected candidate: `20260512-f31-seat-rebalance-headroom` measured F31 seat
244
+ rebalance at bare 33 / solo_claude 98. Bare had 1/3 verification commands
245
+ passing and carried judge/result/verify disqualifiers; solo passed 3/3. F31
246
+ therefore fails the solo_claude headroom precondition and must not receive a pair arm
247
+ unless reworked.
248
+
249
+ ## 4. Conclusions (evidence-based only)
250
+
251
+ **Confirmed**:
252
+ 1. Zero-copy migration is complete: 28 → 0 legacy monolithic artifact references in SKILL.md.
253
+ 2. Structured-artifact adoption is complete: 54 structured references (`pipeline.state.json` + `findings.jsonl`).
254
+ 3. Goal-driven prompt adoption: 10 additional `<goal>/<output_contract>/<quality_bar>/<principle>` blocks across BUILD/EVALUATE/CHALLENGE.
255
+ 4. Routing logic produces the designed outcomes for 3 representative test cases covering all 3 routes.
256
+ 5. Monotonicity invariant holds: `fast ⊆ standard ⊆ strict`.
257
+ 6. Full-pipeline pair evidence now clears the three-fixture gate for F16 + F23
258
+ + F25: `l2_risk_probes` beats `solo_claude` by +21, +31, and +24 points with
259
+ average pair margin +25.3, pair mode true, and average pair/solo wall ratio
260
+ 1.73x.
261
+ 7. F21 also clears a focused full-pipeline gate after current-risk-probe
262
+ remeasurement: 33 / 66 / 99 with pair margin +33 and wall ratio 1.47x, and
263
+ is counted by `benchmark audit` as the fourth passing pair-evidence row.
264
+ 8. F26 is rejected as pair-lift evidence because `solo_claude` reaches ceiling: bare 25 /
265
+ solo_claude 98 in `20260508-f26-headroom`.
266
+ 9. F22 is rejected as pair-lift evidence because both `bare` and `solo_claude` reach ceiling
267
+ in available headroom runs.
268
+ 10. F27 is rejected as pair-lift evidence in its first headroom smoke: bare 33 /
269
+ solo_claude 94, with bare verification 1/3.
270
+ 11. F28 is rejected as pair-lift evidence. A hidden-oracle bug was corrected,
271
+ then `20260511-f28-policy-oraclefix-reverified-pair` reverified the same
272
+ provider diffs at bare 50 / solo_claude 98 / pair 96, margin -2, so the fixture is
273
+ ceiling-saturated for `solo_claude` and should be reworked or rotated.
274
+ 12. F30 is rejected as pair-lift evidence in its first headroom run:
275
+ `20260511-f30-headroom-v1` scored bare 33 / solo_claude 98.
276
+ 13. F15 is rejected as pair-lift evidence in `20260511-f15-concurrency-headroom`:
277
+ bare 99 / solo_claude 94, so the fixture is ceiling-saturated.
278
+ 14. F3 is rejected as pair-lift evidence in `20260511-f3-http-error-headroom`:
279
+ bare 97 / solo_claude 99, so the fixture is ceiling-saturated.
280
+ 15. F2 is rejected as pair-lift evidence in `20260512-f2-medium-headroom`:
281
+ bare 83 / solo_claude 95, so both baseline scores exceed headroom ceilings.
282
+ 16. F4 is rejected as pair-lift evidence in `20260512-f4-web-headroom`:
283
+ bare 70 / solo_claude 92 with bare disqualifiers, so it needs rework first.
284
+ 17. F5 is rejected as pair-lift evidence in `20260512-f5-fixloop-headroom`:
285
+ bare 99 / solo_claude 99, so the fixture is ceiling-saturated.
286
+ 18. F6 is rejected as pair-lift evidence in `20260512-f6-checksum-headroom`:
287
+ bare 97 / solo_claude 96, so the fixture is ceiling-saturated.
288
+ 19. F7 is rejected as pair-lift evidence in `20260512-f7-scope-headroom`:
289
+ bare 99 / solo_claude 100, so the fixture is ceiling-saturated.
290
+ 20. F9 is rejected as pair-lift evidence in `20260512-f9-e2e-headroom`:
291
+ bare 60 / solo_claude 90 with bare headroom 0 and a bare judge disqualifier.
292
+ 21. F1 and F8 are rejected by design as calibration/known-limit controls, not
293
+ pair-lift evidence candidates.
294
+ 22. F10 and F11 are rejected as pair-lift evidence in
295
+ `20260507-f10-f11-tier1-full-pipeline`: F10 scored bare 75 / solo_claude 94, and
296
+ F11 scored bare 98 / solo_claude 97.
297
+ 23. F12 is rejected as pair-lift evidence in `20260511-f12-webhook-headroom`:
298
+ bare 85 / solo_claude 99, so the fixture is ceiling-saturated.
299
+ 24. F31 is rejected as pair-lift evidence in
300
+ `20260512-f31-seat-rebalance-headroom`: bare 33 / solo_claude 98, with bare
301
+ disqualifiers and `solo_claude` at ceiling.
302
+ 25. F32 is rejected as pair-lift evidence in
303
+ `20260512-f32-subscription-renewal-headroom`: bare 33 / solo_claude 98, so the
304
+ subscription renewal fixture is solo-ceiling despite useful rollback/shape
305
+ coverage.
306
+
307
+ **Still hypothetical** (pending real-run validation):
308
+ - Wall-time reduction for `fast` route on trivial tasks.
309
+ - Token consumption reduction from fix-batch packet.
310
+ - Overall pipeline throughput change beyond the measured F16/F23/F25 pair gate
311
+ and focused F21 pair evidence.
312
+
313
+ ## 5. Reproducing
314
+
315
+ ```bash
316
+ # Static
317
+ python3 benchmark/auto-resolve/measure-static.py \
318
+ --baseline 4eb7b47 --head HEAD
319
+
320
+ # Route trace (all test cases)
321
+ python3 benchmark/auto-resolve/trace-route.py --all
322
+
323
+ # Single test case
324
+ python3 benchmark/auto-resolve/trace-route.py --test-case T3-high-risk
325
+
326
+ # Full-pipeline pair gate evidence
327
+ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
328
+ --run-id 20260510-f16-f23-f25-combined-proof \
329
+ --pair-arm l2_risk_probes \
330
+ --min-fixtures 3 \
331
+ --min-pair-margin 5 \
332
+ --max-pair-solo-wall-ratio 3
333
+
334
+ # Additional focused F21 evidence
335
+ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
336
+ --run-id 20260511-f21-current-riskprobes-v1 \
337
+ --pair-arm l2_risk_probes \
338
+ --min-fixtures 1 \
339
+ --min-pair-margin 5 \
340
+ --max-pair-solo-wall-ratio 3
341
+ ```