devlyn-cli 2.2.2 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (220) hide show
  1. package/AGENTS.md +2 -2
  2. package/CLAUDE.md +4 -4
  3. package/README.md +85 -34
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +221 -17
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +5 -4
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +17 -13
  201. package/config/skills/_shared/runtime-principles.md +6 -9
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:design-ui/SKILL.md +364 -0
  205. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  206. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  207. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  208. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  209. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  210. package/config/skills/devlyn:resolve/SKILL.md +78 -26
  211. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  212. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  213. package/config/skills/devlyn:resolve/references/phases/implement.md +1 -1
  214. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  215. package/config/skills/devlyn:resolve/references/phases/verify.md +80 -29
  216. package/config/skills/devlyn:resolve/references/state-schema.md +9 -4
  217. package/package.json +47 -2
  218. package/scripts/lint-fixtures.sh +349 -0
  219. package/scripts/lint-shadow-fixtures.sh +58 -0
  220. package/scripts/lint-skills.sh +3645 -95
package/AGENTS.md CHANGED
@@ -28,7 +28,7 @@ ideate (optional) -> resolve -> ship
28
28
 
29
29
  - `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project` (multi-feature).
30
30
  - `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <ref> --spec <path>`. Phases run inline: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh-subagent, findings-only).
31
- - Three creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
31
+ - `/devlyn:design-ui` — required creative UI exploration surface. Optional companion skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
32
32
 
33
33
  Each skill's `SKILL.md` is the source of truth for flags and workflow. Do not duplicate.
34
34
 
@@ -73,7 +73,7 @@ No silent fallbacks.
73
73
  - Fallbacks allowed only when widely accepted and harmless (CSS fallback fonts, CDN failover, image placeholders).
74
74
  - Silent `catch` blocks are bugs.
75
75
  - Logging is not user-visible error handling.
76
- - The Codex CLI availability downgrade is the one documented exception: emit `engine downgraded: codex-unavailable` and behave exactly like explicit Claude routing.
76
+ - No engine-availability fallback is permitted for pair/risk-probe routes: if required Codex or Claude is unavailable, emit `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` with setup guidance. `--no-pair` and `--no-risk-probes` are explicit user opt-outs, not fallbacks.
77
77
 
78
78
  ## Evidence Over Claim
79
79
 
package/CLAUDE.md CHANGED
@@ -24,7 +24,7 @@ The runtime sub-agent contract below (Subtractive-first / Goal-locked / No-worka
24
24
 
25
25
  ## Quick Start
26
26
 
27
- Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has a gated pair-JUDGE product path when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path; the harness silently downgrades to `claude` and emits a banner if the Codex CLI is missing.
27
+ Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED; `/devlyn:design-ui` is also REQUIRED as the creative UI exploration surface. **Both pipeline skills default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has conditional-default pair-JUDGE when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path. If a selected or conditionally required engine is unavailable, the run stops with `BLOCKED:<engine>-unavailable` and setup guidance.
28
28
 
29
29
  1. `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project`.
30
30
  2. `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <diff> --spec <path>`. Phases: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent, findings-only).
@@ -123,7 +123,7 @@ No `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scr
123
123
 
124
124
  **Permitted exceptions** (explicitly carved out):
125
125
  - CSS fallback fonts, CDN failover, image placeholders — widely-accepted best practices.
126
- - Codex CLI availability downgrade — the one documented silent fallback in this repo. Fires when the resolved engine is `auto` or `codex` (either via skill default or explicit `--engine` flag) and the Codex CLI is absent. Banner `engine downgraded: codex-unavailable` always prints; verdict identical to `--engine claude`. Any other silent fallback in skills code is a bug — file it against the skill that introduced it.
126
+ - No engine-availability fallback is permitted for `/devlyn:resolve` pair/risk-probe routes. If Codex or Claude is required and unavailable, the run stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` plus setup guidance. `--no-pair` / `--no-risk-probes` are explicit user opt-outs, not fallbacks.
127
127
  <!-- runtime-principles:section=no-workaround:end -->
128
128
 
129
129
  ### Evidence over claim
@@ -141,7 +141,7 @@ A finding without one of these forms is excluded. Vague findings produce vague f
141
141
 
142
142
  ## Codex invocation
143
143
 
144
- When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex` or `--engine auto`), the wrapper-form contract lives in `config/skills/_shared/codex-config.md` (or `.claude/skills/_shared/codex-config.md` once installed). Omit `-m <model>` — the CLI's current flagship is used automatically. MCP is not in the loop. If the Codex CLI is absent the harness silently downgrades to Claude and prints `engine downgraded: codex-unavailable` in the final report.
144
+ When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex`, `--engine auto`, or conditional VERIFY pair/risk-probe routing), the wrapper-form contract lives in `config/skills/_shared/codex-config.md` (or `.claude/skills/_shared/codex-config.md` once installed). Omit `-m <model>` — the CLI's current flagship is used automatically. MCP is not in the loop. If Codex is required and unavailable, stop with `BLOCKED:codex-unavailable` and setup guidance.
145
145
 
146
146
  ## Working Mode
147
147
 
@@ -152,7 +152,7 @@ When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine cod
152
152
 
153
153
  ## Skill Boundary Policy
154
154
 
155
- Post iter-0034 Phase 4 cutover (2026-05-04) the runtime product surface is two skills — `/devlyn:resolve` and `/devlyn:ideate`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Three creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
155
+ Post iter-0034 Phase 4 cutover (2026-05-04) the runtime pipeline surface is two skills — `/devlyn:resolve` and `/devlyn:ideate` — plus the required creative UI exploration surface `/devlyn:design-ui`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Optional creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
156
156
 
157
157
  Browser validation routes through `_shared/browser-runner.sh` (Chrome MCP → Playwright → curl tier) directly from BUILD_GATE — there is no separate `/devlyn:browser-validate` skill at HEAD.
158
158
 
package/README.md CHANGED
@@ -27,13 +27,13 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
27
27
  npx devlyn-cli
28
28
  ```
29
29
 
30
- That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
30
+ That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` + `/devlyn:design-ui` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
31
31
 
32
32
  ---
33
33
 
34
34
  ## How It Works — Two Skills, Full Cycle
35
35
 
36
- devlyn-cli turns Claude Code into a hands-free development pipeline. The product surface is two skills:
36
+ devlyn-cli turns Claude Code into a hands-free development pipeline. The pipeline surface is two skills, with `/devlyn:design-ui` installed as the required creative UI surface:
37
37
 
38
38
  ```
39
39
  ideate (optional) → resolve → ship
@@ -79,11 +79,25 @@ PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent
79
79
  - **VERIFY** runs in a fresh subagent context with no code-mutation tools — findings only, structurally independent.
80
80
  - Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
81
81
 
82
- Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--perf` (per-phase timing).
82
+ Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--no-pair` (intentional solo VERIFY), `--risk-probes` / `--no-risk-probes`, `--perf` (per-phase timing).
83
+ `--pair-verify` and `--no-pair` are mutually exclusive; using both stops with `BLOCKED:invalid-flags`.
83
84
 
84
- ### Engine selection Claude solo by default
85
+ Free-form goals that ask for benchmark evidence, pair-evidence, risk-probe
86
+ measurement, `solo<pair` proof, or solo-headroom work must include an
87
+ actionable `solo-headroom hypothesis` naming the visible behavior `solo_claude`
88
+ is expected to miss plus a backticked observable command; the backticked line
89
+ itself must contain `miss` and be framed as the command/observable that exposes it. Without that,
90
+ `/devlyn:resolve` stops with `BLOCKED:solo-headroom-hypothesis-required` and
91
+ points you to `/devlyn:ideate` instead of inventing a weak hypothesis.
92
+ Free-form goals that add or run a new unmeasured benchmark, shadow fixture,
93
+ golden fixture, risk-probe, or pair-evidence candidate must also include
94
+ `solo ceiling avoidance`, mention `solo_claude`, and name the concrete
95
+ difference from rejected or solo-saturated controls such as `S2`-`S6`; without
96
+ that, `/devlyn:resolve` stops with `BLOCKED:solo-ceiling-avoidance-required`.
85
97
 
86
- `--engine claude` (default) is the canonical surface. Every phase routes to Claude.
98
+ ### Engine selection Claude implementation, conditional pair VERIFY
99
+
100
+ `--engine claude` (default) is the canonical implementation surface for PLAN, IMPLEMENT, BUILD_GATE, and CLEANUP. VERIFY/JUDGE conditionally runs pair mode for verify-only runs, high-risk specs, risk probes, mechanical warnings, coverage gaps, or explicit `--pair-verify`.
87
101
 
88
102
  `--engine codex` routes IMPLEMENT to Codex; `--engine auto` opts into the experimental dual-engine routing where applicable. Both are research-only at HEAD: iter-0020 closed Codex BUILD/IMPLEMENT below the quality floor on the 9-fixture suite (L2 vs L1 = −3.6, 3/8 gated fixtures cleared the +5 margin floor — release-readiness FAIL); iter-0033g + iter-0034 closed PLAN-pair as research-only with explicit unblock conditions (container/sandbox infra OR production telemetry capturing positive evidence of subagent introspection). Install the Codex CLI (https://platform.openai.com/docs/codex) and pass the flag explicitly to opt in:
89
103
 
@@ -91,49 +105,86 @@ Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-g
91
105
  /devlyn:resolve "fix the auth bug" --engine auto # experimental, research-only
92
106
  ```
93
107
 
94
- If Codex is absent when `--engine auto` or `--engine codex` is requested, the harness silently downgrades to `--engine claude` and emits a banner in the final report.
95
-
96
- <details>
97
- <summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
98
-
99
- `/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
108
+ If Codex or Claude is absent when explicitly selected or conditionally required, the harness stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` and prints setup guidance. Use `--no-pair` only when intentionally accepting solo VERIFY; use `--no-risk-probes` only when intentionally disabling automatic high-risk probes.
100
109
 
101
- - **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
102
- - **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
103
- - **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
104
- - **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
105
- - **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
106
- - **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
107
- - **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
108
- - **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
109
- - **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
110
+ ### Benchmark score runs
110
111
 
111
- </details>
112
-
113
- <details>
114
- <summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
112
+ Use the benchmark CLI when a change claims `solo_claude < pair`. The score-focused runners print the run id, startup gate lines, blind-judge score tables, fixture pair margins, average pair margin, wall-time ratio, and failure reasons:
115
113
 
116
- Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
114
+ ```bash
115
+ npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
116
+ npx devlyn-cli benchmark recent
117
+ npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
118
+ npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
119
+ npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
120
+ npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
121
+ npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
122
+ ```
117
123
 
118
- - **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
119
- - **`--with-codex` consolidated into `--engine auto`** — auto covers BUILD/FIX + team roles + ideate CHALLENGE critic. Legacy flag still accepted with a graceful handoff. *(Note: post iter-0020 close-out, `--engine auto` is experimental research-only; default is `--engine claude`.)*
120
- - **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
121
- - **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
124
+ `benchmark recent` prints a compact, wrap-safe snapshot of the current local
125
+ pair evidence: status counts, pair-lift aggregates, and one card per passing
126
+ pair-evidence fixture. It intentionally avoids wide Markdown tables, so the
127
+ same output stays readable in narrow terminals, PR comments, and release notes.
128
+ `benchmark frontier` also prints a stdout score summary for existing complete pair
129
+ evidence rows, including pair arm, trigger reasons, average/minimum pair margin,
130
+ and wall ratio, plus row-level verdicts even when `--out-json` or `--out-md`
131
+ writes an artifact. Markdown frontier artifacts include a `Triggers` column.
132
+ Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
133
+ and include a Markdown `Hypothesis trigger` column, so strict regenerated
134
+ evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
135
+ `benchmark audit` is the provider-free release/handoff guard: it writes
136
+ `audit.json` with the frontier summary, artifact map, and compact trigger-backed verdict-bearing `pair_evidence_rows`
137
+ (each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`), runs the frontier with
138
+ `--fail-on-unmeasured`, requires at least four fixtures with passing pair evidence,
139
+ revalidates frontier `verdict: PASS`, zero unmeasured candidates, and revalidates `pair_mode: true`,
140
+ the default 5-point pair margin, and 3x pair/solo wall ratio, then
141
+ audits failed headroom results. The audit stdout also prints
142
+ `headroom_rejections=...`, `pair_evidence_quality=...`,
143
+ `pair_trigger_reasons=...`, `pair_evidence_hypotheses=...`, and
144
+ `pair_evidence_hypothesis_triggers=...` handoff rows, plus
145
+ `pair_trigger_historical_aliases=...` when archived evidence includes legacy
146
+ trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
147
+ hypotheses have not yet propagated into trigger reasons, with the rejected-fixture
148
+ coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
149
+ and canonical trigger reason coverage plus row-match status.
150
+ The compact evidence row count must match the frontier evidence count,
151
+ `checks.frontier_stdout` records summary, aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts,
152
+ `checks.headroom_rejections` records child verdict plus unrecorded/unsupported counts,
153
+ `checks.pair_evidence_quality` records the same quality thresholds from the compact rows,
154
+ `checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status,
155
+ `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts,
156
+ and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details
157
+ so incomplete or low-quality local score artifacts cannot inflate the claim.
158
+ Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
159
+ archived-evidence WARN rows into release-blocking FAIL rows for newly
160
+ regenerated pair evidence.
122
161
 
123
- </details>
162
+ ```bash
163
+ npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
164
+ ```
124
165
 
125
- ---
166
+ Historical trigger aliases are only reported for archived artifact review; new
167
+ current pair-evidence gates fail historical-only or unknown trigger reasons and
168
+ require at least one canonical `pair_trigger.reasons` entry.
169
+ `benchmark audit-headroom` fails if an active failed headroom fixture is missing
170
+ from both rejected registry and passing pair evidence.
171
+ Headroom runs use the current claim gate: `bare <= 60`, `solo_claude <= 80`,
172
+ and the default 5-point `bare`/`solo_claude` headroom margins before spending a pair arm.
173
+ Add `--dry-run` to either score runner to validate args, fixture ids, minimum
174
+ fixture count, and the replay command without running arms or judges. Dry-runs
175
+ and lint prove wiring only; real score claims must cite the run id and fixture
176
+ ids.
126
177
 
127
178
  ## Optional Power-User Skills
128
179
 
129
- Two creative skills have moved to `optional-skills/` — install them via the interactive installer when you need them.
180
+ Two creative companion skills live in `optional-skills/` — install them via the interactive installer when you need them.
130
181
 
131
182
  | Command | Use When |
132
183
  |---|---|
133
184
  | `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
134
185
  | `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
135
186
 
136
- > Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). These were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
187
+ > Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). Most were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:design-ui` is now installed as a required creative UI surface. Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
137
188
 
138
189
  ---
139
190
 
@@ -194,7 +245,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
194
245
  |---|---|
195
246
  | `playwright` | Playwright MCP — powers `/devlyn:resolve` BUILD_GATE browser tier (Chrome MCP → Playwright → curl fallback) |
196
247
 
197
- > `--engine auto/codex` uses the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex; the harness silently downgrades to `--engine claude` if the CLI is missing.
248
+ > `--engine auto/codex` and conditional VERIFY pair mode use the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex, run the current Codex auth/login flow, verify `codex --version`, then rerun.
198
249
 
199
250
  </details>
200
251
 
@@ -2,12 +2,18 @@
2
2
 
3
3
  **Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
4
4
 
5
- **Purpose.** Replace ad-hoc A/B benchmarking with a permanent, comprehensive,
5
+ **Purpose.** Replace ad-hoc harness benchmarking with a permanent, comprehensive,
6
6
  one-command suite that gates every future harness change with a ship/rollback
7
7
  decision. Any prompt edit, phase reorder, new native skill, or model upgrade
8
8
  can be validated by running the suite and reading the numbers.
9
9
 
10
- **Arm structure (current vs planned).** Today the suite runs `variant` (L2: Claude + Codex pair) vs `bare` (L0). The L1 (solo harness on a single LLM) arm is queued for iter-0020 — until then the benchmark cannot directly verify the L1 contract, only the L0 ↔ L2 delta. Single-LLM users (Opus alone, GPT-5.5 alone) are first-class per the North Star, so this gap is a release-blocker for them, not a future enhancement.
10
+ **Arm structure.** Current full-pipeline evidence uses three arms: `bare` (L0),
11
+ `solo_claude` (L1 solo harness), and an L2 pair arm (`variant` in the smoke
12
+ suite, or a focused pair arm such as `l2_risk_probes` in pair-candidate runs).
13
+ Pair claims are headroom-gated: counted fixtures must leave room above solo
14
+ (`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins),
15
+ the pair arm must actually run, and blind judging must show pair above solo by
16
+ the configured margin.
11
17
 
12
18
  **Non-goals.** Publishable-research statistical rigor. Not a regression test
13
19
  library for the product code — those live elsewhere. Not a substitute for
@@ -20,7 +26,7 @@ production telemetry — just enough signal for ship decisions.
20
26
  1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
21
27
  verdict. No manual fixture setup.
22
28
  2. **Novice-proof.** The suite exercises the same paths a first-time user
23
- hits — including an end-to-end `ideate → auto-resolve → preflight` fixture.
29
+ hits — including an end-to-end `ideate → resolve` fixture.
24
30
  3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
25
31
  stable; scores and margins float up as models improve. Nothing is
26
32
  hardcoded to a specific model version.
@@ -56,10 +62,11 @@ benchmark/auto-resolve/
56
62
  │ ├── F6-dep-audit-native-module/
57
63
  │ ├── F7-out-of-scope-trap/
58
64
  │ ├── F8-known-limit-ambiguous/
59
- └── F9-e2e-ideate-to-resolve/
65
+ ├── F9-e2e-ideate-to-resolve/
66
+ │ └── F10+ extensions for headroom, full-pipeline pair, and frozen VERIFY
60
67
 
61
68
  ├── scripts/
62
- │ ├── run-suite.sh # single entry — runs all fixtures × 2 arms + judge + report
69
+ │ ├── run-suite.sh # smoke entry — runs fixture arms + judge + report
63
70
  │ ├── run-fixture.sh # one fixture, one arm
64
71
  │ ├── judge.sh # Codex blind judge (model-agnostic)
65
72
  │ ├── compile-report.py # aggregate into report.md + summary.json
@@ -68,8 +75,9 @@ benchmark/auto-resolve/
68
75
  ├── results/ # per-run artifacts (overwritten)
69
76
  │ └── <run-id>/
70
77
  │ ├── <fixture>/
71
- │ │ ├── variant/{input.md, transcript.txt, diff.patch, verify.json, timing.json}
72
- │ │ └── bare/{same}
78
+ │ │ ├── bare/{input.md, transcript.txt, diff.patch, result.json}
79
+ │ │ ├── solo_claude/{same}
80
+ │ │ └── variant or l2_risk_probes/{same}
73
81
  │ ├── <fixture>/judge.json
74
82
  │ ├── report.md
75
83
  │ └── summary.json
@@ -91,7 +99,7 @@ Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
91
99
  | File | Purpose |
92
100
  |------|---------|
93
101
  | `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
94
- | `spec.md` | pipeline-arm input (auto-resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
102
+ | `spec.md` | pipeline-arm input (resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
95
103
  | `task.txt` | bare-arm input (same intent, natural-language framing) |
96
104
  | `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
97
105
  | `NOTES.md` | why this fixture exists, the specific failure mode it tests |
@@ -103,9 +111,13 @@ consistent.
103
111
 
104
112
  ---
105
113
 
106
- ## The 9 Fixtures
114
+ ## Core Fixtures And Extensions
107
115
 
108
- Category coverage matrix (rows = concerns, columns = fixtures):
116
+ The original v3.6 matrix covered F1-F9. Later fixtures extend the same schema
117
+ for headroom, full-pipeline pair, and frozen VERIFY evidence.
118
+
119
+ Category coverage matrix for the original core set (rows = concerns, columns =
120
+ fixtures):
109
121
 
110
122
  | Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
111
123
  |---------|---------|--------|-----------|--------|------|-----|
@@ -120,9 +132,9 @@ Category coverage matrix (rows = concerns, columns = fixtures):
120
132
  | F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
121
133
 
122
134
  **F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
123
- Input is a vague idea; pipeline arm runs ideate auto-resolve on every
124
- generated spec preflight; bare arm runs a direct prompt. Judge compares
125
- the final usable artifact set (code + docs + roadmap state).
135
+ Input is a vague idea; the pipeline path turns it into a spec with ideate and
136
+ then resolves that spec. Bare arm runs a direct prompt. Judge compares the final
137
+ usable artifact set.
126
138
 
127
139
  ---
128
140
 
@@ -132,7 +144,6 @@ the final usable artifact set (code + docs + roadmap state).
132
144
 
133
145
  ```bash
134
146
  npx devlyn-cli benchmark # n=1 smoke, all fixtures
135
- npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
136
147
  npx devlyn-cli benchmark F2 F5 # specific fixtures only
137
148
  npx devlyn-cli benchmark --judge-only --run-id <id> # re-judge without re-running
138
149
  ```
@@ -143,20 +154,21 @@ Output on completion:
143
154
  Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
144
155
  Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
145
156
 
146
- Fixture Variant Bare Margin Verdict
147
- F1-cli-trivial-flag 95 88 +7 PASS
148
- F2-cli-medium-subcommand 92 81 +11 PASS
149
- F3-backend-contract-risk 89 72 +17 PASS
150
- F4-web-browser-design 87 79 +8 PASS
151
- F5-fix-loop-red-green 91 65 +26 PASS
152
- F6-dep-audit-native-module 88 70 +18 PASS
153
- F7-out-of-scope-trap 94 73 +21 PASS
154
- F8-known-limit-ambiguous 78 79 -1 EXPECTED (known-limit)
155
- F9-e2e-ideate-to-resolve 90 68 +22 PASS
157
+ Fixture variant (L2) solo_claude (L1) bare (L0) variant-solo_claude Verdict
158
+ F1-cli-trivial-flag 95 92 88 +3 PASS
159
+ F2-cli-medium-subcommand 92 86 81 +6 PASS
160
+ F3-backend-contract-risk 89 80 72 +9 PASS
161
+ F4-web-browser-design 87 83 79 +4 PASS
162
+ F5-fix-loop-red-green 91 78 65 +13 PASS
163
+ F6-dep-audit-native-module 88 82 70 +6 PASS
164
+ F7-out-of-scope-trap 94 85 73 +9 PASS
165
+ F8-known-limit-ambiguous 78 79 79 -1 EXPECTED (known-limit)
166
+ F9-e2e-ideate-to-resolve 90 84 68 +6 PASS
156
167
  ---------------------------------------------------------
157
- Suite average variant score: 89.3
158
- Suite average bare score: 75.0
159
- Suite average margin: +14.3 (ship floor: +5)
168
+ Suite average variant (L2) score: 89.3
169
+ Suite average solo_claude (L1) score: 83.2
170
+ Suite average bare (L0) score: 75.0
171
+ Suite average variant-solo_claude margin: +6.1 (pair-evidence floor: +5 on eligible fixtures)
160
172
  Hard-floor violations: 0
161
173
  Regression vs shipped: n/a (first run of v3.6)
162
174
  SHIP-GATE VERDICT: ✅ PASS
@@ -167,7 +179,7 @@ SHIP-GATE VERDICT: ✅ PASS
167
179
  `run-suite.sh`:
168
180
 
169
181
  1. Generate run-id `<ISO>-<sha>-<branch>`
170
- 2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
182
+ 2. For each fixture × each arm (`variant`/L2, `solo_claude`/L1, `bare`/L0): parallelizable via `xargs -P`
171
183
  - `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
172
184
  3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
173
185
  4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
@@ -179,17 +191,17 @@ SHIP-GATE VERDICT: ✅ PASS
179
191
 
180
192
  - Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
181
193
  - Applies `setup.sh` if present
182
- - Copies `spec.md` (variant) or `task.txt` (bare) as the prompt
183
- - Invokes Claude/auto-resolve (variant) or bare Claude (bare) via isolated Agent
194
+ - Copies `spec.md` for `variant`/`solo_claude` or `task.txt` for `bare` as the prompt
195
+ - Invokes `/devlyn:resolve --spec` for `variant`, `/devlyn:resolve --spec --engine claude --no-pair --no-risk-probes` for `solo_claude`, or bare Claude for `bare` via isolated Agent
184
196
  - Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
185
197
  - Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
186
198
  - Writes `result.json` with aggregate: exit code, duration, files changed, verification score
187
199
 
188
200
  ### `judge.sh` contract
189
201
 
190
- - Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
202
+ - Reads `results/<run-id>/<fixture>/{variant,solo_claude,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
191
203
  - Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
192
- - Invokes `codex exec` (current flagship — no model hardcode) with RUBRIC.md
204
+ - Invokes isolated Codex (current flagship — no model hardcode) with RUBRIC.md
193
205
  - Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
194
206
  - Idempotent: re-running overwrites the same `judge.json`
195
207
 
@@ -199,23 +211,27 @@ SHIP-GATE VERDICT: ✅ PASS
199
211
 
200
212
  Three mechanisms:
201
213
 
202
- 1. **No hardcoded models.** Judge invocation is `codex exec` without `-m`; it
203
- inherits whichever flagship the CLI currently ships. Same for agents
204
- they run against whatever Claude Code session-model the caller has.
205
- Model provenance is captured in `result.json` per run.
214
+ 1. **No hardcoded models.** Judge invocation omits `-m`, so it inherits
215
+ whichever flagship the CLI currently ships. The blind judge is isolated from
216
+ user config/rules/hooks so local agent instructions cannot contaminate the
217
+ judgment. Same for agents they run against whatever Claude Code
218
+ session-model the caller has. Model provenance is captured in `result.json`
219
+ per run.
206
220
 
207
221
  2. **Margin as primary signal, absolute score as secondary.** When models
208
- improve, both arms get better. Margin (variant bare) is model-invariant
209
- it measures **what the harness adds beyond bare**. Ship gates are
222
+ improve, all arms tend to get better. Pairwise margins remain the stable
223
+ signal: `solo_claude`-`bare` (L1-L0) measures solo harness value,
224
+ pair-`solo_claude` (L2-L1) measures pair value on eligible fixtures, and
225
+ `variant`-`bare` (L2-L0) remains the legacy suite signal. Ship gates are
210
226
  defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
211
227
  score.
212
228
 
213
229
  3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
214
230
  100 quickly as models improve — that's fine, it still catches catastrophic
215
231
  regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
216
- model won't 100-zero bare. If any fixture saturates (both arms > 95 for
217
- two consecutive versions), we replace it with a harder one and document
218
- the swap in `history/runs/<ts>-fixture-rotation.json`.
232
+ model won't 100-zero bare. If any fixture saturates (all compared gated arms
233
+ > 95 for two consecutive versions), we replace it with a harder one and
234
+ document the swap in `history/runs/<ts>-fixture-rotation.json`.
219
235
 
220
236
  ---
221
237
 
@@ -225,14 +241,15 @@ Hard floors (any single failure blocks ship):
225
241
 
226
242
  - **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
227
243
  - **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
228
- - **At least 7 of 9 fixtures** must have margin ≥ +5 (suite coverage).
244
+ - **At least 7 gated, headroom-available fixtures** must have margin ≥ +5
245
+ (suite coverage).
229
246
  - **F9 (E2E) must PASS** — novice-flow contract.
230
247
 
231
248
  Soft gates (trigger rollback discussion):
232
249
 
233
250
  - Suite average margin drop > 3 vs last shipped.
234
251
  - Any fixture with margin ≤ 0 that previously had margin > +5.
235
- - Critical-finding catch-rate decrease vs last shipped variant (not vs bare — bare is the opponent, not the regression baseline).
252
+ - Critical-finding catch-rate decrease vs the last shipped comparable arm.
236
253
 
237
254
  Known-limit exception:
238
255
 
@@ -264,7 +281,7 @@ adding anything.
264
281
  standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
265
282
  run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
266
283
  entry, which shells out to the script.
267
- 2. Parallel run safety — can we run 9 fixtures × 2 arms concurrently without
284
+ 2. Parallel run safety — can we run the selected fixture set × 3 arms concurrently without
268
285
  rate-limit / lockfile conflicts? **Proposal**: default sequential with
269
286
  `--parallel N` flag. Default `N=1` for safety; the user can opt in.
270
287
  3. Token accounting — Claude Code doesn't expose subagent totals reliably.