devlyn-cli 2.3.0 → 2.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (219) hide show
  1. package/AGENTS.md +1 -1
  2. package/CLAUDE.md +2 -2
  3. package/README.md +82 -29
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +211 -18
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +3 -2
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +1 -1
  201. package/config/skills/_shared/runtime-principles.md +5 -8
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  205. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  206. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  207. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  208. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  209. package/config/skills/devlyn:resolve/SKILL.md +49 -18
  210. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  211. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  212. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  213. package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
  214. package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
  215. package/package.json +47 -2
  216. package/scripts/lint-fixtures.sh +349 -0
  217. package/scripts/lint-shadow-fixtures.sh +58 -0
  218. package/scripts/lint-skills.sh +3642 -92
  219. /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
package/AGENTS.md CHANGED
@@ -28,7 +28,7 @@ ideate (optional) -> resolve -> ship
28
28
 
29
29
  - `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project` (multi-feature).
30
30
  - `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <ref> --spec <path>`. Phases run inline: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh-subagent, findings-only).
31
- - Four creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:design-ui`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
31
+ - `/devlyn:design-ui` — required creative UI exploration surface. Optional companion skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
32
32
 
33
33
  Each skill's `SKILL.md` is the source of truth for flags and workflow. Do not duplicate.
34
34
 
package/CLAUDE.md CHANGED
@@ -24,7 +24,7 @@ The runtime sub-agent contract below (Subtractive-first / Goal-locked / No-worka
24
24
 
25
25
  ## Quick Start
26
26
 
27
- Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has conditional-default pair-JUDGE when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path. If a selected or conditionally required engine is unavailable, the run stops with `BLOCKED:<engine>-unavailable` and setup guidance.
27
+ Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED; `/devlyn:design-ui` is also REQUIRED as the creative UI exploration surface. **Both pipeline skills default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has conditional-default pair-JUDGE when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path. If a selected or conditionally required engine is unavailable, the run stops with `BLOCKED:<engine>-unavailable` and setup guidance.
28
28
 
29
29
  1. `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project`.
30
30
  2. `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <diff> --spec <path>`. Phases: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent, findings-only).
@@ -152,7 +152,7 @@ When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine cod
152
152
 
153
153
  ## Skill Boundary Policy
154
154
 
155
- Post iter-0034 Phase 4 cutover (2026-05-04) the runtime product surface is two skills — `/devlyn:resolve` and `/devlyn:ideate`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Four creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:design-ui`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
155
+ Post iter-0034 Phase 4 cutover (2026-05-04) the runtime pipeline surface is two skills — `/devlyn:resolve` and `/devlyn:ideate` — plus the required creative UI exploration surface `/devlyn:design-ui`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Optional creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
156
156
 
157
157
  Browser validation routes through `_shared/browser-runner.sh` (Chrome MCP → Playwright → curl tier) directly from BUILD_GATE — there is no separate `/devlyn:browser-validate` skill at HEAD.
158
158
 
package/README.md CHANGED
@@ -27,18 +27,20 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
27
27
  npx devlyn-cli
28
28
  ```
29
29
 
30
- That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
30
+ That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` and the `devlyn:resolve`, `devlyn:ideate`, and `devlyn:design-ui` skills into `~/.codex/skills/`. In Codex, invoke them as skills with `$devlyn:resolve`, `$devlyn:ideate`, or `$devlyn:design-ui` rather than Claude Code slash commands. Run it again anytime to update.
31
31
 
32
32
  ---
33
33
 
34
34
  ## How It Works — Two Skills, Full Cycle
35
35
 
36
- devlyn-cli turns Claude Code into a hands-free development pipeline. The product surface is two skills:
36
+ devlyn-cli turns Claude Code into a hands-free development pipeline. The pipeline surface is two skills, with `/devlyn:design-ui` installed as the required creative UI surface:
37
37
 
38
38
  ```
39
39
  ideate (optional) → resolve → ship
40
40
  ```
41
41
 
42
+ Codex note: when the optional Codex install is selected, these workflows are installed as Codex skills. Use `$devlyn:ideate`, `$devlyn:resolve`, or `$devlyn:design-ui` in Codex; the `/devlyn:*` slash-command form is for Claude Code.
43
+
42
44
  ### Step 1 (optional) — Plan with `/devlyn:ideate`
43
45
 
44
46
  Turn a raw idea into a verifiable spec — single-feature, multi-feature, or "normalize this external doc".
@@ -80,6 +82,20 @@ PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent
80
82
  - Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
81
83
 
82
84
  Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--no-pair` (intentional solo VERIFY), `--risk-probes` / `--no-risk-probes`, `--perf` (per-phase timing).
85
+ `--pair-verify` and `--no-pair` are mutually exclusive; using both stops with `BLOCKED:invalid-flags`.
86
+
87
+ Free-form goals that ask for benchmark evidence, pair-evidence, risk-probe
88
+ measurement, `solo<pair` proof, or solo-headroom work must include an
89
+ actionable `solo-headroom hypothesis` naming the visible behavior `solo_claude`
90
+ is expected to miss plus a backticked observable command; the backticked line
91
+ itself must contain `miss` and be framed as the command/observable that exposes it. Without that,
92
+ `/devlyn:resolve` stops with `BLOCKED:solo-headroom-hypothesis-required` and
93
+ points you to `/devlyn:ideate` instead of inventing a weak hypothesis.
94
+ Free-form goals that add or run a new unmeasured benchmark, shadow fixture,
95
+ golden fixture, risk-probe, or pair-evidence candidate must also include
96
+ `solo ceiling avoidance`, mention `solo_claude`, and name the concrete
97
+ difference from rejected or solo-saturated controls such as `S2`-`S6`; without
98
+ that, `/devlyn:resolve` stops with `BLOCKED:solo-ceiling-avoidance-required`.
83
99
 
84
100
  ### Engine selection — Claude implementation, conditional pair VERIFY
85
101
 
@@ -93,47 +109,84 @@ Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-g
93
109
 
94
110
  If Codex or Claude is absent when explicitly selected or conditionally required, the harness stops with `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` and prints setup guidance. Use `--no-pair` only when intentionally accepting solo VERIFY; use `--no-risk-probes` only when intentionally disabling automatic high-risk probes.
95
111
 
96
- <details>
97
- <summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
98
-
99
- `/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
100
-
101
- - **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
102
- - **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
103
- - **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
104
- - **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
105
- - **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
106
- - **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
107
- - **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
108
- - **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
109
- - **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
112
+ ### Benchmark score runs
110
113
 
111
- </details>
112
-
113
- <details>
114
- <summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
114
+ Use the benchmark CLI when a change claims `solo_claude < pair`. The score-focused runners print the run id, startup gate lines, blind-judge score tables, fixture pair margins, average pair margin, wall-time ratio, and failure reasons:
115
115
 
116
- Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
116
+ ```bash
117
+ npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
118
+ npx devlyn-cli benchmark recent
119
+ npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md
120
+ npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md
121
+ npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit
122
+ npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json
123
+ npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules
124
+ ```
117
125
 
118
- - **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
119
- - **`--with-codex` consolidated into `--engine auto`** — auto covers BUILD/FIX + team roles + ideate CHALLENGE critic. Legacy flag still accepted with a graceful handoff. *(Note: post iter-0020 close-out, `--engine auto` is experimental research-only; default is `--engine claude`.)*
120
- - **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
121
- - **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
126
+ `benchmark recent` prints a compact, wrap-safe snapshot of the current local
127
+ pair evidence: status counts, pair-lift aggregates, and one card per passing
128
+ pair-evidence fixture. It intentionally avoids wide Markdown tables, so the
129
+ same output stays readable in narrow terminals, PR comments, and release notes.
130
+ `benchmark frontier` also prints a stdout score summary for existing complete pair
131
+ evidence rows, including pair arm, trigger reasons, average/minimum pair margin,
132
+ and wall ratio, plus row-level verdicts even when `--out-json` or `--out-md`
133
+ writes an artifact. Markdown frontier artifacts include a `Triggers` column.
134
+ Full-pipeline pair gate artifacts record `require_hypothesis_trigger` in JSON
135
+ and include a Markdown `Hypothesis trigger` column, so strict regenerated
136
+ evidence shows whether each row carried `spec.solo_headroom_hypothesis`.
137
+ `benchmark audit` is the provider-free release/handoff guard: it writes
138
+ `audit.json` with the frontier summary, artifact map, and compact trigger-backed verdict-bearing `pair_evidence_rows`
139
+ (each row carries `pair_trigger_eligible: true`, non-empty `pair_trigger_reasons`, `pair_trigger_has_canonical_reason: true`, and `pair_trigger_has_hypothesis_reason`; the audit fails rows missing trigger reasons or missing actionable solo-headroom hypotheses in fixture `spec.md` whose observable command matches `expected.json`), runs the frontier with
140
+ `--fail-on-unmeasured`, requires at least four fixtures with passing pair evidence,
141
+ revalidates frontier `verdict: PASS`, zero unmeasured candidates, and revalidates `pair_mode: true`,
142
+ the default 5-point pair margin, and 3x pair/solo wall ratio, then
143
+ audits failed headroom results. The audit stdout also prints
144
+ `headroom_rejections=...`, `pair_evidence_quality=...`,
145
+ `pair_trigger_reasons=...`, `pair_evidence_hypotheses=...`, and
146
+ `pair_evidence_hypothesis_triggers=...` handoff rows, plus
147
+ `pair_trigger_historical_aliases=...` when archived evidence includes legacy
148
+ trigger aliases and `pair_evidence_hypothesis_trigger_gaps=...` when documented
149
+ hypotheses have not yet propagated into trigger reasons, with the rejected-fixture
150
+ coverage counts plus actual minimum pair margin, maximum pair/solo wall ratio,
151
+ and canonical trigger reason coverage plus row-match status.
152
+ The compact evidence row count must match the frontier evidence count,
153
+ `checks.frontier_stdout` records summary, aggregate, final-verdict, expected, printed score-row, trigger-visible row, and hypothesis-trigger-visible row counts,
154
+ `checks.headroom_rejections` records child verdict plus unrecorded/unsupported counts,
155
+ `checks.pair_evidence_quality` records the same quality thresholds from the compact rows,
156
+ `checks.pair_trigger_reasons` records canonical/historical-alias/exposed/total trigger-reason row counts, fixture-level historical alias details, summary count, and row-match status,
157
+ `checks.pair_evidence_hypotheses` records documented/total pair-evidence hypothesis row counts,
158
+ and `checks.pair_evidence_hypothesis_triggers` records whether documented hypotheses also appear as `spec.solo_headroom_hypothesis` trigger reasons plus fixture-level gap details
159
+ so incomplete or low-quality local score artifacts cannot inflate the claim.
160
+ Add `--require-hypothesis-trigger` to turn those hypothesis-trigger gaps from
161
+ archived-evidence WARN rows into release-blocking FAIL rows for newly
162
+ regenerated pair evidence.
122
163
 
123
- </details>
164
+ ```bash
165
+ npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict
166
+ ```
124
167
 
125
- ---
168
+ Historical trigger aliases are only reported for archived artifact review; new
169
+ current pair-evidence gates fail historical-only or unknown trigger reasons and
170
+ require at least one canonical `pair_trigger.reasons` entry.
171
+ `benchmark audit-headroom` fails if an active failed headroom fixture is missing
172
+ from both rejected registry and passing pair evidence.
173
+ Headroom runs use the current claim gate: `bare <= 60`, `solo_claude <= 80`,
174
+ and the default 5-point `bare`/`solo_claude` headroom margins before spending a pair arm.
175
+ Add `--dry-run` to either score runner to validate args, fixture ids, minimum
176
+ fixture count, and the replay command without running arms or judges. Dry-runs
177
+ and lint prove wiring only; real score claims must cite the run id and fixture
178
+ ids.
126
179
 
127
180
  ## Optional Power-User Skills
128
181
 
129
- Two creative skills have moved to `optional-skills/` — install them via the interactive installer when you need them.
182
+ Two creative companion skills live in `optional-skills/` — install them via the interactive installer when you need them.
130
183
 
131
184
  | Command | Use When |
132
185
  |---|---|
133
186
  | `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
134
187
  | `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
135
188
 
136
- > Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). These were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
189
+ > Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). Most were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:design-ui` is now installed as a required creative UI surface. Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
137
190
 
138
191
  ---
139
192
 
@@ -2,12 +2,18 @@
2
2
 
3
3
  **Outer goal**: see [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md) — the harness composes frontier LLMs into a hands-free pipeline that delivers engineer-quality software for users who do not know context engineering, with each composition layer (L0 bare → L1 solo harness → L2 pair harness) justifying its added cost on quality AND wall-time efficiency. This benchmark is the measurement instrument for that contract.
4
4
 
5
- **Purpose.** Replace ad-hoc A/B benchmarking with a permanent, comprehensive,
5
+ **Purpose.** Replace ad-hoc harness benchmarking with a permanent, comprehensive,
6
6
  one-command suite that gates every future harness change with a ship/rollback
7
7
  decision. Any prompt edit, phase reorder, new native skill, or model upgrade
8
8
  can be validated by running the suite and reading the numbers.
9
9
 
10
- **Arm structure (current vs planned).** Today the suite runs `variant` (L2: Claude + Codex pair) vs `bare` (L0). The L1 (solo harness on a single LLM) arm is queued for iter-0020 — until then the benchmark cannot directly verify the L1 contract, only the L0 ↔ L2 delta. Single-LLM users (Opus alone, GPT-5.5 alone) are first-class per the North Star, so this gap is a release-blocker for them, not a future enhancement.
10
+ **Arm structure.** Current full-pipeline evidence uses three arms: `bare` (L0),
11
+ `solo_claude` (L1 solo harness), and an L2 pair arm (`variant` in the smoke
12
+ suite, or a focused pair arm such as `l2_risk_probes` in pair-candidate runs).
13
+ Pair claims are headroom-gated: counted fixtures must leave room above solo
14
+ (`bare <= 60`, `solo_claude <= 80`, default 5-point `bare`/`solo_claude` headroom margins),
15
+ the pair arm must actually run, and blind judging must show pair above solo by
16
+ the configured margin.
11
17
 
12
18
  **Non-goals.** Publishable-research statistical rigor. Not a regression test
13
19
  library for the product code — those live elsewhere. Not a substitute for
@@ -20,7 +26,7 @@ production telemetry — just enough signal for ship decisions.
20
26
  1. **One command.** `npx devlyn-cli benchmark` runs everything and prints a
21
27
  verdict. No manual fixture setup.
22
28
  2. **Novice-proof.** The suite exercises the same paths a first-time user
23
- hits — including an end-to-end `ideate → auto-resolve → preflight` fixture.
29
+ hits — including an end-to-end `ideate → resolve` fixture.
24
30
  3. **LLM-upgrade friendly.** Rubric, fixture semantics, and thresholds stay
25
31
  stable; scores and margins float up as models improve. Nothing is
26
32
  hardcoded to a specific model version.
@@ -56,10 +62,11 @@ benchmark/auto-resolve/
56
62
  │ ├── F6-dep-audit-native-module/
57
63
  │ ├── F7-out-of-scope-trap/
58
64
  │ ├── F8-known-limit-ambiguous/
59
- └── F9-e2e-ideate-to-resolve/
65
+ ├── F9-e2e-ideate-to-resolve/
66
+ │ └── F10+ extensions for headroom, full-pipeline pair, and frozen VERIFY
60
67
 
61
68
  ├── scripts/
62
- │ ├── run-suite.sh # single entry — runs all fixtures × 2 arms + judge + report
69
+ │ ├── run-suite.sh # smoke entry — runs fixture arms + judge + report
63
70
  │ ├── run-fixture.sh # one fixture, one arm
64
71
  │ ├── judge.sh # Codex blind judge (model-agnostic)
65
72
  │ ├── compile-report.py # aggregate into report.md + summary.json
@@ -68,8 +75,9 @@ benchmark/auto-resolve/
68
75
  ├── results/ # per-run artifacts (overwritten)
69
76
  │ └── <run-id>/
70
77
  │ ├── <fixture>/
71
- │ │ ├── variant/{input.md, transcript.txt, diff.patch, verify.json, timing.json}
72
- │ │ └── bare/{same}
78
+ │ │ ├── bare/{input.md, transcript.txt, diff.patch, result.json}
79
+ │ │ ├── solo_claude/{same}
80
+ │ │ └── variant or l2_risk_probes/{same}
73
81
  │ ├── <fixture>/judge.json
74
82
  │ ├── report.md
75
83
  │ └── summary.json
@@ -91,7 +99,7 @@ Every fixture is a directory with these files (see `fixtures/SCHEMA.md`):
91
99
  | File | Purpose |
92
100
  |------|---------|
93
101
  | `metadata.json` | id, category, difficulty, timeout, required tools, intent block |
94
- | `spec.md` | pipeline-arm input (auto-resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
102
+ | `spec.md` | pipeline-arm input (resolve-ready spec with Requirements/Constraints/Out-of-Scope/Verification) |
95
103
  | `task.txt` | bare-arm input (same intent, natural-language framing) |
96
104
  | `expected.json` | machine-readable acceptance criteria + forbidden patterns + verification commands |
97
105
  | `NOTES.md` | why this fixture exists, the specific failure mode it tests |
@@ -103,9 +111,13 @@ consistent.
103
111
 
104
112
  ---
105
113
 
106
- ## The 9 Fixtures
114
+ ## Core Fixtures And Extensions
107
115
 
108
- Category coverage matrix (rows = concerns, columns = fixtures):
116
+ The original v3.6 matrix covered F1-F9. Later fixtures extend the same schema
117
+ for headroom, full-pipeline pair, and frozen VERIFY evidence.
118
+
119
+ Category coverage matrix for the original core set (rows = concerns, columns =
120
+ fixtures):
109
121
 
110
122
  | Fixture | Trivial | Medium | High-risk | Stress | Edge | E2E |
111
123
  |---------|---------|--------|-----------|--------|------|-----|
@@ -120,9 +132,9 @@ Category coverage matrix (rows = concerns, columns = fixtures):
120
132
  | F9-e2e-ideate-to-resolve | | | | | | ✓ (novice full-flow) |
121
133
 
122
134
  **F9 is load-bearing** for the "novice user types `/devlyn:ideate`" promise.
123
- Input is a vague idea; pipeline arm runs ideate auto-resolve on every
124
- generated spec preflight; bare arm runs a direct prompt. Judge compares
125
- the final usable artifact set (code + docs + roadmap state).
135
+ Input is a vague idea; the pipeline path turns it into a spec with ideate and
136
+ then resolves that spec. Bare arm runs a direct prompt. Judge compares the final
137
+ usable artifact set.
126
138
 
127
139
  ---
128
140
 
@@ -132,7 +144,6 @@ the final usable artifact set (code + docs + roadmap state).
132
144
 
133
145
  ```bash
134
146
  npx devlyn-cli benchmark # n=1 smoke, all fixtures
135
- npx devlyn-cli benchmark --n 3 # higher confidence for ship decisions
136
147
  npx devlyn-cli benchmark F2 F5 # specific fixtures only
137
148
  npx devlyn-cli benchmark --judge-only --run-id <id> # re-judge without re-running
138
149
  ```
@@ -143,20 +154,21 @@ Output on completion:
143
154
  Benchmark Suite Run — 2026-04-23T12:00Z (v3.6)
144
155
  Judge: codex CLI flagship, xhigh, blind (model recorded in run history)
145
156
 
146
- Fixture Variant Bare Margin Verdict
147
- F1-cli-trivial-flag 95 88 +7 PASS
148
- F2-cli-medium-subcommand 92 81 +11 PASS
149
- F3-backend-contract-risk 89 72 +17 PASS
150
- F4-web-browser-design 87 79 +8 PASS
151
- F5-fix-loop-red-green 91 65 +26 PASS
152
- F6-dep-audit-native-module 88 70 +18 PASS
153
- F7-out-of-scope-trap 94 73 +21 PASS
154
- F8-known-limit-ambiguous 78 79 -1 EXPECTED (known-limit)
155
- F9-e2e-ideate-to-resolve 90 68 +22 PASS
157
+ Fixture variant (L2) solo_claude (L1) bare (L0) variant-solo_claude Verdict
158
+ F1-cli-trivial-flag 95 92 88 +3 PASS
159
+ F2-cli-medium-subcommand 92 86 81 +6 PASS
160
+ F3-backend-contract-risk 89 80 72 +9 PASS
161
+ F4-web-browser-design 87 83 79 +4 PASS
162
+ F5-fix-loop-red-green 91 78 65 +13 PASS
163
+ F6-dep-audit-native-module 88 82 70 +6 PASS
164
+ F7-out-of-scope-trap 94 85 73 +9 PASS
165
+ F8-known-limit-ambiguous 78 79 79 -1 EXPECTED (known-limit)
166
+ F9-e2e-ideate-to-resolve 90 84 68 +6 PASS
156
167
  ---------------------------------------------------------
157
- Suite average variant score: 89.3
158
- Suite average bare score: 75.0
159
- Suite average margin: +14.3 (ship floor: +5)
168
+ Suite average variant (L2) score: 89.3
169
+ Suite average solo_claude (L1) score: 83.2
170
+ Suite average bare (L0) score: 75.0
171
+ Suite average variant-solo_claude margin: +6.1 (pair-evidence floor: +5 on eligible fixtures)
160
172
  Hard-floor violations: 0
161
173
  Regression vs shipped: n/a (first run of v3.6)
162
174
  SHIP-GATE VERDICT: ✅ PASS
@@ -167,7 +179,7 @@ SHIP-GATE VERDICT: ✅ PASS
167
179
  `run-suite.sh`:
168
180
 
169
181
  1. Generate run-id `<ISO>-<sha>-<branch>`
170
- 2. For each fixture × each arm (variant, bare): parallelizable via `xargs -P`
182
+ 2. For each fixture × each arm (`variant`/L2, `solo_claude`/L1, `bare`/L0): parallelizable via `xargs -P`
171
183
  - `run-fixture.sh --fixture FX --arm variant` → writes `results/<run-id>/FX/variant/*`
172
184
  3. For each fixture: `judge.sh FX <run-id>` → writes `results/<run-id>/FX/judge.json`
173
185
  4. `compile-report.py <run-id>` → writes `report.md` + `summary.json`
@@ -179,17 +191,17 @@ SHIP-GATE VERDICT: ✅ PASS
179
191
 
180
192
  - Creates fresh temp copy of `test-repo/` at `/tmp/bench-<run-id>-<fixture>-<arm>/`
181
193
  - Applies `setup.sh` if present
182
- - Copies `spec.md` (variant) or `task.txt` (bare) as the prompt
183
- - Invokes Claude/auto-resolve (variant) or bare Claude (bare) via isolated Agent
194
+ - Copies `spec.md` for `variant`/`solo_claude` or `task.txt` for `bare` as the prompt
195
+ - Invokes `/devlyn:resolve --spec` for `variant`, `/devlyn:resolve --spec --engine claude --no-pair --no-risk-probes` for `solo_claude`, or bare Claude for `bare` via isolated Agent
184
196
  - Captures: `diff.patch`, `changed-files.txt`, `transcript.txt`, `timing.json`
185
197
  - Runs `expected.json::verification_commands`, writes pass/fail per command to `verify.json`
186
198
  - Writes `result.json` with aggregate: exit code, duration, files changed, verification score
187
199
 
188
200
  ### `judge.sh` contract
189
201
 
190
- - Reads `results/<run-id>/<fixture>/{variant,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
202
+ - Reads `results/<run-id>/<fixture>/{variant,solo_claude,bare}/{diff.patch,verify.json}` + fixture's `spec.md` + `expected.json`
191
203
  - Builds a blind prompt: labels arms A and B randomly per fixture (seed recorded)
192
- - Invokes `codex exec` (current flagship — no model hardcode) with RUBRIC.md
204
+ - Invokes isolated Codex (current flagship — no model hardcode) with RUBRIC.md
193
205
  - Writes `judge.json`: per-axis scores, winner, margin, critical findings, disqualifiers
194
206
  - Idempotent: re-running overwrites the same `judge.json`
195
207
 
@@ -199,23 +211,27 @@ SHIP-GATE VERDICT: ✅ PASS
199
211
 
200
212
  Three mechanisms:
201
213
 
202
- 1. **No hardcoded models.** Judge invocation is `codex exec` without `-m`; it
203
- inherits whichever flagship the CLI currently ships. Same for agents
204
- they run against whatever Claude Code session-model the caller has.
205
- Model provenance is captured in `result.json` per run.
214
+ 1. **No hardcoded models.** Judge invocation omits `-m`, so it inherits
215
+ whichever flagship the CLI currently ships. The blind judge is isolated from
216
+ user config/rules/hooks so local agent instructions cannot contaminate the
217
+ judgment. Same for agents they run against whatever Claude Code
218
+ session-model the caller has. Model provenance is captured in `result.json`
219
+ per run.
206
220
 
207
221
  2. **Margin as primary signal, absolute score as secondary.** When models
208
- improve, both arms get better. Margin (variant bare) is model-invariant
209
- it measures **what the harness adds beyond bare**. Ship gates are
222
+ improve, all arms tend to get better. Pairwise margins remain the stable
223
+ signal: `solo_claude`-`bare` (L1-L0) measures solo harness value,
224
+ pair-`solo_claude` (L2-L1) measures pair value on eligible fixtures, and
225
+ `variant`-`bare` (L2-L0) remains the legacy suite signal. Ship gates are
210
226
  defined on margin (`>= +5`) and regression (`-3 or worse`), not absolute
211
227
  score.
212
228
 
213
229
  3. **Fixture difficulty gradient.** F1 (trivial) is expected to saturate near
214
230
  100 quickly as models improve — that's fine, it still catches catastrophic
215
231
  regressions. F5/F9 (stress/E2E) have enough depth that even a near-perfect
216
- model won't 100-zero bare. If any fixture saturates (both arms > 95 for
217
- two consecutive versions), we replace it with a harder one and document
218
- the swap in `history/runs/<ts>-fixture-rotation.json`.
232
+ model won't 100-zero bare. If any fixture saturates (all compared gated arms
233
+ > 95 for two consecutive versions), we replace it with a harder one and
234
+ document the swap in `history/runs/<ts>-fixture-rotation.json`.
219
235
 
220
236
  ---
221
237
 
@@ -225,14 +241,15 @@ Hard floors (any single failure blocks ship):
225
241
 
226
242
  - **No silent-catch / fabricated verification / skipped required test in variant.** Judge flags this as disqualifier.
227
243
  - **Variant may not lose any fixture by more than −5** versus previous shipped version (per-fixture regression floor).
228
- - **At least 7 of 9 fixtures** must have margin ≥ +5 (suite coverage).
244
+ - **At least 7 gated, headroom-available fixtures** must have margin ≥ +5
245
+ (suite coverage).
229
246
  - **F9 (E2E) must PASS** — novice-flow contract.
230
247
 
231
248
  Soft gates (trigger rollback discussion):
232
249
 
233
250
  - Suite average margin drop > 3 vs last shipped.
234
251
  - Any fixture with margin ≤ 0 that previously had margin > +5.
235
- - Critical-finding catch-rate decrease vs last shipped variant (not vs bare — bare is the opponent, not the regression baseline).
252
+ - Critical-finding catch-rate decrease vs the last shipped comparable arm.
236
253
 
237
254
  Known-limit exception:
238
255
 
@@ -264,7 +281,7 @@ adding anything.
264
281
  standalone `benchmark/auto-resolve/scripts/run-suite.sh` invoked via `npm
265
282
  run`? **Proposal**: both — `bin/devlyn.js benchmark` is the advertised
266
283
  entry, which shells out to the script.
267
- 2. Parallel run safety — can we run 9 fixtures × 2 arms concurrently without
284
+ 2. Parallel run safety — can we run the selected fixture set × 3 arms concurrently without
268
285
  rate-limit / lockfile conflicts? **Proposal**: default sequential with
269
286
  `--parallel N` flag. Default `N=1` for safety; the user can opt in.
270
287
  3. Token accounting — Claude Code doesn't expose subagent totals reliably.