devlyn-cli 2.3.0 → 2.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (219) hide show
  1. package/AGENTS.md +1 -1
  2. package/CLAUDE.md +2 -2
  3. package/README.md +82 -29
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +211 -18
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +3 -2
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +1 -1
  201. package/config/skills/_shared/runtime-principles.md +5 -8
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  205. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  206. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  207. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  208. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  209. package/config/skills/devlyn:resolve/SKILL.md +49 -18
  210. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  211. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  212. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  213. package/config/skills/devlyn:resolve/references/phases/verify.md +62 -28
  214. package/config/skills/devlyn:resolve/references/state-schema.md +7 -4
  215. package/package.json +47 -2
  216. package/scripts/lint-fixtures.sh +349 -0
  217. package/scripts/lint-shadow-fixtures.sh +58 -0
  218. package/scripts/lint-skills.sh +3642 -92
  219. /package/{optional-skills → config/skills}/devlyn:design-ui/SKILL.md +0 -0
@@ -0,0 +1,56 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "node --test tests/cli.test.js",
5
+ "exit_code": 0,
6
+ "stdout_contains": [],
7
+ "stdout_not_contains": ["not ok "]
8
+ },
9
+ {
10
+ "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js\"",
11
+ "exit_code": 0,
12
+ "stdout_contains": ["\"ok\":true"],
13
+ "stdout_not_contains": [],
14
+ "contract_refs": [
15
+ "Process refund requests globally by `priority` descending, then `requested_on` ascending, then original input order ascending.",
16
+ "A refund rejects with reason `window_expired` when `requested_on` is more than `refund_window_days` after `purchased_on`.",
17
+ "A refund accepts only when the order's remaining refundable cents is at least the requested `cents`.",
18
+ "A rejected refund with reason `over_refund` must not change that order's remaining refundable cents.",
19
+ "For each accepted refund, decrement that order's remaining refundable cents by the requested `cents`.",
20
+ "For each accepted refund, compute `fee_cents` as the category policy's `restocking_fee_cents` capped at the requested `cents`, and compute `net_cents = cents - fee_cents`.",
21
+ "`approved` is ordered in processing order. Each row has keys `id`, `order`, `refund_cents`, `fee_cents`, and `net_cents`.",
22
+ "`rejected` is ordered in the original input refund order. Each row has keys `id`, `reason`.",
23
+ "`orders` is ordered by order id ascending. Each row has keys `id` and `remaining_refundable_cents`.",
24
+ "On success, write exactly one JSON object to stdout and no stderr. Keys: `approved`, `rejected`, `orders`."
25
+ ]
26
+ },
27
+ {
28
+ "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/duplicate-refund-error.js\"",
29
+ "exit_code": 0,
30
+ "stdout_contains": ["\"ok\":true"],
31
+ "stdout_not_contains": [],
32
+ "contract_refs": [
33
+ "Before settling any refund, duplicate refund ids are invalid input: exit `2`, write exactly one JSON error object `{ \"error\": \"duplicate_refund_id\", \"id\": string }` to stderr, and write no stdout."
34
+ ]
35
+ }
36
+ ],
37
+ "forbidden_patterns": [
38
+ {
39
+ "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|''|\\{\\}|\\[\\])",
40
+ "description": "silent catch returning fallback in settle-refunds path",
41
+ "files": ["bin/cli.js", "tests/cli.test.js"],
42
+ "severity": "disqualifier"
43
+ },
44
+ {
45
+ "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
46
+ "description": "empty catch block",
47
+ "files": ["bin/cli.js", "tests/cli.test.js"],
48
+ "severity": "disqualifier"
49
+ }
50
+ ],
51
+ "required_files": ["bin/cli.js", "tests/cli.test.js"],
52
+ "forbidden_files": [],
53
+ "tier_a_waivers": [],
54
+ "spec_output_files": ["bin/cli.js", "tests/cli.test.js"],
55
+ "max_deps_added": 0
56
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "S6-cli-refund-window-ledger",
3
+ "category": "high-risk",
4
+ "difficulty": "high",
5
+ "timeout_seconds": 900,
6
+ "required_tools": ["node"],
7
+ "browser": false,
8
+ "deps_change_expected": false,
9
+ "intent": "Add a refund ledger CLI command that applies category refund windows, priority-ordered refund requests, cumulative per-order refundable balances, duplicate refund rejection, and exact JSON output shape."
10
+ }
@@ -0,0 +1,3 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ # S6 reuses the baseline test-repo state.
@@ -0,0 +1,59 @@
1
+ ---
2
+ id: "S6-cli-refund-window-ledger"
3
+ title: "Add refund window ledger command"
4
+ status: planned
5
+ complexity: high
6
+ depends-on: []
7
+ ---
8
+
9
+ # S6 Add Refund Window Ledger Command
10
+
11
+ ## Context
12
+
13
+ Finance operations needs a deterministic CLI command that settles refund
14
+ requests against original orders. The command must combine category refund
15
+ windows, priority ordering, cumulative per-order refundable balances, duplicate
16
+ id rejection, and exact machine-readable output.
17
+
18
+ ## Requirements
19
+
20
+ - [ ] Add `settle-refunds` to `bin/cli.js`.
21
+ - [ ] Accept `--policies <json>` as a JSON object whose keys are category names and whose values have keys `refund_window_days` and `restocking_fee_cents`.
22
+ - [ ] Accept `--orders <json>` as a JSON array of order objects. Each order has keys `id`, `category`, `paid_cents`, `purchased_on`, and `fulfilled`.
23
+ - [ ] Accept `--refunds <json>` as a JSON array of refund request objects. Each refund has keys `id`, `order`, `cents`, `priority`, and `requested_on`.
24
+ - [ ] Before settling any refund, duplicate refund ids are invalid input: exit `2`, write exactly one JSON error object `{ "error": "duplicate_refund_id", "id": string }` to stderr, and write no stdout.
25
+ - [ ] Process refund requests globally by `priority` descending, then `requested_on` ascending, then original input order ascending.
26
+ - [ ] A refund rejects with reason `unknown_order` when the order does not exist.
27
+ - [ ] A refund rejects with reason `unfulfilled_order` when the order exists but `fulfilled` is not `true`.
28
+ - [ ] A refund rejects with reason `unknown_policy` when the order category has no policy.
29
+ - [ ] A refund rejects with reason `window_expired` when `requested_on` is more than `refund_window_days` after `purchased_on`.
30
+ - [ ] A refund accepts only when the order's remaining refundable cents is at least the requested `cents`.
31
+ - [ ] A rejected refund with reason `over_refund` must not change that order's remaining refundable cents.
32
+ - [ ] For each accepted refund, decrement that order's remaining refundable cents by the requested `cents`.
33
+ - [ ] For each accepted refund, compute `fee_cents` as the category policy's `restocking_fee_cents` capped at the requested `cents`, and compute `net_cents = cents - fee_cents`.
34
+ - [ ] `approved` is ordered in processing order. Each row has keys `id`, `order`, `refund_cents`, `fee_cents`, and `net_cents`.
35
+ - [ ] `rejected` is ordered in the original input refund order. Each row has keys `id`, `reason`.
36
+ - [ ] `orders` is ordered by order id ascending. Each row has keys `id` and `remaining_refundable_cents`.
37
+ - [ ] On success, write exactly one JSON object to stdout and no stderr. Keys: `approved`, `rejected`, `orders`.
38
+
39
+ ## Constraints
40
+
41
+ - Use only Node.js built-ins; add no npm dependencies.
42
+ - Touch only `bin/cli.js` and `tests/cli.test.js`.
43
+ - Do not silently catch JSON parse or validation errors. Surface invalid input as a user-visible error with nonzero exit.
44
+ - Do not persist refund balances between command invocations.
45
+ - All public money amounts are integer cents.
46
+
47
+ ## Out of Scope
48
+
49
+ - Reading input from files.
50
+ - Taxes, payment gateway calls, currency conversion, or store-credit issuance.
51
+ - Partial approval of a single refund request.
52
+ - Changing `hello`, `version`, server routes, or package metadata.
53
+
54
+ ## Verification
55
+
56
+ - `node --test tests/cli.test.js` passes.
57
+ - `node "$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js"` prints `{"ok":true}`.
58
+ - `node "$BENCH_FIXTURE_DIR/verifiers/duplicate-refund-error.js"` prints `{"ok":true}`.
59
+ - Solo-headroom hypothesis: solo_claude is expected to miss cumulative remaining refundable cents or original-order rejected rows under priority-ordered refund settlement; observable command `node "$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js"` exposes the miss.
@@ -0,0 +1 @@
1
+ Add a `settle-refunds` command to bench-cli. It must read policies, orders, and refund requests from JSON CLI arguments, process refund requests by priority, maintain per-order remaining refundable cents, reject duplicates before processing, and emit exact JSON output.
@@ -0,0 +1,41 @@
1
+ 'use strict';
2
+ const assert = require('node:assert');
3
+ const { spawnSync } = require('node:child_process');
4
+ const path = require('node:path');
5
+
6
+ const work = process.env.BENCH_WORKDIR || process.cwd();
7
+ const cli = path.join(work, 'bin', 'cli.js');
8
+
9
+ const policies = JSON.stringify({
10
+ apparel: { refund_window_days: 45, restocking_fee_cents: 25 }
11
+ });
12
+ const orders = JSON.stringify([
13
+ { id: 'ord-a', category: 'apparel', paid_cents: 600, purchased_on: '2026-01-10', fulfilled: true }
14
+ ]);
15
+ const refunds = JSON.stringify([
16
+ { id: 'dup', order: 'ord-a', cents: 100, priority: 2, requested_on: '2026-01-11' },
17
+ { id: 'dup', order: 'ord-a', cents: 100, priority: 1, requested_on: '2026-01-12' }
18
+ ]);
19
+
20
+ const result = spawnSync('node', [
21
+ cli,
22
+ 'settle-refunds',
23
+ '--policies',
24
+ policies,
25
+ '--orders',
26
+ orders,
27
+ '--refunds',
28
+ refunds
29
+ ], {
30
+ cwd: work,
31
+ encoding: 'utf8'
32
+ });
33
+
34
+ assert.strictEqual(result.status, 2, result.stdout || result.stderr);
35
+ assert.strictEqual(result.stdout, '');
36
+ assert.deepStrictEqual(JSON.parse(result.stderr), {
37
+ error: 'duplicate_refund_id',
38
+ id: 'dup'
39
+ });
40
+
41
+ console.log(JSON.stringify({ ok: true }));
@@ -0,0 +1,65 @@
1
+ 'use strict';
2
+ const assert = require('node:assert');
3
+ const { spawnSync } = require('node:child_process');
4
+ const path = require('node:path');
5
+
6
+ const work = process.env.BENCH_WORKDIR || process.cwd();
7
+ const cli = path.join(work, 'bin', 'cli.js');
8
+
9
+ const policies = JSON.stringify({
10
+ electronics: { refund_window_days: 30, restocking_fee_cents: 150 },
11
+ apparel: { refund_window_days: 45, restocking_fee_cents: 25 }
12
+ });
13
+ const orders = JSON.stringify([
14
+ { id: 'ord-a', category: 'electronics', paid_cents: 1000, purchased_on: '2026-01-01', fulfilled: true },
15
+ { id: 'ord-b', category: 'apparel', paid_cents: 600, purchased_on: '2026-01-10', fulfilled: true },
16
+ { id: 'ord-c', category: 'electronics', paid_cents: 400, purchased_on: '2025-12-01', fulfilled: true },
17
+ { id: 'ord-d', category: 'apparel', paid_cents: 500, purchased_on: '2026-01-15', fulfilled: false }
18
+ ]);
19
+ const refunds = JSON.stringify([
20
+ { id: 'low-a', order: 'ord-a', cents: 500, priority: 1, requested_on: '2026-01-08' },
21
+ { id: 'expired-c', order: 'ord-c', cents: 100, priority: 9, requested_on: '2026-02-01' },
22
+ { id: 'high-a', order: 'ord-a', cents: 800, priority: 10, requested_on: '2026-01-09' },
23
+ { id: 'unknown', order: 'missing', cents: 50, priority: 8, requested_on: '2026-01-09' },
24
+ { id: 'unfulfilled', order: 'ord-d', cents: 50, priority: 7, requested_on: '2026-01-20' },
25
+ { id: 'apparel-ok', order: 'ord-b', cents: 300, priority: 6, requested_on: '2026-01-20' }
26
+ ]);
27
+
28
+ const result = spawnSync('node', [
29
+ cli,
30
+ 'settle-refunds',
31
+ '--policies',
32
+ policies,
33
+ '--orders',
34
+ orders,
35
+ '--refunds',
36
+ refunds
37
+ ], {
38
+ cwd: work,
39
+ encoding: 'utf8'
40
+ });
41
+
42
+ assert.strictEqual(result.status, 0, result.stderr || result.stdout);
43
+ assert.strictEqual(result.stderr, '');
44
+ const parsed = JSON.parse(result.stdout);
45
+
46
+ assert.deepStrictEqual(parsed, {
47
+ approved: [
48
+ { id: 'high-a', order: 'ord-a', refund_cents: 800, fee_cents: 150, net_cents: 650 },
49
+ { id: 'apparel-ok', order: 'ord-b', refund_cents: 300, fee_cents: 25, net_cents: 275 }
50
+ ],
51
+ rejected: [
52
+ { id: 'low-a', reason: 'over_refund' },
53
+ { id: 'expired-c', reason: 'window_expired' },
54
+ { id: 'unknown', reason: 'unknown_order' },
55
+ { id: 'unfulfilled', reason: 'unfulfilled_order' }
56
+ ],
57
+ orders: [
58
+ { id: 'ord-a', remaining_refundable_cents: 200 },
59
+ { id: 'ord-b', remaining_refundable_cents: 300 },
60
+ { id: 'ord-c', remaining_refundable_cents: 400 },
61
+ { id: 'ord-d', remaining_refundable_cents: 500 }
62
+ ]
63
+ });
64
+
65
+ console.log(JSON.stringify({ ok: true }));
package/bin/devlyn.js CHANGED
@@ -22,7 +22,7 @@ const CLI_TARGETS = {
22
22
  // Codex auto-loads skills from ~/.codex/skills/ (user-global). Same
23
23
  // SKILL.md format as Claude Code; descriptions must stay ≤1024 chars.
24
24
  skillsDir: path.join(os.homedir(), '.codex', 'skills'),
25
- skillsToInstall: ['devlyn:resolve', 'devlyn:ideate', '_shared'],
25
+ skillsToInstall: ['devlyn:resolve', 'devlyn:ideate', 'devlyn:design-ui', '_shared'],
26
26
  detect: () => fs.existsSync(path.join(process.cwd(), 'AGENTS.md')) || fs.existsSync(path.join(process.cwd(), '.codex')),
27
27
  },
28
28
  gemini: {
@@ -183,7 +183,6 @@ const OPTIONAL_ADDONS = [
183
183
  { name: 'devlyn:pencil-push', desc: 'Push codebase UI to Pencil canvas for design sync', type: 'local' },
184
184
  { name: 'devlyn:reap', desc: 'Safely reap orphaned MCP / codex / Superset child processes left behind by long Claude sessions', type: 'local' },
185
185
  { name: 'devlyn:design-system', desc: 'Extract design tokens from a chosen UI style for exact reproduction (creative power-user)', type: 'local' },
186
- { name: 'devlyn:design-ui', desc: 'N (default 5) distinct UI style explorations from a single Lead Designer (creative power-user)', type: 'local' },
187
186
  { name: 'devlyn:team-design-ui', desc: '5 distinct UI style explorations from a full design team (creative power-user)', type: 'local' },
188
187
  // External skill packs (installed via npx skills add)
189
188
  { name: 'vercel-labs/agent-skills', desc: 'React, Next.js, React Native best practices', type: 'external' },
@@ -194,7 +193,7 @@ const OPTIONAL_ADDONS = [
194
193
  // MCP servers (installed via claude mcp add)
195
194
  // Note: the Codex integration uses the local `codex` CLI binary (not MCP).
196
195
  // Install the CLI separately per https://platform.openai.com/docs/codex — the
197
- // harness auto-detects availability and downgrades to Claude-only on failure.
196
+ // pair/risk-probe routes fail closed when Codex is required but unavailable.
198
197
  { name: 'playwright', desc: 'Playwright MCP for browser testing — powers /devlyn:resolve BUILD_GATE browser tier', type: 'mcp', command: 'npx -y @anthropic-ai/mcp-playwright' },
199
198
  ];
200
199
 
@@ -524,7 +523,7 @@ function detectOtherCLIs() {
524
523
  return detected;
525
524
  }
526
525
 
527
- // Install /devlyn:resolve + /devlyn:ideate + _shared skills into a CLI's
526
+ // Install devlyn:resolve + devlyn:ideate + devlyn:design-ui + _shared skills into a CLI's
528
527
  // global skills directory (e.g. ~/.codex/skills/). Returns count of skills
529
528
  // copied. Skipped silently for CLIs without a skillsDir (e.g. cursor, copilot
530
529
  // at the time of writing — they don't have an analogous skill-loader).
@@ -608,11 +607,11 @@ function installAgentsForCLI(cliKey) {
608
607
  }
609
608
 
610
609
  // If this CLI also supports a global skill-loader (currently Codex), install
611
- // /devlyn:resolve + /devlyn:ideate + _shared so the same slash commands work
612
- // there. Skipped for CLIs without a skillsDir entry.
610
+ // devlyn:resolve + devlyn:ideate + devlyn:design-ui + _shared. Codex invokes
611
+ // these as skills (for example `$devlyn:resolve`), not Claude slash commands.
613
612
  const skillsCopied = installSkillsForCLI(cliKey);
614
613
  if (skillsCopied > 0) {
615
- log(` → ${skillsCopied} skill${skillsCopied > 1 ? 's' : ''} installed (devlyn:resolve / devlyn:ideate / _shared)`, 'dim');
614
+ log(` → ${skillsCopied} skill${skillsCopied > 1 ? 's' : ''} installed (devlyn:resolve / devlyn:ideate / devlyn:design-ui / _shared)`, 'dim');
616
615
  }
617
616
 
618
617
  return true;
@@ -689,7 +688,7 @@ async function init(skipPrompts = false) {
689
688
  }
690
689
  }
691
690
  if (!settings.env) settings.env = {};
692
- // Auto-allow pipeline state directory and common git commands so auto-resolve doesn't prompt
691
+ // Auto-allow pipeline state directory and common git commands so resolve doesn't prompt
693
692
  if (!settings.permissions) settings.permissions = {};
694
693
  if (!settings.permissions.allow) settings.permissions.allow = [];
695
694
  const pipelinePermissions = [
@@ -762,7 +761,7 @@ async function init(skipPrompts = false) {
762
761
  if (cli.configDir) {
763
762
  desc = `Install agents into ${cli.configDir}/`;
764
763
  } else if (cli.skillsDir) {
765
- desc = `Install ${cli.instructionsFile} + /devlyn:resolve + /devlyn:ideate skills (~/.codex/skills/)`;
764
+ desc = `Install ${cli.instructionsFile} + devlyn:resolve/devlyn:ideate/devlyn:design-ui skills (~/.codex/skills/; use $devlyn:* in Codex)`;
766
765
  } else {
767
766
  desc = `Install ${cli.instructionsFile}`;
768
767
  }
@@ -777,7 +776,7 @@ async function init(skipPrompts = false) {
777
776
  log(` ✅ Agent instructions installed for ${agentsInstalled} CLI${agentsInstalled !== 1 ? 's' : ''}`, 'green');
778
777
  } else {
779
778
  log('💡 No additional CLI instructions selected', 'dim');
780
- log(' Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md + /devlyn skills', 'dim');
779
+ log(' Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md + devlyn skills', 'dim');
781
780
  }
782
781
 
783
782
  // Ask about optional addons (local skills + external packs)
@@ -808,8 +807,14 @@ function showHelp() {
808
807
  log(' npx devlyn-cli -y Install without prompts');
809
808
  log(' npx devlyn-cli agents Install agents for detected CLIs');
810
809
  log(' npx devlyn-cli agents all Install agents for all supported CLIs');
811
- log(' npx devlyn-cli benchmark Run the full A/B benchmark suite vs bare');
812
- log(' npx devlyn-cli benchmark --n 3 --bless Ship-decision run + promote baseline if pass');
810
+ log(' npx devlyn-cli benchmark Run the resolve benchmark suite');
811
+ log(' npx devlyn-cli benchmark recent Show compact recent benchmark results');
812
+ log(' npx devlyn-cli benchmark frontier Show pair candidate frontier scores/triggers without providers');
813
+ log(' npx devlyn-cli benchmark audit Audit pair evidence readiness');
814
+ log(' npx devlyn-cli benchmark audit-headroom Audit failed headroom results');
815
+ log(' npx devlyn-cli benchmark headroom <fixtures...> Score bare vs solo_claude headroom');
816
+ log(' npx devlyn-cli benchmark pair <fixtures...> Score solo_claude vs pair path');
817
+ log(' npx devlyn-cli benchmark --bless If ship-gate passes, promote baseline');
813
818
  log(' npx devlyn-cli benchmark --dry-run Validate suite setup without model invocation');
814
819
  log(' npx devlyn-cli --help Show this help\n');
815
820
  log('Optional skills (select during install):', 'green');
@@ -831,6 +836,170 @@ function showHelp() {
831
836
  log('');
832
837
  }
833
838
 
839
+ function showBenchmarkHelp() {
840
+ log('Usage:', 'green');
841
+ log(' npx devlyn-cli benchmark [suite] [options] [fixtures...]');
842
+ log(' npx devlyn-cli benchmark recent [options]');
843
+ log(' npx devlyn-cli benchmark frontier [options]');
844
+ log(' npx devlyn-cli benchmark audit [options]');
845
+ log(' npx devlyn-cli benchmark audit-headroom [options]');
846
+ log(' npx devlyn-cli benchmark headroom [options] <fixtures...>');
847
+ log(' npx devlyn-cli benchmark pair [options] <fixtures...>');
848
+ log('');
849
+ log('Score-focused runs:', 'green');
850
+ log(' recent Show compact, wrap-safe recent benchmark results');
851
+ log(' frontier Show active rejected/evidence/unmeasured pair candidates, scores, and triggers without providers');
852
+ log(' audit Fail on unmeasured pair candidates and invalid headroom rejections');
853
+ log(' Prints frontier score rows plus headroom and pair quality handoff rows');
854
+ log(' audit-headroom Fail on active failed or unsupported headroom rejections');
855
+ log(' headroom Score bare vs solo_claude before spending the pair arm');
856
+ log(' pair Score solo_claude vs the selected pair path and print gate tables');
857
+ log('');
858
+ log('Shadow suite:', 'green');
859
+ log(' npx devlyn-cli benchmark suite --suite shadow --dry-run');
860
+ log(' Lists shadow tasks only; use headroom/pair with explicit S* ids for real measurement');
861
+ log('');
862
+ log('Examples:', 'green');
863
+ log(' npx devlyn-cli benchmark --dry-run F1-cli-trivial-flag');
864
+ log(' npx devlyn-cli benchmark recent');
865
+ log(' npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md');
866
+ log(' npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md');
867
+ log(' npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit');
868
+ log(' npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json');
869
+ log(' npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
870
+ log(' npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
871
+ log('');
872
+ }
873
+
874
+ function showBenchmarkModeHelp(mode) {
875
+ if (mode === 'recent') {
876
+ log('Usage:', 'green');
877
+ log(' npx devlyn-cli benchmark recent [options]');
878
+ log('');
879
+ log('Options:', 'green');
880
+ log(' --out-json PATH');
881
+ log(' --out-md PATH');
882
+ log(' --fixtures-root PATH');
883
+ log(' --registry PATH');
884
+ log(' --results-root PATH');
885
+ log(' --max-width N default: 92');
886
+ log(' --min-pair-margin N default: 5');
887
+ log(' --max-pair-solo-wall-ratio N default: 3');
888
+ log('');
889
+ log('Output:', 'green');
890
+ log(' Prints compact, wrap-safe benchmark status and pair-evidence cards without wide tables');
891
+ log('');
892
+ log('Example:', 'green');
893
+ log(' npx devlyn-cli benchmark recent');
894
+ log(' npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md');
895
+ log('');
896
+ return;
897
+ }
898
+ if (mode === 'frontier') {
899
+ log('Usage:', 'green');
900
+ log(' npx devlyn-cli benchmark frontier [options]');
901
+ log('');
902
+ log('Options:', 'green');
903
+ log(' --out-json PATH');
904
+ log(' --out-md PATH');
905
+ log(' --fixtures-root PATH');
906
+ log(' --registry PATH');
907
+ log(' --results-root PATH');
908
+ log(' --min-pair-margin N default: 5');
909
+ log(' --max-pair-solo-wall-ratio N default: 3');
910
+ log(' --fail-on-unmeasured');
911
+ log('');
912
+ log('Output:', 'green');
913
+ log(' Prints pair evidence score rows with trigger reasons; --out-md includes a Triggers column');
914
+ log('');
915
+ log('Example:', 'green');
916
+ log(' npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md');
917
+ log('');
918
+ return;
919
+ }
920
+ if (mode === 'audit-headroom') {
921
+ log('Usage:', 'green');
922
+ log(' npx devlyn-cli benchmark audit-headroom [options]');
923
+ log('');
924
+ log('Options:', 'green');
925
+ log(' --out-json PATH');
926
+ log(' --fixtures-root PATH');
927
+ log(' --registry PATH');
928
+ log(' --results-root PATH');
929
+ log('');
930
+ log('Example:', 'green');
931
+ log(' npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json');
932
+ log('');
933
+ return;
934
+ }
935
+ if (mode === 'audit') {
936
+ log('Usage:', 'green');
937
+ log(' npx devlyn-cli benchmark audit [options]');
938
+ log('');
939
+ log('Options:', 'green');
940
+ log(' --out-dir PATH');
941
+ log(' --fixtures-root PATH');
942
+ log(' --registry PATH');
943
+ log(' --results-root PATH');
944
+ log(' --min-pair-evidence N default: 4');
945
+ log(' --min-pair-margin N default: 5');
946
+ log(' --max-pair-solo-wall-ratio N default: 3');
947
+ log(' --require-hypothesis-trigger');
948
+ log('');
949
+ log('Output:', 'green');
950
+ log(' Prints frontier score rows plus headroom_rejections=PASS/FAIL, pair_evidence_quality=PASS/FAIL, pair_trigger_reasons=PASS/FAIL, pair_evidence_hypotheses=PASS/FAIL, pair_evidence_hypothesis_triggers=PASS/WARN/FAIL, historical-alias, and hypothesis-trigger gap handoff rows');
951
+ log('');
952
+ log('Example:', 'green');
953
+ log(' npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit');
954
+ log(' npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict');
955
+ log('');
956
+ return;
957
+ }
958
+ if (mode === 'headroom') {
959
+ log('Usage:', 'green');
960
+ log(' npx devlyn-cli benchmark headroom [options] <fixtures...>');
961
+ log('');
962
+ log('Options:', 'green');
963
+ log(' --run-id ID');
964
+ log(' --bare-max N default: 60');
965
+ log(' --solo-max N default: 80');
966
+ log(' --min-bare-headroom N default: 5');
967
+ log(' --min-solo-headroom N default: 5');
968
+ log(' --min-fixtures N default: 2; use 3 for F16/F23/F25 proof reruns; audit requires 4 passing evidence rows');
969
+ log(' --allow-rejected-fixtures active-fixture diagnostics only');
970
+ log(' --dry-run validate args/fixtures and print replay command only');
971
+ log('');
972
+ log('Example:', 'green');
973
+ log(' npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
974
+ log('');
975
+ return;
976
+ }
977
+ if (mode === 'pair') {
978
+ log('Usage:', 'green');
979
+ log(' npx devlyn-cli benchmark pair [options] <fixtures...>');
980
+ log('');
981
+ log('Options:', 'green');
982
+ log(' --run-id ID');
983
+ log(' --bare-max N');
984
+ log(' --solo-max N');
985
+ log(' --min-bare-headroom N default: 5');
986
+ log(' --min-solo-headroom N default: 5');
987
+ log(' --min-fixtures N default: 2; use 3 for F16/F23/F25 proof reruns; audit requires 4 passing evidence rows');
988
+ log(' --min-pair-margin N default: 5');
989
+ log(' --max-pair-solo-wall-ratio N default: 3');
990
+ log(' --pair-arm ARM default: l2_risk_probes; l2_gated is diagnostic');
991
+ log(' --reuse-calibrated-from RUN_ID');
992
+ log(' --allow-rejected-fixtures active-fixture diagnostics only');
993
+ log(' --dry-run validate args/fixtures and print replay command only');
994
+ log('');
995
+ log('Example:', 'green');
996
+ log(' npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
997
+ log('');
998
+ return;
999
+ }
1000
+ showBenchmarkHelp();
1001
+ }
1002
+
834
1003
  // Main
835
1004
  const args = process.argv.slice(2);
836
1005
  const command = args[0];
@@ -850,16 +1019,40 @@ switch (command) {
850
1019
  break;
851
1020
  case 'benchmark':
852
1021
  case 'bench': {
853
- // Delegate to benchmark/auto-resolve/scripts/run-suite.sh with all remaining args.
854
- const runSuite = path.join(__dirname, '..', 'benchmark', 'auto-resolve', 'scripts', 'run-suite.sh');
855
- if (!fs.existsSync(runSuite)) {
1022
+ const benchmarkScripts = {
1023
+ suite: 'run-suite.sh',
1024
+ recent: 'recent-benchmark-summary.py',
1025
+ frontier: 'pair-candidate-frontier.py',
1026
+ audit: 'audit-pair-evidence.py',
1027
+ 'audit-headroom': 'audit-headroom-rejections.py',
1028
+ headroom: 'run-headroom-candidate.sh',
1029
+ pair: 'run-full-pipeline-pair-candidate.sh',
1030
+ };
1031
+ let forwardedArgs = args.slice(1);
1032
+ if (forwardedArgs[0] === '--help' || forwardedArgs[0] === '-h') {
1033
+ showBenchmarkHelp();
1034
+ break;
1035
+ }
1036
+ let benchmarkMode = 'suite';
1037
+ if (forwardedArgs[0] === 'suite' || forwardedArgs[0] === 'recent' || forwardedArgs[0] === 'frontier' || forwardedArgs[0] === 'audit' || forwardedArgs[0] === 'audit-headroom' || forwardedArgs[0] === 'headroom' || forwardedArgs[0] === 'pair') {
1038
+ benchmarkMode = forwardedArgs[0];
1039
+ forwardedArgs = forwardedArgs.slice(1);
1040
+ }
1041
+ if (forwardedArgs[0] === '--help' || forwardedArgs[0] === '-h') {
1042
+ showBenchmarkModeHelp(benchmarkMode);
1043
+ break;
1044
+ }
1045
+ const runnerName = benchmarkScripts[benchmarkMode];
1046
+ const runner = path.join(__dirname, '..', 'benchmark', 'auto-resolve', 'scripts', runnerName);
1047
+ if (!fs.existsSync(runner)) {
856
1048
  log('❌ Benchmark suite runner missing — is this a clean devlyn-cli checkout?', 'yellow');
857
- log(` Expected: ${runSuite}`, 'dim');
1049
+ log(` Expected: ${runner}`, 'dim');
858
1050
  process.exit(1);
859
1051
  }
860
1052
  const { spawnSync } = require('child_process');
861
- const forwardedArgs = args.slice(1);
862
- const res = spawnSync('bash', [runSuite, ...forwardedArgs], { stdio: 'inherit' });
1053
+ const env = { ...process.env, DEVLYN_BENCHMARK_CLI_SUBCOMMAND: benchmarkMode };
1054
+ const executable = (benchmarkMode === 'recent' || benchmarkMode === 'frontier' || benchmarkMode === 'audit' || benchmarkMode === 'audit-headroom') ? 'python3' : 'bash';
1055
+ const res = spawnSync(executable, [runner, ...forwardedArgs], { stdio: 'inherit', env });
863
1056
  process.exit(res.status ?? 1);
864
1057
  break;
865
1058
  }
@@ -30,6 +30,9 @@ Verbosity, formatting, length conventions specific to this model.
30
30
  ## Tool-use posture
31
31
  When to use tools, when to reason, parallel/sequential preferences.
32
32
 
33
+ ## Effort and autonomy
34
+ Optional. Model-specific guidance for effort levels or autonomous-vs-interactive runs when the vendor guide calls this out.
35
+
33
36
  ## Validation pattern
34
37
  How this model verifies its work — mechanical-first vs self-check, etc.
35
38
 
@@ -8,7 +8,7 @@ You are GPT-5.5 by OpenAI. OpenAI's prompt-guidance for this model governs your
8
8
 
9
9
  ## Output discipline
10
10
 
11
- Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use headers, bullets, and bold sparingly favor short paragraphs and natural transitions unless the canonical body or user requests structure. When `text.verbosity` is `low`, prefer even shorter responses.
11
+ Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use Markdown only where it carries structure (`inline code`, code fences, short lists/tables); otherwise favor short paragraphs and natural transitions. When `text.verbosity` is `low`, prefer even shorter responses.
12
12
 
13
13
  ## Tool-use posture
14
14
 
@@ -26,4 +26,8 @@ The official guide warns explicitly about carrying over instructions from older
26
26
  2. **Don't over-specify process when the destination is clear.** If the canonical body names the outcome, choose the path; do not narrate every step.
27
27
  3. **Stop rules are explicit.** When the canonical body or the harness asks you to stop / abstain / ask, follow the stop rule rather than retrying loops indefinitely. Loop-minimization does not outrank correctness or required citation.
28
28
 
29
+ ## Prompt-maintenance cue
30
+
31
+ When asked to improve a failed prompt, act as GPT-5.5 metaprompter for itself: name the observed failure, then propose the smallest instruction to add, remove, or relocate. Prefer subtractive changes before adding new rules; keep the canonical body model-neutral and put only GPT-specific tactics in this adapter.
32
+
29
33
  Do not narrate internal deliberation. State results and decisions directly.
@@ -10,10 +10,18 @@ You are Claude Opus 4.7 by Anthropic. Anthropic's prompt-engineering guide for t
10
10
 
11
11
  You calibrate response length to task complexity automatically — keep simple lookups short, scale up only when the task warrants it. Do NOT pad with context the user didn't ask for. When the canonical body sets a structural format (XML, JSON, sections), follow it literally; do not silently restructure.
12
12
 
13
+ ## Examples and structure
14
+
15
+ When prompt maintenance adds examples for Claude, prefer concise positive examples over lists of negative prohibitions. Wrap examples in `<example>` tags (or `<examples>` for several) so examples stay distinct from instructions and variable inputs.
16
+
13
17
  ## Tool-use posture
14
18
 
15
19
  You default to fewer tool calls than prior Claude generations. When the canonical body lists tools, use them when their result would change your answer. Make independent tool calls in parallel; chain only when one depends on another's output. Do not narrate "I'll now call X" preambles unless the canonical body requests progress updates.
16
20
 
21
+ ## Effort and autonomy
22
+
23
+ For long-horizon coding, review, and agentic runs, assume the harness selected `high` or `xhigh` effort unless told otherwise. Spend that depth on upfront task/constraint understanding and end-state verification, not on verbose narration. If the user or orchestrator gives a complete task in one turn, proceed autonomously instead of requiring progressive clarification.
24
+
17
25
  ## Validation pattern
18
26
 
19
27
  When the canonical body asks you to verify your output before declaring done ("self-check" instructions), execute that step literally — re-read the spec's acceptance criteria, run the listed verification commands if available, list any gap. This is not optional. Mechanical gates owned by the harness (spec-verify-check.py, build-gate.py) are the primary correctness guard; your self-check is the secondary layer that catches what regex cannot.
@@ -22,7 +30,7 @@ When the canonical body asks you to verify your output before declaring done ("s
22
30
 
23
31
  You interpret instructions more literally than prior Claude versions. The official guide is explicit about three failure modes:
24
32
 
25
- 1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones. Do NOT pre-filter for importance; the harness has a separate filter step.
33
+ 1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones; do not filter for importance or confidence. The harness has a separate filter step.
26
34
  2. **Subagent over-spawning**: do NOT spawn a subagent for work you can complete in a single response. Spawn only when the canonical body explicitly requests it OR when fanning out across independent items.
27
35
  3. **Overengineering**: do NOT add files, abstractions, error handling, validation, or "future flexibility" beyond what the spec asks. A bug fix doesn't need surrounding cleanup. The right complexity is the minimum needed for the current task.
28
36