devlyn-cli 2.2.2 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (220) hide show
  1. package/AGENTS.md +2 -2
  2. package/CLAUDE.md +4 -4
  3. package/README.md +85 -34
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +61 -44
  5. package/benchmark/auto-resolve/BENCHMARK-RESULTS.md +341 -0
  6. package/benchmark/auto-resolve/README.md +307 -44
  7. package/benchmark/auto-resolve/RUBRIC.md +23 -14
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +7 -3
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +8 -3
  10. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +8 -3
  11. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +10 -4
  12. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +10 -4
  13. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +12 -0
  14. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +6 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +7 -4
  16. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +12 -0
  17. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +6 -0
  18. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +8 -0
  19. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +12 -0
  20. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +6 -0
  21. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +16 -4
  22. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +7 -0
  23. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +11 -5
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +8 -1
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +4 -2
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +1 -1
  27. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md +34 -0
  28. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json +57 -0
  29. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json +10 -0
  30. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh +2 -0
  31. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md +67 -0
  32. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt +7 -0
  33. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js +35 -0
  34. package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js +53 -0
  35. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/expected.json +57 -0
  37. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/setup.sh +2 -0
  39. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/spec.md +70 -0
  40. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/task.txt +3 -0
  41. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/duplicate-renewal-error.js +42 -0
  42. package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/verifiers/priority-credit-rollback.js +70 -0
  43. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +10 -3
  44. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +7 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +5 -0
  46. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +7 -0
  47. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +3 -0
  48. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +1 -1
  49. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +15 -3
  50. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +1 -1
  51. package/benchmark/auto-resolve/fixtures/SCHEMA.md +53 -7
  52. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/NOTES.md +37 -0
  53. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/RETIRED.md +13 -0
  54. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/expected.json +56 -0
  55. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/setup.sh +18 -0
  57. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/spec.md +69 -0
  58. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/task.txt +7 -0
  59. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/exact-proration.js +48 -0
  60. package/benchmark/auto-resolve/fixtures/retired/F27-cli-subscription-proration/verifiers/rules-source-and-conflict.js +79 -0
  61. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/NOTES.md +54 -0
  62. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/RETIRED.md +7 -0
  63. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/expected.json +67 -0
  64. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/metadata.json +10 -0
  65. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/setup.sh +2 -0
  66. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/spec.md +67 -0
  67. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/task.txt +5 -0
  68. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/policy-precedence.js +72 -0
  69. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-and-immutability.js +43 -0
  70. package/benchmark/auto-resolve/fixtures/retired/F28-cli-return-authorization/verifiers/validation-boundary.js +116 -0
  71. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/NOTES.md +35 -0
  72. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/RETIRED.md +12 -0
  73. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/expected.json +58 -0
  74. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/metadata.json +10 -0
  75. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/setup.sh +2 -0
  76. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/spec.md +73 -0
  77. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/task.txt +17 -0
  78. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/mixed-idempotent-settlement.js +53 -0
  79. package/benchmark/auto-resolve/fixtures/retired/F30-cli-credit-hold-settlement/verifiers/rejection-boundaries.js +74 -0
  80. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/NOTES.md +60 -0
  81. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/RETIRED.md +29 -0
  82. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/expected.json +73 -0
  83. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/metadata.json +10 -0
  84. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/setup.sh +28 -0
  85. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/spec.md +58 -0
  86. package/benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/task.txt +5 -0
  87. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.json +82 -0
  88. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/full-pipeline-pair-gate.md +18 -0
  89. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.json +46 -0
  90. package/benchmark/auto-resolve/results/20260510-f16-f23-f25-combined-proof/headroom-gate.md +17 -0
  91. package/benchmark/auto-resolve/run-real-benchmark.md +303 -0
  92. package/benchmark/auto-resolve/scripts/audit-headroom-rejections.py +441 -0
  93. package/benchmark/auto-resolve/scripts/audit-pair-evidence.py +1256 -0
  94. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +147 -15
  95. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +28 -16
  96. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +11 -1
  97. package/benchmark/auto-resolve/scripts/compile-report.py +208 -46
  98. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +22 -4
  99. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +175 -30
  100. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +408 -46
  101. package/benchmark/auto-resolve/scripts/headroom-gate.py +270 -39
  102. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +164 -33
  103. package/benchmark/auto-resolve/scripts/iter-0033c-l1-summary.py +97 -0
  104. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +150 -38
  105. package/benchmark/auto-resolve/scripts/judge.sh +153 -26
  106. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +12 -5
  107. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +25 -2
  108. package/benchmark/auto-resolve/scripts/pair-candidate-frontier.py +469 -0
  109. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +5 -5
  110. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +9 -2
  111. package/benchmark/auto-resolve/scripts/pair-rejected-fixtures.sh +91 -0
  112. package/benchmark/auto-resolve/scripts/pair_evidence_contract.py +269 -0
  113. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +39 -10
  114. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +34 -4
  115. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +23 -5
  116. package/benchmark/auto-resolve/scripts/recent-benchmark-summary.py +232 -0
  117. package/benchmark/auto-resolve/scripts/run-fixture.sh +118 -51
  118. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +211 -39
  119. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +335 -39
  120. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +249 -6
  121. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +22 -48
  122. package/benchmark/auto-resolve/scripts/run-suite.sh +44 -7
  123. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +120 -19
  124. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +32 -14
  125. package/benchmark/auto-resolve/scripts/ship-gate.py +219 -50
  126. package/benchmark/auto-resolve/scripts/solo-ceiling-avoidance.py +53 -0
  127. package/benchmark/auto-resolve/scripts/solo-headroom-hypothesis.py +77 -0
  128. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +239 -26
  129. package/benchmark/auto-resolve/scripts/test-audit-headroom-rejections.sh +288 -0
  130. package/benchmark/auto-resolve/scripts/test-audit-pair-evidence.sh +1672 -0
  131. package/benchmark/auto-resolve/scripts/test-benchmark-arg-parsing.sh +933 -0
  132. package/benchmark/auto-resolve/scripts/test-build-pair-eligible-manifest.sh +491 -0
  133. package/benchmark/auto-resolve/scripts/test-check-f9-artifacts.sh +91 -0
  134. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +328 -3
  135. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +497 -18
  136. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +331 -14
  137. package/benchmark/auto-resolve/scripts/test-iter-0033c-compare.sh +525 -0
  138. package/benchmark/auto-resolve/scripts/test-iter-0033c-l1-summary.sh +254 -0
  139. package/benchmark/auto-resolve/scripts/test-lint-fixtures.sh +580 -0
  140. package/benchmark/auto-resolve/scripts/test-pair-candidate-frontier.sh +591 -0
  141. package/benchmark/auto-resolve/scripts/test-run-full-pipeline-pair-candidate.sh +497 -0
  142. package/benchmark/auto-resolve/scripts/test-run-headroom-candidate.sh +401 -0
  143. package/benchmark/auto-resolve/scripts/test-run-swebench-solver-batch.sh +111 -0
  144. package/benchmark/auto-resolve/scripts/test-ship-gate.sh +1189 -0
  145. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +924 -5
  146. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/NOTES.md +28 -0
  147. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/expected.json +63 -0
  148. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/metadata.json +10 -0
  149. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/setup.sh +3 -0
  150. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/spec.md +47 -0
  151. package/benchmark/auto-resolve/shadow-fixtures/S1-cli-lang-flag/task.txt +1 -0
  152. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/NOTES.md +34 -0
  153. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/expected.json +53 -0
  154. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/metadata.json +10 -0
  155. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/setup.sh +3 -0
  156. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/spec.md +50 -0
  157. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/task.txt +1 -0
  158. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/duplicate-order-error.js +27 -0
  159. package/benchmark/auto-resolve/shadow-fixtures/S2-cli-inventory-reservation/verifiers/priority-stock-reservation.js +44 -0
  160. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/NOTES.md +34 -0
  161. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/expected.json +55 -0
  162. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/metadata.json +10 -0
  163. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/setup.sh +3 -0
  164. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/spec.md +52 -0
  165. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/task.txt +1 -0
  166. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/duplicate-ticket-error.js +29 -0
  167. package/benchmark/auto-resolve/shadow-fixtures/S3-cli-ticket-assignment/verifiers/priority-agent-assignment.js +48 -0
  168. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/NOTES.md +34 -0
  169. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/expected.json +55 -0
  170. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/metadata.json +10 -0
  171. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/setup.sh +3 -0
  172. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/spec.md +55 -0
  173. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/task.txt +1 -0
  174. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/duplicate-return-error.js +43 -0
  175. package/benchmark/auto-resolve/shadow-fixtures/S4-cli-return-routing/verifiers/priority-return-routing.js +70 -0
  176. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/NOTES.md +37 -0
  177. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/expected.json +54 -0
  178. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/metadata.json +10 -0
  179. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/setup.sh +3 -0
  180. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/spec.md +59 -0
  181. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/task.txt +1 -0
  182. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/credit-ledger-priority.js +98 -0
  183. package/benchmark/auto-resolve/shadow-fixtures/S5-cli-credit-grant-ledger/verifiers/duplicate-charge-error.js +38 -0
  184. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/NOTES.md +36 -0
  185. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json +56 -0
  186. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json +10 -0
  187. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh +3 -0
  188. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md +59 -0
  189. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt +1 -0
  190. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js +41 -0
  191. package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js +65 -0
  192. package/bin/devlyn.js +221 -17
  193. package/config/skills/_shared/adapters/README.md +3 -0
  194. package/config/skills/_shared/adapters/gpt-5-5.md +5 -1
  195. package/config/skills/_shared/adapters/opus-4-7.md +9 -1
  196. package/config/skills/_shared/archive_run.py +78 -6
  197. package/config/skills/_shared/codex-config.md +5 -4
  198. package/config/skills/_shared/codex-monitored.sh +46 -1
  199. package/config/skills/_shared/collect-codex-findings.py +20 -5
  200. package/config/skills/_shared/engine-preflight.md +17 -13
  201. package/config/skills/_shared/runtime-principles.md +6 -9
  202. package/config/skills/_shared/spec-verify-check.py +2664 -107
  203. package/config/skills/_shared/verify-merge-findings.py +1369 -19
  204. package/config/skills/devlyn:design-ui/SKILL.md +364 -0
  205. package/config/skills/devlyn:ideate/SKILL.md +7 -4
  206. package/config/skills/devlyn:ideate/references/elicitation.md +50 -4
  207. package/config/skills/devlyn:ideate/references/from-spec-mode.md +26 -4
  208. package/config/skills/devlyn:ideate/references/project-mode.md +20 -1
  209. package/config/skills/devlyn:ideate/references/spec-template.md +10 -1
  210. package/config/skills/devlyn:resolve/SKILL.md +78 -26
  211. package/config/skills/devlyn:resolve/references/free-form-mode.md +15 -0
  212. package/config/skills/devlyn:resolve/references/phases/build-gate.md +2 -2
  213. package/config/skills/devlyn:resolve/references/phases/implement.md +1 -1
  214. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +74 -2
  215. package/config/skills/devlyn:resolve/references/phases/verify.md +80 -29
  216. package/config/skills/devlyn:resolve/references/state-schema.md +9 -4
  217. package/package.json +47 -2
  218. package/scripts/lint-fixtures.sh +349 -0
  219. package/scripts/lint-shadow-fixtures.sh +58 -0
  220. package/scripts/lint-skills.sh +3645 -95
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: devlyn:resolve
3
- description: Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Free-form goal or formal spec input. Plan → Implement → Build-gate → Cleanup → Verify (fresh subagent, findings-only). Mechanical-first verification; pair-mode is gated in Verify. Use when the user says "resolve this", "fix this", "implement this", "refactor this", "debug this", "review this PR", or wants hands-off completion.
3
+ description: Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Free-form goal or formal spec input. Plan → Implement → Build-gate → Cleanup → Verify (fresh subagent, findings-only). Mechanical-first verification; pair-mode is conditional-default in Verify. Use when the user says "resolve this", "fix this", "implement this", "refactor this", "debug this", "review this PR", or wants hands-off completion.
4
4
  ---
5
5
 
6
6
  Orchestrator for the 2-skill harness pipeline. One subagent per phase; file-based handoff via `.devlyn/pipeline.state.json`. VERIFY spawns a fresh-context subagent so independence is structural — not advisory.
@@ -17,7 +17,7 @@ Long-horizon agentic work; context auto-compacts. State lives in `.devlyn/pipeli
17
17
  Hands-free. Measured by how far we get without human intervention.
18
18
 
19
19
  1. Do not prompt the user mid-pipeline. When tempted to ask, pick the safe default, proceed, and log it in the final report.
20
- 2. Codex availability: on `--engine auto`/`codex`, follow `_shared/engine-preflight.md`. On failure, silently fall back to Claude and log `engine downgraded: codex-unavailable` in the final report.
20
+ 2. Engine availability: follow `_shared/engine-preflight.md`. When a selected or conditionally-required engine is unavailable, fail closed with `BLOCKED:<engine>-unavailable` and setup guidance; do not convert a pair-required or explicitly requested engine into a solo run.
21
21
  3. Phases run in declared order. No extra phases.
22
22
  4. Orchestrator does not write code. It parses input, spawns phases, reads state, branches on verdicts, emits the report.
23
23
  5. Continue by default. Halt only on (a) unrecoverable subagent failure, (b) IMPLEMENT producing zero code changes, (c) BUILD_GATE or VERIFY fix-loop exhausting `max_rounds`.
@@ -32,7 +32,7 @@ Each phase routes to an engine and prepends the per-engine adapter header from `
32
32
 
33
33
  - Claude phases: spawn `Agent` (`mode: "bypassPermissions"`); prompt = adapter-header + canonical-body + task-context.
34
34
  - Codex phases: shell out via `bash _shared/codex-monitored.sh` with the same compounded prompt. The wrapper closes stdin and emits a heartbeat. No MCP.
35
- - Default engine: Claude. `--engine codex` routes IMPLEMENT to Codex; orchestration stays Claude. Pair-mode (only in VERIFY/JUDGE) selects a different engine for the fresh subagent than IMPLEMENT used.
35
+ - Default engine: Claude for PLAN / IMPLEMENT / BUILD_GATE / CLEANUP. `--engine codex` routes IMPLEMENT to Codex; orchestration stays Claude. Pair-mode is conditional-default in VERIFY/JUDGE and selects the OTHER engine for the fresh subagent when the trigger policy fires.
36
36
  - Multi-LLM evolution: when a new model adapter ships in `_shared/adapters/`, that engine becomes selectable via `--engine <model>` without further skill changes (NORTH-STAR.md "Multi-LLM evolution direction").
37
37
  </engine_routing>
38
38
 
@@ -40,7 +40,7 @@ Each phase routes to an engine and prepends the per-engine adapter header from `
40
40
  Three input shapes:
41
41
 
42
42
  1. **Free-form**: `/devlyn:resolve "fix the login bug"`. PHASE 0 runs the complexity classifier and either proceeds with an internal mini-spec (trivial), drafts focused questions for in-prompt resolution (medium), or escalates to `/devlyn:ideate` (large/ambiguous). No mid-pipeline prompts in any branch.
43
- 2. **Spec**: `/devlyn:resolve --spec docs/roadmap/phase-N/X.md`. Spec is read-only. Verification commands pre-staged from spec's `## Verification` block.
43
+ 2. **Spec**: `/devlyn:resolve --spec docs/roadmap/phase-N/X.md`. Spec is read-only. Stage verification commands from sibling `spec.expected.json`; if absent, use the legacy `## Verification` JSON block.
44
44
  3. **Verify-only**: `/devlyn:resolve --verify-only <diff-or-PR-ref> --spec <path>`. Skips PHASE 1-4. Runs PHASE 5 (VERIFY) on the supplied diff against the spec.
45
45
  </modes>
46
46
 
@@ -59,20 +59,26 @@ Once `state.implement_passed_sha` is non-null (PHASE 2 returned and produced a d
59
59
  - `--spec <path>` — switches to spec mode.
60
60
  - `--verify-only <ref>` — switches to verify-only mode. Requires `--spec`.
61
61
  - `--pair-verify` — force pair-mode JUDGE in PHASE 5 even when not auto-triggered.
62
+ - `--no-pair` — disable conditional VERIFY pair-JUDGE for this run. Record `pair_trigger.skipped_reason: "user_no_pair"` whenever a trigger would otherwise fire.
62
63
  - `--risk-probes` — insert PHASE 1.5 cross-engine probe derivation. The OTHER engine converts visible `## Verification` bullets into bounded executable probes before IMPLEMENT; BUILD_GATE and VERIFY replay them mechanically.
64
+ - `--no-risk-probes` — disable automatic high-risk risk probes. Explicit `--risk-probes` wins over `--no-risk-probes`.
63
65
  - `--bypass <phase>[,...]` — skip specific phases. Valid: `build-gate`, `cleanup`. PLAN, IMPLEMENT, VERIFY are non-bypassable.
64
66
  - `--perf` — opt in to per-phase timing.
65
67
 
66
- 2. Engine pre-flight: follow `_shared/engine-preflight.md`. The downgrade banner surfaces in the final report.
68
+ 2. Engine pre-flight: follow `_shared/engine-preflight.md`. If a required engine is unavailable, halt with a BLOCKED verdict and setup instructions instead of downgrading.
67
69
 
68
- 3. Initialize `.devlyn/pipeline.state.json` per `references/state-schema.md`. Set `state.run_id`, `started_at`, `engine`, `base_ref.{branch, sha}`, `rounds.{max_rounds, global: 0}`, `bypasses`, empty `phases`, empty `criteria`.
70
+ `--pair-verify` and `--no-pair` are mutually exclusive; if both are present, stop with `BLOCKED:invalid-flags`.
71
+
72
+ 3. Initialize `.devlyn/pipeline.state.json` per `references/state-schema.md`. Set `state.run_id`, `started_at`, `engine`, `pair_verify: true` only when `--pair-verify` was passed and `false` otherwise, `base_ref.{branch, sha}`, `rounds.{max_rounds, global: 0}`, `bypasses`, empty `phases`, empty `criteria`, and `risk_profile: { high_risk: false, reasons: [], risk_probes_enabled: false, pair_default_enabled: true }`. `risk_profile` is strict typed state: keep it an object, keep the three flags as JSON booleans, and keep `reasons` as a string array; never serialize booleans as strings.
69
73
 
70
74
  4. **Mode-specific init**:
71
- - **Free-form**: read `references/free-form-mode.md`. Run the complexity classifier deterministically (rules over keyword density / file count / spec-shape signals). Set `state.complexity ∈ {trivial, medium, large}`. Trivial: write internal mini-spec to `.devlyn/criteria.generated.md` and proceed. Medium: synthesize a minimal spec from the goal + add 1-2 context anchors from the codebase, write to `.devlyn/criteria.generated.md`, proceed. Large: log `recommend: /devlyn:ideate first` in the final report and either halt (default) or proceed with assumed defaults if `--continue-on-large` flag set.
72
- - **Spec**: validate spec exists + `## Verification` block parses (run `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` to validate carrier shape). Compute `state.source.spec_sha256`. Stage `.devlyn/spec-verify.json` from the spec's verification block.
75
+ - **Free-form**: read `references/free-form-mode.md`. Run the complexity classifier deterministically (rules over keyword density / file count / spec-shape signals, plus pair-evidence intent). Set `state.complexity ∈ {trivial, medium, large}`. Trivial: write internal mini-spec to `.devlyn/criteria.generated.md` and proceed. Medium: synthesize a minimal spec from the goal + add 1-2 context anchors from the codebase, write to `.devlyn/criteria.generated.md`, proceed. Every free-form branch that writes criteria must set `state.source.type = "generated"`, `state.source.criteria_path = ".devlyn/criteria.generated.md"`, and `state.source.criteria_sha256` from the raw file bytes. Large: log `recommend: /devlyn:ideate first` in the final report and either halt (default) or proceed with assumed defaults if `--continue-on-large` flag set, except pair-evidence intent without an actionable solo-headroom hypothesis must halt with `BLOCKED:solo-headroom-hypothesis-required`, and unmeasured pair-candidate intent without solo ceiling avoidance must halt with `BLOCKED:solo-ceiling-avoidance-required`.
76
+ - **Spec**: validate spec exists. If sibling `spec.expected.json` exists, run `--check-expected <expected-path>` to validate both the expected contract, sibling spec `complexity` frontmatter, and any present actionable solo-headroom hypothesis; if the spec has a solo-headroom hypothesis, its observable command must match `spec.expected.json.verification_commands[].cmd`. Then stage `.devlyn/spec-verify.json` from `verification_commands`. Otherwise run `--check <spec-path>` to validate the legacy inline carrier plus supported `complexity` frontmatter and any present actionable solo-headroom hypothesis; if the spec uses an inline `## Verification` JSON carrier, any solo-headroom hypothesis command must match that carrier's `verification_commands[].cmd`. Then stage from the legacy inline carrier. Compute `state.source.spec_sha256`.
73
77
  - **Verify-only**: skip to PHASE 5 with `state.source.spec_path` set, the supplied diff captured at `.devlyn/external-diff.patch`.
74
78
 
75
- 5. Announce one line: `resolve starting run <run_id> engine <engine> mode <mode> complexity <complexity-or-na>`.
79
+ 5. Compute `state.risk_profile` from the user goal plus spec/criteria text. Mark `high_risk: true` when the work touches any of: auth/authz, permissions, security, token/session, payment/money/billing/invoice/pricing/tax/ledger, persistence/data mutation/deletion/migration, idempotency/replay/duplicate, API/webhook/raw-body/signature, allocation/scheduling/inventory/rollback/transaction, or explicit error-priority/output-shape contracts. If high-risk and `--no-risk-probes` is absent, set `risk_probes_enabled: true`; explicit `--risk-probes` also sets it true. If `--no-pair` is present, set `pair_default_enabled: false`. Add concise string reasons for the classification, but do not use reasons as substitutes for the boolean fields.
80
+
81
+ 6. Announce one line: `resolve starting — run <run_id> — engine <engine> — mode <mode> — complexity <complexity-or-na> — pair <conditional|disabled> — risk_probes <on|off>`.
76
82
 
77
83
  ## PHASE 1: PLAN
78
84
 
@@ -90,8 +96,11 @@ After return:
90
96
 
91
97
  ## PHASE 1.5: RISK_PROBES
92
98
 
93
- Skip unless `--risk-probes` is set. This phase is findings-as-executable-checks,
94
- not a second plan and not debate.
99
+ Skip unless `--risk-probes` is set OR `state.risk_profile.risk_probes_enabled`
100
+ is true. This phase is findings-as-executable-checks, not a second plan and not
101
+ debate. If this phase is required and the OTHER engine is unavailable, halt with
102
+ `BLOCKED:codex-unavailable` or `BLOCKED:claude-unavailable` plus setup guidance;
103
+ do not silently continue without probes.
95
104
 
96
105
  Engine: OTHER engine from PHASE 2's selected IMPLEMENT engine. Prompt body:
97
106
  `references/phases/probe-derive.md`.
@@ -108,14 +117,41 @@ a JSON object keyed by tag, with marker arrays as values; a top-level array or
108
117
  tag-only probe is malformed. `ordering_inversion` must include
109
118
  `input_order_would_choose_wrong_winner` and `asserts_processing_order_result`;
110
119
  `prior_consumption` must include `same_resource_consumed_first` and
111
- `later_entity_fails_or_reroutes`; `stdout_stderr_contract` and `shape_contract`
112
- do not require marker strings. Cart/pricing success probes should use
120
+ `later_entity_fails_or_reroutes`; `stdout_stderr_contract` must include
121
+ `asserts_named_stream_output`; `error_contract` must include
122
+ `asserts_error_payload_or_stderr` and `asserts_nonzero_or_exit_2`.
123
+ `http_error_contract` must include `asserts_http_error_status` and
124
+ `asserts_error_payload_body`.
125
+ `auth_signature_contract` must include `asserts_signature_over_exact_bytes` and
126
+ `asserts_tampered_or_missing_signature_rejected`; `idempotency_replay` must
127
+ include `first_delivery_then_duplicate` and
128
+ `duplicate_id_rejected_regardless_of_body`; `concurrent_state_consistency` must
129
+ include `overlapping_mutations_exercised`,
130
+ `all_successful_responses_reflected`, and `distinct_identifiers_asserted`;
131
+ `atomic_batch_state` must include `mixed_valid_invalid_batch`,
132
+ `asserts_store_unchanged_after_failure`, and
133
+ `asserts_success_order_and_distinct_ids`.
134
+ When visible text names exact keys, fields, row shapes, JSON objects, response
135
+ bodies, stdout/stderr objects, or exact error bodies, `shape_contract` must
136
+ include `uses_visible_input_key_names`, `asserts_visible_output_key_names`, and
137
+ `asserts_no_unexpected_output_keys`; exact JSON error objects/bodies must also
138
+ include `asserts_exact_error_object`. Cart/pricing success probes should use
113
139
  `shape_contract` unless they satisfy the `ordering_inversion` markers. The probe
114
140
  command must not reference external network URLs; use only worktree-local or
115
141
  localhost resources.
116
142
  For high-complexity specs with multiple behavior bullets, at least one probe
117
143
  must be compound: it must exercise two or more visible verification bullets in a
118
144
  single command. Empty output is invalid when `--risk-probes` is set.
145
+ When the visible spec includes a solo-headroom hypothesis, the first probe must
146
+ exercise that hypothesis with the visible command/input shape and full
147
+ observable assertion; its `cmd` must contain the hypothesis's backticked
148
+ observable command, and its `derived_from` must reference the hypothesis bullet,
149
+ so deterministic validation can prove the probe targets the stated expected
150
+ `solo_claude` miss. Otherwise the probe set is too weak for pair-evidence work.
151
+ The same actionable solo-headroom hypothesis is a VERIFY pair-trigger reason,
152
+ so a candidate spec that explicitly predicts a `solo_claude` miss cannot finish
153
+ on solo VERIFY alone unless `--no-pair` was explicitly set or an earlier
154
+ verdict-binding blocker already decides the run.
119
155
 
120
156
  State write: `phases.probe_derive.{started_at, verdict, completed_at, duration_ms, artifacts}`.
121
157
 
@@ -123,7 +159,9 @@ Invocation contract when OTHER engine is Codex:
123
159
 
124
160
  - Invoke Codex only through the monitored wrapper path in `CODEX_MONITORED_PATH`,
125
161
  or `.claude/skills/_shared/codex-monitored.sh` when the env var is absent:
126
- `bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
162
+ `CODEX_MONITORED_ISOLATED=1 bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
163
+ Isolation keeps user config, AGENTS.md, pyx-memory, hooks, and project rules
164
+ from adding hidden context, tool calls, or transcript side effects.
127
165
  - Do not run `codex`, `codex exec`, `/Users/.../codex`, or a plugin-provided
128
166
  Codex binary directly. A raw Codex child can outlive the phase and makes the
129
167
  benchmark run invalid even if `.devlyn/risk-probes.jsonl` is written.
@@ -159,8 +197,8 @@ Skip in verify-only mode OR when `build-gate` in `state.bypasses`. Deterministic
159
197
  Spawn Claude `Agent` (`mode: "bypassPermissions"`) with prompt body `references/phases/build-gate.md`. The agent:
160
198
  1. Detects language/framework via project files (`package.json`, `pyproject.toml`, etc.).
161
199
  2. Runs language-specific gates (tsc / lint / test).
162
- 3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` (verification_commands literal-match plus `.devlyn/risk-probes.jsonl` when present).
163
- 4. If `spec.expected.json.browser_flows` declared OR diff touches web-surface files: invokes the browser runner (Chrome MCP Playwright curl tier as available).
200
+ 3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` (verification_commands literal-match plus `.devlyn/risk-probes.jsonl` when present). If `state.risk_profile.risk_probes_enabled == true`, the script requires `.devlyn/risk-probes.jsonl`; a missing file is a CRITICAL mechanical blocker, not a silent solo run.
201
+ 4. If diff touches web-surface files: run the browser tier with the repo's available toolchain (for example Playwright or curl).
164
202
  5. Emits `.devlyn/build_gate.findings.jsonl` + `.devlyn/build_gate.log.md`.
165
203
 
166
204
  State write: `phases.build_gate.{started_at, verdict, completed_at, duration_ms, artifacts}`.
@@ -192,25 +230,39 @@ Independent quality layer. **Spawned with empty conversation context** — no ca
192
230
 
193
231
  Two sub-phases:
194
232
 
195
- 1. **MECHANICAL** (deterministic): re-run `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code (independent of BUILD_GATE's earlier run). Re-scan `spec.expected.json.forbidden_patterns` against the diff. Re-check `required_files` and `forbidden_files`. Emit `.devlyn/verify-mechanical.findings.jsonl`.
233
+ 1. **MECHANICAL** (deterministic): re-run `SPEC_VERIFY_PHASE=verify_mechanical SPEC_VERIFY_FINDINGS_FILE=verify-mechanical.findings.jsonl SPEC_VERIFY_FINDING_PREFIX=VERIFY-MECH python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code (independent of BUILD_GATE's earlier run). If `state.risk_profile.risk_probes_enabled == true`, missing `.devlyn/risk-probes.jsonl` is a CRITICAL mechanical blocker. This emits `.devlyn/verify-mechanical.findings.jsonl` for `verify-merge-findings.py`.
196
234
 
197
- 2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Split each Requirement into binding clauses and trace code-order counterexamples; a passing verifier proves only the case it exercises, not neighboring `once` / `regardless` / `duplicate` / auth-order / rollback invariants. Respect scope qualifiers such as `inside a warehouse`, `per resource`, `for this line`, and `after validation`; do not widen a scoped clause into a global invariant, and compose multiple ordering rules in the stated order. For stateful flows, explicitly trace failed-operation rollback and the next entity's state before hunting broader edge cases. For high-complexity specs, construct at least one interaction counterexample that combines ordering/priority with failure handling and state mutation, then execute at least one such scenario through the repo's existing CLI/API/test runner without leaving tracked files behind; one-axis examples and pure mental tracing are insufficient. Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) is eligible only when MECHANICAL has no HIGH/CRITICAL findings; deterministic blockers already decide the verdict and route to the fix loop. Pair-mode fires when eligible and:
235
+ 2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Split each Requirement into binding clauses and trace code-order counterexamples; a passing verifier proves only the case it exercises, not neighboring `once` / `regardless` / `duplicate` / auth-order / rollback invariants. Respect scope qualifiers such as `inside a warehouse`, `per resource`, `for this line`, and `after validation`; do not widen a scoped clause into a global invariant, and compose multiple ordering rules in the stated order. For stateful flows, explicitly trace failed-operation rollback and the next entity's state before hunting broader edge cases. For high-complexity specs, construct at least one interaction counterexample that combines ordering/priority with failure handling and state mutation, then execute at least one such scenario through the repo's existing CLI/API/test runner without leaving tracked files behind; one-axis examples and pure mental tracing are insufficient. Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) is eligible only after MECHANICAL and the primary JUDGE have no verdict-binding findings; deterministic blockers and primary JUDGE blockers already decide the verdict and route to the fix loop. Pair-mode fires when eligible and:
198
236
  - `--pair-verify` flag set, OR
199
- - spec frontmatter has `complexity: high`, OR `state.complexity` is `"high"` or `"large"`, OR
200
- - MECHANICAL emits findings flagged `severity: warning` (not disqualifier — those route to fix loop directly), OR
237
+ - `state.mode == "verify-only"`, OR
238
+ - `state.risk_profile.high_risk == true`, OR
239
+ - `.devlyn/risk-probes.jsonl` exists or `state.risk_profile.risk_probes_enabled == true`, OR
240
+ - spec frontmatter has `complexity: high` (legacy/external spec `complexity: large` is accepted for compatibility; new specs use `high`), OR current free-form `state.complexity` is `"large"` (legacy `"high"` state is accepted only for archived runs), OR
241
+ - MECHANICAL or the primary JUDGE emits findings flagged `severity: warning` (not verdict-binding — those route to fix loop directly), OR
201
242
  - `state.verify.coverage_failed == true` (judge could not exercise a required spec axis from available evidence).
202
243
 
203
- Before spawning JUDGE, compute `pair_trigger = { eligible, reasons[] }` and write it into `state.phases.verify`. If `eligible == true` and `reasons` is non-empty, you MUST spawn the second OTHER-engine judge. Skipping that second judge is a VERIFY contract violation, not a discretion call.
244
+ After MECHANICAL and the primary JUDGE finish, compute `pair_trigger = { eligible, reasons[], skipped_reason }`, write it into `state.phases.verify`, and then spawn the second OTHER-engine judge when eligible. If `eligible == true`, `reasons` must be non-empty, include every applicable canonical reason, and every reason must be one of these canonical values: `mode.verify-only`, `mode.pair-verify`, `complexity.high`, `complexity.large`, `spec.complexity.high`, `spec.complexity.large`, `spec.solo_headroom_hypothesis`, `risk.high`, `risk_probes.enabled`, `risk_probes.present`, `coverage.failed`, `mechanical.warning`, or `judge.warning`; `skipped_reason` must be null; and you MUST spawn the second OTHER-engine judge. If `eligible == false`, `reasons` must be empty and `skipped_reason` must be a string or null. Contradictory, incomplete, or unknown trigger state is a VERIFY contract violation, not advisory metadata; `verify-merge-findings.py` blocks malformed trigger state. Pair reasons derive `risk.high` and `risk_probes.enabled` from `state.risk_profile`; malformed `risk_profile` is also a VERIFY contract violation because it can hide a required pair decision.
204
245
 
205
246
  The `--engine` flag never suppresses this rule. Explicit `--engine claude`
206
247
  means "Claude is the primary judge"; it does not mean "do not run Codex as the
207
248
  second pair judge." The only valid skip reasons after a non-empty eligible
208
- trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or Codex
209
- unavailability proven by the invocation layer.
249
+ trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or an explicit
250
+ `--no-pair`. Engine unavailability is a `BLOCKED:<engine>-unavailable` verdict,
251
+ not a skip reason.
252
+
253
+ Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. If the spec includes a solo-headroom hypothesis, one of those targeted probes must exercise that hypothesis with the visible command/input shape and full externally visible result, using the hypothesis's backticked observable command as its command anchor before adding bounded input variations. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. When the spec names exact keys, row shapes, JSON object shape, or an exact error body, pair-JUDGE must compare parsed key sets/deep equality so aliased keys, missing keys, and extra keys are verdict-binding failures, and it must construct inputs with the spec's visible key names. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL or the primary JUDGE has a verdict-binding finding, skip the second judge and record `pair_judge: null`; the fix loop needs the blocker, not duplicate review.
254
+
255
+ If pair-mode is triggered and the OTHER engine is unavailable, do not downgrade
256
+ or skip the required judge. Set VERIFY to `BLOCKED:<engine>-unavailable`, preserve the
257
+ failed availability check evidence, and print setup guidance:
210
258
 
211
- Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
259
+ - Codex: install/configure the Codex CLI, run `codex auth` or the current login
260
+ flow, verify `codex --version`, then rerun. Use `--no-pair` only when the user
261
+ intentionally accepts solo VERIFY for this run.
262
+ - Claude: install/configure Claude Code, run `claude --version` when available,
263
+ confirm the host can spawn Claude agents, then rerun.
212
264
 
213
- Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` directly with `-c model_reasoning_effort=medium` for this bounded two-probe review, without piping to `tail`/`head`/`grep`, capture stdout/stderr by direct tool capture or file redirection, require JSONL findings on stdout, and have the orchestrator write `.devlyn/verify.pair.findings.jsonl`. If stdout is first captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; that script is the deterministic boundary writer for `.devlyn/verify.pair.findings.jsonl`. Raw stdout remains diagnostic only: if stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
265
+ Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` with `CODEX_MONITORED_ISOLATED=1` and `-c model_reasoning_effort=medium`, no `tail`/`head`/`grep` pipes, direct stdout/stderr capture, JSONL findings on stdout, and orchestrator-written `.devlyn/verify.pair.findings.jsonl`. Isolation blocks user config, AGENTS.md, pyx-memory, hooks, and project rules from hidden context/tool/transcript side effects. If stdout is captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; raw stdout is diagnostic only. If stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
214
266
 
215
267
  Branch:
216
268
  - `PASS` → PHASE 6.
@@ -223,7 +275,7 @@ State write: `phases.final_report.started_at` at the top of this phase.
223
275
 
224
276
  1. **Terminal verdict** — derive from `state.phases.{plan, implement, build_gate, cleanup, verify}.verdict` per the precedence rules in `references/state-schema.md#terminal-verdict`. Verify-only mode short-circuits to `state.phases.verify.verdict`.
225
277
 
226
- 2. **Render report** — sections: header (run_id, engine, mode, verdict, wall-time), per-phase summary, findings table (verify findings only — post-IMPLEMENT phases are findings-only), follow-up notes (any `--continue-on-large` assumptions, any silent fallbacks).
278
+ 2. **Render report** — sections: header (run_id, engine, mode, verdict, wall-time), per-phase summary, pair/risk-probe status, findings table (verify findings only — post-IMPLEMENT phases are findings-only), follow-up notes (any `--continue-on-large` assumptions, any `--no-pair` / `--no-risk-probes` opt-out, any engine setup guidance after BLOCKED, `/devlyn:ideate` guidance after `BLOCKED:solo-headroom-hypothesis-required` that asks for the visible behavior `solo_claude` is expected to miss, and `/devlyn:ideate` guidance after `BLOCKED:solo-ceiling-avoidance-required` that asks for the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`).
227
279
 
228
280
  3. State write: `phases.final_report.{verdict, completed_at, duration_ms}` BEFORE archive runs (archive prune logic skips runs whose `final_report.verdict` is null).
229
281
 
@@ -13,6 +13,14 @@ Compute these signals from the goal text + project state:
13
13
  3. **verb_class** — primary verb of the goal: `fix | add | refactor | debug | review | rewrite | migrate | ...`.
14
14
  4. **codebase_size** — `git ls-files | wc -l`. Coarse buckets: `<50` / `<500` / `≥500`.
15
15
  5. **has_failing_test** — does the goal mention a specific failing test or include a stack trace?
16
+ 6. **pair_evidence_intent** — does the goal ask for benchmark evidence, pair-evidence, risk-probe measurement, solo<pair proof, or solo-headroom work?
17
+ 7. **has_actionable_solo_headroom** — does the goal itself include the actionable contract: literal `solo-headroom hypothesis`, `solo_claude`, `miss`, and a backticked observable command line that itself contains `miss` and is framed as the command/observable that exposes it?
18
+ 8. **unmeasured_pair_candidate_intent** — does the goal ask to add, create,
19
+ promote, or run a new unmeasured benchmark, shadow fixture, golden fixture,
20
+ risk-probe, or pair-evidence candidate?
21
+ 9. **has_solo_ceiling_avoidance** — does the goal itself include the literal
22
+ phrase `solo ceiling avoidance`, mention `solo_claude`, and name a concrete
23
+ difference from rejected or solo-saturated controls such as `S2`-`S6`?
16
24
 
17
25
  ### Trivial branch
18
26
 
@@ -46,10 +54,14 @@ Conditions (any one):
46
54
  - `file_scope_signals > 10` OR zero signals (vague enough that the classifier cannot pick scope).
47
55
  - `verb_class ∈ {rewrite, migrate}` and scope is multi-subsystem.
48
56
  - The goal mentions a new feature whose surface area requires design decisions the harness cannot make from a one-shot prompt.
57
+ - `pair_evidence_intent == true` and `has_actionable_solo_headroom == false`.
58
+ - `unmeasured_pair_candidate_intent == true` and `has_solo_ceiling_avoidance == false`.
49
59
 
50
60
  Action: log `recommend: /devlyn:ideate first` in `.devlyn/criteria.generated.md` plus the final report. Two policies:
51
61
  - Default: halt with terminal verdict `BLOCKED:large-needs-ideation`.
52
62
  - `--continue-on-large` flag: synthesize a best-effort spec from the goal with explicit "assumptions made" block; proceed to PHASE 1; the final report flags every assumption for user review.
63
+ - Exception: if the large classification came from pair-evidence intent without an actionable solo-headroom hypothesis, halt with `BLOCKED:solo-headroom-hypothesis-required` even when `--continue-on-large` is set. Do not invent a hypothesis; recommend `/devlyn:ideate` so the user can supply the visible behavior `solo_claude` is expected to miss.
64
+ - Exception: if the large classification came from unmeasured pair-candidate intent without solo ceiling avoidance, halt with `BLOCKED:solo-ceiling-avoidance-required` even when `--continue-on-large` is set. Do not invent the note; recommend `/devlyn:ideate` so the user can supply the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`.
53
65
 
54
66
  ## Anti-pattern: drift to LLM judgment
55
67
 
@@ -63,6 +75,9 @@ The internal mini-spec written for trivial / medium / `--continue-on-large` path
63
75
 
64
76
  - `## Requirements` non-empty, each bullet testable (CLI command, test command, observable file change).
65
77
  - `## Verification` non-empty if the goal implies any runnable acceptance check. Empty Verification is allowed only when all Requirements are pure-design (e.g. "follow existing pattern X").
78
+ - If a free-form goal includes pair-evidence intent and already includes an actionable solo-headroom hypothesis, preserve that literal hypothesis in `.devlyn/criteria.generated.md` unchanged enough for VERIFY to detect `solo-headroom hypothesis`, `solo_claude`, `miss`, and the backticked observable command line that itself contains `miss`, emit the canonical `spec.solo_headroom_hypothesis` pair trigger reason, and satisfy regenerated-evidence checks such as `benchmark audit --require-hypothesis-trigger`.
79
+ - If a free-form goal includes unmeasured pair-candidate intent and already includes solo ceiling avoidance, preserve that literal note in `.devlyn/criteria.generated.md` unchanged enough for reviewers to see `solo ceiling avoidance`, `solo_claude`, and the concrete difference from rejected or solo-saturated controls such as `S2`-`S6`.
66
80
  - Free-form mode mini-specs are written to `.devlyn/criteria.generated.md` (not to a roadmap path) — this is run-scoped artifact, not a documented spec.
81
+ - After writing `.devlyn/criteria.generated.md`, set `state.source.type = "generated"`, `state.source.spec_path = null`, `state.source.spec_sha256 = null`, `state.source.criteria_path = ".devlyn/criteria.generated.md"`, and `state.source.criteria_sha256` to the raw-byte SHA-256 of the generated criteria file. Downstream PLAN/IMPLEMENT/VERIFY phases and `spec-verify-check.py --include-risk-probes` depend on this pointer; do not rely on the file existing by convention.
67
82
 
68
83
  PLAN reads the mini-spec the same way it reads a real spec. The downstream pipeline cannot tell the difference.
@@ -22,8 +22,8 @@ Run in this order; each emits findings into `.devlyn/build_gate.findings.jsonl`:
22
22
  1. **Type check** (TypeScript / mypy / etc.). Each error → one finding, severity `HIGH`, rule `correctness.type-check`.
23
23
  2. **Lint** (eslint / ruff / clippy / etc.). Each error → finding, severity `MEDIUM`, rule `quality.lint`. Warnings stay LOW unless the spec elevates them.
24
24
  3. **Test suite** (npm test / pytest / go test / cargo test). Each failing test → finding, severity `HIGH`, rule `correctness.test-failure`. Include the failing test's file:line and the assertion.
25
- 4. **Spec literal verification + risk probes**: `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes`. The script reads `.devlyn/spec-verify.json` (pre-staged from spec or self-staged from `state.source.spec_path`) and appends `.devlyn/risk-probes.jsonl` when present. Each verification command mismatch finding `correctness.spec-literal-mismatch`, severity `CRITICAL`. Each risk-probe mismatch → finding `correctness.risk-probe-failed`, severity `CRITICAL`. Missing/malformed carrier on a generated sourcefinding `correctness.spec-verify-malformed`, severity `CRITICAL`.
26
- 5. **Browser** (only when `spec.expected.json.browser_flows` declared OR diff touches `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `page.*`, `layout.*`, `route.*`, `*.css`, `*.html`): start dev server, run declared flows via Chrome MCP if available, falling back to Playwright, falling back to curl. Each failed flow → finding, severity `HIGH`, rule `correctness.browser-flow-failed`.
25
+ 4. **Spec literal verification + risk probes**: `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes`. The script self-stages from sibling `spec.expected.json` next to `state.source.spec_path`, or the legacy inline carrier when the sibling is absent; benchmark-prestaged `.devlyn/spec-verify.json` still wins. It appends `.devlyn/risk-probes.jsonl` when present, and requires that file when `state.risk_profile.risk_probes_enabled == true`. Malformed `state.risk_profile` is also CRITICAL because it can hide enabled risk probes. Command or risk-probe mismatch → CRITICAL finding. Missing required risk probes, missing/malformed generated carrier, or malformed sibling expected file → `correctness.spec-verify-malformed` CRITICAL.
26
+ 5. **Browser** (only when diff touches `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `page.*`, `layout.*`, `route.*`, `*.css`, `*.html`): start the dev server and run the repo's existing browser checks, or a minimal curl/HTML check when no browser test harness exists. Each failed check → finding, severity `HIGH`, rule `correctness.browser-flow-failed`.
27
27
 
28
28
  Append all findings; do not stop on the first failure.
29
29
  </gates>
@@ -33,7 +33,7 @@ Read `_shared/runtime-principles.md`. Codex-routed phases receive the inlined ex
33
33
 
34
34
  - Subtractive-first: every accretion-shaped change is visible in the commit message or a flagged finding. Net-deletion is the default; pure-addition needs a citation.
35
35
  - Goal-locked: implement only the listed Requirements. Adjacent code that "looks fixable" is drift unless the spec or plan listed it.
36
- - No-workaround: no `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scripts that bypass root cause. The only documented exception is the Codex CLI availability downgrade.
36
+ - No-workaround: no `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scripts that bypass root cause. Required unavailable engines stop with `BLOCKED:<engine>-unavailable`; they do not downgrade.
37
37
  - Evidence: every claim cites file:line you opened. Hallucinated APIs are excluded.
38
38
  </runtime_principles>
39
39
 
@@ -25,7 +25,23 @@ Read the visible `## Verification` section. Emit 1 to 3 executable probes
25
25
  that cover the highest-risk bullets whose failure would change observable
26
26
  behavior. Prefer bullets that combine ordering/priority, rollback/state
27
27
  mutation, idempotency, auth/error priority, stdout/stderr, or exact output
28
- shape.
28
+ shape. Treat CLI/process errors and HTTP error responses as different contracts:
29
+ CLI errors must prove exit/stderr behavior, while HTTP errors must prove the
30
+ status code and response body. When the visible verification text names concurrent or near-concurrent
31
+ mutations, the probe must overlap the operations and assert the complete
32
+ externally-visible state, not just that every request returned a success code.
33
+ When a batch/import operation must be all-or-nothing, the probe must exercise a
34
+ mixed valid/invalid batch and prove the externally-visible state is unchanged
35
+ after the failure.
36
+
37
+ If the visible spec includes a solo-headroom hypothesis, the first probe must
38
+ target that hypothesis: use the visible command/input shape it names, exercise
39
+ the behavior the spec says `solo_claude` is expected to miss, and assert the
40
+ full observable result. The emitted probe `cmd` must contain the hypothesis's
41
+ backticked observable command so `.devlyn/risk-probes.jsonl` can be validated
42
+ mechanically, and `derived_from` must be an exact substring of that hypothesis
43
+ bullet. Do not replace the hypothesis with a neighboring easier edge case, and
44
+ do not cite hidden or benchmark-only verifier files.
29
45
 
30
46
  For high-complexity specs with two or more behavior bullets, at least one probe
31
47
  must be compound: one command must exercise two or more visible verification
@@ -101,7 +117,9 @@ Rules:
101
117
  - `tags` is required. Use only these shape tags:
102
118
  `ordering_inversion`, `boundary_overlap`, `prior_consumption`,
103
119
  `rollback_state`, `positive_remaining`, `stdout_stderr_contract`,
104
- `error_contract`, `shape_contract`.
120
+ `error_contract`, `http_error_contract`, `auth_signature_contract`,
121
+ `idempotency_replay`, `concurrent_state_consistency`,
122
+ `atomic_batch_state`, `shape_contract`.
105
123
  - `tag_evidence` is required and must be a JSON object keyed by tag, never a
106
124
  top-level array. For these tags, include every listed evidence marker in the
107
125
  tag's array and make the command actually exercise it:
@@ -119,6 +137,26 @@ Rules:
119
137
  `later_entity_uses_released_state`.
120
138
  - `positive_remaining`: `asserts_full_remaining_state`,
121
139
  `zero_quantity_rows_absent`.
140
+ - `stdout_stderr_contract`: `asserts_named_stream_output`.
141
+ - `error_contract`: `asserts_error_payload_or_stderr`,
142
+ `asserts_nonzero_or_exit_2`.
143
+ - `http_error_contract`: `asserts_http_error_status`,
144
+ `asserts_error_payload_body`.
145
+ - `auth_signature_contract`: `asserts_signature_over_exact_bytes`,
146
+ `asserts_tampered_or_missing_signature_rejected`.
147
+ - `idempotency_replay`: `first_delivery_then_duplicate`,
148
+ `duplicate_id_rejected_regardless_of_body`.
149
+ - `concurrent_state_consistency`: `overlapping_mutations_exercised`,
150
+ `all_successful_responses_reflected`, `distinct_identifiers_asserted`.
151
+ - `atomic_batch_state`: `mixed_valid_invalid_batch`,
152
+ `asserts_store_unchanged_after_failure`,
153
+ `asserts_success_order_and_distinct_ids`.
154
+ - `shape_contract` when the visible text names exact keys, fields, row
155
+ shapes, JSON objects, response bodies, stdout/stderr objects, or exact error
156
+ bodies: `uses_visible_input_key_names`,
157
+ `asserts_visible_output_key_names`, `asserts_no_unexpected_output_keys`.
158
+ If it names an exact JSON error object/body, also include
159
+ `asserts_exact_error_object`.
122
160
  Tags not listed here may use an empty evidence list or be omitted from
123
161
  `tag_evidence`.
124
162
  - `cmd` must not reference `BENCH_FIXTURE_DIR`, `verifiers/`, benchmark fixture
@@ -128,6 +166,10 @@ Rules:
128
166
  - Match the spec's visible input and output key names literally; do not invent
129
167
  aliases such as `stock` for `lots`, `order_id` for `id`, or `warehouse_id`
130
168
  for `warehouse`.
169
+ - When a verification bullet names exact keys, fields, row shapes, JSON object
170
+ shape, or an exact error body, the probe must use `shape_contract` and assert
171
+ exact key sets with parsed JSON/deep equality. A substring check is too weak:
172
+ the command must fail on aliased keys, missing keys, and extra keys.
131
173
  - For cart/pricing specs whose visible verification covers duplicate combining,
132
174
  multiple line-promotion types, tax, coupon, and shipping, the compound success
133
175
  probe must include interleaved duplicate SKUs plus taxable and non-taxable
@@ -148,6 +190,10 @@ Rules:
148
190
  full test suite.
149
191
  - Coverage over cleverness: mirror the verification bullet literally before
150
192
  inventing an edge case.
193
+ - If the spec includes a solo-headroom hypothesis and the emitted probes do not
194
+ exercise the stated `solo_claude` miss with a `cmd` containing the hypothesis's
195
+ backticked observable command and `derived_from` pointing at the hypothesis
196
+ bullet, the artifact is too weak for pair-evidence work.
151
197
  - If a probe passes while an implementation processes entities in input order
152
198
  instead of the required priority/order, or emits extra zero-value state rows,
153
199
  the probe is too weak.
@@ -175,6 +221,32 @@ Rules:
175
221
  - If `remaining` state appears in the visible contract, at least one probe must
176
222
  carry `positive_remaining` and assert that zero-quantity/zero-value rows are
177
223
  absent unless the visible spec explicitly requires them.
224
+ - If webhook signatures, raw-body signatures, HMAC, or `X-Signature` appear in
225
+ the visible contract, at least one probe must carry
226
+ `auth_signature_contract` and prove exact-byte signature verification plus a
227
+ tampered or missing signature rejection.
228
+ - If replay, duplicate delivery, same id, already-seen ids, or idempotency
229
+ appear in the visible contract, at least one probe must carry
230
+ `idempotency_replay` and cover first delivery followed by duplicate rejection,
231
+ including the case where the duplicate body would otherwise fail validation
232
+ when the spec says duplicate wins.
233
+ - If an HTTP status error such as `400`, `401`, `409`, or `422` appears with a
234
+ JSON error body, error object, or named error field, at least one probe must
235
+ carry `http_error_contract` and assert both the exact status code and parsed
236
+ response body. Do not use CLI `error_contract` for these HTTP-only checks
237
+ unless the visible text also names process exit or stderr behavior.
238
+ - If concurrent, close-together, simultaneous, parallel, race, lost update, or
239
+ many-at-once mutation semantics appear in the visible contract, at least one
240
+ probe must carry `concurrent_state_consistency`. It must trigger overlapping
241
+ mutations, then compare every successful response against the final
242
+ externally-visible state and assert the identifiers are distinct. Do not use
243
+ this tag for ordinary batch success cases that only require distinct ids.
244
+ - If a batch/import contract says one valid plus one invalid item fails while
245
+ the later state remains the same as before, or otherwise says all-or-nothing /
246
+ no partial updates / 0 inserts on failure, at least one probe must carry
247
+ `atomic_batch_state`. It must execute a mixed valid/invalid batch, assert the
248
+ store/list is unchanged after failure, and include an all-valid success case
249
+ proving order and distinct ids.
178
250
  </quality_bar>
179
251
 
180
252
  <runtime_principles>
@@ -10,7 +10,7 @@ Independent quality layer. You answer one question: did the diff deliver what th
10
10
  - `spec.md` (or `.devlyn/criteria.generated.md` for free-form mode) — the contract.
11
11
  - `spec.expected.json` — the mechanical acceptance contract per `_shared/expected.schema.json`.
12
12
  - The cumulative diff against `state.base_ref.sha`.
13
- - The spec hash (`state.source.spec_sha256`) — re-read the spec from disk and confirm the hash matches; if it does not, write `state.phases.verify.verdict: "BLOCKED"` with reason `spec_sha256_mismatch` and stop.
13
+ - The source hash (`state.source.spec_sha256` for spec mode, `state.source.criteria_sha256` for generated free-form mode) — re-read the source contract from disk and confirm the hash matches; if it does not, write `state.phases.verify.verdict: "BLOCKED"` with reason `source_sha256_mismatch` and stop.
14
14
 
15
15
  You do NOT receive: PLAN, IMPLEMENT's reasoning, BUILD_GATE's findings, CLEANUP's allowlist negotiations. Reading those would compromise independence.
16
16
  </input>
@@ -21,10 +21,7 @@ You do NOT receive: PLAN, IMPLEMENT's reasoning, BUILD_GATE's findings, CLEANUP'
21
21
 
22
22
  Re-run the mechanical checks fresh, independent of BUILD_GATE's earlier run:
23
23
 
24
- 1. `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code.
25
- 2. Re-scan `spec.expected.json.forbidden_patterns` against the diff (Python re.search; honor each pattern's `files` allowlist).
26
- 3. Confirm `required_files` exist post-diff; confirm `forbidden_files` do not appear in the diff.
27
- 4. Confirm `max_deps_added` is not exceeded (`git diff -- package.json` for Node; equivalent for other ecosystems).
24
+ 1. `SPEC_VERIFY_PHASE=verify_mechanical SPEC_VERIFY_FINDINGS_FILE=verify-mechanical.findings.jsonl SPEC_VERIFY_FINDING_PREFIX=VERIFY-MECH python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code. In spec mode, sibling `spec.expected.json` wins; a malformed sibling is CRITICAL, not a fallback. When `state.risk_profile.risk_probes_enabled == true`, missing `.devlyn/risk-probes.jsonl` is also CRITICAL. The script also checks `forbidden_patterns`, `required_files`, `forbidden_files`, and `max_deps_added`.
28
25
 
29
26
  Emit findings to `.devlyn/verify-mechanical.findings.jsonl`. Each match = one finding. Severity from the pattern's `severity` field (disqualifier → CRITICAL, warning → MEDIUM).
30
27
 
@@ -87,20 +84,40 @@ design/style concerns remain non-binding MEDIUM and produce `PASS_WITH_ISSUES`.
87
84
 
88
85
  ### Pair-mode (when triggered by orchestrator)
89
86
 
90
- Pair-mode is eligible only after MECHANICAL has no HIGH/CRITICAL findings.
91
- Deterministic blockers already decide the verdict and route to the fix loop; a
92
- second judge there duplicates evidence and wastes wall-time. If MECHANICAL has
93
- a HIGH/CRITICAL finding, record `pair_judge: null` and do not spawn the second
94
- VERIFY agent.
87
+ Pair-mode is eligible only after MECHANICAL and the primary JUDGE have no
88
+ verdict-binding findings. Deterministic blockers and primary JUDGE blockers
89
+ already decide the verdict and route to the fix loop; a second judge there
90
+ duplicates evidence and wastes wall-time. If MECHANICAL or the primary JUDGE
91
+ has a verdict-binding finding, record `pair_judge: null` and do not spawn the
92
+ second VERIFY agent.
95
93
 
96
94
  When eligible, trigger pair-mode if any of these are true:
97
- - `--pair-verify` was set.
98
- - The spec frontmatter has `complexity: high`.
99
- - `state.complexity` is `"high"` or `"large"`.
100
- - MECHANICAL emitted warning-level findings but no HIGH/CRITICAL blockers.
95
+ - `state.pair_verify == true` (`--pair-verify` was set).
96
+ - `state.mode == "verify-only"`.
97
+ - The spec frontmatter has `complexity: high`; legacy/external spec
98
+ `complexity: large` is accepted for compatibility, but new specs use `high`.
99
+ - Current free-form `state.complexity` is `"large"`; legacy `"high"` state remains accepted by the merge validator only for archived run compatibility.
100
+ - `state.risk_profile.high_risk == true`.
101
+ - `.devlyn/risk-probes.jsonl` exists or `state.risk_profile.risk_probes_enabled == true`.
102
+ - The spec includes an actionable solo-headroom hypothesis.
103
+ - MECHANICAL or the primary JUDGE emitted warning-level findings but no
104
+ verdict-binding blockers.
101
105
  - `state.verify.coverage_failed == true`.
102
106
 
103
- Before JUDGE spawn, compute and persist:
107
+ Malformed `state.risk_profile` is a VERIFY contract violation: it must be an
108
+ object, `high_risk` / `risk_probes_enabled` / `pair_default_enabled` must be
109
+ JSON booleans when present, and `reasons` must be a string array. Do not treat
110
+ missing or malformed risk state as low-risk; `verify-merge-findings.py` blocks
111
+ it because it can hide `risk.high` or `risk_probes.enabled` pair triggers.
112
+
113
+ If `--no-pair` was set, do not spawn the OTHER-engine judge. Record
114
+ `pair_trigger: { eligible: false, reasons: [], skipped_reason: "user_no_pair" }`
115
+ and continue with solo VERIFY. This is an explicit user opt-out, not an engine
116
+ availability fallback. `--pair-verify` and `--no-pair` are mutually exclusive;
117
+ if both are present, stop with `BLOCKED:invalid-flags`.
118
+
119
+ After MECHANICAL and the primary JUDGE finish, compute and persist this before
120
+ spawning the OTHER-engine pair judge:
104
121
 
105
122
  ```json
106
123
  "pair_trigger": {
@@ -110,16 +127,39 @@ Before JUDGE spawn, compute and persist:
110
127
  }
111
128
  ```
112
129
 
113
- If `eligible == true` and `reasons` is non-empty, the OTHER-engine judge is
114
- mandatory. Skipping it is a VERIFY contract violation. If ineligible, record the
115
- reason, e.g. `"mechanical_blocker"`.
130
+ If `eligible == true`, `reasons` must be non-empty and include every applicable canonical reason; for example, a spec with an actionable solo-headroom
131
+ hypothesis must include `spec.solo_headroom_hypothesis` even when another reason
132
+ such as `risk.high` also applies. The OTHER-engine judge is mandatory. Skipping
133
+ it is a VERIFY contract violation. If ineligible, record the
134
+ reason, e.g. `"mechanical_blocker"` or `"primary_judge_blocker"`.
135
+
136
+ `pair_trigger` is a strict contract, not advisory metadata. `eligible: true`
137
+ requires a non-empty `reasons` list and `skipped_reason: null`; `eligible: false`
138
+ requires an empty `reasons` list and a string/null `skipped_reason`. Do not emit
139
+ contradictory states such as `eligible: true` with `skipped_reason`, or
140
+ `eligible: false` with trigger reasons. `verify-merge-findings.py` blocks VERIFY
141
+ on malformed trigger state. Eligible triggers must contain only canonical
142
+ reasons and at least one reason: `mode.verify-only`, `complexity.high`, `complexity.large`,
143
+ `mode.pair-verify`, `spec.complexity.high`, `spec.complexity.large`,
144
+ `spec.solo_headroom_hypothesis`, `risk.high`, `risk_probes.enabled`,
145
+ `risk_probes.present`, `coverage.failed`, `mechanical.warning`, or
146
+ `judge.warning`.
116
147
 
117
148
  The `--engine` flag never disables this rule. Explicit `--engine claude` means
118
149
  Claude is the primary judge; if pair-mode triggers, Codex is still the mandatory
119
150
  OTHER-engine judge. Do not record "explicit --engine claude" as a skip reason.
120
151
  The only valid skip reasons after a non-empty eligible trigger are deterministic
121
- MECHANICAL HIGH/CRITICAL blockers or Codex unavailability proven by the
122
- invocation layer.
152
+ MECHANICAL HIGH/CRITICAL blockers or an explicit `--no-pair`. Engine
153
+ unavailability is not a skip reason; it is `BLOCKED:<engine>-unavailable`.
154
+
155
+ Before invoking the OTHER-engine judge, run the shared availability pre-flight
156
+ for that engine. If Codex is required and unavailable, set VERIFY to
157
+ `BLOCKED:codex-unavailable` and tell the user to install/configure the Codex CLI,
158
+ run the current Codex auth/login flow, verify `codex --version`, and rerun. If
159
+ Claude is required and the host cannot spawn a Claude agent, set VERIFY to
160
+ `BLOCKED:claude-unavailable` and tell the user to install/configure Claude Code,
161
+ verify `claude --version` where available, and rerun. Do not convert this to a
162
+ solo pass, and do not synthesize pair findings.
123
163
 
124
164
  When eligible and the orchestrator spawns a second VERIFY agent with the OTHER engine's adapter, both judgments are merged:
125
165
  - Any HIGH/CRITICAL finding either model surfaces is verdict-binding.
@@ -143,12 +183,22 @@ When eligible and the orchestrator spawns a second VERIFY agent with the OTHER e
143
183
  after the first verdict-binding finding and emit JSONL. If both probes pass
144
184
  and static scope/dependency checks show no blocker, emit PASS; do not continue
145
185
  exhaustive exploration.
186
+ If the spec includes a solo-headroom hypothesis, one of the two targeted
187
+ probes must exercise that hypothesis with the visible command/input shape and
188
+ compare the full externally visible result. The probe must use the
189
+ hypothesis's backticked observable command as its command anchor before adding
190
+ bounded input variations. Do not substitute a neighboring easier edge case;
191
+ the pair judge exists to test the stated expected solo miss.
146
192
  A targeted probe must compare the full externally visible result
147
193
  (stdout/stderr/exit and full parsed output object, including accepted/scheduled
148
194
  rows, rejected rows, and remaining state when present), not just a single
149
- property. For priority/stateful specs, at least one probe must include an
150
- earlier input entity that would succeed under input-order processing, a later
151
- higher-priority entity that consumes or blocks the critical resource, and a
195
+ property. When the spec names exact keys, row shapes, JSON object shape, or an
196
+ exact error body, compare parsed key sets/deep equality so aliased keys,
197
+ missing keys, and extra keys are verdict-binding failures. Use the spec's
198
+ visible input key names literally when constructing the probe input. For
199
+ priority/stateful specs, at least one probe must include an earlier input
200
+ entity that would succeed under input-order processing, a later higher-priority
201
+ entity that consumes or blocks the critical resource, and a
152
202
  failure/blocked/rollback edge that determines a later entity's state. This is
153
203
  the minimum compound shape for priority + failure/state-mutation bugs.
154
204
  Scope qualifiers are binding for the pair judge too: do not reinterpret
@@ -164,12 +214,13 @@ When eligible and the orchestrator spawns a second VERIFY agent with the OTHER e
164
214
  (or scheduled) and rejected rows.
165
215
 
166
216
  Codex pair-JUDGE is read-only. Invoke `codex-monitored.sh` directly with
167
- `-c model_reasoning_effort=medium`; this phase is a bounded two-probe review,
168
- not an unbounded implementation task. Do not pipe it to `tail`, `head`, `grep`,
169
- `sed`, or `awk`. Capture stdout/stderr by direct tool capture or file
170
- redirection. The Codex judge must return JSONL
171
- findings on stdout; the orchestrator writes `.devlyn/verify.pair.findings.jsonl`
172
- and merges verdicts. Do not ask Codex to `apply_patch` or edit `.devlyn`.
217
+ `CODEX_MONITORED_ISOLATED=1` and `-c model_reasoning_effort=medium`; this is a
218
+ bounded two-probe review, not implementation. Isolation blocks user config,
219
+ AGENTS.md, pyx-memory, hooks, and project rules from hidden context/tool
220
+ side effects. Do not pipe it to `tail`, `head`, `grep`, `sed`, or `awk`.
221
+ Capture stdout/stderr directly. The Codex judge must return JSONL findings on
222
+ stdout; the orchestrator writes `.devlyn/verify.pair.findings.jsonl` and merges
223
+ verdicts. Do not ask Codex to `apply_patch` or edit `.devlyn`.
173
224
  The Codex prompt must include a bounded-output contract: no harness-doc reads,
174
225
  maximum two targeted probes before first output, stop on the first
175
226
  verdict-binding finding, and emit PASS immediately after the bounded checks pass.