devlyn-cli 2.1.0 → 2.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (127) hide show
  1. package/CLAUDE.md +1 -1
  2. package/benchmark/auto-resolve/README.md +321 -2
  3. package/benchmark/auto-resolve/RUBRIC.md +6 -0
  4. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +0 -1
  5. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md +63 -0
  6. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/expected.json +60 -0
  7. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/metadata.json +10 -0
  8. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/setup.sh +17 -0
  9. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/spec.md +51 -0
  10. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/task.txt +9 -0
  11. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/verifiers/invalid.js +29 -0
  12. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/verifiers/parallel.js +50 -0
  13. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md +70 -0
  14. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/expected.json +52 -0
  15. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/metadata.json +10 -0
  16. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/setup.sh +171 -0
  17. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/spec.md +50 -0
  18. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/task.txt +9 -0
  19. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md +83 -0
  20. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/expected.json +74 -0
  21. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/metadata.json +10 -0
  22. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/setup.sh +251 -0
  23. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/spec.md +57 -0
  24. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/task.txt +13 -0
  25. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/verifiers/replay-malformed-body.js +64 -0
  26. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md +98 -0
  27. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/expected.json +46 -0
  28. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/metadata.json +10 -0
  29. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/setup.sh +336 -0
  30. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/spec.md +51 -0
  31. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/task.txt +9 -0
  32. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md +26 -0
  33. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/expected.json +64 -0
  34. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/metadata.json +10 -0
  35. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/setup.sh +32 -0
  36. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +57 -0
  37. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/task.txt +7 -0
  38. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/verifiers/exact-success.js +54 -0
  39. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/verifiers/no-hardcoded-pricing.js +47 -0
  40. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/verifiers/stock-error.js +45 -0
  41. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +0 -1
  42. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md +27 -0
  43. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/expected.json +62 -0
  44. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/metadata.json +10 -0
  45. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/setup.sh +2 -0
  46. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +61 -0
  47. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/task.txt +7 -0
  48. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/verifiers/error-order.js +55 -0
  49. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/verifiers/priority-blocked.js +48 -0
  50. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md +27 -0
  51. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/expected.json +56 -0
  52. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/metadata.json +10 -0
  53. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/setup.sh +2 -0
  54. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/spec.md +64 -0
  55. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/task.txt +7 -0
  56. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/verifiers/conflicting-duplicate.js +34 -0
  57. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/verifiers/idempotent-close.js +41 -0
  58. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md +27 -0
  59. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/expected.json +56 -0
  60. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/metadata.json +10 -0
  61. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/setup.sh +2 -0
  62. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +70 -0
  63. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/task.txt +7 -0
  64. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/verifiers/priority-rollback.js +64 -0
  65. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/verifiers/single-warehouse-fefo.js +66 -0
  66. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md +28 -0
  67. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/expected.json +66 -0
  68. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/metadata.json +10 -0
  69. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/setup.sh +36 -0
  70. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +64 -0
  71. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/task.txt +7 -0
  72. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/verifiers/catalog-source.js +57 -0
  73. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/verifiers/exact-success.js +63 -0
  74. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/verifiers/stock-error.js +34 -0
  75. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md +25 -0
  76. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/expected.json +68 -0
  77. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/metadata.json +10 -0
  78. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/setup.sh +17 -0
  79. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/spec.md +68 -0
  80. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/task.txt +7 -0
  81. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/verifiers/conflicting-duplicate.js +29 -0
  82. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/verifiers/exact-payout.js +58 -0
  83. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/verifiers/rules-source.js +56 -0
  84. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +0 -1
  85. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +0 -1
  86. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +0 -1
  87. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +0 -1
  88. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +0 -1
  89. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +0 -3
  90. package/benchmark/auto-resolve/fixtures/SCHEMA.md +13 -1
  91. package/benchmark/auto-resolve/scripts/collect-swebench-predictions.py +98 -0
  92. package/benchmark/auto-resolve/scripts/fetch-swebench-instances.py +111 -0
  93. package/benchmark/auto-resolve/scripts/frozen-verify-gate.py +289 -0
  94. package/benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py +250 -0
  95. package/benchmark/auto-resolve/scripts/headroom-gate.py +147 -0
  96. package/benchmark/auto-resolve/scripts/judge.sh +82 -3
  97. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +0 -11
  98. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +0 -10
  99. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py +244 -0
  100. package/benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py +118 -0
  101. package/benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py +192 -0
  102. package/benchmark/auto-resolve/scripts/run-fixture.sh +257 -43
  103. package/benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh +511 -0
  104. package/benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh +162 -0
  105. package/benchmark/auto-resolve/scripts/run-headroom-candidate.sh +93 -0
  106. package/benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh +209 -0
  107. package/benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh +239 -0
  108. package/benchmark/auto-resolve/scripts/swebench-frozen-matrix.py +265 -0
  109. package/benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh +192 -0
  110. package/benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh +131 -0
  111. package/benchmark/auto-resolve/scripts/test-headroom-gate.sh +84 -0
  112. package/benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh +302 -0
  113. package/config/skills/_shared/archive_run.py +3 -0
  114. package/config/skills/_shared/codex-config.md +2 -2
  115. package/config/skills/_shared/codex-monitored.sh +72 -7
  116. package/config/skills/_shared/collect-codex-findings.py +125 -0
  117. package/config/skills/_shared/engine-preflight.md +1 -1
  118. package/config/skills/_shared/expected.schema.json +18 -0
  119. package/config/skills/_shared/spec-verify-check.py +363 -10
  120. package/config/skills/_shared/verify-merge-findings.py +327 -0
  121. package/config/skills/devlyn:resolve/SKILL.md +69 -8
  122. package/config/skills/devlyn:resolve/references/phases/build-gate.md +1 -1
  123. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +183 -0
  124. package/config/skills/devlyn:resolve/references/phases/verify.md +156 -4
  125. package/config/skills/devlyn:resolve/references/state-schema.md +10 -4
  126. package/package.json +1 -1
  127. package/scripts/lint-skills.sh +69 -20
@@ -0,0 +1,327 @@
1
+ #!/usr/bin/env python3
2
+ """Merge VERIFY findings and derive a deterministic verdict.
3
+
4
+ VERIFY judges are model-written, but routing on finding severity must be
5
+ mechanical. This script reads the known VERIFY JSONL finding files, writes a
6
+ merged JSONL artifact, computes source-level and overall verdicts, and can
7
+ write the merged verdict back to `.devlyn/pipeline.state.json`.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ import argparse
13
+ import json
14
+ import pathlib
15
+ import sys
16
+ import tempfile
17
+ from typing import Any
18
+
19
+
20
+ SOURCE_FILES = (
21
+ ("mechanical", "verify-mechanical.findings.jsonl"),
22
+ ("judge", "verify.findings.jsonl"),
23
+ ("pair_judge", "verify.pair.findings.jsonl"),
24
+ ("pair_judge", "verify.pair-judge.findings.jsonl"),
25
+ )
26
+
27
+ VERDICT_RANK = {
28
+ "PASS": 0,
29
+ "PASS_WITH_ISSUES": 1,
30
+ "FAIL": 2,
31
+ "NEEDS_WORK": 2,
32
+ "BLOCKED": 3,
33
+ }
34
+ RANK_VERDICT = {0: "PASS", 1: "PASS_WITH_ISSUES", 2: "NEEDS_WORK", 3: "BLOCKED"}
35
+
36
+
37
+ def rank(verdict: str | None) -> int:
38
+ return VERDICT_RANK.get(verdict or "PASS", 0)
39
+
40
+
41
+ def worse(a: str | None, b: str | None) -> str:
42
+ return RANK_VERDICT[max(rank(a), rank(b))]
43
+
44
+
45
+ def finding_rank(finding: dict[str, Any]) -> int:
46
+ severity = str(finding.get("severity") or "").upper()
47
+ if severity in {"CRITICAL", "HIGH"}:
48
+ return 2
49
+ if severity == "MEDIUM" and finding.get("verdict_binding") is True:
50
+ return 2
51
+ if severity in {"LOW", "MEDIUM"}:
52
+ return 1
53
+ return 0
54
+
55
+
56
+ def read_findings(devlyn: pathlib.Path) -> tuple[list[dict[str, Any]], dict[str, str]]:
57
+ findings: list[dict[str, Any]] = []
58
+ source_verdicts = {source: "PASS" for source, _ in SOURCE_FILES}
59
+ for source, name in SOURCE_FILES:
60
+ path = devlyn / name
61
+ if not path.is_file():
62
+ continue
63
+ with path.open(encoding="utf-8") as handle:
64
+ for line_no, line in enumerate(handle, 1):
65
+ raw = line.strip()
66
+ if not raw:
67
+ continue
68
+ try:
69
+ item = json.loads(raw)
70
+ except json.JSONDecodeError as exc:
71
+ blocked = {
72
+ "id": f"verify-merge-invalid-json-{name}-{line_no}",
73
+ "rule_id": "verify.findings.invalid-json",
74
+ "severity": "CRITICAL",
75
+ "confidence": "high",
76
+ "file": name,
77
+ "line": line_no,
78
+ "message": f"Invalid JSONL finding: {exc}",
79
+ "criterion_ref": "verify-merge",
80
+ "source": source,
81
+ }
82
+ findings.append(blocked)
83
+ source_verdicts[source] = "BLOCKED"
84
+ continue
85
+ if not isinstance(item, dict):
86
+ continue
87
+ item = dict(item)
88
+ item.setdefault("source", source)
89
+ findings.append(item)
90
+ source_verdicts[source] = worse(
91
+ source_verdicts[source], RANK_VERDICT[finding_rank(item)]
92
+ )
93
+ findings.extend(detect_pair_stdout_contract_violations(devlyn, source_verdicts))
94
+ return findings, source_verdicts
95
+
96
+
97
+ def has_pair_findings(devlyn: pathlib.Path) -> bool:
98
+ for name in ("verify.pair.findings.jsonl", "verify.pair-judge.findings.jsonl"):
99
+ path = devlyn / name
100
+ if path.is_file() and path.read_text(encoding="utf-8").strip():
101
+ return True
102
+ return False
103
+
104
+
105
+ def pair_trigger_required(devlyn: pathlib.Path) -> bool:
106
+ state_path = devlyn / "pipeline.state.json"
107
+ if not state_path.is_file():
108
+ return False
109
+ try:
110
+ state = json.loads(state_path.read_text(encoding="utf-8"))
111
+ except json.JSONDecodeError:
112
+ return False
113
+ phases = state.get("phases") if isinstance(state, dict) else {}
114
+ verify_phase = phases.get("verify") if isinstance(phases, dict) else None
115
+ trigger = None
116
+ if isinstance(verify_phase, dict):
117
+ trigger = verify_phase.get("pair_trigger")
118
+ if trigger is None and isinstance(state, dict):
119
+ verify_state = state.get("verify")
120
+ if isinstance(verify_state, dict):
121
+ trigger = verify_state.get("pair_trigger")
122
+ return bool(
123
+ isinstance(trigger, dict)
124
+ and trigger.get("eligible") is True
125
+ and trigger.get("reasons")
126
+ )
127
+
128
+
129
+ def pair_blocker(id_: str, message: str, file_: str | None = None) -> dict[str, Any]:
130
+ return {
131
+ "id": id_,
132
+ "rule_id": "verify.pair.emission-contract",
133
+ "severity": "CRITICAL",
134
+ "confidence": "high",
135
+ "file": file_,
136
+ "line": 1 if file_ else None,
137
+ "message": message,
138
+ "criterion_ref": "verify.pair.findings",
139
+ "source": "pair_judge",
140
+ }
141
+
142
+
143
+ def detect_pair_stdout_contract_violations(
144
+ devlyn: pathlib.Path,
145
+ source_verdicts: dict[str, str],
146
+ ) -> list[dict[str, Any]]:
147
+ stdout_path = devlyn / "codex-judge.stdout"
148
+ if has_pair_findings(devlyn):
149
+ return []
150
+ if not stdout_path.is_file():
151
+ if pair_trigger_required(devlyn):
152
+ source_verdicts["pair_judge"] = "BLOCKED"
153
+ return [
154
+ pair_blocker(
155
+ "verify-pair-required-output-missing",
156
+ "Pair-mode was required, but Codex pair-JUDGE produced no stdout or canonical findings file.",
157
+ "codex-judge.stdout",
158
+ )
159
+ ]
160
+ return []
161
+ raw_text = stdout_path.read_text(encoding="utf-8")
162
+ if not raw_text.strip():
163
+ source_verdicts["pair_judge"] = "BLOCKED"
164
+ return [
165
+ pair_blocker(
166
+ "verify-pair-empty-output",
167
+ "Codex pair-JUDGE stdout was empty; the bounded contract requires a JSONL finding or PASS line.",
168
+ "codex-judge.stdout",
169
+ )
170
+ ]
171
+ has_jsonl_finding = False
172
+ has_nonpass_summary = False
173
+ for line in raw_text.splitlines():
174
+ raw = line.strip()
175
+ if not raw:
176
+ continue
177
+ if raw.startswith("# SUMMARY "):
178
+ try:
179
+ summary = json.loads(raw.removeprefix("# SUMMARY ").strip())
180
+ except json.JSONDecodeError:
181
+ continue
182
+ if summary.get("verdict") in {"NEEDS_WORK", "FAIL", "BLOCKED"}:
183
+ has_nonpass_summary = True
184
+ continue
185
+ if raw.startswith("#"):
186
+ continue
187
+ try:
188
+ item = json.loads(raw)
189
+ except json.JSONDecodeError:
190
+ continue
191
+ if isinstance(item, dict) and str(item.get("severity") or "").upper() in {
192
+ "CRITICAL",
193
+ "HIGH",
194
+ "MEDIUM",
195
+ "LOW",
196
+ }:
197
+ has_jsonl_finding = True
198
+ if not has_jsonl_finding and not has_nonpass_summary:
199
+ return []
200
+ source_verdicts["pair_judge"] = "BLOCKED"
201
+ return [
202
+ pair_blocker(
203
+ "verify-pair-emission-contract-violated",
204
+ (
205
+ "Codex pair-JUDGE stdout contained findings or a non-PASS summary, "
206
+ "but the canonical pair findings JSONL file was empty."
207
+ ),
208
+ "codex-judge.stdout",
209
+ )
210
+ ]
211
+
212
+
213
+ def write_outputs(
214
+ devlyn: pathlib.Path,
215
+ findings: list[dict[str, Any]],
216
+ source_verdicts: dict[str, str],
217
+ ) -> dict[str, Any]:
218
+ merged_path = devlyn / "verify-merged.findings.jsonl"
219
+ summary_path = devlyn / "verify-merge.summary.json"
220
+ with merged_path.open("w", encoding="utf-8") as handle:
221
+ for finding in findings:
222
+ handle.write(json.dumps(finding, sort_keys=True, separators=(",", ":")) + "\n")
223
+ verdict = "PASS"
224
+ for source_verdict in source_verdicts.values():
225
+ verdict = worse(verdict, source_verdict)
226
+ summary = {
227
+ "verdict": verdict,
228
+ "source_verdicts": source_verdicts,
229
+ "findings_count": len(findings),
230
+ "findings_file": str(merged_path),
231
+ }
232
+ summary_path.write_text(json.dumps(summary, indent=2, sort_keys=True) + "\n", encoding="utf-8")
233
+ return summary
234
+
235
+
236
+ def write_state(devlyn: pathlib.Path, summary: dict[str, Any]) -> None:
237
+ state_path = devlyn / "pipeline.state.json"
238
+ if not state_path.is_file():
239
+ raise SystemExit(f"error: {state_path} not found")
240
+ state = json.loads(state_path.read_text(encoding="utf-8"))
241
+ phases = state.setdefault("phases", {})
242
+ verify = phases.get("verify")
243
+ if not isinstance(verify, dict):
244
+ verify = {}
245
+ phases["verify"] = verify
246
+ verify["verdict"] = summary["verdict"]
247
+ sub = verify.setdefault("sub_verdicts", {})
248
+ for source, source_verdict in summary["source_verdicts"].items():
249
+ if source in {"mechanical", "judge", "pair_judge"}:
250
+ sub[source] = source_verdict
251
+ verify["merged"] = {
252
+ "verdict": summary["verdict"],
253
+ "findings_file": ".devlyn/verify-merged.findings.jsonl",
254
+ "summary_file": ".devlyn/verify-merge.summary.json",
255
+ }
256
+ state_path.write_text(json.dumps(state, indent=2, sort_keys=True) + "\n", encoding="utf-8")
257
+
258
+
259
+ def self_test() -> int:
260
+ with tempfile.TemporaryDirectory() as tmp:
261
+ devlyn = pathlib.Path(tmp)
262
+ (devlyn / "pipeline.state.json").write_text(
263
+ json.dumps({"phases": {"verify": {"verdict": "PASS", "sub_verdicts": {}}}}),
264
+ encoding="utf-8",
265
+ )
266
+ (devlyn / "verify.findings.jsonl").write_text(
267
+ json.dumps({"id": "j1", "severity": "LOW"}) + "\n",
268
+ encoding="utf-8",
269
+ )
270
+ (devlyn / "verify.pair.findings.jsonl").write_text(
271
+ json.dumps({"id": "p1", "severity": "HIGH"}) + "\n",
272
+ encoding="utf-8",
273
+ )
274
+ findings, source_verdicts = read_findings(devlyn)
275
+ summary = write_outputs(devlyn, findings, source_verdicts)
276
+ write_state(devlyn, summary)
277
+ state = json.loads((devlyn / "pipeline.state.json").read_text(encoding="utf-8"))
278
+ assert summary["verdict"] == "NEEDS_WORK", summary
279
+ assert state["phases"]["verify"]["verdict"] == "NEEDS_WORK", state
280
+ assert state["phases"]["verify"]["sub_verdicts"]["pair_judge"] == "NEEDS_WORK", state
281
+ assert (devlyn / "verify-merged.findings.jsonl").read_text(encoding="utf-8")
282
+ (devlyn / "verify.findings.jsonl").write_text("", encoding="utf-8")
283
+ (devlyn / "verify.pair.findings.jsonl").write_text("", encoding="utf-8")
284
+ findings, source_verdicts = read_findings(devlyn)
285
+ summary = write_outputs(devlyn, findings, source_verdicts)
286
+ write_state(devlyn, summary)
287
+ state = json.loads((devlyn / "pipeline.state.json").read_text(encoding="utf-8"))
288
+ assert summary["verdict"] == "PASS", summary
289
+ assert state["phases"]["verify"]["verdict"] == "PASS", state
290
+ assert state["phases"]["verify"]["sub_verdicts"]["pair_judge"] == "PASS", state
291
+ (devlyn / "codex-judge.stdout").write_text(
292
+ json.dumps({"id": "cj1", "severity": "HIGH"}) + "\n"
293
+ + '# SUMMARY {"verdict":"NEEDS_WORK"}\n',
294
+ encoding="utf-8",
295
+ )
296
+ findings, source_verdicts = read_findings(devlyn)
297
+ summary = write_outputs(devlyn, findings, source_verdicts)
298
+ write_state(devlyn, summary)
299
+ state = json.loads((devlyn / "pipeline.state.json").read_text(encoding="utf-8"))
300
+ assert summary["verdict"] == "BLOCKED", summary
301
+ assert state["phases"]["verify"]["sub_verdicts"]["pair_judge"] == "BLOCKED", state
302
+ return 0
303
+
304
+
305
+ def main() -> int:
306
+ parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
307
+ parser.add_argument("--devlyn-dir", default=".devlyn")
308
+ parser.add_argument("--write-state", action="store_true")
309
+ parser.add_argument("--self-test", action="store_true")
310
+ args = parser.parse_args()
311
+ if args.self_test:
312
+ return self_test()
313
+
314
+ devlyn = pathlib.Path(args.devlyn_dir)
315
+ if not devlyn.is_dir():
316
+ sys.stderr.write(f"error: {devlyn} is not a directory\n")
317
+ return 1
318
+ findings, source_verdicts = read_findings(devlyn)
319
+ summary = write_outputs(devlyn, findings, source_verdicts)
320
+ if args.write_state:
321
+ write_state(devlyn, summary)
322
+ print(json.dumps(summary, sort_keys=True))
323
+ return 0
324
+
325
+
326
+ if __name__ == "__main__":
327
+ raise SystemExit(main())
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: devlyn:resolve
3
- description: Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Free-form goal or formal spec input. Plan → Implement → Build-gate → Cleanup → Verify (fresh subagent, findings-only). Mechanical-first verification; pair-mode optional in Verify. Use when the user says "resolve this", "fix this", "implement this", "refactor this", "debug this", "review this PR", or wants hands-off completion.
3
+ description: Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Free-form goal or formal spec input. Plan → Implement → Build-gate → Cleanup → Verify (fresh subagent, findings-only). Mechanical-first verification; pair-mode is gated in Verify. Use when the user says "resolve this", "fix this", "implement this", "refactor this", "debug this", "review this PR", or wants hands-off completion.
4
4
  ---
5
5
 
6
6
  Orchestrator for the 2-skill harness pipeline. One subagent per phase; file-based handoff via `.devlyn/pipeline.state.json`. VERIFY spawns a fresh-context subagent so independence is structural — not advisory.
@@ -55,10 +55,11 @@ Once `state.implement_passed_sha` is non-null (PHASE 2 returned and produced a d
55
55
 
56
56
  1. Parse flags from `<pipeline_config>`:
57
57
  - `--max-rounds N` (default 4) — fix-loop budget shared across BUILD_GATE and VERIFY.
58
- - `--engine MODE` (default `claude`) — picks the adapter for IMPLEMENT and CLEANUP.
58
+ - `--engine MODE` (default `claude`) — picks the adapter for IMPLEMENT, CLEANUP, and the primary VERIFY judge. It does not disable VERIFY pair-mode; when a VERIFY pair trigger fires, the second judge uses the OTHER engine.
59
59
  - `--spec <path>` — switches to spec mode.
60
60
  - `--verify-only <ref>` — switches to verify-only mode. Requires `--spec`.
61
61
  - `--pair-verify` — force pair-mode JUDGE in PHASE 5 even when not auto-triggered.
62
+ - `--risk-probes` — insert PHASE 1.5 cross-engine probe derivation. The OTHER engine converts visible `## Verification` bullets into bounded executable probes before IMPLEMENT; BUILD_GATE and VERIFY replay them mechanically.
62
63
  - `--bypass <phase>[,...]` — skip specific phases. Valid: `build-gate`, `cleanup`. PLAN, IMPLEMENT, VERIFY are non-bypassable.
63
64
  - `--perf` — opt in to per-phase timing.
64
65
 
@@ -87,6 +88,57 @@ After return:
87
88
  1. If `.devlyn/plan.md` lists zero files → halt with verdict `BLOCKED:plan-empty`.
88
89
  2. If risk list flags an out-of-scope expansion the user did not authorize → re-spawn once with the reminder; second fail → halt.
89
90
 
91
+ ## PHASE 1.5: RISK_PROBES
92
+
93
+ Skip unless `--risk-probes` is set. This phase is findings-as-executable-checks,
94
+ not a second plan and not debate.
95
+
96
+ Engine: OTHER engine from PHASE 2's selected IMPLEMENT engine. Prompt body:
97
+ `references/phases/probe-derive.md`.
98
+
99
+ Inputs: source spec/criteria, `.devlyn/plan.md`, and repo read/search. Forbidden:
100
+ `spec.expected.json`, `.devlyn/spec-verify.json`, `BENCH_FIXTURE_DIR`, hidden
101
+ fixture/verifier paths, previous findings, and harness docs unless excerpted.
102
+
103
+ Output: `.devlyn/risk-probes.jsonl`, 1 to 3 JSONL entries. Each entry must be
104
+ one verification command shape plus `id`, `derived_from`, `tags`, and
105
+ `tag_evidence`, where `derived_from` is an exact substring of the visible
106
+ `## Verification` bullet the command directly exercises. `tag_evidence` must be
107
+ a JSON object keyed by tag, with marker arrays as values; a top-level array or
108
+ tag-only probe is malformed. `ordering_inversion` must include
109
+ `input_order_would_choose_wrong_winner` and `asserts_processing_order_result`;
110
+ `prior_consumption` must include `same_resource_consumed_first` and
111
+ `later_entity_fails_or_reroutes`; `stdout_stderr_contract` and `shape_contract`
112
+ do not require marker strings. Cart/pricing success probes should use
113
+ `shape_contract` unless they satisfy the `ordering_inversion` markers. The probe
114
+ command must not reference external network URLs; use only worktree-local or
115
+ localhost resources.
116
+ For high-complexity specs with multiple behavior bullets, at least one probe
117
+ must be compound: it must exercise two or more visible verification bullets in a
118
+ single command. Empty output is invalid when `--risk-probes` is set.
119
+
120
+ State write: `phases.probe_derive.{started_at, verdict, completed_at, duration_ms, artifacts}`.
121
+
122
+ Invocation contract when OTHER engine is Codex:
123
+
124
+ - Invoke Codex only through the monitored wrapper path in `CODEX_MONITORED_PATH`,
125
+ or `.claude/skills/_shared/codex-monitored.sh` when the env var is absent:
126
+ `bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
127
+ - Do not run `codex`, `codex exec`, `/Users/.../codex`, or a plugin-provided
128
+ Codex binary directly. A raw Codex child can outlive the phase and makes the
129
+ benchmark run invalid even if `.devlyn/risk-probes.jsonl` is written.
130
+ - Capture wrapper stdout/stderr to `.devlyn/probe-derive.stdout` and
131
+ `.devlyn/probe-derive.stderr`; branch on the wrapper exit code before
132
+ validating `.devlyn/risk-probes.jsonl`.
133
+
134
+ After return:
135
+ 1. Run `python3 .claude/skills/_shared/spec-verify-check.py --validate-risk-probes`
136
+ for the artifact boundary before IMPLEMENT; malformed probes halt with
137
+ `BLOCKED:probe-derive-malformed`.
138
+ 2. IMPLEMENT receives `.devlyn/plan.md` plus `.devlyn/risk-probes.jsonl` as
139
+ concrete acceptance obligations. It must not receive the producer engine's
140
+ commentary or any mention of pair/critic/debate.
141
+
90
142
  ## PHASE 2: IMPLEMENT
91
143
 
92
144
  Skip in verify-only mode. Constrained design judgment within PLAN's invariants. Writes code, tests, and inline doc-comments. No standalone DOCS phase — what the spec licenses is updated here, what it does not is out of scope.
@@ -107,7 +159,7 @@ Skip in verify-only mode OR when `build-gate` in `state.bypasses`. Deterministic
107
159
  Spawn Claude `Agent` (`mode: "bypassPermissions"`) with prompt body `references/phases/build-gate.md`. The agent:
108
160
  1. Detects language/framework via project files (`package.json`, `pyproject.toml`, etc.).
109
161
  2. Runs language-specific gates (tsc / lint / test).
110
- 3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py` (verification_commands literal-match).
162
+ 3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` (verification_commands literal-match plus `.devlyn/risk-probes.jsonl` when present).
111
163
  4. If `spec.expected.json.browser_flows` declared OR diff touches web-surface files: invokes the browser runner (Chrome MCP → Playwright → curl tier as available).
112
164
  5. Emits `.devlyn/build_gate.findings.jsonl` + `.devlyn/build_gate.log.md`.
113
165
 
@@ -140,16 +192,25 @@ Independent quality layer. **Spawned with empty conversation context** — no ca
140
192
 
141
193
  Two sub-phases:
142
194
 
143
- 1. **MECHANICAL** (deterministic): re-run `python3 .claude/skills/_shared/spec-verify-check.py` against the post-CLEANUP code (independent of BUILD_GATE's earlier run). Re-scan `spec.expected.json.forbidden_patterns` against the diff. Re-check `required_files` and `forbidden_files`. Emit `.devlyn/verify-mechanical.findings.jsonl`.
195
+ 1. **MECHANICAL** (deterministic): re-run `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code (independent of BUILD_GATE's earlier run). Re-scan `spec.expected.json.forbidden_patterns` against the diff. Re-check `required_files` and `forbidden_files`. Emit `.devlyn/verify-mechanical.findings.jsonl`.
144
196
 
145
- 2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) fires when:
197
+ 2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Split each Requirement into binding clauses and trace code-order counterexamples; a passing verifier proves only the case it exercises, not neighboring `once` / `regardless` / `duplicate` / auth-order / rollback invariants. Respect scope qualifiers such as `inside a warehouse`, `per resource`, `for this line`, and `after validation`; do not widen a scoped clause into a global invariant, and compose multiple ordering rules in the stated order. For stateful flows, explicitly trace failed-operation rollback and the next entity's state before hunting broader edge cases. For high-complexity specs, construct at least one interaction counterexample that combines ordering/priority with failure handling and state mutation, then execute at least one such scenario through the repo's existing CLI/API/test runner without leaving tracked files behind; one-axis examples and pure mental tracing are insufficient. Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) is eligible only when MECHANICAL has no HIGH/CRITICAL findings; deterministic blockers already decide the verdict and route to the fix loop. Pair-mode fires when eligible and:
146
198
  - `--pair-verify` flag set, OR
199
+ - spec frontmatter has `complexity: high`, OR `state.complexity` is `"high"` or `"large"`, OR
147
200
  - MECHANICAL emits findings flagged `severity: warning` (not disqualifier — those route to fix loop directly), OR
148
201
  - `state.verify.coverage_failed == true` (judge could not exercise a required spec axis from available evidence).
149
202
 
150
- Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is the verdict-binding finding." Cross-model disagreement on lower-severity findings is logged but does not change the verdict.
203
+ Before spawning JUDGE, compute `pair_trigger = { eligible, reasons[] }` and write it into `state.phases.verify`. If `eligible == true` and `reasons` is non-empty, you MUST spawn the second OTHER-engine judge. Skipping that second judge is a VERIFY contract violation, not a discretion call.
204
+
205
+ The `--engine` flag never suppresses this rule. Explicit `--engine claude`
206
+ means "Claude is the primary judge"; it does not mean "do not run Codex as the
207
+ second pair judge." The only valid skip reasons after a non-empty eligible
208
+ trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or Codex
209
+ unavailability proven by the invocation layer.
210
+
211
+ Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
151
212
 
152
- Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
213
+ Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` directly with `-c model_reasoning_effort=medium` for this bounded two-probe review, without piping to `tail`/`head`/`grep`, capture stdout/stderr by direct tool capture or file redirection, require JSONL findings on stdout, and have the orchestrator write `.devlyn/verify.pair.findings.jsonl`. If stdout is first captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; that script is the deterministic boundary writer for `.devlyn/verify.pair.findings.jsonl`. Raw stdout remains diagnostic only: if stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
153
214
 
154
215
  Branch:
155
216
  - `PASS` → PHASE 6.
@@ -166,7 +227,7 @@ State write: `phases.final_report.started_at` at the top of this phase.
166
227
 
167
228
  3. State write: `phases.final_report.{verdict, completed_at, duration_ms}` BEFORE archive runs (archive prune logic skips runs whose `final_report.verdict` is null).
168
229
 
169
- 4. **Archive** — invoke the deterministic script: `python3 .claude/skills/_shared/archive_run.py`. The script reads `run_id` from `.devlyn/pipeline.state.json`, moves per-run artifacts (state.json + `*.findings.jsonl` + `*.log.md` + `fix-batch.round-*.json` + `criteria.generated.md` + `spec-verify*.json` + `spec-verify-findings.jsonl`) into `.devlyn/runs/<run_id>/`, then best-effort prunes to last 10 completed runs. Archive must run; running this step as deterministic-script-not-prose ensures the move actually happens (iter-0033a Smoke 3 caught a case where the agent claimed archive ran without moving the files).
230
+ 4. **Archive** — invoke the deterministic script: `python3 .claude/skills/_shared/archive_run.py`. The script reads `run_id` from `.devlyn/pipeline.state.json`, moves per-run artifacts (state.json + `*.findings.jsonl` + `*.log.md` + `fix-batch.round-*.json` + `criteria.generated.md` + `risk-probes.jsonl` + `spec-verify*.json` + `spec-verify-findings.jsonl`) into `.devlyn/runs/<run_id>/`, then best-effort prunes to last 10 completed runs. Archive must run; running this step as deterministic-script-not-prose ensures the move actually happens (iter-0033a Smoke 3 caught a case where the agent claimed archive ran without moving the files).
170
231
 
171
232
  5. Kill any dev server PHASE 3 left running.
172
233
 
@@ -22,7 +22,7 @@ Run in this order; each emits findings into `.devlyn/build_gate.findings.jsonl`:
22
22
  1. **Type check** (TypeScript / mypy / etc.). Each error → one finding, severity `HIGH`, rule `correctness.type-check`.
23
23
  2. **Lint** (eslint / ruff / clippy / etc.). Each error → finding, severity `MEDIUM`, rule `quality.lint`. Warnings stay LOW unless the spec elevates them.
24
24
  3. **Test suite** (npm test / pytest / go test / cargo test). Each failing test → finding, severity `HIGH`, rule `correctness.test-failure`. Include the failing test's file:line and the assertion.
25
- 4. **Spec literal verification**: `python3 .claude/skills/_shared/spec-verify-check.py`. The script reads `.devlyn/spec-verify.json` (pre-staged from spec or self-staged from `state.source.spec_path`). Each command mismatch → finding `correctness.spec-literal-mismatch`, severity `CRITICAL`. Missing/malformed carrier on a generated source → finding `correctness.spec-verify-malformed`, severity `CRITICAL`.
25
+ 4. **Spec literal verification + risk probes**: `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes`. The script reads `.devlyn/spec-verify.json` (pre-staged from spec or self-staged from `state.source.spec_path`) and appends `.devlyn/risk-probes.jsonl` when present. Each verification command mismatch → finding `correctness.spec-literal-mismatch`, severity `CRITICAL`. Each risk-probe mismatch → finding `correctness.risk-probe-failed`, severity `CRITICAL`. Missing/malformed carrier on a generated source → finding `correctness.spec-verify-malformed`, severity `CRITICAL`.
26
26
  5. **Browser** (only when `spec.expected.json.browser_flows` declared OR diff touches `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `page.*`, `layout.*`, `route.*`, `*.css`, `*.html`): start dev server, run declared flows via Chrome MCP if available, falling back to Playwright, falling back to curl. Each failed flow → finding, severity `HIGH`, rule `correctness.browser-flow-failed`.
27
27
 
28
28
  Append all findings; do not stop on the first failure.
@@ -0,0 +1,183 @@
1
+ # PHASE 1.5 — RISK_PROBES (canonical body)
2
+
3
+ Per-engine adapter header is prepended at runtime. This file is engine-agnostic.
4
+
5
+ <role>
6
+ Convert visible verification obligations into executable probes. You are not a
7
+ second planner, critic essay, or debate participant. Your output is JSONL only.
8
+ </role>
9
+
10
+ <input>
11
+ - Source spec or generated criteria.
12
+ - `.devlyn/plan.md`.
13
+ - Codebase read/search at `state.base_ref.sha`.
14
+ </input>
15
+
16
+ <forbidden_input>
17
+ Do not read `spec.expected.json`, `.devlyn/spec-verify.json`,
18
+ `BENCH_FIXTURE_DIR`, benchmark fixture/verifier paths, `.devlyn/*.findings.jsonl`,
19
+ `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness
20
+ docs unless the orchestrator pasted a specific excerpt into the prompt.
21
+ </forbidden_input>
22
+
23
+ <task>
24
+ Read the visible `## Verification` section. Emit 1 to 3 executable probes
25
+ that cover the highest-risk bullets whose failure would change observable
26
+ behavior. Prefer bullets that combine ordering/priority, rollback/state
27
+ mutation, idempotency, auth/error priority, stdout/stderr, or exact output
28
+ shape.
29
+
30
+ For high-complexity specs with two or more behavior bullets, at least one probe
31
+ must be compound: one command must exercise two or more visible verification
32
+ bullets together. Do not split every risk into isolated one-axis probes.
33
+
34
+ Compound means interaction, not a checklist in one script. If the visible
35
+ verification text includes priority/ordering plus rollback, blocked intervals,
36
+ or failed-operation state, the first probe must be a dominance-loss scenario:
37
+ an earlier lower-priority/input-order entity would succeed alone, a later
38
+ higher-priority entity consumes or blocks the critical resource first, a failed
39
+ or blocked middle entity must not corrupt state, and the assertion must compare
40
+ the complete externally-visible result (accepted/scheduled rows, rejected rows,
41
+ remaining/state rows when present, exit/stdout/stderr).
42
+
43
+ When a verification bullet contains an alternative such as "rejected or moved
44
+ later", the probes must cover both sides when bounded: one case with a later
45
+ valid placement and one case with no later valid placement, where rejection is
46
+ the only correct outcome. Do not test only the easier side of an "or" clause.
47
+ For blocked-interval bullets, include boundary probes where a candidate starts
48
+ exactly at `blocked.start`, ends exactly at `blocked.end`, and has only a
49
+ one-minute overlap. Half-open assumptions must be tested by the command rather
50
+ than left implicit in prose.
51
+
52
+ When a placement algorithm may advance a candidate start past a blocked
53
+ interval or already-accepted entity, include a no-later-valid case where the
54
+ advanced start would exceed the active availability/window bound. The expected
55
+ result must reject that entity. A probe is too weak if every advanced candidate
56
+ still has enough room after the advance; that misses window-bound recheck bugs.
57
+
58
+ When a verification bullet names `remaining`, inventory, stock, balances, or
59
+ state after failures, assert the full externally-visible state. Rows with zero
60
+ quantity do not represent remaining availability; a probe that checks remaining
61
+ state should fail if zero-quantity rows are emitted unless the visible spec
62
+ explicitly requires zero rows. For all-or-nothing rollback, include a later
63
+ entity that can succeed only if the failed entity returned every tentative
64
+ allocation.
65
+
66
+ When visible bullets combine priority ordering, all-or-nothing rollback,
67
+ single-resource or single-warehouse constraints, choice ordering such as FEFO,
68
+ and `remaining` output, prefer one compound probe over isolated checks. The
69
+ probe must include: a lower-priority input-first entity that loses because a
70
+ higher-priority entity consumes stock first; a middle entity that tentatively
71
+ allocates at least one line/lot and then fails another line; a later entity that
72
+ can succeed only if that failed entity rolled back; a single-resource constraint
73
+ case where total cross-resource stock would be enough but no single allowed
74
+ resource is enough; and full expected `remaining` output sorted exactly as the
75
+ visible spec says, with zero-quantity rows absent unless explicitly required.
76
+ For all-or-nothing allocation probes, the failed middle entity must not be
77
+ pre-rejected by a whole-order availability shortcut. It must allocate a scarce
78
+ first line from mutable state, then fail a later line because that SKU/resource
79
+ is absent or otherwise impossible under the visible contract. The later entity
80
+ must request the same scarce first-line SKU so the probe proves rollback by
81
+ observable success, not by internal reasoning.
82
+
83
+ Each probe must run entirely from the worktree with standard shell/Node/Python
84
+ tools already present in the repo. Use inline temp-file scripts when needed.
85
+ Leave no tracked files behind. Probe commands must not call external network
86
+ APIs or write to external memory/telemetry services.
87
+ </task>
88
+
89
+ <output>
90
+ Write `.devlyn/risk-probes.jsonl`. Each line is one JSON object:
91
+
92
+ ```json
93
+ {"id":"P1","derived_from":"verbatim substring from ## Verification","cmd":"shell command","exit_code":0,"stdout_contains":[],"stdout_not_contains":[],"tags":["ordering_inversion"],"tag_evidence":{"ordering_inversion":["input_order_would_choose_wrong_winner","asserts_processing_order_result"]}}
94
+ ```
95
+
96
+ Rules:
97
+ - `derived_from` must be an exact substring of the visible `## Verification`
98
+ bullet that the command directly exercises. For `error_contract`, use the
99
+ invalid-input/stderr/JSON-error/exit-2 bullet, not a generic test-runner
100
+ bullet.
101
+ - `tags` is required. Use only these shape tags:
102
+ `ordering_inversion`, `boundary_overlap`, `prior_consumption`,
103
+ `rollback_state`, `positive_remaining`, `stdout_stderr_contract`,
104
+ `error_contract`, `shape_contract`.
105
+ - `tag_evidence` is required and must be a JSON object keyed by tag, never a
106
+ top-level array. For these tags, include every listed evidence marker in the
107
+ tag's array and make the command actually exercise it:
108
+ - Do not emit a shape tag unless the visible `## Verification` text names that
109
+ kind of risk and the command exercises it. In particular, `boundary_overlap`
110
+ is only for visible blocked-interval/window/overlap boundary semantics; do not
111
+ use it for inventory, warehouse, or generic resource constraints.
112
+ - `ordering_inversion`: `input_order_would_choose_wrong_winner`,
113
+ `asserts_processing_order_result`.
114
+ - `boundary_overlap`: `starts_at_blocked_start`, `ends_at_blocked_end`,
115
+ `one_minute_overlap`.
116
+ - `prior_consumption`: `same_resource_consumed_first`,
117
+ `later_entity_fails_or_reroutes`.
118
+ - `rollback_state`: `failed_entity_tentative_state_absent`,
119
+ `later_entity_uses_released_state`.
120
+ - `positive_remaining`: `asserts_full_remaining_state`,
121
+ `zero_quantity_rows_absent`.
122
+ Tags not listed here may use an empty evidence list or be omitted from
123
+ `tag_evidence`.
124
+ - `cmd` must not reference `BENCH_FIXTURE_DIR`, `verifiers/`, benchmark fixture
125
+ paths, hidden oracle files, external URLs, or files outside the worktree.
126
+ Localhost URLs are allowed only when the visible verification command needs a
127
+ local server.
128
+ - Match the spec's visible input and output key names literally; do not invent
129
+ aliases such as `stock` for `lots`, `order_id` for `id`, or `warehouse_id`
130
+ for `warehouse`.
131
+ - For cart/pricing specs whose visible verification covers duplicate combining,
132
+ multiple line-promotion types, tax, coupon, and shipping, the compound success
133
+ probe must include interleaved duplicate SKUs plus taxable and non-taxable
134
+ items, then assert the full output object and item rows. Use `shape_contract`
135
+ for this probe unless the command also proves the required
136
+ `ordering_inversion` evidence markers.
137
+ - Empty output is invalid when this phase is enabled. If no bounded executable
138
+ probe can be derived, write one JSONL object whose command exits nonzero and
139
+ whose `derived_from` names the blocking verification bullet; BUILD_GATE will
140
+ surface the inability as a concrete failure instead of silently proceeding.
141
+ - No prose, no Markdown, no summaries, no alternate plan.
142
+ </output>
143
+
144
+ <quality_bar>
145
+ - Executable beats rhetorical. A risk that cannot become a bounded command does
146
+ not belong in this artifact.
147
+ - Keep probes small. They are BUILD_GATE obligations, not a replacement for the
148
+ full test suite.
149
+ - Coverage over cleverness: mirror the verification bullet literally before
150
+ inventing an edge case.
151
+ - If a probe passes while an implementation processes entities in input order
152
+ instead of the required priority/order, or emits extra zero-value state rows,
153
+ the probe is too weak.
154
+ - If priority/order appears in the visible contract, at least one probe must
155
+ carry `ordering_inversion`.
156
+ - If blocked intervals, forbidden windows, or overlap appear in the visible
157
+ contract, at least one probe must carry `boundary_overlap`.
158
+ `boundary_overlap` is not satisfied by a generic overlap case. The same
159
+ probe must assert a candidate starting exactly at the blocked interval start,
160
+ a candidate ending exactly at the blocked interval end, and a one-minute
161
+ overlap case with no later valid placement.
162
+ - If the domain has both windows/availability and conflicts that can push a
163
+ candidate later, at least one probe must assert the pushed candidate is
164
+ rejected when the pushed start plus duration no longer fits inside the same
165
+ window. The full expected output must exclude that row from scheduled/accepted
166
+ output and include the required rejection reason.
167
+ - If accepted operations reduce stock/state/availability for later operations,
168
+ at least one probe must carry `prior_consumption`: a later lower-priority or
169
+ later-submitted entity must fail or reroute only because an earlier accepted
170
+ entity consumed the exact resource/lot/slot.
171
+ - If a visible contract has all-or-nothing rollback plus `remaining`, at least
172
+ one probe must carry both `rollback_state` and `positive_remaining`; it must
173
+ prove the rollback by a later successful entity and by the final remaining
174
+ rows, not just by the rejected order reason.
175
+ - If `remaining` state appears in the visible contract, at least one probe must
176
+ carry `positive_remaining` and assert that zero-quantity/zero-value rows are
177
+ absent unless the visible spec explicitly requires them.
178
+ </quality_bar>
179
+
180
+ <runtime_principles>
181
+ Read `_shared/runtime-principles.md`. The discipline here is: visible contract
182
+ in, executable obligation out. Hidden oracle leakage is a blocker.
183
+ </runtime_principles>