devlyn-cli 2.2.0 → 2.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. package/benchmark/auto-resolve/README.md +7 -4
  2. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +0 -1
  3. package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/spec.md +0 -1
  4. package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/spec.md +0 -1
  5. package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/spec.md +0 -1
  6. package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/spec.md +0 -1
  7. package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md +0 -1
  8. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +0 -1
  9. package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md +0 -1
  10. package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/spec.md +0 -1
  11. package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md +0 -1
  12. package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md +0 -1
  13. package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/spec.md +0 -1
  14. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +0 -1
  15. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +0 -1
  16. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +0 -1
  17. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +0 -1
  18. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +0 -1
  19. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +0 -3
  20. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +0 -11
  21. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +0 -10
  22. package/benchmark/auto-resolve/scripts/run-fixture.sh +23 -3
  23. package/config/skills/_shared/spec-verify-check.py +51 -0
  24. package/config/skills/devlyn:resolve/SKILL.md +12 -5
  25. package/config/skills/devlyn:resolve/references/phases/probe-derive.md +24 -5
  26. package/package.json +1 -1
  27. package/scripts/lint-skills.sh +39 -22
  28. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/NOTES.md +0 -24
  29. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/expected.json +0 -66
  30. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/metadata.json +0 -10
  31. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/setup.sh +0 -22
  32. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/spec.md +0 -62
  33. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/task.txt +0 -9
  34. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/verifiers/exact-success.js +0 -48
  35. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/verifiers/insufficient-balance.js +0 -36
  36. package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/verifiers/rules-source.js +0 -55
  37. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/NOTES.md +0 -20
  38. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/expected.json +0 -66
  39. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/metadata.json +0 -10
  40. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/setup.sh +0 -23
  41. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/spec.md +0 -66
  42. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/task.txt +0 -11
  43. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/verifiers/exact-success.js +0 -44
  44. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/verifiers/rules-source.js +0 -58
  45. package/benchmark/auto-resolve/fixtures/F28-cli-rental-quote-rules/verifiers/unavailable-inventory.js +0 -35
@@ -109,7 +109,8 @@ bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
109
109
  ```
110
110
 
111
111
  After a full-pipeline pair run has the calibrated arms (`bare`,
112
- `solo_claude`, `l2_gated`) plus a blind `judge.json`, gate it separately:
112
+ `solo_claude`, `l2_gated` or `l2_risk_probes`) plus a blind `judge.json`, gate
113
+ it separately:
113
114
 
114
115
  ```bash
115
116
  bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
@@ -143,10 +144,12 @@ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
143
144
  ```
144
145
 
145
146
  This is the full-pipeline claim gate: each counted fixture must satisfy the
146
- headroom precondition (`bare <= 60`, `solo_claude <= 80`), the `l2_gated` arm
147
+ headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
147
148
  must be clean, `pair_mode` must be true in the captured resolve state, and the
148
- blind judge must score `l2_gated` at least `--min-pair-margin` above
149
- `solo_claude`. When changing this gate, run:
149
+ blind judge must score the pair arm at least `--min-pair-margin` above
150
+ `solo_claude`. `l2_risk_probes` is the current measured pair arm for the
151
+ F16/F25 gate: `20260509-f16-f25-combined-cartprobe-v2` passed with margins +21
152
+ and +24, average pair/solo wall ratio 1.46x. When changing this gate, run:
150
153
 
151
154
  ```bash
152
155
  bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
@@ -28,7 +28,6 @@ This is a low-risk edit used to calibrate trivial-tier fixture difficulty.
28
28
  - **No silent catches.** If an unknown flag is passed, exit 1 with an informative message (same pattern as the existing `--name` handler).
29
29
  - **Surgical diff.** Only touch `bin/cli.js` and `tests/cli.test.js`. Do not reformat unrelated code.
30
30
 
31
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
32
31
 
33
32
  ## Out of Scope
34
33
 
@@ -35,7 +35,6 @@ multiple POSTs arrive close together — no duplicate ids, no lost writes.
35
35
  - **No silent catches.** Any `try/catch` in the write path must surface failure as `500` with a clear body, not return a fake success.
36
36
  - **No hardcoded ids.** Existing baseline ids (1, 2) remain valid; new ids must not collide with any past or present id.
37
37
  - **No breaking change** to `GET /items` shape or `GET /items/:id` semantics.
38
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
39
38
 
40
39
  ## Out of Scope
41
40
 
@@ -34,7 +34,6 @@ and the stored list is left exactly as it was before the request.
34
34
  - **No silent catches.**
35
35
  - **No partial updates.** A batch with N items must produce either N inserts or 0 inserts.
36
36
  - **No breaking change** to existing `GET /items` and `GET /items/:id`.
37
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
38
37
 
39
38
  ## Out of Scope
40
39
 
@@ -37,7 +37,6 @@ no trailing newline).
37
37
  - **No silent catches.** Errors in the verification path surface as `500` with a clear body.
38
38
  - **Use `crypto.timingSafeEqual` for the signature comparison.** A non-constant-time `===` between hex strings leaks information about the true MAC byte-by-byte.
39
39
  - **No breaking change** to existing `/items`, `/items/:id`, `/health`.
40
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
41
40
 
42
41
  ## Out of Scope
43
42
 
@@ -35,7 +35,6 @@ The implementation persists state to `data/items.json` and exposes:
35
35
  - **No new npm dependencies.** Fix using Express + Node built-ins only.
36
36
  - **No silent catches.** Errors surface with explicit status + body, not by returning a fake-success.
37
37
  - **Touch only `server/index.js` and `tests/server.test.js`.** Do not modify `data/items.json` shape, `tests/cli.test.js`, or anything outside the server.
38
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
39
38
 
40
39
  ## Out of Scope
41
40
 
@@ -40,7 +40,6 @@ the output must be machine-readable.
40
40
  - **No floating-money output.** All public amounts are integer cents.
41
41
  - **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
42
42
  - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
43
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
44
43
 
45
44
  ## Out of Scope
46
45
 
@@ -37,7 +37,6 @@ inside the CLI itself.
37
37
  - **HOME guard.** If `process.env.HOME` is undefined or empty, emit a clear FAIL line ("HOME environment variable is not set") and exit 1.
38
38
  - **EACCES handling.** If `readdirSync` fails with EACCES, emit a permission-specific message quoting the offending path. Do not silently return an empty list.
39
39
 
40
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
41
40
 
42
41
  ## Out of Scope
43
42
 
@@ -42,7 +42,6 @@ failure reasons must be deterministic.
42
42
  - **No mutation of the input file.**
43
43
  - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
44
44
  - **Touch only `bin/cli.js` and `tests/cli.test.js`.**
45
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
46
45
 
47
46
  ## Out of Scope
48
47
 
@@ -44,7 +44,6 @@ once, and duplicate ids must not silently corrupt balances.
44
44
  - **No mutation of the input file.**
45
45
  - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
46
46
  - **Touch only `bin/cli.js` and `tests/cli.test.js`.**
47
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
48
47
 
49
48
  ## Out of Scope
50
49
 
@@ -50,7 +50,6 @@ orders must be deterministic.
50
50
  - **No mutation of the input file.**
51
51
  - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
52
52
  - **Touch only `bin/cli.js` and `tests/cli.test.js`.**
53
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
54
53
 
55
54
  ## Out of Scope
56
55
 
@@ -43,7 +43,6 @@ and stdout must stay machine-readable.
43
43
  - **No floating-money output.** All public amounts are integer cents.
44
44
  - **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
45
45
  - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
46
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
47
46
 
48
47
  ## Out of Scope
49
48
 
@@ -45,7 +45,6 @@ public amount must be integer cents.
45
45
  - **No floating-money output.** All public amounts are integer cents.
46
46
  - **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
47
47
  - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
48
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
49
48
 
50
49
  ## Out of Scope
51
50
 
@@ -37,7 +37,6 @@ so existing assertions continue to pass alongside new paging assertions.
37
37
  - **No breaking change to `/items/:id`.** The per-item route must keep its current contract (the fixture explicitly does NOT paginate single-item lookups).
38
38
  - **Backward-compat note**: clients that previously read `response.items` MUST still get the array at the same key inside the new envelope.
39
39
 
40
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
41
40
 
42
41
  ## Out of Scope
43
42
 
@@ -31,7 +31,6 @@ and italicized — using only the page's own CSS/JS.
31
31
  - **No inline JS frameworks.** Stick to the vanilla pattern already in `index.html`.
32
32
  - **Accessibility.** Both buttons must have accessible names equal to their visible labels; `#whisper` adds `aria-label="whisper"` only if its visible text differs (it doesn't, so leave it off).
33
33
 
34
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
35
34
 
36
35
  ## Out of Scope
37
36
 
@@ -31,7 +31,6 @@ Implement it so every test passes.
31
31
  - **Do not modify `tests/count.test.js`.** If a test looks wrong, that's a signal to revisit the implementation, not the test.
32
32
  - **No silent catches.** Errors reading stdin must surface with a clear message (not suppressed).
33
33
 
34
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
35
34
 
36
35
  ## Out of Scope
37
36
 
@@ -30,7 +30,6 @@ already provides everything needed; no external dependency is warranted.
30
30
  - **Stream-friendly.** Large files should not be read fully into memory. Use a hash stream (`crypto.createHash('sha256')` + pipe from `fs.createReadStream`).
31
31
  - **No silent catches.** File I/O errors must surface with an informative message and the appropriate exit code.
32
32
 
33
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
34
33
 
35
34
  ## Out of Scope
36
35
 
@@ -27,7 +27,6 @@ version without string manipulation. Add a `--format json` flag that makes
27
27
  - **Touch only `bin/cli.js` (`version` handler + argument parsing) and `tests/cli.test.js` (new test).** Do NOT modify the `hello` subcommand or any other file.
28
28
  - **No silent catches.** Unknown `--format` values must surface an error.
29
29
 
30
- - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
31
30
 
32
31
  ## Out of Scope
33
32
 
@@ -42,9 +42,6 @@ inside `/devlyn:resolve` (no separate preflight skill in the 2-skill design).
42
42
  - **No silent catches.**
43
43
  - **Non-git-repo handling.** Do not assume the user is always in a repo.
44
44
 
45
- - **Lifecycle note.** The harness's CLEANUP/VERIFY phases may flip this
46
- spec's frontmatter `status` after implementation completes — that is
47
- benchmark lifecycle bookkeeping, not a scope violation.
48
45
 
49
46
  ## Out of Scope
50
47
 
@@ -154,20 +154,9 @@ def analyze(work_dir, scaffold_sha, waivers, fixture_id=None):
154
154
  findings = []
155
155
  entries = git_diff_status(scaffold_sha, work_dir)
156
156
 
157
- # Structural exemption: every benchmark fixture has its own spec at
158
- # docs/roadmap/phase-*/<fixture_id>.md, and auto-resolve's DOCS phase
159
- # Job 1 legitimately flips its frontmatter status. That flip is a
160
- # skill feature, not a scope violation — always exempt regardless of
161
- # per-fixture waivers.
162
- own_spec_globs = []
163
- if fixture_id:
164
- own_spec_globs.append(f"docs/roadmap/phase-*/{fixture_id}.md")
165
-
166
157
  for status, path in entries:
167
158
  if is_waived(path, waivers):
168
159
  continue
169
- if is_waived(path, own_spec_globs):
170
- continue
171
160
 
172
161
  # Lockfile deletion — only when file existed at scaffold.
173
162
  if status == "D" and os.path.basename(path) in LOCKFILE_NAMES:
@@ -173,22 +173,12 @@ def analyze(work_dir_str: str, scaffold_sha: str, tier_c_globs, waivers,
173
173
 
174
174
  reachable = bfs_trace(seeds, work_dir)
175
175
 
176
- # Structural exemption: the fixture's own spec file at
177
- # docs/roadmap/phase-*/<fixture_id>.md is always authorized — DOCS
178
- # phase Job 1 flips its frontmatter status by design. Kept in sync
179
- # with oracle-scope-tier-a.py.
180
- own_spec_globs = []
181
- if fixture_id:
182
- own_spec_globs.append(f"docs/roadmap/phase-*/{fixture_id}.md")
183
-
184
176
  findings = []
185
177
  for path in sorted(touched):
186
178
  if match_any(path, tier_c_globs):
187
179
  continue
188
180
  if match_any(path, waivers):
189
181
  continue
190
- if match_any(path, own_spec_globs):
191
- continue
192
182
  if path in reachable:
193
183
  depth, via = reachable[path]
194
184
  findings.append({
@@ -595,8 +595,7 @@ fi
595
595
  (cd "$WORK_DIR" \
596
596
  && git diff "$SCAFFOLD_SHA" --name-only) > "$RESULT_DIR/changed-files.txt" 2>&1 || true
597
597
 
598
- # Deterministic oracles (step 1+ of the benchmark-extension plan).
599
- # Findings-only at this stage; scoring integration is step 5.
598
+ # Deterministic oracles. Hard/flag findings are merged into verify.json below.
600
599
  python3 "$BENCH_ROOT/scripts/oracle-test-fidelity.py" \
601
600
  --work "$WORK_DIR" --scaffold "$SCAFFOLD_SHA" \
602
601
  > "$RESULT_DIR/oracle-test-fidelity.json" 2>/dev/null || \
@@ -670,7 +669,8 @@ verify_env["BENCH_FIXTURE_DIR"] = os.path.dirname(os.path.abspath(sys.argv[1]))
670
669
 
671
670
  verify = {"commands": [], "forbidden_pattern_hits": [], "deps_added": 0,
672
671
  "max_deps_added": expected.get("max_deps_added", 0),
673
- "missing_required_files": [], "forbidden_files_present": []}
672
+ "missing_required_files": [], "forbidden_files_present": [],
673
+ "oracle_findings": [], "oracle_disqualifier": False}
674
674
 
675
675
  for vc in expected.get("verification_commands", []):
676
676
  try:
@@ -766,11 +766,29 @@ verify["commands_passed"] = passed
766
766
  verify["commands_total"] = total
767
767
  verify["verify_score"] = (passed / total) if total else 1.0
768
768
 
769
+ for oracle_file in (
770
+ "oracle-scope-tier-a.json",
771
+ "oracle-scope-tier-b.json",
772
+ "oracle-test-fidelity.json",
773
+ ):
774
+ try:
775
+ data = json.load(open(os.path.join(result_dir, oracle_file)))
776
+ except Exception:
777
+ continue
778
+ oracle_name = data.get("oracle") or oracle_file.removesuffix(".json")
779
+ for finding in data.get("findings", []) or []:
780
+ item = dict(finding)
781
+ item["oracle"] = oracle_name
782
+ verify["oracle_findings"].append(item)
783
+ if item.get("severity") in ("disqualifier", "hard", "flag"):
784
+ verify["oracle_disqualifier"] = True
785
+
769
786
  verify["disqualifier"] = (
770
787
  any(h["severity"] == "disqualifier" for h in verify["forbidden_pattern_hits"])
771
788
  or verify["deps_added"] > verify["max_deps_added"]
772
789
  or bool(verify["missing_required_files"])
773
790
  or bool(verify["forbidden_files_present"])
791
+ or verify["oracle_disqualifier"]
774
792
  )
775
793
 
776
794
  json.dump(verify, open(os.path.join(result_dir, "verify.json"), "w"), indent=2)
@@ -861,6 +879,8 @@ result = {
861
879
  "arm": arm,
862
880
  "run_id": run_id,
863
881
  "disqualifier": verify.get("disqualifier", False),
882
+ "oracle_disqualifier": verify.get("oracle_disqualifier", False),
883
+ "oracle_findings_count": len(verify.get("oracle_findings", [])),
864
884
  "verify_score": verify.get("verify_score", 0.0),
865
885
  "commands_passed": verify.get("commands_passed", 0),
866
886
  "commands_total": verify.get("commands_total", 0),
@@ -77,6 +77,14 @@ JSON_FENCE_RE = re.compile(r'(?ms)^```json[ \t]*\n(.*?)\n```[ \t]*$')
77
77
  FORBIDDEN_RISK_PROBE_CMD_RE = re.compile(
78
78
  r'BENCH_FIXTURE_DIR|benchmark/auto-resolve/fixtures|/verifiers/|verifiers/'
79
79
  )
80
+ EXTERNAL_URL_RE = re.compile(r"https?://([^/\s\"']+)", re.IGNORECASE)
81
+ LOCAL_URL_HOSTS = {
82
+ 'localhost',
83
+ '127.0.0.1',
84
+ '0.0.0.0',
85
+ '[::1]',
86
+ '::1',
87
+ }
80
88
  RISK_PROBE_TAGS = {
81
89
  "ordering_inversion",
82
90
  "boundary_overlap",
@@ -131,6 +139,15 @@ def extract_verification_text(text: str) -> str:
131
139
  return section.group(1) if section else ""
132
140
 
133
141
 
142
+ def external_url_hosts(text: str) -> list[str]:
143
+ hosts: list[str] = []
144
+ for match in EXTERNAL_URL_RE.finditer(text or ''):
145
+ host = match.group(1).split('@')[-1].split(':')[0].lower()
146
+ if host not in LOCAL_URL_HOSTS and host not in hosts:
147
+ hosts.append(host)
148
+ return hosts
149
+
150
+
134
151
  def validate_shape(data) -> str | None:
135
152
  """Return None if shape matches the canonical verification_commands
136
153
  schema; else a human-readable error string.
@@ -189,6 +206,12 @@ def validate_risk_probe(probe: object, index: int, verification_text: str) -> st
189
206
  f"risk-probes[{index}].cmd references hidden fixture/verifier paths; "
190
207
  "risk probes must derive from visible spec text only"
191
208
  )
209
+ external_hosts = external_url_hosts(cmd)
210
+ if external_hosts:
211
+ return (
212
+ f"risk-probes[{index}].cmd references external URL(s): "
213
+ f"{', '.join(external_hosts)}; use only worktree-local or localhost resources"
214
+ )
192
215
  if len(cmd) > 4000:
193
216
  return f"risk-probes[{index}].cmd exceeds 4000 characters"
194
217
  tags = probe.get("tags")
@@ -197,6 +220,15 @@ def validate_risk_probe(probe: object, index: int, verification_text: str) -> st
197
220
  unknown_tags = sorted(set(tags) - RISK_PROBE_TAGS)
198
221
  if unknown_tags:
199
222
  return f"risk-probes[{index}].tags contains unknown tag(s): {', '.join(unknown_tags)}"
223
+ if "error_contract" in tags and not re.search(
224
+ r'invalid|stderr|json[ -]?error|error object|exit[ `]*2',
225
+ derived_from,
226
+ re.IGNORECASE,
227
+ ):
228
+ return (
229
+ f"risk-probes[{index}].derived_from for error_contract must name "
230
+ "an invalid-input, stderr, JSON-error, or exit-2 verification bullet"
231
+ )
200
232
  evidence = probe.get("tag_evidence")
201
233
  if not isinstance(evidence, dict):
202
234
  return f"risk-probes[{index}].tag_evidence must be an object"
@@ -449,6 +481,25 @@ def run_self_test() -> int:
449
481
  (devlyn / "risk-probes.jsonl").write_text(json.dumps({
450
482
  "id": "P3",
451
483
  "derived_from": "probe must pass visible marker.",
484
+ "cmd": "printf bad-error-derived-from",
485
+ "exit_code": 0,
486
+ "tags": ["error_contract"],
487
+ "tag_evidence": {"error_contract": []},
488
+ }) + "\n")
489
+ bad_error_ref = subprocess.run(
490
+ [sys.executable, script_path, "--validate-risk-probes"],
491
+ cwd=work,
492
+ env=env,
493
+ capture_output=True,
494
+ text=True,
495
+ )
496
+ if bad_error_ref.returncode == 0:
497
+ print("error_contract with unrelated derived_from was accepted", file=sys.stderr)
498
+ return 1
499
+
500
+ (devlyn / "risk-probes.jsonl").write_text(json.dumps({
501
+ "id": "P4",
502
+ "derived_from": "probe must pass visible marker.",
452
503
  "cmd": "printf weak-boundary",
453
504
  "exit_code": 0,
454
505
  "tags": ["boundary_overlap"],
@@ -103,9 +103,16 @@ fixture/verifier paths, previous findings, and harness docs unless excerpted.
103
103
  Output: `.devlyn/risk-probes.jsonl`, 1 to 3 JSONL entries. Each entry must be
104
104
  one verification command shape plus `id`, `derived_from`, `tags`, and
105
105
  `tag_evidence`, where `derived_from` is an exact substring of the visible
106
- `## Verification` section. `tag_evidence` must prove high-risk tags with the
107
- evidence markers enforced by `spec-verify-check.py`; a tag-only probe is
108
- malformed.
106
+ `## Verification` bullet the command directly exercises. `tag_evidence` must be
107
+ a JSON object keyed by tag, with marker arrays as values; a top-level array or
108
+ tag-only probe is malformed. `ordering_inversion` must include
109
+ `input_order_would_choose_wrong_winner` and `asserts_processing_order_result`;
110
+ `prior_consumption` must include `same_resource_consumed_first` and
111
+ `later_entity_fails_or_reroutes`; `stdout_stderr_contract` and `shape_contract`
112
+ do not require marker strings. Cart/pricing success probes should use
113
+ `shape_contract` unless they satisfy the `ordering_inversion` markers. The probe
114
+ command must not reference external network URLs; use only worktree-local or
115
+ localhost resources.
109
116
  For high-complexity specs with multiple behavior bullets, at least one probe
110
117
  must be compound: it must exercise two or more visible verification bullets in a
111
118
  single command. Empty output is invalid when `--risk-probes` is set.
@@ -116,7 +123,7 @@ Invocation contract when OTHER engine is Codex:
116
123
 
117
124
  - Invoke Codex only through the monitored wrapper path in `CODEX_MONITORED_PATH`,
118
125
  or `.claude/skills/_shared/codex-monitored.sh` when the env var is absent:
119
- `bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=xhigh "<probe prompt>"`.
126
+ `bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
120
127
  - Do not run `codex`, `codex exec`, `/Users/.../codex`, or a plugin-provided
121
128
  Codex binary directly. A raw Codex child can outlive the phase and makes the
122
129
  benchmark run invalid even if `.devlyn/risk-probes.jsonl` is written.
@@ -201,7 +208,7 @@ second pair judge." The only valid skip reasons after a non-empty eligible
201
208
  trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or Codex
202
209
  unavailability proven by the invocation layer.
203
210
 
204
- Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
211
+ Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
205
212
 
206
213
  Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` directly with `-c model_reasoning_effort=medium` for this bounded two-probe review, without piping to `tail`/`head`/`grep`, capture stdout/stderr by direct tool capture or file redirection, require JSONL findings on stdout, and have the orchestrator write `.devlyn/verify.pair.findings.jsonl`. If stdout is first captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; that script is the deterministic boundary writer for `.devlyn/verify.pair.findings.jsonl`. Raw stdout remains diagnostic only: if stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
207
214
 
@@ -82,7 +82,8 @@ observable success, not by internal reasoning.
82
82
 
83
83
  Each probe must run entirely from the worktree with standard shell/Node/Python
84
84
  tools already present in the repo. Use inline temp-file scripts when needed.
85
- Leave no tracked files behind.
85
+ Leave no tracked files behind. Probe commands must not call external network
86
+ APIs or write to external memory/telemetry services.
86
87
  </task>
87
88
 
88
89
  <output>
@@ -94,13 +95,20 @@ Write `.devlyn/risk-probes.jsonl`. Each line is one JSON object:
94
95
 
95
96
  Rules:
96
97
  - `derived_from` must be an exact substring of the visible `## Verification`
97
- section.
98
+ bullet that the command directly exercises. For `error_contract`, use the
99
+ invalid-input/stderr/JSON-error/exit-2 bullet, not a generic test-runner
100
+ bullet.
98
101
  - `tags` is required. Use only these shape tags:
99
102
  `ordering_inversion`, `boundary_overlap`, `prior_consumption`,
100
103
  `rollback_state`, `positive_remaining`, `stdout_stderr_contract`,
101
104
  `error_contract`, `shape_contract`.
102
- - `tag_evidence` is required. For these tags, include every listed evidence
103
- marker and make the command actually exercise it:
105
+ - `tag_evidence` is required and must be a JSON object keyed by tag, never a
106
+ top-level array. For these tags, include every listed evidence marker in the
107
+ tag's array and make the command actually exercise it:
108
+ - Do not emit a shape tag unless the visible `## Verification` text names that
109
+ kind of risk and the command exercises it. In particular, `boundary_overlap`
110
+ is only for visible blocked-interval/window/overlap boundary semantics; do not
111
+ use it for inventory, warehouse, or generic resource constraints.
104
112
  - `ordering_inversion`: `input_order_would_choose_wrong_winner`,
105
113
  `asserts_processing_order_result`.
106
114
  - `boundary_overlap`: `starts_at_blocked_start`, `ends_at_blocked_end`,
@@ -114,7 +122,18 @@ Rules:
114
122
  Tags not listed here may use an empty evidence list or be omitted from
115
123
  `tag_evidence`.
116
124
  - `cmd` must not reference `BENCH_FIXTURE_DIR`, `verifiers/`, benchmark fixture
117
- paths, hidden oracle files, or files outside the worktree.
125
+ paths, hidden oracle files, external URLs, or files outside the worktree.
126
+ Localhost URLs are allowed only when the visible verification command needs a
127
+ local server.
128
+ - Match the spec's visible input and output key names literally; do not invent
129
+ aliases such as `stock` for `lots`, `order_id` for `id`, or `warehouse_id`
130
+ for `warehouse`.
131
+ - For cart/pricing specs whose visible verification covers duplicate combining,
132
+ multiple line-promotion types, tax, coupon, and shipping, the compound success
133
+ probe must include interleaved duplicate SKUs plus taxable and non-taxable
134
+ items, then assert the full output object and item rows. Use `shape_contract`
135
+ for this probe unless the command also proves the required
136
+ `ordering_inversion` evidence markers.
118
137
  - Empty output is invalid when this phase is enabled. If no bounded executable
119
138
  probe can be derived, write one JSONL object whose command exits nonzero and
120
139
  whose `derived_from` names the blocking verification bullet; BUILD_GATE will
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "devlyn-cli",
3
- "version": "2.2.0",
3
+ "version": "2.2.2",
4
4
  "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
5
5
  "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
6
6
  "bin": {
@@ -24,20 +24,21 @@ bad() { printf ' %s✗%s %s\n' "$red" "$reset" "$1"; fail=1; }
24
24
  section "Check 1: No mcp__codex-cli__ outside _shared / archive"
25
25
  # Legal places: config/skills/_shared/codex-config.md (explicitly says "MCP is not used"),
26
26
  # archival snapshots, and tests.
27
- offenders=$(grep -RIln 'mcp__codex-cli__' \
27
+ offenders=$(git grep -Il -- 'mcp__codex-cli__' -- \
28
28
  config/skills \
29
29
  benchmark \
30
30
  README.md \
31
31
  CLAUDE.md \
32
- bin/ 2>/dev/null \
33
- | grep -v 'config/skills/_shared/codex-config.md' \
34
- | grep -v 'config/skills/roadmap-archival-workspace/' \
35
- | grep -v 'config/skills/devlyn:auto-resolve-workspace/' \
36
- | grep -v 'config/skills/devlyn:ideate-workspace/' \
37
- | grep -v 'config/skills/preflight-workspace/' \
38
- | grep -v 'benchmark/auto-resolve/external/' \
39
- | grep -v 'benchmark/auto-resolve/PILOT-RESULTS' \
40
- || true)
32
+ bin/ \
33
+ ':!config/skills/_shared/codex-config.md' \
34
+ ':!config/skills/roadmap-archival-workspace/**' \
35
+ ':!config/skills/devlyn:auto-resolve-workspace/**' \
36
+ ':!config/skills/devlyn:ideate-workspace/**' \
37
+ ':!config/skills/preflight-workspace/**' \
38
+ ':!benchmark/auto-resolve/external/**' \
39
+ ':!benchmark/auto-resolve/results/**' \
40
+ ':!benchmark/auto-resolve/PILOT-RESULTS*' \
41
+ 2>/dev/null || true)
41
42
  if [ -z "$offenders" ]; then
42
43
  ok "no MCP references in managed files"
43
44
  else
@@ -48,15 +49,20 @@ fi
48
49
  # 2. No "Requires Codex MCP" prose.
49
50
  # ---------------------------------------------------------------------------
50
51
  section "Check 2: No 'Requires Codex MCP' prose"
51
- offenders=$(grep -RIln 'Requires Codex MCP\|Codex MCP server\|Codex MCP available\|Codex MCP disconnected' \
52
- config/skills benchmark README.md CLAUDE.md bin/ 2>/dev/null \
53
- | grep -v 'config/skills/roadmap-archival-workspace/' \
54
- | grep -v 'config/skills/devlyn:auto-resolve-workspace/' \
55
- | grep -v 'config/skills/devlyn:ideate-workspace/' \
56
- | grep -v 'config/skills/preflight-workspace/' \
57
- | grep -v 'benchmark/auto-resolve/external/' \
58
- | grep -v 'benchmark/auto-resolve/PILOT-RESULTS' \
59
- || true)
52
+ offenders=$(git grep -Il -- 'Requires Codex MCP\|Codex MCP server\|Codex MCP available\|Codex MCP disconnected' -- \
53
+ config/skills \
54
+ benchmark \
55
+ README.md \
56
+ CLAUDE.md \
57
+ bin/ \
58
+ ':!config/skills/roadmap-archival-workspace/**' \
59
+ ':!config/skills/devlyn:auto-resolve-workspace/**' \
60
+ ':!config/skills/devlyn:ideate-workspace/**' \
61
+ ':!config/skills/preflight-workspace/**' \
62
+ ':!benchmark/auto-resolve/external/**' \
63
+ ':!benchmark/auto-resolve/results/**' \
64
+ ':!benchmark/auto-resolve/PILOT-RESULTS*' \
65
+ 2>/dev/null || true)
60
66
  if [ -z "$offenders" ]; then
61
67
  ok "no Codex MCP prose"
62
68
  else
@@ -203,6 +209,16 @@ else
203
209
  bad "spec-verify-check.py risk-probe self-test failed"
204
210
  fi
205
211
 
212
+ section "Check 6e: All-or-nothing probes prove mutable rollback"
213
+ probe_doc="config/skills/devlyn:resolve/references/phases/probe-derive.md"
214
+ if grep -Fq "pre-rejected by a whole-order availability shortcut" "$probe_doc" \
215
+ && grep -Fq "must allocate a scarce" "$probe_doc" \
216
+ && grep -Fq "must request the same scarce first-line SKU" "$probe_doc"; then
217
+ ok "all-or-nothing probe contract preserves mutable rollback evidence"
218
+ else
219
+ bad "$probe_doc — missing mutable rollback probe contract"
220
+ fi
221
+
206
222
  # ---------------------------------------------------------------------------
207
223
  # 8. CRITIC security sub-pass must be native, not Dual.
208
224
  # Catches the specific drift where a section updates but a cross-reference doesn't.
@@ -431,14 +447,15 @@ fi
431
447
  # version lives) pass while genuine stale references fail. Excluded scopes:
432
448
  # benchmark/auto-resolve/results/ (historical run artifacts, frozen) and
433
449
  # scripts/lint-skills.sh itself (carries the pattern in this check).
434
- stale=$(grep -RIn 'F9-e2e-ideate-to-preflight' \
450
+ stale=$(git grep -In -- 'F9-e2e-ideate-to-preflight' -- \
435
451
  config/skills \
436
452
  benchmark \
437
453
  scripts \
438
454
  CLAUDE.md \
439
- README.md 2>/dev/null \
455
+ README.md \
456
+ ':!benchmark/auto-resolve/results/**' \
457
+ 2>/dev/null \
440
458
  | grep -v '^benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/' \
441
- | grep -v '^benchmark/auto-resolve/results/' \
442
459
  | grep -v '^scripts/lint-skills\.sh:' \
443
460
  | grep -v 'fixtures/retired/F9-e2e-ideate-to-preflight' \
444
461
  || true)
@@ -1,24 +0,0 @@
1
- # F27 CLI gift card redemption
2
-
3
- ## Why this fixture exists
4
-
5
- F16 showed a valid full-pipeline pair lift when the solo arm implemented the
6
- happy path but missed the exact validation-error contract. F25 was rejected
7
- after an oracle correction made solo pass. F26 was rejected because solo reached
8
- the ceiling.
9
-
10
- F27 keeps the useful F16 shape but removes checkout tax complexity: success is
11
- straight integer aggregation, while the risk is the exact failure object after
12
- combining duplicate card redemption rows before balance validation.
13
-
14
- ## Pair expectation
15
-
16
- PLAN must preserve the order of aggregation before validation. IMPLEMENT must
17
- read `data/gift-cards.json` and keep all public amounts in integer cents.
18
- VERIFY should construct an adversarial request where two individually valid
19
- redemptions for the same card become invalid only after combination.
20
-
21
- ## Isolation
22
-
23
- F16 covers quote tax rules. F27 covers non-persistent balance redemption and
24
- exact validation shape after duplicate aggregation.