npm - devlyn-cli - Versions diffs - 2.2.0 → 2.2.2 - Mend

devlyn-cli 2.2.0 → 2.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

package/benchmark/auto-resolve/README.md CHANGED Viewed

@@ -109,7 +109,8 @@ bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
 ```
 After a full-pipeline pair run has the calibrated arms (`bare`,
-`solo_claude`, `l2_gated`) plus a blind `judge.json`, gate it separately:
+`solo_claude`, `l2_gated` or `l2_risk_probes`) plus a blind `judge.json`, gate
+it separately:
 ```bash
 bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
@@ -143,10 +144,12 @@ python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
 ```
 This is the full-pipeline claim gate: each counted fixture must satisfy the
-headroom precondition (`bare <= 60`, `solo_claude <= 80`), the `l2_gated` arm
+headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
 must be clean, `pair_mode` must be true in the captured resolve state, and the
-blind judge must score `l2_gated` at least `--min-pair-margin` above
-`solo_claude`. When changing this gate, run:
+blind judge must score the pair arm at least `--min-pair-margin` above
+`solo_claude`. `l2_risk_probes` is the current measured pair arm for the
+F16/F25 gate: `20260509-f16-f25-combined-cartprobe-v2` passed with margins +21
+and +24, average pair/solo wall ratio 1.46x. When changing this gate, run:
 ```bash
 bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md CHANGED Viewed

@@ -28,7 +28,6 @@ This is a low-risk edit used to calibrate trivial-tier fixture difficulty.
 - **No silent catches.** If an unknown flag is passed, exit 1 with an informative message (same pattern as the existing `--name` handler).
 - **Surgical diff.** Only touch `bin/cli.js` and `tests/cli.test.js`. Do not reformat unrelated code.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/spec.md CHANGED Viewed

@@ -35,7 +35,6 @@ multiple POSTs arrive close together — no duplicate ids, no lost writes.
 - **No silent catches.** Any `try/catch` in the write path must surface failure as `500` with a clear body, not return a fake success.
 - **No hardcoded ids.** Existing baseline ids (1, 2) remain valid; new ids must not collide with any past or present id.
 - **No breaking change** to `GET /items` shape or `GET /items/:id` semantics.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/spec.md CHANGED Viewed

@@ -34,7 +34,6 @@ and the stored list is left exactly as it was before the request.
 - **No silent catches.**
 - **No partial updates.** A batch with N items must produce either N inserts or 0 inserts.
 - **No breaking change** to existing `GET /items` and `GET /items/:id`.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/spec.md CHANGED Viewed

@@ -37,7 +37,6 @@ no trailing newline).
 - **No silent catches.** Errors in the verification path surface as `500` with a clear body.
 - **Use `crypto.timingSafeEqual` for the signature comparison.** A non-constant-time `===` between hex strings leaks information about the true MAC byte-by-byte.
 - **No breaking change** to existing `/items`, `/items/:id`, `/health`.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/spec.md CHANGED Viewed

@@ -35,7 +35,6 @@ The implementation persists state to `data/items.json` and exposes:
 - **No new npm dependencies.** Fix using Express + Node built-ins only.
 - **No silent catches.** Errors surface with explicit status + body, not by returning a fake-success.
 - **Touch only `server/index.js` and `tests/server.test.js`.** Do not modify `data/items.json` shape, `tests/cli.test.js`, or anything outside the server.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md CHANGED Viewed

@@ -40,7 +40,6 @@ the output must be machine-readable.
 - **No floating-money output.** All public amounts are integer cents.
 - **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
 - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md CHANGED Viewed

@@ -37,7 +37,6 @@ inside the CLI itself.
 - **HOME guard.** If `process.env.HOME` is undefined or empty, emit a clear FAIL line ("HOME environment variable is not set") and exit 1.
 - **EACCES handling.** If `readdirSync` fails with EACCES, emit a permission-specific message quoting the offending path. Do not silently return an empty list.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md CHANGED Viewed

@@ -42,7 +42,6 @@ failure reasons must be deterministic.
 - **No mutation of the input file.**
 - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
 - **Touch only `bin/cli.js` and `tests/cli.test.js`.**
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/spec.md CHANGED Viewed

@@ -44,7 +44,6 @@ once, and duplicate ids must not silently corrupt balances.
 - **No mutation of the input file.**
 - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
 - **Touch only `bin/cli.js` and `tests/cli.test.js`.**
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md CHANGED Viewed

@@ -50,7 +50,6 @@ orders must be deterministic.
 - **No mutation of the input file.**
 - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
 - **Touch only `bin/cli.js` and `tests/cli.test.js`.**
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md CHANGED Viewed

@@ -43,7 +43,6 @@ and stdout must stay machine-readable.
 - **No floating-money output.** All public amounts are integer cents.
 - **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
 - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/spec.md CHANGED Viewed

@@ -45,7 +45,6 @@ public amount must be integer cents.
 - **No floating-money output.** All public amounts are integer cents.
 - **No silent catches.** If parsing or file reading fails, emit a visible JSON error to stderr and exit `2`.
 - **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md CHANGED Viewed

@@ -37,7 +37,6 @@ so existing assertions continue to pass alongside new paging assertions.
 - **No breaking change to `/items/:id`.** The per-item route must keep its current contract (the fixture explicitly does NOT paginate single-item lookups).
 - **Backward-compat note**: clients that previously read `response.items` MUST still get the array at the same key inside the new envelope.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md CHANGED Viewed

@@ -31,7 +31,6 @@ and italicized — using only the page's own CSS/JS.
 - **No inline JS frameworks.** Stick to the vanilla pattern already in `index.html`.
 - **Accessibility.** Both buttons must have accessible names equal to their visible labels; `#whisper` adds `aria-label="whisper"` only if its visible text differs (it doesn't, so leave it off).
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md CHANGED Viewed

@@ -31,7 +31,6 @@ Implement it so every test passes.
 - **Do not modify `tests/count.test.js`.** If a test looks wrong, that's a signal to revisit the implementation, not the test.
 - **No silent catches.** Errors reading stdin must surface with a clear message (not suppressed).
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md CHANGED Viewed

@@ -30,7 +30,6 @@ already provides everything needed; no external dependency is warranted.
 - **Stream-friendly.** Large files should not be read fully into memory. Use a hash stream (`crypto.createHash('sha256')` + pipe from `fs.createReadStream`).
 - **No silent catches.** File I/O errors must surface with an informative message and the appropriate exit code.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md CHANGED Viewed

@@ -27,7 +27,6 @@ version without string manipulation. Add a `--format json` flag that makes
 - **Touch only `bin/cli.js` (`version` handler + argument parsing) and `tests/cli.test.js` (new test).** Do NOT modify the `hello` subcommand or any other file.
 - **No silent catches.** Unknown `--format` values must surface an error.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md CHANGED Viewed

@@ -42,9 +42,6 @@ inside `/devlyn:resolve` (no separate preflight skill in the 2-skill design).
 - **No silent catches.**
 - **Non-git-repo handling.** Do not assume the user is always in a repo.
-- **Lifecycle note.** The harness's CLEANUP/VERIFY phases may flip this
-  spec's frontmatter `status` after implementation completes — that is
-  benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py CHANGED Viewed

@@ -154,20 +154,9 @@ def analyze(work_dir, scaffold_sha, waivers, fixture_id=None):
     findings = []
     entries = git_diff_status(scaffold_sha, work_dir)
-    # Structural exemption: every benchmark fixture has its own spec at
-    # docs/roadmap/phase-*/<fixture_id>.md, and auto-resolve's DOCS phase
-    # Job 1 legitimately flips its frontmatter status. That flip is a
-    # skill feature, not a scope violation — always exempt regardless of
-    # per-fixture waivers.
-    own_spec_globs = []
-    if fixture_id:
-        own_spec_globs.append(f"docs/roadmap/phase-*/{fixture_id}.md")
     for status, path in entries:
         if is_waived(path, waivers):
             continue
-        if is_waived(path, own_spec_globs):
-            continue
         # Lockfile deletion — only when file existed at scaffold.
         if status == "D" and os.path.basename(path) in LOCKFILE_NAMES:

package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py CHANGED Viewed

@@ -173,22 +173,12 @@ def analyze(work_dir_str: str, scaffold_sha: str, tier_c_globs, waivers,
     reachable = bfs_trace(seeds, work_dir)
-    # Structural exemption: the fixture's own spec file at
-    # docs/roadmap/phase-*/<fixture_id>.md is always authorized — DOCS
-    # phase Job 1 flips its frontmatter status by design. Kept in sync
-    # with oracle-scope-tier-a.py.
-    own_spec_globs = []
-    if fixture_id:
-        own_spec_globs.append(f"docs/roadmap/phase-*/{fixture_id}.md")
     findings = []
     for path in sorted(touched):
         if match_any(path, tier_c_globs):
             continue
         if match_any(path, waivers):
             continue
-        if match_any(path, own_spec_globs):
-            continue
         if path in reachable:
             depth, via = reachable[path]
             findings.append({

package/benchmark/auto-resolve/scripts/run-fixture.sh CHANGED Viewed

@@ -595,8 +595,7 @@ fi
 (cd "$WORK_DIR" \
    && git diff "$SCAFFOLD_SHA" --name-only) > "$RESULT_DIR/changed-files.txt" 2>&1 || true
-# Deterministic oracles (step 1+ of the benchmark-extension plan).
-# Findings-only at this stage; scoring integration is step 5.
+# Deterministic oracles. Hard/flag findings are merged into verify.json below.
 python3 "$BENCH_ROOT/scripts/oracle-test-fidelity.py" \
   --work "$WORK_DIR" --scaffold "$SCAFFOLD_SHA" \
   > "$RESULT_DIR/oracle-test-fidelity.json" 2>/dev/null || \
@@ -670,7 +669,8 @@ verify_env["BENCH_FIXTURE_DIR"] = os.path.dirname(os.path.abspath(sys.argv[1]))
 verify = {"commands": [], "forbidden_pattern_hits": [], "deps_added": 0,
           "max_deps_added": expected.get("max_deps_added", 0),
-          "missing_required_files": [], "forbidden_files_present": []}
+          "missing_required_files": [], "forbidden_files_present": [],
+          "oracle_findings": [], "oracle_disqualifier": False}
 for vc in expected.get("verification_commands", []):
     try:
@@ -766,11 +766,29 @@ verify["commands_passed"] = passed
 verify["commands_total"] = total
 verify["verify_score"] = (passed / total) if total else 1.0
+for oracle_file in (
+    "oracle-scope-tier-a.json",
+    "oracle-scope-tier-b.json",
+    "oracle-test-fidelity.json",
+):
+    try:
+        data = json.load(open(os.path.join(result_dir, oracle_file)))
+    except Exception:
+        continue
+    oracle_name = data.get("oracle") or oracle_file.removesuffix(".json")
+    for finding in data.get("findings", []) or []:
+        item = dict(finding)
+        item["oracle"] = oracle_name
+        verify["oracle_findings"].append(item)
+        if item.get("severity") in ("disqualifier", "hard", "flag"):
+            verify["oracle_disqualifier"] = True
 verify["disqualifier"] = (
     any(h["severity"] == "disqualifier" for h in verify["forbidden_pattern_hits"])
     or verify["deps_added"] > verify["max_deps_added"]
     or bool(verify["missing_required_files"])
     or bool(verify["forbidden_files_present"])
+    or verify["oracle_disqualifier"]
 )
 json.dump(verify, open(os.path.join(result_dir, "verify.json"), "w"), indent=2)
@@ -861,6 +879,8 @@ result = {
     "arm": arm,
     "run_id": run_id,
     "disqualifier": verify.get("disqualifier", False),
+    "oracle_disqualifier": verify.get("oracle_disqualifier", False),
+    "oracle_findings_count": len(verify.get("oracle_findings", [])),
     "verify_score": verify.get("verify_score", 0.0),
     "commands_passed": verify.get("commands_passed", 0),
     "commands_total": verify.get("commands_total", 0),

package/config/skills/_shared/spec-verify-check.py CHANGED Viewed

@@ -77,6 +77,14 @@ JSON_FENCE_RE = re.compile(r'(?ms)^```json[ \t]*\n(.*?)\n```[ \t]*$')
 FORBIDDEN_RISK_PROBE_CMD_RE = re.compile(
     r'BENCH_FIXTURE_DIR|benchmark/auto-resolve/fixtures|/verifiers/|verifiers/'
 )
+EXTERNAL_URL_RE = re.compile(r"https?://([^/\s\"']+)", re.IGNORECASE)
+LOCAL_URL_HOSTS = {
+    'localhost',
+    '127.0.0.1',
+    '0.0.0.0',
+    '[::1]',
+    '::1',
+}
 RISK_PROBE_TAGS = {
     "ordering_inversion",
     "boundary_overlap",
@@ -131,6 +139,15 @@ def extract_verification_text(text: str) -> str:
     return section.group(1) if section else ""
+def external_url_hosts(text: str) -> list[str]:
+    hosts: list[str] = []
+    for match in EXTERNAL_URL_RE.finditer(text or ''):
+        host = match.group(1).split('@')[-1].split(':')[0].lower()
+        if host not in LOCAL_URL_HOSTS and host not in hosts:
+            hosts.append(host)
+    return hosts
 def validate_shape(data) -> str | None:
     """Return None if shape matches the canonical verification_commands
     schema; else a human-readable error string.
@@ -189,6 +206,12 @@ def validate_risk_probe(probe: object, index: int, verification_text: str) -> st
             f"risk-probes[{index}].cmd references hidden fixture/verifier paths; "
             "risk probes must derive from visible spec text only"
         )
+    external_hosts = external_url_hosts(cmd)
+    if external_hosts:
+        return (
+            f"risk-probes[{index}].cmd references external URL(s): "
+            f"{', '.join(external_hosts)}; use only worktree-local or localhost resources"
+        )
     if len(cmd) > 4000:
         return f"risk-probes[{index}].cmd exceeds 4000 characters"
     tags = probe.get("tags")
@@ -197,6 +220,15 @@ def validate_risk_probe(probe: object, index: int, verification_text: str) -> st
     unknown_tags = sorted(set(tags) - RISK_PROBE_TAGS)
     if unknown_tags:
         return f"risk-probes[{index}].tags contains unknown tag(s): {', '.join(unknown_tags)}"
+    if "error_contract" in tags and not re.search(
+        r'invalid|stderr|json[ -]?error|error object|exit[ `]*2',
+        derived_from,
+        re.IGNORECASE,
+    ):
+        return (
+            f"risk-probes[{index}].derived_from for error_contract must name "
+            "an invalid-input, stderr, JSON-error, or exit-2 verification bullet"
+        )
     evidence = probe.get("tag_evidence")
     if not isinstance(evidence, dict):
         return f"risk-probes[{index}].tag_evidence must be an object"
@@ -449,6 +481,25 @@ def run_self_test() -> int:
         (devlyn / "risk-probes.jsonl").write_text(json.dumps({
             "id": "P3",
             "derived_from": "probe must pass visible marker.",
+            "cmd": "printf bad-error-derived-from",
+            "exit_code": 0,
+            "tags": ["error_contract"],
+            "tag_evidence": {"error_contract": []},
+        }) + "\n")
+        bad_error_ref = subprocess.run(
+            [sys.executable, script_path, "--validate-risk-probes"],
+            cwd=work,
+            env=env,
+            capture_output=True,
+            text=True,
+        )
+        if bad_error_ref.returncode == 0:
+            print("error_contract with unrelated derived_from was accepted", file=sys.stderr)
+            return 1
+        (devlyn / "risk-probes.jsonl").write_text(json.dumps({
+            "id": "P4",
+            "derived_from": "probe must pass visible marker.",
             "cmd": "printf weak-boundary",
             "exit_code": 0,
             "tags": ["boundary_overlap"],

package/config/skills/devlyn:resolve/SKILL.md CHANGED Viewed

@@ -103,9 +103,16 @@ fixture/verifier paths, previous findings, and harness docs unless excerpted.
 Output: `.devlyn/risk-probes.jsonl`, 1 to 3 JSONL entries. Each entry must be
 one verification command shape plus `id`, `derived_from`, `tags`, and
 `tag_evidence`, where `derived_from` is an exact substring of the visible
-`## Verification` section. `tag_evidence` must prove high-risk tags with the
-evidence markers enforced by `spec-verify-check.py`; a tag-only probe is
-malformed.
+`## Verification` bullet the command directly exercises. `tag_evidence` must be
+a JSON object keyed by tag, with marker arrays as values; a top-level array or
+tag-only probe is malformed. `ordering_inversion` must include
+`input_order_would_choose_wrong_winner` and `asserts_processing_order_result`;
+`prior_consumption` must include `same_resource_consumed_first` and
+`later_entity_fails_or_reroutes`; `stdout_stderr_contract` and `shape_contract`
+do not require marker strings. Cart/pricing success probes should use
+`shape_contract` unless they satisfy the `ordering_inversion` markers. The probe
+command must not reference external network URLs; use only worktree-local or
+localhost resources.
 For high-complexity specs with multiple behavior bullets, at least one probe
 must be compound: it must exercise two or more visible verification bullets in a
 single command. Empty output is invalid when `--risk-probes` is set.
@@ -116,7 +123,7 @@ Invocation contract when OTHER engine is Codex:
 - Invoke Codex only through the monitored wrapper path in `CODEX_MONITORED_PATH`,
   or `.claude/skills/_shared/codex-monitored.sh` when the env var is absent:
-  `bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=xhigh "<probe prompt>"`.
+  `bash "$CODEX_MONITORED_PATH" -C "$PWD" --full-auto -c model_reasoning_effort=high "<probe prompt>"`.
 - Do not run `codex`, `codex exec`, `/Users/.../codex`, or a plugin-provided
   Codex binary directly. A raw Codex child can outlive the phase and makes the
   benchmark run invalid even if `.devlyn/risk-probes.jsonl` is written.
@@ -201,7 +208,7 @@ second pair judge." The only valid skip reasons after a non-empty eligible
 trigger are deterministic MECHANICAL HIGH/CRITICAL blockers or Codex
 unavailability proven by the invocation layer.
-Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
+Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; the second judge is a bounded adversarial complement, not a duplicate broad audit. The primary judge owns broad coverage; pair-JUDGE targets the two highest-risk explicit `## Verification` bullets that cross state mutation, all-or-nothing rollback, ordering, idempotency, auth, or error-priority clauses. It must not read `.claude/skills`, `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the orchestrator pasted a specific excerpt into the prompt. It may use only the spec, diff, implementation files, tests, and the repo's existing CLI/API/test runner. It may execute at most two targeted probes before first output, and each probe must compare the full externally visible result (exit/stdout/stderr plus full parsed output object, including accepted/scheduled rows, rejected rows, and remaining state when present), not just a single property. For priority/stateful specs, at least one probe must include an earlier input entity that would succeed under input-order processing, a later higher-priority entity that consumes or blocks the critical resource, and a failure/blocked/rollback edge that determines a later entity's state. For cart/pricing specs where visible verification combines duplicate items, line promotions, tax, coupon, and shipping, the success-path probe must include interleaved duplicates plus taxable and non-taxable items and assert full output rows. Scope qualifiers are binding: pair-JUDGE must not reinterpret `inside a warehouse`, `per resource`, or line-scoped rules as global rules. When both priority ordering and rollback/blocked-interval behavior appear in the spec, this dominance-loss probe is mandatory and comes before any other probe: an earlier lower-priority entity that would succeed alone or under input-order processing must lose because a later higher-priority entity is processed first; a failed/blocked middle entity must not corrupt later state; and the assertion must cover complete accepted/scheduled and rejected output ordering. It must stop and emit JSONL immediately on the first verdict-binding finding, and must emit PASS immediately if both probes plus static scope/dependency checks pass. Both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is verdict-binding; high-confidence MEDIUM findings are also verdict-binding when they identify a concrete behavioral regression against the spec, public contract, or existing test contract." Cross-model disagreement on advisory lower-severity findings is logged but does not change the verdict. If MECHANICAL has a HIGH/CRITICAL finding, skip the second judge and record `pair_judge: null`; the fix loop needs the deterministic finding, not duplicate review.
 Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** Codex pair-JUDGE is read-only: invoke `codex-monitored.sh` directly with `-c model_reasoning_effort=medium` for this bounded two-probe review, without piping to `tail`/`head`/`grep`, capture stdout/stderr by direct tool capture or file redirection, require JSONL findings on stdout, and have the orchestrator write `.devlyn/verify.pair.findings.jsonl`. If stdout is first captured as `.devlyn/codex-judge.stdout`, run `python3 .claude/skills/_shared/collect-codex-findings.py` before merge; that script is the deterministic boundary writer for `.devlyn/verify.pair.findings.jsonl`. Raw stdout remains diagnostic only: if stdout contains findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is empty, `verify-merge-findings.py` blocks VERIFY for `verify.pair.emission-contract`. Do not ask Codex to `apply_patch` or edit `.devlyn`. After primary and pair findings are written, run `python3 .claude/skills/_shared/verify-merge-findings.py --write-state`. Branch only on the merged `state.phases.verify.verdict`; a HIGH/CRITICAL finding from either judge must mechanically become `NEEDS_WORK`. Never write `.devlyn/verify-merged.findings.jsonl` or `.devlyn/verify-merge.summary.json` by hand; `verify-merge-findings.py` is their only writer. State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.

package/config/skills/devlyn:resolve/references/phases/probe-derive.md CHANGED Viewed

@@ -82,7 +82,8 @@ observable success, not by internal reasoning.
 Each probe must run entirely from the worktree with standard shell/Node/Python
 tools already present in the repo. Use inline temp-file scripts when needed.
-Leave no tracked files behind.
+Leave no tracked files behind. Probe commands must not call external network
+APIs or write to external memory/telemetry services.
 </task>
 <output>
@@ -94,13 +95,20 @@ Write `.devlyn/risk-probes.jsonl`. Each line is one JSON object:
 Rules:
 - `derived_from` must be an exact substring of the visible `## Verification`
-  section.
+  bullet that the command directly exercises. For `error_contract`, use the
+  invalid-input/stderr/JSON-error/exit-2 bullet, not a generic test-runner
+  bullet.
 - `tags` is required. Use only these shape tags:
   `ordering_inversion`, `boundary_overlap`, `prior_consumption`,
   `rollback_state`, `positive_remaining`, `stdout_stderr_contract`,
   `error_contract`, `shape_contract`.
-- `tag_evidence` is required. For these tags, include every listed evidence
-  marker and make the command actually exercise it:
+- `tag_evidence` is required and must be a JSON object keyed by tag, never a
+  top-level array. For these tags, include every listed evidence marker in the
+  tag's array and make the command actually exercise it:
+- Do not emit a shape tag unless the visible `## Verification` text names that
+  kind of risk and the command exercises it. In particular, `boundary_overlap`
+  is only for visible blocked-interval/window/overlap boundary semantics; do not
+  use it for inventory, warehouse, or generic resource constraints.
   - `ordering_inversion`: `input_order_would_choose_wrong_winner`,
     `asserts_processing_order_result`.
   - `boundary_overlap`: `starts_at_blocked_start`, `ends_at_blocked_end`,
@@ -114,7 +122,18 @@ Rules:
   Tags not listed here may use an empty evidence list or be omitted from
   `tag_evidence`.
 - `cmd` must not reference `BENCH_FIXTURE_DIR`, `verifiers/`, benchmark fixture
-  paths, hidden oracle files, or files outside the worktree.
+  paths, hidden oracle files, external URLs, or files outside the worktree.
+  Localhost URLs are allowed only when the visible verification command needs a
+  local server.
+- Match the spec's visible input and output key names literally; do not invent
+  aliases such as `stock` for `lots`, `order_id` for `id`, or `warehouse_id`
+  for `warehouse`.
+- For cart/pricing specs whose visible verification covers duplicate combining,
+  multiple line-promotion types, tax, coupon, and shipping, the compound success
+  probe must include interleaved duplicate SKUs plus taxable and non-taxable
+  items, then assert the full output object and item rows. Use `shape_contract`
+  for this probe unless the command also proves the required
+  `ordering_inversion` evidence markers.
 - Empty output is invalid when this phase is enabled. If no bounded executable
   probe can be derived, write one JSONL object whose command exits nonzero and
   whose `derived_from` names the blocking verification bullet; BUILD_GATE will

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "2.2.0",
+  "version": "2.2.2",
   "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
   "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
   "bin": {

package/scripts/lint-skills.sh CHANGED Viewed

@@ -24,20 +24,21 @@ bad()     { printf '  %s✗%s %s\n' "$red"   "$reset" "$1"; fail=1; }
 section "Check 1: No mcp__codex-cli__ outside _shared / archive"
 # Legal places: config/skills/_shared/codex-config.md (explicitly says "MCP is not used"),
 # archival snapshots, and tests.
-offenders=$(grep -RIln 'mcp__codex-cli__' \
+offenders=$(git grep -Il -- 'mcp__codex-cli__' -- \
   config/skills \
   benchmark \
   README.md \
   CLAUDE.md \
-  bin/ 2>/dev/null \
-  | grep -v 'config/skills/_shared/codex-config.md' \
-  | grep -v 'config/skills/roadmap-archival-workspace/' \
-  | grep -v 'config/skills/devlyn:auto-resolve-workspace/' \
-  | grep -v 'config/skills/devlyn:ideate-workspace/' \
-  | grep -v 'config/skills/preflight-workspace/' \
-  | grep -v 'benchmark/auto-resolve/external/' \
-  | grep -v 'benchmark/auto-resolve/PILOT-RESULTS' \
-  || true)
+  bin/ \
+  ':!config/skills/_shared/codex-config.md' \
+  ':!config/skills/roadmap-archival-workspace/**' \
+  ':!config/skills/devlyn:auto-resolve-workspace/**' \
+  ':!config/skills/devlyn:ideate-workspace/**' \
+  ':!config/skills/preflight-workspace/**' \
+  ':!benchmark/auto-resolve/external/**' \
+  ':!benchmark/auto-resolve/results/**' \
+  ':!benchmark/auto-resolve/PILOT-RESULTS*' \
+  2>/dev/null || true)
 if [ -z "$offenders" ]; then
   ok "no MCP references in managed files"
 else
@@ -48,15 +49,20 @@ fi
 # 2. No "Requires Codex MCP" prose.
 # ---------------------------------------------------------------------------
 section "Check 2: No 'Requires Codex MCP' prose"
-offenders=$(grep -RIln 'Requires Codex MCP\|Codex MCP server\|Codex MCP available\|Codex MCP disconnected' \
-  config/skills benchmark README.md CLAUDE.md bin/ 2>/dev/null \
-  | grep -v 'config/skills/roadmap-archival-workspace/' \
-  | grep -v 'config/skills/devlyn:auto-resolve-workspace/' \
-  | grep -v 'config/skills/devlyn:ideate-workspace/' \
-  | grep -v 'config/skills/preflight-workspace/' \
-  | grep -v 'benchmark/auto-resolve/external/' \
-  | grep -v 'benchmark/auto-resolve/PILOT-RESULTS' \
-  || true)
+offenders=$(git grep -Il -- 'Requires Codex MCP\|Codex MCP server\|Codex MCP available\|Codex MCP disconnected' -- \
+  config/skills \
+  benchmark \
+  README.md \
+  CLAUDE.md \
+  bin/ \
+  ':!config/skills/roadmap-archival-workspace/**' \
+  ':!config/skills/devlyn:auto-resolve-workspace/**' \
+  ':!config/skills/devlyn:ideate-workspace/**' \
+  ':!config/skills/preflight-workspace/**' \
+  ':!benchmark/auto-resolve/external/**' \
+  ':!benchmark/auto-resolve/results/**' \
+  ':!benchmark/auto-resolve/PILOT-RESULTS*' \
+  2>/dev/null || true)
 if [ -z "$offenders" ]; then
   ok "no Codex MCP prose"
 else
@@ -203,6 +209,16 @@ else
   bad "spec-verify-check.py risk-probe self-test failed"
 fi
+section "Check 6e: All-or-nothing probes prove mutable rollback"
+probe_doc="config/skills/devlyn:resolve/references/phases/probe-derive.md"
+if grep -Fq "pre-rejected by a whole-order availability shortcut" "$probe_doc" \
+   && grep -Fq "must allocate a scarce" "$probe_doc" \
+   && grep -Fq "must request the same scarce first-line SKU" "$probe_doc"; then
+  ok "all-or-nothing probe contract preserves mutable rollback evidence"
+else
+  bad "$probe_doc — missing mutable rollback probe contract"
+fi
 # ---------------------------------------------------------------------------
 # 8. CRITIC security sub-pass must be native, not Dual.
 # Catches the specific drift where a section updates but a cross-reference doesn't.
@@ -431,14 +447,15 @@ fi
 # version lives) pass while genuine stale references fail. Excluded scopes:
 # benchmark/auto-resolve/results/ (historical run artifacts, frozen) and
 # scripts/lint-skills.sh itself (carries the pattern in this check).
-stale=$(grep -RIn 'F9-e2e-ideate-to-preflight' \
+stale=$(git grep -In -- 'F9-e2e-ideate-to-preflight' -- \
   config/skills \
   benchmark \
   scripts \
   CLAUDE.md \
-  README.md 2>/dev/null \
+  README.md \
+  ':!benchmark/auto-resolve/results/**' \
+  2>/dev/null \
   | grep -v '^benchmark/auto-resolve/fixtures/retired/F9-e2e-ideate-to-preflight/' \
-  | grep -v '^benchmark/auto-resolve/results/' \
   | grep -v '^scripts/lint-skills\.sh:' \
   | grep -v 'fixtures/retired/F9-e2e-ideate-to-preflight' \
   || true)

package/benchmark/auto-resolve/fixtures/F27-cli-gift-card-redemption/NOTES.md DELETED Viewed

@@ -1,24 +0,0 @@
-# F27 CLI gift card redemption
-## Why this fixture exists
-F16 showed a valid full-pipeline pair lift when the solo arm implemented the
-happy path but missed the exact validation-error contract. F25 was rejected
-after an oracle correction made solo pass. F26 was rejected because solo reached
-the ceiling.
-F27 keeps the useful F16 shape but removes checkout tax complexity: success is
-straight integer aggregation, while the risk is the exact failure object after
-combining duplicate card redemption rows before balance validation.
-## Pair expectation
-PLAN must preserve the order of aggregation before validation. IMPLEMENT must
-read `data/gift-cards.json` and keep all public amounts in integer cents.
-VERIFY should construct an adversarial request where two individually valid
-redemptions for the same card become invalid only after combination.
-## Isolation
-F16 covers quote tax rules. F27 covers non-persistent balance redemption and
-exact validation shape after duplicate aggregation.