npm - devlyn-cli - Versions diffs - 2.0.0 → 2.2.0 - Mend

devlyn-cli 2.0.0 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (138) hide show

package/config/skills/devlyn:resolve/references/phases/verify.md CHANGED Viewed

@@ -21,7 +21,7 @@ You do NOT receive: PLAN, IMPLEMENT's reasoning, BUILD_GATE's findings, CLEANUP'
 Re-run the mechanical checks fresh, independent of BUILD_GATE's earlier run:
-1. `python3 .claude/skills/_shared/spec-verify-check.py` against the post-CLEANUP code.
+1. `python3 .claude/skills/_shared/spec-verify-check.py --include-risk-probes` against the post-CLEANUP code.
 2. Re-scan `spec.expected.json.forbidden_patterns` against the diff (Python re.search; honor each pattern's `files` allowlist).
 3. Confirm `required_files` exist post-diff; confirm `forbidden_files` do not appear in the diff.
 4. Confirm `max_deps_added` is not exceeded (`git diff -- package.json` for Node; equivalent for other ecosystems).
@@ -39,23 +39,175 @@ Grade the diff against the spec on rubric axes:
 For each finding, write file:line evidence. Do not paraphrase code; quote it.
+**Clause-level check**: split each Requirement into its binding clauses before
+you pass it. Words like `before`, `after`, `once`, `always`, `never`,
+`regardless`, `irrelevant`, `permanent`, `idempotent`, `duplicate`, `raw`, and
+`signature` usually encode a separate invariant. A passing verification command
+proves only the case it actually exercises; it does not prove neighboring
+clauses. For stateful, auth, parsing, idempotency, rollback, and error-priority
+flows, construct at least one counterexample in your head and trace the code
+order, including failed-operation rollback and the next entity's state. If the
+code order can return the wrong status/body/output for a binding clause, emit a
+HIGH spec-compliance finding even when the provided verifier passes.
+Respect each clause's scope qualifiers. Do not widen an invariant beyond the
+words in the spec: `inside a warehouse`, `per resource`, `for this line`,
+`after validation`, and similar qualifiers are binding. When two ordering rules
+coexist, compose them in the stated order instead of inventing a stronger global
+ordering. A finding based on a widened invariant is a false positive and must
+not drive the fix loop.
+**Interaction check**: for high-complexity specs, one-axis examples are not
+enough. Construct at least one adversarial scenario that combines two or more
+explicit verification bullets. Prioritize combinations such as
+ordering/priority + blocked interval/failure, ordering/priority +
+all-or-nothing rollback + later entity state, validation/error-priority +
+stdout/stderr contract, or auth/idempotency + duplicate/replay ordering. If the
+implementation only passes isolated examples but fails the combined scenario,
+emit a HIGH finding tied to all relevant spec clauses.
+For high-complexity specs, execute at least one combined adversarial check with
+the repo's existing CLI/API/test runner before declaring PASS. Use a temporary
+script or inline command that leaves no tracked files behind. The check must
+cross two or more explicit verification bullets, not merely repeat the visible
+acceptance command. If the command exposes a mismatch, emit a HIGH finding with
+the command, expected output/state, and actual output/state.
 **Coverage check**: before declaring done, confirm you have evidence for every spec axis. If you could not exercise an axis (the spec asks for behavior X but the diff does not touch the code that produces X), set `state.verify.coverage_failed: true` and surface the missing-evidence finding rather than passing on assumption.
+**Verdict-binding severity check**: HIGH/CRITICAL findings are always
+verdict-binding. A MEDIUM finding is also verdict-binding when it identifies a
+concrete behavioral regression against the visible spec, an existing public
+contract, or an existing test contract. Examples: a previously valid input now
+errors, duplicate/idempotent handling regresses, warning/error semantics change
+for a real API path, or a focused existing regression test would fail. Advisory
+design/style concerns remain non-binding MEDIUM and produce `PASS_WITH_ISSUES`.
 **Anti-self-filter rule**: report every finding you observe, including ones you consider low-severity or low-confidence. Tag each with `confidence: high|medium|low` and let the harness's downstream filter rank them. Filtering at this stage suppresses recall.
 ### Pair-mode (when triggered by orchestrator)
-When the orchestrator spawns a second VERIFY agent with the OTHER engine's adapter, both judgments are merged:
+Pair-mode is eligible only after MECHANICAL has no HIGH/CRITICAL findings.
+Deterministic blockers already decide the verdict and route to the fix loop; a
+second judge there duplicates evidence and wastes wall-time. If MECHANICAL has
+a HIGH/CRITICAL finding, record `pair_judge: null` and do not spawn the second
+VERIFY agent.
+When eligible, trigger pair-mode if any of these are true:
+- `--pair-verify` was set.
+- The spec frontmatter has `complexity: high`.
+- `state.complexity` is `"high"` or `"large"`.
+- MECHANICAL emitted warning-level findings but no HIGH/CRITICAL blockers.
+- `state.verify.coverage_failed == true`.
+Before JUDGE spawn, compute and persist:
+```json
+"pair_trigger": {
+  "eligible": true,
+  "reasons": ["complexity.high"],
+  "skipped_reason": null
+}
+```
+If `eligible == true` and `reasons` is non-empty, the OTHER-engine judge is
+mandatory. Skipping it is a VERIFY contract violation. If ineligible, record the
+reason, e.g. `"mechanical_blocker"`.
+The `--engine` flag never disables this rule. Explicit `--engine claude` means
+Claude is the primary judge; if pair-mode triggers, Codex is still the mandatory
+OTHER-engine judge. Do not record "explicit --engine claude" as a skip reason.
+The only valid skip reasons after a non-empty eligible trigger are deterministic
+MECHANICAL HIGH/CRITICAL blockers or Codex unavailability proven by the
+invocation layer.
+When eligible and the orchestrator spawns a second VERIFY agent with the OTHER engine's adapter, both judgments are merged:
 - Any HIGH/CRITICAL finding either model surfaces is verdict-binding.
-- Lower-severity disagreements are logged but do not change the verdict.
+- Any high-confidence MEDIUM finding either model surfaces is also
+  verdict-binding when it identifies a concrete behavioral regression against
+  the spec, public contract, or existing test contract. This includes
+  duplicate/idempotent/order-preservation regressions and real warning/error
+  behavior changes. Do not downgrade these to advisory simply because they are
+  not HIGH.
+- Other lower-severity disagreements are logged but do not change the verdict.
 - The orchestrator handles merge; you only emit your own findings.
+- The second judge's job is adversarial complement, not a duplicate summary:
+  prioritize the two highest-risk explicit `## Verification` bullets that cross
+  state mutation, all-or-nothing rollback, ordering, idempotency, auth, or
+  error-priority clauses. The primary judge owns broad coverage; the pair judge
+  is a bounded adversarial complement. Do not read `.claude/skills`,
+  `.codex/skills`, `CLAUDE.md`, `AGENTS.md`, or other harness docs unless the
+  orchestrator pasted a specific excerpt into the prompt. Use only the spec,
+  diff, implementation files, tests, and the repo's CLI/API/test runner.
+  Execute at most two targeted probes before first output. Stop immediately
+  after the first verdict-binding finding and emit JSONL. If both probes pass
+  and static scope/dependency checks show no blocker, emit PASS; do not continue
+  exhaustive exploration.
+  A targeted probe must compare the full externally visible result
+  (stdout/stderr/exit and full parsed output object, including accepted/scheduled
+  rows, rejected rows, and remaining state when present), not just a single
+  property. For priority/stateful specs, at least one probe must include an
+  earlier input entity that would succeed under input-order processing, a later
+  higher-priority entity that consumes or blocks the critical resource, and a
+  failure/blocked/rollback edge that determines a later entity's state. This is
+  the minimum compound shape for priority + failure/state-mutation bugs.
+  Scope qualifiers are binding for the pair judge too: do not reinterpret
+  `inside a warehouse`, `per resource`, or line-scoped rules as global rules.
+  If a candidate finding depends on that widening, emit PASS for that probe and
+  use the second bounded probe for a different explicit clause.
+  When both priority ordering and rollback/blocked-interval behavior appear in
+  the spec, this dominance-loss probe is mandatory and comes before any other
+  probe: an earlier lower-priority entity that would succeed alone or under
+  input-order processing must lose because a later higher-priority entity is
+  processed first; a failed/blocked middle entity must not corrupt later state;
+  and the assertion must cover the complete output ordering for both accepted
+  (or scheduled) and rejected rows.
+Codex pair-JUDGE is read-only. Invoke `codex-monitored.sh` directly with
+`-c model_reasoning_effort=medium`; this phase is a bounded two-probe review,
+not an unbounded implementation task. Do not pipe it to `tail`, `head`, `grep`,
+`sed`, or `awk`. Capture stdout/stderr by direct tool capture or file
+redirection. The Codex judge must return JSONL
+findings on stdout; the orchestrator writes `.devlyn/verify.pair.findings.jsonl`
+and merges verdicts. Do not ask Codex to `apply_patch` or edit `.devlyn`.
+The Codex prompt must include a bounded-output contract: no harness-doc reads,
+maximum two targeted probes before first output, stop on the first
+verdict-binding finding, and emit PASS immediately after the bounded checks pass.
+If stdout is first captured to `.devlyn/codex-judge.stdout`, run
+`python3 .claude/skills/_shared/collect-codex-findings.py` before merge. That
+script is the deterministic boundary writer for
+`.devlyn/verify.pair.findings.jsonl`.
+If raw Codex stdout is captured as `.devlyn/codex-judge.stdout`,
+`verify-merge-findings.py` treats it as a diagnostic only. If stdout contains
+findings or a non-PASS summary while `.devlyn/verify.pair.findings.jsonl` is
+empty, VERIFY is `BLOCKED` for `verify.pair.emission-contract`; do not pass or
+silently recover from a broken capture contract.
+After all VERIFY findings files are written, run:
+```bash
+python3 .claude/skills/_shared/verify-merge-findings.py --write-state
+```
+This deterministic merge is the routing source of truth for VERIFY. It writes
+`.devlyn/verify-merged.findings.jsonl`, `.devlyn/verify-merge.summary.json`, and
+updates `state.phases.verify.{verdict,sub_verdicts,merged}`. Branch on the
+merged state verdict, not on either model's prose verdict. Any HIGH/CRITICAL
+finding from either judge is `NEEDS_WORK`; a high-confidence MEDIUM must set
+`verdict_binding: true` to become `NEEDS_WORK`.
+Do not create, edit, truncate, or summarize `.devlyn/verify-merged.findings.jsonl`
+or `.devlyn/verify-merge.summary.json` by hand. Those files have exactly one
+writer: `verify-merge-findings.py`. If that command fails, preserve stderr and set
+VERIFY to `BLOCKED`; do not synthesize merge artifacts in prose.
 </sub_phases>
 <output>
 - `.devlyn/verify-mechanical.findings.jsonl` — MECHANICAL findings.
 - `.devlyn/verify.findings.jsonl` — JUDGE findings.
-- `state.phases.verify.{verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge}, artifacts}`. Verdict: WORSE of the two sub-verdicts. `PASS` requires zero CRITICAL/HIGH findings AND coverage met.
+- `.devlyn/verify-merged.findings.jsonl` and `.devlyn/verify-merge.summary.json` — deterministic merge artifacts.
+- `state.phases.verify.{verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, merged, artifacts}`. Verdict: result of `verify-merge-findings.py`. `PASS` requires zero CRITICAL/HIGH findings, zero verdict-binding MEDIUM regressions, and coverage met.
 </output>
 <quality_bar>

package/config/skills/devlyn:resolve/references/state-schema.md CHANGED Viewed

@@ -28,13 +28,14 @@ Single authoritative verdict source for `/devlyn:resolve`. The orchestrator bran
   ],
   "phases": {
     "plan": null,
+    "probe_derive": null,
     "implement": null,
     "build_gate": null,
     "cleanup": null,
     "verify": null,
     "final_report": null
   },
-  "verify": { "coverage_failed": false }
+  "verify": { "coverage_failed": false, "pair_trigger": null }
 }
 ```
@@ -45,14 +46,16 @@ Single authoritative verdict source for `/devlyn:resolve`. The orchestrator bran
 - **complexity** — `null | "trivial" | "medium" | "large"`. Free-form mode populates this; spec/verify-only mode leaves it null.
 - **engine** — `"claude" | "codex" | "auto"` initially; rewritten by engine-preflight if a downgrade fired.
 - **rounds.global** — incremented every fix-loop pass (BUILD_GATE → fix-loop OR VERIFY → fix-loop).
+- **phases.probe_derive** — optional PHASE 1.5 entry when `--risk-probes` is enabled. Artifacts include `.devlyn/risk-probes.jsonl`. Probe failures later surface through BUILD_GATE/VERIFY as `correctness.risk-probe-failed`.
 - **bypasses** — array of phase names from `--bypass`. Valid: `"build-gate" | "cleanup"`. PLAN, IMPLEMENT, VERIFY are non-bypassable (orchestrator rejects at parse time).
 - **implement_passed_sha** — captured at end of PHASE 2; null until then. Activates the post-implement invariant for CLEANUP and VERIFY.
 - **criteria** — generated from spec's `## Requirements` checklist (one per `- [ ]`). `status: pending → implemented` is the legal transition. `failed_by_finding_ids` populates when VERIFY surfaces a finding tied to a criterion.
-- **verify.coverage_failed** — set by VERIFY's JUDGE sub-phase when a spec axis could not be exercised against the diff. Triggers pair-mode escalation when set.
+- **verify.coverage_failed** — set by VERIFY's JUDGE sub-phase when a spec axis could not be exercised against the diff. Triggers pair-mode escalation when set. Pair-mode also triggers for `complexity: high` specs or `state.complexity` of `"high"`/`"large"` when MECHANICAL has no HIGH/CRITICAL blockers.
+- **verify.pair_trigger** — VERIFY's trigger decision: `{ "eligible": boolean, "reasons": string[], "skipped_reason": string|null }`. If eligible with any reason, `pair_judge` must be non-null.
 ## Per-phase shape
-Each entry under `phases.<name>` (for `plan`, `implement`, `build_gate`, `cleanup`, `verify`, `final_report`):
+Each entry under `phases.<name>` (for `plan`, `probe_derive`, `implement`, `build_gate`, `cleanup`, `verify`, `final_report`):
 ```json
 {
@@ -73,7 +76,10 @@ Each entry under `phases.<name>` (for `plan`, `implement`, `build_gate`, `cleanu
 - `verdict` — `"PASS" | "PASS_WITH_ISSUES" | "FAIL" | "NEEDS_WORK" | "BLOCKED"`. PHASE 6 (FINAL_REPORT) writes its own verdict per the terminal-verdict precedence.
 - `triggered_by` — null on first run; one of `"build_gate" | "verify"` when the phase is a fix-loop respawn.
 - `pre_sha` — captured by orchestrator before CLEANUP and (if needed) other allowlist-enforced phases. Used to validate the post-spawn diff.
-- `sub_verdicts` — only populated for VERIFY: `{ "mechanical": "PASS|FAIL", "judge": "PASS|...", "pair_judge": "PASS|..." | null }`.
+- `sub_verdicts` — only populated for VERIFY: `{ "mechanical": "PASS|FAIL", "judge": "PASS|...", "pair_judge": "PASS|..." | null }`. Values are normalized by `verify-merge-findings.py`; model prose verdicts cannot upgrade or downgrade the deterministic findings-derived verdict.
+- `merged` — only populated for VERIFY after `verify-merge-findings.py --write-state`: `{ "verdict": "...", "findings_file": ".devlyn/verify-merged.findings.jsonl", "summary_file": ".devlyn/verify-merge.summary.json" }`.
+- `pair_trigger` — only populated for VERIFY; same shape as top-level `verify.pair_trigger` when the phase stores it locally.
+- `correctness.risk-probe-failed` — emitted by `spec-verify-check.py --include-risk-probes` when an executable probe derived from the visible `## Verification` section fails.
 ## Write protocol

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "2.0.0",
+  "version": "2.2.0",
   "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
   "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
   "bin": {

package/scripts/lint-skills.sh CHANGED Viewed

@@ -35,6 +35,7 @@ offenders=$(grep -RIln 'mcp__codex-cli__' \
   | grep -v 'config/skills/devlyn:auto-resolve-workspace/' \
   | grep -v 'config/skills/devlyn:ideate-workspace/' \
   | grep -v 'config/skills/preflight-workspace/' \
+  | grep -v 'benchmark/auto-resolve/external/' \
   | grep -v 'benchmark/auto-resolve/PILOT-RESULTS' \
   || true)
 if [ -z "$offenders" ]; then
@@ -53,6 +54,7 @@ offenders=$(grep -RIln 'Requires Codex MCP\|Codex MCP server\|Codex MCP availabl
   | grep -v 'config/skills/devlyn:auto-resolve-workspace/' \
   | grep -v 'config/skills/devlyn:ideate-workspace/' \
   | grep -v 'config/skills/preflight-workspace/' \
+  | grep -v 'benchmark/auto-resolve/external/' \
   | grep -v 'benchmark/auto-resolve/PILOT-RESULTS' \
   || true)
 if [ -z "$offenders" ]; then
@@ -127,6 +129,8 @@ else
   # plus the `_shared/` kernel.
   for rel in \
       _shared/spec-verify-check.py \
+      _shared/collect-codex-findings.py \
+      _shared/verify-merge-findings.py \
       devlyn:ideate/SKILL.md \
       devlyn:ideate/references/spec-template.md \
       devlyn:ideate/references/elicitation.md \
@@ -136,6 +140,7 @@ else
       devlyn:resolve/references/state-schema.md \
       devlyn:resolve/references/free-form-mode.md \
       devlyn:resolve/references/phases/plan.md \
+      devlyn:resolve/references/phases/probe-derive.md \
       devlyn:resolve/references/phases/implement.md \
       devlyn:resolve/references/phases/build-gate.md \
       devlyn:resolve/references/phases/cleanup.md \
@@ -171,6 +176,33 @@ else
   fi
 fi
+# ---------------------------------------------------------------------------
+# 6b. VERIFY merge verdict binding self-test.
+#     F23 full-pipeline prompt-fix rerun exposed a real failure where Codex
+#     pair-JUDGE emitted HIGH findings but state kept pair_judge as
+#     PASS_WITH_ISSUES. Routing severity must be deterministic, not prose.
+# ---------------------------------------------------------------------------
+section "Check 6b: VERIFY merge makes pair HIGH verdict-binding"
+if python3 config/skills/_shared/verify-merge-findings.py --self-test >/dev/null 2>&1; then
+  ok "verify-merge-findings.py self-test passed"
+else
+  bad "verify-merge-findings.py self-test failed"
+fi
+section "Check 6c: Codex stdout collection writes canonical pair findings"
+if python3 config/skills/_shared/collect-codex-findings.py --self-test >/dev/null 2>&1; then
+  ok "collect-codex-findings.py self-test passed"
+else
+  bad "collect-codex-findings.py self-test failed"
+fi
+section "Check 6d: Spec verification executes hidden-blind risk probes"
+if python3 config/skills/_shared/spec-verify-check.py --self-test >/dev/null 2>&1; then
+  ok "spec-verify-check.py risk-probe self-test passed"
+else
+  bad "spec-verify-check.py risk-probe self-test failed"
+fi
 # ---------------------------------------------------------------------------
 # 8. CRITIC security sub-pass must be native, not Dual.
 # Catches the specific drift where a section updates but a cross-reference doesn't.