npm - devlyn-cli - Versions diffs - 1.14.0 → 2.0.0 - Mend

devlyn-cli 1.14.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (148) hide show

package/benchmark/auto-resolve/RUBRIC.md ADDED Viewed

@@ -0,0 +1,162 @@
+# Benchmark Judge Rubric
+Stable across model upgrades. This file is the single source of truth for how
+arms are scored and how ship gates evaluate a run. Do not change the rubric
+during a benchmarking window — changing it invalidates comparability with
+prior `history/runs/`.
+**Outer goal lives in [`autoresearch/NORTH-STAR.md`](../../autoresearch/NORTH-STAR.md).** The release-decision layer (L0 / L1 / L2 contracts, wall-time efficiency, pair-cost justification) sits on top of the per-arm scoring rules below. When NORTH-STAR.md adds a release-gate number that this file did not have, the new number applies — open a doc-fix iter to mirror it here.
+## Scoring — 4 axes, 25 points each, 100 total
+The blind judge scores both arms on identical axes without knowing which is
+variant vs. bare.
+### Axis 1 — Spec Compliance (0-25)
+Does this implementation satisfy every Requirements bullet in `spec.md`?
+Does every Verification command behave as the spec states?
+- **25** — All Requirements satisfied. All Verification commands would pass.
+- **19-24** — 90%+ coverage, minor omissions.
+- **13-18** — Partial implementation or verification gaps.
+- **7-12** — Major requirements missed.
+- **0-6** — Does not address the core task.
+### Axis 2 — Constraint Respect (0-25)
+Zero new npm deps (unless spec allows), no silent catches (`try { } catch { return fallback }`), no `any`/`@ts-ignore`, explicit HOME/env guards where required, EACCES-specific handling, no hardcoded values that should be configurable.
+- Each **disqualifier-severity forbidden_pattern match** or explicit constraint violation = −4 points, minimum 0.
+### Axis 3 — Scope Discipline (0-25)
+Out of Scope respected. No gratuitous refactors of unrelated code. No "while I'm here" additions. No opportunistic upgrades.
+- Each out-of-scope change = −5 points, minimum 0.
+### Axis 4 — Code Quality (0-25)
+Readable, idiomatic for the language/framework, helpful error messages, appropriate abstraction level (not under- or over-engineered), uses standard library primitives where available (e.g., `fs.accessSync` over mode-bit checks per CLAUDE.md `phase-3-critic.md:32` calibration).
+- This axis is judge-calibrated; no deterministic grading. The judge looks at
+  naming, function/file decomposition, error handling, and comparability to
+  idiomatic peer code.
+---
+## Judge Disqualifiers (hard floor)
+Any of these produces `disqualifier: true` on the arm, overriding score:
+- Silent-catch pattern in diff.
+- Fabricated verification (code that claims to pass without actually running).
+- Skipped a required test file that the spec names.
+- Created a file listed in `expected.forbidden_files`.
+- Exceeded `expected.max_deps_added`.
+- `@ts-ignore` / `eslint-disable` without scoped justification comment.
+- Hardcoded paths or values where spec required configurability.
+Disqualifier arms automatically lose the fixture regardless of score.
+---
+## Ship Gates
+After the judge finishes every fixture, `scripts/ship-gate.py` applies these
+rules to the run's `summary.json`.
+### Hard floors (any one failure blocks ship)
+1. **No disqualifier-level violation** in variant on any fixture.
+2. **F9 (E2E) must PASS** — novice-flow contract.
+3. **≥ 7 of 9 fixtures** must have margin ≥ +5 — **headroom-aware** (added 2026-05-02 per iter-0033 R4 + NORTH-STAR amendment): a fixture is excluded from this count when `100 - L0_score < 5` AND `L1_score >= 95` AND the L1 arm has no disqualifier / CRITICAL-HIGH finding / watchdog timeout / regression worse than gate #4. Excluded fixtures become fixture-rotation candidates per the policy below if the two-shipped-version rule is met.
+4. **No fixture regression worse than −5** vs. last `baselines/shipped.json` on the same fixture.
+### Soft gates (produce WARNING but do not block)
+5. Suite average margin drop > 3 vs. last shipped.
+6. A fixture that previously had margin > +5 now has margin ≤ 0.
+7. Critical-finding catch-rate decrease vs. last shipped variant (not vs. bare).
+### Known-limit exception
+- **F8-known-limit-ambiguous** is excluded from gates 3 and 4. It exists to
+  document where the harness may not beat bare. Its allowed margin range is
+  [-3, +3]. Margins outside this range trigger a WARNING regardless of sign
+  (too-good means the fixture is no longer a known limit; too-bad means we
+  shipped a regression somewhere else that this fixture caught).
+---
+## Run Record
+Every suite run appends an immutable record to `history/runs/<ts>-<label>.json`:
+```json
+{
+  "run_id": "2026-04-23T12:00:00Z-v3.6",
+  "version_label": "v3.6",
+  "git_sha": "fdb7428...",
+  "branch": "benchmark/v3.6-ab-...",
+  "n_per_fixture": 1,
+  "judge_model": "<recorded from ~/.codex/config.toml at run time; do not hardcode>",
+  "judge_effort": "xhigh",
+  "fixtures": [
+    {
+      "id": "F2-cli-medium-subcommand",
+      "variant": { "score": 92, "wall_s": 707, "tokens_agg": 108852, "disqualifier": false,
+                   "axes": {"spec": 23, "constraint": 23, "scope": 24, "quality": 22} },
+      "bare":    { "score": 81, "wall_s": 101, "tokens_agg": 55588,  "disqualifier": false,
+                   "axes": {"spec": 19, "constraint": 19, "scope": 20, "quality": 23} },
+      "winner": "variant",
+      "margin": 11,
+      "critical_findings": {
+        "variant": [],
+        "bare": ["silent catch in findSkillMdFiles (no-silent-catches violation)"]
+      }
+    }
+  ],
+  "suite": {
+    "fixtures_run": 9,
+    "variant_avg": 89.3,
+    "bare_avg": 75.0,
+    "margin_avg": 14.3,
+    "hard_floor_violations": 0,
+    "ship_gate": "PASS"
+  }
+}
+```
+---
+## Fixture Rotation Policy
+If any fixture has both arms scoring > 95 for two consecutive shipped
+versions, it's saturated and no longer differentiates. Replace with a harder
+equivalent and record the swap in
+`history/runs/<ts>-fixture-rotation.json`:
+```json
+{
+  "retired": "F1-cli-trivial-flag",
+  "retired_reason": "both arms > 95 on v3.7 and v3.8 (saturation)",
+  "replacement": "F1b-cli-trivial-flag-v2",
+  "replacement_rationale": "adds exit-code precedence requirement that current leaders didn't handle on first try"
+}
+```
+Retired fixtures stay in `fixtures/retired/` for replay if a regression is
+suspected in their area.
+---
+## Why These Thresholds
+- **+5 margin floor** — below this, variant isn't reliably beating bare given
+  judge variance (empirically ~±3 per axis). Worth paying pipeline cost
+  requires margin clearly above noise.
+- **−5 regression floor** — one-axis regression can look like −5; allowing
+  less would let real regressions slip through.
+- **7/9 fixtures rule** — tolerates one close-call + F8 known-limit; anything
+  worse means the suite is surfacing a broad harness problem.

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md ADDED Viewed

@@ -0,0 +1,30 @@
+# F1 — Notes
+## Purpose
+Trivial-tier calibration. Every arm should one-shot this; it's here to catch
+catastrophic regressions and to anchor the "saturation" end of the scoring
+scale.
+## Failure mode
+- **Default-behavior regression.** Careless implementations add `--loud`
+  handling but accidentally alter the default case (e.g., always uppercasing
+  because the flag-check is misplaced). Verification commands 1 and 4 guard
+  against that.
+- **Scope creep.** Modifying unrelated code while "here" would be caught by
+  both CRITIC design sub-pass and the `git diff --stat` spec requirement.
+## Pipeline exercise
+- Phase 0 routing: expected `standard` route (no risk keywords).
+- Phase 1 BUILD: single-file edit.
+- Phase 1.4 BUILD GATE: `node --check` + `node --test` both must pass.
+- Phase 2 EVAL: minimal findings expected.
+- Phase 3 CRITIC design: verifies diff surgical-ness.
+## Rotation trigger
+When both arms score > 95 for two consecutive shipped versions, replace with
+a harder trivial fixture (e.g., one that requires handling a new flag
+interacting with existing flag precedence).

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json ADDED Viewed

@@ -0,0 +1,68 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node bin/cli.js hello",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, world!"
+      ],
+      "stdout_not_contains": [
+        "HELLO"
+      ]
+    },
+    {
+      "cmd": "node bin/cli.js hello --loud",
+      "exit_code": 0,
+      "stdout_contains": [
+        "HELLO, WORLD!!"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js hello --loud --name alice",
+      "exit_code": 0,
+      "stdout_contains": [
+        "HELLO, ALICE!!"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js hello --name bob",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, bob!"
+      ],
+      "stdout_not_contains": [
+        "HELLO"
+      ]
+    },
+    {
+      "cmd": "node --test tests/cli.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": [
+        "not ok "
+      ]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|'')",
+      "description": "silent catch returning fallback",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": [
+    "bin/cli.js",
+    "tests/cli.test.js"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": [
+    "bin/cli.js",
+    "tests/cli.test.js"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F1-cli-trivial-flag",
+  "category": "trivial",
+  "difficulty": "trivial",
+  "timeout_seconds": 900,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Add a boolean --loud flag to bench-test-repo's hello subcommand. When passed, the greeting is uppercased and ends with '!!'. Default behavior unchanged. Update tests."
+}

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh ADDED Viewed

@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+# F1 setup — no changes to base test-repo needed.
+set -e
+exit 0

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+id: "F1-cli-trivial-flag"
+title: "Add --loud flag to hello subcommand"
+status: planned
+complexity: trivial
+depends-on: []
+---
+# F1 Add `--loud` flag to `hello`
+## Context
+The `hello` subcommand in `bin/cli.js` currently prints `Hello, <name>!`. A
+`--loud` flag gives users an emphatic variant without breaking the default.
+This is a low-risk edit used to calibrate trivial-tier fixture difficulty.
+## Requirements
+- [ ] `node bin/cli.js hello --loud` prints `HELLO, WORLD!!` (everything uppercased, two trailing exclamation marks).
+- [ ] `node bin/cli.js hello --loud --name alice` prints `HELLO, ALICE!!`.
+- [ ] `node bin/cli.js hello` (no flag) still prints `Hello, world!` (unchanged).
+- [ ] `node bin/cli.js hello --name bob` still prints `Hello, bob!` (unchanged).
+- [ ] Existing tests continue to pass. Add at least one test covering the `--loud` path.
+## Constraints
+- **No new npm dependencies.** Built-ins only.
+- **No silent catches.** If an unknown flag is passed, exit 1 with an informative message (same pattern as the existing `--name` handler).
+- **Surgical diff.** Only touch `bin/cli.js` and `tests/cli.test.js`. Do not reformat unrelated code.
+- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
+## Out of Scope
+- Adding unrelated flags (`--quiet`, `--locale`, etc.).
+- Refactoring the existing argument parser.
+- Touching `server/`, `web/`, or `tests/server.test.js`.
+## Verification
+- `node bin/cli.js hello` prints `Hello, world!` (exit 0).
+- `node bin/cli.js hello --loud` prints `HELLO, WORLD!!` (exit 0).
+- `node bin/cli.js hello --loud --name alice` prints `HELLO, ALICE!!` (exit 0).
+- `node --test tests/` passes all tests including the new `--loud` case.
+- `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched.

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt ADDED Viewed

@@ -0,0 +1,8 @@
+Add a --loud flag to the `hello` subcommand in bench-test-repo's CLI (bin/cli.js). When --loud is passed, the greeting is uppercased and ends with two exclamation marks.
+For example:
+- `node bin/cli.js hello --loud` → `HELLO, WORLD!!`
+- `node bin/cli.js hello --loud --name alice` → `HELLO, ALICE!!`
+- `node bin/cli.js hello` → `Hello, world!` (unchanged default)
+Make sure existing tests still pass and add at least one test for the --loud path. Don't touch unrelated files — only `bin/cli.js` and `tests/cli.test.js`. No new npm dependencies.

package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md ADDED Viewed

@@ -0,0 +1,54 @@
+# F2 — Notes
+## Purpose
+Canonical **medium-complexity single-file CLI task** in the suite. Tests the
+middle-ground: a task big enough that first-draft implementations often miss
+an edge case (EACCES vs missing-dir distinction, TTY gating, HOME guard),
+small enough that every arm can plausibly finish in < 10 minutes.
+## What failure mode does it detect?
+- **Silent catches.** The pattern `try { readdirSync(...) } catch { return [] }`
+  is a natural shortcut here. Bare prompt arms tend to take it. The pipeline's
+  EVAL phase catches it as a `correctness.silent-error` or
+  `hygiene.silent-catch` finding.
+- **Edge-case distinction.** ENOENT vs EACCES must be reported differently.
+  Arms that collapse both into a generic FAIL miss a spec Requirement.
+- **Over-engineering.** Since v3.6's CRITIC calibration, hand-rolled
+  mode-bit writable checks are blocked in favor of `fs.accessSync(...,
+  fs.constants.W_OK)`.
+## Which pipeline phases does it exercise?
+- Phase 0: routing — `permission`, `env` risk keywords in the task body
+  escalate to `strict`.
+- Phase 1 BUILD: main implementation pass.
+- Phase 1.4 BUILD GATE: `node --check` syntax gate.
+- Phase 2 EVAL: catches silent-catch trap if present.
+- Phase 3 CRITIC design: applies stdlib-vs-hand-rolled calibration.
+- Phase 3 CRITIC security (native): minimal — no deps changed.
+- Phase 4 DOCS: spec frontmatter `status: done`.
+## Why can't another fixture cover this?
+- F1 is trivial (single-line edit, no edge cases).
+- F3 is backend (different idioms, tests run differently).
+- F5 is designed to force fix-loop (not applicable here).
+- F7 is scope-creep (orthogonal concern).
+## When should this fixture be retired or replaced?
+When both arms score > 95 for two consecutive shipped versions — i.e., the
+fixture saturates and no longer differentiates. Candidate replacement: a
+similar-size CLI task with multiple interacting flags or a subcommand that
+spawns a child process.
+## Calibration history
+- v3.4   skill 57 / bare 45 / margin +12 (gpt-5.3-codex judge)
+- v3.4.1 skill 59 / bare 43 / margin +16 (gpt-5.3-codex judge)
+- v3.5   skill 92 / bare 81 / margin +11 (gpt-5.4 xhigh judge) — huge absolute jump; bare silent-catch caught
+Absolute scores jumped with the stronger judge. Margin stays solid (+11
+after stdlib calibration is expected to open a few points more).

package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json ADDED Viewed

@@ -0,0 +1,170 @@
+{
+  "fixture_id": "F2-cli-medium-subcommand",
+  "generated_at": "2026-04-29T09:57:53Z",
+  "generated_from": {
+    "expected_path": "benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json",
+    "expected_sha256": "ddef8feba49f20b6957e37840bc6a03e78e554776e380d81ad6390944c72fcab",
+    "metadata_path": "benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json",
+    "metadata_sha256": "1b8066a7c649eb6baad7a3e056edbdb16cc3b796e154cedee0cf2258c5543b18",
+    "oracle_script_shas": {
+      "scope-tier-a": "baaf21ed4a67f35d2a8af825e72869ef9737b5dfe08d65dd1a11c26fafe297ae",
+      "scope-tier-b": "9349d00a5c7456a4df9142923334e7004407d53f2443f2e210945bb771971e25",
+      "test-fidelity": "401184da51ae500cecfc75a6c5819b0d28acb63a397f788fb628c2913562f903"
+    }
+  },
+  "required_invariants": [
+    {
+      "authority": "expected.json/forbidden_patterns",
+      "id": "forbidden_pattern__silent_catch_returning_a_fallback_value_violates_no_silent_c__bin_cli_js",
+      "operational_check": "variant arm output MUST NOT contain regex pattern \"catch\\\\s*\\\\([^)]*\\\\)\\\\s*\\\\{[^}]*return\\\\s+(\\\\[\\\\]|null|undefined|\\\\{|false|'')\" in files ['bin/cli.js']; rationale: silent catch returning a fallback value — violates no-silent-catches policy",
+      "severity": "disqualifier",
+      "source_field": "expected.json/forbidden_patterns/0",
+      "source_ref": "expected.json:forbidden_patterns[0]"
+    },
+    {
+      "authority": "expected.json/forbidden_patterns",
+      "id": "forbidden_pattern__ts_ignore_escape_hatch__bin_cli_js",
+      "operational_check": "variant arm output MUST NOT contain regex pattern '@ts-ignore' in files ['bin/cli.js']; rationale: @ts-ignore escape hatch",
+      "severity": "disqualifier",
+      "source_field": "expected.json/forbidden_patterns/1",
+      "source_ref": "expected.json:forbidden_patterns[1]"
+    },
+    {
+      "authority": "expected.json/max_deps_added",
+      "id": "max_deps_added__0",
+      "operational_check": "variant arm MUST NOT add more than 0 new npm dependencies (count delta of package.json:dependencies + devDependencies)",
+      "severity": "hard",
+      "source_field": "expected.json/max_deps_added",
+      "source_ref": "expected.json:max_deps_added"
+    },
+    {
+      "authority": "expected.json/required_files",
+      "id": "required_file__bin_cli_js",
+      "operational_check": "variant arm output MUST contain file 'bin/cli.js' (created or preserved)",
+      "severity": "hard",
+      "source_field": "expected.json/required_files",
+      "source_ref": "expected.json:required_files[bin/cli.js]"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "scope-tier-a:lockfile-deletion",
+      "operational_check": "variant arm MUST NOT delete a scaffold-present lockfile",
+      "severity": "hard",
+      "source_field": "oracle/scope-tier-a/scope-tier-a:lockfile-deletion",
+      "source_ref": "oracle-scope-tier-a.py"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "scope-tier-a:tier-a-violation",
+      "operational_check": "variant arm MUST NOT add or modify paths matching: docs/roadmap/** | docs/VISION.md | docs/ROADMAP.md | .github/** | node_modules/** | **/node_modules/** | test-results/** | coverage/** | .nyc_output/** | basename suffix .log | basename prefix .env or secrets.",
+      "severity": "hard",
+      "source_field": "oracle/scope-tier-a/scope-tier-a:tier-a-violation",
+      "source_ref": "oracle-scope-tier-a.py"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "scope-tier-b:scope-unmatched",
+      "operational_check": "every variant-touched file MUST be either inside spec_output_files (Tier C) OR reachable from a Tier C seed via static JS/TS imports OR matched by expected.json:tier_a_waivers",
+      "severity": "warn",
+      "source_field": "oracle/scope-tier-b/scope-tier-b:scope-unmatched",
+      "source_ref": "oracle-scope-tier-b.py"
+    },
+    {
+      "authority": "expected.json/spec_output_files",
+      "id": "spec_output_file__bin_cli_js",
+      "operational_check": "variant-touched files MUST be inside (or reachable via static imports from) the spec_output_files set; 'bin/cli.js' is one Tier C seed",
+      "severity": "warn",
+      "source_field": "expected.json/spec_output_files",
+      "source_ref": "expected.json:spec_output_files[bin/cli.js]"
+    },
+    {
+      "authority": "expected.json/spec_output_files",
+      "id": "spec_output_file__tests_cli_test_js",
+      "operational_check": "variant-touched files MUST be inside (or reachable via static imports from) the spec_output_files set; 'tests/cli.test.js' is one Tier C seed",
+      "severity": "warn",
+      "source_field": "expected.json/spec_output_files",
+      "source_ref": "expected.json:spec_output_files[tests/cli.test.js]"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "test-fidelity:assertion-regression",
+      "operational_check": "effective assertion count MUST NOT drop and skipped-test count MUST NOT rise; vacuous expect.assertions(0) is treated as a real regression",
+      "severity": "warn",
+      "source_field": "oracle/test-fidelity/test-fidelity:assertion-regression",
+      "source_ref": "oracle-test-fidelity.py"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "test-fidelity:mock-swap",
+      "operational_check": "post-arm test file MUST NOT swap REAL_PATTERNS hits for MOCK_PATTERNS hits (jest/vi/sinon, nock/msw, app.handle/inject/callback, hand-rolled IncomingMessage/ServerResponse, etc.); a drop in real_calls combined with a rise in mock_calls is a mock-swap flag",
+      "severity": "flag",
+      "source_field": "oracle/test-fidelity/test-fidelity:mock-swap",
+      "source_ref": "oracle-test-fidelity.py"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "test-fidelity:test-file-deleted",
+      "operational_check": "no scaffold-present test file may be deleted by the variant arm; deletion of an existing tests/*.test.* / *.spec.* / *.e2e.* file is a flag-severity finding",
+      "severity": "flag",
+      "source_field": "oracle/test-fidelity/test-fidelity:test-file-deleted",
+      "source_ref": "oracle-test-fidelity.py"
+    },
+    {
+      "authority": "metadata/oracle-allowlist",
+      "id": "test-fidelity:test-file-renamed",
+      "operational_check": "rename of a scaffold-present test file is warn-severity (content fidelity not verified across renames in step 1)",
+      "severity": "warn",
+      "source_field": "oracle/test-fidelity/test-fidelity:test-file-renamed",
+      "source_ref": "oracle-test-fidelity.py"
+    },
+    {
+      "authority": "expected.json/verification_commands",
+      "id": "verification__3f35982a",
+      "operational_check": "running `node bin/cli.js doctor` in the post-arm work dir MUST exit with code 0; stdout MUST contain all of ['doctor:']; stdout MUST NOT contain any of ['undefined', 'Error:']",
+      "severity": "hard",
+      "source_field": "expected.json/verification_commands/0",
+      "source_ref": "expected.json:verification_commands[0]"
+    },
+    {
+      "authority": "expected.json/verification_commands",
+      "id": "verification__460fce04",
+      "operational_check": "running `HOME=/nonexistent node bin/cli.js doctor` in the post-arm work dir MUST exit with code 1; stdout MUST contain all of ['/nonexistent']; stdout MUST NOT contain any of []",
+      "severity": "hard",
+      "source_field": "expected.json/verification_commands/1",
+      "source_ref": "expected.json:verification_commands[1]"
+    },
+    {
+      "authority": "expected.json/verification_commands",
+      "id": "verification__973e287e",
+      "operational_check": "running `python3 -c \"import subprocess; r = subprocess.run(['node', 'bin/cli.js', 'doctor'], capture_output=True); n = r.stdout.count(b'\\x1b['); print(n); exit(0 if n == 0 else 1)\"` in the post-arm work dir MUST exit with code 0; stdout MUST contain all of ['0']; stdout MUST NOT contain any of []",
+      "severity": "hard",
+      "source_field": "expected.json/verification_commands/2",
+      "source_ref": "expected.json:verification_commands[2]"
+    },
+    {
+      "authority": "expected.json/verification_commands",
+      "id": "verification__d6253a97",
+      "operational_check": "running `node bin/cli.js doctor --help` in the post-arm work dir MUST exit with code 0; stdout MUST contain all of ['doctor']; stdout MUST NOT contain any of []",
+      "severity": "hard",
+      "source_field": "expected.json/verification_commands/3",
+      "source_ref": "expected.json:verification_commands[3]"
+    },
+    {
+      "authority": "expected.json/verification_commands",
+      "id": "verification__e0f149e4",
+      "operational_check": "running `node bin/cli.js --help` in the post-arm work dir MUST exit with code 0; stdout MUST contain all of ['doctor']; stdout MUST NOT contain any of []",
+      "severity": "hard",
+      "source_field": "expected.json/verification_commands/4",
+      "source_ref": "expected.json:verification_commands[4]"
+    },
+    {
+      "authority": "expected.json/verification_commands",
+      "id": "verification__fdbcd321",
+      "operational_check": "running `node bin/cli.js doctor --verbose` in the post-arm work dir MUST exit with code 0; stdout MUST contain all of ['doctor:']; stdout MUST NOT contain any of ['Error:']",
+      "severity": "hard",
+      "source_field": "expected.json/verification_commands/5",
+      "source_ref": "expected.json:verification_commands[5]"
+    }
+  ],
+  "schema_version": "1"
+}

package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json ADDED Viewed

@@ -0,0 +1,84 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node bin/cli.js doctor",
+      "exit_code": 0,
+      "stdout_contains": [
+        "doctor:"
+      ],
+      "stdout_not_contains": [
+        "undefined",
+        "Error:"
+      ]
+    },
+    {
+      "cmd": "HOME=/nonexistent node bin/cli.js doctor",
+      "exit_code": 1,
+      "stdout_contains": [
+        "/nonexistent"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "python3 -c \"import subprocess; r = subprocess.run(['node', 'bin/cli.js', 'doctor'], capture_output=True); n = r.stdout.count(b'\\x1b['); print(n); exit(0 if n == 0 else 1)\"",
+      "exit_code": 0,
+      "stdout_contains": [
+        "0"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js doctor --help",
+      "exit_code": 0,
+      "stdout_contains": [
+        "doctor"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js --help",
+      "exit_code": 0,
+      "stdout_contains": [
+        "doctor"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js doctor --verbose",
+      "exit_code": 0,
+      "stdout_contains": [
+        "doctor:"
+      ],
+      "stdout_not_contains": [
+        "Error:"
+      ]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(?:\\[\\]|null|undefined|false|''|\\{\\s*\\})",
+      "description": "silent catch returning a fallback value (null / undefined / [] / false / '' / empty {}) \u2014 violates no-silent-catches policy. Structured error returns like `return { level: 'fail', message }` are NOT silent (they surface a user-visible failure object) and must not match.",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "@ts-ignore",
+      "description": "@ts-ignore escape hatch",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": [
+    "bin/cli.js"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": [
+    "bin/cli.js",
+    "tests/cli.test.js"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json ADDED Viewed

@@ -0,0 +1,21 @@
+{
+  "id": "F2-cli-medium-subcommand",
+  "category": "medium",
+  "difficulty": "medium",
+  "timeout_seconds": 1500,
+  "required_tools": [
+    "node"
+  ],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Add a `doctor` subcommand to bin/cli.js that diagnoses the local environment: node version check, $HOME/.claude directory check, installed plugins count, installed skills count, TTY-gated ANSI color, summary line, exit code, --verbose flag, help integration. Zero new npm dependencies. No silent error catches.",
+  "pair_plan_oracle_categories": [
+    "scope-tier-a:lockfile-deletion",
+    "scope-tier-a:tier-a-violation",
+    "scope-tier-b:scope-unmatched",
+    "test-fidelity:assertion-regression",
+    "test-fidelity:mock-swap",
+    "test-fidelity:test-file-deleted",
+    "test-fidelity:test-file-renamed"
+  ]
+}