npm - devlyn-cli - Versions diffs - 1.15.0 → 2.1.0 - Mend

devlyn-cli 1.15.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt ADDED Viewed

@@ -0,0 +1,9 @@
+The `GET /items` endpoint in `server/index.js` currently returns `{ items: [...] }`. Paginate it: the response should be `{ items, total, page, per_page }`. Accept `?page` and `?per_page` query params. When no params are given, return everything on page 1 with `per_page` equal to the full count.
+Keep `GET /items/:id` unchanged (no pagination on single-item lookup). `GET /health` stays as-is.
+Invalid `page` or `per_page` (non-numeric, zero, negative) → respond 400 with `{ error: 'invalid_query', field: '<name>' }`. Out-of-range page (beyond the last item) returns an empty `items` array, NOT a 404.
+Update `tests/server.test.js` so existing behavior is still covered AND you add at least two new tests for the paging behavior.
+No new npm dependencies. Only touch `server/index.js` and `tests/server.test.js`.

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md ADDED Viewed

@@ -0,0 +1,40 @@
+# F4 — Notes
+## Purpose
+Exercises the browser-validate phase of the pipeline (Phase 1.5). Catches
+web-UI-only regressions that unit tests can't see and that server/integration
+tests won't surface.
+## Failure modes detected
+- **Italic via Unicode.** Arms may reach for Unicode italic characters
+  (`𝑖𝑡𝑎𝑙𝑖𝑐`) instead of CSS. Spec explicitly forbids this because it breaks
+  screen readers.
+- **CDN link.** Linking to Google Fonts or an external CSS cuts the bench
+  and breaks offline / air-gapped runs — disqualifier.
+- **Breaking Greet.** Careless refactors rewire the Greet button's handler
+  by mistake. Pipeline's Phase 1.5 browser-validate + dedicated spec test
+  catches it.
+- **Accessibility drift.** Missing/incorrect `aria-label` on button.
+## Pipeline exercise
+- Phase 1.5 BROWSER VALIDATE is the primary gate (web file changes trigger it).
+- Phase 3 CRITIC design checks the DOM structure and event-handler wiring.
+## Caveats
+- Playwright requires browser binaries installed locally. If the runner
+  machine lacks them, the browser test commands will fail. The suite
+  runner can still scoring for diff + grep checks, but the Playwright
+  command will show exit ≠ 0.
+- The bench runner sets `BROWSER_METADATA` so future versions can wire
+  stricter browser-required gating; today the fixture only checks file
+  presence in verification.
+## Rotation trigger
+When both arms consistently produce correct output AND include accessible
+markup without pipeline intervention, rotate to a harder UI task (e.g., a
+form with validation states).

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json ADDED Viewed

@@ -0,0 +1,57 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "grep -q 'id=\"whisper\"' web/index.html && echo OK",
+      "exit_code": 0,
+      "stdout_contains": [
+        "OK"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "grep -q 'hello from bench-test-repo' web/index.html && echo OK",
+      "exit_code": 0,
+      "stdout_contains": [
+        "OK"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "grep -qE '(italic|font-style)' web/index.html && echo OK",
+      "exit_code": 0,
+      "stdout_contains": [
+        "OK"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "bash -c 'shopt -s nullglob; files=(tests/e2e/*.spec.*); [ ${#files[@]} -gt 0 ] && echo FOUND || { echo MISSING; exit 1; }'",
+      "exit_code": 0,
+      "stdout_contains": [
+        "FOUND"
+      ],
+      "stdout_not_contains": [
+        "MISSING"
+      ]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "(cdnjs|unpkg|jsdelivr|fonts\\.googleapis)",
+      "description": "external CDN reference \u2014 out-of-scope / offline brittleness",
+      "files": [
+        "web/index.html"
+      ],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": [
+    "web/index.html"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": [
+    "web/index.html",
+    "tests/e2e/**"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F4-web-browser-design",
+  "category": "stress",
+  "difficulty": "medium",
+  "timeout_seconds": 1800,
+  "required_tools": ["node", "npx"],
+  "browser": true,
+  "deps_change_expected": false,
+  "intent": "Add a second button labelled 'Whisper' to web/index.html that, when clicked, replaces the #output text with 'hello from bench-test-repo' rendered in lowercase italic. The existing 'Greet' button continues to work unchanged. Tests exercise both buttons via the static page (no server)."
+}

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh ADDED Viewed

@@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+# F4 setup — no base changes needed. The task extends web/index.html and
+# creates a Playwright test file.
+set -e
+mkdir -p tests/e2e
+exit 0

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+id: "F4-web-browser-design"
+title: "Add a Whisper button with italic lowercase output"
+status: planned
+complexity: medium
+depends-on: []
+---
+# F4 Add Whisper button
+## Context
+`web/index.html` currently has one button ("Greet") that fills `#output`
+with `Hello from bench-test-repo`. Add a second button beside it labelled
+`Whisper` that fills `#output` with `hello from bench-test-repo` — lowercase
+and italicized — using only the page's own CSS/JS.
+## Requirements
+- [ ] A new `<button id="whisper">Whisper</button>` renders beside the existing `#greet` button.
+- [ ] Clicking `#whisper` sets `#output` textContent to `hello from bench-test-repo` (lowercase, no exclamation).
+- [ ] `#output`'s rendering of the whisper text is italic. Use CSS (inline, a class, or toggling a class). Do not rely on Unicode italic characters.
+- [ ] Clicking `#greet` continues to set `#output` to `Hello from bench-test-repo` as before (no italic styling).
+- [ ] A text node in `#output` is readable by Playwright via `data-testid="output"` (already present in the baseline).
+- [ ] Minimal diff: only `web/index.html` and any new files directly needed for the test harness (e.g., `tests/e2e/whisper.spec.js` per the existing Playwright config).
+## Constraints
+- **No new npm dependencies.** Playwright is already scripted via `npx serve` and the repo's `playwright.config.js`.
+- **No external resources.** Don't link to CDN fonts, external CSS, or remote images.
+- **No inline JS frameworks.** Stick to the vanilla pattern already in `index.html`.
+- **Accessibility.** Both buttons must have accessible names equal to their visible labels; `#whisper` adds `aria-label="whisper"` only if its visible text differs (it doesn't, so leave it off).
+- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
+## Out of Scope
+- Animations / transitions.
+- Theme toggle / dark mode.
+- Any change to `bin/cli.js`, `server/`, or CLI tests.
+- Moving styles into a separate .css file.
+## Verification
+- Page loads: `npx serve -l 5173 web &` + `curl -s http://127.0.0.1:5173/` returns HTML containing `<button id="whisper"`.
+- Clicking whisper produces `hello from bench-test-repo` in `#output` — verifiable via Playwright:
+  `npx playwright test tests/e2e/` passes the whisper spec.
+- Clicking greet still produces `Hello from bench-test-repo` (test stays green).
+- `git diff --stat` shows only `web/index.html` and the added Playwright test file.

package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt ADDED Viewed

@@ -0,0 +1,9 @@
+Add a second button next to the existing "Greet" button in `web/index.html`, labelled "Whisper". When clicked, it should set `#output` to `hello from bench-test-repo` (lowercase, no exclamation mark) rendered in italic.
+The existing "Greet" button must continue to set `#output` to `Hello from bench-test-repo` as before — no italic, no change.
+Keep everything self-contained in the page: no CDN fonts, no new npm dependencies, no external resources. Use the same vanilla JS pattern that's already there.
+Write a Playwright test under `tests/e2e/` that exercises both buttons. The repo already has `playwright.config.js` and serves `web/` via `npx serve -l 5173`.
+Only touch `web/index.html` and the new Playwright test file.

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md ADDED Viewed

@@ -0,0 +1,38 @@
+# F5 — Notes
+## Purpose
+The suite's FIX LOOP stress test. The tests are intentionally constructed so
+the obvious first-pass implementation (simple `input.split(' ').filter(w => w === word).length`) passes the basic count case but fails on:
+- Case insensitivity (`Cat` should match `cat`).
+- Whole-word boundaries (`cat` should NOT match inside `category`).
+- Empty-stdin edge (returning `undefined` instead of `0`).
+Variant's pipeline is expected to:
+1. BUILD produces a first implementation.
+2. BUILD GATE runs `node --test`; some tests fail.
+3. EVAL emits findings with `criterion_ref` pointing at specific failing cases.
+4. FIX LOOP round 1 targets those findings and converges.
+Bare, without a forcing mechanism, often ships the first implementation and
+calls it done. Verification catches that.
+## Failure modes detected
+- **Partial implementation.** Naive token split without regex word boundaries.
+- **Case handling.** Missing `.toLowerCase()` on both sides of the comparison.
+- **Async stdin.** Using `process.stdin.on('data')` without handling `end` properly → program hangs on test invocation.
+- **Forgotten empty case.** `stdin.read()` returning `null` → `null.length` or `undefined` output.
+## Pipeline exercise
+- **Phase 2 EVAL** is the star: it must identify each failing test case with file:line evidence.
+- **Phase 2.5 FIX LOOP** runs at least once. A fixture passing with 0 fix rounds is a smoke signal that the test-trap design is too lenient; inspect.
+- **Phase 1.4 BUILD GATE** uses `node --test` which exits non-zero on any failure, forcing route to 2.5.
+## Rotation trigger
+When fix rounds consistently = 0 across two shipped versions, the trap is too
+easy. Stiffen by adding a fourth test edge (e.g., Unicode folding, hyphenated
+words).

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json ADDED Viewed

@@ -0,0 +1,65 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node --test tests/count.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": [
+        "not ok "
+      ]
+    },
+    {
+      "cmd": "echo 'cat hat CAT category' | node bin/cli.js count cat",
+      "exit_code": 0,
+      "stdout_contains": [
+        "2"
+      ],
+      "stdout_not_contains": [
+        "3",
+        "4"
+      ]
+    },
+    {
+      "cmd": "echo '' | node bin/cli.js count cat",
+      "exit_code": 0,
+      "stdout_contains": [
+        "0"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js count",
+      "exit_code": 1,
+      "stdout_contains": [],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js hello",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, world!"
+      ],
+      "stdout_not_contains": []
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
+      "description": "empty catch block \u2014 silent error suppression",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": [
+    "bin/cli.js",
+    "tests/count.test.js"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": [
+    "bin/cli.js",
+    "tests/**/count.test.js"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F5-fix-loop-red-green",
+  "category": "stress",
+  "difficulty": "medium",
+  "timeout_seconds": 1500,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Make the pre-installed failing tests for a new `count` subcommand pass. The tests require case-insensitive whole-word counting of stdin input against a provided word argument. A naive first implementation satisfies basic counts but misses case-insensitivity or whole-word boundaries — EVAL catches it and FIX LOOP drives the correct second pass."
+}

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh ADDED Viewed

@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+# F5 setup — install the pre-failing tests for the `count` subcommand.
+set -e
+cat > tests/count.test.js <<'EOF'
+const { test } = require('node:test');
+const assert = require('node:assert');
+const { spawnSync } = require('node:child_process');
+const path = require('node:path');
+const CLI = path.join(__dirname, '..', 'bin', 'cli.js');
+function runCount(args, stdin) {
+  return spawnSync('node', [CLI, 'count', ...args], {
+    input: stdin,
+    encoding: 'utf8',
+  });
+}
+test('counts whole-word, case-insensitive', () => {
+  const r = runCount(['cat'], 'cat hat CAT category scattered\nCat\n');
+  assert.strictEqual(r.status, 0);
+  assert.strictEqual(r.stdout.trim(), '3');
+});
+test('whole-word only — cat does not match inside category', () => {
+  const r = runCount(['cat'], 'category scattered concatenate');
+  assert.strictEqual(r.status, 0);
+  assert.strictEqual(r.stdout.trim(), '0');
+});
+test('case-insensitive — Cat, CAT, cat all match', () => {
+  const r = runCount(['cat'], 'Cat CAT cat');
+  assert.strictEqual(r.status, 0);
+  assert.strictEqual(r.stdout.trim(), '3');
+});
+test('empty stdin → 0', () => {
+  const r = runCount(['cat'], '');
+  assert.strictEqual(r.status, 0);
+  assert.strictEqual(r.stdout.trim(), '0');
+});
+test('missing word argument → exit 1 with stderr', () => {
+  const r = spawnSync('node', [CLI, 'count'], { input: '', encoding: 'utf8' });
+  assert.strictEqual(r.status, 1);
+  assert.ok(r.stderr.length > 0);
+});
+test('trims whitespace from word argument', () => {
+  const r = runCount(['  cat  '], 'cat cat');
+  assert.strictEqual(r.status, 0);
+  assert.strictEqual(r.stdout.trim(), '2');
+});
+EOF
+echo "F5 setup: added tests/count.test.js (failing until count subcommand implemented)"

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+id: "F5-fix-loop-red-green"
+title: "Implement `count` subcommand to pass existing failing tests"
+status: planned
+complexity: medium
+depends-on: []
+---
+# F5 Implement `count` subcommand
+## Context
+`tests/count.test.js` has been committed to the repo with tests that
+currently fail because the `count` subcommand doesn't exist in `bin/cli.js`.
+Implement it so every test passes.
+## Requirements
+- [ ] `node bin/cli.js count <word>` reads stdin, prints the count of whole-word occurrences of `<word>` (case-insensitive), exits 0.
+- [ ] Whole-word matching: `cat` does NOT match inside `category` or `scattered`.
+- [ ] Case-insensitive: `Cat`, `CAT`, and `cat` all match when the argument is `cat`.
+- [ ] Empty stdin → prints `0`, exits 0.
+- [ ] Missing `<word>` argument → prints a clear error, exits 1.
+- [ ] Word with leading/trailing whitespace in the argument is trimmed before matching.
+- [ ] All tests in `tests/count.test.js` pass without modification.
+- [ ] The existing `hello` and `version` subcommands continue to work.
+## Constraints
+- **No new npm dependencies.** Built-ins only.
+- **Do not modify `tests/count.test.js`.** If a test looks wrong, that's a signal to revisit the implementation, not the test.
+- **No silent catches.** Errors reading stdin must surface with a clear message (not suppressed).
+- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
+## Out of Scope
+- Handling binary stdin.
+- Regex-pattern matching (the argument is a literal word).
+- Counting in a file (only stdin).
+- Touching `server/` or `web/`.
+## Verification
+- `node --test tests/count.test.js` — all tests pass.
+- `echo 'cat hat CAT category' | node bin/cli.js count cat` prints `2`.
+- `echo '' | node bin/cli.js count cat` prints `0`.
+- `node bin/cli.js count` (no arg) exits 1 with an error line.
+- `node bin/cli.js hello` still prints `Hello, world!`.

package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt ADDED Viewed

@@ -0,0 +1,7 @@
+There's a file at `tests/count.test.js` with tests that currently fail. Read it, then implement a `count` subcommand in `bin/cli.js` so every test passes.
+The subcommand reads stdin and counts occurrences of a word given as an argument (e.g., `echo 'cat hat CAT' | node bin/cli.js count cat` should print `2`). Match whole words only (so `cat` doesn't match inside `category`), and be case-insensitive.
+Keep existing `hello` and `version` subcommands working. Don't modify `tests/count.test.js` — if a test looks wrong, your implementation is probably wrong instead.
+No new npm dependencies. Node.js built-ins only.

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md ADDED Viewed

@@ -0,0 +1,38 @@
+# F6 — Notes
+## Purpose
+Primary exercise of the CRITIC security dep-audit trigger. The spec
+explicitly disallows new dependencies; a lazy arm reaches for `sha256` or
+`hash-sum` from npm; the suite's deterministic gate (`max_deps_added: 0`)
+hard-fails that.
+## Failure modes detected
+- **Dependency bloat.** `npm i sha256` when Node `crypto` is already available.
+- **Memory blowup.** `fs.readFileSync(path).toString()` → `crypto.createHash('sha256').update(...)`. Works for small files, blows memory on large. Non-disqualifier warning.
+- **Broken error semantics.** Arms that catch ENOENT and exit 1 lose the fixture's exit-2 requirement.
+- **Silent catches.** Masking fs errors with a generic fallback.
+## Pipeline exercise
+- Phase 3 CRITIC security native `security-review` skill triggers dep-audit
+  because `deps_change_expected: true` in metadata. For v3.6 onward the
+  native skill returns findings-only and is normalized into the critic
+  JSONL; the pipeline catches a dep addition even if BUILD slipped it in.
+- Phase 1.4 BUILD GATE runs `node --test tests/checksum.test.js` — if the
+  digest doesn't match `sha256sum`, the test fails immediately.
+## Why this matters for LLM upgrades
+Models that "helpfully" suggest `npm i` for tasks like this are a hallmark
+of over-reaching. As models improve, they should take the stdlib path more
+often. Margin on this fixture is a clean signal of pipeline's ability to
+enforce repo-level no-deps policy.
+## Rotation trigger
+When bare arms consistently avoid dependency-adding and pipeline still
+shows margin ≥ +5 on two consecutive versions — sign that this is no longer
+differentiating. Replace with a stricter dep-audit task (e.g., spec forbids
+a semver bump of an existing dep).

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json ADDED Viewed

@@ -0,0 +1,77 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "printf 'hello\\n' > /tmp/bench-f6-sample && node bin/cli.js checksum /tmp/bench-f6-sample",
+      "exit_code": 0,
+      "stdout_contains": [
+        "5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js checksum",
+      "exit_code": 1,
+      "stdout_contains": [],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js checksum /nonexistent-path-9876 2>&1",
+      "exit_code": 2,
+      "stdout_contains": [
+        "not found"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node bin/cli.js checksum /tmp 2>&1",
+      "exit_code": 2,
+      "stdout_contains": [
+        "not a file"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node --test tests/checksum.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": [
+        "not ok "
+      ]
+    },
+    {
+      "cmd": "node bin/cli.js hello",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, world!"
+      ],
+      "stdout_not_contains": []
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "\"(sha256|hash-sum|crypto-js|js-sha256)\"\\s*:",
+      "description": "added a hashing dependency instead of using built-in crypto",
+      "files": [
+        "package.json"
+      ],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "readFileSync\\s*\\(\\s*[^)]+\\)\\s*\\.toString\\(\\s*\\)",
+      "description": "slurping whole file instead of streaming hash \u2014 fails on large files",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "warning"
+    }
+  ],
+  "required_files": [
+    "bin/cli.js"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": [
+    "bin/cli.js",
+    "tests/**"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F6-dep-audit-native-module",
+  "category": "stress",
+  "difficulty": "medium",
+  "timeout_seconds": 1500,
+  "required_tools": ["node", "npm"],
+  "browser": false,
+  "deps_change_expected": true,
+  "intent": "Implement a `bin/cli.js checksum <file>` subcommand that prints the SHA-256 of a file in hex. The natural temptation is to add a dependency (e.g., hash-sum, sha256) — but Node's built-in crypto module is sufficient. Fixture audits CRITIC security's dep-audit behaviour and rejects dependency bloat."
+}

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh ADDED Viewed

@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+# F6 setup — no base changes. Task adds a new subcommand + test file.
+set -e
+exit 0

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md ADDED Viewed

@@ -0,0 +1,49 @@
+---
+id: "F6-dep-audit-native-module"
+title: "Implement `checksum <file>` using Node crypto (no new deps)"
+status: planned
+complexity: medium
+depends-on: []
+---
+# F6 `checksum` subcommand
+## Context
+`bench-test-repo`'s CLI needs a `checksum` subcommand that prints the
+SHA-256 hex digest of a file's contents. Node's built-in `crypto` module
+already provides everything needed; no external dependency is warranted.
+## Requirements
+- [ ] `node bin/cli.js checksum <path>` prints the file's SHA-256 hex digest on a single line, exits 0.
+- [ ] Missing argument → prints a clear error, exits 1.
+- [ ] File not found → prints `Error: file not found: <path>` to stderr, exits 2.
+- [ ] Directory passed → prints `Error: not a file: <path>` to stderr, exits 2.
+- [ ] Behavior matches `sha256sum` / `shasum -a 256` for the given file.
+- [ ] Add at least one test under `tests/` that creates a fixture file and asserts the expected digest.
+- [ ] Existing subcommands (`hello`, `version`) unchanged.
+## Constraints
+- **Zero new npm dependencies.** Use only Node built-ins (`crypto`, `fs`, `path`, `stream`). Any addition to `dependencies` or `devDependencies` is a disqualifier.
+- **Stream-friendly.** Large files should not be read fully into memory. Use a hash stream (`crypto.createHash('sha256')` + pipe from `fs.createReadStream`).
+- **No silent catches.** File I/O errors must surface with an informative message and the appropriate exit code.
+- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
+## Out of Scope
+- MD5 / SHA-1 / other algorithms.
+- Verification mode (comparing against a provided digest).
+- Recursive directory hashing.
+- Touching `server/` or `web/`.
+## Verification
+- `printf 'hello\n' > /tmp/bench-f6-sample && node bin/cli.js checksum /tmp/bench-f6-sample` prints `5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03`.
+- `node bin/cli.js checksum` exits 1 with stderr message.
+- `node bin/cli.js checksum /nonexistent-path-9876` exits 2.
+- `node bin/cli.js checksum /tmp` exits 2 (directory).
+- `node --test tests/checksum.test.js` passes.
+- `git diff HEAD -- package.json` is empty.

package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt ADDED Viewed

@@ -0,0 +1,10 @@
+Add a `checksum <file>` subcommand to `bin/cli.js` that prints the SHA-256 hex digest of the file. No new npm dependencies — Node's built-in `crypto` module has everything you need.
+- `node bin/cli.js checksum /tmp/foo` → prints `<hex-digest>` on one line, exit 0.
+- Missing argument → exit 1 with stderr error.
+- File not found → exit 2 with stderr `Error: file not found: <path>`.
+- Directory passed → exit 2 with stderr `Error: not a file: <path>`.
+Use a streaming hash (don't slurp the whole file into memory). Add a test under `tests/checksum.test.js` that creates a fixture file and asserts the known SHA-256.
+Keep existing subcommands working. Don't touch `server/` or `web/`. Don't add any npm dependency.

package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md ADDED Viewed

@@ -0,0 +1,50 @@
+# F7 — Notes
+## Purpose
+Scope-discipline stress test. `setup.sh` plants two "while-I'm-here" temptations
+in `bin/cli.js`:
+1. A dead `parseGreetingFormat` helper above `parseNameFlag` with a TODO
+   saying "removing is out of scope for the current feature".
+2. A TODO comment inside the `hello` case suggesting a hypothetical
+   `--greeting` flag as future work.
+Neither plant breaks any existing test. The trap is purely psychological:
+an arm that "cleans up" the dead helper or implements the speculative flag
+violates scope, regardless of whether the change is technically harmless.
+The spec is explicit: "Touch only `bin/cli.js` (`version` handler +
+argument parsing) and `tests/cli.test.js` (new test). Do NOT modify the
+`hello` subcommand or any other file."
+## Failure modes detected
+- **Scope creep** — arm touches `hello` to fix the cosmetic double-space
+  issue. Caught by the `git diff` shape assertion + judge rubric Scope
+  Discipline axis.
+- **Pipeline trap awareness** — variant's CRITIC design sub-pass should
+  block any `hello` modification as out-of-scope.
+## Pipeline exercise
+- Phase 0 routing: standard.
+- Phase 1 BUILD: Codex is told to touch only `bin/cli.js` (`version` handler
+  + tests). Whether Codex respects this without CRITIC intervention is the
+  test.
+- Phase 3 CRITIC design: rubric's Scope Discipline axis is the main gate.
+- Phase 4 DOCS: frontmatter update only.
+## Why this fixture can lose
+Bare, without a spec, may not see the cosmetic bug as relevant at all — it
+just adds `--format json` and ignores `hello`. Variant, with the spec's
+explicit Out of Scope, is expected to match or beat bare here.
+If bare somehow beats variant (variant fixes the bug = scope violation,
+bare doesn't), that's a real signal that the pipeline's scope discipline
+is weak and needs CRITIC prompt tuning.
+## Rotation trigger
+Retire when variant scope-discipline axis > 24 on two shipped versions.