npm - devlyn-cli - Versions diffs - 2.1.0 → 2.2.1 - Mend

devlyn-cli 2.1.0 → 2.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (127) hide show

package/CLAUDE.md CHANGED Viewed

@@ -24,7 +24,7 @@ The runtime sub-agent contract below (Subtractive-first / Goal-locked / No-worka
 ## Quick Start
-Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** — pair / multi-engine routing is research-only at HEAD per the iter-0020 + iter-0033g + iter-0034 close-outs (see [`autoresearch/iterations/0020-pair-policy-narrow.md`](autoresearch/iterations/0020-pair-policy-narrow.md) + [`autoresearch/iterations/0034-phase-4-cutover.md`](autoresearch/iterations/0034-phase-4-cutover.md)). Pass `--engine auto` or `--engine codex` explicitly to opt into the research path; the harness silently downgrades to `claude` and emits a banner if the Codex CLI is missing.
+Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** for PLAN/IMPLEMENT. Codex BUILD/IMPLEMENT and PLAN-pair remain research-only, but `/devlyn:resolve` VERIFY has a gated pair-JUDGE product path when its `SKILL.md` trigger policy fires. Pass `--engine auto` or `--engine codex` explicitly to opt into the broader research path; the harness silently downgrades to `claude` and emits a banner if the Codex CLI is missing.
 1. `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project`.
 2. `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <diff> --spec <path>`. Phases: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent, findings-only).

package/benchmark/auto-resolve/README.md CHANGED Viewed

@@ -46,8 +46,26 @@ benchmark/auto-resolve/
 │   ├── run-fixture.sh        # one fixture × one arm, self-contained
 │   ├── judge.sh              # Codex blind judge for one fixture
 │   ├── compile-report.py     # aggregates into report.md + summary.json
-│   └── ship-gate.py          # applies thresholds + writes history record
+│   ├── ship-gate.py          # applies thresholds + writes history record
+│   ├── run-headroom-candidate.sh
+│   ├── headroom-gate.py      # blocks pair measurement without headroom set
+│   ├── test-headroom-gate.sh
+│   ├── run-full-pipeline-pair-candidate.sh
+│   ├── full-pipeline-pair-gate.py
+│   ├── test-full-pipeline-pair-gate.sh
+│   ├── run-frozen-verify-pair.sh
+│   ├── fetch-swebench-instances.py
+│   ├── collect-swebench-predictions.py
+│   ├── run-swebench-solver-batch.sh
+│   ├── prepare-swebench-frozen-case.py
+│   ├── prepare-swebench-frozen-corpus.py
+│   ├── run-swebench-frozen-corpus.sh
+│   ├── swebench-frozen-matrix.py
+│   ├── test-swebench-frozen-case.sh
+│   ├── frozen-verify-gate.py # gates frozen VERIFY pair-lift evidence
+│   └── test-frozen-verify-gate.sh
 │
+├── external/swebench/        # ignored local imports of SWE-bench cases/repos
 ├── results/<run-id>/         # per-run artifacts (overwritten)
 └── history/
     ├── runs/                 # append-only, one JSON per run
@@ -71,6 +89,305 @@ Follow `fixtures/SCHEMA.md`. Six files per fixture: `metadata.json`, `spec.md`,
 4. Fill `expected.json` with concrete verification commands and forbidden patterns.
 5. Document purpose + failure mode in `NOTES.md`.
 6. Add `setup.sh` if the task needs the base `test-repo` modified before either arm starts.
+7. Run `bash scripts/lint-fixtures.sh`.
+For L2/pair candidate fixtures, also run:
+```bash
+bash benchmark/auto-resolve/scripts/run-headroom-candidate.sh F16-cli-quote-tax-rules
+```
+This runs only the arms needed for calibration (`bare` and `solo_claude`),
+blind-judges them, and applies `headroom-gate.py`. A candidate set is not
+usable for pair measurement unless at least two fixtures pass and each fixture
+has clean `bare <= 60` and `solo_claude <= 80` scores. A one-fixture calibration
+run can show useful scores but does not satisfy the set gate.
+When changing the gate itself, run:
+```bash
+bash benchmark/auto-resolve/scripts/test-headroom-gate.sh
+```
+After a full-pipeline pair run has the calibrated arms (`bare`,
+`solo_claude`, `l2_gated` or `l2_risk_probes`) plus a blind `judge.json`, gate
+it separately:
+```bash
+bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
+  --max-pair-solo-wall-ratio 3 \
+  F21-cli-scheduler-priority F23-cli-fulfillment-wave
+```
+The runner executes `bare` + `solo_claude`, applies `headroom-gate.py`, and
+only then spends a `l2_gated` arm. To gate already-existing artifacts:
+When a prompt-only pair change needs a fresh `l2_gated` measurement but the
+calibrated `bare` + `solo_claude` arms are already clean, reuse them into a new
+run id:
+```bash
+bash benchmark/auto-resolve/scripts/run-full-pipeline-pair-candidate.sh \
+  --run-id <new-run-id> \
+  --reuse-calibrated-from <prior-headroom-run-id> \
+  --max-pair-solo-wall-ratio 3 \
+  F21-cli-scheduler-priority F23-cli-fulfillment-wave
+```
+```bash
+python3 benchmark/auto-resolve/scripts/full-pipeline-pair-gate.py \
+  --run-id <full-pipeline-run-id> \
+  --min-fixtures 2 \
+  --min-pair-margin 5 \
+  --max-pair-solo-wall-ratio 3 \
+  --out-json benchmark/auto-resolve/results/<full-pipeline-run-id>/full-pipeline-pair-gate.json \
+  --out-md benchmark/auto-resolve/results/<full-pipeline-run-id>/full-pipeline-pair-gate.md
+```
+This is the full-pipeline claim gate: each counted fixture must satisfy the
+headroom precondition (`bare <= 60`, `solo_claude <= 80`), the selected pair arm
+must be clean, `pair_mode` must be true in the captured resolve state, and the
+blind judge must score the pair arm at least `--min-pair-margin` above
+`solo_claude`. `l2_risk_probes` is the current measured pair arm for the
+F16/F25 gate: `20260509-f16-f25-combined-cartprobe-v2` passed with margins +21
+and +24, average pair/solo wall ratio 1.46x. When changing this gate, run:
+```bash
+bash benchmark/auto-resolve/scripts/test-full-pipeline-pair-gate.sh
+```
+Commands that reference `BENCH_FIXTURE_DIR` are hidden post-run oracles: they
+are not staged into BUILD_GATE's `.devlyn/spec-verify.json`.
+To compare pair VERIFY against solo VERIFY on a frozen implementation diff,
+run:
+```bash
+bash benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh \
+  --fixture F16-cli-quote-tax-rules \
+  --diff benchmark/auto-resolve/results/<run-id>/F16-cli-quote-tax-rules/solo_claude/diff.patch \
+  --pair-mode gated
+```
+This applies the diff before `/devlyn:resolve` starts, then runs verify-only
+solo and pair arms against the same committed work tree. `--pair-mode gated`
+tests the product trigger policy; `--pair-mode forced` adds `--pair-verify` for
+diagnostics. Use non-empty diffs only; empty diffs fail fast because they are
+not valid pair evidence.
+Hidden verifier context is available during VERIFY, so this runner prevents
+IMPLEMENT contamination but is not an oracle-blind judge setup.
+The runner writes `compare.json`; `pair_verdict_lift: true` means pair VERIFY
+actually ran and found a verdict-binding issue that solo VERIFY did not.
+If an imported case has no deterministic `verification_commands`, the runner
+does not create `.devlyn/spec-verify.json`; an empty carrier is malformed by the
+normal real-user contract and must not block qualitative frozen review.
+To gate a set of frozen VERIFY results mechanically:
+```bash
+python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
+  --run-id 20260505T173913Z-9986cd3-frozen-verify \
+  --run-id 20260505T230215Z-9986cd3-frozen-verify \
+  --max-pair-solo-wall-ratio 3 \
+  --out-json benchmark/auto-resolve/results/frozen-verify-gate-20260505.json \
+  --out-md benchmark/auto-resolve/results/frozen-verify-gate-20260505.md
+```
+When changing the gate itself, run its regression test:
+```bash
+bash benchmark/auto-resolve/scripts/test-frozen-verify-gate.sh
+```
+This is deliberately narrower than `headroom-gate.py`: it does not claim
+full-pipeline pair superiority. It proves only that, after the implementation
+diff is frozen, gated pair VERIFY fires and returns a stricter verdict-binding
+result than solo VERIFY on the same diff. Each supplied run must cover a
+distinct fixture; repeated runs of the same fixture do not count as independent
+corpus growth. `--max-pair-solo-wall-ratio` is optional, but use it for
+ship-style evidence so quality lift is not accepted without a reasonable
+wall-time bound. The gate infers the fixture id from the runner input metadata;
+artifacts without that metadata, or with a fixture id absent from
+the selected `--fixtures-root`, fail instead of being counted as anonymous or
+fake evidence.
+### SWE-bench fixed-diff review pilot
+SWE-bench is useful here as an external, widely known corpus, but the first
+measurement surface should remain frozen VERIFY rather than full-pipeline
+generation. The official dataset fields include `instance_id`, `repo`,
+`base_commit`, `problem_statement`, `patch`, and `test_patch`; SWE-bench Lite is
+the smaller subset and SWE-bench Verified is the human-validated subset.
+See:
+- https://www.swebench.com/SWE-bench/guides/datasets/
+- https://www.swebench.com/lite.html
+- https://www.swebench.com/verified.html
+Fetch a small official Lite/Verified instance file without installing the
+Hugging Face Python stack:
+```bash
+python3 benchmark/auto-resolve/scripts/fetch-swebench-instances.py \
+  --dataset lite \
+  --limit 5 \
+  --out benchmark/auto-resolve/external/swebench/instances-lite.jsonl
+```
+Prepare one case from an instance JSON and a candidate patch produced by a solo
+run or another external solver:
+```bash
+python3 benchmark/auto-resolve/scripts/prepare-swebench-frozen-case.py \
+  --instance-json /path/to/swebench-instance.json \
+  --model-patch /path/to/solo-candidate.patch
+```
+Or prepare a small corpus from the official SWE-bench prediction JSONL shape
+(`instance_id`, `model_name_or_path`, `model_patch`):
+```bash
+python3 benchmark/auto-resolve/scripts/collect-swebench-predictions.py \
+  --patch-root /path/to/logs \
+  --instances-jsonl benchmark/auto-resolve/external/swebench/instances-lite.jsonl \
+  --model-name external-solo \
+  --out benchmark/auto-resolve/external/swebench/solo-predictions.jsonl
+```
+The collector expects `/path/to/logs/<instance_id>/patch.diff`; it is useful
+when another solver or a downloaded SWE-bench log bundle provides per-instance
+patch files rather than prediction JSONL.
+```bash
+python3 benchmark/auto-resolve/scripts/prepare-swebench-frozen-corpus.py \
+  --instances-jsonl benchmark/auto-resolve/external/swebench/instances-lite.jsonl \
+  --predictions-jsonl /path/to/solo-predictions.jsonl \
+  --limit 5 \
+  --out-manifest benchmark/auto-resolve/external/swebench/manifest.json
+```
+Then run the command written to
+`benchmark/auto-resolve/external/swebench/cases/<instance_id>/run-command.txt`.
+For a one-off case, the command uses:
+```bash
+bash benchmark/auto-resolve/scripts/run-frozen-verify-pair.sh \
+  --fixture <instance_id> \
+  --fixtures-root benchmark/auto-resolve/external/swebench/cases \
+  --base-repo benchmark/auto-resolve/external/swebench/repos/<repo-cache> \
+  --diff benchmark/auto-resolve/external/swebench/cases/<instance_id>/model.patch \
+  --pair-mode gated
+```
+For a prepared corpus manifest, run the whole set and gate it:
+```bash
+bash benchmark/auto-resolve/scripts/run-swebench-frozen-corpus.sh \
+  --manifest benchmark/auto-resolve/external/swebench/manifest.json \
+  --min-runs 2 \
+  --max-pair-solo-wall-ratio 3 \
+  --timeout-seconds 900 \
+  --resume-completed-arms \
+  --run-ids-out benchmark/auto-resolve/results/swebench-frozen-run-ids.txt \
+  --out-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
+  --out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
+```
+To re-gate existing run ids without re-invoking providers, write one run id per
+line and pass `--gate-only-run-ids <file>` with the same manifest. For large
+tranches, keep `--run-ids-out` and use `--resume-completed-arms` on retries:
+successful solo/pair arms are reused, while failed or provider-limited arms run
+again. The run ids file is the durable handle for gate-only reruns and matrix
+rendering after a bounded run finishes.
+To produce local candidate patches for a bounded pilot, prepare a solver
+worktree from the same instance JSONL. The generated spec contains only the
+visible SWE-bench problem statement; do not read the instance's gold `patch` or
+`test_patch` while solving.
+```bash
+python3 benchmark/auto-resolve/scripts/prepare-swebench-solver-worktree.py \
+  --instances-jsonl benchmark/auto-resolve/external/swebench/instances-lite.jsonl \
+  --instance-id django__django-11019 \
+  --copy-devlyn-context
+```
+Run the prompt in `<worktree>/solve-prompt.txt`, save the resulting diff as
+`<patch-root>/<instance_id>/patch.diff`, then use
+`collect-swebench-predictions.py` to create prediction JSONL.
+For a bounded local pilot, the batch runner performs those steps
+sequentially and collects prediction JSONL. It redirects provider stdin away
+from the manifest stream so later rows cannot be consumed by a child process.
+The generated solver worktrees and repo caches can become large; once
+`predictions-out` is written and cases are prepared, remove ignored local cache
+directories such as `external/swebench/worktrees/` and
+`external/swebench/repos-solver/` if disk pressure would otherwise interrupt
+the frozen corpus run. Use `--timeout-seconds` and `--resume` for large
+tranches; long-tail solver rows should be recorded as throughput failures
+instead of letting one row hold the whole suite open.
+```bash
+bash benchmark/auto-resolve/scripts/run-swebench-solver-batch.sh \
+  --instances-jsonl benchmark/auto-resolve/external/swebench/instances-lite.jsonl \
+  --instance-id django__django-11039 \
+  --instance-id django__django-11049 \
+  --predictions-out benchmark/auto-resolve/external/swebench/predictions-lite.jsonl \
+  --copy-devlyn-context
+```
+Gate a SWE-bench review pilot by pointing the existing frozen gate at the
+external case root:
+```bash
+python3 benchmark/auto-resolve/scripts/frozen-verify-gate.py \
+  --fixtures-root benchmark/auto-resolve/external/swebench/cases \
+  --run-id <swebench-frozen-run-1> \
+  --run-id <swebench-frozen-run-2> \
+  --run-id <swebench-frozen-run-3> \
+  --min-runs 3 \
+  --max-pair-solo-wall-ratio 3 \
+  --out-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
+  --out-md benchmark/auto-resolve/results/swebench-frozen-gate.md
+```
+This gives evidence for "pair review catches solo-missed verdict-binding issues
+on real SWE-bench patches." The gate accepts either external solo-vs-pair
+verdict lift or internal pair lift (`pair_judge` stricter than the pair run's
+primary judge), because separate solo and pair primary judges are stochastic.
+For evidence intended to support shipping policy, also set a wall-ratio cap and
+inspect `avg_pair_solo_wall_ratio` plus each row's `pair_solo_wall_ratio`.
+For selection-bias control, render every run in the attempted pilot, not just
+gate rows. The matrix reports verdict-lift rows separately from recall-only
+rows where pair found additional findings but did not change the binding
+verdict. It also reports classification counts, gate rate, and trailing
+non-gate rows. Use the optional yield thresholds when the matrix is meant to
+fail closed instead of only documenting that additional rows are adding
+controls without strengthening the proof gate:
+```bash
+python3 benchmark/auto-resolve/scripts/swebench-frozen-matrix.py \
+  --title "SWE-bench Lite Frozen VERIFY Matrix" \
+  --verdict MIXED_WITH_GATE_PASS \
+  --gate-json benchmark/auto-resolve/results/swebench-frozen-gate.json \
+  --run-id <swebench-frozen-run-1> \
+  --run-id <swebench-frozen-run-2> \
+  --min-gate-rate 0.25 \
+  --max-trailing-non-gate 10 \
+  --out-json benchmark/auto-resolve/results/swebench-frozen-matrix.json \
+  --out-md benchmark/auto-resolve/results/swebench-frozen-matrix.md
+```
+It does not measure official SWE-bench solve rate; run the official SWE-bench
+evaluator separately for that metric. When changing the importer or
+external-base runner path, run:
+```bash
+bash benchmark/auto-resolve/scripts/test-swebench-frozen-case.sh
+```
+Do not use the retired full-pipeline `l2_forced` arm as pair evidence. It puts
+`--pair-verify` in the initial prompt, so IMPLEMENT can become pair-aware before
+the diff is frozen.
 ## LLM-upgrade resilience
@@ -91,7 +408,9 @@ Soft gates (warning, not block): suite-margin drop > 3, fixture losing its margi
 ## Running the full suite (real)
-Full real benchmark costs roughly 2-3 minutes per arm for simple fixtures and up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures × 2 arms can take 30 min – 2 hrs depending on routes taken.
+Full real benchmarks usually take 2-3 minutes per arm for simple fixtures and
+up to 15 minutes per arm for strict-route fixtures. A full n=1 run of 9 fixtures
+× 2 arms can take 30 min - 2 hrs depending on routes taken.
 ```bash
 # Smoke run before ship decisions

package/benchmark/auto-resolve/RUBRIC.md CHANGED Viewed

@@ -23,6 +23,12 @@ Does every Verification command behave as the spec states?
 - **7-12** — Major requirements missed.
 - **0-6** — Does not address the core task.
+Mechanical cap: after the blind judge returns, `judge.sh` caps total score at
+`floor(100 * verify_score)` and caps the Spec Compliance axis at
+`floor(25 * verify_score)`. This makes the machine-readable acceptance
+contract binding when a judge grades prose generously despite failed required
+verification commands.
 ### Axis 2 — Constraint Respect (0-25)
 Zero new npm deps (unless spec allows), no silent catches (`try { } catch { return fallback }`), no `any`/`@ts-ignore`, explicit HOME/env guards where required, EACCES-specific handling, no hardcoded values that should be configurable.

package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md CHANGED Viewed

@@ -28,7 +28,6 @@ This is a low-risk edit used to calibrate trivial-tier fixture difficulty.
 - **No silent catches.** If an unknown flag is passed, exit 1 with an informative message (same pattern as the existing `--name` handler).
 - **Surgical diff.** Only touch `bin/cli.js` and `tests/cli.test.js`. Do not reformat unrelated code.
-- **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
 ## Out of Scope

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/NOTES.md ADDED Viewed

@@ -0,0 +1,63 @@
+# F10 — Notes
+## Purpose
+Pair-discriminating high-risk fixture. Adds a write endpoint with persistent
+storage. The pair-edge mechanism: when an implementation reads-modifies-
+writes a JSON file in response to async requests, multiple in-flight POSTs
+can interleave during the await gap between read and write — producing
+duplicate ids or lost writes. A single-pass implementer typically writes
+the natural shape:
+```js
+const data = JSON.parse(await fs.readFile(...));
+data.items.push({ id: data.items.length + 1, ...body });
+await fs.writeFile(...);
+```
+This passes single-POST tests trivially. The failure only surfaces when
+multiple POSTs are in flight together. A reviewer with fresh eyes asking
+"what happens if two of these run at once?" catches it; the same model
+reviewing its own work tends to anchor on the "happy path" view.
+## Failure modes detected
+- **Duplicate ids** from `length + 1` after concurrent reads.
+- **Lost writes** from last-writer-wins overwrite of in-flight POSTs.
+- **No persistence** — implementer keeps in-memory only, ignoring the
+  restart-survival requirement. Caught by `data/items.json` byte check.
+- **Silent catch** wrapping the write path. Caught by forbidden_pattern.
+## Pipeline exercise
+- Phase 1 BUILD: implementer must derive that an awaited file read-modify-
+  write under parallel requests requires either serialization (mutex /
+  queue) or a unique-id source independent of array length.
+- Phase 2 EVAL: scrutinizes whether the new tests exercise the close-
+  together POST case rather than just single-POST happy path.
+- Phase 3 CRITIC: scope discipline + production-ready invariant on the
+  consistency claim.
+## Discrimination expectation
+Calibration target (set in pyx-memory project memory 2026-05-05):
+- bare arm: 40-65 (passes spec wording verbatim, fails parallel verifier).
+- solo arm: 60-78 (review pass may catch one mode, may miss another).
+- pair arm: 75-90 (cross-perspective derivation of the consistency
+  invariant from the English "close together" wording).
+## Public-spec wording — load-bearing
+The spec deliberately uses "close together" and "remain consistent"
+instead of trigger keywords. If the spec were to say "race condition",
+"concurrent", "atomic", or "lock", a single-pass solo arm would
+keyword-match the answer pattern and ace the fixture. The English
+prose forces invariant derivation, which is where pair has the edge.
+## Rotation trigger
+Retire when both arms consistently land > 90 across two shipped versions,
+OR when "close-together-write" becomes a recognized pattern such that
+solo arm reliably reaches for a serializing mechanism on first read.
+Whichever comes first.

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/expected.json ADDED Viewed

@@ -0,0 +1,60 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node --test tests/server.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": ["not ok "]
+    },
+    {
+      "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/parallel.js\"",
+      "exit_code": 0,
+      "stdout_contains": ["\"ok\":true"],
+      "stdout_not_contains": [],
+      "contract_refs": [
+        "After multiple POSTs (single or many at once), `GET /items` reflects every successful POST exactly once",
+        "Several POSTs sent close together produce a list with every item, each with a distinct id"
+      ]
+    },
+    {
+      "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/invalid.js\"",
+      "exit_code": 0,
+      "stdout_contains": ["\"ok\":true"],
+      "stdout_not_contains": [],
+      "contract_refs": [
+        "Invalid body (missing `name`, missing `qty`, `qty` not a positive number) → respond `400` with `{ error: 'invalid_body', field: '<name>' }`",
+        "An invalid POST returns `400` and the persisted file is byte-identical to its pre-request contents."
+      ]
+    },
+    {
+      "cmd": "node -e 'const fs=require(\"fs\"); const d=JSON.parse(fs.readFileSync(\"data/items.json\",\"utf8\")); console.log(Array.isArray(d.items));'",
+      "exit_code": 0,
+      "stdout_contains": ["true"],
+      "stdout_not_contains": ["false", "undefined"]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|''|\\{\\})",
+      "description": "silent catch returning fallback in write path",
+      "files": ["server/index.js"],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
+      "description": "empty catch block",
+      "files": ["server/index.js"],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "/\\*\\s*eslint-disable",
+      "description": "eslint-disable without scoped justification",
+      "files": ["server/index.js"],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": ["server/index.js", "tests/server.test.js", "data/items.json"],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": ["server/index.js", "tests/server.test.js"]
+}

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F10-persist-write-collision",
+  "category": "high-risk",
+  "difficulty": "high",
+  "timeout_seconds": 1500,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Add POST /items that appends a new entry and persists the full list to data/items.json. The list and its ids must remain consistent when multiple POSTs arrive close together — every saved item ends up with a unique numeric id and none are lost."
+}

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/setup.sh ADDED Viewed

@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+# F10 setup — seed data/items.json.
+set -e
+mkdir -p data
+# Seed with the same baseline the in-memory items started with.
+cat > data/items.json <<'JSON'
+{
+  "items": [
+    { "id": 1, "name": "alpha", "qty": 3 },
+    { "id": 2, "name": "beta", "qty": 5 }
+  ]
+}
+JSON
+exit 0

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/spec.md ADDED Viewed

@@ -0,0 +1,51 @@
+---
+id: "F10-persist-write-collision"
+title: "Add POST /items with persistent store"
+status: planned
+complexity: high
+depends-on: []
+---
+# F10 Add POST /items with persistent store
+## Context
+`server/index.js` currently keeps `items` in an in-memory array that
+resets on restart. The task: add `POST /items` that appends a new entry
+and persists the full list to `data/items.json` so state survives
+restarts. Read paths (`GET /items`, `GET /items/:id`) load from the same
+persisted file.
+The list and every saved item's `id` must remain consistent when
+multiple POSTs arrive close together — no duplicate ids, no lost writes.
+## Requirements
+- [ ] `POST /items` accepts JSON body `{ name: string, qty: number }` and returns `201` with `{ item: { id, name, qty } }` on success.
+- [ ] `id` is a unique numeric value. Every successful POST returns an id no other item has, past or present.
+- [ ] After multiple POSTs (single or many at once), `GET /items` reflects every successful POST exactly once — `items.length` equals starting count plus successful POST count.
+- [ ] `data/items.json` on disk reflects the same state. Restarting the server preserves the items.
+- [ ] `GET /items` and `GET /items/:id` read from the persisted file (or from state initialized from it at startup, with all writes persisted before the response is returned).
+- [ ] Invalid body (missing `name`, missing `qty`, `qty` not a positive number) → respond `400` with `{ error: 'invalid_body', field: '<name>' }`. The persisted store must be unchanged after a 400 response.
+- [ ] `tests/server.test.js` is updated. Existing assertions still hold AND at least two new tests cover POST behavior, including a test that fires several POST requests close together and verifies the resulting list contains every item with distinct ids.
+## Constraints
+- **No new npm dependencies.** Use only Express + Node built-ins (`fs/promises`).
+- **No silent catches.** Any `try/catch` in the write path must surface failure as `500` with a clear body, not return a fake success.
+- **No hardcoded ids.** Existing baseline ids (1, 2) remain valid; new ids must not collide with any past or present id.
+- **No breaking change** to `GET /items` shape or `GET /items/:id` semantics.
+## Out of Scope
+- Authentication, rate limiting, validation libraries.
+- Database integration. The store is a single JSON file by design.
+- Touching `bin/cli.js`, `web/`, or `tests/cli.test.js`.
+## Verification
+- `node --test tests/server.test.js` exits 0 and the new tests cover POST.
+- Several POSTs sent close together produce a list with every item, each with a distinct id — assertable from a small Node script that POSTs in parallel and reads back `GET /items`.
+- After server restart, items posted before the restart are still present in `GET /items`.
+- An invalid POST returns `400` and the persisted file is byte-identical to its pre-request contents.
+- `git diff --stat` shows only `server/index.js` and `tests/server.test.js` touched (the `data/items.json` seed comes from setup, not the arm).

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/task.txt ADDED Viewed

@@ -0,0 +1,9 @@
+Add `POST /items` to `server/index.js` so users can append a new item with `{ name, qty }`. The full list should be persisted to `data/items.json` so state survives restarts — the existing read endpoints (`GET /items`, `GET /items/:id`) should also use the same persisted file going forward.
+Every saved item must have a unique numeric id. If two or more POSTs arrive close together, the resulting list must contain every successful one with distinct ids — no duplicates, nothing lost.
+Invalid body (missing `name`, missing `qty`, `qty` not a positive number) → respond `400` with `{ error: 'invalid_body', field: '<name>' }`, and the persisted store must be unchanged after the 400.
+Update `tests/server.test.js` so existing tests still pass AND add at least two new tests covering POST behavior. One of them must fire several POSTs close together and verify the final list contains every item with distinct ids.
+No new npm dependencies. Only touch `server/index.js`, `tests/server.test.js`, and `data/items.json` (which is seeded for you).

package/benchmark/auto-resolve/fixtures/F10-persist-write-collision/verifiers/invalid.js ADDED Viewed

@@ -0,0 +1,29 @@
+'use strict';
+const fs = require('fs');
+const http = require('http');
+const path = require('path');
+const { app } = require(path.join(process.env.BENCH_WORKDIR, 'server'));
+const before = fs.readFileSync('data/items.json');
+const s = http.createServer(app).listen(0, () => {
+  const { port } = s.address();
+  const req = http.request(
+    { host: '127.0.0.1', port, method: 'POST', path: '/items',
+      headers: { 'Content-Type': 'application/json' } },
+    (r) => {
+      let b = '';
+      r.on('data', (c) => (b += c));
+      r.on('end', () => {
+        const after = fs.readFileSync('data/items.json');
+        const same = before.equals(after);
+        const ok = r.statusCode === 400 && same;
+        console.log(JSON.stringify({ status: r.statusCode, store_unchanged: same, ok }));
+        s.close();
+        process.exit(ok ? 0 : 1);
+      });
+    }
+  );
+  req.write(JSON.stringify({ name: 'noqty' }));
+  req.end();
+});