npm - devlyn-cli - Versions diffs - 2.3.0 → 2.3.1 - Mend

devlyn-cli 2.3.0 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (219) hide show

package/benchmark/auto-resolve/fixtures/F11-batch-import-all-or-nothing/NOTES.md CHANGED Viewed

@@ -64,7 +64,12 @@ forces invariant derivation — the discriminating axis.
 ## Rotation trigger
-Retire when both arms consistently land > 90 across two shipped versions,
-OR when "all-or-nothing batch" becomes a recognized pattern such that
-solo arm reliably validates-first on the initial implementation pass.
+Headroom run `20260507-f10-f11-tier1-full-pipeline` rejected this fixture as
+pair-lift evidence: bare scored 98 and solo_claude scored 97. Keep it as an
+atomic batch control unless the visible contract is reworked to expose lower
+bare/solo ceilings.
+Retire when both `bare` and `solo_claude` consistently land > 90 across two
+shipped versions, OR when "all-or-nothing batch" becomes a recognized pattern
+such that solo arm reliably validates-first on the initial implementation pass.
 Whichever comes first.

package/benchmark/auto-resolve/fixtures/F12-webhook-raw-body-signature/NOTES.md CHANGED Viewed

@@ -77,7 +77,13 @@ keywords. Raw-body trap is intentionally left without explicit
 ## Rotation trigger
-Retire when both arms consistently land > 90 across two shipped versions
-on this fixture. If the raw-body verifier (#5) becomes saturated faster
-than the others, replace it with a different platform blindspot rather
-than retiring the whole fixture.
+Headroom run `20260511-f12-webhook-headroom` rejected this fixture as pair-lift
+evidence: bare scored 85 and solo_claude scored 99. Bare still missed one of
+seven verifiers, but the `bare` and `solo_claude` judge scores exceed the
+headroom ceilings. Keep it as a webhook/security control unless the visible
+contract is reworked to expose lower bare/solo ceilings.
+Retire when both `bare` and `solo_claude` consistently land > 90 across two
+shipped versions on this fixture. If the raw-body verifier (#5) becomes
+saturated faster than the others, replace it with a different platform blindspot
+rather than retiring the whole fixture.

package/benchmark/auto-resolve/fixtures/F15-frozen-diff-race-review/NOTES.md CHANGED Viewed

@@ -92,7 +92,13 @@ that doesn't read the await sequence carefully will gloss over these.
 ## Rotation trigger
-Retire when both bare and solo arms consistently land > 85 across two
-shipped versions. If 2026 baseline reliably catches the awaited RMW race
-on cold read of someone else's code, the frozen-diff review thesis also
-needs updating — not just the seeded bug.
+Headroom run `20260511-f15-concurrency-headroom` rejected this fixture as
+pair-lift evidence: bare scored 99 and solo_claude scored 94, so `bare` and
+`solo_claude` are above the headroom ceilings (`bare <= 60`, `solo_claude <=
+80`). Keep the fixture as a frozen-diff review control unless the visible
+contract is reworked to expose a lower solo ceiling.
+Retire when both `bare` and `solo_claude` consistently land > 85 across two
+shipped versions. If 2026 baseline reliably catches the awaited RMW race on
+cold read of someone else's code, the frozen-diff review thesis also needs
+updating — not just the seeded bug.

package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/NOTES.md CHANGED Viewed

@@ -19,6 +19,18 @@ F1/F2 test CLI shape, but not business-rule arithmetic. F10/F11/F12 test
 server behavior and persistence. F15 tests review behavior. None combine
 hidden product math, exact machine output, and source-of-truth pricing.
+## Measurement status
+Pair evidence passed in `20260510-f16-f23-f25-combined-proof`:
+bare `50`, solo_claude `75`, pair `96`, margin `+21`, wall `1.28x`,
+arm `l2_risk_probes`, verdict `pair_evidence_passed`.
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss duplicate-SKU aggregation
+before stock validation and exact integer tax/shipping totals; observable
+command `node "$BENCH_FIXTURE_DIR/verifiers/exact-success.js"` exposes the miss.
 ## Retirement
 Retire or replace this fixture if both bare and solo consistently score

package/benchmark/auto-resolve/fixtures/F16-cli-quote-tax-rules/spec.md CHANGED Viewed

@@ -55,3 +55,9 @@ the output must be machine-readable.
 - A quote over combined stock exits `2`, prints one JSON error to stderr, and prints no stdout.
 - The stock error object includes `sku`, `available`, and `requested`.
 - `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched (the pricing seed comes from setup, not the arm).
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss duplicate-SKU aggregation
+before stock validation or tax/discount calculation; observable command
+`node "$BENCH_FIXTURE_DIR/verifiers/exact-success.js"` exposes the miss.

package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md CHANGED Viewed

@@ -39,16 +39,19 @@ small enough that every arm can plausibly finish in < 10 minutes.
 ## When should this fixture be retired or replaced?
-When both arms score > 95 for two consecutive shipped versions — i.e., the
-fixture saturates and no longer differentiates. Candidate replacement: a
-similar-size CLI task with multiple interacting flags or a subcommand that
-spawns a child process.
+When both `bare` and `solo_claude` score > 95 for two consecutive shipped
+versions — i.e., the fixture saturates and no longer differentiates. Candidate
+replacement: a similar-size CLI task with multiple interacting flags or a
+subcommand that spawns a child process.
 ## Calibration history
 - v3.4   skill 57 / bare 45 / margin +12 (gpt-5.3-codex judge)
 - v3.4.1 skill 59 / bare 43 / margin +16 (gpt-5.3-codex judge)
 - v3.5   skill 92 / bare 81 / margin +11 (gpt-5.4 xhigh judge) — huge absolute jump; bare silent-catch caught
+- 20260512-f2-medium-headroom bare 83 / solo_claude 95 — rejected as
+  pair-lift evidence because both baseline scores exceed current headroom
+  ceilings.
 Absolute scores jumped with the stronger judge. Margin stays solid (+11
 after stdlib calibration is expected to open a few points more).

package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/NOTES.md CHANGED Viewed

@@ -20,6 +20,18 @@ F16 covers checkout arithmetic. F10/F11/F12/F15 cover server behavior. None
 exercise a CLI algorithm where the correct result depends on sorting,
 interval arithmetic, and output ordering at once.
+## Measurement status
+Pair evidence passed in `20260511-f21-current-riskprobes-v1`: bare `33`,
+solo_claude `66`, pair `99`, margin `+33`, wall `1.47x`,
+arm `l2_risk_probes`, verdict `pair_evidence_passed`.
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss global priority ordering
+combined with blocked-interval earliest-fit placement; observable command
+`node "$BENCH_FIXTURE_DIR/verifiers/priority-blocked.js"` exposes the miss.
 ## Retirement
 Retire or replace when both bare and solo consistently exceed the headroom

package/benchmark/auto-resolve/fixtures/F21-cli-scheduler-priority/spec.md CHANGED Viewed

@@ -59,3 +59,9 @@ failure reasons must be deterministic.
 - Unknown resources are reported in `rejected` without aborting the whole run.
 - Duplicate request ids are invalid input: exit `2`, one JSON error to stderr, no stdout.
 - `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched.
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss global priority ordering
+across resources while preserving blocked-interval earliest-fit placement;
+observable command `node "$BENCH_FIXTURE_DIR/verifiers/priority-blocked.js"` exposes the miss.

package/benchmark/auto-resolve/fixtures/F22-cli-ledger-close/NOTES.md CHANGED Viewed

@@ -20,6 +20,14 @@ F16 covers order quote arithmetic, but not ledger idempotency or full-input
 validation before mutation. F21 covers interval scheduling. Server fixtures
 cover API behavior rather than CLI reconciliation.
+## Measurement status
+Headroom runs reject F22 as full-pipeline pair-lift evidence. In
+`20260507-f21-f22-full-pipeline-pair`, F22 scored bare 91 / solo_claude 98 and
+failed the headroom gate. In `20260508-f22-exact-error-headroom`, F22 scored
+bare 94 / solo_claude 98 after the exact-error fixture revision. Keep it as a ledger
+reconciliation control, not as counted `solo < pair` evidence.
 ## Retirement
 Retire or replace if both bare and solo consistently score above the headroom

package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/NOTES.md CHANGED Viewed

@@ -20,6 +20,18 @@ F21 covers interval scheduling. F16 covers quote arithmetic. F22 was too easy
 for bare in the first calibration run. This fixture targets allocation rollback
 and inventory consumption across multiple dimensions.
+## Measurement status
+Pair evidence passed in `20260510-f16-f23-f25-combined-proof`: bare `33`,
+solo_claude `66`, pair `97`, margin `+31`, wall `2.25x`,
+arm `l2_risk_probes`, verdict `pair_evidence_passed`.
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss all-or-nothing rollback after
+a higher-priority order consumes stock first; observable command
+`node "$BENCH_FIXTURE_DIR/verifiers/priority-rollback.js"` exposes the miss.
 ## Retirement
 Retire or replace if both bare and solo consistently exceed the headroom

package/benchmark/auto-resolve/fixtures/F23-cli-fulfillment-wave/spec.md CHANGED Viewed

@@ -68,3 +68,9 @@ orders must be deterministic.
 - Lot choice is FEFO by expiry date, then lot id.
 - `remaining` is sorted by warehouse id, then sku, then expires, then lot.
 - `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched.
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss all-or-nothing rollback after
+a higher-priority order tentatively allocates FEFO lots; observable command
+`node "$BENCH_FIXTURE_DIR/verifiers/priority-rollback.js"` exposes the miss.

package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/NOTES.md CHANGED Viewed

@@ -17,12 +17,24 @@ examples rather than only checking a happy path.
 ## Why existing fixtures do not cover it
 F16 covers quote tax rules, but not multiple line-promotion types plus an order
-coupon. F21/F23 cover scheduling/allocation but became oracle-control fixtures.
+coupon. F21/F23 cover scheduling/allocation, not checkout interaction ordering.
 This fixture keeps the F16-style fair visible-contract shape while testing a
 different checkout interaction.
+## Measurement status
+Pair evidence passed in `20260510-f16-f23-f25-combined-proof`: bare `25`,
+solo_claude `75`, pair `99`, margin `+24`, wall `1.65x`,
+arm `l2_risk_probes`, verdict `pair_evidence_passed`.
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss the interaction between
+duplicate aggregation, line promotions, tax base, coupon order, and shipping;
+observable command `node "$BENCH_FIXTURE_DIR/verifiers/exact-success.js"` exposes the miss.
 ## Retirement
-Retire or replace this fixture if bare or solo consistently reaches ceiling, or
-if a later fixture covers the same promotion-order and catalog-source failure
-mode with cleaner full-pipeline lift.
+Retire or replace this fixture if either `bare` or `solo_claude` consistently
+reaches ceiling, or if a later fixture covers the same promotion-order and
+catalog-source failure mode with cleaner full-pipeline lift.

package/benchmark/auto-resolve/fixtures/F25-cli-cart-promotion-rules/spec.md CHANGED Viewed

@@ -62,3 +62,10 @@ and stdout must stay machine-readable.
 - The stock error object includes `sku`, `available`, and `requested`.
 - Changing `data/catalog.json` prices or rates changes command output without code changes.
 - `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched (the catalog seed comes from setup, not the arm).
+## Solo-headroom hypothesis
+A capable solo_claude baseline is expected to miss the interaction between
+duplicate-SKU aggregation, line-promotion ordering, coupon ordering, and
+tax/shipping thresholds; observable command
+`node "$BENCH_FIXTURE_DIR/verifiers/exact-success.js"` exposes the miss.

package/benchmark/auto-resolve/fixtures/F26-cli-payout-ledger-rules/NOTES.md CHANGED Viewed

@@ -15,11 +15,17 @@ adversarial ledger examples with repeated IDs, refunds, disputes, and reserves.
 ## Why existing fixtures do not cover it
 F16 covers quote math and F25 covers cart promotions, but neither has ledger
-idempotency or conflicting duplicate events. F21/F23 became oracle-control
-fixtures, so this adds a fresh visible-contract stateful arithmetic candidate.
+idempotency or conflicting duplicate events. F21/F23 cover scheduling and
+allocation ordering, not payout ledger arithmetic.
+## Measurement status
+Headroom run `20260508-f26-headroom` rejected F26 as full-pipeline pair-lift
+evidence: bare scored 25, but `solo_claude` scored 98 and passed all 4
+verification commands, so the fixture is at solo ceiling. Keep it as a ledger
+math control unless the spec is revised to expose a lower solo ceiling.
 ## Retirement
-Retire or replace this fixture if solo consistently reaches ceiling or if
-another fixture provides the same idempotent-ledger signal with cleaner
-full-pipeline pair lift.
+Retire or replace this fixture if another fixture provides the same
+idempotent-ledger signal with cleaner full-pipeline pair headroom.

package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md CHANGED Viewed

@@ -25,4 +25,11 @@ update tests but forget backward-compat requirements (single-item route,
 ## Rotation trigger
-Retire when both arms consistently score > 95 AND produce 2+ new tests covering paging edge cases without pipeline intervention.
+Headroom run `20260511-f3-http-error-headroom` rejected this fixture as
+pair-lift evidence after the invalid-query HTTP error body verifier was
+tightened: bare scored 97 and solo_claude scored 99. Keep it as a backend
+contract control unless the visible contract is reworked to expose lower
+bare/solo ceilings.
+Retire when both `bare` and `solo_claude` consistently score > 95 AND produce
+2+ new tests covering paging edge cases without pipeline intervention.

package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json CHANGED Viewed

@@ -26,10 +26,12 @@
       "stdout_not_contains": []
     },
     {
-      "cmd": "node -e 'const { app } = require(\"./server\"); const http = require(\"http\"); const s = http.createServer(app).listen(0, () => { const { port } = s.address(); http.get(`http://127.0.0.1:${port}/items?per_page=abc`, r => { console.log(r.statusCode); s.close(); }); });'",
+      "cmd": "node -e 'const { app } = require(\"./server\"); const http = require(\"http\"); const s = http.createServer(app).listen(0, () => { const { port } = s.address(); http.get(`http://127.0.0.1:${port}/items?per_page=abc`, r => { let b = \"\"; r.on(\"data\", c => b += c); r.on(\"end\", () => { const d = JSON.parse(b); console.log(JSON.stringify({ status: r.statusCode, error: d.error, field: d.field })); s.close(); }); }); });'",
       "exit_code": 0,
       "stdout_contains": [
-        "400"
+        "\"status\":400",
+        "\"error\":\"invalid_query\"",
+        "\"field\":\"per_page\""
       ],
       "stdout_not_contains": []
     }

package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md CHANGED Viewed

@@ -50,6 +50,6 @@ so existing assertions continue to pass alongside new paging assertions.
 - Server start: `node server/index.js` listens on port 3000 (exit via SIGINT).
 - `curl -s http://127.0.0.1:3000/items | jq '.total'` returns `2`.
 - `curl -s 'http://127.0.0.1:3000/items?per_page=1&page=2' | jq '.items[0].name'` returns `"beta"`.
-- `curl -s 'http://127.0.0.1:3000/items?per_page=abc' -o /dev/null -w '%{http_code}'` returns `400`.
+- `curl -s 'http://127.0.0.1:3000/items?per_page=abc'` returns HTTP status `400` with JSON error body `{ "error": "invalid_query", "field": "per_page" }`.
 - `node --test tests/server.test.js` passes; must include ≥ 2 new paging tests.
 - `git diff --stat` shows only `server/index.js` and `tests/server.test.js` touched.

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/NOTES.md ADDED Viewed

@@ -0,0 +1,34 @@
+# F31 CLI seat rebalance
+## Failure mode
+This fixture detects implementations that pass simple entitlement updates while
+missing the interaction between priority ordering, transfer rollback, rejected
+row ordering, and exact machine-readable error handling.
+## Pipeline phase target
+PLAN must preserve the ordering distinction between processing order and
+rejected-output order. IMPLEMENT must keep transfer mutations all-or-nothing.
+VERIFY should build adversarial cases where a later high-priority transfer
+changes the outcome of an earlier low-priority reserve, and where a failed
+transfer would corrupt state if mutations are applied too early.
+## Why existing fixtures do not cover it
+F21 covers scheduling priority and blocked intervals. F23 covers inventory
+allocation rollback. F25 covers checkout calculation order. This fixture covers
+account entitlement reconciliation with a different state shape and a duplicate
+event-id hard error.
+## Retirement
+Headroom run `20260512-f31-seat-rebalance-headroom` rejected this fixture as
+pair-lift evidence: bare scored 33 but carried judge/result/verify
+disqualifiers, and solo_claude scored 98 with all 3 verification commands
+passing. It should remain a control fixture unless reworked to lower the solo
+ceiling.
+Retire or replace this fixture if either `bare` or `solo_claude` consistently
+reaches ceiling, or if a later fixture covers priority event processing plus
+all-or-nothing transfer rollback with cleaner full-pipeline lift.

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/expected.json ADDED Viewed

@@ -0,0 +1,57 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node --test tests/cli.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": ["not ok "]
+    },
+    {
+      "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/priority-transfer-rollback.js\"",
+      "exit_code": 0,
+      "stdout_contains": ["\"ok\":true"],
+      "stdout_not_contains": [],
+      "contract_refs": [
+        "Process events globally by `priority` descending, then `effective_at` ascending, then `id` ascending.",
+        "`transfer` is all-or-nothing. It accepts only when both accounts exist, the source has `used >= qty`, the destination has at least `qty` free seats, and both accounts have the same `region` unless `allow_cross_region` is `true`.",
+        "A rejected `transfer` must not change either account. Use reason `unknown_account`, `region_mismatch`, `insufficient_used`, or `no_capacity` for the first failing transfer rule in the order listed above.",
+        "`rejected` is ordered in the original input event order. Each row has keys `id`, `reason`.",
+        "`accounts` is sorted by account id ascending. Each row has keys `id`, `region`, `seats`, `used`, `free`, where `free = seats - used`.",
+        "On success, write exactly one JSON object to stdout and no stderr. Keys: `applied`, `rejected`, `accounts`."
+      ]
+    },
+    {
+      "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/duplicate-event-error.js\"",
+      "exit_code": 0,
+      "stdout_contains": ["\"ok\":true"],
+      "stdout_not_contains": [],
+      "contract_refs": [
+        "Before processing any event, duplicate event ids are invalid input: exit `2`, write exactly one JSON error object `{ \"error\": \"duplicate_event_id\", \"id\": string }` to stderr, and write no stdout."
+      ]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|''|\\{\\})",
+      "description": "silent catch returning fallback in seat rebalance path",
+      "files": ["bin/cli.js", "tests/cli.test.js"],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
+      "description": "empty catch block",
+      "files": ["bin/cli.js", "tests/cli.test.js"],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "/\\*\\s*eslint-disable",
+      "description": "eslint-disable without scoped justification",
+      "files": ["bin/cli.js"],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": ["bin/cli.js", "tests/cli.test.js"],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "spec_output_files": ["bin/cli.js", "tests/cli.test.js"]
+}

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F31-cli-seat-rebalance",
+  "category": "high-risk",
+  "difficulty": "high",
+  "timeout_seconds": 1500,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Add a bench-cli rebalance-seats command that reads account capacity and seat events from a JSON file, processes events by priority with all-or-nothing transfers, rejects invalid per-event operations without corrupting state, and prints exact applied, rejected, and final account rows."
+}

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/setup.sh ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ #!/usr/bin/env bash
2	+ set -e

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/spec.md ADDED Viewed

@@ -0,0 +1,67 @@
+---
+id: "F31-cli-seat-rebalance"
+title: "Seat rebalance command"
+status: planned
+complexity: high
+depends-on: []
+---
+# F31 Seat rebalance command
+## Context
+`bench-cli` currently has greeting and version commands only. The task:
+add a `rebalance-seats` command that reads account capacity and seat events from
+a JSON file, processes events by priority with all-or-nothing transfers, rejects
+invalid per-event operations without corrupting state, and prints exact applied,
+rejected, and final account rows.
+This is account entitlement reconciliation. Downstream billing tools parse the
+output, so success and error output must be exact machine-readable JSON.
+## Requirements
+- [ ] `bench-cli rebalance-seats --input <path>` reads JSON shaped as `{ "accounts": [{ "id": string, "region": string, "seats": number, "used": number }], "events": [event] }`.
+- [ ] Valid event types are `reserve`, `release`, and `transfer`.
+- [ ] `reserve` events have keys `id`, `type`, `account`, `qty`, `priority`, and `effective_at`.
+- [ ] `release` events have keys `id`, `type`, `account`, `qty`, `priority`, and `effective_at`.
+- [ ] `transfer` events have keys `id`, `type`, `from`, `to`, `qty`, `priority`, `effective_at`, and optional `allow_cross_region`.
+- [ ] Before processing any event, duplicate event ids are invalid input: exit `2`, write exactly one JSON error object `{ "error": "duplicate_event_id", "id": string }` to stderr, and write no stdout.
+- [ ] Before processing any event, account rows must have unique ids, non-empty string `id` and `region`, integer `seats >= 0`, and integer `used` with `0 <= used <= seats`. Invalid account input exits `2` with one JSON error object and no stdout.
+- [ ] Before processing any event, every event `qty` must be a positive integer, every `priority` must be an integer, and every `effective_at` must be a non-empty string. Invalid event input exits `2` with one JSON error object and no stdout.
+- [ ] Process events globally by `priority` descending, then `effective_at` ascending, then `id` ascending.
+- [ ] `reserve` accepts only when the account exists and has at least `qty` free seats. On accept, increase that account's `used` by `qty`. Otherwise reject the event with reason `unknown_account` or `no_capacity`.
+- [ ] `release` accepts only when the account exists and `used >= qty`. On accept, decrease that account's `used` by `qty`. Otherwise reject the event with reason `unknown_account` or `insufficient_used`.
+- [ ] `transfer` is all-or-nothing. It accepts only when both accounts exist, the source has `used >= qty`, the destination has at least `qty` free seats, and both accounts have the same `region` unless `allow_cross_region` is `true`.
+- [ ] A rejected `transfer` must not change either account. Use reason `unknown_account`, `region_mismatch`, `insufficient_used`, or `no_capacity` for the first failing transfer rule in the order listed above.
+- [ ] On success, write exactly one JSON object to stdout and no stderr. Keys: `applied`, `rejected`, `accounts`.
+- [ ] `applied` is ordered in processing order. Each row has keys `id`, `type`.
+- [ ] `rejected` is ordered in the original input event order. Each row has keys `id`, `reason`.
+- [ ] `accounts` is sorted by account id ascending. Each row has keys `id`, `region`, `seats`, `used`, `free`, where `free = seats - used`.
+- [ ] `tests/cli.test.js` is updated. Existing tests still pass and at least two new tests cover `rebalance-seats`: one successful priority/transfer scenario and one validation failure.
+## Constraints
+- **No new npm dependencies.**
+- **No hidden mutable global state.** The command must derive output only from the input JSON for that invocation.
+- **No silent catches.** Parse and file-read failures must emit a visible JSON error to stderr and exit `2`.
+- **No extra stdout/stderr text** on the success path; downstream tooling parses stdout as JSON.
+## Out of Scope
+- Persisting account state between command invocations.
+- Adding billing invoices, plan catalogs, or currency calculations.
+- Adding web UI or server routes.
+- Touching `server/`, `web/`, or `tests/server.test.js`.
+## Verification
+- `node --test tests/cli.test.js` exits 0.
+- A later high-priority transfer is processed before an earlier low-priority reserve, and the low-priority reserve can lose capacity because of that ordering.
+- A rejected transfer leaves both source and destination account usage unchanged.
+- Region mismatch rejects a transfer unless `allow_cross_region` is `true`.
+- `rejected` rows are reported in the original input event order, even though processing order is priority based.
+- Duplicate event ids exit `2`, print exactly `{ "error": "duplicate_event_id", "id": string }` to stderr, and print no stdout.
+- Final `accounts` rows are sorted by id and include exact `free` values.
+- `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js` touched.
+- Solo-headroom hypothesis: solo_claude is expected to miss transfer rollback or rejected-row ordering under priority processing; observable command `node "$BENCH_FIXTURE_DIR/verifiers/priority-transfer-rollback.js"` exposes the miss.

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/task.txt ADDED Viewed

@@ -0,0 +1,7 @@
+Add a bench-cli rebalance-seats command that reads account capacity and seat events from a JSON file, processes events by priority with all-or-nothing transfers, rejects invalid per-event operations without corrupting state, and prints exact applied, rejected, and final account rows.
+The command should be `bench-cli rebalance-seats --input <path>`. The input JSON has account rows with id, region, seats, and used, plus event rows for reserve, release, and transfer operations. Process all events by priority descending, then effective_at ascending, then id ascending.
+Transfers must be all-or-nothing: if either account is missing, the source lacks used seats, the destination lacks free seats, or the regions differ without allow_cross_region, reject the transfer and leave both accounts unchanged. Reserve and release should reject per-event failures without aborting the whole command. Duplicate event ids are invalid input and must exit 2 with exactly one JSON error object on stderr and no stdout.
+On success, stdout must be exactly one JSON object with applied rows in processing order, rejected rows in original input order, and final account rows sorted by account id with free = seats - used. Update `tests/cli.test.js` so existing tests still pass and at least two new tests cover rebalance-seats, including one successful priority/transfer scenario and one validation failure. Do not add dependencies or touch server/web files.

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/duplicate-event-error.js ADDED Viewed

@@ -0,0 +1,35 @@
+'use strict';
+const assert = require('node:assert');
+const { spawnSync } = require('node:child_process');
+const fs = require('node:fs');
+const os = require('node:os');
+const path = require('node:path');
+const work = process.env.BENCH_WORKDIR || process.cwd();
+const cli = path.join(work, 'bin', 'cli.js');
+const tmp = fs.mkdtempSync(path.join(os.tmpdir(), 'f31-duplicate-'));
+const input = path.join(tmp, 'events.json');
+fs.writeFileSync(input, JSON.stringify({
+  accounts: [
+    { id: 'team-a', region: 'us', seats: 5, used: 1 }
+  ],
+  events: [
+    { id: 'dup', type: 'reserve', account: 'team-a', qty: 1, priority: 2, effective_at: '2026-01-01T09:00:00Z' },
+    { id: 'dup', type: 'release', account: 'team-a', qty: 1, priority: 1, effective_at: '2026-01-01T09:01:00Z' }
+  ]
+}), 'utf8');
+const result = spawnSync('node', [cli, 'rebalance-seats', '--input', input], {
+  cwd: work,
+  encoding: 'utf8'
+});
+assert.strictEqual(result.status, 2);
+assert.strictEqual(result.stdout, '');
+assert.deepStrictEqual(JSON.parse(result.stderr), {
+  error: 'duplicate_event_id',
+  id: 'dup'
+});
+console.log(JSON.stringify({ ok: true }));

package/benchmark/auto-resolve/fixtures/F31-cli-seat-rebalance/verifiers/priority-transfer-rollback.js ADDED Viewed

@@ -0,0 +1,53 @@
+'use strict';
+const assert = require('node:assert');
+const { spawnSync } = require('node:child_process');
+const fs = require('node:fs');
+const os = require('node:os');
+const path = require('node:path');
+const work = process.env.BENCH_WORKDIR || process.cwd();
+const cli = path.join(work, 'bin', 'cli.js');
+const tmp = fs.mkdtempSync(path.join(os.tmpdir(), 'f31-rebalance-'));
+const input = path.join(tmp, 'events.json');
+fs.writeFileSync(input, JSON.stringify({
+  accounts: [
+    { id: 'team-a', region: 'us', seats: 5, used: 3 },
+    { id: 'team-b', region: 'us', seats: 4, used: 1 },
+    { id: 'team-eu', region: 'eu', seats: 4, used: 0 }
+  ],
+  events: [
+    { id: 'low-reserve', type: 'reserve', account: 'team-b', qty: 3, priority: 1, effective_at: '2026-01-01T09:00:00Z' },
+    { id: 'bad-cross', type: 'transfer', from: 'team-a', to: 'team-eu', qty: 1, priority: 8, effective_at: '2026-01-01T09:02:00Z' },
+    { id: 'high-transfer', type: 'transfer', from: 'team-a', to: 'team-b', qty: 2, priority: 10, effective_at: '2026-01-01T09:05:00Z' },
+    { id: 'after-release', type: 'release', account: 'team-a', qty: 1, priority: 7, effective_at: '2026-01-01T09:03:00Z' },
+    { id: 'after-reserve', type: 'reserve', account: 'team-a', qty: 5, priority: 6, effective_at: '2026-01-01T09:04:00Z' }
+  ]
+}), 'utf8');
+const result = spawnSync('node', [cli, 'rebalance-seats', '--input', input], {
+  cwd: work,
+  encoding: 'utf8'
+});
+assert.strictEqual(result.status, 0, result.stderr || result.stdout);
+assert.strictEqual(result.stderr, '');
+const parsed = JSON.parse(result.stdout);
+assert.deepStrictEqual(parsed, {
+  applied: [
+    { id: 'high-transfer', type: 'transfer' },
+    { id: 'after-release', type: 'release' },
+    { id: 'after-reserve', type: 'reserve' }
+  ],
+  rejected: [
+    { id: 'low-reserve', reason: 'no_capacity' },
+    { id: 'bad-cross', reason: 'region_mismatch' }
+  ],
+  accounts: [
+    { id: 'team-a', region: 'us', seats: 5, used: 5, free: 0 },
+    { id: 'team-b', region: 'us', seats: 4, used: 3, free: 1 },
+    { id: 'team-eu', region: 'eu', seats: 4, used: 0, free: 4 }
+  ]
+});
+console.log(JSON.stringify({ ok: true }));

package/benchmark/auto-resolve/fixtures/F32-cli-subscription-renewal/NOTES.md ADDED Viewed

@@ -0,0 +1,38 @@
+# F32 Subscription renewal command
+## Failure mode
+This fixture targets billing-style state mutation where an implementation can
+look correct on isolated cases but fail the interaction between renewal
+priority, tentative credit application, rollback after `payment_required`, exact
+credit consumption order, and strict JSON row shapes.
+## Pipeline phase coverage
+- PLAN must preserve the exact input/output field names and ordering clauses.
+- RISK_PROBES should derive a compound priority + rollback + shape probe.
+- IMPLEMENT must avoid input-order processing and must not leak tentative credit
+  consumption from rejected renewals.
+- VERIFY pair mode should catch aliased keys, extra keys, and weak tests that
+  check only one field rather than the full parsed output.
+## Why existing fixtures do not cover it
+F25 covers pricing math and output shape, F31 covers entitlement transfers and
+duplicate-id errors, and F23 covers fulfillment rollback. F32 combines billing
+credits with a failed high-priority renewal that must roll back before a later
+renewal can consume credits, plus exact nested output key sets.
+## Retirement criteria
+Retire or replace this fixture if both `bare` and `solo_claude` score above 95
+on two current-model runs, or if another active fixture covers priority-ordered
+tentative monetary credit rollback with exact nested output shape and duplicate
+ID error contracts.
+## Pair-candidate status
+Rejected as pair-lift evidence by `20260512-f32-subscription-renewal-headroom`:
+bare scored 33, but solo_claude scored 98 and passed all 3 verification
+commands. Keep it as a billing rollback/shape control, not as a pair arm target,
+unless it is reworked and clears a fresh headroom gate.