npm - devlyn-cli - Versions diffs - 2.3.0 → 2.3.2 - Mend

devlyn-cli 2.3.0 → 2.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (219) hide show

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/expected.json ADDED Viewed

@@ -0,0 +1,56 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node --test tests/cli.test.js",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": ["not ok "]
+    },
+    {
+      "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js\"",
+      "exit_code": 0,
+      "stdout_contains": ["\"ok\":true"],
+      "stdout_not_contains": [],
+      "contract_refs": [
+        "Process refund requests globally by `priority` descending, then `requested_on` ascending, then original input order ascending.",
+        "A refund rejects with reason `window_expired` when `requested_on` is more than `refund_window_days` after `purchased_on`.",
+        "A refund accepts only when the order's remaining refundable cents is at least the requested `cents`.",
+        "A rejected refund with reason `over_refund` must not change that order's remaining refundable cents.",
+        "For each accepted refund, decrement that order's remaining refundable cents by the requested `cents`.",
+        "For each accepted refund, compute `fee_cents` as the category policy's `restocking_fee_cents` capped at the requested `cents`, and compute `net_cents = cents - fee_cents`.",
+        "`approved` is ordered in processing order. Each row has keys `id`, `order`, `refund_cents`, `fee_cents`, and `net_cents`.",
+        "`rejected` is ordered in the original input refund order. Each row has keys `id`, `reason`.",
+        "`orders` is ordered by order id ascending. Each row has keys `id` and `remaining_refundable_cents`.",
+        "On success, write exactly one JSON object to stdout and no stderr. Keys: `approved`, `rejected`, `orders`."
+      ]
+    },
+    {
+      "cmd": "node \"$BENCH_FIXTURE_DIR/verifiers/duplicate-refund-error.js\"",
+      "exit_code": 0,
+      "stdout_contains": ["\"ok\":true"],
+      "stdout_not_contains": [],
+      "contract_refs": [
+        "Before settling any refund, duplicate refund ids are invalid input: exit `2`, write exactly one JSON error object `{ \"error\": \"duplicate_refund_id\", \"id\": string }` to stderr, and write no stdout."
+      ]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined|''|\\{\\}|\\[\\])",
+      "description": "silent catch returning fallback in settle-refunds path",
+      "files": ["bin/cli.js", "tests/cli.test.js"],
+      "severity": "disqualifier"
+    },
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
+      "description": "empty catch block",
+      "files": ["bin/cli.js", "tests/cli.test.js"],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": ["bin/cli.js", "tests/cli.test.js"],
+  "forbidden_files": [],
+  "tier_a_waivers": [],
+  "spec_output_files": ["bin/cli.js", "tests/cli.test.js"],
+  "max_deps_added": 0
+}

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "S6-cli-refund-window-ledger",
+  "category": "high-risk",
+  "difficulty": "high",
+  "timeout_seconds": 900,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "Add a refund ledger CLI command that applies category refund windows, priority-ordered refund requests, cumulative per-order refundable balances, duplicate refund rejection, and exact JSON output shape."
+}

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/setup.sh ADDED Viewed

@@ -0,0 +1,3 @@
+#!/usr/bin/env bash
+set -euo pipefail
+# S6 reuses the baseline test-repo state.

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/spec.md ADDED Viewed

@@ -0,0 +1,59 @@
+---
+id: "S6-cli-refund-window-ledger"
+title: "Add refund window ledger command"
+status: planned
+complexity: high
+depends-on: []
+---
+# S6 Add Refund Window Ledger Command
+## Context
+Finance operations needs a deterministic CLI command that settles refund
+requests against original orders. The command must combine category refund
+windows, priority ordering, cumulative per-order refundable balances, duplicate
+id rejection, and exact machine-readable output.
+## Requirements
+- [ ] Add `settle-refunds` to `bin/cli.js`.
+- [ ] Accept `--policies <json>` as a JSON object whose keys are category names and whose values have keys `refund_window_days` and `restocking_fee_cents`.
+- [ ] Accept `--orders <json>` as a JSON array of order objects. Each order has keys `id`, `category`, `paid_cents`, `purchased_on`, and `fulfilled`.
+- [ ] Accept `--refunds <json>` as a JSON array of refund request objects. Each refund has keys `id`, `order`, `cents`, `priority`, and `requested_on`.
+- [ ] Before settling any refund, duplicate refund ids are invalid input: exit `2`, write exactly one JSON error object `{ "error": "duplicate_refund_id", "id": string }` to stderr, and write no stdout.
+- [ ] Process refund requests globally by `priority` descending, then `requested_on` ascending, then original input order ascending.
+- [ ] A refund rejects with reason `unknown_order` when the order does not exist.
+- [ ] A refund rejects with reason `unfulfilled_order` when the order exists but `fulfilled` is not `true`.
+- [ ] A refund rejects with reason `unknown_policy` when the order category has no policy.
+- [ ] A refund rejects with reason `window_expired` when `requested_on` is more than `refund_window_days` after `purchased_on`.
+- [ ] A refund accepts only when the order's remaining refundable cents is at least the requested `cents`.
+- [ ] A rejected refund with reason `over_refund` must not change that order's remaining refundable cents.
+- [ ] For each accepted refund, decrement that order's remaining refundable cents by the requested `cents`.
+- [ ] For each accepted refund, compute `fee_cents` as the category policy's `restocking_fee_cents` capped at the requested `cents`, and compute `net_cents = cents - fee_cents`.
+- [ ] `approved` is ordered in processing order. Each row has keys `id`, `order`, `refund_cents`, `fee_cents`, and `net_cents`.
+- [ ] `rejected` is ordered in the original input refund order. Each row has keys `id`, `reason`.
+- [ ] `orders` is ordered by order id ascending. Each row has keys `id` and `remaining_refundable_cents`.
+- [ ] On success, write exactly one JSON object to stdout and no stderr. Keys: `approved`, `rejected`, `orders`.
+## Constraints
+- Use only Node.js built-ins; add no npm dependencies.
+- Touch only `bin/cli.js` and `tests/cli.test.js`.
+- Do not silently catch JSON parse or validation errors. Surface invalid input as a user-visible error with nonzero exit.
+- Do not persist refund balances between command invocations.
+- All public money amounts are integer cents.
+## Out of Scope
+- Reading input from files.
+- Taxes, payment gateway calls, currency conversion, or store-credit issuance.
+- Partial approval of a single refund request.
+- Changing `hello`, `version`, server routes, or package metadata.
+## Verification
+- `node --test tests/cli.test.js` passes.
+- `node "$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js"` prints `{"ok":true}`.
+- `node "$BENCH_FIXTURE_DIR/verifiers/duplicate-refund-error.js"` prints `{"ok":true}`.
+- Solo-headroom hypothesis: solo_claude is expected to miss cumulative remaining refundable cents or original-order rejected rows under priority-ordered refund settlement; observable command `node "$BENCH_FIXTURE_DIR/verifiers/priority-refund-ledger.js"` exposes the miss.

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/task.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ Add a `settle-refunds` command to bench-cli. It must read policies, orders, and refund requests from JSON CLI arguments, process refund requests by priority, maintain per-order remaining refundable cents, reject duplicates before processing, and emit exact JSON output.

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/duplicate-refund-error.js ADDED Viewed

@@ -0,0 +1,41 @@
+'use strict';
+const assert = require('node:assert');
+const { spawnSync } = require('node:child_process');
+const path = require('node:path');
+const work = process.env.BENCH_WORKDIR || process.cwd();
+const cli = path.join(work, 'bin', 'cli.js');
+const policies = JSON.stringify({
+  apparel: { refund_window_days: 45, restocking_fee_cents: 25 }
+});
+const orders = JSON.stringify([
+  { id: 'ord-a', category: 'apparel', paid_cents: 600, purchased_on: '2026-01-10', fulfilled: true }
+]);
+const refunds = JSON.stringify([
+  { id: 'dup', order: 'ord-a', cents: 100, priority: 2, requested_on: '2026-01-11' },
+  { id: 'dup', order: 'ord-a', cents: 100, priority: 1, requested_on: '2026-01-12' }
+]);
+const result = spawnSync('node', [
+  cli,
+  'settle-refunds',
+  '--policies',
+  policies,
+  '--orders',
+  orders,
+  '--refunds',
+  refunds
+], {
+  cwd: work,
+  encoding: 'utf8'
+});
+assert.strictEqual(result.status, 2, result.stdout || result.stderr);
+assert.strictEqual(result.stdout, '');
+assert.deepStrictEqual(JSON.parse(result.stderr), {
+  error: 'duplicate_refund_id',
+  id: 'dup'
+});
+console.log(JSON.stringify({ ok: true }));

package/benchmark/auto-resolve/shadow-fixtures/S6-cli-refund-window-ledger/verifiers/priority-refund-ledger.js ADDED Viewed

@@ -0,0 +1,65 @@
+'use strict';
+const assert = require('node:assert');
+const { spawnSync } = require('node:child_process');
+const path = require('node:path');
+const work = process.env.BENCH_WORKDIR || process.cwd();
+const cli = path.join(work, 'bin', 'cli.js');
+const policies = JSON.stringify({
+  electronics: { refund_window_days: 30, restocking_fee_cents: 150 },
+  apparel: { refund_window_days: 45, restocking_fee_cents: 25 }
+});
+const orders = JSON.stringify([
+  { id: 'ord-a', category: 'electronics', paid_cents: 1000, purchased_on: '2026-01-01', fulfilled: true },
+  { id: 'ord-b', category: 'apparel', paid_cents: 600, purchased_on: '2026-01-10', fulfilled: true },
+  { id: 'ord-c', category: 'electronics', paid_cents: 400, purchased_on: '2025-12-01', fulfilled: true },
+  { id: 'ord-d', category: 'apparel', paid_cents: 500, purchased_on: '2026-01-15', fulfilled: false }
+]);
+const refunds = JSON.stringify([
+  { id: 'low-a', order: 'ord-a', cents: 500, priority: 1, requested_on: '2026-01-08' },
+  { id: 'expired-c', order: 'ord-c', cents: 100, priority: 9, requested_on: '2026-02-01' },
+  { id: 'high-a', order: 'ord-a', cents: 800, priority: 10, requested_on: '2026-01-09' },
+  { id: 'unknown', order: 'missing', cents: 50, priority: 8, requested_on: '2026-01-09' },
+  { id: 'unfulfilled', order: 'ord-d', cents: 50, priority: 7, requested_on: '2026-01-20' },
+  { id: 'apparel-ok', order: 'ord-b', cents: 300, priority: 6, requested_on: '2026-01-20' }
+]);
+const result = spawnSync('node', [
+  cli,
+  'settle-refunds',
+  '--policies',
+  policies,
+  '--orders',
+  orders,
+  '--refunds',
+  refunds
+], {
+  cwd: work,
+  encoding: 'utf8'
+});
+assert.strictEqual(result.status, 0, result.stderr || result.stdout);
+assert.strictEqual(result.stderr, '');
+const parsed = JSON.parse(result.stdout);
+assert.deepStrictEqual(parsed, {
+  approved: [
+    { id: 'high-a', order: 'ord-a', refund_cents: 800, fee_cents: 150, net_cents: 650 },
+    { id: 'apparel-ok', order: 'ord-b', refund_cents: 300, fee_cents: 25, net_cents: 275 }
+  ],
+  rejected: [
+    { id: 'low-a', reason: 'over_refund' },
+    { id: 'expired-c', reason: 'window_expired' },
+    { id: 'unknown', reason: 'unknown_order' },
+    { id: 'unfulfilled', reason: 'unfulfilled_order' }
+  ],
+  orders: [
+    { id: 'ord-a', remaining_refundable_cents: 200 },
+    { id: 'ord-b', remaining_refundable_cents: 300 },
+    { id: 'ord-c', remaining_refundable_cents: 400 },
+    { id: 'ord-d', remaining_refundable_cents: 500 }
+  ]
+});
+console.log(JSON.stringify({ ok: true }));

package/bin/devlyn.js CHANGED Viewed

@@ -22,7 +22,7 @@ const CLI_TARGETS = {
     // Codex auto-loads skills from ~/.codex/skills/ (user-global). Same
     // SKILL.md format as Claude Code; descriptions must stay ≤1024 chars.
     skillsDir: path.join(os.homedir(), '.codex', 'skills'),
-    skillsToInstall: ['devlyn:resolve', 'devlyn:ideate', '_shared'],
+    skillsToInstall: ['devlyn:resolve', 'devlyn:ideate', 'devlyn:design-ui', '_shared'],
     detect: () => fs.existsSync(path.join(process.cwd(), 'AGENTS.md')) || fs.existsSync(path.join(process.cwd(), '.codex')),
   },
   gemini: {
@@ -183,7 +183,6 @@ const OPTIONAL_ADDONS = [
   { name: 'devlyn:pencil-push', desc: 'Push codebase UI to Pencil canvas for design sync', type: 'local' },
   { name: 'devlyn:reap', desc: 'Safely reap orphaned MCP / codex / Superset child processes left behind by long Claude sessions', type: 'local' },
   { name: 'devlyn:design-system', desc: 'Extract design tokens from a chosen UI style for exact reproduction (creative power-user)', type: 'local' },
-  { name: 'devlyn:design-ui', desc: 'N (default 5) distinct UI style explorations from a single Lead Designer (creative power-user)', type: 'local' },
   { name: 'devlyn:team-design-ui', desc: '5 distinct UI style explorations from a full design team (creative power-user)', type: 'local' },
   // External skill packs (installed via npx skills add)
   { name: 'vercel-labs/agent-skills', desc: 'React, Next.js, React Native best practices', type: 'external' },
@@ -194,7 +193,7 @@ const OPTIONAL_ADDONS = [
   // MCP servers (installed via claude mcp add)
   // Note: the Codex integration uses the local `codex` CLI binary (not MCP).
   // Install the CLI separately per https://platform.openai.com/docs/codex — the
-  // harness auto-detects availability and downgrades to Claude-only on failure.
+  // pair/risk-probe routes fail closed when Codex is required but unavailable.
   { name: 'playwright', desc: 'Playwright MCP for browser testing — powers /devlyn:resolve BUILD_GATE browser tier', type: 'mcp', command: 'npx -y @anthropic-ai/mcp-playwright' },
 ];
@@ -524,7 +523,7 @@ function detectOtherCLIs() {
   return detected;
 }
-// Install /devlyn:resolve + /devlyn:ideate + _shared skills into a CLI's
+// Install devlyn:resolve + devlyn:ideate + devlyn:design-ui + _shared skills into a CLI's
 // global skills directory (e.g. ~/.codex/skills/). Returns count of skills
 // copied. Skipped silently for CLIs without a skillsDir (e.g. cursor, copilot
 // at the time of writing — they don't have an analogous skill-loader).
@@ -608,11 +607,11 @@ function installAgentsForCLI(cliKey) {
   }
   // If this CLI also supports a global skill-loader (currently Codex), install
-  // /devlyn:resolve + /devlyn:ideate + _shared so the same slash commands work
-  // there. Skipped for CLIs without a skillsDir entry.
+  // devlyn:resolve + devlyn:ideate + devlyn:design-ui + _shared. Codex invokes
+  // these as skills (for example `$devlyn:resolve`), not Claude slash commands.
   const skillsCopied = installSkillsForCLI(cliKey);
   if (skillsCopied > 0) {
-    log(`  → ${skillsCopied} skill${skillsCopied > 1 ? 's' : ''} installed (devlyn:resolve / devlyn:ideate / _shared)`, 'dim');
+    log(`  → ${skillsCopied} skill${skillsCopied > 1 ? 's' : ''} installed (devlyn:resolve / devlyn:ideate / devlyn:design-ui / _shared)`, 'dim');
   }
   return true;
@@ -689,7 +688,7 @@ async function init(skipPrompts = false) {
     }
   }
   if (!settings.env) settings.env = {};
-  // Auto-allow pipeline state directory and common git commands so auto-resolve doesn't prompt
+  // Auto-allow pipeline state directory and common git commands so resolve doesn't prompt
   if (!settings.permissions) settings.permissions = {};
   if (!settings.permissions.allow) settings.permissions.allow = [];
   const pipelinePermissions = [
@@ -762,7 +761,7 @@ async function init(skipPrompts = false) {
     if (cli.configDir) {
       desc = `Install agents into ${cli.configDir}/`;
     } else if (cli.skillsDir) {
-      desc = `Install ${cli.instructionsFile} + /devlyn:resolve + /devlyn:ideate skills (~/.codex/skills/)`;
+      desc = `Install ${cli.instructionsFile} + devlyn:resolve/devlyn:ideate/devlyn:design-ui skills (~/.codex/skills/; use $devlyn:* in Codex)`;
     } else {
       desc = `Install ${cli.instructionsFile}`;
     }
@@ -777,7 +776,7 @@ async function init(skipPrompts = false) {
     log(`  ✅ Agent instructions installed for ${agentsInstalled} CLI${agentsInstalled !== 1 ? 's' : ''}`, 'green');
   } else {
     log('💡 No additional CLI instructions selected', 'dim');
-    log('   Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md + /devlyn skills', 'dim');
+    log('   Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md + devlyn skills', 'dim');
   }
   // Ask about optional addons (local skills + external packs)
@@ -808,8 +807,14 @@ function showHelp() {
   log('  npx devlyn-cli -y           Install without prompts');
   log('  npx devlyn-cli agents       Install agents for detected CLIs');
   log('  npx devlyn-cli agents all   Install agents for all supported CLIs');
-  log('  npx devlyn-cli benchmark    Run the full A/B benchmark suite vs bare');
-  log('  npx devlyn-cli benchmark --n 3 --bless   Ship-decision run + promote baseline if pass');
+  log('  npx devlyn-cli benchmark    Run the resolve benchmark suite');
+  log('  npx devlyn-cli benchmark recent              Show compact recent benchmark results');
+  log('  npx devlyn-cli benchmark frontier            Show pair candidate frontier scores/triggers without providers');
+  log('  npx devlyn-cli benchmark audit               Audit pair evidence readiness');
+  log('  npx devlyn-cli benchmark audit-headroom      Audit failed headroom results');
+  log('  npx devlyn-cli benchmark headroom <fixtures...>  Score bare vs solo_claude headroom');
+  log('  npx devlyn-cli benchmark pair <fixtures...>      Score solo_claude vs pair path');
+  log('  npx devlyn-cli benchmark --bless         If ship-gate passes, promote baseline');
   log('  npx devlyn-cli benchmark --dry-run       Validate suite setup without model invocation');
   log('  npx devlyn-cli --help       Show this help\n');
   log('Optional skills (select during install):', 'green');
@@ -831,6 +836,170 @@ function showHelp() {
   log('');
 }
+function showBenchmarkHelp() {
+  log('Usage:', 'green');
+  log('  npx devlyn-cli benchmark [suite] [options] [fixtures...]');
+  log('  npx devlyn-cli benchmark recent [options]');
+  log('  npx devlyn-cli benchmark frontier [options]');
+  log('  npx devlyn-cli benchmark audit [options]');
+  log('  npx devlyn-cli benchmark audit-headroom [options]');
+  log('  npx devlyn-cli benchmark headroom [options] <fixtures...>');
+  log('  npx devlyn-cli benchmark pair [options] <fixtures...>');
+  log('');
+  log('Score-focused runs:', 'green');
+  log('  recent   Show compact, wrap-safe recent benchmark results');
+  log('  frontier Show active rejected/evidence/unmeasured pair candidates, scores, and triggers without providers');
+  log('  audit     Fail on unmeasured pair candidates and invalid headroom rejections');
+  log('            Prints frontier score rows plus headroom and pair quality handoff rows');
+  log('  audit-headroom  Fail on active failed or unsupported headroom rejections');
+  log('  headroom  Score bare vs solo_claude before spending the pair arm');
+  log('  pair      Score solo_claude vs the selected pair path and print gate tables');
+  log('');
+  log('Shadow suite:', 'green');
+  log('  npx devlyn-cli benchmark suite --suite shadow --dry-run');
+  log('            Lists shadow tasks only; use headroom/pair with explicit S* ids for real measurement');
+  log('');
+  log('Examples:', 'green');
+  log('  npx devlyn-cli benchmark --dry-run F1-cli-trivial-flag');
+  log('  npx devlyn-cli benchmark recent');
+  log('  npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md');
+  log('  npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md');
+  log('  npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit');
+  log('  npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json');
+  log('  npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
+  log('  npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
+  log('');
+}
+function showBenchmarkModeHelp(mode) {
+  if (mode === 'recent') {
+    log('Usage:', 'green');
+    log('  npx devlyn-cli benchmark recent [options]');
+    log('');
+    log('Options:', 'green');
+    log('  --out-json PATH');
+    log('  --out-md PATH');
+    log('  --fixtures-root PATH');
+    log('  --registry PATH');
+    log('  --results-root PATH');
+    log('  --max-width N  default: 92');
+    log('  --min-pair-margin N  default: 5');
+    log('  --max-pair-solo-wall-ratio N  default: 3');
+    log('');
+    log('Output:', 'green');
+    log('  Prints compact, wrap-safe benchmark status and pair-evidence cards without wide tables');
+    log('');
+    log('Example:', 'green');
+    log('  npx devlyn-cli benchmark recent');
+    log('  npx devlyn-cli benchmark recent --out-md /tmp/devlyn-recent-benchmark.md');
+    log('');
+    return;
+  }
+  if (mode === 'frontier') {
+    log('Usage:', 'green');
+    log('  npx devlyn-cli benchmark frontier [options]');
+    log('');
+    log('Options:', 'green');
+    log('  --out-json PATH');
+    log('  --out-md PATH');
+    log('  --fixtures-root PATH');
+    log('  --registry PATH');
+    log('  --results-root PATH');
+    log('  --min-pair-margin N  default: 5');
+    log('  --max-pair-solo-wall-ratio N  default: 3');
+    log('  --fail-on-unmeasured');
+    log('');
+    log('Output:', 'green');
+    log('  Prints pair evidence score rows with trigger reasons; --out-md includes a Triggers column');
+    log('');
+    log('Example:', 'green');
+    log('  npx devlyn-cli benchmark frontier --out-md /tmp/devlyn-pair-frontier.md');
+    log('');
+    return;
+  }
+  if (mode === 'audit-headroom') {
+    log('Usage:', 'green');
+    log('  npx devlyn-cli benchmark audit-headroom [options]');
+    log('');
+    log('Options:', 'green');
+    log('  --out-json PATH');
+    log('  --fixtures-root PATH');
+    log('  --registry PATH');
+    log('  --results-root PATH');
+    log('');
+    log('Example:', 'green');
+    log('  npx devlyn-cli benchmark audit-headroom --out-json /tmp/devlyn-headroom-audit.json');
+    log('');
+    return;
+  }
+  if (mode === 'audit') {
+    log('Usage:', 'green');
+    log('  npx devlyn-cli benchmark audit [options]');
+    log('');
+    log('Options:', 'green');
+    log('  --out-dir PATH');
+    log('  --fixtures-root PATH');
+    log('  --registry PATH');
+    log('  --results-root PATH');
+    log('  --min-pair-evidence N  default: 4');
+    log('  --min-pair-margin N  default: 5');
+    log('  --max-pair-solo-wall-ratio N  default: 3');
+    log('  --require-hypothesis-trigger');
+    log('');
+    log('Output:', 'green');
+    log('  Prints frontier score rows plus headroom_rejections=PASS/FAIL, pair_evidence_quality=PASS/FAIL, pair_trigger_reasons=PASS/FAIL, pair_evidence_hypotheses=PASS/FAIL, pair_evidence_hypothesis_triggers=PASS/WARN/FAIL, historical-alias, and hypothesis-trigger gap handoff rows');
+    log('');
+    log('Example:', 'green');
+    log('  npx devlyn-cli benchmark audit --out-dir /tmp/devlyn-benchmark-audit');
+    log('  npx devlyn-cli benchmark audit --require-hypothesis-trigger --out-dir /tmp/devlyn-benchmark-audit-strict');
+    log('');
+    return;
+  }
+  if (mode === 'headroom') {
+    log('Usage:', 'green');
+    log('  npx devlyn-cli benchmark headroom [options] <fixtures...>');
+    log('');
+    log('Options:', 'green');
+    log('  --run-id ID');
+    log('  --bare-max N       default: 60');
+    log('  --solo-max N       default: 80');
+    log('  --min-bare-headroom N  default: 5');
+    log('  --min-solo-headroom N  default: 5');
+    log('  --min-fixtures N   default: 2; use 3 for F16/F23/F25 proof reruns; audit requires 4 passing evidence rows');
+    log('  --allow-rejected-fixtures  active-fixture diagnostics only');
+    log('  --dry-run          validate args/fixtures and print replay command only');
+    log('');
+    log('Example:', 'green');
+    log('  npx devlyn-cli benchmark headroom --min-fixtures 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
+    log('');
+    return;
+  }
+  if (mode === 'pair') {
+    log('Usage:', 'green');
+    log('  npx devlyn-cli benchmark pair [options] <fixtures...>');
+    log('');
+    log('Options:', 'green');
+    log('  --run-id ID');
+    log('  --bare-max N');
+    log('  --solo-max N');
+    log('  --min-bare-headroom N  default: 5');
+    log('  --min-solo-headroom N  default: 5');
+    log('  --min-fixtures N   default: 2; use 3 for F16/F23/F25 proof reruns; audit requires 4 passing evidence rows');
+    log('  --min-pair-margin N  default: 5');
+    log('  --max-pair-solo-wall-ratio N  default: 3');
+    log('  --pair-arm ARM  default: l2_risk_probes; l2_gated is diagnostic');
+    log('  --reuse-calibrated-from RUN_ID');
+    log('  --allow-rejected-fixtures  active-fixture diagnostics only');
+    log('  --dry-run       validate args/fixtures and print replay command only');
+    log('');
+    log('Example:', 'green');
+    log('  npx devlyn-cli benchmark pair --min-fixtures 3 --max-pair-solo-wall-ratio 3 F16-cli-quote-tax-rules F23-cli-fulfillment-wave F25-cli-cart-promotion-rules');
+    log('');
+    return;
+  }
+  showBenchmarkHelp();
+}
 // Main
 const args = process.argv.slice(2);
 const command = args[0];
@@ -850,16 +1019,40 @@ switch (command) {
     break;
   case 'benchmark':
   case 'bench': {
-    // Delegate to benchmark/auto-resolve/scripts/run-suite.sh with all remaining args.
-    const runSuite = path.join(__dirname, '..', 'benchmark', 'auto-resolve', 'scripts', 'run-suite.sh');
-    if (!fs.existsSync(runSuite)) {
+    const benchmarkScripts = {
+      suite: 'run-suite.sh',
+      recent: 'recent-benchmark-summary.py',
+      frontier: 'pair-candidate-frontier.py',
+      audit: 'audit-pair-evidence.py',
+      'audit-headroom': 'audit-headroom-rejections.py',
+      headroom: 'run-headroom-candidate.sh',
+      pair: 'run-full-pipeline-pair-candidate.sh',
+    };
+    let forwardedArgs = args.slice(1);
+    if (forwardedArgs[0] === '--help' || forwardedArgs[0] === '-h') {
+      showBenchmarkHelp();
+      break;
+    }
+    let benchmarkMode = 'suite';
+    if (forwardedArgs[0] === 'suite' || forwardedArgs[0] === 'recent' || forwardedArgs[0] === 'frontier' || forwardedArgs[0] === 'audit' || forwardedArgs[0] === 'audit-headroom' || forwardedArgs[0] === 'headroom' || forwardedArgs[0] === 'pair') {
+      benchmarkMode = forwardedArgs[0];
+      forwardedArgs = forwardedArgs.slice(1);
+    }
+    if (forwardedArgs[0] === '--help' || forwardedArgs[0] === '-h') {
+      showBenchmarkModeHelp(benchmarkMode);
+      break;
+    }
+    const runnerName = benchmarkScripts[benchmarkMode];
+    const runner = path.join(__dirname, '..', 'benchmark', 'auto-resolve', 'scripts', runnerName);
+    if (!fs.existsSync(runner)) {
       log('❌ Benchmark suite runner missing — is this a clean devlyn-cli checkout?', 'yellow');
-      log(`   Expected: ${runSuite}`, 'dim');
+      log(`   Expected: ${runner}`, 'dim');
       process.exit(1);
     }
     const { spawnSync } = require('child_process');
-    const forwardedArgs = args.slice(1);
-    const res = spawnSync('bash', [runSuite, ...forwardedArgs], { stdio: 'inherit' });
+    const env = { ...process.env, DEVLYN_BENCHMARK_CLI_SUBCOMMAND: benchmarkMode };
+    const executable = (benchmarkMode === 'recent' || benchmarkMode === 'frontier' || benchmarkMode === 'audit' || benchmarkMode === 'audit-headroom') ? 'python3' : 'bash';
+    const res = spawnSync(executable, [runner, ...forwardedArgs], { stdio: 'inherit', env });
     process.exit(res.status ?? 1);
     break;
   }

package/config/skills/_shared/adapters/README.md CHANGED Viewed

@@ -30,6 +30,9 @@ Verbosity, formatting, length conventions specific to this model.
 ## Tool-use posture
 When to use tools, when to reason, parallel/sequential preferences.
+## Effort and autonomy
+Optional. Model-specific guidance for effort levels or autonomous-vs-interactive runs when the vendor guide calls this out.
 ## Validation pattern
 How this model verifies its work — mechanical-first vs self-check, etc.

package/config/skills/_shared/adapters/gpt-5-5.md CHANGED Viewed

@@ -8,7 +8,7 @@ You are GPT-5.5 by OpenAI. OpenAI's prompt-guidance for this model governs your
 ## Output discipline
-Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use headers, bullets, and bold sparingly — favor short paragraphs and natural transitions unless the canonical body or user requests structure. When `text.verbosity` is `low`, prefer even shorter responses.
+Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use Markdown only where it carries structure (`inline code`, code fences, short lists/tables); otherwise favor short paragraphs and natural transitions. When `text.verbosity` is `low`, prefer even shorter responses.
 ## Tool-use posture
@@ -26,4 +26,8 @@ The official guide warns explicitly about carrying over instructions from older
 2. **Don't over-specify process when the destination is clear.** If the canonical body names the outcome, choose the path; do not narrate every step.
 3. **Stop rules are explicit.** When the canonical body or the harness asks you to stop / abstain / ask, follow the stop rule rather than retrying loops indefinitely. Loop-minimization does not outrank correctness or required citation.
+## Prompt-maintenance cue
+When asked to improve a failed prompt, act as GPT-5.5 metaprompter for itself: name the observed failure, then propose the smallest instruction to add, remove, or relocate. Prefer subtractive changes before adding new rules; keep the canonical body model-neutral and put only GPT-specific tactics in this adapter.
 Do not narrate internal deliberation. State results and decisions directly.

package/config/skills/_shared/adapters/opus-4-7.md CHANGED Viewed

@@ -10,10 +10,18 @@ You are Claude Opus 4.7 by Anthropic. Anthropic's prompt-engineering guide for t
 You calibrate response length to task complexity automatically — keep simple lookups short, scale up only when the task warrants it. Do NOT pad with context the user didn't ask for. When the canonical body sets a structural format (XML, JSON, sections), follow it literally; do not silently restructure.
+## Examples and structure
+When prompt maintenance adds examples for Claude, prefer concise positive examples over lists of negative prohibitions. Wrap examples in `<example>` tags (or `<examples>` for several) so examples stay distinct from instructions and variable inputs.
 ## Tool-use posture
 You default to fewer tool calls than prior Claude generations. When the canonical body lists tools, use them when their result would change your answer. Make independent tool calls in parallel; chain only when one depends on another's output. Do not narrate "I'll now call X" preambles unless the canonical body requests progress updates.
+## Effort and autonomy
+For long-horizon coding, review, and agentic runs, assume the harness selected `high` or `xhigh` effort unless told otherwise. Spend that depth on upfront task/constraint understanding and end-state verification, not on verbose narration. If the user or orchestrator gives a complete task in one turn, proceed autonomously instead of requiring progressive clarification.
 ## Validation pattern
 When the canonical body asks you to verify your output before declaring done ("self-check" instructions), execute that step literally — re-read the spec's acceptance criteria, run the listed verification commands if available, list any gap. This is not optional. Mechanical gates owned by the harness (spec-verify-check.py, build-gate.py) are the primary correctness guard; your self-check is the secondary layer that catches what regex cannot.
@@ -22,7 +30,7 @@ When the canonical body asks you to verify your output before declaring done ("s
 You interpret instructions more literally than prior Claude versions. The official guide is explicit about three failure modes:
-1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones. Do NOT pre-filter for importance; the harness has a separate filter step.
+1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones; do not filter for importance or confidence. The harness has a separate filter step.
 2. **Subagent over-spawning**: do NOT spawn a subagent for work you can complete in a single response. Spawn only when the canonical body explicitly requests it OR when fanning out across independent items.
 3. **Overengineering**: do NOT add files, abstractions, error handling, validation, or "future flexibility" beyond what the spec asks. A bug fix doesn't need surrounding cleanup. The right complexity is the minimum needed for the current task.