npm - devlyn-cli - Versions diffs - 1.14.0 → 2.0.0 - Mend

devlyn-cli 1.14.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (148) hide show

package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh ADDED Viewed

@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+# F8 setup — no changes. The ambiguity IS the test.
+set -e
+exit 0

package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md ADDED Viewed

@@ -0,0 +1,48 @@
+---
+id: "F8-known-limit-ambiguous"
+title: "Improve the CLI"
+status: planned
+complexity: ambiguous
+depends-on: []
+---
+# F8 Improve the CLI
+## Context
+> "Users have been saying the CLI could be better. Make it better."
+That is the entire ask. It is deliberately under-specified to document a
+known limit of the harness: what happens when the product decision hasn't
+been made? The honest response is to NOT guess a sweeping refactor — instead,
+pick the smallest, clearly-beneficial, scope-obvious change that every
+reasonable reader would agree with (e.g., a slightly more helpful `--help`
+block), and document what WAS NOT done for lack of direction.
+## Requirements
+- [ ] Make a small, defensible improvement to the CLI. Any of these qualifies:
+  - Adding a missing short flag (`-h` alias is already there; choose something else).
+  - Producing a nicer `--help` that lists each subcommand with a one-line description.
+  - Distinguishing unknown-command and no-command cases in error output.
+- [ ] Existing behavior is not regressed. `node bin/cli.js hello` / `version` / `--help` still exit 0 with their current semantics.
+- [ ] Add at least one test that asserts the new behavior.
+- [ ] Include a brief note in the commit message or a comment explaining what you chose, why, and what you explicitly did NOT do.
+## Constraints
+- **No new npm dependencies.**
+- **No sweeping refactors.** Do not rewrite the argument parser or invert the file's control flow.
+- **No breaking changes** to current subcommands.
+## Out of Scope
+- Adding new subcommands the user didn't ask for.
+- Restyling, renaming, or deleting existing subcommands.
+- Touching `server/` or `web/`.
+## Verification
+- Existing baseline commands behave identically.
+- At least one new assertion in `tests/` exercises the change.
+- `node bin/cli.js --help` (if changed) is valid UTF-8 and lists every real subcommand once.

package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ Users have been saying the CLI could be better. Make it better.

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md ADDED Viewed

@@ -0,0 +1,93 @@
+# F9 — Notes (2-skill contract, post iter-0033a)
+## Purpose
+**Load-bearing for the novice-user contract.** The suite ship-gate requires
+F9 to pass (variant arm margin ≥ +5) on every shipped version. If F9 fails,
+the "type `/devlyn:ideate` and ship worldclass software" promise is not being
+met.
+Renamed 2026-04-30 (iter-0033a) from the `-to-preflight` legacy id to match
+the shipped 2-skill product surface: `/devlyn:ideate` → `/devlyn:resolve --spec`.
+The pre-rename copy is preserved at `fixtures/retired/F9-e2e-ideate-to-preflight/`
+for recovery if the OLD 3-skill chain ever needs replay.
+## What the variant arm does (2-skill chain)
+A novice-simulating prompt (`task.txt` is identical to what the user typed)
+is delivered to a fresh Claude session. The session has the new 2-skill kit
+installed. The pipeline arm is expected to:
+1. Recognize this is a vague idea, not a spec → invoke `/devlyn:ideate`.
+2. Ideate produces `docs/specs/<id>-<slug>/spec.md` + `spec.expected.json`
+   and announces `spec ready — /devlyn:resolve --spec <emitted-path>`.
+3. Run `/devlyn:resolve --spec <emitted-path>` (PLAN → IMPLEMENT → BUILD_GATE
+   → CLEANUP → VERIFY in one skill). VERIFY is the fresh-subagent final
+   phase, replacing the standalone `/devlyn:preflight` skill from the
+   3-skill era.
+The variant prompt explicitly instructs this chain so the test isn't about
+Claude inventing the chain — it's about the new tools being usable end-to-end
+when invoked.
+## What the bare arm does
+Same raw task delivered as a direct prompt with anti-skill rules. Bare
+implements `gitstats` using its own judgment. Bare does NOT produce any
+`docs/specs/**` artifacts (and isn't expected to).
+## Why margin ≥ +5 is required (vs L0 / bare)
+The pipeline's whole value prop is that it trades some bare-case tokens for
+quality uplift on novice flows. If this fixture can't show ≥ +5 margin
+vs L0, we're paying pipeline cost without delivering on the novice promise.
+**OLD-vs-NEW comparison is NOT measured here.** OLD `/devlyn:ideate` was
+replaced in iter-0032 (the new ideate is the only ideate at HEAD). Calling
+the OLD F9 chain (`/devlyn:ideate` → `/devlyn:auto-resolve` → `/devlyn:preflight`)
+at HEAD would invoke NEW ideate against OLD auto-resolve — a broken hybrid.
+The harness refuses `--resolve-skill old` on F9 with a hard error.
+## Scoring notes
+- The variant's `docs/specs/<id>-<slug>/spec.md` + `spec.expected.json` ARE
+  part of the judge's evaluation. The judge sees the full product (code +
+  spec + tests), not just the diff to `bin/cli.js`.
+- Bare doesn't produce spec files, so bare's judge payload is code+test only.
+  This asymmetry is INTENTIONAL — the fixture tests total-output quality,
+  not per-file quality.
+## Variant artifact check (out-of-band, NOT in expected.json)
+Per Codex R0.5 §B: `expected.json.verification_commands` apply to ALL arms
+(see `run-fixture.sh:472`). A `docs/specs/**` check in expected.json would
+punish the bare arm (which doesn't run ideate). Variant-only artifact
+verification lives in `scripts/check-f9-artifacts.py`, which runs AFTER
+the per-fixture verification block and asserts variant/solo arms produced:
+- `docs/specs/<id>-<slug>/spec.md` exists.
+- `docs/specs/<id>-<slug>/spec.expected.json` exists.
+- transcript contains `/devlyn:resolve --spec` exactly once.
+- transcript does NOT contain `/devlyn:auto-resolve` or `/devlyn:preflight`.
+## Failure modes detected
+- **Pipeline skips ideate.** Variant invokes `/devlyn:resolve` directly on
+  the raw idea → free-form classifier kicks in → spec quality is shallow.
+  Caught by `scripts/check-f9-artifacts.py`: `docs/specs/**` files missing.
+- **Bare over-engineers.** Without a skeleton, bare builds too much,
+  touches wrong files, adds deps. Caught by spec constraints (no new deps,
+  forbidden empty catch).
+- **Variant chains the OLD names.** If the variant transcript contains
+  `/devlyn:auto-resolve` or `/devlyn:preflight`, the prompt-following gate
+  fails. iter-0033a's harness change ensures the variant prompt names only
+  the 2 surviving skills.
+- **Spec emit path divergence.** If the new ideate refactors away from
+  `<spec-dir>/<id>-<slug>/spec.md`, the harness check fails (path-shape
+  regression smoke #4 of iter-0033a catches it before benchmark runs).
+## Rotation trigger
+F9 is the last fixture we rotate — it's the anchor. If it saturates
+(variant consistently > 95), the whole suite needs a harder novice-flow
+anchor before we retire this one.

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json ADDED Viewed

@@ -0,0 +1,74 @@
+{
+  "verification_commands": [
+    {
+      "cmd": "node bin/cli.js gitstats",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Commits:",
+        "Last commit:"
+      ],
+      "stdout_not_contains": [
+        "Error:"
+      ]
+    },
+    {
+      "cmd": "node bin/cli.js gitstats --json",
+      "exit_code": 0,
+      "stdout_contains": [
+        "{",
+        "commits",
+        "authors"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "cd /tmp && node -e 'const { spawnSync } = require(\"child_process\"); const p = process.env.BENCH_WORKDIR || process.cwd(); console.log(spawnSync(\"node\", [p + \"/bin/cli.js\", \"gitstats\"], { encoding: \"utf8\", cwd: \"/tmp\" }).status)'",
+      "exit_code": 0,
+      "stdout_contains": [
+        "2"
+      ],
+      "stdout_not_contains": [
+        "0"
+      ]
+    },
+    {
+      "cmd": "node bin/cli.js hello",
+      "exit_code": 0,
+      "stdout_contains": [
+        "Hello, world!"
+      ],
+      "stdout_not_contains": []
+    },
+    {
+      "cmd": "node --test tests/",
+      "exit_code": 0,
+      "stdout_contains": [],
+      "stdout_not_contains": []
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
+      "description": "empty catch",
+      "files": [
+        "bin/cli.js"
+      ],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": [
+    "bin/cli.js"
+  ],
+  "forbidden_files": [],
+  "max_deps_added": 0,
+  "tier_a_waivers": [
+    "docs/specs/**",
+    "docs/VISION.md",
+    "docs/ROADMAP.md",
+    "docs/roadmap/**"
+  ],
+  "spec_output_files": [
+    "bin/**",
+    "tests/**"
+  ]
+}

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json ADDED Viewed

@@ -0,0 +1,10 @@
+{
+  "id": "F9-e2e-ideate-to-resolve",
+  "category": "e2e",
+  "difficulty": "high",
+  "timeout_seconds": 3600,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "End-to-end novice flow (2-skill contract): from a vague idea ('git stats CLI for the current repo') the variant must run /devlyn:ideate → /devlyn:resolve --spec <emitted-path> to produce spec + implemented code + verified output. VERIFY is the fresh-subagent final phase of resolve (no separate preflight skill). The bare arm receives the same vague idea as a direct prompt. This fixture gates the novice-user contract."
+}

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh ADDED Viewed

@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+# F9 setup — seed a few synthetic commits with different authors so the
+# `gitstats` subcommand's "top 3 authors by commit count" requirement is
+# meaningfully exercised. Without this, every commit author is the runner's
+# default and the ranking test is a no-op.
+set -e
+commit_as() {
+  local name="$1" email="$2" file="$3" message="$4"
+  echo "$(date +%s%N) $name" >> "$file"
+  git add "$file"
+  git -c user.name="$name" -c user.email="$email" commit -q -m "$message"
+}
+mkdir -p .bench-seed
+commit_as "Alpha Author"   "alpha@bench.test"   .bench-seed/log "seed: alpha 1"
+commit_as "Alpha Author"   "alpha@bench.test"   .bench-seed/log "seed: alpha 2"
+commit_as "Alpha Author"   "alpha@bench.test"   .bench-seed/log "seed: alpha 3"
+commit_as "Alpha Author"   "alpha@bench.test"   .bench-seed/log "seed: alpha 4"
+commit_as "Beta Author"    "beta@bench.test"    .bench-seed/log "seed: beta 1"
+commit_as "Beta Author"    "beta@bench.test"    .bench-seed/log "seed: beta 2"
+commit_as "Beta Author"    "beta@bench.test"    .bench-seed/log "seed: beta 3"
+commit_as "Gamma Author"   "gamma@bench.test"   .bench-seed/log "seed: gamma 1"
+commit_as "Gamma Author"   "gamma@bench.test"   .bench-seed/log "seed: gamma 2"
+commit_as "Delta Author"   "delta@bench.test"   .bench-seed/log "seed: delta 1"
+echo "F9 setup: seeded 10 commits across 4 authors (Alpha 4 / Beta 3 / Gamma 2 / Delta 1)"

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md ADDED Viewed

@@ -0,0 +1,62 @@
+---
+id: "F9-e2e-ideate-to-resolve"
+title: "End-to-end: idea → shipped CLI feature (2-skill contract)"
+status: planned
+complexity: high
+depends-on: []
+---
+# F9 End-to-End Novice Flow (2-skill chain)
+## Context
+A first-time user has a vague idea:
+> "I want a CLI subcommand that shows basic stats about the current git repo — commit count, last commit date, top 3 authors. Call it `gitstats`."
+The variant arm is expected to use the 2-skill chain:
+`/devlyn:ideate` → `/devlyn:resolve --spec <emitted-path>`. The bare arm
+receives the same idea as a direct prompt and implements it without the
+pipeline.
+This fixture is the suite's most important gate for the "novice user contract":
+a first-time user typing `/devlyn:ideate` should land at working,
+well-structured software. VERIFY runs as the fresh-subagent final phase
+inside `/devlyn:resolve` (no separate preflight skill in the 2-skill design).
+## Requirements
+- [ ] A new `gitstats` subcommand exists in `bin/cli.js`.
+- [ ] `node bin/cli.js gitstats` (run inside a git repo) prints:
+  - Line 1: commit count (e.g., `Commits: 42`).
+  - Line 2: last commit ISO date (e.g., `Last commit: 2026-04-23T12:00:00Z`).
+  - Lines 3-5: top 3 authors by commit count, format `<rank>. <name> <count>`.
+- [ ] Run outside a git repo → stderr message `Error: not a git repository` and exit 2.
+- [ ] `node bin/cli.js gitstats --json` emits valid JSON with the same data.
+- [ ] Existing subcommands (`hello`, `version`) unchanged.
+- [ ] Add at least one test.
+## Constraints
+- **No new npm dependencies.** Use `child_process` to shell out to `git`.
+- **No silent catches.**
+- **Non-git-repo handling.** Do not assume the user is always in a repo.
+- **Lifecycle note.** The harness's CLEANUP/VERIFY phases may flip this
+  spec's frontmatter `status` after implementation completes — that is
+  benchmark lifecycle bookkeeping, not a scope violation.
+## Out of Scope
+- Parsing commit messages, tags, branches.
+- Remote API calls.
+- Touching `server/` or `web/`.
+## Verification
+- Inside this worktree (which IS a git repo): `node bin/cli.js gitstats` exits 0 and prints at least 5 lines of summary.
+- `node bin/cli.js gitstats --json | node -e 'const d=JSON.parse(require("fs").readFileSync(0,"utf8")); console.log(typeof d.commits)'` prints `number`.
+- `cd /tmp && node <worktree>/bin/cli.js gitstats` (from outside a repo — use the worktree's absolute path) exits 2.
+- `node --test tests/` passes.
+(Variant-only artifact checks — `docs/specs/<id>-<slug>/spec.md` + `spec.expected.json` existence, transcript fingerprint — live in `scripts/check-f9-artifacts.py`, NOT in the shared verification block above. See NOTES.md.)

package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt ADDED Viewed

@@ -0,0 +1,5 @@
+I want a CLI subcommand that shows basic stats about the current git repo — commit count, last commit date, top 3 authors. Call it `gitstats`.
+Should work inside this repo when I run `node bin/cli.js gitstats`, and fail cleanly if I'm not in a git repo. A `--json` flag for machine-readable output would be useful too.
+Keep the existing `hello` and `version` subcommands working. Add a test. No new npm dependencies.

package/benchmark/auto-resolve/fixtures/SCHEMA.md ADDED Viewed

@@ -0,0 +1,130 @@
+# Fixture Schema
+Every fixture is a directory under `benchmark/auto-resolve/fixtures/F<N>-<slug>/` with these files. **All six files are required** (setup.sh may be empty when the starting `test-repo` copy needs no modification).
+## metadata.json
+```json
+{
+  "id": "F2-cli-medium-subcommand",
+  "category": "medium",
+  "difficulty": "medium",
+  "timeout_seconds": 1200,
+  "required_tools": ["node"],
+  "browser": false,
+  "deps_change_expected": false,
+  "intent": "One-sentence plain-language statement of the work, the SINGLE source of truth for spec.md and task.txt."
+}
+```
+- **id** — matches directory name, used across artifacts.
+- **category** — one of `trivial | medium | high-risk | stress | edge | e2e`. Drives which ship-gate rule applies.
+- **difficulty** — expected difficulty independent of category. Rubric uses this only for saturation detection (when both arms > 95 for two versions, flag fixture for rotation).
+- **timeout_seconds** — per-arm hard timeout. Runner kills the arm at this limit and marks result `TIMEOUT`.
+- **required_tools** — binaries the arm's environment must provide. Runner checks before invocation.
+- **browser** — true if arm must be able to run Playwright. Runner uses this to decide whether `test-repo`'s Playwright deps get installed before the arm starts.
+- **deps_change_expected** — true if the task involves modifying `package.json` / lockfiles. Variant's CRITIC security sub-pass is expected to trigger native `security-review` dep audit when true.
+- **intent** — **load-bearing**. A short plain-language statement shared by both arms. `spec.md` formalizes it into auto-resolve-ready form; `task.txt` renders it as a direct prompt. A CI lint ensures both derive from this field and stay in sync.
+## spec.md
+Auto-resolve-ready spec for the pipeline arm. Same format `/devlyn:ideate` produces:
+```markdown
+---
+id: "<fixture-id>"
+title: "<short title>"
+status: planned
+complexity: medium
+depends-on: []
+---
+# <fixture-id> <Title>
+## Context
+2-3 sentences describing WHY (not HOW). Must be traceable back to `metadata.intent`.
+## Requirements
+- [ ] Specific, testable, scoped.
+- [ ] ...
+## Constraints
+- Concrete, with reasoning for each (not bare).
+## Out of Scope
+- Explicit "must NOT build" list. Audited by preflight as anti-commitments.
+## Verification
+- Concrete commands whose expected behavior is named.
+```
+## task.txt
+Bare-arm input. Plain English, same intent, but framed as a user request rather than a formal spec. Intentionally lacks the structured Requirements/Constraints/Out-of-Scope sections — bare must make those calls itself. Must not leak "use the devlyn skill" hints.
+## expected.json
+Machine-readable acceptance criteria used by both `run-fixture.sh` verification and the judge's rubric anchor.
+```json
+{
+  "verification_commands": [
+    {
+      "cmd": "node bin/cli.js doctor",
+      "exit_code": 0,
+      "stdout_contains": ["doctor: "],
+      "stdout_not_contains": ["undefined"]
+    }
+  ],
+  "forbidden_patterns": [
+    {
+      "pattern": "catch\\s*\\(\\s*[a-zA-Z_]*\\s*\\)\\s*\\{\\s*return",
+      "description": "silent catch returning a fallback value — violates no-silent-catches policy",
+      "files": ["bin/cli.js"],
+      "severity": "disqualifier"
+    }
+  ],
+  "required_files": ["bin/cli.js"],
+  "forbidden_files": [],
+  "max_deps_added": 0
+}
+```
+- **verification_commands** — runner executes each. Each command's pass/fail contributes to the arm's `verify_score`.
+- **forbidden_patterns** — regexes scanned across `diff.patch`. Match at `severity: "disqualifier"` is a hard-floor fail. Match at `severity: "warning"` goes into the judge's critical-findings report.
+- **required_files** — must exist after the arm runs.
+- **forbidden_files** — must NOT appear in the arm's diff.
+- **max_deps_added** — count of new entries under `dependencies`/`devDependencies` in `package.json`. Exceeds → hard-floor fail.
+## NOTES.md
+Human-readable explanation of why this fixture exists. Must answer:
+1. What specific failure mode does this fixture detect?
+2. What pipeline phase(s) is this testing?
+3. Why can't another fixture cover this?
+4. When should this fixture be retired or replaced?
+Notes are read during suite design review, not during runs.
+## setup.sh
+Deterministic starting state. Runs against a fresh copy of `benchmark/auto-resolve/fixtures/test-repo/` before either arm starts. Common uses:
+- Install extra deps (`npm install --prefix . something`).
+- Apply a `.patch` that introduces a bug to fix.
+- Create pre-existing files referenced by the spec.
+Script must be idempotent when re-applied. Empty file (just `#!/usr/bin/env bash\nset -e\n`) is valid when no setup needed.
+---
+## Drift Prevention
+A CI lint step (`scripts/lint-fixtures.sh`) verifies:
+- All six files present per fixture.
+- `metadata.intent` substring appears in both `spec.md::Context` and `task.txt` (≥ 60% token overlap using simple tokenization).
+- `spec.md` frontmatter `id` matches directory name.
+- `expected.json` is valid JSON.
+- `setup.sh` is executable.

package/benchmark/auto-resolve/fixtures/test-repo/README.md ADDED Viewed

@@ -0,0 +1,27 @@
+# bench-test-repo
+Deterministic base Node project used by every devlyn-cli auto-resolve
+benchmark fixture. Fixtures extend this skeleton via `setup.sh` patches.
+## What's in it
+- `bin/cli.js` — tiny CLI (`hello`, `version`)
+- `server/index.js` — tiny Express app (`/health`, `/items`, `/items/:id`)
+- `web/index.html` — minimal static page with a click interaction
+- `tests/cli.test.js`, `tests/server.test.js` — node:test fixtures
+- `playwright.config.js` — used by web/browser fixtures only
+- `package.json` — `express` dep, `engines: node >= 18`
+## How it's used
+`run-fixture.sh` copies this directory to a temp path per run, applies the
+fixture's `setup.sh`, then invokes the arm (variant or bare) against that
+copy. No fixture modifies this source tree — modifications happen only in
+the per-run temp copies.
+## Keep it minimal
+Adding features to `test-repo` enlarges the surface every fixture works
+against. Add only when an existing fixture can't express itself against the
+current baseline. Preferred path: push complexity into the fixture's
+`setup.sh`, not into this base.

package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js ADDED Viewed

@@ -0,0 +1,63 @@
+#!/usr/bin/env node
+// bench-test-repo — tiny CLI used as the deterministic base for benchmark fixtures.
+// Fixtures extend or modify this file; keep the baseline minimal and obvious.
+const fs = require('fs');
+const path = require('path');
+const USAGE = `Usage: bench-cli <command> [options]
+Commands:
+  hello [--name NAME]        Print a greeting (default name: "world")
+  version                    Print the CLI version from package.json
+  --help, -h                 Show this help
+Examples:
+  bench-cli hello
+  bench-cli hello --name alice
+  bench-cli version
+`;
+function readPackageVersion() {
+  const pkgPath = path.join(__dirname, '..', 'package.json');
+  const raw = fs.readFileSync(pkgPath, 'utf8');
+  return JSON.parse(raw).version;
+}
+function parseNameFlag(argv) {
+  const idx = argv.indexOf('--name');
+  if (idx === -1) return 'world';
+  const value = argv[idx + 1];
+  if (!value || value.startsWith('-')) {
+    console.error('--name requires a value');
+    process.exit(1);
+  }
+  return value;
+}
+function main(argv) {
+  const [command, ...rest] = argv;
+  if (!command || command === '--help' || command === '-h') {
+    process.stdout.write(USAGE);
+    return;
+  }
+  switch (command) {
+    case 'hello': {
+      const name = parseNameFlag(rest);
+      console.log(`Hello, ${name}!`);
+      return;
+    }
+    case 'version': {
+      console.log(readPackageVersion());
+      return;
+    }
+    default:
+      console.error(`Unknown command: ${command}`);
+      process.stderr.write(USAGE);
+      process.exit(1);
+  }
+}
+main(process.argv.slice(2));