npm - devlyn-cli - Versions diffs - 1.15.0 → 2.1.0 - Mend

devlyn-cli 1.15.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

package/AGENTS.md ADDED Viewed

@@ -0,0 +1,104 @@
+# Codex Project Instructions
+devlyn-cli installs `/devlyn:ideate` (optional) and `/devlyn:resolve` (required) into Claude Code, plus the contract below. Codex CLI reads this file when invoked inside a project that has it. The principles are non-negotiable on every change.
+## Core principles
+Seven rules govern every change. Cite them by name when a decision touches one.
+1. **No workaround** — fix the root cause, never the symptom. No `any`, no `@ts-ignore`, no silent `catch`, no hardcoded fallback that hides a broken contract.
+2. **No overengineering** — smallest change that closes the goal. New abstractions require an observed failure mode they prevent. Subtractive-first: ask "what can I delete instead?" before writing anything new.
+3. **No guesswork** — verify with the actual files, logs, diffs, and run output before forming conclusions. State the falsifiable prediction BEFORE the experiment; record raw results AFTER.
+4. **Worldclass** — code that survives review at a non-trivial codebase. Zero CRITICAL, zero HIGH security/design findings on the shippable path.
+5. **Best practice** — idiomatic for the language and framework. Use standard primitives; do not hand-roll what the library already provides.
+6. **Optimized** — efficient on the resource that matters (wall-time, tokens, attention, cognitive load on the next reader). "Slower but more thoughtful" is not free.
+7. **Production ready** — error states are explicit and visible; behavior under failure is what the user expects, not silent corruption.
+Three discipline rules govern HOW the principles are applied:
+- **Root cause via flexible why-chain.** Keep asking "why?" until you find the violated invariant. **If the answer surfaces in 2 questions, stop.** If it takes 5 or 7, keep going. Strict counts are wrong; until-found is right.
+- **First-principles thinking.** Challenge the requirement before optimizing the answer. Most "we have to do X" assumptions are habit, not necessity. Reduce to irreducible truths and rebuild from there.
+- **Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.** — Saint-Exupéry. The operating definition of "done." A change is finished when no further line, branch, flag, or doc paragraph can be removed without breaking a learned failure mode.
+## Quick Start
+```text
+ideate (optional)  ->  resolve  ->  ship
+```
+- `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project` (multi-feature).
+- `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <ref> --spec <path>`. Phases run inline: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh-subagent, findings-only).
+- Three creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and install only when the user opts in.
+Each skill's `SKILL.md` is the source of truth for flags and workflow. Do not duplicate.
+## Subtractive-First Editing
+Before writing any change, answer in this order:
+1. What can I delete that makes the addition unnecessary?
+2. What can I delete that makes the addition smaller?
+3. What is the minimum addition still required?
+Hard rules:
+- A pure-addition diff needs a citation: an explicit user/spec requirement OR an observed failure mode.
+- Refactor-only changes should reduce line count unless a cited failure requires the new shape.
+- Do not add flags, branches, or options for hypothetical users.
+- Do not add defensive wrappers when an upstream contract can be corrected instead.
+- Doc growth has the same cost as code growth. Delete the now-stale sentence before adding new prose.
+- A change is not done until you have attempted one more deletion and confirmed it would break something.
+## Goal-Locked Execution
+Default mode is execution toward the user's stated goal. Do not drift.
+Refuse these patterns:
+- Unrequested work ("while here, also fix...").
+- Tangential cleanup in files the task does not require.
+- Speculative robustness for cases not observed in production, tests, findings, or the spec.
+- Mid-flight re-scoping without user approval.
+- Curiosity exploration that is not on the critical path.
+Drift test: **did the user ask for this, OR does the stated goal strictly require it?** If both no, surface it as a follow-up note and continue on the original path.
+In interactive sessions, ask a concise clarification when scope expansion is real. In hands-free pipelines, stay on scope and log the assumption in the final artifact.
+## Error Handling
+No silent fallbacks.
+- Show clear errors, retry paths, or actionable guidance.
+- Fallbacks allowed only when widely accepted and harmless (CSS fallback fonts, CDN failover, image placeholders).
+- Silent `catch` blocks are bugs.
+- Logging is not user-visible error handling.
+- The Codex CLI availability downgrade is the one documented exception: emit `engine downgraded: codex-unavailable` and behave exactly like explicit Claude routing.
+## Evidence Over Claim
+Every finding cites concrete evidence:
+- Code: `file:line` you opened.
+- Missing implementation: state exactly what you searched and found absent.
+- Doc: cite the stale text + section/line.
+- Browser: route/URL + screenshot or observed evidence.
+- Benchmark: run id, fixture id, metric, raw result path.
+Exclude vague claims. They produce vague fixes.
+## Working In a devlyn-cli Project
+- Check `git status --short` before editing.
+- Never revert user changes unless explicitly asked.
+- Use `rg` / `rg --files` for search.
+- Keep changes scoped to the task; stop when the core request is answered or the change is verified.
+- For installer or skill edits inside the devlyn-cli source repo, run `bash scripts/lint-skills.sh`.
+- Treat the project's own `docs/VISION.md`, `docs/ROADMAP.md`, `docs/roadmap/**`, and any local `AGENTS.md` / `CLAUDE.md` overrides as authoritative when present.
+## Communication
+- Lead with objective evidence before opinion.
+- Be concise and specific.
+- State blockers plainly.
+- Separate completed work, verification, and remaining risks.

package/CLAUDE.md CHANGED Viewed

@@ -1,15 +1,113 @@
 # Project Instructions
+devlyn-cli installs `/devlyn:ideate` (optional) and `/devlyn:resolve` (required) into Claude Code, plus the contract below. These principles are non-negotiable on every change — yours and any sub-agent's.
+## Core principles
+Seven rules govern every change. Cite them by name when a decision touches one.
+1. **No workaround** — fix the root cause, never the symptom. No `any`, no `@ts-ignore`, no silent `catch`, no hardcoded fallback that hides a broken contract. Configuration-level skips that bypass the real issue are also rejects.
+2. **No overengineering** — smallest change that closes the goal. New abstractions require an observed failure mode they prevent. Subtractive-first: ask "what can I delete instead?" before writing anything new.
+3. **No guesswork** — verify with the actual files, logs, diffs, and run output before forming conclusions. State the falsifiable prediction BEFORE the experiment; record raw results AFTER. Retroactive prediction edits are dishonest.
+4. **Worldclass** — code that survives review at a non-trivial codebase. Zero CRITICAL, zero HIGH security/design findings on the shippable path.
+5. **Best practice** — idiomatic for the language and framework. Use standard primitives; do not hand-roll what the library already provides.
+6. **Optimized** — efficient on the resource that matters (wall-time, tokens, attention, cognitive load on the next reader). "Slower but more thoughtful" is not free. Each layer of composition or process must beat the simpler baseline.
+7. **Production ready** — error states are explicit and visible; behavior under failure is what the user expects, not silent corruption.
+Three discipline rules govern HOW the principles are applied:
+- **Root cause via flexible why-chain.** Keep asking "why?" until you find the violated invariant. **If the answer surfaces in 2 questions, stop.** If it takes 5 or 7, keep going. Strict counts are wrong; until-found is right.
+- **First-principles thinking.** Challenge the requirement before optimizing the answer. Most "we have to do X" assumptions are habit, not necessity. Reduce the problem to its irreducible truths and rebuild from there.
+- **Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.** — Saint-Exupéry. This is the operating definition of "done." A change is finished when no further line, branch, flag, or doc paragraph can be removed without breaking a learned failure mode. Not before.
+The runtime sub-agent contract below (Subtractive-first / Goal-locked / No-workaround discipline / Evidence over claim) expands these principles into concrete operational tests. Sub-agents in `/devlyn:resolve` and `/devlyn:ideate` enforce them at every phase.
 ## Quick Start
-Three commands cover most work. All default to `--engine auto` — Codex GPT-5.4 builds, Claude Opus 4.7 critiques (cross-model GAN dynamic).
+Two skills cover the full cycle post iter-0034 Phase 4 cutover (2026-05-04). `/devlyn:ideate` is OPTIONAL; `/devlyn:resolve` is REQUIRED. **Both default to `--engine claude`** — pair / multi-engine routing is research-only at HEAD per the iter-0020 + iter-0033g + iter-0034 close-outs (see [`autoresearch/iterations/0020-pair-policy-narrow.md`](autoresearch/iterations/0020-pair-policy-narrow.md) + [`autoresearch/iterations/0034-phase-4-cutover.md`](autoresearch/iterations/0034-phase-4-cutover.md)). Pass `--engine auto` or `--engine codex` explicitly to opt into the research path; the harness silently downgrades to `claude` and emits a banner if the Codex CLI is missing.
-1. `/devlyn:ideate` — unstructured idea → VISION/ROADMAP/item specs
-2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → ship
-3. `/devlyn:preflight` — verify the implementation matches the roadmap
+1. `/devlyn:ideate` (optional) — unstructured idea → `docs/specs/<id>/spec.md` + `spec.expected.json`. Modes: default Q&A, `--quick` (autonomous-pipeline-safe), `--from-spec <path>`, `--project`.
+2. `/devlyn:resolve` — hands-free pipeline for any coding task. Free-form goal, `--spec <path>`, or `--verify-only <diff> --spec <path>`. Phases: PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY (fresh subagent, findings-only).
 Each skill's `SKILL.md` is the source of truth for its flags and workflow — don't duplicate them here.
+### When to use which
+| Situation | Command |
+|-----------|---------|
+| New project / multi-feature plan / external spec to normalize | `/devlyn:ideate` (`--project`, default, or `--from-spec`) |
+| One-line goal you want turned into a spec autonomously | `/devlyn:ideate --quick` |
+| Spec already in hand | `/devlyn:resolve --spec <path>` |
+| Free-form fix / feature / refactor / debug / PR review | `/devlyn:resolve "<describe the work>"` |
+| Verify a diff or PR against a spec | `/devlyn:resolve --verify-only <ref> --spec <path>` |
+### Subtractive-first editing — perfection = nothing left to remove
+<!-- runtime-principles:section=subtractive-first:begin -->
+> "Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away." — Saint-Exupéry. **This is the operating definition of "done" in this repo.** A change is finished when no further line, branch, flag, or doc paragraph can be removed without breaking a learned failure mode. Not before.
+This rule overrides instinct. LLMs (including you) are trained on corpora that reward elaborate, defensive, "thorough" code — so the default impulse is to add. That impulse is wrong here. Read the rules below as hard tests, not aesthetic preferences. They are not optional, not negotiable, and not satisfiable by writing more careful additions.
+**Mandatory pre-edit question.** Before writing any change, you must answer in this order:
+1. **What can I delete that makes the addition unnecessary?** If the addition becomes redundant after the deletion, ship the deletion alone.
+2. **What can I delete that makes the addition smaller?** Trim the surrounding accretion before adding.
+3. **Only then**, what is the minimum addition required?
+If you skip question 1 or 2, you are violating this rule even if the resulting code looks clean.
+**Hard tests every edit must pass:**
+- **Net-negative is the default; pure-addition needs a citation.** A diff that adds N lines and removes 0 must point to a specific cause: a previously-observed failure mode (commit hash, fixture ID, finding ID, user-reported incident), OR an explicit user request / spec requirement that demands new user-visible behavior. The latter is a sufficient citation — do not block legitimate requested additions on the absence of a past failure. What is rejected: vague justifications like "it seems clearer," "for future flexibility," "just in case," "to be safe," "for completeness," "to handle edge cases" — these are the exact phrases that produce accretion.
+- **Delete the line that makes the bug impossible, not the line that catches it.** Defensive wrappers, validation layers, error normalizers, and `try/catch` shells are usually evidence that an upstream contract is unclear. Fix the contract upstream and remove the defenses downstream. The trap: adding the wrapper feels like progress because it makes a test pass. The wrapper is debt; the contract fix is the work. **Scope guard**: if the upstream contract fix is outside the user's stated scope, stop and surface the scope expansion to the user before editing — Goal-locked execution overrides this. The right scope-expansion outcome is "user authorizes the upstream fix" or "user accepts a scoped local fix and a follow-up for upstream"; never silently restructure something the user didn't ask you to.
+- **A new flag, branch, or option is admitting two failures**: (a) the default was wrong, (b) every reader pays attention cost forever. Default-fix-and-delete-flag beats add-flag-with-better-default. The bar for adding a configuration knob is "I have observed two real users with genuinely conflicting needs," not "this might be useful someday."
+- **Doc additions are subject to the same rule.** Before adding a section to any `.md` file (CLAUDE.md, SKILL.md, README, references/), find the now-stale sentence or section the new one supersedes — delete that first. A growing instructions file dilutes the instructions that actually need to be followed; readers (human and LLM) skim long files and miss load-bearing rules.
+- **A "cleaner" refactor that grows line count is not cleaner.** It is a sideways move that increases context, parsing, and review cost. **For refactor-only changes**, line count must drop unless a cited observed failure requires the new shape. **Never delete tests, contracts, public API, comments documenting non-obvious WHY, or user-facing behavior just to win the count** — that is gaming the metric, not honoring the principle. The metric serves complexity reduction; if a deletion would lose information not recoverable from code + commit history, it is the wrong deletion.
+- **Stop adding when no further deletion is possible.** This is the Saint-Exupéry test inverted into a stopping rule: if you have made an addition and you cannot identify anything else that can be removed, examine the addition itself — is part of it still removable? Iterate until the diff is irreducible.
+**Anti-rationalization clause** — explicitly guarding against LLM-style hedging:
+- "More explicit is safer" is **not** a justification. Explicitness has a cost in attention and rot. Required-explicit goes in; nice-to-explicit gets cut.
+- "Adding context for future readers" is **not** a justification. Future readers benefit more from shorter files than from explanatory prose. The code and the commit message together carry the why.
+- "Defense-in-depth" is **not** a justification at the harness layer. Two layers that catch the same bug are evidence one of them should be the only layer.
+- If you find yourself writing the phrase "in case" in a comment, code reviewer note, or doc, **stop and re-evaluate** — that phrase predicts an unjustified addition.
+**Stopping rule.** A change is done when (a) all hypotheses it was meant to close are closed, AND (b) you have attempted at least one further deletion and confirmed it would break something. If you have not tried to delete more, you are not done. If nothing can be deleted to justify the current addition, the addition itself is too large — re-scope or surface the conflict to the user before proceeding.
+**Never grow surface area silently.** Every accretion-shaped change must be visible: in the commit message, in the iteration file, or in a flagged review. Silent growth is the failure mode this rule exists to prevent.
+<!-- runtime-principles:section=subtractive-first:end -->
+### Goal-locked execution — stay on the North Star, do not wander
+<!-- runtime-principles:section=goal-locked:begin -->
+Even with a North Star defined, work drifts off-course ("산으로 간다" / "삼천포로 빠진다" — going up the wrong mountain instead of forward). The harness must **actively block** this drift at run time, not merely discourage it. The default is ruler-straight execution toward the user's stated goal; any deviation requires explicit justification, not the inverse.
+This rule exists because LLMs (including you) are trained to be helpful, comprehensive, and thorough — and "helpful" easily becomes "did more than asked." Doing more than asked is not helpfulness; it is scope creep. Read the rules below as hard blocks, not soft preferences.
+**The five drift patterns you must refuse to execute on:**
+1. **Unrequested work.** "While I'm here, I noticed X is broken/ugly/inefficient" → **stop**. The user did not ask for X. If X is a real defect, surface it as a finding, a follow-up suggestion, or an entry in a TODO list — do NOT fix it inside the current change. Mixing unrequested work with requested work is what makes diffs unreviewable and PRs eternal.
+2. **Tangential cleanup.** "This file looks messy, let me also tidy..." → **stop**. The current task is the only task. Unrelated cleanup is a separate change requiring its own justification, scope, and pre-flight 0 check.
+3. **Speculative robustness.** "Just adding a check / fallback / handler for the case where..." → **stop**. If the case has not been observed (in production, in tests, in a finding), it does not belong in this change. Defensive code added for unobserved cases is the most common form of accretion debt — it never gets removed because nobody can prove the case never happens.
+4. **Re-scoping mid-flight.** "Actually, the better way to do this is to also restructure / rename / migrate..." → **stop**. If you discover the requested approach is wrong, surface that to the user with evidence and let them adjudicate. Do NOT silently expand scope. The user's explicit redirect is the only authorization to enlarge a task.
+5. **Curiosity detours.** "Let me also explore how Y works to understand this better..." → **stop**, unless Y is provably on the goal's critical path. Curiosity-driven exploration is creative-mode; default is execution-mode.
+**The single drift test before any deviation from the stated goal:** *"Did the user ask for this, OR does the user's stated goal strictly require it?"* If the answer to both is no, do not do it. Surface it as a note (commit message, end-of-turn summary, finding) and continue on the original path.
+**Creative-mode is the narrow exception, not the default.** Creative-mode applies only when (a) the user explicitly invoked an ideation/exploration surface (`/devlyn:ideate`, optional `/devlyn:design-system`, "let's brainstorm", "explore options for"), OR (b) the goal is genuinely under-specified and a clarifying question is impossible (extremely rare — usually you should ask). For everything else — bug fixes, feature work, refactors, doc updates, pipeline runs, code review, debugging — execution-mode is the default and drift is a defect, not a feature.
+**Anti-rationalization clause** — explicitly guarding against LLM hedging:
+- "It's a small extra change" is **not** a justification. Small accretions compound; one of them is always small.
+- "It's related to what they asked for" is **not** a justification. Related ≠ requested. Requested is the only standard.
+- "It would be incomplete without this" is **not** a justification. The user defines completeness, not your sense of it.
+- "I'm being thorough" is **not** a justification. Thoroughness on the requested goal is required; thoroughness extending past the goal is drift.
+**When in doubt, ask — outside hands-free pipelines.** In interactive sessions a short clarification ("the requested fix touches the X code path; I notice Y also looks broken — should I fix it in this change or surface it as a follow-up?") is always cheaper than a wrong-scope diff. Asking is not a weakness; silently expanding scope is. **Inside hands-free pipelines** (`/devlyn:resolve`, scheduled remote agents, autonomous skill runs) the contract forbids mid-pipeline prompts — there asking is unsafe because there is no user to answer. The substitute is: stay strictly on the requested goal, do not expand scope, and log the question/assumption explicitly in the final report (or `.devlyn/runs/<run_id>/` artifacts) so the user can adjudicate after the run completes. Choosing scope creep over logging-and-staying-on-path is always wrong.
+**Stopping rule.** A task is done when the user's stated goal is closed AND no off-path work was added. If you find yourself hesitating because "I should also do Z" — Z is drift. Note it for follow-up, do not execute.
+<!-- runtime-principles:section=goal-locked:end -->
 ## Error Handling Philosophy
 **No silent fallbacks.** Handle errors explicitly and show the user what happened.
@@ -18,33 +116,49 @@ Each skill's `SKILL.md` is the source of truth for its flags and workflow — do
 - **Fallbacks are the exception.** Only use them when it's a widely accepted best practice (CSS fallback fonts, CDN failover, image placeholders). Otherwise handle the error explicitly.
 - **Pattern**: `try { doThing() } catch (error) { showErrorUI(error) }` — NOT `try { doThing() } catch { return fallbackValue }`.
-## Investigation Workflow
+### No-workaround discipline (runtime salience)
+<!-- runtime-principles:section=no-workaround:begin -->
+No `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scripts that bypass the root cause. Fix root causes; handle errors with user-visible state per the rule above.
+**Permitted exceptions** (explicitly carved out):
+- CSS fallback fonts, CDN failover, image placeholders — widely-accepted best practices.
+- Codex CLI availability downgrade — the one documented silent fallback in this repo. Fires when the resolved engine is `auto` or `codex` (either via skill default or explicit `--engine` flag) and the Codex CLI is absent. Banner `engine downgraded: codex-unavailable` always prints; verdict identical to `--engine claude`. Any other silent fallback in skills code is a bug — file it against the skill that introduced it.
+<!-- runtime-principles:section=no-workaround:end -->
+### Evidence over claim
+<!-- runtime-principles:section=evidence:begin -->
+Every finding cites concrete evidence. Vague claims are speculation; exclude them.
+- **Code findings**: `file:line` you have opened.
+- **Missing findings**: explicit "searched X and found no implementation" statement.
+- **Doc findings**: quote of the stale text + section/line reference.
+- **Browser findings**: screenshot reference + URL/route.
+A finding without one of these forms is excluded. Vague findings produce vague fixes.
+<!-- runtime-principles:section=evidence:end -->
-When investigating bugs, analyzing features, or exploring code:
+## Codex invocation
-1. **Define exit criteria upfront** — ask "what does 'done' look like?" before starting.
-2. **Checkpoint progress** — use `TaskCreate`/`TaskUpdate` every 5–10 minutes.
-3. **Intermediate summaries** — output "current understanding" snapshots so work isn't lost if interrupted.
-4. **Always deliver findings** — never end mid-analysis. Minimum output: files examined, key findings, remaining unknowns, recommended next steps.
+When `/devlyn:resolve` or `/devlyn:ideate` route a phase to Codex (`--engine codex` or `--engine auto`), the wrapper-form contract lives in `config/skills/_shared/codex-config.md` (or `.claude/skills/_shared/codex-config.md` once installed). Omit `-m <model>` — the CLI's current flagship is used automatically. MCP is not in the loop. If the Codex CLI is absent the harness silently downgrades to Claude and prints `engine downgraded: codex-unavailable` in the final report.
-For complex investigations, use `/devlyn:team-resolve` for a multi-perspective team, or spawn parallel `Agent` subagents.
+## Working Mode
-## Context Window Management
+- **Checkpoint with TaskCreate / TaskUpdate.** Long investigations or multi-phase work: create tasks at start, mark completed as each one closes — don't batch.
+- **Don't stop early on token-budget concerns.** Context auto-compacts; the model resumes after compaction. Run the work to a real stopping point.
+- **Persist across context windows via disk.** `/devlyn:resolve` writes `.devlyn/pipeline.state.json` plus per-phase log/findings under `.devlyn/runs/<run_id>/`; for ad-hoc long work use `HANDOFF.md` and resume with `@HANDOFF.md continue`.
+- **Parallelize independent tool calls; reserve `Agent` subagents for independent fan-out or isolated-context verification** — a single perspective is the default on the resolve hot path.
-Claude 4.5/4.6/4.7 auto-compact as context approaches the limit — you can work indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns; the model resumes from where it left off after compaction.
+## Skill Boundary Policy
-For multi-context-window work (e.g., a large roadmap), persist state to disk:
-- auto-resolve writes durable state to `.devlyn/runs/<run_id>/` (pipeline.state.json, `<phase>.findings.jsonl`, `<phase>.log.md`) plus git commits. Pick up by reading `state.json` first; drill into JSONL/log files as needed.
-- preflight writes `.devlyn/PREFLIGHT-REPORT.md`.
-- For long investigations, write progress to `HANDOFF.md`; resume with `@HANDOFF.md continue`.
+Post iter-0034 Phase 4 cutover (2026-05-04) the runtime product surface is two skills — `/devlyn:resolve` and `/devlyn:ideate`. `/devlyn:resolve` runs PLAN → IMPLEMENT → BUILD_GATE → CLEANUP → VERIFY inline; verification, cleanup, and security review (delegated to the native `security-review` Claude Code skill from BUILD_GATE) all live inside the pipeline. There are no standalone `/devlyn:review`, `/devlyn:evaluate`, `/devlyn:team-resolve`, etc. surfaces to delegate to — those skills were folded into resolve's phases or removed in iter-0034. Three creative power-user skills (`/devlyn:reap`, `/devlyn:design-system`, `/devlyn:team-design-ui`) live in `optional-skills/` and are user-invoked only; resolve never delegates to them.
-Manually `/clear` only when context is genuinely irrelevant to the next task.
+Browser validation routes through `_shared/browser-runner.sh` (Chrome MCP → Playwright → curl tier) directly from BUILD_GATE — there is no separate `/devlyn:browser-validate` skill at HEAD.
 ## Communication Style
-- Lead with **objective data** (popularity, benchmarks, community adoption) before personal opinions.
-- When the user asks "what's popular" or "what do others use", answer with data.
-- Keep recommendations actionable and specific.
+Lead with **objective data** (popularity, benchmarks, community adoption) before opinions — especially when the user asks "what's popular" or "what do others use."
 ## Commit Conventions

package/README.md CHANGED Viewed

@@ -27,124 +27,71 @@ If devlyn-cli saved you time, [give it a star](https://github.com/fysoul17/devly
 npx devlyn-cli
 ```
-That's it. The interactive installer handles everything. Run it again anytime to update.
+That's it. The interactive installer handles everything. Claude Code config is installed by default; optional AI CLI instructions can be selected during install. Choose **Codex CLI (OpenAI)** to install `AGENTS.md` AND `/devlyn:resolve` + `/devlyn:ideate` skills into `~/.codex/skills/` so the same slash commands work inside Codex too. Run it again anytime to update.
 ---
-## How It Works — Three Steps, Full Cycle
+## How It Works — Two Skills, Full Cycle
-devlyn-cli turns Claude Code into an autonomous development pipeline. The core loop is simple:
+devlyn-cli turns Claude Code into a hands-free development pipeline. The product surface is two skills:
 ```
-ideate  →  auto-resolve  →  preflight  →  fix gaps  →  ship
+ideate (optional)  →  resolve  →  ship
 ```
-### Step 1 — Plan with `/devlyn:ideate`
+### Step 1 (optional) — Plan with `/devlyn:ideate`
-Turn a raw idea into structured, implementation-ready specs.
+Turn a raw idea into a verifiable spec — single-feature, multi-feature, or "normalize this external doc".
 ```
 /devlyn:ideate "I want to build a habit tracking app with AI nudges"
 ```
-This produces three documents through interactive brainstorming:
+Default mode produces a `docs/specs/<id>-<slug>/spec.md` plus `spec.expected.json` (mechanical verification block) that `/devlyn:resolve --spec` consumes directly. Modes:
-| Document | What It Contains |
+| Mode | When to use |
 |---|---|
-| `docs/VISION.md` | North star, principles, anti-goals |
-| `docs/ROADMAP.md` | Phased roadmap with links to each spec |
-| `docs/roadmap/phase-N/*.md` | Self-contained spec per feature — ready for auto-resolve |
+| `default` | One feature, AI drives focused Q&A |
+| `--quick` | One-line goal → assume-and-confirm spec, single-turn (autonomous-pipeline-safe) |
+| `--from-spec <path>` | You already wrote a spec; ideate normalizes + lints it |
+| `--project` | Multi-feature project: emits `plan.md` index + N child specs |
-Need to add features later? Run ideate again — it expands the existing roadmap.
+Skip ideate entirely if you have a spec or just want to describe the work — `/devlyn:resolve` accepts free-form goals too.
-### Step 2 — Build with `/devlyn:auto-resolve`
+### Step 2 — Resolve with `/devlyn:resolve`
-Point it at a spec (or just describe what you want) and walk away.
+Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Pass a spec, a free-form goal, or a diff to verify.
 ```
-/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-1/1.1-user-auth.md"
+/devlyn:resolve "fix the login bug"                                # free-form
+/devlyn:resolve --spec docs/specs/2026-05-04-auth/spec.md          # spec mode
+/devlyn:resolve --verify-only <diff-or-PR-ref> --spec <path>       # verify-only
 ```
-It runs a **10-phase pipeline** autonomously:
+Internal phases run sequentially with file-based handoff via `.devlyn/pipeline.state.json`:
 ```
-Build → Build Gate → Browser Test → Evaluate → Fix Loop → Simplify → Review → Security → Clean → Docs
+PLAN  →  IMPLEMENT  →  BUILD_GATE  →  CLEANUP  →  VERIFY (fresh subagent, findings-only)
 ```
-- Each phase runs as a separate agent with fresh context
-- Git checkpoints at every phase for safe rollback
-- **Build Gate** runs your project's real compilers, typecheckers, and linters — catches type errors, cross-package drift, and Docker build failures that tests alone miss. Auto-detects project type (Next.js, Rust, Go, Solidity, Expo, Swift, and more) and Dockerfiles.
-- Browser validation tests your feature end-to-end (clicks, forms, verification)
-- Evaluation grades against done-criteria — if it fails, auto-fix and re-evaluate
+- **PLAN** is the heaviest phase by design — formalizes invariants from the spec/goal and the file list to touch.
+- **BUILD_GATE** runs your project's real compilers, typecheckers, linters, and `python3 .claude/skills/_shared/spec-verify-check.py` (verification commands literal-match). Auto-detects Next.js, Rust, Go, Solidity, Expo, Swift, and Dockerfiles. Browser flows route through Chrome MCP → Playwright → curl tier.
+- **VERIFY** runs in a fresh subagent context with no code-mutation tools — findings only, structurally independent.
+- Git checkpoints at every phase for safe rollback. Fix-loop budget shared across BUILD_GATE and VERIFY (`--max-rounds N`, default 4).
-Skip phases you don't need: `--skip-browser`, `--skip-review`, `--skip-clean`, `--skip-docs`, `--skip-build-gate`, `--max-rounds 6`
-Customize the build gate: `--build-gate strict` (warnings = errors), `--build-gate no-docker` (skip Docker builds for speed)
-Use dual-model routing: `--engine auto` (Codex builds, Claude evaluates — see below)
+Common flags: `--engine claude|codex|auto` (default `claude`), `--bypass build-gate,cleanup`, `--pair-verify` (force pair-mode JUDGE in VERIFY), `--perf` (per-phase timing).
-### Step 3 — Verify with `/devlyn:preflight`
+### Engine selection — Claude solo by default
-After implementing all roadmap items, run a final alignment check:
+`--engine claude` (default) is the canonical surface. Every phase routes to Claude.
-```
-/devlyn:preflight
-```
-Reads every commitment from your vision, roadmap, and item specs, then audits the codebase evidence-based. Catches what you missed:
-| Category | What It Finds |
-|---|---|
-| `MISSING` | In roadmap but not implemented |
-| `INCOMPLETE` | Started but unfinished |
-| `DIVERGENT` | Implemented differently than spec |
-| `BROKEN` | Has a bug preventing it from working |
-| `STALE_DOC` | Docs don't match current code |
-Confirmed gaps become new roadmap items — feed them back into auto-resolve. Use `--autofix` to do this automatically, or `--phase 2` to check only one phase.
-### Bonus — Intelligent Model Routing with `--engine`
-Install the Codex MCP server during setup, then:
+`--engine codex` routes IMPLEMENT to Codex; `--engine auto` opts into the experimental dual-engine routing where applicable. Both are research-only at HEAD: iter-0020 closed Codex BUILD/IMPLEMENT below the quality floor on the 9-fixture suite (L2 vs L1 = −3.6, 3/8 gated fixtures cleared the +5 margin floor — release-readiness FAIL); iter-0033g + iter-0034 closed PLAN-pair as research-only with explicit unblock conditions (container/sandbox infra OR production telemetry capturing positive evidence of subagent introspection). Install the Codex CLI (https://platform.openai.com/docs/codex) and pass the flag explicitly to opt in:
 ```
-/devlyn:auto-resolve "fix the auth bug" --engine auto
+/devlyn:resolve "fix the auth bug" --engine auto   # experimental, research-only
 ```
-**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
-> `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
-Works across the full pipeline:
-```
-/devlyn:auto-resolve "implement feature" --engine auto
-/devlyn:ideate "plan new project" --engine auto
-/devlyn:preflight --engine auto
-```
-<details>
-<summary><strong>How routing works</strong> — A/B tested on 6 roles, 11 integration tests</summary>
-**Pipeline phases** — builder and critic are always different models (GAN dynamic):
-| Phase | Model | Why |
-|---|---|---|
-| Build (implementation) | **Codex GPT-5.4** | SWE-bench Pro +11.7pp for hard coding tasks |
-| Evaluate | **Claude** | Long-context (MRCR +28pp) for full-diff grading |
-| Fix Loop | **Codex GPT-5.4** | Same advantage as Build |
-| Challenge | **Claude** | Fresh skeptical review needs different model family |
-| Browser Validate | **Claude** | Chrome MCP session-bound |
-**Team roles** — each of 21 roles routes to the best model:
-| Engine | Roles | Examples |
-|---|---|---|
-| Claude (11) | Analysis, design, architecture | root-cause-analyst, architecture-reviewer, ux-designer, product-analyst |
-| Codex (4) | Code generation, performance | implementation-planner, test-engineer, performance-engineer |
-| Dual (6) | Both models find unique issues | security-auditor, quality-reviewer, api-designer |
-**Key finding**: Benchmark predictions were only 33% accurate. 4 of 6 A/B-tested roles needed routing changes after real testing — proving that benchmarks alone are insufficient for optimal routing.
-</details>
+If Codex is absent when `--engine auto` or `--engine codex` is requested, the harness silently downgrades to `--engine claude` and emits a banner in the final report.
 <details>
 <summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
@@ -169,7 +116,7 @@ Works across the full pipeline:
 Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
 - **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
-- **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
+- **`--with-codex` consolidated into `--engine auto`** — auto covers BUILD/FIX + team roles + ideate CHALLENGE critic. Legacy flag still accepted with a graceful handoff. *(Note: post iter-0020 close-out, `--engine auto` is experimental research-only; default is `--engine claude`.)*
 - **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
 - **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
@@ -177,47 +124,16 @@ Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against A
 ---
-## Manual Commands
-When you want step-by-step control instead of the full pipeline.
-### Debugging & Resolution
-| Command | Use When |
-|---|---|
-| `/devlyn:resolve` | Simple bugs (1-2 files) |
-| `/devlyn:team-resolve` | Complex issues — spawns root-cause analyst, test engineer, security auditor |
-| `/devlyn:browser-validate` | Test a web feature in a real browser (Chrome MCP → Playwright → curl fallback) |
+## Optional Power-User Skills
-### Code Review & Quality
+Two creative skills have moved to `optional-skills/` — install them via the interactive installer when you need them.
 | Command | Use When |
 |---|---|
-| `/devlyn:review` | Solo review — security, quality, best practices checklist |
-| `/devlyn:team-review` | Multi-reviewer team — security, testing, performance, product perspectives |
-| `/devlyn:evaluate` | Grade work against done-criteria with calibrated skepticism |
-| `/devlyn:clean` | Remove dead code, unused deps, complexity hotspots |
-### UI Design Pipeline
-| Step | Command | What It Does |
-|---|---|---|
-| 1 | `/devlyn:design-ui` | Generate 5 distinct style explorations |
-| 2 | `/devlyn:design-system` | Extract design tokens from chosen style |
-| 3 | `/devlyn:implement-ui` | Team builds it — component architect, UX, accessibility, responsive, visual QA |
+| `/devlyn:design-system` | Extract exact design tokens (colors, type scale, spacing) from a chosen UI style |
+| `/devlyn:team-design-ui` | Multi-perspective design team generates 5 distinct UI style explorations |
-> Use `/devlyn:team-design-ui` for step 1 with a full creative team.
-### Planning & Docs
-| Command | What It Does |
-|---|---|
-| `/devlyn:preflight` | Verify codebase matches vision/roadmap — gap analysis with evidence |
-| `/devlyn:product-spec` | Generate or update product specs |
-| `/devlyn:feature-spec` | Turn product spec → implementable feature spec |
-| `/devlyn:discover-product` | Scan codebase → auto-generate product docs |
-| `/devlyn:recommend-features` | Prioritize top 5 features to build next |
-| `/devlyn:update-docs` | Sync all docs with current codebase |
+> Earlier versions of devlyn-cli shipped 16+ skills (auto-resolve / preflight / evaluate / review / team-review / clean / update-docs / browser-validate / product-spec / feature-spec / recommend-features / discover-product / design-ui / implement-ui). These were consolidated into `/devlyn:resolve` (which folds verification, review, and cleanup into its phases) plus `/devlyn:ideate` (which absorbs the planning surfaces) in the iter-0034 Phase 4 cutover (2026-05-04). Upgrades automatically remove the legacy skill directories from `~/.claude/skills/`.
 ---
@@ -231,7 +147,6 @@ These activate automatically — no commands needed. They shape how Claude think
 | `code-review-standards` | Reviews — severity framework, approval criteria |
 | `ui-implementation-standards` | UI work — design fidelity, accessibility, responsiveness |
 | `code-health-standards` | Maintenance — dead code prevention, complexity thresholds |
-| `workflow-routing` | Any task — guides you to the right command |
 ---
@@ -253,6 +168,9 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | `dokkit` | Document template filling for DOCX/HWPX |
 | `devlyn:pencil-pull` | Pull Pencil designs into code |
 | `devlyn:pencil-push` | Push codebase UI to Pencil canvas |
+| `devlyn:reap` | Safely reap orphaned MCP / codex / Superset child processes |
+| `devlyn:design-system` | Extract design tokens from a chosen UI style for exact reproduction |
+| `devlyn:team-design-ui` | 5 distinct UI style explorations from a full design team |
 </details>
@@ -274,8 +192,9 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | Server | Description |
 |---|---|
-| `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing |
-| `playwright` | Playwright MCP — powers browser-validate Tier 2 |
+| `playwright` | Playwright MCP — powers `/devlyn:resolve` BUILD_GATE browser tier (Chrome MCP → Playwright → curl fallback) |
+> `--engine auto/codex` uses the local `codex` CLI binary, not MCP. Install from https://platform.openai.com/docs/codex; the harness silently downgrades to `--engine claude` if the CLI is missing.
 </details>
@@ -290,9 +209,8 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 ## Contributing
-- **Add a command** — `.md` file in `config/commands/`
 - **Add a skill** — directory in `config/skills/` with `SKILL.md`
-- **Add optional skill** — add to `optional-skills/` and `OPTIONAL_ADDONS`
+- **Add optional skill** — add to `optional-skills/` and `OPTIONAL_ADDONS` in [`bin/devlyn.js`](bin/devlyn.js)
 - **Suggest a pack** — PR to the pack list
 ## Star History