npm - devlyn-cli - Versions diffs - 1.11.0 → 1.12.1 - Mend

devlyn-cli 1.11.0 → 1.12.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/CLAUDE.md +3 -1
package/README.md +52 -4
package/bin/devlyn.js +24 -0
package/config/skills/devlyn:auto-resolve/SKILL.md +28 -9
package/config/skills/devlyn:auto-resolve/references/codex-integration.md +6 -3
package/config/skills/devlyn:auto-resolve/references/engine-routing.md +205 -0
package/config/skills/devlyn:ideate/SKILL.md +20 -4
package/config/skills/devlyn:ideate/references/codex-debate.md +6 -3
package/config/skills/devlyn:preflight/SKILL.md +9 -0
package/config/skills/devlyn:team-resolve/SKILL.md +20 -1
package/config/skills/devlyn:team-review/SKILL.md +20 -1
package/package.json +1 -1

package/CLAUDE.md CHANGED Viewed

@@ -72,7 +72,8 @@ Optional flags:
 - `--skip-review` — skip team-review phase
 - `--skip-clean` — skip clean phase
 - `--skip-docs` — skip update-docs phase
-- `--with-codex [evaluate|review|both]` — use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
+- `--engine auto|codex|claude` — intelligent model routing. `auto` routes each phase and team role to the optimal model (Claude or Codex GPT-5.4) based on benchmark data. `codex` forces Codex for implementation, Claude for evaluation. `claude` (default) uses Claude for everything. Requires codex-mcp-server.
+- `--with-codex [evaluate|review|both]` — (legacy, superseded by `--engine`) use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
 ## Preflight Check (Post-Roadmap Verification)
@@ -91,6 +92,7 @@ Optional flags:
 - `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
 - `--skip-browser` — skip browser validation
 - `--skip-docs` — skip documentation audit
+- `--engine auto|codex|claude` — route code-auditor to Codex (better at code analysis), docs/browser to Claude
 **Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)

package/README.md CHANGED Viewed

@@ -79,6 +79,7 @@ Build → Build Gate → Browser Test → Evaluate → Fix Loop → Simplify →
 Skip phases you don't need: `--skip-browser`, `--skip-review`, `--skip-clean`, `--skip-docs`, `--skip-build-gate`, `--max-rounds 6`
 Customize the build gate: `--build-gate strict` (warnings = errors), `--build-gate no-docker` (skip Docker builds for speed)
+Use dual-model routing: `--engine auto` (Codex builds, Claude evaluates — see below)
 ### Step 3 — Verify with `/devlyn:preflight`
@@ -100,18 +101,64 @@ Reads every commitment from your vision, roadmap, and item specs, then audits th
 Confirmed gaps become new roadmap items — feed them back into auto-resolve. Use `--autofix` to do this automatically, or `--phase 2` to check only one phase.
-### Bonus — Dual-Model Mode with Codex
+### Bonus — Intelligent Model Routing with `--engine`
 Install the Codex MCP server during setup, then:
 ```
-/devlyn:auto-resolve "fix the auth bug" --with-codex
+/devlyn:auto-resolve "fix the auth bug" --engine auto
+```
+**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.6 or GPT-5.4) — validated through A/B testing, not just benchmarks.
+> `--engine auto` (recommended) · `--engine codex` (force Codex for build) · `--engine claude` (default, Claude only)
+Works across the full pipeline:
+```
+/devlyn:auto-resolve "implement feature" --engine auto
+/devlyn:ideate "plan new project" --engine auto
+/devlyn:preflight --engine auto
 ```
-Claude builds, **OpenAI Codex evaluates independently** — two models collaborating, catching what a single model misses.
+<details>
+<summary><strong>How routing works</strong> — A/B tested on 6 roles, 11 integration tests</summary>
+**Pipeline phases** — builder and critic are always different models (GAN dynamic):
+| Phase | Model | Why |
+|---|---|---|
+| Build (implementation) | **Codex GPT-5.4** | SWE-bench Pro +11.7pp for hard coding tasks |
+| Evaluate | **Claude** | Long-context (MRCR +28pp) for full-diff grading |
+| Fix Loop | **Codex GPT-5.4** | Same advantage as Build |
+| Challenge | **Claude** | Fresh skeptical review needs different model family |
+| Browser Validate | **Claude** | Chrome MCP session-bound |
+**Team roles** — each of 21 roles routes to the best model:
+| Engine | Roles | Examples |
+|---|---|---|
+| Claude (11) | Analysis, design, architecture | root-cause-analyst, architecture-reviewer, ux-designer, product-analyst |
+| Codex (4) | Code generation, performance | implementation-planner, test-engineer, performance-engineer |
+| Dual (6) | Both models find unique issues | security-auditor, quality-reviewer, api-designer |
+**Key finding**: Benchmark predictions were only 33% accurate. 4 of 6 A/B-tested roles needed routing changes after real testing — proving that benchmarks alone are insufficient for optimal routing.
+</details>
+<details>
+<summary>Legacy: <code>--with-codex</code> (superseded by <code>--engine</code>)</summary>
+```
+/devlyn:auto-resolve "fix the auth bug" --with-codex
+```
 > `--with-codex evaluate` (default) · `--with-codex review` · `--with-codex both`
+`--engine auto` subsumes `--with-codex both` with broader coverage — Codex is used for build, fix, and 4 team roles, not just evaluate/review.
+</details>
 ---
 ## Manual Commands
@@ -181,6 +228,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | Skill | Description |
 |---|---|
+| `asset-creator` | AI pixel art game asset pipeline — generate, chroma-key, catalog |
 | `cloudflare-nextjs-setup` | Cloudflare Workers + Next.js with OpenNext |
 | `generate-skill` | Create Claude Code skills following Anthropic best practices |
 | `prompt-engineering` | Claude 4 prompt optimization |
@@ -210,7 +258,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | Server | Description |
 |---|---|
-| `codex-cli` | Codex MCP server — enables `--with-codex` dual-model mode |
+| `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing and legacy `--with-codex` mode |
 | `playwright` | Playwright MCP — powers browser-validate Tier 2 |
 </details>

package/bin/devlyn.js CHANGED Viewed

@@ -2,6 +2,7 @@
 const fs = require('fs');
 const path = require('path');
+const os = require('os');
 const readline = require('readline');
 const { execSync } = require('child_process');
@@ -581,6 +582,29 @@ async function init(skipPrompts = false) {
     log('  → settings.json (agent teams + pipeline permissions)', 'dim');
   }
+  // Configure global Claude Code settings (~/.claude/settings.json)
+  const globalClaudeDir = path.join(os.homedir(), '.claude');
+  const globalSettingsPath = path.join(globalClaudeDir, 'settings.json');
+  let globalSettings = {};
+  if (fs.existsSync(globalSettingsPath)) {
+    try {
+      globalSettings = JSON.parse(fs.readFileSync(globalSettingsPath, 'utf8'));
+    } catch {
+      globalSettings = {};
+    }
+  }
+  if (!globalSettings.env) globalSettings.env = {};
+  let globalSettingsChanged = false;
+  if (!globalSettings.env.CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING) {
+    globalSettings.env.CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING = '1';
+    globalSettingsChanged = true;
+  }
+  if (globalSettingsChanged) {
+    if (!fs.existsSync(globalClaudeDir)) fs.mkdirSync(globalClaudeDir, { recursive: true });
+    fs.writeFileSync(globalSettingsPath, JSON.stringify(globalSettings, null, 2) + '\n');
+    log('  → ~/.claude/settings.json (disabled adaptive thinking)', 'dim');
+  }
   // Install agents for other detected CLIs
   const detected = detectOtherCLIs();
   if (detected.length > 0) {

package/config/skills/devlyn:auto-resolve/SKILL.md CHANGED Viewed

@@ -33,28 +33,36 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
    - `--skip-docs` (false) — skip update-docs phase
    - `--skip-build-gate` (false) — skip the deterministic build gate (Phase 1.4). Not recommended — the build gate is the primary defense against "tests pass locally, breaks in CI/Docker/production" class of bugs.
    - `--build-gate MODE` (auto) — controls build gate behavior. `auto`: detect project type and run appropriate build/typecheck/lint commands; if Dockerfile(s) are present, Docker builds are included automatically. `strict`: auto + treat warnings as errors. `no-docker`: auto but skip Docker builds even if Dockerfiles exist (for faster iteration). `skip`: same as --skip-build-gate.
-   - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques.
+   - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques. **Ignored if `--engine` is set** (engine routing subsumes this).
+   - `--engine MODE` (claude) — controls which model handles each pipeline phase and team role. Modes:
+     - `claude` (default): all phases use Claude subagents. Current behavior, no Codex calls.
+     - `codex`: Codex handles implementation/analysis phases, Claude handles orchestration, evaluation, and Chrome MCP.
+     - `auto`: each phase and team role routes to the optimal model based on benchmark data. Recommended when Codex MCP server is available. Subsumes `--with-codex both`.
    Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
-   Codex examples: `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
+   Engine examples: `--engine auto`, `--engine codex`, `--engine claude`
+   Codex examples (legacy): `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
    If no flags are present, use defaults.
-3. **If `--with-codex` is enabled**: Read `references/codex-integration.md` and run the "PRE-FLIGHT CHECK" section to verify Codex MCP server availability before proceeding.
+3. **If `--engine` is `auto` or `codex`**: Read `references/engine-routing.md` for the full routing table. Then call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude` (fallback), [2] Abort. If `--engine` is not set but `--with-codex` is enabled, read `references/codex-integration.md` instead and run its pre-flight check.
 4. Announce the pipeline plan:
 ```
 Auto-resolve pipeline starting
 Task: [extracted task description]
+Engine: [auto / codex / claude]
 Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
 Max evaluation rounds: [N]
-Cross-model evaluation (Codex): [evaluate / review / both / disabled]
+Cross-model evaluation (Codex): [evaluate / review / both / disabled / subsumed by --engine]
 ```
 ## PHASE 1: BUILD
+**Engine routing**: If `--engine` is `auto` or `codex`, read `references/engine-routing.md` "How to Spawn a Codex BUILD/FIX Agent" section. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the full agent prompt below as the `prompt` parameter. If `--engine` is `claude` (default), spawn a Claude subagent as described below.
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to investigate and implement the fix. The subagent does NOT have access to skills, so include all necessary instructions inline.
-Agent prompt — pass this to the Agent tool:
+Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
 Investigate and implement the following task. Work through these phases in order:
@@ -74,6 +82,8 @@ Read relevant files in parallel. Build a clear picture of what exists and what n
 - UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
 Each teammate investigates from their perspective and sends findings back.
+**Engine routing for teammates**: If the orchestrator's `--engine` is `auto` or `codex`, read `references/engine-routing.md` for per-role routing. Roles marked **Codex** are called via `mcp__codex-cli__codex` instead of spawning Agent teammates — include the full role prompt and issue context inline. Roles marked **Claude** use normal Agent teammates. Roles marked **Dual** run both in parallel and merge findings. The orchestrator relays Codex role outputs to Claude teammates that need them.
 **Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
 **Phase E — Update done criteria**: Mark each criterion in `.devlyn/done-criteria.md` as satisfied. Run the full test suite.
@@ -127,9 +137,11 @@ Triggered only when PHASE 1.4 returns FAIL.
 Track a round counter (shared with the main fix loop counter against `max-rounds`). If `round >= max-rounds`, stop with a clear failure report — do NOT continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
+**Engine routing**: Same as PHASE 2.5 FIX LOOP — if `--engine` is `auto` or `codex`, use `mcp__codex-cli__codex` with `workspace-write` and `fullAuto: true`. If `--engine` is `claude`, spawn a Claude subagent.
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
-Agent prompt — pass this to the Agent tool:
+Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
 Read `.devlyn/BUILD-GATE.md` — it contains deterministic build/typecheck/lint failures from real compiler output. These are not opinions; the compiler rejected this code. Fix every listed failure at the root cause level.
@@ -220,7 +232,8 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
 **After the agent completes**:
 1. Read `.devlyn/EVAL-FINDINGS.md`
 2. Extract the verdict
-3. **If `--with-codex` includes `evaluate` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
+3. **If `--engine` is `auto` or `codex`**: The evaluate phase always uses Claude (see `references/engine-routing.md`). When `--engine auto`, the builder was Codex — Claude evaluating Codex's work creates the GAN dynamic automatically. No separate Codex evaluation pass is needed.
+   **If `--engine` is not set and `--with-codex` includes `evaluate` or `both`** (legacy): Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
 4. Branch on verdict (from the merged findings if Codex was used):
    - `PASS` → skip to PHASE 3
    - `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
@@ -232,9 +245,11 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
 Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
+**Engine routing**: If `--engine` is `auto` or `codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the fix prompt below. Use a fresh call each round (no sessionId reuse — sandbox/fullAuto only apply on first call of a session). If `--engine` is `claude`, spawn a Claude subagent as below.
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to fix the evaluation findings.
-Agent prompt — pass this to the Agent tool:
+Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
 Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the evaluator returns PASS — there is no "shippable with issues" shortcut.
@@ -268,11 +283,14 @@ Agent prompt — pass this to the Agent tool:
 Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes.
+**Engine routing for reviewers**: If the orchestrator passed `--engine auto` or `--engine codex`, read `references/engine-routing.md` for per-role routing in the "team-review roles" table. Route each reviewer to Claude Agent or `mcp__codex-cli__codex` accordingly. For Dual roles (security-reviewer), run both models in parallel and merge findings per the "How to Spawn a Dual Role" section. For `--engine claude`, all reviewers are Claude Agent teammates.
 Each reviewer evaluates from their perspective, sends findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW). After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
 Clean up the team after completion.
-**If `--with-codex` includes `review` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
+**If `--engine` is set**: engine routing already handles cross-model review via per-role routing — skip the legacy `--with-codex` review step below.
+**If `--with-codex` includes `review` or `both`** (legacy, only when `--engine` is not set): Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
 **After the review phase completes**:
 1. If CRITICAL issues remain unfixed, log a warning in the final report
@@ -403,6 +421,7 @@ After all phases complete:
 ### Auto-Resolve Pipeline Complete
 **Task**: [original task description]
+**Engine**: [auto / codex / claude — if auto, note which phases used which model]
 **Pipeline Summary**:
 | Phase | Status | Notes |

package/config/skills/devlyn:auto-resolve/references/codex-integration.md CHANGED Viewed

@@ -1,6 +1,8 @@
-# Codex Cross-Model Integration
+# Codex Cross-Model Integration (Legacy)
-Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline. Only read this file when `--with-codex` is enabled.
+> **Note**: This file is the legacy `--with-codex` integration. For the newer `--engine` flag (which subsumes `--with-codex`), see `references/engine-routing.md`. Only read this file when `--with-codex` is enabled AND `--engine` is NOT set.
+Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline.
 Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). This creates a GAN-like adversarial dynamic — Claude builds and Codex critiques, reducing shared blind spots between model families.
@@ -37,7 +39,8 @@ Call `mcp__codex-cli__codex` with:
 - `prompt`: Include the full content of `.devlyn/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
 - `workingDirectory`: the project root
 - `sandbox`: `"read-only"` (Codex should only read, not modify files)
-- `reasoningEffort`: `"high"`
+- `reasoningEffort`: `"high"` (note: for `--engine auto`, the engine-routing.md uses `"xhigh"` by default)
+- `model`: `"gpt-5.4"` (pass explicitly — the MCP schema default may be outdated)
 Example prompt to pass:
 ```

package/config/skills/devlyn:auto-resolve/references/engine-routing.md ADDED Viewed

@@ -0,0 +1,205 @@
+# Engine Routing: Intelligent Model Selection
+Instructions for routing work to the optimal model (Claude or Codex) per role and phase. Only read this file when `--engine` is set to `auto` or `codex`.
+The routing table below is derived from published benchmarks (April 2026) comparing Claude Opus 4.6 and GPT-5.4 across task-relevant dimensions. The principle: each role's work goes to the model that objectively performs better at that task type.
+---
+## Benchmark Basis
+| Dimension | Claude Opus 4.6 | GPT-5.4 | Gap | Source |
+|-----------|-----------------|---------|-----|--------|
+| Long-context retrieval (256k) | 92% | ~64% | Claude +28pp | MRCR v2 |
+| Graduate-level reasoning | 87.4% | 83.9% | Claude +3.5pp | GPQA Diamond |
+| Hard coding problems | ~46% | 57.7% | Codex +11.7pp | SWE-bench Pro |
+| Function-level code gen | 90.4% | 93.1% | Codex +2.7pp | HumanEval |
+| Terminal/CLI tasks | 65.4% | 75.1% | Codex +9.7pp | Terminal-Bench 2.0 |
+| Real-world issue resolution | ~80% | ~80% | Tied | SWE-bench Verified |
+| Security vulnerability detection | — | — | Tied | Semgrep 2025 study |
+| Agentic computer use | 72.7% | 75.0% | Codex +2.3pp | OSWorld |
+| Ambiguous intent handling | Preferred by 70% devs | — | Claude | Developer surveys |
+---
+## Codex Call Defaults
+Every Codex call in this file uses these defaults unless stated otherwise:
+```
+model: "gpt-5.4"
+reasoningEffort: "xhigh"
+sandbox: varies per role (see table)
+workingDirectory: project root
+```
+The `model` field accepts any string — pass `"gpt-5.4"` even if the MCP schema lists older defaults. The Codex CLI resolves it.
+---
+## Role Routing Table
+### team-resolve roles
+| Role | Engine | Sandbox | Rationale |
+|------|--------|---------|-----------|
+| root-cause-analyst | **Claude** | — | A/B test: Claude traced git history (15 tool calls) finding exact commit + unchecked migration plan. Codex analyzed structure well but lacked git history depth. Tool access > SWE-bench Pro advantage for this role. |
+| test-engineer | **Codex** | workspace-write | Test code generation = HumanEval (+2.7pp), needs file write |
+| security-auditor | **Dual** | read-only | Semgrep: both find unique vulns; GAN > single model |
+| implementation-planner | **Codex** | read-only | Implementation planning = SWE-bench Pro (+11.7pp) |
+| product-designer | **Claude** | — | Ambiguous requirements, user intent = Claude strength |
+| ui-designer | **Claude** | — | Visual spec, design reasoning = non-coding task |
+| ux-designer | **Claude** | — | User flow analysis = ambiguous intent handling |
+| accessibility-auditor | **Claude** | — | A/B test: Claude found 12 issues (1 CRITICAL) vs Codex 4. WCAG auditing requires thoroughness and domain knowledge depth, not code generation speed. Claude 3x coverage. |
+| product-analyst | **Claude** | — | Requirements clarity, scope judgment = ambiguity handling |
+| architecture-reviewer | **Claude** | — | Codebase-wide pattern review = MRCR long-context (+28pp) |
+| performance-engineer | **Codex** | read-only | Terminal tasks + algorithm analysis = Terminal-Bench (+9.7pp) |
+| api-designer | **Dual** | read-only | A/B test: Claude found 9 issues, Codex found 6, with unique findings on both sides (Claude: --version, exit codes; Codex: YAML folded scalar parsing bug). Dual maximizes coverage for API surface review. |
+### team-review roles
+| Role | Engine | Sandbox | Rationale |
+|------|--------|---------|-----------|
+| security-reviewer | **Dual** | read-only | Same as team-resolve security-auditor |
+| quality-reviewer | **Dual** | read-only | A/B test: Claude found 14 issues (2 HIGH), Codex found 11 (3 HIGH), only ~6 overlap. Dual yields ~19 unique findings (+36-73% coverage). Both models find HIGH-severity issues the other misses. |
+| test-analyst | **Codex** | workspace-write | Test gap analysis + test code suggestions |
+| ux-reviewer | **Claude** | — | UX flow assessment = ambiguity handling |
+| ui-reviewer | **Claude** | — | Design token consistency = non-coding task |
+| accessibility-reviewer | **Claude** | — | Same rationale as team-resolve accessibility-auditor: Claude 3x finding coverage on WCAG audits |
+| product-validator | **Claude** | — | Business logic intent = ambiguity handling |
+| api-reviewer | **Dual** | read-only | Same rationale as team-resolve api-designer: both models find unique API issues |
+| performance-reviewer | **Codex** | read-only | Algorithm complexity = Terminal-Bench (+9.7pp) |
+### Summary distribution
+| Engine | team-resolve (12) | team-review (9) | Total |
+|--------|-------------------|-----------------|-------|
+| Claude | 7 | 4 | 11 |
+| Codex | 2 | 2 | 4 |
+| Dual | 3 | 3 | 6 |
+---
+## Pipeline Phase Routing (auto-resolve)
+| Phase | --engine auto | --engine codex | --engine claude |
+|-------|--------------|----------------|-----------------|
+| BUILD (implementation) | **Codex** | Codex | Claude |
+| BUILD GATE | bash (model-agnostic) | bash | bash |
+| BROWSER VALIDATE | Claude (Chrome MCP only) | Claude | Claude |
+| EVALUATE | **Claude** | Claude | Claude |
+| FIX LOOP | **Codex** | Codex | Claude |
+| SIMPLIFY | Claude | Codex | Claude |
+| REVIEW (team) | **Mixed per table** | Codex all | Claude all |
+| CHALLENGE | **Claude** | Claude | Claude |
+| SECURITY REVIEW | **Dual** | Codex | Claude |
+| CLEAN | Claude | Codex | Claude |
+| DOCS | Claude | Codex | Claude |
+Rationale for `--engine auto` choices:
+- BUILD/FIX: Codex — SWE-bench Pro 57.7% vs 46%. The biggest model gap is in hard coding tasks.
+- EVALUATE/CHALLENGE: Claude — evaluating a full diff requires long-context retrieval (MRCR +28pp) and skeptical reasoning (GPQA +3.5pp). Different model family from builder creates GAN dynamic.
+- BROWSER: Claude — Chrome MCP tools are Claude Code session-bound.
+- SECURITY: Dual — Semgrep study shows both models find unique vulnerabilities.
+---
+## Pipeline Phase Routing (ideate)
+| Phase | --engine auto | --engine codex | --engine claude |
+|-------|--------------|----------------|-----------------|
+| FRAME | **Claude** | Codex | Claude |
+| EXPLORE | **Claude** | Codex | Claude |
+| CONVERGE | **Claude** | Codex | Claude |
+| CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
+| DOCUMENT | **Claude** | Codex | Claude |
+Rationale:
+- FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
+- CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic (same role as `--with-codex` but automatic). When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
+- DOCUMENT: Claude — writing quality for spec generation.
+---
+## Pipeline Phase Routing (preflight)
+| Phase | --engine auto | --engine codex | --engine claude |
+|-------|--------------|----------------|-----------------|
+| EXTRACT COMMITMENTS | Claude | Codex | Claude |
+| CODE AUDIT | **Codex** | Codex | Claude |
+| DOCS AUDIT | **Claude** | Codex | Claude |
+| BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
+| SYNTHESIZE | Claude | Claude | Claude |
+---
+## How to Spawn a Codex Role
+For roles marked **Codex** in the routing table, call `mcp__codex-cli__codex` instead of spawning a Claude Agent subagent. Package the role's full prompt (from the skill's teammate prompt section) into the Codex call.
+Template:
+```
+mcp__codex-cli__codex({
+  prompt: "[full role prompt with issue context, file paths, and deliverable format]",
+  model: "gpt-5.4",
+  reasoningEffort: "xhigh",
+  sandbox: "[read-only or workspace-write per table]",
+  workingDirectory: "[project root]"
+})
+```
+**Important**: Codex has no access to team infrastructure (TeamCreate, SendMessage, TaskCreate). For Codex roles:
+- Include ALL context inline in the prompt (issue description, file paths from investigation, deliverable format)
+- The orchestrator collects Codex's response and routes it where it would have gone via SendMessage
+- Codex roles cannot communicate with other teammates directly — the orchestrator relays findings
+For roles marked **Claude**, spawn a normal Agent subagent as before.
+---
+## How to Spawn a Dual Role
+For roles marked **Dual**, run BOTH models in parallel and merge findings:
+1. Spawn a Claude Agent subagent with the role's prompt
+2. Call `mcp__codex-cli__codex` with the same role's prompt (sandbox: "read-only")
+3. Wait for both to complete
+4. Merge findings:
+   - Same finding from both → keep more detailed description, mark "confirmed by both models"
+   - Claude-only → keep as-is
+   - Codex-only → prefix with `[codex]`
+   - Conflicting findings → keep both, note the disagreement
+   - Take the MORE SEVERE verdict between the two
+---
+## How to Spawn a Codex BUILD/FIX Agent
+For BUILD and FIX LOOP phases when engine routes to Codex:
+```
+mcp__codex-cli__codex({
+  prompt: "[full build/fix prompt with task description, done criteria, and implementation instructions]",
+  model: "gpt-5.4",
+  reasoningEffort: "xhigh",
+  sandbox: "workspace-write",
+  fullAuto: true,
+  workingDirectory: "[project root]"
+})
+```
+**After Codex completes**: verify changes were made (`git diff --stat`), then proceed to the next phase as normal. The file-based handoff (`.devlyn/done-criteria.md`, `.devlyn/EVAL-FINDINGS.md`, etc.) works identically — Codex writes the same files Claude would.
+**Session management**: For FIX LOOP iterations, use a fresh call each time (no `sessionId` reuse) because sandbox/fullAuto parameters only apply on the first call of a session.
+---
+## Override Behavior
+- `--engine claude` → all roles and phases use Claude (current default behavior, no Codex calls)
+- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
+- `--engine auto` → each role and phase routes to the optimal model per this table
+- `--engine auto` is the recommended default when Codex MCP server is available
+`--engine` and `--with-codex` are **mutually exclusive**. `--engine auto` subsumes `--with-codex both` — it uses Codex where it's optimal (broader than just evaluate/review). If both flags are passed, `--engine` takes precedence and `--with-codex` is ignored with a warning.

package/config/skills/devlyn:ideate/SKILL.md CHANGED Viewed

@@ -24,9 +24,15 @@ Concretely:
 Parse these from the user's invocation message:
-- `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family.
+- `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family. **Ignored if `--engine` is set** (engine routing subsumes this).
+- `--engine MODE` (claude) — controls which model handles each ideation phase. Modes:
+  - `claude` (default): all phases use Claude. Current behavior.
+  - `codex`: Codex handles FRAME/EXPLORE/CONVERGE/DOCUMENT, Claude runs CHALLENGE (role reversal — builder and critic are always different models).
+  - `auto`: Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic). Subsumes `--with-codex`. Recommended when Codex MCP is available.
-**If `--with-codex` is set**: read `references/challenge-rubric.md` and `references/codex-debate.md` up front, then run the pre-flight check described in `codex-debate.md` to verify the Codex MCP server is available before starting the pipeline. If the server is unavailable and the user opts to continue without Codex, the solo CHALLENGE pass still runs — only the cross-model rubric pass is disabled.
+**If `--engine` is `auto` or `codex`**: call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort. Also read `references/challenge-rubric.md` up front. The engine routing table is defined in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)".
+**If `--engine` is not set and `--with-codex` is set** (legacy): read `references/challenge-rubric.md` and `references/codex-debate.md` up front, then run the pre-flight check described in `codex-debate.md` to verify the Codex MCP server is available before starting the pipeline. If the server is unavailable and the user opts to continue without Codex, the solo CHALLENGE pass still runs — only the cross-model rubric pass is disabled.
 <why_this_matters>
 When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
@@ -307,9 +313,13 @@ For Quick Add with one new item, one solo pass is enough. For a full greenfield
 If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
-### Codex pass (only if `--with-codex` is set)
+### Codex pass (engine-routed or legacy `--with-codex`)
+**If `--engine auto`**: Codex runs the CHALLENGE rubric pass automatically. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the packaged plan + rubric as prompt (same format as `codex-debate.md` Step 2). Reconcile findings: same finding from both → "confirmed by both", Codex-only → prefix `[codex]`.
+**If `--engine codex`**: Role reversal — Codex built the plan (FRAME/EXPLORE/CONVERGE/DOCUMENT), so Claude runs the solo CHALLENGE pass. Do NOT also run Codex on CHALLENGE — builder and critic must be different models. Skip this section entirely.
-If the flag is set you have already loaded `references/codex-debate.md` during argument parsing — follow its "PHASE 3.5-CODEX" section now. Codex applies the rubric from `challenge-rubric.md` independently at `reasoningEffort: "xhigh"`. Reconcile findings as `codex-debate.md` describes — findings raised by both sides get "confirmed by both", Codex-only findings get prefixed `[codex]` in internal notes so the user can see where each push came from.
+**If `--engine claude` or `--engine` not set, and `--with-codex` is set** (legacy): follow `references/codex-debate.md` "PHASE 3.5-CODEX" section. Codex applies the rubric from `challenge-rubric.md` independently at `reasoningEffort: "xhigh"`. Reconcile findings as `codex-debate.md` describes — findings raised by both sides get "confirmed by both", Codex-only findings get prefixed `[codex]` in internal notes so the user can see where each push came from.
 ### Respect explicit user intent
@@ -346,6 +356,12 @@ Get explicit confirmation before proceeding to DOCUMENT.
 For single-item additions, run one solo rubric pass on just the new item. Even then do not skip — single-item additions are exactly where overengineering and workarounds slip in unnoticed, because the lack of surrounding context makes a bad item look self-contained and harmless.
+## Engine Routing for FRAME / EXPLORE / CONVERGE / DOCUMENT
+**If `--engine codex`**: Phases 1-3 and Phase 4 are delegated to Codex. For each phase, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, and the phase instructions + user context as the prompt. Use `sessionId` to maintain conversational context across phases (note: sandbox/fullAuto only apply on the first call). Claude remains the orchestrator — it reads Codex's output, manages the conversation with the user (confirmation prompts, clarifying questions), and routes findings between phases.
+**If `--engine auto` or `--engine claude`**: All planning phases use Claude directly (current behavior). Claude's ambiguous intent handling and writing quality benchmarks favor it for planning tasks.
 ## Phase 4: DOCUMENT
 <phase_goal>Generate the three-layer document set.</phase_goal>

package/config/skills/devlyn:ideate/references/codex-debate.md CHANGED Viewed

@@ -1,15 +1,17 @@
-# Codex Cross-Model Rubric Pass
+# Codex Cross-Model Rubric Pass (Legacy)
+> **Note**: This file is the legacy `--with-codex` integration for ideate. For the newer `--engine` flag (which subsumes `--with-codex`), see the engine routing section in SKILL.md. Only read this file when `--with-codex` is set AND `--engine` is NOT set.
 ## Contents
 - Pre-flight check (verify Codex MCP server availability)
 - PHASE 3.5-CODEX: packaging the plan, calling Codex, reconciling findings with the solo pass
 - Cost notes (one Codex call per ideation session)
-Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE. Only read this file when `--with-codex` is set. The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
+Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE.  The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
 Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). The intent: one opinionated rubric pass from a different model family, applied right before the user sees the plan. Two model families catch different blind spots; one pass at maximum effort catches more than multiple shallow passes.
-**Always use `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model.
+**Always use `model: "gpt-5.4"`, `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model. Pass `model: "gpt-5.4"` explicitly as the MCP schema default may be outdated.
 ---
@@ -69,6 +71,7 @@ Call `mcp__codex-cli__codex` with:
 - `prompt`: the packaged context above, followed by the instructions below
 - `workingDirectory`: the project root
 - `sandbox`: `"read-only"`
+- `model`: `"gpt-5.4"` — pass explicitly; the MCP schema default may still show `gpt-5.3-codex`
 - `reasoningEffort`: `"xhigh"` — the highest setting in the Codex enum (`none < minimal < low < medium < high < xhigh`). Always pick the top level; this is the entire reason for the flag.
 Instructions to append to the packaged context. **Before sending, inline the full text of `references/challenge-rubric.md` into the prompt under a `## Rubric` heading** — Codex does not have filesystem access to this project, so Claude must ship the rubric itself. Claude already has the rubric loaded from Phase 3.5 setup.

package/config/skills/devlyn:preflight/SKILL.md CHANGED Viewed

@@ -44,8 +44,15 @@ Parse from `<preflight_config>`:
 - `--autofix` — auto-promote all findings to roadmap items and run auto-resolve on each
 - `--skip-browser` — skip browser validation
 - `--skip-docs` — skip documentation audit
+- `--engine MODE` (claude) — controls which model handles audit phases. Modes:
+  - `claude` (default): all auditors use Claude subagents.
+  - `codex`: code-auditor uses Codex, docs-auditor and browser-auditor use Claude.
+  - `auto`: code-auditor uses Codex (SWE-bench Pro +11.7pp for code analysis), docs-auditor uses Claude (writing quality), browser-auditor uses Claude (Chrome MCP). Recommended when Codex MCP is available.
 Example: `/devlyn:preflight --phase 2 --skip-browser`
+Example with engine: `/devlyn:preflight --engine auto`
+**If `--engine` is `auto` or `codex`**: call `mcp__codex-cli__ping` to verify Codex MCP availability. If ping fails, fall back to `--engine claude` with a warning.
 ## PHASE 0: DISCOVER & SCOPE
@@ -128,6 +135,8 @@ Spawn all applicable auditors in parallel. Each reads `.devlyn/commitment-regist
 ### code-auditor (always)
+**Engine routing**: If `--engine auto` or `--engine codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the full code-auditor prompt (read from `references/auditors/code-auditor.md`). Include the commitment registry content inline in the prompt since Codex cannot read `.devlyn/commitment-registry.md` directly in read-only sandbox. If `--engine claude`, spawn a Claude subagent as below.
 Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/code-auditor.md` and pass it to the subagent.
 The code-auditor classifies each commitment as IMPLEMENTED, MISSING, INCOMPLETE, DIVERGENT, or BROKEN — with file:line evidence. Also catches cross-feature integration gaps and constraint violations. Writes to `.devlyn/audit-code.md`.

package/config/skills/devlyn:team-resolve/SKILL.md CHANGED Viewed

@@ -130,9 +130,28 @@ Use the Agent Teams infrastructure:
 **IMPORTANT**: When spawning teammates, replace `{team-name}` in each prompt below with the actual team name you chose (e.g., `resolve-null-user-crash`). Include the relevant file paths from your Phase 1 investigation in the spawn prompt.
+### Engine-Routed Teammate Spawning
+If the caller passed `--engine auto` or `--engine codex` (check the orchestrator's context or the pipeline config), read the auto-resolve skill's `references/engine-routing.md` for per-role routing under "team-resolve roles".
+**For roles routed to Codex**: Instead of spawning a Claude Agent teammate, call `mcp__codex-cli__codex` with:
+- `model`: `"gpt-5.4"`
+- `reasoningEffort`: `"xhigh"`
+- `sandbox`: per routing table (`"read-only"` or `"workspace-write"`)
+- `workingDirectory`: project root
+- `prompt`: the full teammate prompt below, with issue context and file paths included inline
+Codex roles cannot use TeamCreate/SendMessage — the Team Lead (you) relays their findings to other teammates and collects their output directly from the MCP call response.
+**For roles routed to Claude**: Spawn via Task tool as normal (prompts below).
+**For Dual roles** (e.g., security-auditor): Run BOTH a Claude Agent teammate AND a `mcp__codex-cli__codex` call in parallel with the same prompt. Merge findings per `engine-routing.md` "How to Spawn a Dual Role" section.
+If `--engine claude` or no `--engine` flag: all roles use Claude Agent teammates (current default behavior).
 ### Teammate Prompts
-When spawning each teammate via the Task tool, use these prompts:
+When spawning each teammate via the Task tool (or passing to `mcp__codex-cli__codex` for Codex-routed roles), use these prompts:
 <root_cause_analyst_prompt>
 You are the **Root Cause Analyst** on an Agent Team resolving an issue.

package/config/skills/devlyn:team-review/SKILL.md CHANGED Viewed

@@ -66,9 +66,28 @@ Use the Agent Teams infrastructure:
 **IMPORTANT**: When spawning reviewers, replace `{team-name}` in each prompt below with the actual team name you chose. Include the specific changed file paths in each reviewer's spawn prompt.
+### Engine-Routed Reviewer Spawning
+If the caller passed `--engine auto` or `--engine codex` (check the orchestrator's context or the pipeline config), read the auto-resolve skill's `references/engine-routing.md` for per-role routing under "team-review roles".
+**For roles routed to Codex**: Instead of spawning a Claude Agent reviewer, call `mcp__codex-cli__codex` with:
+- `model`: `"gpt-5.4"`
+- `reasoningEffort`: `"xhigh"`
+- `sandbox`: per routing table (`"read-only"` or `"workspace-write"`)
+- `workingDirectory`: project root
+- `prompt`: the full reviewer prompt below, with changed file paths and diff included inline
+Codex reviewers cannot use TeamCreate/SendMessage — the Review Lead (you) collects their output directly from the MCP call response and relays cross-cutting findings to other reviewers.
+**For roles routed to Claude**: Spawn via Task tool as normal (prompts below).
+**For Dual roles** (e.g., security-reviewer): Run BOTH a Claude Agent reviewer AND a `mcp__codex-cli__codex` call in parallel with the same prompt. Merge findings per `engine-routing.md` "How to Spawn a Dual Role" section.
+If `--engine claude` or no `--engine` flag: all roles use Claude Agent reviewers (current default behavior).
 ### Reviewer Prompts
-When spawning each reviewer via the Task tool, use these prompts:
+When spawning each reviewer via the Task tool (or passing to `mcp__codex-cli__codex` for Codex-routed roles), use these prompts:
 <security_reviewer_prompt>
 You are the **Security Reviewer** on an Agent Team performing a code review.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "1.11.0",
+  "version": "1.12.1",
   "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
   "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
   "bin": {