npm - devlyn-cli - Versions diffs - 1.10.0 → 1.12.0 - Mend

devlyn-cli 1.10.0 → 1.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/CLAUDE.md +3 -1
package/README.md +17 -4
package/config/skills/devlyn:auto-resolve/SKILL.md +28 -9
package/config/skills/devlyn:auto-resolve/references/codex-integration.md +6 -3
package/config/skills/devlyn:auto-resolve/references/engine-routing.md +205 -0
package/config/skills/devlyn:ideate/SKILL.md +85 -4
package/config/skills/devlyn:ideate/references/challenge-rubric.md +122 -0
package/config/skills/devlyn:ideate/references/codex-debate.md +112 -0
package/config/skills/devlyn:preflight/SKILL.md +9 -0
package/config/skills/devlyn:team-resolve/SKILL.md +20 -1
package/config/skills/devlyn:team-review/SKILL.md +20 -1
package/package.json +5 -1

package/CLAUDE.md CHANGED Viewed

@@ -72,7 +72,8 @@ Optional flags:
 - `--skip-review` — skip team-review phase
 - `--skip-clean` — skip clean phase
 - `--skip-docs` — skip update-docs phase
-- `--with-codex [evaluate|review|both]` — use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
+- `--engine auto|codex|claude` — intelligent model routing. `auto` routes each phase and team role to the optimal model (Claude or Codex GPT-5.4) based on benchmark data. `codex` forces Codex for implementation, Claude for evaluation. `claude` (default) uses Claude for everything. Requires codex-mcp-server.
+- `--with-codex [evaluate|review|both]` — (legacy, superseded by `--engine`) use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
 ## Preflight Check (Post-Roadmap Verification)
@@ -91,6 +92,7 @@ Optional flags:
 - `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
 - `--skip-browser` — skip browser validation
 - `--skip-docs` — skip documentation audit
+- `--engine auto|codex|claude` — route code-auditor to Codex (better at code analysis), docs/browser to Claude
 **Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)

package/README.md CHANGED Viewed

@@ -100,18 +100,31 @@ Reads every commitment from your vision, roadmap, and item specs, then audits th
 Confirmed gaps become new roadmap items — feed them back into auto-resolve. Use `--autofix` to do this automatically, or `--phase 2` to check only one phase.
-### Bonus — Dual-Model Mode with Codex
+### Bonus — Intelligent Model Routing with `--engine`
 Install the Codex MCP server during setup, then:
 ```
-/devlyn:auto-resolve "fix the auth bug" --with-codex
+/devlyn:auto-resolve "fix the auth bug" --engine auto
 ```
-Claude builds, **OpenAI Codex evaluates independently** — two models collaborating, catching what a single model misses.
+**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.6 or GPT-5.4) based on benchmark data. Codex handles implementation (SWE-bench Pro +11.7pp), Claude handles evaluation and architecture review (MRCR +28pp). Security roles run both models in parallel for maximum coverage.
+> `--engine auto` (recommended) · `--engine codex` (force Codex for implementation) · `--engine claude` (default, Claude only)
+Also works with `/devlyn:ideate --engine auto` and `/devlyn:preflight --engine auto`.
+<details>
+<summary>Legacy: <code>--with-codex</code> (superseded by <code>--engine</code>)</summary>
+```
+/devlyn:auto-resolve "fix the auth bug" --with-codex
+```
 > `--with-codex evaluate` (default) · `--with-codex review` · `--with-codex both`
+</details>
 ---
 ## Manual Commands
@@ -210,7 +223,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
 | Server | Description |
 |---|---|
-| `codex-cli` | Codex MCP server — enables `--with-codex` dual-model mode |
+| `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing and legacy `--with-codex` mode |
 | `playwright` | Playwright MCP — powers browser-validate Tier 2 |
 </details>

package/config/skills/devlyn:auto-resolve/SKILL.md CHANGED Viewed

@@ -33,28 +33,36 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
    - `--skip-docs` (false) — skip update-docs phase
    - `--skip-build-gate` (false) — skip the deterministic build gate (Phase 1.4). Not recommended — the build gate is the primary defense against "tests pass locally, breaks in CI/Docker/production" class of bugs.
    - `--build-gate MODE` (auto) — controls build gate behavior. `auto`: detect project type and run appropriate build/typecheck/lint commands; if Dockerfile(s) are present, Docker builds are included automatically. `strict`: auto + treat warnings as errors. `no-docker`: auto but skip Docker builds even if Dockerfiles exist (for faster iteration). `skip`: same as --skip-build-gate.
-   - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques.
+   - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques. **Ignored if `--engine` is set** (engine routing subsumes this).
+   - `--engine MODE` (claude) — controls which model handles each pipeline phase and team role. Modes:
+     - `claude` (default): all phases use Claude subagents. Current behavior, no Codex calls.
+     - `codex`: Codex handles implementation/analysis phases, Claude handles orchestration, evaluation, and Chrome MCP.
+     - `auto`: each phase and team role routes to the optimal model based on benchmark data. Recommended when Codex MCP server is available. Subsumes `--with-codex both`.
    Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
-   Codex examples: `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
+   Engine examples: `--engine auto`, `--engine codex`, `--engine claude`
+   Codex examples (legacy): `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
    If no flags are present, use defaults.
-3. **If `--with-codex` is enabled**: Read `references/codex-integration.md` and run the "PRE-FLIGHT CHECK" section to verify Codex MCP server availability before proceeding.
+3. **If `--engine` is `auto` or `codex`**: Read `references/engine-routing.md` for the full routing table. Then call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude` (fallback), [2] Abort. If `--engine` is not set but `--with-codex` is enabled, read `references/codex-integration.md` instead and run its pre-flight check.
 4. Announce the pipeline plan:
 ```
 Auto-resolve pipeline starting
 Task: [extracted task description]
+Engine: [auto / codex / claude]
 Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
 Max evaluation rounds: [N]
-Cross-model evaluation (Codex): [evaluate / review / both / disabled]
+Cross-model evaluation (Codex): [evaluate / review / both / disabled / subsumed by --engine]
 ```
 ## PHASE 1: BUILD
+**Engine routing**: If `--engine` is `auto` or `codex`, read `references/engine-routing.md` "How to Spawn a Codex BUILD/FIX Agent" section. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the full agent prompt below as the `prompt` parameter. If `--engine` is `claude` (default), spawn a Claude subagent as described below.
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to investigate and implement the fix. The subagent does NOT have access to skills, so include all necessary instructions inline.
-Agent prompt — pass this to the Agent tool:
+Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
 Investigate and implement the following task. Work through these phases in order:
@@ -74,6 +82,8 @@ Read relevant files in parallel. Build a clear picture of what exists and what n
 - UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
 Each teammate investigates from their perspective and sends findings back.
+**Engine routing for teammates**: If the orchestrator's `--engine` is `auto` or `codex`, read `references/engine-routing.md` for per-role routing. Roles marked **Codex** are called via `mcp__codex-cli__codex` instead of spawning Agent teammates — include the full role prompt and issue context inline. Roles marked **Claude** use normal Agent teammates. Roles marked **Dual** run both in parallel and merge findings. The orchestrator relays Codex role outputs to Claude teammates that need them.
 **Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
 **Phase E — Update done criteria**: Mark each criterion in `.devlyn/done-criteria.md` as satisfied. Run the full test suite.
@@ -127,9 +137,11 @@ Triggered only when PHASE 1.4 returns FAIL.
 Track a round counter (shared with the main fix loop counter against `max-rounds`). If `round >= max-rounds`, stop with a clear failure report — do NOT continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
+**Engine routing**: Same as PHASE 2.5 FIX LOOP — if `--engine` is `auto` or `codex`, use `mcp__codex-cli__codex` with `workspace-write` and `fullAuto: true`. If `--engine` is `claude`, spawn a Claude subagent.
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
-Agent prompt — pass this to the Agent tool:
+Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
 Read `.devlyn/BUILD-GATE.md` — it contains deterministic build/typecheck/lint failures from real compiler output. These are not opinions; the compiler rejected this code. Fix every listed failure at the root cause level.
@@ -220,7 +232,8 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
 **After the agent completes**:
 1. Read `.devlyn/EVAL-FINDINGS.md`
 2. Extract the verdict
-3. **If `--with-codex` includes `evaluate` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
+3. **If `--engine` is `auto` or `codex`**: The evaluate phase always uses Claude (see `references/engine-routing.md`). When `--engine auto`, the builder was Codex — Claude evaluating Codex's work creates the GAN dynamic automatically. No separate Codex evaluation pass is needed.
+   **If `--engine` is not set and `--with-codex` includes `evaluate` or `both`** (legacy): Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
 4. Branch on verdict (from the merged findings if Codex was used):
    - `PASS` → skip to PHASE 3
    - `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
@@ -232,9 +245,11 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
 Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
+**Engine routing**: If `--engine` is `auto` or `codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the fix prompt below. Use a fresh call each round (no sessionId reuse — sandbox/fullAuto only apply on first call of a session). If `--engine` is `claude`, spawn a Claude subagent as below.
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to fix the evaluation findings.
-Agent prompt — pass this to the Agent tool:
+Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
 Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the evaluator returns PASS — there is no "shippable with issues" shortcut.
@@ -268,11 +283,14 @@ Agent prompt — pass this to the Agent tool:
 Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes.
+**Engine routing for reviewers**: If the orchestrator passed `--engine auto` or `--engine codex`, read `references/engine-routing.md` for per-role routing in the "team-review roles" table. Route each reviewer to Claude Agent or `mcp__codex-cli__codex` accordingly. For Dual roles (security-reviewer), run both models in parallel and merge findings per the "How to Spawn a Dual Role" section. For `--engine claude`, all reviewers are Claude Agent teammates.
 Each reviewer evaluates from their perspective, sends findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW). After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
 Clean up the team after completion.
-**If `--with-codex` includes `review` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
+**If `--engine` is set**: engine routing already handles cross-model review via per-role routing — skip the legacy `--with-codex` review step below.
+**If `--with-codex` includes `review` or `both`** (legacy, only when `--engine` is not set): Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
 **After the review phase completes**:
 1. If CRITICAL issues remain unfixed, log a warning in the final report
@@ -403,6 +421,7 @@ After all phases complete:
 ### Auto-Resolve Pipeline Complete
 **Task**: [original task description]
+**Engine**: [auto / codex / claude — if auto, note which phases used which model]
 **Pipeline Summary**:
 | Phase | Status | Notes |

package/config/skills/devlyn:auto-resolve/references/codex-integration.md CHANGED Viewed

@@ -1,6 +1,8 @@
-# Codex Cross-Model Integration
+# Codex Cross-Model Integration (Legacy)
-Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline. Only read this file when `--with-codex` is enabled.
+> **Note**: This file is the legacy `--with-codex` integration. For the newer `--engine` flag (which subsumes `--with-codex`), see `references/engine-routing.md`. Only read this file when `--with-codex` is enabled AND `--engine` is NOT set.
+Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline.
 Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). This creates a GAN-like adversarial dynamic — Claude builds and Codex critiques, reducing shared blind spots between model families.
@@ -37,7 +39,8 @@ Call `mcp__codex-cli__codex` with:
 - `prompt`: Include the full content of `.devlyn/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
 - `workingDirectory`: the project root
 - `sandbox`: `"read-only"` (Codex should only read, not modify files)
-- `reasoningEffort`: `"high"`
+- `reasoningEffort`: `"high"` (note: for `--engine auto`, the engine-routing.md uses `"xhigh"` by default)
+- `model`: `"gpt-5.4"` (pass explicitly — the MCP schema default may be outdated)
 Example prompt to pass:
 ```

package/config/skills/devlyn:auto-resolve/references/engine-routing.md ADDED Viewed

@@ -0,0 +1,205 @@
+# Engine Routing: Intelligent Model Selection
+Instructions for routing work to the optimal model (Claude or Codex) per role and phase. Only read this file when `--engine` is set to `auto` or `codex`.
+The routing table below is derived from published benchmarks (April 2026) comparing Claude Opus 4.6 and GPT-5.4 across task-relevant dimensions. The principle: each role's work goes to the model that objectively performs better at that task type.
+---
+## Benchmark Basis
+| Dimension | Claude Opus 4.6 | GPT-5.4 | Gap | Source |
+|-----------|-----------------|---------|-----|--------|
+| Long-context retrieval (256k) | 92% | ~64% | Claude +28pp | MRCR v2 |
+| Graduate-level reasoning | 87.4% | 83.9% | Claude +3.5pp | GPQA Diamond |
+| Hard coding problems | ~46% | 57.7% | Codex +11.7pp | SWE-bench Pro |
+| Function-level code gen | 90.4% | 93.1% | Codex +2.7pp | HumanEval |
+| Terminal/CLI tasks | 65.4% | 75.1% | Codex +9.7pp | Terminal-Bench 2.0 |
+| Real-world issue resolution | ~80% | ~80% | Tied | SWE-bench Verified |
+| Security vulnerability detection | — | — | Tied | Semgrep 2025 study |
+| Agentic computer use | 72.7% | 75.0% | Codex +2.3pp | OSWorld |
+| Ambiguous intent handling | Preferred by 70% devs | — | Claude | Developer surveys |
+---
+## Codex Call Defaults
+Every Codex call in this file uses these defaults unless stated otherwise:
+```
+model: "gpt-5.4"
+reasoningEffort: "xhigh"
+sandbox: varies per role (see table)
+workingDirectory: project root
+```
+The `model` field accepts any string — pass `"gpt-5.4"` even if the MCP schema lists older defaults. The Codex CLI resolves it.
+---
+## Role Routing Table
+### team-resolve roles
+| Role | Engine | Sandbox | Rationale |
+|------|--------|---------|-----------|
+| root-cause-analyst | **Claude** | — | A/B test: Claude traced git history (15 tool calls) finding exact commit + unchecked migration plan. Codex analyzed structure well but lacked git history depth. Tool access > SWE-bench Pro advantage for this role. |
+| test-engineer | **Codex** | workspace-write | Test code generation = HumanEval (+2.7pp), needs file write |
+| security-auditor | **Dual** | read-only | Semgrep: both find unique vulns; GAN > single model |
+| implementation-planner | **Codex** | read-only | Implementation planning = SWE-bench Pro (+11.7pp) |
+| product-designer | **Claude** | — | Ambiguous requirements, user intent = Claude strength |
+| ui-designer | **Claude** | — | Visual spec, design reasoning = non-coding task |
+| ux-designer | **Claude** | — | User flow analysis = ambiguous intent handling |
+| accessibility-auditor | **Claude** | — | A/B test: Claude found 12 issues (1 CRITICAL) vs Codex 4. WCAG auditing requires thoroughness and domain knowledge depth, not code generation speed. Claude 3x coverage. |
+| product-analyst | **Claude** | — | Requirements clarity, scope judgment = ambiguity handling |
+| architecture-reviewer | **Claude** | — | Codebase-wide pattern review = MRCR long-context (+28pp) |
+| performance-engineer | **Codex** | read-only | Terminal tasks + algorithm analysis = Terminal-Bench (+9.7pp) |
+| api-designer | **Dual** | read-only | A/B test: Claude found 9 issues, Codex found 6, with unique findings on both sides (Claude: --version, exit codes; Codex: YAML folded scalar parsing bug). Dual maximizes coverage for API surface review. |
+### team-review roles
+| Role | Engine | Sandbox | Rationale |
+|------|--------|---------|-----------|
+| security-reviewer | **Dual** | read-only | Same as team-resolve security-auditor |
+| quality-reviewer | **Dual** | read-only | A/B test: Claude found 14 issues (2 HIGH), Codex found 11 (3 HIGH), only ~6 overlap. Dual yields ~19 unique findings (+36-73% coverage). Both models find HIGH-severity issues the other misses. |
+| test-analyst | **Codex** | workspace-write | Test gap analysis + test code suggestions |
+| ux-reviewer | **Claude** | — | UX flow assessment = ambiguity handling |
+| ui-reviewer | **Claude** | — | Design token consistency = non-coding task |
+| accessibility-reviewer | **Claude** | — | Same rationale as team-resolve accessibility-auditor: Claude 3x finding coverage on WCAG audits |
+| product-validator | **Claude** | — | Business logic intent = ambiguity handling |
+| api-reviewer | **Dual** | read-only | Same rationale as team-resolve api-designer: both models find unique API issues |
+| performance-reviewer | **Codex** | read-only | Algorithm complexity = Terminal-Bench (+9.7pp) |
+### Summary distribution
+| Engine | team-resolve (12) | team-review (9) | Total |
+|--------|-------------------|-----------------|-------|
+| Claude | 7 | 4 | 11 |
+| Codex | 2 | 2 | 4 |
+| Dual | 3 | 3 | 6 |
+---
+## Pipeline Phase Routing (auto-resolve)
+| Phase | --engine auto | --engine codex | --engine claude |
+|-------|--------------|----------------|-----------------|
+| BUILD (implementation) | **Codex** | Codex | Claude |
+| BUILD GATE | bash (model-agnostic) | bash | bash |
+| BROWSER VALIDATE | Claude (Chrome MCP only) | Claude | Claude |
+| EVALUATE | **Claude** | Claude | Claude |
+| FIX LOOP | **Codex** | Codex | Claude |
+| SIMPLIFY | Claude | Codex | Claude |
+| REVIEW (team) | **Mixed per table** | Codex all | Claude all |
+| CHALLENGE | **Claude** | Claude | Claude |
+| SECURITY REVIEW | **Dual** | Codex | Claude |
+| CLEAN | Claude | Codex | Claude |
+| DOCS | Claude | Codex | Claude |
+Rationale for `--engine auto` choices:
+- BUILD/FIX: Codex — SWE-bench Pro 57.7% vs 46%. The biggest model gap is in hard coding tasks.
+- EVALUATE/CHALLENGE: Claude — evaluating a full diff requires long-context retrieval (MRCR +28pp) and skeptical reasoning (GPQA +3.5pp). Different model family from builder creates GAN dynamic.
+- BROWSER: Claude — Chrome MCP tools are Claude Code session-bound.
+- SECURITY: Dual — Semgrep study shows both models find unique vulnerabilities.
+---
+## Pipeline Phase Routing (ideate)
+| Phase | --engine auto | --engine codex | --engine claude |
+|-------|--------------|----------------|-----------------|
+| FRAME | **Claude** | Codex | Claude |
+| EXPLORE | **Claude** | Codex | Claude |
+| CONVERGE | **Claude** | Codex | Claude |
+| CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
+| DOCUMENT | **Claude** | Codex | Claude |
+Rationale:
+- FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
+- CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic (same role as `--with-codex` but automatic). When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
+- DOCUMENT: Claude — writing quality for spec generation.
+---
+## Pipeline Phase Routing (preflight)
+| Phase | --engine auto | --engine codex | --engine claude |
+|-------|--------------|----------------|-----------------|
+| EXTRACT COMMITMENTS | Claude | Codex | Claude |
+| CODE AUDIT | **Codex** | Codex | Claude |
+| DOCS AUDIT | **Claude** | Codex | Claude |
+| BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
+| SYNTHESIZE | Claude | Claude | Claude |
+---
+## How to Spawn a Codex Role
+For roles marked **Codex** in the routing table, call `mcp__codex-cli__codex` instead of spawning a Claude Agent subagent. Package the role's full prompt (from the skill's teammate prompt section) into the Codex call.
+Template:
+```
+mcp__codex-cli__codex({
+  prompt: "[full role prompt with issue context, file paths, and deliverable format]",
+  model: "gpt-5.4",
+  reasoningEffort: "xhigh",
+  sandbox: "[read-only or workspace-write per table]",
+  workingDirectory: "[project root]"
+})
+```
+**Important**: Codex has no access to team infrastructure (TeamCreate, SendMessage, TaskCreate). For Codex roles:
+- Include ALL context inline in the prompt (issue description, file paths from investigation, deliverable format)
+- The orchestrator collects Codex's response and routes it where it would have gone via SendMessage
+- Codex roles cannot communicate with other teammates directly — the orchestrator relays findings
+For roles marked **Claude**, spawn a normal Agent subagent as before.
+---
+## How to Spawn a Dual Role
+For roles marked **Dual**, run BOTH models in parallel and merge findings:
+1. Spawn a Claude Agent subagent with the role's prompt
+2. Call `mcp__codex-cli__codex` with the same role's prompt (sandbox: "read-only")
+3. Wait for both to complete
+4. Merge findings:
+   - Same finding from both → keep more detailed description, mark "confirmed by both models"
+   - Claude-only → keep as-is
+   - Codex-only → prefix with `[codex]`
+   - Conflicting findings → keep both, note the disagreement
+   - Take the MORE SEVERE verdict between the two
+---
+## How to Spawn a Codex BUILD/FIX Agent
+For BUILD and FIX LOOP phases when engine routes to Codex:
+```
+mcp__codex-cli__codex({
+  prompt: "[full build/fix prompt with task description, done criteria, and implementation instructions]",
+  model: "gpt-5.4",
+  reasoningEffort: "xhigh",
+  sandbox: "workspace-write",
+  fullAuto: true,
+  workingDirectory: "[project root]"
+})
+```
+**After Codex completes**: verify changes were made (`git diff --stat`), then proceed to the next phase as normal. The file-based handoff (`.devlyn/done-criteria.md`, `.devlyn/EVAL-FINDINGS.md`, etc.) works identically — Codex writes the same files Claude would.
+**Session management**: For FIX LOOP iterations, use a fresh call each time (no `sessionId` reuse) because sandbox/fullAuto parameters only apply on the first call of a session.
+---
+## Override Behavior
+- `--engine claude` → all roles and phases use Claude (current default behavior, no Codex calls)
+- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
+- `--engine auto` → each role and phase routes to the optimal model per this table
+- `--engine auto` is the recommended default when Codex MCP server is available
+`--engine` and `--with-codex` are **mutually exclusive**. `--engine auto` subsumes `--with-codex both` — it uses Codex where it's optimal (broader than just evaluate/review). If both flags are passed, `--engine` takes precedence and `--with-codex` is ignored with a warning.

package/config/skills/devlyn:ideate/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: devlyn:ideate
-description: Transform unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and multi-perspective synthesis. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) that eliminates context pollution in the implementation pipeline. Use when the user wants to brainstorm, plan a new project or feature set, explore possibilities, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring. Also triggers when the user shares links or resources for a new initiative and needs them synthesized into a plan, or wants to update an existing roadmap with new ideas.
+description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline. Optional --with-codex flag adds OpenAI Codex as a cross-model critic. Use when the user wants to brainstorm, plan a new project or feature set, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring.
 ---
 # Ideation to Implementation Bridge
@@ -20,6 +20,20 @@ Concretely:
 - If you catch yourself about to open a source file to make a code change, stop — that's a signal you've left ideation mode
 </hard_boundary>
+## Arguments
+Parse these from the user's invocation message:
+- `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family. **Ignored if `--engine` is set** (engine routing subsumes this).
+- `--engine MODE` (claude) — controls which model handles each ideation phase. Modes:
+  - `claude` (default): all phases use Claude. Current behavior.
+  - `codex`: Codex handles FRAME/EXPLORE/CONVERGE/DOCUMENT, Claude runs CHALLENGE (role reversal — builder and critic are always different models).
+  - `auto`: Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic). Subsumes `--with-codex`. Recommended when Codex MCP is available.
+**If `--engine` is `auto` or `codex`**: call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort. Also read `references/challenge-rubric.md` up front. The engine routing table is defined in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)".
+**If `--engine` is not set and `--with-codex` is set** (legacy): read `references/challenge-rubric.md` and `references/codex-debate.md` up front, then run the pre-flight check described in `codex-debate.md` to verify the Codex MCP server is available before starting the pipeline. If the server is unavailable and the user opts to continue without Codex, the solo CHALLENGE pass still runs — only the cross-model rubric pass is disabled.
 <why_this_matters>
 When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
 - Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
@@ -271,8 +285,51 @@ Within each phase:
 ### Architecture Decisions
 Surface decisions that affect multiple items — technology choices, data model, integration approaches, UX patterns. For each: **What** was decided, **Why** (tradeoffs), and **What alternatives** were considered. These become decision records.
-### Confirmation
-Before generating documents, present a final summary:
+### Internal draft — do not show the user yet
+At this point you have an internal convergence draft: themes, phases, items, decisions. **Do not present it to the user yet.** Phase 3.5 CHALLENGE runs next, and the user will see exactly one summary — the post-challenge plan, with visibility into what CHALLENGE changed. Showing the pre-challenge draft first and then changing it after challenge creates a two-round confirmation loop that burns the user's trust.
+## Phase 3.5: CHALLENGE
+<phase_goal>Apply a strict 5-axis rubric to the internal convergence draft, then present one post-challenge summary to the user for confirmation. Always runs.</phase_goal>
+<thinking_effort>
+Engage maximum thinking effort here — both the solo rubric pass and, if enabled, the Codex pass. Use extended thinking ("ultrathink") when reading each item, applying each axis, and producing revisions. The default Claude failure mode in self-review is nodding along to the draft you just produced; shallow thinking here is the exact pattern this phase exists to prevent.
+Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
+</thinking_effort>
+The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
+### The rubric — single source of truth
+Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
+### Solo pass (always runs)
+Apply the rubric to the internal convergence draft. Produce findings in the format specified in `challenge-rubric.md` (Severity / Quote / Axis / Why / Fix).
+For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
+If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
+### Codex pass (engine-routed or legacy `--with-codex`)
+**If `--engine auto`**: Codex runs the CHALLENGE rubric pass automatically. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the packaged plan + rubric as prompt (same format as `codex-debate.md` Step 2). Reconcile findings: same finding from both → "confirmed by both", Codex-only → prefix `[codex]`.
+**If `--engine codex`**: Role reversal — Codex built the plan (FRAME/EXPLORE/CONVERGE/DOCUMENT), so Claude runs the solo CHALLENGE pass. Do NOT also run Codex on CHALLENGE — builder and critic must be different models. Skip this section entirely.
+**If `--engine claude` or `--engine` not set, and `--with-codex` is set** (legacy): follow `references/codex-debate.md` "PHASE 3.5-CODEX" section. Codex applies the rubric from `challenge-rubric.md` independently at `reasoningEffort: "xhigh"`. Reconcile findings as `codex-debate.md` describes — findings raised by both sides get "confirmed by both", Codex-only findings get prefixed `[codex]` in internal notes so the user can see where each push came from.
+### Respect explicit user intent
+The rubric is a quality lens, not an override. If a finding conflicts with something the user explicitly and clearly asked for, follow the "Hard rule" section in `challenge-rubric.md`: record the finding, **do not silently rewrite the plan**, and surface it as an open question in the summary below. The user makes the call.
+### User-facing summary (the first and only time the user sees the plan)
+After the rubric pass(es), present the post-challenge plan to the user for confirmation. This is the first time the user sees the converged plan — by design, so they see a rubric-checked result rather than a draft that immediately gets revised.
+Format:
 ```
 Vision: [one sentence]
 Phases: [N] phases, [M] total items
@@ -280,9 +337,30 @@ Phase 1 ([theme]): [items with brief descriptions]
 Phase 2 ([theme]): [items]
 Key decisions: [list]
 Deferred: [items with reasons]
+## CHALLENGE results
+Solo pass: [N findings, M applied]
+Codex pass: [N findings, M applied]   ← only if --with-codex was set
+Changes applied during CHALLENGE:
+- [item]: [what changed and which axis triggered it]
+Open questions for you (rubric flagged something you explicitly asked for):
+- [item]: rubric says [finding]; you asked for [original]; here is the tradeoff — proceed as-is, or adopt the alternative?
 ```
-Get explicit confirmation before proceeding to document generation.
+Get explicit confirmation before proceeding to DOCUMENT.
+### Quick Add mode
+For single-item additions, run one solo rubric pass on just the new item. Even then do not skip — single-item additions are exactly where overengineering and workarounds slip in unnoticed, because the lack of surrounding context makes a bad item look self-contained and harmless.
+## Engine Routing for FRAME / EXPLORE / CONVERGE / DOCUMENT
+**If `--engine codex`**: Phases 1-3 and Phase 4 are delegated to Codex. For each phase, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, and the phase instructions + user context as the prompt. Use `sessionId` to maintain conversational context across phases (note: sandbox/fullAuto only apply on the first call). Claude remains the orchestrator — it reads Codex's output, manages the conversation with the user (confirmation prompts, clarifying questions), and routes findings between phases.
+**If `--engine auto` or `--engine claude`**: All planning phases use Claude directly (current behavior). Claude's ambiguous intent handling and writing quality benchmarks favor it for planning tasks.
 ## Phase 4: DOCUMENT
@@ -376,6 +454,9 @@ Before finalizing, verify:
 - [ ] No spec requires reading VISION.md to be understood (self-contained)
 - [ ] Dependencies between items are documented in both specs
 - [ ] Architecture decisions include reasoning and alternatives considered
+- [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex if `--with-codex` was set); no item still fails any axis at CRITICAL or HIGH severity
+- [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
+- [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
 ## Language

package/config/skills/devlyn:ideate/references/challenge-rubric.md ADDED Viewed

@@ -0,0 +1,122 @@
+# CHALLENGE Rubric (single source of truth)
+## Contents
+- Context — this is a planning rubric
+- The 5 axes (NO WORKAROUND, NO GUESSWORK, NO OVERENGINEERING, WORLD-CLASS BEST PRACTICE, OPTIMIZED)
+- Hard rule — respect explicit user intent
+- Finding format
+- Examples (good vs bad findings, plus a detour-sequencing example)
+The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex pass (when `--with-codex` is set) use this file — there is exactly one definition of the rubric, and both paths read it directly from SKILL.md.
+The rubric exists because plans produced in a single pass, by a single model, in a single conversation almost always fail at least one axis somewhere. The user's historical experience: every time they asked "is this really no-workaround, no-guesswork, no-overengineering, world-class, optimized?", the honest answer was no. This phase makes the answer honestly yes before the user even has to ask.
+## Context — this is a PLANNING rubric, not a code rubric
+This rubric judges the shape of the roadmap: what items exist, in what order, why. It does NOT judge implementation details, code style, or abstractions in code. "Overengineering" here means overengineering the plan, not overengineering a function. When applying it, keep asking: *is this the most direct, optimized path from the user's stated problem to a working outcome?*
+## The 5 axes
+### 1. NO WORKAROUND
+Does the item solve the actual problem directly, or does it route around a missing capability? If the direct path is "build X" and the item is "work around not having X", it fails.
+Canonical failure pattern: the user asks for a feature that papers over a missing foundation. Building the feature adds an item to the plan without solving the real problem, and often makes the real problem harder to fix later.
+### 2. NO GUESSWORK
+Every requirement must be grounded in something the user explicitly confirmed, or in something verifiable from the problem framing. Silent assumptions, "I think the user probably wants...", and requirements invented to fill gaps all fail.
+Canonical failure pattern: vague user input ("improve the dashboard") leads to a fully-specified plan full of invented detail. Correct handling is to mark every assumed fact as [ASSUMED], ask clarifying questions, and keep the plan minimal until the user fills in the gaps.
+### 3. NO OVERENGINEERING (planning-stage)
+The plan fails this axis when it contains any of:
+- **Luxury items** — polish, theming, animations, nice-to-haves that do not serve the stated problem. A polish/theming item in Phase 1 of a tool that does not yet solve its core job.
+- **Filler items** — items added to pad a phase or make the plan feel complete. If an item has no testable requirement a real user would notice if absent, it is filler.
+- **Detour sequencing** — the plan takes the long way around when a direct route exists. Three items building toward X when one item could deliver X. Separate scaffold / store / deploy items when they could be bundled into the actual feature they enable.
+- **Roadmap workarounds masquerading as features** — see axis 1. The same failure can fire on axis 1 (paper-over) and axis 3 (padding the roadmap with the workaround).
+The question to ask for every item: *"Is this the most direct, optimized path to the stated goal, or are we decorating / detouring / papering over?"*
+### 4. WORLD-CLASS BEST PRACTICE
+Would a senior team at a top company structure the roadmap this way for this kind of product today? If a known-good pattern exists for sequencing or decomposing this kind of problem, name it and use it.
+Canonical failure pattern: the plan uses a familiar-but-mediocre decomposition when a better-known-good pattern exists for the specific problem type. Example: using manual export/import for cross-device sync when autosave + cloud draft storage is the standard pattern across mainstream editing tools (Notion, Linear, Gmail, Google Docs).
+### 5. OPTIMIZED
+Does the sequencing minimize wait time, front-load risk, and ship user-visible value at every phase boundary? Dead phases — phases that are pure setup with no visible win for a real user — are a fail.
+Canonical failure pattern: Phase 1 is entirely infrastructure (scaffold, models, deploy) and the first user-facing win arrives in Phase 2. Better: Phase 1 ships one thin vertical slice that a real user can use, even if it is small.
+## Hard rule — respect explicit user intent
+The rubric is a tool to prevent drift from quality, not a tool to override the user. If the user has explicitly and clearly stated a preference ("I want X, not Y"), the rubric does not silently replace X with Y. Instead:
+- Run the rubric as normal.
+- If an axis flags X, do not rewrite the plan. Record the finding and surface it to the user as an open question: "The rubric flags X on [axis] because [reason]. You explicitly asked for X — confirm you want to proceed, or consider [alternative]."
+- The user makes the call. The rubric's job is to make the tradeoff visible, not to make the decision.
+This rule exists because the 5-axis rubric is an opinionated lens, and opinionated lenses are wrong sometimes. The user's stated intent is ground truth when it is explicit. The rubric is ground truth only for things the user did not explicitly decide.
+## Finding format
+For every item that fails any axis, produce a finding in this exact format:
+```
+Severity: CRITICAL / HIGH / MEDIUM / LOW
+Quote: [copy the specific item title or line you are critiquing — one line]
+Axis: [which of the five]
+Why it fails: [one sentence]
+Fix: [one concrete revision — not "reconsider X", say what to do instead]
+```
+For the plan as a whole, give a one-line pass/fail per axis with one-sentence reasoning.
+End with a verdict: `PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED`.
+The Quote field is load-bearing. It anchors each finding to a specific line in the plan, which prevents the common failure mode of generic unanchored critiques ("too much in Phase 1", "consider refactoring"). Anchored findings are actionable; unanchored findings are noise.
+## Examples
+<example>
+BAD finding (too vague, not actionable):
+  Severity: HIGH
+  Axis: NO OVERENGINEERING
+  Why: Phase 1 has too much.
+  Fix: Reduce scope.
+GOOD finding (anchored, specific, actionable):
+  Severity: HIGH
+  Quote: "1.3 — Theme customization (light/dark/custom accent colors)"
+  Axis: NO OVERENGINEERING (luxury item)
+  Why it fails: The product does not yet solve its core job of letting users save a session; theming is a decoration item that does not move the primary problem forward.
+  Fix: Move 1.3 to backlog. Phase 1 is shorter by one item. Revisit theming only after the core save flow is shipped and used.
+</example>
+<example>
+BAD finding:
+  Severity: HIGH
+  Axis: NO WORKAROUND
+  Why: Item 2.1 is a workaround.
+  Fix: Do it properly.
+GOOD finding:
+  Severity: CRITICAL
+  Quote: "2.1 — Export/import session as JSON file so users can move work between devices"
+  Axis: NO WORKAROUND
+  Why it fails: The real problem is cross-device sync. File export is a roadmap workaround that asks the user to do the sync manually; it adds an item to the plan without solving the stated problem, and makes the real problem harder to fix later.
+  Fix: Replace 2.1 with "Cloud-backed session storage" as a direct cross-device solution. If cloud storage is out of scope for the current phase, explicitly defer cross-device sync to a later phase rather than shipping a manual workaround as if it were the feature.
+</example>
+<example>
+Detour sequencing finding:
+  Severity: MEDIUM
+  Quote: "Phase 1: 1.1-scaffold, 1.2-data-store, 1.3-log-today, 1.4-streak-display, 1.5-history-view, 1.6-manage-habits, 1.7-deploy"
+  Axis: NO OVERENGINEERING (detour sequencing)
+  Why it fails: Scaffold, data store, streak display, and deploy are not features a user would notice as separate items — they are implementation steps of the three actual user capabilities (log a habit, see streak, see history). Splitting them into standalone roadmap items pads the plan without delivering value at each boundary.
+  Fix: Collapse Phase 1 to 2 items: "1.1 — Log a habit and see streak" (bundles scaffold + store + log + streak), "1.2 — History view". Deploy is part of each item's done criteria, not a standalone item. Result: 7 items → 2 items, same delivered scope.
+</example>

package/config/skills/devlyn:ideate/references/codex-debate.md ADDED Viewed

@@ -0,0 +1,112 @@
+# Codex Cross-Model Rubric Pass (Legacy)
+> **Note**: This file is the legacy `--with-codex` integration for ideate. For the newer `--engine` flag (which subsumes `--with-codex`), see the engine routing section in SKILL.md. Only read this file when `--with-codex` is set AND `--engine` is NOT set.
+## Contents
+- Pre-flight check (verify Codex MCP server availability)
+- PHASE 3.5-CODEX: packaging the plan, calling Codex, reconciling findings with the solo pass
+- Cost notes (one Codex call per ideation session)
+Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE.  The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
+Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). The intent: one opinionated rubric pass from a different model family, applied right before the user sees the plan. Two model families catch different blind spots; one pass at maximum effort catches more than multiple shallow passes.
+**Always use `model: "gpt-5.4"`, `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model. Pass `model: "gpt-5.4"` explicitly as the MCP schema default may be outdated.
+---
+## PRE-FLIGHT CHECK
+Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
+- **If ping succeeds**: continue.
+- **If ping fails or `mcp__codex-cli__ping` is not found**: warn the user and ask:
+  ```
+  ⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
+  To install:
+    npm i -g @openai/codex
+    claude mcp add codex-cli -- npx -y codex-mcp-server
+  Options:
+    [1] Continue without --with-codex (Claude-only solo CHALLENGE pass)
+    [2] Abort
+  ```
+  If [1], disable `--with-codex` and continue with the solo CHALLENGE. If [2], stop.
+---
+## PHASE 3.5-CODEX: Codex rubric pass
+Run after the solo CHALLENGE pass completes, before the user-facing summary.
+### Step 1 — Package the post-solo plan
+Use the plan as it stands after the solo rubric pass. Package the full context Codex needs:
+```
+## Problem framing (from FRAME phase)
+[problem statement, constraints, success criteria, anti-goals]
+## Confirmed facts vs assumptions
+Confirmed by user: [list]
+Assumptions (not yet confirmed): [list]
+## Plan (post-solo-CHALLENGE)
+Vision: [one sentence]
+Phase 1 ([theme]): [items, dependencies, one-line descriptions]
+Phase 2 ([theme]): ...
+Architecture decisions: [each with what / why / alternatives]
+Deferred to backlog: [items + reason]
+## Findings from the solo rubric pass
+[list each with: axis, quote, why, fix, whether applied]
+```
+Include the framing and assumptions — Codex can only judge whether the plan fits the user's reality if it sees what the user actually said.
+### Step 2 — Codex challenge pass
+Call `mcp__codex-cli__codex` with:
+- `prompt`: the packaged context above, followed by the instructions below
+- `workingDirectory`: the project root
+- `sandbox`: `"read-only"`
+- `model`: `"gpt-5.4"` — pass explicitly; the MCP schema default may still show `gpt-5.3-codex`
+- `reasoningEffort`: `"xhigh"` — the highest setting in the Codex enum (`none < minimal < low < medium < high < xhigh`). Always pick the top level; this is the entire reason for the flag.
+Instructions to append to the packaged context. **Before sending, inline the full text of `references/challenge-rubric.md` into the prompt under a `## Rubric` heading** — Codex does not have filesystem access to this project, so Claude must ship the rubric itself. Claude already has the rubric loaded from Phase 3.5 setup.
+Template for the appended instructions:
+```
+You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user has explicitly asked to be challenged because soft-pedaled plans waste their time.
+## Rubric
+[Claude inlines the full text of references/challenge-rubric.md here]
+## Your job
+- You are running AFTER a solo pass by Claude. Catch what the solo pass missed, do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" or "I would frame this differently" with a reason. Then add your own findings that the solo pass missed.
+- Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
+- Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let Claude surface it to the user.
+End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, and a one-line explanation.
+```
+### Step 3 — Reconcile solo and Codex findings
+Merge the two finding lists:
+- Same finding from both → keep the more specific wording, mark "confirmed by both".
+- Codex-only → prefix `[codex]` in internal notes so the user-facing summary can show where each push came from.
+- Solo-only → keep as-is.
+- Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if the conflict is material, include it as an open question in the user-facing summary.
+If Codex raised CRITICAL or HIGH findings that the solo pass missed, apply the fixes to the plan before presenting the user-facing summary. If fixing would change something the user explicitly asked for, follow the "Respect explicit user intent" rule already loaded from the rubric: do not silently rewrite — surface it.
+Do not loop. One Codex pass is enough. If the result is still FAIL after one pass, that is signal that the plan has structural problems the user should see directly, not signal to keep iterating in the background.
+---
+## Cost notes
+- One Codex call at `reasoningEffort: "xhigh"` typically takes 30–90s and is not cheap. This integration is bounded: exactly one Codex call per ideation session.
+- In Quick Add mode on a single new item, one Codex call is still worth it — small scope, huge signal, and single-item additions are exactly where workarounds slip in unnoticed.

package/config/skills/devlyn:preflight/SKILL.md CHANGED Viewed

@@ -44,8 +44,15 @@ Parse from `<preflight_config>`:
 - `--autofix` — auto-promote all findings to roadmap items and run auto-resolve on each
 - `--skip-browser` — skip browser validation
 - `--skip-docs` — skip documentation audit
+- `--engine MODE` (claude) — controls which model handles audit phases. Modes:
+  - `claude` (default): all auditors use Claude subagents.
+  - `codex`: code-auditor uses Codex, docs-auditor and browser-auditor use Claude.
+  - `auto`: code-auditor uses Codex (SWE-bench Pro +11.7pp for code analysis), docs-auditor uses Claude (writing quality), browser-auditor uses Claude (Chrome MCP). Recommended when Codex MCP is available.
 Example: `/devlyn:preflight --phase 2 --skip-browser`
+Example with engine: `/devlyn:preflight --engine auto`
+**If `--engine` is `auto` or `codex`**: call `mcp__codex-cli__ping` to verify Codex MCP availability. If ping fails, fall back to `--engine claude` with a warning.
 ## PHASE 0: DISCOVER & SCOPE
@@ -128,6 +135,8 @@ Spawn all applicable auditors in parallel. Each reads `.devlyn/commitment-regist
 ### code-auditor (always)
+**Engine routing**: If `--engine auto` or `--engine codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the full code-auditor prompt (read from `references/auditors/code-auditor.md`). Include the commitment registry content inline in the prompt since Codex cannot read `.devlyn/commitment-registry.md` directly in read-only sandbox. If `--engine claude`, spawn a Claude subagent as below.
 Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/code-auditor.md` and pass it to the subagent.
 The code-auditor classifies each commitment as IMPLEMENTED, MISSING, INCOMPLETE, DIVERGENT, or BROKEN — with file:line evidence. Also catches cross-feature integration gaps and constraint violations. Writes to `.devlyn/audit-code.md`.

package/config/skills/devlyn:team-resolve/SKILL.md CHANGED Viewed

@@ -130,9 +130,28 @@ Use the Agent Teams infrastructure:
 **IMPORTANT**: When spawning teammates, replace `{team-name}` in each prompt below with the actual team name you chose (e.g., `resolve-null-user-crash`). Include the relevant file paths from your Phase 1 investigation in the spawn prompt.
+### Engine-Routed Teammate Spawning
+If the caller passed `--engine auto` or `--engine codex` (check the orchestrator's context or the pipeline config), read the auto-resolve skill's `references/engine-routing.md` for per-role routing under "team-resolve roles".
+**For roles routed to Codex**: Instead of spawning a Claude Agent teammate, call `mcp__codex-cli__codex` with:
+- `model`: `"gpt-5.4"`
+- `reasoningEffort`: `"xhigh"`
+- `sandbox`: per routing table (`"read-only"` or `"workspace-write"`)
+- `workingDirectory`: project root
+- `prompt`: the full teammate prompt below, with issue context and file paths included inline
+Codex roles cannot use TeamCreate/SendMessage — the Team Lead (you) relays their findings to other teammates and collects their output directly from the MCP call response.
+**For roles routed to Claude**: Spawn via Task tool as normal (prompts below).
+**For Dual roles** (e.g., security-auditor): Run BOTH a Claude Agent teammate AND a `mcp__codex-cli__codex` call in parallel with the same prompt. Merge findings per `engine-routing.md` "How to Spawn a Dual Role" section.
+If `--engine claude` or no `--engine` flag: all roles use Claude Agent teammates (current default behavior).
 ### Teammate Prompts
-When spawning each teammate via the Task tool, use these prompts:
+When spawning each teammate via the Task tool (or passing to `mcp__codex-cli__codex` for Codex-routed roles), use these prompts:
 <root_cause_analyst_prompt>
 You are the **Root Cause Analyst** on an Agent Team resolving an issue.

package/config/skills/devlyn:team-review/SKILL.md CHANGED Viewed

@@ -66,9 +66,28 @@ Use the Agent Teams infrastructure:
 **IMPORTANT**: When spawning reviewers, replace `{team-name}` in each prompt below with the actual team name you chose. Include the specific changed file paths in each reviewer's spawn prompt.
+### Engine-Routed Reviewer Spawning
+If the caller passed `--engine auto` or `--engine codex` (check the orchestrator's context or the pipeline config), read the auto-resolve skill's `references/engine-routing.md` for per-role routing under "team-review roles".
+**For roles routed to Codex**: Instead of spawning a Claude Agent reviewer, call `mcp__codex-cli__codex` with:
+- `model`: `"gpt-5.4"`
+- `reasoningEffort`: `"xhigh"`
+- `sandbox`: per routing table (`"read-only"` or `"workspace-write"`)
+- `workingDirectory`: project root
+- `prompt`: the full reviewer prompt below, with changed file paths and diff included inline
+Codex reviewers cannot use TeamCreate/SendMessage — the Review Lead (you) collects their output directly from the MCP call response and relays cross-cutting findings to other reviewers.
+**For roles routed to Claude**: Spawn via Task tool as normal (prompts below).
+**For Dual roles** (e.g., security-reviewer): Run BOTH a Claude Agent reviewer AND a `mcp__codex-cli__codex` call in parallel with the same prompt. Merge findings per `engine-routing.md` "How to Spawn a Dual Role" section.
+If `--engine claude` or no `--engine` flag: all roles use Claude Agent reviewers (current default behavior).
 ### Reviewer Prompts
-When spawning each reviewer via the Task tool, use these prompts:
+When spawning each reviewer via the Task tool (or passing to `mcp__codex-cli__codex` for Codex-routed roles), use these prompts:
 <security_reviewer_prompt>
 You are the **Security Reviewer** on an Agent Team performing a code review.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "1.10.0",
+  "version": "1.12.0",
   "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
   "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
   "bin": {
@@ -9,6 +9,10 @@
   "files": [
     "bin",
     "config",
+    "!config/skills/preflight-workspace",
+    "!config/skills/preflight-workspace/**",
+    "!config/skills/devlyn:ideate-workspace",
+    "!config/skills/devlyn:ideate-workspace/**",
     "agents-config",
     "optional-skills",
     "CLAUDE.md"