devlyn-cli 1.10.0 → 1.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -72,7 +72,8 @@ Optional flags:
72
72
  - `--skip-review` — skip team-review phase
73
73
  - `--skip-clean` — skip clean phase
74
74
  - `--skip-docs` — skip update-docs phase
75
- - `--with-codex [evaluate|review|both]` — use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
75
+ - `--engine auto|codex|claude` — intelligent model routing. `auto` routes each phase and team role to the optimal model (Claude or Codex GPT-5.4) based on benchmark data. `codex` forces Codex for implementation, Claude for evaluation. `claude` (default) uses Claude for everything. Requires codex-mcp-server.
76
+ - `--with-codex [evaluate|review|both]` — (legacy, superseded by `--engine`) use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
76
77
 
77
78
  ## Preflight Check (Post-Roadmap Verification)
78
79
 
@@ -91,6 +92,7 @@ Optional flags:
91
92
  - `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
92
93
  - `--skip-browser` — skip browser validation
93
94
  - `--skip-docs` — skip documentation audit
95
+ - `--engine auto|codex|claude` — route code-auditor to Codex (better at code analysis), docs/browser to Claude
94
96
 
95
97
  **Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)
96
98
 
package/README.md CHANGED
@@ -100,18 +100,31 @@ Reads every commitment from your vision, roadmap, and item specs, then audits th
100
100
 
101
101
  Confirmed gaps become new roadmap items — feed them back into auto-resolve. Use `--autofix` to do this automatically, or `--phase 2` to check only one phase.
102
102
 
103
- ### Bonus — Dual-Model Mode with Codex
103
+ ### Bonus — Intelligent Model Routing with `--engine`
104
104
 
105
105
  Install the Codex MCP server during setup, then:
106
106
 
107
107
  ```
108
- /devlyn:auto-resolve "fix the auth bug" --with-codex
108
+ /devlyn:auto-resolve "fix the auth bug" --engine auto
109
109
  ```
110
110
 
111
- Claude builds, **OpenAI Codex evaluates independently** two models collaborating, catching what a single model misses.
111
+ **`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.6 or GPT-5.4) based on benchmark data. Codex handles implementation (SWE-bench Pro +11.7pp), Claude handles evaluation and architecture review (MRCR +28pp). Security roles run both models in parallel for maximum coverage.
112
+
113
+ > `--engine auto` (recommended) · `--engine codex` (force Codex for implementation) · `--engine claude` (default, Claude only)
114
+
115
+ Also works with `/devlyn:ideate --engine auto` and `/devlyn:preflight --engine auto`.
116
+
117
+ <details>
118
+ <summary>Legacy: <code>--with-codex</code> (superseded by <code>--engine</code>)</summary>
119
+
120
+ ```
121
+ /devlyn:auto-resolve "fix the auth bug" --with-codex
122
+ ```
112
123
 
113
124
  > `--with-codex evaluate` (default) · `--with-codex review` · `--with-codex both`
114
125
 
126
+ </details>
127
+
115
128
  ---
116
129
 
117
130
  ## Manual Commands
@@ -210,7 +223,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
210
223
 
211
224
  | Server | Description |
212
225
  |---|---|
213
- | `codex-cli` | Codex MCP server — enables `--with-codex` dual-model mode |
226
+ | `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing and legacy `--with-codex` mode |
214
227
  | `playwright` | Playwright MCP — powers browser-validate Tier 2 |
215
228
 
216
229
  </details>
@@ -33,28 +33,36 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
33
33
  - `--skip-docs` (false) — skip update-docs phase
34
34
  - `--skip-build-gate` (false) — skip the deterministic build gate (Phase 1.4). Not recommended — the build gate is the primary defense against "tests pass locally, breaks in CI/Docker/production" class of bugs.
35
35
  - `--build-gate MODE` (auto) — controls build gate behavior. `auto`: detect project type and run appropriate build/typecheck/lint commands; if Dockerfile(s) are present, Docker builds are included automatically. `strict`: auto + treat warnings as errors. `no-docker`: auto but skip Docker builds even if Dockerfiles exist (for faster iteration). `skip`: same as --skip-build-gate.
36
- - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques.
36
+ - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques. **Ignored if `--engine` is set** (engine routing subsumes this).
37
+ - `--engine MODE` (claude) — controls which model handles each pipeline phase and team role. Modes:
38
+ - `claude` (default): all phases use Claude subagents. Current behavior, no Codex calls.
39
+ - `codex`: Codex handles implementation/analysis phases, Claude handles orchestration, evaluation, and Chrome MCP.
40
+ - `auto`: each phase and team role routes to the optimal model based on benchmark data. Recommended when Codex MCP server is available. Subsumes `--with-codex both`.
37
41
 
38
42
  Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
39
- Codex examples: `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
43
+ Engine examples: `--engine auto`, `--engine codex`, `--engine claude`
44
+ Codex examples (legacy): `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
40
45
  If no flags are present, use defaults.
41
46
 
42
- 3. **If `--with-codex` is enabled**: Read `references/codex-integration.md` and run the "PRE-FLIGHT CHECK" section to verify Codex MCP server availability before proceeding.
47
+ 3. **If `--engine` is `auto` or `codex`**: Read `references/engine-routing.md` for the full routing table. Then call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude` (fallback), [2] Abort. If `--engine` is not set but `--with-codex` is enabled, read `references/codex-integration.md` instead and run its pre-flight check.
43
48
 
44
49
  4. Announce the pipeline plan:
45
50
  ```
46
51
  Auto-resolve pipeline starting
47
52
  Task: [extracted task description]
53
+ Engine: [auto / codex / claude]
48
54
  Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
49
55
  Max evaluation rounds: [N]
50
- Cross-model evaluation (Codex): [evaluate / review / both / disabled]
56
+ Cross-model evaluation (Codex): [evaluate / review / both / disabled / subsumed by --engine]
51
57
  ```
52
58
 
53
59
  ## PHASE 1: BUILD
54
60
 
61
+ **Engine routing**: If `--engine` is `auto` or `codex`, read `references/engine-routing.md` "How to Spawn a Codex BUILD/FIX Agent" section. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the full agent prompt below as the `prompt` parameter. If `--engine` is `claude` (default), spawn a Claude subagent as described below.
62
+
55
63
  Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to investigate and implement the fix. The subagent does NOT have access to skills, so include all necessary instructions inline.
56
64
 
57
- Agent prompt — pass this to the Agent tool:
65
+ Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
58
66
 
59
67
  Investigate and implement the following task. Work through these phases in order:
60
68
 
@@ -74,6 +82,8 @@ Read relevant files in parallel. Build a clear picture of what exists and what n
74
82
  - UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
75
83
  Each teammate investigates from their perspective and sends findings back.
76
84
 
85
+ **Engine routing for teammates**: If the orchestrator's `--engine` is `auto` or `codex`, read `references/engine-routing.md` for per-role routing. Roles marked **Codex** are called via `mcp__codex-cli__codex` instead of spawning Agent teammates — include the full role prompt and issue context inline. Roles marked **Claude** use normal Agent teammates. Roles marked **Dual** run both in parallel and merge findings. The orchestrator relays Codex role outputs to Claude teammates that need them.
86
+
77
87
  **Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
78
88
 
79
89
  **Phase E — Update done criteria**: Mark each criterion in `.devlyn/done-criteria.md` as satisfied. Run the full test suite.
@@ -127,9 +137,11 @@ Triggered only when PHASE 1.4 returns FAIL.
127
137
 
128
138
  Track a round counter (shared with the main fix loop counter against `max-rounds`). If `round >= max-rounds`, stop with a clear failure report — do NOT continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
129
139
 
140
+ **Engine routing**: Same as PHASE 2.5 FIX LOOP — if `--engine` is `auto` or `codex`, use `mcp__codex-cli__codex` with `workspace-write` and `fullAuto: true`. If `--engine` is `claude`, spawn a Claude subagent.
141
+
130
142
  Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
131
143
 
132
- Agent prompt — pass this to the Agent tool:
144
+ Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
133
145
 
134
146
  Read `.devlyn/BUILD-GATE.md` — it contains deterministic build/typecheck/lint failures from real compiler output. These are not opinions; the compiler rejected this code. Fix every listed failure at the root cause level.
135
147
 
@@ -220,7 +232,8 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
220
232
  **After the agent completes**:
221
233
  1. Read `.devlyn/EVAL-FINDINGS.md`
222
234
  2. Extract the verdict
223
- 3. **If `--with-codex` includes `evaluate` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
235
+ 3. **If `--engine` is `auto` or `codex`**: The evaluate phase always uses Claude (see `references/engine-routing.md`). When `--engine auto`, the builder was Codex Claude evaluating Codex's work creates the GAN dynamic automatically. No separate Codex evaluation pass is needed.
236
+ **If `--engine` is not set and `--with-codex` includes `evaluate` or `both`** (legacy): Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
224
237
  4. Branch on verdict (from the merged findings if Codex was used):
225
238
  - `PASS` → skip to PHASE 3
226
239
  - `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
@@ -232,9 +245,11 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
232
245
 
233
246
  Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
234
247
 
248
+ **Engine routing**: If `--engine` is `auto` or `codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the fix prompt below. Use a fresh call each round (no sessionId reuse — sandbox/fullAuto only apply on first call of a session). If `--engine` is `claude`, spawn a Claude subagent as below.
249
+
235
250
  Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to fix the evaluation findings.
236
251
 
237
- Agent prompt — pass this to the Agent tool:
252
+ Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
238
253
 
239
254
  Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the evaluator returns PASS — there is no "shippable with issues" shortcut.
240
255
 
@@ -268,11 +283,14 @@ Agent prompt — pass this to the Agent tool:
268
283
 
269
284
  Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes.
270
285
 
286
+ **Engine routing for reviewers**: If the orchestrator passed `--engine auto` or `--engine codex`, read `references/engine-routing.md` for per-role routing in the "team-review roles" table. Route each reviewer to Claude Agent or `mcp__codex-cli__codex` accordingly. For Dual roles (security-reviewer), run both models in parallel and merge findings per the "How to Spawn a Dual Role" section. For `--engine claude`, all reviewers are Claude Agent teammates.
287
+
271
288
  Each reviewer evaluates from their perspective, sends findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW). After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
272
289
 
273
290
  Clean up the team after completion.
274
291
 
275
- **If `--with-codex` includes `review` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
292
+ **If `--engine` is set**: engine routing already handles cross-model review via per-role routing skip the legacy `--with-codex` review step below.
293
+ **If `--with-codex` includes `review` or `both`** (legacy, only when `--engine` is not set): Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
276
294
 
277
295
  **After the review phase completes**:
278
296
  1. If CRITICAL issues remain unfixed, log a warning in the final report
@@ -403,6 +421,7 @@ After all phases complete:
403
421
  ### Auto-Resolve Pipeline Complete
404
422
 
405
423
  **Task**: [original task description]
424
+ **Engine**: [auto / codex / claude — if auto, note which phases used which model]
406
425
 
407
426
  **Pipeline Summary**:
408
427
  | Phase | Status | Notes |
@@ -1,6 +1,8 @@
1
- # Codex Cross-Model Integration
1
+ # Codex Cross-Model Integration (Legacy)
2
2
 
3
- Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline. Only read this file when `--with-codex` is enabled.
3
+ > **Note**: This file is the legacy `--with-codex` integration. For the newer `--engine` flag (which subsumes `--with-codex`), see `references/engine-routing.md`. Only read this file when `--with-codex` is enabled AND `--engine` is NOT set.
4
+
5
+ Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline.
4
6
 
5
7
  Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). This creates a GAN-like adversarial dynamic — Claude builds and Codex critiques, reducing shared blind spots between model families.
6
8
 
@@ -37,7 +39,8 @@ Call `mcp__codex-cli__codex` with:
37
39
  - `prompt`: Include the full content of `.devlyn/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
38
40
  - `workingDirectory`: the project root
39
41
  - `sandbox`: `"read-only"` (Codex should only read, not modify files)
40
- - `reasoningEffort`: `"high"`
42
+ - `reasoningEffort`: `"high"` (note: for `--engine auto`, the engine-routing.md uses `"xhigh"` by default)
43
+ - `model`: `"gpt-5.4"` (pass explicitly — the MCP schema default may be outdated)
41
44
 
42
45
  Example prompt to pass:
43
46
  ```
@@ -0,0 +1,205 @@
1
+ # Engine Routing: Intelligent Model Selection
2
+
3
+ Instructions for routing work to the optimal model (Claude or Codex) per role and phase. Only read this file when `--engine` is set to `auto` or `codex`.
4
+
5
+ The routing table below is derived from published benchmarks (April 2026) comparing Claude Opus 4.6 and GPT-5.4 across task-relevant dimensions. The principle: each role's work goes to the model that objectively performs better at that task type.
6
+
7
+ ---
8
+
9
+ ## Benchmark Basis
10
+
11
+ | Dimension | Claude Opus 4.6 | GPT-5.4 | Gap | Source |
12
+ |-----------|-----------------|---------|-----|--------|
13
+ | Long-context retrieval (256k) | 92% | ~64% | Claude +28pp | MRCR v2 |
14
+ | Graduate-level reasoning | 87.4% | 83.9% | Claude +3.5pp | GPQA Diamond |
15
+ | Hard coding problems | ~46% | 57.7% | Codex +11.7pp | SWE-bench Pro |
16
+ | Function-level code gen | 90.4% | 93.1% | Codex +2.7pp | HumanEval |
17
+ | Terminal/CLI tasks | 65.4% | 75.1% | Codex +9.7pp | Terminal-Bench 2.0 |
18
+ | Real-world issue resolution | ~80% | ~80% | Tied | SWE-bench Verified |
19
+ | Security vulnerability detection | — | — | Tied | Semgrep 2025 study |
20
+ | Agentic computer use | 72.7% | 75.0% | Codex +2.3pp | OSWorld |
21
+ | Ambiguous intent handling | Preferred by 70% devs | — | Claude | Developer surveys |
22
+
23
+ ---
24
+
25
+ ## Codex Call Defaults
26
+
27
+ Every Codex call in this file uses these defaults unless stated otherwise:
28
+
29
+ ```
30
+ model: "gpt-5.4"
31
+ reasoningEffort: "xhigh"
32
+ sandbox: varies per role (see table)
33
+ workingDirectory: project root
34
+ ```
35
+
36
+ The `model` field accepts any string — pass `"gpt-5.4"` even if the MCP schema lists older defaults. The Codex CLI resolves it.
37
+
38
+ ---
39
+
40
+ ## Role Routing Table
41
+
42
+ ### team-resolve roles
43
+
44
+ | Role | Engine | Sandbox | Rationale |
45
+ |------|--------|---------|-----------|
46
+ | root-cause-analyst | **Claude** | — | A/B test: Claude traced git history (15 tool calls) finding exact commit + unchecked migration plan. Codex analyzed structure well but lacked git history depth. Tool access > SWE-bench Pro advantage for this role. |
47
+ | test-engineer | **Codex** | workspace-write | Test code generation = HumanEval (+2.7pp), needs file write |
48
+ | security-auditor | **Dual** | read-only | Semgrep: both find unique vulns; GAN > single model |
49
+ | implementation-planner | **Codex** | read-only | Implementation planning = SWE-bench Pro (+11.7pp) |
50
+ | product-designer | **Claude** | — | Ambiguous requirements, user intent = Claude strength |
51
+ | ui-designer | **Claude** | — | Visual spec, design reasoning = non-coding task |
52
+ | ux-designer | **Claude** | — | User flow analysis = ambiguous intent handling |
53
+ | accessibility-auditor | **Claude** | — | A/B test: Claude found 12 issues (1 CRITICAL) vs Codex 4. WCAG auditing requires thoroughness and domain knowledge depth, not code generation speed. Claude 3x coverage. |
54
+ | product-analyst | **Claude** | — | Requirements clarity, scope judgment = ambiguity handling |
55
+ | architecture-reviewer | **Claude** | — | Codebase-wide pattern review = MRCR long-context (+28pp) |
56
+ | performance-engineer | **Codex** | read-only | Terminal tasks + algorithm analysis = Terminal-Bench (+9.7pp) |
57
+ | api-designer | **Dual** | read-only | A/B test: Claude found 9 issues, Codex found 6, with unique findings on both sides (Claude: --version, exit codes; Codex: YAML folded scalar parsing bug). Dual maximizes coverage for API surface review. |
58
+
59
+ ### team-review roles
60
+
61
+ | Role | Engine | Sandbox | Rationale |
62
+ |------|--------|---------|-----------|
63
+ | security-reviewer | **Dual** | read-only | Same as team-resolve security-auditor |
64
+ | quality-reviewer | **Dual** | read-only | A/B test: Claude found 14 issues (2 HIGH), Codex found 11 (3 HIGH), only ~6 overlap. Dual yields ~19 unique findings (+36-73% coverage). Both models find HIGH-severity issues the other misses. |
65
+ | test-analyst | **Codex** | workspace-write | Test gap analysis + test code suggestions |
66
+ | ux-reviewer | **Claude** | — | UX flow assessment = ambiguity handling |
67
+ | ui-reviewer | **Claude** | — | Design token consistency = non-coding task |
68
+ | accessibility-reviewer | **Claude** | — | Same rationale as team-resolve accessibility-auditor: Claude 3x finding coverage on WCAG audits |
69
+ | product-validator | **Claude** | — | Business logic intent = ambiguity handling |
70
+ | api-reviewer | **Dual** | read-only | Same rationale as team-resolve api-designer: both models find unique API issues |
71
+ | performance-reviewer | **Codex** | read-only | Algorithm complexity = Terminal-Bench (+9.7pp) |
72
+
73
+ ### Summary distribution
74
+
75
+ | Engine | team-resolve (12) | team-review (9) | Total |
76
+ |--------|-------------------|-----------------|-------|
77
+ | Claude | 7 | 4 | 11 |
78
+ | Codex | 2 | 2 | 4 |
79
+ | Dual | 3 | 3 | 6 |
80
+
81
+ ---
82
+
83
+ ## Pipeline Phase Routing (auto-resolve)
84
+
85
+ | Phase | --engine auto | --engine codex | --engine claude |
86
+ |-------|--------------|----------------|-----------------|
87
+ | BUILD (implementation) | **Codex** | Codex | Claude |
88
+ | BUILD GATE | bash (model-agnostic) | bash | bash |
89
+ | BROWSER VALIDATE | Claude (Chrome MCP only) | Claude | Claude |
90
+ | EVALUATE | **Claude** | Claude | Claude |
91
+ | FIX LOOP | **Codex** | Codex | Claude |
92
+ | SIMPLIFY | Claude | Codex | Claude |
93
+ | REVIEW (team) | **Mixed per table** | Codex all | Claude all |
94
+ | CHALLENGE | **Claude** | Claude | Claude |
95
+ | SECURITY REVIEW | **Dual** | Codex | Claude |
96
+ | CLEAN | Claude | Codex | Claude |
97
+ | DOCS | Claude | Codex | Claude |
98
+
99
+ Rationale for `--engine auto` choices:
100
+ - BUILD/FIX: Codex — SWE-bench Pro 57.7% vs 46%. The biggest model gap is in hard coding tasks.
101
+ - EVALUATE/CHALLENGE: Claude — evaluating a full diff requires long-context retrieval (MRCR +28pp) and skeptical reasoning (GPQA +3.5pp). Different model family from builder creates GAN dynamic.
102
+ - BROWSER: Claude — Chrome MCP tools are Claude Code session-bound.
103
+ - SECURITY: Dual — Semgrep study shows both models find unique vulnerabilities.
104
+
105
+ ---
106
+
107
+ ## Pipeline Phase Routing (ideate)
108
+
109
+ | Phase | --engine auto | --engine codex | --engine claude |
110
+ |-------|--------------|----------------|-----------------|
111
+ | FRAME | **Claude** | Codex | Claude |
112
+ | EXPLORE | **Claude** | Codex | Claude |
113
+ | CONVERGE | **Claude** | Codex | Claude |
114
+ | CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
115
+ | DOCUMENT | **Claude** | Codex | Claude |
116
+
117
+ Rationale:
118
+ - FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
119
+ - CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic (same role as `--with-codex` but automatic). When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
120
+ - DOCUMENT: Claude — writing quality for spec generation.
121
+
122
+ ---
123
+
124
+ ## Pipeline Phase Routing (preflight)
125
+
126
+ | Phase | --engine auto | --engine codex | --engine claude |
127
+ |-------|--------------|----------------|-----------------|
128
+ | EXTRACT COMMITMENTS | Claude | Codex | Claude |
129
+ | CODE AUDIT | **Codex** | Codex | Claude |
130
+ | DOCS AUDIT | **Claude** | Codex | Claude |
131
+ | BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
132
+ | SYNTHESIZE | Claude | Claude | Claude |
133
+
134
+ ---
135
+
136
+ ## How to Spawn a Codex Role
137
+
138
+ For roles marked **Codex** in the routing table, call `mcp__codex-cli__codex` instead of spawning a Claude Agent subagent. Package the role's full prompt (from the skill's teammate prompt section) into the Codex call.
139
+
140
+ Template:
141
+
142
+ ```
143
+ mcp__codex-cli__codex({
144
+ prompt: "[full role prompt with issue context, file paths, and deliverable format]",
145
+ model: "gpt-5.4",
146
+ reasoningEffort: "xhigh",
147
+ sandbox: "[read-only or workspace-write per table]",
148
+ workingDirectory: "[project root]"
149
+ })
150
+ ```
151
+
152
+ **Important**: Codex has no access to team infrastructure (TeamCreate, SendMessage, TaskCreate). For Codex roles:
153
+ - Include ALL context inline in the prompt (issue description, file paths from investigation, deliverable format)
154
+ - The orchestrator collects Codex's response and routes it where it would have gone via SendMessage
155
+ - Codex roles cannot communicate with other teammates directly — the orchestrator relays findings
156
+
157
+ For roles marked **Claude**, spawn a normal Agent subagent as before.
158
+
159
+ ---
160
+
161
+ ## How to Spawn a Dual Role
162
+
163
+ For roles marked **Dual**, run BOTH models in parallel and merge findings:
164
+
165
+ 1. Spawn a Claude Agent subagent with the role's prompt
166
+ 2. Call `mcp__codex-cli__codex` with the same role's prompt (sandbox: "read-only")
167
+ 3. Wait for both to complete
168
+ 4. Merge findings:
169
+ - Same finding from both → keep more detailed description, mark "confirmed by both models"
170
+ - Claude-only → keep as-is
171
+ - Codex-only → prefix with `[codex]`
172
+ - Conflicting findings → keep both, note the disagreement
173
+ - Take the MORE SEVERE verdict between the two
174
+
175
+ ---
176
+
177
+ ## How to Spawn a Codex BUILD/FIX Agent
178
+
179
+ For BUILD and FIX LOOP phases when engine routes to Codex:
180
+
181
+ ```
182
+ mcp__codex-cli__codex({
183
+ prompt: "[full build/fix prompt with task description, done criteria, and implementation instructions]",
184
+ model: "gpt-5.4",
185
+ reasoningEffort: "xhigh",
186
+ sandbox: "workspace-write",
187
+ fullAuto: true,
188
+ workingDirectory: "[project root]"
189
+ })
190
+ ```
191
+
192
+ **After Codex completes**: verify changes were made (`git diff --stat`), then proceed to the next phase as normal. The file-based handoff (`.devlyn/done-criteria.md`, `.devlyn/EVAL-FINDINGS.md`, etc.) works identically — Codex writes the same files Claude would.
193
+
194
+ **Session management**: For FIX LOOP iterations, use a fresh call each time (no `sessionId` reuse) because sandbox/fullAuto parameters only apply on the first call of a session.
195
+
196
+ ---
197
+
198
+ ## Override Behavior
199
+
200
+ - `--engine claude` → all roles and phases use Claude (current default behavior, no Codex calls)
201
+ - `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
202
+ - `--engine auto` → each role and phase routes to the optimal model per this table
203
+ - `--engine auto` is the recommended default when Codex MCP server is available
204
+
205
+ `--engine` and `--with-codex` are **mutually exclusive**. `--engine auto` subsumes `--with-codex both` — it uses Codex where it's optimal (broader than just evaluate/review). If both flags are passed, `--engine` takes precedence and `--with-codex` is ignored with a warning.
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: devlyn:ideate
3
- description: Transform unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and multi-perspective synthesis. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) that eliminates context pollution in the implementation pipeline. Use when the user wants to brainstorm, plan a new project or feature set, explore possibilities, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring. Also triggers when the user shares links or resources for a new initiative and needs them synthesized into a plan, or wants to update an existing roadmap with new ideas.
3
+ description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline. Optional --with-codex flag adds OpenAI Codex as a cross-model critic. Use when the user wants to brainstorm, plan a new project or feature set, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring.
4
4
  ---
5
5
 
6
6
  # Ideation to Implementation Bridge
@@ -20,6 +20,20 @@ Concretely:
20
20
  - If you catch yourself about to open a source file to make a code change, stop — that's a signal you've left ideation mode
21
21
  </hard_boundary>
22
22
 
23
+ ## Arguments
24
+
25
+ Parse these from the user's invocation message:
26
+
27
+ - `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family. **Ignored if `--engine` is set** (engine routing subsumes this).
28
+ - `--engine MODE` (claude) — controls which model handles each ideation phase. Modes:
29
+ - `claude` (default): all phases use Claude. Current behavior.
30
+ - `codex`: Codex handles FRAME/EXPLORE/CONVERGE/DOCUMENT, Claude runs CHALLENGE (role reversal — builder and critic are always different models).
31
+ - `auto`: Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic). Subsumes `--with-codex`. Recommended when Codex MCP is available.
32
+
33
+ **If `--engine` is `auto` or `codex`**: call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort. Also read `references/challenge-rubric.md` up front. The engine routing table is defined in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)".
34
+
35
+ **If `--engine` is not set and `--with-codex` is set** (legacy): read `references/challenge-rubric.md` and `references/codex-debate.md` up front, then run the pre-flight check described in `codex-debate.md` to verify the Codex MCP server is available before starting the pipeline. If the server is unavailable and the user opts to continue without Codex, the solo CHALLENGE pass still runs — only the cross-model rubric pass is disabled.
36
+
23
37
  <why_this_matters>
24
38
  When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
25
39
  - Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
@@ -271,8 +285,51 @@ Within each phase:
271
285
  ### Architecture Decisions
272
286
  Surface decisions that affect multiple items — technology choices, data model, integration approaches, UX patterns. For each: **What** was decided, **Why** (tradeoffs), and **What alternatives** were considered. These become decision records.
273
287
 
274
- ### Confirmation
275
- Before generating documents, present a final summary:
288
+ ### Internal draft — do not show the user yet
289
+
290
+ At this point you have an internal convergence draft: themes, phases, items, decisions. **Do not present it to the user yet.** Phase 3.5 CHALLENGE runs next, and the user will see exactly one summary — the post-challenge plan, with visibility into what CHALLENGE changed. Showing the pre-challenge draft first and then changing it after challenge creates a two-round confirmation loop that burns the user's trust.
291
+
292
+ ## Phase 3.5: CHALLENGE
293
+
294
+ <phase_goal>Apply a strict 5-axis rubric to the internal convergence draft, then present one post-challenge summary to the user for confirmation. Always runs.</phase_goal>
295
+
296
+ <thinking_effort>
297
+ Engage maximum thinking effort here — both the solo rubric pass and, if enabled, the Codex pass. Use extended thinking ("ultrathink") when reading each item, applying each axis, and producing revisions. The default Claude failure mode in self-review is nodding along to the draft you just produced; shallow thinking here is the exact pattern this phase exists to prevent.
298
+
299
+ Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
300
+ </thinking_effort>
301
+
302
+ The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
303
+
304
+ ### The rubric — single source of truth
305
+
306
+ Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
307
+
308
+ ### Solo pass (always runs)
309
+
310
+ Apply the rubric to the internal convergence draft. Produce findings in the format specified in `challenge-rubric.md` (Severity / Quote / Axis / Why / Fix).
311
+
312
+ For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
313
+
314
+ If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
315
+
316
+ ### Codex pass (engine-routed or legacy `--with-codex`)
317
+
318
+ **If `--engine auto`**: Codex runs the CHALLENGE rubric pass automatically. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the packaged plan + rubric as prompt (same format as `codex-debate.md` Step 2). Reconcile findings: same finding from both → "confirmed by both", Codex-only → prefix `[codex]`.
319
+
320
+ **If `--engine codex`**: Role reversal — Codex built the plan (FRAME/EXPLORE/CONVERGE/DOCUMENT), so Claude runs the solo CHALLENGE pass. Do NOT also run Codex on CHALLENGE — builder and critic must be different models. Skip this section entirely.
321
+
322
+ **If `--engine claude` or `--engine` not set, and `--with-codex` is set** (legacy): follow `references/codex-debate.md` "PHASE 3.5-CODEX" section. Codex applies the rubric from `challenge-rubric.md` independently at `reasoningEffort: "xhigh"`. Reconcile findings as `codex-debate.md` describes — findings raised by both sides get "confirmed by both", Codex-only findings get prefixed `[codex]` in internal notes so the user can see where each push came from.
323
+
324
+ ### Respect explicit user intent
325
+
326
+ The rubric is a quality lens, not an override. If a finding conflicts with something the user explicitly and clearly asked for, follow the "Hard rule" section in `challenge-rubric.md`: record the finding, **do not silently rewrite the plan**, and surface it as an open question in the summary below. The user makes the call.
327
+
328
+ ### User-facing summary (the first and only time the user sees the plan)
329
+
330
+ After the rubric pass(es), present the post-challenge plan to the user for confirmation. This is the first time the user sees the converged plan — by design, so they see a rubric-checked result rather than a draft that immediately gets revised.
331
+
332
+ Format:
276
333
  ```
277
334
  Vision: [one sentence]
278
335
  Phases: [N] phases, [M] total items
@@ -280,9 +337,30 @@ Phase 1 ([theme]): [items with brief descriptions]
280
337
  Phase 2 ([theme]): [items]
281
338
  Key decisions: [list]
282
339
  Deferred: [items with reasons]
340
+
341
+ ## CHALLENGE results
342
+
343
+ Solo pass: [N findings, M applied]
344
+ Codex pass: [N findings, M applied] ← only if --with-codex was set
345
+
346
+ Changes applied during CHALLENGE:
347
+ - [item]: [what changed and which axis triggered it]
348
+
349
+ Open questions for you (rubric flagged something you explicitly asked for):
350
+ - [item]: rubric says [finding]; you asked for [original]; here is the tradeoff — proceed as-is, or adopt the alternative?
283
351
  ```
284
352
 
285
- Get explicit confirmation before proceeding to document generation.
353
+ Get explicit confirmation before proceeding to DOCUMENT.
354
+
355
+ ### Quick Add mode
356
+
357
+ For single-item additions, run one solo rubric pass on just the new item. Even then do not skip — single-item additions are exactly where overengineering and workarounds slip in unnoticed, because the lack of surrounding context makes a bad item look self-contained and harmless.
358
+
359
+ ## Engine Routing for FRAME / EXPLORE / CONVERGE / DOCUMENT
360
+
361
+ **If `--engine codex`**: Phases 1-3 and Phase 4 are delegated to Codex. For each phase, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, and the phase instructions + user context as the prompt. Use `sessionId` to maintain conversational context across phases (note: sandbox/fullAuto only apply on the first call). Claude remains the orchestrator — it reads Codex's output, manages the conversation with the user (confirmation prompts, clarifying questions), and routes findings between phases.
362
+
363
+ **If `--engine auto` or `--engine claude`**: All planning phases use Claude directly (current behavior). Claude's ambiguous intent handling and writing quality benchmarks favor it for planning tasks.
286
364
 
287
365
  ## Phase 4: DOCUMENT
288
366
 
@@ -376,6 +454,9 @@ Before finalizing, verify:
376
454
  - [ ] No spec requires reading VISION.md to be understood (self-contained)
377
455
  - [ ] Dependencies between items are documented in both specs
378
456
  - [ ] Architecture decisions include reasoning and alternatives considered
457
+ - [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex if `--with-codex` was set); no item still fails any axis at CRITICAL or HIGH severity
458
+ - [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
459
+ - [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
379
460
 
380
461
  ## Language
381
462
 
@@ -0,0 +1,122 @@
1
+ # CHALLENGE Rubric (single source of truth)
2
+
3
+ ## Contents
4
+ - Context — this is a planning rubric
5
+ - The 5 axes (NO WORKAROUND, NO GUESSWORK, NO OVERENGINEERING, WORLD-CLASS BEST PRACTICE, OPTIMIZED)
6
+ - Hard rule — respect explicit user intent
7
+ - Finding format
8
+ - Examples (good vs bad findings, plus a detour-sequencing example)
9
+
10
+ The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex pass (when `--with-codex` is set) use this file — there is exactly one definition of the rubric, and both paths read it directly from SKILL.md.
11
+
12
+ The rubric exists because plans produced in a single pass, by a single model, in a single conversation almost always fail at least one axis somewhere. The user's historical experience: every time they asked "is this really no-workaround, no-guesswork, no-overengineering, world-class, optimized?", the honest answer was no. This phase makes the answer honestly yes before the user even has to ask.
13
+
14
+ ## Context — this is a PLANNING rubric, not a code rubric
15
+
16
+ This rubric judges the shape of the roadmap: what items exist, in what order, why. It does NOT judge implementation details, code style, or abstractions in code. "Overengineering" here means overengineering the plan, not overengineering a function. When applying it, keep asking: *is this the most direct, optimized path from the user's stated problem to a working outcome?*
17
+
18
+ ## The 5 axes
19
+
20
+ ### 1. NO WORKAROUND
21
+
22
+ Does the item solve the actual problem directly, or does it route around a missing capability? If the direct path is "build X" and the item is "work around not having X", it fails.
23
+
24
+ Canonical failure pattern: the user asks for a feature that papers over a missing foundation. Building the feature adds an item to the plan without solving the real problem, and often makes the real problem harder to fix later.
25
+
26
+ ### 2. NO GUESSWORK
27
+
28
+ Every requirement must be grounded in something the user explicitly confirmed, or in something verifiable from the problem framing. Silent assumptions, "I think the user probably wants...", and requirements invented to fill gaps all fail.
29
+
30
+ Canonical failure pattern: vague user input ("improve the dashboard") leads to a fully-specified plan full of invented detail. Correct handling is to mark every assumed fact as [ASSUMED], ask clarifying questions, and keep the plan minimal until the user fills in the gaps.
31
+
32
+ ### 3. NO OVERENGINEERING (planning-stage)
33
+
34
+ The plan fails this axis when it contains any of:
35
+
36
+ - **Luxury items** — polish, theming, animations, nice-to-haves that do not serve the stated problem. A polish/theming item in Phase 1 of a tool that does not yet solve its core job.
37
+ - **Filler items** — items added to pad a phase or make the plan feel complete. If an item has no testable requirement a real user would notice if absent, it is filler.
38
+ - **Detour sequencing** — the plan takes the long way around when a direct route exists. Three items building toward X when one item could deliver X. Separate scaffold / store / deploy items when they could be bundled into the actual feature they enable.
39
+ - **Roadmap workarounds masquerading as features** — see axis 1. The same failure can fire on axis 1 (paper-over) and axis 3 (padding the roadmap with the workaround).
40
+
41
+ The question to ask for every item: *"Is this the most direct, optimized path to the stated goal, or are we decorating / detouring / papering over?"*
42
+
43
+ ### 4. WORLD-CLASS BEST PRACTICE
44
+
45
+ Would a senior team at a top company structure the roadmap this way for this kind of product today? If a known-good pattern exists for sequencing or decomposing this kind of problem, name it and use it.
46
+
47
+ Canonical failure pattern: the plan uses a familiar-but-mediocre decomposition when a better-known-good pattern exists for the specific problem type. Example: using manual export/import for cross-device sync when autosave + cloud draft storage is the standard pattern across mainstream editing tools (Notion, Linear, Gmail, Google Docs).
48
+
49
+ ### 5. OPTIMIZED
50
+
51
+ Does the sequencing minimize wait time, front-load risk, and ship user-visible value at every phase boundary? Dead phases — phases that are pure setup with no visible win for a real user — are a fail.
52
+
53
+ Canonical failure pattern: Phase 1 is entirely infrastructure (scaffold, models, deploy) and the first user-facing win arrives in Phase 2. Better: Phase 1 ships one thin vertical slice that a real user can use, even if it is small.
54
+
55
+ ## Hard rule — respect explicit user intent
56
+
57
+ The rubric is a tool to prevent drift from quality, not a tool to override the user. If the user has explicitly and clearly stated a preference ("I want X, not Y"), the rubric does not silently replace X with Y. Instead:
58
+
59
+ - Run the rubric as normal.
60
+ - If an axis flags X, do not rewrite the plan. Record the finding and surface it to the user as an open question: "The rubric flags X on [axis] because [reason]. You explicitly asked for X — confirm you want to proceed, or consider [alternative]."
61
+ - The user makes the call. The rubric's job is to make the tradeoff visible, not to make the decision.
62
+
63
+ This rule exists because the 5-axis rubric is an opinionated lens, and opinionated lenses are wrong sometimes. The user's stated intent is ground truth when it is explicit. The rubric is ground truth only for things the user did not explicitly decide.
64
+
65
+ ## Finding format
66
+
67
+ For every item that fails any axis, produce a finding in this exact format:
68
+
69
+ ```
70
+ Severity: CRITICAL / HIGH / MEDIUM / LOW
71
+ Quote: [copy the specific item title or line you are critiquing — one line]
72
+ Axis: [which of the five]
73
+ Why it fails: [one sentence]
74
+ Fix: [one concrete revision — not "reconsider X", say what to do instead]
75
+ ```
76
+
77
+ For the plan as a whole, give a one-line pass/fail per axis with one-sentence reasoning.
78
+
79
+ End with a verdict: `PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED`.
80
+
81
+ The Quote field is load-bearing. It anchors each finding to a specific line in the plan, which prevents the common failure mode of generic unanchored critiques ("too much in Phase 1", "consider refactoring"). Anchored findings are actionable; unanchored findings are noise.
82
+
83
+ ## Examples
84
+
85
+ <example>
86
+ BAD finding (too vague, not actionable):
87
+ Severity: HIGH
88
+ Axis: NO OVERENGINEERING
89
+ Why: Phase 1 has too much.
90
+ Fix: Reduce scope.
91
+
92
+ GOOD finding (anchored, specific, actionable):
93
+ Severity: HIGH
94
+ Quote: "1.3 — Theme customization (light/dark/custom accent colors)"
95
+ Axis: NO OVERENGINEERING (luxury item)
96
+ Why it fails: The product does not yet solve its core job of letting users save a session; theming is a decoration item that does not move the primary problem forward.
97
+ Fix: Move 1.3 to backlog. Phase 1 is shorter by one item. Revisit theming only after the core save flow is shipped and used.
98
+ </example>
99
+
100
+ <example>
101
+ BAD finding:
102
+ Severity: HIGH
103
+ Axis: NO WORKAROUND
104
+ Why: Item 2.1 is a workaround.
105
+ Fix: Do it properly.
106
+
107
+ GOOD finding:
108
+ Severity: CRITICAL
109
+ Quote: "2.1 — Export/import session as JSON file so users can move work between devices"
110
+ Axis: NO WORKAROUND
111
+ Why it fails: The real problem is cross-device sync. File export is a roadmap workaround that asks the user to do the sync manually; it adds an item to the plan without solving the stated problem, and makes the real problem harder to fix later.
112
+ Fix: Replace 2.1 with "Cloud-backed session storage" as a direct cross-device solution. If cloud storage is out of scope for the current phase, explicitly defer cross-device sync to a later phase rather than shipping a manual workaround as if it were the feature.
113
+ </example>
114
+
115
+ <example>
116
+ Detour sequencing finding:
117
+ Severity: MEDIUM
118
+ Quote: "Phase 1: 1.1-scaffold, 1.2-data-store, 1.3-log-today, 1.4-streak-display, 1.5-history-view, 1.6-manage-habits, 1.7-deploy"
119
+ Axis: NO OVERENGINEERING (detour sequencing)
120
+ Why it fails: Scaffold, data store, streak display, and deploy are not features a user would notice as separate items — they are implementation steps of the three actual user capabilities (log a habit, see streak, see history). Splitting them into standalone roadmap items pads the plan without delivering value at each boundary.
121
+ Fix: Collapse Phase 1 to 2 items: "1.1 — Log a habit and see streak" (bundles scaffold + store + log + streak), "1.2 — History view". Deploy is part of each item's done criteria, not a standalone item. Result: 7 items → 2 items, same delivered scope.
122
+ </example>
@@ -0,0 +1,112 @@
1
+ # Codex Cross-Model Rubric Pass (Legacy)
2
+
3
+ > **Note**: This file is the legacy `--with-codex` integration for ideate. For the newer `--engine` flag (which subsumes `--with-codex`), see the engine routing section in SKILL.md. Only read this file when `--with-codex` is set AND `--engine` is NOT set.
4
+
5
+ ## Contents
6
+ - Pre-flight check (verify Codex MCP server availability)
7
+ - PHASE 3.5-CODEX: packaging the plan, calling Codex, reconciling findings with the solo pass
8
+ - Cost notes (one Codex call per ideation session)
9
+
10
+ Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE. The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
11
+
12
+ Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). The intent: one opinionated rubric pass from a different model family, applied right before the user sees the plan. Two model families catch different blind spots; one pass at maximum effort catches more than multiple shallow passes.
13
+
14
+ **Always use `model: "gpt-5.4"`, `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model. Pass `model: "gpt-5.4"` explicitly as the MCP schema default may be outdated.
15
+
16
+ ---
17
+
18
+ ## PRE-FLIGHT CHECK
19
+
20
+ Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
21
+
22
+ - **If ping succeeds**: continue.
23
+ - **If ping fails or `mcp__codex-cli__ping` is not found**: warn the user and ask:
24
+ ```
25
+ ⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
26
+
27
+ To install:
28
+ npm i -g @openai/codex
29
+ claude mcp add codex-cli -- npx -y codex-mcp-server
30
+
31
+ Options:
32
+ [1] Continue without --with-codex (Claude-only solo CHALLENGE pass)
33
+ [2] Abort
34
+ ```
35
+ If [1], disable `--with-codex` and continue with the solo CHALLENGE. If [2], stop.
36
+
37
+ ---
38
+
39
+ ## PHASE 3.5-CODEX: Codex rubric pass
40
+
41
+ Run after the solo CHALLENGE pass completes, before the user-facing summary.
42
+
43
+ ### Step 1 — Package the post-solo plan
44
+
45
+ Use the plan as it stands after the solo rubric pass. Package the full context Codex needs:
46
+
47
+ ```
48
+ ## Problem framing (from FRAME phase)
49
+ [problem statement, constraints, success criteria, anti-goals]
50
+
51
+ ## Confirmed facts vs assumptions
52
+ Confirmed by user: [list]
53
+ Assumptions (not yet confirmed): [list]
54
+
55
+ ## Plan (post-solo-CHALLENGE)
56
+ Vision: [one sentence]
57
+ Phase 1 ([theme]): [items, dependencies, one-line descriptions]
58
+ Phase 2 ([theme]): ...
59
+ Architecture decisions: [each with what / why / alternatives]
60
+ Deferred to backlog: [items + reason]
61
+
62
+ ## Findings from the solo rubric pass
63
+ [list each with: axis, quote, why, fix, whether applied]
64
+ ```
65
+
66
+ Include the framing and assumptions — Codex can only judge whether the plan fits the user's reality if it sees what the user actually said.
67
+
68
+ ### Step 2 — Codex challenge pass
69
+
70
+ Call `mcp__codex-cli__codex` with:
71
+ - `prompt`: the packaged context above, followed by the instructions below
72
+ - `workingDirectory`: the project root
73
+ - `sandbox`: `"read-only"`
74
+ - `model`: `"gpt-5.4"` — pass explicitly; the MCP schema default may still show `gpt-5.3-codex`
75
+ - `reasoningEffort`: `"xhigh"` — the highest setting in the Codex enum (`none < minimal < low < medium < high < xhigh`). Always pick the top level; this is the entire reason for the flag.
76
+
77
+ Instructions to append to the packaged context. **Before sending, inline the full text of `references/challenge-rubric.md` into the prompt under a `## Rubric` heading** — Codex does not have filesystem access to this project, so Claude must ship the rubric itself. Claude already has the rubric loaded from Phase 3.5 setup.
78
+
79
+ Template for the appended instructions:
80
+
81
+ ```
82
+ You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user has explicitly asked to be challenged because soft-pedaled plans waste their time.
83
+
84
+ ## Rubric
85
+ [Claude inlines the full text of references/challenge-rubric.md here]
86
+
87
+ ## Your job
88
+ - You are running AFTER a solo pass by Claude. Catch what the solo pass missed, do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" or "I would frame this differently" with a reason. Then add your own findings that the solo pass missed.
89
+ - Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
90
+ - Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let Claude surface it to the user.
91
+
92
+ End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, and a one-line explanation.
93
+ ```
94
+
95
+ ### Step 3 — Reconcile solo and Codex findings
96
+
97
+ Merge the two finding lists:
98
+ - Same finding from both → keep the more specific wording, mark "confirmed by both".
99
+ - Codex-only → prefix `[codex]` in internal notes so the user-facing summary can show where each push came from.
100
+ - Solo-only → keep as-is.
101
+ - Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if the conflict is material, include it as an open question in the user-facing summary.
102
+
103
+ If Codex raised CRITICAL or HIGH findings that the solo pass missed, apply the fixes to the plan before presenting the user-facing summary. If fixing would change something the user explicitly asked for, follow the "Respect explicit user intent" rule already loaded from the rubric: do not silently rewrite — surface it.
104
+
105
+ Do not loop. One Codex pass is enough. If the result is still FAIL after one pass, that is signal that the plan has structural problems the user should see directly, not signal to keep iterating in the background.
106
+
107
+ ---
108
+
109
+ ## Cost notes
110
+
111
+ - One Codex call at `reasoningEffort: "xhigh"` typically takes 30–90s and is not cheap. This integration is bounded: exactly one Codex call per ideation session.
112
+ - In Quick Add mode on a single new item, one Codex call is still worth it — small scope, huge signal, and single-item additions are exactly where workarounds slip in unnoticed.
@@ -44,8 +44,15 @@ Parse from `<preflight_config>`:
44
44
  - `--autofix` — auto-promote all findings to roadmap items and run auto-resolve on each
45
45
  - `--skip-browser` — skip browser validation
46
46
  - `--skip-docs` — skip documentation audit
47
+ - `--engine MODE` (claude) — controls which model handles audit phases. Modes:
48
+ - `claude` (default): all auditors use Claude subagents.
49
+ - `codex`: code-auditor uses Codex, docs-auditor and browser-auditor use Claude.
50
+ - `auto`: code-auditor uses Codex (SWE-bench Pro +11.7pp for code analysis), docs-auditor uses Claude (writing quality), browser-auditor uses Claude (Chrome MCP). Recommended when Codex MCP is available.
47
51
 
48
52
  Example: `/devlyn:preflight --phase 2 --skip-browser`
53
+ Example with engine: `/devlyn:preflight --engine auto`
54
+
55
+ **If `--engine` is `auto` or `codex`**: call `mcp__codex-cli__ping` to verify Codex MCP availability. If ping fails, fall back to `--engine claude` with a warning.
49
56
 
50
57
  ## PHASE 0: DISCOVER & SCOPE
51
58
 
@@ -128,6 +135,8 @@ Spawn all applicable auditors in parallel. Each reads `.devlyn/commitment-regist
128
135
 
129
136
  ### code-auditor (always)
130
137
 
138
+ **Engine routing**: If `--engine auto` or `--engine codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the full code-auditor prompt (read from `references/auditors/code-auditor.md`). Include the commitment registry content inline in the prompt since Codex cannot read `.devlyn/commitment-registry.md` directly in read-only sandbox. If `--engine claude`, spawn a Claude subagent as below.
139
+
131
140
  Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/code-auditor.md` and pass it to the subagent.
132
141
 
133
142
  The code-auditor classifies each commitment as IMPLEMENTED, MISSING, INCOMPLETE, DIVERGENT, or BROKEN — with file:line evidence. Also catches cross-feature integration gaps and constraint violations. Writes to `.devlyn/audit-code.md`.
@@ -130,9 +130,28 @@ Use the Agent Teams infrastructure:
130
130
 
131
131
  **IMPORTANT**: When spawning teammates, replace `{team-name}` in each prompt below with the actual team name you chose (e.g., `resolve-null-user-crash`). Include the relevant file paths from your Phase 1 investigation in the spawn prompt.
132
132
 
133
+ ### Engine-Routed Teammate Spawning
134
+
135
+ If the caller passed `--engine auto` or `--engine codex` (check the orchestrator's context or the pipeline config), read the auto-resolve skill's `references/engine-routing.md` for per-role routing under "team-resolve roles".
136
+
137
+ **For roles routed to Codex**: Instead of spawning a Claude Agent teammate, call `mcp__codex-cli__codex` with:
138
+ - `model`: `"gpt-5.4"`
139
+ - `reasoningEffort`: `"xhigh"`
140
+ - `sandbox`: per routing table (`"read-only"` or `"workspace-write"`)
141
+ - `workingDirectory`: project root
142
+ - `prompt`: the full teammate prompt below, with issue context and file paths included inline
143
+
144
+ Codex roles cannot use TeamCreate/SendMessage — the Team Lead (you) relays their findings to other teammates and collects their output directly from the MCP call response.
145
+
146
+ **For roles routed to Claude**: Spawn via Task tool as normal (prompts below).
147
+
148
+ **For Dual roles** (e.g., security-auditor): Run BOTH a Claude Agent teammate AND a `mcp__codex-cli__codex` call in parallel with the same prompt. Merge findings per `engine-routing.md` "How to Spawn a Dual Role" section.
149
+
150
+ If `--engine claude` or no `--engine` flag: all roles use Claude Agent teammates (current default behavior).
151
+
133
152
  ### Teammate Prompts
134
153
 
135
- When spawning each teammate via the Task tool, use these prompts:
154
+ When spawning each teammate via the Task tool (or passing to `mcp__codex-cli__codex` for Codex-routed roles), use these prompts:
136
155
 
137
156
  <root_cause_analyst_prompt>
138
157
  You are the **Root Cause Analyst** on an Agent Team resolving an issue.
@@ -66,9 +66,28 @@ Use the Agent Teams infrastructure:
66
66
 
67
67
  **IMPORTANT**: When spawning reviewers, replace `{team-name}` in each prompt below with the actual team name you chose. Include the specific changed file paths in each reviewer's spawn prompt.
68
68
 
69
+ ### Engine-Routed Reviewer Spawning
70
+
71
+ If the caller passed `--engine auto` or `--engine codex` (check the orchestrator's context or the pipeline config), read the auto-resolve skill's `references/engine-routing.md` for per-role routing under "team-review roles".
72
+
73
+ **For roles routed to Codex**: Instead of spawning a Claude Agent reviewer, call `mcp__codex-cli__codex` with:
74
+ - `model`: `"gpt-5.4"`
75
+ - `reasoningEffort`: `"xhigh"`
76
+ - `sandbox`: per routing table (`"read-only"` or `"workspace-write"`)
77
+ - `workingDirectory`: project root
78
+ - `prompt`: the full reviewer prompt below, with changed file paths and diff included inline
79
+
80
+ Codex reviewers cannot use TeamCreate/SendMessage — the Review Lead (you) collects their output directly from the MCP call response and relays cross-cutting findings to other reviewers.
81
+
82
+ **For roles routed to Claude**: Spawn via Task tool as normal (prompts below).
83
+
84
+ **For Dual roles** (e.g., security-reviewer): Run BOTH a Claude Agent reviewer AND a `mcp__codex-cli__codex` call in parallel with the same prompt. Merge findings per `engine-routing.md` "How to Spawn a Dual Role" section.
85
+
86
+ If `--engine claude` or no `--engine` flag: all roles use Claude Agent reviewers (current default behavior).
87
+
69
88
  ### Reviewer Prompts
70
89
 
71
- When spawning each reviewer via the Task tool, use these prompts:
90
+ When spawning each reviewer via the Task tool (or passing to `mcp__codex-cli__codex` for Codex-routed roles), use these prompts:
72
91
 
73
92
  <security_reviewer_prompt>
74
93
  You are the **Security Reviewer** on an Agent Team performing a code review.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "devlyn-cli",
3
- "version": "1.10.0",
3
+ "version": "1.12.0",
4
4
  "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
5
5
  "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
6
6
  "bin": {
@@ -9,6 +9,10 @@
9
9
  "files": [
10
10
  "bin",
11
11
  "config",
12
+ "!config/skills/preflight-workspace",
13
+ "!config/skills/preflight-workspace/**",
14
+ "!config/skills/devlyn:ideate-workspace",
15
+ "!config/skills/devlyn:ideate-workspace/**",
12
16
  "agents-config",
13
17
  "optional-skills",
14
18
  "CLAUDE.md"