devlyn-cli 1.12.4 → 1.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -1,5 +1,15 @@
1
1
  # Project Instructions
2
2
 
3
+ ## Quick Start
4
+
5
+ For most work, the recommended sequence is:
6
+
7
+ 1. `/devlyn:ideate` — turn an idea into roadmap-ready specs
8
+ 2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → polish
9
+ 3. `/devlyn:preflight` — verify the implementation matches the roadmap before shipping
10
+
11
+ All three default to `--engine auto`, which routes each phase to the optimal model (Codex GPT-5.4 for hard coding, Claude Opus 4.7 for evaluation/critique). The cross-model GAN dynamic — different models build vs critique — catches what single-model pipelines miss.
12
+
3
13
  ## General
4
14
 
5
15
  - Proactively use subagents and skills where needed
@@ -20,7 +30,7 @@
20
30
  When investigating bugs, analyzing features, or exploring code:
21
31
 
22
32
  1. **Define exit criteria upfront** - Ask "What does 'done' look like?" before starting
23
- 2. **Checkpoint progress** - Use TodoWrite every 5-10 minutes to save findings
33
+ 2. **Checkpoint progress** - Use the task tools (TaskCreate / TaskUpdate) every 5-10 minutes to save findings
24
34
  3. **Output intermediate summaries** - Provide "Current Understanding" snapshots so work isn't lost if interrupted
25
35
  4. **Always deliver findings** - Never end mid-analysis; at minimum output:
26
36
  - Files examined
@@ -28,7 +38,7 @@ When investigating bugs, analyzing features, or exploring code:
28
38
  - Remaining unknowns
29
39
  - Recommended next steps
30
40
 
31
- For complex investigations, use `/devlyn:team-resolve` to assemble a multi-perspective investigation team, or spawn parallel Task agents to explore different areas simultaneously.
41
+ For complex investigations, use `/devlyn:team-resolve` to assemble a multi-perspective investigation team, or spawn parallel Agent subagents to explore different areas simultaneously.
32
42
 
33
43
  ## UI/UX Workflow
34
44
 
@@ -42,11 +52,11 @@ The full design-to-implementation pipeline:
42
52
  ## Feature Development
43
53
 
44
54
  1. **Plan first** - Always output a concrete implementation plan with specific file changes before writing code
45
- 2. **Track progress** - Use TodoWrite to checkpoint each phase
55
+ 2. **Track progress** - Use the task tools (TaskCreate / TaskUpdate) to checkpoint each phase
46
56
  3. **Test validation** - Write tests alongside implementation; iterate until green
47
57
  4. **Small commits** - Commit working increments rather than large changesets
48
58
 
49
- For complex features, use the Plan agent to design the approach before implementation.
59
+ For complex features, spawn the `Plan` subagent (`Agent` tool with `subagent_type: "Plan"`) to design the approach before implementation.
50
60
 
51
61
  ## Automated Pipeline (Recommended Starting Point)
52
62
 
@@ -72,8 +82,7 @@ Optional flags:
72
82
  - `--skip-review` — skip team-review phase
73
83
  - `--skip-clean` — skip clean phase
74
84
  - `--skip-docs` — skip update-docs phase
75
- - `--engine auto|codex|claude` — intelligent model routing. `auto` (default) routes each phase and team role to the optimal model (Claude or Codex GPT-5.4) based on benchmark data. `codex` forces Codex for implementation, Claude for evaluation. `claude` uses Claude for everything. Requires codex-mcp-server for `auto` and `codex` modes.
76
- - `--with-codex [evaluate|review|both]` — (legacy, superseded by `--engine`) use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
85
+ - `--engine auto|codex|claude` — intelligent model routing. `auto` (default) routes each phase and team role to the optimal model based on benchmark data: Codex GPT-5.4 handles BUILD and FIX (SWE-bench Pro lead), Claude Opus 4.7 handles EVALUATE and CHALLENGE (long-context retrieval + skeptical reasoning). Different models build vs critique — the cross-model GAN dynamic catches what single-model pipelines miss. `codex` forces Codex for implementation, Claude for orchestration and Chrome MCP. `claude` uses Claude for everything. Requires codex-mcp-server for `auto` and `codex` modes.
77
86
 
78
87
  ## Preflight Check (Post-Roadmap Verification)
79
88
 
@@ -92,7 +101,7 @@ Optional flags:
92
101
  - `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
93
102
  - `--skip-browser` — skip browser validation
94
103
  - `--skip-docs` — skip documentation audit
95
- - `--engine auto|codex|claude` — route code-auditor to Codex (better at code analysis), docs/browser to Claude
104
+ - `--engine auto|codex|claude` — `auto` (default) routes the code-auditor to Codex (SWE-bench Pro +11.7pp on code analysis); the docs-auditor and browser-auditor always use Claude regardless of `--engine` (writing-quality strength on docs drift; Chrome MCP tools are session-bound to Claude Code)
96
105
 
97
106
  **Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)
98
107
 
@@ -152,11 +161,13 @@ Steps 4-6 are optional depending on the scope of changes. `/simplify` should alw
152
161
 
153
162
  ## Context Window Management
154
163
 
155
- When a conversation approaches context limits (50k+ tokens):
156
- 1. Check usage with `/context`
157
- 2. Create a HANDOFF.md summarizing: what was attempted, what succeeded, what failed, and next steps
158
- 3. Start a new session with `/clear`
159
- 4. Load context: `@HANDOFF.md Read this file and continue the work`
164
+ Claude 4.5 / 4.6 / 4.7 models auto-compact the conversation as it approaches the context limit, so you can keep working indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns — the model continues from where it left off after compaction.
165
+
166
+ For genuinely multi-context-window work (e.g., a roadmap with many phases), persist state to disk so the next instance can resume:
167
+ - All `auto-resolve` and `preflight` runs already write durable state to `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS, PREFLIGHT-REPORT) and to git commits — pick up by reading those files plus `git log`.
168
+ - For long investigations, write progress notes to a `HANDOFF.md` and resume with `@HANDOFF.md continue from where this left off` if you need a fresh window.
169
+
170
+ Manually clearing with `/clear` is rarely necessary — only do it when context is genuinely irrelevant to the next task.
160
171
 
161
172
  ## Communication Style
162
173
 
package/README.md CHANGED
@@ -146,19 +146,6 @@ Works across the full pipeline:
146
146
 
147
147
  </details>
148
148
 
149
- <details>
150
- <summary>Legacy: <code>--with-codex</code> (superseded by <code>--engine</code>)</summary>
151
-
152
- ```
153
- /devlyn:auto-resolve "fix the auth bug" --with-codex
154
- ```
155
-
156
- > `--with-codex evaluate` (default) · `--with-codex review` · `--with-codex both`
157
-
158
- `--engine auto` subsumes `--with-codex both` with broader coverage — Codex is used for build, fix, and 4 team roles, not just evaluate/review.
159
-
160
- </details>
161
-
162
149
  ---
163
150
 
164
151
  ## Manual Commands
@@ -258,7 +245,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
258
245
 
259
246
  | Server | Description |
260
247
  |---|---|
261
- | `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing and legacy `--with-codex` mode |
248
+ | `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing |
262
249
  | `playwright` | Playwright MCP — powers browser-validate Tier 2 |
263
250
 
264
251
  </details>
@@ -11,6 +11,14 @@ $ARGUMENTS
11
11
 
12
12
  <pipeline_workflow>
13
13
 
14
+ <orchestrator_context>
15
+ This pipeline is long-horizon agentic work. As the orchestrator, you spawn many subagents and read their handoff files; your own context grows over the run.
16
+
17
+ - Your context window is auto-compacted as it approaches its limit, so do not stop tasks early due to token-budget concerns. Keep the run going.
18
+ - All durable state lives in `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS) and in git commits. If your context is cleared mid-run, the next instance can resume from those files plus `git log`. Keep them up to date.
19
+ - Best results come from `xhigh` effort. If you are running on lower effort and notice shallow reasoning during phase decisions, escalate.
20
+ </orchestrator_context>
21
+
14
22
  <autonomy_contract>
15
23
  This pipeline runs hands-free. The user launches it to walk away and come back to finished work, so the quality of this run is measured by how far it gets without human intervention. Apply these behaviors throughout every phase:
16
24
 
@@ -21,6 +29,17 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
21
29
  5. **Treat questions as a signal to act instead.** If you notice yourself drafting a question to the user mid-pipeline, convert it into a decision + log entry and spawn the next phase.
22
30
  </autonomy_contract>
23
31
 
32
+ <engine_routing_convention>
33
+ Every phase in this pipeline routes its work to the optimal model per `references/engine-routing.md`. The convention is the same everywhere:
34
+
35
+ - The phase prompt body below is **engine-agnostic** — same instructions whether Codex or Claude executes it.
36
+ - For phases routed to **Codex** (per the routing table), call `mcp__codex-cli__codex` per the patterns in `engine-routing.md` (How to Spawn a Codex BUILD/FIX Agent / How to Spawn a Codex Role / How to Spawn a Dual Role).
37
+ - For phases routed to **Claude**, spawn an Agent subagent with `mode: "bypassPermissions"` and pass the prompt body verbatim.
38
+ - `--engine claude` forces all phases to Claude. `--engine codex` forces implementation/analysis to Codex (Claude still handles orchestration and Chrome MCP). `--engine auto` (default) uses the routing table per phase.
39
+
40
+ Phase-level "Engine routing" notes below are short reminders only — `engine-routing.md` is the single source of truth.
41
+ </engine_routing_convention>
42
+
24
43
  ## PHASE 0: PARSE INPUT
25
44
 
26
45
  1. Extract the task/issue description from `<pipeline_config>`.
@@ -33,22 +52,21 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
33
52
  - `--skip-docs` (false) — skip update-docs phase
34
53
  - `--skip-build-gate` (false) — skip the deterministic build gate (Phase 1.4). Not recommended — the build gate is the primary defense against "tests pass locally, breaks in CI/Docker/production" class of bugs.
35
54
  - `--build-gate MODE` (auto) — controls build gate behavior. `auto`: detect project type and run appropriate build/typecheck/lint commands; if Dockerfile(s) are present, Docker builds are included automatically. `strict`: auto + treat warnings as errors. `no-docker`: auto but skip Docker builds even if Dockerfiles exist (for faster iteration). `skip`: same as --skip-build-gate.
36
- - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques. **Ignored if `--engine` is set** (engine routing subsumes this).
37
55
  - `--engine MODE` (auto) — controls which model handles each pipeline phase and team role. Modes:
38
- - `auto` (default): each phase and team role routes to the optimal model based on benchmark data. Requires Codex MCP server. Subsumes `--with-codex both`.
56
+ - `auto` (default): each phase and team role routes to the optimal model based on benchmark data. Requires Codex MCP server. Codex handles BUILD/FIX (SWE-bench Pro lead) and several team roles; Claude handles EVALUATE, CHALLENGE, BROWSER, and orchestration — creating a GAN-like dynamic where the builder and critic are always different models.
39
57
  - `codex`: Codex handles implementation/analysis phases, Claude handles orchestration, evaluation, and Chrome MCP.
40
58
  - `claude`: all phases use Claude subagents. No Codex calls.
41
59
 
42
60
  Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
43
61
  Engine examples: `--engine auto`, `--engine codex`, `--engine claude`
44
- Codex examples (legacy): `--with-codex` (both), `--with-codex evaluate`, `--with-codex review`
45
- If no flags are present, use defaults. **The default engine is `auto` — if the user does not pass `--engine`, treat it as `--engine auto`.**
62
+ If no flags are present, use defaults. The default engine is `auto` — if the user does not pass `--engine`, treat it as `--engine auto`.
63
+
64
+ **Consolidated flag**: `--with-codex` (and its variants `evaluate`/`review`/`both`) was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which provides broader Codex coverage — Codex now handles BUILD, FIX, and several team roles automatically. No flag needed. Continuing with `--engine auto`."
46
65
 
47
66
  3. **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
48
- - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — NOT `claude`.
67
+ - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
49
68
  - Read `references/engine-routing.md` for the full routing table.
50
69
  - Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude` (fallback), [2] Abort.
51
- - Exception: if `--engine` is not set AND `--with-codex` is explicitly enabled (legacy), read `references/codex-integration.md` instead and run its pre-flight check.
52
70
 
53
71
  4. Announce the pipeline plan:
54
72
  ```
@@ -57,16 +75,13 @@ Task: [extracted task description]
57
75
  Engine: [auto / codex / claude]
58
76
  Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
59
77
  Max evaluation rounds: [N]
60
- Cross-model evaluation (Codex): [evaluate / review / both / disabled / subsumed by --engine]
61
78
  ```
62
79
 
63
80
  ## PHASE 1: BUILD
64
81
 
65
- **Engine routing**: If `--engine` is `auto` or `codex`, read `references/engine-routing.md` "How to Spawn a Codex BUILD/FIX Agent" section. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the full agent prompt below as the `prompt` parameter. If `--engine` is `claude`, spawn a Claude subagent as described below.
82
+ **Engine**: BUILD row of the routing table Codex on `auto`/`codex`, Claude on `claude`. Per `<engine_routing_convention>` above. Subagents do not have access to skills, so the prompt below includes everything they need inline.
66
83
 
67
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to investigate and implement the fix. The subagent does NOT have access to skills, so include all necessary instructions inline.
68
-
69
- Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
84
+ Agent prompt pass this to the spawned executor:
70
85
 
71
86
  Investigate and implement the following task. Work through these phases in order:
72
87
 
@@ -84,9 +99,7 @@ Read relevant files in parallel. Build a clear picture of what exists and what n
84
99
  - Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
85
100
  - Refactor: architecture-reviewer + test-engineer
86
101
  - UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
87
- Each teammate investigates from their perspective and sends findings back.
88
-
89
- **Engine routing for teammates**: If the orchestrator's `--engine` is `auto` or `codex`, read `references/engine-routing.md` for per-role routing. Roles marked **Codex** are called via `mcp__codex-cli__codex` instead of spawning Agent teammates — include the full role prompt and issue context inline. Roles marked **Claude** use normal Agent teammates. Roles marked **Dual** run both in parallel and merge findings. The orchestrator relays Codex role outputs to Claude teammates that need them.
102
+ Each teammate investigates from their perspective and sends findings back. Per-role engine routing follows the team-resolve table in `references/engine-routing.md`; Dual roles run both models in parallel.
90
103
 
91
104
  **Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
92
105
 
@@ -139,13 +152,11 @@ For failures: include the FULL error output (not truncated) and extract root fil
139
152
 
140
153
  Triggered only when PHASE 1.4 returns FAIL.
141
154
 
142
- Track a round counter (shared with the main fix loop counter against `max-rounds`). If `round >= max-rounds`, stop with a clear failure report do NOT continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
155
+ Track a round counter. The build-gate fix loop and the main evaluate fix loop share **one global round counter** capped at `max-rounds` — increments from this loop and from PHASE 2.5 both count against the same total. If `round >= max-rounds`, stop with a clear failure report and do not continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
143
156
 
144
- **Engine routing**: Same as PHASE 2.5 FIX LOOP if `--engine` is `auto` or `codex`, use `mcp__codex-cli__codex` with `workspace-write` and `fullAuto: true`. If `--engine` is `claude`, spawn a Claude subagent.
157
+ **Engine**: FIX LOOP row of the routing table.
145
158
 
146
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
147
-
148
- Agent prompt — pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
159
+ Agent prompt pass this to the spawned executor:
149
160
 
150
161
  Read `.devlyn/BUILD-GATE.md` — it contains deterministic build/typecheck/lint failures from real compiler output. These are not opinions; the compiler rejected this code. Fix every listed failure at the root cause level.
151
162
 
@@ -158,7 +169,7 @@ For each failure:
158
169
 
159
170
  **After the agent completes**:
160
171
  1. **Checkpoint**: `git add -A && git commit -m "chore(pipeline): build gate fix round [N]"`
161
- 2. Increment round counter
172
+ 2. Increment the global round counter (shared with PHASE 2.5)
162
173
  3. Go back to PHASE 1.4 (re-run the gate)
163
174
 
164
175
  ## PHASE 1.5: BROWSER VALIDATE (conditional)
@@ -186,11 +197,23 @@ You are a browser validation agent. Read the skill instructions at `.claude/skil
186
197
 
187
198
  ## PHASE 2: EVALUATE
188
199
 
189
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to evaluate the work. Include all evaluation instructions inline.
200
+ **Engine**: EVALUATE row of the routing table Claude on every engine. When `--engine auto`, Codex built the code, so Claude evaluating Codex's work is the GAN dynamic by default; no separate Codex evaluation pass is needed.
190
201
 
191
- Agent prompt pass this to the Agent tool:
202
+ Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`. Include all evaluation instructions inline (subagents do not have access to skills).
203
+
204
+ Agent prompt — pass this to the spawned executor:
192
205
 
193
- You are an independent evaluator. Your job is to grade work produced by another agent, not to praise it. You will be too lenient by default — fight this tendency. When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs minutes of review. The cost is asymmetric.
206
+ You are an independent evaluator. Your job is to grade work produced by another agent against a specific rubric, not to praise it.
207
+
208
+ <investigate_before_answering>
209
+ Never claim a file:line or assert a behavior you have not opened and read. The done-criteria file is the rubric — read it first. Then read every changed/new file in full before marking anything VERIFIED or FAILED. Findings without a real file:line behind them are speculation; exclude them.
210
+ </investigate_before_answering>
211
+
212
+ <coverage_over_filtering>
213
+ Your goal is coverage at this stage, not severity filtering. Report every issue you find — uncertain ones, low-severity ones, all of them. The fix loop and the orchestrator's verdict logic do the filtering downstream. Each finding includes its severity and your confidence so the downstream layers can rank them; your job is to surface them, not pre-decide which ones matter.
214
+
215
+ This matters because under-reporting is the asymmetric cost: a missed bug ships broken code, a flagged non-issue costs a few minutes of review.
216
+ </coverage_over_filtering>
194
217
 
195
218
  **Step 1 — Read the done criteria**: Read `.devlyn/done-criteria.md`. This is your primary grading rubric. Every criterion must be verified with evidence.
196
219
 
@@ -215,56 +238,64 @@ You are an independent evaluator. Your job is to grade work produced by another
215
238
  - [ ] criterion — FAILED: what's wrong, file:line
216
239
  ## Findings Requiring Action
217
240
  ### CRITICAL
218
- - `file:line` — description — Fix: suggested approach
241
+ - `file:line` — description — Confidence: high/med/low — Fix: suggested approach
219
242
  ### HIGH
220
- - `file:line` — description — Fix: suggested approach
243
+ - `file:line` — description — Confidence: high/med/low — Fix: suggested approach
244
+ ### MEDIUM / LOW
245
+ - `file:line` — description — Confidence: high/med/low — Fix: suggested approach
221
246
  ## Cross-Cutting Patterns
222
247
  - pattern description
223
248
  ```
224
249
 
225
- Verdict rules: BLOCKED = any CRITICAL issues. NEEDS WORK = HIGH or MEDIUM issues that should be fixed. PASS WITH ISSUES = only LOW cosmetic notes. PASS = clean.
250
+ Verdict rules:
251
+ - `BLOCKED` — any CRITICAL issues
252
+ - `NEEDS WORK` — HIGH or MEDIUM issues
253
+ - `PASS WITH ISSUES` — only LOW cosmetic notes
254
+ - `PASS` — clean
226
255
 
227
- Important: Do NOT label findings as "pre-existing" or "out of scope" to avoid fixing them. If a problem exists in the current code and relates to the done criteria, it's a finding regardless of when it was introduced. The goal is working software, not blame attribution.
256
+ Findings labeled "pre-existing" or "out of scope" still count if they relate to the done criteria. The goal is working software, not blame attribution.
228
257
 
229
- Calibration examples to guide your judgment:
230
- - A catch block that logs but doesn't surface error to user = HIGH (not MEDIUM). Logging is not error handling.
231
- - A `let` that could be `const` = LOW note only. Linters catch this.
232
- - "The error handling is generally quite good" = WRONG. Count the instances. Name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line..."
258
+ Calibration examples:
259
+ - A catch block that logs but doesn't surface the error to the user HIGH (not MEDIUM). Logging is not error handling.
260
+ - A `let` that could be `const` LOW. Linters catch this.
261
+ - "The error handling is generally quite good" is not a finding. Count the instances and name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line"
233
262
 
234
- Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the orchestrator needs them.
263
+ Do not delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the orchestrator needs them.
235
264
 
236
265
  **After the agent completes**:
237
266
  1. Read `.devlyn/EVAL-FINDINGS.md`
238
267
  2. Extract the verdict
239
- 3. **If `--engine` is `auto` or `codex`**: The evaluate phase always uses Claude (see `references/engine-routing.md`). When `--engine auto`, the builder was Codex — Claude evaluating Codex's work creates the GAN dynamic automatically. No separate Codex evaluation pass is needed.
240
- **If `--engine` is not set and `--with-codex` includes `evaluate` or `both`** (legacy): Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
241
- 4. Branch on verdict (from the merged findings if Codex was used):
268
+ 3. Branch on verdict:
242
269
  - `PASS` → skip to PHASE 3
243
270
  - `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
244
271
  - `NEEDS WORK` → go to PHASE 2.5 (fix loop)
245
272
  - `BLOCKED` → go to PHASE 2.5 (fix loop)
246
- 5. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as NEEDS WORK and log a warning — absence of evidence is not evidence of absence
273
+ 4. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as NEEDS WORK and log a warning — absence of evidence is not evidence of absence
247
274
 
248
275
  ## PHASE 2.5: FIX LOOP (conditional)
249
276
 
250
277
  Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
251
278
 
252
- **Engine routing**: If `--engine` is `auto` or `codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "workspace-write"`, `fullAuto: true`, and the fix prompt below. Use a fresh call each round (no sessionId reuse — sandbox/fullAuto only apply on first call of a session). If `--engine` is `claude`, spawn a Claude subagent as below.
279
+ **Engine**: FIX LOOP row of the routing table. Use a fresh Codex call each round (no `sessionId` reuse — sandbox/fullAuto only apply on the first call of a session).
253
280
 
254
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to fix the evaluation findings.
281
+ Agent prompt pass this to the spawned executor:
255
282
 
256
- Agent prompt pass this to the Agent tool (or to `mcp__codex-cli__codex` prompt if engine routes to Codex):
283
+ Read every findings file present in `.devlyn/`:
284
+ - `.devlyn/EVAL-FINDINGS.md` — issues from the independent evaluator (PHASE 2)
285
+ - `.devlyn/BROWSER-RESULTS.md` — issues from browser validation (PHASE 1.5), if present and the verdict is `NEEDS WORK` or `BLOCKED`
257
286
 
258
- Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the evaluator returns PASS — there is no "shippable with issues" shortcut.
287
+ Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the relevant verdict returns PASS — there is no "shippable with issues" shortcut.
259
288
 
260
289
  The original done criteria are in `.devlyn/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
261
290
 
262
- For each finding: read the referenced file:line, understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.devlyn/done-criteria.md` to mark fixed items.
291
+ For each finding: read the referenced file:line (or browser step / console error), understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.devlyn/done-criteria.md` to mark fixed items.
263
292
 
264
293
  **After the agent completes**:
265
294
  1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): fix round [N] complete"` to preserve the fix
266
- 2. Increment round counter
267
- 3. Go back to PHASE 2 (re-evaluate)
295
+ 2. Increment the global round counter (shared with PHASE 1.4-fix)
296
+ 3. Re-run the phase that triggered the fix:
297
+ - If invoked from PHASE 2 (eval failure) → go back to PHASE 2 to re-evaluate
298
+ - If invoked from PHASE 1.5 (browser failure) → go back to PHASE 1.5 to re-validate the browser, then proceed to PHASE 2 only if browser passes
268
299
 
269
300
  ## PHASE 3: SIMPLIFY
270
301
 
@@ -281,21 +312,18 @@ Review the recently changed files (use `git diff HEAD~1` to see what changed). L
281
312
 
282
313
  Skip if `--skip-review` was set.
283
314
 
284
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` for a multi-perspective review.
315
+ **Engine**: REVIEW (team) per-role routing per the team-review table in `references/engine-routing.md`. Dual roles run both models in parallel and merge findings.
285
316
 
286
- Agent prompt pass this to the Agent tool:
317
+ Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
287
318
 
288
- Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes.
319
+ Agent prompt pass this to the spawned executor:
289
320
 
290
- **Engine routing for reviewers**: If the orchestrator passed `--engine auto` or `--engine codex`, read `references/engine-routing.md` for per-role routing in the "team-review roles" table. Route each reviewer to Claude Agent or `mcp__codex-cli__codex` accordingly. For Dual roles (security-reviewer), run both models in parallel and merge findings per the "How to Spawn a Dual Role" section. For `--engine claude`, all reviewers are Claude Agent teammates.
321
+ Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes. Per-role engine routing follows the team-review table in `references/engine-routing.md`; Dual roles run both models in parallel and merge findings.
291
322
 
292
- Each reviewer evaluates from their perspective, sends findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW). After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
323
+ Each reviewer reports findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW) and a confidence level. After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
293
324
 
294
325
  Clean up the team after completion.
295
326
 
296
- **If `--engine` is set**: engine routing already handles cross-model review via per-role routing — skip the legacy `--with-codex` review step below.
297
- **If `--with-codex` includes `review` or `both`** (legacy, only when `--engine` is not set): Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
298
-
299
327
  **After the review phase completes**:
300
328
  1. If CRITICAL issues remain unfixed, log a warning in the final report
301
329
  2. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): review fixes complete"` if there are changes
@@ -306,23 +334,27 @@ Every prior phase used checklists, done-criteria, or structured categories. This
306
334
 
307
335
  This is what catches the things structured reviews miss — subtle logic that technically works but isn't the right approach, assumptions nobody questioned, patterns that are fine but not best-practice, and integration seams that look correct in isolation but feel wrong when you read the whole changeset.
308
336
 
337
+ **Engine**: CHALLENGE row — Claude on every engine. The diff was likely produced by Codex on `--engine auto`; Claude reading it cold preserves the cross-model dynamic.
338
+
309
339
  Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
310
340
 
311
- Agent prompt — pass this to the Agent tool:
341
+ Agent prompt — pass this to the spawned executor:
312
342
 
313
- You are a senior engineer doing a final skeptical review before this code ships to production. You have NOT seen any prior reviews, test results, or design docs — read the code cold.
343
+ You are a senior engineer doing a final skeptical review before this code ships to production. You have not seen any prior reviews, test results, or design docs — read the code cold.
314
344
 
315
- Run `git diff main` to see all changes. Read every changed file in full (not just the diff hunks — you need surrounding context).
345
+ <investigate_before_answering>
346
+ Anchor every finding in code you have actually opened. Run `git diff main` for the change surface, then read each changed file in full (not just the hunks — surrounding context matters). Findings without a real file:line and a quote from the code are speculation; exclude them.
347
+ </investigate_before_answering>
316
348
 
317
- Your job is NOT to check boxes. Your job is to find the things that would make a staff engineer say "hold on, let's talk about this before we ship." Think about:
349
+ Your job is not to check boxes. Your job is to find the things that would make a staff engineer say "hold on, let's talk about this before we ship." Think about:
318
350
 
319
351
  - Would this approach survive a 10x traffic spike? A midnight oncall page? A junior dev maintaining it 6 months from now?
320
352
  - Are there assumptions baked in that nobody stated out loud? Hardcoded limits, implicit ordering, missing edge cases in business logic?
321
353
  - Is the error handling actually helpful, or does it just prevent crashes while leaving the user confused?
322
354
  - Are there simpler, more idiomatic ways to do what this code does? Not "clever" alternatives — genuinely better approaches?
323
- - Would you mass-confidence approve this PR, or would you leave comments?
355
+ - Would you confidently approve this PR, or would you leave comments?
324
356
 
325
- Be brutally honest. Do NOT start with praise. Do NOT soften findings. Every finding must include `file:line` and a concrete fix — not "consider improving" but "change X to Y because Z."
357
+ Be direct and concrete. Do not open with praise. Every finding must include `file:line` and a concrete fix — not "consider improving" but "change X to Y because Z."
326
358
 
327
359
  Write `.devlyn/CHALLENGE-FINDINGS.md`:
328
360
 
@@ -334,7 +366,27 @@ Write `.devlyn/CHALLENGE-FINDINGS.md`:
334
366
  - `file:line` — what's wrong — Fix: concrete change
335
367
  ```
336
368
 
337
- Verdict: PASS only if you would mass-confidently mass-ship this code with your name on it. If you found anything CRITICAL or HIGH, verdict is NEEDS WORK.
369
+ <examples>
370
+ <example index="1">
371
+ GOOD finding (anchored, specific, fixable):
372
+ ### CRITICAL
373
+ - `src/api/orders/cancel.ts:42` — `await db.transaction(...)` is missing — the read of `order.status` and the write of `order.status = "cancelled"` are not atomic, so two concurrent cancellations both succeed and the inventory hook fires twice. Fix: wrap the read+write in `db.transaction()` and re-check `order.status === "pending"` inside the transaction before the update.
374
+ </example>
375
+ <example index="2">
376
+ BAD finding (vague, unanchored, not actionable):
377
+ ### HIGH
378
+ - The error handling could be improved. Consider being more defensive throughout.
379
+
380
+ Why this is bad: no file:line, no specific failure, no concrete fix. Either delete the finding or replace it with a real one anchored to a specific call site.
381
+ </example>
382
+ <example index="3">
383
+ GOOD finding (idiom / approach issue):
384
+ ### MEDIUM
385
+ - `src/components/UserList.tsx:18-34` — fetching `/api/users` inside `useEffect` and managing loading/error state by hand re-implements what the project already does with the `useFetch` hook in `src/hooks/useFetch.ts`. Fix: replace the manual `useState`+`useEffect` with `useFetch('/api/users')` so this list inherits retry, cache, and abort handling.
386
+ </example>
387
+ </examples>
388
+
389
+ Verdict: `PASS` only if you would confidently ship this code with your name on it. If you found anything CRITICAL or HIGH, verdict is `NEEDS WORK`.
338
390
 
339
391
  **After the agent completes**:
340
392
  1. Read `.devlyn/CHALLENGE-FINDINGS.md`
@@ -402,9 +454,37 @@ Skip if `--skip-docs` was set.
402
454
 
403
455
  Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
404
456
 
405
- Agent prompt — pass this to the Agent tool:
457
+ Agent prompt — pass this to the Agent tool (include the original task description from `<pipeline_config>` so the agent can detect spec paths):
458
+
459
+ You are the Docs phase of the auto-resolve pipeline. You have two jobs, in this order.
460
+
461
+ **Job 1 — Roadmap Sync** (run first, only if this task implemented a roadmap item)
462
+
463
+ The ideate skill produces specs at `docs/roadmap/phase-N/{id}-{slug}.md` and tracks them in `docs/ROADMAP.md`. When auto-resolve finishes a task for one of those specs, the index lies until someone flips it — and nobody does, so it rots. Your job is to flip it.
464
+
465
+ 1. **Detect whether this task was a spec implementation.** Look at the original task description you were passed. Match against this regex: `docs/roadmap/phase-\d+/[^\s"'\`)]+\.md`. If there is no match, or if `docs/ROADMAP.md` does not exist in the repo, Job 1 is a no-op — skip straight to Job 2.
466
+ 2. **Sanity-check against the diff.** Run `git diff main --stat` (or `git diff HEAD~N --stat` if on main). If the diff is empty or contains only doc changes, the build phase produced nothing — do NOT flip any status. Leave Job 1 untouched and continue to Job 2.
467
+ 3. **Read the spec file** at the matched path. If its frontmatter already has `status: done`, Job 1 is already done — skip to Job 2. Otherwise:
468
+ - Set `status: done` in the frontmatter.
469
+ - Add a `completed: YYYY-MM-DD` field (use today's date from `date +%Y-%m-%d`).
470
+ - Do not change any other fields, and do not touch the body of the spec.
471
+ 4. **Update `docs/ROADMAP.md`.** Find the row whose `#` column matches the spec's `id` (e.g., row starting `| 2.3 |`). Change its Status column to `Done`. Do not touch any other row, and do not reformat the table.
472
+ 5. **Check whether the phase is now fully Done.** Read every row of the phase's table (the one containing the just-flipped row). If every row's Status is `Done`, archive the phase:
473
+ - Cut the phase's `## Phase N: …` heading and table out of the active section of ROADMAP.md.
474
+ - If no `## Completed` section exists at the bottom of the file, create one just above end-of-file (below Decisions if Decisions exists).
475
+ - Add a `<details>` block for the phase inside Completed, using the format defined in the devlyn:ideate skill's Context Archiving section. Pull each item's completion date from its spec file's `completed:` frontmatter; if a spec has none, use today's date.
476
+ - Item spec files stay on disk — do not delete them. Only the index row moves.
477
+ 6. **Report.** In your summary, say explicitly what you did: "Flipped spec 2.3 to done, updated ROADMAP.md row." And if applicable: "Phase 2 was fully Done — archived to Completed block."
478
+
479
+ **Safety invariants** — violating any of these means stop Job 1 and report it:
480
+ - Never flip a spec to `done` without a non-empty `git diff` touching non-doc files.
481
+ - Never flip multiple specs in one run — one task, one spec.
482
+ - Never edit a row whose `#` doesn't exactly match the spec's `id`.
483
+ - Never delete spec files.
484
+
485
+ **Job 2 — General doc sync**
406
486
 
407
- Synchronize documentation with recent code changes. Use `git log --oneline -20` and `git diff main` to understand what changed. Update any docs that reference changed APIs, features, or behaviors. Do not create new documentation files unless the changes introduced entirely new features with no existing docs. Preserve all forward-looking content: roadmaps, future plans, visions, open questions.
487
+ Synchronize the rest of the documentation with recent code changes. Use `git log --oneline -20` and `git diff main` to understand what changed. Update any docs that reference changed APIs, features, or behaviors. Do not create new documentation files unless the changes introduced entirely new features with no existing docs. Preserve all forward-looking content: future plans, visions, open questions. (Job 1 already handled the roadmap index — don't second-guess it here.)
408
488
 
409
489
  **After the agent completes**:
410
490
  1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): docs updated"` if there are changes
@@ -430,22 +510,20 @@ After all phases complete:
430
510
  **Pipeline Summary**:
431
511
  | Phase | Status | Notes |
432
512
  |-------|--------|-------|
433
- | Build (team-resolve) | [completed] | [brief summary] |
513
+ | Build (team-resolve) | [completed] | [brief summary; engine that ran it] |
434
514
  | Build gate | [completed / skipped / FAIL after N rounds] | [project types detected, commands run, pass/fail per command] |
435
515
  | Browser validate | [completed / skipped / auto-skipped] | [verdict, tier used, console errors, flow results] |
436
- | Evaluate (Claude) | [PASS/NEEDS WORK after N rounds] | [verdict + key findings] |
437
- | Evaluate (Codex) | [completed / skipped] | [Codex-only findings count, merged verdict] |
516
+ | Evaluate | [PASS/NEEDS WORK after N rounds] | [verdict + key findings] |
438
517
  | Fix rounds | [N rounds / skipped] | [what was fixed] |
439
518
  | Simplify | [completed / skipped] | [changes made] |
440
- | Review (Claude team) | [completed / skipped] | [findings summary] |
441
- | Review (Codex) | [completed / skipped] | [Codex-only findings, agreed findings] |
519
+ | Review (team) | [completed / skipped] | [findings summary; per-role engines if --engine auto] |
442
520
  | Challenge | [PASS / NEEDS WORK] | [findings count, fixes applied] |
443
521
  | Security review | [completed / skipped / auto-skipped] | [findings or "no security-sensitive changes"] |
444
522
  | Clean | [completed / skipped] | [items cleaned] |
445
523
  | Docs (update-docs) | [completed / skipped] | [docs updated] |
446
524
 
447
- **Evaluation Rounds**: [N] of [max-rounds] used
448
- **Final Verdict**: [last evaluation verdict]
525
+ **Evaluation Rounds**: [N] of [max-rounds] used (shared budget across PHASE 1.4-fix and PHASE 2.5)
526
+ **Final Verdict**: [last evaluation verdict, or "BUILD GATE FAILED — code does not compile" if PHASE 1.4 exhausted the round budget before PHASE 2 ran]
449
527
 
450
528
  **Commits created**:
451
529
  [git log output]
@@ -116,7 +116,7 @@ Rationale for `--engine auto` choices:
116
116
 
117
117
  Rationale:
118
118
  - FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
119
- - CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic (same role as `--with-codex` but automatic). When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
119
+ - CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic automatic on every run. When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
120
120
  - DOCUMENT: Claude — writing quality for spec generation.
121
121
 
122
122
  ---
@@ -127,10 +127,12 @@ Rationale:
127
127
  |-------|--------------|----------------|-----------------|
128
128
  | EXTRACT COMMITMENTS | Claude | Codex | Claude |
129
129
  | CODE AUDIT | **Codex** | Codex | Claude |
130
- | DOCS AUDIT | **Claude** | Codex | Claude |
130
+ | DOCS AUDIT | **Claude** | **Claude** | Claude |
131
131
  | BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
132
132
  | SYNTHESIZE | Claude | Claude | Claude |
133
133
 
134
+ DOCS AUDIT is always Claude regardless of `--engine` — writing-quality strength on documentation drift detection (READMEs, VISION.md prose, spec status accuracy) is the deciding factor, not code analysis. BROWSER AUDIT is always Claude because Chrome MCP tools are session-bound to Claude Code.
135
+
134
136
  ---
135
137
 
136
138
  ## How to Spawn a Codex Role
@@ -199,7 +201,4 @@ mcp__codex-cli__codex({
199
201
 
200
202
  - `--engine claude` → all roles and phases use Claude (no Codex calls)
201
203
  - `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
202
- - `--engine auto` → each role and phase routes to the optimal model per this table
203
- - `--engine auto` is the recommended default when Codex MCP server is available
204
-
205
- `--engine` and `--with-codex` are **mutually exclusive**. `--engine auto` subsumes `--with-codex both` — it uses Codex where it's optimal (broader than just evaluate/review). If both flags are passed, `--engine` takes precedence and `--with-codex` is ignored with a warning.
204
+ - `--engine auto` (default) → each role and phase routes to the optimal model per this table
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: devlyn:ideate
3
- description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline. Optional --with-codex flag adds OpenAI Codex as a cross-model critic. Use when the user wants to brainstorm, plan a new project or feature set, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring.
3
+ description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline. Default `--engine auto` routes the CHALLENGE rubric pass to OpenAI Codex (GPT-5.4) as a cross-model critic for a GAN dynamic. Use when the user wants to brainstorm, plan a new project or feature set, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring.
4
4
  ---
5
5
 
6
6
  # Ideation to Implementation Bridge
@@ -24,18 +24,17 @@ Concretely:
24
24
 
25
25
  Parse these from the user's invocation message:
26
26
 
27
- - `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family. **Ignored if `--engine` is set** (engine routing subsumes this).
28
27
  - `--engine MODE` (auto) — controls which model handles each ideation phase. Modes:
29
- - `auto` (default): Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic). Subsumes `--with-codex`. Requires Codex MCP server.
28
+ - `auto` (default): Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic). Requires Codex MCP server.
30
29
  - `codex`: Codex handles FRAME/EXPLORE/CONVERGE/DOCUMENT, Claude runs CHALLENGE (role reversal — builder and critic are always different models).
31
30
  - `claude`: all phases use Claude. No Codex calls.
32
31
 
33
32
  **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
34
- - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — NOT `claude`.
33
+ - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
35
34
  - Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort.
36
- - Also read `references/challenge-rubric.md` up front. The engine routing table is defined in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)".
35
+ - Read `references/challenge-rubric.md` up front. The engine routing table lives in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)" — read that on demand when routing decisions are needed.
37
36
 
38
- **If `--engine` is not set and `--with-codex` is explicitly set** (legacy): read `references/challenge-rubric.md` and `references/codex-debate.md` up front, then run the pre-flight check described in `codex-debate.md` to verify the Codex MCP server is available before starting the pipeline. If the server is unavailable and the user opts to continue without Codex, the solo CHALLENGE pass still runs only the cross-model rubric pass is disabled.
37
+ **Consolidated flag**: `--with-codex` was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which routes the CHALLENGE rubric pass to Codex automatically. No flag needed. Continuing with `--engine auto`."
39
38
 
40
39
  <why_this_matters>
41
40
  When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
@@ -115,7 +114,8 @@ Expand is the most common mode after initial setup — the user already has Visi
115
114
  **On entry:**
116
115
  1. Read `docs/VISION.md`, `docs/ROADMAP.md`, and existing phase `_overview.md` files to understand the established context
117
116
  2. Scan existing item specs to understand what's built and what's planned
118
- 3. Summarize your understanding: "Here's what exists: [phases, item count, current status]. You want to add [new area]. Does this expand an existing phase or warrant a new one?"
117
+ 3. **Run the Archive Pass** (see Context Archiving below) before summarizing. Summarizing a stale roadmap to the user wastes the exchange they'll see "Phase 1 has 4 items" when really all 4 are already Done and the phase should be collapsed.
118
+ 4. Summarize your understanding: "Here's what exists: [phases, item count, current status]. You want to add [new area]. Does this expand an existing phase or warrant a new one?"
119
119
 
120
120
  **During ideation:**
121
121
  - FRAME is lighter — the vision already exists, focus on framing the NEW area only
@@ -129,7 +129,7 @@ Expand is the most common mode after initial setup — the user already has Visi
129
129
  - New item specs can reference existing items in their Dependencies section
130
130
  - If new items change the meaning of existing items, flag this: "Adding [X] may affect the scope of existing item [Y]. Should we update [Y]'s spec?"
131
131
 
132
- In Replan mode, also read existing docs first, then focus on the Converge phase to reprioritize.
132
+ In Replan mode: read existing docs first, **run the Archive Pass** (see Context Archiving below) before any reprioritization — you can't sensibly reorder work that's already finished — then focus on the Converge phase to reprioritize what remains. The Archive Pass also surfaces Backlog items whose Revisit date has passed, which are natural candidates when replanning.
133
133
 
134
134
  ### Quick Add Mode Detail
135
135
 
@@ -137,8 +137,9 @@ Quick Add is for when the user has a single concrete idea, bug report, or improv
137
137
 
138
138
  **On entry:**
139
139
  1. Read `docs/ROADMAP.md` and relevant phase `_overview.md` files
140
- 2. Identify the best-fit phase for the new item (or suggest a new phase if it doesn't fit)
141
- 3. Determine the next available item ID (e.g., if phase 2 has 2.1-2.4, the new item is 2.5)
140
+ 2. **Run the Archive Pass first** (see Context Archiving below). Do this *before* you figure out where the new item goes a stale roadmap will mislead phase selection and ID numbering. If the pass moves a phase out of the active section, the new item's natural home may change.
141
+ 3. Identify the best-fit phase for the new item (or suggest a new phase if it doesn't fit)
142
+ 4. Determine the next available item ID (e.g., if phase 2 has 2.1-2.4, the new item is 2.5)
142
143
 
143
144
  **Workflow (minimal — no full Frame/Explore/Converge):**
144
145
  1. Confirm the idea with the user: "I'll add this as [item title] in Phase [N]. That sound right?"
@@ -157,12 +158,25 @@ To implement:
157
158
 
158
159
  ### Context Archiving
159
160
 
160
- As projects progress, completed work accumulates and dilutes the active roadmap. Archive stale context at these trigger points:
161
+ ROADMAP.md is the tactical index. Every row that isn't Planned / In Progress / Blocked is noise — it dilutes attention, pads the file past its 150-line target, and makes future ideation sessions read stale context they'll have to mentally filter out. Done work should move; it shouldn't disappear.
161
162
 
162
- **When an entire phase is complete:**
163
- 1. Move the phase's table from the active section to a `## Completed` section at the bottom of ROADMAP.md
164
- 2. Keep it collapsed — just phase name, completion date, and item count
165
- 3. Item spec files stay in place (they're self-contained and may be referenced by dependencies)
163
+ The goal state: the active section of ROADMAP.md only lists work that still needs doing. Everything completed lives under a collapsed `## Completed` block at the bottom. Item spec files themselves stay in place — they remain on disk at `docs/roadmap/phase-N/{id}.md` because other specs may reference them — only the index row moves.
164
+
165
+ #### The Archive Pass
166
+
167
+ Run this at the start of every Quick Add, Expand, and Replan session (each mode's "On entry" checklist tells you when). It's deterministic and cheap. Never skip it to "save time" — the time you save by skipping it is immediately spent by you and the user arguing about a roadmap that shows phantom work.
168
+
169
+ 1. **Read `docs/ROADMAP.md`.** For each phase, look at the Status column of every row.
170
+ 2. **For each phase where every row is `Done`:** archive the whole phase.
171
+ - Cut the phase's `## Phase N: …` heading and table out of the active section.
172
+ - If no `## Completed` section exists yet at the bottom of the file, create one.
173
+ - Add a `<details>` block inside Completed for this phase (see format below). Use the latest completion date you can find in the item spec frontmatter (`completed:` field, or today's date if absent). Item count is the row count.
174
+ 3. **For individual `Done` rows inside an otherwise-active phase:** leave them in place. A row only moves when its whole phase is finished. (Mixed-state phases stay mixed so the user can see recent wins alongside open work.)
175
+ 4. **Scan the Backlog table.** Surface any row whose "Revisit" date has passed — mention it to the user as a replan candidate. Don't auto-promote it; that's a conversation.
176
+ 5. **Scan `docs/roadmap/decisions/`.** Flag any decision whose status is `accepted` but whose reasoning is visibly contradicted by the work that's now Done. Don't silently edit decisions; raise them as open questions.
177
+ 6. **Report what you did.** Before moving on to the mode's main work, tell the user in one short paragraph: "Archived Phase 1 (3 items). Active roadmap is now Phase 2 (2 items). Proceeding with [Quick Add / Expand / Replan]." Skip the report only if nothing changed.
178
+
179
+ **Completed block format** (place at the bottom of ROADMAP.md, below Decisions):
166
180
 
167
181
  ```markdown
168
182
  ## Completed
@@ -178,16 +192,13 @@ As projects progress, completed work accumulates and dilutes the active roadmap.
178
192
  </details>
179
193
  ```
180
194
 
181
- **When entering Expand or Replan mode:**
182
- 1. Scan ROADMAP.md for items marked Done — if all items in a phase are Done, archive that phase
183
- 2. Check Backlog for items whose "Revisit" date has passed — surface them as candidates for the new phase
184
- 3. Review decisions — flag any marked `accepted` that may need revisiting given new context
195
+ If the `## Completed` section already exists and you're archiving an additional phase, append a new `<details>` block — don't rewrite existing ones.
185
196
 
186
- **When a decision becomes outdated:**
187
- - Don't delete it — mark status as `superseded` and add a note pointing to the replacement decision
188
- - This preserves the reasoning history for future reference
197
+ #### Outdated decisions
189
198
 
190
- The goal: ROADMAP.md's active section should only show work that's planned, in-progress, or blocked. Everything else moves to Completed or gets re-evaluated.
199
+ When a decision becomes wrong because the world changed under it:
200
+ - Don't delete it — set its `status:` to `superseded` in the decision file's frontmatter and add a one-line pointer to the replacement decision record.
201
+ - This preserves the reasoning history for future reference, which matters more than a tidy decisions table.
191
202
 
192
203
  ## Phase 1: FRAME
193
204
 
@@ -211,6 +222,10 @@ Don't write documents yet. The output of this phase is a shared mental model bet
211
222
 
212
223
  This is the creative core — the phase that should take the most conversational turns. The user chose to ideate with AI because they want perspectives, research, and creative expansion they wouldn't get alone.
213
224
 
225
+ <use_parallel_tool_calls>
226
+ EXPLORE often needs several independent lookups: web search for prior art, doc fetches, repo greps for existing patterns. When tool calls have no dependencies on each other, issue them in parallel in the same response. Spawn subagents in parallel when fanning out across distinct research topics. Only chain calls that depend on a previous call's output. Pace research across turns rather than front-loading every lookup before the user has framed direction — EXPLORE is dialogue-driven, parallel is just for the lookups inside any single turn.
227
+ </use_parallel_tool_calls>
228
+
214
229
  <research_protocol>
215
230
  When relevant, actively research before and during brainstorming:
216
231
  - **Existing solutions**: What's already out there? (web search, documentation)
@@ -316,13 +331,60 @@ For Quick Add with one new item, one solo pass is enough. For a full greenfield
316
331
 
317
332
  If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
318
333
 
319
- ### Codex pass (engine-routed or legacy `--with-codex`)
334
+ ### Codex critic pass (engine-routed)
335
+
336
+ **If `--engine auto`** (default): Codex runs the CHALLENGE rubric pass automatically as critic.
337
+
338
+ Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, `workingDirectory: <project root>`. The `prompt` parameter is built from the packaged plan + the inlined rubric + the appended Codex instructions. Codex has no filesystem access to this project, so everything it needs travels in the prompt.
339
+
340
+ **Step 1 — Package the post-solo plan.** Build the prompt with these sections in this order:
341
+
342
+ ```
343
+ ## Problem framing (from FRAME phase)
344
+ [problem statement, constraints, success criteria, anti-goals]
345
+
346
+ ## Confirmed facts vs assumptions
347
+ Confirmed by user: [list each fact the user explicitly confirmed]
348
+ Assumptions (not yet confirmed): [list each assumption the agent made]
349
+
350
+ ## Plan (post-solo-CHALLENGE)
351
+ Vision: [one sentence]
352
+ Phase 1 ([theme]): [items with one-line descriptions and dependencies]
353
+ Phase 2 ([theme]): ...
354
+ Architecture decisions: [each with what / why / alternatives considered]
355
+ Deferred to backlog: [items + reason]
356
+
357
+ ## Findings from the solo rubric pass
358
+ [list each with: severity, axis, quote, why, fix, whether applied]
359
+
360
+ ## Rubric
361
+ [INLINE the full text of references/challenge-rubric.md here verbatim — Codex needs the rubric definition in the prompt itself]
362
+
363
+ ## Your job
364
+ You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user explicitly asked to be challenged because soft-pedaled plans waste their time.
365
+
366
+ You are running AFTER a solo pass by Claude. Catch what the solo pass missed; do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" (with one-line agreement) or "I would frame this differently" (with a reason). Then add your own findings that the solo pass missed.
367
+
368
+ Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
369
+
370
+ Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let the orchestrator surface it to the user.
371
+
372
+ End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, plus a one-line explanation.
373
+ ```
374
+
375
+ **Step 2 — Reconcile.** Merge the two finding lists:
376
+ - Same finding from both → keep the more specific wording, mark "confirmed by both"
377
+ - Codex-only → prefix `[codex]` in internal notes so the user-facing summary can attribute correctly
378
+ - Solo-only → keep as-is
379
+ - Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if material, surface as an open question in the user-facing summary
380
+
381
+ If Codex raised CRITICAL or HIGH findings the solo pass missed, apply the fixes to the plan before presenting the user-facing summary — unless fixing would change something the user explicitly confirmed, in which case follow the rubric's "Respect explicit user intent" rule.
320
382
 
321
- **If `--engine auto`**: Codex runs the CHALLENGE rubric pass automatically. Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the packaged plan + rubric as prompt (same format as `codex-debate.md` Step 2). Reconcile findings: same finding from both → "confirmed by both", Codex-only → prefix `[codex]`.
383
+ **Do not loop.** One Codex pass is enough. If the result is still FAIL after reconciliation, the plan has structural problems that belong in the user-facing summary as open questions rather than further iteration.
322
384
 
323
- **If `--engine codex`**: Role reversal — Codex built the plan (FRAME/EXPLORE/CONVERGE/DOCUMENT), so Claude runs the solo CHALLENGE pass. Do NOT also run Codex on CHALLENGE — builder and critic must be different models. Skip this section entirely.
385
+ **If `--engine codex`**: Role reversal — Codex built the plan, so Claude runs the solo CHALLENGE pass and that is the only pass. Do not also run Codex on CHALLENGE — builder and critic should always be different models. Skip this section.
324
386
 
325
- **If `--engine claude` or `--engine` not set, and `--with-codex` is set** (legacy): follow `references/codex-debate.md` "PHASE 3.5-CODEX" section. Codex applies the rubric from `challenge-rubric.md` independently at `reasoningEffort: "xhigh"`. Reconcile findings as `codex-debate.md` describes — findings raised by both sides get "confirmed by both", Codex-only findings get prefixed `[codex]` in internal notes so the user can see where each push came from.
387
+ **If `--engine claude`**: No Codex calls. The solo pass is the only pass.
326
388
 
327
389
  ### Respect explicit user intent
328
390
 
@@ -344,7 +406,7 @@ Deferred: [items with reasons]
344
406
  ## CHALLENGE results
345
407
 
346
408
  Solo pass: [N findings, M applied]
347
- Codex pass: [N findings, M applied] ← only if --with-codex was set
409
+ Codex pass: [N findings, M applied] ← only on --engine auto
348
410
 
349
411
  Changes applied during CHALLENGE:
350
412
  - [item]: [what changed and which axis triggered it]
@@ -457,7 +519,7 @@ Before finalizing, verify:
457
519
  - [ ] No spec requires reading VISION.md to be understood (self-contained)
458
520
  - [ ] Dependencies between items are documented in both specs
459
521
  - [ ] Architecture decisions include reasoning and alternatives considered
460
- - [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex if `--with-codex` was set); no item still fails any axis at CRITICAL or HIGH severity
522
+ - [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex critic on `--engine auto`); no item still fails any axis at CRITICAL or HIGH severity
461
523
  - [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
462
524
  - [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
463
525
 
@@ -7,7 +7,7 @@
7
7
  - Finding format
8
8
  - Examples (good vs bad findings, plus a detour-sequencing example)
9
9
 
10
- The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex pass (when `--with-codex` is set) use this file — there is exactly one definition of the rubric, and both paths read it directly from SKILL.md.
10
+ The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex critic pass (on `--engine auto`) use this file — there is exactly one definition of the rubric, and `SKILL.md` instructs both passes to read it directly from here.
11
11
 
12
12
  The rubric exists because plans produced in a single pass, by a single model, in a single conversation almost always fail at least one axis somewhere. The user's historical experience: every time they asked "is this really no-workaround, no-guesswork, no-overengineering, world-class, optimized?", the honest answer was no. This phase makes the answer honestly yes before the user even has to ask.
13
13
 
@@ -58,6 +58,10 @@ Example with engine: `/devlyn:preflight --engine auto`
58
58
 
59
59
  ## PHASE 0: DISCOVER & SCOPE
60
60
 
61
+ <use_parallel_tool_calls>
62
+ Phase 0 and Phase 1 do many independent reads (planning docs, item specs, prior state). When tool calls have no dependencies between them, issue them in parallel in a single response — that includes globbing for spec files and reading several specs at once. Only chain calls that depend on values from a previous call.
63
+ </use_parallel_tool_calls>
64
+
61
65
  1. **Find planning documents** — search in parallel:
62
66
  - `docs/VISION.md`
63
67
  - `docs/ROADMAP.md`
@@ -80,7 +84,7 @@ Scope: [Phase N / All phases]
80
84
  Documents: VISION.md, ROADMAP.md, [N] item specs
81
85
  Deferred items (excluded): [N]
82
86
  Previous run: [found — will show delta / none]
83
- Phases: Extract → Audit [Browser] [Docs] → Report → Triage
87
+ Phases: 1 Extract → 2 Audit (code + docs + browser) 3 Report → 4 Triage
84
88
  ```
85
89
 
86
90
  ## PHASE 1: EXTRACT COMMITMENTS
@@ -102,9 +106,9 @@ Read all in-scope planning documents and build a **commitment registry** — eve
102
106
  - Items with `status: cut` in ROADMAP.md
103
107
  - Out of Scope entries — these are anti-commitments (things promised NOT to build)
104
108
 
105
- 5. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are NOT expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do NOT audit them or report them as findings. This distinction matters — flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
109
+ 5. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are not expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do **not** audit them or report them as findings. Flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
106
110
 
107
- 5. **Write to `.devlyn/commitment-registry.md`**:
111
+ 6. **Write to `.devlyn/commitment-registry.md`**:
108
112
 
109
113
  ```markdown
110
114
  # Commitment Registry
@@ -137,20 +141,20 @@ Spawn all applicable auditors in parallel. Each reads `.devlyn/commitment-regist
137
141
 
138
142
  ### code-auditor (always)
139
143
 
140
- **Engine routing**: If `--engine auto` or `--engine codex`, call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, and the full code-auditor prompt (read from `references/auditors/code-auditor.md`). Include the commitment registry content inline in the prompt since Codex cannot read `.devlyn/commitment-registry.md` directly in read-only sandbox. If `--engine claude`, spawn a Claude subagent as below.
141
-
142
- Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/code-auditor.md` and pass it to the subagent.
144
+ Engine routes per the auto-resolve skill's `references/engine-routing.md` ("Pipeline Phase Routing (preflight)" CODE AUDIT row): Codex on `--engine auto`/`codex`, Claude on `--engine claude`. When the route is **Codex**, call `mcp__codex-cli__codex` with the auditor prompt inline (Codex cannot read `.devlyn/commitment-registry.md` directly under `read-only` sandbox, so paste the registry into the prompt). When the route is **Claude**, spawn a subagent with `mode: "bypassPermissions"`. Read the auditor prompt from `references/auditors/code-auditor.md` either way.
143
145
 
144
146
  The code-auditor classifies each commitment as IMPLEMENTED, MISSING, INCOMPLETE, DIVERGENT, or BROKEN — with file:line evidence. Also catches cross-feature integration gaps and constraint violations. Writes to `.devlyn/audit-code.md`.
145
147
 
146
148
  ### docs-auditor (unless --skip-docs)
147
149
 
148
- Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/docs-auditor.md` and pass it to the subagent.
150
+ Always Claude (writing-quality strength) regardless of `--engine`. Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/docs-auditor.md` and pass it to the subagent.
149
151
 
150
152
  Checks: ROADMAP.md status accuracy, README alignment, API doc coverage, VISION.md currency, item spec status. Writes to `.devlyn/audit-docs.md`.
151
153
 
152
154
  ### browser-auditor (conditional)
153
155
 
156
+ Always Claude (Chrome MCP tools are session-bound) regardless of `--engine`.
157
+
154
158
  **Skip conditions** (check in order):
155
159
  1. `--skip-browser` flag → skip
156
160
  2. No web-relevant files in project (no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.html`, `page.*`, `layout.*`) → skip with note "Browser validation skipped — no web files detected"
@@ -340,7 +344,7 @@ Triage complete.
340
344
 
341
345
  Next steps:
342
346
  - To implement fixes: /devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/[id]-[name].md"
343
- - For high-stakes fixes (CRITICAL severity or complex DIVERGENT findings), add `--with-codex both` to cross-validate the fix and review with Codex
347
+ - For CRITICAL severity or complex DIVERGENT findings, the default `--engine auto` already routes BUILD/FIX to Codex and EVALUATE/CHALLENGE to Claude (cross-model GAN dynamic). No extra flag needed.
344
348
  - To re-run preflight after fixes: /devlyn:preflight [same flags]
345
349
  - To add new features discovered during audit: /devlyn:ideate expand
346
350
  ```
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "devlyn-cli",
3
- "version": "1.12.4",
3
+ "version": "1.13.0",
4
4
  "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
5
5
  "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
6
6
  "bin": {
@@ -1,106 +0,0 @@
1
- # Codex Cross-Model Integration (Legacy)
2
-
3
- > **Note**: This file is the legacy `--with-codex` integration. For the newer `--engine` flag (which subsumes `--with-codex`), see `references/engine-routing.md`. Only read this file when `--with-codex` is enabled AND `--engine` is NOT set.
4
-
5
- Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline.
6
-
7
- Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). This creates a GAN-like adversarial dynamic — Claude builds and Codex critiques, reducing shared blind spots between model families.
8
-
9
- ---
10
-
11
- ## PRE-FLIGHT CHECK
12
-
13
- Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
14
-
15
- - **If ping succeeds**: continue normally.
16
- - **If ping fails or `mcp__codex-cli__ping` tool is not found**: warn the user and ask how to proceed:
17
- ```
18
- ⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
19
-
20
- To install:
21
- npm i -g @openai/codex
22
- claude mcp add codex-cli -- npx -y codex-mcp-server
23
-
24
- Options:
25
- [1] Continue without --with-codex (Claude-only evaluation/review)
26
- [2] Abort pipeline
27
- ```
28
- If the user chooses [1], disable `--with-codex` and continue. If [2], stop.
29
-
30
- ---
31
-
32
- ## PHASE 2-CODEX: CROSS-MODEL EVALUATE
33
-
34
- Run after the Claude evaluator (Phase 2) completes, only if `--with-codex` includes `evaluate` or `both`.
35
-
36
- ### Step 1 — Get Codex's evaluation
37
-
38
- Call `mcp__codex-cli__codex` with:
39
- - `prompt`: Include the full content of `.devlyn/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
40
- - `workingDirectory`: the project root
41
- - `sandbox`: `"read-only"` (Codex should only read, not modify files)
42
- - `reasoningEffort`: `"high"` (note: for `--engine auto`, the engine-routing.md uses `"xhigh"` by default)
43
- - `model`: `"gpt-5.4"` (pass explicitly — the MCP schema default may be outdated)
44
-
45
- Example prompt to pass:
46
- ```
47
- You are an independent code evaluator. Grade the following code changes against the done criteria below. Be strict — when in doubt, flag it.
48
-
49
- ## Done Criteria
50
- [paste contents of .devlyn/done-criteria.md]
51
-
52
- ## Code Changes
53
- [paste output of git diff HEAD~1]
54
-
55
- For each criterion, mark VERIFIED (with evidence) or FAILED (with file:line and what's wrong).
56
- Then list all issues found grouped by severity: CRITICAL, HIGH, MEDIUM, LOW.
57
- For each issue provide: file:line, description, and suggested fix.
58
- End with a verdict: PASS, PASS WITH ISSUES, NEEDS WORK, or BLOCKED.
59
- ```
60
-
61
- ### Step 2 — Merge findings
62
-
63
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to merge Claude's and Codex's evaluations.
64
-
65
- Agent prompt:
66
-
67
- Read `.devlyn/EVAL-FINDINGS.md` (Claude's evaluation) and the Codex evaluation output below. Merge them into a single unified `.devlyn/EVAL-FINDINGS.md` following the existing format. Rules:
68
- - Take the MORE SEVERE verdict between the two evaluators
69
- - Deduplicate findings that reference the same file:line or describe the same issue
70
- - When both evaluators flag the same issue, keep the more detailed description
71
- - Prefix Codex-only findings with `[codex]` so the fix loop knows the source
72
- - Preserve the exact structure: Verdict, Done Criteria Results, Findings Requiring Action (CRITICAL/HIGH), Cross-Cutting Patterns
73
-
74
- Codex evaluation:
75
- [paste Codex's response here]
76
-
77
- ---
78
-
79
- ## PHASE 4B: CODEX REVIEW
80
-
81
- Run after the Claude team review (Phase 4A) completes, only if `--with-codex` includes `review` or `both`.
82
-
83
- ### Step 1 — Run Codex review
84
-
85
- Call `mcp__codex-cli__review` with:
86
- - `base`: `"main"` — review all changes since main
87
- - `workingDirectory`: the project root
88
- - `title`: `"Cross-model review (Codex)"`
89
-
90
- This runs OpenAI Codex's built-in code review against the diff. The review tool returns structured findings automatically — no custom prompt needed.
91
-
92
- ### Step 2 — Reconcile both reviews
93
-
94
- Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to reconcile both reviews.
95
-
96
- Agent prompt:
97
-
98
- Two independent reviews have been conducted on recent changes — one by a Claude team review and one by OpenAI Codex. Reconcile them:
99
-
100
- Claude team review findings: [paste Phase 4A agent's output summary]
101
- Codex review findings: [paste mcp__codex-cli__review output]
102
-
103
- 1. Deduplicate findings that describe the same issue
104
- 2. For unique Codex findings not caught by Claude's team, prefix with `[codex]` and assess severity
105
- 3. Fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
106
- 4. Write a brief reconciliation summary to stdout listing: findings from both (agreed), Claude-only, Codex-only, and what was fixed
@@ -1,112 +0,0 @@
1
- # Codex Cross-Model Rubric Pass (Legacy)
2
-
3
- > **Note**: This file is the legacy `--with-codex` integration for ideate. For the newer `--engine` flag (which subsumes `--with-codex`), see the engine routing section in SKILL.md. Only read this file when `--with-codex` is set AND `--engine` is NOT set.
4
-
5
- ## Contents
6
- - Pre-flight check (verify Codex MCP server availability)
7
- - PHASE 3.5-CODEX: packaging the plan, calling Codex, reconciling findings with the solo pass
8
- - Cost notes (one Codex call per ideation session)
9
-
10
- Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE. The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
11
-
12
- Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). The intent: one opinionated rubric pass from a different model family, applied right before the user sees the plan. Two model families catch different blind spots; one pass at maximum effort catches more than multiple shallow passes.
13
-
14
- **Always use `model: "gpt-5.4"`, `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model. Pass `model: "gpt-5.4"` explicitly as the MCP schema default may be outdated.
15
-
16
- ---
17
-
18
- ## PRE-FLIGHT CHECK
19
-
20
- Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
21
-
22
- - **If ping succeeds**: continue.
23
- - **If ping fails or `mcp__codex-cli__ping` is not found**: warn the user and ask:
24
- ```
25
- ⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
26
-
27
- To install:
28
- npm i -g @openai/codex
29
- claude mcp add codex-cli -- npx -y codex-mcp-server
30
-
31
- Options:
32
- [1] Continue without --with-codex (Claude-only solo CHALLENGE pass)
33
- [2] Abort
34
- ```
35
- If [1], disable `--with-codex` and continue with the solo CHALLENGE. If [2], stop.
36
-
37
- ---
38
-
39
- ## PHASE 3.5-CODEX: Codex rubric pass
40
-
41
- Run after the solo CHALLENGE pass completes, before the user-facing summary.
42
-
43
- ### Step 1 — Package the post-solo plan
44
-
45
- Use the plan as it stands after the solo rubric pass. Package the full context Codex needs:
46
-
47
- ```
48
- ## Problem framing (from FRAME phase)
49
- [problem statement, constraints, success criteria, anti-goals]
50
-
51
- ## Confirmed facts vs assumptions
52
- Confirmed by user: [list]
53
- Assumptions (not yet confirmed): [list]
54
-
55
- ## Plan (post-solo-CHALLENGE)
56
- Vision: [one sentence]
57
- Phase 1 ([theme]): [items, dependencies, one-line descriptions]
58
- Phase 2 ([theme]): ...
59
- Architecture decisions: [each with what / why / alternatives]
60
- Deferred to backlog: [items + reason]
61
-
62
- ## Findings from the solo rubric pass
63
- [list each with: axis, quote, why, fix, whether applied]
64
- ```
65
-
66
- Include the framing and assumptions — Codex can only judge whether the plan fits the user's reality if it sees what the user actually said.
67
-
68
- ### Step 2 — Codex challenge pass
69
-
70
- Call `mcp__codex-cli__codex` with:
71
- - `prompt`: the packaged context above, followed by the instructions below
72
- - `workingDirectory`: the project root
73
- - `sandbox`: `"read-only"`
74
- - `model`: `"gpt-5.4"` — pass explicitly; the MCP schema default may still show `gpt-5.3-codex`
75
- - `reasoningEffort`: `"xhigh"` — the highest setting in the Codex enum (`none < minimal < low < medium < high < xhigh`). Always pick the top level; this is the entire reason for the flag.
76
-
77
- Instructions to append to the packaged context. **Before sending, inline the full text of `references/challenge-rubric.md` into the prompt under a `## Rubric` heading** — Codex does not have filesystem access to this project, so Claude must ship the rubric itself. Claude already has the rubric loaded from Phase 3.5 setup.
78
-
79
- Template for the appended instructions:
80
-
81
- ```
82
- You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user has explicitly asked to be challenged because soft-pedaled plans waste their time.
83
-
84
- ## Rubric
85
- [Claude inlines the full text of references/challenge-rubric.md here]
86
-
87
- ## Your job
88
- - You are running AFTER a solo pass by Claude. Catch what the solo pass missed, do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" or "I would frame this differently" with a reason. Then add your own findings that the solo pass missed.
89
- - Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
90
- - Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let Claude surface it to the user.
91
-
92
- End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, and a one-line explanation.
93
- ```
94
-
95
- ### Step 3 — Reconcile solo and Codex findings
96
-
97
- Merge the two finding lists:
98
- - Same finding from both → keep the more specific wording, mark "confirmed by both".
99
- - Codex-only → prefix `[codex]` in internal notes so the user-facing summary can show where each push came from.
100
- - Solo-only → keep as-is.
101
- - Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if the conflict is material, include it as an open question in the user-facing summary.
102
-
103
- If Codex raised CRITICAL or HIGH findings that the solo pass missed, apply the fixes to the plan before presenting the user-facing summary. If fixing would change something the user explicitly asked for, follow the "Respect explicit user intent" rule already loaded from the rubric: do not silently rewrite — surface it.
104
-
105
- Do not loop. One Codex pass is enough. If the result is still FAIL after one pass, that is signal that the plan has structural problems the user should see directly, not signal to keep iterating in the background.
106
-
107
- ---
108
-
109
- ## Cost notes
110
-
111
- - One Codex call at `reasoningEffort: "xhigh"` typically takes 30–90s and is not cheap. This integration is bounded: exactly one Codex call per ideation session.
112
- - In Quick Add mode on a single new item, one Codex call is still worth it — small scope, huge signal, and single-item additions are exactly where workarounds slip in unnoticed.