devlyn-cli 1.12.5 → 1.14.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +23 -12
- package/README.md +24 -8
- package/config/skills/devlyn:auto-resolve/SKILL.md +182 -68
- package/config/skills/devlyn:auto-resolve/references/engine-routing.md +5 -6
- package/config/skills/devlyn:ideate/SKILL.md +66 -25
- package/config/skills/devlyn:ideate/references/challenge-rubric.md +1 -1
- package/config/skills/devlyn:ideate/references/templates/item-spec.md +4 -0
- package/config/skills/devlyn:preflight/SKILL.md +12 -8
- package/package.json +1 -1
- package/config/skills/devlyn:auto-resolve/references/codex-integration.md +0 -106
- package/config/skills/devlyn:ideate/references/codex-debate.md +0 -112
package/CLAUDE.md
CHANGED
|
@@ -1,5 +1,15 @@
|
|
|
1
1
|
# Project Instructions
|
|
2
2
|
|
|
3
|
+
## Quick Start
|
|
4
|
+
|
|
5
|
+
For most work, the recommended sequence is:
|
|
6
|
+
|
|
7
|
+
1. `/devlyn:ideate` — turn an idea into roadmap-ready specs
|
|
8
|
+
2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → polish
|
|
9
|
+
3. `/devlyn:preflight` — verify the implementation matches the roadmap before shipping
|
|
10
|
+
|
|
11
|
+
All three default to `--engine auto`, which routes each phase to the optimal model (Codex GPT-5.4 for hard coding, Claude Opus 4.7 for evaluation/critique). The cross-model GAN dynamic — different models build vs critique — catches what single-model pipelines miss.
|
|
12
|
+
|
|
3
13
|
## General
|
|
4
14
|
|
|
5
15
|
- Proactively use subagents and skills where needed
|
|
@@ -20,7 +30,7 @@
|
|
|
20
30
|
When investigating bugs, analyzing features, or exploring code:
|
|
21
31
|
|
|
22
32
|
1. **Define exit criteria upfront** - Ask "What does 'done' look like?" before starting
|
|
23
|
-
2. **Checkpoint progress** - Use
|
|
33
|
+
2. **Checkpoint progress** - Use the task tools (TaskCreate / TaskUpdate) every 5-10 minutes to save findings
|
|
24
34
|
3. **Output intermediate summaries** - Provide "Current Understanding" snapshots so work isn't lost if interrupted
|
|
25
35
|
4. **Always deliver findings** - Never end mid-analysis; at minimum output:
|
|
26
36
|
- Files examined
|
|
@@ -28,7 +38,7 @@ When investigating bugs, analyzing features, or exploring code:
|
|
|
28
38
|
- Remaining unknowns
|
|
29
39
|
- Recommended next steps
|
|
30
40
|
|
|
31
|
-
For complex investigations, use `/devlyn:team-resolve` to assemble a multi-perspective investigation team, or spawn parallel
|
|
41
|
+
For complex investigations, use `/devlyn:team-resolve` to assemble a multi-perspective investigation team, or spawn parallel Agent subagents to explore different areas simultaneously.
|
|
32
42
|
|
|
33
43
|
## UI/UX Workflow
|
|
34
44
|
|
|
@@ -42,11 +52,11 @@ The full design-to-implementation pipeline:
|
|
|
42
52
|
## Feature Development
|
|
43
53
|
|
|
44
54
|
1. **Plan first** - Always output a concrete implementation plan with specific file changes before writing code
|
|
45
|
-
2. **Track progress** - Use
|
|
55
|
+
2. **Track progress** - Use the task tools (TaskCreate / TaskUpdate) to checkpoint each phase
|
|
46
56
|
3. **Test validation** - Write tests alongside implementation; iterate until green
|
|
47
57
|
4. **Small commits** - Commit working increments rather than large changesets
|
|
48
58
|
|
|
49
|
-
For complex features,
|
|
59
|
+
For complex features, spawn the `Plan` subagent (`Agent` tool with `subagent_type: "Plan"`) to design the approach before implementation.
|
|
50
60
|
|
|
51
61
|
## Automated Pipeline (Recommended Starting Point)
|
|
52
62
|
|
|
@@ -72,8 +82,7 @@ Optional flags:
|
|
|
72
82
|
- `--skip-review` — skip team-review phase
|
|
73
83
|
- `--skip-clean` — skip clean phase
|
|
74
84
|
- `--skip-docs` — skip update-docs phase
|
|
75
|
-
- `--engine auto|codex|claude` — intelligent model routing. `auto` (default) routes each phase and team role to the optimal model
|
|
76
|
-
- `--with-codex [evaluate|review|both]` — (legacy, superseded by `--engine`) use OpenAI Codex as cross-model evaluator/reviewer (requires codex-mcp-server)
|
|
85
|
+
- `--engine auto|codex|claude` — intelligent model routing. `auto` (default) routes each phase and team role to the optimal model based on benchmark data: Codex GPT-5.4 handles BUILD and FIX (SWE-bench Pro lead), Claude Opus 4.7 handles EVALUATE and CHALLENGE (long-context retrieval + skeptical reasoning). Different models build vs critique — the cross-model GAN dynamic catches what single-model pipelines miss. `codex` forces Codex for implementation, Claude for orchestration and Chrome MCP. `claude` uses Claude for everything. Requires codex-mcp-server for `auto` and `codex` modes.
|
|
77
86
|
|
|
78
87
|
## Preflight Check (Post-Roadmap Verification)
|
|
79
88
|
|
|
@@ -92,7 +101,7 @@ Optional flags:
|
|
|
92
101
|
- `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
|
|
93
102
|
- `--skip-browser` — skip browser validation
|
|
94
103
|
- `--skip-docs` — skip documentation audit
|
|
95
|
-
- `--engine auto|codex|claude` —
|
|
104
|
+
- `--engine auto|codex|claude` — `auto` (default) routes the code-auditor to Codex (SWE-bench Pro +11.7pp on code analysis); the docs-auditor and browser-auditor always use Claude regardless of `--engine` (writing-quality strength on docs drift; Chrome MCP tools are session-bound to Claude Code)
|
|
96
105
|
|
|
97
106
|
**Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)
|
|
98
107
|
|
|
@@ -152,11 +161,13 @@ Steps 4-6 are optional depending on the scope of changes. `/simplify` should alw
|
|
|
152
161
|
|
|
153
162
|
## Context Window Management
|
|
154
163
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
164
|
+
Claude 4.5 / 4.6 / 4.7 models auto-compact the conversation as it approaches the context limit, so you can keep working indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns — the model continues from where it left off after compaction.
|
|
165
|
+
|
|
166
|
+
For genuinely multi-context-window work (e.g., a roadmap with many phases), persist state to disk so the next instance can resume:
|
|
167
|
+
- All `auto-resolve` and `preflight` runs already write durable state to `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS, PREFLIGHT-REPORT) and to git commits — pick up by reading those files plus `git log`.
|
|
168
|
+
- For long investigations, write progress notes to a `HANDOFF.md` and resume with `@HANDOFF.md continue from where this left off` if you need a fresh window.
|
|
169
|
+
|
|
170
|
+
Manually clearing with `/clear` is rarely necessary — only do it when context is genuinely irrelevant to the next task.
|
|
160
171
|
|
|
161
172
|
## Communication Style
|
|
162
173
|
|
package/README.md
CHANGED
|
@@ -109,7 +109,7 @@ Install the Codex MCP server during setup, then:
|
|
|
109
109
|
/devlyn:auto-resolve "fix the auth bug" --engine auto
|
|
110
110
|
```
|
|
111
111
|
|
|
112
|
-
**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.
|
|
112
|
+
**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
|
|
113
113
|
|
|
114
114
|
> `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
|
|
115
115
|
|
|
@@ -147,15 +147,31 @@ Works across the full pipeline:
|
|
|
147
147
|
</details>
|
|
148
148
|
|
|
149
149
|
<details>
|
|
150
|
-
<summary>
|
|
150
|
+
<summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
|
|
151
151
|
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
152
|
+
`/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
|
|
153
|
+
|
|
154
|
+
- **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
|
|
155
|
+
- **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
|
|
156
|
+
- **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
|
|
157
|
+
- **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
|
|
158
|
+
- **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
|
|
159
|
+
- **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
|
|
160
|
+
- **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
|
|
161
|
+
- **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
|
|
162
|
+
- **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
|
|
163
|
+
|
|
164
|
+
</details>
|
|
165
|
+
|
|
166
|
+
<details>
|
|
167
|
+
<summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
|
|
155
168
|
|
|
156
|
-
|
|
169
|
+
Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
|
|
157
170
|
|
|
158
|
-
|
|
171
|
+
- **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
|
|
172
|
+
- **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
|
|
173
|
+
- **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
|
|
174
|
+
- **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
|
|
159
175
|
|
|
160
176
|
</details>
|
|
161
177
|
|
|
@@ -258,7 +274,7 @@ Selected during install. Run `npx devlyn-cli` again to add more.
|
|
|
258
274
|
|
|
259
275
|
| Server | Description |
|
|
260
276
|
|---|---|
|
|
261
|
-
| `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing
|
|
277
|
+
| `codex-cli` | Codex MCP server — enables `--engine auto/codex` intelligent model routing |
|
|
262
278
|
| `playwright` | Playwright MCP — powers browser-validate Tier 2 |
|
|
263
279
|
|
|
264
280
|
</details>
|
|
@@ -11,6 +11,14 @@ $ARGUMENTS
|
|
|
11
11
|
|
|
12
12
|
<pipeline_workflow>
|
|
13
13
|
|
|
14
|
+
<orchestrator_context>
|
|
15
|
+
This pipeline is long-horizon agentic work. As the orchestrator, you spawn many subagents and read their handoff files; your own context grows over the run.
|
|
16
|
+
|
|
17
|
+
- Your context window is auto-compacted as it approaches its limit, so do not stop tasks early due to token-budget concerns. Keep the run going.
|
|
18
|
+
- All durable state lives in `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS) and in git commits. If your context is cleared mid-run, the next instance can resume from those files plus `git log`. Keep them up to date.
|
|
19
|
+
- Best results come from `xhigh` effort. If you are running on lower effort and notice shallow reasoning during phase decisions, escalate.
|
|
20
|
+
</orchestrator_context>
|
|
21
|
+
|
|
14
22
|
<autonomy_contract>
|
|
15
23
|
This pipeline runs hands-free. The user launches it to walk away and come back to finished work, so the quality of this run is measured by how far it gets without human intervention. Apply these behaviors throughout every phase:
|
|
16
24
|
|
|
@@ -21,6 +29,17 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
|
|
|
21
29
|
5. **Treat questions as a signal to act instead.** If you notice yourself drafting a question to the user mid-pipeline, convert it into a decision + log entry and spawn the next phase.
|
|
22
30
|
</autonomy_contract>
|
|
23
31
|
|
|
32
|
+
<engine_routing_convention>
|
|
33
|
+
Every phase in this pipeline routes its work to the optimal model per `references/engine-routing.md`. The convention is the same everywhere:
|
|
34
|
+
|
|
35
|
+
- The phase prompt body below is **engine-agnostic** — same instructions whether Codex or Claude executes it.
|
|
36
|
+
- For phases routed to **Codex** (per the routing table), call `mcp__codex-cli__codex` per the patterns in `engine-routing.md` (How to Spawn a Codex BUILD/FIX Agent / How to Spawn a Codex Role / How to Spawn a Dual Role).
|
|
37
|
+
- For phases routed to **Claude**, spawn an Agent subagent with `mode: "bypassPermissions"` and pass the prompt body verbatim.
|
|
38
|
+
- `--engine claude` forces all phases to Claude. `--engine codex` forces implementation/analysis to Codex (Claude still handles orchestration and Chrome MCP). `--engine auto` (default) uses the routing table per phase.
|
|
39
|
+
|
|
40
|
+
Phase-level "Engine routing" notes below are short reminders only — `engine-routing.md` is the single source of truth.
|
|
41
|
+
</engine_routing_convention>
|
|
42
|
+
|
|
24
43
|
## PHASE 0: PARSE INPUT
|
|
25
44
|
|
|
26
45
|
1. Extract the task/issue description from `<pipeline_config>`.
|
|
@@ -33,22 +52,21 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
|
|
|
33
52
|
- `--skip-docs` (false) — skip update-docs phase
|
|
34
53
|
- `--skip-build-gate` (false) — skip the deterministic build gate (Phase 1.4). Not recommended — the build gate is the primary defense against "tests pass locally, breaks in CI/Docker/production" class of bugs.
|
|
35
54
|
- `--build-gate MODE` (auto) — controls build gate behavior. `auto`: detect project type and run appropriate build/typecheck/lint commands; if Dockerfile(s) are present, Docker builds are included automatically. `strict`: auto + treat warnings as errors. `no-docker`: auto but skip Docker builds even if Dockerfiles exist (for faster iteration). `skip`: same as --skip-build-gate.
|
|
36
|
-
- `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques. **Ignored if `--engine` is set** (engine routing subsumes this).
|
|
37
55
|
- `--engine MODE` (auto) — controls which model handles each pipeline phase and team role. Modes:
|
|
38
|
-
- `auto` (default): each phase and team role routes to the optimal model based on benchmark data. Requires Codex MCP server.
|
|
56
|
+
- `auto` (default): each phase and team role routes to the optimal model based on benchmark data. Requires Codex MCP server. Codex handles BUILD/FIX (SWE-bench Pro lead) and several team roles; Claude handles EVALUATE, CHALLENGE, BROWSER, and orchestration — creating a GAN-like dynamic where the builder and critic are always different models.
|
|
39
57
|
- `codex`: Codex handles implementation/analysis phases, Claude handles orchestration, evaluation, and Chrome MCP.
|
|
40
58
|
- `claude`: all phases use Claude subagents. No Codex calls.
|
|
41
59
|
|
|
42
60
|
Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
|
|
43
61
|
Engine examples: `--engine auto`, `--engine codex`, `--engine claude`
|
|
44
|
-
|
|
45
|
-
|
|
62
|
+
If no flags are present, use defaults. The default engine is `auto` — if the user does not pass `--engine`, treat it as `--engine auto`.
|
|
63
|
+
|
|
64
|
+
**Consolidated flag**: `--with-codex` (and its variants `evaluate`/`review`/`both`) was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which provides broader Codex coverage — Codex now handles BUILD, FIX, and several team roles automatically. No flag needed. Continuing with `--engine auto`."
|
|
46
65
|
|
|
47
66
|
3. **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
|
|
48
|
-
- The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` —
|
|
67
|
+
- The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
|
|
49
68
|
- Read `references/engine-routing.md` for the full routing table.
|
|
50
69
|
- Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude` (fallback), [2] Abort.
|
|
51
|
-
- Exception: if `--engine` is not set AND `--with-codex` is explicitly enabled (legacy), read `references/codex-integration.md` instead and run its pre-flight check.
|
|
52
70
|
|
|
53
71
|
4. Announce the pipeline plan:
|
|
54
72
|
```
|
|
@@ -57,36 +75,95 @@ Task: [extracted task description]
|
|
|
57
75
|
Engine: [auto / codex / claude]
|
|
58
76
|
Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
|
|
59
77
|
Max evaluation rounds: [N]
|
|
60
|
-
Cross-model evaluation (Codex): [evaluate / review / both / disabled / subsumed by --engine]
|
|
61
78
|
```
|
|
62
79
|
|
|
63
|
-
## PHASE
|
|
80
|
+
## PHASE 0.5: SPEC PREFLIGHT (conditional)
|
|
81
|
+
|
|
82
|
+
This phase exists because the ideate skill produces specs that are explicitly designed to be auto-resolve's contract — `Requirements` *are* the done-criteria, `Out of Scope` bounds over-building, `Dependencies` gates sequencing. When a run ignores that contract and re-derives everything from the raw task string, 25–40% of BUILD's reasoning is spent re-inventing material the spec already owns. This phase makes the contract load-bearing.
|
|
83
|
+
|
|
84
|
+
Scan the task description from `<pipeline_config>` for a path matching the regex `docs/roadmap/phase-\d+/[^\s"'`)]+\.md`. If no match, skip this entire phase (non-spec tasks fall back to BUILD's open-ended discovery — that mode is still supported).
|
|
85
|
+
|
|
86
|
+
If a match is found:
|
|
87
|
+
|
|
88
|
+
1. **Read the spec file.** If the file does not exist, stop with a `BLOCKED` verdict in the final report — do not proceed to BUILD with a missing spec. The task description is lying and recovering from that silently is worse than halting.
|
|
89
|
+
|
|
90
|
+
2. **Verify internal dependencies.** For each entry under the spec's `## Dependencies` → `Internal` list (e.g., `1.1 User Auth`), locate the matching spec file at `docs/roadmap/phase-*/[id]-*.md` and read its frontmatter `status` field. If any internal dependency does not have `status: done`, stop with a `BLOCKED` verdict listing the unmet deps. Implementing out of sequence wastes the whole pipeline and produces code that fails at the first integration point.
|
|
91
|
+
|
|
92
|
+
3. **Write `.devlyn/SPEC-CONTEXT.md`** so downstream subagents read spec-owned content from a single canonical place without re-parsing the spec file. Copy these spec sections verbatim (do not paraphrase or compress — they are the contract):
|
|
93
|
+
|
|
94
|
+
```
|
|
95
|
+
---
|
|
96
|
+
id: [from frontmatter]
|
|
97
|
+
complexity: [from frontmatter]
|
|
98
|
+
priority: [from frontmatter]
|
|
99
|
+
depends-on: [from frontmatter]
|
|
100
|
+
source-spec: [path to the spec file]
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## Customer Frame
|
|
104
|
+
[verbatim]
|
|
105
|
+
|
|
106
|
+
## Objective
|
|
107
|
+
[verbatim]
|
|
108
|
+
|
|
109
|
+
## Requirements
|
|
110
|
+
[verbatim — these become done-criteria in PHASE 1]
|
|
111
|
+
|
|
112
|
+
## Constraints
|
|
113
|
+
[verbatim]
|
|
114
|
+
|
|
115
|
+
## Out of Scope
|
|
116
|
+
[verbatim — honored explicitly by BUILD in Phase D]
|
|
117
|
+
|
|
118
|
+
## Architecture Notes
|
|
119
|
+
[verbatim, or "(none)" if absent]
|
|
120
|
+
|
|
121
|
+
## Dependencies
|
|
122
|
+
[verbatim]
|
|
123
|
+
|
|
124
|
+
## Verification
|
|
125
|
+
[verbatim]
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
4. **Announce the preflight outcome.** One line: `Spec preflight: [spec path] — complexity [low/medium/high], [N] internal deps verified done, proceeding.` This appears in the final report under the Build row.
|
|
129
|
+
|
|
130
|
+
Downstream phases detect `.devlyn/SPEC-CONTEXT.md` and prefer its content over re-derivation. If it is absent, they use their current open-ended behavior.
|
|
64
131
|
|
|
65
|
-
|
|
132
|
+
## PHASE 1: BUILD
|
|
66
133
|
|
|
67
|
-
|
|
134
|
+
**Engine**: BUILD row of the routing table — Codex on `auto`/`codex`, Claude on `claude`. Per `<engine_routing_convention>` above. Subagents do not have access to skills, so the prompt below includes everything they need inline.
|
|
68
135
|
|
|
69
|
-
Agent prompt — pass this to the
|
|
136
|
+
Agent prompt — pass this to the spawned executor:
|
|
70
137
|
|
|
71
138
|
Investigate and implement the following task. Work through these phases in order:
|
|
72
139
|
|
|
73
|
-
**Phase A — Understand the task**:
|
|
140
|
+
**Phase A — Understand the task**: If `.devlyn/SPEC-CONTEXT.md` exists, read it first. The spec has already decided the task shape — use its `Objective`, `Constraints`, `Architecture Notes`, `Dependencies`, and `complexity` as the exploration boundary. Do not re-classify the task type open-endedly; the spec already bounds the problem. Read only the files the spec implicates (Architecture Notes + Dependencies + any existing files touched by referenced patterns), then move on.
|
|
141
|
+
|
|
142
|
+
If no spec context file exists, read the raw task description and classify the task type:
|
|
74
143
|
- **Bug fix**: trace from symptom to root cause. Read error logs and affected code paths.
|
|
75
144
|
- **Feature**: explore the codebase to find existing patterns, integration points, and relevant modules.
|
|
76
145
|
- **Refactor/Chore**: understand current implementation, identify what needs to change and why.
|
|
77
146
|
- **UI/UX**: review existing components, design system, and user flows.
|
|
78
147
|
Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
|
|
79
148
|
|
|
80
|
-
**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md
|
|
149
|
+
**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md`.
|
|
150
|
+
|
|
151
|
+
First check whether `.devlyn/SPEC-CONTEXT.md` exists (produced by PHASE 0.5 when this run implements an ideate-produced spec). If it does, the spec is the contract — copy its `## Requirements` section verbatim into `done-criteria.md` as the primary done-criteria list, copy its `## Out of Scope` section as an `## Out of Scope` section in done-criteria.md, and copy its `## Verification` section as a `## Verification Method` section. Do not paraphrase, compress, or re-derive these — the ideate skill's CHALLENGE rubric already validated them, and weakening them here silently undoes that work. You may ADD criteria the spec obviously missed (e.g., if Requirements mention an API but omit an obvious error state) but never REMOVE or reword existing ones.
|
|
152
|
+
|
|
153
|
+
If `.devlyn/SPEC-CONTEXT.md` does not exist, synthesize done-criteria from the raw task description. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section.
|
|
154
|
+
|
|
155
|
+
This file is required — downstream evaluation depends on it.
|
|
156
|
+
|
|
157
|
+
**Phase C — Assemble a team (complexity-gated)**: Check `.devlyn/SPEC-CONTEXT.md` frontmatter for `complexity`.
|
|
81
158
|
|
|
82
|
-
|
|
159
|
+
If `complexity: low` AND the spec does not touch security/auth/API/data/UI risk areas (check by greping the spec for keywords: `auth`, `login`, `session`, `token`, `secret`, `password`, `crypto`, `api`, `env`, `permission`, `access`, `database`, `migration`, `payment`), skip TeamCreate entirely and implement directly — the multi-perspective team exists to catch ambiguity that low-complexity specs have already resolved.
|
|
160
|
+
|
|
161
|
+
Otherwise (complexity medium or high, risk areas present, or no spec context), use TeamCreate to create a team. Select teammates based on task type:
|
|
83
162
|
- Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
|
|
84
163
|
- Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
|
|
85
164
|
- Refactor: architecture-reviewer + test-engineer
|
|
86
165
|
- UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
|
|
87
|
-
Each teammate investigates from their perspective and sends findings back.
|
|
88
|
-
|
|
89
|
-
**Engine routing for teammates**: If the orchestrator's `--engine` is `auto` or `codex`, read `references/engine-routing.md` for per-role routing. Roles marked **Codex** are called via `mcp__codex-cli__codex` instead of spawning Agent teammates — include the full role prompt and issue context inline. Roles marked **Claude** use normal Agent teammates. Roles marked **Dual** run both in parallel and merge findings. The orchestrator relays Codex role outputs to Claude teammates that need them.
|
|
166
|
+
Each teammate investigates from their perspective and sends findings back. Per-role engine routing follows the team-resolve table in `references/engine-routing.md`; Dual roles run both models in parallel.
|
|
90
167
|
|
|
91
168
|
**Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
|
|
92
169
|
|
|
@@ -139,13 +216,11 @@ For failures: include the FULL error output (not truncated) and extract root fil
|
|
|
139
216
|
|
|
140
217
|
Triggered only when PHASE 1.4 returns FAIL.
|
|
141
218
|
|
|
142
|
-
Track a round counter
|
|
143
|
-
|
|
144
|
-
**Engine routing**: Same as PHASE 2.5 FIX LOOP — if `--engine` is `auto` or `codex`, use `mcp__codex-cli__codex` with `workspace-write` and `fullAuto: true`. If `--engine` is `claude`, spawn a Claude subagent.
|
|
219
|
+
Track a round counter. The build-gate fix loop and the main evaluate fix loop share **one global round counter** capped at `max-rounds` — increments from this loop and from PHASE 2.5 both count against the same total. If `round >= max-rounds`, stop with a clear failure report and do not continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
|
|
145
220
|
|
|
146
|
-
|
|
221
|
+
**Engine**: FIX LOOP row of the routing table.
|
|
147
222
|
|
|
148
|
-
Agent prompt — pass this to the
|
|
223
|
+
Agent prompt — pass this to the spawned executor:
|
|
149
224
|
|
|
150
225
|
Read `.devlyn/BUILD-GATE.md` — it contains deterministic build/typecheck/lint failures from real compiler output. These are not opinions; the compiler rejected this code. Fix every listed failure at the root cause level.
|
|
151
226
|
|
|
@@ -158,7 +233,7 @@ For each failure:
|
|
|
158
233
|
|
|
159
234
|
**After the agent completes**:
|
|
160
235
|
1. **Checkpoint**: `git add -A && git commit -m "chore(pipeline): build gate fix round [N]"`
|
|
161
|
-
2. Increment round counter
|
|
236
|
+
2. Increment the global round counter (shared with PHASE 2.5)
|
|
162
237
|
3. Go back to PHASE 1.4 (re-run the gate)
|
|
163
238
|
|
|
164
239
|
## PHASE 1.5: BROWSER VALIDATE (conditional)
|
|
@@ -186,11 +261,23 @@ You are a browser validation agent. Read the skill instructions at `.claude/skil
|
|
|
186
261
|
|
|
187
262
|
## PHASE 2: EVALUATE
|
|
188
263
|
|
|
189
|
-
|
|
264
|
+
**Engine**: EVALUATE row of the routing table — Claude on every engine. When `--engine auto`, Codex built the code, so Claude evaluating Codex's work is the GAN dynamic by default; no separate Codex evaluation pass is needed.
|
|
190
265
|
|
|
191
|
-
|
|
266
|
+
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`. Include all evaluation instructions inline (subagents do not have access to skills).
|
|
267
|
+
|
|
268
|
+
Agent prompt — pass this to the spawned executor:
|
|
192
269
|
|
|
193
|
-
You are an independent evaluator. Your job is to grade work produced by another agent
|
|
270
|
+
You are an independent evaluator. Your job is to grade work produced by another agent against a specific rubric, not to praise it.
|
|
271
|
+
|
|
272
|
+
<investigate_before_answering>
|
|
273
|
+
Never claim a file:line or assert a behavior you have not opened and read. The done-criteria file is the rubric — read it first. Then read every changed/new file in full before marking anything VERIFIED or FAILED. Findings without a real file:line behind them are speculation; exclude them.
|
|
274
|
+
</investigate_before_answering>
|
|
275
|
+
|
|
276
|
+
<coverage_over_filtering>
|
|
277
|
+
Your goal is coverage at this stage, not severity filtering. Report every issue you find — uncertain ones, low-severity ones, all of them. The fix loop and the orchestrator's verdict logic do the filtering downstream. Each finding includes its severity and your confidence so the downstream layers can rank them; your job is to surface them, not pre-decide which ones matter.
|
|
278
|
+
|
|
279
|
+
This matters because under-reporting is the asymmetric cost: a missed bug ships broken code, a flagged non-issue costs a few minutes of review.
|
|
280
|
+
</coverage_over_filtering>
|
|
194
281
|
|
|
195
282
|
**Step 1 — Read the done criteria**: Read `.devlyn/done-criteria.md`. This is your primary grading rubric. Every criterion must be verified with evidence.
|
|
196
283
|
|
|
@@ -215,56 +302,64 @@ You are an independent evaluator. Your job is to grade work produced by another
|
|
|
215
302
|
- [ ] criterion — FAILED: what's wrong, file:line
|
|
216
303
|
## Findings Requiring Action
|
|
217
304
|
### CRITICAL
|
|
218
|
-
- `file:line` — description — Fix: suggested approach
|
|
305
|
+
- `file:line` — description — Confidence: high/med/low — Fix: suggested approach
|
|
219
306
|
### HIGH
|
|
220
|
-
- `file:line` — description — Fix: suggested approach
|
|
307
|
+
- `file:line` — description — Confidence: high/med/low — Fix: suggested approach
|
|
308
|
+
### MEDIUM / LOW
|
|
309
|
+
- `file:line` — description — Confidence: high/med/low — Fix: suggested approach
|
|
221
310
|
## Cross-Cutting Patterns
|
|
222
311
|
- pattern description
|
|
223
312
|
```
|
|
224
313
|
|
|
225
|
-
Verdict rules:
|
|
314
|
+
Verdict rules:
|
|
315
|
+
- `BLOCKED` — any CRITICAL issues
|
|
316
|
+
- `NEEDS WORK` — HIGH or MEDIUM issues
|
|
317
|
+
- `PASS WITH ISSUES` — only LOW cosmetic notes
|
|
318
|
+
- `PASS` — clean
|
|
226
319
|
|
|
227
|
-
|
|
320
|
+
Findings labeled "pre-existing" or "out of scope" still count if they relate to the done criteria. The goal is working software, not blame attribution.
|
|
228
321
|
|
|
229
|
-
Calibration examples
|
|
230
|
-
- A catch block that logs but doesn't surface error to user
|
|
231
|
-
- A `let` that could be `const`
|
|
232
|
-
- "The error handling is generally quite good"
|
|
322
|
+
Calibration examples:
|
|
323
|
+
- A catch block that logs but doesn't surface the error to the user → HIGH (not MEDIUM). Logging is not error handling.
|
|
324
|
+
- A `let` that could be `const` → LOW. Linters catch this.
|
|
325
|
+
- "The error handling is generally quite good" is not a finding. Count the instances and name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line…"
|
|
233
326
|
|
|
234
|
-
Do
|
|
327
|
+
Do not delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the orchestrator needs them.
|
|
235
328
|
|
|
236
329
|
**After the agent completes**:
|
|
237
330
|
1. Read `.devlyn/EVAL-FINDINGS.md`
|
|
238
331
|
2. Extract the verdict
|
|
239
|
-
3.
|
|
240
|
-
**If `--engine` is not set and `--with-codex` includes `evaluate` or `both`** (legacy): Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
|
|
241
|
-
4. Branch on verdict (from the merged findings if Codex was used):
|
|
332
|
+
3. Branch on verdict:
|
|
242
333
|
- `PASS` → skip to PHASE 3
|
|
243
334
|
- `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
|
|
244
335
|
- `NEEDS WORK` → go to PHASE 2.5 (fix loop)
|
|
245
336
|
- `BLOCKED` → go to PHASE 2.5 (fix loop)
|
|
246
|
-
|
|
337
|
+
4. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as NEEDS WORK and log a warning — absence of evidence is not evidence of absence
|
|
247
338
|
|
|
248
339
|
## PHASE 2.5: FIX LOOP (conditional)
|
|
249
340
|
|
|
250
341
|
Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
|
|
251
342
|
|
|
252
|
-
**Engine
|
|
343
|
+
**Engine**: FIX LOOP row of the routing table. Use a fresh Codex call each round (no `sessionId` reuse — sandbox/fullAuto only apply on the first call of a session).
|
|
253
344
|
|
|
254
|
-
|
|
345
|
+
Agent prompt — pass this to the spawned executor:
|
|
255
346
|
|
|
256
|
-
|
|
347
|
+
Read every findings file present in `.devlyn/`:
|
|
348
|
+
- `.devlyn/EVAL-FINDINGS.md` — issues from the independent evaluator (PHASE 2)
|
|
349
|
+
- `.devlyn/BROWSER-RESULTS.md` — issues from browser validation (PHASE 1.5), if present and the verdict is `NEEDS WORK` or `BLOCKED`
|
|
257
350
|
|
|
258
|
-
|
|
351
|
+
Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the relevant verdict returns PASS — there is no "shippable with issues" shortcut.
|
|
259
352
|
|
|
260
353
|
The original done criteria are in `.devlyn/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
|
|
261
354
|
|
|
262
|
-
For each finding: read the referenced file:line, understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.devlyn/done-criteria.md` to mark fixed items.
|
|
355
|
+
For each finding: read the referenced file:line (or browser step / console error), understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.devlyn/done-criteria.md` to mark fixed items.
|
|
263
356
|
|
|
264
357
|
**After the agent completes**:
|
|
265
358
|
1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): fix round [N] complete"` to preserve the fix
|
|
266
|
-
2. Increment round counter
|
|
267
|
-
3.
|
|
359
|
+
2. Increment the global round counter (shared with PHASE 1.4-fix)
|
|
360
|
+
3. Re-run the phase that triggered the fix:
|
|
361
|
+
- If invoked from PHASE 2 (eval failure) → go back to PHASE 2 to re-evaluate
|
|
362
|
+
- If invoked from PHASE 1.5 (browser failure) → go back to PHASE 1.5 to re-validate the browser, then proceed to PHASE 2 only if browser passes
|
|
268
363
|
|
|
269
364
|
## PHASE 3: SIMPLIFY
|
|
270
365
|
|
|
@@ -281,21 +376,18 @@ Review the recently changed files (use `git diff HEAD~1` to see what changed). L
|
|
|
281
376
|
|
|
282
377
|
Skip if `--skip-review` was set.
|
|
283
378
|
|
|
284
|
-
|
|
379
|
+
**Engine**: REVIEW (team) — per-role routing per the team-review table in `references/engine-routing.md`. Dual roles run both models in parallel and merge findings.
|
|
285
380
|
|
|
286
|
-
|
|
381
|
+
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
287
382
|
|
|
288
|
-
|
|
383
|
+
Agent prompt — pass this to the spawned executor:
|
|
289
384
|
|
|
290
|
-
|
|
385
|
+
Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes. Per-role engine routing follows the team-review table in `references/engine-routing.md`; Dual roles run both models in parallel and merge findings.
|
|
291
386
|
|
|
292
|
-
Each reviewer
|
|
387
|
+
Each reviewer reports findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW) and a confidence level. After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
|
|
293
388
|
|
|
294
389
|
Clean up the team after completion.
|
|
295
390
|
|
|
296
|
-
**If `--engine` is set**: engine routing already handles cross-model review via per-role routing — skip the legacy `--with-codex` review step below.
|
|
297
|
-
**If `--with-codex` includes `review` or `both`** (legacy, only when `--engine` is not set): Read `references/codex-integration.md` and follow the "PHASE 4B: CODEX REVIEW" section. This runs Codex's independent code review and reconciles findings with the Claude team review.
|
|
298
|
-
|
|
299
391
|
**After the review phase completes**:
|
|
300
392
|
1. If CRITICAL issues remain unfixed, log a warning in the final report
|
|
301
393
|
2. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): review fixes complete"` if there are changes
|
|
@@ -306,23 +398,27 @@ Every prior phase used checklists, done-criteria, or structured categories. This
|
|
|
306
398
|
|
|
307
399
|
This is what catches the things structured reviews miss — subtle logic that technically works but isn't the right approach, assumptions nobody questioned, patterns that are fine but not best-practice, and integration seams that look correct in isolation but feel wrong when you read the whole changeset.
|
|
308
400
|
|
|
401
|
+
**Engine**: CHALLENGE row — Claude on every engine. The diff was likely produced by Codex on `--engine auto`; Claude reading it cold preserves the cross-model dynamic.
|
|
402
|
+
|
|
309
403
|
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
310
404
|
|
|
311
|
-
Agent prompt — pass this to the
|
|
405
|
+
Agent prompt — pass this to the spawned executor:
|
|
312
406
|
|
|
313
|
-
You are a senior engineer doing a final skeptical review before this code ships to production. You have
|
|
407
|
+
You are a senior engineer doing a final skeptical review before this code ships to production. You have not seen any prior reviews, test results, or design docs — read the code cold.
|
|
314
408
|
|
|
315
|
-
|
|
409
|
+
<investigate_before_answering>
|
|
410
|
+
Anchor every finding in code you have actually opened. Run `git diff main` for the change surface, then read each changed file in full (not just the hunks — surrounding context matters). Findings without a real file:line and a quote from the code are speculation; exclude them.
|
|
411
|
+
</investigate_before_answering>
|
|
316
412
|
|
|
317
|
-
Your job is
|
|
413
|
+
Your job is not to check boxes. Your job is to find the things that would make a staff engineer say "hold on, let's talk about this before we ship." Think about:
|
|
318
414
|
|
|
319
415
|
- Would this approach survive a 10x traffic spike? A midnight oncall page? A junior dev maintaining it 6 months from now?
|
|
320
416
|
- Are there assumptions baked in that nobody stated out loud? Hardcoded limits, implicit ordering, missing edge cases in business logic?
|
|
321
417
|
- Is the error handling actually helpful, or does it just prevent crashes while leaving the user confused?
|
|
322
418
|
- Are there simpler, more idiomatic ways to do what this code does? Not "clever" alternatives — genuinely better approaches?
|
|
323
|
-
- Would you
|
|
419
|
+
- Would you confidently approve this PR, or would you leave comments?
|
|
324
420
|
|
|
325
|
-
Be
|
|
421
|
+
Be direct and concrete. Do not open with praise. Every finding must include `file:line` and a concrete fix — not "consider improving" but "change X to Y because Z."
|
|
326
422
|
|
|
327
423
|
Write `.devlyn/CHALLENGE-FINDINGS.md`:
|
|
328
424
|
|
|
@@ -334,7 +430,27 @@ Write `.devlyn/CHALLENGE-FINDINGS.md`:
|
|
|
334
430
|
- `file:line` — what's wrong — Fix: concrete change
|
|
335
431
|
```
|
|
336
432
|
|
|
337
|
-
|
|
433
|
+
<examples>
|
|
434
|
+
<example index="1">
|
|
435
|
+
GOOD finding (anchored, specific, fixable):
|
|
436
|
+
### CRITICAL
|
|
437
|
+
- `src/api/orders/cancel.ts:42` — `await db.transaction(...)` is missing — the read of `order.status` and the write of `order.status = "cancelled"` are not atomic, so two concurrent cancellations both succeed and the inventory hook fires twice. Fix: wrap the read+write in `db.transaction()` and re-check `order.status === "pending"` inside the transaction before the update.
|
|
438
|
+
</example>
|
|
439
|
+
<example index="2">
|
|
440
|
+
BAD finding (vague, unanchored, not actionable):
|
|
441
|
+
### HIGH
|
|
442
|
+
- The error handling could be improved. Consider being more defensive throughout.
|
|
443
|
+
|
|
444
|
+
Why this is bad: no file:line, no specific failure, no concrete fix. Either delete the finding or replace it with a real one anchored to a specific call site.
|
|
445
|
+
</example>
|
|
446
|
+
<example index="3">
|
|
447
|
+
GOOD finding (idiom / approach issue):
|
|
448
|
+
### MEDIUM
|
|
449
|
+
- `src/components/UserList.tsx:18-34` — fetching `/api/users` inside `useEffect` and managing loading/error state by hand re-implements what the project already does with the `useFetch` hook in `src/hooks/useFetch.ts`. Fix: replace the manual `useState`+`useEffect` with `useFetch('/api/users')` so this list inherits retry, cache, and abort handling.
|
|
450
|
+
</example>
|
|
451
|
+
</examples>
|
|
452
|
+
|
|
453
|
+
Verdict: `PASS` only if you would confidently ship this code with your name on it. If you found anything CRITICAL or HIGH, verdict is `NEEDS WORK`.
|
|
338
454
|
|
|
339
455
|
**After the agent completes**:
|
|
340
456
|
1. Read `.devlyn/CHALLENGE-FINDINGS.md`
|
|
@@ -458,22 +574,20 @@ After all phases complete:
|
|
|
458
574
|
**Pipeline Summary**:
|
|
459
575
|
| Phase | Status | Notes |
|
|
460
576
|
|-------|--------|-------|
|
|
461
|
-
| Build (team-resolve) | [completed] | [brief summary] |
|
|
577
|
+
| Build (team-resolve) | [completed] | [brief summary; engine that ran it] |
|
|
462
578
|
| Build gate | [completed / skipped / FAIL after N rounds] | [project types detected, commands run, pass/fail per command] |
|
|
463
579
|
| Browser validate | [completed / skipped / auto-skipped] | [verdict, tier used, console errors, flow results] |
|
|
464
|
-
| Evaluate
|
|
465
|
-
| Evaluate (Codex) | [completed / skipped] | [Codex-only findings count, merged verdict] |
|
|
580
|
+
| Evaluate | [PASS/NEEDS WORK after N rounds] | [verdict + key findings] |
|
|
466
581
|
| Fix rounds | [N rounds / skipped] | [what was fixed] |
|
|
467
582
|
| Simplify | [completed / skipped] | [changes made] |
|
|
468
|
-
| Review (
|
|
469
|
-
| Review (Codex) | [completed / skipped] | [Codex-only findings, agreed findings] |
|
|
583
|
+
| Review (team) | [completed / skipped] | [findings summary; per-role engines if --engine auto] |
|
|
470
584
|
| Challenge | [PASS / NEEDS WORK] | [findings count, fixes applied] |
|
|
471
585
|
| Security review | [completed / skipped / auto-skipped] | [findings or "no security-sensitive changes"] |
|
|
472
586
|
| Clean | [completed / skipped] | [items cleaned] |
|
|
473
587
|
| Docs (update-docs) | [completed / skipped] | [docs updated] |
|
|
474
588
|
|
|
475
|
-
**Evaluation Rounds**: [N] of [max-rounds] used
|
|
476
|
-
**Final Verdict**: [last evaluation verdict]
|
|
589
|
+
**Evaluation Rounds**: [N] of [max-rounds] used (shared budget across PHASE 1.4-fix and PHASE 2.5)
|
|
590
|
+
**Final Verdict**: [last evaluation verdict, or "BUILD GATE FAILED — code does not compile" if PHASE 1.4 exhausted the round budget before PHASE 2 ran]
|
|
477
591
|
|
|
478
592
|
**Commits created**:
|
|
479
593
|
[git log output]
|
|
@@ -116,7 +116,7 @@ Rationale for `--engine auto` choices:
|
|
|
116
116
|
|
|
117
117
|
Rationale:
|
|
118
118
|
- FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
|
|
119
|
-
- CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic
|
|
119
|
+
- CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic — automatic on every run. When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
|
|
120
120
|
- DOCUMENT: Claude — writing quality for spec generation.
|
|
121
121
|
|
|
122
122
|
---
|
|
@@ -127,10 +127,12 @@ Rationale:
|
|
|
127
127
|
|-------|--------------|----------------|-----------------|
|
|
128
128
|
| EXTRACT COMMITMENTS | Claude | Codex | Claude |
|
|
129
129
|
| CODE AUDIT | **Codex** | Codex | Claude |
|
|
130
|
-
| DOCS AUDIT | **Claude** |
|
|
130
|
+
| DOCS AUDIT | **Claude** | **Claude** | Claude |
|
|
131
131
|
| BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
|
|
132
132
|
| SYNTHESIZE | Claude | Claude | Claude |
|
|
133
133
|
|
|
134
|
+
DOCS AUDIT is always Claude regardless of `--engine` — writing-quality strength on documentation drift detection (READMEs, VISION.md prose, spec status accuracy) is the deciding factor, not code analysis. BROWSER AUDIT is always Claude because Chrome MCP tools are session-bound to Claude Code.
|
|
135
|
+
|
|
134
136
|
---
|
|
135
137
|
|
|
136
138
|
## How to Spawn a Codex Role
|
|
@@ -199,7 +201,4 @@ mcp__codex-cli__codex({
|
|
|
199
201
|
|
|
200
202
|
- `--engine claude` → all roles and phases use Claude (no Codex calls)
|
|
201
203
|
- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
|
|
202
|
-
- `--engine auto` → each role and phase routes to the optimal model per this table
|
|
203
|
-
- `--engine auto` is the recommended default when Codex MCP server is available
|
|
204
|
-
|
|
205
|
-
`--engine` and `--with-codex` are **mutually exclusive**. `--engine auto` subsumes `--with-codex both` — it uses Codex where it's optimal (broader than just evaluate/review). If both flags are passed, `--engine` takes precedence and `--with-codex` is ignored with a warning.
|
|
204
|
+
- `--engine auto` (default) → each role and phase routes to the optimal model per this table
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: devlyn:ideate
|
|
3
|
-
description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline.
|
|
3
|
+
description: Transforms unstructured ideas into implementation-ready planning documents through structured brainstorming, research, and a built-in self-skeptical rubric pass. Produces a three-layer document architecture (Vision, Roadmap index, auto-resolve-ready specs) to eliminate context pollution in the implementation pipeline. Default `--engine auto` routes the CHALLENGE rubric pass to OpenAI Codex (GPT-5.4) as a cross-model critic for a GAN dynamic. Use when the user wants to brainstorm, plan a new project or feature set, create a vision and roadmap, or structure scattered ideas into an actionable plan. Triggers on "let's brainstorm", "let's plan", "ideate", "I have an idea for", "help me think through", "let's explore", new project planning, feature discovery, roadmap creation, or when the user is throwing ideas that need structuring.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Ideation to Implementation Bridge
|
|
@@ -24,27 +24,17 @@ Concretely:
|
|
|
24
24
|
|
|
25
25
|
Parse these from the user's invocation message:
|
|
26
26
|
|
|
27
|
-
- `--with-codex` (default: off) — bare flag. When set, OpenAI Codex runs an independent rubric pass during Phase 3.5 CHALLENGE via `mcp__codex-cli__*` MCP tools, using the same rubric as the solo pass. Codex always runs at `reasoningEffort: "xhigh"` — the entire reason for the flag is maximum reasoning from a second model family. **Ignored if `--engine` is set** (engine routing subsumes this).
|
|
28
27
|
- `--engine MODE` (auto) — controls which model handles each ideation phase. Modes:
|
|
29
|
-
- `auto` (default): Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic).
|
|
28
|
+
- `auto` (default): Claude handles FRAME/EXPLORE/CONVERGE/DOCUMENT (ambiguous intent, writing quality), Codex runs the CHALLENGE rubric pass as critic (GAN dynamic). Requires Codex MCP server.
|
|
30
29
|
- `codex`: Codex handles FRAME/EXPLORE/CONVERGE/DOCUMENT, Claude runs CHALLENGE (role reversal — builder and critic are always different models).
|
|
31
30
|
- `claude`: all phases use Claude. No Codex calls.
|
|
32
31
|
|
|
33
32
|
**Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
|
|
34
|
-
- The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` —
|
|
33
|
+
- The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
|
|
35
34
|
- Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort.
|
|
36
|
-
-
|
|
35
|
+
- Read `references/challenge-rubric.md` up front.
|
|
37
36
|
|
|
38
|
-
**
|
|
39
|
-
|
|
40
|
-
<why_this_matters>
|
|
41
|
-
When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
|
|
42
|
-
- Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
|
|
43
|
-
- Full roadmaps create attention noise (49 irrelevant items dilute focus on item #3)
|
|
44
|
-
- Done criteria generated from vague prompts miss the user's actual intent
|
|
45
|
-
|
|
46
|
-
This skill solves the context engineering problem by producing **self-contained specs** — each carries just enough context for auto-resolve to work autonomously.
|
|
47
|
-
</why_this_matters>
|
|
37
|
+
**Consolidated flag**: `--with-codex` was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which routes the CHALLENGE rubric pass to Codex automatically. No flag needed. Continuing with `--engine auto`."
|
|
48
38
|
|
|
49
39
|
## Output Architecture
|
|
50
40
|
|
|
@@ -106,6 +96,8 @@ Before starting, identify what the user needs:
|
|
|
106
96
|
| User shares links/resources to process | **Research-first** | Lead with Explore (research synthesis), then standard flow |
|
|
107
97
|
| Existing roadmap, user wants to reprioritize | **Replan** | Read existing docs, focus on Converge, update documents |
|
|
108
98
|
|
|
99
|
+
**Tie-breaks when a request matches two modes:** choose the narrowest mode that satisfies the request. Quick Add wins over Expand when the user has one concrete item in mind. Research-first wins over Deep-dive when links or resources are the primary input. Deep-dive wins over Expand when one topic specifically needs depth. Replan is chosen only when priority or order changes are explicit. If two modes still look equally plausible after applying these rules, present the top two to the user and let them pick — silently choosing one wastes the session if the other was right.
|
|
100
|
+
|
|
109
101
|
Announce the detected mode and confirm before proceeding.
|
|
110
102
|
|
|
111
103
|
### Expand Mode Detail
|
|
@@ -208,7 +200,7 @@ When a decision becomes wrong because the world changed under it:
|
|
|
208
200
|
The biggest risk in ideation is premature convergence — jumping to solutions before understanding the problem. This phase prevents that.
|
|
209
201
|
|
|
210
202
|
Establish through conversation:
|
|
211
|
-
1. **
|
|
203
|
+
1. **Job-to-be-Done**: In one sentence — "When [situation], [user] wants to [motivation], so they can [outcome]." Capture this before anything else. If the user cannot produce it, that is itself the finding — pause and explore the situation until the sentence exists. A bare problem statement without this frame is a state description, not a job, and downstream specs built from it will describe system behavior instead of customer progress.
|
|
212
204
|
2. **Constraints**: What can't change? (tech stack, timeline, existing commitments)
|
|
213
205
|
3. **Success criteria**: How will we know this worked? (outcomes, not outputs)
|
|
214
206
|
4. **Anti-goals**: What are we explicitly NOT trying to do?
|
|
@@ -223,12 +215,17 @@ Don't write documents yet. The output of this phase is a shared mental model bet
|
|
|
223
215
|
|
|
224
216
|
This is the creative core — the phase that should take the most conversational turns. The user chose to ideate with AI because they want perspectives, research, and creative expansion they wouldn't get alone.
|
|
225
217
|
|
|
218
|
+
<use_parallel_tool_calls>
|
|
219
|
+
EXPLORE often needs several independent lookups: web search for prior art, doc fetches, repo greps for existing patterns. When tool calls have no dependencies on each other, issue them in parallel in the same response. Spawn subagents in parallel when fanning out across distinct research topics. Only chain calls that depend on a previous call's output. Pace research across turns rather than front-loading every lookup before the user has framed direction — EXPLORE is dialogue-driven, parallel is just for the lookups inside any single turn.
|
|
220
|
+
</use_parallel_tool_calls>
|
|
221
|
+
|
|
226
222
|
<research_protocol>
|
|
227
223
|
When relevant, actively research before and during brainstorming:
|
|
228
224
|
- **Existing solutions**: What's already out there? (web search, documentation)
|
|
229
225
|
- **Technical feasibility**: Can this be built within the constraints? Where are the hard parts?
|
|
230
226
|
- **Patterns and prior art**: How have similar problems been solved?
|
|
231
227
|
- **Market/user context**: Who else needs this? What do they currently use?
|
|
228
|
+
- **Evidence discipline**: Treat prior art as source-backed only when verified by a fetched link or documentation the user can open. If a pattern is inferred from memory or analogy, label it `[UNVERIFIED]` inline and do not present it as market fact. The CHALLENGE rubric's NO GUESSWORK axis fires hard on unlabeled claims that look authoritative but are actually recall.
|
|
232
229
|
|
|
233
230
|
Not every ideation needs all of these — a personal side project doesn't need market research. Judge what's relevant and use subagents for parallel research when multiple topics need investigation.
|
|
234
231
|
</research_protocol>
|
|
@@ -314,8 +311,6 @@ Engage maximum thinking effort here — both the solo rubric pass and, if enable
|
|
|
314
311
|
Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
|
|
315
312
|
</thinking_effort>
|
|
316
313
|
|
|
317
|
-
The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
|
|
318
|
-
|
|
319
314
|
### The rubric — single source of truth
|
|
320
315
|
|
|
321
316
|
Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
|
|
@@ -326,15 +321,60 @@ Apply the rubric to the internal convergence draft. Produce findings in the form
|
|
|
326
321
|
|
|
327
322
|
For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
|
|
328
323
|
|
|
329
|
-
|
|
324
|
+
### Codex critic pass (engine-routed)
|
|
325
|
+
|
|
326
|
+
**If `--engine auto`** (default): Codex runs the CHALLENGE rubric pass automatically as critic.
|
|
327
|
+
|
|
328
|
+
Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, `workingDirectory: <project root>`. The `prompt` parameter is built from the packaged plan + the inlined rubric + the appended Codex instructions. Codex has no filesystem access to this project, so everything it needs travels in the prompt.
|
|
329
|
+
|
|
330
|
+
**Step 1 — Package the post-solo plan.** Build the prompt with these sections in this order:
|
|
331
|
+
|
|
332
|
+
```
|
|
333
|
+
## Problem framing (from FRAME phase)
|
|
334
|
+
[problem statement, constraints, success criteria, anti-goals]
|
|
335
|
+
|
|
336
|
+
## Confirmed facts vs assumptions
|
|
337
|
+
Confirmed by user: [list each fact the user explicitly confirmed]
|
|
338
|
+
Assumptions (not yet confirmed): [list each assumption the agent made]
|
|
339
|
+
|
|
340
|
+
## Plan (post-solo-CHALLENGE)
|
|
341
|
+
Vision: [one sentence]
|
|
342
|
+
Phase 1 ([theme]): [items with one-line descriptions and dependencies]
|
|
343
|
+
Phase 2 ([theme]): ...
|
|
344
|
+
Architecture decisions: [each with what / why / alternatives considered]
|
|
345
|
+
Deferred to backlog: [items + reason]
|
|
346
|
+
|
|
347
|
+
## Findings from the solo rubric pass
|
|
348
|
+
[list each with: severity, axis, quote, why, fix, whether applied]
|
|
349
|
+
|
|
350
|
+
## Rubric
|
|
351
|
+
[INLINE the full text of references/challenge-rubric.md here verbatim — Codex needs the rubric definition in the prompt itself]
|
|
352
|
+
|
|
353
|
+
## Your job
|
|
354
|
+
You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user explicitly asked to be challenged because soft-pedaled plans waste their time.
|
|
355
|
+
|
|
356
|
+
You are running AFTER a solo pass by Claude. Catch what the solo pass missed; do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" (with one-line agreement) or "I would frame this differently" (with a reason). Then add your own findings that the solo pass missed.
|
|
357
|
+
|
|
358
|
+
Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
|
|
359
|
+
|
|
360
|
+
Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let the orchestrator surface it to the user.
|
|
361
|
+
|
|
362
|
+
End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, plus a one-line explanation.
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
**Step 2 — Reconcile.** Merge the two finding lists:
|
|
366
|
+
- Same finding from both → keep the more specific wording, mark "confirmed by both"
|
|
367
|
+
- Codex-only → prefix `[codex]` in internal notes so the user-facing summary can attribute correctly
|
|
368
|
+
- Solo-only → keep as-is
|
|
369
|
+
- Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if material, surface as an open question in the user-facing summary
|
|
330
370
|
|
|
331
|
-
|
|
371
|
+
If Codex raised CRITICAL or HIGH findings the solo pass missed, apply the fixes to the plan before presenting the user-facing summary — unless fixing would change something the user explicitly confirmed, in which case follow the rubric's "Respect explicit user intent" rule.
|
|
332
372
|
|
|
333
|
-
**
|
|
373
|
+
**Do not loop.** One Codex pass is enough. If the result is still FAIL after reconciliation, the plan has structural problems that belong in the user-facing summary as open questions rather than further iteration.
|
|
334
374
|
|
|
335
|
-
**If `--engine codex`**: Role reversal — Codex built the plan
|
|
375
|
+
**If `--engine codex`**: Role reversal — Codex built the plan, so Claude runs the solo CHALLENGE pass and that is the only pass. Do not also run Codex on CHALLENGE — builder and critic should always be different models. Skip this section.
|
|
336
376
|
|
|
337
|
-
**If `--engine claude
|
|
377
|
+
**If `--engine claude`**: No Codex calls. The solo pass is the only pass.
|
|
338
378
|
|
|
339
379
|
### Respect explicit user intent
|
|
340
380
|
|
|
@@ -356,7 +396,7 @@ Deferred: [items with reasons]
|
|
|
356
396
|
## CHALLENGE results
|
|
357
397
|
|
|
358
398
|
Solo pass: [N findings, M applied]
|
|
359
|
-
Codex pass: [N findings, M applied] ← only
|
|
399
|
+
Codex pass: [N findings, M applied] ← only on --engine auto
|
|
360
400
|
|
|
361
401
|
Changes applied during CHALLENGE:
|
|
362
402
|
- [item]: [what changed and which axis triggered it]
|
|
@@ -469,9 +509,10 @@ Before finalizing, verify:
|
|
|
469
509
|
- [ ] No spec requires reading VISION.md to be understood (self-contained)
|
|
470
510
|
- [ ] Dependencies between items are documented in both specs
|
|
471
511
|
- [ ] Architecture decisions include reasoning and alternatives considered
|
|
472
|
-
- [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex
|
|
512
|
+
- [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex critic on `--engine auto`); no item still fails any axis at CRITICAL or HIGH severity
|
|
473
513
|
- [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
|
|
474
514
|
- [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
|
|
515
|
+
- [ ] Every requirement is traceable to a confirmed fact, a verified source, or an explicitly labeled assumption — no unmarked guesses slipped into the specs
|
|
475
516
|
|
|
476
517
|
## Language
|
|
477
518
|
|
|
@@ -7,7 +7,7 @@
|
|
|
7
7
|
- Finding format
|
|
8
8
|
- Examples (good vs bad findings, plus a detour-sequencing example)
|
|
9
9
|
|
|
10
|
-
The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex pass (
|
|
10
|
+
The 5-axis rubric applied in Phase 3.5 CHALLENGE of `devlyn:ideate`. Both the solo Claude pass and the Codex critic pass (on `--engine auto`) use this file — there is exactly one definition of the rubric, and `SKILL.md` instructs both passes to read it directly from here.
|
|
11
11
|
|
|
12
12
|
The rubric exists because plans produced in a single pass, by a single model, in a single conversation almost always fail at least one axis somewhere. The user's historical experience: every time they asked "is this really no-workaround, no-guesswork, no-overengineering, world-class, optimized?", the honest answer was no. This phase makes the answer honestly yes before the user even has to ask.
|
|
13
13
|
|
|
@@ -22,6 +22,10 @@ depends-on: []
|
|
|
22
22
|
<!-- Extract only the relevant context from the vision — don't make the implementation agent read the full vision document. -->
|
|
23
23
|
[Project] does [what]. This feature [enables/improves/fixes] [specific user capability].
|
|
24
24
|
|
|
25
|
+
## Customer Frame
|
|
26
|
+
<!-- One sentence. When [situation], [user] wants to [motivation] so they can [outcome]. -->
|
|
27
|
+
<!-- Use this to resolve ambiguous requirements: prefer the behavior that best serves this user outcome, and do not add capabilities outside this frame. -->
|
|
28
|
+
|
|
25
29
|
## Objective
|
|
26
30
|
<!-- One sentence: what the user can do after this is implemented. -->
|
|
27
31
|
|
|
@@ -58,6 +58,10 @@ Example with engine: `/devlyn:preflight --engine auto`
|
|
|
58
58
|
|
|
59
59
|
## PHASE 0: DISCOVER & SCOPE
|
|
60
60
|
|
|
61
|
+
<use_parallel_tool_calls>
|
|
62
|
+
Phase 0 and Phase 1 do many independent reads (planning docs, item specs, prior state). When tool calls have no dependencies between them, issue them in parallel in a single response — that includes globbing for spec files and reading several specs at once. Only chain calls that depend on values from a previous call.
|
|
63
|
+
</use_parallel_tool_calls>
|
|
64
|
+
|
|
61
65
|
1. **Find planning documents** — search in parallel:
|
|
62
66
|
- `docs/VISION.md`
|
|
63
67
|
- `docs/ROADMAP.md`
|
|
@@ -80,7 +84,7 @@ Scope: [Phase N / All phases]
|
|
|
80
84
|
Documents: VISION.md, ROADMAP.md, [N] item specs
|
|
81
85
|
Deferred items (excluded): [N]
|
|
82
86
|
Previous run: [found — will show delta / none]
|
|
83
|
-
Phases: Extract → Audit
|
|
87
|
+
Phases: 1 Extract → 2 Audit (code + docs + browser) → 3 Report → 4 Triage
|
|
84
88
|
```
|
|
85
89
|
|
|
86
90
|
## PHASE 1: EXTRACT COMMITMENTS
|
|
@@ -102,9 +106,9 @@ Read all in-scope planning documents and build a **commitment registry** — eve
|
|
|
102
106
|
- Items with `status: cut` in ROADMAP.md
|
|
103
107
|
- Out of Scope entries — these are anti-commitments (things promised NOT to build)
|
|
104
108
|
|
|
105
|
-
5. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are
|
|
109
|
+
5. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are not expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do **not** audit them or report them as findings. Flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
|
|
106
110
|
|
|
107
|
-
|
|
111
|
+
6. **Write to `.devlyn/commitment-registry.md`**:
|
|
108
112
|
|
|
109
113
|
```markdown
|
|
110
114
|
# Commitment Registry
|
|
@@ -137,20 +141,20 @@ Spawn all applicable auditors in parallel. Each reads `.devlyn/commitment-regist
|
|
|
137
141
|
|
|
138
142
|
### code-auditor (always)
|
|
139
143
|
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/code-auditor.md` and pass it to the subagent.
|
|
144
|
+
Engine routes per the auto-resolve skill's `references/engine-routing.md` ("Pipeline Phase Routing (preflight)" → CODE AUDIT row): Codex on `--engine auto`/`codex`, Claude on `--engine claude`. When the route is **Codex**, call `mcp__codex-cli__codex` with the auditor prompt inline (Codex cannot read `.devlyn/commitment-registry.md` directly under `read-only` sandbox, so paste the registry into the prompt). When the route is **Claude**, spawn a subagent with `mode: "bypassPermissions"`. Read the auditor prompt from `references/auditors/code-auditor.md` either way.
|
|
143
145
|
|
|
144
146
|
The code-auditor classifies each commitment as IMPLEMENTED, MISSING, INCOMPLETE, DIVERGENT, or BROKEN — with file:line evidence. Also catches cross-feature integration gaps and constraint violations. Writes to `.devlyn/audit-code.md`.
|
|
145
147
|
|
|
146
148
|
### docs-auditor (unless --skip-docs)
|
|
147
149
|
|
|
148
|
-
Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/docs-auditor.md` and pass it to the subagent.
|
|
150
|
+
Always Claude (writing-quality strength) regardless of `--engine`. Spawn a subagent with `mode: "bypassPermissions"`. Read the full prompt from `references/auditors/docs-auditor.md` and pass it to the subagent.
|
|
149
151
|
|
|
150
152
|
Checks: ROADMAP.md status accuracy, README alignment, API doc coverage, VISION.md currency, item spec status. Writes to `.devlyn/audit-docs.md`.
|
|
151
153
|
|
|
152
154
|
### browser-auditor (conditional)
|
|
153
155
|
|
|
156
|
+
Always Claude (Chrome MCP tools are session-bound) regardless of `--engine`.
|
|
157
|
+
|
|
154
158
|
**Skip conditions** (check in order):
|
|
155
159
|
1. `--skip-browser` flag → skip
|
|
156
160
|
2. No web-relevant files in project (no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.html`, `page.*`, `layout.*`) → skip with note "Browser validation skipped — no web files detected"
|
|
@@ -340,7 +344,7 @@ Triage complete.
|
|
|
340
344
|
|
|
341
345
|
Next steps:
|
|
342
346
|
- To implement fixes: /devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/[id]-[name].md"
|
|
343
|
-
- For
|
|
347
|
+
- For CRITICAL severity or complex DIVERGENT findings, the default `--engine auto` already routes BUILD/FIX to Codex and EVALUATE/CHALLENGE to Claude (cross-model GAN dynamic). No extra flag needed.
|
|
344
348
|
- To re-run preflight after fixes: /devlyn:preflight [same flags]
|
|
345
349
|
- To add new features discovered during audit: /devlyn:ideate expand
|
|
346
350
|
```
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "devlyn-cli",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.14.0",
|
|
4
4
|
"description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
|
|
5
5
|
"homepage": "https://github.com/fysoul17/devlyn-cli#readme",
|
|
6
6
|
"bin": {
|
|
@@ -1,106 +0,0 @@
|
|
|
1
|
-
# Codex Cross-Model Integration (Legacy)
|
|
2
|
-
|
|
3
|
-
> **Note**: This file is the legacy `--with-codex` integration. For the newer `--engine` flag (which subsumes `--with-codex`), see `references/engine-routing.md`. Only read this file when `--with-codex` is enabled AND `--engine` is NOT set.
|
|
4
|
-
|
|
5
|
-
Instructions for using OpenAI Codex as an independent evaluator/reviewer in the auto-resolve pipeline.
|
|
6
|
-
|
|
7
|
-
Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). This creates a GAN-like adversarial dynamic — Claude builds and Codex critiques, reducing shared blind spots between model families.
|
|
8
|
-
|
|
9
|
-
---
|
|
10
|
-
|
|
11
|
-
## PRE-FLIGHT CHECK
|
|
12
|
-
|
|
13
|
-
Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
|
|
14
|
-
|
|
15
|
-
- **If ping succeeds**: continue normally.
|
|
16
|
-
- **If ping fails or `mcp__codex-cli__ping` tool is not found**: warn the user and ask how to proceed:
|
|
17
|
-
```
|
|
18
|
-
⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
|
|
19
|
-
|
|
20
|
-
To install:
|
|
21
|
-
npm i -g @openai/codex
|
|
22
|
-
claude mcp add codex-cli -- npx -y codex-mcp-server
|
|
23
|
-
|
|
24
|
-
Options:
|
|
25
|
-
[1] Continue without --with-codex (Claude-only evaluation/review)
|
|
26
|
-
[2] Abort pipeline
|
|
27
|
-
```
|
|
28
|
-
If the user chooses [1], disable `--with-codex` and continue. If [2], stop.
|
|
29
|
-
|
|
30
|
-
---
|
|
31
|
-
|
|
32
|
-
## PHASE 2-CODEX: CROSS-MODEL EVALUATE
|
|
33
|
-
|
|
34
|
-
Run after the Claude evaluator (Phase 2) completes, only if `--with-codex` includes `evaluate` or `both`.
|
|
35
|
-
|
|
36
|
-
### Step 1 — Get Codex's evaluation
|
|
37
|
-
|
|
38
|
-
Call `mcp__codex-cli__codex` with:
|
|
39
|
-
- `prompt`: Include the full content of `.devlyn/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
|
|
40
|
-
- `workingDirectory`: the project root
|
|
41
|
-
- `sandbox`: `"read-only"` (Codex should only read, not modify files)
|
|
42
|
-
- `reasoningEffort`: `"high"` (note: for `--engine auto`, the engine-routing.md uses `"xhigh"` by default)
|
|
43
|
-
- `model`: `"gpt-5.4"` (pass explicitly — the MCP schema default may be outdated)
|
|
44
|
-
|
|
45
|
-
Example prompt to pass:
|
|
46
|
-
```
|
|
47
|
-
You are an independent code evaluator. Grade the following code changes against the done criteria below. Be strict — when in doubt, flag it.
|
|
48
|
-
|
|
49
|
-
## Done Criteria
|
|
50
|
-
[paste contents of .devlyn/done-criteria.md]
|
|
51
|
-
|
|
52
|
-
## Code Changes
|
|
53
|
-
[paste output of git diff HEAD~1]
|
|
54
|
-
|
|
55
|
-
For each criterion, mark VERIFIED (with evidence) or FAILED (with file:line and what's wrong).
|
|
56
|
-
Then list all issues found grouped by severity: CRITICAL, HIGH, MEDIUM, LOW.
|
|
57
|
-
For each issue provide: file:line, description, and suggested fix.
|
|
58
|
-
End with a verdict: PASS, PASS WITH ISSUES, NEEDS WORK, or BLOCKED.
|
|
59
|
-
```
|
|
60
|
-
|
|
61
|
-
### Step 2 — Merge findings
|
|
62
|
-
|
|
63
|
-
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to merge Claude's and Codex's evaluations.
|
|
64
|
-
|
|
65
|
-
Agent prompt:
|
|
66
|
-
|
|
67
|
-
Read `.devlyn/EVAL-FINDINGS.md` (Claude's evaluation) and the Codex evaluation output below. Merge them into a single unified `.devlyn/EVAL-FINDINGS.md` following the existing format. Rules:
|
|
68
|
-
- Take the MORE SEVERE verdict between the two evaluators
|
|
69
|
-
- Deduplicate findings that reference the same file:line or describe the same issue
|
|
70
|
-
- When both evaluators flag the same issue, keep the more detailed description
|
|
71
|
-
- Prefix Codex-only findings with `[codex]` so the fix loop knows the source
|
|
72
|
-
- Preserve the exact structure: Verdict, Done Criteria Results, Findings Requiring Action (CRITICAL/HIGH), Cross-Cutting Patterns
|
|
73
|
-
|
|
74
|
-
Codex evaluation:
|
|
75
|
-
[paste Codex's response here]
|
|
76
|
-
|
|
77
|
-
---
|
|
78
|
-
|
|
79
|
-
## PHASE 4B: CODEX REVIEW
|
|
80
|
-
|
|
81
|
-
Run after the Claude team review (Phase 4A) completes, only if `--with-codex` includes `review` or `both`.
|
|
82
|
-
|
|
83
|
-
### Step 1 — Run Codex review
|
|
84
|
-
|
|
85
|
-
Call `mcp__codex-cli__review` with:
|
|
86
|
-
- `base`: `"main"` — review all changes since main
|
|
87
|
-
- `workingDirectory`: the project root
|
|
88
|
-
- `title`: `"Cross-model review (Codex)"`
|
|
89
|
-
|
|
90
|
-
This runs OpenAI Codex's built-in code review against the diff. The review tool returns structured findings automatically — no custom prompt needed.
|
|
91
|
-
|
|
92
|
-
### Step 2 — Reconcile both reviews
|
|
93
|
-
|
|
94
|
-
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to reconcile both reviews.
|
|
95
|
-
|
|
96
|
-
Agent prompt:
|
|
97
|
-
|
|
98
|
-
Two independent reviews have been conducted on recent changes — one by a Claude team review and one by OpenAI Codex. Reconcile them:
|
|
99
|
-
|
|
100
|
-
Claude team review findings: [paste Phase 4A agent's output summary]
|
|
101
|
-
Codex review findings: [paste mcp__codex-cli__review output]
|
|
102
|
-
|
|
103
|
-
1. Deduplicate findings that describe the same issue
|
|
104
|
-
2. For unique Codex findings not caught by Claude's team, prefix with `[codex]` and assess severity
|
|
105
|
-
3. Fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
|
|
106
|
-
4. Write a brief reconciliation summary to stdout listing: findings from both (agreed), Claude-only, Codex-only, and what was fixed
|
|
@@ -1,112 +0,0 @@
|
|
|
1
|
-
# Codex Cross-Model Rubric Pass (Legacy)
|
|
2
|
-
|
|
3
|
-
> **Note**: This file is the legacy `--with-codex` integration for ideate. For the newer `--engine` flag (which subsumes `--with-codex`), see the engine routing section in SKILL.md. Only read this file when `--with-codex` is set AND `--engine` is NOT set.
|
|
4
|
-
|
|
5
|
-
## Contents
|
|
6
|
-
- Pre-flight check (verify Codex MCP server availability)
|
|
7
|
-
- PHASE 3.5-CODEX: packaging the plan, calling Codex, reconciling findings with the solo pass
|
|
8
|
-
- Cost notes (one Codex call per ideation session)
|
|
9
|
-
|
|
10
|
-
Instructions for using OpenAI Codex as an independent critic during Phase 3.5 CHALLENGE. The 5-axis rubric itself lives in `challenge-rubric.md` — Claude loads that file directly from SKILL.md, not via this file.
|
|
11
|
-
|
|
12
|
-
Codex is accessed via `mcp__codex-cli__*` MCP tools (provided by codex-mcp-server). The intent: one opinionated rubric pass from a different model family, applied right before the user sees the plan. Two model families catch different blind spots; one pass at maximum effort catches more than multiple shallow passes.
|
|
13
|
-
|
|
14
|
-
**Always use `model: "gpt-5.4"`, `reasoningEffort: "xhigh"` and `sandbox: "read-only"` for every Codex call in this file.** Maximum reasoning is the whole reason the `--with-codex` flag exists — lowering it defeats the purpose of bringing in a second model. Pass `model: "gpt-5.4"` explicitly as the MCP schema default may be outdated.
|
|
15
|
-
|
|
16
|
-
---
|
|
17
|
-
|
|
18
|
-
## PRE-FLIGHT CHECK
|
|
19
|
-
|
|
20
|
-
Before starting the pipeline, verify the Codex MCP server is available by calling `mcp__codex-cli__ping`.
|
|
21
|
-
|
|
22
|
-
- **If ping succeeds**: continue.
|
|
23
|
-
- **If ping fails or `mcp__codex-cli__ping` is not found**: warn the user and ask:
|
|
24
|
-
```
|
|
25
|
-
⚠ Codex MCP server not detected. --with-codex requires codex-mcp-server.
|
|
26
|
-
|
|
27
|
-
To install:
|
|
28
|
-
npm i -g @openai/codex
|
|
29
|
-
claude mcp add codex-cli -- npx -y codex-mcp-server
|
|
30
|
-
|
|
31
|
-
Options:
|
|
32
|
-
[1] Continue without --with-codex (Claude-only solo CHALLENGE pass)
|
|
33
|
-
[2] Abort
|
|
34
|
-
```
|
|
35
|
-
If [1], disable `--with-codex` and continue with the solo CHALLENGE. If [2], stop.
|
|
36
|
-
|
|
37
|
-
---
|
|
38
|
-
|
|
39
|
-
## PHASE 3.5-CODEX: Codex rubric pass
|
|
40
|
-
|
|
41
|
-
Run after the solo CHALLENGE pass completes, before the user-facing summary.
|
|
42
|
-
|
|
43
|
-
### Step 1 — Package the post-solo plan
|
|
44
|
-
|
|
45
|
-
Use the plan as it stands after the solo rubric pass. Package the full context Codex needs:
|
|
46
|
-
|
|
47
|
-
```
|
|
48
|
-
## Problem framing (from FRAME phase)
|
|
49
|
-
[problem statement, constraints, success criteria, anti-goals]
|
|
50
|
-
|
|
51
|
-
## Confirmed facts vs assumptions
|
|
52
|
-
Confirmed by user: [list]
|
|
53
|
-
Assumptions (not yet confirmed): [list]
|
|
54
|
-
|
|
55
|
-
## Plan (post-solo-CHALLENGE)
|
|
56
|
-
Vision: [one sentence]
|
|
57
|
-
Phase 1 ([theme]): [items, dependencies, one-line descriptions]
|
|
58
|
-
Phase 2 ([theme]): ...
|
|
59
|
-
Architecture decisions: [each with what / why / alternatives]
|
|
60
|
-
Deferred to backlog: [items + reason]
|
|
61
|
-
|
|
62
|
-
## Findings from the solo rubric pass
|
|
63
|
-
[list each with: axis, quote, why, fix, whether applied]
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
Include the framing and assumptions — Codex can only judge whether the plan fits the user's reality if it sees what the user actually said.
|
|
67
|
-
|
|
68
|
-
### Step 2 — Codex challenge pass
|
|
69
|
-
|
|
70
|
-
Call `mcp__codex-cli__codex` with:
|
|
71
|
-
- `prompt`: the packaged context above, followed by the instructions below
|
|
72
|
-
- `workingDirectory`: the project root
|
|
73
|
-
- `sandbox`: `"read-only"`
|
|
74
|
-
- `model`: `"gpt-5.4"` — pass explicitly; the MCP schema default may still show `gpt-5.3-codex`
|
|
75
|
-
- `reasoningEffort`: `"xhigh"` — the highest setting in the Codex enum (`none < minimal < low < medium < high < xhigh`). Always pick the top level; this is the entire reason for the flag.
|
|
76
|
-
|
|
77
|
-
Instructions to append to the packaged context. **Before sending, inline the full text of `references/challenge-rubric.md` into the prompt under a `## Rubric` heading** — Codex does not have filesystem access to this project, so Claude must ship the rubric itself. Claude already has the rubric loaded from Phase 3.5 setup.
|
|
78
|
-
|
|
79
|
-
Template for the appended instructions:
|
|
80
|
-
|
|
81
|
-
```
|
|
82
|
-
You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user has explicitly asked to be challenged because soft-pedaled plans waste their time.
|
|
83
|
-
|
|
84
|
-
## Rubric
|
|
85
|
-
[Claude inlines the full text of references/challenge-rubric.md here]
|
|
86
|
-
|
|
87
|
-
## Your job
|
|
88
|
-
- You are running AFTER a solo pass by Claude. Catch what the solo pass missed, do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" or "I would frame this differently" with a reason. Then add your own findings that the solo pass missed.
|
|
89
|
-
- Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
|
|
90
|
-
- Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let Claude surface it to the user.
|
|
91
|
-
|
|
92
|
-
End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, and a one-line explanation.
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
### Step 3 — Reconcile solo and Codex findings
|
|
96
|
-
|
|
97
|
-
Merge the two finding lists:
|
|
98
|
-
- Same finding from both → keep the more specific wording, mark "confirmed by both".
|
|
99
|
-
- Codex-only → prefix `[codex]` in internal notes so the user-facing summary can show where each push came from.
|
|
100
|
-
- Solo-only → keep as-is.
|
|
101
|
-
- Conflicts (solo says X, Codex says not-X) → record both, do not silently pick one; if the conflict is material, include it as an open question in the user-facing summary.
|
|
102
|
-
|
|
103
|
-
If Codex raised CRITICAL or HIGH findings that the solo pass missed, apply the fixes to the plan before presenting the user-facing summary. If fixing would change something the user explicitly asked for, follow the "Respect explicit user intent" rule already loaded from the rubric: do not silently rewrite — surface it.
|
|
104
|
-
|
|
105
|
-
Do not loop. One Codex pass is enough. If the result is still FAIL after one pass, that is signal that the plan has structural problems the user should see directly, not signal to keep iterating in the background.
|
|
106
|
-
|
|
107
|
-
---
|
|
108
|
-
|
|
109
|
-
## Cost notes
|
|
110
|
-
|
|
111
|
-
- One Codex call at `reasoningEffort: "xhigh"` typically takes 30–90s and is not cheap. This integration is bounded: exactly one Codex call per ideation session.
|
|
112
|
-
- In Quick Add mode on a single new item, one Codex call is still worth it — small scope, huge signal, and single-item additions are exactly where workarounds slip in unnoticed.
|