devlyn-cli 1.13.0 → 1.14.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md
CHANGED
|
@@ -109,7 +109,7 @@ Install the Codex MCP server during setup, then:
|
|
|
109
109
|
/devlyn:auto-resolve "fix the auth bug" --engine auto
|
|
110
110
|
```
|
|
111
111
|
|
|
112
|
-
**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.
|
|
112
|
+
**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
|
|
113
113
|
|
|
114
114
|
> `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
|
|
115
115
|
|
|
@@ -146,6 +146,35 @@ Works across the full pipeline:
|
|
|
146
146
|
|
|
147
147
|
</details>
|
|
148
148
|
|
|
149
|
+
<details>
|
|
150
|
+
<summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
|
|
151
|
+
|
|
152
|
+
`/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
|
|
153
|
+
|
|
154
|
+
- **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
|
|
155
|
+
- **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
|
|
156
|
+
- **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
|
|
157
|
+
- **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
|
|
158
|
+
- **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
|
|
159
|
+
- **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
|
|
160
|
+
- **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
|
|
161
|
+
- **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
|
|
162
|
+
- **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
|
|
163
|
+
|
|
164
|
+
</details>
|
|
165
|
+
|
|
166
|
+
<details>
|
|
167
|
+
<summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
|
|
168
|
+
|
|
169
|
+
Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
|
|
170
|
+
|
|
171
|
+
- **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
|
|
172
|
+
- **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
|
|
173
|
+
- **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
|
|
174
|
+
- **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
|
|
175
|
+
|
|
176
|
+
</details>
|
|
177
|
+
|
|
149
178
|
---
|
|
150
179
|
|
|
151
180
|
## Manual Commands
|
|
@@ -77,6 +77,58 @@ Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed]
|
|
|
77
77
|
Max evaluation rounds: [N]
|
|
78
78
|
```
|
|
79
79
|
|
|
80
|
+
## PHASE 0.5: SPEC PREFLIGHT (conditional)
|
|
81
|
+
|
|
82
|
+
This phase exists because the ideate skill produces specs that are explicitly designed to be auto-resolve's contract — `Requirements` *are* the done-criteria, `Out of Scope` bounds over-building, `Dependencies` gates sequencing. When a run ignores that contract and re-derives everything from the raw task string, 25–40% of BUILD's reasoning is spent re-inventing material the spec already owns. This phase makes the contract load-bearing.
|
|
83
|
+
|
|
84
|
+
Scan the task description from `<pipeline_config>` for a path matching the regex `docs/roadmap/phase-\d+/[^\s"'`)]+\.md`. If no match, skip this entire phase (non-spec tasks fall back to BUILD's open-ended discovery — that mode is still supported).
|
|
85
|
+
|
|
86
|
+
If a match is found:
|
|
87
|
+
|
|
88
|
+
1. **Read the spec file.** If the file does not exist, stop with a `BLOCKED` verdict in the final report — do not proceed to BUILD with a missing spec. The task description is lying and recovering from that silently is worse than halting.
|
|
89
|
+
|
|
90
|
+
2. **Verify internal dependencies.** For each entry under the spec's `## Dependencies` → `Internal` list (e.g., `1.1 User Auth`), locate the matching spec file at `docs/roadmap/phase-*/[id]-*.md` and read its frontmatter `status` field. If any internal dependency does not have `status: done`, stop with a `BLOCKED` verdict listing the unmet deps. Implementing out of sequence wastes the whole pipeline and produces code that fails at the first integration point.
|
|
91
|
+
|
|
92
|
+
3. **Write `.devlyn/SPEC-CONTEXT.md`** so downstream subagents read spec-owned content from a single canonical place without re-parsing the spec file. Copy these spec sections verbatim (do not paraphrase or compress — they are the contract):
|
|
93
|
+
|
|
94
|
+
```
|
|
95
|
+
---
|
|
96
|
+
id: [from frontmatter]
|
|
97
|
+
complexity: [from frontmatter]
|
|
98
|
+
priority: [from frontmatter]
|
|
99
|
+
depends-on: [from frontmatter]
|
|
100
|
+
source-spec: [path to the spec file]
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## Customer Frame
|
|
104
|
+
[verbatim]
|
|
105
|
+
|
|
106
|
+
## Objective
|
|
107
|
+
[verbatim]
|
|
108
|
+
|
|
109
|
+
## Requirements
|
|
110
|
+
[verbatim — these become done-criteria in PHASE 1]
|
|
111
|
+
|
|
112
|
+
## Constraints
|
|
113
|
+
[verbatim]
|
|
114
|
+
|
|
115
|
+
## Out of Scope
|
|
116
|
+
[verbatim — honored explicitly by BUILD in Phase D]
|
|
117
|
+
|
|
118
|
+
## Architecture Notes
|
|
119
|
+
[verbatim, or "(none)" if absent]
|
|
120
|
+
|
|
121
|
+
## Dependencies
|
|
122
|
+
[verbatim]
|
|
123
|
+
|
|
124
|
+
## Verification
|
|
125
|
+
[verbatim]
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
4. **Announce the preflight outcome.** One line: `Spec preflight: [spec path] — complexity [low/medium/high], [N] internal deps verified done, proceeding.` This appears in the final report under the Build row.
|
|
129
|
+
|
|
130
|
+
Downstream phases detect `.devlyn/SPEC-CONTEXT.md` and prefer its content over re-derivation. If it is absent, they use their current open-ended behavior.
|
|
131
|
+
|
|
80
132
|
## PHASE 1: BUILD
|
|
81
133
|
|
|
82
134
|
**Engine**: BUILD row of the routing table — Codex on `auto`/`codex`, Claude on `claude`. Per `<engine_routing_convention>` above. Subagents do not have access to skills, so the prompt below includes everything they need inline.
|
|
@@ -85,16 +137,28 @@ Agent prompt — pass this to the spawned executor:
|
|
|
85
137
|
|
|
86
138
|
Investigate and implement the following task. Work through these phases in order:
|
|
87
139
|
|
|
88
|
-
**Phase A — Understand the task**:
|
|
140
|
+
**Phase A — Understand the task**: If `.devlyn/SPEC-CONTEXT.md` exists, read it first. The spec has already decided the task shape — use its `Objective`, `Constraints`, `Architecture Notes`, `Dependencies`, and `complexity` as the exploration boundary. Do not re-classify the task type open-endedly; the spec already bounds the problem. Read only the files the spec implicates (Architecture Notes + Dependencies + any existing files touched by referenced patterns), then move on.
|
|
141
|
+
|
|
142
|
+
If no spec context file exists, read the raw task description and classify the task type:
|
|
89
143
|
- **Bug fix**: trace from symptom to root cause. Read error logs and affected code paths.
|
|
90
144
|
- **Feature**: explore the codebase to find existing patterns, integration points, and relevant modules.
|
|
91
145
|
- **Refactor/Chore**: understand current implementation, identify what needs to change and why.
|
|
92
146
|
- **UI/UX**: review existing components, design system, and user flows.
|
|
93
147
|
Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
|
|
94
148
|
|
|
95
|
-
**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md
|
|
149
|
+
**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md`.
|
|
150
|
+
|
|
151
|
+
First check whether `.devlyn/SPEC-CONTEXT.md` exists (produced by PHASE 0.5 when this run implements an ideate-produced spec). If it does, the spec is the contract — copy its `## Requirements` section verbatim into `done-criteria.md` as the primary done-criteria list, copy its `## Out of Scope` section as an `## Out of Scope` section in done-criteria.md, and copy its `## Verification` section as a `## Verification Method` section. Do not paraphrase, compress, or re-derive these — the ideate skill's CHALLENGE rubric already validated them, and weakening them here silently undoes that work. You may ADD criteria the spec obviously missed (e.g., if Requirements mention an API but omit an obvious error state) but never REMOVE or reword existing ones.
|
|
152
|
+
|
|
153
|
+
If `.devlyn/SPEC-CONTEXT.md` does not exist, synthesize done-criteria from the raw task description. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section.
|
|
154
|
+
|
|
155
|
+
This file is required — downstream evaluation depends on it.
|
|
156
|
+
|
|
157
|
+
**Phase C — Assemble a team (complexity-gated)**: Check `.devlyn/SPEC-CONTEXT.md` frontmatter for `complexity`.
|
|
158
|
+
|
|
159
|
+
If `complexity: low` AND the spec does not touch security/auth/API/data/UI risk areas (check by greping the spec for keywords: `auth`, `login`, `session`, `token`, `secret`, `password`, `crypto`, `api`, `env`, `permission`, `access`, `database`, `migration`, `payment`), skip TeamCreate entirely and implement directly — the multi-perspective team exists to catch ambiguity that low-complexity specs have already resolved.
|
|
96
160
|
|
|
97
|
-
|
|
161
|
+
Otherwise (complexity medium or high, risk areas present, or no spec context), use TeamCreate to create a team. Select teammates based on task type:
|
|
98
162
|
- Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
|
|
99
163
|
- Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
|
|
100
164
|
- Refactor: architecture-reviewer + test-engineer
|
|
@@ -32,19 +32,10 @@ Parse these from the user's invocation message:
|
|
|
32
32
|
**Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
|
|
33
33
|
- The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
|
|
34
34
|
- Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort.
|
|
35
|
-
- Read `references/challenge-rubric.md` up front.
|
|
35
|
+
- Read `references/challenge-rubric.md` up front.
|
|
36
36
|
|
|
37
37
|
**Consolidated flag**: `--with-codex` was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which routes the CHALLENGE rubric pass to Codex automatically. No flag needed. Continuing with `--engine auto`."
|
|
38
38
|
|
|
39
|
-
<why_this_matters>
|
|
40
|
-
When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
|
|
41
|
-
- Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
|
|
42
|
-
- Full roadmaps create attention noise (49 irrelevant items dilute focus on item #3)
|
|
43
|
-
- Done criteria generated from vague prompts miss the user's actual intent
|
|
44
|
-
|
|
45
|
-
This skill solves the context engineering problem by producing **self-contained specs** — each carries just enough context for auto-resolve to work autonomously.
|
|
46
|
-
</why_this_matters>
|
|
47
|
-
|
|
48
39
|
## Output Architecture
|
|
49
40
|
|
|
50
41
|
The skill produces a three-layer progressive disclosure structure:
|
|
@@ -105,6 +96,8 @@ Before starting, identify what the user needs:
|
|
|
105
96
|
| User shares links/resources to process | **Research-first** | Lead with Explore (research synthesis), then standard flow |
|
|
106
97
|
| Existing roadmap, user wants to reprioritize | **Replan** | Read existing docs, focus on Converge, update documents |
|
|
107
98
|
|
|
99
|
+
**Tie-breaks when a request matches two modes:** choose the narrowest mode that satisfies the request. Quick Add wins over Expand when the user has one concrete item in mind. Research-first wins over Deep-dive when links or resources are the primary input. Deep-dive wins over Expand when one topic specifically needs depth. Replan is chosen only when priority or order changes are explicit. If two modes still look equally plausible after applying these rules, present the top two to the user and let them pick — silently choosing one wastes the session if the other was right.
|
|
100
|
+
|
|
108
101
|
Announce the detected mode and confirm before proceeding.
|
|
109
102
|
|
|
110
103
|
### Expand Mode Detail
|
|
@@ -207,7 +200,7 @@ When a decision becomes wrong because the world changed under it:
|
|
|
207
200
|
The biggest risk in ideation is premature convergence — jumping to solutions before understanding the problem. This phase prevents that.
|
|
208
201
|
|
|
209
202
|
Establish through conversation:
|
|
210
|
-
1. **
|
|
203
|
+
1. **Job-to-be-Done**: In one sentence — "When [situation], [user] wants to [motivation], so they can [outcome]." Capture this before anything else. If the user cannot produce it, that is itself the finding — pause and explore the situation until the sentence exists. A bare problem statement without this frame is a state description, not a job, and downstream specs built from it will describe system behavior instead of customer progress.
|
|
211
204
|
2. **Constraints**: What can't change? (tech stack, timeline, existing commitments)
|
|
212
205
|
3. **Success criteria**: How will we know this worked? (outcomes, not outputs)
|
|
213
206
|
4. **Anti-goals**: What are we explicitly NOT trying to do?
|
|
@@ -232,6 +225,7 @@ When relevant, actively research before and during brainstorming:
|
|
|
232
225
|
- **Technical feasibility**: Can this be built within the constraints? Where are the hard parts?
|
|
233
226
|
- **Patterns and prior art**: How have similar problems been solved?
|
|
234
227
|
- **Market/user context**: Who else needs this? What do they currently use?
|
|
228
|
+
- **Evidence discipline**: Treat prior art as source-backed only when verified by a fetched link or documentation the user can open. If a pattern is inferred from memory or analogy, label it `[UNVERIFIED]` inline and do not present it as market fact. The CHALLENGE rubric's NO GUESSWORK axis fires hard on unlabeled claims that look authoritative but are actually recall.
|
|
235
229
|
|
|
236
230
|
Not every ideation needs all of these — a personal side project doesn't need market research. Judge what's relevant and use subagents for parallel research when multiple topics need investigation.
|
|
237
231
|
</research_protocol>
|
|
@@ -317,8 +311,6 @@ Engage maximum thinking effort here — both the solo rubric pass and, if enable
|
|
|
317
311
|
Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
|
|
318
312
|
</thinking_effort>
|
|
319
313
|
|
|
320
|
-
The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
|
|
321
|
-
|
|
322
314
|
### The rubric — single source of truth
|
|
323
315
|
|
|
324
316
|
Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
|
|
@@ -329,8 +321,6 @@ Apply the rubric to the internal convergence draft. Produce findings in the form
|
|
|
329
321
|
|
|
330
322
|
For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
|
|
331
323
|
|
|
332
|
-
If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
|
|
333
|
-
|
|
334
324
|
### Codex critic pass (engine-routed)
|
|
335
325
|
|
|
336
326
|
**If `--engine auto`** (default): Codex runs the CHALLENGE rubric pass automatically as critic.
|
|
@@ -522,6 +512,7 @@ Before finalizing, verify:
|
|
|
522
512
|
- [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex critic on `--engine auto`); no item still fails any axis at CRITICAL or HIGH severity
|
|
523
513
|
- [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
|
|
524
514
|
- [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
|
|
515
|
+
- [ ] Every requirement is traceable to a confirmed fact, a verified source, or an explicitly labeled assumption — no unmarked guesses slipped into the specs
|
|
525
516
|
|
|
526
517
|
## Language
|
|
527
518
|
|
|
@@ -22,6 +22,10 @@ depends-on: []
|
|
|
22
22
|
<!-- Extract only the relevant context from the vision — don't make the implementation agent read the full vision document. -->
|
|
23
23
|
[Project] does [what]. This feature [enables/improves/fixes] [specific user capability].
|
|
24
24
|
|
|
25
|
+
## Customer Frame
|
|
26
|
+
<!-- One sentence. When [situation], [user] wants to [motivation] so they can [outcome]. -->
|
|
27
|
+
<!-- Use this to resolve ambiguous requirements: prefer the behavior that best serves this user outcome, and do not add capabilities outside this frame. -->
|
|
28
|
+
|
|
25
29
|
## Objective
|
|
26
30
|
<!-- One sentence: what the user can do after this is implemented. -->
|
|
27
31
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "devlyn-cli",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.14.0",
|
|
4
4
|
"description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
|
|
5
5
|
"homepage": "https://github.com/fysoul17/devlyn-cli#readme",
|
|
6
6
|
"bin": {
|