npm - devlyn-cli - Versions diffs - 1.13.0 → 1.14.0 - Mend

devlyn-cli 1.13.0 → 1.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +30 -1
package/config/skills/devlyn:auto-resolve/SKILL.md +67 -3
package/config/skills/devlyn:ideate/SKILL.md +6 -15
package/config/skills/devlyn:ideate/references/templates/item-spec.md +4 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -109,7 +109,7 @@ Install the Codex MCP server during setup, then:
 /devlyn:auto-resolve "fix the auth bug" --engine auto
 ```
-**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.6 or GPT-5.4) — validated through A/B testing, not just benchmarks.
+**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
 > `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
@@ -146,6 +146,35 @@ Works across the full pipeline:
 </details>
+<details>
+<summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
+`/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
+- **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
+- **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
+- **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
+- **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
+- **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
+- **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
+- **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
+- **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
+- **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
+</details>
+<details>
+<summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
+Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
+- **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
+- **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
+- **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
+- **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
+</details>
 ---
 ## Manual Commands

package/config/skills/devlyn:auto-resolve/SKILL.md CHANGED Viewed

@@ -77,6 +77,58 @@ Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed]
 Max evaluation rounds: [N]
 ```
+## PHASE 0.5: SPEC PREFLIGHT (conditional)
+This phase exists because the ideate skill produces specs that are explicitly designed to be auto-resolve's contract — `Requirements` *are* the done-criteria, `Out of Scope` bounds over-building, `Dependencies` gates sequencing. When a run ignores that contract and re-derives everything from the raw task string, 25–40% of BUILD's reasoning is spent re-inventing material the spec already owns. This phase makes the contract load-bearing.
+Scan the task description from `<pipeline_config>` for a path matching the regex `docs/roadmap/phase-\d+/[^\s"'`)]+\.md`. If no match, skip this entire phase (non-spec tasks fall back to BUILD's open-ended discovery — that mode is still supported).
+If a match is found:
+1. **Read the spec file.** If the file does not exist, stop with a `BLOCKED` verdict in the final report — do not proceed to BUILD with a missing spec. The task description is lying and recovering from that silently is worse than halting.
+2. **Verify internal dependencies.** For each entry under the spec's `## Dependencies` → `Internal` list (e.g., `1.1 User Auth`), locate the matching spec file at `docs/roadmap/phase-*/[id]-*.md` and read its frontmatter `status` field. If any internal dependency does not have `status: done`, stop with a `BLOCKED` verdict listing the unmet deps. Implementing out of sequence wastes the whole pipeline and produces code that fails at the first integration point.
+3. **Write `.devlyn/SPEC-CONTEXT.md`** so downstream subagents read spec-owned content from a single canonical place without re-parsing the spec file. Copy these spec sections verbatim (do not paraphrase or compress — they are the contract):
+   ```
+   ---
+   id: [from frontmatter]
+   complexity: [from frontmatter]
+   priority: [from frontmatter]
+   depends-on: [from frontmatter]
+   source-spec: [path to the spec file]
+   ---
+   ## Customer Frame
+   [verbatim]
+   ## Objective
+   [verbatim]
+   ## Requirements
+   [verbatim — these become done-criteria in PHASE 1]
+   ## Constraints
+   [verbatim]
+   ## Out of Scope
+   [verbatim — honored explicitly by BUILD in Phase D]
+   ## Architecture Notes
+   [verbatim, or "(none)" if absent]
+   ## Dependencies
+   [verbatim]
+   ## Verification
+   [verbatim]
+   ```
+4. **Announce the preflight outcome.** One line: `Spec preflight: [spec path] — complexity [low/medium/high], [N] internal deps verified done, proceeding.` This appears in the final report under the Build row.
+Downstream phases detect `.devlyn/SPEC-CONTEXT.md` and prefer its content over re-derivation. If it is absent, they use their current open-ended behavior.
 ## PHASE 1: BUILD
 **Engine**: BUILD row of the routing table — Codex on `auto`/`codex`, Claude on `claude`. Per `<engine_routing_convention>` above. Subagents do not have access to skills, so the prompt below includes everything they need inline.
@@ -85,16 +137,28 @@ Agent prompt — pass this to the spawned executor:
 Investigate and implement the following task. Work through these phases in order:
-**Phase A — Understand the task**: Read the task description carefully. Classify the task type:
+**Phase A — Understand the task**: If `.devlyn/SPEC-CONTEXT.md` exists, read it first. The spec has already decided the task shape — use its `Objective`, `Constraints`, `Architecture Notes`, `Dependencies`, and `complexity` as the exploration boundary. Do not re-classify the task type open-endedly; the spec already bounds the problem. Read only the files the spec implicates (Architecture Notes + Dependencies + any existing files touched by referenced patterns), then move on.
+If no spec context file exists, read the raw task description and classify the task type:
 - **Bug fix**: trace from symptom to root cause. Read error logs and affected code paths.
 - **Feature**: explore the codebase to find existing patterns, integration points, and relevant modules.
 - **Refactor/Chore**: understand current implementation, identify what needs to change and why.
 - **UI/UX**: review existing components, design system, and user flows.
 Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
-**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md` with testable success criteria. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section. This file is required — downstream evaluation depends on it.
+**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md`.
+First check whether `.devlyn/SPEC-CONTEXT.md` exists (produced by PHASE 0.5 when this run implements an ideate-produced spec). If it does, the spec is the contract — copy its `## Requirements` section verbatim into `done-criteria.md` as the primary done-criteria list, copy its `## Out of Scope` section as an `## Out of Scope` section in done-criteria.md, and copy its `## Verification` section as a `## Verification Method` section. Do not paraphrase, compress, or re-derive these — the ideate skill's CHALLENGE rubric already validated them, and weakening them here silently undoes that work. You may ADD criteria the spec obviously missed (e.g., if Requirements mention an API but omit an obvious error state) but never REMOVE or reword existing ones.
+If `.devlyn/SPEC-CONTEXT.md` does not exist, synthesize done-criteria from the raw task description. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section.
+This file is required — downstream evaluation depends on it.
+**Phase C — Assemble a team (complexity-gated)**: Check `.devlyn/SPEC-CONTEXT.md` frontmatter for `complexity`.
+If `complexity: low` AND the spec does not touch security/auth/API/data/UI risk areas (check by greping the spec for keywords: `auth`, `login`, `session`, `token`, `secret`, `password`, `crypto`, `api`, `env`, `permission`, `access`, `database`, `migration`, `payment`), skip TeamCreate entirely and implement directly — the multi-perspective team exists to catch ambiguity that low-complexity specs have already resolved.
-**Phase C — Assemble a team**: Use TeamCreate to create a team. Select teammates based on task type:
+Otherwise (complexity medium or high, risk areas present, or no spec context), use TeamCreate to create a team. Select teammates based on task type:
 - Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
 - Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
 - Refactor: architecture-reviewer + test-engineer

package/config/skills/devlyn:ideate/SKILL.md CHANGED Viewed

@@ -32,19 +32,10 @@ Parse these from the user's invocation message:
 **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
 - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
 - Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort.
-- Read `references/challenge-rubric.md` up front. The engine routing table lives in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)" — read that on demand when routing decisions are needed.
+- Read `references/challenge-rubric.md` up front.
 **Consolidated flag**: `--with-codex` was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which routes the CHALLENGE rubric pass to Codex automatically. No flag needed. Continuing with `--engine auto`."
-<why_this_matters>
-When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
-- Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
-- Full roadmaps create attention noise (49 irrelevant items dilute focus on item #3)
-- Done criteria generated from vague prompts miss the user's actual intent
-This skill solves the context engineering problem by producing **self-contained specs** — each carries just enough context for auto-resolve to work autonomously.
-</why_this_matters>
 ## Output Architecture
 The skill produces a three-layer progressive disclosure structure:
@@ -105,6 +96,8 @@ Before starting, identify what the user needs:
 | User shares links/resources to process | **Research-first** | Lead with Explore (research synthesis), then standard flow |
 | Existing roadmap, user wants to reprioritize | **Replan** | Read existing docs, focus on Converge, update documents |
+**Tie-breaks when a request matches two modes:** choose the narrowest mode that satisfies the request. Quick Add wins over Expand when the user has one concrete item in mind. Research-first wins over Deep-dive when links or resources are the primary input. Deep-dive wins over Expand when one topic specifically needs depth. Replan is chosen only when priority or order changes are explicit. If two modes still look equally plausible after applying these rules, present the top two to the user and let them pick — silently choosing one wastes the session if the other was right.
 Announce the detected mode and confirm before proceeding.
 ### Expand Mode Detail
@@ -207,7 +200,7 @@ When a decision becomes wrong because the world changed under it:
 The biggest risk in ideation is premature convergence — jumping to solutions before understanding the problem. This phase prevents that.
 Establish through conversation:
-1. **Problem statement**: What problem or opportunity? For whom? Why now?
+1. **Job-to-be-Done**: In one sentence — "When [situation], [user] wants to [motivation], so they can [outcome]." Capture this before anything else. If the user cannot produce it, that is itself the finding — pause and explore the situation until the sentence exists. A bare problem statement without this frame is a state description, not a job, and downstream specs built from it will describe system behavior instead of customer progress.
 2. **Constraints**: What can't change? (tech stack, timeline, existing commitments)
 3. **Success criteria**: How will we know this worked? (outcomes, not outputs)
 4. **Anti-goals**: What are we explicitly NOT trying to do?
@@ -232,6 +225,7 @@ When relevant, actively research before and during brainstorming:
 - **Technical feasibility**: Can this be built within the constraints? Where are the hard parts?
 - **Patterns and prior art**: How have similar problems been solved?
 - **Market/user context**: Who else needs this? What do they currently use?
+- **Evidence discipline**: Treat prior art as source-backed only when verified by a fetched link or documentation the user can open. If a pattern is inferred from memory or analogy, label it `[UNVERIFIED]` inline and do not present it as market fact. The CHALLENGE rubric's NO GUESSWORK axis fires hard on unlabeled claims that look authoritative but are actually recall.
 Not every ideation needs all of these — a personal side project doesn't need market research. Judge what's relevant and use subagents for parallel research when multiple topics need investigation.
 </research_protocol>
@@ -317,8 +311,6 @@ Engage maximum thinking effort here — both the solo rubric pass and, if enable
 Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
 </thinking_effort>
-The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
 ### The rubric — single source of truth
 Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
@@ -329,8 +321,6 @@ Apply the rubric to the internal convergence draft. Produce findings in the form
 For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
-If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
 ### Codex critic pass (engine-routed)
 **If `--engine auto`** (default): Codex runs the CHALLENGE rubric pass automatically as critic.
@@ -522,6 +512,7 @@ Before finalizing, verify:
 - [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex critic on `--engine auto`); no item still fails any axis at CRITICAL or HIGH severity
 - [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
 - [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
+- [ ] Every requirement is traceable to a confirmed fact, a verified source, or an explicitly labeled assumption — no unmarked guesses slipped into the specs
 ## Language

package/config/skills/devlyn:ideate/references/templates/item-spec.md CHANGED Viewed

@@ -22,6 +22,10 @@ depends-on: []
 <!-- Extract only the relevant context from the vision — don't make the implementation agent read the full vision document. -->
 [Project] does [what]. This feature [enables/improves/fixes] [specific user capability].
+## Customer Frame
+<!-- One sentence. When [situation], [user] wants to [motivation] so they can [outcome]. -->
+<!-- Use this to resolve ambiguous requirements: prefer the behavior that best serves this user outcome, and do not add capabilities outside this frame. -->
 ## Objective
 <!-- One sentence: what the user can do after this is implemented. -->

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "1.13.0",
+  "version": "1.14.0",
   "description": "AI development toolkit for Claude Code — ideate, auto-resolve, and ship with context engineering and agent orchestration",
   "homepage": "https://github.com/fysoul17/devlyn-cli#readme",
   "bin": {