npm - devlyn-cli - Versions diffs - 1.13.0 → 1.15.0 - Mend

devlyn-cli 1.13.0 → 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/CLAUDE.md CHANGED Viewed

@@ -2,175 +2,54 @@
 ## Quick Start
-For most work, the recommended sequence is:
+Three commands cover most work. All default to `--engine auto` — Codex GPT-5.4 builds, Claude Opus 4.7 critiques (cross-model GAN dynamic).
-1. `/devlyn:ideate` — turn an idea into roadmap-ready specs
-2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → polish
-3. `/devlyn:preflight` — verify the implementation matches the roadmap before shipping
+1. `/devlyn:ideate` — unstructured idea → VISION/ROADMAP/item specs
+2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → ship
+3. `/devlyn:preflight` — verify the implementation matches the roadmap
-All three default to `--engine auto`, which routes each phase to the optimal model (Codex GPT-5.4 for hard coding, Claude Opus 4.7 for evaluation/critique). The cross-model GAN dynamic — different models build vs critique — catches what single-model pipelines miss.
-## General
-- Proactively use subagents and skills where needed
-- Follow commit conventions in `.claude/commit-conventions.md`
-- Follow design system in `docs/design-system.md` for UI/UX work if exist
+Each skill's `SKILL.md` is the source of truth for its flags and workflow — don't duplicate them here.
 ## Error Handling Philosophy
 **No silent fallbacks.** Handle errors explicitly and show the user what happened.
-- **Default behavior**: When something fails, display a clear error state in the UI (error message, retry option, or actionable guidance). Do NOT silently fall back to default/placeholder data.
-- **Fallbacks are the exception, not the rule.** Only use fallbacks when it is a widely accepted best practice (e.g., fallback fonts in CSS, CDN failover, graceful image loading with placeholder). If unsure, handle the error explicitly instead.
-- **Never hide failures.** The user should always know when something went wrong. A visible error with a retry button is better UX than silently showing stale/default data.
-- **Pattern**: `try { doThing() } catch (error) { showErrorUI(error) }` — NOT `try { doThing() } catch { return fallbackValue }`
+- **Default**: when something fails, display a clear error state — message, retry option, or actionable guidance. Do NOT silently fall back to default/placeholder data.
+- **Fallbacks are the exception.** Only use them when it's a widely accepted best practice (CSS fallback fonts, CDN failover, image placeholders). Otherwise handle the error explicitly.
+- **Pattern**: `try { doThing() } catch (error) { showErrorUI(error) }` — NOT `try { doThing() } catch { return fallbackValue }`.
 ## Investigation Workflow
 When investigating bugs, analyzing features, or exploring code:
-1. **Define exit criteria upfront** - Ask "What does 'done' look like?" before starting
-2. **Checkpoint progress** - Use the task tools (TaskCreate / TaskUpdate) every 5-10 minutes to save findings
-3. **Output intermediate summaries** - Provide "Current Understanding" snapshots so work isn't lost if interrupted
-4. **Always deliver findings** - Never end mid-analysis; at minimum output:
-   - Files examined
-   - Key findings
-   - Remaining unknowns
-   - Recommended next steps
-For complex investigations, use `/devlyn:team-resolve` to assemble a multi-perspective investigation team, or spawn parallel Agent subagents to explore different areas simultaneously.
-## UI/UX Workflow
-The full design-to-implementation pipeline:
-1. `/devlyn:design-ui` → Generate 5 style explorations
-2. `/devlyn:design-system [N]` → Extract tokens from chosen style
-3. `/devlyn:implement-ui` → Team implements or improves UI from design system
-4. `/devlyn:team-resolve [feature]` → Add features on top
-## Feature Development
-1. **Plan first** - Always output a concrete implementation plan with specific file changes before writing code
-2. **Track progress** - Use the task tools (TaskCreate / TaskUpdate) to checkpoint each phase
-3. **Test validation** - Write tests alongside implementation; iterate until green
-4. **Small commits** - Commit working increments rather than large changesets
-For complex features, spawn the `Plan` subagent (`Agent` tool with `subagent_type: "Plan"`) to design the approach before implementation.
-## Automated Pipeline (Recommended Starting Point)
-For hands-free build-evaluate-polish cycles — works for bugs, features, refactors, and chores:
-```
-/devlyn:auto-resolve [task description]
-```
-This runs the full pipeline automatically: **Build → Build Gate → Browser Validate → Evaluate → Fix Loop → Simplify → Review → Challenge → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.devlyn/done-criteria.md`, `.devlyn/BUILD-GATE.md`, `.devlyn/EVAL-FINDINGS.md`, `.devlyn/BROWSER-RESULTS.md`, `.devlyn/CHALLENGE-FINDINGS.md`).
-The **Build Gate** (Phase 1.4) runs real compilers, typecheckers, and linters — the same commands CI/Docker/production will run. It auto-detects project types (Next.js, Rust, Go, Solidity, Expo, Swift, etc.) and Dockerfiles. This is the primary defense against "tests pass locally, breaks in CI/Docker" class of bugs (type errors in un-tested files, cross-package drift, Dockerfile copy mismatches).
-The **Challenge** phase (Phase 4.5) is a fresh skeptical review with no checklist — a subagent reads the entire diff cold with zero context from prior phases and asks "would I ship this to production with my name on it?" This catches the subtle issues that structured checklist-driven reviews miss: wrong-but-working approaches, unstated assumptions, non-idiomatic patterns, and integration gaps.
-For web projects, the Browser Validate phase starts the dev server and tests the implemented feature in a real browser — clicking buttons, filling forms, verifying results. If the feature doesn't work, findings feed back into the fix loop.
-Optional flags:
-- `--max-rounds 6` — increase max evaluate-fix iterations (default: 4)
-- `--skip-browser` — skip browser validation phase (auto-skipped for non-web changes)
-- `--skip-build-gate` — skip the deterministic build gate (not recommended)
-- `--build-gate strict` — treat warnings as errors; `--build-gate no-docker` — skip Docker builds for speed
-- `--skip-review` — skip team-review phase
-- `--skip-clean` — skip clean phase
-- `--skip-docs` — skip update-docs phase
-- `--engine auto|codex|claude` — intelligent model routing. `auto` (default) routes each phase and team role to the optimal model based on benchmark data: Codex GPT-5.4 handles BUILD and FIX (SWE-bench Pro lead), Claude Opus 4.7 handles EVALUATE and CHALLENGE (long-context retrieval + skeptical reasoning). Different models build vs critique — the cross-model GAN dynamic catches what single-model pipelines miss. `codex` forces Codex for implementation, Claude for orchestration and Chrome MCP. `claude` uses Claude for everything. Requires codex-mcp-server for `auto` and `codex` modes.
-## Preflight Check (Post-Roadmap Verification)
-After completing a roadmap (or a phase), verify that everything was actually implemented correctly:
-```
-/devlyn:preflight
-```
+1. **Define exit criteria upfront** — ask "what does 'done' look like?" before starting.
+2. **Checkpoint progress** — use `TaskCreate`/`TaskUpdate` every 5–10 minutes.
+3. **Intermediate summaries** — output "current understanding" snapshots so work isn't lost if interrupted.
+4. **Always deliver findings** — never end mid-analysis. Minimum output: files examined, key findings, remaining unknowns, recommended next steps.
-This reads every commitment from VISION.md, ROADMAP.md, and item specs, then audits the codebase evidence-based. The code auditor now runs real build/typecheck commands as its first step — any project that doesn't compile is flagged as BROKEN at CRITICAL severity before individual commitments are even checked. Also checks in the browser for web projects.
+For complex investigations, use `/devlyn:team-resolve` for a multi-perspective team, or spawn parallel `Agent` subagents.
-Output: `.devlyn/PREFLIGHT-REPORT.md` with categorized findings (MISSING, INCOMPLETE, DIVERGENT, BROKEN, STALE_DOC). Confirmed gaps can be promoted to new roadmap items for auto-resolve.
-Optional flags:
-- `--phase N` — audit only phase N items
-- `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
-- `--skip-browser` — skip browser validation
-- `--skip-docs` — skip documentation audit
-- `--engine auto|codex|claude` — `auto` (default) routes the code-auditor to Codex (SWE-bench Pro +11.7pp on code analysis); the docs-auditor and browser-auditor always use Claude regardless of `--engine` (writing-quality strength on docs drift; Chrome MCP tools are session-bound to Claude Code)
-**Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)
-## Manual Pipeline (Step-by-Step Control)
-When you want to run each step yourself with review between phases:
-1. `/devlyn:team-resolve [issue]` → Investigate + implement (writes `.devlyn/done-criteria.md`)
-2. `/devlyn:evaluate` → Grade against done-criteria (writes `.devlyn/EVAL-FINDINGS.md`)
-3. If findings exist: `/devlyn:team-resolve "Fix issues in .devlyn/EVAL-FINDINGS.md"` → Fix loop
-4. `/simplify` → Quick cleanup pass
-5. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
-6. `/devlyn:clean` → Codebase hygiene
-7. `/devlyn:update-docs` → Keep docs in sync
-Steps 5-7 are optional depending on scope.
-## Vibe Coding Workflow
-The recommended sequence after writing code:
-1. **Write code** (vibe coding)
-2. `/simplify` → Quick cleanup pass (reuse, quality, efficiency)
-3. `/devlyn:review` → Thorough solo review with security-first checklist
-4. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
-5. `/devlyn:clean` → Periodic codebase-wide hygiene
-6. `/devlyn:update-docs` → Keep docs in sync
-Steps 4-6 are optional depending on the scope of changes. `/simplify` should always run before `/devlyn:review` to catch low-hanging fruit cheaply.
-## Documentation Workflow
-- **Sync docs with codebase**: Use `/devlyn:update-docs` to clean up stale content, update outdated info, and generate missing docs
-- **Focused doc update**: Use `/devlyn:update-docs [area]` for targeted updates (e.g., "API reference", "getting-started")
-- Preserves all forward-looking content: roadmaps, future plans, visions, open questions
-- If no docs exist, proposes a tailored docs structure and generates initial content
-## Browser Testing Workflow
-- **Standalone**: Use `/devlyn:browser-validate` to test any web feature in the browser — starts the dev server, tests the feature end-to-end, fixes issues it finds
-- **In pipeline**: Auto-resolve includes browser validation automatically for web projects (between Build and Evaluate phases)
-- **Tiered**: Uses chrome MCP tools if available, falls back to Playwright, then curl
-- **Feature-first**: Tests the implemented feature (from done-criteria), not just "does the page load"
-## Debugging Workflow
+## Context Window Management
-- **Simple bugs**: Use `/devlyn:resolve` for systematic bug fixing with test-driven validation
-- **Complex bugs**: Use `/devlyn:team-resolve` for multi-perspective investigation with a full agent team
-- **Hands-free**: Use `/devlyn:auto-resolve` for fully automated resolve → evaluate → fix → polish pipeline
-- **Post-fix review**: Use `/devlyn:team-review` for thorough multi-reviewer validation
+Claude 4.5/4.6/4.7 auto-compact as context approaches the limit — you can work indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns; the model resumes from where it left off after compaction.
-## Maintenance Workflow
+For multi-context-window work (e.g., a large roadmap), persist state to disk:
+- auto-resolve writes durable state to `.devlyn/runs/<run_id>/` (pipeline.state.json, `<phase>.findings.jsonl`, `<phase>.log.md`) plus git commits. Pick up by reading `state.json` first; drill into JSONL/log files as needed.
+- preflight writes `.devlyn/PREFLIGHT-REPORT.md`.
+- For long investigations, write progress to `HANDOFF.md`; resume with `@HANDOFF.md continue`.
-- **Codebase cleanup**: Use `/devlyn:clean` to detect and remove dead code, unused dependencies, complexity hotspots, and tech debt
-- **Focused cleanup**: Use `/devlyn:clean [category]` for targeted sweeps (dead code, deps, tests, complexity, hygiene)
-- **Periodic maintenance sequence**: `/devlyn:clean` → `/simplify` → `/devlyn:update-docs` → `/devlyn:review`
+Manually `/clear` only when context is genuinely irrelevant to the next task.
-## Context Window Management
+## Communication Style
-Claude 4.5 / 4.6 / 4.7 models auto-compact the conversation as it approaches the context limit, so you can keep working indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns — the model continues from where it left off after compaction.
+- Lead with **objective data** (popularity, benchmarks, community adoption) before personal opinions.
+- When the user asks "what's popular" or "what do others use", answer with data.
+- Keep recommendations actionable and specific.
-For genuinely multi-context-window work (e.g., a roadmap with many phases), persist state to disk so the next instance can resume:
-- All `auto-resolve` and `preflight` runs already write durable state to `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS, PREFLIGHT-REPORT) and to git commits — pick up by reading those files plus `git log`.
-- For long investigations, write progress notes to a `HANDOFF.md` and resume with `@HANDOFF.md continue from where this left off` if you need a fresh window.
+## Commit Conventions
-Manually clearing with `/clear` is rarely necessary — only do it when context is genuinely irrelevant to the next task.
+Follow `.claude/commit-conventions.md`.
-## Communication Style
+## Design System
-- Lead with **objective data** (popularity, benchmarks, community adoption) before personal opinions
-- When user asks "what's popular" or "what do others use", provide data-driven answers
-- Keep recommendations actionable and specific
+When doing UI/UX work, follow `docs/design-system.md` if it exists.

package/README.md CHANGED Viewed

@@ -109,7 +109,7 @@ Install the Codex MCP server during setup, then:
 /devlyn:auto-resolve "fix the auth bug" --engine auto
 ```
-**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.6 or GPT-5.4) — validated through A/B testing, not just benchmarks.
+**`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
 > `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
@@ -146,6 +146,35 @@ Works across the full pipeline:
 </details>
+<details>
+<summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
+`/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
+- **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
+- **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
+- **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
+- **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
+- **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
+- **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
+- **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
+- **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
+- **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
+</details>
+<details>
+<summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
+Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
+- **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
+- **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
+- **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
+- **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
+</details>
 ---
 ## Manual Commands