devlyn-cli 1.13.0 → 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (24) hide show
  1. package/CLAUDE.md +28 -149
  2. package/README.md +30 -1
  3. package/config/skills/devlyn:auto-resolve/SKILL.md +167 -453
  4. package/config/skills/devlyn:auto-resolve/evals/evals.json +21 -0
  5. package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +42 -0
  6. package/config/skills/devlyn:auto-resolve/references/build-gate.md +36 -22
  7. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +43 -165
  8. package/config/skills/devlyn:auto-resolve/references/findings-schema.md +103 -0
  9. package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +54 -0
  10. package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +45 -0
  11. package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +84 -0
  12. package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +114 -0
  13. package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +201 -0
  14. package/config/skills/devlyn:auto-resolve/scripts/archive_run.py +104 -0
  15. package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +96 -0
  16. package/config/skills/devlyn:ideate/SKILL.md +17 -78
  17. package/config/skills/devlyn:ideate/references/codex-critic-template.md +42 -0
  18. package/config/skills/devlyn:ideate/references/templates/item-spec.md +4 -0
  19. package/config/skills/devlyn:preflight/SKILL.md +25 -40
  20. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +6 -10
  21. package/config/skills/devlyn:reap/SKILL.md +104 -0
  22. package/config/skills/devlyn:reap/scripts/reap.sh +129 -0
  23. package/config/skills/devlyn:reap/scripts/scan.sh +116 -0
  24. package/package.json +5 -1
package/CLAUDE.md CHANGED
@@ -2,175 +2,54 @@
2
2
 
3
3
  ## Quick Start
4
4
 
5
- For most work, the recommended sequence is:
5
+ Three commands cover most work. All default to `--engine auto` — Codex GPT-5.4 builds, Claude Opus 4.7 critiques (cross-model GAN dynamic).
6
6
 
7
- 1. `/devlyn:ideate` — turn an idea into roadmap-ready specs
8
- 2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → polish
9
- 3. `/devlyn:preflight` — verify the implementation matches the roadmap before shipping
7
+ 1. `/devlyn:ideate` — unstructured idea VISION/ROADMAP/item specs
8
+ 2. `/devlyn:auto-resolve "Implement per spec at docs/roadmap/phase-N/X-name.md"` — hands-free build → evaluate → ship
9
+ 3. `/devlyn:preflight` — verify the implementation matches the roadmap
10
10
 
11
- All three default to `--engine auto`, which routes each phase to the optimal model (Codex GPT-5.4 for hard coding, Claude Opus 4.7 for evaluation/critique). The cross-model GAN dynamic different models build vs critique — catches what single-model pipelines miss.
12
-
13
- ## General
14
-
15
- - Proactively use subagents and skills where needed
16
- - Follow commit conventions in `.claude/commit-conventions.md`
17
- - Follow design system in `docs/design-system.md` for UI/UX work if exist
11
+ Each skill's `SKILL.md` is the source of truth for its flags and workflowdon't duplicate them here.
18
12
 
19
13
  ## Error Handling Philosophy
20
14
 
21
15
  **No silent fallbacks.** Handle errors explicitly and show the user what happened.
22
16
 
23
- - **Default behavior**: When something fails, display a clear error state in the UI (error message, retry option, or actionable guidance). Do NOT silently fall back to default/placeholder data.
24
- - **Fallbacks are the exception, not the rule.** Only use fallbacks when it is a widely accepted best practice (e.g., fallback fonts in CSS, CDN failover, graceful image loading with placeholder). If unsure, handle the error explicitly instead.
25
- - **Never hide failures.** The user should always know when something went wrong. A visible error with a retry button is better UX than silently showing stale/default data.
26
- - **Pattern**: `try { doThing() } catch (error) { showErrorUI(error) }` — NOT `try { doThing() } catch { return fallbackValue }`
17
+ - **Default**: when something fails, display a clear error state message, retry option, or actionable guidance. Do NOT silently fall back to default/placeholder data.
18
+ - **Fallbacks are the exception.** Only use them when it's a widely accepted best practice (CSS fallback fonts, CDN failover, image placeholders). Otherwise handle the error explicitly.
19
+ - **Pattern**: `try { doThing() } catch (error) { showErrorUI(error) }` NOT `try { doThing() } catch { return fallbackValue }`.
27
20
 
28
21
  ## Investigation Workflow
29
22
 
30
23
  When investigating bugs, analyzing features, or exploring code:
31
24
 
32
- 1. **Define exit criteria upfront** - Ask "What does 'done' look like?" before starting
33
- 2. **Checkpoint progress** - Use the task tools (TaskCreate / TaskUpdate) every 5-10 minutes to save findings
34
- 3. **Output intermediate summaries** - Provide "Current Understanding" snapshots so work isn't lost if interrupted
35
- 4. **Always deliver findings** - Never end mid-analysis; at minimum output:
36
- - Files examined
37
- - Key findings
38
- - Remaining unknowns
39
- - Recommended next steps
40
-
41
- For complex investigations, use `/devlyn:team-resolve` to assemble a multi-perspective investigation team, or spawn parallel Agent subagents to explore different areas simultaneously.
42
-
43
- ## UI/UX Workflow
44
-
45
- The full design-to-implementation pipeline:
46
-
47
- 1. `/devlyn:design-ui` → Generate 5 style explorations
48
- 2. `/devlyn:design-system [N]` → Extract tokens from chosen style
49
- 3. `/devlyn:implement-ui` → Team implements or improves UI from design system
50
- 4. `/devlyn:team-resolve [feature]` → Add features on top
51
-
52
- ## Feature Development
53
-
54
- 1. **Plan first** - Always output a concrete implementation plan with specific file changes before writing code
55
- 2. **Track progress** - Use the task tools (TaskCreate / TaskUpdate) to checkpoint each phase
56
- 3. **Test validation** - Write tests alongside implementation; iterate until green
57
- 4. **Small commits** - Commit working increments rather than large changesets
58
-
59
- For complex features, spawn the `Plan` subagent (`Agent` tool with `subagent_type: "Plan"`) to design the approach before implementation.
60
-
61
- ## Automated Pipeline (Recommended Starting Point)
62
-
63
- For hands-free build-evaluate-polish cycles — works for bugs, features, refactors, and chores:
64
-
65
- ```
66
- /devlyn:auto-resolve [task description]
67
- ```
68
-
69
- This runs the full pipeline automatically: **Build → Build Gate → Browser Validate → Evaluate → Fix Loop → Simplify → Review → Challenge → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.devlyn/done-criteria.md`, `.devlyn/BUILD-GATE.md`, `.devlyn/EVAL-FINDINGS.md`, `.devlyn/BROWSER-RESULTS.md`, `.devlyn/CHALLENGE-FINDINGS.md`).
70
-
71
- The **Build Gate** (Phase 1.4) runs real compilers, typecheckers, and linters — the same commands CI/Docker/production will run. It auto-detects project types (Next.js, Rust, Go, Solidity, Expo, Swift, etc.) and Dockerfiles. This is the primary defense against "tests pass locally, breaks in CI/Docker" class of bugs (type errors in un-tested files, cross-package drift, Dockerfile copy mismatches).
72
-
73
- The **Challenge** phase (Phase 4.5) is a fresh skeptical review with no checklist — a subagent reads the entire diff cold with zero context from prior phases and asks "would I ship this to production with my name on it?" This catches the subtle issues that structured checklist-driven reviews miss: wrong-but-working approaches, unstated assumptions, non-idiomatic patterns, and integration gaps.
74
-
75
- For web projects, the Browser Validate phase starts the dev server and tests the implemented feature in a real browser — clicking buttons, filling forms, verifying results. If the feature doesn't work, findings feed back into the fix loop.
76
-
77
- Optional flags:
78
- - `--max-rounds 6` — increase max evaluate-fix iterations (default: 4)
79
- - `--skip-browser` — skip browser validation phase (auto-skipped for non-web changes)
80
- - `--skip-build-gate` — skip the deterministic build gate (not recommended)
81
- - `--build-gate strict` — treat warnings as errors; `--build-gate no-docker` — skip Docker builds for speed
82
- - `--skip-review` — skip team-review phase
83
- - `--skip-clean` — skip clean phase
84
- - `--skip-docs` — skip update-docs phase
85
- - `--engine auto|codex|claude` — intelligent model routing. `auto` (default) routes each phase and team role to the optimal model based on benchmark data: Codex GPT-5.4 handles BUILD and FIX (SWE-bench Pro lead), Claude Opus 4.7 handles EVALUATE and CHALLENGE (long-context retrieval + skeptical reasoning). Different models build vs critique — the cross-model GAN dynamic catches what single-model pipelines miss. `codex` forces Codex for implementation, Claude for orchestration and Chrome MCP. `claude` uses Claude for everything. Requires codex-mcp-server for `auto` and `codex` modes.
86
-
87
- ## Preflight Check (Post-Roadmap Verification)
88
-
89
- After completing a roadmap (or a phase), verify that everything was actually implemented correctly:
90
-
91
- ```
92
- /devlyn:preflight
93
- ```
25
+ 1. **Define exit criteria upfront** ask "what does 'done' look like?" before starting.
26
+ 2. **Checkpoint progress** use `TaskCreate`/`TaskUpdate` every 510 minutes.
27
+ 3. **Intermediate summaries** output "current understanding" snapshots so work isn't lost if interrupted.
28
+ 4. **Always deliver findings** never end mid-analysis. Minimum output: files examined, key findings, remaining unknowns, recommended next steps.
94
29
 
95
- This reads every commitment from VISION.md, ROADMAP.md, and item specs, then audits the codebase evidence-based. The code auditor now runs real build/typecheck commands as its first step — any project that doesn't compile is flagged as BROKEN at CRITICAL severity before individual commitments are even checked. Also checks in the browser for web projects.
30
+ For complex investigations, use `/devlyn:team-resolve` for a multi-perspective team, or spawn parallel `Agent` subagents.
96
31
 
97
- Output: `.devlyn/PREFLIGHT-REPORT.md` with categorized findings (MISSING, INCOMPLETE, DIVERGENT, BROKEN, STALE_DOC). Confirmed gaps can be promoted to new roadmap items for auto-resolve.
98
-
99
- Optional flags:
100
- - `--phase N` — audit only phase N items
101
- - `--autofix` — auto-promote CRITICAL/HIGH findings and run auto-resolve
102
- - `--skip-browser` — skip browser validation
103
- - `--skip-docs` — skip documentation audit
104
- - `--engine auto|codex|claude` — `auto` (default) routes the code-auditor to Codex (SWE-bench Pro +11.7pp on code analysis); the docs-auditor and browser-auditor always use Claude regardless of `--engine` (writing-quality strength on docs drift; Chrome MCP tools are session-bound to Claude Code)
105
-
106
- **Recommended workflow**: `/devlyn:ideate` → `/devlyn:auto-resolve` (repeat) → `/devlyn:preflight` → fix gaps → `/devlyn:preflight` (verify)
107
-
108
- ## Manual Pipeline (Step-by-Step Control)
109
-
110
- When you want to run each step yourself with review between phases:
111
-
112
- 1. `/devlyn:team-resolve [issue]` → Investigate + implement (writes `.devlyn/done-criteria.md`)
113
- 2. `/devlyn:evaluate` → Grade against done-criteria (writes `.devlyn/EVAL-FINDINGS.md`)
114
- 3. If findings exist: `/devlyn:team-resolve "Fix issues in .devlyn/EVAL-FINDINGS.md"` → Fix loop
115
- 4. `/simplify` → Quick cleanup pass
116
- 5. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
117
- 6. `/devlyn:clean` → Codebase hygiene
118
- 7. `/devlyn:update-docs` → Keep docs in sync
119
-
120
- Steps 5-7 are optional depending on scope.
121
-
122
- ## Vibe Coding Workflow
123
-
124
- The recommended sequence after writing code:
125
-
126
- 1. **Write code** (vibe coding)
127
- 2. `/simplify` → Quick cleanup pass (reuse, quality, efficiency)
128
- 3. `/devlyn:review` → Thorough solo review with security-first checklist
129
- 4. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
130
- 5. `/devlyn:clean` → Periodic codebase-wide hygiene
131
- 6. `/devlyn:update-docs` → Keep docs in sync
132
-
133
- Steps 4-6 are optional depending on the scope of changes. `/simplify` should always run before `/devlyn:review` to catch low-hanging fruit cheaply.
134
-
135
- ## Documentation Workflow
136
-
137
- - **Sync docs with codebase**: Use `/devlyn:update-docs` to clean up stale content, update outdated info, and generate missing docs
138
- - **Focused doc update**: Use `/devlyn:update-docs [area]` for targeted updates (e.g., "API reference", "getting-started")
139
- - Preserves all forward-looking content: roadmaps, future plans, visions, open questions
140
- - If no docs exist, proposes a tailored docs structure and generates initial content
141
-
142
- ## Browser Testing Workflow
143
-
144
- - **Standalone**: Use `/devlyn:browser-validate` to test any web feature in the browser — starts the dev server, tests the feature end-to-end, fixes issues it finds
145
- - **In pipeline**: Auto-resolve includes browser validation automatically for web projects (between Build and Evaluate phases)
146
- - **Tiered**: Uses chrome MCP tools if available, falls back to Playwright, then curl
147
- - **Feature-first**: Tests the implemented feature (from done-criteria), not just "does the page load"
148
-
149
- ## Debugging Workflow
32
+ ## Context Window Management
150
33
 
151
- - **Simple bugs**: Use `/devlyn:resolve` for systematic bug fixing with test-driven validation
152
- - **Complex bugs**: Use `/devlyn:team-resolve` for multi-perspective investigation with a full agent team
153
- - **Hands-free**: Use `/devlyn:auto-resolve` for fully automated resolve → evaluate → fix → polish pipeline
154
- - **Post-fix review**: Use `/devlyn:team-review` for thorough multi-reviewer validation
34
+ Claude 4.5/4.6/4.7 auto-compact as context approaches the limit you can work indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns; the model resumes from where it left off after compaction.
155
35
 
156
- ## Maintenance Workflow
36
+ For multi-context-window work (e.g., a large roadmap), persist state to disk:
37
+ - auto-resolve writes durable state to `.devlyn/runs/<run_id>/` (pipeline.state.json, `<phase>.findings.jsonl`, `<phase>.log.md`) plus git commits. Pick up by reading `state.json` first; drill into JSONL/log files as needed.
38
+ - preflight writes `.devlyn/PREFLIGHT-REPORT.md`.
39
+ - For long investigations, write progress to `HANDOFF.md`; resume with `@HANDOFF.md continue`.
157
40
 
158
- - **Codebase cleanup**: Use `/devlyn:clean` to detect and remove dead code, unused dependencies, complexity hotspots, and tech debt
159
- - **Focused cleanup**: Use `/devlyn:clean [category]` for targeted sweeps (dead code, deps, tests, complexity, hygiene)
160
- - **Periodic maintenance sequence**: `/devlyn:clean` → `/simplify` → `/devlyn:update-docs` → `/devlyn:review`
41
+ Manually `/clear` only when context is genuinely irrelevant to the next task.
161
42
 
162
- ## Context Window Management
43
+ ## Communication Style
163
44
 
164
- Claude 4.5 / 4.6 / 4.7 models auto-compact the conversation as it approaches the context limit, so you can keep working indefinitely without manual handoffs in most cases. Don't stop early due to token-budget concerns — the model continues from where it left off after compaction.
45
+ - Lead with **objective data** (popularity, benchmarks, community adoption) before personal opinions.
46
+ - When the user asks "what's popular" or "what do others use", answer with data.
47
+ - Keep recommendations actionable and specific.
165
48
 
166
- For genuinely multi-context-window work (e.g., a roadmap with many phases), persist state to disk so the next instance can resume:
167
- - All `auto-resolve` and `preflight` runs already write durable state to `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS, PREFLIGHT-REPORT) and to git commits — pick up by reading those files plus `git log`.
168
- - For long investigations, write progress notes to a `HANDOFF.md` and resume with `@HANDOFF.md continue from where this left off` if you need a fresh window.
49
+ ## Commit Conventions
169
50
 
170
- Manually clearing with `/clear` is rarely necessary — only do it when context is genuinely irrelevant to the next task.
51
+ Follow `.claude/commit-conventions.md`.
171
52
 
172
- ## Communication Style
53
+ ## Design System
173
54
 
174
- - Lead with **objective data** (popularity, benchmarks, community adoption) before personal opinions
175
- - When user asks "what's popular" or "what do others use", provide data-driven answers
176
- - Keep recommendations actionable and specific
55
+ When doing UI/UX work, follow `docs/design-system.md` if it exists.
package/README.md CHANGED
@@ -109,7 +109,7 @@ Install the Codex MCP server during setup, then:
109
109
  /devlyn:auto-resolve "fix the auth bug" --engine auto
110
110
  ```
111
111
 
112
- **`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.6 or GPT-5.4) — validated through A/B testing, not just benchmarks.
112
+ **`--engine auto`** routes each pipeline phase and team role to the optimal model (Claude Opus 4.7 or GPT-5.4) — validated through A/B testing, not just benchmarks.
113
113
 
114
114
  > `--engine auto` (default, recommended) · `--engine codex` (force Codex for build) · `--engine claude` (Claude only)
115
115
 
@@ -146,6 +146,35 @@ Works across the full pipeline:
146
146
 
147
147
  </details>
148
148
 
149
+ <details>
150
+ <summary><strong>What's new in 1.14.0</strong> — CPO lens + handoff enforcement</summary>
151
+
152
+ `/devlyn:ideate` now thinks like a world-class Product Owner, and `/devlyn:auto-resolve` finally honors the spec contract the ideate skill was already designed to produce. Validated with 19 parallel eval subagents, 1.2M tokens of evidence — Customer Frame propagation went from 0/20 to 20/20 across seven test scenarios.
153
+
154
+ - **Jobs-to-be-Done forcing in FRAME** — ideate's opening FRAME phase now requires a one-sentence JTBD statement ("When [situation], [user] wants [motivation] so they can [outcome]") before anything else. A bare problem statement is a state description, not a job — downstream specs built without this frame describe system behavior instead of customer progress.
155
+ - **Customer Frame field on every item spec** — item-spec template gains a `## Customer Frame` section between Context and Objective that carries the per-item JTBD sentence all the way through to auto-resolve's build agent. The build agent uses this line to resolve ambiguity in Requirements rather than inventing interpretations.
156
+ - **PHASE 0.5 SPEC PREFLIGHT on auto-resolve** — when the task names a `docs/roadmap/phase-N/...md` spec, auto-resolve now reads it BEFORE BUILD, verifies internal dependencies are `status: done`, and writes `.devlyn/SPEC-CONTEXT.md` so downstream phases stop re-deriving what the spec already owns. Un-done deps halt the pipeline with `BLOCKED` rather than shipping out-of-sequence code.
157
+ - **Done-criteria verbatim copy** — when PHASE 0.5 found a spec, BUILD's Phase B copies the spec's `Requirements`, `Out of Scope`, and `Verification` sections verbatim into `.devlyn/done-criteria.md`. No silent re-derivation; the ideate CHALLENGE rubric's validation is preserved through the handoff.
158
+ - **Spec-bounded exploration** — BUILD's Phase A uses the spec's `Architecture Notes` + `Dependencies` as the exploration boundary instead of re-classifying the task type open-endedly.
159
+ - **Complexity-gated team ceremony** — `complexity: low` specs with no security/auth/API/data risk keywords skip TeamCreate entirely. Medium/high complexity or risk-flagged specs still assemble the team as before.
160
+ - **Evidence discipline in ideate EXPLORE** — research phase now labels unsourced market/tech claims `[UNVERIFIED]` inline rather than presenting recall as fact. The CHALLENGE rubric's NO GUESSWORK axis fires on unlabeled authoritative claims.
161
+ - **Mode tie-break rule** — when a request matches two ideate modes (Quick Add vs Expand, Research-first vs Deep-dive), the narrowest mode wins. Deterministic selection replaces intuitive match.
162
+ - **Bloat removal** — three redundant motivational blocks deleted from ideate SKILL.md (`<why_this_matters>` rationale, duplicate CHALLENGE preamble, external engine-routing pointer). SKILL.md shrank from 529 to 519 lines despite the new features.
163
+
164
+ </details>
165
+
166
+ <details>
167
+ <summary><strong>What's new in 1.13.0</strong> — Opus 4.7 pipeline pass</summary>
168
+
169
+ Core pipeline skills (`ideate`, `auto-resolve`, `preflight`) rewritten against Anthropic's Opus 4.7 prompting guidance, validated by multi-round comprehension and quality-grading subagents.
170
+
171
+ - **4.7 prompt patterns** — `<investigate_before_answering>` on evaluator and challenge, `<coverage_over_filtering>` with per-finding confidence, 3 few-shot examples in the Challenge phase, `<orchestrator_context>` (auto-compaction + xhigh effort), `<use_parallel_tool_calls>` in ideate EXPLORE and preflight Phase 0.
172
+ - **`--with-codex` consolidated into `--engine auto`** — auto now covers BUILD/FIX + team roles + ideate CHALLENGE critic (broader than `--with-codex both` ever was). Legacy flag still accepted with a graceful handoff.
173
+ - **Bug fixes** — PHASE 1.5 BLOCKED browser failures re-route correctly via PHASE 2.5; PHASE 1.4-fix and PHASE 2.5 share one global round counter; preflight PHASE 1 numbering fixed; build-gate-exhausted now produces a graceful final report.
174
+ - **CLAUDE.md refresh** (shipped to `npx` installers) — Quick Start pointing to ideate → auto-resolve → preflight, Context Window Management updated for Opus 4.7 auto-compaction, terminology refresh (TodoWrite → task tools, Task agents → Agent subagents).
175
+
176
+ </details>
177
+
149
178
  ---
150
179
 
151
180
  ## Manual Commands