devlyn-cli 1.3.1 → 1.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -56,10 +56,13 @@ For hands-free build-evaluate-polish cycles — works for bugs, features, refact
56
56
  /devlyn:auto-resolve [task description]
57
57
  ```
58
58
 
59
- This runs the full pipeline automatically: **Build → Evaluate → Fix Loop → Simplify → Review → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.claude/done-criteria.md`, `.claude/EVAL-FINDINGS.md`).
59
+ This runs the full pipeline automatically: **Build → Browser Validate → Evaluate → Fix Loop → Simplify → Review → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.claude/done-criteria.md`, `.claude/EVAL-FINDINGS.md`, `.claude/BROWSER-RESULTS.md`).
60
+
61
+ For web projects, the Browser Validate phase starts the dev server and tests the implemented feature in a real browser — clicking buttons, filling forms, verifying results. If the feature doesn't work, findings feed back into the fix loop.
60
62
 
61
63
  Optional flags:
62
64
  - `--max-rounds 3` — increase max evaluate-fix iterations (default: 2)
65
+ - `--skip-browser` — skip browser validation phase (auto-skipped for non-web changes)
63
66
  - `--skip-review` — skip team-review phase
64
67
  - `--skip-clean` — skip clean phase
65
68
  - `--skip-docs` — skip update-docs phase
@@ -99,6 +102,13 @@ Steps 4-6 are optional depending on the scope of changes. `/simplify` should alw
99
102
  - Preserves all forward-looking content: roadmaps, future plans, visions, open questions
100
103
  - If no docs exist, proposes a tailored docs structure and generates initial content
101
104
 
105
+ ## Browser Testing Workflow
106
+
107
+ - **Standalone**: Use `/devlyn:browser-validate` to test any web feature in the browser — starts the dev server, tests the feature end-to-end, fixes issues it finds
108
+ - **In pipeline**: Auto-resolve includes browser validation automatically for web projects (between Build and Evaluate phases)
109
+ - **Tiered**: Uses chrome MCP tools if available, falls back to Playwright, then curl
110
+ - **Feature-first**: Tests the implemented feature (from done-criteria), not just "does the page load"
111
+
102
112
  ## Debugging Workflow
103
113
 
104
114
  - **Simple bugs**: Use `/devlyn:resolve` for systematic bug fixing with test-driven validation
package/README.md CHANGED
@@ -39,9 +39,10 @@ Structured prompts and role-based instructions that shape _what the AI knows and
39
39
 
40
40
  Pipeline orchestration that controls _how agents execute_ — permissions, state management, multi-phase workflows, and cross-model evaluation.
41
41
 
42
- - **`/devlyn:auto-resolve`** — 8-phase automated pipeline (build → evaluate → fix loop → simplify → review → security → clean → docs)
42
+ - **`/devlyn:auto-resolve`** — 9-phase automated pipeline (build → browser validate → evaluate → fix loop → simplify → review → security → clean → docs)
43
+ - **`/devlyn:browser-validate`** — feature verification in a real browser with tiered fallback (Chrome MCP → Playwright → curl)
43
44
  - **`bypassPermissions` mode** for autonomous subagent execution
44
- - **File-based state machine** — agents communicate via `.claude/done-criteria.md` and `EVAL-FINDINGS.md`
45
+ - **File-based state machine** — agents communicate via `.claude/done-criteria.md`, `EVAL-FINDINGS.md`, and `BROWSER-RESULTS.md`
45
46
  - **Git checkpoints** at each phase for rollback safety
46
47
  - **Cross-model evaluation** via `--with-codex` flag (OpenAI Codex as independent evaluator)
47
48
 
@@ -89,7 +90,8 @@ Slash commands are invoked directly in Claude Code conversations (e.g., type `/d
89
90
  |---|---|
90
91
  | `/devlyn:resolve` | Systematic bug fixing with root-cause analysis and test-driven validation |
91
92
  | `/devlyn:team-resolve` | Spawns a full agent team — root cause analyst, test engineer, security auditor — to investigate complex issues |
92
- | `/devlyn:auto-resolve` | Fully automated pipeline for any task — bugs, features, refactors, chores. Build → evaluate → fix loop → simplify → review → clean → docs. One command, zero human intervention. Supports `--with-codex` for cross-model evaluation via OpenAI Codex |
93
+ | `/devlyn:auto-resolve` | Fully automated pipeline for any task — bugs, features, refactors, chores. Build → browser validate → evaluate → fix loop → simplify → review → clean → docs. One command, zero human intervention. Supports `--with-codex` for cross-model evaluation via OpenAI Codex |
94
+ | `/devlyn:browser-validate` | Verify implemented features work in a real browser — starts dev server, tests the feature end-to-end (clicks, forms, verification), with tiered fallback (Chrome MCP → Playwright → curl) |
93
95
 
94
96
  ### Code Review & Quality
95
97
 
@@ -151,6 +153,7 @@ One command runs the full cycle — no human intervention needed:
151
153
  | Phase | What Happens |
152
154
  |---|---|
153
155
  | **Build** | `team-resolve` investigates and implements, writes testable done criteria |
156
+ | **Browser Validate** | For web projects: starts dev server, tests the implemented feature end-to-end in a real browser, fixes issues found |
154
157
  | **Evaluate** | Independent evaluator grades against done criteria with calibrated skepticism |
155
158
  | **Fix Loop** | If evaluation fails, fixes findings and re-evaluates (up to N rounds) |
156
159
  | **Simplify** | Quick cleanup pass for reuse and efficiency |
@@ -159,7 +162,7 @@ One command runs the full cycle — no human intervention needed:
159
162
  | **Clean** | Remove dead code and unused dependencies |
160
163
  | **Docs** | Sync documentation with changes |
161
164
 
162
- Each phase runs as a separate subagent (fresh context), communicates via files, and commits a git checkpoint for rollback safety. Skip phases with flags: `--skip-review`, `--skip-clean`, `--skip-docs`, `--max-rounds 3`, `--with-codex` (cross-model evaluation via OpenAI Codex).
165
+ Each phase runs as a separate subagent (fresh context), communicates via files, and commits a git checkpoint for rollback safety. Skip phases with flags: `--skip-browser`, `--skip-review`, `--skip-clean`, `--skip-docs`, `--max-rounds 3`, `--with-codex` (cross-model evaluation via OpenAI Codex).
163
166
 
164
167
  ### Manual Workflow
165
168
 
@@ -237,6 +240,15 @@ Installed via the [skills CLI](https://github.com/anthropics/skills) (`npx skill
237
240
  | `anthropics/skills` | Official Anthropic skill-creator with eval framework and description optimizer |
238
241
  | `Leonxlnx/taste-skill` | Premium frontend design skills — modern layouts, animations, and visual refinement |
239
242
 
243
+ ### MCP Servers
244
+
245
+ Installed via `claude mcp add` during setup.
246
+
247
+ | Server | Description |
248
+ |---|---|
249
+ | `codex-cli` | Codex MCP server for cross-model evaluation via OpenAI Codex |
250
+ | `playwright` | Playwright MCP for browser testing — powers `devlyn:browser-validate` Tier 2 |
251
+
240
252
  > **Want to add a pack?** Open a PR adding your pack to the `OPTIONAL_ADDONS` array in [`bin/devlyn.js`](bin/devlyn.js).
241
253
 
242
254
  ## How It Works
@@ -91,8 +91,12 @@ You are a browser validation agent. Read the skill instructions at `.claude/skil
91
91
  **After the agent completes**:
92
92
  1. Read `.claude/BROWSER-RESULTS.md`
93
93
  2. Extract the verdict
94
- 3. If `BLOCKED` → the app doesn't even render. Go directly to PHASE 2.5 fix loop with browser findings as context.
95
- 4. Otherwise → continue to PHASE 2 (the evaluator will read `BROWSER-RESULTS.md` as additional evidence)
94
+ 3. Branch on verdict:
95
+ - `PASS` → continue to PHASE 2
96
+ - `PASS WITH ISSUES` → continue to PHASE 2 (evaluator reads browser results as extra context)
97
+ - `PARTIALLY VERIFIED` → continue to PHASE 2, but flag to the evaluator that browser coverage was incomplete — unverified features should be weighted more heavily
98
+ - `NEEDS WORK` → features don't work in the browser. Go to PHASE 2.5 fix loop. Fix agent reads `.claude/BROWSER-RESULTS.md` for which criterion failed, at what step, with what error. After fixing, re-run PHASE 1.5 to verify the fix before proceeding to Evaluate.
99
+ - `BLOCKED` → app doesn't render. Go to PHASE 2.5 fix loop. After fixing, re-run PHASE 1.5.
96
100
 
97
101
  ## PHASE 2: EVALUATE
98
102
 
@@ -71,9 +71,11 @@ Read `.claude/done-criteria.md` (or infer from git diff what was built). For eac
71
71
  2. **Execute it**: Navigate to the page, find the interactive elements, perform the actions, verify the outcome. Read `references/flow-testing.md` for patterns on converting criteria to browser steps.
72
72
  3. **Capture evidence**: Screenshot at each key step. Record console errors and network failures that happen during the interaction.
73
73
  4. **If it fails — try to fix**: Read the error (console, network, or the UI state) to understand why the feature broke. Fix the source code, let hot-reload update, and re-test. Up to 2 fix attempts per criterion.
74
- 5. **Record the result**: For each criterion — PASS (feature works as specified), FAIL (feature doesn't work, include what went wrong), or SKIPPED (criterion isn't browser-testable, e.g., "API returns 401").
74
+ 5. **Record the result**: For each criterion — PASS (feature works as specified), FAIL (feature doesn't work, include what went wrong), SKIPPED (criterion isn't browser-testable, e.g., "API returns 401"), or UNVERIFIABLE (feature depends on external services not available in the test environment — e.g., real API keys, third-party auth, paid services).
75
75
 
76
- The verdict depends primarily on this phase. If the implemented features don't work in the browser, the validation fails even if every page renders perfectly and the layout looks great.
76
+ **Don't churn on external dependencies.** If a feature test is blocked because an API times out, a third-party service isn't configured, or auth credentials aren't available — that's not a bug to fix, it's a test environment limitation. Note it as UNVERIFIABLE, move on to the next criterion. Don't spend more than 30 seconds waiting for a response that's never coming. The goal is to verify what *can* be verified in the current environment, and be honest about what can't.
77
+
78
+ The verdict depends primarily on this phase. If the implemented features don't work in the browser, the validation fails — even if every page renders perfectly and the layout looks great. And if most features couldn't be verified due to environment limitations, be honest about that — don't call it PASS.
77
79
 
78
80
  ## PHASE 5: VISUAL (supporting check)
79
81
 
@@ -91,12 +93,13 @@ Write `.claude/BROWSER-RESULTS.md`:
91
93
  ```markdown
92
94
  # Browser Validation Results
93
95
 
94
- ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
96
+ ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / PARTIALLY VERIFIED / BLOCKED]
95
97
  Verdict rules:
96
98
  - BLOCKED = server won't start or app doesn't render
97
- - NEEDS WORK = implemented features don't work in the browser (this is the primary failure mode)
98
- - PASS WITH ISSUES = features work but visual issues or minor warnings exist
99
- - PASS = features verified working, pages render, layout clean
99
+ - NEEDS WORK = implemented features don't work in the browser
100
+ - PARTIALLY VERIFIED = some features verified working, but others couldn't be tested due to environment limitations (missing API keys, external service dependencies). Be explicit about what was and wasn't verified.
101
+ - PASS WITH ISSUES = all testable features work but visual issues or minor warnings exist
102
+ - PASS = all testable features verified working, pages render, layout clean
100
103
 
101
104
  ## What Was Tested
102
105
  [Brief description of the feature/task from done-criteria or git diff]
@@ -104,7 +107,10 @@ Verdict rules:
104
107
  ## Feature Verification (primary)
105
108
  | Criterion | Test Steps | Result | Evidence |
106
109
  |-----------|-----------|--------|----------|
107
- | [what should work] | [what you did] | PASS/FAIL/SKIPPED | [screenshot, errors, what went wrong] |
110
+ | [what should work] | [what you did] | PASS/FAIL/SKIPPED/UNVERIFIABLE | [screenshot, errors, what went wrong] |
111
+
112
+ ## Unverifiable Features (if any)
113
+ [List features that couldn't be tested and why — e.g., "Badge rendering requires /api/backends/status which needs real API keys not present in test env. Verified via source code and unit tests instead."]
108
114
 
109
115
  ## Smoke Test (prerequisite)
110
116
  | Route | Renders | Console Errors | Network Failures |
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "devlyn-cli",
3
- "version": "1.3.1",
3
+ "version": "1.3.2",
4
4
  "description": "Claude Code configuration toolkit for teams",
5
5
  "bin": {
6
6
  "devlyn": "bin/devlyn.js"