devlyn-cli 1.3.1 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md CHANGED
@@ -56,10 +56,13 @@ For hands-free build-evaluate-polish cycles — works for bugs, features, refact
56
56
  /devlyn:auto-resolve [task description]
57
57
  ```
58
58
 
59
- This runs the full pipeline automatically: **Build → Evaluate → Fix Loop → Simplify → Review → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.claude/done-criteria.md`, `.claude/EVAL-FINDINGS.md`).
59
+ This runs the full pipeline automatically: **Build → Browser Validate → Evaluate → Fix Loop → Simplify → Review → Security Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.devlyn/done-criteria.md`, `.devlyn/EVAL-FINDINGS.md`, `.devlyn/BROWSER-RESULTS.md`).
60
+
61
+ For web projects, the Browser Validate phase starts the dev server and tests the implemented feature in a real browser — clicking buttons, filling forms, verifying results. If the feature doesn't work, findings feed back into the fix loop.
60
62
 
61
63
  Optional flags:
62
64
  - `--max-rounds 3` — increase max evaluate-fix iterations (default: 2)
65
+ - `--skip-browser` — skip browser validation phase (auto-skipped for non-web changes)
63
66
  - `--skip-review` — skip team-review phase
64
67
  - `--skip-clean` — skip clean phase
65
68
  - `--skip-docs` — skip update-docs phase
@@ -69,9 +72,9 @@ Optional flags:
69
72
 
70
73
  When you want to run each step yourself with review between phases:
71
74
 
72
- 1. `/devlyn:team-resolve [issue]` → Investigate + implement (writes `.claude/done-criteria.md`)
73
- 2. `/devlyn:evaluate` → Grade against done-criteria (writes `.claude/EVAL-FINDINGS.md`)
74
- 3. If findings exist: `/devlyn:team-resolve "Fix issues in .claude/EVAL-FINDINGS.md"` → Fix loop
75
+ 1. `/devlyn:team-resolve [issue]` → Investigate + implement (writes `.devlyn/done-criteria.md`)
76
+ 2. `/devlyn:evaluate` → Grade against done-criteria (writes `.devlyn/EVAL-FINDINGS.md`)
77
+ 3. If findings exist: `/devlyn:team-resolve "Fix issues in .devlyn/EVAL-FINDINGS.md"` → Fix loop
75
78
  4. `/simplify` → Quick cleanup pass
76
79
  5. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
77
80
  6. `/devlyn:clean` → Codebase hygiene
@@ -99,6 +102,13 @@ Steps 4-6 are optional depending on the scope of changes. `/simplify` should alw
99
102
  - Preserves all forward-looking content: roadmaps, future plans, visions, open questions
100
103
  - If no docs exist, proposes a tailored docs structure and generates initial content
101
104
 
105
+ ## Browser Testing Workflow
106
+
107
+ - **Standalone**: Use `/devlyn:browser-validate` to test any web feature in the browser — starts the dev server, tests the feature end-to-end, fixes issues it finds
108
+ - **In pipeline**: Auto-resolve includes browser validation automatically for web projects (between Build and Evaluate phases)
109
+ - **Tiered**: Uses chrome MCP tools if available, falls back to Playwright, then curl
110
+ - **Feature-first**: Tests the implemented feature (from done-criteria), not just "does the page load"
111
+
102
112
  ## Debugging Workflow
103
113
 
104
114
  - **Simple bugs**: Use `/devlyn:resolve` for systematic bug fixing with test-driven validation
package/README.md CHANGED
@@ -39,9 +39,10 @@ Structured prompts and role-based instructions that shape _what the AI knows and
39
39
 
40
40
  Pipeline orchestration that controls _how agents execute_ — permissions, state management, multi-phase workflows, and cross-model evaluation.
41
41
 
42
- - **`/devlyn:auto-resolve`** — 8-phase automated pipeline (build → evaluate → fix loop → simplify → review → security → clean → docs)
42
+ - **`/devlyn:auto-resolve`** — 9-phase automated pipeline (build → browser validate → evaluate → fix loop → simplify → review → security → clean → docs)
43
+ - **`/devlyn:browser-validate`** — feature verification in a real browser with tiered fallback (Chrome MCP → Playwright → curl)
43
44
  - **`bypassPermissions` mode** for autonomous subagent execution
44
- - **File-based state machine** — agents communicate via `.claude/done-criteria.md` and `EVAL-FINDINGS.md`
45
+ - **File-based state machine** — agents communicate via `.devlyn/done-criteria.md`, `EVAL-FINDINGS.md`, and `BROWSER-RESULTS.md`
45
46
  - **Git checkpoints** at each phase for rollback safety
46
47
  - **Cross-model evaluation** via `--with-codex` flag (OpenAI Codex as independent evaluator)
47
48
 
@@ -89,7 +90,8 @@ Slash commands are invoked directly in Claude Code conversations (e.g., type `/d
89
90
  |---|---|
90
91
  | `/devlyn:resolve` | Systematic bug fixing with root-cause analysis and test-driven validation |
91
92
  | `/devlyn:team-resolve` | Spawns a full agent team — root cause analyst, test engineer, security auditor — to investigate complex issues |
92
- | `/devlyn:auto-resolve` | Fully automated pipeline for any task — bugs, features, refactors, chores. Build → evaluate → fix loop → simplify → review → clean → docs. One command, zero human intervention. Supports `--with-codex` for cross-model evaluation via OpenAI Codex |
93
+ | `/devlyn:auto-resolve` | Fully automated pipeline for any task — bugs, features, refactors, chores. Build → browser validate → evaluate → fix loop → simplify → review → clean → docs. One command, zero human intervention. Supports `--with-codex` for cross-model evaluation via OpenAI Codex |
94
+ | `/devlyn:browser-validate` | Verify implemented features work in a real browser — starts dev server, tests the feature end-to-end (clicks, forms, verification), with tiered fallback (Chrome MCP → Playwright → curl) |
93
95
 
94
96
  ### Code Review & Quality
95
97
 
@@ -151,6 +153,7 @@ One command runs the full cycle — no human intervention needed:
151
153
  | Phase | What Happens |
152
154
  |---|---|
153
155
  | **Build** | `team-resolve` investigates and implements, writes testable done criteria |
156
+ | **Browser Validate** | For web projects: starts dev server, tests the implemented feature end-to-end in a real browser, fixes issues found |
154
157
  | **Evaluate** | Independent evaluator grades against done criteria with calibrated skepticism |
155
158
  | **Fix Loop** | If evaluation fails, fixes findings and re-evaluates (up to N rounds) |
156
159
  | **Simplify** | Quick cleanup pass for reuse and efficiency |
@@ -159,7 +162,7 @@ One command runs the full cycle — no human intervention needed:
159
162
  | **Clean** | Remove dead code and unused dependencies |
160
163
  | **Docs** | Sync documentation with changes |
161
164
 
162
- Each phase runs as a separate subagent (fresh context), communicates via files, and commits a git checkpoint for rollback safety. Skip phases with flags: `--skip-review`, `--skip-clean`, `--skip-docs`, `--max-rounds 3`, `--with-codex` (cross-model evaluation via OpenAI Codex).
165
+ Each phase runs as a separate subagent (fresh context), communicates via files, and commits a git checkpoint for rollback safety. Skip phases with flags: `--skip-browser`, `--skip-review`, `--skip-clean`, `--skip-docs`, `--max-rounds 3`, `--with-codex` (cross-model evaluation via OpenAI Codex).
163
166
 
164
167
  ### Manual Workflow
165
168
 
@@ -169,7 +172,7 @@ For step-by-step control between phases:
169
172
  |---|---|---|
170
173
  | 1. **Resolve** | `/devlyn:resolve` or `/devlyn:team-resolve` | Fix the issue — solo for focused bugs (1-2 modules), team for complex issues (3+ modules) |
171
174
  | 2. **Evaluate** | `/devlyn:evaluate` | Independent quality evaluation — grades against done criteria written in step 1 |
172
- | | | *If the evaluation finds issues: `/devlyn:team-resolve "Fix issues in .claude/EVAL-FINDINGS.md"`* |
175
+ | | | *If the evaluation finds issues: `/devlyn:team-resolve "Fix issues in .devlyn/EVAL-FINDINGS.md"`* |
173
176
  | 3. **Simplify** | `/simplify` | Quick cleanup pass for reuse, quality, and efficiency *(built-in Claude Code command)* |
174
177
  | 4. **Review** | `/devlyn:review` or `/devlyn:team-review` | Audit the changes — solo for small PRs (< 10 files), team for large PRs (10+ files) |
175
178
  | 5. **Clean** | `/devlyn:clean` | Remove dead code, unused dependencies, and complexity hotspots |
@@ -237,6 +240,15 @@ Installed via the [skills CLI](https://github.com/anthropics/skills) (`npx skill
237
240
  | `anthropics/skills` | Official Anthropic skill-creator with eval framework and description optimizer |
238
241
  | `Leonxlnx/taste-skill` | Premium frontend design skills — modern layouts, animations, and visual refinement |
239
242
 
243
+ ### MCP Servers
244
+
245
+ Installed via `claude mcp add` during setup.
246
+
247
+ | Server | Description |
248
+ |---|---|
249
+ | `codex-cli` | Codex MCP server for cross-model evaluation via OpenAI Codex |
250
+ | `playwright` | Playwright MCP for browser testing — powers `devlyn:browser-validate` Tier 2 |
251
+
240
252
  > **Want to add a pack?** Open a PR adding your pack to the `OPTIONAL_ADDONS` array in [`bin/devlyn.js`](bin/devlyn.js).
241
253
 
242
254
  ## How It Works
@@ -4,7 +4,7 @@ You are a code quality evaluator. Your job is to audit work produced by another
4
4
 
5
5
  ## Before You Start
6
6
 
7
- 1. **Check for done criteria**: Read `.claude/done-criteria.md` if it exists. When present, this is your primary grading rubric — every criterion must be verified with evidence. When absent, fall back to the checklists below.
7
+ 1. **Check for done criteria**: Read `.devlyn/done-criteria.md` if it exists. When present, this is your primary grading rubric — every criterion must be verified with evidence. When absent, fall back to the checklists below.
8
8
 
9
9
  ## Calibration
10
10
 
@@ -36,7 +36,7 @@ You will be too lenient by default. You will identify real issues, then talk you
36
36
 
37
37
  ## Output
38
38
 
39
- Write findings to `.claude/EVAL-FINDINGS.md` for downstream consumption:
39
+ Write findings to `.devlyn/EVAL-FINDINGS.md` for downstream consumption:
40
40
 
41
41
  ```markdown
42
42
  # Evaluation Findings
@@ -61,4 +61,4 @@ Write findings to `.claude/EVAL-FINDINGS.md` for downstream consumption:
61
61
  - [positive observations]
62
62
  ```
63
63
 
64
- Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the orchestrator or user is responsible for cleanup.
64
+ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the orchestrator or user is responsible for cleanup.
package/bin/devlyn.js CHANGED
@@ -528,6 +528,19 @@ async function init(skipPrompts = false) {
528
528
  log(' → CLAUDE.md', 'dim');
529
529
  }
530
530
 
531
+ // Add .devlyn/ (pipeline state directory) to .gitignore
532
+ const gitignorePath = path.join(process.cwd(), '.gitignore');
533
+ const gitignoreEntry = '.devlyn/';
534
+ let gitignoreContent = fs.existsSync(gitignorePath)
535
+ ? fs.readFileSync(gitignorePath, 'utf8')
536
+ : '';
537
+ if (!gitignoreContent.split('\n').some((line) => line.trim() === gitignoreEntry || line.trim() === '.devlyn')) {
538
+ const prefix = gitignoreContent && !gitignoreContent.endsWith('\n') ? '\n' : '';
539
+ const header = gitignoreContent ? '\n# devlyn-cli pipeline state\n' : '# devlyn-cli pipeline state\n';
540
+ fs.writeFileSync(gitignorePath, gitignoreContent + prefix + header + gitignoreEntry + '\n');
541
+ log(' → .gitignore (added .devlyn/)', 'dim');
542
+ }
543
+
531
544
  // Enable agent teams in project settings
532
545
  const settingsPath = path.join(targetDir, 'settings.json');
533
546
  let settings = {};
@@ -539,16 +552,12 @@ async function init(skipPrompts = false) {
539
552
  }
540
553
  }
541
554
  if (!settings.env) settings.env = {};
542
- // Auto-allow pipeline files so auto-resolve doesn't prompt for permission
555
+ // Auto-allow pipeline state directory and common git commands so auto-resolve doesn't prompt
543
556
  if (!settings.permissions) settings.permissions = {};
544
557
  if (!settings.permissions.allow) settings.permissions.allow = [];
545
558
  const pipelinePermissions = [
546
- 'Write(.claude/done-criteria.md)',
547
- 'Write(.claude/EVAL-FINDINGS.md)',
548
- 'Write(.claude/BROWSER-RESULTS.md)',
549
- 'Edit(.claude/done-criteria.md)',
550
- 'Edit(.claude/EVAL-FINDINGS.md)',
551
- 'Edit(.claude/BROWSER-RESULTS.md)',
559
+ 'Write(.devlyn/**)',
560
+ 'Edit(.devlyn/**)',
552
561
  'Bash(git add *)',
553
562
  'Bash(git commit *)',
554
563
  'Bash(git diff *)',
@@ -53,7 +53,7 @@ Investigate and implement the following task. Work through these phases in order
53
53
  - **UI/UX**: review existing components, design system, and user flows.
54
54
  Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
55
55
 
56
- **Phase B — Define done criteria**: Before writing any code, create `.claude/done-criteria.md` with testable success criteria. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section. This file is required — downstream evaluation depends on it.
56
+ **Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md` with testable success criteria. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section. This file is required — downstream evaluation depends on it.
57
57
 
58
58
  **Phase C — Assemble a team**: Use TeamCreate to create a team. Select teammates based on task type:
59
59
  - Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
@@ -64,14 +64,14 @@ Each teammate investigates from their perspective and sends findings back.
64
64
 
65
65
  **Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
66
66
 
67
- **Phase E — Update done criteria**: Mark each criterion in `.claude/done-criteria.md` as satisfied. Run the full test suite.
67
+ **Phase E — Update done criteria**: Mark each criterion in `.devlyn/done-criteria.md` as satisfied. Run the full test suite.
68
68
 
69
69
  **Phase F — Cleanup**: Shut down all teammates and delete the team.
70
70
 
71
71
  The task is: [paste the task description here]
72
72
 
73
73
  **After the agent completes**:
74
- 1. Verify `.claude/done-criteria.md` exists — if missing, create a basic one from the agent's output summary
74
+ 1. Verify `.devlyn/done-criteria.md` exists — if missing, create a basic one from the agent's output summary
75
75
  2. Run `git diff --stat` to confirm code was actually changed
76
76
  3. If no changes were made, report failure and stop
77
77
  4. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"` to create a rollback point
@@ -86,13 +86,17 @@ Skip if `--skip-browser` was set.
86
86
 
87
87
  Agent prompt — pass this to the Agent tool:
88
88
 
89
- You are a browser validation agent. Read the skill instructions at `.claude/skills/devlyn:browser-validate/SKILL.md` and follow the full workflow to validate this web application. The dev server should be started, tested, and left running (pass `--keep-server` internally) — the pipeline will clean it up later. Write your findings to `.claude/BROWSER-RESULTS.md`.
89
+ You are a browser validation agent. Read the skill instructions at `.claude/skills/devlyn:browser-validate/SKILL.md` and follow the full workflow to validate this web application. The dev server should be started, tested, and left running (pass `--keep-server` internally) — the pipeline will clean it up later. Write your findings to `.devlyn/BROWSER-RESULTS.md`.
90
90
 
91
91
  **After the agent completes**:
92
- 1. Read `.claude/BROWSER-RESULTS.md`
92
+ 1. Read `.devlyn/BROWSER-RESULTS.md`
93
93
  2. Extract the verdict
94
- 3. If `BLOCKED` → the app doesn't even render. Go directly to PHASE 2.5 fix loop with browser findings as context.
95
- 4. Otherwise → continue to PHASE 2 (the evaluator will read `BROWSER-RESULTS.md` as additional evidence)
94
+ 3. Branch on verdict:
95
+ - `PASS` → continue to PHASE 2
96
+ - `PASS WITH ISSUES` → continue to PHASE 2 (evaluator reads browser results as extra context)
97
+ - `PARTIALLY VERIFIED` → continue to PHASE 2, but flag to the evaluator that browser coverage was incomplete — unverified features should be weighted more heavily
98
+ - `NEEDS WORK` → features don't work in the browser. Go to PHASE 2.5 fix loop. Fix agent reads `.devlyn/BROWSER-RESULTS.md` for which criterion failed, at what step, with what error. After fixing, re-run PHASE 1.5 to verify the fix before proceeding to Evaluate.
99
+ - `BLOCKED` → app doesn't render. Go to PHASE 2.5 fix loop. After fixing, re-run PHASE 1.5.
96
100
 
97
101
  ## PHASE 2: EVALUATE
98
102
 
@@ -102,7 +106,7 @@ Agent prompt — pass this to the Agent tool:
102
106
 
103
107
  You are an independent evaluator. Your job is to grade work produced by another agent, not to praise it. You will be too lenient by default — fight this tendency. When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs minutes of review. The cost is asymmetric.
104
108
 
105
- **Step 1 — Read the done criteria**: Read `.claude/done-criteria.md`. This is your primary grading rubric. Every criterion must be verified with evidence.
109
+ **Step 1 — Read the done criteria**: Read `.devlyn/done-criteria.md`. This is your primary grading rubric. Every criterion must be verified with evidence.
106
110
 
107
111
  **Step 2 — Discover changes**: Run `git diff HEAD~1` and `git status` to see what changed. Read all changed/new files in parallel.
108
112
 
@@ -115,7 +119,7 @@ You are an independent evaluator. Your job is to grade work produced by another
115
119
 
116
120
  **Step 4 — Grade against done criteria**: For each criterion in done-criteria.md, mark VERIFIED (with evidence) or FAILED (with file:line and what's wrong).
117
121
 
118
- **Step 5 — Write findings**: Write `.claude/EVAL-FINDINGS.md` with this exact structure:
122
+ **Step 5 — Write findings**: Write `.devlyn/EVAL-FINDINGS.md` with this exact structure:
119
123
 
120
124
  ```
121
125
  # Evaluation Findings
@@ -139,10 +143,10 @@ Calibration examples to guide your judgment:
139
143
  - A `let` that could be `const` = LOW note only. Linters catch this.
140
144
  - "The error handling is generally quite good" = WRONG. Count the instances. Name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line..."
141
145
 
142
- Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the orchestrator needs them.
146
+ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the orchestrator needs them.
143
147
 
144
148
  **After the agent completes**:
145
- 1. Read `.claude/EVAL-FINDINGS.md`
149
+ 1. Read `.devlyn/EVAL-FINDINGS.md`
146
150
  2. Extract the verdict
147
151
  3. **If `--with-codex` includes `evaluate` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
148
152
  4. Branch on verdict (from the merged findings if Codex was used):
@@ -150,7 +154,7 @@ Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the o
150
154
  - `PASS WITH ISSUES` → skip to PHASE 3 (issues are shippable)
151
155
  - `NEEDS WORK` → go to PHASE 2.5 (fix loop)
152
156
  - `BLOCKED` → go to PHASE 2.5 (fix loop)
153
- 5. If `.claude/EVAL-FINDINGS.md` was not created, treat as PASS WITH ISSUES and log a warning
157
+ 5. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as PASS WITH ISSUES and log a warning
154
158
 
155
159
  ## PHASE 2.5: FIX LOOP (conditional)
156
160
 
@@ -160,11 +164,11 @@ Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to fix th
160
164
 
161
165
  Agent prompt — pass this to the Agent tool:
162
166
 
163
- Read `.claude/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every CRITICAL and HIGH finding. Address MEDIUM findings if straightforward.
167
+ Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every CRITICAL and HIGH finding. Address MEDIUM findings if straightforward.
164
168
 
165
- The original done criteria are in `.claude/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
169
+ The original done criteria are in `.devlyn/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
166
170
 
167
- For each finding: read the referenced file:line, understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.claude/done-criteria.md` to mark fixed items.
171
+ For each finding: read the referenced file:line, understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.devlyn/done-criteria.md` to mark fixed items.
168
172
 
169
173
  **After the agent completes**:
170
174
  1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): fix round [N] complete"` to preserve the fix
@@ -267,10 +271,7 @@ Synchronize documentation with recent code changes. Use `git log --oneline -20`
267
271
  After all phases complete:
268
272
 
269
273
  1. Clean up temporary files:
270
- - Delete `.claude/done-criteria.md`
271
- - Delete `.claude/EVAL-FINDINGS.md`
272
- - Delete `.claude/BROWSER-RESULTS.md` (if exists)
273
- - Delete `.claude/screenshots/` directory (if exists)
274
+ - Delete the `.devlyn/` directory entirely (contains done-criteria.md, EVAL-FINDINGS.md, BROWSER-RESULTS.md, screenshots/, playwright temp files)
274
275
  - Kill any dev server process still running from browser validation
275
276
 
276
277
  2. Run `git log --oneline -10` to show commits made during the pipeline
@@ -34,7 +34,7 @@ Run after the Claude evaluator (Phase 2) completes, only if `--with-codex` inclu
34
34
  ### Step 1 — Get Codex's evaluation
35
35
 
36
36
  Call `mcp__codex-cli__codex` with:
37
- - `prompt`: Include the full content of `.claude/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
37
+ - `prompt`: Include the full content of `.devlyn/done-criteria.md` and the output of `git diff HEAD~1`. Ask Codex to evaluate the changes against the done criteria and report issues by severity (CRITICAL, HIGH, MEDIUM, LOW) with file:line references.
38
38
  - `workingDirectory`: the project root
39
39
  - `sandbox`: `"read-only"` (Codex should only read, not modify files)
40
40
  - `reasoningEffort`: `"high"`
@@ -44,7 +44,7 @@ Example prompt to pass:
44
44
  You are an independent code evaluator. Grade the following code changes against the done criteria below. Be strict — when in doubt, flag it.
45
45
 
46
46
  ## Done Criteria
47
- [paste contents of .claude/done-criteria.md]
47
+ [paste contents of .devlyn/done-criteria.md]
48
48
 
49
49
  ## Code Changes
50
50
  [paste output of git diff HEAD~1]
@@ -61,7 +61,7 @@ Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to merge
61
61
 
62
62
  Agent prompt:
63
63
 
64
- Read `.claude/EVAL-FINDINGS.md` (Claude's evaluation) and the Codex evaluation output below. Merge them into a single unified `.claude/EVAL-FINDINGS.md` following the existing format. Rules:
64
+ Read `.devlyn/EVAL-FINDINGS.md` (Claude's evaluation) and the Codex evaluation output below. Merge them into a single unified `.devlyn/EVAL-FINDINGS.md` following the existing format. Rules:
65
65
  - Take the MORE SEVERE verdict between the two evaluators
66
66
  - Deduplicate findings that reference the same file:line or describe the same issue
67
67
  - When both evaluators flag the same issue, keep the more detailed description
@@ -15,7 +15,7 @@ $ARGUMENTS
15
15
 
16
16
  ## PHASE 1: DETECT
17
17
 
18
- 1. **What was built**: This is the most important input. Read `.claude/done-criteria.md` if it exists — it tells you what the feature is supposed to do. If it doesn't exist, read `git diff --stat` and `git log -1` to understand what changed. You need to know what to test before anything else.
18
+ 1. **What was built**: This is the most important input. Read `.devlyn/done-criteria.md` if it exists — it tells you what the feature is supposed to do. If it doesn't exist, read `git diff --stat` and `git log -1` to understand what changed. You need to know what to test before anything else.
19
19
 
20
20
  2. **Framework detection**: Read `package.json` → identify framework and start command from `scripts.dev`, `scripts.start`, or `scripts.preview`.
21
21
 
@@ -65,15 +65,17 @@ If the app isn't rendering, the verdict is BLOCKED — feature testing can't hap
65
65
 
66
66
  This is the primary purpose of browser validation. Everything else is in service of getting here.
67
67
 
68
- Read `.claude/done-criteria.md` (or infer from git diff what was built). For each criterion that describes something a user can do or see in the UI, test it end-to-end in the browser:
68
+ Read `.devlyn/done-criteria.md` (or infer from git diff what was built). For each criterion that describes something a user can do or see in the UI, test it end-to-end in the browser:
69
69
 
70
70
  1. **Plan the test**: What would a user do to verify this feature works? Navigate where, click what, type what, expect what result?
71
71
  2. **Execute it**: Navigate to the page, find the interactive elements, perform the actions, verify the outcome. Read `references/flow-testing.md` for patterns on converting criteria to browser steps.
72
72
  3. **Capture evidence**: Screenshot at each key step. Record console errors and network failures that happen during the interaction.
73
73
  4. **If it fails — try to fix**: Read the error (console, network, or the UI state) to understand why the feature broke. Fix the source code, let hot-reload update, and re-test. Up to 2 fix attempts per criterion.
74
- 5. **Record the result**: For each criterion — PASS (feature works as specified), FAIL (feature doesn't work, include what went wrong), or SKIPPED (criterion isn't browser-testable, e.g., "API returns 401").
74
+ 5. **Record the result**: For each criterion — PASS (feature works as specified), FAIL (feature doesn't work, include what went wrong), SKIPPED (criterion isn't browser-testable, e.g., "API returns 401"), or UNVERIFIABLE (feature depends on external services not available in the test environment — e.g., real API keys, third-party auth, paid services).
75
75
 
76
- The verdict depends primarily on this phase. If the implemented features don't work in the browser, the validation fails even if every page renders perfectly and the layout looks great.
76
+ **Don't churn on external dependencies.** If a feature test is blocked because an API times out, a third-party service isn't configured, or auth credentials aren't available — that's not a bug to fix, it's a test environment limitation. Note it as UNVERIFIABLE, move on to the next criterion. Don't spend more than 30 seconds waiting for a response that's never coming. The goal is to verify what *can* be verified in the current environment, and be honest about what can't.
77
+
78
+ The verdict depends primarily on this phase. If the implemented features don't work in the browser, the validation fails — even if every page renders perfectly and the layout looks great. And if most features couldn't be verified due to environment limitations, be honest about that — don't call it PASS.
77
79
 
78
80
  ## PHASE 5: VISUAL (supporting check)
79
81
 
@@ -86,17 +88,18 @@ Judgment-based — look at the screenshots and report visible issues.
86
88
 
87
89
  ## PHASE 6: REPORT
88
90
 
89
- Write `.claude/BROWSER-RESULTS.md`:
91
+ Write `.devlyn/BROWSER-RESULTS.md`:
90
92
 
91
93
  ```markdown
92
94
  # Browser Validation Results
93
95
 
94
- ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
96
+ ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / PARTIALLY VERIFIED / BLOCKED]
95
97
  Verdict rules:
96
98
  - BLOCKED = server won't start or app doesn't render
97
- - NEEDS WORK = implemented features don't work in the browser (this is the primary failure mode)
98
- - PASS WITH ISSUES = features work but visual issues or minor warnings exist
99
- - PASS = features verified working, pages render, layout clean
99
+ - NEEDS WORK = implemented features don't work in the browser
100
+ - PARTIALLY VERIFIED = some features verified working, but others couldn't be tested due to environment limitations (missing API keys, external service dependencies). Be explicit about what was and wasn't verified.
101
+ - PASS WITH ISSUES = all testable features work but visual issues or minor warnings exist
102
+ - PASS = all testable features verified working, pages render, layout clean
100
103
 
101
104
  ## What Was Tested
102
105
  [Brief description of the feature/task from done-criteria or git diff]
@@ -104,7 +107,10 @@ Verdict rules:
104
107
  ## Feature Verification (primary)
105
108
  | Criterion | Test Steps | Result | Evidence |
106
109
  |-----------|-----------|--------|----------|
107
- | [what should work] | [what you did] | PASS/FAIL/SKIPPED | [screenshot, errors, what went wrong] |
110
+ | [what should work] | [what you did] | PASS/FAIL/SKIPPED/UNVERIFIABLE | [screenshot, errors, what went wrong] |
111
+
112
+ ## Unverifiable Features (if any)
113
+ [List features that couldn't be tested and why — e.g., "Badge rendering requires /api/backends/status which needs real API keys not present in test env. Verified via source code and unit tests instead."]
108
114
 
109
115
  ## Smoke Test (prerequisite)
110
116
  | Route | Renders | Console Errors | Network Failures |
@@ -1,6 +1,6 @@
1
1
  # Flow Testing: Done-Criteria to Browser Steps
2
2
 
3
- How to read `.claude/done-criteria.md` and convert testable criteria into browser action sequences. This is the bridge between "what should work" and "prove it works in the browser."
3
+ How to read `.devlyn/done-criteria.md` and convert testable criteria into browser action sequences. This is the bridge between "what should work" and "prove it works in the browser."
4
4
 
5
5
  Read this file only during PHASE 4 (FLOW) when done-criteria exists.
6
6
 
@@ -8,7 +8,7 @@ Read this file only during PHASE 4 (FLOW) when done-criteria exists.
8
8
 
9
9
  ## Step 1: Classify Each Criterion
10
10
 
11
- Read `.claude/done-criteria.md` and classify each criterion:
11
+ Read `.devlyn/done-criteria.md` and classify each criterion:
12
12
 
13
13
  **Browser-testable** — the criterion describes something a user can see or do in the UI:
14
14
  - "User can create a new project from the dashboard"
@@ -44,7 +44,7 @@ Generate a temporary test script from the test steps, run it with Playwright's J
44
44
 
45
45
  ## Script Generation
46
46
 
47
- For each phase (smoke, flow, visual), generate a test script at `.claude/browser-test.spec.ts`.
47
+ For each phase (smoke, flow, visual), generate a test script at `.devlyn/browser-test.spec.ts`.
48
48
 
49
49
  ### Smoke Test Script Template
50
50
 
@@ -89,7 +89,7 @@ test.describe('Smoke Tests', () => {
89
89
  const pageUrl = page.url();
90
90
  expect(title, 'Page shows a browser error — server may be down').not.toBe(pageUrl);
91
91
 
92
- await page.screenshot({ path: `.claude/screenshots/smoke${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
92
+ await page.screenshot({ path: `.devlyn/screenshots/smoke${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
93
93
 
94
94
  if (errors.length > 0) {
95
95
  test.info().annotations.push({ type: 'console_errors', description: errors.join(' | ') });
@@ -123,7 +123,7 @@ test('flow: [criterion description]', async ({ page }) => {
123
123
  await expect(page.locator('[verification selector]')).toBeVisible();
124
124
 
125
125
  // Screenshot
126
- await page.screenshot({ path: '.claude/screenshots/flow-[name].png' });
126
+ await page.screenshot({ path: '.devlyn/screenshots/flow-[name].png' });
127
127
  });
128
128
  ```
129
129
 
@@ -135,7 +135,7 @@ test.describe('Visual - Mobile', () => {
135
135
  for (const route of ROUTES) {
136
136
  test(`visual-mobile: ${route}`, async ({ page }) => {
137
137
  await page.goto(`http://localhost:${PORT}${route}`, { waitUntil: 'networkidle' });
138
- await page.screenshot({ path: `.claude/screenshots/visual-mobile${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
138
+ await page.screenshot({ path: `.devlyn/screenshots/visual-mobile${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
139
139
  });
140
140
  }
141
141
  });
@@ -145,7 +145,7 @@ test.describe('Visual - Desktop', () => {
145
145
  for (const route of ROUTES) {
146
146
  test(`visual-desktop: ${route}`, async ({ page }) => {
147
147
  await page.goto(`http://localhost:${PORT}${route}`, { waitUntil: 'networkidle' });
148
- await page.screenshot({ path: `.claude/screenshots/visual-desktop${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
148
+ await page.screenshot({ path: `.devlyn/screenshots/visual-desktop${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
149
149
  });
150
150
  }
151
151
  });
@@ -154,16 +154,16 @@ test.describe('Visual - Desktop', () => {
154
154
  ## Execution
155
155
 
156
156
  ```bash
157
- mkdir -p .claude/screenshots
158
- npx playwright test .claude/browser-test.spec.ts \
157
+ mkdir -p .devlyn/screenshots
158
+ npx playwright test .devlyn/browser-test.spec.ts \
159
159
  --reporter=json \
160
- --output=.claude/playwright-results \
161
- 2>&1 | tee .claude/playwright-output.json
160
+ --output=.devlyn/playwright-results \
161
+ 2>&1 | tee .devlyn/playwright-output.json
162
162
  ```
163
163
 
164
164
  ## Parsing Results
165
165
 
166
- Read `.claude/playwright-output.json`. The JSON structure contains:
166
+ Read `.devlyn/playwright-output.json`. The JSON structure contains:
167
167
  - `suites[].specs[].tests[].results[].status` — `"passed"`, `"failed"`, `"timedOut"`
168
168
  - `suites[].specs[].tests[].results[].errors` — error messages with stack traces
169
169
  - `suites[].specs[].tests[].annotations` — custom annotations (console_errors, network_failures)
@@ -177,12 +177,12 @@ Map these to BROWSER-RESULTS.md findings:
177
177
 
178
178
  After parsing results:
179
179
  ```bash
180
- rm -f .claude/browser-test.spec.ts
181
- rm -rf .claude/playwright-results
182
- rm -f .claude/playwright-output.json
180
+ rm -f .devlyn/browser-test.spec.ts
181
+ rm -rf .devlyn/playwright-results
182
+ rm -f .devlyn/playwright-output.json
183
183
  ```
184
184
 
185
- Keep `.claude/screenshots/` — those are evidence referenced by the report.
185
+ Keep `.devlyn/screenshots/` — those are evidence referenced by the report.
186
186
 
187
187
  ## Limitations vs Tier 1
188
188
 
@@ -23,7 +23,7 @@ Before spawning any evaluators, understand what you're evaluating:
23
23
  - **"recent changes"** or no argument: Use `git diff HEAD` for unstaged changes, `git status` for new files
24
24
  - **Running session / live monitoring**: Take a baseline snapshot with `git status --short | wc -l`, then poll every 30-45 seconds for new changes using `git status` and `find . -newer <reference-file> -type f`. Report findings incrementally as changes appear.
25
25
 
26
- 2. **Check for done criteria**: Read `.claude/done-criteria.md` if it exists. This file contains testable success criteria written by the generator (e.g., `/devlyn:team-resolve` Phase 1.5). When present, it is the primary grading rubric — every criterion in it must be verified. When absent, fall back to the evaluation checklists below.
26
+ 2. **Check for done criteria**: Read `.devlyn/done-criteria.md` if it exists. This file contains testable success criteria written by the generator (e.g., `/devlyn:team-resolve` Phase 1.5). When present, it is the primary grading rubric — every criterion in it must be verified. When absent, fall back to the evaluation checklists below.
27
27
 
28
28
  3. Build the evaluation baseline:
29
29
  - Run `git status --short` to see all changed and new files
@@ -297,9 +297,9 @@ LOW (note):
297
297
  4. For each catch block: is the error surfaced to the user or silently swallowed?
298
298
  5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
299
299
  6. Compare against existing components for pattern consistency
300
- 7. **Browser evidence** (when available): Read `.claude/BROWSER-RESULTS.md` if it exists — it contains pre-collected smoke test results, flow test results, console errors, network failures, and screenshots from the `devlyn:browser-validate` skill. Use this as additional evidence in your evaluation. Do not re-run smoke tests that are already covered.
300
+ 7. **Browser evidence** (when available): Read `.devlyn/BROWSER-RESULTS.md` if it exists — it contains pre-collected smoke test results, flow test results, console errors, network failures, and screenshots from the `devlyn:browser-validate` skill. Use this as additional evidence in your evaluation. Do not re-run smoke tests that are already covered.
301
301
  If the dev server is still running and you need deeper investigation on a specific interaction, use browser tools directly (check if `mcp__claude-in-chrome__*` tools are available, or fall back to Playwright). Focus on verifying specific findings, not duplicating the full smoke/flow suite.
302
- If neither `.claude/BROWSER-RESULTS.md` exists nor browser tools are available, note "Live testing skipped — no browser validation available" in your deliverable.
302
+ If neither `.devlyn/BROWSER-RESULTS.md` exists nor browser tools are available, note "Live testing skipped — no browser validation available" in your deliverable.
303
303
 
304
304
  **Your deliverable**: Send a message to the team lead with:
305
305
  1. Component quality assessment for each new/changed component
@@ -480,7 +480,7 @@ After receiving all evaluator findings:
480
480
 
481
481
  1. Present the evaluation report to the user (format below).
482
482
 
483
- 2. **Write findings to `.claude/EVAL-FINDINGS.md`** for downstream consumption by other agents (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`). This file enables the feedback loop — the generator can read it and fix the issues without human relay.
483
+ 2. **Write findings to `.devlyn/EVAL-FINDINGS.md`** for downstream consumption by other agents (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`). This file enables the feedback loop — the generator can read it and fix the issues without human relay.
484
484
 
485
485
  ```markdown
486
486
  # Evaluation Findings
@@ -502,7 +502,7 @@ After receiving all evaluator findings:
502
502
  - [pattern description]
503
503
  ```
504
504
 
505
- 3. Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — downstream consumers (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`) may need to read them. The orchestrator or user is responsible for cleanup.
505
+ 3. Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — downstream consumers (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`) may need to read them. The orchestrator or user is responsible for cleanup.
506
506
 
507
507
  ## Phase 6: CLEANUP
508
508
 
@@ -93,7 +93,7 @@ Teammates: [list of roles being spawned and why each was chosen]
93
93
 
94
94
  Before any code is written, define what "done" looks like. This prevents self-evaluation bias and gives external evaluators (like `/devlyn:evaluate`) concrete criteria to grade against.
95
95
 
96
- 1. Based on your Phase 1 investigation, write testable success criteria to `.claude/done-criteria.md`:
96
+ 1. Based on your Phase 1 investigation, write testable success criteria to `.devlyn/done-criteria.md`:
97
97
 
98
98
  ```markdown
99
99
  # Done Criteria: [issue summary]
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "devlyn-cli",
3
- "version": "1.3.1",
3
+ "version": "1.4.0",
4
4
  "description": "Claude Code configuration toolkit for teams",
5
5
  "bin": {
6
6
  "devlyn": "bin/devlyn.js"