npm - devlyn-cli - Versions diffs - 1.2.1 → 1.3.1 - Mend

devlyn-cli 1.2.1 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/bin/devlyn.js CHANGED Viewed

@@ -154,6 +154,7 @@ const OPTIONAL_ADDONS = [
   { name: 'Leonxlnx/taste-skill', desc: 'Premium frontend design skills — modern layouts, animations, and visual refinement', type: 'external' },
   // MCP servers (installed via claude mcp add)
   { name: 'codex-cli', desc: 'Codex MCP server for cross-model evaluation via OpenAI Codex', type: 'mcp', command: 'npx -y codex-mcp-server' },
+  { name: 'playwright', desc: 'Playwright MCP for browser testing — powers devlyn:browser-validate Tier 2', type: 'mcp', command: 'npx -y @anthropic-ai/mcp-playwright' },
 ];
 function log(msg, color = 'reset') {
@@ -544,8 +545,10 @@ async function init(skipPrompts = false) {
   const pipelinePermissions = [
     'Write(.claude/done-criteria.md)',
     'Write(.claude/EVAL-FINDINGS.md)',
+    'Write(.claude/BROWSER-RESULTS.md)',
     'Edit(.claude/done-criteria.md)',
     'Edit(.claude/EVAL-FINDINGS.md)',
+    'Edit(.claude/BROWSER-RESULTS.md)',
     'Bash(git add *)',
     'Bash(git commit *)',
     'Bash(git diff *)',

package/config/skills/devlyn:auto-resolve/SKILL.md CHANGED Viewed

@@ -19,6 +19,7 @@ $ARGUMENTS
    - `--skip-review` (false) — skip team-review phase
    - `--security-review` (auto) — run dedicated security audit. Auto-detects: runs when changes touch auth, secrets, user data, API endpoints, env/config, or crypto. Force with `--security-review always` or skip with `--security-review skip`
    - `--skip-clean` (false) — skip clean phase
+   - `--skip-browser` (false) — skip browser validation phase (auto-skipped for non-web changes)
    - `--skip-docs` (false) — skip update-docs phase
    - `--with-codex` (false) — use OpenAI Codex as a cross-model evaluator/reviewer via `mcp__codex-cli__*` MCP tools. Accepts: `evaluate`, `review`, or `both` (default when flag is present without value). When enabled, Codex provides an independent second opinion from a different model family, creating a GAN-like dynamic where Claude builds and Codex critiques.
@@ -32,7 +33,7 @@ $ARGUMENTS
 ```
 Auto-resolve pipeline starting
 Task: [extracted task description]
-Phases: Build → Evaluate → [Fix loop if needed] → Simplify → [Review] → [Security] → [Clean] → [Docs]
+Phases: Build → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → [Security] → [Clean] → [Docs]
 Max evaluation rounds: [N]
 Cross-model evaluation (Codex): [evaluate / review / both / disabled]
 ```
@@ -75,6 +76,24 @@ The task is: [paste the task description here]
 3. If no changes were made, report failure and stop
 4. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"` to create a rollback point
+## PHASE 1.5: BROWSER VALIDATE (conditional)
+Skip if `--skip-browser` was set.
+1. **Check relevance**: Run `git diff --name-only` and check for web-relevant files (`*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.css`, `*.html`, `page.*`, `layout.*`, `route.*`). If none found, skip and note "Browser validation skipped — no web changes detected."
+2. **Run validation**: Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
+Agent prompt — pass this to the Agent tool:
+You are a browser validation agent. Read the skill instructions at `.claude/skills/devlyn:browser-validate/SKILL.md` and follow the full workflow to validate this web application. The dev server should be started, tested, and left running (pass `--keep-server` internally) — the pipeline will clean it up later. Write your findings to `.claude/BROWSER-RESULTS.md`.
+**After the agent completes**:
+1. Read `.claude/BROWSER-RESULTS.md`
+2. Extract the verdict
+3. If `BLOCKED` → the app doesn't even render. Go directly to PHASE 2.5 fix loop with browser findings as context.
+4. Otherwise → continue to PHASE 2 (the evaluator will read `BROWSER-RESULTS.md` as additional evidence)
 ## PHASE 2: EVALUATE
 Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to evaluate the work. Include all evaluation instructions inline.
@@ -250,6 +269,9 @@ After all phases complete:
 1. Clean up temporary files:
    - Delete `.claude/done-criteria.md`
    - Delete `.claude/EVAL-FINDINGS.md`
+   - Delete `.claude/BROWSER-RESULTS.md` (if exists)
+   - Delete `.claude/screenshots/` directory (if exists)
+   - Kill any dev server process still running from browser validation
 2. Run `git log --oneline -10` to show commits made during the pipeline
@@ -264,6 +286,7 @@ After all phases complete:
 | Phase | Status | Notes |
 |-------|--------|-------|
 | Build (team-resolve) | [completed] | [brief summary] |
+| Browser validate | [completed / skipped / auto-skipped] | [verdict, tier used, console errors, flow results] |
 | Evaluate (Claude) | [PASS/NEEDS WORK after N rounds] | [verdict + key findings] |
 | Evaluate (Codex) | [completed / skipped] | [Codex-only findings count, merged verdict] |
 | Fix rounds | [N rounds / skipped] | [what was fixed] |

package/config/skills/devlyn:browser-validate/SKILL.md ADDED Viewed

@@ -0,0 +1,134 @@
+---
+name: devlyn:browser-validate
+description: Browser-based validation for web applications — verifies that implemented features actually work by testing them in a real browser. Starts the dev server, tests the feature end-to-end (click buttons, fill forms, verify results), and reports what's broken with screenshot evidence. Use this skill whenever the user says "test in browser", "check if it works", "does the feature work", "browser test", "validate the UI", or when auto-resolve needs to verify web changes actually function correctly. Also use proactively after implementing UI changes. The primary goal is feature verification, not just checking if pages render.
+---
+Verify that implemented features actually work in the browser. The primary job is to test the feature that was just built — click the button, fill the form, check the result. Smoke tests and visual checks are supporting checks, not the main event.
+The whole point of browser validation is to catch the gap between "code looks correct" and "user can actually do the thing." Static analysis and unit tests can confirm the code is well-structured. Browser validation confirms it *works*.
+<config>
+$ARGUMENTS
+</config>
+<workflow>
+## PHASE 1: DETECT
+1. **What was built**: This is the most important input. Read `.claude/done-criteria.md` if it exists — it tells you what the feature is supposed to do. If it doesn't exist, read `git diff --stat` and `git log -1` to understand what changed. You need to know what to test before anything else.
+2. **Framework detection**: Read `package.json` → identify framework and start command from `scripts.dev`, `scripts.start`, or `scripts.preview`.
+3. **Port inference**: Defaults — Next.js: 3000, Vite: 5173, CRA: 3000, Nuxt: 3000, Astro: 4321, Angular: 4200. Override with `--port` flag.
+4. **Affected routes**: Map changed files to routes (e.g., `app/dashboard/page.tsx` → `/dashboard`).
+5. **Tier selection** — pick the best available browser tool:
+   - Check if `mcp__claude-in-chrome__*` tools exist → **Tier 1** (Chrome DevTools). Read `references/tier1-chrome.md`.
+   - Else check if `mcp__playwright__*` tools exist or `npx playwright --version` succeeds → **Tier 2** (Playwright). Read `references/tier2-playwright.md`.
+   - Else → **Tier 3** (HTTP smoke). Read `references/tier3-curl.md`.
+6. **Skip gate**: If no web-relevant files changed (no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.astro`, `*.css`, `*.scss`, `*.html`, `page.*`, `layout.*`, `route.*`, `+page.*`, `+layout.*`), skip. Report: "Browser validation skipped — no web changes detected."
+7. **Parse flags** from `<config>`:
+   - `--skip-feature` — skip feature testing, only run smoke + visual
+   - `--port PORT` — override detected port
+   - `--tier N` — force a specific tier (1, 2, or 3)
+   - `--mobile-only` / `--desktop-only` — limit viewport testing
+Announce:
+```
+Browser validation starting
+Feature: [what was built, from done-criteria or git diff]
+Framework: [detected] | Port: [PORT] | Tier: [N — name]
+Phases: Server → Smoke → Feature Test → Visual → Report
+```
+## PHASE 2: SERVER
+Get the dev server running. If it doesn't start, diagnose and fix — don't just report failure.
+1. Start the dev server in background via Bash with `run_in_background: true`.
+2. Health-check: poll `http://localhost:PORT` every 2s, timeout 30s. Ready when you get an HTTP response.
+3. **If it doesn't come up — troubleshoot** (up to 2 attempts): read stderr for the error, fix it (npm install, port conflict, build error, etc.), restart, re-check.
+4. If still down after 2 attempts: write BLOCKED verdict and stop.
+## PHASE 3: SMOKE (quick prerequisite)
+Quick check that the app is alive. This is not the main test — it's a gate to make sure feature testing is even possible.
+Navigate to `/` and each affected route. For each page, judge: is this the actual application, or an error page? A connection error, framework error overlay, or blank shell is not the app. If broken, try to fix (read console errors, fix source, let hot-reload pick it up). Up to 2 fix attempts per route.
+If the app isn't rendering, the verdict is BLOCKED — feature testing can't happen.
+## PHASE 4: FEATURE TEST (the main event)
+This is the primary purpose of browser validation. Everything else is in service of getting here.
+Read `.claude/done-criteria.md` (or infer from git diff what was built). For each criterion that describes something a user can do or see in the UI, test it end-to-end in the browser:
+1. **Plan the test**: What would a user do to verify this feature works? Navigate where, click what, type what, expect what result?
+2. **Execute it**: Navigate to the page, find the interactive elements, perform the actions, verify the outcome. Read `references/flow-testing.md` for patterns on converting criteria to browser steps.
+3. **Capture evidence**: Screenshot at each key step. Record console errors and network failures that happen during the interaction.
+4. **If it fails — try to fix**: Read the error (console, network, or the UI state) to understand why the feature broke. Fix the source code, let hot-reload update, and re-test. Up to 2 fix attempts per criterion.
+5. **Record the result**: For each criterion — PASS (feature works as specified), FAIL (feature doesn't work, include what went wrong), or SKIPPED (criterion isn't browser-testable, e.g., "API returns 401").
+The verdict depends primarily on this phase. If the implemented features don't work in the browser, the validation fails — even if every page renders perfectly and the layout looks great.
+## PHASE 5: VISUAL (supporting check)
+Quick layout check at two viewports (skip if `--mobile-only` or `--desktop-only`):
+1. **Mobile** (375x812): screenshot each affected route, check for overflow/overlap/unreadable text
+2. **Desktop** (1280x800): screenshot each affected route, check for broken layouts
+Judgment-based — look at the screenshots and report visible issues.
+## PHASE 6: REPORT
+Write `.claude/BROWSER-RESULTS.md`:
+```markdown
+# Browser Validation Results
+## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
+Verdict rules:
+- BLOCKED = server won't start or app doesn't render
+- NEEDS WORK = implemented features don't work in the browser (this is the primary failure mode)
+- PASS WITH ISSUES = features work but visual issues or minor warnings exist
+- PASS = features verified working, pages render, layout clean
+## What Was Tested
+[Brief description of the feature/task from done-criteria or git diff]
+## Feature Verification (primary)
+| Criterion | Test Steps | Result | Evidence |
+|-----------|-----------|--------|----------|
+| [what should work] | [what you did] | PASS/FAIL/SKIPPED | [screenshot, errors, what went wrong] |
+## Smoke Test (prerequisite)
+| Route | Renders | Console Errors | Network Failures |
+|-------|---------|---------------|-----------------|
+| / | YES/NO | [count] | [count] |
+## Visual Check
+| Viewport | Route | Issues |
+|----------|-------|--------|
+| Mobile (375px) | / | [issues or "Clean"] |
+| Desktop (1280px) | / | [issues or "Clean"] |
+## Fixes Applied During Validation
+[List any bugs found and fixed during testing — server startup issues, broken routes, feature bugs]
+## Runtime Errors
+[Console errors captured during testing]
+## Failed Network Requests
+[Failed API calls captured during testing]
+```
+## PHASE 7: CLEANUP
+Kill the dev server PID. If `--keep-server` was passed (auto-resolve pipeline), skip — the pipeline handles cleanup.
+</workflow>

package/config/skills/devlyn:browser-validate/references/flow-testing.md ADDED Viewed

@@ -0,0 +1,118 @@
+# Flow Testing: Done-Criteria to Browser Steps
+How to read `.claude/done-criteria.md` and convert testable criteria into browser action sequences. This is the bridge between "what should work" and "prove it works in the browser."
+Read this file only during PHASE 4 (FLOW) when done-criteria exists.
+---
+## Step 1: Classify Each Criterion
+Read `.claude/done-criteria.md` and classify each criterion:
+**Browser-testable** — the criterion describes something a user can see or do in the UI:
+- "User can create a new project from the dashboard"
+- "Error message appears when form is submitted with empty fields"
+- "Navigation shows active state on current page"
+- "Data table loads and displays 10 rows"
+**Not browser-testable** — the criterion is about backend logic, data integrity, or code quality:
+- "API returns 401 for unauthenticated requests"
+- "Database migration runs without errors"
+- "Test coverage exceeds 80%"
+- "No TypeScript errors"
+Skip non-browser-testable criteria. Note them as "Skipped — not browser-testable" in the report.
+## Step 2: Convert to Action Sequences
+For each browser-testable criterion, generate a sequence of steps:
+### Pattern: Navigation + Verification
+```
+Criterion: "Dashboard shows project count"
+Steps:
+  1. Navigate to /dashboard
+  2. Find element containing project count (look for text matching a number pattern)
+  3. Verify: element exists and contains a numeric value
+  4. Screenshot
+```
+### Pattern: Form Interaction
+```
+Criterion: "User can create a new project"
+Steps:
+  1. Navigate to /dashboard (or wherever the create action lives)
+  2. Find "Create" or "New Project" button
+  3. Click it
+  4. Find form fields (name, description, etc.)
+  5. Fill with test data: name="Test Project", description="Browser validation test"
+  6. Find and click submit button
+  7. Verify: success indicator appears (toast, redirect, new item in list)
+  8. Screenshot at steps 3, 6, and 7
+```
+### Pattern: Error State
+```
+Criterion: "Error message shows when form submitted empty"
+Steps:
+  1. Navigate to the form page
+  2. Find submit button
+  3. Click submit without filling any fields
+  4. Verify: error message(s) visible
+  5. Screenshot showing error state
+```
+### Pattern: Conditional UI
+```
+Criterion: "Empty state shows when no data exists"
+Steps:
+  1. Navigate to the list/table page
+  2. Check if data exists — if so, this test needs a clean state
+  3. If clean state achievable: verify empty state message/illustration
+  4. If not: skip with note "Cannot verify empty state — data already exists"
+  5. Screenshot
+```
+## Step 3: Handle Data Dependencies
+Some flow tests need specific data to exist (or not exist). Approach:
+1. **Read-only tests preferred** — test flows that verify existing state rather than create/modify
+2. **Create test data if safe** — if the flow creates something (like a project), use obvious test names ("Browser Validation Test — safe to delete")
+3. **Skip if destructive** — don't test delete flows, don't modify existing data, don't test flows that send emails or notifications
+4. **Note dependencies** — if a test can't run because of missing data, note it as "Skipped — requires [specific data state]"
+## Step 4: Handle Auth-Protected Pages
+If a route requires authentication:
+1. Check if the app redirects to a login page
+2. If login is a simple form (email + password): note "Auth required — skipping unless test credentials available"
+3. If login uses OAuth/SSO: skip entirely, note "Skipped — requires OAuth flow"
+4. Do not attempt to log in with guessed credentials
+## Test Data Guidelines
+When filling forms during flow tests, use obviously fake but valid data:
+- Name: "Test User" or "Browser Validate Test"
+- Email: "test@browser-validate.local"
+- Description: "Created by browser-validate skill — safe to delete"
+- Numbers: use small, obvious values (1, 10, 100)
+This makes test data easy to identify and clean up later.
+## Output Format
+For each flow test, report:
+```
+Criterion: [original text from done-criteria]
+Classification: browser-testable | skipped
+Steps executed: [N of total]
+Result: PASS | FAIL | SKIPPED
+Evidence:
+  - Screenshot: [path]
+  - Console errors during flow: [count] — [details]
+  - Network failures during flow: [count] — [details]
+  - Failure point: [which step failed and why]
+```

package/config/skills/devlyn:browser-validate/references/tier1-chrome.md ADDED Viewed

@@ -0,0 +1,132 @@
+# Tier 1: Chrome DevTools (claude-in-chrome)
+The richest testing tier. Requires the claude-in-chrome MCP extension running in a Chrome browser. Provides full DOM interaction, console monitoring, network inspection, screenshots, and GIF recording.
+Read this file only when Tier 1 was selected during DETECT phase.
+---
+## Setup
+Before any browser interaction, load the tools you need via ToolSearch:
+```
+ToolSearch: "select:mcp__claude-in-chrome__tabs_context_mcp"
+ToolSearch: "select:mcp__claude-in-chrome__tabs_create_mcp"
+ToolSearch: "select:mcp__claude-in-chrome__navigate"
+ToolSearch: "select:mcp__claude-in-chrome__get_page_text"
+ToolSearch: "select:mcp__claude-in-chrome__read_page"
+ToolSearch: "select:mcp__claude-in-chrome__find"
+ToolSearch: "select:mcp__claude-in-chrome__computer"
+ToolSearch: "select:mcp__claude-in-chrome__form_input"
+ToolSearch: "select:mcp__claude-in-chrome__resize_window"
+ToolSearch: "select:mcp__claude-in-chrome__read_console_messages"
+ToolSearch: "select:mcp__claude-in-chrome__read_network_requests"
+ToolSearch: "select:mcp__claude-in-chrome__gif_creator"
+ToolSearch: "select:mcp__claude-in-chrome__javascript_tool"
+```
+Then call `tabs_context_mcp` first to understand current browser state. Create a new tab for testing — never reuse existing user tabs.
+## Tool Mapping by Action
+### Navigate to a page
+```
+tabs_create_mcp → create new tab with URL http://localhost:{PORT}{route}
+  OR
+navigate → go to URL in existing tab
+```
+After navigating, wait 2-3 seconds for client-side rendering, then call `get_page_text` to verify content loaded.
+### Check if page rendered
+```
+get_page_text → extract visible text content
+```
+Read the text and judge: is this the actual application, or an error/fallback page? Browser error pages, framework error overlays, "Unable to connect" screens, and empty shells all have text — but they're not the app. If the page content doesn't look like what the application is supposed to show, it's a failure.
+### Read page structure
+```
+read_page → get DOM structure and layout info
+```
+Use this to understand component hierarchy before interacting.
+### Find interactive elements
+```
+find → locate buttons, links, inputs by text content or attributes
+```
+Returns element positions for clicking.
+### Click elements
+```
+computer → click at coordinates returned by find
+```
+After clicking, wait 1-2 seconds, then check console + network for errors.
+### Fill form fields
+```
+form_input → set values on input fields, selects, textareas
+```
+Identify fields with `find` first, then use `form_input` with the field selector.
+### Take screenshots
+```
+computer → screenshot action captures the visible viewport
+```
+Save screenshots with descriptive names: `smoke-root.png`, `flow-create-project-step3.png`, `visual-mobile-dashboard.png`.
+### Resize viewport
+```
+resize_window → set width and height
+```
+Mobile: `resize_window(375, 812)`. Desktop: `resize_window(1280, 800)`.
+### Read console messages
+```
+read_console_messages → get all console output
+```
+Use `pattern` parameter to filter. Useful patterns:
+- `"error|Error|ERROR"` — catch errors
+- `"warn|Warning"` — catch warnings
+- Exclude known noise: React dev warnings (`"Warning: "` prefix), HMR messages (`"[vite]"`, `"[HMR]"`, `"[Fast Refresh]"`), favicon 404s
+### Read network requests
+```
+read_network_requests → get all HTTP requests with status codes
+```
+Flag: any request with status 4xx or 5xx (excluding `/favicon.ico`). Flag: any CORS error. Ignore: HMR websocket connections, source map requests (`.map`).
+### Record multi-step flows
+```
+gif_creator → record a sequence of actions as an animated GIF
+```
+Use for flow tests with 3+ steps. Capture extra frames before and after actions for smooth playback. Name meaningfully: `flow-user-registration.gif`.
+### Run custom assertions
+```
+javascript_tool → execute JS in the page context
+```
+Useful for checking specific DOM state that other tools can't easily verify:
+- `document.querySelectorAll('.error-message').length` — count error elements
+- `window.__NEXT_DATA__` — check Next.js hydration data
+- `document.title` — verify page title
+Avoid triggering alerts or confirms — they block the extension. Use `console.log` + `read_console_messages` instead.
+## Error Filtering
+Not every console message is a real problem. Apply these filters:
+**Ignore (dev noise)**:
+- `[HMR]`, `[vite]`, `[Fast Refresh]`, `[webpack-dev-server]`
+- `Warning: ReactDOM.render is no longer supported` (React 18 dev warning)
+- `Download the React DevTools`
+- `/favicon.ico` 404
+- Source map warnings
+**Flag as errors**:
+- `Uncaught` anything
+- `TypeError`, `ReferenceError`, `SyntaxError`
+- `Failed to fetch` (network errors)
+- `CORS` errors
+- `Hydration` mismatches
+- `ChunkLoadError` (code splitting failures)
+- Any `console.error` call from application code

package/config/skills/devlyn:browser-validate/references/tier2-playwright.md ADDED Viewed

@@ -0,0 +1,192 @@
+# Tier 2: Playwright (Headless Browser)
+Solid middle-ground tier. No browser extension needed — works in CI, SSH, Docker, and headless environments. Provides DOM interaction, console monitoring, screenshots, and network inspection. No GIF recording.
+Read this file only when Tier 2 was selected during DETECT phase.
+---
+## Two Modes
+Playwright Tier 2 has two sub-modes depending on what's available. The skill auto-detects which to use.
+### Mode A: Playwright MCP (preferred)
+If `mcp__playwright__*` tools are available (installed via `npx devlyn-cli` → select "playwright" MCP), use them directly. This gives interactive browser control similar to Tier 1:
+- `mcp__playwright__browser_navigate` — navigate to URL
+- `mcp__playwright__browser_screenshot` — capture screenshot
+- `mcp__playwright__browser_click` — click elements
+- `mcp__playwright__browser_type` — type into inputs
+- `mcp__playwright__browser_console` — read console messages
+- `mcp__playwright__browser_network` — read network requests
+- `mcp__playwright__browser_resize` — resize viewport
+When Playwright MCP is available, follow the same interaction pattern as Tier 1 (navigate → check → interact → screenshot) but using `mcp__playwright__*` tools instead of `mcp__claude-in-chrome__*`.
+Load tools via ToolSearch before use: `ToolSearch: "select:mcp__playwright__browser_navigate"` etc.
+### Mode B: Script Generation (fallback)
+If Playwright MCP is not installed but `npx playwright` CLI is available, generate and execute test scripts. This is the approach documented below.
+## Setup (Mode B only)
+Playwright runs via `npx` with auto-download. No global install needed. If browsers aren't installed yet:
+```bash
+npx playwright install chromium 2>/dev/null
+```
+This downloads only Chromium (~130MB), not all browsers. It's a one-time cost.
+## Approach (Mode B)
+Generate a temporary test script from the test steps, run it with Playwright's JSON reporter, then parse the results. This avoids needing a persistent test infrastructure — the script is created, executed, and cleaned up.
+## Script Generation
+For each phase (smoke, flow, visual), generate a test script at `.claude/browser-test.spec.ts`.
+### Smoke Test Script Template
+```typescript
+import { test, expect } from '@playwright/test';
+const PORT = {PORT};
+const ROUTES = {ROUTES_JSON_ARRAY};
+test.describe('Smoke Tests', () => {
+  for (const route of ROUTES) {
+    test(`smoke: ${route}`, async ({ page }) => {
+      const errors: string[] = [];
+      const failedRequests: string[] = [];
+      page.on('console', msg => {
+        if (msg.type() === 'error') errors.push(msg.text());
+      });
+      page.on('response', response => {
+        if (response.status() >= 400 && !response.url().includes('favicon')) {
+          failedRequests.push(`${response.status()} ${response.url()}`);
+        }
+      });
+      // If goto throws (connection refused), the test fails — that's correct behavior
+      await page.goto(`http://localhost:${PORT}${route}`, { waitUntil: 'networkidle', timeout: 15000 });
+      // Verify this is the actual application, not an error page.
+      // When a server is down or a route is broken, the browser shows an error page
+      // that still has text content — "Unable to connect", "This site can't be reached", etc.
+      // A naive length check would pass on these. The title is the best signal:
+      // browser error pages have titles like "Problem loading page" or the URL itself,
+      // while real apps have meaningful titles set by the application.
+      const title = await page.title();
+      const bodyText = await page.textContent('body') || '';
+      // Page must have substantive content
+      expect(bodyText.trim().length, 'Page body is empty').toBeGreaterThan(0);
+      // Fail if the page navigation itself failed (Playwright sets title to the URL on error)
+      const pageUrl = page.url();
+      expect(title, 'Page shows a browser error — server may be down').not.toBe(pageUrl);
+      await page.screenshot({ path: `.claude/screenshots/smoke${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
+      if (errors.length > 0) {
+        test.info().annotations.push({ type: 'console_errors', description: errors.join(' | ') });
+      }
+      if (failedRequests.length > 0) {
+        test.info().annotations.push({ type: 'network_failures', description: failedRequests.join(' | ') });
+      }
+      expect(errors.filter(e => !e.includes('[HMR]') && !e.includes('favicon'))).toHaveLength(0);
+      expect(failedRequests).toHaveLength(0);
+    });
+  }
+});
+```
+### Flow Test Script Template
+For each flow test step from done-criteria, generate a test block:
+```typescript
+test('flow: [criterion description]', async ({ page }) => {
+  // Navigate
+  await page.goto(`http://localhost:${PORT}{start_route}`);
+  // Find and interact
+  await page.click('[text or selector]');
+  await page.fill('[selector]', '[value]');
+  await page.click('[submit selector]');
+  // Verify
+  await expect(page.locator('[verification selector]')).toBeVisible();
+  // Screenshot
+  await page.screenshot({ path: '.claude/screenshots/flow-[name].png' });
+});
+```
+### Visual Test Script Template
+```typescript
+test.describe('Visual - Mobile', () => {
+  test.use({ viewport: { width: 375, height: 812 } });
+  for (const route of ROUTES) {
+    test(`visual-mobile: ${route}`, async ({ page }) => {
+      await page.goto(`http://localhost:${PORT}${route}`, { waitUntil: 'networkidle' });
+      await page.screenshot({ path: `.claude/screenshots/visual-mobile${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
+    });
+  }
+});
+test.describe('Visual - Desktop', () => {
+  test.use({ viewport: { width: 1280, height: 800 } });
+  for (const route of ROUTES) {
+    test(`visual-desktop: ${route}`, async ({ page }) => {
+      await page.goto(`http://localhost:${PORT}${route}`, { waitUntil: 'networkidle' });
+      await page.screenshot({ path: `.claude/screenshots/visual-desktop${route.replace(/\//g, '-') || '-root'}.png`, fullPage: true });
+    });
+  }
+});
+```
+## Execution
+```bash
+mkdir -p .claude/screenshots
+npx playwright test .claude/browser-test.spec.ts \
+  --reporter=json \
+  --output=.claude/playwright-results \
+  2>&1 | tee .claude/playwright-output.json
+```
+## Parsing Results
+Read `.claude/playwright-output.json`. The JSON structure contains:
+- `suites[].specs[].tests[].results[].status` — `"passed"`, `"failed"`, `"timedOut"`
+- `suites[].specs[].tests[].results[].errors` — error messages with stack traces
+- `suites[].specs[].tests[].annotations` — custom annotations (console_errors, network_failures)
+Map these to BROWSER-RESULTS.md findings:
+- `failed` → route fails smoke, include error message
+- Annotations with `console_errors` → list in Runtime Errors section
+- Annotations with `network_failures` → list in Failed Network Requests section
+## Cleanup
+After parsing results:
+```bash
+rm -f .claude/browser-test.spec.ts
+rm -rf .claude/playwright-results
+rm -f .claude/playwright-output.json
+```
+Keep `.claude/screenshots/` — those are evidence referenced by the report.
+## Limitations vs Tier 1
+- No GIF recording (can't capture multi-step flow animations)
+- No live DOM exploration (tests are scripted, not interactive)
+- Screenshots are full-page captures, not viewport-specific (use `fullPage: true`)
+- Console filtering is code-based (less flexible than chrome MCP pattern matching)

package/config/skills/devlyn:browser-validate/references/tier3-curl.md ADDED Viewed

@@ -0,0 +1,57 @@
+# Tier 3: HTTP Smoke (curl)
+Bare-minimum fallback. No browser, no JavaScript execution, no interaction testing. This tier confirms the dev server responds and pages return valid HTML. It catches "app doesn't start" and "page returns 500" but nothing subtler.
+Read this file only when Tier 3 was selected during DETECT phase.
+---
+## What You Can Test
+- Server responds on the expected port
+- Pages return HTTP 200
+- HTML contains a `<body>` with content (not an empty shell)
+- No server-side error indicators in the HTML
+## What You Cannot Test
+- Client-side rendering (SPA content won't appear in curl output)
+- JavaScript errors or console output
+- Network requests made by the client
+- Interactive elements (forms, buttons, navigation)
+- Visual layout or responsive behavior
+- Screenshots
+## Smoke Test
+For each affected route:
+```bash
+# Check HTTP status
+STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:{PORT}{route} --max-time 10)
+# Get HTML content
+HTML=$(curl -s http://localhost:{PORT}{route} --max-time 10)
+```
+### Pass Criteria
+A route passes if:
+1. curl succeeds (doesn't error out with connection refused or timeout)
+2. `STATUS` is `200` (or `301`, `302`, `304`) — not `000`, not `5xx`
+3. HTML contains `<body` tag
+3. HTML body has more than 100 characters of text content (not just empty divs)
+4. HTML does not contain server error indicators: `Internal Server Error`, `500`, `ECONNREFUSED`, `Cannot GET`, `404`
+### Parsing HTML Content
+Since curl returns raw HTML (no JS execution), for SPAs the body may only contain a root `<div id="root"></div>` or `<div id="__next"></div>`. This is normal and counts as a PASS for Tier 3 — note it as "SPA shell detected, client-side rendering not verifiable at this tier."
+For SSR frameworks (Next.js with server components, Nuxt, Astro), the HTML should contain actual rendered content.
+## Report Adjustments
+When writing BROWSER-RESULTS.md from Tier 3:
+- Set confidence level to LOW
+- Leave Console Errors, Network Failures, Flow Tests, and Visual Check sections as "N/A — Tier 3 (HTTP only)"
+- Note the limitation: "Tier 3 testing provides HTTP-level validation only. Client-side behavior, JavaScript errors, and visual rendering were not tested. For comprehensive browser validation, install the claude-in-chrome extension (Tier 1) or Playwright (Tier 2)."

package/config/skills/devlyn:evaluate/SKILL.md CHANGED Viewed

@@ -297,14 +297,9 @@ LOW (note):
 4. For each catch block: is the error surfaced to the user or silently swallowed?
 5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
 6. Compare against existing components for pattern consistency
-7. **Live app testing** (when browser tools are available): If `mcp__claude-in-chrome__*` tools are available, test the running application directly:
-   - Navigate to the affected pages
-   - Click through the user flow end-to-end
-   - Test interactive elements (forms, buttons, modals, navigation)
-   - Verify loading, error, and empty states render correctly
-   - Screenshot any visual issues as evidence
-   - Test responsive behavior at mobile/tablet/desktop widths
-   If browser tools are NOT available, skip this step and note "Live testing skipped — no browser tools" in your deliverable.
+7. **Browser evidence** (when available): Read `.claude/BROWSER-RESULTS.md` if it exists — it contains pre-collected smoke test results, flow test results, console errors, network failures, and screenshots from the `devlyn:browser-validate` skill. Use this as additional evidence in your evaluation. Do not re-run smoke tests that are already covered.
+   If the dev server is still running and you need deeper investigation on a specific interaction, use browser tools directly (check if `mcp__claude-in-chrome__*` tools are available, or fall back to Playwright). Focus on verifying specific findings, not duplicating the full smoke/flow suite.
+   If neither `.claude/BROWSER-RESULTS.md` exists nor browser tools are available, note "Live testing skipped — no browser validation available" in your deliverable.
 **Your deliverable**: Send a message to the team lead with:
 1. Component quality assessment for each new/changed component
@@ -312,7 +307,7 @@ LOW (note):
 3. Silent failure points that violate error handling policy
 4. React anti-patterns found
 5. Pattern consistency with existing components
-6. Live testing results (if browser tools were available): screenshots, interaction bugs, visual regressions
+6. Browser validation results (from BROWSER-RESULTS.md or live testing): screenshots, interaction bugs, runtime errors, visual regressions
 Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Coordinate with api-contract-evaluator about client-server type alignment via SendMessage.
 </frontend_evaluator_prompt>

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "1.2.1",
+  "version": "1.3.1",
   "description": "Claude Code configuration toolkit for teams",
   "bin": {
     "devlyn": "bin/devlyn.js"