devlyn-cli 1.5.2 → 1.5.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CLAUDE.md
CHANGED
|
@@ -61,7 +61,7 @@ This runs the full pipeline automatically: **Build → Browser Validate → Eval
|
|
|
61
61
|
For web projects, the Browser Validate phase starts the dev server and tests the implemented feature in a real browser — clicking buttons, filling forms, verifying results. If the feature doesn't work, findings feed back into the fix loop.
|
|
62
62
|
|
|
63
63
|
Optional flags:
|
|
64
|
-
- `--max-rounds
|
|
64
|
+
- `--max-rounds 6` — increase max evaluate-fix iterations (default: 4)
|
|
65
65
|
- `--skip-browser` — skip browser validation phase (auto-skipped for non-web changes)
|
|
66
66
|
- `--skip-review` — skip team-review phase
|
|
67
67
|
- `--skip-clean` — skip clean phase
|
|
@@ -25,7 +25,7 @@ This pipeline runs hands-free. The user launches it to walk away and come back t
|
|
|
25
25
|
|
|
26
26
|
1. Extract the task/issue description from `<pipeline_config>`.
|
|
27
27
|
2. Determine optional flags from the input (defaults in parentheses):
|
|
28
|
-
- `--max-rounds N` (
|
|
28
|
+
- `--max-rounds N` (4) — max evaluate-fix loops before stopping with a report
|
|
29
29
|
- `--skip-review` (false) — skip team-review phase
|
|
30
30
|
- `--security-review` (auto) — run dedicated security audit. Auto-detects: runs when changes touch auth, secrets, user data, API endpoints, env/config, or crypto. Force with `--security-review always` or skip with `--security-review skip`
|
|
31
31
|
- `--skip-clean` (false) — skip clean phase
|
|
@@ -101,10 +101,11 @@ You are a browser validation agent. Read the skill instructions at `.claude/skil
|
|
|
101
101
|
**After the agent completes**:
|
|
102
102
|
1. Read `.devlyn/BROWSER-RESULTS.md`
|
|
103
103
|
2. Extract the verdict
|
|
104
|
-
3.
|
|
104
|
+
3. **Validate the verdict is real**: If the verdict says "code-level pass" or indicates no actual browser interaction occurred (no screenshots taken, no pages navigated, no DOM inspected), the validation did NOT happen. Treat this as if no browser validation ran — re-run PHASE 1.5 with `--tier 2` to force Playwright, or `--tier 3` for HTTP smoke. A "PARTIALLY VERIFIED" based on reading source code is not browser validation.
|
|
105
|
+
4. Branch on verdict:
|
|
105
106
|
- `PASS` → continue to PHASE 2
|
|
106
107
|
- `PASS WITH ISSUES` → continue to PHASE 2 (evaluator reads browser results as extra context)
|
|
107
|
-
- `PARTIALLY VERIFIED` → continue to PHASE 2, but flag to the evaluator that browser coverage was incomplete — unverified features should be weighted more heavily
|
|
108
|
+
- `PARTIALLY VERIFIED` → continue to PHASE 2, but flag to the evaluator that browser coverage was incomplete — unverified features should be weighted more heavily. This verdict is only valid when features were actually tested in a browser and some couldn't be verified due to environment limitations (missing API keys, external services). It is NOT valid as a substitute for "browser tools didn't work."
|
|
108
109
|
- `NEEDS WORK` → features don't work in the browser. Go to PHASE 2.5 fix loop. Fix agent reads `.devlyn/BROWSER-RESULTS.md` for which criterion failed, at what step, with what error. After fixing, re-run PHASE 1.5 to verify the fix before proceeding to Evaluate.
|
|
109
110
|
- `BLOCKED` → app doesn't render. Go to PHASE 2.5 fix loop. After fixing, re-run PHASE 1.5.
|
|
110
111
|
|
|
@@ -146,7 +147,9 @@ You are an independent evaluator. Your job is to grade work produced by another
|
|
|
146
147
|
- pattern description
|
|
147
148
|
```
|
|
148
149
|
|
|
149
|
-
Verdict rules: BLOCKED = any CRITICAL issues. NEEDS WORK = HIGH issues that should be fixed. PASS WITH ISSUES = only
|
|
150
|
+
Verdict rules: BLOCKED = any CRITICAL issues. NEEDS WORK = HIGH or MEDIUM issues that should be fixed. PASS WITH ISSUES = only LOW cosmetic notes. PASS = clean.
|
|
151
|
+
|
|
152
|
+
Important: Do NOT label findings as "pre-existing" or "out of scope" to avoid fixing them. If a problem exists in the current code and relates to the done criteria, it's a finding regardless of when it was introduced. The goal is working software, not blame attribution.
|
|
150
153
|
|
|
151
154
|
Calibration examples to guide your judgment:
|
|
152
155
|
- A catch block that logs but doesn't surface error to user = HIGH (not MEDIUM). Logging is not error handling.
|
|
@@ -161,10 +164,10 @@ Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — the o
|
|
|
161
164
|
3. **If `--with-codex` includes `evaluate` or `both`**: Read `references/codex-integration.md` and follow the "PHASE 2-CODEX: CROSS-MODEL EVALUATE" section. This runs Codex as a second evaluator and merges findings into `EVAL-FINDINGS.md`.
|
|
162
165
|
4. Branch on verdict (from the merged findings if Codex was used):
|
|
163
166
|
- `PASS` → skip to PHASE 3
|
|
164
|
-
- `PASS WITH ISSUES` →
|
|
167
|
+
- `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
|
|
165
168
|
- `NEEDS WORK` → go to PHASE 2.5 (fix loop)
|
|
166
169
|
- `BLOCKED` → go to PHASE 2.5 (fix loop)
|
|
167
|
-
5. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as
|
|
170
|
+
5. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as NEEDS WORK and log a warning — absence of evidence is not evidence of absence
|
|
168
171
|
|
|
169
172
|
## PHASE 2.5: FIX LOOP (conditional)
|
|
170
173
|
|
|
@@ -174,7 +177,7 @@ Spawn a subagent using the Agent tool with `mode: "bypassPermissions"` to fix th
|
|
|
174
177
|
|
|
175
178
|
Agent prompt — pass this to the Agent tool:
|
|
176
179
|
|
|
177
|
-
Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every CRITICAL and
|
|
180
|
+
Read `.devlyn/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the evaluator returns PASS — there is no "shippable with issues" shortcut.
|
|
178
181
|
|
|
179
182
|
The original done criteria are in `.devlyn/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
|
|
180
183
|
|
|
@@ -23,10 +23,12 @@ $ARGUMENTS
|
|
|
23
23
|
|
|
24
24
|
4. **Affected routes**: Map changed files to routes (e.g., `app/dashboard/page.tsx` → `/dashboard`).
|
|
25
25
|
|
|
26
|
-
5. **Tier selection** — pick the best available browser tool:
|
|
27
|
-
- Check if `mcp__claude-in-chrome__*` tools exist
|
|
28
|
-
-
|
|
29
|
-
-
|
|
26
|
+
5. **Tier selection** — pick the best available browser tool. **You must verify each tier actually works before committing to it** — tools can be registered but not connected:
|
|
27
|
+
- **Tier 1 probe** (Chrome DevTools): Check if `mcp__claude-in-chrome__*` tools exist. If they do, load `mcp__claude-in-chrome__tabs_context_mcp` via ToolSearch and call it. If the call **succeeds** (returns tab data without error), use Tier 1. Read `references/tier1-chrome.md`. If the call **fails** (timeout, connection error, extension not running), Tier 1 is unavailable — fall through to Tier 2.
|
|
28
|
+
- **Tier 2 probe** (Playwright): Check if `mcp__playwright__*` tools exist (try ToolSearch for `mcp__playwright__browser_navigate`). If they exist and respond, use Tier 2 Mode A. Else run `npx playwright --version 2>/dev/null` — if it succeeds, use Tier 2 Mode B. Read `references/tier2-playwright.md`.
|
|
29
|
+
- **Tier 3** (HTTP smoke): Fallback when no browser tool is functional. Read `references/tier3-curl.md`.
|
|
30
|
+
|
|
31
|
+
**Critical rule**: Never treat a tier as available just because its tools appear in the tool list. Deferred/registered tools may not have a running backend. Always probe before committing.
|
|
30
32
|
|
|
31
33
|
6. **Skip gate**: If no web-relevant files changed (no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.astro`, `*.css`, `*.scss`, `*.html`, `page.*`, `layout.*`, `route.*`, `+page.*`, `+layout.*`), skip. Report: "Browser validation skipped — no web changes detected."
|
|
32
34
|
|
|
@@ -79,6 +81,8 @@ Quick check that the app is alive. This is not the main test — it's a gate to
|
|
|
79
81
|
|
|
80
82
|
Navigate to `/` and each affected route. For each page, judge: is this the actual application, or an error page? A connection error, framework error overlay, or blank shell is not the app. If broken, try to fix (read console errors, fix source, let hot-reload pick it up). Up to 2 fix attempts per route.
|
|
81
83
|
|
|
84
|
+
**Tier downgrade on failure**: If you're on Tier 1 or Tier 2 Mode A and the browser tool consistently fails during smoke (connection errors, timeouts, extension disconnected), **do not skip browser testing**. Instead, downgrade to the next tier (Tier 1 → Tier 2 → Tier 3), re-read the corresponding reference file, and retry the smoke phase with the new tier. Announce: `"Tier [N] browser tools not responding — downgrading to Tier [N+1]."` The goal is to always run the best available browser test, not to give up.
|
|
85
|
+
|
|
82
86
|
If the app isn't rendering, the verdict is BLOCKED — feature testing can't happen.
|
|
83
87
|
|
|
84
88
|
## PHASE 4: FEATURE TEST (the main event)
|