devlyn-cli 1.14.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +104 -0
- package/CLAUDE.md +112 -119
- package/README.md +43 -125
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
- package/benchmark/auto-resolve/README.md +114 -0
- package/benchmark/auto-resolve/RUBRIC.md +162 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
- package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
- package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
- package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
- package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
- package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
- package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
- package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
- package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
- package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
- package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
- package/benchmark/auto-resolve/scripts/judge.sh +359 -0
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
- package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
- package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
- package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
- package/bin/devlyn.js +129 -17
- package/config/skills/_shared/adapters/README.md +64 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
- package/config/skills/_shared/adapters/opus-4-7.md +29 -0
- package/config/skills/_shared/archive_run.py +130 -0
- package/config/skills/_shared/codex-config.md +54 -0
- package/config/skills/_shared/codex-monitored.sh +141 -0
- package/config/skills/_shared/engine-preflight.md +35 -0
- package/config/skills/_shared/expected.schema.json +93 -0
- package/config/skills/_shared/pair-plan-schema.md +298 -0
- package/config/skills/_shared/runtime-principles.md +110 -0
- package/config/skills/_shared/spec-verify-check.py +519 -0
- package/config/skills/devlyn:ideate/SKILL.md +99 -481
- package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
- package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
- package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
- package/config/skills/devlyn:resolve/SKILL.md +172 -184
- package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
- package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
- package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
- package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
- package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
- package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
- package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
- package/optional-skills/devlyn:reap/SKILL.md +105 -0
- package/optional-skills/devlyn:reap/scripts/reap.sh +129 -0
- package/optional-skills/devlyn:reap/scripts/scan.sh +116 -0
- package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
- package/package.json +16 -2
- package/scripts/lint-skills.sh +431 -0
- package/config/skills/devlyn:auto-resolve/SKILL.md +0 -602
- package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -116
- package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -204
- package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
- package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
- package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
- package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
- package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
- package/config/skills/devlyn:clean/SKILL.md +0 -285
- package/config/skills/devlyn:design-ui/SKILL.md +0 -351
- package/config/skills/devlyn:discover-product/SKILL.md +0 -124
- package/config/skills/devlyn:evaluate/SKILL.md +0 -564
- package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
- package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
- package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
- package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
- package/config/skills/devlyn:preflight/SKILL.md +0 -370
- package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
- package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -90
- package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
- package/config/skills/devlyn:product-spec/SKILL.md +0 -603
- package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
- package/config/skills/devlyn:review/SKILL.md +0 -161
- package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
- package/config/skills/devlyn:team-review/SKILL.md +0 -493
- package/config/skills/devlyn:update-docs/SKILL.md +0 -463
- package/config/skills/workflow-routing/SKILL.md +0 -73
|
@@ -1,116 +0,0 @@
|
|
|
1
|
-
# Build Gate — Project Type Detection & Commands
|
|
2
|
-
|
|
3
|
-
Reference for PHASE 1.4 (Build Gate). The build gate agent reads this file to determine which commands to run.
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Project Type Detection Matrix
|
|
8
|
-
|
|
9
|
-
Inspect the repository root and subdirectories (up to 2 levels). A repo can match **multiple** signals — run ALL matching gates. Do not pick "the main one"; a monorepo with a Next.js dashboard + Rust service needs both.
|
|
10
|
-
|
|
11
|
-
| Signal file(s) | Project type | Gate commands (run in order) |
|
|
12
|
-
|---|---|---|
|
|
13
|
-
| `package.json` with `next` dep | Next.js | `npx tsc --noEmit` → `npx next build` |
|
|
14
|
-
| `package.json` with `nuxt` dep | Nuxt | `npx nuxi typecheck` → `npx nuxi build` |
|
|
15
|
-
| `package.json` with `vite` + `tsconfig.json` | Vite+TS | `npx tsc --noEmit` → `npm run build` (if script exists) |
|
|
16
|
-
| `package.json` with `expo` dep | Expo (React Native) | `npx tsc --noEmit` → `npx expo-doctor` |
|
|
17
|
-
| `package.json` with `react-native` (no expo) | React Native | `npx tsc --noEmit` |
|
|
18
|
-
| `package.json` with `svelte` + `@sveltejs/kit` | SvelteKit | `npm run check` → `npm run build` |
|
|
19
|
-
| `package.json` only, has `build` script | Generic Node | `npm run build` |
|
|
20
|
-
| `package.json` only, has `tsconfig.json` but no `build` | TS library | `npx tsc --noEmit` |
|
|
21
|
-
| `pnpm-workspace.yaml` / `turbo.json` / `lerna.json` | Monorepo | `pnpm -r build` or `turbo run build typecheck lint` — **workspace-wide**, NOT just the changed package |
|
|
22
|
-
| `Cargo.toml` | Rust | `cargo check --all-targets` → `cargo clippy -- -D warnings` |
|
|
23
|
-
| `go.mod` | Go | `go build ./...` → `go vet ./...` |
|
|
24
|
-
| `foundry.toml` | Foundry (Solidity) | `forge build` |
|
|
25
|
-
| `hardhat.config.{js,ts,cjs}` | Hardhat (Solidity) | `npx hardhat compile` |
|
|
26
|
-
| `Anchor.toml` | Anchor (Solana) | `anchor build` |
|
|
27
|
-
| `Move.toml` | Move (Sui/Aptos) | `sui move build` or `aptos move compile` |
|
|
28
|
-
| `pyproject.toml` / `setup.py` + mypy config | Python+mypy | `mypy .` |
|
|
29
|
-
| `pyproject.toml` with `ruff` | Python+Ruff | `ruff check .` |
|
|
30
|
-
| `Package.swift` | Swift package | `swift build` |
|
|
31
|
-
| `*.xcodeproj` / `*.xcworkspace` | iOS/macOS (Xcode) | Skip by default — log "Xcode project detected, manual build gate recommended". Too project-specific without knowing the scheme. |
|
|
32
|
-
| `build.gradle*` / `settings.gradle*` | Gradle/Android | `./gradlew assembleDebug` (debug, not release — keep it fast) |
|
|
33
|
-
| `CMakeLists.txt` | C/C++ (CMake) | `cmake -B build && cmake --build build` |
|
|
34
|
-
| `Makefile` (with no other signals) | Generic Make | `make` (only if no other type matched — Makefiles are too generic) |
|
|
35
|
-
| `Unity/ProjectSettings/` or `ProjectSettings/ProjectVersion.txt` | Unity | Skip by default — log "Unity project detected, manual build gate recommended" |
|
|
36
|
-
| `project.godot` | Godot | Skip by default — log "Godot project detected, manual build gate recommended" |
|
|
37
|
-
| `Dockerfile*` | Docker | `docker build -f <dockerfile> -t _pipeline_gate_test .` — included by default in `auto` mode. Skip with `--build-gate no-docker`. |
|
|
38
|
-
|
|
39
|
-
## Package Manager Detection
|
|
40
|
-
|
|
41
|
-
Respect the project's package manager. Check in order:
|
|
42
|
-
1. `packageManager` field in root `package.json` → use that
|
|
43
|
-
2. `pnpm-lock.yaml` exists → `pnpm`
|
|
44
|
-
3. `yarn.lock` exists → `yarn`
|
|
45
|
-
4. `bun.lockb` / `bun.lock` exists → `bun`
|
|
46
|
-
5. Default → `npm`
|
|
47
|
-
|
|
48
|
-
Replace `npm run build` / `npx` accordingly: `pnpm build` / `pnpm exec`, `yarn build` / `yarn`, `bun run build` / `bunx`.
|
|
49
|
-
|
|
50
|
-
## Monorepo Handling
|
|
51
|
-
|
|
52
|
-
Monorepo is the most critical case — cross-package type drift is the #1 source of "tests pass locally, build fails in CI."
|
|
53
|
-
|
|
54
|
-
1. Detect workspace root markers: `pnpm-workspace.yaml`, `turbo.json`, `lerna.json`, `workspaces` in root `package.json`
|
|
55
|
-
2. Run gates at the **workspace root** level, not per-changed-package:
|
|
56
|
-
- Turbo: `turbo run build typecheck lint` (respects dependency graph)
|
|
57
|
-
- pnpm: `pnpm -r build` (runs in topological order)
|
|
58
|
-
- yarn workspaces: `yarn workspaces foreach -A run build`
|
|
59
|
-
- npm workspaces: `npm run build --workspaces`
|
|
60
|
-
3. This ensures Package A's type change that breaks Package B's consumer is caught, even if only Package A was directly modified.
|
|
61
|
-
|
|
62
|
-
## Strict Mode (`--build-gate strict`)
|
|
63
|
-
|
|
64
|
-
When strict mode is set, treat warnings as failures:
|
|
65
|
-
- TypeScript: add `--strict` if not already in tsconfig (or verify it's set)
|
|
66
|
-
- Clippy: `-D warnings` (already default in the matrix)
|
|
67
|
-
- ESLint: `--max-warnings 0`
|
|
68
|
-
- Go vet: already treats warnings as errors
|
|
69
|
-
- Foundry: `--deny-warnings`
|
|
70
|
-
|
|
71
|
-
In default (auto) mode, only hard errors (non-zero exit code from the tool's perspective) block.
|
|
72
|
-
|
|
73
|
-
## Docker Build (default in `auto` mode)
|
|
74
|
-
|
|
75
|
-
When `Dockerfile*` files are detected AND `--build-gate no-docker` is NOT set:
|
|
76
|
-
1. Run all non-Docker gates first (they're faster and catch most errors before the slow Docker step)
|
|
77
|
-
2. Then run `docker build -f <dockerfile> -t _pipeline_gate_test .` for each Dockerfile found in the repo root and subdirectories (up to 2 levels)
|
|
78
|
-
3. If Docker daemon is not available, log the skip with a warning but do NOT fail — developers without Docker should not be blocked. The warning should note: "Docker builds were skipped because the Docker daemon is unavailable. Use `--build-gate no-docker` to suppress this warning, or ensure Docker is running to catch Dockerfile-specific issues."
|
|
79
|
-
4. This catches Dockerfile-specific issues that no other gate can: COPY paths referencing files excluded by .dockerignore, multi-stage build failures, production-only dependency resolution, and environment differences between dev and container builds
|
|
80
|
-
|
|
81
|
-
Use `--build-gate no-docker` to skip Docker builds for faster iteration during development — the language-level gates (tsc, cargo check, etc.) still run and catch the majority of issues. Docker builds are most valuable as a final gate before shipping.
|
|
82
|
-
|
|
83
|
-
## Output Format
|
|
84
|
-
|
|
85
|
-
Write results to `.devlyn/BUILD-GATE.md`:
|
|
86
|
-
|
|
87
|
-
```markdown
|
|
88
|
-
# Build Gate Results
|
|
89
|
-
## Verdict: [PASS / FAIL]
|
|
90
|
-
## Detected Project Types
|
|
91
|
-
- [type] ([path/])
|
|
92
|
-
## Gate Commands Run
|
|
93
|
-
| # | Command | Dir | Exit | Status | Time |
|
|
94
|
-
|---|---|---|---|---|---|
|
|
95
|
-
| 1 | `npx tsc --noEmit` | dashboard/ | 0 | PASS | 4.2s |
|
|
96
|
-
| 2 | `npx next build` | dashboard/ | 1 | FAIL | 9.8s |
|
|
97
|
-
| 3 | `cargo check --all-targets` | services/indexer/ | 0 | PASS | 12.1s |
|
|
98
|
-
|
|
99
|
-
## Failures
|
|
100
|
-
|
|
101
|
-
### Command #2: `npx next build` (dashboard/, exit 1)
|
|
102
|
-
```
|
|
103
|
-
[full error output — do NOT truncate. Build errors reference files from earlier in output.]
|
|
104
|
-
```
|
|
105
|
-
|
|
106
|
-
**Root file:line(s)**:
|
|
107
|
-
- `dashboard/app/(dashboard)/settings/page.tsx:90` — Type error: Property 'config' does not exist on type 'SettingsTabsProps'
|
|
108
|
-
|
|
109
|
-
**Fix guidance**:
|
|
110
|
-
Read `dashboard/app/(dashboard)/settings/page.tsx:88-93` and `dashboard/components/settings/SettingsTabs.tsx` (the SettingsTabsProps type definition). Either add `config` to SettingsTabsProps or remove the prop from the parent. Then re-run `npx next build` from `dashboard/` to verify.
|
|
111
|
-
```
|
|
112
|
-
|
|
113
|
-
Verdict rules:
|
|
114
|
-
- Any exit code != 0 → **FAIL**
|
|
115
|
-
- All exit codes == 0 → **PASS**
|
|
116
|
-
- No gates detected → **PASS** with note "No build gate detected — project type unknown. Consider adding `--build-gate deploy` if Dockerfiles are present."
|
|
@@ -1,204 +0,0 @@
|
|
|
1
|
-
# Engine Routing: Intelligent Model Selection
|
|
2
|
-
|
|
3
|
-
Instructions for routing work to the optimal model (Claude or Codex) per role and phase. Only read this file when `--engine` is set to `auto` or `codex`.
|
|
4
|
-
|
|
5
|
-
The routing table below is derived from published benchmarks (April 2026) comparing Claude Opus 4.6 and GPT-5.4 across task-relevant dimensions. The principle: each role's work goes to the model that objectively performs better at that task type.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Benchmark Basis
|
|
10
|
-
|
|
11
|
-
| Dimension | Claude Opus 4.6 | GPT-5.4 | Gap | Source |
|
|
12
|
-
|-----------|-----------------|---------|-----|--------|
|
|
13
|
-
| Long-context retrieval (256k) | 92% | ~64% | Claude +28pp | MRCR v2 |
|
|
14
|
-
| Graduate-level reasoning | 87.4% | 83.9% | Claude +3.5pp | GPQA Diamond |
|
|
15
|
-
| Hard coding problems | ~46% | 57.7% | Codex +11.7pp | SWE-bench Pro |
|
|
16
|
-
| Function-level code gen | 90.4% | 93.1% | Codex +2.7pp | HumanEval |
|
|
17
|
-
| Terminal/CLI tasks | 65.4% | 75.1% | Codex +9.7pp | Terminal-Bench 2.0 |
|
|
18
|
-
| Real-world issue resolution | ~80% | ~80% | Tied | SWE-bench Verified |
|
|
19
|
-
| Security vulnerability detection | — | — | Tied | Semgrep 2025 study |
|
|
20
|
-
| Agentic computer use | 72.7% | 75.0% | Codex +2.3pp | OSWorld |
|
|
21
|
-
| Ambiguous intent handling | Preferred by 70% devs | — | Claude | Developer surveys |
|
|
22
|
-
|
|
23
|
-
---
|
|
24
|
-
|
|
25
|
-
## Codex Call Defaults
|
|
26
|
-
|
|
27
|
-
Every Codex call in this file uses these defaults unless stated otherwise:
|
|
28
|
-
|
|
29
|
-
```
|
|
30
|
-
model: "gpt-5.4"
|
|
31
|
-
reasoningEffort: "xhigh"
|
|
32
|
-
sandbox: varies per role (see table)
|
|
33
|
-
workingDirectory: project root
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
The `model` field accepts any string — pass `"gpt-5.4"` even if the MCP schema lists older defaults. The Codex CLI resolves it.
|
|
37
|
-
|
|
38
|
-
---
|
|
39
|
-
|
|
40
|
-
## Role Routing Table
|
|
41
|
-
|
|
42
|
-
### team-resolve roles
|
|
43
|
-
|
|
44
|
-
| Role | Engine | Sandbox | Rationale |
|
|
45
|
-
|------|--------|---------|-----------|
|
|
46
|
-
| root-cause-analyst | **Claude** | — | A/B test: Claude traced git history (15 tool calls) finding exact commit + unchecked migration plan. Codex analyzed structure well but lacked git history depth. Tool access > SWE-bench Pro advantage for this role. |
|
|
47
|
-
| test-engineer | **Codex** | workspace-write | Test code generation = HumanEval (+2.7pp), needs file write |
|
|
48
|
-
| security-auditor | **Dual** | read-only | Semgrep: both find unique vulns; GAN > single model |
|
|
49
|
-
| implementation-planner | **Codex** | read-only | Implementation planning = SWE-bench Pro (+11.7pp) |
|
|
50
|
-
| product-designer | **Claude** | — | Ambiguous requirements, user intent = Claude strength |
|
|
51
|
-
| ui-designer | **Claude** | — | Visual spec, design reasoning = non-coding task |
|
|
52
|
-
| ux-designer | **Claude** | — | User flow analysis = ambiguous intent handling |
|
|
53
|
-
| accessibility-auditor | **Claude** | — | A/B test: Claude found 12 issues (1 CRITICAL) vs Codex 4. WCAG auditing requires thoroughness and domain knowledge depth, not code generation speed. Claude 3x coverage. |
|
|
54
|
-
| product-analyst | **Claude** | — | Requirements clarity, scope judgment = ambiguity handling |
|
|
55
|
-
| architecture-reviewer | **Claude** | — | Codebase-wide pattern review = MRCR long-context (+28pp) |
|
|
56
|
-
| performance-engineer | **Codex** | read-only | Terminal tasks + algorithm analysis = Terminal-Bench (+9.7pp) |
|
|
57
|
-
| api-designer | **Dual** | read-only | A/B test: Claude found 9 issues, Codex found 6, with unique findings on both sides (Claude: --version, exit codes; Codex: YAML folded scalar parsing bug). Dual maximizes coverage for API surface review. |
|
|
58
|
-
|
|
59
|
-
### team-review roles
|
|
60
|
-
|
|
61
|
-
| Role | Engine | Sandbox | Rationale |
|
|
62
|
-
|------|--------|---------|-----------|
|
|
63
|
-
| security-reviewer | **Dual** | read-only | Same as team-resolve security-auditor |
|
|
64
|
-
| quality-reviewer | **Dual** | read-only | A/B test: Claude found 14 issues (2 HIGH), Codex found 11 (3 HIGH), only ~6 overlap. Dual yields ~19 unique findings (+36-73% coverage). Both models find HIGH-severity issues the other misses. |
|
|
65
|
-
| test-analyst | **Codex** | workspace-write | Test gap analysis + test code suggestions |
|
|
66
|
-
| ux-reviewer | **Claude** | — | UX flow assessment = ambiguity handling |
|
|
67
|
-
| ui-reviewer | **Claude** | — | Design token consistency = non-coding task |
|
|
68
|
-
| accessibility-reviewer | **Claude** | — | Same rationale as team-resolve accessibility-auditor: Claude 3x finding coverage on WCAG audits |
|
|
69
|
-
| product-validator | **Claude** | — | Business logic intent = ambiguity handling |
|
|
70
|
-
| api-reviewer | **Dual** | read-only | Same rationale as team-resolve api-designer: both models find unique API issues |
|
|
71
|
-
| performance-reviewer | **Codex** | read-only | Algorithm complexity = Terminal-Bench (+9.7pp) |
|
|
72
|
-
|
|
73
|
-
### Summary distribution
|
|
74
|
-
|
|
75
|
-
| Engine | team-resolve (12) | team-review (9) | Total |
|
|
76
|
-
|--------|-------------------|-----------------|-------|
|
|
77
|
-
| Claude | 7 | 4 | 11 |
|
|
78
|
-
| Codex | 2 | 2 | 4 |
|
|
79
|
-
| Dual | 3 | 3 | 6 |
|
|
80
|
-
|
|
81
|
-
---
|
|
82
|
-
|
|
83
|
-
## Pipeline Phase Routing (auto-resolve)
|
|
84
|
-
|
|
85
|
-
| Phase | --engine auto | --engine codex | --engine claude |
|
|
86
|
-
|-------|--------------|----------------|-----------------|
|
|
87
|
-
| BUILD (implementation) | **Codex** | Codex | Claude |
|
|
88
|
-
| BUILD GATE | bash (model-agnostic) | bash | bash |
|
|
89
|
-
| BROWSER VALIDATE | Claude (Chrome MCP only) | Claude | Claude |
|
|
90
|
-
| EVALUATE | **Claude** | Claude | Claude |
|
|
91
|
-
| FIX LOOP | **Codex** | Codex | Claude |
|
|
92
|
-
| SIMPLIFY | Claude | Codex | Claude |
|
|
93
|
-
| REVIEW (team) | **Mixed per table** | Codex all | Claude all |
|
|
94
|
-
| CHALLENGE | **Claude** | Claude | Claude |
|
|
95
|
-
| SECURITY REVIEW | **Dual** | Codex | Claude |
|
|
96
|
-
| CLEAN | Claude | Codex | Claude |
|
|
97
|
-
| DOCS | Claude | Codex | Claude |
|
|
98
|
-
|
|
99
|
-
Rationale for `--engine auto` choices:
|
|
100
|
-
- BUILD/FIX: Codex — SWE-bench Pro 57.7% vs 46%. The biggest model gap is in hard coding tasks.
|
|
101
|
-
- EVALUATE/CHALLENGE: Claude — evaluating a full diff requires long-context retrieval (MRCR +28pp) and skeptical reasoning (GPQA +3.5pp). Different model family from builder creates GAN dynamic.
|
|
102
|
-
- BROWSER: Claude — Chrome MCP tools are Claude Code session-bound.
|
|
103
|
-
- SECURITY: Dual — Semgrep study shows both models find unique vulnerabilities.
|
|
104
|
-
|
|
105
|
-
---
|
|
106
|
-
|
|
107
|
-
## Pipeline Phase Routing (ideate)
|
|
108
|
-
|
|
109
|
-
| Phase | --engine auto | --engine codex | --engine claude |
|
|
110
|
-
|-------|--------------|----------------|-----------------|
|
|
111
|
-
| FRAME | **Claude** | Codex | Claude |
|
|
112
|
-
| EXPLORE | **Claude** | Codex | Claude |
|
|
113
|
-
| CONVERGE | **Claude** | Codex | Claude |
|
|
114
|
-
| CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
|
|
115
|
-
| DOCUMENT | **Claude** | Codex | Claude |
|
|
116
|
-
|
|
117
|
-
Rationale:
|
|
118
|
-
- FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
|
|
119
|
-
- CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic — automatic on every run. When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
|
|
120
|
-
- DOCUMENT: Claude — writing quality for spec generation.
|
|
121
|
-
|
|
122
|
-
---
|
|
123
|
-
|
|
124
|
-
## Pipeline Phase Routing (preflight)
|
|
125
|
-
|
|
126
|
-
| Phase | --engine auto | --engine codex | --engine claude |
|
|
127
|
-
|-------|--------------|----------------|-----------------|
|
|
128
|
-
| EXTRACT COMMITMENTS | Claude | Codex | Claude |
|
|
129
|
-
| CODE AUDIT | **Codex** | Codex | Claude |
|
|
130
|
-
| DOCS AUDIT | **Claude** | **Claude** | Claude |
|
|
131
|
-
| BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
|
|
132
|
-
| SYNTHESIZE | Claude | Claude | Claude |
|
|
133
|
-
|
|
134
|
-
DOCS AUDIT is always Claude regardless of `--engine` — writing-quality strength on documentation drift detection (READMEs, VISION.md prose, spec status accuracy) is the deciding factor, not code analysis. BROWSER AUDIT is always Claude because Chrome MCP tools are session-bound to Claude Code.
|
|
135
|
-
|
|
136
|
-
---
|
|
137
|
-
|
|
138
|
-
## How to Spawn a Codex Role
|
|
139
|
-
|
|
140
|
-
For roles marked **Codex** in the routing table, call `mcp__codex-cli__codex` instead of spawning a Claude Agent subagent. Package the role's full prompt (from the skill's teammate prompt section) into the Codex call.
|
|
141
|
-
|
|
142
|
-
Template:
|
|
143
|
-
|
|
144
|
-
```
|
|
145
|
-
mcp__codex-cli__codex({
|
|
146
|
-
prompt: "[full role prompt with issue context, file paths, and deliverable format]",
|
|
147
|
-
model: "gpt-5.4",
|
|
148
|
-
reasoningEffort: "xhigh",
|
|
149
|
-
sandbox: "[read-only or workspace-write per table]",
|
|
150
|
-
workingDirectory: "[project root]"
|
|
151
|
-
})
|
|
152
|
-
```
|
|
153
|
-
|
|
154
|
-
**Important**: Codex has no access to team infrastructure (TeamCreate, SendMessage, TaskCreate). For Codex roles:
|
|
155
|
-
- Include ALL context inline in the prompt (issue description, file paths from investigation, deliverable format)
|
|
156
|
-
- The orchestrator collects Codex's response and routes it where it would have gone via SendMessage
|
|
157
|
-
- Codex roles cannot communicate with other teammates directly — the orchestrator relays findings
|
|
158
|
-
|
|
159
|
-
For roles marked **Claude**, spawn a normal Agent subagent as before.
|
|
160
|
-
|
|
161
|
-
---
|
|
162
|
-
|
|
163
|
-
## How to Spawn a Dual Role
|
|
164
|
-
|
|
165
|
-
For roles marked **Dual**, run BOTH models in parallel and merge findings:
|
|
166
|
-
|
|
167
|
-
1. Spawn a Claude Agent subagent with the role's prompt
|
|
168
|
-
2. Call `mcp__codex-cli__codex` with the same role's prompt (sandbox: "read-only")
|
|
169
|
-
3. Wait for both to complete
|
|
170
|
-
4. Merge findings:
|
|
171
|
-
- Same finding from both → keep more detailed description, mark "confirmed by both models"
|
|
172
|
-
- Claude-only → keep as-is
|
|
173
|
-
- Codex-only → prefix with `[codex]`
|
|
174
|
-
- Conflicting findings → keep both, note the disagreement
|
|
175
|
-
- Take the MORE SEVERE verdict between the two
|
|
176
|
-
|
|
177
|
-
---
|
|
178
|
-
|
|
179
|
-
## How to Spawn a Codex BUILD/FIX Agent
|
|
180
|
-
|
|
181
|
-
For BUILD and FIX LOOP phases when engine routes to Codex:
|
|
182
|
-
|
|
183
|
-
```
|
|
184
|
-
mcp__codex-cli__codex({
|
|
185
|
-
prompt: "[full build/fix prompt with task description, done criteria, and implementation instructions]",
|
|
186
|
-
model: "gpt-5.4",
|
|
187
|
-
reasoningEffort: "xhigh",
|
|
188
|
-
sandbox: "workspace-write",
|
|
189
|
-
fullAuto: true,
|
|
190
|
-
workingDirectory: "[project root]"
|
|
191
|
-
})
|
|
192
|
-
```
|
|
193
|
-
|
|
194
|
-
**After Codex completes**: verify changes were made (`git diff --stat`), then proceed to the next phase as normal. The file-based handoff (`.devlyn/done-criteria.md`, `.devlyn/EVAL-FINDINGS.md`, etc.) works identically — Codex writes the same files Claude would.
|
|
195
|
-
|
|
196
|
-
**Session management**: For FIX LOOP iterations, use a fresh call each time (no `sessionId` reuse) because sandbox/fullAuto parameters only apply on the first call of a session.
|
|
197
|
-
|
|
198
|
-
---
|
|
199
|
-
|
|
200
|
-
## Override Behavior
|
|
201
|
-
|
|
202
|
-
- `--engine claude` → all roles and phases use Claude (no Codex calls)
|
|
203
|
-
- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
|
|
204
|
-
- `--engine auto` (default) → each role and phase routes to the optimal model per this table
|
|
@@ -1,164 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: devlyn:browser-validate
|
|
3
|
-
description: Browser-based validation for web applications — verifies that implemented features actually work by testing them in a real browser. Starts the dev server, tests the feature end-to-end (click buttons, fill forms, verify results), and reports what's broken with screenshot evidence. Use this skill whenever the user says "test in browser", "check if it works", "does the feature work", "browser test", "validate the UI", or when auto-resolve needs to verify web changes actually function correctly. Also use proactively after implementing UI changes. The primary goal is feature verification, not just checking if pages render.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
Verify that implemented features actually work in the browser. The primary job is to test the feature that was just built — click the button, fill the form, check the result. Smoke tests and visual checks are supporting checks, not the main event.
|
|
7
|
-
|
|
8
|
-
The whole point of browser validation is to catch the gap between "code looks correct" and "user can actually do the thing." Static analysis and unit tests can confirm the code is well-structured. Browser validation confirms it *works*.
|
|
9
|
-
|
|
10
|
-
<config>
|
|
11
|
-
$ARGUMENTS
|
|
12
|
-
</config>
|
|
13
|
-
|
|
14
|
-
<workflow>
|
|
15
|
-
|
|
16
|
-
## PHASE 1: DETECT
|
|
17
|
-
|
|
18
|
-
1. **What was built**: This is the most important input. Read `.devlyn/done-criteria.md` if it exists — it tells you what the feature is supposed to do. If it doesn't exist, read `git diff --stat` and `git log -1` to understand what changed. You need to know what to test before anything else.
|
|
19
|
-
|
|
20
|
-
2. **Framework detection**: Read `package.json` → identify framework and start command from `scripts.dev`, `scripts.start`, or `scripts.preview`.
|
|
21
|
-
|
|
22
|
-
3. **Port inference**: Defaults — Next.js: 3000, Vite: 5173, CRA: 3000, Nuxt: 3000, Astro: 4321, Angular: 4200. Override with `--port` flag.
|
|
23
|
-
|
|
24
|
-
4. **Affected routes**: Map changed files to routes (e.g., `app/dashboard/page.tsx` → `/dashboard`).
|
|
25
|
-
|
|
26
|
-
5. **Tier selection** — pick the best available browser tool. **You must verify each tier actually works before committing to it** — tools can be registered but not connected:
|
|
27
|
-
- **Tier 1 probe** (Chrome DevTools): Check if `mcp__claude-in-chrome__*` tools exist. If they do, load `mcp__claude-in-chrome__tabs_context_mcp` via ToolSearch and call it. If the call **succeeds** (returns tab data without error), use Tier 1. Read `references/tier1-chrome.md`. If the call **fails** (timeout, connection error, extension not running), Tier 1 is unavailable — fall through to Tier 2.
|
|
28
|
-
- **Tier 2 probe** (Playwright): Check if `mcp__playwright__*` tools exist (try ToolSearch for `mcp__playwright__browser_navigate`). If they exist and respond, use Tier 2 Mode A. Else run `npx playwright --version 2>/dev/null` — if it succeeds, use Tier 2 Mode B. Read `references/tier2-playwright.md`.
|
|
29
|
-
- **Tier 3** (HTTP smoke): Fallback when no browser tool is functional. Read `references/tier3-curl.md`.
|
|
30
|
-
|
|
31
|
-
**Critical rule**: Never treat a tier as available just because its tools appear in the tool list. Deferred/registered tools may not have a running backend. Always probe before committing.
|
|
32
|
-
|
|
33
|
-
6. **Skip gate**: If no web-relevant files changed (no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.astro`, `*.css`, `*.scss`, `*.html`, `page.*`, `layout.*`, `route.*`, `+page.*`, `+layout.*`), skip. Report: "Browser validation skipped — no web changes detected."
|
|
34
|
-
|
|
35
|
-
7. **Parse flags** from `<config>`:
|
|
36
|
-
- `--skip-feature` — skip feature testing, only run smoke + visual
|
|
37
|
-
- `--port PORT` — override detected port
|
|
38
|
-
- `--tier N` — force a specific tier (1, 2, or 3)
|
|
39
|
-
- `--mobile-only` / `--desktop-only` — limit viewport testing
|
|
40
|
-
- `--topic SLUG` — override the auto-derived screenshot topic slug
|
|
41
|
-
|
|
42
|
-
8. **Derive the screenshot topic slug**. All screenshots for this run go under `.devlyn/screenshots/<topic-slug>/` so runs for different features don't pile up together. Resolution order:
|
|
43
|
-
1. `--topic` flag value, kebab-cased
|
|
44
|
-
2. First non-blank heading/line of `.devlyn/done-criteria.md` (strip `#`, kebab-case, max 40 chars)
|
|
45
|
-
3. Current git branch name, if not `main`/`master`/`HEAD`
|
|
46
|
-
4. Fallback: `run-<YYYYMMDD-HHMM>`
|
|
47
|
-
|
|
48
|
-
Then wipe and recreate the topic dir (fresh evidence per run; don't touch other topics' dirs):
|
|
49
|
-
```bash
|
|
50
|
-
SCREENSHOT_DIR=".devlyn/screenshots/<topic-slug>"
|
|
51
|
-
rm -rf "$SCREENSHOT_DIR"
|
|
52
|
-
mkdir -p "$SCREENSHOT_DIR"/{smoke,feature,visual}
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
Record `$SCREENSHOT_DIR` and reuse it through the run. All screenshot paths below are **relative to `$SCREENSHOT_DIR`**:
|
|
56
|
-
- Smoke: `smoke/<route-slug>.png` (root → `smoke/root.png`)
|
|
57
|
-
- Feature: `feature/<criterion-slug>-step<N>.png`
|
|
58
|
-
- Visual: `visual/<viewport>-<route-slug>.png` (e.g., `visual/mobile-dashboard.png`)
|
|
59
|
-
|
|
60
|
-
Announce:
|
|
61
|
-
```
|
|
62
|
-
Browser validation starting
|
|
63
|
-
Feature: [what was built, from done-criteria or git diff]
|
|
64
|
-
Framework: [detected] | Port: [PORT] | Tier: [N — name]
|
|
65
|
-
Topic: [topic-slug] → .devlyn/screenshots/<topic-slug>/
|
|
66
|
-
Phases: Server → Smoke → Feature Test → Visual → Report
|
|
67
|
-
```
|
|
68
|
-
|
|
69
|
-
## PHASE 2: SERVER
|
|
70
|
-
|
|
71
|
-
Get the dev server running. If it doesn't start, diagnose and fix — don't just report failure.
|
|
72
|
-
|
|
73
|
-
1. Start the dev server in background via Bash with `run_in_background: true`.
|
|
74
|
-
2. Health-check: poll `http://localhost:PORT` every 2s, timeout 30s. Ready when you get an HTTP response.
|
|
75
|
-
3. **If it doesn't come up — troubleshoot** (up to 2 attempts): read stderr for the error, fix it (npm install, port conflict, build error, etc.), restart, re-check.
|
|
76
|
-
4. If still down after 2 attempts: write BLOCKED verdict and stop.
|
|
77
|
-
|
|
78
|
-
## PHASE 3: SMOKE (quick prerequisite)
|
|
79
|
-
|
|
80
|
-
Quick check that the app is alive. This is not the main test — it's a gate to make sure feature testing is even possible.
|
|
81
|
-
|
|
82
|
-
Navigate to `/` and each affected route. For each page, judge: is this the actual application, or an error page? A connection error, framework error overlay, or blank shell is not the app. If broken, try to fix (read console errors, fix source, let hot-reload pick it up). Up to 2 fix attempts per route.
|
|
83
|
-
|
|
84
|
-
**Tier downgrade on failure**: If you're on Tier 1 or Tier 2 Mode A and the browser tool consistently fails during smoke (connection errors, timeouts, extension disconnected), **do not skip browser testing**. Instead, downgrade to the next tier (Tier 1 → Tier 2 → Tier 3), re-read the corresponding reference file, and retry the smoke phase with the new tier. Announce: `"Tier [N] browser tools not responding — downgrading to Tier [N+1]."` The goal is to always run the best available browser test, not to give up.
|
|
85
|
-
|
|
86
|
-
If the app isn't rendering, the verdict is BLOCKED — feature testing can't happen.
|
|
87
|
-
|
|
88
|
-
## PHASE 4: FEATURE TEST (the main event)
|
|
89
|
-
|
|
90
|
-
This is the primary purpose of browser validation. Everything else is in service of getting here.
|
|
91
|
-
|
|
92
|
-
Read `.devlyn/done-criteria.md` (or infer from git diff what was built). For each criterion that describes something a user can do or see in the UI, test it end-to-end in the browser:
|
|
93
|
-
|
|
94
|
-
1. **Plan the test**: What would a user do to verify this feature works? Navigate where, click what, type what, expect what result?
|
|
95
|
-
2. **Execute it**: Navigate to the page, find the interactive elements, perform the actions, verify the outcome. Read `references/flow-testing.md` for patterns on converting criteria to browser steps.
|
|
96
|
-
3. **Capture evidence**: Screenshot at each key step. Record console errors and network failures that happen during the interaction.
|
|
97
|
-
4. **If it fails — try to fix**: Read the error (console, network, or the UI state) to understand why the feature broke. Fix the source code, let hot-reload update, and re-test. Up to 2 fix attempts per criterion.
|
|
98
|
-
5. **Record the result**: For each criterion — PASS (feature works as specified), FAIL (feature doesn't work, include what went wrong), SKIPPED (criterion isn't browser-testable, e.g., "API returns 401"), or UNVERIFIABLE (feature depends on external services not available in the test environment — e.g., real API keys, third-party auth, paid services).
|
|
99
|
-
|
|
100
|
-
**Don't churn on external dependencies.** If a feature test is blocked because an API times out, a third-party service isn't configured, or auth credentials aren't available — that's not a bug to fix, it's a test environment limitation. Note it as UNVERIFIABLE, move on to the next criterion. Don't spend more than 30 seconds waiting for a response that's never coming. The goal is to verify what *can* be verified in the current environment, and be honest about what can't.
|
|
101
|
-
|
|
102
|
-
The verdict depends primarily on this phase. If the implemented features don't work in the browser, the validation fails — even if every page renders perfectly and the layout looks great. And if most features couldn't be verified due to environment limitations, be honest about that — don't call it PASS.
|
|
103
|
-
|
|
104
|
-
## PHASE 5: VISUAL (supporting check)
|
|
105
|
-
|
|
106
|
-
Quick layout check at two viewports (skip if `--mobile-only` or `--desktop-only`):
|
|
107
|
-
|
|
108
|
-
1. **Mobile** (375x812): screenshot each affected route, check for overflow/overlap/unreadable text
|
|
109
|
-
2. **Desktop** (1280x800): screenshot each affected route, check for broken layouts
|
|
110
|
-
|
|
111
|
-
Judgment-based — look at the screenshots and report visible issues.
|
|
112
|
-
|
|
113
|
-
## PHASE 6: REPORT
|
|
114
|
-
|
|
115
|
-
Write `.devlyn/BROWSER-RESULTS.md`:
|
|
116
|
-
|
|
117
|
-
```markdown
|
|
118
|
-
# Browser Validation Results
|
|
119
|
-
|
|
120
|
-
## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / PARTIALLY VERIFIED / BLOCKED]
|
|
121
|
-
Verdict rules:
|
|
122
|
-
- BLOCKED = server won't start or app doesn't render
|
|
123
|
-
- NEEDS WORK = implemented features don't work in the browser
|
|
124
|
-
- PARTIALLY VERIFIED = some features verified working, but others couldn't be tested due to environment limitations (missing API keys, external service dependencies). Be explicit about what was and wasn't verified.
|
|
125
|
-
- PASS WITH ISSUES = all testable features work but visual issues or minor warnings exist
|
|
126
|
-
- PASS = all testable features verified working, pages render, layout clean
|
|
127
|
-
|
|
128
|
-
## What Was Tested
|
|
129
|
-
[Brief description of the feature/task from done-criteria or git diff]
|
|
130
|
-
|
|
131
|
-
## Feature Verification (primary)
|
|
132
|
-
| Criterion | Test Steps | Result | Evidence |
|
|
133
|
-
|-----------|-----------|--------|----------|
|
|
134
|
-
| [what should work] | [what you did] | PASS/FAIL/SKIPPED/UNVERIFIABLE | [screenshot, errors, what went wrong] |
|
|
135
|
-
|
|
136
|
-
## Unverifiable Features (if any)
|
|
137
|
-
[List features that couldn't be tested and why — e.g., "Badge rendering requires /api/backends/status which needs real API keys not present in test env. Verified via source code and unit tests instead."]
|
|
138
|
-
|
|
139
|
-
## Smoke Test (prerequisite)
|
|
140
|
-
| Route | Renders | Console Errors | Network Failures |
|
|
141
|
-
|-------|---------|---------------|-----------------|
|
|
142
|
-
| / | YES/NO | [count] | [count] |
|
|
143
|
-
|
|
144
|
-
## Visual Check
|
|
145
|
-
| Viewport | Route | Issues |
|
|
146
|
-
|----------|-------|--------|
|
|
147
|
-
| Mobile (375px) | / | [issues or "Clean"] |
|
|
148
|
-
| Desktop (1280px) | / | [issues or "Clean"] |
|
|
149
|
-
|
|
150
|
-
## Fixes Applied During Validation
|
|
151
|
-
[List any bugs found and fixed during testing — server startup issues, broken routes, feature bugs]
|
|
152
|
-
|
|
153
|
-
## Runtime Errors
|
|
154
|
-
[Console errors captured during testing]
|
|
155
|
-
|
|
156
|
-
## Failed Network Requests
|
|
157
|
-
[Failed API calls captured during testing]
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
## PHASE 7: CLEANUP
|
|
161
|
-
|
|
162
|
-
Kill the dev server PID. If `--keep-server` was passed (auto-resolve pipeline), skip — the pipeline handles cleanup.
|
|
163
|
-
|
|
164
|
-
</workflow>
|
|
@@ -1,118 +0,0 @@
|
|
|
1
|
-
# Flow Testing: Done-Criteria to Browser Steps
|
|
2
|
-
|
|
3
|
-
How to read `.devlyn/done-criteria.md` and convert testable criteria into browser action sequences. This is the bridge between "what should work" and "prove it works in the browser."
|
|
4
|
-
|
|
5
|
-
Read this file only during PHASE 4 (FLOW) when done-criteria exists.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Step 1: Classify Each Criterion
|
|
10
|
-
|
|
11
|
-
Read `.devlyn/done-criteria.md` and classify each criterion:
|
|
12
|
-
|
|
13
|
-
**Browser-testable** — the criterion describes something a user can see or do in the UI:
|
|
14
|
-
- "User can create a new project from the dashboard"
|
|
15
|
-
- "Error message appears when form is submitted with empty fields"
|
|
16
|
-
- "Navigation shows active state on current page"
|
|
17
|
-
- "Data table loads and displays 10 rows"
|
|
18
|
-
|
|
19
|
-
**Not browser-testable** — the criterion is about backend logic, data integrity, or code quality:
|
|
20
|
-
- "API returns 401 for unauthenticated requests"
|
|
21
|
-
- "Database migration runs without errors"
|
|
22
|
-
- "Test coverage exceeds 80%"
|
|
23
|
-
- "No TypeScript errors"
|
|
24
|
-
|
|
25
|
-
Skip non-browser-testable criteria. Note them as "Skipped — not browser-testable" in the report.
|
|
26
|
-
|
|
27
|
-
## Step 2: Convert to Action Sequences
|
|
28
|
-
|
|
29
|
-
For each browser-testable criterion, generate a sequence of steps:
|
|
30
|
-
|
|
31
|
-
### Pattern: Navigation + Verification
|
|
32
|
-
```
|
|
33
|
-
Criterion: "Dashboard shows project count"
|
|
34
|
-
Steps:
|
|
35
|
-
1. Navigate to /dashboard
|
|
36
|
-
2. Find element containing project count (look for text matching a number pattern)
|
|
37
|
-
3. Verify: element exists and contains a numeric value
|
|
38
|
-
4. Screenshot
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
### Pattern: Form Interaction
|
|
42
|
-
```
|
|
43
|
-
Criterion: "User can create a new project"
|
|
44
|
-
Steps:
|
|
45
|
-
1. Navigate to /dashboard (or wherever the create action lives)
|
|
46
|
-
2. Find "Create" or "New Project" button
|
|
47
|
-
3. Click it
|
|
48
|
-
4. Find form fields (name, description, etc.)
|
|
49
|
-
5. Fill with test data: name="Test Project", description="Browser validation test"
|
|
50
|
-
6. Find and click submit button
|
|
51
|
-
7. Verify: success indicator appears (toast, redirect, new item in list)
|
|
52
|
-
8. Screenshot at steps 3, 6, and 7
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
### Pattern: Error State
|
|
56
|
-
```
|
|
57
|
-
Criterion: "Error message shows when form submitted empty"
|
|
58
|
-
Steps:
|
|
59
|
-
1. Navigate to the form page
|
|
60
|
-
2. Find submit button
|
|
61
|
-
3. Click submit without filling any fields
|
|
62
|
-
4. Verify: error message(s) visible
|
|
63
|
-
5. Screenshot showing error state
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
### Pattern: Conditional UI
|
|
67
|
-
```
|
|
68
|
-
Criterion: "Empty state shows when no data exists"
|
|
69
|
-
Steps:
|
|
70
|
-
1. Navigate to the list/table page
|
|
71
|
-
2. Check if data exists — if so, this test needs a clean state
|
|
72
|
-
3. If clean state achievable: verify empty state message/illustration
|
|
73
|
-
4. If not: skip with note "Cannot verify empty state — data already exists"
|
|
74
|
-
5. Screenshot
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
## Step 3: Handle Data Dependencies
|
|
78
|
-
|
|
79
|
-
Some flow tests need specific data to exist (or not exist). Approach:
|
|
80
|
-
|
|
81
|
-
1. **Read-only tests preferred** — test flows that verify existing state rather than create/modify
|
|
82
|
-
2. **Create test data if safe** — if the flow creates something (like a project), use obvious test names ("Browser Validation Test — safe to delete")
|
|
83
|
-
3. **Skip if destructive** — don't test delete flows, don't modify existing data, don't test flows that send emails or notifications
|
|
84
|
-
4. **Note dependencies** — if a test can't run because of missing data, note it as "Skipped — requires [specific data state]"
|
|
85
|
-
|
|
86
|
-
## Step 4: Handle Auth-Protected Pages
|
|
87
|
-
|
|
88
|
-
If a route requires authentication:
|
|
89
|
-
1. Check if the app redirects to a login page
|
|
90
|
-
2. If login is a simple form (email + password): note "Auth required — skipping unless test credentials available"
|
|
91
|
-
3. If login uses OAuth/SSO: skip entirely, note "Skipped — requires OAuth flow"
|
|
92
|
-
4. Do not attempt to log in with guessed credentials
|
|
93
|
-
|
|
94
|
-
## Test Data Guidelines
|
|
95
|
-
|
|
96
|
-
When filling forms during flow tests, use obviously fake but valid data:
|
|
97
|
-
- Name: "Test User" or "Browser Validate Test"
|
|
98
|
-
- Email: "test@browser-validate.local"
|
|
99
|
-
- Description: "Created by browser-validate skill — safe to delete"
|
|
100
|
-
- Numbers: use small, obvious values (1, 10, 100)
|
|
101
|
-
|
|
102
|
-
This makes test data easy to identify and clean up later.
|
|
103
|
-
|
|
104
|
-
## Output Format
|
|
105
|
-
|
|
106
|
-
For each flow test, report:
|
|
107
|
-
|
|
108
|
-
```
|
|
109
|
-
Criterion: [original text from done-criteria]
|
|
110
|
-
Classification: browser-testable | skipped
|
|
111
|
-
Steps executed: [N of total]
|
|
112
|
-
Result: PASS | FAIL | SKIPPED
|
|
113
|
-
Evidence:
|
|
114
|
-
- Screenshot: [path]
|
|
115
|
-
- Console errors during flow: [count] — [details]
|
|
116
|
-
- Network failures during flow: [count] — [details]
|
|
117
|
-
- Failure point: [which step failed and why]
|
|
118
|
-
```
|