codebyplan 1.13.55 → 1.13.57
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli.js +992 -586
- package/package.json +1 -1
- package/templates/agents/cbp-e2e-maestro.md +97 -8
- package/templates/agents/cbp-e2e-playwright.md +118 -15
- package/templates/agents/cbp-verify-reviewer.md +8 -0
- package/templates/context/testing/e2e.md +43 -2
- package/templates/github-workflows/ci.yml +19 -14
- package/templates/hooks/cbp-statusline.mjs +25 -18
- package/templates/hooks/cbp-statusline.py +13 -0
- package/templates/hooks/cbp-statusline.sh +12 -0
- package/templates/rules/e2e-mandatory.md +21 -0
- package/templates/rules/two-tier-ci.md +14 -6
package/package.json
CHANGED
|
@@ -132,7 +132,8 @@ One subdirectory per app module. Shared flows under `_shared/`. Probe under `_pr
|
|
|
132
132
|
|
|
133
133
|
## Spec-Writing Patterns
|
|
134
134
|
|
|
135
|
-
**One flow per screen/feature.**
|
|
135
|
+
**One flow per screen/feature.** A flow that only taps and asserts visibility is NOT done —
|
|
136
|
+
prove behavior:
|
|
136
137
|
|
|
137
138
|
```yaml
|
|
138
139
|
appId: ${APP_ID}
|
|
@@ -141,15 +142,97 @@ tags:
|
|
|
141
142
|
---
|
|
142
143
|
- runFlow: _shared/login.yaml
|
|
143
144
|
- assertVisible: "Dashboard"
|
|
144
|
-
-
|
|
145
|
-
-
|
|
145
|
+
- waitForAnimationToEnd
|
|
146
|
+
- assertNoDefectsWithAI: # AI visual-defect check — see AI Assertions below
|
|
147
|
+
optional: false
|
|
148
|
+
- takeScreenshot: "dashboard-loaded" # NEW states only — see Visual Baselines
|
|
149
|
+
- tapOn:
|
|
150
|
+
text: "Create"
|
|
151
|
+
enabled: true # state selector — waits for interactivity, catches broken gating
|
|
146
152
|
- assertVisible: "New item"
|
|
147
|
-
- takeScreenshot: "create-modal-open"
|
|
148
153
|
```
|
|
149
154
|
|
|
150
155
|
Use text-based targeting first (`tapOn: "Button"`); use testID when ambiguous
|
|
151
|
-
(`tapOn: { id: "btn" }`).
|
|
152
|
-
|
|
156
|
+
(`tapOn: { id: "btn" }`). `text`/`id` are REGEX by default — escape `$` and `[`; quote
|
|
157
|
+
`YES`/`NO`/`ON`/`OFF` (unquoted they parse as YAML booleans).
|
|
158
|
+
|
|
159
|
+
**Assertion depth requirements**:
|
|
160
|
+
|
|
161
|
+
- **State selectors prove logic**: `enabled`, `checked`, `focused`, `selected` — e.g. assert
|
|
162
|
+
Submit is `enabled: false` before required fields are filled, `enabled: true` after.
|
|
163
|
+
- **Data round-trips** via `copyTextFrom` + `assertTrue`: copy the value on screen A
|
|
164
|
+
(snapshot into `output.*` via `evalScript` before the next copy overwrites
|
|
165
|
+
`maestro.copiedText`), navigate, assert screen B shows the same value.
|
|
166
|
+
- **Persistence proof** for create/edit flows — after the UI reports success, verify via
|
|
167
|
+
`runScript` `http.get` against the backend API (`json()` parse + `assertTrue` on the
|
|
168
|
+
field), or at minimum kill + relaunch and re-assert:
|
|
169
|
+
|
|
170
|
+
```yaml
|
|
171
|
+
- killApp
|
|
172
|
+
- launchApp: { stopApp: false }
|
|
173
|
+
- assertVisible: ${output.createdTitle}
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
- For CRUD: create + verify (round-trip); edit + verify updated; delete + confirm + verify
|
|
177
|
+
removed.
|
|
178
|
+
|
|
179
|
+
## Visual Baselines (assertScreenshot)
|
|
180
|
+
|
|
181
|
+
Committed PNGs under `e2e/screenshots/maestro/` are BASELINES, not run artifacts.
|
|
182
|
+
|
|
183
|
+
- **New state** (`git ls-files --error-unmatch <path>` exits non-zero): `waitForAnimationToEnd`,
|
|
184
|
+
then `takeScreenshot: "{flow}-{state}"` and `git add` the PNG (auto-new model).
|
|
185
|
+
- **Existing baseline**: do NOT retake/overwrite. Assert against it:
|
|
186
|
+
|
|
187
|
+
```yaml
|
|
188
|
+
- waitForAnimationToEnd
|
|
189
|
+
- assertScreenshot:
|
|
190
|
+
path: e2e/screenshots/maestro/{flow}-{state}.png
|
|
191
|
+
thresholdPercentage: 95
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
On failure classify `visual_regression`: capture the live screen under a transient
|
|
195
|
+
diagnostic name (`{flow}-{state}-actual`, written to `--test-output-dir`), report it in
|
|
196
|
+
`screenshots[]`, and NEVER overwrite the committed baseline. The user accepts the change at
|
|
197
|
+
`/cbp-verify`; only then is the baseline re-captured and re-added.
|
|
198
|
+
- `baseline_diff_pct` stays `null` (Maestro reports threshold pass/fail, not a percentage);
|
|
199
|
+
set `is_new` per git tracking as before.
|
|
200
|
+
|
|
201
|
+
## AI Assertions (assertNoDefectsWithAI / assertWithAI)
|
|
202
|
+
|
|
203
|
+
Maestro's AI commands screenshot the current screen and detect rendering defects (cut-off
|
|
204
|
+
text, overlapping elements, mis-centered content). Run `assertNoDefectsWithAI` at every
|
|
205
|
+
primary screen state; use `assertWithAI` for states selectors can't express:
|
|
206
|
+
|
|
207
|
+
```yaml
|
|
208
|
+
- assertNoDefectsWithAI:
|
|
209
|
+
optional: false
|
|
210
|
+
- assertWithAI:
|
|
211
|
+
assertion: The 6-digit verification input is visible with all six boxes empty.
|
|
212
|
+
optional: false
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Critical**: AI commands default to `optional: true` (warn-only — a detected defect does
|
|
216
|
+
NOT fail the flow). ALWAYS set `optional: false`.
|
|
217
|
+
|
|
218
|
+
**Auth preflight (Step 6.5.1 addition)**: AI commands require Maestro auth — a `maestro login`
|
|
219
|
+
session or `MAESTRO_CLOUD_API_KEY` (a free account suffices; the legacy `MAESTRO_CLI_AI_KEY`
|
|
220
|
+
BYO-key path is retired). Probe before authoring AI steps. When unavailable, ask the user once
|
|
221
|
+
(provide key / skip AI), record `ai_checks: 'unavailable'` in the output, omit AI commands,
|
|
222
|
+
and rely on `assertScreenshot` baselines — never let an AI step fail a run on a missing key.
|
|
223
|
+
AI artifacts (`ai-report-*.html`, `ai-*.json`) land under `--test-output-dir`; reference them
|
|
224
|
+
in `critical_issues[].reason` when a defect is found.
|
|
225
|
+
|
|
226
|
+
## Anti-Patterns
|
|
227
|
+
|
|
228
|
+
- `waitForAnimationToEnd` is NOT an assertion — it succeeds even on timeout; always pair it
|
|
229
|
+
with a real assert or screenshot.
|
|
230
|
+
- Don't wrap whole flows in `retry` (hides product flakiness); bound `repeat` loops with
|
|
231
|
+
`times` + `while` together.
|
|
232
|
+
- No `point:` coordinate taps — device-dependent; combine attribute + relational selectors instead.
|
|
233
|
+
- Don't max out timeouts ("60s everywhere") — defaults catch performance regressions.
|
|
234
|
+
- Platform limits: `back` is Android/Web only; airplane-mode commands are Android-only;
|
|
235
|
+
Android `inputText` is ASCII-only; system biometric/HealthKit dialogs need XCUITest.
|
|
153
236
|
|
|
154
237
|
## Screenshot Capture
|
|
155
238
|
|
|
@@ -161,7 +244,9 @@ Screenshots written to `e2e/screenshots/maestro/` (via `screenshotsDir` in `conf
|
|
|
161
244
|
Committed path convention: `e2e/screenshots/maestro/{flow}-{state}.png` (repo root).
|
|
162
245
|
This path is intentionally outside `apps/web/e2e/screenshots/` (which is gitignored).
|
|
163
246
|
|
|
164
|
-
After the flow completes, `git add
|
|
247
|
+
After the flow completes, `git add` each NEW PNG individually — never `git add` the whole
|
|
248
|
+
directory (that silently stages drifted baselines; existing states are gated by
|
|
249
|
+
`assertScreenshot`, see Visual Baselines).
|
|
165
250
|
|
|
166
251
|
**`is_new` detection**: `git ls-files --error-unmatch <path>` exits non-zero → `is_new: true`.
|
|
167
252
|
|
|
@@ -186,9 +271,13 @@ Include this in the specialist output alongside `screenshots[]`.
|
|
|
186
271
|
## Run Command
|
|
187
272
|
|
|
188
273
|
```bash
|
|
189
|
-
maestro test maestro/flows/{module}/{flow}.yaml --format
|
|
274
|
+
maestro test maestro/flows/{module}/{flow}.yaml --format junit --output maestro/results.xml \
|
|
275
|
+
--test-output-dir maestro/output
|
|
190
276
|
```
|
|
191
277
|
|
|
278
|
+
`maestro/output/` holds transient diagnostics (AI reports, `-actual` regression captures) —
|
|
279
|
+
gitignore it; committed baselines live only under `e2e/screenshots/maestro/`.
|
|
280
|
+
|
|
192
281
|
## pnpm Scripts
|
|
193
282
|
|
|
194
283
|
```json
|
|
@@ -21,7 +21,7 @@ accordingly.
|
|
|
21
21
|
## Install
|
|
22
22
|
|
|
23
23
|
```bash
|
|
24
|
-
pnpm add -D @playwright/test
|
|
24
|
+
pnpm add -D @playwright/test @axe-core/playwright
|
|
25
25
|
pnpm exec playwright install chromium
|
|
26
26
|
# CI with system deps:
|
|
27
27
|
pnpm exec playwright install --with-deps chromium
|
|
@@ -265,32 +265,123 @@ port from `.codebyplan/server.local.json` (worktree overlay, checked first) then
|
|
|
265
265
|
`.codebyplan/server.json` (committed base). On mismatch ask which is correct, then propose
|
|
266
266
|
an Edit to align them.
|
|
267
267
|
|
|
268
|
+
## Quality Fixture (MANDATORY)
|
|
269
|
+
|
|
270
|
+
`apps/{app}/e2e/fixtures.ts` — the single `test` source for ALL specs. It auto-enforces the
|
|
271
|
+
console-clean mandate (an `{ auto: true }` fixture runs in every test with zero per-spec
|
|
272
|
+
opt-in) and provides the axe builder. Create it if absent; when touching an existing spec
|
|
273
|
+
that still imports from `@playwright/test`, migrate its import.
|
|
274
|
+
|
|
275
|
+
```ts
|
|
276
|
+
import { test as base, expect } from "@playwright/test";
|
|
277
|
+
import AxeBuilder from "@axe-core/playwright";
|
|
278
|
+
|
|
279
|
+
// Known, triaged errors only — every entry needs a comment linking its fix task.
|
|
280
|
+
const ALLOWED_CONSOLE: RegExp[] = [];
|
|
281
|
+
|
|
282
|
+
type QualityFixtures = {
|
|
283
|
+
consoleGuard: void;
|
|
284
|
+
makeAxeBuilder: () => AxeBuilder;
|
|
285
|
+
};
|
|
286
|
+
|
|
287
|
+
export const test = base.extend<QualityFixtures>({
|
|
288
|
+
consoleGuard: [
|
|
289
|
+
async ({ page, baseURL }, use) => {
|
|
290
|
+
const errors: string[] = [];
|
|
291
|
+
page.on("console", (msg) => {
|
|
292
|
+
if (msg.type() === "error" && !ALLOWED_CONSOLE.some((re) => re.test(msg.text())))
|
|
293
|
+
errors.push(`console.error: ${msg.text()}`);
|
|
294
|
+
});
|
|
295
|
+
page.on("pageerror", (err) => errors.push(`pageerror: ${err.message}`));
|
|
296
|
+
page.on("requestfailed", (req) => {
|
|
297
|
+
// Own-origin, non-aborted failures only (cancelled prefetches are noise)
|
|
298
|
+
if (baseURL && req.url().startsWith(baseURL) && req.failure()?.errorText !== "net::ERR_ABORTED")
|
|
299
|
+
errors.push(`requestfailed: ${req.method()} ${req.url()} — ${req.failure()?.errorText}`);
|
|
300
|
+
});
|
|
301
|
+
await use();
|
|
302
|
+
expect(errors, "console/page errors captured during test").toEqual([]);
|
|
303
|
+
},
|
|
304
|
+
{ auto: true },
|
|
305
|
+
],
|
|
306
|
+
makeAxeBuilder: async ({ page }, use) => {
|
|
307
|
+
await use(() => new AxeBuilder({ page }).withTags(["wcag2a", "wcag2aa", "wcag21a", "wcag21aa"]));
|
|
308
|
+
},
|
|
309
|
+
});
|
|
310
|
+
export { expect };
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
Collected errors from failing tests feed the `console_errors[]` output field (see Output
|
|
314
|
+
Additions below).
|
|
315
|
+
|
|
268
316
|
## Spec-Writing Patterns
|
|
269
317
|
|
|
270
|
-
**One spec file per page/flow.**
|
|
318
|
+
**One spec file per page/flow.** Specs import `{ test, expect }` from the quality fixture
|
|
319
|
+
(`./fixtures` or relative path) — NEVER directly from `@playwright/test`.
|
|
271
320
|
|
|
272
|
-
|
|
273
|
-
|
|
321
|
+
Mandatory per spec — a spec that only proves elements are visible is NOT done:
|
|
322
|
+
|
|
323
|
+
- Smoke test: loads, title correct (the console guard fails it on any console/page error).
|
|
324
|
+
- Primary user flow: main interaction **with a behavior proof** (below).
|
|
274
325
|
- Visual regression: `toHaveScreenshot` at every primary state.
|
|
326
|
+
- Structure: `toMatchAriaSnapshot` on the primary state — catches hierarchy/label/role
|
|
327
|
+
breakage without pixel fragility.
|
|
328
|
+
- Accessibility: one axe scan per page state, zero violations.
|
|
329
|
+
|
|
330
|
+
### Functional Proof (mutations)
|
|
275
331
|
|
|
276
|
-
|
|
277
|
-
|
|
332
|
+
Every flow that mutates state MUST prove the mutation happened — asserting the optimistic UI
|
|
333
|
+
is not proof:
|
|
278
334
|
|
|
279
335
|
```ts
|
|
280
|
-
|
|
336
|
+
// 1. Prove the API call succeeded
|
|
337
|
+
const resp = page.waitForResponse((r) => r.url().includes("/api/items") && r.request().method() === "POST");
|
|
338
|
+
await page.getByRole("button", { name: "Create" }).click();
|
|
339
|
+
expect((await resp).status()).toBeLessThan(400);
|
|
340
|
+
|
|
341
|
+
// 2. Prove persistence — reload and re-assert (or poll the API for eventual consistency)
|
|
342
|
+
await page.reload();
|
|
343
|
+
await expect(page.getByRole("listitem").filter({ hasText: itemName })).toBeVisible();
|
|
344
|
+
// await expect.poll(async () => (await page.request.get(`/api/items/${id}`)).status()).toBe(200);
|
|
345
|
+
```
|
|
281
346
|
|
|
282
|
-
|
|
283
|
-
test.beforeEach(async ({ page }) => {
|
|
284
|
-
await page.goto("/");
|
|
285
|
-
});
|
|
347
|
+
### Error-State Proof (forms / CRUD)
|
|
286
348
|
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
|
|
349
|
+
At least one test per form/CRUD spec injects a failure and asserts the rendered error UI —
|
|
350
|
+
error paths are where untested UIs break in production:
|
|
351
|
+
|
|
352
|
+
```ts
|
|
353
|
+
await page.route("**/api/items", (r) => r.fulfill({ status: 500 }));
|
|
354
|
+
await page.getByRole("button", { name: "Create" }).click();
|
|
355
|
+
await expect(page.getByRole("alert")).toContainText(/failed|went wrong/i);
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
### Permission / RLS Proof
|
|
359
|
+
|
|
360
|
+
When the route is role-gated, include one denial test (lower-privilege storage state or
|
|
361
|
+
seeded non-member): assert the explicit denial UI or redirect — a blank render is a bug,
|
|
362
|
+
not a pass.
|
|
363
|
+
|
|
364
|
+
### Accessibility Scan
|
|
365
|
+
|
|
366
|
+
```ts
|
|
367
|
+
test("a11y: dashboard has no WCAG A/AA violations", async ({ page, makeAxeBuilder }) => {
|
|
368
|
+
await page.goto("/dashboard");
|
|
369
|
+
const results = await makeAxeBuilder().analyze();
|
|
370
|
+
expect(results.violations).toEqual([]);
|
|
291
371
|
});
|
|
292
372
|
```
|
|
293
373
|
|
|
374
|
+
Known issues are excluded via `.disableRules([...])` with a comment linking the fix task —
|
|
375
|
+
never by deleting the scan.
|
|
376
|
+
|
|
377
|
+
### Anti-Patterns (reject in review)
|
|
378
|
+
|
|
379
|
+
- `page.waitForTimeout(...)` — web-first assertions auto-retry; hard sleeps mask races.
|
|
380
|
+
- `expect(await locator.isVisible()).toBe(true)` — one-shot, no retry; use `await expect(locator).toBeVisible()`.
|
|
381
|
+
- `.nth(n)` / `.first()` positional selection — except the documented SCSS-module fallback.
|
|
382
|
+
- In-spec env skips (`test.skip(!process.env.X, ...)`) — forbidden per `rules/e2e-mandatory.md`.
|
|
383
|
+
- Visibility-only assertions after a mutation — see Functional Proof.
|
|
384
|
+
|
|
294
385
|
## Screenshot Capture
|
|
295
386
|
|
|
296
387
|
**Baseline regression** (preferred):
|
|
@@ -332,6 +423,18 @@ when the playwright.config project/device emulation indicates a mobile viewport
|
|
|
332
423
|
|
|
333
424
|
Include this in the specialist output alongside `screenshots[]`.
|
|
334
425
|
|
|
426
|
+
## Output Additions (Playwright)
|
|
427
|
+
|
|
428
|
+
Beyond the shared contract, ALWAYS report:
|
|
429
|
+
|
|
430
|
+
- `console_errors[]` — every entry the console guard collected on failed tests
|
|
431
|
+
(`{test_name, type: 'console' | 'pageerror' | 'requestfailed', text}`). Empty array on a
|
|
432
|
+
clean run — never omit the field.
|
|
433
|
+
- `a11y` — `{scanned_pages: string[], violations: [{rule, impact, page}]}` aggregated from
|
|
434
|
+
the axe scans. A `status: 'completed'` output with non-empty `violations` is inconsistent —
|
|
435
|
+
fix in-scope or classify the failures as category `a11y`; `codebyplan e2e verify-round`
|
|
436
|
+
hard-fails the inconsistency.
|
|
437
|
+
|
|
335
438
|
## Run Command
|
|
336
439
|
|
|
337
440
|
```bash
|
|
@@ -202,6 +202,14 @@ The deterministic e2e gate (`codebyplan e2e verify-round`) and the unit/lint/typ
|
|
|
202
202
|
here). If the diff touches an e2e-eligible UI surface, note it in `summary` so the orchestrator
|
|
203
203
|
confirms its gate ran — but do not assert a build/test result this agent did not run.
|
|
204
204
|
|
|
205
|
+
E2E verdict gates (refuse `READY` per `rules/e2e-mandatory.md`): a zero-assertion run
|
|
206
|
+
(`passed === 0 && skipped > 0` on a touched path); an empty `e2e_gallery[]` when the round
|
|
207
|
+
touched UI for an eligible framework (sole exception: `vscode-test`-only rounds with explicit
|
|
208
|
+
`e2e_gallery: []`); a `status: 'completed'` e2e output carrying non-empty `console_errors[]`
|
|
209
|
+
or `a11y.violations[]`. Treat a `{type: 'shallow_coverage'}` critical issue on a mutation
|
|
210
|
+
surface as a real finding (visibility-only specs prove rendering, not behavior) — severity
|
|
211
|
+
`medium` minimum, routed to a follow-up round.
|
|
212
|
+
|
|
205
213
|
### Phase 6: Build Findings, Verdict & Routing
|
|
206
214
|
|
|
207
215
|
Assign severity by impact: `critical` (runtime error / data corruption / security), `high`
|
|
@@ -54,7 +54,7 @@ output:
|
|
|
54
54
|
- test_name: string
|
|
55
55
|
error: string
|
|
56
56
|
file: string
|
|
57
|
-
category: 'env' | 'auth' | 'access' | 'flake' | 'real' | 'visual_regression'
|
|
57
|
+
category: 'env' | 'auth' | 'access' | 'flake' | 'real' | 'visual_regression' | 'console_error' | 'a11y'
|
|
58
58
|
classification_reason: string
|
|
59
59
|
framework_configured: boolean
|
|
60
60
|
preflight:
|
|
@@ -77,6 +77,17 @@ output:
|
|
|
77
77
|
committed_path: string # repo-relative; MUST be git-tracked after the run
|
|
78
78
|
is_new: boolean # true => no prior baseline; auto-captured+committed this run
|
|
79
79
|
baseline_diff_pct: number | null # null for non-playwright frameworks
|
|
80
|
+
console_errors: # REQUIRED for playwright (empty array on a clean run);
|
|
81
|
+
- test_name: string # null/omitted for frameworks without console capture
|
|
82
|
+
type: 'console' | 'pageerror' | 'requestfailed'
|
|
83
|
+
text: string
|
|
84
|
+
a11y: # REQUIRED for playwright; null/omitted otherwise
|
|
85
|
+
scanned_pages: string[]
|
|
86
|
+
violations:
|
|
87
|
+
- rule: string # axe rule id (e.g. color-contrast)
|
|
88
|
+
impact: string # critical | serious | moderate | minor
|
|
89
|
+
page: string
|
|
90
|
+
ai_checks: 'ran' | 'unavailable' | null # maestro only — AI assertion availability (see agent body)
|
|
80
91
|
user_interactions: [{question, answer}]
|
|
81
92
|
tech_stack_reconciliation:
|
|
82
93
|
db_framework: string | null
|
|
@@ -177,12 +188,32 @@ For each failed test, assign exactly one category:
|
|
|
177
188
|
| `auth` | Login-page redirect, 401 after credential submit, `invalid_grant`, `email_not_confirmed` | AskUserQuestion per Step 6.5.3 |
|
|
178
189
|
| `access` | 403/404 on an accessible route, RLS denial text, missing seed data | AskUserQuestion: "Test failed with access error: `{error}`. Options: (1) fix + reply steps, (2) abort." |
|
|
179
190
|
| `flake` | Timeout on first run, passes on immediate retry, network jitter | Retry up to 3 times before reclassifying to `real` |
|
|
180
|
-
| `visual_regression` | `toHaveScreenshot`
|
|
191
|
+
| `visual_regression` | `toHaveScreenshot` / `assertScreenshot` diff exceeded threshold | Do NOT retry. Include baseline + actual paths in `screenshots[]` with `baseline_diff_pct`. Do NOT auto-accept baselines. |
|
|
192
|
+
| `console_error` | Console guard collected console/page/request errors during the flow | App defect — fix in-scope or report; never allowlist without a linked fix task |
|
|
193
|
+
| `a11y` | Axe scan reported WCAG A/AA violations | Do NOT retry. Report rule ids in `a11y.violations`; fix in-scope or surface at `/cbp-verify` |
|
|
181
194
|
| `real` | Assertion failure on app behavior (wrong text, state, navigation) | Attempt fix (selector, timeout, assertion), max 3 attempts, then report |
|
|
182
195
|
|
|
183
196
|
`env`, `auth`, `access` failures MUST NOT count toward `test_results.failed` until
|
|
184
197
|
preflight passes — they block the run instead.
|
|
185
198
|
|
|
199
|
+
## Functional Assertion Mandate
|
|
200
|
+
|
|
201
|
+
Visibility-only specs are NOT sufficient coverage — they prove rendering, not behavior.
|
|
202
|
+
Every spec/flow covering a mutation (create / edit / delete / submit) MUST include at
|
|
203
|
+
least one behavior proof:
|
|
204
|
+
|
|
205
|
+
- **network success proof** — response-status assertion on the mutating call
|
|
206
|
+
(`waitForResponse` in Playwright; `runScript` `http.*` in Maestro), AND/OR
|
|
207
|
+
- **persistence proof** — reload / kill-and-relaunch / direct API re-read showing the
|
|
208
|
+
change survived, PLUS
|
|
209
|
+
- **one error-state test per form/CRUD surface** — inject a failure (`page.route` 500 in
|
|
210
|
+
Playwright) and assert the rendered error UI.
|
|
211
|
+
|
|
212
|
+
When a suite's assertions are entirely visibility/navigation-level, the specialist MUST
|
|
213
|
+
report `critical_issues[]` entry `{type: 'shallow_coverage', ...}` — the run may pass, but
|
|
214
|
+
the gap is flagged for the next round. `cbp-verify-reviewer` treats `shallow_coverage` on a
|
|
215
|
+
mutation surface as a finding, not noise.
|
|
216
|
+
|
|
186
217
|
## Committed-Screenshot Mandate
|
|
187
218
|
|
|
188
219
|
Every eligible e2e run MUST persist relevant screenshots to the framework's committed
|
|
@@ -215,6 +246,11 @@ classify as `visual_regression`. Do NOT auto-update. Surface as a blocking accep
|
|
|
215
246
|
at `/cbp-verify` (round scope). The user must explicitly approve (`--update-snapshots`) or open a
|
|
216
247
|
fix task. This relaxes the prior always-manual contract ONLY for new screens.
|
|
217
248
|
|
|
249
|
+
The model applies to ALL screenshot-capable frameworks, not just Playwright: Maestro gates
|
|
250
|
+
existing baselines with `assertScreenshot` against the committed PNG (the agent never
|
|
251
|
+
retakes/overwrites an existing baseline; acceptance = re-capture + `git add` after user
|
|
252
|
+
approval at `/cbp-verify`).
|
|
253
|
+
|
|
218
254
|
## Screenshot Collection Rule
|
|
219
255
|
|
|
220
256
|
After every run, enumerate all committed PNGs and populate BOTH `screenshots[]` and
|
|
@@ -242,6 +278,11 @@ New-screen auto-capture (above) is the only exception to the always-manual contr
|
|
|
242
278
|
- `tests_run === true`
|
|
243
279
|
- `preflight.*.ok === true` for every required prerequisite
|
|
244
280
|
- Every failure has `category` other than `env`, `auth`, or `access`
|
|
281
|
+
- `console_errors[]` is empty and `a11y.violations[]` is empty (where the framework reports
|
|
282
|
+
them — Playwright always does). Non-empty values with `status: 'completed'` are
|
|
283
|
+
inconsistent and hard-fail `codebyplan e2e verify-round` (`console_errors_reported`,
|
|
284
|
+
`a11y_violations_reported`); either fix in-scope or return `status: 'failed'` with the
|
|
285
|
+
matching failure category.
|
|
245
286
|
|
|
246
287
|
Otherwise return `status: 'failed'`.
|
|
247
288
|
|
|
@@ -17,9 +17,10 @@
|
|
|
17
17
|
# Two jobs:
|
|
18
18
|
# ci SOFT tier (authoritative required check) — the baseline-tolerant
|
|
19
19
|
# inner loop: lint, typecheck, test, build across the repo.
|
|
20
|
-
# ci-strict HARDCORE tier
|
|
21
|
-
# `codebyplan check --scope merged --no-baseline`.
|
|
22
|
-
#
|
|
20
|
+
# ci-strict HARDCORE tier — whole-repo ABSOLUTE GREEN via
|
|
21
|
+
# `codebyplan check --scope merged --no-baseline`. Report-only by
|
|
22
|
+
# default; set `workflow.strict_check_enforced: true` in
|
|
23
|
+
# `.codebyplan/ci.json` to make it a real gate (then enforce-check).
|
|
23
24
|
|
|
24
25
|
name: CI
|
|
25
26
|
|
|
@@ -69,19 +70,21 @@ jobs:
|
|
|
69
70
|
- name: Build
|
|
70
71
|
run: pnpm turbo build
|
|
71
72
|
|
|
72
|
-
# ── HARDCORE strict tier
|
|
73
|
+
# ── HARDCORE strict tier ────────────────────────────────────────────────────
|
|
73
74
|
# Whole-repo ABSOLUTE GREEN: `codebyplan check --scope merged --no-baseline`
|
|
74
75
|
# ignores .check-baseline.json entirely, so ANY failing package (lint,
|
|
75
|
-
# typecheck, test) fails this job. This is the
|
|
76
|
+
# typecheck, test) fails this job. This is the checkpoint→main gate.
|
|
76
77
|
#
|
|
77
|
-
# report-only
|
|
78
|
-
# `
|
|
79
|
-
#
|
|
80
|
-
#
|
|
78
|
+
# report-only vs enforced is driven by `.codebyplan/ci.json`
|
|
79
|
+
# `workflow.strict_check_enforced` (scaffold-ci-workflow substitutes the
|
|
80
|
+
# tokens below): when false (default) the job name carries " (report-only)"
|
|
81
|
+
# and `continue-on-error: true` keeps it non-blocking; when true the suffix is
|
|
82
|
+
# dropped and `continue-on-error: false` makes it a real gate. Only flip the
|
|
83
|
+
# flag once the whole repo is absolute-green AND the job has run green in CI,
|
|
84
|
+
# then add it to branch protection via `codebyplan ci enforce-check`.
|
|
81
85
|
ci-strict:
|
|
82
|
-
name: Strict whole-repo green
|
|
83
|
-
runs-on: ubuntu-latest
|
|
84
|
-
continue-on-error: true
|
|
86
|
+
name: Strict whole-repo green{{STRICT_NAME_SUFFIX}}
|
|
87
|
+
runs-on: ubuntu-latest{{STRICT_CONTINUE_ON_ERROR_LINE}}
|
|
85
88
|
steps:
|
|
86
89
|
- name: Checkout
|
|
87
90
|
uses: actions/checkout@v4
|
|
@@ -112,10 +115,12 @@ jobs:
|
|
|
112
115
|
# In the monorepo run the freshly-built bundle directly (the bin shim may
|
|
113
116
|
# be missing because dist/cli.js did not exist at install time); in a
|
|
114
117
|
# consumer repo that path is absent, so fall back to the installed bin.
|
|
118
|
+
# --concurrency=1 serializes turbo so the whole-repo matrix does not
|
|
119
|
+
# CPU-starve timing-sensitive test suites into flaky timeouts on the runner.
|
|
115
120
|
- name: Strict check (no baseline)
|
|
116
121
|
run: |
|
|
117
122
|
if [ -f packages/codebyplan-package/dist/cli.js ]; then
|
|
118
|
-
node packages/codebyplan-package/dist/cli.js check --scope merged --no-baseline
|
|
123
|
+
node packages/codebyplan-package/dist/cli.js check --scope merged --no-baseline --concurrency=1
|
|
119
124
|
else
|
|
120
|
-
pnpm exec codebyplan check --scope merged --no-baseline
|
|
125
|
+
pnpm exec codebyplan check --scope merged --no-baseline --concurrency=1
|
|
121
126
|
fi
|
|
@@ -420,9 +420,6 @@ function main() {
|
|
|
420
420
|
if (shouldShow("PACKAGE_FRESHNESS", cfg.package_freshness)) {
|
|
421
421
|
let guarded = false;
|
|
422
422
|
let installed = "";
|
|
423
|
-
let newer = false;
|
|
424
|
-
let latest = "";
|
|
425
|
-
let inSync = true;
|
|
426
423
|
|
|
427
424
|
const cachePath = path.join(
|
|
428
425
|
root,
|
|
@@ -444,9 +441,6 @@ function main() {
|
|
|
444
441
|
} else {
|
|
445
442
|
installed =
|
|
446
443
|
typeof cache.installed === "string" ? cache.installed : "";
|
|
447
|
-
newer = cache.newer === true;
|
|
448
|
-
latest = typeof cache.latest === "string" ? cache.latest : "";
|
|
449
|
-
inSync = cache.in_sync !== false;
|
|
450
444
|
}
|
|
451
445
|
}
|
|
452
446
|
} catch {
|
|
@@ -466,21 +460,14 @@ function main() {
|
|
|
466
460
|
guarded = true;
|
|
467
461
|
} else {
|
|
468
462
|
try {
|
|
469
|
-
|
|
470
|
-
|
|
471
|
-
|
|
472
|
-
|
|
463
|
+
// Reading + parsing the manifest validates it — unreadable/invalid
|
|
464
|
+
// JSON falls into the catch below and hides the segment. (The
|
|
465
|
+
// manifest-vs-installed ⟳ nag was removed in CHK-195.)
|
|
466
|
+
JSON.parse(fs.readFileSync(manifestPath, "utf8"));
|
|
473
467
|
const pRaw = fs.readFileSync(pkgPath, "utf8");
|
|
474
468
|
const pParsed = JSON.parse(pRaw);
|
|
475
|
-
|
|
469
|
+
installed =
|
|
476
470
|
typeof pParsed?.version === "string" ? pParsed.version : "";
|
|
477
|
-
installed = iVer;
|
|
478
|
-
if (mVer && iVer && mVer !== iVer) {
|
|
479
|
-
// manifest ≠ installed → .claude is out of sync. The ⟳ nag was removed
|
|
480
|
-
// (CHK-195); only the bare version renders. inSync is retained for the
|
|
481
|
-
// guard shape but no longer drives any output.
|
|
482
|
-
inSync = false;
|
|
483
|
-
}
|
|
484
471
|
} catch {
|
|
485
472
|
// Can't read files → hide segment.
|
|
486
473
|
guarded = true;
|
|
@@ -492,6 +479,26 @@ function main() {
|
|
|
492
479
|
const L8 = `${C.DIM}cbp${C.RST} ${installed}`;
|
|
493
480
|
out.push(L8);
|
|
494
481
|
}
|
|
482
|
+
|
|
483
|
+
// Settings-contract violations (read from cache regardless of guard state).
|
|
484
|
+
if (fs.existsSync(cachePath)) {
|
|
485
|
+
try {
|
|
486
|
+
const cacheRaw2 = fs.readFileSync(cachePath, "utf8");
|
|
487
|
+
const cache2 = JSON.parse(cacheRaw2);
|
|
488
|
+
if (cache2 && typeof cache2 === "object") {
|
|
489
|
+
const settingsMissing = cache2.settings_missing === true;
|
|
490
|
+
const settingsIgnored = cache2.settings_ignored === true;
|
|
491
|
+
if (settingsMissing) {
|
|
492
|
+
out.push(`${C.RED}⚠ settings.json missing!${C.RST}`);
|
|
493
|
+
}
|
|
494
|
+
if (settingsIgnored) {
|
|
495
|
+
out.push(`${C.YELLOW}⚠ settings.json gitignored!${C.RST}`);
|
|
496
|
+
}
|
|
497
|
+
}
|
|
498
|
+
} catch {
|
|
499
|
+
// Unreadable / invalid → no indicators
|
|
500
|
+
}
|
|
501
|
+
}
|
|
495
502
|
}
|
|
496
503
|
|
|
497
504
|
process.stdout.write(out.length ? out.join("\n") + "\n" : "");
|
|
@@ -401,6 +401,19 @@ def main():
|
|
|
401
401
|
l8 = "%scbp%s %s" % (DIM, RST, _installed)
|
|
402
402
|
out.append(l8)
|
|
403
403
|
|
|
404
|
+
# Settings-contract violations (read from cache regardless of guard state).
|
|
405
|
+
if os.path.isfile(cache_path):
|
|
406
|
+
try:
|
|
407
|
+
with open(cache_path, "r", encoding="utf-8") as fh:
|
|
408
|
+
cache2 = json.load(fh)
|
|
409
|
+
if isinstance(cache2, dict):
|
|
410
|
+
if cache2.get("settings_missing") is True:
|
|
411
|
+
out.append("%s⚠ settings.json missing!%s" % (RED, RST))
|
|
412
|
+
if cache2.get("settings_ignored") is True:
|
|
413
|
+
out.append("%s⚠ settings.json gitignored!%s" % (YELLOW, RST))
|
|
414
|
+
except Exception:
|
|
415
|
+
pass # Unreadable / invalid → no indicators
|
|
416
|
+
|
|
404
417
|
sys.stdout.write(("\n".join(out) + "\n") if out else "")
|
|
405
418
|
|
|
406
419
|
|
|
@@ -494,4 +494,16 @@ if should_show PACKAGE_FRESHNESS "$CFG_PACKAGE_FRESHNESS"; then
|
|
|
494
494
|
L8="${DIM}cbp${RST} ${_CBP_INSTALLED}"
|
|
495
495
|
printf "%b\n" "$L8"
|
|
496
496
|
fi
|
|
497
|
+
|
|
498
|
+
# Settings-contract violations (read from cache regardless of guard state).
|
|
499
|
+
if [ -f "$CBP_STATUS_CACHE" ] && command -v jq >/dev/null 2>&1; then
|
|
500
|
+
_CBP_SETTINGS_MISSING="$(jq -r '.settings_missing == true' "$CBP_STATUS_CACHE" 2>/dev/null)"
|
|
501
|
+
_CBP_SETTINGS_IGNORED="$(jq -r '.settings_ignored == true' "$CBP_STATUS_CACHE" 2>/dev/null)"
|
|
502
|
+
if [ "$_CBP_SETTINGS_MISSING" = "true" ]; then
|
|
503
|
+
printf "%b\n" "${RED}⚠ settings.json missing!${RST}"
|
|
504
|
+
fi
|
|
505
|
+
if [ "$_CBP_SETTINGS_IGNORED" = "true" ]; then
|
|
506
|
+
printf "%b\n" "${YELLOW}⚠ settings.json gitignored!${RST}"
|
|
507
|
+
fi
|
|
508
|
+
fi
|
|
497
509
|
fi
|
|
@@ -73,6 +73,27 @@ The sole exception is `vscode-test`: the committed dir may be empty when the ext
|
|
|
73
73
|
has no visual output (behavior-only tests). Agents must still define the dir and report
|
|
74
74
|
`e2e_gallery: []` explicitly — not omit the field.
|
|
75
75
|
|
|
76
|
+
## Quality-Capture Mandates
|
|
77
|
+
|
|
78
|
+
A green run that captured no quality signals is not evidence. Per framework:
|
|
79
|
+
|
|
80
|
+
- **playwright**: every spec imports `test` from the shared quality fixture
|
|
81
|
+
(`e2e/fixtures.ts`) — console/pageerror guard auto-active in every test; one axe WCAG A/AA
|
|
82
|
+
scan per page state. The output MUST carry `console_errors[]` (empty on clean) and `a11y`
|
|
83
|
+
per `context/testing/e2e.md`.
|
|
84
|
+
- **maestro**: existing committed screenshots are baselines — gate them with
|
|
85
|
+
`assertScreenshot` (never retake/overwrite); run `assertNoDefectsWithAI` with
|
|
86
|
+
`optional: false` at primary states when Maestro auth is available (the default
|
|
87
|
+
`optional: true` is warn-only and forbidden; record `ai_checks: 'unavailable'` when auth
|
|
88
|
+
is absent).
|
|
89
|
+
- A `status: 'completed'` output carrying non-empty `console_errors[]` or
|
|
90
|
+
`a11y.violations[]` is inconsistent — `codebyplan e2e verify-round` hard-fails it
|
|
91
|
+
(`console_errors_reported`, `a11y_violations_reported`).
|
|
92
|
+
|
|
93
|
+
Mutation flows MUST carry a behavior proof per `context/testing/e2e.md` § Functional
|
|
94
|
+
Assertion Mandate (network success proof / persistence proof / error-state test);
|
|
95
|
+
visibility-only suites are flagged `{type: 'shallow_coverage'}` in `critical_issues[]`.
|
|
96
|
+
|
|
76
97
|
## Cross-References
|
|
77
98
|
|
|
78
99
|
- `context/testing/e2e.md` — Input/Output contract, pre-flight loop, failure classification,
|
|
@@ -47,13 +47,21 @@ The branch model is **feat→main direct**; `.codebyplan/git.json` has `integrat
|
|
|
47
47
|
IS the per-checkpoint feat branch. The hardcore tier runs against that feat branch's merged
|
|
48
48
|
state before it lands on main; do not assume a staging/integration hop exists.
|
|
49
49
|
|
|
50
|
-
##
|
|
50
|
+
## Strict-Tier Enforcement (report-only ⇄ enforced)
|
|
51
51
|
|
|
52
|
-
The whole-repo hardcore CI **job**
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
52
|
+
The whole-repo hardcore CI **job** (`ci-strict`) is config-driven via `.codebyplan/ci.json`
|
|
53
|
+
`workflow.strict_check_enforced`, which `codebyplan ci scaffold-workflow` substitutes into the
|
|
54
|
+
generated `.github/workflows/ci.yml`:
|
|
55
|
+
|
|
56
|
+
- **`false` (default)** — report-only: the job carries the " (report-only)" name suffix and
|
|
57
|
+
`continue-on-error: true`, so `--scope merged --no-baseline` is advisory in CI — surfaced, not
|
|
58
|
+
enforced. A repo whose baseline is still red keeps merging while it pays the baseline down.
|
|
59
|
+
- **`true`** — enforced: the suffix is dropped and `continue-on-error` is omitted (defaults to
|
|
60
|
+
`false`), making the job a real gate. Flip ONLY after the whole repo is absolute-green AND the
|
|
61
|
+
job has already run green in CI, then wire the check name `Strict whole-repo green` into branch
|
|
62
|
+
protection via `codebyplan ci enforce-check --check-name "Strict whole-repo green"`.
|
|
63
|
+
|
|
64
|
+
Locally, `cbp-verify` runs and reports the same check regardless of the flag.
|
|
57
65
|
|
|
58
66
|
## Cross-References
|
|
59
67
|
|