@brunosps00/dev-workflow 0.11.0 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +48 -5
- package/lib/constants.js +20 -20
- package/lib/init.js +24 -1
- package/lib/migrate-skills.js +129 -0
- package/lib/removed-bundled-skills.js +16 -0
- package/lib/uninstall.js +6 -2
- package/lib/utils.js +43 -1
- package/package.json +1 -1
- package/scaffold/en/agent-instructions.md +68 -0
- package/scaffold/en/commands/dw-autopilot.md +1 -1
- package/scaffold/en/commands/dw-brainstorm.md +1 -1
- package/scaffold/en/commands/dw-bugfix.md +3 -3
- package/scaffold/en/commands/dw-create-techspec.md +1 -1
- package/scaffold/en/commands/dw-deps-audit.md +1 -1
- package/scaffold/en/commands/dw-fix-qa.md +1 -1
- package/scaffold/en/commands/dw-functional-doc.md +2 -2
- package/scaffold/en/commands/dw-help.md +1 -1
- package/scaffold/en/commands/dw-redesign-ui.md +7 -7
- package/scaffold/en/commands/dw-run-qa.md +4 -4
- package/scaffold/en/commands/dw-run-task.md +2 -2
- package/scaffold/en/templates/constitution-template.md +1 -1
- package/scaffold/pt-br/agent-instructions.md +68 -0
- package/scaffold/pt-br/commands/dw-autopilot.md +1 -1
- package/scaffold/pt-br/commands/dw-brainstorm.md +1 -1
- package/scaffold/pt-br/commands/dw-bugfix.md +3 -3
- package/scaffold/pt-br/commands/dw-create-techspec.md +1 -1
- package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
- package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
- package/scaffold/pt-br/commands/dw-functional-doc.md +2 -2
- package/scaffold/pt-br/commands/dw-help.md +1 -1
- package/scaffold/pt-br/commands/dw-redesign-ui.md +7 -7
- package/scaffold/pt-br/commands/dw-run-qa.md +4 -4
- package/scaffold/pt-br/commands/dw-run-task.md +2 -2
- package/scaffold/pt-br/templates/constitution-template.md +1 -1
- package/scaffold/skills/dw-council/SKILL.md +1 -1
- package/scaffold/skills/dw-testing-discipline/SKILL.md +148 -0
- package/scaffold/skills/dw-testing-discipline/references/ai-agent-gates.md +170 -0
- package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +336 -0
- package/scaffold/skills/dw-testing-discipline/references/flaky-discipline.md +163 -0
- package/scaffold/skills/dw-testing-discipline/references/iron-laws.md +128 -0
- package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +282 -0
- package/scaffold/skills/dw-testing-discipline/references/positive-patterns.md +241 -0
- package/scaffold/skills/{webapp-testing → dw-testing-discipline}/references/security-boundary.md +1 -1
- package/scaffold/skills/dw-ui-discipline/SKILL.md +128 -0
- package/scaffold/skills/dw-ui-discipline/references/accessibility-floor.md +225 -0
- package/scaffold/skills/dw-ui-discipline/references/anti-slop.md +162 -0
- package/scaffold/skills/dw-ui-discipline/references/curated-defaults.md +195 -0
- package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +142 -0
- package/scaffold/skills/dw-ui-discipline/references/state-matrix.md +101 -0
- package/scaffold/skills/ui-ux-pro-max/LICENSE +0 -21
- package/scaffold/skills/ui-ux-pro-max/SKILL.md +0 -659
- package/scaffold/skills/ui-ux-pro-max/data/_sync_all.py +0 -414
- package/scaffold/skills/ui-ux-pro-max/data/app-interface.csv +0 -31
- package/scaffold/skills/ui-ux-pro-max/data/charts.csv +0 -26
- package/scaffold/skills/ui-ux-pro-max/data/colors.csv +0 -162
- package/scaffold/skills/ui-ux-pro-max/data/design.csv +0 -1776
- package/scaffold/skills/ui-ux-pro-max/data/draft.csv +0 -1779
- package/scaffold/skills/ui-ux-pro-max/data/google-fonts.csv +0 -1924
- package/scaffold/skills/ui-ux-pro-max/data/icons.csv +0 -106
- package/scaffold/skills/ui-ux-pro-max/data/landing.csv +0 -35
- package/scaffold/skills/ui-ux-pro-max/data/products.csv +0 -162
- package/scaffold/skills/ui-ux-pro-max/data/react-performance.csv +0 -45
- package/scaffold/skills/ui-ux-pro-max/data/stacks/angular.csv +0 -51
- package/scaffold/skills/ui-ux-pro-max/data/stacks/astro.csv +0 -54
- package/scaffold/skills/ui-ux-pro-max/data/stacks/flutter.csv +0 -53
- package/scaffold/skills/ui-ux-pro-max/data/stacks/html-tailwind.csv +0 -56
- package/scaffold/skills/ui-ux-pro-max/data/stacks/jetpack-compose.csv +0 -53
- package/scaffold/skills/ui-ux-pro-max/data/stacks/laravel.csv +0 -51
- package/scaffold/skills/ui-ux-pro-max/data/stacks/nextjs.csv +0 -53
- package/scaffold/skills/ui-ux-pro-max/data/stacks/nuxt-ui.csv +0 -51
- package/scaffold/skills/ui-ux-pro-max/data/stacks/nuxtjs.csv +0 -59
- package/scaffold/skills/ui-ux-pro-max/data/stacks/react-native.csv +0 -52
- package/scaffold/skills/ui-ux-pro-max/data/stacks/react.csv +0 -54
- package/scaffold/skills/ui-ux-pro-max/data/stacks/shadcn.csv +0 -61
- package/scaffold/skills/ui-ux-pro-max/data/stacks/svelte.csv +0 -54
- package/scaffold/skills/ui-ux-pro-max/data/stacks/swiftui.csv +0 -51
- package/scaffold/skills/ui-ux-pro-max/data/stacks/threejs.csv +0 -54
- package/scaffold/skills/ui-ux-pro-max/data/stacks/vue.csv +0 -50
- package/scaffold/skills/ui-ux-pro-max/data/styles.csv +0 -85
- package/scaffold/skills/ui-ux-pro-max/data/typography.csv +0 -74
- package/scaffold/skills/ui-ux-pro-max/data/ui-reasoning.csv +0 -162
- package/scaffold/skills/ui-ux-pro-max/data/ux-guidelines.csv +0 -100
- package/scaffold/skills/ui-ux-pro-max/scripts/core.py +0 -262
- package/scaffold/skills/ui-ux-pro-max/scripts/design_system.py +0 -1148
- package/scaffold/skills/ui-ux-pro-max/scripts/search.py +0 -114
- package/scaffold/skills/ui-ux-pro-max/skills/brand/SKILL.md +0 -97
- package/scaffold/skills/ui-ux-pro-max/skills/design/SKILL.md +0 -302
- package/scaffold/skills/ui-ux-pro-max/skills/design-system/SKILL.md +0 -244
- package/scaffold/skills/ui-ux-pro-max/templates/base/quick-reference.md +0 -297
- package/scaffold/skills/ui-ux-pro-max/templates/base/skill-content.md +0 -358
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/agent.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/augment.json +0 -18
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/claude.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/codebuddy.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/codex.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/continue.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/copilot.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/cursor.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/droid.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/gemini.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/kilocode.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/kiro.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/opencode.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/qoder.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/roocode.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/trae.json +0 -21
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/warp.json +0 -18
- package/scaffold/skills/ui-ux-pro-max/templates/platforms/windsurf.json +0 -21
- package/scaffold/skills/webapp-testing/SKILL.md +0 -138
- package/scaffold/skills/webapp-testing/assets/test-helper.js +0 -56
- /package/scaffold/skills/{webapp-testing → dw-testing-discipline}/references/three-workflow-patterns.md +0 -0
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
# Seven AI agent gates — mandatory when an LLM writes tests
|
|
2
|
+
|
|
3
|
+
LLMs have characteristic failure modes when authoring tests. These gates are forcing functions for the seven most common.
|
|
4
|
+
|
|
5
|
+
Every test produced by an agent (via `/dw-run-task`, `/dw-bugfix`, `/dw-autopilot`, or any other code-generating flow) must pass all seven gates BEFORE the diff is presented for review.
|
|
6
|
+
|
|
7
|
+
## Gate 1: Invariant first
|
|
8
|
+
|
|
9
|
+
**The failure mode it blocks:** Agent writes 200 lines of test code without articulating what the test is supposed to prove.
|
|
10
|
+
|
|
11
|
+
**The gate:**
|
|
12
|
+
|
|
13
|
+
Before writing any test code, the agent prints:
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
INVARIANT: <one sentence: what behavior is true that the test verifies>
|
|
17
|
+
OWNING_LAYER: <unit | integration | contract | e2e>
|
|
18
|
+
EXISTING_SUITE: <path to existing test file the new test joins>
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
**Why it works:**
|
|
22
|
+
- "Invariant" forces specific behavior naming.
|
|
23
|
+
- "Owning layer" forces Law 2 (lowest detectable layer).
|
|
24
|
+
- "Existing suite" forces extending coverage rather than spawning new files.
|
|
25
|
+
|
|
26
|
+
**Verification:** In `/dw-code-review`, look for this 3-line preamble in the PR description or the commit body. Missing = REJECTED.
|
|
27
|
+
|
|
28
|
+
## Gate 2: Owning layer
|
|
29
|
+
|
|
30
|
+
**The failure mode it blocks:** Agent creates a new test file every time, scattering coverage across orphan files. Or, agent writes E2E tests for things unit could prove.
|
|
31
|
+
|
|
32
|
+
**The gate:**
|
|
33
|
+
|
|
34
|
+
The agent must:
|
|
35
|
+
1. Identify the existing test suite that owns the module under test.
|
|
36
|
+
2. Extend that suite, OR document why a new suite is needed (genuinely new module, new test pyramid layer).
|
|
37
|
+
3. Map the test to the right layer per Law 2.
|
|
38
|
+
|
|
39
|
+
**Verification:**
|
|
40
|
+
- New test file in PR but existing file covers the same module? REJECTED.
|
|
41
|
+
- E2E test for pure-logic invariant? REJECTED unless documented.
|
|
42
|
+
- Unit test for cross-service flow? REJECTED — push to integration/E2E.
|
|
43
|
+
|
|
44
|
+
## Gate 3: Real execution
|
|
45
|
+
|
|
46
|
+
**The failure mode it blocks:** Agent writes tests that mock everything. They pass green forever and validate nothing.
|
|
47
|
+
|
|
48
|
+
**The gate:**
|
|
49
|
+
|
|
50
|
+
Every test path the agent writes must, at SOME layer, run against real systems before the diff merges:
|
|
51
|
+
|
|
52
|
+
- Pure logic: unit only is fine.
|
|
53
|
+
- Code that touches DB: must have at least one integration test running real DB (testcontainers, ephemeral container, dedicated test DB).
|
|
54
|
+
- Code that calls external services: must have a contract test OR a sandbox-account smoke test.
|
|
55
|
+
- UI interactions: must have at least one E2E run on a real preview environment.
|
|
56
|
+
|
|
57
|
+
**Verification:** PR description must list the real-system runs that exercise the touched code. If no real-system path covers the change, REJECTED.
|
|
58
|
+
|
|
59
|
+
## Gate 4: Failure → fix production
|
|
60
|
+
|
|
61
|
+
**The failure mode it blocks:** Agent sees test red, modifies the test until green. Bug ships.
|
|
62
|
+
|
|
63
|
+
**The gate:**
|
|
64
|
+
|
|
65
|
+
When the agent encounters a failing test (its own or pre-existing):
|
|
66
|
+
|
|
67
|
+
1. Print: `INVESTIGATING FAILURE: <test name>`
|
|
68
|
+
2. Read production code in the path that produces the observed value.
|
|
69
|
+
3. Print: `ANALYSIS: <2-3 sentences on whether production is wrong, test is wrong, or invariant changed>`
|
|
70
|
+
4. Decide:
|
|
71
|
+
- Production wrong → fix production.
|
|
72
|
+
- Test wrong → fix test AND document the change in the commit body.
|
|
73
|
+
- Invariant changed → update the test AND open an ADR if the change is a public contract change.
|
|
74
|
+
|
|
75
|
+
**Verification:** Every commit that changes a previously-green test must have an `ANALYSIS:` line in the commit body explaining the decision. Missing = REJECTED.
|
|
76
|
+
|
|
77
|
+
## Gate 5: No snapshot without contract
|
|
78
|
+
|
|
79
|
+
**The failure mode it blocks:** Agent reaches for `toMatchSnapshot()` whenever it doesn't know what to assert. Snapshot becomes the assertion. Drift goes unnoticed.
|
|
80
|
+
|
|
81
|
+
**The gate:**
|
|
82
|
+
|
|
83
|
+
Before adding a snapshot assertion, the agent classifies the artifact:
|
|
84
|
+
|
|
85
|
+
- **PRODUCT_CONTRACT**: a stable contract worth pinning (e.g., serialized output of a public API, schema of a stored record). Snapshot is appropriate. Document the classification.
|
|
86
|
+
- **IMPLEMENTATION_DETAIL**: HTML structure, internal representation, component tree shape. Snapshot is FORBIDDEN. Write specific assertions instead.
|
|
87
|
+
|
|
88
|
+
**Verification:** Snapshots in the diff without a classification comment = REJECTED. Snapshots classified as IMPLEMENTATION_DETAIL = REJECTED.
|
|
89
|
+
|
|
90
|
+
## Gate 6: No assertion on self-set mock
|
|
91
|
+
|
|
92
|
+
**The failure mode it blocks:** Agent writes `mockFn.mockReturnValue('X')`, then asserts `expect(mockFn()).toBe('X')`. Proves nothing.
|
|
93
|
+
|
|
94
|
+
**The gate:**
|
|
95
|
+
|
|
96
|
+
The agent cannot assert on values it directly fed into a mock. Assertions must be on:
|
|
97
|
+
- The OUTPUT of production code that consumed the mock.
|
|
98
|
+
- The SIDE EFFECTS (DB state, network calls, event emissions) caused by production code.
|
|
99
|
+
- The VISIBLE behavior (UI change, log line, response) the user/caller observes.
|
|
100
|
+
|
|
101
|
+
**Verification:** Diff analysis flags pairs where a literal value appears in BOTH a mock setup AND an assertion. Flagged = REJECTED unless the agent can show the value passed through production code.
|
|
102
|
+
|
|
103
|
+
## Gate 7: Negative companion
|
|
104
|
+
|
|
105
|
+
**The failure mode it blocks:** Agent writes happy-path-only tests. Edge cases, error paths, boundaries uncovered.
|
|
106
|
+
|
|
107
|
+
**The gate:**
|
|
108
|
+
|
|
109
|
+
Every positive assertion the agent writes ships WITH at least one negative companion:
|
|
110
|
+
|
|
111
|
+
- Asserting `createUser(validInput)` succeeds → also assert `createUser(invalidInput)` fails with a specific error.
|
|
112
|
+
- Asserting `parseDate(validString)` returns a Date → also assert `parseDate(invalidString)` throws/returns null.
|
|
113
|
+
- Asserting `transferFunds(...)` succeeds with sufficient balance → also assert it fails with insufficient balance.
|
|
114
|
+
|
|
115
|
+
**Verification:** A PR adding N positive assertions must add ≥1 negative assertion per public path. Imbalance >3:1 (positive:negative) on a public path = REJECTED.
|
|
116
|
+
|
|
117
|
+
## How the gates compose
|
|
118
|
+
|
|
119
|
+
Together, the seven gates produce tests that:
|
|
120
|
+
1. State what they prove (invariant first).
|
|
121
|
+
2. Live at the right layer (owning layer).
|
|
122
|
+
3. Exercise reality somewhere (real execution).
|
|
123
|
+
4. Reveal bugs when red (failure → fix production).
|
|
124
|
+
5. Assert specifically, not via snapshots (no snapshot w/o contract).
|
|
125
|
+
6. Assert outputs, not setup (no self-mock assertion).
|
|
126
|
+
7. Cover failures, not just success (negative companion).
|
|
127
|
+
|
|
128
|
+
A test passing all seven is a test worth running. A test missing any one is more likely to mislead than help.
|
|
129
|
+
|
|
130
|
+
## Override procedure
|
|
131
|
+
|
|
132
|
+
If an agent (or user) wants to skip a gate, they must:
|
|
133
|
+
1. State which gate is being skipped.
|
|
134
|
+
2. State why (one sentence).
|
|
135
|
+
3. Add a `// SKIP-GATE-N: <reason>` comment in the test.
|
|
136
|
+
4. Open a follow-up issue tracking the gap.
|
|
137
|
+
|
|
138
|
+
Without all four, the gate is enforced.
|
|
139
|
+
|
|
140
|
+
## Prompt block to include when invoking the agent
|
|
141
|
+
|
|
142
|
+
```
|
|
143
|
+
You are about to write tests. Before producing test code, complete the
|
|
144
|
+
seven-gate preamble:
|
|
145
|
+
|
|
146
|
+
INVARIANT: ___
|
|
147
|
+
OWNING_LAYER: ___
|
|
148
|
+
EXISTING_SUITE: ___
|
|
149
|
+
|
|
150
|
+
If you cannot complete these three lines, STOP and ask the user for
|
|
151
|
+
the requirement (do not invent an invariant).
|
|
152
|
+
|
|
153
|
+
Then, while writing tests:
|
|
154
|
+
- Real execution: name the real-system path covering this code.
|
|
155
|
+
- On red: investigate production before changing tests; print ANALYSIS.
|
|
156
|
+
- Snapshots: classify as PRODUCT_CONTRACT or IMPLEMENTATION_DETAIL.
|
|
157
|
+
- Assertions: never assert on values you fed into a mock.
|
|
158
|
+
- Coverage: every positive assertion needs a negative companion.
|
|
159
|
+
|
|
160
|
+
Tests that violate gates without explicit SKIP-GATE-N comments will be
|
|
161
|
+
REJECTED at review.
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
`/dw-run-task` and `/dw-bugfix` inject this prompt before generating test code.
|
|
165
|
+
|
|
166
|
+
## Why these seven and not more
|
|
167
|
+
|
|
168
|
+
These are the seven LLM failure modes empirically observed in test generation across multiple projects (per pedronauck/skills/testing-boss, MIT, plus dev-workflow internal observation). Other tendencies exist; they're either covered by the positive patterns (e.g., wall-clock waits) or have lower hit-rate.
|
|
169
|
+
|
|
170
|
+
If a NEW LLM failure mode appears that none of the seven catches, add a gate AND document the failure mode that motivated it. Don't add gates speculatively.
|
|
@@ -0,0 +1,336 @@
|
|
|
1
|
+
# Anti-patterns — 25 patterns across 5 families
|
|
2
|
+
|
|
3
|
+
Each anti-pattern names the smell, shows the violation in pseudo-code, gives the fix, and notes how `/dw-code-review` detects it.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Family 1: Brittleness (tests bound to internals)
|
|
8
|
+
|
|
9
|
+
### A1. Implementation-detail selectors
|
|
10
|
+
|
|
11
|
+
**Violation:**
|
|
12
|
+
```javascript
|
|
13
|
+
await page.click('.btn.btn-primary.checkout-button');
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
**Fix:** Use `getByRole('button', { name: 'Checkout' })`.
|
|
17
|
+
|
|
18
|
+
**Detection:** Grep for `.click(`, `.querySelector(`, `cy.get('.`, `getByTestId(` with a class-flavored argument.
|
|
19
|
+
|
|
20
|
+
### A2. Testing internal structure vs observable behavior
|
|
21
|
+
|
|
22
|
+
**Violation:**
|
|
23
|
+
```javascript
|
|
24
|
+
expect(component.state.cart.items.length).toBe(3);
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
**Fix:** Assert what the user sees: `expect(await page.getByText('3 items in cart')).toBeVisible()`.
|
|
28
|
+
|
|
29
|
+
**Detection:** Tests that import/inspect internal state, refs, or private fields. Class-based component tests that read `.state` directly.
|
|
30
|
+
|
|
31
|
+
### A3. Testing private methods directly
|
|
32
|
+
|
|
33
|
+
**Violation:**
|
|
34
|
+
```javascript
|
|
35
|
+
expect(orderService._calculateTax(...)).toBe(8.5);
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**Fix:** Test the public method that uses tax calculation; verify the result. If the private method is independently complex, extract it to a module and test that module's public API.
|
|
39
|
+
|
|
40
|
+
**Detection:** Identifiers starting with `_` accessed from tests.
|
|
41
|
+
|
|
42
|
+
### A4. Snapshot-as-test (snapshot replacing assertion)
|
|
43
|
+
|
|
44
|
+
**Violation:**
|
|
45
|
+
```javascript
|
|
46
|
+
expect(rendered).toMatchSnapshot(); // ← only assertion in test
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
**Fix:** Either:
|
|
50
|
+
1. Write specific assertions about what the component renders, OR
|
|
51
|
+
2. Use a snapshot AS A SECONDARY check after specific assertions, with a comment classifying the snapshot as `PRODUCT_CONTRACT` (UI contract worth pinning) — never `IMPLEMENTATION_DETAIL`.
|
|
52
|
+
|
|
53
|
+
**Detection:** Tests where the only assertion is `toMatchSnapshot` or equivalent.
|
|
54
|
+
|
|
55
|
+
### A5. Vague existence assertions
|
|
56
|
+
|
|
57
|
+
**Violation:**
|
|
58
|
+
```javascript
|
|
59
|
+
expect(result).toBeTruthy();
|
|
60
|
+
expect(element).toBeDefined();
|
|
61
|
+
expect(button).should('exist');
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
**Fix:** Assert what you actually want: `expect(result.status).toBe('success')`, `expect(button).toBeEnabled()`, `expect(button).toHaveText('Continue')`.
|
|
65
|
+
|
|
66
|
+
**Detection:** Tests asserting only existence/truthiness without follow-up semantic check.
|
|
67
|
+
|
|
68
|
+
### A6. Action without assertion
|
|
69
|
+
|
|
70
|
+
**Violation:**
|
|
71
|
+
```javascript
|
|
72
|
+
test('clicking save works', async () => {
|
|
73
|
+
await page.getByRole('button', { name: 'Save' }).click();
|
|
74
|
+
// ← no assertion. What did "works" mean?
|
|
75
|
+
});
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
**Fix:** Define what "works" means. Assert the observable outcome: URL changed, modal closed, success message visible, data persisted.
|
|
79
|
+
|
|
80
|
+
**Detection:** Tests with `await x.click()` or `await x.type()` followed by no `expect(...)`.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Family 2: Flakiness (tests randomizing verdicts)
|
|
85
|
+
|
|
86
|
+
### A7. Static sleeps / fixed-timeout waits
|
|
87
|
+
|
|
88
|
+
**Violation:**
|
|
89
|
+
```javascript
|
|
90
|
+
await page.waitForTimeout(2000);
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Fix:** `await expect(page.getByText(/order confirmed/i)).toBeVisible({ timeout: 5000 })` — wait on the actual condition.
|
|
94
|
+
|
|
95
|
+
**Detection:** Grep for `waitForTimeout`, `cy.wait(<number>)`, `sleep(`, `Thread.sleep`, `time.sleep` in test files.
|
|
96
|
+
|
|
97
|
+
### A8. Test order dependency / hidden shared state
|
|
98
|
+
|
|
99
|
+
**Violation:** Test B passes only after Test A has run because A populates a shared cache or DB row.
|
|
100
|
+
|
|
101
|
+
**Fix:** Each test sets up its own state in `beforeEach`. Verify by running tests with `--shuffle` or `--randomize`.
|
|
102
|
+
|
|
103
|
+
**Detection:** Tests fail when run with `.only`. Tests fail with `--shuffle`. Setup in `beforeAll` instead of `beforeEach`.
|
|
104
|
+
|
|
105
|
+
### A9. Non-deterministic inputs (clock, RNG, locale)
|
|
106
|
+
|
|
107
|
+
**Violation:**
|
|
108
|
+
```javascript
|
|
109
|
+
test('today is Monday', () => {
|
|
110
|
+
expect(new Date().getDay()).toBe(1); // fails 6 days a week
|
|
111
|
+
});
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**Fix:** Mock the clock (`vi.useFakeTimers()`, `jest.useFakeTimers()`, `freezegun` in Python). Seed RNG. Pin locale.
|
|
115
|
+
|
|
116
|
+
**Detection:** Tests using `new Date()`, `Date.now()`, `Math.random()`, `Intl.DateTimeFormat` without fakes.
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## Family 3: Mock misuse (tests testing the test setup)
|
|
121
|
+
|
|
122
|
+
### A10. Asserting the mock exists
|
|
123
|
+
|
|
124
|
+
**Violation:**
|
|
125
|
+
```javascript
|
|
126
|
+
expect(mockFn).toBeDefined();
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Fix:** Don't assert on mock setup. If the mock is wrong, the behavior assertion downstream will fail naturally.
|
|
130
|
+
|
|
131
|
+
**Detection:** Mock functions referenced in assertions without `toHaveBeenCalled` semantics.
|
|
132
|
+
|
|
133
|
+
### A11. Mock drift
|
|
134
|
+
|
|
135
|
+
**Violation:** Mocked API response set up 6 months ago still returns `{ status: 'OK' }` while the real API now returns `{ ok: true }`.
|
|
136
|
+
|
|
137
|
+
**Fix:** Contract testing (Pact, schemathesis) or periodic recording (msw + real-traffic capture). Re-validate mocks against real APIs quarterly.
|
|
138
|
+
|
|
139
|
+
**Detection:** Tests with mocks that haven't been touched in >90 days against APIs that have changed. Hard to detect in CI; needs explicit contract checks.
|
|
140
|
+
|
|
141
|
+
### A12. Over-mocking child components
|
|
142
|
+
|
|
143
|
+
**Violation:**
|
|
144
|
+
```javascript
|
|
145
|
+
vi.mock('./UserAvatar');
|
|
146
|
+
vi.mock('./UserMenu');
|
|
147
|
+
vi.mock('./UserBanner');
|
|
148
|
+
// ... testing nothing real
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
**Fix:** Mock at boundaries (HTTP, DB, third-party SDKs). Render real children unless they're genuinely expensive or test-irrelevant.
|
|
152
|
+
|
|
153
|
+
**Detection:** Test files with >3 module mocks of internal modules.
|
|
154
|
+
|
|
155
|
+
### A13. Incomplete mocks (missing fields the code reads)
|
|
156
|
+
|
|
157
|
+
**Violation:**
|
|
158
|
+
```javascript
|
|
159
|
+
const mockUser = { id: 1 }; // missing email, but code reads user.email
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
**Fix:** Use a factory that supplies sensible defaults for ALL fields the type/contract declares.
|
|
163
|
+
|
|
164
|
+
**Detection:** Runtime errors like `Cannot read property 'X' of undefined` inside production code under test.
|
|
165
|
+
|
|
166
|
+
### A14. Mocking wrong level (mocking methods the logic depends on)
|
|
167
|
+
|
|
168
|
+
**Violation:**
|
|
169
|
+
```javascript
|
|
170
|
+
// Testing OrderService, but mocking its private calculate() method
|
|
171
|
+
const service = new OrderService();
|
|
172
|
+
vi.spyOn(service, 'calculate').mockReturnValue(100);
|
|
173
|
+
expect(service.processOrder(...)).toBe(/* uses mocked 100 */);
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
You've tested the SCAFFOLD, not the logic.
|
|
177
|
+
|
|
178
|
+
**Fix:** Mock at the EDGE (DB call, HTTP call, time). Let internal logic run.
|
|
179
|
+
|
|
180
|
+
**Detection:** Spies on methods of the System Under Test itself.
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## Family 4: Process (team and suite pathologies)
|
|
185
|
+
|
|
186
|
+
### A15. Coverage as vanity metric
|
|
187
|
+
|
|
188
|
+
**Violation:** PR comments demanding "you need to hit 90% coverage" with no discussion of what the coverage means.
|
|
189
|
+
|
|
190
|
+
**Fix:** Coverage is a flashlight. Use it to FIND blind spots. Don't optimize for the number.
|
|
191
|
+
|
|
192
|
+
**Detection:** Cultural; visible in PR templates that gate on coverage percentage.
|
|
193
|
+
|
|
194
|
+
### A16. Happy-path-only coverage
|
|
195
|
+
|
|
196
|
+
**Violation:** Every test exercises the success case. Edge cases, error paths, boundary values uncovered.
|
|
197
|
+
|
|
198
|
+
**Fix:** For each unit, write at minimum: happy path + 1 boundary + 1 invalid input + 1 failure path.
|
|
199
|
+
|
|
200
|
+
**Detection:** Tests where every assertion is positive (`toBe`, `toEqual`) and none is negative (`toThrow`, `toReject`).
|
|
201
|
+
|
|
202
|
+
### A17. Eternal `beforeAll` / shared setup hiding dependencies
|
|
203
|
+
|
|
204
|
+
**Violation:**
|
|
205
|
+
```javascript
|
|
206
|
+
beforeAll(async () => {
|
|
207
|
+
await db.users.create([100 users]);
|
|
208
|
+
await db.orders.create([500 orders]);
|
|
209
|
+
});
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Tests now SHARE state. Order matters. Cleanup is fragile.
|
|
213
|
+
|
|
214
|
+
**Fix:** `beforeEach` with minimal setup specific to each test.
|
|
215
|
+
|
|
216
|
+
**Detection:** `beforeAll` blocks creating data (vs `beforeAll` blocks doing one-time framework setup like spinning testcontainers).
|
|
217
|
+
|
|
218
|
+
### A18. Cleanup in `afterEach` (use `beforeEach` instead)
|
|
219
|
+
|
|
220
|
+
**Violation:**
|
|
221
|
+
```javascript
|
|
222
|
+
afterEach(async () => {
|
|
223
|
+
await db.users.deleteAll();
|
|
224
|
+
});
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
If a test fails mid-run, cleanup might not run; next test starts dirty.
|
|
228
|
+
|
|
229
|
+
**Fix:** `beforeEach` with explicit setup-from-clean (truncate + seed). Reliable regardless of previous test outcome.
|
|
230
|
+
|
|
231
|
+
**Detection:** `afterEach` blocks doing state reset.
|
|
232
|
+
|
|
233
|
+
### A19. Magic strings and logic in tests
|
|
234
|
+
|
|
235
|
+
**Violation:**
|
|
236
|
+
```javascript
|
|
237
|
+
const TIMESTAMP = '2024-01-15T10:30:00Z'; // why?
|
|
238
|
+
expect(formatted).toBe('a long string with embedded specifics');
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
When the test fails, what was the test's INTENT?
|
|
242
|
+
|
|
243
|
+
**Fix:** Use factories with named defaults. Extract magic values to constants with documenting names. Use snapshot testing for legitimate snapshot cases (with classification).
|
|
244
|
+
|
|
245
|
+
**Detection:** Test files with ≥10 string literals not bound to a named variable.
|
|
246
|
+
|
|
247
|
+
### A20. Testing against third-party sites you don't control
|
|
248
|
+
|
|
249
|
+
**Violation:**
|
|
250
|
+
```javascript
|
|
251
|
+
test('Google homepage loads', async ({ page }) => {
|
|
252
|
+
await page.goto('https://google.com');
|
|
253
|
+
expect(await page.title()).toContain('Google');
|
|
254
|
+
});
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
You're testing Google's availability, not your code.
|
|
258
|
+
|
|
259
|
+
**Fix:** Mock the third party. Use a wiremock or msw to fake their responses. If you must call them, do it in a separate "external dependencies up" smoke test, not unit/integration.
|
|
260
|
+
|
|
261
|
+
**Detection:** External URLs in test code outside designated smoke tests.
|
|
262
|
+
|
|
263
|
+
### A21. Quarantine-as-cemetery
|
|
264
|
+
|
|
265
|
+
**Violation:**
|
|
266
|
+
```javascript
|
|
267
|
+
test.skip('flaky on CI sometimes', () => { /* ... */ });
|
|
268
|
+
// commented 8 months ago, no owner, no fix-by date
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
**Fix:** Every skip/quarantine has a NAMED OWNER and a FIX-BY DATE. Tracking issue exists. PR that introduces the skip says exactly when the test gets fixed.
|
|
272
|
+
|
|
273
|
+
**Detection:** Skipped tests without comments/labels naming owner and date.
|
|
274
|
+
|
|
275
|
+
### A22. Retry-as-fix (auto-retry hiding real bugs)
|
|
276
|
+
|
|
277
|
+
**Violation:**
|
|
278
|
+
```javascript
|
|
279
|
+
// jest.config or playwright.config
|
|
280
|
+
retries: 3,
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
A flaky test is a SIGNAL. Retrying until green hides it.
|
|
284
|
+
|
|
285
|
+
**Fix:** When a test is flaky, FIX IT (probably a race condition or non-deterministic input). Quarantine if you can't fix immediately. Never just retry.
|
|
286
|
+
|
|
287
|
+
**Detection:** CI config with retry counts. Test runners showing "1 retry succeeded" badges.
|
|
288
|
+
|
|
289
|
+
### A23. Duplicate tests across pyramid layers
|
|
290
|
+
|
|
291
|
+
**Violation:** Same scenario tested at unit, integration, AND E2E. Triple maintenance, no triple value.
|
|
292
|
+
|
|
293
|
+
**Fix:** Apply Law 2 — lowest layer wins. Drop higher-layer duplicates.
|
|
294
|
+
|
|
295
|
+
**Detection:** Search for the same scenario name across `tests/unit`, `tests/integration`, `tests/e2e`.
|
|
296
|
+
|
|
297
|
+
### A24. Weakening tests to make them pass
|
|
298
|
+
|
|
299
|
+
**Violation:**
|
|
300
|
+
```diff
|
|
301
|
+
- expect(orders.length).toBe(5);
|
|
302
|
+
+ expect(orders.length).toBeGreaterThan(0);
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
The "fix" makes the test useless.
|
|
306
|
+
|
|
307
|
+
**Fix:** Read Law 3. Fix production OR document WHY the assertion is weaker.
|
|
308
|
+
|
|
309
|
+
**Detection:** PR diff shows assertion relaxation without commit body explanation.
|
|
310
|
+
|
|
311
|
+
### A25. Mock-driven confidence (test asserts on its own setup)
|
|
312
|
+
|
|
313
|
+
**Violation:**
|
|
314
|
+
```javascript
|
|
315
|
+
const mock = vi.fn().mockReturnValue('hello');
|
|
316
|
+
expect(mock()).toBe('hello');
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
You wrote `hello` in the mock. You asserted `hello`. You proved nothing.
|
|
320
|
+
|
|
321
|
+
**Fix:** Assert on the OUTPUT of the production code that consumed the mock — not on the mock itself.
|
|
322
|
+
|
|
323
|
+
**Detection:** Tests asserting equality between a value the test body created and a value the test body retrieved.
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## How `/dw-code-review` uses this catalog
|
|
328
|
+
|
|
329
|
+
For each diff hunk under a test path:
|
|
330
|
+
1. Run regex scans for the patterns flagged "Detection" above.
|
|
331
|
+
2. Each hit becomes a finding with severity from this skill's `dw-review-rigor` integration.
|
|
332
|
+
3. Hits classified as Brittleness/Flakiness/Mock-misuse → severity `high`.
|
|
333
|
+
4. Hits classified as Process → severity `medium`.
|
|
334
|
+
5. Hits where the SAME test has multiple patterns → severity `critical` (suite-health smell, not just one test).
|
|
335
|
+
|
|
336
|
+
A PR with ≥1 `high` test anti-pattern that lacks ADR justification gets REJECTED.
|
|
@@ -0,0 +1,163 @@
|
|
|
1
|
+
# Flaky discipline — taxonomy, quarantine, SLOs
|
|
2
|
+
|
|
3
|
+
A flaky test is one that produces different verdicts (pass/fail) on the same code across runs. They corrode trust in the suite faster than any other category of test debt.
|
|
4
|
+
|
|
5
|
+
## The four root causes (in order of frequency)
|
|
6
|
+
|
|
7
|
+
### Cause 1: Race conditions (concurrency)
|
|
8
|
+
|
|
9
|
+
**Tells:**
|
|
10
|
+
- Test passes locally, fails in CI (or vice versa).
|
|
11
|
+
- Failure rate correlates with CI machine load.
|
|
12
|
+
- Adding `await page.waitForTimeout(100)` "fixes" it.
|
|
13
|
+
|
|
14
|
+
**Common scenarios:**
|
|
15
|
+
- Async operation completes after test moves on (missing `await`).
|
|
16
|
+
- Two requests sent simultaneously, response order matters.
|
|
17
|
+
- DOM update happens after assertion runs.
|
|
18
|
+
- Database write not yet committed when read fires.
|
|
19
|
+
|
|
20
|
+
**Fix:**
|
|
21
|
+
- Replace wall-clock waits with condition-based waits (`waitFor`, `toBeVisible`, `expect.poll`).
|
|
22
|
+
- Add proper `await` on every async operation.
|
|
23
|
+
- Use transaction boundaries explicitly when test reads its own write.
|
|
24
|
+
|
|
25
|
+
### Cause 2: Test order dependency
|
|
26
|
+
|
|
27
|
+
**Tells:**
|
|
28
|
+
- Test passes when suite runs in order, fails with `--shuffle`.
|
|
29
|
+
- Test fails when run with `.only` in isolation.
|
|
30
|
+
- Failures cluster on first run after CI restart but not afterwards.
|
|
31
|
+
|
|
32
|
+
**Common scenarios:**
|
|
33
|
+
- `beforeAll` populates shared state; second test mutates it; third test fails.
|
|
34
|
+
- Test A creates a global mock; Test B inherits it unexpectedly.
|
|
35
|
+
- Database row persists across tests because cleanup is in `afterEach` but a test threw mid-execution.
|
|
36
|
+
|
|
37
|
+
**Fix:**
|
|
38
|
+
- Move state creation from `beforeAll` to `beforeEach`.
|
|
39
|
+
- Reset shared state in `beforeEach` (clean slate every test).
|
|
40
|
+
- Avoid global mocks; scope mocks to the test that needs them.
|
|
41
|
+
- Run with `--shuffle` in CI to catch new order dependencies.
|
|
42
|
+
|
|
43
|
+
### Cause 3: Non-deterministic inputs
|
|
44
|
+
|
|
45
|
+
**Tells:**
|
|
46
|
+
- Test fails at month boundary, year boundary, DST change.
|
|
47
|
+
- Test fails based on hostname, locale, timezone.
|
|
48
|
+
- Test fails when a flaky RNG produces edge values.
|
|
49
|
+
|
|
50
|
+
**Common scenarios:**
|
|
51
|
+
- `new Date()` in production code, tested without clock fake.
|
|
52
|
+
- `Math.random()` for IDs, tested without seed.
|
|
53
|
+
- `Intl.DateTimeFormat` rendering based on system locale.
|
|
54
|
+
- File paths with timestamps, hash IDs based on time.
|
|
55
|
+
|
|
56
|
+
**Fix:**
|
|
57
|
+
- Mock the clock (`vi.useFakeTimers`, `freezegun`).
|
|
58
|
+
- Seed RNG explicitly in tests (`Math.random = () => 0.5` or via DI).
|
|
59
|
+
- Pin locale and timezone in CI environment AND in test setup.
|
|
60
|
+
|
|
61
|
+
### Cause 4: External dependencies
|
|
62
|
+
|
|
63
|
+
**Tells:**
|
|
64
|
+
- Test fails when a third-party service has an outage.
|
|
65
|
+
- Test fails when CI runs against a real API and hits rate limits.
|
|
66
|
+
- Test fails differently for different geographic CI runners.
|
|
67
|
+
|
|
68
|
+
**Common scenarios:**
|
|
69
|
+
- Direct call to external API in unit tests.
|
|
70
|
+
- DNS lookup baked into test execution path.
|
|
71
|
+
- CDN-hosted resources in E2E tests.
|
|
72
|
+
|
|
73
|
+
**Fix:**
|
|
74
|
+
- Mock external services at unit/integration layers.
|
|
75
|
+
- Use contract tests instead of live calls.
|
|
76
|
+
- For E2E, use a sandbox account / dedicated test environment.
|
|
77
|
+
|
|
78
|
+
## Quarantine workflow
|
|
79
|
+
|
|
80
|
+
When a test flakes:
|
|
81
|
+
|
|
82
|
+
### Within 1 hour of detection
|
|
83
|
+
|
|
84
|
+
1. **Quarantine the test.** Add `.skip` or equivalent. Add a comment:
|
|
85
|
+
|
|
86
|
+
```javascript
|
|
87
|
+
test.skip('FLAKY-2026-05-12: race condition in checkout flow — owner: bruno, fix-by: 2026-05-19', () => {
|
|
88
|
+
// ...
|
|
89
|
+
});
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
2. **File a tracking issue.** Title: `FLAKY: <test name>`. Body includes:
|
|
93
|
+
- Test name and file
|
|
94
|
+
- Failure mode observed (race? order-dependency? non-determinism?)
|
|
95
|
+
- First detection: CI run URL, timestamp
|
|
96
|
+
- Hypothesis (if any)
|
|
97
|
+
- Owner and fix-by date
|
|
98
|
+
|
|
99
|
+
3. **Note in CI.** The next CI run shows "1 quarantined" — make this visible on the dashboard.
|
|
100
|
+
|
|
101
|
+
### Within 24 hours
|
|
102
|
+
|
|
103
|
+
1. **Named owner assigned.** Not "team X" — a person.
|
|
104
|
+
2. **Fix-by date set.** Default 5 business days. Major flake (production-path test): 2 days.
|
|
105
|
+
|
|
106
|
+
### When fix-by passes without fix
|
|
107
|
+
|
|
108
|
+
Escalate:
|
|
109
|
+
- Pair the owner with someone for a debug session.
|
|
110
|
+
- If still unfixed after 2× the fix-by window, the test is removed (not skipped). A failing un-skipped test is better than a perpetually skipped test.
|
|
111
|
+
|
|
112
|
+
## SLOs
|
|
113
|
+
|
|
114
|
+
### `flaky_rate` (first-class metric)
|
|
115
|
+
|
|
116
|
+
- Definition: `(tests that pass on retry but fail on first run) / (total test runs)`.
|
|
117
|
+
- Target: < 1–2% per week.
|
|
118
|
+
- Alert at: > 5% on any given day.
|
|
119
|
+
|
|
120
|
+
### `time-to-fix-flaky`
|
|
121
|
+
|
|
122
|
+
- Definition: hours from quarantine to fix-merged.
|
|
123
|
+
- Target: median < 24 hours; p95 < 7 days.
|
|
124
|
+
|
|
125
|
+
### `quarantine inventory`
|
|
126
|
+
|
|
127
|
+
- Definition: count of currently-skipped tests with `FLAKY-*` markers.
|
|
128
|
+
- Target: < 10 at any time.
|
|
129
|
+
- Alert at: > 25 (the quarantine has become a cemetery — emergency cleanup).
|
|
130
|
+
|
|
131
|
+
## What NOT to do
|
|
132
|
+
|
|
133
|
+
- **Auto-retry as fix.** `retries: 3` in CI config is hiding flakes, not fixing them. The 4th run that finally passes still validated nothing.
|
|
134
|
+
- **Increase timeouts indefinitely.** A timeout that grows from 5s to 30s "to make CI pass" means the test isn't waiting on the right condition.
|
|
135
|
+
- **Remove the test without investigation.** "It's been flaky forever, delete it" — sometimes correct, but make sure the underlying invariant is captured elsewhere.
|
|
136
|
+
- **Mark skip without owner.** A skip is a debt. An unowned debt is a perpetual liability.
|
|
137
|
+
|
|
138
|
+
## When a test should be permanently removed
|
|
139
|
+
|
|
140
|
+
A flaky test should be DELETED (not just skipped) when:
|
|
141
|
+
|
|
142
|
+
1. The invariant it tests is covered elsewhere (duplicate per A23).
|
|
143
|
+
2. The invariant it tests is no longer a real requirement.
|
|
144
|
+
3. The test was always probabilistic by design and never had value.
|
|
145
|
+
|
|
146
|
+
Deletion is acceptable; abandonment-by-skip is not.
|
|
147
|
+
|
|
148
|
+
## Real-systems-at-final-gate principle
|
|
149
|
+
|
|
150
|
+
Many flakes come from mocks drifting from reality. The defense:
|
|
151
|
+
|
|
152
|
+
- **Unit:** mock the world; fast feedback; flake budget tiny here.
|
|
153
|
+
- **Integration:** real DB (testcontainers); mock external services with contract validation.
|
|
154
|
+
- **Contract:** Pact / schemathesis verifying producer-consumer agreement.
|
|
155
|
+
- **E2E:** real services in a preview environment; near-zero mocks.
|
|
156
|
+
|
|
157
|
+
When CI is wired this way, a flake at unit usually = race or order-dependency (fixable). A flake at E2E usually = real environment issue (fix the environment, not the test).
|
|
158
|
+
|
|
159
|
+
## Integration with dev-workflow
|
|
160
|
+
|
|
161
|
+
- `/dw-fix-qa` uses this taxonomy when retest cycles produce inconsistent results: classify the flake, apply the right fix, document.
|
|
162
|
+
- `/dw-code-review` flags tests being modified that have a `FLAKY-*` marker — review must verify the flake is now actually fixed, not just made less likely.
|
|
163
|
+
- `/dw-run-qa` weekly summary includes the `flaky_rate` metric.
|