@brunosps00/dev-workflow 0.13.0 → 0.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +9 -3
- package/package.json +1 -1
- package/scaffold/en/commands/dw-bugfix.md +2 -1
- package/scaffold/en/commands/dw-code-review.md +1 -0
- package/scaffold/en/commands/dw-create-tasks.md +6 -0
- package/scaffold/en/commands/dw-deps-audit.md +1 -1
- package/scaffold/en/commands/dw-fix-qa.md +1 -1
- package/scaffold/en/commands/dw-functional-doc.md +1 -1
- package/scaffold/en/commands/dw-help.md +1 -1
- package/scaffold/en/commands/dw-redesign-ui.md +1 -1
- package/scaffold/en/commands/dw-run-qa.md +2 -1
- package/scaffold/en/commands/dw-run-task.md +1 -1
- package/scaffold/pt-br/commands/dw-bugfix.md +2 -1
- package/scaffold/pt-br/commands/dw-code-review.md +1 -0
- package/scaffold/pt-br/commands/dw-create-tasks.md +6 -0
- package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
- package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
- package/scaffold/pt-br/commands/dw-functional-doc.md +1 -1
- package/scaffold/pt-br/commands/dw-help.md +1 -1
- package/scaffold/pt-br/commands/dw-redesign-ui.md +1 -1
- package/scaffold/pt-br/commands/dw-run-qa.md +2 -1
- package/scaffold/pt-br/commands/dw-run-task.md +1 -1
- package/scaffold/skills/dw-incident-response/SKILL.md +164 -0
- package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
- package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
- package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
- package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
- package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
- package/scaffold/skills/dw-llm-eval/SKILL.md +148 -0
- package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
- package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
- package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
- package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
- package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
- package/scaffold/skills/dw-testing-discipline/SKILL.md +99 -76
- package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
- package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +6 -6
- package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
- package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +2 -2
- package/scaffold/skills/dw-ui-discipline/SKILL.md +101 -79
- package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +93 -73
- package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
- package/scaffold/skills/dw-testing-discipline/references/ai-agent-gates.md +0 -170
- package/scaffold/skills/dw-testing-discipline/references/iron-laws.md +0 -128
- package/scaffold/skills/dw-ui-discipline/references/anti-slop.md +0 -162
- /package/scaffold/skills/dw-testing-discipline/references/{positive-patterns.md → patterns.md} +0 -0
|
@@ -1,97 +1,126 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: dw-testing-discipline
|
|
3
|
-
description: Use when authoring, reviewing, or debugging tests — enforces
|
|
3
|
+
description: Use when authoring, reviewing, or debugging tests — enforces six core rules (assert behavior, push to lowest layer, fix prod first on red, real systems gate merge, mutation > coverage, no test backdoors), a catalog of anti-patterns, agent-authoring guardrails, and flaky-test discipline so tests reveal bugs instead of decorating CI.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
6
|
# Testing Discipline
|
|
7
7
|
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
## Cardinal Premise
|
|
8
|
+
## Founding principle
|
|
11
9
|
|
|
12
10
|
> Tests exist to expose defects, not to keep CI green.
|
|
13
11
|
> A test that fails has done its job.
|
|
14
12
|
> A test that passes for the wrong reason is worse than no test.
|
|
15
13
|
|
|
16
|
-
|
|
14
|
+
Everything else in this skill follows from that.
|
|
15
|
+
|
|
16
|
+
## The six core rules
|
|
17
17
|
|
|
18
18
|
```
|
|
19
19
|
1. Test the behavior, never the mock.
|
|
20
|
-
2. Push
|
|
21
|
-
3. When a test fails,
|
|
20
|
+
2. Push each test to the lowest layer that can detect the defect.
|
|
21
|
+
3. When a test fails, read production first — change the test only with documented justification.
|
|
22
22
|
4. Real systems gate the merge. Mocks isolate; they do not validate.
|
|
23
|
-
5. Coverage is a flashlight
|
|
23
|
+
5. Coverage is a flashlight; mutation score is a quality probe. Neither is a target.
|
|
24
24
|
6. No test-only methods, branches, or flags leak into production code.
|
|
25
25
|
```
|
|
26
26
|
|
|
27
|
-
Each
|
|
27
|
+
Each rule has nuance read `references/core-rules.md` for the long version with examples.
|
|
28
|
+
|
|
29
|
+
## When to use
|
|
28
30
|
|
|
29
|
-
|
|
31
|
+
- Authoring any test (unit, integration, contract, E2E).
|
|
32
|
+
- Reviewing a PR diff under test paths.
|
|
33
|
+
- Debugging a flaky test (or considering retry-as-fix — read `references/flaky-discipline.md` first).
|
|
34
|
+
- Generating tests via an AI agent → invokes `references/agent-guardrails.md` automatically.
|
|
35
|
+
- Browser-based E2E with Playwright → recipes in `references/playwright-recipes.md`.
|
|
36
|
+
- Verifying browser-side trust boundaries (auth, CSRF, headers) → `references/security-boundary.md`.
|
|
37
|
+
- Picking which test workflow applies (UI / network / perf) → `references/three-workflow-patterns.md`.
|
|
30
38
|
|
|
31
|
-
|
|
32
|
-
|------|-----------|
|
|
33
|
-
| Deciding where a test belongs | `references/iron-laws.md` (Law 2 deep-dive) |
|
|
34
|
-
| Writing new tests | `references/positive-patterns.md` |
|
|
35
|
-
| Reviewing / debugging tests | `references/anti-patterns.md` |
|
|
36
|
-
| Test authored by an AI agent | `references/ai-agent-gates.md` + `references/anti-patterns.md` |
|
|
37
|
-
| Flaky tests appeared | `references/flaky-discipline.md` |
|
|
38
|
-
| Browser-based E2E with Playwright | `references/playwright-recipes.md` |
|
|
39
|
-
| Browser security boundary testing | `references/security-boundary.md` |
|
|
40
|
-
| Picking the right test workflow (UI vs network vs perf) | `references/three-workflow-patterns.md` |
|
|
39
|
+
## Reference router
|
|
41
40
|
|
|
42
|
-
|
|
41
|
+
| Doing what | Read |
|
|
42
|
+
|------------|------|
|
|
43
|
+
| Placing a new test (which layer?) | `references/core-rules.md` (Rule 2 deep-dive) |
|
|
44
|
+
| Writing new tests | `references/patterns.md` |
|
|
45
|
+
| Reviewing tests / spotting smells | `references/anti-patterns.md` |
|
|
46
|
+
| Agent-generated tests | `references/agent-guardrails.md` + `references/anti-patterns.md` |
|
|
47
|
+
| Flaky tests | `references/flaky-discipline.md` |
|
|
48
|
+
| Playwright E2E | `references/playwright-recipes.md` |
|
|
49
|
+
| Browser trust boundary | `references/security-boundary.md` |
|
|
50
|
+
| Picking the right workflow | `references/three-workflow-patterns.md` |
|
|
51
|
+
|
|
52
|
+
## Patterns that produce reliable tests (one-liners; full in `references/patterns.md`)
|
|
43
53
|
|
|
44
54
|
1. Query by behavior and accessible role; never CSS selectors or DOM indices.
|
|
45
|
-
2. Selector
|
|
55
|
+
2. Selector ladder: role → label → text → test-id → structural. Stop at the highest rung that disambiguates.
|
|
46
56
|
3. Wait on observable conditions; never wall-clock sleeps.
|
|
47
|
-
4. Each test independent and order-free;
|
|
48
|
-
5. One behavior per test; as many assertions as that behavior
|
|
49
|
-
6. Names read
|
|
57
|
+
4. Each test independent and order-free; lean on `beforeEach`, not `beforeAll`.
|
|
58
|
+
5. One behavior per test; as many assertions as that behavior requires.
|
|
59
|
+
6. Names read as specifications: `should <outcome> when <condition> given <state>`.
|
|
50
60
|
7. Table-driven / parameterized when inputs vary.
|
|
51
61
|
8. Build test data via factories; literal blobs only for fields under test.
|
|
52
|
-
9. Mock at boundaries you don't control; real wiring for
|
|
53
|
-
10. Real systems gate final merge; contract tests bridge unit and E2E.
|
|
62
|
+
9. Mock at boundaries you don't control; real wiring for the systems you own.
|
|
63
|
+
10. Real systems gate the final merge; contract tests bridge unit and E2E.
|
|
54
64
|
11. Mutation score, not coverage percentage, measures suite strength.
|
|
55
|
-
12. Page Object Model is a tool
|
|
56
|
-
|
|
57
|
-
##
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
65
|
+
12. Page Object Model is a tool; collapse it for small suites where it adds noise.
|
|
66
|
+
|
|
67
|
+
## Anti-pattern catalog (four families, full in `references/anti-patterns.md`)
|
|
68
|
+
|
|
69
|
+
The four kinds of smell that produce most test debt:
|
|
70
|
+
|
|
71
|
+
**A. Fragile to refactor** — tests bound to internals, not behavior:
|
|
72
|
+
- Implementation-detail selectors.
|
|
73
|
+
- Asserting internal structure instead of observable outcome.
|
|
74
|
+
- Testing private methods directly.
|
|
75
|
+
- Snapshots replacing real assertions.
|
|
76
|
+
- Vague existence assertions (`toBeTruthy`, `should('exist')`).
|
|
77
|
+
- Actions with no assertion ("clicking save works").
|
|
78
|
+
|
|
79
|
+
**B. Non-deterministic outcomes** — tests that flip verdict on the same code:
|
|
80
|
+
- Static sleeps / fixed-timeout waits.
|
|
81
|
+
- Test order dependency / hidden shared state.
|
|
82
|
+
- Non-deterministic inputs (clock, RNG, locale).
|
|
83
|
+
|
|
84
|
+
**C. Mock-driven false confidence** — tests testing the test setup:
|
|
85
|
+
- Asserting the mock exists.
|
|
86
|
+
- Mock drift (mocked response no longer matches real API).
|
|
87
|
+
- Over-mocking child components.
|
|
88
|
+
- Incomplete mocks (missing fields the code reads).
|
|
89
|
+
- Mocking the wrong level (mocking methods of the SUT itself).
|
|
90
|
+
- Asserting on a value the test body fed into a mock.
|
|
91
|
+
|
|
92
|
+
**D. Suite hygiene problems** — team and suite-level pathologies:
|
|
93
|
+
- Coverage as vanity metric.
|
|
94
|
+
- Happy-path-only coverage.
|
|
95
|
+
- Eternal `beforeAll` hiding dependencies.
|
|
96
|
+
- Cleanup in `afterEach` (move to `beforeEach`).
|
|
97
|
+
- Magic strings and logic in tests.
|
|
98
|
+
- Testing against third-party sites.
|
|
99
|
+
- Quarantine-as-cemetery (skip without owner or deadline).
|
|
100
|
+
- Retry-as-fix (auto-retry hiding real bugs).
|
|
101
|
+
- Duplicate tests across pyramid layers.
|
|
102
|
+
- Weakening assertions to make tests pass.
|
|
103
|
+
|
|
104
|
+
Total: 25 specific patterns across the four families.
|
|
105
|
+
|
|
106
|
+
## Agent-authoring guardrails (mandatory when an LLM writes tests)
|
|
107
|
+
|
|
108
|
+
Six guardrails block the most common failure modes when an LLM produces test code. Each is a pre-condition before the diff goes to review. Full prompts and verification in `references/agent-guardrails.md`:
|
|
109
|
+
|
|
110
|
+
1. **State the invariant first** — agent prints `INVARIANT`, `OWNING_LAYER`, `EXISTING_SUITE` before writing code.
|
|
111
|
+
2. **Extend, don't sprawl** — agent extends an existing suite; new files require a named invariant.
|
|
112
|
+
3. **Real execution somewhere** — at least one test path runs against real systems before merge.
|
|
113
|
+
4. **Red? Read production** — on failure, the agent reads production code first and prints `ANALYSIS:` before changing tests.
|
|
114
|
+
5. **Classify before snapshot** — snapshots only with explicit `PRODUCT_CONTRACT` classification; `IMPLEMENTATION_DETAIL` forbids them.
|
|
115
|
+
6. **Negative companion** — every positive assertion ships with a negative test for invalid input or failure mode.
|
|
87
116
|
|
|
88
117
|
## Placement doctrine (tripwires)
|
|
89
118
|
|
|
90
119
|
Before writing test code:
|
|
91
120
|
|
|
92
121
|
- Name the invariant in **one sentence**. Fuzzy language signals unclear requirements — stop and clarify.
|
|
93
|
-
- Place the test at the **lowest layer** capable of detecting the
|
|
94
|
-
- Reject tests where `
|
|
122
|
+
- Place the test at the **lowest layer** capable of detecting the defect when the invariant breaks.
|
|
123
|
+
- Reject tests where (`likelihood-of-bug` × `blast-radius`) falls below a ten-minute-maintenance threshold (the test is more expensive to maintain than the bug would be to fix).
|
|
95
124
|
|
|
96
125
|
## Flaky discipline (tripwires)
|
|
97
126
|
|
|
@@ -103,9 +132,9 @@ Full taxonomy in `references/flaky-discipline.md`.
|
|
|
103
132
|
|
|
104
133
|
## Cross-cutting red flags
|
|
105
134
|
|
|
106
|
-
Any of these in a PR
|
|
135
|
+
Any of these in a PR is enough to REJECT a verdict:
|
|
107
136
|
|
|
108
|
-
- Mock setup larger than test logic.
|
|
137
|
+
- Mock setup larger than the test logic.
|
|
109
138
|
- Test breaks when an internal method is renamed (not the public contract).
|
|
110
139
|
- Removing the assertion body leaves the test green.
|
|
111
140
|
- Test fails when run with `.only` in isolation.
|
|
@@ -117,9 +146,9 @@ Any of these in a PR triggers REJECTED in `/dw-code-review`:
|
|
|
117
146
|
- Failing tests auto-retried until green; no investigation.
|
|
118
147
|
- Skipped/quarantined tests without named owner and fix-by date.
|
|
119
148
|
- Test depends on `new Date()`, `Math.random()`, or system locale.
|
|
120
|
-
- `afterEach` resets database
|
|
121
|
-
-
|
|
122
|
-
-
|
|
149
|
+
- `afterEach` resets database state.
|
|
150
|
+
- Agent-written test has 6+ assertions and zero edge cases.
|
|
151
|
+
- The diff contains the phrase "I'll mock this to be safe."
|
|
123
152
|
|
|
124
153
|
## When NOT to use this skill
|
|
125
154
|
|
|
@@ -131,18 +160,12 @@ Any of these in a PR triggers REJECTED in `/dw-code-review`:
|
|
|
131
160
|
|
|
132
161
|
## Integration with dev-workflow commands
|
|
133
162
|
|
|
134
|
-
- `/dw-create-tasks`
|
|
135
|
-
- `/dw-run-task`
|
|
163
|
+
- `/dw-create-tasks` applies the placement doctrine — each test-adding task names the invariant.
|
|
164
|
+
- `/dw-run-task` runs the 6 agent guardrails when generating tests during implementation.
|
|
136
165
|
- `/dw-code-review` runs the anti-pattern checks on diff hunks under test paths.
|
|
137
|
-
- `/dw-fix-qa`
|
|
166
|
+
- `/dw-fix-qa` applies the flaky-discipline taxonomy in retest cycles.
|
|
138
167
|
- `/dw-run-qa` (UI mode) references `playwright-recipes.md` for concrete recipes.
|
|
139
168
|
|
|
140
|
-
## Why this skill exists
|
|
141
|
-
|
|
142
|
-
The previous bundled skill (`webapp-testing`) mixed Playwright recipes with two discipline references (`security-boundary`, `three-workflow-patterns`) added later. The discipline references were enterred in a tactical skill that the agent didn't reach for as doctrine.
|
|
143
|
-
|
|
144
|
-
This skill consolidates: doctrine at the top, Playwright recipes as one reference, security and workflow patterns as their own references. One skill, coherent voice, doctrine-first.
|
|
145
|
-
|
|
146
169
|
## Bottom line
|
|
147
170
|
|
|
148
|
-
> A test that cannot fail is decorative. A test that fails for the wrong reason is misleading. Build tests that fail for exactly one reason — the reason the invariant was violated — and trust them when they do. Mocks isolate. Real systems validate. Coverage shines a light. Mutation score grades the suite. Agents will reach for the mock and the snapshot; the
|
|
171
|
+
> A test that cannot fail is decorative. A test that fails for the wrong reason is misleading. Build tests that fail for exactly one reason — the reason the invariant was violated — and trust them when they do. Mocks isolate. Real systems validate. Coverage shines a light. Mutation score grades the suite. Agents will reach for the mock and the snapshot; the guardrails make them put both down. Tests reveal bugs, not just pass.
|
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
# Six guardrails — mandatory when an agent writes tests
|
|
2
|
+
|
|
3
|
+
LLMs have characteristic failure modes when authoring tests. Six guardrails are forcing functions for the most common ones.
|
|
4
|
+
|
|
5
|
+
Every test produced by an agent (via `/dw-run-task`, `/dw-bugfix`, `/dw-autopilot`, or any code-generating flow) must clear all six BEFORE the diff goes to review.
|
|
6
|
+
|
|
7
|
+
## Guardrail 1 — State the invariant, layer, and host suite first
|
|
8
|
+
|
|
9
|
+
**Failure mode it blocks:** agent writes 200 lines of test code without articulating what the test is supposed to prove or where it belongs.
|
|
10
|
+
|
|
11
|
+
**What the agent must do:**
|
|
12
|
+
|
|
13
|
+
Before any test code, print:
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
INVARIANT: <one sentence — what behavior the test verifies>
|
|
17
|
+
OWNING_LAYER: <unit | integration | contract | e2e>
|
|
18
|
+
EXISTING_SUITE: <path to the existing test file the new test joins, or "NEW: <reason>">
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
If the agent can't fill any line, it stops and asks the user — it does NOT invent an invariant.
|
|
22
|
+
|
|
23
|
+
**Why it works:**
|
|
24
|
+
- "Invariant" forces specific behavior naming.
|
|
25
|
+
- "Owning layer" forces Rule 2 (lowest detectable layer).
|
|
26
|
+
- "Existing suite" forces extending coverage rather than spawning orphan files.
|
|
27
|
+
|
|
28
|
+
**Verification:** `/dw-code-review` looks for this 3-line preamble in the PR description or commit body. Missing = REJECTED.
|
|
29
|
+
|
|
30
|
+
## Guardrail 2 — Real execution somewhere
|
|
31
|
+
|
|
32
|
+
**Failure mode it blocks:** agent writes tests that mock everything. They pass green forever and validate nothing.
|
|
33
|
+
|
|
34
|
+
**What the agent must do:**
|
|
35
|
+
|
|
36
|
+
At SOME layer, the test path must run against real systems before merge:
|
|
37
|
+
|
|
38
|
+
- Pure logic: unit only is sufficient.
|
|
39
|
+
- Code touching DB: at least one integration test with real DB (testcontainers, ephemeral container, dedicated test DB).
|
|
40
|
+
- Code calling external services: a contract test OR a sandbox-account smoke test.
|
|
41
|
+
- UI interactions: at least one E2E run on a real preview environment.
|
|
42
|
+
|
|
43
|
+
**Verification:** PR description lists the real-system runs covering the touched code. If no real-system path covers the change → REJECTED.
|
|
44
|
+
|
|
45
|
+
## Guardrail 3 — On red, read production first
|
|
46
|
+
|
|
47
|
+
**Failure mode it blocks:** agent sees a test go red and modifies the test until green. Bug ships.
|
|
48
|
+
|
|
49
|
+
**What the agent must do:**
|
|
50
|
+
|
|
51
|
+
When a test fails (its own or pre-existing):
|
|
52
|
+
|
|
53
|
+
1. Print: `INVESTIGATING FAILURE: <test name>`.
|
|
54
|
+
2. Read production code in the path that produced the observed value.
|
|
55
|
+
3. Print: `ANALYSIS: <2-3 sentences — is production wrong, the test wrong, or has the invariant changed?>`.
|
|
56
|
+
4. Decide:
|
|
57
|
+
- Production wrong → fix production.
|
|
58
|
+
- Test wrong → fix the test AND document the change in the commit body.
|
|
59
|
+
- Invariant changed → update the test AND open an ADR if it's a public-contract change.
|
|
60
|
+
|
|
61
|
+
**Verification:** every commit changing a previously-green test must have an `ANALYSIS:` line. Missing = REJECTED.
|
|
62
|
+
|
|
63
|
+
## Guardrail 4 — No self-confirming assertions
|
|
64
|
+
|
|
65
|
+
**Failure mode it blocks:** two related shortcuts —
|
|
66
|
+
- Agent writes `mockFn.mockReturnValue('X')` then asserts `expect(mockFn()).toBe('X')`. Proves nothing — the test asserts what the test set up.
|
|
67
|
+
- Agent reaches for `toMatchSnapshot()` whenever unsure what to assert. The snapshot becomes the assertion; drift goes unnoticed.
|
|
68
|
+
|
|
69
|
+
**What the agent must do:**
|
|
70
|
+
|
|
71
|
+
**For mocks:** never assert on a value the test body fed into a mock. Assert on:
|
|
72
|
+
- The OUTPUT of production code that consumed the mock.
|
|
73
|
+
- The SIDE EFFECTS (DB state, network calls, event emissions) caused by production code.
|
|
74
|
+
- The VISIBLE behavior (UI change, log line, response) the user/caller observes.
|
|
75
|
+
|
|
76
|
+
**For snapshots:** before adding `toMatchSnapshot()`, classify the artifact:
|
|
77
|
+
- `PRODUCT_CONTRACT` — a stable contract worth pinning (serialized API output, stored-record schema). Snapshot OK; document the classification in a comment.
|
|
78
|
+
- `IMPLEMENTATION_DETAIL` — HTML structure, internal representation, component tree shape. Snapshot is FORBIDDEN; write specific assertions instead.
|
|
79
|
+
|
|
80
|
+
**Verification:**
|
|
81
|
+
- Mock value flowing directly from setup to assertion without passing through production code → REJECTED.
|
|
82
|
+
- Snapshot in diff without classification comment → REJECTED.
|
|
83
|
+
- Snapshot classified `IMPLEMENTATION_DETAIL` → REJECTED.
|
|
84
|
+
|
|
85
|
+
## Guardrail 5 — Negative companion
|
|
86
|
+
|
|
87
|
+
**Failure mode it blocks:** agent writes happy-path-only tests. Edge cases, error paths, boundary inputs uncovered.
|
|
88
|
+
|
|
89
|
+
**What the agent must do:**
|
|
90
|
+
|
|
91
|
+
Every positive assertion ships WITH at least one negative companion:
|
|
92
|
+
|
|
93
|
+
- Asserting `createUser(validInput)` succeeds → also assert `createUser(invalidInput)` fails with a specific error.
|
|
94
|
+
- Asserting `parseDate(validString)` returns a Date → also assert `parseDate(invalidString)` throws or returns null.
|
|
95
|
+
- Asserting `transferFunds(...)` succeeds with sufficient balance → also assert it fails with insufficient balance.
|
|
96
|
+
|
|
97
|
+
**Verification:** a PR adding N positive assertions to a public path must add ≥1 negative assertion. Imbalance >3:1 (positive:negative) on a public path → REJECTED.
|
|
98
|
+
|
|
99
|
+
## Guardrail 6 — Don't expand the surface to test it
|
|
100
|
+
|
|
101
|
+
**Failure mode it blocks:** agent exports internals, adds `*ForTesting` methods, or introduces `process.env.NODE_ENV === 'test'` branches in production code to make the test possible.
|
|
102
|
+
|
|
103
|
+
**What the agent must do:**
|
|
104
|
+
|
|
105
|
+
If the test needs access the production API doesn't grant:
|
|
106
|
+
- **Refactor production for testability** via dependency injection or interface seams.
|
|
107
|
+
- **Emit an observable side effect** the test can verify (event, log line, metric) that production also benefits from.
|
|
108
|
+
- **Use a dedicated test environment** with test credentials, not a backdoor flag.
|
|
109
|
+
|
|
110
|
+
The agent does NOT:
|
|
111
|
+
- Export `_internal` symbols just for tests.
|
|
112
|
+
- Add `// for testing only` methods on classes.
|
|
113
|
+
- Wrap production logic in `if (process.env.NODE_ENV !== 'test')` branches.
|
|
114
|
+
|
|
115
|
+
**Verification:** diff includes new production exports, env checks, or `*ForTesting`-style symbols → REJECTED. Refactor the surface or change the test approach.
|
|
116
|
+
|
|
117
|
+
## How the six guardrails compose
|
|
118
|
+
|
|
119
|
+
A test that passes all six:
|
|
120
|
+
1. States the invariant, layer, and host suite up front (Guardrail 1).
|
|
121
|
+
2. Exercises real systems somewhere in the pipeline (Guardrail 2).
|
|
122
|
+
3. When red, reads production first and documents the analysis (Guardrail 3).
|
|
123
|
+
4. Asserts on production's observable output, not its own setup (Guardrail 4).
|
|
124
|
+
5. Covers failures alongside successes (Guardrail 5).
|
|
125
|
+
6. Lives inside the production API's existing surface (Guardrail 6).
|
|
126
|
+
|
|
127
|
+
Tests passing all six are worth running. Tests missing any one are more likely to mislead than to help.
|
|
128
|
+
|
|
129
|
+
## Override procedure
|
|
130
|
+
|
|
131
|
+
To skip a guardrail explicitly:
|
|
132
|
+
1. State which guardrail is skipped.
|
|
133
|
+
2. State why in one sentence.
|
|
134
|
+
3. Add a `// SKIP-GUARDRAIL-N: <reason>` comment in the test.
|
|
135
|
+
4. Open a follow-up issue tracking the gap.
|
|
136
|
+
|
|
137
|
+
Without all four, the guardrail is enforced.
|
|
138
|
+
|
|
139
|
+
## Prompt block injected when an agent writes tests
|
|
140
|
+
|
|
141
|
+
```
|
|
142
|
+
You are about to write tests. Before producing test code, complete the
|
|
143
|
+
6-guardrail preamble:
|
|
144
|
+
|
|
145
|
+
INVARIANT: ___
|
|
146
|
+
OWNING_LAYER: ___
|
|
147
|
+
EXISTING_SUITE: ___
|
|
148
|
+
|
|
149
|
+
If you cannot complete these three lines, STOP and ask the user for
|
|
150
|
+
the requirement. Do not invent an invariant.
|
|
151
|
+
|
|
152
|
+
Then, while writing tests:
|
|
153
|
+
- Real execution: name the real-system path covering this code.
|
|
154
|
+
- On red: read production first; print ANALYSIS: before changing the test.
|
|
155
|
+
- Mocks: never assert on values fed into a mock.
|
|
156
|
+
- Snapshots: classify PRODUCT_CONTRACT or IMPLEMENTATION_DETAIL — the latter is forbidden.
|
|
157
|
+
- Coverage: every positive assertion needs a negative companion.
|
|
158
|
+
- Production surface: don't export internals or add test-only branches.
|
|
159
|
+
|
|
160
|
+
Tests violating guardrails without explicit SKIP-GUARDRAIL-N comments
|
|
161
|
+
will be REJECTED at review.
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
`/dw-run-task` and `/dw-bugfix` inject this prompt block before generating test code.
|
|
165
|
+
|
|
166
|
+
## Why six and not more
|
|
167
|
+
|
|
168
|
+
These are the highest-frequency LLM failure modes observed across multiple projects. Other tendencies exist but are either covered by the positive patterns (e.g., wall-clock waits) or are lower-frequency than these six.
|
|
169
|
+
|
|
170
|
+
If a new failure mode appears that none of the six catches, add a guardrail AND document the failure that motivated it. Don't add guardrails speculatively.
|
|
@@ -1,10 +1,10 @@
|
|
|
1
|
-
# Anti-patterns — 25
|
|
1
|
+
# Anti-patterns — 25 smells across 4 families
|
|
2
2
|
|
|
3
|
-
Each anti-pattern names the smell, shows the violation in pseudo-code, gives the fix, and notes how `/dw-code-review` detects it.
|
|
3
|
+
Each anti-pattern names the smell, shows the violation in pseudo-code, gives the fix, and notes how `/dw-code-review` detects it. Agent-specific failure modes are covered separately in `agent-guardrails.md`.
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
-
## Family
|
|
7
|
+
## Family A: Fragile to refactor (tests bound to internals, not behavior)
|
|
8
8
|
|
|
9
9
|
### A1. Implementation-detail selectors
|
|
10
10
|
|
|
@@ -81,7 +81,7 @@ test('clicking save works', async () => {
|
|
|
81
81
|
|
|
82
82
|
---
|
|
83
83
|
|
|
84
|
-
## Family
|
|
84
|
+
## Family B: Non-deterministic outcomes (tests that flip verdict on the same code)
|
|
85
85
|
|
|
86
86
|
### A7. Static sleeps / fixed-timeout waits
|
|
87
87
|
|
|
@@ -117,7 +117,7 @@ test('today is Monday', () => {
|
|
|
117
117
|
|
|
118
118
|
---
|
|
119
119
|
|
|
120
|
-
## Family
|
|
120
|
+
## Family C: Mock-driven false confidence (tests asserting on their own setup)
|
|
121
121
|
|
|
122
122
|
### A10. Asserting the mock exists
|
|
123
123
|
|
|
@@ -181,7 +181,7 @@ You've tested the SCAFFOLD, not the logic.
|
|
|
181
181
|
|
|
182
182
|
---
|
|
183
183
|
|
|
184
|
-
## Family
|
|
184
|
+
## Family D: Suite hygiene problems (team and suite-level pathologies)
|
|
185
185
|
|
|
186
186
|
### A15. Coverage as vanity metric
|
|
187
187
|
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
# Six core rules — expanded with examples
|
|
2
|
+
|
|
3
|
+
The rules are short for memorization. Each carries nuance that matters in practice.
|
|
4
|
+
|
|
5
|
+
## Rule 1: Test the behavior, never the mock
|
|
6
|
+
|
|
7
|
+
**What it means:** the test asserts what the system DOES from the caller's perspective. It does not assert that internal call X was made with internal argument Y.
|
|
8
|
+
|
|
9
|
+
**Why it matters:** a test bound to internal calls breaks the day you refactor — even when behavior didn't change. The "test is red, behavior is fine" experience erodes trust. Soon no one runs the suite.
|
|
10
|
+
|
|
11
|
+
**Violation:**
|
|
12
|
+
|
|
13
|
+
```javascript
|
|
14
|
+
// BAD — asserting on mock internals
|
|
15
|
+
test('createOrder calls inventory.reserve', () => {
|
|
16
|
+
const inventory = { reserve: vi.fn() };
|
|
17
|
+
createOrder({ items: [...] }, inventory);
|
|
18
|
+
expect(inventory.reserve).toHaveBeenCalledWith(items, 'reserve');
|
|
19
|
+
});
|
|
20
|
+
```
|
|
21
|
+
|
|
22
|
+
You've asserted that `createOrder` USES the inventory adapter a specific way. The refactor that consolidates `reserve` into `commit-with-reservation` breaks this even though the order still gets created.
|
|
23
|
+
|
|
24
|
+
**Correct version:**
|
|
25
|
+
|
|
26
|
+
```javascript
|
|
27
|
+
// GOOD — asserting behavior
|
|
28
|
+
test('createOrder reserves inventory before confirming', async () => {
|
|
29
|
+
const result = await createOrder({ items: [...] });
|
|
30
|
+
expect(result.status).toBe('confirmed');
|
|
31
|
+
expect(await getInventoryFor(items[0].sku)).toBe(originalStock - 1);
|
|
32
|
+
});
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Now the test cares about the OUTCOME (inventory decremented, order confirmed), not the path.
|
|
36
|
+
|
|
37
|
+
## Rule 2: Push each test to the lowest layer that can detect the defect
|
|
38
|
+
|
|
39
|
+
**What it means:** if a unit test can catch the bug, use a unit. If only an integration test can, integration. If only an end-to-end run can, E2E.
|
|
40
|
+
|
|
41
|
+
**Why it matters:** lower layers run faster, fail more precisely, isolate the cause better. A bug in pure logic caught at unit takes 50ms and points at the exact function. The same bug at E2E takes 30 seconds and tells you "checkout failed."
|
|
42
|
+
|
|
43
|
+
**The pyramid resolved:**
|
|
44
|
+
|
|
45
|
+
| Layer | Catches | Speed | Cost |
|
|
46
|
+
|-------|---------|-------|------|
|
|
47
|
+
| Unit | Pure logic, math, parsing, formatters | <100ms | low |
|
|
48
|
+
| Integration | Module composition, DB queries, HTTP handlers | 500ms–5s | medium |
|
|
49
|
+
| Contract | Producer/consumer agreement at API boundary | 1–10s | medium |
|
|
50
|
+
| E2E | User journey across multiple services | 10s–60s | high |
|
|
51
|
+
|
|
52
|
+
**Rule of thumb:**
|
|
53
|
+
- If you can write a unit test, do so.
|
|
54
|
+
- If unit can't reach it (needs DB, queue, real HTTP), write integration.
|
|
55
|
+
- E2E only for journeys NO lower layer can detect: browser-renders-correctly, third-party-callback-arrives, multi-step session state.
|
|
56
|
+
|
|
57
|
+
## Rule 3: When a test fails, read production first — change the test only with documented justification
|
|
58
|
+
|
|
59
|
+
**What it means:** a red test is a signal. The first question is "what's wrong with production?" — not "why is the test wrong?"
|
|
60
|
+
|
|
61
|
+
**Why it matters:** tests are weakened to pass far more often than they should be. "The behavior is fine; the test is too strict" is the slope that leaves a green suite full of meaningless assertions.
|
|
62
|
+
|
|
63
|
+
**Process when a test goes red:**
|
|
64
|
+
|
|
65
|
+
1. **Read the failure message.** What invariant did the test claim? What did it observe?
|
|
66
|
+
2. **Read production code** in the path that produces the observation.
|
|
67
|
+
3. **Decide which is wrong.** If production violates the invariant, fix production. If the test mis-states the invariant, document WHY before relaxing.
|
|
68
|
+
4. **Commit the analysis** in the test's commit message or PR body. "Relaxed assertion from X to Y because `<reason>`" is auditable; "fix test" is not.
|
|
69
|
+
|
|
70
|
+
**Anti-pattern:** re-run the test until green. Auto-retry on flake. Add `.only` to skip the rest.
|
|
71
|
+
|
|
72
|
+
## Rule 4: Real systems gate the merge. Mocks isolate; they do not validate.
|
|
73
|
+
|
|
74
|
+
**What it means:** before code merges to main, at least ONE test path exercised real systems — real DB, real route, real external integration in a sandbox or test account. Mocks are fine for fast unit feedback; they cannot decide "safe to ship."
|
|
75
|
+
|
|
76
|
+
**Why it matters:** mock drift is real. The mocked HTTP response from 3 months ago no longer matches the actual API. Tests pass; production fails on first real call.
|
|
77
|
+
|
|
78
|
+
**Practical pattern:**
|
|
79
|
+
|
|
80
|
+
- Unit tests: mock the world; run on every keystroke / commit.
|
|
81
|
+
- Integration tests: real local DB (testcontainers, in-memory if equivalent); run on every PR.
|
|
82
|
+
- Contract tests: real producer/consumer agreement check; run on every PR.
|
|
83
|
+
- E2E: real preview environment with real services; run on PRs before merge to main.
|
|
84
|
+
|
|
85
|
+
The discipline: no merge without a green E2E (or equivalent real-system check) for the touched path.
|
|
86
|
+
|
|
87
|
+
## Rule 5: Coverage is a flashlight; mutation score is a quality probe. Neither is a target.
|
|
88
|
+
|
|
89
|
+
**What it means:**
|
|
90
|
+
- **Coverage** tells you what lines executed. Useful as a NEGATIVE signal — 30% coverage means lots of dark code. Useless as a positive signal — 95% coverage with weak assertions is decorative.
|
|
91
|
+
- **Mutation score** introduces small bugs (mutations) and measures whether tests catch them. A high mutation score means tests actually probe behavior, not just execute lines.
|
|
92
|
+
- Neither should be a number you optimize for. They're diagnostics.
|
|
93
|
+
|
|
94
|
+
**Anti-pattern:** "We need 90% coverage to merge." Coverage as a gate produces tests written to pass the gate, not to find bugs.
|
|
95
|
+
|
|
96
|
+
**Healthier framing:** "What lines in the touched diff are NOT covered? Why?" Sometimes the answer is "we don't care, it's logging." Sometimes it's "actually that's a critical branch — add a test."
|
|
97
|
+
|
|
98
|
+
## Rule 6: No test-only methods, branches, or flags leak into production code
|
|
99
|
+
|
|
100
|
+
**What it means:** production code should not have `if (process.env.NODE_ENV === 'test') { ... }` branches, `// for testing only` methods exposed on classes, or internals exported just for assertions.
|
|
101
|
+
|
|
102
|
+
**Why it matters:** production code carrying test-only logic is test decorations leaking into the artifact users run. Bug surface grows; the test environment diverges from production.
|
|
103
|
+
|
|
104
|
+
**Correct patterns:**
|
|
105
|
+
|
|
106
|
+
- Need to inject a dependency for testing? Use constructor injection / dependency injection.
|
|
107
|
+
- Need to assert on internal state? Add a logging hook or event emission that production also benefits from.
|
|
108
|
+
- Need to bypass auth in tests? Use a dedicated test environment with test credentials, not a backdoor flag.
|
|
109
|
+
|
|
110
|
+
**Tells:**
|
|
111
|
+
- `// only used in tests` comments.
|
|
112
|
+
- `*ForTesting` suffix on methods.
|
|
113
|
+
- `vi.spyOn(module, '_internal')` accessing underscore-prefixed members.
|
|
114
|
+
- `process.env.E2E_MODE` reaching into production runtime decisions.
|
|
115
|
+
|
|
116
|
+
If you see these, the test design is wrong. Refactor production to be testable; don't add backdoors.
|
|
117
|
+
|
|
118
|
+
## Putting the rules together
|
|
119
|
+
|
|
120
|
+
A healthy test:
|
|
121
|
+
1. Asserts behavior visible to a caller (Rule 1).
|
|
122
|
+
2. Sits at the lowest layer that can prove that behavior (Rule 2).
|
|
123
|
+
3. When red, sends you to read production code (Rule 3).
|
|
124
|
+
4. Has a sibling exercising real systems somewhere in the pipeline (Rule 4).
|
|
125
|
+
5. Survives a mutation in the code it claims to cover (Rule 5).
|
|
126
|
+
6. Has zero footprint in production code (Rule 6).
|
|
127
|
+
|
|
128
|
+
Any test failing ≥2 of these is technical debt accumulating. `/dw-code-review` flags them.
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
Practical Playwright code for the common scenarios. Use this when `/dw-run-qa` runs in UI mode or when `/dw-functional-doc` needs E2E coverage.
|
|
4
4
|
|
|
5
|
-
> These recipes ASSUME the doctrine in this skill (
|
|
5
|
+
> These recipes ASSUME the doctrine in this skill (core rules, positive patterns) has already been applied. Recipes are the HOW once the WHY is settled.
|
|
6
6
|
|
|
7
7
|
## Basic Navigation
|
|
8
8
|
|
|
@@ -275,7 +275,7 @@ Tests now start authenticated. No login loop in every test.
|
|
|
275
275
|
## Cross-skill integration
|
|
276
276
|
|
|
277
277
|
When running these recipes, the doctrine in this skill applies:
|
|
278
|
-
- Apply
|
|
278
|
+
- Apply core rules (especially Rule 2: lowest layer first — many E2E tests should be integration tests instead).
|
|
279
279
|
- Run the 7 AI Agent Gates if a coding agent is producing this test.
|
|
280
280
|
- Check for Anti-Patterns 1, 5, 7, 20 in the diff.
|
|
281
281
|
- For browser security scenarios (auth bypass, XSS, CSRF), see `security-boundary.md`.
|