@brunosps00/dev-workflow 0.13.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/README.md +9 -3
  2. package/package.json +1 -1
  3. package/scaffold/en/commands/dw-bugfix.md +2 -1
  4. package/scaffold/en/commands/dw-code-review.md +1 -0
  5. package/scaffold/en/commands/dw-create-tasks.md +6 -0
  6. package/scaffold/en/commands/dw-deps-audit.md +1 -1
  7. package/scaffold/en/commands/dw-fix-qa.md +1 -1
  8. package/scaffold/en/commands/dw-functional-doc.md +1 -1
  9. package/scaffold/en/commands/dw-help.md +1 -1
  10. package/scaffold/en/commands/dw-redesign-ui.md +1 -1
  11. package/scaffold/en/commands/dw-run-qa.md +2 -1
  12. package/scaffold/en/commands/dw-run-task.md +1 -1
  13. package/scaffold/pt-br/commands/dw-bugfix.md +2 -1
  14. package/scaffold/pt-br/commands/dw-code-review.md +1 -0
  15. package/scaffold/pt-br/commands/dw-create-tasks.md +6 -0
  16. package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
  17. package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
  18. package/scaffold/pt-br/commands/dw-functional-doc.md +1 -1
  19. package/scaffold/pt-br/commands/dw-help.md +1 -1
  20. package/scaffold/pt-br/commands/dw-redesign-ui.md +1 -1
  21. package/scaffold/pt-br/commands/dw-run-qa.md +2 -1
  22. package/scaffold/pt-br/commands/dw-run-task.md +1 -1
  23. package/scaffold/skills/dw-incident-response/SKILL.md +164 -0
  24. package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
  25. package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
  26. package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
  27. package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
  28. package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
  29. package/scaffold/skills/dw-llm-eval/SKILL.md +148 -0
  30. package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
  31. package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
  32. package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
  33. package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
  34. package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
  35. package/scaffold/skills/dw-testing-discipline/SKILL.md +99 -76
  36. package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
  37. package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +6 -6
  38. package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
  39. package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +2 -2
  40. package/scaffold/skills/dw-ui-discipline/SKILL.md +101 -79
  41. package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +93 -73
  42. package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
  43. package/scaffold/skills/dw-testing-discipline/references/ai-agent-gates.md +0 -170
  44. package/scaffold/skills/dw-testing-discipline/references/iron-laws.md +0 -128
  45. package/scaffold/skills/dw-ui-discipline/references/anti-slop.md +0 -162
  46. /package/scaffold/skills/dw-testing-discipline/references/{positive-patterns.md → patterns.md} +0 -0
@@ -1,97 +1,126 @@
1
1
  ---
2
2
  name: dw-testing-discipline
3
- description: Use when authoring, reviewing, or debugging tests — enforces Six Iron Laws (behavior over mocks, push to lowest layer, fix prod first on red, real systems gate merge), 25 anti-patterns, 7 AI agent gates, and flaky-test discipline so tests reveal bugs instead of decorating CI.
3
+ description: Use when authoring, reviewing, or debugging tests — enforces six core rules (assert behavior, push to lowest layer, fix prod first on red, real systems gate merge, mutation > coverage, no test backdoors), a catalog of anti-patterns, agent-authoring guardrails, and flaky-test discipline so tests reveal bugs instead of decorating CI.
4
4
  ---
5
5
 
6
6
  # Testing Discipline
7
7
 
8
- > **Inspired by** [`pedronauck/skills/testing-boss`](https://github.com/pedronauck/skills/tree/main/skills/mine/testing-boss) (MIT). Six Iron Laws, positive/anti-pattern catalogs, AI agent gates, and flaky-test taxonomy adapted from Pedro Nauck's work. The browser security-boundary and three-workflow-patterns references additionally cite [`addyosmani/agent-skills/browser-devtools`](https://github.com/addyosmani/agent-skills) (MIT), and Playwright recipes carry over from earlier dev-workflow work.
9
-
10
- ## Cardinal Premise
8
+ ## Founding principle
11
9
 
12
10
  > Tests exist to expose defects, not to keep CI green.
13
11
  > A test that fails has done its job.
14
12
  > A test that passes for the wrong reason is worse than no test.
15
13
 
16
- ## Six Iron Laws
14
+ Everything else in this skill follows from that.
15
+
16
+ ## The six core rules
17
17
 
18
18
  ```
19
19
  1. Test the behavior, never the mock.
20
- 2. Push every test to the lowest layer that can detect the failure.
21
- 3. When a test fails, fix production first — change the test only after writing why.
20
+ 2. Push each test to the lowest layer that can detect the defect.
21
+ 3. When a test fails, read production first — change the test only with documented justification.
22
22
  4. Real systems gate the merge. Mocks isolate; they do not validate.
23
- 5. Coverage is a flashlight. Mutation score is a quality probe. Neither is a target.
23
+ 5. Coverage is a flashlight; mutation score is a quality probe. Neither is a target.
24
24
  6. No test-only methods, branches, or flags leak into production code.
25
25
  ```
26
26
 
27
- Each law has nuance read `references/iron-laws.md` for the full version with examples.
27
+ Each rule has nuance read `references/core-rules.md` for the long version with examples.
28
+
29
+ ## When to use
28
30
 
29
- ## Required Reading Router
31
+ - Authoring any test (unit, integration, contract, E2E).
32
+ - Reviewing a PR diff under test paths.
33
+ - Debugging a flaky test (or considering retry-as-fix — read `references/flaky-discipline.md` first).
34
+ - Generating tests via an AI agent → invokes `references/agent-guardrails.md` automatically.
35
+ - Browser-based E2E with Playwright → recipes in `references/playwright-recipes.md`.
36
+ - Verifying browser-side trust boundaries (auth, CSRF, headers) → `references/security-boundary.md`.
37
+ - Picking which test workflow applies (UI / network / perf) → `references/three-workflow-patterns.md`.
30
38
 
31
- | Task | MUST read |
32
- |------|-----------|
33
- | Deciding where a test belongs | `references/iron-laws.md` (Law 2 deep-dive) |
34
- | Writing new tests | `references/positive-patterns.md` |
35
- | Reviewing / debugging tests | `references/anti-patterns.md` |
36
- | Test authored by an AI agent | `references/ai-agent-gates.md` + `references/anti-patterns.md` |
37
- | Flaky tests appeared | `references/flaky-discipline.md` |
38
- | Browser-based E2E with Playwright | `references/playwright-recipes.md` |
39
- | Browser security boundary testing | `references/security-boundary.md` |
40
- | Picking the right test workflow (UI vs network vs perf) | `references/three-workflow-patterns.md` |
39
+ ## Reference router
41
40
 
42
- ## Twelve positive patterns (one-liners, full version in references/positive-patterns.md)
41
+ | Doing what | Read |
42
+ |------------|------|
43
+ | Placing a new test (which layer?) | `references/core-rules.md` (Rule 2 deep-dive) |
44
+ | Writing new tests | `references/patterns.md` |
45
+ | Reviewing tests / spotting smells | `references/anti-patterns.md` |
46
+ | Agent-generated tests | `references/agent-guardrails.md` + `references/anti-patterns.md` |
47
+ | Flaky tests | `references/flaky-discipline.md` |
48
+ | Playwright E2E | `references/playwright-recipes.md` |
49
+ | Browser trust boundary | `references/security-boundary.md` |
50
+ | Picking the right workflow | `references/three-workflow-patterns.md` |
51
+
52
+ ## Patterns that produce reliable tests (one-liners; full in `references/patterns.md`)
43
53
 
44
54
  1. Query by behavior and accessible role; never CSS selectors or DOM indices.
45
- 2. Selector hierarchy: role → label → text → test-id → structural (stop at highest rung that disambiguates).
55
+ 2. Selector ladder: role → label → text → test-id → structural. Stop at the highest rung that disambiguates.
46
56
  3. Wait on observable conditions; never wall-clock sleeps.
47
- 4. Each test independent and order-free; setup over teardown.
48
- 5. One behavior per test; as many assertions as that behavior needs.
49
- 6. Names read like specifications: `should <outcome> when <condition> given <state>`.
57
+ 4. Each test independent and order-free; lean on `beforeEach`, not `beforeAll`.
58
+ 5. One behavior per test; as many assertions as that behavior requires.
59
+ 6. Names read as specifications: `should <outcome> when <condition> given <state>`.
50
60
  7. Table-driven / parameterized when inputs vary.
51
61
  8. Build test data via factories; literal blobs only for fields under test.
52
- 9. Mock at boundaries you don't control; real wiring for owned systems.
53
- 10. Real systems gate final merge; contract tests bridge unit and E2E.
62
+ 9. Mock at boundaries you don't control; real wiring for the systems you own.
63
+ 10. Real systems gate the final merge; contract tests bridge unit and E2E.
54
64
  11. Mutation score, not coverage percentage, measures suite strength.
55
- 12. Page Object Model is a tool, not a religion collapse for small suites.
56
-
57
- ## Five anti-pattern families (25 total, full catalog in references/anti-patterns.md)
58
-
59
- **Brittleness** tests bound to internals:
60
- - Implementation-detail selectors, internal-structure assertions, testing private methods, snapshot-as-test, vague existence assertions, action-without-assertion.
61
-
62
- **Flakiness** tests randomizing verdicts:
63
- - Static sleeps, test order dependency, non-deterministic inputs (clock, RNG, locale).
64
-
65
- **Mock misuse** tests testing the test setup:
66
- - Asserting the mock exists, mock drift, over-mocking children, incomplete mocks, mocking wrong level.
67
-
68
- **Process** — team and suite pathologies:
69
- - Coverage-as-vanity, happy-path-only, eternal `beforeAll`, cleanup in `afterEach`, magic strings, testing third-party sites, quarantine-as-cemetery, retry-as-fix, duplicate tests across layers, weakening tests to make them pass, mock-driven confidence.
70
-
71
- **AI-specific** agent failure modes:
72
- - The seven failure modes that gates in `ai-agent-gates.md` block.
73
-
74
- ## Seven AI agent gates (mandatory when an agent writes tests)
75
-
76
- These are mandatory pre-conditions whenever an LLM produces test code. Each gate is a forcing function against a specific LLM tendency:
77
-
78
- 1. **Invariant first** agent prints `INVARIANT: …`, `OWNING_LAYER: …`, `EXISTING_SUITE: …` before any code.
79
- 2. **Owning layer** extend an existing suite; reject new files without a named invariant.
80
- 3. **Real execution** every test runs against real DB / real route / real external integration at least once before merging.
81
- 4. **Failure → fix production** — on a red test, the next move reads production code, NOT the test. Document the analysis before changing either.
82
- 5. **No snapshot without contract** — classify the artifact as `PRODUCT_CONTRACT` or `IMPLEMENTATION_DETAIL`. The latter forbids snapshots.
83
- 6. **No assertion on self-set mock** cannot assert on values the same test body wrote into the mock.
84
- 7. **Negative companion** — every positive assertion ships with a negative test for invalid input or failure mode.
85
-
86
- Full prompt blocks and verification recipes in `references/ai-agent-gates.md`.
65
+ 12. Page Object Model is a tool; collapse it for small suites where it adds noise.
66
+
67
+ ## Anti-pattern catalog (four families, full in `references/anti-patterns.md`)
68
+
69
+ The four kinds of smell that produce most test debt:
70
+
71
+ **A. Fragile to refactor** — tests bound to internals, not behavior:
72
+ - Implementation-detail selectors.
73
+ - Asserting internal structure instead of observable outcome.
74
+ - Testing private methods directly.
75
+ - Snapshots replacing real assertions.
76
+ - Vague existence assertions (`toBeTruthy`, `should('exist')`).
77
+ - Actions with no assertion ("clicking save works").
78
+
79
+ **B. Non-deterministic outcomes** tests that flip verdict on the same code:
80
+ - Static sleeps / fixed-timeout waits.
81
+ - Test order dependency / hidden shared state.
82
+ - Non-deterministic inputs (clock, RNG, locale).
83
+
84
+ **C. Mock-driven false confidence** tests testing the test setup:
85
+ - Asserting the mock exists.
86
+ - Mock drift (mocked response no longer matches real API).
87
+ - Over-mocking child components.
88
+ - Incomplete mocks (missing fields the code reads).
89
+ - Mocking the wrong level (mocking methods of the SUT itself).
90
+ - Asserting on a value the test body fed into a mock.
91
+
92
+ **D. Suite hygiene problems** — team and suite-level pathologies:
93
+ - Coverage as vanity metric.
94
+ - Happy-path-only coverage.
95
+ - Eternal `beforeAll` hiding dependencies.
96
+ - Cleanup in `afterEach` (move to `beforeEach`).
97
+ - Magic strings and logic in tests.
98
+ - Testing against third-party sites.
99
+ - Quarantine-as-cemetery (skip without owner or deadline).
100
+ - Retry-as-fix (auto-retry hiding real bugs).
101
+ - Duplicate tests across pyramid layers.
102
+ - Weakening assertions to make tests pass.
103
+
104
+ Total: 25 specific patterns across the four families.
105
+
106
+ ## Agent-authoring guardrails (mandatory when an LLM writes tests)
107
+
108
+ Six guardrails block the most common failure modes when an LLM produces test code. Each is a pre-condition before the diff goes to review. Full prompts and verification in `references/agent-guardrails.md`:
109
+
110
+ 1. **State the invariant first** — agent prints `INVARIANT`, `OWNING_LAYER`, `EXISTING_SUITE` before writing code.
111
+ 2. **Extend, don't sprawl** — agent extends an existing suite; new files require a named invariant.
112
+ 3. **Real execution somewhere** — at least one test path runs against real systems before merge.
113
+ 4. **Red? Read production** — on failure, the agent reads production code first and prints `ANALYSIS:` before changing tests.
114
+ 5. **Classify before snapshot** — snapshots only with explicit `PRODUCT_CONTRACT` classification; `IMPLEMENTATION_DETAIL` forbids them.
115
+ 6. **Negative companion** — every positive assertion ships with a negative test for invalid input or failure mode.
87
116
 
88
117
  ## Placement doctrine (tripwires)
89
118
 
90
119
  Before writing test code:
91
120
 
92
121
  - Name the invariant in **one sentence**. Fuzzy language signals unclear requirements — stop and clarify.
93
- - Place the test at the **lowest layer** capable of detecting the failure when the invariant breaks.
94
- - Reject tests where `(likelihood × blast-radius)` falls below the ten-minute-maintenance threshold (the test is more expensive to maintain than the bug would be to fix).
122
+ - Place the test at the **lowest layer** capable of detecting the defect when the invariant breaks.
123
+ - Reject tests where (`likelihood-of-bug` × `blast-radius`) falls below a ten-minute-maintenance threshold (the test is more expensive to maintain than the bug would be to fix).
95
124
 
96
125
  ## Flaky discipline (tripwires)
97
126
 
@@ -103,9 +132,9 @@ Full taxonomy in `references/flaky-discipline.md`.
103
132
 
104
133
  ## Cross-cutting red flags
105
134
 
106
- Any of these in a PR triggers REJECTED in `/dw-code-review`:
135
+ Any of these in a PR is enough to REJECT a verdict:
107
136
 
108
- - Mock setup larger than test logic.
137
+ - Mock setup larger than the test logic.
109
138
  - Test breaks when an internal method is renamed (not the public contract).
110
139
  - Removing the assertion body leaves the test green.
111
140
  - Test fails when run with `.only` in isolation.
@@ -117,9 +146,9 @@ Any of these in a PR triggers REJECTED in `/dw-code-review`:
117
146
  - Failing tests auto-retried until green; no investigation.
118
147
  - Skipped/quarantined tests without named owner and fix-by date.
119
148
  - Test depends on `new Date()`, `Math.random()`, or system locale.
120
- - `afterEach` resets database (move to `beforeEach`).
121
- - AI-written test has 6+ assertions and zero edge cases.
122
- - Phrase "I'll mock this to be safe" appears in the diff.
149
+ - `afterEach` resets database state.
150
+ - Agent-written test has 6+ assertions and zero edge cases.
151
+ - The diff contains the phrase "I'll mock this to be safe."
123
152
 
124
153
  ## When NOT to use this skill
125
154
 
@@ -131,18 +160,12 @@ Any of these in a PR triggers REJECTED in `/dw-code-review`:
131
160
 
132
161
  ## Integration with dev-workflow commands
133
162
 
134
- - `/dw-create-tasks` uses the placement doctrine — each test-adding task must name the invariant.
135
- - `/dw-run-task` applies the 7 AI gates when generating tests as part of implementation.
163
+ - `/dw-create-tasks` applies the placement doctrine — each test-adding task names the invariant.
164
+ - `/dw-run-task` runs the 6 agent guardrails when generating tests during implementation.
136
165
  - `/dw-code-review` runs the anti-pattern checks on diff hunks under test paths.
137
- - `/dw-fix-qa` runs flaky-discipline taxonomy when retesting bugs.
166
+ - `/dw-fix-qa` applies the flaky-discipline taxonomy in retest cycles.
138
167
  - `/dw-run-qa` (UI mode) references `playwright-recipes.md` for concrete recipes.
139
168
 
140
- ## Why this skill exists
141
-
142
- The previous bundled skill (`webapp-testing`) mixed Playwright recipes with two discipline references (`security-boundary`, `three-workflow-patterns`) added later. The discipline references were enterred in a tactical skill that the agent didn't reach for as doctrine.
143
-
144
- This skill consolidates: doctrine at the top, Playwright recipes as one reference, security and workflow patterns as their own references. One skill, coherent voice, doctrine-first.
145
-
146
169
  ## Bottom line
147
170
 
148
- > A test that cannot fail is decorative. A test that fails for the wrong reason is misleading. Build tests that fail for exactly one reason — the reason the invariant was violated — and trust them when they do. Mocks isolate. Real systems validate. Coverage shines a light. Mutation score grades the suite. Agents will reach for the mock and the snapshot; the gates here make them put both down. Tests reveal bugs, not just pass.
171
+ > A test that cannot fail is decorative. A test that fails for the wrong reason is misleading. Build tests that fail for exactly one reason — the reason the invariant was violated — and trust them when they do. Mocks isolate. Real systems validate. Coverage shines a light. Mutation score grades the suite. Agents will reach for the mock and the snapshot; the guardrails make them put both down. Tests reveal bugs, not just pass.
@@ -0,0 +1,170 @@
1
+ # Six guardrails — mandatory when an agent writes tests
2
+
3
+ LLMs have characteristic failure modes when authoring tests. Six guardrails are forcing functions for the most common ones.
4
+
5
+ Every test produced by an agent (via `/dw-run-task`, `/dw-bugfix`, `/dw-autopilot`, or any code-generating flow) must clear all six BEFORE the diff goes to review.
6
+
7
+ ## Guardrail 1 — State the invariant, layer, and host suite first
8
+
9
+ **Failure mode it blocks:** agent writes 200 lines of test code without articulating what the test is supposed to prove or where it belongs.
10
+
11
+ **What the agent must do:**
12
+
13
+ Before any test code, print:
14
+
15
+ ```
16
+ INVARIANT: <one sentence — what behavior the test verifies>
17
+ OWNING_LAYER: <unit | integration | contract | e2e>
18
+ EXISTING_SUITE: <path to the existing test file the new test joins, or "NEW: <reason>">
19
+ ```
20
+
21
+ If the agent can't fill any line, it stops and asks the user — it does NOT invent an invariant.
22
+
23
+ **Why it works:**
24
+ - "Invariant" forces specific behavior naming.
25
+ - "Owning layer" forces Rule 2 (lowest detectable layer).
26
+ - "Existing suite" forces extending coverage rather than spawning orphan files.
27
+
28
+ **Verification:** `/dw-code-review` looks for this 3-line preamble in the PR description or commit body. Missing = REJECTED.
29
+
30
+ ## Guardrail 2 — Real execution somewhere
31
+
32
+ **Failure mode it blocks:** agent writes tests that mock everything. They pass green forever and validate nothing.
33
+
34
+ **What the agent must do:**
35
+
36
+ At SOME layer, the test path must run against real systems before merge:
37
+
38
+ - Pure logic: unit only is sufficient.
39
+ - Code touching DB: at least one integration test with real DB (testcontainers, ephemeral container, dedicated test DB).
40
+ - Code calling external services: a contract test OR a sandbox-account smoke test.
41
+ - UI interactions: at least one E2E run on a real preview environment.
42
+
43
+ **Verification:** PR description lists the real-system runs covering the touched code. If no real-system path covers the change → REJECTED.
44
+
45
+ ## Guardrail 3 — On red, read production first
46
+
47
+ **Failure mode it blocks:** agent sees a test go red and modifies the test until green. Bug ships.
48
+
49
+ **What the agent must do:**
50
+
51
+ When a test fails (its own or pre-existing):
52
+
53
+ 1. Print: `INVESTIGATING FAILURE: <test name>`.
54
+ 2. Read production code in the path that produced the observed value.
55
+ 3. Print: `ANALYSIS: <2-3 sentences — is production wrong, the test wrong, or has the invariant changed?>`.
56
+ 4. Decide:
57
+ - Production wrong → fix production.
58
+ - Test wrong → fix the test AND document the change in the commit body.
59
+ - Invariant changed → update the test AND open an ADR if it's a public-contract change.
60
+
61
+ **Verification:** every commit changing a previously-green test must have an `ANALYSIS:` line. Missing = REJECTED.
62
+
63
+ ## Guardrail 4 — No self-confirming assertions
64
+
65
+ **Failure mode it blocks:** two related shortcuts —
66
+ - Agent writes `mockFn.mockReturnValue('X')` then asserts `expect(mockFn()).toBe('X')`. Proves nothing — the test asserts what the test set up.
67
+ - Agent reaches for `toMatchSnapshot()` whenever unsure what to assert. The snapshot becomes the assertion; drift goes unnoticed.
68
+
69
+ **What the agent must do:**
70
+
71
+ **For mocks:** never assert on a value the test body fed into a mock. Assert on:
72
+ - The OUTPUT of production code that consumed the mock.
73
+ - The SIDE EFFECTS (DB state, network calls, event emissions) caused by production code.
74
+ - The VISIBLE behavior (UI change, log line, response) the user/caller observes.
75
+
76
+ **For snapshots:** before adding `toMatchSnapshot()`, classify the artifact:
77
+ - `PRODUCT_CONTRACT` — a stable contract worth pinning (serialized API output, stored-record schema). Snapshot OK; document the classification in a comment.
78
+ - `IMPLEMENTATION_DETAIL` — HTML structure, internal representation, component tree shape. Snapshot is FORBIDDEN; write specific assertions instead.
79
+
80
+ **Verification:**
81
+ - Mock value flowing directly from setup to assertion without passing through production code → REJECTED.
82
+ - Snapshot in diff without classification comment → REJECTED.
83
+ - Snapshot classified `IMPLEMENTATION_DETAIL` → REJECTED.
84
+
85
+ ## Guardrail 5 — Negative companion
86
+
87
+ **Failure mode it blocks:** agent writes happy-path-only tests. Edge cases, error paths, boundary inputs uncovered.
88
+
89
+ **What the agent must do:**
90
+
91
+ Every positive assertion ships WITH at least one negative companion:
92
+
93
+ - Asserting `createUser(validInput)` succeeds → also assert `createUser(invalidInput)` fails with a specific error.
94
+ - Asserting `parseDate(validString)` returns a Date → also assert `parseDate(invalidString)` throws or returns null.
95
+ - Asserting `transferFunds(...)` succeeds with sufficient balance → also assert it fails with insufficient balance.
96
+
97
+ **Verification:** a PR adding N positive assertions to a public path must add ≥1 negative assertion. Imbalance >3:1 (positive:negative) on a public path → REJECTED.
98
+
99
+ ## Guardrail 6 — Don't expand the surface to test it
100
+
101
+ **Failure mode it blocks:** agent exports internals, adds `*ForTesting` methods, or introduces `process.env.NODE_ENV === 'test'` branches in production code to make the test possible.
102
+
103
+ **What the agent must do:**
104
+
105
+ If the test needs access the production API doesn't grant:
106
+ - **Refactor production for testability** via dependency injection or interface seams.
107
+ - **Emit an observable side effect** the test can verify (event, log line, metric) that production also benefits from.
108
+ - **Use a dedicated test environment** with test credentials, not a backdoor flag.
109
+
110
+ The agent does NOT:
111
+ - Export `_internal` symbols just for tests.
112
+ - Add `// for testing only` methods on classes.
113
+ - Wrap production logic in `if (process.env.NODE_ENV !== 'test')` branches.
114
+
115
+ **Verification:** diff includes new production exports, env checks, or `*ForTesting`-style symbols → REJECTED. Refactor the surface or change the test approach.
116
+
117
+ ## How the six guardrails compose
118
+
119
+ A test that passes all six:
120
+ 1. States the invariant, layer, and host suite up front (Guardrail 1).
121
+ 2. Exercises real systems somewhere in the pipeline (Guardrail 2).
122
+ 3. When red, reads production first and documents the analysis (Guardrail 3).
123
+ 4. Asserts on production's observable output, not its own setup (Guardrail 4).
124
+ 5. Covers failures alongside successes (Guardrail 5).
125
+ 6. Lives inside the production API's existing surface (Guardrail 6).
126
+
127
+ Tests passing all six are worth running. Tests missing any one are more likely to mislead than to help.
128
+
129
+ ## Override procedure
130
+
131
+ To skip a guardrail explicitly:
132
+ 1. State which guardrail is skipped.
133
+ 2. State why in one sentence.
134
+ 3. Add a `// SKIP-GUARDRAIL-N: <reason>` comment in the test.
135
+ 4. Open a follow-up issue tracking the gap.
136
+
137
+ Without all four, the guardrail is enforced.
138
+
139
+ ## Prompt block injected when an agent writes tests
140
+
141
+ ```
142
+ You are about to write tests. Before producing test code, complete the
143
+ 6-guardrail preamble:
144
+
145
+ INVARIANT: ___
146
+ OWNING_LAYER: ___
147
+ EXISTING_SUITE: ___
148
+
149
+ If you cannot complete these three lines, STOP and ask the user for
150
+ the requirement. Do not invent an invariant.
151
+
152
+ Then, while writing tests:
153
+ - Real execution: name the real-system path covering this code.
154
+ - On red: read production first; print ANALYSIS: before changing the test.
155
+ - Mocks: never assert on values fed into a mock.
156
+ - Snapshots: classify PRODUCT_CONTRACT or IMPLEMENTATION_DETAIL — the latter is forbidden.
157
+ - Coverage: every positive assertion needs a negative companion.
158
+ - Production surface: don't export internals or add test-only branches.
159
+
160
+ Tests violating guardrails without explicit SKIP-GUARDRAIL-N comments
161
+ will be REJECTED at review.
162
+ ```
163
+
164
+ `/dw-run-task` and `/dw-bugfix` inject this prompt block before generating test code.
165
+
166
+ ## Why six and not more
167
+
168
+ These are the highest-frequency LLM failure modes observed across multiple projects. Other tendencies exist but are either covered by the positive patterns (e.g., wall-clock waits) or are lower-frequency than these six.
169
+
170
+ If a new failure mode appears that none of the six catches, add a guardrail AND document the failure that motivated it. Don't add guardrails speculatively.
@@ -1,10 +1,10 @@
1
- # Anti-patterns — 25 patterns across 5 families
1
+ # Anti-patterns — 25 smells across 4 families
2
2
 
3
- Each anti-pattern names the smell, shows the violation in pseudo-code, gives the fix, and notes how `/dw-code-review` detects it.
3
+ Each anti-pattern names the smell, shows the violation in pseudo-code, gives the fix, and notes how `/dw-code-review` detects it. Agent-specific failure modes are covered separately in `agent-guardrails.md`.
4
4
 
5
5
  ---
6
6
 
7
- ## Family 1: Brittleness (tests bound to internals)
7
+ ## Family A: Fragile to refactor (tests bound to internals, not behavior)
8
8
 
9
9
  ### A1. Implementation-detail selectors
10
10
 
@@ -81,7 +81,7 @@ test('clicking save works', async () => {
81
81
 
82
82
  ---
83
83
 
84
- ## Family 2: Flakiness (tests randomizing verdicts)
84
+ ## Family B: Non-deterministic outcomes (tests that flip verdict on the same code)
85
85
 
86
86
  ### A7. Static sleeps / fixed-timeout waits
87
87
 
@@ -117,7 +117,7 @@ test('today is Monday', () => {
117
117
 
118
118
  ---
119
119
 
120
- ## Family 3: Mock misuse (tests testing the test setup)
120
+ ## Family C: Mock-driven false confidence (tests asserting on their own setup)
121
121
 
122
122
  ### A10. Asserting the mock exists
123
123
 
@@ -181,7 +181,7 @@ You've tested the SCAFFOLD, not the logic.
181
181
 
182
182
  ---
183
183
 
184
- ## Family 4: Process (team and suite pathologies)
184
+ ## Family D: Suite hygiene problems (team and suite-level pathologies)
185
185
 
186
186
  ### A15. Coverage as vanity metric
187
187
 
@@ -0,0 +1,128 @@
1
+ # Six core rules — expanded with examples
2
+
3
+ The rules are short for memorization. Each carries nuance that matters in practice.
4
+
5
+ ## Rule 1: Test the behavior, never the mock
6
+
7
+ **What it means:** the test asserts what the system DOES from the caller's perspective. It does not assert that internal call X was made with internal argument Y.
8
+
9
+ **Why it matters:** a test bound to internal calls breaks the day you refactor — even when behavior didn't change. The "test is red, behavior is fine" experience erodes trust. Soon no one runs the suite.
10
+
11
+ **Violation:**
12
+
13
+ ```javascript
14
+ // BAD — asserting on mock internals
15
+ test('createOrder calls inventory.reserve', () => {
16
+ const inventory = { reserve: vi.fn() };
17
+ createOrder({ items: [...] }, inventory);
18
+ expect(inventory.reserve).toHaveBeenCalledWith(items, 'reserve');
19
+ });
20
+ ```
21
+
22
+ You've asserted that `createOrder` USES the inventory adapter a specific way. The refactor that consolidates `reserve` into `commit-with-reservation` breaks this even though the order still gets created.
23
+
24
+ **Correct version:**
25
+
26
+ ```javascript
27
+ // GOOD — asserting behavior
28
+ test('createOrder reserves inventory before confirming', async () => {
29
+ const result = await createOrder({ items: [...] });
30
+ expect(result.status).toBe('confirmed');
31
+ expect(await getInventoryFor(items[0].sku)).toBe(originalStock - 1);
32
+ });
33
+ ```
34
+
35
+ Now the test cares about the OUTCOME (inventory decremented, order confirmed), not the path.
36
+
37
+ ## Rule 2: Push each test to the lowest layer that can detect the defect
38
+
39
+ **What it means:** if a unit test can catch the bug, use a unit. If only an integration test can, integration. If only an end-to-end run can, E2E.
40
+
41
+ **Why it matters:** lower layers run faster, fail more precisely, isolate the cause better. A bug in pure logic caught at unit takes 50ms and points at the exact function. The same bug at E2E takes 30 seconds and tells you "checkout failed."
42
+
43
+ **The pyramid resolved:**
44
+
45
+ | Layer | Catches | Speed | Cost |
46
+ |-------|---------|-------|------|
47
+ | Unit | Pure logic, math, parsing, formatters | <100ms | low |
48
+ | Integration | Module composition, DB queries, HTTP handlers | 500ms–5s | medium |
49
+ | Contract | Producer/consumer agreement at API boundary | 1–10s | medium |
50
+ | E2E | User journey across multiple services | 10s–60s | high |
51
+
52
+ **Rule of thumb:**
53
+ - If you can write a unit test, do so.
54
+ - If unit can't reach it (needs DB, queue, real HTTP), write integration.
55
+ - E2E only for journeys NO lower layer can detect: browser-renders-correctly, third-party-callback-arrives, multi-step session state.
56
+
57
+ ## Rule 3: When a test fails, read production first — change the test only with documented justification
58
+
59
+ **What it means:** a red test is a signal. The first question is "what's wrong with production?" — not "why is the test wrong?"
60
+
61
+ **Why it matters:** tests are weakened to pass far more often than they should be. "The behavior is fine; the test is too strict" is the slope that leaves a green suite full of meaningless assertions.
62
+
63
+ **Process when a test goes red:**
64
+
65
+ 1. **Read the failure message.** What invariant did the test claim? What did it observe?
66
+ 2. **Read production code** in the path that produces the observation.
67
+ 3. **Decide which is wrong.** If production violates the invariant, fix production. If the test mis-states the invariant, document WHY before relaxing.
68
+ 4. **Commit the analysis** in the test's commit message or PR body. "Relaxed assertion from X to Y because `<reason>`" is auditable; "fix test" is not.
69
+
70
+ **Anti-pattern:** re-run the test until green. Auto-retry on flake. Add `.only` to skip the rest.
71
+
72
+ ## Rule 4: Real systems gate the merge. Mocks isolate; they do not validate.
73
+
74
+ **What it means:** before code merges to main, at least ONE test path exercised real systems — real DB, real route, real external integration in a sandbox or test account. Mocks are fine for fast unit feedback; they cannot decide "safe to ship."
75
+
76
+ **Why it matters:** mock drift is real. The mocked HTTP response from 3 months ago no longer matches the actual API. Tests pass; production fails on first real call.
77
+
78
+ **Practical pattern:**
79
+
80
+ - Unit tests: mock the world; run on every keystroke / commit.
81
+ - Integration tests: real local DB (testcontainers, in-memory if equivalent); run on every PR.
82
+ - Contract tests: real producer/consumer agreement check; run on every PR.
83
+ - E2E: real preview environment with real services; run on PRs before merge to main.
84
+
85
+ The discipline: no merge without a green E2E (or equivalent real-system check) for the touched path.
86
+
87
+ ## Rule 5: Coverage is a flashlight; mutation score is a quality probe. Neither is a target.
88
+
89
+ **What it means:**
90
+ - **Coverage** tells you what lines executed. Useful as a NEGATIVE signal — 30% coverage means lots of dark code. Useless as a positive signal — 95% coverage with weak assertions is decorative.
91
+ - **Mutation score** introduces small bugs (mutations) and measures whether tests catch them. A high mutation score means tests actually probe behavior, not just execute lines.
92
+ - Neither should be a number you optimize for. They're diagnostics.
93
+
94
+ **Anti-pattern:** "We need 90% coverage to merge." Coverage as a gate produces tests written to pass the gate, not to find bugs.
95
+
96
+ **Healthier framing:** "What lines in the touched diff are NOT covered? Why?" Sometimes the answer is "we don't care, it's logging." Sometimes it's "actually that's a critical branch — add a test."
97
+
98
+ ## Rule 6: No test-only methods, branches, or flags leak into production code
99
+
100
+ **What it means:** production code should not have `if (process.env.NODE_ENV === 'test') { ... }` branches, `// for testing only` methods exposed on classes, or internals exported just for assertions.
101
+
102
+ **Why it matters:** production code carrying test-only logic is test decorations leaking into the artifact users run. Bug surface grows; the test environment diverges from production.
103
+
104
+ **Correct patterns:**
105
+
106
+ - Need to inject a dependency for testing? Use constructor injection / dependency injection.
107
+ - Need to assert on internal state? Add a logging hook or event emission that production also benefits from.
108
+ - Need to bypass auth in tests? Use a dedicated test environment with test credentials, not a backdoor flag.
109
+
110
+ **Tells:**
111
+ - `// only used in tests` comments.
112
+ - `*ForTesting` suffix on methods.
113
+ - `vi.spyOn(module, '_internal')` accessing underscore-prefixed members.
114
+ - `process.env.E2E_MODE` reaching into production runtime decisions.
115
+
116
+ If you see these, the test design is wrong. Refactor production to be testable; don't add backdoors.
117
+
118
+ ## Putting the rules together
119
+
120
+ A healthy test:
121
+ 1. Asserts behavior visible to a caller (Rule 1).
122
+ 2. Sits at the lowest layer that can prove that behavior (Rule 2).
123
+ 3. When red, sends you to read production code (Rule 3).
124
+ 4. Has a sibling exercising real systems somewhere in the pipeline (Rule 4).
125
+ 5. Survives a mutation in the code it claims to cover (Rule 5).
126
+ 6. Has zero footprint in production code (Rule 6).
127
+
128
+ Any test failing ≥2 of these is technical debt accumulating. `/dw-code-review` flags them.
@@ -2,7 +2,7 @@
2
2
 
3
3
  Practical Playwright code for the common scenarios. Use this when `/dw-run-qa` runs in UI mode or when `/dw-functional-doc` needs E2E coverage.
4
4
 
5
- > These recipes ASSUME the doctrine in this skill (Iron Laws, positive patterns) has already been applied. Recipes are the HOW once the WHY is settled.
5
+ > These recipes ASSUME the doctrine in this skill (core rules, positive patterns) has already been applied. Recipes are the HOW once the WHY is settled.
6
6
 
7
7
  ## Basic Navigation
8
8
 
@@ -275,7 +275,7 @@ Tests now start authenticated. No login loop in every test.
275
275
  ## Cross-skill integration
276
276
 
277
277
  When running these recipes, the doctrine in this skill applies:
278
- - Apply Iron Laws (especially Law 2: lowest layer first — many E2E tests should be integration tests instead).
278
+ - Apply core rules (especially Rule 2: lowest layer first — many E2E tests should be integration tests instead).
279
279
  - Run the 7 AI Agent Gates if a coding agent is producing this test.
280
280
  - Check for Anti-Patterns 1, 5, 7, 20 in the diff.
281
281
  - For browser security scenarios (auth bypass, XSS, CSRF), see `security-boundary.md`.