npm - @brunosps00/dev-workflow - Versions diffs - 0.11.0 → 0.15.0 - Mend

@brunosps00/dev-workflow 0.11.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (127) hide show

package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md ADDED Viewed

@@ -0,0 +1,170 @@
+# Six guardrails — mandatory when an agent writes tests
+LLMs have characteristic failure modes when authoring tests. Six guardrails are forcing functions for the most common ones.
+Every test produced by an agent (via `/dw-run-task`, `/dw-bugfix`, `/dw-autopilot`, or any code-generating flow) must clear all six BEFORE the diff goes to review.
+## Guardrail 1 — State the invariant, layer, and host suite first
+**Failure mode it blocks:** agent writes 200 lines of test code without articulating what the test is supposed to prove or where it belongs.
+**What the agent must do:**
+Before any test code, print:
+```
+INVARIANT: <one sentence — what behavior the test verifies>
+OWNING_LAYER: <unit | integration | contract | e2e>
+EXISTING_SUITE: <path to the existing test file the new test joins, or "NEW: <reason>">
+```
+If the agent can't fill any line, it stops and asks the user — it does NOT invent an invariant.
+**Why it works:**
+- "Invariant" forces specific behavior naming.
+- "Owning layer" forces Rule 2 (lowest detectable layer).
+- "Existing suite" forces extending coverage rather than spawning orphan files.
+**Verification:** `/dw-code-review` looks for this 3-line preamble in the PR description or commit body. Missing = REJECTED.
+## Guardrail 2 — Real execution somewhere
+**Failure mode it blocks:** agent writes tests that mock everything. They pass green forever and validate nothing.
+**What the agent must do:**
+At SOME layer, the test path must run against real systems before merge:
+- Pure logic: unit only is sufficient.
+- Code touching DB: at least one integration test with real DB (testcontainers, ephemeral container, dedicated test DB).
+- Code calling external services: a contract test OR a sandbox-account smoke test.
+- UI interactions: at least one E2E run on a real preview environment.
+**Verification:** PR description lists the real-system runs covering the touched code. If no real-system path covers the change → REJECTED.
+## Guardrail 3 — On red, read production first
+**Failure mode it blocks:** agent sees a test go red and modifies the test until green. Bug ships.
+**What the agent must do:**
+When a test fails (its own or pre-existing):
+1. Print: `INVESTIGATING FAILURE: <test name>`.
+2. Read production code in the path that produced the observed value.
+3. Print: `ANALYSIS: <2-3 sentences — is production wrong, the test wrong, or has the invariant changed?>`.
+4. Decide:
+   - Production wrong → fix production.
+   - Test wrong → fix the test AND document the change in the commit body.
+   - Invariant changed → update the test AND open an ADR if it's a public-contract change.
+**Verification:** every commit changing a previously-green test must have an `ANALYSIS:` line. Missing = REJECTED.
+## Guardrail 4 — No self-confirming assertions
+**Failure mode it blocks:** two related shortcuts —
+- Agent writes `mockFn.mockReturnValue('X')` then asserts `expect(mockFn()).toBe('X')`. Proves nothing — the test asserts what the test set up.
+- Agent reaches for `toMatchSnapshot()` whenever unsure what to assert. The snapshot becomes the assertion; drift goes unnoticed.
+**What the agent must do:**
+**For mocks:** never assert on a value the test body fed into a mock. Assert on:
+- The OUTPUT of production code that consumed the mock.
+- The SIDE EFFECTS (DB state, network calls, event emissions) caused by production code.
+- The VISIBLE behavior (UI change, log line, response) the user/caller observes.
+**For snapshots:** before adding `toMatchSnapshot()`, classify the artifact:
+- `PRODUCT_CONTRACT` — a stable contract worth pinning (serialized API output, stored-record schema). Snapshot OK; document the classification in a comment.
+- `IMPLEMENTATION_DETAIL` — HTML structure, internal representation, component tree shape. Snapshot is FORBIDDEN; write specific assertions instead.
+**Verification:**
+- Mock value flowing directly from setup to assertion without passing through production code → REJECTED.
+- Snapshot in diff without classification comment → REJECTED.
+- Snapshot classified `IMPLEMENTATION_DETAIL` → REJECTED.
+## Guardrail 5 — Negative companion
+**Failure mode it blocks:** agent writes happy-path-only tests. Edge cases, error paths, boundary inputs uncovered.
+**What the agent must do:**
+Every positive assertion ships WITH at least one negative companion:
+- Asserting `createUser(validInput)` succeeds → also assert `createUser(invalidInput)` fails with a specific error.
+- Asserting `parseDate(validString)` returns a Date → also assert `parseDate(invalidString)` throws or returns null.
+- Asserting `transferFunds(...)` succeeds with sufficient balance → also assert it fails with insufficient balance.
+**Verification:** a PR adding N positive assertions to a public path must add ≥1 negative assertion. Imbalance >3:1 (positive:negative) on a public path → REJECTED.
+## Guardrail 6 — Don't expand the surface to test it
+**Failure mode it blocks:** agent exports internals, adds `*ForTesting` methods, or introduces `process.env.NODE_ENV === 'test'` branches in production code to make the test possible.
+**What the agent must do:**
+If the test needs access the production API doesn't grant:
+- **Refactor production for testability** via dependency injection or interface seams.
+- **Emit an observable side effect** the test can verify (event, log line, metric) that production also benefits from.
+- **Use a dedicated test environment** with test credentials, not a backdoor flag.
+The agent does NOT:
+- Export `_internal` symbols just for tests.
+- Add `// for testing only` methods on classes.
+- Wrap production logic in `if (process.env.NODE_ENV !== 'test')` branches.
+**Verification:** diff includes new production exports, env checks, or `*ForTesting`-style symbols → REJECTED. Refactor the surface or change the test approach.
+## How the six guardrails compose
+A test that passes all six:
+1. States the invariant, layer, and host suite up front (Guardrail 1).
+2. Exercises real systems somewhere in the pipeline (Guardrail 2).
+3. When red, reads production first and documents the analysis (Guardrail 3).
+4. Asserts on production's observable output, not its own setup (Guardrail 4).
+5. Covers failures alongside successes (Guardrail 5).
+6. Lives inside the production API's existing surface (Guardrail 6).
+Tests passing all six are worth running. Tests missing any one are more likely to mislead than to help.
+## Override procedure
+To skip a guardrail explicitly:
+1. State which guardrail is skipped.
+2. State why in one sentence.
+3. Add a `// SKIP-GUARDRAIL-N: <reason>` comment in the test.
+4. Open a follow-up issue tracking the gap.
+Without all four, the guardrail is enforced.
+## Prompt block injected when an agent writes tests
+```
+You are about to write tests. Before producing test code, complete the
+6-guardrail preamble:
+INVARIANT: ___
+OWNING_LAYER: ___
+EXISTING_SUITE: ___
+If you cannot complete these three lines, STOP and ask the user for
+the requirement. Do not invent an invariant.
+Then, while writing tests:
+- Real execution: name the real-system path covering this code.
+- On red: read production first; print ANALYSIS: before changing the test.
+- Mocks: never assert on values fed into a mock.
+- Snapshots: classify PRODUCT_CONTRACT or IMPLEMENTATION_DETAIL — the latter is forbidden.
+- Coverage: every positive assertion needs a negative companion.
+- Production surface: don't export internals or add test-only branches.
+Tests violating guardrails without explicit SKIP-GUARDRAIL-N comments
+will be REJECTED at review.
+```
+`/dw-run-task` and `/dw-bugfix` inject this prompt block before generating test code.
+## Why six and not more
+These are the highest-frequency LLM failure modes observed across multiple projects. Other tendencies exist but are either covered by the positive patterns (e.g., wall-clock waits) or are lower-frequency than these six.
+If a new failure mode appears that none of the six catches, add a guardrail AND document the failure that motivated it. Don't add guardrails speculatively.

package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md ADDED Viewed

@@ -0,0 +1,336 @@
+# Anti-patterns — 25 smells across 4 families
+Each anti-pattern names the smell, shows the violation in pseudo-code, gives the fix, and notes how `/dw-code-review` detects it. Agent-specific failure modes are covered separately in `agent-guardrails.md`.
+---
+## Family A: Fragile to refactor (tests bound to internals, not behavior)
+### A1. Implementation-detail selectors
+**Violation:**
+```javascript
+await page.click('.btn.btn-primary.checkout-button');
+```
+**Fix:** Use `getByRole('button', { name: 'Checkout' })`.
+**Detection:** Grep for `.click(`, `.querySelector(`, `cy.get('.`, `getByTestId(` with a class-flavored argument.
+### A2. Testing internal structure vs observable behavior
+**Violation:**
+```javascript
+expect(component.state.cart.items.length).toBe(3);
+```
+**Fix:** Assert what the user sees: `expect(await page.getByText('3 items in cart')).toBeVisible()`.
+**Detection:** Tests that import/inspect internal state, refs, or private fields. Class-based component tests that read `.state` directly.
+### A3. Testing private methods directly
+**Violation:**
+```javascript
+expect(orderService._calculateTax(...)).toBe(8.5);
+```
+**Fix:** Test the public method that uses tax calculation; verify the result. If the private method is independently complex, extract it to a module and test that module's public API.
+**Detection:** Identifiers starting with `_` accessed from tests.
+### A4. Snapshot-as-test (snapshot replacing assertion)
+**Violation:**
+```javascript
+expect(rendered).toMatchSnapshot();  // ← only assertion in test
+```
+**Fix:** Either:
+1. Write specific assertions about what the component renders, OR
+2. Use a snapshot AS A SECONDARY check after specific assertions, with a comment classifying the snapshot as `PRODUCT_CONTRACT` (UI contract worth pinning) — never `IMPLEMENTATION_DETAIL`.
+**Detection:** Tests where the only assertion is `toMatchSnapshot` or equivalent.
+### A5. Vague existence assertions
+**Violation:**
+```javascript
+expect(result).toBeTruthy();
+expect(element).toBeDefined();
+expect(button).should('exist');
+```
+**Fix:** Assert what you actually want: `expect(result.status).toBe('success')`, `expect(button).toBeEnabled()`, `expect(button).toHaveText('Continue')`.
+**Detection:** Tests asserting only existence/truthiness without follow-up semantic check.
+### A6. Action without assertion
+**Violation:**
+```javascript
+test('clicking save works', async () => {
+  await page.getByRole('button', { name: 'Save' }).click();
+  // ← no assertion. What did "works" mean?
+});
+```
+**Fix:** Define what "works" means. Assert the observable outcome: URL changed, modal closed, success message visible, data persisted.
+**Detection:** Tests with `await x.click()` or `await x.type()` followed by no `expect(...)`.
+---
+## Family B: Non-deterministic outcomes (tests that flip verdict on the same code)
+### A7. Static sleeps / fixed-timeout waits
+**Violation:**
+```javascript
+await page.waitForTimeout(2000);
+```
+**Fix:** `await expect(page.getByText(/order confirmed/i)).toBeVisible({ timeout: 5000 })` — wait on the actual condition.
+**Detection:** Grep for `waitForTimeout`, `cy.wait(<number>)`, `sleep(`, `Thread.sleep`, `time.sleep` in test files.
+### A8. Test order dependency / hidden shared state
+**Violation:** Test B passes only after Test A has run because A populates a shared cache or DB row.
+**Fix:** Each test sets up its own state in `beforeEach`. Verify by running tests with `--shuffle` or `--randomize`.
+**Detection:** Tests fail when run with `.only`. Tests fail with `--shuffle`. Setup in `beforeAll` instead of `beforeEach`.
+### A9. Non-deterministic inputs (clock, RNG, locale)
+**Violation:**
+```javascript
+test('today is Monday', () => {
+  expect(new Date().getDay()).toBe(1);  // fails 6 days a week
+});
+```
+**Fix:** Mock the clock (`vi.useFakeTimers()`, `jest.useFakeTimers()`, `freezegun` in Python). Seed RNG. Pin locale.
+**Detection:** Tests using `new Date()`, `Date.now()`, `Math.random()`, `Intl.DateTimeFormat` without fakes.
+---
+## Family C: Mock-driven false confidence (tests asserting on their own setup)
+### A10. Asserting the mock exists
+**Violation:**
+```javascript
+expect(mockFn).toBeDefined();
+```
+**Fix:** Don't assert on mock setup. If the mock is wrong, the behavior assertion downstream will fail naturally.
+**Detection:** Mock functions referenced in assertions without `toHaveBeenCalled` semantics.
+### A11. Mock drift
+**Violation:** Mocked API response set up 6 months ago still returns `{ status: 'OK' }` while the real API now returns `{ ok: true }`.
+**Fix:** Contract testing (Pact, schemathesis) or periodic recording (msw + real-traffic capture). Re-validate mocks against real APIs quarterly.
+**Detection:** Tests with mocks that haven't been touched in >90 days against APIs that have changed. Hard to detect in CI; needs explicit contract checks.
+### A12. Over-mocking child components
+**Violation:**
+```javascript
+vi.mock('./UserAvatar');
+vi.mock('./UserMenu');
+vi.mock('./UserBanner');
+// ... testing nothing real
+```
+**Fix:** Mock at boundaries (HTTP, DB, third-party SDKs). Render real children unless they're genuinely expensive or test-irrelevant.
+**Detection:** Test files with >3 module mocks of internal modules.
+### A13. Incomplete mocks (missing fields the code reads)
+**Violation:**
+```javascript
+const mockUser = { id: 1 };  // missing email, but code reads user.email
+```
+**Fix:** Use a factory that supplies sensible defaults for ALL fields the type/contract declares.
+**Detection:** Runtime errors like `Cannot read property 'X' of undefined` inside production code under test.
+### A14. Mocking wrong level (mocking methods the logic depends on)
+**Violation:**
+```javascript
+// Testing OrderService, but mocking its private calculate() method
+const service = new OrderService();
+vi.spyOn(service, 'calculate').mockReturnValue(100);
+expect(service.processOrder(...)).toBe(/* uses mocked 100 */);
+```
+You've tested the SCAFFOLD, not the logic.
+**Fix:** Mock at the EDGE (DB call, HTTP call, time). Let internal logic run.
+**Detection:** Spies on methods of the System Under Test itself.
+---
+## Family D: Suite hygiene problems (team and suite-level pathologies)
+### A15. Coverage as vanity metric
+**Violation:** PR comments demanding "you need to hit 90% coverage" with no discussion of what the coverage means.
+**Fix:** Coverage is a flashlight. Use it to FIND blind spots. Don't optimize for the number.
+**Detection:** Cultural; visible in PR templates that gate on coverage percentage.
+### A16. Happy-path-only coverage
+**Violation:** Every test exercises the success case. Edge cases, error paths, boundary values uncovered.
+**Fix:** For each unit, write at minimum: happy path + 1 boundary + 1 invalid input + 1 failure path.
+**Detection:** Tests where every assertion is positive (`toBe`, `toEqual`) and none is negative (`toThrow`, `toReject`).
+### A17. Eternal `beforeAll` / shared setup hiding dependencies
+**Violation:**
+```javascript
+beforeAll(async () => {
+  await db.users.create([100 users]);
+  await db.orders.create([500 orders]);
+});
+```
+Tests now SHARE state. Order matters. Cleanup is fragile.
+**Fix:** `beforeEach` with minimal setup specific to each test.
+**Detection:** `beforeAll` blocks creating data (vs `beforeAll` blocks doing one-time framework setup like spinning testcontainers).
+### A18. Cleanup in `afterEach` (use `beforeEach` instead)
+**Violation:**
+```javascript
+afterEach(async () => {
+  await db.users.deleteAll();
+});
+```
+If a test fails mid-run, cleanup might not run; next test starts dirty.
+**Fix:** `beforeEach` with explicit setup-from-clean (truncate + seed). Reliable regardless of previous test outcome.
+**Detection:** `afterEach` blocks doing state reset.
+### A19. Magic strings and logic in tests
+**Violation:**
+```javascript
+const TIMESTAMP = '2024-01-15T10:30:00Z'; // why?
+expect(formatted).toBe('a long string with embedded specifics');
+```
+When the test fails, what was the test's INTENT?
+**Fix:** Use factories with named defaults. Extract magic values to constants with documenting names. Use snapshot testing for legitimate snapshot cases (with classification).
+**Detection:** Test files with ≥10 string literals not bound to a named variable.
+### A20. Testing against third-party sites you don't control
+**Violation:**
+```javascript
+test('Google homepage loads', async ({ page }) => {
+  await page.goto('https://google.com');
+  expect(await page.title()).toContain('Google');
+});
+```
+You're testing Google's availability, not your code.
+**Fix:** Mock the third party. Use a wiremock or msw to fake their responses. If you must call them, do it in a separate "external dependencies up" smoke test, not unit/integration.
+**Detection:** External URLs in test code outside designated smoke tests.
+### A21. Quarantine-as-cemetery
+**Violation:**
+```javascript
+test.skip('flaky on CI sometimes', () => { /* ... */ });
+// commented 8 months ago, no owner, no fix-by date
+```
+**Fix:** Every skip/quarantine has a NAMED OWNER and a FIX-BY DATE. Tracking issue exists. PR that introduces the skip says exactly when the test gets fixed.
+**Detection:** Skipped tests without comments/labels naming owner and date.
+### A22. Retry-as-fix (auto-retry hiding real bugs)
+**Violation:**
+```javascript
+// jest.config or playwright.config
+retries: 3,
+```
+A flaky test is a SIGNAL. Retrying until green hides it.
+**Fix:** When a test is flaky, FIX IT (probably a race condition or non-deterministic input). Quarantine if you can't fix immediately. Never just retry.
+**Detection:** CI config with retry counts. Test runners showing "1 retry succeeded" badges.
+### A23. Duplicate tests across pyramid layers
+**Violation:** Same scenario tested at unit, integration, AND E2E. Triple maintenance, no triple value.
+**Fix:** Apply Law 2 — lowest layer wins. Drop higher-layer duplicates.
+**Detection:** Search for the same scenario name across `tests/unit`, `tests/integration`, `tests/e2e`.
+### A24. Weakening tests to make them pass
+**Violation:**
+```diff
+- expect(orders.length).toBe(5);
++ expect(orders.length).toBeGreaterThan(0);
+```
+The "fix" makes the test useless.
+**Fix:** Read Law 3. Fix production OR document WHY the assertion is weaker.
+**Detection:** PR diff shows assertion relaxation without commit body explanation.
+### A25. Mock-driven confidence (test asserts on its own setup)
+**Violation:**
+```javascript
+const mock = vi.fn().mockReturnValue('hello');
+expect(mock()).toBe('hello');
+```
+You wrote `hello` in the mock. You asserted `hello`. You proved nothing.
+**Fix:** Assert on the OUTPUT of the production code that consumed the mock — not on the mock itself.
+**Detection:** Tests asserting equality between a value the test body created and a value the test body retrieved.
+---
+## How `/dw-code-review` uses this catalog
+For each diff hunk under a test path:
+1. Run regex scans for the patterns flagged "Detection" above.
+2. Each hit becomes a finding with severity from this skill's `dw-review-rigor` integration.
+3. Hits classified as Brittleness/Flakiness/Mock-misuse → severity `high`.
+4. Hits classified as Process → severity `medium`.
+5. Hits where the SAME test has multiple patterns → severity `critical` (suite-health smell, not just one test).
+A PR with ≥1 `high` test anti-pattern that lacks ADR justification gets REJECTED.

package/scaffold/skills/dw-testing-discipline/references/core-rules.md ADDED Viewed

@@ -0,0 +1,128 @@
+# Six core rules — expanded with examples
+The rules are short for memorization. Each carries nuance that matters in practice.
+## Rule 1: Test the behavior, never the mock
+**What it means:** the test asserts what the system DOES from the caller's perspective. It does not assert that internal call X was made with internal argument Y.
+**Why it matters:** a test bound to internal calls breaks the day you refactor — even when behavior didn't change. The "test is red, behavior is fine" experience erodes trust. Soon no one runs the suite.
+**Violation:**
+```javascript
+// BAD — asserting on mock internals
+test('createOrder calls inventory.reserve', () => {
+  const inventory = { reserve: vi.fn() };
+  createOrder({ items: [...] }, inventory);
+  expect(inventory.reserve).toHaveBeenCalledWith(items, 'reserve');
+});
+```
+You've asserted that `createOrder` USES the inventory adapter a specific way. The refactor that consolidates `reserve` into `commit-with-reservation` breaks this even though the order still gets created.
+**Correct version:**
+```javascript
+// GOOD — asserting behavior
+test('createOrder reserves inventory before confirming', async () => {
+  const result = await createOrder({ items: [...] });
+  expect(result.status).toBe('confirmed');
+  expect(await getInventoryFor(items[0].sku)).toBe(originalStock - 1);
+});
+```
+Now the test cares about the OUTCOME (inventory decremented, order confirmed), not the path.
+## Rule 2: Push each test to the lowest layer that can detect the defect
+**What it means:** if a unit test can catch the bug, use a unit. If only an integration test can, integration. If only an end-to-end run can, E2E.
+**Why it matters:** lower layers run faster, fail more precisely, isolate the cause better. A bug in pure logic caught at unit takes 50ms and points at the exact function. The same bug at E2E takes 30 seconds and tells you "checkout failed."
+**The pyramid resolved:**
+| Layer | Catches | Speed | Cost |
+|-------|---------|-------|------|
+| Unit | Pure logic, math, parsing, formatters | <100ms | low |
+| Integration | Module composition, DB queries, HTTP handlers | 500ms–5s | medium |
+| Contract | Producer/consumer agreement at API boundary | 1–10s | medium |
+| E2E | User journey across multiple services | 10s–60s | high |
+**Rule of thumb:**
+- If you can write a unit test, do so.
+- If unit can't reach it (needs DB, queue, real HTTP), write integration.
+- E2E only for journeys NO lower layer can detect: browser-renders-correctly, third-party-callback-arrives, multi-step session state.
+## Rule 3: When a test fails, read production first — change the test only with documented justification
+**What it means:** a red test is a signal. The first question is "what's wrong with production?" — not "why is the test wrong?"
+**Why it matters:** tests are weakened to pass far more often than they should be. "The behavior is fine; the test is too strict" is the slope that leaves a green suite full of meaningless assertions.
+**Process when a test goes red:**
+1. **Read the failure message.** What invariant did the test claim? What did it observe?
+2. **Read production code** in the path that produces the observation.
+3. **Decide which is wrong.** If production violates the invariant, fix production. If the test mis-states the invariant, document WHY before relaxing.
+4. **Commit the analysis** in the test's commit message or PR body. "Relaxed assertion from X to Y because `<reason>`" is auditable; "fix test" is not.
+**Anti-pattern:** re-run the test until green. Auto-retry on flake. Add `.only` to skip the rest.
+## Rule 4: Real systems gate the merge. Mocks isolate; they do not validate.
+**What it means:** before code merges to main, at least ONE test path exercised real systems — real DB, real route, real external integration in a sandbox or test account. Mocks are fine for fast unit feedback; they cannot decide "safe to ship."
+**Why it matters:** mock drift is real. The mocked HTTP response from 3 months ago no longer matches the actual API. Tests pass; production fails on first real call.
+**Practical pattern:**
+- Unit tests: mock the world; run on every keystroke / commit.
+- Integration tests: real local DB (testcontainers, in-memory if equivalent); run on every PR.
+- Contract tests: real producer/consumer agreement check; run on every PR.
+- E2E: real preview environment with real services; run on PRs before merge to main.
+The discipline: no merge without a green E2E (or equivalent real-system check) for the touched path.
+## Rule 5: Coverage is a flashlight; mutation score is a quality probe. Neither is a target.
+**What it means:**
+- **Coverage** tells you what lines executed. Useful as a NEGATIVE signal — 30% coverage means lots of dark code. Useless as a positive signal — 95% coverage with weak assertions is decorative.
+- **Mutation score** introduces small bugs (mutations) and measures whether tests catch them. A high mutation score means tests actually probe behavior, not just execute lines.
+- Neither should be a number you optimize for. They're diagnostics.
+**Anti-pattern:** "We need 90% coverage to merge." Coverage as a gate produces tests written to pass the gate, not to find bugs.
+**Healthier framing:** "What lines in the touched diff are NOT covered? Why?" Sometimes the answer is "we don't care, it's logging." Sometimes it's "actually that's a critical branch — add a test."
+## Rule 6: No test-only methods, branches, or flags leak into production code
+**What it means:** production code should not have `if (process.env.NODE_ENV === 'test') { ... }` branches, `// for testing only` methods exposed on classes, or internals exported just for assertions.
+**Why it matters:** production code carrying test-only logic is test decorations leaking into the artifact users run. Bug surface grows; the test environment diverges from production.
+**Correct patterns:**
+- Need to inject a dependency for testing? Use constructor injection / dependency injection.
+- Need to assert on internal state? Add a logging hook or event emission that production also benefits from.
+- Need to bypass auth in tests? Use a dedicated test environment with test credentials, not a backdoor flag.
+**Tells:**
+- `// only used in tests` comments.
+- `*ForTesting` suffix on methods.
+- `vi.spyOn(module, '_internal')` accessing underscore-prefixed members.
+- `process.env.E2E_MODE` reaching into production runtime decisions.
+If you see these, the test design is wrong. Refactor production to be testable; don't add backdoors.
+## Putting the rules together
+A healthy test:
+1. Asserts behavior visible to a caller (Rule 1).
+2. Sits at the lowest layer that can prove that behavior (Rule 2).
+3. When red, sends you to read production code (Rule 3).
+4. Has a sibling exercising real systems somewhere in the pipeline (Rule 4).
+5. Survives a mutation in the code it claims to cover (Rule 5).
+6. Has zero footprint in production code (Rule 6).
+Any test failing ≥2 of these is technical debt accumulating. `/dw-code-review` flags them.