npm - @clipboard-health/ai-rules - Versions diffs - 2.15.8 → 2.15.9 - Mend

@clipboard-health/ai-rules 2.15.8 → 2.15.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/package.json +1 -1
package/skills/flaky-test-debugger/SKILL.md +16 -288
package/skills/flaky-test-debugger/references/datadog-apm-traces.md +5 -1
package/skills/flaky-test-debugger/references/fix.md +65 -0
package/skills/flaky-test-debugger/references/plan.md +226 -0
package/skills/write-bug-ticket/SKILL.md +5 -6
package/skills/write-feature-ticket/SKILL.md +7 -8
package/skills/write-tech-debt-ticket/SKILL.md +4 -5
package/skills/linear-duplicate-finder/SKILL.md +0 -101

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@clipboard-health/ai-rules",
-  "version": "2.15.8",
+  "version": "2.15.9",
   "description": "Pre-built AI agent rules for consistent coding standards.",
   "keywords": [
     "ai",

package/skills/flaky-test-debugger/SKILL.md CHANGED Viewed

@@ -3,7 +3,16 @@ name: flaky-test-debugger
 description: Debug and fix flaky tests including Playwright E2E, NestJS service/integration, React component, and unit tests. Use this skill when investigating intermittent test failures, triaging flaky tests, or fixing test instability.
 ---
-Work through these phases in order. Skip phases only when you already have the information they produce.
+Phases run in order. Skip a phase if you already have the information it produces. Phase 3 runs only in fix mode.
+## Mode: plan vs fix
+This skill runs in one of two modes:
+- **Fix mode (default):** produce a plan, then apply it.
+- **Plan mode:** produce a plan and stop, for human review.
+Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it". Otherwise default to fix mode. Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
 ## Phase 1: Classify Test Type
@@ -18,10 +27,6 @@ Determine the test type from the user's input before doing anything else. The ty
 If the type is ambiguous, check the test file extension and imports to confirm.
-**Routing:** After completing Phase 1, always proceed to Phase 1b before investigating further.
----
 ## Phase 1b: Check for Existing Fixes
 Before investigating, check whether someone (or another agent) has already fixed this flake.
@@ -40,291 +45,14 @@ If an existing fix is found, report:
 - A brief summary of what it addresses
 - Whether it fully covers the current flake or only partially
-If no existing fix is found, proceed to investigation:
-- **E2E (Playwright):** Go to [Phase 2E: E2E Triage Snapshot](#phase-2e-e2e-triage-snapshot)
-- **Service, React component, or Unit:** Go to [Phase 2: Fast Path](#phase-2-fast-path-non-e2e)
----
-## Phase 2: Fast Path (non-E2E)
-For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose and fix the flake. Do not over-investigate -- read the evidence, read the code, fix it.
-### 2a: Gather Failure Context
-Capture from the user's input (ask if missing):
-- **Test file and name** -- exact file path and test title
-- **Error message and stack trace** -- the raw failure output
-- **Framework** -- Jest, Vitest, etc.
-- **Whether it's a new flaky** -- first occurrence vs. recurring
-- **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
-### 2b: Read the Test and Code Under Test
-1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
-2. Read the production code that the test exercises -- follow imports from the test file.
-3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
-### 2c: Classify the Flake Pattern
-| Category                        | Test Types      | Signal                                                                                                             |
-| ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
-| **Connection lifecycle**        | Service         | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
-| **Port conflict**               | Service         | EADDRINUSE -- multiple test files bootstrapping on the same port                                                   |
-| **Async teardown race**         | Service         | Errors appear after test passes -- `afterAll` closes the app while background work is still running                |
-| **Database state leakage**      | Service         | Test depends on DB state that a parallel/prior test modified                                                       |
-| **Unresolved async work**       | Component       | "not wrapped in act()" warnings, state updates after unmount                                                       |
-| **Timer/animation not flushed** | Component       | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced                    |
-| **Mock not restored**           | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks`                   |
-| **Shared mutable state**        | Unit            | Module-level variable or singleton mutated by one test, observed by another                                        |
-| **Date/time sensitivity**       | Unit            | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs                             |
-| **Test ordering dependency**    | All             | Passes in isolation, fails when run with other tests (or vice versa)                                               |
-### 2d: Diagnose and Fix
-Apply the appropriate fix based on the pattern:
-**Service test fixes:**
-- Ensure `afterAll` closes the app _and_ awaits all open connections (DB, Redis, queues) before returning
-- Pass `{ forceCloseConnections: true }` to `NestFactory.create()` (NestJS v10+) to auto-close keep-alive connections on shutdown, or explicitly close the Mongoose/TypeORM connection in `afterAll`
-- Use dynamic/random ports (`listen(0)`) to avoid EADDRINUSE
-- Isolate database state: use unique collection prefixes, transaction rollbacks, or per-test database cleanup
-- If the test uses `setTimeout` or event-driven patterns, ensure the test awaits completion rather than relying on timing
-**React component test fixes:**
-- Wrap state-triggering actions in `act()` or use `waitFor`/`findBy*` queries that handle async updates
-- When using fake timers, advance them explicitly (`jest.advanceTimersByTime`, `jest.runAllTimers`) and restore real timers in `afterEach`
-- Ensure cleanup with `cleanup()` in `afterEach` (React Testing Library does this automatically unless disabled)
-- Restore mocks in `afterEach` -- prefer `jest.restoreAllMocks()` in a shared setup
-**Unit test fixes:**
-- Eliminate shared mutable state: clone or reset objects in `beforeEach`, or make the module-level binding `const`
-- Mock `Date.now` / `new Date()` explicitly when time matters; restore in `afterEach`
-- If order-dependent, check for missing setup that another test was implicitly providing
-### 2e: Evidence Standard (Fast Path)
-Before proposing a fix, include at minimum:
-- The **error message and stack trace** from the failure
-- The **specific code path** in the test or production code that caused the flake
-- A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
-- A **confidence score** (1-5, same scale as [E2E evidence standard](#confidence-score))
-If confidence is 2 or below, recommend reproduction steps or instrumentation before committing to a fix.
-Skip to [Phase 5: Fix Decision Tree](#phase-5-fix-decision-tree).
----
-## Phase 2E: E2E Triage Snapshot
-Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
-- Failing test file and name
-- GitHub Actions run URL to fetch the LLM report
-### Fetch the LLM Report
-Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
-```bash
-bash scripts/fetch-llm-report.sh "<github-actions-url>"
-```
-This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
-## Phase 3E: Quick Classification
-For the full report schema, field reference, caps, and example reports:
-1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
-2. Otherwise, fetch the latest docs from GitHub:
-   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
-   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
-Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
-Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
-Classify the flake to narrow the search space:
-| Category                   | Signal                                                                            | Timeline Pattern                                                                              |
-| -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
-| **Test-state leakage**     | Retries or earlier tests leave auth, cookies, storage, or server state behind     | `attempts[]` — different outcomes across retries                                              |
-| **Data collision**         | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors                                                 |
-| **Backend stale data**     | API returned 200 but response body shows old state                                | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
-| **Frontend cache stale**   | No network request after navigation/reload for the relevant endpoint              | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint      |
-| **Silent network failure** | CORS, DNS, or transport error prevented the request from completing               | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL`                    |
-| **Render/hydration bug**   | API returned correct data but component didn't render it                          | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors                   |
-| **Environment / infra**    | Transient 5xx, timeouts, DNS/network instability                                  | `network` entries with 5xx status; `consoleMessages[]` with connection errors                 |
-| **Locator / UX drift**     | Selector is valid but brittle against small UI changes                            | `errors[]` — locator/selector text in error message                                           |
-## Phase 4E: Analyze LLM Report
-### 4Ea: Walk the Timeline
-**Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
-```text
-step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
-```
-For each timeline entry:
-- **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
-- **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
-- **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
-All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
-### 4Eb: Compare pass vs fail (flaky tests)
-If you don't have passing and failing attempts for the same test, skip to 4Ec.
-Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
-1. Align both timelines by step title sequence
-2. Find the first step/network/console entry that differs between attempts
-3. The divergence answers "what was different this time?" directly
-Common divergence patterns:
-- **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
-- **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
-- **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
-- **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
-### 4Ec: Identify failing tests
-Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
-- **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
-- **`location`**: Source file, line, and column — jump straight to the code.
-- **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
-### 4Ed: Examine attempts for retry patterns
-For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
-**Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
-### 4Ee: Inspect network activity and extract trace IDs
-Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
-**`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`references/datadog-apm-traces.md`](./references/datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
-Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
-### 4Ef: Review test steps
-Prefer the timeline view (4Ea) which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
-## Phase 4E Evidence Standard
-Do not propose a fix without concrete artifacts. At minimum, include:
-- One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
-- One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
-- A **specific code path** that consumed that state — use `tests[].location` to jump to the source
-- When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
-- When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
-- A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
-### Confidence Score
-Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
-| Score | Meaning             | Criteria                                                                                                                                                                                       |
-| ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **5** | Certain             | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner)                                              |
-| **4** | High confidence     | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
-| **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly                                                                      |
-| **2** | Low confidence      | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix                                                 |
-| **1** | Speculative         | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding                                          |
-If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
----
-## Phase 5: Fix Decision Tree
-Applies to all test types.
-Apply fixes in this order of priority:
-1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
-2. **Test harness fix** (when the failure is non-product):
-   - Reset cookies, storage, and session between retries
-   - Isolate test data; generate stronger unique identities
-   - Make retry blocks idempotent
-   - Wait on deterministic app signals, not arbitrary sleeps
-   - (Service tests) Close connections and app properly in `afterAll`
-   - (Component tests) Flush pending state updates and timers before asserting
-   - (Unit tests) Reset shared mutable state in `beforeEach`
-3. **Product fix** (when real users would hit the same issue):
-   - Handle stale or intermediate states safely
-   - Make routing/render logic robust to eventual consistency
-   - Add telemetry for ambiguous transitions
-4. **Both** if user impact exists _and_ tests are fragile.
-## Phase 6: Fix Sibling Instances
-After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
-### When to search
-Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
-- Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
-- Hardcoded ports instead of dynamic allocation
-- Shared mutable state without per-test reset
-- Missing `act()` wrappers or `waitFor` around async assertions
-- Fake timers not restored in `afterEach`
-- Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
-### When NOT to search
-Skip this step when the fix is **specific to one test's logic** -- for example, a wrong assertion value, a test-specific race condition in a unique setup, or a one-off typo.
-### How to search
-1. Identify the anti-pattern as a grep-able code pattern. Examples:
-   - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
-   - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
-   - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
-   - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
-2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
-3. Apply the same fix to each sibling. Keep changes minimal -- fix the anti-pattern, nothing else.
-4. List the sibling files you fixed in the output so reviewers can verify them.
-## Phase 7: Verification
+If no existing fix is found, proceed to Phase 2.
-Lint and type-check touched files.
+## Phase 2: Produce a plan
-## Output Format
+Follow [`references/plan.md`](./references/plan.md). It walks investigation, diagnosis, evidence gathering, and the fix decision tree, and produces a structured plan with confidence score.
-When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
+If you are in plan mode, present the plan and stop here.
-When documenting the fix in a PR or issue, use this structure:
+## Phase 3: Apply the plan (fix mode only)
-- **Confidence:** score (1-5) with brief justification
-- **Symptom:** what failed and where
-- **Root cause:** concise technical explanation
-- **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
-- **Fix:** test-only, product-only, or both
-- **Siblings fixed:** list of other files where the same anti-pattern was corrected (or "N/A -- fix was test-specific")
-- **Validation:** commands and suites run
-- **Residual risk:** what could still be flaky
+Follow [`references/fix.md`](./references/fix.md). It takes the plan from Phase 2, applies the proposed fix, searches for sibling anti-patterns, and verifies. PR creation is out of scope -- if the user later opens one (or invokes a PR-shipping skill), label it `flaky-test-fix`.

package/skills/flaky-test-debugger/references/datadog-apm-traces.md CHANGED Viewed

@@ -9,7 +9,7 @@ The `pup` CLI must be installed and authenticated. Two auth paths are supported:
 - **macOS Keychain** (via `pup auth login`) — the default on developer machines.
 - **Environment variables** (`DD_API_KEY` + `DD_APP_KEY`) — the path used in sandboxes and CI.
-**Do not run `pup auth status` to verify auth.** It only reads the Keychain, so it fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly — env-var auth takes effect there. If the query fails with an auth error, surface it then (check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`).
+Don't run `pup auth status` to verify auth. It fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly. If the query fails with an auth error, check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`.
 ## Key pup conventions
@@ -78,3 +78,7 @@ pup traces search --query="trace_id:<TRACE_ID>" --from=30d --limit=1000 \
       error: .attributes.custom.error.message
     }]'
 ```
+### 4. Query additional data
+If additional data would help diagnose the issue (e.g. logs, rum, cicd), use the pup CLI.

package/skills/flaky-test-debugger/references/fix.md ADDED Viewed

@@ -0,0 +1,65 @@
+# Apply a Flaky Test Fix
+Apply phase of the flaky-test-debugger skill. Takes a plan produced by [`plan.md`](./plan.md) and applies it.
+## Preflight
+Confirm the plan from `plan.md` has confidence ≥ 3. If confidence is 1-2, do not apply -- return to `plan.md` and gather more evidence first.
+## Apply the Proposed Fix
+Edit the files listed in the plan's **Proposed fix** field. Keep the change minimal -- the plan already chose between test harness, product, and both.
+## Fix Sibling Instances
+After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
+### When to search
+Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
+- Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
+- Hardcoded ports instead of dynamic allocation
+- Shared mutable state without per-test reset
+- Missing `act()` wrappers or `waitFor` around async assertions
+- Fake timers not restored in `afterEach`
+- Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
+### When NOT to search
+Skip this step when the fix is **specific to one test's logic** -- for example a test-specific race condition in a unique setup or a one-off typo.
+### How to search
+1. Identify the anti-pattern as a grep-able code pattern. Examples:
+   - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
+   - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
+   - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
+   - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
+2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
+3. Apply the same fix to each sibling. Keep changes minimal; fix the anti-pattern, nothing else.
+4. List the sibling files you fixed in the output so reviewers can verify them.
+## Verification
+Run the plan's **Validation plan** commands — including the previously-flaky test, repeated enough times to give reasonable confidence the flake is gone. Lint and type-check touched files as the floor; do not stop there.
+## Output Format
+When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
+When documenting the fix in a PR or issue, use this structure. Carry **Confidence**, **Symptom**, **Root cause**, **Evidence**, and **Residual risk** straight over from the plan. Three plan fields rename: **Proposed fix** → **Fix**, **Sibling candidates** → **Siblings fixed**, **Validation plan** → **Validation**. Drop **Open questions** (resolved by fix time):
+- **Test ID:** if provided in prompt
+- **Agent session ID:** your running session ID to resume if needed
+- **Confidence:** score (1-5) with brief justification
+- **Symptom:** what failed and where
+- **Root cause:** concise technical explanation
+- **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
+- **Fix:** test-only, product-only, or both
+- **Siblings fixed:** list of other files where the same anti-pattern was or should be corrected (or "N/A -- fix was test-specific")
+- **Validation:** commands and suites run
+- **Residual risk:** what could still be flaky

package/skills/flaky-test-debugger/references/plan.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Plan a Flaky Test Fix
+Diagnosis and planning phase of the flaky-test-debugger skill. Produces a structured plan that the user reviews. In fix mode, the plan is consumed by [`fix.md`](./fix.md).
+Route by the test type identified in Phase 1 of SKILL.md:
+- **E2E (Playwright):** start with [E2E Triage Snapshot](#e2e-triage-snapshot)
+- **Service, React component, or Unit:** start with [Fast Path (non-E2E)](#fast-path-non-e2e)
+## Fast Path (non-E2E)
+For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose the flake. Do not over-investigate -- read the evidence, read the code, plan the fix.
+### Gather Failure Context
+Capture from the user's input (ask if missing):
+- **Test file and name** -- exact file path and test title
+- **Error message and stack trace** -- the raw failure output
+- **Framework** -- Jest, Vitest, etc.
+- **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
+### Read the Test and Code Under Test
+1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
+2. Read the production code that the test exercises -- follow imports from the test file.
+3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
+### Classify the Flake Pattern
+| Category                        | Test Types      | Signal                                                                                                             |
+| ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
+| **Connection lifecycle**        | Service         | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
+| **Port conflict**               | Service         | EADDRINUSE -- multiple test files bootstrapping on the same port                                                   |
+| **Async teardown race**         | Service         | Errors appear after test passes -- `afterAll` closes the app while background work is still running                |
+| **Database state leakage**      | Service         | Test depends on DB state that a parallel/prior test modified                                                       |
+| **Unresolved async work**       | Component       | "not wrapped in act()" warnings, state updates after unmount                                                       |
+| **Timer/animation not flushed** | Component       | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced                    |
+| **Mock not restored**           | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks`                   |
+| **Shared mutable state**        | Unit            | Module-level variable or singleton mutated by one test, observed by another                                        |
+| **Date/time sensitivity**       | Unit            | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs                             |
+| **Test ordering dependency**    | All             | Passes in isolation, fails when run with other tests (or vice versa)                                               |
+### Diagnose with Evidence
+Before proposing a fix, gather:
+- The **error message and stack trace** from the failure
+- The **specific code path** in the test or production code that caused the flake
+- A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
+- A **confidence score** (1-5, see [Confidence Score](#confidence-score))
+If confidence is 2 or below, the plan is to gather more data: recommend specific reproduction steps or instrumentation rather than a code fix.
+If >2, continue to [Decide Fix Approach](#decide-fix-approach).
+## E2E Triage Snapshot
+Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
+- Failing test file and name
+- GitHub Actions run URL to fetch the LLM report
+### Fetch the LLM Report
+Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
+```bash
+bash scripts/fetch-llm-report.sh "<github-actions-url>"
+```
+This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
+## Quick Classification (E2E)
+For the full report schema, field reference, caps, and example reports:
+1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
+2. Otherwise, fetch the latest docs from GitHub:
+   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
+   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
+Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
+Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
+Classify the flake to narrow the search space:
+| Category                   | Signal                                                                            | Timeline Pattern                                                                              |
+| -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
+| **Test-state leakage**     | Retries or earlier tests leave auth, cookies, storage, or server state behind     | `attempts[]` — different outcomes across retries                                              |
+| **Data collision**         | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors                                                 |
+| **Backend stale data**     | API returned 200 but response body shows old state                                | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
+| **Frontend cache stale**   | No network request after navigation/reload for the relevant endpoint              | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint      |
+| **Silent network failure** | CORS, DNS, or transport error prevented the request from completing               | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL`                    |
+| **Render/hydration bug**   | API returned correct data but component didn't render it                          | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors                   |
+| **Environment / infra**    | Transient 5xx, timeouts, DNS/network instability                                  | `network` entries with 5xx status; `consoleMessages[]` with connection errors                 |
+| **Locator / UX drift**     | Selector is valid but brittle against small UI changes                            | `errors[]` — locator/selector text in error message                                           |
+## Analyze LLM Report
+### Walk the Timeline
+**Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
+```text
+step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
+```
+For each timeline entry:
+- **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
+- **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
+- **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
+All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
+### Compare pass vs fail (flaky tests)
+If you don't have passing and failing attempts for the same test, skip to [Identify failing tests](#identify-failing-tests).
+Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
+1. Align both timelines by step title sequence
+2. Find the first step/network/console entry that differs between attempts
+3. The divergence answers "what was different this time?" directly
+Common divergence patterns:
+- **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
+- **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
+- **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
+- **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
+### Identify failing tests
+Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
+- **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
+- **`location`**: Source file, line, and column — jump straight to the code.
+- **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
+### Examine attempts for retry patterns
+For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
+**Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
+### Inspect network activity and extract trace IDs
+Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
+**`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`./datadog-apm-traces.md`](./datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
+Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
+### Review test steps
+Prefer the timeline view above which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
+## Evidence Standard (E2E)
+Do not propose a fix without concrete artifacts. At minimum, include:
+- One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
+- One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
+- A **specific code path** that consumed that state — use `tests[].location` to jump to the source
+- When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
+- When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
+- A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
+If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
+If >2, continue to [Decide Fix Approach](#decide-fix-approach).
+### Confidence Score
+Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
+| Score | Meaning             | Criteria                                                                                                                                                                                       |
+| ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **5** | Certain             | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner)                                              |
+| **4** | High confidence     | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
+| **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly                                                                      |
+| **2** | Low confidence      | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix                                                 |
+| **1** | Speculative         | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding                                          |
+## Decide Fix Approach
+Applies to all test types.
+Choose one of these approaches in priority order:
+1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
+2. **Test harness fix** (when the failure is non-product):
+   - Reset cookies, storage, and session between retries
+   - Isolate test data; generate stronger unique identities
+   - Make retry blocks idempotent
+   - Wait on deterministic app signals, not arbitrary sleeps
+   - (Service tests) Close connections and app properly in `afterAll`
+   - (Component tests) Flush pending state updates and timers before asserting
+   - (Unit tests) Reset shared mutable state in `beforeEach`
+3. **Product fix** (when real users would hit the same issue):
+   - Handle stale or intermediate states safely
+   - Make routing/render logic robust to eventual consistency
+   - Add telemetry for ambiguous transitions
+4. **Both** if user impact exists _and_ tests are fragile.
+## Plan Output Format
+Produce the plan with these fields:
+- **Test ID:** if provided in prompt
+- **Agent session ID:** your running session ID to resume if needed
+- **Confidence:** score (1-5) with brief justification
+- **Symptom:** what failed and where
+- **Root cause:** concise technical explanation
+- **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
+- **Proposed fix:** test harness, product, or both — with the specific file(s) and the change you would make
+- **Sibling candidates:** files that appear to share the same anti-pattern, for the reviewer (or fix.md) to confirm. Or "N/A -- fix is test-specific" if the issue is one-off (see [`fix.md`](./fix.md) for what counts as a structural anti-pattern worth searching for).
+- **Validation plan:** lint/typecheck commands and test commands to run after applying the fix
+- **Open questions:** anything that needs human input before fixing
+- **Residual risk:** what could still be flaky after the fix

package/skills/write-bug-ticket/SKILL.md CHANGED Viewed

@@ -14,12 +14,11 @@ Structure and write Linear bug reports from evidence that already exists in the
 ## Process
 1. **Gather context** — collect evidence from the conversation: investigation findings, user reports, Datadog links, error details
-2. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the symptom description, error messages, and affected service/endpoint. If duplicates are found, present them to the user before proceeding.
-3. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected — ask before drafting. NEVER invent answers. Up to 3 rounds.
-4. **Draft** — title + description, structure scaled to complexity (see format below)
-5. **Self-review** — check every Red Flag below before presenting
-6. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
-7. **Create in Linear** — only after explicit approval
+2. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected — ask before drafting. NEVER invent answers. Up to 3 rounds.
+3. **Draft** — title + description, structure scaled to complexity (see format below)
+4. **Self-review** — check every Red Flag below before presenting
+5. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
+6. **Create in Linear** — only after explicit approval
 ## Hard Rules

package/skills/write-feature-ticket/SKILL.md CHANGED Viewed

@@ -15,14 +15,13 @@ Draft Linear feature request tickets that describe what users need and why — n
    - **1-2 factual gaps** (missing repo, unclear who) → ask the user directly. Don't dispatch the full interview for a single missing data point.
    - **Structural problems** (solution-shaped framing, no problem articulated, mostly unknowns) → dispatch `interview-feature` skill. Receive a structured problem brief. Re-check gate against the brief.
    - If `interview-feature` terminates without producing a problem brief (user refused to articulate a problem), abort the ticket process. Inform the user that the ticket cannot be created without a problem statement.
-3. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the problem's key terms, affected users, and domain area. If duplicates or closely related tickets are found, present them to the user and ask whether to proceed, merge with an existing ticket, or stop.
-4. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check — it doesn't blindly trust upstream context.
-5. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
-6. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
-7. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
-8. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
-9. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
-10. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
+3. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check — it doesn't blindly trust upstream context.
+4. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
+5. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
+6. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
+7. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
+8. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
+9. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
 ## Final Validation Checklist

package/skills/write-tech-debt-ticket/SKILL.md CHANGED Viewed

@@ -21,11 +21,10 @@ Draft Linear tech debt tickets that justify _why_ the debt matters — cost to c
    - Maintainability/DX → `git log` for change frequency and bug-fix commits, grep for workarounds
    - Security → check dependency versions, scan for vulnerability patterns
 5. **Assess interest & risk** — produce structured ratings with evidence (see reference.md for rating framework)
-6. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the debt's key terms, component names, and domain area. If duplicates are found, present them to the user before proceeding.
-7. **Draft** — title + description, structure scaled to complexity (see format below)
-8. **Self-review** — check every Red Flag below before presenting
-9. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
-10. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
+6. **Draft** — title + description, structure scaled to complexity (see format below)
+7. **Self-review** — check every Red Flag below before presenting
+8. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
+9. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
 ## Hard Rules

package/skills/linear-duplicate-finder/SKILL.md DELETED Viewed

@@ -1,101 +0,0 @@
----
-name: linear-duplicate-finder
-description: Use when checking if a Linear ticket already exists before creating one. Searches across teams, archived tickets, and multiple phrasings to find duplicates and related tickets. Dispatched by ticket-writing skills before creation.
----
-# Linear Duplicate Finder
-Search Linear for duplicate or related tickets before creating a new one. Casts a wide net across teams, statuses, and phrasings, then classifies matches by similarity.
-## Inputs
-1. **A Linear ticket ID** (e.g., "ENG-1234") — fetch its details and search for duplicates.
-2. **A ticket title + description** — search Linear directly for matches.
-3. **Multiple ticket IDs** — cross-reference against each other and the backlog.
-## Process
-### Step 1: Understand the Source
-If given a ticket ID:
-- Fetch the ticket using `mcp__linear__get_issue` with `includeRelations: true` to see if duplicates are already marked.
-- Extract the title, description, labels, team, and project.
-If given a title/description:
-- Parse the key concepts, features, and domain terms.
-### Step 2: Generate Search Queries
-Break the ticket down into multiple search angles:
-- **Exact title keywords**: Most distinctive terms from the title.
-- **Core concept**: The fundamental ask, using different phrasings.
-- **Domain/feature area**: The feature area or system component involved.
-- **Synonyms and alternative phrasings**: 2-3 alternative ways to describe the same thing.
-### Step 3: Execute Searches
-Run multiple `mcp__linear__list_issues` searches in parallel using the `query` parameter with different search terms:
-- Use `limit: 50` to cast a wide net.
-- Include `includeArchived: true` to catch completed or cancelled tickets.
-- Filter by team when known, but also do at least one cross-team search.
-Also use `mcp__linear__query_data` with natural language queries for concept-based matching.
-Run at least 3-5 different searches with varied query terms.
-### Step 4: Analyze and Score Results
-| Dimension               | Weight | Description                                            |
-| ----------------------- | ------ | ------------------------------------------------------ |
-| **Title similarity**    | High   | Do the titles describe the same thing?                 |
-| **Description overlap** | High   | Do the descriptions reference the same problem?        |
-| **Same feature area**   | Medium | Are they about the same system/feature?                |
-| **Same team/project**   | Low    | Same team increases likelihood but isn't required.     |
-| **Status**              | Info   | Cancelled/completed duplicates are still worth noting. |
-### Step 5: Classify Matches
-- **Duplicate**: Exact same work. Creating both = redundant effort.
-- **Closely Related**: Overlapping scope — completing one partially addresses the other. Should cross-reference.
-- **Same Area**: Same domain but different aspects. Useful context, not duplicates.
-### Step 6: Present Results
-```text
-## Duplicate Detection Results
-### Source
-**[ID] Title** or **Potential ticket**: "description"
-### Duplicates Found
-1. **[TEAM-123] Title** — Status: In Progress
-   - **Why**: [specific overlap]
-   - **Key difference**: [if any]
-### Closely Related
-1. **[TEAM-456] Title** — Status: Backlog
-   - **Overlap**: [what's shared]
-   - **Difference**: [what's distinct]
-### Same Area (Context)
-1. **[TEAM-789] Title** — Status: Done
-   - **Relevance**: [why worth noting]
-### Recommendation
-[Proceed, merge with existing, or add context to related ticket?]
-```
-If no duplicates found, say so clearly and recommend proceeding.
-## Guidelines
-- **Wide net, then narrow.** Better to surface a false positive than miss a real duplicate.
-- **Search across teams.** Duplicates often live on different teams.
-- **Check archived/cancelled tickets.** May contain valuable context about why work was previously rejected.
-- **Look at different time ranges.** Duplicates can be months old.
-- **Be specific.** Don't say "similar title" — explain exactly what overlaps and differs.
-- **When in doubt, include it.** A false positive is cheap; a missed duplicate wastes engineering effort.