@clipboard-health/ai-rules 2.15.8 → 2.15.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@clipboard-health/ai-rules",
3
- "version": "2.15.8",
3
+ "version": "2.15.9",
4
4
  "description": "Pre-built AI agent rules for consistent coding standards.",
5
5
  "keywords": [
6
6
  "ai",
@@ -3,7 +3,16 @@ name: flaky-test-debugger
3
3
  description: Debug and fix flaky tests including Playwright E2E, NestJS service/integration, React component, and unit tests. Use this skill when investigating intermittent test failures, triaging flaky tests, or fixing test instability.
4
4
  ---
5
5
 
6
- Work through these phases in order. Skip phases only when you already have the information they produce.
6
+ Phases run in order. Skip a phase if you already have the information it produces. Phase 3 runs only in fix mode.
7
+
8
+ ## Mode: plan vs fix
9
+
10
+ This skill runs in one of two modes:
11
+
12
+ - **Fix mode (default):** produce a plan, then apply it.
13
+ - **Plan mode:** produce a plan and stop, for human review.
14
+
15
+ Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it". Otherwise default to fix mode. Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
7
16
 
8
17
  ## Phase 1: Classify Test Type
9
18
 
@@ -18,10 +27,6 @@ Determine the test type from the user's input before doing anything else. The ty
18
27
 
19
28
  If the type is ambiguous, check the test file extension and imports to confirm.
20
29
 
21
- **Routing:** After completing Phase 1, always proceed to Phase 1b before investigating further.
22
-
23
- ---
24
-
25
30
  ## Phase 1b: Check for Existing Fixes
26
31
 
27
32
  Before investigating, check whether someone (or another agent) has already fixed this flake.
@@ -40,291 +45,14 @@ If an existing fix is found, report:
40
45
  - A brief summary of what it addresses
41
46
  - Whether it fully covers the current flake or only partially
42
47
 
43
- If no existing fix is found, proceed to investigation:
44
-
45
- - **E2E (Playwright):** Go to [Phase 2E: E2E Triage Snapshot](#phase-2e-e2e-triage-snapshot)
46
- - **Service, React component, or Unit:** Go to [Phase 2: Fast Path](#phase-2-fast-path-non-e2e)
47
-
48
- ---
49
-
50
- ## Phase 2: Fast Path (non-E2E)
51
-
52
- For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose and fix the flake. Do not over-investigate -- read the evidence, read the code, fix it.
53
-
54
- ### 2a: Gather Failure Context
55
-
56
- Capture from the user's input (ask if missing):
57
-
58
- - **Test file and name** -- exact file path and test title
59
- - **Error message and stack trace** -- the raw failure output
60
- - **Framework** -- Jest, Vitest, etc.
61
- - **Whether it's a new flaky** -- first occurrence vs. recurring
62
- - **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
63
-
64
- ### 2b: Read the Test and Code Under Test
65
-
66
- 1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
67
- 2. Read the production code that the test exercises -- follow imports from the test file.
68
- 3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
69
-
70
- ### 2c: Classify the Flake Pattern
71
-
72
- | Category | Test Types | Signal |
73
- | ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
74
- | **Connection lifecycle** | Service | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
75
- | **Port conflict** | Service | EADDRINUSE -- multiple test files bootstrapping on the same port |
76
- | **Async teardown race** | Service | Errors appear after test passes -- `afterAll` closes the app while background work is still running |
77
- | **Database state leakage** | Service | Test depends on DB state that a parallel/prior test modified |
78
- | **Unresolved async work** | Component | "not wrapped in act()" warnings, state updates after unmount |
79
- | **Timer/animation not flushed** | Component | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced |
80
- | **Mock not restored** | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks` |
81
- | **Shared mutable state** | Unit | Module-level variable or singleton mutated by one test, observed by another |
82
- | **Date/time sensitivity** | Unit | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs |
83
- | **Test ordering dependency** | All | Passes in isolation, fails when run with other tests (or vice versa) |
84
-
85
- ### 2d: Diagnose and Fix
86
-
87
- Apply the appropriate fix based on the pattern:
88
-
89
- **Service test fixes:**
90
-
91
- - Ensure `afterAll` closes the app _and_ awaits all open connections (DB, Redis, queues) before returning
92
- - Pass `{ forceCloseConnections: true }` to `NestFactory.create()` (NestJS v10+) to auto-close keep-alive connections on shutdown, or explicitly close the Mongoose/TypeORM connection in `afterAll`
93
- - Use dynamic/random ports (`listen(0)`) to avoid EADDRINUSE
94
- - Isolate database state: use unique collection prefixes, transaction rollbacks, or per-test database cleanup
95
- - If the test uses `setTimeout` or event-driven patterns, ensure the test awaits completion rather than relying on timing
96
-
97
- **React component test fixes:**
98
-
99
- - Wrap state-triggering actions in `act()` or use `waitFor`/`findBy*` queries that handle async updates
100
- - When using fake timers, advance them explicitly (`jest.advanceTimersByTime`, `jest.runAllTimers`) and restore real timers in `afterEach`
101
- - Ensure cleanup with `cleanup()` in `afterEach` (React Testing Library does this automatically unless disabled)
102
- - Restore mocks in `afterEach` -- prefer `jest.restoreAllMocks()` in a shared setup
103
-
104
- **Unit test fixes:**
105
-
106
- - Eliminate shared mutable state: clone or reset objects in `beforeEach`, or make the module-level binding `const`
107
- - Mock `Date.now` / `new Date()` explicitly when time matters; restore in `afterEach`
108
- - If order-dependent, check for missing setup that another test was implicitly providing
109
-
110
- ### 2e: Evidence Standard (Fast Path)
111
-
112
- Before proposing a fix, include at minimum:
113
-
114
- - The **error message and stack trace** from the failure
115
- - The **specific code path** in the test or production code that caused the flake
116
- - A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
117
- - A **confidence score** (1-5, same scale as [E2E evidence standard](#confidence-score))
118
-
119
- If confidence is 2 or below, recommend reproduction steps or instrumentation before committing to a fix.
120
-
121
- Skip to [Phase 5: Fix Decision Tree](#phase-5-fix-decision-tree).
122
-
123
- ---
124
-
125
- ## Phase 2E: E2E Triage Snapshot
126
-
127
- Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
128
-
129
- - Failing test file and name
130
- - GitHub Actions run URL to fetch the LLM report
131
-
132
- ### Fetch the LLM Report
133
-
134
- Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
135
-
136
- ```bash
137
- bash scripts/fetch-llm-report.sh "<github-actions-url>"
138
- ```
139
-
140
- This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
141
-
142
- ## Phase 3E: Quick Classification
143
-
144
- For the full report schema, field reference, caps, and example reports:
145
-
146
- 1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
147
- 2. Otherwise, fetch the latest docs from GitHub:
148
- - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
149
- - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
150
-
151
- Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
152
-
153
- Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
154
-
155
- Classify the flake to narrow the search space:
156
-
157
- | Category | Signal | Timeline Pattern |
158
- | -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
159
- | **Test-state leakage** | Retries or earlier tests leave auth, cookies, storage, or server state behind | `attempts[]` — different outcomes across retries |
160
- | **Data collision** | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors |
161
- | **Backend stale data** | API returned 200 but response body shows old state | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
162
- | **Frontend cache stale** | No network request after navigation/reload for the relevant endpoint | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint |
163
- | **Silent network failure** | CORS, DNS, or transport error prevented the request from completing | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL` |
164
- | **Render/hydration bug** | API returned correct data but component didn't render it | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors |
165
- | **Environment / infra** | Transient 5xx, timeouts, DNS/network instability | `network` entries with 5xx status; `consoleMessages[]` with connection errors |
166
- | **Locator / UX drift** | Selector is valid but brittle against small UI changes | `errors[]` — locator/selector text in error message |
167
-
168
- ## Phase 4E: Analyze LLM Report
169
-
170
- ### 4Ea: Walk the Timeline
171
-
172
- **Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
173
-
174
- ```text
175
- step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
176
- ```
177
-
178
- For each timeline entry:
179
-
180
- - **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
181
- - **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
182
- - **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
183
-
184
- All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
185
-
186
- ### 4Eb: Compare pass vs fail (flaky tests)
187
-
188
- If you don't have passing and failing attempts for the same test, skip to 4Ec.
189
-
190
- Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
191
-
192
- 1. Align both timelines by step title sequence
193
- 2. Find the first step/network/console entry that differs between attempts
194
- 3. The divergence answers "what was different this time?" directly
195
-
196
- Common divergence patterns:
197
-
198
- - **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
199
- - **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
200
- - **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
201
- - **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
202
-
203
- ### 4Ec: Identify failing tests
204
-
205
- Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
206
-
207
- - **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
208
- - **`location`**: Source file, line, and column — jump straight to the code.
209
- - **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
210
-
211
- ### 4Ed: Examine attempts for retry patterns
212
-
213
- For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
214
-
215
- **Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
216
-
217
- ### 4Ee: Inspect network activity and extract trace IDs
218
-
219
- Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
220
-
221
- **`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`references/datadog-apm-traces.md`](./references/datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
222
-
223
- Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
224
-
225
- ### 4Ef: Review test steps
226
-
227
- Prefer the timeline view (4Ea) which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
228
-
229
- ## Phase 4E Evidence Standard
230
-
231
- Do not propose a fix without concrete artifacts. At minimum, include:
232
-
233
- - One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
234
- - One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
235
- - A **specific code path** that consumed that state — use `tests[].location` to jump to the source
236
- - When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
237
- - When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
238
- - A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
239
-
240
- ### Confidence Score
241
-
242
- Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
243
-
244
- | Score | Meaning | Criteria |
245
- | ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
246
- | **5** | Certain | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner) |
247
- | **4** | High confidence | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
248
- | **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly |
249
- | **2** | Low confidence | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix |
250
- | **1** | Speculative | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding |
251
-
252
- If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
253
-
254
- ---
255
-
256
- ## Phase 5: Fix Decision Tree
257
-
258
- Applies to all test types.
259
-
260
- Apply fixes in this order of priority:
261
-
262
- 1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
263
-
264
- 2. **Test harness fix** (when the failure is non-product):
265
- - Reset cookies, storage, and session between retries
266
- - Isolate test data; generate stronger unique identities
267
- - Make retry blocks idempotent
268
- - Wait on deterministic app signals, not arbitrary sleeps
269
- - (Service tests) Close connections and app properly in `afterAll`
270
- - (Component tests) Flush pending state updates and timers before asserting
271
- - (Unit tests) Reset shared mutable state in `beforeEach`
272
-
273
- 3. **Product fix** (when real users would hit the same issue):
274
- - Handle stale or intermediate states safely
275
- - Make routing/render logic robust to eventual consistency
276
- - Add telemetry for ambiguous transitions
277
-
278
- 4. **Both** if user impact exists _and_ tests are fragile.
279
-
280
- ## Phase 6: Fix Sibling Instances
281
-
282
- After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
283
-
284
- ### When to search
285
-
286
- Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
287
-
288
- - Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
289
- - Hardcoded ports instead of dynamic allocation
290
- - Shared mutable state without per-test reset
291
- - Missing `act()` wrappers or `waitFor` around async assertions
292
- - Fake timers not restored in `afterEach`
293
- - Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
294
-
295
- ### When NOT to search
296
-
297
- Skip this step when the fix is **specific to one test's logic** -- for example, a wrong assertion value, a test-specific race condition in a unique setup, or a one-off typo.
298
-
299
- ### How to search
300
-
301
- 1. Identify the anti-pattern as a grep-able code pattern. Examples:
302
- - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
303
- - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
304
- - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
305
- - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
306
-
307
- 2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
308
-
309
- 3. Apply the same fix to each sibling. Keep changes minimal -- fix the anti-pattern, nothing else.
310
-
311
- 4. List the sibling files you fixed in the output so reviewers can verify them.
312
-
313
- ## Phase 7: Verification
48
+ If no existing fix is found, proceed to Phase 2.
314
49
 
315
- Lint and type-check touched files.
50
+ ## Phase 2: Produce a plan
316
51
 
317
- ## Output Format
52
+ Follow [`references/plan.md`](./references/plan.md). It walks investigation, diagnosis, evidence gathering, and the fix decision tree, and produces a structured plan with confidence score.
318
53
 
319
- When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
54
+ If you are in plan mode, present the plan and stop here.
320
55
 
321
- When documenting the fix in a PR or issue, use this structure:
56
+ ## Phase 3: Apply the plan (fix mode only)
322
57
 
323
- - **Confidence:** score (1-5) with brief justification
324
- - **Symptom:** what failed and where
325
- - **Root cause:** concise technical explanation
326
- - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
327
- - **Fix:** test-only, product-only, or both
328
- - **Siblings fixed:** list of other files where the same anti-pattern was corrected (or "N/A -- fix was test-specific")
329
- - **Validation:** commands and suites run
330
- - **Residual risk:** what could still be flaky
58
+ Follow [`references/fix.md`](./references/fix.md). It takes the plan from Phase 2, applies the proposed fix, searches for sibling anti-patterns, and verifies. PR creation is out of scope -- if the user later opens one (or invokes a PR-shipping skill), label it `flaky-test-fix`.
@@ -9,7 +9,7 @@ The `pup` CLI must be installed and authenticated. Two auth paths are supported:
9
9
  - **macOS Keychain** (via `pup auth login`) — the default on developer machines.
10
10
  - **Environment variables** (`DD_API_KEY` + `DD_APP_KEY`) — the path used in sandboxes and CI.
11
11
 
12
- **Do not run `pup auth status` to verify auth.** It only reads the Keychain, so it fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly — env-var auth takes effect there. If the query fails with an auth error, surface it then (check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`).
12
+ Don't run `pup auth status` to verify auth. It fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly. If the query fails with an auth error, check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`.
13
13
 
14
14
  ## Key pup conventions
15
15
 
@@ -78,3 +78,7 @@ pup traces search --query="trace_id:<TRACE_ID>" --from=30d --limit=1000 \
78
78
  error: .attributes.custom.error.message
79
79
  }]'
80
80
  ```
81
+
82
+ ### 4. Query additional data
83
+
84
+ If additional data would help diagnose the issue (e.g. logs, rum, cicd), use the pup CLI.
@@ -0,0 +1,65 @@
1
+ # Apply a Flaky Test Fix
2
+
3
+ Apply phase of the flaky-test-debugger skill. Takes a plan produced by [`plan.md`](./plan.md) and applies it.
4
+
5
+ ## Preflight
6
+
7
+ Confirm the plan from `plan.md` has confidence ≥ 3. If confidence is 1-2, do not apply -- return to `plan.md` and gather more evidence first.
8
+
9
+ ## Apply the Proposed Fix
10
+
11
+ Edit the files listed in the plan's **Proposed fix** field. Keep the change minimal -- the plan already chose between test harness, product, and both.
12
+
13
+ ## Fix Sibling Instances
14
+
15
+ After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
16
+
17
+ ### When to search
18
+
19
+ Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
20
+
21
+ - Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
22
+ - Hardcoded ports instead of dynamic allocation
23
+ - Shared mutable state without per-test reset
24
+ - Missing `act()` wrappers or `waitFor` around async assertions
25
+ - Fake timers not restored in `afterEach`
26
+ - Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
27
+
28
+ ### When NOT to search
29
+
30
+ Skip this step when the fix is **specific to one test's logic** -- for example a test-specific race condition in a unique setup or a one-off typo.
31
+
32
+ ### How to search
33
+
34
+ 1. Identify the anti-pattern as a grep-able code pattern. Examples:
35
+ - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
36
+ - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
37
+ - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
38
+ - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
39
+
40
+ 2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
41
+
42
+ 3. Apply the same fix to each sibling. Keep changes minimal; fix the anti-pattern, nothing else.
43
+
44
+ 4. List the sibling files you fixed in the output so reviewers can verify them.
45
+
46
+ ## Verification
47
+
48
+ Run the plan's **Validation plan** commands — including the previously-flaky test, repeated enough times to give reasonable confidence the flake is gone. Lint and type-check touched files as the floor; do not stop there.
49
+
50
+ ## Output Format
51
+
52
+ When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
53
+
54
+ When documenting the fix in a PR or issue, use this structure. Carry **Confidence**, **Symptom**, **Root cause**, **Evidence**, and **Residual risk** straight over from the plan. Three plan fields rename: **Proposed fix** → **Fix**, **Sibling candidates** → **Siblings fixed**, **Validation plan** → **Validation**. Drop **Open questions** (resolved by fix time):
55
+
56
+ - **Test ID:** if provided in prompt
57
+ - **Agent session ID:** your running session ID to resume if needed
58
+ - **Confidence:** score (1-5) with brief justification
59
+ - **Symptom:** what failed and where
60
+ - **Root cause:** concise technical explanation
61
+ - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
62
+ - **Fix:** test-only, product-only, or both
63
+ - **Siblings fixed:** list of other files where the same anti-pattern was or should be corrected (or "N/A -- fix was test-specific")
64
+ - **Validation:** commands and suites run
65
+ - **Residual risk:** what could still be flaky
@@ -0,0 +1,226 @@
1
+ # Plan a Flaky Test Fix
2
+
3
+ Diagnosis and planning phase of the flaky-test-debugger skill. Produces a structured plan that the user reviews. In fix mode, the plan is consumed by [`fix.md`](./fix.md).
4
+
5
+ Route by the test type identified in Phase 1 of SKILL.md:
6
+
7
+ - **E2E (Playwright):** start with [E2E Triage Snapshot](#e2e-triage-snapshot)
8
+ - **Service, React component, or Unit:** start with [Fast Path (non-E2E)](#fast-path-non-e2e)
9
+
10
+ ## Fast Path (non-E2E)
11
+
12
+ For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose the flake. Do not over-investigate -- read the evidence, read the code, plan the fix.
13
+
14
+ ### Gather Failure Context
15
+
16
+ Capture from the user's input (ask if missing):
17
+
18
+ - **Test file and name** -- exact file path and test title
19
+ - **Error message and stack trace** -- the raw failure output
20
+ - **Framework** -- Jest, Vitest, etc.
21
+ - **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
22
+
23
+ ### Read the Test and Code Under Test
24
+
25
+ 1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
26
+ 2. Read the production code that the test exercises -- follow imports from the test file.
27
+ 3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
28
+
29
+ ### Classify the Flake Pattern
30
+
31
+ | Category | Test Types | Signal |
32
+ | ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
33
+ | **Connection lifecycle** | Service | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
34
+ | **Port conflict** | Service | EADDRINUSE -- multiple test files bootstrapping on the same port |
35
+ | **Async teardown race** | Service | Errors appear after test passes -- `afterAll` closes the app while background work is still running |
36
+ | **Database state leakage** | Service | Test depends on DB state that a parallel/prior test modified |
37
+ | **Unresolved async work** | Component | "not wrapped in act()" warnings, state updates after unmount |
38
+ | **Timer/animation not flushed** | Component | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced |
39
+ | **Mock not restored** | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks` |
40
+ | **Shared mutable state** | Unit | Module-level variable or singleton mutated by one test, observed by another |
41
+ | **Date/time sensitivity** | Unit | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs |
42
+ | **Test ordering dependency** | All | Passes in isolation, fails when run with other tests (or vice versa) |
43
+
44
+ ### Diagnose with Evidence
45
+
46
+ Before proposing a fix, gather:
47
+
48
+ - The **error message and stack trace** from the failure
49
+ - The **specific code path** in the test or production code that caused the flake
50
+ - A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
51
+ - A **confidence score** (1-5, see [Confidence Score](#confidence-score))
52
+
53
+ If confidence is 2 or below, the plan is to gather more data: recommend specific reproduction steps or instrumentation rather than a code fix.
54
+
55
+ If >2, continue to [Decide Fix Approach](#decide-fix-approach).
56
+
57
+ ## E2E Triage Snapshot
58
+
59
+ Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
60
+
61
+ - Failing test file and name
62
+ - GitHub Actions run URL to fetch the LLM report
63
+
64
+ ### Fetch the LLM Report
65
+
66
+ Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
67
+
68
+ ```bash
69
+ bash scripts/fetch-llm-report.sh "<github-actions-url>"
70
+ ```
71
+
72
+ This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
73
+
74
+ ## Quick Classification (E2E)
75
+
76
+ For the full report schema, field reference, caps, and example reports:
77
+
78
+ 1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
79
+ 2. Otherwise, fetch the latest docs from GitHub:
80
+ - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
81
+ - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
82
+
83
+ Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
84
+
85
+ Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
86
+
87
+ Classify the flake to narrow the search space:
88
+
89
+ | Category | Signal | Timeline Pattern |
90
+ | -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
91
+ | **Test-state leakage** | Retries or earlier tests leave auth, cookies, storage, or server state behind | `attempts[]` — different outcomes across retries |
92
+ | **Data collision** | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors |
93
+ | **Backend stale data** | API returned 200 but response body shows old state | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
94
+ | **Frontend cache stale** | No network request after navigation/reload for the relevant endpoint | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint |
95
+ | **Silent network failure** | CORS, DNS, or transport error prevented the request from completing | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL` |
96
+ | **Render/hydration bug** | API returned correct data but component didn't render it | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors |
97
+ | **Environment / infra** | Transient 5xx, timeouts, DNS/network instability | `network` entries with 5xx status; `consoleMessages[]` with connection errors |
98
+ | **Locator / UX drift** | Selector is valid but brittle against small UI changes | `errors[]` — locator/selector text in error message |
99
+
100
+ ## Analyze LLM Report
101
+
102
+ ### Walk the Timeline
103
+
104
+ **Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
105
+
106
+ ```text
107
+ step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
108
+ ```
109
+
110
+ For each timeline entry:
111
+
112
+ - **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
113
+ - **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
114
+ - **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
115
+
116
+ All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
117
+
118
+ ### Compare pass vs fail (flaky tests)
119
+
120
+ If you don't have passing and failing attempts for the same test, skip to [Identify failing tests](#identify-failing-tests).
121
+
122
+ Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
123
+
124
+ 1. Align both timelines by step title sequence
125
+ 2. Find the first step/network/console entry that differs between attempts
126
+ 3. The divergence answers "what was different this time?" directly
127
+
128
+ Common divergence patterns:
129
+
130
+ - **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
131
+ - **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
132
+ - **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
133
+ - **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
134
+
135
+ ### Identify failing tests
136
+
137
+ Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
138
+
139
+ - **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
140
+ - **`location`**: Source file, line, and column — jump straight to the code.
141
+ - **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
142
+
143
+ ### Examine attempts for retry patterns
144
+
145
+ For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
146
+
147
+ **Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
148
+
149
+ ### Inspect network activity and extract trace IDs
150
+
151
+ Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
152
+
153
+ **`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`./datadog-apm-traces.md`](./datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
154
+
155
+ Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
156
+
157
+ ### Review test steps
158
+
159
+ Prefer the timeline view above which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
160
+
161
+ ## Evidence Standard (E2E)
162
+
163
+ Do not propose a fix without concrete artifacts. At minimum, include:
164
+
165
+ - One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
166
+ - One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
167
+ - A **specific code path** that consumed that state — use `tests[].location` to jump to the source
168
+ - When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
169
+ - When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
170
+ - A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
171
+
172
+ If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
173
+
174
+ If >2, continue to [Decide Fix Approach](#decide-fix-approach).
175
+
176
+ ### Confidence Score
177
+
178
+ Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
179
+
180
+ | Score | Meaning | Criteria |
181
+ | ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
182
+ | **5** | Certain | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner) |
183
+ | **4** | High confidence | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
184
+ | **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly |
185
+ | **2** | Low confidence | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix |
186
+ | **1** | Speculative | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding |
187
+
188
+ ## Decide Fix Approach
189
+
190
+ Applies to all test types.
191
+
192
+ Choose one of these approaches in priority order:
193
+
194
+ 1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
195
+
196
+ 2. **Test harness fix** (when the failure is non-product):
197
+ - Reset cookies, storage, and session between retries
198
+ - Isolate test data; generate stronger unique identities
199
+ - Make retry blocks idempotent
200
+ - Wait on deterministic app signals, not arbitrary sleeps
201
+ - (Service tests) Close connections and app properly in `afterAll`
202
+ - (Component tests) Flush pending state updates and timers before asserting
203
+ - (Unit tests) Reset shared mutable state in `beforeEach`
204
+
205
+ 3. **Product fix** (when real users would hit the same issue):
206
+ - Handle stale or intermediate states safely
207
+ - Make routing/render logic robust to eventual consistency
208
+ - Add telemetry for ambiguous transitions
209
+
210
+ 4. **Both** if user impact exists _and_ tests are fragile.
211
+
212
+ ## Plan Output Format
213
+
214
+ Produce the plan with these fields:
215
+
216
+ - **Test ID:** if provided in prompt
217
+ - **Agent session ID:** your running session ID to resume if needed
218
+ - **Confidence:** score (1-5) with brief justification
219
+ - **Symptom:** what failed and where
220
+ - **Root cause:** concise technical explanation
221
+ - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
222
+ - **Proposed fix:** test harness, product, or both — with the specific file(s) and the change you would make
223
+ - **Sibling candidates:** files that appear to share the same anti-pattern, for the reviewer (or fix.md) to confirm. Or "N/A -- fix is test-specific" if the issue is one-off (see [`fix.md`](./fix.md) for what counts as a structural anti-pattern worth searching for).
224
+ - **Validation plan:** lint/typecheck commands and test commands to run after applying the fix
225
+ - **Open questions:** anything that needs human input before fixing
226
+ - **Residual risk:** what could still be flaky after the fix
@@ -14,12 +14,11 @@ Structure and write Linear bug reports from evidence that already exists in the
14
14
  ## Process
15
15
 
16
16
  1. **Gather context** — collect evidence from the conversation: investigation findings, user reports, Datadog links, error details
17
- 2. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the symptom description, error messages, and affected service/endpoint. If duplicates are found, present them to the user before proceeding.
18
- 3. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected — ask before drafting. NEVER invent answers. Up to 3 rounds.
19
- 4. **Draft** — title + description, structure scaled to complexity (see format below)
20
- 5. **Self-review** — check every Red Flag below before presenting
21
- 6. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
22
- 7. **Create in Linear** — only after explicit approval
17
+ 2. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected ask before drafting. NEVER invent answers. Up to 3 rounds.
18
+ 3. **Draft** — title + description, structure scaled to complexity (see format below)
19
+ 4. **Self-review** — check every Red Flag below before presenting
20
+ 5. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
21
+ 6. **Create in Linear** — only after explicit approval
23
22
 
24
23
  ## Hard Rules
25
24
 
@@ -15,14 +15,13 @@ Draft Linear feature request tickets that describe what users need and why — n
15
15
  - **1-2 factual gaps** (missing repo, unclear who) → ask the user directly. Don't dispatch the full interview for a single missing data point.
16
16
  - **Structural problems** (solution-shaped framing, no problem articulated, mostly unknowns) → dispatch `interview-feature` skill. Receive a structured problem brief. Re-check gate against the brief.
17
17
  - If `interview-feature` terminates without producing a problem brief (user refused to articulate a problem), abort the ticket process. Inform the user that the ticket cannot be created without a problem statement.
18
- 3. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the problem's key terms, affected users, and domain area. If duplicates or closely related tickets are found, present them to the user and ask whether to proceed, merge with an existing ticket, or stop.
19
- 4. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check it doesn't blindly trust upstream context.
20
- 5. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
21
- 6. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
22
- 7. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
23
- 8. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
24
- 9. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
25
- 10. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
18
+ 3. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check it doesn't blindly trust upstream context.
19
+ 4. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
20
+ 5. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
21
+ 6. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
22
+ 7. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
23
+ 8. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
24
+ 9. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
26
25
 
27
26
  ## Final Validation Checklist
28
27
 
@@ -21,11 +21,10 @@ Draft Linear tech debt tickets that justify _why_ the debt matters — cost to c
21
21
  - Maintainability/DX → `git log` for change frequency and bug-fix commits, grep for workarounds
22
22
  - Security → check dependency versions, scan for vulnerability patterns
23
23
  5. **Assess interest & risk** — produce structured ratings with evidence (see reference.md for rating framework)
24
- 6. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the debt's key terms, component names, and domain area. If duplicates are found, present them to the user before proceeding.
25
- 7. **Draft** — title + description, structure scaled to complexity (see format below)
26
- 8. **Self-review** — check every Red Flag below before presenting
27
- 9. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
28
- 10. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
24
+ 6. **Draft** — title + description, structure scaled to complexity (see format below)
25
+ 7. **Self-review** — check every Red Flag below before presenting
26
+ 8. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
27
+ 9. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
29
28
 
30
29
  ## Hard Rules
31
30
 
@@ -1,101 +0,0 @@
1
- ---
2
- name: linear-duplicate-finder
3
- description: Use when checking if a Linear ticket already exists before creating one. Searches across teams, archived tickets, and multiple phrasings to find duplicates and related tickets. Dispatched by ticket-writing skills before creation.
4
- ---
5
-
6
- # Linear Duplicate Finder
7
-
8
- Search Linear for duplicate or related tickets before creating a new one. Casts a wide net across teams, statuses, and phrasings, then classifies matches by similarity.
9
-
10
- ## Inputs
11
-
12
- 1. **A Linear ticket ID** (e.g., "ENG-1234") — fetch its details and search for duplicates.
13
- 2. **A ticket title + description** — search Linear directly for matches.
14
- 3. **Multiple ticket IDs** — cross-reference against each other and the backlog.
15
-
16
- ## Process
17
-
18
- ### Step 1: Understand the Source
19
-
20
- If given a ticket ID:
21
-
22
- - Fetch the ticket using `mcp__linear__get_issue` with `includeRelations: true` to see if duplicates are already marked.
23
- - Extract the title, description, labels, team, and project.
24
-
25
- If given a title/description:
26
-
27
- - Parse the key concepts, features, and domain terms.
28
-
29
- ### Step 2: Generate Search Queries
30
-
31
- Break the ticket down into multiple search angles:
32
-
33
- - **Exact title keywords**: Most distinctive terms from the title.
34
- - **Core concept**: The fundamental ask, using different phrasings.
35
- - **Domain/feature area**: The feature area or system component involved.
36
- - **Synonyms and alternative phrasings**: 2-3 alternative ways to describe the same thing.
37
-
38
- ### Step 3: Execute Searches
39
-
40
- Run multiple `mcp__linear__list_issues` searches in parallel using the `query` parameter with different search terms:
41
-
42
- - Use `limit: 50` to cast a wide net.
43
- - Include `includeArchived: true` to catch completed or cancelled tickets.
44
- - Filter by team when known, but also do at least one cross-team search.
45
-
46
- Also use `mcp__linear__query_data` with natural language queries for concept-based matching.
47
-
48
- Run at least 3-5 different searches with varied query terms.
49
-
50
- ### Step 4: Analyze and Score Results
51
-
52
- | Dimension | Weight | Description |
53
- | ----------------------- | ------ | ------------------------------------------------------ |
54
- | **Title similarity** | High | Do the titles describe the same thing? |
55
- | **Description overlap** | High | Do the descriptions reference the same problem? |
56
- | **Same feature area** | Medium | Are they about the same system/feature? |
57
- | **Same team/project** | Low | Same team increases likelihood but isn't required. |
58
- | **Status** | Info | Cancelled/completed duplicates are still worth noting. |
59
-
60
- ### Step 5: Classify Matches
61
-
62
- - **Duplicate**: Exact same work. Creating both = redundant effort.
63
- - **Closely Related**: Overlapping scope — completing one partially addresses the other. Should cross-reference.
64
- - **Same Area**: Same domain but different aspects. Useful context, not duplicates.
65
-
66
- ### Step 6: Present Results
67
-
68
- ```text
69
- ## Duplicate Detection Results
70
-
71
- ### Source
72
- **[ID] Title** or **Potential ticket**: "description"
73
-
74
- ### Duplicates Found
75
- 1. **[TEAM-123] Title** — Status: In Progress
76
- - **Why**: [specific overlap]
77
- - **Key difference**: [if any]
78
-
79
- ### Closely Related
80
- 1. **[TEAM-456] Title** — Status: Backlog
81
- - **Overlap**: [what's shared]
82
- - **Difference**: [what's distinct]
83
-
84
- ### Same Area (Context)
85
- 1. **[TEAM-789] Title** — Status: Done
86
- - **Relevance**: [why worth noting]
87
-
88
- ### Recommendation
89
- [Proceed, merge with existing, or add context to related ticket?]
90
- ```
91
-
92
- If no duplicates found, say so clearly and recommend proceeding.
93
-
94
- ## Guidelines
95
-
96
- - **Wide net, then narrow.** Better to surface a false positive than miss a real duplicate.
97
- - **Search across teams.** Duplicates often live on different teams.
98
- - **Check archived/cancelled tickets.** May contain valuable context about why work was previously rejected.
99
- - **Look at different time ranges.** Duplicates can be months old.
100
- - **Be specific.** Don't say "similar title" — explain exactly what overlaps and differs.
101
- - **When in doubt, include it.** A false positive is cheap; a missed duplicate wastes engineering effort.