@clipboard-health/ai-rules 2.15.8 → 2.15.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@clipboard-health/ai-rules",
3
- "version": "2.15.8",
3
+ "version": "2.15.10",
4
4
  "description": "Pre-built AI agent rules for consistent coding standards.",
5
5
  "keywords": [
6
6
  "ai",
@@ -59,7 +59,12 @@ for my $section (@sections) {
59
59
  my $raw_file_name = trim($1);
60
60
  my $file_content = $2;
61
61
 
62
- while ($file_content =~ /`(\d+(?:-\d+)?)`:\s*(?:_[^_]+_\s*\|\s*_[^_]+_\s*)?\*\*([^*]+)\*\*\s*([\s\S]*?)(?=---|\n`\d|<\/blockquote>|$)/g) {
62
+ # Category prefix is optional. CodeRabbit emits 0–N `__` tags
63
+ # separated by `|` (e.g. `_⚠️ Potential issue_ | _🟠 Major_ | _⚡ Quick win_`
64
+ # or just `_💤 Low value_` on lower-confidence findings). The previous
65
+ # regex required exactly two tags and silently dropped one-tag and
66
+ # three-tag variants.
67
+ while ($file_content =~ /`(\d+(?:-\d+)?)`:\s*(?:_[^_]+_(?:\s*\|\s*_[^_]+_)*\s*)?\*\*([^*]+)\*\*\s*([\s\S]*?)(?=---|\n`\d|<\/blockquote>|$)/g) {
63
68
  my $line_range = $1;
64
69
  my $title = trim($2);
65
70
  my $clean_body = clean_comment_body(trim($3));
@@ -14,20 +14,13 @@ description: Commit, push, and open a PR. Use when the user wants to ship change
14
14
 
15
15
  ## Your task
16
16
 
17
- **First, decide from the context above. If `Commits ahead of default branch` is `(unknown)`, skip this decision and use the full flow below.**
17
+ If `Commits ahead of default branch` is `(unknown)`, `origin/HEAD` couldn't be resolved — stop and tell the user to run `git remote set-head origin -a` (or otherwise set the default branch) before retrying, since the simplify step also depends on it. Otherwise, if `Git status`, `Commits ahead of default branch`, and `Existing PR` are all empty/none, stop and reply `nothing to ship.`. Otherwise:
18
18
 
19
- - If `Git status` is empty AND `Commits ahead of default branch` is empty AND `Existing PR` is `none`: stop. Reply with `nothing to ship.` and do nothing else.
20
- - If `Git status` is empty but there are `Commits ahead of default branch` or an `Existing PR`: still run step 1 and step 2. Then re-check `git status --short`; skip step 3 only if it remains empty. Then continue with step 4 and step 5.
21
- - Otherwise: proceed with all steps below.
22
-
23
- Based on the above changes:
24
-
25
- 1. Create a new branch if on main (e.g., `feat/add-user-validation`, `fix/null-check-in-parser`)
26
- 2. Run the `simplify` skill on the files included in the PR before committing, pushing, or opening the PR. If the working tree is clean, run `simplify` against `git diff $(git merge-base HEAD origin/HEAD)..HEAD`. If the working tree has changes, run it against `git diff HEAD`; when local commits also exist, include `git diff $(git merge-base HEAD origin/HEAD)..HEAD` as additional PR context. Invoke the skill by name using this agent's normal skill invocation mechanism. Wait for the skill to finish, then include any resulting fixes in the commit step below.
27
- 3. Re-check `git status --short`. If there are changes, create a single conventional commit message.
28
- 4. Push the branch to origin
19
+ 1. Create a new branch if on main (e.g., `feat/add-user-validation`, `fix/null-check-in-parser`).
20
+ 2. Run the `simplify` skill on the full PR diff `git diff $(git merge-base HEAD origin/HEAD)..HEAD` plus any uncommitted changes. Wait for it to finish, then include any resulting fixes in the commit.
21
+ 3. If `git status --short` shows changes, create a single conventional commit.
22
+ 4. Push the branch to origin.
29
23
  5. Check for an existing PR with `gh pr view`.
30
- - No PR exists: Create with `gh pr create`. Title = commit subject line. Description = brief explanation of **why**, not what. Append `<!-- commit-push-pr:created v1 -->` on its own line at the end of the PR description so skill-created PRs can be identified later.
31
- - PR exists: Report the URL and move on.
32
- 6. You have the capability to call multiple tools in a single response. After the `simplify` skill completes, do the remaining git and PR operations in a single message. Do not use any other tools or do anything else.
33
- 7. After tool calls complete, send one short final text response with the branch name and the full PR URL (e.g., `https://github.com/clipboardhealth/core-utils/pull/123`). Never use shorthand like `repo#123` — always output the complete URL so it is clickable.
24
+ - No PR: create with `gh pr create`. Title = commit subject. Description = brief explanation of **why**, not what. Append `<!-- commit-push-pr:created v1 -->` on its own line at the end so skill-created PRs can be identified later.
25
+ - PR exists: report the URL and move on.
26
+ 6. End with one short text response: branch name and the full PR URL (e.g., `https://github.com/clipboardhealth/core-utils/pull/123`). Never use shorthand like `repo#123` always output the complete URL.
@@ -3,7 +3,16 @@ name: flaky-test-debugger
3
3
  description: Debug and fix flaky tests including Playwright E2E, NestJS service/integration, React component, and unit tests. Use this skill when investigating intermittent test failures, triaging flaky tests, or fixing test instability.
4
4
  ---
5
5
 
6
- Work through these phases in order. Skip phases only when you already have the information they produce.
6
+ Phases run in order. Skip a phase if you already have the information it produces. Phase 3 runs only in fix mode.
7
+
8
+ ## Mode: plan vs fix
9
+
10
+ This skill runs in one of two modes:
11
+
12
+ - **Fix mode (default):** produce a plan, then apply it.
13
+ - **Plan mode:** produce a plan and stop, for human review.
14
+
15
+ Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it". Otherwise default to fix mode. Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
7
16
 
8
17
  ## Phase 1: Classify Test Type
9
18
 
@@ -18,10 +27,6 @@ Determine the test type from the user's input before doing anything else. The ty
18
27
 
19
28
  If the type is ambiguous, check the test file extension and imports to confirm.
20
29
 
21
- **Routing:** After completing Phase 1, always proceed to Phase 1b before investigating further.
22
-
23
- ---
24
-
25
30
  ## Phase 1b: Check for Existing Fixes
26
31
 
27
32
  Before investigating, check whether someone (or another agent) has already fixed this flake.
@@ -40,291 +45,14 @@ If an existing fix is found, report:
40
45
  - A brief summary of what it addresses
41
46
  - Whether it fully covers the current flake or only partially
42
47
 
43
- If no existing fix is found, proceed to investigation:
44
-
45
- - **E2E (Playwright):** Go to [Phase 2E: E2E Triage Snapshot](#phase-2e-e2e-triage-snapshot)
46
- - **Service, React component, or Unit:** Go to [Phase 2: Fast Path](#phase-2-fast-path-non-e2e)
47
-
48
- ---
49
-
50
- ## Phase 2: Fast Path (non-E2E)
51
-
52
- For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose and fix the flake. Do not over-investigate -- read the evidence, read the code, fix it.
53
-
54
- ### 2a: Gather Failure Context
55
-
56
- Capture from the user's input (ask if missing):
57
-
58
- - **Test file and name** -- exact file path and test title
59
- - **Error message and stack trace** -- the raw failure output
60
- - **Framework** -- Jest, Vitest, etc.
61
- - **Whether it's a new flaky** -- first occurrence vs. recurring
62
- - **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
63
-
64
- ### 2b: Read the Test and Code Under Test
65
-
66
- 1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
67
- 2. Read the production code that the test exercises -- follow imports from the test file.
68
- 3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
69
-
70
- ### 2c: Classify the Flake Pattern
71
-
72
- | Category | Test Types | Signal |
73
- | ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
74
- | **Connection lifecycle** | Service | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
75
- | **Port conflict** | Service | EADDRINUSE -- multiple test files bootstrapping on the same port |
76
- | **Async teardown race** | Service | Errors appear after test passes -- `afterAll` closes the app while background work is still running |
77
- | **Database state leakage** | Service | Test depends on DB state that a parallel/prior test modified |
78
- | **Unresolved async work** | Component | "not wrapped in act()" warnings, state updates after unmount |
79
- | **Timer/animation not flushed** | Component | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced |
80
- | **Mock not restored** | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks` |
81
- | **Shared mutable state** | Unit | Module-level variable or singleton mutated by one test, observed by another |
82
- | **Date/time sensitivity** | Unit | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs |
83
- | **Test ordering dependency** | All | Passes in isolation, fails when run with other tests (or vice versa) |
84
-
85
- ### 2d: Diagnose and Fix
86
-
87
- Apply the appropriate fix based on the pattern:
88
-
89
- **Service test fixes:**
90
-
91
- - Ensure `afterAll` closes the app _and_ awaits all open connections (DB, Redis, queues) before returning
92
- - Pass `{ forceCloseConnections: true }` to `NestFactory.create()` (NestJS v10+) to auto-close keep-alive connections on shutdown, or explicitly close the Mongoose/TypeORM connection in `afterAll`
93
- - Use dynamic/random ports (`listen(0)`) to avoid EADDRINUSE
94
- - Isolate database state: use unique collection prefixes, transaction rollbacks, or per-test database cleanup
95
- - If the test uses `setTimeout` or event-driven patterns, ensure the test awaits completion rather than relying on timing
96
-
97
- **React component test fixes:**
98
-
99
- - Wrap state-triggering actions in `act()` or use `waitFor`/`findBy*` queries that handle async updates
100
- - When using fake timers, advance them explicitly (`jest.advanceTimersByTime`, `jest.runAllTimers`) and restore real timers in `afterEach`
101
- - Ensure cleanup with `cleanup()` in `afterEach` (React Testing Library does this automatically unless disabled)
102
- - Restore mocks in `afterEach` -- prefer `jest.restoreAllMocks()` in a shared setup
103
-
104
- **Unit test fixes:**
105
-
106
- - Eliminate shared mutable state: clone or reset objects in `beforeEach`, or make the module-level binding `const`
107
- - Mock `Date.now` / `new Date()` explicitly when time matters; restore in `afterEach`
108
- - If order-dependent, check for missing setup that another test was implicitly providing
109
-
110
- ### 2e: Evidence Standard (Fast Path)
111
-
112
- Before proposing a fix, include at minimum:
113
-
114
- - The **error message and stack trace** from the failure
115
- - The **specific code path** in the test or production code that caused the flake
116
- - A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
117
- - A **confidence score** (1-5, same scale as [E2E evidence standard](#confidence-score))
118
-
119
- If confidence is 2 or below, recommend reproduction steps or instrumentation before committing to a fix.
120
-
121
- Skip to [Phase 5: Fix Decision Tree](#phase-5-fix-decision-tree).
122
-
123
- ---
124
-
125
- ## Phase 2E: E2E Triage Snapshot
126
-
127
- Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
128
-
129
- - Failing test file and name
130
- - GitHub Actions run URL to fetch the LLM report
131
-
132
- ### Fetch the LLM Report
133
-
134
- Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
135
-
136
- ```bash
137
- bash scripts/fetch-llm-report.sh "<github-actions-url>"
138
- ```
139
-
140
- This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
141
-
142
- ## Phase 3E: Quick Classification
143
-
144
- For the full report schema, field reference, caps, and example reports:
145
-
146
- 1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
147
- 2. Otherwise, fetch the latest docs from GitHub:
148
- - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
149
- - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
150
-
151
- Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
152
-
153
- Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
154
-
155
- Classify the flake to narrow the search space:
156
-
157
- | Category | Signal | Timeline Pattern |
158
- | -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
159
- | **Test-state leakage** | Retries or earlier tests leave auth, cookies, storage, or server state behind | `attempts[]` — different outcomes across retries |
160
- | **Data collision** | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors |
161
- | **Backend stale data** | API returned 200 but response body shows old state | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
162
- | **Frontend cache stale** | No network request after navigation/reload for the relevant endpoint | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint |
163
- | **Silent network failure** | CORS, DNS, or transport error prevented the request from completing | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL` |
164
- | **Render/hydration bug** | API returned correct data but component didn't render it | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors |
165
- | **Environment / infra** | Transient 5xx, timeouts, DNS/network instability | `network` entries with 5xx status; `consoleMessages[]` with connection errors |
166
- | **Locator / UX drift** | Selector is valid but brittle against small UI changes | `errors[]` — locator/selector text in error message |
167
-
168
- ## Phase 4E: Analyze LLM Report
169
-
170
- ### 4Ea: Walk the Timeline
171
-
172
- **Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
173
-
174
- ```text
175
- step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
176
- ```
177
-
178
- For each timeline entry:
179
-
180
- - **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
181
- - **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
182
- - **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
183
-
184
- All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
185
-
186
- ### 4Eb: Compare pass vs fail (flaky tests)
187
-
188
- If you don't have passing and failing attempts for the same test, skip to 4Ec.
189
-
190
- Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
191
-
192
- 1. Align both timelines by step title sequence
193
- 2. Find the first step/network/console entry that differs between attempts
194
- 3. The divergence answers "what was different this time?" directly
195
-
196
- Common divergence patterns:
197
-
198
- - **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
199
- - **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
200
- - **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
201
- - **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
202
-
203
- ### 4Ec: Identify failing tests
204
-
205
- Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
206
-
207
- - **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
208
- - **`location`**: Source file, line, and column — jump straight to the code.
209
- - **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
210
-
211
- ### 4Ed: Examine attempts for retry patterns
212
-
213
- For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
214
-
215
- **Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
216
-
217
- ### 4Ee: Inspect network activity and extract trace IDs
218
-
219
- Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
220
-
221
- **`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`references/datadog-apm-traces.md`](./references/datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
222
-
223
- Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
224
-
225
- ### 4Ef: Review test steps
226
-
227
- Prefer the timeline view (4Ea) which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
228
-
229
- ## Phase 4E Evidence Standard
230
-
231
- Do not propose a fix without concrete artifacts. At minimum, include:
232
-
233
- - One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
234
- - One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
235
- - A **specific code path** that consumed that state — use `tests[].location` to jump to the source
236
- - When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
237
- - When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
238
- - A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
239
-
240
- ### Confidence Score
241
-
242
- Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
243
-
244
- | Score | Meaning | Criteria |
245
- | ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
246
- | **5** | Certain | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner) |
247
- | **4** | High confidence | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
248
- | **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly |
249
- | **2** | Low confidence | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix |
250
- | **1** | Speculative | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding |
251
-
252
- If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
253
-
254
- ---
255
-
256
- ## Phase 5: Fix Decision Tree
257
-
258
- Applies to all test types.
259
-
260
- Apply fixes in this order of priority:
261
-
262
- 1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
263
-
264
- 2. **Test harness fix** (when the failure is non-product):
265
- - Reset cookies, storage, and session between retries
266
- - Isolate test data; generate stronger unique identities
267
- - Make retry blocks idempotent
268
- - Wait on deterministic app signals, not arbitrary sleeps
269
- - (Service tests) Close connections and app properly in `afterAll`
270
- - (Component tests) Flush pending state updates and timers before asserting
271
- - (Unit tests) Reset shared mutable state in `beforeEach`
272
-
273
- 3. **Product fix** (when real users would hit the same issue):
274
- - Handle stale or intermediate states safely
275
- - Make routing/render logic robust to eventual consistency
276
- - Add telemetry for ambiguous transitions
277
-
278
- 4. **Both** if user impact exists _and_ tests are fragile.
279
-
280
- ## Phase 6: Fix Sibling Instances
281
-
282
- After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
283
-
284
- ### When to search
285
-
286
- Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
287
-
288
- - Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
289
- - Hardcoded ports instead of dynamic allocation
290
- - Shared mutable state without per-test reset
291
- - Missing `act()` wrappers or `waitFor` around async assertions
292
- - Fake timers not restored in `afterEach`
293
- - Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
294
-
295
- ### When NOT to search
296
-
297
- Skip this step when the fix is **specific to one test's logic** -- for example, a wrong assertion value, a test-specific race condition in a unique setup, or a one-off typo.
298
-
299
- ### How to search
300
-
301
- 1. Identify the anti-pattern as a grep-able code pattern. Examples:
302
- - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
303
- - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
304
- - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
305
- - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
306
-
307
- 2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
308
-
309
- 3. Apply the same fix to each sibling. Keep changes minimal -- fix the anti-pattern, nothing else.
310
-
311
- 4. List the sibling files you fixed in the output so reviewers can verify them.
312
-
313
- ## Phase 7: Verification
48
+ If no existing fix is found, proceed to Phase 2.
314
49
 
315
- Lint and type-check touched files.
50
+ ## Phase 2: Produce a plan
316
51
 
317
- ## Output Format
52
+ Follow [`references/plan.md`](./references/plan.md). It walks investigation, diagnosis, evidence gathering, and the fix decision tree, and produces a structured plan with confidence score.
318
53
 
319
- When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
54
+ If you are in plan mode, present the plan and stop here.
320
55
 
321
- When documenting the fix in a PR or issue, use this structure:
56
+ ## Phase 3: Apply the plan (fix mode only)
322
57
 
323
- - **Confidence:** score (1-5) with brief justification
324
- - **Symptom:** what failed and where
325
- - **Root cause:** concise technical explanation
326
- - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
327
- - **Fix:** test-only, product-only, or both
328
- - **Siblings fixed:** list of other files where the same anti-pattern was corrected (or "N/A -- fix was test-specific")
329
- - **Validation:** commands and suites run
330
- - **Residual risk:** what could still be flaky
58
+ Follow [`references/fix.md`](./references/fix.md). It takes the plan from Phase 2, applies the proposed fix, searches for sibling anti-patterns, and verifies. PR creation is out of scope -- if the user later opens one (or invokes a PR-shipping skill), label it `flaky-test-fix`.
@@ -9,7 +9,7 @@ The `pup` CLI must be installed and authenticated. Two auth paths are supported:
9
9
  - **macOS Keychain** (via `pup auth login`) — the default on developer machines.
10
10
  - **Environment variables** (`DD_API_KEY` + `DD_APP_KEY`) — the path used in sandboxes and CI.
11
11
 
12
- **Do not run `pup auth status` to verify auth.** It only reads the Keychain, so it fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly — env-var auth takes effect there. If the query fails with an auth error, surface it then (check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`).
12
+ Don't run `pup auth status` to verify auth. It fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly. If the query fails with an auth error, check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`.
13
13
 
14
14
  ## Key pup conventions
15
15
 
@@ -78,3 +78,7 @@ pup traces search --query="trace_id:<TRACE_ID>" --from=30d --limit=1000 \
78
78
  error: .attributes.custom.error.message
79
79
  }]'
80
80
  ```
81
+
82
+ ### 4. Query additional data
83
+
84
+ If additional data would help diagnose the issue (e.g. logs, rum, cicd), use the pup CLI.
@@ -0,0 +1,65 @@
1
+ # Apply a Flaky Test Fix
2
+
3
+ Apply phase of the flaky-test-debugger skill. Takes a plan produced by [`plan.md`](./plan.md) and applies it.
4
+
5
+ ## Preflight
6
+
7
+ Confirm the plan from `plan.md` has confidence ≥ 3. If confidence is 1-2, do not apply -- return to `plan.md` and gather more evidence first.
8
+
9
+ ## Apply the Proposed Fix
10
+
11
+ Edit the files listed in the plan's **Proposed fix** field. Keep the change minimal -- the plan already chose between test harness, product, and both.
12
+
13
+ ## Fix Sibling Instances
14
+
15
+ After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
16
+
17
+ ### When to search
18
+
19
+ Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
20
+
21
+ - Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
22
+ - Hardcoded ports instead of dynamic allocation
23
+ - Shared mutable state without per-test reset
24
+ - Missing `act()` wrappers or `waitFor` around async assertions
25
+ - Fake timers not restored in `afterEach`
26
+ - Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
27
+
28
+ ### When NOT to search
29
+
30
+ Skip this step when the fix is **specific to one test's logic** -- for example a test-specific race condition in a unique setup or a one-off typo.
31
+
32
+ ### How to search
33
+
34
+ 1. Identify the anti-pattern as a grep-able code pattern. Examples:
35
+ - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
36
+ - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
37
+ - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
38
+ - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
39
+
40
+ 2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
41
+
42
+ 3. Apply the same fix to each sibling. Keep changes minimal; fix the anti-pattern, nothing else.
43
+
44
+ 4. List the sibling files you fixed in the output so reviewers can verify them.
45
+
46
+ ## Verification
47
+
48
+ Run the plan's **Validation plan** commands — including the previously-flaky test, repeated enough times to give reasonable confidence the flake is gone. Lint and type-check touched files as the floor; do not stop there.
49
+
50
+ ## Output Format
51
+
52
+ When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
53
+
54
+ When documenting the fix in a PR or issue, use this structure. Carry **Confidence**, **Symptom**, **Root cause**, **Evidence**, and **Residual risk** straight over from the plan. Three plan fields rename: **Proposed fix** → **Fix**, **Sibling candidates** → **Siblings fixed**, **Validation plan** → **Validation**. Drop **Open questions** (resolved by fix time):
55
+
56
+ - **Test ID:** if provided in prompt
57
+ - **Agent session ID:** your running session ID to resume if needed
58
+ - **Confidence:** score (1-5) with brief justification
59
+ - **Symptom:** what failed and where
60
+ - **Root cause:** concise technical explanation
61
+ - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
62
+ - **Fix:** test-only, product-only, or both
63
+ - **Siblings fixed:** list of other files where the same anti-pattern was or should be corrected (or "N/A -- fix was test-specific")
64
+ - **Validation:** commands and suites run
65
+ - **Residual risk:** what could still be flaky
@@ -0,0 +1,226 @@
1
+ # Plan a Flaky Test Fix
2
+
3
+ Diagnosis and planning phase of the flaky-test-debugger skill. Produces a structured plan that the user reviews. In fix mode, the plan is consumed by [`fix.md`](./fix.md).
4
+
5
+ Route by the test type identified in Phase 1 of SKILL.md:
6
+
7
+ - **E2E (Playwright):** start with [E2E Triage Snapshot](#e2e-triage-snapshot)
8
+ - **Service, React component, or Unit:** start with [Fast Path (non-E2E)](#fast-path-non-e2e)
9
+
10
+ ## Fast Path (non-E2E)
11
+
12
+ For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose the flake. Do not over-investigate -- read the evidence, read the code, plan the fix.
13
+
14
+ ### Gather Failure Context
15
+
16
+ Capture from the user's input (ask if missing):
17
+
18
+ - **Test file and name** -- exact file path and test title
19
+ - **Error message and stack trace** -- the raw failure output
20
+ - **Framework** -- Jest, Vitest, etc.
21
+ - **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
22
+
23
+ ### Read the Test and Code Under Test
24
+
25
+ 1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
26
+ 2. Read the production code that the test exercises -- follow imports from the test file.
27
+ 3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
28
+
29
+ ### Classify the Flake Pattern
30
+
31
+ | Category | Test Types | Signal |
32
+ | ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
33
+ | **Connection lifecycle** | Service | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
34
+ | **Port conflict** | Service | EADDRINUSE -- multiple test files bootstrapping on the same port |
35
+ | **Async teardown race** | Service | Errors appear after test passes -- `afterAll` closes the app while background work is still running |
36
+ | **Database state leakage** | Service | Test depends on DB state that a parallel/prior test modified |
37
+ | **Unresolved async work** | Component | "not wrapped in act()" warnings, state updates after unmount |
38
+ | **Timer/animation not flushed** | Component | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced |
39
+ | **Mock not restored** | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks` |
40
+ | **Shared mutable state** | Unit | Module-level variable or singleton mutated by one test, observed by another |
41
+ | **Date/time sensitivity** | Unit | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs |
42
+ | **Test ordering dependency** | All | Passes in isolation, fails when run with other tests (or vice versa) |
43
+
44
+ ### Diagnose with Evidence
45
+
46
+ Before proposing a fix, gather:
47
+
48
+ - The **error message and stack trace** from the failure
49
+ - The **specific code path** in the test or production code that caused the flake
50
+ - A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
51
+ - A **confidence score** (1-5, see [Confidence Score](#confidence-score))
52
+
53
+ If confidence is 2 or below, the plan is to gather more data: recommend specific reproduction steps or instrumentation rather than a code fix.
54
+
55
+ If >2, continue to [Decide Fix Approach](#decide-fix-approach).
56
+
57
+ ## E2E Triage Snapshot
58
+
59
+ Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
60
+
61
+ - Failing test file and name
62
+ - GitHub Actions run URL to fetch the LLM report
63
+
64
+ ### Fetch the LLM Report
65
+
66
+ Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
67
+
68
+ ```bash
69
+ bash scripts/fetch-llm-report.sh "<github-actions-url>"
70
+ ```
71
+
72
+ This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
73
+
74
+ ## Quick Classification (E2E)
75
+
76
+ For the full report schema, field reference, caps, and example reports:
77
+
78
+ 1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
79
+ 2. Otherwise, fetch the latest docs from GitHub:
80
+ - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
81
+ - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
82
+
83
+ Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
84
+
85
+ Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
86
+
87
+ Classify the flake to narrow the search space:
88
+
89
+ | Category | Signal | Timeline Pattern |
90
+ | -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
91
+ | **Test-state leakage** | Retries or earlier tests leave auth, cookies, storage, or server state behind | `attempts[]` — different outcomes across retries |
92
+ | **Data collision** | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors |
93
+ | **Backend stale data** | API returned 200 but response body shows old state | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
94
+ | **Frontend cache stale** | No network request after navigation/reload for the relevant endpoint | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint |
95
+ | **Silent network failure** | CORS, DNS, or transport error prevented the request from completing | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL` |
96
+ | **Render/hydration bug** | API returned correct data but component didn't render it | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors |
97
+ | **Environment / infra** | Transient 5xx, timeouts, DNS/network instability | `network` entries with 5xx status; `consoleMessages[]` with connection errors |
98
+ | **Locator / UX drift** | Selector is valid but brittle against small UI changes | `errors[]` — locator/selector text in error message |
99
+
100
+ ## Analyze LLM Report
101
+
102
+ ### Walk the Timeline
103
+
104
+ **Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
105
+
106
+ ```text
107
+ step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
108
+ ```
109
+
110
+ For each timeline entry:
111
+
112
+ - **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
113
+ - **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
114
+ - **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
115
+
116
+ All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
117
+
118
+ ### Compare pass vs fail (flaky tests)
119
+
120
+ If you don't have passing and failing attempts for the same test, skip to [Identify failing tests](#identify-failing-tests).
121
+
122
+ Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
123
+
124
+ 1. Align both timelines by step title sequence
125
+ 2. Find the first step/network/console entry that differs between attempts
126
+ 3. The divergence answers "what was different this time?" directly
127
+
128
+ Common divergence patterns:
129
+
130
+ - **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
131
+ - **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
132
+ - **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
133
+ - **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
134
+
135
+ ### Identify failing tests
136
+
137
+ Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
138
+
139
+ - **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
140
+ - **`location`**: Source file, line, and column — jump straight to the code.
141
+ - **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
142
+
143
+ ### Examine attempts for retry patterns
144
+
145
+ For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
146
+
147
+ **Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
148
+
149
+ ### Inspect network activity and extract trace IDs
150
+
151
+ Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
152
+
153
+ **`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`./datadog-apm-traces.md`](./datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
154
+
155
+ Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
156
+
157
+ ### Review test steps
158
+
159
+ Prefer the timeline view above which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
160
+
161
+ ## Evidence Standard (E2E)
162
+
163
+ Do not propose a fix without concrete artifacts. At minimum, include:
164
+
165
+ - One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
166
+ - One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
167
+ - A **specific code path** that consumed that state — use `tests[].location` to jump to the source
168
+ - When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
169
+ - When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
170
+ - A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
171
+
172
+ If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
173
+
174
+ If >2, continue to [Decide Fix Approach](#decide-fix-approach).
175
+
176
+ ### Confidence Score
177
+
178
+ Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
179
+
180
+ | Score | Meaning | Criteria |
181
+ | ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
182
+ | **5** | Certain | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner) |
183
+ | **4** | High confidence | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
184
+ | **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly |
185
+ | **2** | Low confidence | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix |
186
+ | **1** | Speculative | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding |
187
+
188
+ ## Decide Fix Approach
189
+
190
+ Applies to all test types.
191
+
192
+ Choose one of these approaches in priority order:
193
+
194
+ 1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
195
+
196
+ 2. **Test harness fix** (when the failure is non-product):
197
+ - Reset cookies, storage, and session between retries
198
+ - Isolate test data; generate stronger unique identities
199
+ - Make retry blocks idempotent
200
+ - Wait on deterministic app signals, not arbitrary sleeps
201
+ - (Service tests) Close connections and app properly in `afterAll`
202
+ - (Component tests) Flush pending state updates and timers before asserting
203
+ - (Unit tests) Reset shared mutable state in `beforeEach`
204
+
205
+ 3. **Product fix** (when real users would hit the same issue):
206
+ - Handle stale or intermediate states safely
207
+ - Make routing/render logic robust to eventual consistency
208
+ - Add telemetry for ambiguous transitions
209
+
210
+ 4. **Both** if user impact exists _and_ tests are fragile.
211
+
212
+ ## Plan Output Format
213
+
214
+ Produce the plan with these fields:
215
+
216
+ - **Test ID:** if provided in prompt
217
+ - **Agent session ID:** your running session ID to resume if needed
218
+ - **Confidence:** score (1-5) with brief justification
219
+ - **Symptom:** what failed and where
220
+ - **Root cause:** concise technical explanation
221
+ - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
222
+ - **Proposed fix:** test harness, product, or both — with the specific file(s) and the change you would make
223
+ - **Sibling candidates:** files that appear to share the same anti-pattern, for the reviewer (or fix.md) to confirm. Or "N/A -- fix is test-specific" if the issue is one-off (see [`fix.md`](./fix.md) for what counts as a structural anti-pattern worth searching for).
224
+ - **Validation plan:** lint/typecheck commands and test commands to run after applying the fix
225
+ - **Open questions:** anything that needs human input before fixing
226
+ - **Residual risk:** what could still be flaky after the fix
@@ -4,59 +4,43 @@ description: Implement a plan file or direct request end-to-end, then hand off t
4
4
 
5
5
  # Go: Execute Work and Ship It
6
6
 
7
- Given a plan file or direct implementation request, implement the requested work, then invoke the `commit-push-pr` skill to commit, push, and open a PR.
7
+ Given a plan file or direct implementation request, implement the requested work, then invoke the `commit-push-pr` skill to create a PR.
8
8
 
9
9
  This skill is the bridge between intent and shipping. It does not re-plan or re-design unless the user asked for that. If the request is wrong or cannot be implemented safely, surface that; do not silently improvise.
10
10
 
11
- ## Phase 1: Understand the Input
11
+ ## Phase 1: Resolve the Input
12
12
 
13
- The user may invoke this skill with either a plan file path/identifier or a direct implementation request such as "add a button to the homepage to log in." Resolve the input in this order:
13
+ The user invokes this skill with either a plan file path/identifier or a direct request like "add a login button."
14
14
 
15
- 1. **Absolute path** (starts with `/`): read it directly only when it is inside the current workspace/repo root or the host's plans directory. If it is outside those roots, ask the user to confirm before reading it; if they do not confirm, stop and ask for an approved path.
16
- 2. **Relative path** (contains `/` or starts with `./`): resolve from the current working directory.
17
- 3. **Bare name** (no path separators, e.g. `my-plan` or `my-plan.md`): if the host exposes a plans directory, look there first. Use whatever directory the current host stores plan files in — do not hard-code a host-specific path. If no plan is found and the input is ambiguous, ask whether it is a plan identifier or an implementation request.
18
- 4. **Natural-language request** (for example, contains spaces or reads like an implementation task): treat it as the implementation request. There may be no separate plan file.
19
- 5. **No argument given**: ask the user to provide either the plan file path or the implementation request. Do not guess.
15
+ - **Path-like input** (absolute, relative, or bare name): try to read it. For bare names, check the host's plans directory first. Resolve the path first; if it lies outside the workspace/repo or the host's plans directory, ask the user to confirm before reading it (applies to both absolute and relative inputs — `..` escapes count). If the input doesn't resolve to a readable file, stop and tell the user do not reinterpret a missing path as a direct request.
16
+ - **Natural-language request**: treat as the implementation task. There may be no plan file.
17
+ - **No argument**: ask for either a plan path or the implementation request.
20
18
 
21
- If a path-like input does not exist or cannot be read, stop and tell the user. Do not reinterpret a missing path as a direct request.
19
+ When using a plan, read it fully before starting. Note **Critical files**, **Approach**, and **Verification** sections (or equivalents).
22
20
 
23
- When using a plan, read the entire plan before starting. Note the **Critical files**, **Approach**, and **Verification** sections (or their equivalents).
21
+ ## Phase 2: Implement
24
22
 
25
- ## Phase 2: Implement the Request
23
+ The plan or request is the source of truth for scope:
26
24
 
27
- Treat the plan or direct request as the source of truth for scope:
28
-
29
- - Make exactly the changes requested no more, no less. Do not add extra refactors, tests, or cleanup the request did not ask for.
30
- - If a plan references files, functions, or utilities that no longer exist, stop and surface the discrepancy before guessing.
31
- - If you discover the request is wrong (the codebase has moved, an assumption is invalid, a constraint was missed), stop and report. Do not silently rewrite the approach.
32
- - For direct implementation requests, inspect the relevant code before editing and use the repo's existing patterns.
33
- - Do **not** modify the plan file itself when working from a plan.
25
+ - Make exactly the changes requested no extra refactors, tests, or cleanup.
26
+ - If the plan references files or utilities that no longer exist, stop and report.
27
+ - If an assumption is invalid or a constraint was missed, stop and report. Do not silently rewrite the approach.
28
+ - Do not modify the plan file itself.
34
29
 
35
30
  ## Phase 3: Validate
36
31
 
37
- Validation should follow the repo's instructions and the cost profile of the repo. Some repos intentionally leave slow checks to CI; do not run broad or slow suites unless the repo instructions explicitly require them.
38
-
39
- Run validation in this order:
40
-
41
- 1. **Repo instructions.** Look up the repo's standard validation guidance. Common signals, in priority order:
42
- - An explicit instruction in the repo's agent or contributor instructions, such as `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, or an equivalent host-specific instruction file (e.g. "MUST run before opening PRs"). This wins over everything else.
43
- - A relevant `package.json` script when the repo instructions point to it, such as `affected`, `precommit`, `validate`, `check`, `verify`, `test`, `lint`, `typecheck`, or repo-equivalent.
44
- - A repo-specific equivalent in the rare case the project does not use `package.json`.
45
-
46
- If the repo primarily relies on git pre-commit or pre-push hooks and does not mandate a manual validation command, do not invent one. Let the hooks run during the `commit-push-pr` handoff.
47
-
48
- 2. **Request-specific verification.** If the plan or user request names specific checks, run them when they are practical and consistent with the repo instructions. If a requested check is clearly a slow CI-only suite, ask before running it.
49
-
50
- If any check you run fails, fix the failures and re-run the same check. Do not hand off to `commit-push-pr` with known failing checks unless the user explicitly accepts a pre-existing or unrelated failure.
32
+ - Read `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, or equivalent contributor instructions for the mandated pre-PR command (e.g. `npm run affected`). That wins over everything.
33
+ - If the repo relies on pre-commit/pre-push hooks and mandates no manual command, don't invent one — let the hooks run during the `commit-push-pr` handoff.
34
+ - If the plan names specific checks, run them when practical. Ask before running anything that's clearly a slow CI-only suite.
51
35
 
52
- ## Phase 4: Hand Off to commit-push-pr
36
+ Fix any failures before handing off. Do not hand off with known failing checks unless the user explicitly accepts a pre-existing or unrelated failure.
53
37
 
54
- Once implementation is complete and Phase 3 validation is satisfied, invoke the `commit-push-pr` skill by name using this agent's normal skill invocation mechanism.
38
+ ## Phase 4: Hand Off
55
39
 
56
- Do **not** commit, push, or open the PR yourself in this skill. Let `commit-push-pr` own the shipping flow.
40
+ Invoke the `commit-push-pr` skill. Do not commit, push, or open the PR yourself.
57
41
 
58
- If `commit-push-pr` reports `nothing to ship.` (the work resulted in no actual file changes), surface that to the user — it usually means the request was already implemented or was a no-op.
42
+ If `commit-push-pr` reports `nothing to ship.`, surface that.
59
43
 
60
44
  ## Phase 5: Final Output
61
45
 
62
- After `commit-push-pr` returns, ensure the user sees one short final response with the branch name and the full PR URL it produced. If the current host already surfaced that response, do not duplicate it. If `commit-push-pr` reported nothing to ship, say so plainly.
46
+ Ensure the user sees the branch name and full PR URL.
@@ -14,12 +14,11 @@ Structure and write Linear bug reports from evidence that already exists in the
14
14
  ## Process
15
15
 
16
16
  1. **Gather context** — collect evidence from the conversation: investigation findings, user reports, Datadog links, error details
17
- 2. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the symptom description, error messages, and affected service/endpoint. If duplicates are found, present them to the user before proceeding.
18
- 3. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected — ask before drafting. NEVER invent answers. Up to 3 rounds.
19
- 4. **Draft** — title + description, structure scaled to complexity (see format below)
20
- 5. **Self-review** — check every Red Flag below before presenting
21
- 6. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
22
- 7. **Create in Linear** — only after explicit approval
17
+ 2. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected ask before drafting. NEVER invent answers. Up to 3 rounds.
18
+ 3. **Draft** — title + description, structure scaled to complexity (see format below)
19
+ 4. **Self-review** — check every Red Flag below before presenting
20
+ 5. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
21
+ 6. **Create in Linear** — only after explicit approval
23
22
 
24
23
  ## Hard Rules
25
24
 
@@ -15,14 +15,13 @@ Draft Linear feature request tickets that describe what users need and why — n
15
15
  - **1-2 factual gaps** (missing repo, unclear who) → ask the user directly. Don't dispatch the full interview for a single missing data point.
16
16
  - **Structural problems** (solution-shaped framing, no problem articulated, mostly unknowns) → dispatch `interview-feature` skill. Receive a structured problem brief. Re-check gate against the brief.
17
17
  - If `interview-feature` terminates without producing a problem brief (user refused to articulate a problem), abort the ticket process. Inform the user that the ticket cannot be created without a problem statement.
18
- 3. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the problem's key terms, affected users, and domain area. If duplicates or closely related tickets are found, present them to the user and ask whether to proceed, merge with an existing ticket, or stop.
19
- 4. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check it doesn't blindly trust upstream context.
20
- 5. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
21
- 6. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
22
- 7. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
23
- 8. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
24
- 9. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
25
- 10. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
18
+ 3. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check it doesn't blindly trust upstream context.
19
+ 4. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
20
+ 5. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
21
+ 6. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
22
+ 7. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
23
+ 8. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
24
+ 9. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
26
25
 
27
26
  ## Final Validation Checklist
28
27
 
@@ -21,11 +21,10 @@ Draft Linear tech debt tickets that justify _why_ the debt matters — cost to c
21
21
  - Maintainability/DX → `git log` for change frequency and bug-fix commits, grep for workarounds
22
22
  - Security → check dependency versions, scan for vulnerability patterns
23
23
  5. **Assess interest & risk** — produce structured ratings with evidence (see reference.md for rating framework)
24
- 6. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the debt's key terms, component names, and domain area. If duplicates are found, present them to the user before proceeding.
25
- 7. **Draft** — title + description, structure scaled to complexity (see format below)
26
- 8. **Self-review** — check every Red Flag below before presenting
27
- 9. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
28
- 10. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
24
+ 6. **Draft** — title + description, structure scaled to complexity (see format below)
25
+ 7. **Self-review** — check every Red Flag below before presenting
26
+ 8. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
27
+ 9. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
29
28
 
30
29
  ## Hard Rules
31
30
 
@@ -1,101 +0,0 @@
1
- ---
2
- name: linear-duplicate-finder
3
- description: Use when checking if a Linear ticket already exists before creating one. Searches across teams, archived tickets, and multiple phrasings to find duplicates and related tickets. Dispatched by ticket-writing skills before creation.
4
- ---
5
-
6
- # Linear Duplicate Finder
7
-
8
- Search Linear for duplicate or related tickets before creating a new one. Casts a wide net across teams, statuses, and phrasings, then classifies matches by similarity.
9
-
10
- ## Inputs
11
-
12
- 1. **A Linear ticket ID** (e.g., "ENG-1234") — fetch its details and search for duplicates.
13
- 2. **A ticket title + description** — search Linear directly for matches.
14
- 3. **Multiple ticket IDs** — cross-reference against each other and the backlog.
15
-
16
- ## Process
17
-
18
- ### Step 1: Understand the Source
19
-
20
- If given a ticket ID:
21
-
22
- - Fetch the ticket using `mcp__linear__get_issue` with `includeRelations: true` to see if duplicates are already marked.
23
- - Extract the title, description, labels, team, and project.
24
-
25
- If given a title/description:
26
-
27
- - Parse the key concepts, features, and domain terms.
28
-
29
- ### Step 2: Generate Search Queries
30
-
31
- Break the ticket down into multiple search angles:
32
-
33
- - **Exact title keywords**: Most distinctive terms from the title.
34
- - **Core concept**: The fundamental ask, using different phrasings.
35
- - **Domain/feature area**: The feature area or system component involved.
36
- - **Synonyms and alternative phrasings**: 2-3 alternative ways to describe the same thing.
37
-
38
- ### Step 3: Execute Searches
39
-
40
- Run multiple `mcp__linear__list_issues` searches in parallel using the `query` parameter with different search terms:
41
-
42
- - Use `limit: 50` to cast a wide net.
43
- - Include `includeArchived: true` to catch completed or cancelled tickets.
44
- - Filter by team when known, but also do at least one cross-team search.
45
-
46
- Also use `mcp__linear__query_data` with natural language queries for concept-based matching.
47
-
48
- Run at least 3-5 different searches with varied query terms.
49
-
50
- ### Step 4: Analyze and Score Results
51
-
52
- | Dimension | Weight | Description |
53
- | ----------------------- | ------ | ------------------------------------------------------ |
54
- | **Title similarity** | High | Do the titles describe the same thing? |
55
- | **Description overlap** | High | Do the descriptions reference the same problem? |
56
- | **Same feature area** | Medium | Are they about the same system/feature? |
57
- | **Same team/project** | Low | Same team increases likelihood but isn't required. |
58
- | **Status** | Info | Cancelled/completed duplicates are still worth noting. |
59
-
60
- ### Step 5: Classify Matches
61
-
62
- - **Duplicate**: Exact same work. Creating both = redundant effort.
63
- - **Closely Related**: Overlapping scope — completing one partially addresses the other. Should cross-reference.
64
- - **Same Area**: Same domain but different aspects. Useful context, not duplicates.
65
-
66
- ### Step 6: Present Results
67
-
68
- ```text
69
- ## Duplicate Detection Results
70
-
71
- ### Source
72
- **[ID] Title** or **Potential ticket**: "description"
73
-
74
- ### Duplicates Found
75
- 1. **[TEAM-123] Title** — Status: In Progress
76
- - **Why**: [specific overlap]
77
- - **Key difference**: [if any]
78
-
79
- ### Closely Related
80
- 1. **[TEAM-456] Title** — Status: Backlog
81
- - **Overlap**: [what's shared]
82
- - **Difference**: [what's distinct]
83
-
84
- ### Same Area (Context)
85
- 1. **[TEAM-789] Title** — Status: Done
86
- - **Relevance**: [why worth noting]
87
-
88
- ### Recommendation
89
- [Proceed, merge with existing, or add context to related ticket?]
90
- ```
91
-
92
- If no duplicates found, say so clearly and recommend proceeding.
93
-
94
- ## Guidelines
95
-
96
- - **Wide net, then narrow.** Better to surface a false positive than miss a real duplicate.
97
- - **Search across teams.** Duplicates often live on different teams.
98
- - **Check archived/cancelled tickets.** May contain valuable context about why work was previously rejected.
99
- - **Look at different time ranges.** Duplicates can be months old.
100
- - **Be specific.** Don't say "similar title" — explain exactly what overlaps and differs.
101
- - **When in doubt, include it.** A false positive is cheap; a missed duplicate wastes engineering effort.