@clipboard-health/ai-rules 2.22.3 → 2.22.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@clipboard-health/ai-rules",
3
- "version": "2.22.3",
3
+ "version": "2.22.4",
4
4
  "description": "Pre-built AI agent rules for consistent coding standards.",
5
5
  "keywords": [
6
6
  "ai",
@@ -9,14 +9,29 @@ Phases run in order. Skip a phase if you already have the information it produce
9
9
 
10
10
  This skill runs in one of two modes:
11
11
 
12
- - **Fix mode (default):** produce a plan, then apply it.
12
+ - **Fix mode (default for local/unit-sized fixes):** produce a plan, then apply it.
13
13
  - **Plan mode:** produce a plan and stop, for human review.
14
14
 
15
- Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it". Otherwise default to fix mode. Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
15
+ Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it".
16
16
 
17
- ## Phase 1: Classify Test Type
17
+ For CI-sourced E2E flakes, prefer plan mode unless the user explicitly asks you to implement a fix or the root cause is already clear, high-confidence, and local to the repository. E2E flakes often originate in CI setup, auth/test-data infrastructure, backend behavior, deployment assets, or product code; avoid editing the test just because that is where the failure surfaced.
18
18
 
19
- Determine the test type from the user's input before doing anything else. The type dictates the investigation path.
19
+ Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
20
+
21
+ ## Phase 1: Classify Failure Surface and Test Type
22
+
23
+ For E2E flakes, first classify where the failure surfaced in the lifecycle, then identify the test type. The failure surface dictates how broadly to investigate before reading or editing the test.
24
+
25
+ Common E2E failure surfaces:
26
+
27
+ - **CI/job setup:** dependency installation, CLI/tooling, environment setup, build/deploy, artifact download.
28
+ - **Test setup/auth/data:** token minting, login bootstrap, seeded users/entities, one-time credentials, external service setup.
29
+ - **App bootstrap/navigation:** static assets, route load, hydration, browser console/page errors before the user action.
30
+ - **User action:** click/input completed but the expected request, dialog, route change, or state transition did not start.
31
+ - **Backend request:** request emitted; backend returned error, stale data, unexpected shape, or excessive latency.
32
+ - **Assertion/locator:** app state is correct, but the assertion/selector is brittle or out of sync with the intended UX.
33
+
34
+ Then determine the test type from the user's input. The type dictates the detailed investigation path.
20
35
 
21
36
  | Type | Signals |
22
37
  | -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -6,10 +6,18 @@ Apply phase of the flaky-test-debugger skill. Takes a plan produced by [`plan.md
6
6
 
7
7
  Confirm the plan from `plan.md` has confidence ≥ 3. If confidence is 1-2, do not apply -- return to `plan.md` and gather more evidence first.
8
8
 
9
+ Before editing, verify the plan is still current:
10
+
11
+ - The failing commit's code path still exists on current `main`, or the plan has been adjusted for the current code.
12
+ - The proposed fix targets the diagnosed failure surface, not only the final assertion.
13
+ - Any retry/wait change is safe and idempotent; it must not repeat one-time credentials, duplicate writes, destructive actions, or rate-limited setup calls.
14
+
9
15
  ## Apply the Proposed Fix
10
16
 
11
17
  Edit the files listed in the plan's **Proposed fix** field. Keep the change minimal -- the plan already chose between test harness, product, and both.
12
18
 
19
+ Do not convert an infrastructure, backend, auth/data, or product-state root cause into a frontend timeout or locator retry. If the plan's evidence no longer supports the proposed fix, stop and revise the plan.
20
+
13
21
  ## Fix Sibling Instances
14
22
 
15
23
  After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
@@ -56,6 +64,8 @@ When documenting the fix in a PR or issue, use this structure. Carry **Confidenc
56
64
  - **Test ID:** if provided in prompt
57
65
  - **Agent session ID:** your running session ID to resume if needed
58
66
  - **Confidence:** score (1-5) with brief justification
67
+ - **Failure surface:** where the failure first surfaced and why the fix belongs there
68
+ - **Current main status:** whether the failure path still existed when the fix was made
59
69
  - **Symptom:** what failed and where
60
70
  - **Root cause:** concise technical explanation
61
71
  - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
@@ -62,6 +62,8 @@ Capture these details first so the investigation is reproducible. If the user ha
62
62
 
63
63
  - Failing test file and name
64
64
  - GitHub Actions run URL to fetch the LLM report
65
+ - Branch, commit, shard, timestamp, and issue/ticket link when available
66
+ - Whether the failure is one test, a retry-only flake, or a burst across many tests/shards
65
67
 
66
68
  ### Fetch the LLM Report
67
69
 
@@ -73,6 +75,33 @@ bash scripts/fetch-llm-report.sh "<github-actions-url>"
73
75
 
74
76
  This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
75
77
 
78
+ ### Cluster Before Per-Test Debugging
79
+
80
+ Before reading the failing test as if it is isolated, check whether the failure belongs to a broader pattern:
81
+
82
+ 1. **Within the report:** group failures by first error line, stack frame, setup helper, console error, and failing lifecycle stage.
83
+ 2. **Across attempts:** compare failed and passed retries for the same test; note the first divergence.
84
+ 3. **Across CI context:** if issue tracker, GitHub Actions logs, CI logs, or test analytics are available, search the exact error and stack snippet across nearby runs, shards, commits, and runners.
85
+ 4. **Across code history:** check whether the failing commit is still representative of current `main`; recent migrations can make an implementation plan stale even when the old failure was real.
86
+
87
+ If many tests fail in the same setup helper, dependency install step, auth/token path, or external service call, diagnose the shared mechanism first. Do not create per-test fixes for a shared failure.
88
+
89
+ ### Classify the E2E Failure Surface
90
+
91
+ Use the earliest observed failure surface to decide where to investigate first:
92
+
93
+ | Surface | Primary signal | First places to look |
94
+ | ------------------------ | ----------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
95
+ | **CI/job setup** | Failure before browser/user-flow code; install/build/config/tooling errors | GitHub Actions logs, workflow YAML, dependency versions, cache keys, recent tooling releases |
96
+ | **Test setup/auth/data** | Failure in fixtures, token minting, login, seed data, or one-time credentials | Setup helpers, external identity/data services, idempotency of retries, CI logs, service logs |
97
+ | **App bootstrap** | Blank page, static asset errors, hydration errors, route never becomes ready | Browser console, static asset/network entries, deployment version, route bootstrap helpers |
98
+ | **User action no-op** | Click/input completes but expected request/dialog/route/state never starts | Playwright trace, console/page errors, React error boundaries, remounts, permissions/session state |
99
+ | **Backend request** | Expected request is emitted and fails, stalls, or returns unexpected data | Request/response body, timings, trace/log correlation, backend code, data freshness |
100
+ | **Post-success render** | Request succeeds with expected data but UI does not reflect it | Client cache invalidation, state updates, rendering conditions, console errors |
101
+ | **Assertion/locator** | App state is correct but selector/assertion no longer matches intended UX | Test selector, accessible names, UX drift, deterministic app-ready signal |
102
+
103
+ This classification can change as evidence improves. State the final surface in the plan.
104
+
76
105
  ## Quick Classification (E2E)
77
106
 
78
107
  For the full report schema, field reference, caps, and example reports:
@@ -88,16 +117,19 @@ Read the docs if you need field semantics or limits; otherwise the field names u
88
117
 
89
118
  Classify the flake to narrow the search space:
90
119
 
91
- | Category | Signal | Timeline Pattern |
92
- | -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
93
- | **Test-state leakage** | Retries or earlier tests leave auth, cookies, storage, or server state behind | `attempts[]` different outcomes across retries |
94
- | **Data collision** | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` duplicate key or conflict errors |
95
- | **Backend stale data** | API returned 200 but response body shows old state | `step(action)` `network(GET, 200)``step(assert) FAIL` API succeeded but data was stale |
96
- | **Frontend cache stale** | No network request after navigation/reload for the relevant endpoint | `step(reload)` `step(assert) FAIL` — no intervening network call for expected endpoint |
97
- | **Silent network failure** | CORS, DNS, or transport error prevented the request from completing | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL` |
98
- | **Render/hydration bug** | API returned correct data but component didn't render it | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors |
99
- | **Environment / infra** | Transient 5xx, timeouts, DNS/network instability | `network` entries with 5xx status; `consoleMessages[]` with connection errors |
100
- | **Locator / UX drift** | Selector is valid but brittle against small UI changes | `errors[]` locator/selector text in error message |
120
+ | Category | Signal | Timeline Pattern |
121
+ | ---------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
122
+ | **Shared setup/tooling** | Many tests fail before user-flow assertions in the same helper or CI step | `job/setup` or `beforeEach` fails across shards/runners with the same stack or command error |
123
+ | **Auth/data setup** | Token mint, login, seeded user/entity, or external test setup fails | `beforeEach/setup` service/CLI error before page actions |
124
+ | **Test-state leakage** | Retries or earlier tests leave auth, cookies, storage, or server state behind | `attempts[]` — different outcomes across retries |
125
+ | **Data collision** | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors |
126
+ | **Backend stale data** | API returned 200 but response body shows old state | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
127
+ | **Frontend cache stale** | No network request after navigation/reload for the relevant endpoint | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint |
128
+ | **Expected request missing** | User action succeeds but the expected network request never appears | `step(click/input)` completes no matching `network(...)` assertion/wait fails |
129
+ | **Silent network failure** | CORS, DNS, or transport error prevented the request from completing | `step(action)` `console(error: "net::ERR_FAILED")` `step(assert) FAIL` |
130
+ | **Render/hydration bug** | API returned correct data but component didn't render it | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors |
131
+ | **Environment / infra** | Transient 5xx, timeouts, DNS/network instability | `network` entries with 5xx status; `consoleMessages[]` with connection errors |
132
+ | **Locator / UX drift** | Selector is valid but brittle against small UI changes | `errors[]` — locator/selector text in error message |
101
133
 
102
134
  ## Analyze LLM Report
103
135
 
@@ -148,14 +180,30 @@ For each attempt, compare `status`, `durationMs`, and `error` across retries —
148
180
 
149
181
  **Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
150
182
 
183
+ If the LLM report does not show whether a click resolved, an element detached, a dialog appeared/disappeared, or a request was never emitted, download and inspect the full Playwright HTML report/trace artifact when available. The LLM report is an index, not the ceiling for evidence.
184
+
151
185
  ### Inspect network activity and extract trace IDs
152
186
 
153
187
  Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
154
188
 
155
189
  **`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`./datadog-apm-traces.md`](./datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
156
190
 
191
+ If no expected request was emitted, say that explicitly and do not diagnose backend latency for that action. Instead, use the trace, console/page errors, session/permission state, and frontend code path to explain why the request never started.
192
+
157
193
  Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
158
194
 
195
+ ### Use Available Observability Beyond APM
196
+
197
+ APM traces are valuable when an application request exists, but many E2E flakes fail before that. Use the observability surface that matches the failure surface:
198
+
199
+ - **CI/job setup:** GitHub Actions logs, cache keys, installed tool versions, runner/shard distribution.
200
+ - **Auth/data setup:** service logs, identity-provider/API audit logs, rate-limit/throttle metrics, setup helper output.
201
+ - **App bootstrap:** browser console, static asset responses, deployment/version metadata, CDN or webapp logs.
202
+ - **User action no-op:** Playwright trace actions, console/page errors, RUM/session events when available.
203
+ - **Backend request emitted:** APM traces, backend logs, request/response bodies, data store/cache traces.
204
+
205
+ Search exact error strings across a relevant time window and commit/shard context. Prefer primary telemetry over inference from test code when both are available.
206
+
159
207
  ### Review test steps
160
208
 
161
209
  Prefer the timeline view above which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
@@ -165,10 +213,14 @@ Prefer the timeline view above which interleaves steps with network and console.
165
213
  Do not propose a fix without concrete artifacts. At minimum, include:
166
214
 
167
215
  - One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
168
- - One **network artifact** an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
216
+ - One **network or lifecycle artifact**:
217
+ - If a request was emitted: an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
218
+ - If no request was emitted: the step/trace evidence showing the triggering action completed and the expected request/dialog/route transition never started
219
+ - If failure happened before the app action: CI/setup/auth log evidence showing the failing command/service call
169
220
  - A **specific code path** that consumed that state — use `tests[].location` to jump to the source
170
221
  - When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
171
222
  - When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
223
+ - When relevant: **logs/RUM/CI evidence** that confirms whether the issue is app, backend, infra, auth/test-data, or CI tooling
172
224
  - A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
173
225
 
174
226
  If confidence is less than 5/5, identify the missing evidence and propose concrete frontend and/or backend observability changes that would make the next occurrence diagnosable at 5/5 confidence. These changes may span multiple repositories.
@@ -193,25 +245,54 @@ Rate your confidence in the root cause on a 1-5 scale. Report this score alongsi
193
245
 
194
246
  Applies to all test types.
195
247
 
196
- Choose one of these approaches in priority order:
248
+ Choose the fix locus from the evidence, not from where the assertion failed. A flaky E2E test can be exposing a CI dependency issue, auth/test-data service issue, backend bug, deployment problem, product state bug, or test harness bug.
249
+
250
+ Use this decision order:
251
+
252
+ 1. **Shared setup or CI fix** when many tests fail before user-flow assertions or all failures share a tool, cache, install, auth, seed-data, or fixture path.
253
+ 2. **Backend/service/data fix** when the expected request is emitted and backend telemetry or response bodies show errors, throttling, stale data, inconsistent state, or unexpected latency.
254
+ 3. **Product fix** when real users can hit the same unsafe intermediate state, render error, permission/session race, stale cache, or missing error handling.
255
+ 4. **Test data or harness fix** when the scenario is not user-realistic, the test setup is semantically wrong, or the test needs a deterministic app-ready signal.
256
+ 5. **Assertion/locator fix** only when the app state is correct and the selector/assertion is the only broken part.
257
+
258
+ Before proposing any retry, timeout, or wait change, pass the idempotency check:
259
+
260
+ - Is the retried operation safe to repeat?
261
+ - Does retrying preserve the same test scenario?
262
+ - Could retrying amplify the root cause, such as rate limits, one-time credentials, duplicate writes, or destructive mutations?
263
+ - Is there a deterministic signal to wait on instead of a longer timeout?
264
+
265
+ If the answer is no or unclear, do not add a retry/wait as the fix. Propose a root-cause fix or instrumentation instead.
266
+
267
+ Common valid fix types:
268
+
269
+ - **Shared setup / CI**
270
+ - Pin or lock runtime tools and dependencies
271
+ - Fail fast on incompatible tool contracts
272
+ - Remove mutable global state from CI setup
273
+ - Add setup-level diagnostics before sharding
197
274
 
198
- 1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
275
+ - **Test harness / data** (when the failure is non-product):
276
+ - Reset cookies, storage, and session between retries
277
+ - Isolate test data; generate stronger unique identities
278
+ - Make retry blocks idempotent
279
+ - Wait on deterministic app signals, not arbitrary sleeps
280
+ - (Service tests) Close connections and app properly in `afterAll`
281
+ - (Component tests) Flush pending state updates and timers before asserting
282
+ - (Unit tests) Reset shared mutable state in `beforeEach`
199
283
 
200
- 2. **Test harness fix** (when the failure is non-product):
201
- - Reset cookies, storage, and session between retries
202
- - Isolate test data; generate stronger unique identities
203
- - Make retry blocks idempotent
204
- - Wait on deterministic app signals, not arbitrary sleeps
205
- - (Service tests) Close connections and app properly in `afterAll`
206
- - (Component tests) Flush pending state updates and timers before asserting
207
- - (Unit tests) Reset shared mutable state in `beforeEach`
284
+ - **Product** (when real users would hit the same issue):
285
+ - Handle stale or intermediate states safely
286
+ - Make routing/render logic robust to eventual consistency
287
+ - Add telemetry for ambiguous transitions
208
288
 
209
- 3. **Product fix** (when real users would hit the same issue):
210
- - Handle stale or intermediate states safely
211
- - Make routing/render logic robust to eventual consistency
212
- - Add telemetry for ambiguous transitions
289
+ - **Backend/service**
290
+ - Remove avoidable shared mutable writes from hot paths
291
+ - Make setup operations idempotent or explicitly rate limited
292
+ - Fix stale reads, cache invalidation, and eventual-consistency assumptions
293
+ - Add trace/log correlation for ambiguous failures
213
294
 
214
- 4. **Both** if user impact exists _and_ tests are fragile.
295
+ Choose **both** if user impact exists _and_ tests are fragile.
215
296
 
216
297
  ## Plan Output Format
217
298
 
@@ -220,6 +301,8 @@ Produce the plan with these fields:
220
301
  - **Test ID:** if provided in prompt
221
302
  - **Agent session ID:** your running session ID to resume if needed
222
303
  - **Confidence:** score (1-5) with brief justification
304
+ - **Failure surface:** CI/job setup, test setup/auth/data, app bootstrap, user action no-op, backend request, post-success render, assertion/locator, or mixed
305
+ - **Current main status:** whether the failing commit's code path still exists on current `main`, has already been fixed, or has changed enough that the plan must be adjusted
223
306
  - **Symptom:** what failed and where
224
307
  - **Root cause:** concise technical explanation
225
308
  - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)