npm - @clipboard-health/ai-rules - Versions diffs - 2.22.3 → 2.22.4 - Mend

@clipboard-health/ai-rules 2.22.3 → 2.22.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/package.json +1 -1
package/skills/flaky-test-debugger/SKILL.md +19 -4
package/skills/flaky-test-debugger/references/fix.md +10 -0
package/skills/flaky-test-debugger/references/plan.md +109 -26

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@clipboard-health/ai-rules",
-  "version": "2.22.3",
+  "version": "2.22.4",
   "description": "Pre-built AI agent rules for consistent coding standards.",
   "keywords": [
     "ai",

package/skills/flaky-test-debugger/SKILL.md CHANGED Viewed

@@ -9,14 +9,29 @@ Phases run in order. Skip a phase if you already have the information it produce
 This skill runs in one of two modes:
-- **Fix mode (default):** produce a plan, then apply it.
+- **Fix mode (default for local/unit-sized fixes):** produce a plan, then apply it.
 - **Plan mode:** produce a plan and stop, for human review.
-Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it". Otherwise default to fix mode. Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
+Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it".
-## Phase 1: Classify Test Type
+For CI-sourced E2E flakes, prefer plan mode unless the user explicitly asks you to implement a fix or the root cause is already clear, high-confidence, and local to the repository. E2E flakes often originate in CI setup, auth/test-data infrastructure, backend behavior, deployment assets, or product code; avoid editing the test just because that is where the failure surfaced.
-Determine the test type from the user's input before doing anything else. The type dictates the investigation path.
+Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
+## Phase 1: Classify Failure Surface and Test Type
+For E2E flakes, first classify where the failure surfaced in the lifecycle, then identify the test type. The failure surface dictates how broadly to investigate before reading or editing the test.
+Common E2E failure surfaces:
+- **CI/job setup:** dependency installation, CLI/tooling, environment setup, build/deploy, artifact download.
+- **Test setup/auth/data:** token minting, login bootstrap, seeded users/entities, one-time credentials, external service setup.
+- **App bootstrap/navigation:** static assets, route load, hydration, browser console/page errors before the user action.
+- **User action:** click/input completed but the expected request, dialog, route change, or state transition did not start.
+- **Backend request:** request emitted; backend returned error, stale data, unexpected shape, or excessive latency.
+- **Assertion/locator:** app state is correct, but the assertion/selector is brittle or out of sync with the intended UX.
+Then determine the test type from the user's input. The type dictates the detailed investigation path.
 | Type                             | Signals                                                                                                                                                              |
 | -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

package/skills/flaky-test-debugger/references/fix.md CHANGED Viewed

@@ -6,10 +6,18 @@ Apply phase of the flaky-test-debugger skill. Takes a plan produced by [`plan.md
 Confirm the plan from `plan.md` has confidence ≥ 3. If confidence is 1-2, do not apply -- return to `plan.md` and gather more evidence first.
+Before editing, verify the plan is still current:
+- The failing commit's code path still exists on current `main`, or the plan has been adjusted for the current code.
+- The proposed fix targets the diagnosed failure surface, not only the final assertion.
+- Any retry/wait change is safe and idempotent; it must not repeat one-time credentials, duplicate writes, destructive actions, or rate-limited setup calls.
 ## Apply the Proposed Fix
 Edit the files listed in the plan's **Proposed fix** field. Keep the change minimal -- the plan already chose between test harness, product, and both.
+Do not convert an infrastructure, backend, auth/data, or product-state root cause into a frontend timeout or locator retry. If the plan's evidence no longer supports the proposed fix, stop and revise the plan.
 ## Fix Sibling Instances
 After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
@@ -56,6 +64,8 @@ When documenting the fix in a PR or issue, use this structure. Carry **Confidenc
 - **Test ID:** if provided in prompt
 - **Agent session ID:** your running session ID to resume if needed
 - **Confidence:** score (1-5) with brief justification
+- **Failure surface:** where the failure first surfaced and why the fix belongs there
+- **Current main status:** whether the failure path still existed when the fix was made
 - **Symptom:** what failed and where
 - **Root cause:** concise technical explanation
 - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)

package/skills/flaky-test-debugger/references/plan.md CHANGED Viewed

@@ -62,6 +62,8 @@ Capture these details first so the investigation is reproducible. If the user ha
 - Failing test file and name
 - GitHub Actions run URL to fetch the LLM report
+- Branch, commit, shard, timestamp, and issue/ticket link when available
+- Whether the failure is one test, a retry-only flake, or a burst across many tests/shards
 ### Fetch the LLM Report
@@ -73,6 +75,33 @@ bash scripts/fetch-llm-report.sh "<github-actions-url>"
 This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
+### Cluster Before Per-Test Debugging
+Before reading the failing test as if it is isolated, check whether the failure belongs to a broader pattern:
+1. **Within the report:** group failures by first error line, stack frame, setup helper, console error, and failing lifecycle stage.
+2. **Across attempts:** compare failed and passed retries for the same test; note the first divergence.
+3. **Across CI context:** if issue tracker, GitHub Actions logs, CI logs, or test analytics are available, search the exact error and stack snippet across nearby runs, shards, commits, and runners.
+4. **Across code history:** check whether the failing commit is still representative of current `main`; recent migrations can make an implementation plan stale even when the old failure was real.
+If many tests fail in the same setup helper, dependency install step, auth/token path, or external service call, diagnose the shared mechanism first. Do not create per-test fixes for a shared failure.
+### Classify the E2E Failure Surface
+Use the earliest observed failure surface to decide where to investigate first:
+| Surface                  | Primary signal                                                                | First places to look                                                                               |
+| ------------------------ | ----------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
+| **CI/job setup**         | Failure before browser/user-flow code; install/build/config/tooling errors    | GitHub Actions logs, workflow YAML, dependency versions, cache keys, recent tooling releases       |
+| **Test setup/auth/data** | Failure in fixtures, token minting, login, seed data, or one-time credentials | Setup helpers, external identity/data services, idempotency of retries, CI logs, service logs      |
+| **App bootstrap**        | Blank page, static asset errors, hydration errors, route never becomes ready  | Browser console, static asset/network entries, deployment version, route bootstrap helpers         |
+| **User action no-op**    | Click/input completes but expected request/dialog/route/state never starts    | Playwright trace, console/page errors, React error boundaries, remounts, permissions/session state |
+| **Backend request**      | Expected request is emitted and fails, stalls, or returns unexpected data     | Request/response body, timings, trace/log correlation, backend code, data freshness                |
+| **Post-success render**  | Request succeeds with expected data but UI does not reflect it                | Client cache invalidation, state updates, rendering conditions, console errors                     |
+| **Assertion/locator**    | App state is correct but selector/assertion no longer matches intended UX     | Test selector, accessible names, UX drift, deterministic app-ready signal                          |
+This classification can change as evidence improves. State the final surface in the plan.
 ## Quick Classification (E2E)
 For the full report schema, field reference, caps, and example reports:
@@ -88,16 +117,19 @@ Read the docs if you need field semantics or limits; otherwise the field names u
 Classify the flake to narrow the search space:
-| Category                   | Signal                                                                            | Timeline Pattern                                                                              |
-| -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
-| **Test-state leakage**     | Retries or earlier tests leave auth, cookies, storage, or server state behind     | `attempts[]` — different outcomes across retries                                              |
-| **Data collision**         | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors                                                 |
-| **Backend stale data**     | API returned 200 but response body shows old state                                | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
-| **Frontend cache stale**   | No network request after navigation/reload for the relevant endpoint              | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint      |
-| **Silent network failure** | CORS, DNS, or transport error prevented the request from completing               | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL`                    |
-| **Render/hydration bug**   | API returned correct data but component didn't render it                          | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors                   |
-| **Environment / infra**    | Transient 5xx, timeouts, DNS/network instability                                  | `network` entries with 5xx status; `consoleMessages[]` with connection errors                 |
-| **Locator / UX drift**     | Selector is valid but brittle against small UI changes                            | `errors[]` — locator/selector text in error message                                           |
+| Category                     | Signal                                                                            | Timeline Pattern                                                                              |
+| ---------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
+| **Shared setup/tooling**     | Many tests fail before user-flow assertions in the same helper or CI step         | `job/setup` or `beforeEach` fails across shards/runners with the same stack or command error  |
+| **Auth/data setup**          | Token mint, login, seeded user/entity, or external test setup fails               | `beforeEach/setup` → service/CLI error before page actions                                    |
+| **Test-state leakage**       | Retries or earlier tests leave auth, cookies, storage, or server state behind     | `attempts[]` — different outcomes across retries                                              |
+| **Data collision**           | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors                                                 |
+| **Backend stale data**       | API returned 200 but response body shows old state                                | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
+| **Frontend cache stale**     | No network request after navigation/reload for the relevant endpoint              | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint      |
+| **Expected request missing** | User action succeeds but the expected network request never appears               | `step(click/input)` completes → no matching `network(...)` → assertion/wait fails             |
+| **Silent network failure**   | CORS, DNS, or transport error prevented the request from completing               | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL`                    |
+| **Render/hydration bug**     | API returned correct data but component didn't render it                          | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors                   |
+| **Environment / infra**      | Transient 5xx, timeouts, DNS/network instability                                  | `network` entries with 5xx status; `consoleMessages[]` with connection errors                 |
+| **Locator / UX drift**       | Selector is valid but brittle against small UI changes                            | `errors[]` — locator/selector text in error message                                           |
 ## Analyze LLM Report
@@ -148,14 +180,30 @@ For each attempt, compare `status`, `durationMs`, and `error` across retries —
 **Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
+If the LLM report does not show whether a click resolved, an element detached, a dialog appeared/disappeared, or a request was never emitted, download and inspect the full Playwright HTML report/trace artifact when available. The LLM report is an index, not the ceiling for evidence.
 ### Inspect network activity and extract trace IDs
 Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
 **`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`./datadog-apm-traces.md`](./datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
+If no expected request was emitted, say that explicitly and do not diagnose backend latency for that action. Instead, use the trace, console/page errors, session/permission state, and frontend code path to explain why the request never started.
 Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
+### Use Available Observability Beyond APM
+APM traces are valuable when an application request exists, but many E2E flakes fail before that. Use the observability surface that matches the failure surface:
+- **CI/job setup:** GitHub Actions logs, cache keys, installed tool versions, runner/shard distribution.
+- **Auth/data setup:** service logs, identity-provider/API audit logs, rate-limit/throttle metrics, setup helper output.
+- **App bootstrap:** browser console, static asset responses, deployment/version metadata, CDN or webapp logs.
+- **User action no-op:** Playwright trace actions, console/page errors, RUM/session events when available.
+- **Backend request emitted:** APM traces, backend logs, request/response bodies, data store/cache traces.
+Search exact error strings across a relevant time window and commit/shard context. Prefer primary telemetry over inference from test code when both are available.
 ### Review test steps
 Prefer the timeline view above which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
@@ -165,10 +213,14 @@ Prefer the timeline view above which interleaves steps with network and console.
 Do not propose a fix without concrete artifacts. At minimum, include:
 - One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
-- One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
+- One **network or lifecycle artifact**:
+  - If a request was emitted: an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
+  - If no request was emitted: the step/trace evidence showing the triggering action completed and the expected request/dialog/route transition never started
+  - If failure happened before the app action: CI/setup/auth log evidence showing the failing command/service call
 - A **specific code path** that consumed that state — use `tests[].location` to jump to the source
 - When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
 - When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
+- When relevant: **logs/RUM/CI evidence** that confirms whether the issue is app, backend, infra, auth/test-data, or CI tooling
 - A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
 If confidence is less than 5/5, identify the missing evidence and propose concrete frontend and/or backend observability changes that would make the next occurrence diagnosable at 5/5 confidence. These changes may span multiple repositories.
@@ -193,25 +245,54 @@ Rate your confidence in the root cause on a 1-5 scale. Report this score alongsi
 Applies to all test types.
-Choose one of these approaches in priority order:
+Choose the fix locus from the evidence, not from where the assertion failed. A flaky E2E test can be exposing a CI dependency issue, auth/test-data service issue, backend bug, deployment problem, product state bug, or test harness bug.
+Use this decision order:
+1. **Shared setup or CI fix** when many tests fail before user-flow assertions or all failures share a tool, cache, install, auth, seed-data, or fixture path.
+2. **Backend/service/data fix** when the expected request is emitted and backend telemetry or response bodies show errors, throttling, stale data, inconsistent state, or unexpected latency.
+3. **Product fix** when real users can hit the same unsafe intermediate state, render error, permission/session race, stale cache, or missing error handling.
+4. **Test data or harness fix** when the scenario is not user-realistic, the test setup is semantically wrong, or the test needs a deterministic app-ready signal.
+5. **Assertion/locator fix** only when the app state is correct and the selector/assertion is the only broken part.
+Before proposing any retry, timeout, or wait change, pass the idempotency check:
+- Is the retried operation safe to repeat?
+- Does retrying preserve the same test scenario?
+- Could retrying amplify the root cause, such as rate limits, one-time credentials, duplicate writes, or destructive mutations?
+- Is there a deterministic signal to wait on instead of a longer timeout?
+If the answer is no or unclear, do not add a retry/wait as the fix. Propose a root-cause fix or instrumentation instead.
+Common valid fix types:
+- **Shared setup / CI**
+  - Pin or lock runtime tools and dependencies
+  - Fail fast on incompatible tool contracts
+  - Remove mutable global state from CI setup
+  - Add setup-level diagnostics before sharding
-1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
+- **Test harness / data** (when the failure is non-product):
+  - Reset cookies, storage, and session between retries
+  - Isolate test data; generate stronger unique identities
+  - Make retry blocks idempotent
+  - Wait on deterministic app signals, not arbitrary sleeps
+  - (Service tests) Close connections and app properly in `afterAll`
+  - (Component tests) Flush pending state updates and timers before asserting
+  - (Unit tests) Reset shared mutable state in `beforeEach`
-2. **Test harness fix** (when the failure is non-product):
-   - Reset cookies, storage, and session between retries
-   - Isolate test data; generate stronger unique identities
-   - Make retry blocks idempotent
-   - Wait on deterministic app signals, not arbitrary sleeps
-   - (Service tests) Close connections and app properly in `afterAll`
-   - (Component tests) Flush pending state updates and timers before asserting
-   - (Unit tests) Reset shared mutable state in `beforeEach`
+- **Product** (when real users would hit the same issue):
+  - Handle stale or intermediate states safely
+  - Make routing/render logic robust to eventual consistency
+  - Add telemetry for ambiguous transitions
-3. **Product fix** (when real users would hit the same issue):
-   - Handle stale or intermediate states safely
-   - Make routing/render logic robust to eventual consistency
-   - Add telemetry for ambiguous transitions
+- **Backend/service**
+  - Remove avoidable shared mutable writes from hot paths
+  - Make setup operations idempotent or explicitly rate limited
+  - Fix stale reads, cache invalidation, and eventual-consistency assumptions
+  - Add trace/log correlation for ambiguous failures
-4. **Both** if user impact exists _and_ tests are fragile.
+Choose **both** if user impact exists _and_ tests are fragile.
 ## Plan Output Format
@@ -220,6 +301,8 @@ Produce the plan with these fields:
 - **Test ID:** if provided in prompt
 - **Agent session ID:** your running session ID to resume if needed
 - **Confidence:** score (1-5) with brief justification
+- **Failure surface:** CI/job setup, test setup/auth/data, app bootstrap, user action no-op, backend request, post-success render, assertion/locator, or mixed
+- **Current main status:** whether the failing commit's code path still exists on current `main`, has already been fixed, or has changed enough that the plan must be adjusted
 - **Symptom:** what failed and where
 - **Root cause:** concise technical explanation
 - **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)