npm - @clipboard-health/ai-rules - Versions diffs - 2.15.8 → 2.15.10 - Mend

@clipboard-health/ai-rules 2.15.8 → 2.15.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/package.json +1 -1
package/skills/babysit-pr/scripts/parseNitpicks.sh +6 -1
package/skills/commit-push-pr/SKILL.md +8 -15
package/skills/flaky-test-debugger/SKILL.md +16 -288
package/skills/flaky-test-debugger/references/datadog-apm-traces.md +5 -1
package/skills/flaky-test-debugger/references/fix.md +65 -0
package/skills/flaky-test-debugger/references/plan.md +226 -0
package/skills/go/SKILL.md +21 -37
package/skills/write-bug-ticket/SKILL.md +5 -6
package/skills/write-feature-ticket/SKILL.md +7 -8
package/skills/write-tech-debt-ticket/SKILL.md +4 -5
package/skills/linear-duplicate-finder/SKILL.md +0 -101

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@clipboard-health/ai-rules",
-  "version": "2.15.8",
+  "version": "2.15.10",
   "description": "Pre-built AI agent rules for consistent coding standards.",
   "keywords": [
     "ai",

package/skills/babysit-pr/scripts/parseNitpicks.sh CHANGED Viewed

@@ -59,7 +59,12 @@ for my $section (@sections) {
     my $raw_file_name = trim($1);
     my $file_content = $2;
-    while ($file_content =~ /`(\d+(?:-\d+)?)`:\s*(?:_[^_]+_\s*\|\s*_[^_]+_\s*)?\*\*([^*]+)\*\*\s*([\s\S]*?)(?=---|\n`\d|<\/blockquote>|$)/g) {
+    # Category prefix is optional. CodeRabbit emits 0–N `_…_` tags
+    # separated by `|` (e.g. `_⚠️ Potential issue_ | _🟠 Major_ | _⚡ Quick win_`
+    # or just `_💤 Low value_` on lower-confidence findings). The previous
+    # regex required exactly two tags and silently dropped one-tag and
+    # three-tag variants.
+    while ($file_content =~ /`(\d+(?:-\d+)?)`:\s*(?:_[^_]+_(?:\s*\|\s*_[^_]+_)*\s*)?\*\*([^*]+)\*\*\s*([\s\S]*?)(?=---|\n`\d|<\/blockquote>|$)/g) {
       my $line_range = $1;
       my $title = trim($2);
       my $clean_body = clean_comment_body(trim($3));

package/skills/commit-push-pr/SKILL.md CHANGED Viewed

@@ -14,20 +14,13 @@ description: Commit, push, and open a PR. Use when the user wants to ship change
 ## Your task
-**First, decide from the context above. If `Commits ahead of default branch` is `(unknown)`, skip this decision and use the full flow below.**
+If `Commits ahead of default branch` is `(unknown)`, `origin/HEAD` couldn't be resolved — stop and tell the user to run `git remote set-head origin -a` (or otherwise set the default branch) before retrying, since the simplify step also depends on it. Otherwise, if `Git status`, `Commits ahead of default branch`, and `Existing PR` are all empty/none, stop and reply `nothing to ship.`. Otherwise:
-- If `Git status` is empty AND `Commits ahead of default branch` is empty AND `Existing PR` is `none`: stop. Reply with `nothing to ship.` and do nothing else.
-- If `Git status` is empty but there are `Commits ahead of default branch` or an `Existing PR`: still run step 1 and step 2. Then re-check `git status --short`; skip step 3 only if it remains empty. Then continue with step 4 and step 5.
-- Otherwise: proceed with all steps below.
-Based on the above changes:
-1. Create a new branch if on main (e.g., `feat/add-user-validation`, `fix/null-check-in-parser`)
-2. Run the `simplify` skill on the files included in the PR before committing, pushing, or opening the PR. If the working tree is clean, run `simplify` against `git diff $(git merge-base HEAD origin/HEAD)..HEAD`. If the working tree has changes, run it against `git diff HEAD`; when local commits also exist, include `git diff $(git merge-base HEAD origin/HEAD)..HEAD` as additional PR context. Invoke the skill by name using this agent's normal skill invocation mechanism. Wait for the skill to finish, then include any resulting fixes in the commit step below.
-3. Re-check `git status --short`. If there are changes, create a single conventional commit message.
-4. Push the branch to origin
+1. Create a new branch if on main (e.g., `feat/add-user-validation`, `fix/null-check-in-parser`).
+2. Run the `simplify` skill on the full PR diff — `git diff $(git merge-base HEAD origin/HEAD)..HEAD` plus any uncommitted changes. Wait for it to finish, then include any resulting fixes in the commit.
+3. If `git status --short` shows changes, create a single conventional commit.
+4. Push the branch to origin.
 5. Check for an existing PR with `gh pr view`.
-   - No PR exists: Create with `gh pr create`. Title = commit subject line. Description = brief explanation of **why**, not what. Append `<!-- commit-push-pr:created v1 -->` on its own line at the end of the PR description so skill-created PRs can be identified later.
-   - PR exists: Report the URL and move on.
-6. You have the capability to call multiple tools in a single response. After the `simplify` skill completes, do the remaining git and PR operations in a single message. Do not use any other tools or do anything else.
-7. After tool calls complete, send one short final text response with the branch name and the full PR URL (e.g., `https://github.com/clipboardhealth/core-utils/pull/123`). Never use shorthand like `repo#123` — always output the complete URL so it is clickable.
+   - No PR: create with `gh pr create`. Title = commit subject. Description = brief explanation of **why**, not what. Append `<!-- commit-push-pr:created v1 -->` on its own line at the end so skill-created PRs can be identified later.
+   - PR exists: report the URL and move on.
+6. End with one short text response: branch name and the full PR URL (e.g., `https://github.com/clipboardhealth/core-utils/pull/123`). Never use shorthand like `repo#123` — always output the complete URL.

package/skills/flaky-test-debugger/SKILL.md CHANGED Viewed

@@ -3,7 +3,16 @@ name: flaky-test-debugger
 description: Debug and fix flaky tests including Playwright E2E, NestJS service/integration, React component, and unit tests. Use this skill when investigating intermittent test failures, triaging flaky tests, or fixing test instability.
 ---
-Work through these phases in order. Skip phases only when you already have the information they produce.
+Phases run in order. Skip a phase if you already have the information it produces. Phase 3 runs only in fix mode.
+## Mode: plan vs fix
+This skill runs in one of two modes:
+- **Fix mode (default):** produce a plan, then apply it.
+- **Plan mode:** produce a plan and stop, for human review.
+Use plan mode when the user asks for a plan, an investigation, a triage report, or says "don't fix yet" / "just plan it". Otherwise default to fix mode. Both modes share the same diagnosis path; the plan is the artifact you hand to a reviewer (plan mode) or to yourself (fix mode) before editing code.
 ## Phase 1: Classify Test Type
@@ -18,10 +27,6 @@ Determine the test type from the user's input before doing anything else. The ty
 If the type is ambiguous, check the test file extension and imports to confirm.
-**Routing:** After completing Phase 1, always proceed to Phase 1b before investigating further.
----
 ## Phase 1b: Check for Existing Fixes
 Before investigating, check whether someone (or another agent) has already fixed this flake.
@@ -40,291 +45,14 @@ If an existing fix is found, report:
 - A brief summary of what it addresses
 - Whether it fully covers the current flake or only partially
-If no existing fix is found, proceed to investigation:
-- **E2E (Playwright):** Go to [Phase 2E: E2E Triage Snapshot](#phase-2e-e2e-triage-snapshot)
-- **Service, React component, or Unit:** Go to [Phase 2: Fast Path](#phase-2-fast-path-non-e2e)
----
-## Phase 2: Fast Path (non-E2E)
-For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose and fix the flake. Do not over-investigate -- read the evidence, read the code, fix it.
-### 2a: Gather Failure Context
-Capture from the user's input (ask if missing):
-- **Test file and name** -- exact file path and test title
-- **Error message and stack trace** -- the raw failure output
-- **Framework** -- Jest, Vitest, etc.
-- **Whether it's a new flaky** -- first occurrence vs. recurring
-- **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
-### 2b: Read the Test and Code Under Test
-1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
-2. Read the production code that the test exercises -- follow imports from the test file.
-3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
-### 2c: Classify the Flake Pattern
-| Category                        | Test Types      | Signal                                                                                                             |
-| ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
-| **Connection lifecycle**        | Service         | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
-| **Port conflict**               | Service         | EADDRINUSE -- multiple test files bootstrapping on the same port                                                   |
-| **Async teardown race**         | Service         | Errors appear after test passes -- `afterAll` closes the app while background work is still running                |
-| **Database state leakage**      | Service         | Test depends on DB state that a parallel/prior test modified                                                       |
-| **Unresolved async work**       | Component       | "not wrapped in act()" warnings, state updates after unmount                                                       |
-| **Timer/animation not flushed** | Component       | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced                    |
-| **Mock not restored**           | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks`                   |
-| **Shared mutable state**        | Unit            | Module-level variable or singleton mutated by one test, observed by another                                        |
-| **Date/time sensitivity**       | Unit            | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs                             |
-| **Test ordering dependency**    | All             | Passes in isolation, fails when run with other tests (or vice versa)                                               |
-### 2d: Diagnose and Fix
-Apply the appropriate fix based on the pattern:
-**Service test fixes:**
-- Ensure `afterAll` closes the app _and_ awaits all open connections (DB, Redis, queues) before returning
-- Pass `{ forceCloseConnections: true }` to `NestFactory.create()` (NestJS v10+) to auto-close keep-alive connections on shutdown, or explicitly close the Mongoose/TypeORM connection in `afterAll`
-- Use dynamic/random ports (`listen(0)`) to avoid EADDRINUSE
-- Isolate database state: use unique collection prefixes, transaction rollbacks, or per-test database cleanup
-- If the test uses `setTimeout` or event-driven patterns, ensure the test awaits completion rather than relying on timing
-**React component test fixes:**
-- Wrap state-triggering actions in `act()` or use `waitFor`/`findBy*` queries that handle async updates
-- When using fake timers, advance them explicitly (`jest.advanceTimersByTime`, `jest.runAllTimers`) and restore real timers in `afterEach`
-- Ensure cleanup with `cleanup()` in `afterEach` (React Testing Library does this automatically unless disabled)
-- Restore mocks in `afterEach` -- prefer `jest.restoreAllMocks()` in a shared setup
-**Unit test fixes:**
-- Eliminate shared mutable state: clone or reset objects in `beforeEach`, or make the module-level binding `const`
-- Mock `Date.now` / `new Date()` explicitly when time matters; restore in `afterEach`
-- If order-dependent, check for missing setup that another test was implicitly providing
-### 2e: Evidence Standard (Fast Path)
-Before proposing a fix, include at minimum:
-- The **error message and stack trace** from the failure
-- The **specific code path** in the test or production code that caused the flake
-- A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
-- A **confidence score** (1-5, same scale as [E2E evidence standard](#confidence-score))
-If confidence is 2 or below, recommend reproduction steps or instrumentation before committing to a fix.
-Skip to [Phase 5: Fix Decision Tree](#phase-5-fix-decision-tree).
----
-## Phase 2E: E2E Triage Snapshot
-Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
-- Failing test file and name
-- GitHub Actions run URL to fetch the LLM report
-### Fetch the LLM Report
-Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
-```bash
-bash scripts/fetch-llm-report.sh "<github-actions-url>"
-```
-This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
-## Phase 3E: Quick Classification
-For the full report schema, field reference, caps, and example reports:
-1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
-2. Otherwise, fetch the latest docs from GitHub:
-   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
-   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
-Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
-Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
-Classify the flake to narrow the search space:
-| Category                   | Signal                                                                            | Timeline Pattern                                                                              |
-| -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
-| **Test-state leakage**     | Retries or earlier tests leave auth, cookies, storage, or server state behind     | `attempts[]` — different outcomes across retries                                              |
-| **Data collision**         | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors                                                 |
-| **Backend stale data**     | API returned 200 but response body shows old state                                | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
-| **Frontend cache stale**   | No network request after navigation/reload for the relevant endpoint              | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint      |
-| **Silent network failure** | CORS, DNS, or transport error prevented the request from completing               | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL`                    |
-| **Render/hydration bug**   | API returned correct data but component didn't render it                          | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors                   |
-| **Environment / infra**    | Transient 5xx, timeouts, DNS/network instability                                  | `network` entries with 5xx status; `consoleMessages[]` with connection errors                 |
-| **Locator / UX drift**     | Selector is valid but brittle against small UI changes                            | `errors[]` — locator/selector text in error message                                           |
-## Phase 4E: Analyze LLM Report
-### 4Ea: Walk the Timeline
-**Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
-```text
-step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
-```
-For each timeline entry:
-- **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
-- **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
-- **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
-All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
-### 4Eb: Compare pass vs fail (flaky tests)
-If you don't have passing and failing attempts for the same test, skip to 4Ec.
-Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
-1. Align both timelines by step title sequence
-2. Find the first step/network/console entry that differs between attempts
-3. The divergence answers "what was different this time?" directly
-Common divergence patterns:
-- **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
-- **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
-- **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
-- **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
-### 4Ec: Identify failing tests
-Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
-- **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
-- **`location`**: Source file, line, and column — jump straight to the code.
-- **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
-### 4Ed: Examine attempts for retry patterns
-For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
-**Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
-### 4Ee: Inspect network activity and extract trace IDs
-Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
-**`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`references/datadog-apm-traces.md`](./references/datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
-Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
-### 4Ef: Review test steps
-Prefer the timeline view (4Ea) which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
-## Phase 4E Evidence Standard
-Do not propose a fix without concrete artifacts. At minimum, include:
-- One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
-- One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
-- A **specific code path** that consumed that state — use `tests[].location` to jump to the source
-- When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
-- When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
-- A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
-### Confidence Score
-Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
-| Score | Meaning             | Criteria                                                                                                                                                                                       |
-| ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **5** | Certain             | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner)                                              |
-| **4** | High confidence     | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
-| **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly                                                                      |
-| **2** | Low confidence      | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix                                                 |
-| **1** | Speculative         | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding                                          |
-If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
----
-## Phase 5: Fix Decision Tree
-Applies to all test types.
-Apply fixes in this order of priority:
-1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
-2. **Test harness fix** (when the failure is non-product):
-   - Reset cookies, storage, and session between retries
-   - Isolate test data; generate stronger unique identities
-   - Make retry blocks idempotent
-   - Wait on deterministic app signals, not arbitrary sleeps
-   - (Service tests) Close connections and app properly in `afterAll`
-   - (Component tests) Flush pending state updates and timers before asserting
-   - (Unit tests) Reset shared mutable state in `beforeEach`
-3. **Product fix** (when real users would hit the same issue):
-   - Handle stale or intermediate states safely
-   - Make routing/render logic robust to eventual consistency
-   - Add telemetry for ambiguous transitions
-4. **Both** if user impact exists _and_ tests are fragile.
-## Phase 6: Fix Sibling Instances
-After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
-### When to search
-Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
-- Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
-- Hardcoded ports instead of dynamic allocation
-- Shared mutable state without per-test reset
-- Missing `act()` wrappers or `waitFor` around async assertions
-- Fake timers not restored in `afterEach`
-- Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
-### When NOT to search
-Skip this step when the fix is **specific to one test's logic** -- for example, a wrong assertion value, a test-specific race condition in a unique setup, or a one-off typo.
-### How to search
-1. Identify the anti-pattern as a grep-able code pattern. Examples:
-   - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
-   - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
-   - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
-   - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
-2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
-3. Apply the same fix to each sibling. Keep changes minimal -- fix the anti-pattern, nothing else.
-4. List the sibling files you fixed in the output so reviewers can verify them.
-## Phase 7: Verification
+If no existing fix is found, proceed to Phase 2.
-Lint and type-check touched files.
+## Phase 2: Produce a plan
-## Output Format
+Follow [`references/plan.md`](./references/plan.md). It walks investigation, diagnosis, evidence gathering, and the fix decision tree, and produces a structured plan with confidence score.
-When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
+If you are in plan mode, present the plan and stop here.
-When documenting the fix in a PR or issue, use this structure:
+## Phase 3: Apply the plan (fix mode only)
-- **Confidence:** score (1-5) with brief justification
-- **Symptom:** what failed and where
-- **Root cause:** concise technical explanation
-- **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
-- **Fix:** test-only, product-only, or both
-- **Siblings fixed:** list of other files where the same anti-pattern was corrected (or "N/A -- fix was test-specific")
-- **Validation:** commands and suites run
-- **Residual risk:** what could still be flaky
+Follow [`references/fix.md`](./references/fix.md). It takes the plan from Phase 2, applies the proposed fix, searches for sibling anti-patterns, and verifies. PR creation is out of scope -- if the user later opens one (or invokes a PR-shipping skill), label it `flaky-test-fix`.

package/skills/flaky-test-debugger/references/datadog-apm-traces.md CHANGED Viewed

@@ -9,7 +9,7 @@ The `pup` CLI must be installed and authenticated. Two auth paths are supported:
 - **macOS Keychain** (via `pup auth login`) — the default on developer machines.
 - **Environment variables** (`DD_API_KEY` + `DD_APP_KEY`) — the path used in sandboxes and CI.
-**Do not run `pup auth status` to verify auth.** It only reads the Keychain, so it fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly — env-var auth takes effect there. If the query fails with an auth error, surface it then (check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`).
+Don't run `pup auth status` to verify auth. It fails in sandboxes even when env-var auth is working. Call `pup traces search …` directly. If the query fails with an auth error, check `DD_API_KEY` / `DD_APP_KEY` or run `pup auth login`.
 ## Key pup conventions
@@ -78,3 +78,7 @@ pup traces search --query="trace_id:<TRACE_ID>" --from=30d --limit=1000 \
       error: .attributes.custom.error.message
     }]'
 ```
+### 4. Query additional data
+If additional data would help diagnose the issue (e.g. logs, rum, cicd), use the pup CLI.

package/skills/flaky-test-debugger/references/fix.md ADDED Viewed

@@ -0,0 +1,65 @@
+# Apply a Flaky Test Fix
+Apply phase of the flaky-test-debugger skill. Takes a plan produced by [`plan.md`](./plan.md) and applies it.
+## Preflight
+Confirm the plan from `plan.md` has confidence ≥ 3. If confidence is 1-2, do not apply -- return to `plan.md` and gather more evidence first.
+## Apply the Proposed Fix
+Edit the files listed in the plan's **Proposed fix** field. Keep the change minimal -- the plan already chose between test harness, product, and both.
+## Fix Sibling Instances
+After fixing the root cause, search for other tests that exhibit the same anti-pattern and fix them too. A flaky pattern in one test file almost always has siblings nearby.
+### When to search
+Search for siblings when the root cause is a **structural anti-pattern** -- something that would be wrong regardless of the specific test logic:
+- Missing or incomplete teardown (`afterAll`/`afterEach` not closing connections, not restoring mocks)
+- Hardcoded ports instead of dynamic allocation
+- Shared mutable state without per-test reset
+- Missing `act()` wrappers or `waitFor` around async assertions
+- Fake timers not restored in `afterEach`
+- Stale data patterns (E2E: missing reload/re-fetch; service: no DB cleanup between tests)
+### When NOT to search
+Skip this step when the fix is **specific to one test's logic** -- for example a test-specific race condition in a unique setup or a one-off typo.
+### How to search
+1. Identify the anti-pattern as a grep-able code pattern. Examples:
+   - Missing connection cleanup: grep for `createTestingModule` in test files and check each for proper `afterAll` teardown
+   - Hardcoded port: grep for `listen(3000)` or `listen(PORT)` in test files
+   - Missing mock restore: grep for `jest.spyOn` in files that lack `restoreAllMocks`
+   - Missing `act()`: grep for `render(` in `.test.tsx` files that call state-changing functions without `act` or `waitFor`
+2. Scope the search to the same area of the codebase first (same package or directory), then widen if the pattern is pervasive.
+3. Apply the same fix to each sibling. Keep changes minimal; fix the anti-pattern, nothing else.
+4. List the sibling files you fixed in the output so reviewers can verify them.
+## Verification
+Run the plan's **Validation plan** commands — including the previously-flaky test, repeated enough times to give reasonable confidence the flake is gone. Lint and type-check touched files as the floor; do not stop there.
+## Output Format
+When opening a PR for a flaky test fix, include `--label flaky-test-fix` in the `gh pr create` command so other agents can find it during Phase 1b deduplication.
+When documenting the fix in a PR or issue, use this structure. Carry **Confidence**, **Symptom**, **Root cause**, **Evidence**, and **Residual risk** straight over from the plan. Three plan fields rename: **Proposed fix** → **Fix**, **Sibling candidates** → **Siblings fixed**, **Validation plan** → **Validation**. Drop **Open questions** (resolved by fix time):
+- **Test ID:** if provided in prompt
+- **Agent session ID:** your running session ID to resume if needed
+- **Confidence:** score (1-5) with brief justification
+- **Symptom:** what failed and where
+- **Root cause:** concise technical explanation
+- **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
+- **Fix:** test-only, product-only, or both
+- **Siblings fixed:** list of other files where the same anti-pattern was or should be corrected (or "N/A -- fix was test-specific")
+- **Validation:** commands and suites run
+- **Residual risk:** what could still be flaky

package/skills/flaky-test-debugger/references/plan.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Plan a Flaky Test Fix
+Diagnosis and planning phase of the flaky-test-debugger skill. Produces a structured plan that the user reviews. In fix mode, the plan is consumed by [`fix.md`](./fix.md).
+Route by the test type identified in Phase 1 of SKILL.md:
+- **E2E (Playwright):** start with [E2E Triage Snapshot](#e2e-triage-snapshot)
+- **Service, React component, or Unit:** start with [Fast Path (non-E2E)](#fast-path-non-e2e)
+## Fast Path (non-E2E)
+For service, component, and unit tests, the failure information plus the test source code is usually sufficient to diagnose the flake. Do not over-investigate -- read the evidence, read the code, plan the fix.
+### Gather Failure Context
+Capture from the user's input (ask if missing):
+- **Test file and name** -- exact file path and test title
+- **Error message and stack trace** -- the raw failure output
+- **Framework** -- Jest, Vitest, etc.
+- **Failure metadata** -- branch, pipeline URL, duration, shard, timestamp (when available)
+### Read the Test and Code Under Test
+1. Read the failing test file. Focus on the specific failing test and its surrounding `describe`/`beforeEach`/`afterEach`/`afterAll` blocks.
+2. Read the production code that the test exercises -- follow imports from the test file.
+3. For service tests: also read the test module setup (the `Test.createTestingModule(...)` or app bootstrap code), and check for `afterAll` cleanup that closes the app/database connections.
+### Classify the Flake Pattern
+| Category                        | Test Types      | Signal                                                                                                             |
+| ------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------ |
+| **Connection lifecycle**        | Service         | "connection closed", "topology destroyed", socket errors in stack -- app/DB not fully ready or torn down too early |
+| **Port conflict**               | Service         | EADDRINUSE -- multiple test files bootstrapping on the same port                                                   |
+| **Async teardown race**         | Service         | Errors appear after test passes -- `afterAll` closes the app while background work is still running                |
+| **Database state leakage**      | Service         | Test depends on DB state that a parallel/prior test modified                                                       |
+| **Unresolved async work**       | Component       | "not wrapped in act()" warnings, state updates after unmount                                                       |
+| **Timer/animation not flushed** | Component       | Test asserts before `setTimeout`/`requestAnimationFrame` fires, or `useFakeTimers` not advanced                    |
+| **Mock not restored**           | Component, Unit | `jest.spyOn` or `jest.mock` bleeds into the next test -- missing `mockRestore`/`restoreAllMocks`                   |
+| **Shared mutable state**        | Unit            | Module-level variable or singleton mutated by one test, observed by another                                        |
+| **Date/time sensitivity**       | Unit            | Test assumes a specific date, time zone, or `Date.now()` value that shifts across runs                             |
+| **Test ordering dependency**    | All             | Passes in isolation, fails when run with other tests (or vice versa)                                               |
+### Diagnose with Evidence
+Before proposing a fix, gather:
+- The **error message and stack trace** from the failure
+- The **specific code path** in the test or production code that caused the flake
+- A brief **explanation** of why the flake is intermittent (what timing or state condition triggers it)
+- A **confidence score** (1-5, see [Confidence Score](#confidence-score))
+If confidence is 2 or below, the plan is to gather more data: recommend specific reproduction steps or instrumentation rather than a code fix.
+If >2, continue to [Decide Fix Approach](#decide-fix-approach).
+## E2E Triage Snapshot
+Capture these details first so the investigation is reproducible. If the user hasn't provided them, ask.
+- Failing test file and name
+- GitHub Actions run URL to fetch the LLM report
+### Fetch the LLM Report
+Downloads the `playwright-llm-report` artifact from a GitHub Actions run.
+```bash
+bash scripts/fetch-llm-report.sh "<github-actions-url>"
+```
+This downloads and extracts to `/tmp/playwright-llm-report-{runId}/`. The report is a single `llm-report.json` file.
+## Quick Classification (E2E)
+For the full report schema, field reference, caps, and example reports:
+1. If the repo has `node_modules/@clipboard-health/playwright-reporter-llm/`, read `README.md` and `docs/example-report.json` from there — exact version match to the report.
+2. Otherwise, fetch the latest docs from GitHub:
+   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/README.md`
+   - `https://raw.githubusercontent.com/ClipboardHealth/core-utils/refs/heads/main/packages/playwright-reporter-llm/docs/example-report.json`
+Cross-check the report's `schemaVersion` against the docs — if they disagree, the `main` docs describe a different version and some field semantics may not apply.
+Read the docs if you need field semantics or limits; otherwise the field names used below are enough to drive the investigation.
+Classify the flake to narrow the search space:
+| Category                   | Signal                                                                            | Timeline Pattern                                                                              |
+| -------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
+| **Test-state leakage**     | Retries or earlier tests leave auth, cookies, storage, or server state behind     | `attempts[]` — different outcomes across retries                                              |
+| **Data collision**         | "Random" identities aren't unique enough and collide with existing users/entities | `errors[]` — duplicate key or conflict errors                                                 |
+| **Backend stale data**     | API returned 200 but response body shows old state                                | `step(action)` → `network(GET, 200)` → `step(assert) FAIL` — API succeeded but data was stale |
+| **Frontend cache stale**   | No network request after navigation/reload for the relevant endpoint              | `step(reload)` → `step(assert) FAIL` — no intervening network call for expected endpoint      |
+| **Silent network failure** | CORS, DNS, or transport error prevented the request from completing               | `step(action)` → `console(error: "net::ERR_FAILED")` → `step(assert) FAIL`                    |
+| **Render/hydration bug**   | API returned correct data but component didn't render it                          | `network(GET, 200, correct data)` → `step(assert) FAIL` — no console errors                   |
+| **Environment / infra**    | Transient 5xx, timeouts, DNS/network instability                                  | `network` entries with 5xx status; `consoleMessages[]` with connection errors                 |
+| **Locator / UX drift**     | Selector is valid but brittle against small UI changes                            | `errors[]` — locator/selector text in error message                                           |
+## Analyze LLM Report
+### Walk the Timeline
+**Use `attempts[].timeline[]` as the primary analysis view.** The timeline is a unified, `offsetMs`-sorted array of all steps, network requests, and console entries. Walk it to reconstruct the exact event sequence around the failure:
+```text
+step(click "Submit") → network(POST /api/orders, 201) → step(waitForURL /confirmation) → console(error: "Cannot read property...") → step(expect toBeVisible) FAILED
+```
+For each timeline entry:
+- **`kind: "step"`** — test action with `title`, `category`, `durationMs`, `depth`, optional `error`
+- **`kind: "network"`** — slimmed HTTP entry with `method`, `url`, `status`, and `networkId`. Resolve `networkId` against `attempts[].network.instances[]` for per-request detail (`durationMs`, `timings`, `traceId`/`spanId`/`requestId`/`correlationId`, `requestBodyRef`/`responseBodyRef`, redirect links), and against `attempts[].network.groups[instance.groupId]` for shared shape (`resourceType`, `failureText`, `wasAborted`, occurrence counts)
+- **`kind: "console"`** — browser message with `type` (warning/error/pageerror/page-closed/page-crashed) and `text`
+All entries share `offsetMs` (milliseconds since attempt start), giving a single temporal view.
+### Compare pass vs fail (flaky tests)
+If you don't have passing and failing attempts for the same test, skip to [Identify failing tests](#identify-failing-tests).
+Walk the failed attempt's timeline and the passed attempt's timeline side-by-side to identify the first divergence point:
+1. Align both timelines by step title sequence
+2. Find the first step/network/console entry that differs between attempts
+3. The divergence answers "what was different this time?" directly
+Common divergence patterns:
+- **Same step, different network response** — backend returned different data (stale cache, race condition, eventual consistency)
+- **Same step, network call missing in failed attempt** — frontend cache served stale data, or request was silently blocked
+- **Same step, console error only in failed attempt** — CORS/network failure, or JS exception from unexpected state
+- **Different step timing** — failed attempt took much longer before the assertion, suggesting resource contention or slow backend
+### Identify failing tests
+Filter `tests[]` for entries where `status` is `"failed"` or `flaky` is `true`. For each:
+- **`errors[]`**: Contains clean error text with extracted assertion diffs and file/line location. This is usually enough to understand what went wrong.
+- **`location`**: Source file, line, and column — jump straight to the code.
+- **`attempts[]`**: Full retry history. Compare attempt outcomes, durations, and errors to see if the failure is consistent or intermittent.
+### Examine attempts for retry patterns
+For each attempt, compare `status`, `durationMs`, and `error` across retries — timing or error-shape differences between attempts often point at the trigger.
+**Always decode `failureArtifacts.screenshotBase64` when present.** The page state at failure often reveals modals, loading spinners, error banners, or unexpected navigation that the assertion text alone doesn't explain.
+### Inspect network activity and extract trace IDs
+Scan `attempts[].network.instances[]` for 4xx/5xx responses near the failure's `offsetMs` and use per-instance `timings` to isolate slow phases (DNS, connect, wait, receive). Join each instance to its group via `attempts[].network.groups[instance.groupId]` for shape-level signal (`failureText`, `wasAborted`, `resourceType`, `occurrenceCount`). Resolve payloads via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when the body matters.
+**`traceId`** — when present on a failing instance (`attempts[].network.instances[].traceId`), you must follow [`./datadog-apm-traces.md`](./datadog-apm-traces.md) to correlate with backend behavior. This is the bridge between frontend test failure and potential backend root cause.
+Check `attempts[].network.summary` for saturation. Non-zero `instancesDroppedByGroupCap`, `instancesDroppedByInstanceCap`, or `instancesEvictedAfterAdmission` means retained content is a sample and the request you care about may have been dropped — note this as a confidence-reducing factor. `instancesDroppedByFilter` alone is expected (static assets are filtered by design). v3 caps: instances 500, groups 200, bodies 100 per attempt.
+### Review test steps
+Prefer the timeline view above which interleaves steps with network and console. Fall back to `tests[].attempts[].steps[]` directly when you need the full nesting hierarchy via `depth`.
+## Evidence Standard (E2E)
+Do not propose a fix without concrete artifacts. At minimum, include:
+- One **error artifact** — from `tests[].errors[]` (assertion diff, timeout message) or a trace/log entry
+- One **network artifact** — an instance from `attempts[].network.instances[]` (status, timing, trace ids) joined to its group via `attempts[].network.groups[instance.groupId]` (shape, `failureText`/`wasAborted`, occurrence counts), plus the body via `attempts[].network.bodies[instance.requestBodyRef | instance.responseBodyRef]` when relevant
+- A **specific code path** that consumed that state — use `tests[].location` to jump to the source
+- When available: **screenshot** from `failureArtifacts.screenshotBase64` showing page state at failure
+- When available: **Datadog trace** via `attempts[].network.instances[].traceId` showing backend behavior for the failing request
+- A **confidence score** from 1 to 5 rating how certain you are in the root cause diagnosis
+If confidence is 2 or below, do not propose a code fix. Instead, recommend specific instrumentation or reproduction steps to raise confidence.
+If >2, continue to [Decide Fix Approach](#decide-fix-approach).
+### Confidence Score
+Rate your confidence in the root cause on a 1-5 scale. Report this score alongside your evidence.
+| Score | Meaning             | Criteria                                                                                                                                                                                       |
+| ----- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **5** | Certain             | Root cause is directly visible in artifacts (e.g., assertion diff shows stale data, network response confirms 5xx, screenshot shows error banner)                                              |
+| **4** | High confidence     | Evidence strongly supports the diagnosis but one link in the chain is inferred rather than observed (e.g., timeline shows the right sequence but no Datadog trace to confirm backend behavior) |
+| **3** | Moderate confidence | Evidence is consistent with the diagnosis but alternative explanations remain plausible. Flag the alternatives explicitly                                                                      |
+| **2** | Low confidence      | Limited evidence, mostly reasoning from code patterns rather than observed artifacts. Recommend gathering more data before committing to a fix                                                 |
+| **1** | Speculative         | No direct evidence for the root cause. The fix is a best guess. Recommend reproducing the failure locally or adding instrumentation before proceeding                                          |
+## Decide Fix Approach
+Applies to all test types.
+Choose one of these approaches in priority order:
+1. **Validate scenario realism first.** Is the failure path possible for real users, or is it purely a test-setup artifact? If not user-realistic, prioritize test/data/harness fixes over product changes.
+2. **Test harness fix** (when the failure is non-product):
+   - Reset cookies, storage, and session between retries
+   - Isolate test data; generate stronger unique identities
+   - Make retry blocks idempotent
+   - Wait on deterministic app signals, not arbitrary sleeps
+   - (Service tests) Close connections and app properly in `afterAll`
+   - (Component tests) Flush pending state updates and timers before asserting
+   - (Unit tests) Reset shared mutable state in `beforeEach`
+3. **Product fix** (when real users would hit the same issue):
+   - Handle stale or intermediate states safely
+   - Make routing/render logic robust to eventual consistency
+   - Add telemetry for ambiguous transitions
+4. **Both** if user impact exists _and_ tests are fragile.
+## Plan Output Format
+Produce the plan with these fields:
+- **Test ID:** if provided in prompt
+- **Agent session ID:** your running session ID to resume if needed
+- **Confidence:** score (1-5) with brief justification
+- **Symptom:** what failed and where
+- **Root cause:** concise technical explanation
+- **Evidence:** artifacts supporting the diagnosis (traces, network, error messages, screenshots as applicable)
+- **Proposed fix:** test harness, product, or both — with the specific file(s) and the change you would make
+- **Sibling candidates:** files that appear to share the same anti-pattern, for the reviewer (or fix.md) to confirm. Or "N/A -- fix is test-specific" if the issue is one-off (see [`fix.md`](./fix.md) for what counts as a structural anti-pattern worth searching for).
+- **Validation plan:** lint/typecheck commands and test commands to run after applying the fix
+- **Open questions:** anything that needs human input before fixing
+- **Residual risk:** what could still be flaky after the fix

package/skills/go/SKILL.md CHANGED Viewed

@@ -4,59 +4,43 @@ description: Implement a plan file or direct request end-to-end, then hand off t
 # Go: Execute Work and Ship It
-Given a plan file or direct implementation request, implement the requested work, then invoke the `commit-push-pr` skill to commit, push, and open a PR.
+Given a plan file or direct implementation request, implement the requested work, then invoke the `commit-push-pr` skill to create a PR.
 This skill is the bridge between intent and shipping. It does not re-plan or re-design unless the user asked for that. If the request is wrong or cannot be implemented safely, surface that; do not silently improvise.
-## Phase 1: Understand the Input
+## Phase 1: Resolve the Input
-The user may invoke this skill with either a plan file path/identifier or a direct implementation request such as "add a button to the homepage to log in." Resolve the input in this order:
+The user invokes this skill with either a plan file path/identifier or a direct request like "add a login button."
-1. **Absolute path** (starts with `/`): read it directly only when it is inside the current workspace/repo root or the host's plans directory. If it is outside those roots, ask the user to confirm before reading it; if they do not confirm, stop and ask for an approved path.
-2. **Relative path** (contains `/` or starts with `./`): resolve from the current working directory.
-3. **Bare name** (no path separators, e.g. `my-plan` or `my-plan.md`): if the host exposes a plans directory, look there first. Use whatever directory the current host stores plan files in — do not hard-code a host-specific path. If no plan is found and the input is ambiguous, ask whether it is a plan identifier or an implementation request.
-4. **Natural-language request** (for example, contains spaces or reads like an implementation task): treat it as the implementation request. There may be no separate plan file.
-5. **No argument given**: ask the user to provide either the plan file path or the implementation request. Do not guess.
+- **Path-like input** (absolute, relative, or bare name): try to read it. For bare names, check the host's plans directory first. Resolve the path first; if it lies outside the workspace/repo or the host's plans directory, ask the user to confirm before reading it (applies to both absolute and relative inputs — `..` escapes count). If the input doesn't resolve to a readable file, stop and tell the user — do not reinterpret a missing path as a direct request.
+- **Natural-language request**: treat as the implementation task. There may be no plan file.
+- **No argument**: ask for either a plan path or the implementation request.
-If a path-like input does not exist or cannot be read, stop and tell the user. Do not reinterpret a missing path as a direct request.
+When using a plan, read it fully before starting. Note **Critical files**, **Approach**, and **Verification** sections (or equivalents).
-When using a plan, read the entire plan before starting. Note the **Critical files**, **Approach**, and **Verification** sections (or their equivalents).
+## Phase 2: Implement
-## Phase 2: Implement the Request
+The plan or request is the source of truth for scope:
-Treat the plan or direct request as the source of truth for scope:
-- Make exactly the changes requested — no more, no less. Do not add extra refactors, tests, or cleanup the request did not ask for.
-- If a plan references files, functions, or utilities that no longer exist, stop and surface the discrepancy before guessing.
-- If you discover the request is wrong (the codebase has moved, an assumption is invalid, a constraint was missed), stop and report. Do not silently rewrite the approach.
-- For direct implementation requests, inspect the relevant code before editing and use the repo's existing patterns.
-- Do **not** modify the plan file itself when working from a plan.
+- Make exactly the changes requested — no extra refactors, tests, or cleanup.
+- If the plan references files or utilities that no longer exist, stop and report.
+- If an assumption is invalid or a constraint was missed, stop and report. Do not silently rewrite the approach.
+- Do not modify the plan file itself.
 ## Phase 3: Validate
-Validation should follow the repo's instructions and the cost profile of the repo. Some repos intentionally leave slow checks to CI; do not run broad or slow suites unless the repo instructions explicitly require them.
-Run validation in this order:
-1. **Repo instructions.** Look up the repo's standard validation guidance. Common signals, in priority order:
-   - An explicit instruction in the repo's agent or contributor instructions, such as `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, or an equivalent host-specific instruction file (e.g. "MUST run before opening PRs"). This wins over everything else.
-   - A relevant `package.json` script when the repo instructions point to it, such as `affected`, `precommit`, `validate`, `check`, `verify`, `test`, `lint`, `typecheck`, or repo-equivalent.
-   - A repo-specific equivalent in the rare case the project does not use `package.json`.
-   If the repo primarily relies on git pre-commit or pre-push hooks and does not mandate a manual validation command, do not invent one. Let the hooks run during the `commit-push-pr` handoff.
-2. **Request-specific verification.** If the plan or user request names specific checks, run them when they are practical and consistent with the repo instructions. If a requested check is clearly a slow CI-only suite, ask before running it.
-If any check you run fails, fix the failures and re-run the same check. Do not hand off to `commit-push-pr` with known failing checks unless the user explicitly accepts a pre-existing or unrelated failure.
+- Read `AGENTS.md`, `CLAUDE.md`, `CONTRIBUTING.md`, or equivalent contributor instructions for the mandated pre-PR command (e.g. `npm run affected`). That wins over everything.
+- If the repo relies on pre-commit/pre-push hooks and mandates no manual command, don't invent one — let the hooks run during the `commit-push-pr` handoff.
+- If the plan names specific checks, run them when practical. Ask before running anything that's clearly a slow CI-only suite.
-## Phase 4: Hand Off to commit-push-pr
+Fix any failures before handing off. Do not hand off with known failing checks unless the user explicitly accepts a pre-existing or unrelated failure.
-Once implementation is complete and Phase 3 validation is satisfied, invoke the `commit-push-pr` skill by name using this agent's normal skill invocation mechanism.
+## Phase 4: Hand Off
-Do **not** commit, push, or open the PR yourself in this skill. Let `commit-push-pr` own the shipping flow.
+Invoke the `commit-push-pr` skill. Do not commit, push, or open the PR yourself.
-If `commit-push-pr` reports `nothing to ship.` (the work resulted in no actual file changes), surface that to the user — it usually means the request was already implemented or was a no-op.
+If `commit-push-pr` reports `nothing to ship.`, surface that.
 ## Phase 5: Final Output
-After `commit-push-pr` returns, ensure the user sees one short final response with the branch name and the full PR URL it produced. If the current host already surfaced that response, do not duplicate it. If `commit-push-pr` reported nothing to ship, say so plainly.
+Ensure the user sees the branch name and full PR URL.

package/skills/write-bug-ticket/SKILL.md CHANGED Viewed

@@ -14,12 +14,11 @@ Structure and write Linear bug reports from evidence that already exists in the
 ## Process
 1. **Gather context** — collect evidence from the conversation: investigation findings, user reports, Datadog links, error details
-2. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the symptom description, error messages, and affected service/endpoint. If duplicates are found, present them to the user before proceeding.
-3. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected — ask before drafting. NEVER invent answers. Up to 3 rounds.
-4. **Draft** — title + description, structure scaled to complexity (see format below)
-5. **Self-review** — check every Red Flag below before presenting
-6. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
-7. **Create in Linear** — only after explicit approval
+2. **Clarify (conditional)** — if missing: (a) expected behavior, (b) actual behavior, or (c) who's affected — ask before drafting. NEVER invent answers. Up to 3 rounds.
+3. **Draft** — title + description, structure scaled to complexity (see format below)
+4. **Self-review** — check every Red Flag below before presenting
+5. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
+6. **Create in Linear** — only after explicit approval
 ## Hard Rules

package/skills/write-feature-ticket/SKILL.md CHANGED Viewed

@@ -15,14 +15,13 @@ Draft Linear feature request tickets that describe what users need and why — n
    - **1-2 factual gaps** (missing repo, unclear who) → ask the user directly. Don't dispatch the full interview for a single missing data point.
    - **Structural problems** (solution-shaped framing, no problem articulated, mostly unknowns) → dispatch `interview-feature` skill. Receive a structured problem brief. Re-check gate against the brief.
    - If `interview-feature` terminates without producing a problem brief (user refused to articulate a problem), abort the ticket process. Inform the user that the ticket cannot be created without a problem statement.
-3. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the problem's key terms, affected users, and domain area. If duplicates or closely related tickets are found, present them to the user and ask whether to proceed, merge with an existing ticket, or stop.
-4. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check — it doesn't blindly trust upstream context.
-5. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
-6. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
-7. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
-8. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
-9. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
-10. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
+3. **Final validation** — run the checklist below before drafting. This is the ticket skill's own quality check — it doesn't blindly trust upstream context.
+4. **Assess scope** — does the problem contain multiple independent user-facing outcomes? If so, decompose into parent + sub-issues, each describing one outcome. Decomposition is about what the user gets, not how the engineer builds it.
+5. **Draft** — title + description, structure scaled to complexity (see Ticket Format below)
+6. **Self-review** — check every item in [red-flags.md](red-flags.md) before presenting
+7. **Suggest metadata (conditional)** — priority (Urgent/High/Medium/Low/No Priority), labels, project when context supports it. Present metadata suggestions BELOW the ticket body, separate from the description.
+8. **Present for review** — show the draft to the user. Wait for explicit approval before proceeding.
+9. **Create in Linear** — once the user approves (or approves with changes), create the ticket in Linear using the Linear MCP tools. For sub-issues, create parent first, then children linked to it. Apply any confirmed metadata. NEVER create without user approval.
 ## Final Validation Checklist

package/skills/write-tech-debt-ticket/SKILL.md CHANGED Viewed

@@ -21,11 +21,10 @@ Draft Linear tech debt tickets that justify _why_ the debt matters — cost to c
    - Maintainability/DX → `git log` for change frequency and bug-fix commits, grep for workarounds
    - Security → check dependency versions, scan for vulnerability patterns
 5. **Assess interest & risk** — produce structured ratings with evidence (see reference.md for rating framework)
-6. **Check for duplicates** — follow the `linear-duplicate-finder` process (read its [SKILL.md](../linear-duplicate-finder/SKILL.md)). Generate search queries from the debt's key terms, component names, and domain area. If duplicates are found, present them to the user before proceeding.
-7. **Draft** — title + description, structure scaled to complexity (see format below)
-8. **Self-review** — check every Red Flag below before presenting
-9. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
-10. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
+6. **Draft** — title + description, structure scaled to complexity (see format below)
+7. **Self-review** — check every Red Flag below before presenting
+8. **Present for review** — show ONLY the draft and metadata suggestions. Ask for team/assignee.
+9. **Create in Linear** — only after explicit approval. Apply `technical-debt` label.
 ## Hard Rules

package/skills/linear-duplicate-finder/SKILL.md DELETED Viewed

@@ -1,101 +0,0 @@
----
-name: linear-duplicate-finder
-description: Use when checking if a Linear ticket already exists before creating one. Searches across teams, archived tickets, and multiple phrasings to find duplicates and related tickets. Dispatched by ticket-writing skills before creation.
----
-# Linear Duplicate Finder
-Search Linear for duplicate or related tickets before creating a new one. Casts a wide net across teams, statuses, and phrasings, then classifies matches by similarity.
-## Inputs
-1. **A Linear ticket ID** (e.g., "ENG-1234") — fetch its details and search for duplicates.
-2. **A ticket title + description** — search Linear directly for matches.
-3. **Multiple ticket IDs** — cross-reference against each other and the backlog.
-## Process
-### Step 1: Understand the Source
-If given a ticket ID:
-- Fetch the ticket using `mcp__linear__get_issue` with `includeRelations: true` to see if duplicates are already marked.
-- Extract the title, description, labels, team, and project.
-If given a title/description:
-- Parse the key concepts, features, and domain terms.
-### Step 2: Generate Search Queries
-Break the ticket down into multiple search angles:
-- **Exact title keywords**: Most distinctive terms from the title.
-- **Core concept**: The fundamental ask, using different phrasings.
-- **Domain/feature area**: The feature area or system component involved.
-- **Synonyms and alternative phrasings**: 2-3 alternative ways to describe the same thing.
-### Step 3: Execute Searches
-Run multiple `mcp__linear__list_issues` searches in parallel using the `query` parameter with different search terms:
-- Use `limit: 50` to cast a wide net.
-- Include `includeArchived: true` to catch completed or cancelled tickets.
-- Filter by team when known, but also do at least one cross-team search.
-Also use `mcp__linear__query_data` with natural language queries for concept-based matching.
-Run at least 3-5 different searches with varied query terms.
-### Step 4: Analyze and Score Results
-| Dimension               | Weight | Description                                            |
-| ----------------------- | ------ | ------------------------------------------------------ |
-| **Title similarity**    | High   | Do the titles describe the same thing?                 |
-| **Description overlap** | High   | Do the descriptions reference the same problem?        |
-| **Same feature area**   | Medium | Are they about the same system/feature?                |
-| **Same team/project**   | Low    | Same team increases likelihood but isn't required.     |
-| **Status**              | Info   | Cancelled/completed duplicates are still worth noting. |
-### Step 5: Classify Matches
-- **Duplicate**: Exact same work. Creating both = redundant effort.
-- **Closely Related**: Overlapping scope — completing one partially addresses the other. Should cross-reference.
-- **Same Area**: Same domain but different aspects. Useful context, not duplicates.
-### Step 6: Present Results
-```text
-## Duplicate Detection Results
-### Source
-**[ID] Title** or **Potential ticket**: "description"
-### Duplicates Found
-1. **[TEAM-123] Title** — Status: In Progress
-   - **Why**: [specific overlap]
-   - **Key difference**: [if any]
-### Closely Related
-1. **[TEAM-456] Title** — Status: Backlog
-   - **Overlap**: [what's shared]
-   - **Difference**: [what's distinct]
-### Same Area (Context)
-1. **[TEAM-789] Title** — Status: Done
-   - **Relevance**: [why worth noting]
-### Recommendation
-[Proceed, merge with existing, or add context to related ticket?]
-```
-If no duplicates found, say so clearly and recommend proceeding.
-## Guidelines
-- **Wide net, then narrow.** Better to surface a false positive than miss a real duplicate.
-- **Search across teams.** Duplicates often live on different teams.
-- **Check archived/cancelled tickets.** May contain valuable context about why work was previously rejected.
-- **Look at different time ranges.** Duplicates can be months old.
-- **Be specific.** Don't say "similar title" — explain exactly what overlaps and differs.
-- **When in doubt, include it.** A false positive is cheap; a missed duplicate wastes engineering effort.