npm - @exodus/xqa - Versions diffs - 5.0.0 → 5.2.0 - Mend

@exodus/xqa 5.0.0 → 5.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +3 -0
package/dist/skills/xqa-test-plan/SKILL.md +720 -53
package/dist/skills/xqa-test-plan/SKILL.test.md +122 -0
package/dist/xqa.cjs +1195 -770
package/package.json +3 -3

package/dist/skills/xqa-test-plan/SKILL.md CHANGED Viewed

@@ -12,16 +12,104 @@ license: MIT
 - User says "what should I test", "generate a test plan", "QA my branch", "test my changes before pushing", "what should I QA?"
 - Self-activate on implied intent when the user is asking for pre-push manual verification on the current branch
+## Voice
+Narrate as a peer engineer, not a CLI. Direct short phrases. Avoid mechanical state dumps ("Auto-selecting UDID", "Invoking planner", "Rendering correlated report"). When uncertain, ask one short question.
+## Asking the user
+### Platform detection
+MUST inspect tool list at conversation start. `AskUserQuestion` present → use it at every gate. Absent → use `### Platform fallback` format. Never attempt `AskUserQuestion` when absent (produces mid-conversation tool-call error). Detect once; apply for whole session.
+### Gate rules
+MUST use `AskUserQuestion` (when available) or platform-fallback (when not) at every decision gate. No other form permitted.
+Decision gates (no exceptions):
+- Plan approval (after rendering checklist)
+- Run-go (after plan approval, before first dispatch; confirms app state)
+- Simulator selection (>1 booted)
+- Sim state (multi-profile, before first dispatch)
+- Transition-ready between scenario groups
+- Destructive-delete seed-backup confirmation
+- Existing plan: Rerun / Regenerate / Extend
+- Post-interruption: Resume / Report / Abort
+- Existing PR test plan: Use / Enrich / Regenerate
+- Update PR: write plan + tACK / write plan only / skip (merged from former write-back + self-tACK post gates)
+- Self-tACK update (fires only when prior tACK exists and content differs)
+Free-text prompts allowed ONLY for:
+- Spec edit feedback (user's own words)
+- Narrated acknowledgements (no decision needed)
+`AskUserQuestion` auto-adds "Other". MUST parse free-form replies ("approve" / "go" / "ready") identically.
+### Forbidden gate phrasings (canonical)
+MUST NOT emit at any gate:
+- slash-delimited option lists ("Reply approve / run it", "use / enrich / regenerate")
+- numbered option lists ("(1) Option A (2) Option B", "Type 1 for X")
+- "Reply [word]" / "Say [word]" / "Pick [word]" in any form
+- "say go when", "let me know when", "enter your choice"
+- bare "Go?" as standalone sentence
+- any sentence ending with slash-delimited choice list
+- "describe edits" as prompt suffix
+- "(y/n)" or "yes/no" choice suffix
+- "write to PR?" + "post tACK?" as separate consecutive prompts — these are the merged Update PR gate; emit one combined gate, not two
+- "should I update the PR and post a tACK?" as free text — use Update PR gate form
+SELF-CHECK before emitting any gate text: if output matches any pattern above, STOP — apply gate form instead. Rule applies to skill output only, not to vocabulary tables describing accepted user replies.
+Each gate section below assumes this canonical list. Gate sections do not re-enumerate forbidden patterns.
+### Platform fallback
+When `AskUserQuestion` unavailable, every gate MUST use this format:
+```
+**<Gate name>**
+- **<Option A>** — <one-phrase description>
+- **<Option B>** — <one-phrase description>
+_(or describe your situation)_
+```
+Rules:
+- First line: bold gate name, no punctuation, no question mark
+- One bullet per option: bold label, em-dash, description ≤8 words
+- Last line: exactly `_(or describe your situation)_`
+- No numbered lists
+- Canonical forbidden phrasings apply here too
+Example — plan approval gate in fallback mode:
+```
+**Plan approval**
+- **Approve** — run as-is
+- **Discuss** — describe changes
+_(or describe your situation)_
+```
 ## Process
 ```
 Detect state → Generate → Approve → Run → Report → Re-run / Regenerate / Extend
 ```
-IMPORTANT: The skill orchestrates the CLI; it never writes `.test.md` files directly. Every spec mutation goes through `xqa plan` or `xqa plan edit`.
+MUST orchestrate CLI only; skill MUST NOT write `.test.md` directly. Every spec mutation goes through `xqa plan` or `xqa plan edit`.
 ## Detect state
+MUST suppress intermediate state narration. Emit one opening sentence naming branch and chosen simulator by name (not UDID), then transition into the plan. Template: "Looking at `<branch>` — using `<sim-name>`, which is already booted. Here's what I'd verify before pushing:"
 Resolve the current branch and its plan directory before anything else.
 ```bash
@@ -44,63 +132,207 @@ Plan directory: `.xqa/test-plan/<slug>/`.
 ### Auto-prune stale siblings
-List every child of `.xqa/test-plan/`. For each sibling directory:
+Slug → branch mapping is lossy (non-`[a-zA-Z0-9._-]` chars collapse to `-`). MUST NOT invert — iterate forward instead.
-1. Decode the slug back to an approximate branch name. WARNING: slug decoding is ambiguous (dashes could have been slashes or other chars). Treat the approximation as best-effort.
-2. Probe the local branch list:
+1. List all local branches:
    ```bash
-   git branch --list '<approximate-name>'
+   git for-each-ref refs/heads --format='%(refname:short)'
    ```
-3. If the probe is empty, the directory is stale — `rm -rf` it.
+2. Apply `branchToSlug` (slug rules table above) to each branch name. Build the set of live slugs.
+3. List every child of `.xqa/test-plan/`. Directory is stale if its name is NOT in the live-slug set.
+4. `rm -rf` stale directories only.
-IMPORTANT: Never prune the current branch's slug directory, even if the approximate-name probe looks empty.
-IMPORTANT: Auto-prune only removes directories whose approximation has NO matching local branch. When in doubt, leave it.
+MUST NOT prune current branch's slug directory regardless of slug-set membership.
+MUST leave any directory whose stale-status is uncertain (e.g. slug computation fails on a branch name).
 ### Existing plan detection
 Check `.xqa/test-plan/<slug>/` for `*.test.md` files.
-| State                 | Next action                                                |
-| --------------------- | ---------------------------------------------------------- |
-| No specs              | Proceed to Generate flow                                   |
-| Specs already present | Offer three choices: **Rerun**, **Regenerate**, **Extend** |
+| State                 | Next action                                 |
+| --------------------- | ------------------------------------------- |
+| No specs              | Proceed to Generate flow                    |
+| Specs already present | Use `AskUserQuestion` — see gate spec below |
-## Generate flow
+Gate form (canonical forbidden phrasings apply). Fallback header: "Plan exists"; options: Rerun / Regenerate / Extend.
-### 1. Summarize intent
+`AskUserQuestion` gate for existing plan:
-Scan the last ~5 turns of chat history. Compress the user's stated goal into a single sentence suitable for `--intent`. If no intent emerges, pass an empty string.
+- question: "A plan already exists for this branch. What do you want to do?"
+- header: "Plan exists"
+- options:
+  - `Rerun` — "Run existing specs as-is"
+  - `Regenerate` — "Discard specs and re-plan from current diff"
+  - `Extend` — "Add scenarios for new commits"
-### 2. Detect booted simulators
+## PR detection
+### Parallel probes at Generate start
+Probes are independent — MUST emit in single Bash batch (parallel tool use), not sequentially. Wait for all three before planner invocation.
 ```bash
+# Probe 1 — branch slug (already resolved in Detect state; re-use the value)
+# Probe 2 — booted simulator list
 xcrun simctl list devices booted --json
+# Probe 3 — open PR for current branch
+gh pr view --json number,body,url,headRefName,author 2>/dev/null
 ```
-Count booted devices:
+If `gh` is not installed or not authenticated, or the branch has no open PR, Probe 3 returns a non-zero exit or empty output — treat this as "no PR". Skip all PR-integration steps silently and proceed with the existing local-only flow. Never fabricate a PR number or URL.
+### PR test plan quality evaluation
+After fetching the PR body, locate the test plan section using the same heading variants as tack-pr (in priority order):
+- `## Test plan` / `### Test plan` / `**Test plan**`
+- `## Testing` / `### Testing`
+- `## Test Plan` / `## QA steps`
+Extract every `- [ ] ...` and `- [x] ...` line under that heading, up to the next heading or end of body. Let M = item count.
+Compute three diagnostic signals mechanically. No qualitative judgment.
+**Signal A — Change coverage.** Build changed-surface token set via `git diff <base>..HEAD --name-only` + extraction rules below. For each item: lowercase, replace non-alphanumeric with spaces, tokenize. Match if any changed-surface token appears in item tokens. Report: `N_covered of M items reference a changed surface`.
+**Signal B — CI-step contamination.** Match items (case-insensitive, word-boundary) against CI-step token list below + shape `run {test|lint|build|typecheck|coverage}`. Report: `K items look like CI steps (not manual QA)`.
+**Signal C — Placeholder tokens.** Flag items with `TODO`, `TBD`, or `???` as a whole token. Report: `P items have placeholder tokens (TODO/TBD/???)`.
+Identical PR bodies MUST produce identical diagnostics. No model judgment beyond the three signals.
+### Changed surface extraction
+From the diff (via `git diff <base>..HEAD --name-only`):
+- Filter to paths under `app/`, `apps/`, `src/`, `packages/` (adjust to project convention if obvious from tree structure)
+- Exclude test files (`**/__tests__/**`, `*.test.*`, `*.spec.*`)
+- Exclude config/build files (`*.json`, `*.yaml`, `*.toml`, `*.config.*`)
+- For each remaining path: extract basename (minus extension) and immediate parent directory name as tokens
+- Union all tokens to produce the "changed-surface token set"
+Token matching against a test-plan item:
+- Lowercase the item text; replace non-alphanumeric with spaces; tokenize by whitespace
+- Item references a changed surface if ANY changed-surface token appears in the item's token set
+### CI-step token list
+Items matching any of these tokens (case-insensitive, word-boundary) are flagged:
+| Category     | Tokens                                                                                                                       |
+| ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
+| Test runners | `npm test`, `yarn test`, `pnpm test`, `vitest`, `jest`, `mocha`, `playwright test`                                           |
+| Build        | `npm build`, `yarn build`, `pnpm build`, `turbo build`, `tsc`, `typecheck`, `type-check`                                     |
+| Lint         | `npm lint`, `yarn lint`, `pnpm lint`, `eslint`, `prettier`, `biome`                                                          |
+| CI meta      | `ci passes`, `ci green`, `pipeline`, `github actions`, `workflow`, `unit tests`, `integration tests`, `coverage`, `snapshot` |
+Also flag shape "run {test|lint|build|typecheck|coverage}" as a CI reference.
+### PR-aware branching
+| PR state              | Next action                           |
+| --------------------- | ------------------------------------- |
+| No open PR            | Proceed with normal Generate flow     |
+| PR exists, no section | Proceed with normal Generate flow     |
+| PR exists             | Emit diagnostic report → gate (below) |
+### Diagnostic report and gate
+Emit diagnostics before the gate:
+```
+PR #<N> has a test plan (<M> items). Diagnostic:
+- <N_covered>/<M> items reference a changed surface
+- <K> items look like CI steps (not manual QA)
+- <P> items have placeholder tokens (TODO/TBD/???)
+```
+Gate form (canonical forbidden phrasings apply).
+`AskUserQuestion` gate for existing PR test plan:
+- question: "How do you want to proceed with the existing PR test plan?"
+- header: "PR plan"
+- options:
+  - `Use as-is` — "Run the existing plan verbatim"
+  - `Enrich` — "Append planner-generated scenarios to cover gaps"
+  - `Regenerate` — "Discard and author a fresh plan from the diff"
+When the user chooses `Use as-is`: skip planner invocation, convert the extracted checklist items directly into the approval checklist, then proceed to the Approval loop.
+When the user chooses `Enrich`: run the planner normally, then merge its output with the PR checklist using this deterministic dedup algorithm: normalize each line to lowercase, replace non-alphanumeric chars with single spaces, collapse whitespace. Match if ALL tokens of the PR item appear as tokens in the planner item (order-independent), or vice versa. When matched, keep the planner version (it has structured intents). Append PR-only items (no match found) to the end of the merged list.
+When the user chooses `Regenerate`: proceed with normal Generate flow, ignoring the PR body.
+## Generate flow
+### 1. Summarize intent
+Scan the last ~5 turns of chat history. Compress the user's stated goal into a single sentence suitable for `--intent`. If no intent emerges, pass an empty string.
+### 2. Detect booted simulators
+Result is already available from the parallel Probe 2 above. Count booted devices:
 | Count | Behavior                                                                   |
 | ----- | -------------------------------------------------------------------------- |
 | 0     | STOP — tell user to boot a simulator (`xcrun simctl boot <udid>`) and wait |
 | 1     | Auto-select its UDID                                                       |
-| >1    | Ask user to pick one by name + UDID                                        |
+| >1    | Use simulator selection gate below                                         |
+**Simulator selection gate** (multi-booted case)
-Remember the chosen UDID for the Run step.
+Gate form (canonical forbidden phrasings apply).
-### 3. Invoke planner
+`AskUserQuestion`:
+- question: "Which simulator should we use?"
+- header: "Simulator"
+- options: one per booted sim (max 4) — label = device name (`iPhone 17 Pro`), description = OS + UDID prefix (`iOS 18.1 · 442A7152`)
+- >4 booted: list 3 by recency; rely on "Other"
+Remember chosen UDID for Run step.
+### 3. Detect base ref
+MUST resolve `--base` before invoking planner; wrong diff causes model abstention (`emptyReason: model-abstained`).
+| Detection method                   | Command                                                                    | Use when                       |
+| ---------------------------------- | -------------------------------------------------------------------------- | ------------------------------ |
+| Open PR (authoritative)            | PR Probe 3 result — read `baseRefName` from the JSON already fetched       | `gh` available + PR exists     |
+| Local upstream tracking (fallback) | `git rev-parse --abbrev-ref @{upstream}` — strip remote prefix (`origin/`) | No PR; upstream set            |
+| None                               | Omit `--base` entirely                                                     | Both methods fail or empty out |
+MUST NOT fabricate base ref when detection fails. Omit `--base` and warn user: "No base ref detected — diff may be broader than expected." Never assert what CLI defaults to; let CLI's own behavior surface.
+### 4. Invoke planner
 ```bash
-xqa plan --intent "<one-sentence summary>" --out .xqa/test-plan/<slug>
+xqa plan --intent "<one-sentence summary>" --base <base-ref> --out .xqa/test-plan/<slug>
 ```
-### 4. Render approval checklist
+Omit `--base <base-ref>` when detection produced no result.
+### 5. Render approval checklist
 - Parse planner stdout as JSON; extract the `specs[]` array.
 - Read each `<path>.test.md` that the planner wrote.
-- Extract the step intents (first line of each numbered step, before `→` or `[hint:`).
-- Render a numbered checklist, one line per scenario, showing scenario title and 1-3 key steps.
+- Extract step intents: first line of each numbered step, before `→` or `[hint:`.
+- Render a flat markdown task list — one `- [ ]` per step intent across all scenarios combined.
+- No scenario titles, no UDID, no numbering prefixes. Each line MUST be ≤10 words. If the planner's step intent exceeds 10 words, truncate at the last full word that fits (no ellipsis, no abbreviation).
+- When multiple scenarios exist, insert bold scenario headers (`**Scenario: <feature>**`) between groups of bullets; bullets stay flat. When only one scenario exists, omit the header — emit the flat list directly.
-Ask: "Does this cover what you want to QA? Reply **approve** / **run it** / **looks good** to proceed, or describe edits."
+Gate form (canonical forbidden phrasings apply). Never route yes/no through prose even when planner output suggests it. Fallback header: "Plan approval"; options: Approve / Discuss.
+Use `AskUserQuestion` after rendering the checklist:
+- question: "Does this cover what you want to test?"
+- header: "Plan approval"
+- options:
+  - `Approve` — "Run the plan as-is"
+  - `Discuss` — "Describe changes to the plan"
+- "Other" provides free-form fallback for inline edit feedback
 ## Approval loop
@@ -113,7 +345,144 @@ Ask: "Does this cover what you want to QA? Reply **approve** / **run it** / **lo
 After every edit call, re-read the updated spec and re-render the checklist. Loop until approval.
-IMPORTANT: Never hand-edit `.test.md` files. Every change flows through `xqa plan edit`.
+MUST NOT hand-edit `.test.md` files. Every change flows through `xqa plan edit`.
+## Run confirmation
+MUST emit Run-go gate after plan approval for all scenarios, regardless of profile count or destructiveness. Plan approval ≠ run approval. User MUST confirm sim/app state before dispatch.
+After approval, emit peer-engineer transition naming count, first scenario, precondition. Template: "Cool — [N] scenarios queued. First: [title] (needs [label])."
+### Run-go gate
+Gate form (canonical forbidden phrasings apply). Fallback header: "Run gate"; options: Go / Not yet.
+Use `AskUserQuestion`:
+- question: "App in the right state to start? ([label] expected)"
+- header: "Run gate"
+- options:
+  - `Go` — "App is in [label] state, dispatch run"
+  - `Not yet` — "I need to set up the app state first"
+Substitute `[label]` with the first scenario's precondition so the question explicitly references the expected app state.
+Free-form replies via `AskUserQuestion` "Other" (or platform-fallback free-form line):
+- Accepted affirmatives: "go" / "ready" / "yes" / "run it" / any clear affirmative
+- Negatives ("no" / "wait" / "hold" / "not yet"): reply "No problem — let me know when you're ready." and wait
+MUST NOT auto-dispatch after plan approval. MUST NOT skip Run-go gate for any run configuration (single-profile, non-destructive, or otherwise).
+## Setup coordination
+Read the `## Setup` section from each scenario's `.test.md`. Group scenarios by `setup` text (string equality). Sort groups: least-destructive first (heuristic: Wallet present, empty < Wallet present, funded < No wallet).
+**Single-profile flow:** all scenarios share one `setup` — Run-go gate confirms app state, then dispatch the batch without re-asking between scenarios.
+**Multi-profile flow:**
+1. Before first dispatch: "I'll run the [profile-A] scenarios first, then [profile-B] to keep setup changes minimal."
+2. Gate form (canonical forbidden phrasings apply). Fallback header: "Sim state"; one option per precondition profile (max 4), canonical wallet vocab.
+   Use `AskUserQuestion` for current sim state:
+   - question: "Before I start — what state is the sim in right now?"
+   - header: "Sim state"
+   - options: one per distinct precondition profile (max 4), labels using canonical wallet vocab (`Wallet imported, empty` / `Wallet imported, funded` / `No wallet` etc.). When more than 4 profiles exist, list the 3 most-needed and rely on "Other".
+3. Per group dispatch: `xqa run --spec <single-spec-file> --udid <udid>` per scenario in the group (or globbed across the group's spec files).
+4. Between groups: emit tip from lookup table. Gate form (canonical forbidden phrasings apply). Fallback header: "Setup ready"; options: Ready / Skip this group / Abort. Use `AskUserQuestion`:
+   - question: "Ready for the next group needing [profile]?"
+   - header: "Setup ready"
+   - options:
+     - `Ready` — "Sim is in the new state"
+     - `Skip this group` — "Move to next group without running this one"
+     - `Abort` — "Stop here and report what ran"
+Tip lookup table — no creative latitude; use exact text:
+| Setup text contains                                                    | Suggested tip                                                                                                                                                                                                          |
+| ---------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `No wallet` / "fresh install" / "no wallet"                            | "Delete app and reinstall, OR delete the current wallet from Settings (you'll lose access without the seed phrase — confirm you have it before continuing)."                                                           |
+| `Wallet imported, empty` / `Wallet created, empty` / "empty wallet"    | "If you're starting from no wallet: import a fresh test seed or create a new wallet. If you have a funded wallet: you'll need to delete it first (confirm seed backup) — that's the only way to reach an empty state." |
+| `Wallet imported, funded` / `Wallet created, funded` / "funded wallet" | "If wallet is empty: send a small amount from another wallet or use the faucet on testnet. If no wallet: import a known funded test seed."                                                                             |
+| `Wallet imported, <asset>` / "specific asset held"                     | "Send the asset from another wallet or swap into it — wallet must already be imported/created."                                                                                                                        |
+| `Wallet imported, <feature> enabled/disabled`                          | "Toggle the feature in Settings — wallet must already be imported/created."                                                                                                                                            |
+| anything else                                                          | "Check the spec for setup notes — I couldn't find a standard tip."                                                                                                                                                     |
+WARNING: Any tip involving deleting existing wallet is destructive. Before suggesting delete path, gate via canonical form. Fallback header: "Seed backup"; options: Yes, proceed / No, cancel delete. Use `AskUserQuestion`:
+- question: "Deleting the current wallet is irreversible without the seed. Backed up?"
+- header: "Seed backup"
+- options:
+  - `Yes, proceed` — "I have the seed phrase saved"
+  - `No, cancel delete` — "I'll reach the target state another way"
+Never suggest a delete without this gate. If the user selects "No, cancel delete", ask them to manually navigate to a fresh-wallet state by other means and report when ready.
+### Worked example (multi-profile)
+Branch `kichelm/wallet-states` has 3 scenarios: 2 need `Wallet imported, empty`, 1 needs `Wallet imported, funded`.
+---
+Skill opening:
+> Looking at `kichelm/wallet-states` — using iPhone 17 Pro, which is already booted. Here's what I'd verify before pushing:
+>
+> **Scenario: empty-state onboarding**
+>
+> - [ ] Open app — empty-state banner visible on Portfolio
+> - [ ] Try to send — error shown, send blocked
+>
+> **Scenario: funded wallet history**
+>
+> - [ ] Tap history — transaction list populated, amounts correct
+[AskUserQuestion: "Does this cover what you want to test?" / header: "Plan approval" / options: Approve, Discuss]
+User: Approve
+> Cool — 3 scenarios queued. First: empty-state onboarding (needs Wallet imported, empty).
+[AskUserQuestion: "App in the right state to start? (Wallet imported, empty expected)" / header: "Run gate" / options: Go, Not yet]
+User: Go
+> I'll run the two empty-wallet scenarios first, then the funded-wallet one to keep setup changes minimal.
+[AskUserQuestion: "Before I start — what state is the sim in right now?" / header: "Sim state" / options: Wallet imported, empty | Wallet imported, funded | No wallet]
+User: Wallet imported, empty
+> Running the Wallet imported, empty group...
+(scenarios complete)
+> Next up needs Wallet imported, funded. Here's a quick way to get there: if wallet is empty: send a small amount from another wallet or use the faucet on testnet. If no wallet: import a known funded test seed.
+[AskUserQuestion: "Ready for the next group needing Wallet imported, funded?" / header: "Setup ready" / options: Ready, Skip this group, Abort]
+---
+### Cancellation
+| User says     | Skill action                                              |
+| ------------- | --------------------------------------------------------- |
+| abort / stop  | Kill `xqa run` (SIGTERM); then use next-step gate below   |
+| skip          | Skip current scenario; proceed to next or Reporting       |
+| skip all      | Skip remaining scenarios; go straight to Reporting        |
+| rerun [title] | Re-enter run-confirmation gate with that single spec path |
+WARNING: `xqa run` dispatches atomically — kill may produce incomplete findings JSON. Report what exists; flag interrupted scenarios.
+After a kill (abort / stop): gate form (canonical forbidden phrasings apply). Fallback header: "Next step"; options: Resume from next / Report what ran / Abort. Use `AskUserQuestion`:
+- question: "Run interrupted. What next?"
+- header: "Next step"
+- options:
+  - `Resume from next` — "Continue with the next scenario"
+  - `Report what ran` — "Skip remaining and show partial findings"
+  - `Abort` — "Stop and report"
 ## Run
@@ -157,25 +526,334 @@ mv <findings-path>/../shots .xqa/test-plan/<slug>/runs/<iso-timestamp>/shots
 Use an ISO-8601 UTC timestamp (e.g. `2026-04-22T15-30-00Z`) as the directory name. Use `mv`, not `cp` — the plan directory owns the canonical copy.
-WARNING: After moving, all downstream steps reference the new paths under `.xqa/test-plan/<slug>/runs/`.
+After moving, all downstream steps MUST reference new paths under `.xqa/test-plan/<slug>/runs/`.
+Artifacts under `runs/` are immutable — MUST NOT edit or overwrite once written.
 ## Report
+MUST NOT compute finding correlation in the skill. `xqa plan report` owns correlation. No ad-hoc string matchers.
+### xqa plan report usage
+Flags (verified against CLI source):
+- `--findings <path>` — required; path to `findings.json`
+- `--specs <dir>` — optional; scenarios directory (defaults to `<xqa>/test-plan/default` when omitted)
+- `--runs <path>` — optional; path to `scenario-runs.json`; when omitted, resolved as `scenario-runs.json` sibling to `findings.json`
+Default invocation using skill's path convention:
 ```bash
 xqa plan report \
   --findings .xqa/test-plan/<slug>/runs/<iso-timestamp>/findings.json \
   --specs .xqa/test-plan/<slug>
 ```
-Parse stdout JSON as `CorrelatedReport`. Render grouped output:
+Omit `--runs` — auto-discovery resolves `scenario-runs.json` from the same directory as `findings.json`.
+Override `--runs` only when `scenario-runs.json` lives elsewhere:
+```bash
+xqa plan report \
+  --findings .xqa/test-plan/<slug>/runs/<iso-timestamp>/findings.json \
+  --specs .xqa/test-plan/<slug> \
+  --runs .xqa/test-plan/<slug>/runs/<iso-timestamp>/scenario-runs.json
+```
+Common failure modes:
+- `FINDINGS_READ_FAILED` on `findings.json` — run has not completed yet or path is wrong; check the `mv` step completed
+- Empty `scenarios` in report / all `not_run` — `--specs` points to wrong directory; slug mismatch between plan dir and findings path
+- All `not_run` with `(no run record)` — `scenario-runs.json` absent; run predates attestation file; re-run with current `xqa run`
+- Report writes `report.json` next to `findings.json` — not stdout; parse the emitted JSON from that file
+Parse stdout JSON as `CorrelatedReport`. Open the report with one sentence based on the bucket distribution:
+- All `passed`: "All clear — every scenario ran clean."
+- Any `failed`: "A few things came up:" (then list)
+- `passed > 0` + any `not_run` (no `failed`): "Some passed; a few didn't complete cleanly:"
+- Any `not_run` only (no `passed`, no `failed`): "Heads up — some scenarios didn't complete:" (then list)
+- Mix of `failed` + `not_run`: "Some issues, plus some scenarios that didn't complete cleanly:"
+Render grouped output:
+| Status             | Rendering                                                                                                                                                                                                                                                      |
+| ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `passed`           | Green check `- [x]` on the scenario line. Title rendered plain (no strikethrough).                                                                                                                                                                             |
+| `failed`           | Red `- [x]` (or `- [!]` if monospace renderer) with title in strikethrough. Inline each finding description. Link screenshot paths on a sub-bullet.                                                                                                            |
+| `not_run`          | Yellow/muted `- [ ]` with the scenario title. Append `*(not run — explorer outcome: <outcome>)*` where `<outcome>` is the run record's outcome verb (`errored`, `timed_out`, `aborted`) or `*(no run record)*` when the runs file is absent for that scenario. |
+| Unmatched findings | Separate "Also noticed:" bullet list at the end.                                                                                                                                                                                                               |
+`not_run` means explorer never finished the scenario cleanly (crashed, timed out, aborted). NOT a pass, NOT a failure. Render muted/yellow; surface outcome verb so user knows whether to investigate or rerun. `passed` is reserved for scenarios where explorer completed AND emitted zero findings — real green check.
+`not_run` with `(no run record)`: findings file has no sibling `scenario-runs.json` (pre-attestation run). Suggest re-run with current `xqa run`.
+**Worked example — mixed outcomes:**
+Suppose 3 scenarios ran. Scenario A had a finding, B passed, C timed out.
+> A few things came up:
+>
+> - [x] ~~Verify send flow blocks on zero balance~~
+>       Finding: error message wraps off-screen on small displays. [shot](.xqa/test-plan/<slug>/runs/<ts>/shots/002.png)
+> - [x] Open app — empty-state banner visible on Portfolio
+> - [ ] Tap history — transaction list populates _(not run — explorer outcome: timed_out)_
+>
+> 1 passed · 1 failed · 1 not run.
+The footer summary line ("N passed · N failed · N not run") is mandatory whenever the run had >1 scenario.
+## Write back to PR
+### When to run
+After the report renders (clean or with findings), if Probe 3 during Generate returned an open PR, run these two probes in parallel before asking the user anything:
+MUST run gh-user resolver step FIRST (see below), then emit Probe A and Probe B in single Bash batch. Probe B MUST run AFTER resolver — never interpolate placeholder literally.
+```bash
+# Step 0 — resolve GH user (primary)
+GH_USER=$(gh api user --jq .login) || GH_USER=$(gh auth status --active -h github.com 2>&1 | grep -oE 'Logged in to github\.com account [^ ]+' | awk '{print $NF}')
+# Probe A — current PR body (may have changed since Generate start)
+gh pr view <number> --json body -q .body > /tmp/current_pr_body.md
+# Probe B — existing self-tACK comment lookup (needed for Self-tACK section below)
+gh pr view <number> --json comments --jq "[.comments[] | select(.author.login == \"$GH_USER\")] | map(select(.body | startswith(\"tACK\")))" > /tmp/existing_tack.json
+```
+### Gate — Update PR (write-back + self-tACK combined)
+The write-back and self-tACK post are adjacent actions on the same PR. They are consolidated into a single "Update PR" gate. When the user selects an option, both sub-actions execute sequentially without further confirmation (this gate IS the confirmation for both). The self-tACK UPDATE confirmation (when a prior tACK exists and content differs) is a separate destructive-action gate that fires after the sub-actions — it is not merged here.
+Gate form (canonical forbidden phrasings apply).
+`AskUserQuestion` gate for Update PR:
+- question: "Update PR #<N> with the executed test plan?"
+- header: "Update PR"
+- options:
+  - `Write plan + tACK` — "Replace ## Test plan section AND post self-tACK comment"
+  - `Write plan only` — "Replace ## Test plan section, skip tACK"
+  - `Skip` — "Keep PR body unchanged, skip tACK"
+- "Other" allows free-form (e.g. "tACK only" — parse intent and act accordingly)
+On `Write plan + tACK`: execute write-back mechanics (below), then execute self-tACK posting (build comment → existing tACK detection → new or update path).
+On `Write plan only`: execute write-back mechanics only; stop before self-tACK.
+On `Skip`: stop. Do not write back or post tACK.
+### Write-back mechanics
+If the user selects `Write plan + tACK` or `Write plan only`:
+1. Read `/tmp/current_pr_body.md`.
+2. Locate the test plan section using the same heading detection rules as PR detection.
+3. Replace that section with the new `## Test plan` section (see format below). If no section exists, append the new section at the end of the body.
+4. Preserve all content below the original test plan section verbatim.
+5. Write the updated body to `/tmp/updated_pr_body.md`.
+6. Apply with:
+```bash
+gh pr edit <number> --body-file /tmp/updated_pr_body.md
+```
+Use `--body-file` — never interpolate the body as a string argument.
+If `gh pr edit` fails, surface the error verbatim. Do not silently retry.
+### Approved checklist write-back format
+PR `## Test plan` section MUST reproduce approved checklist verbatim — every item user approved, as `- [ ]` bullets, unchecked (user has not confirmed run results in PR yet — findings step owns check mark flip to `[x]`).
+Write-back MUST NOT:
+- Write scenario titles as sole content
+- Paraphrase, compress, or summarize
+- Check any box at write-back time
+Write-back layout:
+- Single-scenario plan: emit `## Test plan` then the flat `- [ ]` bullet list. No scenario subheader.
+- Multi-scenario plan: emit `## Test plan`, then for each scenario: `### <Scenario name>` subheader (verbatim from planner output), then the flat `- [ ]` bullets for that scenario's approved steps.
+Example — single scenario (5 approved steps):
+```
+## Test plan
+- [ ] Swipe down Portfolio — Profile overlay slides down
+- [ ] Tap Settings — Settings list appears
+- [ ] Scroll to Test Warning row — row visible
+- [ ] Tap Test Warning — modal appears with content
+- [ ] Dismiss modal via swipe down or tap outside
+```
+Example — two scenarios:
+```
+## Test plan
+### Settings warning modal
+- [ ] Swipe down Portfolio — Profile overlay slides down
+- [ ] Tap Settings — Settings list appears
+- [ ] Scroll to Test Warning row — row visible
+- [ ] Tap Test Warning — modal appears with content
+- [ ] Dismiss modal via swipe down or tap outside
+### Send flow
+- [ ] Tap Send — send sheet opens
+- [ ] Enter recipient — address field accepts input
+- [ ] Confirm — transaction submitted
+```
+Validation before posting (mirrors tack-pr discipline):
+- Every approved checklist item appears as a `- [ ]` bullet in the write-back
+- Scenario subheaders match scenario titles from the planner output exactly
+- Single-scenario plan has no subheader
+- Multi-scenario plan has one subheader per scenario, no skipped scenarios
+- No `[x]` boxes at write-back time — all boxes MUST be unchecked
+Forbidden write-back shapes (MUST NOT emit):
+- Scenario title as sole `## Test plan` content
+- Paraphrased bullet text
+- Summarized plan ("5 steps verified")
+- Bullets with `[x]` at write-back time (findings step owns the check mark)
+- `## Test plan` section with fewer bullets than approved checklist items
+### Post-run outcome note
+The `## Test plan` section MUST stay verbatim — same bullets as approved checklist, always `- [ ]` unchecked. Per-item run status goes in the findings report, not the PR body.
+Optional: append a status footnote below the checklist (plain markdown, not checkbox items):
+```
+_Executed N scenarios: P passed · F failed · R not run. See [run report](<link>) for details._
+```
+Rules:
+- MUST NOT flip `[ ]` to `[x]` in `## Test plan` at any step — tACK is the checkbox-flip signal, not write-back
+- MUST NOT rewrite bullets as scenario titles after findings — bullets stay verbatim
+- MUST NOT inject `<!-- failed -->` or similar comments into the canonical checklist
+- Section heading MUST be `## Test plan` (normalized from `## Testing` / `## Test Plan` / `## QA steps`)
+Rationale: tACK reproduces the `## Test plan` checklist with `[x]`. If write-back mutates bullets post-run, tACK diverges from test plan. They MUST remain identical in text (tACK ⊆ test plan).
+## Self-tACK
+### Posting (triggered by Update PR gate — no separate gate for new tACK)
+When Update PR gate selected `Write plan + tACK` (or free-form "tACK only"): proceed to self-tACK posting. No additional gate needed for new tACK — Update PR selection IS the confirmation. Self-tACK UPDATE confirmation (prior tACK exists + content differs) fires separately because diff preview is essential context before overwriting public comment.
+### Existing tACK detection
+Use the result from Probe B (already fetched in parallel with Probe A above). Parse `/tmp/existing_tack.json`:
+- If the array is empty → no prior tACK exists → proceed to New tACK path
+- If the array is non-empty → prior tACK exists → proceed to Update path
+### Build the tACK comment body
+Construct the comment following tack-pr conventions exactly:
+- First line: `tACK` (exact casing)
+- Blank line
+- Every checkbox item from the PR `## Test plan` section reproduced with `[x]` — all boxes checked regardless of pass/fail/not_run status (tACK means "I followed the plan", not "everything passed")
+- Text of each item MUST match PR `## Test plan` item verbatim (modulo `[ ]` → `[x]`)
+- Nested bullets, indented lines, inline code, links: preserved byte-for-byte
+- Scenario subheaders (`### <name>`) preserved when present in test plan
+- Nothing else added — no sign-off, no commentary
+Write the comment to `/tmp/tack_comment.md`.
+Validation before rendering (same discipline as tack-pr):
+- Source item count == rendered item count
+- Every box is `[x]`, none are `[ ]`
+- Item text identical to PR `## Test plan` item — no paraphrase, no invented items, no reordering
+- `tACK` is first line; blank line follows
+- No trailing commentary
+- tACK body is a strict subset of PR `## Test plan` content (same items + structure, only checkbox state differs)
+### New tACK path
+No prior tACK comment exists. The Update PR gate already confirmed intent to post — proceed directly.
+Render the proposed comment in chat (for transparency):
+> Posting tACK to PR #<N> at <url>:
+>
+> (rendered comment body)
+Then post immediately:
+```bash
+gh pr comment <number> --body-file /tmp/tack_comment.md
+```
+Confirm: "tACK posted to PR #<N>: <url>"
+### Update path
+A prior tACK comment from the current user already exists (ID from Probe B JSON: `.id` field).
+Diff existing vs new content:
+- Parse the existing comment body from the Probe B JSON (`.body` field of the first tACK entry)
+- Compare item-by-item to the new `/tmp/tack_comment.md`
+- Compute: added items, removed items, items with changed checkbox state
+If the diff is empty (content is identical): tell the user "Self-tACK already posted matches the new plan — no update needed." Stop.
+If the diff is non-empty: render the diff in chat before gating:
+> Existing tACK comment differs from the new plan:
+>
+> Added:
+>
+> - [x] <new item>
+>
+> Removed:
+>
+> - [x] <old item>
+>
+> Status changes:
+>
+> - [x] <item that was unchecked, now checked>
+Gate form (canonical forbidden phrasings apply).
+Then use `AskUserQuestion`:
+- question: "Update the existing tACK comment on PR #<N>?"
+- header: "Confirm tACK update"
+- options:
+  - `Update` — "Overwrite the existing tACK comment"
+  - `Cancel` — "Leave existing comment unchanged"
+WARNING: Overwrites public comment on shared PR. Diff MUST be shown before gate fires. MUST NOT update without diff + explicit confirmation.
+On Update:
+```bash
+gh api --method PATCH /repos/{owner}/{repo}/issues/comments/<comment-id> \
+  --field body=@/tmp/tack_comment.md
+```
+Where `<comment-id>` is the `.databaseId` or `.id` (numeric) from Probe B JSON. Use `--field body=@/tmp/tack_comment.md` to read from file, avoiding shell escaping issues.
+Confirm: "tACK comment updated on PR #<N>: <url>"
+On Cancel: stop.
-| Group                    | Rendering                                                                                  |
-| ------------------------ | ------------------------------------------------------------------------------------------ |
-| `has_findings` scenarios | Strikethrough title + fail marker. Inline each finding description. Link screenshot paths. |
-| `not_run` scenarios      | Muted/grey. Prefix with "(no findings referenced)" — do not mark as passed.                |
-| Unmatched findings       | Separate "Findings without a scenario" section at the end.                                 |
+### Failure modes (both paths)
-IMPORTANT: `not_run` does NOT mean "passed". It means the agent did not emit a finding that correlated to this scenario. We literally don't know the outcome. Do not print a green check.
+- `gh` not authenticated → surface `gh auth login` error verbatim; do not retry
+- Post/update fails → surface error verbatim; do not silently retry
+- Item count mismatch in validation → abort, tell user which items are missing or duplicated; do not post
 ## Rerun / Regenerate / Extend
@@ -185,27 +863,16 @@ IMPORTANT: `not_run` does NOT mean "passed". It means the agent did not emit a f
 | Regenerate | `rm .xqa/test-plan/<slug>/*.test.md` (specs only — preserve `runs/`). Re-run Generate flow.     |
 | Extend     | `xqa plan extend --out .xqa/test-plan/<slug>`. Planner appends new scenarios for fresh commits. |
-IMPORTANT: Regenerate never deletes `runs/`. Run history is preserved across regenerations.
+Regenerate MUST NOT delete `runs/`. Run history is preserved across regenerations.
-## Guardrails
+## Anti-patterns
-- IMPORTANT: The skill NEVER writes `.test.md` files directly. Only `xqa plan` and `xqa plan edit` produce or modify them. `scenarioId` and meta injection live in the planner — hand-authoring breaks correlation.
-- IMPORTANT: The skill NEVER computes finding correlation. Only `xqa plan report` does. Do not write ad-hoc string matchers.
-- WARNING: The skill NEVER modifies another branch's slug directory. Auto-prune only removes directories whose local branch is gone AND which are not the current branch.
-- WARNING: The skill NEVER edits artifacts under `.xqa/test-plan/<slug>/runs/`. Those are immutable run records.
-- IMPORTANT: The skill NEVER claims `not_run` scenarios passed. Report them as unknown, not green.
+- Over-confirming: one "go" gate per phase; never two consecutive gates for the same transition
+- Context-free prompts: every gate states what's about to happen (scenario name, precondition, transition steps)
+- Phantom continuity: re-state current state from scratch at each gate; never assume the user remembers prior context
+- Silent state assumption: ask user to confirm sim state when preconditions differ between groups
+- Approval conflation: plan approval ≠ run approval — always gate Run-go before first dispatch to confirm app state
 ## Manual test plan
-- [ ] Skill activates on `/xqa-test-plan`
-- [ ] Skill activates on implied intent ("what should I QA?")
-- [ ] Detect state correctly computes slug for current branch
-- [ ] Detect state auto-prunes stale sibling dirs but never current
-- [ ] Generate flow handles 0 booted simulators (error), 1 (auto), >1 (prompt)
-- [ ] Generate flow passes `--intent` correctly
-- [ ] Approval loop calls `xqa plan edit` per file
-- [ ] Run flow preflights simulator before dispatching
-- [ ] Report flow renders two-state status (has_findings / not_run) correctly
-- [ ] Rerun doesn't regenerate specs
-- [ ] Regenerate wipes specs before re-invoking xqa plan
-- [ ] Extend appends scenario-N+1 without re-writing existing scenarios
+See [SKILL.test.md](./SKILL.test.md) for the full manual verification checklist.