@exodus/xqa 5.0.0 → 5.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -12,16 +12,104 @@ license: MIT
12
12
  - User says "what should I test", "generate a test plan", "QA my branch", "test my changes before pushing", "what should I QA?"
13
13
  - Self-activate on implied intent when the user is asking for pre-push manual verification on the current branch
14
14
 
15
+ ## Voice
16
+
17
+ Narrate as a peer engineer, not a CLI. Direct short phrases. Avoid mechanical state dumps ("Auto-selecting UDID", "Invoking planner", "Rendering correlated report"). When uncertain, ask one short question.
18
+
19
+ ## Asking the user
20
+
21
+ ### Platform detection
22
+
23
+ MUST inspect tool list at conversation start. `AskUserQuestion` present → use it at every gate. Absent → use `### Platform fallback` format. Never attempt `AskUserQuestion` when absent (produces mid-conversation tool-call error). Detect once; apply for whole session.
24
+
25
+ ### Gate rules
26
+
27
+ MUST use `AskUserQuestion` (when available) or platform-fallback (when not) at every decision gate. No other form permitted.
28
+
29
+ Decision gates (no exceptions):
30
+
31
+ - Plan approval (after rendering checklist)
32
+ - Run-go (after plan approval, before first dispatch; confirms app state)
33
+ - Simulator selection (>1 booted)
34
+ - Sim state (multi-profile, before first dispatch)
35
+ - Transition-ready between scenario groups
36
+ - Destructive-delete seed-backup confirmation
37
+ - Existing plan: Rerun / Regenerate / Extend
38
+ - Post-interruption: Resume / Report / Abort
39
+ - Existing PR test plan: Use / Enrich / Regenerate
40
+ - Update PR: write plan + tACK / write plan only / skip (merged from former write-back + self-tACK post gates)
41
+ - Self-tACK update (fires only when prior tACK exists and content differs)
42
+
43
+ Free-text prompts allowed ONLY for:
44
+
45
+ - Spec edit feedback (user's own words)
46
+ - Narrated acknowledgements (no decision needed)
47
+
48
+ `AskUserQuestion` auto-adds "Other". MUST parse free-form replies ("approve" / "go" / "ready") identically.
49
+
50
+ ### Forbidden gate phrasings (canonical)
51
+
52
+ MUST NOT emit at any gate:
53
+
54
+ - slash-delimited option lists ("Reply approve / run it", "use / enrich / regenerate")
55
+ - numbered option lists ("(1) Option A (2) Option B", "Type 1 for X")
56
+ - "Reply [word]" / "Say [word]" / "Pick [word]" in any form
57
+ - "say go when", "let me know when", "enter your choice"
58
+ - bare "Go?" as standalone sentence
59
+ - any sentence ending with slash-delimited choice list
60
+ - "describe edits" as prompt suffix
61
+ - "(y/n)" or "yes/no" choice suffix
62
+ - "write to PR?" + "post tACK?" as separate consecutive prompts — these are the merged Update PR gate; emit one combined gate, not two
63
+ - "should I update the PR and post a tACK?" as free text — use Update PR gate form
64
+
65
+ SELF-CHECK before emitting any gate text: if output matches any pattern above, STOP — apply gate form instead. Rule applies to skill output only, not to vocabulary tables describing accepted user replies.
66
+
67
+ Each gate section below assumes this canonical list. Gate sections do not re-enumerate forbidden patterns.
68
+
69
+ ### Platform fallback
70
+
71
+ When `AskUserQuestion` unavailable, every gate MUST use this format:
72
+
73
+ ```
74
+ **<Gate name>**
75
+
76
+ - **<Option A>** — <one-phrase description>
77
+ - **<Option B>** — <one-phrase description>
78
+
79
+ _(or describe your situation)_
80
+ ```
81
+
82
+ Rules:
83
+
84
+ - First line: bold gate name, no punctuation, no question mark
85
+ - One bullet per option: bold label, em-dash, description ≤8 words
86
+ - Last line: exactly `_(or describe your situation)_`
87
+ - No numbered lists
88
+ - Canonical forbidden phrasings apply here too
89
+
90
+ Example — plan approval gate in fallback mode:
91
+
92
+ ```
93
+ **Plan approval**
94
+
95
+ - **Approve** — run as-is
96
+ - **Discuss** — describe changes
97
+
98
+ _(or describe your situation)_
99
+ ```
100
+
15
101
  ## Process
16
102
 
17
103
  ```
18
104
  Detect state → Generate → Approve → Run → Report → Re-run / Regenerate / Extend
19
105
  ```
20
106
 
21
- IMPORTANT: The skill orchestrates the CLI; it never writes `.test.md` files directly. Every spec mutation goes through `xqa plan` or `xqa plan edit`.
107
+ MUST orchestrate CLI only; skill MUST NOT write `.test.md` directly. Every spec mutation goes through `xqa plan` or `xqa plan edit`.
22
108
 
23
109
  ## Detect state
24
110
 
111
+ MUST suppress intermediate state narration. Emit one opening sentence naming branch and chosen simulator by name (not UDID), then transition into the plan. Template: "Looking at `<branch>` — using `<sim-name>`, which is already booted. Here's what I'd verify before pushing:"
112
+
25
113
  Resolve the current branch and its plan directory before anything else.
26
114
 
27
115
  ```bash
@@ -44,63 +132,207 @@ Plan directory: `.xqa/test-plan/<slug>/`.
44
132
 
45
133
  ### Auto-prune stale siblings
46
134
 
47
- List every child of `.xqa/test-plan/`. For each sibling directory:
135
+ Slug branch mapping is lossy (non-`[a-zA-Z0-9._-]` chars collapse to `-`). MUST NOT invert — iterate forward instead.
48
136
 
49
- 1. Decode the slug back to an approximate branch name. WARNING: slug decoding is ambiguous (dashes could have been slashes or other chars). Treat the approximation as best-effort.
50
- 2. Probe the local branch list:
137
+ 1. List all local branches:
51
138
  ```bash
52
- git branch --list '<approximate-name>'
139
+ git for-each-ref refs/heads --format='%(refname:short)'
53
140
  ```
54
- 3. If the probe is empty, the directory is stale `rm -rf` it.
141
+ 2. Apply `branchToSlug` (slug rules table above) to each branch name. Build the set of live slugs.
142
+ 3. List every child of `.xqa/test-plan/`. Directory is stale if its name is NOT in the live-slug set.
143
+ 4. `rm -rf` stale directories only.
55
144
 
56
- IMPORTANT: Never prune the current branch's slug directory, even if the approximate-name probe looks empty.
57
- IMPORTANT: Auto-prune only removes directories whose approximation has NO matching local branch. When in doubt, leave it.
145
+ MUST NOT prune current branch's slug directory regardless of slug-set membership.
146
+ MUST leave any directory whose stale-status is uncertain (e.g. slug computation fails on a branch name).
58
147
 
59
148
  ### Existing plan detection
60
149
 
61
150
  Check `.xqa/test-plan/<slug>/` for `*.test.md` files.
62
151
 
63
- | State | Next action |
64
- | --------------------- | ---------------------------------------------------------- |
65
- | No specs | Proceed to Generate flow |
66
- | Specs already present | Offer three choices: **Rerun**, **Regenerate**, **Extend** |
152
+ | State | Next action |
153
+ | --------------------- | ------------------------------------------- |
154
+ | No specs | Proceed to Generate flow |
155
+ | Specs already present | Use `AskUserQuestion` see gate spec below |
67
156
 
68
- ## Generate flow
157
+ Gate form (canonical forbidden phrasings apply). Fallback header: "Plan exists"; options: Rerun / Regenerate / Extend.
69
158
 
70
- ### 1. Summarize intent
159
+ `AskUserQuestion` gate for existing plan:
71
160
 
72
- Scan the last ~5 turns of chat history. Compress the user's stated goal into a single sentence suitable for `--intent`. If no intent emerges, pass an empty string.
161
+ - question: "A plan already exists for this branch. What do you want to do?"
162
+ - header: "Plan exists"
163
+ - options:
164
+ - `Rerun` — "Run existing specs as-is"
165
+ - `Regenerate` — "Discard specs and re-plan from current diff"
166
+ - `Extend` — "Add scenarios for new commits"
73
167
 
74
- ### 2. Detect booted simulators
168
+ ## PR detection
169
+
170
+ ### Parallel probes at Generate start
171
+
172
+ Probes are independent — MUST emit in single Bash batch (parallel tool use), not sequentially. Wait for all three before planner invocation.
75
173
 
76
174
  ```bash
175
+ # Probe 1 — branch slug (already resolved in Detect state; re-use the value)
176
+ # Probe 2 — booted simulator list
77
177
  xcrun simctl list devices booted --json
178
+ # Probe 3 — open PR for current branch
179
+ gh pr view --json number,body,url,headRefName,author 2>/dev/null
78
180
  ```
79
181
 
80
- Count booted devices:
182
+ If `gh` is not installed or not authenticated, or the branch has no open PR, Probe 3 returns a non-zero exit or empty output — treat this as "no PR". Skip all PR-integration steps silently and proceed with the existing local-only flow. Never fabricate a PR number or URL.
183
+
184
+ ### PR test plan quality evaluation
185
+
186
+ After fetching the PR body, locate the test plan section using the same heading variants as tack-pr (in priority order):
187
+
188
+ - `## Test plan` / `### Test plan` / `**Test plan**`
189
+ - `## Testing` / `### Testing`
190
+ - `## Test Plan` / `## QA steps`
191
+
192
+ Extract every `- [ ] ...` and `- [x] ...` line under that heading, up to the next heading or end of body. Let M = item count.
193
+
194
+ Compute three diagnostic signals mechanically. No qualitative judgment.
195
+
196
+ **Signal A — Change coverage.** Build changed-surface token set via `git diff <base>..HEAD --name-only` + extraction rules below. For each item: lowercase, replace non-alphanumeric with spaces, tokenize. Match if any changed-surface token appears in item tokens. Report: `N_covered of M items reference a changed surface`.
197
+
198
+ **Signal B — CI-step contamination.** Match items (case-insensitive, word-boundary) against CI-step token list below + shape `run {test|lint|build|typecheck|coverage}`. Report: `K items look like CI steps (not manual QA)`.
199
+
200
+ **Signal C — Placeholder tokens.** Flag items with `TODO`, `TBD`, or `???` as a whole token. Report: `P items have placeholder tokens (TODO/TBD/???)`.
201
+
202
+ Identical PR bodies MUST produce identical diagnostics. No model judgment beyond the three signals.
203
+
204
+ ### Changed surface extraction
205
+
206
+ From the diff (via `git diff <base>..HEAD --name-only`):
207
+
208
+ - Filter to paths under `app/`, `apps/`, `src/`, `packages/` (adjust to project convention if obvious from tree structure)
209
+ - Exclude test files (`**/__tests__/**`, `*.test.*`, `*.spec.*`)
210
+ - Exclude config/build files (`*.json`, `*.yaml`, `*.toml`, `*.config.*`)
211
+ - For each remaining path: extract basename (minus extension) and immediate parent directory name as tokens
212
+ - Union all tokens to produce the "changed-surface token set"
213
+
214
+ Token matching against a test-plan item:
215
+
216
+ - Lowercase the item text; replace non-alphanumeric with spaces; tokenize by whitespace
217
+ - Item references a changed surface if ANY changed-surface token appears in the item's token set
218
+
219
+ ### CI-step token list
220
+
221
+ Items matching any of these tokens (case-insensitive, word-boundary) are flagged:
222
+
223
+ | Category | Tokens |
224
+ | ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
225
+ | Test runners | `npm test`, `yarn test`, `pnpm test`, `vitest`, `jest`, `mocha`, `playwright test` |
226
+ | Build | `npm build`, `yarn build`, `pnpm build`, `turbo build`, `tsc`, `typecheck`, `type-check` |
227
+ | Lint | `npm lint`, `yarn lint`, `pnpm lint`, `eslint`, `prettier`, `biome` |
228
+ | CI meta | `ci passes`, `ci green`, `pipeline`, `github actions`, `workflow`, `unit tests`, `integration tests`, `coverage`, `snapshot` |
229
+
230
+ Also flag shape "run {test|lint|build|typecheck|coverage}" as a CI reference.
231
+
232
+ ### PR-aware branching
233
+
234
+ | PR state | Next action |
235
+ | --------------------- | ------------------------------------- |
236
+ | No open PR | Proceed with normal Generate flow |
237
+ | PR exists, no section | Proceed with normal Generate flow |
238
+ | PR exists | Emit diagnostic report → gate (below) |
239
+
240
+ ### Diagnostic report and gate
241
+
242
+ Emit diagnostics before the gate:
243
+
244
+ ```
245
+ PR #<N> has a test plan (<M> items). Diagnostic:
246
+ - <N_covered>/<M> items reference a changed surface
247
+ - <K> items look like CI steps (not manual QA)
248
+ - <P> items have placeholder tokens (TODO/TBD/???)
249
+ ```
250
+
251
+ Gate form (canonical forbidden phrasings apply).
252
+
253
+ `AskUserQuestion` gate for existing PR test plan:
254
+
255
+ - question: "How do you want to proceed with the existing PR test plan?"
256
+ - header: "PR plan"
257
+ - options:
258
+ - `Use as-is` — "Run the existing plan verbatim"
259
+ - `Enrich` — "Append planner-generated scenarios to cover gaps"
260
+ - `Regenerate` — "Discard and author a fresh plan from the diff"
261
+
262
+ When the user chooses `Use as-is`: skip planner invocation, convert the extracted checklist items directly into the approval checklist, then proceed to the Approval loop.
263
+
264
+ When the user chooses `Enrich`: run the planner normally, then merge its output with the PR checklist using this deterministic dedup algorithm: normalize each line to lowercase, replace non-alphanumeric chars with single spaces, collapse whitespace. Match if ALL tokens of the PR item appear as tokens in the planner item (order-independent), or vice versa. When matched, keep the planner version (it has structured intents). Append PR-only items (no match found) to the end of the merged list.
265
+
266
+ When the user chooses `Regenerate`: proceed with normal Generate flow, ignoring the PR body.
267
+
268
+ ## Generate flow
269
+
270
+ ### 1. Summarize intent
271
+
272
+ Scan the last ~5 turns of chat history. Compress the user's stated goal into a single sentence suitable for `--intent`. If no intent emerges, pass an empty string.
273
+
274
+ ### 2. Detect booted simulators
275
+
276
+ Result is already available from the parallel Probe 2 above. Count booted devices:
81
277
 
82
278
  | Count | Behavior |
83
279
  | ----- | -------------------------------------------------------------------------- |
84
280
  | 0 | STOP — tell user to boot a simulator (`xcrun simctl boot <udid>`) and wait |
85
281
  | 1 | Auto-select its UDID |
86
- | >1 | Ask user to pick one by name + UDID |
282
+ | >1 | Use simulator selection gate below |
283
+
284
+ **Simulator selection gate** (multi-booted case)
87
285
 
88
- Remember the chosen UDID for the Run step.
286
+ Gate form (canonical forbidden phrasings apply).
89
287
 
90
- ### 3. Invoke planner
288
+ `AskUserQuestion`:
289
+
290
+ - question: "Which simulator should we use?"
291
+ - header: "Simulator"
292
+ - options: one per booted sim (max 4) — label = device name (`iPhone 17 Pro`), description = OS + UDID prefix (`iOS 18.1 · 442A7152`)
293
+ - >4 booted: list 3 by recency; rely on "Other"
294
+
295
+ Remember chosen UDID for Run step.
296
+
297
+ ### 3. Detect base ref
298
+
299
+ MUST resolve `--base` before invoking planner; wrong diff causes model abstention (`emptyReason: model-abstained`).
300
+
301
+ | Detection method | Command | Use when |
302
+ | ---------------------------------- | -------------------------------------------------------------------------- | ------------------------------ |
303
+ | Open PR (authoritative) | PR Probe 3 result — read `baseRefName` from the JSON already fetched | `gh` available + PR exists |
304
+ | Local upstream tracking (fallback) | `git rev-parse --abbrev-ref @{upstream}` — strip remote prefix (`origin/`) | No PR; upstream set |
305
+ | None | Omit `--base` entirely | Both methods fail or empty out |
306
+
307
+ MUST NOT fabricate base ref when detection fails. Omit `--base` and warn user: "No base ref detected — diff may be broader than expected." Never assert what CLI defaults to; let CLI's own behavior surface.
308
+
309
+ ### 4. Invoke planner
91
310
 
92
311
  ```bash
93
- xqa plan --intent "<one-sentence summary>" --out .xqa/test-plan/<slug>
312
+ xqa plan --intent "<one-sentence summary>" --base <base-ref> --out .xqa/test-plan/<slug>
94
313
  ```
95
314
 
96
- ### 4. Render approval checklist
315
+ Omit `--base <base-ref>` when detection produced no result.
316
+
317
+ ### 5. Render approval checklist
97
318
 
98
319
  - Parse planner stdout as JSON; extract the `specs[]` array.
99
320
  - Read each `<path>.test.md` that the planner wrote.
100
- - Extract the step intents (first line of each numbered step, before `→` or `[hint:`).
101
- - Render a numbered checklist, one line per scenario, showing scenario title and 1-3 key steps.
321
+ - Extract step intents: first line of each numbered step, before `→` or `[hint:`.
322
+ - Render a flat markdown task list — one `- [ ]` per step intent across all scenarios combined.
323
+ - No scenario titles, no UDID, no numbering prefixes. Each line MUST be ≤10 words. If the planner's step intent exceeds 10 words, truncate at the last full word that fits (no ellipsis, no abbreviation).
324
+ - When multiple scenarios exist, insert bold scenario headers (`**Scenario: <feature>**`) between groups of bullets; bullets stay flat. When only one scenario exists, omit the header — emit the flat list directly.
102
325
 
103
- Ask: "Does this cover what you want to QA? Reply **approve** / **run it** / **looks good** to proceed, or describe edits."
326
+ Gate form (canonical forbidden phrasings apply). Never route yes/no through prose even when planner output suggests it. Fallback header: "Plan approval"; options: Approve / Discuss.
327
+
328
+ Use `AskUserQuestion` after rendering the checklist:
329
+
330
+ - question: "Does this cover what you want to test?"
331
+ - header: "Plan approval"
332
+ - options:
333
+ - `Approve` — "Run the plan as-is"
334
+ - `Discuss` — "Describe changes to the plan"
335
+ - "Other" provides free-form fallback for inline edit feedback
104
336
 
105
337
  ## Approval loop
106
338
 
@@ -113,7 +345,144 @@ Ask: "Does this cover what you want to QA? Reply **approve** / **run it** / **lo
113
345
 
114
346
  After every edit call, re-read the updated spec and re-render the checklist. Loop until approval.
115
347
 
116
- IMPORTANT: Never hand-edit `.test.md` files. Every change flows through `xqa plan edit`.
348
+ MUST NOT hand-edit `.test.md` files. Every change flows through `xqa plan edit`.
349
+
350
+ ## Run confirmation
351
+
352
+ MUST emit Run-go gate after plan approval for all scenarios, regardless of profile count or destructiveness. Plan approval ≠ run approval. User MUST confirm sim/app state before dispatch.
353
+
354
+ After approval, emit peer-engineer transition naming count, first scenario, precondition. Template: "Cool — [N] scenarios queued. First: [title] (needs [label])."
355
+
356
+ ### Run-go gate
357
+
358
+ Gate form (canonical forbidden phrasings apply). Fallback header: "Run gate"; options: Go / Not yet.
359
+
360
+ Use `AskUserQuestion`:
361
+
362
+ - question: "App in the right state to start? ([label] expected)"
363
+ - header: "Run gate"
364
+ - options:
365
+ - `Go` — "App is in [label] state, dispatch run"
366
+ - `Not yet` — "I need to set up the app state first"
367
+
368
+ Substitute `[label]` with the first scenario's precondition so the question explicitly references the expected app state.
369
+
370
+ Free-form replies via `AskUserQuestion` "Other" (or platform-fallback free-form line):
371
+
372
+ - Accepted affirmatives: "go" / "ready" / "yes" / "run it" / any clear affirmative
373
+ - Negatives ("no" / "wait" / "hold" / "not yet"): reply "No problem — let me know when you're ready." and wait
374
+
375
+ MUST NOT auto-dispatch after plan approval. MUST NOT skip Run-go gate for any run configuration (single-profile, non-destructive, or otherwise).
376
+
377
+ ## Setup coordination
378
+
379
+ Read the `## Setup` section from each scenario's `.test.md`. Group scenarios by `setup` text (string equality). Sort groups: least-destructive first (heuristic: Wallet present, empty < Wallet present, funded < No wallet).
380
+
381
+ **Single-profile flow:** all scenarios share one `setup` — Run-go gate confirms app state, then dispatch the batch without re-asking between scenarios.
382
+
383
+ **Multi-profile flow:**
384
+
385
+ 1. Before first dispatch: "I'll run the [profile-A] scenarios first, then [profile-B] to keep setup changes minimal."
386
+ 2. Gate form (canonical forbidden phrasings apply). Fallback header: "Sim state"; one option per precondition profile (max 4), canonical wallet vocab.
387
+ Use `AskUserQuestion` for current sim state:
388
+ - question: "Before I start — what state is the sim in right now?"
389
+ - header: "Sim state"
390
+ - options: one per distinct precondition profile (max 4), labels using canonical wallet vocab (`Wallet imported, empty` / `Wallet imported, funded` / `No wallet` etc.). When more than 4 profiles exist, list the 3 most-needed and rely on "Other".
391
+
392
+ 3. Per group dispatch: `xqa run --spec <single-spec-file> --udid <udid>` per scenario in the group (or globbed across the group's spec files).
393
+ 4. Between groups: emit tip from lookup table. Gate form (canonical forbidden phrasings apply). Fallback header: "Setup ready"; options: Ready / Skip this group / Abort. Use `AskUserQuestion`:
394
+ - question: "Ready for the next group needing [profile]?"
395
+ - header: "Setup ready"
396
+ - options:
397
+ - `Ready` — "Sim is in the new state"
398
+ - `Skip this group` — "Move to next group without running this one"
399
+ - `Abort` — "Stop here and report what ran"
400
+
401
+ Tip lookup table — no creative latitude; use exact text:
402
+
403
+ | Setup text contains | Suggested tip |
404
+ | ---------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
405
+ | `No wallet` / "fresh install" / "no wallet" | "Delete app and reinstall, OR delete the current wallet from Settings (you'll lose access without the seed phrase — confirm you have it before continuing)." |
406
+ | `Wallet imported, empty` / `Wallet created, empty` / "empty wallet" | "If you're starting from no wallet: import a fresh test seed or create a new wallet. If you have a funded wallet: you'll need to delete it first (confirm seed backup) — that's the only way to reach an empty state." |
407
+ | `Wallet imported, funded` / `Wallet created, funded` / "funded wallet" | "If wallet is empty: send a small amount from another wallet or use the faucet on testnet. If no wallet: import a known funded test seed." |
408
+ | `Wallet imported, <asset>` / "specific asset held" | "Send the asset from another wallet or swap into it — wallet must already be imported/created." |
409
+ | `Wallet imported, <feature> enabled/disabled` | "Toggle the feature in Settings — wallet must already be imported/created." |
410
+ | anything else | "Check the spec for setup notes — I couldn't find a standard tip." |
411
+
412
+ WARNING: Any tip involving deleting existing wallet is destructive. Before suggesting delete path, gate via canonical form. Fallback header: "Seed backup"; options: Yes, proceed / No, cancel delete. Use `AskUserQuestion`:
413
+
414
+ - question: "Deleting the current wallet is irreversible without the seed. Backed up?"
415
+ - header: "Seed backup"
416
+ - options:
417
+ - `Yes, proceed` — "I have the seed phrase saved"
418
+ - `No, cancel delete` — "I'll reach the target state another way"
419
+
420
+ Never suggest a delete without this gate. If the user selects "No, cancel delete", ask them to manually navigate to a fresh-wallet state by other means and report when ready.
421
+
422
+ ### Worked example (multi-profile)
423
+
424
+ Branch `kichelm/wallet-states` has 3 scenarios: 2 need `Wallet imported, empty`, 1 needs `Wallet imported, funded`.
425
+
426
+ ---
427
+
428
+ Skill opening:
429
+
430
+ > Looking at `kichelm/wallet-states` — using iPhone 17 Pro, which is already booted. Here's what I'd verify before pushing:
431
+ >
432
+ > **Scenario: empty-state onboarding**
433
+ >
434
+ > - [ ] Open app — empty-state banner visible on Portfolio
435
+ > - [ ] Try to send — error shown, send blocked
436
+ >
437
+ > **Scenario: funded wallet history**
438
+ >
439
+ > - [ ] Tap history — transaction list populated, amounts correct
440
+
441
+ [AskUserQuestion: "Does this cover what you want to test?" / header: "Plan approval" / options: Approve, Discuss]
442
+
443
+ User: Approve
444
+
445
+ > Cool — 3 scenarios queued. First: empty-state onboarding (needs Wallet imported, empty).
446
+
447
+ [AskUserQuestion: "App in the right state to start? (Wallet imported, empty expected)" / header: "Run gate" / options: Go, Not yet]
448
+
449
+ User: Go
450
+
451
+ > I'll run the two empty-wallet scenarios first, then the funded-wallet one to keep setup changes minimal.
452
+
453
+ [AskUserQuestion: "Before I start — what state is the sim in right now?" / header: "Sim state" / options: Wallet imported, empty | Wallet imported, funded | No wallet]
454
+
455
+ User: Wallet imported, empty
456
+
457
+ > Running the Wallet imported, empty group...
458
+
459
+ (scenarios complete)
460
+
461
+ > Next up needs Wallet imported, funded. Here's a quick way to get there: if wallet is empty: send a small amount from another wallet or use the faucet on testnet. If no wallet: import a known funded test seed.
462
+
463
+ [AskUserQuestion: "Ready for the next group needing Wallet imported, funded?" / header: "Setup ready" / options: Ready, Skip this group, Abort]
464
+
465
+ ---
466
+
467
+ ### Cancellation
468
+
469
+ | User says | Skill action |
470
+ | ------------- | --------------------------------------------------------- |
471
+ | abort / stop | Kill `xqa run` (SIGTERM); then use next-step gate below |
472
+ | skip | Skip current scenario; proceed to next or Reporting |
473
+ | skip all | Skip remaining scenarios; go straight to Reporting |
474
+ | rerun [title] | Re-enter run-confirmation gate with that single spec path |
475
+
476
+ WARNING: `xqa run` dispatches atomically — kill may produce incomplete findings JSON. Report what exists; flag interrupted scenarios.
477
+
478
+ After a kill (abort / stop): gate form (canonical forbidden phrasings apply). Fallback header: "Next step"; options: Resume from next / Report what ran / Abort. Use `AskUserQuestion`:
479
+
480
+ - question: "Run interrupted. What next?"
481
+ - header: "Next step"
482
+ - options:
483
+ - `Resume from next` — "Continue with the next scenario"
484
+ - `Report what ran` — "Skip remaining and show partial findings"
485
+ - `Abort` — "Stop and report"
117
486
 
118
487
  ## Run
119
488
 
@@ -157,25 +526,334 @@ mv <findings-path>/../shots .xqa/test-plan/<slug>/runs/<iso-timestamp>/shots
157
526
 
158
527
  Use an ISO-8601 UTC timestamp (e.g. `2026-04-22T15-30-00Z`) as the directory name. Use `mv`, not `cp` — the plan directory owns the canonical copy.
159
528
 
160
- WARNING: After moving, all downstream steps reference the new paths under `.xqa/test-plan/<slug>/runs/`.
529
+ After moving, all downstream steps MUST reference new paths under `.xqa/test-plan/<slug>/runs/`.
530
+
531
+ Artifacts under `runs/` are immutable — MUST NOT edit or overwrite once written.
161
532
 
162
533
  ## Report
163
534
 
535
+ MUST NOT compute finding correlation in the skill. `xqa plan report` owns correlation. No ad-hoc string matchers.
536
+
537
+ ### xqa plan report usage
538
+
539
+ Flags (verified against CLI source):
540
+
541
+ - `--findings <path>` — required; path to `findings.json`
542
+ - `--specs <dir>` — optional; scenarios directory (defaults to `<xqa>/test-plan/default` when omitted)
543
+ - `--runs <path>` — optional; path to `scenario-runs.json`; when omitted, resolved as `scenario-runs.json` sibling to `findings.json`
544
+
545
+ Default invocation using skill's path convention:
546
+
164
547
  ```bash
165
548
  xqa plan report \
166
549
  --findings .xqa/test-plan/<slug>/runs/<iso-timestamp>/findings.json \
167
550
  --specs .xqa/test-plan/<slug>
168
551
  ```
169
552
 
170
- Parse stdout JSON as `CorrelatedReport`. Render grouped output:
553
+ Omit `--runs` auto-discovery resolves `scenario-runs.json` from the same directory as `findings.json`.
554
+
555
+ Override `--runs` only when `scenario-runs.json` lives elsewhere:
556
+
557
+ ```bash
558
+ xqa plan report \
559
+ --findings .xqa/test-plan/<slug>/runs/<iso-timestamp>/findings.json \
560
+ --specs .xqa/test-plan/<slug> \
561
+ --runs .xqa/test-plan/<slug>/runs/<iso-timestamp>/scenario-runs.json
562
+ ```
563
+
564
+ Common failure modes:
565
+
566
+ - `FINDINGS_READ_FAILED` on `findings.json` — run has not completed yet or path is wrong; check the `mv` step completed
567
+ - Empty `scenarios` in report / all `not_run` — `--specs` points to wrong directory; slug mismatch between plan dir and findings path
568
+ - All `not_run` with `(no run record)` — `scenario-runs.json` absent; run predates attestation file; re-run with current `xqa run`
569
+ - Report writes `report.json` next to `findings.json` — not stdout; parse the emitted JSON from that file
570
+
571
+ Parse stdout JSON as `CorrelatedReport`. Open the report with one sentence based on the bucket distribution:
572
+
573
+ - All `passed`: "All clear — every scenario ran clean."
574
+ - Any `failed`: "A few things came up:" (then list)
575
+ - `passed > 0` + any `not_run` (no `failed`): "Some passed; a few didn't complete cleanly:"
576
+ - Any `not_run` only (no `passed`, no `failed`): "Heads up — some scenarios didn't complete:" (then list)
577
+ - Mix of `failed` + `not_run`: "Some issues, plus some scenarios that didn't complete cleanly:"
578
+
579
+ Render grouped output:
580
+
581
+ | Status | Rendering |
582
+ | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
583
+ | `passed` | Green check `- [x]` on the scenario line. Title rendered plain (no strikethrough). |
584
+ | `failed` | Red `- [x]` (or `- [!]` if monospace renderer) with title in strikethrough. Inline each finding description. Link screenshot paths on a sub-bullet. |
585
+ | `not_run` | Yellow/muted `- [ ]` with the scenario title. Append `*(not run — explorer outcome: <outcome>)*` where `<outcome>` is the run record's outcome verb (`errored`, `timed_out`, `aborted`) or `*(no run record)*` when the runs file is absent for that scenario. |
586
+ | Unmatched findings | Separate "Also noticed:" bullet list at the end. |
587
+
588
+ `not_run` means explorer never finished the scenario cleanly (crashed, timed out, aborted). NOT a pass, NOT a failure. Render muted/yellow; surface outcome verb so user knows whether to investigate or rerun. `passed` is reserved for scenarios where explorer completed AND emitted zero findings — real green check.
589
+
590
+ `not_run` with `(no run record)`: findings file has no sibling `scenario-runs.json` (pre-attestation run). Suggest re-run with current `xqa run`.
591
+
592
+ **Worked example — mixed outcomes:**
593
+
594
+ Suppose 3 scenarios ran. Scenario A had a finding, B passed, C timed out.
595
+
596
+ > A few things came up:
597
+ >
598
+ > - [x] ~~Verify send flow blocks on zero balance~~
599
+ > Finding: error message wraps off-screen on small displays. [shot](.xqa/test-plan/<slug>/runs/<ts>/shots/002.png)
600
+ > - [x] Open app — empty-state banner visible on Portfolio
601
+ > - [ ] Tap history — transaction list populates _(not run — explorer outcome: timed_out)_
602
+ >
603
+ > 1 passed · 1 failed · 1 not run.
604
+
605
+ The footer summary line ("N passed · N failed · N not run") is mandatory whenever the run had >1 scenario.
606
+
607
+ ## Write back to PR
608
+
609
+ ### When to run
610
+
611
+ After the report renders (clean or with findings), if Probe 3 during Generate returned an open PR, run these two probes in parallel before asking the user anything:
612
+
613
+ MUST run gh-user resolver step FIRST (see below), then emit Probe A and Probe B in single Bash batch. Probe B MUST run AFTER resolver — never interpolate placeholder literally.
614
+
615
+ ```bash
616
+ # Step 0 — resolve GH user (primary)
617
+ GH_USER=$(gh api user --jq .login) || GH_USER=$(gh auth status --active -h github.com 2>&1 | grep -oE 'Logged in to github\.com account [^ ]+' | awk '{print $NF}')
618
+ # Probe A — current PR body (may have changed since Generate start)
619
+ gh pr view <number> --json body -q .body > /tmp/current_pr_body.md
620
+ # Probe B — existing self-tACK comment lookup (needed for Self-tACK section below)
621
+ gh pr view <number> --json comments --jq "[.comments[] | select(.author.login == \"$GH_USER\")] | map(select(.body | startswith(\"tACK\")))" > /tmp/existing_tack.json
622
+ ```
623
+
624
+ ### Gate — Update PR (write-back + self-tACK combined)
625
+
626
+ The write-back and self-tACK post are adjacent actions on the same PR. They are consolidated into a single "Update PR" gate. When the user selects an option, both sub-actions execute sequentially without further confirmation (this gate IS the confirmation for both). The self-tACK UPDATE confirmation (when a prior tACK exists and content differs) is a separate destructive-action gate that fires after the sub-actions — it is not merged here.
627
+
628
+ Gate form (canonical forbidden phrasings apply).
629
+
630
+ `AskUserQuestion` gate for Update PR:
631
+
632
+ - question: "Update PR #<N> with the executed test plan?"
633
+ - header: "Update PR"
634
+ - options:
635
+ - `Write plan + tACK` — "Replace ## Test plan section AND post self-tACK comment"
636
+ - `Write plan only` — "Replace ## Test plan section, skip tACK"
637
+ - `Skip` — "Keep PR body unchanged, skip tACK"
638
+ - "Other" allows free-form (e.g. "tACK only" — parse intent and act accordingly)
639
+
640
+ On `Write plan + tACK`: execute write-back mechanics (below), then execute self-tACK posting (build comment → existing tACK detection → new or update path).
641
+ On `Write plan only`: execute write-back mechanics only; stop before self-tACK.
642
+ On `Skip`: stop. Do not write back or post tACK.
643
+
644
+ ### Write-back mechanics
645
+
646
+ If the user selects `Write plan + tACK` or `Write plan only`:
647
+
648
+ 1. Read `/tmp/current_pr_body.md`.
649
+ 2. Locate the test plan section using the same heading detection rules as PR detection.
650
+ 3. Replace that section with the new `## Test plan` section (see format below). If no section exists, append the new section at the end of the body.
651
+ 4. Preserve all content below the original test plan section verbatim.
652
+ 5. Write the updated body to `/tmp/updated_pr_body.md`.
653
+ 6. Apply with:
654
+
655
+ ```bash
656
+ gh pr edit <number> --body-file /tmp/updated_pr_body.md
657
+ ```
658
+
659
+ Use `--body-file` — never interpolate the body as a string argument.
660
+
661
+ If `gh pr edit` fails, surface the error verbatim. Do not silently retry.
662
+
663
+ ### Approved checklist write-back format
664
+
665
+ PR `## Test plan` section MUST reproduce approved checklist verbatim — every item user approved, as `- [ ]` bullets, unchecked (user has not confirmed run results in PR yet — findings step owns check mark flip to `[x]`).
666
+
667
+ Write-back MUST NOT:
668
+
669
+ - Write scenario titles as sole content
670
+ - Paraphrase, compress, or summarize
671
+ - Check any box at write-back time
672
+
673
+ Write-back layout:
674
+
675
+ - Single-scenario plan: emit `## Test plan` then the flat `- [ ]` bullet list. No scenario subheader.
676
+ - Multi-scenario plan: emit `## Test plan`, then for each scenario: `### <Scenario name>` subheader (verbatim from planner output), then the flat `- [ ]` bullets for that scenario's approved steps.
677
+
678
+ Example — single scenario (5 approved steps):
679
+
680
+ ```
681
+ ## Test plan
682
+
683
+ - [ ] Swipe down Portfolio — Profile overlay slides down
684
+ - [ ] Tap Settings — Settings list appears
685
+ - [ ] Scroll to Test Warning row — row visible
686
+ - [ ] Tap Test Warning — modal appears with content
687
+ - [ ] Dismiss modal via swipe down or tap outside
688
+ ```
689
+
690
+ Example — two scenarios:
691
+
692
+ ```
693
+ ## Test plan
694
+
695
+ ### Settings warning modal
696
+
697
+ - [ ] Swipe down Portfolio — Profile overlay slides down
698
+ - [ ] Tap Settings — Settings list appears
699
+ - [ ] Scroll to Test Warning row — row visible
700
+ - [ ] Tap Test Warning — modal appears with content
701
+ - [ ] Dismiss modal via swipe down or tap outside
702
+
703
+ ### Send flow
704
+
705
+ - [ ] Tap Send — send sheet opens
706
+ - [ ] Enter recipient — address field accepts input
707
+ - [ ] Confirm — transaction submitted
708
+ ```
709
+
710
+ Validation before posting (mirrors tack-pr discipline):
711
+
712
+ - Every approved checklist item appears as a `- [ ]` bullet in the write-back
713
+ - Scenario subheaders match scenario titles from the planner output exactly
714
+ - Single-scenario plan has no subheader
715
+ - Multi-scenario plan has one subheader per scenario, no skipped scenarios
716
+ - No `[x]` boxes at write-back time — all boxes MUST be unchecked
717
+
718
+ Forbidden write-back shapes (MUST NOT emit):
719
+
720
+ - Scenario title as sole `## Test plan` content
721
+ - Paraphrased bullet text
722
+ - Summarized plan ("5 steps verified")
723
+ - Bullets with `[x]` at write-back time (findings step owns the check mark)
724
+ - `## Test plan` section with fewer bullets than approved checklist items
725
+
726
+ ### Post-run outcome note
727
+
728
+ The `## Test plan` section MUST stay verbatim — same bullets as approved checklist, always `- [ ]` unchecked. Per-item run status goes in the findings report, not the PR body.
729
+
730
+ Optional: append a status footnote below the checklist (plain markdown, not checkbox items):
731
+
732
+ ```
733
+ _Executed N scenarios: P passed · F failed · R not run. See [run report](<link>) for details._
734
+ ```
735
+
736
+ Rules:
737
+
738
+ - MUST NOT flip `[ ]` to `[x]` in `## Test plan` at any step — tACK is the checkbox-flip signal, not write-back
739
+ - MUST NOT rewrite bullets as scenario titles after findings — bullets stay verbatim
740
+ - MUST NOT inject `<!-- failed -->` or similar comments into the canonical checklist
741
+ - Section heading MUST be `## Test plan` (normalized from `## Testing` / `## Test Plan` / `## QA steps`)
742
+
743
+ Rationale: tACK reproduces the `## Test plan` checklist with `[x]`. If write-back mutates bullets post-run, tACK diverges from test plan. They MUST remain identical in text (tACK ⊆ test plan).
744
+
745
+ ## Self-tACK
746
+
747
+ ### Posting (triggered by Update PR gate — no separate gate for new tACK)
748
+
749
+ When Update PR gate selected `Write plan + tACK` (or free-form "tACK only"): proceed to self-tACK posting. No additional gate needed for new tACK — Update PR selection IS the confirmation. Self-tACK UPDATE confirmation (prior tACK exists + content differs) fires separately because diff preview is essential context before overwriting public comment.
750
+
751
+ ### Existing tACK detection
752
+
753
+ Use the result from Probe B (already fetched in parallel with Probe A above). Parse `/tmp/existing_tack.json`:
754
+
755
+ - If the array is empty → no prior tACK exists → proceed to New tACK path
756
+ - If the array is non-empty → prior tACK exists → proceed to Update path
757
+
758
+ ### Build the tACK comment body
759
+
760
+ Construct the comment following tack-pr conventions exactly:
761
+
762
+ - First line: `tACK` (exact casing)
763
+ - Blank line
764
+ - Every checkbox item from the PR `## Test plan` section reproduced with `[x]` — all boxes checked regardless of pass/fail/not_run status (tACK means "I followed the plan", not "everything passed")
765
+ - Text of each item MUST match PR `## Test plan` item verbatim (modulo `[ ]` → `[x]`)
766
+ - Nested bullets, indented lines, inline code, links: preserved byte-for-byte
767
+ - Scenario subheaders (`### <name>`) preserved when present in test plan
768
+ - Nothing else added — no sign-off, no commentary
769
+
770
+ Write the comment to `/tmp/tack_comment.md`.
771
+
772
+ Validation before rendering (same discipline as tack-pr):
773
+
774
+ - Source item count == rendered item count
775
+ - Every box is `[x]`, none are `[ ]`
776
+ - Item text identical to PR `## Test plan` item — no paraphrase, no invented items, no reordering
777
+ - `tACK` is first line; blank line follows
778
+ - No trailing commentary
779
+ - tACK body is a strict subset of PR `## Test plan` content (same items + structure, only checkbox state differs)
780
+
781
+ ### New tACK path
782
+
783
+ No prior tACK comment exists. The Update PR gate already confirmed intent to post — proceed directly.
784
+
785
+ Render the proposed comment in chat (for transparency):
786
+
787
+ > Posting tACK to PR #<N> at <url>:
788
+ >
789
+ > (rendered comment body)
790
+
791
+ Then post immediately:
792
+
793
+ ```bash
794
+ gh pr comment <number> --body-file /tmp/tack_comment.md
795
+ ```
796
+
797
+ Confirm: "tACK posted to PR #<N>: <url>"
798
+
799
+ ### Update path
800
+
801
+ A prior tACK comment from the current user already exists (ID from Probe B JSON: `.id` field).
802
+
803
+ Diff existing vs new content:
804
+
805
+ - Parse the existing comment body from the Probe B JSON (`.body` field of the first tACK entry)
806
+ - Compare item-by-item to the new `/tmp/tack_comment.md`
807
+ - Compute: added items, removed items, items with changed checkbox state
808
+
809
+ If the diff is empty (content is identical): tell the user "Self-tACK already posted matches the new plan — no update needed." Stop.
810
+
811
+ If the diff is non-empty: render the diff in chat before gating:
812
+
813
+ > Existing tACK comment differs from the new plan:
814
+ >
815
+ > Added:
816
+ >
817
+ > - [x] <new item>
818
+ >
819
+ > Removed:
820
+ >
821
+ > - [x] <old item>
822
+ >
823
+ > Status changes:
824
+ >
825
+ > - [x] <item that was unchecked, now checked>
826
+
827
+ Gate form (canonical forbidden phrasings apply).
828
+
829
+ Then use `AskUserQuestion`:
830
+
831
+ - question: "Update the existing tACK comment on PR #<N>?"
832
+ - header: "Confirm tACK update"
833
+ - options:
834
+ - `Update` — "Overwrite the existing tACK comment"
835
+ - `Cancel` — "Leave existing comment unchanged"
836
+
837
+ WARNING: Overwrites public comment on shared PR. Diff MUST be shown before gate fires. MUST NOT update without diff + explicit confirmation.
838
+
839
+ On Update:
840
+
841
+ ```bash
842
+ gh api --method PATCH /repos/{owner}/{repo}/issues/comments/<comment-id> \
843
+ --field body=@/tmp/tack_comment.md
844
+ ```
845
+
846
+ Where `<comment-id>` is the `.databaseId` or `.id` (numeric) from Probe B JSON. Use `--field body=@/tmp/tack_comment.md` to read from file, avoiding shell escaping issues.
847
+
848
+ Confirm: "tACK comment updated on PR #<N>: <url>"
849
+
850
+ On Cancel: stop.
171
851
 
172
- | Group | Rendering |
173
- | ------------------------ | ------------------------------------------------------------------------------------------ |
174
- | `has_findings` scenarios | Strikethrough title + fail marker. Inline each finding description. Link screenshot paths. |
175
- | `not_run` scenarios | Muted/grey. Prefix with "(no findings referenced)" — do not mark as passed. |
176
- | Unmatched findings | Separate "Findings without a scenario" section at the end. |
852
+ ### Failure modes (both paths)
177
853
 
178
- IMPORTANT: `not_run` does NOT mean "passed". It means the agent did not emit a finding that correlated to this scenario. We literally don't know the outcome. Do not print a green check.
854
+ - `gh` not authenticated surface `gh auth login` error verbatim; do not retry
855
+ - Post/update fails → surface error verbatim; do not silently retry
856
+ - Item count mismatch in validation → abort, tell user which items are missing or duplicated; do not post
179
857
 
180
858
  ## Rerun / Regenerate / Extend
181
859
 
@@ -185,27 +863,16 @@ IMPORTANT: `not_run` does NOT mean "passed". It means the agent did not emit a f
185
863
  | Regenerate | `rm .xqa/test-plan/<slug>/*.test.md` (specs only — preserve `runs/`). Re-run Generate flow. |
186
864
  | Extend | `xqa plan extend --out .xqa/test-plan/<slug>`. Planner appends new scenarios for fresh commits. |
187
865
 
188
- IMPORTANT: Regenerate never deletes `runs/`. Run history is preserved across regenerations.
866
+ Regenerate MUST NOT delete `runs/`. Run history is preserved across regenerations.
189
867
 
190
- ## Guardrails
868
+ ## Anti-patterns
191
869
 
192
- - IMPORTANT: The skill NEVER writes `.test.md` files directly. Only `xqa plan` and `xqa plan edit` produce or modify them. `scenarioId` and meta injection live in the planner — hand-authoring breaks correlation.
193
- - IMPORTANT: The skill NEVER computes finding correlation. Only `xqa plan report` does. Do not write ad-hoc string matchers.
194
- - WARNING: The skill NEVER modifies another branch's slug directory. Auto-prune only removes directories whose local branch is gone AND which are not the current branch.
195
- - WARNING: The skill NEVER edits artifacts under `.xqa/test-plan/<slug>/runs/`. Those are immutable run records.
196
- - IMPORTANT: The skill NEVER claims `not_run` scenarios passed. Report them as unknown, not green.
870
+ - Over-confirming: one "go" gate per phase; never two consecutive gates for the same transition
871
+ - Context-free prompts: every gate states what's about to happen (scenario name, precondition, transition steps)
872
+ - Phantom continuity: re-state current state from scratch at each gate; never assume the user remembers prior context
873
+ - Silent state assumption: ask user to confirm sim state when preconditions differ between groups
874
+ - Approval conflation: plan approval run approval always gate Run-go before first dispatch to confirm app state
197
875
 
198
876
  ## Manual test plan
199
877
 
200
- - [ ] Skill activates on `/xqa-test-plan`
201
- - [ ] Skill activates on implied intent ("what should I QA?")
202
- - [ ] Detect state correctly computes slug for current branch
203
- - [ ] Detect state auto-prunes stale sibling dirs but never current
204
- - [ ] Generate flow handles 0 booted simulators (error), 1 (auto), >1 (prompt)
205
- - [ ] Generate flow passes `--intent` correctly
206
- - [ ] Approval loop calls `xqa plan edit` per file
207
- - [ ] Run flow preflights simulator before dispatching
208
- - [ ] Report flow renders two-state status (has_findings / not_run) correctly
209
- - [ ] Rerun doesn't regenerate specs
210
- - [ ] Regenerate wipes specs before re-invoking xqa plan
211
- - [ ] Extend appends scenario-N+1 without re-writing existing scenarios
878
+ See [SKILL.test.md](./SKILL.test.md) for the full manual verification checklist.