pi-agent-browser-native 0.2.33 → 0.2.35

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. package/CHANGELOG.md +46 -0
  2. package/README.md +47 -17
  3. package/docs/ARCHITECTURE.md +25 -13
  4. package/docs/COMMAND_REFERENCE.md +285 -47
  5. package/docs/ELECTRON.md +3 -3
  6. package/docs/RELEASE.md +22 -14
  7. package/docs/REQUIREMENTS.md +5 -5
  8. package/docs/SUPPORT_MATRIX.md +26 -22
  9. package/docs/TOOL_CONTRACT.md +97 -32
  10. package/extensions/agent-browser/index.ts +519 -2402
  11. package/extensions/agent-browser/lib/argv-descriptor.ts +90 -0
  12. package/extensions/agent-browser/lib/argv-grammar.ts +128 -0
  13. package/extensions/agent-browser/lib/command-policy.ts +71 -0
  14. package/extensions/agent-browser/lib/command-taxonomy.ts +336 -0
  15. package/extensions/agent-browser/lib/electron/cleanup.ts +1 -0
  16. package/extensions/agent-browser/lib/executable-path.ts +19 -0
  17. package/extensions/agent-browser/lib/input-modes/job.ts +62 -0
  18. package/extensions/agent-browser/lib/input-modes/params.ts +8 -8
  19. package/extensions/agent-browser/lib/input-modes.ts +3 -0
  20. package/extensions/agent-browser/lib/orchestration/batch-stdin.ts +65 -0
  21. package/extensions/agent-browser/lib/orchestration/browser-run/browser-action-model.ts +154 -0
  22. package/extensions/agent-browser/lib/orchestration/browser-run/click-dispatch.ts +149 -0
  23. package/extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts +77 -29
  24. package/extensions/agent-browser/lib/orchestration/browser-run/final-result.ts +6 -2
  25. package/extensions/agent-browser/lib/orchestration/browser-run/index.ts +33 -27
  26. package/extensions/agent-browser/lib/orchestration/browser-run/prepare.ts +74 -23
  27. package/extensions/agent-browser/lib/orchestration/browser-run/process-output.ts +67 -17
  28. package/extensions/agent-browser/lib/orchestration/browser-run/prompt-guards.ts +93 -0
  29. package/extensions/agent-browser/lib/orchestration/browser-run/session-state.ts +19 -123
  30. package/extensions/agent-browser/lib/orchestration/browser-run/types.ts +32 -1
  31. package/extensions/agent-browser/lib/orchestration/electron-host/index.ts +860 -0
  32. package/extensions/agent-browser/lib/playbook.ts +24 -23
  33. package/extensions/agent-browser/lib/prompt-policy.ts +122 -0
  34. package/extensions/agent-browser/lib/results/action-recommendations.ts +3 -23
  35. package/extensions/agent-browser/lib/results/categories.ts +1 -1
  36. package/extensions/agent-browser/lib/results/presentation/navigation.ts +2 -34
  37. package/extensions/agent-browser/lib/results/presentation/registry.ts +34 -6
  38. package/extensions/agent-browser/lib/results/presentation/semantic-action.ts +133 -0
  39. package/extensions/agent-browser/lib/results/presentation.ts +11 -6
  40. package/extensions/agent-browser/lib/runtime.ts +93 -227
  41. package/extensions/agent-browser/lib/session-page-state.ts +31 -14
  42. package/extensions/agent-browser/lib/temp.ts +148 -23
  43. package/package.json +4 -4
  44. package/scripts/agent-browser-capability-baseline.mjs +198 -1
package/docs/ELECTRON.md CHANGED
@@ -198,9 +198,9 @@ Closes the tracked managed session, stops only the wrapper-tracked process, veri
198
198
  - arbitrary Electron processes the wrapper did not start
199
199
  - explicit screenshots, downloads, PDFs, traces, HAR files, or recordings saved to caller-chosen paths
200
200
 
201
- For manual launches, `close` only closes the browser/CDP session. Close the app yourself and clean its profile/temp files with normal host tools.
201
+ For manual launches, close commands (`close`, `quit`, or `exit`) only close the browser/CDP session. Close the app yourself and clean its profile/temp files with normal host tools.
202
202
 
203
- On Pi session shutdown, active wrapper-owned Electron launches are best-effort cleaned. Stale restored records (PID gone, port dead) are **reported** instead of guessed at or killed.
203
+ On Pi `quit`, active wrapper-owned Electron launches are best-effort cleaned. On `/reload`, the current branch-visible active Electron launch and its isolated temp `userDataDir` are preserved for continuity while off-branch owned Electron launches are cleaned before process-local ownership is cleared. If cleanup is partial and skips or fails `user-data-dir` removal because the process or debug port is still live, the generic temp sweep preserves that profile path across reload, quit, repeated temp cleanup, process-exit cleanup, and stale temp-root pruning after restart rather than deleting it out from under the remaining host resource. If `electron.cleanup` closes the attached managed session but host process/profile cleanup is partial, later default browser calls still rotate away from that closed wrapper-managed session. Stale restored records (PID gone, port dead) are **reported** instead of guessed at or killed.
204
204
 
205
205
  ### `timeoutMs` by action (quick reference)
206
206
 
@@ -210,7 +210,7 @@ On Pi session shutdown, active wrapper-owned Electron launches are best-effort c
210
210
  | --- | --- | --- |
211
211
  | `launch` | Host-side wait for `DevToolsActivePort` and CDP readiness | **15 s**, hard-capped at **120 s** (`normalizeTimeoutMs` in `extensions/agent-browser/lib/electron/launch.ts`) |
212
212
  | `status` | Optional managed-session `get title` / `get url` reads used for mismatch diagnostics | Normal tool subprocess budget from `runAgentBrowserProcess` / `AGENT_BROWSER_DEFAULT_TIMEOUT`; localhost CDP HTTP probes keep a short fixed budget (`ELECTRON_STATUS_FETCH_TIMEOUT_MS` in `extensions/agent-browser/lib/electron/cleanup.ts`) |
213
- | `cleanup` | One combined budget for managed-session `close`, tracked process exit, debug-port verification, and temp profile removal | `PI_AGENT_BROWSER_IMPLICIT_SESSION_CLOSE_TIMEOUT_MS` when set, else **5000 ms** (`getImplicitSessionCloseTimeoutMs` in `extensions/agent-browser/lib/runtime.ts`, passed through `cleanupTrackedElectronLaunches` in `extensions/agent-browser/index.ts`) |
213
+ | `cleanup` | One combined budget for managed-session `close`, tracked process exit, debug-port verification, and temp profile removal | `PI_AGENT_BROWSER_IMPLICIT_SESSION_CLOSE_TIMEOUT_MS` when set, else **5000 ms** (`getImplicitSessionCloseTimeoutMs` in `extensions/agent-browser/lib/runtime.ts`, passed through `cleanupTrackedElectronHostLaunches` in `extensions/agent-browser/lib/orchestration/electron-host/index.ts`) |
214
214
  | `probe` | **Each** upstream read in the probe chain (`get title`, `get url`, focused `eval --stdin`, `tab list`, `snapshot -i`) | Same default as other tool calls (typically **28 s** per subprocess unless `AGENT_BROWSER_DEFAULT_TIMEOUT` / `PI_AGENT_BROWSER_PROCESS_TIMEOUT_MS` overrides `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts`) |
215
215
 
216
216
  ## `qa.attached` — current-session smoke check
package/docs/RELEASE.md CHANGED
@@ -35,13 +35,21 @@ npm run verify -- release
35
35
 
36
36
  `npm publish` runs npm’s `prepublishOnly` script from `package.json`, which executes the same `npm run verify -- release` gate and then `npm pack --dry-run`. That concatenated gate is everything in the default `npm run verify` step (generated playbook drift, TypeScript, the unit/fake suite, generated command-reference blocks, and live upstream command-reference sampling against the targeted `agent-browser` on `PATH`) plus the packaged Pi smoke in `package-pi`. Using `npm publish --ignore-scripts` skips that contract intentionally.
37
37
 
38
- `prepublishOnly` intentionally does **not** run `npm run verify -- lifecycle`, `npm run verify -- real-upstream`, or `npm run verify -- benchmark`; those are separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). Treat the bullets below as the full pre-publish contract even though only the `release` slice is automated at publish time.
38
+ `prepublishOnly` intentionally does **not** run `npm run verify -- lifecycle`, `npm run verify -- real-upstream`, `npm run verify -- dogfood`, or `npm run verify -- benchmark`; those are separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). Treat the bullets below as the full pre-publish contract even though only the `release` slice is automated at publish time.
39
39
 
40
- Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --no-extensions --no-skills -e .` from the checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Automated localhost and fake-upstream gates do not replace this human-readable live-site transcript evidence. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
40
+ For a deterministic real-browser wrapper smoke without model choice in the loop, run:
41
+
42
+ ```bash
43
+ npm run verify -- dogfood
44
+ ```
45
+
46
+ This mode uses the extension harness and the real `agent-browser` on `PATH` against public `example.com`, then verifies top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close. Use `npm run verify -- dogfood --keep-artifacts` or `--artifact-dir <path>` only while debugging, then delete retained screenshots. This smoke complements, but does not replace, human-readable interactive transcript evidence.
47
+
48
+ Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --no-extensions --no-skills -e .` from the checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Automated localhost, fake-upstream, and deterministic dogfood gates do not replace this human-readable live-site transcript evidence. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
41
49
 
42
50
  When reviewing saved session JSONL after a failed smoke or a `qa` preset that reclassified an upstream-successful batch, expect `agent_browser` tool rows to carry `isError: true` whenever `details.resultCategory` is `failure`. For normal prose output, model-visible text should end with a `Pi tool isError: true` category line; for caller-requested `--json` output, the hook preserves parseable JSON and only patches `isError`. The extension applies that patch on the `tool_result` path so Pi’s transcript matches the wrapper contract ([`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details)). Preserve a normal Pi session directory for those checks; avoiding `--no-session` keeps this evidence intact ([`AGENTS.md`](../AGENTS.md) preferred validation workflow).
43
51
 
44
- The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload` plus restart/`/resume` behavior:
52
+ The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload`, full relaunch with the same exact Pi 0.76 `--session-id`, managed-session continuity, persisted artifacts, and Pi failure-patch behavior. Branch-backed `session_tree` rehydration and cleanup ownership are validated by focused extension harness tests:
45
53
 
46
54
  ```bash
47
55
  npm run verify -- lifecycle
@@ -111,13 +119,13 @@ Please gather enough evidence to support the smoke result:
111
119
  Return a concise PASS/FAIL report with evidence and any tool or workflow issues you noticed. Do not create a dogfood-output report directory.
112
120
  ```
113
121
 
114
- Evaluator expectations after the queued Sauce Demo fixes: the agent should independently choose efficient, safe browser operations; native add-to-cart clicks should mutate cart state without JavaScript fallback; same-snapshot form fills may be batched safely when the agent chooses that route; the selected sort order should be verified; checkout must stop before Finish and must not place the order; screenshot and recording must use the requested paths or be explicitly reported unavailable; `network requests` may show public-demo telemetry 401s; `console` may report offline-cache logs; `errors` should show no page errors; and the browser session plus temp artifacts should be cleaned up after evidence is recorded. A run that clicks Finish despite the stop instruction or silently substitutes artifact paths is a workflow failure even if the store flow itself works.
122
+ Evaluator expectations after the queued Sauce Demo fixes: the agent should independently choose efficient, safe browser operations; native add-to-cart clicks should mutate cart state without the agent authoring `eval`/DOM-click fallbacks (the wrapper may fail with `details.clickDispatch` when upstream reports click success but no trusted DOM event reached the target); same-snapshot form fills may be batched safely when the agent chooses that route; the selected sort order should be verified; checkout must stop before Finish and must not place the order; if the agent attempts Finish or another likely final submit action, the wrapper should block it with `details.promptGuard.reason: "explicit-user-stop-boundary"`; screenshot and recording must use the requested paths or be explicitly reported unavailable, and close should be blocked with `details.promptGuard.reason: "requested-artifacts-missing-before-close"` until required screenshot paths are verified; `network requests` may show public-demo telemetry 401s; `console` may report offline-cache logs; `errors` should show no page errors; and the browser session plus temp artifacts should be cleaned up after evidence is recorded. A run that reaches `checkout-complete.html` or silently substitutes artifact paths is a workflow failure even if other store flow steps work.
115
123
 
116
124
  ## Deterministic agent efficiency benchmark
117
125
 
118
126
  [`scripts/agent-browser-efficiency-benchmark.mjs`](../scripts/agent-browser-efficiency-benchmark.mjs) is an accounting-only benchmark: it does not shell out to `agent-browser`, launch a browser, or read or write Pi sessions. It models representative `agent_browser` call shapes (including optional `stdin` for `batch` and top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` objects that compile to batch) and aggregates success rate, tool-call counts, UTF-8 size of model-visible strings, stale-ref failure and recovery counts, artifact success, distinct failure-category coverage, and summed elapsed-time estimates. When extending scenarios, keep them aligned with the closed `RQ-0068` “no reusable recipe layer” rationale in [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet) (benchmark ids cited there are the canonical inventory for that evidence bar).
119
127
 
120
- - **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes).
128
+ - **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes). Optional `--sample-jsonl path/to/session.jsonl` adds a `jsonlSample` section with real UTF-8 byte totals and per-workflow/overall p95 sizes for model-visible `agent_browser` tool-result text without changing deterministic scenario metrics; comparison ignores `jsonlSample` blocks.
121
129
  - **Default gate:** `npm run verify` checks generated playbook drift, runs `tsc --noEmit`, runs the full unit/fake suite under `test/**/*.test.ts` (including [`test/agent-browser.efficiency-benchmark.test.ts`](../test/agent-browser.efficiency-benchmark.test.ts) for scenario coverage and comparison behavior), verifies generated command-reference baseline blocks, and samples live upstream command-reference tokens. It does not spawn the standalone benchmark script’s JSON/Markdown run; that is what the opt-in slice below adds.
122
130
  - **Opt-in slice:** `npm run verify -- benchmark` runs the benchmark script once with `--json` and then that same test module alone. It is intentionally **not** part of `npm run verify -- release`, so routine publish gates stay decoupled from benchmark churn while still allowing a focused check after editing scenarios or `CURRENT_BENCHMARK_VERSION`.
123
131
 
@@ -172,7 +180,7 @@ Before publishing, validate both local-checkout modes without mixing their assum
172
180
 
173
181
  For expanded-surface validation, the smoke prompt should cover native tool invocation rather than shelling out to `agent-browser`: `--version`, `--help`, `skills list`, `skills get core --full`, `open` with `sessionMode: "fresh"`, `snapshot -i`, `click`, top-level `semanticAction` (locator shorthand compiled to upstream `find` and native dropdown selection compiled to upstream `select`, optionally with `semanticAction.session` when you need the same named upstream session as a prior explicit `--session` call), `eval --stdin`, `batch` via stdin, top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` (compiled batch smoke), `screenshot <path>`, explicit `--session … open` plus `--session … close`, `network requests`, `console` / `errors`, `diff snapshot`, `stream status` plus `stream disable`, `dashboard start` plus `dashboard stop`, and `chat <message>` (credential failure is acceptable evidence of wrapper pass-through when `AI_GATEWAY_API_KEY` is intentionally unset). Clean up any opened browser session with `close`, remove temporary files, and kill the tmux session before ending validation.
174
182
 
175
- This checklist assumes a real `agent-browser` on `PATH`. It complements, but does not overlap, `npm run verify -- lifecycle`: that harness swaps in a fake upstream binary and focuses on `/reload`, full restart, `/resume`, managed-session continuity, and spill-path persistence (`scripts/verify-lifecycle.mjs`), not the full command matrix above.
183
+ This checklist assumes a real `agent-browser` on `PATH`. It complements, but does not overlap, `npm run verify -- lifecycle`: that harness swaps in a fake upstream binary and focuses on `/reload`, exact `--session-id` relaunch, managed-session continuity, spill-path persistence, and Pi `tool_result` failure-patch semantics (`scripts/verify-lifecycle.mjs`), not the full command matrix above.
176
184
 
177
185
  When a smoke or dogfood run fails after `sessionMode: "fresh"` (missing binary, timeout, upstream error, or **`qa`** preset reclassification), read `details.managedSessionOutcome` before assuming which managed session the next default `sessionMode: "auto"` call will follow; the same struct can appear without the extra `Managed session outcome: …` prose line on `"auto"` failures. Field-level semantics and append ordering relative to other diagnostic tails are documented in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) and the session-mode notes in [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md).
178
186
 
@@ -184,7 +192,7 @@ Run the automated harness for deterministic configured-source lifecycle regressi
184
192
  npm run verify -- lifecycle
185
193
  ```
186
194
 
187
- The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs plain `pi` in `tmux` with default model **`zai/glm-5.1`**, puts a deterministic fake `agent-browser` first on `PATH`, and drives `/reload`, full restart, and `/resume`. Per-step tmux waits default to **180000 ms** (three minutes) in [`scripts/verify-lifecycle.mjs`](../scripts/verify-lifecycle.mjs) (`DEFAULT_TIMEOUT_MS`); override with `--timeout-ms <ms>` when slower models or cold starts need more headroom. Override the model when needed:
195
+ The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs `pi` in `tmux` with default model **`zai/glm-5.1`** and a deterministic `--session-id`, puts a deterministic fake `agent-browser` first on `PATH`, drives `/reload`, closes Pi, and relaunches with the same exact session id instead of typing `/resume`. It also asserts the JSONL session header id, same-page managed-session continuity, persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification. Per-step tmux waits default to **180000 ms** (three minutes) in [`scripts/verify-lifecycle.mjs`](../scripts/verify-lifecycle.mjs) (`DEFAULT_TIMEOUT_MS`); override with `--timeout-ms <ms>` when slower models or cold starts need more headroom. Override the model when needed:
188
196
 
189
197
  ```bash
190
198
  npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal
@@ -196,7 +204,7 @@ Combine flags in one invocation when both apply (order after `lifecycle` is flex
196
204
  npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --timeout-ms 600000
197
205
  ```
198
206
 
199
- It asserts same-page managed-session continuity, persisted `details.fullOutputPath` reachability after resume, and updated extension-code pickup through a temporary sentinel command. On failure it retains transcripts/session artifacts; on success it performs best-effort cleanup. It does not replace occasional real-browser manual smoke testing.
207
+ On failure it retains transcripts/session artifacts; on success it performs best-effort cleanup. It does not replace occasional real-browser manual smoke testing.
200
208
 
201
209
  **Lifecycle triage:** a timeout on sentinel `v2` after `/reload` often means Pi rejected reload while the TUI still showed `Working…` (`Wait for the current response to finish before reloading`), even when the session JSONL already has a final assistant message. Re-run with `--keep-artifacts --verbose`, inspect the retained pane capture, and confirm the configured model follows tool prompts reliably. Slower models may need a higher `--timeout-ms` than the **180000 ms** default.
202
210
 
@@ -217,7 +225,7 @@ Manual validation remains useful for release confidence and installed-package ch
217
225
 
218
226
  1. Configure exactly one active source for this extension in Pi settings: this checkout path before publishing, or the installed package after publishing.
219
227
  2. Launch plain `pi` so extension discovery is active.
220
- 3. Validate managed-session continuity with `/reload` and a full restart + `/resume`.
228
+ 3. Validate managed-session continuity with `/reload` and a full restart plus exact `--session-id` relaunch or `/resume`.
221
229
  4. Re-check local extension-side docs (`README.md`, `docs/COMMAND_REFERENCE.md`, `docs/TOOL_CONTRACT.md`, including the [`semanticAction`](TOOL_CONTRACT.md#semanticaction) rules when that shorthand or upstream `find` / `select` behavior changes) and regenerated prompt fragments from `extensions/agent-browser/lib/playbook.ts` via `npm run docs -- playbook check` or `npm run docs`. When the upstream `agent-browser` version or help surface changed, run `npm run verify -- command-reference`.
222
230
 
223
231
  ### Real upstream contract validation
@@ -264,8 +272,8 @@ Recommended configured-source lifecycle follow-up:
264
272
 
265
273
  1. Open a page with the implicit managed session and confirm the title.
266
274
  2. Run `/reload`, then ask for `snapshot -i` and confirm the same page is still active.
267
- 3. Exit `pi`, relaunch it against the same session file or use `/resume`, then ask for `snapshot -i` again and confirm the same page is still active.
268
- 4. Open a large page that compacts its snapshot output and confirm `details.fullOutputPath` still exists after the restart/resume flow.
275
+ 3. Exit `pi`, relaunch it against the same exact session id/path or use `/resume`, then ask for `snapshot -i` again and confirm the same page is still active.
276
+ 4. Open a large page that compacts its snapshot output and confirm `details.fullOutputPath` still exists after the restart/resume/exact-session flow.
269
277
  5. Trigger an oversized non-snapshot output (for example a deliberately large `eval --stdin` result) and confirm the tool prints the actual spill file path directly in content instead of only referencing a details key.
270
278
  6. Validate at least one direct file-download flow with `download <selector> <path>`.
271
279
  7. Validate at least one asynchronous export flow with `click` followed by `wait --download <path>`, confirming the wait result reports `savedFilePath`/`savedFile` and checking `details.artifacts[].exists` before relying on the requested path being present on disk.
@@ -286,7 +294,7 @@ Then run the real-browser smoke prompt:
286
294
  Use the agent_browser tool to open https://react.dev and then take an interactive snapshot.
287
295
  ```
288
296
 
289
- Only use plain `pi` for installed-package validation after temporarily disabling or removing the checkout source or any other active source for this extension from Pi settings. Then confirm `pi` exposes the native `agent_browser` tool, that a basic `open` + `snapshot -i` flow works, and that `/reload` plus restart/`/resume` keep following the same implicit managed browser session.
297
+ Only use plain `pi` for installed-package validation after temporarily disabling or removing the checkout source or any other active source for this extension from Pi settings. Then confirm `pi` exposes the native `agent_browser` tool, that a basic `open` + `snapshot -i` flow works, and that `/reload` plus restart with exact `--session-id` relaunch or `/resume` keep following the same implicit managed browser session.
290
298
 
291
299
  ## Release notes checklist
292
300
 
@@ -302,7 +310,7 @@ Before publishing:
302
310
  - confirm both local-checkout modes still work for pre-release validation: isolated `pi --no-extensions -e .` smoke testing for general checkout loading (add `--no-skills` for extension-focused bounded smokes) and configured-source lifecycle validation
303
311
  - complete interactive `tmux` live-site extension smoke with `pi --no-extensions --no-skills -e .` and the native `agent_browser` tool (at least one simple static site and one real documentation/product site; include `qa` or `job`/`batch` when those surfaces changed; use the [public Grafana stress checklist](#public-grafana-stress-checklist) when dashboard/diagnostic/artifact behavior changed; close sessions and remove screenshots/temp artifacts; record evidence). Run separate skill-enabled dogfood only when validating skill routing/report-generation behavior—see [Pre-release checks](#pre-release-checks); automated gates are not a substitute
304
312
  - rerun `npm run verify -- release`
305
- - run `npm run verify -- lifecycle` for configured-source `/reload` plus restart/`/resume` regression coverage (required before publish; see [Pre-release checks](#pre-release-checks))
313
+ - run `npm run verify -- lifecycle` for configured-source `/reload`, exact `--session-id` relaunch, managed-session continuity, persisted-spill, and Pi failure-patch regression coverage (required before publish; see [Pre-release checks](#pre-release-checks))
306
314
  - confirm [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md) still maps every current baseline inventory section to docs, runtime handling, tests, and validation status
307
- - manually exercise real-browser `/reload` and full restart + `/resume` continuity when release risk warrants browser-level confidence beyond the fake upstream harness
315
+ - manually exercise real-browser `/reload` and full restart plus exact `--session-id` relaunch or `/resume` continuity when release risk warrants browser-level confidence beyond the fake upstream harness
308
316
  - publish only after the tarball contents and isolated packaged-extension smoke check match expectations
@@ -53,7 +53,7 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
53
53
  - Keep the current local-checkout path documented as the practical pre-release and development flow.
54
54
  - Most users will install this extension globally rather than as a project-local extension.
55
55
  - Local checkout smoke testing should use explicit CLI loading such as `pi --no-extensions -e .` or `pi --no-extensions -e /absolute/path/to/pi-agent-browser-native`; Pi settings are bypassed in this mode and code edits require a process restart for validation.
56
- - Local checkout hot-reload and resume validation should use configured-source lifecycle mode: exactly one active checkout/package source in Pi settings, launched with plain `pi`, so `/reload` exercises discovered/configured resources.
56
+ - Local checkout hot-reload and exact-session relaunch validation should use configured-source lifecycle mode: exactly one active checkout/package source in Pi settings, launched with plain `pi` (or the lifecycle harness' exact `--session-id` relaunch path), so `/reload` and relaunch events exercise discovered/configured resources. Focused extension harness tests validate Pi `session_tree` branch rehydration and cleanup ownership.
57
57
  - Do **not** rely on repo-local `.pi/extensions/` auto-discovery for this package, because it conflicts with the global installed-package path.
58
58
 
59
59
  ### Native-tool preference
@@ -85,10 +85,10 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
85
85
  - The primary confidence path is a real `pi` session driven in `tmux`.
86
86
  - For quick local checkout smoke validation, launch `pi --no-extensions -e .` from the repository root so only the checkout copy loads; do not rely on Pi settings or `/reload` semantics in this isolated mode.
87
87
  - For hot-reload validation, configure exactly one active source for this extension in Pi settings and launch plain `pi`; validate `/reload` there because it exercises auto-discovered/configured resources.
88
- - Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart, and `/resume`, and asserts managed-session continuity plus persisted artifact survival. It is its own `npm run verify` mode rather than part of the default `npm run verify` sequence, but operators still run it before every publish. The harness defaults Pi to model `zai/glm-5.1` (`scripts/verify-lifecycle.mjs`); pass `--model <id>` after `lifecycle` when a different model is required. Keep `docs/RELEASE.md` accurate about the harness behavior, cleanup, transcript retention, and limitations.
89
- - Validate a full `pi` restart with `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths.
88
+ - Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart plus exact `--session-id` relaunch, and asserts managed-session continuity, persisted artifact survival, and real Pi `tool_result` failure-patch semantics. It is its own `npm run verify` mode rather than part of the default `npm run verify` sequence, but operators still run it before every publish. The harness defaults Pi to model `zai/glm-5.1` (`scripts/verify-lifecycle.mjs`); pass `--model <id>` after `lifecycle` when a different model is required. Keep `docs/RELEASE.md` accurate about the harness behavior, cleanup, transcript retention, and limitations.
89
+ - Validate a full `pi` restart with exact `--session-id` relaunch or `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths. Validate branch-backed state changes with the focused `session_tree` harness tests.
90
90
  - Prefer full `pi` restart over `/reload` when validating extension changes beyond a quick reload smoke check.
91
- - Use `/resume` when needed after restart.
91
+ - Use `/resume` or an explicit session id/path when needed after restart.
92
92
  - Keep testing broader than a single smoke site like `example.com`.
93
93
  - Bounded release smokes that validate this extension should disable auto-loaded skills with `--no-skills`; run skill-enabled dogfood separately only when validating external skill routing or report-generation behavior.
94
94
  - Maintain a concrete release/package verification workflow in `docs/RELEASE.md` and matching repository scripts.
@@ -111,7 +111,7 @@ The design should comfortably support workflows such as:
111
111
  - Package-manifest behavior matters more than repo-local development wiring.
112
112
  - The extension should use official `pi` hooks and package resources where possible.
113
113
  - The wrapper should stay thin, with upstream `agent-browser` remaining the source of truth for command semantics.
114
- - Successful and failed tool outcomes should surface bounded machine-readable fields on Pi-facing `details` (`resultCategory`, `successCategory`, `failureCategory`, optional structured `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`, optional `artifactVerification` with the same shape on successful `batchSteps[]` rows) so agents can branch without parsing prose; stateful commands (`auth`, `cookies`, `storage`, `dialog`, `frame`, `state`) plus other structured diagnostics (for example `network`, `diff`, `trace`, `stream`, `dashboard`, `chat`) and `batch` should redact secret-bearing payloads in model-facing `details.data`, including the compact per-step `batch` roll-up on the parent result (full per-step payloads live on `batchSteps[]`). The contract lives in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), enums and classifier precedence live in `extensions/agent-browser/lib/results/categories.ts` and `contracts.ts` (also re-exported from `shared.ts`), and presentation-time summaries, redaction, network request follow-ups, and artifact verification rollups are assembled in `extensions/agent-browser/lib/results/presentation.ts` (`buildPageChangeSummary`, `PAGE_CHANGE_SUMMARY_COMMANDS`, `redactPresentationData`, `buildArtifactVerificationSummary`, `buildBatchPresentation`).
114
+ - Successful and failed tool outcomes should surface bounded machine-readable fields on Pi-facing `details` (`resultCategory`, `successCategory`, `failureCategory`, optional structured `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`, optional `artifactVerification` with the same shape on successful `batchSteps[]` rows) so agents can branch without parsing prose; stateful commands (`auth`, `cookies`, `storage`, `dialog`, `frame`, `state`) plus other structured diagnostics (for example `network`, `diff`, `trace`, `stream`, `dashboard`, `chat`) and `batch` should redact secret-bearing payloads in model-facing `details.data`, including the compact per-step `batch` roll-up on the parent result (full per-step payloads live on `batchSteps[]`). The contract lives in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), enums and classifier precedence live in `extensions/agent-browser/lib/results/categories.ts` and `contracts.ts` (also re-exported from `shared.ts`), and presentation-time summaries, redaction, network request follow-ups, and artifact verification rollups are assembled in `extensions/agent-browser/lib/results/presentation.ts` (`buildPageChangeSummary`, command taxonomy predicates from `command-taxonomy.ts`, `redactPresentationData`, `buildArtifactVerificationSummary`, `buildBatchPresentation`).
115
115
  - User-facing docs belong in `README.md` and the canonical published files under `docs/`.
116
116
  - Agent workflow and deeper testing procedures can stay in `AGENTS.md`, but published docs must not depend on that file being present.
117
117
  - When upstream `agent-browser` changes, refresh the local command reference, prompt guidance, and other extension-side docs so agents still have a repo-readable equivalent of the blocked direct-binary help path.
@@ -27,8 +27,8 @@ When upstream ships a new `agent-browser` or the inventory changes:
27
27
 
28
28
  - Target upstream: `agent-browser 0.27.0` (must match `CAPABILITY_BASELINE.targetVersion` in [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs)).
29
29
  - Source of truth: `CAPABILITY_BASELINE.inventorySections` in the same file (stable `id` keys: `skills`, `core-commands`, `state-tabs-frames-dialogs`, `network-storage-artifacts-diagnostics`, `batch-auth-setup-ai`, `options-and-env`).
30
- - Status: supported for the current wrapper contract.
31
- - High-priority support gaps: none identified in the baseline audit.
30
+ - Status: supported for the current wrapper contract after the 2026-05-26 all-command audit.
31
+ - High-priority support gaps: 2026-05-26 audit found sessionless local commands and command-scoped value flags needed sharper wrapper handling; runtime/tests/docs now cover those paths. Remaining upstream-owned caveat: `agent-browser 0.27.0` help mentions `wait <selector> --state hidden`, but source parsing does not implement that distinct wait mode, so wrapper docs steer agents to `wait --fn` predicates.
32
32
  - Post-`v0.2.29` review state: commits `eb55320` through `86abbfb` add browser guidance/smoke coverage plus `RQ-0086` click-probe reduction, `RQ-0087` same-snapshot form fill batching, `RQ-0088` current-ref fallback on locator misses, `RQ-0089` direct-upstream click mutation investigation, and `RQ-0090` stop-boundary/artifact-path guidance. Verification gates below were rerun on 2026-05-18 after those tasks landed. Constrained `job` (`RQ-0064`), the lightweight `qa` preset (`RQ-0065`), the experimental `sourceLookup` helper (`RQ-0066`), and the experimental `networkSourceLookup` helper (`RQ-0067`) are implemented; see [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#job), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sourcelookup), and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#networksourcelookup). Reusable browser recipes (`RQ-0068`) are intentionally not adopted as a runtime surface; see [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet).
33
33
 
34
34
  ## Verification evidence
@@ -37,23 +37,25 @@ Re-run the gates below before each release; this table records what the closure
37
37
 
38
38
  | Gate | Evidence | Status |
39
39
  | --- | --- | --- |
40
- | Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-18 (`npm run verify`, `agent-browser 0.27.0` on `PATH`). |
41
- | Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-18 (`npm run verify -- real-upstream`). |
42
- | Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-18 (`npm run verify -- package-pi`). |
43
- | `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-18 (`npm run verify -- release`). `prepublishOnly` still needs a fresh run during actual publish. |
44
- | Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`, restart, `/resume`, session continuity, slash-command sentinel tokens (`v1` then `v2` after rewriting the packaged extension to simulate pickup), and persisted spill reachability with a fake upstream on `PATH`. Default Pi model is `zai/glm-5.1`; default per-step wait is **180000 ms** (`DEFAULT_TIMEOUT_MS`); override model with `--model <id>` and waits with `--timeout-ms <ms>`. Passthrough flags in [`scripts/project.mjs`](../scripts/project.mjs): `--keep-artifacts`, `--model`, `--verbose`, and `--timeout-ms` plus a value (for example `npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --keep-artifacts --verbose --timeout-ms 600000`). | Pass on 2026-05-18 (`npm run verify -- lifecycle`). Treat any future unexplained red lifecycle gate as a release blocker. |
45
- | Quick isolated Pi smoke | `pi --no-extensions -e .` from repo root; native `agent_browser` only. | Pass on 2026-05-18 for a fresh interactive tmux smoke: the agent opened `https://example.com`, waited for `Example Domain`, saved `/tmp/piab-isolated-smoke.png` with verified `image/png` artifact metadata, closed the browser session, and reported PASS. Broader historical coverage also includes version/help/skills, open/snapshot/click, eval stdin, batch stdin, screenshot, explicit session, `sessionMode: "fresh"`, network requests, console/errors, diff snapshot, stream status/disable, dashboard start/stop, and chat credential-failure pass-through during RQ-0055. |
40
+ | Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-27 (`npm run verify`, `agent-browser 0.27.0` on `PATH`). |
41
+ | Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-27 (`npm run verify -- real-upstream`, `agent-browser 0.27.0` on `PATH`). |
42
+ | Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-27 (`npm run verify -- package-pi`). |
43
+ | Deterministic dogfood smoke | `npm run verify -- dogfood` (`scripts/verify-agent-browser-dogfood.ts`) drives the native wrapper against public `example.com` through top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close with the real `agent-browser` on `PATH`. | Pass on 2026-05-24 (`npm run verify -- dogfood --artifact-dir /tmp/pi-agent-browser-release-dogfood --json`; artifacts removed). |
44
+ | Efficiency benchmark | `npm run verify -- benchmark` runs deterministic browser workflow accounting plus focused benchmark tests, including JSONL sampling fixtures and job/qa/sourceLookup/networkSourceLookup/Electron scenario coverage. | Pass on 2026-05-24 (`npm run verify -- benchmark`). |
45
+ | `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, dogfood, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-24 (`npm run verify -- release`). `prepublishOnly` still needs a fresh run during actual publish. |
46
+ | Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`, closes and relaunches Pi with the same exact `--session-id`, checks the JSONL session header id, session continuity, slash-command sentinel tokens (`v1` then `v2` after rewriting the packaged extension to simulate pickup), persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification with a fake upstream on `PATH`. Default Pi model is `zai/glm-5.1`; default per-step wait is **180000 ms** (`DEFAULT_TIMEOUT_MS`); override model with `--model <id>` and waits with `--timeout-ms <ms>`. Passthrough flags in [`scripts/project.mjs`](../scripts/project.mjs): `--keep-artifacts`, `--model`, `--verbose`, and `--timeout-ms` plus a value (for example `npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --keep-artifacts --verbose --timeout-ms 600000`). | Pass on 2026-05-27 (`npm run verify -- lifecycle --timeout-ms 300000`). Treat any future unexplained red lifecycle gate as a release blocker. |
47
+ | Quick isolated Pi smoke | `pi --no-extensions --no-skills -e . --tools agent_browser` from repo root; native `agent_browser` only. | Pass on 2026-05-27 for an interactive tmux checkout smoke (`pi --no-extensions --no-skills -e . --session-dir <temp> --model zai/glm-5.1`): opened `https://example.com` with `sessionMode: "fresh"`, ran `snapshot -i`, verified `Example Domain`, closed the browser session, reported PASS, and removed the temp session dir/tmux session. Broader historical coverage also includes version/help/skills, open/snapshot/click, eval stdin, batch stdin, screenshot, explicit session, `sessionMode: "fresh"`, network requests, console/errors, diff snapshot, stream status/disable, dashboard start/stop, and chat credential-failure pass-through during RQ-0055. |
46
48
 
47
49
  ## Baseline checklist by inventory section
48
50
 
49
51
  | Baseline section | Baseline items | Documentation | Runtime handling | Test coverage | Validation status |
50
52
  | --- | --- | --- | --- | --- | --- |
51
- | Built-in skills | `skills list`, `skills get core`, `skills get core --full`, `skills get <name>`, `skills get electron`, `skills get slack`, `skills get dogfood`, `skills get vercel-sandbox`, `skills get agentcore`, `skills path [name]` | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills), generated baseline block, README proof section, release docs. | `isStatelessInspectionCommand` keeps read-only `skills list` / `skills get` / `skills path` JSON inspection stateless while preserving thin upstream passthrough. | `test/agent-browser.runtime.test.ts`; `test/agent-browser.extension-validation.test.ts` skills/provider matrix; real-upstream inspection/skills group. | Supported. Real upstream covers `skills list`, `skills get core --full`, `skills path core`; fake matrix covers specialized skills. |
52
- | Core page, element, navigation, and extraction commands | `open <url>`, `click <sel>`, `dblclick <sel>`, `type <sel> <text>`, `fill <sel> <text>`, `press <key>`, `keyboard type <text>`, `keyboard inserttext <text>`, `keydown Shift`, `keyup Shift`, `hover <sel>`, `focus <sel>`, `check <sel>`, `uncheck <sel>`, `select <sel> <val...>`, `drag <src> <dst>`, `upload <sel> <files...>`, `download <sel> <path>`, `scroll <dir> [px]`, `scrollintoview <sel>`, `wait <sel|ms>`, `screenshot [path]`, `screenshot --full`, `screenshot --annotate`, `pdf <path>`, `snapshot`, `eval <js>`, `connect <port|url>`, `close [--all]`, `back`, `forward`, `reload`, `pushstate <url>`, `get <what> [selector]`, `is <what> <selector>`, `find <locator> <value> <action>`, `mouse <action> [args]`, `set <setting> [value]` | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md), README quick start. | Thin upstream passthrough with wrapper-owned `--json`, managed-session planning, stale-ref guidance, artifact verification, page-change summaries, no-op scroll diagnostics, focused-combobox diagnostics, first-class `semanticAction` / `job` compile paths for upstream `select`, and redaction. | Real-upstream core matrix covers representative interactions/navigation/extraction/artifacts plus raw/semantic/job `select`; fake core matrix covers additional passthrough and ordering plus no-op scroll, combobox-focus diagnostics, and select compiler validation. | Supported. Some upstream semantics remain upstream-owned; wrapper contract and artifact metadata are tested. |
53
- | Sessions, state, tabs, frames, dialogs, and windows | `session`, `session list`, `state save <path>`, `state load <path>`, `tab list`, `tab new --label <name> [url]`, `tab <t<N>|label>`, `frame <selector|main>`, `dialog accept [text]`, `dialog dismiss`, `dialog status`, `window new` | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands) (session/state/tabs/frames/dialogs/windows), stateful workflow notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Stateful presentation summaries/redaction; state save artifact handling; explicit/implicit session restore; tab target pinning; frame/dialog/window passthrough. | `test/agent-browser.extension-validation.test.ts` stateful matrix; runtime session/resume tests; presentation stateful redaction tests; lifecycle harness for reload/resume. | Supported. External profile/auth state remains operator-owned and documented. |
54
- | Network, storage, artifacts, diagnostics, and performance | `network <action>`, `network route <url> [--abort|--body <json>] [--resource-type <csv>]`, `network request <requestId>`, `cookies [get|set|clear]`, `cookies set --curl <file>`, `storage <local|session>`, `diff snapshot`, `diff screenshot --baseline`, `diff url <u1> <u2>`, `trace start|stop [path]`, `profiler start|stop [path]`, `record start <path> [url]`, `record restart <path> [url]`, `record stop`, `console [--clear]`, `errors [--clear]`, `highlight <sel>`, `inspect`, `clipboard <op> [text]`, `stream enable [--port <n>]`, `stream disable`, `stream status`, `react tree`, `react inspect <id>`, `react renders start`, `react renders stop [--json]`, `react suspense [--only-dynamic] [--json]`, `vitals [url] [--json]`, `removeinitscript <id>` | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage) and diagnostic sections; [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Thin passthrough plus command-specific compact diagnostic summaries, artifact metadata for HAR/diff/trace/profile/record, early missing-ffmpeg recording warnings, sensitive-data redaction, timeout bounds, and cleanup-pair guidance. | Fake non-core matrix covers network/diff/trace/profiler/record/console/errors/highlight/inspect/clipboard/stream/dashboard/chat JSON shapes and redaction; real-upstream covers safe network requests/HAR, diff, trace/profiler, console/errors/highlight, stream, vitals, and React missing-renderer. | Supported. Browser-opening or environment-sensitive operations (`inspect`, OS clipboard, full React app inspection) are delegated thinly and documented as needing suitable local/browser state. |
55
- | Batch, auth, confirmations, setup, dashboard, and AI commands | `batch [--bail]`, `auth save <name>`, `auth save <name> --password-stdin`, `auth login <name>`, `auth list`, `auth show <name>`, `auth delete <name>`, `confirm <id>`, `deny <id>`, `chat <message>`, `dashboard start --port <n>`, `dashboard stop`, `install`, `install --with-deps`, `upgrade`, `doctor [--fix]`, `doctor --offline --quick`, `doctor --json`, `profiles` | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-and-setup), README security notes, release docs. | Batch stdin is native-tool-only; top-level `job`, `qa`, and experimental `sourceLookup` / `networkSourceLookup` compile to `batch` with generated stdin (caller `stdin` rejected for those modes); job `select` compiles to upstream `select <selector> <value...>`; auth/confirmation details are redacted; dashboard/chat/setup/doctor are passed through thinly with timeout/cleanup guidance; package doctor remains separate and read-only. | Unit/fake tests cover batch, auth password stdin, confirmations, dashboard/chat summaries, and doctor diagnostics; extension-validation covers `job`, `qa`, `sourceLookup`, and `networkSourceLookup` compilation plus `details.sourceLookup` / `details.networkSourceLookup` evidence, including job `select`; [`scripts/agent-browser-efficiency-benchmark.mjs`](../scripts/agent-browser-efficiency-benchmark.mjs) includes `source-lookup-visible-element` and `network-source-lookup-failed-request` scenarios; quick isolated Pi smoke covered dashboard start/stop and chat credential-failure pass-through. | Supported. `install`, `upgrade`, `doctor --fix`, and interactive auth/chat/setup flows are upstream-owned and should be run only when the operator intends those side effects. |
56
- | Global flags, config, providers, policy, and environment | `--profile <name|path>`, `AGENT_BROWSER_PROFILE`, `--session <name>`, `AGENT_BROWSER_SESSION`, `--session-name <name>`, `AGENT_BROWSER_SESSION_NAME`, `--state <path>`, `AGENT_BROWSER_STATE`, `--auto-connect`, `AGENT_BROWSER_AUTO_CONNECT`, `--headers <json>`, `--init-script <path>`, `AGENT_BROWSER_INIT_SCRIPTS`, `--enable <feature>`, `AGENT_BROWSER_ENABLE`, `--executable-path <path>`, `AGENT_BROWSER_EXECUTABLE_PATH`, `--extension <path>`, `AGENT_BROWSER_EXTENSIONS`, `--args <args>`, `AGENT_BROWSER_ARGS`, `--user-agent <ua>`, `AGENT_BROWSER_USER_AGENT`, `--proxy <server>`, `AGENT_BROWSER_PROXY`, `HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, `--proxy-bypass <hosts>`, `AGENT_BROWSER_PROXY_BYPASS`, `NO_PROXY`, `--ignore-https-errors`, `AGENT_BROWSER_IGNORE_HTTPS_ERRORS`, `--allow-file-access`, `AGENT_BROWSER_ALLOW_FILE_ACCESS`, `--headed`, `AGENT_BROWSER_HEADED`, `--cdp <port>`, `--color-scheme <scheme>`, `AGENT_BROWSER_COLOR_SCHEME`, `--download-path <path>`, `AGENT_BROWSER_DOWNLOAD_PATH`, `--engine <name>`, `AGENT_BROWSER_ENGINE`, `--no-auto-dialog`, `AGENT_BROWSER_NO_AUTO_DIALOG`, `--json`, `AGENT_BROWSER_JSON`, `--annotate`, `AGENT_BROWSER_ANNOTATE`, `--screenshot-dir <path>`, `AGENT_BROWSER_SCREENSHOT_DIR`, `--screenshot-quality <n>`, `AGENT_BROWSER_SCREENSHOT_QUALITY`, `--screenshot-format <fmt>`, `AGENT_BROWSER_SCREENSHOT_FORMAT`, `--content-boundaries`, `AGENT_BROWSER_CONTENT_BOUNDARIES`, `--max-output <chars>`, `AGENT_BROWSER_MAX_OUTPUT`, `--allowed-domains <list>`, `AGENT_BROWSER_ALLOWED_DOMAINS`, `--action-policy <path>`, `AGENT_BROWSER_ACTION_POLICY`, `--confirm-actions <list>`, `AGENT_BROWSER_CONFIRM_ACTIONS`, `--confirm-interactive`, `AGENT_BROWSER_CONFIRM_INTERACTIVE`, `-p, --provider <name>`, `AGENT_BROWSER_PROVIDER`, `browserbase`, `kernel`, `browseruse`, `browserless`, `agentcore`, `--device <name>`, `AGENT_BROWSER_IOS_DEVICE`, `agent-browser -p ios device list`, `agent-browser -p ios swipe up`, `agent-browser -p ios tap @e1`, `--model <name>`, `AI_GATEWAY_MODEL`, `-v, --verbose`, `-q, --quiet`, `--debug`, `AGENT_BROWSER_DEBUG`, `AGENT_BROWSER_CONFIG`, `AGENT_BROWSER_DEFAULT_TIMEOUT`, `AGENT_BROWSER_STREAM_PORT`, `AGENT_BROWSER_IDLE_TIMEOUT_MS`, `AGENT_BROWSER_ENCRYPTION_KEY`, `AGENT_BROWSER_STATE_EXPIRE_DAYS`, `AGENT_BROWSER_IOS_UDID`, `AI_GATEWAY_URL`, `AI_GATEWAY_API_KEY` | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment), README provider/setup notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sessionmode), architecture/runtime docs. | Runtime handles value flags, launch-scoped flags, redacted invocation echoes, `sessionMode: "fresh"` recovery hints, explicit sessions, and provider/device launch-scoping. Process env forwards a curated allowlist/prefix set for upstream/provider credentials without cloning the whole parent env. Subprocess completion uses `watchSpawnedChildCompletion` so inherited stdio from detached descendants cannot stall the tool after the direct `agent-browser` child exits (`RQ-0097`). | Runtime tests cover launch-scoped flags, provider/device planning, redaction, stateless inspections, and explicit/fresh sessions. Process tests cover provider env prefixes, exit-vs-close completion when descendants keep stdio open, and timeout (`124`) resolution under the same stdio-inheritance pattern (`RQ-0097`). Fake provider/specialized-skill matrix covers provider argv/env passthrough. Package doctor checks version/source drift. | Supported. Provider clouds, iOS/Appium, Browserbase/Kernel/BrowserUse/Browserless/AgentCore, proxies, profiles, and credentials require external setup; the wrapper documents and forwards them thinly rather than emulating provider behavior. |
53
+ | Built-in skills | 13 canonical tokens from baseline section `skills`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills), generated baseline block, README proof section, release docs. | `needsManagedSession` keeps read-only skills inspection sessionless while preserving thin upstream passthrough. | Runtime and extension-validation skills/provider matrix; real-upstream inspection/skills group. | Supported. |
54
+ | Core page, element, navigation, and extraction commands | 74 canonical tokens from baseline section `core-commands`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md), README quick start. | Thin passthrough with wrapper-owned JSON/session planning, ref guidance, artifact verification, page-change summaries, click-dispatch diagnostics, no-op scroll/focus diagnostics, shorthand compilers, and redaction. | Real-upstream core matrix plus fake core matrix for passthrough, ordering, diagnostics, and compiler validation. | Supported. Upstream semantics remain upstream-owned. |
55
+ | Sessions, state, tabs, frames, dialogs, and windows | 20 canonical tokens from baseline section `state-tabs-frames-dialogs`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands), stateful workflow notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Stateful summaries/redaction, state artifact handling, sessionless local command planning, managed-session restore, tab target pinning, and close alias cleanup. | Extension-validation stateful matrix, runtime session/resume tests, presentation redaction tests, lifecycle harness. | Supported. External profile/auth state remains operator-owned. |
56
+ | Network, storage, artifacts, diagnostics, and performance | 42 canonical tokens from baseline section `network-storage-artifacts-diagnostics`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage), diagnostic sections, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Thin passthrough plus compact diagnostics, artifact metadata, missing-ffmpeg warnings, sensitive-data redaction, timeout bounds, and cleanup-pair guidance. | Fake non-core matrix and safe real-upstream coverage for network/HAR, diff, trace/profiler, console/errors/highlight, stream, vitals, and React missing-renderer. | Supported. Environment-sensitive operations need suitable local/browser state. |
57
+ | Batch, auth, confirmations, setup, dashboard, devices, and AI commands | 24 canonical tokens from baseline section `batch-auth-setup-ai`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-devices-and-setup). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-devices-and-setup), README security notes, release docs. | Native-tool batch stdin, generated `job`/`qa`/lookup batch plans, auth/confirmation redaction, sessionless local auth/setup/dashboard/doctor planning, timeout/cleanup guidance. | Unit/fake batch/auth/confirmation/dashboard/chat/doctor tests; extension-validation for structured input modes; efficiency benchmark scenarios. | Supported. Interactive side-effecting setup/auth/chat remains upstream-owned. |
58
+ | Global flags, config, providers, policy, and environment | 117 canonical tokens from baseline section `options-and-env`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment), README provider/setup notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sessionmode), architecture/runtime docs. | Runtime handles command discovery, value-flag prevalidation, launch-scoped flags, redacted echoes, fresh-session recovery hints, explicit sessions, provider/device launch-scoping, curated env forwarding, and subprocess completion. | Runtime tests for flags/planning/redaction/session behavior; process tests for env and stdio-linger completion; fake provider/specialized-skill matrix; package doctor. | Supported. Provider clouds, iOS/Appium, proxies, profiles, and credentials require external setup. |
57
59
 
58
60
  ## Follow-up decision after closure
59
61
 
@@ -71,15 +73,17 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
71
73
 
72
74
  `RQ-0091` keeps advanced release smoke tests focused on extension behavior instead of external skill routing: the Sauce Demo smoke in [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt) now launches with `--no-skills`, restricts tools to `agent_browser`, and uses bounded release-smoke wording rather than dogfood/exploratory QA language. Runtime guidance remains the concise stop-boundary and exact-artifact-path contract from `extensions/agent-browser/lib/playbook.ts`; no site-specific automation or recipe layer was added. Evidence from the failed high/low local-shop runs showed skill/report drift (`dogfood-output` substitution) and reasoning complexity, not a wrapper command defect, so skill-enabled dogfood remains a separate validation mode. Human workflow: [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt), [`AGENTS.md`](../AGENTS.md#preferred-testing-workflow), and [`REQUIREMENTS.md`](REQUIREMENTS.md#testing-guidance).
73
75
 
74
- `RQ-0097` keeps upstream subprocess completion reliable when detached descendants inherit the child’s stdio handles: `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts` uses `watchSpawnedChildCompletion` to observe both Node `exit` and `close`, leaves piped stdio intact during the short post-`exit` grace (`EXIT_STDIO_GRACE_MS`, currently **100 ms**) so normal `close` can still win, destroys those streams only if the fallback resolves, and resolves with exit-code precedence `close` wrapper timeout (**124**) post-`exit` fallback for the direct child spawn failure (**127**) when `close` is still delayed so the Pi tool cannot hang after `agent-browser` has already exited. Human context: [`ARCHITECTURE.md`](ARCHITECTURE.md#direct-subprocess-execution) (subprocess bullet) and [`AGENTS.md`](../AGENTS.md) (**Runtime planning** → **Upstream subprocess completion**); fake coverage: `runAgentBrowserProcess resolves after exit when descendants keep stdio handles open` and `runAgentBrowserProcess returns timeout exit code when descendants keep stdio handles open` in [`test/agent-browser.process.test.ts`](../test/agent-browser.process.test.ts).
76
+ `RQ-0090` now backs stop-boundary and exact-artifact-path guidance with prompt-derived preflight guards instead of relying only on model instruction-following. `buildPromptPolicy` in `extensions/agent-browser/lib/prompt-policy.ts` extracts explicit stop-before-order/submit wording and exact requested artifact paths from the latest user message. `extensions/agent-browser/lib/orchestration/browser-run/browser-action-model.ts` normalizes click-like actions plus `press`/`key` Enter/Return submits; `prompt-guards.ts` blocks likely final targets on those covered shapes (including `@ref` role/name metadata from `details.refSnapshot`, selectors such as `finish`, and matching batch click/find steps) with `details.promptGuard.reason: "explicit-user-stop-boundary"` and documents excluded flows (`eval`, generic fill/type, `keyboard type`/`keyboard inserttext`, non-Enter keypresses). It also blocks browser `close` / `quit` / `exit` with `reason: "requested-artifacts-missing-before-close"` until required prompt screenshot paths are verified in `details.artifactManifest` (optional recording paths are required only when recording appears available). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`promptGuard`); human workflow: README stop-boundary/artifact notes and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md); fake coverage: `agentBrowserExtension blocks likely final order clicks when the user set a stop boundary`, `agentBrowserExtension blocks close until required prompt screenshot artifacts are saved`, and `buildPromptPolicy detects stop boundaries and requested artifact paths`.
75
77
 
76
- `RQ-0096` ships first-class Electron desktop-app support without adding a generic recipe runtime: top-level `electron` covers wrapper-owned `list`, isolated `launch` with snapshot/tabs/connect handoff, `status`, `cleanup`, and compact current-session or launch-scoped `probe`; `qa.attached` extends the existing QA preset for attached Electron/CDP sessions without introducing `electron.qa`. `launch.handoff` still defaults to `"snapshot"`, while `handoff: "tabs"` is documented as the safer diagnostic starting point when refs/content capture is not needed yet. Host install discovery (`discoverElectronApps`) is macOS/Linux-only today: on Windows `electron.list` reports `platform: "unsupported"` with an empty catalog and name/bundle targets cannot resolve from scans—use `executablePath` (or a host path to the Electron binary) for Windows launch targeting. Discovery adds non-blocking likely-sensitive app annotations plus visible isolated-profile/auth-state warnings; launch output and `details.electron.profileIsolation` state that wrapper launches do not reuse existing signed-in app profiles or attach to already-running authenticated apps, and point agents to the host debug-port launch plus raw `connect` path when signed-in local app state is the goal; launch timeout failures include PID/profile/DevToolsActivePort/timing diagnostics; status/probe add launch/session identifiers, liveness, mismatch/reattach next actions, and dead-launch context for `about:blank`; post-mutation Electron death is upgraded to `tab-drift` with `details.electronPostCommandHealth`; Electron fills can add `details.fillVerification`; Electron `@e…` mutations can add same-URL ref freshness guidance; broad Electron `get text` selectors add scope warnings; cleanup ownership is bounded to wrapper-created launch records and temp profiles; externally launched debug ports stay on the manual `args: ["connect", "<port-or-url>"]` path and remain host-owned. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#electron) plus [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) for `qa.attached`; human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps) and README common calls; implementation: `extensions/agent-browser/index.ts` and `extensions/agent-browser/lib/electron/`; deterministic efficiency evidence: `electron-lifecycle` and `electron-probe` in `scripts/agent-browser-efficiency-benchmark.mjs`; fake coverage includes Electron schema/probe/mismatch/post-command-health/fill-verification/broad-text/discovery-sensitivity and packaged-sourceLookup cases in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts). This plan is the `RQ-0068` revisit evidence for Electron specifically: [`docs/plans/electron-extension-2026-05-20.md`](plans/electron-extension-2026-05-20.md) documents repeated failure-prone discover/launch/attach/cleanup and multi-call state-probe sequences, plus bounded owner/versioning/test/docs artifacts.
78
+ `RQ-0097` keeps upstream subprocess completion reliable when detached descendants inherit the child’s stdio handles: `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts` uses `watchSpawnedChildCompletion` to observe both Node `exit` and `close`, leaves piped stdio intact during the short post-`exit` grace (`EXIT_STDIO_GRACE_MS`, currently **100 ms**) so normal `close` can still win, destroys those streams only if the fallback resolves, and resolves with exit-code precedence `close` wrapper timeout (**124**) post-`exit` fallback for the direct child spawn failure (**127**) when `close` is still delayed so the Pi tool cannot hang after `agent-browser` has already exited. Human context: [`ARCHITECTURE.md`](ARCHITECTURE.md#direct-subprocess-execution) (subprocess bullet) and [`AGENTS.md`](../AGENTS.md) (**Runtime planning** **Upstream subprocess completion**); fake coverage: `runAgentBrowserProcess resolves after exit when descendants keep stdio handles open` asserts the post-exit fallback returns near the 100 ms grace window instead of the process timeout, and `runAgentBrowserProcess returns timeout exit code when descendants keep stdio handles open` in [`test/agent-browser.process.test.ts`](../test/agent-browser.process.test.ts).
79
+
80
+ `RQ-0096` ships first-class Electron desktop-app support without adding a generic recipe runtime: top-level `electron` covers wrapper-owned `list`, isolated `launch` with snapshot/tabs/connect handoff, `status`, `cleanup`, and compact current-session or launch-scoped `probe`; `qa.attached` extends the existing QA preset for attached Electron/CDP sessions without introducing `electron.qa`. `launch.handoff` still defaults to `"snapshot"`, while `handoff: "tabs"` is documented as the safer diagnostic starting point when refs/content capture is not needed yet. Host install discovery (`discoverElectronApps`) is macOS/Linux-only today: on Windows `electron.list` reports `platform: "unsupported"` with an empty catalog and name/bundle targets cannot resolve from scans—use `executablePath` (or a host path to the Electron binary) for Windows launch targeting. Discovery adds non-blocking likely-sensitive app annotations plus visible isolated-profile/auth-state warnings; launch output and `details.electron.profileIsolation` state that wrapper launches do not reuse existing signed-in app profiles or attach to already-running authenticated apps, and point agents to the host debug-port launch plus raw `connect` path when signed-in local app state is the goal; launch timeout failures include PID/profile/DevToolsActivePort/timing diagnostics; status/probe add launch/session identifiers, liveness, mismatch/reattach next actions, and dead-launch context for `about:blank`; post-mutation Electron death is upgraded to `tab-drift` with `details.electronPostCommandHealth`; Electron fills can add `details.fillVerification`; Electron `@e…` mutations can add same-URL ref freshness guidance; broad Electron `get text` selectors add scope warnings; cleanup ownership is bounded to wrapper-created launch records and temp profiles; externally launched debug ports stay on the manual `args: ["connect", "<port-or-url>"]` path and remain host-owned. Runtime-owned off-branch launch records remain visible to `electron.status { launchId }`, `electron.status { all: true }`, `electron.probe { launchId }`, and `electron.cleanup`; default current-session `electron.probe` stays scoped to the active managed session, and no-arg status/cleanup reports ambiguity when multiple active branch/off-branch records are still owned. Explicit cleanup is serialized with managed-session work, records managed-session close success independently from partial process/profile cleanup, clears live/restore managed-session state for the closed wrapper session, updates branch-visible Electron state only with selected cleanup records instead of unrelated off-branch lookup records, and rotates the next default auto browser call away from that closed name. `/reload` preserves the current branch-visible active Electron launch and its isolated temp `userDataDir` for continuity, cleans off-branch owned Electron launches before clearing process-local ownership, and durably protects profile dirs from generic temp cleanup, quit cleanup, process-exit cleanup, and stale temp-root pruning after restart when partial Electron cleanup deliberately skips or fails `user-data-dir` removal. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#electron) plus [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) for `qa.attached`; human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps) and README common calls; implementation: `extensions/agent-browser/lib/input-modes/electron.ts`, `extensions/agent-browser/lib/orchestration/electron-host/`, `extensions/agent-browser/lib/orchestration/browser-run/`, `extensions/agent-browser/lib/electron/`, and dispatch/state wiring in `extensions/agent-browser/index.ts`; deterministic efficiency evidence: `electron-lifecycle` and `electron-probe` in `scripts/agent-browser-efficiency-benchmark.mjs`; fake coverage includes Electron schema/probe/mismatch/post-command-health/fill-verification/broad-text/discovery-sensitivity and packaged-sourceLookup cases in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus off-branch Electron status/probe/cleanup, targeted cleanup without unrelated branch promotion, explicit cleanup serialization, current and restored cleanup managed-session retirement, active-Electron reload preservation, off-branch Electron reload cleanup, durable partial off-branch reload/quit profile preservation, protected temp-root process-exit and stale-prune cleanup, and partial-cleanup managed-session untracking in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts). This plan is the `RQ-0068` revisit evidence for Electron specifically: [`docs/plans/electron-extension-2026-05-20.md`](plans/electron-extension-2026-05-20.md) documents repeated failure-prone discover/launch/attach/cleanup and multi-call state-probe sequences, plus bounded owner/versioning/test/docs artifacts.
77
81
 
78
82
  `RQ-0097` completes manual CDP attach recovery without making manually launched apps wrapper-owned: successful raw `connect` results append the session-scoped safe tab-list action `list-connected-session-tabs`; `snapshot -i` failures whose upstream error says `No active page` append the safe tab-list action `list-tabs-after-no-active-page` when a session is known. Agents then choose a stable `tab t<N>` target and run `snapshot -i` explicitly; the wrapper does not emit raw-connect or no-active-page snapshot retry ids without a wrapper-observed safe tab id. The runtime source of truth for these recovery ids is `AGENT_BROWSER_RECOVERY_NEXT_ACTION_IDS` in `extensions/agent-browser/lib/results/recovery-actions.ts` (re-exported from `shared.ts`). The guidance keeps manual signed-in desktop apps and explicit artifacts host-owned while `close` remains a browser/CDP-session close and `electron.cleanup` remains limited to wrapper-created `electron.launch` records. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`ELECTRON.md`](ELECTRON.md#manual-host-launch-pattern) and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps); fake coverage: raw connect and no-active snapshot assertions in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus central next-action helper coverage in [`test/agent-browser.results.test.ts`](../test/agent-browser.results.test.ts).
79
83
 
80
84
  `RQ-0068` remains closed with a no-adopt decision for a reusable named browser recipe runtime. The Electron evidence above justified a narrow typed shorthand and compact probe, not an open-ended recipe layer; future reusable recipes still require concrete repeated workflow evidence and a defined owner/versioning/test plan.
81
85
 
82
- `RQ-0098` completes the docs/playbook groundwork for desktop readiness and wait orchestration without adding a runtime primitive or reusable recipe layer. The accepted ladder is: prefer condition waits (`wait --text`, `wait --url`, `wait --fn`, `wait --load <state>`, `wait --download`) when a real condition exists; after raw manual CDP `connect`, inspect `tab list`, select a stable `tab t<N>` surface, then run a condition wait or `snapshot -i`; after wrapper-owned `electron.launch`, use `electron.probe` / `electron.status` when launch health or target mismatch matters; use `qa.attached` for current-session text/selector diagnostics; keep fixed waits as a last resort below the wrapper IPC budget; and treat fixed-wait payloads such as `"waited":"timeout"` as elapsed time rather than completion evidence. Manual signed-in attach docs now also restate that `connect` readiness is not immediate readiness, `close` only closes the browser/CDP session, `electron.cleanup` remains wrapper-owned, and manually launched apps plus explicit artifacts stay host-owned. Human workflow: [`ELECTRON.md`](ELECTRON.md#readiness-and-waits), [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#wait-for-page-readiness-or-downloads), README Electron section, and generated playbook text from `extensions/agent-browser/lib/playbook.ts`. Revisit a first-class host-idle primitive only with repeated desktop smoke evidence that condition waits, `qa.attached`, `electron.probe`, snapshots, and screenshots cannot cover the workflow. Verification: `npm run docs` keeps generated playbook fragments aligned; no runtime `details.nextActions` are part of this RQ.
86
+ `RQ-0098` completes the docs/playbook groundwork for desktop readiness and wait orchestration without adding a runtime primitive or reusable recipe layer. The accepted ladder is: prefer condition waits (`wait --text`, `wait --url`, `wait --fn`, `wait --load <state>`, `wait --download`) when a real condition exists; after raw manual CDP `connect`, inspect `tab list`, select a stable `tab t<N>` surface, then run a condition wait or `snapshot -i`; after wrapper-owned `electron.launch`, use `electron.probe` / `electron.status` when launch health or target mismatch matters; use `qa.attached` for current-session text/selector diagnostics; keep fixed waits as a last resort below the wrapper IPC budget; and treat fixed-wait payloads such as `"waited":"timeout"` as elapsed time rather than completion evidence. Manual signed-in attach docs now also restate that `connect` readiness is not immediate readiness, close commands (`close`, `quit`, or `exit`) only close the browser/CDP session, `electron.cleanup` remains wrapper-owned, and manually launched apps plus explicit artifacts stay host-owned. Human workflow: [`ELECTRON.md`](ELECTRON.md#readiness-and-waits), [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#wait-for-page-readiness-or-downloads), README Electron section, and generated playbook text from `extensions/agent-browser/lib/playbook.ts`. Revisit a first-class host-idle primitive only with repeated desktop smoke evidence that condition waits, `qa.attached`, `electron.probe`, snapshots, and screenshots cannot cover the workflow. Verification: `npm run docs` keeps generated playbook fragments aligned; no runtime `details.nextActions` are part of this RQ.
83
87
 
84
88
  `RQ-0100` makes desktop tab/surface drift recovery machine-readable without adding routine tab-list probes for normal clicks. When existing wrapper state already identifies a target tab, about:blank and tab-drift paths append `list-tabs-for-about-blank-recovery` or `list-tabs-for-tab-drift-recovery`, then `select-intended-tab-after-drift` and `snapshot-after-tab-recovery` when the stable `t<N>` id is known. The implementation reuses `priorSessionTabTarget`, `aboutBlankSessionMismatch`, `sessionTabCorrection`, `openResultTabCorrection`, and existing tab-correction outputs; it does not probe tabs for ordinary clicks beyond the RQ-0086-gated drift paths. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#tabs) and [`ELECTRON.md`](ELECTRON.md#troubleshooting); fake coverage: about:blank recovery and explicit-about:blank negatives in [`test/agent-browser.extension-tab-recovery.test.ts`](../test/agent-browser.extension-tab-recovery.test.ts), early tab-drift failure assertions in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), and central next-action helper coverage in [`test/agent-browser.results.test.ts`](../test/agent-browser.results.test.ts).
85
89
 
@@ -93,19 +97,19 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
93
97
 
94
98
  `RQ-0088` adds current-snapshot ref fallback for selector misses: when raw `find` or compiled `semanticAction` fails with `failureCategory: "selector-not-found"`, `extensions/agent-browser/index.ts` may take one fresh session-scoped `snapshot -i`, then `extensions/agent-browser/lib/results/selector-recovery.ts` looks for exact normalized role/name matches for the failed target and emits `details.visibleRefFallback` plus visible `Current snapshot ref fallback`. Non-fill matches append bounded direct-ref next actions (`try-current-visible-ref` / `try-current-visible-ref-N`); fill matches omit direct args/text and feed the RQ-0099 rich-input recovery path when the ref is editable. The matcher is intentionally narrow: role locators require `--name`; text-click maps only to exact-name `button`/`link` refs; label/placeholder fill maps only to exact-name textbox/searchbox-style refs; prefixes/fuzzy matches are ignored, and duplicate exact matches carry ambiguity safety copy. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`visibleRefFallback`, nextActions); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) selector strategy and README pitfalls; fake coverage: `agentBrowserExtension suggests current snapshot refs when raw find role locators miss` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts).
95
99
 
96
- `RQ-0072` guards page-scoped `@e…` refs against silent recycling: successful `snapshot` (or the last `snapshot` step inside a successful `batch`) records `details.refSnapshot` with ref ids and the snapshot page URL; `extensions/agent-browser/lib/session-page-state.ts` replays per-session snapshots and `refSnapshotInvalidation` markers from the transcript on reload/resume, clears them on successful `close`, invalidates prior refs when a session `snapshot` fails with `No active page`, rejects mutation-prone ref argv before spawn when the tab URL diverges, a ref id is missing from the latest snapshot, or the session refs are invalidated, blocks `batch` stdin that uses `@e…` on a guarded command after an earlier step that can navigate or mutate until a `snapshot` step appears later in the same stdin array (pre-spawn latch reset only), and prefixes `refresh-interactive-refs` with `--session` when the call names a session (including upstream-classified `stale-ref` outcomes). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`refSnapshot`, `refSnapshotInvalidation`, `stale-ref`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) snapshot/ref notes and README pitfalls; fake coverage: `agentBrowserExtension recommends tab recovery after No active page snapshot failures` and `agentBrowserExtension invalidates refs after No active page snapshot failures inside batch` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus `agentBrowserExtension blocks page-scoped ref reuse…`, `…blocks stale refs after page-changing steps inside a batch`, `…allows same-snapshot form fills before a batch click`, `…allows batch stdin ref steps after snapshot following an invalidating step`, `…records snapshot refs returned inside a successful batch`, and `…rejects refs absent from the latest same-page snapshot` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts).
100
+ `RQ-0072` guards page-scoped `@e…` refs against silent recycling: successful `snapshot` (or the last `snapshot` step inside a successful `batch`) records `details.refSnapshot` with ref ids and the snapshot page URL; `extensions/agent-browser/lib/session-page-state.ts` replays per-session snapshots and `refSnapshotInvalidation` markers from the active transcript branch on `session_start` and Pi 0.76 `session_tree` branch changes, clears them on successful close commands (`close`, `quit`, or `exit`), invalidates prior refs when a session `snapshot` fails with `No active page`, rejects mutation-prone ref argv before spawn when the tab URL diverges, a ref id is missing from the latest snapshot, or the session refs are invalidated, blocks `batch` stdin that uses `@e…` on a guarded command after an earlier step that can navigate or mutate until a `snapshot` step appears later in the same stdin array (pre-spawn latch reset only), and prefixes `refresh-interactive-refs` with `--session` when the call names a session (including upstream-classified `stale-ref` outcomes). The entrypoint also serializes `session_tree` restore and wrapper-owned browser commands with managed-session work, guards independent caller-owned explicit-session completions with a branch-state generation check, keeps process-owned cleanup registries for managed sessions and wrapper-launched Electron records separate from the branch-visible view, treats explicit wrapper-owned close rows and Electron cleanup managed-session steps as restore-visible close events, closes off-branch owned managed sessions and Electron launches on non-quit reload shutdown, preserves current branch-visible active managed/Electron sessions and active Electron temp profiles for reload continuity, and preserves fresh-session allocation monotonicity across branch restores. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`refSnapshot`, `refSnapshotInvalidation`, `stale-ref`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) snapshot/ref notes and README pitfalls; fake coverage: `agentBrowserExtension recommends tab recovery after No active page snapshot failures` and `agentBrowserExtension invalidates refs after No active page snapshot failures inside batch` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus `agentBrowserExtension blocks page-scoped ref reuse…`, `…rehydrates page-scoped refs from the current tree branch`, `…rehydrates managed browser session state from the current tree branch`, `…rehydrates artifact manifest state from the current tree branch`, `…keeps Electron cleanup ownership after session_tree switches away from the launch branch`, `…blocks stale refs after page-changing steps inside a batch`, `…allows same-snapshot form fills before a batch click`, `…allows batch stdin ref steps after snapshot following an invalidating step`, `…records snapshot refs returned inside a successful batch`, and `…rejects refs absent from the latest same-page snapshot` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts); managed-session reload cleanup, explicit close untracking/state rotation/restore, generated fresh-name reservation after repeated explicit closes, explicit-session command versus `session_tree` generation-guard coverage, explicit close versus in-flight implicit command serialization, and fresh-ordinal coverage lives in [`test/agent-browser.resume-state.test.ts`](../test/agent-browser.resume-state.test.ts).
97
101
 
98
102
  `RQ-0087` keeps the RQ-0072 guard but removes `fill` from the batch invalidation latch: `fill @e…` rows remain guarded against stale/missing refs, yet multiple same-snapshot form fills can run before the first click/submit/navigation step in one upstream `batch`. A later guarded ref after `click`, `open`, `reload`, or other invalidating rows still fails before spawn unless the batch includes a fresh `snapshot` step first. This improves login/checkout efficiency without permitting likely post-navigation ref reuse. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`Batch stdin ordering`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) ref notes; fake coverage: `agentBrowserExtension allows same-snapshot form fills before a batch click` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts).
99
103
 
100
- `RQ-0073` surfaces likely overlay blockers after no-navigation clicks without inventing blind targets: for **top-level** `click` results (unified command `click`, not `batch`-wrapped steps) whose upstream JSON includes `data.clicked`, whose prior pinned tab URL and post-click URL (from `details.navigationSummary`, gathered by one read-only `eval` summary when the click payload omits **both** string `data.url` and `data.title`) stay equal after the same fragment-insensitive normalization used for ref preflight, and where the same unified result did **not** already apply session tab correction or about-blank mismatch recovery, `extensions/agent-browser/index.ts` takes one fresh session-scoped `snapshot -i`, scans `refs` for strong modal context (`dialog` / `alertdialog`) plus up to three close/dismiss-pattern `button`/`link`/`menuitem` controls, and only then emits `details.overlayBlockers` (`candidates`, `summary`, and a `snapshot` map that can advance `refSnapshot`), visible `Possible overlay blockers`, and `inspect-overlay-state` / `try-overlay-blocker-candidate-*` next actions (with `--session` prefix when the session is named) appended after presentation follow-ups such as `inspect-after-mutation`. Page-wide privacy/sign-in/banner text without a dialog role is deliberately ignored to avoid warnings after ordinary same-page clicks. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`overlayBlockers`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) no-navigation click note and README pitfalls; fake coverage: `agentBrowserExtension surfaces likely overlay blockers after a no-op click` and `agentBrowserExtension does not report overlay blockers from unrelated page chrome after a successful same-page click` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
104
+ `RQ-0073` surfaces likely overlay blockers after no-navigation clicks without inventing blind targets: for **top-level** `click` results (unified command `click`, not `batch`-wrapped steps) whose upstream JSON includes `data.clicked`, whose prior pinned tab URL and post-click URL (from `details.navigationSummary`, gathered by one read-only `eval` summary when the click payload omits **both** string `data.url` and `data.title`) stay equal after the same fragment-insensitive normalization used for ref preflight, and where the same unified result did **not** already apply session tab correction, about-blank mismatch recovery, or `details.clickDispatch` fired for the same result, `extensions/agent-browser/index.ts` takes one fresh session-scoped `snapshot -i`, scans `refs` for strong modal context (`dialog` / `alertdialog`) plus up to three close/dismiss-pattern `button`/`link`/`menuitem` controls, and only then emits `details.overlayBlockers` (`candidates`, `summary`, and a `snapshot` map that can advance `refSnapshot`), visible `Possible overlay blockers`, and `inspect-overlay-state` / `try-overlay-blocker-candidate-*` next actions (with `--session` prefix when the session is named) appended after presentation follow-ups such as `inspect-after-mutation`. Page-wide privacy/sign-in/banner text without a dialog role is deliberately ignored to avoid warnings after ordinary same-page clicks. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`overlayBlockers`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) no-navigation click note and README pitfalls; fake coverage: `agentBrowserExtension surfaces likely overlay blockers after a no-op click` and `agentBrowserExtension does not report overlay blockers from unrelated page chrome after a successful same-page click` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
101
105
 
102
106
  `RQ-0086` reduces wrapper-induced click fragility found during Sauce Demo smokes: navigation-summary enrichment for click/back/forward/reload/dblclick now uses one read-only `eval` (`({ title: document.title, url: location.href })`) instead of serial `get title` plus `get url` probes, including tab-pinned batch wrappers. Tab pinning/post-command tab correction now runs only after the wrapper has evidence of tab-drift risk (profile restore correction, overlapping stale opens, or restored session state), so ordinary same-session clicks no longer get repeated `tab list` probes. This keeps `details.navigationSummary`, overlay blocker checks, and drift recovery intact while avoiding the upstream `agent-browser 0.27.0` sequence that could report later clicks as successful without dispatching pointer/click events after repeated getter/tab/snapshot probes. Fake coverage: `agentBrowserExtension enriches click results with a post-navigation title and url summary` in [`test/agent-browser.extension-tabs.test.ts`](../test/agent-browser.extension-tabs.test.ts), plus `agentBrowserExtension pins the intended tab inside a follow-up command when reconnect drift would otherwise steal focus` and about-blank/tab overlap assertions in [`test/agent-browser.extension-tab-recovery.test.ts`](../test/agent-browser.extension-tab-recovery.test.ts); manual validation source: [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt).
103
107
 
104
- `RQ-0089` investigated remaining Sauce Demo no-op clicks after RQ-0086. Minimal direct-upstream probes against `agent-browser 0.27.0` reproduced the residual `Finish` behavior without the wrapper: both CSS `click [data-test="finish"]` and `find role button click --name Finish` returned success, but a page-level click listener recorded no click event and the URL stayed on `checkout-step-two.html` after a 1s wait; a separate cart-link check showed normal trusted click events and navigation when the app handled the event. Conclusion: the residual issue is upstream/site interaction rather than wrapper post-click probes. Runtime behavior stays thin/no site-specific DOM-click fallback; docs now state that click success is attempted-action evidence only and agents must verify important mutations with URL/text/state checks or fresh snapshots before continuing. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`pageChangeSummary`, `nextActions`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) click verification notes; manual evidence: direct-upstream RQ-0089 development probe plus [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt) smoke caveats.
108
+ `RQ-0089` investigated Sauce Demo no-op clicks after RQ-0086, and the 2026-05-26 release smoke reproduced the failure against direct upstream `agent-browser 0.27.0`: CSS `click [data-test=add-to-cart-sauce-labs-backpack]` and current `@ref` clicks returned success, but a page-level listener recorded no trusted pointer/mouse/click events and the cart stayed unchanged; an in-page `element.click()` did mutate the cart. The wrapper now adds a bounded top-level non-Electron `click` dispatch probe before standalone clicks. If upstream reports success but no trusted DOM event reached the target, it fails the tool, records `details.clickDispatch.status: "no-native-event-observed"`, and appends `inspect-click-dispatch-miss` / `retry-click-after-dispatch-miss` next actions; it does **not** replay clicks in-page. This is not site-specific and does not alter `batch`/`job`/`qa` click steps. For `@e…` refs, the probe uses role/name metadata persisted in `details.refSnapshot` from the latest snapshot instead of running a pre-click snapshot that could recycle upstream refs. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`clickDispatch`, `refSnapshot`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) click verification notes; fake coverage: `agentBrowserExtension reports click dispatch diagnostic when upstream reports success without dispatching DOM events` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
105
109
 
106
110
  `RQ-0074` warns when `get text <selector>` may read hidden or tabbed DOM content: for non-ref CSS selectors, `extensions/agent-browser/index.ts` runs a read-only `eval --stdin` visibility probe after successful text reads, emits `details.selectorTextVisibility` plus visible warning text when the first match is hidden while visible matches exist or when multiple matches make the upstream first-match choice ambiguous, preserves multiple batched warnings in `details.selectorTextVisibilityAll`, and appends `inspect-visible-text-candidates` next actions. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`selectorTextVisibility`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) extraction note and README pitfalls; fake coverage: `agentBrowserExtension warns when get text may read hidden selector matches` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
107
111
 
108
- `RQ-0075` classifies QA and diagnostic network failures by likely impact: `summarizeNetworkFailures` / `classifyNetworkRequestFailure` in `extensions/agent-browser/lib/results/network.ts` (re-exported from `shared.ts`) split rows that already count as failed (`isFailedNetworkRequest`) into actionable versus benign low-impact browser icon asset misses (`isBenignAssetFailure`: favicon/apple-touch-icon basename patterns, 404/`failed`/string `error` signals, and image-like `resourceType`/`mimeType` when present). `analyzeQaPresetResults` fails `qa` only for actionable network failures while preserving benign rows in `qaPreset.warnings`, and network request presentation adds a compact actionable/benign summary plus per-row impact tags, ordered with actionable/benign failed rows before successful rows so late failures are visible even in capped previews. Because real Pi ignores returned `isError` fields from custom tool `execute`, `extensions/agent-browser/index.ts` also realigns `details.resultCategory: "failure"` outcomes to Pi-visible tool errors through a `tool_result` handler; it appends the exact failure category plus `Pi tool isError: true` to prose output and preserves caller-requested `--json` output as parseable JSON while patching `isError`. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) QA and network diagnostic notes; fake coverage: `agentBrowserExtension compiles lightweight QA presets and fails diagnostics` in [`test/agent-browser.extension-input-modes.test.ts`](../test/agent-browser.extension-input-modes.test.ts) plus network presentation assertions in [`test/agent-browser.presentation.test.ts`](../test/agent-browser.presentation.test.ts); real-Pi evidence should use saved JSONL sessions per [`AGENTS.md`](../AGENTS.md#preferred-testing-workflow).
112
+ `RQ-0075` classifies QA and diagnostic network failures by likely impact: `summarizeNetworkFailures` / `classifyNetworkRequestFailure` in `extensions/agent-browser/lib/results/network.ts` (re-exported from `shared.ts`) split rows that already count as failed (`isFailedNetworkRequest`) into actionable versus benign low-impact browser icon asset misses (`isBenignAssetFailure`: favicon/apple-touch-icon basename patterns, 404/`failed`/string `error` signals, and image-like `resourceType`/`mimeType` when present). `analyzeQaPresetResults` fails `qa` only for actionable network failures while preserving benign rows in `qaPreset.warnings`, and network request presentation adds a compact actionable/benign summary plus per-row impact tags, ordered with actionable/benign failed rows before successful rows so late failures are visible even in capped previews. Because real Pi ignores returned `isError` fields from custom tool `execute`, `extensions/agent-browser/index.ts` also realigns `details.resultCategory: "failure"` outcomes to Pi-visible tool errors through a `tool_result` handler; it appends the exact failure category plus `Pi tool isError: true` to prose output and preserves caller-requested `--json` output as parseable JSON while patching `isError`. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) QA and network diagnostic notes; fake coverage: `agentBrowserExtension compiles lightweight QA presets and fails diagnostics` in [`test/agent-browser.extension-input-modes.test.ts`](../test/agent-browser.extension-input-modes.test.ts) plus network presentation assertions in [`test/agent-browser.presentation.test.ts`](../test/agent-browser.presentation.test.ts); model-free real-Pi pipeline coverage in [`test/agent-browser.pi-pipeline.test.ts`](../test/agent-browser.pi-pipeline.test.ts) asserts both in-memory and persisted JSONL tool results for QA prose patching, parseable caller-requested `--json` failures, and strict public-schema rejection before upstream spawn; `npm run verify -- lifecycle` asserts the QA failure-patch line in a saved JSONL session.
109
113
 
110
114
  `RQ-0076` adds best-effort timeout recovery when the wrapper watchdog kills a stuck upstream process: `extensions/agent-browser/index.ts` calls `collectTimeoutPartialProgress` / `formatTimeoutPartialProgressText` to build `details.timeoutPartialProgress` from the compiled `job` or `qa` step list or parsed caller `batch` stdin, session-scoped `get url` / `get title` (plus optional planned-URL fallback from `open`/`navigate`/`pushstate` steps), and declared artifact paths (`screenshot`, `pdf`, `download`, `wait --download`) with existence/size checks, then appends a visible `Timeout partial progress` block with redacted URLs/paths. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) wrapper timeout note and README job section; fake coverage: `agentBrowserExtension reports partial progress and artifacts after job timeout` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
111
115
 
@@ -113,7 +117,7 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
113
117
 
114
118
  `RQ-0078` improves getter/eval discoverability: `extensions/agent-browser/lib/results/presentation/errors.ts` matches upstream failure text containing `unknown command`, `unknown subcommand`, or `unrecognized command` (case-insensitive) when the failed command token is one of `attr`, `count`, `html`, `text`, `title`, `url`, or `value`, then adds grouped-`get` prose; only `title` / `url` also emit read-only `nextActions` (`use-get-title` / `use-get-url`, with `--session` when the failed call named a session). The getter block is skipped when selector recovery already injected an `Agent-browser hint:` line into the same error string. `extensions/agent-browser/index.ts` adds `details.evalStdinHint` plus visible `Eval stdin hint` when `looksLikeFunctionEvalStdin` matches trimmed stdin and upstream JSON carries a plain empty-object `data.result`; empty arrays such as `[]` are valid eval results and are not warned. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`nextActions`, `evalStdinHint`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) extraction note and README quick start; fake coverage: `buildToolPresentation suggests grouped getter commands for common unknown getter shortcuts` and `agentBrowserExtension warns when eval stdin returns an empty object from a function-shaped snippet`.
115
119
 
116
- `RQ-0079` clarifies artifact lifecycle and cleanup ownership: `extensions/agent-browser/index.ts` adds `details.artifactCleanup` and visible `Artifact lifecycle` copy on successful `close` when `artifactManifest.entries` is non-empty (`getArtifactCleanupGuidance`), stating that close does not delete explicit artifacts; `explicitArtifactPaths` carries up to ten distinct existing `explicit-path` manifest paths after a filesystem existence check, skipping stale paths already removed by host tools (possibly empty when the recent window has no existing explicit rows). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`artifactCleanup`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) artifact retention section and README artifact notes; fake coverage: `agentBrowserExtension reports artifact lifecycle guidance on close` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
120
+ `RQ-0079` clarifies artifact lifecycle and cleanup ownership: `extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts` builds `details.artifactCleanup`, surfaced by process-output with visible `Artifact lifecycle` copy on successful close commands (`close`, `quit`, or `exit`) when `artifactManifest.entries` is non-empty (`getArtifactCleanupGuidance`), stating that close commands do not delete explicit artifacts; `explicitArtifactPaths` carries up to ten distinct existing `explicit-path` manifest paths after a filesystem existence check, skipping stale paths already removed by host tools (possibly empty when the recent window has no existing explicit rows). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`artifactCleanup`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) artifact retention section and README artifact notes; fake coverage: `agentBrowserExtension reports artifact lifecycle guidance on close` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts), plus close-alias unit coverage in [`test/agent-browser.runtime.test.ts`](../test/agent-browser.runtime.test.ts) and [`test/agent-browser.session-page-state.test.ts`](../test/agent-browser.session-page-state.test.ts).
117
121
 
118
122
  `RQ-0080` adds no-op scroll recovery for dense dashboards and nested panes: for successful top-level `scroll`, `extensions/agent-browser/index.ts` samples viewport and prominent scroll-container positions before and after execution with read-only session-scoped `eval --stdin` probes. If no sampled position changes, it emits `details.scrollNoop`, appends visible `Scroll diagnostic: no observed scroll movement`, appends exact `inspect-after-noop-scroll` / `verify-noop-scroll-visually` next actions, and updates `pageChangeSummary.nextActionIds` so agents can branch without parsing prose. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`scrollNoop`, `nextActions`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) scroll note; fake coverage: `agentBrowserExtension reports no-op scroll diagnostics with recovery next actions`.
119
123