pi-agent-browser-native 0.2.34 → 0.2.36
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +44 -0
- package/README.md +25 -15
- package/docs/ARCHITECTURE.md +19 -13
- package/docs/COMMAND_REFERENCE.md +274 -44
- package/docs/ELECTRON.md +3 -3
- package/docs/RELEASE.md +11 -11
- package/docs/REQUIREMENTS.md +5 -5
- package/docs/SUPPORT_MATRIX.md +43 -24
- package/docs/TOOL_CONTRACT.md +50 -30
- package/extensions/agent-browser/index.ts +518 -2402
- package/extensions/agent-browser/lib/argv-descriptor.ts +90 -0
- package/extensions/agent-browser/lib/argv-grammar.ts +128 -0
- package/extensions/agent-browser/lib/command-policy.ts +71 -0
- package/extensions/agent-browser/lib/command-taxonomy.ts +336 -0
- package/extensions/agent-browser/lib/electron/cleanup.ts +1 -0
- package/extensions/agent-browser/lib/executable-path.ts +19 -0
- package/extensions/agent-browser/lib/input-modes/params.ts +6 -6
- package/extensions/agent-browser/lib/orchestration/batch-stdin.ts +65 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/browser-action-model.ts +154 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/click-dispatch.ts +149 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts +56 -30
- package/extensions/agent-browser/lib/orchestration/browser-run/final-result.ts +13 -3
- package/extensions/agent-browser/lib/orchestration/browser-run/index.ts +33 -27
- package/extensions/agent-browser/lib/orchestration/browser-run/prepare.ts +48 -22
- package/extensions/agent-browser/lib/orchestration/browser-run/process-output.ts +39 -10
- package/extensions/agent-browser/lib/orchestration/browser-run/prompt-guards.ts +93 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/session-state.ts +98 -124
- package/extensions/agent-browser/lib/orchestration/browser-run/types.ts +40 -1
- package/extensions/agent-browser/lib/orchestration/electron-host/index.ts +860 -0
- package/extensions/agent-browser/lib/playbook.ts +10 -10
- package/extensions/agent-browser/lib/prompt-policy.ts +122 -0
- package/extensions/agent-browser/lib/results/action-recommendations.ts +3 -23
- package/extensions/agent-browser/lib/results/presentation/navigation.ts +2 -34
- package/extensions/agent-browser/lib/runtime.ts +93 -227
- package/extensions/agent-browser/lib/session-page-state.ts +31 -14
- package/extensions/agent-browser/lib/temp.ts +148 -23
- package/package.json +4 -4
- package/scripts/agent-browser-capability-baseline.mjs +198 -1
package/docs/ELECTRON.md
CHANGED
|
@@ -198,9 +198,9 @@ Closes the tracked managed session, stops only the wrapper-tracked process, veri
|
|
|
198
198
|
- arbitrary Electron processes the wrapper did not start
|
|
199
199
|
- explicit screenshots, downloads, PDFs, traces, HAR files, or recordings saved to caller-chosen paths
|
|
200
200
|
|
|
201
|
-
For manual launches, `close` only
|
|
201
|
+
For manual launches, close commands (`close`, `quit`, or `exit`) only close the browser/CDP session. Close the app yourself and clean its profile/temp files with normal host tools.
|
|
202
202
|
|
|
203
|
-
On Pi
|
|
203
|
+
On Pi `quit`, active wrapper-owned Electron launches are best-effort cleaned. On `/reload`, the current branch-visible active Electron launch and its isolated temp `userDataDir` are preserved for continuity while off-branch owned Electron launches are cleaned before process-local ownership is cleared. If cleanup is partial and skips or fails `user-data-dir` removal because the process or debug port is still live, the generic temp sweep preserves that profile path across reload, quit, repeated temp cleanup, process-exit cleanup, and stale temp-root pruning after restart rather than deleting it out from under the remaining host resource. If `electron.cleanup` closes the attached managed session but host process/profile cleanup is partial, later default browser calls still rotate away from that closed wrapper-managed session. Stale restored records (PID gone, port dead) are **reported** instead of guessed at or killed.
|
|
204
204
|
|
|
205
205
|
### `timeoutMs` by action (quick reference)
|
|
206
206
|
|
|
@@ -210,7 +210,7 @@ On Pi session shutdown, active wrapper-owned Electron launches are best-effort c
|
|
|
210
210
|
| --- | --- | --- |
|
|
211
211
|
| `launch` | Host-side wait for `DevToolsActivePort` and CDP readiness | **15 s**, hard-capped at **120 s** (`normalizeTimeoutMs` in `extensions/agent-browser/lib/electron/launch.ts`) |
|
|
212
212
|
| `status` | Optional managed-session `get title` / `get url` reads used for mismatch diagnostics | Normal tool subprocess budget from `runAgentBrowserProcess` / `AGENT_BROWSER_DEFAULT_TIMEOUT`; localhost CDP HTTP probes keep a short fixed budget (`ELECTRON_STATUS_FETCH_TIMEOUT_MS` in `extensions/agent-browser/lib/electron/cleanup.ts`) |
|
|
213
|
-
| `cleanup` | One combined budget for managed-session `close`, tracked process exit, debug-port verification, and temp profile removal | `PI_AGENT_BROWSER_IMPLICIT_SESSION_CLOSE_TIMEOUT_MS` when set, else **5000 ms** (`getImplicitSessionCloseTimeoutMs` in `extensions/agent-browser/lib/runtime.ts`, passed through `
|
|
213
|
+
| `cleanup` | One combined budget for managed-session `close`, tracked process exit, debug-port verification, and temp profile removal | `PI_AGENT_BROWSER_IMPLICIT_SESSION_CLOSE_TIMEOUT_MS` when set, else **5000 ms** (`getImplicitSessionCloseTimeoutMs` in `extensions/agent-browser/lib/runtime.ts`, passed through `cleanupTrackedElectronHostLaunches` in `extensions/agent-browser/lib/orchestration/electron-host/index.ts`) |
|
|
214
214
|
| `probe` | **Each** upstream read in the probe chain (`get title`, `get url`, focused `eval --stdin`, `tab list`, `snapshot -i`) | Same default as other tool calls (typically **28 s** per subprocess unless `AGENT_BROWSER_DEFAULT_TIMEOUT` / `PI_AGENT_BROWSER_PROCESS_TIMEOUT_MS` overrides `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts`) |
|
|
215
215
|
|
|
216
216
|
## `qa.attached` — current-session smoke check
|
package/docs/RELEASE.md
CHANGED
|
@@ -49,7 +49,7 @@ Every release also requires interactive `tmux`-driven Pi dogfood with the native
|
|
|
49
49
|
|
|
50
50
|
When reviewing saved session JSONL after a failed smoke or a `qa` preset that reclassified an upstream-successful batch, expect `agent_browser` tool rows to carry `isError: true` whenever `details.resultCategory` is `failure`. For normal prose output, model-visible text should end with a `Pi tool isError: true` category line; for caller-requested `--json` output, the hook preserves parseable JSON and only patches `isError`. The extension applies that patch on the `tool_result` path so Pi’s transcript matches the wrapper contract ([`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details)). Preserve a normal Pi session directory for those checks; avoiding `--no-session` keeps this evidence intact ([`AGENTS.md`](../AGENTS.md) preferred validation workflow).
|
|
51
51
|
|
|
52
|
-
The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload
|
|
52
|
+
The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload`, full relaunch with the same exact Pi 0.76 `--session-id`, managed-session continuity, persisted artifacts, and Pi failure-patch behavior. Branch-backed `session_tree` rehydration and cleanup ownership are validated by focused extension harness tests:
|
|
53
53
|
|
|
54
54
|
```bash
|
|
55
55
|
npm run verify -- lifecycle
|
|
@@ -119,7 +119,7 @@ Please gather enough evidence to support the smoke result:
|
|
|
119
119
|
Return a concise PASS/FAIL report with evidence and any tool or workflow issues you noticed. Do not create a dogfood-output report directory.
|
|
120
120
|
```
|
|
121
121
|
|
|
122
|
-
Evaluator expectations after the queued Sauce Demo fixes: the agent should independently choose efficient, safe browser operations; native add-to-cart clicks should mutate cart state without
|
|
122
|
+
Evaluator expectations after the queued Sauce Demo fixes: the agent should independently choose efficient, safe browser operations; native add-to-cart clicks should mutate cart state without the agent authoring `eval`/DOM-click fallbacks (the wrapper may fail with `details.clickDispatch` when upstream reports click success but no trusted DOM event reached the target); same-snapshot form fills may be batched safely when the agent chooses that route; the selected sort order should be verified; checkout must stop before Finish and must not place the order; if the agent attempts Finish or another likely final submit action, the wrapper should block it with `details.promptGuard.reason: "explicit-user-stop-boundary"`; screenshot and recording must use the requested paths or be explicitly reported unavailable, and close should be blocked with `details.promptGuard.reason: "requested-artifacts-missing-before-close"` until required screenshot paths are verified; `network requests` may show public-demo telemetry 401s; `console` may report offline-cache logs; `errors` should show no page errors; and the browser session plus temp artifacts should be cleaned up after evidence is recorded. A run that reaches `checkout-complete.html` or silently substitutes artifact paths is a workflow failure even if other store flow steps work.
|
|
123
123
|
|
|
124
124
|
## Deterministic agent efficiency benchmark
|
|
125
125
|
|
|
@@ -180,7 +180,7 @@ Before publishing, validate both local-checkout modes without mixing their assum
|
|
|
180
180
|
|
|
181
181
|
For expanded-surface validation, the smoke prompt should cover native tool invocation rather than shelling out to `agent-browser`: `--version`, `--help`, `skills list`, `skills get core --full`, `open` with `sessionMode: "fresh"`, `snapshot -i`, `click`, top-level `semanticAction` (locator shorthand compiled to upstream `find` and native dropdown selection compiled to upstream `select`, optionally with `semanticAction.session` when you need the same named upstream session as a prior explicit `--session` call), `eval --stdin`, `batch` via stdin, top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` (compiled batch smoke), `screenshot <path>`, explicit `--session … open` plus `--session … close`, `network requests`, `console` / `errors`, `diff snapshot`, `stream status` plus `stream disable`, `dashboard start` plus `dashboard stop`, and `chat <message>` (credential failure is acceptable evidence of wrapper pass-through when `AI_GATEWAY_API_KEY` is intentionally unset). Clean up any opened browser session with `close`, remove temporary files, and kill the tmux session before ending validation.
|
|
182
182
|
|
|
183
|
-
This checklist assumes a real `agent-browser` on `PATH`. It complements, but does not overlap, `npm run verify -- lifecycle`: that harness swaps in a fake upstream binary and focuses on `/reload`,
|
|
183
|
+
This checklist assumes a real `agent-browser` on `PATH`. It complements, but does not overlap, `npm run verify -- lifecycle`: that harness swaps in a fake upstream binary and focuses on `/reload`, exact `--session-id` relaunch, managed-session continuity, spill-path persistence, and Pi `tool_result` failure-patch semantics (`scripts/verify-lifecycle.mjs`), not the full command matrix above.
|
|
184
184
|
|
|
185
185
|
When a smoke or dogfood run fails after `sessionMode: "fresh"` (missing binary, timeout, upstream error, or **`qa`** preset reclassification), read `details.managedSessionOutcome` before assuming which managed session the next default `sessionMode: "auto"` call will follow; the same struct can appear without the extra `Managed session outcome: …` prose line on `"auto"` failures. Field-level semantics and append ordering relative to other diagnostic tails are documented in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) and the session-mode notes in [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md).
|
|
186
186
|
|
|
@@ -192,7 +192,7 @@ Run the automated harness for deterministic configured-source lifecycle regressi
|
|
|
192
192
|
npm run verify -- lifecycle
|
|
193
193
|
```
|
|
194
194
|
|
|
195
|
-
The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs
|
|
195
|
+
The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs `pi` in `tmux` with default model **`zai/glm-5.1`** and a deterministic `--session-id`, puts a deterministic fake `agent-browser` first on `PATH`, drives `/reload`, closes Pi, and relaunches with the same exact session id instead of typing `/resume`. It also asserts the JSONL session header id, same-page managed-session continuity, persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification. Per-step tmux waits default to **180000 ms** (three minutes) in [`scripts/verify-lifecycle.mjs`](../scripts/verify-lifecycle.mjs) (`DEFAULT_TIMEOUT_MS`); override with `--timeout-ms <ms>` when slower models or cold starts need more headroom. Override the model when needed:
|
|
196
196
|
|
|
197
197
|
```bash
|
|
198
198
|
npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal
|
|
@@ -204,7 +204,7 @@ Combine flags in one invocation when both apply (order after `lifecycle` is flex
|
|
|
204
204
|
npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --timeout-ms 600000
|
|
205
205
|
```
|
|
206
206
|
|
|
207
|
-
|
|
207
|
+
On failure it retains transcripts/session artifacts; on success it performs best-effort cleanup. It does not replace occasional real-browser manual smoke testing.
|
|
208
208
|
|
|
209
209
|
**Lifecycle triage:** a timeout on sentinel `v2` after `/reload` often means Pi rejected reload while the TUI still showed `Working…` (`Wait for the current response to finish before reloading`), even when the session JSONL already has a final assistant message. Re-run with `--keep-artifacts --verbose`, inspect the retained pane capture, and confirm the configured model follows tool prompts reliably. Slower models may need a higher `--timeout-ms` than the **180000 ms** default.
|
|
210
210
|
|
|
@@ -225,7 +225,7 @@ Manual validation remains useful for release confidence and installed-package ch
|
|
|
225
225
|
|
|
226
226
|
1. Configure exactly one active source for this extension in Pi settings: this checkout path before publishing, or the installed package after publishing.
|
|
227
227
|
2. Launch plain `pi` so extension discovery is active.
|
|
228
|
-
3. Validate managed-session continuity with `/reload` and a full restart
|
|
228
|
+
3. Validate managed-session continuity with `/reload` and a full restart plus exact `--session-id` relaunch or `/resume`.
|
|
229
229
|
4. Re-check local extension-side docs (`README.md`, `docs/COMMAND_REFERENCE.md`, `docs/TOOL_CONTRACT.md`, including the [`semanticAction`](TOOL_CONTRACT.md#semanticaction) rules when that shorthand or upstream `find` / `select` behavior changes) and regenerated prompt fragments from `extensions/agent-browser/lib/playbook.ts` via `npm run docs -- playbook check` or `npm run docs`. When the upstream `agent-browser` version or help surface changed, run `npm run verify -- command-reference`.
|
|
230
230
|
|
|
231
231
|
### Real upstream contract validation
|
|
@@ -272,8 +272,8 @@ Recommended configured-source lifecycle follow-up:
|
|
|
272
272
|
|
|
273
273
|
1. Open a page with the implicit managed session and confirm the title.
|
|
274
274
|
2. Run `/reload`, then ask for `snapshot -i` and confirm the same page is still active.
|
|
275
|
-
3. Exit `pi`, relaunch it against the same session
|
|
276
|
-
4. Open a large page that compacts its snapshot output and confirm `details.fullOutputPath` still exists after the restart/resume flow.
|
|
275
|
+
3. Exit `pi`, relaunch it against the same exact session id/path or use `/resume`, then ask for `snapshot -i` again and confirm the same page is still active.
|
|
276
|
+
4. Open a large page that compacts its snapshot output and confirm `details.fullOutputPath` still exists after the restart/resume/exact-session flow.
|
|
277
277
|
5. Trigger an oversized non-snapshot output (for example a deliberately large `eval --stdin` result) and confirm the tool prints the actual spill file path directly in content instead of only referencing a details key.
|
|
278
278
|
6. Validate at least one direct file-download flow with `download <selector> <path>`.
|
|
279
279
|
7. Validate at least one asynchronous export flow with `click` followed by `wait --download <path>`, confirming the wait result reports `savedFilePath`/`savedFile` and checking `details.artifacts[].exists` before relying on the requested path being present on disk.
|
|
@@ -294,7 +294,7 @@ Then run the real-browser smoke prompt:
|
|
|
294
294
|
Use the agent_browser tool to open https://react.dev and then take an interactive snapshot.
|
|
295
295
|
```
|
|
296
296
|
|
|
297
|
-
Only use plain `pi` for installed-package validation after temporarily disabling or removing the checkout source or any other active source for this extension from Pi settings. Then confirm `pi` exposes the native `agent_browser` tool, that a basic `open` + `snapshot -i` flow works, and that `/reload` plus restart
|
|
297
|
+
Only use plain `pi` for installed-package validation after temporarily disabling or removing the checkout source or any other active source for this extension from Pi settings. Then confirm `pi` exposes the native `agent_browser` tool, that a basic `open` + `snapshot -i` flow works, and that `/reload` plus restart with exact `--session-id` relaunch or `/resume` keep following the same implicit managed browser session.
|
|
298
298
|
|
|
299
299
|
## Release notes checklist
|
|
300
300
|
|
|
@@ -310,7 +310,7 @@ Before publishing:
|
|
|
310
310
|
- confirm both local-checkout modes still work for pre-release validation: isolated `pi --no-extensions -e .` smoke testing for general checkout loading (add `--no-skills` for extension-focused bounded smokes) and configured-source lifecycle validation
|
|
311
311
|
- complete interactive `tmux` live-site extension smoke with `pi --no-extensions --no-skills -e .` and the native `agent_browser` tool (at least one simple static site and one real documentation/product site; include `qa` or `job`/`batch` when those surfaces changed; use the [public Grafana stress checklist](#public-grafana-stress-checklist) when dashboard/diagnostic/artifact behavior changed; close sessions and remove screenshots/temp artifacts; record evidence). Run separate skill-enabled dogfood only when validating skill routing/report-generation behavior—see [Pre-release checks](#pre-release-checks); automated gates are not a substitute
|
|
312
312
|
- rerun `npm run verify -- release`
|
|
313
|
-
- run `npm run verify -- lifecycle` for configured-source `/reload
|
|
313
|
+
- run `npm run verify -- lifecycle` for configured-source `/reload`, exact `--session-id` relaunch, managed-session continuity, persisted-spill, and Pi failure-patch regression coverage (required before publish; see [Pre-release checks](#pre-release-checks))
|
|
314
314
|
- confirm [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md) still maps every current baseline inventory section to docs, runtime handling, tests, and validation status
|
|
315
|
-
- manually exercise real-browser `/reload` and full restart
|
|
315
|
+
- manually exercise real-browser `/reload` and full restart plus exact `--session-id` relaunch or `/resume` continuity when release risk warrants browser-level confidence beyond the fake upstream harness
|
|
316
316
|
- publish only after the tarball contents and isolated packaged-extension smoke check match expectations
|
package/docs/REQUIREMENTS.md
CHANGED
|
@@ -53,7 +53,7 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
|
|
|
53
53
|
- Keep the current local-checkout path documented as the practical pre-release and development flow.
|
|
54
54
|
- Most users will install this extension globally rather than as a project-local extension.
|
|
55
55
|
- Local checkout smoke testing should use explicit CLI loading such as `pi --no-extensions -e .` or `pi --no-extensions -e /absolute/path/to/pi-agent-browser-native`; Pi settings are bypassed in this mode and code edits require a process restart for validation.
|
|
56
|
-
- Local checkout hot-reload and
|
|
56
|
+
- Local checkout hot-reload and exact-session relaunch validation should use configured-source lifecycle mode: exactly one active checkout/package source in Pi settings, launched with plain `pi` (or the lifecycle harness' exact `--session-id` relaunch path), so `/reload` and relaunch events exercise discovered/configured resources. Focused extension harness tests validate Pi `session_tree` branch rehydration and cleanup ownership.
|
|
57
57
|
- Do **not** rely on repo-local `.pi/extensions/` auto-discovery for this package, because it conflicts with the global installed-package path.
|
|
58
58
|
|
|
59
59
|
### Native-tool preference
|
|
@@ -85,10 +85,10 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
|
|
|
85
85
|
- The primary confidence path is a real `pi` session driven in `tmux`.
|
|
86
86
|
- For quick local checkout smoke validation, launch `pi --no-extensions -e .` from the repository root so only the checkout copy loads; do not rely on Pi settings or `/reload` semantics in this isolated mode.
|
|
87
87
|
- For hot-reload validation, configure exactly one active source for this extension in Pi settings and launch plain `pi`; validate `/reload` there because it exercises auto-discovered/configured resources.
|
|
88
|
-
- Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart
|
|
89
|
-
- Validate a full `pi` restart with `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths.
|
|
88
|
+
- Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart plus exact `--session-id` relaunch, and asserts managed-session continuity, persisted artifact survival, and real Pi `tool_result` failure-patch semantics. It is its own `npm run verify` mode rather than part of the default `npm run verify` sequence, but operators still run it before every publish. The harness defaults Pi to model `zai/glm-5.1` (`scripts/verify-lifecycle.mjs`); pass `--model <id>` after `lifecycle` when a different model is required. Keep `docs/RELEASE.md` accurate about the harness behavior, cleanup, transcript retention, and limitations.
|
|
89
|
+
- Validate a full `pi` restart with exact `--session-id` relaunch or `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths. Validate branch-backed state changes with the focused `session_tree` harness tests.
|
|
90
90
|
- Prefer full `pi` restart over `/reload` when validating extension changes beyond a quick reload smoke check.
|
|
91
|
-
- Use `/resume` when needed after restart.
|
|
91
|
+
- Use `/resume` or an explicit session id/path when needed after restart.
|
|
92
92
|
- Keep testing broader than a single smoke site like `example.com`.
|
|
93
93
|
- Bounded release smokes that validate this extension should disable auto-loaded skills with `--no-skills`; run skill-enabled dogfood separately only when validating external skill routing or report-generation behavior.
|
|
94
94
|
- Maintain a concrete release/package verification workflow in `docs/RELEASE.md` and matching repository scripts.
|
|
@@ -111,7 +111,7 @@ The design should comfortably support workflows such as:
|
|
|
111
111
|
- Package-manifest behavior matters more than repo-local development wiring.
|
|
112
112
|
- The extension should use official `pi` hooks and package resources where possible.
|
|
113
113
|
- The wrapper should stay thin, with upstream `agent-browser` remaining the source of truth for command semantics.
|
|
114
|
-
- Successful and failed tool outcomes should surface bounded machine-readable fields on Pi-facing `details` (`resultCategory`, `successCategory`, `failureCategory`, optional structured `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`, optional `artifactVerification` with the same shape on successful `batchSteps[]` rows) so agents can branch without parsing prose; stateful commands (`auth`, `cookies`, `storage`, `dialog`, `frame`, `state`) plus other structured diagnostics (for example `network`, `diff`, `trace`, `stream`, `dashboard`, `chat`) and `batch` should redact secret-bearing payloads in model-facing `details.data`, including the compact per-step `batch` roll-up on the parent result (full per-step payloads live on `batchSteps[]`). The contract lives in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), enums and classifier precedence live in `extensions/agent-browser/lib/results/categories.ts` and `contracts.ts` (also re-exported from `shared.ts`), and presentation-time summaries, redaction, network request follow-ups, and artifact verification rollups are assembled in `extensions/agent-browser/lib/results/presentation.ts` (`buildPageChangeSummary`, `
|
|
114
|
+
- Successful and failed tool outcomes should surface bounded machine-readable fields on Pi-facing `details` (`resultCategory`, `successCategory`, `failureCategory`, optional structured `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`, optional `artifactVerification` with the same shape on successful `batchSteps[]` rows) so agents can branch without parsing prose; stateful commands (`auth`, `cookies`, `storage`, `dialog`, `frame`, `state`) plus other structured diagnostics (for example `network`, `diff`, `trace`, `stream`, `dashboard`, `chat`) and `batch` should redact secret-bearing payloads in model-facing `details.data`, including the compact per-step `batch` roll-up on the parent result (full per-step payloads live on `batchSteps[]`). The contract lives in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), enums and classifier precedence live in `extensions/agent-browser/lib/results/categories.ts` and `contracts.ts` (also re-exported from `shared.ts`), and presentation-time summaries, redaction, network request follow-ups, and artifact verification rollups are assembled in `extensions/agent-browser/lib/results/presentation.ts` (`buildPageChangeSummary`, command taxonomy predicates from `command-taxonomy.ts`, `redactPresentationData`, `buildArtifactVerificationSummary`, `buildBatchPresentation`).
|
|
115
115
|
- User-facing docs belong in `README.md` and the canonical published files under `docs/`.
|
|
116
116
|
- Agent workflow and deeper testing procedures can stay in `AGENTS.md`, but published docs must not depend on that file being present.
|
|
117
117
|
- When upstream `agent-browser` changes, refresh the local command reference, prompt guidance, and other extension-side docs so agents still have a repo-readable equivalent of the blocked direct-binary help path.
|
package/docs/SUPPORT_MATRIX.md
CHANGED
|
@@ -27,35 +27,52 @@ When upstream ships a new `agent-browser` or the inventory changes:
|
|
|
27
27
|
|
|
28
28
|
- Target upstream: `agent-browser 0.27.0` (must match `CAPABILITY_BASELINE.targetVersion` in [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs)).
|
|
29
29
|
- Source of truth: `CAPABILITY_BASELINE.inventorySections` in the same file (stable `id` keys: `skills`, `core-commands`, `state-tabs-frames-dialogs`, `network-storage-artifacts-diagnostics`, `batch-auth-setup-ai`, `options-and-env`).
|
|
30
|
-
- Status: supported for the current wrapper contract.
|
|
31
|
-
- High-priority support gaps:
|
|
30
|
+
- Status: supported for the current wrapper contract after the 2026-05-26 all-command audit.
|
|
31
|
+
- High-priority support gaps: 2026-05-26 audit found sessionless local commands and command-scoped value flags needed sharper wrapper handling; runtime/tests/docs now cover those paths. Remaining upstream-owned caveat: `agent-browser 0.27.0` help mentions `wait <selector> --state hidden`, but source parsing does not implement that distinct wait mode, so wrapper docs steer agents to `wait --fn` predicates.
|
|
32
32
|
- Post-`v0.2.29` review state: commits `eb55320` through `86abbfb` add browser guidance/smoke coverage plus `RQ-0086` click-probe reduction, `RQ-0087` same-snapshot form fill batching, `RQ-0088` current-ref fallback on locator misses, `RQ-0089` direct-upstream click mutation investigation, and `RQ-0090` stop-boundary/artifact-path guidance. Verification gates below were rerun on 2026-05-18 after those tasks landed. Constrained `job` (`RQ-0064`), the lightweight `qa` preset (`RQ-0065`), the experimental `sourceLookup` helper (`RQ-0066`), and the experimental `networkSourceLookup` helper (`RQ-0067`) are implemented; see [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#job), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sourcelookup), and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#networksourcelookup). Reusable browser recipes (`RQ-0068`) are intentionally not adopted as a runtime surface; see [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet).
|
|
33
33
|
|
|
34
|
+
## Open UX/reliability follow-ups from 2026-05-29 agent feedback
|
|
35
|
+
|
|
36
|
+
Phase 1 triage (2026-05-29): IDs **RQ-0110–RQ-0117** track this feedback batch. **Do not reuse RQ-0101** here—that id is already shipped for compact-snapshot high-value controls (see closure section below).
|
|
37
|
+
|
|
38
|
+
These rows track this feedback batch. Some rows are docs-only or environment-owned; rows marked shipped have code/tests in this change but still need release-gate evidence before being treated as release closure.
|
|
39
|
+
|
|
40
|
+
| ID | Feedback | Owner | Phase 1 classification | Evidence (2026-05-29, `agent-browser 0.27.0` on maintainer macOS unless noted) | Next implementation action | Likely files / tests |
|
|
41
|
+
| --- | --- | --- | --- | --- | --- | --- |
|
|
42
|
+
| RQ-0110 | Headed demos are hard to discover and hard to verify. | Wrapper + upstream (`--headed`) | **docs/playbook-mitigated** (`README`, `TOOL_CONTRACT`, `COMMAND_REFERENCE`, generated playbook guidance); visibility proof **out-of-scope/host-owned** until upstream exposes a portable signal. | `--headed open https://example.com` succeeds with JSON success; no upstream field proves an OS window is visible. Docs/playbook now document `sessionMode: "fresh"` and screenshot/tab/get-url evidence. | No further wrapper action planned for this batch without an upstream/OS portable visibility signal. | `extensions/agent-browser/lib/playbook.ts` (`npm run docs -- playbook write`), README, `docs/COMMAND_REFERENCE.md`, `docs/TOOL_CONTRACT.md`. |
|
|
43
|
+
| RQ-0111 | Local `localhost` / `127.0.0.1` fixture servers can fail with `ERR_EMPTY_RESPONSE` from the browser host. | Environment + upstream (navigation) | **docs-mitigated** (loopback host mismatch + `ERR_EMPTY_RESPONSE` meaning); browser-host reachability remains **environment-owned**. | Reproduced loopback navigation failures on 2026-05-29 maintainer macOS: accept-then-close without HTTP can surface as `net::ERR_EMPTY_RESPONSE` or `net::ERR_SOCKET_NOT_CONNECTED`; nothing listening yields `net::ERR_CONNECTION_REFUSED`. Same-machine `python3 -m http.server` (or harness `SimpleHTTPRequestHandler`) + `open http://127.0.0.1:<port>/fixture.html` succeeds. `npm run verify -- real-upstream` already uses localhost fixtures successfully on this host. | No wrapper server manager or classifier in this batch: failures are not specific enough to prove browser-host loopback mismatch. Keep guidance on host-reachable addresses, `file://` static fallback, and harness-owned servers. | README, `docs/COMMAND_REFERENCE.md`, `docs/TOOL_CONTRACT.md`, `test/helpers/agent-browser-harness.ts`. |
|
|
44
|
+
| RQ-0112 | `eval --stdin` can silently return `null` on `file://` pages, blocking DOM verification. | Upstream (eval channel) + wrapper (warning UX) | **docs-mitigated** (treat `file://` null as inconclusive); **wrapper-owned shipped** (`details.evalResultWarning` + visible `Eval result warning` on `file:` + `result === null`; upstream null channel remains environment/upstream-owned). | Reproduced on `file://` fixture: expressions `null`, `undefined`, `(() => null)()`, and missing-element queries return `"success":true` with `"result":null`; `JSON.stringify(null)` returns the string `"null"`. Simple DOM reads (`document.getElementById(...).textContent`) return real values on the same page. Focused fake coverage asserts the warning without failing the tool. | No further wrapper action planned unless real upstream exposes a richer error. Keep release validation focused on the non-failing warning and redaction-safe visible copy. | `extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts`, `process-output.ts`, `final-result.ts`, `types.ts`, `docs/TOOL_CONTRACT.md`, `test/agent-browser.extension-errors-artifacts.test.ts`. |
|
|
45
|
+
| RQ-0113 | A successful click may not lead to the expected DOM mutation. | Wrapper + upstream (click semantics) | **docs-mitigated / existing-runtime-mitigated** by `RQ-0089` click-dispatch, `RQ-0073` overlay diagnostics, and stronger verification guidance; arbitrary app no-op handlers remain **app/upstream semantics**, not proof of wrapper failure. | Direct upstream can correctly report `click #noop` success while app state stays unchanged; `click #mutate` updates DOM. This shows click success is target activation evidence, not expected-state proof. Wrapper already probes missing trusted DOM events and overlay blockers, but cannot infer arbitrary expected mutations without task-specific assertions. | No additional generic post-click probe in this batch to avoid false positives. Use task-specific verification (`snapshot`, `wait --text`, `assertText`, screenshot, `pageChangeSummary`) after state-changing clicks. | Existing `clickDispatch`/overlay tests plus README / `docs/COMMAND_REFERENCE.md` verification guidance. |
|
|
46
|
+
| RQ-0114 | `get text` selector ambiguity remains hard to resolve when several matches are visible. | Wrapper + upstream (first-match `get text`) | **wrapper-owned shipped** (`visibleCandidates` on selector probe + visible previews); first-match behavior remains upstream semantics. `RQ-0074` warning path already shipped. | Upstream CLI: `get text ".item"` with two visible matches returns only `Alpha`. Wrapper `RQ-0074` already warns when `matchCount > 1` (including all-visible cases) and now exposes bounded visible candidate previews/indexes for safer narrowing. | No further wrapper action planned for this batch. Future improvement: derive safe selector suggestions only if redaction rules can keep them non-sensitive. | `extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts`, `test/agent-browser.extension-errors-artifacts.test.ts`, `docs/TOOL_CONTRACT.md`. |
|
|
47
|
+
| RQ-0115 | Temporary local HTTP server port management is manual and leaked processes block later runs. | Environment (host/process lifecycle) | **out-of-scope/host-owned** (no fixture-server runtime per architecture); **docs-mitigated** (harness pointer in `COMMAND_REFERENCE`). | By design outside `agent_browser` per architecture no-recipe policy. Repo test harness already exposes `startAgentBrowserContractFixtureServer()` for deterministic localhost pages; leaked `python3 -m http.server` / Node listeners are operator or CI cleanup. Phase 1 added a maintainer pointer from [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#headed-demo-and-local-page-checks) to the harness. | Phase 2: **no wrapper server manager** without new design evidence. Parent decision if a separate npm script (`verify` helper) is wanted—out of scope for thin integration. | `test/helpers/agent-browser-harness.ts`, `docs/COMMAND_REFERENCE.md`, `docs/ARCHITECTURE.md` (if explicit anti-scope note is needed). |
|
|
48
|
+
| RQ-0116 | Fresh-session failure prose is opaque and exposes internal generated session ids without clear recovery. | Wrapper | **wrapper-owned shipped** (action-oriented visible recovery + `nextActions`; `attemptedSessionName` remains in `details`). Struct + visible line already exist (`RQ-0077`). | `buildManagedSessionOutcome` still keeps full generated-session transition details in `details.managedSessionOutcome`, while visible failure prose now summarizes preserved/abandoned/replaced outcomes without repeating generated ids. Focused fake coverage covers preserved, missing-binary, abandoned, and QA-reclassification paths. | No further wrapper action planned for this batch unless reviewer finds recovery actions unsafe or insufficient. | `extensions/agent-browser/lib/orchestration/browser-run/session-state.ts`, `final-result.ts`, `docs/TOOL_CONTRACT.md`, `test/agent-browser.extension-errors-artifacts.test.ts`, `test/agent-browser.extension-input-modes.test.ts`. |
|
|
49
|
+
| RQ-0117 | There is no machine-readable confirmation that headed mode is visible to the user. | Wrapper gap + environment (display) | **documented unsupported** for this batch; true OS visibility is **out-of-scope/host-owned** until upstream exposes a portable signal. Pairs with RQ-0110. | Same root cause as RQ-0110: no portable upstream/wrapper field observed. Headed launch success is not visibility proof, and adding a constant `details.headedVisibility: "unsupported"` would add noise without a decision signal. | No runtime field in this batch. Keep the explicit contract limitation and independent screenshot/tab/get-url evidence guidance. | README, `docs/TOOL_CONTRACT.md`, `docs/COMMAND_REFERENCE.md`, generated playbook guidance. |
|
|
50
|
+
|
|
34
51
|
## Verification evidence
|
|
35
52
|
|
|
36
53
|
Re-run the gates below before each release; this table records what the closure audit exercised.
|
|
37
54
|
|
|
38
55
|
| Gate | Evidence | Status |
|
|
39
56
|
| --- | --- | --- |
|
|
40
|
-
| Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-
|
|
41
|
-
| Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-
|
|
42
|
-
| Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-
|
|
43
|
-
| Deterministic dogfood smoke | `npm run verify -- dogfood` (`scripts/verify-agent-browser-dogfood.ts`) drives the native wrapper against public `example.com` through top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close with the real `agent-browser` on `PATH`. | Pass on 2026-05-
|
|
44
|
-
| Efficiency benchmark | `npm run verify -- benchmark` runs deterministic browser workflow accounting plus focused benchmark tests, including JSONL sampling fixtures and job/qa/sourceLookup/networkSourceLookup/Electron scenario coverage. | Pass on 2026-05-
|
|
45
|
-
| `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, dogfood, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-
|
|
46
|
-
| Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`,
|
|
47
|
-
| Quick isolated Pi smoke | `pi --no-extensions --no-skills -e . --tools agent_browser` from repo root; native `agent_browser` only. | Pass on 2026-05-
|
|
57
|
+
| Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-29 (`npm run verify`, `agent-browser 0.27.0` on `PATH`). |
|
|
58
|
+
| Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-29 (`npm run verify -- real-upstream`, `agent-browser 0.27.0` on `PATH`). |
|
|
59
|
+
| Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-29 (`npm run verify -- package-pi`). |
|
|
60
|
+
| Deterministic dogfood smoke | `npm run verify -- dogfood` (`scripts/verify-agent-browser-dogfood.ts`) drives the native wrapper against public `example.com` through top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close with the real `agent-browser` on `PATH`. | Pass on 2026-05-29 (`npm run verify -- dogfood`; artifacts cleaned by the harness). |
|
|
61
|
+
| Efficiency benchmark | `npm run verify -- benchmark` runs deterministic browser workflow accounting plus focused benchmark tests, including JSONL sampling fixtures and job/qa/sourceLookup/networkSourceLookup/Electron scenario coverage. | Pass on 2026-05-29 (`npm run verify -- benchmark`). |
|
|
62
|
+
| `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, dogfood, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-29 (`npm run verify -- release`). `prepublishOnly` will rerun this during `npm publish`. |
|
|
63
|
+
| Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`, closes and relaunches Pi with the same exact `--session-id`, checks the JSONL session header id, session continuity, slash-command sentinel tokens (`v1` then `v2` after rewriting the packaged extension to simulate pickup), persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification with a fake upstream on `PATH`. Default Pi model is `zai/glm-5.1`; default per-step wait is **180000 ms** (`DEFAULT_TIMEOUT_MS`); override model with `--model <id>` and waits with `--timeout-ms <ms>`. Passthrough flags in [`scripts/project.mjs`](../scripts/project.mjs): `--keep-artifacts`, `--model`, `--verbose`, and `--timeout-ms` plus a value (for example `npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --keep-artifacts --verbose --timeout-ms 600000`). | Pass on 2026-05-29 (`npm run verify -- lifecycle`). Treat any future unexplained red lifecycle gate as a release blocker. |
|
|
64
|
+
| Quick isolated Pi smoke | `pi --no-extensions --no-skills -e . --tools agent_browser` from repo root; native `agent_browser` only. | Pass on 2026-05-29 for an interactive tmux checkout smoke (`pi --no-extensions --no-skills -e . --session-dir <temp> --model zai/glm-5.1`): prompted native `agent_browser --version`, verified `agent-browser 0.27.0`, reported PASS, and removed the temp session dir/tmux session. Broader historical coverage also includes version/help/skills, open/snapshot/click, eval stdin, batch stdin, screenshot, explicit session, `sessionMode: "fresh"`, network requests, console/errors, diff snapshot, stream status/disable, dashboard start/stop, and chat credential-failure pass-through during RQ-0055. |
|
|
48
65
|
|
|
49
66
|
## Baseline checklist by inventory section
|
|
50
67
|
|
|
51
68
|
| Baseline section | Baseline items | Documentation | Runtime handling | Test coverage | Validation status |
|
|
52
69
|
| --- | --- | --- | --- | --- | --- |
|
|
53
|
-
| Built-in skills |
|
|
54
|
-
| Core page, element, navigation, and extraction commands |
|
|
55
|
-
| Sessions, state, tabs, frames, dialogs, and windows |
|
|
56
|
-
| Network, storage, artifacts, diagnostics, and performance |
|
|
57
|
-
| Batch, auth, confirmations, setup, dashboard, and AI commands |
|
|
58
|
-
| Global flags, config, providers, policy, and environment |
|
|
70
|
+
| Built-in skills | 13 canonical tokens from baseline section `skills`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills), generated baseline block, README proof section, release docs. | `needsManagedSession` keeps read-only skills inspection sessionless while preserving thin upstream passthrough. | Runtime and extension-validation skills/provider matrix; real-upstream inspection/skills group. | Supported. |
|
|
71
|
+
| Core page, element, navigation, and extraction commands | 74 canonical tokens from baseline section `core-commands`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md), README quick start. | Thin passthrough with wrapper-owned JSON/session planning, ref guidance, artifact verification, page-change summaries, click-dispatch diagnostics, no-op scroll/focus diagnostics, shorthand compilers, and redaction. | Real-upstream core matrix plus fake core matrix for passthrough, ordering, diagnostics, and compiler validation. | Supported. Upstream semantics remain upstream-owned. |
|
|
72
|
+
| Sessions, state, tabs, frames, dialogs, and windows | 20 canonical tokens from baseline section `state-tabs-frames-dialogs`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands), stateful workflow notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Stateful summaries/redaction, state artifact handling, sessionless local command planning, managed-session restore, tab target pinning, and close alias cleanup. | Extension-validation stateful matrix, runtime session/resume tests, presentation redaction tests, lifecycle harness. | Supported. External profile/auth state remains operator-owned. |
|
|
73
|
+
| Network, storage, artifacts, diagnostics, and performance | 42 canonical tokens from baseline section `network-storage-artifacts-diagnostics`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage), diagnostic sections, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Thin passthrough plus compact diagnostics, artifact metadata, missing-ffmpeg warnings, sensitive-data redaction, timeout bounds, and cleanup-pair guidance. | Fake non-core matrix and safe real-upstream coverage for network/HAR, diff, trace/profiler, console/errors/highlight, stream, vitals, and React missing-renderer. | Supported. Environment-sensitive operations need suitable local/browser state. |
|
|
74
|
+
| Batch, auth, confirmations, setup, dashboard, devices, and AI commands | 24 canonical tokens from baseline section `batch-auth-setup-ai`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-devices-and-setup). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-devices-and-setup), README security notes, release docs. | Native-tool batch stdin, generated `job`/`qa`/lookup batch plans, auth/confirmation redaction, sessionless local auth/setup/dashboard/doctor planning, timeout/cleanup guidance. | Unit/fake batch/auth/confirmation/dashboard/chat/doctor tests; extension-validation for structured input modes; efficiency benchmark scenarios. | Supported. Interactive side-effecting setup/auth/chat remains upstream-owned. |
|
|
75
|
+
| Global flags, config, providers, policy, and environment | 117 canonical tokens from baseline section `options-and-env`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment), README provider/setup notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sessionmode), architecture/runtime docs. | Runtime handles command discovery, value-flag prevalidation, launch-scoped flags, redacted echoes, fresh-session recovery hints, explicit sessions, provider/device launch-scoping, curated env forwarding, and subprocess completion. | Runtime tests for flags/planning/redaction/session behavior; process tests for env and stdio-linger completion; fake provider/specialized-skill matrix; package doctor. | Supported. Provider clouds, iOS/Appium, proxies, profiles, and credentials require external setup. |
|
|
59
76
|
|
|
60
77
|
## Follow-up decision after closure
|
|
61
78
|
|
|
@@ -73,15 +90,17 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
|
|
|
73
90
|
|
|
74
91
|
`RQ-0091` keeps advanced release smoke tests focused on extension behavior instead of external skill routing: the Sauce Demo smoke in [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt) now launches with `--no-skills`, restricts tools to `agent_browser`, and uses bounded release-smoke wording rather than dogfood/exploratory QA language. Runtime guidance remains the concise stop-boundary and exact-artifact-path contract from `extensions/agent-browser/lib/playbook.ts`; no site-specific automation or recipe layer was added. Evidence from the failed high/low local-shop runs showed skill/report drift (`dogfood-output` substitution) and reasoning complexity, not a wrapper command defect, so skill-enabled dogfood remains a separate validation mode. Human workflow: [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt), [`AGENTS.md`](../AGENTS.md#preferred-testing-workflow), and [`REQUIREMENTS.md`](REQUIREMENTS.md#testing-guidance).
|
|
75
92
|
|
|
76
|
-
`RQ-
|
|
93
|
+
`RQ-0090` now backs stop-boundary and exact-artifact-path guidance with prompt-derived preflight guards instead of relying only on model instruction-following. `buildPromptPolicy` in `extensions/agent-browser/lib/prompt-policy.ts` extracts explicit stop-before-order/submit wording and exact requested artifact paths from the latest user message. `extensions/agent-browser/lib/orchestration/browser-run/browser-action-model.ts` normalizes click-like actions plus `press`/`key` Enter/Return submits; `prompt-guards.ts` blocks likely final targets on those covered shapes (including `@ref` role/name metadata from `details.refSnapshot`, selectors such as `finish`, and matching batch click/find steps) with `details.promptGuard.reason: "explicit-user-stop-boundary"` and documents excluded flows (`eval`, generic fill/type, `keyboard type`/`keyboard inserttext`, non-Enter keypresses). It also blocks browser `close` / `quit` / `exit` with `reason: "requested-artifacts-missing-before-close"` until required prompt screenshot paths are verified in `details.artifactManifest` (optional recording paths are required only when recording appears available). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`promptGuard`); human workflow: README stop-boundary/artifact notes and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md); fake coverage: `agentBrowserExtension blocks likely final order clicks when the user set a stop boundary`, `agentBrowserExtension blocks close until required prompt screenshot artifacts are saved`, and `buildPromptPolicy detects stop boundaries and requested artifact paths`.
|
|
94
|
+
|
|
95
|
+
`RQ-0097` keeps upstream subprocess completion reliable when detached descendants inherit the child’s stdio handles: `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts` uses `watchSpawnedChildCompletion` to observe both Node `exit` and `close`, leaves piped stdio intact during the short post-`exit` grace (`EXIT_STDIO_GRACE_MS`, currently **100 ms**) so normal `close` can still win, destroys those streams only if the fallback resolves, and resolves with exit-code precedence `close` → wrapper timeout (**124**) → post-`exit` fallback for the direct child → spawn failure (**127**) when `close` is still delayed so the Pi tool cannot hang after `agent-browser` has already exited. Human context: [`ARCHITECTURE.md`](ARCHITECTURE.md#direct-subprocess-execution) (subprocess bullet) and [`AGENTS.md`](../AGENTS.md) (**Runtime planning** → **Upstream subprocess completion**); fake coverage: `runAgentBrowserProcess resolves after exit when descendants keep stdio handles open` asserts the post-exit fallback returns near the 100 ms grace window instead of the process timeout, and `runAgentBrowserProcess returns timeout exit code when descendants keep stdio handles open` in [`test/agent-browser.process.test.ts`](../test/agent-browser.process.test.ts).
|
|
77
96
|
|
|
78
|
-
`RQ-0096` ships first-class Electron desktop-app support without adding a generic recipe runtime: top-level `electron` covers wrapper-owned `list`, isolated `launch` with snapshot/tabs/connect handoff, `status`, `cleanup`, and compact current-session or launch-scoped `probe`; `qa.attached` extends the existing QA preset for attached Electron/CDP sessions without introducing `electron.qa`. `launch.handoff` still defaults to `"snapshot"`, while `handoff: "tabs"` is documented as the safer diagnostic starting point when refs/content capture is not needed yet. Host install discovery (`discoverElectronApps`) is macOS/Linux-only today: on Windows `electron.list` reports `platform: "unsupported"` with an empty catalog and name/bundle targets cannot resolve from scans—use `executablePath` (or a host path to the Electron binary) for Windows launch targeting. Discovery adds non-blocking likely-sensitive app annotations plus visible isolated-profile/auth-state warnings; launch output and `details.electron.profileIsolation` state that wrapper launches do not reuse existing signed-in app profiles or attach to already-running authenticated apps, and point agents to the host debug-port launch plus raw `connect` path when signed-in local app state is the goal; launch timeout failures include PID/profile/DevToolsActivePort/timing diagnostics; status/probe add launch/session identifiers, liveness, mismatch/reattach next actions, and dead-launch context for `about:blank`; post-mutation Electron death is upgraded to `tab-drift` with `details.electronPostCommandHealth`; Electron fills can add `details.fillVerification`; Electron `@e…` mutations can add same-URL ref freshness guidance; broad Electron `get text` selectors add scope warnings; cleanup ownership is bounded to wrapper-created launch records and temp profiles; externally launched debug ports stay on the manual `args: ["connect", "<port-or-url>"]` path and remain host-owned. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#electron) plus [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) for `qa.attached`; human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps) and README common calls; implementation: `extensions/agent-browser/
|
|
97
|
+
`RQ-0096` ships first-class Electron desktop-app support without adding a generic recipe runtime: top-level `electron` covers wrapper-owned `list`, isolated `launch` with snapshot/tabs/connect handoff, `status`, `cleanup`, and compact current-session or launch-scoped `probe`; `qa.attached` extends the existing QA preset for attached Electron/CDP sessions without introducing `electron.qa`. `launch.handoff` still defaults to `"snapshot"`, while `handoff: "tabs"` is documented as the safer diagnostic starting point when refs/content capture is not needed yet. Host install discovery (`discoverElectronApps`) is macOS/Linux-only today: on Windows `electron.list` reports `platform: "unsupported"` with an empty catalog and name/bundle targets cannot resolve from scans—use `executablePath` (or a host path to the Electron binary) for Windows launch targeting. Discovery adds non-blocking likely-sensitive app annotations plus visible isolated-profile/auth-state warnings; launch output and `details.electron.profileIsolation` state that wrapper launches do not reuse existing signed-in app profiles or attach to already-running authenticated apps, and point agents to the host debug-port launch plus raw `connect` path when signed-in local app state is the goal; launch timeout failures include PID/profile/DevToolsActivePort/timing diagnostics; status/probe add launch/session identifiers, liveness, mismatch/reattach next actions, and dead-launch context for `about:blank`; post-mutation Electron death is upgraded to `tab-drift` with `details.electronPostCommandHealth`; Electron fills can add `details.fillVerification`; Electron `@e…` mutations can add same-URL ref freshness guidance; broad Electron `get text` selectors add scope warnings; cleanup ownership is bounded to wrapper-created launch records and temp profiles; externally launched debug ports stay on the manual `args: ["connect", "<port-or-url>"]` path and remain host-owned. Runtime-owned off-branch launch records remain visible to `electron.status { launchId }`, `electron.status { all: true }`, `electron.probe { launchId }`, and `electron.cleanup`; default current-session `electron.probe` stays scoped to the active managed session, and no-arg status/cleanup reports ambiguity when multiple active branch/off-branch records are still owned. Explicit cleanup is serialized with managed-session work, records managed-session close success independently from partial process/profile cleanup, clears live/restore managed-session state for the closed wrapper session, updates branch-visible Electron state only with selected cleanup records instead of unrelated off-branch lookup records, and rotates the next default auto browser call away from that closed name. `/reload` preserves the current branch-visible active Electron launch and its isolated temp `userDataDir` for continuity, cleans off-branch owned Electron launches before clearing process-local ownership, and durably protects profile dirs from generic temp cleanup, quit cleanup, process-exit cleanup, and stale temp-root pruning after restart when partial Electron cleanup deliberately skips or fails `user-data-dir` removal. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#electron) plus [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) for `qa.attached`; human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps) and README common calls; implementation: `extensions/agent-browser/lib/input-modes/electron.ts`, `extensions/agent-browser/lib/orchestration/electron-host/`, `extensions/agent-browser/lib/orchestration/browser-run/`, `extensions/agent-browser/lib/electron/`, and dispatch/state wiring in `extensions/agent-browser/index.ts`; deterministic efficiency evidence: `electron-lifecycle` and `electron-probe` in `scripts/agent-browser-efficiency-benchmark.mjs`; fake coverage includes Electron schema/probe/mismatch/post-command-health/fill-verification/broad-text/discovery-sensitivity and packaged-sourceLookup cases in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus off-branch Electron status/probe/cleanup, targeted cleanup without unrelated branch promotion, explicit cleanup serialization, current and restored cleanup managed-session retirement, active-Electron reload preservation, off-branch Electron reload cleanup, durable partial off-branch reload/quit profile preservation, protected temp-root process-exit and stale-prune cleanup, and partial-cleanup managed-session untracking in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts). This plan is the `RQ-0068` revisit evidence for Electron specifically: [`docs/plans/electron-extension-2026-05-20.md`](plans/electron-extension-2026-05-20.md) documents repeated failure-prone discover/launch/attach/cleanup and multi-call state-probe sequences, plus bounded owner/versioning/test/docs artifacts.
|
|
79
98
|
|
|
80
99
|
`RQ-0097` completes manual CDP attach recovery without making manually launched apps wrapper-owned: successful raw `connect` results append the session-scoped safe tab-list action `list-connected-session-tabs`; `snapshot -i` failures whose upstream error says `No active page` append the safe tab-list action `list-tabs-after-no-active-page` when a session is known. Agents then choose a stable `tab t<N>` target and run `snapshot -i` explicitly; the wrapper does not emit raw-connect or no-active-page snapshot retry ids without a wrapper-observed safe tab id. The runtime source of truth for these recovery ids is `AGENT_BROWSER_RECOVERY_NEXT_ACTION_IDS` in `extensions/agent-browser/lib/results/recovery-actions.ts` (re-exported from `shared.ts`). The guidance keeps manual signed-in desktop apps and explicit artifacts host-owned while `close` remains a browser/CDP-session close and `electron.cleanup` remains limited to wrapper-created `electron.launch` records. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`ELECTRON.md`](ELECTRON.md#manual-host-launch-pattern) and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps); fake coverage: raw connect and no-active snapshot assertions in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus central next-action helper coverage in [`test/agent-browser.results.test.ts`](../test/agent-browser.results.test.ts).
|
|
81
100
|
|
|
82
101
|
`RQ-0068` remains closed with a no-adopt decision for a reusable named browser recipe runtime. The Electron evidence above justified a narrow typed shorthand and compact probe, not an open-ended recipe layer; future reusable recipes still require concrete repeated workflow evidence and a defined owner/versioning/test plan.
|
|
83
102
|
|
|
84
|
-
`RQ-0098` completes the docs/playbook groundwork for desktop readiness and wait orchestration without adding a runtime primitive or reusable recipe layer. The accepted ladder is: prefer condition waits (`wait --text`, `wait --url`, `wait --fn`, `wait --load <state>`, `wait --download`) when a real condition exists; after raw manual CDP `connect`, inspect `tab list`, select a stable `tab t<N>` surface, then run a condition wait or `snapshot -i`; after wrapper-owned `electron.launch`, use `electron.probe` / `electron.status` when launch health or target mismatch matters; use `qa.attached` for current-session text/selector diagnostics; keep fixed waits as a last resort below the wrapper IPC budget; and treat fixed-wait payloads such as `"waited":"timeout"` as elapsed time rather than completion evidence. Manual signed-in attach docs now also restate that `connect` readiness is not immediate readiness, `close` only
|
|
103
|
+
`RQ-0098` completes the docs/playbook groundwork for desktop readiness and wait orchestration without adding a runtime primitive or reusable recipe layer. The accepted ladder is: prefer condition waits (`wait --text`, `wait --url`, `wait --fn`, `wait --load <state>`, `wait --download`) when a real condition exists; after raw manual CDP `connect`, inspect `tab list`, select a stable `tab t<N>` surface, then run a condition wait or `snapshot -i`; after wrapper-owned `electron.launch`, use `electron.probe` / `electron.status` when launch health or target mismatch matters; use `qa.attached` for current-session text/selector diagnostics; keep fixed waits as a last resort below the wrapper IPC budget; and treat fixed-wait payloads such as `"waited":"timeout"` as elapsed time rather than completion evidence. Manual signed-in attach docs now also restate that `connect` readiness is not immediate readiness, close commands (`close`, `quit`, or `exit`) only close the browser/CDP session, `electron.cleanup` remains wrapper-owned, and manually launched apps plus explicit artifacts stay host-owned. Human workflow: [`ELECTRON.md`](ELECTRON.md#readiness-and-waits), [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#wait-for-page-readiness-or-downloads), README Electron section, and generated playbook text from `extensions/agent-browser/lib/playbook.ts`. Revisit a first-class host-idle primitive only with repeated desktop smoke evidence that condition waits, `qa.attached`, `electron.probe`, snapshots, and screenshots cannot cover the workflow. Verification: `npm run docs` keeps generated playbook fragments aligned; no runtime `details.nextActions` are part of this RQ.
|
|
85
104
|
|
|
86
105
|
`RQ-0100` makes desktop tab/surface drift recovery machine-readable without adding routine tab-list probes for normal clicks. When existing wrapper state already identifies a target tab, about:blank and tab-drift paths append `list-tabs-for-about-blank-recovery` or `list-tabs-for-tab-drift-recovery`, then `select-intended-tab-after-drift` and `snapshot-after-tab-recovery` when the stable `t<N>` id is known. The implementation reuses `priorSessionTabTarget`, `aboutBlankSessionMismatch`, `sessionTabCorrection`, `openResultTabCorrection`, and existing tab-correction outputs; it does not probe tabs for ordinary clicks beyond the RQ-0086-gated drift paths. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#tabs) and [`ELECTRON.md`](ELECTRON.md#troubleshooting); fake coverage: about:blank recovery and explicit-about:blank negatives in [`test/agent-browser.extension-tab-recovery.test.ts`](../test/agent-browser.extension-tab-recovery.test.ts), early tab-drift failure assertions in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), and central next-action helper coverage in [`test/agent-browser.results.test.ts`](../test/agent-browser.results.test.ts).
|
|
87
106
|
|
|
@@ -95,19 +114,19 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
|
|
|
95
114
|
|
|
96
115
|
`RQ-0088` adds current-snapshot ref fallback for selector misses: when raw `find` or compiled `semanticAction` fails with `failureCategory: "selector-not-found"`, `extensions/agent-browser/index.ts` may take one fresh session-scoped `snapshot -i`, then `extensions/agent-browser/lib/results/selector-recovery.ts` looks for exact normalized role/name matches for the failed target and emits `details.visibleRefFallback` plus visible `Current snapshot ref fallback`. Non-fill matches append bounded direct-ref next actions (`try-current-visible-ref` / `try-current-visible-ref-N`); fill matches omit direct args/text and feed the RQ-0099 rich-input recovery path when the ref is editable. The matcher is intentionally narrow: role locators require `--name`; text-click maps only to exact-name `button`/`link` refs; label/placeholder fill maps only to exact-name textbox/searchbox-style refs; prefixes/fuzzy matches are ignored, and duplicate exact matches carry ambiguity safety copy. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`visibleRefFallback`, nextActions); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) selector strategy and README pitfalls; fake coverage: `agentBrowserExtension suggests current snapshot refs when raw find role locators miss` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts).
|
|
97
116
|
|
|
98
|
-
`RQ-0072` guards page-scoped `@e…` refs against silent recycling: successful `snapshot` (or the last `snapshot` step inside a successful `batch`) records `details.refSnapshot` with ref ids and the snapshot page URL; `extensions/agent-browser/lib/session-page-state.ts` replays per-session snapshots and `refSnapshotInvalidation` markers from the transcript on
|
|
117
|
+
`RQ-0072` guards page-scoped `@e…` refs against silent recycling: successful `snapshot` (or the last `snapshot` step inside a successful `batch`) records `details.refSnapshot` with ref ids and the snapshot page URL; `extensions/agent-browser/lib/session-page-state.ts` replays per-session snapshots and `refSnapshotInvalidation` markers from the active transcript branch on `session_start` and Pi 0.76 `session_tree` branch changes, clears them on successful close commands (`close`, `quit`, or `exit`), invalidates prior refs when a session `snapshot` fails with `No active page`, rejects mutation-prone ref argv before spawn when the tab URL diverges, a ref id is missing from the latest snapshot, or the session refs are invalidated, blocks `batch` stdin that uses `@e…` on a guarded command after an earlier step that can navigate or mutate until a `snapshot` step appears later in the same stdin array (pre-spawn latch reset only), and prefixes `refresh-interactive-refs` with `--session` when the call names a session (including upstream-classified `stale-ref` outcomes). The entrypoint also serializes `session_tree` restore and wrapper-owned browser commands with managed-session work, guards independent caller-owned explicit-session completions with a branch-state generation check, keeps process-owned cleanup registries for managed sessions and wrapper-launched Electron records separate from the branch-visible view, treats explicit wrapper-owned close rows and Electron cleanup managed-session steps as restore-visible close events, closes off-branch owned managed sessions and Electron launches on non-quit reload shutdown, preserves current branch-visible active managed/Electron sessions and active Electron temp profiles for reload continuity, and preserves fresh-session allocation monotonicity across branch restores. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`refSnapshot`, `refSnapshotInvalidation`, `stale-ref`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) snapshot/ref notes and README pitfalls; fake coverage: `agentBrowserExtension recommends tab recovery after No active page snapshot failures` and `agentBrowserExtension invalidates refs after No active page snapshot failures inside batch` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus `agentBrowserExtension blocks page-scoped ref reuse…`, `…rehydrates page-scoped refs from the current tree branch`, `…rehydrates managed browser session state from the current tree branch`, `…rehydrates artifact manifest state from the current tree branch`, `…keeps Electron cleanup ownership after session_tree switches away from the launch branch`, `…blocks stale refs after page-changing steps inside a batch`, `…allows same-snapshot form fills before a batch click`, `…allows batch stdin ref steps after snapshot following an invalidating step`, `…records snapshot refs returned inside a successful batch`, and `…rejects refs absent from the latest same-page snapshot` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts); managed-session reload cleanup, explicit close untracking/state rotation/restore, generated fresh-name reservation after repeated explicit closes, explicit-session command versus `session_tree` generation-guard coverage, explicit close versus in-flight implicit command serialization, and fresh-ordinal coverage lives in [`test/agent-browser.resume-state.test.ts`](../test/agent-browser.resume-state.test.ts).
|
|
99
118
|
|
|
100
119
|
`RQ-0087` keeps the RQ-0072 guard but removes `fill` from the batch invalidation latch: `fill @e…` rows remain guarded against stale/missing refs, yet multiple same-snapshot form fills can run before the first click/submit/navigation step in one upstream `batch`. A later guarded ref after `click`, `open`, `reload`, or other invalidating rows still fails before spawn unless the batch includes a fresh `snapshot` step first. This improves login/checkout efficiency without permitting likely post-navigation ref reuse. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`Batch stdin ordering`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) ref notes; fake coverage: `agentBrowserExtension allows same-snapshot form fills before a batch click` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts).
|
|
101
120
|
|
|
102
|
-
`RQ-0073` surfaces likely overlay blockers after no-navigation clicks without inventing blind targets: for **top-level** `click` results (unified command `click`, not `batch`-wrapped steps) whose upstream JSON includes `data.clicked`, whose prior pinned tab URL and post-click URL (from `details.navigationSummary`, gathered by one read-only `eval` summary when the click payload omits **both** string `data.url` and `data.title`) stay equal after the same fragment-insensitive normalization used for ref preflight, and where the same unified result did **not** already apply session tab correction
|
|
121
|
+
`RQ-0073` surfaces likely overlay blockers after no-navigation clicks without inventing blind targets: for **top-level** `click` results (unified command `click`, not `batch`-wrapped steps) whose upstream JSON includes `data.clicked`, whose prior pinned tab URL and post-click URL (from `details.navigationSummary`, gathered by one read-only `eval` summary when the click payload omits **both** string `data.url` and `data.title`) stay equal after the same fragment-insensitive normalization used for ref preflight, and where the same unified result did **not** already apply session tab correction, about-blank mismatch recovery, or `details.clickDispatch` fired for the same result, `extensions/agent-browser/index.ts` takes one fresh session-scoped `snapshot -i`, scans `refs` for strong modal context (`dialog` / `alertdialog`) plus up to three close/dismiss-pattern `button`/`link`/`menuitem` controls, and only then emits `details.overlayBlockers` (`candidates`, `summary`, and a `snapshot` map that can advance `refSnapshot`), visible `Possible overlay blockers`, and `inspect-overlay-state` / `try-overlay-blocker-candidate-*` next actions (with `--session` prefix when the session is named) appended after presentation follow-ups such as `inspect-after-mutation`. Page-wide privacy/sign-in/banner text without a dialog role is deliberately ignored to avoid warnings after ordinary same-page clicks. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`overlayBlockers`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) no-navigation click note and README pitfalls; fake coverage: `agentBrowserExtension surfaces likely overlay blockers after a no-op click` and `agentBrowserExtension does not report overlay blockers from unrelated page chrome after a successful same-page click` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
103
122
|
|
|
104
123
|
`RQ-0086` reduces wrapper-induced click fragility found during Sauce Demo smokes: navigation-summary enrichment for click/back/forward/reload/dblclick now uses one read-only `eval` (`({ title: document.title, url: location.href })`) instead of serial `get title` plus `get url` probes, including tab-pinned batch wrappers. Tab pinning/post-command tab correction now runs only after the wrapper has evidence of tab-drift risk (profile restore correction, overlapping stale opens, or restored session state), so ordinary same-session clicks no longer get repeated `tab list` probes. This keeps `details.navigationSummary`, overlay blocker checks, and drift recovery intact while avoiding the upstream `agent-browser 0.27.0` sequence that could report later clicks as successful without dispatching pointer/click events after repeated getter/tab/snapshot probes. Fake coverage: `agentBrowserExtension enriches click results with a post-navigation title and url summary` in [`test/agent-browser.extension-tabs.test.ts`](../test/agent-browser.extension-tabs.test.ts), plus `agentBrowserExtension pins the intended tab inside a follow-up command when reconnect drift would otherwise steal focus` and about-blank/tab overlap assertions in [`test/agent-browser.extension-tab-recovery.test.ts`](../test/agent-browser.extension-tab-recovery.test.ts); manual validation source: [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt).
|
|
105
124
|
|
|
106
|
-
`RQ-0089` investigated
|
|
125
|
+
`RQ-0089` investigated Sauce Demo no-op clicks after RQ-0086, and the 2026-05-26 release smoke reproduced the failure against direct upstream `agent-browser 0.27.0`: CSS `click [data-test=add-to-cart-sauce-labs-backpack]` and current `@ref` clicks returned success, but a page-level listener recorded no trusted pointer/mouse/click events and the cart stayed unchanged; an in-page `element.click()` did mutate the cart. The wrapper now adds a bounded top-level non-Electron `click` dispatch probe before standalone clicks. If upstream reports success but no trusted DOM event reached the target, it fails the tool, records `details.clickDispatch.status: "no-native-event-observed"`, and appends `inspect-click-dispatch-miss` / `retry-click-after-dispatch-miss` next actions; it does **not** replay clicks in-page. This is not site-specific and does not alter `batch`/`job`/`qa` click steps. For `@e…` refs, the probe uses role/name metadata persisted in `details.refSnapshot` from the latest snapshot instead of running a pre-click snapshot that could recycle upstream refs. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`clickDispatch`, `refSnapshot`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) click verification notes; fake coverage: `agentBrowserExtension reports click dispatch diagnostic when upstream reports success without dispatching DOM events` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
107
126
|
|
|
108
127
|
`RQ-0074` warns when `get text <selector>` may read hidden or tabbed DOM content: for non-ref CSS selectors, `extensions/agent-browser/index.ts` runs a read-only `eval --stdin` visibility probe after successful text reads, emits `details.selectorTextVisibility` plus visible warning text when the first match is hidden while visible matches exist or when multiple matches make the upstream first-match choice ambiguous, preserves multiple batched warnings in `details.selectorTextVisibilityAll`, and appends `inspect-visible-text-candidates` next actions. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`selectorTextVisibility`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) extraction note and README pitfalls; fake coverage: `agentBrowserExtension warns when get text may read hidden selector matches` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
109
128
|
|
|
110
|
-
`RQ-0075` classifies QA and diagnostic network failures by likely impact: `summarizeNetworkFailures` / `classifyNetworkRequestFailure` in `extensions/agent-browser/lib/results/network.ts` (re-exported from `shared.ts`) split rows that already count as failed (`isFailedNetworkRequest`) into actionable versus benign low-impact browser icon asset misses (`isBenignAssetFailure`: favicon/apple-touch-icon basename patterns, 404/`failed`/string `error` signals, and image-like `resourceType`/`mimeType` when present). `analyzeQaPresetResults` fails `qa` only for actionable network failures while preserving benign rows in `qaPreset.warnings`, and network request presentation adds a compact actionable/benign summary plus per-row impact tags, ordered with actionable/benign failed rows before successful rows so late failures are visible even in capped previews. Because real Pi ignores returned `isError` fields from custom tool `execute`, `extensions/agent-browser/index.ts` also realigns `details.resultCategory: "failure"` outcomes to Pi-visible tool errors through a `tool_result` handler; it appends the exact failure category plus `Pi tool isError: true` to prose output and preserves caller-requested `--json` output as parseable JSON while patching `isError`. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) QA and network diagnostic notes; fake coverage: `agentBrowserExtension compiles lightweight QA presets and fails diagnostics` in [`test/agent-browser.extension-input-modes.test.ts`](../test/agent-browser.extension-input-modes.test.ts) plus network presentation assertions in [`test/agent-browser.presentation.test.ts`](../test/agent-browser.presentation.test.ts); real-Pi
|
|
129
|
+
`RQ-0075` classifies QA and diagnostic network failures by likely impact: `summarizeNetworkFailures` / `classifyNetworkRequestFailure` in `extensions/agent-browser/lib/results/network.ts` (re-exported from `shared.ts`) split rows that already count as failed (`isFailedNetworkRequest`) into actionable versus benign low-impact browser icon asset misses (`isBenignAssetFailure`: favicon/apple-touch-icon basename patterns, 404/`failed`/string `error` signals, and image-like `resourceType`/`mimeType` when present). `analyzeQaPresetResults` fails `qa` only for actionable network failures while preserving benign rows in `qaPreset.warnings`, and network request presentation adds a compact actionable/benign summary plus per-row impact tags, ordered with actionable/benign failed rows before successful rows so late failures are visible even in capped previews. Because real Pi ignores returned `isError` fields from custom tool `execute`, `extensions/agent-browser/index.ts` also realigns `details.resultCategory: "failure"` outcomes to Pi-visible tool errors through a `tool_result` handler; it appends the exact failure category plus `Pi tool isError: true` to prose output and preserves caller-requested `--json` output as parseable JSON while patching `isError`. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) QA and network diagnostic notes; fake coverage: `agentBrowserExtension compiles lightweight QA presets and fails diagnostics` in [`test/agent-browser.extension-input-modes.test.ts`](../test/agent-browser.extension-input-modes.test.ts) plus network presentation assertions in [`test/agent-browser.presentation.test.ts`](../test/agent-browser.presentation.test.ts); model-free real-Pi pipeline coverage in [`test/agent-browser.pi-pipeline.test.ts`](../test/agent-browser.pi-pipeline.test.ts) asserts both in-memory and persisted JSONL tool results for QA prose patching, parseable caller-requested `--json` failures, and strict public-schema rejection before upstream spawn; `npm run verify -- lifecycle` asserts the QA failure-patch line in a saved JSONL session.
|
|
111
130
|
|
|
112
131
|
`RQ-0076` adds best-effort timeout recovery when the wrapper watchdog kills a stuck upstream process: `extensions/agent-browser/index.ts` calls `collectTimeoutPartialProgress` / `formatTimeoutPartialProgressText` to build `details.timeoutPartialProgress` from the compiled `job` or `qa` step list or parsed caller `batch` stdin, session-scoped `get url` / `get title` (plus optional planned-URL fallback from `open`/`navigate`/`pushstate` steps), and declared artifact paths (`screenshot`, `pdf`, `download`, `wait --download`) with existence/size checks, then appends a visible `Timeout partial progress` block with redacted URLs/paths. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) wrapper timeout note and README job section; fake coverage: `agentBrowserExtension reports partial progress and artifacts after job timeout` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
113
132
|
|
|
@@ -115,7 +134,7 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
|
|
|
115
134
|
|
|
116
135
|
`RQ-0078` improves getter/eval discoverability: `extensions/agent-browser/lib/results/presentation/errors.ts` matches upstream failure text containing `unknown command`, `unknown subcommand`, or `unrecognized command` (case-insensitive) when the failed command token is one of `attr`, `count`, `html`, `text`, `title`, `url`, or `value`, then adds grouped-`get` prose; only `title` / `url` also emit read-only `nextActions` (`use-get-title` / `use-get-url`, with `--session` when the failed call named a session). The getter block is skipped when selector recovery already injected an `Agent-browser hint:` line into the same error string. `extensions/agent-browser/index.ts` adds `details.evalStdinHint` plus visible `Eval stdin hint` when `looksLikeFunctionEvalStdin` matches trimmed stdin and upstream JSON carries a plain empty-object `data.result`; empty arrays such as `[]` are valid eval results and are not warned. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`nextActions`, `evalStdinHint`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) extraction note and README quick start; fake coverage: `buildToolPresentation suggests grouped getter commands for common unknown getter shortcuts` and `agentBrowserExtension warns when eval stdin returns an empty object from a function-shaped snippet`.
|
|
117
136
|
|
|
118
|
-
`RQ-0079` clarifies artifact lifecycle and cleanup ownership: `extensions/agent-browser/
|
|
137
|
+
`RQ-0079` clarifies artifact lifecycle and cleanup ownership: `extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts` builds `details.artifactCleanup`, surfaced by process-output with visible `Artifact lifecycle` copy on successful close commands (`close`, `quit`, or `exit`) when `artifactManifest.entries` is non-empty (`getArtifactCleanupGuidance`), stating that close commands do not delete explicit artifacts; `explicitArtifactPaths` carries up to ten distinct existing `explicit-path` manifest paths after a filesystem existence check, skipping stale paths already removed by host tools (possibly empty when the recent window has no existing explicit rows). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`artifactCleanup`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) artifact retention section and README artifact notes; fake coverage: `agentBrowserExtension reports artifact lifecycle guidance on close` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts), plus close-alias unit coverage in [`test/agent-browser.runtime.test.ts`](../test/agent-browser.runtime.test.ts) and [`test/agent-browser.session-page-state.test.ts`](../test/agent-browser.session-page-state.test.ts).
|
|
119
138
|
|
|
120
139
|
`RQ-0080` adds no-op scroll recovery for dense dashboards and nested panes: for successful top-level `scroll`, `extensions/agent-browser/index.ts` samples viewport and prominent scroll-container positions before and after execution with read-only session-scoped `eval --stdin` probes. If no sampled position changes, it emits `details.scrollNoop`, appends visible `Scroll diagnostic: no observed scroll movement`, appends exact `inspect-after-noop-scroll` / `verify-noop-scroll-visually` next actions, and updates `pageChangeSummary.nextActionIds` so agents can branch without parsing prose. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`scrollNoop`, `nextActions`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) scroll note; fake coverage: `agentBrowserExtension reports no-op scroll diagnostics with recovery next actions`.
|
|
121
140
|
|