pi-agent-browser-native 0.2.33 → 0.2.35
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +46 -0
- package/README.md +47 -17
- package/docs/ARCHITECTURE.md +25 -13
- package/docs/COMMAND_REFERENCE.md +285 -47
- package/docs/ELECTRON.md +3 -3
- package/docs/RELEASE.md +22 -14
- package/docs/REQUIREMENTS.md +5 -5
- package/docs/SUPPORT_MATRIX.md +26 -22
- package/docs/TOOL_CONTRACT.md +97 -32
- package/extensions/agent-browser/index.ts +519 -2402
- package/extensions/agent-browser/lib/argv-descriptor.ts +90 -0
- package/extensions/agent-browser/lib/argv-grammar.ts +128 -0
- package/extensions/agent-browser/lib/command-policy.ts +71 -0
- package/extensions/agent-browser/lib/command-taxonomy.ts +336 -0
- package/extensions/agent-browser/lib/electron/cleanup.ts +1 -0
- package/extensions/agent-browser/lib/executable-path.ts +19 -0
- package/extensions/agent-browser/lib/input-modes/job.ts +62 -0
- package/extensions/agent-browser/lib/input-modes/params.ts +8 -8
- package/extensions/agent-browser/lib/input-modes.ts +3 -0
- package/extensions/agent-browser/lib/orchestration/batch-stdin.ts +65 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/browser-action-model.ts +154 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/click-dispatch.ts +149 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts +77 -29
- package/extensions/agent-browser/lib/orchestration/browser-run/final-result.ts +6 -2
- package/extensions/agent-browser/lib/orchestration/browser-run/index.ts +33 -27
- package/extensions/agent-browser/lib/orchestration/browser-run/prepare.ts +74 -23
- package/extensions/agent-browser/lib/orchestration/browser-run/process-output.ts +67 -17
- package/extensions/agent-browser/lib/orchestration/browser-run/prompt-guards.ts +93 -0
- package/extensions/agent-browser/lib/orchestration/browser-run/session-state.ts +19 -123
- package/extensions/agent-browser/lib/orchestration/browser-run/types.ts +32 -1
- package/extensions/agent-browser/lib/orchestration/electron-host/index.ts +860 -0
- package/extensions/agent-browser/lib/playbook.ts +24 -23
- package/extensions/agent-browser/lib/prompt-policy.ts +122 -0
- package/extensions/agent-browser/lib/results/action-recommendations.ts +3 -23
- package/extensions/agent-browser/lib/results/categories.ts +1 -1
- package/extensions/agent-browser/lib/results/presentation/navigation.ts +2 -34
- package/extensions/agent-browser/lib/results/presentation/registry.ts +34 -6
- package/extensions/agent-browser/lib/results/presentation/semantic-action.ts +133 -0
- package/extensions/agent-browser/lib/results/presentation.ts +11 -6
- package/extensions/agent-browser/lib/runtime.ts +93 -227
- package/extensions/agent-browser/lib/session-page-state.ts +31 -14
- package/extensions/agent-browser/lib/temp.ts +148 -23
- package/package.json +4 -4
- package/scripts/agent-browser-capability-baseline.mjs +198 -1
package/docs/ELECTRON.md
CHANGED
|
@@ -198,9 +198,9 @@ Closes the tracked managed session, stops only the wrapper-tracked process, veri
|
|
|
198
198
|
- arbitrary Electron processes the wrapper did not start
|
|
199
199
|
- explicit screenshots, downloads, PDFs, traces, HAR files, or recordings saved to caller-chosen paths
|
|
200
200
|
|
|
201
|
-
For manual launches, `close` only
|
|
201
|
+
For manual launches, close commands (`close`, `quit`, or `exit`) only close the browser/CDP session. Close the app yourself and clean its profile/temp files with normal host tools.
|
|
202
202
|
|
|
203
|
-
On Pi
|
|
203
|
+
On Pi `quit`, active wrapper-owned Electron launches are best-effort cleaned. On `/reload`, the current branch-visible active Electron launch and its isolated temp `userDataDir` are preserved for continuity while off-branch owned Electron launches are cleaned before process-local ownership is cleared. If cleanup is partial and skips or fails `user-data-dir` removal because the process or debug port is still live, the generic temp sweep preserves that profile path across reload, quit, repeated temp cleanup, process-exit cleanup, and stale temp-root pruning after restart rather than deleting it out from under the remaining host resource. If `electron.cleanup` closes the attached managed session but host process/profile cleanup is partial, later default browser calls still rotate away from that closed wrapper-managed session. Stale restored records (PID gone, port dead) are **reported** instead of guessed at or killed.
|
|
204
204
|
|
|
205
205
|
### `timeoutMs` by action (quick reference)
|
|
206
206
|
|
|
@@ -210,7 +210,7 @@ On Pi session shutdown, active wrapper-owned Electron launches are best-effort c
|
|
|
210
210
|
| --- | --- | --- |
|
|
211
211
|
| `launch` | Host-side wait for `DevToolsActivePort` and CDP readiness | **15 s**, hard-capped at **120 s** (`normalizeTimeoutMs` in `extensions/agent-browser/lib/electron/launch.ts`) |
|
|
212
212
|
| `status` | Optional managed-session `get title` / `get url` reads used for mismatch diagnostics | Normal tool subprocess budget from `runAgentBrowserProcess` / `AGENT_BROWSER_DEFAULT_TIMEOUT`; localhost CDP HTTP probes keep a short fixed budget (`ELECTRON_STATUS_FETCH_TIMEOUT_MS` in `extensions/agent-browser/lib/electron/cleanup.ts`) |
|
|
213
|
-
| `cleanup` | One combined budget for managed-session `close`, tracked process exit, debug-port verification, and temp profile removal | `PI_AGENT_BROWSER_IMPLICIT_SESSION_CLOSE_TIMEOUT_MS` when set, else **5000 ms** (`getImplicitSessionCloseTimeoutMs` in `extensions/agent-browser/lib/runtime.ts`, passed through `
|
|
213
|
+
| `cleanup` | One combined budget for managed-session `close`, tracked process exit, debug-port verification, and temp profile removal | `PI_AGENT_BROWSER_IMPLICIT_SESSION_CLOSE_TIMEOUT_MS` when set, else **5000 ms** (`getImplicitSessionCloseTimeoutMs` in `extensions/agent-browser/lib/runtime.ts`, passed through `cleanupTrackedElectronHostLaunches` in `extensions/agent-browser/lib/orchestration/electron-host/index.ts`) |
|
|
214
214
|
| `probe` | **Each** upstream read in the probe chain (`get title`, `get url`, focused `eval --stdin`, `tab list`, `snapshot -i`) | Same default as other tool calls (typically **28 s** per subprocess unless `AGENT_BROWSER_DEFAULT_TIMEOUT` / `PI_AGENT_BROWSER_PROCESS_TIMEOUT_MS` overrides `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts`) |
|
|
215
215
|
|
|
216
216
|
## `qa.attached` — current-session smoke check
|
package/docs/RELEASE.md
CHANGED
|
@@ -35,13 +35,21 @@ npm run verify -- release
|
|
|
35
35
|
|
|
36
36
|
`npm publish` runs npm’s `prepublishOnly` script from `package.json`, which executes the same `npm run verify -- release` gate and then `npm pack --dry-run`. That concatenated gate is everything in the default `npm run verify` step (generated playbook drift, TypeScript, the unit/fake suite, generated command-reference blocks, and live upstream command-reference sampling against the targeted `agent-browser` on `PATH`) plus the packaged Pi smoke in `package-pi`. Using `npm publish --ignore-scripts` skips that contract intentionally.
|
|
37
37
|
|
|
38
|
-
`prepublishOnly` intentionally does **not** run `npm run verify -- lifecycle`, `npm run verify -- real-upstream`, or `npm run verify -- benchmark`; those are separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). Treat the bullets below as the full pre-publish contract even though only the `release` slice is automated at publish time.
|
|
38
|
+
`prepublishOnly` intentionally does **not** run `npm run verify -- lifecycle`, `npm run verify -- real-upstream`, `npm run verify -- dogfood`, or `npm run verify -- benchmark`; those are separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). Treat the bullets below as the full pre-publish contract even though only the `release` slice is automated at publish time.
|
|
39
39
|
|
|
40
|
-
|
|
40
|
+
For a deterministic real-browser wrapper smoke without model choice in the loop, run:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
npm run verify -- dogfood
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
This mode uses the extension harness and the real `agent-browser` on `PATH` against public `example.com`, then verifies top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close. Use `npm run verify -- dogfood --keep-artifacts` or `--artifact-dir <path>` only while debugging, then delete retained screenshots. This smoke complements, but does not replace, human-readable interactive transcript evidence.
|
|
47
|
+
|
|
48
|
+
Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --no-extensions --no-skills -e .` from the checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Automated localhost, fake-upstream, and deterministic dogfood gates do not replace this human-readable live-site transcript evidence. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
|
|
41
49
|
|
|
42
50
|
When reviewing saved session JSONL after a failed smoke or a `qa` preset that reclassified an upstream-successful batch, expect `agent_browser` tool rows to carry `isError: true` whenever `details.resultCategory` is `failure`. For normal prose output, model-visible text should end with a `Pi tool isError: true` category line; for caller-requested `--json` output, the hook preserves parseable JSON and only patches `isError`. The extension applies that patch on the `tool_result` path so Pi’s transcript matches the wrapper contract ([`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details)). Preserve a normal Pi session directory for those checks; avoiding `--no-session` keeps this evidence intact ([`AGENTS.md`](../AGENTS.md) preferred validation workflow).
|
|
43
51
|
|
|
44
|
-
The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload
|
|
52
|
+
The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload`, full relaunch with the same exact Pi 0.76 `--session-id`, managed-session continuity, persisted artifacts, and Pi failure-patch behavior. Branch-backed `session_tree` rehydration and cleanup ownership are validated by focused extension harness tests:
|
|
45
53
|
|
|
46
54
|
```bash
|
|
47
55
|
npm run verify -- lifecycle
|
|
@@ -111,13 +119,13 @@ Please gather enough evidence to support the smoke result:
|
|
|
111
119
|
Return a concise PASS/FAIL report with evidence and any tool or workflow issues you noticed. Do not create a dogfood-output report directory.
|
|
112
120
|
```
|
|
113
121
|
|
|
114
|
-
Evaluator expectations after the queued Sauce Demo fixes: the agent should independently choose efficient, safe browser operations; native add-to-cart clicks should mutate cart state without
|
|
122
|
+
Evaluator expectations after the queued Sauce Demo fixes: the agent should independently choose efficient, safe browser operations; native add-to-cart clicks should mutate cart state without the agent authoring `eval`/DOM-click fallbacks (the wrapper may fail with `details.clickDispatch` when upstream reports click success but no trusted DOM event reached the target); same-snapshot form fills may be batched safely when the agent chooses that route; the selected sort order should be verified; checkout must stop before Finish and must not place the order; if the agent attempts Finish or another likely final submit action, the wrapper should block it with `details.promptGuard.reason: "explicit-user-stop-boundary"`; screenshot and recording must use the requested paths or be explicitly reported unavailable, and close should be blocked with `details.promptGuard.reason: "requested-artifacts-missing-before-close"` until required screenshot paths are verified; `network requests` may show public-demo telemetry 401s; `console` may report offline-cache logs; `errors` should show no page errors; and the browser session plus temp artifacts should be cleaned up after evidence is recorded. A run that reaches `checkout-complete.html` or silently substitutes artifact paths is a workflow failure even if other store flow steps work.
|
|
115
123
|
|
|
116
124
|
## Deterministic agent efficiency benchmark
|
|
117
125
|
|
|
118
126
|
[`scripts/agent-browser-efficiency-benchmark.mjs`](../scripts/agent-browser-efficiency-benchmark.mjs) is an accounting-only benchmark: it does not shell out to `agent-browser`, launch a browser, or read or write Pi sessions. It models representative `agent_browser` call shapes (including optional `stdin` for `batch` and top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` objects that compile to batch) and aggregates success rate, tool-call counts, UTF-8 size of model-visible strings, stale-ref failure and recovery counts, artifact success, distinct failure-category coverage, and summed elapsed-time estimates. When extending scenarios, keep them aligned with the closed `RQ-0068` “no reusable recipe layer” rationale in [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet) (benchmark ids cited there are the canonical inventory for that evidence bar).
|
|
119
127
|
|
|
120
|
-
- **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes).
|
|
128
|
+
- **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes). Optional `--sample-jsonl path/to/session.jsonl` adds a `jsonlSample` section with real UTF-8 byte totals and per-workflow/overall p95 sizes for model-visible `agent_browser` tool-result text without changing deterministic scenario metrics; comparison ignores `jsonlSample` blocks.
|
|
121
129
|
- **Default gate:** `npm run verify` checks generated playbook drift, runs `tsc --noEmit`, runs the full unit/fake suite under `test/**/*.test.ts` (including [`test/agent-browser.efficiency-benchmark.test.ts`](../test/agent-browser.efficiency-benchmark.test.ts) for scenario coverage and comparison behavior), verifies generated command-reference baseline blocks, and samples live upstream command-reference tokens. It does not spawn the standalone benchmark script’s JSON/Markdown run; that is what the opt-in slice below adds.
|
|
122
130
|
- **Opt-in slice:** `npm run verify -- benchmark` runs the benchmark script once with `--json` and then that same test module alone. It is intentionally **not** part of `npm run verify -- release`, so routine publish gates stay decoupled from benchmark churn while still allowing a focused check after editing scenarios or `CURRENT_BENCHMARK_VERSION`.
|
|
123
131
|
|
|
@@ -172,7 +180,7 @@ Before publishing, validate both local-checkout modes without mixing their assum
|
|
|
172
180
|
|
|
173
181
|
For expanded-surface validation, the smoke prompt should cover native tool invocation rather than shelling out to `agent-browser`: `--version`, `--help`, `skills list`, `skills get core --full`, `open` with `sessionMode: "fresh"`, `snapshot -i`, `click`, top-level `semanticAction` (locator shorthand compiled to upstream `find` and native dropdown selection compiled to upstream `select`, optionally with `semanticAction.session` when you need the same named upstream session as a prior explicit `--session` call), `eval --stdin`, `batch` via stdin, top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` (compiled batch smoke), `screenshot <path>`, explicit `--session … open` plus `--session … close`, `network requests`, `console` / `errors`, `diff snapshot`, `stream status` plus `stream disable`, `dashboard start` plus `dashboard stop`, and `chat <message>` (credential failure is acceptable evidence of wrapper pass-through when `AI_GATEWAY_API_KEY` is intentionally unset). Clean up any opened browser session with `close`, remove temporary files, and kill the tmux session before ending validation.
|
|
174
182
|
|
|
175
|
-
This checklist assumes a real `agent-browser` on `PATH`. It complements, but does not overlap, `npm run verify -- lifecycle`: that harness swaps in a fake upstream binary and focuses on `/reload`,
|
|
183
|
+
This checklist assumes a real `agent-browser` on `PATH`. It complements, but does not overlap, `npm run verify -- lifecycle`: that harness swaps in a fake upstream binary and focuses on `/reload`, exact `--session-id` relaunch, managed-session continuity, spill-path persistence, and Pi `tool_result` failure-patch semantics (`scripts/verify-lifecycle.mjs`), not the full command matrix above.
|
|
176
184
|
|
|
177
185
|
When a smoke or dogfood run fails after `sessionMode: "fresh"` (missing binary, timeout, upstream error, or **`qa`** preset reclassification), read `details.managedSessionOutcome` before assuming which managed session the next default `sessionMode: "auto"` call will follow; the same struct can appear without the extra `Managed session outcome: …` prose line on `"auto"` failures. Field-level semantics and append ordering relative to other diagnostic tails are documented in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) and the session-mode notes in [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md).
|
|
178
186
|
|
|
@@ -184,7 +192,7 @@ Run the automated harness for deterministic configured-source lifecycle regressi
|
|
|
184
192
|
npm run verify -- lifecycle
|
|
185
193
|
```
|
|
186
194
|
|
|
187
|
-
The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs
|
|
195
|
+
The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs `pi` in `tmux` with default model **`zai/glm-5.1`** and a deterministic `--session-id`, puts a deterministic fake `agent-browser` first on `PATH`, drives `/reload`, closes Pi, and relaunches with the same exact session id instead of typing `/resume`. It also asserts the JSONL session header id, same-page managed-session continuity, persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification. Per-step tmux waits default to **180000 ms** (three minutes) in [`scripts/verify-lifecycle.mjs`](../scripts/verify-lifecycle.mjs) (`DEFAULT_TIMEOUT_MS`); override with `--timeout-ms <ms>` when slower models or cold starts need more headroom. Override the model when needed:
|
|
188
196
|
|
|
189
197
|
```bash
|
|
190
198
|
npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal
|
|
@@ -196,7 +204,7 @@ Combine flags in one invocation when both apply (order after `lifecycle` is flex
|
|
|
196
204
|
npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --timeout-ms 600000
|
|
197
205
|
```
|
|
198
206
|
|
|
199
|
-
|
|
207
|
+
On failure it retains transcripts/session artifacts; on success it performs best-effort cleanup. It does not replace occasional real-browser manual smoke testing.
|
|
200
208
|
|
|
201
209
|
**Lifecycle triage:** a timeout on sentinel `v2` after `/reload` often means Pi rejected reload while the TUI still showed `Working…` (`Wait for the current response to finish before reloading`), even when the session JSONL already has a final assistant message. Re-run with `--keep-artifacts --verbose`, inspect the retained pane capture, and confirm the configured model follows tool prompts reliably. Slower models may need a higher `--timeout-ms` than the **180000 ms** default.
|
|
202
210
|
|
|
@@ -217,7 +225,7 @@ Manual validation remains useful for release confidence and installed-package ch
|
|
|
217
225
|
|
|
218
226
|
1. Configure exactly one active source for this extension in Pi settings: this checkout path before publishing, or the installed package after publishing.
|
|
219
227
|
2. Launch plain `pi` so extension discovery is active.
|
|
220
|
-
3. Validate managed-session continuity with `/reload` and a full restart
|
|
228
|
+
3. Validate managed-session continuity with `/reload` and a full restart plus exact `--session-id` relaunch or `/resume`.
|
|
221
229
|
4. Re-check local extension-side docs (`README.md`, `docs/COMMAND_REFERENCE.md`, `docs/TOOL_CONTRACT.md`, including the [`semanticAction`](TOOL_CONTRACT.md#semanticaction) rules when that shorthand or upstream `find` / `select` behavior changes) and regenerated prompt fragments from `extensions/agent-browser/lib/playbook.ts` via `npm run docs -- playbook check` or `npm run docs`. When the upstream `agent-browser` version or help surface changed, run `npm run verify -- command-reference`.
|
|
222
230
|
|
|
223
231
|
### Real upstream contract validation
|
|
@@ -264,8 +272,8 @@ Recommended configured-source lifecycle follow-up:
|
|
|
264
272
|
|
|
265
273
|
1. Open a page with the implicit managed session and confirm the title.
|
|
266
274
|
2. Run `/reload`, then ask for `snapshot -i` and confirm the same page is still active.
|
|
267
|
-
3. Exit `pi`, relaunch it against the same session
|
|
268
|
-
4. Open a large page that compacts its snapshot output and confirm `details.fullOutputPath` still exists after the restart/resume flow.
|
|
275
|
+
3. Exit `pi`, relaunch it against the same exact session id/path or use `/resume`, then ask for `snapshot -i` again and confirm the same page is still active.
|
|
276
|
+
4. Open a large page that compacts its snapshot output and confirm `details.fullOutputPath` still exists after the restart/resume/exact-session flow.
|
|
269
277
|
5. Trigger an oversized non-snapshot output (for example a deliberately large `eval --stdin` result) and confirm the tool prints the actual spill file path directly in content instead of only referencing a details key.
|
|
270
278
|
6. Validate at least one direct file-download flow with `download <selector> <path>`.
|
|
271
279
|
7. Validate at least one asynchronous export flow with `click` followed by `wait --download <path>`, confirming the wait result reports `savedFilePath`/`savedFile` and checking `details.artifacts[].exists` before relying on the requested path being present on disk.
|
|
@@ -286,7 +294,7 @@ Then run the real-browser smoke prompt:
|
|
|
286
294
|
Use the agent_browser tool to open https://react.dev and then take an interactive snapshot.
|
|
287
295
|
```
|
|
288
296
|
|
|
289
|
-
Only use plain `pi` for installed-package validation after temporarily disabling or removing the checkout source or any other active source for this extension from Pi settings. Then confirm `pi` exposes the native `agent_browser` tool, that a basic `open` + `snapshot -i` flow works, and that `/reload` plus restart
|
|
297
|
+
Only use plain `pi` for installed-package validation after temporarily disabling or removing the checkout source or any other active source for this extension from Pi settings. Then confirm `pi` exposes the native `agent_browser` tool, that a basic `open` + `snapshot -i` flow works, and that `/reload` plus restart with exact `--session-id` relaunch or `/resume` keep following the same implicit managed browser session.
|
|
290
298
|
|
|
291
299
|
## Release notes checklist
|
|
292
300
|
|
|
@@ -302,7 +310,7 @@ Before publishing:
|
|
|
302
310
|
- confirm both local-checkout modes still work for pre-release validation: isolated `pi --no-extensions -e .` smoke testing for general checkout loading (add `--no-skills` for extension-focused bounded smokes) and configured-source lifecycle validation
|
|
303
311
|
- complete interactive `tmux` live-site extension smoke with `pi --no-extensions --no-skills -e .` and the native `agent_browser` tool (at least one simple static site and one real documentation/product site; include `qa` or `job`/`batch` when those surfaces changed; use the [public Grafana stress checklist](#public-grafana-stress-checklist) when dashboard/diagnostic/artifact behavior changed; close sessions and remove screenshots/temp artifacts; record evidence). Run separate skill-enabled dogfood only when validating skill routing/report-generation behavior—see [Pre-release checks](#pre-release-checks); automated gates are not a substitute
|
|
304
312
|
- rerun `npm run verify -- release`
|
|
305
|
-
- run `npm run verify -- lifecycle` for configured-source `/reload
|
|
313
|
+
- run `npm run verify -- lifecycle` for configured-source `/reload`, exact `--session-id` relaunch, managed-session continuity, persisted-spill, and Pi failure-patch regression coverage (required before publish; see [Pre-release checks](#pre-release-checks))
|
|
306
314
|
- confirm [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md) still maps every current baseline inventory section to docs, runtime handling, tests, and validation status
|
|
307
|
-
- manually exercise real-browser `/reload` and full restart
|
|
315
|
+
- manually exercise real-browser `/reload` and full restart plus exact `--session-id` relaunch or `/resume` continuity when release risk warrants browser-level confidence beyond the fake upstream harness
|
|
308
316
|
- publish only after the tarball contents and isolated packaged-extension smoke check match expectations
|
package/docs/REQUIREMENTS.md
CHANGED
|
@@ -53,7 +53,7 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
|
|
|
53
53
|
- Keep the current local-checkout path documented as the practical pre-release and development flow.
|
|
54
54
|
- Most users will install this extension globally rather than as a project-local extension.
|
|
55
55
|
- Local checkout smoke testing should use explicit CLI loading such as `pi --no-extensions -e .` or `pi --no-extensions -e /absolute/path/to/pi-agent-browser-native`; Pi settings are bypassed in this mode and code edits require a process restart for validation.
|
|
56
|
-
- Local checkout hot-reload and
|
|
56
|
+
- Local checkout hot-reload and exact-session relaunch validation should use configured-source lifecycle mode: exactly one active checkout/package source in Pi settings, launched with plain `pi` (or the lifecycle harness' exact `--session-id` relaunch path), so `/reload` and relaunch events exercise discovered/configured resources. Focused extension harness tests validate Pi `session_tree` branch rehydration and cleanup ownership.
|
|
57
57
|
- Do **not** rely on repo-local `.pi/extensions/` auto-discovery for this package, because it conflicts with the global installed-package path.
|
|
58
58
|
|
|
59
59
|
### Native-tool preference
|
|
@@ -85,10 +85,10 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
|
|
|
85
85
|
- The primary confidence path is a real `pi` session driven in `tmux`.
|
|
86
86
|
- For quick local checkout smoke validation, launch `pi --no-extensions -e .` from the repository root so only the checkout copy loads; do not rely on Pi settings or `/reload` semantics in this isolated mode.
|
|
87
87
|
- For hot-reload validation, configure exactly one active source for this extension in Pi settings and launch plain `pi`; validate `/reload` there because it exercises auto-discovered/configured resources.
|
|
88
|
-
- Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart
|
|
89
|
-
- Validate a full `pi` restart with `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths.
|
|
88
|
+
- Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart plus exact `--session-id` relaunch, and asserts managed-session continuity, persisted artifact survival, and real Pi `tool_result` failure-patch semantics. It is its own `npm run verify` mode rather than part of the default `npm run verify` sequence, but operators still run it before every publish. The harness defaults Pi to model `zai/glm-5.1` (`scripts/verify-lifecycle.mjs`); pass `--model <id>` after `lifecycle` when a different model is required. Keep `docs/RELEASE.md` accurate about the harness behavior, cleanup, transcript retention, and limitations.
|
|
89
|
+
- Validate a full `pi` restart with exact `--session-id` relaunch or `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths. Validate branch-backed state changes with the focused `session_tree` harness tests.
|
|
90
90
|
- Prefer full `pi` restart over `/reload` when validating extension changes beyond a quick reload smoke check.
|
|
91
|
-
- Use `/resume` when needed after restart.
|
|
91
|
+
- Use `/resume` or an explicit session id/path when needed after restart.
|
|
92
92
|
- Keep testing broader than a single smoke site like `example.com`.
|
|
93
93
|
- Bounded release smokes that validate this extension should disable auto-loaded skills with `--no-skills`; run skill-enabled dogfood separately only when validating external skill routing or report-generation behavior.
|
|
94
94
|
- Maintain a concrete release/package verification workflow in `docs/RELEASE.md` and matching repository scripts.
|
|
@@ -111,7 +111,7 @@ The design should comfortably support workflows such as:
|
|
|
111
111
|
- Package-manifest behavior matters more than repo-local development wiring.
|
|
112
112
|
- The extension should use official `pi` hooks and package resources where possible.
|
|
113
113
|
- The wrapper should stay thin, with upstream `agent-browser` remaining the source of truth for command semantics.
|
|
114
|
-
- Successful and failed tool outcomes should surface bounded machine-readable fields on Pi-facing `details` (`resultCategory`, `successCategory`, `failureCategory`, optional structured `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`, optional `artifactVerification` with the same shape on successful `batchSteps[]` rows) so agents can branch without parsing prose; stateful commands (`auth`, `cookies`, `storage`, `dialog`, `frame`, `state`) plus other structured diagnostics (for example `network`, `diff`, `trace`, `stream`, `dashboard`, `chat`) and `batch` should redact secret-bearing payloads in model-facing `details.data`, including the compact per-step `batch` roll-up on the parent result (full per-step payloads live on `batchSteps[]`). The contract lives in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), enums and classifier precedence live in `extensions/agent-browser/lib/results/categories.ts` and `contracts.ts` (also re-exported from `shared.ts`), and presentation-time summaries, redaction, network request follow-ups, and artifact verification rollups are assembled in `extensions/agent-browser/lib/results/presentation.ts` (`buildPageChangeSummary`, `
|
|
114
|
+
- Successful and failed tool outcomes should surface bounded machine-readable fields on Pi-facing `details` (`resultCategory`, `successCategory`, `failureCategory`, optional structured `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`, optional `artifactVerification` with the same shape on successful `batchSteps[]` rows) so agents can branch without parsing prose; stateful commands (`auth`, `cookies`, `storage`, `dialog`, `frame`, `state`) plus other structured diagnostics (for example `network`, `diff`, `trace`, `stream`, `dashboard`, `chat`) and `batch` should redact secret-bearing payloads in model-facing `details.data`, including the compact per-step `batch` roll-up on the parent result (full per-step payloads live on `batchSteps[]`). The contract lives in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), enums and classifier precedence live in `extensions/agent-browser/lib/results/categories.ts` and `contracts.ts` (also re-exported from `shared.ts`), and presentation-time summaries, redaction, network request follow-ups, and artifact verification rollups are assembled in `extensions/agent-browser/lib/results/presentation.ts` (`buildPageChangeSummary`, command taxonomy predicates from `command-taxonomy.ts`, `redactPresentationData`, `buildArtifactVerificationSummary`, `buildBatchPresentation`).
|
|
115
115
|
- User-facing docs belong in `README.md` and the canonical published files under `docs/`.
|
|
116
116
|
- Agent workflow and deeper testing procedures can stay in `AGENTS.md`, but published docs must not depend on that file being present.
|
|
117
117
|
- When upstream `agent-browser` changes, refresh the local command reference, prompt guidance, and other extension-side docs so agents still have a repo-readable equivalent of the blocked direct-binary help path.
|
package/docs/SUPPORT_MATRIX.md
CHANGED
|
@@ -27,8 +27,8 @@ When upstream ships a new `agent-browser` or the inventory changes:
|
|
|
27
27
|
|
|
28
28
|
- Target upstream: `agent-browser 0.27.0` (must match `CAPABILITY_BASELINE.targetVersion` in [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs)).
|
|
29
29
|
- Source of truth: `CAPABILITY_BASELINE.inventorySections` in the same file (stable `id` keys: `skills`, `core-commands`, `state-tabs-frames-dialogs`, `network-storage-artifacts-diagnostics`, `batch-auth-setup-ai`, `options-and-env`).
|
|
30
|
-
- Status: supported for the current wrapper contract.
|
|
31
|
-
- High-priority support gaps:
|
|
30
|
+
- Status: supported for the current wrapper contract after the 2026-05-26 all-command audit.
|
|
31
|
+
- High-priority support gaps: 2026-05-26 audit found sessionless local commands and command-scoped value flags needed sharper wrapper handling; runtime/tests/docs now cover those paths. Remaining upstream-owned caveat: `agent-browser 0.27.0` help mentions `wait <selector> --state hidden`, but source parsing does not implement that distinct wait mode, so wrapper docs steer agents to `wait --fn` predicates.
|
|
32
32
|
- Post-`v0.2.29` review state: commits `eb55320` through `86abbfb` add browser guidance/smoke coverage plus `RQ-0086` click-probe reduction, `RQ-0087` same-snapshot form fill batching, `RQ-0088` current-ref fallback on locator misses, `RQ-0089` direct-upstream click mutation investigation, and `RQ-0090` stop-boundary/artifact-path guidance. Verification gates below were rerun on 2026-05-18 after those tasks landed. Constrained `job` (`RQ-0064`), the lightweight `qa` preset (`RQ-0065`), the experimental `sourceLookup` helper (`RQ-0066`), and the experimental `networkSourceLookup` helper (`RQ-0067`) are implemented; see [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#job), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sourcelookup), and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#networksourcelookup). Reusable browser recipes (`RQ-0068`) are intentionally not adopted as a runtime surface; see [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet).
|
|
33
33
|
|
|
34
34
|
## Verification evidence
|
|
@@ -37,23 +37,25 @@ Re-run the gates below before each release; this table records what the closure
|
|
|
37
37
|
|
|
38
38
|
| Gate | Evidence | Status |
|
|
39
39
|
| --- | --- | --- |
|
|
40
|
-
| Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-
|
|
41
|
-
| Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-
|
|
42
|
-
| Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-
|
|
43
|
-
|
|
|
44
|
-
|
|
|
45
|
-
|
|
|
40
|
+
| Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-27 (`npm run verify`, `agent-browser 0.27.0` on `PATH`). |
|
|
41
|
+
| Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-27 (`npm run verify -- real-upstream`, `agent-browser 0.27.0` on `PATH`). |
|
|
42
|
+
| Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-27 (`npm run verify -- package-pi`). |
|
|
43
|
+
| Deterministic dogfood smoke | `npm run verify -- dogfood` (`scripts/verify-agent-browser-dogfood.ts`) drives the native wrapper against public `example.com` through top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close with the real `agent-browser` on `PATH`. | Pass on 2026-05-24 (`npm run verify -- dogfood --artifact-dir /tmp/pi-agent-browser-release-dogfood --json`; artifacts removed). |
|
|
44
|
+
| Efficiency benchmark | `npm run verify -- benchmark` runs deterministic browser workflow accounting plus focused benchmark tests, including JSONL sampling fixtures and job/qa/sourceLookup/networkSourceLookup/Electron scenario coverage. | Pass on 2026-05-24 (`npm run verify -- benchmark`). |
|
|
45
|
+
| `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, dogfood, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-24 (`npm run verify -- release`). `prepublishOnly` still needs a fresh run during actual publish. |
|
|
46
|
+
| Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`, closes and relaunches Pi with the same exact `--session-id`, checks the JSONL session header id, session continuity, slash-command sentinel tokens (`v1` then `v2` after rewriting the packaged extension to simulate pickup), persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification with a fake upstream on `PATH`. Default Pi model is `zai/glm-5.1`; default per-step wait is **180000 ms** (`DEFAULT_TIMEOUT_MS`); override model with `--model <id>` and waits with `--timeout-ms <ms>`. Passthrough flags in [`scripts/project.mjs`](../scripts/project.mjs): `--keep-artifacts`, `--model`, `--verbose`, and `--timeout-ms` plus a value (for example `npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --keep-artifacts --verbose --timeout-ms 600000`). | Pass on 2026-05-27 (`npm run verify -- lifecycle --timeout-ms 300000`). Treat any future unexplained red lifecycle gate as a release blocker. |
|
|
47
|
+
| Quick isolated Pi smoke | `pi --no-extensions --no-skills -e . --tools agent_browser` from repo root; native `agent_browser` only. | Pass on 2026-05-27 for an interactive tmux checkout smoke (`pi --no-extensions --no-skills -e . --session-dir <temp> --model zai/glm-5.1`): opened `https://example.com` with `sessionMode: "fresh"`, ran `snapshot -i`, verified `Example Domain`, closed the browser session, reported PASS, and removed the temp session dir/tmux session. Broader historical coverage also includes version/help/skills, open/snapshot/click, eval stdin, batch stdin, screenshot, explicit session, `sessionMode: "fresh"`, network requests, console/errors, diff snapshot, stream status/disable, dashboard start/stop, and chat credential-failure pass-through during RQ-0055. |
|
|
46
48
|
|
|
47
49
|
## Baseline checklist by inventory section
|
|
48
50
|
|
|
49
51
|
| Baseline section | Baseline items | Documentation | Runtime handling | Test coverage | Validation status |
|
|
50
52
|
| --- | --- | --- | --- | --- | --- |
|
|
51
|
-
| Built-in skills |
|
|
52
|
-
| Core page, element, navigation, and extraction commands |
|
|
53
|
-
| Sessions, state, tabs, frames, dialogs, and windows |
|
|
54
|
-
| Network, storage, artifacts, diagnostics, and performance |
|
|
55
|
-
| Batch, auth, confirmations, setup, dashboard, and AI commands |
|
|
56
|
-
| Global flags, config, providers, policy, and environment |
|
|
53
|
+
| Built-in skills | 13 canonical tokens from baseline section `skills`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#built-in-skills), generated baseline block, README proof section, release docs. | `needsManagedSession` keeps read-only skills inspection sessionless while preserving thin upstream passthrough. | Runtime and extension-validation skills/provider matrix; real-upstream inspection/skills group. | Supported. |
|
|
54
|
+
| Core page, element, navigation, and extraction commands | 74 canonical tokens from baseline section `core-commands`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#core-page-and-element-commands), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md), README quick start. | Thin passthrough with wrapper-owned JSON/session planning, ref guidance, artifact verification, page-change summaries, click-dispatch diagnostics, no-op scroll/focus diagnostics, shorthand compilers, and redaction. | Real-upstream core matrix plus fake core matrix for passthrough, ordering, diagnostics, and compiler validation. | Supported. Upstream semantics remain upstream-owned. |
|
|
55
|
+
| Sessions, state, tabs, frames, dialogs, and windows | 20 canonical tokens from baseline section `state-tabs-frames-dialogs`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#session-state-frames-dialogs-windows-and-inspection-commands), stateful workflow notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Stateful summaries/redaction, state artifact handling, sessionless local command planning, managed-session restore, tab target pinning, and close alias cleanup. | Extension-validation stateful matrix, runtime session/resume tests, presentation redaction tests, lifecycle harness. | Supported. External profile/auth state remains operator-owned. |
|
|
56
|
+
| Network, storage, artifacts, diagnostics, and performance | 42 canonical tokens from baseline section `network-storage-artifacts-diagnostics`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#page-state-finding-mouse-settings-network-and-storage), diagnostic sections, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details). | Thin passthrough plus compact diagnostics, artifact metadata, missing-ffmpeg warnings, sensitive-data redaction, timeout bounds, and cleanup-pair guidance. | Fake non-core matrix and safe real-upstream coverage for network/HAR, diff, trace/profiler, console/errors/highlight, stream, vitals, and React missing-renderer. | Supported. Environment-sensitive operations need suitable local/browser state. |
|
|
57
|
+
| Batch, auth, confirmations, setup, dashboard, devices, and AI commands | 24 canonical tokens from baseline section `batch-auth-setup-ai`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-devices-and-setup). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#batch-auth-confirmations-sessions-chat-dashboard-devices-and-setup), README security notes, release docs. | Native-tool batch stdin, generated `job`/`qa`/lookup batch plans, auth/confirmation redaction, sessionless local auth/setup/dashboard/doctor planning, timeout/cleanup guidance. | Unit/fake batch/auth/confirmation/dashboard/chat/doctor tests; extension-validation for structured input modes; efficiency benchmark scenarios. | Supported. Interactive side-effecting setup/auth/chat remains upstream-owned. |
|
|
58
|
+
| Global flags, config, providers, policy, and environment | 117 canonical tokens from baseline section `options-and-env`; see [`scripts/agent-browser-capability-baseline.mjs`](../scripts/agent-browser-capability-baseline.mjs) and generated [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment). | [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#important-global-flags-config-and-environment), README provider/setup notes, [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sessionmode), architecture/runtime docs. | Runtime handles command discovery, value-flag prevalidation, launch-scoped flags, redacted echoes, fresh-session recovery hints, explicit sessions, provider/device launch-scoping, curated env forwarding, and subprocess completion. | Runtime tests for flags/planning/redaction/session behavior; process tests for env and stdio-linger completion; fake provider/specialized-skill matrix; package doctor. | Supported. Provider clouds, iOS/Appium, proxies, profiles, and credentials require external setup. |
|
|
57
59
|
|
|
58
60
|
## Follow-up decision after closure
|
|
59
61
|
|
|
@@ -71,15 +73,17 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
|
|
|
71
73
|
|
|
72
74
|
`RQ-0091` keeps advanced release smoke tests focused on extension behavior instead of external skill routing: the Sauce Demo smoke in [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt) now launches with `--no-skills`, restricts tools to `agent_browser`, and uses bounded release-smoke wording rather than dogfood/exploratory QA language. Runtime guidance remains the concise stop-boundary and exact-artifact-path contract from `extensions/agent-browser/lib/playbook.ts`; no site-specific automation or recipe layer was added. Evidence from the failed high/low local-shop runs showed skill/report drift (`dogfood-output` substitution) and reasoning complexity, not a wrapper command defect, so skill-enabled dogfood remains a separate validation mode. Human workflow: [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt), [`AGENTS.md`](../AGENTS.md#preferred-testing-workflow), and [`REQUIREMENTS.md`](REQUIREMENTS.md#testing-guidance).
|
|
73
75
|
|
|
74
|
-
`RQ-
|
|
76
|
+
`RQ-0090` now backs stop-boundary and exact-artifact-path guidance with prompt-derived preflight guards instead of relying only on model instruction-following. `buildPromptPolicy` in `extensions/agent-browser/lib/prompt-policy.ts` extracts explicit stop-before-order/submit wording and exact requested artifact paths from the latest user message. `extensions/agent-browser/lib/orchestration/browser-run/browser-action-model.ts` normalizes click-like actions plus `press`/`key` Enter/Return submits; `prompt-guards.ts` blocks likely final targets on those covered shapes (including `@ref` role/name metadata from `details.refSnapshot`, selectors such as `finish`, and matching batch click/find steps) with `details.promptGuard.reason: "explicit-user-stop-boundary"` and documents excluded flows (`eval`, generic fill/type, `keyboard type`/`keyboard inserttext`, non-Enter keypresses). It also blocks browser `close` / `quit` / `exit` with `reason: "requested-artifacts-missing-before-close"` until required prompt screenshot paths are verified in `details.artifactManifest` (optional recording paths are required only when recording appears available). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`promptGuard`); human workflow: README stop-boundary/artifact notes and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md); fake coverage: `agentBrowserExtension blocks likely final order clicks when the user set a stop boundary`, `agentBrowserExtension blocks close until required prompt screenshot artifacts are saved`, and `buildPromptPolicy detects stop boundaries and requested artifact paths`.
|
|
75
77
|
|
|
76
|
-
`RQ-
|
|
78
|
+
`RQ-0097` keeps upstream subprocess completion reliable when detached descendants inherit the child’s stdio handles: `runAgentBrowserProcess` in `extensions/agent-browser/lib/process.ts` uses `watchSpawnedChildCompletion` to observe both Node `exit` and `close`, leaves piped stdio intact during the short post-`exit` grace (`EXIT_STDIO_GRACE_MS`, currently **100 ms**) so normal `close` can still win, destroys those streams only if the fallback resolves, and resolves with exit-code precedence `close` → wrapper timeout (**124**) → post-`exit` fallback for the direct child → spawn failure (**127**) when `close` is still delayed so the Pi tool cannot hang after `agent-browser` has already exited. Human context: [`ARCHITECTURE.md`](ARCHITECTURE.md#direct-subprocess-execution) (subprocess bullet) and [`AGENTS.md`](../AGENTS.md) (**Runtime planning** → **Upstream subprocess completion**); fake coverage: `runAgentBrowserProcess resolves after exit when descendants keep stdio handles open` asserts the post-exit fallback returns near the 100 ms grace window instead of the process timeout, and `runAgentBrowserProcess returns timeout exit code when descendants keep stdio handles open` in [`test/agent-browser.process.test.ts`](../test/agent-browser.process.test.ts).
|
|
79
|
+
|
|
80
|
+
`RQ-0096` ships first-class Electron desktop-app support without adding a generic recipe runtime: top-level `electron` covers wrapper-owned `list`, isolated `launch` with snapshot/tabs/connect handoff, `status`, `cleanup`, and compact current-session or launch-scoped `probe`; `qa.attached` extends the existing QA preset for attached Electron/CDP sessions without introducing `electron.qa`. `launch.handoff` still defaults to `"snapshot"`, while `handoff: "tabs"` is documented as the safer diagnostic starting point when refs/content capture is not needed yet. Host install discovery (`discoverElectronApps`) is macOS/Linux-only today: on Windows `electron.list` reports `platform: "unsupported"` with an empty catalog and name/bundle targets cannot resolve from scans—use `executablePath` (or a host path to the Electron binary) for Windows launch targeting. Discovery adds non-blocking likely-sensitive app annotations plus visible isolated-profile/auth-state warnings; launch output and `details.electron.profileIsolation` state that wrapper launches do not reuse existing signed-in app profiles or attach to already-running authenticated apps, and point agents to the host debug-port launch plus raw `connect` path when signed-in local app state is the goal; launch timeout failures include PID/profile/DevToolsActivePort/timing diagnostics; status/probe add launch/session identifiers, liveness, mismatch/reattach next actions, and dead-launch context for `about:blank`; post-mutation Electron death is upgraded to `tab-drift` with `details.electronPostCommandHealth`; Electron fills can add `details.fillVerification`; Electron `@e…` mutations can add same-URL ref freshness guidance; broad Electron `get text` selectors add scope warnings; cleanup ownership is bounded to wrapper-created launch records and temp profiles; externally launched debug ports stay on the manual `args: ["connect", "<port-or-url>"]` path and remain host-owned. Runtime-owned off-branch launch records remain visible to `electron.status { launchId }`, `electron.status { all: true }`, `electron.probe { launchId }`, and `electron.cleanup`; default current-session `electron.probe` stays scoped to the active managed session, and no-arg status/cleanup reports ambiguity when multiple active branch/off-branch records are still owned. Explicit cleanup is serialized with managed-session work, records managed-session close success independently from partial process/profile cleanup, clears live/restore managed-session state for the closed wrapper session, updates branch-visible Electron state only with selected cleanup records instead of unrelated off-branch lookup records, and rotates the next default auto browser call away from that closed name. `/reload` preserves the current branch-visible active Electron launch and its isolated temp `userDataDir` for continuity, cleans off-branch owned Electron launches before clearing process-local ownership, and durably protects profile dirs from generic temp cleanup, quit cleanup, process-exit cleanup, and stale temp-root pruning after restart when partial Electron cleanup deliberately skips or fails `user-data-dir` removal. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#electron) plus [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) for `qa.attached`; human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps) and README common calls; implementation: `extensions/agent-browser/lib/input-modes/electron.ts`, `extensions/agent-browser/lib/orchestration/electron-host/`, `extensions/agent-browser/lib/orchestration/browser-run/`, `extensions/agent-browser/lib/electron/`, and dispatch/state wiring in `extensions/agent-browser/index.ts`; deterministic efficiency evidence: `electron-lifecycle` and `electron-probe` in `scripts/agent-browser-efficiency-benchmark.mjs`; fake coverage includes Electron schema/probe/mismatch/post-command-health/fill-verification/broad-text/discovery-sensitivity and packaged-sourceLookup cases in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus off-branch Electron status/probe/cleanup, targeted cleanup without unrelated branch promotion, explicit cleanup serialization, current and restored cleanup managed-session retirement, active-Electron reload preservation, off-branch Electron reload cleanup, durable partial off-branch reload/quit profile preservation, protected temp-root process-exit and stale-prune cleanup, and partial-cleanup managed-session untracking in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts). This plan is the `RQ-0068` revisit evidence for Electron specifically: [`docs/plans/electron-extension-2026-05-20.md`](plans/electron-extension-2026-05-20.md) documents repeated failure-prone discover/launch/attach/cleanup and multi-call state-probe sequences, plus bounded owner/versioning/test/docs artifacts.
|
|
77
81
|
|
|
78
82
|
`RQ-0097` completes manual CDP attach recovery without making manually launched apps wrapper-owned: successful raw `connect` results append the session-scoped safe tab-list action `list-connected-session-tabs`; `snapshot -i` failures whose upstream error says `No active page` append the safe tab-list action `list-tabs-after-no-active-page` when a session is known. Agents then choose a stable `tab t<N>` target and run `snapshot -i` explicitly; the wrapper does not emit raw-connect or no-active-page snapshot retry ids without a wrapper-observed safe tab id. The runtime source of truth for these recovery ids is `AGENT_BROWSER_RECOVERY_NEXT_ACTION_IDS` in `extensions/agent-browser/lib/results/recovery-actions.ts` (re-exported from `shared.ts`). The guidance keeps manual signed-in desktop apps and explicit artifacts host-owned while `close` remains a browser/CDP-session close and `electron.cleanup` remains limited to wrapper-created `electron.launch` records. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`ELECTRON.md`](ELECTRON.md#manual-host-launch-pattern) and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#electron-desktop-apps); fake coverage: raw connect and no-active snapshot assertions in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus central next-action helper coverage in [`test/agent-browser.results.test.ts`](../test/agent-browser.results.test.ts).
|
|
79
83
|
|
|
80
84
|
`RQ-0068` remains closed with a no-adopt decision for a reusable named browser recipe runtime. The Electron evidence above justified a narrow typed shorthand and compact probe, not an open-ended recipe layer; future reusable recipes still require concrete repeated workflow evidence and a defined owner/versioning/test plan.
|
|
81
85
|
|
|
82
|
-
`RQ-0098` completes the docs/playbook groundwork for desktop readiness and wait orchestration without adding a runtime primitive or reusable recipe layer. The accepted ladder is: prefer condition waits (`wait --text`, `wait --url`, `wait --fn`, `wait --load <state>`, `wait --download`) when a real condition exists; after raw manual CDP `connect`, inspect `tab list`, select a stable `tab t<N>` surface, then run a condition wait or `snapshot -i`; after wrapper-owned `electron.launch`, use `electron.probe` / `electron.status` when launch health or target mismatch matters; use `qa.attached` for current-session text/selector diagnostics; keep fixed waits as a last resort below the wrapper IPC budget; and treat fixed-wait payloads such as `"waited":"timeout"` as elapsed time rather than completion evidence. Manual signed-in attach docs now also restate that `connect` readiness is not immediate readiness, `close` only
|
|
86
|
+
`RQ-0098` completes the docs/playbook groundwork for desktop readiness and wait orchestration without adding a runtime primitive or reusable recipe layer. The accepted ladder is: prefer condition waits (`wait --text`, `wait --url`, `wait --fn`, `wait --load <state>`, `wait --download`) when a real condition exists; after raw manual CDP `connect`, inspect `tab list`, select a stable `tab t<N>` surface, then run a condition wait or `snapshot -i`; after wrapper-owned `electron.launch`, use `electron.probe` / `electron.status` when launch health or target mismatch matters; use `qa.attached` for current-session text/selector diagnostics; keep fixed waits as a last resort below the wrapper IPC budget; and treat fixed-wait payloads such as `"waited":"timeout"` as elapsed time rather than completion evidence. Manual signed-in attach docs now also restate that `connect` readiness is not immediate readiness, close commands (`close`, `quit`, or `exit`) only close the browser/CDP session, `electron.cleanup` remains wrapper-owned, and manually launched apps plus explicit artifacts stay host-owned. Human workflow: [`ELECTRON.md`](ELECTRON.md#readiness-and-waits), [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#wait-for-page-readiness-or-downloads), README Electron section, and generated playbook text from `extensions/agent-browser/lib/playbook.ts`. Revisit a first-class host-idle primitive only with repeated desktop smoke evidence that condition waits, `qa.attached`, `electron.probe`, snapshots, and screenshots cannot cover the workflow. Verification: `npm run docs` keeps generated playbook fragments aligned; no runtime `details.nextActions` are part of this RQ.
|
|
83
87
|
|
|
84
88
|
`RQ-0100` makes desktop tab/surface drift recovery machine-readable without adding routine tab-list probes for normal clicks. When existing wrapper state already identifies a target tab, about:blank and tab-drift paths append `list-tabs-for-about-blank-recovery` or `list-tabs-for-tab-drift-recovery`, then `select-intended-tab-after-drift` and `snapshot-after-tab-recovery` when the stable `t<N>` id is known. The implementation reuses `priorSessionTabTarget`, `aboutBlankSessionMismatch`, `sessionTabCorrection`, `openResultTabCorrection`, and existing tab-correction outputs; it does not probe tabs for ordinary clicks beyond the RQ-0086-gated drift paths. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#tabs) and [`ELECTRON.md`](ELECTRON.md#troubleshooting); fake coverage: about:blank recovery and explicit-about:blank negatives in [`test/agent-browser.extension-tab-recovery.test.ts`](../test/agent-browser.extension-tab-recovery.test.ts), early tab-drift failure assertions in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), and central next-action helper coverage in [`test/agent-browser.results.test.ts`](../test/agent-browser.results.test.ts).
|
|
85
89
|
|
|
@@ -93,19 +97,19 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
|
|
|
93
97
|
|
|
94
98
|
`RQ-0088` adds current-snapshot ref fallback for selector misses: when raw `find` or compiled `semanticAction` fails with `failureCategory: "selector-not-found"`, `extensions/agent-browser/index.ts` may take one fresh session-scoped `snapshot -i`, then `extensions/agent-browser/lib/results/selector-recovery.ts` looks for exact normalized role/name matches for the failed target and emits `details.visibleRefFallback` plus visible `Current snapshot ref fallback`. Non-fill matches append bounded direct-ref next actions (`try-current-visible-ref` / `try-current-visible-ref-N`); fill matches omit direct args/text and feed the RQ-0099 rich-input recovery path when the ref is editable. The matcher is intentionally narrow: role locators require `--name`; text-click maps only to exact-name `button`/`link` refs; label/placeholder fill maps only to exact-name textbox/searchbox-style refs; prefixes/fuzzy matches are ignored, and duplicate exact matches carry ambiguity safety copy. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`visibleRefFallback`, nextActions); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) selector strategy and README pitfalls; fake coverage: `agentBrowserExtension suggests current snapshot refs when raw find role locators miss` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts).
|
|
95
99
|
|
|
96
|
-
`RQ-0072` guards page-scoped `@e…` refs against silent recycling: successful `snapshot` (or the last `snapshot` step inside a successful `batch`) records `details.refSnapshot` with ref ids and the snapshot page URL; `extensions/agent-browser/lib/session-page-state.ts` replays per-session snapshots and `refSnapshotInvalidation` markers from the transcript on
|
|
100
|
+
`RQ-0072` guards page-scoped `@e…` refs against silent recycling: successful `snapshot` (or the last `snapshot` step inside a successful `batch`) records `details.refSnapshot` with ref ids and the snapshot page URL; `extensions/agent-browser/lib/session-page-state.ts` replays per-session snapshots and `refSnapshotInvalidation` markers from the active transcript branch on `session_start` and Pi 0.76 `session_tree` branch changes, clears them on successful close commands (`close`, `quit`, or `exit`), invalidates prior refs when a session `snapshot` fails with `No active page`, rejects mutation-prone ref argv before spawn when the tab URL diverges, a ref id is missing from the latest snapshot, or the session refs are invalidated, blocks `batch` stdin that uses `@e…` on a guarded command after an earlier step that can navigate or mutate until a `snapshot` step appears later in the same stdin array (pre-spawn latch reset only), and prefixes `refresh-interactive-refs` with `--session` when the call names a session (including upstream-classified `stale-ref` outcomes). The entrypoint also serializes `session_tree` restore and wrapper-owned browser commands with managed-session work, guards independent caller-owned explicit-session completions with a branch-state generation check, keeps process-owned cleanup registries for managed sessions and wrapper-launched Electron records separate from the branch-visible view, treats explicit wrapper-owned close rows and Electron cleanup managed-session steps as restore-visible close events, closes off-branch owned managed sessions and Electron launches on non-quit reload shutdown, preserves current branch-visible active managed/Electron sessions and active Electron temp profiles for reload continuity, and preserves fresh-session allocation monotonicity across branch restores. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`refSnapshot`, `refSnapshotInvalidation`, `stale-ref`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) snapshot/ref notes and README pitfalls; fake coverage: `agentBrowserExtension recommends tab recovery after No active page snapshot failures` and `agentBrowserExtension invalidates refs after No active page snapshot failures inside batch` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts), plus `agentBrowserExtension blocks page-scoped ref reuse…`, `…rehydrates page-scoped refs from the current tree branch`, `…rehydrates managed browser session state from the current tree branch`, `…rehydrates artifact manifest state from the current tree branch`, `…keeps Electron cleanup ownership after session_tree switches away from the launch branch`, `…blocks stale refs after page-changing steps inside a batch`, `…allows same-snapshot form fills before a batch click`, `…allows batch stdin ref steps after snapshot following an invalidating step`, `…records snapshot refs returned inside a successful batch`, and `…rejects refs absent from the latest same-page snapshot` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts); managed-session reload cleanup, explicit close untracking/state rotation/restore, generated fresh-name reservation after repeated explicit closes, explicit-session command versus `session_tree` generation-guard coverage, explicit close versus in-flight implicit command serialization, and fresh-ordinal coverage lives in [`test/agent-browser.resume-state.test.ts`](../test/agent-browser.resume-state.test.ts).
|
|
97
101
|
|
|
98
102
|
`RQ-0087` keeps the RQ-0072 guard but removes `fill` from the batch invalidation latch: `fill @e…` rows remain guarded against stale/missing refs, yet multiple same-snapshot form fills can run before the first click/submit/navigation step in one upstream `batch`. A later guarded ref after `click`, `open`, `reload`, or other invalidating rows still fails before spawn unless the batch includes a fresh `snapshot` step first. This improves login/checkout efficiency without permitting likely post-navigation ref reuse. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`Batch stdin ordering`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) ref notes; fake coverage: `agentBrowserExtension allows same-snapshot form fills before a batch click` in [`test/agent-browser.extension-ref-guards.test.ts`](../test/agent-browser.extension-ref-guards.test.ts).
|
|
99
103
|
|
|
100
|
-
`RQ-0073` surfaces likely overlay blockers after no-navigation clicks without inventing blind targets: for **top-level** `click` results (unified command `click`, not `batch`-wrapped steps) whose upstream JSON includes `data.clicked`, whose prior pinned tab URL and post-click URL (from `details.navigationSummary`, gathered by one read-only `eval` summary when the click payload omits **both** string `data.url` and `data.title`) stay equal after the same fragment-insensitive normalization used for ref preflight, and where the same unified result did **not** already apply session tab correction
|
|
104
|
+
`RQ-0073` surfaces likely overlay blockers after no-navigation clicks without inventing blind targets: for **top-level** `click` results (unified command `click`, not `batch`-wrapped steps) whose upstream JSON includes `data.clicked`, whose prior pinned tab URL and post-click URL (from `details.navigationSummary`, gathered by one read-only `eval` summary when the click payload omits **both** string `data.url` and `data.title`) stay equal after the same fragment-insensitive normalization used for ref preflight, and where the same unified result did **not** already apply session tab correction, about-blank mismatch recovery, or `details.clickDispatch` fired for the same result, `extensions/agent-browser/index.ts` takes one fresh session-scoped `snapshot -i`, scans `refs` for strong modal context (`dialog` / `alertdialog`) plus up to three close/dismiss-pattern `button`/`link`/`menuitem` controls, and only then emits `details.overlayBlockers` (`candidates`, `summary`, and a `snapshot` map that can advance `refSnapshot`), visible `Possible overlay blockers`, and `inspect-overlay-state` / `try-overlay-blocker-candidate-*` next actions (with `--session` prefix when the session is named) appended after presentation follow-ups such as `inspect-after-mutation`. Page-wide privacy/sign-in/banner text without a dialog role is deliberately ignored to avoid warnings after ordinary same-page clicks. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`overlayBlockers`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) no-navigation click note and README pitfalls; fake coverage: `agentBrowserExtension surfaces likely overlay blockers after a no-op click` and `agentBrowserExtension does not report overlay blockers from unrelated page chrome after a successful same-page click` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
101
105
|
|
|
102
106
|
`RQ-0086` reduces wrapper-induced click fragility found during Sauce Demo smokes: navigation-summary enrichment for click/back/forward/reload/dblclick now uses one read-only `eval` (`({ title: document.title, url: location.href })`) instead of serial `get title` plus `get url` probes, including tab-pinned batch wrappers. Tab pinning/post-command tab correction now runs only after the wrapper has evidence of tab-drift risk (profile restore correction, overlapping stale opens, or restored session state), so ordinary same-session clicks no longer get repeated `tab list` probes. This keeps `details.navigationSummary`, overlay blocker checks, and drift recovery intact while avoiding the upstream `agent-browser 0.27.0` sequence that could report later clicks as successful without dispatching pointer/click events after repeated getter/tab/snapshot probes. Fake coverage: `agentBrowserExtension enriches click results with a post-navigation title and url summary` in [`test/agent-browser.extension-tabs.test.ts`](../test/agent-browser.extension-tabs.test.ts), plus `agentBrowserExtension pins the intended tab inside a follow-up command when reconnect drift would otherwise steal focus` and about-blank/tab overlap assertions in [`test/agent-browser.extension-tab-recovery.test.ts`](../test/agent-browser.extension-tab-recovery.test.ts); manual validation source: [`RELEASE.md`](RELEASE.md#public-sauce-demo-checkout-smoke-prompt).
|
|
103
107
|
|
|
104
|
-
`RQ-0089` investigated
|
|
108
|
+
`RQ-0089` investigated Sauce Demo no-op clicks after RQ-0086, and the 2026-05-26 release smoke reproduced the failure against direct upstream `agent-browser 0.27.0`: CSS `click [data-test=add-to-cart-sauce-labs-backpack]` and current `@ref` clicks returned success, but a page-level listener recorded no trusted pointer/mouse/click events and the cart stayed unchanged; an in-page `element.click()` did mutate the cart. The wrapper now adds a bounded top-level non-Electron `click` dispatch probe before standalone clicks. If upstream reports success but no trusted DOM event reached the target, it fails the tool, records `details.clickDispatch.status: "no-native-event-observed"`, and appends `inspect-click-dispatch-miss` / `retry-click-after-dispatch-miss` next actions; it does **not** replay clicks in-page. This is not site-specific and does not alter `batch`/`job`/`qa` click steps. For `@e…` refs, the probe uses role/name metadata persisted in `details.refSnapshot` from the latest snapshot instead of running a pre-click snapshot that could recycle upstream refs. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`clickDispatch`, `refSnapshot`); human workflow: README and [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) click verification notes; fake coverage: `agentBrowserExtension reports click dispatch diagnostic when upstream reports success without dispatching DOM events` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
105
109
|
|
|
106
110
|
`RQ-0074` warns when `get text <selector>` may read hidden or tabbed DOM content: for non-ref CSS selectors, `extensions/agent-browser/index.ts` runs a read-only `eval --stdin` visibility probe after successful text reads, emits `details.selectorTextVisibility` plus visible warning text when the first match is hidden while visible matches exist or when multiple matches make the upstream first-match choice ambiguous, preserves multiple batched warnings in `details.selectorTextVisibilityAll`, and appends `inspect-visible-text-candidates` next actions. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`selectorTextVisibility`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) extraction note and README pitfalls; fake coverage: `agentBrowserExtension warns when get text may read hidden selector matches` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
107
111
|
|
|
108
|
-
`RQ-0075` classifies QA and diagnostic network failures by likely impact: `summarizeNetworkFailures` / `classifyNetworkRequestFailure` in `extensions/agent-browser/lib/results/network.ts` (re-exported from `shared.ts`) split rows that already count as failed (`isFailedNetworkRequest`) into actionable versus benign low-impact browser icon asset misses (`isBenignAssetFailure`: favicon/apple-touch-icon basename patterns, 404/`failed`/string `error` signals, and image-like `resourceType`/`mimeType` when present). `analyzeQaPresetResults` fails `qa` only for actionable network failures while preserving benign rows in `qaPreset.warnings`, and network request presentation adds a compact actionable/benign summary plus per-row impact tags, ordered with actionable/benign failed rows before successful rows so late failures are visible even in capped previews. Because real Pi ignores returned `isError` fields from custom tool `execute`, `extensions/agent-browser/index.ts` also realigns `details.resultCategory: "failure"` outcomes to Pi-visible tool errors through a `tool_result` handler; it appends the exact failure category plus `Pi tool isError: true` to prose output and preserves caller-requested `--json` output as parseable JSON while patching `isError`. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) QA and network diagnostic notes; fake coverage: `agentBrowserExtension compiles lightweight QA presets and fails diagnostics` in [`test/agent-browser.extension-input-modes.test.ts`](../test/agent-browser.extension-input-modes.test.ts) plus network presentation assertions in [`test/agent-browser.presentation.test.ts`](../test/agent-browser.presentation.test.ts); real-Pi
|
|
112
|
+
`RQ-0075` classifies QA and diagnostic network failures by likely impact: `summarizeNetworkFailures` / `classifyNetworkRequestFailure` in `extensions/agent-browser/lib/results/network.ts` (re-exported from `shared.ts`) split rows that already count as failed (`isFailedNetworkRequest`) into actionable versus benign low-impact browser icon asset misses (`isBenignAssetFailure`: favicon/apple-touch-icon basename patterns, 404/`failed`/string `error` signals, and image-like `resourceType`/`mimeType` when present). `analyzeQaPresetResults` fails `qa` only for actionable network failures while preserving benign rows in `qaPreset.warnings`, and network request presentation adds a compact actionable/benign summary plus per-row impact tags, ordered with actionable/benign failed rows before successful rows so late failures are visible even in capped previews. Because real Pi ignores returned `isError` fields from custom tool `execute`, `extensions/agent-browser/index.ts` also realigns `details.resultCategory: "failure"` outcomes to Pi-visible tool errors through a `tool_result` handler; it appends the exact failure category plus `Pi tool isError: true` to prose output and preserves caller-requested `--json` output as parseable JSON while patching `isError`. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#qa) and [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) QA and network diagnostic notes; fake coverage: `agentBrowserExtension compiles lightweight QA presets and fails diagnostics` in [`test/agent-browser.extension-input-modes.test.ts`](../test/agent-browser.extension-input-modes.test.ts) plus network presentation assertions in [`test/agent-browser.presentation.test.ts`](../test/agent-browser.presentation.test.ts); model-free real-Pi pipeline coverage in [`test/agent-browser.pi-pipeline.test.ts`](../test/agent-browser.pi-pipeline.test.ts) asserts both in-memory and persisted JSONL tool results for QA prose patching, parseable caller-requested `--json` failures, and strict public-schema rejection before upstream spawn; `npm run verify -- lifecycle` asserts the QA failure-patch line in a saved JSONL session.
|
|
109
113
|
|
|
110
114
|
`RQ-0076` adds best-effort timeout recovery when the wrapper watchdog kills a stuck upstream process: `extensions/agent-browser/index.ts` calls `collectTimeoutPartialProgress` / `formatTimeoutPartialProgressText` to build `details.timeoutPartialProgress` from the compiled `job` or `qa` step list or parsed caller `batch` stdin, session-scoped `get url` / `get title` (plus optional planned-URL fallback from `open`/`navigate`/`pushstate` steps), and declared artifact paths (`screenshot`, `pdf`, `download`, `wait --download`) with existence/size checks, then appends a visible `Timeout partial progress` block with redacted URLs/paths. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) wrapper timeout note and README job section; fake coverage: `agentBrowserExtension reports partial progress and artifacts after job timeout` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts).
|
|
111
115
|
|
|
@@ -113,7 +117,7 @@ Native `job`, `qa`, experimental `sourceLookup`, experimental `networkSourceLook
|
|
|
113
117
|
|
|
114
118
|
`RQ-0078` improves getter/eval discoverability: `extensions/agent-browser/lib/results/presentation/errors.ts` matches upstream failure text containing `unknown command`, `unknown subcommand`, or `unrecognized command` (case-insensitive) when the failed command token is one of `attr`, `count`, `html`, `text`, `title`, `url`, or `value`, then adds grouped-`get` prose; only `title` / `url` also emit read-only `nextActions` (`use-get-title` / `use-get-url`, with `--session` when the failed call named a session). The getter block is skipped when selector recovery already injected an `Agent-browser hint:` line into the same error string. `extensions/agent-browser/index.ts` adds `details.evalStdinHint` plus visible `Eval stdin hint` when `looksLikeFunctionEvalStdin` matches trimmed stdin and upstream JSON carries a plain empty-object `data.result`; empty arrays such as `[]` are valid eval results and are not warned. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`nextActions`, `evalStdinHint`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) extraction note and README quick start; fake coverage: `buildToolPresentation suggests grouped getter commands for common unknown getter shortcuts` and `agentBrowserExtension warns when eval stdin returns an empty object from a function-shaped snippet`.
|
|
115
119
|
|
|
116
|
-
`RQ-0079` clarifies artifact lifecycle and cleanup ownership: `extensions/agent-browser/
|
|
120
|
+
`RQ-0079` clarifies artifact lifecycle and cleanup ownership: `extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts` builds `details.artifactCleanup`, surfaced by process-output with visible `Artifact lifecycle` copy on successful close commands (`close`, `quit`, or `exit`) when `artifactManifest.entries` is non-empty (`getArtifactCleanupGuidance`), stating that close commands do not delete explicit artifacts; `explicitArtifactPaths` carries up to ten distinct existing `explicit-path` manifest paths after a filesystem existence check, skipping stale paths already removed by host tools (possibly empty when the recent window has no existing explicit rows). Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`artifactCleanup`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) artifact retention section and README artifact notes; fake coverage: `agentBrowserExtension reports artifact lifecycle guidance on close` in [`test/agent-browser.extension-errors-artifacts.test.ts`](../test/agent-browser.extension-errors-artifacts.test.ts), plus close-alias unit coverage in [`test/agent-browser.runtime.test.ts`](../test/agent-browser.runtime.test.ts) and [`test/agent-browser.session-page-state.test.ts`](../test/agent-browser.session-page-state.test.ts).
|
|
117
121
|
|
|
118
122
|
`RQ-0080` adds no-op scroll recovery for dense dashboards and nested panes: for successful top-level `scroll`, `extensions/agent-browser/index.ts` samples viewport and prominent scroll-container positions before and after execution with read-only session-scoped `eval --stdin` probes. If no sampled position changes, it emits `details.scrollNoop`, appends visible `Scroll diagnostic: no observed scroll movement`, appends exact `inspect-after-noop-scroll` / `verify-noop-scroll-visually` next actions, and updates `pageChangeSummary.nextActionIds` so agents can branch without parsing prose. Contract: [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details) (`scrollNoop`, `nextActions`); human workflow: [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md) scroll note; fake coverage: `agentBrowserExtension reports no-op scroll diagnostics with recovery next actions`.
|
|
119
123
|
|