npm - pi-agent-browser-native - Versions diffs - 0.2.33 → 0.2.34 - Mend

pi-agent-browser-native 0.2.33 → 0.2.34

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/CHANGELOG.md +19 -0
package/README.md +34 -4
package/docs/ARCHITECTURE.md +6 -0
package/docs/COMMAND_REFERENCE.md +28 -5
package/docs/RELEASE.md +11 -3
package/docs/SUPPORT_MATRIX.md +8 -6
package/docs/TOOL_CONTRACT.md +61 -7
package/extensions/agent-browser/index.ts +2 -1
package/extensions/agent-browser/lib/input-modes/job.ts +62 -0
package/extensions/agent-browser/lib/input-modes/params.ts +4 -4
package/extensions/agent-browser/lib/input-modes.ts +3 -0
package/extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts +67 -1
package/extensions/agent-browser/lib/orchestration/browser-run/prepare.ts +26 -1
package/extensions/agent-browser/lib/orchestration/browser-run/process-output.ts +34 -7
package/extensions/agent-browser/lib/orchestration/browser-run/types.ts +6 -0
package/extensions/agent-browser/lib/playbook.ts +17 -16
package/extensions/agent-browser/lib/results/categories.ts +1 -1
package/extensions/agent-browser/lib/results/presentation/registry.ts +34 -6
package/extensions/agent-browser/lib/results/presentation/semantic-action.ts +133 -0
package/extensions/agent-browser/lib/results/presentation.ts +11 -6
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,25 @@
 ## Unreleased
+## 0.2.34 - 2026-05-24
+### Added
+- Deterministic maintainer dogfood mode: `npm run verify -- dogfood` runs a model-free live-browser smoke through the native wrapper against public `example.com`, covering top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close.
+- Opt-in efficiency-benchmark JSONL sampling via `--sample-jsonl`, so maintainers can measure real transcript model-visible byte output without changing deterministic scenario metrics.
+- Architecture note for the Tier A/Tier B prompt-guidance budget, keeping always-on `promptGuidelines` short while preserving detailed browser playbook guidance in docs.
+### Changed
+- Always-on `agent_browser` prompt guidance is smaller and focused on Tier A rules: input-mode choice, refs/session/artifacts/nextActions, extraction basics, and explicit stop-before-order/post/purchase/submit boundaries.
+- `semanticAction` success output now better mirrors raw browser-action navigation and page-change summaries, while docs make the input-mode chooser clearer.
+- QA preset pass output is more compact, and `qa.attached` preflight now treats URL-only current-page checks as valid attached-session evidence.
+- Constrained `job` docs now make post-click navigation assertions explicit with `assertUrl` / `assertText` instead of implying hidden automatic navigation checks.
+### Fixed
+- Stabilized concurrency-sensitive fake-upstream tests by waiting for the older explicit-session open to reach the fake binary before launching the newer one, and by covering the documented planned-URL fallback separately from strict live current-page recovery assertions.
 ## 0.2.33 - 2026-05-23
 ### Added

package/README.md CHANGED Viewed

@@ -150,13 +150,15 @@ It does **not** edit Pi settings and does **not** run upstream `agent-browser do
 You usually prompt the agent in natural language. These JSON snippets show the exact native tool shape the agent should use.
-Open a page and inspect it:
+Open a page and inspect it (first-call recipe: open → snapshot -i → interact with current `@refs` → snapshot -i after changes). Do not pass `--json` in `args`; the wrapper injects it.
 ```json
 { "args": ["open", "https://example.com"] }
 { "args": ["snapshot", "-i"] }
 ```
+On `https://example.com/`, the main link label is **Learn more**—use exact visible text from your snapshot, not guessed copy such as `More information...`.
 Click a visible ref, then refresh refs after navigation or a DOM update:
 ```json
@@ -211,7 +213,8 @@ For supported upstream `find` flows and native dropdown selection you can omit h
 Typical pitfalls:
-- Supply **exactly one** of `args`, `semanticAction`, `job`, `qa`, `sourceLookup`, `networkSourceLookup`, or `electron` per call (not more, not none).
+- Supply **exactly one** of `args`, `semanticAction`, `job`, `qa`, `sourceLookup`, `networkSourceLookup`, or `electron` per call (not more, not none). Prefer `args` for routine browse; `semanticAction` for stable locators; `job`/`qa` for multi-step checks; `electron` for desktop apps; treat `sourceLookup` / `networkSourceLookup` as experimental candidates-only.
+- Do not pass `--json` in `args`; the wrapper injects it automatically.
 - `semanticAction` and `job` are **not** valid inside `batch` stdin; batch steps stay upstream argv string arrays (spell a `find` step as tokens there if you need it in a batch).
 - Commands or locators outside the supported shorthand still require explicit `args`. Common page getters are grouped under `get`: use `get title`, `get url`, or `get text <selector>` rather than shortcut commands such as `title` or `url`; unknown getter shortcuts can return read-only `details.nextActions` like `use-get-title`.
 - For `locator: "role"`, pass either `value: "button"` or `role: "button"`; if both are present they must match.
@@ -228,6 +231,8 @@ Typical pitfalls:
 For short repeatable workflows, pass a top-level `job` instead of hand-writing `batch` stdin. The wrapper only supports constrained steps (`open`, `click`, `fill`, `select`, `wait`, `assertText`, `assertUrl`, `waitForDownload`, and `screenshot`), compiles them to existing upstream `batch` commands, and echoes the compiled commands as `details.compiledJob` for auditability. The same compile path backs top-level `qa`, so long `qa` runs surface the same timeout evidence shape. If a long `job`, `qa`, or `batch` hits the wrapper watchdog, `details.timeoutPartialProgress` may recover planned steps, current page title/URL, and declared artifact paths that already exist on disk (see [`docs/TOOL_CONTRACT.md#details`](docs/TOOL_CONTRACT.md#details)). There is no separate catalog of reusable named browser recipes above `job`, `qa`, and raw `batch`; see [`docs/ARCHITECTURE.md#no-reusable-recipe-layer-yet`](docs/ARCHITECTURE.md#no-reusable-recipe-layer-yet) for the closed `RQ-0068` decision and when to revisit it.
+**Navigation inside `job` is explicit.** A successful `click` does not prove the next page loaded; add `assertUrl` and/or `assertText` after navigation-prone clicks (forms, checkout, tabs, submit buttons) before screenshots or steps that assume the new page.
 ```json
 {
   "job": {
@@ -240,6 +245,21 @@ For short repeatable workflows, pass a top-level `job` instead of hand-writing `
 }
 ```
+```json
+{
+  "job": {
+    "steps": [
+      { "action": "open", "url": "https://shop.example/checkout" },
+      { "action": "fill", "selector": "#email", "text": "user@example.com" },
+      { "action": "click", "selector": "#continue" },
+      { "action": "assertUrl", "url": "**/shipping" },
+      { "action": "assertText", "text": "Shipping address" },
+      { "action": "screenshot", "path": ".dogfood/shipping.png" }
+    ]
+  }
+}
+```
 On app pages that expose a native dropdown, add a `select` step such as `{ "action": "select", "selector": "#flavor", "value": "chocolate" }` before the assertion that depends on it.
 Use raw `args`/`stdin` when you need full upstream `batch` power, custom flags, or commands outside the constrained job schema. Do not pass `stdin` with `job`, `qa`, `sourceLookup`, `networkSourceLookup`, or `electron`; those modes generate or manage their own input.
@@ -315,6 +335,8 @@ For asynchronous exports, click first and then wait for the download:
 When a user gives exact artifact paths for screenshots, recordings, downloads, PDFs, traces, or HAR files, use those paths or explicitly report why the artifact was unavailable; do not silently substitute a different path in the final report. With upstream `agent-browser 0.27.0`, treat `details.savedFilePath` as upstream-reported metadata and confirm `details.artifacts[].exists` before relying on the requested `wait --download <path>` file being present on disk.
+For evidence-only screenshots or QA captures, branch on `details.artifactVerification` and `details.artifacts` before reporting PASS/FAIL; inline image attachments are optional when size limits allow—do not require vision review unless the user asked for visual inspection.
 Artifact cleanup is host-owned, not a browser command. `close` shuts down the browser session but does **not** delete explicit screenshots, downloads, PDFs, traces, HAR files, or recordings saved to paths you chose. When the session’s non-empty `details.artifactManifest` is in scope, a successful `close` appends an `Artifact lifecycle` note and sets `details.artifactCleanup` with the same retention summary as `details.artifactRetentionSummary`, a fixed `note` about host-owned cleanup, and `explicitArtifactPaths`: up to ten distinct paths from manifest rows whose `storageScope` is `explicit-path` (this list can be empty if the recent window only holds spills or other non-explicit inventory). Remove any listed paths with normal file tools after inspection.
 Start a fresh profiled browser after the implicit public-browsing session already exists:
@@ -381,7 +403,7 @@ Use SPA and Web Vitals helpers as normal command tokens:
 ```json
 { "args": ["pushstate", "/dashboard"] }
-{ "args": ["vitals", "https://example.com", "--json"] }
+{ "args": ["vitals", "https://example.com"] }
 ```
 For setup that must happen before first navigation, open a blank fresh page, stage routes/cookies/scripts, then navigate:
@@ -417,7 +439,7 @@ The full `npm run verify` gate runs:
 - command-reference baseline checks
 - live command-reference verification against the targeted installed upstream `agent-browser`
-Step order and which subprocesses run live in [`scripts/project.mjs`](scripts/project.mjs); [`test/project-verify.test.ts`](test/project-verify.test.ts) locks default, `release`, `real-upstream`, `package-pi`, and combined-docs orchestration so a gate cannot disappear accidentally. Run `npm run verify -- --help` for opt-in modes and supported passthrough flags.
+Step order and which subprocesses run live in [`scripts/project.mjs`](scripts/project.mjs); [`test/project-verify.test.ts`](test/project-verify.test.ts) locks default, `release`, `real-upstream`, `dogfood`, `package-pi`, and combined-docs orchestration so a gate cannot disappear accidentally. Run `npm run verify -- --help` for opt-in modes and supported passthrough flags.
 The deterministic agent-efficiency benchmark’s **standalone JSON/Markdown accounting run** is not part of default `npm run verify` (only `npm run verify -- benchmark` or `npm run benchmark:agent-browser` invokes the script). The full unit suite still exercises `test/agent-browser.efficiency-benchmark.test.ts`. Use the script before and after agent-facing abstractions to prove call-count, output-size, stale-ref, artifact, failure-category coverage, success-rate, and elapsed-time effects before changing the wrapper UX:
@@ -438,6 +460,14 @@ npm run verify -- real-upstream
 That mode sets `PI_AGENT_BROWSER_REAL_UPSTREAM=1` and runs `test/agent-browser.real-upstream-contract.test.ts` against the real `agent-browser` on `PATH` (version must match the capability baseline). It covers inspection, skills, a broad core interaction and navigation matrix on localhost fixtures (including `batch` stdin and `pushstate`), plus `vitals`, network route/requests/HAR, diff snapshot/screenshot/url, trace/profiler, console/errors/highlight, stream enable/status/disable, `cookies set --curl`, a `react tree` missing-renderer path, and `wait --download` with the on-disk caveat documented in release notes. The harness uses a throwaway temp `HOME` and dedicated socket/screenshot directories so the run does not touch your normal browser profile paths. Browser-opening or credential-dependent families such as `inspect`, `dashboard`, `chat`, provider clouds, and OS clipboard flows stay in fake-upstream or manual validation unless a safe deterministic fixture is added. For prerequisites, isolation details, and troubleshooting, see [`docs/RELEASE.md`](docs/RELEASE.md#real-upstream-contract-validation).
+A deterministic live-browser wrapper smoke is available without an LLM choosing tool calls:
+```bash
+npm run verify -- dogfood
+```
+That mode drives the native wrapper through top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close against public `example.com`. It complements, but does not replace, the interactive Pi/tmux release dogfood in [`docs/RELEASE.md`](docs/RELEASE.md#pre-release-checks).
 For package release confidence, follow [`docs/RELEASE.md`](docs/RELEASE.md). The release gate is:
 ```bash

package/docs/ARCHITECTURE.md CHANGED Viewed

@@ -53,6 +53,12 @@ That means:
 - no manual user orchestration as the main workflow
 - any future slash commands should be minimal and secondary
+### Prompt guidance budget
+Runtime `promptGuidelines` are a Tier A budget, not a full manual. They stay short enough to load on every `agent_browser`-aware turn and carry only high-impact rules: input-mode choice, the open → snapshot → ref loop, launch-scoped session handling, artifact verification, structured `nextActions`, extraction basics, and hard safety boundaries such as “stop before order/post/purchase/submit.”
+Tier B guidance lives in `SHARED_BROWSER_PLAYBOOK_GUIDELINES`, generated README/command-reference fragments, and targeted docs. When a workflow needs examples, caveats, or long command-family coverage, add it there instead of expanding always-on prompt text. If a Tier B rule prevents a repeated real failure, promote only the smallest durable sentence into Tier A and keep the generated-doc mirrors aligned.
 ### No reusable recipe layer yet
 Do **not** add reusable browser recipes as a first-class runtime surface yet.

package/docs/COMMAND_REFERENCE.md CHANGED Viewed

@@ -27,6 +27,8 @@ Use `npm run benchmark:agent-browser` or `npm run verify -- benchmark` before an
 ## Core mental model
+Input mode chooser (one per call): **`args`** for the default open → snapshot -i → click/fill `@refs` flow; **`semanticAction`** for stable role/text/label targets; **`job`** / **`qa`** for multi-step checks; **`electron`** for desktop apps only; **`sourceLookup`** / **`networkSourceLookup`** are **experimental candidates-only** helpers (not authoritative mappings). Do not pass `--json` in `args`—the wrapper injects it. Match link and button text to the latest snapshot (on `https://example.com/` the main link is `Learn more`, not legacy `More information...` copy). See [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#input-mode-chooser) for snapshot variants (`-i` vs `--compact` vs full) and batching three or more getters.
 Tool parameters (use exactly one of `args`, `semanticAction`, `job`, `qa`, `sourceLookup`, `networkSourceLookup`, or `electron`):
 ```json
@@ -63,8 +65,8 @@ Tool parameters (use exactly one of `args`, `semanticAction`, `job`, `qa`, `sour
 - `semanticAction`: optional shorthand for common `find` flows and native dropdown `select`; compiles to upstream argv and is rejected together with `args`, `job`, `qa`, `sourceLookup`, `networkSourceLookup`, or `electron` on the same call.
 - `job`: optional constrained short-workflow schema; compiles to existing upstream `batch` args/stdin and reports the compiled plan in `details.compiledJob`.
 - `qa`: optional lightweight QA preset; compiles to the same batch path and reports `details.compiledQaPreset` plus `details.qaPreset` pass/fail evidence.
-- `sourceLookup`: optional experimental helper for local UI-to-source *candidates*; compiles to the same `batch` path, reports `details.compiledSourceLookup` and `details.sourceLookup`, and never reclassifies a fully successful upstream batch as failed the way `qa` can (see [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sourcelookup) and the longer notes below).
-- `networkSourceLookup`: optional experimental helper for failed request-to-source *candidates*; compiles to generated `batch`, reports `details.compiledNetworkSourceLookup` and `details.networkSourceLookup`, and never assigns blame or edits files.
+- `sourceLookup`: **EXPERIMENTAL — candidates only** for local UI-to-source hints; compiles to the same `batch` path, reports `details.compiledSourceLookup` and `details.sourceLookup`, and never reclassifies a fully successful upstream batch as failed the way `qa` can (see [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#sourcelookup) and the longer notes below).
+- `networkSourceLookup`: **EXPERIMENTAL — candidates only** for failed request-to-source hints; compiles to generated `batch`, reports `details.compiledNetworkSourceLookup` and `details.networkSourceLookup`, and never assigns blame or edits files.
 - `electron`: optional Electron desktop-app shorthand. `list`, `status`, `cleanup`, and `probe` are wrapper-owned host/session helpers; `launch` starts a wrapper-owned isolated Electron profile and attaches through upstream `connect`.
 - `stdin`: only for `batch`, `eval --stdin`, and `auth save --password-stdin`; other command/stdin combinations are rejected before `agent-browser` is launched. `job`, `qa`, `sourceLookup`, `networkSourceLookup`, and `electron` generate or manage their own input.
 - `sessionMode`:
@@ -105,7 +107,7 @@ React introspection requires the React DevTools init hook to be installed before
 Use `vitals [url]` for Core Web Vitals plus React hydration timing when available, and `pushstate <url>` for client-side SPA navigation without a full reload:
 ```json
-{ "args": ["vitals", "https://example.com", "--json"] }
+{ "args": ["vitals", "https://example.com"] }
 { "args": ["pushstate", "/dashboard?tab=settings"] }
 ```
@@ -186,6 +188,8 @@ Use `batch --bail` when later steps should stop after the first failed command.
 For short constrained flows, use top-level `job` instead of hand-writing `batch` stdin. Supported job steps are `open`, `click`, `fill`, `select`, `wait`, `assertText`, `assertUrl`, `waitForDownload`, and `screenshot`; `select` requires `selector` plus `value` or `values`, and compiles to upstream `select <selector> <value...>`. The wrapper compiles steps to upstream `batch` and records `details.compiledJob.steps[]`. There is still no separate first-class catalog of reusable named browser recipes above `job`, the `qa` preset, and raw `batch`; see [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet) for the closed `RQ-0068` decision and revisit bar.
+**Job navigation is explicit.** A `click` step (or other navigation-prone interaction) does not prove the next page loaded. The wrapper does not auto-insert `assertUrl` or `assertText` after clicks inside `job`; add those steps yourself with the URL pattern or on-page text you expect, especially after forms, checkout, tabs, or submit buttons, before screenshots or later steps.
 ```json
 {
   "job": {
@@ -198,11 +202,28 @@ For short constrained flows, use top-level `job` instead of hand-writing `batch`
 }
 ```
+Navigation-prone flow (open → fill → click → assert destination → screenshot):
+```json
+{
+  "job": {
+    "steps": [
+      { "action": "open", "url": "https://shop.example/checkout" },
+      { "action": "fill", "selector": "#email", "text": "user@example.com" },
+      { "action": "click", "selector": "#continue" },
+      { "action": "assertUrl", "url": "**/shipping" },
+      { "action": "assertText", "text": "Shipping address" },
+      { "action": "screenshot", "path": ".dogfood/shipping.png" }
+    ]
+  }
+}
+```
 On app pages that expose a native dropdown, add a `select` step such as `{ "action": "select", "selector": "#flavor", "value": "chocolate" }` before the assertion that depends on it.
 Use raw `args: ["batch"]` with `stdin` when you need arbitrary upstream commands, flags, or batch failure policies outside the constrained schema. Do not pass `stdin` with `job`, `qa`, `sourceLookup`, `networkSourceLookup`, or `electron`; those modes generate or manage their own input.
-For quick smoke/QA checks, use top-level `qa`. It clears enabled network/console/page-error buffers before opening the target URL, waits for page readiness, checks expected text/selector, inspects fresh network requests, console messages, and page errors, and can capture an evidence screenshot. The readiness wait defaults to `loadState: "domcontentloaded"`; set `loadState` to `"load"` or `"networkidle"` only when that stricter state is useful and the site is not expected to keep background requests alive. QA network diagnostics classify failed requests by likely impact and list failed rows first in the network preview: actionable document/script/API-style failures fail the preset, while common low-impact browser icon misses such as `favicon.ico` are surfaced as warnings (`qaPreset.warnings`) so they do not fail an otherwise healthy page. Failed QA presets report `details.resultCategory: "failure"`, `failureCategory: "qa-failure"`, and real Pi sessions treat the diagnostic as a failed tool result. Prose output also gets a model-visible result-category line including `Pi tool isError: true`; caller-requested `--json` output keeps the JSON string parseable and relies on the patched `isError` plus `details` fields.
+For quick smoke/QA checks, use top-level `qa`. It clears enabled network/console/page-error buffers before opening the target URL, waits for page readiness, checks expected text/selector, inspects fresh network requests, console messages, and page errors, and can capture an evidence screenshot. The readiness wait defaults to `loadState: "domcontentloaded"`; set `loadState` to `"load"` or `"networkidle"` only when that stricter state is useful and the site is not expected to keep background requests alive. QA network diagnostics classify failed requests by likely impact and list failed rows first in the network preview: actionable document/script/API-style failures fail the preset, while common low-impact browser icon misses such as `favicon.ico` are surfaced as warnings (`qaPreset.warnings`) so they do not fail an otherwise healthy page. Successful QA with no failed checks returns compact model-visible prose (page URL/title when known, checks run, optional screenshot verification) while keeping the full step matrix in `details.qaPreset` and `details.batchSteps`. Failed QA presets report `details.resultCategory: "failure"`, `failureCategory: "qa-failure"`, keep verbose per-step batch output, and real Pi sessions treat the diagnostic as a failed tool result. Prose output also gets a model-visible result-category line including `Pi tool isError: true`; caller-requested `--json` output keeps the JSON string parseable and relies on the patched `isError` plus `details` fields.
 The same classification drives plain `network requests` presentation: when any row counts as failed (HTTP status ≥ 400, `failed: true`, or a string `error`), model-facing text starts with a line like `Network failure summary: 0 actionable, 1 benign low-impact (1 total).`, and each preview line can end with an impact tag such as `[benign: low-impact browser icon asset]` or `[actionable: document, script, API, or non-benign request failure]`. When safe request IDs are present, `details.nextActions` adds bounded read-only follow-ups such as `network request <id>`, `networkSourceLookup` for actionable failed rows, `network requests --filter <path>`, and `network har start`; prefer those payloads over rebuilding request-id commands from prose. Rules live in `classifyNetworkRequestFailure` / `summarizeNetworkFailures` in `extensions/agent-browser/lib/results/network.ts`; QA aggregation is `analyzeQaPresetResults` in `extensions/agent-browser/index.ts`.
@@ -212,7 +233,7 @@ The same classification drives plain `network requests` presentation: when any r
 Optional `loadState`, `checkNetwork`, `checkConsole`, and `checkErrors` default to `"domcontentloaded"`, `true`, `true`, and `true`; set a check to `false` to skip that diagnostic. Omit `expectedText` and `expectedSelector` when you only need load plus diagnostics.
-For attached Electron or manually connected CDP sessions, use `qa.attached` after the session exists. It does not open a URL and rejects `sessionMode: "fresh"` because it checks the current managed session.
+For attached Electron or manually connected CDP sessions, use `qa.attached` after the session exists. It does not open a URL and rejects `sessionMode: "fresh"` because it checks the current managed session. Before running diagnostics, the wrapper requires a readable `http:` or `https:` page URL on the attached session; missing URLs, read failures, and non-http(s) surfaces fail fast with recovery `nextActions` such as `tab list` and `snapshot -i` instead of running the full QA batch.
 ```json
 { "qa": { "attached": true, "expectedText": "Explorer", "screenshotPath": ".dogfood/electron.png" } }
@@ -310,6 +331,8 @@ The upstream screenshot aliases are `screenshot --full` for full-page capture an
 Prefer `download <selector> <path>` when the target element itself is the downloadable link/control. Use `click` plus `wait --download [path]` when a previous action starts the download indirectly.
+For evidence-only screenshots, QA captures, or audit artifacts, save to an explicit path and branch on `details.artifactVerification` plus `details.artifacts` before reporting PASS/FAIL. Inline image attachments are optional convenience when size limits allow; do not require vision review unless the user asked for visual inspection.
 Wrapper result rendering is metadata-first for saved files:
 - screenshots return a saved-path summary, visible artifact metadata, structured `details.artifacts` metadata, and an inline image attachment when safe; the visible block includes artifact type, requested path, absolute path, existence, size, cwd, session, and repair/copy status when applicable
 - downloads, PDFs, `wait --download` files, `state save` state files, diff screenshot output images, traces, CPU profiles, completed WebM recordings from `record stop`, and path-bearing HAR captures return concise saved-path summaries plus structured `details.artifacts` metadata without inlining large files

package/docs/RELEASE.md CHANGED Viewed

@@ -35,9 +35,17 @@ npm run verify -- release
 `npm publish` runs npm’s `prepublishOnly` script from `package.json`, which executes the same `npm run verify -- release` gate and then `npm pack --dry-run`. That concatenated gate is everything in the default `npm run verify` step (generated playbook drift, TypeScript, the unit/fake suite, generated command-reference blocks, and live upstream command-reference sampling against the targeted `agent-browser` on `PATH`) plus the packaged Pi smoke in `package-pi`. Using `npm publish --ignore-scripts` skips that contract intentionally.
-`prepublishOnly` intentionally does **not** run `npm run verify -- lifecycle`, `npm run verify -- real-upstream`, or `npm run verify -- benchmark`; those are separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). Treat the bullets below as the full pre-publish contract even though only the `release` slice is automated at publish time.
+`prepublishOnly` intentionally does **not** run `npm run verify -- lifecycle`, `npm run verify -- real-upstream`, `npm run verify -- dogfood`, or `npm run verify -- benchmark`; those are separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). Treat the bullets below as the full pre-publish contract even though only the `release` slice is automated at publish time.
-Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --no-extensions --no-skills -e .` from the checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Automated localhost and fake-upstream gates do not replace this human-readable live-site transcript evidence. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
+For a deterministic real-browser wrapper smoke without model choice in the loop, run:
+```bash
+npm run verify -- dogfood
+```
+This mode uses the extension harness and the real `agent-browser` on `PATH` against public `example.com`, then verifies top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close. Use `npm run verify -- dogfood --keep-artifacts` or `--artifact-dir <path>` only while debugging, then delete retained screenshots. This smoke complements, but does not replace, human-readable interactive transcript evidence.
+Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --no-extensions --no-skills -e .` from the checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Automated localhost, fake-upstream, and deterministic dogfood gates do not replace this human-readable live-site transcript evidence. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
 When reviewing saved session JSONL after a failed smoke or a `qa` preset that reclassified an upstream-successful batch, expect `agent_browser` tool rows to carry `isError: true` whenever `details.resultCategory` is `failure`. For normal prose output, model-visible text should end with a `Pi tool isError: true` category line; for caller-requested `--json` output, the hook preserves parseable JSON and only patches `isError`. The extension applies that patch on the `tool_result` path so Pi’s transcript matches the wrapper contract ([`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details)). Preserve a normal Pi session directory for those checks; avoiding `--no-session` keeps this evidence intact ([`AGENTS.md`](../AGENTS.md) preferred validation workflow).
@@ -117,7 +125,7 @@ Evaluator expectations after the queued Sauce Demo fixes: the agent should indep
 [`scripts/agent-browser-efficiency-benchmark.mjs`](../scripts/agent-browser-efficiency-benchmark.mjs) is an accounting-only benchmark: it does not shell out to `agent-browser`, launch a browser, or read or write Pi sessions. It models representative `agent_browser` call shapes (including optional `stdin` for `batch` and top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` objects that compile to batch) and aggregates success rate, tool-call counts, UTF-8 size of model-visible strings, stale-ref failure and recovery counts, artifact success, distinct failure-category coverage, and summed elapsed-time estimates. When extending scenarios, keep them aligned with the closed `RQ-0068` “no reusable recipe layer” rationale in [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet) (benchmark ids cited there are the canonical inventory for that evidence bar).
-- **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes).
+- **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes). Optional `--sample-jsonl path/to/session.jsonl` adds a `jsonlSample` section with real UTF-8 byte totals and per-workflow/overall p95 sizes for model-visible `agent_browser` tool-result text without changing deterministic scenario metrics; comparison ignores `jsonlSample` blocks.
 - **Default gate:** `npm run verify` checks generated playbook drift, runs `tsc --noEmit`, runs the full unit/fake suite under `test/**/*.test.ts` (including [`test/agent-browser.efficiency-benchmark.test.ts`](../test/agent-browser.efficiency-benchmark.test.ts) for scenario coverage and comparison behavior), verifies generated command-reference baseline blocks, and samples live upstream command-reference tokens. It does not spawn the standalone benchmark script’s JSON/Markdown run; that is what the opt-in slice below adds.
 - **Opt-in slice:** `npm run verify -- benchmark` runs the benchmark script once with `--json` and then that same test module alone. It is intentionally **not** part of `npm run verify -- release`, so routine publish gates stay decoupled from benchmark churn while still allowing a focused check after editing scenarios or `CURRENT_BENCHMARK_VERSION`.

package/docs/SUPPORT_MATRIX.md CHANGED Viewed

@@ -37,12 +37,14 @@ Re-run the gates below before each release; this table records what the closure
 | Gate | Evidence | Status |
 | --- | --- | --- |
-| Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-18 (`npm run verify`, `agent-browser 0.27.0` on `PATH`). |
-| Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-18 (`npm run verify -- real-upstream`). |
-| Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-18 (`npm run verify -- package-pi`). |
-| `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-18 (`npm run verify -- release`). `prepublishOnly` still needs a fresh run during actual publish. |
-| Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`, restart, `/resume`, session continuity, slash-command sentinel tokens (`v1` then `v2` after rewriting the packaged extension to simulate pickup), and persisted spill reachability with a fake upstream on `PATH`. Default Pi model is `zai/glm-5.1`; default per-step wait is **180000 ms** (`DEFAULT_TIMEOUT_MS`); override model with `--model <id>` and waits with `--timeout-ms <ms>`. Passthrough flags in [`scripts/project.mjs`](../scripts/project.mjs): `--keep-artifacts`, `--model`, `--verbose`, and `--timeout-ms` plus a value (for example `npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --keep-artifacts --verbose --timeout-ms 600000`). | Pass on 2026-05-18 (`npm run verify -- lifecycle`). Treat any future unexplained red lifecycle gate as a release blocker. |
-| Quick isolated Pi smoke | `pi --no-extensions -e .` from repo root; native `agent_browser` only. | Pass on 2026-05-18 for a fresh interactive tmux smoke: the agent opened `https://example.com`, waited for `Example Domain`, saved `/tmp/piab-isolated-smoke.png` with verified `image/png` artifact metadata, closed the browser session, and reported PASS. Broader historical coverage also includes version/help/skills, open/snapshot/click, eval stdin, batch stdin, screenshot, explicit session, `sessionMode: "fresh"`, network requests, console/errors, diff snapshot, stream status/disable, dashboard start/stop, and chat credential-failure pass-through during RQ-0055. |
+| Default local gate | `npm run verify` checks generated playbook drift, `tsc --noEmit`, unit/fake tests, generated command-reference blocks, and live command-reference sampling. | Pass on 2026-05-24 (`npm run verify`, `agent-browser 0.27.0` on `PATH`). |
+| Real upstream contract | `npm run verify -- real-upstream` runs the localhost fixture matrix against the real installed `agent-browser` matching the baseline. | Pass on 2026-05-24 (`npm run verify -- real-upstream`, `agent-browser 0.27.0` on `PATH`). |
+| Packaged Pi smoke | `npm run verify -- package-pi` validates package contents, loads exactly one packaged `agent_browser` tool, and executes fake-upstream `--version`. | Pass on 2026-05-24 as part of `npm run verify -- release`. |
+| Deterministic dogfood smoke | `npm run verify -- dogfood` (`scripts/verify-agent-browser-dogfood.ts`) drives the native wrapper against public `example.com` through top-level `qa`, `semanticAction`, `qa.attached`, constrained `job`, screenshot artifact verification, and session close with the real `agent-browser` on `PATH`. | Pass on 2026-05-24 (`npm run verify -- dogfood --artifact-dir /tmp/pi-agent-browser-release-dogfood --json`; artifacts removed). |
+| Efficiency benchmark | `npm run verify -- benchmark` runs deterministic browser workflow accounting plus focused benchmark tests, including JSONL sampling fixtures and job/qa/sourceLookup/networkSourceLookup/Electron scenario coverage. | Pass on 2026-05-24 (`npm run verify -- benchmark`). |
+| `verify -- release` / `prepublishOnly` | `npm run verify -- release` chains the default gate with packaged Pi smoke (`verifySteps` `release` in [`scripts/project.mjs`](../scripts/project.mjs)). `package.json` `prepublishOnly` runs that compose before `npm pack --dry-run` during `npm publish`. It intentionally omits lifecycle, real-upstream, dogfood, and benchmark modes—see [`RELEASE.md`](RELEASE.md#pre-release-checks). | Pass on 2026-05-24 (`npm run verify -- release`). `prepublishOnly` still needs a fresh run during actual publish. |
+| Configured-source lifecycle | `npm run verify -- lifecycle` (`scripts/verify-lifecycle.mjs`) drives `/reload`, restart, `/resume`, session continuity, slash-command sentinel tokens (`v1` then `v2` after rewriting the packaged extension to simulate pickup), and persisted spill reachability with a fake upstream on `PATH`. Default Pi model is `zai/glm-5.1`; default per-step wait is **180000 ms** (`DEFAULT_TIMEOUT_MS`); override model with `--model <id>` and waits with `--timeout-ms <ms>`. Passthrough flags in [`scripts/project.mjs`](../scripts/project.mjs): `--keep-artifacts`, `--model`, `--verbose`, and `--timeout-ms` plus a value (for example `npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal --keep-artifacts --verbose --timeout-ms 600000`). | Pass on 2026-05-24 (`npm run verify -- lifecycle`). Treat any future unexplained red lifecycle gate as a release blocker. |
+| Quick isolated Pi smoke | `pi --no-extensions --no-skills -e . --tools agent_browser` from repo root; native `agent_browser` only. | Pass on 2026-05-24 for a fresh interactive tmux smoke: the agent opened `https://example.com`, verified `Example Domain`, clicked `Learn more` to `https://www.iana.org/help/example-domains`, opened `https://github.com/fitchmultz/pi-agent-browser-native`, verified the project page, saved `/tmp/piab-release-artifacts-TL5YWe/release-smoke.png` with verified `image/png` artifact metadata, closed the browser session, and reported PASS; screenshot, session dir, and tmux session were removed after recording evidence. Broader historical coverage also includes version/help/skills, open/snapshot/click, eval stdin, batch stdin, screenshot, explicit session, `sessionMode: "fresh"`, network requests, console/errors, diff snapshot, stream status/disable, dashboard start/stop, and chat credential-failure pass-through during RQ-0055. |
 ## Baseline checklist by inventory section

package/docs/TOOL_CONTRACT.md CHANGED Viewed

@@ -33,11 +33,38 @@ The native command reference in `docs/COMMAND_REFERENCE.md` is driven by the sam
 Agent-facing efficiency claims are measured with `npm run benchmark:agent-browser` or `npm run verify -- benchmark`. The benchmark is deterministic and does not launch a browser; it tracks representative workflow success, tool calls, model-visible output size, stale-ref failures and recoveries, artifact success, failure-category coverage, and elapsed-time estimates so future abstractions can prove they reduce agent work before replacing raw tool use.
+## Input mode chooser
+Use exactly one top-level input per call:
+| When you need | Use | Notes |
+| --- | --- | --- |
+| Routine browse, click, fill, screenshots, upstream commands | `args` | Default path: `open` → `snapshot -i` → `click`/`fill` `@eN` → `snapshot -i` after navigation or DOM changes. Do not pass `--json`; the wrapper injects it (see [Wrapper `--json`](#wrapper-json)). |
+| Stable visible role/text/label/placeholder targets | `semanticAction` | Compiles to upstream `find` or `select`; optional `session` for a named upstream browser. |
+| Short multi-step smoke or evidence flows | `job` or `qa` | Both compile to `batch`; `qa` may reclassify diagnostics as failure. |
+| Desktop Electron apps (list/launch/probe/cleanup) | `electron` | Wrapper-owned lifecycle; not for ordinary websites. |
+| Local UI source hints (experimental) | `sourceLookup` | **Candidates only** with confidence/evidence; not guaranteed DOM-to-file mappings. |
+| Failed fetch/API source hints (experimental) | `networkSourceLookup` | **Candidates only** from initiator metadata and bounded workspace URL literals; not definitive blame. |
+For link and button text, use the **exact** visible label from the latest `snapshot -i` (or `semanticAction` locators), not guessed copy. The `https://example.com/` smoke page uses heading `Example Domain` and link `Learn more`; do not assume older example.com strings such as `More information...`.
+## Snapshot and getter batching
+- **`snapshot -i`**: default for interaction—interactive `@eN` refs, main-content-first trimming, and the usual click/fill workflow.
+- **`snapshot --compact`**: denser same-page tree when you still need refs but want less output than full interactive snapshot.
+- **Full `snapshot`** (no `-i`): use only when you need the complete accessibility tree; expect larger output and possible spill files.
+- Re-run `snapshot -i` after navigation, scrolling, rerendering, or other major DOM changes; refs are page-scoped.
+- When you need **three or more** `get title` / `get url` / `get text` / similar reads for known refs or selectors on the same page, prefer one `batch` stdin array (for example `[["get","text","@e1"],["get","text","@e2"]]`) instead of serial tool calls.
+## Wrapper `--json`
+The extension always plans normal browser commands with `--json` prepended in `effectiveArgs` so upstream returns structured JSON for presentation and `details`. **Do not** include `--json` in caller `args`; it is unnecessary and can confuse planning or transcript hooks that treat caller-requested JSON differently. Plain-text inspection (`--help`, `--version`) and read-only `skills list` / `skills get` / `skills path` keep their own output shapes and skip implicit session injection as documented under `sessionMode`.
 <!-- agent-browser-playbook:start shared-guidelines -->
 <!-- Generated from extensions/agent-browser/lib/playbook.ts. Run `npm run docs -- playbook write` to update. -->
 - Standard workflow: open the page, snapshot -i, interact using current @refs from that snapshot, and re-snapshot after navigation, scrolling, rerendering, or other major DOM changes because refs are page-scoped; the wrapper fails mutation-prone stale/recycled refs before upstream can silently target a different current-page element.
 - For ordinary forms from one snapshot, batch multiple fill @refs before the submit/click step to avoid serial tool calls; if a fill may autosubmit, navigate, or rerender later fields, split the flow and refresh refs first.
-- When snapshot -i compacts because the tree is oversized, scan visible output for Omitted high-value controls and optional details.data.highValueControlRefIds before opening the spill file: those list bounded searchboxes, textboxes, comboboxes, buttons, tabs, checkboxes, radios, options, and menuitems that did not fit the key/other ref previews.
+- Snapshot choice: prefer snapshot -i for routine clicks/fills (interactive @refs, main-content-first). Use snapshot --compact when you need a denser same-page tree without full spill; use full snapshot (no -i) only when you need the complete accessibility tree. Re-snapshot after navigation or major DOM changes. When snapshot -i compacts because the tree is oversized, scan visible output for Omitted high-value controls and optional details.data.highValueControlRefIds before opening the spill file: those list bounded searchboxes, textboxes, comboboxes, buttons, tabs, checkboxes, radios, options, and menuitems that did not fit the key/other ref previews.
 - When a visible text or accessible-name target should survive ref churn, prefer find locators such as role, text, label, placeholder, alt, title, or testid with the intended action instead of guessing a CSS selector.
 - For desktop or host-controlled rich inputs, if semanticAction fill misses, refresh refs and prefer a current editable @ref from details.richInputRecovery or the latest snapshot; focus or click that ref, then use keyboard inserttext or keyboard type with the intended text. Do not auto-submit with Enter or a submit button unless the user flow explicitly calls for it.
 - Do not assume Playwright selector dialects such as text=Close or button:has-text('Close') are supported wrapper syntax unless current upstream agent-browser behavior has been verified.
@@ -62,6 +89,9 @@ Agent-facing efficiency claims are measured with `npm run benchmark:agent-browse
 - When using eval --stdin for extraction, return the value you want instead of relying on console.log as the primary result channel. Prefer plain expressions like ({ title: document.title }) or explicitly invoked functions like (() => ({ title: document.title }))(); if a function-shaped snippet returns {}, details.evalStdinHint may warn that the function was serialized instead of called. If get text on a CSS selector surfaces details.selectorTextVisibility or selectorTextVisibilityAll, prefer a visible @ref, a more specific selector, or the inspect-visible-text-candidates nextAction over hidden tab content.
 - When details.pageChangeSummary is present, use changeType and summary as a compact signal for navigation, DOM mutation, confirmations, or artifacts; when nextActionIds is set, match those ids to entries in details.nextActions (or per-step nextActions inside batch) for concrete follow-up payloads instead of inferring from prose alone. If a no-navigation click surfaces details.overlayBlockers, inspect the fresh snapshot evidence before using a close/dismiss candidate nextAction; ordinary page chrome without dialog/alertdialog evidence should not trigger this diagnostic.
 - When commands save or spill files (screenshots, downloads, PDFs, traces, recordings, HAR, large snapshot spills), use the user's exact requested paths when given and treat paths as provisional until details.artifactVerification shows every row verified: branch on missingCount, pendingCount, unverifiedCount, per-entry state, and optional limitation before downstream file use or PASS/FAIL reporting.
+- For evidence-only screenshots, QA captures, or other audit artifacts, save to an explicit path and branch on details.artifactVerification plus details.artifacts before reporting PASS/FAIL; do not require vision review of inline image attachments unless the user asked for visual inspection.
+- Respect explicit user stop boundaries: if the user says to stop before order/post/purchase/submit, do not click that final action.
+- Successful record stop needs ffmpeg on PATH; the wrapper may warn after record start when ffmpeg is missing.
 - Do not call --help or other exploratory inspection commands unless the user explicitly asks for them or debugging the browser integration is necessary.
 <!-- agent-browser-playbook:end shared-guidelines -->
@@ -90,6 +120,8 @@ Illustrative shapes (each real call uses exactly one of `args`, `semanticAction`
 - exact CLI args passed after `agent-browser`
 - no shell operators
 - do not include the binary name
+- do not include `--json`; the wrapper injects it (see [Wrapper `--json`](#wrapper-json))
+- first-call recipe: `open` → `snapshot -i` → `click` / `fill` with current `@eN` refs from that snapshot → `snapshot -i` again after navigation or DOM changes
 Examples:
@@ -167,7 +199,9 @@ Examples:
   - `waitForDownload` with `path` (compiled as `wait --download <path>`)
   - `screenshot` with `path`
-Example:
+**Navigation assertions are explicit only.** `job` never treats a successful `click` (or a `select` / submit-style interaction that may navigate) as proof that the expected next page loaded. Top-level `click` may still surface optional `details.navigationSummary` or `pageChangeSummary` hints for operators, but compiled `job` / `batch` steps do **not** auto-insert `assertUrl` or `assertText` after clicks—there is no deterministic expected URL source without caller intent. After any navigation-prone step (link/submit clicks, checkout or form flows, tab-sensitive UI), add an explicit `assertUrl` with the destination pattern you expect, `assertText` for on-page copy, or both, **before** screenshots or steps that assume the new page state.
+Example (static landing page):
 ```json
 {
@@ -181,12 +215,29 @@ Example:
 }
 ```
-Compiled shape:
+Example (open → fill → click → assert destination → screenshot):
+```json
+{
+  "job": {
+    "steps": [
+      { "action": "open", "url": "https://shop.example/checkout" },
+      { "action": "fill", "selector": "#email", "text": "user@example.com" },
+      { "action": "click", "selector": "#continue" },
+      { "action": "assertUrl", "url": "**/shipping" },
+      { "action": "assertText", "text": "Shipping address" },
+      { "action": "screenshot", "path": ".dogfood/shipping.png" }
+    ]
+  }
+}
+```
+Compiled shape for the navigation example:
 ```json
 {
   "args": ["batch"],
-  "stdin": "[[\"open\",\"https://example.com\"],[\"wait\",\"--text\",\"Example Domain\"],[\"screenshot\",\".dogfood/example.png\"]]"
+  "stdin": "[[\"open\",\"https://shop.example/checkout\"],[\"fill\",\"#email\",\"user@example.com\"],[\"click\",\"#continue\"],[\"wait\",\"--url\",\"**/shipping\"],[\"wait\",\"--text\",\"Shipping address\"],[\"screenshot\",\".dogfood/shipping.png\"]]"
 }
 ```
@@ -202,11 +253,12 @@ Because `job` still executes as upstream `batch` with generated stdin, the same
 - optional; mutually exclusive with `args`, `semanticAction`, `job`, `sourceLookup`, `networkSourceLookup`, and `electron`
 - lightweight preset built on the same batch compiler path as `job`
 - URL form: clears enabled diagnostic buffers first (`network requests --clear`, `console --clear`, `errors --clear`), then opens `url`, waits with `wait --load <state>` using the resolved `loadState`, optionally asserts `expectedText` (string or string array) and/or `expectedSelector` (each may be omitted for a load-plus-diagnostics-only smoke), then runs enabled diagnostics: `network requests`, `console`, and `errors`
-- attached form: `qa: { attached: true, expectedText?, expectedSelector?, screenshotPath?, checkNetwork?, checkConsole?, checkErrors?, loadState? }` runs the same waits, optional assertions, diagnostics, and screenshot against the current attached managed session without opening a URL. It rejects `url` and cannot be used with `sessionMode: "fresh"`; attach first with `electron.launch` or raw `args: ["connect", "<port-or-url>"]`, then run `qa.attached`.
+- attached form: `qa: { attached: true, expectedText?, expectedSelector?, screenshotPath?, checkNetwork?, checkConsole?, checkErrors?, loadState? }` runs the same waits, optional assertions, diagnostics, and screenshot against the current attached managed session without opening a URL. It rejects `url` and cannot be used with `sessionMode: "fresh"`; attach first with `electron.launch` or raw `args: ["connect", "<port-or-url>"]`, then run `qa.attached`. Before spawning the diagnostic batch, the wrapper preflights the attached session: `get url` must succeed and return an `http:` or `https:` page URL. Missing URLs, read failures, and non-http(s) surfaces fail fast with `failureCategory: "validation-error"`, `details.validationError`, and recovery `nextActions` such as `list-tabs-before-qa-attached` and `snapshot-before-qa-attached` instead of running the full QA batch.
 - `loadState` is optional and must be `domcontentloaded`, `load`, or `networkidle`; it defaults to `domcontentloaded` so analytics-heavy or long-polling pages do not hang routine QA. Use `networkidle` only when the site is expected to go fully quiet.
 - `checkNetwork`, `checkConsole`, and `checkErrors` default to `true`; set a field to `false` to omit that diagnostic
 - optional `screenshotPath` adds an evidence screenshot step
 - reports `details.compiledQaPreset` with the compiled batch plan and resolved `loadState`, plus `details.qaPreset` with `{ passed, failedChecks, warnings, summary }`
+- on success with no failed checks, model-visible prose collapses to a compact pass summary (current page URL/title when known, checks run, optional screenshot path plus artifact verification, pointer to `details.qaPreset` and `details.batchSteps` for the full matrix). Failed QA and QA reclassified to `qa-failure` keep the verbose per-step batch output.
 - fails the native tool result with `failureCategory: "qa-failure"` when diagnostics report page errors, console error messages, actionable failed network requests, or any batch step failure. Benign classification (implementation: `classifyNetworkRequestFailure` → `isBenignAssetFailure` in `extensions/agent-browser/lib/results/network.ts`) applies only when the row is already treated as failed (`status >= 400`, `failed: true`, or a string `error`—see `isFailedNetworkRequest`), the URL path’s last segment matches the icon basename heuristic (`favicon` plus `.ico`/`.png`/`.svg`, or `apple-touch-icon` plus `.png`, each allowing an optional `[-.\w]*` stem suffix before the extension), **and** at least one of `status === 404`, `failed === true`, or `typeof request.error === "string"` holds (so a **status-only** failure such as `500` on that path with neither `failed` nor a string `error` stays actionable). It also requires the upstream `resourceType` / `mimeType` (whichever is present) to be absent or look image-like: `image`, `img`, `other`, or a value starting with `image/`. Those rows are counted in `qaPreset.warnings` (for example `N benign network request failure(s) ignored`) and omitted from the actionable failed-network tally; every other failed request stays actionable.
 Example:
@@ -358,7 +410,7 @@ For an app you launched manually with remote debugging enabled, skip `electron.c
 - type: object with at least one of `selector`, `reactFiberId`, or `componentName`
 - optional; mutually exclusive with `args`, `semanticAction`, `job`, `qa`, `networkSourceLookup`, and `electron`
-- experimental opt-in helper for local app debugging; it reports candidate source locations with confidence and evidence instead of claiming a guaranteed DOM-to-file mapping
+- **EXPERIMENTAL — candidates only:** opt-in helper for local app debugging; it reports candidate source locations with confidence and evidence instead of claiming a guaranteed DOM-to-file mapping. Do not treat output as authoritative file ownership or edit targets without verification.
 - compiles to existing upstream `batch` commands only:
   - `selector` adds `is visible <selector>` and, unless `includeDomHints: false`, adds `get html <selector>` for source-like DOM attributes (`data-source-file`, `data-file`, `data-component-file`, `data-source`, plus optional `data-source-line` / `data-line` and `data-source-column` / `data-column`) and for `.ts`/`.tsx`/`.js`/`.jsx` paths embedded in HTML text
   - `reactFiberId` runs `react inspect <id>`; this requires the page to have been launched with `--enable react-devtools` before first navigation and for the app build to expose source information
@@ -383,7 +435,7 @@ Use raw `args` for direct upstream React inspection when you already know the ex
 - type: object with at least one of `requestId`, `filter`, or `url`, plus optional `maxWorkspaceFiles`
 - optional; mutually exclusive with `args`, `semanticAction`, `job`, `qa`, `sourceLookup`, and `electron`
-- experimental failed-request source-hint helper; it reports failed network requests and candidate source hints with evidence instead of assigning blame
+- **EXPERIMENTAL — candidates only:** failed-request source-hint helper; it reports failed network requests and candidate source hints with evidence instead of assigning blame or proving root cause
 - compiles to existing upstream `batch` commands only: `network request <requestId>` when provided plus `network requests` with `--filter <filter-or-url>` when a filter or URL is provided (if both are set, `filter` wins; when only `url` is set, it becomes the `--filter` argument); optional `session` prepends `--session <name>` before that generated `batch`
 - detects failed requests from `status >= 400`, `failed: true`, or an `error` field
 - candidate sources come from source-like initiator/stack metadata in upstream network results and bounded local workspace search for URL/path literals under the Pi session cwd
@@ -442,6 +494,8 @@ Recommended use:
 ## Wrapper behavior
+Caller `args` should omit `--json`; the wrapper prepends it for normal execution so `details` and presentation stay structured. See [Wrapper `--json`](#wrapper-json).
 The extension should:
 - inject `--json`
 - invoke `agent-browser` directly, not through a shell

package/extensions/agent-browser/index.ts CHANGED Viewed

@@ -717,6 +717,7 @@ interface ElectronBroadGetTextScopeDiagnostic {
 }
 interface QaAttachedTarget {
+	error?: string;
 	sessionName: string;
 	title?: string;
 	url?: string;
@@ -3062,7 +3063,7 @@ export default function agentBrowserExtension(pi: ExtensionAPI) {
 		name: "agent_browser",
 		label: "Agent Browser",
 		description:
-			"Browse and interact with websites using agent-browser. Use this for web research, reading live docs, opening pages, taking snapshots or screenshots, clicking links, filling forms, extracting page content, and authenticated/profile-based browser work.",
+			"Browse and interact with websites using agent-browser. Use this for web research, reading live docs, opening pages, taking snapshots or screenshots, clicking links, filling forms, extracting page content, and authenticated/profile-based browser work. Input choice: default `args` for open → snapshot -i → click/fill @refs; `semanticAction` for stable role/text/label targets; `job` or `qa` for multi-step checks; `electron` only for desktop apps; experimental `sourceLookup` / `networkSourceLookup` for candidates only.",
 		promptSnippet:
 			"Browse websites, read live docs, click and fill pages, extract browser content, take screenshots, and automate real web workflows.",
 		promptGuidelines: toolPromptGuidelines,

package/extensions/agent-browser/lib/input-modes/job.ts CHANGED Viewed

@@ -4,6 +4,7 @@
  * Scope: Job and QA modes only.
  */
+import type { ArtifactVerificationSummary } from "../results/contracts.js";
 import { isRecord } from "../parsing.js";
 import { summarizeNetworkFailures } from "../results/network.js";
 import { getBatchResultItems, getCommandNameFromBatchItem, getSelectValues } from "./shared.js";
@@ -93,6 +94,67 @@ export function compileAgentBrowserJob(input: unknown): { compiled?: CompiledAge
 	return { compiled: { args: ["batch"], stdin: JSON.stringify(steps.map((step) => step.args)), steps } };
 }
+export function isHttpOrHttpsUrl(url: string): boolean {
+	try {
+		const protocol = new URL(url).protocol;
+		return protocol === "http:" || protocol === "https:";
+	} catch {
+		return false;
+	}
+}
+function describeQaChecksRun(checks: CompiledAgentBrowserQaPreset["checks"]): string {
+	const parts = [`load:${checks.loadState}`];
+	if (checks.expectedText.length > 0) parts.push(`text×${checks.expectedText.length}`);
+	if (checks.expectedSelector) parts.push("selector");
+	if (checks.checkNetwork) parts.push("network");
+	if (checks.checkConsole) parts.push("console");
+	if (checks.checkErrors) parts.push("errors");
+	if (checks.screenshotPath) parts.push("screenshot");
+	return parts.join(", ");
+}
+export function extractQaPageContext(options: {
+	attachedTarget?: { title?: string; url?: string };
+	batchData?: unknown;
+	compiled?: CompiledAgentBrowserQaPreset;
+}): { title?: string; url?: string } {
+	if (options.attachedTarget?.title || options.attachedTarget?.url) {
+		return { title: options.attachedTarget.title, url: options.attachedTarget.url };
+	}
+	for (const item of getBatchResultItems(options.batchData)) {
+		if (getCommandNameFromBatchItem(item) !== "open" || !isRecord(item.result)) continue;
+		const url = typeof item.result.url === "string" ? item.result.url : undefined;
+		const title = typeof item.result.title === "string" ? item.result.title : undefined;
+		if (url || title) return { title, url };
+	}
+	if (options.compiled?.checks.url) {
+		return { url: options.compiled.checks.url };
+	}
+	return {};
+}
+export function buildQaCompactPassText(options: {
+	artifactVerification?: ArtifactVerificationSummary;
+	batchStepCount: number;
+	checks: CompiledAgentBrowserQaPreset["checks"];
+	page?: { title?: string; url?: string };
+	qaPreset: AgentBrowserQaPresetAnalysis;
+}): string {
+	const lines = [options.qaPreset.summary];
+	const pageParts = [options.page?.title, options.page?.url].filter((part): part is string => typeof part === "string" && part.length > 0);
+	if (pageParts.length > 0) lines.push(`Page: ${pageParts.join(" — ")}`);
+	lines.push(`Checks run: ${describeQaChecksRun(options.checks)} (${options.batchStepCount} batch step${options.batchStepCount === 1 ? "" : "s"})`);
+	if (options.checks.screenshotPath) {
+		const verification = options.artifactVerification;
+		lines.push(verification
+			? `Screenshot: ${options.checks.screenshotPath} (${verification.verifiedCount}/${verification.artifacts.length} verified on disk)`
+			: `Screenshot: ${options.checks.screenshotPath}`);
+	}
+	lines.push("Full diagnostic matrix: see details.qaPreset and details.batchSteps.");
+	return lines.join("\n");
+}
 export function analyzeQaPresetResults(data: unknown): AgentBrowserQaPresetAnalysis | undefined {
 	const items = getBatchResultItems(data);
 	if (items.length === 0) return undefined;

package/extensions/agent-browser/lib/input-modes/params.ts CHANGED Viewed

@@ -25,8 +25,8 @@ import {
 export const AGENT_BROWSER_PARAMS = Type.Object({
 	args: Type.Optional(
-		Type.Array(Type.String({ description: "Exact agent-browser CLI arguments, excluding the binary name." }), {
-			description: "Exact agent-browser CLI arguments, excluding the binary name and any shell operators. Required unless semanticAction, job, qa, sourceLookup, networkSourceLookup, or electron is provided.",
+		Type.Array(Type.String({ description: "Exact agent-browser CLI arguments, excluding the binary name. Do not pass --json; the wrapper injects it. First-call recipe: open → snapshot -i → click/fill @eN → snapshot -i." }), {
+			description: "Exact agent-browser CLI arguments, excluding the binary name and any shell operators. Required unless semanticAction, job, qa, sourceLookup, networkSourceLookup, or electron is provided. Do not include --json (wrapper injects it). Typical first calls: open, snapshot -i, click/fill current @refs, then snapshot -i again after navigation or DOM changes.",
 			minItems: 1,
 		}),
 	),
@@ -79,7 +79,7 @@ export const AGENT_BROWSER_PARAMS = Type.Object({
 			componentName: Type.Optional(Type.String({ description: "Component name to correlate with react tree output and bounded local workspace search." })),
 			includeDomHints: Type.Optional(Type.Boolean({ description: "Whether selector lookups should inspect DOM HTML attributes for source-like metadata. Defaults to true." })),
 			maxWorkspaceFiles: Type.Optional(Type.Number({ description: "Maximum local source files to scan when componentName is provided. Defaults to 2000 and cannot exceed 5000.", minimum: 1, maximum: SOURCE_LOOKUP_MAX_WORKSPACE_FILES })),
-		}),
+		}, { description: "EXPERIMENTAL: local UI-to-source candidates only (confidence/evidence, not guaranteed mappings). Compiles to batch; mutually exclusive with other input modes." }),
 	),
 	networkSourceLookup: Type.Optional(
 		Type.Object({
@@ -88,7 +88,7 @@ export const AGENT_BROWSER_PARAMS = Type.Object({
 			session: Type.Optional(Type.String({ description: "Optional upstream session name; prepends --session <name> before the generated batch." })),
 			url: Type.Optional(Type.String({ description: "Optional failed request URL or URL fragment to correlate with local source." })),
 			maxWorkspaceFiles: Type.Optional(Type.Number({ description: "Maximum local source files to scan for URL literals. Defaults to 2000 and cannot exceed 5000.", minimum: 1, maximum: SOURCE_LOOKUP_MAX_WORKSPACE_FILES })),
-		}),
+		}, { description: "EXPERIMENTAL: failed-request-to-source candidates only (initiator metadata and bounded workspace URL literals; not definitive blame). Compiles to batch; mutually exclusive with other input modes." }),
 	),
 	electron: Type.Optional(
 		Type.Union([