pi-agent-browser-native 0.2.46 → 0.2.48

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/CHANGELOG.md +64 -20
  2. package/README.md +45 -20
  3. package/docs/ARCHITECTURE.md +14 -14
  4. package/docs/COMMAND_REFERENCE.md +37 -23
  5. package/docs/ELECTRON.md +3 -3
  6. package/docs/RELEASE.md +33 -24
  7. package/docs/REQUIREMENTS.md +4 -4
  8. package/docs/SUPPORT_MATRIX.md +34 -106
  9. package/docs/TOOL_CONTRACT.md +24 -22
  10. package/docs/platform-smoke.md +2 -2
  11. package/extensions/agent-browser/index.ts +20 -2
  12. package/extensions/agent-browser/lib/config-policy.js +16 -5
  13. package/extensions/agent-browser/lib/config.ts +17 -4
  14. package/extensions/agent-browser/lib/input-modes/job.ts +138 -62
  15. package/extensions/agent-browser/lib/input-modes/params.ts +2 -2
  16. package/extensions/agent-browser/lib/orchestration/browser-run/artifact-paths.ts +44 -0
  17. package/extensions/agent-browser/lib/orchestration/browser-run/click-dispatch.ts +42 -19
  18. package/extensions/agent-browser/lib/orchestration/browser-run/diagnostics.ts +6 -4
  19. package/extensions/agent-browser/lib/orchestration/browser-run/final-result.ts +18 -9
  20. package/extensions/agent-browser/lib/orchestration/browser-run/prepare/direct-anchor-download.ts +158 -0
  21. package/extensions/agent-browser/lib/orchestration/browser-run/prepare/network-page-filter.ts +116 -0
  22. package/extensions/agent-browser/lib/orchestration/browser-run/prepare/scroll-shims.ts +147 -0
  23. package/extensions/agent-browser/lib/orchestration/browser-run/prepare/snapshot-filter.ts +183 -0
  24. package/extensions/agent-browser/lib/orchestration/browser-run/prepare/wait-timeouts.ts +58 -0
  25. package/extensions/agent-browser/lib/orchestration/browser-run/prepare.ts +19 -653
  26. package/extensions/agent-browser/lib/orchestration/browser-run/process-output.ts +1 -6
  27. package/extensions/agent-browser/lib/orchestration/browser-run/session-artifacts.ts +8 -0
  28. package/extensions/agent-browser/lib/orchestration/browser-run/types.ts +1 -0
  29. package/extensions/agent-browser/lib/pi-tool-rendering.ts +34 -19
  30. package/extensions/agent-browser/lib/playbook.ts +4 -4
  31. package/extensions/agent-browser/lib/results/action-recommendations.ts +3 -3
  32. package/extensions/agent-browser/lib/web-search.ts +11 -4
  33. package/package.json +4 -4
  34. package/scripts/agent-browser-capability-baseline.mjs +6 -3
  35. package/scripts/doctor.mjs +12 -11
  36. package/scripts/platform-smoke/platform-build-windows.ps1 +2 -2
  37. package/scripts/platform-smoke/targets.mjs +7 -3
  38. package/scripts/platform-smoke.mjs +2 -2
package/docs/RELEASE.md CHANGED
@@ -8,9 +8,9 @@ Related docs:
8
8
  - [`ELECTRON.md`](ELECTRON.md)
9
9
  - [`platform-smoke.md`](platform-smoke.md)
10
10
  - [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md)
11
- - Bounded `agent_browser` outcome metadata on `details` (`resultCategory`, `successCategory`, `failureCategory`, optional `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`): contract in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); maintainer checklists under “Tool result categories” and “Page-change summaries” in [`../AGENTS.md`](../AGENTS.md)
12
- - Post-success `get text` selector visibility (`RQ-0074`): optional `details.selectorTextVisibility` / `selectorTextVisibilityAll`, visible warnings, and `inspect-visible-text-candidates*` next actions after read-only visibility probes—[`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), and [`../AGENTS.md`](../AGENTS.md) maintainer checklist
13
- - Managed-session outcomes (`RQ-0077`): after extension-managed implicit or fresh `--session` injection reaches process execution, `details.managedSessionOutcome` records the transition (`created` / `replaced` / `unchanged` / `closed` on success; `preserved` / `abandoned` when a plan fails before a new session becomes current). Failing `sessionMode: "fresh"` calls also append model-visible `Managed session outcome: …`—[`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md), [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md), and [`../AGENTS.md`](../AGENTS.md) maintainer checklist
11
+ - Bounded `agent_browser` outcome metadata on `details` (`resultCategory`, `successCategory`, `failureCategory`, optional `nextActions`, optional `pageChangeSummary` with per-step summaries on `batch`): contract in [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); maintainer checklists under “Tool result categories” and “Page-change summaries” in [`../AGENTS.md`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/AGENTS.md)
12
+ - Post-success `get text` selector visibility (`RQ-0074`): optional `details.selectorTextVisibility` / `selectorTextVisibilityAll`, visible warnings, and `inspect-visible-text-candidates*` next actions after read-only visibility probes—[`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md), [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), and [`../AGENTS.md`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/AGENTS.md) maintainer checklist
13
+ - Managed-session outcomes (`RQ-0077`): after extension-managed implicit or fresh `--session` injection reaches process execution, `details.managedSessionOutcome` records the transition (`created` / `replaced` / `unchanged` / `closed` on success; `preserved` / `abandoned` when a plan fails before a new session becomes current). Failing `sessionMode: "fresh"` calls also append model-visible `Managed session outcome: …`—[`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details), [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md), [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md), and [`../AGENTS.md`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/AGENTS.md) maintainer checklist
14
14
  - Stateful context commands (`cookies`, `storage`, `auth`, `dialog`, `frame`, `state`) and aggregate `batch` results: model-facing `details.data` is summarized or redacted per [`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details); aggregate `batch` replaces top-level `details.data` with a compact per-step matrix (`success`, argv-redacted `command`, redacted `result` or scrubbed `error`) while full per-step payloads, artifacts, and categories remain on `batchSteps[]`—operational notes in [`COMMAND_REFERENCE.md`](COMMAND_REFERENCE.md#use-stateful-browser-context-commands-safely), assembly in `extensions/agent-browser/lib/results/presentation/batch.ts`
15
15
 
16
16
  ## Purpose
@@ -30,7 +30,15 @@ npm run smoke:platform:doctor
30
30
  npm run verify -- release
31
31
  ```
32
32
 
33
- `npm run doctor` is a read-only first-run diagnostic for PATH, targeted upstream version, the recommended Pi release floor, and duplicate package/checkout source conflicts. The Pi version check is a warning, not a hard runtime requirement. It does not replace upstream `agent-browser doctor` for browser runtime health and does not edit Pi settings.
33
+ `npm run doctor` is a read-only first-run diagnostic for PATH, targeted upstream version, the minimum Pi runtime floor, and duplicate package/checkout source conflicts. The package keeps Pi core imports as wildcard `peerDependencies` because installed Pi package docs require the host Pi install to provide those packages, while the doctor fails setup when `pi --version` is below the enforced floor. It does not replace upstream `agent-browser doctor` for browser runtime health and does not edit Pi settings.
34
+
35
+ For PR-ready local confidence before release-only lifecycle and platform cost, run:
36
+
37
+ ```bash
38
+ npm run verify -- pre-pr
39
+ ```
40
+
41
+ `pre-pr` composes the default gate with `npm run verify -- package`: generated docs, TypeScript, the full unit/fake suite, live command-reference sampling, and package-content verification. It intentionally does not run lifecycle, packaged Pi smoke, Crabbox platform smoke, real-upstream, dogfood, or benchmark modes.
34
42
 
35
43
  `npm run verify -- release` runs:
36
44
 
@@ -41,7 +49,7 @@ npm run verify -- release
41
49
 
42
50
  `npm publish` runs npm’s `prepublishOnly` script from `package.json`, which executes the same `npm run verify -- release` gate and then `npm pack --dry-run`. That concatenated gate is everything in the default `npm run verify` step (generated playbook drift, TypeScript, the unit/fake suite, generated command-reference blocks, and live upstream command-reference sampling against the targeted `agent-browser` on `PATH`), the configured-source lifecycle harness, the packaged Pi smoke in `package-pi`, and the release-blocking Crabbox platform matrix. Using `npm publish --ignore-scripts` skips that contract intentionally.
43
51
 
44
- `prepublishOnly` intentionally does **not** run the standalone host-only `npm run verify -- real-upstream`, `npm run verify -- dogfood`, or `npm run verify -- benchmark` modes; those remain separate `npm run verify` modes in [`scripts/project.mjs`](../scripts/project.mjs). The platform matrix includes its own fast target-local build/package gate and browser dogfood suite, and is automated through the `release` slice.
52
+ `prepublishOnly` intentionally does **not** run the standalone host-only `npm run verify -- real-upstream`, `npm run verify -- dogfood`, or `npm run verify -- benchmark` modes; those remain separate `npm run verify` modes in [`scripts/project.mjs`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/scripts/project.mjs). The platform matrix includes its own fast target-local build/package gate and browser dogfood suite, and is automated through the `release` slice.
45
53
 
46
54
  For a deterministic host-only real-browser wrapper smoke without model choice in the loop, run:
47
55
 
@@ -64,11 +72,11 @@ The Crabbox gate is only green when suite assertions and artifact manifests unde
64
72
 
65
73
  The deterministic dogfood mode uses the extension harness and the real `agent-browser` on `PATH` against a deterministic local file fixture, then verifies top-level `qa`, `semanticAction`, constrained `job`, screenshot artifact verification, and session close. Use `npm run verify -- dogfood --keep-artifacts` or `--artifact-dir <path>` only while debugging, then delete retained screenshots. This smoke complements, but does not replace, human-readable interactive transcript evidence.
66
74
 
67
- Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --no-extensions --no-skills -e .` from the checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Automated localhost, fake-upstream, and deterministic dogfood gates do not replace this human-readable live-site transcript evidence. When `agent_browser_web_search` or package config changed, add one key-free smoke proving the optional tool is absent without config, one fake/unit-backed smoke in the default suite, and one opt-in live Exa or Brave Search check with a real key while confirming the key does not appear in transcripts, stdout/stderr, config status, PR text, or artifacts. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
75
+ Every release also requires interactive `tmux`-driven Pi dogfood with the native `agent_browser` tool against real sites. For extension-focused release smokes, use `pi --approve --no-extensions --no-skills -e .` from the trusted checkout before publish so auto-loaded dogfood/QA skills cannot replace the bounded smoke workflow; omit `--approve` only when the smoke is explicitly testing Pi's Project Trust prompt. Run separate skill-enabled dogfood only when validating skill routing or report-generation behavior. Drive prompts with `tmux send-keys`, exercise at least one simple static site and one real documentation/product site, include the higher-level `qa` or `job`/`batch` surfaces when they changed, close every opened browser session, remove screenshots/temp artifacts, and record the outcome in the release notes or support-matrix evidence. Do not paste raw multi-line prompts into a tmux Pi pane: plain newlines submit separate queued user messages. For scripted smoke driving, collapse prompt files to one line before sending (`PROMPT=$(tr '\n' ' ' < /tmp/smoke-prompt.md); tmux send-keys -t "$SESSION":0.0 -l "$PROMPT"; tmux send-keys -t "$SESSION":0.0 Enter`). For manual multi-line editing, use Pi's external editor shortcut (`Ctrl+G`) or configure tmux extended keys so Pi can receive `Shift+Enter` for newlines; see the installed Pi `docs/tmux.md` guidance. Automated localhost, fake-upstream, and deterministic dogfood gates do not replace this human-readable live-site transcript evidence. When `agent_browser_web_search` or package config changed, add one key-free smoke proving the optional tool is absent without config, one fake/unit-backed smoke in the default suite, and one opt-in live Exa or Brave Search check with a real key while confirming the key does not appear in transcripts, stdout/stderr, config status, PR text, or artifacts. When `electron.*` surfaces, attached-session diagnostics, or `qa.attached` changed, add a local Electron pass: `electron.list` → `electron.launch` (expect isolated profile behavior) → `snapshot -i` or `electron.probe` / `qa.attached` → `electron.cleanup` with the returned `launchId`, verifying status/mismatch guidance if you simulate a dead renderer or stale refs. For dense-dashboard stress coverage, use the [public Grafana stress checklist](#public-grafana-stress-checklist) below; it is a maintainer workflow, not bundled product skill or recipe runtime.
68
76
 
69
- When reviewing saved session JSONL after a failed smoke or a `qa` preset that reclassified an upstream-successful batch, expect `agent_browser` tool rows to carry `isError: true` whenever `details.resultCategory` is `failure`. For normal prose output, model-visible text should end with a `Pi tool isError: true` category line; for caller-requested `--json` output, the hook preserves parseable JSON and only patches `isError`. The extension applies that patch on the `tool_result` path so Pi’s transcript matches the wrapper contract ([`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details)). Preserve a normal Pi session directory for those checks; avoiding `--no-session` keeps this evidence intact ([`AGENTS.md`](../AGENTS.md) preferred validation workflow).
77
+ When reviewing saved session JSONL after a failed smoke or a `qa` preset that reclassified an upstream-successful batch, expect `agent_browser` tool rows to carry `isError: true` whenever `details.resultCategory` is `failure`. For normal prose output, model-visible text should end with a `Pi tool isError: true` category line; for caller-requested `--json` output, the hook preserves parseable JSON and only patches `isError`. The extension applies that patch on the `tool_result` path so Pi’s transcript matches the wrapper contract ([`TOOL_CONTRACT.md`](TOOL_CONTRACT.md#details)). Preserve a normal Pi session directory for those checks; avoiding `--no-session` keeps this evidence intact ([`AGENTS.md`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/AGENTS.md) preferred validation workflow).
70
78
 
71
- The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` and validates `/reload`, full relaunch with the same exact Pi 0.78 `--session-id`, managed-session continuity, persisted artifacts, and Pi failure-patch behavior. Branch-backed `session_tree` rehydration and cleanup ownership are validated by focused extension harness tests:
79
+ The configured-source lifecycle regression harness is required before release because it launches an interactive `pi` process under `tmux` with `--approve` and validates `/reload`, full relaunch with the same exact Pi 0.79 `--session-id`, managed-session continuity, persisted artifacts, and Pi failure-patch behavior. Branch-backed `session_tree` rehydration and cleanup ownership are validated by focused extension harness tests:
72
80
 
73
81
  ```bash
74
82
  npm run verify -- lifecycle
@@ -106,11 +114,13 @@ Use this validation prompt after changing click enrichment, tab pinning, ref pre
106
114
  Run it in an isolated checkout session with skills disabled so the run validates the extension browser workflow instead of external dogfood/QA skill routing. It is fine to restrict active tools at launch so the checkout extension is the only browser surface, but keep those launch details out of the user prompt:
107
115
 
108
116
  ```bash
109
- pi --no-extensions --no-skills -e . --model openai-codex/gpt-5.5:minimal --tools agent_browser --session-dir "$SESSION_DIR"
117
+ pi --approve --no-extensions --no-skills -e . --model openai-codex/gpt-5.5:minimal --tools agent_browser --session-dir "$SESSION_DIR"
110
118
  ```
111
119
 
112
120
  Repeat with `--model openai-codex/gpt-5.5:medium` when validating instruction-following robustness. Use unique temp paths for each run and delete them afterward. Run separate skill-enabled dogfood sessions only when the thing under test is skill integration, not this bounded release smoke.
113
121
 
122
+ Submit the prompt as one Pi message. In tmux automation, write it to a temp file with placeholders replaced, collapse newlines to spaces, and send that one line; for manual multiline entry, use Pi's `Ctrl+G` external editor or a tmux setup that preserves `Shift+Enter` newlines. Do not paste the raw block into a tmux pane line-by-line.
123
+
114
124
  Copy/paste prompt, replacing the two artifact placeholders with exact absolute paths:
115
125
 
116
126
  ```text
@@ -142,13 +152,14 @@ Evaluator expectations after the queued Sauce Demo fixes: the agent should indep
142
152
 
143
153
  ## Deterministic agent efficiency benchmark
144
154
 
145
- [`scripts/agent-browser-efficiency-benchmark.mjs`](../scripts/agent-browser-efficiency-benchmark.mjs) is an accounting-only benchmark: it does not shell out to `agent-browser`, launch a browser, or read or write Pi sessions. It models representative `agent_browser` call shapes (including optional `stdin` for `batch` and top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` objects that compile to batch) and aggregates success rate, tool-call counts, UTF-8 size of model-visible strings, stale-ref failure and recovery counts, artifact success, distinct failure-category coverage, and summed elapsed-time estimates. When extending scenarios, keep them aligned with the closed `RQ-0068` “no reusable recipe layer” rationale in [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet) (benchmark ids cited there are the canonical inventory for that evidence bar).
155
+ [`scripts/agent-browser-efficiency-benchmark.mjs`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/scripts/agent-browser-efficiency-benchmark.mjs) is an accounting-only benchmark: it does not shell out to `agent-browser`, launch a browser, or read or write Pi sessions. It models representative `agent_browser` call shapes (including optional `stdin` for `batch` and top-level `job`, `qa`, or experimental `sourceLookup` / `networkSourceLookup` objects that compile to batch) and aggregates success rate, tool-call counts, UTF-8 size of model-visible strings, stale-ref failure and recovery counts, artifact success, distinct failure-category coverage, and summed elapsed-time estimates. When extending scenarios, keep them aligned with the closed `RQ-0068` “no reusable recipe layer” rationale in [`ARCHITECTURE.md`](ARCHITECTURE.md#no-reusable-recipe-layer-yet) (benchmark ids cited there are the canonical inventory for that evidence bar).
146
156
 
147
157
  - **During development:** `npm run benchmark:agent-browser` prints a Markdown report; `npm run benchmark:agent-browser -- --json` saves machine-readable metrics; `npm run benchmark:agent-browser -- --compare path/to/prior.json` fails with exit code `1` on regressions (see the script’s `--help` for exit codes). Optional `--sample-jsonl path/to/session.jsonl` adds a `jsonlSample` section with real UTF-8 byte totals and per-workflow/overall p95 sizes for model-visible `agent_browser` tool-result text without changing deterministic scenario metrics; comparison ignores `jsonlSample` blocks.
148
- - **Default gate:** `npm run verify` checks generated playbook drift, runs `tsc --noEmit`, runs the full unit/fake suite under `test/**/*.test.ts` with Node test concurrency pinned to `1` (including [`test/agent-browser.efficiency-benchmark.test.ts`](../test/agent-browser.efficiency-benchmark.test.ts) for scenario coverage and comparison behavior), verifies generated command-reference baseline blocks, and samples live upstream command-reference tokens. It does not spawn the standalone benchmark script’s JSON/Markdown run; that is what the opt-in slice below adds.
149
- - **Opt-in slice:** `npm run verify -- benchmark` runs the benchmark script once with `--json` and then that same test module alone. It is intentionally **not** part of `npm run verify -- release`, so routine publish gates stay decoupled from benchmark churn while still allowing a focused check after editing scenarios or `CURRENT_BENCHMARK_VERSION`.
158
+ - **Default gate:** `npm run verify` checks generated playbook drift, runs `tsc --noEmit`, runs the full unit/fake suite under `test/**/*.test.ts` with Node test concurrency pinned to `1` (including [`test/agent-browser.efficiency-benchmark.test.ts`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/test/agent-browser.efficiency-benchmark.test.ts) for scenario coverage and comparison behavior), verifies generated command-reference baseline blocks, and samples live upstream command-reference tokens. It does not spawn the standalone benchmark script’s JSON/Markdown run; that is what the opt-in slice below adds.
159
+ - **Pre-PR gate:** `npm run verify -- pre-pr` runs the default gate plus `npm run verify -- package` for larger handoffs that need package-content confidence without lifecycle, platform, real-upstream, dogfood, or benchmark cost.
160
+ - **Opt-in slice:** `npm run verify -- benchmark` runs the benchmark script once with `--json` and then that same test module alone. It is intentionally **not** part of `npm run verify -- pre-pr` or `npm run verify -- release`, so routine handoff and publish gates stay decoupled from benchmark churn while still allowing a focused check after editing scenarios or `CURRENT_BENCHMARK_VERSION`.
150
161
 
151
- Maintainer constraints for evolving scenarios and version bumps are summarized under “Agent browser efficiency benchmark” in [`../AGENTS.md`](../AGENTS.md).
162
+ Maintainer constraints for evolving scenarios and version bumps are summarized under “Agent browser efficiency benchmark” in [`../AGENTS.md`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/AGENTS.md).
152
163
 
153
164
  ## What package verification checks
154
165
 
@@ -174,9 +185,7 @@ The packaged execution smoke intentionally uses a temporary fake `agent-browser`
174
185
  Current forbidden packed files include:
175
186
 
176
187
  - `AGENTS.md`
177
- - `docs/IMPLEMENTATION_PLAN.md`
178
- - `docs/native-integration-design.md`
179
- - `docs/v1-tool-contract.md`
188
+ - archived planning drafts under `docs/archive/`
180
189
  - `.pi/extensions/agent-browser.ts`
181
190
  - test and repo-only maintenance files
182
191
 
@@ -193,7 +202,7 @@ Before publishing, validate both local-checkout modes without mixing their assum
193
202
  ### Quick isolated checkout smoke test
194
203
 
195
204
  1. Install `agent-browser` separately.
196
- 2. Launch `pi --no-extensions -e .` from this repository root.
205
+ 2. Launch `pi --approve --no-extensions -e .` from this trusted repository root. Omit `--approve` only when testing Pi's Project Trust prompt.
197
206
  3. Confirm the checkout extension loads from `extensions/agent-browser/index.ts`.
198
207
  4. Run a smoke prompt that exercises `agent_browser`.
199
208
  5. Restart the `pi` process after extension edits; Pi settings and `/reload` are not the validation target in this isolated mode.
@@ -212,7 +221,7 @@ Run the automated harness for deterministic configured-source lifecycle regressi
212
221
  npm run verify -- lifecycle
213
222
  ```
214
223
 
215
- The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs `pi` in `tmux` with default model **`zai/glm-5.1`** and a deterministic `--session-id`, puts a deterministic fake `agent-browser` first on `PATH`, drives `/reload`, closes Pi, and relaunches with the same exact session id instead of typing `/resume`. It also asserts the JSONL session header id, same-page managed-session continuity, persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification. Per-step tmux waits default to **180000 ms** (three minutes) in [`scripts/verify-lifecycle.mjs`](../scripts/verify-lifecycle.mjs) (`DEFAULT_TIMEOUT_MS`); override with `--timeout-ms <ms>` when slower models or cold starts need more headroom. Override the model when needed:
224
+ The harness creates an isolated `PI_CODING_AGENT_DIR`, writes settings with exactly one temporary configured package source, runs `pi` in `tmux` with `--approve`, default model **`zai/glm-5.1`**, and a deterministic `--session-id`, puts a deterministic fake `agent-browser` first on `PATH`, drives `/reload`, closes Pi, and relaunches with the same exact session id instead of typing `/resume`. It also asserts the JSONL session header id, same-page managed-session continuity, persisted spill reachability, and real Pi `tool_result` failure-patch semantics for a QA reclassification. Per-step tmux waits default to **180000 ms** (three minutes) in [`scripts/verify-lifecycle.mjs`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/scripts/verify-lifecycle.mjs) (`DEFAULT_TIMEOUT_MS`); override with `--timeout-ms <ms>` when slower models or cold starts need more headroom. Override the model when needed:
216
225
 
217
226
  ```bash
218
227
  npm run verify -- lifecycle --model openai-codex/gpt-5.5:minimal
@@ -234,7 +243,7 @@ These show up often in cloud dev boxes and scripted smokes; they are maintainer
234
243
 
235
244
  | Topic | What to watch for | Mitigation |
236
245
  | --- | --- | --- |
237
- | **Pi CLI vs repo devDependencies** | Global `pi` older than the recommended Pi floor for the release can change TUI behavior, `/reload`, package installs, and tool routing during lifecycle or checkout smokes. | Run `npm run doctor` and align `pi` with the current audited baseline before release gates (`pi update` or install the matching version). The published peer range stays non-pinning; the local release gate should use the audited Pi version. |
246
+ | **Pi CLI vs repo devDependencies** | Global `pi` older than the minimum Pi runtime floor for the release can change TUI behavior, `/reload`, package installs, and tool routing during lifecycle or checkout smokes. | Run `npm run doctor` and align `pi` with the current audited baseline before release gates (`pi update` or install the matching version). The published peer range stays wildcard per Pi package docs, and the doctor enforces the minimum Pi runtime floor before package validation. |
238
247
  | **npm lockfile (`packageManager`)** | `package.json` pins **npm@11**. npm 10 may only strip optional `libc` metadata on `@esbuild/*` platform entries in `package-lock.json` (no dependency version change). | Prefer `npx -y npm@11.14.0 install` when refreshing the lockfile; do not commit npm-10-only lockfile churn. |
239
248
  | **`pi -p` / print mode** | Non-interactive `pi -p` may hang or emit no stdout for long real-browser smokes without a TTY. | Use **tmux**-driven interactive `pi` for release evidence and checkout smokes; reserve `-p` for short, non-browser checks. |
240
249
  | **Real-browser cleanup** | `real-upstream`, Sauce Demo, and live-site runs can leave defunct Chrome/`agent-browser` children if a session aborts mid-flow. | Close via `agent_browser` / `agent-browser` `close`, kill stray tmux sessions, and remove temp screenshots/HARs under `/tmp` or your chosen artifact dirs. |
@@ -261,11 +270,11 @@ That npm script sets `PI_AGENT_BROWSER_REAL_UPSTREAM=1` for the test process. To
261
270
  This suite requires the installed `agent-browser --version` to exactly match `scripts/agent-browser-capability-baseline.mjs`. It serves fixture pages from localhost and checks stable `details`/`data` keys via `test/fixtures/agent-browser-real-output-shapes.json`. Coverage groups:
262
271
 
263
272
  - **Inspection and skills (stateless JSON):** `--version`, `--help`, `snapshot --help`, `skills list`, `skills get … --full`, `skills path …` (no managed `sessionName` / `usedImplicitSession`).
264
- - **Managed session core and safe diagnostic matrix:** fresh `open` on the contract fixture, then implicit reuse across `eval --stdin`, `snapshot -i`, interaction commands (`click`, `dblclick`, `fill`, `type`, `focus`, `keyboard` with `type` / `inserttext`, `press`, `hover`, `check`, `uncheck`, `select`, `upload`, `drag`, `mouse`, `scroll`, `scrollintoview`, `wait` on a selector), extraction (`get` variants, `is` variants, label `find … fill`, inline `eval`), file outputs (`screenshot`, `pdf`), navigation (`back`, `forward`, `reload`, `tab list`, another `open` to the same fixture), `batch` stdin, `pushstate`, `vitals … --json`, network route/requests/HAR, diff snapshot/screenshot/url, trace/profiler, console/errors/highlight, stream enable/status/disable, and `cookies set --curl`.
273
+ - **Managed session core and safe diagnostic matrix:** fresh `open` on the contract fixture, then implicit reuse across `eval --stdin`, `snapshot -i`, interaction commands (`click`, `dblclick`, `fill`, `type`, `type --clear --delay`, `focus`, `keyboard` with `type` / `inserttext`, `press`, `hover`, `check`, `uncheck`, `select`, failed `select` no-match, `upload`, `drag`, `mouse`, `scroll`, off-viewport click, `scrollintoview`, `wait` on selectors in the main frame and a selected iframe), extraction (`get` variants, `is` variants, `find label … fill` via native `<label>`, `aria-label`, and `aria-labelledby`, inline `eval`), file outputs (`screenshot`, `pdf`), navigation (`back`, `forward`, `reload`, `tab list`, another `open` to the same fixture), `batch` stdin, `pushstate`, `vitals … --json`, network route/requests/HAR, diff snapshot/screenshot/url, trace/profiler, console/errors/highlight, stream enable/status/disable, and `cookies set --curl`.
265
274
  - **Failure shape:** `react tree` on a page opened with `--enable react-devtools` but without a React app (expects a clear missing-renderer error with session-bound `details`).
266
275
  - **Async download:** `open` on the `/download` fixture, anchor-triggered export, then `wait --download <path>` metadata and wrapper artifact reporting for the requested path.
267
276
 
268
- The default unit suite also runs `agentBrowserExtension passes through core command coverage fallback matrix` in [`test/agent-browser.extension-validation.test.ts`](../test/agent-browser.extension-validation.test.ts): a fake upstream records argv so `connect 9222`, `download` with a selector and path, `get url`, `snapshot --compact`, and `tab new` / `tab 0` / `tab close` still prove `--json` plus implicit `--session` ordering without a browser. A second fake-upstream matrix in the same file (`agentBrowserExtension passes through non-core network debug diff stream dashboard and chat families`) pins representative `network`, `diff`, `trace` / `profiler` / `record`, `console` / `errors` / `highlight` / `inspect` / `clipboard`, `stream`, `dashboard`, and `chat` JSON shapes plus redacted `details.data` and argv echoes without a browser. A third matrix (`agentBrowserExtension passes through provider and specialized skill workflows`) asserts provider `open` argv shapes still receive `--json` plus implicit `--session` while read-only `skills get …` stays stateless (no managed session fields) and provider credential env vars are forwarded into the fake upstream log. Extend those matrices when adding passthrough coverage that should stay out of the slow real-upstream loop.
277
+ The default unit suite also runs `agentBrowserExtension passes through core command coverage fallback matrix` in [`test/agent-browser.extension-validation.test.ts`](https://github.com/fitchmultz/pi-agent-browser-native/blob/main/test/agent-browser.extension-validation.test.ts): a fake upstream records argv so `connect 9222`, `download` with a selector and path, `get url`, `snapshot --compact`, and `tab new` / `tab 0` / `tab close` still prove `--json` plus implicit `--session` ordering without a browser. A second fake-upstream matrix in the same file (`agentBrowserExtension passes through non-core network debug diff stream dashboard and chat families`) pins representative `network`, `diff`, `trace` / `profiler` / `record`, `console` / `errors` / `highlight` / `inspect` / `clipboard`, `stream`, `dashboard`, and `chat` JSON shapes plus redacted `details.data` and argv echoes without a browser. A third matrix (`agentBrowserExtension passes through provider and specialized skill workflows`) asserts provider `open` argv shapes still receive `--json` plus implicit `--session` while read-only `skills get …` stays stateless (no managed session fields) and provider credential env vars are forwarded into the fake upstream log. Extend those matrices when adding passthrough coverage that should stay out of the slow real-upstream loop.
269
278
 
270
279
  ### Real upstream suite mechanics, isolation, and troubleshooting
271
280
 
@@ -280,7 +289,7 @@ The default unit suite also runs `agentBrowserExtension passes through core comm
280
289
  - **Missing or extra `details` / `data` keys:** Update `test/fixtures/agent-browser-real-output-shapes.json` in the same change as the wrapper or presentation code that shifts those keys.
281
290
  - **Timeouts:** A 120s bound covers the full matrix; repeated timeouts usually mean a hung browser, blocked loopback, or an environment preventing headful/headless launch—check upstream logs and local security tooling before loosening timeouts.
282
291
 
283
- The current upstream `agent-browser 0.27.1` `wait --download <path>` saveAs persistence limitation is tracked at [vercel-labs/agent-browser#1300](https://github.com/vercel-labs/agent-browser/issues/1300); until it is fixed, release validation must treat `details.savedFilePath` as upstream-reported metadata and use `details.artifacts[].exists` as the filesystem truth (the contract asserts the requested path is absent on disk while upstream still reports success). If the suite fails because JSON/detail keys drifted, update the wrapper behavior or refresh `test/fixtures/agent-browser-real-output-shapes.json` together with the presentation work that consumes those shapes.
292
+ The current upstream `agent-browser 0.27.2` `wait --download <path>` saveAs persistence limitation is tracked at [vercel-labs/agent-browser#1300](https://github.com/vercel-labs/agent-browser/issues/1300); until it is fixed, release validation must treat `details.savedFilePath` as upstream-reported metadata and use `details.artifacts[].exists` as the filesystem truth (the contract asserts the requested path is absent on disk while upstream still reports success). If the suite fails because JSON/detail keys drifted, update the wrapper behavior or refresh `test/fixtures/agent-browser-real-output-shapes.json` together with the presentation work that consumes those shapes.
284
293
 
285
294
  Example smoke prompt:
286
295
 
@@ -327,8 +336,8 @@ Before publishing:
327
336
  - run `npm run verify -- command-reference` if the installed upstream `agent-browser` version or help surface changed
328
337
  - run `npm run doctor` and confirm any duplicate-source remediation matches the active package/checkout setup
329
338
  - run `npm run verify -- real-upstream` for upstream runtime, result-presentation, or managed-session changes
330
- - confirm both local-checkout modes still work for pre-release validation: isolated `pi --no-extensions -e .` smoke testing for general checkout loading (add `--no-skills` for extension-focused bounded smokes) and configured-source lifecycle validation
331
- - complete interactive `tmux` live-site extension smoke with `pi --no-extensions --no-skills -e .` and the native `agent_browser` tool (at least one simple static site and one real documentation/product site; include `qa` or `job`/`batch` when those surfaces changed; use the [public Grafana stress checklist](#public-grafana-stress-checklist) when dashboard/diagnostic/artifact behavior changed; close sessions and remove screenshots/temp artifacts; record evidence). Run separate skill-enabled dogfood only when validating skill routing/report-generation behavior—see [Pre-release checks](#pre-release-checks); automated gates are not a substitute
339
+ - confirm both local-checkout modes still work for pre-release validation: isolated `pi --approve --no-extensions -e .` smoke testing for general trusted checkout loading (add `--no-skills` for extension-focused bounded smokes; omit `--approve` only to test the trust prompt) and configured-source lifecycle validation
340
+ - complete interactive `tmux` live-site extension smoke with `pi --approve --no-extensions --no-skills -e .` and the native `agent_browser` tool (at least one simple static site and one real documentation/product site; include `qa` or `job`/`batch` when those surfaces changed; use the [public Grafana stress checklist](#public-grafana-stress-checklist) when dashboard/diagnostic/artifact behavior changed; close sessions and remove screenshots/temp artifacts; record evidence). Run separate skill-enabled dogfood only when validating skill routing/report-generation behavior—see [Pre-release checks](#pre-release-checks); automated gates are not a substitute
332
341
  - rerun `npm run verify -- release` and confirm the embedded Crabbox `platform-build` plus `browser-dogfood-smoke` matrix passed on `macos`, `ubuntu`, and `windows-native` with artifacts under `.artifacts/platform-smoke/`
333
342
  - run `npm run verify -- lifecycle` for configured-source `/reload`, exact `--session-id` relaunch, managed-session continuity, persisted-spill, and Pi failure-patch regression coverage (required before publish; see [Pre-release checks](#pre-release-checks))
334
343
  - confirm [`SUPPORT_MATRIX.md`](SUPPORT_MATRIX.md) still maps every current baseline inventory section to docs, runtime handling, tests, and validation status
@@ -47,12 +47,12 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
47
47
  ### Install priority
48
48
 
49
49
  - Prioritize the package install path first.
50
- - User-facing install docs should lead with `pi install npm:pi-agent-browser-native`; ephemeral package trials and validation must use `pi --no-extensions -e npm:pi-agent-browser-native[@<version>]` so configured checkout or global sources cannot duplicate `agent_browser`.
50
+ - User-facing install docs should lead with `pi install npm:pi-agent-browser-native`; ephemeral package trials and validation should use `pi --no-extensions -e npm:pi-agent-browser-native[@<version>]` so configured checkout or global sources cannot duplicate `agent_browser`, adding `--approve` in Pi 0.79+ automation when the current project is intentionally trusted.
51
51
  - User-facing install docs should also include the GitHub source path `pi install https://github.com/fitchmultz/pi-agent-browser-native`.
52
52
  - Provide a read-only package-level doctor command that checks upstream `agent-browser` PATH/version and duplicate Pi package/checkout sources before first use. It must not mutate Pi settings and must remain distinct from upstream `agent-browser doctor`.
53
53
  - Keep the current local-checkout path documented as the practical pre-release and development flow.
54
54
  - Most users will install this extension globally rather than as a project-local extension.
55
- - Local checkout smoke testing should use explicit CLI loading such as `pi --no-extensions -e .` or `pi --no-extensions -e /absolute/path/to/pi-agent-browser-native`; Pi settings are bypassed in this mode and code edits require a process restart for validation.
55
+ - Local trusted-checkout smoke testing should use explicit CLI loading such as `pi --approve --no-extensions -e .` or `pi --approve --no-extensions -e /absolute/path/to/pi-agent-browser-native`; Pi settings are bypassed in this mode and code edits require a process restart for validation. Omit `--approve` only when the test is meant to cover Pi's Project Trust prompt.
56
56
  - Local checkout hot-reload and exact-session relaunch validation should use configured-source lifecycle mode: exactly one active checkout/package source in Pi settings, launched with plain `pi` (or the lifecycle harness' exact `--session-id` relaunch path), so `/reload` and relaunch events exercise discovered/configured resources. Focused extension harness tests validate Pi `session_tree` branch rehydration and cleanup ownership.
57
57
  - Do **not** rely on repo-local `.pi/extensions/` auto-discovery for this package, because it conflicts with the global installed-package path.
58
58
 
@@ -81,12 +81,12 @@ Define the product requirements and constraints for `pi-agent-browser-native`.
81
81
  - Because direct-binary usage is commonly blocked in normal agent sessions, the repo must carry a local command reference for the effective `agent_browser` surface and keep it in sync with upstream changes.
82
82
  - Repository verification must include a lightweight command-reference drift check against the targeted installed upstream `agent-browser` version.
83
83
  - Published package contents should include the canonical user-facing docs plus `LICENSE`.
84
- - Published package contents should exclude agent-only and superseded docs such as `AGENTS.md`, `docs/v1-tool-contract.md`, and `docs/native-integration-design.md`.
84
+ - Published package contents should exclude agent-only and superseded docs such as `AGENTS.md` and archived drafts under `docs/archive/`.
85
85
 
86
86
  ### Testing guidance
87
87
 
88
88
  - The primary confidence path is a real `pi` session driven in `tmux`.
89
- - For quick local checkout smoke validation, launch `pi --no-extensions -e .` from the repository root so only the checkout copy loads; do not rely on Pi settings or `/reload` semantics in this isolated mode.
89
+ - For quick local checkout smoke validation, launch `pi --approve --no-extensions -e .` from the repository root so only the checkout copy loads; do not rely on Pi settings or `/reload` semantics in this isolated mode.
90
90
  - For hot-reload validation, configure exactly one active source for this extension in Pi settings and launch plain `pi`; validate `/reload` there because it exercises auto-discovered/configured resources.
91
91
  - Maintain a tmux-driven configured-source lifecycle harness (`npm run verify -- lifecycle`; required before release per `docs/RELEASE.md`) that isolates Pi settings, uses exactly one configured source, exercises `/reload`, full restart plus exact `--session-id` relaunch, and asserts managed-session continuity, persisted artifact survival, and real Pi `tool_result` failure-patch semantics. It remains outside the default `npm run verify` sequence, but it is embedded in `npm run verify -- release` so `prepublishOnly` enforces it before publish unless scripts are intentionally skipped. The harness defaults Pi to model `zai/glm-5.1` (`scripts/verify-lifecycle.mjs`); pass `--model <id>` after `lifecycle` when a different model is required. Keep `docs/RELEASE.md` accurate about the harness behavior, cleanup, transcript retention, and limitations.
92
92
  - Validate a full `pi` restart with exact `--session-id` relaunch or `/resume` when changes touch managed-session continuity, reload behavior, or persisted artifact paths. Validate branch-backed state changes with the focused `session_tree` harness tests.