pi-chrome 0.14.5 → 0.14.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,14 @@
2
2
 
3
3
  All notable user-facing changes to `pi-chrome`.
4
4
 
5
+ ## 0.14.7
6
+
7
+ - Replace "30+ challenges" hand-wave in README + COMPARISON.md with the accurate framing from chrome-benchmark: **38 primitive challenges + 4 hermetic BrowserGym-style long-horizon tasks**, scored by **expected-outcome-by-mode** (not raw PASS count). Explains why a synthetic-events tool isn't supposed to satisfy a clipboard user-activation gate — matching that expectation is the pass.
8
+
9
+ ## 0.14.6
10
+
11
+ - Fix Browser Use license in `docs/COMPARISON.md`: MIT (not Apache-2.0). Confirmed against upstream LICENSE on GitHub.
12
+
5
13
  ## 0.14.5
6
14
 
7
15
  - `docs/COMPARISON.md` rewritten with a three-axis landscape (drivers / agent frameworks / cloud providers). Adds Browser Use, Stagehand, Skyvern, Magnitude, Alumnium, OpenAI Operator, Project Mariner, Surfer 2, Anthropic Computer Use, Browserbase, Steel.dev, Hyperbrowser, Anchor, Browserless. Adds Interop section, public-benchmark cheat sheet (WebArena, WorkArena++, BrowseComp, Mind2Web 2, WebChoreArena, MiniWoB++, BrowserGym).
package/README.md CHANGED
@@ -32,10 +32,10 @@ You: [keeps coding — agent never asked you to log in]
32
32
  | Multi-session safe | ✅ shared local bridge | ❌ port collisions | ❌ | ❌ |
33
33
  | Network/console capture | ✅ built-in | ✅ | ✅ | ⚠️ via extensions |
34
34
  | Honest result envelopes¹ | ✅ | ⚠️ | ❌ | ❌ |
35
- | Built-in benchmark suite² | ✅ 30+ challenges | n/a | n/a | n/a |
35
+ | Built-in benchmark suite² | ✅ 38 primitives + 4 long-horizon | n/a | n/a | n/a |
36
36
 
37
37
  ¹ Every action returns `pageMutated`, `defaultPrevented`, `elementVisible`, `occludedBy`, and `valueMatches` so the agent knows when a click didn't take effect — instead of looping blindly.
38
- ² See [`test-suite/`](./test-suite) — static pages that grade any browser-control tool on trusted clicks, pointer humanization, keyboard fidelity, drag/drop, clipboard, Shadow DOM, iframes, file uploads, network capture, and fingerprint leaks.
38
+ ² See [`test-suite/`](./test-suite) — 38 primitive challenges plus 4 hermetic BrowserGym-style tasks. Scoring is expected-outcome-by-mode (`synthetic` / `trusted` / `manual`), not raw PASS count. Pages grade any browser-control tool on trusted clicks, pointer humanization, keyboard fidelity, drag/drop, clipboard, Shadow DOM, iframes, file uploads, network capture, and fingerprint leaks.
39
39
 
40
40
  ---
41
41
 
@@ -240,7 +240,11 @@ There is no network exposure; the bridge binds to loopback only.
240
240
 
241
241
  ## Built-in benchmark suite
242
242
 
243
- [`test-suite/`](./test-suite) is a static page benchmark for **any** browser-control agent (not just pi-chrome). Each challenge exposes `window.__verdict` / `window.__reason` / `window.__events` and a manifest entry with expected results per mode (`synthetic`, `trusted`, `manual`).
243
+ [`test-suite/`](./test-suite) is a benchmark for **any** browser-control agent (not just pi-chrome). It includes **38 primitive challenges** plus **4 hermetic BrowserGym-style long-horizon tasks**.
244
+
245
+ Scoring is **expected-outcome-by-mode**, not raw PASS count: each challenge has an expected verdict per mode (`synthetic`, `trusted`, `manual`) and a tool grades itself by whether its actual outcome matches the expected one. This avoids false equivalence between modes — a synthetic-events tool isn't supposed to satisfy a clipboard user-activation gate; matching that expectation is the pass.
246
+
247
+ Each challenge exposes `window.__verdict` / `window.__reason` / `window.__events` and a manifest entry with expected results per mode.
244
248
 
245
249
  ```bash
246
250
  cd test-suite && python3 -m http.server 8765
@@ -64,7 +64,7 @@ These wrap a driver with an LLM loop. They are **higher-level than pi-chrome** a
64
64
 
65
65
  | Framework | Driver underneath | Approach | Open source |
66
66
  | ------------------------ | ------------------------------ | --------------------------------------------------------------------------------------------- | --------------- |
67
- | **Browser Use** | Playwright | DOM + a11y tree → LLM → action JSON. Open-source leader; widely cited on WebVoyager. | Apache-2.0 (Python) |
67
+ | **Browser Use** | Playwright | DOM + a11y tree → LLM → action JSON. Open-source leader; widely cited on WebVoyager. | MIT (Python) |
68
68
  | **Stagehand** (Browserbase) | Playwright | Natural-language `.act()` / `.observe()` / `.extract()`; deterministic + AI mix. | MIT (TypeScript)|
69
69
  | **Skyvern** | Playwright + own DOM model | Vision-first + DOM; YAML workflows for form/workflow automation. | AGPL (Python) |
70
70
  | **Magnitude** | Playwright | NL test authoring; QA-focused. | open |
@@ -134,7 +134,7 @@ If your threat model excludes extensions with broad permissions, neither approac
134
134
 
135
135
  ## Public benchmarks worth knowing (for axis 2 / axis 3 comparison)
136
136
 
137
- Pi-chrome itself ships a per-primitive benchmark suite ([`../test-suite/`](../test-suite)) covering trusted-input, pointer humanization, keyboard fidelity, drag/drop, Shadow DOM, file uploads, network observability, fingerprint leaks, and agent-safety honeypots. That's **driver-level** grading.
137
+ Pi-chrome itself ships a benchmark suite ([`../test-suite/`](../test-suite)) of **38 primitive challenges** plus **4 hermetic BrowserGym-style long-horizon tasks** covering trusted-input, pointer humanization, keyboard fidelity, drag/drop, Shadow DOM, file uploads, network observability, fingerprint leaks, and agent-safety honeypots. Scoring is **expected-outcome-by-mode** (not raw PASS count): each challenge has expected verdicts per mode (`synthetic` / `trusted` / `manual`) and a tool grades itself by whether its actual outcome matches expectations. That's **driver-level** grading.
138
138
 
139
139
  For **agent-level** comparison (axis 2), the public benchmarks worth citing:
140
140
 
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "manifest_version": 3,
3
3
  "name": "Pi Chrome Connector",
4
- "version": "0.14.5",
4
+ "version": "0.14.7",
5
5
  "description": "Lets Pi control tabs in Chrome via a local connector at 127.0.0.1.",
6
6
  "permissions": ["tabs", "scripting", "storage", "activeTab", "alarms", "webNavigation", "debugger"],
7
7
  "host_permissions": ["<all_urls>", "http://127.0.0.1:17318/*"],
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "pi-chrome",
3
- "version": "0.14.5",
3
+ "version": "0.14.7",
4
4
  "description": "The de-facto browser automation toolkit for Pi agents. Drive your existing logged-in Chrome — no re-login, no throwaway profile, no CDP. 20+ tools (click, type, navigate, screenshot, network capture, file upload, drag, scroll, touch) + honest result envelopes + a built-in benchmark suite.",
5
5
  "keywords": [
6
6
  "pi",