barebrowse 0.9.0 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,117 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.10.0
4
+
5
+ ### Ad/tracker URL blocking + canvas-noise stealth + Chromium pgid reap fix
6
+
7
+ Scrapling-inspired additions to make every snapshot quieter and every
8
+ headless session less fingerprintable, plus a flake fix surfaced by the
9
+ new work.
10
+
11
+ - **Ad/tracker URL blocking via CDP `Network.setBlockedURLs`.** New
12
+ `src/blocklist.js` ships ~120 hand-curated glob patterns covering the
13
+ high-frequency tracker families: Google ads + analytics, Facebook
14
+ Pixel, Amazon ads, MS Clarity/Bing, Adobe Marketing Cloud, the
15
+ consumer-pixel cluster (LinkedIn/Twitter/TikTok/Snap/Pinterest), the
16
+ SaaS analytics stacks (Segment/Amplitude/Mixpanel/Heap/PostHog),
17
+ session-replay (Hotjar/FullStory/LogRocket/Crazy Egg/Mouseflow),
18
+ content recommendation (Criteo/Taboola/Outbrain), supply-side ad
19
+ networks (AppNexus/Rubicon/PubMatic/OpenX/Trade Desk), and marketing
20
+ automation (HubSpot/Marketo/Pardot/Intercom/Drift). Curated by traffic
21
+ frequency rather than pulled wholesale from Peter Lowe — CDP does
22
+ linear pattern matching per request, so the long tail of regional
23
+ networks was measurable cost (~10ms cumulative on a 100-request page)
24
+ for ~5% extra coverage we'd rarely hit in agent traffic. Net effect:
25
+ smaller ARIA snapshots and faster page loads.
26
+ - **`opts.blockAds` and `opts.blockUrls` on `connect()` and `browse()`.**
27
+ `blockAds` defaults to `true` for launched browsers and `false` in
28
+ attach mode (would otherwise affect any tab in the user's running
29
+ browser). Explicit `blockAds: true` in attach mode is honored and
30
+ follows the session across `switchTab()`. `blockUrls` accepts extra
31
+ glob patterns merged with the default unless `blockAds: false`.
32
+ - **CLI flags on `bb open`: `--no-block-ads` and `--block-urls=PATTERN`**
33
+ (the latter repeatable). Plumbed through `cli.js`, `src/daemon.js`
34
+ startDaemon args, and `runDaemon` → `connect()`. Not exposed via MCP
35
+ or bareagent on purpose — agents inside a session shouldn't be
36
+ reconfiguring infra per tool call; the decision belongs at session
37
+ start.
38
+ - **Canvas fingerprint noise** in `src/stealth.js`. After WebGL
39
+ (already spoofed in v0.9.0), canvas `toDataURL` / `getImageData` is
40
+ the second-most-checked fingerprint vector — the pixel output of
41
+ rendered text/shapes depends on GPU, driver, and font rasterizer in
42
+ ways that are stable per machine but unique across machines, which
43
+ makes it a tracking signal that survives cookie clearing. The patch
44
+ XORs ~1 bit per 64-byte stride into the read pixels, with the bit
45
+ derived from a position-mixed hash of a per-session
46
+ `crypto.getRandomValues`-seeded value. Output is stable within a
47
+ session (so legitimate canvas use doesn't flicker) and different
48
+ across sessions (so fingerprinters see a fresh hash on every visit).
49
+ The canvas bitmap is snapshotted and restored around encoding so any
50
+ downstream legitimate read sees the original pixels.
51
+ - **Pre-existing Chromium subprocess reap flake fixed.** Chromium
52
+ spawns renderer/GPU/network/utility subprocesses that, under
53
+ `--site-per-process` (v0.9.0 H2), can outlive SIGTERM on the
54
+ Chromium parent by seconds while still holding profile-dir file
55
+ handles. Without `detached: true`, all of them shared Node's process
56
+ group — there was no way to signal the whole Chromium tree without
57
+ enumerating PIDs. `src/chromium.js` now spawns with `detached: true`
58
+ so each Chromium becomes its own process-group leader, and
59
+ `cleanupBrowser` / `reapAllSync` send SIGKILL to the negative PID
60
+ (the whole group) before `rmSync`. Latent in `main`, but the new
61
+ blocklist's added CDP setup overlapped the cleanup window enough to
62
+ hit ~1-in-3 under parallel test load. Side effect: terminal SIGINT
63
+ now goes to Node's pgid only — `registerExitHandlers`' SIGINT
64
+ reaper is what kills Chromium under Ctrl-C and must not be removed.
65
+ - **`startDaemon` poll deadline 15s → 30s** for cold-boot margin on
66
+ slower hardware (CI / older boxes) now that the blocklist adds a
67
+ small amount of CDP setup time to the session-startup path.
68
+ - **Tests:** 138 total (10 new). New: 5-test unit suite for
69
+ `DEFAULT_BLOCKLIST` (shape/coverage drift guards, must-cover
70
+ tracker families, no dups); 2-test integration suite that proves
71
+ `Network.setBlockedURLs` actually drops the matching subresource
72
+ and that `blockAds:false` lets it through; 2 new canvas-noise
73
+ subtests (patch installed, stable within session, different across
74
+ sessions); 1 end-to-end `bb open --block-urls=PATTERN URL` test
75
+ that proves the flag survives every hop through `cli.js` →
76
+ `startDaemon` → daemon-internal → `connect()` → `setBlockedURLs`
77
+ and that the tracker server sees zero hits.
78
+
79
+ ## 0.9.1
80
+
81
+ ### Pruning — `pruneMode` reaches MCP / bareagent and `read` finally works
82
+
83
+ - **`mode: 'read'` is now a real alias for `mode: 'browse'`** in `prune()`.
84
+ Previously, the CLI (`barebrowse snapshot --mode=read`) and the SKILL.md
85
+ advertised a `read` mode that did not exist — `MODE_REGIONS[mode] ||
86
+ MODE_REGIONS.act` silently fell back to act-mode pruning. Articles, docs,
87
+ and blog posts therefore came back gutted no matter which mode the agent
88
+ asked for, which is why Claude tended to give up and fall back to
89
+ WebFetch. One-line alias at the top of `prune()` fixes it; `act|browse|
90
+ navigate|full` still behave unchanged.
91
+ - **MCP `browse` and `snapshot` tools gained a `pruneMode: 'act'|'read'`
92
+ parameter** (mcp-server.js). Before this, the MCP surface had no way to
93
+ ask for any mode other than `act` — `browse`'s `mode` param was browser
94
+ mode (headless/headed/hybrid), and `snapshot` accepted only `maxChars`.
95
+ Tool descriptions now tell the caller when to pick `read` (content-heavy
96
+ pages: articles, docs, blogs).
97
+ - **bareagent `browse` and `snapshot` tools gained the same `pruneMode`
98
+ parameter** (`src/bareagent.js`) with identical semantics. The `browse`
99
+ handler preserves any caller-supplied default `opts.pruneMode` when the
100
+ tool is called without an arg (`pruneMode ? { ...opts, pruneMode } : opts`).
101
+ - **Auto-hint when act-mode looks suspect.** When `page.snapshot()` or
102
+ `browse()` is called in act mode against a substantial page (raw > 5 KB)
103
+ and the pruned output collapses to under 500 chars AND under 5% of raw,
104
+ the result includes a one-line `hint: act mode dropped most of the page
105
+ — retry with pruneMode='read' …` directly between the stats line and the
106
+ tree. Thresholds are deliberately conservative: an e-commerce or
107
+ search-results page (many interactive elements kept) won't trigger it;
108
+ a paragraph-heavy article will.
109
+ - **Regression test:** `test/unit/prune.test.js` — "aliases mode='read' to
110
+ browse mode" pins the alias contract by asserting `prune(tree, {mode:
111
+ 'read'})` deep-equals `prune(tree, {mode: 'browse'})` and that paragraphs
112
+ survive (the act-mode-style stripping that previously masqueraded as
113
+ read-mode is gone).
114
+
3
115
  ## 0.9.0
4
116
 
5
117
  Phase B — every H1–H9 from `docs/02-features/fix-plan.md` shipped one
package/README.md CHANGED
@@ -94,6 +94,8 @@ Or manually add to your config (`claude_desktop_config.json`, `.cursor/mcp.json`
94
94
 
95
95
  18 tools: `browse`, `goto`, `snapshot`, `click`, `type`, `press`, `scroll`, `hover`, `select`, `back`, `forward`, `reload`, `drag`, `upload`, `pdf`, `screenshot`, `wait_for`, `tabs`. Plus `assess` (privacy scan) if [wearehere](https://github.com/hamr0/wearehere) is installed. Plus opt-in `eval` (`BAREBROWSE_MCP_EVAL=1`) — runs JS in the authenticated session, off by default because it can read cookies/localStorage. Session runs in hybrid mode with automatic cookie injection. Per-tool timeouts (goto/reload/wait_for 60s, back/forward 30s, interactive ops 15s, pdf/screenshot/upload 45s) with auto-retry on transient failures (idempotent only — mutating tools fail loudly to avoid double-submits).
96
96
 
97
+ `browse` and `snapshot` accept `pruneMode: 'act'|'read'` (v0.9.1). `act` (default) keeps interactive elements — best for clicking/filling. `read` keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. If act-mode collapses a content-heavy page near-totally, the snapshot includes a `hint: …` line suggesting `pruneMode='read'` so the agent doesn't bail to a separate HTTP fetch.
98
+
97
99
  Troubleshooting MCP setup: `npx barebrowse doctor` scans every known config location and flags scope conflicts. `npx barebrowse install --force` overwrites an existing entry pointing at a different endpoint.
98
100
 
99
101
  ### 3. Library -- for agentic automation
@@ -1,7 +1,7 @@
1
1
  # barebrowse -- Integration Guide
2
2
 
3
3
  > For AI assistants and developers wiring barebrowse into a project.
4
- > v0.9.0 | Node.js >= 22 | 0 required deps | Apache-2.0
4
+ > v0.9.1 | Node.js >= 22 | 0 required deps | Apache-2.0
5
5
 
6
6
  ## What this is
7
7
 
@@ -45,6 +45,8 @@ const snapshot = await browse('https://example.com', {
45
45
  prune: true, // apply ARIA pruning (47-95% token reduction)
46
46
  pruneMode: 'act', // 'act' (interactive elements) | 'read' (all content)
47
47
  consent: true, // auto-dismiss cookie consent dialogs
48
+ blockAds: true, // block ~120 ad/tracker URL patterns (default on for owned browsers)
49
+ blockUrls: [], // extra URL globs to block (merged with the default)
48
50
  timeout: 30000, // navigation timeout in ms
49
51
  });
50
52
  ```
@@ -91,6 +93,8 @@ const snapshot = await browse('https://example.com', {
91
93
  - `viewport: '1280x720'` — Set viewport dimensions
92
94
  - `storageState: 'file.json'` — Load cookies/localStorage from saved state
93
95
  - `downloadPath: '/abs/dir'` — Where downloads land. Default: per-session `mkdtemp` under `/tmp/barebrowse-dl-*` that gets removed on `close()`. Caller-supplied paths are not cleaned up — caller owns the lifecycle.
96
+ - `blockAds: true|false` — CDP-level URL blocking of ~120 common ad/tracker patterns (Google ads/analytics, FB/Amazon/MS/Adobe ad+analytics, Segment/Amplitude/Mixpanel/Heap, Hotjar/FullStory/LogRocket, Criteo/Taboola/Outbrain, the consumer-pixel cluster, AppNexus/Rubicon/PubMatic supply, marketing automation). Default `true` for launched browsers, `false` in attach mode (would affect any tab in the user's running browser). Explicit `true` in attach mode is honored and follows the session across `switchTab()`. Shrinks ARIA snapshots and speeds page loads.
97
+ - `blockUrls: ['*://foo.com/*', ...]` — Extra glob patterns (CDP `Network.setBlockedURLs` format) to block in addition to the default. Merged with the default unless `blockAds: false`.
94
98
 
95
99
  ## Snapshot format
96
100
 
@@ -161,7 +165,8 @@ barebrowse can inject cookies from the user's real browser sessions, bypassing l
161
165
  | Form submission | `press('Enter')` triggers onsubmit | Both |
162
166
  | SPA navigation | `waitForNavigation()` uses loadEventFired + frameNavigated | Both |
163
167
  | Bot detection | v0.9.0 (H9): Cloudflare-strong phrases ("Just a moment", "Attention Required", "verify you are human") fire alone; generic phrases ("access denied", "unknown error") only fire on near-empty pages — no more false-positive headed-launches on legitimate 4xx/5xx pages. `botBlocked` flag set after every `goto()`. Hybrid fallback switches to headed. Snapshot shows `[BOT CHALLENGE DETECTED]` warning. | Hybrid |
164
- | Stealth (headless tells) | v0.9.0 (H4): `Network.setUserAgentOverride` strips "HeadlessChrome" from UA in HTTP headers AND `navigator.userAgent`; JS patches for webdriver, plugins, languages, full `chrome.runtime` enum shape, `Notification` constructor + `permission: 'default'`, `hardwareConcurrency: 8`, `deviceMemory: 8`, WebGL `UNMASKED_VENDOR_WEBGL`/`UNMASKED_RENDERER_WEBGL` spoofed to Intel | Headless |
168
+ | Stealth (headless tells) | v0.9.0 (H4): `Network.setUserAgentOverride` strips "HeadlessChrome" from UA in HTTP headers AND `navigator.userAgent`; JS patches for webdriver, plugins, languages, full `chrome.runtime` enum shape, `Notification` constructor + `permission: 'default'`, `hardwareConcurrency: 8`, `deviceMemory: 8`, WebGL `UNMASKED_VENDOR_WEBGL`/`UNMASKED_RENDERER_WEBGL` spoofed to Intel. v0.10.0: canvas fingerprint noise — `toDataURL`/`getImageData` XOR a per-session `crypto.getRandomValues`-seeded mask into ~1 byte per 64-byte stride (stable within a session, different across sessions; bitmap is restored after encoding so legitimate canvas use is unaffected). | Headless |
169
+ | Ad / tracker URL blocking | v0.10.0: CDP `Network.setBlockedURLs` with ~120 curated patterns (Google/FB/Amazon/MS/Adobe ad+analytics, the major SaaS analytics + session-replay stacks, content-rec, supply-side ad networks, marketing automation). Default on for launched browsers, off in attach mode. `opts.blockUrls` extends; `opts.blockAds: false` disables. Shrinks ARIA snapshots and speeds loads. | Launched |
165
170
  | iframe / OOPIF content (Stripe, reCAPTCHA, embedded forms) | v0.9.0 (H2): `Target.setAutoAttach({flatten:true})` registers a CDP session per iframe; `ariaTree()` walks `Page.getFrameTree`, fetches each frame's AX tree on the right session, splices children under iframe placeholders via `DOM.getFrameOwner`. Refs route via `{session, backendNodeId}` so clicks dispatch in the iframe's Input domain. `--site-per-process` launch flag forces every iframe — including same-origin — into OOPIF so coords work. | Both |
166
171
  | Downloads | v0.9.0 (H7): `Browser.setDownloadBehavior({behavior:'allowAndName', downloadPath, eventsEnabled:true})` + listeners populate `page.downloads`. Files land at `savedPath` (under `--download-path` if supplied, else per-session `/tmp/barebrowse-dl-*`). | Headless + Headed (skipped in attach mode) |
167
172
  | Profile locking | Unique temp dir per headless instance | Headless |
@@ -255,6 +260,8 @@ Action tools return `'ok'` -- the agent calls `snapshot` explicitly to observe.
255
260
 
256
261
  `browse` and `snapshot` accept a `maxChars` param (default 30000). If the snapshot exceeds the limit, it's saved to `.barebrowse/page-<timestamp>.yml` and a short message with the file path is returned instead. `screenshot` always saves to `.barebrowse/screenshot-<timestamp>.{png,jpeg,webp}` and returns the file path (raw base64 in a JSON-RPC response would blow `maxChars`). `tabs` returns the JSON array, or with `switchTo: N` it switches and returns `'ok'`.
257
262
 
263
+ `browse` and `snapshot` also accept `pruneMode: 'act'|'read'`. `act` (the default) keeps interactive elements and short labels — best for clicking/filling. `read` keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. Same surface on the bareagent adapter. If act mode collapses a content-heavy page (raw > 5 KB → pruned < 500 chars AND < 5% of raw), the result includes a `hint: act mode dropped most of the page — retry with pruneMode='read' …` line between the stats and the tree so the caller knows to re-snapshot in read mode instead of bailing to a separate HTTP fetch.
264
+
258
265
  Session runs in hybrid mode (headless with automatic headed fallback on bot detection). `goto` injects cookies from the user's browser before navigation for authenticated access.
259
266
 
260
267
  Session tools share a singleton page, lazy-created on first use. All session tools have auto-retry on transient failures (browser crash, WebSocket close, navigation timeout) on a per-tool deadline (v0.9.0 H5): `goto`/`reload`/`wait_for` 60s, `back`/`forward` 30s, interactive ops (`click`/`type`/`press`/`scroll`/`hover`/`select`/`drag`/`snapshot`/`eval`) 15s, `tabs` 5s, heavy I/O (`pdf`/`screenshot`/`upload`) 45s — replaces the prior blanket 30s. Session resets between attempts. Idempotent tools retry once; mutating tools (`click`/`type`/`upload`/etc.) `{ retry: false }` so partial first attempts don't replay on a fresh page. Scroll accepts `direction: "up"/"down"` in addition to numeric `deltaY`. Click falls back to JS `.click()` when elements have no layout. `browse` has a 60s timeout (no retry — stateless). Assess tries headless first; if bot-blocked, retries headed. Browser OOM/crash auto-recovers (session resets, server stays alive).
package/cli.js CHANGED
@@ -117,6 +117,8 @@ async function cmdOpen() {
117
117
  viewport: parseFlag('--viewport'),
118
118
  storageState: parseFlag('--storage-state'),
119
119
  downloadPath: parseFlag('--download-path'),
120
+ blockAds: hasFlag('--no-block-ads') ? false : undefined,
121
+ blockUrls: parseFlagAll('--block-urls'),
120
122
  };
121
123
 
122
124
  try {
@@ -218,6 +220,8 @@ async function runDaemonInternal() {
218
220
  viewport: parseFlag('--viewport'),
219
221
  storageState: parseFlag('--storage-state'),
220
222
  downloadPath: parseFlag('--download-path'),
223
+ blockAds: hasFlag('--no-block-ads') ? false : undefined,
224
+ blockUrls: parseFlagAll('--block-urls'),
221
225
  };
222
226
  const outputDir = parseFlag('--output-dir') || resolve('.barebrowse');
223
227
  const url = parseFlag('--url');
@@ -240,6 +244,20 @@ function hasFlag(name) {
240
244
  return args.includes(name);
241
245
  }
242
246
 
247
+ // Collects every occurrence of a repeatable flag (--name=val or --name val).
248
+ // Returns undefined when absent so the opts object stays sparse and callers
249
+ // can distinguish "not provided" from "provided but empty".
250
+ function parseFlagAll(name) {
251
+ const out = [];
252
+ for (let i = 0; i < args.length; i++) {
253
+ if (args[i].startsWith(name + '=')) out.push(args[i].slice(name.length + 1));
254
+ else if (args[i] === name && args[i + 1] && !args[i + 1].startsWith('--')) {
255
+ out.push(args[i + 1]); i++;
256
+ }
257
+ }
258
+ return out.length ? out : undefined;
259
+ }
260
+
243
261
 
244
262
  // --- MCP auto-installer ---
245
263
 
@@ -467,6 +485,10 @@ Session:
467
485
  --viewport=WxH Viewport size (e.g. 1280x720)
468
486
  --storage-state=FILE Load cookies/localStorage from JSON file
469
487
  --download-path=DIR Directory for downloaded files (default: per-session temp dir)
488
+ --no-block-ads Disable the built-in ad/tracker blocklist (~120 patterns).
489
+ Default: enabled in owned-browser modes, disabled in attach mode.
490
+ --block-urls=PATTERN Extra URL glob to block (repeatable, e.g. --block-urls='*://*.foo.com/*').
491
+ Use the =VALUE form when the pattern could be mistaken for a flag.
470
492
 
471
493
  Navigation:
472
494
  barebrowse goto <url> Navigate to URL
package/mcp-server.js CHANGED
@@ -150,6 +150,7 @@ export const TOOLS = [
150
150
  properties: {
151
151
  url: { type: 'string', description: 'URL to browse' },
152
152
  mode: { type: 'string', enum: ['headless', 'headed', 'hybrid'], description: 'Browser mode (default: headless)' },
153
+ pruneMode: { type: 'string', enum: ['act', 'read'], description: 'Pruning mode. "act" (default) keeps interactive elements and short labels — best for clicking/filling. "read" keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. If the page is content-heavy and act-mode returns mostly empty, retry with "read".' },
153
154
  maxChars: { type: 'number', description: 'Max chars to return inline. Larger snapshots are saved to .barebrowse/ and a file path is returned instead. Default: 30000.' },
154
155
  },
155
156
  required: ['url'],
@@ -172,6 +173,7 @@ export const TOOLS = [
172
173
  inputSchema: {
173
174
  type: 'object',
174
175
  properties: {
176
+ pruneMode: { type: 'string', enum: ['act', 'read'], description: 'Pruning mode. "act" (default) keeps interactive elements and short labels — best for clicking/filling. "read" keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. If a previous snapshot looked empty on a content-heavy page, retry with "read".' },
175
177
  maxChars: { type: 'number', description: 'Max chars to return inline. Larger snapshots are saved to .barebrowse/ and a file path is returned instead. Default: 30000.' },
176
178
  },
177
179
  },
@@ -374,7 +376,7 @@ async function handleToolCall(name, args) {
374
376
  case 'browse': {
375
377
  let timer;
376
378
  const text = await Promise.race([
377
- browse(args.url, { mode: args.mode }),
379
+ browse(args.url, { mode: args.mode, pruneMode: args.pruneMode }),
378
380
  new Promise((_, rej) => { timer = setTimeout(() => rej(new Error('browse timed out after 60s')), 60000); }),
379
381
  ]);
380
382
  clearTimeout(timer);
@@ -393,7 +395,7 @@ async function handleToolCall(name, args) {
393
395
  }, TIMEOUTS.goto);
394
396
  case 'snapshot': return withRetry(async () => {
395
397
  const page = await getPage();
396
- const text = await page.snapshot();
398
+ const text = await page.snapshot(args.pruneMode ? { mode: args.pruneMode } : undefined);
397
399
  const limit = args.maxChars ?? MAX_CHARS_DEFAULT;
398
400
  if (text.length > limit) {
399
401
  const file = saveSnapshot(text);
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "barebrowse",
3
- "version": "0.9.0",
3
+ "version": "0.10.0",
4
4
  "description": "Authenticated web browsing for autonomous agents via CDP. URL in, pruned ARIA snapshot out.",
5
5
  "type": "module",
6
6
  "main": "src/index.js",
package/src/bareagent.js CHANGED
@@ -50,10 +50,11 @@ export function createBrowseTools(opts = {}) {
50
50
  type: 'object',
51
51
  properties: {
52
52
  url: { type: 'string', description: 'URL to browse' },
53
+ pruneMode: { type: 'string', enum: ['act', 'read'], description: '"act" (default) for interactive elements only; "read" for paragraphs and long text (articles/docs).' },
53
54
  },
54
55
  required: ['url'],
55
56
  },
56
- execute: async ({ url }) => await browse(url, opts),
57
+ execute: async ({ url, pruneMode }) => await browse(url, pruneMode ? { ...opts, pruneMode } : opts),
57
58
  },
58
59
  {
59
60
  name: 'goto',
@@ -70,10 +71,15 @@ export function createBrowseTools(opts = {}) {
70
71
  {
71
72
  name: 'snapshot',
72
73
  description: 'Get the current ARIA snapshot. Returns a YAML-like tree with [ref=N] markers on interactive elements.',
73
- parameters: { type: 'object', properties: {} },
74
- execute: async () => {
74
+ parameters: {
75
+ type: 'object',
76
+ properties: {
77
+ pruneMode: { type: 'string', enum: ['act', 'read'], description: '"act" (default) for interactive elements only; "read" for paragraphs and long text (articles/docs).' },
78
+ },
79
+ },
80
+ execute: async ({ pruneMode } = {}) => {
75
81
  const page = await getPage();
76
- return await page.snapshot();
82
+ return await page.snapshot(pruneMode ? { mode: pruneMode } : undefined);
77
83
  },
78
84
  },
79
85
  {
@@ -0,0 +1,190 @@
1
+ /**
2
+ * blocklist.js — Ad/tracker URL patterns for CDP Network.setBlockedURLs.
3
+ *
4
+ * Curated by real-world frequency, not pulled wholesale from Peter Lowe /
5
+ * EasyList. CDP does linear pattern matching per request, so 3,000-entry
6
+ * lists add ~150ms cumulative cost on a typical page for ~5% extra coverage
7
+ * (long-tail regional networks the agent rarely encounters). The set below
8
+ * is ~120 patterns covering the trackers that actually show up in agent
9
+ * traffic: Google/FB/Amazon/MS/Adobe ad+analytics, the major SaaS analytics
10
+ * stacks (Segment/Amplitude/Mixpanel/HubSpot/Hotjar/FullStory/Heap/Mouseflow),
11
+ * session-replay (LogRocket/Crazy Egg/Optimizely/VWO), content-recommendation
12
+ * (Taboola/Outbrain/Criteo), and the consumer-pixel cluster (LinkedIn/Twitter/
13
+ * TikTok/Snap/Pinterest/Reddit).
14
+ *
15
+ * Patterns are CDP-format globs: '*' matches any character run.
16
+ *
17
+ * To extend at runtime, pass connect({ blockUrls: [...] }) — your patterns
18
+ * are merged with this default. To turn the default off entirely, pass
19
+ * { blockAds: false }.
20
+ */
21
+
22
+ export const DEFAULT_BLOCKLIST = [
23
+ // --- Google ads + analytics (the single biggest cluster) ---
24
+ '*://*.doubleclick.net/*',
25
+ '*://*.googlesyndication.com/*',
26
+ '*://*.googleadservices.com/*',
27
+ '*://*.googletagservices.com/*',
28
+ '*://*.googletagmanager.com/*',
29
+ '*://*.google-analytics.com/*',
30
+ '*://*.adservice.google.com/*',
31
+ '*://pagead2.googlesyndication.com/*',
32
+ '*://www.googleadservices.com/pagead/*',
33
+ '*://ssl.google-analytics.com/*',
34
+ '*://stats.g.doubleclick.net/*',
35
+
36
+ // --- Facebook / Meta ---
37
+ '*://connect.facebook.net/*',
38
+ '*://*.facebook.com/tr*', // Pixel (matches both /tr/... and /tr?...)
39
+ '*://*.fbcdn.net/signals/*',
40
+
41
+ // --- Amazon ads ---
42
+ '*://*.amazon-adsystem.com/*',
43
+ '*://aax.amazon-adsystem.com/*',
44
+ '*://s.amazon-adsystem.com/*',
45
+
46
+ // --- Microsoft (Bing ads + Clarity) ---
47
+ '*://bat.bing.com/*',
48
+ '*://*.clarity.ms/*',
49
+
50
+ // --- Yandex ---
51
+ '*://mc.yandex.ru/*',
52
+ '*://an.yandex.ru/*',
53
+ '*://yandex.ru/ads/*',
54
+
55
+ // --- Adobe Marketing Cloud ---
56
+ '*://*.omtrdc.net/*',
57
+ '*://*.demdex.net/*',
58
+ '*://*.everesttech.net/*',
59
+ '*://*.2o7.net/*',
60
+ '*://*.adobedtm.com/*',
61
+
62
+ // --- LinkedIn ---
63
+ '*://px.ads.linkedin.com/*',
64
+ '*://snap.licdn.com/li.lms-analytics/*',
65
+
66
+ // --- Twitter/X ---
67
+ '*://analytics.twitter.com/*',
68
+ '*://static.ads-twitter.com/*',
69
+ '*://*.t.co/i/adsct*',
70
+
71
+ // --- TikTok ---
72
+ '*://analytics.tiktok.com/*',
73
+ '*://business-api.tiktok.com/*',
74
+ '*://*.tiktokcdn.com/tiktok/*',
75
+
76
+ // --- Snap ---
77
+ '*://tr.snapchat.com/*',
78
+ '*://sc-static.net/scevent.min.js*',
79
+
80
+ // --- Pinterest ---
81
+ '*://ct.pinterest.com/*',
82
+ '*://*.pinimg.com/ct/*',
83
+
84
+ // --- Reddit ---
85
+ '*://events.redditmedia.com/*',
86
+ '*://www.redditstatic.com/ads/*',
87
+
88
+ // --- Quantcast / ComScore / Chartbeat ---
89
+ '*://pixel.quantserve.com/*',
90
+ '*://*.quantcount.com/*',
91
+ '*://*.scorecardresearch.com/*',
92
+ '*://ping.chartbeat.net/*',
93
+ '*://static.chartbeat.com/*',
94
+
95
+ // --- Criteo / Taboola / Outbrain (content + retargeting) ---
96
+ '*://*.criteo.com/*',
97
+ '*://*.criteo.net/*',
98
+ '*://cdn.taboola.com/*',
99
+ '*://trc.taboola.com/*',
100
+ '*://widgets.outbrain.com/*',
101
+ '*://*.outbrain.com/utils/*',
102
+
103
+ // --- Tealium / Marketo / Pardot / Salesforce marketing ---
104
+ '*://tags.tiqcdn.com/*',
105
+ '*://*.tealiumiq.com/*',
106
+ '*://munchkin.marketo.net/*',
107
+ '*://*.marketo.com/munchkin*',
108
+ '*://pi.pardot.com/*',
109
+ '*://*.exacttarget.com/cdn/*',
110
+
111
+ // --- Yahoo / Verizon Media ---
112
+ '*://*.yahoo.com/p.gif*',
113
+ '*://ad.yieldmanager.com/*',
114
+ '*://sp.analytics.yahoo.com/*',
115
+
116
+ // --- RUM / front-end perf (debatable, but commonly noisy) ---
117
+ '*://rum.pingdom.net/*',
118
+ '*://bam.nr-data.net/*',
119
+ '*://bam-cell.nr-data.net/*',
120
+ '*://js-agent.newrelic.com/*',
121
+ '*://*.browser-intake-datadoghq.com/*',
122
+ '*://*.browser-intake-datadoghq.eu/*',
123
+
124
+ // --- Session replay + heatmaps ---
125
+ '*://*.hotjar.com/*',
126
+ '*://*.hotjar.io/*',
127
+ '*://*.fullstory.com/s/*',
128
+ '*://*.fullstory.com/rec/*',
129
+ '*://r.lr-ingest.io/*',
130
+ '*://*.logrocket.io/*',
131
+ '*://cdn.lr-ingest.com/*',
132
+ '*://script.crazyegg.com/*',
133
+ '*://cdn.mouseflow.com/*',
134
+ '*://*.mouseflow.com/projects/*',
135
+
136
+ // --- A/B testing ---
137
+ '*://cdn.optimizely.com/*',
138
+ '*://*.optimizely.com/event*',
139
+ '*://dev.visualwebsiteoptimizer.com/*',
140
+ '*://*.vwo.com/*',
141
+
142
+ // --- Product analytics ---
143
+ '*://api.segment.io/*',
144
+ '*://cdn.segment.com/*',
145
+ '*://*.segment.io/v1/*',
146
+ '*://api.amplitude.com/*',
147
+ '*://api2.amplitude.com/*',
148
+ '*://cdn.amplitude.com/*',
149
+ '*://api.mixpanel.com/*',
150
+ '*://cdn.mxpnl.com/*',
151
+ '*://*.heapanalytics.com/*',
152
+ '*://heapanalytics.com/h*',
153
+ '*://*.posthog.com/e/*',
154
+ '*://*.posthog.com/decide/*',
155
+
156
+ // --- Marketing automation ---
157
+ '*://track.hubspot.com/*',
158
+ '*://js.hs-scripts.com/*',
159
+ '*://js.hs-analytics.net/*',
160
+ '*://js.hsforms.net/*',
161
+
162
+ // --- Customer messaging (these load chat widgets that bloat ARIA) ---
163
+ '*://widget.intercom.io/*',
164
+ '*://api-iam.intercom.io/messenger/*',
165
+ '*://js.intercomcdn.com/*',
166
+ '*://js.driftt.com/*',
167
+ '*://event.api.drift.com/*',
168
+
169
+ // --- Error reporters (Sentry kept off — agents may want to see errors) ---
170
+ '*://sessions.bugsnag.com/*',
171
+ '*://notify.bugsnag.com/*',
172
+
173
+ // --- Misc widely-deployed ad networks ---
174
+ '*://*.adnxs.com/*', // AppNexus / Xandr
175
+ '*://*.rubiconproject.com/*',
176
+ '*://*.pubmatic.com/*',
177
+ '*://*.openx.net/*',
178
+ '*://*.casalemedia.com/*',
179
+ '*://*.bidswitch.net/*',
180
+ '*://*.adsrvr.org/*', // The Trade Desk
181
+ '*://*.media.net/*',
182
+ '*://*.mediavoice.com/*',
183
+ '*://*.serving-sys.com/*', // Sizmek
184
+ '*://*.smartadserver.com/*',
185
+ '*://*.indexww.com/*',
186
+ '*://*.mathtag.com/*',
187
+ '*://*.tapad.com/*',
188
+ '*://*.bluekai.com/*', // Oracle Data Cloud
189
+ '*://*.krxd.net/*', // Salesforce / Krux
190
+ ];
package/src/chromium.js CHANGED
@@ -16,9 +16,12 @@ let exitHandlersRegistered = false;
16
16
  function reapAllSync() {
17
17
  const toReap = [...activeBrowsers];
18
18
  activeBrowsers.clear();
19
- // Send SIGKILL to everything first so the kernel reaps in parallel
19
+ // Send SIGKILL to the parent AND the whole process group (detached:true
20
+ // gives each Chromium its own pgid, so -pid targets every renderer/GPU/
21
+ // network child without touching Node or its other children).
20
22
  for (const b of toReap) {
21
23
  try { if (!b.process.killed) b.process.kill('SIGKILL'); } catch {}
24
+ try { process.kill(-b.process.pid, 'SIGKILL'); } catch {}
22
25
  }
23
26
  // Then poll each for actual death before removing its profile dir —
24
27
  // Chromium can hold file handles briefly even after SIGKILL, which would
@@ -151,8 +154,22 @@ export async function launch(opts = {}) {
151
154
  // about:blank as initial page
152
155
  args.push('about:blank');
153
156
 
157
+ // detached:true makes Node call setsid() so Chromium becomes its own
158
+ // process-group leader. Without this, the renderer/GPU/network children
159
+ // it forks share the Node parent's process group — SIGTERM on the
160
+ // Chromium PID only signals Chromium itself and the children linger,
161
+ // holding profile-dir files for seconds after the parent exits. Under
162
+ // parallel test load that races our rmSync cleanup. With a separate
163
+ // pgid, cleanupBrowser can signal the whole group with process.kill(-pid).
164
+ //
165
+ // Trade-off: a terminal SIGINT (Ctrl-C) is delivered to the foreground
166
+ // process group, which is now Node's — Chromium will NOT receive it
167
+ // directly. The SIGINT handler in registerExitHandlers() that calls
168
+ // reapAllSync() is what actually kills Chromium under Ctrl-C now. Do not
169
+ // remove that handler without restoring some other path to reap children.
154
170
  const child = spawn(binary, args, {
155
171
  stdio: ['ignore', 'pipe', 'pipe'],
172
+ detached: true,
156
173
  });
157
174
 
158
175
  // Parse the WebSocket URL from stderr
@@ -216,20 +233,33 @@ export async function cleanupBrowser(browser) {
216
233
  });
217
234
  try { browser.process.kill(); } catch {}
218
235
  await exited;
236
+ // SIGKILL the whole Chromium process group. The parent may have exited
237
+ // already (above) but renderer/GPU/network children — separate processes
238
+ // under --site-per-process — can outlive it by seconds, and they hold
239
+ // profile-dir file handles. Because launch() spawned with detached:true,
240
+ // the children share Chromium's pgid (not Node's), so process.kill on a
241
+ // negative PID reaps the whole group without touching anything else.
242
+ try { process.kill(-browser.process.pid, 'SIGKILL'); } catch {
243
+ // ESRCH = group already gone; anything else is best-effort here.
244
+ }
219
245
  }
220
246
  if (browser.ownedProfileDir) {
221
- // Chromium can still flush files for ~hundreds of ms after exit; with
222
- // --site-per-process (added in H2) every iframe is its own renderer
223
- // process, each with its own pending file handles, so the old 10×100ms
224
- // window (1s) wasn't always enough under parallel test load. Now
225
- // 25×100ms (2.5s) plus a polling jitter to avoid every concurrent
226
- // cleanup hammering at the same tick.
227
- for (let i = 0; i < 25; i++) {
247
+ // Chromium spawns renderer + GPU + network + utility subprocesses (one
248
+ // per site under --site-per-process from H2), and SIGTERM on the parent
249
+ // doesn't guarantee the children have closed their profile-file handles
250
+ // by the time the parent's exit event fires. Under parallel test load
251
+ // we've seen handle-release take >2.5s. Retry budget here is 60×100ms
252
+ // jittered (~6s+ worst case). Retry on ANY error short of ENOENT —
253
+ // earlier code only retried ENOTEMPTY/EBUSY but Linux also reports
254
+ // EPERM/EACCES transiently when an open-deleted file is still being
255
+ // written to. force:true already swallows ENOENT, so the catch only
256
+ // sees real failures.
257
+ for (let i = 0; i < 60; i++) {
228
258
  try {
229
259
  rmSync(browser.ownedProfileDir, { recursive: true, force: true });
230
260
  break;
231
261
  } catch (err) {
232
- if (err.code !== 'ENOTEMPTY' && err.code !== 'EBUSY') break;
262
+ if (err.code === 'ENOENT') break; // already gone
233
263
  await new Promise((r) => setTimeout(r, 100 + Math.floor(Math.random() * 50)));
234
264
  }
235
265
  }
package/src/daemon.js CHANGED
@@ -40,6 +40,10 @@ export async function startDaemon(opts, outputDir, initialUrl) {
40
40
  if (opts.viewport) args.push('--viewport', opts.viewport);
41
41
  if (opts.storageState) args.push('--storage-state', opts.storageState);
42
42
  if (opts.downloadPath) args.push('--download-path', opts.downloadPath);
43
+ if (opts.blockAds === false) args.push('--no-block-ads');
44
+ if (Array.isArray(opts.blockUrls)) {
45
+ for (const p of opts.blockUrls) args.push('--block-urls', p);
46
+ }
43
47
 
44
48
  const child = spawn(process.execPath, args, {
45
49
  detached: true,
@@ -48,8 +52,11 @@ export async function startDaemon(opts, outputDir, initialUrl) {
48
52
  });
49
53
  child.unref();
50
54
 
51
- // Poll for session.json (50ms interval, 15s timeout)
52
- const deadline = Date.now() + 15000;
55
+ // Poll for session.json (50ms interval, 30s timeout). 30s covers cold
56
+ // Chromium boot plus initial-URL navigation on slower CI/older hardware;
57
+ // the previous 15s was tight enough that the ad-blocklist's added
58
+ // CDP setup time pushed real boots past it on stress runs.
59
+ const deadline = Date.now() + 30000;
53
60
  while (Date.now() < deadline) {
54
61
  if (existsSync(sessionPath)) {
55
62
  try {
@@ -59,7 +66,7 @@ export async function startDaemon(opts, outputDir, initialUrl) {
59
66
  }
60
67
  await new Promise((r) => setTimeout(r, 50));
61
68
  }
62
- throw new Error('Daemon failed to start within 15s');
69
+ throw new Error('Daemon failed to start within 30s');
63
70
  }
64
71
 
65
72
  /**
@@ -79,6 +86,8 @@ export async function runDaemon(opts, outputDir, initialUrl) {
79
86
  viewport: opts.viewport,
80
87
  storageState: opts.storageState,
81
88
  downloadPath: opts.downloadPath,
89
+ blockAds: opts.blockAds,
90
+ blockUrls: opts.blockUrls,
82
91
  });
83
92
 
84
93
  // Console log capture
package/src/index.js CHANGED
@@ -16,6 +16,7 @@ import { prune as pruneTree } from './prune.js';
16
16
  import { click as cdpClick, type as cdpType, scroll as cdpScroll, press as cdpPress, hover as cdpHover, select as cdpSelect, drag as cdpDrag, upload as cdpUpload } from './interact.js';
17
17
  import { dismissConsent } from './consent.js';
18
18
  import { applyStealth } from './stealth.js';
19
+ import { DEFAULT_BLOCKLIST } from './blocklist.js';
19
20
  import { waitForNetworkIdle } from './network-idle.js';
20
21
  import { join as pathJoin } from 'node:path';
21
22
 
@@ -29,6 +30,11 @@ import { join as pathJoin } from 'node:path';
29
30
  * @param {boolean} [opts.cookies=true] - Inject user's cookies (Phase 2)
30
31
  * @param {boolean} [opts.prune=true] - Apply ARIA pruning (Phase 2)
31
32
  * @param {number} [opts.timeout=30000] - Navigation timeout in ms
33
+ * @param {boolean} [opts.blockAds=true] - Block ~120 common ad/tracker
34
+ * URL patterns via CDP. Shrinks ARIA snapshots and speeds page loads.
35
+ * See src/blocklist.js for the default set. Set false to disable.
36
+ * @param {string[]} [opts.blockUrls] - Extra URL glob patterns to block,
37
+ * merged with the default unless blockAds:false.
32
38
  * @returns {Promise<string>} ARIA snapshot text
33
39
  */
34
40
  export async function browse(url, opts = {}) {
@@ -53,7 +59,8 @@ export async function browse(url, opts = {}) {
53
59
  }
54
60
 
55
61
  // Step 2: Create a new page target and attach
56
- let page = await createPage(cdp, mode !== 'headed', { viewport: opts.viewport });
62
+ const pageOpts = { viewport: opts.viewport, blockAds: opts.blockAds, blockUrls: opts.blockUrls };
63
+ let page = await createPage(cdp, mode !== 'headed', pageOpts);
57
64
 
58
65
  // Step 2.5: Suppress permission prompts
59
66
  await suppressPermissions(cdp);
@@ -87,7 +94,7 @@ export async function browse(url, opts = {}) {
87
94
  try {
88
95
  browser = await launch({ ...launchOpts, headed: true });
89
96
  cdp = await createCDP(browser.wsUrl);
90
- page = await createPage(cdp, false, { viewport: opts.viewport });
97
+ page = await createPage(cdp, false, pageOpts);
91
98
  await suppressPermissions(cdp);
92
99
  if (opts.cookies !== false) {
93
100
  try { await authenticate(page.session, url, { browser: opts.browser }); } catch {}
@@ -110,7 +117,11 @@ export async function browse(url, opts = {}) {
110
117
  snapshot = raw;
111
118
  }
112
119
  const stats = `url: ${url}\n${raw.length.toLocaleString()} chars → ${snapshot.length.toLocaleString()} chars (${Math.round((1 - snapshot.length / raw.length) * 100)}% pruned)`;
113
- snapshot = stats + '\n' + snapshot;
120
+ const actMode = !opts.pruneMode || opts.pruneMode === 'act';
121
+ const hint = (actMode && raw.length > 5000 && snapshot.length < 500 && snapshot.length < raw.length * 0.05)
122
+ ? `hint: act mode dropped most of the page — retry with pruneMode='read' for paragraphs and long text\n`
123
+ : '';
124
+ snapshot = stats + '\n' + hint + snapshot;
114
125
 
115
126
  // Step 7: Clean up
116
127
  await cdp.send('Target.closeTarget', { targetId: page.targetId });
@@ -135,6 +146,14 @@ export async function browse(url, opts = {}) {
135
146
  * Default: a per-session subdirectory under the OS temp dir. Downloads
136
147
  * land here as <guid>; check `page.downloads` for { url, suggestedFilename,
137
148
  * savedPath, state, totalBytes, receivedBytes } per file.
149
+ * @param {boolean} [opts.blockAds] - Block ~120 common ad/tracker URL
150
+ * patterns via CDP. Defaults to true for launched browsers, false in
151
+ * attach mode (would affect any tab attached to the user's running
152
+ * session). Setting blockAds:true explicitly in attach mode honors the
153
+ * request — blocking applies to whichever tab the session is currently
154
+ * attached to and follows the session across switchTab() until close.
155
+ * @param {string[]} [opts.blockUrls] - Extra URL glob patterns to block,
156
+ * merged with the default unless blockAds is false.
138
157
  * @returns {Promise<object>} Page handle with goto, snapshot, close
139
158
  */
140
159
  export async function connect(opts = {}) {
@@ -165,7 +184,15 @@ export async function connect(opts = {}) {
165
184
  // (they'd persist in the user's session via addScriptToEvaluateOnNewDocument)
166
185
  // and the headed→headless rewind in goto() is gated off below.
167
186
  let currentlyHeaded = attachMode || (mode === 'headed');
168
- let page = await createPage(cdp, !currentlyHeaded, { viewport: opts.viewport });
187
+ // Default blockAds on for owned browsers, off in attach mode (would affect
188
+ // any tab we attach to in the user's running session). Caller can flip with
189
+ // explicit blockAds:true/false.
190
+ const pageOpts = {
191
+ viewport: opts.viewport,
192
+ blockAds: opts.blockAds !== undefined ? opts.blockAds : !attachMode,
193
+ blockUrls: opts.blockUrls,
194
+ };
195
+ let page = await createPage(cdp, !currentlyHeaded, pageOpts);
169
196
  let refMap = new Map();
170
197
  let botBlocked = false;
171
198
 
@@ -300,7 +327,7 @@ export async function connect(opts = {}) {
300
327
 
301
328
  browser = await launch(launchOpts);
302
329
  cdp = await createCDP(browser.wsUrl);
303
- page = await createPage(cdp, true, { viewport: opts.viewport });
330
+ page = await createPage(cdp, true, pageOpts);
304
331
  setupDialogHandler(page.session);
305
332
  await suppressPermissions(cdp);
306
333
  currentlyHeaded = false;
@@ -326,7 +353,7 @@ export async function connect(opts = {}) {
326
353
  try {
327
354
  browser = await launch({ ...launchOpts, headed: true });
328
355
  cdp = await createCDP(browser.wsUrl);
329
- page = await createPage(cdp, false, { viewport: opts.viewport });
356
+ page = await createPage(cdp, false, pageOpts);
330
357
  setupDialogHandler(page.session);
331
358
  await suppressPermissions(cdp);
332
359
  await navigate(page, url, timeout);
@@ -382,10 +409,14 @@ export async function connect(opts = {}) {
382
409
  const pageUrl = entries[currentIndex]?.url || '';
383
410
  const warn = botBlocked ? '[BOT CHALLENGE DETECTED — page content may be incomplete or blocked]\n' : '';
384
411
  if (pruneOpts === false) return `url: ${pageUrl}\n` + warn + raw;
385
- const pruned = pruneTree(result.tree, { mode: pruneOpts?.mode || 'act' });
412
+ const mode = pruneOpts?.mode || 'act';
413
+ const pruned = pruneTree(result.tree, { mode });
386
414
  const out = formatTree(pruned);
387
415
  const stats = `url: ${pageUrl}\n${raw.length.toLocaleString()} chars → ${out.length.toLocaleString()} chars (${Math.round((1 - out.length / raw.length) * 100)}% pruned)`;
388
- return stats + '\n' + warn + out;
416
+ const hint = (mode === 'act' && raw.length > 5000 && out.length < 500 && out.length < raw.length * 0.05)
417
+ ? `hint: act mode dropped most of the page — retry with pruneMode='read' for paragraphs and long text\n`
418
+ : '';
419
+ return stats + '\n' + hint + warn + out;
389
420
  },
390
421
 
391
422
  async click(ref) {
@@ -465,7 +496,7 @@ export async function connect(opts = {}) {
465
496
  // closure handle used by every method below, so swapping it makes
466
497
  // snapshot/click/type/etc. operate on the new tab.
467
498
  const oldSessionId = page.sessionId;
468
- page = await attachToExistingTarget(cdp, target.targetId);
499
+ page = await attachToExistingTarget(cdp, target.targetId, pageOpts);
469
500
  refMap = new Map(); // refs from the previous tab are no longer valid
470
501
  setupDialogHandler(page.session);
471
502
  try { await cdp.send('Target.detachFromTarget', { sessionId: oldSessionId }); } catch {}
@@ -553,7 +584,7 @@ export async function connect(opts = {}) {
553
584
  get cdp() { return page.session; },
554
585
 
555
586
  async createTab() {
556
- const tab = await createPage(cdp, !currentlyHeaded, { viewport: opts.viewport });
587
+ const tab = await createPage(cdp, !currentlyHeaded, pageOpts);
557
588
  await suppressPermissions(cdp);
558
589
  setupDialogHandler(tab.session);
559
590
  let tabBotBlocked = false;
@@ -645,6 +676,12 @@ async function createPage(cdp, stealth = false, pageOpts = {}) {
645
676
  await applyStealth(session);
646
677
  }
647
678
 
679
+ // Ad/tracker URL blocking via CDP. Default on for owned browsers — shrinks
680
+ // ARIA, speeds loads. Skipped in attach mode (would affect the user's
681
+ // running browser globally) and skippable per-call via blockAds:false.
682
+ // Custom patterns in blockUrls extend the default unless blockAds is false.
683
+ await applyBlocklist(session, pageOpts);
684
+
648
685
  // Set viewport size if specified (e.g. "1280x720")
649
686
  if (pageOpts.viewport) {
650
687
  const [w, h] = pageOpts.viewport.split('x').map(Number);
@@ -710,16 +747,35 @@ async function attachFrameTracking(cdp, mainSession) {
710
747
  * Attach a CDP session to an existing target (e.g. a tab opened by window.open).
711
748
  * Enables the same domains as createPage so snapshot/click/type work uniformly.
712
749
  */
713
- async function attachToExistingTarget(cdp, targetId) {
750
+ async function attachToExistingTarget(cdp, targetId, pageOpts = {}) {
714
751
  const { sessionId } = await cdp.send('Target.attachToTarget', { targetId, flatten: true });
715
752
  const session = cdp.session(sessionId);
716
753
  await session.send('Page.enable');
717
754
  await session.send('Network.enable');
718
755
  await session.send('DOM.enable');
756
+ await applyBlocklist(session, pageOpts);
719
757
  const framesByFrameId = await attachFrameTracking(cdp, session);
720
758
  return { session, targetId, sessionId, framesByFrameId };
721
759
  }
722
760
 
761
+ /**
762
+ * Apply Network.setBlockedURLs for ad/tracker blocking on a session.
763
+ * Default list is on; pass blockAds:false to skip, blockUrls:[] to extend.
764
+ * Silent on failure — older Chrome / unusual modes shouldn't break the page.
765
+ */
766
+ async function applyBlocklist(session, pageOpts) {
767
+ if (pageOpts.blockAds === false && !pageOpts.blockUrls) return;
768
+ const patterns = pageOpts.blockAds === false
769
+ ? (pageOpts.blockUrls || [])
770
+ : [...DEFAULT_BLOCKLIST, ...(pageOpts.blockUrls || [])];
771
+ if (!patterns.length) return;
772
+ try {
773
+ await session.send('Network.setBlockedURLs', { urls: patterns });
774
+ } catch {
775
+ // Network.setBlockedURLs unsupported on this Chrome — skip silently.
776
+ }
777
+ }
778
+
723
779
  /**
724
780
  * Navigate to a URL and wait for the page to load.
725
781
  */
package/src/prune.js CHANGED
@@ -65,7 +65,8 @@ const SKIP_ROLES = new Set([
65
65
  * @returns {object|null} Pruned tree
66
66
  */
67
67
  export function prune(tree, options = {}) {
68
- const { mode = 'act', context = '' } = options;
68
+ let { mode = 'act', context = '' } = options;
69
+ if (mode === 'read') mode = 'browse';
69
70
  const allowedRegions = MODE_REGIONS[mode] || MODE_REGIONS.act;
70
71
  const isBrowse = mode === 'browse';
71
72
  const keywords = context
package/src/stealth.js CHANGED
@@ -92,6 +92,75 @@ const STEALTH_SCRIPT = `
92
92
  return origGetParam2.apply(this, arguments);
93
93
  };
94
94
  }
95
+
96
+ // Canvas fingerprinting — sites render standard text/shapes, then read
97
+ // pixels via toDataURL or getImageData. The output is stable per machine
98
+ // (GPU, font rasterizer, anti-aliasing) but unique across machines, which
99
+ // makes it the second-most-common fingerprint after WebGL. Defense: nudge
100
+ // a few RGB channels by ±1 per session so the hash changes each visit
101
+ // while the canvas still looks identical to the human eye. The per-tab
102
+ // seed keeps reads stable within a session so legitimate canvas use
103
+ // (image processing, games) doesn't flicker.
104
+ // crypto.getRandomValues is guaranteed unique per browsing context; using
105
+ // Math.random alone can collide when two fresh V8 contexts start within
106
+ // microseconds of each other (real-world: parallel tests, real-world hit:
107
+ // we observed it). performance.now adds a wall-clock anchor as a belt-and-
108
+ // braces guard against contexts where crypto is somehow stubbed.
109
+ const _seedBuf = new Uint32Array(1);
110
+ crypto.getRandomValues(_seedBuf);
111
+ const CANVAS_SEED = (_seedBuf[0] ^ ((performance.now() * 1e6) | 0)) >>> 0;
112
+ function shiftPixels(data) {
113
+ // Touch ~1 byte per 64-byte stride. The bit we XOR with is taken from a
114
+ // position-dependent SLICE of a seed-mixed hash, not its low bit — a
115
+ // naive 'mix & 1' collapses to only two possible outputs per seed
116
+ // parity because every stride index is even (the position multiplier
117
+ // is odd, so the low bit only depends on seed parity). Indexing the
118
+ // hash by (i/64) mod 32 makes every stride position sample a different
119
+ // bit, so two distinct seeds produce different mask patterns.
120
+ for (let i = 0; i < data.length; i += 64) {
121
+ const h = ((CANVAS_SEED * 2654435761) ^ (i * 1597334677)) >>> 0;
122
+ const bit = (h >>> ((i >>> 6) & 31)) & 1;
123
+ data[i] = (data[i] ^ bit) & 0xff;
124
+ }
125
+ return data;
126
+ }
127
+ // Capture originals BEFORE replacing — toDataURL must read pixels via the
128
+ // native getImageData (not the patched one), otherwise the patch double-
129
+ // applies and the second XOR cancels the first, leaving output unchanged.
130
+ const origGetImageData = CanvasRenderingContext2D.prototype.getImageData;
131
+ const origToDataURL = HTMLCanvasElement.prototype.toDataURL;
132
+ HTMLCanvasElement.prototype.toDataURL = function() {
133
+ const ctx = this.getContext('2d');
134
+ if (ctx && this.width > 0 && this.height > 0) {
135
+ try {
136
+ const img = origGetImageData.call(ctx, 0, 0, this.width, this.height);
137
+ // Snapshot the original bytes so we can restore them after encoding.
138
+ // Without this, repeated toDataURL() alternates noisy/clean: call 1
139
+ // XORs the canvas in place, call 2 reads the noisy canvas and XORs
140
+ // again (self-inverse), call 3 again, etc. Same XOR-cancellation
141
+ // class as the earlier double-apply bug, just through canvas state
142
+ // rather than method composition. The restore also keeps the
143
+ // bitmap idempotent for any downstream legitimate canvas reads.
144
+ const original = new Uint8ClampedArray(img.data);
145
+ shiftPixels(img.data);
146
+ ctx.putImageData(img, 0, 0);
147
+ const result = origToDataURL.apply(this, arguments);
148
+ img.data.set(original);
149
+ ctx.putImageData(img, 0, 0);
150
+ return result;
151
+ } catch {
152
+ // Tainted canvas (cross-origin image) — can't read; skip the nudge
153
+ // and fall through to the original call so the page sees the
154
+ // expected SecurityError instead of silent corruption.
155
+ }
156
+ }
157
+ return origToDataURL.apply(this, arguments);
158
+ };
159
+ CanvasRenderingContext2D.prototype.getImageData = function() {
160
+ const img = origGetImageData.apply(this, arguments);
161
+ shiftPixels(img.data);
162
+ return img;
163
+ };
95
164
  `;
96
165
 
97
166
  /**