barebrowse 0.9.1 → 0.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,124 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.10.1
4
+
5
+ ### Blocklist long-tail additions + legacy-Chrome warn + switchTab attach-mode test
6
+
7
+ Carry-forward items from the v0.10.0 backlog. All additive, no behavior
8
+ change on supported Chrome.
9
+
10
+ - **8 new patterns in `src/blocklist.js`** (120 → 128, still in the
11
+ curated 80–200 band):
12
+ - Mobile-measurement-on-web cluster (increasingly served from web
13
+ pages, not just SDKs): `*.appsflyer.com`, `*.branch.io`,
14
+ `*.adjust.com`.
15
+ - Privacy-friendly analytics that still tracks from an agent POV:
16
+ `static.cloudflareinsights.com` (Cloudflare Web Analytics),
17
+ `*.matomo.cloud` (Matomo Cloud's hosted tier).
18
+ - Broader Outbrain coverage: `amplify.outbrain.com`,
19
+ `log.outbrain.com` (in addition to the existing
20
+ `widgets.outbrain.com` and `*.outbrain.com/utils/*`).
21
+ - Broader PostHog: `*.posthog.com/static/array.js*` (the snippet
22
+ loader, in addition to the existing `/e/` and `/decide/` endpoints).
23
+ - **One-time `console.warn` when `Network.setBlockedURLs` rejects.**
24
+ Legacy Chromium builds lacking the method previously failed silently
25
+ inside `applyBlocklist`; now a single warn per process surfaces the
26
+ reason so callers don't wonder why blocking isn't engaging. Stays
27
+ silent on supported Chrome (success path), stays silent when
28
+ `blockAds: false` opts out entirely. Module-scoped flag —
29
+ intentionally not per-session, since the failure mode is the
30
+ browser, not the session.
31
+ - **`switchTab()` + `blockAds:true` attach-mode integration test.**
32
+ The v0.10.0 JSDoc claimed blocklist follows `switchTab()` in attach
33
+ mode but had no automated guard. New test in
34
+ `test/integration/blocklist.test.js` launches a real browser, opens
35
+ a second tab via raw CDP (bypassing barebrowse so the tab simulates
36
+ one the user already had open), attaches with explicit
37
+ `blockAds: true` + `blockUrls: [pattern]`, switches into that tab,
38
+ and asserts the tracker server gets zero hits and the tracker script
39
+ never executed. Locks in the post-switch `applyBlocklist` call site
40
+ that was added in v0.10.0.
41
+ - **Tests:** 143 total (5 new). 4 new unit tests in
42
+ `test/unit/blocklist.test.js` (long-tail coverage drift guard +
43
+ 3-subtest warn-once suite covering rejection, success path, and
44
+ opted-out paths); 1 new integration test as above.
45
+
46
+ ## 0.10.0
47
+
48
+ ### Ad/tracker URL blocking + canvas-noise stealth + Chromium pgid reap fix
49
+
50
+ Scrapling-inspired additions to make every snapshot quieter and every
51
+ headless session less fingerprintable, plus a flake fix surfaced by the
52
+ new work.
53
+
54
+ - **Ad/tracker URL blocking via CDP `Network.setBlockedURLs`.** New
55
+ `src/blocklist.js` ships ~120 hand-curated glob patterns covering the
56
+ high-frequency tracker families: Google ads + analytics, Facebook
57
+ Pixel, Amazon ads, MS Clarity/Bing, Adobe Marketing Cloud, the
58
+ consumer-pixel cluster (LinkedIn/Twitter/TikTok/Snap/Pinterest), the
59
+ SaaS analytics stacks (Segment/Amplitude/Mixpanel/Heap/PostHog),
60
+ session-replay (Hotjar/FullStory/LogRocket/Crazy Egg/Mouseflow),
61
+ content recommendation (Criteo/Taboola/Outbrain), supply-side ad
62
+ networks (AppNexus/Rubicon/PubMatic/OpenX/Trade Desk), and marketing
63
+ automation (HubSpot/Marketo/Pardot/Intercom/Drift). Curated by traffic
64
+ frequency rather than pulled wholesale from Peter Lowe — CDP does
65
+ linear pattern matching per request, so the long tail of regional
66
+ networks was measurable cost (~10ms cumulative on a 100-request page)
67
+ for ~5% extra coverage we'd rarely hit in agent traffic. Net effect:
68
+ smaller ARIA snapshots and faster page loads.
69
+ - **`opts.blockAds` and `opts.blockUrls` on `connect()` and `browse()`.**
70
+ `blockAds` defaults to `true` for launched browsers and `false` in
71
+ attach mode (would otherwise affect any tab in the user's running
72
+ browser). Explicit `blockAds: true` in attach mode is honored and
73
+ follows the session across `switchTab()`. `blockUrls` accepts extra
74
+ glob patterns merged with the default unless `blockAds: false`.
75
+ - **CLI flags on `bb open`: `--no-block-ads` and `--block-urls=PATTERN`**
76
+ (the latter repeatable). Plumbed through `cli.js`, `src/daemon.js`
77
+ startDaemon args, and `runDaemon` → `connect()`. Not exposed via MCP
78
+ or bareagent on purpose — agents inside a session shouldn't be
79
+ reconfiguring infra per tool call; the decision belongs at session
80
+ start.
81
+ - **Canvas fingerprint noise** in `src/stealth.js`. After WebGL
82
+ (already spoofed in v0.9.0), canvas `toDataURL` / `getImageData` is
83
+ the second-most-checked fingerprint vector — the pixel output of
84
+ rendered text/shapes depends on GPU, driver, and font rasterizer in
85
+ ways that are stable per machine but unique across machines, which
86
+ makes it a tracking signal that survives cookie clearing. The patch
87
+ XORs ~1 bit per 64-byte stride into the read pixels, with the bit
88
+ derived from a position-mixed hash of a per-session
89
+ `crypto.getRandomValues`-seeded value. Output is stable within a
90
+ session (so legitimate canvas use doesn't flicker) and different
91
+ across sessions (so fingerprinters see a fresh hash on every visit).
92
+ The canvas bitmap is snapshotted and restored around encoding so any
93
+ downstream legitimate read sees the original pixels.
94
+ - **Pre-existing Chromium subprocess reap flake fixed.** Chromium
95
+ spawns renderer/GPU/network/utility subprocesses that, under
96
+ `--site-per-process` (v0.9.0 H2), can outlive SIGTERM on the
97
+ Chromium parent by seconds while still holding profile-dir file
98
+ handles. Without `detached: true`, all of them shared Node's process
99
+ group — there was no way to signal the whole Chromium tree without
100
+ enumerating PIDs. `src/chromium.js` now spawns with `detached: true`
101
+ so each Chromium becomes its own process-group leader, and
102
+ `cleanupBrowser` / `reapAllSync` send SIGKILL to the negative PID
103
+ (the whole group) before `rmSync`. Latent in `main`, but the new
104
+ blocklist's added CDP setup overlapped the cleanup window enough to
105
+ hit ~1-in-3 under parallel test load. Side effect: terminal SIGINT
106
+ now goes to Node's pgid only — `registerExitHandlers`' SIGINT
107
+ reaper is what kills Chromium under Ctrl-C and must not be removed.
108
+ - **`startDaemon` poll deadline 15s → 30s** for cold-boot margin on
109
+ slower hardware (CI / older boxes) now that the blocklist adds a
110
+ small amount of CDP setup time to the session-startup path.
111
+ - **Tests:** 138 total (10 new). New: 5-test unit suite for
112
+ `DEFAULT_BLOCKLIST` (shape/coverage drift guards, must-cover
113
+ tracker families, no dups); 2-test integration suite that proves
114
+ `Network.setBlockedURLs` actually drops the matching subresource
115
+ and that `blockAds:false` lets it through; 2 new canvas-noise
116
+ subtests (patch installed, stable within session, different across
117
+ sessions); 1 end-to-end `bb open --block-urls=PATTERN URL` test
118
+ that proves the flag survives every hop through `cli.js` →
119
+ `startDaemon` → daemon-internal → `connect()` → `setBlockedURLs`
120
+ and that the tracker server sees zero hits.
121
+
3
122
  ## 0.9.1
4
123
 
5
124
  ### Pruning — `pruneMode` reaches MCP / bareagent and `read` finally works
@@ -45,6 +45,8 @@ const snapshot = await browse('https://example.com', {
45
45
  prune: true, // apply ARIA pruning (47-95% token reduction)
46
46
  pruneMode: 'act', // 'act' (interactive elements) | 'read' (all content)
47
47
  consent: true, // auto-dismiss cookie consent dialogs
48
+ blockAds: true, // block 128 ad/tracker URL patterns (default on for owned browsers)
49
+ blockUrls: [], // extra URL globs to block (merged with the default)
48
50
  timeout: 30000, // navigation timeout in ms
49
51
  });
50
52
  ```
@@ -91,6 +93,8 @@ const snapshot = await browse('https://example.com', {
91
93
  - `viewport: '1280x720'` — Set viewport dimensions
92
94
  - `storageState: 'file.json'` — Load cookies/localStorage from saved state
93
95
  - `downloadPath: '/abs/dir'` — Where downloads land. Default: per-session `mkdtemp` under `/tmp/barebrowse-dl-*` that gets removed on `close()`. Caller-supplied paths are not cleaned up — caller owns the lifecycle.
96
+ - `blockAds: true|false` — CDP-level URL blocking of 128 common ad/tracker patterns (Google ads/analytics, FB/Amazon/MS/Adobe ad+analytics, Segment/Amplitude/Mixpanel/Heap/PostHog, Hotjar/FullStory/LogRocket, Criteo/Taboola/Outbrain, the consumer-pixel cluster, AppNexus/Rubicon/PubMatic supply, marketing automation; v0.10.1 added AppsFlyer/Branch/Adjust, Cloudflare Web Analytics, Matomo Cloud). Default `true` for launched browsers, `false` in attach mode (would affect any tab in the user's running browser). Explicit `true` in attach mode is honored and follows the session across `switchTab()` (regression-tested). Shrinks ARIA snapshots and speeds page loads. On legacy Chromium lacking `Network.setBlockedURLs` a one-time `console.warn` surfaces the fallback.
97
+ - `blockUrls: ['*://foo.com/*', ...]` — Extra glob patterns (CDP `Network.setBlockedURLs` format) to block in addition to the default. Merged with the default unless `blockAds: false`.
94
98
 
95
99
  ## Snapshot format
96
100
 
@@ -161,7 +165,8 @@ barebrowse can inject cookies from the user's real browser sessions, bypassing l
161
165
  | Form submission | `press('Enter')` triggers onsubmit | Both |
162
166
  | SPA navigation | `waitForNavigation()` uses loadEventFired + frameNavigated | Both |
163
167
  | Bot detection | v0.9.0 (H9): Cloudflare-strong phrases ("Just a moment", "Attention Required", "verify you are human") fire alone; generic phrases ("access denied", "unknown error") only fire on near-empty pages — no more false-positive headed-launches on legitimate 4xx/5xx pages. `botBlocked` flag set after every `goto()`. Hybrid fallback switches to headed. Snapshot shows `[BOT CHALLENGE DETECTED]` warning. | Hybrid |
164
- | Stealth (headless tells) | v0.9.0 (H4): `Network.setUserAgentOverride` strips "HeadlessChrome" from UA in HTTP headers AND `navigator.userAgent`; JS patches for webdriver, plugins, languages, full `chrome.runtime` enum shape, `Notification` constructor + `permission: 'default'`, `hardwareConcurrency: 8`, `deviceMemory: 8`, WebGL `UNMASKED_VENDOR_WEBGL`/`UNMASKED_RENDERER_WEBGL` spoofed to Intel | Headless |
168
+ | Stealth (headless tells) | v0.9.0 (H4): `Network.setUserAgentOverride` strips "HeadlessChrome" from UA in HTTP headers AND `navigator.userAgent`; JS patches for webdriver, plugins, languages, full `chrome.runtime` enum shape, `Notification` constructor + `permission: 'default'`, `hardwareConcurrency: 8`, `deviceMemory: 8`, WebGL `UNMASKED_VENDOR_WEBGL`/`UNMASKED_RENDERER_WEBGL` spoofed to Intel. v0.10.0: canvas fingerprint noise — `toDataURL`/`getImageData` XOR a per-session `crypto.getRandomValues`-seeded mask into ~1 byte per 64-byte stride (stable within a session, different across sessions; bitmap is restored after encoding so legitimate canvas use is unaffected). | Headless |
169
+ | Ad / tracker URL blocking | v0.10.0: CDP `Network.setBlockedURLs` with 128 curated patterns (Google/FB/Amazon/MS/Adobe ad+analytics, the major SaaS analytics + session-replay stacks, content-rec, supply-side ad networks, marketing automation). v0.10.1 added long-tail: AppsFlyer/Branch/Adjust, Cloudflare Web Analytics, Matomo Cloud, broader Outbrain (`amplify`/`log`) and PostHog (`/static/array.js`). Default on for launched browsers, off in attach mode. `opts.blockUrls` extends; `opts.blockAds: false` disables. Shrinks ARIA snapshots and speeds loads. v0.10.1: regression-tested across `switchTab()` in attach mode; one-time `console.warn` if Chromium lacks the CDP method. | Launched |
165
170
  | iframe / OOPIF content (Stripe, reCAPTCHA, embedded forms) | v0.9.0 (H2): `Target.setAutoAttach({flatten:true})` registers a CDP session per iframe; `ariaTree()` walks `Page.getFrameTree`, fetches each frame's AX tree on the right session, splices children under iframe placeholders via `DOM.getFrameOwner`. Refs route via `{session, backendNodeId}` so clicks dispatch in the iframe's Input domain. `--site-per-process` launch flag forces every iframe — including same-origin — into OOPIF so coords work. | Both |
166
171
  | Downloads | v0.9.0 (H7): `Browser.setDownloadBehavior({behavior:'allowAndName', downloadPath, eventsEnabled:true})` + listeners populate `page.downloads`. Files land at `savedPath` (under `--download-path` if supplied, else per-session `/tmp/barebrowse-dl-*`). | Headless + Headed (skipped in attach mode) |
167
172
  | Profile locking | Unique temp dir per headless instance | Headless |
package/cli.js CHANGED
@@ -117,6 +117,8 @@ async function cmdOpen() {
117
117
  viewport: parseFlag('--viewport'),
118
118
  storageState: parseFlag('--storage-state'),
119
119
  downloadPath: parseFlag('--download-path'),
120
+ blockAds: hasFlag('--no-block-ads') ? false : undefined,
121
+ blockUrls: parseFlagAll('--block-urls'),
120
122
  };
121
123
 
122
124
  try {
@@ -218,6 +220,8 @@ async function runDaemonInternal() {
218
220
  viewport: parseFlag('--viewport'),
219
221
  storageState: parseFlag('--storage-state'),
220
222
  downloadPath: parseFlag('--download-path'),
223
+ blockAds: hasFlag('--no-block-ads') ? false : undefined,
224
+ blockUrls: parseFlagAll('--block-urls'),
221
225
  };
222
226
  const outputDir = parseFlag('--output-dir') || resolve('.barebrowse');
223
227
  const url = parseFlag('--url');
@@ -240,6 +244,20 @@ function hasFlag(name) {
240
244
  return args.includes(name);
241
245
  }
242
246
 
247
+ // Collects every occurrence of a repeatable flag (--name=val or --name val).
248
+ // Returns undefined when absent so the opts object stays sparse and callers
249
+ // can distinguish "not provided" from "provided but empty".
250
+ function parseFlagAll(name) {
251
+ const out = [];
252
+ for (let i = 0; i < args.length; i++) {
253
+ if (args[i].startsWith(name + '=')) out.push(args[i].slice(name.length + 1));
254
+ else if (args[i] === name && args[i + 1] && !args[i + 1].startsWith('--')) {
255
+ out.push(args[i + 1]); i++;
256
+ }
257
+ }
258
+ return out.length ? out : undefined;
259
+ }
260
+
243
261
 
244
262
  // --- MCP auto-installer ---
245
263
 
@@ -467,6 +485,10 @@ Session:
467
485
  --viewport=WxH Viewport size (e.g. 1280x720)
468
486
  --storage-state=FILE Load cookies/localStorage from JSON file
469
487
  --download-path=DIR Directory for downloaded files (default: per-session temp dir)
488
+ --no-block-ads Disable the built-in ad/tracker blocklist (~120 patterns).
489
+ Default: enabled in owned-browser modes, disabled in attach mode.
490
+ --block-urls=PATTERN Extra URL glob to block (repeatable, e.g. --block-urls='*://*.foo.com/*').
491
+ Use the =VALUE form when the pattern could be mistaken for a flag.
470
492
 
471
493
  Navigation:
472
494
  barebrowse goto <url> Navigate to URL
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "barebrowse",
3
- "version": "0.9.1",
3
+ "version": "0.10.1",
4
4
  "description": "Authenticated web browsing for autonomous agents via CDP. URL in, pruned ARIA snapshot out.",
5
5
  "type": "module",
6
6
  "main": "src/index.js",
@@ -0,0 +1,202 @@
1
+ /**
2
+ * blocklist.js — Ad/tracker URL patterns for CDP Network.setBlockedURLs.
3
+ *
4
+ * Curated by real-world frequency, not pulled wholesale from Peter Lowe /
5
+ * EasyList. CDP does linear pattern matching per request, so 3,000-entry
6
+ * lists add ~150ms cumulative cost on a typical page for ~5% extra coverage
7
+ * (long-tail regional networks the agent rarely encounters). The set below
8
+ * is ~120 patterns covering the trackers that actually show up in agent
9
+ * traffic: Google/FB/Amazon/MS/Adobe ad+analytics, the major SaaS analytics
10
+ * stacks (Segment/Amplitude/Mixpanel/HubSpot/Hotjar/FullStory/Heap/Mouseflow),
11
+ * session-replay (LogRocket/Crazy Egg/Optimizely/VWO), content-recommendation
12
+ * (Taboola/Outbrain/Criteo), and the consumer-pixel cluster (LinkedIn/Twitter/
13
+ * TikTok/Snap/Pinterest/Reddit).
14
+ *
15
+ * Patterns are CDP-format globs: '*' matches any character run.
16
+ *
17
+ * To extend at runtime, pass connect({ blockUrls: [...] }) — your patterns
18
+ * are merged with this default. To turn the default off entirely, pass
19
+ * { blockAds: false }.
20
+ */
21
+
22
+ export const DEFAULT_BLOCKLIST = [
23
+ // --- Google ads + analytics (the single biggest cluster) ---
24
+ '*://*.doubleclick.net/*',
25
+ '*://*.googlesyndication.com/*',
26
+ '*://*.googleadservices.com/*',
27
+ '*://*.googletagservices.com/*',
28
+ '*://*.googletagmanager.com/*',
29
+ '*://*.google-analytics.com/*',
30
+ '*://*.adservice.google.com/*',
31
+ '*://pagead2.googlesyndication.com/*',
32
+ '*://www.googleadservices.com/pagead/*',
33
+ '*://ssl.google-analytics.com/*',
34
+ '*://stats.g.doubleclick.net/*',
35
+
36
+ // --- Facebook / Meta ---
37
+ '*://connect.facebook.net/*',
38
+ '*://*.facebook.com/tr*', // Pixel (matches both /tr/... and /tr?...)
39
+ '*://*.fbcdn.net/signals/*',
40
+
41
+ // --- Amazon ads ---
42
+ '*://*.amazon-adsystem.com/*',
43
+ '*://aax.amazon-adsystem.com/*',
44
+ '*://s.amazon-adsystem.com/*',
45
+
46
+ // --- Microsoft (Bing ads + Clarity) ---
47
+ '*://bat.bing.com/*',
48
+ '*://*.clarity.ms/*',
49
+
50
+ // --- Yandex ---
51
+ '*://mc.yandex.ru/*',
52
+ '*://an.yandex.ru/*',
53
+ '*://yandex.ru/ads/*',
54
+
55
+ // --- Adobe Marketing Cloud ---
56
+ '*://*.omtrdc.net/*',
57
+ '*://*.demdex.net/*',
58
+ '*://*.everesttech.net/*',
59
+ '*://*.2o7.net/*',
60
+ '*://*.adobedtm.com/*',
61
+
62
+ // --- LinkedIn ---
63
+ '*://px.ads.linkedin.com/*',
64
+ '*://snap.licdn.com/li.lms-analytics/*',
65
+
66
+ // --- Twitter/X ---
67
+ '*://analytics.twitter.com/*',
68
+ '*://static.ads-twitter.com/*',
69
+ '*://*.t.co/i/adsct*',
70
+
71
+ // --- TikTok ---
72
+ '*://analytics.tiktok.com/*',
73
+ '*://business-api.tiktok.com/*',
74
+ '*://*.tiktokcdn.com/tiktok/*',
75
+
76
+ // --- Snap ---
77
+ '*://tr.snapchat.com/*',
78
+ '*://sc-static.net/scevent.min.js*',
79
+
80
+ // --- Pinterest ---
81
+ '*://ct.pinterest.com/*',
82
+ '*://*.pinimg.com/ct/*',
83
+
84
+ // --- Reddit ---
85
+ '*://events.redditmedia.com/*',
86
+ '*://www.redditstatic.com/ads/*',
87
+
88
+ // --- Quantcast / ComScore / Chartbeat ---
89
+ '*://pixel.quantserve.com/*',
90
+ '*://*.quantcount.com/*',
91
+ '*://*.scorecardresearch.com/*',
92
+ '*://ping.chartbeat.net/*',
93
+ '*://static.chartbeat.com/*',
94
+
95
+ // --- Criteo / Taboola / Outbrain (content + retargeting) ---
96
+ '*://*.criteo.com/*',
97
+ '*://*.criteo.net/*',
98
+ '*://cdn.taboola.com/*',
99
+ '*://trc.taboola.com/*',
100
+ '*://widgets.outbrain.com/*',
101
+ '*://*.outbrain.com/utils/*',
102
+ '*://amplify.outbrain.com/*',
103
+ '*://log.outbrain.com/*',
104
+
105
+ // --- Tealium / Marketo / Pardot / Salesforce marketing ---
106
+ '*://tags.tiqcdn.com/*',
107
+ '*://*.tealiumiq.com/*',
108
+ '*://munchkin.marketo.net/*',
109
+ '*://*.marketo.com/munchkin*',
110
+ '*://pi.pardot.com/*',
111
+ '*://*.exacttarget.com/cdn/*',
112
+
113
+ // --- Yahoo / Verizon Media ---
114
+ '*://*.yahoo.com/p.gif*',
115
+ '*://ad.yieldmanager.com/*',
116
+ '*://sp.analytics.yahoo.com/*',
117
+
118
+ // --- RUM / front-end perf (debatable, but commonly noisy) ---
119
+ '*://rum.pingdom.net/*',
120
+ '*://bam.nr-data.net/*',
121
+ '*://bam-cell.nr-data.net/*',
122
+ '*://js-agent.newrelic.com/*',
123
+ '*://*.browser-intake-datadoghq.com/*',
124
+ '*://*.browser-intake-datadoghq.eu/*',
125
+
126
+ // --- Session replay + heatmaps ---
127
+ '*://*.hotjar.com/*',
128
+ '*://*.hotjar.io/*',
129
+ '*://*.fullstory.com/s/*',
130
+ '*://*.fullstory.com/rec/*',
131
+ '*://r.lr-ingest.io/*',
132
+ '*://*.logrocket.io/*',
133
+ '*://cdn.lr-ingest.com/*',
134
+ '*://script.crazyegg.com/*',
135
+ '*://cdn.mouseflow.com/*',
136
+ '*://*.mouseflow.com/projects/*',
137
+
138
+ // --- A/B testing ---
139
+ '*://cdn.optimizely.com/*',
140
+ '*://*.optimizely.com/event*',
141
+ '*://dev.visualwebsiteoptimizer.com/*',
142
+ '*://*.vwo.com/*',
143
+
144
+ // --- Product analytics ---
145
+ '*://api.segment.io/*',
146
+ '*://cdn.segment.com/*',
147
+ '*://*.segment.io/v1/*',
148
+ '*://api.amplitude.com/*',
149
+ '*://api2.amplitude.com/*',
150
+ '*://cdn.amplitude.com/*',
151
+ '*://api.mixpanel.com/*',
152
+ '*://cdn.mxpnl.com/*',
153
+ '*://*.heapanalytics.com/*',
154
+ '*://heapanalytics.com/h*',
155
+ '*://*.posthog.com/e/*',
156
+ '*://*.posthog.com/decide/*',
157
+ '*://*.posthog.com/static/array.js*',
158
+
159
+ // --- Marketing automation ---
160
+ '*://track.hubspot.com/*',
161
+ '*://js.hs-scripts.com/*',
162
+ '*://js.hs-analytics.net/*',
163
+ '*://js.hsforms.net/*',
164
+
165
+ // --- Customer messaging (these load chat widgets that bloat ARIA) ---
166
+ '*://widget.intercom.io/*',
167
+ '*://api-iam.intercom.io/messenger/*',
168
+ '*://js.intercomcdn.com/*',
169
+ '*://js.driftt.com/*',
170
+ '*://event.api.drift.com/*',
171
+
172
+ // --- Error reporters (Sentry kept off — agents may want to see errors) ---
173
+ '*://sessions.bugsnag.com/*',
174
+ '*://notify.bugsnag.com/*',
175
+
176
+ // --- Mobile-measurement (increasingly served on web too) ---
177
+ '*://*.appsflyer.com/*',
178
+ '*://*.branch.io/*',
179
+ '*://*.adjust.com/*',
180
+
181
+ // --- Privacy-friendly analytics (still trackers from an agent POV) ---
182
+ '*://static.cloudflareinsights.com/*',
183
+ '*://*.matomo.cloud/*',
184
+
185
+ // --- Misc widely-deployed ad networks ---
186
+ '*://*.adnxs.com/*', // AppNexus / Xandr
187
+ '*://*.rubiconproject.com/*',
188
+ '*://*.pubmatic.com/*',
189
+ '*://*.openx.net/*',
190
+ '*://*.casalemedia.com/*',
191
+ '*://*.bidswitch.net/*',
192
+ '*://*.adsrvr.org/*', // The Trade Desk
193
+ '*://*.media.net/*',
194
+ '*://*.mediavoice.com/*',
195
+ '*://*.serving-sys.com/*', // Sizmek
196
+ '*://*.smartadserver.com/*',
197
+ '*://*.indexww.com/*',
198
+ '*://*.mathtag.com/*',
199
+ '*://*.tapad.com/*',
200
+ '*://*.bluekai.com/*', // Oracle Data Cloud
201
+ '*://*.krxd.net/*', // Salesforce / Krux
202
+ ];
package/src/chromium.js CHANGED
@@ -16,9 +16,12 @@ let exitHandlersRegistered = false;
16
16
  function reapAllSync() {
17
17
  const toReap = [...activeBrowsers];
18
18
  activeBrowsers.clear();
19
- // Send SIGKILL to everything first so the kernel reaps in parallel
19
+ // Send SIGKILL to the parent AND the whole process group (detached:true
20
+ // gives each Chromium its own pgid, so -pid targets every renderer/GPU/
21
+ // network child without touching Node or its other children).
20
22
  for (const b of toReap) {
21
23
  try { if (!b.process.killed) b.process.kill('SIGKILL'); } catch {}
24
+ try { process.kill(-b.process.pid, 'SIGKILL'); } catch {}
22
25
  }
23
26
  // Then poll each for actual death before removing its profile dir —
24
27
  // Chromium can hold file handles briefly even after SIGKILL, which would
@@ -151,8 +154,22 @@ export async function launch(opts = {}) {
151
154
  // about:blank as initial page
152
155
  args.push('about:blank');
153
156
 
157
+ // detached:true makes Node call setsid() so Chromium becomes its own
158
+ // process-group leader. Without this, the renderer/GPU/network children
159
+ // it forks share the Node parent's process group — SIGTERM on the
160
+ // Chromium PID only signals Chromium itself and the children linger,
161
+ // holding profile-dir files for seconds after the parent exits. Under
162
+ // parallel test load that races our rmSync cleanup. With a separate
163
+ // pgid, cleanupBrowser can signal the whole group with process.kill(-pid).
164
+ //
165
+ // Trade-off: a terminal SIGINT (Ctrl-C) is delivered to the foreground
166
+ // process group, which is now Node's — Chromium will NOT receive it
167
+ // directly. The SIGINT handler in registerExitHandlers() that calls
168
+ // reapAllSync() is what actually kills Chromium under Ctrl-C now. Do not
169
+ // remove that handler without restoring some other path to reap children.
154
170
  const child = spawn(binary, args, {
155
171
  stdio: ['ignore', 'pipe', 'pipe'],
172
+ detached: true,
156
173
  });
157
174
 
158
175
  // Parse the WebSocket URL from stderr
@@ -216,20 +233,33 @@ export async function cleanupBrowser(browser) {
216
233
  });
217
234
  try { browser.process.kill(); } catch {}
218
235
  await exited;
236
+ // SIGKILL the whole Chromium process group. The parent may have exited
237
+ // already (above) but renderer/GPU/network children — separate processes
238
+ // under --site-per-process — can outlive it by seconds, and they hold
239
+ // profile-dir file handles. Because launch() spawned with detached:true,
240
+ // the children share Chromium's pgid (not Node's), so process.kill on a
241
+ // negative PID reaps the whole group without touching anything else.
242
+ try { process.kill(-browser.process.pid, 'SIGKILL'); } catch {
243
+ // ESRCH = group already gone; anything else is best-effort here.
244
+ }
219
245
  }
220
246
  if (browser.ownedProfileDir) {
221
- // Chromium can still flush files for ~hundreds of ms after exit; with
222
- // --site-per-process (added in H2) every iframe is its own renderer
223
- // process, each with its own pending file handles, so the old 10×100ms
224
- // window (1s) wasn't always enough under parallel test load. Now
225
- // 25×100ms (2.5s) plus a polling jitter to avoid every concurrent
226
- // cleanup hammering at the same tick.
227
- for (let i = 0; i < 25; i++) {
247
+ // Chromium spawns renderer + GPU + network + utility subprocesses (one
248
+ // per site under --site-per-process from H2), and SIGTERM on the parent
249
+ // doesn't guarantee the children have closed their profile-file handles
250
+ // by the time the parent's exit event fires. Under parallel test load
251
+ // we've seen handle-release take >2.5s. Retry budget here is 60×100ms
252
+ // jittered (~6s+ worst case). Retry on ANY error short of ENOENT —
253
+ // earlier code only retried ENOTEMPTY/EBUSY but Linux also reports
254
+ // EPERM/EACCES transiently when an open-deleted file is still being
255
+ // written to. force:true already swallows ENOENT, so the catch only
256
+ // sees real failures.
257
+ for (let i = 0; i < 60; i++) {
228
258
  try {
229
259
  rmSync(browser.ownedProfileDir, { recursive: true, force: true });
230
260
  break;
231
261
  } catch (err) {
232
- if (err.code !== 'ENOTEMPTY' && err.code !== 'EBUSY') break;
262
+ if (err.code === 'ENOENT') break; // already gone
233
263
  await new Promise((r) => setTimeout(r, 100 + Math.floor(Math.random() * 50)));
234
264
  }
235
265
  }
package/src/daemon.js CHANGED
@@ -40,6 +40,10 @@ export async function startDaemon(opts, outputDir, initialUrl) {
40
40
  if (opts.viewport) args.push('--viewport', opts.viewport);
41
41
  if (opts.storageState) args.push('--storage-state', opts.storageState);
42
42
  if (opts.downloadPath) args.push('--download-path', opts.downloadPath);
43
+ if (opts.blockAds === false) args.push('--no-block-ads');
44
+ if (Array.isArray(opts.blockUrls)) {
45
+ for (const p of opts.blockUrls) args.push('--block-urls', p);
46
+ }
43
47
 
44
48
  const child = spawn(process.execPath, args, {
45
49
  detached: true,
@@ -48,8 +52,11 @@ export async function startDaemon(opts, outputDir, initialUrl) {
48
52
  });
49
53
  child.unref();
50
54
 
51
- // Poll for session.json (50ms interval, 15s timeout)
52
- const deadline = Date.now() + 15000;
55
+ // Poll for session.json (50ms interval, 30s timeout). 30s covers cold
56
+ // Chromium boot plus initial-URL navigation on slower CI/older hardware;
57
+ // the previous 15s was tight enough that the ad-blocklist's added
58
+ // CDP setup time pushed real boots past it on stress runs.
59
+ const deadline = Date.now() + 30000;
53
60
  while (Date.now() < deadline) {
54
61
  if (existsSync(sessionPath)) {
55
62
  try {
@@ -59,7 +66,7 @@ export async function startDaemon(opts, outputDir, initialUrl) {
59
66
  }
60
67
  await new Promise((r) => setTimeout(r, 50));
61
68
  }
62
- throw new Error('Daemon failed to start within 15s');
69
+ throw new Error('Daemon failed to start within 30s');
63
70
  }
64
71
 
65
72
  /**
@@ -79,6 +86,8 @@ export async function runDaemon(opts, outputDir, initialUrl) {
79
86
  viewport: opts.viewport,
80
87
  storageState: opts.storageState,
81
88
  downloadPath: opts.downloadPath,
89
+ blockAds: opts.blockAds,
90
+ blockUrls: opts.blockUrls,
82
91
  });
83
92
 
84
93
  // Console log capture
package/src/index.js CHANGED
@@ -16,6 +16,7 @@ import { prune as pruneTree } from './prune.js';
16
16
  import { click as cdpClick, type as cdpType, scroll as cdpScroll, press as cdpPress, hover as cdpHover, select as cdpSelect, drag as cdpDrag, upload as cdpUpload } from './interact.js';
17
17
  import { dismissConsent } from './consent.js';
18
18
  import { applyStealth } from './stealth.js';
19
+ import { DEFAULT_BLOCKLIST } from './blocklist.js';
19
20
  import { waitForNetworkIdle } from './network-idle.js';
20
21
  import { join as pathJoin } from 'node:path';
21
22
 
@@ -29,6 +30,11 @@ import { join as pathJoin } from 'node:path';
29
30
  * @param {boolean} [opts.cookies=true] - Inject user's cookies (Phase 2)
30
31
  * @param {boolean} [opts.prune=true] - Apply ARIA pruning (Phase 2)
31
32
  * @param {number} [opts.timeout=30000] - Navigation timeout in ms
33
+ * @param {boolean} [opts.blockAds=true] - Block ~120 common ad/tracker
34
+ * URL patterns via CDP. Shrinks ARIA snapshots and speeds page loads.
35
+ * See src/blocklist.js for the default set. Set false to disable.
36
+ * @param {string[]} [opts.blockUrls] - Extra URL glob patterns to block,
37
+ * merged with the default unless blockAds:false.
32
38
  * @returns {Promise<string>} ARIA snapshot text
33
39
  */
34
40
  export async function browse(url, opts = {}) {
@@ -53,7 +59,8 @@ export async function browse(url, opts = {}) {
53
59
  }
54
60
 
55
61
  // Step 2: Create a new page target and attach
56
- let page = await createPage(cdp, mode !== 'headed', { viewport: opts.viewport });
62
+ const pageOpts = { viewport: opts.viewport, blockAds: opts.blockAds, blockUrls: opts.blockUrls };
63
+ let page = await createPage(cdp, mode !== 'headed', pageOpts);
57
64
 
58
65
  // Step 2.5: Suppress permission prompts
59
66
  await suppressPermissions(cdp);
@@ -87,7 +94,7 @@ export async function browse(url, opts = {}) {
87
94
  try {
88
95
  browser = await launch({ ...launchOpts, headed: true });
89
96
  cdp = await createCDP(browser.wsUrl);
90
- page = await createPage(cdp, false, { viewport: opts.viewport });
97
+ page = await createPage(cdp, false, pageOpts);
91
98
  await suppressPermissions(cdp);
92
99
  if (opts.cookies !== false) {
93
100
  try { await authenticate(page.session, url, { browser: opts.browser }); } catch {}
@@ -139,6 +146,14 @@ export async function browse(url, opts = {}) {
139
146
  * Default: a per-session subdirectory under the OS temp dir. Downloads
140
147
  * land here as <guid>; check `page.downloads` for { url, suggestedFilename,
141
148
  * savedPath, state, totalBytes, receivedBytes } per file.
149
+ * @param {boolean} [opts.blockAds] - Block ~120 common ad/tracker URL
150
+ * patterns via CDP. Defaults to true for launched browsers, false in
151
+ * attach mode (would affect any tab attached to the user's running
152
+ * session). Setting blockAds:true explicitly in attach mode honors the
153
+ * request — blocking applies to whichever tab the session is currently
154
+ * attached to and follows the session across switchTab() until close.
155
+ * @param {string[]} [opts.blockUrls] - Extra URL glob patterns to block,
156
+ * merged with the default unless blockAds is false.
142
157
  * @returns {Promise<object>} Page handle with goto, snapshot, close
143
158
  */
144
159
  export async function connect(opts = {}) {
@@ -169,7 +184,15 @@ export async function connect(opts = {}) {
169
184
  // (they'd persist in the user's session via addScriptToEvaluateOnNewDocument)
170
185
  // and the headed→headless rewind in goto() is gated off below.
171
186
  let currentlyHeaded = attachMode || (mode === 'headed');
172
- let page = await createPage(cdp, !currentlyHeaded, { viewport: opts.viewport });
187
+ // Default blockAds on for owned browsers, off in attach mode (would affect
188
+ // any tab we attach to in the user's running session). Caller can flip with
189
+ // explicit blockAds:true/false.
190
+ const pageOpts = {
191
+ viewport: opts.viewport,
192
+ blockAds: opts.blockAds !== undefined ? opts.blockAds : !attachMode,
193
+ blockUrls: opts.blockUrls,
194
+ };
195
+ let page = await createPage(cdp, !currentlyHeaded, pageOpts);
173
196
  let refMap = new Map();
174
197
  let botBlocked = false;
175
198
 
@@ -304,7 +327,7 @@ export async function connect(opts = {}) {
304
327
 
305
328
  browser = await launch(launchOpts);
306
329
  cdp = await createCDP(browser.wsUrl);
307
- page = await createPage(cdp, true, { viewport: opts.viewport });
330
+ page = await createPage(cdp, true, pageOpts);
308
331
  setupDialogHandler(page.session);
309
332
  await suppressPermissions(cdp);
310
333
  currentlyHeaded = false;
@@ -330,7 +353,7 @@ export async function connect(opts = {}) {
330
353
  try {
331
354
  browser = await launch({ ...launchOpts, headed: true });
332
355
  cdp = await createCDP(browser.wsUrl);
333
- page = await createPage(cdp, false, { viewport: opts.viewport });
356
+ page = await createPage(cdp, false, pageOpts);
334
357
  setupDialogHandler(page.session);
335
358
  await suppressPermissions(cdp);
336
359
  await navigate(page, url, timeout);
@@ -473,7 +496,7 @@ export async function connect(opts = {}) {
473
496
  // closure handle used by every method below, so swapping it makes
474
497
  // snapshot/click/type/etc. operate on the new tab.
475
498
  const oldSessionId = page.sessionId;
476
- page = await attachToExistingTarget(cdp, target.targetId);
499
+ page = await attachToExistingTarget(cdp, target.targetId, pageOpts);
477
500
  refMap = new Map(); // refs from the previous tab are no longer valid
478
501
  setupDialogHandler(page.session);
479
502
  try { await cdp.send('Target.detachFromTarget', { sessionId: oldSessionId }); } catch {}
@@ -561,7 +584,7 @@ export async function connect(opts = {}) {
561
584
  get cdp() { return page.session; },
562
585
 
563
586
  async createTab() {
564
- const tab = await createPage(cdp, !currentlyHeaded, { viewport: opts.viewport });
587
+ const tab = await createPage(cdp, !currentlyHeaded, pageOpts);
565
588
  await suppressPermissions(cdp);
566
589
  setupDialogHandler(tab.session);
567
590
  let tabBotBlocked = false;
@@ -653,6 +676,12 @@ async function createPage(cdp, stealth = false, pageOpts = {}) {
653
676
  await applyStealth(session);
654
677
  }
655
678
 
679
+ // Ad/tracker URL blocking via CDP. Default on for owned browsers — shrinks
680
+ // ARIA, speeds loads. Skipped in attach mode (would affect the user's
681
+ // running browser globally) and skippable per-call via blockAds:false.
682
+ // Custom patterns in blockUrls extend the default unless blockAds is false.
683
+ await applyBlocklist(session, pageOpts);
684
+
656
685
  // Set viewport size if specified (e.g. "1280x720")
657
686
  if (pageOpts.viewport) {
658
687
  const [w, h] = pageOpts.viewport.split('x').map(Number);
@@ -718,16 +747,52 @@ async function attachFrameTracking(cdp, mainSession) {
718
747
  * Attach a CDP session to an existing target (e.g. a tab opened by window.open).
719
748
  * Enables the same domains as createPage so snapshot/click/type work uniformly.
720
749
  */
721
- async function attachToExistingTarget(cdp, targetId) {
750
+ async function attachToExistingTarget(cdp, targetId, pageOpts = {}) {
722
751
  const { sessionId } = await cdp.send('Target.attachToTarget', { targetId, flatten: true });
723
752
  const session = cdp.session(sessionId);
724
753
  await session.send('Page.enable');
725
754
  await session.send('Network.enable');
726
755
  await session.send('DOM.enable');
756
+ await applyBlocklist(session, pageOpts);
727
757
  const framesByFrameId = await attachFrameTracking(cdp, session);
728
758
  return { session, targetId, sessionId, framesByFrameId };
729
759
  }
730
760
 
761
+ // One-time warn flag for Network.setBlockedURLs reject. Module-scoped so the
762
+ // warn fires once per process across every session — legacy Chrome will keep
763
+ // rejecting and we don't want to spam.
764
+ let blocklistWarned = false;
765
+
766
+ /**
767
+ * Apply Network.setBlockedURLs for ad/tracker blocking on a session.
768
+ * Default list is on; pass blockAds:false to skip, blockUrls:[] to extend.
769
+ * On failure (legacy Chrome lacking the method) warns once and continues —
770
+ * blocking is an enhancement, not a hard requirement.
771
+ *
772
+ * Exported for unit testing of the warn-once behavior; not part of the public
773
+ * API surface.
774
+ */
775
+ export async function applyBlocklist(session, pageOpts) {
776
+ if (pageOpts.blockAds === false && !pageOpts.blockUrls) return;
777
+ const patterns = pageOpts.blockAds === false
778
+ ? (pageOpts.blockUrls || [])
779
+ : [...DEFAULT_BLOCKLIST, ...(pageOpts.blockUrls || [])];
780
+ if (!patterns.length) return;
781
+ try {
782
+ await session.send('Network.setBlockedURLs', { urls: patterns });
783
+ } catch (err) {
784
+ if (!blocklistWarned) {
785
+ blocklistWarned = true;
786
+ console.warn(`barebrowse: Network.setBlockedURLs unsupported — ad/tracker blocking disabled (${err.message})`);
787
+ }
788
+ }
789
+ }
790
+
791
+ /** Test-only: reset the warn-once flag. Not part of the public API. */
792
+ export function _resetBlocklistWarning() {
793
+ blocklistWarned = false;
794
+ }
795
+
731
796
  /**
732
797
  * Navigate to a URL and wait for the page to load.
733
798
  */
package/src/stealth.js CHANGED
@@ -92,6 +92,75 @@ const STEALTH_SCRIPT = `
92
92
  return origGetParam2.apply(this, arguments);
93
93
  };
94
94
  }
95
+
96
+ // Canvas fingerprinting — sites render standard text/shapes, then read
97
+ // pixels via toDataURL or getImageData. The output is stable per machine
98
+ // (GPU, font rasterizer, anti-aliasing) but unique across machines, which
99
+ // makes it the second-most-common fingerprint after WebGL. Defense: nudge
100
+ // a few RGB channels by ±1 per session so the hash changes each visit
101
+ // while the canvas still looks identical to the human eye. The per-tab
102
+ // seed keeps reads stable within a session so legitimate canvas use
103
+ // (image processing, games) doesn't flicker.
104
+ // crypto.getRandomValues is guaranteed unique per browsing context; using
105
+ // Math.random alone can collide when two fresh V8 contexts start within
106
+ // microseconds of each other (real-world: parallel tests, real-world hit:
107
+ // we observed it). performance.now adds a wall-clock anchor as a belt-and-
108
+ // braces guard against contexts where crypto is somehow stubbed.
109
+ const _seedBuf = new Uint32Array(1);
110
+ crypto.getRandomValues(_seedBuf);
111
+ const CANVAS_SEED = (_seedBuf[0] ^ ((performance.now() * 1e6) | 0)) >>> 0;
112
+ function shiftPixels(data) {
113
+ // Touch ~1 byte per 64-byte stride. The bit we XOR with is taken from a
114
+ // position-dependent SLICE of a seed-mixed hash, not its low bit — a
115
+ // naive 'mix & 1' collapses to only two possible outputs per seed
116
+ // parity because every stride index is even (the position multiplier
117
+ // is odd, so the low bit only depends on seed parity). Indexing the
118
+ // hash by (i/64) mod 32 makes every stride position sample a different
119
+ // bit, so two distinct seeds produce different mask patterns.
120
+ for (let i = 0; i < data.length; i += 64) {
121
+ const h = ((CANVAS_SEED * 2654435761) ^ (i * 1597334677)) >>> 0;
122
+ const bit = (h >>> ((i >>> 6) & 31)) & 1;
123
+ data[i] = (data[i] ^ bit) & 0xff;
124
+ }
125
+ return data;
126
+ }
127
+ // Capture originals BEFORE replacing — toDataURL must read pixels via the
128
+ // native getImageData (not the patched one), otherwise the patch double-
129
+ // applies and the second XOR cancels the first, leaving output unchanged.
130
+ const origGetImageData = CanvasRenderingContext2D.prototype.getImageData;
131
+ const origToDataURL = HTMLCanvasElement.prototype.toDataURL;
132
+ HTMLCanvasElement.prototype.toDataURL = function() {
133
+ const ctx = this.getContext('2d');
134
+ if (ctx && this.width > 0 && this.height > 0) {
135
+ try {
136
+ const img = origGetImageData.call(ctx, 0, 0, this.width, this.height);
137
+ // Snapshot the original bytes so we can restore them after encoding.
138
+ // Without this, repeated toDataURL() alternates noisy/clean: call 1
139
+ // XORs the canvas in place, call 2 reads the noisy canvas and XORs
140
+ // again (self-inverse), call 3 again, etc. Same XOR-cancellation
141
+ // class as the earlier double-apply bug, just through canvas state
142
+ // rather than method composition. The restore also keeps the
143
+ // bitmap idempotent for any downstream legitimate canvas reads.
144
+ const original = new Uint8ClampedArray(img.data);
145
+ shiftPixels(img.data);
146
+ ctx.putImageData(img, 0, 0);
147
+ const result = origToDataURL.apply(this, arguments);
148
+ img.data.set(original);
149
+ ctx.putImageData(img, 0, 0);
150
+ return result;
151
+ } catch {
152
+ // Tainted canvas (cross-origin image) — can't read; skip the nudge
153
+ // and fall through to the original call so the page sees the
154
+ // expected SecurityError instead of silent corruption.
155
+ }
156
+ }
157
+ return origToDataURL.apply(this, arguments);
158
+ };
159
+ CanvasRenderingContext2D.prototype.getImageData = function() {
160
+ const img = origGetImageData.apply(this, arguments);
161
+ shiftPixels(img.data);
162
+ return img;
163
+ };
95
164
  `;
96
165
 
97
166
  /**