barebrowse 0.9.0 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +112 -0
- package/README.md +2 -0
- package/barebrowse.context.md +9 -2
- package/cli.js +22 -0
- package/mcp-server.js +4 -2
- package/package.json +1 -1
- package/src/bareagent.js +10 -4
- package/src/blocklist.js +190 -0
- package/src/chromium.js +39 -9
- package/src/daemon.js +12 -3
- package/src/index.js +67 -11
- package/src/prune.js +2 -1
- package/src/stealth.js +69 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,117 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.10.0
|
|
4
|
+
|
|
5
|
+
### Ad/tracker URL blocking + canvas-noise stealth + Chromium pgid reap fix
|
|
6
|
+
|
|
7
|
+
Scrapling-inspired additions to make every snapshot quieter and every
|
|
8
|
+
headless session less fingerprintable, plus a flake fix surfaced by the
|
|
9
|
+
new work.
|
|
10
|
+
|
|
11
|
+
- **Ad/tracker URL blocking via CDP `Network.setBlockedURLs`.** New
|
|
12
|
+
`src/blocklist.js` ships ~120 hand-curated glob patterns covering the
|
|
13
|
+
high-frequency tracker families: Google ads + analytics, Facebook
|
|
14
|
+
Pixel, Amazon ads, MS Clarity/Bing, Adobe Marketing Cloud, the
|
|
15
|
+
consumer-pixel cluster (LinkedIn/Twitter/TikTok/Snap/Pinterest), the
|
|
16
|
+
SaaS analytics stacks (Segment/Amplitude/Mixpanel/Heap/PostHog),
|
|
17
|
+
session-replay (Hotjar/FullStory/LogRocket/Crazy Egg/Mouseflow),
|
|
18
|
+
content recommendation (Criteo/Taboola/Outbrain), supply-side ad
|
|
19
|
+
networks (AppNexus/Rubicon/PubMatic/OpenX/Trade Desk), and marketing
|
|
20
|
+
automation (HubSpot/Marketo/Pardot/Intercom/Drift). Curated by traffic
|
|
21
|
+
frequency rather than pulled wholesale from Peter Lowe — CDP does
|
|
22
|
+
linear pattern matching per request, so the long tail of regional
|
|
23
|
+
networks was measurable cost (~10ms cumulative on a 100-request page)
|
|
24
|
+
for ~5% extra coverage we'd rarely hit in agent traffic. Net effect:
|
|
25
|
+
smaller ARIA snapshots and faster page loads.
|
|
26
|
+
- **`opts.blockAds` and `opts.blockUrls` on `connect()` and `browse()`.**
|
|
27
|
+
`blockAds` defaults to `true` for launched browsers and `false` in
|
|
28
|
+
attach mode (would otherwise affect any tab in the user's running
|
|
29
|
+
browser). Explicit `blockAds: true` in attach mode is honored and
|
|
30
|
+
follows the session across `switchTab()`. `blockUrls` accepts extra
|
|
31
|
+
glob patterns merged with the default unless `blockAds: false`.
|
|
32
|
+
- **CLI flags on `bb open`: `--no-block-ads` and `--block-urls=PATTERN`**
|
|
33
|
+
(the latter repeatable). Plumbed through `cli.js`, `src/daemon.js`
|
|
34
|
+
startDaemon args, and `runDaemon` → `connect()`. Not exposed via MCP
|
|
35
|
+
or bareagent on purpose — agents inside a session shouldn't be
|
|
36
|
+
reconfiguring infra per tool call; the decision belongs at session
|
|
37
|
+
start.
|
|
38
|
+
- **Canvas fingerprint noise** in `src/stealth.js`. After WebGL
|
|
39
|
+
(already spoofed in v0.9.0), canvas `toDataURL` / `getImageData` is
|
|
40
|
+
the second-most-checked fingerprint vector — the pixel output of
|
|
41
|
+
rendered text/shapes depends on GPU, driver, and font rasterizer in
|
|
42
|
+
ways that are stable per machine but unique across machines, which
|
|
43
|
+
makes it a tracking signal that survives cookie clearing. The patch
|
|
44
|
+
XORs ~1 bit per 64-byte stride into the read pixels, with the bit
|
|
45
|
+
derived from a position-mixed hash of a per-session
|
|
46
|
+
`crypto.getRandomValues`-seeded value. Output is stable within a
|
|
47
|
+
session (so legitimate canvas use doesn't flicker) and different
|
|
48
|
+
across sessions (so fingerprinters see a fresh hash on every visit).
|
|
49
|
+
The canvas bitmap is snapshotted and restored around encoding so any
|
|
50
|
+
downstream legitimate read sees the original pixels.
|
|
51
|
+
- **Pre-existing Chromium subprocess reap flake fixed.** Chromium
|
|
52
|
+
spawns renderer/GPU/network/utility subprocesses that, under
|
|
53
|
+
`--site-per-process` (v0.9.0 H2), can outlive SIGTERM on the
|
|
54
|
+
Chromium parent by seconds while still holding profile-dir file
|
|
55
|
+
handles. Without `detached: true`, all of them shared Node's process
|
|
56
|
+
group — there was no way to signal the whole Chromium tree without
|
|
57
|
+
enumerating PIDs. `src/chromium.js` now spawns with `detached: true`
|
|
58
|
+
so each Chromium becomes its own process-group leader, and
|
|
59
|
+
`cleanupBrowser` / `reapAllSync` send SIGKILL to the negative PID
|
|
60
|
+
(the whole group) before `rmSync`. Latent in `main`, but the new
|
|
61
|
+
blocklist's added CDP setup overlapped the cleanup window enough to
|
|
62
|
+
hit ~1-in-3 under parallel test load. Side effect: terminal SIGINT
|
|
63
|
+
now goes to Node's pgid only — `registerExitHandlers`' SIGINT
|
|
64
|
+
reaper is what kills Chromium under Ctrl-C and must not be removed.
|
|
65
|
+
- **`startDaemon` poll deadline 15s → 30s** for cold-boot margin on
|
|
66
|
+
slower hardware (CI / older boxes) now that the blocklist adds a
|
|
67
|
+
small amount of CDP setup time to the session-startup path.
|
|
68
|
+
- **Tests:** 138 total (10 new). New: 5-test unit suite for
|
|
69
|
+
`DEFAULT_BLOCKLIST` (shape/coverage drift guards, must-cover
|
|
70
|
+
tracker families, no dups); 2-test integration suite that proves
|
|
71
|
+
`Network.setBlockedURLs` actually drops the matching subresource
|
|
72
|
+
and that `blockAds:false` lets it through; 2 new canvas-noise
|
|
73
|
+
subtests (patch installed, stable within session, different across
|
|
74
|
+
sessions); 1 end-to-end `bb open --block-urls=PATTERN URL` test
|
|
75
|
+
that proves the flag survives every hop through `cli.js` →
|
|
76
|
+
`startDaemon` → daemon-internal → `connect()` → `setBlockedURLs`
|
|
77
|
+
and that the tracker server sees zero hits.
|
|
78
|
+
|
|
79
|
+
## 0.9.1
|
|
80
|
+
|
|
81
|
+
### Pruning — `pruneMode` reaches MCP / bareagent and `read` finally works
|
|
82
|
+
|
|
83
|
+
- **`mode: 'read'` is now a real alias for `mode: 'browse'`** in `prune()`.
|
|
84
|
+
Previously, the CLI (`barebrowse snapshot --mode=read`) and the SKILL.md
|
|
85
|
+
advertised a `read` mode that did not exist — `MODE_REGIONS[mode] ||
|
|
86
|
+
MODE_REGIONS.act` silently fell back to act-mode pruning. Articles, docs,
|
|
87
|
+
and blog posts therefore came back gutted no matter which mode the agent
|
|
88
|
+
asked for, which is why Claude tended to give up and fall back to
|
|
89
|
+
WebFetch. One-line alias at the top of `prune()` fixes it; `act|browse|
|
|
90
|
+
navigate|full` still behave unchanged.
|
|
91
|
+
- **MCP `browse` and `snapshot` tools gained a `pruneMode: 'act'|'read'`
|
|
92
|
+
parameter** (mcp-server.js). Before this, the MCP surface had no way to
|
|
93
|
+
ask for any mode other than `act` — `browse`'s `mode` param was browser
|
|
94
|
+
mode (headless/headed/hybrid), and `snapshot` accepted only `maxChars`.
|
|
95
|
+
Tool descriptions now tell the caller when to pick `read` (content-heavy
|
|
96
|
+
pages: articles, docs, blogs).
|
|
97
|
+
- **bareagent `browse` and `snapshot` tools gained the same `pruneMode`
|
|
98
|
+
parameter** (`src/bareagent.js`) with identical semantics. The `browse`
|
|
99
|
+
handler preserves any caller-supplied default `opts.pruneMode` when the
|
|
100
|
+
tool is called without an arg (`pruneMode ? { ...opts, pruneMode } : opts`).
|
|
101
|
+
- **Auto-hint when act-mode looks suspect.** When `page.snapshot()` or
|
|
102
|
+
`browse()` is called in act mode against a substantial page (raw > 5 KB)
|
|
103
|
+
and the pruned output collapses to under 500 chars AND under 5% of raw,
|
|
104
|
+
the result includes a one-line `hint: act mode dropped most of the page
|
|
105
|
+
— retry with pruneMode='read' …` directly between the stats line and the
|
|
106
|
+
tree. Thresholds are deliberately conservative: an e-commerce or
|
|
107
|
+
search-results page (many interactive elements kept) won't trigger it;
|
|
108
|
+
a paragraph-heavy article will.
|
|
109
|
+
- **Regression test:** `test/unit/prune.test.js` — "aliases mode='read' to
|
|
110
|
+
browse mode" pins the alias contract by asserting `prune(tree, {mode:
|
|
111
|
+
'read'})` deep-equals `prune(tree, {mode: 'browse'})` and that paragraphs
|
|
112
|
+
survive (the act-mode-style stripping that previously masqueraded as
|
|
113
|
+
read-mode is gone).
|
|
114
|
+
|
|
3
115
|
## 0.9.0
|
|
4
116
|
|
|
5
117
|
Phase B — every H1–H9 from `docs/02-features/fix-plan.md` shipped one
|
package/README.md
CHANGED
|
@@ -94,6 +94,8 @@ Or manually add to your config (`claude_desktop_config.json`, `.cursor/mcp.json`
|
|
|
94
94
|
|
|
95
95
|
18 tools: `browse`, `goto`, `snapshot`, `click`, `type`, `press`, `scroll`, `hover`, `select`, `back`, `forward`, `reload`, `drag`, `upload`, `pdf`, `screenshot`, `wait_for`, `tabs`. Plus `assess` (privacy scan) if [wearehere](https://github.com/hamr0/wearehere) is installed. Plus opt-in `eval` (`BAREBROWSE_MCP_EVAL=1`) — runs JS in the authenticated session, off by default because it can read cookies/localStorage. Session runs in hybrid mode with automatic cookie injection. Per-tool timeouts (goto/reload/wait_for 60s, back/forward 30s, interactive ops 15s, pdf/screenshot/upload 45s) with auto-retry on transient failures (idempotent only — mutating tools fail loudly to avoid double-submits).
|
|
96
96
|
|
|
97
|
+
`browse` and `snapshot` accept `pruneMode: 'act'|'read'` (v0.9.1). `act` (default) keeps interactive elements — best for clicking/filling. `read` keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. If act-mode collapses a content-heavy page near-totally, the snapshot includes a `hint: …` line suggesting `pruneMode='read'` so the agent doesn't bail to a separate HTTP fetch.
|
|
98
|
+
|
|
97
99
|
Troubleshooting MCP setup: `npx barebrowse doctor` scans every known config location and flags scope conflicts. `npx barebrowse install --force` overwrites an existing entry pointing at a different endpoint.
|
|
98
100
|
|
|
99
101
|
### 3. Library -- for agentic automation
|
package/barebrowse.context.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
# barebrowse -- Integration Guide
|
|
2
2
|
|
|
3
3
|
> For AI assistants and developers wiring barebrowse into a project.
|
|
4
|
-
> v0.9.
|
|
4
|
+
> v0.9.1 | Node.js >= 22 | 0 required deps | Apache-2.0
|
|
5
5
|
|
|
6
6
|
## What this is
|
|
7
7
|
|
|
@@ -45,6 +45,8 @@ const snapshot = await browse('https://example.com', {
|
|
|
45
45
|
prune: true, // apply ARIA pruning (47-95% token reduction)
|
|
46
46
|
pruneMode: 'act', // 'act' (interactive elements) | 'read' (all content)
|
|
47
47
|
consent: true, // auto-dismiss cookie consent dialogs
|
|
48
|
+
blockAds: true, // block ~120 ad/tracker URL patterns (default on for owned browsers)
|
|
49
|
+
blockUrls: [], // extra URL globs to block (merged with the default)
|
|
48
50
|
timeout: 30000, // navigation timeout in ms
|
|
49
51
|
});
|
|
50
52
|
```
|
|
@@ -91,6 +93,8 @@ const snapshot = await browse('https://example.com', {
|
|
|
91
93
|
- `viewport: '1280x720'` — Set viewport dimensions
|
|
92
94
|
- `storageState: 'file.json'` — Load cookies/localStorage from saved state
|
|
93
95
|
- `downloadPath: '/abs/dir'` — Where downloads land. Default: per-session `mkdtemp` under `/tmp/barebrowse-dl-*` that gets removed on `close()`. Caller-supplied paths are not cleaned up — caller owns the lifecycle.
|
|
96
|
+
- `blockAds: true|false` — CDP-level URL blocking of ~120 common ad/tracker patterns (Google ads/analytics, FB/Amazon/MS/Adobe ad+analytics, Segment/Amplitude/Mixpanel/Heap, Hotjar/FullStory/LogRocket, Criteo/Taboola/Outbrain, the consumer-pixel cluster, AppNexus/Rubicon/PubMatic supply, marketing automation). Default `true` for launched browsers, `false` in attach mode (would affect any tab in the user's running browser). Explicit `true` in attach mode is honored and follows the session across `switchTab()`. Shrinks ARIA snapshots and speeds page loads.
|
|
97
|
+
- `blockUrls: ['*://foo.com/*', ...]` — Extra glob patterns (CDP `Network.setBlockedURLs` format) to block in addition to the default. Merged with the default unless `blockAds: false`.
|
|
94
98
|
|
|
95
99
|
## Snapshot format
|
|
96
100
|
|
|
@@ -161,7 +165,8 @@ barebrowse can inject cookies from the user's real browser sessions, bypassing l
|
|
|
161
165
|
| Form submission | `press('Enter')` triggers onsubmit | Both |
|
|
162
166
|
| SPA navigation | `waitForNavigation()` uses loadEventFired + frameNavigated | Both |
|
|
163
167
|
| Bot detection | v0.9.0 (H9): Cloudflare-strong phrases ("Just a moment", "Attention Required", "verify you are human") fire alone; generic phrases ("access denied", "unknown error") only fire on near-empty pages — no more false-positive headed-launches on legitimate 4xx/5xx pages. `botBlocked` flag set after every `goto()`. Hybrid fallback switches to headed. Snapshot shows `[BOT CHALLENGE DETECTED]` warning. | Hybrid |
|
|
164
|
-
| Stealth (headless tells) | v0.9.0 (H4): `Network.setUserAgentOverride` strips "HeadlessChrome" from UA in HTTP headers AND `navigator.userAgent`; JS patches for webdriver, plugins, languages, full `chrome.runtime` enum shape, `Notification` constructor + `permission: 'default'`, `hardwareConcurrency: 8`, `deviceMemory: 8`, WebGL `UNMASKED_VENDOR_WEBGL`/`UNMASKED_RENDERER_WEBGL` spoofed to Intel | Headless |
|
|
168
|
+
| Stealth (headless tells) | v0.9.0 (H4): `Network.setUserAgentOverride` strips "HeadlessChrome" from UA in HTTP headers AND `navigator.userAgent`; JS patches for webdriver, plugins, languages, full `chrome.runtime` enum shape, `Notification` constructor + `permission: 'default'`, `hardwareConcurrency: 8`, `deviceMemory: 8`, WebGL `UNMASKED_VENDOR_WEBGL`/`UNMASKED_RENDERER_WEBGL` spoofed to Intel. v0.10.0: canvas fingerprint noise — `toDataURL`/`getImageData` XOR a per-session `crypto.getRandomValues`-seeded mask into ~1 byte per 64-byte stride (stable within a session, different across sessions; bitmap is restored after encoding so legitimate canvas use is unaffected). | Headless |
|
|
169
|
+
| Ad / tracker URL blocking | v0.10.0: CDP `Network.setBlockedURLs` with ~120 curated patterns (Google/FB/Amazon/MS/Adobe ad+analytics, the major SaaS analytics + session-replay stacks, content-rec, supply-side ad networks, marketing automation). Default on for launched browsers, off in attach mode. `opts.blockUrls` extends; `opts.blockAds: false` disables. Shrinks ARIA snapshots and speeds loads. | Launched |
|
|
165
170
|
| iframe / OOPIF content (Stripe, reCAPTCHA, embedded forms) | v0.9.0 (H2): `Target.setAutoAttach({flatten:true})` registers a CDP session per iframe; `ariaTree()` walks `Page.getFrameTree`, fetches each frame's AX tree on the right session, splices children under iframe placeholders via `DOM.getFrameOwner`. Refs route via `{session, backendNodeId}` so clicks dispatch in the iframe's Input domain. `--site-per-process` launch flag forces every iframe — including same-origin — into OOPIF so coords work. | Both |
|
|
166
171
|
| Downloads | v0.9.0 (H7): `Browser.setDownloadBehavior({behavior:'allowAndName', downloadPath, eventsEnabled:true})` + listeners populate `page.downloads`. Files land at `savedPath` (under `--download-path` if supplied, else per-session `/tmp/barebrowse-dl-*`). | Headless + Headed (skipped in attach mode) |
|
|
167
172
|
| Profile locking | Unique temp dir per headless instance | Headless |
|
|
@@ -255,6 +260,8 @@ Action tools return `'ok'` -- the agent calls `snapshot` explicitly to observe.
|
|
|
255
260
|
|
|
256
261
|
`browse` and `snapshot` accept a `maxChars` param (default 30000). If the snapshot exceeds the limit, it's saved to `.barebrowse/page-<timestamp>.yml` and a short message with the file path is returned instead. `screenshot` always saves to `.barebrowse/screenshot-<timestamp>.{png,jpeg,webp}` and returns the file path (raw base64 in a JSON-RPC response would blow `maxChars`). `tabs` returns the JSON array, or with `switchTo: N` it switches and returns `'ok'`.
|
|
257
262
|
|
|
263
|
+
`browse` and `snapshot` also accept `pruneMode: 'act'|'read'`. `act` (the default) keeps interactive elements and short labels — best for clicking/filling. `read` keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. Same surface on the bareagent adapter. If act mode collapses a content-heavy page (raw > 5 KB → pruned < 500 chars AND < 5% of raw), the result includes a `hint: act mode dropped most of the page — retry with pruneMode='read' …` line between the stats and the tree so the caller knows to re-snapshot in read mode instead of bailing to a separate HTTP fetch.
|
|
264
|
+
|
|
258
265
|
Session runs in hybrid mode (headless with automatic headed fallback on bot detection). `goto` injects cookies from the user's browser before navigation for authenticated access.
|
|
259
266
|
|
|
260
267
|
Session tools share a singleton page, lazy-created on first use. All session tools have auto-retry on transient failures (browser crash, WebSocket close, navigation timeout) on a per-tool deadline (v0.9.0 H5): `goto`/`reload`/`wait_for` 60s, `back`/`forward` 30s, interactive ops (`click`/`type`/`press`/`scroll`/`hover`/`select`/`drag`/`snapshot`/`eval`) 15s, `tabs` 5s, heavy I/O (`pdf`/`screenshot`/`upload`) 45s — replaces the prior blanket 30s. Session resets between attempts. Idempotent tools retry once; mutating tools (`click`/`type`/`upload`/etc.) `{ retry: false }` so partial first attempts don't replay on a fresh page. Scroll accepts `direction: "up"/"down"` in addition to numeric `deltaY`. Click falls back to JS `.click()` when elements have no layout. `browse` has a 60s timeout (no retry — stateless). Assess tries headless first; if bot-blocked, retries headed. Browser OOM/crash auto-recovers (session resets, server stays alive).
|
package/cli.js
CHANGED
|
@@ -117,6 +117,8 @@ async function cmdOpen() {
|
|
|
117
117
|
viewport: parseFlag('--viewport'),
|
|
118
118
|
storageState: parseFlag('--storage-state'),
|
|
119
119
|
downloadPath: parseFlag('--download-path'),
|
|
120
|
+
blockAds: hasFlag('--no-block-ads') ? false : undefined,
|
|
121
|
+
blockUrls: parseFlagAll('--block-urls'),
|
|
120
122
|
};
|
|
121
123
|
|
|
122
124
|
try {
|
|
@@ -218,6 +220,8 @@ async function runDaemonInternal() {
|
|
|
218
220
|
viewport: parseFlag('--viewport'),
|
|
219
221
|
storageState: parseFlag('--storage-state'),
|
|
220
222
|
downloadPath: parseFlag('--download-path'),
|
|
223
|
+
blockAds: hasFlag('--no-block-ads') ? false : undefined,
|
|
224
|
+
blockUrls: parseFlagAll('--block-urls'),
|
|
221
225
|
};
|
|
222
226
|
const outputDir = parseFlag('--output-dir') || resolve('.barebrowse');
|
|
223
227
|
const url = parseFlag('--url');
|
|
@@ -240,6 +244,20 @@ function hasFlag(name) {
|
|
|
240
244
|
return args.includes(name);
|
|
241
245
|
}
|
|
242
246
|
|
|
247
|
+
// Collects every occurrence of a repeatable flag (--name=val or --name val).
|
|
248
|
+
// Returns undefined when absent so the opts object stays sparse and callers
|
|
249
|
+
// can distinguish "not provided" from "provided but empty".
|
|
250
|
+
function parseFlagAll(name) {
|
|
251
|
+
const out = [];
|
|
252
|
+
for (let i = 0; i < args.length; i++) {
|
|
253
|
+
if (args[i].startsWith(name + '=')) out.push(args[i].slice(name.length + 1));
|
|
254
|
+
else if (args[i] === name && args[i + 1] && !args[i + 1].startsWith('--')) {
|
|
255
|
+
out.push(args[i + 1]); i++;
|
|
256
|
+
}
|
|
257
|
+
}
|
|
258
|
+
return out.length ? out : undefined;
|
|
259
|
+
}
|
|
260
|
+
|
|
243
261
|
|
|
244
262
|
// --- MCP auto-installer ---
|
|
245
263
|
|
|
@@ -467,6 +485,10 @@ Session:
|
|
|
467
485
|
--viewport=WxH Viewport size (e.g. 1280x720)
|
|
468
486
|
--storage-state=FILE Load cookies/localStorage from JSON file
|
|
469
487
|
--download-path=DIR Directory for downloaded files (default: per-session temp dir)
|
|
488
|
+
--no-block-ads Disable the built-in ad/tracker blocklist (~120 patterns).
|
|
489
|
+
Default: enabled in owned-browser modes, disabled in attach mode.
|
|
490
|
+
--block-urls=PATTERN Extra URL glob to block (repeatable, e.g. --block-urls='*://*.foo.com/*').
|
|
491
|
+
Use the =VALUE form when the pattern could be mistaken for a flag.
|
|
470
492
|
|
|
471
493
|
Navigation:
|
|
472
494
|
barebrowse goto <url> Navigate to URL
|
package/mcp-server.js
CHANGED
|
@@ -150,6 +150,7 @@ export const TOOLS = [
|
|
|
150
150
|
properties: {
|
|
151
151
|
url: { type: 'string', description: 'URL to browse' },
|
|
152
152
|
mode: { type: 'string', enum: ['headless', 'headed', 'hybrid'], description: 'Browser mode (default: headless)' },
|
|
153
|
+
pruneMode: { type: 'string', enum: ['act', 'read'], description: 'Pruning mode. "act" (default) keeps interactive elements and short labels — best for clicking/filling. "read" keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. If the page is content-heavy and act-mode returns mostly empty, retry with "read".' },
|
|
153
154
|
maxChars: { type: 'number', description: 'Max chars to return inline. Larger snapshots are saved to .barebrowse/ and a file path is returned instead. Default: 30000.' },
|
|
154
155
|
},
|
|
155
156
|
required: ['url'],
|
|
@@ -172,6 +173,7 @@ export const TOOLS = [
|
|
|
172
173
|
inputSchema: {
|
|
173
174
|
type: 'object',
|
|
174
175
|
properties: {
|
|
176
|
+
pruneMode: { type: 'string', enum: ['act', 'read'], description: 'Pruning mode. "act" (default) keeps interactive elements and short labels — best for clicking/filling. "read" keeps paragraphs, headings, and long text — best for articles, docs, and content extraction. If a previous snapshot looked empty on a content-heavy page, retry with "read".' },
|
|
175
177
|
maxChars: { type: 'number', description: 'Max chars to return inline. Larger snapshots are saved to .barebrowse/ and a file path is returned instead. Default: 30000.' },
|
|
176
178
|
},
|
|
177
179
|
},
|
|
@@ -374,7 +376,7 @@ async function handleToolCall(name, args) {
|
|
|
374
376
|
case 'browse': {
|
|
375
377
|
let timer;
|
|
376
378
|
const text = await Promise.race([
|
|
377
|
-
browse(args.url, { mode: args.mode }),
|
|
379
|
+
browse(args.url, { mode: args.mode, pruneMode: args.pruneMode }),
|
|
378
380
|
new Promise((_, rej) => { timer = setTimeout(() => rej(new Error('browse timed out after 60s')), 60000); }),
|
|
379
381
|
]);
|
|
380
382
|
clearTimeout(timer);
|
|
@@ -393,7 +395,7 @@ async function handleToolCall(name, args) {
|
|
|
393
395
|
}, TIMEOUTS.goto);
|
|
394
396
|
case 'snapshot': return withRetry(async () => {
|
|
395
397
|
const page = await getPage();
|
|
396
|
-
const text = await page.snapshot();
|
|
398
|
+
const text = await page.snapshot(args.pruneMode ? { mode: args.pruneMode } : undefined);
|
|
397
399
|
const limit = args.maxChars ?? MAX_CHARS_DEFAULT;
|
|
398
400
|
if (text.length > limit) {
|
|
399
401
|
const file = saveSnapshot(text);
|
package/package.json
CHANGED
package/src/bareagent.js
CHANGED
|
@@ -50,10 +50,11 @@ export function createBrowseTools(opts = {}) {
|
|
|
50
50
|
type: 'object',
|
|
51
51
|
properties: {
|
|
52
52
|
url: { type: 'string', description: 'URL to browse' },
|
|
53
|
+
pruneMode: { type: 'string', enum: ['act', 'read'], description: '"act" (default) for interactive elements only; "read" for paragraphs and long text (articles/docs).' },
|
|
53
54
|
},
|
|
54
55
|
required: ['url'],
|
|
55
56
|
},
|
|
56
|
-
execute: async ({ url }) => await browse(url, opts),
|
|
57
|
+
execute: async ({ url, pruneMode }) => await browse(url, pruneMode ? { ...opts, pruneMode } : opts),
|
|
57
58
|
},
|
|
58
59
|
{
|
|
59
60
|
name: 'goto',
|
|
@@ -70,10 +71,15 @@ export function createBrowseTools(opts = {}) {
|
|
|
70
71
|
{
|
|
71
72
|
name: 'snapshot',
|
|
72
73
|
description: 'Get the current ARIA snapshot. Returns a YAML-like tree with [ref=N] markers on interactive elements.',
|
|
73
|
-
parameters: {
|
|
74
|
-
|
|
74
|
+
parameters: {
|
|
75
|
+
type: 'object',
|
|
76
|
+
properties: {
|
|
77
|
+
pruneMode: { type: 'string', enum: ['act', 'read'], description: '"act" (default) for interactive elements only; "read" for paragraphs and long text (articles/docs).' },
|
|
78
|
+
},
|
|
79
|
+
},
|
|
80
|
+
execute: async ({ pruneMode } = {}) => {
|
|
75
81
|
const page = await getPage();
|
|
76
|
-
return await page.snapshot();
|
|
82
|
+
return await page.snapshot(pruneMode ? { mode: pruneMode } : undefined);
|
|
77
83
|
},
|
|
78
84
|
},
|
|
79
85
|
{
|
package/src/blocklist.js
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* blocklist.js — Ad/tracker URL patterns for CDP Network.setBlockedURLs.
|
|
3
|
+
*
|
|
4
|
+
* Curated by real-world frequency, not pulled wholesale from Peter Lowe /
|
|
5
|
+
* EasyList. CDP does linear pattern matching per request, so 3,000-entry
|
|
6
|
+
* lists add ~150ms cumulative cost on a typical page for ~5% extra coverage
|
|
7
|
+
* (long-tail regional networks the agent rarely encounters). The set below
|
|
8
|
+
* is ~120 patterns covering the trackers that actually show up in agent
|
|
9
|
+
* traffic: Google/FB/Amazon/MS/Adobe ad+analytics, the major SaaS analytics
|
|
10
|
+
* stacks (Segment/Amplitude/Mixpanel/HubSpot/Hotjar/FullStory/Heap/Mouseflow),
|
|
11
|
+
* session-replay (LogRocket/Crazy Egg/Optimizely/VWO), content-recommendation
|
|
12
|
+
* (Taboola/Outbrain/Criteo), and the consumer-pixel cluster (LinkedIn/Twitter/
|
|
13
|
+
* TikTok/Snap/Pinterest/Reddit).
|
|
14
|
+
*
|
|
15
|
+
* Patterns are CDP-format globs: '*' matches any character run.
|
|
16
|
+
*
|
|
17
|
+
* To extend at runtime, pass connect({ blockUrls: [...] }) — your patterns
|
|
18
|
+
* are merged with this default. To turn the default off entirely, pass
|
|
19
|
+
* { blockAds: false }.
|
|
20
|
+
*/
|
|
21
|
+
|
|
22
|
+
export const DEFAULT_BLOCKLIST = [
|
|
23
|
+
// --- Google ads + analytics (the single biggest cluster) ---
|
|
24
|
+
'*://*.doubleclick.net/*',
|
|
25
|
+
'*://*.googlesyndication.com/*',
|
|
26
|
+
'*://*.googleadservices.com/*',
|
|
27
|
+
'*://*.googletagservices.com/*',
|
|
28
|
+
'*://*.googletagmanager.com/*',
|
|
29
|
+
'*://*.google-analytics.com/*',
|
|
30
|
+
'*://*.adservice.google.com/*',
|
|
31
|
+
'*://pagead2.googlesyndication.com/*',
|
|
32
|
+
'*://www.googleadservices.com/pagead/*',
|
|
33
|
+
'*://ssl.google-analytics.com/*',
|
|
34
|
+
'*://stats.g.doubleclick.net/*',
|
|
35
|
+
|
|
36
|
+
// --- Facebook / Meta ---
|
|
37
|
+
'*://connect.facebook.net/*',
|
|
38
|
+
'*://*.facebook.com/tr*', // Pixel (matches both /tr/... and /tr?...)
|
|
39
|
+
'*://*.fbcdn.net/signals/*',
|
|
40
|
+
|
|
41
|
+
// --- Amazon ads ---
|
|
42
|
+
'*://*.amazon-adsystem.com/*',
|
|
43
|
+
'*://aax.amazon-adsystem.com/*',
|
|
44
|
+
'*://s.amazon-adsystem.com/*',
|
|
45
|
+
|
|
46
|
+
// --- Microsoft (Bing ads + Clarity) ---
|
|
47
|
+
'*://bat.bing.com/*',
|
|
48
|
+
'*://*.clarity.ms/*',
|
|
49
|
+
|
|
50
|
+
// --- Yandex ---
|
|
51
|
+
'*://mc.yandex.ru/*',
|
|
52
|
+
'*://an.yandex.ru/*',
|
|
53
|
+
'*://yandex.ru/ads/*',
|
|
54
|
+
|
|
55
|
+
// --- Adobe Marketing Cloud ---
|
|
56
|
+
'*://*.omtrdc.net/*',
|
|
57
|
+
'*://*.demdex.net/*',
|
|
58
|
+
'*://*.everesttech.net/*',
|
|
59
|
+
'*://*.2o7.net/*',
|
|
60
|
+
'*://*.adobedtm.com/*',
|
|
61
|
+
|
|
62
|
+
// --- LinkedIn ---
|
|
63
|
+
'*://px.ads.linkedin.com/*',
|
|
64
|
+
'*://snap.licdn.com/li.lms-analytics/*',
|
|
65
|
+
|
|
66
|
+
// --- Twitter/X ---
|
|
67
|
+
'*://analytics.twitter.com/*',
|
|
68
|
+
'*://static.ads-twitter.com/*',
|
|
69
|
+
'*://*.t.co/i/adsct*',
|
|
70
|
+
|
|
71
|
+
// --- TikTok ---
|
|
72
|
+
'*://analytics.tiktok.com/*',
|
|
73
|
+
'*://business-api.tiktok.com/*',
|
|
74
|
+
'*://*.tiktokcdn.com/tiktok/*',
|
|
75
|
+
|
|
76
|
+
// --- Snap ---
|
|
77
|
+
'*://tr.snapchat.com/*',
|
|
78
|
+
'*://sc-static.net/scevent.min.js*',
|
|
79
|
+
|
|
80
|
+
// --- Pinterest ---
|
|
81
|
+
'*://ct.pinterest.com/*',
|
|
82
|
+
'*://*.pinimg.com/ct/*',
|
|
83
|
+
|
|
84
|
+
// --- Reddit ---
|
|
85
|
+
'*://events.redditmedia.com/*',
|
|
86
|
+
'*://www.redditstatic.com/ads/*',
|
|
87
|
+
|
|
88
|
+
// --- Quantcast / ComScore / Chartbeat ---
|
|
89
|
+
'*://pixel.quantserve.com/*',
|
|
90
|
+
'*://*.quantcount.com/*',
|
|
91
|
+
'*://*.scorecardresearch.com/*',
|
|
92
|
+
'*://ping.chartbeat.net/*',
|
|
93
|
+
'*://static.chartbeat.com/*',
|
|
94
|
+
|
|
95
|
+
// --- Criteo / Taboola / Outbrain (content + retargeting) ---
|
|
96
|
+
'*://*.criteo.com/*',
|
|
97
|
+
'*://*.criteo.net/*',
|
|
98
|
+
'*://cdn.taboola.com/*',
|
|
99
|
+
'*://trc.taboola.com/*',
|
|
100
|
+
'*://widgets.outbrain.com/*',
|
|
101
|
+
'*://*.outbrain.com/utils/*',
|
|
102
|
+
|
|
103
|
+
// --- Tealium / Marketo / Pardot / Salesforce marketing ---
|
|
104
|
+
'*://tags.tiqcdn.com/*',
|
|
105
|
+
'*://*.tealiumiq.com/*',
|
|
106
|
+
'*://munchkin.marketo.net/*',
|
|
107
|
+
'*://*.marketo.com/munchkin*',
|
|
108
|
+
'*://pi.pardot.com/*',
|
|
109
|
+
'*://*.exacttarget.com/cdn/*',
|
|
110
|
+
|
|
111
|
+
// --- Yahoo / Verizon Media ---
|
|
112
|
+
'*://*.yahoo.com/p.gif*',
|
|
113
|
+
'*://ad.yieldmanager.com/*',
|
|
114
|
+
'*://sp.analytics.yahoo.com/*',
|
|
115
|
+
|
|
116
|
+
// --- RUM / front-end perf (debatable, but commonly noisy) ---
|
|
117
|
+
'*://rum.pingdom.net/*',
|
|
118
|
+
'*://bam.nr-data.net/*',
|
|
119
|
+
'*://bam-cell.nr-data.net/*',
|
|
120
|
+
'*://js-agent.newrelic.com/*',
|
|
121
|
+
'*://*.browser-intake-datadoghq.com/*',
|
|
122
|
+
'*://*.browser-intake-datadoghq.eu/*',
|
|
123
|
+
|
|
124
|
+
// --- Session replay + heatmaps ---
|
|
125
|
+
'*://*.hotjar.com/*',
|
|
126
|
+
'*://*.hotjar.io/*',
|
|
127
|
+
'*://*.fullstory.com/s/*',
|
|
128
|
+
'*://*.fullstory.com/rec/*',
|
|
129
|
+
'*://r.lr-ingest.io/*',
|
|
130
|
+
'*://*.logrocket.io/*',
|
|
131
|
+
'*://cdn.lr-ingest.com/*',
|
|
132
|
+
'*://script.crazyegg.com/*',
|
|
133
|
+
'*://cdn.mouseflow.com/*',
|
|
134
|
+
'*://*.mouseflow.com/projects/*',
|
|
135
|
+
|
|
136
|
+
// --- A/B testing ---
|
|
137
|
+
'*://cdn.optimizely.com/*',
|
|
138
|
+
'*://*.optimizely.com/event*',
|
|
139
|
+
'*://dev.visualwebsiteoptimizer.com/*',
|
|
140
|
+
'*://*.vwo.com/*',
|
|
141
|
+
|
|
142
|
+
// --- Product analytics ---
|
|
143
|
+
'*://api.segment.io/*',
|
|
144
|
+
'*://cdn.segment.com/*',
|
|
145
|
+
'*://*.segment.io/v1/*',
|
|
146
|
+
'*://api.amplitude.com/*',
|
|
147
|
+
'*://api2.amplitude.com/*',
|
|
148
|
+
'*://cdn.amplitude.com/*',
|
|
149
|
+
'*://api.mixpanel.com/*',
|
|
150
|
+
'*://cdn.mxpnl.com/*',
|
|
151
|
+
'*://*.heapanalytics.com/*',
|
|
152
|
+
'*://heapanalytics.com/h*',
|
|
153
|
+
'*://*.posthog.com/e/*',
|
|
154
|
+
'*://*.posthog.com/decide/*',
|
|
155
|
+
|
|
156
|
+
// --- Marketing automation ---
|
|
157
|
+
'*://track.hubspot.com/*',
|
|
158
|
+
'*://js.hs-scripts.com/*',
|
|
159
|
+
'*://js.hs-analytics.net/*',
|
|
160
|
+
'*://js.hsforms.net/*',
|
|
161
|
+
|
|
162
|
+
// --- Customer messaging (these load chat widgets that bloat ARIA) ---
|
|
163
|
+
'*://widget.intercom.io/*',
|
|
164
|
+
'*://api-iam.intercom.io/messenger/*',
|
|
165
|
+
'*://js.intercomcdn.com/*',
|
|
166
|
+
'*://js.driftt.com/*',
|
|
167
|
+
'*://event.api.drift.com/*',
|
|
168
|
+
|
|
169
|
+
// --- Error reporters (Sentry kept off — agents may want to see errors) ---
|
|
170
|
+
'*://sessions.bugsnag.com/*',
|
|
171
|
+
'*://notify.bugsnag.com/*',
|
|
172
|
+
|
|
173
|
+
// --- Misc widely-deployed ad networks ---
|
|
174
|
+
'*://*.adnxs.com/*', // AppNexus / Xandr
|
|
175
|
+
'*://*.rubiconproject.com/*',
|
|
176
|
+
'*://*.pubmatic.com/*',
|
|
177
|
+
'*://*.openx.net/*',
|
|
178
|
+
'*://*.casalemedia.com/*',
|
|
179
|
+
'*://*.bidswitch.net/*',
|
|
180
|
+
'*://*.adsrvr.org/*', // The Trade Desk
|
|
181
|
+
'*://*.media.net/*',
|
|
182
|
+
'*://*.mediavoice.com/*',
|
|
183
|
+
'*://*.serving-sys.com/*', // Sizmek
|
|
184
|
+
'*://*.smartadserver.com/*',
|
|
185
|
+
'*://*.indexww.com/*',
|
|
186
|
+
'*://*.mathtag.com/*',
|
|
187
|
+
'*://*.tapad.com/*',
|
|
188
|
+
'*://*.bluekai.com/*', // Oracle Data Cloud
|
|
189
|
+
'*://*.krxd.net/*', // Salesforce / Krux
|
|
190
|
+
];
|
package/src/chromium.js
CHANGED
|
@@ -16,9 +16,12 @@ let exitHandlersRegistered = false;
|
|
|
16
16
|
function reapAllSync() {
|
|
17
17
|
const toReap = [...activeBrowsers];
|
|
18
18
|
activeBrowsers.clear();
|
|
19
|
-
// Send SIGKILL to
|
|
19
|
+
// Send SIGKILL to the parent AND the whole process group (detached:true
|
|
20
|
+
// gives each Chromium its own pgid, so -pid targets every renderer/GPU/
|
|
21
|
+
// network child without touching Node or its other children).
|
|
20
22
|
for (const b of toReap) {
|
|
21
23
|
try { if (!b.process.killed) b.process.kill('SIGKILL'); } catch {}
|
|
24
|
+
try { process.kill(-b.process.pid, 'SIGKILL'); } catch {}
|
|
22
25
|
}
|
|
23
26
|
// Then poll each for actual death before removing its profile dir —
|
|
24
27
|
// Chromium can hold file handles briefly even after SIGKILL, which would
|
|
@@ -151,8 +154,22 @@ export async function launch(opts = {}) {
|
|
|
151
154
|
// about:blank as initial page
|
|
152
155
|
args.push('about:blank');
|
|
153
156
|
|
|
157
|
+
// detached:true makes Node call setsid() so Chromium becomes its own
|
|
158
|
+
// process-group leader. Without this, the renderer/GPU/network children
|
|
159
|
+
// it forks share the Node parent's process group — SIGTERM on the
|
|
160
|
+
// Chromium PID only signals Chromium itself and the children linger,
|
|
161
|
+
// holding profile-dir files for seconds after the parent exits. Under
|
|
162
|
+
// parallel test load that races our rmSync cleanup. With a separate
|
|
163
|
+
// pgid, cleanupBrowser can signal the whole group with process.kill(-pid).
|
|
164
|
+
//
|
|
165
|
+
// Trade-off: a terminal SIGINT (Ctrl-C) is delivered to the foreground
|
|
166
|
+
// process group, which is now Node's — Chromium will NOT receive it
|
|
167
|
+
// directly. The SIGINT handler in registerExitHandlers() that calls
|
|
168
|
+
// reapAllSync() is what actually kills Chromium under Ctrl-C now. Do not
|
|
169
|
+
// remove that handler without restoring some other path to reap children.
|
|
154
170
|
const child = spawn(binary, args, {
|
|
155
171
|
stdio: ['ignore', 'pipe', 'pipe'],
|
|
172
|
+
detached: true,
|
|
156
173
|
});
|
|
157
174
|
|
|
158
175
|
// Parse the WebSocket URL from stderr
|
|
@@ -216,20 +233,33 @@ export async function cleanupBrowser(browser) {
|
|
|
216
233
|
});
|
|
217
234
|
try { browser.process.kill(); } catch {}
|
|
218
235
|
await exited;
|
|
236
|
+
// SIGKILL the whole Chromium process group. The parent may have exited
|
|
237
|
+
// already (above) but renderer/GPU/network children — separate processes
|
|
238
|
+
// under --site-per-process — can outlive it by seconds, and they hold
|
|
239
|
+
// profile-dir file handles. Because launch() spawned with detached:true,
|
|
240
|
+
// the children share Chromium's pgid (not Node's), so process.kill on a
|
|
241
|
+
// negative PID reaps the whole group without touching anything else.
|
|
242
|
+
try { process.kill(-browser.process.pid, 'SIGKILL'); } catch {
|
|
243
|
+
// ESRCH = group already gone; anything else is best-effort here.
|
|
244
|
+
}
|
|
219
245
|
}
|
|
220
246
|
if (browser.ownedProfileDir) {
|
|
221
|
-
// Chromium
|
|
222
|
-
// --site-per-process
|
|
223
|
-
//
|
|
224
|
-
//
|
|
225
|
-
//
|
|
226
|
-
//
|
|
227
|
-
|
|
247
|
+
// Chromium spawns renderer + GPU + network + utility subprocesses (one
|
|
248
|
+
// per site under --site-per-process from H2), and SIGTERM on the parent
|
|
249
|
+
// doesn't guarantee the children have closed their profile-file handles
|
|
250
|
+
// by the time the parent's exit event fires. Under parallel test load
|
|
251
|
+
// we've seen handle-release take >2.5s. Retry budget here is 60×100ms
|
|
252
|
+
// jittered (~6s+ worst case). Retry on ANY error short of ENOENT —
|
|
253
|
+
// earlier code only retried ENOTEMPTY/EBUSY but Linux also reports
|
|
254
|
+
// EPERM/EACCES transiently when an open-deleted file is still being
|
|
255
|
+
// written to. force:true already swallows ENOENT, so the catch only
|
|
256
|
+
// sees real failures.
|
|
257
|
+
for (let i = 0; i < 60; i++) {
|
|
228
258
|
try {
|
|
229
259
|
rmSync(browser.ownedProfileDir, { recursive: true, force: true });
|
|
230
260
|
break;
|
|
231
261
|
} catch (err) {
|
|
232
|
-
if (err.code
|
|
262
|
+
if (err.code === 'ENOENT') break; // already gone
|
|
233
263
|
await new Promise((r) => setTimeout(r, 100 + Math.floor(Math.random() * 50)));
|
|
234
264
|
}
|
|
235
265
|
}
|
package/src/daemon.js
CHANGED
|
@@ -40,6 +40,10 @@ export async function startDaemon(opts, outputDir, initialUrl) {
|
|
|
40
40
|
if (opts.viewport) args.push('--viewport', opts.viewport);
|
|
41
41
|
if (opts.storageState) args.push('--storage-state', opts.storageState);
|
|
42
42
|
if (opts.downloadPath) args.push('--download-path', opts.downloadPath);
|
|
43
|
+
if (opts.blockAds === false) args.push('--no-block-ads');
|
|
44
|
+
if (Array.isArray(opts.blockUrls)) {
|
|
45
|
+
for (const p of opts.blockUrls) args.push('--block-urls', p);
|
|
46
|
+
}
|
|
43
47
|
|
|
44
48
|
const child = spawn(process.execPath, args, {
|
|
45
49
|
detached: true,
|
|
@@ -48,8 +52,11 @@ export async function startDaemon(opts, outputDir, initialUrl) {
|
|
|
48
52
|
});
|
|
49
53
|
child.unref();
|
|
50
54
|
|
|
51
|
-
// Poll for session.json (50ms interval,
|
|
52
|
-
|
|
55
|
+
// Poll for session.json (50ms interval, 30s timeout). 30s covers cold
|
|
56
|
+
// Chromium boot plus initial-URL navigation on slower CI/older hardware;
|
|
57
|
+
// the previous 15s was tight enough that the ad-blocklist's added
|
|
58
|
+
// CDP setup time pushed real boots past it on stress runs.
|
|
59
|
+
const deadline = Date.now() + 30000;
|
|
53
60
|
while (Date.now() < deadline) {
|
|
54
61
|
if (existsSync(sessionPath)) {
|
|
55
62
|
try {
|
|
@@ -59,7 +66,7 @@ export async function startDaemon(opts, outputDir, initialUrl) {
|
|
|
59
66
|
}
|
|
60
67
|
await new Promise((r) => setTimeout(r, 50));
|
|
61
68
|
}
|
|
62
|
-
throw new Error('Daemon failed to start within
|
|
69
|
+
throw new Error('Daemon failed to start within 30s');
|
|
63
70
|
}
|
|
64
71
|
|
|
65
72
|
/**
|
|
@@ -79,6 +86,8 @@ export async function runDaemon(opts, outputDir, initialUrl) {
|
|
|
79
86
|
viewport: opts.viewport,
|
|
80
87
|
storageState: opts.storageState,
|
|
81
88
|
downloadPath: opts.downloadPath,
|
|
89
|
+
blockAds: opts.blockAds,
|
|
90
|
+
blockUrls: opts.blockUrls,
|
|
82
91
|
});
|
|
83
92
|
|
|
84
93
|
// Console log capture
|
package/src/index.js
CHANGED
|
@@ -16,6 +16,7 @@ import { prune as pruneTree } from './prune.js';
|
|
|
16
16
|
import { click as cdpClick, type as cdpType, scroll as cdpScroll, press as cdpPress, hover as cdpHover, select as cdpSelect, drag as cdpDrag, upload as cdpUpload } from './interact.js';
|
|
17
17
|
import { dismissConsent } from './consent.js';
|
|
18
18
|
import { applyStealth } from './stealth.js';
|
|
19
|
+
import { DEFAULT_BLOCKLIST } from './blocklist.js';
|
|
19
20
|
import { waitForNetworkIdle } from './network-idle.js';
|
|
20
21
|
import { join as pathJoin } from 'node:path';
|
|
21
22
|
|
|
@@ -29,6 +30,11 @@ import { join as pathJoin } from 'node:path';
|
|
|
29
30
|
* @param {boolean} [opts.cookies=true] - Inject user's cookies (Phase 2)
|
|
30
31
|
* @param {boolean} [opts.prune=true] - Apply ARIA pruning (Phase 2)
|
|
31
32
|
* @param {number} [opts.timeout=30000] - Navigation timeout in ms
|
|
33
|
+
* @param {boolean} [opts.blockAds=true] - Block ~120 common ad/tracker
|
|
34
|
+
* URL patterns via CDP. Shrinks ARIA snapshots and speeds page loads.
|
|
35
|
+
* See src/blocklist.js for the default set. Set false to disable.
|
|
36
|
+
* @param {string[]} [opts.blockUrls] - Extra URL glob patterns to block,
|
|
37
|
+
* merged with the default unless blockAds:false.
|
|
32
38
|
* @returns {Promise<string>} ARIA snapshot text
|
|
33
39
|
*/
|
|
34
40
|
export async function browse(url, opts = {}) {
|
|
@@ -53,7 +59,8 @@ export async function browse(url, opts = {}) {
|
|
|
53
59
|
}
|
|
54
60
|
|
|
55
61
|
// Step 2: Create a new page target and attach
|
|
56
|
-
|
|
62
|
+
const pageOpts = { viewport: opts.viewport, blockAds: opts.blockAds, blockUrls: opts.blockUrls };
|
|
63
|
+
let page = await createPage(cdp, mode !== 'headed', pageOpts);
|
|
57
64
|
|
|
58
65
|
// Step 2.5: Suppress permission prompts
|
|
59
66
|
await suppressPermissions(cdp);
|
|
@@ -87,7 +94,7 @@ export async function browse(url, opts = {}) {
|
|
|
87
94
|
try {
|
|
88
95
|
browser = await launch({ ...launchOpts, headed: true });
|
|
89
96
|
cdp = await createCDP(browser.wsUrl);
|
|
90
|
-
page = await createPage(cdp, false,
|
|
97
|
+
page = await createPage(cdp, false, pageOpts);
|
|
91
98
|
await suppressPermissions(cdp);
|
|
92
99
|
if (opts.cookies !== false) {
|
|
93
100
|
try { await authenticate(page.session, url, { browser: opts.browser }); } catch {}
|
|
@@ -110,7 +117,11 @@ export async function browse(url, opts = {}) {
|
|
|
110
117
|
snapshot = raw;
|
|
111
118
|
}
|
|
112
119
|
const stats = `url: ${url}\n${raw.length.toLocaleString()} chars → ${snapshot.length.toLocaleString()} chars (${Math.round((1 - snapshot.length / raw.length) * 100)}% pruned)`;
|
|
113
|
-
|
|
120
|
+
const actMode = !opts.pruneMode || opts.pruneMode === 'act';
|
|
121
|
+
const hint = (actMode && raw.length > 5000 && snapshot.length < 500 && snapshot.length < raw.length * 0.05)
|
|
122
|
+
? `hint: act mode dropped most of the page — retry with pruneMode='read' for paragraphs and long text\n`
|
|
123
|
+
: '';
|
|
124
|
+
snapshot = stats + '\n' + hint + snapshot;
|
|
114
125
|
|
|
115
126
|
// Step 7: Clean up
|
|
116
127
|
await cdp.send('Target.closeTarget', { targetId: page.targetId });
|
|
@@ -135,6 +146,14 @@ export async function browse(url, opts = {}) {
|
|
|
135
146
|
* Default: a per-session subdirectory under the OS temp dir. Downloads
|
|
136
147
|
* land here as <guid>; check `page.downloads` for { url, suggestedFilename,
|
|
137
148
|
* savedPath, state, totalBytes, receivedBytes } per file.
|
|
149
|
+
* @param {boolean} [opts.blockAds] - Block ~120 common ad/tracker URL
|
|
150
|
+
* patterns via CDP. Defaults to true for launched browsers, false in
|
|
151
|
+
* attach mode (would affect any tab attached to the user's running
|
|
152
|
+
* session). Setting blockAds:true explicitly in attach mode honors the
|
|
153
|
+
* request — blocking applies to whichever tab the session is currently
|
|
154
|
+
* attached to and follows the session across switchTab() until close.
|
|
155
|
+
* @param {string[]} [opts.blockUrls] - Extra URL glob patterns to block,
|
|
156
|
+
* merged with the default unless blockAds is false.
|
|
138
157
|
* @returns {Promise<object>} Page handle with goto, snapshot, close
|
|
139
158
|
*/
|
|
140
159
|
export async function connect(opts = {}) {
|
|
@@ -165,7 +184,15 @@ export async function connect(opts = {}) {
|
|
|
165
184
|
// (they'd persist in the user's session via addScriptToEvaluateOnNewDocument)
|
|
166
185
|
// and the headed→headless rewind in goto() is gated off below.
|
|
167
186
|
let currentlyHeaded = attachMode || (mode === 'headed');
|
|
168
|
-
|
|
187
|
+
// Default blockAds on for owned browsers, off in attach mode (would affect
|
|
188
|
+
// any tab we attach to in the user's running session). Caller can flip with
|
|
189
|
+
// explicit blockAds:true/false.
|
|
190
|
+
const pageOpts = {
|
|
191
|
+
viewport: opts.viewport,
|
|
192
|
+
blockAds: opts.blockAds !== undefined ? opts.blockAds : !attachMode,
|
|
193
|
+
blockUrls: opts.blockUrls,
|
|
194
|
+
};
|
|
195
|
+
let page = await createPage(cdp, !currentlyHeaded, pageOpts);
|
|
169
196
|
let refMap = new Map();
|
|
170
197
|
let botBlocked = false;
|
|
171
198
|
|
|
@@ -300,7 +327,7 @@ export async function connect(opts = {}) {
|
|
|
300
327
|
|
|
301
328
|
browser = await launch(launchOpts);
|
|
302
329
|
cdp = await createCDP(browser.wsUrl);
|
|
303
|
-
page = await createPage(cdp, true,
|
|
330
|
+
page = await createPage(cdp, true, pageOpts);
|
|
304
331
|
setupDialogHandler(page.session);
|
|
305
332
|
await suppressPermissions(cdp);
|
|
306
333
|
currentlyHeaded = false;
|
|
@@ -326,7 +353,7 @@ export async function connect(opts = {}) {
|
|
|
326
353
|
try {
|
|
327
354
|
browser = await launch({ ...launchOpts, headed: true });
|
|
328
355
|
cdp = await createCDP(browser.wsUrl);
|
|
329
|
-
page = await createPage(cdp, false,
|
|
356
|
+
page = await createPage(cdp, false, pageOpts);
|
|
330
357
|
setupDialogHandler(page.session);
|
|
331
358
|
await suppressPermissions(cdp);
|
|
332
359
|
await navigate(page, url, timeout);
|
|
@@ -382,10 +409,14 @@ export async function connect(opts = {}) {
|
|
|
382
409
|
const pageUrl = entries[currentIndex]?.url || '';
|
|
383
410
|
const warn = botBlocked ? '[BOT CHALLENGE DETECTED — page content may be incomplete or blocked]\n' : '';
|
|
384
411
|
if (pruneOpts === false) return `url: ${pageUrl}\n` + warn + raw;
|
|
385
|
-
const
|
|
412
|
+
const mode = pruneOpts?.mode || 'act';
|
|
413
|
+
const pruned = pruneTree(result.tree, { mode });
|
|
386
414
|
const out = formatTree(pruned);
|
|
387
415
|
const stats = `url: ${pageUrl}\n${raw.length.toLocaleString()} chars → ${out.length.toLocaleString()} chars (${Math.round((1 - out.length / raw.length) * 100)}% pruned)`;
|
|
388
|
-
|
|
416
|
+
const hint = (mode === 'act' && raw.length > 5000 && out.length < 500 && out.length < raw.length * 0.05)
|
|
417
|
+
? `hint: act mode dropped most of the page — retry with pruneMode='read' for paragraphs and long text\n`
|
|
418
|
+
: '';
|
|
419
|
+
return stats + '\n' + hint + warn + out;
|
|
389
420
|
},
|
|
390
421
|
|
|
391
422
|
async click(ref) {
|
|
@@ -465,7 +496,7 @@ export async function connect(opts = {}) {
|
|
|
465
496
|
// closure handle used by every method below, so swapping it makes
|
|
466
497
|
// snapshot/click/type/etc. operate on the new tab.
|
|
467
498
|
const oldSessionId = page.sessionId;
|
|
468
|
-
page = await attachToExistingTarget(cdp, target.targetId);
|
|
499
|
+
page = await attachToExistingTarget(cdp, target.targetId, pageOpts);
|
|
469
500
|
refMap = new Map(); // refs from the previous tab are no longer valid
|
|
470
501
|
setupDialogHandler(page.session);
|
|
471
502
|
try { await cdp.send('Target.detachFromTarget', { sessionId: oldSessionId }); } catch {}
|
|
@@ -553,7 +584,7 @@ export async function connect(opts = {}) {
|
|
|
553
584
|
get cdp() { return page.session; },
|
|
554
585
|
|
|
555
586
|
async createTab() {
|
|
556
|
-
const tab = await createPage(cdp, !currentlyHeaded,
|
|
587
|
+
const tab = await createPage(cdp, !currentlyHeaded, pageOpts);
|
|
557
588
|
await suppressPermissions(cdp);
|
|
558
589
|
setupDialogHandler(tab.session);
|
|
559
590
|
let tabBotBlocked = false;
|
|
@@ -645,6 +676,12 @@ async function createPage(cdp, stealth = false, pageOpts = {}) {
|
|
|
645
676
|
await applyStealth(session);
|
|
646
677
|
}
|
|
647
678
|
|
|
679
|
+
// Ad/tracker URL blocking via CDP. Default on for owned browsers — shrinks
|
|
680
|
+
// ARIA, speeds loads. Skipped in attach mode (would affect the user's
|
|
681
|
+
// running browser globally) and skippable per-call via blockAds:false.
|
|
682
|
+
// Custom patterns in blockUrls extend the default unless blockAds is false.
|
|
683
|
+
await applyBlocklist(session, pageOpts);
|
|
684
|
+
|
|
648
685
|
// Set viewport size if specified (e.g. "1280x720")
|
|
649
686
|
if (pageOpts.viewport) {
|
|
650
687
|
const [w, h] = pageOpts.viewport.split('x').map(Number);
|
|
@@ -710,16 +747,35 @@ async function attachFrameTracking(cdp, mainSession) {
|
|
|
710
747
|
* Attach a CDP session to an existing target (e.g. a tab opened by window.open).
|
|
711
748
|
* Enables the same domains as createPage so snapshot/click/type work uniformly.
|
|
712
749
|
*/
|
|
713
|
-
async function attachToExistingTarget(cdp, targetId) {
|
|
750
|
+
async function attachToExistingTarget(cdp, targetId, pageOpts = {}) {
|
|
714
751
|
const { sessionId } = await cdp.send('Target.attachToTarget', { targetId, flatten: true });
|
|
715
752
|
const session = cdp.session(sessionId);
|
|
716
753
|
await session.send('Page.enable');
|
|
717
754
|
await session.send('Network.enable');
|
|
718
755
|
await session.send('DOM.enable');
|
|
756
|
+
await applyBlocklist(session, pageOpts);
|
|
719
757
|
const framesByFrameId = await attachFrameTracking(cdp, session);
|
|
720
758
|
return { session, targetId, sessionId, framesByFrameId };
|
|
721
759
|
}
|
|
722
760
|
|
|
761
|
+
/**
|
|
762
|
+
* Apply Network.setBlockedURLs for ad/tracker blocking on a session.
|
|
763
|
+
* Default list is on; pass blockAds:false to skip, blockUrls:[] to extend.
|
|
764
|
+
* Silent on failure — older Chrome / unusual modes shouldn't break the page.
|
|
765
|
+
*/
|
|
766
|
+
async function applyBlocklist(session, pageOpts) {
|
|
767
|
+
if (pageOpts.blockAds === false && !pageOpts.blockUrls) return;
|
|
768
|
+
const patterns = pageOpts.blockAds === false
|
|
769
|
+
? (pageOpts.blockUrls || [])
|
|
770
|
+
: [...DEFAULT_BLOCKLIST, ...(pageOpts.blockUrls || [])];
|
|
771
|
+
if (!patterns.length) return;
|
|
772
|
+
try {
|
|
773
|
+
await session.send('Network.setBlockedURLs', { urls: patterns });
|
|
774
|
+
} catch {
|
|
775
|
+
// Network.setBlockedURLs unsupported on this Chrome — skip silently.
|
|
776
|
+
}
|
|
777
|
+
}
|
|
778
|
+
|
|
723
779
|
/**
|
|
724
780
|
* Navigate to a URL and wait for the page to load.
|
|
725
781
|
*/
|
package/src/prune.js
CHANGED
|
@@ -65,7 +65,8 @@ const SKIP_ROLES = new Set([
|
|
|
65
65
|
* @returns {object|null} Pruned tree
|
|
66
66
|
*/
|
|
67
67
|
export function prune(tree, options = {}) {
|
|
68
|
-
|
|
68
|
+
let { mode = 'act', context = '' } = options;
|
|
69
|
+
if (mode === 'read') mode = 'browse';
|
|
69
70
|
const allowedRegions = MODE_REGIONS[mode] || MODE_REGIONS.act;
|
|
70
71
|
const isBrowse = mode === 'browse';
|
|
71
72
|
const keywords = context
|
package/src/stealth.js
CHANGED
|
@@ -92,6 +92,75 @@ const STEALTH_SCRIPT = `
|
|
|
92
92
|
return origGetParam2.apply(this, arguments);
|
|
93
93
|
};
|
|
94
94
|
}
|
|
95
|
+
|
|
96
|
+
// Canvas fingerprinting — sites render standard text/shapes, then read
|
|
97
|
+
// pixels via toDataURL or getImageData. The output is stable per machine
|
|
98
|
+
// (GPU, font rasterizer, anti-aliasing) but unique across machines, which
|
|
99
|
+
// makes it the second-most-common fingerprint after WebGL. Defense: nudge
|
|
100
|
+
// a few RGB channels by ±1 per session so the hash changes each visit
|
|
101
|
+
// while the canvas still looks identical to the human eye. The per-tab
|
|
102
|
+
// seed keeps reads stable within a session so legitimate canvas use
|
|
103
|
+
// (image processing, games) doesn't flicker.
|
|
104
|
+
// crypto.getRandomValues is guaranteed unique per browsing context; using
|
|
105
|
+
// Math.random alone can collide when two fresh V8 contexts start within
|
|
106
|
+
// microseconds of each other (real-world: parallel tests, real-world hit:
|
|
107
|
+
// we observed it). performance.now adds a wall-clock anchor as a belt-and-
|
|
108
|
+
// braces guard against contexts where crypto is somehow stubbed.
|
|
109
|
+
const _seedBuf = new Uint32Array(1);
|
|
110
|
+
crypto.getRandomValues(_seedBuf);
|
|
111
|
+
const CANVAS_SEED = (_seedBuf[0] ^ ((performance.now() * 1e6) | 0)) >>> 0;
|
|
112
|
+
function shiftPixels(data) {
|
|
113
|
+
// Touch ~1 byte per 64-byte stride. The bit we XOR with is taken from a
|
|
114
|
+
// position-dependent SLICE of a seed-mixed hash, not its low bit — a
|
|
115
|
+
// naive 'mix & 1' collapses to only two possible outputs per seed
|
|
116
|
+
// parity because every stride index is even (the position multiplier
|
|
117
|
+
// is odd, so the low bit only depends on seed parity). Indexing the
|
|
118
|
+
// hash by (i/64) mod 32 makes every stride position sample a different
|
|
119
|
+
// bit, so two distinct seeds produce different mask patterns.
|
|
120
|
+
for (let i = 0; i < data.length; i += 64) {
|
|
121
|
+
const h = ((CANVAS_SEED * 2654435761) ^ (i * 1597334677)) >>> 0;
|
|
122
|
+
const bit = (h >>> ((i >>> 6) & 31)) & 1;
|
|
123
|
+
data[i] = (data[i] ^ bit) & 0xff;
|
|
124
|
+
}
|
|
125
|
+
return data;
|
|
126
|
+
}
|
|
127
|
+
// Capture originals BEFORE replacing — toDataURL must read pixels via the
|
|
128
|
+
// native getImageData (not the patched one), otherwise the patch double-
|
|
129
|
+
// applies and the second XOR cancels the first, leaving output unchanged.
|
|
130
|
+
const origGetImageData = CanvasRenderingContext2D.prototype.getImageData;
|
|
131
|
+
const origToDataURL = HTMLCanvasElement.prototype.toDataURL;
|
|
132
|
+
HTMLCanvasElement.prototype.toDataURL = function() {
|
|
133
|
+
const ctx = this.getContext('2d');
|
|
134
|
+
if (ctx && this.width > 0 && this.height > 0) {
|
|
135
|
+
try {
|
|
136
|
+
const img = origGetImageData.call(ctx, 0, 0, this.width, this.height);
|
|
137
|
+
// Snapshot the original bytes so we can restore them after encoding.
|
|
138
|
+
// Without this, repeated toDataURL() alternates noisy/clean: call 1
|
|
139
|
+
// XORs the canvas in place, call 2 reads the noisy canvas and XORs
|
|
140
|
+
// again (self-inverse), call 3 again, etc. Same XOR-cancellation
|
|
141
|
+
// class as the earlier double-apply bug, just through canvas state
|
|
142
|
+
// rather than method composition. The restore also keeps the
|
|
143
|
+
// bitmap idempotent for any downstream legitimate canvas reads.
|
|
144
|
+
const original = new Uint8ClampedArray(img.data);
|
|
145
|
+
shiftPixels(img.data);
|
|
146
|
+
ctx.putImageData(img, 0, 0);
|
|
147
|
+
const result = origToDataURL.apply(this, arguments);
|
|
148
|
+
img.data.set(original);
|
|
149
|
+
ctx.putImageData(img, 0, 0);
|
|
150
|
+
return result;
|
|
151
|
+
} catch {
|
|
152
|
+
// Tainted canvas (cross-origin image) — can't read; skip the nudge
|
|
153
|
+
// and fall through to the original call so the page sees the
|
|
154
|
+
// expected SecurityError instead of silent corruption.
|
|
155
|
+
}
|
|
156
|
+
}
|
|
157
|
+
return origToDataURL.apply(this, arguments);
|
|
158
|
+
};
|
|
159
|
+
CanvasRenderingContext2D.prototype.getImageData = function() {
|
|
160
|
+
const img = origGetImageData.apply(this, arguments);
|
|
161
|
+
shiftPixels(img.data);
|
|
162
|
+
return img;
|
|
163
|
+
};
|
|
95
164
|
`;
|
|
96
165
|
|
|
97
166
|
/**
|