@kritchoff/agent-browser 0.9.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (88) hide show
  1. package/LICENSE +201 -0
  2. package/README.md +903 -0
  3. package/README.sdk.md +77 -0
  4. package/bin/agent-browser-linux-x64 +0 -0
  5. package/bin/agent-browser.js +109 -0
  6. package/dist/actions.d.ts +17 -0
  7. package/dist/actions.d.ts.map +1 -0
  8. package/dist/actions.js +1427 -0
  9. package/dist/actions.js.map +1 -0
  10. package/dist/browser.d.ts +474 -0
  11. package/dist/browser.d.ts.map +1 -0
  12. package/dist/browser.js +1566 -0
  13. package/dist/browser.js.map +1 -0
  14. package/dist/cdp-client.d.ts +103 -0
  15. package/dist/cdp-client.d.ts.map +1 -0
  16. package/dist/cdp-client.js +223 -0
  17. package/dist/cdp-client.js.map +1 -0
  18. package/dist/daemon.d.ts +60 -0
  19. package/dist/daemon.d.ts.map +1 -0
  20. package/dist/daemon.js +401 -0
  21. package/dist/daemon.js.map +1 -0
  22. package/dist/dualmode-config.d.ts +37 -0
  23. package/dist/dualmode-config.d.ts.map +1 -0
  24. package/dist/dualmode-config.js +44 -0
  25. package/dist/dualmode-config.js.map +1 -0
  26. package/dist/dualmode-fetcher.d.ts +60 -0
  27. package/dist/dualmode-fetcher.d.ts.map +1 -0
  28. package/dist/dualmode-fetcher.js +449 -0
  29. package/dist/dualmode-fetcher.js.map +1 -0
  30. package/dist/dualmode-types.d.ts +183 -0
  31. package/dist/dualmode-types.d.ts.map +1 -0
  32. package/dist/dualmode-types.js +8 -0
  33. package/dist/dualmode-types.js.map +1 -0
  34. package/dist/ios-actions.d.ts +11 -0
  35. package/dist/ios-actions.d.ts.map +1 -0
  36. package/dist/ios-actions.js +228 -0
  37. package/dist/ios-actions.js.map +1 -0
  38. package/dist/ios-manager.d.ts +266 -0
  39. package/dist/ios-manager.d.ts.map +1 -0
  40. package/dist/ios-manager.js +1073 -0
  41. package/dist/ios-manager.js.map +1 -0
  42. package/dist/protocol.d.ts +26 -0
  43. package/dist/protocol.d.ts.map +1 -0
  44. package/dist/protocol.js +832 -0
  45. package/dist/protocol.js.map +1 -0
  46. package/dist/snapshot.d.ts +83 -0
  47. package/dist/snapshot.d.ts.map +1 -0
  48. package/dist/snapshot.js +653 -0
  49. package/dist/snapshot.js.map +1 -0
  50. package/dist/stream-server.d.ts +117 -0
  51. package/dist/stream-server.d.ts.map +1 -0
  52. package/dist/stream-server.js +305 -0
  53. package/dist/stream-server.js.map +1 -0
  54. package/dist/types.d.ts +742 -0
  55. package/dist/types.d.ts.map +1 -0
  56. package/dist/types.js +2 -0
  57. package/dist/types.js.map +1 -0
  58. package/docker-compose.sdk.yml +45 -0
  59. package/package.json +85 -0
  60. package/scripts/benchmark.sh +80 -0
  61. package/scripts/build-all-platforms.sh +68 -0
  62. package/scripts/check-version-sync.js +39 -0
  63. package/scripts/copy-native.js +36 -0
  64. package/scripts/fast_reset.sh +108 -0
  65. package/scripts/postinstall.js +235 -0
  66. package/scripts/publish_images.sh +55 -0
  67. package/scripts/snapshot_manager.sh +293 -0
  68. package/scripts/start-android-agent.sh +49 -0
  69. package/scripts/sync-version.js +69 -0
  70. package/scripts/vaccine-run +26 -0
  71. package/sdk.sh +153 -0
  72. package/skills/agent-browser/SKILL.md +217 -0
  73. package/skills/agent-browser/references/authentication.md +202 -0
  74. package/skills/agent-browser/references/commands.md +259 -0
  75. package/skills/agent-browser/references/proxy-support.md +188 -0
  76. package/skills/agent-browser/references/session-management.md +193 -0
  77. package/skills/agent-browser/references/snapshot-refs.md +194 -0
  78. package/skills/agent-browser/references/video-recording.md +173 -0
  79. package/skills/agent-browser/templates/authenticated-session.sh +97 -0
  80. package/skills/agent-browser/templates/capture-workflow.sh +69 -0
  81. package/skills/agent-browser/templates/form-automation.sh +62 -0
  82. package/skills/skill-creator/LICENSE.txt +202 -0
  83. package/skills/skill-creator/SKILL.md +356 -0
  84. package/skills/skill-creator/references/output-patterns.md +82 -0
  85. package/skills/skill-creator/references/workflows.md +28 -0
  86. package/skills/skill-creator/scripts/init_skill.py +303 -0
  87. package/skills/skill-creator/scripts/package_skill.py +113 -0
  88. package/skills/skill-creator/scripts/quick_validate.py +95 -0
package/README.md ADDED
@@ -0,0 +1,903 @@
1
+ # agent-browser
2
+
3
+ Headless browser automation CLI for AI agents. Fast Rust CLI with Node.js fallback.
4
+
5
+ ## Installation
6
+
7
+ ### npm (recommended)
8
+
9
+ ```bash
10
+ npm install -g agent-browser
11
+ agent-browser install # Download Chromium
12
+ ```
13
+
14
+ ### Homebrew (macOS)
15
+
16
+ ```bash
17
+ brew install agent-browser
18
+ agent-browser install # Download Chromium
19
+ ```
20
+
21
+ ### From Source
22
+
23
+ ```bash
24
+ git clone https://github.com/vercel-labs/agent-browser
25
+ cd agent-browser
26
+ pnpm install
27
+ pnpm build
28
+ pnpm build:native # Requires Rust (https://rustup.rs)
29
+ pnpm link --global # Makes agent-browser available globally
30
+ agent-browser install
31
+ ```
32
+
33
+ ### Linux Dependencies
34
+
35
+ On Linux, install system dependencies:
36
+
37
+ ```bash
38
+ agent-browser install --with-deps
39
+ # or manually: npx playwright install-deps chromium
40
+ ```
41
+
42
+ ## Quick Start
43
+
44
+ ```bash
45
+ agent-browser open example.com
46
+ agent-browser snapshot # Get accessibility tree with refs
47
+ agent-browser click @e2 # Click by ref from snapshot
48
+ agent-browser fill @e3 "test@example.com" # Fill by ref
49
+ agent-browser get text @e1 # Get text by ref
50
+ agent-browser screenshot page.png
51
+ agent-browser close
52
+ ```
53
+
54
+ ### Traditional Selectors (also supported)
55
+
56
+ ```bash
57
+ agent-browser click "#submit"
58
+ agent-browser fill "#email" "test@example.com"
59
+ agent-browser find role button click --name "Submit"
60
+ ```
61
+
62
+ ## Commands
63
+
64
+ ### Core Commands
65
+
66
+ ```bash
67
+ agent-browser open <url> # Navigate to URL (aliases: goto, navigate)
68
+ agent-browser click <sel> # Click element
69
+ agent-browser dblclick <sel> # Double-click element
70
+ agent-browser focus <sel> # Focus element
71
+ agent-browser type <sel> <text> # Type into element
72
+ agent-browser fill <sel> <text> # Clear and fill
73
+ agent-browser press <key> # Press key (Enter, Tab, Control+a) (alias: key)
74
+ agent-browser keydown <key> # Hold key down
75
+ agent-browser keyup <key> # Release key
76
+ agent-browser hover <sel> # Hover element
77
+ agent-browser select <sel> <val> # Select dropdown option
78
+ agent-browser check <sel> # Check checkbox
79
+ agent-browser uncheck <sel> # Uncheck checkbox
80
+ agent-browser scroll <dir> [px] # Scroll (up/down/left/right)
81
+ agent-browser scrollintoview <sel> # Scroll element into view (alias: scrollinto)
82
+ agent-browser drag <src> <tgt> # Drag and drop
83
+ agent-browser upload <sel> <files> # Upload files
84
+ agent-browser screenshot [path] # Take screenshot (--full for full page, saves to a temporary directory if no path)
85
+ agent-browser pdf <path> # Save as PDF
86
+ agent-browser snapshot # Accessibility tree with refs (best for AI)
87
+ agent-browser eval <js> # Run JavaScript (-b for base64, --stdin for piped input)
88
+ agent-browser connect <port> # Connect to browser via CDP
89
+ agent-browser close # Close browser (aliases: quit, exit)
90
+ ```
91
+
92
+ ### Get Info
93
+
94
+ ```bash
95
+ agent-browser get text <sel> # Get text content
96
+ agent-browser get html <sel> # Get innerHTML
97
+ agent-browser get value <sel> # Get input value
98
+ agent-browser get attr <sel> <attr> # Get attribute
99
+ agent-browser get title # Get page title
100
+ agent-browser get url # Get current URL
101
+ agent-browser get count <sel> # Count matching elements
102
+ agent-browser get box <sel> # Get bounding box
103
+ ```
104
+
105
+ ### Check State
106
+
107
+ ```bash
108
+ agent-browser is visible <sel> # Check if visible
109
+ agent-browser is enabled <sel> # Check if enabled
110
+ agent-browser is checked <sel> # Check if checked
111
+ ```
112
+
113
+ ### Find Elements (Semantic Locators)
114
+
115
+ ```bash
116
+ agent-browser find role <role> <action> [value] # By ARIA role
117
+ agent-browser find text <text> <action> # By text content
118
+ agent-browser find label <label> <action> [value] # By label
119
+ agent-browser find placeholder <ph> <action> [value] # By placeholder
120
+ agent-browser find alt <text> <action> # By alt text
121
+ agent-browser find title <text> <action> # By title attr
122
+ agent-browser find testid <id> <action> [value] # By data-testid
123
+ agent-browser find first <sel> <action> [value] # First match
124
+ agent-browser find last <sel> <action> [value] # Last match
125
+ agent-browser find nth <n> <sel> <action> [value] # Nth match
126
+ ```
127
+
128
+ **Actions:** `click`, `fill`, `check`, `hover`, `text`
129
+
130
+ **Examples:**
131
+ ```bash
132
+ agent-browser find role button click --name "Submit"
133
+ agent-browser find text "Sign In" click
134
+ agent-browser find label "Email" fill "test@test.com"
135
+ agent-browser find first ".item" click
136
+ agent-browser find nth 2 "a" text
137
+ ```
138
+
139
+ ### Wait
140
+
141
+ ```bash
142
+ agent-browser wait <selector> # Wait for element to be visible
143
+ agent-browser wait <ms> # Wait for time (milliseconds)
144
+ agent-browser wait --text "Welcome" # Wait for text to appear
145
+ agent-browser wait --url "**/dash" # Wait for URL pattern
146
+ agent-browser wait --load networkidle # Wait for load state
147
+ agent-browser wait --fn "window.ready === true" # Wait for JS condition
148
+ ```
149
+
150
+ **Load states:** `load`, `domcontentloaded`, `networkidle`
151
+
152
+ ### Mouse Control
153
+
154
+ ```bash
155
+ agent-browser mouse move <x> <y> # Move mouse
156
+ agent-browser mouse down [button] # Press button (left/right/middle)
157
+ agent-browser mouse up [button] # Release button
158
+ agent-browser mouse wheel <dy> [dx] # Scroll wheel
159
+ ```
160
+
161
+ ### Browser Settings
162
+
163
+ ```bash
164
+ agent-browser set viewport <w> <h> # Set viewport size
165
+ agent-browser set device <name> # Emulate device ("iPhone 14")
166
+ agent-browser set geo <lat> <lng> # Set geolocation
167
+ agent-browser set offline [on|off] # Toggle offline mode
168
+ agent-browser set headers <json> # Extra HTTP headers
169
+ agent-browser set credentials <u> <p> # HTTP basic auth
170
+ agent-browser set media [dark|light] # Emulate color scheme
171
+ ```
172
+
173
+ ### Cookies & Storage
174
+
175
+ ```bash
176
+ agent-browser cookies # Get all cookies
177
+ agent-browser cookies set <name> <val> # Set cookie
178
+ agent-browser cookies clear # Clear cookies
179
+
180
+ agent-browser storage local # Get all localStorage
181
+ agent-browser storage local <key> # Get specific key
182
+ agent-browser storage local set <k> <v> # Set value
183
+ agent-browser storage local clear # Clear all
184
+
185
+ agent-browser storage session # Same for sessionStorage
186
+ ```
187
+
188
+ ### Network
189
+
190
+ ```bash
191
+ agent-browser network route <url> # Intercept requests
192
+ agent-browser network route <url> --abort # Block requests
193
+ agent-browser network route <url> --body <json> # Mock response
194
+ agent-browser network unroute [url] # Remove routes
195
+ agent-browser network requests # View tracked requests
196
+ agent-browser network requests --filter api # Filter requests
197
+ ```
198
+
199
+ ### Tabs & Windows
200
+
201
+ ```bash
202
+ agent-browser tab # List tabs
203
+ agent-browser tab new [url] # New tab (optionally with URL)
204
+ agent-browser tab <n> # Switch to tab n
205
+ agent-browser tab close [n] # Close tab
206
+ agent-browser window new # New window
207
+ ```
208
+
209
+ ### Frames
210
+
211
+ ```bash
212
+ agent-browser frame <sel> # Switch to iframe
213
+ agent-browser frame main # Back to main frame
214
+ ```
215
+
216
+ ### Dialogs
217
+
218
+ ```bash
219
+ agent-browser dialog accept [text] # Accept (with optional prompt text)
220
+ agent-browser dialog dismiss # Dismiss
221
+ ```
222
+
223
+ ### Debug
224
+
225
+ ```bash
226
+ agent-browser trace start [path] # Start recording trace
227
+ agent-browser trace stop [path] # Stop and save trace
228
+ agent-browser console # View console messages (log, error, warn, info)
229
+ agent-browser console --clear # Clear console
230
+ agent-browser errors # View page errors (uncaught JavaScript exceptions)
231
+ agent-browser errors --clear # Clear errors
232
+ agent-browser highlight <sel> # Highlight element
233
+ agent-browser state save <path> # Save auth state
234
+ agent-browser state load <path> # Load auth state
235
+ ```
236
+
237
+ ### Navigation
238
+
239
+ ```bash
240
+ agent-browser back # Go back
241
+ agent-browser forward # Go forward
242
+ agent-browser reload # Reload page
243
+ ```
244
+
245
+ ### Setup
246
+
247
+ ```bash
248
+ agent-browser install # Download Chromium browser
249
+ agent-browser install --with-deps # Also install system deps (Linux)
250
+ ```
251
+
252
+ ## Sessions
253
+
254
+ Run multiple isolated browser instances:
255
+
256
+ ```bash
257
+ # Different sessions
258
+ agent-browser --session agent1 open site-a.com
259
+ agent-browser --session agent2 open site-b.com
260
+
261
+ # Or via environment variable
262
+ AGENT_BROWSER_SESSION=agent1 agent-browser click "#btn"
263
+
264
+ # List active sessions
265
+ agent-browser session list
266
+ # Output:
267
+ # Active sessions:
268
+ # -> default
269
+ # agent1
270
+
271
+ # Show current session
272
+ agent-browser session
273
+ ```
274
+
275
+ Each session has its own:
276
+ - Browser instance
277
+ - Cookies and storage
278
+ - Navigation history
279
+ - Authentication state
280
+
281
+ ## Persistent Profiles
282
+
283
+ By default, browser state (cookies, localStorage, login sessions) is ephemeral and lost when the browser closes. Use `--profile` to persist state across browser restarts:
284
+
285
+ ```bash
286
+ # Use a persistent profile directory
287
+ agent-browser --profile ~/.myapp-profile open myapp.com
288
+
289
+ # Login once, then reuse the authenticated session
290
+ agent-browser --profile ~/.myapp-profile open myapp.com/dashboard
291
+
292
+ # Or via environment variable
293
+ AGENT_BROWSER_PROFILE=~/.myapp-profile agent-browser open myapp.com
294
+ ```
295
+
296
+ The profile directory stores:
297
+ - Cookies and localStorage
298
+ - IndexedDB data
299
+ - Service workers
300
+ - Browser cache
301
+ - Login sessions
302
+
303
+ **Tip**: Use different profile paths for different projects to keep their browser state isolated.
304
+
305
+ ## Snapshot Options
306
+
307
+ The `snapshot` command supports filtering to reduce output size:
308
+
309
+ ```bash
310
+ agent-browser snapshot # Full accessibility tree
311
+ agent-browser snapshot -i # Interactive elements only (buttons, inputs, links)
312
+ agent-browser snapshot -i -C # Include cursor-interactive elements (divs with onclick, etc.)
313
+ agent-browser snapshot -c # Compact (remove empty structural elements)
314
+ agent-browser snapshot -d 3 # Limit depth to 3 levels
315
+ agent-browser snapshot -s "#main" # Scope to CSS selector
316
+ agent-browser snapshot -i -c -d 5 # Combine options
317
+ ```
318
+
319
+ | Option | Description |
320
+ |--------|-------------|
321
+ | `-i, --interactive` | Only show interactive elements (buttons, links, inputs) |
322
+ | `-C, --cursor` | Include cursor-interactive elements (cursor:pointer, onclick, tabindex) |
323
+ | `-c, --compact` | Remove empty structural elements |
324
+ | `-d, --depth <n>` | Limit tree depth |
325
+ | `-s, --selector <sel>` | Scope to CSS selector |
326
+
327
+ The `-C` flag is useful for modern web apps that use custom clickable elements (divs, spans) instead of standard buttons/links.
328
+
329
+ ## Options
330
+
331
+ | Option | Description |
332
+ |--------|-------------|
333
+ | `--session <name>` | Use isolated session (or `AGENT_BROWSER_SESSION` env) |
334
+ | `--profile <path>` | Persistent browser profile directory (or `AGENT_BROWSER_PROFILE` env) |
335
+ | `--headers <json>` | Set HTTP headers scoped to the URL's origin |
336
+ | `--executable-path <path>` | Custom browser executable (or `AGENT_BROWSER_EXECUTABLE_PATH` env) |
337
+ | `--args <args>` | Browser launch args, comma or newline separated (or `AGENT_BROWSER_ARGS` env) |
338
+ | `--user-agent <ua>` | Custom User-Agent string (or `AGENT_BROWSER_USER_AGENT` env) |
339
+ | `--proxy <url>` | Proxy server URL with optional auth (or `AGENT_BROWSER_PROXY` env) |
340
+ | `--proxy-bypass <hosts>` | Hosts to bypass proxy (or `AGENT_BROWSER_PROXY_BYPASS` env) |
341
+ | `-p, --provider <name>` | Cloud browser provider (or `AGENT_BROWSER_PROVIDER` env) |
342
+ | `--json` | JSON output (for agents) |
343
+ | `--full, -f` | Full page screenshot |
344
+ | `--name, -n` | Locator name filter |
345
+ | `--exact` | Exact text match |
346
+ | `--headed` | Show browser window (not headless) |
347
+ | `--cdp <port>` | Connect via Chrome DevTools Protocol |
348
+ | `--ignore-https-errors` | Ignore HTTPS certificate errors (useful for self-signed certs) |
349
+ | `--allow-file-access` | Allow file:// URLs to access local files (Chromium only) |
350
+ | `--debug` | Debug output |
351
+
352
+ ## Selectors
353
+
354
+ ### Refs (Recommended for AI)
355
+
356
+ Refs provide deterministic element selection from snapshots:
357
+
358
+ ```bash
359
+ # 1. Get snapshot with refs
360
+ agent-browser snapshot
361
+ # Output:
362
+ # - heading "Example Domain" [ref=e1] [level=1]
363
+ # - button "Submit" [ref=e2]
364
+ # - textbox "Email" [ref=e3]
365
+ # - link "Learn more" [ref=e4]
366
+
367
+ # 2. Use refs to interact
368
+ agent-browser click @e2 # Click the button
369
+ agent-browser fill @e3 "test@example.com" # Fill the textbox
370
+ agent-browser get text @e1 # Get heading text
371
+ agent-browser hover @e4 # Hover the link
372
+ ```
373
+
374
+ **Why use refs?**
375
+ - **Deterministic**: Ref points to exact element from snapshot
376
+ - **Fast**: No DOM re-query needed
377
+ - **AI-friendly**: Snapshot + ref workflow is optimal for LLMs
378
+
379
+ ### CSS Selectors
380
+
381
+ ```bash
382
+ agent-browser click "#id"
383
+ agent-browser click ".class"
384
+ agent-browser click "div > button"
385
+ ```
386
+
387
+ ### Text & XPath
388
+
389
+ ```bash
390
+ agent-browser click "text=Submit"
391
+ agent-browser click "xpath=//button"
392
+ ```
393
+
394
+ ### Semantic Locators
395
+
396
+ ```bash
397
+ agent-browser find role button click --name "Submit"
398
+ agent-browser find label "Email" fill "test@test.com"
399
+ ```
400
+
401
+ ## Agent Mode
402
+
403
+ Use `--json` for machine-readable output:
404
+
405
+ ```bash
406
+ agent-browser snapshot --json
407
+ # Returns: {"success":true,"data":{"snapshot":"...","refs":{"e1":{"role":"heading","name":"Title"},...}}}
408
+
409
+ agent-browser get text @e1 --json
410
+ agent-browser is visible @e2 --json
411
+ ```
412
+
413
+ ### Optimal AI Workflow
414
+
415
+ ```bash
416
+ # 1. Navigate and get snapshot
417
+ agent-browser open example.com
418
+ agent-browser snapshot -i --json # AI parses tree and refs
419
+
420
+ # 2. AI identifies target refs from snapshot
421
+ # 3. Execute actions using refs
422
+ agent-browser click @e2
423
+ agent-browser fill @e3 "input text"
424
+
425
+ # 4. Get new snapshot if page changed
426
+ agent-browser snapshot -i --json
427
+ ```
428
+
429
+ ## Headed Mode
430
+
431
+ Show the browser window for debugging:
432
+
433
+ ```bash
434
+ agent-browser open example.com --headed
435
+ ```
436
+
437
+ This opens a visible browser window instead of running headless.
438
+
439
+ ## Authenticated Sessions
440
+
441
+ Use `--headers` to set HTTP headers for a specific origin, enabling authentication without login flows:
442
+
443
+ ```bash
444
+ # Headers are scoped to api.example.com only
445
+ agent-browser open api.example.com --headers '{"Authorization": "Bearer <token>"}'
446
+
447
+ # Requests to api.example.com include the auth header
448
+ agent-browser snapshot -i --json
449
+ agent-browser click @e2
450
+
451
+ # Navigate to another domain - headers are NOT sent (safe!)
452
+ agent-browser open other-site.com
453
+ ```
454
+
455
+ This is useful for:
456
+ - **Skipping login flows** - Authenticate via headers instead of UI
457
+ - **Switching users** - Start new sessions with different auth tokens
458
+ - **API testing** - Access protected endpoints directly
459
+ - **Security** - Headers are scoped to the origin, not leaked to other domains
460
+
461
+ To set headers for multiple origins, use `--headers` with each `open` command:
462
+
463
+ ```bash
464
+ agent-browser open api.example.com --headers '{"Authorization": "Bearer token1"}'
465
+ agent-browser open api.acme.com --headers '{"Authorization": "Bearer token2"}'
466
+ ```
467
+
468
+ For global headers (all domains), use `set headers`:
469
+
470
+ ```bash
471
+ agent-browser set headers '{"X-Custom-Header": "value"}'
472
+ ```
473
+
474
+ ## Custom Browser Executable
475
+
476
+ Use a custom browser executable instead of the bundled Chromium. This is useful for:
477
+ - **Serverless deployment**: Use lightweight Chromium builds like `@sparticuz/chromium` (~50MB vs ~684MB)
478
+ - **System browsers**: Use an existing Chrome/Chromium installation
479
+ - **Custom builds**: Use modified browser builds
480
+
481
+ ### CLI Usage
482
+
483
+ ```bash
484
+ # Via flag
485
+ agent-browser --executable-path /path/to/chromium open example.com
486
+
487
+ # Via environment variable
488
+ AGENT_BROWSER_EXECUTABLE_PATH=/path/to/chromium agent-browser open example.com
489
+ ```
490
+
491
+ ### Serverless Example (Vercel/AWS Lambda)
492
+
493
+ ```typescript
494
+ import chromium from '@sparticuz/chromium';
495
+ import { BrowserManager } from 'agent-browser';
496
+
497
+ export async function handler() {
498
+ const browser = new BrowserManager();
499
+ await browser.launch({
500
+ executablePath: await chromium.executablePath(),
501
+ headless: true,
502
+ });
503
+ // ... use browser
504
+ }
505
+ ```
506
+
507
+ ## Local Files
508
+
509
+ Open and interact with local files (PDFs, HTML, etc.) using `file://` URLs:
510
+
511
+ ```bash
512
+ # Enable file access (required for JavaScript to access local files)
513
+ agent-browser --allow-file-access open file:///path/to/document.pdf
514
+ agent-browser --allow-file-access open file:///path/to/page.html
515
+
516
+ # Take screenshot of a local PDF
517
+ agent-browser --allow-file-access open file:///Users/me/report.pdf
518
+ agent-browser screenshot report.png
519
+ ```
520
+
521
+ The `--allow-file-access` flag adds Chromium flags (`--allow-file-access-from-files`, `--allow-file-access`) that allow `file://` URLs to:
522
+ - Load and render local files
523
+ - Access other local files via JavaScript (XHR, fetch)
524
+ - Load local resources (images, scripts, stylesheets)
525
+
526
+ **Note:** This flag only works with Chromium. For security, it's disabled by default.
527
+
528
+ ## CDP Mode
529
+
530
+ Connect to an existing browser via Chrome DevTools Protocol:
531
+
532
+ ```bash
533
+ # Start Chrome with: google-chrome --remote-debugging-port=9222
534
+
535
+ # Connect once, then run commands without --cdp
536
+ agent-browser connect 9222
537
+ agent-browser snapshot
538
+ agent-browser tab
539
+ agent-browser close
540
+
541
+ # Or pass --cdp on each command
542
+ agent-browser --cdp 9222 snapshot
543
+
544
+ # Connect to remote browser via WebSocket URL
545
+ agent-browser --cdp "wss://your-browser-service.com/cdp?token=..." snapshot
546
+ ```
547
+
548
+ The `--cdp` flag accepts either:
549
+ - A port number (e.g., `9222`) for local connections via `http://localhost:{port}`
550
+ - A full WebSocket URL (e.g., `wss://...` or `ws://...`) for remote browser services
551
+
552
+ This enables control of:
553
+ - Electron apps
554
+ - Chrome/Chromium instances with remote debugging
555
+ - WebView2 applications
556
+ - Any browser exposing a CDP endpoint
557
+
558
+ ## Streaming (Browser Preview)
559
+
560
+ Stream the browser viewport via WebSocket for live preview or "pair browsing" where a human can watch and interact alongside an AI agent.
561
+
562
+ ### Enable Streaming
563
+
564
+ Set the `AGENT_BROWSER_STREAM_PORT` environment variable:
565
+
566
+ ```bash
567
+ AGENT_BROWSER_STREAM_PORT=9223 agent-browser open example.com
568
+ ```
569
+
570
+ This starts a WebSocket server on the specified port that streams the browser viewport and accepts input events.
571
+
572
+ ### WebSocket Protocol
573
+
574
+ Connect to `ws://localhost:9223` to receive frames and send input:
575
+
576
+ **Receive frames:**
577
+ ```json
578
+ {
579
+ "type": "frame",
580
+ "data": "<base64-encoded-jpeg>",
581
+ "metadata": {
582
+ "deviceWidth": 1280,
583
+ "deviceHeight": 720,
584
+ "pageScaleFactor": 1,
585
+ "offsetTop": 0,
586
+ "scrollOffsetX": 0,
587
+ "scrollOffsetY": 0
588
+ }
589
+ }
590
+ ```
591
+
592
+ **Send mouse events:**
593
+ ```json
594
+ {
595
+ "type": "input_mouse",
596
+ "eventType": "mousePressed",
597
+ "x": 100,
598
+ "y": 200,
599
+ "button": "left",
600
+ "clickCount": 1
601
+ }
602
+ ```
603
+
604
+ **Send keyboard events:**
605
+ ```json
606
+ {
607
+ "type": "input_keyboard",
608
+ "eventType": "keyDown",
609
+ "key": "Enter",
610
+ "code": "Enter"
611
+ }
612
+ ```
613
+
614
+ **Send touch events:**
615
+ ```json
616
+ {
617
+ "type": "input_touch",
618
+ "eventType": "touchStart",
619
+ "touchPoints": [{ "x": 100, "y": 200 }]
620
+ }
621
+ ```
622
+
623
+ ### Programmatic API
624
+
625
+ For advanced use, control streaming directly via the protocol:
626
+
627
+ ```typescript
628
+ import { BrowserManager } from 'agent-browser';
629
+
630
+ const browser = new BrowserManager();
631
+ await browser.launch({ headless: true });
632
+ await browser.navigate('https://example.com');
633
+
634
+ // Start screencast
635
+ await browser.startScreencast((frame) => {
636
+ // frame.data is base64-encoded image
637
+ // frame.metadata contains viewport info
638
+ console.log('Frame received:', frame.metadata.deviceWidth, 'x', frame.metadata.deviceHeight);
639
+ }, {
640
+ format: 'jpeg',
641
+ quality: 80,
642
+ maxWidth: 1280,
643
+ maxHeight: 720,
644
+ });
645
+
646
+ // Inject mouse events
647
+ await browser.injectMouseEvent({
648
+ type: 'mousePressed',
649
+ x: 100,
650
+ y: 200,
651
+ button: 'left',
652
+ });
653
+
654
+ // Inject keyboard events
655
+ await browser.injectKeyboardEvent({
656
+ type: 'keyDown',
657
+ key: 'Enter',
658
+ code: 'Enter',
659
+ });
660
+
661
+ // Stop when done
662
+ await browser.stopScreencast();
663
+ ```
664
+
665
+ ## Architecture
666
+
667
+ agent-browser uses a client-daemon architecture:
668
+
669
+ 1. **Rust CLI** (fast native binary) - Parses commands, communicates with daemon
670
+ 2. **Node.js Daemon** - Manages Playwright browser instance
671
+ 3. **Fallback** - If native binary unavailable, uses Node.js directly
672
+
673
+ The daemon starts automatically on first command and persists between commands for fast subsequent operations.
674
+
675
+ **Browser Engine:** Uses Chromium by default. The daemon also supports Firefox and WebKit via the Playwright protocol.
676
+
677
+ ## Platforms
678
+
679
+ | Platform | Binary | Fallback |
680
+ |----------|--------|----------|
681
+ | macOS ARM64 | Native Rust | Node.js |
682
+ | macOS x64 | Native Rust | Node.js |
683
+ | Linux ARM64 | Native Rust | Node.js |
684
+ | Linux x64 | Native Rust | Node.js |
685
+ | Windows x64 | Native Rust | Node.js |
686
+
687
+ ## Usage with AI Agents
688
+
689
+ ### Just ask the agent
690
+
691
+ The simplest approach - just tell your agent to use it:
692
+
693
+ ```
694
+ Use agent-browser to test the login flow. Run agent-browser --help to see available commands.
695
+ ```
696
+
697
+ The `--help` output is comprehensive and most agents can figure it out from there.
698
+
699
+ ### AI Coding Assistants
700
+
701
+ Add the skill to your AI coding assistant for richer context:
702
+
703
+ ```bash
704
+ npx skills add vercel-labs/agent-browser
705
+ ```
706
+
707
+ This works with Claude Code, Codex, Cursor, Gemini CLI, GitHub Copilot, Goose, OpenCode, and Windsurf.
708
+
709
+ ### AGENTS.md / CLAUDE.md
710
+
711
+ For more consistent results, add to your project or global instructions file:
712
+
713
+ ```markdown
714
+ ## Browser Automation
715
+
716
+ Use `agent-browser` for web automation. Run `agent-browser --help` for all commands.
717
+
718
+ Core workflow:
719
+ 1. `agent-browser open <url>` - Navigate to page
720
+ 2. `agent-browser snapshot -i` - Get interactive elements with refs (@e1, @e2)
721
+ 3. `agent-browser click @e1` / `fill @e2 "text"` - Interact using refs
722
+ 4. Re-snapshot after page changes
723
+ ```
724
+
725
+ ## Integrations
726
+
727
+ ### iOS Simulator
728
+
729
+ Control real Mobile Safari in the iOS Simulator for authentic mobile web testing. Requires macOS with Xcode.
730
+
731
+ **Setup:**
732
+
733
+ ```bash
734
+ # Install Appium and XCUITest driver
735
+ npm install -g appium
736
+ appium driver install xcuitest
737
+ ```
738
+
739
+ **Usage:**
740
+
741
+ ```bash
742
+ # List available iOS simulators
743
+ agent-browser device list
744
+
745
+ # Launch Safari on a specific device
746
+ agent-browser -p ios --device "iPhone 16 Pro" open https://example.com
747
+
748
+ # Same commands as desktop
749
+ agent-browser -p ios snapshot -i
750
+ agent-browser -p ios tap @e1
751
+ agent-browser -p ios fill @e2 "text"
752
+ agent-browser -p ios screenshot mobile.png
753
+
754
+ # Mobile-specific commands
755
+ agent-browser -p ios swipe up
756
+ agent-browser -p ios swipe down 500
757
+
758
+ # Close session
759
+ agent-browser -p ios close
760
+ ```
761
+
762
+ Or use environment variables:
763
+
764
+ ```bash
765
+ export AGENT_BROWSER_PROVIDER=ios
766
+ export AGENT_BROWSER_IOS_DEVICE="iPhone 16 Pro"
767
+ agent-browser open https://example.com
768
+ ```
769
+
770
+ | Variable | Description |
771
+ |----------|-------------|
772
+ | `AGENT_BROWSER_PROVIDER` | Set to `ios` to enable iOS mode |
773
+ | `AGENT_BROWSER_IOS_DEVICE` | Device name (e.g., "iPhone 16 Pro", "iPad Pro") |
774
+ | `AGENT_BROWSER_IOS_UDID` | Device UDID (alternative to device name) |
775
+
776
+ **Supported devices:** All iOS Simulators available in Xcode (iPhones, iPads), plus real iOS devices.
777
+
778
+ **Note:** The iOS provider boots the simulator, starts Appium, and controls Safari. First launch takes ~30-60 seconds; subsequent commands are fast.
779
+
780
+ #### Real Device Support
781
+
782
+ Appium also supports real iOS devices connected via USB. This requires additional one-time setup:
783
+
784
+ **1. Get your device UDID:**
785
+ ```bash
786
+ xcrun xctrace list devices
787
+ # or
788
+ system_profiler SPUSBDataType | grep -A 5 "iPhone\|iPad"
789
+ ```
790
+
791
+ **2. Sign WebDriverAgent (one-time):**
792
+ ```bash
793
+ # Open the WebDriverAgent Xcode project
794
+ cd ~/.appium/node_modules/appium-xcuitest-driver/node_modules/appium-webdriveragent
795
+ open WebDriverAgent.xcodeproj
796
+ ```
797
+
798
+ In Xcode:
799
+ - Select the `WebDriverAgentRunner` target
800
+ - Go to Signing & Capabilities
801
+ - Select your Team (requires Apple Developer account, free tier works)
802
+ - Let Xcode manage signing automatically
803
+
804
+ **3. Use with agent-browser:**
805
+ ```bash
806
+ # Connect device via USB, then:
807
+ agent-browser -p ios --device "<DEVICE_UDID>" open https://example.com
808
+
809
+ # Or use the device name if unique
810
+ agent-browser -p ios --device "John's iPhone" open https://example.com
811
+ ```
812
+
813
+ **Real device notes:**
814
+ - First run installs WebDriverAgent to the device (may require Trust prompt)
815
+ - Device must be unlocked and connected via USB
816
+ - Slightly slower initial connection than simulator
817
+ - Tests against real Safari performance and behavior
818
+
819
+ ### Browserbase
820
+
821
+ [Browserbase](https://browserbase.com) provides remote browser infrastructure to make deployment of agentic browsing agents easy. Use it when running the agent-browser CLI in an environment where a local browser isn't feasible.
822
+
823
+ To enable Browserbase, use the `-p` flag:
824
+
825
+ ```bash
826
+ export BROWSERBASE_API_KEY="your-api-key"
827
+ export BROWSERBASE_PROJECT_ID="your-project-id"
828
+ agent-browser -p browserbase open https://example.com
829
+ ```
830
+
831
+ Or use environment variables for CI/scripts:
832
+
833
+ ```bash
834
+ export AGENT_BROWSER_PROVIDER=browserbase
835
+ export BROWSERBASE_API_KEY="your-api-key"
836
+ export BROWSERBASE_PROJECT_ID="your-project-id"
837
+ agent-browser open https://example.com
838
+ ```
839
+
840
+ When enabled, agent-browser connects to a Browserbase session instead of launching a local browser. All commands work identically.
841
+
842
+ Get your API key and project ID from the [Browserbase Dashboard](https://browserbase.com/overview).
843
+
844
+ ### Browser Use
845
+
846
+ [Browser Use](https://browser-use.com) provides cloud browser infrastructure for AI agents. Use it when running agent-browser in environments where a local browser isn't available (serverless, CI/CD, etc.).
847
+
848
+ To enable Browser Use, use the `-p` flag:
849
+
850
+ ```bash
851
+ export BROWSER_USE_API_KEY="your-api-key"
852
+ agent-browser -p browseruse open https://example.com
853
+ ```
854
+
855
+ Or use environment variables for CI/scripts:
856
+
857
+ ```bash
858
+ export AGENT_BROWSER_PROVIDER=browseruse
859
+ export BROWSER_USE_API_KEY="your-api-key"
860
+ agent-browser open https://example.com
861
+ ```
862
+
863
+ When enabled, agent-browser connects to a Browser Use cloud session instead of launching a local browser. All commands work identically.
864
+
865
+ Get your API key from the [Browser Use Cloud Dashboard](https://cloud.browser-use.com/settings?tab=api-keys). Free credits are available to get started, with pay-as-you-go pricing after.
866
+
867
+ ### Kernel
868
+
869
+ [Kernel](https://www.kernel.sh) provides cloud browser infrastructure for AI agents with features like stealth mode and persistent profiles.
870
+
871
+ To enable Kernel, use the `-p` flag:
872
+
873
+ ```bash
874
+ export KERNEL_API_KEY="your-api-key"
875
+ agent-browser -p kernel open https://example.com
876
+ ```
877
+
878
+ Or use environment variables for CI/scripts:
879
+
880
+ ```bash
881
+ export AGENT_BROWSER_PROVIDER=kernel
882
+ export KERNEL_API_KEY="your-api-key"
883
+ agent-browser open https://example.com
884
+ ```
885
+
886
+ Optional configuration via environment variables:
887
+
888
+ | Variable | Description | Default |
889
+ |----------|-------------|---------|
890
+ | `KERNEL_HEADLESS` | Run browser in headless mode (`true`/`false`) | `false` |
891
+ | `KERNEL_STEALTH` | Enable stealth mode to avoid bot detection (`true`/`false`) | `true` |
892
+ | `KERNEL_TIMEOUT_SECONDS` | Session timeout in seconds | `300` |
893
+ | `KERNEL_PROFILE_NAME` | Browser profile name for persistent cookies/logins (created if it doesn't exist) | (none) |
894
+
895
+ When enabled, agent-browser connects to a Kernel cloud session instead of launching a local browser. All commands work identically.
896
+
897
+ **Profile Persistence:** When `KERNEL_PROFILE_NAME` is set, the profile will be created if it doesn't already exist. Cookies, logins, and session data are automatically saved back to the profile when the browser session ends, making them available for future sessions.
898
+
899
+ Get your API key from the [Kernel Dashboard](https://dashboard.onkernel.com).
900
+
901
+ ## License
902
+
903
+ Apache-2.0