browser-autopilot 0.4.4 → 0.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +352 -178
- package/dist/agent/loop.d.ts +5 -0
- package/dist/agent/loop.js +13 -3
- package/dist/agent/loop.js.map +1 -1
- package/dist/agent/tools.d.ts +1 -1
- package/dist/orchestrator.d.ts +1 -0
- package/dist/orchestrator.js +96 -20
- package/dist/orchestrator.js.map +1 -1
- package/dist/x11/agent.d.ts +64 -2
- package/dist/x11/agent.js +176 -60
- package/dist/x11/agent.js.map +1 -1
- package/dist/x11/input.d.ts +4 -0
- package/dist/x11/input.js +58 -8
- package/dist/x11/input.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,251 +1,425 @@
|
|
|
1
1
|
# browser-autopilot
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Autonomous browser automation for real Chrome, built for local or self-hosted execution.
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
It uses:
|
|
6
|
+
- raw Chrome DevTools Protocol for fast, structured browser control
|
|
7
|
+
- local X11 mouse and keyboard events when the task needs literal screen-level interaction
|
|
8
|
+
- the Vercel AI SDK for multi-step model-driven reasoning and tool use
|
|
6
9
|
|
|
7
|
-
|
|
10
|
+
The browser always runs on your machine or your infrastructure. The package sends screenshots and state to your model provider, but it does not run the browser on OpenAI's machines.
|
|
8
11
|
|
|
9
|
-
|
|
12
|
+
## What It Is
|
|
10
13
|
|
|
11
|
-
|
|
14
|
+
`browser-autopilot` is a TypeScript package for browser agents that need more than a scraper and less than a full remote browser service.
|
|
12
15
|
|
|
13
|
-
|
|
16
|
+
It is designed for:
|
|
17
|
+
- authenticated browser workflows
|
|
18
|
+
- long multi-step tasks
|
|
19
|
+
- sites where DOM-first automation is useful most of the time
|
|
20
|
+
- cases where you still want a local fallback for literal mouse and keyboard control
|
|
14
21
|
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
22
|
+
It is not built on Playwright or Puppeteer. The package talks to Chrome over raw CDP and can fall back to X11-level input on Linux when needed.
|
|
23
|
+
|
|
24
|
+
## Core Model
|
|
25
|
+
|
|
26
|
+
There are two execution modes:
|
|
27
|
+
|
|
28
|
+
- `CDP mode`
|
|
29
|
+
Fast, structured control through Chrome DevTools Protocol. The model sees a screenshot plus an indexed DOM-like view and can call browser tools like `click`, `input`, `navigate`, `extract`, `evaluate`, `upload_file`, `scroll`, and `done`.
|
|
30
|
+
|
|
31
|
+
- `X11 mode`
|
|
32
|
+
Local OS-level control of the actual browser window through the X Window System. The model works from screenshots and issues actions like `CLICK`, `DOUBLE_CLICK`, `MOVE`, `DRAG`, `SCROLL`, `TYPE`, `KEYPRESS`, `WAIT`, and `DONE`.
|
|
33
|
+
|
|
34
|
+
`orchestrate(...)` combines them:
|
|
35
|
+
|
|
36
|
+
```text
|
|
37
|
+
orchestrate({ credentials, loginUrl, successUrlContains, task })
|
|
27
38
|
│
|
|
28
|
-
|
|
39
|
+
├─ Cached session? ───────────────→ run task in CDP mode
|
|
40
|
+
├─ CDP login works? ──────────────→ run task in CDP mode
|
|
41
|
+
└─ CDP looks blocked or unusable? → relaunch and continue through X11
|
|
29
42
|
```
|
|
30
43
|
|
|
31
|
-
|
|
44
|
+
The CDP agent can also explicitly request a handoff by calling `switch_to_x11`.
|
|
45
|
+
|
|
46
|
+
## What X11 Means
|
|
47
|
+
|
|
48
|
+
In this package, `X11` means local OS-level browser control using the machine's real windowing and input stack, not DOM or CDP APIs.
|
|
32
49
|
|
|
33
|
-
|
|
50
|
+
That is implemented with Linux/X11 utilities such as:
|
|
51
|
+
- `xdotool` for mouse and keyboard events
|
|
52
|
+
- `xclip` for real clipboard paste
|
|
53
|
+
- ImageMagick `import` for screenshots
|
|
54
|
+
- `Xvfb` and `openbox` for headful containerized environments
|
|
34
55
|
|
|
35
|
-
|
|
56
|
+
This is useful when the page is hostile to DOM automation, when the DOM is not trustworthy enough, or when the task truly needs screen-level interaction.
|
|
36
57
|
|
|
37
|
-
##
|
|
58
|
+
## AI SDK Integration
|
|
38
59
|
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
60
|
+
`browser-autopilot` already uses the Vercel AI SDK internally:
|
|
61
|
+
|
|
62
|
+
- `generateText(...)` drives the step loop
|
|
63
|
+
- `gateway(model)` resolves the model
|
|
64
|
+
- browser tools and your custom tools use standard AI SDK `tool(...)` definitions
|
|
65
|
+
|
|
66
|
+
That means you can extend the agent with normal AI SDK tools and pass them into `runAgent(...)` or `orchestrate(...)`.
|
|
67
|
+
|
|
68
|
+
```ts
|
|
69
|
+
import { tool } from "ai";
|
|
70
|
+
import { z } from "zod";
|
|
71
|
+
import { CDPBrowser, runAgent } from "browser-autopilot";
|
|
72
|
+
|
|
73
|
+
const browser = new CDPBrowser();
|
|
74
|
+
await browser.connect();
|
|
75
|
+
|
|
76
|
+
const result = await runAgent({
|
|
77
|
+
browser,
|
|
78
|
+
task: "Log in and fetch the latest invoice PDF",
|
|
79
|
+
extraTools: {
|
|
80
|
+
get_2fa_code: tool({
|
|
81
|
+
description: "Return the latest 2FA code",
|
|
82
|
+
inputSchema: z.object({}),
|
|
83
|
+
execute: async () => "123456",
|
|
84
|
+
}),
|
|
85
|
+
},
|
|
86
|
+
});
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
Today the package is wired to the AI SDK gateway path, so you configure the model by name and provide `AI_GATEWAY_API_KEY`.
|
|
90
|
+
|
|
91
|
+
## Installation
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
npm install browser-autopilot ai zod
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Environment:
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
export AI_GATEWAY_API_KEY=...
|
|
101
|
+
export AGENT_MODEL=claude-sonnet-4-6
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
For local Chrome/CDP use, you also need a Chrome or Chromium binary available.
|
|
105
|
+
|
|
106
|
+
For X11 mode on Linux, you need:
|
|
107
|
+
- `xdotool`
|
|
108
|
+
- `xclip`
|
|
109
|
+
- ImageMagick `import`
|
|
110
|
+
- an X11 display or an Xvfb-based container/VM
|
|
55
111
|
|
|
56
112
|
## Quick Start
|
|
57
113
|
|
|
58
|
-
###
|
|
114
|
+
### 1. Login plus task orchestration
|
|
59
115
|
|
|
60
|
-
|
|
116
|
+
Use `orchestrate(...)` when you want the package to handle:
|
|
117
|
+
- cached-session reuse
|
|
118
|
+
- a quick CDP login attempt
|
|
119
|
+
- fallback to X11 when needed
|
|
120
|
+
- running the actual task afterward
|
|
121
|
+
|
|
122
|
+
```ts
|
|
61
123
|
import { orchestrate } from "browser-autopilot";
|
|
62
124
|
|
|
63
|
-
// One call — handles login (auto CDP/X11) + task
|
|
64
125
|
const { result, success, loginMethod } = await orchestrate({
|
|
65
126
|
credentials: {
|
|
66
|
-
username: "
|
|
67
|
-
password: "
|
|
68
|
-
email:
|
|
69
|
-
totpKey: "
|
|
127
|
+
username: process.env.MY_USER ?? "",
|
|
128
|
+
password: process.env.MY_PASS ?? "",
|
|
129
|
+
email: process.env.MY_EMAIL ?? "",
|
|
130
|
+
totpKey: process.env.MY_TOTP_KEY ?? "",
|
|
70
131
|
},
|
|
71
132
|
loginUrl: "https://x.com/login",
|
|
72
133
|
successUrlContains: "/home",
|
|
73
|
-
task: "
|
|
134
|
+
task: "Open settings and tell me which account email is configured.",
|
|
74
135
|
});
|
|
75
136
|
|
|
76
|
-
console.log(
|
|
137
|
+
console.log({ success, loginMethod, result });
|
|
77
138
|
```
|
|
78
139
|
|
|
79
|
-
### CDP
|
|
140
|
+
### 2. CDP-only task execution
|
|
141
|
+
|
|
142
|
+
Use `runAgent(...)` when:
|
|
143
|
+
- you are already logged in
|
|
144
|
+
- the task does not need auth
|
|
145
|
+
- you want direct control over browser lifecycle
|
|
80
146
|
|
|
81
|
-
```
|
|
147
|
+
```ts
|
|
82
148
|
import { CDPBrowser, runAgent } from "browser-autopilot";
|
|
83
149
|
|
|
84
150
|
const browser = new CDPBrowser();
|
|
85
151
|
await browser.connect();
|
|
86
152
|
|
|
87
|
-
const { result } = await runAgent({
|
|
88
|
-
task: "Go to wikipedia.org and find the population of Tokyo",
|
|
153
|
+
const { result, success } = await runAgent({
|
|
89
154
|
browser,
|
|
155
|
+
task: "Go to wikipedia.org and find the population of Tokyo.",
|
|
90
156
|
});
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
### As Docker (for servers/TEEs)
|
|
94
|
-
|
|
95
|
-
```bash
|
|
96
|
-
docker build -f docker/Dockerfile -t browser-autopilot .
|
|
97
157
|
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
-e TWITTER_USER=myuser \
|
|
101
|
-
-e TWITTER_PASS=mypass \
|
|
102
|
-
-e TWITTER_EMAIL=me@example.com \
|
|
103
|
-
-e TWITTER_TOTP_KEY=ABCDEF123456 \
|
|
104
|
-
-e PROXY_HOST=1.2.3.4 \
|
|
105
|
-
-e PROXY_PORT=45001 \
|
|
106
|
-
-e PROXY_USER=proxyuser \
|
|
107
|
-
-e PROXY_PASS=proxypass \
|
|
108
|
-
-v mydata:/data \
|
|
109
|
-
browser-autopilot
|
|
158
|
+
console.log({ success, result });
|
|
159
|
+
await browser.disconnect();
|
|
110
160
|
```
|
|
111
161
|
|
|
112
|
-
###
|
|
162
|
+
### 3. Direct X11 control
|
|
113
163
|
|
|
114
|
-
|
|
164
|
+
Use `X11Agent` directly when you want a pure screenshot-and-actions loop on Linux/X11.
|
|
165
|
+
|
|
166
|
+
```ts
|
|
115
167
|
import { X11Agent } from "browser-autopilot";
|
|
116
168
|
import * as chrome from "browser-autopilot/x11/chrome";
|
|
117
169
|
|
|
118
|
-
chrome.launch("https://
|
|
170
|
+
chrome.launch("https://example.com/login", "example-profile");
|
|
119
171
|
|
|
120
172
|
const agent = new X11Agent();
|
|
121
|
-
await agent.
|
|
122
|
-
systemPrompt: `
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
173
|
+
const result = await agent.runDetailed({
|
|
174
|
+
systemPrompt: `
|
|
175
|
+
You are controlling a Chrome browser with local X11 mouse and keyboard actions.
|
|
176
|
+
Log in, then say ACTION: DONE Logged in successfully.
|
|
177
|
+
`,
|
|
178
|
+
successCheck: () => chrome.pageUrlContains("/dashboard"),
|
|
126
179
|
});
|
|
180
|
+
|
|
181
|
+
console.log(result);
|
|
127
182
|
```
|
|
128
183
|
|
|
129
|
-
|
|
184
|
+
## Supported Systems
|
|
130
185
|
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
186
|
+
| Environment | CDP mode | X11 mode | Notes |
|
|
187
|
+
|---|---|---|---|
|
|
188
|
+
| Linux desktop with X11 | Supported | Supported | Best native environment for the full stack |
|
|
189
|
+
| Linux VM/container with Xvfb | Supported | Supported | Recommended for cloud/self-hosted deployments |
|
|
190
|
+
| macOS | Supported | Not supported natively | Use CDP mode locally; use Docker/Linux for X11 fallback |
|
|
191
|
+
| Windows | Partial | Not supported natively | Chrome path detection exists, but the X11 stack does not |
|
|
192
|
+
| Serverless functions | Poor fit | Poor fit | Headful Chrome + long-lived sessions are usually the wrong shape |
|
|
135
193
|
|
|
136
|
-
|
|
137
|
-
await browser.connect();
|
|
194
|
+
### Linux
|
|
138
195
|
|
|
139
|
-
|
|
140
|
-
task: "Login and download my invoice",
|
|
141
|
-
browser,
|
|
142
|
-
extraTools: {
|
|
143
|
-
get_2fa: tool({
|
|
144
|
-
description: "Get 2FA code from authenticator",
|
|
145
|
-
inputSchema: z.object({}),
|
|
146
|
-
execute: async () => "123456",
|
|
147
|
-
}),
|
|
148
|
-
},
|
|
149
|
-
});
|
|
150
|
-
```
|
|
196
|
+
Linux is the primary target for the full feature set.
|
|
151
197
|
|
|
152
|
-
|
|
198
|
+
Use Linux if you want:
|
|
199
|
+
- `orchestrate(...)` with reliable X11 fallback
|
|
200
|
+
- `switch_to_x11`
|
|
201
|
+
- Docker/Xvfb deployment
|
|
202
|
+
- noVNC live viewing
|
|
153
203
|
|
|
154
|
-
|
|
204
|
+
If your desktop runs Wayland, the X11 path may require XWayland or an X11 session. The X11 tools in this repo are not written against native Wayland APIs.
|
|
155
205
|
|
|
156
|
-
|
|
157
|
-
2. **Index DOM** — interactive elements get sequential numbers: `[1] button "Submit"`, `[2] textbox "Email"`. The LLM references elements by these numbers.
|
|
158
|
-
3. **Build context** — format state as text, append action history (truncated), add loop detection nudges if needed
|
|
159
|
-
4. **Send to LLM** — screenshot as vision input + state text + tool definitions (all prompt-cached)
|
|
160
|
-
5. **Parse response** — extract reasoning (evaluation, memory, next goal) + tool calls
|
|
161
|
-
6. **Execute tools** — up to `maxActionsPerStep` tool calls per step (click, type, navigate, etc.)
|
|
162
|
-
7. **Record history** — step added to history with all actions + results + errors
|
|
163
|
-
8. **Check termination** — done tool called? max steps? max failures?
|
|
206
|
+
### macOS
|
|
164
207
|
|
|
165
|
-
|
|
208
|
+
macOS is a good fit for CDP-only use:
|
|
209
|
+
- local development
|
|
210
|
+
- authenticated flows that succeed without X11 fallback
|
|
211
|
+
- browser tasks driven through `runAgent(...)`
|
|
212
|
+
|
|
213
|
+
Native macOS is not a full X11 target for this package. `src/x11/input.ts` depends on Linux/X11 tools like `xdotool`, `xclip`, and ImageMagick window capture.
|
|
214
|
+
|
|
215
|
+
Practical guidance on macOS:
|
|
216
|
+
- Use `runAgent(...)` locally for CDP-first tasks.
|
|
217
|
+
- Use `orchestrate(...)` only if you are comfortable with the fact that X11 fallback is not available natively.
|
|
218
|
+
- If you need the full stack, run the Docker image or a Linux VM.
|
|
219
|
+
|
|
220
|
+
### Windows
|
|
221
|
+
|
|
222
|
+
The Chrome launch helper knows common Windows Chrome paths, but the broader package is not a native Windows-first stack.
|
|
223
|
+
|
|
224
|
+
Practical guidance on Windows:
|
|
225
|
+
- treat local CDP usage as experimental
|
|
226
|
+
- do not expect native X11 fallback
|
|
227
|
+
- use a Linux VM or container for production use
|
|
228
|
+
|
|
229
|
+
## Local Execution Model
|
|
230
|
+
|
|
231
|
+
This package is local or self-hosted by design.
|
|
232
|
+
|
|
233
|
+
What runs locally:
|
|
234
|
+
- Chrome
|
|
235
|
+
- CDP client
|
|
236
|
+
- X11 input execution
|
|
237
|
+
- screenshots
|
|
238
|
+
- file uploads and downloads
|
|
239
|
+
- browser profiles and cookies
|
|
240
|
+
|
|
241
|
+
What goes to the model provider:
|
|
242
|
+
- task text
|
|
243
|
+
- browser state text
|
|
244
|
+
- screenshots
|
|
245
|
+
- tool-call context and results
|
|
246
|
+
|
|
247
|
+
So if you use OpenAI, Anthropic, or another provider through AI SDK Gateway, the reasoning is remote but the browser execution stays on your machine or your own servers.
|
|
248
|
+
|
|
249
|
+
## Cloud Deployment
|
|
166
250
|
|
|
251
|
+
The best cloud shape is a long-lived Linux container or VM with a headful browser.
|
|
252
|
+
|
|
253
|
+
Good fits:
|
|
254
|
+
- Docker on a VM
|
|
255
|
+
- Kubernetes workloads with persistent storage
|
|
256
|
+
- self-hosted Linux boxes
|
|
257
|
+
- isolated agent workers or TEEs that can run a browser session for minutes at a time
|
|
258
|
+
|
|
259
|
+
Poor fits:
|
|
260
|
+
- stateless serverless functions
|
|
261
|
+
- environments where headful Chrome cannot start
|
|
262
|
+
- platforms without persistent disk for browser profiles
|
|
263
|
+
|
|
264
|
+
### Docker
|
|
265
|
+
|
|
266
|
+
The repo includes a Docker image that sets up:
|
|
267
|
+
- Google Chrome
|
|
268
|
+
- Xvfb
|
|
269
|
+
- openbox
|
|
270
|
+
- `xdotool`
|
|
271
|
+
- `xclip`
|
|
272
|
+
- ImageMagick
|
|
273
|
+
- optional noVNC viewer
|
|
274
|
+
|
|
275
|
+
Build:
|
|
276
|
+
|
|
277
|
+
```bash
|
|
278
|
+
docker build -f docker/Dockerfile -t browser-autopilot .
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
Run:
|
|
282
|
+
|
|
283
|
+
```bash
|
|
284
|
+
docker run --rm \
|
|
285
|
+
-e AI_GATEWAY_API_KEY=$AI_GATEWAY_API_KEY \
|
|
286
|
+
-e LOGIN_URL=https://x.com/login \
|
|
287
|
+
-e SUCCESS_URL=/home \
|
|
288
|
+
-e TWITTER_USER=myuser \
|
|
289
|
+
-e TWITTER_PASS=mypass \
|
|
290
|
+
-e TWITTER_EMAIL=me@example.com \
|
|
291
|
+
-e TWITTER_TOTP_KEY=ABCDEF123456 \
|
|
292
|
+
-e AGENT_TASK="Open settings and summarize what you find." \
|
|
293
|
+
-v browser-autopilot-data:/data \
|
|
294
|
+
browser-autopilot
|
|
167
295
|
```
|
|
168
|
-
browser-autopilot/
|
|
169
|
-
├── src/
|
|
170
|
-
│ ├── config.ts # All env vars
|
|
171
|
-
│ ├── index.ts # CLI entrypoint + library exports
|
|
172
|
-
│ ├── orchestrator.ts # Auto mode selection (cached → CDP → X11)
|
|
173
|
-
│ ├── x11/
|
|
174
|
-
│ │ ├── agent.ts # Generic X11Agent (works for any site)
|
|
175
|
-
│ │ ├── chrome.ts # Chrome launch, focus, status check
|
|
176
|
-
│ │ ├── input.ts # xdotool: type, key, click, screenshot
|
|
177
|
-
│ │ └── login.ts # Twitter login built on X11Agent
|
|
178
|
-
│ ├── browser/
|
|
179
|
-
│ │ ├── cdp.ts # Raw CDP client (nav, click, type, tabs, cookies, dialogs)
|
|
180
|
-
│ │ ├── dom.ts # DOM indexer — [1] [2] [3] element refs
|
|
181
|
-
│ │ └── snapshot.ts # AX tree serialization utilities
|
|
182
|
-
│ ├── agent/
|
|
183
|
-
│ │ ├── loop.ts # Step-based agent loop (the core)
|
|
184
|
-
│ │ ├── tools.ts # 25+ browser tools for the agent
|
|
185
|
-
│ │ ├── state.ts # Per-step browser state capture
|
|
186
|
-
│ │ ├── history.ts # History tracking, loop detection, truncation
|
|
187
|
-
│ │ └── run.ts # CLI entrypoint (Twitter dev portal example)
|
|
188
|
-
│ └── captcha/
|
|
189
|
-
│ └── solver.ts # Capsolver + 2Captcha unified solver
|
|
190
|
-
├── docker/
|
|
191
|
-
│ ├── Dockerfile # Production image (amd64, Google Chrome, gost, xdotool)
|
|
192
|
-
│ └── entrypoint.sh # dbus → Xvfb → openbox → gost → login → agent
|
|
193
|
-
├── tests/
|
|
194
|
-
│ ├── dom.test.ts # DOM indexer tests
|
|
195
|
-
│ ├── history.test.ts # History + loop detection tests
|
|
196
|
-
│ ├── state.test.ts # Browser state formatting tests
|
|
197
|
-
│ └── config.test.ts # Config tests
|
|
198
|
-
├── package.json
|
|
199
|
-
└── tsconfig.json
|
|
200
296
|
|
|
297
|
+
Notes:
|
|
298
|
+
- The Docker image is `linux/amd64` today.
|
|
299
|
+
- Persist `/data` if you want browser sessions and outputs to survive across runs.
|
|
300
|
+
- Set `ENABLE_VIEWER=1` if you want the noVNC viewer for debugging.
|
|
301
|
+
|
|
302
|
+
### Cloud Architecture Guidance
|
|
303
|
+
|
|
304
|
+
For production-ish deployments:
|
|
305
|
+
- prefer one browser session per worker
|
|
306
|
+
- persist the browser profile directory
|
|
307
|
+
- keep timezone, locale, and proxy geography aligned
|
|
308
|
+
- use a real Linux/X11 stack if you depend on X11 fallback
|
|
309
|
+
- treat the browser as stateful infrastructure, not a short-lived lambda
|
|
310
|
+
|
|
311
|
+
## Browser Tools
|
|
312
|
+
|
|
313
|
+
The CDP agent exposes a broad tool surface, including:
|
|
314
|
+
|
|
315
|
+
- `navigate`
|
|
316
|
+
- `click`
|
|
317
|
+
- `click_at`
|
|
318
|
+
- `input`
|
|
319
|
+
- `type_text`
|
|
320
|
+
- `send_keys`
|
|
321
|
+
- `scroll`
|
|
322
|
+
- `find_text`
|
|
323
|
+
- `switch_tab`
|
|
324
|
+
- `new_tab`
|
|
325
|
+
- `close_tab`
|
|
326
|
+
- `upload_file`
|
|
327
|
+
- `click_and_upload`
|
|
328
|
+
- `paste_content`
|
|
329
|
+
- `paste_image`
|
|
330
|
+
- `extract`
|
|
331
|
+
- `evaluate`
|
|
332
|
+
- `handle_dialog`
|
|
333
|
+
- `wait`
|
|
334
|
+
- `save_page_snapshot`
|
|
335
|
+
- `save_element_html`
|
|
336
|
+
- `shell`
|
|
337
|
+
- `solve_captcha`
|
|
338
|
+
- `inject_captcha_token`
|
|
339
|
+
- `solve_datadome`
|
|
340
|
+
- `done`
|
|
341
|
+
|
|
342
|
+
The X11 agent supports local screen-level actions such as:
|
|
343
|
+
|
|
344
|
+
- `CLICK x y`
|
|
345
|
+
- `DOUBLE_CLICK x y`
|
|
346
|
+
- `MOVE x y`
|
|
347
|
+
- `DRAG x1 y1 x2 y2`
|
|
348
|
+
- `SCROLL up|down amount`
|
|
349
|
+
- `TYPE text`
|
|
350
|
+
- `KEYPRESS key`
|
|
351
|
+
- `WAIT seconds`
|
|
352
|
+
- `SCREENSHOT`
|
|
353
|
+
- `DONE result`
|
|
354
|
+
|
|
355
|
+
## Sensitive Data and Custom Tools
|
|
356
|
+
|
|
357
|
+
The package supports `sensitiveData` masking so secrets already present in prompts can be redacted in the model-facing task text. For high-value credentials or payment details, prefer explicit AI SDK tools over dumping everything directly into the task.
|
|
358
|
+
|
|
359
|
+
That pattern looks like:
|
|
360
|
+
- pass non-sensitive workflow context in `task`
|
|
361
|
+
- expose just-in-time secrets through `extraTools`
|
|
362
|
+
- let the agent request them only when needed
|
|
363
|
+
|
|
364
|
+
## Project Structure
|
|
365
|
+
|
|
366
|
+
```text
|
|
367
|
+
src/
|
|
368
|
+
agent/ Step-based CDP agent loop and tool definitions
|
|
369
|
+
browser/ Raw CDP client and DOM indexing
|
|
370
|
+
captcha/ CAPTCHA solving helpers
|
|
371
|
+
viewer/ Optional live viewer for Xvfb environments
|
|
372
|
+
x11/ X11 agent, local input primitives, Chrome launch helpers
|
|
373
|
+
orchestrator.ts
|
|
374
|
+
config.ts
|
|
375
|
+
index.ts
|
|
376
|
+
docker/
|
|
377
|
+
Dockerfile
|
|
378
|
+
entrypoint.sh
|
|
379
|
+
tests/
|
|
380
|
+
docs/
|
|
201
381
|
```
|
|
202
382
|
|
|
203
|
-
##
|
|
383
|
+
## Important Constraints
|
|
204
384
|
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
| **Timezone must match proxy geo** | `Intl.DateTimeFormat` timezone vs IP geolocation mismatch = flagged |
|
|
211
|
-
| **SOCKS5 auth needs gost** | Chromium doesn't support SOCKS5 username/password natively |
|
|
212
|
-
| **TOTP must be on-demand** | Codes expire in 30s — never bake into prompts, always generate fresh |
|
|
213
|
-
| **Clean profile locks** | Stale `SingletonLock` files from previous runs prevent Chrome startup |
|
|
385
|
+
- Use headful Chrome, not headless Chrome.
|
|
386
|
+
- X11 fallback is a Linux/X11 feature, not a cross-platform abstraction.
|
|
387
|
+
- If you depend on stealthy login fallback, deploy on Linux.
|
|
388
|
+
- If you only need structured browser automation, CDP mode is the simpler path.
|
|
389
|
+
- The package currently chooses models through AI SDK Gateway, so configure `AI_GATEWAY_API_KEY` and `AGENT_MODEL`.
|
|
214
390
|
|
|
215
391
|
## Environment Variables
|
|
216
392
|
|
|
217
393
|
| Variable | Required | Description |
|
|
218
394
|
|---|---|---|
|
|
219
|
-
| `
|
|
220
|
-
| `
|
|
221
|
-
| `
|
|
222
|
-
| `
|
|
223
|
-
| `
|
|
395
|
+
| `AI_GATEWAY_API_KEY` | Yes | API key for AI SDK Gateway |
|
|
396
|
+
| `AGENT_MODEL` | No | Model name, defaults to `claude-sonnet-4-6` |
|
|
397
|
+
| `LOGIN_URL` | CLI only | Login URL for the top-level entrypoint |
|
|
398
|
+
| `SUCCESS_URL` | CLI only | Post-login URL substring for the top-level entrypoint |
|
|
399
|
+
| `TWITTER_USER` | Optional | Username used by the default CLI entrypoint |
|
|
400
|
+
| `TWITTER_PASS` | Optional | Password used by the default CLI entrypoint |
|
|
401
|
+
| `TWITTER_EMAIL` | Optional | Email used by the default CLI entrypoint |
|
|
402
|
+
| `TWITTER_TOTP_KEY` | Optional | TOTP seed used by the default CLI entrypoint |
|
|
224
403
|
| `PROXY_HOST` | No | SOCKS5 proxy host |
|
|
225
404
|
| `PROXY_PORT` | No | SOCKS5 proxy port |
|
|
226
|
-
| `PROXY_USER` | No |
|
|
227
|
-
| `PROXY_PASS` | No |
|
|
405
|
+
| `PROXY_USER` | No | Proxy username |
|
|
406
|
+
| `PROXY_PASS` | No | Proxy password |
|
|
228
407
|
| `CAPSOLVER_KEY` | No | Capsolver API key |
|
|
229
408
|
| `TWOCAPTCHA_KEY` | No | 2Captcha API key |
|
|
230
|
-
| `CDP_PORT` | No | Chrome debugging port
|
|
409
|
+
| `CDP_PORT` | No | Chrome remote debugging port, defaults to `9222` |
|
|
231
410
|
| `PROFILE_DIR` | No | Browser profile directory |
|
|
232
|
-
| `DATA_DIR` | No | Data
|
|
233
|
-
| `AGENT_TASK` | No |
|
|
234
|
-
| `MAX_STEPS` | No | Max agent steps
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
| openbox | Minimal window manager |
|
|
248
|
-
| gost | SOCKS5 proxy chaining |
|
|
249
|
-
| zod | Tool parameter validation |
|
|
250
|
-
| otplib | TOTP generation |
|
|
251
|
-
| vitest | Testing |
|
|
411
|
+
| `DATA_DIR` | No | Data/output directory |
|
|
412
|
+
| `AGENT_TASK` | No | Task used by the top-level CLI entrypoint |
|
|
413
|
+
| `MAX_STEPS` | No | Max agent steps, defaults to `80` |
|
|
414
|
+
| `ENABLE_VIEWER` | No | Enable the noVNC viewer in Docker |
|
|
415
|
+
| `CHROME_PATH` | No | Override Chrome binary path |
|
|
416
|
+
|
|
417
|
+
## Development
|
|
418
|
+
|
|
419
|
+
```bash
|
|
420
|
+
npm install
|
|
421
|
+
npm test
|
|
422
|
+
npm run build
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
Architecture notes live in [docs/architecture.md](docs/architecture.md).
|
package/dist/agent/loop.d.ts
CHANGED
|
@@ -18,8 +18,13 @@ export interface AgentOptions {
|
|
|
18
18
|
onStep?: (step: StepRecord) => void;
|
|
19
19
|
sensitiveData?: Record<string, string>;
|
|
20
20
|
}
|
|
21
|
+
export interface AgentModeSwitchRequest {
|
|
22
|
+
mode: "x11";
|
|
23
|
+
reason: string;
|
|
24
|
+
}
|
|
21
25
|
export declare function runAgent(opts: AgentOptions): Promise<{
|
|
22
26
|
result: string | null;
|
|
23
27
|
success: boolean;
|
|
24
28
|
history: AgentHistory;
|
|
29
|
+
requestedModeSwitch: AgentModeSwitchRequest | null;
|
|
25
30
|
}>;
|