windows-use 0.1.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +188 -70
- package/dist/cli.js +689 -129
- package/dist/cli.js.map +1 -1
- package/dist/index.d.ts +88 -16
- package/dist/index.js +467 -88
- package/dist/index.js.map +1 -1
- package/dist/mcp/server.js +504 -124
- package/dist/mcp/server.js.map +1 -1
- package/package.json +10 -1
package/README.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
# windows-use
|
|
2
2
|
|
|
3
|
-
Let big LLMs delegate Windows
|
|
3
|
+
Let big LLMs delegate Windows & browser automation to small LLMs via sessions.
|
|
4
4
|
|
|
5
|
-
When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that
|
|
5
|
+
When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that with a "big model directs small model" architecture — the small model (Qwen, GPT-4o-mini, etc.) does the actual work autonomously and reports back concise summaries with optional embedded screenshots.
|
|
6
6
|
|
|
7
7
|
```
|
|
8
8
|
Big Model windows-use Small Model
|
|
@@ -10,8 +10,10 @@ Big Model windows-use Small Model
|
|
|
10
10
|
├─ create_session ────────► │ │
|
|
11
11
|
│ │ │
|
|
12
12
|
├─ send_instruction ──────► │ ── tools + instruction ──► │
|
|
13
|
-
│ │
|
|
14
|
-
│
|
|
13
|
+
│ │ screenshot → analyze │
|
|
14
|
+
│ │ → click → verify → ... │
|
|
15
|
+
│ │ ◄── report ──────────────── │
|
|
16
|
+
│ ◄── text + images ────── │ │
|
|
15
17
|
│ │ │
|
|
16
18
|
├─ done_session ──────────► │ cleanup │
|
|
17
19
|
```
|
|
@@ -22,28 +24,50 @@ Big Model windows-use Small Model
|
|
|
22
24
|
npm install -g windows-use
|
|
23
25
|
```
|
|
24
26
|
|
|
25
|
-
##
|
|
26
|
-
|
|
27
|
-
Set these environment variables (or pass as CLI flags / MCP tool params):
|
|
27
|
+
## Quick Start
|
|
28
28
|
|
|
29
29
|
```bash
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
export WINDOWS_USE_MODEL=qwen3.5-flash
|
|
30
|
+
# Interactive setup — saves config to ~/.windows-use.json
|
|
31
|
+
windows-use init
|
|
33
32
|
```
|
|
34
33
|
|
|
35
|
-
|
|
34
|
+
You'll be prompted for:
|
|
35
|
+
- **Base URL** — any OpenAI-compatible endpoint (Qwen, DeepSeek, Ollama, vLLM, etc.)
|
|
36
|
+
- **API Key**
|
|
37
|
+
- **Model name** — e.g. `qwen3.5-flash`, `gpt-4o-mini`
|
|
36
38
|
|
|
37
39
|
## Usage
|
|
38
40
|
|
|
39
41
|
### CLI Mode
|
|
40
42
|
|
|
41
43
|
```bash
|
|
42
|
-
#
|
|
44
|
+
# Single task
|
|
43
45
|
windows-use "Open Notepad and type Hello World"
|
|
44
46
|
|
|
45
|
-
#
|
|
46
|
-
windows-use
|
|
47
|
+
# Interactive REPL session
|
|
48
|
+
windows-use
|
|
49
|
+
> Open Chrome and go to github.com
|
|
50
|
+
> Find the trending repositories
|
|
51
|
+
> exit
|
|
52
|
+
|
|
53
|
+
# With explicit config flags
|
|
54
|
+
windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o-mini "Take a screenshot"
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
CLI shows real-time step-by-step progress:
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
> Take a screenshot of the desktop
|
|
61
|
+
[windows-use] Running...
|
|
62
|
+
|
|
63
|
+
[step 1] 🔧 screenshot
|
|
64
|
+
[step 1] 📎 Screenshot captured (img_1)
|
|
65
|
+
[step 1] 💭 I can see the Windows desktop with...
|
|
66
|
+
[step 2] 🔧 report { status: 'completed', content: '...' }
|
|
67
|
+
|
|
68
|
+
✅ [completed]
|
|
69
|
+
Here is the current desktop:
|
|
70
|
+
📸 desktop: http://127.0.0.1:54321/s1.jpg
|
|
47
71
|
```
|
|
48
72
|
|
|
49
73
|
### MCP Server Mode
|
|
@@ -52,7 +76,19 @@ windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4
|
|
|
52
76
|
windows-use --mcp
|
|
53
77
|
```
|
|
54
78
|
|
|
55
|
-
|
|
79
|
+
Exposes 3 tools over MCP (stdio transport):
|
|
80
|
+
|
|
81
|
+
| Tool | Description |
|
|
82
|
+
|------|-------------|
|
|
83
|
+
| `create_session` | Create a new agent session. Returns `session_id`. |
|
|
84
|
+
| `send_instruction` | Send a task to the agent. Returns rich report with text + images. |
|
|
85
|
+
| `done_session` | Terminate a session and free resources. |
|
|
86
|
+
|
|
87
|
+
## MCP Client Configuration
|
|
88
|
+
|
|
89
|
+
### Claude Desktop
|
|
90
|
+
|
|
91
|
+
Edit `claude_desktop_config.json`:
|
|
56
92
|
|
|
57
93
|
```json
|
|
58
94
|
{
|
|
@@ -61,8 +97,8 @@ Add to Claude Desktop (`claude_desktop_config.json`):
|
|
|
61
97
|
"command": "npx",
|
|
62
98
|
"args": ["-y", "windows-use", "--mcp"],
|
|
63
99
|
"env": {
|
|
64
|
-
"WINDOWS_USE_API_KEY": "
|
|
65
|
-
"WINDOWS_USE_BASE_URL": "https://
|
|
100
|
+
"WINDOWS_USE_API_KEY": "sk-xxx",
|
|
101
|
+
"WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
|
|
66
102
|
"WINDOWS_USE_MODEL": "qwen3.5-flash"
|
|
67
103
|
}
|
|
68
104
|
}
|
|
@@ -70,15 +106,116 @@ Add to Claude Desktop (`claude_desktop_config.json`):
|
|
|
70
106
|
}
|
|
71
107
|
```
|
|
72
108
|
|
|
73
|
-
|
|
109
|
+
### VS Code (Claude Code / Copilot Chat)
|
|
110
|
+
|
|
111
|
+
Add to `.vscode/settings.json` or global settings:
|
|
112
|
+
|
|
113
|
+
```json
|
|
114
|
+
{
|
|
115
|
+
"mcp": {
|
|
116
|
+
"servers": {
|
|
117
|
+
"windows-use": {
|
|
118
|
+
"command": "npx",
|
|
119
|
+
"args": ["-y", "windows-use", "--mcp"],
|
|
120
|
+
"env": {
|
|
121
|
+
"WINDOWS_USE_API_KEY": "sk-xxx",
|
|
122
|
+
"WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
|
|
123
|
+
"WINDOWS_USE_MODEL": "qwen3.5-flash"
|
|
124
|
+
}
|
|
125
|
+
}
|
|
126
|
+
}
|
|
127
|
+
}
|
|
128
|
+
}
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
> If you've run `windows-use init`, the config is saved in `~/.windows-use.json` and you can omit the `env` block entirely.
|
|
132
|
+
|
|
133
|
+
## Configuration
|
|
134
|
+
|
|
135
|
+
Config priority: **CLI flags > environment variables > `~/.windows-use.json` > defaults**
|
|
136
|
+
|
|
137
|
+
| Option | CLI Flag | Env Var | Default |
|
|
138
|
+
|--------|----------|---------|---------|
|
|
139
|
+
| API Key | `--api-key` | `WINDOWS_USE_API_KEY` | — (required) |
|
|
140
|
+
| Base URL | `--base-url` | `WINDOWS_USE_BASE_URL` | — (required) |
|
|
141
|
+
| Model | `--model` | `WINDOWS_USE_MODEL` | — (required) |
|
|
142
|
+
| CDP URL | `--cdp-url` | `WINDOWS_USE_CDP_URL` | `http://localhost:9222` |
|
|
143
|
+
| Max Steps | `--max-steps` | `WINDOWS_USE_MAX_STEPS` | `50` |
|
|
144
|
+
| Max Rounds | `--max-rounds` | `WINDOWS_USE_MAX_ROUNDS` | `20` |
|
|
145
|
+
| Timeout | — | `WINDOWS_USE_TIMEOUT_MS` | `300000` (5 min) |
|
|
146
|
+
|
|
147
|
+
- **Max Steps** — tool-calling rounds per instruction (how many actions the small model can take for one task)
|
|
148
|
+
- **Max Rounds** — instruction rounds per session (how many `send_instruction` calls before the session expires)
|
|
149
|
+
|
|
150
|
+
## Browser Setup
|
|
151
|
+
|
|
152
|
+
Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging:
|
|
153
|
+
|
|
154
|
+
```bash
|
|
155
|
+
# Windows
|
|
156
|
+
chrome.exe --remote-debugging-port=9222
|
|
157
|
+
|
|
158
|
+
# macOS
|
|
159
|
+
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Uses your existing cookies, login state, and extensions — no webdriver detection flags.
|
|
163
|
+
|
|
164
|
+
## Small Model Tools
|
|
165
|
+
|
|
166
|
+
The small model agent has access to 16 tools:
|
|
167
|
+
|
|
168
|
+
### Screen & Input
|
|
74
169
|
|
|
75
170
|
| Tool | Description |
|
|
76
171
|
|------|-------------|
|
|
77
|
-
| `
|
|
78
|
-
| `
|
|
79
|
-
| `
|
|
172
|
+
| `screenshot` | Full-screen capture with **coordinate grid overlay** (auto-scaled to logical resolution, grid coordinates match mouse_click) |
|
|
173
|
+
| `mouse_click(x, y, button)` | Click at screen coordinates |
|
|
174
|
+
| `mouse_move(x, y)` | Move mouse without clicking |
|
|
175
|
+
| `mouse_scroll(direction, amount)` | Scroll up/down |
|
|
176
|
+
| `keyboard_type(text)` | Type text character by character |
|
|
177
|
+
| `keyboard_press(keys)` | Key combos like `["Ctrl", "C"]`, `["Alt", "F4"]` |
|
|
178
|
+
| `run_command(command)` | Execute shell command (PowerShell on Windows) |
|
|
179
|
+
|
|
180
|
+
### Browser
|
|
181
|
+
|
|
182
|
+
| Tool | Description |
|
|
183
|
+
|------|-------------|
|
|
184
|
+
| `browser_navigate(url)` | Open a URL |
|
|
185
|
+
| `browser_click(selector)` | Click element by CSS selector or `text=...` |
|
|
186
|
+
| `browser_type(selector, text)` | Type into input field |
|
|
187
|
+
| `browser_screenshot(fullPage?)` | Page screenshot (JPEG quality 70) |
|
|
188
|
+
| `browser_content` | Get visible text content of page |
|
|
189
|
+
| `browser_scroll(direction, amount)` | Scroll page |
|
|
190
|
+
|
|
191
|
+
### File & Image
|
|
192
|
+
|
|
193
|
+
| Tool | Description |
|
|
194
|
+
|------|-------------|
|
|
195
|
+
| `file_read(path)` | Read file contents |
|
|
196
|
+
| `file_write(path, content)` | Write file |
|
|
197
|
+
| `use_local_image(path)` | Load a local image and get a screenshot ID for embedding in reports |
|
|
198
|
+
|
|
199
|
+
### Control
|
|
200
|
+
|
|
201
|
+
| Tool | Description |
|
|
202
|
+
|------|-------------|
|
|
203
|
+
| `report(status, content)` | Submit a rich report and stop execution |
|
|
80
204
|
|
|
81
|
-
###
|
|
205
|
+
### Rich Reports
|
|
206
|
+
|
|
207
|
+
Each screenshot tool returns an ID (e.g. `img_1`, `img_2`). The `report` tool supports a rich document format — mix text with embedded screenshots using `[Image:img_X]` markers:
|
|
208
|
+
|
|
209
|
+
```
|
|
210
|
+
report({
|
|
211
|
+
status: "completed",
|
|
212
|
+
content: "Here's what I found:\n[Image:img_2]\nThe page shows search results.\n[Image:img_3]\nI also checked the sidebar."
|
|
213
|
+
})
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
When delivered to the user (CLI) or big model (MCP), these markers expand into actual multimodal image content.
|
|
217
|
+
|
|
218
|
+
## Programmatic API
|
|
82
219
|
|
|
83
220
|
```typescript
|
|
84
221
|
import { loadConfig, SessionRegistry } from 'windows-use';
|
|
@@ -92,60 +229,41 @@ const config = loadConfig({
|
|
|
92
229
|
const registry = new SessionRegistry();
|
|
93
230
|
const session = registry.create(config);
|
|
94
231
|
|
|
232
|
+
// Real-time step events
|
|
233
|
+
session.runner.setOnStep((event) => {
|
|
234
|
+
if (event.type === 'tool_call') console.log(`Step ${event.step}: ${event.name}`);
|
|
235
|
+
if (event.type === 'thinking') console.log(`Step ${event.step}: ${event.content}`);
|
|
236
|
+
});
|
|
237
|
+
|
|
95
238
|
const result = await session.runner.run('Open calculator and compute 2+2');
|
|
96
|
-
console.log(result.
|
|
239
|
+
console.log(result.status); // 'completed' | 'blocked' | 'need_guidance'
|
|
240
|
+
console.log(result.content); // Rich text with [Image:img_X] markers
|
|
97
241
|
|
|
98
|
-
await registry.
|
|
242
|
+
await registry.destroyAll();
|
|
99
243
|
```
|
|
100
244
|
|
|
101
|
-
##
|
|
245
|
+
## Architecture
|
|
102
246
|
|
|
103
|
-
Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging enabled:
|
|
104
|
-
|
|
105
|
-
```bash
|
|
106
|
-
# Windows
|
|
107
|
-
m
|
|
108
|
-
|
|
109
|
-
# macOS
|
|
110
|
-
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
|
|
111
247
|
```
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
-
|
|
123
|
-
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
- `browser_screenshot` — Capture page
|
|
132
|
-
- `browser_content` — Get page text
|
|
133
|
-
- `browser_scroll` — Scroll page
|
|
134
|
-
|
|
135
|
-
**Control**
|
|
136
|
-
- `report` — Report progress back to the big model (completed / blocked / need_guidance)
|
|
137
|
-
|
|
138
|
-
## How It Works
|
|
139
|
-
|
|
140
|
-
1. The big model calls `create_session` to initialize an agent with a small LLM.
|
|
141
|
-
2. The big model sends high-level instructions via `send_instruction`.
|
|
142
|
-
3. The small model autonomously executes tasks using the tools above:
|
|
143
|
-
- Takes screenshots to understand the current state
|
|
144
|
-
- Performs actions (clicks, types, navigates)
|
|
145
|
-
- Verifies results with another screenshot
|
|
146
|
-
- Calls `report` when done or stuck
|
|
147
|
-
4. The big model receives only a text summary + optional screenshot — not all the intermediate data.
|
|
148
|
-
5. The big model can send follow-up instructions or end the session.
|
|
248
|
+
src/
|
|
249
|
+
├── cli.ts # CLI entry (single task + interactive REPL)
|
|
250
|
+
├── index.ts # Public API exports
|
|
251
|
+
├── config/ # Zod config schema + env/file loader
|
|
252
|
+
├── mcp/
|
|
253
|
+
│ ├── server.ts # MCP stdio transport
|
|
254
|
+
│ ├── tools.ts # 3 MCP tools (create/send/done)
|
|
255
|
+
│ └── session-registry.ts # Session lifecycle + timeout
|
|
256
|
+
├── agent/
|
|
257
|
+
│ ├── runner.ts # Tool-calling loop + step events
|
|
258
|
+
│ ├── llm-client.ts # OpenAI SDK wrapper (any compatible endpoint)
|
|
259
|
+
│ ├── context-manager.ts # Full message history
|
|
260
|
+
│ └── system-prompt.ts # Small model system prompt
|
|
261
|
+
└── tools/
|
|
262
|
+
├── windows/ # screenshot (with grid), mouse, keyboard, command
|
|
263
|
+
├── browser/ # navigate, click, type, screenshot, content, scroll
|
|
264
|
+
├── file/ # read, write, use_local_image
|
|
265
|
+
└── control/ # report
|
|
266
|
+
```
|
|
149
267
|
|
|
150
268
|
## License
|
|
151
269
|
|