windows-use 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,8 +1,8 @@
1
1
  # windows-use
2
2
 
3
- Let big LLMs delegate Windows/browser automation to small LLMs via sessions.
3
+ Let big LLMs delegate Windows & browser automation to small LLMs via sessions.
4
4
 
5
- When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that by having a big model (Claude, GPT, etc.) direct a small model (Qwen, etc.) through sessions — the small model does the actual work and reports back concise text summaries.
5
+ When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that with a "big model directs small model" architecture the small model (Qwen, GPT-4o-mini, etc.) does the actual work autonomously and reports back concise summaries with optional embedded screenshots.
6
6
 
7
7
  ```
8
8
  Big Model windows-use Small Model
@@ -10,8 +10,10 @@ Big Model windows-use Small Model
10
10
  ├─ create_session ────────► │ │
11
11
  │ │ │
12
12
  ├─ send_instruction ──────► │ ── tools + instruction ──► │
13
- │ │ ◄── report (summary) ──────
14
- ◄── text summary ──────
13
+ │ │ screenshot analyze
14
+ │ → click verify → ...
15
+ │ │ ◄── report ──────────────── │
16
+ │ ◄── text + images ────── │ │
15
17
  │ │ │
16
18
  ├─ done_session ──────────► │ cleanup │
17
19
  ```
@@ -22,28 +24,50 @@ Big Model windows-use Small Model
22
24
  npm install -g windows-use
23
25
  ```
24
26
 
25
- ## Configuration
26
-
27
- Set these environment variables (or pass as CLI flags / MCP tool params):
27
+ ## Quick Start
28
28
 
29
29
  ```bash
30
- export WINDOWS_USE_API_KEY=your-api-key
31
- export WINDOWS_USE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
32
- export WINDOWS_USE_MODEL=qwen3.5-flash
30
+ # Interactive setup — saves config to ~/.windows-use.json
31
+ windows-use init
33
32
  ```
34
33
 
35
- Any OpenAI-compatible endpoint works (Qwen, DeepSeek, Ollama, vLLM, etc.).
34
+ You'll be prompted for:
35
+ - **Base URL** — any OpenAI-compatible endpoint (Qwen, DeepSeek, Ollama, vLLM, etc.)
36
+ - **API Key**
37
+ - **Model name** — e.g. `qwen3.5-flash`, `gpt-4o-mini`
36
38
 
37
39
  ## Usage
38
40
 
39
41
  ### CLI Mode
40
42
 
41
43
  ```bash
42
- # Run a single task
44
+ # Single task
43
45
  windows-use "Open Notepad and type Hello World"
44
46
 
45
- # With explicit config
46
- windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o "Take a screenshot"
47
+ # Interactive REPL session
48
+ windows-use
49
+ > Open Chrome and go to github.com
50
+ > Find the trending repositories
51
+ > exit
52
+
53
+ # With explicit config flags
54
+ windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o-mini "Take a screenshot"
55
+ ```
56
+
57
+ CLI shows real-time step-by-step progress:
58
+
59
+ ```
60
+ > Take a screenshot of the desktop
61
+ [windows-use] Running...
62
+
63
+ [step 1] 🔧 screenshot
64
+ [step 1] 📎 Screenshot captured (img_1)
65
+ [step 1] 💭 I can see the Windows desktop with...
66
+ [step 2] 🔧 report { status: 'completed', content: '...' }
67
+
68
+ ✅ [completed]
69
+ Here is the current desktop:
70
+ 📸 desktop: http://127.0.0.1:54321/s1.jpg
47
71
  ```
48
72
 
49
73
  ### MCP Server Mode
@@ -52,7 +76,19 @@ windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4
52
76
  windows-use --mcp
53
77
  ```
54
78
 
55
- Add to Claude Desktop (`claude_desktop_config.json`):
79
+ Exposes 3 tools over MCP (stdio transport):
80
+
81
+ | Tool | Description |
82
+ |------|-------------|
83
+ | `create_session` | Create a new agent session. Returns `session_id`. |
84
+ | `send_instruction` | Send a task to the agent. Returns rich report with text + images. |
85
+ | `done_session` | Terminate a session and free resources. |
86
+
87
+ ## MCP Client Configuration
88
+
89
+ ### Claude Desktop
90
+
91
+ Edit `claude_desktop_config.json`:
56
92
 
57
93
  ```json
58
94
  {
@@ -61,8 +97,8 @@ Add to Claude Desktop (`claude_desktop_config.json`):
61
97
  "command": "npx",
62
98
  "args": ["-y", "windows-use", "--mcp"],
63
99
  "env": {
64
- "WINDOWS_USE_API_KEY": "your-key",
65
- "WINDOWS_USE_BASE_URL": "https://dashscope.aliyuncs.com/compatible-mode/v1",
100
+ "WINDOWS_USE_API_KEY": "sk-xxx",
101
+ "WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
66
102
  "WINDOWS_USE_MODEL": "qwen3.5-flash"
67
103
  }
68
104
  }
@@ -70,15 +106,116 @@ Add to Claude Desktop (`claude_desktop_config.json`):
70
106
  }
71
107
  ```
72
108
 
73
- The MCP server exposes 3 tools:
109
+ ### VS Code (Claude Code / Copilot Chat)
110
+
111
+ Add to `.vscode/settings.json` or global settings:
112
+
113
+ ```json
114
+ {
115
+ "mcp": {
116
+ "servers": {
117
+ "windows-use": {
118
+ "command": "npx",
119
+ "args": ["-y", "windows-use", "--mcp"],
120
+ "env": {
121
+ "WINDOWS_USE_API_KEY": "sk-xxx",
122
+ "WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
123
+ "WINDOWS_USE_MODEL": "qwen3.5-flash"
124
+ }
125
+ }
126
+ }
127
+ }
128
+ }
129
+ ```
130
+
131
+ > If you've run `windows-use init`, the config is saved in `~/.windows-use.json` and you can omit the `env` block entirely.
132
+
133
+ ## Configuration
134
+
135
+ Config priority: **CLI flags > environment variables > `~/.windows-use.json` > defaults**
136
+
137
+ | Option | CLI Flag | Env Var | Default |
138
+ |--------|----------|---------|---------|
139
+ | API Key | `--api-key` | `WINDOWS_USE_API_KEY` | — (required) |
140
+ | Base URL | `--base-url` | `WINDOWS_USE_BASE_URL` | — (required) |
141
+ | Model | `--model` | `WINDOWS_USE_MODEL` | — (required) |
142
+ | CDP URL | `--cdp-url` | `WINDOWS_USE_CDP_URL` | `http://localhost:9222` |
143
+ | Max Steps | `--max-steps` | `WINDOWS_USE_MAX_STEPS` | `50` |
144
+ | Max Rounds | `--max-rounds` | `WINDOWS_USE_MAX_ROUNDS` | `20` |
145
+ | Timeout | — | `WINDOWS_USE_TIMEOUT_MS` | `300000` (5 min) |
146
+
147
+ - **Max Steps** — tool-calling rounds per instruction (how many actions the small model can take for one task)
148
+ - **Max Rounds** — instruction rounds per session (how many `send_instruction` calls before the session expires)
149
+
150
+ ## Browser Setup
151
+
152
+ Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging:
153
+
154
+ ```bash
155
+ # Windows
156
+ chrome.exe --remote-debugging-port=9222
157
+
158
+ # macOS
159
+ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
160
+ ```
161
+
162
+ Uses your existing cookies, login state, and extensions — no webdriver detection flags.
163
+
164
+ ## Small Model Tools
165
+
166
+ The small model agent has access to 16 tools:
167
+
168
+ ### Screen & Input
74
169
 
75
170
  | Tool | Description |
76
171
  |------|-------------|
77
- | `create_session` | Create a new agent session. Returns `session_id`. |
78
- | `send_instruction` | Send a task to the agent. Returns status + summary. |
79
- | `done_session` | Terminate a session and free resources. |
172
+ | `screenshot` | Full-screen capture with **coordinate grid overlay** (auto-scaled to logical resolution, grid coordinates match mouse_click) |
173
+ | `mouse_click(x, y, button)` | Click at screen coordinates |
174
+ | `mouse_move(x, y)` | Move mouse without clicking |
175
+ | `mouse_scroll(direction, amount)` | Scroll up/down |
176
+ | `keyboard_type(text)` | Type text character by character |
177
+ | `keyboard_press(keys)` | Key combos like `["Ctrl", "C"]`, `["Alt", "F4"]` |
178
+ | `run_command(command)` | Execute shell command (PowerShell on Windows) |
179
+
180
+ ### Browser
181
+
182
+ | Tool | Description |
183
+ |------|-------------|
184
+ | `browser_navigate(url)` | Open a URL |
185
+ | `browser_click(selector)` | Click element by CSS selector or `text=...` |
186
+ | `browser_type(selector, text)` | Type into input field |
187
+ | `browser_screenshot(fullPage?)` | Page screenshot (JPEG quality 70) |
188
+ | `browser_content` | Get visible text content of page |
189
+ | `browser_scroll(direction, amount)` | Scroll page |
190
+
191
+ ### File & Image
192
+
193
+ | Tool | Description |
194
+ |------|-------------|
195
+ | `file_read(path)` | Read file contents |
196
+ | `file_write(path, content)` | Write file |
197
+ | `use_local_image(path)` | Load a local image and get a screenshot ID for embedding in reports |
198
+
199
+ ### Control
200
+
201
+ | Tool | Description |
202
+ |------|-------------|
203
+ | `report(status, content)` | Submit a rich report and stop execution |
80
204
 
81
- ### Programmatic Usage
205
+ ### Rich Reports
206
+
207
+ Each screenshot tool returns an ID (e.g. `img_1`, `img_2`). The `report` tool supports a rich document format — mix text with embedded screenshots using `[Image:img_X]` markers:
208
+
209
+ ```
210
+ report({
211
+ status: "completed",
212
+ content: "Here's what I found:\n[Image:img_2]\nThe page shows search results.\n[Image:img_3]\nI also checked the sidebar."
213
+ })
214
+ ```
215
+
216
+ When delivered to the user (CLI) or big model (MCP), these markers expand into actual multimodal image content.
217
+
218
+ ## Programmatic API
82
219
 
83
220
  ```typescript
84
221
  import { loadConfig, SessionRegistry } from 'windows-use';
@@ -92,60 +229,41 @@ const config = loadConfig({
92
229
  const registry = new SessionRegistry();
93
230
  const session = registry.create(config);
94
231
 
232
+ // Real-time step events
233
+ session.runner.setOnStep((event) => {
234
+ if (event.type === 'tool_call') console.log(`Step ${event.step}: ${event.name}`);
235
+ if (event.type === 'thinking') console.log(`Step ${event.step}: ${event.content}`);
236
+ });
237
+
95
238
  const result = await session.runner.run('Open calculator and compute 2+2');
96
- console.log(result.summary);
239
+ console.log(result.status); // 'completed' | 'blocked' | 'need_guidance'
240
+ console.log(result.content); // Rich text with [Image:img_X] markers
97
241
 
98
- await registry.destroy(session.id);
242
+ await registry.destroyAll();
99
243
  ```
100
244
 
101
- ## Browser Automation
245
+ ## Architecture
102
246
 
103
- Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging enabled:
104
-
105
- ```bash
106
- # Windows
107
- m
108
-
109
- # macOS
110
- /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
111
247
  ```
112
-
113
- This uses your existing cookies, login state, and extensions — no automation detection flags.
114
-
115
- ## Small Model Tools
116
-
117
- The small model agent has access to these tools:
118
-
119
- **Windows**
120
- - `screenshot` — Capture full screen
121
- - `mouse_click` / `mouse_move` / `mouse_scroll` — Mouse control
122
- - `keyboard_type` / `keyboard_press` Keyboard input
123
- - `run_command` Execute shell commands
124
-
125
- **File**
126
- - `file_read` / `file_write` Read and write files
127
-
128
- **Browser**
129
- - `browser_navigate` — Open URL
130
- - `browser_click` / `browser_type` — Interact with page elements
131
- - `browser_screenshot` — Capture page
132
- - `browser_content` — Get page text
133
- - `browser_scroll` — Scroll page
134
-
135
- **Control**
136
- - `report` — Report progress back to the big model (completed / blocked / need_guidance)
137
-
138
- ## How It Works
139
-
140
- 1. The big model calls `create_session` to initialize an agent with a small LLM.
141
- 2. The big model sends high-level instructions via `send_instruction`.
142
- 3. The small model autonomously executes tasks using the tools above:
143
- - Takes screenshots to understand the current state
144
- - Performs actions (clicks, types, navigates)
145
- - Verifies results with another screenshot
146
- - Calls `report` when done or stuck
147
- 4. The big model receives only a text summary + optional screenshot — not all the intermediate data.
148
- 5. The big model can send follow-up instructions or end the session.
248
+ src/
249
+ ├── cli.ts # CLI entry (single task + interactive REPL)
250
+ ├── index.ts # Public API exports
251
+ ├── config/ # Zod config schema + env/file loader
252
+ ├── mcp/
253
+ │ ├── server.ts # MCP stdio transport
254
+ │ ├── tools.ts # 3 MCP tools (create/send/done)
255
+ │ └── session-registry.ts # Session lifecycle + timeout
256
+ ├── agent/
257
+ │ ├── runner.ts # Tool-calling loop + step events
258
+ │ ├── llm-client.ts # OpenAI SDK wrapper (any compatible endpoint)
259
+ │ ├── context-manager.ts # Full message history
260
+ │ └── system-prompt.ts # Small model system prompt
261
+ └── tools/
262
+ ├── windows/ # screenshot (with grid), mouse, keyboard, command
263
+ ├── browser/ # navigate, click, type, screenshot, content, scroll
264
+ ├── file/ # read, write, use_local_image
265
+ └── control/ # report
266
+ ```
149
267
 
150
268
  ## License
151
269