windows-use 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,152 @@
1
+ # windows-use
2
+
3
+ Let big LLMs delegate Windows/browser automation to small LLMs via sessions.
4
+
5
+ When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that by having a big model (Claude, GPT, etc.) direct a small model (Qwen, etc.) through sessions — the small model does the actual work and reports back concise text summaries.
6
+
7
+ ```
8
+ Big Model windows-use Small Model
9
+ │ │ │
10
+ ├─ create_session ────────► │ │
11
+ │ │ │
12
+ ├─ send_instruction ──────► │ ── tools + instruction ──► │
13
+ │ │ ◄── report (summary) ────── │
14
+ │ ◄── text summary ────── │ │
15
+ │ │ │
16
+ ├─ done_session ──────────► │ cleanup │
17
+ ```
18
+
19
+ ## Install
20
+
21
+ ```bash
22
+ npm install -g windows-use
23
+ ```
24
+
25
+ ## Configuration
26
+
27
+ Set these environment variables (or pass as CLI flags / MCP tool params):
28
+
29
+ ```bash
30
+ export WINDOWS_USE_API_KEY=your-api-key
31
+ export WINDOWS_USE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
32
+ export WINDOWS_USE_MODEL=qwen3.5-flash
33
+ ```
34
+
35
+ Any OpenAI-compatible endpoint works (Qwen, DeepSeek, Ollama, vLLM, etc.).
36
+
37
+ ## Usage
38
+
39
+ ### CLI Mode
40
+
41
+ ```bash
42
+ # Run a single task
43
+ windows-use "Open Notepad and type Hello World"
44
+
45
+ # With explicit config
46
+ windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o "Take a screenshot"
47
+ ```
48
+
49
+ ### MCP Server Mode
50
+
51
+ ```bash
52
+ windows-use --mcp
53
+ ```
54
+
55
+ Add to Claude Desktop (`claude_desktop_config.json`):
56
+
57
+ ```json
58
+ {
59
+ "mcpServers": {
60
+ "windows-use": {
61
+ "command": "npx",
62
+ "args": ["-y", "windows-use", "--mcp"],
63
+ "env": {
64
+ "WINDOWS_USE_API_KEY": "your-key",
65
+ "WINDOWS_USE_BASE_URL": "https://dashscope.aliyuncs.com/compatible-mode/v1",
66
+ "WINDOWS_USE_MODEL": "qwen3.5-flash"
67
+ }
68
+ }
69
+ }
70
+ }
71
+ ```
72
+
73
+ The MCP server exposes 3 tools:
74
+
75
+ | Tool | Description |
76
+ |------|-------------|
77
+ | `create_session` | Create a new agent session. Returns `session_id`. |
78
+ | `send_instruction` | Send a task to the agent. Returns status + summary. |
79
+ | `done_session` | Terminate a session and free resources. |
80
+
81
+ ### Programmatic Usage
82
+
83
+ ```typescript
84
+ import { loadConfig, SessionRegistry } from 'windows-use';
85
+
86
+ const config = loadConfig({
87
+ apiKey: 'sk-xxx',
88
+ baseURL: 'https://api.example.com/v1',
89
+ model: 'qwen3.5-flash',
90
+ });
91
+
92
+ const registry = new SessionRegistry();
93
+ const session = registry.create(config);
94
+
95
+ const result = await session.runner.run('Open calculator and compute 2+2');
96
+ console.log(result.summary);
97
+
98
+ await registry.destroy(session.id);
99
+ ```
100
+
101
+ ## Browser Automation
102
+
103
+ Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging enabled:
104
+
105
+ ```bash
106
+ # Windows
107
+ m
108
+
109
+ # macOS
110
+ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
111
+ ```
112
+
113
+ This uses your existing cookies, login state, and extensions — no automation detection flags.
114
+
115
+ ## Small Model Tools
116
+
117
+ The small model agent has access to these tools:
118
+
119
+ **Windows**
120
+ - `screenshot` — Capture full screen
121
+ - `mouse_click` / `mouse_move` / `mouse_scroll` — Mouse control
122
+ - `keyboard_type` / `keyboard_press` — Keyboard input
123
+ - `run_command` — Execute shell commands
124
+
125
+ **File**
126
+ - `file_read` / `file_write` — Read and write files
127
+
128
+ **Browser**
129
+ - `browser_navigate` — Open URL
130
+ - `browser_click` / `browser_type` — Interact with page elements
131
+ - `browser_screenshot` — Capture page
132
+ - `browser_content` — Get page text
133
+ - `browser_scroll` — Scroll page
134
+
135
+ **Control**
136
+ - `report` — Report progress back to the big model (completed / blocked / need_guidance)
137
+
138
+ ## How It Works
139
+
140
+ 1. The big model calls `create_session` to initialize an agent with a small LLM.
141
+ 2. The big model sends high-level instructions via `send_instruction`.
142
+ 3. The small model autonomously executes tasks using the tools above:
143
+ - Takes screenshots to understand the current state
144
+ - Performs actions (clicks, types, navigates)
145
+ - Verifies results with another screenshot
146
+ - Calls `report` when done or stuck
147
+ 4. The big model receives only a text summary + optional screenshot — not all the intermediate data.
148
+ 5. The big model can send follow-up instructions or end the session.
149
+
150
+ ## License
151
+
152
+ MIT
package/dist/cli.d.ts ADDED
@@ -0,0 +1 @@
1
+ #!/usr/bin/env node