npm - windows-use - Versions diffs - 0.1.0 → 0.2.1 - Mend

windows-use 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 # windows-use
-Let big LLMs delegate Windows/browser automation to small LLMs via sessions.
+Let big LLMs delegate Windows & browser automation to small LLMs via sessions.
-When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that by having a big model (Claude, GPT, etc.) direct a small model (Qwen, etc.) through sessions — the small model does the actual work and reports back concise text summaries.
+When AI agents use tools to operate Windows or browsers, screenshots and other multimodal returns consume massive context. This project solves that with a "big model directs small model" architecture — the small model (Qwen, GPT-4o-mini, etc.) does the actual work autonomously and reports back concise summaries with optional embedded screenshots.
 ```
 Big Model                    windows-use                    Small Model
@@ -10,8 +10,10 @@ Big Model                    windows-use                    Small Model
    ├─ create_session ────────►    │                              │
    │                              │                              │
    ├─ send_instruction ──────►    │  ── tools + instruction ──► │
-   │                              │  ◄── report (summary) ────── │
-   │  ◄── text summary ──────    │                              │
+   │                              │     screenshot → analyze     │
+   │                              │     → click → verify → ...   │
+   │                              │  ◄── report ──────────────── │
+   │  ◄── text + images ──────   │                              │
    │                              │                              │
    ├─ done_session ──────────►    │  cleanup                     │
 ```
@@ -22,28 +24,50 @@ Big Model                    windows-use                    Small Model
 npm install -g windows-use
 ```
-## Configuration
-Set these environment variables (or pass as CLI flags / MCP tool params):
+## Quick Start
 ```bash
-export WINDOWS_USE_API_KEY=your-api-key
-export WINDOWS_USE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
-export WINDOWS_USE_MODEL=qwen3.5-flash
+# Interactive setup — saves config to ~/.windows-use.json
+windows-use init
 ```
-Any OpenAI-compatible endpoint works (Qwen, DeepSeek, Ollama, vLLM, etc.).
+You'll be prompted for:
+- **Base URL** — any OpenAI-compatible endpoint (Qwen, DeepSeek, Ollama, vLLM, etc.)
+- **API Key**
+- **Model name** — e.g. `qwen3.5-flash`, `gpt-4o-mini`
 ## Usage
 ### CLI Mode
 ```bash
-# Run a single task
+# Single task
 windows-use "Open Notepad and type Hello World"
-# With explicit config
-windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o "Take a screenshot"
+# Interactive REPL session
+windows-use
+> Open Chrome and go to github.com
+> Find the trending repositories
+> exit
+# With explicit config flags
+windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4o-mini "Take a screenshot"
+```
+CLI shows real-time step-by-step progress:
+```
+> Take a screenshot of the desktop
+[windows-use] Running...
+  [step 1] 🔧 screenshot
+  [step 1] 📎 Screenshot captured (img_1)
+  [step 1] 💭 I can see the Windows desktop with...
+  [step 2] 🔧 report { status: 'completed', content: '...' }
+✅ [completed]
+Here is the current desktop:
+   📸 desktop: http://127.0.0.1:54321/s1.jpg
 ```
 ### MCP Server Mode
@@ -52,7 +76,19 @@ windows-use --api-key sk-xxx --base-url https://api.example.com/v1 --model gpt-4
 windows-use --mcp
 ```
-Add to Claude Desktop (`claude_desktop_config.json`):
+Exposes 3 tools over MCP (stdio transport):
+| Tool | Description |
+|------|-------------|
+| `create_session` | Create a new agent session. Returns `session_id`. |
+| `send_instruction` | Send a task to the agent. Returns rich report with text + images. |
+| `done_session` | Terminate a session and free resources. |
+## MCP Client Configuration
+### Claude Desktop
+Edit `claude_desktop_config.json`:
 ```json
 {
@@ -61,8 +97,8 @@ Add to Claude Desktop (`claude_desktop_config.json`):
       "command": "npx",
       "args": ["-y", "windows-use", "--mcp"],
       "env": {
-        "WINDOWS_USE_API_KEY": "your-key",
-        "WINDOWS_USE_BASE_URL": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+        "WINDOWS_USE_API_KEY": "sk-xxx",
+        "WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
         "WINDOWS_USE_MODEL": "qwen3.5-flash"
       }
     }
@@ -70,15 +106,116 @@ Add to Claude Desktop (`claude_desktop_config.json`):
 }
 ```
-The MCP server exposes 3 tools:
+### VS Code (Claude Code / Copilot Chat)
+Add to `.vscode/settings.json` or global settings:
+```json
+{
+  "mcp": {
+    "servers": {
+      "windows-use": {
+        "command": "npx",
+        "args": ["-y", "windows-use", "--mcp"],
+        "env": {
+          "WINDOWS_USE_API_KEY": "sk-xxx",
+          "WINDOWS_USE_BASE_URL": "https://your-api.com/v1",
+          "WINDOWS_USE_MODEL": "qwen3.5-flash"
+        }
+      }
+    }
+  }
+}
+```
+> If you've run `windows-use init`, the config is saved in `~/.windows-use.json` and you can omit the `env` block entirely.
+## Configuration
+Config priority: **CLI flags > environment variables > `~/.windows-use.json` > defaults**
+| Option | CLI Flag | Env Var | Default |
+|--------|----------|---------|---------|
+| API Key | `--api-key` | `WINDOWS_USE_API_KEY` | — (required) |
+| Base URL | `--base-url` | `WINDOWS_USE_BASE_URL` | — (required) |
+| Model | `--model` | `WINDOWS_USE_MODEL` | — (required) |
+| CDP URL | `--cdp-url` | `WINDOWS_USE_CDP_URL` | `http://localhost:9222` |
+| Max Steps | `--max-steps` | `WINDOWS_USE_MAX_STEPS` | `50` |
+| Max Rounds | `--max-rounds` | `WINDOWS_USE_MAX_ROUNDS` | `20` |
+| Timeout | — | `WINDOWS_USE_TIMEOUT_MS` | `300000` (5 min) |
+- **Max Steps** — tool-calling rounds per instruction (how many actions the small model can take for one task)
+- **Max Rounds** — instruction rounds per session (how many `send_instruction` calls before the session expires)
+## Browser Setup
+Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging:
+```bash
+# Windows
+chrome.exe --remote-debugging-port=9222
+# macOS
+/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
+```
+Uses your existing cookies, login state, and extensions — no webdriver detection flags.
+## Small Model Tools
+The small model agent has access to 16 tools:
+### Screen & Input
 | Tool | Description |
 |------|-------------|
-| `create_session` | Create a new agent session. Returns `session_id`. |
-| `send_instruction` | Send a task to the agent. Returns status + summary. |
-| `done_session` | Terminate a session and free resources. |
+| `screenshot` | Full-screen capture with **coordinate grid overlay** (auto-scaled to logical resolution, grid coordinates match mouse_click) |
+| `mouse_click(x, y, button)` | Click at screen coordinates |
+| `mouse_move(x, y)` | Move mouse without clicking |
+| `mouse_scroll(direction, amount)` | Scroll up/down |
+| `keyboard_type(text)` | Type text character by character |
+| `keyboard_press(keys)` | Key combos like `["Ctrl", "C"]`, `["Alt", "F4"]` |
+| `run_command(command)` | Execute shell command (PowerShell on Windows) |
+### Browser
+| Tool | Description |
+|------|-------------|
+| `browser_navigate(url)` | Open a URL |
+| `browser_click(selector)` | Click element by CSS selector or `text=...` |
+| `browser_type(selector, text)` | Type into input field |
+| `browser_screenshot(fullPage?)` | Page screenshot (JPEG quality 70) |
+| `browser_content` | Get visible text content of page |
+| `browser_scroll(direction, amount)` | Scroll page |
+### File & Image
+| Tool | Description |
+|------|-------------|
+| `file_read(path)` | Read file contents |
+| `file_write(path, content)` | Write file |
+| `use_local_image(path)` | Load a local image and get a screenshot ID for embedding in reports |
+### Control
+| Tool | Description |
+|------|-------------|
+| `report(status, content)` | Submit a rich report and stop execution |
-### Programmatic Usage
+### Rich Reports
+Each screenshot tool returns an ID (e.g. `img_1`, `img_2`). The `report` tool supports a rich document format — mix text with embedded screenshots using `[Image:img_X]` markers:
+```
+report({
+  status: "completed",
+  content: "Here's what I found:\n[Image:img_2]\nThe page shows search results.\n[Image:img_3]\nI also checked the sidebar."
+})
+```
+When delivered to the user (CLI) or big model (MCP), these markers expand into actual multimodal image content.
+## Programmatic API
 ```typescript
 import { loadConfig, SessionRegistry } from 'windows-use';
@@ -92,60 +229,41 @@ const config = loadConfig({
 const registry = new SessionRegistry();
 const session = registry.create(config);
+// Real-time step events
+session.runner.setOnStep((event) => {
+  if (event.type === 'tool_call') console.log(`Step ${event.step}: ${event.name}`);
+  if (event.type === 'thinking') console.log(`Step ${event.step}: ${event.content}`);
+});
 const result = await session.runner.run('Open calculator and compute 2+2');
-console.log(result.summary);
+console.log(result.status);  // 'completed' | 'blocked' | 'need_guidance'
+console.log(result.content); // Rich text with [Image:img_X] markers
-await registry.destroy(session.id);
+await registry.destroyAll();
 ```
-## Browser Automation
+## Architecture
-Browser tools connect to your real Chrome via CDP. Start Chrome with remote debugging enabled:
-```bash
-# Windows
-m
-# macOS
-/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
 ```
-This uses your existing cookies, login state, and extensions — no automation detection flags.
-## Small Model Tools
-The small model agent has access to these tools:
-**Windows**
-- `screenshot` — Capture full screen
-- `mouse_click` / `mouse_move` / `mouse_scroll` — Mouse control
-- `keyboard_type` / `keyboard_press` — Keyboard input
-- `run_command` — Execute shell commands
-**File**
-- `file_read` / `file_write` — Read and write files
-**Browser**
-- `browser_navigate` — Open URL
-- `browser_click` / `browser_type` — Interact with page elements
-- `browser_screenshot` — Capture page
-- `browser_content` — Get page text
-- `browser_scroll` — Scroll page
-**Control**
-- `report` — Report progress back to the big model (completed / blocked / need_guidance)
-## How It Works
-1. The big model calls `create_session` to initialize an agent with a small LLM.
-2. The big model sends high-level instructions via `send_instruction`.
-3. The small model autonomously executes tasks using the tools above:
-   - Takes screenshots to understand the current state
-   - Performs actions (clicks, types, navigates)
-   - Verifies results with another screenshot
-   - Calls `report` when done or stuck
-4. The big model receives only a text summary + optional screenshot — not all the intermediate data.
-5. The big model can send follow-up instructions or end the session.
+src/
+├── cli.ts                      # CLI entry (single task + interactive REPL)
+├── index.ts                    # Public API exports
+├── config/                     # Zod config schema + env/file loader
+├── mcp/
+│   ├── server.ts               # MCP stdio transport
+│   ├── tools.ts                # 3 MCP tools (create/send/done)
+│   └── session-registry.ts     # Session lifecycle + timeout
+├── agent/
+│   ├── runner.ts               # Tool-calling loop + step events
+│   ├── llm-client.ts           # OpenAI SDK wrapper (any compatible endpoint)
+│   ├── context-manager.ts      # Full message history
+│   └── system-prompt.ts        # Small model system prompt
+└── tools/
+    ├── windows/                # screenshot (with grid), mouse, keyboard, command
+    ├── browser/                # navigate, click, type, screenshot, content, scroll
+    ├── file/                   # read, write, use_local_image
+    └── control/                # report
+```
 ## License