native-devtools-mcp 0.1.8 → 0.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +107 -349
- package/bin/cli.js +12 -5
- package/package.json +5 -3
package/README.md
CHANGED
|
@@ -1,157 +1,80 @@
|
|
|
1
1
|
# native-devtools-mcp
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
<div align="center">
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+

|
|
6
|
+

|
|
7
|
+

|
|
8
|
+

|
|
6
9
|
|
|
7
|
-
|
|
10
|
+
**Give your AI agent "eyes" and "hands" for native desktop applications.**
|
|
8
11
|
|
|
9
|
-
|
|
12
|
+
A Model Context Protocol (MCP) server that provides **Computer Use** capabilities: screenshots, OCR, input simulation, and window management.
|
|
10
13
|
|
|
11
|
-
|
|
12
|
-
|----------|--------|
|
|
13
|
-
| **macOS** | Supported |
|
|
14
|
-
| **Windows** | Planned |
|
|
15
|
-
| **Linux** | Planned |
|
|
14
|
+
[Features](#features) • [Installation](#installation) • [For AI Agents](#for-ai-agents-llms) • [Permissions](#required-permissions-macos)
|
|
16
15
|
|
|
17
|
-
|
|
16
|
+

|
|
18
17
|
|
|
19
|
-
|
|
18
|
+
</div>
|
|
20
19
|
|
|
21
|
-
|
|
22
|
-
- **Screenshots** - Capture full screen, windows, or regions
|
|
23
|
-
- **Input simulation** - Click, type, scroll, drag via platform-native events
|
|
24
|
-
- **Window/app enumeration** - List and focus windows and applications
|
|
20
|
+
---
|
|
25
21
|
|
|
26
|
-
##
|
|
22
|
+
## 🚀 Features
|
|
27
23
|
|
|
28
|
-
|
|
24
|
+
- **👀 Computer Vision:** Capture screenshots of screens, windows, or specific regions. Includes built-in OCR (text recognition) to "read" the screen.
|
|
25
|
+
- **🖱️ Input Simulation:** Click, drag, scroll, and type text naturally. Supports global coordinates and window-relative actions.
|
|
26
|
+
- **🪟 Window Management:** List open windows, find applications, and bring them to focus.
|
|
27
|
+
- **🔒 Local & Private:** 100% local execution. No screenshots or data are ever sent to external servers.
|
|
28
|
+
- **🔌 Dual-Mode Interaction:**
|
|
29
|
+
1. **Visual/Native:** Works with *any* app via screenshots & coordinates (Universal).
|
|
30
|
+
2. **AppDebugKit:** Deep integration for supported apps to inspect the UI tree (DOM-like structure).
|
|
29
31
|
|
|
30
|
-
|
|
32
|
+
## 🤖 For AI Agents (LLMs)
|
|
31
33
|
|
|
32
|
-
|
|
33
|
-
- **Element targeting by ID** - Click buttons, fill text fields by element reference
|
|
34
|
-
- **CSS-like selectors** - Query elements with `#id`, `.ClassName`, `[title=Save]`
|
|
35
|
-
- **View hierarchy inspection** - Traverse the UI tree programmatically
|
|
36
|
-
- **Framework-aware** - Works with AppKit and SwiftUI controls
|
|
34
|
+
This MCP server is designed to be **highly discoverable and usable** by AI models (Claude, Gemini, GPT).
|
|
37
35
|
|
|
38
|
-
**
|
|
36
|
+
- **[📄 Read `agents.md`](./agents.md):** A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.
|
|
39
37
|
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
38
|
+
**Core Capabilities for System Prompts:**
|
|
39
|
+
1. `take_screenshot`: The "eyes". Returns images + layout metadata + text locations (OCR).
|
|
40
|
+
2. `click` / `type_text`: The "hands". Interacts with the system based on visual feedback.
|
|
41
|
+
3. `find_text`: A shortcut to find text on screen and get its coordinates immediately.
|
|
43
42
|
|
|
44
|
-
|
|
43
|
+
## 📦 Installation
|
|
45
44
|
|
|
46
|
-
|
|
47
|
-
- **Screen coordinate targeting** - Click at (x, y) positions
|
|
48
|
-
- **Works with any UI framework** - egui, Electron, Qt, games, anything
|
|
49
|
-
- **No app modification required** - Just needs Accessibility permission
|
|
45
|
+
### Option 1: Run with `npx` (No install needed)
|
|
50
46
|
|
|
51
|
-
|
|
47
|
+
The easiest way to use this with Claude Desktop or other MCP clients.
|
|
52
48
|
|
|
49
|
+
```bash
|
|
50
|
+
npx -y native-devtools-mcp
|
|
53
51
|
```
|
|
54
|
-
take_screenshot → (analyze visually) → click(x=500, y=300)
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
### Why Two Approaches?
|
|
58
|
-
|
|
59
|
-
The LLM needs to choose the right approach based on what it observes:
|
|
60
|
-
|
|
61
|
-
| Scenario | Recommended Approach |
|
|
62
|
-
|----------|---------------------|
|
|
63
|
-
| App with AppDebugKit embedded | `app_*` tools - reliable element IDs |
|
|
64
|
-
| egui/Electron/Qt app | CGEvent tools - coordinate-based |
|
|
65
|
-
| Unknown app | Try `app_connect`, fall back to CGEvent |
|
|
66
|
-
| App with poor view hierarchy | CGEvent even if AppDebugKit connected |
|
|
67
|
-
|
|
68
|
-
Merging them into auto-fallback would hide important context from the LLM and reduce its ability to make informed decisions.
|
|
69
52
|
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
### Option 1: npm (Recommended)
|
|
53
|
+
### Option 2: Global Install
|
|
73
54
|
|
|
74
55
|
```bash
|
|
75
|
-
# Install globally
|
|
76
56
|
npm install -g native-devtools-mcp
|
|
77
|
-
|
|
78
|
-
# Or run directly with npx
|
|
79
|
-
npx native-devtools-mcp
|
|
80
57
|
```
|
|
81
58
|
|
|
82
|
-
### Option
|
|
59
|
+
### Option 3: Build from Source (Rust)
|
|
60
|
+
|
|
61
|
+
<details>
|
|
62
|
+
<summary>Click to expand build instructions</summary>
|
|
83
63
|
|
|
84
64
|
```bash
|
|
85
|
-
# Clone the repository
|
|
86
65
|
git clone https://github.com/sh3ll3x3c/native-devtools-mcp
|
|
87
66
|
cd native-devtools-mcp
|
|
88
|
-
|
|
89
|
-
# Build
|
|
90
67
|
cargo build --release
|
|
91
|
-
|
|
92
|
-
# Binary location
|
|
93
|
-
./target/release/native-devtools-mcp
|
|
68
|
+
# Binary: ./target/release/native-devtools-mcp
|
|
94
69
|
```
|
|
70
|
+
</details>
|
|
95
71
|
|
|
96
|
-
##
|
|
97
|
-
|
|
98
|
-
This MCP server requires macOS privacy permissions to capture screenshots and simulate input. **These permissions are required for the tools to function.**
|
|
99
|
-
|
|
100
|
-
### Step-by-Step Setup
|
|
101
|
-
|
|
102
|
-
#### 1. Screen Recording Permission (required for screenshots)
|
|
103
|
-
|
|
104
|
-
1. Open **System Settings** → **Privacy & Security** → **Screen Recording**
|
|
105
|
-
2. Click the **+** button (you may need to unlock with your password)
|
|
106
|
-
3. Add the app that runs Claude Code:
|
|
107
|
-
- **VS Code**: `/Applications/Visual Studio Code.app`
|
|
108
|
-
- **Terminal**: `/Applications/Utilities/Terminal.app`
|
|
109
|
-
- **iTerm**: `/Applications/iTerm.app`
|
|
110
|
-
4. **Quit and restart the app completely** (not just reload)
|
|
111
|
-
|
|
112
|
-
#### 2. Accessibility Permission (required for click, type, scroll)
|
|
113
|
-
|
|
114
|
-
1. Open **System Settings** → **Privacy & Security** → **Accessibility**
|
|
115
|
-
2. Click the **+** button
|
|
116
|
-
3. Add the same app as above (VS Code, Terminal, etc.)
|
|
117
|
-
4. **Quit and restart the app completely**
|
|
118
|
-
|
|
119
|
-
#### 3. macOS Version (for OCR features)
|
|
120
|
-
|
|
121
|
-
The `find_text` tool and `take_screenshot` OCR feature use Apple's Vision framework for text recognition. This is the **recommended way** to interact with apps when AppDebugKit is not available.
|
|
122
|
-
|
|
123
|
-
- **macOS 10.15+ (Catalina)** required for OCR (`VNRecognizeTextRequest`)
|
|
124
|
-
- No additional software installation needed - Vision is built into macOS
|
|
125
|
-
|
|
126
|
-
### Important Notes
|
|
127
|
-
|
|
128
|
-
- **Grant permissions to the host app** (VS Code, Terminal), not to the MCP server binary itself
|
|
129
|
-
- **Restart is required** - Permissions don't take effect until you fully quit and reopen the app
|
|
130
|
-
- **No popup appears** - macOS won't prompt you; it silently fails if permissions are missing
|
|
131
|
-
- If you see `could not create image from display`, you need Screen Recording permission
|
|
132
|
-
- If clicks don't work, you need Accessibility permission
|
|
133
|
-
|
|
134
|
-
### During Automation (Important)
|
|
135
|
-
|
|
136
|
-
These tools assume the target window stays focused. If you use the mouse/keyboard, a macOS permission prompt appears, or Claude Code asks to approve a tool call, focus can change and actions may be sent to the wrong app or field.
|
|
137
|
-
|
|
138
|
-
- Pre-grant Screen Recording and Accessibility permissions before running.
|
|
139
|
-
- Pre-approve Claude Code tool permissions for this MCP server so no prompts appear mid-run.
|
|
140
|
-
- Avoid interacting with the computer while scenarios are executing.
|
|
141
|
-
|
|
142
|
-
### Privacy & Security
|
|
143
|
-
|
|
144
|
-
All data stays on your machine:
|
|
145
|
-
- Screenshots are captured locally and sent directly to Claude via the MCP protocol
|
|
146
|
-
- No data is uploaded to external servers
|
|
147
|
-
- The MCP server runs entirely offline
|
|
148
|
-
- Source code is open for audit
|
|
149
|
-
|
|
150
|
-
## MCP Configuration
|
|
72
|
+
## ⚙️ Configuration
|
|
151
73
|
|
|
152
|
-
|
|
74
|
+
Add this to your **Claude Desktop** configuration file:
|
|
153
75
|
|
|
154
|
-
|
|
76
|
+
**macOS:** `~/Library/Application Support/Claude/claude_desktop_config.json`
|
|
77
|
+
**Windows:** `%APPDATA%\Claude\claude_desktop_config.json`
|
|
155
78
|
|
|
156
79
|
```json
|
|
157
80
|
{
|
|
@@ -164,262 +87,97 @@ Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
|
|
|
164
87
|
}
|
|
165
88
|
```
|
|
166
89
|
|
|
167
|
-
Note
|
|
90
|
+
> **Note:** Requires Node.js 18+ installed.
|
|
168
91
|
|
|
169
|
-
###
|
|
92
|
+
### For Claude Code (CLI) Users
|
|
93
|
+
|
|
94
|
+
To avoid approving every single tool call (clicks, screenshots), you can add this wildcard permission to your project's settings or global config:
|
|
95
|
+
|
|
96
|
+
**File:** `.claude/settings.local.json` (or similar)
|
|
170
97
|
|
|
171
98
|
```json
|
|
172
99
|
{
|
|
173
|
-
"
|
|
174
|
-
"
|
|
175
|
-
"command": "/path/to/native-devtools-mcp"
|
|
176
|
-
}
|
|
100
|
+
"permissions": {
|
|
101
|
+
"allow": ["mcp__native-devtools__*"]
|
|
177
102
|
}
|
|
178
103
|
}
|
|
179
104
|
```
|
|
180
105
|
|
|
181
|
-
##
|
|
182
|
-
|
|
183
|
-
### System Tools (work with any app)
|
|
184
|
-
|
|
185
|
-
| Tool | Description |
|
|
186
|
-
|------|-------------|
|
|
187
|
-
| `take_screenshot` | Capture screen, window, or region (base64 PNG). Returns screenshot metadata for coordinate conversion and includes OCR text annotations by default (`include_ocr: true`). |
|
|
188
|
-
| `list_windows` | List visible windows with IDs, titles, bounds |
|
|
189
|
-
| `list_apps` | List running applications |
|
|
190
|
-
| `focus_window` | Bring window/app to front |
|
|
191
|
-
| `get_displays` | Get display info (bounds, scale factors) for coordinate conversion |
|
|
192
|
-
| `find_text` | Find text on screen using OCR; returns screen coordinates for clicking |
|
|
193
|
-
|
|
194
|
-
### CGEvent Input Tools (work with any app, require Accessibility permission)
|
|
195
|
-
|
|
196
|
-
| Tool | Description |
|
|
197
|
-
|------|-------------|
|
|
198
|
-
| `click` | Click at screen/window/screenshot coordinates (supports captured screenshot metadata) |
|
|
199
|
-
| `type_text` | Type text at cursor position |
|
|
200
|
-
| `press_key` | Press key combo (e.g., "return", modifiers: ["command"]) |
|
|
201
|
-
| `scroll` | Scroll at position |
|
|
202
|
-
| `drag` | Drag from point to point |
|
|
203
|
-
| `move_mouse` | Move cursor to position |
|
|
204
|
-
|
|
205
|
-
### AppDebugKit Tools (require app to embed AppDebugKit)
|
|
206
|
-
|
|
207
|
-
| Tool | Description |
|
|
208
|
-
|------|-------------|
|
|
209
|
-
| `app_connect` | Connect to app's debug server (ws://127.0.0.1:9222). Supports `expected_bundle_id` and `expected_app_name` validation. |
|
|
210
|
-
| `app_disconnect` | Disconnect from app |
|
|
211
|
-
| `app_get_info` | Get app metadata (name, bundle ID, version) |
|
|
212
|
-
| `app_get_tree` | Get view hierarchy |
|
|
213
|
-
| `app_query` | Find elements by CSS-like selector |
|
|
214
|
-
| `app_get_element` | Get element details by ID |
|
|
215
|
-
| `app_click` | Click element by ID |
|
|
216
|
-
| `app_type` | Type text into element |
|
|
217
|
-
| `app_press_key` | Press key in app context |
|
|
218
|
-
| `app_focus` | Focus element (make first responder) |
|
|
219
|
-
| `app_screenshot` | Screenshot element or window |
|
|
220
|
-
| `app_list_windows` | List app's windows |
|
|
221
|
-
| `app_focus_window` | Focus specific window |
|
|
222
|
-
|
|
223
|
-
Note: app_* tools (except `app_connect`) are only listed after a successful connection. The server emits a tools list change on connect/disconnect, so some clients may need to refresh/re-list tools to see the app_* set.
|
|
224
|
-
|
|
225
|
-
<details>
|
|
226
|
-
<summary><strong>Agent Context (for automated agents)</strong></summary>
|
|
227
|
-
|
|
228
|
-
This section provides a compact, machine-readable summary for LLM agents. For a dedicated agent-first index, see `agents.md`.
|
|
229
|
-
|
|
230
|
-
### Capabilities Matrix
|
|
231
|
-
|
|
232
|
-
| Intent | Tools | Outputs |
|
|
233
|
-
|--------|-------|---------|
|
|
234
|
-
| Capture screen or window | `take_screenshot` | base64 PNG, metadata (origin, scale), optional OCR text |
|
|
235
|
-
| Find text and click it | `find_text` → `click` | coordinates, click action |
|
|
236
|
-
| List and focus windows | `list_windows` → `focus_window` | window list, focus action |
|
|
237
|
-
| Element-level UI control | `app_connect` → `app_query` → `app_click` | element IDs, click action |
|
|
238
|
-
|
|
239
|
-
### Structured Intent (YAML)
|
|
240
|
-
|
|
241
|
-
```yaml
|
|
242
|
-
intents:
|
|
243
|
-
- name: capture_screenshot
|
|
244
|
-
tools: [take_screenshot]
|
|
245
|
-
inputs:
|
|
246
|
-
scope: { type: string, enum: [screen, window, region] }
|
|
247
|
-
window_id: { type: number, optional: true }
|
|
248
|
-
region: { type: object, optional: true }
|
|
249
|
-
include_ocr: { type: boolean, default: true }
|
|
250
|
-
outputs:
|
|
251
|
-
image_base64: { type: string }
|
|
252
|
-
metadata: { type: object, optional: true }
|
|
253
|
-
ocr: { type: array, optional: true }
|
|
254
|
-
- name: find_text_and_click
|
|
255
|
-
tools: [find_text, click]
|
|
256
|
-
inputs:
|
|
257
|
-
query: { type: string }
|
|
258
|
-
window_id: { type: number, optional: true }
|
|
259
|
-
outputs:
|
|
260
|
-
matches: { type: array }
|
|
261
|
-
clicked: { type: boolean }
|
|
262
|
-
- name: list_and_focus_window
|
|
263
|
-
tools: [list_windows, focus_window]
|
|
264
|
-
inputs:
|
|
265
|
-
app_name: { type: string, optional: true }
|
|
266
|
-
outputs:
|
|
267
|
-
windows: { type: array }
|
|
268
|
-
focused: { type: boolean }
|
|
269
|
-
- name: element_level_interaction
|
|
270
|
-
tools: [app_connect, app_query, app_click, app_type]
|
|
271
|
-
inputs:
|
|
272
|
-
selector: { type: string }
|
|
273
|
-
element_id: { type: string, optional: true }
|
|
274
|
-
text: { type: string, optional: true }
|
|
275
|
-
outputs:
|
|
276
|
-
element: { type: object }
|
|
277
|
-
ok: { type: boolean }
|
|
278
|
-
```
|
|
279
|
-
|
|
280
|
-
### Prompt -> Tool -> Output Examples
|
|
281
|
-
|
|
282
|
-
| User prompt | Tool sequence | Expected output |
|
|
283
|
-
|-------------|---------------|-----------------|
|
|
284
|
-
| "Take a screenshot of the Settings window" | `list_windows` → `take_screenshot(window_id)` | base64 PNG, metadata, OCR text |
|
|
285
|
-
| "Click the OK button" | `take_screenshot` → (vision) → `click(screenshot_x/y + metadata)` | click action |
|
|
286
|
-
| "Find text 'Submit' and click it" | `find_text(query)` → `click(x,y)` | coordinates, click action |
|
|
287
|
-
| "Click the Save button in the AppDebugKit app" | `app_connect` → `app_query("[title=Save]")` → `app_click(element_id)` | element ID, click action |
|
|
288
|
-
|
|
289
|
-
### Coordinate Usage
|
|
290
|
-
|
|
291
|
-
| Coordinate source | Click parameters |
|
|
292
|
-
|-------------------|------------------|
|
|
293
|
-
| `find_text` or OCR annotation | `x`, `y` (direct screen coords) |
|
|
294
|
-
| Visual inspection of screenshot | `screenshot_x/y` + metadata from `take_screenshot` |
|
|
295
|
-
|
|
296
|
-
</details>
|
|
297
|
-
|
|
298
|
-
## How Screenshots and Clicking Work (macOS)
|
|
299
|
-
|
|
300
|
-
- **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. Metadata for the screenshot origin and backing scale factor is included for deterministic coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
|
|
301
|
-
- **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using captured metadata when available, otherwise window bounds and display scale are looked up at click time.
|
|
302
|
-
|
|
303
|
-
## Coordinate Systems and Display Scaling
|
|
304
|
-
|
|
305
|
-
The `click` tool supports multiple coordinate input methods. Choose based on how you obtained the coordinates:
|
|
306
|
-
|
|
307
|
-
### OCR Coordinates (from `find_text` or `take_screenshot` OCR)
|
|
308
|
-
|
|
309
|
-
OCR results return **screen-absolute coordinates** that are ready to use directly:
|
|
310
|
-
|
|
311
|
-
```json
|
|
312
|
-
// OCR returns: "Submit" at (450, 320)
|
|
313
|
-
// Use direct screen coordinates:
|
|
314
|
-
{ "x": 450, "y": 320 }
|
|
315
|
-
```
|
|
106
|
+
## 🔍 Two Approaches to Interaction
|
|
316
107
|
|
|
317
|
-
|
|
108
|
+
We provide two ways for agents to interact, allowing them to choose the best tool for the job.
|
|
318
109
|
|
|
319
|
-
|
|
110
|
+
### 1. The "Visual" Approach (Universal)
|
|
111
|
+
**Best for:** 99% of apps (Electron, Qt, Games, Browsers).
|
|
112
|
+
* **How it works:** The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
|
|
113
|
+
* **Tools:** `take_screenshot`, `find_text`, `click`, `type_text`.
|
|
114
|
+
* **Example:** "Click the button that looks like a gear icon."
|
|
320
115
|
|
|
321
|
-
|
|
322
|
-
|
|
323
|
-
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
// Pass both the pixel coords and the metadata:
|
|
327
|
-
{ "screenshot_x": 200, "screenshot_y": 100, "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
|
|
328
|
-
```
|
|
116
|
+
### 2. The "Structural" Approach (AppDebugKit)
|
|
117
|
+
**Best for:** Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
|
|
118
|
+
* **How it works:** The agent connects to a debug port and queries the UI tree (like HTML DOM).
|
|
119
|
+
* **Tools:** `app_connect`, `app_query`, `app_click`.
|
|
120
|
+
* **Example:** `app_click(element_id="submit-button")`.
|
|
329
121
|
|
|
330
|
-
|
|
122
|
+
## 🏗️ Architecture
|
|
331
123
|
|
|
332
|
-
```
|
|
333
|
-
|
|
334
|
-
|
|
124
|
+
```mermaid
|
|
125
|
+
graph TD
|
|
126
|
+
Client[Claude / LLM Client] <-->|JSON-RPC 2.0| Server[native-devtools-mcp]
|
|
127
|
+
Server -->|Direct API| Sys[System APIs]
|
|
128
|
+
Server -->|WebSocket| Debug[AppDebugKit]
|
|
335
129
|
|
|
336
|
-
|
|
337
|
-
|
|
130
|
+
subgraph "Your Machine"
|
|
131
|
+
Sys -->|Screen/OCR| macOS[CoreGraphics / Vision]
|
|
132
|
+
Sys -->|Input| Win[Win32 / SendInput]
|
|
133
|
+
Debug -.->|Inspect| App[Target App]
|
|
134
|
+
end
|
|
338
135
|
```
|
|
339
136
|
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
| Coordinate source | Click parameters |
|
|
343
|
-
|-------------------|------------------|
|
|
344
|
-
| `find_text` result | `x`, `y` (direct) |
|
|
345
|
-
| `take_screenshot` OCR annotation | `x`, `y` (direct) |
|
|
346
|
-
| Visual inspection of screenshot | `screenshot_x`, `screenshot_y` + metadata |
|
|
347
|
-
| Known window-relative position | `window_x`, `window_y`, `window_id` |
|
|
348
|
-
|
|
349
|
-
Use `get_displays` to understand the display configuration:
|
|
350
|
-
```json
|
|
351
|
-
{
|
|
352
|
-
"displays": [{
|
|
353
|
-
"id": 1,
|
|
354
|
-
"is_main": true,
|
|
355
|
-
"bounds": { "x": 0, "y": 0, "width": 3008, "height": 1692 },
|
|
356
|
-
"backing_scale_factor": 2.0,
|
|
357
|
-
"pixel_width": 6016,
|
|
358
|
-
"pixel_height": 3384
|
|
359
|
-
}]
|
|
360
|
-
}
|
|
361
|
-
```
|
|
137
|
+
<details>
|
|
138
|
+
<summary><strong>🔧 Technical Details (Under the Hood)</strong></summary>
|
|
362
139
|
|
|
363
|
-
|
|
140
|
+
| OS | Feature | API Used |
|
|
141
|
+
|----|---------|----------|
|
|
142
|
+
| **macOS** | Screenshots | `screencapture` (CLI) |
|
|
143
|
+
| | Input | `CGEvent` (CoreGraphics) |
|
|
144
|
+
| | OCR | `VNRecognizeTextRequest` (Vision Framework) |
|
|
145
|
+
| **Windows** | Screenshots | `BitBlt` (GDI) |
|
|
146
|
+
| | Input | `SendInput` (Win32) |
|
|
147
|
+
| | OCR | `Windows.Media.Ocr` (WinRT) |
|
|
364
148
|
|
|
365
|
-
|
|
149
|
+
</details>
|
|
366
150
|
|
|
367
|
-
|
|
368
|
-
```
|
|
369
|
-
User: Click the Submit button in the app
|
|
151
|
+
## 🛡️ Privacy, Safety & Best Practices
|
|
370
152
|
|
|
371
|
-
|
|
372
|
-
|
|
373
|
-
|
|
374
|
-
|
|
153
|
+
### 🔒 Privacy First
|
|
154
|
+
* **100% Local:** All processing (screenshots, OCR, logic) happens on your device.
|
|
155
|
+
* **No Cloud:** Images are never uploaded to any third-party server by this tool.
|
|
156
|
+
* **Open Source:** You can inspect the code to verify exactly what it does.
|
|
375
157
|
|
|
376
|
-
|
|
377
|
-
|
|
378
|
-
|
|
158
|
+
### ⚠️ Operational Safety
|
|
159
|
+
* **Hands Off:** When the agent is "driving" (clicking/typing), **do not move your mouse or type**.
|
|
160
|
+
* *Why?* Real hardware inputs can conflict with the simulated ones, causing clicks to land in the wrong place.
|
|
161
|
+
* **Focus Matters:** Ensure the window you want the agent to use is visible. If a popup steals focus, the agent might type into the wrong window unless it checks first.
|
|
379
162
|
|
|
380
|
-
|
|
381
|
-
[receives image + metadata: {"screenshot_origin_x": 0, "screenshot_origin_y": 0, "screenshot_scale": 2.0}]
|
|
382
|
-
[visually identifies icon at pixel (300, 150) in the image]
|
|
383
|
-
[calls click with screenshot_x=300, screenshot_y=150, screenshot_origin_x=0, screenshot_origin_y=0, screenshot_scale=2.0]
|
|
384
|
-
```
|
|
163
|
+
## 🔐 Required Permissions (macOS)
|
|
385
164
|
|
|
386
|
-
|
|
165
|
+
On macOS, you must grant permissions to the **host application** (e.g., Terminal, VS Code, Claude Desktop) to allow screen recording and input control.
|
|
387
166
|
|
|
388
|
-
|
|
389
|
-
|
|
167
|
+
1. **Screen Recording:** Required for `take_screenshot`.
|
|
168
|
+
* *System Settings > Privacy & Security > Screen Recording*
|
|
169
|
+
2. **Accessibility:** Required for `click`, `type_text`, `scroll`.
|
|
170
|
+
* *System Settings > Privacy & Security > Accessibility*
|
|
390
171
|
|
|
391
|
-
|
|
392
|
-
[calls app_query with selector="[title=Submit]"]
|
|
393
|
-
[receives element_id="view-42"]
|
|
394
|
-
[calls app_click with element_id="view-42"]
|
|
395
|
-
```
|
|
172
|
+
> **Restart Required:** After granting permissions, you must fully quit and restart the host application.
|
|
396
173
|
|
|
397
|
-
##
|
|
174
|
+
## 🪟 Windows Support
|
|
398
175
|
|
|
399
|
-
|
|
400
|
-
|
|
401
|
-
|
|
402
|
-
|
|
403
|
-
└─────────────────┘ └────────┬─────────┘
|
|
404
|
-
│
|
|
405
|
-
┌────────────────────────┼────────────────────────┐
|
|
406
|
-
│ │ │
|
|
407
|
-
▼ ▼ ▼
|
|
408
|
-
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
409
|
-
│ AppDebugKit │ │ CGEvent │ │ System │
|
|
410
|
-
│ (WebSocket) │ │ Input │ │ APIs │
|
|
411
|
-
│ │ │ │ │ │
|
|
412
|
-
│ Element-level│ │ Coordinate │ │ Screenshots │
|
|
413
|
-
│ interaction │ │ based input │ │ Window enum │
|
|
414
|
-
└─────────────┘ └─────────────┘ └─────────────┘
|
|
415
|
-
│ │
|
|
416
|
-
▼ ▼
|
|
417
|
-
┌─────────────┐ ┌─────────────┐
|
|
418
|
-
│ Apps with │ │ Any app │
|
|
419
|
-
│ AppDebugKit │ │ (egui, etc) │
|
|
420
|
-
└─────────────┘ └─────────────┘
|
|
421
|
-
```
|
|
176
|
+
Works out of the box on **Windows 10/11**.
|
|
177
|
+
* Uses standard Win32 APIs (GDI, SendInput).
|
|
178
|
+
* OCR uses the built-in Windows Media OCR engine (offline).
|
|
179
|
+
* **Note:** Cannot interact with "Run as Administrator" windows unless the MCP server itself is also running as Administrator.
|
|
422
180
|
|
|
423
|
-
## License
|
|
181
|
+
## 📜 License
|
|
424
182
|
|
|
425
|
-
MIT
|
|
183
|
+
MIT © [sh3ll3x3c](https://github.com/sh3ll3x3c)
|
package/bin/cli.js
CHANGED
|
@@ -6,6 +6,7 @@ const fs = require("fs");
|
|
|
6
6
|
|
|
7
7
|
const PLATFORMS = {
|
|
8
8
|
"darwin-arm64": "@sh3ll3x3c/native-devtools-mcp-darwin-arm64",
|
|
9
|
+
"win32-x64": "@sh3ll3x3c/native-devtools-mcp-win32-x64",
|
|
9
10
|
};
|
|
10
11
|
|
|
11
12
|
function getPlatformPackage() {
|
|
@@ -16,7 +17,9 @@ function getPlatformPackage() {
|
|
|
16
17
|
const pkg = PLATFORMS[key];
|
|
17
18
|
if (!pkg) {
|
|
18
19
|
console.error(`Unsupported platform: ${platform}-${arch}`);
|
|
19
|
-
console.error(
|
|
20
|
+
console.error(
|
|
21
|
+
"native-devtools-mcp supports: darwin-arm64 (Apple Silicon), win32-x64 (Windows x64)"
|
|
22
|
+
);
|
|
20
23
|
process.exit(1);
|
|
21
24
|
}
|
|
22
25
|
|
|
@@ -29,16 +32,20 @@ function findBinary() {
|
|
|
29
32
|
const platformDir = `${platform}-${arch}`;
|
|
30
33
|
const pkg = getPlatformPackage();
|
|
31
34
|
|
|
35
|
+
// Binary name differs by platform
|
|
36
|
+
const binaryName =
|
|
37
|
+
platform === "win32" ? "native-devtools-mcp.exe" : "native-devtools-mcp";
|
|
38
|
+
|
|
32
39
|
// Try to find the platform-specific package
|
|
33
40
|
const possiblePaths = [
|
|
34
41
|
// Local development (binary in sibling directory)
|
|
35
|
-
path.join(__dirname, "..", platformDir, "bin",
|
|
42
|
+
path.join(__dirname, "..", platformDir, "bin", binaryName),
|
|
36
43
|
// When installed as a dependency
|
|
37
|
-
path.join(__dirname, "..", "node_modules", pkg, "bin",
|
|
44
|
+
path.join(__dirname, "..", "node_modules", pkg, "bin", binaryName),
|
|
38
45
|
// When installed globally or via npx
|
|
39
|
-
path.join(__dirname, "..", "..", pkg, "bin",
|
|
46
|
+
path.join(__dirname, "..", "..", pkg, "bin", binaryName),
|
|
40
47
|
// Hoisted in node_modules
|
|
41
|
-
path.join(__dirname, "..", "..", "..", pkg, "bin",
|
|
48
|
+
path.join(__dirname, "..", "..", "..", pkg, "bin", binaryName),
|
|
42
49
|
];
|
|
43
50
|
|
|
44
51
|
for (const binPath of possiblePaths) {
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "native-devtools-mcp",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.2.2",
|
|
4
4
|
"description": "MCP server for testing native desktop applications",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"repository": {
|
|
@@ -15,7 +15,8 @@
|
|
|
15
15
|
"desktop",
|
|
16
16
|
"testing",
|
|
17
17
|
"automation",
|
|
18
|
-
"macos"
|
|
18
|
+
"macos",
|
|
19
|
+
"windows"
|
|
19
20
|
],
|
|
20
21
|
"bin": {
|
|
21
22
|
"native-devtools-mcp": "bin/cli.js"
|
|
@@ -24,7 +25,8 @@
|
|
|
24
25
|
"bin"
|
|
25
26
|
],
|
|
26
27
|
"optionalDependencies": {
|
|
27
|
-
"@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.
|
|
28
|
+
"@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.2.2",
|
|
29
|
+
"@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.2.2"
|
|
28
30
|
},
|
|
29
31
|
"engines": {
|
|
30
32
|
"node": ">=18"
|