native-devtools-mcp 0.1.8 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/bin/cli.js +12 -5
  2. package/package.json +5 -3
  3. package/README.md +0 -425
package/bin/cli.js CHANGED
@@ -6,6 +6,7 @@ const fs = require("fs");
6
6
 
7
7
  const PLATFORMS = {
8
8
  "darwin-arm64": "@sh3ll3x3c/native-devtools-mcp-darwin-arm64",
9
+ "win32-x64": "@sh3ll3x3c/native-devtools-mcp-win32-x64",
9
10
  };
10
11
 
11
12
  function getPlatformPackage() {
@@ -16,7 +17,9 @@ function getPlatformPackage() {
16
17
  const pkg = PLATFORMS[key];
17
18
  if (!pkg) {
18
19
  console.error(`Unsupported platform: ${platform}-${arch}`);
19
- console.error("native-devtools-mcp supports: darwin-arm64 (Apple Silicon)");
20
+ console.error(
21
+ "native-devtools-mcp supports: darwin-arm64 (Apple Silicon), win32-x64 (Windows x64)"
22
+ );
20
23
  process.exit(1);
21
24
  }
22
25
 
@@ -29,16 +32,20 @@ function findBinary() {
29
32
  const platformDir = `${platform}-${arch}`;
30
33
  const pkg = getPlatformPackage();
31
34
 
35
+ // Binary name differs by platform
36
+ const binaryName =
37
+ platform === "win32" ? "native-devtools-mcp.exe" : "native-devtools-mcp";
38
+
32
39
  // Try to find the platform-specific package
33
40
  const possiblePaths = [
34
41
  // Local development (binary in sibling directory)
35
- path.join(__dirname, "..", platformDir, "bin", "native-devtools-mcp"),
42
+ path.join(__dirname, "..", platformDir, "bin", binaryName),
36
43
  // When installed as a dependency
37
- path.join(__dirname, "..", "node_modules", pkg, "bin", "native-devtools-mcp"),
44
+ path.join(__dirname, "..", "node_modules", pkg, "bin", binaryName),
38
45
  // When installed globally or via npx
39
- path.join(__dirname, "..", "..", pkg, "bin", "native-devtools-mcp"),
46
+ path.join(__dirname, "..", "..", pkg, "bin", binaryName),
40
47
  // Hoisted in node_modules
41
- path.join(__dirname, "..", "..", "..", pkg, "bin", "native-devtools-mcp"),
48
+ path.join(__dirname, "..", "..", "..", pkg, "bin", binaryName),
42
49
  ];
43
50
 
44
51
  for (const binPath of possiblePaths) {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "native-devtools-mcp",
3
- "version": "0.1.8",
3
+ "version": "0.2.1",
4
4
  "description": "MCP server for testing native desktop applications",
5
5
  "license": "MIT",
6
6
  "repository": {
@@ -15,7 +15,8 @@
15
15
  "desktop",
16
16
  "testing",
17
17
  "automation",
18
- "macos"
18
+ "macos",
19
+ "windows"
19
20
  ],
20
21
  "bin": {
21
22
  "native-devtools-mcp": "bin/cli.js"
@@ -24,7 +25,8 @@
24
25
  "bin"
25
26
  ],
26
27
  "optionalDependencies": {
27
- "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.8"
28
+ "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.2.1",
29
+ "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.2.1"
28
30
  },
29
31
  "engines": {
30
32
  "node": ">=18"
package/README.md DELETED
@@ -1,425 +0,0 @@
1
- # native-devtools-mcp
2
-
3
- A Model Context Protocol (MCP) server for testing native desktop applications, similar to how Chrome DevTools enables web UI testing.
4
-
5
- > **100% Local & Private** - All processing happens on your machine. No data is sent to external servers. Screenshots, UI interactions, and app data never leave your device.
6
-
7
- ![Demo](demo.gif)
8
-
9
- ## Platform Support
10
-
11
- | Platform | Status |
12
- |----------|--------|
13
- | **macOS** | Supported |
14
- | **Windows** | Planned |
15
- | **Linux** | Planned |
16
-
17
- > Windows and Linux support will be added in future releases with platform-specific backends.
18
-
19
- ## Overview
20
-
21
- This MCP server enables LLM-driven testing of native desktop apps by providing:
22
- - **Screenshots** - Capture full screen, windows, or regions
23
- - **Input simulation** - Click, type, scroll, drag via platform-native events
24
- - **Window/app enumeration** - List and focus windows and applications
25
-
26
- ## Two Approaches to UI Interaction
27
-
28
- This server provides **two distinct approaches** for interacting with application UIs. They are kept separate intentionally to give the LLM full control over which approach to use based on context.
29
-
30
- ### 1. AppDebugKit (`app_*` tools) - Element-Level Precision
31
-
32
- For applications that embed [AppDebugKit](./AppDebugKit/), you get:
33
- - **Element targeting by ID** - Click buttons, fill text fields by element reference
34
- - **CSS-like selectors** - Query elements with `#id`, `.ClassName`, `[title=Save]`
35
- - **View hierarchy inspection** - Traverse the UI tree programmatically
36
- - **Framework-aware** - Works with AppKit and SwiftUI controls
37
-
38
- **Best for:** Apps you control, AppKit/SwiftUI apps, when you need reliable element targeting.
39
-
40
- ```
41
- app_connect → app_get_tree → app_query(".NSButton") → app_click(element_id)
42
- ```
43
-
44
- ### 2. CGEvent (`click`, `type_text`, etc.) - Universal Compatibility
45
-
46
- For **any application** regardless of framework:
47
- - **Screen coordinate targeting** - Click at (x, y) positions
48
- - **Works with any UI framework** - egui, Electron, Qt, games, anything
49
- - **No app modification required** - Just needs Accessibility permission
50
-
51
- **Best for:** Third-party apps, egui/Electron/Qt apps, when AppDebugKit isn't available.
52
-
53
- ```
54
- take_screenshot → (analyze visually) → click(x=500, y=300)
55
- ```
56
-
57
- ### Why Two Approaches?
58
-
59
- The LLM needs to choose the right approach based on what it observes:
60
-
61
- | Scenario | Recommended Approach |
62
- |----------|---------------------|
63
- | App with AppDebugKit embedded | `app_*` tools - reliable element IDs |
64
- | egui/Electron/Qt app | CGEvent tools - coordinate-based |
65
- | Unknown app | Try `app_connect`, fall back to CGEvent |
66
- | App with poor view hierarchy | CGEvent even if AppDebugKit connected |
67
-
68
- Merging them into auto-fallback would hide important context from the LLM and reduce its ability to make informed decisions.
69
-
70
- ## Installation
71
-
72
- ### Option 1: npm (Recommended)
73
-
74
- ```bash
75
- # Install globally
76
- npm install -g native-devtools-mcp
77
-
78
- # Or run directly with npx
79
- npx native-devtools-mcp
80
- ```
81
-
82
- ### Option 2: Build from source
83
-
84
- ```bash
85
- # Clone the repository
86
- git clone https://github.com/sh3ll3x3c/native-devtools-mcp
87
- cd native-devtools-mcp
88
-
89
- # Build
90
- cargo build --release
91
-
92
- # Binary location
93
- ./target/release/native-devtools-mcp
94
- ```
95
-
96
- ## Required Permissions (macOS)
97
-
98
- This MCP server requires macOS privacy permissions to capture screenshots and simulate input. **These permissions are required for the tools to function.**
99
-
100
- ### Step-by-Step Setup
101
-
102
- #### 1. Screen Recording Permission (required for screenshots)
103
-
104
- 1. Open **System Settings** → **Privacy & Security** → **Screen Recording**
105
- 2. Click the **+** button (you may need to unlock with your password)
106
- 3. Add the app that runs Claude Code:
107
- - **VS Code**: `/Applications/Visual Studio Code.app`
108
- - **Terminal**: `/Applications/Utilities/Terminal.app`
109
- - **iTerm**: `/Applications/iTerm.app`
110
- 4. **Quit and restart the app completely** (not just reload)
111
-
112
- #### 2. Accessibility Permission (required for click, type, scroll)
113
-
114
- 1. Open **System Settings** → **Privacy & Security** → **Accessibility**
115
- 2. Click the **+** button
116
- 3. Add the same app as above (VS Code, Terminal, etc.)
117
- 4. **Quit and restart the app completely**
118
-
119
- #### 3. macOS Version (for OCR features)
120
-
121
- The `find_text` tool and `take_screenshot` OCR feature use Apple's Vision framework for text recognition. This is the **recommended way** to interact with apps when AppDebugKit is not available.
122
-
123
- - **macOS 10.15+ (Catalina)** required for OCR (`VNRecognizeTextRequest`)
124
- - No additional software installation needed - Vision is built into macOS
125
-
126
- ### Important Notes
127
-
128
- - **Grant permissions to the host app** (VS Code, Terminal), not to the MCP server binary itself
129
- - **Restart is required** - Permissions don't take effect until you fully quit and reopen the app
130
- - **No popup appears** - macOS won't prompt you; it silently fails if permissions are missing
131
- - If you see `could not create image from display`, you need Screen Recording permission
132
- - If clicks don't work, you need Accessibility permission
133
-
134
- ### During Automation (Important)
135
-
136
- These tools assume the target window stays focused. If you use the mouse/keyboard, a macOS permission prompt appears, or Claude Code asks to approve a tool call, focus can change and actions may be sent to the wrong app or field.
137
-
138
- - Pre-grant Screen Recording and Accessibility permissions before running.
139
- - Pre-approve Claude Code tool permissions for this MCP server so no prompts appear mid-run.
140
- - Avoid interacting with the computer while scenarios are executing.
141
-
142
- ### Privacy & Security
143
-
144
- All data stays on your machine:
145
- - Screenshots are captured locally and sent directly to Claude via the MCP protocol
146
- - No data is uploaded to external servers
147
- - The MCP server runs entirely offline
148
- - Source code is open for audit
149
-
150
- ## MCP Configuration
151
-
152
- ### Getting started
153
-
154
- Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
155
-
156
- ```json
157
- {
158
- "mcpServers": {
159
- "native-devtools": {
160
- "command": "npx",
161
- "args": ["-y", "native-devtools-mcp"]
162
- }
163
- }
164
- }
165
- ```
166
-
167
- Note: Use `native-devtools-mcp@latest` if you want to always run the newest version.
168
-
169
- ### Build from source
170
-
171
- ```json
172
- {
173
- "mcpServers": {
174
- "native-devtools": {
175
- "command": "/path/to/native-devtools-mcp"
176
- }
177
- }
178
- }
179
- ```
180
-
181
- ## Tools
182
-
183
- ### System Tools (work with any app)
184
-
185
- | Tool | Description |
186
- |------|-------------|
187
- | `take_screenshot` | Capture screen, window, or region (base64 PNG). Returns screenshot metadata for coordinate conversion and includes OCR text annotations by default (`include_ocr: true`). |
188
- | `list_windows` | List visible windows with IDs, titles, bounds |
189
- | `list_apps` | List running applications |
190
- | `focus_window` | Bring window/app to front |
191
- | `get_displays` | Get display info (bounds, scale factors) for coordinate conversion |
192
- | `find_text` | Find text on screen using OCR; returns screen coordinates for clicking |
193
-
194
- ### CGEvent Input Tools (work with any app, require Accessibility permission)
195
-
196
- | Tool | Description |
197
- |------|-------------|
198
- | `click` | Click at screen/window/screenshot coordinates (supports captured screenshot metadata) |
199
- | `type_text` | Type text at cursor position |
200
- | `press_key` | Press key combo (e.g., "return", modifiers: ["command"]) |
201
- | `scroll` | Scroll at position |
202
- | `drag` | Drag from point to point |
203
- | `move_mouse` | Move cursor to position |
204
-
205
- ### AppDebugKit Tools (require app to embed AppDebugKit)
206
-
207
- | Tool | Description |
208
- |------|-------------|
209
- | `app_connect` | Connect to app's debug server (ws://127.0.0.1:9222). Supports `expected_bundle_id` and `expected_app_name` validation. |
210
- | `app_disconnect` | Disconnect from app |
211
- | `app_get_info` | Get app metadata (name, bundle ID, version) |
212
- | `app_get_tree` | Get view hierarchy |
213
- | `app_query` | Find elements by CSS-like selector |
214
- | `app_get_element` | Get element details by ID |
215
- | `app_click` | Click element by ID |
216
- | `app_type` | Type text into element |
217
- | `app_press_key` | Press key in app context |
218
- | `app_focus` | Focus element (make first responder) |
219
- | `app_screenshot` | Screenshot element or window |
220
- | `app_list_windows` | List app's windows |
221
- | `app_focus_window` | Focus specific window |
222
-
223
- Note: app_* tools (except `app_connect`) are only listed after a successful connection. The server emits a tools list change on connect/disconnect, so some clients may need to refresh/re-list tools to see the app_* set.
224
-
225
- <details>
226
- <summary><strong>Agent Context (for automated agents)</strong></summary>
227
-
228
- This section provides a compact, machine-readable summary for LLM agents. For a dedicated agent-first index, see `agents.md`.
229
-
230
- ### Capabilities Matrix
231
-
232
- | Intent | Tools | Outputs |
233
- |--------|-------|---------|
234
- | Capture screen or window | `take_screenshot` | base64 PNG, metadata (origin, scale), optional OCR text |
235
- | Find text and click it | `find_text` → `click` | coordinates, click action |
236
- | List and focus windows | `list_windows` → `focus_window` | window list, focus action |
237
- | Element-level UI control | `app_connect` → `app_query` → `app_click` | element IDs, click action |
238
-
239
- ### Structured Intent (YAML)
240
-
241
- ```yaml
242
- intents:
243
- - name: capture_screenshot
244
- tools: [take_screenshot]
245
- inputs:
246
- scope: { type: string, enum: [screen, window, region] }
247
- window_id: { type: number, optional: true }
248
- region: { type: object, optional: true }
249
- include_ocr: { type: boolean, default: true }
250
- outputs:
251
- image_base64: { type: string }
252
- metadata: { type: object, optional: true }
253
- ocr: { type: array, optional: true }
254
- - name: find_text_and_click
255
- tools: [find_text, click]
256
- inputs:
257
- query: { type: string }
258
- window_id: { type: number, optional: true }
259
- outputs:
260
- matches: { type: array }
261
- clicked: { type: boolean }
262
- - name: list_and_focus_window
263
- tools: [list_windows, focus_window]
264
- inputs:
265
- app_name: { type: string, optional: true }
266
- outputs:
267
- windows: { type: array }
268
- focused: { type: boolean }
269
- - name: element_level_interaction
270
- tools: [app_connect, app_query, app_click, app_type]
271
- inputs:
272
- selector: { type: string }
273
- element_id: { type: string, optional: true }
274
- text: { type: string, optional: true }
275
- outputs:
276
- element: { type: object }
277
- ok: { type: boolean }
278
- ```
279
-
280
- ### Prompt -> Tool -> Output Examples
281
-
282
- | User prompt | Tool sequence | Expected output |
283
- |-------------|---------------|-----------------|
284
- | "Take a screenshot of the Settings window" | `list_windows` → `take_screenshot(window_id)` | base64 PNG, metadata, OCR text |
285
- | "Click the OK button" | `take_screenshot` → (vision) → `click(screenshot_x/y + metadata)` | click action |
286
- | "Find text 'Submit' and click it" | `find_text(query)` → `click(x,y)` | coordinates, click action |
287
- | "Click the Save button in the AppDebugKit app" | `app_connect` → `app_query("[title=Save]")` → `app_click(element_id)` | element ID, click action |
288
-
289
- ### Coordinate Usage
290
-
291
- | Coordinate source | Click parameters |
292
- |-------------------|------------------|
293
- | `find_text` or OCR annotation | `x`, `y` (direct screen coords) |
294
- | Visual inspection of screenshot | `screenshot_x/y` + metadata from `take_screenshot` |
295
-
296
- </details>
297
-
298
- ## How Screenshots and Clicking Work (macOS)
299
-
300
- - **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. Metadata for the screenshot origin and backing scale factor is included for deterministic coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
301
- - **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using captured metadata when available, otherwise window bounds and display scale are looked up at click time.
302
-
303
- ## Coordinate Systems and Display Scaling
304
-
305
- The `click` tool supports multiple coordinate input methods. Choose based on how you obtained the coordinates:
306
-
307
- ### OCR Coordinates (from `find_text` or `take_screenshot` OCR)
308
-
309
- OCR results return **screen-absolute coordinates** that are ready to use directly:
310
-
311
- ```json
312
- // OCR returns: "Submit" at (450, 320)
313
- // Use direct screen coordinates:
314
- { "x": 450, "y": 320 }
315
- ```
316
-
317
- ### Screenshot Pixel Coordinates (from visual inspection)
318
-
319
- When you visually identify a click target in a screenshot image, use the pixel coordinates with the screenshot metadata:
320
-
321
- ```json
322
- // take_screenshot returns metadata:
323
- // { "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
324
- //
325
- // You identify a button at pixel (200, 100) in the image.
326
- // Pass both the pixel coords and the metadata:
327
- { "screenshot_x": 200, "screenshot_y": 100, "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
328
- ```
329
-
330
- ### Other Coordinate Methods
331
-
332
- ```json
333
- // Window-relative (converted using window bounds)
334
- { "window_x": 100, "window_y": 50, "window_id": 1234 }
335
-
336
- // Screenshot pixels (legacy: window lookup at click time - less reliable)
337
- { "screenshot_x": 200, "screenshot_y": 100, "screenshot_window_id": 1234 }
338
- ```
339
-
340
- ### Quick Reference
341
-
342
- | Coordinate source | Click parameters |
343
- |-------------------|------------------|
344
- | `find_text` result | `x`, `y` (direct) |
345
- | `take_screenshot` OCR annotation | `x`, `y` (direct) |
346
- | Visual inspection of screenshot | `screenshot_x`, `screenshot_y` + metadata |
347
- | Known window-relative position | `window_x`, `window_y`, `window_id` |
348
-
349
- Use `get_displays` to understand the display configuration:
350
- ```json
351
- {
352
- "displays": [{
353
- "id": 1,
354
- "is_main": true,
355
- "bounds": { "x": 0, "y": 0, "width": 3008, "height": 1692 },
356
- "backing_scale_factor": 2.0,
357
- "pixel_width": 6016,
358
- "pixel_height": 3384
359
- }]
360
- }
361
- ```
362
-
363
- ## Example Usage
364
-
365
- ### With CGEvent (any app)
366
-
367
- **Option A: Using OCR (recommended for text elements)**
368
- ```
369
- User: Click the Submit button in the app
370
-
371
- Claude: [calls find_text with text="Submit"]
372
- [receives: {"text": "Submit", "x": 450, "y": 320}]
373
- [calls click with x=450, y=320]
374
- ```
375
-
376
- **Option B: Using screenshot metadata (for any visual element)**
377
- ```
378
- User: Click the icon next to Settings
379
-
380
- Claude: [calls take_screenshot]
381
- [receives image + metadata: {"screenshot_origin_x": 0, "screenshot_origin_y": 0, "screenshot_scale": 2.0}]
382
- [visually identifies icon at pixel (300, 150) in the image]
383
- [calls click with screenshot_x=300, screenshot_y=150, screenshot_origin_x=0, screenshot_origin_y=0, screenshot_scale=2.0]
384
- ```
385
-
386
- ### With AppDebugKit (embedded app)
387
-
388
- ```
389
- User: Click the Submit button in the app
390
-
391
- Claude: [calls app_connect with url="ws://127.0.0.1:9222"]
392
- [calls app_query with selector="[title=Submit]"]
393
- [receives element_id="view-42"]
394
- [calls app_click with element_id="view-42"]
395
- ```
396
-
397
- ## Architecture
398
-
399
- ```
400
- ┌─────────────────┐ JSON-RPC 2.0 ┌──────────────────┐
401
- │ Claude/Client │ ◄──────────────────► │ native-devtools │
402
- │ (with vision) │ stdio │ MCP Server │
403
- └─────────────────┘ └────────┬─────────┘
404
-
405
- ┌────────────────────────┼────────────────────────┐
406
- │ │ │
407
- ▼ ▼ ▼
408
- ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
409
- │ AppDebugKit │ │ CGEvent │ │ System │
410
- │ (WebSocket) │ │ Input │ │ APIs │
411
- │ │ │ │ │ │
412
- │ Element-level│ │ Coordinate │ │ Screenshots │
413
- │ interaction │ │ based input │ │ Window enum │
414
- └─────────────┘ └─────────────┘ └─────────────┘
415
- │ │
416
- ▼ ▼
417
- ┌─────────────┐ ┌─────────────┐
418
- │ Apps with │ │ Any app │
419
- │ AppDebugKit │ │ (egui, etc) │
420
- └─────────────┘ └─────────────┘
421
- ```
422
-
423
- ## License
424
-
425
- MIT