native-devtools-mcp 0.1.7 → 0.1.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +144 -16
- package/package.json +2 -2
package/README.md
CHANGED
|
@@ -4,6 +4,8 @@ A Model Context Protocol (MCP) server for testing native desktop applications, s
|
|
|
4
4
|
|
|
5
5
|
> **100% Local & Private** - All processing happens on your machine. No data is sent to external servers. Screenshots, UI interactions, and app data never leave your device.
|
|
6
6
|
|
|
7
|
+

|
|
8
|
+
|
|
7
9
|
## Platform Support
|
|
8
10
|
|
|
9
11
|
| Platform | Status |
|
|
@@ -114,13 +116,12 @@ This MCP server requires macOS privacy permissions to capture screenshots and si
|
|
|
114
116
|
3. Add the same app as above (VS Code, Terminal, etc.)
|
|
115
117
|
4. **Quit and restart the app completely**
|
|
116
118
|
|
|
117
|
-
#### 3.
|
|
119
|
+
#### 3. macOS Version (for OCR features)
|
|
118
120
|
|
|
119
|
-
The `find_text` tool
|
|
121
|
+
The `find_text` tool and `take_screenshot` OCR feature use Apple's Vision framework for text recognition. This is the **recommended way** to interact with apps when AppDebugKit is not available.
|
|
120
122
|
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
```
|
|
123
|
+
- **macOS 10.15+ (Catalina)** required for OCR (`VNRecognizeTextRequest`)
|
|
124
|
+
- No additional software installation needed - Vision is built into macOS
|
|
124
125
|
|
|
125
126
|
### Important Notes
|
|
126
127
|
|
|
@@ -130,6 +131,14 @@ brew install tesseract
|
|
|
130
131
|
- If you see `could not create image from display`, you need Screen Recording permission
|
|
131
132
|
- If clicks don't work, you need Accessibility permission
|
|
132
133
|
|
|
134
|
+
### During Automation (Important)
|
|
135
|
+
|
|
136
|
+
These tools assume the target window stays focused. If you use the mouse/keyboard, a macOS permission prompt appears, or Claude Code asks to approve a tool call, focus can change and actions may be sent to the wrong app or field.
|
|
137
|
+
|
|
138
|
+
- Pre-grant Screen Recording and Accessibility permissions before running.
|
|
139
|
+
- Pre-approve Claude Code tool permissions for this MCP server so no prompts appear mid-run.
|
|
140
|
+
- Avoid interacting with the computer while scenarios are executing.
|
|
141
|
+
|
|
133
142
|
### Privacy & Security
|
|
134
143
|
|
|
135
144
|
All data stays on your machine:
|
|
@@ -140,6 +149,8 @@ All data stays on your machine:
|
|
|
140
149
|
|
|
141
150
|
## MCP Configuration
|
|
142
151
|
|
|
152
|
+
### Getting started
|
|
153
|
+
|
|
143
154
|
Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
|
|
144
155
|
|
|
145
156
|
```json
|
|
@@ -153,7 +164,9 @@ Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
|
|
|
153
164
|
}
|
|
154
165
|
```
|
|
155
166
|
|
|
156
|
-
|
|
167
|
+
Note: Use `native-devtools-mcp@latest` if you want to always run the newest version.
|
|
168
|
+
|
|
169
|
+
### Build from source
|
|
157
170
|
|
|
158
171
|
```json
|
|
159
172
|
{
|
|
@@ -171,7 +184,7 @@ Or if you built from source:
|
|
|
171
184
|
|
|
172
185
|
| Tool | Description |
|
|
173
186
|
|------|-------------|
|
|
174
|
-
| `take_screenshot` | Capture screen, window, or region (base64 PNG).
|
|
187
|
+
| `take_screenshot` | Capture screen, window, or region (base64 PNG). Returns screenshot metadata for coordinate conversion and includes OCR text annotations by default (`include_ocr: true`). |
|
|
175
188
|
| `list_windows` | List visible windows with IDs, titles, bounds |
|
|
176
189
|
| `list_apps` | List running applications |
|
|
177
190
|
| `focus_window` | Bring window/app to front |
|
|
@@ -182,7 +195,7 @@ Or if you built from source:
|
|
|
182
195
|
|
|
183
196
|
| Tool | Description |
|
|
184
197
|
|------|-------------|
|
|
185
|
-
| `click` | Click at screen/window/screenshot coordinates |
|
|
198
|
+
| `click` | Click at screen/window/screenshot coordinates (supports captured screenshot metadata) |
|
|
186
199
|
| `type_text` | Type text at cursor position |
|
|
187
200
|
| `press_key` | Press key combo (e.g., "return", modifiers: ["command"]) |
|
|
188
201
|
| `scroll` | Scroll at position |
|
|
@@ -209,26 +222,130 @@ Or if you built from source:
|
|
|
209
222
|
|
|
210
223
|
Note: app_* tools (except `app_connect`) are only listed after a successful connection. The server emits a tools list change on connect/disconnect, so some clients may need to refresh/re-list tools to see the app_* set.
|
|
211
224
|
|
|
225
|
+
<details>
|
|
226
|
+
<summary><strong>Agent Context (for automated agents)</strong></summary>
|
|
227
|
+
|
|
228
|
+
This section provides a compact, machine-readable summary for LLM agents. For a dedicated agent-first index, see `agents.md`.
|
|
229
|
+
|
|
230
|
+
### Capabilities Matrix
|
|
231
|
+
|
|
232
|
+
| Intent | Tools | Outputs |
|
|
233
|
+
|--------|-------|---------|
|
|
234
|
+
| Capture screen or window | `take_screenshot` | base64 PNG, metadata (origin, scale), optional OCR text |
|
|
235
|
+
| Find text and click it | `find_text` → `click` | coordinates, click action |
|
|
236
|
+
| List and focus windows | `list_windows` → `focus_window` | window list, focus action |
|
|
237
|
+
| Element-level UI control | `app_connect` → `app_query` → `app_click` | element IDs, click action |
|
|
238
|
+
|
|
239
|
+
### Structured Intent (YAML)
|
|
240
|
+
|
|
241
|
+
```yaml
|
|
242
|
+
intents:
|
|
243
|
+
- name: capture_screenshot
|
|
244
|
+
tools: [take_screenshot]
|
|
245
|
+
inputs:
|
|
246
|
+
scope: { type: string, enum: [screen, window, region] }
|
|
247
|
+
window_id: { type: number, optional: true }
|
|
248
|
+
region: { type: object, optional: true }
|
|
249
|
+
include_ocr: { type: boolean, default: true }
|
|
250
|
+
outputs:
|
|
251
|
+
image_base64: { type: string }
|
|
252
|
+
metadata: { type: object, optional: true }
|
|
253
|
+
ocr: { type: array, optional: true }
|
|
254
|
+
- name: find_text_and_click
|
|
255
|
+
tools: [find_text, click]
|
|
256
|
+
inputs:
|
|
257
|
+
query: { type: string }
|
|
258
|
+
window_id: { type: number, optional: true }
|
|
259
|
+
outputs:
|
|
260
|
+
matches: { type: array }
|
|
261
|
+
clicked: { type: boolean }
|
|
262
|
+
- name: list_and_focus_window
|
|
263
|
+
tools: [list_windows, focus_window]
|
|
264
|
+
inputs:
|
|
265
|
+
app_name: { type: string, optional: true }
|
|
266
|
+
outputs:
|
|
267
|
+
windows: { type: array }
|
|
268
|
+
focused: { type: boolean }
|
|
269
|
+
- name: element_level_interaction
|
|
270
|
+
tools: [app_connect, app_query, app_click, app_type]
|
|
271
|
+
inputs:
|
|
272
|
+
selector: { type: string }
|
|
273
|
+
element_id: { type: string, optional: true }
|
|
274
|
+
text: { type: string, optional: true }
|
|
275
|
+
outputs:
|
|
276
|
+
element: { type: object }
|
|
277
|
+
ok: { type: boolean }
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
### Prompt -> Tool -> Output Examples
|
|
281
|
+
|
|
282
|
+
| User prompt | Tool sequence | Expected output |
|
|
283
|
+
|-------------|---------------|-----------------|
|
|
284
|
+
| "Take a screenshot of the Settings window" | `list_windows` → `take_screenshot(window_id)` | base64 PNG, metadata, OCR text |
|
|
285
|
+
| "Click the OK button" | `take_screenshot` → (vision) → `click(screenshot_x/y + metadata)` | click action |
|
|
286
|
+
| "Find text 'Submit' and click it" | `find_text(query)` → `click(x,y)` | coordinates, click action |
|
|
287
|
+
| "Click the Save button in the AppDebugKit app" | `app_connect` → `app_query("[title=Save]")` → `app_click(element_id)` | element ID, click action |
|
|
288
|
+
|
|
289
|
+
### Coordinate Usage
|
|
290
|
+
|
|
291
|
+
| Coordinate source | Click parameters |
|
|
292
|
+
|-------------------|------------------|
|
|
293
|
+
| `find_text` or OCR annotation | `x`, `y` (direct screen coords) |
|
|
294
|
+
| Visual inspection of screenshot | `screenshot_x/y` + metadata from `take_screenshot` |
|
|
295
|
+
|
|
296
|
+
</details>
|
|
297
|
+
|
|
212
298
|
## How Screenshots and Clicking Work (macOS)
|
|
213
299
|
|
|
214
|
-
- **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64.
|
|
215
|
-
- **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using window bounds and display scale.
|
|
300
|
+
- **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. Metadata for the screenshot origin and backing scale factor is included for deterministic coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
|
|
301
|
+
- **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using captured metadata when available, otherwise window bounds and display scale are looked up at click time.
|
|
216
302
|
|
|
217
303
|
## Coordinate Systems and Display Scaling
|
|
218
304
|
|
|
219
|
-
The `click` tool supports
|
|
305
|
+
The `click` tool supports multiple coordinate input methods. Choose based on how you obtained the coordinates:
|
|
306
|
+
|
|
307
|
+
### OCR Coordinates (from `find_text` or `take_screenshot` OCR)
|
|
308
|
+
|
|
309
|
+
OCR results return **screen-absolute coordinates** that are ready to use directly:
|
|
220
310
|
|
|
221
311
|
```json
|
|
222
|
-
//
|
|
223
|
-
|
|
312
|
+
// OCR returns: "Submit" at (450, 320)
|
|
313
|
+
// Use direct screen coordinates:
|
|
314
|
+
{ "x": 450, "y": 320 }
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
### Screenshot Pixel Coordinates (from visual inspection)
|
|
318
|
+
|
|
319
|
+
When you visually identify a click target in a screenshot image, use the pixel coordinates with the screenshot metadata:
|
|
320
|
+
|
|
321
|
+
```json
|
|
322
|
+
// take_screenshot returns metadata:
|
|
323
|
+
// { "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
|
|
324
|
+
//
|
|
325
|
+
// You identify a button at pixel (200, 100) in the image.
|
|
326
|
+
// Pass both the pixel coords and the metadata:
|
|
327
|
+
{ "screenshot_x": 200, "screenshot_y": 100, "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
|
|
328
|
+
```
|
|
224
329
|
|
|
330
|
+
### Other Coordinate Methods
|
|
331
|
+
|
|
332
|
+
```json
|
|
225
333
|
// Window-relative (converted using window bounds)
|
|
226
334
|
{ "window_x": 100, "window_y": 50, "window_id": 1234 }
|
|
227
335
|
|
|
228
|
-
// Screenshot pixels (
|
|
336
|
+
// Screenshot pixels (legacy: window lookup at click time - less reliable)
|
|
229
337
|
{ "screenshot_x": 200, "screenshot_y": 100, "screenshot_window_id": 1234 }
|
|
230
338
|
```
|
|
231
339
|
|
|
340
|
+
### Quick Reference
|
|
341
|
+
|
|
342
|
+
| Coordinate source | Click parameters |
|
|
343
|
+
|-------------------|------------------|
|
|
344
|
+
| `find_text` result | `x`, `y` (direct) |
|
|
345
|
+
| `take_screenshot` OCR annotation | `x`, `y` (direct) |
|
|
346
|
+
| Visual inspection of screenshot | `screenshot_x`, `screenshot_y` + metadata |
|
|
347
|
+
| Known window-relative position | `window_x`, `window_y`, `window_id` |
|
|
348
|
+
|
|
232
349
|
Use `get_displays` to understand the display configuration:
|
|
233
350
|
```json
|
|
234
351
|
{
|
|
@@ -247,14 +364,25 @@ Use `get_displays` to understand the display configuration:
|
|
|
247
364
|
|
|
248
365
|
### With CGEvent (any app)
|
|
249
366
|
|
|
367
|
+
**Option A: Using OCR (recommended for text elements)**
|
|
250
368
|
```
|
|
251
369
|
User: Click the Submit button in the app
|
|
252
370
|
|
|
253
|
-
Claude: [calls
|
|
254
|
-
[
|
|
371
|
+
Claude: [calls find_text with text="Submit"]
|
|
372
|
+
[receives: {"text": "Submit", "x": 450, "y": 320}]
|
|
255
373
|
[calls click with x=450, y=320]
|
|
256
374
|
```
|
|
257
375
|
|
|
376
|
+
**Option B: Using screenshot metadata (for any visual element)**
|
|
377
|
+
```
|
|
378
|
+
User: Click the icon next to Settings
|
|
379
|
+
|
|
380
|
+
Claude: [calls take_screenshot]
|
|
381
|
+
[receives image + metadata: {"screenshot_origin_x": 0, "screenshot_origin_y": 0, "screenshot_scale": 2.0}]
|
|
382
|
+
[visually identifies icon at pixel (300, 150) in the image]
|
|
383
|
+
[calls click with screenshot_x=300, screenshot_y=150, screenshot_origin_x=0, screenshot_origin_y=0, screenshot_scale=2.0]
|
|
384
|
+
```
|
|
385
|
+
|
|
258
386
|
### With AppDebugKit (embedded app)
|
|
259
387
|
|
|
260
388
|
```
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "native-devtools-mcp",
|
|
3
|
-
"version": "0.1.
|
|
3
|
+
"version": "0.1.8",
|
|
4
4
|
"description": "MCP server for testing native desktop applications",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"repository": {
|
|
@@ -24,7 +24,7 @@
|
|
|
24
24
|
"bin"
|
|
25
25
|
],
|
|
26
26
|
"optionalDependencies": {
|
|
27
|
-
"@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.
|
|
27
|
+
"@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.8"
|
|
28
28
|
},
|
|
29
29
|
"engines": {
|
|
30
30
|
"node": ">=18"
|