native-devtools-mcp 0.1.7 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +144 -16
  2. package/package.json +2 -2
package/README.md CHANGED
@@ -4,6 +4,8 @@ A Model Context Protocol (MCP) server for testing native desktop applications, s
4
4
 
5
5
  > **100% Local & Private** - All processing happens on your machine. No data is sent to external servers. Screenshots, UI interactions, and app data never leave your device.
6
6
 
7
+ ![Demo](demo.gif)
8
+
7
9
  ## Platform Support
8
10
 
9
11
  | Platform | Status |
@@ -114,13 +116,12 @@ This MCP server requires macOS privacy permissions to capture screenshots and si
114
116
  3. Add the same app as above (VS Code, Terminal, etc.)
115
117
  4. **Quit and restart the app completely**
116
118
 
117
- #### 3. Tesseract OCR (required for `find_text`, and for `take_screenshot` OCR)
119
+ #### 3. macOS Version (for OCR features)
118
120
 
119
- The `find_text` tool uses OCR to locate text on screen and return clickable coordinates. The `take_screenshot` tool also runs OCR by default (`include_ocr: true`) to return text annotations. This is the **recommended way** to interact with apps when AppDebugKit is not available.
121
+ The `find_text` tool and `take_screenshot` OCR feature use Apple's Vision framework for text recognition. This is the **recommended way** to interact with apps when AppDebugKit is not available.
120
122
 
121
- ```bash
122
- brew install tesseract
123
- ```
123
+ - **macOS 10.15+ (Catalina)** required for OCR (`VNRecognizeTextRequest`)
124
+ - No additional software installation needed - Vision is built into macOS
124
125
 
125
126
  ### Important Notes
126
127
 
@@ -130,6 +131,14 @@ brew install tesseract
130
131
  - If you see `could not create image from display`, you need Screen Recording permission
131
132
  - If clicks don't work, you need Accessibility permission
132
133
 
134
+ ### During Automation (Important)
135
+
136
+ These tools assume the target window stays focused. If you use the mouse/keyboard, a macOS permission prompt appears, or Claude Code asks to approve a tool call, focus can change and actions may be sent to the wrong app or field.
137
+
138
+ - Pre-grant Screen Recording and Accessibility permissions before running.
139
+ - Pre-approve Claude Code tool permissions for this MCP server so no prompts appear mid-run.
140
+ - Avoid interacting with the computer while scenarios are executing.
141
+
133
142
  ### Privacy & Security
134
143
 
135
144
  All data stays on your machine:
@@ -140,6 +149,8 @@ All data stays on your machine:
140
149
 
141
150
  ## MCP Configuration
142
151
 
152
+ ### Getting started
153
+
143
154
  Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
144
155
 
145
156
  ```json
@@ -153,7 +164,9 @@ Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
153
164
  }
154
165
  ```
155
166
 
156
- Or if you built from source:
167
+ Note: Use `native-devtools-mcp@latest` if you want to always run the newest version.
168
+
169
+ ### Build from source
157
170
 
158
171
  ```json
159
172
  {
@@ -171,7 +184,7 @@ Or if you built from source:
171
184
 
172
185
  | Tool | Description |
173
186
  |------|-------------|
174
- | `take_screenshot` | Capture screen, window, or region (base64 PNG). Includes OCR text annotations by default (`include_ocr: true`). |
187
+ | `take_screenshot` | Capture screen, window, or region (base64 PNG). Returns screenshot metadata for coordinate conversion and includes OCR text annotations by default (`include_ocr: true`). |
175
188
  | `list_windows` | List visible windows with IDs, titles, bounds |
176
189
  | `list_apps` | List running applications |
177
190
  | `focus_window` | Bring window/app to front |
@@ -182,7 +195,7 @@ Or if you built from source:
182
195
 
183
196
  | Tool | Description |
184
197
  |------|-------------|
185
- | `click` | Click at screen/window/screenshot coordinates |
198
+ | `click` | Click at screen/window/screenshot coordinates (supports captured screenshot metadata) |
186
199
  | `type_text` | Type text at cursor position |
187
200
  | `press_key` | Press key combo (e.g., "return", modifiers: ["command"]) |
188
201
  | `scroll` | Scroll at position |
@@ -209,26 +222,130 @@ Or if you built from source:
209
222
 
210
223
  Note: app_* tools (except `app_connect`) are only listed after a successful connection. The server emits a tools list change on connect/disconnect, so some clients may need to refresh/re-list tools to see the app_* set.
211
224
 
225
+ <details>
226
+ <summary><strong>Agent Context (for automated agents)</strong></summary>
227
+
228
+ This section provides a compact, machine-readable summary for LLM agents. For a dedicated agent-first index, see `agents.md`.
229
+
230
+ ### Capabilities Matrix
231
+
232
+ | Intent | Tools | Outputs |
233
+ |--------|-------|---------|
234
+ | Capture screen or window | `take_screenshot` | base64 PNG, metadata (origin, scale), optional OCR text |
235
+ | Find text and click it | `find_text` → `click` | coordinates, click action |
236
+ | List and focus windows | `list_windows` → `focus_window` | window list, focus action |
237
+ | Element-level UI control | `app_connect` → `app_query` → `app_click` | element IDs, click action |
238
+
239
+ ### Structured Intent (YAML)
240
+
241
+ ```yaml
242
+ intents:
243
+ - name: capture_screenshot
244
+ tools: [take_screenshot]
245
+ inputs:
246
+ scope: { type: string, enum: [screen, window, region] }
247
+ window_id: { type: number, optional: true }
248
+ region: { type: object, optional: true }
249
+ include_ocr: { type: boolean, default: true }
250
+ outputs:
251
+ image_base64: { type: string }
252
+ metadata: { type: object, optional: true }
253
+ ocr: { type: array, optional: true }
254
+ - name: find_text_and_click
255
+ tools: [find_text, click]
256
+ inputs:
257
+ query: { type: string }
258
+ window_id: { type: number, optional: true }
259
+ outputs:
260
+ matches: { type: array }
261
+ clicked: { type: boolean }
262
+ - name: list_and_focus_window
263
+ tools: [list_windows, focus_window]
264
+ inputs:
265
+ app_name: { type: string, optional: true }
266
+ outputs:
267
+ windows: { type: array }
268
+ focused: { type: boolean }
269
+ - name: element_level_interaction
270
+ tools: [app_connect, app_query, app_click, app_type]
271
+ inputs:
272
+ selector: { type: string }
273
+ element_id: { type: string, optional: true }
274
+ text: { type: string, optional: true }
275
+ outputs:
276
+ element: { type: object }
277
+ ok: { type: boolean }
278
+ ```
279
+
280
+ ### Prompt -> Tool -> Output Examples
281
+
282
+ | User prompt | Tool sequence | Expected output |
283
+ |-------------|---------------|-----------------|
284
+ | "Take a screenshot of the Settings window" | `list_windows` → `take_screenshot(window_id)` | base64 PNG, metadata, OCR text |
285
+ | "Click the OK button" | `take_screenshot` → (vision) → `click(screenshot_x/y + metadata)` | click action |
286
+ | "Find text 'Submit' and click it" | `find_text(query)` → `click(x,y)` | coordinates, click action |
287
+ | "Click the Save button in the AppDebugKit app" | `app_connect` → `app_query("[title=Save]")` → `app_click(element_id)` | element ID, click action |
288
+
289
+ ### Coordinate Usage
290
+
291
+ | Coordinate source | Click parameters |
292
+ |-------------------|------------------|
293
+ | `find_text` or OCR annotation | `x`, `y` (direct screen coords) |
294
+ | Visual inspection of screenshot | `screenshot_x/y` + metadata from `take_screenshot` |
295
+
296
+ </details>
297
+
212
298
  ## How Screenshots and Clicking Work (macOS)
213
299
 
214
- - **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. The backing scale factor is tracked for coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
215
- - **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using window bounds and display scale.
300
+ - **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. Metadata for the screenshot origin and backing scale factor is included for deterministic coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
301
+ - **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using captured metadata when available, otherwise window bounds and display scale are looked up at click time.
216
302
 
217
303
  ## Coordinate Systems and Display Scaling
218
304
 
219
- The `click` tool supports three coordinate input methods to handle macOS display scaling:
305
+ The `click` tool supports multiple coordinate input methods. Choose based on how you obtained the coordinates:
306
+
307
+ ### OCR Coordinates (from `find_text` or `take_screenshot` OCR)
308
+
309
+ OCR results return **screen-absolute coordinates** that are ready to use directly:
220
310
 
221
311
  ```json
222
- // Direct screen coordinates
223
- { "x": 500, "y": 300 }
312
+ // OCR returns: "Submit" at (450, 320)
313
+ // Use direct screen coordinates:
314
+ { "x": 450, "y": 320 }
315
+ ```
316
+
317
+ ### Screenshot Pixel Coordinates (from visual inspection)
318
+
319
+ When you visually identify a click target in a screenshot image, use the pixel coordinates with the screenshot metadata:
320
+
321
+ ```json
322
+ // take_screenshot returns metadata:
323
+ // { "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
324
+ //
325
+ // You identify a button at pixel (200, 100) in the image.
326
+ // Pass both the pixel coords and the metadata:
327
+ { "screenshot_x": 200, "screenshot_y": 100, "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
328
+ ```
224
329
 
330
+ ### Other Coordinate Methods
331
+
332
+ ```json
225
333
  // Window-relative (converted using window bounds)
226
334
  { "window_x": 100, "window_y": 50, "window_id": 1234 }
227
335
 
228
- // Screenshot pixels (converted using backing scale factor)
336
+ // Screenshot pixels (legacy: window lookup at click time - less reliable)
229
337
  { "screenshot_x": 200, "screenshot_y": 100, "screenshot_window_id": 1234 }
230
338
  ```
231
339
 
340
+ ### Quick Reference
341
+
342
+ | Coordinate source | Click parameters |
343
+ |-------------------|------------------|
344
+ | `find_text` result | `x`, `y` (direct) |
345
+ | `take_screenshot` OCR annotation | `x`, `y` (direct) |
346
+ | Visual inspection of screenshot | `screenshot_x`, `screenshot_y` + metadata |
347
+ | Known window-relative position | `window_x`, `window_y`, `window_id` |
348
+
232
349
  Use `get_displays` to understand the display configuration:
233
350
  ```json
234
351
  {
@@ -247,14 +364,25 @@ Use `get_displays` to understand the display configuration:
247
364
 
248
365
  ### With CGEvent (any app)
249
366
 
367
+ **Option A: Using OCR (recommended for text elements)**
250
368
  ```
251
369
  User: Click the Submit button in the app
252
370
 
253
- Claude: [calls take_screenshot]
254
- [analyzes screenshot, finds Submit button at approximately x=450, y=320]
371
+ Claude: [calls find_text with text="Submit"]
372
+ [receives: {"text": "Submit", "x": 450, "y": 320}]
255
373
  [calls click with x=450, y=320]
256
374
  ```
257
375
 
376
+ **Option B: Using screenshot metadata (for any visual element)**
377
+ ```
378
+ User: Click the icon next to Settings
379
+
380
+ Claude: [calls take_screenshot]
381
+ [receives image + metadata: {"screenshot_origin_x": 0, "screenshot_origin_y": 0, "screenshot_scale": 2.0}]
382
+ [visually identifies icon at pixel (300, 150) in the image]
383
+ [calls click with screenshot_x=300, screenshot_y=150, screenshot_origin_x=0, screenshot_origin_y=0, screenshot_scale=2.0]
384
+ ```
385
+
258
386
  ### With AppDebugKit (embedded app)
259
387
 
260
388
  ```
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "native-devtools-mcp",
3
- "version": "0.1.7",
3
+ "version": "0.1.8",
4
4
  "description": "MCP server for testing native desktop applications",
5
5
  "license": "MIT",
6
6
  "repository": {
@@ -24,7 +24,7 @@
24
24
  "bin"
25
25
  ],
26
26
  "optionalDependencies": {
27
- "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.7"
27
+ "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.8"
28
28
  },
29
29
  "engines": {
30
30
  "node": ">=18"