npm - native-devtools-mcp - Versions diffs - 0.1.7 → 0.1.8 - Mend

native-devtools-mcp 0.1.7 → 0.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +144 -16
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -4,6 +4,8 @@ A Model Context Protocol (MCP) server for testing native desktop applications, s
 > **100% Local & Private** - All processing happens on your machine. No data is sent to external servers. Screenshots, UI interactions, and app data never leave your device.
+![Demo](demo.gif)
 ## Platform Support
 | Platform | Status |
@@ -114,13 +116,12 @@ This MCP server requires macOS privacy permissions to capture screenshots and si
 3. Add the same app as above (VS Code, Terminal, etc.)
 4. **Quit and restart the app completely**
-#### 3. Tesseract OCR (required for `find_text`, and for `take_screenshot` OCR)
+#### 3. macOS Version (for OCR features)
-The `find_text` tool uses OCR to locate text on screen and return clickable coordinates. The `take_screenshot` tool also runs OCR by default (`include_ocr: true`) to return text annotations. This is the **recommended way** to interact with apps when AppDebugKit is not available.
+The `find_text` tool and `take_screenshot` OCR feature use Apple's Vision framework for text recognition. This is the **recommended way** to interact with apps when AppDebugKit is not available.
-```bash
-brew install tesseract
-```
+- **macOS 10.15+ (Catalina)** required for OCR (`VNRecognizeTextRequest`)
+- No additional software installation needed - Vision is built into macOS
 ### Important Notes
@@ -130,6 +131,14 @@ brew install tesseract
 - If you see `could not create image from display`, you need Screen Recording permission
 - If clicks don't work, you need Accessibility permission
+### During Automation (Important)
+These tools assume the target window stays focused. If you use the mouse/keyboard, a macOS permission prompt appears, or Claude Code asks to approve a tool call, focus can change and actions may be sent to the wrong app or field.
+- Pre-grant Screen Recording and Accessibility permissions before running.
+- Pre-approve Claude Code tool permissions for this MCP server so no prompts appear mid-run.
+- Avoid interacting with the computer while scenarios are executing.
 ### Privacy & Security
 All data stays on your machine:
@@ -140,6 +149,8 @@ All data stays on your machine:
 ## MCP Configuration
+### Getting started
 Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
 ```json
@@ -153,7 +164,9 @@ Add to your Claude Code MCP config (`~/.claude/claude_desktop_config.json`):
 }
 ```
-Or if you built from source:
+Note: Use `native-devtools-mcp@latest` if you want to always run the newest version.
+### Build from source
 ```json
 {
@@ -171,7 +184,7 @@ Or if you built from source:
 | Tool | Description |
 |------|-------------|
-| `take_screenshot` | Capture screen, window, or region (base64 PNG). Includes OCR text annotations by default (`include_ocr: true`). |
+| `take_screenshot` | Capture screen, window, or region (base64 PNG). Returns screenshot metadata for coordinate conversion and includes OCR text annotations by default (`include_ocr: true`). |
 | `list_windows` | List visible windows with IDs, titles, bounds |
 | `list_apps` | List running applications |
 | `focus_window` | Bring window/app to front |
@@ -182,7 +195,7 @@ Or if you built from source:
 | Tool | Description |
 |------|-------------|
-| `click` | Click at screen/window/screenshot coordinates |
+| `click` | Click at screen/window/screenshot coordinates (supports captured screenshot metadata) |
 | `type_text` | Type text at cursor position |
 | `press_key` | Press key combo (e.g., "return", modifiers: ["command"]) |
 | `scroll` | Scroll at position |
@@ -209,26 +222,130 @@ Or if you built from source:
 Note: app_* tools (except `app_connect`) are only listed after a successful connection. The server emits a tools list change on connect/disconnect, so some clients may need to refresh/re-list tools to see the app_* set.
+<details>
+<summary><strong>Agent Context (for automated agents)</strong></summary>
+This section provides a compact, machine-readable summary for LLM agents. For a dedicated agent-first index, see `agents.md`.
+### Capabilities Matrix
+| Intent | Tools | Outputs |
+|--------|-------|---------|
+| Capture screen or window | `take_screenshot` | base64 PNG, metadata (origin, scale), optional OCR text |
+| Find text and click it | `find_text` → `click` | coordinates, click action |
+| List and focus windows | `list_windows` → `focus_window` | window list, focus action |
+| Element-level UI control | `app_connect` → `app_query` → `app_click` | element IDs, click action |
+### Structured Intent (YAML)
+```yaml
+intents:
+  - name: capture_screenshot
+    tools: [take_screenshot]
+    inputs:
+      scope: { type: string, enum: [screen, window, region] }
+      window_id: { type: number, optional: true }
+      region: { type: object, optional: true }
+      include_ocr: { type: boolean, default: true }
+    outputs:
+      image_base64: { type: string }
+      metadata: { type: object, optional: true }
+      ocr: { type: array, optional: true }
+  - name: find_text_and_click
+    tools: [find_text, click]
+    inputs:
+      query: { type: string }
+      window_id: { type: number, optional: true }
+    outputs:
+      matches: { type: array }
+      clicked: { type: boolean }
+  - name: list_and_focus_window
+    tools: [list_windows, focus_window]
+    inputs:
+      app_name: { type: string, optional: true }
+    outputs:
+      windows: { type: array }
+      focused: { type: boolean }
+  - name: element_level_interaction
+    tools: [app_connect, app_query, app_click, app_type]
+    inputs:
+      selector: { type: string }
+      element_id: { type: string, optional: true }
+      text: { type: string, optional: true }
+    outputs:
+      element: { type: object }
+      ok: { type: boolean }
+```
+### Prompt -> Tool -> Output Examples
+| User prompt | Tool sequence | Expected output |
+|-------------|---------------|-----------------|
+| "Take a screenshot of the Settings window" | `list_windows` → `take_screenshot(window_id)` | base64 PNG, metadata, OCR text |
+| "Click the OK button" | `take_screenshot` → (vision) → `click(screenshot_x/y + metadata)` | click action |
+| "Find text 'Submit' and click it" | `find_text(query)` → `click(x,y)` | coordinates, click action |
+| "Click the Save button in the AppDebugKit app" | `app_connect` → `app_query("[title=Save]")` → `app_click(element_id)` | element ID, click action |
+### Coordinate Usage
+| Coordinate source | Click parameters |
+|-------------------|------------------|
+| `find_text` or OCR annotation | `x`, `y` (direct screen coords) |
+| Visual inspection of screenshot | `screenshot_x/y` + metadata from `take_screenshot` |
+</details>
 ## How Screenshots and Clicking Work (macOS)
-- **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. The backing scale factor is tracked for coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
-- **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using window bounds and display scale.
+- **Screenshots** are captured via the system `screencapture` utility (`-x` silent, `-C` include cursor, `-R` region, `-l` window with `-o` to exclude shadow), written to a temp PNG, and returned as base64. Metadata for the screenshot origin and backing scale factor is included for deterministic coordinate conversion. Window screenshots exclude shadows so that pixel coordinates align exactly with `CGWindowBounds`, and OCR coordinates are automatically offset into screen space.
+- **Clicks/inputs** use CoreGraphics CGEvent injection (HID event tap). This requires Accessibility permission and works across AppKit, SwiftUI, Electron, egui, etc. Window-relative or screenshot-pixel coordinates are converted to screen coordinates using captured metadata when available, otherwise window bounds and display scale are looked up at click time.
 ## Coordinate Systems and Display Scaling
-The `click` tool supports three coordinate input methods to handle macOS display scaling:
+The `click` tool supports multiple coordinate input methods. Choose based on how you obtained the coordinates:
+### OCR Coordinates (from `find_text` or `take_screenshot` OCR)
+OCR results return **screen-absolute coordinates** that are ready to use directly:
 ```json
-// Direct screen coordinates
-{ "x": 500, "y": 300 }
+// OCR returns: "Submit" at (450, 320)
+// Use direct screen coordinates:
+{ "x": 450, "y": 320 }
+```
+### Screenshot Pixel Coordinates (from visual inspection)
+When you visually identify a click target in a screenshot image, use the pixel coordinates with the screenshot metadata:
+```json
+// take_screenshot returns metadata:
+// { "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
+//
+// You identify a button at pixel (200, 100) in the image.
+// Pass both the pixel coords and the metadata:
+{ "screenshot_x": 200, "screenshot_y": 100, "screenshot_origin_x": 50, "screenshot_origin_y": 80, "screenshot_scale": 2.0 }
+```
+### Other Coordinate Methods
+```json
 // Window-relative (converted using window bounds)
 { "window_x": 100, "window_y": 50, "window_id": 1234 }
-// Screenshot pixels (converted using backing scale factor)
+// Screenshot pixels (legacy: window lookup at click time - less reliable)
 { "screenshot_x": 200, "screenshot_y": 100, "screenshot_window_id": 1234 }
 ```
+### Quick Reference
+| Coordinate source | Click parameters |
+|-------------------|------------------|
+| `find_text` result | `x`, `y` (direct) |
+| `take_screenshot` OCR annotation | `x`, `y` (direct) |
+| Visual inspection of screenshot | `screenshot_x`, `screenshot_y` + metadata |
+| Known window-relative position | `window_x`, `window_y`, `window_id` |
 Use `get_displays` to understand the display configuration:
 ```json
 {
@@ -247,14 +364,25 @@ Use `get_displays` to understand the display configuration:
 ### With CGEvent (any app)
+**Option A: Using OCR (recommended for text elements)**
 ```
 User: Click the Submit button in the app
-Claude: [calls take_screenshot]
-        [analyzes screenshot, finds Submit button at approximately x=450, y=320]
+Claude: [calls find_text with text="Submit"]
+        [receives: {"text": "Submit", "x": 450, "y": 320}]
         [calls click with x=450, y=320]
 ```
+**Option B: Using screenshot metadata (for any visual element)**
+```
+User: Click the icon next to Settings
+Claude: [calls take_screenshot]
+        [receives image + metadata: {"screenshot_origin_x": 0, "screenshot_origin_y": 0, "screenshot_scale": 2.0}]
+        [visually identifies icon at pixel (300, 150) in the image]
+        [calls click with screenshot_x=300, screenshot_y=150, screenshot_origin_x=0, screenshot_origin_y=0, screenshot_scale=2.0]
+```
 ### With AppDebugKit (embedded app)
 ```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "native-devtools-mcp",
-  "version": "0.1.7",
+  "version": "0.1.8",
   "description": "MCP server for testing native desktop applications",
   "license": "MIT",
   "repository": {
@@ -24,7 +24,7 @@
     "bin"
   ],
   "optionalDependencies": {
-    "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.7"
+    "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.1.8"
   },
   "engines": {
     "node": ">=18"