npm - native-devtools-mcp - Versions diffs - 0.3.1 → 0.3.3 - Mend

native-devtools-mcp 0.3.1 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +46 -3
package/package.json +18 -4

package/README.md CHANGED Viewed

@@ -11,6 +11,8 @@
 A Model Context Protocol (MCP) server that provides **Computer Use** capabilities: screenshots, OCR, input simulation, and window management.
+[//]: # "Search keywords: MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, mouse, keyboard, screen reading, macOS, Windows, native-devtools-mcp"
 [Features](#-features) • [Installation](#-installation) • [For AI Agents](#-for-ai-agents-llms) • [Permissions](#-required-permissions-macos)
 ![Demo](demo.gif)
@@ -19,11 +21,16 @@ A Model Context Protocol (MCP) server that provides **Computer Use** capabilitie
 ---
+## 🔍 Search Keywords
+MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, screen reading, mouse, keyboard, macOS, Windows, native-devtools-mcp.
 ## 🚀 Features
 - **👀 Computer Vision:** Capture screenshots of screens, windows, or specific regions. Includes built-in OCR (text recognition) to "read" the screen.
 - **🖱️ Input Simulation:** Click, drag, scroll, and type text naturally. Supports global coordinates and window-relative actions.
 - **🪟 Window Management:** List open windows, find applications, and bring them to focus.
+- **🧩 Template Matching:** Find non-text UI elements (icons, shapes) using `load_image` + `find_image`, returning precise click coordinates.
 - **🔒 Local & Private:** 100% local execution. No screenshots or data are ever sent to external servers.
 - **🔌 Dual-Mode Interaction:**
     1.  **Visual/Native:** Works with *any* app via screenshots & coordinates (Universal).
@@ -33,12 +40,13 @@ A Model Context Protocol (MCP) server that provides **Computer Use** capabilitie
 This MCP server is designed to be **highly discoverable and usable** by AI models (Claude, Gemini, GPT).
-- **[📄 Read `agents.md`](./agents.md):** A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.
+- **[📄 Read `AGENTS.md`](./AGENTS.md):** A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.
 **Core Capabilities for System Prompts:**
 1.  `take_screenshot`: The "eyes". Returns images + layout metadata + text locations (OCR).
 2.  `click` / `type_text`: The "hands". Interacts with the system based on visual feedback.
 3.  `find_text`: A shortcut to find text on screen and get its coordinates immediately.
+4.  `load_image` / `find_image`: Template matching for non-text UI elements (icons, shapes), returning screen coordinates for clicking.
 ## 📦 Installation (macOS + Windows)
@@ -137,8 +145,8 @@ We provide two ways for agents to interact, allowing them to choose the best too
 ### 1. The "Visual" Approach (Universal)
 **Best for:** 99% of apps (Electron, Qt, Games, Browsers).
 *   **How it works:** The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
-*   **Tools:** `take_screenshot`, `find_text`, `click`, `type_text`.
-*   **Example:** "Click the button that looks like a gear icon."
+*   **Tools:** `take_screenshot`, `find_text`, `click`, `type_text` (plus `load_image` / `find_image` for icons and shapes).
+*   **Example:** "Click the button that looks like a gear icon." → use `find_image` with a gear template.
 ### 2. The "Structural" Approach (AppDebugKit)
 **Best for:** Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
@@ -146,6 +154,22 @@ We provide two ways for agents to interact, allowing them to choose the best too
 *   **Tools:** `app_connect`, `app_query`, `app_click`.
 *   **Example:** `app_click(element_id="submit-button")`.
+## 🧩 Template Matching (find_image)
+Use `find_image` when the target is **not text** (icons, toggles, custom controls) and OCR or `find_text` cannot identify it.
+**Typical flow:**
+1. `take_screenshot(app_name="MyApp")` → `screenshot_id`
+2. `load_image(path="/path/to/icon.png")` → `template_id`
+3. `find_image(screenshot_id="...", template_id="...")` → `matches` with `screen_x/screen_y`
+4. `click(x=..., y=...)`
+**Fast vs Accurate:**
+- **fast** (default): uses downscaling and early-exit for speed.
+- **accurate**: uses full-resolution, wider scale search, and smaller stride for thorough matching.
+Optional inputs like `mask_id`, `search_region`, `scales`, and `rotations` can improve precision and performance.
 ## 🏗️ Architecture
 ```mermaid
@@ -173,6 +197,25 @@ graph TD
 | | Input | `SendInput` (Win32) |
 | | OCR | `Windows.Media.Ocr` (WinRT) |
+### Screenshot Coordinate Precision
+Screenshots include metadata for accurate coordinate conversion:
+- `screenshot_origin_x/y`: Screen-space origin of the captured area (in points)
+- `screenshot_scale`: Display scale factor (e.g., 2.0 for Retina displays)
+- `screenshot_pixel_width/height`: Actual pixel dimensions of the image
+- `screenshot_window_id`: Window ID (for window captures)
+**Coordinate conversion:**
+```
+screen_x = screenshot_origin_x + (pixel_x / screenshot_scale)
+screen_y = screenshot_origin_y + (pixel_y / screenshot_scale)
+```
+**Implementation notes:**
+- **Window captures** (macOS): Uses `screencapture -o` which excludes window shadow. The captured image dimensions match `kCGWindowBounds × scale` exactly, ensuring click coordinates derived from screenshots land on intended UI elements.
+- **Region captures**: Origin coordinates are aligned to integers to match the actual captured area.
 </details>
 ## 🛡️ Privacy, Safety & Best Practices

package/package.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "native-devtools-mcp",
-  "version": "0.3.1",
-  "description": "MCP server for testing native desktop applications",
+  "version": "0.3.3",
+  "description": "MCP server for computer-use / desktop automation of native apps (screenshots, OCR, input)",
   "license": "MIT",
   "repository": {
     "type": "git",
@@ -11,6 +11,20 @@
   "keywords": [
     "mcp",
     "model-context-protocol",
+    "computer-use",
+    "desktop-automation",
+    "ui-automation",
+    "rpa",
+    "ocr",
+    "screenshot",
+    "screen-reading",
+    "mouse",
+    "keyboard",
+    "ai-agent",
+    "llm",
+    "claude",
+    "gemini",
+    "gpt",
     "devtools",
     "desktop",
     "testing",
@@ -25,8 +39,8 @@
     "bin"
   ],
   "optionalDependencies": {
-    "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.3.1",
-    "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.3.1"
+    "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.3.3",
+    "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.3.3"
   },
   "engines": {
     "node": ">=18"