native-devtools-mcp 0.3.1 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +46 -3
  2. package/package.json +18 -4
package/README.md CHANGED
@@ -11,6 +11,8 @@
11
11
 
12
12
  A Model Context Protocol (MCP) server that provides **Computer Use** capabilities: screenshots, OCR, input simulation, and window management.
13
13
 
14
+ [//]: # "Search keywords: MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, mouse, keyboard, screen reading, macOS, Windows, native-devtools-mcp"
15
+
14
16
  [Features](#-features) • [Installation](#-installation) • [For AI Agents](#-for-ai-agents-llms) • [Permissions](#-required-permissions-macos)
15
17
 
16
18
  ![Demo](demo.gif)
@@ -19,11 +21,16 @@ A Model Context Protocol (MCP) server that provides **Computer Use** capabilitie
19
21
 
20
22
  ---
21
23
 
24
+ ## 🔍 Search Keywords
25
+
26
+ MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, screen reading, mouse, keyboard, macOS, Windows, native-devtools-mcp.
27
+
22
28
  ## 🚀 Features
23
29
 
24
30
  - **👀 Computer Vision:** Capture screenshots of screens, windows, or specific regions. Includes built-in OCR (text recognition) to "read" the screen.
25
31
  - **🖱️ Input Simulation:** Click, drag, scroll, and type text naturally. Supports global coordinates and window-relative actions.
26
32
  - **🪟 Window Management:** List open windows, find applications, and bring them to focus.
33
+ - **🧩 Template Matching:** Find non-text UI elements (icons, shapes) using `load_image` + `find_image`, returning precise click coordinates.
27
34
  - **🔒 Local & Private:** 100% local execution. No screenshots or data are ever sent to external servers.
28
35
  - **🔌 Dual-Mode Interaction:**
29
36
  1. **Visual/Native:** Works with *any* app via screenshots & coordinates (Universal).
@@ -33,12 +40,13 @@ A Model Context Protocol (MCP) server that provides **Computer Use** capabilitie
33
40
 
34
41
  This MCP server is designed to be **highly discoverable and usable** by AI models (Claude, Gemini, GPT).
35
42
 
36
- - **[📄 Read `agents.md`](./agents.md):** A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.
43
+ - **[📄 Read `AGENTS.md`](./AGENTS.md):** A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.
37
44
 
38
45
  **Core Capabilities for System Prompts:**
39
46
  1. `take_screenshot`: The "eyes". Returns images + layout metadata + text locations (OCR).
40
47
  2. `click` / `type_text`: The "hands". Interacts with the system based on visual feedback.
41
48
  3. `find_text`: A shortcut to find text on screen and get its coordinates immediately.
49
+ 4. `load_image` / `find_image`: Template matching for non-text UI elements (icons, shapes), returning screen coordinates for clicking.
42
50
 
43
51
  ## 📦 Installation (macOS + Windows)
44
52
 
@@ -137,8 +145,8 @@ We provide two ways for agents to interact, allowing them to choose the best too
137
145
  ### 1. The "Visual" Approach (Universal)
138
146
  **Best for:** 99% of apps (Electron, Qt, Games, Browsers).
139
147
  * **How it works:** The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
140
- * **Tools:** `take_screenshot`, `find_text`, `click`, `type_text`.
141
- * **Example:** "Click the button that looks like a gear icon."
148
+ * **Tools:** `take_screenshot`, `find_text`, `click`, `type_text` (plus `load_image` / `find_image` for icons and shapes).
149
+ * **Example:** "Click the button that looks like a gear icon." → use `find_image` with a gear template.
142
150
 
143
151
  ### 2. The "Structural" Approach (AppDebugKit)
144
152
  **Best for:** Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
@@ -146,6 +154,22 @@ We provide two ways for agents to interact, allowing them to choose the best too
146
154
  * **Tools:** `app_connect`, `app_query`, `app_click`.
147
155
  * **Example:** `app_click(element_id="submit-button")`.
148
156
 
157
+ ## 🧩 Template Matching (find_image)
158
+
159
+ Use `find_image` when the target is **not text** (icons, toggles, custom controls) and OCR or `find_text` cannot identify it.
160
+
161
+ **Typical flow:**
162
+ 1. `take_screenshot(app_name="MyApp")` → `screenshot_id`
163
+ 2. `load_image(path="/path/to/icon.png")` → `template_id`
164
+ 3. `find_image(screenshot_id="...", template_id="...")` → `matches` with `screen_x/screen_y`
165
+ 4. `click(x=..., y=...)`
166
+
167
+ **Fast vs Accurate:**
168
+ - **fast** (default): uses downscaling and early-exit for speed.
169
+ - **accurate**: uses full-resolution, wider scale search, and smaller stride for thorough matching.
170
+
171
+ Optional inputs like `mask_id`, `search_region`, `scales`, and `rotations` can improve precision and performance.
172
+
149
173
  ## 🏗️ Architecture
150
174
 
151
175
  ```mermaid
@@ -173,6 +197,25 @@ graph TD
173
197
  | | Input | `SendInput` (Win32) |
174
198
  | | OCR | `Windows.Media.Ocr` (WinRT) |
175
199
 
200
+ ### Screenshot Coordinate Precision
201
+
202
+ Screenshots include metadata for accurate coordinate conversion:
203
+
204
+ - `screenshot_origin_x/y`: Screen-space origin of the captured area (in points)
205
+ - `screenshot_scale`: Display scale factor (e.g., 2.0 for Retina displays)
206
+ - `screenshot_pixel_width/height`: Actual pixel dimensions of the image
207
+ - `screenshot_window_id`: Window ID (for window captures)
208
+
209
+ **Coordinate conversion:**
210
+ ```
211
+ screen_x = screenshot_origin_x + (pixel_x / screenshot_scale)
212
+ screen_y = screenshot_origin_y + (pixel_y / screenshot_scale)
213
+ ```
214
+
215
+ **Implementation notes:**
216
+ - **Window captures** (macOS): Uses `screencapture -o` which excludes window shadow. The captured image dimensions match `kCGWindowBounds × scale` exactly, ensuring click coordinates derived from screenshots land on intended UI elements.
217
+ - **Region captures**: Origin coordinates are aligned to integers to match the actual captured area.
218
+
176
219
  </details>
177
220
 
178
221
  ## 🛡️ Privacy, Safety & Best Practices
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "native-devtools-mcp",
3
- "version": "0.3.1",
4
- "description": "MCP server for testing native desktop applications",
3
+ "version": "0.3.3",
4
+ "description": "MCP server for computer-use / desktop automation of native apps (screenshots, OCR, input)",
5
5
  "license": "MIT",
6
6
  "repository": {
7
7
  "type": "git",
@@ -11,6 +11,20 @@
11
11
  "keywords": [
12
12
  "mcp",
13
13
  "model-context-protocol",
14
+ "computer-use",
15
+ "desktop-automation",
16
+ "ui-automation",
17
+ "rpa",
18
+ "ocr",
19
+ "screenshot",
20
+ "screen-reading",
21
+ "mouse",
22
+ "keyboard",
23
+ "ai-agent",
24
+ "llm",
25
+ "claude",
26
+ "gemini",
27
+ "gpt",
14
28
  "devtools",
15
29
  "desktop",
16
30
  "testing",
@@ -25,8 +39,8 @@
25
39
  "bin"
26
40
  ],
27
41
  "optionalDependencies": {
28
- "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.3.1",
29
- "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.3.1"
42
+ "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.3.3",
43
+ "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.3.3"
30
44
  },
31
45
  "engines": {
32
46
  "node": ">=18"