native-devtools-mcp 0.3.2 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +20 -2
  2. package/package.json +3 -3
package/README.md CHANGED
@@ -30,6 +30,7 @@ MCP, Model Context Protocol, computer use, desktop automation, UI automation, RP
30
30
  - **👀 Computer Vision:** Capture screenshots of screens, windows, or specific regions. Includes built-in OCR (text recognition) to "read" the screen.
31
31
  - **🖱️ Input Simulation:** Click, drag, scroll, and type text naturally. Supports global coordinates and window-relative actions.
32
32
  - **🪟 Window Management:** List open windows, find applications, and bring them to focus.
33
+ - **🧩 Template Matching:** Find non-text UI elements (icons, shapes) using `load_image` + `find_image`, returning precise click coordinates.
33
34
  - **🔒 Local & Private:** 100% local execution. No screenshots or data are ever sent to external servers.
34
35
  - **🔌 Dual-Mode Interaction:**
35
36
  1. **Visual/Native:** Works with *any* app via screenshots & coordinates (Universal).
@@ -45,6 +46,7 @@ This MCP server is designed to be **highly discoverable and usable** by AI model
45
46
  1. `take_screenshot`: The "eyes". Returns images + layout metadata + text locations (OCR).
46
47
  2. `click` / `type_text`: The "hands". Interacts with the system based on visual feedback.
47
48
  3. `find_text`: A shortcut to find text on screen and get its coordinates immediately.
49
+ 4. `load_image` / `find_image`: Template matching for non-text UI elements (icons, shapes), returning screen coordinates for clicking.
48
50
 
49
51
  ## 📦 Installation (macOS + Windows)
50
52
 
@@ -143,8 +145,8 @@ We provide two ways for agents to interact, allowing them to choose the best too
143
145
  ### 1. The "Visual" Approach (Universal)
144
146
  **Best for:** 99% of apps (Electron, Qt, Games, Browsers).
145
147
  * **How it works:** The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
146
- * **Tools:** `take_screenshot`, `find_text`, `click`, `type_text`.
147
- * **Example:** "Click the button that looks like a gear icon."
148
+ * **Tools:** `take_screenshot`, `find_text`, `click`, `type_text` (plus `load_image` / `find_image` for icons and shapes).
149
+ * **Example:** "Click the button that looks like a gear icon." → use `find_image` with a gear template.
148
150
 
149
151
  ### 2. The "Structural" Approach (AppDebugKit)
150
152
  **Best for:** Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
@@ -152,6 +154,22 @@ We provide two ways for agents to interact, allowing them to choose the best too
152
154
  * **Tools:** `app_connect`, `app_query`, `app_click`.
153
155
  * **Example:** `app_click(element_id="submit-button")`.
154
156
 
157
+ ## 🧩 Template Matching (find_image)
158
+
159
+ Use `find_image` when the target is **not text** (icons, toggles, custom controls) and OCR or `find_text` cannot identify it.
160
+
161
+ **Typical flow:**
162
+ 1. `take_screenshot(app_name="MyApp")` → `screenshot_id`
163
+ 2. `load_image(path="/path/to/icon.png")` → `template_id`
164
+ 3. `find_image(screenshot_id="...", template_id="...")` → `matches` with `screen_x/screen_y`
165
+ 4. `click(x=..., y=...)`
166
+
167
+ **Fast vs Accurate:**
168
+ - **fast** (default): uses downscaling and early-exit for speed.
169
+ - **accurate**: uses full-resolution, wider scale search, and smaller stride for thorough matching.
170
+
171
+ Optional inputs like `mask_id`, `search_region`, `scales`, and `rotations` can improve precision and performance.
172
+
155
173
  ## 🏗️ Architecture
156
174
 
157
175
  ```mermaid
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "native-devtools-mcp",
3
- "version": "0.3.2",
3
+ "version": "0.3.3",
4
4
  "description": "MCP server for computer-use / desktop automation of native apps (screenshots, OCR, input)",
5
5
  "license": "MIT",
6
6
  "repository": {
@@ -39,8 +39,8 @@
39
39
  "bin"
40
40
  ],
41
41
  "optionalDependencies": {
42
- "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.3.2",
43
- "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.3.2"
42
+ "@sh3ll3x3c/native-devtools-mcp-darwin-arm64": "0.3.3",
43
+ "@sh3ll3x3c/native-devtools-mcp-win32-x64": "0.3.3"
44
44
  },
45
45
  "engines": {
46
46
  "node": ">=18"