@harusame64/desktop-touch-mcp 0.15.6 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (5) hide show
  1. package/LICENSE +21 -21
  2. package/README.ja.md +616 -516
  3. package/README.md +781 -685
  4. package/bin/launcher.js +83 -7
  5. package/package.json +79 -78
package/README.md CHANGED
@@ -1,694 +1,790 @@
1
- # desktop-touch-mcp
2
-
3
- [![desktop-touch-mcp MCP server](https://glama.ai/mcp/servers/Harusame64/desktop-touch-mcp/badges/card.svg)](https://glama.ai/mcp/servers/Harusame64/desktop-touch-mcp)
4
-
5
- [日本語](README.ja.md)
6
-
7
- > **Stop pasting screenshots. Let Claude see and control your desktop directly.**
8
-
9
- An MCP server that gives Claude eyes and hands on Windows — 57 tools covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
10
-
11
- > *v0.15: **82× average speedup** via Rust native engine — UIA focus queries in 2 ms, SSE2-accelerated image diffing at 13–15× native speed. Zero-config: the engine auto-loads when present, with transparent PowerShell fallback.*
1
+ # desktop-touch-mcp
2
+
3
+ [![desktop-touch-mcp MCP server](https://glama.ai/mcp/servers/Harusame64/desktop-touch-mcp/badges/card.svg)](https://glama.ai/mcp/servers/Harusame64/desktop-touch-mcp)
4
+
5
+ [日本語](README.ja.md)
6
+
7
+ > **Stop pasting screenshots. Let Claude see and control your desktop directly.**
8
+
9
+ > **Project site:** [harusame64.github.io/desktop-touch-mcp](https://harusame64.github.io/desktop-touch-mcp/)
10
+ > Start here for the public explainer, client setup guides, and the Reactive Perception Graph overview.
11
+
12
+ An MCP server that gives Claude eyes and hands on Windows — 28 public tools (26 stub catalog + 2 dynamic v2 World-Graph: `desktop_discover` / `desktop_act`) covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
13
+
14
+ > *v0.15: **82× average speedup** via Rust native engine — UIA focus queries in 2 ms, SSE2-accelerated image diffing at 13–15× native speed. Zero-config: the engine auto-loads when present, with transparent PowerShell fallback.*
12
15
  > *v0.15.5: **Pinned release verification** — the npm launcher now fetches only the matching GitHub Release tag and verifies the Windows runtime zip before extraction.*
13
-
14
- ---
15
-
16
- ## Features
17
-
18
- - **⚡ High-performance Rust Native Core** — The UIA bridge and image-diff engine are written in Rust (`napi-rs` + `windows-rs`) and loaded as a native `.node` addon. Direct COM calls from a dedicated MTA thread eliminate PowerShell process spawning — `getFocusedElement` completes in **2 ms** (160× faster), and `getUiElements` returns full trees in **~100 ms** with a batch BFS algorithm that minimizes cross-process RPC. Image-diff operations use **SSE2 SIMD** for 13–15× throughput. When the native engine is unavailable, every function transparently falls back to PowerShell — zero config required.
19
- - **🎯 Set-of-Marks (SoM) visual fallback** — Games, RDP sessions, and non-accessible Electron apps return clickable elements even when UIA is completely blind. `screenshot(detail="text")` automatically detects UIA sparsity and activates a Hybrid Non-CDP pipeline: Rust-powered grayscale + bilinear upscale → Windows OCR → clustering → red bounding-box annotation with numbered badges (`[1]`, `[2]`…). Two parallel representations returned: a visual PNG for spatial orientation and a semantic `elements[]` list with `clickAt` coords — no CDP required.
20
- - **LLM-native design** — Built around how LLMs think, not how humans click. `run_macro` batches multiple operations into a single API call; `diffMode` sends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.
21
- - **Reactive Perception Graph** — Register a `lensId` for a window or browser tab, pass it to action tools, and get guard-checked `post.perception` feedback after each action. It reduces repeated `screenshot` / `get_context` calls and prevents wrong-window typing or stale-coordinate clicks.
22
- - **Full CJK support** — Uses Win32 `GetWindowTextW` for window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.
23
- - **3-tier token reduction** — `detail="image"` (~443 tok) / `detail="text"` (~100–300 tok) / `diffMode=true` (~160 tok). Send pixels only when you actually need to see them.
24
- - **1:1 coordinate mode** — `dotByDot=true` captures at native resolution (WebP). Image pixel = screen coordinate — no scale math needed. With `origin`+`scale` passed to `mouse_click`, the server converts coords for you — eliminating off-by-one / scale bugs.
25
- - **Browser capture data reduction** — `grayscale=true` (~50% size), `dotByDotMaxDimension=1280` (auto-scaled with coord preservation), and `windowTitle + region` sub-crops help exclude browser chrome and other irrelevant pixels. Typical reduction for heavy captures: 50–70%.
26
- - **Chromium smart fallback** — `detail="text"` on Chrome/Edge/Brave auto-skips UIA (prohibitively slow there) and runs Windows OCR. `hints.chromiumGuard` + `hints.ocrFallbackFired` flag the path taken.
27
- - **UIA element extraction** — `detail="text"` returns button names and `clickAt` coords as JSON. Claude can click the right element without ever looking at a screenshot.
28
- - **Auto-dock CLI** — `dock_window` snaps any window to a screen corner with always-on-top. Set `DESKTOP_TOUCH_DOCK_TITLE='@parent'` to auto-dock the terminal hosting Claude on MCP startup — the process-tree walker finds the right window regardless of title.
29
- - **Emergency stop (Failsafe)** — Move the mouse to the **top-left corner (within 10px of 0,0)** to immediately terminate the MCP server.
30
-
31
- ---
32
-
33
- ## Requirements
34
-
35
- | | |
36
- |---|---|
37
- | OS | Windows 10 / 11 (64-bit) |
38
- | Node.js | v20+ recommended (tested on v22+) |
39
- | PowerShell | 5.1+ (bundled with Windows) — used only as fallback when the Rust native engine is unavailable |
40
- | Claude CLI | `claude` command must be available |
41
-
42
- > **Note:** nut-js native bindings require the Visual C++ Redistributable.
43
- > Download from [Microsoft](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist) if not already installed.
44
-
45
- ---
46
-
47
- ## Installation
48
-
49
- ```bash
50
- npx -y @harusame64/desktop-touch-mcp
51
- ```
52
-
16
+
17
+ ---
18
+
19
+ ## Features
20
+
21
+ - **⚡ High-performance Rust Native Core** — The UIA bridge and image-diff engine are written in Rust (`napi-rs` + `windows-rs`) and loaded as a native `.node` addon. Direct COM calls from a dedicated MTA thread eliminate PowerShell process spawning — `getFocusedElement` completes in **2 ms** (160× faster), and `getUiElements` returns full trees in **~100 ms** with a batch BFS algorithm that minimizes cross-process RPC. Image-diff operations use **SSE2 SIMD** for 13–15× throughput. When the native engine is unavailable, every function transparently falls back to PowerShell — zero config required.
22
+ - **🎯 Set-of-Marks (SoM) visual fallback** — Games, RDP sessions, and non-accessible Electron apps return clickable elements even when UIA is completely blind. `screenshot(detail="text")` automatically detects UIA sparsity and activates a Hybrid Non-CDP pipeline: Rust-powered grayscale + bilinear upscale → Windows OCR → clustering → red bounding-box annotation with numbered badges (`[1]`, `[2]`…). Two parallel representations returned: a visual PNG for spatial orientation and a semantic `elements[]` list with `clickAt` coords — no CDP required.
23
+ - **LLM-native design** — Built around how LLMs think, not how humans click. `run_macro` batches multiple operations into a single API call; `diffMode` sends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.
24
+ - **Reactive Perception Graph** — Register a `lensId` for a window or browser tab, pass it to action tools, and get guard-checked `post.perception` feedback after each action. It reduces repeated `screenshot` / `desktop_state` calls and prevents wrong-window typing or stale-coordinate clicks.
25
+ - **Full CJK support** — Uses Win32 `GetWindowTextW` for window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.
26
+ - **3-tier token reduction** — `detail="image"` (~443 tok) / `detail="text"` (~100–300 tok) / `diffMode=true` (~160 tok). Send pixels only when you actually need to see them.
27
+ - **1:1 coordinate mode** — `dotByDot=true` captures at native resolution (WebP). Image pixel = screen coordinate — no scale math needed. With `origin`+`scale` passed to `mouse_click`, the server converts coords for you — eliminating off-by-one / scale bugs.
28
+ - **Browser capture data reduction** — `grayscale=true` (~50% size), `dotByDotMaxDimension=1280` (auto-scaled with coord preservation), and `windowTitle + region` sub-crops help exclude browser chrome and other irrelevant pixels. Typical reduction for heavy captures: 50–70%.
29
+ - **Chromium smart fallback** — `detail="text"` on Chrome/Edge/Brave auto-skips UIA (prohibitively slow there) and runs Windows OCR. `hints.chromiumGuard` + `hints.ocrFallbackFired` flag the path taken.
30
+ - **UIA element extraction** — `detail="text"` returns button names and `clickAt` coords as JSON. Claude can click the right element without ever looking at a screenshot.
31
+ - **Auto-dock CLI** — `window_dock(action='dock')` snaps any window to a screen corner with always-on-top. Set `DESKTOP_TOUCH_DOCK_TITLE='@parent'` to auto-dock the terminal hosting Claude on MCP startup — the process-tree walker finds the right window regardless of title.
32
+ - **Emergency stop (Failsafe)** — Move the mouse to the **top-left corner (within 10px of 0,0)** to immediately terminate the MCP server.
33
+
34
+ ---
35
+
36
+ ## Requirements
37
+
38
+ | | |
39
+ |---|---|
40
+ | OS | Windows 10 / 11 (64-bit) |
41
+ | Node.js | v20+ recommended (tested on v22+) |
42
+ | PowerShell | 5.1+ (bundled with Windows) — used only as fallback when the Rust native engine is unavailable |
43
+ | Claude CLI | `claude` command must be available |
44
+
45
+ > **Note:** nut-js native bindings require the Visual C++ Redistributable.
46
+ > Download from [Microsoft](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist) if not already installed.
47
+
48
+ ---
49
+
50
+ ## Installation
51
+
52
+ ```bash
53
+ npx -y @harusame64/desktop-touch-mcp
54
+ ```
55
+
53
56
  The npm launcher resolves runtime strictly by npm package version. For package `X.Y.Z`, it fetches only GitHub Release tag `vX.Y.Z`, downloads `desktop-touch-mcp-windows.zip`, verifies its SHA256 digest, and only then expands it under `%USERPROFILE%\.desktop-touch-mcp`. Verified cached releases are reused on later runs.
54
57
 
55
58
  Set `DESKTOP_TOUCH_MCP_HOME` to override the cache root directory.
56
-
57
- ### Register with Claude CLI
58
-
59
- Add to `~/.claude.json` under `mcpServers`:
60
-
61
- ```json
62
- {
63
- "mcpServers": {
64
- "desktop-touch": {
65
- "type": "stdio",
66
- "command": "npx",
67
- "args": ["-y", "@harusame64/desktop-touch-mcp"]
68
- }
69
- }
70
- }
71
- ```
72
-
73
- **No system prompt needed.** The command reference is automatically injected into Claude via the MCP `initialize` response's `instructions` field.
74
-
75
- ### Register with other clients (HTTP mode)
76
-
77
- Clients that require an HTTP endpoint (GPT Desktop, VS Code Copilot, Cursor, etc.) can use the built-in Streamable HTTP transport:
78
-
79
- ```bash
80
- npx -y @harusame64/desktop-touch-mcp --http
81
- # or with a custom port:
82
- npx -y @harusame64/desktop-touch-mcp --http --port 8080
83
- ```
84
-
85
- The server starts at `http://127.0.0.1:23847/mcp` (localhost only). Register the URL in your MCP client settings. A health check is available at `http://127.0.0.1:<port>/health`.
86
-
87
- In HTTP mode the system tray icon shows the active URL and provides quick-copy and open-in-browser shortcuts.
88
-
89
- ### Development install
90
-
91
- ```bash
92
- git clone https://github.com/Harusame64/desktop-touch-mcp.git
93
- cd desktop-touch-mcp
94
- npm install
95
- ```
96
-
59
+
60
+ ### Register with Claude CLI
61
+
62
+ Add to `~/.claude.json` under `mcpServers`:
63
+
64
+ ```json
65
+ {
66
+ "mcpServers": {
67
+ "desktop-touch": {
68
+ "type": "stdio",
69
+ "command": "npx",
70
+ "args": ["-y", "@harusame64/desktop-touch-mcp"]
71
+ }
72
+ }
73
+ }
74
+ ```
75
+
76
+ **No system prompt needed.** The command reference is automatically injected into Claude via the MCP `initialize` response's `instructions` field.
77
+
78
+ ### Register with other clients (HTTP mode)
79
+
80
+ Clients that require an HTTP endpoint (GPT Desktop, VS Code Copilot, Cursor, etc.) can use the built-in Streamable HTTP transport:
81
+
82
+ ```bash
83
+ npx -y @harusame64/desktop-touch-mcp --http
84
+ # or with a custom port:
85
+ npx -y @harusame64/desktop-touch-mcp --http --port 8080
86
+ ```
87
+
88
+ The server starts at `http://127.0.0.1:23847/mcp` (localhost only). Register the URL in your MCP client settings. A health check is available at `http://127.0.0.1:<port>/health`.
89
+
90
+ In HTTP mode the system tray icon shows the active URL and provides quick-copy and open-in-browser shortcuts.
91
+
92
+ ### Development install
93
+
94
+ ```bash
95
+ git clone https://github.com/Harusame64/desktop-touch-mcp.git
96
+ cd desktop-touch-mcp
97
+ npm install
98
+ ```
99
+
97
100
  Build after install:
98
101
 
99
102
  ```bash
100
103
  npm run build
101
104
  ```
102
-
103
- For a local checkout, register the built server directly:
104
-
105
- ```json
106
- {
107
- "mcpServers": {
108
- "desktop-touch": {
109
- "type": "stdio",
110
- "command": "node",
111
- "args": ["D:/path/to/desktop-touch-mcp/dist/index.js"]
112
- }
113
- }
114
- }
115
- ```
116
-
117
- > **Note:** Replace `D:/path/to/desktop-touch-mcp` with the actual path where you cloned this repository.
118
-
119
- ---
120
-
121
- ## Tools (57 total)
122
-
123
- > 📖 **Full command reference**: [`docs/system-overview.md`](docs/system-overview.md) — every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
124
-
125
-
126
- ### Screenshot (5)
127
- | Tool | Description |
128
- |---|---|
129
- | `screenshot` | Main capture. Supports `detail`, `dotByDot`, `dotByDotMaxDimension`, `grayscale`, `region` sub-crop, `diffMode`. `detail="text"` auto-activates the SoM pipeline when UIA is blind (games, RDP, custom Electron) |
130
- | `screenshot_background` | Capture a background window without focusing it (PrintWindow API) |
131
- | `screenshot_ocr` | Windows.Media.Ocr on a window; returns word-level text + screen clickAt coords |
132
- | `get_screen_info` | Monitor layout, DPI, cursor position |
133
- | `scroll_capture` | Full-page stitch by scrolling (MAE overlap detection + 10% fallback) |
134
-
135
- ### Window management (4)
136
- | Tool | Description |
137
- |---|---|
138
- | `get_windows` | List all windows in Z-order |
139
- | `get_active_window` | Info about the focused window |
140
- | `focus_window` | Bring a window to foreground by partial title match |
141
- | `dock_window` | Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
142
-
143
- ### Mouse (5)
144
- | Tool | Description |
145
- |---|---|
146
- | `mouse_move` / `mouse_click` / `mouse_drag` | Move, click, drag. `doubleClick` / `tripleClick` (line-select). Accept `speed` and `homing` parameters |
147
- | `scroll` | Scroll in any direction. Accepts `speed` and `homing` parameters |
148
- | `get_cursor_position` | Current cursor coordinates |
149
-
150
- ### Keyboard (2)
151
- | Tool | Description |
152
- |---|---|
153
- | `keyboard_type` | Type text. `use_clipboard=true` bypasses IME (required for em-dash / smart quotes). `replaceAll=true` sends Ctrl+A before typing. Non-ASCII symbols trigger clipboard mode automatically (opt-out: `forceKeystrokes=true`) |
154
- | `keyboard_press` | Key combos (`ctrl+c`, `alt+f4`, etc.) |
155
-
156
- ### UI Automation (4)
157
- | Tool | Description |
158
- |---|---|
159
- | `get_ui_elements` | Full UIA element tree for a window |
160
- | `click_element` | Click a button by name or automationId — no coordinates needed |
161
- | `set_element_value` | Write directly to a text field |
162
- | `scope_element` | High-res zoom crop of an element + its child tree |
163
-
164
- ### Browser CDP (12)
165
- | Tool | Description |
166
- |---|---|
167
- | `browser_launch` | Launch Chrome/Edge/Brave with `--remote-debugging-port` and wait for the CDP endpoint (idempotent) |
168
- | `browser_connect` | Connect to Chrome/Edge via CDP; lists open tabs with `active:true/false` |
169
- | `browser_find_element` | CSS selector exact physical screen coords |
170
- | `browser_click_element` | Find DOM element + click in one step |
171
- | `browser_eval` | Evaluate JS expression in the browser tab |
172
- | `browser_fill_input` | Fill React/Vue/Svelte controlled inputs via CDP — works where `browser_eval` value assignment doesn't update framework state |
173
- | `browser_get_dom` | Get outerHTML of element or `document.body` |
174
- | `browser_get_interactive` | Enumerate links / buttons / inputs + **ARIA toggles** with `state.{checked,pressed,selected,expanded}`; each element includes `viewportPosition` |
175
- | `browser_get_app_state` | **SPA state extractor** — one CDP call that scans `__NEXT_DATA__`, `__NUXT_DATA__`, `__REMIX_CONTEXT__`, `__APOLLO_STATE__`, GitHub `react-app` embeddedData, JSON-LD, `window.__INITIAL_STATE__` |
176
- | `browser_search` | Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
177
- | `browser_navigate` | Navigate via CDP `Page.navigate`; `waitForLoad:true` (default) returns once `readyState==='complete'` |
178
- | `browser_disconnect` | Close cached CDP WebSocket sessions |
179
-
180
- All `browser_*` tools that touch the DOM accept `includeContext:false` to omit the trailing `activeTab:` / `readyState:` lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
181
-
182
- ### Workspace (2)
183
- | Tool | Description |
184
- |---|---|
185
- | `workspace_snapshot` | All windows: thumbnails + UI summaries in one call |
186
- | `workspace_launch` | Launch an app and auto-detect the new window |
187
-
188
- ### Context / Wait / History (8)
189
- | Tool | Description |
190
- |---|---|
191
- | `get_context` | Lightweight snapshot of focused window, element, cursor, and page state |
192
- | `get_history` | Retrieve recent tool invocation history |
193
- | `get_document_state` | Chrome page state (URL/title/readyState/scroll) via CDP |
194
- | `engine_status` | Returns which backend is active: `uia` (native Rust or powershell) and `imageDiff` (native Rust SSE2 or typescript). Diagnostic call once per session when troubleshooting performance |
195
- | `wait_until` | Server-side wait for window/focus/terminal/browser DOM state changes |
196
- | `events_subscribe` / `events_poll` / `events_unsubscribe` / `events_list` | Subscribe to and poll window appearance/disappearance/focus events |
197
-
198
- ### Terminal (2)
199
- | Tool | Description |
200
- |---|---|
201
- | `terminal_read` | Read text from Windows Terminal / PowerShell / cmd / WSL via UIA/OCR. Supports `sinceMarker` for diff reads |
202
- | `terminal_send` | Send commands to a terminal. Uses clipboard paste by default for IME safety |
203
-
204
- ### Pin / Macro (3)
205
- | Tool | Description |
206
- |---|---|
207
- | `pin_window` / `unpin_window` | Always-on-top toggle |
208
- | `run_macro` | Execute up to 50 steps sequentially in one MCP call |
209
-
210
- ### Clipboard (2)
211
- | Tool | Description |
212
- |---|---|
213
- | `clipboard_read` | Read the current Windows clipboard text (non-text payloads return empty string) |
214
- | `clipboard_write` | Write text to the Windows clipboard; full Unicode / emoji / CJK support |
215
-
216
- ### Notification (1)
217
- | Tool | Description |
218
- |---|---|
219
- | `notification_show` | Show a Windows system tray balloon notification — useful to alert the user when a long-running task finishes |
220
-
221
- ### Scroll (2)
222
- | Tool | Description |
223
- |---|---|
224
- | `scroll_to_element` | Scroll a named element into the viewport without computing scroll amounts. Chrome path: `selector` + `block` alignment. Native path: `name` + `windowTitle` via UIA ScrollItemPattern |
225
- | `smart_scroll` | **SmartScroll** — unified scroll dispatcher: CDP → UIA → image binary-search fallback. Handles nested containers, virtualised lists (TanStack/React Virtualized), sticky-header occlusion, and image-only environments. Returns `pageRatio`, `ancestors[]`, and hash-verified `scrolled` |
226
-
227
- ---
228
-
229
- ## Browser CDP automation
230
-
231
- For web automation, connect Chrome or Edge with the remote debugging port enabled — no Selenium or Playwright needed.
232
-
233
- ```bash
234
- # Launch Chrome in CDP mode
235
- chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp
236
- ```
237
-
238
- ```
239
- browser_launch() launch Chrome/Edge/Brave in debug mode (idempotent)
240
- browser_connect() → list open tabs + get tabIds
241
- browser_find_element("#submit") → CSS selector → physical screen coords
242
- browser_click_element("#submit") → find + click in one step (auto-focuses browser)
243
- browser_eval("document.title") → evaluate JS, returns result
244
- browser_fill_input("#email", "user@example.com") fill React/Vue/Svelte controlled input (state-safe)
245
- browser_get_dom("#main", maxLength=5000)outerHTML, truncated to maxLength chars
246
- browser_get_interactive() → links/buttons/inputs + ARIA toggles + viewportPosition per element
247
- browser_get_app_state() one-shot SPA state (Next/Nuxt/Remix/Apollo/GitHub react-app/Redux SSR)
248
- browser_search(by="text", pattern="...")→ grep DOM with confidence ranking
249
- browser_navigate("https://example.com") → navigate via CDP (no address bar interaction)
250
- browser_disconnect() → clean up WebSocket sessions
251
- ```
252
-
253
- For chained calls in the same tab, pass `includeContext:false` to omit the activeTab/readyState annotation (~150 tok/call saved). Boolean / object params accept the LLM-friendly string spellings (`"true"`, `"{}"`).
254
-
255
- Coordinates returned by `browser_find_element` account for the browser chrome (tab strip + address bar height) and `devicePixelRatio`, so they can be passed directly to `mouse_click` without any scaling.
256
-
257
- **Recommended web workflow:**
258
- ```
259
- browser_connect() browser_get_dom() browser_find_element(selector) browser_click_element(selector)
260
- ```
261
-
262
- ---
263
-
264
- ## Auto-dock CLI on startup
265
-
266
- Keep Claude CLI visible while operating other apps full-screen. Set env vars in your MCP config and the docked window auto-snaps into place every MCP startup.
267
-
268
- ```json
269
- {
270
- "mcpServers": {
271
- "desktop-touch": {
272
- "type": "stdio",
273
- "command": "npx",
274
- "args": ["-y", "@harusame64/desktop-touch-mcp"],
275
- "env": {
276
- "DESKTOP_TOUCH_DOCK_TITLE": "@parent",
277
- "DESKTOP_TOUCH_DOCK_CORNER": "bottom-right",
278
- "DESKTOP_TOUCH_DOCK_WIDTH": "480",
279
- "DESKTOP_TOUCH_DOCK_HEIGHT": "360",
280
- "DESKTOP_TOUCH_DOCK_PIN": "true"
281
- }
282
- }
283
- }
284
- }
285
- ```
286
-
287
- | Env var | Default | Notes |
288
- |---|---|---|
289
- | `DESKTOP_TOUCH_DOCK_TITLE` | *(unset = off)* | `@parent` walks the MCP process tree to find the hosting terminal — immune to title / branch / project changes. Or use a literal substring. |
290
- | `DESKTOP_TOUCH_DOCK_CORNER` | `bottom-right` | `top-left` / `top-right` / `bottom-left` / `bottom-right` |
291
- | `DESKTOP_TOUCH_DOCK_WIDTH` / `HEIGHT` | `480` / `360` | px (`"480"`) or ratio of work area (`"25%"`) — 4K/8K auto-adapts |
292
- | `DESKTOP_TOUCH_DOCK_PIN` | `true` | Always-on-top toggle |
293
- | `DESKTOP_TOUCH_DOCK_MONITOR` | primary | Monitor id from `get_screen_info` |
294
- | `DESKTOP_TOUCH_DOCK_SCALE_DPI` | `false` | If true, multiply px values by `dpi / 96` (opt-in per-monitor scaling) |
295
- | `DESKTOP_TOUCH_DOCK_MARGIN` | `8` | Screen-edge padding (px) |
296
- | `DESKTOP_TOUCH_DOCK_TIMEOUT_MS` | `5000` | Max wait for the target window to appear |
297
-
298
- > **Input routing gotcha:** when a pinned window is active (e.g. Claude CLI), `keyboard_type` / `keyboard_press` send keys to it, **not** the app you wanted to type into. Always call `focus_window(title=...)` before keyboard operations, then verify `isActive=true` via `screenshot(detail='meta')`.
299
-
300
- ### Reactive Perception Graph (4)
301
-
302
- | Tool | Description |
303
- |---|---|
304
- | `perception_register` | Register a live perception lens on a window or browser tab. Returns a `lensId` to pass to action tools |
305
- | `perception_read` | Force-refresh the lens and return a full perception envelope when attention is dirty/stale/blocked |
306
- | `perception_forget` | Release a lens when the workflow ends or the target was replaced |
307
- | `perception_list` | List active lenses so Claude can reuse or clean up existing tracking |
308
-
309
- Reactive Perception Graph is desktop-touch's low-cost situational awareness layer. It keeps the target identity, focus, rect, readiness, and guard state alive across actions so Claude does not need to re-check everything with a screenshot after every small move.
310
-
311
- ```
312
- # Register a lens on the target window or browser tab
313
- perception_register({name:"editor", target:{kind:"window", match:{titleIncludes:"Notepad"}}})
314
- {lensId:"perc-1", ...}
315
-
316
- # Pass lensId to action tools. Guards run before the action;
317
- # compact feedback arrives in post.perception after the action.
318
- keyboard_type({text:"hello", windowTitle:"Notepad", lensId:"perc-1"})
319
- post.perception: {attention:"ok", guards:{...}, latest:{target:{title, rect, foreground}}}
320
-
321
- # If the app restarts or focus moves away, guards fail closed before unsafe input:
322
- keyboard_type({text:"x", lensId:"perc-1"})
323
- {ok:false, code:"GuardFailed", suggest:["Re-register lens for the new process instance"]}
324
- ```
325
-
326
- `lensId` is opt-in on all action tools (`keyboard_type`, `keyboard_press`, `mouse_click`, `mouse_drag`, `click_element`, `set_element_value`, `browser_click_element`, `browser_navigate`, `browser_eval`). Omitting `lensId` preserves existing behavior exactly.
327
-
328
- ---
329
-
330
- ## Mouse homing correction
331
-
332
- When Claude calls `screenshot(detail='text')` to read coordinates and then `mouse_click` seconds later, the target window may have moved. The homing system corrects this automatically.
333
-
334
- | Tier | How to enable | Latency | What it does |
335
- |------|--------------|---------|--------------|
336
- | 1 | Always-on (if cache exists) | <1ms | Applies (dx, dy) offset when window moved |
337
- | 2 | Pass `windowTitle` hint | ~100ms | Auto-focuses window if it went behind another |
338
- | 3 | Pass `elementName`/`elementId` + `windowTitle` | 1–3s | UIA re-query for fresh coords on resize |
339
-
340
- ```
341
- # Tier 1 only (automatic)
342
- mouse_click(x=500, y=300)
343
-
344
- # Tier 1 + 2: also bring window to front if hidden
345
- mouse_click(x=500, y=300, windowTitle="Notepad")
346
-
347
- # Tier 1 + 2 + 3: also re-query UIA if window resized
348
- mouse_click(x=500, y=300, windowTitle="Notepad", elementName="Save")
349
-
350
- # Traction control OFF no correction
351
- mouse_click(x=500, y=300, homing=false)
352
- ```
353
-
354
- The `homing` parameter is available on `mouse_click`, `mouse_move`, `mouse_drag`, and `scroll`. The cache is updated automatically on every `screenshot()`, `get_windows()`, `focus_window()`, and `workspace_snapshot()` call.
355
-
356
- ### `mouse_click` image-local coords (origin + scale)
357
-
358
- When you take a `dotByDot` screenshot with `dotByDotMaxDimension`, the response prints the `origin` and `scale` values. Instead of computing screen coords manually, copy them into `mouse_click`:
359
-
360
- ```
361
- # Screenshot response:
362
- # origin: (0, 120) | scale: 0.6667
363
- # To click image pixel (ix, iy): mouse_click(x=ix, y=iy, origin={x:0, y:120}, scale=0.6667)
364
-
365
- mouse_click(x=640, y=300, origin={x:0, y:120}, scale=0.6667, windowTitle="Chrome")
366
- # Server converts: screen = (0 + 640/0.6667, 120 + 300/0.6667) = (960, 570)
367
- ```
368
-
369
- This eliminates a whole class of off-by-one and scale bugs. Without origin/scale, `x`/`y` remain absolute screen pixels (unchanged behavior).
370
-
371
- ---
372
-
373
- ## `screenshot` key parameters
374
-
375
- ```
376
- detail="image" — PNG/WebP pixels (default)
377
- detail="text" — UIA element JSON + clickAt coords (no image, ~100–300 tok)
378
- detail="meta" — Title + region only (cheapest, ~20 tok/window)
379
- dotByDot=true — 1:1 WebP; image_px + origin = screen_px
380
- dotByDotMaxDimension=N — cap longest edge (response includes scale for coord math)
381
- grayscale=true — ~50% smaller for text-heavy captures (code/AWS console)
382
- region={x,y,w,h} — with windowTitle: window-local coords (exclude browser chrome)
383
- without: virtual screen coords
384
- diffMode=true — I-frame first call, P-frame (changed windows only) after (~160 tok)
385
- ocrFallback="auto" — detail='text' auto-fires Windows OCR on uiaSparse or empty
386
- ```
387
-
388
- **Recommended Chrome combo** (50–70% data reduction):
389
- ```
390
- screenshot(windowTitle="Chrome",
391
- dotByDot=true, dotByDotMaxDimension=1280, grayscale=true,
392
- region={x:0, y:120, width:1920, height:900}) # skip browser chrome
393
- ```
394
-
395
- **Recommended workflow:**
396
- ```
397
- workspace_snapshot() → full orientation (resets diff buffer)
398
- screenshot(detail="text", windowTitle=X) → get actionable[].clickAt coords
399
- mouse_click(x, y) → click directly, no math needed
400
- screenshot(diffMode=true) → check only what changed (~160 tok)
401
- ```
402
-
403
- ---
404
-
405
- ## Security
406
-
407
- ### Emergency stop (Failsafe)
408
-
409
- **Move the mouse to the top-left corner of the screen (within 10px of 0,0) to immediately terminate the MCP server.**
410
-
411
- - **Per-tool check**: `checkFailsafe()` runs before every tool handler
412
- - **Background monitor**: 500ms polling as a backup for long-running operations
413
- - Trigger radius: 10px
414
-
415
- ### Blocked operations
416
-
417
- **`workspace_launch` blocklist:**
418
- `cmd.exe`, `powershell.exe`, `pwsh.exe`, `wscript.exe`, `cscript.exe`, `mshta.exe`, `regsvr32.exe`, `rundll32.exe`, `msiexec.exe`, `bash.exe`, `wsl.exe` are blocked.
419
- Script extensions (`.bat`, `.ps1`, `.vbs`, etc.) are rejected. Arguments containing `;`, `&`, `|`, `` ` ``, `$(`, `${` are also rejected.
420
-
421
- **`keyboard_press` blocklist:**
422
- `Win+R` (Run dialog), `Win+X` (admin menu), `Win+S` (search), `Win+L` (lock screen) are blocked.
423
-
424
- ### PowerShell injection protection
425
-
426
- All `-like` patterns in the UIA bridge PowerShell fallback path are sanitized with `escapeLike()`, which escapes wildcard characters (`*`, `?`, `[`, `]`) before they reach PowerShell. When the Rust native engine is active, PowerShell is not invoked for UIA operations.
427
-
428
- ### Allowlist for `workspace_launch`
429
-
430
- Shell interpreters are blocked by default. To allow specific executables, create an allowlist file:
431
-
432
- **File locations (searched in order):**
433
- 1. Path in `DESKTOP_TOUCH_ALLOWLIST` environment variable
434
- 2. `~/.claude/desktop-touch-allowlist.json`
435
- 3. `desktop-touch-allowlist.json` in the server's working directory
436
-
437
- **Format:**
438
- ```json
439
- {
440
- "allowedExecutables": [
441
- "pwsh.exe",
442
- "C:\\Tools\\myapp.exe"
443
- ]
444
- }
445
- ```
446
-
447
- Changes take effect immediately — no restart needed.
448
-
449
- ---
450
-
451
- ## Mouse movement speed
452
-
453
- All mouse tools (`mouse_move`, `mouse_click`, `mouse_drag`, `scroll`) accept an optional `speed` parameter:
454
-
455
- | Value | Behavior |
456
- |---|---|
457
- | Omitted | Uses the configured default (see below) |
458
- | `0` | Instant teleport — `setPosition()`, no animation |
459
- | `1–N` | Animated movement at N px/sec |
460
-
461
- **Default speed** is 1500 px/sec. Change it permanently via the `DESKTOP_TOUCH_MOUSE_SPEED` environment variable:
462
-
463
- ```json
464
- {
465
- "mcpServers": {
466
- "desktop-touch": {
467
- "type": "stdio",
468
- "command": "npx",
469
- "args": ["-y", "@harusame64/desktop-touch-mcp"],
470
- "env": {
471
- "DESKTOP_TOUCH_MOUSE_SPEED": "3000"
472
- }
473
- }
474
- }
475
- }
476
- ```
477
-
478
- Common values: `0` = teleport, `1500` = default gentle, `3000` = fast, `5000` = very fast.
479
-
480
- ---
481
-
482
- ## Force-Focus (AttachThreadInput)
483
-
484
- Windows foreground-stealing protection can prevent `SetForegroundWindow` from succeeding when another window (such as a pinned Claude CLI) is in the foreground. This causes subsequent keystrokes or clicks to land in the wrong window — a silent failure.
485
-
486
- `mouse_click`, `keyboard_type`, `keyboard_press`, and `terminal_send` all accept a `forceFocus` parameter that bypasses this protection using `AttachThreadInput`:
487
-
488
- ```json
489
- {
490
- "name": "mouse_click",
491
- "arguments": {
492
- "x": 500,
493
- "y": 300,
494
- "windowTitle": "Google Chrome",
495
- "forceFocus": true
496
- }
497
- }
498
- ```
499
-
500
- If the force attempt is refused despite `AttachThreadInput`, the response includes `hints.warnings: ["ForceFocusRefused"]`.
501
-
502
- **Global default via environment variable:**
503
-
504
- ```json
505
- {
506
- "mcpServers": {
507
- "desktop-touch": {
508
- "env": {
509
- "DESKTOP_TOUCH_FORCE_FOCUS": "1"
510
- }
511
- }
512
- }
513
- }
514
- ```
515
-
516
- Setting `DESKTOP_TOUCH_FORCE_FOCUS=1` makes `forceFocus: true` the default for all four tools without changing each call.
517
-
518
- **Known tradeoffs:**
519
-
520
- - During the ~10ms `AttachThreadInput` window, key state and mouse capture are shared between the two threads. In rapid macro sequences this can cause a race condition (rare in practice).
521
- - Disable `forceFocus` (or unset the env var) when the user is manually operating another app to avoid unexpected focus shifts.
522
-
523
- ---
524
-
525
- ## Auto Guard (v0.12+)
526
-
527
- Action tools (`mouse_click`, `mouse_drag`, `keyboard_type`, `keyboard_press`, `click_element`, `set_element_value`, `browser_click_element`, `browser_navigate`) automatically guard each action when you pass `windowTitle` / `tabId`:
528
-
529
- - Verifies target window identity (process restart / HWND replacement detected)
530
- - Confirms click coordinates are inside the target window rect
531
- - Returns `post.perception.status` on every response — including failures — so the LLM can recover without a screenshot
532
-
533
- **Disabling auto guard** — set `DESKTOP_TOUCH_AUTO_GUARD=0` to restore v0.11.12 behavior (no auto guard):
534
-
535
- ```json
536
- {
537
- "mcpServers": {
538
- "desktop-touch": {
539
- "type": "stdio",
540
- "command": "npx",
541
- "args": ["-y", "@harusame64/desktop-touch-mcp"],
542
- "env": {
543
- "DESKTOP_TOUCH_AUTO_GUARD": "0"
544
- }
545
- }
546
- }
547
- }
548
- ```
549
-
550
- When auto guard is enabled (default), `post.perception.status` will be one of:
551
-
552
- | Status | Meaning |
553
- |---|---|
554
- | `ok` | Guard passed target verified |
555
- | `unguarded` | `windowTitle` not provided; action ran without guard |
556
- | `target_not_found` | No window matched the given title |
557
- | `ambiguous_target` | Multiple windows matched; use a more specific title |
558
- | `identity_changed` | Window was replaced (process restart / HWND change) |
559
- | `unsafe_coordinates` | Click coordinates are outside the target window rect |
560
- | `needs_escalation` | Use `browser_click_element` or specify `windowTitle` |
561
-
562
- When `unsafe_coordinates` or `identity_changed` is returned, the response may include a `suggestedFix.fixId`. Pass that `fixId` to the relevant tool call to approve the recovery:
563
-
564
- ```json
565
- { "name": "mouse_click", "arguments": { "fixId": "fix-..." } }
566
- { "name": "keyboard_type", "arguments": { "fixId": "fix-...", "text": "hello" } }
567
- { "name": "click_element", "arguments": { "fixId": "fix-..." } }
568
- { "name": "browser_click_element", "arguments": { "fixId": "fix-..." } }
569
- ```
570
-
571
- The fix is one-shot and expires in 15 seconds. The server revalidates the target process identity before executing.
572
-
573
- ---
574
-
575
- ## v0.13 Additions
576
-
577
- ### Target-Identity Timeline
578
-
579
- The server tracks a semantic timeline of what happened to each target window/tab. Recent events are included in:
580
-
581
- - `get_history` `recentTargetKeys`: array of 3 most recently active target keys (compact, no event bodies)
582
- - `perception_read(lensId)` `recentEvents`: up to 10 events for that lens's target, each with `tsMs`, `semantic`, `summary`
583
-
584
- Enable the MCP resources below to browse timelines:
585
-
586
- ```json
587
- { "env": { "DESKTOP_TOUCH_PERCEPTION_RESOURCES": "1" } }
588
- ```
589
-
590
- MCP resources available when enabled:
591
-
592
- | URI | Content |
593
- |---|---|
594
- | `perception://target/{targetKey}/timeline` | Semantic event timeline for a target |
595
- | `perception://targets/recent` | Most recently active target keys |
596
- | `perception://lens/{lensId}/summary` | Lens attention/guard state |
597
-
598
- ### Manual Lens Eviction: FIFO LRU
599
-
600
- Manual lenses (created via `perception_register`) are now evicted by **least-recently-used** instead of insertion order. Using `perception_read`, `evaluatePreToolGuards`, or `buildEnvelopeFor` on a lens promotes it. The hard limit of 16 active lenses is unchanged.
601
-
602
- ### browser_eval Structured Mode
603
-
604
- Pass `withPerception: true` to receive a structured JSON response with `post.perception` instead of raw text:
605
-
606
- ```json
607
- { "name": "browser_eval", "arguments": { "expression": "document.title", "withPerception": true } }
608
- ```
609
-
610
- Returns `{ ok: true, result: "...", post: { perception: { status: "ok", ... } } }`.
611
-
612
- ### mouse_drag Cross-Window Guard
613
-
614
- `mouse_drag` now guards both start and end coordinates. Drags that cross window boundaries (or reach the desktop wallpaper) are blocked by default. To allow intentional cross-window or range-selection drags:
615
-
616
- ```json
617
- { "name": "mouse_drag", "arguments": { "startX": 100, "startY": 100, "endX": 900, "endY": 900, "allowCrossWindowDrag": true } }
618
- ```
619
-
620
- ---
621
-
622
- ## Performance (v0.15 Rust Native Engine)
623
-
624
- The Rust native engine (`@harusame64/desktop-touch-engine`) replaces PowerShell process spawning with direct COM calls over a persistent MTA thread. It loads automatically as a `.node` addon — no configuration needed.
625
-
626
- ### UIA Benchmark (vs PowerShell baseline)
627
-
628
- | Function | Rust Native | PowerShell | Speedup |
629
- |---|---|---|---|
630
- | `getFocusedElement` | **2.2 ms** | 366 ms | **163.9×** |
631
- | `getUiElements` (Explorer, ~60 elements) | **106.5 ms** | 346 ms | **3.3×** |
632
- | **Weighted average** | | | **~82×** |
633
-
634
- ### Image Diff Benchmark (SSE2 SIMD)
635
-
636
- | Function | Rust (SSE2) | TypeScript | Speedup |
637
- |---|---|---|---|
638
- | `computeChangeFraction` (1920×1080) | **0.26 ms** | 3.8 ms | **~15×** |
639
- | `dHash` (perceptual hash) | **0.09 ms** | 1.2 ms | **~13×** |
640
-
641
- ### Architecture
642
-
643
- ```
644
- Claude CLI / MCP Client
645
- │ stdio or HTTP (MCP protocol)
646
-
647
- desktop-touch-mcp (TypeScript)
648
-
649
- ├── Rust Native Engine (.node addon) ← NEW in v0.15
650
- │ ├── UIA: 13 functions via napi-rs + windows-rs 0.62
651
- │ │ └── Dedicated COM thread (MTA) + batch BFS algorithm
652
- │ └── Image: SSE2 SIMD pixel diff + perceptual hashing
653
-
654
- └── PowerShell Fallback (automatic)
655
- └── Activates transparently if .node is unavailable
656
- ```
657
-
658
- ### Why `getUiElements` is 3.3× (not 160×)
659
-
660
- The 160× speedup on `getFocusedElement` comes from eliminating PowerShell process startup (~200 ms) and .NET assembly loading. For `getUiElements`, the bottleneck shifts to the **UIA provider** inside the target application (e.g., Explorer) — it must enumerate its UI tree regardless of who asks. The Rust engine uses a **batch BFS algorithm** (`FindAllBuildCache` + `TreeScope_Children`) that minimizes cross-process RPC calls and supports `maxElements` early exit, making it dramatically faster on large trees (VS Code, browsers with 1000+ elements).
661
-
662
- ---
663
-
664
- ## Known limitations
665
-
666
- | Limitation | Detail | Workaround |
667
- |---|---|---|
668
- | Games / video players may return black or hang in background capture | DirectX fullscreen apps may not work even with `PW_RENDERFULLCONTENT` | Retry with `screenshot_background(fullContent=false)`; if still black, use foreground `screenshot` |
669
- | UIA call overhead | ~2 ms (focus) / ~100 ms (tree) via Rust native engine; ~300 ms via PowerShell fallback | Rust engine loads automatically; `workspace_snapshot` uses a 2 s timeout internally |
670
- | Chrome / WinUI3 UIA elements are empty | Chromium exposes only limited UIA | `screenshot(detail='text')` auto-detects Chromium and falls back to Windows OCR (`hints.chromiumGuard=true`). For richer DOM access use `browser_connect` + `browser_find_element` |
671
- | Chromium title-regex misses when sites rewrite `document.title` | Guard relies on the ` - Google Chrome` suffix being present; some sites push it off the end of a long title | Title is treated as plain Chrome (UIA runs). OCR path is still reachable via `ocrFallback='always'` or when UIA returns `<5` elements (`uiaSparse`) |
672
- | `browser_*` CDP tools need Chrome launched with `--remote-debugging-port` | If Chrome is already running on the default profile without the flag, `browser_launch` / `browser_connect` fail. The CDP E2E suite (`tests/e2e/browser-cdp.test.ts`) will also fail in that state | Close Chrome first, then `browser_launch` will relaunch it in debug mode, or start Chrome manually with `--remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp` |
673
- | Layer buffer TTL | Buffer auto-clears after 90s of inactivity → next `diffMode` becomes an I-frame | After long waits, call `workspace_snapshot` to explicitly reset the buffer |
674
- | `keyboard_type` / `keyboard_press` follow focus | When `dock_window(pin=true)` keeps another window on top (e.g. Claude CLI), keystrokes may be absorbed by that window | Call `focus_window(title=...)` first and verify `isActive=true` via `screenshot(detail='meta')` before sending keys |
675
- | `keyboard_type` em-dash / smart quotes in Chrome/Edge | Non-ASCII punctuation (em-dash `—`, en-dash `–`, smart quotes `"" ''`) can be intercepted as keyboard accelerators, shifting focus to the address bar | Always use `use_clipboard=true` when the text contains such characters |
676
- | `browser_eval` on React / Vue / Svelte inputs | Setting `element.value = ...` or dispatching synthetic events does not update the framework's internal state | Use `browser_fill_input(selector, value)` it uses native prototype setter + InputEvent which does update React/Vue/Svelte state |
677
-
678
- ---
679
-
680
- ## Token cost reference
681
-
682
- | Mode | Tokens | Use case |
683
- |---|---|---|
684
- | `screenshot` (768px PNG) | ~443 tok | General visual check |
685
- | `screenshot(dotByDot=true)` window | ~800 tok | Precise clicking (no coordinate math) |
686
- | `screenshot(diffMode=true)` | ~160 tok | Post-action diff |
687
- | `screenshot(detail="text")` | ~100–300 tok | UI interaction (no image) |
688
- | `workspace_snapshot` | ~2000 tok | Full session orientation |
689
-
690
- ---
691
-
692
- ## License
693
-
694
- MIT
105
+
106
+ For a local checkout, register the built server directly:
107
+
108
+ ```json
109
+ {
110
+ "mcpServers": {
111
+ "desktop-touch": {
112
+ "type": "stdio",
113
+ "command": "node",
114
+ "args": ["D:/path/to/desktop-touch-mcp/dist/index.js"]
115
+ }
116
+ }
117
+ }
118
+ ```
119
+
120
+ > **Note:** Replace `D:/path/to/desktop-touch-mcp` with the actual path where you cloned this repository.
121
+
122
+ ---
123
+
124
+ ## Tools (28 total — 26 stub catalog + 2 dynamic v2)
125
+
126
+ > 📖 **Full command reference**: [`docs/system-overview.md`](docs/system-overview.md) — every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
127
+
128
+
129
+ ### Screenshot (1)
130
+ | Tool | Description |
131
+ |---|---|
132
+ | `screenshot` | Main capture. Supports `detail` (`meta`/`text`/`image`/`som`/`ocr`), `mode` (`normal`/`background` — Phase 4 absorbs former screenshot_background), `region` sub-crop (Phase 4 absorbs former scope_element after `desktop_discover` gives you element bounds), `dotByDot`, `dotByDotMaxDimension`, `grayscale`, `diffMode`. `detail='text'` auto-activates the SoM pipeline when UIA is blind. `detail='ocr'` (Phase 4 absorbs former screenshot_ocr) returns Windows OCR words with screen-pixel clickAt coords. `scroll(action='capture')` provides full-page stitched capture. |
133
+
134
+ ### Observation (1) World-Graph core
135
+ | Tool | Description |
136
+ |---|---|
137
+ | `desktop_state` | Cheapest read-only observation. Always returns focused window, focused element, modal flag, attention signal. Phase 4: `includeCursor:true` adds `cursor.{x,y,monitorId}` (former get_cursor_position); `includeScreen:true` adds `screen.{displays[],...}` (former get_screen_info); `includeDocument:true` adds `document.{url,title,readyState,scroll,...}` via CDP (former get_document_state). focusedWindow always covers the former get_active_window response. |
138
+
139
+ ### Discovery + Action — World-Graph dispatchers (2 dynamic)
140
+ | Tool | Description |
141
+ |---|---|
142
+ | `desktop_discover` | Find actionable entities and emit leases. Returns `entities[]` (former `get_ui_elements` equivalent — `entityId`/`label`/`role`/`confidence`/`sources`/`primaryAction`/`lease`, plus `rect` when `debug:true`) + `windows[]` (former `get_windows` equivalent — `zOrder`/`title`/`hwnd`/`region`/`isActive`/`isMinimized`/`isMaximized`/`processName`). Pre-Phase-1 `desktop_see`. |
143
+ | `desktop_act` | Lease-consuming action (`click` / `type` / `setValue` / `scroll` / `select`). Phase 4: `action='setValue'` absorbs former set_element_value via UIA ValuePattern (UIA entities) or CDP fill (browser entities). Pre-Phase-1 `desktop_touch`. |
144
+
145
+ ### Window management (1)
146
+ | Tool | Description |
147
+ |---|---|
148
+ | `focus_window` | Bring a window to foreground by partial title match. Use `desktop_discover` to list available titles. |
149
+ | `window_dock(action='dock')` | Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
150
+
151
+ ### Mouse (2)
152
+ | Tool | Description |
153
+ |---|---|
154
+ | `mouse_click` / `mouse_drag` | Click, drag. `doubleClick` / `tripleClick` (line-select). Accept `speed` and `homing` parameters. (`mouse_move` privatized in Phase 4.) |
155
+ | `scroll` | Scroll in any direction (`action='raw'` / `'to_element'` / `'smart'` / `'capture'`). Accepts `speed` and `homing` parameters. |
156
+
157
+ ### Keyboard (1)
158
+ | Tool | Description |
159
+ |---|---|
160
+ | `keyboard` | Send keyboard input. `action='type'` for text (`use_clipboard=true` bypasses IME / required for em-dash / smart quotes; `replaceAll=true` sends Ctrl+A before typing). `action='press'` for key combos (`ctrl+c`, `alt+f4`, etc.). |
161
+
162
+ ### UI Automation (1)
163
+ | Tool | Description |
164
+ |---|---|
165
+ | `click_element` | Click a button by name or automationId no coordinates needed. Use `desktop_discover` to find target entities. (`get_ui_elements` / `set_element_value` / `scope_element` privatized in Phase 4 — see desktop_discover / desktop_act / screenshot.region.) |
166
+
167
+ ### Browser CDP (9)
168
+ | Tool | Description |
169
+ |---|---|
170
+ | `browser_open` | Connect to Chrome/Edge via CDP; lists open tabs with `active:true/false`. Pass `launch:{}` (or with overrides) to auto-spawn a debug-mode browser when no CDP endpoint is live (idempotent) |
171
+ | `browser_locate` | CSS selector exact physical screen coords |
172
+ | `browser_click` | Find DOM element + click in one step |
173
+ | `browser_eval` | Inspect or operate on a tab via 3 actions: `js` (evaluate JS), `dom` (get HTML), `appState` (extract SSR-injected SPA state — `__NEXT_DATA__`, `__NUXT_DATA__`, `__REMIX_CONTEXT__`, `__APOLLO_STATE__`, GitHub `react-app`, JSON-LD, Redux SSR) |
174
+ | `browser_fill` | Fill React/Vue/Svelte controlled inputs via CDP works where `browser_eval(action='js')` value assignment doesn't update framework state |
175
+ | `browser_form` | Inspect form fields (input/select/textarea/button) with name, type, value, label resolved via for/aria/ancestor LABEL |
176
+ | `browser_overview` | Enumerate links / buttons / inputs + **ARIA toggles** with `state.{checked,pressed,selected,expanded}`; each element includes `viewportPosition` |
177
+ | `browser_search` | Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
178
+ | `browser_navigate` | Navigate via CDP `Page.navigate`; `waitForLoad:true` (default) returns once `readyState==='complete'` |
179
+
180
+ All `browser_*` tools that touch the DOM accept `includeContext:false` to omit the trailing `activeTab:` / `readyState:` lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
181
+
182
+ ### Workspace (2)
183
+ | Tool | Description |
184
+ |---|---|
185
+ | `workspace_snapshot` | All windows: thumbnails + UI summaries in one call |
186
+ | `workspace_launch` | Launch an app and auto-detect the new window |
187
+
188
+ ### Wait / Status (2)
189
+ | Tool | Description |
190
+ |---|---|
191
+ | `wait_until` | Server-side wait for window/focus/terminal/browser DOM state changes. Replaces the former events_* polling family — use `condition='window_appears'` / `'terminal_output_contains'` / `'element_matches'` / `'focus_changes'` for one-shot waits. |
192
+ | `server_status` | Returns which backend is active: `uia` (native Rust or powershell) and `imageDiff` (native Rust SSE2 or typescript). Diagnostic — call once per session when troubleshooting performance. |
193
+
194
+ ### Terminal (2)
195
+ | Tool | Description |
196
+ |---|---|
197
+ | `terminal(action='read')` | Read text from Windows Terminal / PowerShell / cmd / WSL via UIA/OCR. Supports `sinceMarker` for diff reads |
198
+ | `terminal(action='send')` | Send commands to a terminal. Uses clipboard paste by default for IME safety |
199
+
200
+ ### Pin / Macro (2)
201
+ | Tool | Description |
202
+ |---|---|
203
+ | `window_dock(action='pin'\|'unpin'\|'dock')` | Always-on-top toggle / docking |
204
+ | `run_macro` | Execute up to 50 steps sequentially in one MCP call. Phase 4: DSL accepts the v1.0.0 dispatcher names (`keyboard({action:'type'})` etc.) only. |
205
+
206
+ ### Clipboard (2)
207
+ | Tool | Description |
208
+ |---|---|
209
+ | `clipboard(action='read')` | Read the current Windows clipboard text (non-text payloads return empty string) |
210
+ | `clipboard(action='write')` | Write text to the Windows clipboard; full Unicode / emoji / CJK support |
211
+
212
+ ### Notification (1)
213
+ | Tool | Description |
214
+ |---|---|
215
+ | `notification_show` | Show a Windows system tray balloon notification — useful to alert the user when a long-running task finishes |
216
+
217
+ ### Scroll (2)
218
+ | Tool | Description |
219
+ |---|---|
220
+ | `scroll(action='to_element')` | Scroll a named element into the viewport without computing scroll amounts. Chrome path: `selector` + `block` alignment. Native path: `name` + `windowTitle` via UIA ScrollItemPattern |
221
+ | `scroll(action='smart')` | **SmartScroll** — unified scroll dispatcher: CDP → UIA → image binary-search fallback. Handles nested containers, virtualised lists (TanStack/React Virtualized), sticky-header occlusion, and image-only environments. Returns `pageRatio`, `ancestors[]`, and hash-verified `scrolled` |
222
+
223
+ ---
224
+
225
+ ## Standard workflow (v1.0.0)
226
+
227
+ The v2 World-Graph surface (`desktop_discover` / `desktop_act`) is the recommended dispatch path. The four-call shape works for native apps, browsers, and terminals identically.
228
+
229
+ ```
230
+ desktop_state → orient: focused window/element, modal, attention signal
231
+ desktop_discover → find actionable entities (returns lease + windows[])
232
+ desktop_act(lease, …) → act on entity (returns attention + post.perception)
233
+ desktop_state → confirm the world changed as expected
234
+ ```
235
+
236
+ Clicking — priority order:
237
+
238
+ ```
239
+ browser_click(selector) → Chrome / Edge (CDP, stable across repaints)
240
+ desktop_act(lease, action='click') → native / dialog / visual (entity-based; use after desktop_discover)
241
+ click_element(name | automationId) → native UIA fallback if desktop_act returns ok:false
242
+ mouse_click(x, y, origin?, scale?) pixel last resort; origin+scale from dotByDot screenshots only
243
+ ```
244
+
245
+ Recovery hints read `response.attention` after every observation and `response.warnings[]` on `desktop_discover` / `desktop_act`. Common reasons:
246
+
247
+ - `lease_expired` / `lease_generation_mismatch` / `lease_digest_mismatch` / `entity_not_found` re-call `desktop_discover`
248
+ - `modal_blocking` dismiss via `click_element`, then retry
249
+ - `entity_outside_viewport` `scroll(action='to_element' | 'raw')`, then re-call `desktop_discover`
250
+ - `executor_failed` fall back to `click_element` / `mouse_click` / `browser_click`
251
+
252
+ Lease lifecycle:
253
+
254
+ - Each `desktop_discover` response carries `softExpiresAtMs` (≈ 60 % of the TTL window). Past that timestamp the LLM should consider re-calling `desktop_discover` even though the lease is still technically valid — `lease.expiresAtMs` is the only correctness wall.
255
+ - TTL adapts to `view` mode (`action`/`explore`/`debug`), entity count, and response payload size. Cap is 60 s.
256
+ - Set `DESKTOP_TOUCH_DISABLE_FUKUWARAI_V2=1` to fall back to the v1 tool surface (`get_windows` / `get_ui_elements` / `set_element_value`) for troubleshooting only V2 is the recommended default.
257
+
258
+ ---
259
+
260
+ ## Browser CDP automation
261
+
262
+ For web automation, connect Chrome or Edge with the remote debugging port enabled — no Selenium or Playwright needed.
263
+
264
+ ```bash
265
+ # Launch Chrome in CDP mode
266
+ chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp
267
+ ```
268
+
269
+ ```
270
+ browser_open({launch:{}}) → spawn-if-needed Chrome in debug mode + list tabs (idempotent)
271
+ browser_open() → connect-only (fail if no CDP endpoint live)
272
+ browser_locate({selector:"#submit"}) → CSS selector → physical screen coords
273
+ browser_click({selector:"#submit"}) → find + click in one step (auto-focuses browser)
274
+ browser_eval({action:"js", expression:"document.title"}) → evaluate JS, returns result
275
+ browser_eval({action:"dom", selector:"#main", maxLength:5000}) → outerHTML, truncated to maxLength chars
276
+ browser_eval({action:"appState"}) → one-shot SPA state (Next/Nuxt/Remix/Apollo/GitHub react-app/Redux SSR)
277
+ browser_fill({selector:"#email", value:"user@example.com"}) → fill React/Vue/Svelte controlled input (state-safe)
278
+ browser_overview() → links/buttons/inputs + ARIA toggles + viewportPosition per element
279
+ browser_search({by:"text", pattern:"..."}) → grep DOM with confidence ranking
280
+ browser_navigate({url:"https://example.com"}) → navigate via CDP (no address bar interaction)
281
+ ```
282
+
283
+ For chained calls in the same tab, pass `includeContext:false` to omit the activeTab/readyState annotation (~150 tok/call saved). Boolean / object params accept the LLM-friendly string spellings (`"true"`, `"{}"`).
284
+
285
+ Coordinates returned by `browser_locate` account for the browser chrome (tab strip + address bar height) and `devicePixelRatio`, so they can be passed directly to `mouse_click` without any scaling.
286
+
287
+ **Recommended web workflow:**
288
+ ```
289
+ browser_open({launch:{}}) → browser_eval({action:"dom"}) → browser_locate(selector) → browser_click(selector)
290
+ ```
291
+
292
+ ---
293
+
294
+ ## Auto-dock CLI on startup
295
+
296
+ Keep Claude CLI visible while operating other apps full-screen. Set env vars in your MCP config and the docked window auto-snaps into place every MCP startup.
297
+
298
+ ```json
299
+ {
300
+ "mcpServers": {
301
+ "desktop-touch": {
302
+ "type": "stdio",
303
+ "command": "npx",
304
+ "args": ["-y", "@harusame64/desktop-touch-mcp"],
305
+ "env": {
306
+ "DESKTOP_TOUCH_DOCK_TITLE": "@parent",
307
+ "DESKTOP_TOUCH_DOCK_CORNER": "bottom-right",
308
+ "DESKTOP_TOUCH_DOCK_WIDTH": "480",
309
+ "DESKTOP_TOUCH_DOCK_HEIGHT": "360",
310
+ "DESKTOP_TOUCH_DOCK_PIN": "true"
311
+ }
312
+ }
313
+ }
314
+ }
315
+ ```
316
+
317
+ | Env var | Default | Notes |
318
+ |---|---|---|
319
+ | `DESKTOP_TOUCH_DOCK_TITLE` | *(unset = off)* | `@parent` walks the MCP process tree to find the hosting terminal — immune to title / branch / project changes. Or use a literal substring. |
320
+ | `DESKTOP_TOUCH_DOCK_CORNER` | `bottom-right` | `top-left` / `top-right` / `bottom-left` / `bottom-right` |
321
+ | `DESKTOP_TOUCH_DOCK_WIDTH` / `HEIGHT` | `480` / `360` | px (`"480"`) or ratio of work area (`"25%"`) — 4K/8K auto-adapts |
322
+ | `DESKTOP_TOUCH_DOCK_PIN` | `true` | Always-on-top toggle |
323
+ | `DESKTOP_TOUCH_DOCK_MONITOR` | primary | Monitor id from `desktop_state({includeScreen:true})` |
324
+ | `DESKTOP_TOUCH_DOCK_SCALE_DPI` | `false` | If true, multiply px values by `dpi / 96` (opt-in per-monitor scaling) |
325
+ | `DESKTOP_TOUCH_DOCK_MARGIN` | `8` | Screen-edge padding (px) |
326
+ | `DESKTOP_TOUCH_DOCK_TIMEOUT_MS` | `5000` | Max wait for the target window to appear |
327
+
328
+ > **Input routing gotcha:** when a pinned window is active (e.g. Claude CLI), `keyboard(action='type')` / `keyboard(action='press')` send keys to it, **not** the app you wanted to type into. Always call `focus_window(title=...)` before keyboard operations, then verify `isActive=true` via `screenshot(detail='meta')`.
329
+
330
+ ### Auto Perception (always-on)
331
+
332
+ Phase 4 privatizes the explicit `perception_*` tool family — the v0.12 Auto
333
+ Perception layer attaches an `attention` signal to every `desktop_state` and
334
+ `desktop_act` response automatically. Action tools also auto-guard when given
335
+ a `windowTitle`. There is no longer a need to register / read / forget lenses
336
+ manually.
337
+
338
+ ```
339
+ # desktop_state always returns the attention signal
340
+ desktop_state() {focusedWindow, focusedElement, modal, attention:"ok", ...}
341
+
342
+ # Action tools auto-guard when windowTitle is given:
343
+ keyboard({action:"type", text:"hello", windowTitle:"Notepad"})
344
+ post.perception:{status:"ok"} // unsafe input blocked if guards fail
345
+
346
+ # When attention is dirty / stale / settling, refresh with desktop_state:
347
+ desktop_state() // re-evaluates attention via Auto Perception
348
+ ```
349
+
350
+ For advanced pinned-target workflows, the `lensId` parameter remains on action
351
+ tools (`keyboard`, `mouse_click`, `mouse_drag`, `click_element`,
352
+ `browser_click`, `browser_navigate`, `browser_eval`, `desktop_act`). Omit
353
+ `lensId` for the normal Auto Perception path. The underlying registry, hot
354
+ target cache, and sensor loop are unchanged; only the explicit
355
+ `perception_register / perception_read / perception_forget / perception_list`
356
+ tools were retired.
357
+
358
+ ---
359
+
360
+ ## Mouse homing correction
361
+
362
+ When Claude calls `screenshot(detail='text')` to read coordinates and then `mouse_click` seconds later, the target window may have moved. The homing system corrects this automatically.
363
+
364
+ | Tier | How to enable | Latency | What it does |
365
+ |------|--------------|---------|--------------|
366
+ | 1 | Always-on (if cache exists) | <1ms | Applies (dx, dy) offset when window moved |
367
+ | 2 | Pass `windowTitle` hint | ~100ms | Auto-focuses window if it went behind another |
368
+ | 3 | Pass `elementName`/`elementId` + `windowTitle` | 1–3s | UIA re-query for fresh coords on resize |
369
+
370
+ ```
371
+ # Tier 1 only (automatic)
372
+ mouse_click(x=500, y=300)
373
+
374
+ # Tier 1 + 2: also bring window to front if hidden
375
+ mouse_click(x=500, y=300, windowTitle="Notepad")
376
+
377
+ # Tier 1 + 2 + 3: also re-query UIA if window resized
378
+ mouse_click(x=500, y=300, windowTitle="Notepad", elementName="Save")
379
+
380
+ # Traction control OFF no correction
381
+ mouse_click(x=500, y=300, homing=false)
382
+ ```
383
+
384
+ The `homing` parameter is available on `mouse_click`, `mouse_drag`, and `scroll`. The cache is updated automatically on every `screenshot()`, `desktop_discover()`, `focus_window()`, and `workspace_snapshot()` call.
385
+
386
+ ### `mouse_click` image-local coords (origin + scale)
387
+
388
+ When you take a `dotByDot` screenshot with `dotByDotMaxDimension`, the response prints the `origin` and `scale` values. Instead of computing screen coords manually, copy them into `mouse_click`:
389
+
390
+ ```
391
+ # Screenshot response:
392
+ # origin: (0, 120) | scale: 0.6667
393
+ # To click image pixel (ix, iy): mouse_click(x=ix, y=iy, origin={x:0, y:120}, scale=0.6667)
394
+
395
+ mouse_click(x=640, y=300, origin={x:0, y:120}, scale=0.6667, windowTitle="Chrome")
396
+ # Server converts: screen = (0 + 640/0.6667, 120 + 300/0.6667) = (960, 570)
397
+ ```
398
+
399
+ This eliminates a whole class of off-by-one and scale bugs. Without origin/scale, `x`/`y` remain absolute screen pixels (unchanged behavior).
400
+
401
+ ---
402
+
403
+ ## `screenshot` key parameters
404
+
405
+ ```
406
+ detail="image" — PNG/WebP pixels (default)
407
+ detail="text" — UIA element JSON + clickAt coords (no image, ~100–300 tok)
408
+ detail="meta" — Title + region only (cheapest, ~20 tok/window)
409
+ dotByDot=true — 1:1 WebP; image_px + origin = screen_px
410
+ dotByDotMaxDimension=N — cap longest edge (response includes scale for coord math)
411
+ grayscale=true — ~50% smaller for text-heavy captures (code/AWS console)
412
+ region={x,y,w,h} — with windowTitle: window-local coords (exclude browser chrome)
413
+ without: virtual screen coords
414
+ diffMode=true — I-frame first call, P-frame (changed windows only) after (~160 tok)
415
+ ocrFallback="auto" — detail='text' auto-fires Windows OCR on uiaSparse or empty
416
+ ```
417
+
418
+ **Recommended Chrome combo** (50–70% data reduction):
419
+ ```
420
+ screenshot(windowTitle="Chrome",
421
+ dotByDot=true, dotByDotMaxDimension=1280, grayscale=true,
422
+ region={x:0, y:120, width:1920, height:900}) # skip browser chrome
423
+ ```
424
+
425
+ **Recommended workflow:**
426
+ ```
427
+ workspace_snapshot() → full orientation (resets diff buffer)
428
+ screenshot(detail="text", windowTitle=X) → get actionable[].clickAt coords
429
+ mouse_click(x, y) click directly, no math needed
430
+ screenshot(diffMode=true) → check only what changed (~160 tok)
431
+ ```
432
+
433
+ ---
434
+
435
+ ## Security
436
+
437
+ ### Emergency stop (Failsafe)
438
+
439
+ **Move the mouse to the top-left corner of the screen (within 10px of 0,0) to immediately terminate the MCP server.**
440
+
441
+ - **Per-tool check**: `checkFailsafe()` runs before every tool handler
442
+ - **Background monitor**: 500ms polling as a backup for long-running operations
443
+ - Trigger radius: 10px
444
+
445
+ ### Blocked operations
446
+
447
+ **`workspace_launch` blocklist:**
448
+ `cmd.exe`, `powershell.exe`, `pwsh.exe`, `wscript.exe`, `cscript.exe`, `mshta.exe`, `regsvr32.exe`, `rundll32.exe`, `msiexec.exe`, `bash.exe`, `wsl.exe` are blocked.
449
+ Script extensions (`.bat`, `.ps1`, `.vbs`, etc.) are rejected. Arguments containing `;`, `&`, `|`, `` ` ``, `$(`, `${` are also rejected.
450
+
451
+ **`keyboard(action='press')` blocklist:**
452
+ `Win+R` (Run dialog), `Win+X` (admin menu), `Win+S` (search), `Win+L` (lock screen) are blocked.
453
+
454
+ ### PowerShell injection protection
455
+
456
+ All `-like` patterns in the UIA bridge PowerShell fallback path are sanitized with `escapeLike()`, which escapes wildcard characters (`*`, `?`, `[`, `]`) before they reach PowerShell. When the Rust native engine is active, PowerShell is not invoked for UIA operations.
457
+
458
+ ### Allowlist for `workspace_launch`
459
+
460
+ Shell interpreters are blocked by default. To allow specific executables, create an allowlist file:
461
+
462
+ **File locations (searched in order):**
463
+ 1. Path in `DESKTOP_TOUCH_ALLOWLIST` environment variable
464
+ 2. `~/.claude/desktop-touch-allowlist.json`
465
+ 3. `desktop-touch-allowlist.json` in the server's working directory
466
+
467
+ **Format:**
468
+ ```json
469
+ {
470
+ "allowedExecutables": [
471
+ "pwsh.exe",
472
+ "C:\\Tools\\myapp.exe"
473
+ ]
474
+ }
475
+ ```
476
+
477
+ Changes take effect immediately — no restart needed.
478
+
479
+ ---
480
+
481
+ ## Mouse movement speed
482
+
483
+ All mouse tools (`mouse_click`, `mouse_drag`, `scroll`) accept an optional `speed` parameter:
484
+
485
+ | Value | Behavior |
486
+ |---|---|
487
+ | Omitted | Uses the configured default (see below) |
488
+ | `0` | Instant teleport — `setPosition()`, no animation |
489
+ | `1–N` | Animated movement at N px/sec |
490
+
491
+ **Default speed** is 1500 px/sec. Change it permanently via the `DESKTOP_TOUCH_MOUSE_SPEED` environment variable:
492
+
493
+ ```json
494
+ {
495
+ "mcpServers": {
496
+ "desktop-touch": {
497
+ "type": "stdio",
498
+ "command": "npx",
499
+ "args": ["-y", "@harusame64/desktop-touch-mcp"],
500
+ "env": {
501
+ "DESKTOP_TOUCH_MOUSE_SPEED": "3000"
502
+ }
503
+ }
504
+ }
505
+ }
506
+ ```
507
+
508
+ Common values: `0` = teleport, `1500` = default gentle, `3000` = fast, `5000` = very fast.
509
+
510
+ ---
511
+
512
+ ## Force-Focus (AttachThreadInput)
513
+
514
+ Windows foreground-stealing protection can prevent `SetForegroundWindow` from succeeding when another window (such as a pinned Claude CLI) is in the foreground. This causes subsequent keystrokes or clicks to land in the wrong window — a silent failure.
515
+
516
+ `mouse_click`, `keyboard(action='type')`, `keyboard(action='press')`, and `terminal(action='send')` all accept a `forceFocus` parameter that bypasses this protection using `AttachThreadInput`:
517
+
518
+ ```json
519
+ {
520
+ "name": "mouse_click",
521
+ "arguments": {
522
+ "x": 500,
523
+ "y": 300,
524
+ "windowTitle": "Google Chrome",
525
+ "forceFocus": true
526
+ }
527
+ }
528
+ ```
529
+
530
+ If the force attempt is refused despite `AttachThreadInput`, the response includes `hints.warnings: ["ForceFocusRefused"]`.
531
+
532
+ **Global default via environment variable:**
533
+
534
+ ```json
535
+ {
536
+ "mcpServers": {
537
+ "desktop-touch": {
538
+ "env": {
539
+ "DESKTOP_TOUCH_FORCE_FOCUS": "1"
540
+ }
541
+ }
542
+ }
543
+ }
544
+ ```
545
+
546
+ Setting `DESKTOP_TOUCH_FORCE_FOCUS=1` makes `forceFocus: true` the default for all four tools without changing each call.
547
+
548
+ **Known tradeoffs:**
549
+
550
+ - During the ~10ms `AttachThreadInput` window, key state and mouse capture are shared between the two threads. In rapid macro sequences this can cause a race condition (rare in practice).
551
+ - Disable `forceFocus` (or unset the env var) when the user is manually operating another app to avoid unexpected focus shifts.
552
+
553
+ ---
554
+
555
+ ## Auto Guard (v0.12+)
556
+
557
+ Action tools (`mouse_click`, `mouse_drag`, `keyboard(action='type'/'press')`, `click_element`, `desktop_act`, `browser_click`, `browser_navigate`) automatically guard each action when you pass `windowTitle` / `tabId`:
558
+
559
+ - Verifies target window identity (process restart / HWND replacement detected)
560
+ - Confirms click coordinates are inside the target window rect
561
+ - Returns `post.perception.status` on every response including failures so the LLM can recover without a screenshot
562
+
563
+ **Disabling auto guard** set `DESKTOP_TOUCH_AUTO_GUARD=0` to restore v0.11.12 behavior (no auto guard):
564
+
565
+ ```json
566
+ {
567
+ "mcpServers": {
568
+ "desktop-touch": {
569
+ "type": "stdio",
570
+ "command": "npx",
571
+ "args": ["-y", "@harusame64/desktop-touch-mcp"],
572
+ "env": {
573
+ "DESKTOP_TOUCH_AUTO_GUARD": "0"
574
+ }
575
+ }
576
+ }
577
+ }
578
+ ```
579
+
580
+ When auto guard is enabled (default), `post.perception.status` will be one of:
581
+
582
+ | Status | Meaning |
583
+ |---|---|
584
+ | `ok` | Guard passed target verified |
585
+ | `unguarded` | `windowTitle` not provided; action ran without guard |
586
+ | `target_not_found` | No window matched the given title |
587
+ | `ambiguous_target` | Multiple windows matched; use a more specific title |
588
+ | `identity_changed` | Window was replaced (process restart / HWND change) |
589
+ | `unsafe_coordinates` | Click coordinates are outside the target window rect |
590
+ | `needs_escalation` | Use `browser_click` or specify `windowTitle` |
591
+
592
+ When `unsafe_coordinates` or `identity_changed` is returned, the response may include a `suggestedFix.fixId`. Pass that `fixId` to the relevant tool call to approve the recovery:
593
+
594
+ ```json
595
+ { "name": "mouse_click", "arguments": { "fixId": "fix-..." } }
596
+ { "name": "keyboard(action='type')", "arguments": { "fixId": "fix-...", "text": "hello" } }
597
+ { "name": "click_element", "arguments": { "fixId": "fix-..." } }
598
+ { "name": "browser_click", "arguments": { "fixId": "fix-..." } }
599
+ ```
600
+
601
+ The fix is one-shot and expires in 15 seconds. The server revalidates the target process identity before executing.
602
+
603
+ ---
604
+
605
+ ## v0.13 Additions
606
+
607
+ ### Target-Identity Timeline
608
+
609
+ The server tracks a semantic timeline of what happened to each target window/tab. Recent events are included in:
610
+
611
+ - `get_history` → `recentTargetKeys`: array of 3 most recently active target keys (compact, no event bodies)
612
+ - `perception_read(lensId)` → `recentEvents`: up to 10 events for that lens's target, each with `tsMs`, `semantic`, `summary`
613
+
614
+ Enable the MCP resources below to browse timelines:
615
+
616
+ ```json
617
+ { "env": { "DESKTOP_TOUCH_PERCEPTION_RESOURCES": "1" } }
618
+ ```
619
+
620
+ MCP resources available when enabled:
621
+
622
+ | URI | Content |
623
+ |---|---|
624
+ | `perception://target/{targetKey}/timeline` | Semantic event timeline for a target |
625
+ | `perception://targets/recent` | Most recently active target keys |
626
+ | `perception://lens/{lensId}/summary` | Lens attention/guard state |
627
+
628
+ ### Manual Lens Eviction: FIFO → LRU
629
+
630
+ Manual lenses (created via `perception_register`) are now evicted by **least-recently-used** instead of insertion order. Using `perception_read`, `evaluatePreToolGuards`, or `buildEnvelopeFor` on a lens promotes it. The hard limit of 16 active lenses is unchanged.
631
+
632
+ ### browser_eval Structured Mode
633
+
634
+ Pass `withPerception: true` to receive a structured JSON response with `post.perception` instead of raw text:
635
+
636
+ ```json
637
+ { "name": "browser_eval", "arguments": { "expression": "document.title", "withPerception": true } }
638
+ ```
639
+
640
+ Returns `{ ok: true, result: "...", post: { perception: { status: "ok", ... } } }`.
641
+
642
+ ### mouse_drag Cross-Window Guard
643
+
644
+ `mouse_drag` now guards both start and end coordinates. Drags that cross window boundaries (or reach the desktop wallpaper) are blocked by default. To allow intentional cross-window or range-selection drags:
645
+
646
+ ```json
647
+ { "name": "mouse_drag", "arguments": { "startX": 100, "startY": 100, "endX": 900, "endY": 900, "allowCrossWindowDrag": true } }
648
+ ```
649
+
650
+ ---
651
+
652
+ ## Performance (v0.15 Rust Native Engine)
653
+
654
+ The Rust native engine (`@harusame64/desktop-touch-engine`) replaces PowerShell process spawning with direct COM calls over a persistent MTA thread. It loads automatically as a `.node` addon — no configuration needed.
655
+
656
+ ### UIA Benchmark (vs PowerShell baseline)
657
+
658
+ | Function | Rust Native | PowerShell | Speedup |
659
+ |---|---|---|---|
660
+ | `getFocusedElement` | **2.2 ms** | 366 ms | **163.9×** |
661
+ | `getUiElements` (Explorer, ~60 elements) | **106.5 ms** | 346 ms | **3.3×** |
662
+ | **Weighted average** | | | **~82×** |
663
+
664
+ ### Image Diff Benchmark (SSE2 SIMD)
665
+
666
+ | Function | Rust (SSE2) | TypeScript | Speedup |
667
+ |---|---|---|---|
668
+ | `computeChangeFraction` (1920×1080) | **0.26 ms** | 3.8 ms | **~15×** |
669
+ | `dHash` (perceptual hash) | **0.09 ms** | 1.2 ms | **~13×** |
670
+
671
+ ### Architecture
672
+
673
+ ```
674
+ Claude CLI / MCP Client
675
+ │ stdio or HTTP (MCP protocol)
676
+
677
+ desktop-touch-mcp (TypeScript)
678
+
679
+ ├── Rust Native Engine (.node addon) NEW in v0.15
680
+ │ ├── UIA: 13 functions via napi-rs + windows-rs 0.62
681
+ │ │ └── Dedicated COM thread (MTA) + batch BFS algorithm
682
+ │ └── Image: SSE2 SIMD pixel diff + perceptual hashing
683
+
684
+ └── PowerShell Fallback (automatic)
685
+ └── Activates transparently if .node is unavailable
686
+ ```
687
+
688
+ ### Why `getUiElements` is 3.3× (not 160×)
689
+
690
+ The 160× speedup on `getFocusedElement` comes from eliminating PowerShell process startup (~200 ms) and .NET assembly loading. For `getUiElements`, the bottleneck shifts to the **UIA provider** inside the target application (e.g., Explorer) — it must enumerate its UI tree regardless of who asks. The Rust engine uses a **batch BFS algorithm** (`FindAllBuildCache` + `TreeScope_Children`) that minimizes cross-process RPC calls and supports `maxElements` early exit, making it dramatically faster on large trees (VS Code, browsers with 1000+ elements).
691
+
692
+ ---
693
+
694
+ ## UI Operating Layer (V2)
695
+
696
+ > **Status: Default ON since v0.17.** `desktop_discover` and `desktop_act` are available out of the box.
697
+
698
+ V2 introduces two new tools that replace coordinate-based clicking with entity-based interaction:
699
+
700
+ | Tool | Description |
701
+ |---|---|
702
+ | `desktop_discover` | Observe a window or browser tab. Returns interactive entities with leases — no raw screen coordinates. Supports UIA (native), CDP (browser), terminal, and visual GPU lanes. |
703
+ | `desktop_act` | Interact with an entity returned by `desktop_discover`. Validates the lease before executing. Returns a semantic diff (`entity_disappeared`, `modal_appeared`, `focus_shifted`, …). |
704
+
705
+ ### Clicking — priority order
706
+
707
+ When multiple tools could perform the same click, prefer them in this order:
708
+
709
+ 1. `browser_click(selector)` — Chrome / Edge over CDP (stable across repaints)
710
+ 2. `desktop_act(lease)` — native windows, dialogs, visual-only targets (entity-based; use after `desktop_discover`)
711
+ 3. `click_element(name | automationId)` — native UIA fallback when `desktop_act` returns `ok:false`
712
+ 4. `mouse_click(x, y)` — pixel-level last resort (`origin` + `scale` from `dotByDot` screenshots only)
713
+
714
+ ### Disabling V2 (kill switch)
715
+
716
+ To hide `desktop_discover` / `desktop_act` from the tool catalog, add the disable flag and restart:
717
+
718
+ ```json
719
+ {
720
+ "mcpServers": {
721
+ "desktop-touch": {
722
+ "type": "stdio",
723
+ "command": "npx",
724
+ "args": ["-y", "@harusame64/desktop-touch-mcp"],
725
+ "env": {
726
+ "DESKTOP_TOUCH_DISABLE_FUKUWARAI_V2": "1"
727
+ }
728
+ }
729
+ }
730
+ }
731
+ ```
732
+
733
+ All V1 tools continue to work without interruption — no reinstall required. Remove the env entry and restart to re-enable.
734
+
735
+ Flag semantics (exact-match: only the literal string `"1"` counts):
736
+
737
+ | `DISABLE_FUKUWARAI_V2` | `ENABLE_FUKUWARAI_V2` | V2 state |
738
+ |---|---|---|
739
+ | unset / not `"1"` | unset / not `"1"` | **ON** (default) |
740
+ | unset / not `"1"` | `"1"` | ON (legacy flag — see below) |
741
+ | `"1"` | any | **OFF** — DISABLE wins |
742
+
743
+ ### Deprecated: `DESKTOP_TOUCH_ENABLE_FUKUWARAI_V2`
744
+
745
+ This was the opt-in switch in v0.16.x. From v0.17 it is accepted for compatibility but no longer required — the server prints a deprecation warning on startup when it is set. It will be removed in v0.18. Remove it from your config when you upgrade.
746
+
747
+ ### Recovery when V2 fails
748
+
749
+ If `desktop_act` returns `ok: false`, read `reason` and follow the built-in recovery hints in the tool description. Common paths:
750
+
751
+ - `lease_expired` / `*_mismatch` / `entity_not_found` → re-call `desktop_discover`
752
+ - `modal_blocking` → dismiss the modal with `click_element`, then retry
753
+ - `entity_outside_viewport` → `scroll` / `scroll(action='to_element')`, then re-call `desktop_discover`
754
+ - `executor_failed` → fall back to `click_element` / `mouse_click` / `browser_click`
755
+
756
+ For `desktop_discover` warnings (`visual_provider_unavailable`, `visual_provider_warming`, `cdp_provider_failed`, …), V1 tools (`screenshot`, `click_element`, `get_ui_elements`, `terminal(action='send')`, …) remain available as an escape hatch.
757
+
758
+ ---
759
+
760
+ ## Known limitations
761
+
762
+ | Limitation | Detail | Workaround |
763
+ |---|---|---|
764
+ | Games / video players may return black or hang in background capture | DirectX fullscreen apps may not work even with `PW_RENDERFULLCONTENT` | Retry with `screenshot_background(fullContent=false)`; if still black, use foreground `screenshot` |
765
+ | UIA call overhead | ~2 ms (focus) / ~100 ms (tree) via Rust native engine; ~300 ms via PowerShell fallback | Rust engine loads automatically; `workspace_snapshot` uses a 2 s timeout internally |
766
+ | Chrome / WinUI3 UIA elements are empty | Chromium exposes only limited UIA | `screenshot(detail='text')` auto-detects Chromium and falls back to Windows OCR (`hints.chromiumGuard=true`). For richer DOM access use `browser_open` + `browser_locate` |
767
+ | Chromium title-regex misses when sites rewrite `document.title` | Guard relies on the ` - Google Chrome` suffix being present; some sites push it off the end of a long title | Title is treated as plain Chrome (UIA runs). OCR path is still reachable via `ocrFallback='always'` or when UIA returns `<5` elements (`uiaSparse`) |
768
+ | `browser_*` CDP tools need Chrome launched with `--remote-debugging-port` | If Chrome is already running on the default profile without the flag, `browser_open` fails. The CDP E2E suite (`tests/e2e/browser-cdp.test.ts`) will also fail in that state | Close Chrome first, then `browser_open({launch:{}})` will relaunch it in debug mode, or start Chrome manually with `--remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp` |
769
+ | Layer buffer TTL | Buffer auto-clears after 90s of inactivity → next `diffMode` becomes an I-frame | After long waits, call `workspace_snapshot` to explicitly reset the buffer |
770
+ | `keyboard(action='type')` / `keyboard(action='press')` follow focus | When `window_dock(action='dock')(pin=true)` keeps another window on top (e.g. Claude CLI), keystrokes may be absorbed by that window | Call `focus_window(title=...)` first and verify `isActive=true` via `screenshot(detail='meta')` before sending keys |
771
+ | `keyboard(action='type')` em-dash / smart quotes in Chrome/Edge | Non-ASCII punctuation (em-dash `—`, en-dash `–`, smart quotes `"" ''`) can be intercepted as keyboard accelerators, shifting focus to the address bar | Always use `use_clipboard=true` when the text contains such characters |
772
+ | `browser_eval(action='js')` on React / Vue / Svelte inputs | Setting `element.value = ...` or dispatching synthetic events does not update the framework's internal state | Use `browser_fill(selector, value)` — it uses native prototype setter + InputEvent which does update React/Vue/Svelte state |
773
+
774
+ ---
775
+
776
+ ## Token cost reference
777
+
778
+ | Mode | Tokens | Use case |
779
+ |---|---|---|
780
+ | `screenshot` (768px PNG) | ~443 tok | General visual check |
781
+ | `screenshot(dotByDot=true)` window | ~800 tok | Precise clicking (no coordinate math) |
782
+ | `screenshot(diffMode=true)` | ~160 tok | Post-action diff |
783
+ | `screenshot(detail="text")` | ~100–300 tok | UI interaction (no image) |
784
+ | `workspace_snapshot` | ~2000 tok | Full session orientation |
785
+
786
+ ---
787
+
788
+ ## License
789
+
790
+ MIT