@harusame64/desktop-touch-mcp 1.0.3 → 1.0.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.ja.md +2 -6
- package/README.md +29 -87
- package/bin/launcher.js +3 -3
- package/package.json +2 -2
package/README.ja.md
CHANGED
|
@@ -4,13 +4,9 @@
|
|
|
4
4
|
|
|
5
5
|
[English](README.md)
|
|
6
6
|
|
|
7
|
-
>
|
|
7
|
+
> **「座標ルーレット」はもう不要。セマンティック・ワールドグラフと自動認識ガードによる、LLMネイティブなWindows自動化。**
|
|
8
8
|
|
|
9
|
-
|
|
10
|
-
> まずはここから、公開向けの解説、client setup、Reactive Perception Graph の紹介を読めます。
|
|
11
|
-
|
|
12
|
-
Claude がデスクトップを直接見て、直接操作する。
|
|
13
|
-
マウス・キーボード・スクリーンショット・Windows UI Automation・Chrome DevTools Protocol・ターミナル・SmartScroll・Reactive Perception Graph を統合した 28 個の public ツール (26 stub catalog + 2 dynamic v2 World-Graph: `desktop_discover` / `desktop_act`) を提供する MCP サーバーです。
|
|
9
|
+
ClaudeがWindowsを直接「見て」「操作する」ためのMCPサーバー。単なるピクセル指定の座標当てゲームを卒業し、**セマンティック・ワールドグラフ** (`desktop_discover`) による確実なエンティティ特定と、**自動認識(Auto-Perception)ガード**による安全な実行を実現します。28個の最適化されたツール群が、バックグラウンド入力、UIA、Chrome CDP、ターミナル、そしてトークン効率に優れたP-frame差分転送をサポートします。
|
|
14
10
|
|
|
15
11
|
> *v0.15: Rust ネイティブエンジンにより**平均 82 倍高速化** — UIA フォーカス取得 2ms、SSE2 SIMD 画像差分 13〜15 倍速。設定不要:エンジンは自動ロード、不在時は PowerShell に透過フォールバック。*
|
|
16
12
|
> *v0.15.5: **固定リリース検証** — npm ランチャーは対応する GitHub Release tag だけを取得し、Windows runtime zip を検証してから展開します。*
|
package/README.md
CHANGED
|
@@ -4,12 +4,9 @@
|
|
|
4
4
|
|
|
5
5
|
[日本語](README.ja.md)
|
|
6
6
|
|
|
7
|
-
> **
|
|
7
|
+
> **Beyond Coordinate Roulette: LLM-native Windows automation with a Semantic World-Graph and Auto-Perception.**
|
|
8
8
|
|
|
9
|
-
|
|
10
|
-
> Start here for the public explainer, client setup guides, and the Reactive Perception Graph overview.
|
|
11
|
-
|
|
12
|
-
An MCP server that gives Claude eyes and hands on Windows — 28 public tools (26 stub catalog + 2 dynamic v2 World-Graph: `desktop_discover` / `desktop_act`) covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
|
|
9
|
+
An MCP server that gives Claude eyes and hands on Windows. It moves beyond pixel-guessing by grounding all interactions in a **Semantic World-Graph** (`desktop_discover`) and verifying every action with **Auto-Perception** guards. Optimized into 28 high-signal tools covering screenshots, background input (WM_CHAR), UIA, Chrome CDP, terminal, and token-efficient P-frame diffing.
|
|
13
10
|
|
|
14
11
|
> *v0.15: **82× average speedup** via Rust native engine — UIA focus queries in 2 ms, SSE2-accelerated image diffing at 13–15× native speed. Zero-config: the engine auto-loads when present, with transparent PowerShell fallback.*
|
|
15
12
|
> *v0.15.5: **Pinned release verification** — the npm launcher now fetches only the matching GitHub Release tag and verifies the Windows runtime zip before extraction.*
|
|
@@ -121,104 +118,49 @@ For a local checkout, register the built server directly:
|
|
|
121
118
|
|
|
122
119
|
---
|
|
123
120
|
|
|
124
|
-
## Tools (28
|
|
125
|
-
|
|
126
|
-
> 📖 **Full command reference**: [`docs/system-overview.md`](docs/system-overview.md) — every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
### Screenshot (1)
|
|
130
|
-
| Tool | Description |
|
|
131
|
-
|---|---|
|
|
132
|
-
| `screenshot` | Main capture. Supports `detail` (`meta`/`text`/`image`/`som`/`ocr`), `mode` (`normal`/`background` — Phase 4 absorbs former screenshot_background), `region` sub-crop (Phase 4 absorbs former scope_element after `desktop_discover` gives you element bounds), `dotByDot`, `dotByDotMaxDimension`, `grayscale`, `diffMode`. `detail='text'` auto-activates the SoM pipeline when UIA is blind. `detail='ocr'` (Phase 4 absorbs former screenshot_ocr) returns Windows OCR words with screen-pixel clickAt coords. `scroll(action='capture')` provides full-page stitched capture. |
|
|
133
|
-
|
|
134
|
-
### Observation (1) — World-Graph core
|
|
135
|
-
| Tool | Description |
|
|
136
|
-
|---|---|
|
|
137
|
-
| `desktop_state` | Cheapest read-only observation. Always returns focused window, focused element, modal flag, attention signal. Phase 4: `includeCursor:true` adds `cursor.{x,y,monitorId}` (former get_cursor_position); `includeScreen:true` adds `screen.{displays[],...}` (former get_screen_info); `includeDocument:true` adds `document.{url,title,readyState,scroll,...}` via CDP (former get_document_state). focusedWindow always covers the former get_active_window response. |
|
|
138
|
-
|
|
139
|
-
### Discovery + Action — World-Graph dispatchers (2 dynamic)
|
|
140
|
-
| Tool | Description |
|
|
141
|
-
|---|---|
|
|
142
|
-
| `desktop_discover` | Find actionable entities and emit leases. Returns `entities[]` (former `get_ui_elements` equivalent — `entityId`/`label`/`role`/`confidence`/`sources`/`primaryAction`/`lease`, plus `rect` when `debug:true`) + `windows[]` (former `get_windows` equivalent — `zOrder`/`title`/`hwnd`/`region`/`isActive`/`isMinimized`/`isMaximized`/`processName`). Pre-Phase-1 `desktop_see`. |
|
|
143
|
-
| `desktop_act` | Lease-consuming action (`click` / `type` / `setValue` / `scroll` / `select`). Phase 4: `action='setValue'` absorbs former set_element_value via UIA ValuePattern (UIA entities) or CDP fill (browser entities). Pre-Phase-1 `desktop_touch`. |
|
|
144
|
-
|
|
145
|
-
### Window management (1)
|
|
146
|
-
| Tool | Description |
|
|
147
|
-
|---|---|
|
|
148
|
-
| `focus_window` | Bring a window to foreground by partial title match. Use `desktop_discover` to list available titles. |
|
|
149
|
-
| `window_dock(action='dock')` | Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
|
|
121
|
+
## Tools (28 Optimized Tools)
|
|
150
122
|
|
|
151
|
-
|
|
152
|
-
| Tool | Description |
|
|
153
|
-
|---|---|
|
|
154
|
-
| `mouse_click` / `mouse_drag` | Click, drag. `doubleClick` / `tripleClick` (line-select). Accept `speed` and `homing` parameters. (`mouse_move` privatized in Phase 4.) |
|
|
155
|
-
| `scroll` | Scroll in any direction (`action='raw'` / `'to_element'` / `'smart'` / `'capture'`). Accepts `speed` and `homing` parameters. |
|
|
156
|
-
|
|
157
|
-
### Keyboard (1)
|
|
158
|
-
| Tool | Description |
|
|
159
|
-
|---|---|
|
|
160
|
-
| `keyboard` | Send keyboard input. `action='type'` for text (`use_clipboard=true` bypasses IME / required for em-dash / smart quotes; `replaceAll=true` sends Ctrl+A before typing). `action='press'` for key combos (`ctrl+c`, `alt+f4`, etc.). |
|
|
161
|
-
|
|
162
|
-
### UI Automation (1)
|
|
163
|
-
| Tool | Description |
|
|
164
|
-
|---|---|
|
|
165
|
-
| `click_element` | Click a button by name or automationId — no coordinates needed. Use `desktop_discover` to find target entities. (`get_ui_elements` / `set_element_value` / `scope_element` privatized in Phase 4 — see desktop_discover / desktop_act / screenshot.region.) |
|
|
166
|
-
|
|
167
|
-
### Browser CDP (9)
|
|
168
|
-
| Tool | Description |
|
|
169
|
-
|---|---|
|
|
170
|
-
| `browser_open` | Connect to Chrome/Edge via CDP; lists open tabs with `active:true/false`. Pass `launch:{}` (or with overrides) to auto-spawn a debug-mode browser when no CDP endpoint is live (idempotent) |
|
|
171
|
-
| `browser_locate` | CSS selector → exact physical screen coords |
|
|
172
|
-
| `browser_click` | Find DOM element + click in one step |
|
|
173
|
-
| `browser_eval` | Inspect or operate on a tab via 3 actions: `js` (evaluate JS), `dom` (get HTML), `appState` (extract SSR-injected SPA state — `__NEXT_DATA__`, `__NUXT_DATA__`, `__REMIX_CONTEXT__`, `__APOLLO_STATE__`, GitHub `react-app`, JSON-LD, Redux SSR) |
|
|
174
|
-
| `browser_fill` | Fill React/Vue/Svelte controlled inputs via CDP — works where `browser_eval(action='js')` value assignment doesn't update framework state |
|
|
175
|
-
| `browser_form` | Inspect form fields (input/select/textarea/button) with name, type, value, label resolved via for/aria/ancestor LABEL |
|
|
176
|
-
| `browser_overview` | Enumerate links / buttons / inputs + **ARIA toggles** with `state.{checked,pressed,selected,expanded}`; each element includes `viewportPosition` |
|
|
177
|
-
| `browser_search` | Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
|
|
178
|
-
| `browser_navigate` | Navigate via CDP `Page.navigate`; `waitForLoad:true` (default) returns once `readyState==='complete'` |
|
|
179
|
-
|
|
180
|
-
All `browser_*` tools that touch the DOM accept `includeContext:false` to omit the trailing `activeTab:` / `readyState:` lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
|
|
181
|
-
|
|
182
|
-
### Workspace (2)
|
|
183
|
-
| Tool | Description |
|
|
184
|
-
|---|---|
|
|
185
|
-
| `workspace_snapshot` | All windows: thumbnails + UI summaries in one call |
|
|
186
|
-
| `workspace_launch` | Launch an app and auto-detect the new window |
|
|
187
|
-
|
|
188
|
-
### Wait / Status (2)
|
|
189
|
-
| Tool | Description |
|
|
190
|
-
|---|---|
|
|
191
|
-
| `wait_until` | Server-side wait for window/focus/terminal/browser DOM state changes. Replaces the former events_* polling family — use `condition='window_appears'` / `'terminal_output_contains'` / `'element_matches'` / `'focus_changes'` for one-shot waits. |
|
|
192
|
-
| `server_status` | Returns which backend is active: `uia` (native Rust or powershell) and `imageDiff` (native Rust SSE2 or typescript). Diagnostic — call once per session when troubleshooting performance. |
|
|
123
|
+
> 📖 **Full Reference**: [`docs/system-overview.md`](docs/system-overview.md) — Exhaustive guide on parameters, return schemas, and coordinate math.
|
|
193
124
|
|
|
194
|
-
###
|
|
125
|
+
### 🌐 World-Graph V2 (Primary Path)
|
|
195
126
|
| Tool | Description |
|
|
196
127
|
|---|---|
|
|
197
|
-
| `
|
|
198
|
-
| `
|
|
128
|
+
| `desktop_discover` | Observe the desktop. Returns interactive entities with leases (UIA, CDP, Terminal, Visual SoM). |
|
|
129
|
+
| `desktop_act` | Perform actions (click, type, drag, select) on entities via lease validation. Returns semantic diffs. |
|
|
199
130
|
|
|
200
|
-
###
|
|
131
|
+
### 👁️ Observation & State
|
|
201
132
|
| Tool | Description |
|
|
202
133
|
|---|---|
|
|
203
|
-
| `
|
|
204
|
-
| `
|
|
134
|
+
| `desktop_state` | Lightweight check of focus, active window, cursor, and Auto-Perception attention signal. |
|
|
135
|
+
| `screenshot` | Multi-mode capture: `detail='text'` (UIA/OCR), `diffMode` (P-frame), `dotByDot` (1:1), and `background`. |
|
|
136
|
+
| `workspace_snapshot` | Instant session orientation: all window thumbnails + UI summaries in one call. |
|
|
137
|
+
| `server_status` | Diagnostic check for native engine health and feature activation. |
|
|
205
138
|
|
|
206
|
-
###
|
|
139
|
+
### ⌨️ Input & Control
|
|
207
140
|
| Tool | Description |
|
|
208
141
|
|---|---|
|
|
209
|
-
| `
|
|
210
|
-
| `
|
|
142
|
+
| `keyboard` | Send keyboard input. Supports background input (WM_CHAR) and IME-safe clipboard bypass. |
|
|
143
|
+
| `mouse_click` / `mouse_drag` | Precision coordinate-based interaction with homing and force-focus protection. |
|
|
144
|
+
| `scroll` | Multi-strategy: `raw` (notches), `to_element`, `smart` (virtual lists), and `capture` (stitch). |
|
|
145
|
+
| `click_element` | Legacy UIA-based click by name/ID (fallback when entities are unavailable). |
|
|
211
146
|
|
|
212
|
-
###
|
|
147
|
+
### 🌐 Browser CDP (Chrome/Edge/Brave)
|
|
213
148
|
| Tool | Description |
|
|
214
149
|
|---|---|
|
|
215
|
-
| `
|
|
150
|
+
| `browser_open` / `browser_navigate` | Idempotent debug-mode launch and reliable navigation. |
|
|
151
|
+
| `browser_click` / `browser_fill` / `browser_form` | High-level DOM interaction stable across repaints and framework re-renders. |
|
|
152
|
+
| `browser_eval` | Deep inspection via `js` (scripting), `dom` (HTML), and `appState` (SPA data extraction). |
|
|
153
|
+
| `browser_overview` / `browser_search` / `browser_locate` | Semantic discovery, grep-like DOM search, and pixel-accurate coordinate lookup. |
|
|
216
154
|
|
|
217
|
-
###
|
|
155
|
+
### 🛠️ Utilities & Workflow
|
|
218
156
|
| Tool | Description |
|
|
219
157
|
|---|---|
|
|
220
|
-
| `
|
|
221
|
-
| `
|
|
158
|
+
| `terminal` | Unified command execution: `run` (send + wait + read), `read` (OCR/UIA), and `send`. |
|
|
159
|
+
| `wait_until` | Efficient server-side polling for window, focus, text, or URL state changes. |
|
|
160
|
+
| `window_dock` / `focus_window` | Window management: `pin` (always-on-top), `unpin`, `dock` (corner snap), and `focus`. |
|
|
161
|
+
| `workspace_launch` | Launch apps and auto-detect new HWNDs (supports localized titles). |
|
|
162
|
+
| `run_macro` | Batch up to 50 operations into a single round-trip for maximum efficiency. |
|
|
163
|
+
| `clipboard` / `notification_show` | System-level text exchange and user alerts. |
|
|
222
164
|
|
|
223
165
|
---
|
|
224
166
|
|
package/bin/launcher.js
CHANGED
|
@@ -18,15 +18,15 @@ import path from "node:path";
|
|
|
18
18
|
import { Readable } from "node:stream";
|
|
19
19
|
import { pipeline } from "node:stream/promises";
|
|
20
20
|
|
|
21
|
-
const PACKAGE_VERSION = "1.0.
|
|
21
|
+
const PACKAGE_VERSION = "1.0.4";
|
|
22
22
|
const RELEASE_TAG = `v${PACKAGE_VERSION}`;
|
|
23
23
|
const REPO_API_URL = `https://api.github.com/repos/Harusame64/desktop-touch-mcp/releases/tags/${RELEASE_TAG}`;
|
|
24
24
|
const ASSET_NAME = "desktop-touch-mcp-windows.zip";
|
|
25
25
|
const RELEASE_METADATA_FILE = ".desktop-touch-release.json";
|
|
26
26
|
const RELEASE_MANIFEST = {
|
|
27
|
-
tagName: "v1.0.
|
|
27
|
+
tagName: "v1.0.4",
|
|
28
28
|
assetName: ASSET_NAME,
|
|
29
|
-
sha256: "
|
|
29
|
+
sha256: "8e04bd3edef0e243e84fb4a534867b2c6e47d516ed95b8219752275a1e9a3e59",
|
|
30
30
|
};
|
|
31
31
|
const CACHE_ROOT = process.env.DESKTOP_TOUCH_MCP_HOME
|
|
32
32
|
? path.resolve(process.env.DESKTOP_TOUCH_MCP_HOME)
|
package/package.json
CHANGED
|
@@ -1,8 +1,8 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@harusame64/desktop-touch-mcp",
|
|
3
|
-
"version": "1.0.
|
|
3
|
+
"version": "1.0.4",
|
|
4
4
|
"mcpName": "io.github.Harusame64/desktop-touch-mcp",
|
|
5
|
-
"description": "LLM-native Windows computer-use MCP server with
|
|
5
|
+
"description": "LLM-native Windows computer-use MCP server with 28 tools for screenshots, UIA, mouse/keyboard, Chrome CDP, terminal, SmartScroll, and perception guards",
|
|
6
6
|
"engines": {
|
|
7
7
|
"node": ">=20.0.0"
|
|
8
8
|
},
|