@harusame64/desktop-touch-mcp 1.0.3 → 1.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.ja.md CHANGED
@@ -4,13 +4,9 @@
4
4
 
5
5
  [English](README.md)
6
6
 
7
- > **「Claude CLI にスクショを毎回コピーしていたあなたへ。」**
7
+ > **「座標ルーレット」はもう不要。セマンティック・ワールドグラフと自動認識ガードによる、LLMネイティブなWindows自動化。**
8
8
 
9
- > **公開サイト:** [harusame64.github.io/desktop-touch-mcp](https://harusame64.github.io/desktop-touch-mcp/)
10
- > まずはここから、公開向けの解説、client setup、Reactive Perception Graph の紹介を読めます。
11
-
12
- Claude がデスクトップを直接見て、直接操作する。
13
- マウス・キーボード・スクリーンショット・Windows UI Automation・Chrome DevTools Protocol・ターミナル・SmartScroll・Reactive Perception Graph を統合した 28 個の public ツール (26 stub catalog + 2 dynamic v2 World-Graph: `desktop_discover` / `desktop_act`) を提供する MCP サーバーです。
9
+ ClaudeがWindowsを直接「見て」「操作する」ためのMCPサーバー。単なるピクセル指定の座標当てゲームを卒業し、**セマンティック・ワールドグラフ** (`desktop_discover`) による確実なエンティティ特定と、**自動認識(Auto-Perception)ガード**による安全な実行を実現します。28個の最適化されたツール群が、バックグラウンド入力、UIA、Chrome CDP、ターミナル、そしてトークン効率に優れたP-frame差分転送をサポートします。
14
10
 
15
11
  > *v0.15: Rust ネイティブエンジンにより**平均 82 倍高速化** — UIA フォーカス取得 2ms、SSE2 SIMD 画像差分 13〜15 倍速。設定不要:エンジンは自動ロード、不在時は PowerShell に透過フォールバック。*
16
12
  > *v0.15.5: **固定リリース検証** — npm ランチャーは対応する GitHub Release tag だけを取得し、Windows runtime zip を検証してから展開します。*
package/README.md CHANGED
@@ -4,12 +4,9 @@
4
4
 
5
5
  [日本語](README.ja.md)
6
6
 
7
- > **Stop pasting screenshots. Let Claude see and control your desktop directly.**
7
+ > **Beyond Coordinate Roulette: LLM-native Windows automation with a Semantic World-Graph and Auto-Perception.**
8
8
 
9
- > **Project site:** [harusame64.github.io/desktop-touch-mcp](https://harusame64.github.io/desktop-touch-mcp/)
10
- > Start here for the public explainer, client setup guides, and the Reactive Perception Graph overview.
11
-
12
- An MCP server that gives Claude eyes and hands on Windows — 28 public tools (26 stub catalog + 2 dynamic v2 World-Graph: `desktop_discover` / `desktop_act`) covering screenshots, mouse, keyboard, Windows UI Automation, Chrome DevTools Protocol, clipboard, desktop notifications, SmartScroll, and a Reactive Perception Graph for safe multi-step automation, designed from the ground up for LLM efficiency.
9
+ An MCP server that gives Claude eyes and hands on Windows. It moves beyond pixel-guessing by grounding all interactions in a **Semantic World-Graph** (`desktop_discover`) and verifying every action with **Auto-Perception** guards. Optimized into 28 high-signal tools covering screenshots, background input (WM_CHAR), UIA, Chrome CDP, terminal, and token-efficient P-frame diffing.
13
10
 
14
11
  > *v0.15: **82× average speedup** via Rust native engine — UIA focus queries in 2 ms, SSE2-accelerated image diffing at 13–15× native speed. Zero-config: the engine auto-loads when present, with transparent PowerShell fallback.*
15
12
  > *v0.15.5: **Pinned release verification** — the npm launcher now fetches only the matching GitHub Release tag and verifies the Windows runtime zip before extraction.*
@@ -121,104 +118,49 @@ For a local checkout, register the built server directly:
121
118
 
122
119
  ---
123
120
 
124
- ## Tools (28 total — 26 stub catalog + 2 dynamic v2)
125
-
126
- > 📖 **Full command reference**: [`docs/system-overview.md`](docs/system-overview.md) — every tool's parameters, response shape, coordinate math, layer-buffer strategy, and engineering notes in one place.
127
-
128
-
129
- ### Screenshot (1)
130
- | Tool | Description |
131
- |---|---|
132
- | `screenshot` | Main capture. Supports `detail` (`meta`/`text`/`image`/`som`/`ocr`), `mode` (`normal`/`background` — Phase 4 absorbs former screenshot_background), `region` sub-crop (Phase 4 absorbs former scope_element after `desktop_discover` gives you element bounds), `dotByDot`, `dotByDotMaxDimension`, `grayscale`, `diffMode`. `detail='text'` auto-activates the SoM pipeline when UIA is blind. `detail='ocr'` (Phase 4 absorbs former screenshot_ocr) returns Windows OCR words with screen-pixel clickAt coords. `scroll(action='capture')` provides full-page stitched capture. |
133
-
134
- ### Observation (1) — World-Graph core
135
- | Tool | Description |
136
- |---|---|
137
- | `desktop_state` | Cheapest read-only observation. Always returns focused window, focused element, modal flag, attention signal. Phase 4: `includeCursor:true` adds `cursor.{x,y,monitorId}` (former get_cursor_position); `includeScreen:true` adds `screen.{displays[],...}` (former get_screen_info); `includeDocument:true` adds `document.{url,title,readyState,scroll,...}` via CDP (former get_document_state). focusedWindow always covers the former get_active_window response. |
138
-
139
- ### Discovery + Action — World-Graph dispatchers (2 dynamic)
140
- | Tool | Description |
141
- |---|---|
142
- | `desktop_discover` | Find actionable entities and emit leases. Returns `entities[]` (former `get_ui_elements` equivalent — `entityId`/`label`/`role`/`confidence`/`sources`/`primaryAction`/`lease`, plus `rect` when `debug:true`) + `windows[]` (former `get_windows` equivalent — `zOrder`/`title`/`hwnd`/`region`/`isActive`/`isMinimized`/`isMaximized`/`processName`). Pre-Phase-1 `desktop_see`. |
143
- | `desktop_act` | Lease-consuming action (`click` / `type` / `setValue` / `scroll` / `select`). Phase 4: `action='setValue'` absorbs former set_element_value via UIA ValuePattern (UIA entities) or CDP fill (browser entities). Pre-Phase-1 `desktop_touch`. |
144
-
145
- ### Window management (1)
146
- | Tool | Description |
147
- |---|---|
148
- | `focus_window` | Bring a window to foreground by partial title match. Use `desktop_discover` to list available titles. |
149
- | `window_dock(action='dock')` | Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible) |
121
+ ## Tools (28 Optimized Tools)
150
122
 
151
- ### Mouse (2)
152
- | Tool | Description |
153
- |---|---|
154
- | `mouse_click` / `mouse_drag` | Click, drag. `doubleClick` / `tripleClick` (line-select). Accept `speed` and `homing` parameters. (`mouse_move` privatized in Phase 4.) |
155
- | `scroll` | Scroll in any direction (`action='raw'` / `'to_element'` / `'smart'` / `'capture'`). Accepts `speed` and `homing` parameters. |
156
-
157
- ### Keyboard (1)
158
- | Tool | Description |
159
- |---|---|
160
- | `keyboard` | Send keyboard input. `action='type'` for text (`use_clipboard=true` bypasses IME / required for em-dash / smart quotes; `replaceAll=true` sends Ctrl+A before typing). `action='press'` for key combos (`ctrl+c`, `alt+f4`, etc.). |
161
-
162
- ### UI Automation (1)
163
- | Tool | Description |
164
- |---|---|
165
- | `click_element` | Click a button by name or automationId — no coordinates needed. Use `desktop_discover` to find target entities. (`get_ui_elements` / `set_element_value` / `scope_element` privatized in Phase 4 — see desktop_discover / desktop_act / screenshot.region.) |
166
-
167
- ### Browser CDP (9)
168
- | Tool | Description |
169
- |---|---|
170
- | `browser_open` | Connect to Chrome/Edge via CDP; lists open tabs with `active:true/false`. Pass `launch:{}` (or with overrides) to auto-spawn a debug-mode browser when no CDP endpoint is live (idempotent) |
171
- | `browser_locate` | CSS selector → exact physical screen coords |
172
- | `browser_click` | Find DOM element + click in one step |
173
- | `browser_eval` | Inspect or operate on a tab via 3 actions: `js` (evaluate JS), `dom` (get HTML), `appState` (extract SSR-injected SPA state — `__NEXT_DATA__`, `__NUXT_DATA__`, `__REMIX_CONTEXT__`, `__APOLLO_STATE__`, GitHub `react-app`, JSON-LD, Redux SSR) |
174
- | `browser_fill` | Fill React/Vue/Svelte controlled inputs via CDP — works where `browser_eval(action='js')` value assignment doesn't update framework state |
175
- | `browser_form` | Inspect form fields (input/select/textarea/button) with name, type, value, label resolved via for/aria/ancestor LABEL |
176
- | `browser_overview` | Enumerate links / buttons / inputs + **ARIA toggles** with `state.{checked,pressed,selected,expanded}`; each element includes `viewportPosition` |
177
- | `browser_search` | Grep DOM by text / regex / role / ariaLabel / selector with confidence ranking |
178
- | `browser_navigate` | Navigate via CDP `Page.navigate`; `waitForLoad:true` (default) returns once `readyState==='complete'` |
179
-
180
- All `browser_*` tools that touch the DOM accept `includeContext:false` to omit the trailing `activeTab:` / `readyState:` lines (saves ~150 tok/call on chained invocations). Within a 500 ms window, consecutive calls reuse one tab-context fetch automatically.
181
-
182
- ### Workspace (2)
183
- | Tool | Description |
184
- |---|---|
185
- | `workspace_snapshot` | All windows: thumbnails + UI summaries in one call |
186
- | `workspace_launch` | Launch an app and auto-detect the new window |
187
-
188
- ### Wait / Status (2)
189
- | Tool | Description |
190
- |---|---|
191
- | `wait_until` | Server-side wait for window/focus/terminal/browser DOM state changes. Replaces the former events_* polling family — use `condition='window_appears'` / `'terminal_output_contains'` / `'element_matches'` / `'focus_changes'` for one-shot waits. |
192
- | `server_status` | Returns which backend is active: `uia` (native Rust or powershell) and `imageDiff` (native Rust SSE2 or typescript). Diagnostic — call once per session when troubleshooting performance. |
123
+ > 📖 **Full Reference**: [`docs/system-overview.md`](docs/system-overview.md) — Exhaustive guide on parameters, return schemas, and coordinate math.
193
124
 
194
- ### Terminal (2)
125
+ ### 🌐 World-Graph V2 (Primary Path)
195
126
  | Tool | Description |
196
127
  |---|---|
197
- | `terminal(action='read')` | Read text from Windows Terminal / PowerShell / cmd / WSL via UIA/OCR. Supports `sinceMarker` for diff reads |
198
- | `terminal(action='send')` | Send commands to a terminal. Uses clipboard paste by default for IME safety |
128
+ | `desktop_discover` | Observe the desktop. Returns interactive entities with leases (UIA, CDP, Terminal, Visual SoM). |
129
+ | `desktop_act` | Perform actions (click, type, drag, select) on entities via lease validation. Returns semantic diffs. |
199
130
 
200
- ### Pin / Macro (2)
131
+ ### 👁️ Observation & State
201
132
  | Tool | Description |
202
133
  |---|---|
203
- | `window_dock(action='pin'\|'unpin'\|'dock')` | Always-on-top toggle / docking |
204
- | `run_macro` | Execute up to 50 steps sequentially in one MCP call. Phase 4: DSL accepts the v1.0.0 dispatcher names (`keyboard({action:'type'})` etc.) only. |
134
+ | `desktop_state` | Lightweight check of focus, active window, cursor, and Auto-Perception attention signal. |
135
+ | `screenshot` | Multi-mode capture: `detail='text'` (UIA/OCR), `diffMode` (P-frame), `dotByDot` (1:1), and `background`. |
136
+ | `workspace_snapshot` | Instant session orientation: all window thumbnails + UI summaries in one call. |
137
+ | `server_status` | Diagnostic check for native engine health and feature activation. |
205
138
 
206
- ### Clipboard (2)
139
+ ### ⌨️ Input & Control
207
140
  | Tool | Description |
208
141
  |---|---|
209
- | `clipboard(action='read')` | Read the current Windows clipboard text (non-text payloads return empty string) |
210
- | `clipboard(action='write')` | Write text to the Windows clipboard; full Unicode / emoji / CJK support |
142
+ | `keyboard` | Send keyboard input. Supports background input (WM_CHAR) and IME-safe clipboard bypass. |
143
+ | `mouse_click` / `mouse_drag` | Precision coordinate-based interaction with homing and force-focus protection. |
144
+ | `scroll` | Multi-strategy: `raw` (notches), `to_element`, `smart` (virtual lists), and `capture` (stitch). |
145
+ | `click_element` | Legacy UIA-based click by name/ID (fallback when entities are unavailable). |
211
146
 
212
- ### Notification (1)
147
+ ### 🌐 Browser CDP (Chrome/Edge/Brave)
213
148
  | Tool | Description |
214
149
  |---|---|
215
- | `notification_show` | Show a Windows system tray balloon notification useful to alert the user when a long-running task finishes |
150
+ | `browser_open` / `browser_navigate` | Idempotent debug-mode launch and reliable navigation. |
151
+ | `browser_click` / `browser_fill` / `browser_form` | High-level DOM interaction stable across repaints and framework re-renders. |
152
+ | `browser_eval` | Deep inspection via `js` (scripting), `dom` (HTML), and `appState` (SPA data extraction). |
153
+ | `browser_overview` / `browser_search` / `browser_locate` | Semantic discovery, grep-like DOM search, and pixel-accurate coordinate lookup. |
216
154
 
217
- ### Scroll (2)
155
+ ### 🛠️ Utilities & Workflow
218
156
  | Tool | Description |
219
157
  |---|---|
220
- | `scroll(action='to_element')` | Scroll a named element into the viewport without computing scroll amounts. Chrome path: `selector` + `block` alignment. Native path: `name` + `windowTitle` via UIA ScrollItemPattern |
221
- | `scroll(action='smart')` | **SmartScroll** — unified scroll dispatcher: CDP → UIA → image binary-search fallback. Handles nested containers, virtualised lists (TanStack/React Virtualized), sticky-header occlusion, and image-only environments. Returns `pageRatio`, `ancestors[]`, and hash-verified `scrolled` |
158
+ | `terminal` | Unified command execution: `run` (send + wait + read), `read` (OCR/UIA), and `send`. |
159
+ | `wait_until` | Efficient server-side polling for window, focus, text, or URL state changes. |
160
+ | `window_dock` / `focus_window` | Window management: `pin` (always-on-top), `unpin`, `dock` (corner snap), and `focus`. |
161
+ | `workspace_launch` | Launch apps and auto-detect new HWNDs (supports localized titles). |
162
+ | `run_macro` | Batch up to 50 operations into a single round-trip for maximum efficiency. |
163
+ | `clipboard` / `notification_show` | System-level text exchange and user alerts. |
222
164
 
223
165
  ---
224
166
 
package/bin/launcher.js CHANGED
@@ -18,15 +18,15 @@ import path from "node:path";
18
18
  import { Readable } from "node:stream";
19
19
  import { pipeline } from "node:stream/promises";
20
20
 
21
- const PACKAGE_VERSION = "1.0.3";
21
+ const PACKAGE_VERSION = "1.0.4";
22
22
  const RELEASE_TAG = `v${PACKAGE_VERSION}`;
23
23
  const REPO_API_URL = `https://api.github.com/repos/Harusame64/desktop-touch-mcp/releases/tags/${RELEASE_TAG}`;
24
24
  const ASSET_NAME = "desktop-touch-mcp-windows.zip";
25
25
  const RELEASE_METADATA_FILE = ".desktop-touch-release.json";
26
26
  const RELEASE_MANIFEST = {
27
- tagName: "v1.0.3",
27
+ tagName: "v1.0.4",
28
28
  assetName: ASSET_NAME,
29
- sha256: "f019db6e0194dcf09f7428e8e42af7902de6f842695568eb7897750803d0807e",
29
+ sha256: "8e04bd3edef0e243e84fb4a534867b2c6e47d516ed95b8219752275a1e9a3e59",
30
30
  };
31
31
  const CACHE_ROOT = process.env.DESKTOP_TOUCH_MCP_HOME
32
32
  ? path.resolve(process.env.DESKTOP_TOUCH_MCP_HOME)
package/package.json CHANGED
@@ -1,8 +1,8 @@
1
1
  {
2
2
  "name": "@harusame64/desktop-touch-mcp",
3
- "version": "1.0.3",
3
+ "version": "1.0.4",
4
4
  "mcpName": "io.github.Harusame64/desktop-touch-mcp",
5
- "description": "LLM-native Windows computer-use MCP server with 56 tools for screenshots, UIA, mouse/keyboard, Chrome CDP, terminal, SmartScroll, and perception guards",
5
+ "description": "LLM-native Windows computer-use MCP server with 28 tools for screenshots, UIA, mouse/keyboard, Chrome CDP, terminal, SmartScroll, and perception guards",
6
6
  "engines": {
7
7
  "node": ">=20.0.0"
8
8
  },