@harusame64/desktop-touch-mcp 1.9.2 → 1.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.ja.md CHANGED
@@ -21,6 +21,7 @@ npx -y @harusame64/desktop-touch-mcp
21
21
 
22
22
  - **⚡ 高性能 Rust ネイティブコア** — UIA ブリッジと画像差分エンジンを Rust (`napi-rs` + `windows-rs`) で実装し、ネイティブ `.node` アドオンとしてロード。専用 MTA スレッドからの直接 COM 呼び出しにより PowerShell プロセス起動を排除 — `getFocusedElement` は **2ms**(160 倍高速)、`getUiElements` はバッチ型 BFS アルゴリズムでクロスプロセス RPC を最小化し **約 100ms** で完了。画像差分は **SSE2 SIMD** で 13〜15 倍のスループット。ネイティブエンジンが利用不可の場合、全関数が PowerShell に透過フォールバック — 設定不要。
23
23
  - **🎯 Set-of-Marks (SoM) ビジュアルフォールバック** — ゲーム・RDP・非対応 Electron アプリで UIA が完全に機能しない場合でも、`screenshot(detail="text")` が Hybrid Non-CDP パイプラインを自動起動。Rust 画像前処理 → Windows OCR → クラスタリング → 赤い枠線 + 番号バッジ(`[1]`、`[2]`…)付き PNG 画像を生成し、`clickAt` 座標付きの要素リストを返します。CDP 不要。
24
+ - **🔁 視覚のみ対象での 1 コール確認** — UIA が効かない対象(Electron・PWA・ゲーム・自前描画キャンバス・RDP ウィンドウ)では、`desktop_act` が操作後の確認を応答自体に畳み込めます。成功時にオプションの `roiCapture` ——「変化した領域だけ」を切り出した PNG + そこに今ある要素の lease なしプレビュー —— を同梱するので、別途 `desktop_state` + `screenshot` を呼ばずに「クリックの結果」と「次の対象」を確認できます。視覚のみ対象では変化があれば**デフォルトで付与**されます(`returnCapture:"on-change"`)。`returnCapture:"never"` で抑止、`"always"` で常時付与。構造化対象(ブラウザ/CDP・UIA リッチなネイティブ)には付与されないため、それらの応答は不変です(そこは `desktop_state` の方が安価かつ正確)。
24
25
  - **LLM ネイティブ設計** — 人間の操作を模倣するのではなく、「LLM がいかにコンテキストを消費せず高速に動けるか」を前提に設計。`run_macro` による複数操作の一括実行(API 往復の削減)と、**MPEG P-frame 方式のレイヤー差分** (`diffMode`) を組み合わせることで、無駄な画像転送や推論ループを極限まで削ぎ落とす。
25
26
  - **Reactive Perception Graph** — ウィンドウやブラウザタブに `lensId` を登録し、以後の action tool に渡すだけで、操作前の安全 guard と操作後の `post.perception` フィードバックを受け取れます。`screenshot` / `desktop_state` の反復を減らし、別ウィンドウへの誤入力や古い座標クリックを防ぎます。
26
27
  - **日本語/CJK 完全対応** — ウィンドウタイトル取得に Win32 `GetWindowTextW` を使用。nut-js の文字化けを回避。IME バイパス入力にも対応。
@@ -487,7 +488,7 @@ V2 は、座標ベースのクリックをエンティティベースの操作
487
488
  | ツール | 説明 |
488
489
  |---|---|
489
490
  | `desktop_discover` | ウィンドウまたはブラウザタブを観測し、インタラクティブなエンティティを返します。raw 座標は返しません。UIA(ネイティブ)、CDP(ブラウザ)、ターミナル、GPU ビジュアルレーンに対応。 |
490
- | `desktop_act` | `desktop_discover` が返したエンティティを操作します。実行前にリースを検証し、セマンティック diff(`entity_disappeared`、`modal_appeared`、`focus_shifted` など)を返します。 |
491
+ | `desktop_act` | `desktop_discover` が返したエンティティを操作します。実行前にリースを検証し、セマンティック diff(`entity_disappeared`、`modal_appeared`、`focus_shifted` など)を返します。視覚のみの対象では、成功時に `roiCapture`(変化領域の PNG + 次対象の lease なしプレビュー)を同梱でき、「結果確認」と「次対象探索」を 1 コールで完了できます(`returnCapture`: `on-change` 既定で変化時に付与 / `never` で抑止 / `always` で常時)。 |
491
492
 
492
493
  ### クリック優先順位
493
494
 
package/README.md CHANGED
@@ -22,6 +22,7 @@ npx -y @harusame64/desktop-touch-mcp
22
22
 
23
23
  - **⚡ High-performance Rust Native Core** — The UIA bridge and image-diff engine are written in Rust (`napi-rs` + `windows-rs`) and loaded as a native `.node` addon. Direct COM calls from a dedicated MTA thread eliminate PowerShell process spawning — `getFocusedElement` completes in **2 ms** (160× faster), and `getUiElements` returns full trees in **~100 ms** with a batch BFS algorithm that minimizes cross-process RPC. Image-diff operations use **SSE2 SIMD** for 13–15× throughput. When the native engine is unavailable, every function transparently falls back to PowerShell — zero config required.
24
24
  - **🎯 Set-of-Marks (SoM) visual fallback** — Games, RDP sessions, and non-accessible Electron apps return clickable elements even when UIA is completely blind. `screenshot(detail="text")` automatically detects UIA sparsity and activates a Hybrid Non-CDP pipeline: Rust-powered grayscale + bilinear upscale → Windows OCR → clustering → red bounding-box annotation with numbered badges (`[1]`, `[2]`…). Two parallel representations returned: a visual PNG for spatial orientation and a semantic `elements[]` list with `clickAt` coords — no CDP required.
25
+ - **🔁 One-call confirmation on visual-only targets** — On UIA-blind targets (Electron, PWAs, games, custom canvases, RDP windows), `desktop_act` can fold the post-action confirmation into its own response: an optional `roiCapture` carrying a PNG crop of *just the region that changed* plus a lease-less preview of the controls now visible there. The agent confirms what its click did and finds the next target without a separate `desktop_state` + `screenshot`. On visual-only targets it is **on by default** for a visible change (`returnCapture:"on-change"`); pass `returnCapture:"never"` to suppress it, or `"always"` to force it. Never attached on structured targets (browser/CDP, UIA-rich native), where `desktop_state` is cheaper and exact — so those responses are unchanged.
25
26
  - **LLM-native design** — Built around how LLMs think, not how humans click. `run_macro` batches multiple operations into a single API call; `diffMode` sends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.
26
27
  - **Reactive Perception Graph** — Register a `lensId` for a window or browser tab, pass it to action tools, and get guard-checked `post.perception` feedback after each action. It reduces repeated `screenshot` / `desktop_state` calls and prevents wrong-window typing or stale-coordinate clicks.
27
28
  - **Full CJK support** — Uses Win32 `GetWindowTextW` for window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.
@@ -131,7 +132,7 @@ For a local checkout, register the built server directly:
131
132
  | Tool | Description |
132
133
  |---|---|
133
134
  | `desktop_discover` | Observe the desktop. Returns interactive entities with leases (UIA, CDP, Terminal, Visual SoM). |
134
- | `desktop_act` | Perform actions (click, type, drag, select) on entities via lease validation. Returns semantic diffs. |
135
+ | `desktop_act` | Perform actions (click, type, drag, select) on entities via lease validation. Returns semantic diffs — plus an optional `roiCapture` (changed-region PNG + next-target preview) on visual-only targets. |
135
136
 
136
137
  ### 👁️ Observation & State
137
138
  | Tool | Description |
@@ -687,7 +688,7 @@ V2 introduces two new tools that replace coordinate-based clicking with entity-b
687
688
  | Tool | Description |
688
689
  |---|---|
689
690
  | `desktop_discover` | Observe a window or browser tab. Returns interactive entities with leases — no raw screen coordinates. Supports UIA (native), CDP (browser), terminal, and visual GPU lanes. |
690
- | `desktop_act` | Interact with an entity returned by `desktop_discover`. Validates the lease before executing. Returns a semantic diff (`entity_disappeared`, `modal_appeared`, `focus_shifted`, …). |
691
+ | `desktop_act` | Interact with an entity returned by `desktop_discover`. Validates the lease before executing. Returns a semantic diff (`entity_disappeared`, `modal_appeared`, `focus_shifted`, …). On visual-only targets a successful act can bundle a `roiCapture` (a PNG crop of the changed region + a lease-less next-target preview) so you confirm the result and find the next target in one call — controlled by `returnCapture` (`on-change`, the default on a visible change; `never` to suppress; `always` to force). |
691
692
 
692
693
  ### Clicking — priority order
693
694
 
package/bin/launcher.js CHANGED
@@ -18,15 +18,15 @@ import path from "node:path";
18
18
  import { Readable } from "node:stream";
19
19
  import { pipeline } from "node:stream/promises";
20
20
 
21
- const PACKAGE_VERSION = "1.9.2";
21
+ const PACKAGE_VERSION = "1.10.1";
22
22
  const RELEASE_TAG = `v${PACKAGE_VERSION}`;
23
23
  const REPO_API_URL = `https://api.github.com/repos/Harusame64/desktop-touch-mcp/releases/tags/${RELEASE_TAG}`;
24
24
  const ASSET_NAME = "desktop-touch-mcp-windows.zip";
25
25
  const RELEASE_METADATA_FILE = ".desktop-touch-release.json";
26
26
  const RELEASE_MANIFEST = {
27
- tagName: "v1.9.2",
27
+ tagName: "v1.10.1",
28
28
  assetName: ASSET_NAME,
29
- sha256: "be6ede4602e5c0b9d52b620e794c35b4e77c19a014c9e827fade960ca1caab0e",
29
+ sha256: "09515799119459d754bf8a24cae2f3c931bcc4946a85d2071467d1a0e8779856",
30
30
  };
31
31
  const CACHE_ROOT = process.env.DESKTOP_TOUCH_MCP_HOME
32
32
  ? path.resolve(process.env.DESKTOP_TOUCH_MCP_HOME)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@harusame64/desktop-touch-mcp",
3
- "version": "1.9.2",
3
+ "version": "1.10.1",
4
4
  "mcpName": "io.github.Harusame64/desktop-touch-mcp",
5
5
  "description": "Let Claude, Cursor, or any MCP client see and operate your Windows 10/11 desktop. 29 tools for screenshots, UI Automation, Chrome CDP, keyboard/mouse, terminal, with semantic discover-then-act targeting and per-action perception guards that avoid wrong-window typing and stale-coordinate clicks.",
6
6
  "keywords": [
@@ -95,6 +95,7 @@
95
95
  "check:native-types": "node scripts/check-native-types.mjs",
96
96
  "check:no-koffi": "node scripts/check-no-koffi.mjs",
97
97
  "check:rs-workspace": "cargo check --workspace --locked",
98
+ "check:rs-test-compile": "cargo test --workspace --locked --no-run",
98
99
  "check:expansion-disjoint": "node scripts/check-expansion-disjoint.mjs",
99
100
  "build:rs": "node scripts/build-rs.mjs --release",
100
101
  "build:rs:debug": "node scripts/build-rs.mjs --debug",
@@ -104,17 +105,17 @@
104
105
  "devDependencies": {
105
106
  "@eslint/js": "^10.0.1",
106
107
  "@modelcontextprotocol/sdk": "^1.10.0",
107
- "@napi-rs/cli": "^3.6.2",
108
+ "@napi-rs/cli": "^3.7.0",
108
109
  "@nut-tree-fork/nut-js": "^4.2.6",
109
110
  "@types/node": "^25.9.1",
110
111
  "@types/ws": "^8.18.1",
111
- "eslint": "^10.4.0",
112
+ "eslint": "^10.4.1",
112
113
  "fast-check": "^4.8.0",
113
114
  "globals": "^17.5.0",
114
115
  "sharp": "^0.34.5",
115
116
  "typescript": "^6.0.2",
116
- "typescript-eslint": "^8.59.4",
117
- "vitest": "^4.1.7",
117
+ "typescript-eslint": "^8.60.1",
118
+ "vitest": "^4.1.8",
118
119
  "ws": "^8.21.0",
119
120
  "zod": "^4.3.6"
120
121
  }