ucu-mcp 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,67 +5,30 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
- ## [0.2.0] - 2026-06-05
8
+ ## [0.3.0] - 2026-06-06
9
9
 
10
- ### Changed
11
-
12
- - Replaced JXA keyboard/mouse input with native Swift CGEvent helper (`native/cgevent/cgevent-helper`), eliminating SIGSEGV crashes on macOS Sequoia+
13
- - `listWindows` switched from `CGWindowListCopyWindowInfo` to System Events for reliable window enumeration
14
- - `getWindowState` adapted to use System Events window IDs instead of CGWindow IDs
15
- - Fixed OCR JXA script — `isValid` guard now correctly handles missing/broken references
16
-
17
- ### Fixed
18
-
19
- - `typeInElement` now properly escapes `$` in text to prevent JXA template-literal interpolation errors
20
- - AX element cache now refetches stale references instead of throwing
21
- - MCP server version now resolves from `package.json` instead of advertising stale `0.1.0`
22
- - `screenshot.maxWidth`, `screenshot.windowId`, and action `captureAfter` encode options now reach the execution path
23
- - `captureAfter` now returns a separate MCP image content item instead of embedding screenshot bytes in JSON text
24
- - Window-relative coordinate tools now reject stale `windowId` values instead of falling back to raw screen coordinates
25
- - Real input actions no longer use the shared retry wrapper after a partial failure
26
- - macOS AX traversal now uses `uiElements()` with `elements()` fallback, fixing TextEdit `AXTextArea` discovery
27
- - User activity monitoring now starts with the MCP server and initializes the cursor baseline before polling
28
- - Added client-friendly aliases and defaults: `press_key.modifiers`, `scroll.deltaX=0`, `wait_for_element.timeoutMs/intervalMs`, and `move.captureAfter`
29
- - README tool tables and OCR/captureAfter response examples now match the live MCP schema
30
- - macOS platform failures now use structured `UcuError` subclasses for screenshots, window lookup, AX permissions, stale elements, cursor queries, and input synthesis
31
- - MCP tool failures now return `isError: true` with JSON `error.name`, `error.code`, `error.retryable`, `error.message`, and `error.recovery` instead of forcing clients to parse plain text
32
- - `wait_for_element` no longer masks Accessibility/platform failures as ordinary timeouts; missing elements still time out, but real lookup failures surface through the structured MCP error response
33
- - macOS `listWindows` now uses a short defensive-copy cache for repeated window lookups, reducing back-to-back window resolution calls from seconds to near-zero while `focusApp` still invalidates before activating a target app
34
- - Added optional real client CLI smoke coverage for Claude Code CLI, Codex CLI, and OpenCode MCP visibility
35
- - README now includes verified `claude mcp add`, `codex mcp add`, and OpenCode `opencode.json` setup paths
36
-
37
- ### Tests
38
-
39
- - Unit test count grew from 83 → 161
40
- - Optional client CLI smoke: 3/3 passing with `npm run test:client-cli`
41
- - GUI smoke tests 6/6 passing (`UCU_MACOS_GUI_SMOKE=1`)
42
-
43
- ## [0.1.0] - 2026-06-02
44
10
 
45
11
  ### Added
46
12
 
47
- - Initial release of UCU-MCP (Universal Computer Use MCP Server)
48
- - **22 MCP tools** for desktop automation via Model Context Protocol:
49
- - Screen capture: `screen_capture`, `screen_capture_active_window`
50
- - Mouse control: `mouse_move`, `mouse_click`, `mouse_double_click`, `mouse_drag`, `mouse_scroll`
51
- - Keyboard control: `keyboard_type`, `keyboard_hotkey`, `keyboard_key`
52
- - Clipboard: `clipboard_read`, `clipboard_write`
53
- - Window management: `window_list`, `window_activate`, `window_close`
54
- - Application control: `app_launch`, `app_quit`
55
- - System: `system_info`, `process_list`, `process_terminate`
56
- - Safety: `doctor` command for permission and environment diagnostics
57
- - **Safety features**:
58
- - URL blocklist to prevent navigation to sensitive sites
59
- - Lock screen guard (macOS) — blocks automation when screen is locked
60
- - Typed text injection scan — validates keyboard input before injection
61
- - Focus steal suppression — prevents accidental focus changes during automation
62
- - User interaction monitor — tracks user activity for safety coordination
63
- - **macOS platform support** with Accessibility API integration
64
- - TypeScript-first codebase with full type definitions
65
- - CLI entry point with `doctor` diagnostic command
13
+ - Scenario-based MCP Instructions — tool-usage guidance organized by task pattern (form fill, menu bar click, screen read, app switch, verify action, wait for change, recover stale target, clipboard)
14
+ - findElement multi-strategy: `value` filter (AX value, respects textMode), `index` selector (0-based Nth match), `near` sorter (ascending distance to point)
15
+ - wait_for_element `until` parameter: `appear` (default), `disappear` (poll until gone), `value_change` (poll until first match value differs)
16
+ - Action Receipt v1 unified receipt structure for all action-class tools (click, double_click, scroll, drag, move, type_text, press_key, click_element, set_value, type_in_element)
17
+ - Receipt fields: actionId (base36-timestamp unique ID), action, status (ok/partial/blocked), target (location context), result (business result), capture (screenshot metadata), warnings, next (suggested next step)
18
+ - Partial receipt when action succeeds but post-action screenshot fails: status="partial", capture.error contains error details, warnings includes "Post-action screenshot capture failed"
19
+ - Target Session v1 — `focus_app` now returns stable target metadata (`targetId`, `appName`, `pid`, `windowId`, `title`, `capturedAt`) for follow-up tool calls
20
+ - `TARGET_STALE` structured errors for active target windows that disappear before `get_window_state`
66
21
 
67
22
  ### Changed
68
23
 
24
+ - MCP instructions rewritten from generic description to scenario-driven workflow recommendations
25
+ - `wait_for_element` description updated to reflect `until` parameter semantics
26
+ - `find_element` schema extended with `value`, `index`, `near` parameters
27
+ - Action tool responses now wrap business results under `result` instead of returning them at the top level
28
+ - captureAfter failures now surface through receipt.capture.error instead of a flat captureError object
29
+ - `get_window_state` can use the prior `focus_app` target when `windowId` is omitted
30
+ - AX tools (`find_element`, `wait_for_element`, `click_element`, `set_value`, `type_in_element`) can use the prior `focus_app` target when `app` is omitted
31
+
69
32
  - Rewrote `src/mcp/tools.ts` with comprehensive 22-tool registry:
70
33
  - Unified `withSafety` wrapper for all automation actions
71
34
  - `captureAfter` helper for post-action screenshots
package/README.md CHANGED
@@ -81,7 +81,7 @@ UCU-MCP provides 22 tools across five categories:
81
81
  | `screenshot` | Capture screen, window, or region as PNG/JPEG image content | `display?`, `windowId?`, `region?`, `maxWidth?`, `format?` |
82
82
  | `list_windows` | List all on-screen windows with IDs, titles, bounds | `includeMinimized?` |
83
83
  | `list_apps` | List visible macOS apps with pid, frontmost state, and window count | — |
84
- | `focus_app` | Select an app/window target context for later AX tools | `app` |
84
+ | `focus_app` | Select an app/window target context for later AX tools; returns `targetId`, `appName`, `pid`, `windowId`, `title`, and `capturedAt` | `app` |
85
85
  | `get_window_state` | Get accessibility tree of a window, or the prior focus_app target when windowId is omitted | `windowId?`, `depth?`, `includeBounds?` |
86
86
  | `get_screen_size` | Get screen dimensions | `display?` |
87
87
  | `ocr` | Perform OCR on screen or region; returns text with bounding boxes and confidence | `display?`, `region?` |
@@ -108,8 +108,8 @@ UCU-MCP provides 22 tools across five categories:
108
108
 
109
109
  | Tool | Description | Key Parameters |
110
110
  |------|-------------|----------------|
111
- | `find_element` | Find UI element by text, role, or description using AX APIs | `text?`, `role?`, `app?`, `depth?`, `includeBounds?`, `maxResults?` |
112
- | `click_element` | Click an AX element by its id (from find_element); refetches equivalent elements after UI updates | `elementId`, `app?`, `captureAfter?` |
111
+ | `find_element` | Find UI element by text, role, or description using AX APIs, using the current focus_app target when app is omitted | `text?`, `role?`, `app?`, `depth?`, `includeBounds?`, `maxResults?` |
112
+ | `click_element` | Click an AX element by its id (from find_element), using the current focus_app target when app is omitted; refetches equivalent elements after UI updates | `elementId`, `app?`, `captureAfter?` |
113
113
  | `set_value` | Set an AX element's value directly without focusing it, using the current focus_app target when app is omitted | `elementId`, `value`, `app?`, `captureAfter?` |
114
114
  | `type_in_element` | Type text into a specific AX text field element; may focus the element and refetches equivalent elements after UI updates | `elementId`, `text`, `app?`, `clearFirst?`, `captureAfter?` |
115
115
 
@@ -121,10 +121,12 @@ UCU-MCP provides 22 tools across five categories:
121
121
  | `wait` | Wait for UI state to settle after launches, animations, or navigation | `ms` |
122
122
  | `wait_for_element` | Poll the AX tree until a matching element appears | `text?`, `role?`, `app?`, `timeout?`, `timeoutMs?`, `interval?`, `intervalMs?` |
123
123
 
124
- Action tools accept `captureAfter`, `captureMaxWidth`, and `captureFormat` so an agent can receive a post-action screenshot as a second MCP image content item in the same response instead of spending another round trip on `screenshot`.
124
+ Action tools accept `captureAfter`, `captureMaxWidth`, and `captureFormat` so an agent can receive a post-action screenshot as a second MCP image content item in the same response instead of spending another round trip on `screenshot`. When `captureAfter` is requested and the action succeeds, the tool returns an `ActionReceipt` (see the Action Receipt section below) with `capture.status: "ok"`. If post-action capture fails, the receipt has `status: "partial"` and `capture.status: "error"` with the error details. If `captureAfter` is omitted, `capture.status` is `"skipped"`.
125
125
 
126
126
  For fast AX discovery on large windows, use `find_element` with `includeBounds=false` and a small `maxResults`. Keep bounds enabled when the result may be used for coordinate fallback.
127
127
 
128
+ `focus_app` establishes a session target for follow-up observation and AX actions. After focusing an app, `get_window_state` may omit `windowId`, and AX tools may omit `app`. If the focused window closes or is replaced, UCU-MCP returns a structured `TARGET_STALE` error so the agent can refresh with `focus_app` or `list_windows` instead of silently acting on a different target.
129
+
128
130
  ## OCR Tool Usage
129
131
 
130
132
  The `ocr` tool captures a screenshot and runs optical character recognition, returning each detected text element with its position and confidence score.
@@ -210,6 +212,90 @@ The AX (Accessibility) element tools let you interact with UI controls by their
210
212
  }
211
213
  ```
212
214
 
215
+ ## Action Receipt
216
+
217
+ Action tools (`click`, `double_click`, `scroll`, `drag`, `move`, `type_text`, `press_key`, `click_element`, `set_value`, `type_in_element`) return a unified `ActionReceipt` JSON object that wraps the action result, target information, and optional post-action screenshot metadata.
218
+
219
+ ### Receipt structure
220
+
221
+ | Field | Type | Description |
222
+ |-------|------|-------------|
223
+ | `actionId` | `string` | Unique base36-timestamp ID (e.g. `a1x9z2k-1`) |
224
+ | `action` | `string` | Tool name that produced this receipt |
225
+ | `status` | `"ok" \| "partial" \| "blocked"` | Overall action status |
226
+ | `target` | `object` | What was acted upon (coordinates, elementId, app, windowId) |
227
+ | `result` | `object` | Original business result (clicked, x, y, etc.) |
228
+ | `capture` | `object` | Screenshot metadata (requested, status, format, maxWidth, error) |
229
+ | `warnings` | `string[]` | Non-fatal warnings array |
230
+ | `next` | `string` | Suggested next action |
231
+
232
+ ### Examples
233
+
234
+ **Success with captureAfter:**
235
+
236
+ ```json
237
+ {
238
+ "actionId": "a1x9z2k-1",
239
+ "action": "click",
240
+ "status": "ok",
241
+ "target": { "x": 100, "y": 200 },
242
+ "result": { "clicked": true, "x": 100, "y": 200 },
243
+ "capture": {
244
+ "requested": true,
245
+ "status": "ok",
246
+ "format": "jpeg",
247
+ "maxWidth": 1280
248
+ },
249
+ "warnings": [],
250
+ "next": "find_element or get_window_state"
251
+ }
252
+ ```
253
+
254
+ **Success without captureAfter:**
255
+
256
+ ```json
257
+ {
258
+ "actionId": "a1x9z2k-2",
259
+ "action": "click_element",
260
+ "status": "ok",
261
+ "target": { "elementId": "AXButton-42", "app": "Safari" },
262
+ "result": { "clicked": true, "elementId": "AXButton-42" },
263
+ "capture": {
264
+ "requested": false,
265
+ "status": "skipped"
266
+ },
267
+ "warnings": [],
268
+ "next": "find_element or get_window_state"
269
+ }
270
+ ```
271
+
272
+ **Partial when capture fails:**
273
+
274
+ ```json
275
+ {
276
+ "actionId": "a1x9z2k-3",
277
+ "action": "click",
278
+ "status": "partial",
279
+ "target": { "x": 100, "y": 200 },
280
+ "result": { "clicked": true, "x": 100, "y": 200 },
281
+ "capture": {
282
+ "requested": true,
283
+ "status": "error",
284
+ "format": "jpeg",
285
+ "maxWidth": 1280,
286
+ "error": {
287
+ "name": "CaptureError",
288
+ "code": "CAPTURE_FAILED",
289
+ "retryable": true,
290
+ "message": "Screenshot capture failed after action",
291
+ "recovery": "Check Screen Recording permission and retry."
292
+ }
293
+ },
294
+ "warnings": ["Post-action screenshot capture failed"],
295
+ "next": "screenshot"
296
+ }
297
+ ```
298
+
213
299
  ## macOS Permission Setup
214
300
 
215
301
  UCU-MCP on macOS requires two system permissions:
@@ -5,15 +5,20 @@ import { fileURLToPath } from "node:url";
5
5
  import { createStdioTransport } from "./transport.js";
6
6
  import { registerTools, startUserActivityMonitor } from "./tools.js";
7
7
  const UCU_MCP_INSTRUCTIONS = `
8
- UCU-MCP is a cross-client computer-use server for Claude Code CLI, Claude Code Desktop, OpenCode, and other MCP clients.
8
+ UCU-MCP is a cross-client computer-use server for Claude Code CLI/Desktop, OpenCode, and other MCP clients.
9
9
 
10
- Use screenshots and window state to observe before acting. On macOS, prefer list_apps/focus_app to establish the target app context, then use AX element tools when an element can be identified: find_element, then click_element, set_value, or type_in_element. Fall back to coordinates only when AX lookup is unavailable or ambiguous.
10
+ Pick the right tool sequence for the task:
11
11
 
12
- Before repeated UI work, call get_screen_size and list_windows so coordinates and target windows are explicit. Use get_window_state for structured UI trees. Use ocr when visible text is not exposed through Accessibility. For tight observe-act loops, set captureAfter=true on action tools to receive a post-action screenshot in the same tool response.
12
+ Fill a form field find_element (text/role) + type_in_element or set_value. Prefer AX over coordinates.
13
+ • Click a menu bar item → get_screen_size + click with coordinates (menu bar is not in the AX tree).
14
+ • Read what's on screen → screenshot; for text not in AX use ocr; for a structured tree use get_window_state.
15
+ • Switch between apps → list_apps, then focus_app; subsequent tools use the active target context.
16
+ • Verify an action succeeded → captureAfter=true on action tools, or call screenshot afterwards.
17
+ • Wait for UI to change → wait_for_element (until: "appear" default; also "disappear" or "value_change").
18
+ • Recover from TARGET_STALE → call focus_app again for the target app, then retry the action.
19
+ • Read or write the clipboard → clipboard_read / clipboard_write.
13
20
 
14
- Safety model: actions are blocked while macOS is locked, dangerous shortcuts and sensitive windows are blocked, and suspicious injected text is rejected. For text entry into UI controls, prefer type_in_element because it can refetch equivalent AX elements if the UI tree changes.
15
-
16
- For Claude Code CLI/Desktop and OpenCode configs, run the ucu-mcp executable over stdio. If tools fail on macOS, run doctor first to check Accessibility and Screen Recording permissions. Windows and Linux adapters are explicit stubs until their native backends are implemented.
21
+ General rules: on macOS call list_apps/focus_app first to establish target context, then prefer AX tools (find_element → click_element / type_in_element / set_value). Use coordinates only when AX lookup is unavailable. Actions are blocked while macOS is locked; dangerous shortcuts and sensitive windows are blocked; suspicious injected text is rejected. type_in_element can refetch equivalent AX elements when the UI tree changes. Run doctor to check Accessibility and Screen Recording permissions. Windows and Linux adapters are explicit stubs until their native backends are implemented.
17
22
  `.trim();
18
23
  function getPackageVersion() {
19
24
  let dir = dirname(fileURLToPath(import.meta.url));
@@ -1,10 +1,15 @@
1
1
  /**
2
2
  * Tool registry for UCU-MCP.
3
3
  *
4
- * Registers 22 MCP tools on the server and dispatches each call through
4
+ * Registers 24 MCP tools on the server and dispatches each call through
5
5
  * a shared safety/permission/retry pipeline (`withSafety`).
6
6
  */
7
7
  import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
8
+ import type { AppTarget } from "../platform/base.js";
9
+ /**
10
+ * Get the currently active target context (set by focus_app).
11
+ */
12
+ export declare function getActiveTarget(): AppTarget | undefined;
8
13
  export declare function startUserActivityMonitor(): void;
9
14
  export declare function stopUserActivityMonitor(): void;
10
15
  export declare function registerTools(server: McpServer): void;