ucu-mcp 0.4.3 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -5,6 +5,38 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.5.0] - 2026-06-15
9
+
10
+ cc-switch 操控诊断与视觉通道修复:打通 OCR + 字母快捷键 + 托盘 + 视觉降级 fallback + 内置 agent skill。
11
+
12
+ ### Added
13
+
14
+ - **方向3 `describe_screen` 工具 + screenshot `describe` 选项(25→26 工具)**:结构化文本屏幕描述(OCR blocks + AX tree + foreground window),是 image content 在中转环境被降级为 URL 时的视觉 fallback。OCR/AX 各自 try/catch,失败聚合到 `errors[]` 不抛出。密码字段(`AXSecureTextField` 或 name 匹配 `/password|secret|token/i`)自动脱敏为 `[REDACTED]`。`screenshot(describe:true)` 在 image block 后追加 text block;独立 `describe_screen` 工具纯文本无 image。新增 `ScreenDescription` 类型(`base.ts`);`axDepth` 用 `Math.min(depth,10)` 与 ax-tree 一致;`ocrBlocks` 默认 50 cap。
15
+ - **方向4a `click_menu_bar_extra` 托盘支持 + SystemUIServer 遍历**:`findMenuBarExtra` JXA 两阶段——先查 app 自身 menuBarItems(`host:"self"`),为空或仅 Apple 菜单时追加遍历 `SystemUIServer` 进程托管的状态项(`host:"systemuiserver"`),按 description/name 双向 includes 匹配。`clickMenuBarExtra` 按 `host` 在正确进程重定位(SystemUIServer 顺序不稳定,按 name/description 二次匹配)。`MenuBarExtraItem` 加 `host` 字段。纯 LSUIElement 托盘应用现在可达。
16
+ - **Agent Skill 三层**:`skills/ucu-mcp/SKILL.md`(精简入口 + YAML frontmatter)+ `references/{tool-reference,workflows,troubleshooting}.md`(深度文档,按需读取)+ `agents/openai.yaml`(Codex 接口元数据)。随 npm 包发布(`files` 字段加 `skills/`)。可通过 `npx skills add ucu-mcp -g -a codex/claude-code` 安装。README 加 "Agent Skill" 节。
17
+ - `UCU_MCP_INSTRUCTIONS` 加 describe_screen 指引(image 不可见时的 fallback)。
18
+ - `OBSERVE_ACTIONS` 加 `describe_screen`(observe 类,不触发 user-activity pause)。
19
+ - 工具数断言 25→26 同步更新(tools-layer / cli-mcp / client-cli-smoke)。
20
+ - +19 测试:describe_screen(OCR/AX 失败聚合、ocrBlocks cap、密码脱敏 AXSecureTextField/token、ocr=false/includeAx=false)+ screenshot describe=true + SystemUIServer 遍历(source 断言 / host 字段 / click 重定位)。
21
+
22
+ ### Fixed
23
+
24
+ - **方向1 OCR 路径修复(两 bug 叠加)**:`ocrNative` candidates 加 4 级 `../../../../native/ocr/ocr-helper`(npm prod 下 screen.js 在 4 级深,原 3 级候选全 MISSING);`ocrJxa` 重写绕开 CGImage 脆弱路径(`VNImageRequestHandler.initWithURLOptions` + `Ref()` + `ObjC.unwrap` + 像素维度 `pixelsWide/pixelsHigh`),消除 `NSNull unrecognized selector` 崩溃。删 `ocrJxa` 死代码 `scaleFactorX`(`buf.readUInt32BE(16)` 在权限缺失时抛 RangeError 绕过 hint)。`window.ts` `resolveNativeHelper` 同源 4 级路径修复。
25
+ - **方向2 press_key 字母/数字键**:`typeText` 的 letterMap/digitMap 提升为模块级 `MAC_LETTER_KEY_CODES`/`MAC_DIGIT_KEY_CODES`;`pressKey` falsy-safe 三元回退(先特殊键 → 字母 `in` 判定 → 数字;用 `in` 防 `'a'`=0 穿透)。`keyboard-tools.ts` zod describe 更新(支持 a-z / 0-9)。
26
+ - **SafetyGuard blocklist 修正(方向2 引入的回归)**:字母 q 可解析后 `cmd+shift+q`(注销)绕过 blocklist(只有 cmd+q)。blocklist 加 `cmd+shift+q` + `cmd+option+q`;`normalizeShortcut` 加修饰键别名归一化(alt→option / ctrl→control / cmd→command)根治 `cmd+alt+esc` 绕过 `cmd+option+esc`。
27
+ - `findMenuBarExtra` JXA 进程名大小写/空格/连字符/下划线容差(`_norm`);`matchMenuBarExtra` 无 selector 时过滤 Apple 菜单;`clickMenuBarExtra` null deref 防护。
28
+
29
+ ### Changed
30
+
31
+ - 工具数 24→26(+`describe_screen`、+`click_menu_bar_extra`)。README/index.ts 注释同步。
32
+ - `focus_app` 托盘回退建立 `windowId:"tray"` 的 activeTarget(pid:0),`validateActiveTarget` 对 tray 直接 return(不查 listWindows)。
33
+
34
+ ## [0.4.3] - 2026-06-14
35
+
36
+ ### Fixed
37
+
38
+ - `registerTool` 回调漏调 `registry.register(name)`,导致启动日志 `Registered tools count:0`。一行修复(行为无变化,24 工具始终正常注册到 MCP server)。
39
+
8
40
  ## [0.4.2] - 2026-06-13
9
41
 
10
42
  ### Security
package/README.md CHANGED
@@ -72,16 +72,17 @@ If you installed globally, you can use the shorter form:
72
72
 
73
73
  ## Tool List
74
74
 
75
- UCU-MCP provides 22 tools across five categories:
75
+ UCU-MCP provides 26 tools across five categories:
76
76
 
77
77
  ### Screen & Window
78
78
 
79
79
  | Tool | Description | Key Parameters |
80
80
  |------|-------------|----------------|
81
- | `screenshot` | Capture screen, window, or region as PNG/JPEG image content | `display?`, `windowId?`, `region?`, `maxWidth?`, `format?` |
81
+ | `screenshot` | Capture screen, window, or region as PNG/JPEG image content; `describe=true` appends a structured text description (OCR + AX) for vision-degraded environments | `display?`, `windowId?`, `region?`, `maxWidth?`, `format?`, `describe?`, `describeOptions?` |
82
+ | `describe_screen` | Structured text description of the screen (OCR blocks + AX tree + foreground window) — the vision-degraded fallback when image content is not visible to the model. Password fields are masked. | `display?`, `ocr?`, `includeAx?`, `axDepth?`, `ocrBlocks?`, `windowId?` |
82
83
  | `list_windows` | List all on-screen windows with IDs, titles, bounds | `includeMinimized?` |
83
84
  | `list_apps` | List visible macOS apps with pid, frontmost state, and window count | — |
84
- | `focus_app` | Select an app/window target context for later AX tools; returns `targetId`, `appName`, `pid`, `windowId`, `title`, and `capturedAt` | `app` |
85
+ | `focus_app` | Select an app/window target context for later AX tools; returns `targetId`, `appName`, `pid`, `windowId`, `title`, and `capturedAt`. Falls back to a tray target for menu-bar-only apps. | `app` |
85
86
  | `get_window_state` | Get accessibility tree of a window, or the prior focus_app target when windowId is omitted | `windowId?`, `depth?`, `includeBounds?` |
86
87
  | `get_screen_size` | Get screen dimensions | `display?` |
87
88
  | `ocr` | Perform OCR on screen or region; returns text with bounding boxes and confidence | `display?`, `region?` |
@@ -112,6 +113,7 @@ UCU-MCP provides 22 tools across five categories:
112
113
  | `click_element` | Click an AX element by its id (from find_element), using the current focus_app target when app is omitted; refetches equivalent elements after UI updates | `elementId`, `app?`, `captureAfter?` |
113
114
  | `set_value` | Set an AX element's value directly without focusing it, using the current focus_app target when app is omitted | `elementId`, `value`, `app?`, `captureAfter?` |
114
115
  | `type_in_element` | Type text into a specific AX text field element; may focus the element and refetches equivalent elements after UI updates | `elementId`, `text`, `app?`, `clearFirst?`, `captureAfter?` |
116
+ | `click_menu_bar_extra` | Click a menu-bar status item (tray icon) — for menu-bar-only apps (e.g. cc-switch) that focus_app cannot target. Finds items in the app's own menu bar or hosted by SystemUIServer. | `app`, `description?`, `name?`, `index?`, `captureAfter?` |
115
117
 
116
118
  ### Runtime & Synchronization
117
119
 
@@ -120,6 +122,8 @@ UCU-MCP provides 22 tools across five categories:
120
122
  | `doctor` | Check platform readiness, permissions, lock-screen state, and client integration hints | — |
121
123
  | `wait` | Wait for UI state to settle after launches, animations, or navigation | `ms` |
122
124
  | `wait_for_element` | Poll the AX tree until a matching element appears | `text?`, `role?`, `app?`, `timeout?`, `timeoutMs?`, `interval?`, `intervalMs?` |
125
+ | `clipboard_read` | Read the current contents of the system clipboard | — |
126
+ | `clipboard_write` | Write text to the system clipboard (text-injection patterns are blocked) | `text`, `captureAfter?` |
123
127
 
124
128
  Action tools accept `captureAfter`, `captureMaxWidth`, and `captureFormat` so an agent can receive a post-action screenshot as a second MCP image content item in the same response instead of spending another round trip on `screenshot`. When `captureAfter` is requested and the action succeeds, the tool returns an `ActionReceipt` (see the Action Receipt section below) with `capture.status: "ok"`. If post-action capture fails, the receipt has `status: "partial"` and `capture.status: "error"` with the error details. If `captureAfter` is omitted, `capture.status` is `"skipped"`.
125
129
 
@@ -415,6 +419,27 @@ ucu-mcp doctor
415
419
 
416
420
  The same readiness report is also available as the MCP `doctor` tool.
417
421
 
422
+ ## Agent Skill
423
+
424
+ UCU-MCP ships an installable **agent skill** that gives Codex, Claude Code, and
425
+ other agent runtimes richer guidance than the embedded MCP `instructions:` field
426
+ alone — structured tool-selection rules, task playbooks, and an error-recovery
427
+ reference. The skill lives at [`skills/ucu-mcp/`](skills/ucu-mcp/SKILL.md) and
428
+ is included in the npm tarball.
429
+
430
+ Install it for your agent runtime with the [`skills` CLI](https://www.npmjs.com/package/skills):
431
+
432
+ ```bash
433
+ # Codex
434
+ npx skills add ucu-mcp -g -a codex --skill ucu-mcp -y
435
+ # Claude Code
436
+ npx skills add ucu-mcp -g -a claude-code --skill ucu-mcp -y
437
+ ```
438
+
439
+ Or reference it directly: the entry point is
440
+ [`skills/ucu-mcp/SKILL.md`](skills/ucu-mcp/SKILL.md), with deeper content in
441
+ `skills/ucu-mcp/references/` (tool reference, workflows, troubleshooting).
442
+
418
443
  ## Safety
419
444
 
420
445
  ### Built-in safety rules
@@ -12,6 +12,7 @@ Pick the right tool sequence for the task:
12
12
  • Fill a form field → find_element (text/role) + type_in_element or set_value. Prefer AX over coordinates.
13
13
  • Click a menu bar item → get_screen_size + click with coordinates (menu bar is not in the AX tree).
14
14
  • Read what's on screen → screenshot; for text not in AX use ocr; for a structured tree use get_window_state.
15
+ • When image content is not visible (relayed/downgraded to URL) → screenshot with describe=true, or describe_screen for a text-only structured view (OCR + AX tree + foreground window).
15
16
  • Switch between apps → list_apps, then focus_app; subsequent tools use the active target context.
16
17
  • Verify an action succeeded → captureAfter=true on action tools, or call screenshot afterwards.
17
18
  • Wait for UI to change → wait_for_element (until: "appear" default; also "disappear" or "value_change").
@@ -56,4 +56,15 @@ export function registerElementTools(registerTool) {
56
56
  await withSafety({ action: "type_in_element", params: { text: params.text, ...safetyCtx }, requiresAccessibility: true, execute: () => getPlatform().typeInElement(params.elementId, params.text, effectiveApp, params.clearFirst) });
57
57
  return actionResponse("type_in_element", { typed: true, elementId: params.elementId, charCount: params.text.length }, { elementId: params.elementId, app: effectiveApp }, params.captureAfter, params.captureFormat, params.captureMaxWidth);
58
58
  });
59
+ registerTool("click_menu_bar_extra", "Click a menu bar status item (tray icon) — for menu-bar-only apps (e.g. cc-switch) that focus_app cannot target. After clicking, the menu opens; use find_element to locate menu items, or screenshot + ocr if the menu's AX tree is opaque.", {
60
+ app: z.string().describe("Target app name"),
61
+ description: z.string().optional().describe("Match menu bar item by description/name substring"),
62
+ name: z.string().optional().describe("Match menu bar item by name/description substring"),
63
+ index: z.number().int().nonnegative().optional().describe("0-based index among matched items (default 0)"),
64
+ ...captureAfterFields,
65
+ }, async (params) => {
66
+ const safetyCtx = await getSafetyContext();
67
+ await withSafety({ action: "click_menu_bar_extra", params: { ...safetyCtx }, requiresAccessibility: true, execute: () => getPlatform().clickMenuBarExtra(params.app, { description: params.description, name: params.name, index: params.index }) });
68
+ return actionResponse("click_menu_bar_extra", { clicked: true, app: params.app }, { app: params.app }, params.captureAfter, params.captureFormat, params.captureMaxWidth);
69
+ });
59
70
  }
@@ -1,7 +1,7 @@
1
1
  /**
2
2
  * Tool registry for UCU-MCP.
3
3
  *
4
- * Registers 24 MCP tools on the server and dispatches each call through
4
+ * Registers 26 MCP tools on the server and dispatches each call through
5
5
  * a shared safety/permission/retry pipeline (`withSafety`).
6
6
  */
7
7
  import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
@@ -1,7 +1,7 @@
1
1
  /**
2
2
  * Tool registry for UCU-MCP.
3
3
  *
4
- * Registers 24 MCP tools on the server and dispatches each call through
4
+ * Registers 26 MCP tools on the server and dispatches each call through
5
5
  * a shared safety/permission/retry pipeline (`withSafety`).
6
6
  */
7
7
  import { createLogger } from "../../util/logger.js";
@@ -14,8 +14,8 @@ export function registerKeyboardTools(registerTool) {
14
14
  return actionResponse("type_text", { typed: true, charCount: params.text.length }, {}, params.captureAfter, params.captureFormat, params.captureMaxWidth);
15
15
  });
16
16
  registerTool("press_key", "Press a keyboard shortcut", {
17
- keys: z.array(z.string()).optional().describe("Keys to press simultaneously"),
18
- key: z.string().optional().describe("Single key to press (alias for keys)"),
17
+ keys: z.array(z.string()).optional().describe("Keys to press simultaneously — special keys (enter/escape/tab/f1-f12/arrows...), single letters a-z, or single digits 0-9"),
18
+ key: z.string().optional().describe("Single key (alias for keys) — special keys, single letter a-z, or single digit 0-9"),
19
19
  modifiers: z.array(z.string()).optional().describe("Modifier keys used with key, such as cmd, shift, alt, or ctrl"),
20
20
  windowId: z.string().optional().describe("UNSUPPORTED: windowId-targeted key events are not implemented"),
21
21
  ...captureAfterFields,
@@ -1,14 +1,111 @@
1
1
  import { z } from "zod";
2
2
  import { checkPermission } from "../../safety/permissions.js";
3
3
  import { UnsupportedParameterError } from "../../util/errors.js";
4
- import { getPlatform, getActiveTarget, withSafety, } from "./helpers.js";
4
+ import { getPlatform, getActiveTarget, withSafety, jsonText, } from "./helpers.js";
5
+ /** Sensitive-field detection regex for AX tree masking (password fields, secrets). */
6
+ const SENSITIVE_NAME_RE = /password|passwd|secret|pincode|pin\b|token|credential/i;
7
+ const SENSITIVE_ROLES = new Set(["AXSecureTextField", "AXPasswordField"]);
8
+ /**
9
+ * Recursively mask values of password / secret fields in an AX subtree.
10
+ * Mutates in place. Covers AXSecureTextField role and heuristic name matches.
11
+ */
12
+ function maskSensitiveFields(el) {
13
+ if (!el)
14
+ return;
15
+ const isSensitive = SENSITIVE_ROLES.has(el.role) || SENSITIVE_NAME_RE.test(el.name || "") || SENSITIVE_NAME_RE.test(el.description || "");
16
+ if (isSensitive && el.value)
17
+ el.value = "[REDACTED]";
18
+ if (el.children)
19
+ for (const child of el.children)
20
+ maskSensitiveFields(child);
21
+ }
22
+ /**
23
+ * Build a structured ScreenDescription — each source (screen / foreground / OCR / AX)
24
+ * is collected independently inside try/catch. Failures go to `errors` and set the
25
+ * corresponding status; the function never throws. This is the text fallback for
26
+ * environments where image content blocks are downgraded to URLs.
27
+ */
28
+ async function buildScreenDescription(opts) {
29
+ const platform = getPlatform();
30
+ const errors = [];
31
+ const capturedAt = new Date().toISOString();
32
+ // screen — sync, almost never fails
33
+ let screen;
34
+ try {
35
+ screen = platform.getScreenSize(opts.display);
36
+ }
37
+ catch (e) {
38
+ screen = { width: 0, height: 0, scaleFactor: 1, estimated: true };
39
+ errors.push({ source: "screen", message: `getScreenSize failed: ${e.message}` });
40
+ }
41
+ // foreground window — listApps() isFrontmost → listWindows() filter by processName + isOnScreen
42
+ let foregroundWindow;
43
+ try {
44
+ if (platform.listApps) {
45
+ const apps = await platform.listApps();
46
+ const front = apps.find((a) => a.isFrontmost);
47
+ if (front) {
48
+ const wins = await platform.listWindows(true);
49
+ foregroundWindow = wins.find((w) => w.processName === front.name && w.isOnScreen);
50
+ }
51
+ }
52
+ }
53
+ catch (e) {
54
+ errors.push({ source: "foreground", message: `foreground window resolution failed: ${e.message}` });
55
+ }
56
+ // OCR — cap blocks to ocrBlocks via slice
57
+ let ocr;
58
+ if (opts.runOcr) {
59
+ try {
60
+ const result = await platform.ocr(opts.display);
61
+ ocr = {
62
+ blocks: result.elements.slice(0, opts.ocrBlocks),
63
+ fullText: result.fullText,
64
+ status: "ok",
65
+ };
66
+ }
67
+ catch (e) {
68
+ ocr = { blocks: [], fullText: "", status: "failed" };
69
+ errors.push({ source: "ocr", message: `ocr failed: ${e.message}` });
70
+ }
71
+ }
72
+ else {
73
+ ocr = { blocks: [], fullText: "", status: "skipped" };
74
+ }
75
+ // AX — getWindowState with depth cap, password masking applied
76
+ let ax;
77
+ if (opts.includeAx) {
78
+ const effectiveWindowId = opts.windowId ?? getActiveTarget()?.windowId;
79
+ try {
80
+ const cappedDepth = Math.min(opts.axDepth, 10);
81
+ const state = await platform.getWindowState(effectiveWindowId, cappedDepth, true);
82
+ maskSensitiveFields(state.tree);
83
+ maskSensitiveFields(state.focusedElement);
84
+ ax = { elements: state.tree, status: "ok", windowId: effectiveWindowId };
85
+ }
86
+ catch (e) {
87
+ ax = { status: "failed", windowId: effectiveWindowId };
88
+ errors.push({ source: "ax", message: `getWindowState failed: ${e.message}` });
89
+ }
90
+ }
91
+ else {
92
+ ax = { status: "skipped" };
93
+ }
94
+ return { capturedAt, screen, foregroundWindow, ocr, ax, errors };
95
+ }
5
96
  export function registerScreenTools(registerTool) {
6
- registerTool("screenshot", "Capture a screenshot of the entire screen or a region", {
97
+ registerTool("screenshot", "Capture a screenshot of the entire screen or a region. Set describe=true to also append a structured text description (OCR + AX tree) — useful when image content blocks may not be visible to the model.", {
7
98
  display: z.number().optional().describe("Display index (default 0)"),
8
99
  windowId: z.string().optional().describe("Window ID from list_windows; when set, captures that window"),
9
100
  region: z.object({ x: z.number(), y: z.number(), width: z.number(), height: z.number() }).optional().describe("Region to capture"),
10
101
  format: z.enum(["png", "jpeg"]).default("png").describe("Image format"),
11
102
  maxWidth: z.number().default(1280).describe("Maximum output width in pixels. Aspect ratio is preserved."),
103
+ describe: z.boolean().default(false).describe("When true, append a text content block with a structured screen description (OCR + AX tree) after the image"),
104
+ describeOptions: z.object({
105
+ axDepth: z.number().int().positive().default(3).describe("AX tree depth (capped at 10)"),
106
+ ocrBlocks: z.number().int().positive().default(50).describe("Max OCR blocks to include"),
107
+ includeAx: z.boolean().default(true).describe("Include the AX tree in the description"),
108
+ }).optional().describe("Options for the appended description (only used when describe=true)"),
12
109
  }, async (params) => {
13
110
  if (params.windowId && params.region)
14
111
  throw new UnsupportedParameterError("screenshot windowId cannot be combined with region");
@@ -23,7 +120,20 @@ export function registerScreenTools(registerTool) {
23
120
  : Promise.reject(new UnsupportedParameterError("window screenshots are not implemented on this platform"))
24
121
  : getPlatform().screenshot(params.display, params.region, options),
25
122
  });
26
- return { content: [{ type: "image", data: buf.toString("base64"), mimeType: `image/${params.format}` }] };
123
+ const content = [{ type: "image", data: buf.toString("base64"), mimeType: `image/${params.format}` }];
124
+ if (params.describe) {
125
+ const opts = params.describeOptions ?? { axDepth: 3, ocrBlocks: 50, includeAx: true };
126
+ const desc = await buildScreenDescription({
127
+ display: params.display,
128
+ runOcr: true,
129
+ includeAx: opts.includeAx ?? true,
130
+ axDepth: opts.axDepth ?? 3,
131
+ ocrBlocks: opts.ocrBlocks ?? 50,
132
+ windowId: params.windowId,
133
+ });
134
+ content.push(jsonText(desc));
135
+ }
136
+ return { content };
27
137
  });
28
138
  registerTool("list_windows", "List all visible windows on screen", {
29
139
  includeMinimized: z.boolean().optional().describe("Include minimized windows"),
@@ -66,4 +176,30 @@ export function registerScreenTools(registerTool) {
66
176
  const result = await withSafety({ action: "ocr", params: {}, requiresScreenRecording: true, execute: () => getPlatform().ocr(params.display, params.region) });
67
177
  return { content: [{ type: "text", text: JSON.stringify(result, null, 2) }] };
68
178
  });
179
+ registerTool("describe_screen", "Get a structured text description of the screen (OCR blocks + AX tree + foreground window). Use this when image content blocks are not visible to the model (e.g. relayed/downgraded to URLs) or when you need a machine-readable screen layout. With ocr=false it does not require Screen Recording. Sensitive fields (passwords) are masked.", {
180
+ display: z.number().optional().describe("Display index (default 0)"),
181
+ ocr: z.boolean().default(true).describe("Run OCR and include text blocks (requires Screen Recording)"),
182
+ includeAx: z.boolean().default(true).describe("Include the AX tree of the foreground/active window"),
183
+ axDepth: z.number().int().positive().default(3).describe("AX tree depth (capped at 10)"),
184
+ ocrBlocks: z.number().int().positive().default(50).describe("Max OCR blocks to include"),
185
+ windowId: z.string().optional().describe("Window ID for AX traversal (defaults to active target's window)"),
186
+ }, async (params) => {
187
+ const desc = await withSafety({
188
+ action: "describe_screen",
189
+ params,
190
+ requiresScreenRecording: params.ocr,
191
+ requiresAccessibility: params.includeAx,
192
+ // describe_screen never throws from inner source failures (they go to errors[]),
193
+ // but withSafety still needs to enforce rate-limit / lock / dry-run semantics around it.
194
+ execute: () => buildScreenDescription({
195
+ display: params.display,
196
+ runOcr: params.ocr,
197
+ includeAx: params.includeAx,
198
+ axDepth: params.axDepth,
199
+ ocrBlocks: params.ocrBlocks,
200
+ windowId: params.windowId,
201
+ }),
202
+ });
203
+ return { content: [jsonText(desc)] };
204
+ });
69
205
  }
@@ -126,6 +126,31 @@ export interface WindowState {
126
126
  focusedElement?: ElementInfo;
127
127
  tree?: ElementInfo;
128
128
  }
129
+ /**
130
+ * Structured text description of the screen — a fallback for environments where
131
+ * image content blocks are downgraded to URLs (so the model cannot see screenshots).
132
+ * Each source (OCR / AX / foreground) is collected independently; failures are
133
+ * aggregated in `errors` rather than thrown.
134
+ */
135
+ export interface ScreenDescription {
136
+ capturedAt: string;
137
+ screen: ScreenSize;
138
+ foregroundWindow?: WindowInfo;
139
+ ocr: {
140
+ blocks: OcrElement[];
141
+ fullText: string;
142
+ status: "ok" | "skipped" | "failed";
143
+ };
144
+ ax: {
145
+ elements?: ElementInfo;
146
+ status: "ok" | "skipped" | "failed";
147
+ windowId?: string;
148
+ };
149
+ errors: Array<{
150
+ source: "ocr" | "ax" | "foreground" | "screen";
151
+ message: string;
152
+ }>;
153
+ }
129
154
  export interface Platform {
130
155
  screenshot(display?: number, region?: ScreenRegion, options?: ScreenshotOptions): Promise<Buffer>;
131
156
  screenshotWindow?(windowId: string, options?: ScreenshotOptions): Promise<Buffer>;
@@ -147,6 +172,12 @@ export interface Platform {
147
172
  clickElement(elementId: string, app?: string): Promise<void>;
148
173
  typeInElement(elementId: string, text: string, app?: string, clearFirst?: boolean): Promise<void>;
149
174
  setElementValue?(elementId: string, value: string, app?: string): Promise<void>;
175
+ findMenuBarExtra?(app: string): Promise<unknown[]>;
176
+ clickMenuBarExtra?(app: string, selector?: {
177
+ description?: string;
178
+ name?: string;
179
+ index?: number;
180
+ }): Promise<void>;
150
181
  isScreenLocked?(): boolean;
151
182
  saveFocus?(): Promise<void>;
152
183
  restoreFocus?(): Promise<void>;
@@ -5,7 +5,7 @@ import { screenshot, screenshotWindow, getScreenSize, isScreenLocked, ocr } from
5
5
  import { listApps, focusApp, getActiveBrowserContext, listWindows } from "./window.js";
6
6
  import { getWindowState, findElement } from "./ax-tree.js";
7
7
  import { click, move, drag, scroll, getCursorPosition, type as typeMethod, key } from "./input.js";
8
- import { clickElement, typeInElement, setElementValue } from "./element.js";
8
+ import { clickElement, typeInElement, setElementValue, findMenuBarExtra, clickMenuBarExtra } from "./element.js";
9
9
  import { readClipboard, writeClipboard } from "./clipboard.js";
10
10
  export type { MacOSPlatformOptions } from "./helpers.js";
11
11
  export declare class MacOSPlatform implements Platform {
@@ -52,6 +52,8 @@ export declare class MacOSPlatform implements Platform {
52
52
  clickElement: typeof clickElement;
53
53
  typeInElement: typeof typeInElement;
54
54
  setElementValue: typeof setElementValue;
55
+ findMenuBarExtra: typeof findMenuBarExtra;
56
+ clickMenuBarExtra: typeof clickMenuBarExtra;
55
57
  readClipboard: typeof readClipboard;
56
58
  writeClipboard: typeof writeClipboard;
57
59
  }
@@ -4,7 +4,7 @@ import { screenshot, screenshotWindow, getScreenSize, isScreenLocked, ocr } from
4
4
  import { listApps, focusApp, getActiveBrowserContext, listWindows } from "./window.js";
5
5
  import { getWindowState, findElement } from "./ax-tree.js";
6
6
  import { click, move, drag, scroll, getCursorPosition, type as typeMethod, key } from "./input.js";
7
- import { clickElement, typeInElement, setElementValue } from "./element.js";
7
+ import { clickElement, typeInElement, setElementValue, findMenuBarExtra, clickMenuBarExtra } from "./element.js";
8
8
  import { readClipboard, writeClipboard } from "./clipboard.js";
9
9
  export class MacOSPlatform {
10
10
  _nativeHelperPaths;
@@ -53,6 +53,10 @@ export class MacOSPlatform {
53
53
  async validateActiveTarget() {
54
54
  if (!this.activeTarget?.windowId)
55
55
  return;
56
+ // 托盘 target(focus_app 找不到窗口时回退到 status item)没有真实窗口,
57
+ // 恒有效直到模型显式 focus_app 其他应用——不查 listWindows(查了必然失配)。
58
+ if (this.activeTarget.windowId === "tray")
59
+ return;
56
60
  this.windowCache = undefined;
57
61
  const windows = await this.listWindows(true);
58
62
  const match = windows.find(w => w.id === this.activeTarget.windowId);
@@ -87,6 +91,8 @@ export class MacOSPlatform {
87
91
  clickElement = clickElement;
88
92
  typeInElement = typeInElement;
89
93
  setElementValue = setElementValue;
94
+ findMenuBarExtra = findMenuBarExtra;
95
+ clickMenuBarExtra = clickMenuBarExtra;
90
96
  readClipboard = readClipboard;
91
97
  writeClipboard = writeClipboard;
92
98
  }
@@ -2,3 +2,23 @@ import type { MacOSPlatform } from "./base.js";
2
2
  export declare function clickElement(this: MacOSPlatform, elementId: string, app?: string): Promise<void>;
3
3
  export declare function typeInElement(this: MacOSPlatform, elementId: string, text: string, app?: string, clearFirst?: boolean): Promise<void>;
4
4
  export declare function setElementValue(this: MacOSPlatform, elementId: string, value: string, app?: string): Promise<void>;
5
+ export interface MenuBarExtraItem {
6
+ menuBar: number;
7
+ index: number;
8
+ name: string;
9
+ description: string;
10
+ x: number;
11
+ y: number;
12
+ width: number;
13
+ height: number;
14
+ /** Which process hosts this status item. "self" = app's own menu bar; "systemuiserver" = third-party tray hosted by SystemUIServer. */
15
+ host: "self" | "systemuiserver";
16
+ }
17
+ export interface MenuBarExtraSelector {
18
+ description?: string;
19
+ name?: string;
20
+ index?: number;
21
+ }
22
+ export declare function findMenuBarExtra(this: MacOSPlatform, app: string): Promise<MenuBarExtraItem[]>;
23
+ export declare function matchMenuBarExtra(items: MenuBarExtraItem[], selector: MenuBarExtraSelector): MenuBarExtraItem | undefined;
24
+ export declare function clickMenuBarExtra(this: MacOSPlatform, app: string, selector?: MenuBarExtraSelector): Promise<void>;