@amaster.ai/pi-computer-use 0.1.2-beta.5 → 0.1.2-beta.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +44 -30
- package/bin/darwin-arm64/.version +2 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/CodeResources +0 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Info.plist +32 -0
- package/bin/darwin-arm64/{cua-driver → CuaDriver.app/Contents/MacOS/cua-driver} +0 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/README.md +140 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/RECORDING.md +113 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/SKILL.md +887 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/TESTS.md +232 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/WEB_APPS.md +471 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/_CodeSignature/CodeResources +172 -0
- package/bin/darwin-x64/.version +2 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/CodeResources +0 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Info.plist +32 -0
- package/bin/darwin-x64/{cua-driver → CuaDriver.app/Contents/MacOS/cua-driver} +0 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/README.md +140 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/RECORDING.md +113 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/SKILL.md +887 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/TESTS.md +232 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/WEB_APPS.md +471 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/_CodeSignature/CodeResources +172 -0
- package/bin/linux-x64/.version +2 -0
- package/bin/linux-x64/cua-driver +0 -0
- package/bin/win32-arm64/.version +2 -0
- package/bin/win32-arm64/cua-driver-uia.exe +0 -0
- package/bin/win32-arm64/cua-driver.exe +0 -0
- package/bin/win32-x64/.version +2 -0
- package/bin/win32-x64/cua-driver-uia.exe +0 -0
- package/bin/win32-x64/cua-driver.exe +0 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +422 -38
- package/dist/index.js.map +1 -1
- package/dist/mcp-client.d.ts.map +1 -1
- package/dist/mcp-client.js +9 -2
- package/dist/mcp-client.js.map +1 -1
- package/package.json +2 -2
|
@@ -0,0 +1,887 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: cua-driver
|
|
3
|
+
description: Drive a native macOS app via the cua-driver CLI (default) or MCP server — snapshot its AX tree, click/type/scroll by element_index, verify via re-snapshot. Use when the user asks you to operate, drive, automate, or perform a GUI task in a real macOS application on the host (e.g. "open a file in TextEdit", "navigate to /Applications in Finder", "click the Save button in Numbers").
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# cua-driver
|
|
7
|
+
|
|
8
|
+
Orchestrates macOS app automation via `cua-driver`. Whenever a user
|
|
9
|
+
asks to drive a native macOS app, follow the loop in this skill rather
|
|
10
|
+
than calling tools ad-hoc — the snapshot-before-action invariant is not
|
|
11
|
+
optional and silently breaks if you skip it.
|
|
12
|
+
|
|
13
|
+
## The no-foreground contract — read this first
|
|
14
|
+
|
|
15
|
+
**The user's frontmost app MUST NOT change.** This is the whole
|
|
16
|
+
reason cua-driver exists. Users pay for the right to keep typing in
|
|
17
|
+
their editor while an agent drives another app in the background.
|
|
18
|
+
Violate this rule and every other nice property the driver gives
|
|
19
|
+
you (no cursor warp, no Space switch, no window raise) stops
|
|
20
|
+
mattering — you just shipped the Accessibility Inspector with extra
|
|
21
|
+
steps.
|
|
22
|
+
|
|
23
|
+
Before running any shell command, ask: **"does this raise,
|
|
24
|
+
activate, foreground, or make-key any app?"** If yes, don't run it.
|
|
25
|
+
Every one of the commands below activates the target on macOS and
|
|
26
|
+
is therefore forbidden unless the user **explicitly** asked for
|
|
27
|
+
frontmost state:
|
|
28
|
+
|
|
29
|
+
- **Every form of the `open` CLI — `open -a <App>`, `open -b
|
|
30
|
+
<bundle-id>`, `open <file>`, `open <path-to-App.app>`, `open
|
|
31
|
+
<url>` — always activates.** macOS routes all forms through
|
|
32
|
+
LaunchServices, which unhides and foregrounds the target
|
|
33
|
+
regardless of whether you passed an app name, a bundle id, a
|
|
34
|
+
document, a URL, or the bundle path itself. The activation
|
|
35
|
+
happens even when the only intent was "start the process."
|
|
36
|
+
**Never use `open` for any app launch.** This includes launching
|
|
37
|
+
a just-built .app from a local build dir (e.g. `open
|
|
38
|
+
build/Build/Products/Debug/MyApp.app`) — resolve the
|
|
39
|
+
`CFBundleIdentifier` from `Info.plist` and use `launch_app`
|
|
40
|
+
with that id. See "The narrow carve-out" below for why
|
|
41
|
+
`launch_app` is safe even when the app internally calls
|
|
42
|
+
`NSApp.activate`.
|
|
43
|
+
- `osascript -e 'tell application "X" to activate'` —
|
|
44
|
+
activates by design. Same for `... to open <file>`,
|
|
45
|
+
`... to launch`, and anything with `activate` in the tell block.
|
|
46
|
+
- `osascript -e 'tell application "System Events" to ... frontmost'`
|
|
47
|
+
in a mutating form (setting `frontmost` rather than reading it).
|
|
48
|
+
- AppleScript files that invoke `activate`, `launch`, or `open`
|
|
49
|
+
against the target app.
|
|
50
|
+
- `cliclick` (moves the user's real cursor to the target coords
|
|
51
|
+
before clicking — a focus-steal-equivalent even if the app's
|
|
52
|
+
window state is unchanged).
|
|
53
|
+
- `CGEventPost` with `cghidEventTap` targeting a coordinate over
|
|
54
|
+
a different app's window (warps the cursor, possibly activates
|
|
55
|
+
on hit).
|
|
56
|
+
- `AppleScriptTask`, `NSAppleScript`, `Process` wrapping `osascript`
|
|
57
|
+
that contains any of the above.
|
|
58
|
+
- `NSRunningApplication.activate(options:)` called from your own
|
|
59
|
+
helper binary — same class.
|
|
60
|
+
- Dock clicks and any `open` invocation (see the first bullet —
|
|
61
|
+
every form of `open` goes through LaunchServices which
|
|
62
|
+
activates, full stop).
|
|
63
|
+
- **Keyboard shortcuts that semantically mean "focus here" —
|
|
64
|
+
most notably Chrome / Safari / Arc's `⌘L` (focus omnibox) and
|
|
65
|
+
Finder's `⌘⇧G` (Go to Folder).** These aren't pure key events —
|
|
66
|
+
the receiving app interprets "user wants to type here" as
|
|
67
|
+
activation intent and raises its window to be key. Even when
|
|
68
|
+
delivered to a backgrounded pid via `hotkey`, the downstream app
|
|
69
|
+
pulls focus. **For omnibox navigation specifically**, the correct
|
|
70
|
+
path is `launch_app({bundle_id: "com.google.Chrome", urls:
|
|
71
|
+
["https://…"]})` — no omnibox dance, no `⌘L`, no focus-steal. Do
|
|
72
|
+
NOT try `set_value` on the omnibox: Chrome's commit logic requires
|
|
73
|
+
a "user-typed" signal that neither an AX value write nor
|
|
74
|
+
`CGEvent.postToPid` keystrokes supply from a backgrounded pid —
|
|
75
|
+
the URL lands in the field but Return fires as a no-op. See
|
|
76
|
+
`WEB_APPS.md` → "Navigate to a URL" for the full pattern. The
|
|
77
|
+
general principle: a shortcut that says "put my cursor inside this
|
|
78
|
+
app" is a focus-steal; a shortcut that says "do this thing" (copy,
|
|
79
|
+
save, quit) is fine.
|
|
80
|
+
- **Tab-switching shortcuts in browsers (`⌘1..⌘9`, `⌘]`, `⌘[`,
|
|
81
|
+
`⌘⇧[`, `⌘⇧]`) are visibly disruptive even when delivered to a
|
|
82
|
+
backgrounded pid.** The app's key handler processes the shortcut,
|
|
83
|
+
the window re-renders the new tab's content, the user sees their
|
|
84
|
+
tabs flipping. There is no AX-only workaround: page content (HTML,
|
|
85
|
+
form state, `AXWebArea`) populates only for the focused tab;
|
|
86
|
+
inspecting a background tab requires activating it, which is the
|
|
87
|
+
visible flip. Observed with Dia; the same mechanic applies to every
|
|
88
|
+
Chromium-family browser (Chrome, Arc, Brave, Edge).
|
|
89
|
+
|
|
90
|
+
**Prefer the windows-over-tabs pattern**: for each URL you need to
|
|
91
|
+
drive backgrounded, use `launch_app({bundle_id, urls: [url]})` —
|
|
92
|
+
browsers open each URL in a new **window**. Each window has its own
|
|
93
|
+
`window_id`, its own AX tree, and can be inspected / interacted with
|
|
94
|
+
via `element_index` without activating or switching anything. Tabs
|
|
95
|
+
are a UX grouping for humans; cua-driver workflows should default to
|
|
96
|
+
windows. See `WEB_APPS.md` → "Tabs vs windows" for the full pattern.
|
|
97
|
+
|
|
98
|
+
Tab-title enumeration (read-only) IS safe — walk a window's toolbar
|
|
99
|
+
AX tree for `AXTab` / `AXRadioButton` children and read their
|
|
100
|
+
`AXTitle`s. Tab switching (activating one) is not.
|
|
101
|
+
|
|
102
|
+
Reading frontmost state is fine (`osascript -e 'tell application
|
|
103
|
+
"System Events" to get name of first application process whose
|
|
104
|
+
frontmost is true'`). Mutating it is not.
|
|
105
|
+
|
|
106
|
+
**Corollary — the AXMenuBar rule.** `AXMenuBarItem` + AXPick
|
|
107
|
+
dispatches at the AX layer regardless of which app is frontmost,
|
|
108
|
+
but macOS's on-screen menu bar always belongs to the frontmost
|
|
109
|
+
app. If you drive a *backgrounded* app's menu bar, the AX call
|
|
110
|
+
succeeds but the viewer sees the dispatch rendered over the
|
|
111
|
+
*frontmost* app's menu bar — confusing in any observed session and
|
|
112
|
+
routinely a silent no-op too, because action menu items go
|
|
113
|
+
`DISABLED` when their owning app isn't the key window. **So: only
|
|
114
|
+
use menu-bar navigation when the target is already frontmost.** For
|
|
115
|
+
backgrounded targets, read state via in-window AX (window title,
|
|
116
|
+
toolbar `AXStaticText`) and dispatch via in-window `element_index`
|
|
117
|
+
or pixel clicks — both paths are frontmost-insensitive. Full
|
|
118
|
+
rationale in "Navigating native menu bars" below.
|
|
119
|
+
|
|
120
|
+
**"Open \<app\>" in user speech means launch, not activate.**
|
|
121
|
+
`cua-driver launch_app` is the one correct path for process
|
|
122
|
+
startup — it's idempotent (no-op on a running app), returns the
|
|
123
|
+
pid, and has an internal `FocusRestoreGuard` that catches
|
|
124
|
+
`NSApp.activate(ignoringOtherApps:)` calls the target makes during
|
|
125
|
+
`application(_:open:)` and clobbers the frontmost back to what it
|
|
126
|
+
was before the launch. That guard is why `launch_app` with `urls`
|
|
127
|
+
(e.g. `{"bundle_id": "com.colliderli.iina", "urls": ["~/video.mp4"]}`)
|
|
128
|
+
is safe even for apps that normally foreground on media-load
|
|
129
|
+
(Chrome, Electron, media players).
|
|
130
|
+
|
|
131
|
+
## Defaults — always prefer cua-driver over shell shims
|
|
132
|
+
|
|
133
|
+
**Default transport is the `cua-driver` CLI** — `Bash` shelling out
|
|
134
|
+
to `cua-driver <tool-name> '<JSON-args>'`. MCP tools (prefix
|
|
135
|
+
`mcp__cua-driver__*`) only when the user explicitly asks for them.
|
|
136
|
+
CLI wins because it picks up rebuilds instantly, failures are
|
|
137
|
+
easier to diagnose, and there's no per-tool schema-load overhead.
|
|
138
|
+
|
|
139
|
+
Every reference to `click(...)`, `get_window_state(...)` etc. in this
|
|
140
|
+
skill means `cua-driver click '{...}'` — translate to MCP form only
|
|
141
|
+
when MCP is requested.
|
|
142
|
+
|
|
143
|
+
### Claude Code computer-use compatibility mode
|
|
144
|
+
|
|
145
|
+
For normal Claude Code use, keep the default CLI or `cua-driver` MCP server path above. If the user explicitly wants Claude Code's vision/computer-use-style flow, they can register:
|
|
146
|
+
|
|
147
|
+
```bash
|
|
148
|
+
claude mcp add --transport stdio cua-computer-use -- cua-driver mcp --claude-code-computer-use-compat
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
Observation: Claude Code vision flows appear to treat a screenshot MCP tool as the image-grounding anchor. This compatibility mode keeps the normal CuaDriver tools and changes only `screenshot`. The compatibility `screenshot` requires `pid` and `window_id`, captures only that target window, and returns the window-local pixel coordinate frame. Start with `launch_app` or `list_windows`, then call `screenshot({pid, window_id})`; do not assume desktop coordinates or a full-screen capture.
|
|
152
|
+
|
|
153
|
+
Use MCP for this Claude Code vision/computer-use-style path. Do not shell out to `cua-driver screenshot` as a substitute: CLI screenshots still work as CuaDriver calls, but they do not expose the `mcp__cua-computer-use__screenshot` tool name that Claude Code appears to use as the image-grounding cue.
|
|
154
|
+
|
|
155
|
+
Intent → tool mapping. If you find yourself reaching for the right
|
|
156
|
+
column, something has gone wrong — re-read "The no-foreground
|
|
157
|
+
contract" above:
|
|
158
|
+
|
|
159
|
+
| Intent | Use | Don't use |
|
|
160
|
+
|---|---|---|
|
|
161
|
+
| Open / launch an app | `launch_app({bundle_id})` or `launch_app({bundle_id, urls:[...]})` | `open -a`, `osascript 'tell app … to launch/activate/open'` |
|
|
162
|
+
| Find a pid | `list_apps` or `launch_app`'s return | `pgrep`, `ps`, `osascript frontmost` |
|
|
163
|
+
| Enumerate an app's windows | `list_windows({pid})` — or read the `windows` array `launch_app` already returns | `osascript 'every window of app …'` |
|
|
164
|
+
| Click / type / scroll / keys | `click`, `type_text`, `scroll`, `press_key`, `hotkey` | `osascript`, `cliclick`, raw `CGEvent`, `open <url>` |
|
|
165
|
+
| Drag / drag-and-drop / marquee select | `drag({pid, from_x, from_y, to_x, to_y})` (pixel-only — macOS AX has no semantic drag) | `cliclick dd:`, `osascript drag` |
|
|
166
|
+
| Screenshot | `screenshot` or the PNG in `get_window_state` | `screencapture` |
|
|
167
|
+
| Quit an app | ask the user first, then `hotkey({pid, keys:["cmd","q"]})` | `kill`, `killall`, `pkill` |
|
|
168
|
+
| Hand a file/URL to an app | `launch_app({bundle_id, urls:[<path>]})` | `open -a <App> <path>`, `open <url>` |
|
|
169
|
+
|
|
170
|
+
### The narrow carve-out
|
|
171
|
+
|
|
172
|
+
The **only** legitimate use of `osascript -e 'tell app X to
|
|
173
|
+
activate'` is when the user **explicitly** asked for frontmost
|
|
174
|
+
state ("bring Chrome to the front", "make it frontmost", "I want
|
|
175
|
+
to see X"). Reaching for it because a tool call returned something
|
|
176
|
+
confusing is wrong — that's the skill's classic foot-in-the-door
|
|
177
|
+
failure mode and it steals focus every time.
|
|
178
|
+
|
|
179
|
+
When a cua-driver call surprises you, diagnose cua-driver first:
|
|
180
|
+
|
|
181
|
+
- **Tiny screenshot / empty `tree_markdown`?** Check
|
|
182
|
+
`cua-driver get_config` → `capture_mode`. Default `"som"` returns
|
|
183
|
+
both the AX tree and screenshot. `"vision"` omits the AX tree
|
|
184
|
+
(PNG only), `"ax"` omits the PNG. If a snapshot lacks a tree,
|
|
185
|
+
`capture_mode` is almost certainly `"vision"` — either reason
|
|
186
|
+
purely from the PNG or flip to `"som"` / `"ax"` via `set_config`.
|
|
187
|
+
- **`has_screenshot: false`?** The window capture failed (transient
|
|
188
|
+
race against a close, or the window has no backing store yet).
|
|
189
|
+
Re-snapshot; if persistent, pick a different `window_id` via
|
|
190
|
+
`list_windows`.
|
|
191
|
+
- **`Invalid element_index` / `No cached AX state`?** You either
|
|
192
|
+
skipped `get_window_state` this turn or passed a different
|
|
193
|
+
`window_id` than the one the snapshot cached against. The cache
|
|
194
|
+
is keyed on `(pid, window_id)` — indices don't carry across
|
|
195
|
+
windows of the same app. Re-snapshot with the same window_id
|
|
196
|
+
you're about to click in.
|
|
197
|
+
- **Sparse Chromium AX tree?** Retry `get_window_state` once — the
|
|
198
|
+
tree populates on second call.
|
|
199
|
+
|
|
200
|
+
Only after those are ruled out, and only if the user's action
|
|
201
|
+
genuinely needs frontmost state, fall through to the activate
|
|
202
|
+
fallback. Always name the focus steal in your response ("I'll
|
|
203
|
+
briefly bring Chrome to the front because …").
|
|
204
|
+
|
|
205
|
+
### Self-check pattern
|
|
206
|
+
|
|
207
|
+
Before every `Bash` call whose command line touches any macOS app
|
|
208
|
+
(launching, opening, clicking, typing, scripting, screenshotting),
|
|
209
|
+
run the self-check:
|
|
210
|
+
|
|
211
|
+
1. **Does this command foreground the target?** If yes — stop and
|
|
212
|
+
translate to the cua-driver equivalent from the mapping table.
|
|
213
|
+
2. **Does this command move the user's real cursor?** (`cliclick`,
|
|
214
|
+
any `CGEventPost` at `cghidEventTap` over another app's window).
|
|
215
|
+
If yes — stop; use `click({pid, x, y})` which routes per-pid
|
|
216
|
+
via SkyLight and never warps the cursor.
|
|
217
|
+
3. **Does this command bypass cua-driver entirely?** (`osascript`
|
|
218
|
+
mutating GUI state, AppleScript files, external helpers.) If
|
|
219
|
+
yes — stop; find the cua-driver tool that does the intent.
|
|
220
|
+
|
|
221
|
+
If all three are "no," the command is safe. If you can't answer,
|
|
222
|
+
default to stop and ask rather than proceed. A single `open -a`
|
|
223
|
+
run by accident kills the demo, the trust, and the user's in-flight
|
|
224
|
+
editor state.
|
|
225
|
+
|
|
226
|
+
## Prerequisites — check before starting
|
|
227
|
+
|
|
228
|
+
1. `cua-driver` is on `$PATH` (`which cua-driver`). If not, point the
|
|
229
|
+
user at `scripts/install-local.sh` and stop.
|
|
230
|
+
2. Run `cua-driver check_permissions` (with the daemon up — see step 3).
|
|
231
|
+
The default behavior also raises the system permission dialogs for
|
|
232
|
+
any missing grants, so the user can grant on the spot. If either
|
|
233
|
+
grant still reads `false` after that (user dismissed the dialog),
|
|
234
|
+
tell them to open System Settings → Privacy & Security and grant
|
|
235
|
+
Accessibility and Screen Recording to `CuaDriver.app`, then stop.
|
|
236
|
+
Pass `'{"prompt":false}'` for a purely read-only status check that
|
|
237
|
+
won't steal focus.
|
|
238
|
+
3. Start the daemon with `open -n -g -a CuaDriver --args serve` (the
|
|
239
|
+
recommended form — goes through LaunchServices so TCC attributes
|
|
240
|
+
the process to CuaDriver.app). `cua-driver serve &` also works;
|
|
241
|
+
the CLI auto-relaunches through `open -n -g -a CuaDriver` when it
|
|
242
|
+
detects a wrong-TCC context (any IDE-spawned shell: Claude Code,
|
|
243
|
+
Cursor, VS Code, Conductor). Verify with `cua-driver status`.
|
|
244
|
+
|
|
245
|
+
## Using cua-driver from the shell
|
|
246
|
+
|
|
247
|
+
Tool names are `snake_case`, management subcommands are
|
|
248
|
+
`kebab-case` — no ambiguity. Tools invoked as `cua-driver
|
|
249
|
+
<tool-name> '<JSON-args>'`. Management subcommands:
|
|
250
|
+
|
|
251
|
+
- `open -n -g -a CuaDriver --args serve` — start persistent daemon
|
|
252
|
+
(**required** for `element_index` workflows; without it each CLI
|
|
253
|
+
invocation spawns a fresh process and the per-pid element cache
|
|
254
|
+
dies between calls). `cua-driver serve &` also works — the CLI
|
|
255
|
+
auto-relaunches via `open` when the shell's TCC context is wrong.
|
|
256
|
+
Pass `--no-relaunch` / `CUA_DRIVER_NO_RELAUNCH=1` to opt out.
|
|
257
|
+
- `cua-driver stop` / `status`
|
|
258
|
+
- `cua-driver list-tools`, `describe <tool>`
|
|
259
|
+
- `cua-driver recording start|stop|status` — see `RECORDING.md`
|
|
260
|
+
|
|
261
|
+
Canonical multi-step workflow:
|
|
262
|
+
|
|
263
|
+
```
|
|
264
|
+
open -n -g -a CuaDriver --args serve
|
|
265
|
+
cua-driver launch_app '{"bundle_id":"com.apple.calculator"}'
|
|
266
|
+
# → {pid: 844, windows: [{window_id: 10725, ...}]}
|
|
267
|
+
cua-driver get_window_state '{"pid":844,"window_id":10725}'
|
|
268
|
+
cua-driver click '{"pid":844,"window_id":10725,"element_index":14}'
|
|
269
|
+
cua-driver stop
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
## Agent cursor overlay
|
|
273
|
+
|
|
274
|
+
Visual cursor overlay for demos and screen recordings. Default:
|
|
275
|
+
enabled. Toggle with `cua-driver set_agent_cursor_enabled
|
|
276
|
+
'{"enabled":true|false}'`. A triangle pointer Bezier-glides to each
|
|
277
|
+
click target, ring-ripples on landing, idle-hides after ~1.5s.
|
|
278
|
+
Motion knobs: `set_agent_cursor_motion` takes any subset of
|
|
279
|
+
`start_handle`, `end_handle`, `arc_size`, `arc_flow`, `spring` —
|
|
280
|
+
tuneable at runtime, persisted to config.
|
|
281
|
+
|
|
282
|
+
Requires an AppKit runloop, which `cua-driver serve` / `mcp`
|
|
283
|
+
bootstraps. One-shot CLI invocations skip the overlay entirely.
|
|
284
|
+
|
|
285
|
+
## The core invariant — snapshot before AND after every action
|
|
286
|
+
|
|
287
|
+
**Every action MUST be bracketed by `get_window_state(pid, window_id)`**:
|
|
288
|
+
|
|
289
|
+
- **Before** — the pre-action snapshot resolves the `element_index`
|
|
290
|
+
you're about to use. Indices from previous turns are stale; the
|
|
291
|
+
server replaces the element index map on every snapshot, keyed
|
|
292
|
+
on `(pid, window_id)`. Indices from turn N don't resolve in turn
|
|
293
|
+
N+1, and indices from window A don't resolve against window B of
|
|
294
|
+
the same app. Skip this and element-indexed actions fail with
|
|
295
|
+
`No cached AX state`.
|
|
296
|
+
- **After** — the post-action snapshot verifies the action actually
|
|
297
|
+
landed. Without it you can't tell a silent no-op from a real
|
|
298
|
+
effect. The AX tree change (new value, new window, disappeared
|
|
299
|
+
menu, disabled button, etc.) is your evidence that the action
|
|
300
|
+
fired. If nothing changed, the action probably failed silently —
|
|
301
|
+
say so, don't assume success.
|
|
302
|
+
|
|
303
|
+
This applies to pixel clicks too — re-snapshot after to confirm the
|
|
304
|
+
click landed on the intended target.
|
|
305
|
+
|
|
306
|
+
### Why window selection is the caller's job now
|
|
307
|
+
|
|
308
|
+
`get_app_state` used to pick a window for you via a max-area heuristic
|
|
309
|
+
that returned the wrong surface on apps with large off-screen utility
|
|
310
|
+
panels. Concrete reproducer: IINA's OpenSubtitles helper (600×432
|
|
311
|
+
off-screen) out-area'd the visible 320×240 player window, so
|
|
312
|
+
`get_app_state(pid)` screenshot'd the invisible panel and clicks landed
|
|
313
|
+
there silently. The new `get_window_state(pid, window_id)` makes the
|
|
314
|
+
caller name the window explicitly — the driver validates that the
|
|
315
|
+
window belongs to the pid and is on the current Space, then snapshots
|
|
316
|
+
exactly what was asked for. Enumerate candidates via `list_windows` or
|
|
317
|
+
read the `windows` array `launch_app` already returns.
|
|
318
|
+
|
|
319
|
+
## Behavior matrix
|
|
320
|
+
|
|
321
|
+
Two orthogonal axes shape what the agent can do.
|
|
322
|
+
|
|
323
|
+
**capture_mode → addressing mode**
|
|
324
|
+
|
|
325
|
+
| `capture_mode` | `get_window_state` returns | Use for actions |
|
|
326
|
+
|---|---|---|
|
|
327
|
+
| **`som`** (default) | tree + screenshot | `element_index` preferred; pixel fallback |
|
|
328
|
+
| **`ax`** | tree only (no PNG) | `element_index` only |
|
|
329
|
+
| **`vision`** | PNG only (no tree) | pixel only — see [SCREENSHOT.md](./SCREENSHOT.md) |
|
|
330
|
+
|
|
331
|
+
`vision` was renamed from `screenshot` — the old name still decodes
|
|
332
|
+
as a deprecated alias, so an on-disk `"capture_mode": "screenshot"`
|
|
333
|
+
keeps working. Default is `som` so element_index clicks work the
|
|
334
|
+
first time a user calls `get_window_state`; the other modes are
|
|
335
|
+
opt-in when the caller specifically doesn't want one half of the
|
|
336
|
+
work. Note the tool named `screenshot` is separate (raw PNG, no AX
|
|
337
|
+
walk) and unrelated to the capture mode.
|
|
338
|
+
|
|
339
|
+
When a snapshot looks wrong (tiny screenshot / empty tree), check
|
|
340
|
+
`cua-driver get_config` for `capture_mode` before anything else.
|
|
341
|
+
|
|
342
|
+
Pure-vision mode has its own caveats — Claude Code's vision
|
|
343
|
+
pipeline downsamples dense text aggressively, so pixel grounding
|
|
344
|
+
takes multiple correction cycles on text-heavy UIs. Read
|
|
345
|
+
[SCREENSHOT.md](./SCREENSHOT.md) before driving anything in that
|
|
346
|
+
mode; it documents the iterate/annotate/verify recipe plus the
|
|
347
|
+
JPEG-over-PNG finding.
|
|
348
|
+
|
|
349
|
+
**Window state → what works**
|
|
350
|
+
|
|
351
|
+
| state | `get_window_state` | `click`/`set_value` (AX) | `press_key` commit (Return/Space/Tab) | pixel click |
|
|
352
|
+
|---|---|---|---|---|
|
|
353
|
+
| frontmost | ✅ | ✅ | ✅ | ✅ |
|
|
354
|
+
| backgrounded / visible | ✅ | ✅ | ✅ | ✅ |
|
|
355
|
+
| **minimized** (Dock genie) | ✅ | ✅ (no deminiaturize — AX actions fire on the minimized window in place) | ❌ silent no-op / system beep — use `set_value` or click equivalent | ❌ no on-screen bounds |
|
|
356
|
+
| hidden (`hides=true` / `NSApp.hide`) | ✅ | ✅ | depends | ❌ |
|
|
357
|
+
| on another Space | ⚠️ AX tree often stripped to menu-bar-only on SwiftUI apps (System Settings) — AppKit apps usually fine. Response carries `off_space: true` + `window_space_ids` so you can detect it | ✅ | ✅ | ❌ window not in current-Space list |
|
|
358
|
+
|
|
359
|
+
**Critical cell — minimized + keyboard commit.** The keystroke
|
|
360
|
+
reaches the app but AX focus doesn't propagate to renderer focus on
|
|
361
|
+
a minimized window. Workarounds in order of preference:
|
|
362
|
+
`set_value` to write the field's entire value directly, or AX-click
|
|
363
|
+
a commit-equivalent button (Go, Submit, checkbox). Tell the user
|
|
364
|
+
the window needs to un-minimize only as a last resort.
|
|
365
|
+
|
|
366
|
+
## The canonical loop
|
|
367
|
+
|
|
368
|
+
```
|
|
369
|
+
launch_app(target)
|
|
370
|
+
→ pick window_id from the returned `windows` array
|
|
371
|
+
(or call list_windows(pid) separately)
|
|
372
|
+
→ get_window_state(pid, window_id)
|
|
373
|
+
→ [act] # every action also takes (pid, window_id)
|
|
374
|
+
→ get_window_state(pid, window_id) → verify
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
`launch_app` now returns a `windows` array alongside the pid, so the
|
|
378
|
+
common case collapses to two calls (`launch_app` → `get_window_state`)
|
|
379
|
+
without a separate `list_windows` hop.
|
|
380
|
+
|
|
381
|
+
### 1. Resolve target pid — always via `launch_app`
|
|
382
|
+
|
|
383
|
+
**Always start with `launch_app`**, whether or not the target is already
|
|
384
|
+
running. It's idempotent (relaunching returns the existing pid with no
|
|
385
|
+
side effects) and gives you the pid in one call — no `list_apps` hop.
|
|
386
|
+
|
|
387
|
+
- `launch_app({bundle_id: "com.apple.finder"})` — preferred, unambiguous.
|
|
388
|
+
- `launch_app({name: "Calculator"})` — when bundle_id isn't known.
|
|
389
|
+
|
|
390
|
+
`launch_app` is a **hidden-launch primitive by design** — that's the
|
|
391
|
+
entire point of cua-driver: agents drive apps in the background while
|
|
392
|
+
the user keeps typing in their real foreground app. The target's
|
|
393
|
+
window is initialized (AX tree fully populated, clickable via
|
|
394
|
+
`element_index`, the pid appears in `list_apps`) but not drawn on
|
|
395
|
+
screen. The driver never activates or unhides apps on its own; that
|
|
396
|
+
would violate the no-foreground contract the whole driver exists to
|
|
397
|
+
protect.
|
|
398
|
+
|
|
399
|
+
If the user explicitly wants the window visible (usually for a demo
|
|
400
|
+
or recording), they unhide it themselves — Dock click, Cmd-Tab, or
|
|
401
|
+
Spotlight. Do not reach for `open` / `osascript activate` as a
|
|
402
|
+
shortcut to make the window visible; those paths break the backgrounded
|
|
403
|
+
invariant on every call, not just the call that "needed" the
|
|
404
|
+
foreground. Say out loud what the user needs to do ("click the
|
|
405
|
+
Todo app in your Dock to bring it forward") and let them do it.
|
|
406
|
+
|
|
407
|
+
Never shell out to **any** form of `open` (including `open
|
|
408
|
+
<path-to-App.app>` for a just-built binary — resolve the bundle id
|
|
409
|
+
from `Info.plist` and use `launch_app` with that), `osascript 'tell
|
|
410
|
+
app … to launch/open'`, or similar. Those paths activate the target,
|
|
411
|
+
bypass the driver's focus-restore guard, and require a Bash
|
|
412
|
+
permission prompt the agent loop shouldn't be burning on app launch.
|
|
413
|
+
See "Prefer cua-driver tools over shell shims" above for the full
|
|
414
|
+
intent → tool mapping.
|
|
415
|
+
|
|
416
|
+
`list_apps` is for app-level discovery (answering "what's installed /
|
|
417
|
+
running / frontmost?") — not part of the core action loop. Skip it in
|
|
418
|
+
the loop. For **window-level** questions — "does this app have a
|
|
419
|
+
visible window?", "which Space is this window on?", "which of this
|
|
420
|
+
pid's windows is the main one?" — call `list_windows` instead; the
|
|
421
|
+
app record doesn't carry window state on purpose. In the common
|
|
422
|
+
single-window case you can skip `list_windows` entirely and read the
|
|
423
|
+
`windows` array that `launch_app` already returned.
|
|
424
|
+
|
|
425
|
+
### 2. Snapshot and act by element_index
|
|
426
|
+
|
|
427
|
+
Call `get_window_state({pid, window_id})` with the `window_id` from
|
|
428
|
+
`launch_app`'s `windows` array (or a fresh `list_windows({pid})` if
|
|
429
|
+
you're interacting with a long-lived process). The default `som`
|
|
430
|
+
capture_mode returns **both the AX tree and screenshot**, so the
|
|
431
|
+
canonical loop works immediately without any config change. The rest
|
|
432
|
+
of this section walks through `som` mode. If you're in `vision` mode
|
|
433
|
+
(PNG only, no AX tree), flip back: `cua-driver set_config '{"key":
|
|
434
|
+
"capture_mode", "value": "som"}'`.
|
|
435
|
+
|
|
436
|
+
In `som` mode (the default) the response carries:
|
|
437
|
+
|
|
438
|
+
- `tree_markdown` — every actionable element tagged `[N]`. That `N`
|
|
439
|
+
is the `element_index`. The tree can be very large (Finder is
|
|
440
|
+
~1600 elements, ~190 KB); when it exceeds token limits the MCP
|
|
441
|
+
harness saves it to a file and returns the path. Use `Bash` +
|
|
442
|
+
`jq -r '.tree_markdown'` + `grep` to pull the section you need.
|
|
443
|
+
- `screenshot_file_path` — absolute path to the saved screenshot when
|
|
444
|
+
`screenshot_out_file` was passed. Absent otherwise.
|
|
445
|
+
- `screenshot_width` / `_height` / `_scale_factor` — dimensions of the
|
|
446
|
+
captured image. Present whenever a screenshot was taken.
|
|
447
|
+
**Getting the screenshot as a file (CLI and context-constrained agents):**
|
|
448
|
+
|
|
449
|
+
```bash
|
|
450
|
+
# write to file — stdout stays readable (AX tree / summary only, no base64)
|
|
451
|
+
cua-driver get_window_state '{"pid":N,"window_id":W,"screenshot_out_file":"/tmp/shot.jpg"}'
|
|
452
|
+
|
|
453
|
+
# CLI --screenshot-out-file flag is equivalent and works for all capture modes
|
|
454
|
+
cua-driver get_window_state '{"pid":N,"window_id":W}' --screenshot-out-file /tmp/shot.jpg
|
|
455
|
+
```
|
|
456
|
+
|
|
457
|
+
Pass `screenshot_out_file` when using `get_window_state` via CLI or from an
|
|
458
|
+
agent whose context window can't absorb ~31 KB of inline base64 (e.g.
|
|
459
|
+
OpenCode with a local Ollama model). The MCP image content block is omitted
|
|
460
|
+
from the response when this param is set — the model receives only the AX
|
|
461
|
+
tree and `screenshot_file_path`, then reads the image from disk.
|
|
462
|
+
|
|
463
|
+
**Reason over both the tree AND the screenshot — they're
|
|
464
|
+
complementary, not redundant.** In `som` mode every
|
|
465
|
+
turn's `get_window_state` gives you both halves and you should pull
|
|
466
|
+
signal from each:
|
|
467
|
+
|
|
468
|
+
- The **AX tree** tells you *what's clickable* — roles, labels,
|
|
469
|
+
`element_index` handles, advertised actions, parent-child
|
|
470
|
+
structure. This is the ground truth for dispatching.
|
|
471
|
+
- The **screenshot** tells you *which one* — the tree often has
|
|
472
|
+
many buttons with similar or empty labels ("Delete", "OK",
|
|
473
|
+
anonymous UUID-labeled buttons, five `AXStaticText = " "`), and
|
|
474
|
+
visual context disambiguates. Captions, colors, layout relationships
|
|
475
|
+
visible in pixels often don't show up in the AX tree at all
|
|
476
|
+
(especially in Chromium / Electron / web content).
|
|
477
|
+
|
|
478
|
+
Canonical pattern: look at the screenshot to decide "the blue
|
|
479
|
+
Subscribe button on the top-right of the video card", then walk the
|
|
480
|
+
tree to find the matching `AXButton` and dispatch by its
|
|
481
|
+
`element_index`. Don't try to do it from just the tree — you'll
|
|
482
|
+
pick the wrong element when labels repeat. Don't try to do it from
|
|
483
|
+
just the screenshot — you lose the reliable AX-action path and the
|
|
484
|
+
safe backgrounded-dispatch.
|
|
485
|
+
|
|
486
|
+
Reach for pixel coordinates only when the target is a canvas /
|
|
487
|
+
video / WebGL / custom-drawn surface that isn't in the AX tree
|
|
488
|
+
(see Pixel-coordinate clicks below).
|
|
489
|
+
|
|
490
|
+
The `actions=[...]` list on each element is **advisory**, not
|
|
491
|
+
authoritative. cua-driver does not pre-flight check against it —
|
|
492
|
+
`click({pid, element_index})` always attempts `AXPress` (or the
|
|
493
|
+
action you pass) and surfaces whatever the target returns. Many
|
|
494
|
+
apps accept `AXPress` on elements that don't advertise it — Chrome's
|
|
495
|
+
omnibox suggestion `AXMenuItem` is a live example. **Try the click
|
|
496
|
+
first** — pivot only on the returned AX error code.
|
|
497
|
+
|
|
498
|
+
Dispatch table (every row assumes a `(pid, window_id)` pair from the
|
|
499
|
+
last `get_window_state`; `window_id` is required alongside
|
|
500
|
+
`element_index`, ignored on pixel-only forms unless you want to
|
|
501
|
+
anchor the conversion against a specific window):
|
|
502
|
+
|
|
503
|
+
| Intent | Tool | Notes |
|
|
504
|
+
|---|---|---|
|
|
505
|
+
| List an app's windows | `list_windows({pid})` | returns `window_id`, `title`, `bounds`, `z_index`, `is_on_screen`, `on_current_space`. Already included in `launch_app`'s response — only call this for long-lived pids |
|
|
506
|
+
| Snapshot a window | `get_window_state({pid, window_id})` | returns `tree_markdown` + `screenshot_*`; populates the `(pid, window_id)` element_index cache |
|
|
507
|
+
| Left click | `click({pid, window_id, element_index})` | default `action: "press"`. Pixel form: `click({pid, x, y})` (window_id optional — when supplied, pinpoints the anchor window) — `modifier: ["cmd"]` |
|
|
508
|
+
| Double-click / open | `double_click({pid, window_id, element_index})` | AXOpen when advertised (Finder items, openable rows); else stamped pixel double-click at the element's center. Pixel form: `double_click({pid, x, y})` — primer-gated recipe lands on backgrounded Chromium web content (YouTube fullscreen, Finder open-on-dbl). `click({..., count: 2})` still works and routes through the same recipe; `double_click` is the intent-first spelling |
|
|
509
|
+
| Right click / context menu | `right_click({pid, window_id, element_index})` or `click({pid, window_id, element_index, action: "show_menu"})` | Chromium web-content coerces pixel right-click to left — see `WEB_APPS.md` |
|
|
510
|
+
| Type at cursor | `type_text({pid, text, window_id, element_index})` | `AXSelectedText` write; focuses first |
|
|
511
|
+
| Set whole field value | `set_value({pid, window_id, element_index, value})` | sliders, steppers, text fields; **use for keyboard-commit workarounds on minimized windows** |
|
|
512
|
+
| Scroll | `scroll({pid, direction, amount, by, window_id, element_index})` | synthesizes PageUp/PageDown/arrows via SLEventPostToPid |
|
|
513
|
+
| Focus + send key | `press_key({pid, key, window_id, element_index, modifiers})` | element_index sets AXFocused, then posts key |
|
|
514
|
+
| Send key to pid | `press_key({pid, key, modifiers})` | no focus change; key goes to pid's current focus |
|
|
515
|
+
| Modifier combo | `hotkey({pid, keys})` | e.g. `["cmd","c"]`; posted per-pid, not HID tap |
|
|
516
|
+
| Unicode keystrokes | `type_text({pid, text, delay_ms})` | AX write with automatic CGEvent fallback; reaches Chromium/Electron inputs |
|
|
517
|
+
|
|
518
|
+
**All keyboard/text primitives require `pid`.** There is no
|
|
519
|
+
frontmost-routed variant — every key goes to the named target via
|
|
520
|
+
`CGEvent.postToPid`, so the driver cannot leak keystrokes into the
|
|
521
|
+
user's foreground app.
|
|
522
|
+
|
|
523
|
+
**Why `element_index` is the primary path:** works on hidden /
|
|
524
|
+
occluded / off-Space windows, no focus steal, stable across
|
|
525
|
+
rebuilds, labels tell you what you're clicking. Reach for pixel
|
|
526
|
+
coordinates only when AX can't.
|
|
527
|
+
|
|
528
|
+
### Pixel-coordinate clicks
|
|
529
|
+
|
|
530
|
+
The pixel path (`click({pid, x, y})`) is for surfaces the AX tree
|
|
531
|
+
doesn't reach — canvases, video players, WebGL, custom-drawn controls.
|
|
532
|
+
Coords are **window-local screenshot pixels** (same space as the PNG
|
|
533
|
+
`get_window_state` returns). Top-left origin, y-down. The driver
|
|
534
|
+
handles screen-point conversion internally. Passing `window_id`
|
|
535
|
+
alongside `x, y` is optional but recommended — it pins the
|
|
536
|
+
coordinate conversion to the window whose screenshot produced the
|
|
537
|
+
pixel, rather than the driver's heuristic choice.
|
|
538
|
+
|
|
539
|
+
#### Reading coordinates from the PNG
|
|
540
|
+
|
|
541
|
+
PNGs returned by `get_window_state` are capped at **1568 px
|
|
542
|
+
long-side by default** (`max_image_dimension` config), matching
|
|
543
|
+
Anthropic's multimodal-vision downsampling limit. That means the
|
|
544
|
+
image the model reasons over and the image the click tool's
|
|
545
|
+
coordinate system lives in are the **same resolution** — just look
|
|
546
|
+
at the PNG, pick a pixel, click at that pixel. No scaling math.
|
|
547
|
+
|
|
548
|
+
This is the default because the mismatch between "rendered
|
|
549
|
+
thumbnail" and "native PNG" was a recurring coord-estimation
|
|
550
|
+
footgun. If you opt out (explicit `max_image_dimension=0` for
|
|
551
|
+
pixel-perfect verification flows), the old rule applies: don't
|
|
552
|
+
eyeball coords from whatever your client renders — it may be
|
|
553
|
+
2-4× smaller than the PNG on disk, and a 2% error in thumbnail
|
|
554
|
+
space becomes ~80 px in the real image. Use the crosshair recipe
|
|
555
|
+
below against the full-resolution file in that case.
|
|
556
|
+
|
|
557
|
+
1. `get_window_state({pid, window_id})` returns an image capped
|
|
558
|
+
at 1568 long-side (default) plus its dimensions
|
|
559
|
+
(`screenshot_width` / `screenshot_height`). Write the bytes to
|
|
560
|
+
disk with `--screenshot-out-file <path>` in any capture mode — works
|
|
561
|
+
identically in `vision` (where it's the only way) and `som`
|
|
562
|
+
(where it sidesteps the jq + base64 dance on the spliced
|
|
563
|
+
`screenshot_png_b64` field).
|
|
564
|
+
2. You are a multimodal model — look at the PNG. Since the PNG
|
|
565
|
+
matches what you see, pick the target pixel directly. No
|
|
566
|
+
fractional math needed.
|
|
567
|
+
3. When precision matters (small targets, dense UIs), draw a
|
|
568
|
+
crosshair on the image (do **not** crop — cropping loses the
|
|
569
|
+
coordinate system and requires error-prone offset math) and
|
|
570
|
+
show it before clicking:
|
|
571
|
+
|
|
572
|
+
```python
|
|
573
|
+
from PIL import Image, ImageDraw
|
|
574
|
+
img = Image.open('/tmp/shot.png')
|
|
575
|
+
draw = ImageDraw.Draw(img)
|
|
576
|
+
x, y = <your_coordinate>
|
|
577
|
+
r = 18
|
|
578
|
+
draw.ellipse([x-r, y-r, x+r, y+r], outline='red', width=4)
|
|
579
|
+
draw.line([x-30, y, x+30, y], fill='red', width=3)
|
|
580
|
+
draw.line([x, y-30, x, y+30], fill='red', width=3)
|
|
581
|
+
img.save('/tmp/shot_annotated.png')
|
|
582
|
+
```
|
|
583
|
+
|
|
584
|
+
4. Only dispatch the click after the user (or your own re-read of
|
|
585
|
+
the annotated image) confirms the crosshair is on target.
|
|
586
|
+
|
|
587
|
+
#### Addressing variants
|
|
588
|
+
|
|
589
|
+
- `click({pid, x, y})` — single left-click.
|
|
590
|
+
- `click({pid, x, y, count: 2})` — double-click.
|
|
591
|
+
- `click({pid, x, y, modifier: ["cmd"]})` — cmd-click. Accepts any
|
|
592
|
+
subset of `cmd/shift/option/ctrl`.
|
|
593
|
+
- `right_click({pid, x, y})` — also takes `modifier`.
|
|
594
|
+
|
|
595
|
+
The pixel path animates the agent cursor overlay but never warps
|
|
596
|
+
the real cursor. If the pid has no on-screen window the call errors
|
|
597
|
+
with `pid X has no on-screen window` — you need a visible window to
|
|
598
|
+
anchor the conversion.
|
|
599
|
+
|
|
600
|
+
#### How the pixel click is dispatched
|
|
601
|
+
|
|
602
|
+
The recipe is the backgrounded "noraise" sequence: yabai's
|
|
603
|
+
focus-without-raise SLPS event records followed by an off-screen
|
|
604
|
+
user-activation primer and the real click, all stamped via
|
|
605
|
+
`SLEventPostToPid`. The target app becomes AppKit-active for event
|
|
606
|
+
routing but its window does **not** rise to the front of the
|
|
607
|
+
z-stack, and macOS's "switch to Space with windows for app" follow
|
|
608
|
+
is suppressed. Full mechanics in
|
|
609
|
+
`Sources/CuaDriverCore/Input/MouseInput.swift` (`clickViaAuthSignedPost`)
|
|
610
|
+
and the companion `FocusWithoutRaise.swift`.
|
|
611
|
+
|
|
612
|
+
#### Known limits
|
|
613
|
+
|
|
614
|
+
- **Chromium `<video>` play/pause**: pixel click is often rejected
|
|
615
|
+
by HTML5's click-to-play handler on some builds. Use keyboard
|
|
616
|
+
instead: `press_key({pid, key: "k"})` (YouTube) or
|
|
617
|
+
`press_key({pid, key: "space"})` (generic). Keyboard events
|
|
618
|
+
travel through a different auth envelope.
|
|
619
|
+
- **Pixel right-click on Chromium web content** coerces to a
|
|
620
|
+
left-click — a known Chromium renderer-IPC limitation that affects
|
|
621
|
+
every non-HID-tap synthesis path. For context menus on
|
|
622
|
+
AX-addressable elements (links, buttons, toolbar items), use
|
|
623
|
+
`right_click({pid, element_index})` instead.
|
|
624
|
+
|
|
625
|
+
### Canvases, viewports, games (Blender, Unity, GHOST, Qt, wxWidgets)
|
|
626
|
+
|
|
627
|
+
Apps whose main surface is an OpenGL / Metal / Qt / wxWidgets
|
|
628
|
+
viewport expose **no useful AX tree** — the whole surface is one
|
|
629
|
+
opaque `AXGroup` or `AXWindow` from AX's perspective. Per-pid event
|
|
630
|
+
paths (`SLEventPostToPid`, `CGEvent.postToPid`) are filtered by the
|
|
631
|
+
viewport's own event-source check and silently dropped — the event
|
|
632
|
+
loop wants "real HID origin".
|
|
633
|
+
|
|
634
|
+
The working pattern:
|
|
635
|
+
|
|
636
|
+
1. Bring the target frontmost (a brief `osascript activate` is
|
|
637
|
+
acceptable here — this is the carve-out the skill's osascript
|
|
638
|
+
gate allows).
|
|
639
|
+
2. `CGEvent.post(tap: .cghidEventTap)` with a leading `mouseMoved`
|
|
640
|
+
event (~30 ms before the click). `cua-driver click` when the
|
|
641
|
+
target is frontmost automatically takes this path.
|
|
642
|
+
3. Accept that the real cursor visibly moves — `cghidEventTap` is
|
|
643
|
+
the system HID stream, the cursor warps to the click point.
|
|
644
|
+
|
|
645
|
+
There is no backgrounded path that reaches these apps today.
|
|
646
|
+
|
|
647
|
+
## Navigating native menu bars (AXMenuBar)
|
|
648
|
+
|
|
649
|
+
**Only drive the menu bar when the target app is frontmost.** This
|
|
650
|
+
is the single most-misused cua-driver capability. If the target is
|
|
651
|
+
backgrounded, don't reach for `AXMenuBarItem` + AXPick — use
|
|
652
|
+
in-window `element_index` or pixel clicks instead. Two reasons, one
|
|
653
|
+
functional and one perceptual:
|
|
654
|
+
|
|
655
|
+
- **Functional:** menu items that touch document/playback/editor
|
|
656
|
+
state go `DISABLED` when their owning app isn't the key window
|
|
657
|
+
(Preview rotate, IINA speed change, most editor commands). AXPick
|
|
658
|
+
+ AXPress will dispatch successfully from the driver's side but
|
|
659
|
+
no-op at the target — you get a silent false-pass.
|
|
660
|
+
- **Perceptual (matters for demos, screen recordings, and anything
|
|
661
|
+
the user watches live):** macOS's screen-rendered menu bar
|
|
662
|
+
always belongs to the *frontmost* app. AXPick on a backgrounded
|
|
663
|
+
app's `AXMenuBarItem` dispatches to that app's per-process menu at
|
|
664
|
+
the AX layer, but any visible menu render happens over the
|
|
665
|
+
frontmost app's menu bar — the viewer sees an IINA submenu
|
|
666
|
+
flashing on top of Chrome's menus, which reads as "the agent
|
|
667
|
+
clicked the wrong app." The AX call was correct; the frame the
|
|
668
|
+
user sees is not. For recorded or observed sessions, this is an
|
|
669
|
+
integrity bug even though it's not a correctness bug.
|
|
670
|
+
|
|
671
|
+
**Good decision rule:** if the target is not already frontmost, do
|
|
672
|
+
not use `AXMenuBarItem` at all. For *reading* in-window state,
|
|
673
|
+
snapshot the window AX tree — most apps expose the same state via
|
|
674
|
+
an in-window `AXStaticText`, title bar, or toolbar. For *dispatching*
|
|
675
|
+
actions, use in-window `element_index` (buttons, toolbar items) or
|
|
676
|
+
pixel clicks on in-window controls — both dispatch via AppKit's
|
|
677
|
+
window-under-pointer hit-test and are **not** frontmost-gated.
|
|
678
|
+
|
|
679
|
+
When the target IS frontmost, the menu-bar flow below is fine and
|
|
680
|
+
the canonical path for menus.
|
|
681
|
+
|
|
682
|
+
### The two-snapshot pattern (target frontmost only)
|
|
683
|
+
|
|
684
|
+
Menu contents are a two-snapshot flow. Closed AXMenu subtrees are
|
|
685
|
+
deliberately skipped during snapshot — otherwise every app's File /
|
|
686
|
+
Edit / View hierarchy plus every Recent Items macOS has ever seen
|
|
687
|
+
would inflate the tree 10-100x. But once a menu is *open*, its
|
|
688
|
+
AXMenuItem children do receive `element_index` values so you can
|
|
689
|
+
click them normally.
|
|
690
|
+
|
|
691
|
+
1. Find the `[N] AXMenuBarItem "<Menu Name>"` in the tree.
|
|
692
|
+
2. `click({pid, element_index: N, action: "pick"})` — menu bar items
|
|
693
|
+
implement `AXPick` ("open my submenu"), not `AXPress`. Using the
|
|
694
|
+
default action on an AXMenuBarItem is a no-op.
|
|
695
|
+
3. Re-snapshot. The expanded menu's items now appear under the bar
|
|
696
|
+
item as `[M] AXMenuItem "<Item Name>"`.
|
|
697
|
+
4. Click the target item — most items respond to `AXPress` (default
|
|
698
|
+
action). Submenus nest under the item and are walked the same way.
|
|
699
|
+
5. Re-snapshot and verify.
|
|
700
|
+
|
|
701
|
+
If you ever need to back out without selecting, `press_key({pid, key:
|
|
702
|
+
"escape"})` closes the open menu. Leaving a menu expanded between
|
|
703
|
+
turns poisons subsequent snapshots for that pid.
|
|
704
|
+
|
|
705
|
+
### Commands gated on the target being frontmost
|
|
706
|
+
|
|
707
|
+
Some menu items and global shortcuts (Preview's Tools → Rotate
|
|
708
|
+
Right, ⌘R; anything in the View menu that manipulates the current
|
|
709
|
+
document; most editor commands) are **disabled unless the target
|
|
710
|
+
app is the key / frontmost window**. You'll see it in the AX tree
|
|
711
|
+
as `DISABLED` on the menu item even though the user's intent is
|
|
712
|
+
obviously valid.
|
|
713
|
+
|
|
714
|
+
Before activating, confirm you're in this narrow case — the menu
|
|
715
|
+
item still reads `DISABLED` after a fresh snapshot AND the action
|
|
716
|
+
the user requested genuinely requires frontmost (Preview rotate,
|
|
717
|
+
View menu document manipulation, editor commands). If either
|
|
718
|
+
check fails, don't activate.
|
|
719
|
+
|
|
720
|
+
When both checks pass, the driver has no `activate` tool
|
|
721
|
+
(deliberately — the whole point is backgroundable control), so
|
|
722
|
+
this is the one legitimate `osascript` fallback:
|
|
723
|
+
|
|
724
|
+
```
|
|
725
|
+
osascript -e 'tell application "<App Name>" to activate'
|
|
726
|
+
```
|
|
727
|
+
|
|
728
|
+
Then re-snapshot — the menu item loses its `DISABLED` tag — and
|
|
729
|
+
`click({action: "pick"})` the item. Alternatively, a `hotkey`
|
|
730
|
+
call delivered to the now-frontmost app works for the shortcut
|
|
731
|
+
form (`⌘R`, `⌘+`, etc.).
|
|
732
|
+
|
|
733
|
+
**Always name the focus steal in your response** so the user isn't
|
|
734
|
+
surprised — "Briefly activating Preview to enable Tools → Rotate
|
|
735
|
+
Right" or similar. Don't silently steal focus. You don't need to
|
|
736
|
+
restore the previous frontmost afterwards unless the user asks —
|
|
737
|
+
they can cmd-tab back.
|
|
738
|
+
|
|
739
|
+
## Web-rendered apps (browsers, Electron, Tauri)
|
|
740
|
+
|
|
741
|
+
For Chrome / Edge / Brave / Arc / Safari, Electron apps (Slack,
|
|
742
|
+
VSCode, Notion, Discord), and Tauri apps — see **`WEB_APPS.md`**.
|
|
743
|
+
|
|
744
|
+
Covers: sparse AX tree population (retry-once pattern for Chromium),
|
|
745
|
+
URL navigation via omnibox suggestions, the `set_value` workaround
|
|
746
|
+
for keyboard commits on **minimized** windows (Return silently
|
|
747
|
+
no-ops — symptom is a macOS system beep; use `set_value` or click a
|
|
748
|
+
clickable equivalent), scrolling via synthetic PageUp/Down keystrokes,
|
|
749
|
+
in-page clicks, and typing into web inputs.
|
|
750
|
+
|
|
751
|
+
Chromium web content specifically also coerces `right_click` back to
|
|
752
|
+
left — use `element_index` for AX-addressable targets and accept the
|
|
753
|
+
limit otherwise.
|
|
754
|
+
|
|
755
|
+
### Browser JS primitives — `page` tool and `get_window_state(javascript=)`
|
|
756
|
+
|
|
757
|
+
When the AX tree doesn't expose the data you need (common in
|
|
758
|
+
Chromium/Electron — the tree is sparse for web content), use the
|
|
759
|
+
`page` tool or the `javascript` param on `get_window_state` to query
|
|
760
|
+
the DOM directly via Apple Events. Requires "Allow JavaScript from
|
|
761
|
+
Apple Events" to be enabled — see `WEB_APPS.md` for the setup path.
|
|
762
|
+
|
|
763
|
+
**Three actions on the `page` tool:**
|
|
764
|
+
|
|
765
|
+
- `page({pid, window_id, action: "get_text"})` — returns
|
|
766
|
+
`document.body.innerText`. Fastest way to read page content, prices,
|
|
767
|
+
article text, or any raw text the AX tree truncates or omits.
|
|
768
|
+
|
|
769
|
+
- `page({pid, window_id, action: "query_dom", css_selector: "a[href]",
|
|
770
|
+
attributes: ["href"]})` — runs `querySelectorAll` and returns each
|
|
771
|
+
match's tag, text, and requested attributes as a JSON array. Use for
|
|
772
|
+
table rows, link hrefs, data attributes, structured page data.
|
|
773
|
+
|
|
774
|
+
- `page({pid, window_id, action: "execute_javascript", javascript:
|
|
775
|
+
"..."})` — raw JS. Wrap in an IIFE with try-catch. Don't use this for
|
|
776
|
+
elements already indexed by `get_window_state` — `click` and
|
|
777
|
+
`set_value` are more reliable there.
|
|
778
|
+
|
|
779
|
+
**Co-located read — `get_window_state` with `javascript`:**
|
|
780
|
+
|
|
781
|
+
```
|
|
782
|
+
get_window_state({pid, window_id, javascript: "document.title"})
|
|
783
|
+
```
|
|
784
|
+
|
|
785
|
+
Runs the JS and appends the result as a `## JavaScript result` section
|
|
786
|
+
alongside the AX snapshot — one round-trip instead of two. Use this
|
|
787
|
+
when you need both the element tree (for subsequent clicks) and some
|
|
788
|
+
page data in the same turn.
|
|
789
|
+
|
|
790
|
+
**Decision rule — AX vs JS:**
|
|
791
|
+
|
|
792
|
+
| Need | Use |
|
|
793
|
+
|---|---|
|
|
794
|
+
| Click / type into an element | `get_window_state` → `click` / `set_value` (AX, works backgrounded) |
|
|
795
|
+
| Read text the AX tree drops | `page(get_text)` or `get_window_state(javascript=)` |
|
|
796
|
+
| Scrape structured data (tables, hrefs) | `page(query_dom)` |
|
|
797
|
+
| Trigger JS events / mutations | `page(execute_javascript)` |
|
|
798
|
+
|
|
799
|
+
Supported backends:
|
|
800
|
+
|
|
801
|
+
| App type | How | Context |
|
|
802
|
+
|---|---|---|
|
|
803
|
+
| Chrome / Brave / Edge | Apple Events `execute javascript` | Full DOM ✅ |
|
|
804
|
+
| Safari | Apple Events `do JavaScript` | Full DOM ✅ |
|
|
805
|
+
| Electron (VS Code, Cursor…) | SIGUSR1 → V8 inspector → CDP | Main process only: `process`, `Buffer` — no `document`, no `require` in sandboxed apps |
|
|
806
|
+
| Electron (with `--remote-debugging-port`) | CDP page target | Full DOM ✅ |
|
|
807
|
+
|
|
808
|
+
**Electron sandbox note:** SIGUSR1 connects to the Node.js *main* process.
|
|
809
|
+
Sandboxed Electron apps (VS Code, Cursor) strip `require` and Electron
|
|
810
|
+
APIs there. Useful for: `process.env`, `process.versions`, `process.cwd()`,
|
|
811
|
+
`process.pid`. For full DOM/renderer access, launch the app with
|
|
812
|
+
`--remote-debugging-port=9222` — cua-driver will detect and prefer the
|
|
813
|
+
page target automatically.
|
|
814
|
+
|
|
815
|
+
Arc returns no values; Firefox has no JS-via-AppleEvents support — see
|
|
816
|
+
`WEB_APPS.md` for the full matrix.
|
|
817
|
+
|
|
818
|
+
### 3. Re-snapshot and verify — mandatory
|
|
819
|
+
|
|
820
|
+
**Always** call `get_window_state({pid, window_id})` after the action.
|
|
821
|
+
This isn't optional verification — it's the second half of the
|
|
822
|
+
snapshot invariant.
|
|
823
|
+
|
|
824
|
+
Check the AX tree diff: a changed value, a new element, a new
|
|
825
|
+
window, or the disappearance of the thing you just clicked (menus
|
|
826
|
+
collapse after selection, buttons may become disabled, etc.). If
|
|
827
|
+
nothing changed, the action likely failed silently — **tell the
|
|
828
|
+
user what you attempted and what you observed**, don't paper over
|
|
829
|
+
with "done" language. Agents that skip this step report success on
|
|
830
|
+
silently-dropped actions — the single most common failure mode.
|
|
831
|
+
|
|
832
|
+
## Recording trajectories
|
|
833
|
+
|
|
834
|
+
Session-scoped action recording + replay, for demos, regressions, and
|
|
835
|
+
training data. Only invoke when the user explicitly asks to record a
|
|
836
|
+
session — the skill does not auto-enable this. CLI surface:
|
|
837
|
+
`cua-driver recording start|stop|status`; raw tool: `set_recording`.
|
|
838
|
+
|
|
839
|
+
See **`RECORDING.md`** for the full flow: enable/disable, turn folder
|
|
840
|
+
contents, replay via `replay_trajectory`, and the element_index
|
|
841
|
+
doesn't-survive-across-sessions caveat.
|
|
842
|
+
|
|
843
|
+
## Common error patterns
|
|
844
|
+
|
|
845
|
+
| Error text | Meaning | Fix |
|
|
846
|
+
|---|---|---|
|
|
847
|
+
| `No cached AX state for pid X window_id W` | You either skipped `get_window_state` this turn, or passed a different `window_id` to the click than the one the snapshot cached against | Call `get_window_state({pid: X, window_id: W})` first — the same window_id you intend to click in |
|
|
848
|
+
| `Invalid element_index N for pid X window_id W` | Index is stale or out of range | Re-run `get_window_state` with the same window_id, pick a fresh index from the new tree |
|
|
849
|
+
| `window_id W belongs to pid P, not …` | Passed a window_id that's owned by a different process | Use `list_windows({pid: X})` to enumerate this pid's own windows |
|
|
850
|
+
| `AX action AXPress failed with code …` | Element doesn't support AXPress | Try `show_menu`, `confirm`, `cancel`, or `pick` |
|
|
851
|
+
| macOS system-alert beep on `press_key` with no visible change | Target window is minimized; Return / Space / Tab commits don't establish real renderer focus on minimized windows | AX-click a clickable equivalent (Go button, Submit button, checkbox) instead of pressing the key; see "Keyboard commits on minimized windows" under the Browser section |
|
|
852
|
+
| `Accessibility permission not granted` | TCC not granted | Stop; tell user to grant in System Settings |
|
|
853
|
+
| `Screen Recording permission not granted` | TCC not granted for capture | Affects `screenshot` and `get_window_state` (which always captures). Grant in System Settings — the driver can't operate without it |
|
|
854
|
+
|
|
855
|
+
## Things to avoid
|
|
856
|
+
|
|
857
|
+
- **Never** reuse an `element_index` across a re-snapshot of the same pid.
|
|
858
|
+
- **Never** translate screenshot pixels into a click — the screenshot
|
|
859
|
+
is for visual disambiguation, not coordinates. Use the
|
|
860
|
+
`element_index`.
|
|
861
|
+
- **Prefer AX over pixels.** `click({pid, x, y})` works for
|
|
862
|
+
canvas / WebView regions, but it lands blindly and skips the
|
|
863
|
+
agent-cursor overlay. Exhaust AX paths (menu bars, cmd-k palettes,
|
|
864
|
+
toolbar items, keyboard shortcuts) before dropping to coordinates.
|
|
865
|
+
- **Never** drive destructive actions (delete files, close unsaved
|
|
866
|
+
documents, send messages, submit forms) without explicit user
|
|
867
|
+
intent for that specific destructive step.
|
|
868
|
+
- **Never** launch apps autonomously; confirm with the user first
|
|
869
|
+
unless their original request clearly implies the launch.
|
|
870
|
+
|
|
871
|
+
## Example end-to-end task
|
|
872
|
+
|
|
873
|
+
**User:** "Open the Downloads folder in Finder."
|
|
874
|
+
|
|
875
|
+
1. `launch_app({bundle_id: "com.apple.finder", urls: ["~/Downloads"]})`
|
|
876
|
+
→ `{pid: 844, windows: [{window_id: 6123, title: "Downloads", ...}]}`.
|
|
877
|
+
Idempotent launch; plus Finder opens a hidden window rooted at
|
|
878
|
+
`~/Downloads` via `application(_:open:)` — zero activation, no
|
|
879
|
+
focus steal. The `windows` array lets you skip a `list_windows` hop.
|
|
880
|
+
2. `get_window_state({pid: 844, window_id: 6123})` → verify an
|
|
881
|
+
`AXWindow` whose title contains "Downloads" is present with a
|
|
882
|
+
populated AX subtree (sidebar, list view, files).
|
|
883
|
+
3. Done.
|
|
884
|
+
|
|
885
|
+
If the user instead asks to navigate *within* an already-open Finder
|
|
886
|
+
window, use the menu-bar flow from the "Navigating native menu bars"
|
|
887
|
+
section above (click Go → pick a menu item → re-snapshot → click it).
|