@amaster.ai/pi-computer-use 0.1.2-beta.6 → 0.1.2-beta.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. package/bin/darwin-arm64/CuaDriver.app/Contents/CodeResources +0 -0
  2. package/bin/darwin-arm64/CuaDriver.app/Contents/Info.plist +32 -0
  3. package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/README.md +140 -0
  4. package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/RECORDING.md +113 -0
  5. package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/SKILL.md +887 -0
  6. package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/TESTS.md +232 -0
  7. package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/WEB_APPS.md +471 -0
  8. package/bin/darwin-arm64/CuaDriver.app/Contents/_CodeSignature/CodeResources +172 -0
  9. package/bin/darwin-x64/CuaDriver.app/Contents/CodeResources +0 -0
  10. package/bin/darwin-x64/CuaDriver.app/Contents/Info.plist +32 -0
  11. package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/README.md +140 -0
  12. package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/RECORDING.md +113 -0
  13. package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/SKILL.md +887 -0
  14. package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/TESTS.md +232 -0
  15. package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/WEB_APPS.md +471 -0
  16. package/bin/darwin-x64/CuaDriver.app/Contents/_CodeSignature/CodeResources +172 -0
  17. package/dist/mcp-client.d.ts.map +1 -1
  18. package/dist/mcp-client.js +9 -2
  19. package/dist/mcp-client.js.map +1 -1
  20. package/package.json +2 -2
  21. /package/bin/darwin-arm64/{cua-driver → CuaDriver.app/Contents/MacOS/cua-driver} +0 -0
  22. /package/bin/darwin-x64/{cua-driver → CuaDriver.app/Contents/MacOS/cua-driver} +0 -0
@@ -0,0 +1,887 @@
1
+ ---
2
+ name: cua-driver
3
+ description: Drive a native macOS app via the cua-driver CLI (default) or MCP server — snapshot its AX tree, click/type/scroll by element_index, verify via re-snapshot. Use when the user asks you to operate, drive, automate, or perform a GUI task in a real macOS application on the host (e.g. "open a file in TextEdit", "navigate to /Applications in Finder", "click the Save button in Numbers").
4
+ ---
5
+
6
+ # cua-driver
7
+
8
+ Orchestrates macOS app automation via `cua-driver`. Whenever a user
9
+ asks to drive a native macOS app, follow the loop in this skill rather
10
+ than calling tools ad-hoc — the snapshot-before-action invariant is not
11
+ optional and silently breaks if you skip it.
12
+
13
+ ## The no-foreground contract — read this first
14
+
15
+ **The user's frontmost app MUST NOT change.** This is the whole
16
+ reason cua-driver exists. Users pay for the right to keep typing in
17
+ their editor while an agent drives another app in the background.
18
+ Violate this rule and every other nice property the driver gives
19
+ you (no cursor warp, no Space switch, no window raise) stops
20
+ mattering — you just shipped the Accessibility Inspector with extra
21
+ steps.
22
+
23
+ Before running any shell command, ask: **"does this raise,
24
+ activate, foreground, or make-key any app?"** If yes, don't run it.
25
+ Every one of the commands below activates the target on macOS and
26
+ is therefore forbidden unless the user **explicitly** asked for
27
+ frontmost state:
28
+
29
+ - **Every form of the `open` CLI — `open -a <App>`, `open -b
30
+ <bundle-id>`, `open <file>`, `open <path-to-App.app>`, `open
31
+ <url>` — always activates.** macOS routes all forms through
32
+ LaunchServices, which unhides and foregrounds the target
33
+ regardless of whether you passed an app name, a bundle id, a
34
+ document, a URL, or the bundle path itself. The activation
35
+ happens even when the only intent was "start the process."
36
+ **Never use `open` for any app launch.** This includes launching
37
+ a just-built .app from a local build dir (e.g. `open
38
+ build/Build/Products/Debug/MyApp.app`) — resolve the
39
+ `CFBundleIdentifier` from `Info.plist` and use `launch_app`
40
+ with that id. See "The narrow carve-out" below for why
41
+ `launch_app` is safe even when the app internally calls
42
+ `NSApp.activate`.
43
+ - `osascript -e 'tell application "X" to activate'` —
44
+ activates by design. Same for `... to open <file>`,
45
+ `... to launch`, and anything with `activate` in the tell block.
46
+ - `osascript -e 'tell application "System Events" to ... frontmost'`
47
+ in a mutating form (setting `frontmost` rather than reading it).
48
+ - AppleScript files that invoke `activate`, `launch`, or `open`
49
+ against the target app.
50
+ - `cliclick` (moves the user's real cursor to the target coords
51
+ before clicking — a focus-steal-equivalent even if the app's
52
+ window state is unchanged).
53
+ - `CGEventPost` with `cghidEventTap` targeting a coordinate over
54
+ a different app's window (warps the cursor, possibly activates
55
+ on hit).
56
+ - `AppleScriptTask`, `NSAppleScript`, `Process` wrapping `osascript`
57
+ that contains any of the above.
58
+ - `NSRunningApplication.activate(options:)` called from your own
59
+ helper binary — same class.
60
+ - Dock clicks and any `open` invocation (see the first bullet —
61
+ every form of `open` goes through LaunchServices which
62
+ activates, full stop).
63
+ - **Keyboard shortcuts that semantically mean "focus here" —
64
+ most notably Chrome / Safari / Arc's `⌘L` (focus omnibox) and
65
+ Finder's `⌘⇧G` (Go to Folder).** These aren't pure key events —
66
+ the receiving app interprets "user wants to type here" as
67
+ activation intent and raises its window to be key. Even when
68
+ delivered to a backgrounded pid via `hotkey`, the downstream app
69
+ pulls focus. **For omnibox navigation specifically**, the correct
70
+ path is `launch_app({bundle_id: "com.google.Chrome", urls:
71
+ ["https://…"]})` — no omnibox dance, no `⌘L`, no focus-steal. Do
72
+ NOT try `set_value` on the omnibox: Chrome's commit logic requires
73
+ a "user-typed" signal that neither an AX value write nor
74
+ `CGEvent.postToPid` keystrokes supply from a backgrounded pid —
75
+ the URL lands in the field but Return fires as a no-op. See
76
+ `WEB_APPS.md` → "Navigate to a URL" for the full pattern. The
77
+ general principle: a shortcut that says "put my cursor inside this
78
+ app" is a focus-steal; a shortcut that says "do this thing" (copy,
79
+ save, quit) is fine.
80
+ - **Tab-switching shortcuts in browsers (`⌘1..⌘9`, `⌘]`, `⌘[`,
81
+ `⌘⇧[`, `⌘⇧]`) are visibly disruptive even when delivered to a
82
+ backgrounded pid.** The app's key handler processes the shortcut,
83
+ the window re-renders the new tab's content, the user sees their
84
+ tabs flipping. There is no AX-only workaround: page content (HTML,
85
+ form state, `AXWebArea`) populates only for the focused tab;
86
+ inspecting a background tab requires activating it, which is the
87
+ visible flip. Observed with Dia; the same mechanic applies to every
88
+ Chromium-family browser (Chrome, Arc, Brave, Edge).
89
+
90
+ **Prefer the windows-over-tabs pattern**: for each URL you need to
91
+ drive backgrounded, use `launch_app({bundle_id, urls: [url]})` —
92
+ browsers open each URL in a new **window**. Each window has its own
93
+ `window_id`, its own AX tree, and can be inspected / interacted with
94
+ via `element_index` without activating or switching anything. Tabs
95
+ are a UX grouping for humans; cua-driver workflows should default to
96
+ windows. See `WEB_APPS.md` → "Tabs vs windows" for the full pattern.
97
+
98
+ Tab-title enumeration (read-only) IS safe — walk a window's toolbar
99
+ AX tree for `AXTab` / `AXRadioButton` children and read their
100
+ `AXTitle`s. Tab switching (activating one) is not.
101
+
102
+ Reading frontmost state is fine (`osascript -e 'tell application
103
+ "System Events" to get name of first application process whose
104
+ frontmost is true'`). Mutating it is not.
105
+
106
+ **Corollary — the AXMenuBar rule.** `AXMenuBarItem` + AXPick
107
+ dispatches at the AX layer regardless of which app is frontmost,
108
+ but macOS's on-screen menu bar always belongs to the frontmost
109
+ app. If you drive a *backgrounded* app's menu bar, the AX call
110
+ succeeds but the viewer sees the dispatch rendered over the
111
+ *frontmost* app's menu bar — confusing in any observed session and
112
+ routinely a silent no-op too, because action menu items go
113
+ `DISABLED` when their owning app isn't the key window. **So: only
114
+ use menu-bar navigation when the target is already frontmost.** For
115
+ backgrounded targets, read state via in-window AX (window title,
116
+ toolbar `AXStaticText`) and dispatch via in-window `element_index`
117
+ or pixel clicks — both paths are frontmost-insensitive. Full
118
+ rationale in "Navigating native menu bars" below.
119
+
120
+ **"Open \<app\>" in user speech means launch, not activate.**
121
+ `cua-driver launch_app` is the one correct path for process
122
+ startup — it's idempotent (no-op on a running app), returns the
123
+ pid, and has an internal `FocusRestoreGuard` that catches
124
+ `NSApp.activate(ignoringOtherApps:)` calls the target makes during
125
+ `application(_:open:)` and clobbers the frontmost back to what it
126
+ was before the launch. That guard is why `launch_app` with `urls`
127
+ (e.g. `{"bundle_id": "com.colliderli.iina", "urls": ["~/video.mp4"]}`)
128
+ is safe even for apps that normally foreground on media-load
129
+ (Chrome, Electron, media players).
130
+
131
+ ## Defaults — always prefer cua-driver over shell shims
132
+
133
+ **Default transport is the `cua-driver` CLI** — `Bash` shelling out
134
+ to `cua-driver <tool-name> '<JSON-args>'`. MCP tools (prefix
135
+ `mcp__cua-driver__*`) only when the user explicitly asks for them.
136
+ CLI wins because it picks up rebuilds instantly, failures are
137
+ easier to diagnose, and there's no per-tool schema-load overhead.
138
+
139
+ Every reference to `click(...)`, `get_window_state(...)` etc. in this
140
+ skill means `cua-driver click '{...}'` — translate to MCP form only
141
+ when MCP is requested.
142
+
143
+ ### Claude Code computer-use compatibility mode
144
+
145
+ For normal Claude Code use, keep the default CLI or `cua-driver` MCP server path above. If the user explicitly wants Claude Code's vision/computer-use-style flow, they can register:
146
+
147
+ ```bash
148
+ claude mcp add --transport stdio cua-computer-use -- cua-driver mcp --claude-code-computer-use-compat
149
+ ```
150
+
151
+ Observation: Claude Code vision flows appear to treat a screenshot MCP tool as the image-grounding anchor. This compatibility mode keeps the normal CuaDriver tools and changes only `screenshot`. The compatibility `screenshot` requires `pid` and `window_id`, captures only that target window, and returns the window-local pixel coordinate frame. Start with `launch_app` or `list_windows`, then call `screenshot({pid, window_id})`; do not assume desktop coordinates or a full-screen capture.
152
+
153
+ Use MCP for this Claude Code vision/computer-use-style path. Do not shell out to `cua-driver screenshot` as a substitute: CLI screenshots still work as CuaDriver calls, but they do not expose the `mcp__cua-computer-use__screenshot` tool name that Claude Code appears to use as the image-grounding cue.
154
+
155
+ Intent → tool mapping. If you find yourself reaching for the right
156
+ column, something has gone wrong — re-read "The no-foreground
157
+ contract" above:
158
+
159
+ | Intent | Use | Don't use |
160
+ |---|---|---|
161
+ | Open / launch an app | `launch_app({bundle_id})` or `launch_app({bundle_id, urls:[...]})` | `open -a`, `osascript 'tell app … to launch/activate/open'` |
162
+ | Find a pid | `list_apps` or `launch_app`'s return | `pgrep`, `ps`, `osascript frontmost` |
163
+ | Enumerate an app's windows | `list_windows({pid})` — or read the `windows` array `launch_app` already returns | `osascript 'every window of app …'` |
164
+ | Click / type / scroll / keys | `click`, `type_text`, `scroll`, `press_key`, `hotkey` | `osascript`, `cliclick`, raw `CGEvent`, `open <url>` |
165
+ | Drag / drag-and-drop / marquee select | `drag({pid, from_x, from_y, to_x, to_y})` (pixel-only — macOS AX has no semantic drag) | `cliclick dd:`, `osascript drag` |
166
+ | Screenshot | `screenshot` or the PNG in `get_window_state` | `screencapture` |
167
+ | Quit an app | ask the user first, then `hotkey({pid, keys:["cmd","q"]})` | `kill`, `killall`, `pkill` |
168
+ | Hand a file/URL to an app | `launch_app({bundle_id, urls:[<path>]})` | `open -a <App> <path>`, `open <url>` |
169
+
170
+ ### The narrow carve-out
171
+
172
+ The **only** legitimate use of `osascript -e 'tell app X to
173
+ activate'` is when the user **explicitly** asked for frontmost
174
+ state ("bring Chrome to the front", "make it frontmost", "I want
175
+ to see X"). Reaching for it because a tool call returned something
176
+ confusing is wrong — that's the skill's classic foot-in-the-door
177
+ failure mode and it steals focus every time.
178
+
179
+ When a cua-driver call surprises you, diagnose cua-driver first:
180
+
181
+ - **Tiny screenshot / empty `tree_markdown`?** Check
182
+ `cua-driver get_config` → `capture_mode`. Default `"som"` returns
183
+ both the AX tree and screenshot. `"vision"` omits the AX tree
184
+ (PNG only), `"ax"` omits the PNG. If a snapshot lacks a tree,
185
+ `capture_mode` is almost certainly `"vision"` — either reason
186
+ purely from the PNG or flip to `"som"` / `"ax"` via `set_config`.
187
+ - **`has_screenshot: false`?** The window capture failed (transient
188
+ race against a close, or the window has no backing store yet).
189
+ Re-snapshot; if persistent, pick a different `window_id` via
190
+ `list_windows`.
191
+ - **`Invalid element_index` / `No cached AX state`?** You either
192
+ skipped `get_window_state` this turn or passed a different
193
+ `window_id` than the one the snapshot cached against. The cache
194
+ is keyed on `(pid, window_id)` — indices don't carry across
195
+ windows of the same app. Re-snapshot with the same window_id
196
+ you're about to click in.
197
+ - **Sparse Chromium AX tree?** Retry `get_window_state` once — the
198
+ tree populates on second call.
199
+
200
+ Only after those are ruled out, and only if the user's action
201
+ genuinely needs frontmost state, fall through to the activate
202
+ fallback. Always name the focus steal in your response ("I'll
203
+ briefly bring Chrome to the front because …").
204
+
205
+ ### Self-check pattern
206
+
207
+ Before every `Bash` call whose command line touches any macOS app
208
+ (launching, opening, clicking, typing, scripting, screenshotting),
209
+ run the self-check:
210
+
211
+ 1. **Does this command foreground the target?** If yes — stop and
212
+ translate to the cua-driver equivalent from the mapping table.
213
+ 2. **Does this command move the user's real cursor?** (`cliclick`,
214
+ any `CGEventPost` at `cghidEventTap` over another app's window).
215
+ If yes — stop; use `click({pid, x, y})` which routes per-pid
216
+ via SkyLight and never warps the cursor.
217
+ 3. **Does this command bypass cua-driver entirely?** (`osascript`
218
+ mutating GUI state, AppleScript files, external helpers.) If
219
+ yes — stop; find the cua-driver tool that does the intent.
220
+
221
+ If all three are "no," the command is safe. If you can't answer,
222
+ default to stop and ask rather than proceed. A single `open -a`
223
+ run by accident kills the demo, the trust, and the user's in-flight
224
+ editor state.
225
+
226
+ ## Prerequisites — check before starting
227
+
228
+ 1. `cua-driver` is on `$PATH` (`which cua-driver`). If not, point the
229
+ user at `scripts/install-local.sh` and stop.
230
+ 2. Run `cua-driver check_permissions` (with the daemon up — see step 3).
231
+ The default behavior also raises the system permission dialogs for
232
+ any missing grants, so the user can grant on the spot. If either
233
+ grant still reads `false` after that (user dismissed the dialog),
234
+ tell them to open System Settings → Privacy & Security and grant
235
+ Accessibility and Screen Recording to `CuaDriver.app`, then stop.
236
+ Pass `'{"prompt":false}'` for a purely read-only status check that
237
+ won't steal focus.
238
+ 3. Start the daemon with `open -n -g -a CuaDriver --args serve` (the
239
+ recommended form — goes through LaunchServices so TCC attributes
240
+ the process to CuaDriver.app). `cua-driver serve &` also works;
241
+ the CLI auto-relaunches through `open -n -g -a CuaDriver` when it
242
+ detects a wrong-TCC context (any IDE-spawned shell: Claude Code,
243
+ Cursor, VS Code, Conductor). Verify with `cua-driver status`.
244
+
245
+ ## Using cua-driver from the shell
246
+
247
+ Tool names are `snake_case`, management subcommands are
248
+ `kebab-case` — no ambiguity. Tools invoked as `cua-driver
249
+ <tool-name> '<JSON-args>'`. Management subcommands:
250
+
251
+ - `open -n -g -a CuaDriver --args serve` — start persistent daemon
252
+ (**required** for `element_index` workflows; without it each CLI
253
+ invocation spawns a fresh process and the per-pid element cache
254
+ dies between calls). `cua-driver serve &` also works — the CLI
255
+ auto-relaunches via `open` when the shell's TCC context is wrong.
256
+ Pass `--no-relaunch` / `CUA_DRIVER_NO_RELAUNCH=1` to opt out.
257
+ - `cua-driver stop` / `status`
258
+ - `cua-driver list-tools`, `describe <tool>`
259
+ - `cua-driver recording start|stop|status` — see `RECORDING.md`
260
+
261
+ Canonical multi-step workflow:
262
+
263
+ ```
264
+ open -n -g -a CuaDriver --args serve
265
+ cua-driver launch_app '{"bundle_id":"com.apple.calculator"}'
266
+ # → {pid: 844, windows: [{window_id: 10725, ...}]}
267
+ cua-driver get_window_state '{"pid":844,"window_id":10725}'
268
+ cua-driver click '{"pid":844,"window_id":10725,"element_index":14}'
269
+ cua-driver stop
270
+ ```
271
+
272
+ ## Agent cursor overlay
273
+
274
+ Visual cursor overlay for demos and screen recordings. Default:
275
+ enabled. Toggle with `cua-driver set_agent_cursor_enabled
276
+ '{"enabled":true|false}'`. A triangle pointer Bezier-glides to each
277
+ click target, ring-ripples on landing, idle-hides after ~1.5s.
278
+ Motion knobs: `set_agent_cursor_motion` takes any subset of
279
+ `start_handle`, `end_handle`, `arc_size`, `arc_flow`, `spring` —
280
+ tuneable at runtime, persisted to config.
281
+
282
+ Requires an AppKit runloop, which `cua-driver serve` / `mcp`
283
+ bootstraps. One-shot CLI invocations skip the overlay entirely.
284
+
285
+ ## The core invariant — snapshot before AND after every action
286
+
287
+ **Every action MUST be bracketed by `get_window_state(pid, window_id)`**:
288
+
289
+ - **Before** — the pre-action snapshot resolves the `element_index`
290
+ you're about to use. Indices from previous turns are stale; the
291
+ server replaces the element index map on every snapshot, keyed
292
+ on `(pid, window_id)`. Indices from turn N don't resolve in turn
293
+ N+1, and indices from window A don't resolve against window B of
294
+ the same app. Skip this and element-indexed actions fail with
295
+ `No cached AX state`.
296
+ - **After** — the post-action snapshot verifies the action actually
297
+ landed. Without it you can't tell a silent no-op from a real
298
+ effect. The AX tree change (new value, new window, disappeared
299
+ menu, disabled button, etc.) is your evidence that the action
300
+ fired. If nothing changed, the action probably failed silently —
301
+ say so, don't assume success.
302
+
303
+ This applies to pixel clicks too — re-snapshot after to confirm the
304
+ click landed on the intended target.
305
+
306
+ ### Why window selection is the caller's job now
307
+
308
+ `get_app_state` used to pick a window for you via a max-area heuristic
309
+ that returned the wrong surface on apps with large off-screen utility
310
+ panels. Concrete reproducer: IINA's OpenSubtitles helper (600×432
311
+ off-screen) out-area'd the visible 320×240 player window, so
312
+ `get_app_state(pid)` screenshot'd the invisible panel and clicks landed
313
+ there silently. The new `get_window_state(pid, window_id)` makes the
314
+ caller name the window explicitly — the driver validates that the
315
+ window belongs to the pid and is on the current Space, then snapshots
316
+ exactly what was asked for. Enumerate candidates via `list_windows` or
317
+ read the `windows` array `launch_app` already returns.
318
+
319
+ ## Behavior matrix
320
+
321
+ Two orthogonal axes shape what the agent can do.
322
+
323
+ **capture_mode → addressing mode**
324
+
325
+ | `capture_mode` | `get_window_state` returns | Use for actions |
326
+ |---|---|---|
327
+ | **`som`** (default) | tree + screenshot | `element_index` preferred; pixel fallback |
328
+ | **`ax`** | tree only (no PNG) | `element_index` only |
329
+ | **`vision`** | PNG only (no tree) | pixel only — see [SCREENSHOT.md](./SCREENSHOT.md) |
330
+
331
+ `vision` was renamed from `screenshot` — the old name still decodes
332
+ as a deprecated alias, so an on-disk `"capture_mode": "screenshot"`
333
+ keeps working. Default is `som` so element_index clicks work the
334
+ first time a user calls `get_window_state`; the other modes are
335
+ opt-in when the caller specifically doesn't want one half of the
336
+ work. Note the tool named `screenshot` is separate (raw PNG, no AX
337
+ walk) and unrelated to the capture mode.
338
+
339
+ When a snapshot looks wrong (tiny screenshot / empty tree), check
340
+ `cua-driver get_config` for `capture_mode` before anything else.
341
+
342
+ Pure-vision mode has its own caveats — Claude Code's vision
343
+ pipeline downsamples dense text aggressively, so pixel grounding
344
+ takes multiple correction cycles on text-heavy UIs. Read
345
+ [SCREENSHOT.md](./SCREENSHOT.md) before driving anything in that
346
+ mode; it documents the iterate/annotate/verify recipe plus the
347
+ JPEG-over-PNG finding.
348
+
349
+ **Window state → what works**
350
+
351
+ | state | `get_window_state` | `click`/`set_value` (AX) | `press_key` commit (Return/Space/Tab) | pixel click |
352
+ |---|---|---|---|---|
353
+ | frontmost | ✅ | ✅ | ✅ | ✅ |
354
+ | backgrounded / visible | ✅ | ✅ | ✅ | ✅ |
355
+ | **minimized** (Dock genie) | ✅ | ✅ (no deminiaturize — AX actions fire on the minimized window in place) | ❌ silent no-op / system beep — use `set_value` or click equivalent | ❌ no on-screen bounds |
356
+ | hidden (`hides=true` / `NSApp.hide`) | ✅ | ✅ | depends | ❌ |
357
+ | on another Space | ⚠️ AX tree often stripped to menu-bar-only on SwiftUI apps (System Settings) — AppKit apps usually fine. Response carries `off_space: true` + `window_space_ids` so you can detect it | ✅ | ✅ | ❌ window not in current-Space list |
358
+
359
+ **Critical cell — minimized + keyboard commit.** The keystroke
360
+ reaches the app but AX focus doesn't propagate to renderer focus on
361
+ a minimized window. Workarounds in order of preference:
362
+ `set_value` to write the field's entire value directly, or AX-click
363
+ a commit-equivalent button (Go, Submit, checkbox). Tell the user
364
+ the window needs to un-minimize only as a last resort.
365
+
366
+ ## The canonical loop
367
+
368
+ ```
369
+ launch_app(target)
370
+ → pick window_id from the returned `windows` array
371
+ (or call list_windows(pid) separately)
372
+ → get_window_state(pid, window_id)
373
+ → [act] # every action also takes (pid, window_id)
374
+ → get_window_state(pid, window_id) → verify
375
+ ```
376
+
377
+ `launch_app` now returns a `windows` array alongside the pid, so the
378
+ common case collapses to two calls (`launch_app` → `get_window_state`)
379
+ without a separate `list_windows` hop.
380
+
381
+ ### 1. Resolve target pid — always via `launch_app`
382
+
383
+ **Always start with `launch_app`**, whether or not the target is already
384
+ running. It's idempotent (relaunching returns the existing pid with no
385
+ side effects) and gives you the pid in one call — no `list_apps` hop.
386
+
387
+ - `launch_app({bundle_id: "com.apple.finder"})` — preferred, unambiguous.
388
+ - `launch_app({name: "Calculator"})` — when bundle_id isn't known.
389
+
390
+ `launch_app` is a **hidden-launch primitive by design** — that's the
391
+ entire point of cua-driver: agents drive apps in the background while
392
+ the user keeps typing in their real foreground app. The target's
393
+ window is initialized (AX tree fully populated, clickable via
394
+ `element_index`, the pid appears in `list_apps`) but not drawn on
395
+ screen. The driver never activates or unhides apps on its own; that
396
+ would violate the no-foreground contract the whole driver exists to
397
+ protect.
398
+
399
+ If the user explicitly wants the window visible (usually for a demo
400
+ or recording), they unhide it themselves — Dock click, Cmd-Tab, or
401
+ Spotlight. Do not reach for `open` / `osascript activate` as a
402
+ shortcut to make the window visible; those paths break the backgrounded
403
+ invariant on every call, not just the call that "needed" the
404
+ foreground. Say out loud what the user needs to do ("click the
405
+ Todo app in your Dock to bring it forward") and let them do it.
406
+
407
+ Never shell out to **any** form of `open` (including `open
408
+ <path-to-App.app>` for a just-built binary — resolve the bundle id
409
+ from `Info.plist` and use `launch_app` with that), `osascript 'tell
410
+ app … to launch/open'`, or similar. Those paths activate the target,
411
+ bypass the driver's focus-restore guard, and require a Bash
412
+ permission prompt the agent loop shouldn't be burning on app launch.
413
+ See "Prefer cua-driver tools over shell shims" above for the full
414
+ intent → tool mapping.
415
+
416
+ `list_apps` is for app-level discovery (answering "what's installed /
417
+ running / frontmost?") — not part of the core action loop. Skip it in
418
+ the loop. For **window-level** questions — "does this app have a
419
+ visible window?", "which Space is this window on?", "which of this
420
+ pid's windows is the main one?" — call `list_windows` instead; the
421
+ app record doesn't carry window state on purpose. In the common
422
+ single-window case you can skip `list_windows` entirely and read the
423
+ `windows` array that `launch_app` already returned.
424
+
425
+ ### 2. Snapshot and act by element_index
426
+
427
+ Call `get_window_state({pid, window_id})` with the `window_id` from
428
+ `launch_app`'s `windows` array (or a fresh `list_windows({pid})` if
429
+ you're interacting with a long-lived process). The default `som`
430
+ capture_mode returns **both the AX tree and screenshot**, so the
431
+ canonical loop works immediately without any config change. The rest
432
+ of this section walks through `som` mode. If you're in `vision` mode
433
+ (PNG only, no AX tree), flip back: `cua-driver set_config '{"key":
434
+ "capture_mode", "value": "som"}'`.
435
+
436
+ In `som` mode (the default) the response carries:
437
+
438
+ - `tree_markdown` — every actionable element tagged `[N]`. That `N`
439
+ is the `element_index`. The tree can be very large (Finder is
440
+ ~1600 elements, ~190 KB); when it exceeds token limits the MCP
441
+ harness saves it to a file and returns the path. Use `Bash` +
442
+ `jq -r '.tree_markdown'` + `grep` to pull the section you need.
443
+ - `screenshot_file_path` — absolute path to the saved screenshot when
444
+ `screenshot_out_file` was passed. Absent otherwise.
445
+ - `screenshot_width` / `_height` / `_scale_factor` — dimensions of the
446
+ captured image. Present whenever a screenshot was taken.
447
+ **Getting the screenshot as a file (CLI and context-constrained agents):**
448
+
449
+ ```bash
450
+ # write to file — stdout stays readable (AX tree / summary only, no base64)
451
+ cua-driver get_window_state '{"pid":N,"window_id":W,"screenshot_out_file":"/tmp/shot.jpg"}'
452
+
453
+ # CLI --screenshot-out-file flag is equivalent and works for all capture modes
454
+ cua-driver get_window_state '{"pid":N,"window_id":W}' --screenshot-out-file /tmp/shot.jpg
455
+ ```
456
+
457
+ Pass `screenshot_out_file` when using `get_window_state` via CLI or from an
458
+ agent whose context window can't absorb ~31 KB of inline base64 (e.g.
459
+ OpenCode with a local Ollama model). The MCP image content block is omitted
460
+ from the response when this param is set — the model receives only the AX
461
+ tree and `screenshot_file_path`, then reads the image from disk.
462
+
463
+ **Reason over both the tree AND the screenshot — they're
464
+ complementary, not redundant.** In `som` mode every
465
+ turn's `get_window_state` gives you both halves and you should pull
466
+ signal from each:
467
+
468
+ - The **AX tree** tells you *what's clickable* — roles, labels,
469
+ `element_index` handles, advertised actions, parent-child
470
+ structure. This is the ground truth for dispatching.
471
+ - The **screenshot** tells you *which one* — the tree often has
472
+ many buttons with similar or empty labels ("Delete", "OK",
473
+ anonymous UUID-labeled buttons, five `AXStaticText = " "`), and
474
+ visual context disambiguates. Captions, colors, layout relationships
475
+ visible in pixels often don't show up in the AX tree at all
476
+ (especially in Chromium / Electron / web content).
477
+
478
+ Canonical pattern: look at the screenshot to decide "the blue
479
+ Subscribe button on the top-right of the video card", then walk the
480
+ tree to find the matching `AXButton` and dispatch by its
481
+ `element_index`. Don't try to do it from just the tree — you'll
482
+ pick the wrong element when labels repeat. Don't try to do it from
483
+ just the screenshot — you lose the reliable AX-action path and the
484
+ safe backgrounded-dispatch.
485
+
486
+ Reach for pixel coordinates only when the target is a canvas /
487
+ video / WebGL / custom-drawn surface that isn't in the AX tree
488
+ (see Pixel-coordinate clicks below).
489
+
490
+ The `actions=[...]` list on each element is **advisory**, not
491
+ authoritative. cua-driver does not pre-flight check against it —
492
+ `click({pid, element_index})` always attempts `AXPress` (or the
493
+ action you pass) and surfaces whatever the target returns. Many
494
+ apps accept `AXPress` on elements that don't advertise it — Chrome's
495
+ omnibox suggestion `AXMenuItem` is a live example. **Try the click
496
+ first** — pivot only on the returned AX error code.
497
+
498
+ Dispatch table (every row assumes a `(pid, window_id)` pair from the
499
+ last `get_window_state`; `window_id` is required alongside
500
+ `element_index`, ignored on pixel-only forms unless you want to
501
+ anchor the conversion against a specific window):
502
+
503
+ | Intent | Tool | Notes |
504
+ |---|---|---|
505
+ | List an app's windows | `list_windows({pid})` | returns `window_id`, `title`, `bounds`, `z_index`, `is_on_screen`, `on_current_space`. Already included in `launch_app`'s response — only call this for long-lived pids |
506
+ | Snapshot a window | `get_window_state({pid, window_id})` | returns `tree_markdown` + `screenshot_*`; populates the `(pid, window_id)` element_index cache |
507
+ | Left click | `click({pid, window_id, element_index})` | default `action: "press"`. Pixel form: `click({pid, x, y})` (window_id optional — when supplied, pinpoints the anchor window) — `modifier: ["cmd"]` |
508
+ | Double-click / open | `double_click({pid, window_id, element_index})` | AXOpen when advertised (Finder items, openable rows); else stamped pixel double-click at the element's center. Pixel form: `double_click({pid, x, y})` — primer-gated recipe lands on backgrounded Chromium web content (YouTube fullscreen, Finder open-on-dbl). `click({..., count: 2})` still works and routes through the same recipe; `double_click` is the intent-first spelling |
509
+ | Right click / context menu | `right_click({pid, window_id, element_index})` or `click({pid, window_id, element_index, action: "show_menu"})` | Chromium web-content coerces pixel right-click to left — see `WEB_APPS.md` |
510
+ | Type at cursor | `type_text({pid, text, window_id, element_index})` | `AXSelectedText` write; focuses first |
511
+ | Set whole field value | `set_value({pid, window_id, element_index, value})` | sliders, steppers, text fields; **use for keyboard-commit workarounds on minimized windows** |
512
+ | Scroll | `scroll({pid, direction, amount, by, window_id, element_index})` | synthesizes PageUp/PageDown/arrows via SLEventPostToPid |
513
+ | Focus + send key | `press_key({pid, key, window_id, element_index, modifiers})` | element_index sets AXFocused, then posts key |
514
+ | Send key to pid | `press_key({pid, key, modifiers})` | no focus change; key goes to pid's current focus |
515
+ | Modifier combo | `hotkey({pid, keys})` | e.g. `["cmd","c"]`; posted per-pid, not HID tap |
516
+ | Unicode keystrokes | `type_text({pid, text, delay_ms})` | AX write with automatic CGEvent fallback; reaches Chromium/Electron inputs |
517
+
518
+ **All keyboard/text primitives require `pid`.** There is no
519
+ frontmost-routed variant — every key goes to the named target via
520
+ `CGEvent.postToPid`, so the driver cannot leak keystrokes into the
521
+ user's foreground app.
522
+
523
+ **Why `element_index` is the primary path:** works on hidden /
524
+ occluded / off-Space windows, no focus steal, stable across
525
+ rebuilds, labels tell you what you're clicking. Reach for pixel
526
+ coordinates only when AX can't.
527
+
528
+ ### Pixel-coordinate clicks
529
+
530
+ The pixel path (`click({pid, x, y})`) is for surfaces the AX tree
531
+ doesn't reach — canvases, video players, WebGL, custom-drawn controls.
532
+ Coords are **window-local screenshot pixels** (same space as the PNG
533
+ `get_window_state` returns). Top-left origin, y-down. The driver
534
+ handles screen-point conversion internally. Passing `window_id`
535
+ alongside `x, y` is optional but recommended — it pins the
536
+ coordinate conversion to the window whose screenshot produced the
537
+ pixel, rather than the driver's heuristic choice.
538
+
539
+ #### Reading coordinates from the PNG
540
+
541
+ PNGs returned by `get_window_state` are capped at **1568 px
542
+ long-side by default** (`max_image_dimension` config), matching
543
+ Anthropic's multimodal-vision downsampling limit. That means the
544
+ image the model reasons over and the image the click tool's
545
+ coordinate system lives in are the **same resolution** — just look
546
+ at the PNG, pick a pixel, click at that pixel. No scaling math.
547
+
548
+ This is the default because the mismatch between "rendered
549
+ thumbnail" and "native PNG" was a recurring coord-estimation
550
+ footgun. If you opt out (explicit `max_image_dimension=0` for
551
+ pixel-perfect verification flows), the old rule applies: don't
552
+ eyeball coords from whatever your client renders — it may be
553
+ 2-4× smaller than the PNG on disk, and a 2% error in thumbnail
554
+ space becomes ~80 px in the real image. Use the crosshair recipe
555
+ below against the full-resolution file in that case.
556
+
557
+ 1. `get_window_state({pid, window_id})` returns an image capped
558
+ at 1568 long-side (default) plus its dimensions
559
+ (`screenshot_width` / `screenshot_height`). Write the bytes to
560
+ disk with `--screenshot-out-file <path>` in any capture mode — works
561
+ identically in `vision` (where it's the only way) and `som`
562
+ (where it sidesteps the jq + base64 dance on the spliced
563
+ `screenshot_png_b64` field).
564
+ 2. You are a multimodal model — look at the PNG. Since the PNG
565
+ matches what you see, pick the target pixel directly. No
566
+ fractional math needed.
567
+ 3. When precision matters (small targets, dense UIs), draw a
568
+ crosshair on the image (do **not** crop — cropping loses the
569
+ coordinate system and requires error-prone offset math) and
570
+ show it before clicking:
571
+
572
+ ```python
573
+ from PIL import Image, ImageDraw
574
+ img = Image.open('/tmp/shot.png')
575
+ draw = ImageDraw.Draw(img)
576
+ x, y = <your_coordinate>
577
+ r = 18
578
+ draw.ellipse([x-r, y-r, x+r, y+r], outline='red', width=4)
579
+ draw.line([x-30, y, x+30, y], fill='red', width=3)
580
+ draw.line([x, y-30, x, y+30], fill='red', width=3)
581
+ img.save('/tmp/shot_annotated.png')
582
+ ```
583
+
584
+ 4. Only dispatch the click after the user (or your own re-read of
585
+ the annotated image) confirms the crosshair is on target.
586
+
587
+ #### Addressing variants
588
+
589
+ - `click({pid, x, y})` — single left-click.
590
+ - `click({pid, x, y, count: 2})` — double-click.
591
+ - `click({pid, x, y, modifier: ["cmd"]})` — cmd-click. Accepts any
592
+ subset of `cmd/shift/option/ctrl`.
593
+ - `right_click({pid, x, y})` — also takes `modifier`.
594
+
595
+ The pixel path animates the agent cursor overlay but never warps
596
+ the real cursor. If the pid has no on-screen window the call errors
597
+ with `pid X has no on-screen window` — you need a visible window to
598
+ anchor the conversion.
599
+
600
+ #### How the pixel click is dispatched
601
+
602
+ The recipe is the backgrounded "noraise" sequence: yabai's
603
+ focus-without-raise SLPS event records followed by an off-screen
604
+ user-activation primer and the real click, all stamped via
605
+ `SLEventPostToPid`. The target app becomes AppKit-active for event
606
+ routing but its window does **not** rise to the front of the
607
+ z-stack, and macOS's "switch to Space with windows for app" follow
608
+ is suppressed. Full mechanics in
609
+ `Sources/CuaDriverCore/Input/MouseInput.swift` (`clickViaAuthSignedPost`)
610
+ and the companion `FocusWithoutRaise.swift`.
611
+
612
+ #### Known limits
613
+
614
+ - **Chromium `<video>` play/pause**: pixel click is often rejected
615
+ by HTML5's click-to-play handler on some builds. Use keyboard
616
+ instead: `press_key({pid, key: "k"})` (YouTube) or
617
+ `press_key({pid, key: "space"})` (generic). Keyboard events
618
+ travel through a different auth envelope.
619
+ - **Pixel right-click on Chromium web content** coerces to a
620
+ left-click — a known Chromium renderer-IPC limitation that affects
621
+ every non-HID-tap synthesis path. For context menus on
622
+ AX-addressable elements (links, buttons, toolbar items), use
623
+ `right_click({pid, element_index})` instead.
624
+
625
+ ### Canvases, viewports, games (Blender, Unity, GHOST, Qt, wxWidgets)
626
+
627
+ Apps whose main surface is an OpenGL / Metal / Qt / wxWidgets
628
+ viewport expose **no useful AX tree** — the whole surface is one
629
+ opaque `AXGroup` or `AXWindow` from AX's perspective. Per-pid event
630
+ paths (`SLEventPostToPid`, `CGEvent.postToPid`) are filtered by the
631
+ viewport's own event-source check and silently dropped — the event
632
+ loop wants "real HID origin".
633
+
634
+ The working pattern:
635
+
636
+ 1. Bring the target frontmost (a brief `osascript activate` is
637
+ acceptable here — this is the carve-out the skill's osascript
638
+ gate allows).
639
+ 2. `CGEvent.post(tap: .cghidEventTap)` with a leading `mouseMoved`
640
+ event (~30 ms before the click). `cua-driver click` when the
641
+ target is frontmost automatically takes this path.
642
+ 3. Accept that the real cursor visibly moves — `cghidEventTap` is
643
+ the system HID stream, the cursor warps to the click point.
644
+
645
+ There is no backgrounded path that reaches these apps today.
646
+
647
+ ## Navigating native menu bars (AXMenuBar)
648
+
649
+ **Only drive the menu bar when the target app is frontmost.** This
650
+ is the single most-misused cua-driver capability. If the target is
651
+ backgrounded, don't reach for `AXMenuBarItem` + AXPick — use
652
+ in-window `element_index` or pixel clicks instead. Two reasons, one
653
+ functional and one perceptual:
654
+
655
+ - **Functional:** menu items that touch document/playback/editor
656
+ state go `DISABLED` when their owning app isn't the key window
657
+ (Preview rotate, IINA speed change, most editor commands). AXPick
658
+ + AXPress will dispatch successfully from the driver's side but
659
+ no-op at the target — you get a silent false-pass.
660
+ - **Perceptual (matters for demos, screen recordings, and anything
661
+ the user watches live):** macOS's screen-rendered menu bar
662
+ always belongs to the *frontmost* app. AXPick on a backgrounded
663
+ app's `AXMenuBarItem` dispatches to that app's per-process menu at
664
+ the AX layer, but any visible menu render happens over the
665
+ frontmost app's menu bar — the viewer sees an IINA submenu
666
+ flashing on top of Chrome's menus, which reads as "the agent
667
+ clicked the wrong app." The AX call was correct; the frame the
668
+ user sees is not. For recorded or observed sessions, this is an
669
+ integrity bug even though it's not a correctness bug.
670
+
671
+ **Good decision rule:** if the target is not already frontmost, do
672
+ not use `AXMenuBarItem` at all. For *reading* in-window state,
673
+ snapshot the window AX tree — most apps expose the same state via
674
+ an in-window `AXStaticText`, title bar, or toolbar. For *dispatching*
675
+ actions, use in-window `element_index` (buttons, toolbar items) or
676
+ pixel clicks on in-window controls — both dispatch via AppKit's
677
+ window-under-pointer hit-test and are **not** frontmost-gated.
678
+
679
+ When the target IS frontmost, the menu-bar flow below is fine and
680
+ the canonical path for menus.
681
+
682
+ ### The two-snapshot pattern (target frontmost only)
683
+
684
+ Menu contents are a two-snapshot flow. Closed AXMenu subtrees are
685
+ deliberately skipped during snapshot — otherwise every app's File /
686
+ Edit / View hierarchy plus every Recent Items macOS has ever seen
687
+ would inflate the tree 10-100x. But once a menu is *open*, its
688
+ AXMenuItem children do receive `element_index` values so you can
689
+ click them normally.
690
+
691
+ 1. Find the `[N] AXMenuBarItem "<Menu Name>"` in the tree.
692
+ 2. `click({pid, element_index: N, action: "pick"})` — menu bar items
693
+ implement `AXPick` ("open my submenu"), not `AXPress`. Using the
694
+ default action on an AXMenuBarItem is a no-op.
695
+ 3. Re-snapshot. The expanded menu's items now appear under the bar
696
+ item as `[M] AXMenuItem "<Item Name>"`.
697
+ 4. Click the target item — most items respond to `AXPress` (default
698
+ action). Submenus nest under the item and are walked the same way.
699
+ 5. Re-snapshot and verify.
700
+
701
+ If you ever need to back out without selecting, `press_key({pid, key:
702
+ "escape"})` closes the open menu. Leaving a menu expanded between
703
+ turns poisons subsequent snapshots for that pid.
704
+
705
+ ### Commands gated on the target being frontmost
706
+
707
+ Some menu items and global shortcuts (Preview's Tools → Rotate
708
+ Right, ⌘R; anything in the View menu that manipulates the current
709
+ document; most editor commands) are **disabled unless the target
710
+ app is the key / frontmost window**. You'll see it in the AX tree
711
+ as `DISABLED` on the menu item even though the user's intent is
712
+ obviously valid.
713
+
714
+ Before activating, confirm you're in this narrow case — the menu
715
+ item still reads `DISABLED` after a fresh snapshot AND the action
716
+ the user requested genuinely requires frontmost (Preview rotate,
717
+ View menu document manipulation, editor commands). If either
718
+ check fails, don't activate.
719
+
720
+ When both checks pass, the driver has no `activate` tool
721
+ (deliberately — the whole point is backgroundable control), so
722
+ this is the one legitimate `osascript` fallback:
723
+
724
+ ```
725
+ osascript -e 'tell application "<App Name>" to activate'
726
+ ```
727
+
728
+ Then re-snapshot — the menu item loses its `DISABLED` tag — and
729
+ `click({action: "pick"})` the item. Alternatively, a `hotkey`
730
+ call delivered to the now-frontmost app works for the shortcut
731
+ form (`⌘R`, `⌘+`, etc.).
732
+
733
+ **Always name the focus steal in your response** so the user isn't
734
+ surprised — "Briefly activating Preview to enable Tools → Rotate
735
+ Right" or similar. Don't silently steal focus. You don't need to
736
+ restore the previous frontmost afterwards unless the user asks —
737
+ they can cmd-tab back.
738
+
739
+ ## Web-rendered apps (browsers, Electron, Tauri)
740
+
741
+ For Chrome / Edge / Brave / Arc / Safari, Electron apps (Slack,
742
+ VSCode, Notion, Discord), and Tauri apps — see **`WEB_APPS.md`**.
743
+
744
+ Covers: sparse AX tree population (retry-once pattern for Chromium),
745
+ URL navigation via omnibox suggestions, the `set_value` workaround
746
+ for keyboard commits on **minimized** windows (Return silently
747
+ no-ops — symptom is a macOS system beep; use `set_value` or click a
748
+ clickable equivalent), scrolling via synthetic PageUp/Down keystrokes,
749
+ in-page clicks, and typing into web inputs.
750
+
751
+ Chromium web content specifically also coerces `right_click` back to
752
+ left — use `element_index` for AX-addressable targets and accept the
753
+ limit otherwise.
754
+
755
+ ### Browser JS primitives — `page` tool and `get_window_state(javascript=)`
756
+
757
+ When the AX tree doesn't expose the data you need (common in
758
+ Chromium/Electron — the tree is sparse for web content), use the
759
+ `page` tool or the `javascript` param on `get_window_state` to query
760
+ the DOM directly via Apple Events. Requires "Allow JavaScript from
761
+ Apple Events" to be enabled — see `WEB_APPS.md` for the setup path.
762
+
763
+ **Three actions on the `page` tool:**
764
+
765
+ - `page({pid, window_id, action: "get_text"})` — returns
766
+ `document.body.innerText`. Fastest way to read page content, prices,
767
+ article text, or any raw text the AX tree truncates or omits.
768
+
769
+ - `page({pid, window_id, action: "query_dom", css_selector: "a[href]",
770
+ attributes: ["href"]})` — runs `querySelectorAll` and returns each
771
+ match's tag, text, and requested attributes as a JSON array. Use for
772
+ table rows, link hrefs, data attributes, structured page data.
773
+
774
+ - `page({pid, window_id, action: "execute_javascript", javascript:
775
+ "..."})` — raw JS. Wrap in an IIFE with try-catch. Don't use this for
776
+ elements already indexed by `get_window_state` — `click` and
777
+ `set_value` are more reliable there.
778
+
779
+ **Co-located read — `get_window_state` with `javascript`:**
780
+
781
+ ```
782
+ get_window_state({pid, window_id, javascript: "document.title"})
783
+ ```
784
+
785
+ Runs the JS and appends the result as a `## JavaScript result` section
786
+ alongside the AX snapshot — one round-trip instead of two. Use this
787
+ when you need both the element tree (for subsequent clicks) and some
788
+ page data in the same turn.
789
+
790
+ **Decision rule — AX vs JS:**
791
+
792
+ | Need | Use |
793
+ |---|---|
794
+ | Click / type into an element | `get_window_state` → `click` / `set_value` (AX, works backgrounded) |
795
+ | Read text the AX tree drops | `page(get_text)` or `get_window_state(javascript=)` |
796
+ | Scrape structured data (tables, hrefs) | `page(query_dom)` |
797
+ | Trigger JS events / mutations | `page(execute_javascript)` |
798
+
799
+ Supported backends:
800
+
801
+ | App type | How | Context |
802
+ |---|---|---|
803
+ | Chrome / Brave / Edge | Apple Events `execute javascript` | Full DOM ✅ |
804
+ | Safari | Apple Events `do JavaScript` | Full DOM ✅ |
805
+ | Electron (VS Code, Cursor…) | SIGUSR1 → V8 inspector → CDP | Main process only: `process`, `Buffer` — no `document`, no `require` in sandboxed apps |
806
+ | Electron (with `--remote-debugging-port`) | CDP page target | Full DOM ✅ |
807
+
808
+ **Electron sandbox note:** SIGUSR1 connects to the Node.js *main* process.
809
+ Sandboxed Electron apps (VS Code, Cursor) strip `require` and Electron
810
+ APIs there. Useful for: `process.env`, `process.versions`, `process.cwd()`,
811
+ `process.pid`. For full DOM/renderer access, launch the app with
812
+ `--remote-debugging-port=9222` — cua-driver will detect and prefer the
813
+ page target automatically.
814
+
815
+ Arc returns no values; Firefox has no JS-via-AppleEvents support — see
816
+ `WEB_APPS.md` for the full matrix.
817
+
818
+ ### 3. Re-snapshot and verify — mandatory
819
+
820
+ **Always** call `get_window_state({pid, window_id})` after the action.
821
+ This isn't optional verification — it's the second half of the
822
+ snapshot invariant.
823
+
824
+ Check the AX tree diff: a changed value, a new element, a new
825
+ window, or the disappearance of the thing you just clicked (menus
826
+ collapse after selection, buttons may become disabled, etc.). If
827
+ nothing changed, the action likely failed silently — **tell the
828
+ user what you attempted and what you observed**, don't paper over
829
+ with "done" language. Agents that skip this step report success on
830
+ silently-dropped actions — the single most common failure mode.
831
+
832
+ ## Recording trajectories
833
+
834
+ Session-scoped action recording + replay, for demos, regressions, and
835
+ training data. Only invoke when the user explicitly asks to record a
836
+ session — the skill does not auto-enable this. CLI surface:
837
+ `cua-driver recording start|stop|status`; raw tool: `set_recording`.
838
+
839
+ See **`RECORDING.md`** for the full flow: enable/disable, turn folder
840
+ contents, replay via `replay_trajectory`, and the element_index
841
+ doesn't-survive-across-sessions caveat.
842
+
843
+ ## Common error patterns
844
+
845
+ | Error text | Meaning | Fix |
846
+ |---|---|---|
847
+ | `No cached AX state for pid X window_id W` | You either skipped `get_window_state` this turn, or passed a different `window_id` to the click than the one the snapshot cached against | Call `get_window_state({pid: X, window_id: W})` first — the same window_id you intend to click in |
848
+ | `Invalid element_index N for pid X window_id W` | Index is stale or out of range | Re-run `get_window_state` with the same window_id, pick a fresh index from the new tree |
849
+ | `window_id W belongs to pid P, not …` | Passed a window_id that's owned by a different process | Use `list_windows({pid: X})` to enumerate this pid's own windows |
850
+ | `AX action AXPress failed with code …` | Element doesn't support AXPress | Try `show_menu`, `confirm`, `cancel`, or `pick` |
851
+ | macOS system-alert beep on `press_key` with no visible change | Target window is minimized; Return / Space / Tab commits don't establish real renderer focus on minimized windows | AX-click a clickable equivalent (Go button, Submit button, checkbox) instead of pressing the key; see "Keyboard commits on minimized windows" under the Browser section |
852
+ | `Accessibility permission not granted` | TCC not granted | Stop; tell user to grant in System Settings |
853
+ | `Screen Recording permission not granted` | TCC not granted for capture | Affects `screenshot` and `get_window_state` (which always captures). Grant in System Settings — the driver can't operate without it |
854
+
855
+ ## Things to avoid
856
+
857
+ - **Never** reuse an `element_index` across a re-snapshot of the same pid.
858
+ - **Never** translate screenshot pixels into a click — the screenshot
859
+ is for visual disambiguation, not coordinates. Use the
860
+ `element_index`.
861
+ - **Prefer AX over pixels.** `click({pid, x, y})` works for
862
+ canvas / WebView regions, but it lands blindly and skips the
863
+ agent-cursor overlay. Exhaust AX paths (menu bars, cmd-k palettes,
864
+ toolbar items, keyboard shortcuts) before dropping to coordinates.
865
+ - **Never** drive destructive actions (delete files, close unsaved
866
+ documents, send messages, submit forms) without explicit user
867
+ intent for that specific destructive step.
868
+ - **Never** launch apps autonomously; confirm with the user first
869
+ unless their original request clearly implies the launch.
870
+
871
+ ## Example end-to-end task
872
+
873
+ **User:** "Open the Downloads folder in Finder."
874
+
875
+ 1. `launch_app({bundle_id: "com.apple.finder", urls: ["~/Downloads"]})`
876
+ → `{pid: 844, windows: [{window_id: 6123, title: "Downloads", ...}]}`.
877
+ Idempotent launch; plus Finder opens a hidden window rooted at
878
+ `~/Downloads` via `application(_:open:)` — zero activation, no
879
+ focus steal. The `windows` array lets you skip a `list_windows` hop.
880
+ 2. `get_window_state({pid: 844, window_id: 6123})` → verify an
881
+ `AXWindow` whose title contains "Downloads" is present with a
882
+ populated AX subtree (sidebar, list view, files).
883
+ 3. Done.
884
+
885
+ If the user instead asks to navigate *within* an already-open Finder
886
+ window, use the menu-bar flow from the "Navigating native menu bars"
887
+ section above (click Go → pick a menu item → re-snapshot → click it).