@amaster.ai/pi-computer-use 0.1.1-beta.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +136 -0
- package/bin/darwin-arm64/.version +2 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/CodeResources +0 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Info.plist +32 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/MacOS/cua-driver +0 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/README.md +140 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/RECORDING.md +113 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/SKILL.md +887 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/TESTS.md +232 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/Resources/Skills/cua-driver/WEB_APPS.md +471 -0
- package/bin/darwin-arm64/CuaDriver.app/Contents/_CodeSignature/CodeResources +172 -0
- package/bin/darwin-x64/.version +2 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/CodeResources +0 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Info.plist +32 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/MacOS/cua-driver +0 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/README.md +140 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/RECORDING.md +113 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/SKILL.md +887 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/TESTS.md +232 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/Resources/Skills/cua-driver/WEB_APPS.md +471 -0
- package/bin/darwin-x64/CuaDriver.app/Contents/_CodeSignature/CodeResources +172 -0
- package/bin/linux-x64/.version +2 -0
- package/bin/linux-x64/cua-driver +0 -0
- package/bin/win32-arm64/.version +2 -0
- package/bin/win32-arm64/cua-driver-uia.exe +0 -0
- package/bin/win32-arm64/cua-driver.exe +0 -0
- package/bin/win32-x64/.version +2 -0
- package/bin/win32-x64/cua-driver-uia.exe +0 -0
- package/bin/win32-x64/cua-driver.exe +0 -0
- package/dist/config.d.ts +7 -20
- package/dist/config.d.ts.map +1 -1
- package/dist/config.js +8 -18
- package/dist/config.js.map +1 -1
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +553 -72
- package/dist/index.js.map +1 -1
- package/dist/mcp-client.d.ts +22 -0
- package/dist/mcp-client.d.ts.map +1 -0
- package/dist/mcp-client.js +91 -0
- package/dist/mcp-client.js.map +1 -0
- package/dist/vision.d.ts.map +1 -1
- package/dist/vision.js +19 -0
- package/dist/vision.js.map +1 -1
- package/package.json +25 -5
- package/preview.png +0 -0
- package/scripts/postinstall.js +29 -0
- package/dist/__tests__/computer-client.test.d.ts +0 -2
- package/dist/__tests__/computer-client.test.d.ts.map +0 -1
- package/dist/__tests__/computer-client.test.js +0 -174
- package/dist/__tests__/computer-client.test.js.map +0 -1
- package/dist/__tests__/index.test.d.ts +0 -2
- package/dist/__tests__/index.test.d.ts.map +0 -1
- package/dist/__tests__/index.test.js +0 -385
- package/dist/__tests__/index.test.js.map +0 -1
- package/dist/__tests__/server-process.test.d.ts +0 -2
- package/dist/__tests__/server-process.test.d.ts.map +0 -1
- package/dist/__tests__/server-process.test.js +0 -127
- package/dist/__tests__/server-process.test.js.map +0 -1
- package/dist/__tests__/vision.test.d.ts +0 -2
- package/dist/__tests__/vision.test.d.ts.map +0 -1
- package/dist/__tests__/vision.test.js +0 -36
- package/dist/__tests__/vision.test.js.map +0 -1
- package/dist/actions.d.ts +0 -15
- package/dist/actions.d.ts.map +0 -1
- package/dist/actions.js +0 -45
- package/dist/actions.js.map +0 -1
- package/dist/computer-client.d.ts +0 -13
- package/dist/computer-client.d.ts.map +0 -1
- package/dist/computer-client.js +0 -109
- package/dist/computer-client.js.map +0 -1
- package/dist/server-process.d.ts +0 -9
- package/dist/server-process.d.ts.map +0 -1
- package/dist/server-process.js +0 -76
- package/dist/server-process.js.map +0 -1
package/README.md
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# @amaster.ai/pi-computer-use
|
|
2
|
+
|
|
3
|
+

|
|
4
|
+
|
|
5
|
+
pi-coding-agent extension that wraps [cua-driver-rs](https://github.com/trycua/cua/), exposing desktop automation tools with a `computer_use_` prefix.
|
|
6
|
+
|
|
7
|
+
## Features
|
|
8
|
+
|
|
9
|
+
- **Zero external dependencies** — pre-compiled cua-driver-rs binaries bundled for all platforms
|
|
10
|
+
- **MCP stdio communication** — spawns `cua-driver mcp` via `StdioClientTransport`, JSON-RPC over stdio
|
|
11
|
+
- **Dynamic tool discovery** — auto-discovers upstream MCP tools and registers with `computer_use_` prefix; falls back to a built-in tool list when cua-driver fails to start
|
|
12
|
+
- **Smart tool filtering** — excludes non-essential tools (agent cursor, recording, config, raw screenshot), exposes 17 action tools + 1 vision tool
|
|
13
|
+
- **Optional visual analysis** — `computer_use_analyze_screenshot` via configurable vision model
|
|
14
|
+
- **Cross-platform permission handling** — detects platform-specific permission issues (macOS TCC, Windows UAC, Linux display server access) and returns actionable guidance
|
|
15
|
+
- **Graceful degradation** — tools are always registered even when cua-driver cannot connect; lazy reconnect is attempted on each tool call
|
|
16
|
+
|
|
17
|
+
## Install
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
bun add @amaster.ai/pi-computer-use
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
Requires Node.js >= 20 and `@earendil-works/pi-coding-agent >= 0.74.0`.
|
|
24
|
+
|
|
25
|
+
## Usage
|
|
26
|
+
|
|
27
|
+
Install the package and pi-coding-agent will automatically discover and load the extension. All tools are registered on `session_start`.
|
|
28
|
+
|
|
29
|
+
Configure via `.pi/settings.json` (project-level) or `~/.pi/agent/settings.json` (user-level) under the `"pi-computer-use"` key:
|
|
30
|
+
|
|
31
|
+
```json
|
|
32
|
+
{
|
|
33
|
+
"pi-computer-use": {
|
|
34
|
+
"mode": "bundled"
|
|
35
|
+
}
|
|
36
|
+
}
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Configuration
|
|
40
|
+
|
|
41
|
+
| Option | Type | Default | Description |
|
|
42
|
+
|--------|------|---------|-------------|
|
|
43
|
+
| `mode` | `'bundled' \| 'path'` | `'bundled'` | Binary resolution strategy |
|
|
44
|
+
| `binaryPath` | `string` | — | Custom cua-driver binary path (requires `mode: 'path'`) |
|
|
45
|
+
| `extraArgs` | `string[]` | — | Extra CLI arguments passed to cua-driver |
|
|
46
|
+
| `visionModel` | `VisionModelConfig` | — | Enable visual screenshot analysis |
|
|
47
|
+
|
|
48
|
+
### Vision Model (Optional)
|
|
49
|
+
|
|
50
|
+
Enable `computer_use_analyze_screenshot` by referencing a model already configured in Pi's model registry (`models.json`):
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
{
|
|
54
|
+
"pi-computer-use": {
|
|
55
|
+
"visionModel": {
|
|
56
|
+
"provider": "openai",
|
|
57
|
+
"model": "gpt-4o"
|
|
58
|
+
}
|
|
59
|
+
}
|
|
60
|
+
}
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
The extension resolves API key, base URL, and headers from the model registry automatically — no need to duplicate credentials here.
|
|
64
|
+
|
|
65
|
+
## Exposed Tools (17 + 1 vision)
|
|
66
|
+
|
|
67
|
+
### Input
|
|
68
|
+
|
|
69
|
+
| Tool | Description |
|
|
70
|
+
|------|-------------|
|
|
71
|
+
| `computer_use_click` | Left-click via element_index or x/y coordinates |
|
|
72
|
+
| `computer_use_double_click` | Double-click at x/y or on an AX element |
|
|
73
|
+
| `computer_use_right_click` | Right-click (context menu) |
|
|
74
|
+
| `computer_use_type_text` | Insert text via AX or CGEvent fallback |
|
|
75
|
+
| `computer_use_press_key` | Press and release a single key |
|
|
76
|
+
| `computer_use_hotkey` | Press a key combination (e.g. Cmd+C) |
|
|
77
|
+
| `computer_use_scroll` | Scroll by line or page in a direction |
|
|
78
|
+
| `computer_use_drag` | Press-drag-release gesture between two points |
|
|
79
|
+
| `computer_use_set_value` | Set value on UI elements (popups, sliders, steppers) |
|
|
80
|
+
|
|
81
|
+
### Query
|
|
82
|
+
|
|
83
|
+
| Tool | Description |
|
|
84
|
+
|------|-------------|
|
|
85
|
+
| `computer_use_get_screen_size` | Get display dimensions and scale factor |
|
|
86
|
+
| `computer_use_get_cursor_position` | Get current mouse cursor position |
|
|
87
|
+
| `computer_use_get_accessibility_tree` | Lightweight desktop snapshot (apps, windows, bounds) |
|
|
88
|
+
| `computer_use_get_window_state` | Full AX tree of a window with actionable element indices |
|
|
89
|
+
| `computer_use_list_windows` | List all top-level windows with bounds and z-order |
|
|
90
|
+
| `computer_use_list_apps` | List running and installed apps with state flags |
|
|
91
|
+
|
|
92
|
+
### App Lifecycle
|
|
93
|
+
|
|
94
|
+
| Tool | Description |
|
|
95
|
+
|------|-------------|
|
|
96
|
+
| `computer_use_launch_app` | Launch an app in the background without focus steal |
|
|
97
|
+
| `computer_use_kill_app` | Force-terminate a process by pid |
|
|
98
|
+
|
|
99
|
+
### Vision (requires `visionModel` config)
|
|
100
|
+
|
|
101
|
+
| Tool | Description |
|
|
102
|
+
|------|-------------|
|
|
103
|
+
| `computer_use_analyze_screenshot` | Take a screenshot and analyze it with a vision model |
|
|
104
|
+
|
|
105
|
+
## Excluded Tools (16)
|
|
106
|
+
|
|
107
|
+
Agent cursor styling, recording/replay, config management, zoom, raw screenshot (use `analyze_screenshot` instead), and browser-specific operations are filtered out.
|
|
108
|
+
|
|
109
|
+
## Permissions
|
|
110
|
+
|
|
111
|
+
On `session_start`, the extension checks permissions via cua-driver's `check_permissions` tool. Platform-specific guidance is provided:
|
|
112
|
+
|
|
113
|
+
| Platform | Accessibility | Screen Capture |
|
|
114
|
+
|----------|--------------|----------------|
|
|
115
|
+
| macOS | System Settings → Privacy & Security → Accessibility | System Settings → Privacy & Security → Screen & System Audio Recording |
|
|
116
|
+
| Windows | Run as Administrator / UI Automation access | Check DRM or security policy |
|
|
117
|
+
| Linux | AT-SPI accessibility service | PipeWire portal or X11 access |
|
|
118
|
+
|
|
119
|
+
When cua-driver fails to connect (missing permissions, binary not found, etc.):
|
|
120
|
+
1. User is notified with a platform-appropriate warning
|
|
121
|
+
2. Tools are still registered using a built-in fallback schema
|
|
122
|
+
3. On each tool call, lazy reconnect is attempted; if it still fails, a friendly error with permission instructions is returned
|
|
123
|
+
|
|
124
|
+
## Supported Platforms
|
|
125
|
+
|
|
126
|
+
| Platform | Binary |
|
|
127
|
+
|----------|--------|
|
|
128
|
+
| macOS ARM64 | `bin/darwin-arm64/cua-driver` |
|
|
129
|
+
| macOS x64 | `bin/darwin-x64/cua-driver` |
|
|
130
|
+
| Linux x64 | `bin/linux-x64/cua-driver` |
|
|
131
|
+
| Windows x64 | `bin/win32-x64/cua-driver.exe` |
|
|
132
|
+
| Windows ARM64 | `bin/win32-arm64/cua-driver.exe` |
|
|
133
|
+
|
|
134
|
+
## License
|
|
135
|
+
|
|
136
|
+
Apache-2.0
|
|
Binary file
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
<?xml version="1.0" encoding="UTF-8"?>
|
|
2
|
+
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
|
3
|
+
<plist version="1.0">
|
|
4
|
+
<dict>
|
|
5
|
+
<key>CFBundleIdentifier</key>
|
|
6
|
+
<string>com.trycua.driver</string>
|
|
7
|
+
<key>CFBundleName</key>
|
|
8
|
+
<string>Cua Driver</string>
|
|
9
|
+
<key>CFBundleDisplayName</key>
|
|
10
|
+
<string>Cua Driver</string>
|
|
11
|
+
<key>CFBundleExecutable</key>
|
|
12
|
+
<string>cua-driver</string>
|
|
13
|
+
<key>CFBundleIconFile</key>
|
|
14
|
+
<string>AppIcon</string>
|
|
15
|
+
<key>CFBundleIconName</key>
|
|
16
|
+
<string>AppIcon</string>
|
|
17
|
+
<key>CFBundlePackageType</key>
|
|
18
|
+
<string>APPL</string>
|
|
19
|
+
<key>CFBundleShortVersionString</key>
|
|
20
|
+
<string>0.2.0</string>
|
|
21
|
+
<key>CFBundleVersion</key>
|
|
22
|
+
<string>1</string>
|
|
23
|
+
<key>LSMinimumSystemVersion</key>
|
|
24
|
+
<string>14.0</string>
|
|
25
|
+
<key>LSUIElement</key>
|
|
26
|
+
<true/>
|
|
27
|
+
<key>NSHighResolutionCapable</key>
|
|
28
|
+
<true/>
|
|
29
|
+
<key>NSSupportsAutomaticTermination</key>
|
|
30
|
+
<true/>
|
|
31
|
+
</dict>
|
|
32
|
+
</plist>
|
|
Binary file
|
|
@@ -0,0 +1,140 @@
|
|
|
1
|
+
# cua-driver — Claude Code skill
|
|
2
|
+
|
|
3
|
+
A [Claude Code](https://code.claude.com) skill that teaches Claude to
|
|
4
|
+
drive native macOS apps via the
|
|
5
|
+
[`cua-driver`](https://github.com/trycua/cua/tree/main/libs/cua-driver)
|
|
6
|
+
CLI — snapshot an app's accessibility tree, click/type/scroll by
|
|
7
|
+
`element_index`, and verify via re-snapshot. Backgrounded-first: no
|
|
8
|
+
focus steal, no cursor warp, no Space follow.
|
|
9
|
+
|
|
10
|
+
## What the skill covers
|
|
11
|
+
|
|
12
|
+
- The snapshot-before-AND-after invariant that keeps the agent honest
|
|
13
|
+
about whether an action actually landed.
|
|
14
|
+
- The backgrounded-click recipe (yabai focus-without-raise + stamped
|
|
15
|
+
SLEventPostToPid) that lets synthetic clicks land on Chrome web
|
|
16
|
+
content without raising the window or pulling the user across Spaces.
|
|
17
|
+
- Web-app quirks (`WEB_APPS.md`) — Chromium/WebKit/Electron/Tauri,
|
|
18
|
+
including the minimized-Chrome keyboard-commit caveat and the
|
|
19
|
+
`set_value` workaround.
|
|
20
|
+
- Trajectory recording (`RECORDING.md`) — optional per-session
|
|
21
|
+
recording + replay for demos and regressions.
|
|
22
|
+
- Canvas/viewport apps (Blender, Unity, GHOST, Qt, wxWidgets) —
|
|
23
|
+
HID-tap fallback when AX is empty.
|
|
24
|
+
|
|
25
|
+
See `SKILL.md` for the main body.
|
|
26
|
+
|
|
27
|
+
## Prerequisites
|
|
28
|
+
|
|
29
|
+
1. **macOS 14 or newer** — the driver depends on SkyLight private SPIs
|
|
30
|
+
that were stabilized in Sonoma.
|
|
31
|
+
2. **`cua-driver` CLI + `CuaDriver.app`** — installable one-liner:
|
|
32
|
+
```bash
|
|
33
|
+
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh)"
|
|
34
|
+
```
|
|
35
|
+
Or from a clone of `trycua/cua`:
|
|
36
|
+
```bash
|
|
37
|
+
cd libs/cua-driver
|
|
38
|
+
scripts/install-local.sh # builds + installs + symlinks for dev use
|
|
39
|
+
```
|
|
40
|
+
The driver runs as an `.app` bundle because macOS TCC grants are
|
|
41
|
+
tied to a stable bundle id (`com.trycua.driver`). The CLI symlink
|
|
42
|
+
lets Claude invoke tools via plain shell.
|
|
43
|
+
3. **TCC grants on `CuaDriver.app`** — **Accessibility** and
|
|
44
|
+
**Screen Recording** in System Settings → Privacy & Security.
|
|
45
|
+
Verify with:
|
|
46
|
+
```bash
|
|
47
|
+
cua-driver check_permissions
|
|
48
|
+
```
|
|
49
|
+
Both fields must be `true`. If not, the app appears in the
|
|
50
|
+
relevant panes of System Settings after first use; toggle it on
|
|
51
|
+
there.
|
|
52
|
+
|
|
53
|
+
## Install
|
|
54
|
+
|
|
55
|
+
The skill is two drop-in directories.
|
|
56
|
+
|
|
57
|
+
**Personal scope** (all Claude Code sessions on your machine):
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
mkdir -p ~/.claude/skills
|
|
61
|
+
cp -R Skills/cua-driver ~/.claude/skills/
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Or symlink if you want edits-in-place:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
ln -s "$PWD/Skills/cua-driver" ~/.claude/skills/cua-driver
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
**Project scope** (committed alongside a specific repo):
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
mkdir -p .claude/skills
|
|
74
|
+
cp -R /path/to/cua/libs/cua-driver/Skills/cua-driver .claude/skills/
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Invoking the skill
|
|
78
|
+
|
|
79
|
+
Claude Code auto-invokes the skill when you ask for macOS GUI
|
|
80
|
+
automation — e.g. "open the Downloads folder in Finder", "click the
|
|
81
|
+
Save button in Numbers", "navigate to trycua.com in Chrome". You can
|
|
82
|
+
also invoke it explicitly:
|
|
83
|
+
|
|
84
|
+
```
|
|
85
|
+
/cua-driver
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Claude Code MCP compatibility mode
|
|
89
|
+
|
|
90
|
+
For normal skill-driven use, prefer the CLI or the standard MCP server. If you want Claude Code's vision/computer-use-style flow to ground on CuaDriver screenshots, register the compatibility server:
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
claude mcp add --transport stdio cua-computer-use -- cua-driver mcp --claude-code-computer-use-compat
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
This mode exposes the normal CuaDriver tools and changes only `screenshot`. The compatibility screenshot requires `pid` and `window_id`, captures that window only, and establishes a window-local pixel coordinate frame. It does not call Anthropic APIs or expose Anthropic's native computer-use API tool.
|
|
97
|
+
|
|
98
|
+
Use MCP for this Claude Code vision/computer-use-style path. CLI screenshots still work as CuaDriver calls, but they do not expose the `mcp__cua-computer-use__screenshot` tool name that Claude Code appears to use as the image-grounding cue.
|
|
99
|
+
|
|
100
|
+
## Files
|
|
101
|
+
|
|
102
|
+
- `SKILL.md` — the main skill body (~500 lines). Loaded on first
|
|
103
|
+
invocation; stays in context for the session.
|
|
104
|
+
- `WEB_APPS.md` — browsers, Electron, Tauri (Chromium + WebKit). Loaded
|
|
105
|
+
on demand when SKILL.md's pointer is followed.
|
|
106
|
+
- `RECORDING.md` — trajectory recording / replay. Loaded on demand.
|
|
107
|
+
- `TESTS.md` — manual test scripts for end-to-end skill verification.
|
|
108
|
+
|
|
109
|
+
## Troubleshooting
|
|
110
|
+
|
|
111
|
+
- `cua-driver: command not found` → re-run the installer or add
|
|
112
|
+
`.build/CuaDriver.app/Contents/MacOS/` to `$PATH`.
|
|
113
|
+
- `No cached AX state for pid X window_id W` → element_index was
|
|
114
|
+
reused across turns, or across different windows of the same app.
|
|
115
|
+
Call `get_window_state({pid, window_id})` first in the same turn,
|
|
116
|
+
with the same window_id you're about to act against.
|
|
117
|
+
- Empty `tree_markdown` → `capture_mode` is set to `vision`, which
|
|
118
|
+
skips the AX walk by design. Flip back to the default `som`
|
|
119
|
+
(`cua-driver config set capture_mode som`) to get the tree.
|
|
120
|
+
Tiny screenshot → likely a stale window capture. See "Behavior
|
|
121
|
+
matrix" in SKILL.md for the full mode table.
|
|
122
|
+
- System-alert beep when pressing Return on a minimized Chrome
|
|
123
|
+
omnibox → the keyboard-commit-on-minimized limitation. Use
|
|
124
|
+
`set_value` on the field instead, or AX-click a Go/Submit button.
|
|
125
|
+
See `WEB_APPS.md`.
|
|
126
|
+
|
|
127
|
+
## Updates
|
|
128
|
+
|
|
129
|
+
The skill evolves alongside the driver. To update:
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
cd /path/to/cua && git pull
|
|
133
|
+
# if you copied: re-copy
|
|
134
|
+
cp -R libs/cua-driver/Skills/cua-driver ~/.claude/skills/
|
|
135
|
+
# if you symlinked: nothing needed
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## License
|
|
139
|
+
|
|
140
|
+
MIT. Same license as the parent `trycua/cua` repo.
|
|
@@ -0,0 +1,113 @@
|
|
|
1
|
+
# Recording & replaying trajectories
|
|
2
|
+
|
|
3
|
+
Session-scoped capture of action sequences + pre/post state, suitable
|
|
4
|
+
for demos, regression diffs, and training data. Invoked only when the
|
|
5
|
+
user explicitly asks to record — the skill does not auto-enable this.
|
|
6
|
+
|
|
7
|
+
`set_recording` turns on a session-scoped trajectory recorder. While
|
|
8
|
+
enabled, every action-tool call (`click`, `right_click`, `scroll`,
|
|
9
|
+
`type_text`, `press_key`, `hotkey`, `set_value`)
|
|
10
|
+
writes a numbered turn folder under a caller-chosen output
|
|
11
|
+
directory. Read-only tools (`get_window_state`, `list_windows`,
|
|
12
|
+
`screenshot`, `list_apps`, permission probes, agent-cursor getters /
|
|
13
|
+
setters, and `set_recording` itself) are not recorded.
|
|
14
|
+
|
|
15
|
+
## Enable / disable
|
|
16
|
+
|
|
17
|
+
Two equivalent surfaces: the `set_recording` MCP tool, or the
|
|
18
|
+
friendlier `cua-driver recording` subcommand group (wraps
|
|
19
|
+
`set_recording` + `get_recording_state` with human-readable output).
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
cua-driver recording start ~/cua-trajectories/run-1
|
|
23
|
+
# … run the workflow …
|
|
24
|
+
cua-driver recording status # -> enabled / disabled, next_turn, output_dir
|
|
25
|
+
cua-driver recording stop # -> "Recording disabled (N turns captured in …)"
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Raw-tool equivalent:
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
cua-driver set_recording '{"enabled":true,"output_dir":"~/cua-trajectories/run-1"}'
|
|
32
|
+
cua-driver get_recording_state
|
|
33
|
+
cua-driver set_recording '{"enabled":false}'
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
The `recording` subcommands require a running daemon (`cua-driver
|
|
37
|
+
serve &`) because recording state is per-process. `output_dir` expands
|
|
38
|
+
`~` and is created (with intermediates) if missing. Turn numbering
|
|
39
|
+
starts at `1` every time recording is (re-)enabled, regardless of any
|
|
40
|
+
existing contents in the directory. State lives in memory only — a
|
|
41
|
+
daemon restart resets to disabled.
|
|
42
|
+
|
|
43
|
+
## What each turn folder contains
|
|
44
|
+
|
|
45
|
+
Each action writes to `turn-NNNNN/` (five-digit zero-padded counter):
|
|
46
|
+
|
|
47
|
+
- `app_state.json` — post-action AX snapshot for the target pid, same
|
|
48
|
+
shape `get_window_state` returns (tree_markdown, element_count,
|
|
49
|
+
turn_id, etc.) minus the screenshot fields. The recorder resolves a
|
|
50
|
+
frontmost window internally (visible + on-current-Space preferred,
|
|
51
|
+
max-area fallback) since individual action tools carry a
|
|
52
|
+
window_id but the recorder has no caller-supplied anchor.
|
|
53
|
+
- `screenshot.png` — post-action capture of the same window the
|
|
54
|
+
recorder just snapshotted. Omitted when the pid has no visible
|
|
55
|
+
window.
|
|
56
|
+
- `action.json` — the tool name, full input arguments, result
|
|
57
|
+
summary, pid, click point (when applicable), ISO-8601 timestamp.
|
|
58
|
+
- `click.png` — only for click-family actions (`click`,
|
|
59
|
+
`right_click`): a copy of `screenshot.png` with a red dot drawn at
|
|
60
|
+
the click point (screen-absolute point → window-local pixels via
|
|
61
|
+
the screenshot's `scale_factor`). Absent for other tools and for
|
|
62
|
+
clicks whose point falls outside the captured window.
|
|
63
|
+
|
|
64
|
+
## When to use it
|
|
65
|
+
|
|
66
|
+
- Demos and screen recordings — play the turn folder back to show
|
|
67
|
+
exactly what the agent saw and what it did.
|
|
68
|
+
- Replay for regression — re-run the same sequence against a future
|
|
69
|
+
build and diff the new trajectory against the saved one.
|
|
70
|
+
- Training data collection — each turn is a
|
|
71
|
+
`(state, action, next_state)` triple ready for offline learning.
|
|
72
|
+
|
|
73
|
+
## When to invoke it
|
|
74
|
+
|
|
75
|
+
This skill does **not** auto-enable recording. The client invokes
|
|
76
|
+
`set_recording` explicitly when the user asks to capture a session.
|
|
77
|
+
If the user says "record this session" or similar, call
|
|
78
|
+
`set_recording({enabled:true, output_dir:…})` before the first
|
|
79
|
+
action, and `set_recording({enabled:false})` when done.
|
|
80
|
+
|
|
81
|
+
## Replaying a recorded trajectory
|
|
82
|
+
|
|
83
|
+
`replay_trajectory({dir})` walks `<dir>/turn-NNNNN/` folders in
|
|
84
|
+
lexical order, reads each `action.json`, and re-invokes the recorded
|
|
85
|
+
tool with its recorded `arguments`. Optional knobs: `delay_ms`
|
|
86
|
+
(pacing between turns, default 500) and `stop_on_error` (halt on
|
|
87
|
+
first failure, default true).
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
cua-driver recording start ~/cua-trajectories/demo1
|
|
91
|
+
# … run the workflow …
|
|
92
|
+
cua-driver recording stop
|
|
93
|
+
# Later: replay against a new build.
|
|
94
|
+
cua-driver replay_trajectory '{"dir":"~/cua-trajectories/demo1","delay_ms":500}'
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Important caveat: **element_index doesn't survive across sessions**.
|
|
98
|
+
Indices are assigned fresh on every `get_window_state` snapshot,
|
|
99
|
+
keyed on `(pid, window_id)`, so a recorded
|
|
100
|
+
`click({pid, window_id, element_index: 14})` from yesterday won't
|
|
101
|
+
resolve today — the pid is usually different, the window_id always
|
|
102
|
+
is. The call returns `Invalid element_index` or `No cached AX
|
|
103
|
+
state`. Pixel clicks (`click({pid, x, y})`) and keyboard tools
|
|
104
|
+
(`press_key`, `hotkey`, `type_text` without element_index) replay cleanly; element-indexed actions require a
|
|
105
|
+
live snapshot that replay doesn't currently re-emit (read-only tools
|
|
106
|
+
like `get_window_state` aren't recorded). For a reliable replay, either
|
|
107
|
+
compose the trajectory from pixel + keyboard primitives, or capture
|
|
108
|
+
it as a regression artifact (compare the failure/success pattern
|
|
109
|
+
across builds) rather than a re-driving script.
|
|
110
|
+
|
|
111
|
+
If recording is still enabled while replay runs, the replay is
|
|
112
|
+
itself recorded into the current output directory — that's the
|
|
113
|
+
intended regression-diff workflow.
|