screenhand 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (177) hide show
  1. package/README.md +458 -93
  2. package/dist/.audit-log.jsonl +55 -0
  3. package/dist/.screenhand/memory/.lock +1 -0
  4. package/dist/.screenhand/memory/actions.jsonl +85 -0
  5. package/dist/.screenhand/memory/errors.jsonl +5 -0
  6. package/dist/.screenhand/memory/errors.jsonl.bak +4 -0
  7. package/dist/.screenhand/memory/state.json +35 -0
  8. package/dist/.screenhand/memory/state.json.bak +35 -0
  9. package/dist/.screenhand/memory/strategies.jsonl +12 -0
  10. package/dist/agent/cli.js +73 -0
  11. package/dist/agent/loop.js +258 -0
  12. package/dist/config.js +9 -0
  13. package/dist/index.js +56 -0
  14. package/dist/logging/timeline-logger.js +29 -0
  15. package/dist/mcp/mcp-stdio-server.js +448 -0
  16. package/dist/mcp/server.js +347 -0
  17. package/dist/mcp-desktop.js +2731 -0
  18. package/dist/mcp-entry.js +59 -0
  19. package/dist/memory/recall.js +160 -0
  20. package/dist/memory/research.js +98 -0
  21. package/dist/memory/seeds.js +89 -0
  22. package/dist/memory/session.js +161 -0
  23. package/dist/memory/store.js +391 -0
  24. package/dist/memory/types.js +4 -0
  25. package/dist/monitor/codex-monitor.js +377 -0
  26. package/dist/monitor/task-queue.js +84 -0
  27. package/dist/monitor/types.js +49 -0
  28. package/dist/native/bridge-client.js +174 -0
  29. package/dist/native/macos-bridge-client.js +5 -0
  30. package/dist/npm-publish-helper.js +117 -0
  31. package/dist/npm-token-cdp.js +113 -0
  32. package/dist/npm-token-create.js +135 -0
  33. package/dist/npm-token-finish.js +126 -0
  34. package/dist/playbook/engine.js +193 -0
  35. package/dist/playbook/index.js +4 -0
  36. package/dist/playbook/recorder.js +519 -0
  37. package/dist/playbook/runner.js +392 -0
  38. package/dist/playbook/store.js +166 -0
  39. package/dist/playbook/types.js +4 -0
  40. package/dist/runtime/accessibility-adapter.js +377 -0
  41. package/dist/runtime/app-adapter.js +48 -0
  42. package/dist/runtime/applescript-adapter.js +283 -0
  43. package/dist/runtime/ax-role-map.js +80 -0
  44. package/dist/runtime/browser-adapter.js +36 -0
  45. package/dist/runtime/cdp-chrome-adapter.js +505 -0
  46. package/dist/runtime/composite-adapter.js +205 -0
  47. package/dist/runtime/executor.js +250 -0
  48. package/dist/runtime/locator-cache.js +12 -0
  49. package/dist/runtime/planning-loop.js +47 -0
  50. package/dist/runtime/service.js +372 -0
  51. package/dist/runtime/session-manager.js +28 -0
  52. package/dist/runtime/state-observer.js +105 -0
  53. package/dist/runtime/vision-adapter.js +208 -0
  54. package/dist/scripts/codex-monitor-daemon.js +335 -0
  55. package/dist/scripts/supervisor-daemon.js +272 -0
  56. package/dist/scripts/worker-daemon.js +228 -0
  57. package/dist/src/agent/cli.js +82 -0
  58. package/dist/src/agent/loop.js +274 -0
  59. package/{src/config.ts → dist/src/config.js} +5 -10
  60. package/{src/index.ts → dist/src/index.js} +32 -52
  61. package/dist/src/jobs/manager.js +237 -0
  62. package/dist/src/jobs/runner.js +683 -0
  63. package/dist/src/jobs/store.js +102 -0
  64. package/dist/src/jobs/types.js +30 -0
  65. package/dist/src/jobs/worker.js +97 -0
  66. package/dist/src/logging/timeline-logger.js +45 -0
  67. package/dist/src/mcp/mcp-stdio-server.js +464 -0
  68. package/dist/src/mcp/server.js +363 -0
  69. package/dist/src/mcp-entry.js +60 -0
  70. package/dist/src/memory/recall.js +170 -0
  71. package/dist/src/memory/research.js +104 -0
  72. package/dist/src/memory/seeds.js +101 -0
  73. package/dist/src/memory/service.js +421 -0
  74. package/dist/src/memory/session.js +169 -0
  75. package/dist/src/memory/store.js +422 -0
  76. package/dist/src/memory/types.js +17 -0
  77. package/dist/src/monitor/codex-monitor.js +382 -0
  78. package/dist/src/monitor/task-queue.js +97 -0
  79. package/dist/src/monitor/types.js +62 -0
  80. package/dist/src/native/bridge-client.js +190 -0
  81. package/{src/native/macos-bridge-client.ts → dist/src/native/macos-bridge-client.js} +0 -1
  82. package/dist/src/playbook/engine.js +201 -0
  83. package/dist/src/playbook/index.js +20 -0
  84. package/dist/src/playbook/recorder.js +535 -0
  85. package/dist/src/playbook/runner.js +408 -0
  86. package/dist/src/playbook/store.js +183 -0
  87. package/dist/src/playbook/types.js +17 -0
  88. package/dist/src/runtime/accessibility-adapter.js +393 -0
  89. package/dist/src/runtime/app-adapter.js +64 -0
  90. package/dist/src/runtime/applescript-adapter.js +299 -0
  91. package/dist/src/runtime/ax-role-map.js +96 -0
  92. package/dist/src/runtime/browser-adapter.js +52 -0
  93. package/dist/src/runtime/cdp-chrome-adapter.js +521 -0
  94. package/dist/src/runtime/composite-adapter.js +221 -0
  95. package/dist/src/runtime/execution-contract.js +159 -0
  96. package/dist/src/runtime/executor.js +266 -0
  97. package/{src/runtime/locator-cache.ts → dist/src/runtime/locator-cache.js} +10 -15
  98. package/dist/src/runtime/planning-loop.js +63 -0
  99. package/dist/src/runtime/service.js +388 -0
  100. package/dist/src/runtime/session-manager.js +60 -0
  101. package/dist/src/runtime/state-observer.js +121 -0
  102. package/dist/src/runtime/vision-adapter.js +224 -0
  103. package/dist/src/supervisor/locks.js +186 -0
  104. package/dist/src/supervisor/supervisor.js +403 -0
  105. package/dist/src/supervisor/types.js +30 -0
  106. package/dist/src/test-mcp-protocol.js +154 -0
  107. package/dist/src/types.js +17 -0
  108. package/dist/src/util/atomic-write.js +118 -0
  109. package/dist/test-mcp-protocol.js +138 -0
  110. package/dist/types.js +1 -0
  111. package/package.json +18 -4
  112. package/.claude/commands/automate.md +0 -28
  113. package/.claude/commands/debug-ui.md +0 -19
  114. package/.claude/commands/screenshot.md +0 -15
  115. package/.github/FUNDING.yml +0 -1
  116. package/.github/ISSUE_TEMPLATE/bug_report.md +0 -27
  117. package/.github/ISSUE_TEMPLATE/feature_request.md +0 -20
  118. package/.mcp.json +0 -8
  119. package/DESKTOP_MCP_GUIDE.md +0 -92
  120. package/SECURITY.md +0 -44
  121. package/docs/architecture.md +0 -47
  122. package/install-skills.sh +0 -19
  123. package/mcp-bridge.ts +0 -271
  124. package/mcp-desktop.ts +0 -1221
  125. package/native/macos-bridge/Package.swift +0 -21
  126. package/native/macos-bridge/Sources/AccessibilityBridge.swift +0 -261
  127. package/native/macos-bridge/Sources/AppManagement.swift +0 -129
  128. package/native/macos-bridge/Sources/CoreGraphicsBridge.swift +0 -242
  129. package/native/macos-bridge/Sources/ObserverBridge.swift +0 -120
  130. package/native/macos-bridge/Sources/VisionBridge.swift +0 -80
  131. package/native/macos-bridge/Sources/main.swift +0 -345
  132. package/native/windows-bridge/AppManagement.cs +0 -234
  133. package/native/windows-bridge/InputBridge.cs +0 -436
  134. package/native/windows-bridge/Program.cs +0 -265
  135. package/native/windows-bridge/ScreenCapture.cs +0 -329
  136. package/native/windows-bridge/UIAutomationBridge.cs +0 -571
  137. package/native/windows-bridge/WindowsBridge.csproj +0 -17
  138. package/playbooks/devpost.json +0 -186
  139. package/playbooks/instagram.json +0 -41
  140. package/playbooks/instagram_v2.json +0 -201
  141. package/playbooks/x_v1.json +0 -211
  142. package/scripts/devpost-live-loop.mjs +0 -421
  143. package/src/logging/timeline-logger.ts +0 -55
  144. package/src/mcp/server.ts +0 -449
  145. package/src/memory/recall.ts +0 -191
  146. package/src/memory/research.ts +0 -146
  147. package/src/memory/seeds.ts +0 -123
  148. package/src/memory/session.ts +0 -201
  149. package/src/memory/store.ts +0 -434
  150. package/src/memory/types.ts +0 -69
  151. package/src/native/bridge-client.ts +0 -239
  152. package/src/runtime/accessibility-adapter.ts +0 -487
  153. package/src/runtime/app-adapter.ts +0 -169
  154. package/src/runtime/applescript-adapter.ts +0 -376
  155. package/src/runtime/ax-role-map.ts +0 -102
  156. package/src/runtime/browser-adapter.ts +0 -129
  157. package/src/runtime/cdp-chrome-adapter.ts +0 -676
  158. package/src/runtime/composite-adapter.ts +0 -274
  159. package/src/runtime/executor.ts +0 -396
  160. package/src/runtime/planning-loop.ts +0 -81
  161. package/src/runtime/service.ts +0 -448
  162. package/src/runtime/session-manager.ts +0 -50
  163. package/src/runtime/state-observer.ts +0 -136
  164. package/src/runtime/vision-adapter.ts +0 -297
  165. package/src/types.ts +0 -297
  166. package/tests/bridge-client.test.ts +0 -176
  167. package/tests/browser-stealth.test.ts +0 -210
  168. package/tests/composite-adapter.test.ts +0 -64
  169. package/tests/mcp-server.test.ts +0 -151
  170. package/tests/memory-recall.test.ts +0 -339
  171. package/tests/memory-research.test.ts +0 -159
  172. package/tests/memory-seeds.test.ts +0 -120
  173. package/tests/memory-store.test.ts +0 -392
  174. package/tests/types.test.ts +0 -92
  175. package/tsconfig.check.json +0 -17
  176. package/tsconfig.json +0 -19
  177. package/vitest.config.ts +0 -8
package/README.md CHANGED
@@ -2,48 +2,62 @@
2
2
 
3
3
  # ScreenHand
4
4
 
5
- **Give AI eyes and hands on your desktop.**
5
+ **Native desktop control for MCP agents.**
6
6
 
7
- ScreenHand is an [MCP server](https://modelcontextprotocol.io/) that lets AI agents see your screen, click buttons, type text, and control any app on macOS and Windows.
7
+ An open-source [MCP server](https://modelcontextprotocol.io/) for macOS and Windows that gives Claude, Cursor, Codex CLI, and OpenClaw fast desktop control via Accessibility/UI Automation, OCR, and Chrome CDP.
8
8
 
9
9
  [![License: AGPL-3.0](https://img.shields.io/badge/License-AGPL--3.0-blue.svg)](LICENSE)
10
10
  [![npm: screenhand](https://img.shields.io/npm/v/screenhand)](https://www.npmjs.com/package/screenhand)
11
+ [![CI](https://github.com/manushi4/screenhand/actions/workflows/ci.yml/badge.svg)](https://github.com/manushi4/screenhand/actions/workflows/ci.yml)
11
12
  [![Platform: macOS & Windows](https://img.shields.io/badge/Platform-macOS%20%7C%20Windows-green)]()
12
13
  [![MCP Compatible](https://img.shields.io/badge/MCP-Compatible-purple)]()
13
14
 
14
- [Website](https://screenhand.com) | [Quick Start](#quick-start) | [Use Cases](#use-cases) | [FAQ](#faq)
15
+ [Website](https://screenhand.com) | [Quick Start](#quick-start) | [Why ScreenHand](#why-screenhand) | [Tools](#tools) | [FAQ](#faq)
15
16
 
16
17
  </div>
17
18
 
18
19
  ---
19
20
 
20
- ## The Problem
21
+ ## Why ScreenHand?
21
22
 
22
- AI assistants are powerful — but they're blind. They can't see what's on your screen, click a button, or type into an app. If you want Claude to help you automate a workflow, debug a UI, or fill out a form, you're stuck copy-pasting screenshots and describing what you see.
23
+ - `~50ms` native UI actions via Accessibility APIs and Windows UI Automation
24
+ - `0` extra AI calls for native clicks, typing, and UI element lookup
25
+ - `70+` tools across desktop apps, browser automation, OCR, memory, sessions, jobs, and playbooks
26
+ - `macOS + Windows` behind the same MCP interface
27
+ - **Multi-agent safe** — session leases prevent conflicts between Claude, Cursor, and Codex
28
+ - **Background worker** — queue jobs and let the daemon process them continuously
23
29
 
24
- **ScreenHand fixes that.** It gives any AI agent direct access to your desktop through native OS APIs — not slow screenshot-and-guess loops.
30
+ ## What is ScreenHand?
25
31
 
26
- ## How It Works
27
-
28
- You connect ScreenHand to your AI client (Claude, Cursor, Codex CLI, etc.) via the [Model Context Protocol](https://modelcontextprotocol.io/). Once connected, your AI can:
32
+ ScreenHand is a **desktop automation bridge for AI**. It connects AI assistants like Claude to your operating system so they can:
29
33
 
30
34
  - **See** your screen via screenshots and OCR
31
- - **Read** UI elements directly via native Accessibility APIs
35
+ - **Read** UI elements via Accessibility APIs (macOS) or UI Automation (Windows)
32
36
  - **Click** buttons, menus, and links
33
37
  - **Type** text into any input field
34
38
  - **Control** Chrome tabs via DevTools Protocol
35
- - **Automate** cross-app workflows
36
-
37
- ```
38
- Your AI Client (Claude, Cursor, etc.)
39
- | MCP protocol (stdio)
40
- ScreenHand
41
- | Native OS APIs
42
- Your Desktop (any app, any browser)
43
- ```
39
+ - **Run** AppleScript commands (macOS)
40
+ - **Queue & execute** multi-step jobs via playbooks with a background worker daemon
41
+ - **Coordinate** multiple AI agents with session leases and stall detection
42
+
43
+ It works as an [MCP (Model Context Protocol)](https://modelcontextprotocol.io/) server, meaning any MCP-compatible AI client can use it out of the box.
44
+
45
+ | Problem | ScreenHand Solution |
46
+ |---|---|
47
+ | AI can't see your screen | Screenshots + OCR return all visible text |
48
+ | AI can't click UI elements | Accessibility API finds and clicks elements in ~50ms |
49
+ | AI can't control browsers | Chrome DevTools Protocol gives full page control |
50
+ | AI can't automate workflows | 70+ tools for cross-app automation |
51
+ | Only works on one OS | Native bridges for both macOS and Windows |
52
+ | Multiple agents conflict | Session leases with heartbeat and stall detection |
53
+ | Jobs need manual triggering | Worker daemon processes the queue continuously |
44
54
 
45
55
  ## Quick Start
46
56
 
57
+ ### Source install (recommended today)
58
+
59
+ ScreenHand currently builds a native bridge locally for Accessibility/UI Automation, so the fastest reliable setup is still from source:
60
+
47
61
  ```bash
48
62
  git clone https://github.com/manushi4/screenhand.git
49
63
  cd screenhand
@@ -52,10 +66,9 @@ npm run build:native # macOS — builds Swift bridge
52
66
  # npm run build:native:windows # Windows — builds .NET bridge
53
67
  ```
54
68
 
55
- ### Connect to Your AI Client
69
+ Then connect ScreenHand to your AI client.
56
70
 
57
- <details>
58
- <summary><strong>Claude Desktop</strong></summary>
71
+ ### Claude Desktop
59
72
 
60
73
  Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
61
74
 
@@ -69,10 +82,8 @@ Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
69
82
  }
70
83
  }
71
84
  ```
72
- </details>
73
85
 
74
- <details>
75
- <summary><strong>Claude Code</strong></summary>
86
+ ### Claude Code
76
87
 
77
88
  Add to your project `.mcp.json` or `~/.claude/settings.json`:
78
89
 
@@ -86,12 +97,10 @@ Add to your project `.mcp.json` or `~/.claude/settings.json`:
86
97
  }
87
98
  }
88
99
  ```
89
- </details>
90
100
 
91
- <details>
92
- <summary><strong>Cursor</strong></summary>
101
+ ### Cursor
93
102
 
94
- Add to `.cursor/mcp.json` in your project (or `~/.cursor/mcp.json` globally):
103
+ Add to `.cursor/mcp.json` in your project (or `~/.cursor/mcp.json` for global):
95
104
 
96
105
  ```json
97
106
  {
@@ -103,10 +112,8 @@ Add to `.cursor/mcp.json` in your project (or `~/.cursor/mcp.json` globally):
103
112
  }
104
113
  }
105
114
  ```
106
- </details>
107
115
 
108
- <details>
109
- <summary><strong>OpenAI Codex CLI</strong></summary>
116
+ ### OpenAI Codex CLI
110
117
 
111
118
  Add to `~/.codex/config.toml`:
112
119
 
@@ -116,117 +123,475 @@ command = "npx"
116
123
  args = ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
117
124
  transport = "stdio"
118
125
  ```
119
- </details>
120
126
 
121
- <details>
122
- <summary><strong>Any MCP Client</strong></summary>
127
+ ### OpenClaw
128
+
129
+ Add to your `openclaw.json`:
130
+
131
+ ```json
132
+ {
133
+ "mcpServers": {
134
+ "screenhand": {
135
+ "command": "npx",
136
+ "args": ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
137
+ }
138
+ }
139
+ }
140
+ ```
141
+
142
+ > **Why?** OpenClaw's built-in desktop control sends a screenshot to an LLM for every click (~3-5s, costs an API call). ScreenHand uses native Accessibility APIs — `press('Send')` runs in ~50ms with zero AI calls. See the full [integration guide](docs/openclaw-integration.md).
143
+
144
+ ### Any MCP Client
123
145
 
124
- ScreenHand is a standard MCP server over stdio. Point any MCP-compatible client at `mcp-desktop.ts`.
125
- </details>
146
+ ScreenHand is a standard MCP server over stdio. It works with any MCP-compatible client — just point it at `mcp-desktop.ts`.
126
147
 
127
148
  Replace `/path/to/screenhand` with the actual path where you cloned the repo.
128
149
 
150
+ ## Tools
151
+
152
+ ScreenHand exposes 70+ tools organized by category.
153
+
154
+ ### See the Screen
155
+
156
+ | Tool | What it does | Speed |
157
+ |------|-------------|-------|
158
+ | `screenshot` | Full screenshot + OCR — returns all visible text | ~600ms |
159
+ | `screenshot_file` | Screenshot saved to file (for viewing the image) | ~400ms |
160
+ | `ocr` | OCR with element positions and bounding boxes | ~600ms |
161
+
162
+ ### Control Any App (Accessibility / UI Automation)
163
+
164
+ | Tool | What it does | Speed |
165
+ |------|-------------|-------|
166
+ | `apps` | List running apps with bundle IDs and PIDs | ~10ms |
167
+ | `windows` | List visible windows with positions and sizes | ~10ms |
168
+ | `focus` | Bring an app to the front | ~10ms |
169
+ | `launch` | Launch an app by bundle ID or name | ~1s |
170
+ | `ui_tree` | Full UI element tree — instant, no OCR needed | ~50ms |
171
+ | `ui_find` | Find a UI element by text or title | ~50ms |
172
+ | `ui_press` | Click a UI element by its title | ~50ms |
173
+ | `ui_set_value` | Set value of a text field, slider, etc. | ~50ms |
174
+ | `menu_click` | Click a menu bar item by path | ~100ms |
175
+
176
+ ### Keyboard and Mouse
177
+
178
+ | Tool | What it does |
179
+ |------|-------------|
180
+ | `click` | Click at screen coordinates |
181
+ | `click_text` | Find text via OCR and click it (fallback) |
182
+ | `type_text` | Type text via keyboard |
183
+ | `key` | Key combo (e.g. `cmd+s`, `ctrl+shift+n`) |
184
+ | `drag` | Drag from point A to B |
185
+ | `scroll` | Scroll at a position |
186
+
187
+ ### Chrome Browser (CDP)
188
+
189
+ | Tool | What it does |
190
+ |------|-------------|
191
+ | `browser_tabs` | List all open Chrome tabs |
192
+ | `browser_open` | Open URL in new tab |
193
+ | `browser_navigate` | Navigate active tab to URL |
194
+ | `browser_js` | Run JavaScript in a tab |
195
+ | `browser_dom` | Query DOM with CSS selectors |
196
+ | `browser_click` | Click element by CSS selector (uses CDP mouse events) |
197
+ | `browser_type` | Type into an input field (uses CDP keyboard events, React-compatible) |
198
+ | `browser_wait` | Wait for a page condition |
199
+ | `browser_page_info` | Get page title, URL, and content |
200
+
201
+ ### Anti-Detection & Stealth (CDP)
202
+
203
+ Tools for interacting with sites that have bot detection (Instagram, LinkedIn, etc.):
204
+
205
+ | Tool | What it does |
206
+ |------|-------------|
207
+ | `browser_stealth` | Inject anti-detection patches (hides webdriver flag, fakes plugins/languages) |
208
+ | `browser_fill_form` | Human-like typing with random delays via CDP keyboard events |
209
+ | `browser_human_click` | Realistic mouse event sequence (mouseMoved → mousePressed → mouseReleased) |
210
+
211
+ > **Tip:** Call `browser_stealth` once after navigating to a protected site. Then use `browser_fill_form` and `browser_human_click` for interactions. The regular `browser_type` and `browser_click` also use CDP Input events now.
212
+
213
+ ### Smart Execution (fallback chain)
214
+
215
+ Tools that automatically choose the best method (Accessibility → CDP → OCR → coordinates):
216
+
217
+ | Tool | What it does |
218
+ |------|-------------|
219
+ | `execution_plan` | Generate an execution plan for a task |
220
+ | `click_with_fallback` | Click using the best available method |
221
+ | `type_with_fallback` | Type using the best available method |
222
+ | `read_with_fallback` | Read content using the best available method |
223
+ | `locate_with_fallback` | Find an element using the best available method |
224
+ | `select_with_fallback` | Select an option using the best available method |
225
+ | `scroll_with_fallback` | Scroll using the best available method |
226
+ | `wait_for_state` | Wait for a UI state using the best available method |
227
+
228
+ ### Platform Playbooks (lazy-loaded)
229
+
230
+ Pre-built automation knowledge for specific platforms — selectors, URLs, flows, and **error solutions**.
231
+
232
+ | Tool | What it does |
233
+ |------|-------------|
234
+ | `platform_guide` | Get automation guide for a platform (selectors, URLs, flows, errors+solutions) |
235
+ | `export_playbook` | Auto-generate a playbook from your session. Share it to help others. |
236
+
237
+ ```
238
+ platform_guide({ platform: "devpost", section: "errors" }) # Just errors + solutions
239
+ platform_guide({ platform: "devpost", section: "selectors" }) # All CSS selectors
240
+ platform_guide({ platform: "devpost", section: "flows" }) # Step-by-step workflows
241
+ platform_guide({ platform: "devpost" }) # Full playbook
242
+ ```
243
+
244
+ **Contributing playbooks:** After automating any site, run:
245
+ ```
246
+ export_playbook({ platform: "twitter", domain: "twitter.com" })
247
+ ```
248
+ This auto-extracts URLs, selectors, errors+solutions from your session and saves a ready-to-share `playbooks/twitter.json`.
249
+
250
+ Available platforms: `devpost`. Add more by running `export_playbook` or creating JSON files in `playbooks/`.
251
+
252
+ Zero performance cost — files only read when `platform_guide` is called.
253
+
254
+ ### AppleScript (macOS only)
255
+
256
+ | Tool | What it does |
257
+ |------|-------------|
258
+ | `applescript` | Run any AppleScript command |
259
+
260
+ ### Memory (Learning) — zero-config, zero-latency
261
+
262
+ ScreenHand gets smarter every time you use it — **no manual setup needed**.
263
+
264
+ **What happens automatically:**
265
+ - Every tool call is logged (async, non-blocking — adds ~0ms to response time)
266
+ - After 3+ consecutive successes, the winning sequence is saved as a reusable strategy
267
+ - Known error patterns are tracked with resolutions (e.g. "launch times out → use focus() instead")
268
+ - On every tool call, the response includes **auto-recall hints**:
269
+ - Error warnings if the tool has failed before
270
+ - Next-step suggestions if you're mid-way through a known strategy
271
+
272
+ **Predefined seed strategies:**
273
+ - Ships with 12 common macOS workflows (Photo Booth, Chrome navigation, copy/paste, Finder, export PDF, etc.)
274
+ - Loaded automatically on first boot — the system has knowledge from day one
275
+ - Seeds are searchable via `memory_recall` and provide next-step hints like any learned strategy
276
+
277
+ **Background web research:**
278
+ - When a tool fails and no resolution exists, ScreenHand searches for a fix in the background (non-blocking)
279
+ - Uses Claude API (haiku, if `ANTHROPIC_API_KEY` is set) or DuckDuckGo instant answers as fallback
280
+ - Resolutions are saved to both error cache and strategy store — zero-latency recall next time
281
+ - Completely silent and fire-and-forget — never blocks tool responses or throws errors
282
+
283
+ **Fingerprint matching & feedback loop:**
284
+ - Each strategy is fingerprinted by its tool sequence (e.g. `apps→focus→ui_press`)
285
+ - O(1) exact-match lookup when the agent follows a known sequence
286
+ - Success/failure outcomes are tracked per strategy — unreliable strategies are auto-penalized and eventually skipped
287
+ - Keyword-based fuzzy search with reliability scoring for `memory_recall`
288
+
289
+ **Production-grade under the hood:**
290
+ - All data cached in RAM at startup — lookups are ~0ms, disk is only for persistence
291
+ - Disk writes are async and buffered (100ms debounce) — never block tool calls
292
+ - Sync flush on process exit (SIGINT/SIGTERM) — no lost writes
293
+ - Per-line JSONL parsing — corrupted lines are skipped, not fatal
294
+ - LRU eviction: 500 strategies, 200 error patterns max (oldest evicted automatically)
295
+ - File locking (`.lock` + PID) prevents corruption from concurrent instances
296
+ - Action log auto-rotates at 10 MB
297
+ - Data lives in `.screenhand/memory/` as JSONL (grep-friendly, no database)
298
+
299
+ | Tool | What it does |
300
+ |------|-------------|
301
+ | `memory_snapshot` | Get current memory state snapshot |
302
+ | `memory_recall` | Search past strategies by task description |
303
+ | `memory_save` | Manually save the current session as a strategy |
304
+ | `memory_record_error` | Record an error pattern with an optional fix |
305
+ | `memory_record_learning` | Record a verified pattern (what works/fails) |
306
+ | `memory_query_patterns` | Search learnings by scope and method |
307
+ | `memory_errors` | View all known error patterns and resolutions |
308
+ | `memory_stats` | Action counts, success rates, top tools, disk usage |
309
+ | `memory_clear` | Clear actions, strategies, errors, or all data |
310
+
311
+ ### Session Supervisor — multi-agent coordination
312
+
313
+ Lease-based window locking with heartbeat, stall detection, and automatic recovery. Prevents multiple AI agents from fighting over the same app window.
314
+
315
+ | Tool | What it does |
316
+ |------|-------------|
317
+ | `session_claim` | Claim exclusive control of an app window |
318
+ | `session_heartbeat` | Keep your lease alive (call every 60s) |
319
+ | `session_release` | Release your session lease |
320
+ | `supervisor_status` | Active sessions, health metrics, stall detection |
321
+ | `supervisor_start` | Start the supervisor background daemon |
322
+ | `supervisor_stop` | Stop the supervisor daemon |
323
+ | `supervisor_pause` | Pause supervisor monitoring |
324
+ | `supervisor_resume` | Resume supervisor monitoring |
325
+ | `supervisor_install` | Install supervisor as a launchd service (macOS) |
326
+ | `supervisor_uninstall` | Uninstall supervisor launchd service |
327
+ | `recovery_queue_add` | Add a recovery action to the supervisor's queue |
328
+ | `recovery_queue_list` | List pending recovery actions |
329
+
330
+ The supervisor runs as a **detached daemon** that survives MCP/client restarts. It monitors active sessions, detects stalls, expires abandoned leases, and queues recovery actions.
331
+
332
+ ### Jobs & Worker Daemon
333
+
334
+ Queue multi-step automation jobs and let a background worker process them continuously. Jobs can target specific apps/windows and execute via playbook engine or free-form steps.
335
+
336
+ | Tool | What it does |
337
+ |------|-------------|
338
+ | `job_create` | Create a job with steps (optionally tied to a playbook + bundleId/windowId) |
339
+ | `job_status` | Get the status of a job |
340
+ | `job_list` | List jobs by state (queued, running, done, failed, blocked) |
341
+ | `job_transition` | Transition a job to a new state |
342
+ | `job_step_done` | Mark a job step as done |
343
+ | `job_step_fail` | Mark a job step as failed |
344
+ | `job_resume` | Resume a blocked/waiting job |
345
+ | `job_dequeue` | Dequeue the next queued job |
346
+ | `job_remove` | Remove a job |
347
+ | `job_run` | Execute a single queued job through the runner |
348
+ | `job_run_all` | Process all queued jobs sequentially |
349
+ | `worker_start` | Start the background worker daemon |
350
+ | `worker_stop` | Stop the worker daemon |
351
+ | `worker_status` | Get worker daemon status and recent results |
352
+
353
+ **Job state machine:** `queued → running → done | failed | blocked | waiting_human`
354
+
355
+ **Worker daemon features:**
356
+ - Runs as a detached process — survives MCP/client restarts
357
+ - Continuously polls the job queue and executes via JobRunner
358
+ - Playbook integration — jobs with a `playbookId` execute through PlaybookEngine
359
+ - Focuses/validates the target `bundleId`/`windowId` before each step
360
+ - Persists status and recent results to `~/.screenhand/worker/state.json`
361
+ - Single-instance enforcement via PID file
362
+ - Graceful shutdown on SIGINT/SIGTERM
363
+
364
+ ```bash
365
+ # Start the worker daemon directly
366
+ npx tsx scripts/worker-daemon.ts
367
+ npx tsx scripts/worker-daemon.ts --poll 5000 --max-jobs 10
368
+
369
+ # Or via MCP tools
370
+ worker_start → worker_status → worker_stop
371
+ ```
372
+
373
+ ## Architecture
374
+
375
+ ```
376
+ ┌─────────────────────────────────────────────────────┐
377
+ │ MCP Client (Claude, Cursor, Codex CLI, OpenClaw) │
378
+ └────────────────────────┬────────────────────────────┘
379
+ │ stdio JSON-RPC
380
+ ┌────────────────────────▼────────────────────────────┐
381
+ │ mcp-desktop.ts │
382
+ │ (MCP Server — 70+ tools) │
383
+ ├───────────┬──────────┬──────────────────────────────┤
384
+ │ Native │ Chrome │ Memory / Supervisor / Jobs │
385
+ │ Bridge │ CDP │ / Playbooks / Worker │
386
+ └─────┬─────┴────┬─────┴──────────────────────────────┘
387
+ │ │
388
+ ┌─────▼─────┐ ┌──▼──────┐ ┌──────────────┐ ┌──────────────┐
389
+ │macos-bridge│ │ Chrome │ │ Supervisor │ │ Worker │
390
+ │(Swift, AX) │ │DevTools │ │ Daemon │ │ Daemon │
391
+ └────────────┘ └─────────┘ └──────────────┘ └──────────────┘
392
+ ```
393
+
394
+ ### Key modules
395
+
396
+ | Path | Purpose |
397
+ |---|---|
398
+ | `mcp-desktop.ts` | MCP server entrypoint — all tool definitions |
399
+ | `src/native/bridge-client.ts` | TypeScript ↔ native bridge communication |
400
+ | `native/macos-bridge/` | Swift binary using Accessibility API + OCR |
401
+ | `native/windows-bridge/` | C# binary using UI Automation + SendInput |
402
+ | `src/memory/` | Persistent memory service (strategies, errors, learnings) |
403
+ | `src/supervisor/` | Session leases, stall detection, recovery |
404
+ | `src/jobs/` | Job queue, runner, worker state persistence |
405
+ | `src/playbook/` | Playbook engine and store |
406
+ | `src/runtime/` | Execution contract, accessibility adapter, fallback chain |
407
+ | `scripts/worker-daemon.ts` | Standalone worker daemon process |
408
+ | `scripts/supervisor-daemon.ts` | Standalone supervisor daemon process |
409
+
410
+ ### State files
411
+
412
+ All persistent state lives under `~/.screenhand/`:
413
+
414
+ ```
415
+ ~/.screenhand/
416
+ ├── memory/ # strategies, errors, learnings (JSONL)
417
+ ├── supervisor/ # supervisor daemon state
418
+ ├── locks/ # session lease files
419
+ ├── jobs/ # job queue persistence
420
+ ├── worker/ # worker daemon state, PID, logs
421
+ └── playbooks/ # saved playbook definitions
422
+ ```
423
+
424
+ ## How It Works
425
+
426
+ ScreenHand has three layers:
427
+
428
+ ```
429
+ AI Client (Claude, Cursor, etc.)
430
+ ↓ MCP protocol (stdio)
431
+ ScreenHand MCP Server (TypeScript)
432
+ ↓ JSON-RPC (stdio)
433
+ Native Bridge (Swift on macOS / C# on Windows)
434
+ ↓ Platform APIs
435
+ Operating System (Accessibility, CoreGraphics, UI Automation, SendInput)
436
+ ```
437
+
438
+ 1. **Native bridge** — talks directly to OS-level APIs:
439
+ - **macOS**: Swift binary using Accessibility APIs, CoreGraphics, and Vision framework (OCR)
440
+ - **Windows**: C# (.NET 8) binary using UI Automation, SendInput, GDI+, and Windows.Media.Ocr
441
+ 2. **TypeScript MCP server** — routes tools to the correct bridge, handles Chrome CDP, manages sessions, runs jobs
442
+ 3. **MCP protocol** — standard Model Context Protocol so any AI client can connect
443
+
444
+ The native bridge is auto-selected based on your OS. Both bridges speak the same JSON-RPC protocol, so all tools work identically on both platforms.
445
+
129
446
  ## Use Cases
130
447
 
131
- ### Automate Repetitive Workflows
132
- Tell your AI "submit this form on 10 websites" or "export all these reports as PDFs" and it does it. ScreenHand handles the clicking, typing, and navigating across any app.
448
+ ### App Debugging
449
+ Claude reads UI trees, clicks through flows, and checks element statesfaster than clicking around yourself.
133
450
 
134
- ### Debug UIs Faster
135
- Instead of clicking through your app manually, let Claude inspect the full UI element tree, check states, and walk through flows — all from your terminal.
451
+ ### Design Inspection
452
+ Screenshots + OCR to read exactly what's on screen. `ui_tree` shows component structure like React DevTools but for any native app.
136
453
 
137
- ### Browser Automation Without Selenium
138
- Fill forms, scrape data, run JavaScript, and navigate pages through Chrome DevTools Protocol. Works with sites that block traditional automation.
454
+ ### Browser Automation
455
+ Fill forms, scrape data, run JavaScript, navigate pages — all through Chrome DevTools Protocol.
139
456
 
140
457
  ### Cross-App Workflows
141
- Read data from a spreadsheet, search it in Chrome, paste results into Notes — chain actions across your entire desktop.
458
+ Read from one app, paste into another, chain actions across your whole desktop. Example: extract data from a spreadsheet, search it in Chrome, paste results into Notes.
142
459
 
143
- ### AI-Powered UI Testing
144
- Click buttons, verify text appears, check element states, and catch regressions all driven by your AI agent.
460
+ ### Multi-Agent Coordination
461
+ Run Claude, Cursor, and Codex simultaneously each claims its own app window via session leases. The supervisor detects stalls and recovers.
145
462
 
146
- ## What's Included
463
+ ### Background Job Processing
464
+ Queue automation jobs with `job_create`, start the worker daemon with `worker_start`, and let it process tasks continuously — even after you close your AI client.
147
465
 
148
- ScreenHand exposes **70+ tools** organized by what you need to do:
466
+ ### UI Testing
467
+ Click buttons, verify text appears, catch visual regressions — all driven by AI.
149
468
 
150
- | Category | Examples | What For |
151
- |----------|----------|----------|
152
- | **Screen** | `screenshot`, `ocr` | See what's on screen, read all visible text |
153
- | **App Control** | `ui_tree`, `ui_press`, `menu_click` | Read and interact with any native app |
154
- | **Keyboard & Mouse** | `click`, `type_text`, `key`, `drag` | Direct input control |
155
- | **Chrome Browser** | `browser_navigate`, `browser_js`, `browser_dom` | Full browser automation via CDP |
156
- | **Memory** | `memory_recall`, `memory_save` | ScreenHand learns from past sessions |
157
- | **AppleScript** | `applescript` | Run AppleScript on macOS |
469
+ ## Requirements
158
470
 
159
- For the full tool reference, see the [tool documentation](DESKTOP_MCP_GUIDE.md).
471
+ ### macOS
160
472
 
161
- ## Requirements
473
+ - macOS 12+
474
+ - Node.js 18+
475
+ - Accessibility permissions: System Settings > Privacy & Security > Accessibility > enable your terminal
476
+ - Chrome with `--remote-debugging-port=9222` (only for browser tools)
162
477
 
163
- | | macOS | Windows |
164
- |---|---|---|
165
- | **OS** | macOS 12+ | Windows 10 (1809+) |
166
- | **Runtime** | Node.js 18+ | Node.js 18+ |
167
- | **Permissions** | Accessibility (System Settings) | None (no admin needed) |
168
- | **Browser tools** | Chrome with `--remote-debugging-port=9222` | Same |
169
- | **Build** | `npm run build:native` | `npm run build:native:windows` |
478
+ ### Windows
479
+
480
+ - Windows 10 (1809+)
481
+ - Node.js 18+
482
+ - [.NET 8 SDK](https://dotnet.microsoft.com/download/dotnet/8.0)
483
+ - No special permissions needed UI Automation works without admin
484
+ - Chrome with `--remote-debugging-port=9222` (only for browser tools)
485
+ - Build: `npm run build:native:windows`
486
+
487
+ ## Skills (Slash Commands)
488
+
489
+ ScreenHand ships with Claude Code slash commands:
490
+
491
+ - `/screenshot` — capture your screen and describe what's visible
492
+ - `/debug-ui` — inspect the UI tree of any app
493
+ - `/automate` — describe a task and Claude does it
494
+
495
+ **Install globally** so they work in any project:
496
+
497
+ ```bash
498
+ ./install-skills.sh
499
+ ```
170
500
 
171
501
  ## Development
172
502
 
173
503
  ```bash
174
- npm run check # type-check
175
- npm test # run test suite
176
- npm run build # compile TypeScript
177
- npm run build:native # build native bridge
504
+ npm run dev # Run MCP server with tsx (hot reload)
505
+ npm run check # type-check (covers all entry files)
506
+ npm test # run test suite
507
+ npm run build # compile TypeScript
508
+ npm run build:native # build Swift bridge (macOS)
509
+ npm run build:native:windows # build .NET bridge (Windows)
178
510
  ```
179
511
 
180
512
  ## FAQ
181
513
 
182
- <details>
183
- <summary><strong>What is ScreenHand?</strong></summary>
514
+ ### What is ScreenHand?
515
+ ScreenHand is an open-source MCP server that gives AI assistants like Claude the ability to see and control your desktop. It provides 70+ tools for screenshots, UI inspection, clicking, typing, browser automation, session management, job queuing, and playbook execution on both macOS and Windows.
184
516
 
185
- An MCP server that gives AI agents the ability to see and control your desktop. It uses native OS APIs (Accessibility on macOS, UI Automation on Windows) for fast, reliable automation — not slow screenshot-based guessing.
186
- </details>
517
+ ### How does ScreenHand differ from Anthropic's Computer Use?
518
+ Anthropic's Computer Use is a cloud-based feature built into Claude. ScreenHand is an open-source, local-first tool that runs entirely on your machine with no cloud dependency. It uses native OS APIs (Accessibility on macOS, UI Automation on Windows) which are faster and more reliable than screenshot-based approaches.
187
519
 
188
- <details>
189
- <summary><strong>How is this different from Anthropic's Computer Use?</strong></summary>
520
+ ### How does ScreenHand work with OpenClaw?
190
521
 
191
- Computer Use is cloud-based and built into Claude. ScreenHand is open-source, runs locally on your machine, and uses native OS APIs which are faster and more reliable than screenshot-based approaches. It also works with any MCP-compatible client, not just Claude.
192
- </details>
522
+ ScreenHand **integrates with** OpenClaw as an MCP server giving your Claw agent native desktop speed instead of screenshot-based clicking.
193
523
 
194
- <details>
195
- <summary><strong>Is it safe?</strong></summary>
524
+ | | Without ScreenHand | With ScreenHand |
525
+ |---|---|---|
526
+ | **Clicking a button** | Screenshot → LLM interprets → coordinate click (~3-5s) | `press('Send')` via Accessibility API (~50ms) |
527
+ | **Cost per action** | 1 LLM API call per click | 0 LLM calls — native OS API |
528
+ | **Accuracy** | Coordinate guessing — can miss if layout shifts | Exact element targeting by role/name |
529
+
530
+ **Setup** — add to your `openclaw.json`:
531
+
532
+ ```json
533
+ {
534
+ "mcpServers": {
535
+ "screenhand": {
536
+ "command": "npx",
537
+ "args": ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
538
+ }
539
+ }
540
+ }
541
+ ```
542
+
543
+ Your Claw keeps its visual understanding for complex tasks, but now has 70+ fast native tools for clicks, typing, menus, scrolling, browser control, and more. See the full [integration guide](docs/openclaw-integration.md).
544
+
545
+ ### Does ScreenHand work on Windows?
546
+ Yes. ScreenHand supports both macOS and Windows. On macOS it uses a Swift native bridge with Accessibility APIs. On Windows it uses a C# (.NET 8) bridge with UI Automation and SendInput.
196
547
 
197
- ScreenHand runs entirely on your machine — no screen data is sent to external servers. All tool calls are audit-logged. See our [Security Policy](SECURITY.md) for details on permissions and boundaries.
198
- </details>
548
+ ### What AI clients work with ScreenHand?
549
+ Any MCP-compatible client: Claude Desktop, Claude Code, Cursor, Windsurf, OpenAI Codex CLI, and any other tool that supports the Model Context Protocol.
199
550
 
200
- <details>
201
- <summary><strong>What AI clients work with it?</strong></summary>
551
+ ### Does ScreenHand need admin/root permissions?
552
+ On macOS, you need to grant Accessibility permissions to your terminal app. On Windows, no special permissions are needed — UI Automation works without admin for most applications.
202
553
 
203
- Any MCP-compatible client: Claude Desktop, Claude Code, Cursor, Windsurf, OpenAI Codex CLI, and more.
204
- </details>
554
+ ### Is ScreenHand safe to use?
555
+ ScreenHand runs locally and never sends screen data to external servers. Dangerous tools (AppleScript, browser JS execution) are audit-logged. You control which AI client connects to it via MCP configuration.
205
556
 
206
- <details>
207
- <summary><strong>Can it control any app?</strong></summary>
557
+ ### Can ScreenHand control any application?
558
+ On macOS, it can control any app that exposes Accessibility elements (most apps do). On Windows, it works with any app that supports UI Automation. Some apps with custom rendering (games, some Electron apps) may have limited element tree support — use OCR as a fallback.
208
559
 
209
- On macOS, any app that exposes Accessibility elements (most do). On Windows, any app supporting UI Automation. For apps with custom rendering (games, some Electron apps), OCR is available as a fallback.
210
- </details>
560
+ ### How fast is ScreenHand?
561
+ Accessibility/UI Automation operations take ~50ms. Chrome CDP operations take ~10ms. Screenshots with OCR take ~600ms. Memory lookups add ~0ms (in-memory cache). ScreenHand is significantly faster than screenshot-only approaches because it reads the UI tree directly.
562
+
563
+ ### Does the learning memory affect performance?
564
+ No. All memory data is loaded into RAM at startup. Lookups are O(1) hash map reads. Disk writes are async and buffered — they never block tool calls. The memory system adds effectively zero latency to any tool call.
565
+
566
+ ### Is the memory data safe from corruption?
567
+ Yes. JSONL files are parsed line-by-line — a single corrupted line is skipped without affecting other entries. File locking prevents concurrent write corruption. Pending writes are flushed synchronously on exit (SIGINT/SIGTERM). Cache sizes are capped with LRU eviction to prevent unbounded growth.
211
568
 
212
569
  ## Contributing
213
570
 
214
- Contributions welcome! Please open an issue first to discuss what you'd like to change.
571
+ Contributions are welcome! Please open an issue first to discuss what you'd like to change.
215
572
 
216
573
  ```bash
217
574
  git clone https://github.com/manushi4/screenhand.git
218
575
  cd screenhand
219
- npm install && npm run build:native && npm test
576
+ npm install
577
+ npm run build:native
578
+ npm test
220
579
  ```
221
580
 
581
+ ## Contact
582
+
583
+ - **Email**: [khushi@clazro.com](mailto:khushi@clazro.com)
584
+ - **Issues**: [github.com/manushi4/screenhand/issues](https://github.com/manushi4/screenhand/issues)
585
+ - **Website**: [screenhand.com](https://screenhand.com)
586
+
222
587
  ## License
223
588
 
224
- [AGPL-3.0](LICENSE) — Copyright (C) 2025 Clazro Technology Private Limited
589
+ AGPL-3.0-only — Copyright (C) 2025 Clazro Technology Private Limited
225
590
 
226
591
  ---
227
592
 
228
593
  <div align="center">
229
594
 
230
- **[screenhand.com](https://screenhand.com)** | Built by **[Clazro Technology Private Limited](https://github.com/manushi4)**
595
+ **[screenhand.com](https://screenhand.com)** | [khushi@clazro.com](mailto:khushi@clazro.com) | A product of **Clazro Technology Private Limited**
231
596
 
232
597
  </div>