screenhand 0.3.3 → 0.3.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/mcp-desktop.js +117 -67
- package/package.json +1 -1
package/dist/mcp-desktop.js
CHANGED
|
@@ -267,84 +267,134 @@ async function ensureCDP(overridePort) {
|
|
|
267
267
|
throw new Error("Chrome not running with --remote-debugging-port. Launch with: /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug");
|
|
268
268
|
}
|
|
269
269
|
const server = new McpServer({ name: "screenhand", version: "3.0.0" }, {
|
|
270
|
-
instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools
|
|
270
|
+
instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools. Never click blind — always follow: KNOW → SEE → NAVIGATE → ACT → VERIFY → STOP.
|
|
271
271
|
|
|
272
|
-
##
|
|
272
|
+
## The Golden Sequence (follow this order)
|
|
273
273
|
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
**Cross-app**: focus("com.apple.Notes") → type_text() → key("cmd+s") — chain apps freely
|
|
274
|
+
### 1. KNOW (before touching anything)
|
|
275
|
+
platform_guide("figma") → get selectors, flows, known errors for this app/site
|
|
276
|
+
memory_recall("figma export") → check if you've done this before — reuse past strategies
|
|
277
|
+
scan_menu_bar() → discover all menu items in the current app
|
|
279
278
|
|
|
280
|
-
|
|
279
|
+
If platform_guide() has no data: platform_explore("bundleId") to auto-discover the app, or platform_learn("domain") for websites.
|
|
281
280
|
|
|
282
|
-
###
|
|
283
|
-
|
|
284
|
-
|
|
285
|
-
|
|
286
|
-
|
|
287
|
-
- Pattern: perception_start() → do work → world_state() to verify → perception_stop()
|
|
281
|
+
### 2. SEE (understand current state)
|
|
282
|
+
apps() → what's running?
|
|
283
|
+
perception_start() → turn on continuous monitoring (3-rate: 100ms/300ms/1000ms)
|
|
284
|
+
world_state() → current app, windows, controls, dialogs
|
|
285
|
+
screenshot() → visual confirmation if needed
|
|
288
286
|
|
|
289
|
-
|
|
290
|
-
- **Learning is automatic** — every tool call teaches ScreenHand which selectors work, which fail, optimal timing per app. No action needed.
|
|
291
|
-
- **memory_save(key, value)** — save a strategy or finding for future sessions (persists to disk).
|
|
292
|
-
- **memory_recall(query)** — retrieve saved strategies, past errors, what worked before. ALWAYS recall before attempting unfamiliar platforms.
|
|
293
|
-
- **learning_status()** — see what ScreenHand has learned: locator preferences, recovery rankings, timing budgets per app.
|
|
294
|
-
- **learning_reset()** — nuclear option, clears all learning. Rarely needed.
|
|
295
|
-
- Pattern: memory_recall("instagram post") → use recalled strategy → if new approach works, memory_save() it
|
|
287
|
+
perception_start() keeps world_state() continuously updated. Use it for complex multi-step workflows.
|
|
296
288
|
|
|
297
|
-
###
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
- ***_with_fallback tools** (click_with_fallback, type_with_fallback, etc.) — use these instead of bare click/type when reliability matters. They auto-try multiple methods.
|
|
302
|
-
- Pattern: Use *_with_fallback tools for critical actions. If something still fails, check recovery_status() to understand why.
|
|
289
|
+
### 3. NAVIGATE (get to the right place)
|
|
290
|
+
focus("com.figma.Desktop") → bring app to front
|
|
291
|
+
ui_tree() → see all clickable elements with roles and labels
|
|
292
|
+
ui_find("Export") → check if a specific target exists before clicking
|
|
303
293
|
|
|
304
|
-
###
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
|
|
308
|
-
|
|
309
|
-
|
|
294
|
+
### 4. ACT (do the thing)
|
|
295
|
+
click_with_fallback("Export") → click element (auto-tries AX → CDP → OCR → coordinates)
|
|
296
|
+
type_with_fallback("filename") → type text with auto-fallback
|
|
297
|
+
key("cmd+shift+e") → keyboard shortcuts
|
|
298
|
+
drag(fromX, fromY, toX, toY) → drag and drop
|
|
299
|
+
scroll(direction) → scroll up/down/left/right
|
|
310
300
|
|
|
311
|
-
|
|
312
|
-
- **job_create(name, steps[])** — define a multi-step workflow that persists to disk.
|
|
313
|
-
- **job_run(jobId)** — execute a job. Survives MCP client restarts.
|
|
314
|
-
- **worker_start()** — start background daemon that processes jobs autonomously.
|
|
315
|
-
- **playbook_record()** / **export_playbook()** — record your actions into reusable playbooks.
|
|
316
|
-
- Pattern: For repeatable workflows, record as playbook → export → job_create from playbook → worker_start
|
|
301
|
+
Always prefer *_with_fallback tools over bare click/type — they auto-recover when one method fails.
|
|
317
302
|
|
|
318
|
-
###
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
- **supervisor_start()** — background daemon that detects stalled agents and recovers.
|
|
323
|
-
- Pattern: session_claim() → do work → session_heartbeat() periodically → session_release()
|
|
303
|
+
### 5. VERIFY (confirm it worked)
|
|
304
|
+
world_state() → did UI change as expected?
|
|
305
|
+
world_state_diff() → what exactly changed since last check?
|
|
306
|
+
screenshot() → visual proof
|
|
324
307
|
|
|
325
|
-
###
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
- **plan_step(goalId)** — execute one step at a time (for more control than plan_execute).
|
|
329
|
-
- **plan_step_resolve(goalId, tool, params)** — when a plan pauses at an LLM step, YOU decide which tool and params to use. The server executes it, verifies postconditions, and advances.
|
|
330
|
-
- **plan_status(goalId)** — check progress: which step you're on, what's done, what's left.
|
|
331
|
-
- **plan_list()** — see all goals (active, completed, failed).
|
|
332
|
-
- **plan_cancel(goalId)** — abort a goal.
|
|
333
|
-
- Pattern: plan_goal("do X") → review steps → plan_execute() → resolve LLM steps as they pause → on success, strategy auto-saved to memory
|
|
308
|
+
### 6. STOP (clean up)
|
|
309
|
+
perception_stop() → stop monitoring (save resources)
|
|
310
|
+
memory_save("figma_export", ...) → save successful strategy for next time
|
|
334
311
|
|
|
335
|
-
##
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
312
|
+
## For Web/Browser (Chrome, Electron apps)
|
|
313
|
+
browser_navigate("https://...") → go to URL
|
|
314
|
+
browser_stealth() → activate FIRST if site has bot detection
|
|
315
|
+
browser_dom() → read page structure (CSS selectors)
|
|
316
|
+
browser_click("#submit") → click element by CSS selector
|
|
317
|
+
browser_type("input", "text") → type into form field
|
|
318
|
+
browser_fill_form({...}) → fill multiple fields at once (human-like timing)
|
|
319
|
+
browser_js("return ...") → run JavaScript for complex extraction/actions
|
|
320
|
+
browser_wait("selector") → wait for element to appear
|
|
321
|
+
browser_human_click(x, y) → human-like click with randomized timing
|
|
341
322
|
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
-
|
|
347
|
-
|
|
323
|
+
All browser tools work in the background (~10ms) — no need to focus Chrome.
|
|
324
|
+
|
|
325
|
+
## For Complex Multi-Step Tasks (let ScreenHand plan it)
|
|
326
|
+
plan_goal("Export video as H.264") → describe WHAT you want — ScreenHand generates steps from playbooks, strategies, and references
|
|
327
|
+
plan_execute(goalId) → auto-run deterministic steps, pauses at LLM steps for your judgment
|
|
328
|
+
plan_step_resolve(goalId, tool, params) → you provide the tool+params for LLM steps
|
|
329
|
+
plan_status(goalId) → check progress
|
|
330
|
+
plan_cancel(goalId) → abort if needed
|
|
331
|
+
|
|
332
|
+
On success, the strategy is auto-saved to memory for future reuse.
|
|
333
|
+
|
|
334
|
+
## For Repeatable Workflows (automate once, run forever)
|
|
335
|
+
playbook_record() → start recording your actions
|
|
336
|
+
... do the work ...
|
|
337
|
+
export_playbook() → save as reusable playbook
|
|
338
|
+
job_create("daily post", steps) → make it a persistent job
|
|
339
|
+
worker_start() → background daemon runs jobs autonomously
|
|
340
|
+
|
|
341
|
+
Jobs survive MCP client restarts. worker_start() runs independently.
|
|
342
|
+
|
|
343
|
+
## For Multi-Agent Coordination
|
|
344
|
+
session_claim() → claim exclusive access to an app window (lease-based)
|
|
345
|
+
session_heartbeat() → keep your lease alive (call periodically)
|
|
346
|
+
session_release() → release when done
|
|
347
|
+
supervisor_start() → daemon that detects stalled agents and auto-recovers
|
|
348
|
+
|
|
349
|
+
## Self-Healing (automatic — no action needed)
|
|
350
|
+
When any tool fails, ScreenHand automatically tries alternative strategies (AX → CDP → OCR → coordinates). Learning is also automatic — every tool call teaches which selectors work, optimal timing, and recovery rankings per app. Check with:
|
|
351
|
+
- learning_status() → see learned preferences per app
|
|
352
|
+
- recovery_status() → see active cooldowns and cached strategies
|
|
353
|
+
- recovery_configure() → tune recovery budget (max time, max retries)
|
|
354
|
+
|
|
355
|
+
## Tool Speed Priority
|
|
356
|
+
1. **ui_tree + ui_press** — native Accessibility API, ~50ms (fastest, most reliable)
|
|
357
|
+
2. **browser_* tools** — Chrome DevTools Protocol, ~10ms (background, no focus needed)
|
|
358
|
+
3. ***_with_fallback** — auto-tries multiple methods (~100-500ms)
|
|
359
|
+
4. **screenshot + ocr** — visual capture, ~600ms (only for canvas apps)
|
|
360
|
+
5. **applescript** — macOS scripting (Finder, Mail, Safari, etc.)
|
|
361
|
+
|
|
362
|
+
## Decision Flow (run BEFORE step 1 of Golden Sequence)
|
|
363
|
+
|
|
364
|
+
Before starting any automation, ask two questions to pick your strategy:
|
|
365
|
+
|
|
366
|
+
### "Should I learn first or just go?" → coverage_report(bundleId)
|
|
367
|
+
- 0 shortcuts, 0 selectors, 0 flows → LEARN FIRST: scan_menu_bar() + platform_explore() before acting
|
|
368
|
+
- Has selectors + flows but 0 playbooks → CAN ACT, but start playbook_record() to save for next time
|
|
369
|
+
- Has everything + high stability → GO FAST: use direct tools (ui_press, key, type_text)
|
|
370
|
+
- Has error patterns for your tool → BE CAREFUL: use *_with_fallback tools
|
|
371
|
+
|
|
372
|
+
### "Should I use fast or safe tools?" → learning_status(bundleId)
|
|
373
|
+
- 100+ timing samples → FAST: app is well-known, use direct tools (ui_press, key, type_text ~50ms)
|
|
374
|
+
- 1-99 timing samples → SAFE: use *_with_fallback tools (~100-500ms)
|
|
375
|
+
- 0 timing samples → LEARN: platform_explore() first, then *_with_fallback
|
|
376
|
+
- AX score > 0.9 → use ui_tree + ui_press (native accessibility, fastest)
|
|
377
|
+
- AX low, CDP high → it's a web app, use browser_* tools
|
|
378
|
+
- Both low, Vision high → canvas app, use screenshot + ocr + click_text
|
|
379
|
+
|
|
380
|
+
### "Do I need perception?"
|
|
381
|
+
- Single action (click a button) → NO, just ui_find + ui_press
|
|
382
|
+
- Multi-step workflow (5+ steps) → YES, perception_start()
|
|
383
|
+
- Visual app (Figma, DaVinci) → YES, with vision (default)
|
|
384
|
+
- Text-heavy app (Notes, Terminal) → AX-only is enough
|
|
385
|
+
|
|
386
|
+
### Decision Tree Summary
|
|
387
|
+
1. coverage_report(bundleId) → do we know this app?
|
|
388
|
+
- YES (has references) → use known selectors/flows directly
|
|
389
|
+
- NO (empty) → scan_menu_bar + platform_explore FIRST
|
|
390
|
+
2. learning_status(bundleId) → how well do we know it?
|
|
391
|
+
- 100+ samples → direct tools (fast)
|
|
392
|
+
- <100 samples → *_with_fallback (safe)
|
|
393
|
+
- 0 samples → learn first, then fallback
|
|
394
|
+
3. Multi-step? → perception_start() : skip perception
|
|
395
|
+
|
|
396
|
+
## Key Rule
|
|
397
|
+
Never click blind. Always: coverage_report → learning_status → KNOW → SEE → NAVIGATE → ACT → VERIFY → STOP.
|
|
348
398
|
`,
|
|
349
399
|
});
|
|
350
400
|
// ═══════════════════════════════════════════════
|
|
@@ -6232,7 +6282,7 @@ server.tool("ingest_tutorial", "Extract structured playbook steps from a video t
|
|
|
6232
6282
|
}],
|
|
6233
6283
|
};
|
|
6234
6284
|
});
|
|
6235
|
-
server.tool("coverage_report", "
|
|
6285
|
+
server.tool("coverage_report", "CALL THIS FIRST before automating any app. Shows what ScreenHand knows: shortcuts, selectors, flows, playbooks, error patterns, and stability %. Use the result to decide your strategy: learn first (if empty), go fast (if high coverage), or use fallback tools (if error patterns exist). See Decision Flow in server instructions.", {
|
|
6236
6286
|
bundleId: z.string().describe("macOS bundle ID (e.g. com.blackmagic-design.DaVinciResolveLite)"),
|
|
6237
6287
|
appName: z.string().describe("Human-readable app name"),
|
|
6238
6288
|
includeLiveMenuScan: z.boolean().optional().describe("Also scan the live menu bar for comparison (requires app to be running, needs pid)"),
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "screenhand",
|
|
3
|
-
"version": "0.3.
|
|
3
|
+
"version": "0.3.5",
|
|
4
4
|
"mcpName": "io.github.manushi4/screenhand",
|
|
5
5
|
"description": "Give AI eyes and hands on your desktop. ScreenHand is an open-source MCP server that lets Claude and other AI agents see your screen, click buttons, type text, and control any app on macOS and Windows.",
|
|
6
6
|
"homepage": "https://screenhand.com",
|