screenhand 0.3.3 → 0.3.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/dist/mcp-desktop.js +117 -67
  2. package/package.json +1 -1
@@ -267,84 +267,134 @@ async function ensureCDP(overridePort) {
267
267
  throw new Error("Chrome not running with --remote-debugging-port. Launch with: /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug");
268
268
  }
269
269
  const server = new McpServer({ name: "screenhand", version: "3.0.0" }, {
270
- instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools across 7 capability tiers.
270
+ instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools. Never click blind — always follow: KNOW → SEE → NAVIGATE → ACT → VERIFY → STOP.
271
271
 
272
- ## Quick Patterns
272
+ ## The Golden Sequence (follow this order)
273
273
 
274
- **Click something**: ui_find("Search") → ui_press("Search") (~50ms, no screenshot needed)
275
- **Type text**: click target first, then type_text("hello") or key("cmd+a") for shortcuts
276
- **Read screen**: ui_tree() for structured elements, screenshot() + ocr() for visual content
277
- **Browser**: browser_navigate/browser_js/browser_click works in background via CDP (~10ms)
278
- **Cross-app**: focus("com.apple.Notes") → type_text() → key("cmd+s") — chain apps freely
274
+ ### 1. KNOW (before touching anything)
275
+ platform_guide("figma") → get selectors, flows, known errors for this app/site
276
+ memory_recall("figma export") check if you've done this before reuse past strategies
277
+ scan_menu_bar() → discover all menu items in the current app
279
278
 
280
- ## When to Use Advanced Features
279
+ If platform_guide() has no data: platform_explore("bundleId") to auto-discover the app, or platform_learn("domain") for websites.
281
280
 
282
- ### World State & Perception (know what's on screen)
283
- - **perception_start()** — turn on continuous screen monitoring (3-rate: 100ms/300ms/1000ms). Use BEFORE complex multi-step workflows so you always know what's on screen.
284
- - **world_state()** check current app, windows, controls, dialogs. Use to verify state before acting. Use verbose=true to see all controls.
285
- - **world_state_diff()** find stale/outdated UI elements. Use after long pauses to check what changed.
286
- - **perception_stop()** turn off when done to save resources.
287
- - Pattern: perception_start() → do work → world_state() to verify → perception_stop()
281
+ ### 2. SEE (understand current state)
282
+ apps() what's running?
283
+ perception_start() turn on continuous monitoring (3-rate: 100ms/300ms/1000ms)
284
+ world_state() current app, windows, controls, dialogs
285
+ screenshot() visual confirmation if needed
288
286
 
289
- ### Learning & Memory (get smarter over time)
290
- - **Learning is automatic** — every tool call teaches ScreenHand which selectors work, which fail, optimal timing per app. No action needed.
291
- - **memory_save(key, value)** — save a strategy or finding for future sessions (persists to disk).
292
- - **memory_recall(query)** — retrieve saved strategies, past errors, what worked before. ALWAYS recall before attempting unfamiliar platforms.
293
- - **learning_status()** — see what ScreenHand has learned: locator preferences, recovery rankings, timing budgets per app.
294
- - **learning_reset()** — nuclear option, clears all learning. Rarely needed.
295
- - Pattern: memory_recall("instagram post") → use recalled strategy → if new approach works, memory_save() it
287
+ perception_start() keeps world_state() continuously updated. Use it for complex multi-step workflows.
296
288
 
297
- ### Self-Healing & Recovery (handle errors automatically)
298
- - **Recovery is automatic** — when a tool fails, ScreenHand tries alternative strategies (AX CDP OCR → coordinates) without you doing anything.
299
- - **recovery_status()** see cooldowns, active strategies, which fixes are cached.
300
- - **recovery_configure()** adjust recovery budget (max time, max strategies to try).
301
- - ***_with_fallback tools** (click_with_fallback, type_with_fallback, etc.) — use these instead of bare click/type when reliability matters. They auto-try multiple methods.
302
- - Pattern: Use *_with_fallback tools for critical actions. If something still fails, check recovery_status() to understand why.
289
+ ### 3. NAVIGATE (get to the right place)
290
+ focus("com.figma.Desktop") bring app to front
291
+ ui_tree() see all clickable elements with roles and labels
292
+ ui_find("Export") check if a specific target exists before clicking
303
293
 
304
- ### Platform Knowledge (know HOW to automate an app)
305
- - **platform_guide("figma")** get selectors, flows, known errors for a platform. Call FIRST when automating any app/site.
306
- - **platform_explore("bundleId")** auto-discover an unknown app's UI structure.
307
- - **platform_learn("domain")** learn a website's structure by crawling.
308
- - **scan_menu_bar()** discover all menu items in the current app.
309
- - Pattern: platform_guide() first if not found, platform_explore() → then automate
294
+ ### 4. ACT (do the thing)
295
+ click_with_fallback("Export") click element (auto-tries AX CDP OCR coordinates)
296
+ type_with_fallback("filename") type text with auto-fallback
297
+ key("cmd+shift+e") keyboard shortcuts
298
+ drag(fromX, fromY, toX, toY) → drag and drop
299
+ scroll(direction) scroll up/down/left/right
310
300
 
311
- ### Jobs & Multi-Step Workflows (survive restarts)
312
- - **job_create(name, steps[])** — define a multi-step workflow that persists to disk.
313
- - **job_run(jobId)** — execute a job. Survives MCP client restarts.
314
- - **worker_start()** — start background daemon that processes jobs autonomously.
315
- - **playbook_record()** / **export_playbook()** — record your actions into reusable playbooks.
316
- - Pattern: For repeatable workflows, record as playbook → export → job_create from playbook → worker_start
301
+ Always prefer *_with_fallback tools over bare click/type — they auto-recover when one method fails.
317
302
 
318
- ### Multi-Agent Coordination (multiple AI agents sharing one machine)
319
- - **session_claim()** claim exclusive access to an app window (lease-based).
320
- - **session_heartbeat()** keep your lease alive.
321
- - **session_release()** release when done.
322
- - **supervisor_start()** — background daemon that detects stalled agents and recovers.
323
- - Pattern: session_claim() → do work → session_heartbeat() periodically → session_release()
303
+ ### 5. VERIFY (confirm it worked)
304
+ world_state() did UI change as expected?
305
+ world_state_diff() what exactly changed since last check?
306
+ screenshot() visual proof
324
307
 
325
- ### Planning (let ScreenHand figure out the steps)
326
- - **plan_goal("Export video as H.264")** describe WHAT you want, ScreenHand generates a step-by-step plan. It searches playbooks, saved strategies, and reference knowledge to build the plan. Does NOT execute — returns the plan for review.
327
- - **plan_execute(goalId)** — run the plan automatically. Deterministic steps (known selectors/flows) run internally. Pauses at LLM steps where your judgment is needed — resolve them with plan_step_resolve().
328
- - **plan_step(goalId)** — execute one step at a time (for more control than plan_execute).
329
- - **plan_step_resolve(goalId, tool, params)** — when a plan pauses at an LLM step, YOU decide which tool and params to use. The server executes it, verifies postconditions, and advances.
330
- - **plan_status(goalId)** — check progress: which step you're on, what's done, what's left.
331
- - **plan_list()** — see all goals (active, completed, failed).
332
- - **plan_cancel(goalId)** — abort a goal.
333
- - Pattern: plan_goal("do X") → review steps → plan_execute() → resolve LLM steps as they pause → on success, strategy auto-saved to memory
308
+ ### 6. STOP (clean up)
309
+ perception_stop() stop monitoring (save resources)
310
+ memory_save("figma_export", ...) save successful strategy for next time
334
311
 
335
- ## Tool Selection Priority
336
- 1. **ui_tree + ui_press** for native app elements (fastest, most reliable)
337
- 2. **browser_* tools** for web content in Chrome/Electron
338
- 3. ***_with_fallback** when you're unsure which method will work
339
- 4. **screenshot + ocr** only for canvas apps or visual verification
340
- 5. **applescript** for macOS-specific automation (Finder, Mail, etc.)
312
+ ## For Web/Browser (Chrome, Electron apps)
313
+ browser_navigate("https://...") → go to URL
314
+ browser_stealth() → activate FIRST if site has bot detection
315
+ browser_dom() → read page structure (CSS selectors)
316
+ browser_click("#submit") → click element by CSS selector
317
+ browser_type("input", "text") → type into form field
318
+ browser_fill_form({...}) → fill multiple fields at once (human-like timing)
319
+ browser_js("return ...") → run JavaScript for complex extraction/actions
320
+ browser_wait("selector") → wait for element to appear
321
+ browser_human_click(x, y) → human-like click with randomized timing
341
322
 
342
- ## Tips
343
- - Always call platform_guide() before automating a new app/site
344
- - Use memory_recall() before attempting something you might have done before
345
- - Start perception_start() for complex workflows, stop when done
346
- - Prefer *_with_fallback tools over bare tools for reliability
347
- - browser_stealth() before visiting sites with bot detection
323
+ All browser tools work in the background (~10ms) — no need to focus Chrome.
324
+
325
+ ## For Complex Multi-Step Tasks (let ScreenHand plan it)
326
+ plan_goal("Export video as H.264") describe WHAT you want — ScreenHand generates steps from playbooks, strategies, and references
327
+ plan_execute(goalId) → auto-run deterministic steps, pauses at LLM steps for your judgment
328
+ plan_step_resolve(goalId, tool, params) you provide the tool+params for LLM steps
329
+ plan_status(goalId) → check progress
330
+ plan_cancel(goalId) → abort if needed
331
+
332
+ On success, the strategy is auto-saved to memory for future reuse.
333
+
334
+ ## For Repeatable Workflows (automate once, run forever)
335
+ playbook_record() → start recording your actions
336
+ ... do the work ...
337
+ export_playbook() → save as reusable playbook
338
+ job_create("daily post", steps) → make it a persistent job
339
+ worker_start() → background daemon runs jobs autonomously
340
+
341
+ Jobs survive MCP client restarts. worker_start() runs independently.
342
+
343
+ ## For Multi-Agent Coordination
344
+ session_claim() → claim exclusive access to an app window (lease-based)
345
+ session_heartbeat() → keep your lease alive (call periodically)
346
+ session_release() → release when done
347
+ supervisor_start() → daemon that detects stalled agents and auto-recovers
348
+
349
+ ## Self-Healing (automatic — no action needed)
350
+ When any tool fails, ScreenHand automatically tries alternative strategies (AX → CDP → OCR → coordinates). Learning is also automatic — every tool call teaches which selectors work, optimal timing, and recovery rankings per app. Check with:
351
+ - learning_status() → see learned preferences per app
352
+ - recovery_status() → see active cooldowns and cached strategies
353
+ - recovery_configure() → tune recovery budget (max time, max retries)
354
+
355
+ ## Tool Speed Priority
356
+ 1. **ui_tree + ui_press** — native Accessibility API, ~50ms (fastest, most reliable)
357
+ 2. **browser_* tools** — Chrome DevTools Protocol, ~10ms (background, no focus needed)
358
+ 3. ***_with_fallback** — auto-tries multiple methods (~100-500ms)
359
+ 4. **screenshot + ocr** — visual capture, ~600ms (only for canvas apps)
360
+ 5. **applescript** — macOS scripting (Finder, Mail, Safari, etc.)
361
+
362
+ ## Decision Flow (run BEFORE step 1 of Golden Sequence)
363
+
364
+ Before starting any automation, ask two questions to pick your strategy:
365
+
366
+ ### "Should I learn first or just go?" → coverage_report(bundleId)
367
+ - 0 shortcuts, 0 selectors, 0 flows → LEARN FIRST: scan_menu_bar() + platform_explore() before acting
368
+ - Has selectors + flows but 0 playbooks → CAN ACT, but start playbook_record() to save for next time
369
+ - Has everything + high stability → GO FAST: use direct tools (ui_press, key, type_text)
370
+ - Has error patterns for your tool → BE CAREFUL: use *_with_fallback tools
371
+
372
+ ### "Should I use fast or safe tools?" → learning_status(bundleId)
373
+ - 100+ timing samples → FAST: app is well-known, use direct tools (ui_press, key, type_text ~50ms)
374
+ - 1-99 timing samples → SAFE: use *_with_fallback tools (~100-500ms)
375
+ - 0 timing samples → LEARN: platform_explore() first, then *_with_fallback
376
+ - AX score > 0.9 → use ui_tree + ui_press (native accessibility, fastest)
377
+ - AX low, CDP high → it's a web app, use browser_* tools
378
+ - Both low, Vision high → canvas app, use screenshot + ocr + click_text
379
+
380
+ ### "Do I need perception?"
381
+ - Single action (click a button) → NO, just ui_find + ui_press
382
+ - Multi-step workflow (5+ steps) → YES, perception_start()
383
+ - Visual app (Figma, DaVinci) → YES, with vision (default)
384
+ - Text-heavy app (Notes, Terminal) → AX-only is enough
385
+
386
+ ### Decision Tree Summary
387
+ 1. coverage_report(bundleId) → do we know this app?
388
+ - YES (has references) → use known selectors/flows directly
389
+ - NO (empty) → scan_menu_bar + platform_explore FIRST
390
+ 2. learning_status(bundleId) → how well do we know it?
391
+ - 100+ samples → direct tools (fast)
392
+ - <100 samples → *_with_fallback (safe)
393
+ - 0 samples → learn first, then fallback
394
+ 3. Multi-step? → perception_start() : skip perception
395
+
396
+ ## Key Rule
397
+ Never click blind. Always: coverage_report → learning_status → KNOW → SEE → NAVIGATE → ACT → VERIFY → STOP.
348
398
  `,
349
399
  });
350
400
  // ═══════════════════════════════════════════════
@@ -6232,7 +6282,7 @@ server.tool("ingest_tutorial", "Extract structured playbook steps from a video t
6232
6282
  }],
6233
6283
  };
6234
6284
  });
6235
- server.tool("coverage_report", "Generate a coverage report for an app shows what knowledge we have (shortcuts, selectors, flows, playbooks, errors) and identifies gaps with recommendations.", {
6285
+ server.tool("coverage_report", "CALL THIS FIRST before automating any app. Shows what ScreenHand knows: shortcuts, selectors, flows, playbooks, error patterns, and stability %. Use the result to decide your strategy: learn first (if empty), go fast (if high coverage), or use fallback tools (if error patterns exist). See Decision Flow in server instructions.", {
6236
6286
  bundleId: z.string().describe("macOS bundle ID (e.g. com.blackmagic-design.DaVinciResolveLite)"),
6237
6287
  appName: z.string().describe("Human-readable app name"),
6238
6288
  includeLiveMenuScan: z.boolean().optional().describe("Also scan the live menu bar for comparison (requires app to be running, needs pid)"),
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "screenhand",
3
- "version": "0.3.3",
3
+ "version": "0.3.5",
4
4
  "mcpName": "io.github.manushi4/screenhand",
5
5
  "description": "Give AI eyes and hands on your desktop. ScreenHand is an open-source MCP server that lets Claude and other AI agents see your screen, click buttons, type text, and control any app on macOS and Windows.",
6
6
  "homepage": "https://screenhand.com",