screenhand 0.3.5 → 0.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/dist/mcp-desktop.js +67 -109
  2. package/package.json +1 -1
@@ -267,134 +267,92 @@ async function ensureCDP(overridePort) {
267
267
  throw new Error("Chrome not running with --remote-debugging-port. Launch with: /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug");
268
268
  }
269
269
  const server = new McpServer({ name: "screenhand", version: "3.0.0" }, {
270
- instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools. Never click blind — always follow: KNOW → SEE → NAVIGATE → ACT → VERIFY → STOP.
270
+ instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools.
271
271
 
272
- ## The Golden Sequence (follow this order)
272
+ ## Quick Actions (just do it)
273
+ For simple tasks, go direct — no setup needed:
273
274
 
274
- ### 1. KNOW (before touching anything)
275
- platform_guide("figma") get selectors, flows, known errors for this app/site
276
- memory_recall("figma export") → check if you've done this before — reuse past strategies
277
- scan_menu_bar() → discover all menu items in the current app
275
+ focus("com.apple.Notes") ui_press("New Note") → type_text("hello") → key("cmd+s")
276
+ browser_navigate("https://...") browser_click("#btn") browser_js("return ...")
277
+
278
+ ## Tool Speed (fastest first)
279
+ 1. **ui_press / key / type_text** — native AX, ~50ms
280
+ 2. **browser_* tools** — CDP, ~10ms (background, no focus needed)
281
+ 3. ***_with_fallback** — auto-tries AX → CDP → OCR (~100-500ms)
282
+ 4. **screenshot + ocr** — visual, ~600ms (canvas apps only)
283
+ 5. **applescript** — macOS scripting (Finder, Mail, Safari)
278
284
 
279
- If platform_guide() has no data: platform_explore("bundleId") to auto-discover the app, or platform_learn("domain") for websites.
285
+ ## The Golden Sequence (for multi-step workflows)
286
+ For complex tasks with 3+ steps, follow this order:
287
+
288
+ ### 1. KNOW (before touching anything)
289
+ platform_guide("figma") → get selectors, flows, known errors
290
+ memory_recall("figma export") → reuse past strategies
291
+ If unknown app: platform_explore("bundleId") or platform_learn("domain")
280
292
 
281
293
  ### 2. SEE (understand current state)
282
294
  apps() → what's running?
283
- perception_start() → turn on continuous monitoring (3-rate: 100ms/300ms/1000ms)
284
- world_state() → current app, windows, controls, dialogs
285
- screenshot() → visual confirmation if needed
295
+ perception_start() → continuous monitoring (for multi-step only)
296
+ world_state() → current app, windows, controls
286
297
 
287
- perception_start() keeps world_state() continuously updated. Use it for complex multi-step workflows.
288
-
289
- ### 3. NAVIGATE (get to the right place)
298
+ ### 3. NAVIGATE
290
299
  focus("com.figma.Desktop") → bring app to front
291
- ui_tree() → see all clickable elements with roles and labels
292
- ui_find("Export") → check if a specific target exists before clicking
300
+ ui_tree() → see all clickable elements
301
+ ui_find("Export") → check if target exists
293
302
 
294
- ### 4. ACT (do the thing)
295
- click_with_fallback("Export") → click element (auto-tries AX → CDP → OCR → coordinates)
296
- type_with_fallback("filename") → type text with auto-fallback
303
+ ### 4. ACT
304
+ click_with_fallback("Export") → click (auto-tries multiple methods)
305
+ type_with_fallback("filename") → type with fallback
297
306
  key("cmd+shift+e") → keyboard shortcuts
298
- drag(fromX, fromY, toX, toY) → drag and drop
299
- scroll(direction) → scroll up/down/left/right
300
-
301
- Always prefer *_with_fallback tools over bare click/type — they auto-recover when one method fails.
302
-
303
- ### 5. VERIFY (confirm it worked)
304
- world_state() → did UI change as expected?
305
- world_state_diff() → what exactly changed since last check?
306
- screenshot() → visual proof
307
-
308
- ### 6. STOP (clean up)
309
- perception_stop() → stop monitoring (save resources)
310
- memory_save("figma_export", ...) → save successful strategy for next time
311
-
312
- ## For Web/Browser (Chrome, Electron apps)
313
- browser_navigate("https://...") → go to URL
314
- browser_stealth() → activate FIRST if site has bot detection
315
- browser_dom() → read page structure (CSS selectors)
316
- browser_click("#submit") → click element by CSS selector
317
- browser_type("input", "text") → type into form field
318
- browser_fill_form({...}) → fill multiple fields at once (human-like timing)
319
- browser_js("return ...") → run JavaScript for complex extraction/actions
320
- browser_wait("selector") → wait for element to appear
321
- browser_human_click(x, y) → human-like click with randomized timing
322
-
323
- All browser tools work in the background (~10ms) — no need to focus Chrome.
324
-
325
- ## For Complex Multi-Step Tasks (let ScreenHand plan it)
326
- plan_goal("Export video as H.264") → describe WHAT you want — ScreenHand generates steps from playbooks, strategies, and references
327
- plan_execute(goalId) → auto-run deterministic steps, pauses at LLM steps for your judgment
328
- plan_step_resolve(goalId, tool, params) → you provide the tool+params for LLM steps
329
- plan_status(goalId) → check progress
330
- plan_cancel(goalId) → abort if needed
331
-
332
- On success, the strategy is auto-saved to memory for future reuse.
333
-
334
- ## For Repeatable Workflows (automate once, run forever)
335
- playbook_record() → start recording your actions
336
- ... do the work ...
337
- export_playbook() → save as reusable playbook
338
- job_create("daily post", steps) → make it a persistent job
339
- worker_start() → background daemon runs jobs autonomously
340
-
341
- Jobs survive MCP client restarts. worker_start() runs independently.
342
307
 
343
- ## For Multi-Agent Coordination
344
- session_claim() claim exclusive access to an app window (lease-based)
345
- session_heartbeat() keep your lease alive (call periodically)
346
- session_release() → release when done
347
- supervisor_start() → daemon that detects stalled agents and auto-recovers
308
+ ### 5. VERIFY
309
+ world_state() did UI change?
310
+ world_state_diff() what changed?
348
311
 
349
- ## Self-Healing (automatic — no action needed)
350
- When any tool fails, ScreenHand automatically tries alternative strategies (AX CDP → OCR → coordinates). Learning is also automatic — every tool call teaches which selectors work, optimal timing, and recovery rankings per app. Check with:
351
- - learning_status() see learned preferences per app
352
- - recovery_status() → see active cooldowns and cached strategies
353
- - recovery_configure() → tune recovery budget (max time, max retries)
312
+ ### 6. STOP
313
+ perception_stop() stop monitoring
314
+ memory_save("task", ...) save strategy for next time
354
315
 
355
- ## Tool Speed Priority
356
- 1. **ui_tree + ui_press** native Accessibility API, ~50ms (fastest, most reliable)
357
- 2. **browser_* tools** — Chrome DevTools Protocol, ~10ms (background, no focus needed)
358
- 3. ***_with_fallback** — auto-tries multiple methods (~100-500ms)
359
- 4. **screenshot + ocr** — visual capture, ~600ms (only for canvas apps)
360
- 5. **applescript** — macOS scripting (Finder, Mail, Safari, etc.)
316
+ ## Strategy Selection (optional — for when you want to be smart about it)
317
+ Use these tools to pick the best approach. Skip for quick one-off actions.
361
318
 
362
- ## Decision Flow (run BEFORE step 1 of Golden Sequence)
319
+ **coverage_report(bundleId)** what does ScreenHand know about this app?
320
+ - Empty (0 selectors/flows) → learn first: scan_menu_bar() + platform_explore()
321
+ - Has data + high stability → go fast: direct tools (ui_press, key)
322
+ - Has error patterns → be careful: use *_with_fallback tools
363
323
 
364
- Before starting any automation, ask two questions to pick your strategy:
324
+ **learning_status(bundleId)** how experienced is ScreenHand with this app?
325
+ - 100+ samples → app is well-known, direct tools are safe
326
+ - 0 samples → unknown app, use *_with_fallback
327
+ - AX score high → use ui_tree + ui_press
328
+ - CDP score high → it's a web app, use browser_* tools
329
+ - Vision score high → canvas app, use screenshot + ocr
365
330
 
366
- ### "Should I learn first or just go?" → coverage_report(bundleId)
367
- - 0 shortcuts, 0 selectors, 0 flows → LEARN FIRST: scan_menu_bar() + platform_explore() before acting
368
- - Has selectors + flows but 0 playbooks → CAN ACT, but start playbook_record() to save for next time
369
- - Has everything + high stability → GO FAST: use direct tools (ui_press, key, type_text)
370
- - Has error patterns for your tool → BE CAREFUL: use *_with_fallback tools
331
+ ## Browser Automation
332
+ browser_navigate/browser_click/browser_type/browser_js all work in background (~10ms)
333
+ browser_stealth() activate before sites with bot detection
334
+ browser_fill_form({...}) human-like multi-field form filling
335
+ browser_human_click(x, y) randomized timing to avoid detection
371
336
 
372
- ### "Should I use fast or safe tools?" → learning_status(bundleId)
373
- - 100+ timing samples FAST: app is well-known, use direct tools (ui_press, key, type_text ~50ms)
374
- - 1-99 timing samples SAFE: use *_with_fallback tools (~100-500ms)
375
- - 0 timing samples LEARN: platform_explore() first, then *_with_fallback
376
- - AX score > 0.9 → use ui_tree + ui_press (native accessibility, fastest)
377
- - AX low, CDP high → it's a web app, use browser_* tools
378
- - Both low, Vision high → canvas app, use screenshot + ocr + click_text
337
+ ## Planning (let ScreenHand figure out the steps)
338
+ plan_goal("Export video as H.264") generates step-by-step plan from playbooks/strategies/references
339
+ plan_execute(goalId) → auto-runs known steps, pauses at LLM steps for your judgment
340
+ plan_step_resolve(goalId, tool, params)you resolve paused steps
341
+ plan_status(goalId) / plan_list() / plan_cancel(goalId)
379
342
 
380
- ### "Do I need perception?"
381
- - Single action (click a button) → NO, just ui_find + ui_press
382
- - Multi-step workflow (5+ steps) YES, perception_start()
383
- - Visual app (Figma, DaVinci) → YES, with vision (default)
384
- - Text-heavy app (Notes, Terminal) → AX-only is enough
343
+ ## Repeatable Workflows
344
+ playbook_record() do work export_playbook() → job_create("name", steps) worker_start()
345
+ Jobs survive restarts. Worker daemon runs independently.
385
346
 
386
- ### Decision Tree Summary
387
- 1. coverage_report(bundleId) → do we know this app?
388
- - YES (has references) use known selectors/flows directly
389
- - NO (empty) → scan_menu_bar + platform_explore FIRST
390
- 2. learning_status(bundleId) → how well do we know it?
391
- - 100+ samples → direct tools (fast)
392
- - <100 samples → *_with_fallback (safe)
393
- - 0 samples → learn first, then fallback
394
- 3. Multi-step? → perception_start() : skip perception
347
+ ## Multi-Agent
348
+ session_claim() → work session_heartbeat() session_release()
349
+ supervisor_start() auto-detects stalled agents and recovers
395
350
 
396
- ## Key Rule
397
- Never click blind. Always: coverage_report learning_status KNOW SEE NAVIGATE ACT VERIFY STOP.
351
+ ## Self-Healing (automatic)
352
+ Tool failures auto-retry with alternative strategies. Learning is automatic every call improves selectors, timing, and recovery per app.
353
+ - learning_status() — inspect learned knowledge
354
+ - recovery_status() — check recovery state
355
+ - recovery_configure() — tune recovery budget
398
356
  `,
399
357
  });
400
358
  // ═══════════════════════════════════════════════
@@ -6282,7 +6240,7 @@ server.tool("ingest_tutorial", "Extract structured playbook steps from a video t
6282
6240
  }],
6283
6241
  };
6284
6242
  });
6285
- server.tool("coverage_report", "CALL THIS FIRST before automating any app. Shows what ScreenHand knows: shortcuts, selectors, flows, playbooks, error patterns, and stability %. Use the result to decide your strategy: learn first (if empty), go fast (if high coverage), or use fallback tools (if error patterns exist). See Decision Flow in server instructions.", {
6243
+ server.tool("coverage_report", "Check what ScreenHand knows about an app: shortcuts, selectors, flows, playbooks, error patterns, and stability %. Useful before complex workflows to decide strategy: learn first (if empty), go fast (if high coverage), or use fallback tools (if error patterns exist). Optional for quick actions.", {
6286
6244
  bundleId: z.string().describe("macOS bundle ID (e.g. com.blackmagic-design.DaVinciResolveLite)"),
6287
6245
  appName: z.string().describe("Human-readable app name"),
6288
6246
  includeLiveMenuScan: z.boolean().optional().describe("Also scan the live menu bar for comparison (requires app to be running, needs pid)"),
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "screenhand",
3
- "version": "0.3.5",
3
+ "version": "0.3.6",
4
4
  "mcpName": "io.github.manushi4/screenhand",
5
5
  "description": "Give AI eyes and hands on your desktop. ScreenHand is an open-source MCP server that lets Claude and other AI agents see your screen, click buttons, type text, and control any app on macOS and Windows.",
6
6
  "homepage": "https://screenhand.com",