screenhand 0.3.7 → 0.3.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -267,82 +267,86 @@ async function ensureCDP(overridePort) {
267
267
  throw new Error("Chrome not running with --remote-debugging-port. Launch with: /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug");
268
268
  }
269
269
  const server = new McpServer({ name: "screenhand", version: "3.0.0" }, {
270
- instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools.
271
-
272
- ## Quick Actions (just do it)
273
- For simple tasks, go direct — no setup needed:
270
+ instructions: `ScreenHand gives you native desktop control on macOS/Windows. 111 tools across 7 layers.
274
271
 
272
+ ## Quick Actions (1-2 steps, no setup)
275
273
  focus("com.apple.Notes") → ui_press("New Note") → type_text("hello") → key("cmd+s")
276
274
  browser_navigate("https://...") → browser_click("#btn") → browser_js("return ...")
277
275
 
278
- ## Tool Speed (fastest first)
279
- 1. **ui_press / key / type_text** — native AX, ~50ms
280
- 2. **browser_* tools** — CDP, ~10ms (background, no focus needed)
281
- 3. ***_with_fallback** — auto-tries AX → CDP → OCR (~100-500ms)
282
- 4. **screenshot + ocr** — visual, ~600ms (canvas apps only)
283
- 5. **applescript** — macOS scripting (Finder, Mail, Safari)
284
-
285
- ## The Golden Sequence (for multi-step workflows)
286
- For complex tasks with 3+ steps, follow this order:
276
+ ## Smart Decision Flow (3+ steps)
287
277
 
288
- ### 1. KNOW (before touching anything)
289
- platform_guide("figma") get selectors, flows, known errors
290
- memory_recall("figma export") reuse past strategies
291
- If unknown app: platform_explore("bundleId") or platform_learn("domain")
278
+ ### Step 0: DECIDE learn or go?
279
+ coverage_report(bundleId, appName) tells you exactly what ScreenHand knows
280
+ - "0 selectors, 0 flows" LEARN FIRST (Step 0a)
281
+ - "Has selectors + flows" GO (skip to Step 1)
282
+ - "Has error patterns for your tool" → use *_with_fallback tools
292
283
 
293
- ### 2. SEE (understand current state)
294
- apps() what's running?
295
- perception_start() continuous monitoring (for multi-step only)
296
- world_state() current app, windows, controls
284
+ learning_status(bundleId) tells you WHICH tools to use
285
+ - AX score > 0.9 use ui_press/ui_tree (fastest, ~50ms)
286
+ - CDP score high it's a web app → use browser_* tools (~10ms)
287
+ - Vision score high canvas app use screenshot + ocr (~600ms)
288
+ - 0 samples → unknown app → always use *_with_fallback
297
289
 
298
- ### 3. NAVIGATE
299
- focus("com.figma.Desktop") bring app to front
300
- ui_tree() see all clickable elements
301
- ui_find("Export") check if target exists
290
+ ### Step 0a: LEARN (only if coverage_report says gaps)
291
+ scan_menu_bar() discover shortcuts + menu structure
292
+ platform_explore("bundleId") map all interactive elements
293
+ platform_guide("platform") load curated selectors/flows/errors
294
+ memory_recall("task description") → reuse past strategies
295
+ Then go to Step 1.
302
296
 
303
- ### 4. ACT
304
- click_with_fallback("Export") click (auto-tries multiple methods)
305
- type_with_fallback("filename") type with fallback
306
- key("cmd+shift+e") keyboard shortcuts
297
+ ### Step 1: SEE
298
+ perception_start() turns on continuous monitoring (3 rates: AX 100ms, CDP 300ms, Vision 1s)
299
+ world_state() verify windows + controls are tracked
300
+ If world_state shows 0 controls wait 1-2s for perception to populate, then retry.
307
301
 
308
- ### 5. VERIFY
309
- world_state() → did UI change?
310
- world_state_diff() → what changed?
302
+ While perception runs, you get automatic features:
303
+ - Auto world_state_diff after every action tool (Δ line in response)
304
+ - Auto dialog dismissal (learning-ranked: Cancel/OK/Escape)
305
+ - Auto context switch when apps change (loads new reference)
311
306
 
312
- ### 6. STOP
313
- perception_stop() → stop monitoring
314
- memory_save("task", ...) save strategy for next time
307
+ ### Step 2: ACT + VERIFY (loop)
308
+ Each action tool response includes: world summary + Δ changes + perception freshness + learning hints.
309
+ No need to manually call world_state() or world_state_diff() it's automatic.
315
310
 
316
- ## Strategy Selection (optional — for when you want to be smart about it)
317
- Use these tools to pick the best approach. Skip for quick one-off actions.
311
+ **Tool priority:**
312
+ 1. ui_press / key / type_text native AX, ~50ms (when AX score high)
313
+ 2. browser_* tools — CDP, ~10ms, background (web content)
314
+ 3. *_with_fallback — auto-tries AX → CDP → OCR (~100-500ms, when unsure)
315
+ 4. screenshot + ocr — visual (~600ms, canvas apps / visual verification)
316
+ 5. applescript — macOS scripting (Finder, Mail, bulk ops)
318
317
 
319
- **coverage_report(bundleId)** what does ScreenHand know about this app?
320
- - Empty (0 selectors/flows)learn first: scan_menu_bar() + platform_explore()
321
- - Has data + high stability go fast: direct tools (ui_press, key)
322
- - Has error patternsbe careful: use *_with_fallback tools
318
+ **Read the Δ line after each action:**
319
+ - controls: 690→728"UI changed, action worked
320
+ - dialogs: 0→1"dialog appeared, auto-dismiss will handle it
321
+ - No Δ linenothing changed, action may have failed
323
322
 
324
- **learning_status(bundleId)** how experienced is ScreenHand with this app?
325
- - 100+ samples app is well-known, direct tools are safe
326
- - 0 samples unknown app, use *_with_fallback
327
- - AX score high → use ui_tree + ui_press
328
- - CDP score high it's a web app, use browser_* tools
329
- - Vision score high canvas app, use screenshot + ocr
323
+ ### Step 3: RECORD (optional make it repeatable)
324
+ playbook_record(action="start", platform="notes") start capturing
325
+ ... do the workflow ...
326
+ playbook_record(action="clean") → auto-remove failed steps + retries
327
+ playbook_record(action="status") review steps (shows ⚠️FAILED markers)
328
+ playbook_record(action="trim", removeSteps=[2,5]) remove specific bad steps
329
+ playbook_record(action="stop", name="my workflow") → save as reusable playbook
330
330
 
331
- ## Browser Automation
332
- browser_navigate/browser_click/browser_type/browser_js all work in background (~10ms)
333
- browser_stealth() activate before sites with bot detection
334
- browser_fill_form({...}) — human-like multi-field form filling
335
- browser_human_click(x, y) — randomized timing to avoid detection
331
+ ### Step 4: STOP
332
+ perception_stop() → stop monitoring, save resources
333
+ memory_save("key", "strategy") save what worked for next time
336
334
 
337
335
  ## Planning (let ScreenHand figure out the steps)
338
- plan_goal("Export video as H.264") → generates step-by-step plan from playbooks/strategies/references
339
- plan_execute(goalId) → auto-runs known steps, pauses at LLM steps for your judgment
336
+ plan_goal("Export video as H.264") → generates plan from playbooks/strategies/references
337
+ plan_execute(goalId) → auto-runs known steps, pauses at LLM steps
340
338
  plan_step_resolve(goalId, tool, params) → you resolve paused steps
341
339
  plan_status(goalId) / plan_list() / plan_cancel(goalId)
342
340
 
343
- ## Repeatable Workflows
344
- playbook_record() do work export_playbook() → job_create("name", steps) → worker_start()
345
- Jobs survive restarts. Worker daemon runs independently.
341
+ ## Browser
342
+ browser_navigate/click/type/js background via CDP (~10ms)
343
+ browser_stealth() before sites with bot detection
344
+ browser_fill_form({...}) — human-like form filling
345
+ browser_human_click(x, y) — randomized timing
346
+ All browser tools accept cdpPort param for Electron apps (e.g. 9333)
347
+
348
+ ## Jobs (survive restarts)
349
+ playbook → job_create("name", steps) → job_run(id) or worker_start() for background
346
350
 
347
351
  ## Multi-Agent
348
352
  session_claim() → work → session_heartbeat() → session_release()
@@ -449,6 +453,39 @@ recoveryEngine.setAppMap(appMap);
449
453
  planner.setToolRegistry(toolRegistry);
450
454
  planner.setAppMap(appMap);
451
455
  perceptionManager.setLearningEngine(learningEngine);
456
+ // ── Reactive event loop: wire perception events to automatic responses ──
457
+ // These fire at perception speed (100-300ms), not LLM speed (~2-3s).
458
+ perceptionManager.on("dialog_detected", (event) => {
459
+ // Auto-dismiss unexpected dialogs using the best strategy from learning
460
+ const bundleId = worldModel.getState().focusedApp?.bundleId;
461
+ const ranked = bundleId
462
+ ? learningEngine.rankRecoveryStrategies("unexpected_dialog", bundleId)
463
+ : [];
464
+ // Pick the top-ranked strategy, or default to Escape
465
+ const bestStrategy = ranked.length > 0 && ranked[0].score > 0.3
466
+ ? ranked[0].strategyId
467
+ : "dismiss_dialog_escape";
468
+ // Map strategy to tool call
469
+ const strategyActions = {
470
+ dismiss_dialog_cancel: { tool: "click_text", params: { text: "Cancel" } },
471
+ dismiss_dialog_ok: { tool: "click_text", params: { text: "OK" } },
472
+ dismiss_dialog_escape: { tool: "key", params: { combo: "Escape" } },
473
+ grant_permission_allow: { tool: "click_text", params: { text: "Allow" } },
474
+ grant_permission_ok: { tool: "click_text", params: { text: "OK" } },
475
+ };
476
+ const action = strategyActions[bestStrategy] ?? strategyActions["dismiss_dialog_escape"];
477
+ console.error(`[reactive] Dialog detected: "${event.title}" (pid=${event.pid}) → auto-${bestStrategy}`);
478
+ // Execute non-blocking — fire and forget, don't block perception loop
479
+ toolRegistry.toExecutor()(action.tool, action.params).catch((err) => {
480
+ console.error(`[reactive] Auto-dismiss failed: ${err instanceof Error ? err.message : err}`);
481
+ });
482
+ });
483
+ perceptionManager.on("app_switched", (event) => {
484
+ // Auto-update context tracker when app switches (loads new reference/playbook)
485
+ contextTracker.updateContext("focus", { bundleId: event.bundleId });
486
+ // Log for observability
487
+ console.error(`[reactive] App switched to ${event.bundleId} (pid=${event.pid})`);
488
+ });
452
489
  const mcpRecorder = new McpPlaybookRecorder(playbooksDir);
453
490
  const referenceMerger = new ReferenceMerger(referencesDir);
454
491
  const communityPublisher = new PlaybookPublisher();
@@ -4182,6 +4219,27 @@ server.tool("click_with_fallback", "Click a target by text using the canonical f
4182
4219
  }
4183
4220
  throw new Error("Target not found via OCR");
4184
4221
  }
4222
+ case "window_buffer": {
4223
+ // Last resort: capture GPU window buffer (works even when window is hidden),
4224
+ // OCR it, find target text, translate window-relative to screen-absolute coords
4225
+ const wbWindowId = await resolveWindowId(targetPid);
4226
+ if (!wbWindowId)
4227
+ throw new Error("No window found for window_buffer capture");
4228
+ const wbShot = await bridge.call("cg.captureWindow", { windowId: wbWindowId });
4229
+ const wbMatches = await bridge.call("vision.findText", { imagePath: wbShot.path, searchText: target });
4230
+ const wbMatch = Array.isArray(wbMatches) ? wbMatches[0] : null;
4231
+ if (!wbMatch?.bounds)
4232
+ throw new Error("Target not found via window buffer OCR");
4233
+ // Translate window-relative coords to screen-absolute
4234
+ const allWins = await bridge.call("app.windows");
4235
+ const winInfo = allWins.find((w) => w.windowId === wbWindowId);
4236
+ const winX = winInfo?.bounds?.x ?? 0;
4237
+ const winY = winInfo?.bounds?.y ?? 0;
4238
+ const absX = winX + wbMatch.bounds.x + wbMatch.bounds.width / 2;
4239
+ const absY = winY + wbMatch.bounds.y + wbMatch.bounds.height / 2;
4240
+ await bridge.call("cg.mouseClick", { x: absX, y: absY });
4241
+ return { ok: true, method, durationMs: Date.now() - start, fallbackFrom: null, retries: attempt, error: null, target: `${target} at (${Math.round(absX)},${Math.round(absY)}) [window_buffer]` };
4242
+ }
4185
4243
  }
4186
4244
  throw new Error(`Unknown method: ${method}`);
4187
4245
  }
@@ -4516,6 +4574,22 @@ server.tool("read_with_fallback", "Read text content from the screen or a specif
4516
4574
  const ocr = await bridge.call("vision.ocr", { imagePath: shot.path });
4517
4575
  return { ok: true, method, durationMs: Date.now() - start, fallbackFrom: null, retries: attempt, error: null, target: ocr.text?.slice(0, 4000) ?? "" };
4518
4576
  }
4577
+ case "window_buffer": {
4578
+ // GPU window buffer capture — reads content even when window is behind other apps
4579
+ const rbWindowId = await resolveWindowId(targetPid);
4580
+ if (!rbWindowId)
4581
+ throw new Error("No window found for window_buffer read");
4582
+ const rbShot = await bridge.call("cg.captureWindow", { windowId: rbWindowId });
4583
+ if (target) {
4584
+ const rbMatches = await bridge.call("vision.findText", { imagePath: rbShot.path, searchText: target });
4585
+ const rbMatch = Array.isArray(rbMatches) ? rbMatches[0] : null;
4586
+ if (!rbMatch)
4587
+ throw new Error("Text not found via window buffer OCR");
4588
+ return { ok: true, method, durationMs: Date.now() - start, fallbackFrom: null, retries: attempt, error: null, target: rbMatch.text };
4589
+ }
4590
+ const rbOcr = await bridge.call("vision.ocr", { imagePath: rbShot.path });
4591
+ return { ok: true, method, durationMs: Date.now() - start, fallbackFrom: null, retries: attempt, error: null, target: rbOcr.text?.slice(0, 4000) ?? "" };
4592
+ }
4519
4593
  }
4520
4594
  throw new Error(`Method ${method} does not support read`);
4521
4595
  }
@@ -4631,6 +4705,29 @@ server.tool("locate_with_fallback", "Find an element's position on screen using
4631
4705
  const b = match.bounds;
4632
4706
  return { ok: true, method, durationMs: Date.now() - start, fallbackFrom: null, retries: attempt, error: null, target: `${target} at (${b.x},${b.y} ${b.width}x${b.height})` };
4633
4707
  }
4708
+ case "window_buffer": {
4709
+ // GPU window buffer capture + OCR — works even when window is hidden
4710
+ const lbWindowId = await resolveWindowId(targetPid);
4711
+ if (!lbWindowId)
4712
+ throw new Error("No window found for window_buffer locate");
4713
+ const lbShot = await bridge.call("cg.captureWindow", { windowId: lbWindowId });
4714
+ const lbMatches = await bridge.call("vision.findText", { imagePath: lbShot.path, searchText: target });
4715
+ const lbMatch = Array.isArray(lbMatches) ? lbMatches[0] : null;
4716
+ if (!lbMatch?.bounds)
4717
+ throw new Error("Target not found via window buffer OCR");
4718
+ // Translate window-relative to screen-absolute bounds
4719
+ const lbWins = await bridge.call("app.windows");
4720
+ const lbWinInfo = lbWins.find((w) => w.windowId === lbWindowId);
4721
+ const lbOffX = lbWinInfo?.bounds?.x ?? 0;
4722
+ const lbOffY = lbWinInfo?.bounds?.y ?? 0;
4723
+ const lbBounds = {
4724
+ x: lbOffX + lbMatch.bounds.x,
4725
+ y: lbOffY + lbMatch.bounds.y,
4726
+ width: lbMatch.bounds.width,
4727
+ height: lbMatch.bounds.height,
4728
+ };
4729
+ return { ok: true, method, durationMs: Date.now() - start, fallbackFrom: null, retries: attempt, error: null, target: `${target} at (${lbBounds.x},${lbBounds.y} ${lbBounds.width}x${lbBounds.height}) [window_buffer]` };
4730
+ }
4634
4731
  }
4635
4732
  throw new Error(`Method ${method} does not support locate`);
4636
4733
  }
@@ -126,6 +126,26 @@ export class LearningEngine {
126
126
  rankSensors(bundleId) {
127
127
  return this.sensors.rank(bundleId);
128
128
  }
129
+ /**
130
+ * Detect whether an app is "vision-only" — AX can't see its content,
131
+ * so window buffer capture + OCR is the only viable perception source.
132
+ * Returns true when AX has failed enough times with a low score and
133
+ * at least one other source (vision/ocr) has succeeded.
134
+ */
135
+ isVisionOnlyApp(bundleId) {
136
+ const ranked = this.sensors.rank(bundleId);
137
+ if (ranked.length < 2)
138
+ return false;
139
+ const ax = ranked.find(r => r.sourceType === "ax");
140
+ const vision = ranked.find(r => r.sourceType === "vision" || r.sourceType === "ocr");
141
+ // AX score near zero + vision/ocr has some success
142
+ if (ax && ax.score < 0.15 && vision && vision.score > 0.3)
143
+ return true;
144
+ // No AX entry at all but vision works
145
+ if (!ax && vision && vision.score > 0.3)
146
+ return true;
147
+ return false;
148
+ }
129
149
  /**
130
150
  * Query verified UI patterns for a given app, optionally filtered by tool.
131
151
  */
@@ -768,13 +768,18 @@ export class PerceptionCoordinator extends EventEmitter {
768
768
  // Safe CLI mode is already enabled via setSafeCLI() in start().
769
769
  // This allows vision/OCR for canvas-heavy apps like Canva in Chrome.
770
770
  // Skip vision if learning engine shows it consistently fails for this app,
771
- // but retry every 20th cycle to re-evaluate (apps may gain windows later)
771
+ // but retry every 20th cycle to re-evaluate (apps may gain windows later).
772
+ // Exception: vision-only apps (AX blind) — vision/OCR is their ONLY perception
773
+ // source, so never skip it. Window buffer capture works even when window is hidden.
772
774
  if (this.learningEngine && this.activeAppContext) {
773
- const ranked = this.learningEngine.rankSensors(this.activeAppContext.bundleId);
774
- const visionRank = ranked.find(r => r.sourceType === "vision");
775
- if (visionRank && visionRank.score < 0.1 && ranked.length >= 2 && this.stats.slowCycles % 20 !== 0) {
776
- this.stats.slowCycles++;
777
- return; // Vision consistently fails for this app skip (retry every 20th cycle)
775
+ const isVisionOnly = this.learningEngine.isVisionOnlyApp(this.activeAppContext.bundleId);
776
+ if (!isVisionOnly) {
777
+ const ranked = this.learningEngine.rankSensors(this.activeAppContext.bundleId);
778
+ const visionRank = ranked.find(r => r.sourceType === "vision");
779
+ if (visionRank && visionRank.score < 0.1 && ranked.length >= 2 && this.stats.slowCycles % 20 !== 0) {
780
+ this.stats.slowCycles++;
781
+ return; // Vision consistently fails for this app — skip (retry every 20th cycle)
782
+ }
778
783
  }
779
784
  }
780
785
  const timestamp = new Date().toISOString();
@@ -860,14 +865,24 @@ export class PerceptionCoordinator extends EventEmitter {
860
865
  },
861
866
  });
862
867
  }
863
- // Record vision sensor outcome
868
+ // Record vision sensor outcome — also record as window_buffer for vision-only apps
869
+ // so the fallback chain knows this source works for element location
864
870
  if (this.learningEngine && this.activeAppContext) {
871
+ const latencyMs = Date.now() - new Date(timestamp).getTime();
865
872
  this.learningEngine.recordSensorOutcome({
866
873
  bundleId: this.activeAppContext.bundleId,
867
874
  sourceType: "vision",
868
875
  success: !!diffEvent,
869
- latencyMs: Date.now() - new Date(timestamp).getTime(),
876
+ latencyMs,
870
877
  });
878
+ if (this.learningEngine.isVisionOnlyApp(this.activeAppContext.bundleId) && ocrEvent) {
879
+ this.learningEngine.recordSensorOutcome({
880
+ bundleId: this.activeAppContext.bundleId,
881
+ sourceType: "window_buffer",
882
+ success: true,
883
+ latencyMs,
884
+ });
885
+ }
871
886
  }
872
887
  }
873
888
  catch {
@@ -21,6 +21,7 @@ export const DEFAULT_PERCEPTION_CONFIG = {
21
21
  enableAX: true,
22
22
  enableCDP: true,
23
23
  enableVision: true,
24
+ enableWindowBuffer: true,
24
25
  maxROIsPerCycle: 3,
25
26
  skipCaptureLock: false,
26
27
  };
@@ -23,7 +23,7 @@
23
23
  */
24
24
  // ── 1. Fallback Chain ──────────────────────────────────────────────────
25
25
  /** Ordered list of execution methods, from fastest/most reliable to slowest/least reliable */
26
- const EXECUTION_METHODS = ["ax", "cdp", "ocr", "coordinates"];
26
+ const EXECUTION_METHODS = ["ax", "cdp", "ocr", "window_buffer", "coordinates"];
27
27
  const METHOD_CAPABILITIES = {
28
28
  ax: {
29
29
  method: "ax",
@@ -61,6 +61,18 @@ const METHOD_CAPABILITIES = {
61
61
  requiresBridge: true,
62
62
  requiresCDP: false,
63
63
  },
64
+ window_buffer: {
65
+ method: "window_buffer",
66
+ canClick: true,
67
+ canType: false,
68
+ canRead: true,
69
+ canLocate: true,
70
+ canSelect: false,
71
+ canScroll: false,
72
+ avgLatencyMs: 350,
73
+ requiresBridge: true,
74
+ requiresCDP: false,
75
+ },
64
76
  coordinates: {
65
77
  method: "coordinates",
66
78
  canClick: true,
@@ -90,6 +102,7 @@ const SENSOR_TO_METHOD = {
90
102
  chrome: "cdp",
91
103
  ocr: "ocr",
92
104
  vision: "ocr",
105
+ window_buffer: "window_buffer",
93
106
  coordinates: "coordinates",
94
107
  };
95
108
  /**
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "screenhand",
3
- "version": "0.3.7",
3
+ "version": "0.3.9",
4
4
  "mcpName": "io.github.manushi4/screenhand",
5
5
  "description": "Give AI eyes and hands on your desktop. ScreenHand is an open-source MCP server that lets Claude and other AI agents see your screen, click buttons, type text, and control any app on macOS and Windows.",
6
6
  "homepage": "https://screenhand.com",