screenhand 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (103) hide show
  1. package/.claude/commands/automate.md +28 -0
  2. package/.claude/commands/debug-ui.md +19 -0
  3. package/.claude/commands/screenshot.md +15 -0
  4. package/.github/FUNDING.yml +1 -0
  5. package/.github/ISSUE_TEMPLATE/bug_report.md +27 -0
  6. package/.github/ISSUE_TEMPLATE/feature_request.md +20 -0
  7. package/.mcp.json +8 -0
  8. package/DESKTOP_MCP_GUIDE.md +92 -0
  9. package/LICENSE +661 -21
  10. package/README.md +97 -292
  11. package/SECURITY.md +44 -0
  12. package/docs/architecture.md +47 -0
  13. package/install-skills.sh +19 -0
  14. package/mcp-bridge.ts +271 -0
  15. package/mcp-desktop.ts +1221 -0
  16. package/native/macos-bridge/Package.swift +21 -0
  17. package/native/macos-bridge/Sources/AccessibilityBridge.swift +261 -0
  18. package/native/macos-bridge/Sources/AppManagement.swift +129 -0
  19. package/native/macos-bridge/Sources/CoreGraphicsBridge.swift +242 -0
  20. package/native/macos-bridge/Sources/ObserverBridge.swift +120 -0
  21. package/native/macos-bridge/Sources/VisionBridge.swift +80 -0
  22. package/native/macos-bridge/Sources/main.swift +345 -0
  23. package/native/windows-bridge/AppManagement.cs +234 -0
  24. package/native/windows-bridge/InputBridge.cs +436 -0
  25. package/native/windows-bridge/Program.cs +265 -0
  26. package/native/windows-bridge/ScreenCapture.cs +329 -0
  27. package/native/windows-bridge/UIAutomationBridge.cs +571 -0
  28. package/native/windows-bridge/WindowsBridge.csproj +17 -0
  29. package/package.json +3 -14
  30. package/playbooks/devpost.json +186 -0
  31. package/playbooks/instagram.json +41 -0
  32. package/playbooks/instagram_v2.json +201 -0
  33. package/playbooks/x_v1.json +211 -0
  34. package/scripts/devpost-live-loop.mjs +421 -0
  35. package/src/config.ts +30 -0
  36. package/src/index.ts +92 -0
  37. package/src/logging/timeline-logger.ts +55 -0
  38. package/src/mcp/server.ts +449 -0
  39. package/src/memory/recall.ts +191 -0
  40. package/src/memory/research.ts +146 -0
  41. package/src/memory/seeds.ts +123 -0
  42. package/src/memory/session.ts +201 -0
  43. package/src/memory/store.ts +434 -0
  44. package/src/memory/types.ts +69 -0
  45. package/src/native/bridge-client.ts +239 -0
  46. package/src/native/macos-bridge-client.ts +22 -0
  47. package/src/runtime/accessibility-adapter.ts +487 -0
  48. package/src/runtime/app-adapter.ts +169 -0
  49. package/src/runtime/applescript-adapter.ts +376 -0
  50. package/src/runtime/ax-role-map.ts +102 -0
  51. package/src/runtime/browser-adapter.ts +129 -0
  52. package/src/runtime/cdp-chrome-adapter.ts +676 -0
  53. package/src/runtime/composite-adapter.ts +274 -0
  54. package/src/runtime/executor.ts +396 -0
  55. package/src/runtime/locator-cache.ts +33 -0
  56. package/src/runtime/planning-loop.ts +81 -0
  57. package/src/runtime/service.ts +448 -0
  58. package/src/runtime/session-manager.ts +50 -0
  59. package/src/runtime/state-observer.ts +136 -0
  60. package/src/runtime/vision-adapter.ts +297 -0
  61. package/src/types.ts +297 -0
  62. package/tests/bridge-client.test.ts +176 -0
  63. package/tests/browser-stealth.test.ts +210 -0
  64. package/tests/composite-adapter.test.ts +64 -0
  65. package/tests/mcp-server.test.ts +151 -0
  66. package/tests/memory-recall.test.ts +339 -0
  67. package/tests/memory-research.test.ts +159 -0
  68. package/tests/memory-seeds.test.ts +120 -0
  69. package/tests/memory-store.test.ts +392 -0
  70. package/tests/types.test.ts +92 -0
  71. package/tsconfig.check.json +17 -0
  72. package/tsconfig.json +19 -0
  73. package/vitest.config.ts +8 -0
  74. package/dist/config.js +0 -9
  75. package/dist/index.js +0 -55
  76. package/dist/logging/timeline-logger.js +0 -29
  77. package/dist/mcp/mcp-stdio-server.js +0 -284
  78. package/dist/mcp/server.js +0 -347
  79. package/dist/mcp-entry.js +0 -62
  80. package/dist/memory/recall.js +0 -160
  81. package/dist/memory/research.js +0 -98
  82. package/dist/memory/seeds.js +0 -89
  83. package/dist/memory/session.js +0 -161
  84. package/dist/memory/store.js +0 -391
  85. package/dist/memory/types.js +0 -4
  86. package/dist/native/bridge-client.js +0 -173
  87. package/dist/native/macos-bridge-client.js +0 -5
  88. package/dist/runtime/accessibility-adapter.js +0 -377
  89. package/dist/runtime/app-adapter.js +0 -48
  90. package/dist/runtime/applescript-adapter.js +0 -283
  91. package/dist/runtime/ax-role-map.js +0 -80
  92. package/dist/runtime/browser-adapter.js +0 -36
  93. package/dist/runtime/cdp-chrome-adapter.js +0 -505
  94. package/dist/runtime/composite-adapter.js +0 -205
  95. package/dist/runtime/executor.js +0 -250
  96. package/dist/runtime/locator-cache.js +0 -12
  97. package/dist/runtime/planning-loop.js +0 -47
  98. package/dist/runtime/service.js +0 -372
  99. package/dist/runtime/session-manager.js +0 -28
  100. package/dist/runtime/state-observer.js +0 -105
  101. package/dist/runtime/vision-adapter.js +0 -208
  102. package/dist/test-mcp-protocol.js +0 -138
  103. package/dist/types.js +0 -1
@@ -0,0 +1,28 @@
1
+ Automate a desktop workflow described by the user.
2
+
3
+ The user will describe what they want done: $ARGUMENTS
4
+
5
+ Plan and execute the workflow step by step using the desktop automation MCP tools:
6
+
7
+ ## Planning
8
+ 1. Break the task into discrete steps
9
+ 2. Identify which apps are involved (`apps`, `windows`)
10
+ 3. For each step, pick the FASTEST approach — try them in this order:
11
+ - **Accessibility (FASTEST — always try first)**: `ui_tree` → `ui_find` → `ui_press` / `ui_set_value`. ~50ms per action, no screenshots.
12
+ - **Keyboard shortcuts**: `key` for known shortcuts (cmd+s, cmd+c, etc.) — instant
13
+ - **AppleScript**: `applescript` for scriptable apps (Finder, Mail, Notes) — fast
14
+ - **Chrome CDP**: `browser_dom` → `browser_click` / `browser_type` — direct DOM, no vision
15
+ - **Visual (LAST RESORT only)**: `screenshot` → `click_text` — slow, only when Accessibility can't see the element (canvas, games, images)
16
+
17
+ IMPORTANT: Do NOT use screenshot/OCR/click_text to interact with standard UI elements. Use ui_tree + ui_press instead — it's 10x faster and more reliable.
18
+
19
+ ## Execution
20
+ - Execute each step, verifying success before moving to the next
21
+ - After key actions, use `screenshot` or `ui_tree` to confirm the expected state
22
+ - If a step fails, try an alternative approach before giving up
23
+ - Report progress as you go
24
+
25
+ ## Completion
26
+ - Summarize what was done
27
+ - Note any steps that required fallbacks
28
+ - Flag anything that didn't work as expected
@@ -0,0 +1,19 @@
1
+ Inspect and debug the UI structure of an app.
2
+
3
+ 1. Use `apps` to list running applications
4
+ 2. If the user specified an app name ($ARGUMENTS), find its PID. Otherwise use the frontmost app.
5
+ 3. Use `focus` to bring the app to the front
6
+ 4. Use `ui_tree` with the app's PID to get the full Accessibility tree
7
+ 5. Use `windows` to get the window bounds
8
+
9
+ Then analyze and report:
10
+ - App name and bundle ID
11
+ - Window hierarchy and layout
12
+ - Interactive elements (buttons, text fields, menus) with their states (enabled/disabled, value)
13
+ - Navigation structure
14
+ - Any elements that look broken or inaccessible
15
+ - Suggested selectors for automating key actions (titles to use with `ui_press`, `ui_find`)
16
+
17
+ Format as a structured report with sections.
18
+
19
+ $ARGUMENTS
@@ -0,0 +1,15 @@
1
+ Take a screenshot of the current screen and describe what you see.
2
+
3
+ 1. Use the `screenshot` MCP tool to capture the screen and OCR it
4
+ 2. Use `apps` to identify which apps are running
5
+ 3. Use `windows` to see window positions
6
+
7
+ Then provide a clear summary:
8
+ - What apps are visible
9
+ - What the user appears to be doing
10
+ - Key UI elements and text on screen
11
+ - Any notable state (dialogs open, errors visible, etc.)
12
+
13
+ If the user provides an app name as argument, focus on that app: use `focus` to bring it forward first, then screenshot.
14
+
15
+ $ARGUMENTS
@@ -0,0 +1 @@
1
+ github: manushi4
@@ -0,0 +1,27 @@
1
+ ---
2
+ name: Bug Report
3
+ about: Report a bug in ScreenHand
4
+ title: "[Bug] "
5
+ labels: bug
6
+ ---
7
+
8
+ **Platform**
9
+ - [ ] macOS
10
+ - [ ] Windows
11
+
12
+ **Describe the bug**
13
+ A clear description of what went wrong.
14
+
15
+ **To reproduce**
16
+ 1. Tool called: `...`
17
+ 2. Parameters: `...`
18
+ 3. Error/unexpected output: `...`
19
+
20
+ **Expected behavior**
21
+ What you expected to happen.
22
+
23
+ **Environment**
24
+ - OS version:
25
+ - Node.js version:
26
+ - ScreenHand version:
27
+ - AI client (Claude Desktop / Claude Code / Cursor / other):
@@ -0,0 +1,20 @@
1
+ ---
2
+ name: Feature Request
3
+ about: Suggest a new tool or improvement for ScreenHand
4
+ title: "[Feature] "
5
+ labels: enhancement
6
+ ---
7
+
8
+ **What problem does this solve?**
9
+ Describe the use case.
10
+
11
+ **Proposed solution**
12
+ How should it work? What MCP tool name/parameters would you expect?
13
+
14
+ **Alternatives considered**
15
+ Any workarounds you've tried.
16
+
17
+ **Platform**
18
+ - [ ] macOS
19
+ - [ ] Windows
20
+ - [ ] Both
package/.mcp.json ADDED
@@ -0,0 +1,8 @@
1
+ {
2
+ "mcpServers": {
3
+ "desktop": {
4
+ "command": "npx",
5
+ "args": ["tsx", "mcp-desktop.ts"]
6
+ }
7
+ }
8
+ }
@@ -0,0 +1,92 @@
1
+ # ScreenHand MCP — Usage Guide
2
+
3
+ You have access to the ScreenHand MCP server that can control any macOS/Windows application and Chrome browser. Use it for app debugging, design inspection, UI testing, and automation.
4
+
5
+ ## Quick Reference
6
+
7
+ ### How to discover what's on screen
8
+ 1. `apps` → get running apps with PIDs
9
+ 2. `windows` → get window IDs and positions
10
+ 3. `ui_tree(pid, maxDepth)` → get full UI structure instantly (50ms, no OCR)
11
+ 4. `screenshot(windowId)` → capture + OCR a window (600ms, use when you need visual text)
12
+
13
+ ### How to interact with native macOS apps (Finder, Notes, Xcode, etc.)
14
+ - **Read UI structure**: `ui_tree(pid=1234, maxDepth=4)` — returns every button, menu, text field with positions
15
+ - **Find element**: `ui_find(pid=1234, title="Save")` — find by text
16
+ - **Click element**: `ui_press(pid=1234, title="Save")` — click by accessibility
17
+ - **Click menu**: `menu_click(pid=1234, menuPath="File/Save As...")` — click any menu item
18
+ - **Set text**: `ui_set_value(pid=1234, title="Search", value="hello")`
19
+ - **Key combo**: `key(combo="cmd+s")` or `key(combo="cmd+shift+n")`
20
+
21
+ ### How to interact with Chrome/web pages (FAST — use this, not OCR)
22
+ - **List tabs**: `browser_tabs()`
23
+ - **Open URL**: `browser_open(url="https://example.com")`
24
+ - **Run JS**: `browser_js(code="document.title")` — execute any JavaScript, returns result
25
+ - **Query DOM**: `browser_dom(selector="button.primary")` — find elements with text, positions, attributes
26
+ - **Click element**: `browser_click(selector="#submit-btn")`
27
+ - **Type in input**: `browser_type(selector="input[name=search]", text="hello")`
28
+ - **Wait for load**: `browser_wait(condition="document.querySelector('.results')")`
29
+ - **Page content**: `browser_page_info()` — title, URL, text content
30
+
31
+ ### How to use AppleScript
32
+ - `applescript(script='tell application "Finder" to get name of every file of desktop')`
33
+ - `applescript(script='tell application "Safari" to set URL of current tab of front window to "https://google.com"')`
34
+
35
+ ## Rules & Best Practices
36
+
37
+ ### Speed hierarchy — always prefer the fastest method:
38
+ 1. **Accessibility (ui_tree, ui_press, menu_click)** — 50ms, structured, reliable. Use for ALL native app interactions.
39
+ 2. **CDP (browser_js, browser_dom, browser_click)** — 10ms, structured. Use for ALL web/browser interactions.
40
+ 3. **AppleScript** — 50ms, for app-specific scripting (Finder files, Safari URLs, Mail compose).
41
+ 4. **OCR (screenshot, click_text)** — 600ms, last resort. Only use when AX and CDP aren't available.
42
+
43
+ ### For app debugging:
44
+ - Start with `apps` to find the PID
45
+ - Use `ui_tree(pid, maxDepth=4)` to see the full UI hierarchy — every button, text field, label, with positions
46
+ - Use `ui_tree(pid, maxDepth=6)` for deep inspection of complex views
47
+ - Use `screenshot(windowId)` only when you need to see actual rendered text/images
48
+
49
+ ### For design inspection:
50
+ - `screenshot_file(windowId)` returns the image path — you can read it to see the actual design
51
+ - `ui_tree` shows the component structure (like React DevTools but for any app)
52
+ - `browser_dom(selector="*")` with limit shows the DOM tree of any web page
53
+ - `browser_js` can extract computed styles: `getComputedStyle(el).color`
54
+
55
+ ### For web app debugging:
56
+ - Use `browser_js` to run any debugging code — console.log, inspect state, check network
57
+ - Use `browser_dom` to find elements and their properties
58
+ - Use `browser_wait` before interacting with dynamic content
59
+ - Chain: `browser_navigate` → `browser_wait` → `browser_dom` → `browser_click`
60
+
61
+ ### Common patterns:
62
+
63
+ **Debug a native app's UI:**
64
+ ```
65
+ apps → find pid
66
+ ui_tree(pid, 4) → see structure
67
+ ui_find(pid, "button text") → locate element
68
+ ui_press(pid, "button text") → interact
69
+ ```
70
+
71
+ **Debug a web page:**
72
+ ```
73
+ browser_tabs → find the tab
74
+ browser_dom("main", tabId) → see page structure
75
+ browser_js("document.querySelector('.error')?.textContent", tabId) → inspect
76
+ ```
77
+
78
+ **Automate a flow:**
79
+ ```
80
+ launch(bundleId) → open app
81
+ ui_tree(pid, 3) → understand layout
82
+ menu_click(pid, "File/New") → trigger action
83
+ ui_set_value(pid, "Name", "test") → fill form
84
+ ui_press(pid, "Save") → submit
85
+ ```
86
+
87
+ ### Important notes:
88
+ - Chrome must be running with `--remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug` for browser_* tools
89
+ - PIDs change when apps restart — always call `apps` first to get current PIDs
90
+ - Window IDs change when windows are recreated — call `windows` to get current IDs
91
+ - `ui_tree` requires Accessibility permissions (System Settings → Privacy → Accessibility)
92
+ - For clicking by coordinates, use `click(x, y)` — coordinates are screen-absolute