cursor-buddy 0.0.10-beta.0 → 0.0.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +196 -22
- package/dist/{client-CliXcNch.mjs → client-D7kFGsuH.mjs} +634 -300
- package/dist/client-D7kFGsuH.mjs.map +1 -0
- package/dist/client-DoqSfCbo.d.mts +82 -0
- package/dist/client-DoqSfCbo.d.mts.map +1 -0
- package/dist/index.d.mts +3 -2
- package/dist/index.mjs +1 -1
- package/dist/{point-tool-l3FewgM9.d.mts → point-tool-B_s8op--.d.mts} +3 -9
- package/dist/point-tool-B_s8op--.d.mts.map +1 -0
- package/dist/point-tool-DZJmhD8e.mjs.map +1 -1
- package/dist/react/index.d.mts +83 -6
- package/dist/react/index.d.mts.map +1 -1
- package/dist/react/index.mjs +268 -13
- package/dist/react/index.mjs.map +1 -1
- package/dist/server/adapters/next.d.mts +1 -1
- package/dist/server/index.d.mts +3 -3
- package/dist/server/index.mjs +106 -32
- package/dist/server/index.mjs.map +1 -1
- package/dist/{client-sjVVGYPU.d.mts → types-BU0Gegg2.d.mts} +123 -180
- package/dist/types-BU0Gegg2.d.mts.map +1 -0
- package/dist/{types-BJfkApb_.d.mts → types-ClkvIgAm.d.mts} +1 -1
- package/dist/{types-BJfkApb_.d.mts.map → types-ClkvIgAm.d.mts.map} +1 -1
- package/package.json +3 -2
- package/dist/client-CliXcNch.mjs.map +0 -1
- package/dist/client-sjVVGYPU.d.mts.map +0 -1
- package/dist/point-tool-l3FewgM9.d.mts.map +0 -1
package/README.md
CHANGED
|
@@ -1,7 +1,13 @@
|
|
|
1
1
|
# cursor-buddy
|
|
2
2
|
|
|
3
3
|
|
|
4
|
-
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
https://github.com/user-attachments/assets/def0876a-d63c-4e31-b633-9be3fb2b79b5
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
|
|
5
11
|
|
|
6
12
|
|
|
7
13
|
AI Agent that lives in your cursor, built for web apps. Push-to-talk voice assistant that can see your screen and point at things.
|
|
@@ -15,6 +21,7 @@ Customize its prompt, pass custom tools, choose between browser or server-side s
|
|
|
15
21
|
- **DOM snapshot context** — AI sees a token-efficient representation of your visible page structure
|
|
16
22
|
- **Voice responses** — Browser or server TTS, with optional streaming playback
|
|
17
23
|
- **Cursor pointing** — AI can point at UI elements it references
|
|
24
|
+
- **Tool call bubbles** — Visual feedback for tool execution with customizable display
|
|
18
25
|
- **Voice interruption** — Start talking again to cut off current response
|
|
19
26
|
- **Framework agnostic** — Core client written in Typescript, adapter-based architecture
|
|
20
27
|
- **Customizable** — CSS variables, custom components, headless mode
|
|
@@ -82,6 +89,89 @@ export default function RootLayout({ children }) {
|
|
|
82
89
|
|
|
83
90
|
That's it! Hold **Ctrl+Alt** to speak, release to send.
|
|
84
91
|
|
|
92
|
+
## How It Works
|
|
93
|
+
|
|
94
|
+
```mermaid
|
|
95
|
+
flowchart LR
|
|
96
|
+
subgraph Input
|
|
97
|
+
A[Hold hotkey] --> B[Mic + Speech Recognition]
|
|
98
|
+
A --> C[Screenshot + DOM Snapshot]
|
|
99
|
+
end
|
|
100
|
+
|
|
101
|
+
subgraph Transcription
|
|
102
|
+
B --> D{Browser transcript?}
|
|
103
|
+
D -->|Yes| E[Use browser transcript]
|
|
104
|
+
D -->|No| F[Server transcription]
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
subgraph Processing
|
|
108
|
+
E --> G[Send to AI with context]
|
|
109
|
+
F --> G
|
|
110
|
+
C --> G
|
|
111
|
+
G --> H[AI Response]
|
|
112
|
+
H -->|point tool called| I[Animate cursor to @ID]
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
subgraph Output
|
|
116
|
+
H --> J[Speak response via TTS]
|
|
117
|
+
end
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
1. User holds the hotkey
|
|
121
|
+
2. Microphone captures audio and browser speech recognition starts when available
|
|
122
|
+
3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
|
|
123
|
+
4. User releases hotkey
|
|
124
|
+
5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
|
|
125
|
+
6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
|
|
126
|
+
7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
|
|
127
|
+
8. Response is spoken in the browser or on the server based on `speech.mode`,
|
|
128
|
+
and can either wait for the full response or stream sentence-by-sentence
|
|
129
|
+
based on `speech.allowStreaming`
|
|
130
|
+
9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
|
|
131
|
+
10. **If user presses hotkey again at any point, current response is interrupted**
|
|
132
|
+
|
|
133
|
+
## DOM Snapshot
|
|
134
|
+
|
|
135
|
+
The DOM snapshot is a token-efficient representation of the visible page structure that gives the AI context about what's on screen.
|
|
136
|
+
|
|
137
|
+
When the user holds the hotkey, cursor-buddy traverses the visible DOM and builds a lightweight text representation. Each interactive or meaningful element is assigned a unique `@ID` that the AI can reference when pointing.
|
|
138
|
+
|
|
139
|
+
- **Enables pointing** — The AI can say "click the submit button @42" and the cursor will animate to that exact element
|
|
140
|
+
- **Token efficient** — Only visible, relevant elements are included (no hidden elements, scripts, or styles)
|
|
141
|
+
- **Semantic context** — The AI understands the page structure, not just pixels from the screenshot
|
|
142
|
+
|
|
143
|
+
For a simple login form, the snapshot might look like:
|
|
144
|
+
|
|
145
|
+
```
|
|
146
|
+
# viewport 1280x720
|
|
147
|
+
@20 body "Sign In Email Password Sign In Forgot password?" [x=0 y=0 w=1280 h=720]
|
|
148
|
+
@19 main "Sign In Email Password Sign In Forgot password?" [x=440 y=200 w=400 h=320]
|
|
149
|
+
@18 form "Sign In Email Password Sign In Forgot password?" [x=440 y=200 w=400 h=320]
|
|
150
|
+
@1 h1 "Sign In" [x=580 y=220 w=120 h=32]
|
|
151
|
+
@4 div "Email" [x=460 y=270 w=360 h=56]
|
|
152
|
+
@2 label "Email" [x=460 y=270 w=40 h=20]
|
|
153
|
+
@3 input [type="email"] [placeholder="Enter your email"] [x=460 y=294 w=360 h=32]
|
|
154
|
+
@7 div "Password" [x=460 y=340 w=360 h=56]
|
|
155
|
+
@5 label "Password" [x=460 y=340 w=64 h=20]
|
|
156
|
+
@6 input [type="password"] [placeholder="Enter your password"] [x=460 y=364 w=360 h=32]
|
|
157
|
+
@8 button "Sign In" [type="submit"] [x=460 y=420 w=360 h=40]
|
|
158
|
+
@9 a "Forgot password?" [href="/forgot"] [x=540 y=476 w=120 h=20]
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Each line contains: `@ID tag "text content" [attributes] [bounding box]`
|
|
162
|
+
|
|
163
|
+
The AI sees this alongside the screenshot. When it wants to guide the user to enter their email, it can call `point(@3)` and the cursor will animate to that input field.
|
|
164
|
+
|
|
165
|
+
### What Gets Captured
|
|
166
|
+
|
|
167
|
+
| Included | Excluded |
|
|
168
|
+
|----------|----------|
|
|
169
|
+
| Visible elements in viewport | Hidden elements (`display: none`, `visibility: hidden`) |
|
|
170
|
+
| Interactive elements (buttons, inputs, links) | Script and style tags |
|
|
171
|
+
| Text content (truncated if long) | Elements outside viewport |
|
|
172
|
+
| Element attributes (type, placeholder, href) | Inline styles and classes |
|
|
173
|
+
| Semantic structure | Comment nodes |
|
|
174
|
+
|
|
85
175
|
## Server Configuration
|
|
86
176
|
|
|
87
177
|
```ts
|
|
@@ -136,12 +226,22 @@ createCursorBuddyHandler({
|
|
|
136
226
|
speechBubble={(props) => <CustomBubble {...props} />}
|
|
137
227
|
waveform={(props) => <CustomWaveform {...props} />}
|
|
138
228
|
|
|
229
|
+
// Tool display configuration
|
|
230
|
+
toolDisplay={{
|
|
231
|
+
"*": { minDisplayTime: 1500 }, // Default for all tools
|
|
232
|
+
web_search: { label: "Searching..." }, // Custom label
|
|
233
|
+
internal_tool: { mode: "hidden" }, // Hide from UI
|
|
234
|
+
}}
|
|
235
|
+
renderToolBubble={(props) => <CustomToolBubble {...props} />}
|
|
236
|
+
|
|
139
237
|
// Callbacks
|
|
140
238
|
onTranscript={(text) => {}} // Called when speech is transcribed
|
|
141
239
|
onResponse={(text) => {}} // Called when AI responds
|
|
142
240
|
onPoint={(target) => {}} // Called when AI points at element
|
|
143
241
|
onStateChange={(state) => {}} // Called on state change
|
|
144
242
|
onError={(error) => {}} // Called on error
|
|
243
|
+
onToolCall={(event) => {}} // Called when a tool is invoked
|
|
244
|
+
onToolResult={(event) => {}} // Called when a tool completes
|
|
145
245
|
/>
|
|
146
246
|
```
|
|
147
247
|
|
|
@@ -170,6 +270,65 @@ createCursorBuddyHandler({
|
|
|
170
270
|
- `speech.allowStreaming: true` — Speak completed sentence segments as the chat
|
|
171
271
|
stream arrives.
|
|
172
272
|
|
|
273
|
+
## Tool Display
|
|
274
|
+
|
|
275
|
+
When the AI uses tools (like web search), bubbles appear near the cursor showing the tool's status. Configure how tools are displayed:
|
|
276
|
+
|
|
277
|
+
```tsx
|
|
278
|
+
<CursorBuddy
|
|
279
|
+
endpoint="/api/cursor-buddy"
|
|
280
|
+
toolDisplay={{
|
|
281
|
+
// Default settings for all tools
|
|
282
|
+
"*": {
|
|
283
|
+
minDisplayTime: 1500, // Minimum time to show bubble (ms)
|
|
284
|
+
},
|
|
285
|
+
|
|
286
|
+
// Per-tool configuration
|
|
287
|
+
web_search: {
|
|
288
|
+
label: "Searching the web...", // Static label
|
|
289
|
+
// Or dynamic label based on status:
|
|
290
|
+
// label: (args, status) => status === "completed" ? "Found results" : "Searching..."
|
|
291
|
+
},
|
|
292
|
+
|
|
293
|
+
// Hide internal tools from UI
|
|
294
|
+
internal_logging: {
|
|
295
|
+
mode: "hidden",
|
|
296
|
+
},
|
|
297
|
+
|
|
298
|
+
// Custom render for specific tool
|
|
299
|
+
data_fetch: {
|
|
300
|
+
render: (props) => (
|
|
301
|
+
<div className="my-custom-bubble">
|
|
302
|
+
{props.status === "pending" ? "Loading..." : "Done!"}
|
|
303
|
+
</div>
|
|
304
|
+
),
|
|
305
|
+
},
|
|
306
|
+
}}
|
|
307
|
+
/>
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
### Tool Call States
|
|
311
|
+
|
|
312
|
+
| Status | Description |
|
|
313
|
+
|--------|-------------|
|
|
314
|
+
| `pending` | Tool called, waiting for result |
|
|
315
|
+
| `awaiting_approval` | Needs user consent (for tools with `needsApproval`) |
|
|
316
|
+
| `approved` | User approved, executing |
|
|
317
|
+
| `denied` | User denied the tool call |
|
|
318
|
+
| `completed` | Finished successfully |
|
|
319
|
+
| `failed` | Execution failed |
|
|
320
|
+
|
|
321
|
+
### Approval Keyboard Shortcuts
|
|
322
|
+
|
|
323
|
+
When a tool requires approval, use these keyboard shortcuts:
|
|
324
|
+
|
|
325
|
+
| Key | Action |
|
|
326
|
+
|-----|--------|
|
|
327
|
+
| **Y** or **Enter** | Approve the tool call |
|
|
328
|
+
| **N** or **Escape** | Deny the tool call |
|
|
329
|
+
|
|
330
|
+
Shortcuts are automatically enabled when a tool is awaiting approval and disabled otherwise. They are ignored when focus is in an input field or textarea.
|
|
331
|
+
|
|
173
332
|
## Customization
|
|
174
333
|
|
|
175
334
|
### CSS Variables
|
|
@@ -192,6 +351,14 @@ Cursor buddy styles are customizable via CSS variables. Override them in your st
|
|
|
192
351
|
|
|
193
352
|
/* Waveform */
|
|
194
353
|
--cursor-buddy-waveform-color: #ef4444;
|
|
354
|
+
|
|
355
|
+
/* Tool bubbles */
|
|
356
|
+
--cursor-buddy-tool-bg: #ffffff;
|
|
357
|
+
--cursor-buddy-tool-text: #1f2937;
|
|
358
|
+
--cursor-buddy-tool-pending: #3b82f6;
|
|
359
|
+
--cursor-buddy-tool-approval: #f59e0b;
|
|
360
|
+
--cursor-buddy-tool-success: #22c55e;
|
|
361
|
+
--cursor-buddy-tool-error: #ef4444;
|
|
195
362
|
}
|
|
196
363
|
```
|
|
197
364
|
|
|
@@ -245,6 +412,11 @@ function MyCustomUI() {
|
|
|
245
412
|
isPointing,
|
|
246
413
|
error,
|
|
247
414
|
|
|
415
|
+
// Tool state
|
|
416
|
+
toolCalls, // All tool calls in current turn
|
|
417
|
+
activeToolCalls, // Visible, non-expired tool calls
|
|
418
|
+
pendingApproval, // Tool awaiting user approval, or null
|
|
419
|
+
|
|
248
420
|
// Actions
|
|
249
421
|
startListening,
|
|
250
422
|
stopListening,
|
|
@@ -252,6 +424,11 @@ function MyCustomUI() {
|
|
|
252
424
|
pointAt, // Manually point at coordinates
|
|
253
425
|
dismissPointing,
|
|
254
426
|
reset,
|
|
427
|
+
|
|
428
|
+
// Tool actions
|
|
429
|
+
approveToolCall, // Approve a pending tool call
|
|
430
|
+
denyToolCall, // Deny a pending tool call
|
|
431
|
+
dismissToolCall, // Dismiss a tool call bubble
|
|
255
432
|
} = useCursorBuddy()
|
|
256
433
|
|
|
257
434
|
return (
|
|
@@ -264,6 +441,19 @@ function MyCustomUI() {
|
|
|
264
441
|
>
|
|
265
442
|
Hold to speak
|
|
266
443
|
</button>
|
|
444
|
+
|
|
445
|
+
{/* Render active tool calls */}
|
|
446
|
+
{activeToolCalls.map((tool) => (
|
|
447
|
+
<div key={tool.id}>
|
|
448
|
+
{tool.label}
|
|
449
|
+
{tool.status === "awaiting_approval" && (
|
|
450
|
+
<>
|
|
451
|
+
<button onClick={() => approveToolCall(tool.id)}>Yes</button>
|
|
452
|
+
<button onClick={() => denyToolCall(tool.id)}>No</button>
|
|
453
|
+
</>
|
|
454
|
+
)}
|
|
455
|
+
</div>
|
|
456
|
+
))}
|
|
267
457
|
</div>
|
|
268
458
|
)
|
|
269
459
|
}
|
|
@@ -362,21 +552,11 @@ client.stopListening()
|
|
|
362
552
|
| `CursorRenderProps` | Props passed to custom cursor |
|
|
363
553
|
| `SpeechBubbleRenderProps` | Props passed to custom speech bubble |
|
|
364
554
|
| `WaveformRenderProps` | Props passed to custom waveform |
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
|
|
371
|
-
4. User releases hotkey
|
|
372
|
-
5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
|
|
373
|
-
6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
|
|
374
|
-
7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
|
|
375
|
-
8. Response is spoken in the browser or on the server based on `speech.mode`,
|
|
376
|
-
and can either wait for the full response or stream sentence-by-sentence
|
|
377
|
-
based on `speech.allowStreaming`
|
|
378
|
-
9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
|
|
379
|
-
10. **If user presses hotkey again at any point, current response is interrupted**
|
|
555
|
+
| `ToolBubbleRenderProps` | Props passed to custom tool bubble |
|
|
556
|
+
| `ToolCallState` | State of a tool call |
|
|
557
|
+
| `ToolCallStatus` | `"pending" \| "awaiting_approval" \| "approved" \| "denied" \| "completed" \| "failed"` |
|
|
558
|
+
| `ToolDisplayConfig` | Configuration for tool display |
|
|
559
|
+
| `ToolDisplayOptions` | Options for a single tool |
|
|
380
560
|
|
|
381
561
|
## Security Best Practices
|
|
382
562
|
|
|
@@ -407,14 +587,8 @@ export async function POST(request: Request) {
|
|
|
407
587
|
|
|
408
588
|
return handler(request)
|
|
409
589
|
}
|
|
410
|
-
|
|
411
|
-
export const GET = POST
|
|
412
590
|
```
|
|
413
591
|
|
|
414
|
-
## TODOs
|
|
415
|
-
|
|
416
|
-
- [ ] Medium: Proper test structure without relying on `as any` for audio and voice capture
|
|
417
|
-
|
|
418
592
|
## License
|
|
419
593
|
|
|
420
594
|
MIT
|