cursor-buddy 0.0.10 → 0.0.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,7 +1,13 @@
1
1
  # cursor-buddy
2
2
 
3
3
 
4
- https://github.com/user-attachments/assets/3cdfe011-aee2-4c8e-b695-34f83a972593
4
+
5
+
6
+ https://github.com/user-attachments/assets/def0876a-d63c-4e31-b633-9be3fb2b79b5
7
+
8
+
9
+
10
+
5
11
 
6
12
 
7
13
  AI Agent that lives in your cursor, built for web apps. Push-to-talk voice assistant that can see your screen and point at things.
@@ -15,6 +21,7 @@ Customize its prompt, pass custom tools, choose between browser or server-side s
15
21
  - **DOM snapshot context** — AI sees a token-efficient representation of your visible page structure
16
22
  - **Voice responses** — Browser or server TTS, with optional streaming playback
17
23
  - **Cursor pointing** — AI can point at UI elements it references
24
+ - **Tool call bubbles** — Visual feedback for tool execution with customizable display
18
25
  - **Voice interruption** — Start talking again to cut off current response
19
26
  - **Framework agnostic** — Core client written in Typescript, adapter-based architecture
20
27
  - **Customizable** — CSS variables, custom components, headless mode
@@ -82,6 +89,89 @@ export default function RootLayout({ children }) {
82
89
 
83
90
  That's it! Hold **Ctrl+Alt** to speak, release to send.
84
91
 
92
+ ## How It Works
93
+
94
+ ```mermaid
95
+ flowchart LR
96
+ subgraph Input
97
+ A[Hold hotkey] --> B[Mic + Speech Recognition]
98
+ A --> C[Screenshot + DOM Snapshot]
99
+ end
100
+
101
+ subgraph Transcription
102
+ B --> D{Browser transcript?}
103
+ D -->|Yes| E[Use browser transcript]
104
+ D -->|No| F[Server transcription]
105
+ end
106
+
107
+ subgraph Processing
108
+ E --> G[Send to AI with context]
109
+ F --> G
110
+ C --> G
111
+ G --> H[AI Response]
112
+ H -->|point tool called| I[Animate cursor to @ID]
113
+ end
114
+
115
+ subgraph Output
116
+ H --> J[Speak response via TTS]
117
+ end
118
+ ```
119
+
120
+ 1. User holds the hotkey
121
+ 2. Microphone captures audio and browser speech recognition starts when available
122
+ 3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
123
+ 4. User releases hotkey
124
+ 5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
125
+ 6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
126
+ 7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
127
+ 8. Response is spoken in the browser or on the server based on `speech.mode`,
128
+ and can either wait for the full response or stream sentence-by-sentence
129
+ based on `speech.allowStreaming`
130
+ 9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
131
+ 10. **If user presses hotkey again at any point, current response is interrupted**
132
+
133
+ ## DOM Snapshot
134
+
135
+ The DOM snapshot is a token-efficient representation of the visible page structure that gives the AI context about what's on screen.
136
+
137
+ When the user holds the hotkey, cursor-buddy traverses the visible DOM and builds a lightweight text representation. Each interactive or meaningful element is assigned a unique `@ID` that the AI can reference when pointing.
138
+
139
+ - **Enables pointing** — The AI can say "click the submit button @42" and the cursor will animate to that exact element
140
+ - **Token efficient** — Only visible, relevant elements are included (no hidden elements, scripts, or styles)
141
+ - **Semantic context** — The AI understands the page structure, not just pixels from the screenshot
142
+
143
+ For a simple login form, the snapshot might look like:
144
+
145
+ ```
146
+ # viewport 1280x720
147
+ @20 body "Sign In Email Password Sign In Forgot password?" [x=0 y=0 w=1280 h=720]
148
+ @19 main "Sign In Email Password Sign In Forgot password?" [x=440 y=200 w=400 h=320]
149
+ @18 form "Sign In Email Password Sign In Forgot password?" [x=440 y=200 w=400 h=320]
150
+ @1 h1 "Sign In" [x=580 y=220 w=120 h=32]
151
+ @4 div "Email" [x=460 y=270 w=360 h=56]
152
+ @2 label "Email" [x=460 y=270 w=40 h=20]
153
+ @3 input [type="email"] [placeholder="Enter your email"] [x=460 y=294 w=360 h=32]
154
+ @7 div "Password" [x=460 y=340 w=360 h=56]
155
+ @5 label "Password" [x=460 y=340 w=64 h=20]
156
+ @6 input [type="password"] [placeholder="Enter your password"] [x=460 y=364 w=360 h=32]
157
+ @8 button "Sign In" [type="submit"] [x=460 y=420 w=360 h=40]
158
+ @9 a "Forgot password?" [href="/forgot"] [x=540 y=476 w=120 h=20]
159
+ ```
160
+
161
+ Each line contains: `@ID tag "text content" [attributes] [bounding box]`
162
+
163
+ The AI sees this alongside the screenshot. When it wants to guide the user to enter their email, it can call `point(@3)` and the cursor will animate to that input field.
164
+
165
+ ### What Gets Captured
166
+
167
+ | Included | Excluded |
168
+ |----------|----------|
169
+ | Visible elements in viewport | Hidden elements (`display: none`, `visibility: hidden`) |
170
+ | Interactive elements (buttons, inputs, links) | Script and style tags |
171
+ | Text content (truncated if long) | Elements outside viewport |
172
+ | Element attributes (type, placeholder, href) | Inline styles and classes |
173
+ | Semantic structure | Comment nodes |
174
+
85
175
  ## Server Configuration
86
176
 
87
177
  ```ts
@@ -136,12 +226,22 @@ createCursorBuddyHandler({
136
226
  speechBubble={(props) => <CustomBubble {...props} />}
137
227
  waveform={(props) => <CustomWaveform {...props} />}
138
228
 
229
+ // Tool display configuration
230
+ toolDisplay={{
231
+ "*": { minDisplayTime: 1500 }, // Default for all tools
232
+ web_search: { label: "Searching..." }, // Custom label
233
+ internal_tool: { mode: "hidden" }, // Hide from UI
234
+ }}
235
+ renderToolBubble={(props) => <CustomToolBubble {...props} />}
236
+
139
237
  // Callbacks
140
238
  onTranscript={(text) => {}} // Called when speech is transcribed
141
239
  onResponse={(text) => {}} // Called when AI responds
142
240
  onPoint={(target) => {}} // Called when AI points at element
143
241
  onStateChange={(state) => {}} // Called on state change
144
242
  onError={(error) => {}} // Called on error
243
+ onToolCall={(event) => {}} // Called when a tool is invoked
244
+ onToolResult={(event) => {}} // Called when a tool completes
145
245
  />
146
246
  ```
147
247
 
@@ -170,6 +270,65 @@ createCursorBuddyHandler({
170
270
  - `speech.allowStreaming: true` — Speak completed sentence segments as the chat
171
271
  stream arrives.
172
272
 
273
+ ## Tool Display
274
+
275
+ When the AI uses tools (like web search), bubbles appear near the cursor showing the tool's status. Configure how tools are displayed:
276
+
277
+ ```tsx
278
+ <CursorBuddy
279
+ endpoint="/api/cursor-buddy"
280
+ toolDisplay={{
281
+ // Default settings for all tools
282
+ "*": {
283
+ minDisplayTime: 1500, // Minimum time to show bubble (ms)
284
+ },
285
+
286
+ // Per-tool configuration
287
+ web_search: {
288
+ label: "Searching the web...", // Static label
289
+ // Or dynamic label based on status:
290
+ // label: (args, status) => status === "completed" ? "Found results" : "Searching..."
291
+ },
292
+
293
+ // Hide internal tools from UI
294
+ internal_logging: {
295
+ mode: "hidden",
296
+ },
297
+
298
+ // Custom render for specific tool
299
+ data_fetch: {
300
+ render: (props) => (
301
+ <div className="my-custom-bubble">
302
+ {props.status === "pending" ? "Loading..." : "Done!"}
303
+ </div>
304
+ ),
305
+ },
306
+ }}
307
+ />
308
+ ```
309
+
310
+ ### Tool Call States
311
+
312
+ | Status | Description |
313
+ |--------|-------------|
314
+ | `pending` | Tool called, waiting for result |
315
+ | `awaiting_approval` | Needs user consent (for tools with `needsApproval`) |
316
+ | `approved` | User approved, executing |
317
+ | `denied` | User denied the tool call |
318
+ | `completed` | Finished successfully |
319
+ | `failed` | Execution failed |
320
+
321
+ ### Approval Keyboard Shortcuts
322
+
323
+ When a tool requires approval, use these keyboard shortcuts:
324
+
325
+ | Key | Action |
326
+ |-----|--------|
327
+ | **Y** or **Enter** | Approve the tool call |
328
+ | **N** or **Escape** | Deny the tool call |
329
+
330
+ Shortcuts are automatically enabled when a tool is awaiting approval and disabled otherwise. They are ignored when focus is in an input field or textarea.
331
+
173
332
  ## Customization
174
333
 
175
334
  ### CSS Variables
@@ -192,6 +351,14 @@ Cursor buddy styles are customizable via CSS variables. Override them in your st
192
351
 
193
352
  /* Waveform */
194
353
  --cursor-buddy-waveform-color: #ef4444;
354
+
355
+ /* Tool bubbles */
356
+ --cursor-buddy-tool-bg: #ffffff;
357
+ --cursor-buddy-tool-text: #1f2937;
358
+ --cursor-buddy-tool-pending: #3b82f6;
359
+ --cursor-buddy-tool-approval: #f59e0b;
360
+ --cursor-buddy-tool-success: #22c55e;
361
+ --cursor-buddy-tool-error: #ef4444;
195
362
  }
196
363
  ```
197
364
 
@@ -245,6 +412,11 @@ function MyCustomUI() {
245
412
  isPointing,
246
413
  error,
247
414
 
415
+ // Tool state
416
+ toolCalls, // All tool calls in current turn
417
+ activeToolCalls, // Visible, non-expired tool calls
418
+ pendingApproval, // Tool awaiting user approval, or null
419
+
248
420
  // Actions
249
421
  startListening,
250
422
  stopListening,
@@ -252,6 +424,11 @@ function MyCustomUI() {
252
424
  pointAt, // Manually point at coordinates
253
425
  dismissPointing,
254
426
  reset,
427
+
428
+ // Tool actions
429
+ approveToolCall, // Approve a pending tool call
430
+ denyToolCall, // Deny a pending tool call
431
+ dismissToolCall, // Dismiss a tool call bubble
255
432
  } = useCursorBuddy()
256
433
 
257
434
  return (
@@ -264,6 +441,19 @@ function MyCustomUI() {
264
441
  >
265
442
  Hold to speak
266
443
  </button>
444
+
445
+ {/* Render active tool calls */}
446
+ {activeToolCalls.map((tool) => (
447
+ <div key={tool.id}>
448
+ {tool.label}
449
+ {tool.status === "awaiting_approval" && (
450
+ <>
451
+ <button onClick={() => approveToolCall(tool.id)}>Yes</button>
452
+ <button onClick={() => denyToolCall(tool.id)}>No</button>
453
+ </>
454
+ )}
455
+ </div>
456
+ ))}
267
457
  </div>
268
458
  )
269
459
  }
@@ -362,21 +552,11 @@ client.stopListening()
362
552
  | `CursorRenderProps` | Props passed to custom cursor |
363
553
  | `SpeechBubbleRenderProps` | Props passed to custom speech bubble |
364
554
  | `WaveformRenderProps` | Props passed to custom waveform |
365
-
366
- ## How It Works
367
-
368
- 1. User holds the hotkey
369
- 2. Microphone captures audio, waveform shows audio level, and browser speech recognition starts when available
370
- 3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
371
- 4. User releases hotkey
372
- 5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
373
- 6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
374
- 7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
375
- 8. Response is spoken in the browser or on the server based on `speech.mode`,
376
- and can either wait for the full response or stream sentence-by-sentence
377
- based on `speech.allowStreaming`
378
- 9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
379
- 10. **If user presses hotkey again at any point, current response is interrupted**
555
+ | `ToolBubbleRenderProps` | Props passed to custom tool bubble |
556
+ | `ToolCallState` | State of a tool call |
557
+ | `ToolCallStatus` | `"pending" \| "awaiting_approval" \| "approved" \| "denied" \| "completed" \| "failed"` |
558
+ | `ToolDisplayConfig` | Configuration for tool display |
559
+ | `ToolDisplayOptions` | Options for a single tool |
380
560
 
381
561
  ## Security Best Practices
382
562
 
@@ -407,14 +587,8 @@ export async function POST(request: Request) {
407
587
 
408
588
  return handler(request)
409
589
  }
410
-
411
- export const GET = POST
412
590
  ```
413
591
 
414
- ## TODOs
415
-
416
- - [ ] Medium: Proper test structure without relying on `as any` for audio and voice capture
417
-
418
592
  ## License
419
593
 
420
594
  MIT