cursor-buddy 0.0.8 → 0.0.9-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -12,7 +12,7 @@ Customize its prompt, pass custom tools, choose between browser or server-side s
12
12
 
13
13
  - **Push-to-talk voice input** — Hold a hotkey to speak, release to send
14
14
  - **Browser-first live transcription** — Realtime transcript while speaking, with server fallback
15
- - **Annotated screenshot context** — AI sees your current viewport with numbered interactive elements
15
+ - **DOM snapshot context** — AI sees a token-efficient representation of your visible page structure
16
16
  - **Voice responses** — Browser or server TTS, with optional streaming playback
17
17
  - **Cursor pointing** — AI can point at UI elements it references
18
18
  - **Voice interruption** — Start talking again to cut off current response
@@ -57,7 +57,7 @@ export const cursorBuddy = createCursorBuddyHandler({
57
57
  import { toNextJsHandler } from "cursor-buddy/server/next"
58
58
  import { cursorBuddy } from "@/lib/cursor-buddy"
59
59
 
60
- export const { GET, POST } = toNextJsHandler(cursorBuddy)
60
+ export const { POST } = toNextJsHandler(cursorBuddy)
61
61
  ```
62
62
 
63
63
  ### 2. Client Setup
@@ -367,17 +367,15 @@ client.stopListening()
367
367
 
368
368
  1. User holds the hotkey
369
369
  2. Microphone captures audio, waveform shows audio level, and browser speech recognition starts when available
370
- 3. User releases hotkey
371
- 4. An annotated screenshot of the viewport is captured, with numbered markers on visible interactive elements, based on [agent-browser](https://github.com/vercel-labs/agent-browser) implementation.
370
+ 3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
371
+ 4. User releases hotkey
372
372
  5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
373
- 6. Screenshot + marker context are sent to the AI model
374
- 7. AI responds with text, optionally including a pointing tag:
375
- - Preferred: `[POINT:5:Submit]` for numbered interactive elements
376
- - Fallback: `[POINT:640,360:Error text]` for arbitrary screen coordinates
373
+ 6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
374
+ 7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
377
375
  8. Response is spoken in the browser or on the server based on `speech.mode`,
378
- and can either wait for the full response or stream sentence-by-sentence
379
- based on `speech.allowStreaming`
380
- 9. If a marker tag is present, it is resolved back to the live DOM element; if a coordinate tag is present, it is mapped back to the live viewport; then the cursor animates to the target location
376
+ and can either wait for the full response or stream sentence-by-sentence
377
+ based on `speech.allowStreaming`
378
+ 9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
381
379
  10. **If user presses hotkey again at any point, current response is interrupted**
382
380
 
383
381
  ## Security Best Practices
@@ -415,7 +413,6 @@ export const GET = POST
415
413
 
416
414
  ## TODOs
417
415
 
418
- - [ ] High: Make tool calls first class: Pointing becomes tool call (once per turn) + re-use pointing bubble UI for tool calls
419
416
  - [ ] Medium: Proper test structure without relying on `as any` for audio and voice capture
420
417
 
421
418
  ## License