npm - cursor-buddy - Versions diffs - 0.0.10 → 0.0.11 - Mend

cursor-buddy 0.0.10 → 0.0.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

package/README.md +196 -22
package/dist/{client-CliXcNch.mjs → client-D7kFGsuH.mjs} +634 -300
package/dist/client-D7kFGsuH.mjs.map +1 -0
package/dist/client-DoqSfCbo.d.mts +82 -0
package/dist/client-DoqSfCbo.d.mts.map +1 -0
package/dist/index.d.mts +3 -2
package/dist/index.mjs +1 -1
package/dist/{point-tool-l3FewgM9.d.mts → point-tool-B_s8op--.d.mts} +3 -9
package/dist/point-tool-B_s8op--.d.mts.map +1 -0
package/dist/point-tool-DZJmhD8e.mjs.map +1 -1
package/dist/react/index.d.mts +83 -6
package/dist/react/index.d.mts.map +1 -1
package/dist/react/index.mjs +268 -13
package/dist/react/index.mjs.map +1 -1
package/dist/server/adapters/next.d.mts +1 -1
package/dist/server/index.d.mts +3 -3
package/dist/server/index.mjs +84 -28
package/dist/server/index.mjs.map +1 -1
package/dist/{client-sjVVGYPU.d.mts → types-BU0Gegg2.d.mts} +123 -180
package/dist/types-BU0Gegg2.d.mts.map +1 -0
package/dist/{types-BJfkApb_.d.mts → types-ClkvIgAm.d.mts} +1 -1
package/dist/{types-BJfkApb_.d.mts.map → types-ClkvIgAm.d.mts.map} +1 -1
package/package.json +3 -2
package/dist/client-CliXcNch.mjs.map +0 -1
package/dist/client-sjVVGYPU.d.mts.map +0 -1
package/dist/point-tool-l3FewgM9.d.mts.map +0 -1

package/README.md CHANGED Viewed

@@ -1,7 +1,13 @@
 # cursor-buddy
-https://github.com/user-attachments/assets/3cdfe011-aee2-4c8e-b695-34f83a972593
+https://github.com/user-attachments/assets/def0876a-d63c-4e31-b633-9be3fb2b79b5
 AI Agent that lives in your cursor, built for web apps. Push-to-talk voice assistant that can see your screen and point at things.
@@ -15,6 +21,7 @@ Customize its prompt, pass custom tools, choose between browser or server-side s
 - **DOM snapshot context** — AI sees a token-efficient representation of your visible page structure
 - **Voice responses** — Browser or server TTS, with optional streaming playback
 - **Cursor pointing** — AI can point at UI elements it references
+- **Tool call bubbles** — Visual feedback for tool execution with customizable display
 - **Voice interruption** — Start talking again to cut off current response
 - **Framework agnostic** — Core client written in Typescript, adapter-based architecture
 - **Customizable** — CSS variables, custom components, headless mode
@@ -82,6 +89,89 @@ export default function RootLayout({ children }) {
 That's it! Hold **Ctrl+Alt** to speak, release to send.
+## How It Works
+```mermaid
+flowchart LR
+    subgraph Input
+        A[Hold hotkey] --> B[Mic + Speech Recognition]
+        A --> C[Screenshot + DOM Snapshot]
+    end
+    subgraph Transcription
+        B --> D{Browser transcript?}
+        D -->|Yes| E[Use browser transcript]
+        D -->|No| F[Server transcription]
+    end
+    subgraph Processing
+        E --> G[Send to AI with context]
+        F --> G
+        C --> G
+        G --> H[AI Response]
+        H -->|point tool called| I[Animate cursor to @ID]
+    end
+    subgraph Output
+        H --> J[Speak response via TTS]
+    end
+```
+1. User holds the hotkey
+2. Microphone captures audio and browser speech recognition starts when available
+3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
+4. User releases hotkey
+5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
+6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
+7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
+8. Response is spoken in the browser or on the server based on `speech.mode`,
+    and can either wait for the full response or stream sentence-by-sentence
+    based on `speech.allowStreaming`
+9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
+10. **If user presses hotkey again at any point, current response is interrupted**
+## DOM Snapshot
+The DOM snapshot is a token-efficient representation of the visible page structure that gives the AI context about what's on screen.
+When the user holds the hotkey, cursor-buddy traverses the visible DOM and builds a lightweight text representation. Each interactive or meaningful element is assigned a unique `@ID` that the AI can reference when pointing.
+- **Enables pointing** — The AI can say "click the submit button @42" and the cursor will animate to that exact element
+- **Token efficient** — Only visible, relevant elements are included (no hidden elements, scripts, or styles)
+- **Semantic context** — The AI understands the page structure, not just pixels from the screenshot
+For a simple login form, the snapshot might look like:
+```
+# viewport 1280x720
+@20 body "Sign In Email Password Sign In Forgot password?" [x=0 y=0 w=1280 h=720]
+  @19 main "Sign In Email Password Sign In Forgot password?" [x=440 y=200 w=400 h=320]
+    @18 form "Sign In Email Password Sign In Forgot password?" [x=440 y=200 w=400 h=320]
+      @1 h1 "Sign In" [x=580 y=220 w=120 h=32]
+      @4 div "Email" [x=460 y=270 w=360 h=56]
+        @2 label "Email" [x=460 y=270 w=40 h=20]
+        @3 input [type="email"] [placeholder="Enter your email"] [x=460 y=294 w=360 h=32]
+      @7 div "Password" [x=460 y=340 w=360 h=56]
+        @5 label "Password" [x=460 y=340 w=64 h=20]
+        @6 input [type="password"] [placeholder="Enter your password"] [x=460 y=364 w=360 h=32]
+      @8 button "Sign In" [type="submit"] [x=460 y=420 w=360 h=40]
+      @9 a "Forgot password?" [href="/forgot"] [x=540 y=476 w=120 h=20]
+```
+Each line contains: `@ID tag "text content" [attributes] [bounding box]`
+The AI sees this alongside the screenshot. When it wants to guide the user to enter their email, it can call `point(@3)` and the cursor will animate to that input field.
+### What Gets Captured
+| Included | Excluded |
+|----------|----------|
+| Visible elements in viewport | Hidden elements (`display: none`, `visibility: hidden`) |
+| Interactive elements (buttons, inputs, links) | Script and style tags |
+| Text content (truncated if long) | Elements outside viewport |
+| Element attributes (type, placeholder, href) | Inline styles and classes |
+| Semantic structure | Comment nodes |
 ## Server Configuration
 ```ts
@@ -136,12 +226,22 @@ createCursorBuddyHandler({
   speechBubble={(props) => <CustomBubble {...props} />}
   waveform={(props) => <CustomWaveform {...props} />}
+  // Tool display configuration
+  toolDisplay={{
+    "*": { minDisplayTime: 1500 },           // Default for all tools
+    web_search: { label: "Searching..." },   // Custom label
+    internal_tool: { mode: "hidden" },       // Hide from UI
+  }}
+  renderToolBubble={(props) => <CustomToolBubble {...props} />}
   // Callbacks
   onTranscript={(text) => {}}    // Called when speech is transcribed
   onResponse={(text) => {}}      // Called when AI responds
   onPoint={(target) => {}}       // Called when AI points at element
   onStateChange={(state) => {}}  // Called on state change
   onError={(error) => {}}        // Called on error
+  onToolCall={(event) => {}}     // Called when a tool is invoked
+  onToolResult={(event) => {}}   // Called when a tool completes
 />
 ```
@@ -170,6 +270,65 @@ createCursorBuddyHandler({
 - `speech.allowStreaming: true` — Speak completed sentence segments as the chat
   stream arrives.
+## Tool Display
+When the AI uses tools (like web search), bubbles appear near the cursor showing the tool's status. Configure how tools are displayed:
+```tsx
+<CursorBuddy
+  endpoint="/api/cursor-buddy"
+  toolDisplay={{
+    // Default settings for all tools
+    "*": {
+      minDisplayTime: 1500,  // Minimum time to show bubble (ms)
+    },
+    // Per-tool configuration
+    web_search: {
+      label: "Searching the web...",  // Static label
+      // Or dynamic label based on status:
+      // label: (args, status) => status === "completed" ? "Found results" : "Searching..."
+    },
+    // Hide internal tools from UI
+    internal_logging: {
+      mode: "hidden",
+    },
+    // Custom render for specific tool
+    data_fetch: {
+      render: (props) => (
+        <div className="my-custom-bubble">
+          {props.status === "pending" ? "Loading..." : "Done!"}
+        </div>
+      ),
+    },
+  }}
+/>
+```
+### Tool Call States
+| Status | Description |
+|--------|-------------|
+| `pending` | Tool called, waiting for result |
+| `awaiting_approval` | Needs user consent (for tools with `needsApproval`) |
+| `approved` | User approved, executing |
+| `denied` | User denied the tool call |
+| `completed` | Finished successfully |
+| `failed` | Execution failed |
+### Approval Keyboard Shortcuts
+When a tool requires approval, use these keyboard shortcuts:
+| Key | Action |
+|-----|--------|
+| **Y** or **Enter** | Approve the tool call |
+| **N** or **Escape** | Deny the tool call |
+Shortcuts are automatically enabled when a tool is awaiting approval and disabled otherwise. They are ignored when focus is in an input field or textarea.
 ## Customization
 ### CSS Variables
@@ -192,6 +351,14 @@ Cursor buddy styles are customizable via CSS variables. Override them in your st
   /* Waveform */
   --cursor-buddy-waveform-color: #ef4444;
+  /* Tool bubbles */
+  --cursor-buddy-tool-bg: #ffffff;
+  --cursor-buddy-tool-text: #1f2937;
+  --cursor-buddy-tool-pending: #3b82f6;
+  --cursor-buddy-tool-approval: #f59e0b;
+  --cursor-buddy-tool-success: #22c55e;
+  --cursor-buddy-tool-error: #ef4444;
 }
 ```
@@ -245,6 +412,11 @@ function MyCustomUI() {
     isPointing,
     error,
+    // Tool state
+    toolCalls,       // All tool calls in current turn
+    activeToolCalls, // Visible, non-expired tool calls
+    pendingApproval, // Tool awaiting user approval, or null
     // Actions
     startListening,
     stopListening,
@@ -252,6 +424,11 @@ function MyCustomUI() {
     pointAt,         // Manually point at coordinates
     dismissPointing,
     reset,
+    // Tool actions
+    approveToolCall, // Approve a pending tool call
+    denyToolCall,    // Deny a pending tool call
+    dismissToolCall, // Dismiss a tool call bubble
   } = useCursorBuddy()
   return (
@@ -264,6 +441,19 @@ function MyCustomUI() {
       >
         Hold to speak
       </button>
+      {/* Render active tool calls */}
+      {activeToolCalls.map((tool) => (
+        <div key={tool.id}>
+          {tool.label}
+          {tool.status === "awaiting_approval" && (
+            <>
+              <button onClick={() => approveToolCall(tool.id)}>Yes</button>
+              <button onClick={() => denyToolCall(tool.id)}>No</button>
+            </>
+          )}
+        </div>
+      ))}
     </div>
   )
 }
@@ -362,21 +552,11 @@ client.stopListening()
 | `CursorRenderProps` | Props passed to custom cursor |
 | `SpeechBubbleRenderProps` | Props passed to custom speech bubble |
 | `WaveformRenderProps` | Props passed to custom waveform |
-## How It Works
-1. User holds the hotkey
-2. Microphone captures audio, waveform shows audio level, and browser speech recognition starts when available
-3. At the same time, a screenshot and token-efficient DOM snapshot of the viewport are captured in the background. This runs in parallel with speech capture to minimize latency
-4. User releases hotkey
-5. The client prefers the browser transcript; if it is unavailable or empty in `auto` mode, the recorded audio is transcribed on the server
-6. The already-captured screenshot + DOM snapshot are sent to the AI model. Each element has an `@ID` (e.g., `@12`) that the AI can reference.
-7. AI responds with text and can optionally call the `point` tool to indicate an element on screen by its `@ID` from the DOM snapshot
-8. Response is spoken in the browser or on the server based on `speech.mode`,
-    and can either wait for the full response or stream sentence-by-sentence
-    based on `speech.allowStreaming`
-9. If the AI calls the point tool, the cursor animates to the target element's current position (it resolves the element from the snapshot registry and computes its center point)
-10. **If user presses hotkey again at any point, current response is interrupted**
+| `ToolBubbleRenderProps` | Props passed to custom tool bubble |
+| `ToolCallState` | State of a tool call |
+| `ToolCallStatus` | `"pending" \| "awaiting_approval" \| "approved" \| "denied" \| "completed" \| "failed"` |
+| `ToolDisplayConfig` | Configuration for tool display |
+| `ToolDisplayOptions` | Options for a single tool |
 ## Security Best Practices
@@ -407,14 +587,8 @@ export async function POST(request: Request) {
   return handler(request)
 }
-export const GET = POST
 ```
-## TODOs
-- [ ] Medium: Proper test structure without relying on `as any` for audio and voice capture
 ## License
 MIT