npm - testdriverai - Versions diffs - 7.3.16 → 7.3.17 - Mend

testdriverai 7.3.16 → 7.3.17

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/.github/skills/testdriver:find/SKILL.md +12 -0
package/.github/skills/testdriver:parse/SKILL.md +118 -0
package/CHANGELOG.md +9 -0
package/agent/lib/sdk.js +4 -2
package/ai/skills/testdriver:find/SKILL.md +75 -0
package/ai/skills/testdriver:parse/SKILL.md +236 -0
package/ai/skills/testdriver:reusable-code/SKILL.md +9 -0
package/docs/docs.json +1 -1
package/docs/v7/find.mdx +75 -0
package/docs/v7/parse.mdx +237 -0
package/package.json +1 -1
package/sdk.d.ts +1 -1
package/sdk.js +9 -0
package/ai/skills/testdriver:ocr/SKILL.md +0 -235
package/docs/v7/ocr.mdx +0 -236

package/.github/skills/testdriver:find/SKILL.md CHANGED Viewed

@@ -37,6 +37,18 @@ const element = await testdriver.find(description, options)
       Maximum time in milliseconds to poll for the element. Retries every 5 seconds until found or timeout expires.
     </ParamField>
+    <ParamField path="confidence" type="number">
+      Minimum confidence threshold (0-1). If the AI's confidence score for the found element is below this value, the find will be treated as a failure (`element.found()` returns `false`). Useful for ensuring high-quality matches in critical test steps.
+    </ParamField>
+    <ParamField path="type" type="string">
+      Element type hint that wraps the description for better matching. Accepted values:
+      - `"text"` — Wraps the prompt as `The text "..."`
+      - `"image"` — Wraps the prompt as `The image "..."`
+      - `"ui"` — Wraps the prompt as `The UI element "..."`
+      - `"any"` — No wrapping, uses the description as-is (default behavior)
+    </ParamField>
     <ParamField path="zoom" type="boolean" default={false}>
       Enable two-phase zoom mode for better precision in crowded UIs with many similar elements.
     </ParamField>

package/.github/skills/testdriver:parse/SKILL.md ADDED Viewed

@@ -0,0 +1,118 @@
+```skill
+---
+name: testdriver:parse
+description: Detect all UI elements on screen using OmniParser
+---
+<!-- Generated from parse.mdx. DO NOT EDIT. -->
+## Overview
+Parse the current screen using OmniParser v2 to detect all visible UI elements. Returns structured data including element types, text content, interactivity levels, and bounding box coordinates.
+This method analyzes the entire screen and returns every detected element. It's useful for:
+- Understanding the full UI layout of a screen
+- Finding all clickable or interactive elements
+- Building custom element-based logic
+- Debugging what elements TestDriver can detect
+- Accessibility auditing
+<Note>
+  **Availability**: `parse()` requires an enterprise or self-hosted plan. It uses OmniParser v2 server-side for element detection.
+</Note>
+## Syntax
+```javascript
+const result = await testdriver.parse()
+```
+## Parameters
+None.
+## Returns
+`Promise<ParseResult>` - Object containing detected UI elements
+### ParseResult
+| Property | Type | Description |
+|----------|------|-------------|
+| `elements` | `ParsedElement[]` | Array of detected UI elements |
+| `annotatedImageUrl` | `string` | URL of the annotated screenshot with bounding boxes |
+| `imageWidth` | `number` | Width of the analyzed screenshot |
+| `imageHeight` | `number` | Height of the analyzed screenshot |
+### ParsedElement
+| Property | Type | Description |
+|----------|------|-------------|
+| `index` | `number` | Element index |
+| `type` | `string` | Element type (e.g. `"text"`, `"icon"`, `"button"`) |
+| `content` | `string` | Text content or description of the element |
+| `interactivity` | `string` | Interactivity level (e.g. `"clickable"`, `"non-interactive"`) |
+| `bbox` | `object` | Bounding box in pixel coordinates `{x0, y0, x1, y1}` |
+| `boundingBox` | `object` | Bounding box as `{left, top, width, height}` |
+## Examples
+### Get All Elements on Screen
+```javascript
+const result = await testdriver.parse();
+console.log(`Found ${result.elements.length} elements`);
+result.elements.forEach((el, i) => {
+  console.log(`${i + 1}. [${el.type}] "${el.content}" (${el.interactivity})`);
+});
+```
+### Find Clickable Elements
+```javascript
+const result = await testdriver.parse();
+const clickable = result.elements.filter(e => e.interactivity === 'clickable');
+console.log(`Found ${clickable.length} clickable elements`);
+```
+### Find and Click an Element by Content
+```javascript
+const result = await testdriver.parse();
+const submitBtn = result.elements.find(e =>
+  e.content.toLowerCase().includes('submit') && e.interactivity === 'clickable'
+);
+if (submitBtn) {
+  const x = Math.round((submitBtn.bbox.x0 + submitBtn.bbox.x1) / 2);
+  const y = Math.round((submitBtn.bbox.y0 + submitBtn.bbox.y1) / 2);
+  await testdriver.click({ x, y });
+}
+```
+### Filter by Element Type
+```javascript
+const result = await testdriver.parse();
+const textElements = result.elements.filter(e => e.type === 'text');
+const icons = result.elements.filter(e => e.type === 'icon');
+const buttons = result.elements.filter(e => e.type === 'button');
+```
+## Best Practices
+- Use `find()` for targeting specific elements — `parse()` is for full UI analysis
+- Filter by `interactivity` to distinguish clickable vs non-interactive elements
+- Wait for the page to stabilize before calling `parse()`
+- Use the `annotatedImageUrl` for visual debugging
+## Related
+- [find()](/v7/find) - AI-powered element location
+- [assert()](/v7/assert) - Make AI-powered assertions about screen state
+- [screenshot()](/v7/screenshot) - Capture screenshots
+- [Elements Reference](/v7/elements) - Complete Element API
+```

package/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,12 @@
+## [7.3.17](https://github.com/testdriverai/testdriverai/compare/v7.3.16...v7.3.17) (2026-02-18)
+### Features
+* add type option to find(), move confidence to API, rename ocr to parse ([#640](https://github.com/testdriverai/testdriverai/issues/640)) ([d98a94b](https://github.com/testdriverai/testdriverai/commit/d98a94bd05c135c74d9fed2ebb72c709fc643337))
 ## [7.3.16](https://github.com/testdriverai/testdriverai/compare/v7.3.14...v7.3.16) (2026-02-18)

package/agent/lib/sdk.js CHANGED Viewed

@@ -337,9 +337,11 @@ const createSDK = (emitter, config, sessionInstance) => {
   }
   const req = async (path, data, onChunk) => {
-    // for each value of data, if it is empty remove it
+    // for each value of data, if it is null/undefined remove it
+    // Note: use == null to match both null and undefined, but preserve
+    // other falsy values like 0, false, and "" which may be intentional
     for (let key in data) {
-      if (!data[key]) {
+      if (data[key] == null) {
         delete data[key];
       }
     }

package/ai/skills/testdriver:find/SKILL.md CHANGED Viewed

@@ -37,6 +37,18 @@ const element = await testdriver.find(description, options)
       Maximum time in milliseconds to poll for the element. Retries every 5 seconds until found or timeout expires.
     </ParamField>
+    <ParamField path="confidence" type="number">
+      Minimum confidence threshold (0-1). If the AI's confidence score for the found element is below this value, the find will be treated as a failure (`element.found()` returns `false`). Useful for ensuring high-quality matches in critical test steps.
+    </ParamField>
+    <ParamField path="type" type="string">
+      Element type hint that wraps the description for better matching. Accepted values:
+      - `"text"` — Wraps the prompt as `The text "..."`
+      - `"image"` — Wraps the prompt as `The image "..."`
+      - `"ui"` — Wraps the prompt as `The UI element "..."`
+      - `"any"` — No wrapping, uses the description as-is (default behavior)
+    </ParamField>
     <ParamField path="zoom" type="boolean" default={false}>
       Enable two-phase zoom mode for better precision in crowded UIs with many similar elements.
     </ParamField>
@@ -232,6 +244,69 @@ This is useful for:
   ```
 </Check>
+## Confidence Threshold
+Require a minimum AI confidence score for element matches. If the confidence is below the threshold, `find()` treats the result as not found:
+```javascript
+// Require at least 90% confidence
+const element = await testdriver.find('submit button', { confidence: 0.9 });
+if (!element.found()) {
+  // AI found something but wasn't confident enough
+  throw new Error('Could not confidently locate submit button');
+}
+await element.click();
+```
+This is useful for:
+- Critical test steps where an incorrect click could cause cascading failures
+- Distinguishing between similar elements (e.g., multiple buttons)
+- Failing fast when the UI has changed unexpectedly
+```javascript
+// Combine with timeout for robust polling with confidence gate
+const element = await testdriver.find('success notification', {
+  confidence: 0.85,
+  timeout: 15000,
+});
+```
+<Tip>
+  The `confidence` value is a float between 0 and 1 (e.g., `0.9` = 90%). The AI returns its confidence with each find result, which you can also read from `element.confidence` after a successful find.
+</Tip>
+## Element Type
+Use the `type` option to hint what kind of element you're looking for. This wraps your description into a more specific prompt for the AI, improving match accuracy — especially when users provide short or ambiguous descriptions.
+```javascript
+// Find text on the page
+const label = await testdriver.find('Sign In', { type: 'text' });
+// AI prompt becomes: The text "Sign In"
+// Find an image
+const logo = await testdriver.find('company logo', { type: 'image' });
+// AI prompt becomes: The image "company logo"
+// Find a UI element (button, input, checkbox, etc.)
+const btn = await testdriver.find('Submit', { type: 'ui' });
+// AI prompt becomes: The UI element "Submit"
+// No wrapping — same as omitting the option
+const el = await testdriver.find('the blue submit button', { type: 'any' });
+```
+| Type | Prompt sent to AI |
+|------|----|
+| `"text"` | `The text "..."` |
+| `"image"` | `The image "..."` |
+| `"ui"` | `The UI element "..."` |
+| `"any"` | Original description (no wrapping) |
+<Tip>
+  This is particularly useful for short descriptions like `"Submit"` or `"Login"` where the AI may not know whether to look for a button, a link, or visible text. Specifying `type` removes the ambiguity.
+</Tip>
 ## Polling for Dynamic Elements
 For elements that may not be immediately visible, use the `timeout` option to automatically poll:

package/ai/skills/testdriver:parse/SKILL.md ADDED Viewed

@@ -0,0 +1,236 @@
+---
+name: testdriver:parse
+description: Detect all UI elements on screen using OmniParser
+---
+<!-- Generated from parse.mdx. DO NOT EDIT. -->
+## Overview
+Parse the current screen using OmniParser v2 to detect all visible UI elements. Returns structured data including element types, text content, interactivity levels, and bounding box coordinates.
+This method analyzes the entire screen and returns every detected element. It's useful for:
+- Understanding the full UI layout of a screen
+- Finding all clickable or interactive elements
+- Building custom element-based logic
+- Debugging what elements TestDriver can detect
+- Accessibility auditing
+<Note>
+  **Availability**: `parse()` requires an enterprise or self-hosted plan. It uses OmniParser v2 server-side for element detection.
+</Note>
+## Syntax
+```javascript
+const result = await testdriver.parse()
+```
+## Parameters
+None.
+## Returns
+`Promise<ParseResult>` - Object containing detected UI elements
+### ParseResult
+| Property | Type | Description |
+|----------|------|-------------|
+| `elements` | `ParsedElement[]` | Array of detected UI elements |
+| `annotatedImageUrl` | `string` | URL of the annotated screenshot with bounding boxes |
+| `imageWidth` | `number` | Width of the analyzed screenshot |
+| `imageHeight` | `number` | Height of the analyzed screenshot |
+### ParsedElement
+| Property | Type | Description |
+|----------|------|-------------|
+| `index` | `number` | Element index |
+| `type` | `string` | Element type (e.g. `"text"`, `"icon"`, `"button"`) |
+| `content` | `string` | Text content or description of the element |
+| `interactivity` | `string` | Interactivity level (e.g. `"clickable"`, `"non-interactive"`) |
+| `bbox` | `object` | Bounding box in pixel coordinates `{x0, y0, x1, y1}` |
+| `boundingBox` | `object` | Bounding box as `{left, top, width, height}` |
+## Examples
+### Get All Elements on Screen
+```javascript
+const result = await testdriver.parse();
+console.log(`Found ${result.elements.length} elements`);
+result.elements.forEach((el, i) => {
+  console.log(`${i + 1}. [${el.type}] "${el.content}" (${el.interactivity})`);
+});
+```
+### Find Clickable Elements
+```javascript
+const result = await testdriver.parse();
+const clickable = result.elements.filter(e => e.interactivity === 'clickable');
+console.log(`Found ${clickable.length} clickable elements`);
+clickable.forEach(el => {
+  console.log(`- "${el.content}" at (${el.bbox.x0}, ${el.bbox.y0})`);
+});
+```
+### Find and Click an Element by Content
+```javascript
+const result = await testdriver.parse();
+// Find a "Submit" button
+const submitBtn = result.elements.find(e =>
+  e.content.toLowerCase().includes('submit') && e.interactivity === 'clickable'
+);
+if (submitBtn) {
+  // Calculate center of the bounding box
+  const x = Math.round((submitBtn.bbox.x0 + submitBtn.bbox.x1) / 2);
+  const y = Math.round((submitBtn.bbox.y0 + submitBtn.bbox.y1) / 2);
+  await testdriver.click({ x, y });
+}
+```
+### Filter by Element Type
+```javascript
+const result = await testdriver.parse();
+// Get all text elements
+const textElements = result.elements.filter(e => e.type === 'text');
+textElements.forEach(e => console.log(`Text: "${e.content}"`));
+// Get all icons
+const icons = result.elements.filter(e => e.type === 'icon');
+console.log(`Found ${icons.length} icons`);
+// Get all buttons
+const buttons = result.elements.filter(e => e.type === 'button');
+console.log(`Found ${buttons.length} buttons`);
+```
+### Build Custom Assertions
+```javascript
+import { describe, expect, it } from "vitest";
+import { TestDriver } from "testdriverai/lib/vitest/hooks.mjs";
+describe("Login Page", () => {
+  it("should have expected form elements", async (context) => {
+    const testdriver = TestDriver(context);
+    await testdriver.provision.chrome({
+      url: 'https://myapp.com/login',
+    });
+    const result = await testdriver.parse();
+    // Assert expected elements exist
+    const textContent = result.elements.map(e => e.content.toLowerCase());
+    expect(textContent).toContain('email');
+    expect(textContent).toContain('password');
+    // Assert there are clickable elements
+    const clickable = result.elements.filter(e => e.interactivity === 'clickable');
+    expect(clickable.length).toBeGreaterThan(0);
+  });
+});
+```
+### Use Bounding Box Coordinates
+```javascript
+const result = await testdriver.parse();
+result.elements.forEach(el => {
+  // Pixel coordinates
+  console.log(`Element "${el.content}":`);
+  console.log(`  bbox: (${el.bbox.x0}, ${el.bbox.y0}) to (${el.bbox.x1}, ${el.bbox.y1})`);
+  console.log(`  size: ${el.boundingBox.width}x${el.boundingBox.height}`);
+  console.log(`  position: left=${el.boundingBox.left}, top=${el.boundingBox.top}`);
+});
+```
+### View Annotated Screenshot
+```javascript
+const result = await testdriver.parse();
+// The annotated image shows all detected elements with bounding boxes
+console.log('Annotated screenshot:', result.annotatedImageUrl);
+console.log(`Image dimensions: ${result.imageWidth}x${result.imageHeight}`);
+```
+## How It Works
+1. TestDriver captures a screenshot of the current screen
+2. The image is sent to the TestDriver API
+3. OmniParser v2 analyzes the image to detect all UI elements
+4. Each element is classified by type (text, icon, button, etc.) and interactivity
+5. Bounding box coordinates are returned in pixel coordinates matching the screen resolution
+<Note>
+  OmniParser detects elements visually — it works with any UI framework, native apps, and even non-standard interfaces. It does not rely on DOM or accessibility trees.
+</Note>
+## Best Practices
+<AccordionGroup>
+  <Accordion title="Use find() for targeting specific elements">
+    For locating and interacting with a specific element, prefer `find()` which uses AI vision. Use `parse()` when you need a complete inventory of all elements on screen.
+    ```javascript
+    // Prefer this for clicking a specific element
+    await testdriver.find("Submit button").click();
+    // Use parse() for full UI analysis
+    const result = await testdriver.parse();
+    const allButtons = result.elements.filter(e => e.type === 'button');
+    ```
+  </Accordion>
+  <Accordion title="Filter by interactivity">
+    Use the `interactivity` field to distinguish between clickable and non-interactive elements.
+    ```javascript
+    const result = await testdriver.parse();
+    const interactive = result.elements.filter(e => e.interactivity === 'clickable');
+    const static_ = result.elements.filter(e => e.interactivity === 'non-interactive');
+    ```
+  </Accordion>
+  <Accordion title="Wait for content to load">
+    If elements aren't being detected, the page may not be fully loaded. Add a wait first.
+    ```javascript
+    // Wait for page to stabilize
+    await testdriver.wait(2000);
+    // Then parse
+    const result = await testdriver.parse();
+    ```
+  </Accordion>
+  <Accordion title="Use the annotated image for debugging">
+    The `annotatedImageUrl` provides a visual overlay showing all detected elements with their bounding boxes — great for debugging.
+    ```javascript
+    const result = await testdriver.parse();
+    console.log('View annotated screenshot:', result.annotatedImageUrl);
+    ```
+  </Accordion>
+</AccordionGroup>
+## Related
+- [find()](/v7/find) - AI-powered element location
+- [assert()](/v7/assert) - Make AI-powered assertions about screen state
+- [screenshot()](/v7/screenshot) - Capture screenshots
+- [Elements Reference](/v7/elements) - Complete Element API

package/ai/skills/testdriver:reusable-code/SKILL.md CHANGED Viewed

@@ -36,6 +36,15 @@ export async function logout(testdriver) {
 }
 ```
+<Warning>
+**Avoid hardcoding dynamic values in element descriptions.** Element selectors should describe the *type* of element, not specific content that might change.
+**❌ Bad:** `await testdriver.find('profile name TestDriver in the top right')`
+**✅ Good:** `await testdriver.find('user profile name in the top right')`
+Hardcoded values like usernames, product names, or prices will cause tests to fail when the data changes. Use generic descriptions that work regardless of the specific content displayed.
+</Warning>
 Now import and use these helpers in any test:
 ```javascript test/checkout.test.mjs

package/docs/docs.json CHANGED Viewed

@@ -107,7 +107,7 @@
               "/v7/hover",
               "/v7/mouse-down",
               "/v7/mouse-up",
-              "/v7/ocr",
+              "/v7/parse",
               "/v7/press-keys",
               "/v7/right-click",
               "/v7/screenshot",

package/docs/v7/find.mdx CHANGED Viewed

@@ -38,6 +38,18 @@ const element = await testdriver.find(description, options)
       Maximum time in milliseconds to poll for the element. Retries every 5 seconds until found or timeout expires.
     </ParamField>
+    <ParamField path="confidence" type="number">
+      Minimum confidence threshold (0-1). If the AI's confidence score for the found element is below this value, the find will be treated as a failure (`element.found()` returns `false`). Useful for ensuring high-quality matches in critical test steps.
+    </ParamField>
+    <ParamField path="type" type="string">
+      Element type hint that wraps the description for better matching. Accepted values:
+      - `"text"` — Wraps the prompt as `The text "..."`
+      - `"image"` — Wraps the prompt as `The image "..."`
+      - `"ui"` — Wraps the prompt as `The UI element "..."`
+      - `"any"` — No wrapping, uses the description as-is (default behavior)
+    </ParamField>
     <ParamField path="zoom" type="boolean" default={false}>
       Enable two-phase zoom mode for better precision in crowded UIs with many similar elements.
     </ParamField>
@@ -233,6 +245,69 @@ This is useful for:
   ```
 </Check>
+## Confidence Threshold
+Require a minimum AI confidence score for element matches. If the confidence is below the threshold, `find()` treats the result as not found:
+```javascript
+// Require at least 90% confidence
+const element = await testdriver.find('submit button', { confidence: 0.9 });
+if (!element.found()) {
+  // AI found something but wasn't confident enough
+  throw new Error('Could not confidently locate submit button');
+}
+await element.click();
+```
+This is useful for:
+- Critical test steps where an incorrect click could cause cascading failures
+- Distinguishing between similar elements (e.g., multiple buttons)
+- Failing fast when the UI has changed unexpectedly
+```javascript
+// Combine with timeout for robust polling with confidence gate
+const element = await testdriver.find('success notification', {
+  confidence: 0.85,
+  timeout: 15000,
+});
+```
+<Tip>
+  The `confidence` value is a float between 0 and 1 (e.g., `0.9` = 90%). The AI returns its confidence with each find result, which you can also read from `element.confidence` after a successful find.
+</Tip>
+## Element Type
+Use the `type` option to hint what kind of element you're looking for. This wraps your description into a more specific prompt for the AI, improving match accuracy — especially when users provide short or ambiguous descriptions.
+```javascript
+// Find text on the page
+const label = await testdriver.find('Sign In', { type: 'text' });
+// AI prompt becomes: The text "Sign In"
+// Find an image
+const logo = await testdriver.find('company logo', { type: 'image' });
+// AI prompt becomes: The image "company logo"
+// Find a UI element (button, input, checkbox, etc.)
+const btn = await testdriver.find('Submit', { type: 'ui' });
+// AI prompt becomes: The UI element "Submit"
+// No wrapping — same as omitting the option
+const el = await testdriver.find('the blue submit button', { type: 'any' });
+```
+| Type | Prompt sent to AI |
+|------|----|
+| `"text"` | `The text "..."` |
+| `"image"` | `The image "..."` |
+| `"ui"` | `The UI element "..."` |
+| `"any"` | Original description (no wrapping) |
+<Tip>
+  This is particularly useful for short descriptions like `"Submit"` or `"Login"` where the AI may not know whether to look for a button, a link, or visible text. Specifying `type` removes the ambiguity.
+</Tip>
 ## Polling for Dynamic Elements
 For elements that may not be immediately visible, use the `timeout` option to automatically poll:

package/docs/v7/parse.mdx ADDED Viewed

@@ -0,0 +1,237 @@
+---
+title: "parse()"
+sidebarTitle: "parse"
+description: "Detect all UI elements on screen using OmniParser"
+icon: "diagram-project"
+---
+## Overview
+Parse the current screen using OmniParser v2 to detect all visible UI elements. Returns structured data including element types, text content, interactivity levels, and bounding box coordinates.
+This method analyzes the entire screen and returns every detected element. It's useful for:
+- Understanding the full UI layout of a screen
+- Finding all clickable or interactive elements
+- Building custom element-based logic
+- Debugging what elements TestDriver can detect
+- Accessibility auditing
+<Note>
+  **Availability**: `parse()` requires an enterprise or self-hosted plan. It uses OmniParser v2 server-side for element detection.
+</Note>
+## Syntax
+```javascript
+const result = await testdriver.parse()
+```
+## Parameters
+None.
+## Returns
+`Promise<ParseResult>` - Object containing detected UI elements
+### ParseResult
+| Property | Type | Description |
+|----------|------|-------------|
+| `elements` | `ParsedElement[]` | Array of detected UI elements |
+| `annotatedImageUrl` | `string` | URL of the annotated screenshot with bounding boxes |
+| `imageWidth` | `number` | Width of the analyzed screenshot |
+| `imageHeight` | `number` | Height of the analyzed screenshot |
+### ParsedElement
+| Property | Type | Description |
+|----------|------|-------------|
+| `index` | `number` | Element index |
+| `type` | `string` | Element type (e.g. `"text"`, `"icon"`, `"button"`) |
+| `content` | `string` | Text content or description of the element |
+| `interactivity` | `string` | Interactivity level (e.g. `"clickable"`, `"non-interactive"`) |
+| `bbox` | `object` | Bounding box in pixel coordinates `{x0, y0, x1, y1}` |
+| `boundingBox` | `object` | Bounding box as `{left, top, width, height}` |
+## Examples
+### Get All Elements on Screen
+```javascript
+const result = await testdriver.parse();
+console.log(`Found ${result.elements.length} elements`);
+result.elements.forEach((el, i) => {
+  console.log(`${i + 1}. [${el.type}] "${el.content}" (${el.interactivity})`);
+});
+```
+### Find Clickable Elements
+```javascript
+const result = await testdriver.parse();
+const clickable = result.elements.filter(e => e.interactivity === 'clickable');
+console.log(`Found ${clickable.length} clickable elements`);
+clickable.forEach(el => {
+  console.log(`- "${el.content}" at (${el.bbox.x0}, ${el.bbox.y0})`);
+});
+```
+### Find and Click an Element by Content
+```javascript
+const result = await testdriver.parse();
+// Find a "Submit" button
+const submitBtn = result.elements.find(e =>
+  e.content.toLowerCase().includes('submit') && e.interactivity === 'clickable'
+);
+if (submitBtn) {
+  // Calculate center of the bounding box
+  const x = Math.round((submitBtn.bbox.x0 + submitBtn.bbox.x1) / 2);
+  const y = Math.round((submitBtn.bbox.y0 + submitBtn.bbox.y1) / 2);
+  await testdriver.click({ x, y });
+}
+```
+### Filter by Element Type
+```javascript
+const result = await testdriver.parse();
+// Get all text elements
+const textElements = result.elements.filter(e => e.type === 'text');
+textElements.forEach(e => console.log(`Text: "${e.content}"`));
+// Get all icons
+const icons = result.elements.filter(e => e.type === 'icon');
+console.log(`Found ${icons.length} icons`);
+// Get all buttons
+const buttons = result.elements.filter(e => e.type === 'button');
+console.log(`Found ${buttons.length} buttons`);
+```
+### Build Custom Assertions
+```javascript
+import { describe, expect, it } from "vitest";
+import { TestDriver } from "testdriverai/lib/vitest/hooks.mjs";
+describe("Login Page", () => {
+  it("should have expected form elements", async (context) => {
+    const testdriver = TestDriver(context);
+    await testdriver.provision.chrome({
+      url: 'https://myapp.com/login',
+    });
+    const result = await testdriver.parse();
+    // Assert expected elements exist
+    const textContent = result.elements.map(e => e.content.toLowerCase());
+    expect(textContent).toContain('email');
+    expect(textContent).toContain('password');
+    // Assert there are clickable elements
+    const clickable = result.elements.filter(e => e.interactivity === 'clickable');
+    expect(clickable.length).toBeGreaterThan(0);
+  });
+});
+```
+### Use Bounding Box Coordinates
+```javascript
+const result = await testdriver.parse();
+result.elements.forEach(el => {
+  // Pixel coordinates
+  console.log(`Element "${el.content}":`);
+  console.log(`  bbox: (${el.bbox.x0}, ${el.bbox.y0}) to (${el.bbox.x1}, ${el.bbox.y1})`);
+  console.log(`  size: ${el.boundingBox.width}x${el.boundingBox.height}`);
+  console.log(`  position: left=${el.boundingBox.left}, top=${el.boundingBox.top}`);
+});
+```
+### View Annotated Screenshot
+```javascript
+const result = await testdriver.parse();
+// The annotated image shows all detected elements with bounding boxes
+console.log('Annotated screenshot:', result.annotatedImageUrl);
+console.log(`Image dimensions: ${result.imageWidth}x${result.imageHeight}`);
+```
+## How It Works
+1. TestDriver captures a screenshot of the current screen
+2. The image is sent to the TestDriver API
+3. OmniParser v2 analyzes the image to detect all UI elements
+4. Each element is classified by type (text, icon, button, etc.) and interactivity
+5. Bounding box coordinates are returned in pixel coordinates matching the screen resolution
+<Note>
+  OmniParser detects elements visually — it works with any UI framework, native apps, and even non-standard interfaces. It does not rely on DOM or accessibility trees.
+</Note>
+## Best Practices
+<AccordionGroup>
+  <Accordion title="Use find() for targeting specific elements">
+    For locating and interacting with a specific element, prefer `find()` which uses AI vision. Use `parse()` when you need a complete inventory of all elements on screen.
+    ```javascript
+    // Prefer this for clicking a specific element
+    await testdriver.find("Submit button").click();
+    // Use parse() for full UI analysis
+    const result = await testdriver.parse();
+    const allButtons = result.elements.filter(e => e.type === 'button');
+    ```
+  </Accordion>
+  <Accordion title="Filter by interactivity">
+    Use the `interactivity` field to distinguish between clickable and non-interactive elements.
+    ```javascript
+    const result = await testdriver.parse();
+    const interactive = result.elements.filter(e => e.interactivity === 'clickable');
+    const static_ = result.elements.filter(e => e.interactivity === 'non-interactive');
+    ```
+  </Accordion>
+  <Accordion title="Wait for content to load">
+    If elements aren't being detected, the page may not be fully loaded. Add a wait first.
+    ```javascript
+    // Wait for page to stabilize
+    await testdriver.wait(2000);
+    // Then parse
+    const result = await testdriver.parse();
+    ```
+  </Accordion>
+  <Accordion title="Use the annotated image for debugging">
+    The `annotatedImageUrl` provides a visual overlay showing all detected elements with their bounding boxes — great for debugging.
+    ```javascript
+    const result = await testdriver.parse();
+    console.log('View annotated screenshot:', result.annotatedImageUrl);
+    ```
+  </Accordion>
+</AccordionGroup>
+## Related
+- [find()](/v7/find) - AI-powered element location
+- [assert()](/v7/assert) - Make AI-powered assertions about screen state
+- [screenshot()](/v7/screenshot) - Capture screenshots
+- [Elements Reference](/v7/elements) - Complete Element API

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "testdriverai",
-  "version": "7.3.16",
+  "version": "7.3.17",
   "description": "Next generation autonomous AI agent for end-to-end testing of web & desktop",
   "main": "sdk.js",
   "types": "sdk.d.ts",

package/sdk.d.ts CHANGED Viewed

@@ -1082,7 +1082,7 @@ export default class TestDriverSDK {
   find(description: string, cacheThreshold?: number): ChainableElementPromise;
   find(
     description: string,
-    options?: { cacheThreshold?: number; cacheKey?: string; timeout?: number; ai?: AIConfig; cache?: { thresholds?: { screen?: number; element?: number } } },
+    options?: { cacheThreshold?: number; cacheKey?: string; timeout?: number; confidence?: number; type?: "text" | "image" | "ui" | "any"; ai?: AIConfig; cache?: { thresholds?: { screen?: number; element?: number } } },
   ): ChainableElementPromise;
   /**

package/sdk.js CHANGED Viewed

@@ -476,6 +476,9 @@ class Element {
       let zoom = false; // Default to disabled, enable with zoom: true
       let perCommandAi = null; // Per-command AI config override
+      let minConfidence = null; // Minimum confidence threshold
+      let elementType = null; // Element type hint: "text", "image", "ui", or "any"
       if (typeof options === "number") {
         // Legacy: options is just a number threshold
         cacheThreshold = options;
@@ -485,6 +488,10 @@ class Element {
         cacheThreshold = options.cacheThreshold ?? null;
         // zoom defaults to false unless explicitly set to true
         zoom = options.zoom === true;
+        // Minimum confidence threshold: fail find if AI confidence is below this value
+        minConfidence = options.confidence ?? null;
+        // Element type hint for prompt wrapping
+        elementType = options.type ?? null;
         // Per-command cache thresholds: { cache: { thresholds: { screen: 0.1, element: 0.2 } } }
         if (typeof options.cache === "object" && options.cache?.thresholds) {
           perCommandThresholds = options.cache.thresholds;
@@ -554,6 +561,8 @@ class Element {
         os: this.sdk.os,
         resolution: this.sdk.resolution,
         zoom: zoom,
+        confidence: minConfidence,
+        type: elementType,
         ai: {
           ...this.sdk.aiConfig,
           ...(perCommandAi || {}),

package/ai/skills/testdriver:ocr/SKILL.md DELETED Viewed

@@ -1,235 +0,0 @@
----
-name: testdriver:ocr
-description: Extract all visible text from the screen using OCR
----
-<!-- Generated from ocr.mdx. DO NOT EDIT. -->
-## Overview
-Extract all visible text from the current screen using Tesseract OCR. Returns structured data including each word's text content, bounding box coordinates, and confidence scores.
-This method runs OCR on-demand and returns the results immediately. It's useful for:
-- Verifying text content on screen
-- Finding elements by their text when visual matching alone isn't enough
-- Debugging what text TestDriver can "see"
-- Building custom text-based assertions
-<Note>
-  **Performance**: OCR runs server-side using Tesseract.js with a worker pool for fast extraction. A typical screenshot processes in 200-500ms.
-</Note>
-## Syntax
-```javascript
-const result = await testdriver.ocr()
-```
-## Parameters
-None.
-## Returns
-`Promise<OCRResult>` - Object containing extracted text data
-### OCRResult
-| Property | Type | Description |
-|----------|------|-------------|
-| `words` | `OCRWord[]` | Array of extracted words with positions |
-| `fullText` | `string` | All text concatenated with spaces |
-| `confidence` | `number` | Overall OCR confidence (0-100) |
-| `imageWidth` | `number` | Width of the analyzed screenshot |
-| `imageHeight` | `number` | Height of the analyzed screenshot |
-### OCRWord
-| Property | Type | Description |
-|----------|------|-------------|
-| `content` | `string` | The word's text content |
-| `confidence` | `number` | Confidence score for this word (0-100) |
-| `bbox.x0` | `number` | Left edge X coordinate |
-| `bbox.y0` | `number` | Top edge Y coordinate |
-| `bbox.x1` | `number` | Right edge X coordinate |
-| `bbox.y1` | `number` | Bottom edge Y coordinate |
-## Examples
-### Get All Text on Screen
-```javascript
-const result = await testdriver.ocr();
-console.log(result.fullText);
-// "Welcome to TestDriver Sign In Email Password Submit"
-console.log(`Found ${result.words.length} words with ${result.confidence}% confidence`);
-```
-### Check if Text Exists
-```javascript
-const result = await testdriver.ocr();
-// Check for error message
-const hasError = result.words.some(w =>
-  w.content.toLowerCase().includes('error')
-);
-if (hasError) {
-  console.log('Error message detected on screen');
-}
-```
-### Find and Click Text
-```javascript
-const result = await testdriver.ocr();
-// Find the "Submit" button text
-const submitWord = result.words.find(w => w.content === 'Submit');
-if (submitWord) {
-  // Calculate center of the word's bounding box
-  const x = Math.round((submitWord.bbox.x0 + submitWord.bbox.x1) / 2);
-  const y = Math.round((submitWord.bbox.y0 + submitWord.bbox.y1) / 2);
-  // Click at those coordinates
-  await testdriver.click({ x, y });
-}
-```
-### Filter Words by Confidence
-```javascript
-const result = await testdriver.ocr();
-// Only use high-confidence words (90%+)
-const reliableWords = result.words.filter(w => w.confidence >= 90);
-console.log('High confidence words:', reliableWords.map(w => w.content));
-```
-### Build Custom Assertions
-```javascript
-import { describe, expect, it } from "vitest";
-import { TestDriver } from "testdriverai/lib/vitest/hooks.mjs";
-describe("Login Page", () => {
-  it("should show form labels", async (context) => {
-    const testdriver = TestDriver(context);
-    await testdriver.provision.chrome({
-      url: 'https://myapp.com/login',
-    });
-    const result = await testdriver.ocr();
-    // Assert expected labels are present
-    expect(result.fullText).toContain('Email');
-    expect(result.fullText).toContain('Password');
-    expect(result.fullText).toContain('Sign In');
-  });
-});
-```
-### Debug Screen Content
-```javascript
-// Useful for debugging what TestDriver can see
-const result = await testdriver.ocr();
-console.log('=== Screen Text ===');
-console.log(result.fullText);
-console.log('');
-console.log('=== Word Details ===');
-result.words.forEach((word, i) => {
-  console.log(`${i + 1}. "${word.content}" at (${word.bbox.x0}, ${word.bbox.y0}) - ${word.confidence}% confidence`);
-});
-```
-### Find Multiple Instances
-```javascript
-const result = await testdriver.ocr();
-// Find all instances of "Button" text
-const buttons = result.words.filter(w =>
-  w.content.toLowerCase() === 'button'
-);
-console.log(`Found ${buttons.length} buttons on screen`);
-buttons.forEach((btn, i) => {
-  console.log(`Button ${i + 1} at position (${btn.bbox.x0}, ${btn.bbox.y0})`);
-});
-```
-## How It Works
-1. TestDriver captures a screenshot of the current screen
-2. The image is sent to the TestDriver API
-3. Tesseract.js processes the image server-side with multiple workers
-4. The API returns structured data with text and positions
-5. Bounding box coordinates are scaled to match the original screen resolution
-<Note>
-  OCR works best with clear, readable text. Very small text, unusual fonts, or low-contrast text may have lower confidence scores or be missed entirely.
-</Note>
-## Best Practices
-<AccordionGroup>
-  <Accordion title="Use find() for element location">
-    For locating elements, prefer `find()` which uses AI vision. Use `ocr()` when you need raw text data or want to build custom text-based logic.
-    ```javascript
-    // Prefer this for clicking elements
-    await testdriver.find("Submit button").click();
-    // Use ocr() for text verification or custom logic
-    const result = await testdriver.ocr();
-    expect(result.fullText).toContain('Success');
-    ```
-  </Accordion>
-  <Accordion title="Filter by confidence">
-    OCR can sometimes misread characters. Filter by confidence score when accuracy is critical.
-    ```javascript
-    const result = await testdriver.ocr();
-    const reliable = result.words.filter(w => w.confidence >= 85);
-    ```
-  </Accordion>
-  <Accordion title="Handle case sensitivity">
-    Text matching should usually be case-insensitive since OCR capitalization can vary.
-    ```javascript
-    const result = await testdriver.ocr();
-    const hasLogin = result.words.some(w =>
-      w.content.toLowerCase() === 'login'
-    );
-    ```
-  </Accordion>
-  <Accordion title="Wait for content to load">
-    If text isn't being found, the page may not be fully loaded. Add a wait or use `waitForText()`.
-    ```javascript
-    // Wait for specific text to appear
-    await testdriver.waitForText("Welcome");
-    // Then run OCR
-    const result = await testdriver.ocr();
-    ```
-  </Accordion>
-</AccordionGroup>
-## Related
-- [find()](/v7/find) - AI-powered element location
-- [assert()](/v7/assert) - Make AI-powered assertions about screen state
-- [waitForText()](/v7/waiting-for-elements) - Wait for text to appear on screen
-- [screenshot()](/v7/screenshot) - Capture screenshots

package/docs/v7/ocr.mdx DELETED Viewed

@@ -1,236 +0,0 @@
----
-title: "ocr()"
-sidebarTitle: "ocr"
-description: "Extract all visible text from the screen using OCR"
-icon: "text"
----
-## Overview
-Extract all visible text from the current screen using Tesseract OCR. Returns structured data including each word's text content, bounding box coordinates, and confidence scores.
-This method runs OCR on-demand and returns the results immediately. It's useful for:
-- Verifying text content on screen
-- Finding elements by their text when visual matching alone isn't enough
-- Debugging what text TestDriver can "see"
-- Building custom text-based assertions
-<Note>
-  **Performance**: OCR runs server-side using Tesseract.js with a worker pool for fast extraction. A typical screenshot processes in 200-500ms.
-</Note>
-## Syntax
-```javascript
-const result = await testdriver.ocr()
-```
-## Parameters
-None.
-## Returns
-`Promise<OCRResult>` - Object containing extracted text data
-### OCRResult
-| Property | Type | Description |
-|----------|------|-------------|
-| `words` | `OCRWord[]` | Array of extracted words with positions |
-| `fullText` | `string` | All text concatenated with spaces |
-| `confidence` | `number` | Overall OCR confidence (0-100) |
-| `imageWidth` | `number` | Width of the analyzed screenshot |
-| `imageHeight` | `number` | Height of the analyzed screenshot |
-### OCRWord
-| Property | Type | Description |
-|----------|------|-------------|
-| `content` | `string` | The word's text content |
-| `confidence` | `number` | Confidence score for this word (0-100) |
-| `bbox.x0` | `number` | Left edge X coordinate |
-| `bbox.y0` | `number` | Top edge Y coordinate |
-| `bbox.x1` | `number` | Right edge X coordinate |
-| `bbox.y1` | `number` | Bottom edge Y coordinate |
-## Examples
-### Get All Text on Screen
-```javascript
-const result = await testdriver.ocr();
-console.log(result.fullText);
-// "Welcome to TestDriver Sign In Email Password Submit"
-console.log(`Found ${result.words.length} words with ${result.confidence}% confidence`);
-```
-### Check if Text Exists
-```javascript
-const result = await testdriver.ocr();
-// Check for error message
-const hasError = result.words.some(w =>
-  w.content.toLowerCase().includes('error')
-);
-if (hasError) {
-  console.log('Error message detected on screen');
-}
-```
-### Find and Click Text
-```javascript
-const result = await testdriver.ocr();
-// Find the "Submit" button text
-const submitWord = result.words.find(w => w.content === 'Submit');
-if (submitWord) {
-  // Calculate center of the word's bounding box
-  const x = Math.round((submitWord.bbox.x0 + submitWord.bbox.x1) / 2);
-  const y = Math.round((submitWord.bbox.y0 + submitWord.bbox.y1) / 2);
-  // Click at those coordinates
-  await testdriver.click({ x, y });
-}
-```
-### Filter Words by Confidence
-```javascript
-const result = await testdriver.ocr();
-// Only use high-confidence words (90%+)
-const reliableWords = result.words.filter(w => w.confidence >= 90);
-console.log('High confidence words:', reliableWords.map(w => w.content));
-```
-### Build Custom Assertions
-```javascript
-import { describe, expect, it } from "vitest";
-import { TestDriver } from "testdriverai/lib/vitest/hooks.mjs";
-describe("Login Page", () => {
-  it("should show form labels", async (context) => {
-    const testdriver = TestDriver(context);
-    await testdriver.provision.chrome({
-      url: 'https://myapp.com/login',
-    });
-    const result = await testdriver.ocr();
-    // Assert expected labels are present
-    expect(result.fullText).toContain('Email');
-    expect(result.fullText).toContain('Password');
-    expect(result.fullText).toContain('Sign In');
-  });
-});
-```
-### Debug Screen Content
-```javascript
-// Useful for debugging what TestDriver can see
-const result = await testdriver.ocr();
-console.log('=== Screen Text ===');
-console.log(result.fullText);
-console.log('');
-console.log('=== Word Details ===');
-result.words.forEach((word, i) => {
-  console.log(`${i + 1}. "${word.content}" at (${word.bbox.x0}, ${word.bbox.y0}) - ${word.confidence}% confidence`);
-});
-```
-### Find Multiple Instances
-```javascript
-const result = await testdriver.ocr();
-// Find all instances of "Button" text
-const buttons = result.words.filter(w =>
-  w.content.toLowerCase() === 'button'
-);
-console.log(`Found ${buttons.length} buttons on screen`);
-buttons.forEach((btn, i) => {
-  console.log(`Button ${i + 1} at position (${btn.bbox.x0}, ${btn.bbox.y0})`);
-});
-```
-## How It Works
-1. TestDriver captures a screenshot of the current screen
-2. The image is sent to the TestDriver API
-3. Tesseract.js processes the image server-side with multiple workers
-4. The API returns structured data with text and positions
-5. Bounding box coordinates are scaled to match the original screen resolution
-<Note>
-  OCR works best with clear, readable text. Very small text, unusual fonts, or low-contrast text may have lower confidence scores or be missed entirely.
-</Note>
-## Best Practices
-<AccordionGroup>
-  <Accordion title="Use find() for element location">
-    For locating elements, prefer `find()` which uses AI vision. Use `ocr()` when you need raw text data or want to build custom text-based logic.
-    ```javascript
-    // Prefer this for clicking elements
-    await testdriver.find("Submit button").click();
-    // Use ocr() for text verification or custom logic
-    const result = await testdriver.ocr();
-    expect(result.fullText).toContain('Success');
-    ```
-  </Accordion>
-  <Accordion title="Filter by confidence">
-    OCR can sometimes misread characters. Filter by confidence score when accuracy is critical.
-    ```javascript
-    const result = await testdriver.ocr();
-    const reliable = result.words.filter(w => w.confidence >= 85);
-    ```
-  </Accordion>
-  <Accordion title="Handle case sensitivity">
-    Text matching should usually be case-insensitive since OCR capitalization can vary.
-    ```javascript
-    const result = await testdriver.ocr();
-    const hasLogin = result.words.some(w =>
-      w.content.toLowerCase() === 'login'
-    );
-    ```
-  </Accordion>
-  <Accordion title="Wait for content to load">
-    If text isn't being found, the page may not be fully loaded. Add a wait or use `waitForText()`.
-    ```javascript
-    // Wait for specific text to appear
-    await testdriver.waitForText("Welcome");
-    // Then run OCR
-    const result = await testdriver.ocr();
-    ```
-  </Accordion>
-</AccordionGroup>
-## Related
-- [find()](/v7/find) - AI-powered element location
-- [assert()](/v7/assert) - Make AI-powered assertions about screen state
-- [waitForText()](/v7/waiting-for-elements) - Wait for text to appear on screen
-- [screenshot()](/v7/screenshot) - Capture screenshots