npm - @dealcrawl/sdk - Versions diffs - 2.7.0 → 2.10.0 - Mend

@dealcrawl/sdk 2.7.0 → 2.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md CHANGED Viewed

@@ -6,6 +6,15 @@ Official TypeScript SDK for the DealCrawl web scraping and crawling API.
 [![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+## What's New in January 2026 🎉
+- **📸 Screenshot Storage (Phase 4)** - Automatic screenshot capture and storage via Supabase with public URLs
+- **🎯 Priority Crawl System (Phase 5)** - 3-tier queue system (high/medium/low) based on SmartFrontier deal scores for optimized resource allocation
+- **🤖 AI Deal Extraction** - LLM-powered deal extraction with customizable score thresholds and automatic database storage
+- **💾 Enhanced Data Persistence** - New `crawled_pages` and `crawled_deals` tables for comprehensive deal tracking
+- **📝 Markdown Output** - Convert scraped content to clean Markdown with GFM support
+- **🎬 Browser Actions** - Execute preset actions (click, scroll, write, etc.) before scraping for dynamic content
 ## Features
 - 🚀 **Full API Coverage** - Access all 50+ DealCrawl API endpoints
@@ -40,15 +49,172 @@ const client = new DealCrawl({
   apiKey: process.env.DEALCRAWL_API_KEY!,
 });
-// Scrape a single page with deal extraction
+// Scrape a single page with deal extraction and screenshot
 const job = await client.scrape.create({
   url: "https://shop.example.com/product",
   extractDeal: true,
+  screenshot: { enabled: true },
+  outputMarkdown: true, // NEW: Get clean markdown output
 });
 // Wait for result with automatic polling
 const result = await client.waitForResult(job.jobId);
-console.log(result);
+console.log(result.data.parsed.markdown); // Markdown content
+console.log(result.data.screenshot); // Public screenshot URL
+```
+## January 2026 Features in Detail
+### 📸 Screenshot Storage (SEC-011)
+**Private by default** with configurable signed URL expiration:
+```typescript
+// Basic screenshot (private with tier-specific TTL)
+const job = await client.scrape.create({
+  url: "https://example.com",
+  screenshot: {
+    enabled: true,
+    fullPage: true,
+    format: "webp",
+    quality: 85,
+    signedUrlTtl: 604800,  // 7 days (default for Pro/Enterprise)
+  },
+});
+const result = await client.waitForResult(job.jobId);
+console.log(result.data.screenshotMetadata);
+// {
+//   url: "https://...supabase.co/storage/v1/object/sign/screenshots-private/...",
+//   isPublic: false,
+//   expiresAt: "2026-01-25T12:00:00Z",
+//   width: 1280,
+//   height: 720,
+//   format: "webp",
+//   sizeBytes: 125000
+// }
+// Refresh signed URL before expiration
+const refreshed = await client.screenshots.refresh({
+  path: "job_abc123/1234567890_nanoid_example.png",
+  ttl: 604800  // Extend for another 7 days
+});
+console.log(refreshed.url);        // New signed URL
+console.log(refreshed.expiresAt);  // "2026-02-01T12:00:00Z"
+// Get tier-specific TTL limits
+const limits = await client.screenshots.getLimits();
+console.log(limits);
+// {
+//   tier: "pro",
+//   limits: { min: 3600, max: 604800, default: 604800 },
+//   formattedLimits: { min: "1 hour", max: "7 days", default: "7 days" }
+// }
+// Enterprise: Public URLs (opt-in)
+const jobPublic = await client.scrape.create({
+  url: "https://example.com",
+  screenshot: {
+    enabled: true,
+    publicUrl: true,  // ⚠️ Enterprise only - exposes data publicly
+  },
+});
+// → Public URL without expiration (Enterprise tier only)
+```
+**Security Note:** Screenshots are private by default to prevent exposure of personal data, copyrighted content, or sensitive tokens. Public URLs require Enterprise tier + explicit opt-in.
+### 🎯 Priority Crawl System
+3-tier queue system automatically prioritizes high-value pages:
+```typescript
+// Crawl with automatic prioritization
+const job = await client.crawl.create({
+  url: "https://shop.example.com",
+  extractDeal: true,
+  minDealScore: 50, // Only extract deals scoring 50+
+});
+// Behind the scenes:
+// - Pages scoring 70+ → High priority queue (5 workers, 30/min)
+// - Pages scoring 40-69 → Medium priority queue (10 workers, 60/min)
+// - Pages scoring <40 → Low priority queue (20 workers, 120/min)
+```
+### 🤖 AI Deal Extraction
+Extract deals with LLM-powered analysis:
+```typescript
+// Extract deals during crawl
+const job = await client.crawl.create({
+  url: "https://marketplace.example.com",
+  extractDeal: true,
+  minDealScore: 30, // Only extract if score >= 30
+  maxPages: 200,
+});
+// Get extracted deals
+const deals = await client.status.getDeals(job.jobId, {
+  minScore: 70, // Filter for high-quality deals
+  limit: 50,
+});
+console.log(deals.deals); // Array of ExtractedDeal objects
+```
+### 📝 Markdown Output
+Convert HTML to clean, structured markdown:
+```typescript
+// Single page markdown
+const job = await client.scrape.create({
+  url: "https://blog.example.com/article",
+  outputMarkdown: true,
+  markdownBaseUrl: "https://blog.example.com", // Resolve relative URLs
+  onlyMainContent: true,
+});
+const result = await client.waitForResult(job.jobId);
+console.log(result.data.parsed.markdown);
+// Clean markdown with:
+// - GFM tables, strikethrough, task lists
+// - Code blocks with syntax detection
+// - Absolute URLs
+// - Noise removal (ads, navigation)
+```
+### 🎬 Browser Actions
+Execute actions before scraping for dynamic content:
+```typescript
+// Handle cookie popups and load more content
+const job = await client.scrape.create({
+  url: "https://shop.example.com/products",
+  actions: [
+    { type: "click", selector: "#accept-cookies", optional: true },
+    { type: "wait", milliseconds: 500 },
+    { type: "scroll", direction: "down", amount: 500 },
+    { type: "click", selector: ".load-more", retries: 3 },
+    { type: "wait", selector: ".products-loaded" },
+  ],
+  extractMultipleDeals: true,
+});
+// Search and extract
+const job2 = await client.scrape.create({
+  url: "https://marketplace.com",
+  actions: [
+    { type: "write", selector: "input[name='search']", text: "laptop deals" },
+    { type: "press", key: "Enter" },
+    { type: "wait", selector: ".results" },
+  ],
+  extractMultipleDeals: true,
+  maxDeals: 30,
+});
 ```
 ## Configuration
@@ -86,25 +252,29 @@ const job = await client.scrape.withScreenshot("https://example.com", {
 ```
 **Options:**
-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `url` | string | required | URL to scrape |
-| `noStore` | boolean | false | Zero Data Retention - don't save results (Pro/Enterprise) |
-| `detectSignals` | boolean | true | Detect prices, discounts, urgency |
-| `extractDeal` | boolean | false | Extract deal information |
-| `extractMultipleDeals` | boolean | false | Extract multiple deals from list pages |
-| `maxDeals` | number | 20 | Max deals to extract (max: 50) |
-| `extractWithAI` | boolean | false | Use AI for extraction |
-| `useAdvancedModel` | boolean | false | Use GPT-4o (higher cost) |
-| `minDealScore` | number | 0 | Minimum deal score (0-100) |
-| `screenshot` | object | - | Screenshot options |
-| `excludeTags` | string[] | - | HTML tags to exclude |
-| `excludeSelectors` | string[] | - | CSS selectors to exclude |
-| `onlyMainContent` | boolean | true | Extract main content only |
-| `headers` | object | - | Custom HTTP headers |
-| `timeout` | number | 30000 | Request timeout in ms (max: 120000) |
-### Batch Scrape - Bulk URL Scraping (NEW)
+| Option                 | Type     | Default  | Description                                               |
+| ---------------------- | -------- | -------- | --------------------------------------------------------- |
+| `url`                  | string   | required | URL to scrape                                             |
+| `noStore`              | boolean  | false    | Zero Data Retention - don't save results (Pro/Enterprise) |
+| `detectSignals`        | boolean  | true     | Detect prices, discounts, urgency                         |
+| `extractDeal`          | boolean  | false    | Extract deal information                                  |
+| `extractMultipleDeals` | boolean  | false    | Extract multiple deals from list pages                    |
+| `maxDeals`             | number   | 20       | Max deals to extract (max: 50)                            |
+| `extractWithAI`        | boolean  | false    | Use AI for extraction                                     |
+| `useAdvancedModel`     | boolean  | false    | Use GPT-4o (higher cost)                                  |
+| `minDealScore`         | number   | 0        | Minimum deal score (0-100)                                |
+| `screenshot`           | object   | -        | Screenshot options                                        |
+| `excludeTags`          | string[] | -        | HTML tags to exclude                                      |
+| `excludeSelectors`     | string[] | -        | CSS selectors to exclude                                  |
+| `onlyMainContent`      | boolean  | true     | Extract main content only                                 |
+| `headers`              | object   | -        | Custom HTTP headers                                       |
+| `timeout`              | number   | 30000    | Request timeout in ms (max: 120000)                       |
+| `outputMarkdown`       | boolean  | false    | Convert content to Markdown (GFM)                         |
+| `markdownBaseUrl`      | string   | -        | Base URL for resolving relative URLs in markdown          |
+| `actions`              | array    | -        | Browser actions to execute before scraping                |
+### Batch Scrape - Bulk URL Scraping
 ```typescript
 // Scrape multiple URLs in one request (1-100 URLs)
@@ -128,16 +298,17 @@ const results = await client.waitForAll(batch.jobIds);
 ```
 **Batch Options:**
-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `urls` | array | required | 1-100 URL objects with optional overrides |
-| `defaults` | object | - | Default options applied to all URLs |
-| `priority` | number | 5 | Priority 1-10 (higher = faster) |
-| `delay` | number | 0 | Delay between URLs (0-5000ms) |
-| `webhookUrl` | string | - | Webhook for batch completion |
-| `ref` | string | - | Custom reference ID for tracking |
-### Search - Web Search with AI (NEW)
+| Option       | Type   | Default  | Description                               |
+| ------------ | ------ | -------- | ----------------------------------------- |
+| `urls`       | array  | required | 1-100 URL objects with optional overrides |
+| `defaults`   | object | -        | Default options applied to all URLs       |
+| `priority`   | number | 5        | Priority 1-10 (higher = faster)           |
+| `delay`      | number | 0        | Delay between URLs (0-5000ms)             |
+| `webhookUrl` | string | -        | Webhook for batch completion              |
+| `ref`        | string | -        | Custom reference ID for tracking          |
+### Search - Web Search with AI
 ```typescript
 // Basic search
@@ -184,17 +355,18 @@ const result = await client.searchAndWait({
 ```
 **Search Options:**
-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `query` | string | required | Search query |
-| `maxResults` | number | 10 | Results to return (1-100) |
-| `useAiOptimization` | boolean | false | AI-enhance the query |
-| `aiProvider` | string | "openai" | "openai" or "anthropic" |
-| `aiModel` | string | - | Model ID (gpt-4o-mini, claude-3-5-sonnet, etc.) |
-| `useDealScoring` | boolean | false | Score results for deal relevance |
-| `autoScrape` | boolean | false | Auto-scrape top results |
-| `autoScrapeLimit` | number | 3 | Number of results to scrape |
-| `filters` | object | - | Location, language, date, domains |
+| Option              | Type    | Default  | Description                                     |
+| ------------------- | ------- | -------- | ----------------------------------------------- |
+| `query`             | string  | required | Search query                                    |
+| `maxResults`        | number  | 10       | Results to return (1-100)                       |
+| `useAiOptimization` | boolean | false    | AI-enhance the query                            |
+| `aiProvider`        | string  | "openai" | "openai" or "anthropic"                         |
+| `aiModel`           | string  | -        | Model ID (gpt-4o-mini, claude-3-5-sonnet, etc.) |
+| `useDealScoring`    | boolean | false    | Score results for deal relevance                |
+| `autoScrape`        | boolean | false    | Auto-scrape top results                         |
+| `autoScrapeLimit`   | number  | 3        | Number of results to scrape                     |
+| `filters`           | object  | -        | Location, language, date, domains               |
 ### Crawl - Website Crawling
@@ -253,26 +425,30 @@ const job = await client.crawl.create({
 - `custom` - No preset, use your own settings
 **Crawl Options:**
-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `url` | string | required | Starting URL |
-| `maxDepth` | number | 3 | Max crawl depth (1-5) |
-| `maxPages` | number | 100 | Max pages to crawl (1-1000) |
-| `detectSignals` | boolean | true | Detect prices, discounts |
-| `extractDeal` | boolean | false | Extract deal info with AI |
-| `minDealScore` | number | 30 | Min deal score threshold (0-100) |
-| `categories` | array | - | Filter: courses, software, physical, services, other |
-| `priceRange` | object | - | Filter: { min, max } price |
-| `onlyHighQuality` | boolean | false | Only deals scoring 70+ |
-| `allowedMerchants` | string[] | - | Only these merchants |
-| `blockedMerchants` | string[] | - | Exclude these merchants |
-| `webhookUrl` | string | - | Real-time notifications URL |
-| `syncToDealup` | boolean | false | Auto-sync to DealUp |
-| `template` | string | - | Job template to use |
-| `useSmartRouting` | boolean | true | Auto-detect best settings |
-| `priority` | string | - | Queue priority (Enterprise only) |
-| `requireJS` | boolean | false | Force JavaScript rendering |
-| `bypassAntiBot` | boolean | false | Advanced anti-bot techniques |
+| Option             | Type     | Default  | Description                                          |
+| ------------------ | -------- | -------- | ---------------------------------------------------- |
+| `url`              | string   | required | Starting URL                                         |
+| `maxDepth`         | number   | 3        | Max crawl depth (1-5)                                |
+| `maxPages`         | number   | 100      | Max pages to crawl (1-1000)                          |
+| `detectSignals`    | boolean  | true     | Detect prices, discounts                             |
+| `extractDeal`      | boolean  | false    | Extract deal info with AI                            |
+| `minDealScore`     | number   | 30       | Min deal score threshold (0-100)                     |
+| `categories`       | array    | -        | Filter: courses, software, physical, services, other |
+| `priceRange`       | object   | -        | Filter: { min, max } price                           |
+| `onlyHighQuality`  | boolean  | false    | Only deals scoring 70+                               |
+| `allowedMerchants` | string[] | -        | Only these merchants                                 |
+| `blockedMerchants` | string[] | -        | Exclude these merchants                              |
+| `webhookUrl`       | string   | -        | Real-time notifications URL                          |
+| `syncToDealup`     | boolean  | false    | Auto-sync to DealUp                                  |
+| `template`         | string   | -        | Job template to use                                  |
+| `useSmartRouting`  | boolean  | true     | Auto-detect best settings                            |
+| `priority`         | string   | -        | Queue priority (Enterprise only)                     |
+| `requireJS`        | boolean  | false    | Force JavaScript rendering                           |
+| `bypassAntiBot`    | boolean  | false    | Advanced anti-bot techniques                         |
+| `outputMarkdown`   | boolean  | false    | Convert pages to Markdown (GFM)                      |
+| `markdownBaseUrl`  | string   | -        | Base URL for relative links in markdown              |
+| `noStore`          | boolean  | false    | Zero Data Retention (Pro/Enterprise only)            |
 ### Extract - LLM-Based Extraction
@@ -327,7 +503,7 @@ const query = client.dork.buildQuery({
 // Returns: "laptop deals site:amazon.com intitle:discount"
 ```
-### Agent - AI Autonomous Navigation (NEW)
+### Agent - AI Autonomous Navigation
 Create AI agents that can navigate websites, interact with elements, and extract structured data using natural language instructions.
@@ -335,7 +511,8 @@ Create AI agents that can navigate websites, interact with elements, and extract
 // Basic agent - navigate and extract data
 const job = await client.agent.create({
   url: "https://amazon.com",
-  prompt: "Search for wireless headphones under $50 and extract the top 5 results",
+  prompt:
+    "Search for wireless headphones under $50 and extract the top 5 results",
   schema: {
     type: "object",
     properties: {
@@ -399,28 +576,64 @@ const job = await client.agent.withClaude(
 ```
 **Agent Options:**
-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `url` | string | required | Starting URL |
-| `prompt` | string | required | Natural language instructions (10-2000 chars) |
-| `schema` | object | - | JSON Schema for structured output |
-| `maxSteps` | number | 10 | Maximum navigation steps (max: 25) |
-| `actions` | array | - | Preset actions to execute first |
-| `model` | string | "openai" | LLM provider: "openai" or "anthropic" |
-| `timeout` | number | 30000 | Per-step timeout in ms (max: 60000) |
-| `takeScreenshots` | boolean | false | Capture screenshot at each step |
-| `onlyMainContent` | boolean | true | Extract main content only |
+| Option            | Type    | Default  | Description                                   |
+| ----------------- | ------- | -------- | --------------------------------------------- |
+| `url`             | string  | required | Starting URL                                  |
+| `prompt`          | string  | required | Natural language instructions (10-2000 chars) |
+| `schema`          | object  | -        | JSON Schema for structured output             |
+| `maxSteps`        | number  | 10       | Maximum navigation steps (max: 25)            |
+| `actions`         | array   | -        | Preset actions to execute first               |
+| `model`           | string  | "openai" | LLM provider: "openai" or "anthropic"         |
+| `timeout`         | number  | 30000    | Per-step timeout in ms (max: 60000)           |
+| `takeScreenshots` | boolean | false    | Capture screenshot at each step               |
+| `onlyMainContent` | boolean | true     | Extract main content only                     |
 **Action Types:**
-- `click` - Click an element
-- `scroll` - Scroll page or to element
-- `write` - Type text into input
-- `wait` - Wait for time or element
-- `press` - Press keyboard key
-- `screenshot` - Capture screenshot
-- `hover` - Hover over element
-- `select` - Select dropdown option
+| Action       | Key Parameters                                    | Description              |
+|--------------|---------------------------------------------------|--------------------------|
+| `click`      | `selector`, `waitAfter?`, `button?`, `force?`     | Click an element         |
+| `scroll`     | `direction`, `amount?`, `smooth?`                 | Scroll page/to element   |
+| `write`      | `selector`, `text`, `clearFirst?`, `typeDelay?`   | Type text into input     |
+| `wait`       | `milliseconds?`, `selector?`, `condition?`        | Wait for time or element |
+| `press`      | `key`, `modifiers?`                               | Press keyboard key       |
+| `screenshot` | `fullPage?`, `selector?`, `name?`                 | Capture screenshot       |
+| `hover`      | `selector`, `duration?`                           | Hover over element       |
+| `select`     | `selector`, `value`, `byLabel?`                   | Select dropdown option   |
+**Action Resilience (all actions support):**
+- `optional: boolean` - Don't fail job if action fails
+- `retries: number` - Retry failed action (1-5 times)
+- `delayBefore: number` - Delay before executing action (ms)
+**Schema Generation:**
+```typescript
+// Generate JSON Schema from natural language
+const schemaResult = await client.agent.generateSchema({
+  prompt: "Find e-commerce product deals with prices and discounts",
+  context: {
+    domains: ["e-commerce", "retail"],       // Help AI understand context
+    dataTypes: ["prices", "discounts"],      // Expected data types
+    format: "json",                          // Output format
+    clarifications: ["Include shipping info"] // Additional requirements
+  },
+});
+// Use the generated schema
+const job = await client.agent.create({
+  url: "https://shop.example.com",
+  prompt: schemaResult.refinedPrompt,  // AI-improved prompt
+  schema: schemaResult.schema,          // Generated JSON Schema
+});
+// Check confidence - if low, ask clarifying questions
+if (schemaResult.confidence < 0.7) {
+  console.log("Consider clarifying:", schemaResult.suggestedQuestions);
+}
+```
 ### Status - Job Management
@@ -512,6 +725,44 @@ await client.webhooks.delete(webhookId);
 - `crawl.completed` - Crawl job finished
 - `crawl.failed` - Crawl job failed
+### Screenshots - Signed URL Management
+Manage screenshot signed URLs with configurable TTL and automatic refresh:
+```typescript
+// Refresh a signed URL before expiration
+const refreshed = await client.screenshots.refresh({
+  path: "job_abc123/1234567890_nanoid_example.png",
+  ttl: 604800  // Optional: 7 days (defaults to tier default)
+});
+console.log(refreshed.url);           // New signed URL
+console.log(refreshed.expiresAt);     // "2026-01-25T12:00:00Z"
+console.log(refreshed.tierLimits);    // { min: 3600, max: 604800, default: 604800 }
+// Get tier-specific TTL limits
+const limits = await client.screenshots.getLimits();
+console.log(limits.tier);                    // "pro"
+console.log(limits.limits);                  // { min: 3600, max: 604800, default: 604800 }
+console.log(limits.formattedLimits);         // { min: "1 hour", max: "7 days", default: "7 days" }
+// Specify custom bucket (defaults to 'screenshots-private')
+const refreshed = await client.screenshots.refresh({
+  path: "job_xyz/screenshot.png",
+  ttl: 86400,  // 1 day
+  bucket: "screenshots-private"
+});
+```
+**TTL Limits by Tier:**
+| Tier       | Min TTL | Max TTL | Default TTL |
+|------------|---------|---------|-------------|
+| Free       | 1 hour  | 24 hours| 24 hours    |
+| Pro        | 1 hour  | 7 days  | 7 days      |
+| Enterprise | 1 hour  | 30 days | 7 days      |
+**Security Note:** All screenshots are private by default. Public URLs (Enterprise only) don't require refresh as they don't expire.
 ### Keys - API Key Management
 ```typescript
@@ -650,7 +901,7 @@ const result = await client.crawlAndWait({
 });
 ```
-## Field Selection (NEW)
+## Field Selection
 Reduce response payload size by selecting only the fields you need:
@@ -670,6 +921,22 @@ const deals = await client.data.listDeals({
 const jobs = await client.data.listJobs({
   fields: ["id", "status", "result.deals.title", "result.deals.price"],
 });
+// Agent job field selection
+const agentStatus = await client.status.get(agentJobId, {
+  fields: [
+    "id",
+    "status",
+    "data.extractedData",           // Final extracted data
+    "data.steps.action",            // Just action details (skip observations)
+    "data.totalSteps",
+  ],
+});
+// Markdown content selection
+const scrapeResult = await client.status.get(scrapeJobId, {
+  fields: ["id", "status", "result.parsed.markdown", "result.parsed.title"],
+});
 ```
 **Benefits:**
@@ -757,11 +1024,29 @@ import type {
   SearchJobResponse,
   BatchScrapeResponse,
+  // Action Types
+  ActionInput,
+  ClickAction,
+  ScrollAction,
+  WriteAction,
+  WaitAction,
+  PressAction,
+  HoverAction,
+  SelectAction,
+  // Screenshot Options & Responses
+  ScreenshotOptions,
+  ScreenshotResult,
+  RefreshScreenshotOptions,
+  ScreenshotRefreshResponse,
+  ScreenshotLimitsResponse,
   // Re-exports from @dealcrawl/shared
   ScrapeResult,
   CrawlResult,
   ExtractedDeal,
   Signal,
+  ParsedPage,              // Includes markdown field
 } from "@dealcrawl/sdk";
 ```
@@ -876,4 +1161,6 @@ const client = new DealCrawl({
 ## License
+By @Shipfastgo
 MIT © [DealUp](https://dealup.cc)