npm - spectrawl - Versions diffs - 0.3.6 → 0.3.8 - Mend

spectrawl 0.3.6 → 0.3.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -2,6 +2,8 @@
 The unified web layer for AI agents. Search, browse, authenticate, and act on platforms — one tool, self-hosted, free.
+**Free Tavily alternative** with Google-quality results via Gemini Grounded Search.
 ## What It Does
 AI agents need to interact with the web. That means searching, browsing pages, logging into platforms, and posting content. Today you duct-tape together Playwright + Tavily + cookie managers + platform-specific scripts. Spectrawl replaces all of that.
@@ -10,123 +12,121 @@ AI agents need to interact with the web. That means searching, browsing pages, l
 npm install spectrawl
 ```
-**Search** — 6 engines in a cascade: SearXNG → DuckDuckGo → Brave → Serper → Google CSE → Jina. Tries free/unlimited first, falls through to quota-based. Dual scraping (Jina Reader + readability). Optional LLM summarization.
-**Browse** — Stealth browsing with anti-detection out of the box. Three tiers:
-1. `playwright-extra` + stealth plugin (default, works immediately)
-2. Camoufox binary — engine-level anti-fingerprint (`npx spectrawl install-stealth`)
-3. Remote Camoufox service (for existing deployments)
-**Auth** — Persistent cookie storage (SQLite), multi-account management, automatic cookie refresh, expiry alerts.
-**Act** — 24 platform adapters covering 30+ sites:
-- **Content platforms:** X, Reddit, LinkedIn, Dev.to, Hashnode, IndieHackers, Medium, Hacker News, Quora
-- **Developer:** GitHub (repos, issues, releases), HuggingFace (models, datasets), Discord (bot + webhooks)
-- **Launch/SEO:** Product Hunt, BetaList, AlternativeTo, SaaSHub, DevHunt, AppSumo
-- **Directories:** Generic adapter for MicroLaunch, Uneed, Peerlist, Fazier, BetaPage, LaunchingNext, StartupStash, SideProjectors, TAIFT, Futurepedia, Crunchbase, G2, StackShare, YouTube
-- Rate limiting, content dedup, dead letter queue for retries.
-**Proxy** — Rotating proxy server. One endpoint (`localhost:8080`) for all your tools. Round-robin, random, or least-used strategies. Health checking with auto-failover.
 ## Quick Start
 ```bash
 npm install spectrawl
-npx spectrawl init          # create spectrawl.json config
-npx spectrawl search "your query"
+export GEMINI_API_KEY=your-free-key  # Get one at aistudio.google.com
 ```
-### As a Library
 ```js
 const { Spectrawl } = require('spectrawl')
 const web = new Spectrawl()
-// Search
-const results = await web.search('best practices for node.js APIs')
-console.log(results.sources)      // [{ url, title, snippet, content }]
-console.log(results.answer)       // LLM summary (if configured)
+// Deep search — like Tavily but free
+const result = await web.deepSearch('best AI agent frameworks 2025')
+console.log(result.answer)    // AI-generated answer with citations
+console.log(result.sources)   // [{ title, url, content, score }]
-// Browse with stealth
-const page = await web.browse('https://example.com')
-console.log(page.content)         // extracted text
-console.log(page.engine)          // 'stealth-playwright' or 'camoufox'
-// Act on platforms
-await web.act('x', 'post', {
-  text: 'Hello from Spectrawl',
-  account: '@myhandle'
-})
+// Fast mode — snippets only, ~6s
+const fast = await web.deepSearch('query', { mode: 'fast' })
-// Check auth health
-const accounts = await web.status()
-// [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
+// Basic search — raw results, no AI
+const basic = await web.search('query')
 ```
-### HTTP Server
+### vs Tavily
-```bash
-npx spectrawl serve --port 3900
-```
+| | Tavily | Spectrawl |
+|---|---|---|
+| Speed | ~2s | ~6-9s |
+| Search quality | Google index | Google via Gemini ✅ |
+| Results per query | 10 | 12-16 ✅ |
+| Citations | ✅ | ✅ |
+| Cost | $0.01/query | **Free** ✅ |
+| Self-hosted | No | **Yes** ✅ |
+| Stealth scraping | No | **Yes** ✅ |
+| Auth + posting | No | **24 adapters** ✅ |
+| Cached repeats | No | **<1ms** ✅ |
-```
-POST /search   { "query": "...", "summarize": true }
-POST /browse   { "url": "...", "screenshot": true }
-POST /act      { "platform": "x", "action": "post", "params": { "text": "..." } }
-GET  /status
-GET  /health
-```
+## Search
-### MCP Server
+Default cascade: **Gemini Grounded → Brave → DDG**
-Works with any MCP-compatible agent framework:
+Gemini Grounded Search gives you Google-quality results through the Gemini API. Free tier: 5,000 grounded queries/month.
-```bash
-npx spectrawl mcp
+| Engine | Free Tier | Key Required | Default |
+|--------|-----------|-------------|---------|
+| **Gemini Grounded** | 5,000/month | `GEMINI_API_KEY` | ✅ Primary |
+| Brave | 2,000/month | `BRAVE_API_KEY` | ✅ Fallback |
+| DuckDuckGo | Unlimited | None | ✅ Last resort |
+| Bing | Unlimited | None | Available |
+| Serper | 2,500 trial | `SERPER_API_KEY` | Available |
+| Google CSE | 100/day | `GOOGLE_CSE_KEY` | Available |
+| Jina Reader | Unlimited | None | Available |
+| SearXNG | Unlimited | Self-hosted | Available |
+### Deep Search Pipeline
+```
+Query → Gemini Grounded + DDG (parallel)
+  → Merge & deduplicate (12-16 results)
+  → Source quality ranking (boost GitHub/SO/Reddit, penalize SEO spam)
+  → Parallel scraping (Jina → readability → Playwright fallback)
+  → AI summarization with [1] [2] citations
 ```
-Exposes 5 tools: `web_search`, `web_browse`, `web_act`, `web_auth`, `web_status`.
+### What you get without any keys
-## Stealth Browsing
+DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs get rate-limited by DDG — recommend at minimum a free Gemini key.
-Default: `playwright-extra` with stealth plugin patches webdriver detection, navigator properties, canvas/WebGL fingerprinting, and plugin enumeration. Works for ~90% of sites.
+## Browse
-For deeper anti-detection:
+Stealth browsing with anti-detection. Three tiers (auto-detected):
-```bash
-npx spectrawl install-stealth
+1. **playwright-extra + stealth plugin** — default, works immediately
+2. **Camoufox binary** — engine-level anti-fingerprint (`npx spectrawl install-stealth`)
+3. **Remote Camoufox** — for existing deployments
+```js
+const page = await web.browse('https://example.com')
+console.log(page.content)       // extracted text/markdown
+console.log(page.screenshot)    // PNG buffer (if requested)
+// With screenshot
+const page = await web.browse('https://example.com', { screenshot: true })
 ```
-Downloads the [Camoufox](https://github.com/daijro/camoufox) browser — a patched Firefox with engine-level anti-fingerprint. Spectrawl auto-detects and uses it.
+Auto-fallback: if Jina and readability return too little content (<200 chars), Spectrawl renders the page with Playwright and extracts from the rendered DOM. Tavily can't do this — they fail on JS-heavy pages.
-## Search Engines
+## Auth
-| Engine | Free Tier | Default |
-|--------|-----------|---------|
-| SearXNG | Unlimited (self-hosted) | ✅ |
-| DuckDuckGo | Unlimited | ✅ |
-| Brave | 2,000/month | ✅ |
-| Serper | 2,500/month | Fallback |
-| Google CSE | 100/day | Fallback |
-| Jina Reader | Unlimited | Fallback |
+Persistent cookie storage (SQLite), multi-account management, automatic refresh.
-Configure the cascade in `spectrawl.json`:
+```js
+// Store cookies
+await web.auth.setCookies('x', '@myhandle', cookies)
-```json
-{
-  "search": {
-    "cascade": ["searxng", "ddg", "brave", "serper", "google-cse", "jina"]
-  }
-}
+// Check health
+const accounts = await web.status()
+// [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
 ```
-## Platform Adapters
+## Act — 24 Platform Adapters
+Post to 30+ platforms with one API:
+```js
+await web.act('x', 'post', { text: 'Hello from Spectrawl', account: '@myhandle' })
+await web.act('reddit', 'post', { subreddit: 'node', title: '...', text: '...' })
+await web.act('github', 'create-repo', { name: 'my-repo', description: '...' })
+```
 | Platform | Auth Method | Actions |
 |----------|-------------|---------|
 | X/Twitter | GraphQL Cookie + OAuth 1.0a | post |
 | Reddit | Cookie API (oauth.reddit.com) | post, comment |
-| Dev.to | REST API (API key) | post |
+| Dev.to | REST API | post |
 | Hashnode | GraphQL API | post |
 | LinkedIn | Cookie API (Voyager) | post |
 | IndieHackers | Browser automation | post, comment, upvote |
@@ -135,14 +135,70 @@ Configure the cascade in `spectrawl.json`:
 | Discord | Bot API + webhooks | send, thread |
 | Product Hunt | GraphQL v2 | launch, comment, upvote |
 | Hacker News | Cookie/form POST | submit, comment, upvote |
-| YouTube | Data API v3 | comment, playlist, update |
+| YouTube | Data API v3 | comment, playlist |
 | Quora | Browser automation | answer, question |
 | HuggingFace | Hub API | repo, model card, upload |
 | BetaList | REST API | submit |
 | AlternativeTo | Browser automation | submit |
 | SaaSHub | Browser automation | submit |
 | DevHunt | Browser automation | submit |
-| **30+ Directories** | Generic adapter | submit (MicroLaunch, Uneed, TAIFT, Futurepedia, Crunchbase, G2, etc.) |
+| **14 Directories** | Generic adapter | submit |
+Built-in rate limiting, content dedup (MD5, 24h window), and dead letter queue for retries.
+## Source Quality Ranking
+Spectrawl ranks results by domain trust — something Tavily doesn't do:
+- **Boosted:** GitHub, StackOverflow, HN, Reddit, MDN, arxiv, Wikipedia
+- **Penalized:** SEO farms, thin content sites, tag/category pages
+- **Customizable:** bring your own domain weights
+```js
+const web = new Spectrawl({
+  search: {
+    sourceRanker: {
+      boost: ['github.com', 'news.ycombinator.com'],
+      block: ['spamsite.com']
+    }
+  }
+})
+```
+## HTTP Server
+```bash
+npx spectrawl serve --port 3900
+```
+```
+POST /search   { "query": "...", "summarize": true }
+POST /browse   { "url": "...", "screenshot": true }
+POST /act      { "platform": "x", "action": "post", "params": { ... } }
+GET  /status
+GET  /health
+```
+## MCP Server
+Works with any MCP-compatible agent framework (Claude, OpenAI, etc.):
+```bash
+npx spectrawl mcp
+```
+5 tools: `web_search`, `web_browse`, `web_act`, `web_auth`, `web_status`.
+## CLI
+```bash
+npx spectrawl init              # create spectrawl.json
+npx spectrawl search "query"    # search from terminal
+npx spectrawl status            # check auth health
+npx spectrawl serve             # start HTTP server
+npx spectrawl mcp               # start MCP server
+npx spectrawl install-stealth   # download Camoufox browser
+```
 ## Configuration
@@ -150,28 +206,23 @@ Configure the cascade in `spectrawl.json`:
 ```json
 {
-  "port": 3900,
   "search": {
-    "cascade": ["ddg", "brave"],
+    "cascade": ["gemini-grounded", "brave", "ddg"],
     "scrapeTop": 3
   },
   "cache": {
-    "path": "./data/cache.db",
-    "searchTtl": 1,
-    "scrapeTtl": 24
+    "searchTtl": 3600,
+    "scrapeTtl": 86400
   },
   "proxy": {
     "localPort": 8080,
     "strategy": "round-robin",
     "upstreams": [
-      { "url": "http://user:pass@proxy1.example.com:8080" }
+      { "url": "http://user:pass@proxy.example.com:8080" }
     ]
   },
-  "camoufox": {
-    "url": "http://localhost:9869"
-  },
   "rateLimit": {
-    "x": { "postsPerHour": 3, "minDelayMs": 60000 },
+    "x": { "postsPerHour": 3 },
     "reddit": { "postsPerHour": 5 }
   }
 }
@@ -180,15 +231,14 @@ Configure the cascade in `spectrawl.json`:
 ## Environment Variables
 ```
-BRAVE_API_KEY       Brave Search API key
+GEMINI_API_KEY      Gemini API key (free — primary search + summarization)
+BRAVE_API_KEY       Brave Search API key (2,000 free/month)
 SERPER_API_KEY      Serper.dev API key
 GOOGLE_CSE_KEY      Google Custom Search API key
 GOOGLE_CSE_CX       Google Custom Search engine ID
 JINA_API_KEY        Jina Reader API key (optional)
-SEARXNG_URL         SearXNG instance URL (default: http://localhost:8888)
-CAMOUFOX_URL        Remote Camoufox service URL
-OPENAI_API_KEY      For LLM summarization
-ANTHROPIC_API_KEY   For LLM summarization
+OPENAI_API_KEY      For LLM summarization (alternative to Gemini)
+ANTHROPIC_API_KEY   For LLM summarization (alternative to Gemini)
 ```
 ## License

package/index.d.ts CHANGED Viewed

@@ -1,90 +1,108 @@
 declare module 'spectrawl' {
-  export interface SearchResult {
-    sources: Array<{
-      url: string
-      title: string
-      snippet?: string
-      content?: string
-    }>
-    answer?: string
-    engine: string
-    cached: boolean
+  interface SpectrawlConfig {
+    search?: {
+      cascade?: string[]
+      scrapeTop?: number
+      geminiKey?: string
+      'gemini-grounded'?: { apiKey?: string; model?: string }
+      llm?: { provider: string; model?: string; apiKey?: string }
+      sourceRanker?: {
+        weights?: Record<string, number>
+        boost?: string[]
+        block?: string[]
+      }
+    }
+    browse?: {
+      defaultEngine?: string
+      proxy?: { type: string; host: string; port: number; username?: string; password?: string }
+      humanlike?: { minDelay?: number; maxDelay?: number; scrollBehavior?: boolean }
+    }
+    auth?: {
+      refreshInterval?: string
+      cookieStore?: string
+    }
+    cache?: {
+      path?: string
+      searchTtl?: number
+      scrapeTtl?: number
+      screenshotTtl?: number
+    }
+    rateLimit?: Record<string, { postsPerHour?: number; minDelayMs?: number }>
+    proxy?: {
+      localPort?: number
+      strategy?: 'round-robin' | 'random' | 'least-used'
+      upstreams?: { url: string }[]
+    }
   }
-  export interface BrowseResult {
-    content?: string
-    html?: string
-    screenshot?: Buffer
-    url: string
+  interface SearchResult {
     title: string
-    engine: 'stealth-playwright' | 'camoufox' | 'remote-camoufox' | 'playwright'
+    url: string
+    snippet: string
+    content?: string
+    score?: number
+    engine?: string
+  }
+  interface SearchResponse {
+    answer: string | null
+    sources: SearchResult[]
     cached: boolean
-    cookies?: any[]
   }
-  export interface ActResult {
-    success: boolean
-    error?: string
-    detail?: string
-    suggestion?: string
-    retryAfter?: number
-    url?: string
-    [key: string]: any
+  interface DeepSearchResponse {
+    answer: string | null
+    sources: SearchResult[]
+    queries: string[]
+    cached: boolean
   }
-  export interface AccountStatus {
+  interface DeepSearchOptions {
+    mode?: 'fast' | 'full'
+    scrapeTop?: number
+    expand?: boolean
+    rerank?: boolean
+  }
+  interface BrowseResult {
+    content: string
+    text?: string
+    screenshot?: Buffer
+    engine: string
+    url: string
+  }
+  interface AuthStatus {
     platform: string
     account: string
-    status: 'valid' | 'expiring' | 'expired' | 'unknown'
+    status: 'valid' | 'expired' | 'unknown'
     expiresAt?: string
-    cookieCount?: number
   }
-  export interface SearchOptions {
-    summarize?: boolean
-    minResults?: number
-    noCache?: boolean
-  }
+  class Spectrawl {
+    constructor(config?: SpectrawlConfig | string)
-  export interface BrowseOptions {
-    screenshot?: boolean
-    fullPage?: boolean
-    html?: boolean
-    extract?: boolean
-    stealth?: boolean
-    camoufox?: boolean
-    noCache?: boolean
-    saveCookies?: boolean
-    _cookies?: any[]
-  }
+    /** Basic search — raw results from cascade engines */
+    search(query: string, opts?: { summarize?: boolean; scrapeTop?: number; engines?: string[] }): Promise<SearchResponse>
-  export interface ActParams {
-    text?: string
-    title?: string
-    body?: string
-    account?: string
-    group?: string
-    postUrl?: string
-    tweetId?: string
-    mediaIds?: string[]
-    _cookies?: any[]
-    [key: string]: any
-  }
+    /** Deep search — Tavily-equivalent with citations. Set GEMINI_API_KEY for best results. */
+    deepSearch(query: string, opts?: DeepSearchOptions): Promise<DeepSearchResponse>
+    /** Browse a URL with stealth anti-detection */
+    browse(url: string, opts?: { screenshot?: boolean; timeout?: number; extractText?: boolean }): Promise<BrowseResult>
+    /** Act on a platform (post, comment, submit) */
+    act(platform: string, action: string, params: Record<string, any>): Promise<any>
-  export class Spectrawl {
-    constructor(config?: any)
-    search(query: string, opts?: SearchOptions): Promise<SearchResult>
-    browse(url: string, opts?: BrowseOptions): Promise<BrowseResult>
-    act(platform: string, action: string, params?: ActParams): Promise<ActResult>
-    status(): Promise<AccountStatus[]>
+    /** Check auth health for all configured accounts */
+    status(): Promise<AuthStatus[]>
+    /** Get raw Playwright page for custom automation */
+    getPage(url: string, opts?: any): Promise<any>
+    /** Close all connections */
     close(): Promise<void>
   }
-  export function installStealth(): Promise<{
-    path: string
-    binary?: string
-    version: string
-  }>
-  export function isStealthInstalled(): boolean
+  export { Spectrawl, SpectrawlConfig, SearchResult, SearchResponse, DeepSearchResponse, DeepSearchOptions, BrowseResult, AuthStatus }
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "spectrawl",
-  "version": "0.3.6",
+  "version": "0.3.8",
   "description": "The unified web layer for AI agents. Search (6 engines), stealth browse (Camoufox + Playwright), auth (cookies, multi-account), act (24 adapters, 30+ platforms), proxy rotation. Self-hosted, free.",
   "main": "src/index.js",
   "types": "index.d.ts",

package/src/config.js CHANGED Viewed

@@ -4,7 +4,7 @@ const path = require('path')
 const DEFAULTS = {
   port: 3900,
   search: {
-    cascade: ['searxng', 'ddg', 'brave', 'serper'],
+    cascade: ['gemini-grounded', 'brave', 'ddg'],
     scrapeTop: 3,
     searxng: { url: 'http://localhost:8888' },
     llm: null // { provider, model, apiKey }