spectrawl 0.3.6 → 0.3.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -2,6 +2,8 @@
2
2
 
3
3
  The unified web layer for AI agents. Search, browse, authenticate, and act on platforms — one tool, self-hosted, free.
4
4
 
5
+ **Free Tavily alternative** with Google-quality results via Gemini Grounded Search.
6
+
5
7
  ## What It Does
6
8
 
7
9
  AI agents need to interact with the web. That means searching, browsing pages, logging into platforms, and posting content. Today you duct-tape together Playwright + Tavily + cookie managers + platform-specific scripts. Spectrawl replaces all of that.
@@ -10,123 +12,121 @@ AI agents need to interact with the web. That means searching, browsing pages, l
10
12
  npm install spectrawl
11
13
  ```
12
14
 
13
- **Search** — 6 engines in a cascade: SearXNG → DuckDuckGo → Brave → Serper → Google CSE → Jina. Tries free/unlimited first, falls through to quota-based. Dual scraping (Jina Reader + readability). Optional LLM summarization.
14
-
15
- **Browse** — Stealth browsing with anti-detection out of the box. Three tiers:
16
- 1. `playwright-extra` + stealth plugin (default, works immediately)
17
- 2. Camoufox binary — engine-level anti-fingerprint (`npx spectrawl install-stealth`)
18
- 3. Remote Camoufox service (for existing deployments)
19
-
20
- **Auth** — Persistent cookie storage (SQLite), multi-account management, automatic cookie refresh, expiry alerts.
21
-
22
- **Act** — 24 platform adapters covering 30+ sites:
23
- - **Content platforms:** X, Reddit, LinkedIn, Dev.to, Hashnode, IndieHackers, Medium, Hacker News, Quora
24
- - **Developer:** GitHub (repos, issues, releases), HuggingFace (models, datasets), Discord (bot + webhooks)
25
- - **Launch/SEO:** Product Hunt, BetaList, AlternativeTo, SaaSHub, DevHunt, AppSumo
26
- - **Directories:** Generic adapter for MicroLaunch, Uneed, Peerlist, Fazier, BetaPage, LaunchingNext, StartupStash, SideProjectors, TAIFT, Futurepedia, Crunchbase, G2, StackShare, YouTube
27
- - Rate limiting, content dedup, dead letter queue for retries.
28
-
29
- **Proxy** — Rotating proxy server. One endpoint (`localhost:8080`) for all your tools. Round-robin, random, or least-used strategies. Health checking with auto-failover.
30
-
31
15
  ## Quick Start
32
16
 
33
17
  ```bash
34
18
  npm install spectrawl
35
- npx spectrawl init # create spectrawl.json config
36
- npx spectrawl search "your query"
19
+ export GEMINI_API_KEY=your-free-key # Get one at aistudio.google.com
37
20
  ```
38
21
 
39
- ### As a Library
40
-
41
22
  ```js
42
23
  const { Spectrawl } = require('spectrawl')
43
24
  const web = new Spectrawl()
44
25
 
45
- // Search
46
- const results = await web.search('best practices for node.js APIs')
47
- console.log(results.sources) // [{ url, title, snippet, content }]
48
- console.log(results.answer) // LLM summary (if configured)
26
+ // Deep search — like Tavily but free
27
+ const result = await web.deepSearch('best AI agent frameworks 2025')
28
+ console.log(result.answer) // AI-generated answer with citations
29
+ console.log(result.sources) // [{ title, url, content, score }]
49
30
 
50
- // Browse with stealth
51
- const page = await web.browse('https://example.com')
52
- console.log(page.content) // extracted text
53
- console.log(page.engine) // 'stealth-playwright' or 'camoufox'
54
-
55
- // Act on platforms
56
- await web.act('x', 'post', {
57
- text: 'Hello from Spectrawl',
58
- account: '@myhandle'
59
- })
31
+ // Fast mode — snippets only, ~6s
32
+ const fast = await web.deepSearch('query', { mode: 'fast' })
60
33
 
61
- // Check auth health
62
- const accounts = await web.status()
63
- // [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
34
+ // Basic search — raw results, no AI
35
+ const basic = await web.search('query')
64
36
  ```
65
37
 
66
- ### HTTP Server
38
+ ### vs Tavily
67
39
 
68
- ```bash
69
- npx spectrawl serve --port 3900
70
- ```
40
+ | | Tavily | Spectrawl |
41
+ |---|---|---|
42
+ | Speed | ~2s | ~6-9s |
43
+ | Search quality | Google index | Google via Gemini ✅ |
44
+ | Results per query | 10 | 12-16 ✅ |
45
+ | Citations | ✅ | ✅ |
46
+ | Cost | $0.01/query | **Free** ✅ |
47
+ | Self-hosted | No | **Yes** ✅ |
48
+ | Stealth scraping | No | **Yes** ✅ |
49
+ | Auth + posting | No | **24 adapters** ✅ |
50
+ | Cached repeats | No | **<1ms** ✅ |
71
51
 
72
- ```
73
- POST /search { "query": "...", "summarize": true }
74
- POST /browse { "url": "...", "screenshot": true }
75
- POST /act { "platform": "x", "action": "post", "params": { "text": "..." } }
76
- GET /status
77
- GET /health
78
- ```
52
+ ## Search
79
53
 
80
- ### MCP Server
54
+ Default cascade: **Gemini Grounded → Brave → DDG**
81
55
 
82
- Works with any MCP-compatible agent framework:
56
+ Gemini Grounded Search gives you Google-quality results through the Gemini API. Free tier: 5,000 grounded queries/month.
83
57
 
84
- ```bash
85
- npx spectrawl mcp
58
+ | Engine | Free Tier | Key Required | Default |
59
+ |--------|-----------|-------------|---------|
60
+ | **Gemini Grounded** | 5,000/month | `GEMINI_API_KEY` | ✅ Primary |
61
+ | Brave | 2,000/month | `BRAVE_API_KEY` | ✅ Fallback |
62
+ | DuckDuckGo | Unlimited | None | ✅ Last resort |
63
+ | Bing | Unlimited | None | Available |
64
+ | Serper | 2,500 trial | `SERPER_API_KEY` | Available |
65
+ | Google CSE | 100/day | `GOOGLE_CSE_KEY` | Available |
66
+ | Jina Reader | Unlimited | None | Available |
67
+ | SearXNG | Unlimited | Self-hosted | Available |
68
+
69
+ ### Deep Search Pipeline
70
+
71
+ ```
72
+ Query → Gemini Grounded + DDG (parallel)
73
+ → Merge & deduplicate (12-16 results)
74
+ → Source quality ranking (boost GitHub/SO/Reddit, penalize SEO spam)
75
+ → Parallel scraping (Jina → readability → Playwright fallback)
76
+ → AI summarization with [1] [2] citations
86
77
  ```
87
78
 
88
- Exposes 5 tools: `web_search`, `web_browse`, `web_act`, `web_auth`, `web_status`.
79
+ ### What you get without any keys
89
80
 
90
- ## Stealth Browsing
81
+ DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs get rate-limited by DDG — recommend at minimum a free Gemini key.
91
82
 
92
- Default: `playwright-extra` with stealth plugin patches webdriver detection, navigator properties, canvas/WebGL fingerprinting, and plugin enumeration. Works for ~90% of sites.
83
+ ## Browse
93
84
 
94
- For deeper anti-detection:
85
+ Stealth browsing with anti-detection. Three tiers (auto-detected):
95
86
 
96
- ```bash
97
- npx spectrawl install-stealth
87
+ 1. **playwright-extra + stealth plugin** — default, works immediately
88
+ 2. **Camoufox binary** — engine-level anti-fingerprint (`npx spectrawl install-stealth`)
89
+ 3. **Remote Camoufox** — for existing deployments
90
+
91
+ ```js
92
+ const page = await web.browse('https://example.com')
93
+ console.log(page.content) // extracted text/markdown
94
+ console.log(page.screenshot) // PNG buffer (if requested)
95
+
96
+ // With screenshot
97
+ const page = await web.browse('https://example.com', { screenshot: true })
98
98
  ```
99
99
 
100
- Downloads the [Camoufox](https://github.com/daijro/camoufox) browser a patched Firefox with engine-level anti-fingerprint. Spectrawl auto-detects and uses it.
100
+ Auto-fallback: if Jina and readability return too little content (<200 chars), Spectrawl renders the page with Playwright and extracts from the rendered DOM. Tavily can't do this — they fail on JS-heavy pages.
101
101
 
102
- ## Search Engines
102
+ ## Auth
103
103
 
104
- | Engine | Free Tier | Default |
105
- |--------|-----------|---------|
106
- | SearXNG | Unlimited (self-hosted) | ✅ |
107
- | DuckDuckGo | Unlimited | ✅ |
108
- | Brave | 2,000/month | ✅ |
109
- | Serper | 2,500/month | Fallback |
110
- | Google CSE | 100/day | Fallback |
111
- | Jina Reader | Unlimited | Fallback |
104
+ Persistent cookie storage (SQLite), multi-account management, automatic refresh.
112
105
 
113
- Configure the cascade in `spectrawl.json`:
106
+ ```js
107
+ // Store cookies
108
+ await web.auth.setCookies('x', '@myhandle', cookies)
114
109
 
115
- ```json
116
- {
117
- "search": {
118
- "cascade": ["searxng", "ddg", "brave", "serper", "google-cse", "jina"]
119
- }
120
- }
110
+ // Check health
111
+ const accounts = await web.status()
112
+ // [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
121
113
  ```
122
114
 
123
- ## Platform Adapters
115
+ ## Act — 24 Platform Adapters
116
+
117
+ Post to 30+ platforms with one API:
118
+
119
+ ```js
120
+ await web.act('x', 'post', { text: 'Hello from Spectrawl', account: '@myhandle' })
121
+ await web.act('reddit', 'post', { subreddit: 'node', title: '...', text: '...' })
122
+ await web.act('github', 'create-repo', { name: 'my-repo', description: '...' })
123
+ ```
124
124
 
125
125
  | Platform | Auth Method | Actions |
126
126
  |----------|-------------|---------|
127
127
  | X/Twitter | GraphQL Cookie + OAuth 1.0a | post |
128
128
  | Reddit | Cookie API (oauth.reddit.com) | post, comment |
129
- | Dev.to | REST API (API key) | post |
129
+ | Dev.to | REST API | post |
130
130
  | Hashnode | GraphQL API | post |
131
131
  | LinkedIn | Cookie API (Voyager) | post |
132
132
  | IndieHackers | Browser automation | post, comment, upvote |
@@ -135,14 +135,70 @@ Configure the cascade in `spectrawl.json`:
135
135
  | Discord | Bot API + webhooks | send, thread |
136
136
  | Product Hunt | GraphQL v2 | launch, comment, upvote |
137
137
  | Hacker News | Cookie/form POST | submit, comment, upvote |
138
- | YouTube | Data API v3 | comment, playlist, update |
138
+ | YouTube | Data API v3 | comment, playlist |
139
139
  | Quora | Browser automation | answer, question |
140
140
  | HuggingFace | Hub API | repo, model card, upload |
141
141
  | BetaList | REST API | submit |
142
142
  | AlternativeTo | Browser automation | submit |
143
143
  | SaaSHub | Browser automation | submit |
144
144
  | DevHunt | Browser automation | submit |
145
- | **30+ Directories** | Generic adapter | submit (MicroLaunch, Uneed, TAIFT, Futurepedia, Crunchbase, G2, etc.) |
145
+ | **14 Directories** | Generic adapter | submit |
146
+
147
+ Built-in rate limiting, content dedup (MD5, 24h window), and dead letter queue for retries.
148
+
149
+ ## Source Quality Ranking
150
+
151
+ Spectrawl ranks results by domain trust — something Tavily doesn't do:
152
+
153
+ - **Boosted:** GitHub, StackOverflow, HN, Reddit, MDN, arxiv, Wikipedia
154
+ - **Penalized:** SEO farms, thin content sites, tag/category pages
155
+ - **Customizable:** bring your own domain weights
156
+
157
+ ```js
158
+ const web = new Spectrawl({
159
+ search: {
160
+ sourceRanker: {
161
+ boost: ['github.com', 'news.ycombinator.com'],
162
+ block: ['spamsite.com']
163
+ }
164
+ }
165
+ })
166
+ ```
167
+
168
+ ## HTTP Server
169
+
170
+ ```bash
171
+ npx spectrawl serve --port 3900
172
+ ```
173
+
174
+ ```
175
+ POST /search { "query": "...", "summarize": true }
176
+ POST /browse { "url": "...", "screenshot": true }
177
+ POST /act { "platform": "x", "action": "post", "params": { ... } }
178
+ GET /status
179
+ GET /health
180
+ ```
181
+
182
+ ## MCP Server
183
+
184
+ Works with any MCP-compatible agent framework (Claude, OpenAI, etc.):
185
+
186
+ ```bash
187
+ npx spectrawl mcp
188
+ ```
189
+
190
+ 5 tools: `web_search`, `web_browse`, `web_act`, `web_auth`, `web_status`.
191
+
192
+ ## CLI
193
+
194
+ ```bash
195
+ npx spectrawl init # create spectrawl.json
196
+ npx spectrawl search "query" # search from terminal
197
+ npx spectrawl status # check auth health
198
+ npx spectrawl serve # start HTTP server
199
+ npx spectrawl mcp # start MCP server
200
+ npx spectrawl install-stealth # download Camoufox browser
201
+ ```
146
202
 
147
203
  ## Configuration
148
204
 
@@ -150,28 +206,23 @@ Configure the cascade in `spectrawl.json`:
150
206
 
151
207
  ```json
152
208
  {
153
- "port": 3900,
154
209
  "search": {
155
- "cascade": ["ddg", "brave"],
210
+ "cascade": ["gemini-grounded", "brave", "ddg"],
156
211
  "scrapeTop": 3
157
212
  },
158
213
  "cache": {
159
- "path": "./data/cache.db",
160
- "searchTtl": 1,
161
- "scrapeTtl": 24
214
+ "searchTtl": 3600,
215
+ "scrapeTtl": 86400
162
216
  },
163
217
  "proxy": {
164
218
  "localPort": 8080,
165
219
  "strategy": "round-robin",
166
220
  "upstreams": [
167
- { "url": "http://user:pass@proxy1.example.com:8080" }
221
+ { "url": "http://user:pass@proxy.example.com:8080" }
168
222
  ]
169
223
  },
170
- "camoufox": {
171
- "url": "http://localhost:9869"
172
- },
173
224
  "rateLimit": {
174
- "x": { "postsPerHour": 3, "minDelayMs": 60000 },
225
+ "x": { "postsPerHour": 3 },
175
226
  "reddit": { "postsPerHour": 5 }
176
227
  }
177
228
  }
@@ -180,15 +231,14 @@ Configure the cascade in `spectrawl.json`:
180
231
  ## Environment Variables
181
232
 
182
233
  ```
183
- BRAVE_API_KEY Brave Search API key
234
+ GEMINI_API_KEY Gemini API key (free — primary search + summarization)
235
+ BRAVE_API_KEY Brave Search API key (2,000 free/month)
184
236
  SERPER_API_KEY Serper.dev API key
185
237
  GOOGLE_CSE_KEY Google Custom Search API key
186
238
  GOOGLE_CSE_CX Google Custom Search engine ID
187
239
  JINA_API_KEY Jina Reader API key (optional)
188
- SEARXNG_URL SearXNG instance URL (default: http://localhost:8888)
189
- CAMOUFOX_URL Remote Camoufox service URL
190
- OPENAI_API_KEY For LLM summarization
191
- ANTHROPIC_API_KEY For LLM summarization
240
+ OPENAI_API_KEY For LLM summarization (alternative to Gemini)
241
+ ANTHROPIC_API_KEY For LLM summarization (alternative to Gemini)
192
242
  ```
193
243
 
194
244
  ## License
package/index.d.ts CHANGED
@@ -1,90 +1,108 @@
1
1
  declare module 'spectrawl' {
2
- export interface SearchResult {
3
- sources: Array<{
4
- url: string
5
- title: string
6
- snippet?: string
7
- content?: string
8
- }>
9
- answer?: string
10
- engine: string
11
- cached: boolean
2
+ interface SpectrawlConfig {
3
+ search?: {
4
+ cascade?: string[]
5
+ scrapeTop?: number
6
+ geminiKey?: string
7
+ 'gemini-grounded'?: { apiKey?: string; model?: string }
8
+ llm?: { provider: string; model?: string; apiKey?: string }
9
+ sourceRanker?: {
10
+ weights?: Record<string, number>
11
+ boost?: string[]
12
+ block?: string[]
13
+ }
14
+ }
15
+ browse?: {
16
+ defaultEngine?: string
17
+ proxy?: { type: string; host: string; port: number; username?: string; password?: string }
18
+ humanlike?: { minDelay?: number; maxDelay?: number; scrollBehavior?: boolean }
19
+ }
20
+ auth?: {
21
+ refreshInterval?: string
22
+ cookieStore?: string
23
+ }
24
+ cache?: {
25
+ path?: string
26
+ searchTtl?: number
27
+ scrapeTtl?: number
28
+ screenshotTtl?: number
29
+ }
30
+ rateLimit?: Record<string, { postsPerHour?: number; minDelayMs?: number }>
31
+ proxy?: {
32
+ localPort?: number
33
+ strategy?: 'round-robin' | 'random' | 'least-used'
34
+ upstreams?: { url: string }[]
35
+ }
12
36
  }
13
37
 
14
- export interface BrowseResult {
15
- content?: string
16
- html?: string
17
- screenshot?: Buffer
18
- url: string
38
+ interface SearchResult {
19
39
  title: string
20
- engine: 'stealth-playwright' | 'camoufox' | 'remote-camoufox' | 'playwright'
40
+ url: string
41
+ snippet: string
42
+ content?: string
43
+ score?: number
44
+ engine?: string
45
+ }
46
+
47
+ interface SearchResponse {
48
+ answer: string | null
49
+ sources: SearchResult[]
21
50
  cached: boolean
22
- cookies?: any[]
23
51
  }
24
52
 
25
- export interface ActResult {
26
- success: boolean
27
- error?: string
28
- detail?: string
29
- suggestion?: string
30
- retryAfter?: number
31
- url?: string
32
- [key: string]: any
53
+ interface DeepSearchResponse {
54
+ answer: string | null
55
+ sources: SearchResult[]
56
+ queries: string[]
57
+ cached: boolean
33
58
  }
34
59
 
35
- export interface AccountStatus {
60
+ interface DeepSearchOptions {
61
+ mode?: 'fast' | 'full'
62
+ scrapeTop?: number
63
+ expand?: boolean
64
+ rerank?: boolean
65
+ }
66
+
67
+ interface BrowseResult {
68
+ content: string
69
+ text?: string
70
+ screenshot?: Buffer
71
+ engine: string
72
+ url: string
73
+ }
74
+
75
+ interface AuthStatus {
36
76
  platform: string
37
77
  account: string
38
- status: 'valid' | 'expiring' | 'expired' | 'unknown'
78
+ status: 'valid' | 'expired' | 'unknown'
39
79
  expiresAt?: string
40
- cookieCount?: number
41
80
  }
42
81
 
43
- export interface SearchOptions {
44
- summarize?: boolean
45
- minResults?: number
46
- noCache?: boolean
47
- }
82
+ class Spectrawl {
83
+ constructor(config?: SpectrawlConfig | string)
48
84
 
49
- export interface BrowseOptions {
50
- screenshot?: boolean
51
- fullPage?: boolean
52
- html?: boolean
53
- extract?: boolean
54
- stealth?: boolean
55
- camoufox?: boolean
56
- noCache?: boolean
57
- saveCookies?: boolean
58
- _cookies?: any[]
59
- }
85
+ /** Basic search — raw results from cascade engines */
86
+ search(query: string, opts?: { summarize?: boolean; scrapeTop?: number; engines?: string[] }): Promise<SearchResponse>
60
87
 
61
- export interface ActParams {
62
- text?: string
63
- title?: string
64
- body?: string
65
- account?: string
66
- group?: string
67
- postUrl?: string
68
- tweetId?: string
69
- mediaIds?: string[]
70
- _cookies?: any[]
71
- [key: string]: any
72
- }
88
+ /** Deep search — Tavily-equivalent with citations. Set GEMINI_API_KEY for best results. */
89
+ deepSearch(query: string, opts?: DeepSearchOptions): Promise<DeepSearchResponse>
90
+
91
+ /** Browse a URL with stealth anti-detection */
92
+ browse(url: string, opts?: { screenshot?: boolean; timeout?: number; extractText?: boolean }): Promise<BrowseResult>
93
+
94
+ /** Act on a platform (post, comment, submit) */
95
+ act(platform: string, action: string, params: Record<string, any>): Promise<any>
73
96
 
74
- export class Spectrawl {
75
- constructor(config?: any)
76
- search(query: string, opts?: SearchOptions): Promise<SearchResult>
77
- browse(url: string, opts?: BrowseOptions): Promise<BrowseResult>
78
- act(platform: string, action: string, params?: ActParams): Promise<ActResult>
79
- status(): Promise<AccountStatus[]>
97
+ /** Check auth health for all configured accounts */
98
+ status(): Promise<AuthStatus[]>
99
+
100
+ /** Get raw Playwright page for custom automation */
101
+ getPage(url: string, opts?: any): Promise<any>
102
+
103
+ /** Close all connections */
80
104
  close(): Promise<void>
81
105
  }
82
106
 
83
- export function installStealth(): Promise<{
84
- path: string
85
- binary?: string
86
- version: string
87
- }>
88
-
89
- export function isStealthInstalled(): boolean
107
+ export { Spectrawl, SpectrawlConfig, SearchResult, SearchResponse, DeepSearchResponse, DeepSearchOptions, BrowseResult, AuthStatus }
90
108
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "spectrawl",
3
- "version": "0.3.6",
3
+ "version": "0.3.8",
4
4
  "description": "The unified web layer for AI agents. Search (6 engines), stealth browse (Camoufox + Playwright), auth (cookies, multi-account), act (24 adapters, 30+ platforms), proxy rotation. Self-hosted, free.",
5
5
  "main": "src/index.js",
6
6
  "types": "index.d.ts",
package/src/config.js CHANGED
@@ -4,7 +4,7 @@ const path = require('path')
4
4
  const DEFAULTS = {
5
5
  port: 3900,
6
6
  search: {
7
- cascade: ['searxng', 'ddg', 'brave', 'serper'],
7
+ cascade: ['gemini-grounded', 'brave', 'ddg'],
8
8
  scrapeTop: 3,
9
9
  searxng: { url: 'http://localhost:8888' },
10
10
  llm: null // { provider, model, apiKey }