spectrawl 0.4.3 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +578 -67
- package/package.json +1 -1
- package/src/agent.js +295 -0
- package/src/browse/index.js +125 -0
- package/src/crawl.js +138 -3
- package/src/extract.js +314 -0
- package/src/index.js +35 -0
- package/src/server.js +69 -12
package/README.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
|
1
1
|
# Spectrawl
|
|
2
2
|
|
|
3
|
-
The unified web layer for AI agents. Search, browse,
|
|
3
|
+
The unified web layer for AI agents. Search, browse, crawl, extract, and act on platforms — one package, self-hosted.
|
|
4
4
|
|
|
5
|
-
**5,000 free searches/month** via Gemini Grounded Search. Full
|
|
5
|
+
**5,000 free searches/month** via Gemini Grounded Search. Full page scraping, stealth browsing, multi-page crawling, structured extraction, AI browser agent, 24 platform adapters.
|
|
6
6
|
|
|
7
7
|
## What It Does
|
|
8
8
|
|
|
9
|
-
AI agents need to interact with the web — searching, browsing pages, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.
|
|
9
|
+
AI agents need to interact with the web — searching, browsing pages, crawling sites, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.
|
|
10
10
|
|
|
11
11
|
```
|
|
12
12
|
npm install spectrawl
|
|
@@ -44,25 +44,45 @@ const basic = await web.search('query')
|
|
|
44
44
|
|
|
45
45
|
> **Why no summary by default?** Your agent already has an LLM. If we summarize AND your agent summarizes, you're paying two LLMs for one answer. We return rich sources — your agent does the rest.
|
|
46
46
|
|
|
47
|
-
## Spectrawl vs
|
|
47
|
+
## Spectrawl vs Others
|
|
48
|
+
|
|
49
|
+
| | Tavily | Crawl4AI | Firecrawl | Stagehand | Spectrawl |
|
|
50
|
+
|---|---|---|---|---|---|
|
|
51
|
+
| Speed | ~2s | ~5s | ~3s | ~3s | ~6-10s |
|
|
52
|
+
| Free tier | 1,000/mo | Unlimited | 500/mo | None | 5,000/mo |
|
|
53
|
+
| Returns | Snippets + AI | Markdown | Markdown/JSON | Structured | Full page + structured |
|
|
54
|
+
| Self-hosted | No | Yes | Yes | Yes | Yes |
|
|
55
|
+
| Anti-detect | No | No | No | No | **Yes (Camoufox)** |
|
|
56
|
+
| Block detection | No | No | No | No | **8 services** |
|
|
57
|
+
| CAPTCHA solving | No | No | No | No | **Yes (Gemini Vision)** |
|
|
58
|
+
| Structured extraction | No | No | No | **Yes** | **Yes** |
|
|
59
|
+
| NL browser agent | No | No | No | **Yes** | **Yes** |
|
|
60
|
+
| Network capturing | No | Yes | No | No | **Yes** |
|
|
61
|
+
| Multi-page crawl | No | Yes | Yes | No | **Yes (+ sitemap)** |
|
|
62
|
+
| Platform posting | No | No | No | No | **24 adapters** |
|
|
63
|
+
| Auth management | No | No | No | No | **Cookie store + refresh** |
|
|
48
64
|
|
|
49
|
-
|
|
65
|
+
## Search
|
|
50
66
|
|
|
51
|
-
|
|
52
|
-
|---|---|---|
|
|
53
|
-
| Speed | ~2s | ~6-10s |
|
|
54
|
-
| Free tier | 1,000/month | 5,000/month |
|
|
55
|
-
| Returns | Snippets + AI answer | Full page content + snippets |
|
|
56
|
-
| Self-hosted | No | Yes |
|
|
57
|
-
| Stealth browsing | No | Yes (Camoufox + Playwright) |
|
|
58
|
-
| Platform posting | No | 19 adapters |
|
|
59
|
-
| Auth management | No | Cookie store + auto-refresh |
|
|
60
|
-
| Site crawling | No | ✅ Free (Jina + Playwright) |
|
|
61
|
-
| Cached repeats | No | <1ms |
|
|
67
|
+
Two modes: **basic search** and **deep search**.
|
|
62
68
|
|
|
63
|
-
|
|
69
|
+
### Basic Search
|
|
64
70
|
|
|
65
|
-
|
|
71
|
+
```js
|
|
72
|
+
const results = await web.search('query')
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
Returns raw search results from the engine cascade. Fast, lightweight.
|
|
76
|
+
|
|
77
|
+
### Deep Search
|
|
78
|
+
|
|
79
|
+
```js
|
|
80
|
+
const results = await web.deepSearch('query', { summarize: true })
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
Full pipeline: query expansion → parallel search → merge/dedup → rerank → scrape top N → optional AI summary with citations.
|
|
84
|
+
|
|
85
|
+
### Search Engine Cascade
|
|
66
86
|
|
|
67
87
|
Default cascade: **Gemini Grounded → Tavily → Brave**
|
|
68
88
|
|
|
@@ -96,55 +116,351 @@ DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs
|
|
|
96
116
|
|
|
97
117
|
## Browse
|
|
98
118
|
|
|
99
|
-
Stealth browsing with anti-detection. Three tiers (auto-
|
|
119
|
+
Stealth browsing with anti-detection. Three tiers (auto-escalation):
|
|
100
120
|
|
|
101
|
-
1. **
|
|
121
|
+
1. **Playwright + stealth plugin** — default, works immediately
|
|
102
122
|
2. **Camoufox binary** — engine-level anti-fingerprint (`npx spectrawl install-stealth`)
|
|
103
123
|
3. **Remote Camoufox** — for existing deployments
|
|
104
124
|
|
|
125
|
+
If tier 1 gets blocked, Spectrawl automatically escalates to tier 2 (if installed) or tier 3 (if configured). No manual intervention needed.
|
|
126
|
+
|
|
127
|
+
### Browse Options
|
|
128
|
+
|
|
129
|
+
```js
|
|
130
|
+
const page = await web.browse('https://example.com', {
|
|
131
|
+
screenshot: true, // Take a PNG screenshot
|
|
132
|
+
fullPage: true, // Full page screenshot (not just viewport)
|
|
133
|
+
html: true, // Return raw HTML alongside markdown
|
|
134
|
+
stealth: true, // Force stealth mode
|
|
135
|
+
camoufox: true, // Force Camoufox engine
|
|
136
|
+
noCache: true, // Bypass cache
|
|
137
|
+
auth: 'reddit' // Use stored auth cookies for this platform
|
|
138
|
+
})
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### Browse Response
|
|
142
|
+
|
|
143
|
+
```js
|
|
144
|
+
{
|
|
145
|
+
content: "# Page Title\n\nExtracted markdown content...",
|
|
146
|
+
url: "https://example.com",
|
|
147
|
+
title: "Page Title",
|
|
148
|
+
statusCode: 200,
|
|
149
|
+
cached: false,
|
|
150
|
+
engine: "camoufox", // which engine was used
|
|
151
|
+
screenshot: Buffer<png>, // PNG buffer (JS) or base64 (HTTP)
|
|
152
|
+
html: "<html>...</html>", // raw HTML (if html: true)
|
|
153
|
+
blocked: false, // true if block page detected
|
|
154
|
+
blockInfo: null // { type: 'cloudflare', detail: '...' }
|
|
155
|
+
}
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Block Page Detection
|
|
159
|
+
|
|
160
|
+
Spectrawl detects block/challenge pages from **8 anti-bot services** and reports them in the response instead of returning garbage HTML:
|
|
161
|
+
|
|
162
|
+
- **Cloudflare** (including RFC 9457 structured errors)
|
|
163
|
+
- **Akamai**
|
|
164
|
+
- **AWS WAF**
|
|
165
|
+
- **Imperva / Incapsula**
|
|
166
|
+
- **DataDome**
|
|
167
|
+
- **PerimeterX / HUMAN**
|
|
168
|
+
- **hCaptcha** challenges
|
|
169
|
+
- **reCAPTCHA** challenges
|
|
170
|
+
- Generic bot detection (403, "access denied", etc.)
|
|
171
|
+
|
|
172
|
+
When a block is detected, the response includes `blocked: true` and `blockInfo: { type, detail }`.
|
|
173
|
+
|
|
174
|
+
### CAPTCHA Solving
|
|
175
|
+
|
|
176
|
+
Built-in CAPTCHA solver using **Gemini Vision** (free tier: 1,500 req/day):
|
|
177
|
+
|
|
178
|
+
- ✅ Image CAPTCHAs
|
|
179
|
+
- ✅ Text/math CAPTCHAs
|
|
180
|
+
- ✅ Simple visual challenges
|
|
181
|
+
- ❌ reCAPTCHA v2/v3 (requires token solving services)
|
|
182
|
+
- ❌ hCaptcha (requires token solving services)
|
|
183
|
+
- ❌ Cloudflare Turnstile (requires token solving services)
|
|
184
|
+
|
|
185
|
+
The solver automatically detects CAPTCHA type and attempts resolution before returning the page.
|
|
186
|
+
|
|
187
|
+
## Extract — Structured Data Extraction
|
|
188
|
+
|
|
189
|
+
Pull structured data from any page using LLM + optional CSS/XPath selectors. Like Stagehand's `extract()` but self-hosted and integrated with Spectrawl's anti-detect browsing.
|
|
190
|
+
|
|
191
|
+
### Basic Extraction
|
|
192
|
+
|
|
193
|
+
```js
|
|
194
|
+
const result = await web.extract('https://news.ycombinator.com', {
|
|
195
|
+
instruction: 'Extract the top 3 story titles and their point counts',
|
|
196
|
+
schema: {
|
|
197
|
+
type: 'object',
|
|
198
|
+
properties: {
|
|
199
|
+
stories: {
|
|
200
|
+
type: 'array',
|
|
201
|
+
items: {
|
|
202
|
+
type: 'object',
|
|
203
|
+
properties: {
|
|
204
|
+
title: { type: 'string' },
|
|
205
|
+
points: { type: 'number' }
|
|
206
|
+
}
|
|
207
|
+
}
|
|
208
|
+
}
|
|
209
|
+
}
|
|
210
|
+
}
|
|
211
|
+
})
|
|
212
|
+
// result.data = { stories: [{ title: "...", points: 210 }, ...] }
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
### HTTP API
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
curl -X POST http://localhost:3900/extract \
|
|
219
|
+
-H 'Content-Type: application/json' \
|
|
220
|
+
-d '{
|
|
221
|
+
"url": "https://example.com",
|
|
222
|
+
"instruction": "Extract the page title and main heading",
|
|
223
|
+
"schema": {"type": "object", "properties": {"title": {"type": "string"}, "heading": {"type": "string"}}}
|
|
224
|
+
}'
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
Response:
|
|
228
|
+
```json
|
|
229
|
+
{
|
|
230
|
+
"data": { "title": "Example Domain", "heading": "Example Domain" },
|
|
231
|
+
"url": "https://example.com",
|
|
232
|
+
"title": "Example Domain",
|
|
233
|
+
"contentLength": 129,
|
|
234
|
+
"duration": 679
|
|
235
|
+
}
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
### Targeted Extraction with Selectors
|
|
239
|
+
|
|
240
|
+
Narrow extraction scope using CSS or XPath selectors — reduces tokens and improves accuracy:
|
|
241
|
+
|
|
242
|
+
```js
|
|
243
|
+
const result = await web.extract('https://news.ycombinator.com', {
|
|
244
|
+
instruction: 'Extract all story titles',
|
|
245
|
+
selector: '.titleline', // CSS selector
|
|
246
|
+
// or: selector: 'xpath=//table[@class="itemlist"]'
|
|
247
|
+
schema: { type: 'object', properties: { titles: { type: 'array', items: { type: 'string' } } } }
|
|
248
|
+
})
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
### Relevance Filtering (BM25)
|
|
252
|
+
|
|
253
|
+
For large pages, filter content by relevance before sending to the LLM — saves tokens:
|
|
254
|
+
|
|
255
|
+
```js
|
|
256
|
+
const result = await web.extract('https://en.wikipedia.org/wiki/Node.js', {
|
|
257
|
+
instruction: 'Extract the creator and release date',
|
|
258
|
+
relevanceFilter: true // BM25 scoring keeps only relevant sections
|
|
259
|
+
})
|
|
260
|
+
// Content reduced from 50K+ chars to ~2K relevant chars
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
### Extract from Content (No Browsing)
|
|
264
|
+
|
|
265
|
+
Already have the content? Skip the browse step:
|
|
266
|
+
|
|
267
|
+
```js
|
|
268
|
+
const result = await web.extractFromContent(markdownContent, {
|
|
269
|
+
instruction: 'Extract all email addresses',
|
|
270
|
+
schema: { type: 'object', properties: { emails: { type: 'array', items: { type: 'string' } } } }
|
|
271
|
+
})
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
Uses Gemini Flash (free) by default. Falls back to OpenAI if configured.
|
|
275
|
+
|
|
276
|
+
## Agent — Natural Language Browser Actions
|
|
277
|
+
|
|
278
|
+
Control a browser with natural language. Navigate, click, type, scroll — the LLM interprets the page and decides what to do.
|
|
279
|
+
|
|
280
|
+
```js
|
|
281
|
+
const result = await web.agent('https://example.com', 'click the More Information link', {
|
|
282
|
+
maxSteps: 5, // max actions to take
|
|
283
|
+
screenshot: true // screenshot after completion
|
|
284
|
+
})
|
|
285
|
+
// result.success = true
|
|
286
|
+
// result.url = "https://www.iana.org/domains/reserved" (navigated!)
|
|
287
|
+
// result.steps = [{ step: 1, action: "click", elementIdx: 0, result: "clicked" }, ...]
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
### HTTP API
|
|
291
|
+
|
|
292
|
+
```bash
|
|
293
|
+
curl -X POST http://localhost:3900/agent \
|
|
294
|
+
-H 'Content-Type: application/json' \
|
|
295
|
+
-d '{"url": "https://example.com", "instruction": "click the More Information link", "maxSteps": 3}'
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
Response:
|
|
299
|
+
```json
|
|
300
|
+
{
|
|
301
|
+
"success": true,
|
|
302
|
+
"url": "https://www.iana.org/domains/reserved",
|
|
303
|
+
"title": "IANA — Reserved Domains",
|
|
304
|
+
"steps": [
|
|
305
|
+
{ "step": 1, "action": "click", "elementIdx": 0, "reason": "clicking the More Information link", "result": "clicked" }
|
|
306
|
+
],
|
|
307
|
+
"content": "...",
|
|
308
|
+
"duration": 5200
|
|
309
|
+
}
|
|
310
|
+
```
|
|
311
|
+
|
|
312
|
+
### Supported Actions
|
|
313
|
+
|
|
314
|
+
The agent can: **click**, **type** (fill inputs), **select** (dropdowns), **press** (keyboard keys), **scroll** (up/down).
|
|
315
|
+
|
|
316
|
+
## Network Request Capturing
|
|
317
|
+
|
|
318
|
+
Capture XHR/fetch requests made by a page during browsing — useful for discovering hidden APIs:
|
|
319
|
+
|
|
320
|
+
```js
|
|
321
|
+
const result = await web.browse('https://example.com', {
|
|
322
|
+
captureNetwork: true,
|
|
323
|
+
captureNetworkHeaders: true, // include request headers
|
|
324
|
+
captureNetworkBody: true // include response bodies (<50KB)
|
|
325
|
+
})
|
|
326
|
+
// result.networkRequests = [
|
|
327
|
+
// { url: "https://api.example.com/data", method: "GET", status: 200, contentType: "application/json", body: "..." }
|
|
328
|
+
// ]
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
### HTTP API
|
|
332
|
+
|
|
333
|
+
```bash
|
|
334
|
+
curl -X POST http://localhost:3900/browse \
|
|
335
|
+
-d '{"url": "https://example.com", "captureNetwork": true, "captureNetworkBody": true}'
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
## Screenshots
|
|
339
|
+
|
|
340
|
+
Take screenshots of any page via browse:
|
|
341
|
+
|
|
342
|
+
### JavaScript
|
|
343
|
+
|
|
105
344
|
```js
|
|
106
|
-
const
|
|
107
|
-
|
|
108
|
-
|
|
345
|
+
const result = await web.browse('https://example.com', {
|
|
346
|
+
screenshot: true,
|
|
347
|
+
fullPage: true // optional: capture entire page, not just viewport
|
|
348
|
+
})
|
|
349
|
+
// result.screenshot is a PNG Buffer
|
|
350
|
+
fs.writeFileSync('screenshot.png', result.screenshot)
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
### HTTP API
|
|
354
|
+
|
|
355
|
+
```bash
|
|
356
|
+
curl -X POST http://localhost:3900/browse \
|
|
357
|
+
-H 'Content-Type: application/json' \
|
|
358
|
+
-d '{"url": "https://example.com", "screenshot": true, "fullPage": true}'
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
Response:
|
|
362
|
+
```json
|
|
363
|
+
{
|
|
364
|
+
"content": "# Page Title\n\nExtracted markdown...",
|
|
365
|
+
"url": "https://example.com",
|
|
366
|
+
"title": "Page Title",
|
|
367
|
+
"screenshot": "iVBORw0KGgo...base64-encoded-png...",
|
|
368
|
+
"cached": false
|
|
369
|
+
}
|
|
109
370
|
```
|
|
110
371
|
|
|
111
|
-
|
|
372
|
+
> **Note:** Screenshots bypass the cache — each request renders a fresh page.
|
|
112
373
|
|
|
113
374
|
## Crawl
|
|
114
375
|
|
|
115
|
-
|
|
376
|
+
Multi-page website crawler with automatic RAM-based parallelization.
|
|
377
|
+
|
|
378
|
+
```js
|
|
379
|
+
const result = await web.crawl('https://docs.example.com', {
|
|
380
|
+
depth: 2, // how many link levels to follow
|
|
381
|
+
maxPages: 50, // stop after N pages
|
|
382
|
+
format: 'markdown', // 'markdown' or 'html'
|
|
383
|
+
scope: 'domain', // 'domain' | 'subdomain' | 'path'
|
|
384
|
+
concurrency: 'auto', // auto-detect from available RAM, or set a number
|
|
385
|
+
merge: true, // merge all pages into one document
|
|
386
|
+
includePatterns: [], // regex patterns to include
|
|
387
|
+
excludePatterns: [], // regex patterns to skip
|
|
388
|
+
delay: 300, // ms between batch launches (politeness)
|
|
389
|
+
stealth: true // use anti-detect browsing
|
|
390
|
+
})
|
|
391
|
+
```
|
|
392
|
+
|
|
393
|
+
### Crawl Response
|
|
394
|
+
|
|
395
|
+
```js
|
|
396
|
+
{
|
|
397
|
+
pages: [
|
|
398
|
+
{ url: 'https://docs.example.com/', content: '...', title: '...', statusCode: 200 },
|
|
399
|
+
{ url: 'https://docs.example.com/guide', content: '...', title: '...', statusCode: 200 },
|
|
400
|
+
// ...
|
|
401
|
+
],
|
|
402
|
+
stats: {
|
|
403
|
+
pagesScraped: 23,
|
|
404
|
+
duration: 45000,
|
|
405
|
+
concurrency: 4
|
|
406
|
+
}
|
|
407
|
+
}
|
|
408
|
+
```
|
|
409
|
+
|
|
410
|
+
### Sitemap-Based Crawling
|
|
116
411
|
|
|
117
|
-
|
|
412
|
+
Spectrawl auto-discovers `sitemap.xml` and pre-seeds the crawl queue — much faster than link-following for documentation sites:
|
|
118
413
|
|
|
119
414
|
```js
|
|
120
|
-
// Crawl a docs site — returns clean markdown for every page
|
|
121
415
|
const result = await web.crawl('https://docs.example.com', {
|
|
122
|
-
|
|
123
|
-
maxPages:
|
|
124
|
-
format: 'markdown', // markdown | html | json
|
|
125
|
-
delay: 300, // ms between requests (be polite)
|
|
126
|
-
stealth: false, // use Camoufox for anti-detect
|
|
127
|
-
auth: 'account' // use stored cookies (crawl behind logins)
|
|
416
|
+
useSitemap: true, // enabled by default
|
|
417
|
+
maxPages: 20
|
|
128
418
|
})
|
|
419
|
+
// [crawl] Found sitemap at https://docs.example.com/sitemap.xml with 82 URLs
|
|
420
|
+
// [crawl] Pre-seeded 20 URLs from sitemap
|
|
421
|
+
```
|
|
422
|
+
|
|
423
|
+
Set `useSitemap: false` to disable and rely only on link discovery.
|
|
424
|
+
|
|
425
|
+
### Webhook Notifications
|
|
426
|
+
|
|
427
|
+
Get notified when a crawl completes:
|
|
129
428
|
|
|
130
|
-
|
|
131
|
-
|
|
429
|
+
```bash
|
|
430
|
+
curl -X POST http://localhost:3900/crawl \
|
|
431
|
+
-d '{"url": "https://docs.example.com", "maxPages": 50, "webhook": "https://your-server.com/webhook"}'
|
|
132
432
|
```
|
|
133
433
|
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
- ✅ Works for AI agents (50-200 pages, not millions)
|
|
434
|
+
Spectrawl will POST the full crawl result to your webhook URL when finished.
|
|
435
|
+
|
|
436
|
+
### Async Crawl Jobs
|
|
437
|
+
|
|
438
|
+
For large sites, use async mode to avoid HTTP timeouts:
|
|
140
439
|
|
|
141
|
-
**HTTP API:**
|
|
142
440
|
```bash
|
|
441
|
+
# Start a crawl job (returns immediately)
|
|
143
442
|
curl -X POST http://localhost:3900/crawl \
|
|
144
|
-
-
|
|
145
|
-
|
|
443
|
+
-d '{"url": "https://docs.example.com", "depth": 3, "maxPages": 100, "async": true}'
|
|
444
|
+
# Response: { "jobId": "abc123", "status": "running" }
|
|
445
|
+
|
|
446
|
+
# Check job status
|
|
447
|
+
curl http://localhost:3900/crawl/abc123
|
|
448
|
+
|
|
449
|
+
# List all jobs
|
|
450
|
+
curl http://localhost:3900/crawl/jobs
|
|
451
|
+
|
|
452
|
+
# Check system capacity
|
|
453
|
+
curl http://localhost:3900/crawl/capacity
|
|
146
454
|
```
|
|
147
455
|
|
|
456
|
+
### RAM-Based Auto-Parallelization
|
|
457
|
+
|
|
458
|
+
Spectrawl estimates ~250MB per browser tab and calculates safe concurrency from available system RAM:
|
|
459
|
+
|
|
460
|
+
- **8GB server:** ~4 concurrent tabs
|
|
461
|
+
- **16GB server:** ~8 concurrent tabs
|
|
462
|
+
- **32GB server:** 10 concurrent tabs (capped)
|
|
463
|
+
|
|
148
464
|
## Auth
|
|
149
465
|
|
|
150
466
|
Persistent cookie storage (SQLite), multi-account management, automatic expiry detection.
|
|
@@ -158,11 +474,45 @@ const accounts = await web.auth.getStatus()
|
|
|
158
474
|
// [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
|
|
159
475
|
```
|
|
160
476
|
|
|
161
|
-
Cookie refresh cron fires
|
|
477
|
+
Cookie refresh cron fires events before accounts go stale (see [Events](#events)).
|
|
478
|
+
|
|
479
|
+
## Events
|
|
480
|
+
|
|
481
|
+
Spectrawl emits events for auth state changes, rate limits, and action results. Subscribe to stay informed:
|
|
482
|
+
|
|
483
|
+
```js
|
|
484
|
+
const { EVENTS } = require('spectrawl')
|
|
485
|
+
|
|
486
|
+
web.on(EVENTS.COOKIE_EXPIRING, (data) => {
|
|
487
|
+
console.log(`Cookie expiring for ${data.platform}:${data.account}`)
|
|
488
|
+
})
|
|
489
|
+
|
|
490
|
+
web.on(EVENTS.RATE_LIMITED, (data) => {
|
|
491
|
+
console.log(`Rate limited on ${data.platform}`)
|
|
492
|
+
})
|
|
493
|
+
|
|
494
|
+
// Wildcard — catch everything
|
|
495
|
+
web.on('*', ({ event, ...data }) => {
|
|
496
|
+
console.log(`Event: ${event}`, data)
|
|
497
|
+
})
|
|
498
|
+
```
|
|
499
|
+
|
|
500
|
+
### Available Events
|
|
501
|
+
|
|
502
|
+
| Event | When |
|
|
503
|
+
|---|---|
|
|
504
|
+
| `cookie_expiring` | Cookie approaching expiry |
|
|
505
|
+
| `cookie_expired` | Cookie has expired |
|
|
506
|
+
| `auth_failed` | Authentication attempt failed |
|
|
507
|
+
| `auth_refreshed` | Cookie successfully refreshed |
|
|
508
|
+
| `rate_limited` | Platform rate limit hit |
|
|
509
|
+
| `action_failed` | Platform action failed |
|
|
510
|
+
| `action_success` | Platform action succeeded |
|
|
511
|
+
| `health_check` | Periodic health check result |
|
|
162
512
|
|
|
163
|
-
## Act —
|
|
513
|
+
## Act — 24 Platform Adapters
|
|
164
514
|
|
|
165
|
-
Post to
|
|
515
|
+
Post to 24+ platforms with one API:
|
|
166
516
|
|
|
167
517
|
```js
|
|
168
518
|
await web.act('github', 'create-issue', { repo: 'user/repo', title: 'Bug report', body: '...' })
|
|
@@ -171,7 +521,7 @@ await web.act('devto', 'post', { title: '...', body: '...', tags: ['ai'] })
|
|
|
171
521
|
await web.act('huggingface', 'create-repo', { name: 'my-model', type: 'model' })
|
|
172
522
|
```
|
|
173
523
|
|
|
174
|
-
**Live tested:** GitHub ✅, Reddit ✅, Dev.to ✅, HuggingFace ✅, X (reads)
|
|
524
|
+
**Live tested:** GitHub ✅, Reddit ✅, Dev.to ✅, HuggingFace ✅, X (reads) ✅
|
|
175
525
|
|
|
176
526
|
| Platform | Auth Method | Actions |
|
|
177
527
|
|----------|-------------|---------|
|
|
@@ -190,16 +540,13 @@ await web.act('huggingface', 'create-repo', { name: 'my-model', type: 'model' })
|
|
|
190
540
|
| Quora | Browser automation | answer |
|
|
191
541
|
| HuggingFace | Hub API | repo, model card, upload |
|
|
192
542
|
| BetaList | REST API | submit |
|
|
193
|
-
|
|
|
194
|
-
| DevHunt | Supabase auth | submit, upvote |
|
|
195
|
-
| SaaSHub | Generic adapter | submit |
|
|
196
|
-
| **Generic Directory** | Configurable | submit |
|
|
543
|
+
| **14 Directories** | Generic adapter | submit |
|
|
197
544
|
|
|
198
545
|
Built-in rate limiting, content dedup (MD5, 24h window), and dead letter queue for retries.
|
|
199
546
|
|
|
200
547
|
## Source Quality Ranking
|
|
201
548
|
|
|
202
|
-
Spectrawl ranks results by domain trust — something
|
|
549
|
+
Spectrawl ranks results by domain trust — something most search tools don't do:
|
|
203
550
|
|
|
204
551
|
- **Boosted:** GitHub, StackOverflow, HN, Reddit, MDN, arxiv, Wikipedia
|
|
205
552
|
- **Penalized:** SEO farms, thin content sites, tag/category pages
|
|
@@ -220,12 +567,143 @@ const web = new Spectrawl({
|
|
|
220
567
|
npx spectrawl serve --port 3900
|
|
221
568
|
```
|
|
222
569
|
|
|
570
|
+
### Endpoints
|
|
571
|
+
|
|
572
|
+
| Method | Path | Description |
|
|
573
|
+
|--------|------|-------------|
|
|
574
|
+
| `POST` | `/search` | Search the web |
|
|
575
|
+
| `POST` | `/browse` | Stealth browse a URL |
|
|
576
|
+
| `POST` | `/crawl` | Crawl a website (sync or async) |
|
|
577
|
+
| `POST` | `/extract` | Structured data extraction with LLM |
|
|
578
|
+
| `POST` | `/agent` | Natural language browser actions |
|
|
579
|
+
| `POST` | `/act` | Platform actions |
|
|
580
|
+
| `GET` | `/status` | Auth account health |
|
|
581
|
+
| `GET` | `/health` | Server health |
|
|
582
|
+
| `GET` | `/crawl/jobs` | List async crawl jobs |
|
|
583
|
+
| `GET` | `/crawl/:jobId` | Get crawl job status/results |
|
|
584
|
+
| `GET` | `/crawl/capacity` | System crawl capacity |
|
|
585
|
+
|
|
586
|
+
### Request / Response Examples
|
|
587
|
+
|
|
588
|
+
#### POST /search
|
|
589
|
+
|
|
590
|
+
```bash
|
|
591
|
+
curl -X POST http://localhost:3900/search \
|
|
592
|
+
-H 'Content-Type: application/json' \
|
|
593
|
+
-d '{"query": "best headless browsers 2026", "summarize": true}'
|
|
594
|
+
```
|
|
595
|
+
|
|
596
|
+
Response:
|
|
597
|
+
```json
|
|
598
|
+
{
|
|
599
|
+
"sources": [
|
|
600
|
+
{
|
|
601
|
+
"title": "Top Headless Browsers in 2026",
|
|
602
|
+
"url": "https://example.com/article",
|
|
603
|
+
"snippet": "Short snippet from search...",
|
|
604
|
+
"content": "Full page markdown content (if scraped)...",
|
|
605
|
+
"source": "gemini-grounded",
|
|
606
|
+
"confidence": 0.95
|
|
607
|
+
}
|
|
608
|
+
],
|
|
609
|
+
"answer": "AI-generated summary with [1] citations... (only if summarize: true)",
|
|
610
|
+
"cached": false
|
|
611
|
+
}
|
|
612
|
+
```
|
|
613
|
+
|
|
614
|
+
#### POST /browse
|
|
615
|
+
|
|
616
|
+
```bash
|
|
617
|
+
curl -X POST http://localhost:3900/browse \
|
|
618
|
+
-H 'Content-Type: application/json' \
|
|
619
|
+
-d '{"url": "https://example.com", "screenshot": true, "fullPage": true}'
|
|
620
|
+
```
|
|
621
|
+
|
|
622
|
+
Response:
|
|
623
|
+
```json
|
|
624
|
+
{
|
|
625
|
+
"content": "# Example Domain\n\nThis domain is for use in illustrative examples...",
|
|
626
|
+
"url": "https://example.com",
|
|
627
|
+
"title": "Example Domain",
|
|
628
|
+
"statusCode": 200,
|
|
629
|
+
"screenshot": "iVBORw0KGgoAAAANSUhEUg...base64...",
|
|
630
|
+
"cached": false,
|
|
631
|
+
"engine": "playwright"
|
|
632
|
+
}
|
|
633
|
+
```
|
|
634
|
+
|
|
635
|
+
#### POST /crawl
|
|
636
|
+
|
|
637
|
+
```bash
|
|
638
|
+
curl -X POST http://localhost:3900/crawl \
|
|
639
|
+
-H 'Content-Type: application/json' \
|
|
640
|
+
-d '{"url": "https://docs.example.com", "depth": 2, "maxPages": 10}'
|
|
641
|
+
```
|
|
642
|
+
|
|
643
|
+
Response:
|
|
644
|
+
```json
|
|
645
|
+
{
|
|
646
|
+
"pages": [
|
|
647
|
+
{
|
|
648
|
+
"url": "https://docs.example.com/",
|
|
649
|
+
"content": "# Docs Home\n\n...",
|
|
650
|
+
"title": "Documentation",
|
|
651
|
+
"statusCode": 200
|
|
652
|
+
}
|
|
653
|
+
],
|
|
654
|
+
"stats": {
|
|
655
|
+
"pagesScraped": 8,
|
|
656
|
+
"duration": 12000,
|
|
657
|
+
"concurrency": 4
|
|
658
|
+
}
|
|
659
|
+
}
|
|
660
|
+
```
|
|
661
|
+
|
|
662
|
+
#### POST /act
|
|
663
|
+
|
|
664
|
+
```bash
|
|
665
|
+
curl -X POST http://localhost:3900/act \
|
|
666
|
+
-H 'Content-Type: application/json' \
|
|
667
|
+
-d '{"platform": "github", "action": "create-issue", "repo": "user/repo", "title": "Bug", "body": "Details..."}'
|
|
668
|
+
```
|
|
669
|
+
|
|
670
|
+
#### Error Responses
|
|
671
|
+
|
|
672
|
+
All errors follow [RFC 9457](https://www.rfc-editor.org/rfc/rfc9457) Problem Details format:
|
|
673
|
+
|
|
674
|
+
```json
|
|
675
|
+
{
|
|
676
|
+
"type": "https://spectrawl.dev/errors/rate-limited",
|
|
677
|
+
"status": 429,
|
|
678
|
+
"title": "rate limited",
|
|
679
|
+
"detail": "Reddit rate limit: max 3 posts per hour",
|
|
680
|
+
"retryable": true
|
|
681
|
+
}
|
|
682
|
+
```
|
|
683
|
+
|
|
684
|
+
Error types: `bad-request` (400), `unauthorized` (401), `forbidden` (403), `not-found` (404), `rate-limited` (429), `internal-error` (500), `upstream-error` (502), `service-unavailable` (503).
|
|
685
|
+
|
|
686
|
+
## Proxy Configuration
|
|
687
|
+
|
|
688
|
+
Route browsing through residential or datacenter proxies:
|
|
689
|
+
|
|
690
|
+
```json
|
|
691
|
+
{
|
|
692
|
+
"browse": {
|
|
693
|
+
"proxy": {
|
|
694
|
+
"host": "proxy.example.com",
|
|
695
|
+
"port": 8080,
|
|
696
|
+
"username": "user",
|
|
697
|
+
"password": "pass"
|
|
698
|
+
}
|
|
699
|
+
}
|
|
700
|
+
}
|
|
223
701
|
```
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
|
|
702
|
+
|
|
703
|
+
The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server:
|
|
704
|
+
|
|
705
|
+
```bash
|
|
706
|
+
npx spectrawl proxy --port 8080
|
|
229
707
|
```
|
|
230
708
|
|
|
231
709
|
## MCP Server
|
|
@@ -236,7 +714,15 @@ Works with any MCP-compatible agent (Claude, Cursor, OpenClaw, LangChain):
|
|
|
236
714
|
npx spectrawl mcp
|
|
237
715
|
```
|
|
238
716
|
|
|
239
|
-
|
|
717
|
+
### MCP Tools
|
|
718
|
+
|
|
719
|
+
| Tool | Description | Key Parameters |
|
|
720
|
+
|------|-------------|----------------|
|
|
721
|
+
| `web_search` | Search the web | `query`, `summarize`, `scrapeTop`, `minResults` |
|
|
722
|
+
| `web_browse` | Stealth browse a URL | `url`, `auth`, `screenshot`, `html` |
|
|
723
|
+
| `web_act` | Platform action | `platform`, `action`, `account`, `text`, `title` |
|
|
724
|
+
| `web_auth` | Manage auth | `action` (list/add/remove), `platform`, `account` |
|
|
725
|
+
| `web_status` | Check auth health | — |
|
|
240
726
|
|
|
241
727
|
## CLI
|
|
242
728
|
|
|
@@ -246,35 +732,64 @@ npx spectrawl search "query" # search from terminal
|
|
|
246
732
|
npx spectrawl status # check auth health
|
|
247
733
|
npx spectrawl serve # start HTTP server
|
|
248
734
|
npx spectrawl mcp # start MCP server
|
|
735
|
+
npx spectrawl proxy # start rotating proxy server
|
|
249
736
|
npx spectrawl install-stealth # download Camoufox browser
|
|
737
|
+
npx spectrawl version # show version
|
|
250
738
|
```
|
|
251
739
|
|
|
252
740
|
## Configuration
|
|
253
741
|
|
|
254
|
-
`spectrawl.json
|
|
742
|
+
`spectrawl.json` — full defaults:
|
|
255
743
|
|
|
256
744
|
```json
|
|
257
745
|
{
|
|
746
|
+
"port": 3900,
|
|
747
|
+
"concurrency": 3,
|
|
258
748
|
"search": {
|
|
259
749
|
"cascade": ["gemini-grounded", "tavily", "brave"],
|
|
260
750
|
"scrapeTop": 5
|
|
261
751
|
},
|
|
752
|
+
"browse": {
|
|
753
|
+
"defaultEngine": "playwright",
|
|
754
|
+
"proxy": null,
|
|
755
|
+
"humanlike": {
|
|
756
|
+
"minDelay": 500,
|
|
757
|
+
"maxDelay": 2000,
|
|
758
|
+
"scrollBehavior": true
|
|
759
|
+
}
|
|
760
|
+
},
|
|
761
|
+
"auth": {
|
|
762
|
+
"refreshInterval": "4h",
|
|
763
|
+
"cookieStore": "./data/cookies.db"
|
|
764
|
+
},
|
|
262
765
|
"cache": {
|
|
766
|
+
"path": "./data/cache.db",
|
|
263
767
|
"searchTtl": 3600,
|
|
264
|
-
"scrapeTtl": 86400
|
|
768
|
+
"scrapeTtl": 86400,
|
|
769
|
+
"screenshotTtl": 3600
|
|
265
770
|
},
|
|
266
771
|
"rateLimit": {
|
|
267
|
-
"x": { "postsPerHour":
|
|
268
|
-
"reddit": { "postsPerHour":
|
|
772
|
+
"x": { "postsPerHour": 5, "minDelayMs": 30000 },
|
|
773
|
+
"reddit": { "postsPerHour": 3, "minDelayMs": 600000 }
|
|
269
774
|
}
|
|
270
775
|
}
|
|
271
776
|
```
|
|
272
777
|
|
|
778
|
+
### Human-like Browsing
|
|
779
|
+
|
|
780
|
+
Spectrawl simulates human browsing patterns by default:
|
|
781
|
+
|
|
782
|
+
- **Random delays** between page loads (500-2000ms)
|
|
783
|
+
- **Scroll behavior** simulation
|
|
784
|
+
- **Random viewport sizes** from common resolutions
|
|
785
|
+
- Configurable via `browse.humanlike`
|
|
786
|
+
|
|
273
787
|
## Environment Variables
|
|
274
788
|
|
|
275
789
|
```
|
|
276
790
|
GEMINI_API_KEY Free — primary search + summarization (aistudio.google.com)
|
|
277
791
|
BRAVE_API_KEY Brave Search (2,000 free/month)
|
|
792
|
+
TAVILY_API_KEY Tavily Search (1,000 free/month)
|
|
278
793
|
SERPER_API_KEY Serper.dev (2,500 trial queries)
|
|
279
794
|
GITHUB_TOKEN For GitHub adapter
|
|
280
795
|
DEVTO_API_KEY For Dev.to adapter
|
|
@@ -286,7 +801,3 @@ ANTHROPIC_API_KEY Alternative LLM for summarization
|
|
|
286
801
|
## License
|
|
287
802
|
|
|
288
803
|
MIT
|
|
289
|
-
|
|
290
|
-
## Part of xanOS
|
|
291
|
-
|
|
292
|
-
Spectrawl is the web layer for [xanOS](https://github.com/FayAndXan/xanOS) — the autonomous content engine. Use it standalone or as part of the full stack.
|