spectrawl 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,12 +1,12 @@
1
1
  # Spectrawl
2
2
 
3
- The unified web layer for AI agents. Search, browse, authenticate, and act on platforms — one package, self-hosted.
3
+ The unified web layer for AI agents. Search, browse, crawl, extract, and act on platforms — one package, self-hosted.
4
4
 
5
- **5,000 free searches/month** via Gemini Grounded Search. Full site crawling, stealth browsing, 19 platform adapters.
5
+ **5,000 free searches/month** via Gemini Grounded Search. Full page scraping, stealth browsing, multi-page crawling, structured extraction, AI browser agent, 24 platform adapters.
6
6
 
7
7
  ## What It Does
8
8
 
9
- AI agents need to interact with the web — searching, browsing pages, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.
9
+ AI agents need to interact with the web — searching, browsing pages, crawling sites, logging into platforms, posting content. Today you wire together Playwright + a search API + cookie managers + platform-specific scripts. Spectrawl is one package that does all of it.
10
10
 
11
11
  ```
12
12
  npm install spectrawl
@@ -44,25 +44,45 @@ const basic = await web.search('query')
44
44
 
45
45
  > **Why no summary by default?** Your agent already has an LLM. If we summarize AND your agent summarizes, you're paying two LLMs for one answer. We return rich sources — your agent does the rest.
46
46
 
47
- ## Spectrawl vs Tavily
47
+ ## Spectrawl vs Others
48
+
49
+ | | Tavily | Crawl4AI | Firecrawl | Stagehand | Spectrawl |
50
+ |---|---|---|---|---|---|
51
+ | Speed | ~2s | ~5s | ~3s | ~3s | ~6-10s |
52
+ | Free tier | 1,000/mo | Unlimited | 500/mo | None | 5,000/mo |
53
+ | Returns | Snippets + AI | Markdown | Markdown/JSON | Structured | Full page + structured |
54
+ | Self-hosted | No | Yes | Yes | Yes | Yes |
55
+ | Anti-detect | No | No | No | No | **Yes (Camoufox)** |
56
+ | Block detection | No | No | No | No | **8 services** |
57
+ | CAPTCHA solving | No | No | No | No | **Yes (Gemini Vision)** |
58
+ | Structured extraction | No | No | No | **Yes** | **Yes** |
59
+ | NL browser agent | No | No | No | **Yes** | **Yes** |
60
+ | Network capturing | No | Yes | No | No | **Yes** |
61
+ | Multi-page crawl | No | Yes | Yes | No | **Yes (+ sitemap)** |
62
+ | Platform posting | No | No | No | No | **24 adapters** |
63
+ | Auth management | No | No | No | No | **Cookie store + refresh** |
48
64
 
49
- Different tools for different needs.
65
+ ## Search
50
66
 
51
- | | Tavily | Spectrawl |
52
- |---|---|---|
53
- | Speed | ~2s | ~6-10s |
54
- | Free tier | 1,000/month | 5,000/month |
55
- | Returns | Snippets + AI answer | Full page content + snippets |
56
- | Self-hosted | No | Yes |
57
- | Stealth browsing | No | Yes (Camoufox + Playwright) |
58
- | Platform posting | No | 19 adapters |
59
- | Auth management | No | Cookie store + auto-refresh |
60
- | Site crawling | No | ✅ Free (Jina + Playwright) |
61
- | Cached repeats | No | <1ms |
67
+ Two modes: **basic search** and **deep search**.
62
68
 
63
- **Tavily** is fast and simple — great for agents that need quick answers. **Spectrawl** returns richer data and does more (browse, auth, post) — but it's slower. Choose based on your use case.
69
+ ### Basic Search
64
70
 
65
- ## Search
71
+ ```js
72
+ const results = await web.search('query')
73
+ ```
74
+
75
+ Returns raw search results from the engine cascade. Fast, lightweight.
76
+
77
+ ### Deep Search
78
+
79
+ ```js
80
+ const results = await web.deepSearch('query', { summarize: true })
81
+ ```
82
+
83
+ Full pipeline: query expansion → parallel search → merge/dedup → rerank → scrape top N → optional AI summary with citations.
84
+
85
+ ### Search Engine Cascade
66
86
 
67
87
  Default cascade: **Gemini Grounded → Tavily → Brave**
68
88
 
@@ -96,55 +116,351 @@ DDG-only search, raw results, no AI answer. Works from home IPs. Datacenter IPs
96
116
 
97
117
  ## Browse
98
118
 
99
- Stealth browsing with anti-detection. Three tiers (auto-detected):
119
+ Stealth browsing with anti-detection. Three tiers (auto-escalation):
100
120
 
101
- 1. **playwright-extra + stealth plugin** — default, works immediately
121
+ 1. **Playwright + stealth plugin** — default, works immediately
102
122
  2. **Camoufox binary** — engine-level anti-fingerprint (`npx spectrawl install-stealth`)
103
123
  3. **Remote Camoufox** — for existing deployments
104
124
 
125
+ If tier 1 gets blocked, Spectrawl automatically escalates to tier 2 (if installed) or tier 3 (if configured). No manual intervention needed.
126
+
127
+ ### Browse Options
128
+
129
+ ```js
130
+ const page = await web.browse('https://example.com', {
131
+ screenshot: true, // Take a PNG screenshot
132
+ fullPage: true, // Full page screenshot (not just viewport)
133
+ html: true, // Return raw HTML alongside markdown
134
+ stealth: true, // Force stealth mode
135
+ camoufox: true, // Force Camoufox engine
136
+ noCache: true, // Bypass cache
137
+ auth: 'reddit' // Use stored auth cookies for this platform
138
+ })
139
+ ```
140
+
141
+ ### Browse Response
142
+
143
+ ```js
144
+ {
145
+ content: "# Page Title\n\nExtracted markdown content...",
146
+ url: "https://example.com",
147
+ title: "Page Title",
148
+ statusCode: 200,
149
+ cached: false,
150
+ engine: "camoufox", // which engine was used
151
+ screenshot: Buffer<png>, // PNG buffer (JS) or base64 (HTTP)
152
+ html: "<html>...</html>", // raw HTML (if html: true)
153
+ blocked: false, // true if block page detected
154
+ blockInfo: null // { type: 'cloudflare', detail: '...' }
155
+ }
156
+ ```
157
+
158
+ ### Block Page Detection
159
+
160
+ Spectrawl detects block/challenge pages from **8 anti-bot services** and reports them in the response instead of returning garbage HTML:
161
+
162
+ - **Cloudflare** (including RFC 9457 structured errors)
163
+ - **Akamai**
164
+ - **AWS WAF**
165
+ - **Imperva / Incapsula**
166
+ - **DataDome**
167
+ - **PerimeterX / HUMAN**
168
+ - **hCaptcha** challenges
169
+ - **reCAPTCHA** challenges
170
+ - Generic bot detection (403, "access denied", etc.)
171
+
172
+ When a block is detected, the response includes `blocked: true` and `blockInfo: { type, detail }`.
173
+
174
+ ### CAPTCHA Solving
175
+
176
+ Built-in CAPTCHA solver using **Gemini Vision** (free tier: 1,500 req/day):
177
+
178
+ - ✅ Image CAPTCHAs
179
+ - ✅ Text/math CAPTCHAs
180
+ - ✅ Simple visual challenges
181
+ - ❌ reCAPTCHA v2/v3 (requires token solving services)
182
+ - ❌ hCaptcha (requires token solving services)
183
+ - ❌ Cloudflare Turnstile (requires token solving services)
184
+
185
+ The solver automatically detects CAPTCHA type and attempts resolution before returning the page.
186
+
187
+ ## Extract — Structured Data Extraction
188
+
189
+ Pull structured data from any page using LLM + optional CSS/XPath selectors. Like Stagehand's `extract()` but self-hosted and integrated with Spectrawl's anti-detect browsing.
190
+
191
+ ### Basic Extraction
192
+
193
+ ```js
194
+ const result = await web.extract('https://news.ycombinator.com', {
195
+ instruction: 'Extract the top 3 story titles and their point counts',
196
+ schema: {
197
+ type: 'object',
198
+ properties: {
199
+ stories: {
200
+ type: 'array',
201
+ items: {
202
+ type: 'object',
203
+ properties: {
204
+ title: { type: 'string' },
205
+ points: { type: 'number' }
206
+ }
207
+ }
208
+ }
209
+ }
210
+ }
211
+ })
212
+ // result.data = { stories: [{ title: "...", points: 210 }, ...] }
213
+ ```
214
+
215
+ ### HTTP API
216
+
217
+ ```bash
218
+ curl -X POST http://localhost:3900/extract \
219
+ -H 'Content-Type: application/json' \
220
+ -d '{
221
+ "url": "https://example.com",
222
+ "instruction": "Extract the page title and main heading",
223
+ "schema": {"type": "object", "properties": {"title": {"type": "string"}, "heading": {"type": "string"}}}
224
+ }'
225
+ ```
226
+
227
+ Response:
228
+ ```json
229
+ {
230
+ "data": { "title": "Example Domain", "heading": "Example Domain" },
231
+ "url": "https://example.com",
232
+ "title": "Example Domain",
233
+ "contentLength": 129,
234
+ "duration": 679
235
+ }
236
+ ```
237
+
238
+ ### Targeted Extraction with Selectors
239
+
240
+ Narrow extraction scope using CSS or XPath selectors — reduces tokens and improves accuracy:
241
+
242
+ ```js
243
+ const result = await web.extract('https://news.ycombinator.com', {
244
+ instruction: 'Extract all story titles',
245
+ selector: '.titleline', // CSS selector
246
+ // or: selector: 'xpath=//table[@class="itemlist"]'
247
+ schema: { type: 'object', properties: { titles: { type: 'array', items: { type: 'string' } } } }
248
+ })
249
+ ```
250
+
251
+ ### Relevance Filtering (BM25)
252
+
253
+ For large pages, filter content by relevance before sending to the LLM — saves tokens:
254
+
255
+ ```js
256
+ const result = await web.extract('https://en.wikipedia.org/wiki/Node.js', {
257
+ instruction: 'Extract the creator and release date',
258
+ relevanceFilter: true // BM25 scoring keeps only relevant sections
259
+ })
260
+ // Content reduced from 50K+ chars to ~2K relevant chars
261
+ ```
262
+
263
+ ### Extract from Content (No Browsing)
264
+
265
+ Already have the content? Skip the browse step:
266
+
267
+ ```js
268
+ const result = await web.extractFromContent(markdownContent, {
269
+ instruction: 'Extract all email addresses',
270
+ schema: { type: 'object', properties: { emails: { type: 'array', items: { type: 'string' } } } }
271
+ })
272
+ ```
273
+
274
+ Uses Gemini Flash (free) by default. Falls back to OpenAI if configured.
275
+
276
+ ## Agent — Natural Language Browser Actions
277
+
278
+ Control a browser with natural language. Navigate, click, type, scroll — the LLM interprets the page and decides what to do.
279
+
280
+ ```js
281
+ const result = await web.agent('https://example.com', 'click the More Information link', {
282
+ maxSteps: 5, // max actions to take
283
+ screenshot: true // screenshot after completion
284
+ })
285
+ // result.success = true
286
+ // result.url = "https://www.iana.org/domains/reserved" (navigated!)
287
+ // result.steps = [{ step: 1, action: "click", elementIdx: 0, result: "clicked" }, ...]
288
+ ```
289
+
290
+ ### HTTP API
291
+
292
+ ```bash
293
+ curl -X POST http://localhost:3900/agent \
294
+ -H 'Content-Type: application/json' \
295
+ -d '{"url": "https://example.com", "instruction": "click the More Information link", "maxSteps": 3}'
296
+ ```
297
+
298
+ Response:
299
+ ```json
300
+ {
301
+ "success": true,
302
+ "url": "https://www.iana.org/domains/reserved",
303
+ "title": "IANA — Reserved Domains",
304
+ "steps": [
305
+ { "step": 1, "action": "click", "elementIdx": 0, "reason": "clicking the More Information link", "result": "clicked" }
306
+ ],
307
+ "content": "...",
308
+ "duration": 5200
309
+ }
310
+ ```
311
+
312
+ ### Supported Actions
313
+
314
+ The agent can: **click**, **type** (fill inputs), **select** (dropdowns), **press** (keyboard keys), **scroll** (up/down).
315
+
316
+ ## Network Request Capturing
317
+
318
+ Capture XHR/fetch requests made by a page during browsing — useful for discovering hidden APIs:
319
+
320
+ ```js
321
+ const result = await web.browse('https://example.com', {
322
+ captureNetwork: true,
323
+ captureNetworkHeaders: true, // include request headers
324
+ captureNetworkBody: true // include response bodies (<50KB)
325
+ })
326
+ // result.networkRequests = [
327
+ // { url: "https://api.example.com/data", method: "GET", status: 200, contentType: "application/json", body: "..." }
328
+ // ]
329
+ ```
330
+
331
+ ### HTTP API
332
+
333
+ ```bash
334
+ curl -X POST http://localhost:3900/browse \
335
+ -d '{"url": "https://example.com", "captureNetwork": true, "captureNetworkBody": true}'
336
+ ```
337
+
338
+ ## Screenshots
339
+
340
+ Take screenshots of any page via browse:
341
+
342
+ ### JavaScript
343
+
105
344
  ```js
106
- const page = await web.browse('https://example.com')
107
- console.log(page.content) // extracted text/markdown
108
- console.log(page.screenshot) // PNG buffer (if requested)
345
+ const result = await web.browse('https://example.com', {
346
+ screenshot: true,
347
+ fullPage: true // optional: capture entire page, not just viewport
348
+ })
349
+ // result.screenshot is a PNG Buffer
350
+ fs.writeFileSync('screenshot.png', result.screenshot)
351
+ ```
352
+
353
+ ### HTTP API
354
+
355
+ ```bash
356
+ curl -X POST http://localhost:3900/browse \
357
+ -H 'Content-Type: application/json' \
358
+ -d '{"url": "https://example.com", "screenshot": true, "fullPage": true}'
359
+ ```
360
+
361
+ Response:
362
+ ```json
363
+ {
364
+ "content": "# Page Title\n\nExtracted markdown...",
365
+ "url": "https://example.com",
366
+ "title": "Page Title",
367
+ "screenshot": "iVBORw0KGgo...base64-encoded-png...",
368
+ "cached": false
369
+ }
109
370
  ```
110
371
 
111
- Auto-fallback: if Jina and readability return too little content (<200 chars), Spectrawl renders the page with Playwright and extracts from the rendered DOM. Tavily can't do this — they fail on JS-heavy pages.
372
+ > **Note:** Screenshots bypass the cache each request renders a fresh page.
112
373
 
113
374
  ## Crawl
114
375
 
115
- Give your agent the ability to read an entire website in one call. Free, no API costs.
376
+ Multi-page website crawler with automatic RAM-based parallelization.
377
+
378
+ ```js
379
+ const result = await web.crawl('https://docs.example.com', {
380
+ depth: 2, // how many link levels to follow
381
+ maxPages: 50, // stop after N pages
382
+ format: 'markdown', // 'markdown' or 'html'
383
+ scope: 'domain', // 'domain' | 'subdomain' | 'path'
384
+ concurrency: 'auto', // auto-detect from available RAM, or set a number
385
+ merge: true, // merge all pages into one document
386
+ includePatterns: [], // regex patterns to include
387
+ excludePatterns: [], // regex patterns to skip
388
+ delay: 300, // ms between batch launches (politeness)
389
+ stealth: true // use anti-detect browsing
390
+ })
391
+ ```
392
+
393
+ ### Crawl Response
394
+
395
+ ```js
396
+ {
397
+ pages: [
398
+ { url: 'https://docs.example.com/', content: '...', title: '...', statusCode: 200 },
399
+ { url: 'https://docs.example.com/guide', content: '...', title: '...', statusCode: 200 },
400
+ // ...
401
+ ],
402
+ stats: {
403
+ pagesScraped: 23,
404
+ duration: 45000,
405
+ concurrency: 4
406
+ }
407
+ }
408
+ ```
409
+
410
+ ### Sitemap-Based Crawling
116
411
 
117
- Uses [Jina Reader](https://jina.ai/reader) (free, unlimited) with Playwright stealth fallback for JS-heavy sites.
412
+ Spectrawl auto-discovers `sitemap.xml` and pre-seeds the crawl queue much faster than link-following for documentation sites:
118
413
 
119
414
  ```js
120
- // Crawl a docs site — returns clean markdown for every page
121
415
  const result = await web.crawl('https://docs.example.com', {
122
- depth: 2, // how many levels deep (default: 1)
123
- maxPages: 50, // max pages to crawl (default: 50)
124
- format: 'markdown', // markdown | html | json
125
- delay: 300, // ms between requests (be polite)
126
- stealth: false, // use Camoufox for anti-detect
127
- auth: 'account' // use stored cookies (crawl behind logins)
416
+ useSitemap: true, // enabled by default
417
+ maxPages: 20
128
418
  })
419
+ // [crawl] Found sitemap at https://docs.example.com/sitemap.xml with 82 URLs
420
+ // [crawl] Pre-seeded 20 URLs from sitemap
421
+ ```
422
+
423
+ Set `useSitemap: false` to disable and rely only on link discovery.
424
+
425
+ ### Webhook Notifications
426
+
427
+ Get notified when a crawl completes:
129
428
 
130
- result.pages // [{ url, title, content, links, depth }]
131
- result.stats // { total, crawled, failed, duration }
429
+ ```bash
430
+ curl -X POST http://localhost:3900/crawl \
431
+ -d '{"url": "https://docs.example.com", "maxPages": 50, "webhook": "https://your-server.com/webhook"}'
132
432
  ```
133
433
 
134
- **vs Cloudflare's /crawl:**
135
- - ✅ Free (self-hosted, no per-request cost)
136
- - Crawls sites that block Cloudflare IPs
137
- - ✅ Auth-aware — crawl behind login walls with stored cookies
138
- - Stealth mode bypasses bot detection
139
- - ✅ Works for AI agents (50-200 pages, not millions)
434
+ Spectrawl will POST the full crawl result to your webhook URL when finished.
435
+
436
+ ### Async Crawl Jobs
437
+
438
+ For large sites, use async mode to avoid HTTP timeouts:
140
439
 
141
- **HTTP API:**
142
440
  ```bash
441
+ # Start a crawl job (returns immediately)
143
442
  curl -X POST http://localhost:3900/crawl \
144
- -H "Content-Type: application/json" \
145
- -d '{ "url": "https://docs.example.com", "depth": 2, "maxPages": 50 }'
443
+ -d '{"url": "https://docs.example.com", "depth": 3, "maxPages": 100, "async": true}'
444
+ # Response: { "jobId": "abc123", "status": "running" }
445
+
446
+ # Check job status
447
+ curl http://localhost:3900/crawl/abc123
448
+
449
+ # List all jobs
450
+ curl http://localhost:3900/crawl/jobs
451
+
452
+ # Check system capacity
453
+ curl http://localhost:3900/crawl/capacity
146
454
  ```
147
455
 
456
+ ### RAM-Based Auto-Parallelization
457
+
458
+ Spectrawl estimates ~250MB per browser tab and calculates safe concurrency from available system RAM:
459
+
460
+ - **8GB server:** ~4 concurrent tabs
461
+ - **16GB server:** ~8 concurrent tabs
462
+ - **32GB server:** 10 concurrent tabs (capped)
463
+
148
464
  ## Auth
149
465
 
150
466
  Persistent cookie storage (SQLite), multi-account management, automatic expiry detection.
@@ -158,11 +474,45 @@ const accounts = await web.auth.getStatus()
158
474
  // [{ platform: 'x', account: '@myhandle', status: 'valid', expiresAt: '...' }]
159
475
  ```
160
476
 
161
- Cookie refresh cron fires `cookie_expiring` and `cookie_expired` events before accounts go stale.
477
+ Cookie refresh cron fires events before accounts go stale (see [Events](#events)).
478
+
479
+ ## Events
480
+
481
+ Spectrawl emits events for auth state changes, rate limits, and action results. Subscribe to stay informed:
482
+
483
+ ```js
484
+ const { EVENTS } = require('spectrawl')
485
+
486
+ web.on(EVENTS.COOKIE_EXPIRING, (data) => {
487
+ console.log(`Cookie expiring for ${data.platform}:${data.account}`)
488
+ })
489
+
490
+ web.on(EVENTS.RATE_LIMITED, (data) => {
491
+ console.log(`Rate limited on ${data.platform}`)
492
+ })
493
+
494
+ // Wildcard — catch everything
495
+ web.on('*', ({ event, ...data }) => {
496
+ console.log(`Event: ${event}`, data)
497
+ })
498
+ ```
499
+
500
+ ### Available Events
501
+
502
+ | Event | When |
503
+ |---|---|
504
+ | `cookie_expiring` | Cookie approaching expiry |
505
+ | `cookie_expired` | Cookie has expired |
506
+ | `auth_failed` | Authentication attempt failed |
507
+ | `auth_refreshed` | Cookie successfully refreshed |
508
+ | `rate_limited` | Platform rate limit hit |
509
+ | `action_failed` | Platform action failed |
510
+ | `action_success` | Platform action succeeded |
511
+ | `health_check` | Periodic health check result |
162
512
 
163
- ## Act — 19 Platform Adapters
513
+ ## Act — 24 Platform Adapters
164
514
 
165
- Post to 19 platforms with one API:
515
+ Post to 24+ platforms with one API:
166
516
 
167
517
  ```js
168
518
  await web.act('github', 'create-issue', { repo: 'user/repo', title: 'Bug report', body: '...' })
@@ -171,7 +521,7 @@ await web.act('devto', 'post', { title: '...', body: '...', tags: ['ai'] })
171
521
  await web.act('huggingface', 'create-repo', { name: 'my-model', type: 'model' })
172
522
  ```
173
523
 
174
- **Live tested:** GitHub ✅, Reddit ✅, Dev.to ✅, HuggingFace ✅, X (reads) ✅, Hashnode ✅, Discord ✅, Product Hunt
524
+ **Live tested:** GitHub ✅, Reddit ✅, Dev.to ✅, HuggingFace ✅, X (reads) ✅
175
525
 
176
526
  | Platform | Auth Method | Actions |
177
527
  |----------|-------------|---------|
@@ -190,16 +540,13 @@ await web.act('huggingface', 'create-repo', { name: 'my-model', type: 'model' })
190
540
  | Quora | Browser automation | answer |
191
541
  | HuggingFace | Hub API | repo, model card, upload |
192
542
  | BetaList | REST API | submit |
193
- | AlternativeTo | Cookie session | submit, claim |
194
- | DevHunt | Supabase auth | submit, upvote |
195
- | SaaSHub | Generic adapter | submit |
196
- | **Generic Directory** | Configurable | submit |
543
+ | **14 Directories** | Generic adapter | submit |
197
544
 
198
545
  Built-in rate limiting, content dedup (MD5, 24h window), and dead letter queue for retries.
199
546
 
200
547
  ## Source Quality Ranking
201
548
 
202
- Spectrawl ranks results by domain trust — something Tavily doesn't do:
549
+ Spectrawl ranks results by domain trust — something most search tools don't do:
203
550
 
204
551
  - **Boosted:** GitHub, StackOverflow, HN, Reddit, MDN, arxiv, Wikipedia
205
552
  - **Penalized:** SEO farms, thin content sites, tag/category pages
@@ -220,12 +567,143 @@ const web = new Spectrawl({
220
567
  npx spectrawl serve --port 3900
221
568
  ```
222
569
 
570
+ ### Endpoints
571
+
572
+ | Method | Path | Description |
573
+ |--------|------|-------------|
574
+ | `POST` | `/search` | Search the web |
575
+ | `POST` | `/browse` | Stealth browse a URL |
576
+ | `POST` | `/crawl` | Crawl a website (sync or async) |
577
+ | `POST` | `/extract` | Structured data extraction with LLM |
578
+ | `POST` | `/agent` | Natural language browser actions |
579
+ | `POST` | `/act` | Platform actions |
580
+ | `GET` | `/status` | Auth account health |
581
+ | `GET` | `/health` | Server health |
582
+ | `GET` | `/crawl/jobs` | List async crawl jobs |
583
+ | `GET` | `/crawl/:jobId` | Get crawl job status/results |
584
+ | `GET` | `/crawl/capacity` | System crawl capacity |
585
+
586
+ ### Request / Response Examples
587
+
588
+ #### POST /search
589
+
590
+ ```bash
591
+ curl -X POST http://localhost:3900/search \
592
+ -H 'Content-Type: application/json' \
593
+ -d '{"query": "best headless browsers 2026", "summarize": true}'
594
+ ```
595
+
596
+ Response:
597
+ ```json
598
+ {
599
+ "sources": [
600
+ {
601
+ "title": "Top Headless Browsers in 2026",
602
+ "url": "https://example.com/article",
603
+ "snippet": "Short snippet from search...",
604
+ "content": "Full page markdown content (if scraped)...",
605
+ "source": "gemini-grounded",
606
+ "confidence": 0.95
607
+ }
608
+ ],
609
+ "answer": "AI-generated summary with [1] citations... (only if summarize: true)",
610
+ "cached": false
611
+ }
612
+ ```
613
+
614
+ #### POST /browse
615
+
616
+ ```bash
617
+ curl -X POST http://localhost:3900/browse \
618
+ -H 'Content-Type: application/json' \
619
+ -d '{"url": "https://example.com", "screenshot": true, "fullPage": true}'
620
+ ```
621
+
622
+ Response:
623
+ ```json
624
+ {
625
+ "content": "# Example Domain\n\nThis domain is for use in illustrative examples...",
626
+ "url": "https://example.com",
627
+ "title": "Example Domain",
628
+ "statusCode": 200,
629
+ "screenshot": "iVBORw0KGgoAAAANSUhEUg...base64...",
630
+ "cached": false,
631
+ "engine": "playwright"
632
+ }
633
+ ```
634
+
635
+ #### POST /crawl
636
+
637
+ ```bash
638
+ curl -X POST http://localhost:3900/crawl \
639
+ -H 'Content-Type: application/json' \
640
+ -d '{"url": "https://docs.example.com", "depth": 2, "maxPages": 10}'
641
+ ```
642
+
643
+ Response:
644
+ ```json
645
+ {
646
+ "pages": [
647
+ {
648
+ "url": "https://docs.example.com/",
649
+ "content": "# Docs Home\n\n...",
650
+ "title": "Documentation",
651
+ "statusCode": 200
652
+ }
653
+ ],
654
+ "stats": {
655
+ "pagesScraped": 8,
656
+ "duration": 12000,
657
+ "concurrency": 4
658
+ }
659
+ }
660
+ ```
661
+
662
+ #### POST /act
663
+
664
+ ```bash
665
+ curl -X POST http://localhost:3900/act \
666
+ -H 'Content-Type: application/json' \
667
+ -d '{"platform": "github", "action": "create-issue", "repo": "user/repo", "title": "Bug", "body": "Details..."}'
668
+ ```
669
+
670
+ #### Error Responses
671
+
672
+ All errors follow [RFC 9457](https://www.rfc-editor.org/rfc/rfc9457) Problem Details format:
673
+
674
+ ```json
675
+ {
676
+ "type": "https://spectrawl.dev/errors/rate-limited",
677
+ "status": 429,
678
+ "title": "rate limited",
679
+ "detail": "Reddit rate limit: max 3 posts per hour",
680
+ "retryable": true
681
+ }
682
+ ```
683
+
684
+ Error types: `bad-request` (400), `unauthorized` (401), `forbidden` (403), `not-found` (404), `rate-limited` (429), `internal-error` (500), `upstream-error` (502), `service-unavailable` (503).
685
+
686
+ ## Proxy Configuration
687
+
688
+ Route browsing through residential or datacenter proxies:
689
+
690
+ ```json
691
+ {
692
+ "browse": {
693
+ "proxy": {
694
+ "host": "proxy.example.com",
695
+ "port": 8080,
696
+ "username": "user",
697
+ "password": "pass"
698
+ }
699
+ }
700
+ }
223
701
  ```
224
- POST /search { "query": "...", "summarize": true }
225
- POST /browse { "url": "...", "screenshot": true }
226
- POST /act { "platform": "github", "action": "create-issue", ... }
227
- GET /status — auth account health
228
- GET /health — server health
702
+
703
+ The proxy is used for all Playwright and Camoufox browsing sessions. You can also start a local rotating proxy server:
704
+
705
+ ```bash
706
+ npx spectrawl proxy --port 8080
229
707
  ```
230
708
 
231
709
  ## MCP Server
@@ -236,7 +714,15 @@ Works with any MCP-compatible agent (Claude, Cursor, OpenClaw, LangChain):
236
714
  npx spectrawl mcp
237
715
  ```
238
716
 
239
- 5 tools: `web_search`, `web_browse`, `web_act`, `web_auth`, `web_status`.
717
+ ### MCP Tools
718
+
719
+ | Tool | Description | Key Parameters |
720
+ |------|-------------|----------------|
721
+ | `web_search` | Search the web | `query`, `summarize`, `scrapeTop`, `minResults` |
722
+ | `web_browse` | Stealth browse a URL | `url`, `auth`, `screenshot`, `html` |
723
+ | `web_act` | Platform action | `platform`, `action`, `account`, `text`, `title` |
724
+ | `web_auth` | Manage auth | `action` (list/add/remove), `platform`, `account` |
725
+ | `web_status` | Check auth health | — |
240
726
 
241
727
  ## CLI
242
728
 
@@ -246,35 +732,64 @@ npx spectrawl search "query" # search from terminal
246
732
  npx spectrawl status # check auth health
247
733
  npx spectrawl serve # start HTTP server
248
734
  npx spectrawl mcp # start MCP server
735
+ npx spectrawl proxy # start rotating proxy server
249
736
  npx spectrawl install-stealth # download Camoufox browser
737
+ npx spectrawl version # show version
250
738
  ```
251
739
 
252
740
  ## Configuration
253
741
 
254
- `spectrawl.json`:
742
+ `spectrawl.json` — full defaults:
255
743
 
256
744
  ```json
257
745
  {
746
+ "port": 3900,
747
+ "concurrency": 3,
258
748
  "search": {
259
749
  "cascade": ["gemini-grounded", "tavily", "brave"],
260
750
  "scrapeTop": 5
261
751
  },
752
+ "browse": {
753
+ "defaultEngine": "playwright",
754
+ "proxy": null,
755
+ "humanlike": {
756
+ "minDelay": 500,
757
+ "maxDelay": 2000,
758
+ "scrollBehavior": true
759
+ }
760
+ },
761
+ "auth": {
762
+ "refreshInterval": "4h",
763
+ "cookieStore": "./data/cookies.db"
764
+ },
262
765
  "cache": {
766
+ "path": "./data/cache.db",
263
767
  "searchTtl": 3600,
264
- "scrapeTtl": 86400
768
+ "scrapeTtl": 86400,
769
+ "screenshotTtl": 3600
265
770
  },
266
771
  "rateLimit": {
267
- "x": { "postsPerHour": 3 },
268
- "reddit": { "postsPerHour": 5 }
772
+ "x": { "postsPerHour": 5, "minDelayMs": 30000 },
773
+ "reddit": { "postsPerHour": 3, "minDelayMs": 600000 }
269
774
  }
270
775
  }
271
776
  ```
272
777
 
778
+ ### Human-like Browsing
779
+
780
+ Spectrawl simulates human browsing patterns by default:
781
+
782
+ - **Random delays** between page loads (500-2000ms)
783
+ - **Scroll behavior** simulation
784
+ - **Random viewport sizes** from common resolutions
785
+ - Configurable via `browse.humanlike`
786
+
273
787
  ## Environment Variables
274
788
 
275
789
  ```
276
790
  GEMINI_API_KEY Free — primary search + summarization (aistudio.google.com)
277
791
  BRAVE_API_KEY Brave Search (2,000 free/month)
792
+ TAVILY_API_KEY Tavily Search (1,000 free/month)
278
793
  SERPER_API_KEY Serper.dev (2,500 trial queries)
279
794
  GITHUB_TOKEN For GitHub adapter
280
795
  DEVTO_API_KEY For Dev.to adapter
@@ -286,7 +801,3 @@ ANTHROPIC_API_KEY Alternative LLM for summarization
286
801
  ## License
287
802
 
288
803
  MIT
289
-
290
- ## Part of xanOS
291
-
292
- Spectrawl is the web layer for [xanOS](https://github.com/FayAndXan/xanOS) — the autonomous content engine. Use it standalone or as part of the full stack.